Using SAS Viya to Analyze Love at First Sight

Data Science February 11, 2020

The Game Changer series

By Stefan Dimitrov Stoyanov, Business Analytics Intern at Boemska

As you may have noticed from my first and second articles about SAS Viya, I fell in love at first sight with the SAS in-memory cloud analytics platform. However, do you believe you can fall in love, at first sight, with someone?

According to the 2017 Elite Singles poll the majority of people, 61% of women and 72% of men, say yes. Research conducted by the University of Groningen also supports these results. So, there is a chance that just like me, you are a hopeless romantic believer. 🙂

I have never fully considered the reasons that make people fall in love so suddenly. That is, not until I bumped into the SAS Viya Visual Data Mining and Machine Learning free trial. It walked me through the analysis and visualization of a speed dating dataset garnered from Columbia University and hosted by Kaggle. Through the trial, and the GitHub challenge – both of which I will discuss in this article – SAS Viya helped me to understand the most important factors and how to predict a potential match with that special someone, the first time we meet.

Based on the findings, I feel more prepared for my next date. As I’ll show in this edition of the #GameChanger series, SAS Viya turned out to be a game-changer for me one more time! ☺

Getting to know the data

An experiment was led by Columbia Business School professors Ray Fisman and Sheena Iyengar. It consisted of a series of speed dating meetups from 2002-2004. Columbia University gathered data by distributing surveys at different stages of the experiment. Before each meetup, participants answered demographic questions about age, region, career, interests and hobbies, as well as their expectations from the speed dating event. Also, they were asked to give their perceived importance on six key characteristics: attractiveness, ambition, intelligence, fun, sincerity, and shared interests. During the meetups, the attendees spent 4 minutes to speak with every participant from the opposite sex. After every 4 minutes, they were asked to rate their dates on the same six key attributes. The attendees had to say if they wanted to see someone of the other participants again. The gathered information resulted in a dataset with 195 columns and over 8000 records.

As soon as I started the free trial, I could immediately begin to explore this data set through a beautiful and comprehensive Visual Analytics report. The coolest feature is that it visualizes the data in the form of a single interactive dashboard. Therefore, I could see valuable new knowledge about the participants in the experiment at just a glance.

For example, I could notice immediately that the participants are quite social. Still, at the same time, most of them do not date very often. More interestingly, I could already see that one of the main factors for someone to fall in love at first sight, is if the other person is fun. ☺

Something I like is that the dashboard is interactive and allowed me to explore more insights very fast. I could expand a diagram if I needed to have a closer look. I was even able to modify the charts only for females or males by choosing the gender selection at the top of the report. I could see more interesting insights about a specific diagram by clicking on it and opening an information pane. For example, by exploring the “Correlation of Interests” chart, you can find out with what person you have the most significant chance to make a match based on your interests.

By following the step-by-step instructions in the free trial, with only a few clicks in the SAS Viya Visual Analytics tool, I was able to create in several minutes the following box plots:

The two box plots show us that men were ranked the highest on Intelligence, followed by Sincerity and Ambition. The female results appear to be more evenly distributed among the six attributes, where only Shared Interest had visibly lower scores.

Now that we have revealed some interesting facts about how male and female participants scored each other, do you want to know if some of them made a match?

Creating an automated analysis of the data

The “Automated analysis” machine learning object within SAS Viya Visual Analytics environment, helped me to determine quickly what variables influence a match. By following the step-by-step instructions, I was able to build the first basic machine learning predictive model in the trial very easily. You can see all the results nicely arranged in a single interactive dashboard. But that is not everything, what is most impressive is that you don’t even need to try hard to interpret the figures and charts. SAS Viya does the hard job instead of you. SAS Viya applies natural language processing to automatically generate valuable insights in the form of easy to understand text.

The automated analysis reveals that 16.49% of all 4184 speed dates resulted in a match (690 couples). The colourful bar at the top orders the predictors of a match by relative importance. It starts with the most important predictors on the left. Fun, followed by shared interests and attractiveness scores by females, and attractiveness, fun and shared interests ratings by males were the most important predictors for a successful match. This means that there was an excellent chance for a match if both a female and male considered their partner in the speed date to be fun, had shared interests and were attractive.

What else is needed for a match? The automated model algorithm calculates the probability of a match for different combinations of variables. The dashboard reveals that when both the female and the male found their partner to be attractive and the female considered that the male shared her interests, there was 64.09% chance for a match.

From the colourful bar at the top, I could also click on a predictor. Then, from the chart on the right, I would see how the selected variable related to the number of matches. For example, on the screenshot above the selected predictor is 2_154 – Shared Interest Score by Female. You can see that below the bar chart, there is an insight generated by the SAS natural language processing algorithm. It showed that when a male was scored with 6, 7 or 8 for shared interests, the total count of yes (match) was a high value.

It is interesting that not always the highest score for a specific characteristic led to the most matches.

Building advanced Machine Learning models to find the best prediction of love at first sight <3

Now we can go even further. The free trial teaches us to easily and quickly build, navigate, and assess different machine learning models to find the best combination of variables that predicts a match. You will explore the Decision tree, Gradient boosting and Forest. I even learned how to integrate into SAS Viya an open-source python model. Moreover, SAS Viya enables us to compare and choose the best performing model automatically. Wow!

Below you can see the whole well-structured analytical workflow of a model in a single interactive diagram, called “pipeline”. It is provided in Model Studio – the web-based visual interface of SAS Viya for doing visual Machine Learning.

Using prepared templates to build predictive models faster

SAS Viya makes it faster and easier to build machine learning models by using prepared templates. The free trial showed me how to include in my project the feature engineering pipeline template.

By exploring it, I learned about different data preprocessing techniques I can use when trying to improve the set of predictors of love at first sight. For example, the free trial showed me how to create new features really fast and combine supervised with unsupervised methods to select the most significant variables when building a model. I also learned by practical example how to perform early stopping to prevent under or over-training of the model. This way, it can make more accurate predictions. Cool, isn’t it! ☺

Automating interpretability of the most significant predictors of love at first sight

The free trial applies the fantastic automated model interpretability feature of SAS Viya. It provides us with several plots that help us to understand the most important predictors of a model and to compare results across many different models. This way, we can choose the best one.

After we have selected our champion model, the free trial shows us how to deploy it into production and to score our data. You can see now that SAS Viya enables you to go through the whole analytical lifecycle from data discovery to model deployment in one integrated environment. This is so convenient!

Model Studio includes a number of cool tools that will enable you to boost your productivity while doing machine learning using SAS Viya. If you want to learn more about the features of SAS Model Studio, check out my first blog from the #GameChanger series.

Predicting love at first sight – The GitHub Challenge!

After showing how to do some tuning of the models, the free trial script presents us with its model champion. The Gradient Boosting model performed the best with a Validation Misclassification Rate (Event) of 0.1267. That is, only in 12.67% of the cases the model made wrong predictions. For 87.33 % of the speed dates, the model correctly predicted if there was a match or not. From this model, we see that the top predictors in determining a match are: Fun Score by Female, Attractiveness Score by Male, Field of Study by Female, Attractiveness Score by Female, and Shared Interest Score by Female.

However, is this really the best set of predictors that influence the desire of two people to get to know more each other? Can we obtain a better performing model which more accurately predict a match between a female and a male? This is the challenge that the Trial offers us. And I accepted it.

I decided to make one more step. We are trying to predict if there will be a match or not between two people. In machine learning, that means that we have a target with two classes – yes/no. Therefore, I added to my project the Advanced Pipeline Template for Class Target. I included in it the variable clustering data preprocessing node before the Gradient boosting model node. I also set up some of the options of the nodes in the pipeline. Then, I used the unique SAS Viya autotuning feature. This action helped me to find the optimal hyperparameters of the models.

I succeeded to build a machine learning model that was 1.85% more accurate than the model champion presented in the free trial. My Gradient Boosting model performed the best with a Validation Misclassification Rate (Event) of 0.1091. That is, only in 10.91% of the cases the model made wrong predictions. For 89.09 % of the speed dates, the model correctly predicted if there was a match or not.

From this model we see that the top 10 predictors in determining a match are: Attractiveness Score by Female, Fun Score by Female, Attractiveness Score by Male, Fun Score by Male, Estimated Number of Matches by Female, Shared Interest Score by Female, Shared Interest Score by Male, Diff Like Score, Probability Partner Likes You by Male, and Probability Partner Likes You by Female.

So, you can see that according to my SAS Viya predictive model, one of the essential factors influencing love at first sight, are fun, attractiveness and shared interests. However, here we can see some interesting predictors. “Probability Partner Likes You by Female” represents how a female thought the male would score them on the overall like score. Likewise, “Probability Partner Likes You by Male” represents how a male thought the female would score them on the total like rating. That is, how much we think our date liked us is a significant catalyst for our feelings of ‘love at first sight’!

After the speed dates, each partner gave an overall “how much did you like the person” score. “Diff Like Score” is the difference between the two scores. For example, if the male gave the female an overall like score of 8 and the Female gave an overall like rating of 7, then this variable would be 1 (8-7). If there was no significant difference in the way two partners liked each other, the couple was more likely to make a match.

Based on the predictor “Estimated Number of Matches by Female”, we can see one more interesting trend. If a female believed she was going to make a match during the speed dates, it was more likely to happen.

Do you want to see which are the other determinants of initial attractiveness between two people according to this predictive model? You can download my pipeline from the SAS Visual Data Mining and Machine Learning Trials Challenge GitHub repository.

Being better prepared for your next date. Beat my best predictive model!

Why don’t you try building new models? You may find one which is more accurate and provides better predictors of love at first sight. If you do, download the model and upload it to the SAS Viya Trials Challenge GitHub repository.

By building your model and predicting the influencers of love at first sight, you will get better prepared for your next date, just like me. Even if you have already found love, you can become a mentor to all your single friends in their attempts to find THE One! 🙂

http://www.datasciencecentral.com/xn/detail/6448529:BlogPost:930545

The Data Organisation

India’s loan scams leave victims scared for their lives

Elon Musk threatens to walk away from Twitter deal

Digital fingerprints of a million child abuse images made

How China plans to become the next big space power

Sheryl Sandberg to leave Facebook after 14 years

THE DATA ORGANISATION