GraphLab came as an unexpected breakthrough on my learning plan. After all, ‘ Good Things Happen When You Expect Them Least To Happen’. It all started with the end of Black Friday Data Hack. Out of 1200 participants, we got our winners and their interesting solutions.
I read and analyzed them. I realized that I had missed on an incredible machine learning tool. A quick exploration told me that this tool has immense potential to reduce our machine learning modeling pains. So, I decided to explore it further. I now have dedicated a few days to understand its science and logical methods of usage. To my surprise, it wasn’t difficult to understand.
Were you trying to improve your Machine Learning model ? But failed mostly? Try this advanced machine learning tool. A month trial is free and 1 year subscription is available for FREE for academic use. Then, you can purchase subscription for following years.
To get you started quickly, here is a beginners guide on GraphLab in Python. For ease of understanding, I’ve tried to explain these concepts in simplest possible manner.
GraphLab has an interesting story of its inception. Let me tell you in brief.
GraphLab, known as Dato is founded by Carlos Guestrin. Carlos holds Ph.D in Computer Science from Stanford University. It happened around 7 years back. Carlos was a professor at Carnegie Mellon University. Two of his students were working on large scale distributed machine learning algorithms. They ran their model on top of Hadoop and found it took quite long to compute. Situations didn’t even improve after using MPI (high performance computing library).
So, they decided to build a system to write more papers quickly. With this, GraphLab came into existence.
P.S – GraphLab Create is a commercial software by GraphLab. GraphLab Create is accessed in python using “graphlab” library. Hence, in this article, ‘GraphLab’ connotes GraphLab Create. Don’t get confused.
GraphLab is a new parallel framework for machine learning written in C++. It is an open source project and has been designed considering the scale, variety and complexity of real world data. It incorporates various high level algorithms such as Stochastic Gradient Descent (SGD), Gradient Descent & Locking to deliver high performance experience. It helps data scientists and developers easily create and install applications at large scale.
But, what makes it amazing? It’s the presence of neat libraries for data transformation, manipulation and model visualization. In addition, it comprises of scalable machine learning toolkits which has everything (almost) required to improve machine learning models. The toolkit includes implementation for deep learning, factor machines, topic modeling, clustering, nearest neighbors and more.
Here is the complete architecture of GraphLab Create.
There are multiple benefits of using GraphLab as described below:
You can also use GraphLab once you have availed its license. However, you can also get started with free trial or the academic edition with 1 year subscription. So, prior to installation, your machine must fulfill the system requirement to run GraphLab.
System Requirement for GraphLab:
If your system fails to meet above requirement, you can use GraphLab Create on the AWS Free Tier also.
Steps for Installation:
After you’ve installed GraphLab successfully, you can access it using “import <library_name>”.
import graphlab or import graphlab as gl
Here, I’ll demonstrate the use of GraphLab by solving a data science challenge. I have the taken data set from Black Friday Data Hack.
sf_train=graphlab.SFrame('C:/Users/Analytics Vidhya/Desktop/DataSets/Black_Friday/train.csv') sf_test=graphlab.SFrame('C:/Users/Analytics Vidhya/Desktop/DataSets/Black_Friday/test.csv')
int
and float
) show a basic summary statistics (num_unique, missing values as num_undefined, min, max, median, mean, std) with box plot. string columns show number of unique values, missing values and table of most frequent items in the column.# Make a change to existing variable # Combine all bins of age greater than 50
def combine_age(age): if age=='51-55': return '50+' elif age=='55+': return '50+' else: return age
sf['Age']=sf['Age'].apply(combine_age)
Now, look at the pre and post visualization of variable “Age”.
For more details on Data Manipulation using GraphLab, please refer this link.
# Create the data # Variables based on which we want to perform imputation and variable to impute # You can look at the algorithms behind the imputation here. sf_impute = sf_train['Age','Gender','Product_Category_2']
imputer = graphlab.feature_engineering.CategoricalImputer(feature='Product_Category_2') # Fit and transform on the same data transformed_sf = imputer.fit_transform(sf_impute)
#Retrieve the imputed values transformed_sf
Finally, you can take this input variable to original data set.
sf_train['Predicted_Product_Category_2']=transformed_sf['predicted_feature_Product_Category_2']
Similarly, you can apply other feature engineering operations to the data set based on your requirement. You can refer this link for more details.
In Black Friday challenge, we are required predict the numeric quantities “Purchase” i.e. we need a regression model to predict the “Purchase”.
In GraphLab, we have three type of regression models:
A) Linear Regression
B) Random Forest Regression
C) Gradient Boosted Regression
If you have any confusion in algorithm selection, GraphLab takes care of that. Don’t worry. It selects the right regression model automatically.
# Make a train-test split train_data, validate_data = sf_train.random_split(0.8)
# Automatically picks the right model based on your data. model = graphlab.regression.create(train_data, target='Purchase', features = ['Gender','Age','Occupation','City_Category','Stay_In_Current_City_Years', 'Marital_Status','Product_Category_1'])
# Save predictions to an SArray predictions = model.predict(validate_data)
# Evaluate the model and save the results into a dictionary results = model.evaluate(validate_data) results
Output:
{'max_error': 13377.561969523947, 'rmse': 3007.1225949345117}
#Do prediction on test data set final_predictions = model.predict(sf_test)
To know more about other modeling techniques like clustering, classification, recommendation system, Text analysis, Graph Analysis, Recommendation Systems you can refer this link. Alternatively, here is the complete user guide by Dato.
In this article, we learnt about “GraphLab Create” which helps to handle large data set while building machine learning models. We also looked at the data structure of Graphlab which enables it to handle large data set like “SFrame” and “SGraph”. I’d recommend you to use GraphLab. You’d love its automated features like data exploration (Canvas, interactive web data exploration tool), feature engineering, selecting the right models and deployment.
For better understanding, I’ve also demonstrated a modeling exercise using GraphLab. In my next article on GraphLab, I will focus on Graph Analysis and Recommendation System.
Did you find this article helpful ? Share with us your experience with GraphLab.
Hi Sunil, Very good write up, just wanted to add that currently Carlos Guestrin is offering a Machine Learning specialization on Coursera and they are using Graphlab for all the exercises. Just in case if any one is interested. Also, I wish to highlight that Graphlab has its own issues like no 32-bit OS support etc.
Carlos Guestrin along with Emily Fox conducts the Machine Learning Specialization MOOC on Coursera. Both are currently Professors at University of Washington in Seattle. I just completed the first course - getting ready to start the course on Regression Analysis. Enrolling in the course gives you one-year access to Graphlab.
Very useful. Would be great to see something similar on other ML and data science tools and platforms, and also some comparison between them. There are so many options available these days its difficult to separate wheat from the chaff, and also to know which ones will still be around in 5y time and hence are worth investing (both time and money) in. Thanks!