Tutorial – Getting Started with GraphLab For Machine Learning in Python

Sunil Ray Last Updated : 26 Jun, 2020

9 min read

Introduction

GraphLab came as an unexpected breakthrough on my learning plan. After all, ‘ Good Things Happen When You Expect Them Least To Happen’. It all started with the end of Black Friday Data Hack. Out of 1200 participants, we got our winners and their interesting solutions.

I read and analyzed them. I realized that I had missed on an incredible machine learning tool. A quick exploration told me that this tool has immense potential to reduce our machine learning modeling pains. So, I decided to explore it further. I now have dedicated a few days to understand its science and logical methods of usage. To my surprise, it wasn’t difficult to understand.

Were you trying to improve your Machine Learning model ? But failed mostly? Try this advanced machine learning tool. A month trial is free and 1 year subscription is available for FREE for academic use. Then, you can purchase subscription for following years.

To get you started quickly, here is a beginners guide on GraphLab in Python. For ease of understanding, I’ve tried to explain these concepts in simplest possible manner.

Topics Covered

How it all started ?
What is GraphLab ?
Benefits and Limitations of GraphLab
How to install GraphLab ?
Getting started with GraphLab

How it all started ?

GraphLab has an interesting story of its inception. Let me tell you in brief.

GraphLab, known as Dato is founded by Carlos Guestrin. Carlos holds Ph.D in Computer Science from Stanford University. It happened around 7 years back. Carlos was a professor at Carnegie Mellon University. Two of his students were working on large scale distributed machine learning algorithms. They ran their model on top of Hadoop and found it took quite long to compute. Situations didn’t even improve after using MPI (high performance computing library).

So, they decided to build a system to write more papers quickly. With this, GraphLab came into existence.

P.S – GraphLab Create is a commercial software by GraphLab. GraphLab Create is accessed in python using “graphlab” library. Hence, in this article, ‘GraphLab’ connotes GraphLab Create. Don’t get confused.

What is GraphLab?

GraphLab is a new parallel framework for machine learning written in C++. It is an open source project and has been designed considering the scale, variety and complexity of real world data. It incorporates various high level algorithms such as Stochastic Gradient Descent (SGD), Gradient Descent & Locking to deliver high performance experience. It helps data scientists and developers easily create and install applications at large scale.

But, what makes it amazing? It’s the presence of neat libraries for data transformation, manipulation and model visualization. In addition, it comprises of scalable machine learning toolkits which has everything (almost) required to improve machine learning models. The toolkit includes implementation for deep learning, factor machines, topic modeling, clustering, nearest neighbors and more.

Here is the complete architecture of GraphLab Create.

What are the Benefits of using GraphLab ?

There are multiple benefits of using GraphLab as described below:

Handles Large Data: Data structure of GraphLab can handle large data sets which result into scalable machine learning. Let’s look at the data structure of Graph Lab:
- - SFrame: It is an efficient disk-based tabular data structure which is not limited by RAM. It helps to scale analysis and data processing to handle large data set (Tera byte), even on your laptop. It has similar syntax like pandas or R data frames. Each column is an SArray, which is a series of elements stored on disk. This makes SFrames disk based. I have discussed the methods to work with “SFrames” in following sections.
  - SGraph: Graph helps us to understand networks by analyzing relationships between pair of items. Each item is represented by a vertex in the graph. Relationship between items is represented by edges. In GraphLab, to perform a graph-oriented data analysis, it uses SGraph object. It is a scalable graph data structure which store vertices and edges in SFrames. To know more about this, please refer this link. Below is graph representation of James Bond characters.
Integration with various data sources: GraphLab supports various data sources like S3, ODBC, JSON, CSV, HDFS and many more.
Data exploration and visualization with GraphLab Canvas. GraphLab Canvas is a browser-based interactive GUI which allows you to explore tabular data, summary statistics and bi-variate plots. Using this feature, you spend less time coding for data exploration. This will help you to focus more on understanding the relationship and distribution of variables. I have discussed this part following sections.
Feature Engineering: GraphLab has an inbuilt option to create new useful features to enhance model performance. It comprises of various options like transformation, binning, imputation, One hot encoding, tf-idf etc.
Modeling: GraphLab has various toolkits to deliver easy and fast solution for ML problems. It allows you to perform various modeling exercise (regression, classification, clustering) in fewer lines of code. You can work on problems like recommendation system, churn prediction, sentiment analysis, image analysis and many more.
Production automation: Data pipelines allow you to assemble reusable code task into jobs. Then, automatically run them on common execution environments (e.g. Amazon Web Services, Hadoop).
GraphLab Create SDK: Advance users can extend the capabilities of GraphLab Create using GraphLab Creat SDK. You can define new machine learning models/ programs and integrate them with the rest of the package. See the GitHub repository here.
License: It has limitation to use. You can go for 30 days free trial period or one year license for academic edition. To extend your subscription you;ll be charged (see subscription structure here).

How to Install GraphLab?

You can also use GraphLab once you have availed its license. However, you can also get started with free trial or the academic edition with 1 year subscription. So, prior to installation, your machine must fulfill the system requirement to run GraphLab.

System Requirement for GraphLab:

If your system fails to meet above requirement, you can use GraphLab Create on the AWS Free Tier also.

Steps for Installation:

Register for free trail. After registration, you will receive a product key
Select your operating system (Auto selection is on) and follow the given instructions
Below is the command line installation instruction (For “Anaconda Python Environment”).

Getting started with Graphlab

After you’ve installed GraphLab successfully, you can access it using “import <library_name>”.

import graphlab
or 
import graphlab as gl

Here, I’ll demonstrate the use of GraphLab by solving a data science challenge. I have the taken data set from Black Friday Data Hack.

- Load Data Set: As discussed, SFrame is the tabular data structure available in GraphLab. SFrames can import data in various formats (CSV, JSON, Apache Avro, ODBC ..). A common data format is comma separated value (csv) file. In this example, I’ll use csv files. Now, I load test and train data sets:
```
sf_train=graphlab.SFrame('C:/Users/Analytics Vidhya/Desktop/DataSets/Black_Friday/train.csv')
sf_test=graphlab.SFrame('C:/Users/Analytics Vidhya/Desktop/DataSets/Black_Friday/test.csv')
```
- Data Exploration and Visualization: This is a crucial stage of data modeling exercise. It require lots of code to infer relationships, outliers, missing values. But not in GraphLab. Here, they’ve reduced the coding element (almost minimize) and allows you to focus on decision making. Let’s explore the data set:
  - See top/ bottom x records: We can look at the top x records of the data set like we do in pandas.sf_train.head(3)
  - Unique values of a variable: Again, we have similar commands as pandas to look at the unique values in a variable.
  - Distribution and relationship of variables: You can visualize relationships and distribution with just one command for all variables. I have already discussed about Canvas (Web based interactive visualization platform to explore data). Data in an SFrame can be visualized with SFrame.show(). When we run command SFrame.show(), it returns a URL which redirects to GraphLab Canvas. By default, the data section remains active and has three tabs namely:
    - Summary: It shows number of columns available in SFrame with summary of data. Numeric columns (int and float) show a basic summary statistics (num_unique, missing values as num_undefined, min, max, median, mean, std) with box plot. string columns show number of unique values, missing values and table of most frequent items in the column.
    - Tables: It provide interactive tabular view of the data inside SFrame. The paging controls on the left side (image below) allows you to move quickly through the SFrame.
      Click on variable name to look its distribution. Below, you can see the frequency table and bar graph for variable Age.
    - Plot: It helps to create bi-variate plots. Currently, it supports Scatter plot, Heatmap, Bar chart, Box plot, and line chart. Here, you select variables for X and Y axis. Then, based on data type of variables, it activates the relevant chart option(s). For example: if both x and y-axis variables are numeric, it activates scatter plot. In the bar chart, you have options to add other metrics like standard deviation, min, max, sum, mean and others.
      Above, we learnt the steps to create visualization in a browser. But, you can also visualize these information in you IPython Notebook. This can be done by, first setting target to ipython notebook then visualize information.

- - Data Manipulation: You can also perform data manipulation operation with SFrame such as adding a constant value to all values, concatenating two or more variables, create a new variable based exiting variable(s) as shown below:
    - Add a constant value to variable:
    - Concatenate two strings and store it to a new variable:
    - Update values of existing variables: This can be done using apply function. In this data set, I have combined age buckets greater than 50 using code below:
```
# Make a change to existing variable
# Combine all bins of age greater than 50
```
```
def combine_age(age):
 if age=='51-55':
 return '50+'
 elif age=='55+':
 return '50+'
 else:
 return age
```
```
sf['Age']=sf['Age'].apply(combine_age)
```
      Now, look at the pre and post visualization of variable “Age”.
      For more details on Data Manipulation using GraphLab, please refer this link.
  - Feature Engineering: Feature engineering is an efficient method to improve model performance. Using this technique, we can create new variable(s) after transformation or manipulation of existing existing variable(s). In fact, GraphLab has automated this process. They have various transformation options for numerical, categorical, text and image features. Also, you’ll find direct options for feature binning, imputation, one hot encoding, Count threshold, TF-IDF, Hasher, Tokenzing and others. Let’s look at the imputation of categorical feature “Product_Category_2” based on “Age” and “Gender” of “Black Friday” data set.
```
# Create the data
# Variables based on which we want to perform imputation and variable to impute
# You can look at the algorithms behind the imputation here.

sf_impute = sf_train['Age','Gender','Product_Category_2']
```
```
imputer = graphlab.feature_engineering.CategoricalImputer(feature='Product_Category_2')
# Fit and transform on the same data
transformed_sf = imputer.fit_transform(sf_impute)
```
```
#Retrieve the imputed values
transformed_sf
```
    Finally, you can take this input variable to original data set.
```
sf_train['Predicted_Product_Category_2']=transformed_sf['predicted_feature_Product_Category_2']
```
    Similarly, you can apply other feature engineering operations to the data set based on your requirement. You can refer this link for more details.
  - Modeling: At this stage, we do predictions from the past data. GraphLab easily create models for common tasks, such as:
    A) Predicting Numeric Quantities
    B) Building Recommendation Systems
    C) Clustering Data and Documents
    D) Analyzing Graphs

In Black Friday challenge, we are required predict the numeric quantities “Purchase” i.e. we need a regression model to predict the “Purchase”.

In GraphLab, we have three type of regression models:
A) Linear Regression
B) Random Forest Regression
C) Gradient Boosted Regression
If you have any confusion in algorithm selection, GraphLab takes care of that. Don’t worry. It selects the right regression model automatically.

# Make a train-test split
train_data, validate_data = sf_train.random_split(0.8)

# Automatically picks the right model based on your data.
model = graphlab.regression.create(train_data, target='Purchase', features = ['Gender','Age','Occupation','City_Category','Stay_In_Current_City_Years',
 'Marital_Status','Product_Category_1'])

# Save predictions to an SArray
predictions = model.predict(validate_data)

# Evaluate the model and save the results into a dictionary
results = model.evaluate(validate_data)
results

Output:

{'max_error': 13377.561969523947, 'rmse': 3007.1225949345117}

#Do prediction on test data set
final_predictions = model.predict(sf_test)

To know more about other modeling techniques like clustering, classification, recommendation system, Text analysis, Graph Analysis, Recommendation Systems you can refer this link. Alternatively, here is the complete user guide by Dato.

End Notes

In this article, we learnt about “GraphLab Create” which helps to handle large data set while building machine learning models. We also looked at the data structure of Graphlab which enables it to handle large data set like “SFrame” and “SGraph”. I’d recommend you to use GraphLab. You’d love its automated features like data exploration (Canvas, interactive web data exploration tool), feature engineering, selecting the right models and deployment.

For better understanding, I’ve also demonstrated a modeling exercise using GraphLab. In my next article on GraphLab, I will focus on Graph Analysis and Recommendation System.

Did you find this article helpful ? Share with us your experience with GraphLab.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Sunil Ray

Sunil Ray is Chief Content Officer at Analytics Vidhya, India's largest Analytics community. I am deeply passionate about understanding and explaining concepts from first principles. In my current role, I am responsible for creating top notch content for Analytics Vidhya including its courses, conferences, blogs and Competitions.

I thrive in fast paced environment and love building and scaling products which unleash huge value for customers using data and technology. Over the last 6 years, I have built the content team and created multiple data products at Analytics Vidhya.

Prior to Analytics Vidhya, I have 7+ years of experience working with several insurance companies like Max Life, Max Bupa, Birla Sun Life & Aviva Life Insurance in different data roles.

Industry exposure: Insurance, and EdTech

Major capabilities: Content Development, Product Management, Analytics, Growth Strategy.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Punardeep Singh

Hi Sunil, Very good write up, just wanted to add that currently Carlos Guestrin is offering a Machine Learning specialization on Coursera and they are using Graphlab for all the exercises. Just in case if any one is interested. Also, I wish to highlight that Graphlab has its own issues like no 32-bit OS support etc.

Krishna Mohan

Carlos Guestrin along with Emily Fox conducts the Machine Learning Specialization MOOC on Coursera. Both are currently Professors at University of Washington in Seattle. I just completed the first course - getting ready to start the course on Regression Analysis. Enrolling in the course gives you one-year access to Graphlab.

Very useful. Would be great to see something similar on other ML and data science tools and platforms, and also some comparison between them. There are so many options available these days its difficult to separate wheat from the chaff, and also to know which ones will still be around in 5y time and hence are worth investing (both time and money) in. Thanks!

Tutorial – Getting Started with GraphLab For Machine Learning in Python

Introduction

Topics Covered

How it all started ?

What is GraphLab?

What are the Benefits of using GraphLab ?

How to Install GraphLab?

Getting started with Graphlab

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie