Building Machine Learning Model is fun using Orange

Analytics Vidhya Last Updated : 07 Jun, 2020

9 min read

Introduction

With growing need of data science managers, we need tools which take out difficulty from doing data science and make it fun. Not everyone is willing to learn coding, even though they would want to learn / apply data science. This is where GUI based tools can come in handy.

Today, I will introduce you to another GUI based tool – Orange. This tool is great for beginners who wish to visualize patterns and understand their data without really knowing how to code.

In my previous article, I presented you with another GUI based tool KNIME. If you do not want to learn to code but still apply data science, you can try out any of these tools.

By the end of this tutorial, you’ll be able to predict which person out of a certain set of people is eligible for a loan with Orange!

1. Why Orange?

Orange is a platform built for mining and analysis on a GUI based workflow. This signifies that you do not have to know how to code to be able to work using Orange and mine data, crunch numbers and derive insights.

You can perform tasks ranging from basic visuals to data manipulations, transformations, and data mining. It consolidates all the functions of the entire process into a single workflow.

The best part and the differentiator about Orange is that it has some wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all sorts of visualizations available.

2. Setting up your System

Orange comes built-in with the Anaconda tool if you’ve previously installed it. If not, follow these steps to download Orange.

Step 1: Go to https://orange.biolab.si and click on Download.

Step 2: Install the platform and set the working directory for Orange to store its files.

This is what the start-up page of Orange looks like. You have options that allow you to create new projects, open recent ones or view examples and get started.

Before we delve into how Orange works, let’s define a few key terms to help us in our understanding:

A widget is the basic processing point of any data manipulation. It can do a number of actions based on what you choose in your widget selector on the left of the screen.
A workflow is the sequence of steps or actions that you take in your platform to accomplish a particular task.

You can also go to “Example Workflows” on your start-up screen to check out more workflows once you have created your first one.

For now, click on “New” and let’s start building your first workflow.

3. Creating Your First Workflow

This is the first step towards building a solution to any problem. We need to first understand what steps we need to take in order to achieve our final goal. After you clicked on “New” in the above step, this is what you should have come up with.

This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging any widget from the widget menu to your workflow.

4. Familiarising yourself with the basics

Orange is a platform that can help us solve most problems in Data Science today. Topics that range from the most basic visualizations to training models. You can even evaluate and perform unsupervised learning on datasets:

4.1 Problem

The problem we’re looking to solve in this tutorial is the practice problem Loan Prediction that can be accessed via this link on Datahack.

4.2 Importing the data files

We begin with the first and the necessary step to understand our data and make predictions: importing our data

Step 1: Click on the “Data” tab on the widget selector menu and drag the widget “File” to our blank workflow.

Step 2: Double click the “File” widget and select the file you want to load into the workflow. In this article, as we will be learning how to solve the practice problem Loan Prediction, I will import the training dataset from the same.

Step 3: Once you can see the structure of your dataset using the widget, go back by closing this menu.

Step 4: Now since we have the raw .csv details, we need to convert it to a format we can use in our mining. Click on the dotted line encircling the “File” widget and drag, and then click anywhere in the blank space.

Step 5: As we need a data table to better visualize our findings, we click on the “Data Table” widget.

Step 6: Now double click the widget to visualize your table.

Neat! Isn’t it?

Let’s now visualize some columns to find interesting patterns in our data.

4.3 Understanding our Data

4.3.1 Scatter Plot
Click on the semicircle in front of the “File” widget and drag it to an empty space in the workflow and select the “Scatter Plot” widget.

Once you create a Scatter Plot widget, double click it and explore your data like this! You can select the X and Y axes, colors, shapes, sizes and a lot of other manipulations.

The plot I’ve explored is a Gender by Income plot, with the colors set to the education levels. As we can see in males, the higher income group naturally belongs to the Graduates!

Although in females, we see that a lot of the graduate females are earning low or almost nothing at all. Any specific reason? Let’s find out using the scatterplot.

One possible reason I found was marriage. A huge number graduates who were married were found to be in lower income groups; this may be due to family responsibilities or added efforts. Makes perfect sense, right?

4.3.2 Distribution

Another way to visualize our distributions would be the “Distributions” widget. Click on the semi-circle again, and drag to find the widget “Distributions”.

Now double click on it and visualize!

What we see is a very interesting distribution. We have in our dataset, more number of married males than females.

4.3.3 Sieve diagram

How does income relate to the education levels? Do graduates get paid more than non-grads?

Let’s visualize using a sieve diagram.

Click and drag from the “File” widget and search for “Sieve Diagram”.

Once you place it, double click on it and select your axes!

This plot divides the sections of distribution into 4 bins. The sections can be investigated by hovering the mouse over it.

For example, graduates and non-graduates are divided 78% by 22%. Then subdivisions of 25% each are made by splitting the applicant incomes into 4 equal groups. Here the task for you, generate insight from these charts and share in the comment section.

Let’s now look at how to clean our data to start building our model.

5. How do you clean your data?

Here for cleaning purpose, we will impute missing values. Imputation is a very important step in understanding and making the best use of our data.

Click on the “File” widget and drag to find the “Impute” widget.

When you double click on the widget after placing it, you will see that there are a variety of imputation methods you can use. You can also use default methods or choose individual methods for each class separately.

Here, I have selected the default method to be Average for numerical values and Most Frequent for text based values (categorical).

You can select from a variety of imputations like:

Distinct Value
Random Values
Remove the rows with missing values
Model-Based

The other things you can include in your approach to training your model are Feature Extraction and Generation.For further understanding, follow this article on Data Exploration and Feature Engineering (https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)

6. Training your First Model

Beginning with the basics, we will first train a linear model encompassing all the features just to understand how to select and build models.

Step 1: First, we need to set a target variable to apply Logistic Regression on it.

Step 2: Go to the “File” widget and double click it.

Step 3: Now, double click on the Loan_Status column and select it as the target variable. Click Apply.

Step 4: Once we have set our target variable, find the clean data from the “Impute” widget as follows and place the “Logistic Regression” widget.

Step 5: Double click the widget and select the type of regularization you want to perform.

Ridge Regression:
- Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients
- Minimization objective = LS Obj + α * (sum of square of coefficients)
Lasso Regression:
- Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients
- Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

For a better understanding of these, please visit the link about Ridge and Lasso regressions https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

I have chosen Ridge for my analysis, you are free to choose between the two.

Step 6: Next, click on the “Impute” or the “Logistic Regression” widget and find the “Test and Score” widget. Make sure you connect both the data and the model to the testing widget.

Step 7: Now, click on the “Test and Score” widget to see how well your model is doing.

Step 8: To visualize the results better, drag and drop from the “Test and Score” widget to fin d “Confusion Matrix”.

Step 9: Once you’ve placed it, click on it to visualize your findings!

This way, you can test out different models and see how accurately they perform.

Let’s try to evaluate, how a Random Forest would do? Change the modeling method to Random Forest and look at the confusion matrix.

Looks decent, but the Logistic Regression performed better.

We can try again with a Support Vector Machine.

Better than the Random Forest, but still not as good as the Logistic Regression model.

Sometimes the simpler methods are the better ones, isn’t it?

This is how your final workflow would look after you are done with the complete process.

For people who wish to work in groups, you can also export your workflows and send it to friends who can work alongside you!

The resulting file is of the (.ows) extension and can be opened in any other Orange setup.

End Notes

Orange is a platform that can be used for almost any kind of analysis but most importantly, for beautiful and easy visuals. In this article, we explored how to visualize a dataset. Predictive modeling was undertaken as well, using a logistic regression predictor, SVM, and a random forest predictor to find loan statuses for each person accordingly.

Hope this tutorial has helped you figure out aspects of the problem that you might not have understood or missed out on before. It is very important to understand the data science pipeline and the steps we take to train a model, and this should surely help you build better predictive models soon!

Learn, engage, compete, and get hired!

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Rajeshwari Rai

Wonderful tutorial article. I had tried working on Orange earlier. Your example helped me in understanding this tool better. Will run this example. For your query on analyzing the sieve diagram, the relation between graduates and non-graduates seems inverse. Higher number of graduates are paid more than 5790 when compared to non-graduates, whereas higher number of non-graduates are paid 2875.5 when compared to graduates. So graduates obviously are paid well compared to non-graduates according to my inference.

Sony Krishna Swamy

Wonderful! This is how data science tools should be! Thanks for the good explanation of Orange. 2 questions... Shouldn't you include how to get a prediction of loan value given a new input to the model?

Show 2 reply

Bartłomiej "furas" Burek

Here you have how to get prediction https://github.com/furas/orange3-examples/tree/master/wikipedia-bayes-example

Faiz

You can predict the new data (test.csv) using the below steps :- a. In the same worksheet , Include the file widget and double click to select Test.csv file. b. Include "Precition" - Click on the "File" widget and drag it to include "Prediction" widget. c. Feeding your trained model to Predcition -- By selecting "Random Forest" widget (which we had created for train dataset),make the connection to Prediction widget. This way we are feeding the trained model input to predict test data. d. Creating o/p file -- Selecting the Prediction widget drag the connector to include "Save Data" widget. There you have the final o/p file. Hope this helps.

Omi

Wonderful... Found similar to SAS e-miner

Building Machine Learning Model is fun using Orange

Introduction

Table of Contents:

1. Why Orange?

2. Setting up your System

3. Creating Your First Workflow

4. Familiarising yourself with the basics

4.1 Problem

4.2 Importing the data files

4.3 Understanding our Data

5. How do you clean your data?

6. Training your First Model

End Notes

Learn, engage, compete, and get hired!

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr