Getting started with Machine Learning in MS Excel using XLMiner

Analytics Vidhya Last Updated : 26 Jun, 2020

9 min read

Introduction

Machine Learning is nothing but building a ‘machine’ which ‘learns’ from its experience. And, becomes better with experience – just like humans. We also learn from our experiences. Right ? Companies like Google, Facebook, Microsoft are using machine learning techniques at a larger scale.

However, one common mis-conception people have is that they need to learn coding to start machine learning. While coding becomes necessary for any one who is doing machine learning seriosuly, but not to start it. You can look at GUI driven tool like Weka or even Excel to start with Machine Learning.

Here, I’ll introduce you to a simpler way to get started with Machine Learning.

Do you find coding hard to understand?

Machine Learning requires powerful coding / algorithmic skills. And that’s why, people with computer science degree find it relatively easier to succeed in machine learning domain.

But, the scenario has changed. Though, you can’t escape coding completely, you can still get started with machine learning. Once you get started, you can later brush up your coding skills.

The good news is, now you can start machine learning using Microsoft Excel. Yes! you heard it right.

Frontline Solvers has introduced ‘XLMINER DATA MINING‘ add-in for MS Excel. It is an easy to use, made for professionals tool for data visualization, forecasting and data mining. You’d find it easy to use if:

You’ve worked on MS Excel in past
You’ve working experience of SPSS

Also Read: Simple Yet Powerful Tricks To Analyze Data in Excel

What are the tasks XLMiner can do ?

I knew this was coming. Well! XLMiner can do lot of things which you do in R, Python or Julia. That too, without writing a piece of code. It offers a great deal in machine learning and data mining tasks. XLMiner supports Excel 2007, Excel 2010 and Excel 2013 (32-bit and 64-bit). Here is the list of tasks which can be done using XLMiner:

Data Exploration and Visualization
Feature Engineering
Text Mining
Time Series Analysis
Machine Learning
- Regression
- Classification
- Clustering
- Ensemble Modeling
- Neural Networks

Note: It is not available for free. You can download it on 15 days trial period and later purchase two year license for $2495.

In this article, I’ll demonstrate the steps to perform Regression, Classification and Clustering in Excel. I’d recommend you to work on small data sets in excel as it might crash. It is good to use on data sets like Titanic.

To get the best from this article, you must have / gain basic knowledge of these algorithms. If you need a quick refresher on machine learning, I recommend you to check out this tutorials: Essentials of Machine Learning Algorithms

I’ve installed XLMiner. After installation, you will notice XLMINER appearing in main tabs (image below). You can also watch this overview of XLMiner platform.

Let’s get started !

Tutorial: Multiple Linear Regression

Regression is not a big deal. You can also perform it using add-in ‘data analysis tool pack‘ available in excel. It is good for statistical analysis. For machine learning, you would need XLMiner. Here I’ve demonstrated multiple regression using XLMiner. For linear regression, all the steps remains same except you select one independent variable for modeling. Following are the steps:

1. I’ve used Boston Housing Data Set. This data represents housing prices in Boston based on various influencing factors. You can load the data set using: Help -> Examples -> Boston Housing.

2. Here is the data set.

3. There are no missing values in this data set. However, this add-in provides a convenient option to deal with missing values. You can access this option from here.

Simply, select the variables where you find missing values. If missing values are represented by ‘null’,’N/A’ or in any other form, mention it. Finally, you can choose the treatment method and done.

4. Now we’ll do feature selection. MEDV is the response variable. MEDV represents the median value of owner occupied homes in $1000.

5. Use Shift + Click to select all independent variables at once. Send MEDV to Output variable. Click Next.

6. Select correlation filters. I’ve selected all three. Click Next

7. Now select features. Let’s find out top 5 important predictor variables. Click Finish.

8. Here is the variable importance chart. We see, LSTAT is the most important variable, followed by RM, PTRATIO, INDUS and TAX.

9. Close this chart. You will see Output Navigator. This helps you to navigate between various output sheets. Let’s check out ‘selected predictors’.

10. Here are the selected predictors. Let proceed to build a regression model using these variables.

11. Prior to modeling, let’s divide (partition) this data into train and validation.

12. On the basis of feature selection, select the variables to be included in partition. Leave the rest as default values and click OK.

13. And, here we’ve got the training data set ready for modeling.

14. Click on any cell in Selected Variables and proceed to build multiple regression model. Click Multiple Linear Regression

15. Select the set of predictor and response variables. Click Next

16. Select your required metrics. Click Finish

17. Your multiple linear regression model is ready. Use the output navigator to access different metrics and model accuracy.

Tutorial: Logistic Regression

Logistic Regression is a classic example of classification algorithm. Similar to multiple linear regression, below are the steps to build a logistic regression model. If you wish to quickly refresh your logistic regression concepts, you can refer to this tutorial: Simple Guide to Logistic Regression

1. Load the data set ‘Charles_bookclub’. On XLMiner Ribbon, click on Help -> Example. Select this data set. This data set represents information associated with individuals who are members of a book club. We’ll build a model for predicting whether a person will purchase a book about the city of Florence based on past purchases.

2. Now, we’ll divide the data set into training (70%) and validation (30%). This time you need to specify percentages for partition. Click OK

3. You’ll see a data partition sheet. Click on any cell in ‘selected variables’ table and Click on logistic regression as shown.

4. Here you select the input and output variables. Florence is the output variable where it gets 1 when a customer purchased a book about the city of Florence and 0 otherwise. Here 1 is success. 0 is failure as denoted in the option below. Leave the rest as default values. Click Next

5. Select the confidence interval as 95%. If you tick ‘Force constant term to zero’, you’ll omit the constant term in the regression. Hence, don’t select it. Click on advanced, and tick ‘perform collinearity diagnostics’. It will display useful information in dealing with correlated variables having large standard errors. Click OK. Now, click Variable Selection.

6. Variable Selection helps us to deal with large number of predictor variables and find the best among them. ‘Maximum size of best subset’ takes value from 1 to N, where N is the number of input variables. We’ll not change this value. In selection procedure, you can choose any as per your preferences. I’ve chosen ‘Best Subsets’ as it searches for all combination of variables and select only the best fit ones. Click OK. Click Next.

7. Now we’ll select the required computation coefficients to evaluate the model. Select Covariance matrix of coefficients and Residuals. Residuals will produce a table of fitted values and their residuals in the output. Click Finish.

8. Here is your logistic regression model. If you scroll down this sheet, you’ll find various metrics useful to evaluate this model performance. A commonly used metric to check model’s accuracy is confusion matrix. As you scroll down, you’d find this.

Tutorial: k – Means Clustering

If you are new to clustering, here is your quick refresher to Clustering Analysis. In simple words, clustering is a technique of grouping variables with similar attributes. This technique is generally used for customer profiling and creating products as per their need.

Let’s look at the steps for perform k-means clustering in XLMiner.

1. Load the data set Wine. Go to XLMiner ribbon, click Help -> Examples. Select Wine. In this data set, each row represents sample of wine belonging to 3 classes (A, B and C). On the basis of this data, we’ll build a clustering model to determine the class of wine. Here is the data set.

2. Click on any cell in data set. Then, click on k-means clustering.

3. Type is the output variable. Hence, we’ll select all variables except Type to be used in clustering. Click Next.

4. Let’s take number of clusters as 8. Because, with large number of clusters, sum of squared error(SSE) remains small. SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. You can set any value of k, and evaluate the output from each to check which one is best. Setting random value to say 5, will let this algorithm to build the model from any random point. With this, XLMiner will generate 5 cluster sets and generate the output from best cluster. Leave the default values for rest and click Next.

5. Leave the values as default. Click Finish

6. Here is your clustering model. Check our various evaluation metrics to determine the accuracy of this model.

Random Starts Summary: This table determines the best start with lowest sum of squares distance. In this case (#1) is the best start. Once the best start is determined, the remaining output of the model is generated using the best start as starting point.

Cluster Centers: Here you will find two boxes. The lower box shows the distance between the centroid of clusters. Larger the distance, different will be nature of clusters. For example, the difference between cluster 4 and cluster 8 is 1176.59. This suggests these clusters are very different. The upper box shows the variable values at the cluster centers.

Data Summary: It represents the average distance of observations from the center of a cluster. We can infer then cluster 2 has lowest average distance from its centroid and cluster 6 has highest.

7. Click on sheet KMC_Clusters. Here you’ll find the predicted clusters. Check the Record ID 1. It has been classified to cluster 6. Because, the distance of this observation is minimum to cluster 6. Similarly, all other observations have been classified on the basis of their nearest cluster.

End Notes

I wrote this tutorial just to get your started with machine learning in excel. Once you understand these algorithms, you can easily use them in R, Python or any other programming language. Since, many of us have worked on excel at some point, it wouldn’t be difficult to understand these concepts in excel. If you get stuck, you can refer to help option in XLMiner Ribbon. The documentation is helpful and easy to understand.

Now you know the steps, I’d suggest you spend time in interpreting the model and iterate it to get the best fit. Excel might slow down with large data sets, hence you should work on small data sets as to save your time in learning.

Did you find this article useful? Have you ever worked on XLMiner? I’d love to hear your experiences and suggestions in the comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Getting started with Machine Learning in MS Excel using XLMiner

Introduction

Do you find coding hard to understand?

What are the tasks XLMiner can do ?

Tutorial: Multiple Linear Regression

Tutorial: Logistic Regression

Tutorial: k – Means Clustering

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect