22 Widely Used Data Science and Machine Learning Tools in 2020

Ram Dewani Last Updated : 12 Jul, 2020

11 min read

Overview

There are a plethora of data science tools out there – which one should you pick up?
Here’s a list of over 20 data science tools catering to different stages of the data science lifecycle

Introduction

What are the best tools for performing data science tasks? And which tool should you pick up as a newcomer in data science?

I’m sure you’ve asked (or searched for) these questions at some point in your own data science journey. These are valid questions! There is no shortage of data science tools in the industry. Picking one for your journey and career can be a tricky decision.

Let’s face it – data science is a vast spectrum and each of its domains requires handling of data in a unique way that leads many analysts/data scientists into confusion. And if you’re a business leader, you would come across crucial questions regarding the tools you and your company choose as it might have a long term impact.

So again, the question is which data science tool should you choose?

In this article, I will be attempting to clear this confusion by listing down widely used tools used in the data science space broken down by their usage and strong points. So let us get started!

And if you’re a newcomer to machine learning and/or business analytics, or are just getting started, I encourage you to leverage an incredible initiative by Analytics Vidhya called UnLock 2020. Covering two comprehensive programs – Machine Learning Starter Program and the Business Analytics Starter Program – this initiative is time-bound so you’d need to enroll as soon as you can to give your data science career a massive boost!

Diving into Big Data – Tools for handling Big Data
- Volume
- Variety
- Volume
Tools for Data Science
- Reporting and Business Intelligence
- Predictive Modelling and Machine Learning
- Artificial Intelligence

Data Science Tools for Big Data

To truly grasp the meaning behind Big Data, it is important that we understand the basic principles that define the data as big data. These are known as the 3 V’s of big data:

Volume
Variety
Velocity

Tools for Handling Volume

As the name suggests, volume refers to the scale and the amount of data. To understand the scale of the data I’m talking about, you need to know that over 90% of the data in the world was created in just the last two years!

Over the decade, with the increase in the amount of data, the technology has also become better. The decrease in computational and storage costs has made collecting and storing huge amounts of data far easier.

The volume of the data defines whether it qualifies as big data or not.

When we have data ranging from 1Gb to around 10Gb, the traditional data science tools tend to work well in these cases. So what are these tools?

Microsoft Excel – Excel prevails as the easiest and most popular tool for handling small amounts of data. The maximum amount of rows it supports is just a shade over 1 million and one sheet can handle only up to 16,380 columns at a time. These numbers are simply not enough when the amount of data is big.

Microsoft Access – It is a popular tool by Microsoft that is used for data storage. Smaller databases up to 2Gb can be handled smoothly with this tool but beyond that, it starts cracking up.

SQL – SQL is one of the most popular data management systems which has been around since the 1970s. It was the primary database solution for a few decades. SQL still remains popular but there’s a drawback – It becomes difficult to scale it as the database continues to grow.

We have covered some of the basic tools so far. It is time to unleash the big guns now! If your data is greater than 10Gb all the way up to storage greater than 1Tb+, then you need to implement the tools I’ve mentioned below:

Hadoop – It is an open-source distributed framework that manages data processing and storage for big data. You are likely to come across this tool whenever you build a machine learning project from scratch.

Hive – It is a data warehouse built on top of Hadoop. Hive provides a SQL-like interface to query the data stored in various databases and file systems that integrate with Hadoop.

Tools for Handling Variety

Variety refers to the different types of data that are out there. The data type may be one of these – Structured and Unstructured data.

Let us go through the examples falling under the umbrella of these different data types:

Take a moment to observe these examples and correlate them with your real-world data.

As you might have observed in the case of Structured data, there is a certain order and structure to these data types whereas in the case of unstructured data, the examples do not follow any trend or pattern. For example, customer feedback may vary in length, sentiments, and other factors. Moreover, these types of data are huge and diverse.

It can be very challenging to tackle this type of data, so what are the different data science tools available in the market for managing and handling these different data types?

The two most common databases are SQL and NoSQL. SQL has been the market-dominant players for a number of years before NoSQL emerged.

Some examples for SQL are Oracle, MySQL, SQLite, whereas NoSQL consists of popular databases like MongoDB, Cassandra, etc. These NoSQL databases are seeing huge adoption numbers because of their ability to scale and handle dynamic data.

Tools for Handling Velocity

The third and final V represents the velocity. This is the speed at which the data is captured. This includes both real-time and non-real-time data. We’ll be talking mainly about the real-time data here.

We have a lot of examples around us that capture and process real-time data. The most complex one is the sensor data collected by self-driving cars. Imagine being in a self-driving car – the car has to dynamically collect and process data regarding its lane, distance from other vehicles, etc. all at the same time!

Some other examples of real-time data being collected are:

CCTV
Stock trading
Fraud detection for credit card transaction
Network data – social media (Facebook, Twitter, etc.)

Did you know?

More than 1Tb of data is generated during each trade session at the New York stock exchange!

Now, let’s head on to some of the commonly used data science tools to handle real-time data:

Apache Kafka – Kafka is an open-source tool by Apache. It is used for building real-time data pipelines. Some of the advantages of Kafka are – It is fault-tolerant, really quick, and used in production by a large number of organizations.

Apache Storm – This tool by Apache can be used with almost all the programming languages. It can process up to 1 Million tuples per second and it is highly scalable. It is a good tool to consider for high data velocity.

Amazon Kinesis – This tool by Amazon is similar to Kafka but it comes with a subscription cost. However, it is offered as an out-of-the-box solution which makes it a very powerful option for organizations.

Apache Flink – Flink is yet another tool by Apache that we can use for real-time data. Some of the advantages of Flink are high performance, fault tolerance, and efficient memory management.

Now that we have a solid grasp on the different tools commonly being used for working with Big Data, let’s move to the segment where you can take advantage of the data by applying advanced machine learning techniques and algorithms.

Widely Used Data Science Tools

If you’re setting up a brand new data science project, you’ll have a ton of questions in mind. This is true regardless of your level – whether you’re a data scientist, a data analyst, a project manager, or a senior data science executive.

Some of the questions you’ll face are:

Which tools should you use in different domains of data science?
Should I buy licenses for the tools or opt for an open-source one?, and so on

In this section, we will be discussing some of the popular data science tools used in the industry according to different domains.

Data Science is a broad term in itself and it consists of a variety of different domains and each domain has its own business importance and complexity which is beautifully captured in the below image:

The data science spectrum consists of various domains and these domains are represented by their relative complexity and the business value that they provide. Let us take up each one of the points I’ve shown in the above spectrum.

Reporting and Business Intelligence

Let’s begin with the lower end of the spectrum. It enables an organization to identify trends and patterns so as to make crucial strategic decisions. The types of analysis range from MIS, data analytics, all the way over to dashboarding.

The commonly used tools in these domains are:

Excel – It gives a diverse range of options including Pivot tables and charts that let you do analysis in double-quick time. This is, in short, the Swiss Army Knife of data science/analytics tools
QlikView – It lets you consolidate, search, visualize, and analyze all your data sources with just a few clicks. It is an easy and intuitive tool to learn which makes it so popular.

Tableau – It is amongst the most popular data visualization tools in the market today. It is capable of handling large amounts of data and even offers Excel-like calculation functions and parameters. Tableau is well-liked because of its neat dashboard and story interface.

Microstrategy – It is yet another BI tool that supports dashboards, automated distributions, and other key data analytics tasks.

PowerBI – It is a Microsoft offering in the Business Intelligence (BI) space. PowerBI was built to integrate with Microsoft technologies. So if your organization has a Sharepoint or SQL database user, you and your team will love working on this tool.

Google Analytics – Wondering how did Google Analytics make it to this list? Well, digital marketing plays a major role in transforming businesses and there’s no better tool than this to analyze your digital efforts.

Predictive Analytics and Machine Learning Tools

Moving further up the ladder, the stakes just got high in terms of complexity as well as the business value! This is the domain where the bread and butter of most data scientists come from. Some of the types of problems you’ll solve are statistical modeling, forecasting, neural networks, and deep learning.

Let us understand the commonly used tools in this domain:

Python – This is one of the most dominant languages for data science in the industry today because of its ease, flexibility, open-source nature. It has gained rapid popularity and acceptance in the ML community.

R – It is another very commonly used and respected language in data science. R has a thriving and incredibly supportive community and it comes with a plethora of packages and libraries that support most machine learning tasks.

Apache Spark – Spark was open-sourced by UC Berkley in 2010 and has since become one of the largest communities in big data. It is known as the swiss army knife of big data analytics as it offers multiple advantages such as flexibility, speed, computational power, etc.

Julia – It is an upcoming language and is being touted as the successor of Python. It’s still in its nascent stage and it will be interesting to see how it performs in the future.

Jupyter Notebooks – These notebooks are widely used for coding in Python. While it is predominantly used for Python, it also supports other languages such as Julia, R, etc.

The tools we have discussed so far are true open-source tools. You don’t require to pay for them or buy any extra licenses. They have thriving and active communities that maintain and release updates on a regular basis.

Now, we will check out some premium tools that are recognized as industry leaders:

SAS – It is a very popular and powerful tool. It’s prevalently and commonly used in the banking and financial sectors. It has a very high share in private organizations like American Express, JP Morgan, Mu Sigma, Royal Bank of Scotland, etc.

SPSS – Short for Statistical Package for Social Sciences, SPSS was acquired by IBM in 2009. It offers advanced statistical analysis, a vast library of machine learning algorithms, text analysis, and much more.

Matlab – Matlab is really underrated in the organizational landscape but it is widely used in academia and research divisions. It has lost a lot of ground in recent times to the likes of Python, R, and SAS but universities, especially in the US, still teach a lot of undergraduate courses using Matlab.

Common Frameworks for Deep Learning

Deep Learning requires high computational resources and needs special frameworks to utilize those resources effectively. Due to this, you would most likely require a GPU or a TPU.

Let us look at some of the frameworks used for Deep Learning in this section.

TensorFlow – It is easily the most widely used tool in the industry today. Google might have something to do with that!
PyTorch – This super flexible deep learning framework is giving major competition to TensorFlow. PyTorch has recently come into the limelight and was developed by researchers at Facebook
Keras and Caffe are other frameworks used extensively for building deep learning applications

Artificial Intelligence Tools

The era of AutoML is here. If you haven’t heard of these tools, then it is a good time to educate yourself! This could well be what you as a data scientist will be working with in the near future.

Some of the most popular AutoML tools are AutoKeras, Google Cloud AutoML, IBM Watson, DataRobot, H20’s Driverless AI, and Amazon’s Lex. AutoML is expected to be the next big thing in the AI/ML community. It aims to eliminate or reduce the technical side of things so that business leaders can use it to make strategic decisions.

These tools will be able to automate the complete pipeline!

End Notes

We have discussed the data collection engine and the tools required to accomplish the pipeline for retrieval, processing, and storage of data. Data Science consists of a large spectrum of domain and each domain has its own set of tools and frameworks.

Picking your data science tool will often come down to your personal choice, your domain or project, and of course, your organization.

Let me know in the comments about your favorite data science tool or framework that you love to work with!

Ram Dewani

Product Growth Analyst at Analytics Vidhya. I'm always curious to deep dive into data, process it, polish it so as to create value. My interest lies in the field of marketing analytics.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Luis

Hi, great article. Thank you. Where would you put KNIME and Alteryx in your Data Science Spectrum?

Ali

Sir ! I really enjoyed and learn a lot about different tools used for different domains . It is really helpful to me for choosing the right one according to my domain.

Sri

Interesting and informative post. Very cleanly devided.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

22 Widely Used Data Science and Machine Learning Tools in 2020

Overview

Introduction

Table of Contents

Data Science Tools for Big Data

Tools for Handling Volume

Tools for Handling Variety

Tools for Handling Velocity

Widely Used Data Science Tools

Reporting and Business Intelligence

Predictive Analytics and Machine Learning Tools

Common Frameworks for Deep Learning

Artificial Intelligence Tools

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B