The Key Concepts To Investigating Your Dataset

Vamshi krishna Last Updated : 07 Dec, 2020

5 min read

This article was published as a part of the Data Science Blogathon.

“Don`t jump into modeling. First, understand and explore your data!”

Overview

This is common advice for many data scientists. If your data set is messy, building models will not help you to solve your problem. What will happen is “garbage in, garbage out.” In order to build a powerful machine learning algorithm. We need to explore and understand our data set before we define a predictive task and solve it.

Data analysis is a nuanced discipline, and there are enough ways to slice and dice data to make a beginner’s head spin. A common question data analysts hear is “where do I even start with my analysis?” These are some hints to help start you on the right path thinking about a methodical way to uncover the answers in your data.

Introduction

Before going further, Data scientists spend most of their time exploring, cleaning, and preparing their data for modeling. This helps them to build accurate models and check assumptions required for fitting models.

Create meaningful data visualizations, predict future trends from the data.

If you are good at understanding data preparation almost 80% of the work is completed.

Ask the right questions?
Analyze different subsets of data
Explore trends
Find your blind spots
Investigate the whys

Ask the right questions

Whether it’s surveying results, sales data, or an email campaign, you’ve collected data for a specific purpose. By extension, apply this purpose to the questions you’re asking of the data itself. Beginning with some specific questions can keep your research focused and allow you to see the forest through the trees. A question like “what does my revenue look like for the past 3 years” is vague and allows for exploration but also confusion.

Instead, something like “which channel brings in the most revenue for the past 3 years” has a clearer answer. Subsequent questions may be: “which department brings in the most revenue per year” or “are sales in climbing gear increasing or decreasing this year?” It’s important to have a specific question in mind when you begin data analysis so as to provide some structure and avoid stumbling into false positives.

Analyze different subset of data:

It’s easier to spot relationships if you analyze the data from different subsets. For example, segment your revenue data by channel like the chart above, or by the department. Experiment with the subsets and variables that make the most sense of the questions you developed in the previous step.

This design focuses on allowing you to stay within your train of thought and smoothly transition from question to question, without tripping up on formatting or equations. It can also be helpful to use what would be referred to as a pivot table in Excel. In our outdoor gear retailer example, you can switch from a quarterly view to revenue by a quarter of the year just by selecting in a drop-down menu. The graph below then is an aggregate of each quarter’s revenue between 2010 and 2013.

Explore trends

Experiment with your time variables. Look at the quarter, month, or week, whichever makes sense based on what you’re looking for. Sometimes what is missing is also just as important as what is there. If there are holes in your data analysis, take note. It can be helpful to take notes through your analysis, reminders of what you’d like to research or discuss with colleagues later.

Take a look at this quarterly analysis of revenue by the department. It’s not very helpful because it’s hard to spot trends.

This yearly line graph makes it much easier to see that Climbing is the fastest-growing department and Running sales have been decreasing for the past three years.

Find your blind spots

Do you bump up against a particular question regularly? There is a fine line between collecting as much data as you can get answers, and frustrating your users with too many questions. Weigh this consideration when deciding how much data you’d like to collect. Then you can either find a way to gather that information from your users or at least write it on a data collection wish list for later discussion.

Actually been collected
for the task you are being asked to do. And you are being asked to make the
data validate an outcome that has already been decided.
Most organizations don’t think scientifically. They don’t create a hypothesis and then decide what data they need to collect to validate it. They choose an outcome, then make the data fit.
Often the data come from something else entirely – often as a byproduct of a business process. Then someone has the bright idea “We could use this to work out”
Analyzing the below graph, the graph illustrates the information about the blind spots of a data set. Hidden data will be one of the drawbacks to getting a solution. Overall, finding outliers will be a solution.
Outlier correction based on the R parameter. The leftmost graph shows the original data with detected outliers. The middle graph uses a noise value of zero to place, or correct, the location of the outliers in the linear model. The rightmost graph places the outlier near the linear model at a distance based on a positive value for R (R = 0.5).

Investigate the whys:

After your daily, weekly, or quarterly analysis, take your charts, notes, and conclusions to the rest of the team and start trying to piece together as much as you can. The data can tell you what is happening, but not the why. The why requires piecing together the backstory. Because so many factors play into your sales data, coming together with your team to discuss insights from your data can lead to a lot more understanding. The marketing manager may know something about the third quarter’s climbing gear sales that the business analyst didn’t.

Data analysis is a continual process and the best way to approach it is to try to get less and less wrong. You probably won’t ever have all the data you want or need to answer every question about your business, but you can at least push toward more answers and better decisions. This continual feedback loop (question, analyze, investigate, repeat) can be improved but will never be perfect.

Endnotes

Understanding and interpreting data are a very crucial step in machine learning. In this blog post, we tried to provide an overview of techniques that can help you to better know your data

Depending on the size, dimension, and type of your data, you can choose the algorithm. For instance, when you have big raw data, you can use representative examples instead of random samples. If you have a wide data set, you can also find the important dimensions to understand the representative samples.

Different techniques can give you different insights on your data. It is your job to use the tools to solve the mystery like a detective.

Vamshi krishna

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

The Key Concepts To Investigating Your Dataset

Overview

Introduction

Table of Contents

Ask the right questions

Analyze different subset of data:

Explore trends

Find your blind spots

Investigate the whys:

Endnotes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#