What does not get discussed about analysis: Chaos to order

Priyanka Last Updated : 11 Oct, 2020

4 min read

This article was published as a part of the Data Science Blogathon.

Introduction

How can you ensure that the many lines of code you write meet the objective? The critical phases of analyzing data are after you have the dataset and before you set up the environment, to write codes. There is a science to what data we have access to and will it suffice to answer questions the business is seeking. That is a discussion for another time. Here we will focus on what happens once data is available.

An effort that is critical to the success of any analysis exercise is to organize thought.

Organizing data and thought

This is the phase before we begin to set up the environment to run the codes. The analysis begins much before we start writing the code. The process begins with understanding the question. Map it against the libraries of Python, Mathematics, statistics, color, language. That is not a comprehensive but a good starting point.

Let’s understand this with an example.

“The learning curve is an upward sloping line with a positive delta”

That line will be understood differently by each reader here. Take a pause and note your thoughts. Think about what will be an appropriate visual description, other than words. You are welcome to share your thoughts with Analytics Vidya and the author even before you complete reading. you can write at [email protected].

There are many layers rolled up in that statement. Towards the end of the article, we will enlist them. Here we visit how we organize thought and the impact of color on visualizations.

Understanding Data sets

A data set (or dataset) is a collection of data. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable. [1]

Tabulating is the first step towards organizing thoughts. It is a representation of something along rows and columns. Tabulation works great for numbers. Markdown cells are a good way to tabulate verbal descriptions.

Let’s look at the underlying data. Tabulating and quantifying aid understanding.

essential things data science

The temptation as a data scientist would be to say: value 1 performance at 25. If your first reaction was; why, who said so, can we make any conclusion basis this? Well, congratulations you are on your way to be a data scientist!

Visualization: Learning Curve: Line Graph

Visualization is a visual description of thought. Nonverbal descriptions are graphs.

Be aware that visualization of a learning curve need not always be a line graph. Nor does it have to be upward sloping. Data Science encourages us to build. Share your thoughts on other visualization possibilities of the learning curve.

Understanding color as a dimension

No matter how detailed the visualization is, if presented on a paper or screen it will be a 2-dimensional. Colour will add the third dimension.

Some of the simplest visualizations around our bar graphs. They are simple to decipher. They are simple also because they breed familiarity. Tally marks and bar graphs go back at least 10 years for most of us. Let us use Bar Graphs to understand color as a dimension.

The dominant features above can be presented as follows:

essential things data science

No matter how detailed the visualization is, if presented on a paper or screen it will be a 2-dimensional. All conversations will be between the x-axis and the y-axis. Colour will add the third dimension.

Layers to analysis

Preconceived notions and cultural influences add depth to data analysis. Data and Datum can be numbers, words, and expanding to be pictures and even voice. So everything that affects anything is data.

If you can organize thought, then you will be able to identify the variables. Each variable will be the heading of a column in the table. All readings together (if numeric) will be distributed. Data science will then become an exercise of understanding the interplay between variables.

Let’s enlist the various layers rolled up in the statement.

“The learning curve is an upward sloping line with a positive delta”.

It is verbal.
In Python, I will need functions that work with string.
For a machine learning model, convert words to numeric form.
Graphs will help us visualize.
Mathematics will help quantify.
Statistics will help make sense. What is the average performance? Can I add performance and learning effort? If the range of performance values were from 0- 100 but learning effort values ranged from 0 -1000. Then can I still use a line graph?
Cultural influences: make a radial graph of the same dataset. Think about if it will make sense to the reader. Will the reader have the motivation to invest the effort needed to read a complex graph?
Upward sloping line and positive delta: technical description.

In conclusion: organizing thought forms the bedrock of data analysis. Tabulation helps organize thought. Visualization will generate insights about variables. In the journey of data, scientists enlist as many layers. Explore the interplay between variables. Enjoy the journey! Share your thoughts with the author at

https://www.linkedin.com/in/priyanka-krishna-sharma/

and

[email protected]

Priyanka

Free Courses

4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

4.5

Analyzing Data with Power BI

Turn raw data into insights with Power BI - dashboards, reports & more!

Reading list

What does not get discussed about analysis: Chaos to order

Introduction

Organizing data and thought

Understanding Data sets

Visualization: Learning Curve: Line Graph

Understanding color as a dimension

Layers to analysis

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Analyzing Data with Power BI

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What does not get discussed about analysis: Chaos to order

Introduction

Organizing data and thought

Understanding Data sets

Visualization: Learning Curve: Line Graph

Understanding color as a dimension

Layers to analysis

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Analyzing Data with Power BI

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques