Data Analysis Project for Beginners Using Python

Raghav Agrawal Last Updated : 02 Jun, 2022
7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Data Analysis is one major part that you must master before learning or diving into the machine learning algorithms section because data analysis is a process to explore the data to get a better understanding of data. Data Analysis is a core part of any data science or machine learning project which almost takes 70 to 80 per cent of the time of the complete project lifecycle. Data analysis is a very vast domain that contains different techniques and methods like data cleaning, preprocessing, visualization, transformations, encoding, etc. In this tutorial, we will learn a basic data analysis that will boost your confidence to learn more about data analysis and help to start your journey to learning data technologies using Python.

Data Analysis Project

Dataset Overview

The dataset we will use is a simple weather dataset which is a time-series dataset that stores the temperature, humidity, wind speed, etc on an hourly basis on different dates in 2012. The dataset is simply available on Kaggle and you can access or download the dataset using this link. The dataset contains 8784 rows and 8 columns where the last column states the weather condition according to recorded different climatic conditions. The basic overview of the dataset can observe in the below-given screenshot.

Load Dataset

You have the dataset and open the Jupyter Notebook or you can also create a Kaggle notebook over there itself. The first step is to import the necessary libraries and load the dataset into a notebook. Pandas is a popular data preprocessing library in python that helps analyse and play with data using in-built functions.

import pandas as pd
import numpy as np
#Load the data
data = pd.<a onclick="parent.postMessage({'referent':'.pandas.read_csv'}, '*')">read_csv("/kaggle/input/weather-data-set-for-beginners/1. Weather Data.csv")
data.head()

Basic Python Pandas Data Analysis Functions

1. Shape – Shape is a property or attribute of python pandas that stores the number of rows and number of columns as a tuple. If you use the shape property on data, it will show 8784 rows and 8 columns in a tuple.

data.shape

2. Data types – Pandas are used to print the data type of each column in the Dataframe and the property can be applied to a single column as well.

3. Unique – This is the function that displays the list of all unique values ​​present in a given column.

data['Weather'].unique()

4. n unique – It is a function that displays several unique values present in the dataframe. The function can be applied to a single column as well as to complete data at a time.

#To view the number of unique values in each column we can use func on data
data.nunique()

5. Count – count function displays the total number of non-null values present in a particular column. You can apply the function on complete data and also on a single column.

data.count()

6. Value counts – The function displays the count of all unique present in data of any column. At a time the function can be used on only one column.

data['Weather'].value_counts()

7. information – the below function is used to get the basic details about the dataset.

8. Describe – It is a function that results in basic information about numeric variables present in the dataset like count, minimum, maximum, standard deviation, average, etc. In short, describe function is used to get the statistical summary of data.

Answering Different Data Analysis Problems

The main work of data analysis comes here where using some queries you have to find the solution to the given problem and we will practice some basic and important data analysis questions including filtering, aggregating, and retrieving the data. Remember the one point that there are multiple ways to solve a problem and based on your simplicity or performance you can go with any solution.

Que-1) Find all records from data when the weather was exactly clear?

The question simply asks to display the rows where the Weather condition (last column) is clear. So we can find the answer to this question using three different ways. First is filtering, value counts, and using grouping. let us try using each method.

I] Filtering the data

Filtering simply means extracting some rows from the dataset that matches certain conditions which in our case is weather should be clear. So we can compare the weather values against clear using the assignment operator and print the dataframe we need to embed the condition in a square bracket. And if you want to only know the number of rows where the weather is clear so we can use the shape property after this code. Both the statements are demonstrated below the snippet.

data[data[‘Weather’] == ‘Clear’] #to display complete dataframe

data[data[‘Weather’] == ‘Clear’].shape #to view number of records

II] Use value counts

Value counts display the total count of records of each unique value in the column so we can use it on the weather column to get the count of clear.

data.Weather.value_counts().to_frame().reset_index()

To frame function is used to convert the series data to a dataframe and we have set the index again from 0 as a new Dataframe is formed.

III] Use Grouping

Group By Command group the data according to each unique value and we can use the aggregate function on that to get the desired number of rows which has weather as clear. To display the Dataframe we can use get group property of group by and pass clear to get all rows where weather is clear.

#groupby
data.groupby('Weather').get_group('Clear').shape

Que-2) Find the number of times when the wind speed was exactly 4 km/hr?

The question is the same as the above question and I hope you can write the query for the same. The answer can be found using the Filtering, or value counts function.

data[data['Wind Speed_km/h'] == 4].shape

Que-3) Check if there are any NULL values present in the dataset?

Null values are the missing values which do not contain any proper value for the required column which is denoted as NA or NULL in the dataset. To find the null values pandas have a direct function and to print the number of null values we can use the sum function.

In our data, nono NULL values are present, but when you work on n real-life or real-time data, even a huge amount of missing values will be there and you need to do their treatment. If you want to study more about missing value detection and treatment you can refer to this article.

Que-4) Rename the column Weather to Weather_Condition?

You might think like Renaming a column is not a part of data analysis but some columns in your dataset contain some jumble words or spaces in between and create a problem while accessing them so better to rename. To demonstrate how to rename a column we pick a weather column.

data.rename(columns = {'Weather' : 'Weather_Condition'}, inplace=True)
#to permanently rename col use inplace

Que-5) What is the mean Visibility of a given dataset?

The mean is the average of all the values present in the dataset. It is calculated as the sum of all values divided by a total number of values. To find the mean direct use the mean function of pandas and to verify the output you can also calculate using the sum function and divide using several rows.

data['Visibility_km'].mean()

Same as mean Pandas provide basic aggregate and statistics functions that can be used on any numerical column like standard deviation, variance, maximum value, minimum value, count of the total value, skewness, etc.

Que-6) Find the number of records where wind speed is greater than 24 and visibility is equal to 25?

The question again asks to filter the dataset but filter based on two conditions. And when we have two or more two conditions then we use logical operators. In this case, we have to find records where wind speed should be greater than 24 and visibility equal to 25 so the logical AND operator will be used and the remaining filtering syntax will be the same. To write more conditions in square bracket we also use parenthesis for code readability.

data[(data['Wind Speed_km/h'] > 24) & (data['Visibility_km'] == 25)].shape

Que-7) What is the mean value of each column against each weather condition?

Whenever the question asks ‘EACH’ then you have to use Group By clause in the query because you need to group the data based on each unique weather value and aggregate the data on other columns to find meaning.

In the same way, you can find the minimum or maximum value of all columns based on each weather value. This you need to find in your notebook.

Que-8) Find all instances where weather is clear and relative humidity is greater than 50 or visibility is above 40?

The question asks you to filter the dataset based on 3 different conditions. And while applying three conditions you need to use two logical operators and in this question, we have to use one AND and one OR logical operator. For simplicity use parenthesis for writing each condition and understand the question first where and how it wants the conditions to be separated.

Que-9) Find the number of Weather conditions where snow is there?

The question does not ask you to find several records where weather is equal to snow indeed it is asking you to find all weather where snow word is there like snow fog, snow showers, blowing snow, etc. Simply understand it to find a sentence that- contains the particular word in the list of sentences.

To solve the problem Pandas provide a function named contains using which we can check that any iterator contains a particular item and it is only applicable on strings.

Conclusion

Data analysis is a continuous process that represents how deep and better you represent your analysis to the client so the insights that can be used to drive business decisions are understandable. Let us conclude the article with some key takeaways that we have learned in this article.

  • Data analysis using logical operators filters the data based on certain conditions and retrieve the data which is true for defined condition.
  • Statistical techniques like mean, median, standard deviation, and variance represent a lot of information about the spread of data.
  • Always treat the NULL values with the best imputation technique and try not to delete them if your dataset is small or if you have more NULL values in the dataset.
  • Data analysis is a continuous process and involves different techniques and after t=following this article I will suggest following this exploratory data analysis article to carry forward your data analysis journey and learn how to analyze the data with better visualization charts and graphs that make the data analysis and data representation step simple and streamline.

👉 I hope that it was easy to cope-up with each step and easily understandable. If you have any queries then feel free to post them in the comment section below or can connect with me. I hope you liked my article on hive query language.

👉 Connect with me on Linkedin.

👉 Check out my other articles on Analytics Vidhya and crazy-techie

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I am a software Engineer with a keen passion towards data science. I love to learn and explore different data-related techniques and technologies. Writing articles provide me with the skill of research and the ability to make others understand what I learned. I aspire to grow as a prominent data architect through my profession and technical content writing as a passion.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details