10 Awesome Data Manipulation and Wrangling Hacks, Tips and Tricks

Ram Dewani Last Updated : 05 Apr, 2023
5 min read

Introduction

“Efficiency is doing things right. Effectiveness is doing the right thing.” – Zig Zagler

As data scientists, we are often taught to be effective and do whatever it takes to get the job done. But ask yourself this – are we efficient at what we do on a day-to-day basis in a data science project? Is there any way to quicken the code we run or the monotonous processes we go through each day?

When I entered the field of data science, I was really happy to have the sexiest job of the 21st century. I dreamed of building state of the art machine learning models until I came face to face with a harsh reality – the majority of my time was going in cleaning and organizing data. I’m sure most of you have gone through this frustrating stage. It’s no secret – data preparation accounts for about 60-80% of a data scientists’ role.

This motivated me to take the path of becoming an efficient data scientist. I started searching for hacks to quickly build my code, tips to speed up the data preparation/cleaning process, and tricks to become far more efficient at data science tasks.

data_science_hacks_tips_tricks

In this article, I will walk you through some of these data maniplution and data wrangling hacks, tips, and tricks that have served me well. I hope these help you in your journey and role as well!

I have also converted my learning into a free course that you can check out:

Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository – Data Science hacks, tips and tricks on GitHub.

We are posting these hacks daily on social media platforms like LinkedIn, Twitter, Facebook. Make sure to follow #avhackoftheday to get your daily dose of freshly brewed data science hacks, tips, and tricks!

 

Table of Contents

We’ll cover these data manipulation and data wrangling hacks, tips and tricks :

  1. Data Science Hack #1 – Select Data Type using Pandas
  2. Data Science Hack #2 – Extract E-mail from text
  3. Data Science Hack #3 – Remove Emojis from Text
  4. Data Science Hack #4 – Image Augmentation
  5. Data Science Hack #5 – Resizing Images
  6. Data Science Hack #6  – Apply Pandas Operations in Parallel
  7. Data Science Hack #7 – Pandas Melt
  8. Data Science Hack #8 – Divide equal proportion of classes (Classification)
  9. Data Science Hack #9 – Reading Data from multiple files
  10. Data Science Hack #10 – Splitting Dataframe using str.split()

 

Data Science Hack #1 – Select Data Type using Pandas

At the start of my data science journey, I used to write an ‘if’ condition to separate out continuous and categorical variables for data analysis. This was a taxing task as it consumed a lot of unnecessary time and energy. Then I came across this simple Pandas hack which made my life so much simpler!

 

Data Science Hack #2 – Extract E-mail from text

One of the most important parts of digital marketing is getting E-mails IDs of your customers. Is there any way that I can extract these IDs? Of course there is – RegEx to the rescue!

This hack provides the regular expression you may use to extract E-mail ids from the text!

 

Data Science Hack #3 – Remove Emojis from Text

Preprocessing is one of the key steps for improving the performance of any machine learning model. One of the main reasons for text preprocessing is to remove unwanted characters from text like punctuation, emojis, links and so on which are not required for our problem statement.

This hack will help you get rid of these unnecessary emojis!

 

Data Science Hack #4 – Image Augmentation

Deep Learning models usually require a lot of data for training. But acquiring massive amounts of data comes with its own challenges. Instead of spending days manually collecting data, you can make use of Image Augmentation techniques.

It is the process of generating new images. These new images are generated using the existing training images and hence we don’t have to collect them manually.

 

Data Science Hack #5 – Resizing Images

While building an image classification model using deep learning, it is required that all the images should be of the same size. However, as the data comes from different sources, images may have different shapes.

So, to convert them to the same shape, we can use the resize function from the OpenCV library. This hack will help you convert the images of any shape to a specified shape:

 

Data Science Hack #6  – Apply Pandas Operations in Parallel

The traditional Pandas library is slow especially if you have a large dataset. Pandarallel is a simple and efficient tool to parallelize Pandas operations on all your available CPUs! This trick is certainly going to save loads of your precious time.

 

Data Science Hack #7 – Pandas Melt

Pandas’ melt function helps you to bring your dataframe into a tidy form. It gives you the functionality to unpivot a dataframe from wide to long format. In pd.melt(), one or more columns are used as identifiers. You can “Unmelt the data”, using pivot() function:

 

Data Science Hack #8 – Divide equal proportion of classes (Classification)

It is a very common mistake made by beginners – for classification problems, not splitting the classes into equal proportions in train and test set which often leads to spurious results. Sklearn provides an easy way to do it using the “stratify” parameter in the train_test_split function.

In this example, we pass stratify = y, and you can observe the difference of proportion in both cases – with stratify and without stratify.

 

Data Science Hack #9 Reading Data from multiple files

A lot of times you may require to read multiple data files. For example, a retailer maintains his sales data in files split according to years. In this case, you’ll use glob, a module that finds all the pathnames matching a specified pattern according to the rules used by the Unix shell to read each file. Let’s see it in this example:

 

Data Science Hack #10 Splitting Column using str.split()

str.split() is used to apply vectorized string functions on a Pandas dataframe column. Let’s say you want to split the names in a dataframe column into first name and last name. pandas.Series.str along with split( ) can be used to perform this task.

 

End Notes

In this article, we covered 10 data manipulation and data wrangling hacks, tips and tricks across various tools and techniques to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.

Let me know your Data Science hacks, tips and tricks in the comments section below!

Product Growth Analyst at Analytics Vidhya. I'm always curious to deep dive into data, process it, polish it so as to create value. My interest lies in the field of marketing analytics.

Responses From Readers

Clear

Dr.D.K.Samuel
Dr.D.K.Samuel

Please give a way to embed jupyter animation s in power point as html or HTML5 video . Or even as MP4. Please

Rakshit Sakhuja
Rakshit Sakhuja

It's Awesome. I was not aware that there is also a way to do parallel operation in pandas. For image augmentation, you can explain what does wrap mode in rotation does?

Jeeva
Jeeva

it's very usefull.great guidence.nice.

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details