Python is one of the most widely-used programming languages for Data Science, Data Analytics, and Machine Learning. Its popularity arises from the fact that it is easy to pick up for beginners, has a great online community of learners, and it has some very useful and powerful data-centric libraries (like Pandas, NumPy, and Matplotlib) which help us in managing and manipulating large amounts of data with ease. Python has become the go-to language for Data Scientists and Data Analysts.
Pandas library in Python allows us to store tabular data with the help of a data type called dataframe. A pandas dataframe allows users to store a large amount of tabular data and makes it very easy to access this data using row and column indices. We can store data with hundreds of columns (fields) and thousands of rows (records).
When dealing with a large amount of data, we have to be careful with how we use memory. Shortage of memory is a common issue when we have a large amount of data at hand. In case the entire RAM space is consumed, the program can crash and throw a MemoryError, which can be tricky to handle at times. Limiting the memory usage becomes important in this case. Reducing memory usage also speeds up computation and helps save time.
The info() method in Pandas tells us how much memory is being taken up by a particular dataframe. To do this, we can assign the memory_usage argument a value = “deep” within the info() method. This will give us the total memory being taken up by the pandas dataframe.
However, the info() method does not give us a detailed description of the memory usage. It only tells us the total memory being used by the dataframe. For a more detailed overview, we can use the memory_usage() method. The memory_usage() method gives us the total memory being used by each column in the dataframe. It returns a Pandas series which lists the space being taken up by each column in bytes. Passing the deep argument a value = True within the memory_usage() method gives us the total memory usage of the dataframe columns.
In general, columns having object datatype (gender, occupation, and zip code in case of our data) take up a lot of space since they are storing strings in them which take up more space than integers and floating-point numbers. Having columns with object datatype can increase memory usage significantly.
To get around this, we can change the datatype of certain object columns to category. For instance, the gender column can only take up 2 values, either M or F. Thus, it makes sense to change the datatype of the gender column from object to category. This will result in a reduction in space being taken up by the gender column.
When the datatype of the gender column is changed to a category, the gender records are stored as integer codes instead of strings. These integer codes in turn refer to the string values, either M or F. Since integers take up less space than strings, the memory usage comes down significantly. The dataframe may look the same on the surface, but the way it is storing data on the inside has changed. Space is taken up by the gender column goes down from 58,466 bytes to 1,147 bytes, a 98% reduction in space.
Similarly, we can change the data type of other object columns in our dataframe. This can reduce memory usage to a large extent, and can prevent the unnecessary occurrence of MemoryError in our program.
Another way to reduce memory being used by columns storing only numerical values is to change the data type according to the range of values. For example, in the case of our data, the minimum and maximum values of age are 7 and 73 respectively. This range of values can very well be represented by an 8-bit binary number. So, instead of storing age data as a 64-bit integer which is the default in most newer versions of Pandas, we can store it as an 8-bit integer. As the number of bits required to store the data has reduced, the memory usage also comes down.
An 8-bit integer can range between -127 and +128 (in 2’s complement representation), which will be sufficient for the age column in our dataframe. This will result in a significant reduction in the memory being taken up by the age column.
When the datatype of the age column is converted from int64 to int8, the space being taken up by the column does down from 7544 bytes to 943 bytes, an 87.5% reduction in space.
We can also change the datatype from int64 to int16 or int32. While int16 supports a range of -32,768 to +32,767, int32 supports a much larger range of numbers, from -2147483648 to +2147483647. We can choose int8, int16, or int32 depending on the range of values.
The table below lists the entire range of values that can be represented by the different integer data types:
Maximum value | Minimum value | |
int8 | 127 | -128 |
int16 | 32767 | -32768 |
int32 | 2147483647 | -2147483648 |
int64 | 9223372036854775807 | -9223372036854775808 |
Similarly, we can also change the data type of columns having floating-point numbers. A change in datatype from float64 to float16 will result in a significant reduction in space.
In this blog post, we have learned about 2 methods in pandas that tell us about the memory being taken up by a dataframe, the info() method and the memory_usage() method. We also looked at two ways to reduce the memory being used by a pandas dataframe. The first way is to change the data type of an object column in a dataframe to the category in the case of categorical data.
This does not affect the way the dataframe looks but reduces the memory usage significantly. The second way is to change the data type of numerical columns in a dataframe based on the range of values. This works for columns storing either integers or floating-point numbers.
You can also refer to the YouTube video linked below to get a deeper understanding of the same. It explains the same methods to reduce the memory being taken up by a pandas dataframe.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Very Nicely explained and useful.