While learning data science, we have to know how to work with many libraries like numpy, matplotlib, seaborn, etc. It is important to learn about machine learning algorithms in data science and also know how to use these packages to your advantage. Another important thing to know while working with data is how to handle errors and warnings. No matter how long you’ve worked with Python Pandas, sooner or later, you will encounter the SettingWithCopyWarning. If you are trying to understand what it is and why it keeps showing up even when you ‘do get the output you expected, then this article is for you.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
It doesn’t pay to ignore warnings. Even when they don’t make sense. – Debra Doyle
One of the things I was taught while learning to code in python was not to be bothered by ‘Warnings’ in your code. “Focus on fixing major bugs and errors; warnings aren’t a big deal” was the advice I got. I realized it was terrible advice when I started working on real-world situations. Sometimes warnings can cost you more than you think. While working with the pandas DataFrame, I faced one such warning, the ‘SettingWithCopy’.
In order to explain the logic behind the warning, I will take you through a hands-on tutorial in python. I have used the Car Sales dataset from Kaggle as a sample dataset. The dataset contains information about different types of cars.
Here is a glimpse of the data and the structure of the dataset.
Here we have imported pandas using the import pandas statement. If you want, you can import numpy and matplotlib as well, then we have used the read_csv function to read the data. Then we used the info function to find out about the data in all the columns, such as non-null/missing values, and whether the data is categorical or numerical.
Inference: Here, we can see that some of the values are null values or nan values as the number of non-nulls is less in some of the columns.
Let’s assume a scenario where we have received an update that the fuel_capacity of all the Porsche cars has been increased from 17.0 to 18.0, and we have been requested to make the changes. Let’s go ahead and change them.
car_sales[car_sales['Manufacturer'] == 'Porsche']['Fuel_capacity'] = 18.0
Uh-oh! We have triggered the famous ‘SettingWithCopyWarning’.
If we examine the dataframe now, we can see that the values have not updated.
car_sales[car_sales['Manufacturer'] == 'Porsche']
We have to understand “SettingWithCopy” is a warning and not an error. An error breaks your code and prevents you from moving on further without fixing it. But, a warning indicates that there is something wrong with your code while producing the output.
In this case, sometimes we might get the output we intended to and also be tempted to ignore the warning. But we should never ignore this warning because it means that the operation we are trying to perform may not have worked as we expected, and there can be some unexpected issues in the future.
These are the words of Jeff Reback, one of the core developers of pandas, on why you should never ignore this warning.
In order to understand how to fix this warning and what to do when we face it, it is imperative to know the difference between Views and Copies in Pandas and how they work.
An important step in machine learning is preprocessing, and there will be lots of situations in which we will have to work with a data frame’s views and copies. It is especially important to know about views and copies when we are using the df.loc and iloc functions as they can return either views or copies depending on how they are used. There are some functions like the groupby function, which does not return a view or copy but only the grouped object. So we must stay familiar with which function returns what. If you want to know more about this, then I recommend you to go through this documentation about Pandas’ APIs
In the code above, where we try to return all the Porsche cars from the data, the result we receive may either be a view or a copy of the dataframe.
A view (or a shallow copy) is a subset of the original object which doesn’t have its own memory and address space. It is just a projection of the object we are trying to access.
A copy (or a deep copy) is a duplicate of the original object which has its own memory and address space. It is a separate entity that is thrown away in Pandas once we are done operating on them.
One of the main differences between views and copies is that modifying a view modifies the original dataframe and vice versa, whereas modifying a copy doesn’t affect the original dataframe.
Let’s say we change ‘sales_in_thousands’ for the car of ‘Model‘ Boxter to 9.35.
You can see above modifying a view modifies the original dataframe as well.
On the contrary, modifying a copy doesn’t necessarily modify the original dataframe.
Pandas got this behavior of views and copies from the underlying Numpy arrays. Numpy arrays are limited to a datatype, so whether a view or a copy is returned can be predicted. While Pandas uses its Numpy core, it follows a complex set of rules to optimize space and determine whether to return a view or a copy. Because of that, whenever we are indexing a dataframe, there is no set way to predict whether a view or a copy is returned. To quote the pandas documentation,
Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees) ………. That’s what SettingWithCopy is warning you about!
To check whether a view or a copy is returned, you can use the internal attributes _is_view or _is_copy. _is_view returns a boolean, and _is_copy returns a reference to the original dataframe or None.
If you want to read more about views, then you can look at this official documentation from pandas
Let’s look at 3 of the most common issues for encountering the SettingWithCopyWarning and how to handle them.
One of the most common reasons Pandas generates this warning is when it detects chained assignment or chained indexing.
There are two things we do with a Pandas dataframe, we either
A chained assignment is when we try to assign(set) something by using more than one indexing operation.
Recall the example below, which we used previously.
car_sales[car_sales['Manufacturer'] == 'Porsche']['Fuel_capacity'] = 18.0
Here, two indexing operations are combined to set a value. First, we try to access (get) all the ‘Porsche’ cars from the dataframe, then we try to assign(set) a new value to ‘Fuel_capacity’.
We want to modify the original dataframe, but this operation may try to create a copy and modify it. This is what the warning is telling us. ‘A value is trying to be set on a copy of a slice of a dataframe.
We discussed above that Pandas can either create a view or a copy when we are trying to access (get) a subset of an operation.
Let’s see if the operation we are trying to perform is on a view or a copy.
car_sales[car_sales['Manufacturer'] == 'Porsche']['Fuel_capacity']._is_view
# output
True
car_sales[car_sales['Manufacturer'] == 'Porsche']['Fuel_capacity']._is_copy
#output
_is_view has returned ‘True’, meaning it’s a view, while _is_copy has returned a ‘weakref’, meaning it’s a copy. Hence, the output of the ‘get‘ operation is ambiguous. It can be anything in the end. This is why ignoring the ‘SettingWithCopyWarning’ is a bad idea. It can eventually lead to breaking something in your code when you least expect it.
You can easily tackle the problem of chained assignment by combining the back-to-back indexing operations into a single operation using .loc.
car_sales.loc[car_sales.Manufacturer == 'Porsche', 'Fuel_capacity'] = 18.0
car_sales[car_sales.Manufacturer == 'Porsche']['Fuel_capacity']
#output
124 18.0
125 18.0
126 18.0
Name: Fuel_capacity, dtype: float64
The second most common reason that triggers this warning is Hidden Chaining. It can be tricky and hard to track down the source of this problem as it may span across your entire codebase. Let’s look at a scenario for Hidden Chaining. We’ll go ahead and create a new dataframe containing all the ‘Chevrolet’ cars while bearing in mind to use .loc from our previous lesson.
chevrolet_cars = car_sales.loc[car_sales.Manufacturer == 'Chevrolet']
chevrolet_cars
We do some other operations for some time and play around with our code.
chevrolet_cars['Model'].value_counts()
....
# few lines of code
chevrolet_cars['Sales_in_thousands'].std()
....
chevrolet_cars['__year_resale_value'].max()
....
# few lines of code
chevrolet_cars.loc[20,'Price_in_thousands'] = 17.638
Boom! This warning again!!
There was no chained assignment in that last line of code, but it still went ahead and triggered that warning. Let’s look at the values in our dataframe.
It has updated our value. So should we go ahead and ignore the warning this time? Probably not.
There is no obvious chained assignment in this code. In reality, it can occur on one line or even across multiple lines of code. When we created the ‘chevrolet_cars’ dataframe, we used a get operation. So there is no guarantee whether this returned a view or a copy. So, we might be trying to modify the original dataframe as well.
Identifying this problem can be very tedious in real codebases spanning thousands of lines, but it is very simple to tackle this. When you want to create a new dataframe, explicitly make a copy using the .copy() method. This will make it clear to Pandas that we are operating on a new dataframe.
chevrolet_cars = car_sales.loc[car_sales.Manufacturer == 'Chevrolet'].copy()
chevrolet_cars.loc[20,'Price_in_thousands'] = 17.638
chevrolet_cars.loc[20, 'Price_in_thousands']
#output
17.638
A false positive occurs when the warning triggers even though it shouldn’t. It’s very safe to ignore the warning in this case. Pandas has fixed many scenarios that caused the ‘False Positive’ warnings over the years. It’s discussed in the Pandas documentation if you want to take a look.
Let’s say we want only the cars with Vehicle_type as ‘Passenger‘, and we would like to create a new dataframe column which will be a boolean indicating whether the car is available or not.
car_sales = car_sales[car_sales['Vehicle_type'] == 'Passenger']
car_sales['In Stock'] = 'True'
#output
:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you look at the dataframe, it would have updated the new column. In this case, we do not care if it overwrites the original dataframe.
We can go ahead and suppress the warning by changing the default behavior as follows,
Do not change the behavior unless you know what you are doing.
To avoid the “SettingWithCopy” warning in Python, which typically arises when modifying a DataFrame slice, follow these Concepts:
df.loc[df['column'] > 5, 'new_column'] = 0
df_copy = df[df['column'] > 5].copy()
df_copy['new_column'] = 0
Hope this article has given you a detailed overview of the ‘SettingWithCopyWarning’ in pandas in Python. You can avoid most scenarios that trigger the ‘SettingWithCopyWarning’ by communicating clearly with Pandas and understanding why it occurs in the first place. When you want to modify the original dataframe, use .loc, or when you want a copy, specify it directly. This will not only prevent future warnings and errors, but it will also make your codebase more robust for maintenance.
You can take a look at these issues on GitHub #5390 and #5597 for background discussion.
If you want to master pandas, then for further reading, you can “iterate” through these articles:
You can also check out this free course on Pandas for Data Analysis in Python
A. We can use the copy() function to create a copy of a dataframe. In case you want to make changes to a dataframe, then you should create copies of that dataframe so that if you want to roll back, then you can.
A. You can use the pd.options.mode to stop Pandas from printing warnings. The chained_assignment setting, set to None, will suppress the warnings. However, be careful because this might hide potential issues in your code.
A. DataFrame and Series are both data structures in pandas that you can use to manipulate data. A dataframe can have multiple columns; however, a series has only one column. Also, we can store multiple types of data, like an object, float64, etc., in a dataframe, but in series, since it is only one column, you can only have data of the same data type.
Chained indexing happens in Python when you use multiple indexing operations, like [ ], one after another, on a DataFrame or a series. This can sometimes lead to unexpected behavior and make it unclear whether you’re working with a copy or a view of the data. It’s generally
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Very good explanation! The documentation was never that clear for me to understand. Thanks!
Comments are Closed