Generative AI and Large Language Models (LLMs) have brought a new era to Artificial Intelligence and Machine Learning. These large language models are used in various applications across different domains and have opened up new perspectives on AI. These models are trained on a vast amount of text data from all over the internet and can generate text in a human-like manner. The most well-known example of an LLM is ChatGPT, which OpenAI developed. It can perform various tasks, from creating original content to writing code. In this article, we will look into one such application of LLMs: the PandasAI library.
The guide to PandasAI tutorial can be considered a fusion of Python‘s popular Pandas library and OpenAI’s GPT. It is extremely powerful for getting quick insights from data without writing much code. In this article, you will get an understanding of the pandasai API key, pandasai API and examples of pandasai.
This article was published as a part of the Data Science Blogathon.
It is a new tool for making data analysis and visualization tasks easier. PandasAI is built with Python’s Pandas library and uses Generative AI and LLMs in its work. Unlike Pandas, in which you have to analyze and manipulate data manually, PandasAI LLMs allow you to generate insights from data by simply providing a text prompt. It is like giving instructions to your assistant, who is skilled and proficient and can do the work for you quickly. The only difference is that it is not a human but a machine that can understand and process information like a human.
In this article, I will review the full data analysis and visualization process using PandasAI with code examples and explanations. So, let’s get started.
To use the PandasAI library, you must create an OpenAI account (if you don’t already have one) and use your API key. It can be done as follows:
If you have followed the above-given steps, you are all set to leverage the power of Generative AI in your projects.
Write the command below in a Jupyter Notebook/ Google colab or a terminal to install the PandasAI package on your computer.
pip install pandasai
Installation will take some time, but once installed, you can directly import it into a Python environment.
from pandasai import PandasAI
This will import PandasAI to your coding environment. We are ready to use it, but let’s first get the data.
You can use any tabular data you like. I will use the medical charges data for this tutorial. (Note: PandasAI LLMs can only analyze tabular and structured data, like regular pandas, not unstructured data, such as images.)
The data looks like this.
With the data in place, we will need our Open AI API key to instantiate a Large Language Model. To do this, type in the code:
# Use your API key to instantiate an LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token=f"{YOUR_API_KEY}"
) pandas_ai = PandasAI(llm)
Just enter your secret key created above in place of the YOUR_API_KEY placeholder in the above code, and you’ll be all good to go. Now, we can analyze our data and find some key insights using PandasAI.
PandasAI mainly takes two parameters as input: the dataset and the prompt, which is the query or question asked. You might wonder how it works under the hood, so let me explain.
Executing your prompt using PandasAI sends a request to the OpenAI server on which the LLM is hosted. The LLM processes the request, converts the query into appropriate Python code, and then uses pandas to calculate the answer. It returns the answer to PandasAI and then outputs it to your screen.
Let’s start with one of the most basic questions!
prompt = "What is the size of the dataset?"
pandas_ai(data, prompt=prompt)
Output:
'1338 7'
It’s always best to check the correctness of the AI’s answers to ensure it understands our question correctly. I will use Panda’s library, which you must be familiar with, to validate its answers. Let’s see if the above answer is correct or not.
import pandas as pd
print(data.shape)
Output:
(1338, 7)
The output matches PandasAI’s answer, and we are off to a good start. The PandasAI LLMs model can also impute missing values in the data. The data doesn’t contain any missing values, but I deliberately changed the first value for the charges column to null.
prompt = '''How many null values are in the data.
Can you also tell which column contains the missing value'''
pandas_ai(data, prompt=prompt)
Output:
'1 charges'
This outputs ‘1 charge’, which indicates that 1 value is missing in the charges column, which is correct.
prompt = '''Impute the missing value in the data using the mean value.
Output the imputed value rounded to 2 decimal digits.'''
pandas_ai(data, prompt=prompt)
Output:
13267.72
Output: 13267.72
Age Average BMI06432.97613615232.93603425832.71820036132.54826146232.342609.
Generally, BMI values greater than 30 fall in the range of the obese category. Therefore, the data shows that people in their 50s and 60s are more likely to be obese than other age groups.
prompt = '''Which region has the greatest number of smokers and which has the lowest?
Include the values of both the greatest and lowest numbers in the answer.
Provide the answer in form of a sentence.'''
pandas_ai(data, prompt=prompt)
Output:
'The region with the greatest number of smokers is southeast with 91 smokers.'
'The region with the lowest number of smokers is southwest with 58 smokers.'
Let’s increase the difficulty a bit and ask a tricky question.
The region column contains four regions: northeast, northwest, southeast, and southwest. So, the north should contain both northeast and northwest regions. But can the LLM understand this subtle but important detail? Let’s find out!
prompt = '''What are the average charges of a female living in the north region?
Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)
Output:
The average charges of a female living in the north region are $12479.87
Let’s check the answer manually using pandas.
north_data = data[(data['sex'] == 'female') &
((data['region'] == 'northeast') |
(data['region'] == 'northwest'))]
north_data['charges'].mean()
Output:
12714.35
The above code outputs a different answer (which is the correct answer) than the LLM gave. In this case, the LLM didn’t perform well. We can be more specific and tell the LLM what we mean by the north region and see if it can give the correct answer.
prompt = '''What are the average charges of a female living in the north region?
The north region consists of both the northeast and northwest regions.
Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)
Output:
The average charges of a female living in the north region are $12714.35
This time it gives the correct answer. As this was a tricky question, we must be more careful about our prompts and include relevant details, as the LLM might overlook these subtle differences. Therefore, you can see that we can’t trust the LLM blindly as it can generate incorrect responses sometimes due to incomplete prompts or some other limitations, which I will discuss later in the tutorial.
So far, we have seen PandasAI LLMs Models’ proficiency in analyzing data; now, let’s test it by plotting some graphs and seeing how well it visualizes data.
Let’s create a correlation heatmap of the numeric columns.
prompt = "Make a heatmap showing the correlation of all the numeric columns in the data"
pandas_ai(data, prompt=prompt)
That looks great. Under the hood, PandasAI uses Python’s Seaborn and matplotlib libraries to plot data. Let’s create some more graphs.
prompt = prompt = "Create a histogram of bmi with a kernel density plot." pandas_ai(data, prompt=prompt)
The distribution of BMI values somewhat resembles the normal distribution plot with a mean value near 30.
prompt = "Make a boxplot of charges. Output the median value of charges."
pandas_ai(data, prompt=prompt)
The median value of the charges column is roughly 9382. In the plot, this is depicted by the orange line in the middle of the box. The circles in the above plot show that the charges column contains many outlier values.
Now, let’s create some plots that show the relationship between more than one column.
prompt = "Make a horizontal bar chart of region vs smoker. Make the legend smaller."
pandas_ai(data, prompt=prompt)
The graph clearly shows that the Southeast region has the greatest number of smokers compared to other regions.
prompt = '''Make a scatterplot of age with charges and colorcode using the smoker values.
Also provide the legends.'''
pandas_ai(data, prompt=prompt)
Looks like age and charges follow a linear relationship for non-smokers, while no specific pattern exists for smokers.
To make things a little more complex, let’s try creating a plot using only a proportion of the data instead of the real data and see how the LLM can perform.
prompt = "Make a scatterplot of bmi with charges and colorcode using the smoker values.
Add legends and use only data of people who have less than 2 children."
pandas_ai(data, prompt=prompt)
It did a great job creating a plot, even with a complex question. PandasAI has now unveiled its true potential. You have witnessed the true power of Large Language Models.
Pandas and PandasAI are both tools used for data analysis in Python, but they serve different purposes:
PandasAI Tutorial represents a significant advancement in data analysis, combining the power of Pandas with the capabilities of Large Language Models. This tool simplifies complex data tasks through natural language prompts, making data analysis more accessible and efficient. While it excels in quick insights and visualizations, users should know its limitations, including potential biases and misinterpretations. PandasAI is not a replacement for traditional data analysis methods but a complementary tool that enhances productivity. As with any AI-powered tool, critical thinking and result validation remain crucial for accurate and reliable data analysis. Hope you like the article and understand the PandasAI API key, Pandas AI, and PandasAI API. By covering all of these, you will get full preparation for the PandasAI API key.
Here are some key takeaways from this article:
A. To start with PandasAI, visit their website, sign up, and explore their tools for AI-powered data analysis and automation using natural language.
A. Yes, PandasAI operates independently of OpenAI, leveraging its technology stack for data analysis and automation tasks.
A. PandasAI is known for its robust AI capabilities in data handling and analysis. It offers efficient tools for automating tasks traditionally done with the Pandas library in Python.
A. PandasAI’s limitations may include dependence on the quality of underlying AI models, potential for errors in complex data scenarios, and constraints in customization compared to traditional coding approaches with Pandas.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.