A Comprehensive Guide to PandasAI

nikhil1e9 Last Updated : 14 Oct, 2024
10 min read

Introduction

Generative AI and Large Language Models (LLMs) have brought a new era to Artificial Intelligence and Machine Learning. These large language models are used in various applications across different domains and have opened up new perspectives on AI. These models are trained on a vast amount of text data from all over the internet and can generate text in a human-like manner. The most well-known example of an LLM is ChatGPT, which OpenAI developed. It can perform various tasks, from creating original content to writing code. In this article, we will look into one such application of LLMs: the PandasAI library.

The guide to PandasAI tutorial can be considered a fusion of Python‘s popular Pandas library and OpenAI’s GPT. It is extremely powerful for getting quick insights from data without writing much code. In this article, you will get an understanding of the pandasai API key, pandasai API and examples of pandasai.

Learning Objectives

  • Understanding the differences between Pandas and PandasAI
  • PandasAI and its Role in Data Analysis and Visualization
  • Using PandasAI to build a full exploratory data analysis workflow
  • Understanding the importance of writing clear, concise, and specific prompts
  • Understanding the limitations of the PandasAI LLMs Model.

This article was published as a part of the Data Science Blogathon.

What is PandasAI?

It is a new tool for making data analysis and visualization tasks easier. PandasAI is built with Python’s Pandas library and uses Generative AI and LLMs in its work. Unlike Pandas, in which you have to analyze and manipulate data manually, PandasAI LLMs allow you to generate insights from data by simply providing a text prompt. It is like giving instructions to your assistant, who is skilled and proficient and can do the work for you quickly. The only difference is that it is not a human but a machine that can understand and process information like a human.

In this article, I will review the full data analysis and visualization process using PandasAI with code examples and explanations. So, let’s get started.

Set up an OpenAI Account and Extract the API Key

To use the PandasAI library, you must create an OpenAI account (if you don’t already have one) and use your API key. It can be done as follows:

  1. Go to https://platform.openai.com and create a personal account.
  2. Sign in to your account.
  3. Click on Personal on the top right side.
  4. Select View API keys from the dropdown.
  5. Create a new secret key.
  6. Copy and store the secret key to a safe location on your computer.

If you have followed the above-given steps, you are all set to leverage the power of Generative AI in your projects.

Installing PandasAI

Write the command below in a Jupyter Notebook/ Google colab or a terminal to install the PandasAI package on your computer.

pip install pandasai

Installation will take some time, but once installed, you can directly import it into a Python environment.

from pandasai import PandasAI 

This will import PandasAI to your coding environment. We are ready to use it, but let’s first get the data.

Getting the Data and Instantiating an LLM

You can use any tabular data you like. I will use the medical charges data for this tutorial. (Note: PandasAI LLMs can only analyze tabular and structured data, like regular pandas, not unstructured data, such as images.)

The data looks like this.

"Guide

With the data in place, we will need our Open AI API key to instantiate a Large Language Model. To do this, type in the code:

# Use your API key to instantiate an LLM 
from pandasai.llm.openai import OpenAI 
llm = OpenAI(api_token=f"{YOUR_API_KEY}"
) pandas_ai = PandasAI(llm)

Just enter your secret key created above in place of the YOUR_API_KEY placeholder in the above code, and you’ll be all good to go. Now, we can analyze our data and find some key insights using PandasAI.

Analyzing Data with PandasAI

PandasAI mainly takes two parameters as input: the dataset and the prompt, which is the query or question asked. You might wonder how it works under the hood, so let me explain.

Executing your prompt using PandasAI sends a request to the OpenAI server on which the LLM is hosted. The LLM processes the request, converts the query into appropriate Python code, and then uses pandas to calculate the answer. It returns the answer to PandasAI and then outputs it to your screen.

Prompts

Let’s start with one of the most basic questions!

Q1. What is the size of the dataset?

prompt = "What is the size of the dataset?"
pandas_ai(data, prompt=prompt)

Output:
'1338 7'

It’s always best to check the correctness of the AI’s answers to ensure it understands our question correctly. I will use Panda’s library, which you must be familiar with, to validate its answers. Let’s see if the above answer is correct or not.

import pandas as pd
print(data.shape)

Output:
(1338, 7)

Output

The output matches PandasAI’s answer, and we are off to a good start. The PandasAI LLMs model can also impute missing values in the data. The data doesn’t contain any missing values, but I deliberately changed the first value for the charges column to null.

Finding the missing value and column it belongs to

prompt = '''How many null values are in the data.
            Can you also tell which column contains the missing value'''
pandas_ai(data, prompt=prompt)

Output:
'1 charges'

This outputs ‘1 charge’, which indicates that 1 value is missing in the charges column, which is correct.

Imputing the missing value

prompt = '''Impute the missing value in the data using the mean value. 
            Output the imputed value rounded to 2 decimal digits.'''
pandas_ai(data, prompt=prompt)

Output:
13267.72

Output: 13267.72

Now the first row looks like this

 Source: AuthorNaNSource: Author</figcaption> </figure> <p> Let's check this using pandas.</p> <pre><code># Checking mean values of charges excluding the first value data['charges'].iloc[1:].mean() Output: 132667.718823</code></pre> <p>This too outputs the same value. This is some incredible stuff. You can just talk to the AI and it can solve your queries in just a matter of seconds. And this is just one of many things <pre><code>prompt = '''What is the proportion of males to females in the data? Output should look like this [Males: value, Females: value] where value is the answer. Also round the answer to 2 decimal places''' pandas_ai(data, prompt=prompt) Output: 'Males: 0.51, Females: 0.49'</code></pre> <p>You can also optimize your prompts and tell it to output answer in a certain format like the one given above. Detailed prompts make it easier for the AI to understand the question better and helps in extracting accurate answers even for complex problems. Let's check the answer using pandas.</p> <pre><code>data['sex'].value_counts(normalize=True) Output: male 0.505232 female 0.494768 Name: sex, dtype: float64</code></pre> <p>That's correct. </p> <h2>Answering interesting questions using <p>Now let's answer some more interesting questions to gain insights on the data.</p> <p><b>Question: Medical charges for which gender is more on average?</b></p> <pre><code>prompt = '''Medical charges for which gender is more on average and by how much? Round the answer to 2 decimal places. Provide the answer in form of a sentence.''' pandas_ai(data, prompt=prompt) Output: 'On average, charges for male are higher by $1387.17.'</code></pre> <p><b>Question: Does smoking causes more charges on average?</b></p> <pre><code>prompt = '''Does smoking causes more charges on average and by how much? Provide the answer in form of a sentence rounded down to 2 decimal places.''' pandas_ai(data, prompt=prompt) Output: 'Smoking causes an average increase in charges of $23615.96.'</code></pre> <p>Now let's ask a bit more complicated question to test the limits of <p><b>Question: List the 5 age groups having the highest average BMI?</b></p> <pre><code>prompt = '''What are the 5 ages with the highest average BMI?. Sort the values in descending order and display them in a table.

Age Average  BMI06432.97613615232.93603425832.71820036132.54826146232.342609.

Generally, BMI values greater than 30 fall in the range of the obese category. Therefore, the data shows that people in their 50s and 60s are more likely to be obese than other age groups.

Q2. Which region has the greatest number of smokers?

prompt = '''Which region has the greatest number of smokers and which has the lowest?
            Include the values of both the greatest and lowest numbers in the answer.
            Provide the answer in form of a sentence.'''
pandas_ai(data, prompt=prompt)

Output:
'The region with the greatest number of smokers is southeast with 91 smokers.'
'The region with the lowest number of smokers is southwest with 58 smokers.'

Let’s increase the difficulty a bit and ask a tricky question.

Q3. What are the average charges of a female living in the north?

The region column contains four regions: northeast, northwest, southeast, and southwest. So, the north should contain both northeast and northwest regions. But can the LLM understand this subtle but important detail? Let’s find out!

prompt = '''What are the average charges of a female living in the north region?
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12479.87

Let’s check the answer manually using pandas.

north_data = data[(data['sex'] == 'female') & 
                 ((data['region'] == 'northeast') |
                  (data['region'] == 'northwest'))]
north_data['charges'].mean()

Output:
12714.35

The above code outputs a different answer (which is the correct answer) than the LLM gave. In this case, the LLM didn’t perform well. We can be more specific and tell the LLM what we mean by the north region and see if it can give the correct answer.

prompt = '''What are the average charges of a female living in the north region?
            The north region consists of both the northeast and northwest regions.
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12714.35

This time it gives the correct answer. As this was a tricky question, we must be more careful about our prompts and include relevant details, as the LLM might overlook these subtle differences. Therefore, you can see that we can’t trust the LLM blindly as it can generate incorrect responses sometimes due to incomplete prompts or some other limitations, which I will discuss later in the tutorial.

Visualizing Data with PandasAI

So far, we have seen PandasAI LLMs Models’ proficiency in analyzing data; now, let’s test it by plotting some graphs and seeing how well it visualizes data.

Correlation Heatmap

Let’s create a correlation heatmap of the numeric columns.

prompt = "Make a heatmap showing the correlation of all the numeric columns in the data"
pandas_ai(data, prompt=prompt)
"Visualizing | PandasAI

That looks great. Under the hood, PandasAI uses Python’s Seaborn and matplotlib libraries to plot data. Let’s create some more graphs.

Distribution of BMI using Histogram

prompt = prompt = "Create a histogram of bmi with a kernel density plot." pandas_ai(data, prompt=prompt)

"Histogram

The distribution of BMI values somewhat resembles the normal distribution plot with a mean value near 30.

Distribution of Charges Using Boxplot

prompt = "Make a boxplot of charges. Output the median value of charges."
pandas_ai(data, prompt=prompt)
"

The median value of the charges column is roughly 9382. In the plot, this is depicted by the orange line in the middle of the box. The circles in the above plot show that the charges column contains many outlier values.

Now, let’s create some plots that show the relationship between more than one column.

Region vs. Smoker

prompt = "Make a horizontal bar chart of region vs smoker. Make the legend smaller."
pandas_ai(data, prompt=prompt)
PandasAI

The graph clearly shows that the Southeast region has the greatest number of smokers compared to other regions.

Variation of Charges with Age

prompt = '''Make a scatterplot of age with charges and colorcode using the smoker values. 
            Also provide the legends.'''
pandas_ai(data, prompt=prompt)
"

Looks like age and charges follow a linear relationship for non-smokers, while no specific pattern exists for smokers.

Variation of Charges with BMI

To make things a little more complex, let’s try creating a plot using only a proportion of the data instead of the real data and see how the LLM can perform.

prompt = "Make a scatterplot of bmi with charges and colorcode using the smoker values. 
          Add legends and use only data of people who have less than 2 children."
pandas_ai(data, prompt=prompt)
"Scatterplot | PandasAI

It did a great job creating a plot, even with a complex question. PandasAI has now unveiled its true potential. You have witnessed the true power of Large Language Models.

Limitations of PandasAI

  • The responses generated by PandasAI can sometimes exhibit inherent biases due to the vast amount of data LLMs are trained on from the Internet, which can hinder the analysis. Understanding and mitigating such biases is essential to ensuring fair and unbiased results.
  • LLMs can sometimes misinterpret ambiguous or contextually complex queries, leading to inaccurate or unexpected results. One must exercise caution and double-check the answers before making any critical data-driven decision.
  • It can sometimes be slow to come to an answer or completely fail. The server hosts the LLMs, and occasionally, technical issues may prevent the request from reaching the server or being processed.
  • It cannot be used for big data analysis tasks as it is not computationally efficient when dealing with large amounts of data and requires high-performance GPUs or computational resources.

What is the use of PandasAI and Pandas?

Pandas and PandasAI are both tools used for data analysis in Python, but they serve different purposes:

  • Pandas:
    • It is a well-established library that provides powerful data manipulation and analysis functionalities.
    • You directly interact with the data using Python code.
    • It offers many features for working with dataframes like spreadsheets on steroids. You can load data, clean it, perform calculations, and create visualizations.
    • It requires knowledge of Python programming to be used effectively.
  • PandasAI:
    • It is a relatively new tool built on top of Pandas.
    • Integrates generative AI to allow you to analyze data using natural language processing.
    • You can ask questions about your data in plain English, and PandasAI will translate those questions into Python code and generate insights or visualizations.
    • Aims to make data analysis more accessible, especially for those less familiar with programming.
    • It is a complementary tool to Pandas, not a replacement.

Conclusion

PandasAI Tutorial represents a significant advancement in data analysis, combining the power of Pandas with the capabilities of Large Language Models. This tool simplifies complex data tasks through natural language prompts, making data analysis more accessible and efficient. While it excels in quick insights and visualizations, users should know its limitations, including potential biases and misinterpretations. PandasAI is not a replacement for traditional data analysis methods but a complementary tool that enhances productivity. As with any AI-powered tool, critical thinking and result validation remain crucial for accurate and reliable data analysis. Hope you like the article and understand the PandasAI API key, Pandas AI, and PandasAI API. By covering all of these, you will get full preparation for the PandasAI API key.

Key Takeaways

Here are some key takeaways from this article:

  • PandasAI is a Python library that adds Generative AI capabilities to Pandas, clubbing it with large language models.
  • PandasAI makes Pandas conversational by allowing us to ask questions in natural language using text prompts.
  • Despite its amazing capabilities, PandasAI has its limitations. Don’t blindly trust or use for sophisticated use cases like big data analysis.

Frequently Asked Questions

Q1. How do I get started with PandasAI?

A. To start with PandasAI, visit their website, sign up, and explore their tools for AI-powered data analysis and automation using natural language.

Q2. Can I use PandasAI without OpenAI?

A. Yes, PandasAI operates independently of OpenAI, leveraging its technology stack for data analysis and automation tasks.

Q3. How good is PandasAI?

A. PandasAI is known for its robust AI capabilities in data handling and analysis. It offers efficient tools for automating tasks traditionally done with the Pandas library in Python.

Q4. What are the limitations of PandasAI?

A. PandasAI’s limitations may include dependence on the quality of underlying AI models, potential for errors in complex data scenarios, and constraints in customization compared to traditional coding approaches with Pandas.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Hello, I'm Nikhil Kotra, a data science enthusiast with a bachelor's degree from Indian Institute of Technology Roorkee.

I have done various internships and projects in the field of AI, machine learning and deep learning and want to contribute to the tech industry and the future of AI.

I am really passionate about leveraging the power of AI for the benefit of humanity and to tackle real issues like environmental crisis and health hazards. I believe that AI should be used ethically and morally by respecting and uphelding other people's opinions.

I am really interested in doing some real-world projects using Generative AI and Large Language Models and contributing to the data science community by sharing my knowledge and learnings through articles and blogs.

In my free time, I enjoy traveling, playing chess and reading books.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details