A Comprehensive Guide to PandasAI

nikhil1e9 Last Updated : 14 Oct, 2024

10 min read

Introduction

Generative AI and Large Language Models (LLMs) have brought a new era to Artificial Intelligence and Machine Learning. These large language models are used in various applications across different domains and have opened up new perspectives on AI. These models are trained on a vast amount of text data from all over the internet and can generate text in a human-like manner. The most well-known example of an LLM is ChatGPT, which OpenAI developed. It can perform various tasks, from creating original content to writing code. In this article, we will look into one such application of LLMs: the PandasAI library.

The guide to PandasAI tutorial can be considered a fusion of Python‘s popular Pandas library and OpenAI’s GPT. It is extremely powerful for getting quick insights from data without writing much code. In this article, you will get an understanding of the pandasai API key, pandasai API and examples of pandasai.

Learning Objectives

Understanding the differences between Pandas and PandasAI
PandasAI and its Role in Data Analysis and Visualization
Using PandasAI to build a full exploratory data analysis workflow
Understanding the importance of writing clear, concise, and specific prompts
Understanding the limitations of the PandasAI LLMs Model.

This article was published as a part of the Data Science Blogathon.

Introduction
What is PandasAI?
Set up an OpenAI Account and Extract the API Key
Installing PandasAI
Getting the Data and Instantiating an LLM
Analyzing Data with PandasAI
Visualizing Data with PandasAI
Limitations of PandasAI
What is the use of PandasAI and Pandas?
Conclusion
Frequently Asked Questions

What is PandasAI?

It is a new tool for making data analysis and visualization tasks easier. PandasAI is built with Python’s Pandas library and uses Generative AI and LLMs in its work. Unlike Pandas, in which you have to analyze and manipulate data manually, PandasAI LLMs allow you to generate insights from data by simply providing a text prompt. It is like giving instructions to your assistant, who is skilled and proficient and can do the work for you quickly. The only difference is that it is not a human but a machine that can understand and process information like a human.

In this article, I will review the full data analysis and visualization process using PandasAI with code examples and explanations. So, let’s get started.

Set up an OpenAI Account and Extract the API Key

To use the PandasAI library, you must create an OpenAI account (if you don’t already have one) and use your API key. It can be done as follows:

Go to https://platform.openai.com and create a personal account.
Sign in to your account.
Click on Personal on the top right side.
Select View API keys from the dropdown.
Create a new secret key.
Copy and store the secret key to a safe location on your computer.

If you have followed the above-given steps, you are all set to leverage the power of Generative AI in your projects.

Installing PandasAI

Write the command below in a Jupyter Notebook/ Google colab or a terminal to install the PandasAI package on your computer.

pip install pandasai

Installation will take some time, but once installed, you can directly import it into a Python environment.

from pandasai import PandasAI

This will import PandasAI to your coding environment. We are ready to use it, but let’s first get the data.

Getting the Data and Instantiating an LLM

You can use any tabular data you like. I will use the medical charges data for this tutorial. (Note: PandasAI LLMs can only analyze tabular and structured data, like regular pandas, not unstructured data, such as images.)

The data looks like this.

With the data in place, we will need our Open AI API key to instantiate a Large Language Model. To do this, type in the code:

# Use your API key to instantiate an LLM 
from pandasai.llm.openai import OpenAI 
llm = OpenAI(api_token=f"{YOUR_API_KEY}"
) pandas_ai = PandasAI(llm)

Just enter your secret key created above in place of the YOUR_API_KEY placeholder in the above code, and you’ll be all good to go. Now, we can analyze our data and find some key insights using PandasAI.

Analyzing Data with PandasAI

PandasAI mainly takes two parameters as input: the dataset and the prompt, which is the query or question asked. You might wonder how it works under the hood, so let me explain.

Executing your prompt using PandasAI sends a request to the OpenAI server on which the LLM is hosted. The LLM processes the request, converts the query into appropriate Python code, and then uses pandas to calculate the answer. It returns the answer to PandasAI and then outputs it to your screen.

Prompts

Let’s start with one of the most basic questions!

Q1. What is the size of the dataset?

prompt = "What is the size of the dataset?"
pandas_ai(data, prompt=prompt)

Output:
'1338 7'

It’s always best to check the correctness of the AI’s answers to ensure it understands our question correctly. I will use Panda’s library, which you must be familiar with, to validate its answers. Let’s see if the above answer is correct or not.

import pandas as pd
print(data.shape)

Output:
(1338, 7)

Output

The output matches PandasAI’s answer, and we are off to a good start. The PandasAI LLMs model can also impute missing values in the data. The data doesn’t contain any missing values, but I deliberately changed the first value for the charges column to null.

Finding the missing value and column it belongs to

prompt = '''How many null values are in the data.
            Can you also tell which column contains the missing value'''
pandas_ai(data, prompt=prompt)

Output:
'1 charges'

This outputs ‘1 charge’, which indicates that 1 value is missing in the charges column, which is correct.

Imputing the missing value

prompt = '''Impute the missing value in the data using the mean value. 
            Output the imputed value rounded to 2 decimal digits.'''
pandas_ai(data, prompt=prompt)

Output:
13267.72

Output: 13267.72

Now the first row looks like this

Source: AuthorNaNSource: Author</figcaption> </figure> <p> Let's check this using pandas.</p> <pre><code># Checking mean values of charges excluding the first value data['charges'].iloc[1:].mean() Output: 132667.718823</code></pre> <p>This too outputs the same value. This is some incredible stuff. You can just talk to the AI and it can solve your queries in just a matter of seconds. And this is just one of many things <pre><code>prompt = '''What is the proportion of males to females in the data? Output should look like this [Males: value, Females: value] where value is the answer. Also round the answer to 2 decimal places''' pandas_ai(data, prompt=prompt) Output: 'Males: 0.51, Females: 0.49'</code></pre> <p>You can also optimize your prompts and tell it to output answer in a certain format like the one given above. Detailed prompts make it easier for the AI to understand the question better and helps in extracting accurate answers even for complex problems. Let's check the answer using pandas.</p> <pre><code>data['sex'].value_counts(normalize=True) Output: male 0.505232 female 0.494768 Name: sex, dtype: float64</code></pre> <p>That's correct. </p> <h2>Answering interesting questions using <p>Now let's answer some more interesting questions to gain insights on the data.</p> <p><b>Question: Medical charges for which gender is more on average?</b></p> <pre><code>prompt = '''Medical charges for which gender is more on average and by how much? Round the answer to 2 decimal places. Provide the answer in form of a sentence.''' pandas_ai(data, prompt=prompt) Output: 'On average, charges for male are higher by $1387.17.'</code></pre> <p><b>Question: Does smoking causes more charges on average?</b></p> <pre><code>prompt = '''Does smoking causes more charges on average and by how much? Provide the answer in form of a sentence rounded down to 2 decimal places.''' pandas_ai(data, prompt=prompt) Output: 'Smoking causes an average increase in charges of $23615.96.'</code></pre> <p>Now let's ask a bit more complicated question to test the limits of <p><b>Question: List the 5 age groups having the highest average BMI?</b></p> <pre><code>prompt = '''What are the 5 ages with the highest average BMI?. Sort the values in descending order and display them in a table.

Age Average BMI06432.97613615232.93603425832.71820036132.54826146232.342609.

Generally, BMI values greater than 30 fall in the range of the obese category. Therefore, the data shows that people in their 50s and 60s are more likely to be obese than other age groups.

Q2. Which region has the greatest number of smokers?

prompt = '''Which region has the greatest number of smokers and which has the lowest?
            Include the values of both the greatest and lowest numbers in the answer.
            Provide the answer in form of a sentence.'''
pandas_ai(data, prompt=prompt)

Output:
'The region with the greatest number of smokers is southeast with 91 smokers.'
'The region with the lowest number of smokers is southwest with 58 smokers.'

Let’s increase the difficulty a bit and ask a tricky question.

Q3. What are the average charges of a female living in the north?

The region column contains four regions: northeast, northwest, southeast, and southwest. So, the north should contain both northeast and northwest regions. But can the LLM understand this subtle but important detail? Let’s find out!

prompt = '''What are the average charges of a female living in the north region?
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12479.87

Let’s check the answer manually using pandas.

north_data = data[(data['sex'] == 'female') & 
                 ((data['region'] == 'northeast') |
                  (data['region'] == 'northwest'))]
north_data['charges'].mean()

Output:
12714.35

The above code outputs a different answer (which is the correct answer) than the LLM gave. In this case, the LLM didn’t perform well. We can be more specific and tell the LLM what we mean by the north region and see if it can give the correct answer.

prompt = '''What are the average charges of a female living in the north region?
            The north region consists of both the northeast and northwest regions.
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12714.35

This time it gives the correct answer. As this was a tricky question, we must be more careful about our prompts and include relevant details, as the LLM might overlook these subtle differences. Therefore, you can see that we can’t trust the LLM blindly as it can generate incorrect responses sometimes due to incomplete prompts or some other limitations, which I will discuss later in the tutorial.

Visualizing Data with PandasAI

So far, we have seen PandasAI LLMs Models’ proficiency in analyzing data; now, let’s test it by plotting some graphs and seeing how well it visualizes data.

Correlation Heatmap

Let’s create a correlation heatmap of the numeric columns.

prompt = "Make a heatmap showing the correlation of all the numeric columns in the data"
pandas_ai(data, prompt=prompt)

That looks great. Under the hood, PandasAI uses Python’s Seaborn and matplotlib libraries to plot data. Let’s create some more graphs.

Distribution of BMI using Histogram

prompt = prompt = "Create a histogram of bmi with a kernel density plot." pandas_ai(data, prompt=prompt)

The distribution of BMI values somewhat resembles the normal distribution plot with a mean value near 30.

Distribution of Charges Using Boxplot

prompt = "Make a boxplot of charges. Output the median value of charges."
pandas_ai(data, prompt=prompt)

The median value of the charges column is roughly 9382. In the plot, this is depicted by the orange line in the middle of the box. The circles in the above plot show that the charges column contains many outlier values.

Now, let’s create some plots that show the relationship between more than one column.

Region vs. Smoker

prompt = "Make a horizontal bar chart of region vs smoker. Make the legend smaller."
pandas_ai(data, prompt=prompt)

The graph clearly shows that the Southeast region has the greatest number of smokers compared to other regions.

Variation of Charges with Age

prompt = '''Make a scatterplot of age with charges and colorcode using the smoker values. 
            Also provide the legends.'''
pandas_ai(data, prompt=prompt)

Looks like age and charges follow a linear relationship for non-smokers, while no specific pattern exists for smokers.

Variation of Charges with BMI

To make things a little more complex, let’s try creating a plot using only a proportion of the data instead of the real data and see how the LLM can perform.

prompt = "Make a scatterplot of bmi with charges and colorcode using the smoker values. 
          Add legends and use only data of people who have less than 2 children."
pandas_ai(data, prompt=prompt)

It did a great job creating a plot, even with a complex question. PandasAI has now unveiled its true potential. You have witnessed the true power of Large Language Models.

Limitations of PandasAI

The responses generated by PandasAI can sometimes exhibit inherent biases due to the vast amount of data LLMs are trained on from the Internet, which can hinder the analysis. Understanding and mitigating such biases is essential to ensuring fair and unbiased results.
LLMs can sometimes misinterpret ambiguous or contextually complex queries, leading to inaccurate or unexpected results. One must exercise caution and double-check the answers before making any critical data-driven decision.
It can sometimes be slow to come to an answer or completely fail. The server hosts the LLMs, and occasionally, technical issues may prevent the request from reaching the server or being processed.
It cannot be used for big data analysis tasks as it is not computationally efficient when dealing with large amounts of data and requires high-performance GPUs or computational resources.

What is the use of PandasAI and Pandas?

Pandas and PandasAI are both tools used for data analysis in Python, but they serve different purposes:

Pandas:
- It is a well-established library that provides powerful data manipulation and analysis functionalities.
- You directly interact with the data using Python code.
- It offers many features for working with dataframes like spreadsheets on steroids. You can load data, clean it, perform calculations, and create visualizations.
- It requires knowledge of Python programming to be used effectively.
PandasAI:
- It is a relatively new tool built on top of Pandas.
- Integrates generative AI to allow you to analyze data using natural language processing.
- You can ask questions about your data in plain English, and PandasAI will translate those questions into Python code and generate insights or visualizations.
- Aims to make data analysis more accessible, especially for those less familiar with programming.
- It is a complementary tool to Pandas, not a replacement.

Conclusion

PandasAI Tutorial represents a significant advancement in data analysis, combining the power of Pandas with the capabilities of Large Language Models. This tool simplifies complex data tasks through natural language prompts, making data analysis more accessible and efficient. While it excels in quick insights and visualizations, users should know its limitations, including potential biases and misinterpretations. PandasAI is not a replacement for traditional data analysis methods but a complementary tool that enhances productivity. As with any AI-powered tool, critical thinking and result validation remain crucial for accurate and reliable data analysis. Hope you like the article and understand the PandasAI API key, Pandas AI, and PandasAI API. By covering all of these, you will get full preparation for the PandasAI API key.

Key Takeaways

Here are some key takeaways from this article:

PandasAI is a Python library that adds Generative AI capabilities to Pandas, clubbing it with large language models.
PandasAI makes Pandas conversational by allowing us to ask questions in natural language using text prompts.
Despite its amazing capabilities, PandasAI has its limitations. Don’t blindly trust or use for sophisticated use cases like big data analysis.

Frequently Asked Questions

Q1. How do I get started with PandasAI?

A. To start with PandasAI, visit their website, sign up, and explore their tools for AI-powered data analysis and automation using natural language.

Q2. Can I use PandasAI without OpenAI?

A. Yes, PandasAI operates independently of OpenAI, leveraging its technology stack for data analysis and automation tasks.

Q3. How good is PandasAI?

A. PandasAI is known for its robust AI capabilities in data handling and analysis. It offers efficient tools for automating tasks traditionally done with the Pandas library in Python.

Q4. What are the limitations of PandasAI?

A. PandasAI’s limitations may include dependence on the quality of underlying AI models, potential for errors in complex data scenarios, and constraints in customization compared to traditional coding approaches with Pandas.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

nikhil1e9

Hello, I'm Nikhil Kotra, a data science enthusiast with a bachelor's degree from Indian Institute of Technology Roorkee.

I have done various internships and projects in the field of AI, machine learning and deep learning and want to contribute to the tech industry and the future of AI.

I am really passionate about leveraging the power of AI for the benefit of humanity and to tackle real issues like environmental crisis and health hazards. I believe that AI should be used ethically and morally by respecting and uphelding other people's opinions.

I am really interested in doing some real-world projects using Generative AI and Large Language Models and contributing to the data science community by sharing my knowledge and learnings through articles and blogs.

In my free time, I enjoy traveling, playing chess and reading books.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

A Comprehensive Guide to PandasAI

Introduction

Learning Objectives

Table of contents

What is PandasAI?

Set up an OpenAI Account and Extract the API Key

Installing PandasAI

Getting the Data and Instantiating an LLM

Analyzing Data with PandasAI

Prompts

Q1. What is the size of the dataset?

Output

Finding the missing value and column it belongs to

Imputing the missing value

Now the first row looks like this

Q2. Which region has the greatest number of smokers?

Q3. What are the average charges of a female living in the north?

Visualizing Data with PandasAI

Correlation Heatmap

Distribution of BMI using Histogram

Distribution of Charges Using Boxplot

Region vs. Smoker

Variation of Charges with Age

Variation of Charges with BMI

Limitations of PandasAI

What is the use of PandasAI and Pandas?

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us