There have been recent surges and breakthroughs in the field of Generative artificial Intelligence causing disruptions in the data field. Companies are trying to see how to make the most of these innovations, such as ChatGPT. This will help any business take a competitive advantage. A new cutting-edge innovation is introducing a GenAI-powered data analysis library to the regular Pandas library known as “PandasAI.” OpenAI has done this. Unlike other areas of Generative AI, PandasAI applies the technology of GenAI to the analysis tool Pandas.
As the name suggests, it directly applies artificial intelligence to the traditional Pandas library. The Pandas library has become very popular in the data field with Python in tasks such as preprocessing and data visualization, and this innovation has just made it better.
This article was published as a part of the Data Science Blogathon.
PandasAI is a Python library that uses Generative AI models to carry out tasks with pandas. It is a library that integrates generative artificial intelligence capabilities using prompt engineering to make Pandas data frames conversational. When we recall Pandas, it brings to mind data analysis and manipulation. With PandasAI, we try to improve our Pandas’ productivity with the benefit of GenAI.
With the help of Generative artificial intelligence, we all need to give conversational prompts to the dataset. This comes with the advantage of removing the need for learning or understanding complex code. The Data Scientist can query the dataset by simply talking to the dataset using natural human language and getting results. This saves time in preprocessing and analysis. This is the new revolution where programmers need not write codes. They only need to say what they have in mind and see their instructions being carried out. Even non-techies can now build systems without writing any complex code!
Before we see how to use PandasAI, let us see how it works. We have mentioned the term “Generative Artificial Intelligence” several times here. It serves as the technology behind the implementation of PandasAI. Generative AI (GenAI) is a subset of artificial intelligence that can produce a wide range of data types, including text, audio, video, pictures, and 3D models. It accomplishes this by identifying patterns in already collected data and exploiting them to create novel and distinctive outputs.
Another thing to note is using large language models (LLMs). PandasAI has been trained on LLMs which are models consisting of an artificial neural network (ANN) with many parameters (tens of millions to even billions). All this helps the model behind PandasAI to be able to take human instructions and tokenize them before interpretation. PandasAi has also been designed to handle LangChain models, making building LLM applications easier.
Now let us see how to use PandasAI. We will see two approaches for using PandasAI. Firstly is using LangChain models and then a direct implementation.
Using LangChain Models
To use LangChain models, you need to install the Langchain package first:
pip install langchain
Then we can instantiate a LangChain object:
from pandasai import PandasAI
from langchain.llms import OpenAI
langchain_llm = OpenAI(openai_api_key="my-openai-api-key")
pandasai = PandasAI(llm=langchain_llm)
Your environment is now ready, and PandasAI will automatically use a LangChain llm and convert it to a PandasAI llm.
Direct Implementation (Without LangChain)
This article uses this second approach by installing PandasAI without using LangChain. When writing this article, Colab does not have PandasAI preinstalled like Pandas. This is why we need to start by installing it.
pip install pandasai
Another vital thing to note is that you require an OpenAI API key to use PandaAI. An API key can be created with an account on the OpenAI platform. Visit here to create a key.
Remember to keep the key safe for future use, as returning to the site will not give you access to copy the key. I also hid my API key from the public to manage my credits. Do same!
Note: With a free OpenAI account, you might not be able to plot graphs with PandasAI conveniently due to 3 prompts per minute restrictions. This is to manage the system’s high demand and keep it maximized.
Let us continue by importing our dependencies.
import pandas as pd
# PandasAI
from pandasai import PandasAI
# For charts
import seaborn as sns
# iris inbuilt dataset from seaborn
iris = sns.load_dataset('iris')
# Viewing first rows
iris.head()
Next, we import OpenAI from Pandasai, which we installed earlier. Ensure to insert your API key by replacing INSERT_YOUR_API_KEY_HERE before running the code, as shown below.
# Sample DataFrame
df = iris
# Instantiating an LLM
from pandasai.llm.openai import OpenAI
# Assigning API key
llm = OpenAI(api_token="INSERT_YOUR_API_KEY_HERE")
# Calling PandasAI
pandas_ai = PandasAI(llm)
Now let us see some text prompts on the iris dataset.
Example 1
prompt=’Which is the most common specie?’
# Running PandasAI prompt
pandas_ai.run(df, prompt='Which is the most common specie?')
Oh, the most common specie is actually setosa!
Example 2
prompt=’What is the average of sepal_length?’
# Calling PandasAI
pandas_ai = PandasAI(llm)
# Running PandasAI prompt
pandas_ai.run(df, prompt='What is the average of sepal_length?')
The average sepal length of the dataset is 5.84.
Example 3
prompt=’What is the average of sepal_width?’
# Calling PandasAI
pandas_ai = PandasAI(llm)
# Running PandasAI prompt
pandas_ai.run(df, prompt='What is the average of sepal_width?')
The average sepal width is 3.0573333333333337.
Example 4
prompt=’Which is the most common petal_length?’
# Calling PandasAI
pandas_ai = PandasAI(llm)
# Running PandasAI prompt
pandas_ai.run(df, prompt='Which is the most common petal_length?')
Based on the data provided, the most common petal_length is 1.4.
Yes, it is not only texts we can generate! We can also generate plots and graphs using PandasAI. This will require a paid API Key if not it will likely generate a RateLimitError. You can try to run your prompts from time to time. Between 20s intervals, or you can simply get a paid plan.
You will likely encounter a RateLimitError when you start generating plots or graphs. This is going to be encountered by those using a free API key. A way out first is to get a paid plan. This keys you more credit and resources to do demanding tasks. But if you just want to experiment or only have access to a free Key, you must regulate how you run your code manually. You are expected to run only limited prompts with a free account with about 20 seconds intervals between prompts. This lets you run your code in intervals of 20 seconds. This is to manage the server between users due to high demand.
Example 1
Prompt = ‘”Plot the histogram of the entries.”
# Running PandasAI prompt
response = pandas_ai.run(
df,
"Plot the histogram of the entries",
)
print(response)
Sure, here's a histogram of the entries in the dataset. It shows the distribution of values for each variable, including sepal length, sepal width, petal length, petal width, and species. The histogram is a useful way to visualize the data and see any patterns or trends that may exist.
Example 2
Prompt = ‘Perform scattered plot of sepal_length and sepal_width’
# Running Pandas AI command
response = pandas_ai.run(
df,
"Perform scattered plot of sepal_length and sepal_width",
)
print(response)
Sure! To create a scattered plot of sepal_length and sepal_width, we can use the data provided in the table. The table includes columns for sepal_length, sepal_width, petal_length, petal_width, and species. We can focus on just the sepal_length and sepal_width columns to create the plot.
Example 3
Prompt = “Plot a scattered plot of sepal_length and sepal_width for the species’
# Running Pandas AI command
response = pandas_ai.run(
df,
"Plot a scattered plot of sepal_length and sepal_width for the species",
)
print(response)
Sure! To plot a scattered plot of sepal_length and sepal_width for the species, we can use the provided dataset which includes columns for sepal_length, sepal_width, petal_length, petal_width, and species. We'll focus on just the sepal_length and sepal_width columns. Then, we can create a scatter plot with sepal_length on the x-axis and sepal_width on the y-axis. This will allow us to visualize any potential relationship between these two variables for each species in the dataset.
The possibilities keep increasing. You can try your commands and see how it goes. The goal is to reap the benefits that come with Generative artificial intelligence.
We have seen that by utilizing large language models to extract insights from datasets, Pandas AI can potentially transform data analysis. However, it is constrained and needs human verification for accuracy. This problem can be resolved by learning prompt engineering. So, we can conclude by saying PandasAI is Pandas + AI. More specifically, we can say Pandas + Generative AI. All this is possible using commands, allowing the user to interact with the tasks in a human-to-human way. Prompts are processed with advanced NLP and marrying it to other tasks.
A. Prompt engineering involves the creation of context-specific instructions (queries), to produce desired responses from language models. These conversations guide the model and shape its behavior and output.
A. Generative artificial intelligence or generative AI is an artificial intelligence (AI) system capable of generating text, images, or other media in response to commands.
A. Some examples of PE are AI systems, such as Pandas AI and ChatGPT.
A. Although Generative AI has achieved a lot recently, it still suffers some setbacks, such as ethics, control of harmful content, copyright issues, data privacy, etc.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.