This article will introduce the readers to LIDA, an open-source python library for generating detailed data visualizations and appealing infographics. We will first understand that how LIDA works and what are its core capabilities and then finally see it in action by building a Streamlit application that will enable the user to explore the provided csv dataset and unearth valuable insights automatically by creating amazing data visualizations.
This article was published as a part of the Data Science Blogathon.
Manual data exploration is a labor intensive process that demands significant time and effort to clean, analyze, and visualize data. Analysts often face the challenge of sifting through large datasets, which
increases the likelihood of human error and overlooked patterns or insights. Additionally, the manual approach can be inconsistent, as it relies heavily on the individual skills and expertise of the analyst, making it difficult to reproduce results or scale the process for larger datasets.
Automating data exploration accelerates the analysis process, ensuring more accurate and comprehensive insights. Automation tools, like LIDA, streamline data visualization, and insight generation, allowing users to focus on decision-making and strategic planning.
LIDA is a open-source python library for generating data visualizations and infographics. LIDA is grammar agnostic and can work with any programming language. It also supports multiple visualization libraries like matplotlib, seaborn etc.
LIDA consists of the following 4 key modules that work together in a sequence to generate automatic data visualizations and infographics:
Now that we are familiar with the building blocks of LIDA and their respective functions, let’s understand that how all these blocks integrate and work together in a single workflow:
This integrated approach streamlines the process of data exploration, visualization, and infographic creation, making it efficient and user-friendly.
Now that we have a pretty fair idea of LIDA and it’s functioning, let’s roll up our sleeves and get into some action by building a Streamlit application that will accept a CSV dataset as input and then leverage LIDA to generate automatic data visualizations
First things first, let’s install the required python libraries for our application. We will create a requirements.txt file with the following set of libraries:
Python Library | Description/Use case |
uvicorn | A lightning-fast ASGI server for running Python web applications |
streamlit | An open-source app framework for creating and sharing beautiful, custom web apps |
pandas | A powerful data manipulation and analysis library providing data structures like DataFrames |
lida | A toolkit for generating data visualizations and data-faithful infographics |
python-dotenv | A toolkit for generating data visualizations and data-faithful infographics, compatible with various programming languages and visualization libraries |
Then install all the libraries by running the command “pip install -r requirements.txt”
Next, we need to integrate LIDA with a LLM that will be used to summarize the dataset, create goals and then finally generate and execute visualization code. LIDA is highly flexible and integrates smoothly with multiple large language model providers, including OpenAI, Azure OpenAI, PaLM, Cohere, and Huggingface. However, for our application, we will be using the GPT-3.5 Turbo model by OpenAI and for that we would need an Open AI API key.
To generate an API key, first, create an OpenAI account or sign in. Next, navigate to the API key page and “Create new secret key”, optionally naming the key. Make sure to save this somewhere safe and do not share it with anyone.
Once we have API key, create a .env file and save your API key over there
Finally, we will create the app.py file containing the Streamlit application logic and LIDA API call.
import streamlit as st
import pandas as pd
from lida import Manager, TextGenerationConfig , llm
from PIL import Image
from io import BytesIO
import base64
from dotenv import load_dotenv
import os
import openai
# Configuring the OpenAI API Key
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')
# To convert charts into images, so that they can be displayed on Stremlit front-end
def base64_to_image(base64_string):
# Decode the base64 string
byte_data = base64.b64decode(base64_string)
# Use BytesIO to convert the byte data to image
return Image.open(BytesIO(byte_data))
# Streamlit App Code
st.set_page_config(
page_title="Automatic Insights and Visualization App",
page_icon="🤖",
layout="centered",
initial_sidebar_state="expanded",
)
st.header("Automatic Insights and Visualization 🤖")
menu = st.sidebar.selectbox("Choose an Option", ["Automatic Insights"])
if menu == "Automatic Insights":
st.subheader("Generate Automatic Insights")
# Upload CSV dataset as input
uploaded_file = st.file_uploader("Choose a csv file")
if uploaded_file is not None:
dataframe = pd.read_csv(uploaded_file)
st.write(dataframe)
btn = st.button("Generate Suggestions", type = "primary")
if btn:
# Generate goals using LIDA
lida = Manager(text_gen = llm("openai"))
textgen_config = TextGenerationConfig(n=1,
temperature=0.5,
model="gpt-3.5-turbo-0301",
use_cache=True)
summary = lida.summarize(dataframe,
summary_method="default",
textgen_config=textgen_config)
goals = lida.goals(summary, n=5, textgen_config=textgen_config)
i = 0
library = "seaborn"
imgs = []
textgen_config = TextGenerationConfig(n=1, temperature=0.2, use_cache=True)
# Create the corresponding data visualization for each goal
for i in range(len(goals)):
charts = lida.visualize(summary=summary,
goal=goals[i],
textgen_config=textgen_config,
library=library)
img_base64_string = charts[0].raster
img = base64_to_image(img_base64_string)
imgs.append(img)
tab1, tab2, tab3, tab4, tab5 = st.tabs(
["Goal 1", "Goal 2", "Goal 3", "Goal 4", "Goal 5"]
)
with tab1:
st.header("Goal 1")
goals[0].question
st.image(imgs[0])
with tab2:
st.header("Goal 2")
goals[1].question
st.image(imgs[1])
with tab3:
st.header("Goal 3")
goals[2].question
st.image(imgs[2])
with tab4:
st.header("Goal 4")
goals[3].question
st.image(imgs[3])
with tab5:
st.header("Goal 5")
goals[4].question
st.image(imgs[4])
Once all the files are ready, you can run the streamlit application using the command “streamlit run app.py”
We explored the challenges associated with manual data exploration and how tools like LIDA help us streamline the process by providing a flexible and fully automatic solution for data exploration and insight generation We also got an understanding of the LIDA system architecture and its core capabilities. Lastly, we saw LIDA in action by building an automatic insight generation application using Streamlit.
Here is the link for the video depicting the final application and it’s working.
A. LIDA supports multiple large language model providers like OpenAI, Azure OpenAI, PaLM, Cohere and Huggingface .
A. LIDA is an open-source library and doesn’t require an API key as you need to install it on your system and run it locally, but you might need an API key for the LLM model that you will be using with LIDA. For example, you will need an OpenAI API key if you are using a model like GPT3.5-Turbo.
A. Instead of relying on LIDA for goal generation, a user can explicitly provide the query/goal and generate the desired chart. LIDA also provides the support for multi-lingual input.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.