Gen AI Powered Data Insight Generation using LIDA

Ankit Bajaj Last Updated : 18 Jun, 2024

6 min read

Introduction

This article will introduce the readers to LIDA, an open-source python library for generating detailed data visualizations and appealing infographics. We will first understand that how LIDA works and what are its core capabilities and then finally see it in action by building a Streamlit application that will enable the user to explore the provided csv dataset and unearth valuable insights automatically by creating amazing data visualizations.

Learning Objectives

Understand the challenges associated with manual data exploration and analysis.
Explore LIDA architecture and core building blocks.
Learn to build a fully functional Streamlit application for automated data exploration and insight generation.

This article was published as a part of the Data Science Blogathon.

Challenges with Manual Data Exploration
What is LIDA and How Does it Work?
Building Application for Automatic Insight Generation
Frequently Asked Questions

Challenges with Manual Data Exploration

Manual data exploration is a labor intensive process that demands significant time and effort to clean, analyze, and visualize data. Analysts often face the challenge of sifting through large datasets, which
increases the likelihood of human error and overlooked patterns or insights. Additionally, the manual approach can be inconsistent, as it relies heavily on the individual skills and expertise of the analyst, making it difficult to reproduce results or scale the process for larger datasets.

Automating data exploration accelerates the analysis process, ensuring more accurate and comprehensive insights. Automation tools, like LIDA, streamline data visualization, and insight generation, allowing users to focus on decision-making and strategic planning.

What is LIDA and How Does it Work?

LIDA is a open-source python library for generating data visualizations and infographics. LIDA is grammar agnostic and can work with any programming language. It also supports multiple visualization libraries like matplotlib, seaborn etc.

LIDA consists of the following 4 key modules that work together in a sequence to generate automatic data visualizations and infographics:

Summarizer

Function: Converts datasets into a rich but compact natural language representation (context)
Process: Uses rules and large language models (LLMs) to analyze the dataset.

Goal Explorer

Function: Generates a set of potential “goals” based on the dataset context.
Process: Utilizes LLMs to interpret the context and suggest relevant visualization goals.

Viz Generator

Function: Generates, evaluates, repairs, filters, and executes visualization code to meet specified goals.
Process: Leverages LLMs to create visualization code in the appropriate programming language or grammar.

Infographer

Function: Generates stylized infographics based on the visualization and style prompts.
Process: Applies image generation models (IGMs) to transform visualizations into styled infographics.

Now that we are familiar with the building blocks of LIDA and their respective functions, let’s understand that how all these blocks integrate and work together in a single workflow:

Dataset Input: The user provides a CSV dataset (e.g., Cars.csv).
Summarization: The Summarizer processes the dataset and generates a natural language context.
Goal Exploration: The Goal Explorer uses the context to suggest possible visualization goals.
Visualization Generation: The Viz Generator creates and executes code to produce visualizations based on the selected goals.
Infographic Creation: The Infographer transforms these visualizations into styled infographics according to user-defined prompts.
Output Delivery: The system outputs a natural language summary, suggested goals, visualization code, and the final stylized infographics.

This integrated approach streamlines the process of data exploration, visualization, and infographic creation, making it efficient and user-friendly.

Building Application for Automatic Insight Generation

Now that we have a pretty fair idea of LIDA and it’s functioning, let’s roll up our sleeves and get into some action by building a Streamlit application that will accept a CSV dataset as input and then leverage LIDA to generate automatic data visualizations

Step1: Install Python Libraries

First things first, let’s install the required python libraries for our application. We will create a requirements.txt file with the following set of libraries:

Python Library	Description/Use case
uvicorn	A lightning-fast ASGI server for running Python web applications
streamlit	An open-source app framework for creating and sharing beautiful, custom web apps
pandas	A powerful data manipulation and analysis library providing data structures like DataFrames
lida	A toolkit for generating data visualizations and data-faithful infographics
python-dotenv	A toolkit for generating data visualizations and data-faithful infographics, compatible with various programming languages and visualization libraries

Then install all the libraries by running the command “pip install -r requirements.txt”

Step2: Integrating LIDA with LLM

Next, we need to integrate LIDA with a LLM that will be used to summarize the dataset, create goals and then finally generate and execute visualization code. LIDA is highly flexible and integrates smoothly with multiple large language model providers, including OpenAI, Azure OpenAI, PaLM, Cohere, and Huggingface. However, for our application, we will be using the GPT-3.5 Turbo model by OpenAI and for that we would need an Open AI API key.

To generate an API key, first, create an OpenAI account or sign in. Next, navigate to the API key page and “Create new secret key”, optionally naming the key. Make sure to save this somewhere safe and do not share it with anyone.

Once we have API key, create a .env file and save your API key over there

Step3: Streamlit Application Logic

Finally, we will create the app.py file containing the Streamlit application logic and LIDA API call.

import streamlit as st
import pandas as pd
from lida import Manager, TextGenerationConfig , llm  
from PIL import Image
from io import BytesIO
import base64
from dotenv import load_dotenv
import os
import openai

# Configuring the OpenAI API Key
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# To convert charts into images, so that they can be displayed on Stremlit front-end
def base64_to_image(base64_string):
    # Decode the base64 string
    byte_data = base64.b64decode(base64_string)
    # Use BytesIO to convert the byte data to image
    return Image.open(BytesIO(byte_data))

# Streamlit App Code
st.set_page_config(
    page_title="Automatic Insights and Visualization App",
    page_icon="🤖",
    layout="centered",
    initial_sidebar_state="expanded",
)

st.header("Automatic Insights and Visualization 🤖")


menu = st.sidebar.selectbox("Choose an Option", ["Automatic Insights"])

if menu == "Automatic Insights":
    st.subheader("Generate Automatic Insights")
    # Upload CSV dataset as input
    uploaded_file = st.file_uploader("Choose a csv file")
    if uploaded_file is not None:
        dataframe = pd.read_csv(uploaded_file)
        st.write(dataframe)
        btn = st.button("Generate Suggestions", type = "primary")

        if btn: 
            # Generate goals using LIDA
            lida = Manager(text_gen = llm("openai"))
            textgen_config = TextGenerationConfig(n=1, 
                                                  temperature=0.5, 
                                                  model="gpt-3.5-turbo-0301", 
                                                  use_cache=True)
            summary = lida.summarize(dataframe, 
                      summary_method="default", 
                      textgen_config=textgen_config)  
            goals = lida.goals(summary, n=5, textgen_config=textgen_config)

            i = 0
            library = "seaborn"
            imgs = []
            textgen_config = TextGenerationConfig(n=1, temperature=0.2, use_cache=True)
            # Create the corresponding data visualization for each goal
            for i in range(len(goals)):
                charts = lida.visualize(summary=summary, 
                                        goal=goals[i], 
                                        textgen_config=textgen_config, 
                                        library=library)
                img_base64_string = charts[0].raster
                img = base64_to_image(img_base64_string)
                imgs.append(img)

            tab1, tab2, tab3, tab4, tab5 = st.tabs(
            ["Goal 1", "Goal 2", "Goal 3", "Goal 4", "Goal 5"]
            )

            with tab1:
                st.header("Goal 1")
                goals[0].question
                st.image(imgs[0])

            with tab2:
                st.header("Goal 2")
                goals[1].question
                st.image(imgs[1])

            with tab3:
                st.header("Goal 3")
                goals[2].question
                st.image(imgs[2])
            
            with tab4:
                st.header("Goal 4")
                goals[3].question
                st.image(imgs[3])
            
            with tab5:
                st.header("Goal 5")
                goals[4].question
                st.image(imgs[4])

Once all the files are ready, you can run the streamlit application using the command “streamlit run app.py”

Conclusion

We explored the challenges associated with manual data exploration and how tools like LIDA help us streamline the process by providing a flexible and fully automatic solution for data exploration and insight generation We also got an understanding of the LIDA system architecture and its core capabilities. Lastly, we saw LIDA in action by building an automatic insight generation application using Streamlit.

Here is the link for the video depicting the final application and it’s working.

Key Takeaways

Whether you’re working with Matplotlib or Seaborn, Python or any other programming language, LIDA fits right into your workflow.
Leverage the latest language models to generate intelligent insights and recommendations for your data.
No step learning curves here. LIDA is designed to be intuitive and easy to use, so you can focus on the things that matter to you and the business – making data-driven decisions.
Automating data exploration accelerates the analysis process, ensuring more accurate and comprehensive insights.

Frequently Asked Questions

Q1. What are the different LLM models supported by LIDA?

A. LIDA supports multiple large language model providers like OpenAI, Azure OpenAI, PaLM, Cohere and Huggingface .

Q2. Is an API key required to work with LIDA?

A. LIDA is an open-source library and doesn’t require an API key as you need to install it on your system and run it locally, but you might need an API key for the LLM model that you will be using with LIDA. For example, you will need an OpenAI API key if you are using a model like GPT3.5-Turbo.

Q3. Does LIDA support query-based visualization generation?

A. Instead of relying on LIDA for goal generation, a user can explicitly provide the query/goal and generate the desired chart. LIDA also provides the support for multi-lingual input.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ankit Bajaj

As a dedicated Data Analyst with three years of experience, I am passionate about harnessing the power of data to drive informed business decisions. My expertise spans a diverse array of tools and technologies, including Google Analytics (GA), Google Tag Manager (GTM), Python, SQL, Power BI, and the Azure cloud platform. These skills enable me to effectively collect, analyze, and interpret complex data sets, transforming raw data into actionable insights.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Gen AI Powered Data Insight Generation using LIDA

Introduction

Learning Objectives

Table of contents

Challenges with Manual Data Exploration

What is LIDA and How Does it Work?

Summarizer

Goal Explorer

Viz Generator

Infographer

Building Application for Automatic Insight Generation

Step1: Install Python Libraries

Step2: Integrating LIDA with LLM

Step3: Streamlit Application Logic

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I