Automate Data Insights with LIDA’s Intelligent Visualization

Mounish V Last Updated : 10 Oct, 2024
9 min read

Introduction

Language-Integrated Data Analysis (LIDA) is a powerful tool designed to automate visualization creation, enabling the generation of grammar-agnostic visualizations and infographics. LIDA addresses several critical tasks: interpreting data semantics, identifying appropriate visualization goals, and generating detailed visualization specifications. LIDA conceptualizes visualization generation as a multi-step process and uses well-structured pipelines, which integrate large language models (LLMs) and image generation models (IGMs).

LIDA

Overview

  1. LIDA automates data visualization by combining large language models (LLMs) and image generation models (IGMs) in a multi-stage process, making it easier to create grammar-agnostic visualizations.
  2. LIDA’s Key components include data summarisation tools, goal identification, visualization generation, and infographic creation, facilitating comprehensive data analysis workflows.
  3. The platform supports diverse programming languages like Python, R, and C++, allowing users to create visualizations in various formats without being tied to a specific grammar.
  4. LIDA features a hybrid interface, combining direct manipulation with natural language commands to make data visualization accessible to both technical and non-technical users.
  5. Advanced capabilities like visualization repair, recommendations, and explanation are built-in, enhancing data literacy and enabling users to refine visual outputs through automated evaluation.
  6. LIDA aims to democratize data-driven insights, empowering users to transform complex datasets into meaningful visualizations for better decision-making.

Key Features of LIDA

  1. Grammar-Agnostic Visualizations: Whether you’re using Python, R, or C++, LIDA allows you to produce visual outputs without being locked into a specific coding language. This flexibility makes it easier for users coming from different programming backgrounds.
  2. Multi-Stage Generation Pipeline: LIDA seamlessly orchestrates a workflow that progresses from data summarization to visualization creation, facilitating users in navigating complex datasets.
  3. Hybrid User Interface: The option for direct manipulation and multilingual natural language interfaces makes LIDA accessible to a broader audience, from data scientists to business analysts. Users can interact through natural language commands, making data visualization intuitive and straightforward.

Language-Integrated Data Analysis (LIDA) Architecture

Language-Integrated Data Analysis (LIDA)
  1. Summarizer: Convert datasets into concise natural language descriptions with information like all the column names, distribution..etc
  2. GOAL Explorer:Identifies potential visualization or analytical goals based on the dataset. It generates an ‘n’ number of goals, where n is a parameter chosen by the user.
  3. Viz Generator: Automatically generate code to create visualizations based on the dataset context and specified goals.
  4. Infographer: Create, evaluate, refine, and execute visualization code to produce fully styled specifications.

Features of LIDA

FeatureDescription
Data SummarizationLIDA compacts large datasets into dense natural language summaries, used as grounding for future operations.
Automated Data ExplorationLIDA offers a fully automated mode for generating meaningful visualization goals based on unfamiliar datasets.
Grammar-Agnostic VisualizationsLIDA generates visualizations in any grammar (Altair, Matplotlib, Seaborn in Python, or R, C++, etc.).
Infographics GenerationConverts data into stylized, engaging infographics using image generation models for personalized stories.
VizOps – Operations on VisualizationsDetailed operations on generated visualizations, enhancing accessibility, data literacy, and debugging.
Visualization ExplanationProvides in-depth descriptions of visualization code, aiding in accessibility, education, and sensemaking.
Self-EvaluationLLMs are used to generate multi-dimensional evaluation scores for visualizations based on best practices.
Visualization RepairAutomatically improves or repairs visualizations using self-evaluation or user-provided feedback.
Visualization RecommendationsRecommends additional visualizations based on context or existing visualizations for comparison or added perspectives.

Installations LIDA

To use LIDA, you’ll need to install LIDA with the following command:

pip install -U lida

We’ll be using llmx to create LLM text generators with support for multiple LLM providers.

!pip install llmx

LIDA in Action: Heart Disease Prediction

To predict heart disease presence, let’s try analyzing the Heart Attack Analysis & Prediction Dataset, which contains 14 clinical features like age, cholesterol, and chest pain type. We’ll be working with heart.csv in this guide: Heart Attack Analysis & Prediction Dataset.

Setting-up LIDA WebUI

To use LIDA’s webui, we need to first setup the OpenAI key:

import os
os.environ['OPENAI_API_KEY']='sk-test'

Now run this command and go click on the url: 

!lida ui  --port=8080 --docs
LIDA UI

Click on the live demo button: 

LIDA

Note: You need to set up your openai key to get the web ui running.

Working with Language Models

“gpt-3.5-turbo-0301” is the model that’s selected by default. 

LIDA

You can click on Generation settings and the LLM provider, model and other settings. 

LIDA Generation Setting

Visualizing and Gaining Insights with LIDA Using Python

I’ll focus on visualizing and gaining insights with LIDA using Python in this guide. 

In this demo, I’ll be using the Cohere LLM provider. You can hover over to Cohere’s dashboard and get your trial API key to use models from Cohere.

from llmx import llm
from llmx.datamodel import TextGenerationConfig
import os
os.environ['COHERE_API_KEY']='Your_API_Key'
messages = [
   {"role": "system", "content": "You are a helpful assistant"},
   {"role": "user", "content": "What is osmosis?"}
]
gen = llm(provider="cohere")
config = TextGenerationConfig(model="command-r-plus-08-2024", max_tokens=50)
response = gen.generate(messages, config=config, use_cache=True)
print(response.text[0].content)
Osmosis is a fundamental process in biology and chemistry where a solvent,
typically water, moves across a semipermeable membrane from a region of
lower solute concentration to a region of higher solute concentration, aiming
to equalize the concentrations on both sides
from lida import Manager, llm
lida = Manager(text_gen = gen) # using the hugging face model
summary = lida.summarize("heart.csv")
print(summary)

Output

{'name': 'heart.csv', 'file_name': 'heart.csv', 'dataset_description': '',
'fields': [{'column': 'age', 'properties': {'dtype': 'number', 'std': 9,
'min': 29, 'max': 77, 'samples': [46, 66, 48], 'num_unique_values': 41,
'semantic_type': '', 'description': ''}}, {'column': 'sex', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 1, 'samples': [0, 1],
'num_unique_values': 2, 'semantic_type': '', 'description': ''}}, {'column':
'cp', 'properties': {'dtype': 'number', 'std': 1, 'min': 0, 'max': 3,
'samples': [2, 0], 'num_unique_values': 4, 'semantic_type': '',
'description': ''}}, {'column': 'trtbps', 'properties': {'dtype': 'number',
'std': 17, 'min': 94, 'max': 200, 'samples': [104, 123],
'num_unique_values': 49, 'semantic_type': '', 'description': ''}}, {'column':
'chol', 'properties': {'dtype': 'number', 'std': 51, 'min': 126, 'max': 564,
'samples': [277, 169], 'num_unique_values': 152, 'semantic_type': '',
'description': ''}}, {'column': 'fbs', 'properties': {'dtype': 'number',
'std': 0, 'min': 0, 'max': 1, 'samples': [0, 1], 'num_unique_values': 2,
'semantic_type': '', 'description': ''}}, {'column': 'restecg',
'properties': {'dtype': 'number', 'std': 0, 'min': 0, 'max': 2, 'samples':
[0, 1], 'num_unique_values': 3, 'semantic_type': '', 'description': ''}},
{'column': 'thalachh', 'properties': {'dtype': 'number', 'std': 22, 'min':
71, 'max': 202, 'samples': [159, 152], 'num_unique_values': 91,
'semantic_type': '', 'description': ''}}, {'column': 'exng', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 1, 'samples': [1, 0],
'num_unique_values': 2, 'semantic_type': '', 'description': ''}}, {'column':
'oldpeak', 'properties': {'dtype': 'number', 'std': 1.1610750220686343,
'min': 0.0, 'max': 6.2, 'samples': [1.9, 3.0], 'num_unique_values': 40,
'semantic_type': '', 'description': ''}}, {'column': 'slp', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 2, 'samples': [0, 2],
'num_unique_values': 3, 'semantic_type': '', 'description': ''}}, {'column':
'caa', 'properties': {'dtype': 'number', 'std': 1, 'min': 0, 'max': 4,
'samples': [2, 4], 'num_unique_values': 5, 'semantic_type': '',
'description': ''}}, {'column': 'thall', 'properties': {'dtype': 'number',
'std': 0, 'min': 0, 'max': 3, 'samples': [2, 0], 'num_unique_values': 4,
'semantic_type': '', 'description': ''}}, {'column': 'output', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 1, 'samples': [0, 1],
'num_unique_values': 2, 'semantic_type': '', 'description': ''}}],
'field_names': ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg',
'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']}
goals = lida.goals(summary=summary, n=5, persona="A data scientist focused on using predictive analytics to improve early detection and prevention of heart disease.") # generate goals (n is no. of goals)

5 Goals that We Have Generated

‘n’ is no. of goals that we’ll generate using the summary; let’s look at the 5 goals that we generated:

goals[0]
Goal 0
Question: How does age impact heart disease risk?

Visualization: Scatter plot with 'age' on the x-axis and 'output' (heart
disease presence) as colored data points

Rationale: This visualization will help us understand if there's a
correlation between age and heart disease risk. By plotting age against the
presence of heart disease, we can identify any trends or patterns that may
indicate higher risk at certain ages, aiding in early detection strategies.
goals[1]
Goal 1
Question: Is there a gender disparity in heart disease occurrence?

Visualization: Stacked bar chart comparing the count of 'sex' (gender) with
'output' (heart disease presence)

Rationale: This chart will reveal any gender disparities in heart disease
cases. By comparing the distribution of males and females with and without
heart disease, we can assess if one gender is more susceptible, which is
crucial for targeted prevention efforts.
goals[2]
Goal 2
Question: How does cholesterol level affect heart health?

Visualization: Box plot of 'chol' (cholesterol) grouped by 'output' (heart
disease presence)

Rationale: This plot will illustrate the distribution of cholesterol levels
in individuals with and without heart disease. We can determine if higher
cholesterol is associated with an increased risk of heart disease, providing
insights for preventive measures.
goals[3]
Goal 3
Question: Are there specific chest pain types linked to heart disease?

Visualization: Violin plot of 'cp' (chest pain type) colored by 'output'
 (heart disease presence)

Rationale: This visualization will help us understand if certain types of
 chest pain are more prevalent in heart disease cases. By examining the
 distribution of chest pain types, we can identify patterns that may aid in
 early diagnosis and treatment planning.
goals[4]
Goal 4
Question: How does resting heart rate relate to heart disease?

Visualization: Scatter plot with 'thalachh' (resting heart rate) on the y-
axis and 'output' (heart disease presence) as colored data points

Rationale: This plot will reveal any relationship between resting heart rate
and heart disease. By visualizing the resting heart rate against the
presence of heart disease, we can determine if higher or lower rates are
associated with increased risk, guiding early intervention strategies.

Generating Charts for Each Goal

Let’s generate charts for each goal and gain insights from the visualizations.

charts = []
for i in range(5):
   charts.append(lida.visualize(summary=summary, goal=goals[i], library="seaborn"))
charts[0][0]
LIDA Graph
charts[1][0]
LIDA charts[1][0]
charts[2][0]
LIDA charts[2][0]
charts[3][0]
LIDA charts[3][0]
charts[4][0]
LIDA charts[4][0]

lida.edit Function to Suggest Changes in the Chart

Let’s look at the lida.edit function to suggest changes in the chart. Let’s change the title and colour of the plot. 

# modify chart using natural language
instructions = ["change the color to red", "shorten the title"]
edited_charts = lida.edit(code=charts[4][0].code,  summary=summary, instructions=instructions, library='seaborn')
LIDA Chart

lida.explain Function to Review and Explain the Code

We also have the option to use the lida.explain the function to review the code and explain about the code (specifically for the chart of goal-0 here)

explanation = lida.explain(code=charts[0][0].code)
print(explanation[0][0]['explanation'])

This code creates a scatter plot using the Seaborn library, with ‘age’ on the x-axis and ‘output’ (heart disease presence) as coloured data points. The legend is added with the title ‘Heart Disease Presence’ to distinguish between the two possible outputs. The plot’s title provides context, asking about the impact of age on heart disease risk.

LIDA also lets users evaluate the code and give a score of a code using lida.evaluate:

evaluations = lida.evaluate(code=charts[4][0].code, goal=goals[4], library='seaborn')
print(evaluations[0][0])
{'dimension': 'bugs', 'score': 8, 'rationale': "The code has no syntax errors 
and is mostly bug-free. However, there is a potential issue with the variable
'output' in the scatterplot, as it is not defined in the provided code
snippet. Assuming 'output' is a column in the DataFrame, the code should
work as intended, but this could cause confusion or errors if the column name
is not accurate."}

With a given code, we can recommend more visualizations using lida.recommend.

recommendations = lida.recommend(code=charts[1][0].code, summary=summary, n=2)
LIDA Chart

References and Resources

  1. Official LIDA Documentation: [LIDA Documentation]
  2. GitHub Repository: [Microsoft LIDA GitHub]

Conclusion

LIDA is revolutionizing the landscape of data visualization by seamlessly integrating machine learning capabilities into the process. Its multi-stage pipeline simplifies the creation of meaningful, grammar-agnostic visualizations and infographics, making data insights more accessible even for those without extensive programming skills. Combining natural language interfaces with direct manipulation empowers technical and non-technical users to transform complex datasets into clear, visually compelling stories. The platform’s built-in features for visualization repair, recommendations, and self-evaluation further enhance data literacy and enable users to refine visual outputs effectively. Ultimately, it facilitates better data-driven decision-making by streamlining the process of converting data into actionable insights.

If you are looking for a comprehensive generative AI course, explore GenAI Pinnacle today and take your skills to the next level!

Frequently Asked Questions

Q1. What does the Viz Generator do in LIDA?

Ans. The Viz Generator generates code to create visualizations.

Q2. Which programming languages and libraries does LIDA support?

Ans. LIDA is grammar-agnostic, meaning it can generate visualizations in any visualization grammar like Altair, Matplotlib, ggplot or Seaborn in Python, as well as in other programming languages such as R and C++.

Q3. What is a limitation of LIDA?

Ans. One limitation of LIDA is its reliance on the accuracy of large language models and the quality of the data. If the models generate incorrect goals or summaries, it may lead to suboptimal or misleading visualizations.

I'm a tech enthusiast, graduated from Vellore Institute of Technology. I'm working as a Data Science Trainee right now. I am very much interested in Deep Learning and Generative AI.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details