LIDA is a powerful tool designed to automate visualization creation, enabling the generation of grammar-agnostic visualizations and infographics. LIDA addresses several critical tasks: interpreting data semantics, identifying appropriate visualization goals, and generating detailed visualization specifications. LIDA conceptualizes visualization generation as a multi-step process and uses well-structured pipelines, which integrate large language models (LLMs) and image generation models (IGMs).
Feature | Description |
Data Summarization | LIDA compacts large datasets into dense natural language summaries, used as grounding for future operations. |
Automated Data Exploration | LIDA offers a fully automated mode for generating meaningful visualization goals based on unfamiliar datasets. |
Grammar-Agnostic Visualizations | LIDA generates visualizations in any grammar (Altair, Matplotlib, Seaborn in Python, or R, C++, etc.). |
Infographics Generation | Converts data into stylized, engaging infographics using image generation models for personalized stories. |
VizOps – Operations on Visualizations | Detailed operations on generated visualizations, enhancing accessibility, data literacy, and debugging. |
Visualization Explanation | Provides in-depth descriptions of visualization code, aiding in accessibility, education, and sensemaking. |
Self-Evaluation | LLMs are used to generate multi-dimensional evaluation scores for visualizations based on best practices. |
Visualization Repair | Automatically improves or repairs visualizations using self-evaluation or user-provided feedback. |
Visualization Recommendations | Recommends additional visualizations based on context or existing visualizations for comparison or added perspectives. |
To use LIDA, you’ll need to install LIDA with the following command:
pip install -U lida
We’ll be using llmx to create LLM text generators with support for multiple LLM providers.
!pip install llmx
To predict heart disease presence, let’s try analyzing the Heart Attack Analysis & Prediction Dataset, which contains 14 clinical features like age, cholesterol, and chest pain type. We’ll be working with heart.csv in this guide: Heart Attack Analysis & Prediction Dataset.
To use LIDA’s webui, we need to first setup the OpenAI key:
import os
os.environ['OPENAI_API_KEY']='sk-test'
Now run this command and go click on the url:
!lida ui --port=8080 --docs
Click on the live demo button:
Note: You need to set up your openai key to get the web ui running.
“gpt-3.5-turbo-0301” is the model that’s selected by default.
You can click on Generation settings to modify the LLM provider, model and other settings.
I’ll focus on visualizing and gaining insights with LIDA using Python in this guide.
In this demo, I’ll be using the Cohere LLM provider. You can hover over to Cohere’s dashboard and get your trial API key to use models from Cohere.
from llmx import llm
from llmx.datamodel import TextGenerationConfig
import os
os.environ['COHERE_API_KEY']='Your_API_Key'
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What is osmosis?"}
]
gen = llm(provider="cohere")
config = TextGenerationConfig(model="command-r-plus-08-2024", max_tokens=50)
response = gen.generate(messages, config=config, use_cache=True)
print(response.text[0].content)
Osmosis is a fundamental process in biology and chemistry where a solvent,
typically water, moves across a semipermeable membrane from a region of
lower solute concentration to a region of higher solute concentration, aiming
to equalize the concentrations on both sides
from lida import Manager, llm
lida = Manager(text_gen = gen)
summary = lida.summarize("heart.csv")
print(summary)
{'name': 'heart.csv', 'file_name': 'heart.csv', 'dataset_description': '',
'fields': [{'column': 'age', 'properties': {'dtype': 'number', 'std': 9,
'min': 29, 'max': 77, 'samples': [46, 66, 48], 'num_unique_values': 41,
'semantic_type': '', 'description': ''}}, {'column': 'sex', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 1, 'samples': [0, 1],
'num_unique_values': 2, 'semantic_type': '', 'description': ''}}, {'column':
'cp', 'properties': {'dtype': 'number', 'std': 1, 'min': 0, 'max': 3,
'samples': [2, 0], 'num_unique_values': 4, 'semantic_type': '',
'description': ''}}, {'column': 'trtbps', 'properties': {'dtype': 'number',
'std': 17, 'min': 94, 'max': 200, 'samples': [104, 123],
'num_unique_values': 49, 'semantic_type': '', 'description': ''}}, {'column':
'chol', 'properties': {'dtype': 'number', 'std': 51, 'min': 126, 'max': 564,
'samples': [277, 169], 'num_unique_values': 152, 'semantic_type': '',
'description': ''}}, {'column': 'fbs', 'properties': {'dtype': 'number',
'std': 0, 'min': 0, 'max': 1, 'samples': [0, 1], 'num_unique_values': 2,
'semantic_type': '', 'description': ''}}, {'column': 'restecg',
'properties': {'dtype': 'number', 'std': 0, 'min': 0, 'max': 2, 'samples':
[0, 1], 'num_unique_values': 3, 'semantic_type': '', 'description': ''}},
{'column': 'thalachh', 'properties': {'dtype': 'number', 'std': 22, 'min':
71, 'max': 202, 'samples': [159, 152], 'num_unique_values': 91,
'semantic_type': '', 'description': ''}}, {'column': 'exng', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 1, 'samples': [1, 0],
'num_unique_values': 2, 'semantic_type': '', 'description': ''}}, {'column':
'oldpeak', 'properties': {'dtype': 'number', 'std': 1.1610750220686343,
'min': 0.0, 'max': 6.2, 'samples': [1.9, 3.0], 'num_unique_values': 40,
'semantic_type': '', 'description': ''}}, {'column': 'slp', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 2, 'samples': [0, 2],
'num_unique_values': 3, 'semantic_type': '', 'description': ''}}, {'column':
'caa', 'properties': {'dtype': 'number', 'std': 1, 'min': 0, 'max': 4,
'samples': [2, 4], 'num_unique_values': 5, 'semantic_type': '',
'description': ''}}, {'column': 'thall', 'properties': {'dtype': 'number',
'std': 0, 'min': 0, 'max': 3, 'samples': [2, 0], 'num_unique_values': 4,
'semantic_type': '', 'description': ''}}, {'column': 'output', 'properties':
{'dtype': 'number', 'std': 0, 'min': 0, 'max': 1, 'samples': [0, 1],
'num_unique_values': 2, 'semantic_type': '', 'description': ''}}],
'field_names': ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg',
'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']}
goals = lida.goals(summary=summary, n=5, persona="A data scientist focused on using predictive analytics to improve early detection and prevention of heart disease.") # generate goals (n is no. of goals)
‘n’ is no. of goals that we’ll generate using the summary; let’s look at the 5 goals that we generated:
goals[0]
Goal 0
Question: How does age impact heart disease risk?
Visualization: Scatter plot with 'age' on the x-axis and 'output' (heart
disease presence) as colored data points
Rationale: This visualization will help us understand if there's a
correlation between age and heart disease risk. By plotting age against the
presence of heart disease, we can identify any trends or patterns that may
indicate higher risk at certain ages, aiding in early detection strategies.
goals[1]
Goal 1
Question: Is there a gender disparity in heart disease occurrence?
Visualization: Stacked bar chart comparing the count of 'sex' (gender) with
'output' (heart disease presence)
Rationale: This chart will reveal any gender disparities in heart disease
cases. By comparing the distribution of males and females with and without
heart disease, we can assess if one gender is more susceptible, which is
crucial for targeted prevention efforts.
goals[2]
Goal 2
Question: How does cholesterol level affect heart health?
Visualization: Box plot of 'chol' (cholesterol) grouped by 'output' (heart
disease presence)
Rationale: This plot will illustrate the distribution of cholesterol levels
in individuals with and without heart disease. We can determine if higher
cholesterol is associated with an increased risk of heart disease, providing
insights for preventive measures.
goals[3]
Goal 3
Question: Are there specific chest pain types linked to heart disease?
Visualization: Violin plot of 'cp' (chest pain type) colored by 'output'
(heart disease presence)
Rationale: This visualization will help us understand if certain types of
chest pain are more prevalent in heart disease cases. By examining the
distribution of chest pain types, we can identify patterns that may aid in
early diagnosis and treatment planning.
goals[4]
Goal 4
Question: How does resting heart rate relate to heart disease?
Visualization: Scatter plot with 'thalachh' (resting heart rate) on the y-
axis and 'output' (heart disease presence) as colored data points
Rationale: This plot will reveal any relationship between resting heart rate
and heart disease. By visualizing the resting heart rate against the
presence of heart disease, we can determine if higher or lower rates are
associated with increased risk, guiding early intervention strategies.
Let’s generate charts for each goal and gain insights from the visualizations.
charts = []
for i in range(5):
charts.append(lida.visualize(summary=summary, goal=goals[i], library="seaborn"))
charts[0][0]
charts[1][0]
charts[2][0]
charts[3][0]
charts[4][0]
Let’s look at the lida.edit function to suggest changes in the chart. Let’s change the title and colour of the plot.
# modify chart using natural language
instructions = ["change the color to red", "shorten the title"]
edited_charts = lida.edit(code=charts[4][0].code, summary=summary, instructions=instructions, library='seaborn')
We also have the option to use the lida.explain the function to review the code and explain about the code (specifically for the chart of goal-0 here)
explanation = lida.explain(code=charts[0][0].code)
print(explanation[0][0]['explanation'])
This code creates a scatter plot using the Seaborn library, with ‘age’ on the x-axis and ‘output’ (heart disease presence) as coloured data points. The legend is added with the title ‘Heart Disease Presence’ to distinguish between the two possible outputs. The plot’s title provides context, asking about the impact of age on heart disease risk.
LIDA also lets users evaluate the code and give a score of a code using lida.evaluate:
evaluations = lida.evaluate(code=charts[4][0].code, goal=goals[4], library='seaborn')
print(evaluations[0][0])
{'dimension': 'bugs', 'score': 8, 'rationale': "The code has no syntax errors
and is mostly bug-free. However, there is a potential issue with the variable
'output' in the scatterplot, as it is not defined in the provided code
snippet. Assuming 'output' is a column in the DataFrame, the code should
work as intended, but this could cause confusion or errors if the column name
is not accurate."}
With a given code, we can recommend more visualizations using lida.recommend.
recommendations = lida.recommend(code=charts[1][0].code, summary=summary, n=2)
LIDA is revolutionizing the landscape of data visualization by seamlessly integrating machine learning capabilities into the process. Its multi-stage pipeline simplifies the creation of meaningful, grammar-agnostic visualizations and infographics, making data insights more accessible even for those without extensive programming skills. Combining natural language interfaces with direct manipulation empowers technical and non-technical users to transform complex datasets into clear, visually compelling stories. The platform’s built-in features for visualization repair, recommendations, and self-evaluation further enhance data literacy and enable users to refine visual outputs effectively. Ultimately, it facilitates better data-driven decision-making by streamlining the process of converting data into actionable insights.
If you are looking for a comprehensive generative AI course, explore GenAI Pinnacle today and take your skills to the next level!
Ans. The Viz Generator generates code to create visualizations.
Ans. LIDA is grammar-agnostic, meaning it can generate visualizations in any visualization grammar like Altair, Matplotlib, ggplot or Seaborn in Python, as well as in other programming languages such as R and C++.
Ans. One limitation of LIDA is its reliance on the accuracy of large language models and the quality of the data. If the models generate incorrect goals or summaries, it may lead to suboptimal or misleading visualizations.