Generative AI enhances data analytics by creating new data and simplifying tasks like coding and analysis. Large language models (LLMs) such as GPT-3.5 empower this by understanding and generating SQL, Python, text summarization, and visualizations from data. Yet, limitations persist, like handling short contexts and errors. Future improvements target specialized LLMs, multi-modal abilities, and better user interfaces for streamlined data workflows. Initiatives like TalktoData aim to make data analytics more accessible through user-friendly Generative AI platforms. The goal is to simplify and broaden data analysis for everyone.
Learning Objectives:
Generative AI is an AI subset that excels in content generation encompassing text, imagery, audio, video, and synthetic data. Unlike traditional AI models that classify or predict based on predefined parameters, Generative AI generates content. It operates within the realm of deep learning, distinguishing itself by its ability to produce new data labels based on the input provided.
A striking difference lies in its capacity to handle unstructured data, eliminating the need to mold data to fit pre-defined parameters. Generative AI has vast potential to understand and infer from the given data. Therefore making it a groundbreaking innovation in data analytics.
Generative AI, especially through LLMs, such as GPT-4 pr GPT-3.5, presents numerous applications in data analytics. One of the most impactful use cases is its ability to generate code for data professionals. LLMs trained on publicly available code snippets in SQL and Python can generate code, significantly aiding data analysis tasks.
These models possess reasoning capabilities, enabling them to extract insights and create correlations within data. Furthermore, they can summarize texts, generate visualizations, and even modify graphs, enhancing the analytical process. They not only perform traditional machine learning tasks like regression and classification but also adapt to analyze datasets directly. This makes data analysis more intuitive and efficient.
In utilizing LLMs for data analytics, the process involves using various libraries such as OpenAI’s GPT 3.5, LLaMA Index, and related frameworks to perform data analysis on both CSV files and SQL databases.
#Import OpenAI and API Key
import os
import openai
from IPython.display import Markdown, display
os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
openai.api_key = os.environ["OPENAI_API_KEY"]
#Import Pandas and Pandas Query Engine from Llama-index
import pandas as pd
from llama_index.query_engine import PandasQueryEngine
# Load sample csv file(Titanic dataset)
df = pd.read_csv("titanic.csv")
df.head(5)
Output:
The primary significance lies in the inherent capability of LLMs to generate code based on natural language queries. Thus enabling users to seek insights from their data seamlessly. For instance, loading a CSV file into a Pandas query engine allows users to ask questions in plain language, like ‘How many passengers survived?’. LLM generates the corresponding code, providing accurate results.
response = pd_query_engine.query(
"Total How many passengers survived?",
)
display(Markdown(f"<b>{response}</b>"))
Output:
response = pd_query_engine.query(
"What is the average, maximum and minimum age of male and female population?",
)
display(Markdown(f"<b>{response}</b>"))
Output:
This seamless interaction extends to SQL databases, where the LLM generates SQL queries based on the metadata provided, allowing complex inquiries like retrieving top-selling albums from specific countries. Metadata plays a pivotal role in effectively utilizing LLMs for data analysis. Within SQL databases, metadata provides crucial information regarding tables, primary keys, foreign keys, column names, and their respective data types. This metadata acts as a guide for LLMs, allowing them to understand the database structure and generate SQL queries based on these pre-defined parameters.
#Load a SQL database
from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column
# Sample Database
# https://www.sqlitetutorial.net/sqlite-sample-database/
engine = create_engine("sqlite:///Chinook.db")
metadata_obj = MetaData()
#Lets use SQL Query engine from Llama-index
from llama_index import SQLDatabase
sql_database = SQLDatabase(engine)
#Create Query Engine
from llama_index.indices.struct_store import NLSQLTableQueryEngine
query_engine = NLSQLTableQueryEngine(
sql_database=sql_database
)
query_str = (
"What are all the tables in the database?"
)
response = query_engine.query(query_str)
print(response)
Output:
response = query_engine.query("Give me first 5 rows of Album table")
print(response)
Output:
However, limitations exist, such as short context restrictions, potential errors in code generation, and computational overhead. The necessity for advanced LLMs like GPT-4 to enhance context understanding and accuracy in SQL query code generation is clear. Moreover, the future lies in making these AI systems more user-friendly, intuitive, and capable of handling diverse data analysis workflows. Additionally, they could potentially revolutionize how businesses and users interact with analytical tools in the future.
Language Model Models, especially GPT-3.5, offer a tangible glimpse into the potential of Generative AI in real-world applications. In a practical demonstration using a Colab notebook, it’s evident how LLMs can be used to analyze CSV files and SQL databases, simplifying the data analytics process for common use cases.
By loading a sample CSV file and a public SQL database, these LLMs showcased their ability to generate answers to questions about the data. They exhibited proficiency in interpreting user queries, understanding table structures, and providing accurate responses. However, certain limitations and drawbacks come to light in using LLMs.
LLMs, despite their immense capabilities, are not without limitations. Their primary constraints include the short context, high error rates, computation overhead, and the lack of an intuitive interface for end-users. Providing a large volume of data may cause overflow errors, and error rates, especially in general-purpose LLMs, can reach up to 40%.
Additionally, the lack of an intuitive user interface limits widespread adoption, especially among business users who may not be comfortable with APIs or coding interfaces. To address these limitations, solutions, and advancements are necessary.
The challenges with Generative AI, specifically LLMs, have directed the need for refined models and improved methodologies to overcome the existing limitations. Short-context issues, higher error rates, computation overhead, and the lack of intuitive user interfaces call for innovative solutions to optimize LLM performance in data analytics.
The future of Generative AI in data analytics holds promising developments. Enhancements in LLM capabilities, such as GPT-4 and other models, aim to resolve current limitations. The focus on fine-tuning LLMs for SQL and integrating multi-model capabilities for text, voice, and image inputs is set to revolutionize data analytics workflows.
Moreover, introducing UI/UX-driven end-user applications will democratize the usage of Generative AI in data analytics, enabling a broader audience to leverage its power.
Addressing the drawbacks of Generative AI requires innovative approaches. At TalktoData, we’re working on a solution tailored to simplify data analytics. The platform offers an intuitive user interface designed specifically for data analytics workflows, catering to the complexities of handling various data sources, including SQL databases and diverse file formats.
The groundbreaking feature of creating dedicated Jupyter Sandbox instances for each query allows users to interact with the platform and receive insights, generating code and executing it within a dedicated environment. This eliminates the complexity of traditional data analytics workflow, simplifying the process and enabling seamless interactions.
The TalktoData solution is poised to revolutionize how data analytics tasks are performed. By combining the power of Generative AI with an intuitive and user-friendly interface, the platform seeks to bridge the gap between the complexities of data analytics and a more user-centric approach. With the ability to simplify interactions, generate code, and execute analytical processes, this solution aims to empower data professionals across industries.
Generative AI, notably LLMs like GPT-3.5, is transforming data analytics. They do so not only by creating new data but also by streamlining complex analysis tasks. While these models exhibit immense potential to revolutionize the field, they have significant limitations. These limitations lead to the necessity for improved models and more user-friendly interfaces.
The future of Generative AI in data analytics lies in refining models like GPT-4, multi-modal capabilities, and enhanced user experiences. Initiatives like TalktoData signal a shift toward more accessible data analytics for all. It highlights the pursuit to simplify and broaden data analysis in a user-centric manner. As the technology continues to evolve, addressing these challenges will lead to more inclusive, intuitive, and powerful applications of Generative AI in data analytics.
Key Takeaways
Ans. LLMs face constraints with short contexts, high error rates, computational overhead, and lack intuitive interfaces, hampering efficient usage.
Ans. LLMs, exemplified by GPT-3.5, simplify data analysis by generating code, summarizing texts, and interpreting user queries about the data, easing common data tasks.
Ans. Solutions entail refining LLMs, enhancing user interfaces, and developing specialized models, exemplified by TalktoData’s user-centric platform for seamless data analytics.
Vinod Varma is a seasoned data professional with a rich background in data science and analytics. As the Co-Founder of Sager AI since February 2022, he has been instrumental in shaping the company’s vision and driving its growth. Sager AI specializes in the intersection of Generative AI and Data, offering innovative solutions that leverage cutting-edge technologies. Vinod’s extensive experience includes roles as a Data Scientist at HRS Group in Cologne, Germany, where he contributed to data-driven strategies.
DataHour Page: https://community.analyticsvidhya.com/c/datahour/unleashing-generative-ai-in-data-analytics