Imagine you’re on the brink of developing the next big breakthrough in AI technology, like a state-of-the-art chatbot or an advanced recommendation system. However, the journey from a brilliant prototype to a fully operational, reliable application is filled with hurdles. Enter LangSmith, the game-changer that simplifies this transition. Launched in 2023, LangSmith is transforming the landscape of language model development by providing a robust DevOps platform designed specifically for large language models. In this blog, we’ll dive into complete LangSmith guide and how LangSmith can turn your AI aspirations into reality and ensure your models not only meet but exceed expectations.
This article was published as a part of the Data Science Blogathon.
LangSmith is a state-of-the-art testing framework designed for the evaluation of language models and AI applications, with a particular emphasis on creating production-grade LLM applications. As a comprehensive platform, LangSmith provides tools that extract valuable insights from model responses, enabling developers to refine their models for improved real-world performance.
LangSmith builds on LangChain, focusing on production readiness, while LangChain handles prototyping. The tracing tools in LangChain are indispensable for debugging and comprehending the execution steps of an agent, offering a visual representation of the sequence of calls within a workflow. This facilitates a deeper understanding of the model’s decision-making process, thereby fostering greater confidence in its accuracy.
We will investigate and see instances of each of these, but let’s first start with a LangSmith Platform overview and setting up the environment for LangSmith.
Below is an overview of LangSmith’s web user interface. Interested users first need to log in to http://smith.langchain.com/ and sign up to use the LnagSmith services. Once signed up, the UI will look as shown below. The landing page would have two main sections: Projects and Datasets & Testing. Both sections are navigable via Python SDK, which we will see in the next section.
Managing projects in LangSmith becomes much easier with its Python SDK, which connects to the platform through an API key. To obtain an API key, click on the key icon in the platform and save it securely. Then, set up a new directory with an initialized virtual environment and create a .env file. Inside this file, add the following lines:
LANGCHAIN_API_KEY="USER-LangSmith-API-key"
OPENAI_API_KEY="USER-OPENAI-key"
Next, open your terminal and execute these commands to install LangSmith and python-dotenv for reading environment variables:
pip install -U langsmith
pip install python-dotenv
Now you can start writing the necessary code. Begin by importing the required libraries and functions to manage environment variables and set them up:
import warnings
import os
import uuid
from dotenv import find_dotenv, load_dotenv
from langsmith import Client
# Suppress warnings
warnings.filterwarnings("ignore")
# Load environment variables
load_dotenv(find_dotenv())
os.environ["LANGCHAIN_API_KEY"] = str(os.getenv("LANGCHAIN_API_KEY"))
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# Initialize a client
client = Client()
# Generate a unique project name and create the project
uid = uuid.uuid4()
PROJECT_NAME = "Give_a_Demo_Project_Name<e.g:flashcards-generator->" + str(uid)
session = client.create_project(
project_name=PROJECT_NAME,
description="A project that generates flashcards from user input",
)
Setting LANGCHAIN_TRACING_V2 to true enables tracing (logging), which is essential for debugging LLMs. Once you run the create_project command successfully, you will see the project listed in the Projects section of the LangSmith web UI.
Now that we have seen how to create a project, we can move to the other aspects of LangSmith. The next steps would involve mainly getting access to an LLM and using it for inferencing or serving. Before that, we will briefly look into how to add observability and evaluate an LLM application. These would be important bits for our final step, where we will look into some realistic use cases.
Observability is crucial for any software application, but it’s particularly vital for LLM applications due to their non-deterministic nature, which can lead to unexpected results and make debugging more challenging. LangSmith provides LLM-native observability, offering meaningful insights throughout all stages of application development, from prototyping to production.
from openai import OpenAI
from langsmith.wrappers import wrap_openai
openai_client = wrap_openai(OpenAI())
def retriever(query: str):
results = ["Harrison worked at Kensho"]
return results
def rag(question):
docs = retriever(question)
system_message = f"Answer the users question using only the provided information below:\n{docs}"
return openai_client.chat.completions.create(
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": question},
],
model="gpt-3.5-turbo",
)
In the above code for the LLM, we have used the GPT-3.5 Turbo version, but you can experiment with the LLMs of your choice. Now, if you call it with rag(“where did Harrison work?”), the OpenAI call trace will be visible in the LangSmith UI, as shown below.
Alternatively, you can also use the traceable decorator to trace the entire function, providing comprehensive visibility.
from langsmith import traceable
@traceable
def rag(question):
docs = retriever(question)
system_message = f"Answer the users question using only the provided information below:\n{docs}"
return openai_client.chat.completions.create(
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": question},
],
model="gpt-3.5-turbo",
)
This will produce a trace of the entire pipeline (with the OpenAI call as a child run)—it should look something like the one shown below.
During the beta testing stage of LLM application development, you release your application to a select group of initial users. Establishing robust observability is essential, as it helps you gain insights into how users interact with your application, often revealing unexpected usage patterns. Adjusting your tracing setup to capture this data more effectively is advisable. A critical aspect of observability in beta testing is collecting user feedback, which can be as simple as a thumbs up/down. LangSmith simplifies this process by allowing you to log feedback and easily associate it with the specific runs that generated it.
Collect Feedback: Track user feedback by logging it with a run ID. It can be achieved as shown below.
import uuid
from langsmith import Client
ls_client = Client()
run_id = str(uuid.uuid4())
rag("where did harrison work", langsmith_extra={"run_id": run_id})
ls_client.create_feedback(run_id, key="user-score", score=1.0)
After you log feedback for each run, you can view it in the Metadata tab when inspecting each run.
You can also log important metadata like LLM versions to filter and analyze different runs. In the below code, we, for instance, will log two information, the LLM used and also dynamically pass the user id during runtime.
import uuid
run_id = str(uuid.uuid4())
@traceable(metadata={"llm": "gpt-3.5-turbo"})
def rag(question):
docs = retriever(question)
system_message = f"Answer the users question using only the provided information below:\n{docs}"
return openai_client.chat.completions.create(
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": question},
],
model="gpt-3.5-turbo",
)
Now if we call the rag function as rag( “where did harrison work”, langsmith_extra={“run_id”: run_id, “metadata”: {“user_id”: “harrison”}}), both the pieces fo information should be visible on the UI as shown below.
You can use LangSmith’s monitoring tools to track application performance, including traces, feedback, and response times. Group monitoring charts by metadata attributes to facilitate A/B testing and performance comparison. If you click on the Monitor tab of your project you can see a series of charts. An instance of the same is shown below. The output may vary based on the user’s scenario.
Evaluating the LLM application’s performance with respect to custom user-defined matrices is a difficult task. However, it is a crucial step during the iterative process of developing the LLM application, which would allow increased confidence and improvement during the development process. Below is how LangSmith allows users to evaluate an LLM easily. These steps serve as a demo, so you should adjust the metrics and other parameters to fit your specific needs.
from langsmith import Client
client = Client()
dataset_name = "QA Example Dataset"
dataset = client.create_dataset(dataset_name)
client.create_examples(
inputs=[
{"question": "What is LangChain?"},
{"question": "What is LangSmith?"},
{"question": "What is OpenAI?"},
{"question": "What is Google?"},
{"question": "What is Mistral?"},
],
outputs=[
{"answer": "A framework for building LLM applications"},
{"answer": "A platform for observing and evaluating LLM applications"},
{"answer": "A company that creates Large Language Models"},
{"answer": "A technology company known for search"},
{"answer": "A company that creates Large Language Models"},
],
dataset_id=dataset.id,
)
Below is how it would look in the LangSmith UI under the Datasets & Testing page for the prepared Q&A Example Dataset.
Output:
Use an LLM to judge the correctness of outputs and define custom metrics, such as response length.
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator
from langsmith.schemas import Run, Example
def evaluate_length(run: Run, example: Example) -> dict:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("answer") or ""
score = int(len(prediction) < 2 * len(required))
return {"key":"length", "score": score}
_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students'
answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
Respond with CORRECT or INCORRECT:
Grade:
"""
PROMPT = PromptTemplate(
input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE
)
eval_llm = ChatAnthropic(temperature=0.0)
qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm,
"prompt": PROMPT})
from langsmith.schemas import Run, Example
def evaluate_length(run: Run, example: Example) -> dict:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("answer") or ""
score = int(len(prediction) < 2 * len(required))
return {"key": "length", "score": score}
Build and evaluate the application using the defined metrics.
from langsmith.evaluation import evaluate
import openai
def langsmith_app(inputs):
output = my_app(inputs["question"])
return {"output": output}
openai_client = openai.Client()
def my_app(question):
return openai_client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0,
messages=[
{
"role": "system",
"content": "Respond to the users question in a short,
concise manner (one short sentence)."
},
{
"role": "user",
"content": question,
}
],
).choices[0].message.content
experiment_results = evaluate(
langsmith_app, # Your AI system
data=dataset_name, # The data to predict and grade over
evaluators=[evaluate_length, qa_evaluator],
# The evaluators to score the results
experiment_prefix="openai-3.5",
# A prefix for your experiment names to easily identify them
)
Running the above code will provide a link, clicking on which would open the LangSmith UI for the evaluations. An instance of the LangSmith UI is shown below.
Output:
We have seen how to evaluate the LLMs. LangSmith also allows us to compare results amongst the different LLMs. Users can simply change the model parameter in the app function defined above to use other suitable LLMs to analyze high-level metrics and detailed comparisons across different models and configurations.
The image below shows the comparison across different metrics amongst three different LLMs in the LangSmith UI.
So far, we have seen how to set up the LangSmith environment, allow traceability for the LLM calls, and evaluate and compare the LLM outputs easily under one dashboard. This concludes our current scope in the blog to explore LnagSmith for LLM production. Next, we will explore two realistic case studies that combine these elements under one roof.
In this section, we will combine all the scattered knowledge we have learned about LangSmith and examine it from the perspective of two realistic use cases. We will fine-tune an LLaMA model and evaluate and visualize the results using LangSmith. In the second case, we will develop an automated feedback mechanism for language models using LangSmith. While both the use cases would require the readers to have other technical knowledge, the focus in the below subsection would solely be on the LangSmith perspective.
This use case demonstrates the process of fine-tuning the LLaMA2-7b-chat model for a knowledge graph triple extraction task using a single GPU. LangSmith sources the training data, managing and evaluating datasets on its platform. The notebook leverages HuggingFace for the fine-tuning process and utilizes LangSmith to manage and export training data, as well as to evaluate the fine-tuned model’s performance. This showcases a practical application of integrating LangSmith with HuggingFace for efficient LLM fine-tuning and evaluation. The entire notebook can be found here.
Below, we will highlight the major steps, with a focus on the code snippets related to LangSmith.
env LANGCHAIN_API_KEY= <api-key>
pip install --quiet -U langchain langsmith pandas openai xformers
transformers huggingface accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2
transformers==4.31.0 trl==0.4.7
model_loaded = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
pipe_llama7b_chat_ft = pipeline(task="text-generation", model=model_loaded,
tokenizer=tokenizer, max_length=300, device=1)
result = pipe_llama7b_chat_ft(test_prompt)
print(result)
Output:
Running the above code should produce an output using the fine-tuned model. The data used might vary the output. Above, we show a sample of the expected output.
An overview of the evaluation workflow with a focus on LangSmith output is provided below.
from langsmith import Client
from langchain.smith import RunEvalConfig
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
client = Client()
# Note that "sentence" is the key in the test dataset
prompt = PromptTemplate.from_template(
"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n###
Input:{sentence}\n\n[/INST]\n"
).partial(system_message=system_prompt)
from langchain.smith import RunEvalConfig
config = RunEvalConfig(
custom_evaluators=[EvaluateTriplets()],
)
# Chat LLM w/ FT
llama_llm_chat_ft = HuggingFacePipeline(pipeline=pipe_llama7b_chat_ft)
llama_chain_chat_ft = prompt | llama_llm_chat_ft
results = await client.arun_on_dataset(validation_dataset_name,
llama_chain_chat_ft, evaluation=config)
Output:
Running the above code in Google Colab will provide a link, as shown above, which will open the lnagSmith UI, which shows the model’s performance across the evaluation strategy.
In this use case, we set up an automated feedback pipeline for language models using LangSmith. It enables the tracking and evaluation of model performance through automated metrics integrated with LangSmith’s dataset management and evaluation capabilities. The blog doesn’t cover a detailed walkthrough of the code, so users should have knowledge of the associated topics. The entire code is available here. We will look into the LangSmith aspects.
Here are the main steps outlined in the code:
import os
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = "YOUR_API_KEY"
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_HUB_API_URL"] = "https://api.hub.langchain.com"
# Update with your Hub API key
os.environ["LANGCHAIN_HUB_API_KEY"] = "YOUR_HUB_API_KEY"
project_name = "YOUR_PROJECT_NAME" # Change to your project name
from langsmith import Client
from datetime import datetime
client = Client()
example_data = [
("Who trained Llama-v2?", "I'm sorry, but I don't have that information."),
(
"When did langchain first announce the hub?",
"LangChain first announced the LangChain Hub on September 5, 2023.",
),
(
"What's LangSmith?",
"LangSmith is a platform developed by LangChain for building
production-grade LLM (Language Model) applications. It allows you to
debug, test, evaluate, and monitor chains and intelligent agents built
on any LLM framework. LangSmith seamlessly integrates with LangChain's
open-source framework called LangChain, which is widely used for
building applications with LLMs.\n\nLangSmith provides full visibility
into model inputs and outputs at every step in the chain of events,
making it easier to debug and analyze the behavior of LLM applications.
It has been tested with early design partners and on internal workflows,
and it has been found to help teams in various ways.\n\nYou can find more
information about LangSmith on the official LangSmith documentation
[here](https://docs.smith.langchain.com/). Additionally,
you can read about the announcement of LangSmith as a unified platform
for debugging and testing LLM applications [here]
(https://blog.langchain.dev/announcing-langsmith/).",
),
(
"What is the langsmith cookbook?",
"I'm sorry, but I couldn't find any information about the
\"Langsmith Cookbook\". It's possible that it may not be a well-known
cookbook or it may not exist. Could you provide more context or clarify
the name?",
),
(
"What is LangChain?",
"I'm sorry, but I couldn't find any information about
"LangChain\". Could you please provide more context or clarify your
question?",
),
("When was Llama-v2 released?", "Llama-v2 was released on July 18, 2023."),
]
for input_, output_ in example_data:
client.create_run(
name="ExampleRun",
run_type="chain",
inputs={"input": input_},
outputs={"output": output_},
project_name=project_name,
end_time=datetime.utcnow(),
)
This code creates a series of example runs with predefined input-output pairs. Each run is logged using the client.create_run method, associating it with a project for easy management and retrieval.
from langchain import hub
prompt = hub.pull(
"wfh/automated-feedback-example", api_url="https://api.hub.langchain.com"
)
from langchain_core.output_parsers.openai_functions import
JsonOutputFunctionsParser
from langchain_core.tracers.context import collect_runs
from langchain_openai import ChatOpenAI
chain = (
prompt
| ChatOpenAI(model="gpt-3.5-turbo", temperature=1).bind(
functions=[
{
"name": "submit_scores",
"description": "Submit the graded scores for a user question
and bot response.",
"parameters": {
"type": "object",
"properties": {
"relevance": {
"type": "integer",
"minimum": 0,
"maximum": 5,
"description": "Score indicating the relevance of
the question to LangChain/LangSmith.",
},
"difficulty": {
"type": "integer",
"minimum": 0,
"maximum": 5,
"description": "Score indicating the complexity or
difficulty of the question.",
},
"verbosity": {
"type": "integer",
"minimum": 0,
"maximum": 5,
"description": "Score indicating how verbose the
question is.",
},
"specificity": {
"type": "integer",
"minimum": 0,
"maximum": 5,
"description": "Score indicating how specific the
question is.",
},
},
"required": ["relevance", "difficulty", "verbosity",
"specificity"],
},
}
]
)
| JsonOutputFunctionsParser()
)
def evaluate_run(run):
try:
if "input" not in run.inputs or not run.outputs or "output" not in
run.outputs:
return
if run.feedback_stats and "specificity" in run.feedback_stats:
return
with collect_runs() as cb:
result = chain.invoke(
{
"question": run.inputs["input"][:3000],
"prediction": run.outputs["output"][:3000],
},
)
for feedback_key, value in result.items():
score = int(value) / 5
client.create_feedback(
run.id,
key=feedback_key,
score=score,
source_run_id=cb.traced_runs[0].id,
feedback_source_type="model",
)
except Exception as e:
pass
wrapped_function = RunnableLambda(evaluate_run)
_ = wrapped_function.batch(runs, {"max_concurrency": 10}, return_exceptions=True)
This code snippet demonstrates AI-assisted feedback, where an LLM (GPT-3.5-turbo) scores each run’s input based on several metrics (relevance, difficulty, verbosity, and specificity). The scores are logged as feedback using client.create_feedback. The evaluate_run function handles the evaluation logic, and RunnableLambda is used for concurrent processing.
from langchain_core.runnables import RunnableLambda
wrapped_function = RunnableLambda(evaluate_run)
_ = wrapped_function.batch(runs, {"max_concurrency": 10}, return_exceptions=True)
Feedback is logged concurrently using the RunnableLambda class to batch-process the runs. This ensures efficient handling of multiple evaluations simultaneously.
feedback_stats = client.read_project(project_name=project_name).feedback_stats
print(feedback_stats)
Aggregate feedback statistics are read from the project, showcasing metrics such as readability indices and other evaluation scores. This provides a comprehensive view of the model’s performance based on the feedback received. Below is an image that might be expected as the final output in the LangSmith UI.
LangSmith UI Output:
LangSmith helps take language models from prototype to production by offering a comprehensive suite of tools and features designed to enhance their capabilities. By utilizing LangSmith’s monitoring, evaluation, debugging, testing, tracing, and observability functions, developers and businesses can significantly improve their model’s performance and reliability. LangSmith’s user-friendly interface and robust API integrations streamline the development process, making it easier to achieve high-quality results. Adopting LangSmith can lead to more efficient model iterations and, ultimately, better user experiences. This articles focused on complete LangSmith guide in detail.
You can access code links here:
A. LangSmith provides a comprehensive suite of tools including monitoring, evaluation, debugging, testing, tracing, and observability features. These tools help developers enhance the performance and reliability of their language models throughout the development lifecycle.
A. LangSmith streamlines the development process by offering a user-friendly interface and robust API integrations. It ensures efficient model iterations and faster deployment, crucial for moving from prototype stages to full-scale production.
A. Yes, LangSmith’s advanced debugging tools allow developers to identify and resolve issues quickly. They also provide detailed insights into model performance, enabling precise debugging and optimization.
A. Monitoring and evaluation in LangSmith are essential for continuously assessing model performance in real time. These features help developers track model behavior, detect anomalies, and make data-driven improvements.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.