Ultimate LangSmith Guide

Neil D Last Updated : 19 Nov, 2024

15 min read

Introduction

Imagine you’re on the brink of developing the next big breakthrough in AI technology, like a state-of-the-art chatbot or an advanced recommendation system. However, the journey from a brilliant prototype to a fully operational, reliable application is filled with hurdles. Enter LangSmith, the game-changer that simplifies this transition. Launched in 2023, LangSmith is transforming the landscape of language model development by providing a robust DevOps platform designed specifically for large language models. In this blog, we’ll dive into complete LangSmith guide and how LangSmith can turn your AI aspirations into reality and ensure your models not only meet but exceed expectations.

Learning Outcomes

Learn what LangSmith is and how it simplifies the development of production-grade LLM applications.
Gain insights into the comprehensive features LangSmith offers, including testing, debugging, and performance monitoring.
Learn how to set up LangSmith using its Python SDK, create projects, and manage them efficiently.
Understand the importance of observability in LLM applications and how to implement it using LangSmith for real-time monitoring and debugging.
Learn how to evaluate the performance of LLM applications using LangSmith’s evaluation tools and custom metrics.

This article was published as a part of the Data Science Blogathon.

Introduction
What is LangSmith?
LangSmith Platform Overview
Navigating LangSmith with PythonSDK
Adding Observability to Your LLM Application
Beta Testing and Feedback Collection
Evaluating a LLM application
Use Cases of LangSmith
- Fine-tuning LLaMA2-7b-chat Model
- Automated Feedback Pipeline for Language Models
Conclusion
Frequently Asked Questions

What is LangSmith?

LangSmith is a state-of-the-art testing framework designed for the evaluation of language models and AI applications, with a particular emphasis on creating production-grade LLM applications. As a comprehensive platform, LangSmith provides tools that extract valuable insights from model responses, enabling developers to refine their models for improved real-world performance.

LangSmith builds on LangChain, focusing on production readiness, while LangChain handles prototyping. The tracing tools in LangChain are indispensable for debugging and comprehending the execution steps of an agent, offering a visual representation of the sequence of calls within a workflow. This facilitates a deeper understanding of the model’s decision-making process, thereby fostering greater confidence in its accuracy.

Use of LangSmith

Craft LLMs with Assurance: Create applications easily using an intuitive interface that streamlines even the most complex workflows.
Test Professionally: Identify and resolve vulnerabilities before launch with LangSmith’s comprehensive testing suite.
Achieve In-depth Insights: Assess your application’s performance using LangSmith’s detailed tools, ensuring peak functionality.
Monitor Confidently: Ensure application stability with LangSmith’s real-time monitoring capabilities.
Debug Accurately: Resolve intricate issues swiftly with LangSmith’s advanced debugging tools.
Enhance Performance: Optimize your application for maximum effectiveness.

We will investigate and see instances of each of these, but let’s first start with a LangSmith Platform overview and setting up the environment for LangSmith.

LangSmith Platform Overview

Below is an overview of LangSmith’s web user interface. Interested users first need to log in to http://smith.langchain.com/ and sign up to use the LnagSmith services. Once signed up, the UI will look as shown below. The landing page would have two main sections: Projects and Datasets & Testing. Both sections are navigable via Python SDK, which we will see in the next section.

Navigating LangSmith with PythonSDK

Managing projects in LangSmith becomes much easier with its Python SDK, which connects to the platform through an API key. To obtain an API key, click on the key icon in the platform and save it securely. Then, set up a new directory with an initialized virtual environment and create a .env file. Inside this file, add the following lines:

LANGCHAIN_API_KEY="USER-LangSmith-API-key"
OPENAI_API_KEY="USER-OPENAI-key"

Next, open your terminal and execute these commands to install LangSmith and python-dotenv for reading environment variables:

pip install -U langsmith
pip install python-dotenv

Now you can start writing the necessary code. Begin by importing the required libraries and functions to manage environment variables and set them up:

import warnings
import os
import uuid
from dotenv import find_dotenv, load_dotenv
from langsmith import Client

# Suppress warnings
warnings.filterwarnings("ignore")

# Load environment variables
load_dotenv(find_dotenv())
os.environ["LANGCHAIN_API_KEY"] = str(os.getenv("LANGCHAIN_API_KEY"))
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# Initialize a client
client = Client()

# Generate a unique project name and create the project
uid = uuid.uuid4()
PROJECT_NAME = "Give_a_Demo_Project_Name<e.g:flashcards-generator->" + str(uid)
session = client.create_project(
    project_name=PROJECT_NAME,
    description="A project that generates flashcards from user input",
)

Setting LANGCHAIN_TRACING_V2 to true enables tracing (logging), which is essential for debugging LLMs. Once you run the create_project command successfully, you will see the project listed in the Projects section of the LangSmith web UI.

Now that we have seen how to create a project, we can move to the other aspects of LangSmith. The next steps would involve mainly getting access to an LLM and using it for inferencing or serving. Before that, we will briefly look into how to add observability and evaluate an LLM application. These would be important bits for our final step, where we will look into some realistic use cases.

Adding Observability to Your LLM Application

Observability is crucial for any software application, but it’s particularly vital for LLM applications due to their non-deterministic nature, which can lead to unexpected results and make debugging more challenging. LangSmith provides LLM-native observability, offering meaningful insights throughout all stages of application development, from prototyping to production.

Setting Up Observability

Create the API key, install the necessary package, and configure and set up the environment, as shown previously.
Setup the basic LLM Tracing Calls: Wrap your OpenAI client using LangSmith to trace LLM calls

from openai import OpenAI
from langsmith.wrappers import wrap_openai

openai_client = wrap_openai(OpenAI())

def retriever(query: str):
    results = ["Harrison worked at Kensho"]
    return results

def rag(question):
    docs = retriever(question)
    system_message = f"Answer the users question using only the provided information below:\n{docs}"
    return openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-3.5-turbo",
    )

In the above code for the LLM, we have used the GPT-3.5 Turbo version, but you can experiment with the LLMs of your choice. Now, if you call it with rag(“where did Harrison work?”), the OpenAI call trace will be visible in the LangSmith UI, as shown below.

Output

Alternatively, you can also use the traceable decorator to trace the entire function, providing comprehensive visibility.

from langsmith import traceable

@traceable
def rag(question):
    docs = retriever(question)
    system_message = f"Answer the users question using only the provided information below:\n{docs}"
    return openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-3.5-turbo",
    )

This will produce a trace of the entire pipeline (with the OpenAI call as a child run)—it should look something like the one shown below.

Beta Testing and Feedback Collection

During the beta testing stage of LLM application development, you release your application to a select group of initial users. Establishing robust observability is essential, as it helps you gain insights into how users interact with your application, often revealing unexpected usage patterns. Adjusting your tracing setup to capture this data more effectively is advisable. A critical aspect of observability in beta testing is collecting user feedback, which can be as simple as a thumbs up/down. LangSmith simplifies this process by allowing you to log feedback and easily associate it with the specific runs that generated it.

Collect Feedback: Track user feedback by logging it with a run ID. It can be achieved as shown below.

import uuid
from langsmith import Client

ls_client = Client()
run_id = str(uuid.uuid4())

rag("where did harrison work", langsmith_extra={"run_id": run_id})
ls_client.create_feedback(run_id, key="user-score", score=1.0)

After you log feedback for each run, you can view it in the Metadata tab when inspecting each run.

Logging Metadata

You can also log important metadata like LLM versions to filter and analyze different runs. In the below code, we, for instance, will log two information, the LLM used and also dynamically pass the user id during runtime.

import uuid
run_id = str(uuid.uuid4())
@traceable(metadata={"llm": "gpt-3.5-turbo"})
def rag(question):
    docs = retriever(question)
    system_message = f"Answer the users question using only the provided information below:\n{docs}"
    return openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-3.5-turbo",
    )

Now if we call the rag function as rag( “where did harrison work”, langsmith_extra={“run_id”: run_id, “metadata”: {“user_id”: “harrison”}}), both the pieces fo information should be visible on the UI as shown below.

Production Monitoring

You can use LangSmith’s monitoring tools to track application performance, including traces, feedback, and response times. Group monitoring charts by metadata attributes to facilitate A/B testing and performance comparison. If you click on the Monitor tab of your project you can see a series of charts. An instance of the same is shown below. The output may vary based on the user’s scenario.

Evaluating a LLM application

Evaluating the LLM application’s performance with respect to custom user-defined matrices is a difficult task. However, it is a crucial step during the iterative process of developing the LLM application, which would allow increased confidence and improvement during the development process. Below is how LangSmith allows users to evaluate an LLM easily. These steps serve as a demo, so you should adjust the metrics and other parameters to fit your specific needs.

Step1: Create a Golden Dataset

Define data points with appropriate schema and expected outputs.
Start with a small set (10-50 examples) for initial coverage, expanding over time.

from langsmith import Client

client = Client()
dataset_name = "QA Example Dataset"
dataset = client.create_dataset(dataset_name)
client.create_examples(
    inputs=[
        {"question": "What is LangChain?"},
        {"question": "What is LangSmith?"},
        {"question": "What is OpenAI?"},
        {"question": "What is Google?"},
        {"question": "What is Mistral?"},
    ],
    outputs=[
        {"answer": "A framework for building LLM applications"},
        {"answer": "A platform for observing and evaluating LLM applications"},
        {"answer": "A company that creates Large Language Models"},
        {"answer": "A technology company known for search"},
        {"answer": "A company that creates Large Language Models"},
    ],
    dataset_id=dataset.id,
)

Below is how it would look in the LangSmith UI under the Datasets & Testing page for the prepared Q&A Example Dataset.

Output:

Step2: Define Matrices

Use an LLM to judge the correctness of outputs and define custom metrics, such as response length.

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator
from langsmith.schemas import Run, Example

def evaluate_length(run: Run, example: Example) -> dict:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("answer") or ""
    score = int(len(prediction) < 2 * len(required))
    return {"key":"length", "score": score}

_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' 
answers to questions.
You are grading the following question:
{query}
Here is the real answer:
{answer}
You are grading the following predicted answer:
{result}
Respond with CORRECT or INCORRECT:
Grade:
"""

PROMPT = PromptTemplate(
    input_variables=["query", "answer", "result"], template=_PROMPT_TEMPLATE
)
eval_llm = ChatAnthropic(temperature=0.0)

qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm, 
"prompt": PROMPT})

from langsmith.schemas import Run, Example

def evaluate_length(run: Run, example: Example) -> dict:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("answer") or ""
    score = int(len(prediction) < 2 * len(required))
    return {"key": "length", "score": score}

Step3: Run Evaluations

Build and evaluate the application using the defined metrics.

from langsmith.evaluation import evaluate
import openai


def langsmith_app(inputs):
    output = my_app(inputs["question"])
    return {"output": output}
    
    
openai_client = openai.Client()

def my_app(question):
    return openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": "Respond to the users question in a short, 
                concise manner (one short sentence)."
            },
            {
                "role": "user",
                "content": question,
            }
        ],
    ).choices[0].message.content
    
experiment_results = evaluate(
    langsmith_app, # Your AI system
    data=dataset_name, # The data to predict and grade over
    evaluators=[evaluate_length, qa_evaluator], 
    # The evaluators to score the results
    experiment_prefix="openai-3.5", 
    # A prefix for your experiment names to easily identify them
)

Running the above code will provide a link, clicking on which would open the LangSmith UI for the evaluations. An instance of the LangSmith UI is shown below.

Output:

Step4: Compare Results

We have seen how to evaluate the LLMs. LangSmith also allows us to compare results amongst the different LLMs. Users can simply change the model parameter in the app function defined above to use other suitable LLMs to analyze high-level metrics and detailed comparisons across different models and configurations.

The image below shows the comparison across different metrics amongst three different LLMs in the LangSmith UI.

So far, we have seen how to set up the LangSmith environment, allow traceability for the LLM calls, and evaluate and compare the LLM outputs easily under one dashboard. This concludes our current scope in the blog to explore LnagSmith for LLM production. Next, we will explore two realistic case studies that combine these elements under one roof.

Use Cases of LangSmith

In this section, we will combine all the scattered knowledge we have learned about LangSmith and examine it from the perspective of two realistic use cases. We will fine-tune an LLaMA model and evaluate and visualize the results using LangSmith. In the second case, we will develop an automated feedback mechanism for language models using LangSmith. While both the use cases would require the readers to have other technical knowledge, the focus in the below subsection would solely be on the LangSmith perspective.

Fine-tuning LLaMA2-7b-chat Model

This use case demonstrates the process of fine-tuning the LLaMA2-7b-chat model for a knowledge graph triple extraction task using a single GPU. LangSmith sources the training data, managing and evaluating datasets on its platform. The notebook leverages HuggingFace for the fine-tuning process and utilizes LangSmith to manage and export training data, as well as to evaluate the fine-tuned model’s performance. This showcases a practical application of integrating LangSmith with HuggingFace for efficient LLM fine-tuning and evaluation. The entire notebook can be found here.

Major Steps Involved

Below, we will highlight the major steps, with a focus on the code snippets related to LangSmith.

Environment Setup:
- First, set your LANGCHAIN_API_KEY to access LangSmith datasets and install the necessary libraries.

env LANGCHAIN_API_KEY= <api-key>

pip install --quiet -U langchain langsmith pandas openai xformers 
transformers huggingface accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 
transformers==4.31.0 trl==0.4.7

Data Preparation:
- Load and preprocess your dataset from LangSmith.
- Save the DataFrame to a JSONL file suitable for HuggingFace. The corresponding codes are available in the notebook and out of the current scope.
Create Instructions for Fine-Tuning:
- Since we are fine-tuning an LLaMA model, we need to structure the data according to the LLaMA chat prompt format.
Benchmark the Base Model:
- Load the base LLaMA2-7b-chat model with 4-bit quantization for benchmarking.
Fine-Tuning and Save the Model:
- Set hyperparameters and prepare for fine-tuning.
- Save the fine-tuned model to Google Drive. Again, the codes for all these sections re available in the linked notebook and not being mentioned here to maintain the scope of this blog. User knowledge of LLM fine-tuning is advised.
Load and Test the Fine-Tuned Model:
- Load the fine-tuned model and test inference.

model_loaded = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

pipe_llama7b_chat_ft = pipeline(task="text-generation", model=model_loaded, 
tokenizer=tokenizer, max_length=300, device=1)
result = pipe_llama7b_chat_ft(test_prompt)
print(result)

Output:

Running the above code should produce an output using the fine-tuned model. The data used might vary the output. Above, we show a sample of the expected output.

Evaluation:
- In this evaluation step, we used LangSmith and GPT-4 to meticulously evaluate the fine-tuned LLaMA2-7b-chat model’s performance on extracting knowledge graph triples. The evaluator identified factual discrepancies, ensuring high precision in the model’s predictions. This rigorous approach allows us to refine the model further and achieve better performance in real-world applications.

An overview of the evaluation workflow with a focus on LangSmith output is provided below.

Evaluation

Setup Evaluation Prompt: Define a prompt for the LLM evaluator to measure accuracy and identify factual discrepancies.
Create Evaluation Function: Implement a function that uses the defined prompt to evaluate the predicted triplets against the labeled data.
Evaluate Model Predictions: Use the function to evaluate a subset of the dataset and analyze the results. Below is the code snippet that evaluates the LLM runs and logs and visualizes it in the LamgSmith UI.

from langsmith import Client
from langchain.smith import RunEvalConfig
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

client = Client()
# Note that "sentence" is the key in the test dataset
prompt = PromptTemplate.from_template(
    "[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n### 
    
    Input:{sentence}\n\n[/INST]\n"
).partial(system_message=system_prompt)

from langchain.smith import RunEvalConfig
config = RunEvalConfig(
    custom_evaluators=[EvaluateTriplets()],
)
# Chat LLM w/ FT
llama_llm_chat_ft = HuggingFacePipeline(pipeline=pipe_llama7b_chat_ft)
llama_chain_chat_ft = prompt | llama_llm_chat_ft
results = await client.arun_on_dataset(validation_dataset_name, 
llama_chain_chat_ft, evaluation=config)

Output:

Running the above code in Google Colab will provide a link, as shown above, which will open the lnagSmith UI, which shows the model’s performance across the evaluation strategy.

Automated Feedback Pipeline for Language Models

In this use case, we set up an automated feedback pipeline for language models using LangSmith. It enables the tracking and evaluation of model performance through automated metrics integrated with LangSmith’s dataset management and evaluation capabilities. The blog doesn’t cover a detailed walkthrough of the code, so users should have knowledge of the associated topics. The entire code is available here. We will look into the LangSmith aspects.

Here are the main steps outlined in the code:

Setup Environment and Initialize Client:
- Set environment variables for the LangSmith and LangChain Hub API endpoints and keys.
- A Client instance is created to interact with LangSmith.

import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = "YOUR_API_KEY"
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_HUB_API_URL"] = "https://api.hub.langchain.com"
# Update with your Hub API key
os.environ["LANGCHAIN_HUB_API_KEY"] = "YOUR_HUB_API_KEY"
project_name = "YOUR_PROJECT_NAME"  # Change to your project name
from langsmith import Client
from datetime import datetime

client = Client()

Create Example Runs:
- A series of example runs are created with predefined input-output pairs using the client.create_run method.

Implementation with python code

example_data = [
    ("Who trained Llama-v2?", "I'm sorry, but I don't have that information."),
    (
        "When did langchain first announce the hub?",
        "LangChain first announced the LangChain Hub on September 5, 2023.",
    ),
    (
        "What's LangSmith?",
        "LangSmith is a platform developed by LangChain for building 
        production-grade LLM (Language Model) applications. It allows you to 
        debug, test, evaluate, and monitor chains and intelligent agents built
         on any LLM framework. LangSmith seamlessly integrates with LangChain's
          open-source framework called LangChain, which is widely used for 
          building applications with LLMs.\n\nLangSmith provides full visibility 
          into model inputs and outputs at every step in the chain of events, 
          making it easier to debug and analyze the behavior of LLM applications. 
          It has been tested with early design partners and on internal workflows, 
          and it has been found to help teams in various ways.\n\nYou can find more
           information about LangSmith on the official LangSmith documentation 
           [here](https://docs.smith.langchain.com/). Additionally, 
           you can read about the announcement of LangSmith as a unified platform 
           for debugging and testing LLM applications [here]
           (https://blog.langchain.dev/announcing-langsmith/).",
    ),
    (
        "What is the langsmith cookbook?",
        "I'm sorry, but I couldn't find any information about the 
        \"Langsmith Cookbook\". It's possible that it may not be a well-known 
        cookbook or it may not exist. Could you provide more context or clarify
         the name?",
    ),
    (
        "What is LangChain?",
        "I'm sorry, but I couldn't find any information about 
        "LangChain\". Could you please provide more context or clarify your 
        question?",
    ),
    ("When was Llama-v2 released?", "Llama-v2 was released on July 18, 2023."),
]

for input_, output_ in example_data:
    client.create_run(
        name="ExampleRun",
        run_type="chain",
        inputs={"input": input_},
        outputs={"output": output_},
        project_name=project_name,
        end_time=datetime.utcnow(),
    )

This code creates a series of example runs with predefined input-output pairs. Each run is logged using the client.create_run method, associating it with a project for easy management and retrieval.

Define Feedback Algorithm:
- Example A: Simple text statistics are computed on the input text using the textstat library, illustrating a basic feedback mechanism.
- Example B: AI-assisted feedback is applied. Scoring runs based on relevance, difficulty, verbosity, and specificity using an LLM (GPT-3.5-turbo). The client logs the scores as feedback using the create_feedback method.
- Example2: LangChain’s built-in evaluators check the completeness of the model’s output against the input query. We define and apply the CompletenessEvaluator class to the runs for this purpose.

from langchain import hub

prompt = hub.pull(
    "wfh/automated-feedback-example", api_url="https://api.hub.langchain.com"
)
from langchain_core.output_parsers.openai_functions import 
JsonOutputFunctionsParser
from langchain_core.tracers.context import collect_runs
from langchain_openai import ChatOpenAI

chain = (
    prompt
    | ChatOpenAI(model="gpt-3.5-turbo", temperature=1).bind(
        functions=[
            {
                "name": "submit_scores",
                "description": "Submit the graded scores for a user question 
                and bot response.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "relevance": {
                            "type": "integer",
                            "minimum": 0,
                            "maximum": 5,
                            "description": "Score indicating the relevance of
                             the question to LangChain/LangSmith.",
                        },
                        "difficulty": {
                            "type": "integer",
                            "minimum": 0,
                            "maximum": 5,
                            "description": "Score indicating the complexity or 
                            difficulty of the question.",
                        },
                        "verbosity": {
                            "type": "integer",
                            "minimum": 0,
                            "maximum": 5,
                            "description": "Score indicating how verbose the 
                            question is.",
                        },
                        "specificity": {
                            "type": "integer",
                            "minimum": 0,
                            "maximum": 5,
                            "description": "Score indicating how specific the 
                            question is.",
                        },
                    },
                    "required": ["relevance", "difficulty", "verbosity", 
                    "specificity"],
                },
            }
        ]
    )
    | JsonOutputFunctionsParser()
)

def evaluate_run(run):
    try:
        if "input" not in run.inputs or not run.outputs or "output" not in 
        run.outputs:
            return
        if run.feedback_stats and "specificity" in run.feedback_stats:
            return
        with collect_runs() as cb:
            result = chain.invoke(
                {
                    "question": run.inputs["input"][:3000],
                    "prediction": run.outputs["output"][:3000],
                },
            )
            for feedback_key, value in result.items():
                score = int(value) / 5
                client.create_feedback(
                    run.id,
                    key=feedback_key,
                    score=score,
                    source_run_id=cb.traced_runs[0].id,
                    feedback_source_type="model",
                )
    except Exception as e:
        pass

wrapped_function = RunnableLambda(evaluate_run)

_ = wrapped_function.batch(runs, {"max_concurrency": 10}, return_exceptions=True)

This code snippet demonstrates AI-assisted feedback, where an LLM (GPT-3.5-turbo) scores each run’s input based on several metrics (relevance, difficulty, verbosity, and specificity). The scores are logged as feedback using client.create_feedback. The evaluate_run function handles the evaluation logic, and RunnableLambda is used for concurrent processing.

Logging and Viewing Feedback Results:
- Feedback is logged concurrently using the RunnableLambda class to batch-process the runs

from langchain_core.runnables import RunnableLambda

wrapped_function = RunnableLambda(evaluate_run)
_ = wrapped_function.batch(runs, {"max_concurrency": 10}, return_exceptions=True)

Feedback is logged concurrently using the RunnableLambda class to batch-process the runs. This ensures efficient handling of multiple evaluations simultaneously.

Aggregate feedback statistics are read from the project, showcasing metrics such as readability indices.

feedback_stats = client.read_project(project_name=project_name).feedback_stats
print(feedback_stats)

Aggregate feedback statistics are read from the project, showcasing metrics such as readability indices and other evaluation scores. This provides a comprehensive view of the model’s performance based on the feedback received. Below is an image that might be expected as the final output in the LangSmith UI.

LangSmith UI Output:

Conclusion

LangSmith helps take language models from prototype to production by offering a comprehensive suite of tools and features designed to enhance their capabilities. By utilizing LangSmith’s monitoring, evaluation, debugging, testing, tracing, and observability functions, developers and businesses can significantly improve their model’s performance and reliability. LangSmith’s user-friendly interface and robust API integrations streamline the development process, making it easier to achieve high-quality results. Adopting LangSmith can lead to more efficient model iterations and, ultimately, better user experiences. This articles focused on complete LangSmith guide in detail.

You can access code links here:

Link for case 1: https://colab.research.google.com/drive/17_921Y4nYxEzwfehv-ErUCMKBVissIwk?usp=sharing
Link for case 2: https://colab.research.google.com/drive/1WGeQTs7dyBLJURlnh_RvBts_73EXpgJn?usp=sharing

Key Takeaways

Use LangSmith to continuously monitor and evaluate your language model’s performance in real-time.
Explored complete LangSmith guide in detail.
Identify and resolve issues quickly with LangSmith’s advanced debugging capabilities.
Ensure your model is robust and reliable through extensive testing and validation features.
Gain deep insights into your model’s operations with LangSmith’s tracing and observability tools.
Leverage LangSmith’s API and integrations for a smooth and efficient development workflow.

Frequently Asked Questions

Q1. What exactly does LangSmith offer for language model development?

A. LangSmith provides a comprehensive suite of tools including monitoring, evaluation, debugging, testing, tracing, and observability features. These tools help developers enhance the performance and reliability of their language models throughout the development lifecycle.

Q2. How does LangSmith facilitate the transition from prototype to production?

A. LangSmith streamlines the development process by offering a user-friendly interface and robust API integrations. It ensures efficient model iterations and faster deployment, crucial for moving from prototype stages to full-scale production.

Q3. Can LangSmith help identify and fix issues in language models?

A. Yes, LangSmith’s advanced debugging tools allow developers to identify and resolve issues quickly. They also provide detailed insights into model performance, enabling precise debugging and optimization.

Q4. What role does monitoring and evaluation play in LangSmith?

A. Monitoring and evaluation in LangSmith are essential for continuously assessing model performance in real time. These features help developers track model behavior, detect anomalies, and make data-driven improvements.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Neil D

Advancing language model research by day and writing about my work online by night. I explore AI breakthroughs and transform complex studies into clear, engaging insights that empower professionals and enthusiasts alike.

Thanks for stopping by my profile!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Ultimate LangSmith Guide

Introduction

Learning Outcomes

Table of contents

What is LangSmith?

Use of LangSmith

LangSmith Platform Overview

Navigating LangSmith with PythonSDK

Adding Observability to Your LLM Application

Setting Up Observability

Output

Beta Testing and Feedback Collection

Logging Metadata

Production Monitoring

Evaluating a LLM application

Step1: Create a Golden Dataset

Step2: Define Matrices

Step3: Run Evaluations

Step4: Compare Results

Use Cases of LangSmith

Fine-tuning LLaMA2-7b-chat Model

Major Steps Involved

Evaluation

Automated Feedback Pipeline for Language Models

Implementation with python code

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or