Output parsers are essential for converting raw, unstructured text from language models (LLMs) into structured formats, such as JSON or Pydantic models, making it easier for downstream tasks. While function or tool calling can automate this transformation in many LLMs, output parsers are still valuable for generating structured data or normalizing model outputs.
LLMs often produce unstructured text, but output parsers can transform this into structured data. Some models natively support structured output, but when they don’t, parsers step in. Output parsers implement two key methods:
Some parsers also include an optional method:
The PydanticOutputParser is useful for defining and validating structured outputs. Below is a step-by-step example of how to use it.
import os
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI # Use ChatOpenAI for GPT-4
from pydantic import BaseModel, Field, model_validator
# Ensure OpenAI API key is set correctly
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY') # Set key from userdata
openai_key = os.getenv('OPENAI_API_KEY')
# Define the model using ChatOpenAI
model = ChatOpenAI(api_key=openai_key, model_name="gpt-4o", temperature=0.0)
# Define a Pydantic model for the output
class Joke(BaseModel):
setup: str = Field(description="The question in the joke")
punchline: str = Field(description="The answer in the joke")
@model_validator(mode="before")
@classmethod
def validate_setup(cls, values: dict) -> dict:
setup = values.get("setup")
if setup and setup[-1] != "?":
raise ValueError("Setup must end with a question mark!"
return values
# Create the parser
parser = PydanticOutputParser(pydantic_object=Joke)
# Define the prompt template
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Combine the model and parser
prompt_and_model = prompt | model
# Generate output
output = prompt_and_model.invoke({"query": "Tell me a joke."})
# Parse the output
parsed_output = parser.invoke(output)
# Print the parsed output
print(parsed_output)
Output:
Output parsers integrate smoothly with LangChain Expression Language (LCEL), enabling complex chaining and streaming of data. For example:
chain = prompt | model | parser
parsed_response = chain.invoke({"query": "Tell me a joke."})
print(parsed_response)
Output:
LangChain’s output parsers also support streaming, allowing partial outputs to be generated dynamically.
from langchain.output_parsers.json import SimpleJsonOutputParser
json_prompt = PromptTemplate.from_template(
"Return a JSON object with an `answer` key that answers the following question: {question}"
)
json_parser = SimpleJsonOutputParser()
json_chain = json_prompt | model | json_parser
for chunk in json_chain.stream({"question": "Who invented the bulb?"}):
print(chunk)
Output:
for chunk in chain.stream({"query": "Tell me a joke."}):
print(chunk)
Output:
Output parsers like PydanticOutputParser enable the efficient extraction, validation, and streaming of structured responses from LLMs, improving flexibility in handling model outputs.
While some language models natively support structured outputs, others rely on output parsers for defining and processing structured data. The JsonOutputParser is a powerful tool for parsing JSON schemas, enabling structured information extraction from model responses.
Here’s an example of how to combine JsonOutputParser with Pydantic to parse a language model’s output into a structured format.
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# Initialize the model
model = ChatOpenAI(temperature=0)
# Define the desired data structure
class MovieQuote(BaseModel):
character: str = Field(description="The character who said the quote")
quote: str = Field(description="The quote itself")
# Create the parser with the schema
parser = JsonOutputParser(pydantic_object=MovieQuote)
# Define a query
quote_query = "Give me a famous movie quote with the character name."
# Set up the prompt with formatting instructions
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Combine the prompt, model, and parser
chain = prompt | model | parser
# Invoke the chai
response = chain.invoke({"query": quote_query})
print(response)
Output:
A major advantage of the JsonOutputParser is its ability to stream partial JSON objects, which is useful for real-time applications.
for chunk in chain.stream({"query": quote_query}):
print(chunk)
Output:
For scenarios where strict schema validation is not necessary, you can use JsonOutputParser without defining a Pydantic schema. This approach is simpler but offers less control over the output structure.
# Define a simple query
quote_query = "Tell me a fun fact about movies."
# Initialize the parser without a Pydantic object
parser = JsonOutputParser()
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser
response = chain.invoke({"query": quote_query})
print(response)
Output:
{'response': "Sure! Here is a fun fact about movies: The first movie
ever made is considered to be 'Roundhay Garden Scene,' which was filmed
in 1888 and is only 2.11 seconds long.", 'fun_fact': "The first movie ever
made is considered to be 'Roundhay Garden Scene,' which was filmed in
1888 and is only 2.11 seconds long."}
By integrating JsonOutputParser into your workflow, you can efficiently manage structured data outputs from language models, ensuring clarity and consistency in responses.
LLMs vary in their ability to produce structured data, and XML is a commonly used format for organizing hierarchical data. The XMLOutputParser is a powerful tool for requesting, structuring, and parsing XML outputs from models into a usable format.
Here’s how to prompt a model to generate XML content and parse it into a structured format.
from langchain_core.output_parsers import XMLOutputParser
from langchain_core.prompts import PromptTemplate
# Initialize the model
model = ChatOpenAI(api_key=openai_key, model_name="gpt-4o", temperature=0.0)
# Define a simple query
query = "List popular books written by J.K. Rowling, enclosed in <book></book> tags."
# Model generates XML output
response = model.invoke(query)
print(response.content)
Output:
To convert the generated XML into a structured format like a dictionary, use the XMLOutputParser:
parser = XMLOutputParser()
# Add formatting instructions to the prompt
prompt = PromptTemplate(
template="{query}\n{format_instructions}",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Create a chain of prompt, model, and parser
chain = prompt | model | parser
# Execute the query
parsed_output = chain.invoke({"query": query})
print(parsed_output)
{'books': [{'book': [{'title': "Harry Potter and the
Philosopher's Stone"}]}, {'book': [{'title': 'Harry
Potter and the Chamber of Secrets'}]}, {'book':
[{'title': 'Harry Potter and the Prisoner of Azkaban'}
]}, {'book': [{'title': 'Harry Potter and the Goblet
of Fire'}]}, {'book': [{'title': 'Harry Potter and
the Order of the Phoenix'}]}, {'book': [{'title':
'Harry Potter and the Half-Blood Prince'}]},
{'book': [{'title': 'Harry Potter and the Deathly
Hallows'}]}, {'book': [{'title': 'The Casual Vacancy'}]},
{'book': [{'title': "The Cuckoo's Calling"}]},
{'book': [{'title': 'The Silkworm'}]},
{'book': [{'title': 'Career of Evil'}]},
{'book': [{'title': 'Lethal White'}]},
{'book': [{'title': 'Troubled Blood'}]}]}
You can specify custom tags for the output, allowing greater control over the structure of the data.
parser = XMLOutputParser(tags=["author", "book", "genre", "year"])
prompt = PromptTemplate(
template="{query}\n{format_instructions}",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser
query = "Provide a detailed list of books by J.K. Rowling, including genre and publication year."
custom_output = chain.invoke({"query": query})
print(custom_output)
{'author': [{'name': 'J.K. Rowling'},
{'books': [{'book': [{'title': "Harry Potter and the Philosopher's Stone"},
{'genre': 'Fantasy'}, {'year': '1997'}]},
{'book': [{'title': 'Harry Potter and the Chamber of Secrets'},
{'genre': 'Fantasy'}, {'year': '1998'}]},
{'book': [{'title': 'Harry Potter and the Prisoner of Azkaban'},
{'genre': 'Fantasy'}, {'year': '1999'}]},
{'book': [{'title': 'Harry Potter and the Goblet of Fire'},
{'genre': 'Fantasy'}, {'year': '2000'}]},
{'book': [{'title': 'Harry Potter and the Order of the Phoenix'},
{'genre': 'Fantasy'}, {'year': '2003'}]},
{'book': [{'title': 'Harry Potter and the Half-Blood Prince'},
{'genre': 'Fantasy'}, {'year': '2005'}]},
{'book': [{'title': 'Harry Potter and the Deathly Hallows'},
{'genre': 'Fantasy'}, {'year': '2007'}]},
{'book': [{'title': 'The Casual Vacancy'},
{'genre': 'Fiction'}, {'year': '2012'}]},
{'book': [{'title': "The Cuckoo's Calling"},
{'genre': 'Crime Fiction'}, {'year': '2013'}]},
{'book': [{'title': 'The Silkworm'},
{'genre': 'Crime Fiction'}, {'year': '2014'}]},
{'book': [{'title': 'Career of Evil'},
{'genre': 'Crime Fiction'}, {'year': '2015'}]},
{'book': [{'title': 'Lethal White'},
{'genre': 'Crime Fiction'}, {'year': '2018'}]},
{'book': [{'title': 'Troubled Blood'},
{'genre': 'Crime Fiction'}, {'year': '2020'}]},
{'book': [{'title': 'The Ink Black Heart'},
{'genre': 'Crime Fiction'}, {'year': '2022'}]}]}]}
The XMLOutputParser also supports streaming partial results for real-time processing, making it ideal for scenarios where data must be processed as it is generated.
for chunk in chain.stream({"query": query}):
print(chunk)
By leveraging XMLOutputParser, you can efficiently request, validate, and transform XML-based outputs into structured data formats, allowing you to work with complex hierarchical information effectively.
YAML is a widely-used format for representing structured data clean and human-readable. The YamlOutputParser offers a straightforward method to request and parse YAML-formatted outputs from language models, simplifying structured data extraction while adhering to a defined schema.
To get started, define a data structure and use the YamlOutputParser to guide the model’s output.
from langchain.output_parsers import YamlOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# Define the desired structure
class Recipe(BaseModel):
name: str = Field(description="The name of the dish")
ingredients: list[str] = Field(description="List of ingredients for the recipe")
steps: list[str] = Field(description="Steps to prepare the recipe")
# Initialize the model
model = ChatOpenAI(temperature=0)
# Define a query to generate structured YAML output
query = "Provide a simple recipe for pancakes."
# Set up the parser
parser = YamlOutputParser(pydantic_object=Recipe)
# Inject parser instructions into the prompt
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Chain the components together
chain = prompt | model | parser
# Execute the query
output = chain.invoke({"query": query})
print(output)
name='Pancakes' ingredients=['1 cup all-purpose flour',
'2 tablespoons sugar', '1 tablespoon baking powder',
'1/2 teaspoon salt', '1 cup milk',
'2 tablespoons unsalted butter, melted', '1 large egg']
steps=['In a mixing bowl, whisk together flour, sugar,
baking powder, and salt.', 'In a separate bowl,
mix milk, melted butter, and egg.',
'Combine wet and dry ingredients,
stirring until just combined (batter may be lumpy).',
'Heat a non-stick pan over medium heat and pour
1/4 cup of batter for each pancake.',
'Cook until bubbles form on the surface,
then flip and cook until golden brown.',
'Serve hot with your favorite toppings.']
Once the YAML output is generated, the YamlOutputParser converts the YAML into a Pydantic object, making the data programmatically accessible.
print(output.name)
print(output.ingredients)
print(output.steps[0])
You can customize the schema to meet specific needs by defining additional fields or constraints.
from langchain.output_parsers import YamlOutputParser
from pydantic import BaseModel, Field
# Define a custom schema for HabitChange
class HabitChange(BaseModel):
habit: str = Field(description="A daily habit")
sustainable_alternative: str = Field(description="An environmentally friendly alternative")
# Create the parser using the custom schema
parser = YamlOutputParser(pydantic_object=HabitChange)
# Sample output from your chain
output = """
habit: Using plastic bags for grocery shopping.
sustainable_alternative: Bring reusable bags to the store to reduce plastic waste and protect the environment.
"""
# Parse the output
parsed_output = parser.parse(output)
# Print the parsed output
print(parsed_output)
habit='Using plastic bags for grocery shopping.'
sustainable_alternative='Bring reusable bags to the store
to reduce plastic waste and protect the environment.'
You can augment the model’s default YAML instructions with your own to further control the format of the output.
The output should be formatted as a YAML instance that conforms to the following JSON schema:
You can edit the prompt to provide additional context or enforce stricter formatting requirements.
The YamlOutputParser is a powerful tool for leveraging YAML for structured outputs from language models. By combining it with tools like Pydantic, you can ensure that the generated data is both valid and usable, streamlining workflows that rely on structured responses.
Parsing errors can occur when a model’s output does not adhere to the expected structure or when the response is incomplete. In such cases, the RetryOutputParser offers a mechanism to retry parsing by utilizing the original prompt alongside the invalid output. This approach helps improve the final result by asking the model to refine its response.
Consider a scenario where you’re prompting the model to provide structured action information, but the response is incomplete.
from langchain.output_parsers import RetryOutputParser, PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from pydantic import BaseModel, Field
# Define the expected structure
class Action(BaseModel):
action: str = Field(description="The action to be performed")
action_input: str = Field(description="The input required for the action")
# Initialize the parser
parser = PydanticOutputParser(pydantic_object=Action)
# Define a prompt template
template = """Based on the user's question, provide an Action and its corresponding Action Input.
{format_instructions}
Question: {query}
Response:"""
prompt = PromptTemplate(
template=template,
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Prepare the query and bad response
query = "Who is Leo DiCaprio's girlfriend?"
prompt_value = prompt.format_prompt(query=query)
bad_response = '{"action": "search"}'
# Attempt parsing the incomplete response
try:
parser.parse(bad_response)
except Exception as e:
print(f"Parsing failed: {e}")
The RetryOutputParser retries parsing by passing the original prompt and incomplete output back to the model, prompting it to provide a more complete answer.
# Initialize a RetryOutputParser
retry_parser = RetryOutputParser.from_llm(parser=parser, llm=OpenAI(temperature=0))
# Retry parsing
retried_output = retry_parser.parse_with_prompt(bad_response, prompt_value)
print(retried_output)
You can integrate the RetryOutputParser into a chain to streamline the handling of parsing errors and automate reprocessing.
from langchain_core.runnables import RunnableLambda, RunnableParallel
# Define a chain for generating and retrying responses
completion_chain = prompt | OpenAI(temperature=0)
main_chain = RunnableParallel(
completion=completion_chain,
prompt_value=prompt
) | RunnableLambda(lambda x: retry_parser.parse_with_prompt(**x))
# Invoke the chain
output = main_chain.invoke({"query": query})
print(output)
The RetryOutputParser provides a robust way to handle parsing errors by leveraging the model’s ability to regenerate outputs in light of prior mistakes. This ensures more consistent, complete, and contextually appropriate responses for structured data workflows.
The OutputFixingParser is designed to help when another output parser fails due to formatting errors. Instead of simply raising an error, it can attempt to correct the misformatted output by passing it back to a language model for reprocessing. This helps “fix” issues such as incorrect JSON formatting or missing required fields.
Suppose you’re parsing an actor’s filmography, but the data you receive is misformatted and does not comply with the expected schema.
from typing import List
from langchain_core.exceptions import OutputParserException
from langchain_core.output_parsers import PydanticOutputParser
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# Define the expected structure
class Actor(BaseModel):
name: str = Field(description="name of the actor")
film_names: List[str] = Field(description="list of names of films they starred in")
# Define the query
actor_query = "Generate the filmography for a random actor."
# Initialize the parser
parser = PydanticOutputParser(pydantic_object=Actor)
# Misformatted response
misformatted = "{'name': 'Tom Hanks', 'film_names': ['Forrest Gump']}"
# Try to parse the misformatted output
try:
parser.parse(misformatted)
except OutputParserException as e:
print(f"Parsing failed: {e}")
To fix the error, we pass the problematic output to the OutputFixingParser, which uses the model to correct it:
from langchain.output_parsers import OutputFixingParser
# Initialize the OutputFixingParser
new_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI())
# Use the OutputFixingParser to attempt fixing the response
fixed_output = new_parser.parse(misformatted)
print(fixed_output)
Output:
This technique is particularly useful in data-driven tasks where the model’s output may sometimes be incomplete or poorly formatted but still contains valuable information that can be salvaged with some corrections.
The output parsers—YamlOutputParser, RetryOutputParser, and OutputFixingParser—play crucial roles in handling structured data and addressing parsing errors in language model outputs. The YamlOutputParser is particularly useful for generating and parsing structured YAML outputs, making data human-readable and easily mapped to Python models via Pydantic. It’s ideal when data needs to be readable, validated, and customized.
On the other hand, the RetryOutputParser helps recover from incomplete or invalid responses by utilizing the original prompt and output, allowing the model to refine its response. This ensures more accurate, contextually appropriate outputs when certain fields are missing or the data is incomplete.
Lastly, the OutputFixingParser tackles misformatted outputs, such as incorrect JSON or missing fields, by reprocessing the invalid data. This parser enhances resilience by correcting minor errors without halting the process, ensuring that the system can recover from inconsistencies. Together, these tools enable consistent, structured, and error-free outputs, even when faced with incomplete or misformatted data.
Also, if you are looking for Generative AI course online, then explore: GenAI Pinnacle Program
Ans. The YamlOutputParser is used to parse and generate YAML-formatted outputs from language models. It helps in structuring data in a human-readable way and maps it to Python models using tools like Pydantic. This parser is ideal for tasks where structured and readable outputs are needed.
Ans. The RetryOutputParser is used when a parsing error occurs, such as incomplete or invalid responses. It retries the parsing process by using the original prompt and the initial invalid output, asking the model to refine and correct its response. This helps to generate more accurate, complete outputs, especially when fields are missing or partially completed.
Ans. The OutputFixingParser is useful when the output is misformatted or does not adhere to the expected schema, such as when JSON formatting issues arise or required fields are missing. It helps in fixing these errors by passing the misformatted output back to the model for correction, ensuring the data can be parsed correctly.
Ans. Yes, the YamlOutputParser can be customized to fit specific needs. You can define custom schemas using Pydantic models and specify additional fields, constraints, or formats to ensure the parsed data matches your requirements.
Ans. If parsing fails after retrying with the RetryOutputParser, the system may need further manual intervention. However, this parser significantly reduces the likelihood of failures by leveraging the original prompt and output to refine the response.