Recent advancements in reasoning models, such as OpenAI’s o1 and DeepSeek R1, have propelled LLMs to achieve impressive performance through techniques like Chain of Thought (CoT). However, the verbose nature of CoT leads to increased computational costs and latency. A novel paper published by Zoom Communications presents a new prompting technique called Chain of Draft (CoD). CoD focuses on concise, dense reasoning steps, reducing verbosity while maintaining accuracy. This approach mirrors human reasoning by prioritizing minimal, informative outputs, optimizing efficiency for real-world
applications.
In this guide article we will explore this new prompting technique thoroughly and implement it using Gemini, Groq and Cohere API. And understand the differences between other prompting techniques and Chain of Draft prompting technique.
This article was published as a part of the Data Science Blogathon.
Chain of Draft (CoD) prompting is a novel approach to reasoning in large language models (LLMs), inspired by how humans tackle complex tasks. Rather than generating verbose, step-by-step explanations like the Chain of Thought (CoT) method, CoD focuses on producing concise, critical insights at each step. This minimalist approach allows LLMs to advance toward solutions more efficiently, using fewer tokens and reducing latency, all while maintaining or even improving accuracy.
Introduced by researchers at Zoom Communications, CoD has shown significant improvements in cost-effectiveness and speed across tasks like arithmetic, common-sense reasoning, and symbolic problem-solving, making it a practical technique for real-world applications. One can read the published paper in detail here.
Large Language Models (LLMs) have significantly advanced in their ability to perform complex reasoning tasks, owing much of their progress to various structured reasoning frameworks. One foundational method, Chain-of-Thought (CoT) reasoning, encourages models to articulate intermediate steps, thereby enhancing problem-solving capabilities. Building upon this, more sophisticated structures like tree and graph-based reasoning have been developed, allowing LLMs to tackle increasingly intricate problems by representing hierarchical and relational data more effectively.
Additionally, approaches such as self-consistency CoT incorporate verification and reflection mechanisms to bolster reasoning reliability, while ReAct integrates tool usage into the reasoning process, enabling LLMs to access external resources and knowledge. These innovations collectively expand the reasoning capabilities of LLMs across a diverse range of applications.
Chain of Draft (CoD) Prompting is a minimalist reasoning technique designed to optimize the performance of large language models (LLMs) by reducing verbosity during the reasoning process while maintaining accuracy. The core idea behind CoD is inspired by how humans approach problem-solving: instead of articulating every detail in a step-by-step manner, we tend to use concise, shorthand notes or drafts that capture only the most crucial pieces of information. This approach helps to reduce cognitive load and enables faster progress toward a solution.
Concise Intermediate Steps: CoD focuses on generating compact, dense outputs for each reasoning step, which capture only the essential information needed to move forward. This results in minimalistic drafts that help guide the model through problem-solving without unnecessary detail.
Cognitive Scaffolding: Just as humans use shorthand to track their ideas, CoD externalizes critical
thoughts while avoiding the verbosity that typically burdens traditional reasoning models. The goal is to maintain the integrity of the reasoning pathway without overloading the model with excessive tokens.
Problem: Jason had 20 lollipops. He gave Denny some. Now he has 12 left. How many did Jason give to Denny?
Response [CoD] : 20–12 = 8 → Final Answer: 8.
As we can see above the response for the problem had very concise symbolic reasoning steps similar to what we do when we are doing problem solving.
Different prompting techniques enhance LLM reasoning in unique ways, from step-by-step logic to external knowledge integration and structured thought processes.
In standard prompting, the LLM generates a direct answer to a query without showing the intermediate reasoning steps. It provides the final output without revealing the thought process behind it.
Although this approach is efficient in terms of token usage, it lacks transparency. Without insight into how the model reached its conclusion, verifying correctness or identifying reasoning errors becomes
challenging, particularly for complex problems that require step-by-step reasoning.
With CoT prompting, the model offers an in-depth explanation of its reasoning process.
This response is thorough and transparent, outlining every step of the reasoning process. However, it is overly detailed, including redundant information that doesn’t contribute computationally. This excess verbosity greatly increases token usage, resulting in higher latency and cost.
With CoD prompting, the model focuses exclusively on the essential reasoning steps, providing only the most critical information. This approach eliminates unnecessary details, ensuring efficiency while maintaining accuracy.
Below we will look into the advantages of chain of draft prompting:
Now we will see how we can implement the Chain of Draft prompting using different LLMs and methods.
We can implement Chain of Draft in different ways let us g through them:
We will now implement this in code using two different LLM Gemini and Groq API. Gr
Let us now implement these prompting techniques using Gemini to enhance reasoning, decision-making, and problem-solving capabilities.
For Gemini API Key visit Gemini Site Click on get an API Key button as shown below in pic. You will be
redirected Google AI Studio where you will need to use your google account login and then find your API Key generated.
We basically need to install google genai library.
pip install google-genai
We import relevant packages and add API key as a environment variable.
import base64
import os
from google import genai
from google.genai import types
os.environ["GEMINI_API_KEY"] = "Your Gemini API Key"
Now we define the generate function and configure model, contents and generate_content_config .
Note in generate_content_config we pass system instruction as ” Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####.”
def generate_gemini(example,question):
client = genai.Client(
api_key=os.environ.get("GEMINI_API_KEY"),
)
model = "gemini-2.0-flash"
contents = [
types.Content(
role="user",
parts=[
types.Part.from_text(text=example),
types.Part.from_text(text=question),
],
),
]
generate_content_config = types.GenerateContentConfig(
temperature=1,
top_p=0.95,
top_k=40,
max_output_tokens=8192,
response_mime_type="text/plain",
system_instruction=[
types.Part.from_text(text="""Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####."""),
],
)
# Now pass the parameters to generate_content_stream function
for chunk in client.models.generate_content_stream(
model=model,
contents=contents,
config=generate_content_config,
):
print(chunk.text, end="")
Now we can execute the code using two methods one passing only system instruction prompt and question directly. Another is by passing one-shot example in prompt along with question and system instruction.
if __name__ == "__main__":
example = """"""
question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
generate_gemini(example,question)
Response for Zero-shot CoD prompt from Gemini:
Apples cost: 3 * $1.20
Oranges cost: 4 * $0.80
Total: sum of both
#### $6.80
if __name__ == "__main__":
example = """Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: 20 - x = 12; x = 8. #### 8"""
question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
generate_gemini(example,question)
Output
Apple cost: 3 * 1.20
Orange cost: 4 * 0.80
Total: apple + orange
Total cost: 3.60 +3.20
Total: 6.80
#### 6.80
Now we will use Groq API which uses Llamaa model within it to demonstrate CoD prompting technique.
Similar to Gemini we need to first create an account in groq wwe can do it by logging in through one of google account (gmail) on this site. Once logged in click on “Create an API Key” button and give a name for our api key and copy the generated api key as it will not be displayed again.
We basically need to install groq library.
!pip install groq --quiet
We import relevant packages and add API key as a environment variable.
from groq import Groq
# configure the LM, and remember to export your API key, please set any one of the key
os.environ['GROQ_API_KEY'] = "Your Groq API Key"
Now we create generate_groq function by passing example and question. We also add system prompt “Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####.””
def generate_groq(example,question):
client = Groq()
completion = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{
"role": "system",
"content": "Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####."
},
{
"role": "user",
"content": example+"\n"+question
},
],
temperature=1,
max_completion_tokens=1024,
top_p=1,
stream=True,
stop=None,
)
for chunk in completion:
print(chunk.choices[0].delta.content or "", end="")
Now we can execute the code using two methods one passing only system instruction prompt and question directly. Another is by passing one-shot example in prompt along with question and system instruction. Let’s see the output for Groq Llama models
#One shot
if __name__ == "__main__":
example = """Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: 20 - x = 12; x = 8. #### 8"""
question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
generate_groq(example,question)
Output
Apples cost $1.20 * 3
Oranges cost $0.80 * 4
Add both costs together
Total cost is $3.60 + $3.20
Equals $6.80
#### $6.8
#zero shot
if __name__ == "__main__":
example = """"""
question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
generate_groq(example,question)
Output
Calculate apple cost.
Calculate orange cost.
Add both costs.
#### $7.20
As we can see for zero shot the answer is not coming correct for llama model unlike gemini model we will try to tweak and add more words in our question prompt to arrive at correct answer.
We add this line further to our Question at end “Verify the answer is correct with steps”
#tweaked Zero shot
if __name__ == "__main__":
example = """"""
question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?Verify the answer is correct with steps
A:"""
generate_groq(example,question)
Output
Calculate apple cost 3*1.20
Equal 3.60
Calculate orange cost 4 * 0.80
Equal 3.20
Add costs together 3.603.20
Equal 6.80
#### 6.80
Let us now look into the limitation of CoD below:
Chain of Draft (CoD) prompting presents a compelling alternative to traditional reasoning techniques by prioritizing efficiency and conciseness. Its ability to reduce latency and cost while maintaining accuracy makes it a valuable approach for real-world AI applications. However, CoD’s reliance on minimalistic reasoning steps can reduce transparency, making debugging and validation more challenging. Additionally, it struggles in zero-shot scenarios, particularly with smaller models, due to the lack of CoD-style reasoning in training data. Despite these limitations, CoD remains a powerful tool for optimizing LLM performance in constrained environments. Future research and fine-tuning may help address its weaknesses and broaden its applicability.
A. CoD generates significantly more concise reasoning compared to CoT while preserving accuracy. By eliminating non-essential details and utilizing equations or shorthand notation, it achieves a 68-92% reduction in token usage with minimal impact on accuracy.
A. Runnable interfaces allow developers to chain functions easily,
improving code readability and maintainability. To implement CoD in your prompts, you can provide a system directive such as:
“Think step by step, but limit each thinking step to a minimal draft of no more than five words. Return the final answer after a separator (####).” Additionally, using one-shot or few-shot examples can improve consistency, especially for models that struggle in zero-shot scenarios.
A. CoD is most effective for structured reasoning tasks, including mathematical problem-solving, symbolic reasoning, and logic-based challenges. It excels in benchmarks like GSM8k and tasks that require step-by-step logical thinking.
A. In paper it was mentioned that CoD can reduce token usage by 68-92%, significantly lowering LLM API costs for high-volume applications while maintaining accuracy.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.