Chain of Draft Prompting with Gemini and Groq

Ritika Last Updated : 29 Mar, 2025
12 min read

Recent advancements in reasoning models, such as OpenAI’s o1 and DeepSeek R1, have propelled LLMs to achieve impressive performance through techniques like Chain of Thought (CoT). However, the verbose nature of CoT leads to increased computational costs and latency. A novel paper published by Zoom Communications presents a new prompting technique called Chain of Draft (CoD). CoD focuses on concise, dense reasoning steps, reducing verbosity while maintaining accuracy. This approach mirrors human reasoning by prioritizing minimal, informative outputs, optimizing efficiency for real-world
applications.

In this guide article we will explore this new prompting technique thoroughly and implement it  using Gemini, Groq and Cohere API. And understand the differences between other prompting techniques and Chain of Draft prompting technique.

Learning Objectives

  • Gain a comprehensive understanding of the Chain of Draft (CoD) prompting technique.
  • Learn how to implement the CoD technique using APIs from Gemini, Groq, and Cohere.
  • Understand about the comparison between CoD and other prompting techniques.
  • Analyze the advantages and limitations of the CoD prompting technique.

This article was published as a part of the Data Science Blogathon.

Introducing Chain of Draft Prompting 

Chain of Draft (CoD) prompting is a novel approach to reasoning in large language models (LLMs), inspired by how humans tackle complex tasks. Rather than generating verbose, step-by-step explanations like the Chain of Thought (CoT) method, CoD focuses on producing concise, critical insights at each step. This minimalist approach allows LLMs to advance toward solutions more efficiently, using fewer tokens and reducing latency, all while maintaining or even improving accuracy.

Introduced by researchers at Zoom Communications, CoD has shown significant improvements in cost-effectiveness and speed across tasks like arithmetic, common-sense reasoning, and symbolic problem-solving, making it a practical technique for real-world applications. One can read the published paper in detail here.

Background on Other Prompting Techniques

Large Language Models (LLMs) have significantly advanced in their ability to perform complex reasoning tasks, owing much of their progress to various structured reasoning frameworks. One foundational method, Chain-of-Thought (CoT) reasoning, encourages models to articulate intermediate steps, thereby enhancing problem-solving capabilities. Building upon this, more sophisticated structures like tree and graph-based reasoning have been developed, allowing LLMs to tackle increasingly intricate problems by representing hierarchical and relational data more effectively.

Additionally, approaches such as self-consistency CoT incorporate verification and reflection mechanisms to bolster reasoning reliability, while ReAct integrates tool usage into the reasoning process, enabling LLMs to access external resources and knowledge. These innovations collectively expand the reasoning capabilities of LLMs across a diverse range of applications. 

Different Prompting Techniques

  •  Chain-of-Thought (CoT) Prompting: Encourages models to generate intermediate reasoning steps, breaking down complex problems into simpler tasks. This approach improves performance on arithmetic, commonsense, and symbolic reasoning tasks.
  • Self-Consistency CoT: Integrates verification and reflection mechanisms into the reasoning process, allowing models to assess the consistency of their intermediate steps and refine their conclusions, thereby increasing reasoning reliability.
  • ReAct (Reasoning and Acting): Combines reasoning with tool usage, enabling models to access external resources and knowledge bases during the reasoning process. This integration enhances the model’s ability to perform tasks that require external information retrieval.
  • Tree-of-Thought Prompting: An advanced technique that explores multiple reasoning paths simultaneously by generating various approaches at each decision point and evaluating them to find the most promising solutions.
  • Graph of Thought (GoT): This prompting is an advanced technique designed to enhance the reasoning capabilities of Large Language Models (LLMs) by structuring their thought processes as interconnected graphs.This method addresses the limitations of linear reasoning approaches, such as Chain-of-Thought (CoT) and Tree of Thoughts (ToT), by capturing the non-linear and dynamic nature of human cognition.
  • Skeleton-of-Thought (SoT): Guides models to first generate a skeletal outline of the answer, followed by parallel decoding. This method aims to reduce latency in generating responses while maintaining reasoning quality.

Explaining Chain of Draft Prompting

Chain of Draft (CoD) Prompting is a minimalist reasoning technique designed to optimize the performance of large language models (LLMs) by reducing verbosity during the reasoning process while maintaining accuracy. The core idea behind CoD is inspired by how humans approach problem-solving: instead of articulating every detail in a step-by-step manner, we tend to use concise, shorthand notes or drafts that capture only the most crucial pieces of information. This approach helps to reduce cognitive load and enables faster progress toward a solution.

Human-Centric Inspiration

  • In human problem-solving, whether solving equations, drafting essays, or coding, we rarely articulate every step in great detail. Instead, we often jot down only the most important pieces of information that are essential to advancing the solution. This minimalistic method reduces cognitive load, keeping focus on the core concepts.
  • For example, in mathematics, a person might record only key steps or simplified versions of equations, capturing the essence of the reasoning without excessive elaboration.

Mechanism of CoD

Concise Intermediate Steps: CoD focuses on generating compact, dense outputs for each reasoning step, which capture only the essential information needed to move forward. This results in minimalistic drafts that help guide the model through problem-solving without unnecessary detail.

Cognitive Scaffolding: Just as humans use shorthand to track their ideas, CoD externalizes critical
thoughts while avoiding the verbosity that typically burdens traditional reasoning models. The goal is to maintain the integrity of the reasoning pathway without overloading the model with excessive tokens.

Example of CoD

Problem: Jason had 20 lollipops. He gave Denny some. Now he has 12 left. How many did Jason give to Denny?  

Response [CoD] : 20–12 = 8 → Final Answer: 8.

As we can see above the response for the problem had very concise symbolic reasoning steps similar to what we do when we are doing problem solving.

Comparison Between different Prompting Techniques

Different prompting techniques enhance LLM reasoning in unique ways, from step-by-step logic to external knowledge integration and structured thought processes.

Standard Prompting

In standard prompting, the LLM generates a direct answer to a query without showing the intermediate reasoning steps. It provides the final output without revealing the thought process behind it.

standard prompting
Standard Prompting Example

Although this approach is efficient in terms of token usage, it lacks transparency. Without insight into how the model reached its conclusion, verifying correctness or identifying reasoning errors becomes
challenging, particularly for complex problems that require step-by-step reasoning.

Chain of Thought(CoT) Prompting

With CoT prompting, the model offers an in-depth explanation of its reasoning process.

Chain of Thought(CoT) Prompting
Chain of Thought Prompting example

This response is thorough and transparent, outlining every step of the reasoning process. However, it is overly detailed, including redundant information that doesn’t contribute computationally. This excess verbosity greatly increases token usage, resulting in higher latency and cost.

Chain of Draft (CoD) Prompting

With CoD prompting, the model focuses exclusively on the essential reasoning steps, providing only the most critical information. This approach eliminates unnecessary details, ensuring efficiency while maintaining accuracy.

Chain of Draft (CoD) Prompting
Chain of Draft Prompting example

Advantages of Chain of Draft (CoD) Prompting

Below we will look into the advantages of chain of draft prompting:

  • Reduced Latency: CoD enhances response times by 48-76% by reducing the number of tokens generated. This leads to much faster AI-powered applications, particularly in real-time environments like support, education, and conversational AI, where latency can heavily affect user experience.
  • Cost Reduction: By cutting token usage by 70-90% compared to CoT, CoD results in significantly lower inference costs. For an enterprise handling 1 million reasoning queries each month, CoD could reduce costs from $3,800 (CoT) to $760, saving over $3,000 per month—savings that grow even more at scale. With its ability to scale efficiently across large workloads, CoD allows businesses to process millions of AI queries without incurring excessive expenses.
  • Easier to integrate in systems: Less verbose responses allow responses to be more user friendly.
  • Simplicity of Implementation: Unlike AI techniques that require model retraining or infrastructure changes, CoD is a prompting strategy that can be adopted instantly. Organizations already using CoT can switch to CoD with a simple prompt modification, making it highly accessible. Because CoD requires no fine-tuning, enterprises can seamlessly scale AI reasoning across global deployments without model retraining.
  • No model update required: CoD is compatible with pre-existing LLMs, allowing it to take advantage of advancements in model development without the need for retraining or fine-tuning. This ensures that efficiency improvements remain relevant and continue to grow as AI models progress.

Code Implementation of CoD

Now we will see how we can implement the Chain of Draft prompting using different LLMs and methods.

Methods to Implement CoD

We can implement Chain of Draft in different ways let us g through them:

  • Using Prompt Instruction: To implement Chain of Draft (CoD) prompting, instruct the model with the following prompt: “Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most.” This guides the model to generate concise, essential reasoning for each step. Once the reasoning steps are complete, ask the model to return the final answer after a separator (####). This ensures minimal token usage while maintaining clarity and accuracy.
  • Using One shot or Few shot example: We can also make it more robust by adding some zero or few shots examples in our prompt to enable LLM to give a consistent response using those examples and generate intermediate steps in short drafts.

We will now implement this in code using two different LLM Gemini and Groq API. Gr

Implementation using Gemini

Let us now implement these prompting techniques using Gemini to enhance reasoning, decision-making, and problem-solving capabilities.

Step 1: Generate Gemini API Key

For Gemini API Key visit Gemini Site Click on get an API Key  button as shown below in pic. You will be
redirected Google AI Studio where you will need to use your google account login and then find your API Key generated.

Generate Gemini API Key

Step 2: Install Libraries

We basically need to install google genai library.

pip install google-genai

Step 3: Import Packages and Setup API Key

We import relevant packages and add API key as a environment variable.

import base64
import os
from google import genai
from google.genai import types

os.environ["GEMINI_API_KEY"] = "Your Gemini API Key"

Step 4: Create Generate Function

Now we define the generate function and configure model, contents and generate_content_config .

Note in generate_content_config  we pass system instruction as ” Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####.”

def generate_gemini(example,question):
    client = genai.Client(
        api_key=os.environ.get("GEMINI_API_KEY"),
    )

    model = "gemini-2.0-flash"
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text=example),
                types.Part.from_text(text=question),
            ],
        ),
    ]
    generate_content_config = types.GenerateContentConfig(
        temperature=1,
        top_p=0.95,
        top_k=40,
        max_output_tokens=8192,
        response_mime_type="text/plain",
        system_instruction=[
            types.Part.from_text(text="""Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####."""),
        ],
    )

# Now pass the parameters to generate_content_stream function
    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        print(chunk.text, end="")

Step 5: Execute the Code 

Now we can execute the code using two methods one passing only system instruction prompt and question directly. Another is by passing one-shot example in prompt along with question and system instruction.

if __name__ == "__main__":
    example = """"""
    question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
    generate_gemini(example,question)

Response for Zero-shot CoD prompt from Gemini:

Apples cost: 3 * $1.20
Oranges cost: 4 * $0.80
Total: sum of both
#### $6.80
if __name__ == "__main__":
    example = """Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: 20 - x = 12; x = 8. #### 8"""
    question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
    generate_gemini(example,question)

Output


Apple cost: 3 * 1.20 
Orange cost: 4 * 0.80 
Total: apple + orange 
Total cost: 3.60 +3.20
Total: 6.80
#### 6.80

Implementation using Groq

Now we will use  Groq API which uses Llamaa model within it to demonstrate CoD prompting technique.

Step 1: Generate Groq API Key

Similar to Gemini we need to first create an account in groq wwe can do it by logging in through one of google account (gmail) on this site. Once logged in click on “Create an API Key”  button and give a name for our api key and copy the generated api key as it will not be displayed again.

Creating Groq API Key

Step 2: Install Libraries

We basically need to install groq library.

!pip install groq --quiet

Step 3: Import Packages and Setup API Key

We import relevant packages and add API key as a environment variable.

from groq import Groq

# configure the LM, and remember to export your API key, please set any one of the key
os.environ['GROQ_API_KEY'] = "Your Groq API Key"

Step 4: Create Generate Function

Now we create generate_groq function by passing example and question. We also add system prompt “Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####.””

def generate_groq(example,question):

  client = Groq()
  completion = client.chat.completions.create(
      model="llama-3.3-70b-versatile",
      messages=[
          {
              "role": "system",
              "content": "Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####."
          },
          {
              "role": "user",
              "content": example+"\n"+question
          },
      ],
      temperature=1,
      max_completion_tokens=1024,
      top_p=1,
      stream=True,
      stop=None,
  )

  for chunk in completion:
      print(chunk.choices[0].delta.content or "", end="")

Step 5: Execute the Code 

Now we can execute the code using two methods one passing only system instruction prompt and question directly. Another is by passing one-shot example in prompt along with question and system instruction. Let’s see the output for Groq Llama models

#One shot 
if __name__ == "__main__":
    example = """Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: 20 - x = 12; x = 8. #### 8"""
    question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
    generate_groq(example,question)

Output

Apples cost $1.20 * 3
Oranges cost $0.80 * 4 
Add both costs together 
Total cost is $3.60 + $3.20 
Equals $6.80
#### $6.8
#zero shot
if __name__ == "__main__":
    example = """"""
    question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?
A:"""
    generate_groq(example,question)

Output

Calculate apple cost. 
Calculate orange cost.
Add both costs.
#### $7.20

As we can see for zero shot the answer is not coming correct for llama model unlike gemini model we will try to tweak and add more words in our question prompt to arrive at correct answer.

We add this line further to our Question at end “Verify the answer is correct with steps”

 #tweaked Zero shot
if __name__ == "__main__":
    example = """"""
    question ="""Q: Anita bought 3 apples and 4 oranges. Each apple costs $1.20 and each orange costs $0.80. How much did she spend in total?Verify the answer is correct with steps
A:"""
    generate_groq(example,question)

Output

Calculate apple cost 3*1.20
Equal 3.60
Calculate orange cost 4 * 0.80 
Equal 3.20
Add costs together 3.603.20
Equal 6.80
#### 6.80

Limitations of CoD

Let us now look into the limitation of CoD below:

  • Less Transparency :  As compared to other prompting techniques such as CoT, CoD has less transparency as it does not clearly provide each verbose steps which can help in debugging and understanding the flow.
  • Increased likelihood of mistakes in intricate reasoning: Certain problems demand thorough intermediate steps to maintain logical accuracy, which CoD may overlook.
  • CoD’s Dependency on Examples: As we saw above for smaller models the performance drops in zero shot cases. It struggles in zero-shot scenarios, showing a significant drop in accuracy without example prompts. This is likely due to the absence of CoD-style reasoning patterns in training data, making it harder for models to grasp the approach without guidance.

Conclusion

Chain of Draft (CoD) prompting presents a compelling alternative to traditional reasoning techniques by prioritizing efficiency and conciseness. Its ability to reduce latency and cost while maintaining accuracy makes it a valuable approach for real-world AI applications. However, CoD’s reliance on minimalistic reasoning steps can reduce transparency, making debugging and validation more challenging. Additionally, it struggles in zero-shot scenarios, particularly with smaller models, due to the lack of CoD-style reasoning in training data. Despite these limitations, CoD remains a powerful tool for optimizing LLM performance in constrained environments. Future research and fine-tuning may help address its weaknesses and broaden its applicability.

Key Takeaways

  • A new, concise prompting technique from Zoom Communications, CoD reduces verbosity compared to Chain of Thought (CoT), mirroring human reasoning for efficiency.
  • CoD cuts token usage by 70-90% and latency by 48-76%, potentially saving thousands monthly (e.g., $3,000 for a million queries).
  • Easily applied via APIs like Gemini and Groq with minimal prompts, no model retraining needed.
  • Offers less transparency than CoT and may falter in complex reasoning or zero-shot scenarios without examples.

Frequently Asked Questions

Q1. How is CoD different from Chain of Thought (CoT)?

A. CoD generates significantly more concise reasoning compared to CoT while preserving accuracy. By eliminating non-essential details and utilizing equations or shorthand notation, it achieves a 68-92% reduction in token usage with minimal impact on accuracy.

Q2.  How can I apply Chain of Draft (CoD) in my prompts?

A. Runnable interfaces allow developers to chain functions easily,
improving code readability and maintainability. To implement CoD in your prompts, you can provide a system directive such as:
“Think step by step, but limit each thinking step to a minimal draft of no more than five words. Return the final answer after a separator (####).” Additionally, using one-shot or few-shot examples can improve consistency, especially for models that struggle in zero-shot scenarios.

Q3. Which tasks are best suited for Chain of Draft (CoD)?

A. CoD is most effective for structured reasoning tasks, including mathematical problem-solving, symbolic reasoning, and logic-based challenges. It excels in benchmarks like GSM8k and tasks that require step-by-step logical thinking.

Q4. How does Chain of Draft (CoD) impact cost savings compared to Chain of Thought (CoT)?

A. In paper it was mentioned that CoD can reduce token usage by 68-92%, significantly lowering LLM API costs for high-volume applications while maintaining accuracy.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I am a professional working as data scientist after finishing my MBA in Business Analytics and Finance. A keen learner who loves to explore and understand and simplify stuff! I am currently learning about advanced ML and NLP techniques and reading up on various topics related to it including research papers .

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details