Marco-o1 vs Llama 3.2: Which is Better?

Nibedita Dutta Last Updated : 31 Dec, 2024
10 min read

OpenAI’s o1 model has generated considerable excitement in the field of large reasoning models (LRMs) due to its advanced capabilities in tackling complex problems. Building on this foundation, Marco-o1 emerges as a new LRM that not only emphasizes traditional disciplines such as mathematics and coding but also prioritizes open-ended problem-solving across a variety of domains. A key focus of Marco-o1 is to explore the extent to which the o1 model can generalize its reasoning abilities to areas that lack clear standards and quantifiable rewards. This exploration is crucial for understanding the potential applications of LRMs in real-world scenarios where conventional metrics may not apply, thereby pushing the boundaries of what these models can achieve.

Marco-o1 vs Llama 3.2: Which is Better?

Learning Objectives

  • Understand the architecture and key techniques behind the Marco-o1 model, including Chain-of-Thought fine-tuning and Monte Carlo Tree Search.
  • Explore how Marco-o1 adapts its reasoning strategies for complex, open-ended problem-solving tasks across various domains.
  • Analyze the role of the reflection mechanism in improving reasoning accuracy by prompting self-evaluation of the model’s outputs.
  • Compare the reasoning capabilities of Marco-o1 and Llama 3.2, focusing on the depth and explanation of their outputs in advanced reasoning scenarios.
  • Examine the practical applications of Marco-o1 in real-world problem-solving, including mathematical, logical, and multilingual tasks.

This article was published as a part of the Data Science Blogathon.

What is Marco-o1?

Marco-o1 is an advanced reasoning model developed by the MarcoPolo Team at Alibaba International Digital Commerce, designed to tackle open-ended problem-solving tasks.

It is built upon the Qwen2 architecture and employs a sophisticated combination of Chain-of-Thought (CoT) fine-tuning and Monte Carlo Tree Search (MCTS) techniques to enhance its reasoning capabilities

Training Datasets

By fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks.

  • Open-O1 CoT Dataset: Refined through heuristic filtering to promote structured reasoning patterns.
  • Marco-o1 CoT Dataset: Generated using MCTS to formulate complex reasoning pathways.
  • Marco Instruction Dataset: Focused on enhancing instruction-following capabilities across diverse tasks.
Overview of Marco-o1

Below image illustrates the inference process for Marco-01, detailing the use of datasets like Open-01 CoT and Marco-01 CoT. The process involves selecting prompt paths, performing MCTS, and applying supervised fine-tuning for better accuracy. This leads to the generation of a final answer with confidence scores.

Training Datasets
Source: HuggingFace

Techniques For Advanced Reasoning

This focuses on sophisticated methods that enable AI models to handle complex tasks, such as reasoning through multiple steps, optimizing decision-making, and incorporating uncertainty for more accurate predictions and responses.

MCTS is used to determine the best answer to a user query by exploring all possible answers through random sampling. As shown in the Figure above, in MCTS, Nodes represent different reasoning paths and Yellow nodes specifically are selected for further exploration. Green nodes represents the final answers while arrows like “Select” and “Backup” show how the system evaluates and refines choices.

Confidence Score

The system calculates a confidence score after generating an answer using probabilities (shown in the formula) to refine the final output.

Action Strategy

The model can work at two levels – broad level reasoning (Step Level) and multi step reasoning (Mini-Step Level).

Different levels of granularity were explored in the MCTS search. To expand the model’s search space and enhance its problem-solving capabilities, steps were divided into smaller units of 64 or 32 tokens, referred to as “mini-step.” This finer granularity allowed the model to explore reasoning paths in greater detail.

Reflection after Thinking

A reflection mechanism is present in the model by adding the phrase “Wait! Maybe I made some mistakes! I need to rethink from scratch.” at the end of each thought process. This prompts the model to self-reflect and reevaluate its reasoning steps. This reflection has yielded significant improvements for the model, especially on difficult problems that the original model initially solved incorrectly.

Key Features

  • Open-Ended Reasoning: Unlike traditional models that excel in standard answer domains (like mathematics or coding), Marco-o1 emphasizes open-ended resolutions, making it suitable for a broader range of applications where clear standards are absent.
  • Exploration of Solutions: The MCTS implementation allows the model to explore multiple solution paths, akin to a chess player considering various moves before making a decision. This approach helps in identifying the most promising strategies for problem-solving.
  • Flexible Reasoning Strategies: Marco-o1 adapts its reasoning strategies based on the type of problem it encounters, effectively breaking down complex tasks into manageable steps.

Applications

Marco-o1 is particularly effective for:

  • Complex problem-solving scenarios where traditional answers may not suffice.
  • Mathematical reasoning tasks.
  • Sophisticated translation tasks requiring nuanced understanding.

What is Llama 3.2?

The Llama 3.2 model includes 1 billion (1B) and 3 billion (3B) parameter text models which are designed for mobile and edge devices, focusing on efficient performance for applications like summarization and instruction following.

Model Architecture

Llama 3.2 was pretrained on up to 9 trillion tokens from publicly available sources, incorporating knowledge distillation techniques from larger models (like Llama 3.1) to enhance performance while maintaining a smaller size.

Overview of Llama 3.2 Text Models
Source: Medium

Key Features

  • Optimized for Edge Devices: The model is designed to be lightweight, making it suitable for deployment on mobile and edge devices.
  • Extended Context Length: Llama 3.2 supports a context length of up to 128K tokens (~96,240 words), which facilitates handling long inputs and maintaining context over extended interactions.
  • Support for Multilingual Dialogue: The model is optimized for multilingual use cases, making it effective in applications that require interaction in multiple languages.

Applications

Llama 3.2 3B demonstrated notable performance in specific areas, particularly in reasoning tasks. In the ARC Challenge, it achieved a score of 78.6, surpassing Gemma’s 76.7, while being just behind Phi-3.5-mini, which scored 87.4. Likewise, in the Hellawag benchmark, Llama 3.2 3B scored 69.8, outperforming Gemma and staying competitive with Phi.

Hence, in the next hands on Python implementation we do a comparative assessment of reasoning based question on the two models – Marco-o1 and Llama 3.2 3B. This comparative assessment is primarily done to check whether the outputs from Marco-o1 really excel in reasoning based questions.

Running Models on Google Colab using Ollama

Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these models on Google Colab using Ollama in the following steps.

Step1: Installation of Libraries

Below we will install all needed libraries:

!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2

Step2: Enabling the Threading Process to run Ollama on Google Colab

In this step, we set up threading to allow Ollama to run efficiently on Google Colab. Threading enables parallel execution of tasks, ensuring smooth performance and faster processing without delays. This setup is crucial for running resource-intensive operations seamlessly within the Colab environment.

import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

Step3: Pulling the Ollama Model

!ollama pull marco-o1

  We can use the same code for pulling the llama3.2 model by replacing marco-o1 with llama3.2.  

Step4: Querying the Model

This step involves sending queries to the model to get responses or insights based on the input. It helps in interacting with the model for tasks like generating text or answering questions.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

template = """Question: {question}"""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="marco-o1")

chain = prompt | model

# Prepare input for invocation
input_data = {
    "question": 'I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating half of the pie how many apples do I have left?'}

# Invoke the chain with input data and display the response in Markdown format
response = chain.invoke(input_data)
display(Markdown(response))

Let’s Begin the Comparison: Marco-o1 vs Llama 3.2

In this section, we will compare the outputs of Marco-o1 and Llama 3.2, highlighting their strengths and differences in handling complex reasoning tasks and real-time applications. By examining their responses, we can better understand how each model approaches problem-solving and adapts to different use cases.

Task 1: Logical Reasoning

“I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating 
half of the pie how many apples do I have left?”

Output from Marco-o1

logical reasoning Output from Marco-o1

Output from Llama 3.2 (3b Model)

logical reasoning Output from Llama 3.2 (3b Model)

Both models provide accurate responses, but Marco-o1 offers more detailed explanations compared to Llama 3.2.

Task 2: Strawberry Test

"How many r in strawberry?”

Output from Marco-o1

Output from Marco-o1

Output from Llama 3.2  (3b Model)

strawberry test Output from Llama 3.2  (3b Model)

As can be seen from the outputs above, the response from llama 3.2 model is inaccurate while the response from marco-o1 model is accurate.

Task 3: Geometry Based Reasoning

“What is the area of a triangle with a base of 10 units and a height of 5 units?”

Output from Marco-o1

Geometry Based Reasoning macro

Output from Llama 3.2  (3b Model)

Geometry Based Reasoning Llama 3.2

As can be seen from the outputs above, both the models give accurate responses but the response from marco-o1 model is a little more explained as compared to llama 3.2.

Task 4: Step By Step Reasoning

"If a car costs $20,000 and depreciates by $1,000 each year, how much will it be 
worth after three years?"

Output from Marco-o1

Step By Step Reasoning macro-o1

Output from Llama 3.2  (3b Model)

Step By Step Reasoning Llama 3.2

As can be seen from the outputs above, both the models give accurate responses but the response from marco-o1 model is a little more explained as compared to llama 3.2.

Syllogism with Ambiguity

“All birds can fly. Penguins are birds. Can penguins fly?”

Output from Marco-o1

Syllogism with Ambiguity macro-o1

Output from Llama 3.2  (3b Model)

Syllogism with Ambiguity macro-o1 llama 3.2

As can be seen from the outputs above even though both the models give accurate responses, the response from marco-o1 model is way more explained and elaborate presenting a lot of arguments and double checks to arrive at the answer as compared to llama 3.2.

Task 5: Fragile Mathematical Context

“Oliver picks 44 kiwis on Friday, then 58 on Saturday. On Sunday, he picks double what he did on Friday, but five of them were smaller than average. How many kiwis does Oliver have?”

Output from Marco-o1

Fragile Mathematical Context macro-o1

Output from Llama 3.2  (3b Model)

Fragile Mathematical Context macro-o1 llama 3.2

As can be seen from the outputs above even though both the models give accurate responses, the response from llama 3.2 is inaccurate as it gets confused with the additional information (but five of them were smaller than average)  provided in the query and hence subtracts 5 from the actual answer. However, output from marco-o1 is accurate with detailed explaination.

Task 6: Contradictory Information

”John is allergic to peanuts. He ate a peanut butter sandwich and felt fine. What
can we conclude about John's allergy?”

Output from Marco-o1

Contradictory Information macro

Output from Llama 3.2  (3b Model)

Contradictory Information llama 3.2

As can be seen from the response from marco-o1 model, it is a lot explained and elaborate presenting a lot of arguments and double checks to arrive at the answer. The response from Llama 3.2 doesn’t seem to be completely accurate as the information “he simply had a stomach upset or an intolerance to the peanut butter” is inaccurate and contradictory to the information given in the query.

Result: Marco-o1 vs Llama 3.2

Task Marco-o1 Performance Llama 3.2 (3b Model) Performance Winner
Task 1: Logical Reasoning Accurate with detailed explanations Accurate but less detailed Marco-o1
Task 2: Strawberry Test Accurate Inaccurate Marco-o1
Task 3: Geometry Reasoning Accurate with detailed explanations Accurate but less detailed Marco-o1
Task 4: Step-by-Step Reasoning Accurate with detailed explanations Accurate but less detailed Marco-o1
Task 5: Syllogism with Ambiguity Accurate with elaborate explanations and double-checks Accurate but less detailed Marco-o1
Task 6: Fragile Mathematical Context Accurate with detailed explanations Inaccurate (confused by additional information) Marco-o1
Task 7: Contradictory Information Accurate with elaborate explanations and double-checks Inaccurate (provided contradictory information) Marco-o1

Conclusion

The Marco-o1 model represents a significant advancement in AI’s ability to handle complex reasoning tasks, particularly through its innovative use of Monte Carlo Tree Search and Chain-of-Thought fine-tuning. Its versatility across various domains such as mathematics, physics, and multilingual tasks sets it apart from traditional models. Meanwhile, the Llama 3.2 model offers efficient performance for edge devices, excelling in tasks like summarization and instruction-following. Both models showcase the ongoing evolution of AI, each excelling in its own domain, and together they highlight the broad potential of advanced language models in solving real-world challenges.

Key Takeaways

  • Marco-o1 uses Chain-of-Thought fine-tuning and Monte Carlo Tree Search for advanced problem-solving.
  • It adapts reasoning strategies, breaks down challenges, and explores multiple solutions.
  • A reflection mechanism improves accuracy by reevaluating reasoning steps.
  • Llama 3.2 is optimized for mobile/edge devices, excelling in summarization and instruction-following.
  • It supports long inputs with a 128K token context for extended interactions.
  • Marco-o1 delivers detailed, explanatory responses with thorough checks for complex queries.

Frequently Asked Questions

Q1. How does Marco-o1 adapt its reasoning strategies to different tasks?

A. Marco-o1 adjusts its reasoning strategies based on the complexity of the task at hand, breaking down challenges into manageable steps and exploring various solution paths using Monte Carlo Tree Search to find the optimal approach.

Q2. How does Monte Carlo Tree Search (MCTS) enhance the reasoning abilities of Marco-o1?

A. MCTS enables Marco-o1 to explore multiple potential solutions for a given problem, selecting the most promising paths through random sampling, leading to more accurate and efficient problem-solving.

Q3. What is the purpose of the reflection mechanism in Marco-o1?

A. The reflection mechanism allows Marco-o1 to reevaluate its reasoning steps at the end of each process, helping the model improve accuracy and refine its answers, especially for highly complex queries.

Q4. How do Marco-o1 and Llama 3.2 compare in terms of handling complex reasoning tasks?

A. Marco-o1 is specialized for tackling complex reasoning tasks using advanced techniques like Chain-of-Thought fine-tuning and MCTS. Llama 3.2 excels in efficient, real-time applications on mobile and edge devices, with extended context handling.

Q5. What is the significance of the Llama 3.2 model’s lightweight design?

A. The lightweight design of Llama 3.2 makes it ideal for deployment on mobile and edge devices, offering efficient performance while maintaining the ability to handle diverse tasks such as summarization and multilingual interactions.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details