Marco-o1 vs Llama 3.2: Which is Better?

Nibedita Dutta Last Updated : 31 Dec, 2024

10 min read

OpenAI’s o1 model has generated considerable excitement in the field of large reasoning models (LRMs) due to its advanced capabilities in tackling complex problems. Building on this foundation, Marco-o1 emerges as a new LRM that not only emphasizes traditional disciplines such as mathematics and coding but also prioritizes open-ended problem-solving across a variety of domains. A key focus of Marco-o1 is to explore the extent to which the o1 model can generalize its reasoning abilities to areas that lack clear standards and quantifiable rewards. This exploration is crucial for understanding the potential applications of LRMs in real-world scenarios where conventional metrics may not apply, thereby pushing the boundaries of what these models can achieve.

Learning Objectives

Understand the architecture and key techniques behind the Marco-o1 model, including Chain-of-Thought fine-tuning and Monte Carlo Tree Search.
Explore how Marco-o1 adapts its reasoning strategies for complex, open-ended problem-solving tasks across various domains.
Analyze the role of the reflection mechanism in improving reasoning accuracy by prompting self-evaluation of the model’s outputs.
Compare the reasoning capabilities of Marco-o1 and Llama 3.2, focusing on the depth and explanation of their outputs in advanced reasoning scenarios.
Examine the practical applications of Marco-o1 in real-world problem-solving, including mathematical, logical, and multilingual tasks.

This article was published as a part of the Data Science Blogathon.

What is Marco-o1?
Techniques For Advanced Reasoning
What is Llama 3.2?
Running Models on Google Colab using Ollama
Let’s Begin the Comparison: Marco-o1 vs Llama 3.2
Task 1: Logical Reasoning
Task 2: Strawberry Test
Task 3: Geometry Based Reasoning
Task 4: Step By Step Reasoning
Task 5: Fragile Mathematical Context
Task 6: Contradictory Information
Result: Marco-o1 vs Llama 3.2
Conclusion
Frequently Asked Questions

What is Marco-o1?

Marco-o1 is an advanced reasoning model developed by the MarcoPolo Team at Alibaba International Digital Commerce, designed to tackle open-ended problem-solving tasks.

It is built upon the Qwen2 architecture and employs a sophisticated combination of Chain-of-Thought (CoT) fine-tuning and Monte Carlo Tree Search (MCTS) techniques to enhance its reasoning capabilities

Training Datasets

By fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks.

Open-O1 CoT Dataset: Refined through heuristic filtering to promote structured reasoning patterns.
Marco-o1 CoT Dataset: Generated using MCTS to formulate complex reasoning pathways.
Marco Instruction Dataset: Focused on enhancing instruction-following capabilities across diverse tasks.

Below image illustrates the inference process for Marco-01, detailing the use of datasets like Open-01 CoT and Marco-01 CoT. The process involves selecting prompt paths, performing MCTS, and applying supervised fine-tuning for better accuracy. This leads to the generation of a final answer with confidence scores.

Techniques For Advanced Reasoning

This focuses on sophisticated methods that enable AI models to handle complex tasks, such as reasoning through multiple steps, optimizing decision-making, and incorporating uncertainty for more accurate predictions and responses.

Solution Space Expansion via Monte Carlo Tree Search

MCTS is used to determine the best answer to a user query by exploring all possible answers through random sampling. As shown in the Figure above, in MCTS, Nodes represent different reasoning paths and Yellow nodes specifically are selected for further exploration. Green nodes represents the final answers while arrows like “Select” and “Backup” show how the system evaluates and refines choices.

Confidence Score

The system calculates a confidence score after generating an answer using probabilities (shown in the formula) to refine the final output.

Action Strategy

The model can work at two levels – broad level reasoning (Step Level) and multi step reasoning (Mini-Step Level).

Different levels of granularity were explored in the MCTS search. To expand the model’s search space and enhance its problem-solving capabilities, steps were divided into smaller units of 64 or 32 tokens, referred to as “mini-step.” This finer granularity allowed the model to explore reasoning paths in greater detail.

Reflection after Thinking

A reflection mechanism is present in the model by adding the phrase “Wait! Maybe I made some mistakes! I need to rethink from scratch.” at the end of each thought process. This prompts the model to self-reflect and reevaluate its reasoning steps. This reflection has yielded significant improvements for the model, especially on difficult problems that the original model initially solved incorrectly.

Key Features

Open-Ended Reasoning: Unlike traditional models that excel in standard answer domains (like mathematics or coding), Marco-o1 emphasizes open-ended resolutions, making it suitable for a broader range of applications where clear standards are absent.
Exploration of Solutions: The MCTS implementation allows the model to explore multiple solution paths, akin to a chess player considering various moves before making a decision. This approach helps in identifying the most promising strategies for problem-solving.
Flexible Reasoning Strategies: Marco-o1 adapts its reasoning strategies based on the type of problem it encounters, effectively breaking down complex tasks into manageable steps.

Applications

Marco-o1 is particularly effective for:

Complex problem-solving scenarios where traditional answers may not suffice.
Mathematical reasoning tasks.
Sophisticated translation tasks requiring nuanced understanding.

What is Llama 3.2?

The Llama 3.2 model includes 1 billion (1B) and 3 billion (3B) parameter text models which are designed for mobile and edge devices, focusing on efficient performance for applications like summarization and instruction following.

Model Architecture

Llama 3.2 was pretrained on up to 9 trillion tokens from publicly available sources, incorporating knowledge distillation techniques from larger models (like Llama 3.1) to enhance performance while maintaining a smaller size.

Overview of Llama 3.2 Text Models — Source: Medium

Key Features

Optimized for Edge Devices: The model is designed to be lightweight, making it suitable for deployment on mobile and edge devices.
Extended Context Length: Llama 3.2 supports a context length of up to 128K tokens (~96,240 words), which facilitates handling long inputs and maintaining context over extended interactions.
Support for Multilingual Dialogue: The model is optimized for multilingual use cases, making it effective in applications that require interaction in multiple languages.

Applications

Llama 3.2 3B demonstrated notable performance in specific areas, particularly in reasoning tasks. In the ARC Challenge, it achieved a score of 78.6, surpassing Gemma’s 76.7, while being just behind Phi-3.5-mini, which scored 87.4. Likewise, in the Hellawag benchmark, Llama 3.2 3B scored 69.8, outperforming Gemma and staying competitive with Phi.

Hence, in the next hands on Python implementation we do a comparative assessment of reasoning based question on the two models – Marco-o1 and Llama 3.2 3B. This comparative assessment is primarily done to check whether the outputs from Marco-o1 really excel in reasoning based questions.

Running Models on Google Colab using Ollama

Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these models on Google Colab using Ollama in the following steps.

Step1: Installation of Libraries

Below we will install all needed libraries:

!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2

Step2: Enabling the Threading Process to run Ollama on Google Colab

In this step, we set up threading to allow Ollama to run efficiently on Google Colab. Threading enables parallel execution of tasks, ensuring smooth performance and faster processing without delays. This setup is crucial for running resource-intensive operations seamlessly within the Colab environment.

import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

Step3: Pulling the Ollama Model

!ollama pull marco-o1

We can use the same code for pulling the llama3.2 model by replacing marco-o1 with llama3.2.

Step4: Querying the Model

This step involves sending queries to the model to get responses or insights based on the input. It helps in interacting with the model for tasks like generating text or answering questions.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

template = """Question: {question}"""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="marco-o1")

chain = prompt | model

# Prepare input for invocation
input_data = {
    "question": 'I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating half of the pie how many apples do I have left?'}

# Invoke the chain with input data and display the response in Markdown format
response = chain.invoke(input_data)
display(Markdown(response))

Let’s Begin the Comparison: Marco-o1 vs Llama 3.2

In this section, we will compare the outputs of Marco-o1 and Llama 3.2, highlighting their strengths and differences in handling complex reasoning tasks and real-time applications. By examining their responses, we can better understand how each model approaches problem-solving and adapts to different use cases.

Task 1: Logical Reasoning

“I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating 
half of the pie how many apples do I have left?”

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Both models provide accurate responses, but Marco-o1 offers more detailed explanations compared to Llama 3.2.

Task 2: Strawberry Test

"How many r in strawberry?”

Output from Marco-o1

Output from Llama 3.2 (3b Model)

As can be seen from the outputs above, the response from llama 3.2 model is inaccurate while the response from marco-o1 model is accurate.

Task 3: Geometry Based Reasoning

“What is the area of a triangle with a base of 10 units and a height of 5 units?”

Output from Marco-o1

Output from Llama 3.2 (3b Model)

As can be seen from the outputs above, both the models give accurate responses but the response from marco-o1 model is a little more explained as compared to llama 3.2.

Task 4: Step By Step Reasoning

"If a car costs $20,000 and depreciates by $1,000 each year, how much will it be 
worth after three years?"

Output from Marco-o1

Output from Llama 3.2 (3b Model)

As can be seen from the outputs above, both the models give accurate responses but the response from marco-o1 model is a little more explained as compared to llama 3.2.

Syllogism with Ambiguity

“All birds can fly. Penguins are birds. Can penguins fly?”

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Syllogism with Ambiguity macro-o1 llama 3.2

As can be seen from the outputs above even though both the models give accurate responses, the response from marco-o1 model is way more explained and elaborate presenting a lot of arguments and double checks to arrive at the answer as compared to llama 3.2.

Task 5: Fragile Mathematical Context

“Oliver picks 44 kiwis on Friday, then 58 on Saturday. On Sunday, he picks double what he did on Friday, but five of them were smaller than average. How many kiwis does Oliver have?”

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Fragile Mathematical Context macro-o1 llama 3.2

As can be seen from the outputs above even though both the models give accurate responses, the response from llama 3.2 is inaccurate as it gets confused with the additional information (but five of them were smaller than average) provided in the query and hence subtracts 5 from the actual answer. However, output from marco-o1 is accurate with detailed explaination.

Task 6: Contradictory Information

”John is allergic to peanuts. He ate a peanut butter sandwich and felt fine. What
 can we conclude about John's allergy?”

Output from Marco-o1

Output from Llama 3.2 (3b Model)

As can be seen from the response from marco-o1 model, it is a lot explained and elaborate presenting a lot of arguments and double checks to arrive at the answer. The response from Llama 3.2 doesn’t seem to be completely accurate as the information “he simply had a stomach upset or an intolerance to the peanut butter” is inaccurate and contradictory to the information given in the query.

Result: Marco-o1 vs Llama 3.2

Task	Marco-o1 Performance	Llama 3.2 (3b Model) Performance	Winner
Task 1: Logical Reasoning	Accurate with detailed explanations	Accurate but less detailed	Marco-o1
Task 2: Strawberry Test	Accurate	Inaccurate	Marco-o1
Task 3: Geometry Reasoning	Accurate with detailed explanations	Accurate but less detailed	Marco-o1
Task 4: Step-by-Step Reasoning	Accurate with detailed explanations	Accurate but less detailed	Marco-o1
Task 5: Syllogism with Ambiguity	Accurate with elaborate explanations and double-checks	Accurate but less detailed	Marco-o1
Task 6: Fragile Mathematical Context	Accurate with detailed explanations	Inaccurate (confused by additional information)	Marco-o1
Task 7: Contradictory Information	Accurate with elaborate explanations and double-checks	Inaccurate (provided contradictory information)	Marco-o1

Conclusion

The Marco-o1 model represents a significant advancement in AI’s ability to handle complex reasoning tasks, particularly through its innovative use of Monte Carlo Tree Search and Chain-of-Thought fine-tuning. Its versatility across various domains such as mathematics, physics, and multilingual tasks sets it apart from traditional models. Meanwhile, the Llama 3.2 model offers efficient performance for edge devices, excelling in tasks like summarization and instruction-following. Both models showcase the ongoing evolution of AI, each excelling in its own domain, and together they highlight the broad potential of advanced language models in solving real-world challenges.

Key Takeaways

Marco-o1 uses Chain-of-Thought fine-tuning and Monte Carlo Tree Search for advanced problem-solving.
It adapts reasoning strategies, breaks down challenges, and explores multiple solutions.
A reflection mechanism improves accuracy by reevaluating reasoning steps.
Llama 3.2 is optimized for mobile/edge devices, excelling in summarization and instruction-following.
It supports long inputs with a 128K token context for extended interactions.
Marco-o1 delivers detailed, explanatory responses with thorough checks for complex queries.

Frequently Asked Questions

Q1. How does Marco-o1 adapt its reasoning strategies to different tasks?

A. Marco-o1 adjusts its reasoning strategies based on the complexity of the task at hand, breaking down challenges into manageable steps and exploring various solution paths using Monte Carlo Tree Search to find the optimal approach.

Q2. How does Monte Carlo Tree Search (MCTS) enhance the reasoning abilities of Marco-o1?

A. MCTS enables Marco-o1 to explore multiple potential solutions for a given problem, selecting the most promising paths through random sampling, leading to more accurate and efficient problem-solving.

Q3. What is the purpose of the reflection mechanism in Marco-o1?

A. The reflection mechanism allows Marco-o1 to reevaluate its reasoning steps at the end of each process, helping the model improve accuracy and refine its answers, especially for highly complex queries.

Q4. How do Marco-o1 and Llama 3.2 compare in terms of handling complex reasoning tasks?

A. Marco-o1 is specialized for tackling complex reasoning tasks using advanced techniques like Chain-of-Thought fine-tuning and MCTS. Llama 3.2 excels in efficient, real-time applications on mobile and edge devices, with extended context handling.

Q5. What is the significance of the Llama 3.2 model’s lightweight design?

A. The lightweight design of Llama 3.2 makes it ideal for deployment on mobile and edge devices, offering efficient performance while maintaining the ability to handle diverse tasks such as summarization and multilingual interactions.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Advanced Generative AI LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Marco-o1 vs Llama 3.2: Which is Better?

Learning Objectives

Table of contents

What is Marco-o1?

Training Datasets

Techniques For Advanced Reasoning

Solution Space Expansion via Monte Carlo Tree Search

Confidence Score

Action Strategy

Reflection after Thinking

Key Features

Applications

What is Llama 3.2?

Model Architecture

Key Features

Applications

Running Models on Google Colab using Ollama

Step1: Installation of Libraries

Step2: Enabling the Threading Process to run Ollama on Google Colab

Step3: Pulling the Ollama Model

Step4: Querying the Model

Let’s Begin the Comparison: Marco-o1 vs Llama 3.2

Task 1: Logical Reasoning

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Task 2: Strawberry Test

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Task 3: Geometry Based Reasoning

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Task 4: Step By Step Reasoning

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Syllogism with Ambiguity

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Task 5: Fragile Mathematical Context

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Task 6: Contradictory Information

Output from Marco-o1

Output from Llama 3.2 (3b Model)

Result: Marco-o1 vs Llama 3.2

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck