OpenAI’s o1 model has generated considerable excitement in the field of large reasoning models (LRMs) due to its advanced capabilities in tackling complex problems. Building on this foundation, Marco-o1 emerges as a new LRM that not only emphasizes traditional disciplines such as mathematics and coding but also prioritizes open-ended problem-solving across a variety of domains. A key focus of Marco-o1 is to explore the extent to which the o1 model can generalize its reasoning abilities to areas that lack clear standards and quantifiable rewards. This exploration is crucial for understanding the potential applications of LRMs in real-world scenarios where conventional metrics may not apply, thereby pushing the boundaries of what these models can achieve.
This article was published as a part of the Data Science Blogathon.
Marco-o1 is an advanced reasoning model developed by the MarcoPolo Team at Alibaba International Digital Commerce, designed to tackle open-ended problem-solving tasks.
It is built upon the Qwen2 architecture and employs a sophisticated combination of Chain-of-Thought (CoT) fine-tuning and Monte Carlo Tree Search (MCTS) techniques to enhance its reasoning capabilities
By fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks.
Below image illustrates the inference process for Marco-01, detailing the use of datasets like Open-01 CoT and Marco-01 CoT. The process involves selecting prompt paths, performing MCTS, and applying supervised fine-tuning for better accuracy. This leads to the generation of a final answer with confidence scores.
This focuses on sophisticated methods that enable AI models to handle complex tasks, such as reasoning through multiple steps, optimizing decision-making, and incorporating uncertainty for more accurate predictions and responses.
MCTS is used to determine the best answer to a user query by exploring all possible answers through random sampling. As shown in the Figure above, in MCTS, Nodes represent different reasoning paths and Yellow nodes specifically are selected for further exploration. Green nodes represents the final answers while arrows like “Select” and “Backup” show how the system evaluates and refines choices.
The system calculates a confidence score after generating an answer using probabilities (shown in the formula) to refine the final output.
The model can work at two levels – broad level reasoning (Step Level) and multi step reasoning (Mini-Step Level).
Different levels of granularity were explored in the MCTS search. To expand the model’s search space and enhance its problem-solving capabilities, steps were divided into smaller units of 64 or 32 tokens, referred to as “mini-step.” This finer granularity allowed the model to explore reasoning paths in greater detail.
A reflection mechanism is present in the model by adding the phrase “Wait! Maybe I made some mistakes! I need to rethink from scratch.” at the end of each thought process. This prompts the model to self-reflect and reevaluate its reasoning steps. This reflection has yielded significant improvements for the model, especially on difficult problems that the original model initially solved incorrectly.
Marco-o1 is particularly effective for:
The Llama 3.2 model includes 1 billion (1B) and 3 billion (3B) parameter text models which are designed for mobile and edge devices, focusing on efficient performance for applications like summarization and instruction following.
Llama 3.2 was pretrained on up to 9 trillion tokens from publicly available sources, incorporating knowledge distillation techniques from larger models (like Llama 3.1) to enhance performance while maintaining a smaller size.
Llama 3.2 3B demonstrated notable performance in specific areas, particularly in reasoning tasks. In the ARC Challenge, it achieved a score of 78.6, surpassing Gemma’s 76.7, while being just behind Phi-3.5-mini, which scored 87.4. Likewise, in the Hellawag benchmark, Llama 3.2 3B scored 69.8, outperforming Gemma and staying competitive with Phi.
Hence, in the next hands on Python implementation we do a comparative assessment of reasoning based question on the two models – Marco-o1 and Llama 3.2 3B. This comparative assessment is primarily done to check whether the outputs from Marco-o1 really excel in reasoning based questions.
Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these models on Google Colab using Ollama in the following steps.
Below we will install all needed libraries:
!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2
In this step, we set up threading to allow Ollama to run efficiently on Google Colab. Threading enables parallel execution of tasks, ensuring smooth performance and faster processing without delays. This setup is crucial for running resource-intensive operations seamlessly within the Colab environment.
import threading
import subprocess
import time
def run_ollama_serve():
subprocess.Popen(["ollama", "serve"])
thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)
!ollama pull marco-o1
We can use the same code for pulling the llama3.2 model by replacing marco-o1 with llama3.2.
This step involves sending queries to the model to get responses or insights based on the input. It helps in interacting with the model for tasks like generating text or answering questions.
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown
template = """Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="marco-o1")
chain = prompt | model
# Prepare input for invocation
input_data = {
"question": 'I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating half of the pie how many apples do I have left?'}
# Invoke the chain with input data and display the response in Markdown format
response = chain.invoke(input_data)
display(Markdown(response))
In this section, we will compare the outputs of Marco-o1 and Llama 3.2, highlighting their strengths and differences in handling complex reasoning tasks and real-time applications. By examining their responses, we can better understand how each model approaches problem-solving and adapts to different use cases.
“I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating
half of the pie how many apples do I have left?”
Both models provide accurate responses, but Marco-o1 offers more detailed explanations compared to Llama 3.2.
"How many r in strawberry?”
As can be seen from the outputs above, the response from llama 3.2 model is inaccurate while the response from marco-o1 model is accurate.
“What is the area of a triangle with a base of 10 units and a height of 5 units?”
As can be seen from the outputs above, both the models give accurate responses but the response from marco-o1 model is a little more explained as compared to llama 3.2.
"If a car costs $20,000 and depreciates by $1,000 each year, how much will it be
worth after three years?"
As can be seen from the outputs above, both the models give accurate responses but the response from marco-o1 model is a little more explained as compared to llama 3.2.
“All birds can fly. Penguins are birds. Can penguins fly?”
As can be seen from the outputs above even though both the models give accurate responses, the response from marco-o1 model is way more explained and elaborate presenting a lot of arguments and double checks to arrive at the answer as compared to llama 3.2.
“Oliver picks 44 kiwis on Friday, then 58 on Saturday. On Sunday, he picks double what he did on Friday, but five of them were smaller than average. How many kiwis does Oliver have?”
As can be seen from the outputs above even though both the models give accurate responses, the response from llama 3.2 is inaccurate as it gets confused with the additional information (but five of them were smaller than average) provided in the query and hence subtracts 5 from the actual answer. However, output from marco-o1 is accurate with detailed explaination.
”John is allergic to peanuts. He ate a peanut butter sandwich and felt fine. What
can we conclude about John's allergy?”
As can be seen from the response from marco-o1 model, it is a lot explained and elaborate presenting a lot of arguments and double checks to arrive at the answer. The response from Llama 3.2 doesn’t seem to be completely accurate as the information “he simply had a stomach upset or an intolerance to the peanut butter” is inaccurate and contradictory to the information given in the query.
Task | Marco-o1 Performance | Llama 3.2 (3b Model) Performance | Winner |
---|---|---|---|
Task 1: Logical Reasoning | Accurate with detailed explanations | Accurate but less detailed | Marco-o1 |
Task 2: Strawberry Test | Accurate | Inaccurate | Marco-o1 |
Task 3: Geometry Reasoning | Accurate with detailed explanations | Accurate but less detailed | Marco-o1 |
Task 4: Step-by-Step Reasoning | Accurate with detailed explanations | Accurate but less detailed | Marco-o1 |
Task 5: Syllogism with Ambiguity | Accurate with elaborate explanations and double-checks | Accurate but less detailed | Marco-o1 |
Task 6: Fragile Mathematical Context | Accurate with detailed explanations | Inaccurate (confused by additional information) | Marco-o1 |
Task 7: Contradictory Information | Accurate with elaborate explanations and double-checks | Inaccurate (provided contradictory information) | Marco-o1 |
The Marco-o1 model represents a significant advancement in AI’s ability to handle complex reasoning tasks, particularly through its innovative use of Monte Carlo Tree Search and Chain-of-Thought fine-tuning. Its versatility across various domains such as mathematics, physics, and multilingual tasks sets it apart from traditional models. Meanwhile, the Llama 3.2 model offers efficient performance for edge devices, excelling in tasks like summarization and instruction-following. Both models showcase the ongoing evolution of AI, each excelling in its own domain, and together they highlight the broad potential of advanced language models in solving real-world challenges.
A. Marco-o1 adjusts its reasoning strategies based on the complexity of the task at hand, breaking down challenges into manageable steps and exploring various solution paths using Monte Carlo Tree Search to find the optimal approach.
A. MCTS enables Marco-o1 to explore multiple potential solutions for a given problem, selecting the most promising paths through random sampling, leading to more accurate and efficient problem-solving.
A. The reflection mechanism allows Marco-o1 to reevaluate its reasoning steps at the end of each process, helping the model improve accuracy and refine its answers, especially for highly complex queries.
A. Marco-o1 is specialized for tackling complex reasoning tasks using advanced techniques like Chain-of-Thought fine-tuning and MCTS. Llama 3.2 excels in efficient, real-time applications on mobile and edge devices, with extended context handling.
A. The lightweight design of Llama 3.2 makes it ideal for deployment on mobile and edge devices, offering efficient performance while maintaining the ability to handle diverse tasks such as summarization and multilingual interactions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.