OLMoE: Open Mixture-of-Experts Language Models

Nibedita Dutta Last Updated : 20 Mar, 2025

10 min read

AI is a game-changer for any company – but training large language models can be a major problem due to the amounts of computational power needed. This can be a daunting challenge to implementing the use of the AI especially for the organizations that require the technology to make significant impacts without having to spend a great deal of money.

The Mixture of Experts technique provides accurate and efficient solution to the problem; a large model can be split into several sub-models to become instances of the specified networks. This way of building AI solutions not only makes more efficient use of resources but also allows businesses to adapt to their needs the best high-performance AI tools, making complex AI more affordable.

Learning Objectives

Understand the concept and significance of Mixture of Experts (MoE) models in optimizing computational resources for AI applications.
Explore the architecture and components of MoE models, including experts and router networks, and their practical implementations.
Learn about the OLMoE model, its unique features, training techniques, and performance benchmarks.
Gain hands-on experience in running OLMoE on Google Colab using Ollama and testing its capabilities with real-world tasks.
Examine the practical use cases and efficiency of sparse model architectures like OLMoE in diverse AI applications.

This article was published as a part of the Data Science Blogathon.

Need for Mixture of Experts Models
How do Mixture of Experts Models Work?
Main Components of MOE
Details of OLMoE model
How was OLMoE Trained?
Performance of OLMoE-1b-7B
Running OLMoE on Google Colab using Ollama
Testing OLMoE with Different Questions
Conclusion
Frequently Asked Questions

Need for Mixture of Experts Models

Modern deep learning models use artificial neural networks composed of layers of “neurons” or nodes. Each neuron takes input, applies a simple math operation (called an activation function), and sends the result to the next layer. More advanced models, like transformers, have extra features like self-attention, which help them understand more complex patterns in data.

However, using the entire network for every input, as in dense models, can be very resource-heavy. Mixture of Experts (MoE) models solve this by leveraging a sparse architecture by activating only the most relevant parts of the network (called “experts”) for each input. This makes MoE models efficient, as they can handle more complex tasks like natural language processing without needing as much computational power.

How do Mixture of Experts Models Work?

When working on a group project, often the team comprises of small subgroup of members who are really good at different specific tasks. A Mixture of Experts (MoE) model works similar to this—it divides a complicated problem among smaller parts, called “experts,” that each specialize in solving one piece of the puzzle.

For example, if you were building a robot to help around the house, one expert might handle cleaning, another might be great at organizing, and a third might cook. Each expert focuses on what they’re best at, making the entire process faster and more accurate.

This way, the group works together efficiently, allowing them to get the job done better and faster instead of one person doing everything.

How do Mixture of Experts Models Work? — Source:

Main Components of MOE

In a Mixture of Experts (MoE) model, there are two important parts that make it work:

Experts – Think of experts as special workers in a factory. Each worker is really good at one specific task. In the case of an MoE model, these “experts” are actually smaller neural networks (like FFNNs) that focus on specific parts of the problem. Only a few of these experts are needed to work on each task, depending on what’s required.
Router or Gate Network – The router is like a manager who decides which experts should work on which task. It looks at the input data (like a piece of text or an image) and decides which experts are the best ones to handle it. The router activates only the necessary experts, instead of using the whole team for everything, making the process more efficient.

Experts

In a Mixture of Experts (MoE) model, the “experts” are like mini neural networks, each trained to handle different tasks or types of data.

Few Active Experts at a Time:

However, in MoE models, these specialists don’t all work at the same time. The model is designed to be “sparse,” which means only a few experts are active at any given moment, depending on the task at hand.
This helps the system stay focused and efficient, using just the right specialists for the job, rather than overloading it with too many tasks or experts working unnecessarily. This approach keeps the model from being overwhelmed and makes it faster and more efficient.

In the context of processing text inputs, experts could have for instance the following expertise (just for illustration)-

An expert in a layer (e.g. Expert 1) can have expertise to handle the punctuation part of the words,
Another expert (e.g. Expert 2) can be an expert in handling the adjectives (like good, bad, ugly)
Another expert (e.g. Expert 2) can be an expert in handling the conjunctions (and, but, if)

Given an input text, the system chooses the expert best suited for the task, as shown below. Since most LLMs have several decoder blocks, the text passes through multiple experts in different layers before generation.

Router or Gate Network

In a Mixture of Experts (MoE) model, the “gating network” helps the model decide which experts (mini neural networks) should handle a specific task. Think of it like a smart guide that looks at the input (like a sentence to be translated) and chooses the best experts to work on it.

There are different ways the gating network can choose the experts, which we call “routing algorithms.” Here are a few simple ones:

Top-k routing: The gating network picks the top ‘k’ experts with the highest scores to handle the task.
Expert choice routing: Instead of the data picking the experts, the experts decide which tasks they’re best suited for. This helps keep everything balanced.

Once the experts finish their tasks, the model combines their results to make a final decision. Sometimes, more than one expert is needed for complex problems, but the gating network makes sure the right ones are used at the right time.

Details of OLMoE model

OLMoE is a new completely open source Mixture-of-Experts (MoE) based language model developed by researchers from the Allen Institute for AI, Contextual AI, University of Washington, and Princeton University.

It leverages a sparse architecture, meaning only a small number of “experts” are activated for each input, which helps save computational resources compared to traditional models that use all parameters for every token.

The OLMoE model comes in two versions:

OLMoE-1B-7B, which has 7 billion total parameters but activates 1 billion parameters per token, and
OLMoE-1B-7B-INSTRUCT, which is fine-tuned for better task-specific performance.

Architecture of OLMoE

OLMoE uses a smart design to be more efficient by having small groups of experts (Mixture of Experts model) in each layer.
In this model, there are 64 experts, but only eight are activated at a time, which helps save processing power. This method makes OLMoE better at handling different tasks without using too much computational energy, compared to other models that activate all parameters for every input.

How was OLMoE Trained?

OLMoE was trained on a massive dataset of 5 trillion tokens, helping it perform well across many language tasks. During training, special techniques were used, like auxiliary losses and load balancing, to make sure the model uses its resources efficiently and remains stable. This ensures that only the best-performing parts of the model are activated depending on the task, allowing OLMoE to handle different tasks effectively without overloading the system. The use of router z-losses further improves its ability to manage which parts of the model should be used at any time.

Performance of OLMoE-1b-7B

The OLMoE-1B-7B model has been tested against several top-performing models, like Llama2-13B and DeepSeekMoE-16B, as shown in the Figure below, and has shown notable improvements in both efficiency and performance. It excelled in key NLP tests, such as MMLU, GSM8k, and HumanEval, which evaluate a model’s skills in areas like logic, math, and language understanding. These benchmarks are important because they measure how well a model can perform various tasks, proving that OLMoE can compete with larger models while being more efficient.

Running OLMoE on Google Colab using Ollama

Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these small language models on Google Colab using Ollama in the following steps.

Step1: Installing the Required Libraries

!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh

!sudo apt update: This updates the package lists to ensure we are getting the latest versions.
!sudo apt install -y pciutils: The pciutils package is required by Ollama to detect the GPU type.
!curl -fsSL https://ollama.com/install.sh | sh command – this command uses curl to download and install Ollama
!pip install langchain-ollama: Installs the langchain-ollama Python package, which is likely related to integrating the LangChain framework with the Ollama language model service.

Step2: Importing the Required Libraries

import threading
import subprocess
import time
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

Step3: Running Ollama in Background on Colab

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

The run_ollama_serve() function is defined to launch an external process (ollama serve) using subprocess.Popen().

A new thread is created using the threading package, which will run the run_ollama_serve() function.The thread is started which enables running the ollama service in the background. The main thread sleeps for 5 seconds as defined by time.sleep(5) commad, giving the server time to start up before proceeding with any further actions.

Step4: Pulling olmoe-1b-7b from Ollama

!ollama pull sam860/olmoe-1b-7b-0924

Running !ollama pull sam860/olmoe-1b-7b-0924 downloads the olmoe-1b-7b language model and prepares it for use.

Step5:. Prompting the olmoe-1b-7b model


template = """Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="sam860/olmoe-1b-7b-0924")

chain = prompt | model

display(Markdown(chain.invoke({"question": """Summarize the following into one sentence: \"Bob was a boy.  Bob had a dog.  Bob and his dog went for a walk.  Bob and his dog walked to the park.  At the park, Bob threw a stick and his dog brought it back to him.  The dog chased a squirrel, and Bob ran after him.  Bob got his dog back and they walked home together.\""""})))

The above code creates a prompt template to format a question, feeds the question to the model, and outputs the response.

Testing OLMoE with Different Questions

Summarization Question

Question

"Summarize the following into one sentence: \"Bob was a boy. Bob had a dog.
And then Bob and his dog went for a walk. Then his dog and Bob walked to the park. 
At the park, Bob threw a stick and his dog brought it back to him. The dog chased a
 squirrel, and Bob ran after him. Bob got his dog back and they walked home 
together.\""

Output from Model:

As we can see, the output has a fairly accurate summarized version of the paragraph.

Logical Reasoning Question

Question

“Give me a list of 13 words that have 9 letters.”

Output from Model

As we can see, the output has 13 words but not all words contain 9 letters. So, it is not completely accurate.

Word problem involving common sense

Question

“Create a birthday planning checklist.”

Output from Model

Word problem involving common sense: OLMoE

As we can see, the model has created a good list for birthday planning.

Coding Question

Question

"Write a Python program to Merge two sorted arrays into a single sorted array.”

Output from Model

The model accurately generated code to merge two sorted arrays into one sorted array.

Conclusion

The Mixture of Experts (MoE) technique breaks complex problems into smaller tasks. Specialized sub-networks, called “experts,” handle these tasks. A router assigns tasks to the most suitable experts based on the input. MoE models are efficient, activating only the required experts to save computational resources. They can tackle diverse challenges effectively. However, MoE models face challenges like complex training, overfitting, and the need for diverse datasets. Coordinating experts efficiently can also be difficult.

OLMoE, an open-source MoE model, optimizes resource usage with a sparse architecture, activating only eight out of 64 experts at a time. It comes in two versions: OLMoE-1B-7B, with 7 billion total parameters (1 billion active per token), and OLMoE-1B-7B-INSTRUCT, fine-tuned for task-specific applications. These innovations make OLMoE powerful yet computationally efficient.

Key Takeaways

Mixture of Experts (MoE) models break down large tasks into smaller, manageable parts handled by specialized sub-networks called “experts.”
By activating only the necessary experts for each task, MoE models save computational resources and effectively handle diverse challenges.
A router (or gate network) ensures efficiency by dynamically assigning tasks to the most relevant experts based on input.
MoE models face hurdles like complex training, potential overfitting, the need for diverse datasets, and managing expert coordination.
The open-source OLMoE model uses sparse architecture, activating 8 out of 64 experts at a time, and offers two versions—OLMoE-1B-7B and OLMoE-1B-7B-INSTRUCT—delivering both efficiency and task-specific performance.

Frequently Asked Questions

Q1. What are “experts” in a Mixture of Experts (MoE) model?

A. In an MoE model, experts are small neural networks trained to specialize in specific tasks or data types. For example, they may focus on processing punctuation, adjectives, or conjunctions in text.

Q2. How does a Mixture of Experts (MoE) model improve efficiency?

A. MoE models use a “sparse” design, activating only a few relevant experts at a time based on the task. This approach reduces unnecessary computation, keeps the system focused, and improves speed and efficiency.

Q3. What are the two versions of the OLMoE model?

A. OLMoE is available in two versions: OLMoE-1B-7B, with 7 billion total parameters and 1 billion activated per token, and OLMoE-1B-7B-INSTRUCT. The latter is fine-tuned for improved task-specific performance.

Q4. What is the advantage of using a sparse architecture in OLMoE?

A. The sparse architecture of OLMoE activates only the necessary experts for each input, minimizing computational costs. This design makes the model more efficient than traditional models that engage all parameters for every input.

Q5. How does the routing network improve the performance of an MoE model?

A. The gating network selects the best experts for each task using methods like top-k or expert choice routing. This approach enables the model to handle complex tasks efficiently while conserving computational resources.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

OLMoE: Open Mixture-of-Experts Language Models

Learning Objectives

Table of contents

Need for Mixture of Experts Models

How do Mixture of Experts Models Work?

Main Components of MOE

Experts

Router or Gate Network

Details of OLMoE model

Architecture of OLMoE

How was OLMoE Trained?

Performance of OLMoE-1b-7B

Running OLMoE on Google Colab using Ollama

Step1: Installing the Required Libraries

Step2: Importing the Required Libraries

Step3: Running Ollama in Background on Colab

Step4: Pulling olmoe-1b-7b from Ollama

Step5:. Prompting the olmoe-1b-7b model

Testing OLMoE with Different Questions

Summarization Question

Logical Reasoning Question

Word problem involving common sense

Coding Question

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm