AI is a game-changer for any company – but training large language models can be a major problem due to the amounts of computational power needed. This can be a daunting challenge to implementing the use of the AI especially for the organizations that require the technology to make significant impacts without having to spend a great deal of money.
The Mixture of Experts technique provides accurate and efficient solution to the problem; a large model can be split into several sub-models to become instances of the specified networks. This way of building AI solutions not only makes more efficient use of resources but also allows businesses to adapt to their needs the best high-performance AI tools, making complex AI more affordable.
This article was published as a part of the Data Science Blogathon.
Modern deep learning models use artificial neural networks composed of layers of “neurons” or nodes. Each neuron takes input, applies a simple math operation (called an activation function), and sends the result to the next layer. More advanced models, like transformers, have extra features like self-attention, which help them understand more complex patterns in data.
However, using the entire network for every input, as in dense models, can be very resource-heavy. Mixture of Experts (MoE) models solve this by leveraging a sparse architecture by activating only the most relevant parts of the network (called “experts”) for each input. This makes MoE models efficient, as they can handle more complex tasks like natural language processing without needing as much computational power.
When working on a group project, often the team comprises of small subgroup of members who are really good at different specific tasks. A Mixture of Experts (MoE) model works similar to this—it divides a complicated problem among smaller parts, called “experts,” that each specialize in solving one piece of the puzzle.
For example, if you were building a robot to help around the house, one expert might handle cleaning, another might be great at organizing, and a third might cook. Each expert focuses on what they’re best at, making the entire process faster and more accurate.
This way, the group works together efficiently, allowing them to get the job done better and faster instead of one person doing everything.
In a Mixture of Experts (MoE) model, there are two important parts that make it work:
In a Mixture of Experts (MoE) model, the “experts” are like mini neural networks, each trained to handle different tasks or types of data.
Few Active Experts at a Time:
In the context of processing text inputs, experts could have for instance the following expertise (just for illustration)-
Given an input text, the system chooses the expert best suited for the task, as shown below. Since most LLMs have several decoder blocks, the text passes through multiple experts in different layers before generation.
In a Mixture of Experts (MoE) model, the “gating network” helps the model decide which experts (mini neural networks) should handle a specific task. Think of it like a smart guide that looks at the input (like a sentence to be translated) and chooses the best experts to work on it.
There are different ways the gating network can choose the experts, which we call “routing algorithms.” Here are a few simple ones:
Once the experts finish their tasks, the model combines their results to make a final decision. Sometimes, more than one expert is needed for complex problems, but the gating network makes sure the right ones are used at the right time.
OLMoE is a new completely open source Mixture-of-Experts (MoE) based language model developed by researchers from the Allen Institute for AI, Contextual AI, University of Washington, and Princeton University.
It leverages a sparse architecture, meaning only a small number of “experts” are activated for each input, which helps save computational resources compared to traditional models that use all parameters for every token.
The OLMoE model comes in two versions:
OLMoE was trained on a massive dataset of 5 trillion tokens, helping it perform well across many language tasks. During training, special techniques were used, like auxiliary losses and load balancing, to make sure the model uses its resources efficiently and remains stable. This ensures that only the best-performing parts of the model are activated depending on the task, allowing OLMoE to handle different tasks effectively without overloading the system. The use of router z-losses further improves its ability to manage which parts of the model should be used at any time.
The OLMoE-1B-7B model has been tested against several top-performing models, like Llama2-13B and DeepSeekMoE-16B, as shown in the Figure below, and has shown notable improvements in both efficiency and performance. It excelled in key NLP tests, such as MMLU, GSM8k, and HumanEval, which evaluate a model’s skills in areas like logic, math, and language understanding. These benchmarks are important because they measure how well a model can perform various tasks, proving that OLMoE can compete with larger models while being more efficient.
Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these small language models on Google Colab using Ollama in the following steps.
!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown
def run_ollama_serve():
subprocess.Popen(["ollama", "serve"])
thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)
The run_ollama_serve() function is defined to launch an external process (ollama serve) using subprocess.Popen().
A new thread is created using the threading package, which will run the run_ollama_serve() function.The thread is started which enables running the ollama service in the background. The main thread sleeps for 5 seconds as defined by time.sleep(5) commad, giving the server time to start up before proceeding with any further actions.
!ollama pull sam860/olmoe-1b-7b-0924
Running !ollama pull sam860/olmoe-1b-7b-0924
downloads the olmoe-1b-7b language model and prepares it for use.
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="sam860/olmoe-1b-7b-0924")
chain = prompt | model
display(Markdown(chain.invoke({"question": """Summarize the following into one sentence: \"Bob was a boy. Bob had a dog. Bob and his dog went for a walk. Bob and his dog walked to the park. At the park, Bob threw a stick and his dog brought it back to him. The dog chased a squirrel, and Bob ran after him. Bob got his dog back and they walked home together.\""""})))
The above code creates a prompt template to format a question, feeds the question to the model, and outputs the response.
Question
"Summarize the following into one sentence: \"Bob was a boy. Bob had a dog.
And then Bob and his dog went for a walk. Then his dog and Bob walked to the park.
At the park, Bob threw a stick and his dog brought it back to him. The dog chased a
squirrel, and Bob ran after him. Bob got his dog back and they walked home
together.\""
Output from Model:
As we can see, the output has a fairly accurate summarized version of the paragraph.
Question
“Give me a list of 13 words that have 9 letters.”
Output from Model
As we can see, the output has 13 words but not all words contain 9 letters. So, it is not completely accurate.
Question
“Create a birthday planning checklist.”
Output from Model
As we can see, the model has created a good list for birthday planning.
Question
"Write a Python program to Merge two sorted arrays into a single sorted array.”
Output from Model
The model accurately generated code to merge two sorted arrays into one sorted array.
The Mixture of Experts (MoE) technique breaks complex problems into smaller tasks. Specialized sub-networks, called “experts,” handle these tasks. A router assigns tasks to the most suitable experts based on the input. MoE models are efficient, activating only the required experts to save computational resources. They can tackle diverse challenges effectively. However, MoE models face challenges like complex training, overfitting, and the need for diverse datasets. Coordinating experts efficiently can also be difficult.
OLMoE, an open-source MoE model, optimizes resource usage with a sparse architecture, activating only eight out of 64 experts at a time. It comes in two versions: OLMoE-1B-7B, with 7 billion total parameters (1 billion active per token), and OLMoE-1B-7B-INSTRUCT, fine-tuned for task-specific applications. These innovations make OLMoE powerful yet computationally efficient.
A. In an MoE model, experts are small neural networks trained to specialize in specific tasks or data types. For example, they may focus on processing punctuation, adjectives, or conjunctions in text.
A. MoE models use a “sparse” design, activating only a few relevant experts at a time based on the task. This approach reduces unnecessary computation, keeps the system focused, and improves speed and efficiency.
A. OLMoE is available in two versions: OLMoE-1B-7B, with 7 billion total parameters and 1 billion activated per token, and OLMoE-1B-7B-INSTRUCT. The latter is fine-tuned for improved task-specific performance.
A. The sparse architecture of OLMoE activates only the necessary experts for each input, minimizing computational costs. This design makes the model more efficient than traditional models that engage all parameters for every input.
A. The gating network selects the best experts for each task using methods like top-k or expert choice routing. This approach enables the model to handle complex tasks efficiently while conserving computational resources.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.