Bilingual Powerhouse EXAONE 3.5 Sets New AI Standards

Nibedita Dutta Last Updated : 16 Jan, 2025
9 min read

EXAONE 3.5 is the latest iteration in a series of large language models developed by LG AI Research, designed to enhance the capabilities and accessibility of artificial intelligence technologies. Released in December 2024, EXAONE 3.5 encompasses three distinct configurations: 2.4 billion, 7.8 billion, and 32 billion parameters. Each model variant is tailored to meet different performance needs, ranging from lightweight applications suitable for mobile devices to high-performance tasks requiring extensive computational resources. With a focus on bilingual proficiency in English and Korean, EXAONE 3.5 aims to set new standards in instruction-following accuracy and long-context understanding, making it an invaluable tool across various sectors.

Learning Objectives

  • Understand the architecture and design choices of EXAONE 3.5, including its decoder-only transformer model and extended context length.
  • Explore the bilingual proficiency of EXAONE 3.5 in English and Korean, and its applications in multilingual scenarios.
  • Learn about the two-stage training process and how fine-tuning enhances instruction-following and long-context understanding.
  • Gain insights into advanced methodologies like the decontamination process and Direct Preference Optimization (DPO) for training LLMs.
  • Evaluate EXAONE 3.5’s performance benchmarks across real-world use cases, long-context processing, and general domain tasks.

This article was published as a part of the Data Science Blogathon.

How Reasoning-Based LLMs Work?

Reasoning-based large language models , like EXAONE 3.5, process complex tasks that require logical thinking, problem-solving, and understanding of intricate patterns. Built using advanced architectures such as transformer-based networks, these models excel at handling sequential data and long-contexts. They train on vast datasets to recognize relationships between pieces of information, enabling them to generate accurate responses to queries, reason through problems, and follow instructions effectively.

By leveraging fine-tuning techniques like Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), these LLMs refine their ability to mimic human-like reasoning in diverse applications, from simple tasks to complex decision-making scenarios.

EXAONE 3.5 Model Architecture

EXAONE 3.5 utilizes a decoder-only transformer architecture, which has become a standard in modern LLM design due to its efficiency in processing sequential data. The architecture is optimized for instruction-following tasks, allowing it to understand and execute user commands effectively. The key specifications for all the three model variants (2.4 billion, 7.8 billion, and 32 billion parameters) are as follows:

  • Maximum Context Length:32,768 tokens
  • Layers: 32
  • Feedforward Dimension: 14,336

Architectural Innovations in EXAONE 3.5

EXAONE 3.5 introduces groundbreaking advancements to its architecture, enhancing its ability to process extended contexts and deliver accurate, user-aligned outputs. These innovations set new standards for efficiency and performance in large language models.

Architectural Innovations in EXAONE 3.5
  • Extended Context Length: The maximum context length has been significantly increased to accommodate up to 32,768 tokens, enabling effective processing of larger texts without losing coherence.
  • Two-Stage Training Process: EXAONE underwent a two-stage training process consisting of general-domain training followed by fine-tuning for specific tasks related to long-context understanding. In the pre-training phase, the process removes duplicates and personally identifiable information from datasets to improve the models’ performance and reduce infrastructure costs. In the post-training phase, Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) methods enhance the models’ instruction-following capabilities and enable them to better reflect user preferences.
  • Decontamination Process: The team implemented a rigorous decontamination process to ensure unbiased evaluations by removing contaminated data from the training set. They borrowed a decontamination method from a global model whose performance was rigorously evaluated. The process involved comparing the training data with evaluation datasets, repeating it 10 times.

What is Direct Preference Optimization (DPO)?

It is a novel algorithm designed to fine-tune large language models by directly aligning them with human preferences without the complexities of traditional reinforcement learning methods. Unlike Reinforcement Learning from Human Feedback (RLHF), which requires intricate reward modeling and sampling, DPO simplifies the process by employing a straightforward classification loss to optimize model responses based on user preferences. This approach allows for stable and efficient training, making it computationally lightweight and easier to implement.

It is important to note that DPO needs a preference dataset. DPO is applied to preference data, which basically consists of a dataset of triplets (prompt, chosen answer, rejected answer).

What is Decontamination Process?

Decontamination refers to a rigorous process aimed at enhancing the generalization performance of the models by removing contaminated examples from the training dataset. Since the training data often comes from web crawls, some test-set examples might appear in the training corpus, which can lead to biased evaluations. To address this, EXAONE uses a substring-level matching method to identify and eliminate these contaminated samples.

These architectural enhancements enable EXAONE models to excel in real-world applications while maintaining competitive performance across various benchmarks.

Performance Benchmarks

The evaluation benchmarks of EXAONE 3.5 Models were categorized into three groups:

  • Real-world use cases – evaluated the models’ ability to understand and respond to user queries in practical scenarios
  • Long-context processing – assessed the models’ capability to process and retrieve information from extended textual inputs
  • General domain tasks – tested the models’ proficiency in mathematics, coding, and knowledge-based tasks.

EXAONE 3.5
Source: Click Here

EXAONE 3.5
Source: Click Here

As seen from the above Figures, all the three models excelled in real-world use cases and long-context scenarios, often surpassing baseline models of similar size. For example, the 32B model achieved an average score of 74.3 in real-world use cases, significantly outperforming competitors like Qwen 2.5 32B and Gemma 2 27B.


EXAONE versions
Source: Click Here

The EXAONE 3.5 excels in both mathematical and coding tasks. Across nine general benchmarks, the 2.4B model achieved the highest average score, surpassing other global models of the same size. Likewise, the 7.8B and 32B models also placed among the top performers, securing impressive average scores.

Running EXAONE 3.5 (7 Billion) on Google Colab Using Ollama

Below we will learn how to set up and query the EXAONE 3.5 model (7B variant) on Google Colab using Ollama. This guide walks you through the installation, configuration, and testing process to evaluate the model’s capabilities firsthand.

Step1: Installation of Libraries

Install necessary libraries and tools, including Langchain and Ollama, to prepare the Colab environment for running the model.

!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2

Step2: Enabling the Threading Process to run Ollama on Google Colab

Set up a threading process to run Ollama on Google Colab and ensure smooth execution.

import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

Step3: Pulling the Ollama Model

Download the EXAONE 3.5 model (7B variant) using Ollama to prepare it for querying.

!ollama pull exaone3.5

Step4: Querying the Model

Define the query using Langchain, invoke the model, and display the response in Markdown format to evaluate the model’s performance.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

template = """Question: {question}"""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="exaone3.5")

chain = prompt | model

# Prepare input for invocation
input_data = {
    "question": 'I have 2 apples, then I buy 2 more. I bake a pie with 2 of the apples. After eating half of the pie how many apples do I have left?'}

# Invoke the chain with input data and display the response in Markdown format
response = chain.invoke(input_data)
display(Markdown(response))

Testing the Model For Different Prompts

Below we will test the model for different prompts:

Needle in the Haystack Tasks

For finding specific information in very long inputs

Context: Climate change is causing glaciers to melt at an unprecedented rate, 
leading to rising sea levels. In coastal cities like Miami and New Orleans, this
poses a significant threat to infrastructure and ecosystems. Furthermore,
scientists predict that if current trends continue, sea levels could rise by more
than six feet by the end of the century.
Question: Based on the context, what are two potential impacts of rising sea levels
due to climate change?”

Output:

Needle in the Haystack Tasks

As we can see from the output, the model has correctly identified the needed information from the context.

Ancestral Trace Challenge

Context: The Great Wall of China was built over several dynasties, primarily during
the Ming dynasty (1368–1644). It stretches over 13,000 miles and was constructed to
protect against invasions. Today, it stands as a UNESCO World Heritage site and
attracts millions of tourists each year.
Questions:
a) During which dynasty was most of the Great Wall constructed?
b) How long is the Great Wall of China?
c) What designation does it hold today?”

Output:

Ancestral Trace Challenge

As we can see from the output, the model has correctly identified the needed information from the context.

Real-world Use Case Scenarios

Let us now look into some real world use cases below:

Customer Support Scenario

“User Query: "I received the wrong item in my order. What should I do?"
Prompt: Given the user's query, provide a clear and actionable response that guides
them through the return process. Include any necessary information about contacting
customer support or initiating a return.”

Output:

Customer Support Scenario

As we can see from the output, the model has answered pretty well from the perspective of a customer support engineer to the raised query.

Educational Assistance

“User Query: "I'm struggling with calculus concepts, especially derivatives. Can you explain it simply?"
Prompt: Explain the concept of derivatives in calculus using simple language and
examples. Include visual aids or analogies if possible to enhance understanding.”

Output:

Educational Assistance

As we can see from the output, the model has answered pretty well from the perspective of a an educational counsellor to help the student with the raised query.

Logical Reasoning Tasks

Below we will look in to some logical reasoning tasks:

Fragile Mathematical Context

“Oliver picks 44 kiwis on Friday, then 58 on Saturday. On Sunday, he picks double
what he did on Friday, but five of them were smaller than average. How many kiwis
does Oliver have?”

Output:

Logical Reasoning Tasks

The model provides an accurate response to the fragile mathematical context above and does not get confused by additional information.

Contradictory Information

”John is allergic to peanuts. He ate a peanut butter sandwich and felt fine. What 
can we conclude about John's allergy?”
Contradictory Information

As we can see from the output above with the contradictory information in the input, the model gives an accurate response providing all the arguments correctly.

Korean Tasks on General Knowledge

"한국의 수도는 무엇이며, 그 도시의 주요 특징은 무엇인가요?"

The english translation of the above query is “What is the capital of Korea and what are the main features of that city?”

Output:

Korean Tasks on General Knowledge

As we can see from the output above, the response is accurate with enough details.

Korean Task on General Knowledge with Desired Output in Korean

"인도의 총리는 누구입니까? 한국어로 설명하다"

The english translation of the above query is “Who is the Prime Minister of India? Explain in Korean”

Output:

Korean Task on General Knowledge with Desired Output in Korean

The output shows that, although the answer includes clarification in Korean as instructed, the response is inaccurate. The accurate response should have been “Narendra Modi”.

Conclusion

EXAONE 3.5 by LG AI Research represents a significant advancement in large language models, offering three versatile configurations tailored for diverse applications. With its enhanced architecture, including an extended context length and robust instruction-following capabilities, EXAONE 3.5 excels in real-world tasks and multilingual contexts. Its performance benchmarks demonstrate competitive advantages in long-context processing and general domain tasks, making it a valuable tool for researchers and businesses alike, while adhering to ethical standards in AI development.

Key Takeaways

  • EXAONE 3.5 offers three variants with different parameter counts (2.4 billion, 7.8 billion, and 32 billion), catering to a range of applications, from mobile-friendly solutions to high-performance tasks requiring more computational power.
  • The model supports a maximum context length of 32,768 tokens, allowing it to effectively process longer texts and maintain coherence for tasks requiring in-depth responses.
  • EXAONE 3.5 excels in both English and Korean, making it suitable for a global audience and enabling multilingual use cases.
  • EXAONE 3.5 undergoes a two-stage training process: first, general-domain training, followed by fine-tuning for long-context understanding, optimizing the model’s real-world applicability.
  • A rigorous decontamination process removes biased data from the training set, ensuring fair and unbiased model evaluations.

Frequently Asked Questions

Q1. How many parameter configurations does EXAONE 3.5 have?

A. EXAONE 3.5 comes in three variants with different parameter counts: 2.4 billion, 7.8 billion, and 32 billion parameters, allowing it to serve different computational needs.

Q2. What languages does EXAONE 3.5 support?

A. EXAONE 3.5 is bilingual, with proficiency in both English and Korean, making it suitable for global and multilingual applications.

Q3. What is the maximum context length supported by EXAONE 3.5?

A. EXAONE 3.5 can handle a maximum context length of 32,768 tokens, enabling it to process longer texts without losing coherence.

Q4. What performance benchmarks were used to evaluate EXAONE 3.5?

A. EXAONE 3.5’s performance evaluates real-world use cases, long-context processing, and general domain tasks such as mathematics, coding, and knowledge-based tasks.

Q5. What is the decontamination process in EXAONE 3.5?

A. EXAONE 3.5 employs a rigorous decontamination process to enhance its generalization performance by removing contaminated examples from the training data. Since the models train on web-crawled data, overlapping test-set examples with the training corpus can skew evaluation metrics and compromise reliability.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details