Knowledge Fusion of Large Language Models (LLMs)

Pankaj Singh Last Updated : 14 Mar, 2024

6 min read

Introduction

In Natural Language Processing (NLP), developing Large Language Models (LLMs) has proven to be a transformative and revolutionary endeavor. These models, equipped with massive parameters and trained on extensive datasets, have demonstrated unprecedented proficiency across many NLP tasks. However, the exorbitant costs of training these models from scratch have prompted researchers to explore alternative strategies. A pioneering strategy that has emerged to enhance the capabilities of Large Language Models (LLMs) is knowledge fusion, a concept explored in-depth in the research paper titled Knowledge “Fusion of Large Language Models” by Wan, Huang, Cai, Quan, and others.

Recognizing the need to address redundancy in the functionalities of newly developed LLMs, this innovative approach offers a compelling solution. The paper delves into the intricate process of merging the knowledge of various LLMs, presenting a promising avenue to refine and amplify the performance of these language models.

The fundamental idea is to combine the strengths and capabilities of existing LLMs, transcending the limitations of individual models. By merging existing pre-trained LLMs, we can create a more powerful model that surpasses the individual strengths of each source model.

Knowledge Fusion of Large Language Models

Introduction
Understanding the Knowledge Fusion of LLMs
Implementation Details: Token Alignment and Fusion Strategies
Experiments and Evaluation
Performance Across Different Benchmarks
The Fused Probabilistic Distributions: Accelerating Optimization
Analysis of the Implementation Process
FUSELLM vs. Knowledge Distillation and Ensemble/Merging
Conclusion

Understanding the Knowledge Fusion of LLMs

The paper begins by highlighting the challenges and costs of training LLMs from scratch. The authors propose knowledge fusion as an efficient and cost-effective alternative. Rather than merging weights directly, the approach focuses on externalizing the collective knowledge of source LLMs and transferring it to a target model. The research introduces FUSELLM, a method that leverages the generative distributions of source LLMs, aiming to enhance the target model’s capabilities beyond any individual source LLM.

The primary objective of LLMs fusion is to externalize the inherent knowledge embedded within multiple source LLMs and integrate their capabilities into a target LLM. The paper emphasizes stimulating LLMs to manifest knowledge by predicting the next token in a given text. The probabilistic distributions generated by different source LLMs for the same text are then fused into a single representation, creating a unified probabilistic understanding over the text.

Implementation Details: Token Alignment and Fusion Strategies

The paper introduces two crucial implementation details to ensure effective knowledge fusion: token alignment and fusion strategies.

Token alignment is achieved through a Minimum Edit Distance (MinED) strategy, enhancing the success rate of aligning tokens from different LLMs.

Fusion strategies, namely MinCE and AvgCE, evaluate the quality of different LLMs and assign varying levels of importance to their distribution matrices based on cross-entropy scores.

Experiments and Evaluation

The research conducts experiments on a challenging scenario of LLMs fusion, where the source models exhibit minimal commonalities. Three representative open-source models – Llama-2, OpenLLaMA, and MPT – are selected as source LLMs for fusion, with another Llama-2 serving as the target LLM. The experiments span benchmarks assessing reasoning, commonsense, and code generation capabilities.

Performance Across Different Benchmarks

The comprehensive evaluation of FUSELLM’s performance across various benchmarks provides valuable insights into its efficacy. Table 1 showcases the overall results of FUSELLM in comparison to baseline methods on the Big-Bench Hard (BBH). Notably, FUSELLM demonstrates an average relative performance gain of 5.16% over the original Llama-2 across all 27 tasks. The specific tasks, such as Hyperbaton, show substantial enhancements, underscoring FUSELLM’s ability to leverage collective knowledge for improved performance.

Moving on to the Common Sense (CS) benchmark in Table 2, FUSELLM consistently outperforms baselines across all tasks, achieving a relative performance improvement of 1.25% over Llama-2. This trend holds true even in challenging tasks like ARC-challenge and OpenBookQA, where FUSELLM exhibits significant improvements, highlighting its effectiveness in addressing intricate problems.

In the context of code generation, Table 3 illustrates the zero-shot performance of FUSELLM on the MultiPL-E (ME) benchmark. Outperforming Llama-2 in 9 out of 10 tasks, FUSELLM showcases a notable enhancement in the pass@1 score, particularly for specific programming languages like R. Despite a performance gap compared to OpenLLaMA or MPT, FUSELLM still achieves a remarkable average performance gain of 6.36%, surpassing the 1.37% improvement observed in Llama-2 CLM.

The Fused Probabilistic Distributions: Accelerating Optimization

A crucial aspect of FUSELLM’s success lies in its ability to utilize fused probabilistic distributions from multiple LLMs. Figure 2 compares the few-shot Chain-of-Thought (CoT) performance between Llama-2 CLM and FUSELLM with varying scales of training data on BBH. FUSELLM significantly enhances the exact match (EM) accuracy by 2.5%, achieving the best performance of Llama-2 CLM within 0.52 billion tokens. This represents a 3.9× reduction in token requirements compared to Llama-2 CLM, indicating that the probabilistic distributions derived from LLMs contain more readily learnable knowledge than the original text sequences, thereby accelerating the optimization process.

Analysis of the Implementation Process

Knowledge Fusion of Large Language Models (LLMs)

Delving into the implementation details of FUSELLM reveals critical considerations for its success. The number of source LLMs, token alignment criteria, and the choice of fusion function play pivotal roles in shaping FUSELLM’s performance.

Number of Source LLMs: Table 4 demonstrates the performance improvement of FUSELLM with varying numbers of models. The results show an apparent enhancement as the number of models increases from 1 to 3, with consistent improvements observed in BBH.
Criteria for Token Alignment: Proper token alignment is crucial during the fusion of LLMs. The proposed MinED method consistently outperforms the EM method, showcasing the effectiveness of MinED in aligning tokens from multiple models.
Fusion Function: The choice of the fusion function is critical, and FUSELLM with MinCE consistently outperforms AvgCE across all benchmarks. This emphasizes the importance of the fusion function in preserving the distinct advantages of individual LLMs.

FUSELLM vs. Knowledge Distillation and Ensemble/Merging

Comparative analyses with traditional techniques like knowledge distillation and ensemble/merging shed light on FUSELLM’s unique strengths.

FUSELLM vs. Knowledge Distillation: FUSELLM outperforms knowledge distillation, especially in BBH, where the improvement achieved by FUSELLM (5.16%) surpasses the modest gain of knowledge distillation (2.97%). This highlights FUSELLM’s ability to harness collective knowledge from multiple LLMs more effectively.
FUSELLM vs. Ensemble/Merging: In scenarios where multiple LLMs originated from the same base model but were trained on distinct corpora, FUSELLM consistently achieves the lowest average perplexity across three domains compared to ensemble and weight merging methods. This reinforces FUSELLM’s potential to leverage collective knowledge more effectively than traditional fusion methods.

Also read: Knowledge Distillation: Theory and End to End Case Study

You can find the code, model weights, and data public here: GitHub FUSELLM

Conclusion

The paper concludes with compelling results, showcasing the effectiveness of FUSELLM over individual source LLMs and established baselines. The study opens up a promising avenue for future exploration in LLMs fusion. The findings emphasize the potential of combining the diverse capabilities and strengths of structurally different LLMs, shedding light on a cost-effective and powerful approach to developing large language models.

The knowledge fusion of large language models is an innovative solution in a world where the demand for advanced natural language processing capabilities continues to rise. This research paves the way for future endeavors in creating unified models that harness the collective intelligence of diverse LLMs, pushing the boundaries of what is achievable in the realm of natural language understanding and generation.

I’m eager to know your opinions regarding the Knowledge Fusion of Large Language Models (LLMs). Feel free to share your insights on any other noteworthy and informative papers you may have encountered in the comments section.

Also read: A Comprehensive Guide to Fine-Tuning Large Language Models

Pankaj Singh

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Knowledge Fusion of Large Language Models (LLMs)

Introduction

Table of contents

Understanding the Knowledge Fusion of LLMs

Implementation Details: Token Alignment and Fusion Strategies

Experiments and Evaluation

Performance Across Different Benchmarks

The Fused Probabilistic Distributions: Accelerating Optimization

Analysis of the Implementation Process

FUSELLM vs. Knowledge Distillation and Ensemble/Merging

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au