Transformers and the Large Language Models have taken the world by storm after they have been introduced in the field of Natural Language Processing (NLP). Since their inception, the field has been quickly evolving with innovations and research that make these LLMs more efficient. These include LoRA(Low-Rank Adaption), Flash Attention, Quantization, and the recent Merging approach of the notable LLMs. In this guide, we will look at a new approach to merging LLMs (Solar 10.7B) introduced by the Upstage AI.
This article was published as a part of the Data Science Blogathon.
Upstange AI introduced the new 10.7 Billion Parameter model, SOLAR 10.7B. This model is a result of merging two 7 Billion Parameter Models, specifically two Llama 2 7 Billion models, which were pretrained to create SOLAR 10.7B. The unique aspect of this merge is the application of a new approach called Depth Up-Scaling (DUS), contrasting with the Mixtral method where a mixture of experts is employed.
The new 10.7B Model outperformed the Mistral 7B, Qwen 14B. An Instruct version called SOLAR 10.7B Instruct has been released, and upon its release, it topped the leaderboard, surpassing both the Qwen 72B and the Mixtral 8x7B Large Language Model. Despite being a 10.7 Billion Parameter model, the SOLAR was able to outperform the LLMs that are multiple times its size
Let’s understand how it all began, and the formation of SOLAR 10.7B. It all starts with a single Base Model. The Upstage has chosen the Llama 2 containing 32 Transformer Layers for its Base Model due to its wider Open Source Contributors. Then a copy of this Base Model was created
We then get two Base Models. As for the weights, the Upstage has taken the pretrained weights from the Mistral 7B because it was performing the best at that time. Now, we start the depthwise scaling. Each of the Base Models contains 32 Layers. From these 32 Layers, we remove m Layers, that is the final m Layers from the Original Model and the first m layers from the copy version of it. This adds up to 24 Layers in each of them. Then we merge these two models:
The two Base Models are concatenated to form the scaled model. The scaled model now contains 48 Layers. The scaled model performs poorly due to the merging. Hence the scaled model undergoes pretraining. This Depthwise Scaling followed by the continued Pretraining together makes the Depth Up-Scaling (DUS).
The scaled model needs to be pretrained because of the decrease in performance due to merging. The makers said that the performance has risen quickly with pretraining. The pretraining / fine-tuning involved two stages
The first stage was the Instruction Fine-Tuning. In this type of Fine-Tuning, the model underwent training on datasets to align with the instructions. The fine-tuning process involved working with popular Open Source datasets such as Alpaca-GPT4 and OpenOrca. The paper noted that only a subset of the dataset was utilized in fine-tuning the merged model. Along with the Open Source data, the Upstage even trained it with some closed source Math data.
In the second stage, Alignment Tuning is performed. In Alignment Tuning, we take the stage one fine-tuned model and further fine-tune it to be more aligned with humans or powerful AIs like GPT4. This was done through the DPOTrainer(Direct Preference Optimization) an RLHF(Reinforcement Learning with Human Feedback)-like technique.
In Direct Preference Optimization, we have a dataset containing three columns, a Prompt, a preferred answer column, and a rejected answer column. This is then used to train the scaled model to make it generate the answers that we need it to generate. The same datasets that were trained for instruction-finetuning are used here.
The Hugging Face OpenLLM Leaderboard uses several benchmarks to evaluate the capabilities of Large Language Models (LLMs). Each benchmark assesses different aspects of an LLM’s performance:
The base SOLAR 10.7B Model outperformed models like the Mistral 7B Instruct v0.2 model and the Qwen 14B model. The Instruct version of the SOLAR 10.7B was able to even beat the very Large Language Models like the Mistral 8x7B, Qwen 72B, Falcon 180B, and the other huge Large Language Models. It was ahead of all the models in the ARC and the TruthfulQA benchmark
The SOLAR 10.7B Model is readily available in the HuggingFace Hub to work with the transformers library. Even the quantized models of the SOLAR 10.7B are available to work with. In this section, we will be downloading the quantized version and try inputting the model with different tasks and seeing the output generated
For testing with the quantized version of SOLAR 10.7B, we will be working with the llama_cpp_python library of Python that lets us run quantized Large Language Models. For this demo, we will be working with the free version of Google Colab.
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python
!pip3 install huggingface-hub
To work with the SOLAR 10.7B model, we need to first download the quantized version of it. To download it, we will run the following code:
from huggingface_hub import hf_hub_download
# specifying the model name
model_name = "TheBloke/SOLAR-10.7B-Instruct-v1.0-GGUF"
# specifying the type of quantization of the model
model_file = "solar-10.7b-instruct-v1.0.Q2_K.gguf"
# download the model by specifying the model name and quantized model name
model_path = hf_hub_download(model_name, filename=model_file)
Here, we work with the hugging_face_hub to download the quantized model. For this, we import the hf_hub_download that takes in the following parameters
Now, we can load this model through the llama_cpp_python library. The code for loading the model will be like the one below
from llama_cpp import Llama
llm = Llama(
model_path=model_path,
n_ctx=512, # the number of i/p tokens the model can take
n_threads=8, # the number of threads to use
n_gpu_layers=110 # how many layers of the model to offload to the GPU
)
We import the Llama class from the llama_cpp, which takes in the following parameters
Running this code will load the SOLAR 10.7B quantized model onto the GPU and set the appropriate context length. Now, it’s time to perform some inferences on this model. For this, we work with the below code
output = llm(
"### User:\nWho are you?\n\n### Assistant:", # User Prompt
max_tokens=512, # the number of output tokens generated
stop=["</s>"], # the token which tells the LLM to stop
)
print(output['choices'][0]['text']) # llm generated text
To infer the model, we pass the following parameters to the LLMs:
Running this will store the results in the output variable. The result generated is similar to the OpenAI API call. Hence we can access the generation through the given print statement, which is similar to how we access the generation from the OpenAI responses. The output generated can be seen below
The generated sentence seems good enough without the appearance of major grammatical mistakes. Let’s try the common sense part of the model by giving the following Prompts
output = llm(
"### User:\nHow many eggs can a monkey lay in its lifetime?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
output = llm(
"### User:\nHow many smartphones can a human eat?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
Here we see two examples related to common sense and surprisingly SOLAR 10.7B handles it very well. The Large Language Model was able to deliver the right answers with some useful content. Let’s try testing the math and Reasoning Abilities of the model through the following Prompts
output = llm(
"### User:\nLook at this series: 80, 10, 70, 15, 60, ... \
What number should come next?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
output = llm(
"### User:\nJohn runs faster than Ken. Magnus runs faster than John. \
Does Ken run faster than Magnus?\n\n### Assistant:",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
From the given example Prompts, the SOLAR 10.7B generated a good response. It was able to answer the given mathematical, and logical reasoning correctly and even the questions related to common sense. Overall we can conclude that SOLAR 10.7B Large Language Model is generating good responses
Mixtral 8x7B MoE is created by the Mistral AI with the Mixture of Experts architecture. In brief, this Mixture of Experts, the Mistral employs 8 7Billion Parameter Models. Each of these models has some of its feed-forward networks replaced by other layers called experts. Hence the Mixtral 8x7B is considered to have 8 experts. And everyone the model takes in the Input Prompt, there will be a gating mechanism that selects only 2 of these experts from the 8. The 2 experts then take in this Input Prompt and generate final output tokens. So we can see that there is a bit of complexity involved in this type of merging, where we have to replace the feed-forward layers with other layers and introduce a gating mechanism that selects between these experts
While the SOLAR 10.7B Model from Upstage leverages the Depth Up-Scaling method. In the Depth Up-Scaling, we only just remove some number of the starting layers from a Base Model and the same number of final layers from its copy version. Then we just merge the models by stacking one on top of the other. And with just a few epochs of fine-tuning the merged model can show a rapid growth in performance. Here we do not replace the existing layers with some other layers. Also here we do not have a gating mechanism. In overall, the Depth Up-Scaling is a simple and effective way to merge models that do not involve complexities.
Also comparing the performances, the Depth Up-Scaling, though by just combining two 7 Billion Models, the SOLAR 10.7B was able to clearly outperform the Mixtral 8x7B, which is a far larger model in comparison. This proves the effectiveness of a simple merging method over a complex one like the Mixtral of Experts
In this guide, we have taken a look at the recently released SOLAR 10.7Billion Parameter model by the Upstage AI. Upstage AI has taken a new approach to merge and scale models. The paper used a new approach called Depth Up-Scaling to merge two Llama-2 7 Billion Parameter models by removing some of the starting and final transformer layers. Afterward, it fine-tuned the model on Open Source datasets and tested it on the OpenLLM Leaderboard, achieving the highest H6 score and topping the leaderboard.
A. SOLAR 10.7B is a 10.7 billion parameter model by Upstage AI, utilizing a unique merging technique called Depth Up-Scaling. It distinguishes itself by outperforming larger LLMs and showcasing advancements in merging models.
A. Depthwise Scaling involves two base models. The process involves directly merging these two base models by stacking them on top of one another. Before the merging takes place, the initial layers from one model and the final layers from the other model are removed.
A. SOLAR 10.7B undergoes a two-stage pretraining process. Instruction fine-tuning involves training the model on datasets emphasizing instruction-following. Alignment tuning refines the model’s alignment with human preferences using a technique called Direct Preference Optimization (DPO).
A. SOLAR 10.7B excels across various benchmarks, including ARC (AI2 Reasoning Challenge), MMLU (Massive MultiTask Language Understanding), HellaSwag, Winogrande, TruthfulQA, and GSM8K. It achieves high scores, demonstrating its versatility in handling different language tasks.
A. SOLAR 10.7B surpasses models like Mistral 7B and Qwen 14B, showcasing superior performance despite having fewer parameters. The instruct version even competes with and outperforms very large models, including Mistral 8x7B and Qwen 72B, on various benchmarks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.