In the quickly growing landscape of artificial intelligence and machine learning, TinyLlama 1.1B emerges as a noteworthy development. In an era where computational constraints pose challenges for running more complex models, TinyLlama stands out by defying expectations. It showcases the remarkable performance of compact models.
This article aims to provide an analysis of TinyLlama 1.1B, a compact large language model. We will delve into its core aspects, like how it was trained in performance benchmarks and practical implementation using the Hugging Face platform. We will even run this model on the free Google Colab and test its maths and reasoning abilities.
This article was published as a part of the Data Science Blogathon.
TinyLlama 1.1B, a part of the broader Llama project, is a testament to language modeling advancements. It’s a model with 1.1 billion parameters, trained on a staggering 3 trillion tokens, which puts it in a unique position in the AI landscape. Unlike its larger counterparts, TinyLlama 1.1B is designed to be more efficient and manageable, making it a good choice for applications with limited computational resources.
This open-source model democratizes access to state-of-the-art AI technology, allowing many developers and researchers to explore and innovate in the field of natural language processing. It is a model known for its ability to balance performance with resource consumption, a critical consideration in today’s diverse computational environments.
The training process of TinyLlama 1.1B is fascinating, like the model itself. The training of TinyLlama took place just for 90 days, trained on the 16 A100-40G GPUs.The pretraining was done on 3 Trillion Tokens, and the TinyLlama Team has published the intermediate model between each half a trillion.
As for the data, Slimpajama and Starcoderdata were taken with a combined dataset size of 950 Billion Tokens. The natural language-to-code ratio was kept at 7:3, i.e. 70% of the data was natural language, and 30% was code. Thus, to achieve the 3 Trillion Tokens mark for fine-tuning, the TinyLlama underwent 3 epochs of training for this dataset.
There is even a chat version of TinyLlama called the TinyLlama-Chat released. Initially, this model underwent fine-tuning on the UltraChat dataset, which contains diverse synthetic conversations generated by ChatGPT. This step was crucial in making the model to handle different conversational contexts and styles.
Further refinement was achieved using the DPOTrainer on the UltraFeedback dataset. This training phase focused on aligning the model’s responses to align with human-like conversational patterns. The result is a model that not just grasps information on different topics but even interacts in a natural and engaging way.
You can also read: Getting Started with LlaMA 2: A Beginner’s Guide
Evaluating the performance of TinyLlama 1.1B reveals its capability to deliver high-quality responses swiftly. Its training has endowed it with the ability to cater to multilingual applications, an important feature in our globalized world. Despite its smaller size, TinyLlama 1.1B is still catching up to its larger counterparts regarding response quality and speed, making it a potent tool in different AI applications.
The benchmarks for TinyLlama 1.1B, while less extensive than those for larger models, still demonstrate its proficiency in handling complex language tasks. Its ability to generate coherent and contextually relevant responses in multiple languages is particularly impressive. The model was tested on different benchmarks like HellaSwag, WinoGrande, ARC, MMLU, and others. The combined average score came out to be 52.99. This is way better than the other 1 Billion Parameter Model, i.e. the Pythia 1B, which achieved an average score of 48.3. The table depicts the individual scores of each benchmark
Benchmark | TinyLlama 1.1B Score |
---|---|
HellaSwag | 59.2 |
Obqa | 36.0 |
WinoGrande | 59.12 |
ARC_c | 30.12 |
ARC_e | 55.25 |
boolq | 57.83 |
piqa | 73.29 |
avg | 52.9 |
Here, in this section, we will download the quantized version of TinyLlama Chat and run it in Google Colab. Before downloading the model, we have to download and install the following Python Packages
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python
!pip3 install huggingface-hub
To test the TinyLlama 1.1B Chat model, we need first to download the quantized version of it. To download it, we will run the following code
from huggingface_hub import hf_hub_download
# specifying the model name
model_name = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
# specifying the type of quantization of the model
model_file = "tinyllama-1.1b-chat-v1.0.Q8_0.gguf"
# download the model by specifying the model name and quantized model name
model_path = hf_hub_download(model_name, filename=model_file)
Here, the hugging_face_hub library will take care of the process of downloading the quantized model. For this, we import the hf_hub_download that takes in the following parameters:
Now, we can load this model through the llama_cpp_python library. The code for loading the model will be like the one below.
from llama_cpp import Llama
llm = Llama(
model_path=model_path,
n_ctx=512, # the number of i/p tokens the model can take
n_threads=8, # the number of threads to use
n_gpu_layers=40# how many layers of the model to offload to the GPU
)
We import the Llama class from the llama_cpp, which takes in the following parameters
Running this code will load the TinyLlama 1.1B Chat quantized model onto the GPU and set the appropriate context length. Now, it’s time to perform some inferences on this model. For this, we work with the below code
output = llm(
"<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\n", # User Prompt
max_tokens=512, # Number of output tokens generated
stop=["</s>"], # Token which tells the LLM to stop
)
print(output['choices'][0]['text']) # Model generated text
To infer the model, we pass the following parameters to the LLM:
The generated text is stored in the output variable when we run this. The result is generated in a format similar to the OpenAI API call. Hence, we can access the generation through the given print statement, similar to how we access the generation from the OpenAI responses. The output generated can be seen below
For a model of this size, its generated response is top-notch. This is unexpected from a model of this size; the grammar and tone look perfectly fine, and there is no sign of repetition of sentences. Let’s try testing the model’s reasoning capabilities
output = llm(
"<|im_start|>user\nIf all students who study hard get good grades, \
and John got good grades, can we conclude that John studied hard?\
<|im_end|>\n<|im_start|>assistant\n",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
output = llm(
"<|im_start|>user\nHow fast can a snake fly?\n<|im_end|>\n<|im_start|>assistant\n",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
So far, so good. From the examples we have seen, the model generates good answers. But this may not be true in all cases because we only test it on a limited number of questions. Let’s even test the model on its math reasoning capabilities
output = llm(
"<|im_start|>user\nJohn is twice as old as Sarah, and Sarah is three years \
older than Mary. If Mary is 10 years old, how old is John?\n<|im_end|>\n<|im_start|>assistant\n",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
output = llm(
"<|im_start|>user\nWhat is the missing number in this pattern: \
1, 4, 9, 16, __, 36?\n<|im_end|>\n<|im_start|>assistant\n",
max_tokens=512,
stop=["</s>"],
)
print(output['choices'][0]['text'])
From the examples we have seen, it is clear that the TinyLlamaChat performs extremely poorly in answering simple aptitude questions in math. This is expected because the model was not pretrained on any maths dataset. The quality of the generation can be improved by fine-tuning it on the math dataset
Coming to fine-tuning, the TinyLlama is a go-to choice for those who are restricted with limited hardware and wish to fine-tune large language models on their specific dataset
Given the compact size of TinyLlama, which boasts 1.1 billion parameters, its applications are mainly suited to environments where larger models might not be as feasible due to hardware limitations or greater efficiency. Here are some specific use cases keeping its size in consideration:
Mobile Applications: TinyLlama’s smaller size makes it a good choice for integrating into mobile apps where on-device processing is necessary. This includes language translation apps, personal assistant features, and chatbots that can operate efficiently on smartphones.
Embedded Systems in IoT Devices: In the Internet of Things (IoT) field, the computing resources are often limited; TinyLlama can be used to add intelligent language processing capabilities to different equipment like smart home assistants, wearable tech, and other such connected equipment.
Edge Computing: For applications that benefit from processing data closer to the source rather than in a centralized cloud environment, TinyLlama can be employed effectively. This includes real-time language processing in automotive systems, manufacturing equipment, and other edge devices.
Low-Resource Language Research: Due to its smaller size and lower computational requirements, TinyLlama can be a valuable tool in linguistic research, especially for under-resourced languages where large-scale model training isn’t feasible.
Educational Tools: In educational settings, especially those with limited access to high-end computing resources, TinyLlama can be used to develop language learning apps, interactive educational tools, and other learning aids.
Content Generation for Small Businesses: Small businesses with limited resources can use TinyLlama for generating content, like product descriptions, marketing copy, and customer correspondence, without the need for extensive computing power.
Prototyping and Experimentation: Developers and researchers who wish to experiment with language models but lack access to high-powered computing resources can use TinyLlama to prototype and develop new NLP applications.
Efficient Data Analysis: TinyLlama can be used for text analysis and data extraction in scenarios where quick and efficient processing is needed, like analyzing customer feedback, survey responses, or social media interactions.
TinyLlama 1.1B is a testament to the advancements in the field of AI and natural language processing. Its development and widespread availability are vital to creating more efficient, small, and quick inference language models. By balancing a smaller parameter footprint with robust performance, TinyLlama 1.1B addresses the critical need for powerful and practical models for a wide array of applications. Its ability to understand and generate language in a human-like manner while being light enough for different computing environments makes it a go-to choice for people struggling to run Large Language Models on their machines. The model can be fine-tuned easily on a dataset and can be trained with limited computing resources.
A. TinyLlama 1.1B is a compact, efficient large language model with 1.1 billion parameters, trained on 3 trillion tokens, suitable for applications with limited computational resources.
A. It was trained over 90 days using 16 A100-40G GPUs on datasets including Slimpajama and Starcoderdata, with a natural language to code ratio of 7:3.
A. TinyLlama 1.1B shows its skills in handling complex language tasks, scoring an average of 52.99 across benchmarks like HellaSwag, MMLU, and WinoGrande.
A. It’s suitable for applications where size and speed are an important issue. These include mobile apps, IoT equipment like home automation devices, content generation for small businesses, and efficient data analysis.
A. Absolutely, it’s a perfect choice for developers and researchers who lack access to high-powered computing resources for prototyping and developing new NLP applications. The TinyLlama model can be even run on a Raspberry Pi machine.
A. While it really excels in different language tasks, it shows limitations in mathematical reasoning, which can be improved by fine-tuning relevant datasets.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.