We’re already into the second month of 2025, and every passing day brings us closer to Artificial General Intelligence (AGI)—AI that can tackle complex problems across multiple sectors at a human level.
Take DeepSeek, for instance. Until recently, could you have imagined an organization before 2024 that could build a cutting-edge Generative AI model for just a few million dollars and still go toe-to-toe with OpenAI’s flagship models? Probably not. But it’s happening.
Now, OpenAI has countered with the release of o3-mini, further accelerating AI’s evolution. Its reasoning capabilities are pushing the boundaries of AI development, making the technology more accessible and powerful. This AI war will go on! Also recently, as Sam Altman noted in his Three Observations blog, the cost of using a given level of AI is dropping tenfold every 12 months, and with lower prices comes exponentially greater adoption.
At this rate, in a decade, every person on Earth could accomplish more than today’s most impactful individuals—solely because of advancements in AI. This isn’t just progress; it’s a revolution. In this battle of Large Language Models (LLMs), the key to dominance lies in one of many fundamental aspects such as – Pretraining.
In this article, we’ll talk about LLM pretraining as mentioned in Andrej Karapathy – “Deep Dive into LLMs like ChatGPT” — what it is, how it works, and why it’s the foundation of modern AI capabilities.
Before talking about, the Pretraining stage of LLM, the bigger picture here is, how ChatGPT, Claude or any other LLM generate the output. For instance, If we ask ChatGPT – “Who is your Parent Company?”
The question here will be – how this output is generated by ChatGPT or you can say what’s happening behind the scenes of ChatGPT?
Let’s begin with – What is the LLM Pretraining Stage?
The LLM pretraining stage is the first phase of teaching a large language model (LLM) like how to understand and generate text. Think of it as reading a massive number of books, articles, and websites to learn grammar, facts, and common patterns in language. During this stage, the model processes billions of words (data) and predicts the next word (token) in a sentence repeatedly, refining its ability to generate coherent and relevant responses. However, at this point, it doesn’t fully “understand” meaning like a human—it just recognizes patterns and probabilities.
Pre-trained Large Language Models (LLMs) can perform a wide range of tasks, including text generation, summarization, translation, and sentiment analysis. They assist in code generation, question-answering, and content recommendation. LLMs can extract insights from unstructured data, facilitate chatbots, and automate customer support. They enhance creative writing, provide tutoring, and even generate realistic conversations. Additionally, they assist in data augmentation, legal analysis, and medical research by analyzing vast amounts of information efficiently. Their ability to understand and generate human-like text makes them valuable for various industries, from education and finance to healthcare and entertainment. However, they require fine-tuning for domain-specific accuracy.
Here we are taking ChatGPT to understand the concepts.
There are multiple stages of training an LLM but here we will first talk about the LLM Pretraining stage.
The performance of a large language model (LLM) is deeply influenced by the quality and scale of its pretraining dataset. If your dataset is clean, structured and easy to process, the model will work accordingly.
However, for many state-of-the-art open LLMs like Llama 3 and Mixtral, the details of their pretraining data remain a mystery—these datasets are not publicly available, and little is known about how they were curated.
To address this gap, Hugging collected data from the internet and curated FineWeb, a large-scale dataset ( this is a portion of data available on the internet) specifically designed for LLM pretraining. This high-quality and diverse dataset has 15 trillion tokens and occupies 44TB of disk space, FineWeb is built from 96 CommonCrawl snapshots and has been shown to produce better-performing models than other publicly available pretraining datasets.
What sets FineWeb apart is its transparency:
It meticulously documented every design choice, running detailed ablations on deduplication and filtering strategies to refine the dataset’s quality.
There are two main sources:
For FineWeb, they followed the approach of many LLM teams and used CommonCrawl (CC) as the starting point. CC releases a new dataset every 1-2 months, typically containing 200-400 TiB of text.
For example, the April 2024 crawl includes 2.7 billion web pages with 386 TiB of uncompressed HTML. Since 2013, CC has released 96 crawls, plus 3 older-format crawls from 2008-2012.
This structured approach helps improve the performance of AI models by ensuring that they are trained on high-quality, diverse, and safe textual data.
If you are done with step 1 of processing the raw data, now the question arises is – How to train the neural network on this data? As mentioned in the FineWeb, there are 15 trillion tokens and 44TB of disk space data set that need to be fed to the neural network for further processing.
The next essential step is tokenization, a process that prepares the raw text data for training large language models (LLMs). Let’s break down how tokenization works and its significance based on the transcript.
Tokenization is the process of converting large sequences of text into smaller, manageable units called tokens. These tokens are discrete elements that neural networks process during training. But how exactly do we turn a massive text corpus into tokens that a machine can understand and learn from?
Before feeding the data to the neural network, we have to decide how are we going to represent the text. Neural networks do not process raw text directly; instead, they expect input in the form of a finite one-dimensional sequence of symbols.
Computers represent text using binary encoding (zeros and ones). Each character can be encoded into a sequence of 8 bits (1 byte). This forms the basis of how text data is represented internally. Since bytes can take 256 possible values (0–255), we now have a vocabulary of 256 unique symbols, which can be thought of as unique IDs representing each character or combination.
Note: 1 Byte = 8 bits
Since each bit can be 0 or 1, an 8-bit sequence can represent:
28 = 256
This means a single byte can encode 256 unique values, ranging from 0 to 255.
When you encode text in UTF-8, you convert human-readable characters into binary representations (raw bits).
Although the binary (byte-based) encoding is efficient, storing long sequences of binary bits would make the input sequences unnecessarily lengthy. To address this, tokenization methods such as Byte Pair Encoding (BPE) are employed to reduce sequence length while increasing the size of the vocabulary.
In practice, state-of-the-art LLMs like GPT-4 use a vocabulary size of 100,277 tokens. This iterative merging stops when a predefined vocabulary size is reached. This balance allows shorter sequences to be used for training while maintaining token granularity that captures essential language features. Each token can represent characters, words, spaces, or even common word combinations.
Using a tokenizer like GPT-4’s base model (CL100k_base), the input text is split into tokens based on the model’s predefined vocabulary. For example:
The process of converting the raw text into symbols or tokens is called Tokenization. Tokenization is crucial because it translates raw text data into a format that gets converted to vectors (vector embedding using similarity search or something else) and neural networks can efficiently understand and process. It also strikes a trade-off between vocabulary richness and sequence length, which is key to optimizing the training process for large-scale LLMs. This step sets the foundation for the subsequent phases of LLM pretraining, where these tokens become the building blocks of the model’s understanding of language patterns, syntax, and semantics.
A neural network is a computational model designed to simulate the way the human brain processes information. It consists of layers of interconnected nodes (neurons) that work together to recognize patterns, make decisions, and solve complex tasks.
A neural network is a powerful AI tool that learns from data and improves over time, enabling machines to make human-like decisions.
Also read: Introduction to Neural Network in Machine Learning
Input: Tokenized Sequences
The input to the neural network consists of sequences of tokens derived from a dataset through tokenization. Tokenization breaks down the text into discrete units, which are assigned unique numerical IDs. In this example, we consider a sequence of four tokens:
If you are done with step1
Token ID | Token |
2746 | “If” |
499 | “you” |
527 | “are” |
2884 | “Done” |
449 | with |
3094 | step |
16 | 1 |
These tokens are fed into the neural network as context, aiming to predict the next token in the sequence.
Once the token sequence is passed through the neural network, it generates a probability distribution over a vocabulary of possible next tokens. In this case, the vocabulary size of GPT-4 is 100,277 unique tokens. The output is a probability score assigned to each possible token, representing the likelihood of its occurrence as the next token.
To correct its predictions, the neural network goes through a mathematical update process:
The neural network updates its parameters using a mathematical optimization process. Given the correct token, the training algorithm adjusts the network weights such that:
For instance, after an update, the probability of a token may increase from 4% to 6%, while the probabilities of other tokens adjust accordingly. This iterative process occurs across large batches of training data, refining the network’s ability to model the statistical relationships between tokens.
Through continuous exposure to data and iterative updates, the neural network improves its predictive capability. By analyzing context windows of tokens and refining probability distributions, it learns to generate text sequences that align with real-world linguistic patterns.
A neural network, particularly modern architectures like Transformers, follows a structured computational process to generate meaningful predictions based on input data. Below is a detailed explanation of its internals, broken down into key stages.
Neural networks process input data in the form of token sequences. Each token is a numerical representation of a word or a subword.
Once token sequences are fed into the network, they are processed mathematically using a large number of parameters (also called weights).
The network itself is a giant mathematical function with a fixed structure. It mixes inputs X1,X2,…with weights W1,W2….through:
Even though modern networks contain billions of parameters, at their core, they perform simple mathematical operations repeatedly.
Example: A basic operation in a neural network may look like:
You can know more about it here: {link of article}
We are talking about – the model nano-GPT, with a mere 85,000 parameters.
Here we are taking a sequence – C B A B B C and sorting it to ABBBCC.
Each letter is called a tokens with the token index:
Token | A | B | C |
Index | 0 | 1 | 2 |
After this embedding happens: each green cell represents a number being processed, and each blue cell is a weight.
The embedding is then passed through the model, going through a series of layers, called transformers, before reaching the bottom.
After processing through multiple layers, the network outputs a probability distribution over possible next tokens.
The training process involves:
Training is like tuning a musical instrument—gradually refining parameters to produce meaningful outputs.
Once a model is trained, it enters the inference phase, where it predicts new text based on user-provided input.
While neural networks use biological terminology, they are not equivalent to biological brains. Unlike biological neurons, neural networks operate without memory and process inputs statelessly. Additionally, biological neurons exhibit dynamic and adaptive behaviour beyond mathematical formulas, whereas neural networks, including transformers, remain purely mathematical constructs without sentient cognition.
A base model in large language models (LLMs) like GPT, refers to a pretrained model that has been trained on vast amounts of internet text data but has not yet been fine-tuned for specific tasks.
GPT-2, or Generative Pre-trained Transformer 2, is the second iteration of OpenAI’s Transformer-based language model, first released in 2019. It was a significant milestone in the evolution of large-scale natural language models, setting the stage for modern generative AI applications.
These numbers, while impressive at the time, are small by today’s standards. For example, Llama 3 (2024) features 405 billion parameters trained on 15 trillion tokens, demonstrating the rapid growth in scale and capability of Transformer-based models.
At inference time, GPT-2 functions as a token-level document simulator:
Even though GPT-2 was not explicitly fine-tuned for specific tasks, prompt engineering enables it to perform various applications:
So, this is the LLM Pretraining stage.
This next stage transforms the model from a statistical text generator into a practical AI assistant capable of answering questions effectively.
We will talk about the post-training stage in the next article…
The LLM pretraining stage is the foundation of modern AI development, shaping the capabilities of models like GPT-4 and beyond. As we advance toward Artificial General Intelligence (AGI), pretraining remains a critical component in improving language understanding, efficiency, and reasoning.
This process involves massive datasets, sophisticated filtering mechanisms, and tokenization strategies that refine raw data into meaningful input for neural networks. Through iterative learning, neural networks enhance their predictive accuracy by analyzing patterns in tokenized text and optimizing mathematical relationships.
Despite their impressive abilities, LLMs are not sentient—they rely on statistical probabilities and structured computations rather than true comprehension. As AI models continue to evolve, advancements in pretraining methodologies will play a key role in driving performance improvements, cost reductions, and broader accessibility.
In the ongoing race for AI supremacy, pretraining is not just a technical necessity; it is a strategic battleground where the future of AI is being forged.