A Comprehensive Guide to Pre-training LLMs

Pankaj Singh Last Updated : 12 Feb, 2025
15 min read

We’re already into the second month of 2025, and every passing day brings us closer to Artificial General Intelligence (AGI)—AI that can tackle complex problems across multiple sectors at a human level.

Take DeepSeek, for instance. Until recently, could you have imagined an organization before 2024 that could build a cutting-edge Generative AI model for just a few million dollars and still go toe-to-toe with OpenAI’s flagship models? Probably not. But it’s happening.

Now, OpenAI has countered with the release of o3-mini, further accelerating AI’s evolution. Its reasoning capabilities are pushing the boundaries of AI development, making the technology more accessible and powerful. This AI war will go on! Also recently, as Sam Altman noted in his Three Observations blog, the cost of using a given level of AI is dropping tenfold every 12 months, and with lower prices comes exponentially greater adoption.

At this rate, in a decade, every person on Earth could accomplish more than today’s most impactful individuals—solely because of advancements in AI. This isn’t just progress; it’s a revolution. In this battle of Large Language Models (LLMs), the key to dominance lies in one of many fundamental aspects such as – Pretraining.

In this article, we’ll talk about LLM pretraining as mentioned in Andrej Karapathy – “Deep Dive into LLMs like ChatGPT — what it is, how it works, and why it’s the foundation of modern AI capabilities.

What is the LLM Pre-training?

Before talking about, the Pretraining stage of LLM, the bigger picture here is, how ChatGPT, Claude or any other LLM generate the output. For instance, If we ask ChatGPT –  “Who is your Parent Company?

The question here will be – how this output is generated by ChatGPT or you can say what’s happening behind the scenes of ChatGPT?

Let’s begin with – What is the LLM Pretraining Stage?

The LLM pretraining stage is the first phase of teaching a large language model (LLM) like how to understand and generate text. Think of it as reading a massive number of books, articles, and websites to learn grammar, facts, and common patterns in language. During this stage, the model processes billions of words (data) and predicts the next word (token) in a sentence repeatedly, refining its ability to generate coherent and relevant responses. However, at this point, it doesn’t fully “understand” meaning like a human—it just recognizes patterns and probabilities.

What can a Pre-trained LLM do?

Pre-trained Large Language Models (LLMs) can perform a wide range of tasks, including text generation, summarization, translation, and sentiment analysis. They assist in code generation, question-answering, and content recommendation. LLMs can extract insights from unstructured data, facilitate chatbots, and automate customer support. They enhance creative writing, provide tutoring, and even generate realistic conversations. Additionally, they assist in data augmentation, legal analysis, and medical research by analyzing vast amounts of information efficiently. Their ability to understand and generate human-like text makes them valuable for various industries, from education and finance to healthcare and entertainment. However, they require fine-tuning for domain-specific accuracy.

Here we are taking ChatGPT to understand the concepts.

LLM Pretraining Step 1: Process the Internet Data

There are multiple stages of training an LLM but here we will first talk about the LLM Pretraining stage.

The performance of a large language model (LLM) is deeply influenced by the quality and scale of its pretraining dataset. If your dataset is clean, structured and easy to process, the model will work accordingly.

However, for many state-of-the-art open LLMs like Llama 3 and Mixtral, the details of their pretraining data remain a mystery—these datasets are not publicly available, and little is known about how they were curated.

To address this gap, Hugging collected data from the internet and curated FineWeb, a large-scale dataset ( this is a portion of data available on the internet) specifically designed for LLM pretraining. This high-quality and diverse dataset has 15 trillion tokens and occupies 44TB of disk space, FineWeb is built from 96 CommonCrawl snapshots and has been shown to produce better-performing models than other publicly available pretraining datasets.

What sets FineWeb apart is its transparency:

It meticulously documented every design choice, running detailed ablations on deduplication and filtering strategies to refine the dataset’s quality.

HuggingFaceFW/fineweb.

Where Does the Raw Data Come From?

There are two main sources:

  1. Crawling the web yourself – Used by companies like OpenAI and Anthropic.
  2. Using public repositories – CommonCrawl, a non-profit that has been archiving web data since 2007.

For FineWeb, they followed the approach of many LLM teams and used CommonCrawl (CC) as the starting point. CC releases a new dataset every 1-2 months, typically containing 200-400 TiB of text.

For example, the April 2024 crawl includes 2.7 billion web pages with 386 TiB of uncompressed HTML. Since 2013, CC has released 96 crawls, plus 3 older-format crawls from 2008-2012.

1. URL Filtering

  • The pipeline begins with URL filtering, where web pages from certain domains or with certain characteristics are blocked based on a pre-defined list.
  • This helps remove adult content, spam, or other undesirable data at the initial stage.
filtering using a blocklist
Source: Link

2. Text Extraction

  • Once URLs are filtered, the text is extracted from the web pages.
  • This step removes HTML, JavaScript, and other non-text elements while preserving the meaningful content.
Text Extraction

3. Language Filtering

  • The extracted text is then filtered based on language.
  • A fastText classifier is used to detect whether the content is in English.
  • Only texts with a confidence score of ≥ 0.65 are kept.
Language Filtering

4. Gopher Filtering

  • This is an additional quality filter designed to remove low-quality text.
  • It might include checks for repetitive content, nonsensical text, or harmful content.

5. MinHash Deduplication

  • This step detects and removes duplicate content using the MinHash technique.
  • MinHash helps efficiently compare large amounts of text to find near-duplicate documents and eliminate redundancy.

6. C4 Filters

C4 Filters
  • The filtered data then passes through C4 filters, which further refine the dataset.
  • C4 (Colossal Clean Crawled Corpus) filters typically remove boilerplate content, excessive repetition, and low-quality text.

7. Custom Filters

  • At this stage, additional custom filtering rules are applied.
  • These could involve removing specific patterns, handling formatting issues, or eliminating known sources of noise.

8. PII Removal

  • Finally, the pipeline includes a PII (Personally Identifiable Information) Removal step.
  • This ensures that private or sensitive information (such as names, addresses, emails, and phone numbers) is scrubbed from the dataset.

The Outcome of the Process

  • The FineWeb pipeline ensures that the resulting dataset is clean, high-quality, and optimized for training AI models.
  • Data Reduction: After filtering, 36 trillion tokens remain from the original web dumps.

This structured approach helps improve the performance of AI models by ensuring that they are trained on high-quality, diverse, and safe textual data.

LLM Pretraining Step 2: Tokenization

LLM Pretraining Step 2: Tokenization
Source: Author

If you are done with step 1 of processing the raw data, now the question arises is – How to train the neural network on this data? As mentioned in the FineWeb, there are 15 trillion tokens and 44TB of disk space data set that need to be fed to the neural network for further processing.

The next essential step is tokenization, a process that prepares the raw text data for training large language models (LLMs). Let’s break down how tokenization works and its significance based on the transcript.

Tokenization is the process of converting large sequences of text into smaller, manageable units called tokens. These tokens are discrete elements that neural networks process during training. But how exactly do we turn a massive text corpus into tokens that a machine can understand and learn from?

1. From Raw Text to One-Dimensional Sequence

Before feeding the data to the neural network, we have to decide how are we going to represent the text.  Neural networks do not process raw text directly; instead, they expect input in the form of a finite one-dimensional sequence of symbols.

2. Binary Representation – Bits and Bytes

  • A long sequence of 0s and 1s would be inefficient for storage and processing in neural networks.
  • Instead of encoding text as a raw sequence of bits, a more efficient approach is to group bits into meaningful symbols.

Computers represent text using binary encoding (zeros and ones). Each character can be encoded into a sequence of 8 bits (1 byte). This forms the basis of how text data is represented internally. Since bytes can take 256 possible values (0–255), we now have a vocabulary of 256 unique symbols, which can be thought of as unique IDs representing each character or combination.

Note: 1 Byte = 8 bits

Since each bit can be 0 or 1, an 8-bit sequence can represent:
28 = 256

This means a single byte can encode 256 unique values, ranging from 0 to 255.

  • Each character (or symbol) is stored in 1 byte (8 bits).
  • Each byte can take one of 256 possible values.
  • Thus, the vocabulary size is 256 unique symbols.

When you encode text in UTF-8, you convert human-readable characters into binary representations (raw bits).

4. Reducing Sequence Length – Beyond Bytes

Although the binary (byte-based) encoding is efficient, storing long sequences of binary bits would make the input sequences unnecessarily lengthy. To address this, tokenization methods such as Byte Pair Encoding (BPE) are employed to reduce sequence length while increasing the size of the vocabulary.

  • Byte Pair Encoding (BPE): This method groups frequently occurring pairs of symbols (bytes) into new symbols. For instance, if any sequence such as “135 32” appears repeatedly, it will be replaced by a new token with an ID (like 256). The process iteratively reduces the sequence length while expanding the token vocabulary.

5. Vocabulary Size – Trade-off Between Sequence Length and Token Granularity

In practice, state-of-the-art LLMs like GPT-4 use a vocabulary size of 100,277 tokens. This iterative merging stops when a predefined vocabulary size is reached. This balance allows shorter sequences to be used for training while maintaining token granularity that captures essential language features. Each token can represent characters, words, spaces, or even common word combinations.

6. Tokenizing Text – Example and Practical Insights

Using a tokenizer like GPT-4’s base model (CL100k_base), the input text is split into tokens based on the model’s predefined vocabulary. For example:

  • The phrase “hello world” is tokenized into two tokens: one for “hello” and one for “space + world.”
  • Adding or removing spaces results in different tokens due to subtle variations in text patterns.
Tiktokenizer
Source: Tiktokenizer

Why Is This Useful?

  • Optimizing Neural Network Input: Large Language Models (LLMs) like GPT-4 don’t read raw text. Instead, they process tokenized input.
  • Understanding Compression: Some words are split into multiple tokens, while others stay intact.
  • Efficiency in Training: Tokenization allows efficient storage and manipulation of text data.

The process of converting the raw text into symbols or tokens is called Tokenization. Tokenization is crucial because it translates raw text data into a format that gets converted to vectors (vector embedding using similarity search or something else) and neural networks can efficiently understand and process. It also strikes a trade-off between vocabulary richness and sequence length, which is key to optimizing the training process for large-scale LLMs. This step sets the foundation for the subsequent phases of LLM pretraining, where these tokens become the building blocks of the model’s understanding of language patterns, syntax, and semantics.

LLM Pretraining Step 3: Neural Network

A neural network is a computational model designed to simulate the way the human brain processes information. It consists of layers of interconnected nodes (neurons) that work together to recognize patterns, make decisions, and solve complex tasks.

Key Characteristics:

  1. Inspired by the Human Brain – Mimics how biological neurons process and transmit information.
  2. Layered Structure – Composed of an input layer, hidden layers, and an output layer.
  3. Learning through Training – Adjusts internal parameters (weights) over multiple iterations to improve accuracy.
  4. Task-Specific Adaptability – Can handle various problems such as classification, pattern recognition, and clustering.

How It Works:

  • Nodes (Neurons): Fundamental units that process data.
  • Connections (Weights): Store learned information and adjust based on input.
  • Training Process: Weights are updated over multiple iterations using training data.
  • Final Model: A trained neural network can efficiently perform the intended task.

A neural network is a powerful AI tool that learns from data and improves over time, enabling machines to make human-like decisions.

Also read: Introduction to Neural Network in Machine Learning

Neural Network I/O

Input: Tokenized Sequences

The input to the neural network consists of sequences of tokens derived from a dataset through tokenization. Tokenization breaks down the text into discrete units, which are assigned unique numerical IDs. In this example, we consider a sequence of four tokens:

If you are done with step1

Token IDToken
2746“If”
499“you”
527“are”
2884“Done”
449with
3094step
161
tokenization
tokenization

These tokens are fed into the neural network as context, aiming to predict the next token in the sequence.

Processing: Probability Distribution Prediction

Once the token sequence is passed through the neural network, it generates a probability distribution over a vocabulary of possible next tokens. In this case, the vocabulary size of GPT-4 is 100,277 unique tokens. The output is a probability score assigned to each possible token, representing the likelihood of its occurrence as the next token.

Processing: Probability Distribution Prediction
Source: Author

Backpropagation and Adjustment

To correct its predictions, the neural network goes through a mathematical update process:

  1. Calculate Loss – A loss function (like cross-entropy loss) measures how far the predicted probabilities are from the correct probabilities. A lower probability for the correct token results in a higher loss.
  2. Compute Gradients – The network uses gradient descent to determine how to adjust the weights of its neurons.
  3. Update Weights – The model’s internal parameters (weights) are tweaked slightly so that the next time it sees the same sequence, it increases the probability of “Post” and decreases the probability of incorrect options.

Training and Refinement

The neural network updates its parameters using a mathematical optimization process. Given the correct token, the training algorithm adjusts the network weights such that:

  • The probability of the correct token increases.
  • The probabilities of incorrect tokens decrease.

For instance, after an update, the probability of a token may increase from 4% to 6%, while the probabilities of other tokens adjust accordingly. This iterative process occurs across large batches of training data, refining the network’s ability to model the statistical relationships between tokens.

Through continuous exposure to data and iterative updates, the neural network improves its predictive capability. By analyzing context windows of tokens and refining probability distributions, it learns to generate text sequences that align with real-world linguistic patterns.

Internal Working of Neural Network

Internal Working of Neural Network
Source: Andrej Karapathy

A neural network, particularly modern architectures like Transformers, follows a structured computational process to generate meaningful predictions based on input data. Below is a detailed explanation of its internals, broken down into key stages.

1. Input Representation: Token Sequences

Neural networks process input data in the form of token sequences. Each token is a numerical representation of a word or a subword.

  • The input length can vary from 0 to 8,000 tokens (depending on the model), but computational constraints limit the maximum context length.
  • Token sequences are the primary data structures that flow through the network.

2. Mathematical Processing with Parameters (Weights)

Once token sequences are fed into the network, they are processed mathematically using a large number of parameters (also called weights).

  • Parameters are initially random, leading to random predictions.
  • Through training, these parameters are adjusted to reflect patterns in the training dataset.

3. The Mathematical Expressions Behind Neural Networks

The network itself is a giant mathematical function with a fixed structure. It mixes inputs X1,X2,…with weights W1,W2….through:

  • Multiplication
  • Addition
  • Exponentiation
  • Normalization (LayerNorm)
  • Matrix Operations
  • Activation Functions (Softmax, etc.)

Even though modern networks contain billions of parameters, at their core, they perform simple mathematical operations repeatedly.

Example: A basic operation in a neural network may look like:

mathematical formula

You can know more about it here: {link of article}

4. The Transformer Architecture: The Backbone of Modern Neural Networks

We are talking about – the model nano-GPT, with a mere 85,000 parameters.

Here we are taking a sequence – C B A B B C and sorting it to ABBBCC. 

Each letter is called a tokens with the token index:

TokenABC
Index012

After this embedding happens: each green cell represents a number being processed, and each blue cell is a weight.

The embedding is then passed through the model, going through a series of layers, called transformers, before reaching the bottom.

5. Neural Network Output: Prediction Generation

After processing through multiple layers, the network outputs a probability distribution over possible next tokens.

  • The final layer (Logits & Softmax) predicts the next token.
  • The output token is fed back into the network in an autoregressive manner.
  • This process repeats iteratively, generating coherent text.

6. Training the Neural Network: Adjusting Parameters

The training process involves:

  1. Computing the Loss: The difference between the predicted output and the correct output is measured using loss functions (e.g., cross-entropy loss).
  2. Backpropagation: The loss is used to update network parameters via gradient descent.
  3. Optimization (Gradient Descent, Adam, etc.): Parameters are adjusted to minimize prediction errors over many iterations.

Training is like tuning a musical instrument—gradually refining parameters to produce meaningful outputs.

7. Inference: Generating New Predictions

7. Inference: Generating New Predictions
Source: Author

Once a model is trained, it enters the inference phase, where it predicts new text based on user-provided input.

  • The model generates tokens step by step using learned knowledge.
  • It follows statistical patterns from training data.
  • The process repeats until a stopping condition is met (e.g., max length, EOS token).

While neural networks use biological terminology, they are not equivalent to biological brains. Unlike biological neurons, neural networks operate without memory and process inputs statelessly. Additionally, biological neurons exhibit dynamic and adaptive behaviour beyond mathematical formulas, whereas neural networks, including transformers, remain purely mathematical constructs without sentient cognition.

Base Model

A base model in large language models (LLMs) like GPT, refers to a pretrained model that has been trained on vast amounts of internet text data but has not yet been fine-tuned for specific tasks.

Key Points About Base Models:

  1. Token Simulators: A base model essentially predicts the next token (word, subword, or character) given a sequence of previous tokens. It is a statistical pattern recognizer that generates text based on probabilities learned from training data.
  2. Not Directly Useful for Assistants: A base model doesn’t inherently understand user intent or follow conversational instructions. Instead, it generates text in an open-ended way, often producing a remix of internet text.
  3. Limited Releases: Most base models are not publicly released because they are just an intermediate step in developing a useful AI assistant. Companies usually fine-tune these base models before releasing them for public use.
  4. Example – GPT-2:
    • OpenAI released GPT-2 in 2019 with a 1.5 billion parameter base model.
    • It was a raw model trained to predict text sequences but required additional fine-tuning to be used effectively in applications.

GPT-2, or Generative Pre-trained Transformer 2, is the second iteration of OpenAI’s Transformer-based language model, first released in 2019. It was a significant milestone in the evolution of large-scale natural language models, setting the stage for modern generative AI applications.

Key Specifications:

  • Parameters: 1.6 billion
  • Training Tokens: 100 billion
  • Maximum Context Length: 1,024 tokens

These numbers, while impressive at the time, are small by today’s standards. For example, Llama 3 (2024) features 405 billion parameters trained on 15 trillion tokens, demonstrating the rapid growth in scale and capability of Transformer-based models.

Inference: How GPT-2 Generates Text

1. Token-Level Simulation

At inference time, GPT-2 functions as a token-level document simulator:

  • It generates text one token at a time, conditioning each prediction on the previous tokens.
  • The process continues iteratively, producing sequences that resemble human-written text.

2. Prompting and In-Context Learning

Even though GPT-2 was not explicitly fine-tuned for specific tasks, prompt engineering enables it to perform various applications:

  • Translation: A well-constructed few-shot prompt can turn GPT-2 into an English-to-Korean translator.
  • Q&A and Assistant-like Behavior: With the right conversation-style prompt, GPT-2 can mimic a chatbot.
  • Story Generation: By seeding with an opening sentence, GPT-2 can complete a passage in a coherent manner.

3. Limitations of GPT-2 in Inference

  • Short Context Window: With a maximum of 1,024 tokens, GPT-2 struggles with long-form coherence.
  • Lack of Explicit Memory: Unlike later models with retrieval-augmented generation (RAG), GPT-2 relies entirely on its parameters.
  • Prone to Bias and Regurgitation: Due to the nature of its dataset, GPT-2 can produce biased or even verbatim outputs from training data.

Why Are Base Models Important?

  • They form the foundation for creating useful AI applications.
  • Fine-tuning and reinforcement learning make them more useful for interactive tasks, like chatbots, code assistants, or summarization tools.
  • They enable adaptability, allowing researchers and developers to fine-tune them for specific domains (e.g., medical AI, legal AI).

So, this is the LLM Pretraining stage. 

Key Takeaways from the Pre-training Stage:

  1. Pre-training is about token prediction:
    • We train the model using Internet documents broken down into tokens (small chunks of text).
    • The model learns to predict token sequences based on statistical patterns in the data.
  2. The base model is an “Internet Document Simulator”:
    • It generates text that mimics Internet writing at the token level.
    • It lacks alignment with human intent, meaning it’s not yet useful as an AI assistant.
  3. Base model limitations:
    • It can generate fluent text but doesn’t understand questions or follow instructions well.
    • We need additional steps to make it interactive and aligned with human needs.

Next Stage: Post-training

  • Goal: Improve the base model to function as a useful AI assistant.
  • Approach: Apply post-training techniques to refine responses, making them more accurate, helpful, and aligned with user expectations.

This next stage transforms the model from a statistical text generator into a practical AI assistant capable of answering questions effectively.

We will talk about the post-training stage in the next article…

Conclusion

The LLM pretraining stage is the foundation of modern AI development, shaping the capabilities of models like GPT-4 and beyond. As we advance toward Artificial General Intelligence (AGI), pretraining remains a critical component in improving language understanding, efficiency, and reasoning.

This process involves massive datasets, sophisticated filtering mechanisms, and tokenization strategies that refine raw data into meaningful input for neural networks. Through iterative learning, neural networks enhance their predictive accuracy by analyzing patterns in tokenized text and optimizing mathematical relationships.

Despite their impressive abilities, LLMs are not sentient—they rely on statistical probabilities and structured computations rather than true comprehension. As AI models continue to evolve, advancements in pretraining methodologies will play a key role in driving performance improvements, cost reductions, and broader accessibility.

In the ongoing race for AI supremacy, pretraining is not just a technical necessity; it is a strategic battleground where the future of AI is being forged.

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details