A Comprehensive Guide to Pre-training LLMs

Pankaj Singh Last Updated : 12 Feb, 2025

15 min read

We’re already into the second month of 2025, and every passing day brings us closer to Artificial General Intelligence (AGI)—AI that can tackle complex problems across multiple sectors at a human level.

Take DeepSeek, for instance. Until recently, could you have imagined an organization before 2024 that could build a cutting-edge Generative AI model for just a few million dollars and still go toe-to-toe with OpenAI’s flagship models? Probably not. But it’s happening.

Now, OpenAI has countered with the release of o3-mini, further accelerating AI’s evolution. Its reasoning capabilities are pushing the boundaries of AI development, making the technology more accessible and powerful. This AI war will go on! Also recently, as Sam Altman noted in his Three Observations blog, the cost of using a given level of AI is dropping tenfold every 12 months, and with lower prices comes exponentially greater adoption.

At this rate, in a decade, every person on Earth could accomplish more than today’s most impactful individuals—solely because of advancements in AI. This isn’t just progress; it’s a revolution. In this battle of Large Language Models (LLMs), the key to dominance lies in one of many fundamental aspects such as – Pretraining.

In this article, we’ll talk about LLM pretraining as mentioned in Andrej Karapathy – “Deep Dive into LLMs like ChatGPT” — what it is, how it works, and why it’s the foundation of modern AI capabilities.

What is the LLM Pre-training?
LLM Pretraining Step 1: Process the Internet Data
LLM Pretraining Step 2: Tokenization
LLM Pretraining Step 3: Neural Network
Base Model
Inference: How GPT-2 Generates Text
Conclusion

What is the LLM Pre-training?

Before talking about, the Pretraining stage of LLM, the bigger picture here is, how ChatGPT, Claude or any other LLM generate the output. For instance, If we ask ChatGPT – “Who is your Parent Company?”

The question here will be – how this output is generated by ChatGPT or you can say what’s happening behind the scenes of ChatGPT?

Let’s begin with – What is the LLM Pretraining Stage?

The LLM pretraining stage is the first phase of teaching a large language model (LLM) like how to understand and generate text. Think of it as reading a massive number of books, articles, and websites to learn grammar, facts, and common patterns in language. During this stage, the model processes billions of words (data) and predicts the next word (token) in a sentence repeatedly, refining its ability to generate coherent and relevant responses. However, at this point, it doesn’t fully “understand” meaning like a human—it just recognizes patterns and probabilities.

What can a Pre-trained LLM do?

Pre-trained Large Language Models (LLMs) can perform a wide range of tasks, including text generation, summarization, translation, and sentiment analysis. They assist in code generation, question-answering, and content recommendation. LLMs can extract insights from unstructured data, facilitate chatbots, and automate customer support. They enhance creative writing, provide tutoring, and even generate realistic conversations. Additionally, they assist in data augmentation, legal analysis, and medical research by analyzing vast amounts of information efficiently. Their ability to understand and generate human-like text makes them valuable for various industries, from education and finance to healthcare and entertainment. However, they require fine-tuning for domain-specific accuracy.

Here we are taking ChatGPT to understand the concepts.

LLM Pretraining Step 1: Process the Internet Data

There are multiple stages of training an LLM but here we will first talk about the LLM Pretraining stage.

The performance of a large language model (LLM) is deeply influenced by the quality and scale of its pretraining dataset. If your dataset is clean, structured and easy to process, the model will work accordingly.

However, for many state-of-the-art open LLMs like Llama 3 and Mixtral, the details of their pretraining data remain a mystery—these datasets are not publicly available, and little is known about how they were curated.

To address this gap, Hugging collected data from the internet and curated FineWeb, a large-scale dataset ( this is a portion of data available on the internet) specifically designed for LLM pretraining. This high-quality and diverse dataset has 15 trillion tokens and occupies 44TB of disk space, FineWeb is built from 96 CommonCrawl snapshots and has been shown to produce better-performing models than other publicly available pretraining datasets.

What sets FineWeb apart is its transparency:

It meticulously documented every design choice, running detailed ablations on deduplication and filtering strategies to refine the dataset’s quality.

HuggingFaceFW/fineweb.

Where Does the Raw Data Come From?

There are two main sources:

Crawling the web yourself – Used by companies like OpenAI and Anthropic.
Using public repositories – CommonCrawl, a non-profit that has been archiving web data since 2007.

For FineWeb, they followed the approach of many LLM teams and used CommonCrawl (CC) as the starting point. CC releases a new dataset every 1-2 months, typically containing 200-400 TiB of text.

For example, the April 2024 crawl includes 2.7 billion web pages with 386 TiB of uncompressed HTML. Since 2013, CC has released 96 crawls, plus 3 older-format crawls from 2008-2012.

Fineweb pipeline — Source: HuggingFaceFW/fineweb.

1. URL Filtering

The pipeline begins with URL filtering, where web pages from certain domains or with certain characteristics are blocked based on a pre-defined list.
This helps remove adult content, spam, or other undesirable data at the initial stage.

filtering using a blocklist — Source: Link

2. Text Extraction

Once URLs are filtered, the text is extracted from the web pages.
This step removes HTML, JavaScript, and other non-text elements while preserving the meaningful content.

3. Language Filtering

The extracted text is then filtered based on language.
A fastText classifier is used to detect whether the content is in English.
Only texts with a confidence score of ≥ 0.65 are kept.

4. Gopher Filtering

This is an additional quality filter designed to remove low-quality text.
It might include checks for repetitive content, nonsensical text, or harmful content.

5. MinHash Deduplication

This step detects and removes duplicate content using the MinHash technique.
MinHash helps efficiently compare large amounts of text to find near-duplicate documents and eliminate redundancy.

6. C4 Filters

The filtered data then passes through C4 filters, which further refine the dataset.
C4 (Colossal Clean Crawled Corpus) filters typically remove boilerplate content, excessive repetition, and low-quality text.

7. Custom Filters

At this stage, additional custom filtering rules are applied.
These could involve removing specific patterns, handling formatting issues, or eliminating known sources of noise.

8. PII Removal

Finally, the pipeline includes a PII (Personally Identifiable Information) Removal step.
This ensures that private or sensitive information (such as names, addresses, emails, and phone numbers) is scrubbed from the dataset.

The Outcome of the Process

The FineWeb pipeline ensures that the resulting dataset is clean, high-quality, and optimized for training AI models.
Data Reduction: After filtering, 36 trillion tokens remain from the original web dumps.

This structured approach helps improve the performance of AI models by ensuring that they are trained on high-quality, diverse, and safe textual data.

LLM Pretraining Step 2: Tokenization

If you are done with step 1 of processing the raw data, now the question arises is – How to train the neural network on this data? As mentioned in the FineWeb, there are 15 trillion tokens and 44TB of disk space data set that need to be fed to the neural network for further processing.

The next essential step is tokenization, a process that prepares the raw text data for training large language models (LLMs). Let’s break down how tokenization works and its significance based on the transcript.

Tokenization is the process of converting large sequences of text into smaller, manageable units called tokens. These tokens are discrete elements that neural networks process during training. But how exactly do we turn a massive text corpus into tokens that a machine can understand and learn from?

1. From Raw Text to One-Dimensional Sequence

Before feeding the data to the neural network, we have to decide how are we going to represent the text. Neural networks do not process raw text directly; instead, they expect input in the form of a finite one-dimensional sequence of symbols.

2. Binary Representation – Bits and Bytes

A long sequence of 0s and 1s would be inefficient for storage and processing in neural networks.
Instead of encoding text as a raw sequence of bits, a more efficient approach is to group bits into meaningful symbols.

Computers represent text using binary encoding (zeros and ones). Each character can be encoded into a sequence of 8 bits (1 byte). This forms the basis of how text data is represented internally. Since bytes can take 256 possible values (0–255), we now have a vocabulary of 256 unique symbols, which can be thought of as unique IDs representing each character or combination.

Note: 1 Byte = 8 bits

Since each bit can be 0 or 1, an 8-bit sequence can represent:
2⁸ = 256

This means a single byte can encode 256 unique values, ranging from 0 to 255.

Each character (or symbol) is stored in 1 byte (8 bits).
Each byte can take one of 256 possible values.
Thus, the vocabulary size is 256 unique symbols.

When you encode text in UTF-8, you convert human-readable characters into binary representations (raw bits).

4. Reducing Sequence Length – Beyond Bytes

Although the binary (byte-based) encoding is efficient, storing long sequences of binary bits would make the input sequences unnecessarily lengthy. To address this, tokenization methods such as Byte Pair Encoding (BPE) are employed to reduce sequence length while increasing the size of the vocabulary.

Byte Pair Encoding (BPE): This method groups frequently occurring pairs of symbols (bytes) into new symbols. For instance, if any sequence such as “135 32” appears repeatedly, it will be replaced by a new token with an ID (like 256). The process iteratively reduces the sequence length while expanding the token vocabulary.

5. Vocabulary Size – Trade-off Between Sequence Length and Token Granularity

In practice, state-of-the-art LLMs like GPT-4 use a vocabulary size of 100,277 tokens. This iterative merging stops when a predefined vocabulary size is reached. This balance allows shorter sequences to be used for training while maintaining token granularity that captures essential language features. Each token can represent characters, words, spaces, or even common word combinations.

6. Tokenizing Text – Example and Practical Insights

Using a tokenizer like GPT-4’s base model (CL100k_base), the input text is split into tokens based on the model’s predefined vocabulary. For example:

The phrase “hello world” is tokenized into two tokens: one for “hello” and one for “space + world.”
Adding or removing spaces results in different tokens due to subtle variations in text patterns.

Why Is This Useful?

Optimizing Neural Network Input: Large Language Models (LLMs) like GPT-4 don’t read raw text. Instead, they process tokenized input.
Understanding Compression: Some words are split into multiple tokens, while others stay intact.
Efficiency in Training: Tokenization allows efficient storage and manipulation of text data.

The process of converting the raw text into symbols or tokens is called Tokenization. Tokenization is crucial because it translates raw text data into a format that gets converted to vectors (vector embedding using similarity search or something else) and neural networks can efficiently understand and process. It also strikes a trade-off between vocabulary richness and sequence length, which is key to optimizing the training process for large-scale LLMs. This step sets the foundation for the subsequent phases of LLM pretraining, where these tokens become the building blocks of the model’s understanding of language patterns, syntax, and semantics.

LLM Pretraining Step 3: Neural Network

A neural network is a computational model designed to simulate the way the human brain processes information. It consists of layers of interconnected nodes (neurons) that work together to recognize patterns, make decisions, and solve complex tasks.

Key Characteristics:

Inspired by the Human Brain – Mimics how biological neurons process and transmit information.
Layered Structure – Composed of an input layer, hidden layers, and an output layer.
Learning through Training – Adjusts internal parameters (weights) over multiple iterations to improve accuracy.
Task-Specific Adaptability – Can handle various problems such as classification, pattern recognition, and clustering.

How It Works:

Nodes (Neurons): Fundamental units that process data.
Connections (Weights): Store learned information and adjust based on input.
Training Process: Weights are updated over multiple iterations using training data.
Final Model: A trained neural network can efficiently perform the intended task.

A neural network is a powerful AI tool that learns from data and improves over time, enabling machines to make human-like decisions.

Also read: Introduction to Neural Network in Machine Learning

Neural Network I/O

Input: Tokenized Sequences

The input to the neural network consists of sequences of tokens derived from a dataset through tokenization. Tokenization breaks down the text into discrete units, which are assigned unique numerical IDs. In this example, we consider a sequence of four tokens:

If you are done with step1

Token ID	Token
2746	“If”
499	“you”
527	“are”
2884	“Done”
449	with
3094	step
16	1

These tokens are fed into the neural network as context, aiming to predict the next token in the sequence.

Processing: Probability Distribution Prediction

Once the token sequence is passed through the neural network, it generates a probability distribution over a vocabulary of possible next tokens. In this case, the vocabulary size of GPT-4 is 100,277 unique tokens. The output is a probability score assigned to each possible token, representing the likelihood of its occurrence as the next token.

Backpropagation and Adjustment

To correct its predictions, the neural network goes through a mathematical update process:

Calculate Loss – A loss function (like cross-entropy loss) measures how far the predicted probabilities are from the correct probabilities. A lower probability for the correct token results in a higher loss.
Compute Gradients – The network uses gradient descent to determine how to adjust the weights of its neurons.
Update Weights – The model’s internal parameters (weights) are tweaked slightly so that the next time it sees the same sequence, it increases the probability of “Post” and decreases the probability of incorrect options.

The neural network updates its parameters using a mathematical optimization process. Given the correct token, the training algorithm adjusts the network weights such that:

The probability of the correct token increases.
The probabilities of incorrect tokens decrease.

For instance, after an update, the probability of a token may increase from 4% to 6%, while the probabilities of other tokens adjust accordingly. This iterative process occurs across large batches of training data, refining the network’s ability to model the statistical relationships between tokens.

Through continuous exposure to data and iterative updates, the neural network improves its predictive capability. By analyzing context windows of tokens and refining probability distributions, it learns to generate text sequences that align with real-world linguistic patterns.

Internal Working of Neural Network

A neural network, particularly modern architectures like Transformers, follows a structured computational process to generate meaningful predictions based on input data. Below is a detailed explanation of its internals, broken down into key stages.

1. Input Representation: Token Sequences

Neural networks process input data in the form of token sequences. Each token is a numerical representation of a word or a subword.

The input length can vary from 0 to 8,000 tokens (depending on the model), but computational constraints limit the maximum context length.
Token sequences are the primary data structures that flow through the network.

2. Mathematical Processing with Parameters (Weights)

Once token sequences are fed into the network, they are processed mathematically using a large number of parameters (also called weights).

Parameters are initially random, leading to random predictions.
Through training, these parameters are adjusted to reflect patterns in the training dataset.

3. The Mathematical Expressions Behind Neural Networks

The network itself is a giant mathematical function with a fixed structure. It mixes inputs X¹,X²,…with weights W¹,W²….through:

Multiplication
Addition
Exponentiation
Normalization (LayerNorm)
Matrix Operations
Activation Functions (Softmax, etc.)

Even though modern networks contain billions of parameters, at their core, they perform simple mathematical operations repeatedly.

Example: A basic operation in a neural network may look like:

You can know more about it here: {link of article}

4. The Transformer Architecture: The Backbone of Modern Neural Networks

We are talking about – the model nano-GPT, with a mere 85,000 parameters.

Here we are taking a sequence – C B A B B C and sorting it to ABBBCC.

Each letter is called a tokens with the token index:

Token	A	B	C
Index	0	1	2

After this embedding happens: each green cell represents a number being processed, and each blue cell is a weight.

The embedding is then passed through the model, going through a series of layers, called transformers, before reaching the bottom.

5. Neural Network Output: Prediction Generation

After processing through multiple layers, the network outputs a probability distribution over possible next tokens.

The final layer (Logits & Softmax) predicts the next token.
The output token is fed back into the network in an autoregressive manner.
This process repeats iteratively, generating coherent text.

6. Training the Neural Network: Adjusting Parameters

The training process involves:

Computing the Loss: The difference between the predicted output and the correct output is measured using loss functions (e.g., cross-entropy loss).
Backpropagation: The loss is used to update network parameters via gradient descent.
Optimization (Gradient Descent, Adam, etc.): Parameters are adjusted to minimize prediction errors over many iterations.

Training is like tuning a musical instrument—gradually refining parameters to produce meaningful outputs.

7. Inference: Generating New Predictions

Once a model is trained, it enters the inference phase, where it predicts new text based on user-provided input.

The model generates tokens step by step using learned knowledge.
It follows statistical patterns from training data.
The process repeats until a stopping condition is met (e.g., max length, EOS token).

While neural networks use biological terminology, they are not equivalent to biological brains. Unlike biological neurons, neural networks operate without memory and process inputs statelessly. Additionally, biological neurons exhibit dynamic and adaptive behaviour beyond mathematical formulas, whereas neural networks, including transformers, remain purely mathematical constructs without sentient cognition.

Base Model

A base model in large language models (LLMs) like GPT, refers to a pretrained model that has been trained on vast amounts of internet text data but has not yet been fine-tuned for specific tasks.

Key Points About Base Models:

Token Simulators: A base model essentially predicts the next token (word, subword, or character) given a sequence of previous tokens. It is a statistical pattern recognizer that generates text based on probabilities learned from training data.
Not Directly Useful for Assistants: A base model doesn’t inherently understand user intent or follow conversational instructions. Instead, it generates text in an open-ended way, often producing a remix of internet text.
Limited Releases: Most base models are not publicly released because they are just an intermediate step in developing a useful AI assistant. Companies usually fine-tune these base models before releasing them for public use.
Example – GPT-2:
- OpenAI released GPT-2 in 2019 with a 1.5 billion parameter base model.
- It was a raw model trained to predict text sequences but required additional fine-tuning to be used effectively in applications.

GPT-2, or Generative Pre-trained Transformer 2, is the second iteration of OpenAI’s Transformer-based language model, first released in 2019. It was a significant milestone in the evolution of large-scale natural language models, setting the stage for modern generative AI applications.

Key Specifications:

Parameters: 1.6 billion
Training Tokens: 100 billion
Maximum Context Length: 1,024 tokens

These numbers, while impressive at the time, are small by today’s standards. For example, Llama 3 (2024) features 405 billion parameters trained on 15 trillion tokens, demonstrating the rapid growth in scale and capability of Transformer-based models.

Inference: How GPT-2 Generates Text

1. Token-Level Simulation

At inference time, GPT-2 functions as a token-level document simulator:

It generates text one token at a time, conditioning each prediction on the previous tokens.
The process continues iteratively, producing sequences that resemble human-written text.

2. Prompting and In-Context Learning

Even though GPT-2 was not explicitly fine-tuned for specific tasks, prompt engineering enables it to perform various applications:

Translation: A well-constructed few-shot prompt can turn GPT-2 into an English-to-Korean translator.
Q&A and Assistant-like Behavior: With the right conversation-style prompt, GPT-2 can mimic a chatbot.
Story Generation: By seeding with an opening sentence, GPT-2 can complete a passage in a coherent manner.

3. Limitations of GPT-2 in Inference

Short Context Window: With a maximum of 1,024 tokens, GPT-2 struggles with long-form coherence.
Lack of Explicit Memory: Unlike later models with retrieval-augmented generation (RAG), GPT-2 relies entirely on its parameters.
Prone to Bias and Regurgitation: Due to the nature of its dataset, GPT-2 can produce biased or even verbatim outputs from training data.

Why Are Base Models Important?

They form the foundation for creating useful AI applications.
Fine-tuning and reinforcement learning make them more useful for interactive tasks, like chatbots, code assistants, or summarization tools.
They enable adaptability, allowing researchers and developers to fine-tune them for specific domains (e.g., medical AI, legal AI).

So, this is the LLM Pretraining stage.

Key Takeaways from the Pre-training Stage:

Pre-training is about token prediction:
- We train the model using Internet documents broken down into tokens (small chunks of text).
- The model learns to predict token sequences based on statistical patterns in the data.
The base model is an “Internet Document Simulator”:
- It generates text that mimics Internet writing at the token level.
- It lacks alignment with human intent, meaning it’s not yet useful as an AI assistant.
Base model limitations:
- It can generate fluent text but doesn’t understand questions or follow instructions well.
- We need additional steps to make it interactive and aligned with human needs.

Next Stage: Post-training

Goal: Improve the base model to function as a useful AI assistant.
Approach: Apply post-training techniques to refine responses, making them more accurate, helpful, and aligned with user expectations.

This next stage transforms the model from a statistical text generator into a practical AI assistant capable of answering questions effectively.

We will talk about the post-training stage in the next article…

Conclusion

The LLM pretraining stage is the foundation of modern AI development, shaping the capabilities of models like GPT-4 and beyond. As we advance toward Artificial General Intelligence (AGI), pretraining remains a critical component in improving language understanding, efficiency, and reasoning.

This process involves massive datasets, sophisticated filtering mechanisms, and tokenization strategies that refine raw data into meaningful input for neural networks. Through iterative learning, neural networks enhance their predictive accuracy by analyzing patterns in tokenized text and optimizing mathematical relationships.

Despite their impressive abilities, LLMs are not sentient—they rely on statistical probabilities and structured computations rather than true comprehension. As AI models continue to evolve, advancements in pretraining methodologies will play a key role in driving performance improvements, cost reductions, and broader accessibility.

In the ongoing race for AI supremacy, pretraining is not just a technical necessity; it is a strategic battleground where the future of AI is being forged.

Pankaj Singh

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

A Comprehensive Guide to Pre-training LLMs

Table of contents

What is the LLM Pre-training?

What can a Pre-trained LLM do?

LLM Pretraining Step 1: Process the Internet Data

Where Does the Raw Data Come From?

1. URL Filtering

2. Text Extraction

3. Language Filtering

4. Gopher Filtering

5. MinHash Deduplication

6. C4 Filters

7. Custom Filters

8. PII Removal

The Outcome of the Process

LLM Pretraining Step 2: Tokenization

1. From Raw Text to One-Dimensional Sequence

2. Binary Representation – Bits and Bytes

4. Reducing Sequence Length – Beyond Bytes

5. Vocabulary Size – Trade-off Between Sequence Length and Token Granularity

6. Tokenizing Text – Example and Practical Insights

Why Is This Useful?

LLM Pretraining Step 3: Neural Network

Key Characteristics:

How It Works:

Neural Network I/O

Processing: Probability Distribution Prediction

Backpropagation and Adjustment

Training and Refinement

Internal Working of Neural Network

1. Input Representation: Token Sequences

2. Mathematical Processing with Parameters (Weights)

3. The Mathematical Expressions Behind Neural Networks

4. The Transformer Architecture: The Backbone of Modern Neural Networks

5. Neural Network Output: Prediction Generation

6. Training the Neural Network: Adjusting Parameters

7. Inference: Generating New Predictions

Base Model

Key Points About Base Models:

Key Specifications:

Inference: How GPT-2 Generates Text

1. Token-Level Simulation

2. Prompting and In-Context Learning

3. Limitations of GPT-2 in Inference

Why Are Base Models Important?

Key Takeaways from the Pre-training Stage:

Next Stage: Post-training

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)