A Guide to 400+ Categorized Large Language Model(LLM) Datasets

Pankaj Singh Last Updated : 03 Dec, 2024

10 min read

You can find useful datasets on countless platforms—Kaggle, Paperwithcode, GitHub, and more. But what if I tell you there’s a goldmine: a repository packed with over 400+ datasets, meticulously categorised across five essential dimensions—Pre-training Corpora, Fine-tuning Instruction Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets and more? And to top it off, this collection receives regular updates. Sounds impressive, right?

These datasets were compiled by Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin in their survey on the paper “Datasets for Large Language Models: A Comprehensive Survey,” which has just been released (February 2024). It offers a groundbreaking look at the backbone of large language model (LLM) development: datasets.

Note: I am providing you with a brief description of the datasets mentioned in the research paper; you can find all the datasets in the repo.

400+ Datasets for Your GenAI_LLMs Project

Datasets for Your GenAI/LLMs Project: Abstract Overview of the Paper
LLM Text Datasets Across Seven Dimensions
1. Pre-training Corpora
2. Instruction Fine-tuning Datasets
3. Preference Datasets
4. Evaluation Datasets
5. Traditional NLP Datasets
6. Multi-modal Large Language Models (MLLMs) Datasets
7. Retrieval Augmented Generation (RAG) Datasets
Conclusion
Frequently Asked Questions

Datasets for Your GenAI/LLMs Project: Abstract Overview of the Paper

Source: Datasets for Large Language Models: A Comprehensive Survey

This paper sets out to navigate the intricate landscape of LLM datasets, which are the cornerstone behind the stellar evolution of these models. Just as the roots of a tree provide the necessary support and nutrients for growth, datasets are fundamental to LLMs. Thus, studying these datasets isn’t just relevant; it’s essential.

Given the current gaps in comprehensive analysis and overview, this survey organises and categorises the essential types of LLM datasets into five primary perspectives:

Pre-training Corpora
Instruction Fine-tuning Datasets
Preference Datasets
Evaluation Datasets
Traditional Natural Language Processing (NLP) Datasets
Multi-modal Large Language Models (MLLMs) Datasets
Retrieval Augmented Generation (RAG) Datasets.

The research outlines the key challenges that exist today and suggests potential directions for further exploration. It goes a step beyond mere discussion by compiling a thorough review of available dataset resources: statistics from 444 datasets spanning 32 domains and 8 language categories. This includes extensive data size metrics—more than 774.5 TB for pre-training corpora alone and 700 million instances across other dataset types.

This survey acts as a complete roadmap to guide researchers, serve as an invaluable resource, and inspire future studies in the LLM field.

Here’s the overall architecture of the survey

**Source: Datasets for Large Language Models: A Comprehensive Survey**

Also read: 10 Datasets by INDIAai for your Next Data Science Project

LLM Text Datasets Across Seven Dimensions

Here are the key types of LLM text datasets, categorized into seven main dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, Traditional NLP Datasets, Multi-modal Large Language Models (MLLMs) Datasets, and Retrieval Augmented Generation (RAG) Datasets. These categories are regularly updated for comprehensive coverage.

Note: I am using the same structure mentioned in the repo, and you can refer to the repo for the dataset information format.

It is like this –

- Dataset name  Release Time | Public or Not | Language | Construction Method
 | Paper | Github | Dataset | Website
  - Publisher:
  - Size:
  - License:
  - Source:

Repo Link: Awesome-LLMs-Datasets

1. Pre-training Corpora

These are extensive collections of text used during the initial training phase of LLMs.

A. General Pre-training Corpora: Large-scale datasets that include diverse text sources from various domains. They are designed to train foundational models that can perform various tasks due to their broad data coverage.

Webpages

MADLAD-400 2023-9 | All | Multi (419) | HG |
Paper | Github | Dataset
- Publisher: Google DeepMind et al.
- Size: 2.8 T Tokens
- License: ODL-BY
- Source: Common Crawl
FineWeb 2024-4 | All | EN | CI |
Dataset
- Publisher: HuggingFaceFW
- Size: 15 TB Tokens
- License: ODC-BY-1.0
- Source: Common Crawl
CCI 2.0 2024-4 | All | ZH | HG |
Dataset1 | Dataset2
- Publisher: BAAI
- Size: 501 GB
- License: CCI Usage Aggrement
- Source: Chinese webpages
DCLM 2024-6 | All | EN | CI |
Paper | Github | Dataset | Website
- Publisher: University of Washington et al.
- Size: 279.6 TB
- License: Common Crawl Terms of Use
- Source: Common Crawl

Language Texts

ANC 2003-X | All | EN | HG |
Website
- Publisher: The US National Science Foundation et al.
- Size: –
- License: –
- Source: American English texts
BNC 1994-X | All | EN | HG |
Website
- Publisher: Oxford University Press et al.
- Size: 4124 Texts
- License: –
- Source: British English texts
News-crawl 2019-1 | All | Multi (59) | HG |
Dataset
- Publisher: UKRI et al.
- Size: 110 GB
- License: CC0
- Source: Newspapers

Books

Anna’s Archive 2023-X | All | Multi | HG |
Website
- Publisher: Anna
- Size: 586.3 TB
- License: –
- Source: Sci-Hub, Library Genesis, Z-Library, etc.
BookCorpusOpen 2021-5 | All | EN | CI |
Paper | Github | Dataset
- Publisher: Jack Bandy et al.
- Size: 17,868 Books
- License: Smashwords Terms of Service
- Source: Toronto Book Corpus
PG-19 2019-11 | All | EN | HG |
Paper | Github | Dataset
- Publisher: DeepMind
- Size: 11.74 GB
- License: Apache-2.0
- Source: Project Gutenberg
Project Gutenberg 1971-X | All | Multi | HG |
Website
- Publisher: Ibiblio et al.
- Size: –
- License: The Project Gutenberg
- Source: Ebook data

You can find more categories in this dimension here: General Pre-training Corpora

B. Domain-specific Pre-training Corpora: Customized datasets focused on specific fields or topics, used for targeted, incremental pre-training to enhance performance in specialized domains.

Financial

BBT-FinCorpus 2023-2 | Partial | ZH | HG |
Paper | Github | Website
- Publisher: Fudan University et al.
- Size: 256 GB
- License: –
- Source: Company announcements, research reports, financial
- Category: Multi
- Domain: Finance
FinCorpus 2023-9 | All | ZH | HG |
Paper | Github | Dataset
- Publisher: Du Xiaoman
- Size: 60.36 GB
- License: Apache-2.0
- Source: Company announcements, financial news, financial exam questions
- Category: Multi
- Domain: Finance
FinGLM 2023-7 | All | ZH | HG |
Github
- Publisher: Knowledge Atlas et al.
- Size: 69 GB
- License: Apache-2.0
- Source: Annual Reports of Listed Companies
- Category: Language Texts
- Domain: Finance

Medical

Medical-pt 2023-5 | All | ZH | CI |
Github | Dataset
- Publisher: Ming Xu
- Size: 632.78 MB
- License: Apache-2.0
- Source: Medical encyclopedia data, medical textbooks
- Category: Multi
- Domain: Medical
PubMed Central 2000-2 | All | EN | HG |
Website
- Publisher: NCBI
- Size: –
- License: PMC Copyright Notice
- Source: Biomedical scientific literature
- Category: Academic Materials
- Domain: Medical

Math

Proof-Pile-2 2023-10 | All | EN | HG & CI |
Paper | Github | Dataset | Website
- Publisher: Princeton University et al.
- Size: 55 B Tokens
- License: –
- Source: ArXiv, OpenWebMath, AlgebraicStack
- Category: Multi
- Domain: Mathematics
MathPile 2023-12 | All | EN | HG |
Paper | Github | Dataset
- Publisher: Shanghai Jiao Tong University et al.
- Size: 9.5 B Tokens
- License: CC-BY-NC-SA-4.0
- Source: Textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, arXiv
- Category: Multi
- Domain: Mathematics
OpenWebMath 2023-10 | All | EN | HG |
Paper | Github | Dataset
- Publisher: University of Toronto et al.
- Size: 14.7 B Tokens
- License: ODC-BY-1.0
- Source: Common Crawl
- Category: Webpages
- Domain: Mathematics

You can find more categories in this dimension here: Domain-specific Pre-training Corpora

2. Instruction Fine-tuning Datasets

These datasets consist of pairs of “instruction inputs” (requests made to the model) and corresponding “answer outputs” (model-generated responses).

A. General Instruction Fine-tuning Datasets: Include a variety of instruction types without domain limitations. They aim to improve the model’s ability to follow instructions across general tasks.

Human Generated Datasets (HG)

databricks-dolly-15K 2023-4 | All | EN | HG |
Dataset | Website
- Publisher: Databricks
- Size: 15011 instances
- License: CC-BY-SA-3.0
- Source: Manually generated based on different instruction categories
- Instruction Category: Multi
InstructionWild_v2 2023-6 | All | EN & ZH | HG |
Github
- Publisher: National University of Singapore
- Size: 110K instances
- License: –
- Source: Collected on the web
- Instruction Category: Multi
LCCC 2020-8 | All | ZH | HG |
Paper | Github
- Publisher: Tsinghua University et al.
- Size: 12M instances
- License: MIT
- Source: Crawl user interactions on social media
- Instruction Category: Multi

Model Constructed Datasets (MC)

Alpaca_data 2023-3 | All | EN | MC |
Github
- Publisher: Stanford Alpaca
- Size: 52K instances
- License: Apache-2.0
- Source: Generated by Text-Davinci-003 with Aplaca_data prompts
- Instruction Category: Multi
BELLE_Generated_Chat 2023-5 | All | ZH | MC |
Github | Dataset
- Publisher: BELLE
- Size: 396004 instances
- License: GPL-3.0
- Source: Generated by ChatGPT
- Instruction Category: Generation
BELLE_Multiturn_Chat 2023-5 | All | ZH | MC |
Github | Dataset
- Publisher: BELLE
- Size: 831036 instances
- License: GPL-3.0
- Source: Generated by ChatGPT
- Instruction Category: Multi

You can find more categories in this dimension here: General Instruction Fine-tuning Datasets

B. Domain-specific Instruction Fine-tuning Datasets: Tailored for specific domains, containing instructions relevant to particular knowledge areas or task types.

Medical

ChatDoctor 2023-3 | All | EN | HG & MC |
Paper | Github | Dataset
- Publisher: University of Texas Southwestern Medical Center et al.
- Size: 115K instances
- License: Apache-2.0
- Source: Real conversations between doctors and patients & Generated by ChatGPT
- Instruction Category: Multi
- Domain: Medical
ChatMed_Consult_Dataset 2023-5 | All | ZH | MC |
Github | Dataset
- Publisher: michael-wzhu
- Size: 549326 instances
- License: CC-BY-NC-4.0
- Source: Generated by GPT-3.5-Turbo
- Instruction Category: Multi
- Domain: Medical
CMtMedQA 2023-8 | All | ZH | HG |
Paper | Github | Dataset
- Publisher: Zhengzhou University
- Size: 68023 instances
- License: MIT
- Source: Real conversations between doctors and patients
- Instruction Category: Multi
- Domain: Medical

Code

Code_Alpaca_20K 2023-3 | All | EN & PL | MC |
Github | Dataset
- Publisher: Sahil Chaudhary
- Size: 20K instances
- License: Apache-2.0
- Source: Generated by Text-Davinci-003
- Instruction Category: Code
- Domain: Code
CodeContest 2022-3 | All | EN & PL | CI |
Paper | Github
- Publisher: DeepMind
- Size: 13610 instances
- License: Apache-2.0
- Source: Collection and improvement of various datasets
- Instruction Category: Code
- Domain: Code
CommitPackFT 2023-8 | All | EN & PL (277) | HG |
Paper | Github | Dataset
- Publisher: Bigcode
- Size: 702062 instances
- License: MIT
- Source: GitHub Action dump
- Instruction Category: Code
- Domain: Code

You can find more categories in this dimension here: Domain-specific Instruction Fine-tuning Datasets

3. Preference Datasets

Preference datasets evaluate and refine model responses by providing comparative feedback on multiple outputs for the same input.

A. Preference Evaluation Methods: These can include methods such as voting, sorting, and scoring to establish how model responses align with human preferences.

Vote

Chatbot_arena_conversations 2023-6 | All | Multi | HG & MC |
Paper | Dataset
- Publisher: UC Berkeley et al.
- Size: 33000 instances
- License: CC-BY-4.0 & CC-BY-NC-4.0
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by twenty LLMs & Manual judgment
hh-rlhf 2022-4 | All | EN | HG & MC |
Paper1 | Paper2 | Github | Dataset
- Publisher: Anthropic
- Size: 169352 instances
- License: MIT
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by LLMs & Manual judgment
MT-Bench_human_judgments 2023-6 | All | EN | HG & MC |
Paper | Github | Dataset | Website
- Publisher: UC Berkeley et al.
- Size: 3.3K instances
- License: CC-BY-4.0
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by LLMs & Manual judgment

You can find more categories in this dimension here: Preference Evaluation Methods

4. Evaluation Datasets

These datasets are meticulously curated and annotated to measure the performance of LLMs on various tasks. They are categorized based on the domains they are used to evaluate.

General

AlpacaEval 2023-5 | All | EN | CI & MC |
Paper | Github | Dataset | Website
- Publisher: Stanford et al.
- Size: 805 instances
- License: Apache-2.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: The performance on open-ended question answering
- Numbers of Evaluation Categories/Subcategories: 1/-
- Evaluation Category: Open-ended question answering
BayLing-80 2023-6 | All | EN & ZH | HG & CI |
Paper | Github | Dataset
- Publisher: Chinese Academy of Sciences
- Size: 320 instances
- License: GPL-3.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: Chinese-English language proficiency and multimodal interaction skills
- Numbers of Evaluation Categories/Subcategories: 9/-
- Evaluation Category: Writing, Roleplay, Common-sense, Fermi, Counterfactual, Coding, Math, Generic, Knowledge
BELLE_eval 2023-4 | All | ZH | HG & MC |
Paper | Github
- Publisher: BELLE
- Size: 1000 instances
- License: Apache-2.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: The performance of Chinese language models in following instructions
- Numbers of Evaluation Categories/Subcategories: 9/-
- Evaluation Category: Extract, Closed qa, Rewrite, Summarization, Generation, Classification, Brainstorming, Open qa, Others

You can find more categories in this dimension here: Evaluation Dataset

5. Traditional NLP Datasets

These datasets cover text used for natural language processing tasks prior to the era of LLMs. They are essential for tasks like language modelling, translation, and sentiment analysis in traditional NLP workflows.

Selection & Judgment

BoolQ 2019-5 | EN |
Paper | Github
- Publisher: University of Washington et al.
- Train/Dev/Test/All Size: 9427/3270/3245/15942
- License: CC-SA-3.0
CosmosQA 2019-9 | EN | Paper | Github | Dataset | Website
- Publisher: University of Illinois Urbana-Champaign et al.
- Train/Dev/Test/All Size: 25588/3000/7000/35588
- License: CC-BY-4.0
CondaQA 2022-11 | EN |
Paper | Github | Dataset
- Publisher: Carnegie Mellon University et al.
- Train/Dev/Test/All Size: 5832/1110/7240/14182
- License: Apache-2.0
PubMedQA 2019-9 | EN |
Paper | Github | Dataset | Website
- Publisher: University of Pittsburgh et al.
- Train/Dev/Test/All Size: -/-/-/273.5K
- License: MIT
MultiRC 2018-6 | EN |
Paper | Github | Dataset
- Publisher: University of Pennsylvania et al.
- Train/Dev/Test/All Size: -/-/-/9872
- License: MultiRC License

You can find more categories in this dimension here: Traditional NLP Datasets

Datasets in this category integrate multiple data types, such as text and images, to train models capable of processing and generating responses across different modalities.

Documents

mOSCAR: A large-scale multilingual and multimodal document-level corpus
- Paper: mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
- Github: Link
OBELISC: An open web-scale filtered dataset of interleaved image-text documents
- Paper: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
- Github: Link
- Dataset: Link

Instruction Fine-tuning Datasets:

Remote Sensing

MMRS-1M: Multi-sensor remote sensing instruction dataset
- Paper: EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
- Github: Link

Images + Videos

VideoChat2-IT: Instruction fine-tuning dataset for images/videos
- Paper: MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
- Dataset: Link

You can find more categories in this dimension here: Multi-modal Large Language Models (MLLMs) Datasets

7. Retrieval Augmented Generation (RAG) Datasets

These datasets enhance LLMs with retrieval capabilities, enabling models to access and integrate external data sources for more informed and contextually relevant responses.

CRUD-RAG: A comprehensive Chinese benchmark for RAG
- Paper: CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
- Github: Link
- Dataset: Link
WikiEval: To do correlation analysis of difference metrics proposed in RAGAS
- Paper: RAGAS: Automated Evaluation of Retrieval Augmented Generation
- Github: Link
- Dataset: Link
RGB: A benchmark for RAG
- Paper: Benchmarking Large Language Models in Retrieval-Augmented Generation
- Github: Link
- Dataset: https://github.com/chen700564/RGB
RAG-Instruct-Benchmark-Tester: An updated benchmarking test dataset for RAG use cases in the enterprise
- Dataset: Link
- Website: Link

You can find more categories in this dimension here: Retrieval Augmented Generation (RAG) Datasets

Conclusion

In conclusion, the comprehensive survey “Datasets for Large Language Models: A Comprehensive Survey” provides an invaluable roadmap for navigating the diverse and complex world of LLM datasets. This extensive review by Liu, Cao, Liu, Ding, and Jin showcases over 400 datasets, meticulously categorized into critical dimensions such as Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and others, covering over 774.5 TB of data and 700 million instances. By breaking down these datasets and their uses—from broad foundational pre-training sets to highly specialized, domain-specific collections—this survey highlights existing resources and maps out current challenges and future research directions in developing and optimising LLMs. This resource serves as both a guide for researchers entering the field and a reference for those aiming to enhance generative AI’s capabilities and application scopes.

Explore your potential in the world of Generative AI! Dive into our GenAI Pinnacle Program and transform your skills into real-world applications. Don’t miss out—Explore the course now!

Frequently Asked Questions

Q1. What are the main types of datasets used for training LLMs?

Ans. Datasets for LLMs can be broadly categorized into structured data (e.g., tables, databases), unstructured data (e.g., text documents, books, articles), and semi-structured data (e.g., HTML, JSON). The most common are large-scale, diverse text datasets compiled from sources like websites, encyclopedias, and academic papers.

Q2. How do datasets impact the quality of an LLM?

Ans. The training dataset’s quality, diversity, and size heavily impact an LLM’s performance. A well-curated dataset improves the model’s generalizability, comprehension, and bias reduction, while a poorly curated one can lead to inaccuracies and biased outputs.

Q3. What are common sources for LLM datasets?

Ans. Common sources include web scrapes from platforms like Wikipedia, news sites, books, research journals, and large-scale repositories like Common Crawl. Publicly available datasets such as The Pile or OpenWebText are also frequently used.

Q4. How do you handle data bias in LLM datasets?

Ans. Mitigating data bias involves diversifying data sources, implementing fairness-aware data collection strategies, filtering content to reduce bias, and post-training fine-tuning. Regular audits and ethical reviews help identify and minimize biases during dataset creation.

Pankaj Singh

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

A Guide to 400+ Categorized Large Language Model(LLM) Datasets

Table of contents

Datasets for Your GenAI/LLMs Project: Abstract Overview of the Paper

LLM Text Datasets Across Seven Dimensions

1. Pre-training Corpora

Webpages

Language Texts

Books

Financial

Medical

Math

2. Instruction Fine-tuning Datasets

Human Generated Datasets (HG)

Model Constructed Datasets (MC)

Medical

Code

3. Preference Datasets

Vote

4. Evaluation Datasets

General

5. Traditional NLP Datasets

Selection & Judgment

6. Multi-modal Large Language Models (MLLMs) Datasets

Documents

Remote Sensing

Images + Videos

7. Retrieval Augmented Generation (RAG) Datasets

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)