You can find useful datasets on countless platforms—Kaggle, Paperwithcode, GitHub, and more. But what if I tell you there’s a goldmine: a repository packed with over 400+ datasets, meticulously categorised across five essential dimensions—Pre-training Corpora, Fine-tuning Instruction Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets and more? And to top it off, this collection receives regular updates. Sounds impressive, right?
These datasets were compiled by Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin in their survey on the paper “Datasets for Large Language Models: A Comprehensive Survey,” which has just been released (February 2024). It offers a groundbreaking look at the backbone of large language model (LLM) development: datasets.
Note: I am providing you with a brief description of the datasets mentioned in the research paper; you can find all the datasets in the repo.
This paper sets out to navigate the intricate landscape of LLM datasets, which are the cornerstone behind the stellar evolution of these models. Just as the roots of a tree provide the necessary support and nutrients for growth, datasets are fundamental to LLMs. Thus, studying these datasets isn’t just relevant; it’s essential.
Given the current gaps in comprehensive analysis and overview, this survey organises and categorises the essential types of LLM datasets into five primary perspectives:
Pre-training Corpora
Instruction Fine-tuning Datasets
Preference Datasets
Evaluation Datasets
Traditional Natural Language Processing (NLP) Datasets
Multi-modal Large Language Models (MLLMs) Datasets
Retrieval Augmented Generation (RAG) Datasets.
The research outlines the key challenges that exist today and suggests potential directions for further exploration. It goes a step beyond mere discussion by compiling a thorough review of available dataset resources: statistics from 444 datasets spanning 32 domains and 8 language categories. This includes extensive data size metrics—more than 774.5 TB for pre-training corpora alone and 700 million instances across other dataset types.
This survey acts as a complete roadmap to guide researchers, serve as an invaluable resource, and inspire future studies in the LLM field.
Here are the key types of LLM text datasets, categorized into seven main dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, Traditional NLP Datasets, Multi-modal Large Language Models (MLLMs) Datasets, and Retrieval Augmented Generation (RAG) Datasets. These categories are regularly updated for comprehensive coverage.
Note: I am using the same structure mentioned in the repo, and you can refer to the repo for the dataset information format.
It is like this –
- Dataset name Release Time | Public or Not | Language | Construction Method | Paper | Github | Dataset | Website - Publisher: - Size: - License: - Source:
These are extensive collections of text used during the initial training phase of LLMs.
A. General Pre-training Corpora: Large-scale datasets that include diverse text sources from various domains. They are designed to train foundational models that can perform various tasks due to their broad data coverage.
B. Domain-specific Pre-training Corpora: Customized datasets focused on specific fields or topics, used for targeted, incremental pre-training to enhance performance in specialized domains.
These datasets consist of pairs of “instruction inputs” (requests made to the model) and corresponding “answer outputs” (model-generated responses).
A. General Instruction Fine-tuning Datasets: Include a variety of instruction types without domain limitations. They aim to improve the model’s ability to follow instructions across general tasks.
Human Generated Datasets (HG)
databricks-dolly-15K 2023-4 | All | EN | HG | Dataset | Website
Publisher: Databricks
Size: 15011 instances
License: CC-BY-SA-3.0
Source: Manually generated based on different instruction categories
Instruction Category: Multi
InstructionWild_v2 2023-6 | All | EN & ZH | HG | Github
B.Domain-specific Instruction Fine-tuning Datasets: Tailored for specific domains, containing instructions relevant to particular knowledge areas or task types.
Preference datasets evaluate and refine model responses by providing comparative feedback on multiple outputs for the same input.
A. Preference Evaluation Methods: These can include methods such as voting, sorting, and scoring to establish how model responses align with human preferences.
Vote
Chatbot_arena_conversations 2023-6 | All | Multi | HG & MC | Paper | Dataset
Publisher: UC Berkeley et al.
Size: 33000 instances
License: CC-BY-4.0 & CC-BY-NC-4.0
Domain: General
Instruction Category: Multi
Preference Evaluation Method: VO-H
Source: Generated by twenty LLMs & Manual judgment
These datasets are meticulously curated and annotated to measure the performance of LLMs on various tasks. They are categorized based on the domains they are used to evaluate.
These datasets cover text used for natural language processing tasks prior to the era of LLMs. They are essential for tasks like language modelling, translation, and sentiment analysis in traditional NLP workflows.
6. Multi-modal Large Language Models (MLLMs) Datasets
Datasets in this category integrate multiple data types, such as text and images, to train models capable of processing and generating responses across different modalities.
Documents
mOSCAR: A large-scale multilingual and multimodal document-level corpus
These datasets enhance LLMs with retrieval capabilities, enabling models to access and integrate external data sources for more informed and contextually relevant responses.
CRUD-RAG: A comprehensive Chinese benchmark for RAG
In conclusion, the comprehensive survey “Datasets for Large Language Models: A Comprehensive Survey” provides an invaluable roadmap for navigating the diverse and complex world of LLM datasets. This extensive review by Liu, Cao, Liu, Ding, and Jin showcases over 400 datasets, meticulously categorized into critical dimensions such as Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and others, covering over 774.5 TB of data and 700 million instances. By breaking down these datasets and their uses—from broad foundational pre-training sets to highly specialized, domain-specific collections—this survey highlights existing resources and maps out current challenges and future research directions in developing and optimising LLMs. This resource serves as both a guide for researchers entering the field and a reference for those aiming to enhance generative AI’s capabilities and application scopes.
Explore your potential in the world of Generative AI! Dive into our GenAI Pinnacle Program and transform your skills into real-world applications. Don’t miss out—Explore the course now!
Frequently Asked Questions
Q1. What are the main types of datasets used for training LLMs?
Ans. Datasets for LLMs can be broadly categorized into structured data (e.g., tables, databases), unstructured data (e.g., text documents, books, articles), and semi-structured data (e.g., HTML, JSON). The most common are large-scale, diverse text datasets compiled from sources like websites, encyclopedias, and academic papers.
Q2. How do datasets impact the quality of an LLM?
Ans. The training dataset’s quality, diversity, and size heavily impact an LLM’s performance. A well-curated dataset improves the model’s generalizability, comprehension, and bias reduction, while a poorly curated one can lead to inaccuracies and biased outputs.
Q3. What are common sources for LLM datasets?
Ans. Common sources include web scrapes from platforms like Wikipedia, news sites, books, research journals, and large-scale repositories like Common Crawl. Publicly available datasets such as The Pile or OpenWebText are also frequently used.
Q4. How do you handle data bias in LLM datasets?
Ans. Mitigating data bias involves diversifying data sources, implementing fairness-aware data collection strategies, filtering content to reduce bias, and post-training fine-tuning. Regular audits and ethical reviews help identify and minimize biases during dataset creation.
Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.