20 Open-Source Datasets for Generative and Agentic AI

K.C. Sabreena Basheer Last Updated : 24 Feb, 2025

9 min read

The fields of generative AI (GenAI) and agentic AI are transforming everything from creative content generation to autonomous decision-making. At the heart of these innovations lie vast open-source datasets that fuel model training, testing, and deployment. In this article, we present a curated list of the top open-source datasets for generative and agentic AI that you can use to train your models. These span multiple modalities – from extensive collections of text and richly annotated images to specialized resources for building intelligent agents and solving complex reasoning tasks.

The Pile
Common Crawl
WikiText
OpenWebText
LAION-5B
MS COCO
Open Images Dataset
RedPajama‑1T
RedPajama‑V2
OpenAI WebGPT Dataset
Obsidian Agent Dataset
WebShop Dataset
Meta EAI Dataset (Embodied AI)
MuJoCo
Robotics Datasets
Atari Games
Web-crawled Interactions
AI2 ARC Dataset
MS MARCO
OpenAI Gym
Summary Table
Conclusion
Frequently Asked Questions

20 Open-source Datasets for Generative and Agentic AI

1. The Pile

The Pile is an extensive, diverse dataset comprising roughly 800GB of text drawn from sources like ArXiv, GitHub, Wikipedia, and more. It has been meticulously compiled to offer a wide spectrum of writing styles and subject matter, making it ideal for training large-scale language models. Researchers and developers leverage The Pile to improve natural language understanding and generation by exposing models to a broad contextual landscape.

Best For:

Training large-scale language models.
Developing sophisticated natural language understanding systems.
Fine-tuning models for domain-specific text generation.

Link: EleutherAI – The Pile

2. Common Crawl

Common Crawl aggregates billions of web pages scraped on a monthly basis, offering a true web-scale dataset. Its vast collection captures diverse content from across the internet, making it a foundational resource for training robust language models. The dataset is invaluable for tasks ranging from language modeling to large-scale information retrieval due to its comprehensive and continuously updated nature.

Best For:

Building web-scale language models.
Enhancing information retrieval and search engine capabilities.
Analyzing content trends and user behavior online.

Link: Common Crawl

3. WikiText

WikiText is an open-source language modeling dataset derived from high-quality Wikipedia articles. It retains the rich structure and linguistic complexity found in editorial content, offering models a challenging environment to learn long-range dependencies. It also features a far larger vocabulary and retains the original case, punctuation and numbers. The WikiText-2 dataset is over 2 times larger than the first, and WikiText-103 is over 110 times larger.

Best For:

Training language models with a focus on long-range context.
Benchmarking next-word prediction and text generation tasks.
Fine-tuning models for summarization and translation applications.

Link: WikiText on Hugging Face

4. OpenWebText

OpenWebText is an open-source effort to recreate the WebText dataset originally used by OpenAI for language modeling. Compiled from web pages linked on Reddit, it provides a diverse collection of high-quality internet text. This dataset is especially valuable for training models that require a broad spectrum of language styles and contemporary online discourse, making it ideal for research in large-scale text generation.

Best For:

Training web-scale language models using diverse online text.
Fine-tuning models for text generation and summarization tasks.
Researching natural language understanding with up-to-date web data.

Link: OpenWebText on GitHub

5. LAION-5B

LAION-5B is an enormous dataset containing 5.85 billion image-text pairs, providing an unprecedented resource for multimodal AI. Its scale and diversity support the training of cutting-edge text-to-image models like Stable Diffusion and DALL·E. The integration of visual and textual data allows researchers to build systems that effectively translate language into visual content.

Best For:

Training text-to-image generative models.
Developing multimodal content synthesis systems.
Creating advanced image captioning and visual storytelling applications.

Link: LAION-5B

Also Read: 20 Most Liked Datasets on HuggingFace

6. MS COCO

MS COCO offers a rich collection of images accompanied by detailed annotations for object detection, segmentation, and captioning. The dataset’s complexity challenges models to understand and generate comprehensive descriptions of visual scenes. It is widely used in both academic and industrial settings to drive advancements in image understanding and generation.

Best For:

Developing robust object detection and segmentation models.
Training models for image captioning and visual description.
Creating context-aware image synthesis systems.

Link: MS COCO

7. Open Images Dataset

The Open Images Dataset is a large-scale, community-driven collection of images annotated with labels, bounding boxes, and segmentation masks. Its extensive coverage and diverse content make it ideal for training general-purpose image generation and recognition models. The dataset supports innovative applications in computer vision by providing detailed visual context across numerous object categories. The V7 version of the dataset has dense annotations for over 1.9M images and labels for over 9M images.

Best For:

Training general-purpose image generation systems.
Enhancing object detection and segmentation models.
Building robust image recognition frameworks.

Link: Open Images Dataset

8. RedPajama‑1T

RedPajama‑1T is an open-source reproduction of LLaMA’s pretraining dataset, consisting of 1.2 trillion tokens from CommonCrawl, Wikipedia, Books, GitHub, arXiv, C4, and StackExchange. It applies filtering techniques, such as CCNet for web data, to enhance quality. The dataset is fully transparent, with all preprocessing scripts available for reproducibility.

Best For:

Reproducing LLaMA’s training data
Open-source LLM pretraining
Multi-domain dataset curation

Link: RedPajama-1T

9. RedPajama‑V2

RedPajama‑V2 refines the 1T dataset by focusing on web data, sourced from 84 CommonCrawl snapshots, totaling over 100B text documents. It includes English, French, German, Spanish, and Italian, with 40+ quality annotations for filtering and optimization. This enables dynamic dataset curation for tailored pretraining.

Best For:

High-quality dataset filtering
Multilingual LLM development
Custom pretraining dataset creation

Link: RedPajama‑V2

10. OpenAI WebGPT Dataset

The OpenAI WebGPT Dataset is tailored for training AI agents that interact dynamically with the web. It contains human-annotated data capturing real-world web browsing interactions, which are essential for developing retrieval-augmented generation systems. This resource empowers AI models to understand, navigate, and generate context-aware responses based on live web data.

Best For:

Training web-browsing and information retrieval agents.
Developing retrieval-augmented natural language processing systems.
Enhancing AI’s ability to interact with and understand web content.

Link: OpenAI WebGPT Dataset

Also Read: 28 Websites to Find Datasets for your Projects

11. Obsidian Agent Dataset

The Obsidian Agent Dataset is a synthetic collection designed to simulate environments for autonomous decision-making. It focuses on agent-based reasoning and equips models with scenarios that test complex planning and decision-making skills. This dataset is pivotal for researchers developing AI agents that must operate autonomously in unpredictable settings.

Best For:

Training autonomous decision-making models.
Simulating agent-based reasoning in controlled environments.
Experimenting with synthetic data for complex AI planning tasks.

Link: Obsidian Agent Dataset

12. WebShop Dataset

The WebShop Dataset is designed specifically for AI agents operating within the e-commerce domain. It features detailed product descriptions, user interaction logs, and browsing patterns that mimic real-world online shopping behavior. This dataset is ideal for developing intelligent agents capable of product research, recommendation, and automated purchase decision-making.

Best For:

Building AI agents for e-commerce navigation and product research.
Developing recommendation systems for online shoppers.
Automating product comparison and purchase decision processes.

Link: WebShop Dataset

13. Meta EAI Dataset (Embodied AI)

The Meta EAI Dataset is curated for training AI agents that interact with virtual and real-world environments. It provides detailed simulation scenarios that support the development of embodied AI—particularly for robotics and household task planning. By incorporating realistic interactive challenges, the dataset helps models learn effective planning and execution in dynamic environments.

Best For:

Training interactive robotic agents for real-world tasks.
Simulating household task planning and execution.
Developing embodied AI applications in virtual environments.

Link: Meta EAI Dataset

14. MuJoCo

MuJoCo is a physics engine renowned for creating highly realistic simulations of physical interactions, particularly in robotics. It offers detailed, physics-based environments that enable AI models to learn complex motion and control tasks. This dataset is critical for researchers focused on developing models that require an accurate representation of real-world dynamics.

Best For:

Training models for realistic robotic simulations.
Developing advanced control systems in simulated environments.
Benchmarking AI algorithms on physics-based tasks.

Link: MuJoCo

15. Robotics Datasets

Robotics datasets capture real-world sensor data and robot interactions, making them indispensable for embodied AI research. They offer rich, contextual information from varied robotic applications, ranging from industrial automation to service robots. These datasets enable the training of models that can navigate complex, physical environments with high reliability.

Best For:

Training AI for real-world robotic interactions.
Developing sensor-based decision-making systems.
Benchmarking embodied AI performance in dynamic environments.

Link: Robotics Datasets

Also Read: 10 Open Source Datasets for LLM Training

16. Atari Games

Atari Games is a classic dataset used as a benchmark for reinforcement learning algorithms. It provides a suite of game environments that challenge AI models with sequential decision-making tasks. This dataset remains a popular tool for testing and advancing AI performance in diverse, dynamic scenarios.

Best For:

Benchmarking reinforcement learning strategies.
Testing AI performance in varied game environments.
Developing algorithms for sequential decision-making.

Link: Atari Games

17. Web-crawled Interactions

Web-crawled interactions consist of large-scale user behavior data extracted from various online platforms. They capture authentic human interaction patterns and engagement metrics, offering valuable insights for training interactive agents. This dataset is particularly useful for developing AI that can understand and predict real-world user behavior on the web.

Best For:

Training interactive agents based on real user behavior.
Enhancing recommendation systems with dynamic interaction data.
Analyzing engagement trends for conversational AI.

Link: Web-crawled Interactions

18. AI2 ARC Dataset

The AI2 ARC Dataset is a collection of challenging multiple-choice questions designed to assess an AI’s commonsense reasoning and problem-solving abilities. Its questions span a variety of topics and difficulty levels, making it a rigorous benchmark for reasoning models. Researchers utilize this dataset to push the boundaries of logical inference and to evaluate the depth of understanding in generative AI systems.

Best For:

Benchmarking common sense reasoning capabilities.
Training models to handle standardized test questions.
Enhancing problem-solving and logical inference in AI systems.

Link: AI2 ARC Dataset

19. MS MARCO

Microsoft Machine Reading Comprehension (MS MARCO) is a large-scale dataset curated for tasks such as passage ranking, question answering, and information retrieval. It compiles real-world search queries and relevant passages to train and test retrieval-augmented generation systems. The dataset is instrumental in bridging the gap between information retrieval and generative models, leading to more context-aware search and answer generation.

Best For:

Training retrieval-augmented generation (RAG) models.
Developing advanced passage ranking and question-answering systems.
Enhancing information retrieval pipelines with real-world data.

Link: MS MARCO

20. OpenAI Gym

OpenAI Gym is a standardized toolkit featuring a variety of simulated environments for developing and benchmarking reinforcement learning algorithms. It offers a range of scenarios—from simple control tasks to more complex simulations—ideal for training agentic behavior. Its ease of use and broad community support make it a staple in reinforcement learning research.

Best For:

Benchmarking reinforcement learning algorithms.
Developing simulated training environments for agents.
Rapid prototyping of agentic behavior in controlled scenarios.

Link: OpenAI Gym

Also Read: A Guide to 400+ Categorized Large Language Model(LLM) Datasets

Summary Table

Here’s a summarized table of the above discussed open‐source datasets for generative and agentic AI. I’ve mentioned the approximate sample counts, file sizes, and developers for each, along with their download links.

#No.	Dataset	Number of Samples	Size (Approx.)	Developer	Best Used For
1	The Pile	Millions of documents (aggregated from 22 sub-datasets)	~825 GB	EleutherAI	Training large-scale language models.
2	Common Crawl	~2.5 billion web pages	~60 TB (raw data)	Common Crawl Foundation	Web-scale language models and content analysis.
3	WikiText	~28,475 articles	~500 MB	Salesforce Research	Long-range context modeling and text prediction.
4	OpenWebText	~8 million documents	~38 GB	Open-source community	Web-based text generation and summarization.
5	LAION-5B	5.85 billion image-text pairs	~5 TB	LAION	Training multimodal AI and text-to-image models.
6	MS COCO	~330,000 images	~25 GB	Microsoft	Object detection and image captioning.
7	Open Images	~9 million images	~600 GB	Google	Image recognition and segmentation research.
8	RedPajama‑1T	1.2 trillion tokens (aggregated from diverse sources)	~1 TB	Together (RedPajama)	Large-scale LLM pretraining and dataset curation.
9	RedPajama‑V2	Over 100 billion tokens	~200 GB	Together (RedPajama)	Multilingual LLM development and dataset filtering.
10	OpenAI WebGPT Dataset	~10,000 annotated web browsing sessions	~10 GB	OpenAI	Training AI for web browsing and retrieval.
11	Obsidian Agent Dataset	100,000 simulated scenarios	~5 GB	Obsidian Labs	AI decision-making and planning simulations.
12	WebShop Dataset	1 million product interactions	~20 GB	WebShop Open-Source	E-commerce AI and product search optimization.
13	Meta EAI Dataset	10,000 simulation scenarios	~50 GB	Meta	Training AI for real-world robotics.
14	MuJoCo	Thousands of simulation episodes	~1 GB	Roboti LLC / DeepMind	Simulating robotic control and physics-based AI.
15	Robotics Datasets	Aggregated from various sources (thousands of sensor recordings)	~100 GB (aggregate)	Various Research Groups	AI for robotic interactions and control.
16	Atari Games	~10 million game frames	~10 GB	Various Academic Sources	Benchmarking reinforcement learning in gaming.
17	Web-crawled Interactions	Billions of user interaction logs	~500 GB	Various Research Institutions	Training interactive agents and recommendation AI.
18	AI2 ARC	7,787 multiple-choice questions	~100 MB	Allen Institute for AI	Commonsense reasoning and logical inference.
19	MS MARCO	Over 1 million passages	~100 GB	Microsoft	Information retrieval and question answering.
20	OpenAI Gym	70+ simulated environments	N/A	OpenAI	Reinforcement learning and AI agent training.

Note: The number of samples and size of datasets can vary based on the version and preprocessing applied. Please refer to the official documentation via the provided download links for the latest and most precise information

Conclusion

The open-source datasets highlighted above provide a robust foundation for developing cutting-edge generative and agentic AI systems. Whether you’re working on natural language processing, computer vision, autonomous decision-making, or advanced reasoning, these resources offer the depth and diversity needed to drive innovation. By leveraging these datasets, researchers and developers can accelerate breakthroughs, refine model performance, and explore new frontiers in artificial intelligence.

Frequently Asked Questions

Q1. What are open-source datasets?

A. Open-source datasets are publicly available collections of data that anyone can use for research, development, and training AI models. They enable transparency and collaboration in the AI community by providing free access to high-quality data.

Q2. Why are open-source datasets crucial for generative and agentic AI?

A. They provide the diverse and large-scale data required to train sophisticated models, enhancing their ability to generate creative content and make autonomous decisions. This democratizes AI development, allowing both academic and commercial projects to innovate without prohibitive costs.

Q3. What are the best open-source text and language datasets?

A. The Pile, Common Crawl, WikiText, OpenWebText, and IMDB Reviews are some of the best open-source datasets for text and language data. These datasets help in training large-scale language models, enhancing natural language understanding, and fine-tuning domain-specific applications.

Q4. Which are some good open-source image datasets?

A. Open-source image datasets like LAION-5B, ImageNet, MS COCO, Open Images, and CelebA are great options. These datasets are essential for tasks like image classification, object recognition, and text-to-image generation, powering advances in computer vision.

Q5. What are agentic AI datasets, and why are they important?

A. Agentic AI datasets, such as RedPajama‑1T, the OpenAI WebGPT Dataset, and the Obsidian Agent Dataset, provide data for training models to perform autonomous decision-making and reasoning tasks. They are pivotal for developing AI agents that can navigate and interact within complex environments.

Q6. How can I access these open-source datasets?

A. Most of these datasets are available through public repositories and official project pages, such as GitHub or Hugging Face. The article includes direct links, so you can download and experiment with the data under open-source licenses.

K.C. Sabreena Basheer

Sabreena is a GenAI enthusiast and tech editor who's passionate about documenting the latest advancements that shape the world. She's currently exploring the world of AI and Data Science as the Manager of Content & Growth at Analytics Vidhya.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

20 Open-Source Datasets for Generative and Agentic AI

Table of Contents

1. The Pile

2. Common Crawl

3. WikiText

4. OpenWebText

5. LAION-5B

6. MS COCO

7. Open Images Dataset

8. RedPajama‑1T

9. RedPajama‑V2

10. OpenAI WebGPT Dataset

11. Obsidian Agent Dataset

12. WebShop Dataset

13. Meta EAI Dataset (Embodied AI)

14. MuJoCo

15. Robotics Datasets

16. Atari Games

17. Web-crawled Interactions

18. AI2 ARC Dataset

19. MS MARCO

20. OpenAI Gym

Summary Table

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#