20 Most Liked Datasets on HuggingFace

Nitika Sharma Last Updated : 27 Dec, 2024
5 min read

Hugging Face recently released its list of the most liked datasets, each contributing significantly to advancements in AI. These datasets serve diverse purposes, ranging from instruction-following to multimodal understanding, and are widely adopted across various AI applications. Below is a comprehensive overview of these HuggingFace datasets, sorted by the number of downloads.

HuggingFace Datasets

1. FineWeb-Edu by HuggingFaceFW

Likes: 573 | Downloads: 318,907

  • Key Features: Filters high-quality educational web content using an educational classifier developed with annotations scored by LLama3-70B-Instruct. The classifier prioritizes middle-school to grade-school knowledge while retaining some high-level content. This ensures the dataset focuses on truly educational material, balancing technical depth with accessibility.
  • Use Cases: Powers e-learning platforms, enhances course recommendations, and supports educational chatbots. Known for enabling personalized learning pathways and improving real-time problem-solving capabilities in academic contexts.
  • Highlight: Provides premium, educationally rich materials curated for advanced academic and training models.

Click here to access this dataset. 

2. TxT360 by LLM360

Likes: 217 | Downloads: 102,124

  • Key Features: Filters 99 Common Crawl snapshots for LLM pretraining, emphasizing data quality with advanced deduplication techniques. Incorporates curated and web-based datasets to create a 15T+ token corpus.
  • Use Cases: Supports web-based content generation, SEO optimization, and general-purpose NLP tasks. Facilitates diverse applications, including LLM fine-tuning.
  • Highlight: Offers a scalable pipeline, enhancing data quality for challenging downstream tasks.

Click here to access this dataset.

3. FineWeb 2 by HuggingFaceFW

Likes: 363 | Downloads: 88,657

  • Key Features: A multilingual dataset supporting over 1,000 languages and scripts. Built on 96 Common Crawl snapshots spanning 2013 to 2024, it processes 8 terabytes of text data—approximately 3 trillion words.
  • Use Cases: Enhances NLP applications for multilingual models and underrepresented languages. Ideal for research requiring clean, high-quality data.
  • Highlight: Advances global NLP inclusivity with transparent and scalable methodology.

Click here to checkout this dataset on HuggingFace. 

4. Common Corpus by PleIAs

Likes: 196 | Downloads: 24,844

  • Key Features: Comprising over 2 trillion tokens from diverse sources, this multilingual dataset emphasizes high-quality and ethical standards through toxicity filtering and content curation.
  • Use Cases: Widely used in pretraining models like GPT and BERT for tasks such as summarization, translation, and sentiment analysis.
  • Highlight: Benchmark resource for robust, generalized AI model development.

You can explore this dataset here.

5. Cosmopedia by HuggingFaceTB

Likes: 570 | Downloads: 20,840

  • Key Features: A synthetic dataset of 30 million samples generated by Mixtral-8x7B-Instruct-v0.1. It includes educational resources, blog posts, and synthetic instruction datasets.
  • Use Cases: Supports academic learning, creative writing, and commonsense reasoning.
  • Highlight: Pioneers scalable synthetic data generation with refined prompts and decontamination pipelines.

Click here to access this dataset. 

6. HelpSteer2 by Nvidia

Likes: 390 | Downloads: 13,799

  • Key Features: Contains 21,000 samples with detailed annotations, focusing on helpfulness and correctness. Used for preference-based training models.
  • Use Cases: Ideal for customer service bots and content moderation systems.
  • Highlight: Achieved top scores across major benchmarks like RewardBench and AlpacaEval.

Click here to access this dataset on HuggingFace. 

7. Orca-AgentInstruct-1M-v1 by Microsoft

Likes: 404 | Downloads: 12,877

  • Key Features: Contains 1 million synthetically generated instruction pairs. Covers text editing, coding, and comprehension tasks.
  • Use Cases: Enhances LLM instruction tuning and conversational agent training.
  • Highlight: Significant improvements in benchmarks for reasoning and factual correctness.

Click here to checkout this dataset. 

8. SmolTalkDataset by HuggingFaceTB

Likes: 260 | Downloads: 11,523

  • Key Features: A synthetic dataset for supervised fine-tuning, covering mathematics, coding, and summarization tasks.
  • Use Cases: Powers AI tutors, coding assistants, and reasoning bots.
  • Highlight: Enhances task-specific performance and reasoning capabilities.

Checkout this HuggingFace dataset here.

9. FinePersonas by Argilla

Likes: 363 | Downloads: 6,853

  • Key Features: Provides 21 million detailed personas generated for diverse and controllable synthetic text generation, specifically designed to enhance reasoning and creative writing. These personas are grounded in high-quality educational content, primarily derived from the HuggingFaceFW/FineWeb-Edu dataset, with a strong bias toward education and science domains.
  • Use Cases: Ideal for creative storytelling, role-playing games, brand persona development tools, and LLM fine-tuning. This dataset allows researchers to integrate domain-specific attributes into AI models, enabling the generation of nuanced, targeted content.
  • Highlight: Facilitates the creation of rich, diverse, and context-specific synthetic outputs while minimizing the complexity of crafting detailed attributes manually.

Click here to checkout this dataset. 

10. FineVideo by HuggingFaceFV

Likes: 283 | Downloads: 5,434

  • Key Features: Designed for video understanding, focusing on mood analysis, storytelling, and editing.
  • Use Cases: Enhances video summarization, analytics, and narrative-driven AI tools.
  • Highlight: Powers cutting-edge multimodal research in video content analysis.

Click here to checkout this HuggingFace dataset.

11. Infinity Instruct by Beijing Academy of Artificial Intelligence (BAAI)

Likes: 574 | Downloads: 5,284

  • Key Features: Offers a large-scale instruction dataset optimizing task-specific AI models for reasoning, coding, and more.
  • Use Cases: Trains task-specific AI systems and improves instruction-following in open-source models.
  • Highlight: Provides high-quality datasets advancing open-source AI capabilities.

Click here to checkout this dataset.

12. PersonaHub by proj-persona

Likes: 475 | Downloads: 3,846

  • Key Features: Offers 1 billion personas curated for synthetic data synthesis. Supports storytelling and game design.
  • Use Cases: Extensively applied in interactive storytelling and personalized marketing tools.
  • Highlight: Facilitates diverse, context-specific character interactions.

Click here to checkout this dataset. 

13. Two-Million-Bluesky-Posts by Alpin Dale

Likes: 193 | Downloads: 3,155

  • Key Features: Comprises 2 million public posts from Bluesky Social’s API, enriched with metadata and language labels.
  • Use Cases: Supports NLP tasks, conversational AI, and social media research.
  • Highlight: Explores linguistic trends and community interactions.

Click here to checkout this dataset. 

14. xlam-function-calling-60k by Salesforce

Likes: 395 | Downloads: 2,567

  • Key Features: Focused on function-calling applications, this dataset ensures correctness with over 95% passing human evaluation. It includes diverse API function calls across 21 categories.
  • Use Cases: Trains AI models for API interactions, enhances coding assistants, and develops task-specific agents.
  • Highlight: Achieved 88.24% accuracy on the Berkeley Function-Calling Leaderboard.

Click here to checkout this dataset. 

15. OpenO1-SFT by O1-OPEN

Likes: 271 | Downloads: 2,171

  • Key Features: Supports Supervised Fine-Tuning (SFT) for Chain-of-Thought (CoT) reasoning. Includes structured responses for coherent reasoning sequences.
  • Use Cases: Enhances reasoning in AI tutoring, educational tools, and advanced question answering.
  • Highlight: Improves self-consistency and accuracy in reasoning tasks.

Click here to access this dataset. 

16. MMMLU by OpenAI

Likes: 438 | Downloads: 1,761

  • Key Features: Covers 57 topics translated into 14 languages with high accuracy, particularly for low-resource languages.
  • Use Cases: Benchmarks multilingual AI models for global applications and cross-lingual understanding.
  • Highlight: Sets a high standard for language comprehension and accessibility.

Click here to checkout this dataset. 

17. FRAMES by Google

Likes: 176 | Downloads: 1,757

  • Key Features: A Retrieval-Augmented Generation (RAG) evaluation dataset with 824 multi-hop questions and diverse reasoning types.
  • Use Cases: Benchmarks search engines, trains knowledge graphs, and refines Q&A systems.
  • Highlight: Tests multi-step retrieval and temporal reasoning strategies.

Click here to access this dataset. 

18. Reasoning-Base-20k by KingNish

Likes: 194 | Downloads: 1,581

  • Key Features: Includes step-by-step explanations for reasoning tasks, enhancing models’ logical problem-solving abilities.
  • Use Cases: Widely used for educational apps, logical reasoning bots, and science or math tutors.
  • Highlight: Improves reasoning accuracy and detailed response quality.

Click here to checkout this dataset. 

19. arXiver by Neuralwork

Likes: 355 | Downloads: 790

  • Key Features: Consists of 63,357 arXiv papers in multi-markdown format, curated for semantic search and summarization.
  • Use Cases: Enhances academic tools, scientific Q&A systems, and scholarly summarization.
  • Highlight: Streamlines technical content integration for research-oriented AI applications.

Click here to checkout this HuggingFace dataset.

20. 5CD-AILLaVA-CoT-o1-Instruct by 5CD-AI

Likes: 64 | Downloads: 598

  • Key Features: Enables Chain-of-Thought reasoning in vision-language models with multimodal sequences and explanations.
  • Use Cases: Ideal for e-learning, interactive AI tools, and multimodal reasoning research.
  • Highlight: Integrates structured outputs for complex decision-making tasks.

Click here to access this dataset. 

Similar Articles

Conclusion

This comprehensive collection of cutting-edge datasets empowers researchers and developers to advance AI across diverse domains. From reasoning models to multilingual corpora, each dataset brings unique value to the community. Which of these datasets stands out as your favorite? How do you plan to use them in your projects? Let us know your thoughts in the comment section below.

For more such awesome content, stay tuned to Analytics Vidhya blog!

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details