10 Datasets By INDIAai for Data Science Project

Pankaj Singh Last Updated : 03 Mar, 2025
5 min read

India ranks fifth globally in AI investment, with its AI market projected to grow by 28.63% (2024-2030), reaching $28.36 billion by 2030. This growth reflects India’s commitment to advancing AI through initiatives like INDIAai, a knowledge portal, research organization, and ecosystem-building platform. INDIAai fosters collaboration within India’s AI ecosystem and provides essential resources, including high-quality datasets for data science projects. For students and researchers, it offers curated datasets across various domains, enabling innovation and impactful research. Among its offerings, 10 standout datasets are particularly valuable for aspiring data scientists, making INDIAai a key driver of AI development in India. In this article , you will get to know about the 10 Datasets by INDIAai for Data Science Project.

india ai

Overview of 10 Datasets

The 10 datasets curated by INDIAai encompass various data sources spanning multiple domains and use cases. They are meticulously curated, annotated, and accessible to researchers, practitioners, and enthusiasts alike. Whether you’re interested in natural language processing, computer vision, healthcare analytics, or socioeconomic research, the datasets offer you an opportunity for exploration and discovery.

Datasets by INDIAai for Your Data Science Projects

Here are datasets by INDIAai for your data science projects:

Global Youth Tobacco Survey (GYTS-4)

The International Institute for Population Sciences (IIPS), operating under the Ministry of Health and Family Welfare, conducted the Global Youth Tobacco Survey (GYTS-4) in 2019. This comprehensive survey aimed to assess tobacco usage among schoolchildren aged 13-15 across various states and union territories (UTs). It delved into demographic factors such as gender, school location (rural or urban), and school administration type (public or private) to provide a nuanced understanding of tobacco consumption patterns among this demographic group.

Download Link: Global Youth Tobacco Survey (GYTS-4)

National Financial and Economic Data

The Department of Economic Affairs meticulously compiles comprehensive national financial and economic data. This invaluable repository encompasses critical metrics such as external debt, central government borrowing, monthly economic reports, and succinct national summary data pages, providing a robust foundation for informed decision-making and strategic planning at both macro and micro levels.

Download Link: National Financial and Economic Data

Indian Census Data

Explore an extensive array of invaluable resources at our digital library, where a treasure trove of census tables, reports, and various digital files spanning from 1991 to 2011 awaits your discovery. Delve into rich datasets, insightful reports, and meticulously curated information, all available for seamless download in digital format, empowering researchers, policymakers, and curious minds alike to unlock new insights and perspectives. Whether unraveling demographic trends, conducting historical research, or seeking data-driven solutions, our comprehensive collection is a beacon of knowledge, fostering exploration and innovation with every click.

Download Link: Indian Census Data

Herbarium Dataset of the Wildlife Institute of India (WII)

The Wildlife Institute of India recently unveiled its groundbreaking Wildlife Herbarium Dataset, comprising 4591 specimens. This comprehensive collection encompasses various flora and fauna, meticulously cataloged and digitized for scientific exploration. Leveraging the Global Biodiversity Information Facility (GBIF) network, these digital specimens are readily accessible to researchers worldwide, facilitating unparalleled insights into the natural world.

This invaluable resource serves as a cornerstone for conservation efforts and ecological research. Scientists and conservationists can harness the power of this dataset to monitor biodiversity trends, track endangered species, and devise effective conservation strategies. By analyzing the information contained within these specimens, researchers can unravel ecological mysteries, identify critical habitats, and safeguard vulnerable ecosystems.

Download Link: Herbarium Dataset of the Wildlife Institute of India (WII)

Voice Call Quality Customer Experience

Voice Call Quality Customer Experience data collected by the Ministry of Communications, Department of Telecommunications (DOT), and the Telecom Regulatory Authority of India (TRAI) is a vital barometer of telecommunications performance in India. This comprehensive dataset encapsulates the nuanced quality metrics of voice calls across diverse regions, telecom operators, and technological infrastructures.

The collaboration between the Ministry of Communications and TRAI ensures the meticulous gathering, analysis, and dissemination of data, fostering transparency and accountability within the telecommunications sector. By assessing various parameters such as call drops, call setup success rates, voice clarity, and network coverage, this data empowers stakeholders to make informed decisions and drive continuous improvement in service delivery.

Download Link: Voice Call Quality Customer Experience

List of MSME Registered Units

The dataset contains comprehensive information regarding Micro, Small, and Medium Enterprises (MSMEs) registered under the Udyog Aadhaar Memorandum. It encompasses many details concerning these registered units, ranging from demographic information to operational specifics.

Download Link: MSME Registered Units

Local Government Directory (LGD) – Local Bodies with PIN Codes

The Local Government Directory (LGD) – Urban dataset, provided by the Ministry of Panchayati Raj, is a comprehensive resource for urban governance. It encompasses a wide array of information crucial for effective administration and planning at the local level, particularly focusing on areas within urban jurisdictions.

This dataset includes detailed information on various facets of urban governance, ranging from administrative structures to demographic profiles. It offers insights into the organizational hierarchy, delineating the roles and responsibilities of different administrative units within urban local bodies. Moreover, it provides data on key infrastructure facilities, such as healthcare, education, transportation, and sanitation, essential for sustainable urban development.

Download Link: Local Government Directory (LGD) – Local Bodies with PIN Codes

The Lemur Project: ClueWeb09 Dataset

The ClueWeb09 dataset, created by the Language Technologies Institute at Carnegie Mellon University, is incredibly important for advancing research in information retrieval and language technologies. It contains a massive collection of 1 billion web pages gathered in early 2009, offering a diverse range of online content in ten different languages. This dataset is highly valued in the academic community and is used in various parts of the prestigious TREC conference. Its extensive coverage and size make it an essential tool for scholars and researchers, allowing them to make significant discoveries and advancements in search technology and related fields.

Download Link: The Lemur Project: ClueWeb09 Dataset

The 20 Newsgroups Datasets

The 20 Newsgroups dataset is a cornerstone of machine learning. It comprises around 20,000 documents drawn from an eclectic array of newsgroups. These documents are meticulously partitioned, ensuring a near-even distribution across 20 categories. While its origins trace back to Ken Lang, the mastermind behind Newsweeder, it’s worth noting that Lang doesn’t explicitly claim this specific collection.

Download Link: The 20 Newsgroups data sets

Reuters Corpora (RCV1, RCV2, TRC2)

In 2000, Reuters Ltd introduced the Reuters Corpus, Volume 1 (RCV1), a significant advancement in natural language processing and machine learning. This expansive collection of Reuters News stories surpassed previous datasets in size and scope, offering a diverse range of topics, languages, and sources. RCV1 quickly became a cornerstone for researchers and developers, driving text classification and analysis innovation. Over the years, it has remained a vital resource, facilitating breakthroughs in sentiment analysis and topic modeling. RCV1’s legacy underscores the importance of meticulously curated datasets in advancing the field of natural language processing.

Download Link: Reuters Corpora (RCV1, RCV2, TRC2)

For more datasets refer to this: Datasets by INDIAai

Conclusion

These 10 datasets curated by INDIAai represent a goldmine of opportunities for researchers, data scientists, and enthusiasts alike. They offer a rich tapestry of information for exploration and analysis, covering diverse domains such as public health, economics, biodiversity, telecommunications, governance, and language technologies. Whether you are looking for a data science project for a college internship or want to practice, these datasets are useful.

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details