Top 20 Datasets in Machine Learning

Top 20 Datasets in Machine Learning
Top 20 Datasets in Machine Learning

Dataset is an integral part of Machine Learning applications. It can be available in different formats like .txt, .csv and many more. The article provides a glimpse of the top 20 datasets in Machine Learning.

In supervised machine learning, the labelled training a dataset is used and in unsupervised, no label is needed. If you are a beginner, we recommend you to read this article thoroughly. Dataset can help you to be better at Machine Learning models, where you always have enough room to improve your accuracy and to test your knowledge as the best accuracy. Let’s now see several famous databases that you can start from.

MNIST Dataset:
This data set has several images of a handwritten digit. With this, you can make a model which can recognise the handwritten digits. It’s a well-known and interesting machine learning dataset. The surprising fact of this dataset is that it offers both 60000 instances for training and 10000 for testing.

  • The training set and testing set are disjoint from each other.
  • Get binary images of handwritten digits using NIST’s Special Database 3 and Special Database.
  • This dataset helps you to understand and learn how to use ML techniques and pattern recognition methods on real-world data.

Spam SMS Classifier Dataset
Among so many machine learning applications, spam classification or spam detection is interesting one. Also, it’s a well-known task for an academic project or machine learning research. However, if you are a beginner in this field, you can build or develop a spam classifier using this dataset. This SMS Spam dataset may be a set of SMSs labelled messages that are collected for SMS Spam analysis.

  • This dataset contains 5,574 messages, which is written in English.
  • The file format is CSV.
  • Each line has two columns: one column contains the label (ham or spam), and the other one includes the raw text.

Iris Flower Dataset
If you are a beginner and want to develop a simple project, then you can use this simple Iris Flowers Dataset. It is one of the best datasets of pattern recognition. This dataset is small, and no pre-processing is needed to apply in your machine learning project. The dataset of Iris flowers has numeric attributes, as an instance, sepal and petal length and width.

  • All of the attributes are real.
  • The dataset characteristics are multivariate.
  • There are four attributes, i.e., sepal length in cm, sepal width in cm, petal length in cm and petal width in cm.

Boston House Price Dataset
This dataset is collected from the area of Boston Mass. This is a very basic dataset and is used by beginners for learning to improve their model and learn to do the same problem using single perceptron. Through this data set, they can get a very basic idea of making prediction through a regression model or using a perceptron to fit into the data for the proper predictions. Here very basic features are provided for the house and the price point of that house.

Pima Indians Diabetics Dataset
If you want to apply machine learning in healthcare, then you can use this Pima Indian Diabetics dataset in your healthcare system. We all know that diabetes is one of the most common dangerous diseases. You can use this dataset in your diabetes detection system. This dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases.

  • The file format of this dataset is CSV.
  • All the patients of this dataset are female, and at least 21 years old.
  • It contains 768 data points with nine features each.

COVID-19 Open Research Dataset Challenge (CORD-19)
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 192,000 scholarly articles, including over 84,000 with full text, about COVID-19, SARS-CoV 2 and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

  • The file format of this dataset is CSV.
  • Over 192,000 scholarly articles, including over 84,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

Google Trends Data Portal
Google trends data can be used to examine and analyse the data visually. You can also download the dataset into CSV files with a simple click. We can find out what’s trending and what people are searching for. Google Trends can be a game-changer in many fields be it political or social influencer, they become must when people in mass are searching for a particular topic. This can be analysed and many policies can be made to transform the livelihood of people in that region.

GTSRB (German traffic sign recognition benchmark) Dataset
The GTSRB dataset contains around 50,000 images of traffic signs belonging to 43 different classes and contains information on the bounding box of each sign. The dataset is used for multiclass classification. This has become a necessary cause of the rapid development in the field of self-driving cars. Thus, this set can be analysed for the recognition of various traffic sign so that a machine can be made better at recognising it, and stopping accidents and self-driven cars could be more efficient.

IMDB-Wiki Dataset
The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labelled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labelled images. Make a model that will detect faces and predict their gender and age. You can have categories in different ranges like 0-10, 10-20, 30-40, 50-60, etc.

  • Provides better result using a CNN model.
  • Labelled face images with age and gender.

Youtube 8M Dataset
The YouTube 8M dataset is a large-scale labelled video dataset that has 6.1 million YouTube video ids, 350,000 hours of video, 2.6 billion audio/visual features, 3862 classes and 3 avg labels per video. It is used for video classification purposes. Video classification can be done by using the dataset, and the model can describe what video is about. A video takes a series of inputs to classify in which category the video belongs.

  • This dataset is helpful for social influencer for the creation of specific content for their public.
  • 237K human-verified segment labels with 1000 classes, thus providing a variety.

Fake News Detection Dataset
It is a CSV file that has 7796 rows with 4 columns. The first column identifies news, second for the title, third for news text and fourth is the label TRUE or FAKE. Build a fake news detection model with Passive Aggressive Classifier algorithm. The Passive-Aggressive algorithm can classify massive streams of data, it can be implemented quickly.

Titanic Dataset
On 15 April 1912, the unsinkable Titanic ship sank and killed 1502 passengers out of 2224. The dataset contains information like name, age, sex, number of siblings aboard, etc of about 891 passengers in the training set and 418 passengers in the testing set. Build a fun model to predict whether a person would have survived on the Titanic or not. You can use linear regression for this purpose.

Daily Power Generation in India (2017-2020)
India is the world’s third-largest producer and third-largest consumer of electricity. The national electric grid in India has an installed capacity of 370.106 GW as of 31 March 2020. Renewable power plants, which also include large hydroelectric plants, constitute 35.86% of India’s total installed capacity. A detailed study of the leading power generation method in India, the growth of region-wise renewable energy, the state leading in the production of both renewable and non-renewable energy.

Hospitals and beds in India (Statewise)
The Novel coronavirus (Covid-19) has caused an acute shortage of healthcare infrastructure. The following is a number of beds and hospitals in different states and union territories. This data along with the patient database can be crucial in predicting the acute shortages of medical infrastructure and supplement wherever necessary.

The Yelp Dataset
The yelp made their dataset publicly available but you have to fill a form first to access the data. It contains 1.2 million tips by 1.6 million users, over 1.2 million business attributes and photos for natural language processing tasks. You can build a model which can detect whether a restaurant’s review is fake or real. With text processing and additional features in the dataset you can build an SVM model that can classify reviews as fake or real.

Recommender Systems Dataset
This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system. Build a product recommendation system like Amazon. A recommendation system can suggest your products, movies, etc based on your interests and the things you like and have used earlier.

A good source for economic and financial data – useful for building models to predict economic indicators or stock prices. This dataset can give you a handful experience of stock related Machine learning model, through which you can also help to increase the accuracy of your model.

Fashion MNIST
It is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

  • 28×28 grayscale image.
  • A training set of 60,000 examples and a test set of 10,000 examples.
  • 10 classes.

Hotel booking Demand
This data set contains booking information for a city hotel and a resort hotel and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.

Chest X-Ray Images (Pneumonia)
For someone working on biological solutions for Pneumonia or for a project, then this dataset can be considered for creating a predictive model. Further detail is shown below. There are 5,863 X-Ray images (JPEG) and two categories (Pneumonia/Normal).

Explore more articles here.

By Tusshar Verma