Top 6 Data Science Projects for 2021

Top 6 Data Science Projects for 2021
Top 6 Data Science Projects for 2021

Data science projects to leverage your knowledge and showcase on your resume. Data science is an application of statistical methods and machine-learning practices to gain insights and useful information from raw data.


Data science is a vast subject of study and experiment. Recently, data science gained the tag of the sexiest job of the 21st century. The buzz around it and the curiosity to learn data science might have made you learn various prerequisites to learn data science such as algebra, machine learning, calculus, and statistics. With this much knowledge and information to consolidate, you can truly materialise your understanding of data science by building projects.

Data science is all about extracting the pattern and information using various statistical models and machine learning practices. It is really useful to build projects to show your expertise in the subject of study. These data science projects will establish a solid base for your job applications as a data scientist.

1. Breast Cancer Detection

Breast cancer is one of the common cancers among women around the globe. It is a topic of research to classify the tumours into malignant (cancerous) and benign(non-cancerous). The classification of the cells based on the complex features of the tumour can be done using machine learning methodology. The UCI Machine learning repository has a breast cancer dataset.

The dataset consists of attributes namely IDnumber, Diagnosis(M=malignant, B=benign). The features that you will be using to train your model will be the cell nucleus features such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave point, symmetry and fractal dimension.

You will have to select parameters that are most helpful in the classification of cancer cells. The various steps that will be involved will range from data exploration, data modification, data splitting, model selection, training and testing.

You can have various classification models to check the accuracy that each one of them gives. The models that you should consider are Logistic regression, nearest neighbour algorithm, support vector machine, kernel SVM, naive Bayes, random forest algorithm and decision tree.

Dataset: Link

Repository for reference: Link

2. Titanic Dataset

We all have seen the movie titanic and know the story of great tragedy. The ship didn’t have enough lifeboats for everyone resulting in the death of 1502 passengers out of 2224. It can be said that it was luck for survival but there is an observation of a certain of set people had the chances of survival more than others. This is what the titanic dataset is all about.

blog banner 1

You have to build a predictive model to predict whether a person is likely to survive or not based on various features such as name, age, gender, socio-economic class and other features. It is a really interesting case study and a learning resource.

You can participate in this practice competition of Kaggle to get hands-on learning and a Kaggle environment experience.

Link to competition

3. House Price Prediction

House buying is a deal of money for your dream house. The price of a house depends on many factors. The area, the number of rooms, the furniture, street, location and many more possible features. This is a real-world utilisation of price prediction using machine learning.

The dataset you will be using for this project will be Ames Housing Dataset by Dean De Cock. There is a beginner competition on Kaggle for this dataset too. You can learn after regression techniques using ti dataset along with learning feature engineering.

Link to Competition

4. MNIST Handwritten Digit Recognition

It is one of the most standard datasets to learn classification algorithm. It contains the image of handwritten digits 0-9. It is used in computer vision and deep learning basics. You can train a neural network to predict handwritten digits. The dataset contains 60,000 images to train and 10,000 images to test. This dataset will help you get started with TensorFlow.


 5. Iris Dataset

Iris is one of the most standard and basic datasets to step your feet into the world of data science. It is a small dataset of three varieties of flower namely- Iris Setosa, Iris Versicolour and Iris Virginica. Each flower has 50 instances with various features such as sepal length and width, petal length and width. It is a pretty straight forward dataset where you need to predict the variety of the flower out of the three.

Dataset: Link

6. Sentiment Analysis

Sentiments hold a greater value in today’s world of likes, reviews, tweets and Reddit. Sentiment analysis can be used in a lot of domain to filter out abusive tweets, analyse the likeability of a product by the customers, and to leverage a better understanding of text data. Some of the most common emotions that can be detected are excited, sad, angry happy etc. It can help you learn a different branch of data science that is NLP ie- Natural language processing. There are many popular datasets to practice sentiment analysis such as amazon product data, Stanford sentiment treebank and IMDB movie review dataset.

Frequently Asked Questions

What are some data science projects?

Some of the most popular data science projects are plant disease detection, covid-19 data analysis, breast cancer detection, housing price prediction, fake news detection, movie recommendation and many more datasets are available in the public domain which can be utilised to make data science projects.

How do I start a data science project?

A data science project has various steps which start from data exploration. You try different visualisation and learn about the dataset. Data cleaning is yet another very important aspect of data science before training the model. Model selection is the next step. After this, you work on details by testing different algorithms and apply techniques such as hyperparameter optimisation and feature engineering.

How do data science projects work?

Data science projects aim to pipeline the data to make it a meaningful asset for the organisation. It can have various applications ranging from the series recommendation you get on Netflix, to collage made automatically on your google photos, credit card fraud detection and so much more.

Where can I practice data science?

There are various platforms with active data science and machine learning community to help each other. The competition on these platforms can help you leverage your skills and enjoy the process of learning. Kaggle, dock ship and are the popular ones. There are more which you can know from