The latest technological advancements in Data Sciences and Machine Learning have helped transform the way we live and interact with our surroundings.
We are constantly generating exponential amounts of data at a very fast pace, which the big tech companies aim to gain meaningful insights from to understand the consumer behaviour in order to tweak their products to become an indispensable part of our life.
We need large volumes of data to build models which can solve the real-world problems as closely as possible. Hence, data is considered as the currency in these fields and it is commonly believed that the insights from data or the machine learning models are only as good as the training data.
Thus, it is extremely important for professionals in these fields to gain access to reliable data to build highly accurate models. There are several readily available open datasets provided by both the private and government agencies which can be found on government websites, GitHub, Reddit or Kaggle. However, the plethora of options, both in terms of data and its variety, can get quite overwhelming for a beginner.
In this blog, we will first list down the most popular and reliable crime-related data sets from India and abroad which can be used to derive meaningful insights from and build machine learning models. Later, we will provide some project-specific datasets.
Datasets on Crimes from Around the World
We recommend analysing the criminal data by yourself first and then perusing the analyses released on their website to cross-check your findings. Analysing this data helps the law enforcement lessen the crimes by making stringent rules and surveillance after studying the types of crimes in each district and the modus operandi of the criminals.
Crimes in India: In India, the National Crime Records Bureau (NCRB) collects data about various crimes and publishes it on the Open Government Data (OGD) Platform (www.data.gov.in). The dataset contains state-wise information classified under the type of crime, registered cases and the people arrested.
Link to the Dataset: https://data.gov.in/keywords/crime-india
Crimes in Canada: The above dataset URL directs to a landing page which includes data on the various crimes in different parts of Canada as well as their analysis. You can narrow down the information relevant to your interest based on the various filters on their website.
Link to the Dataset: https://www150.statcan.gc.ca/n1/en/subjects/crime_and_justice/crimes_and_offences
Crimes in the United Kingdom: In the UK, different agencies publish their datasets on data.gov.uk, which has datasets classified under various topics on its landing page. You can narrow down on the information relevant to your interest based on various filters like a publisher, topic, licence type and format.
Link to the Dataset: https://data.gov.uk/search?filters%5Btopic%5D=Crime+and+justice
Crimes in the United States of America
In the USA, the police enforcements of different states release their own datasets. data.world is a platform where individuals and organisations upload their datasets so that it is easier for the research enthusiasts to collate data easily.
Link to the Dataset: https://data.world/datasets/crime
Security-based Project Specific Datasets: Apart from analysing crimes in a district, city or country, you can also develop machine learning models to solve some of the most sought after problems of the current times. Solving these problems would require you to visualise and analyse the data before coming up with the most efficient and accurate algorithm to solve the problem.
There are hundreds of solutions to each of the following datasets posted online. Make sure to revise the popular machine learning algorithms before jumping on to real-world projects. After you develop your own machine learning algorithm, go through some of the popular implementations to learn how to improve the accuracy of your implementation.
Spam Email Dataset: Almost all the email service providers out there have an embedded spam detection software which parses through the contents of the email and classifies it as a spam or useful email. This is very helpful since it saves the customers from being victims of fraudulent activities and also some time going through the contents of each email.
We do not need to categorise all promotional emails as spam, but it is very essential to mark the unsolicited emails sent for the purpose of phishing or spreading malware. The University of California, Irvine maintains a multivariate dataset which contains more than 4.5k emails (instances) with over 50 meta-information (attributes) about the emails.
Link to the Dataset: https://archive.ics.uci.edu/ml/datasets/Spambase
Fraudulent Email Dataset: Another project which can be done with emails is actually an important subset of spam detection – detection of fraudulent emails. Emails are still the most formal way of communication and are used by billions of individuals and organisations daily. The vast usage makes it very attractive for cybercriminals since a single security bug can be exploited to affect a large population.
The Enron dataset is provided by the CALO Project (A Cognitive Assistant that Learns and Organises) and contains around 500k emails of over 150 users. Most of the users are actually the senior members of the company. When going through the dataset, ensure that you are sensitive to the privacy of these members since the emails are real and might have some confidential information.
Link to the Dataset: https://www.cs.cmu.edu/~enron/
Credit Card Fraud Detection Dataset:
As the economy is going more digital as the days pass and the credit card companies are trying to make people spend by making the transaction process as seamless as possible, there is a dire need for these companies to vary of charging the transactions which are not done by the customers.
This dataset contains information about 250k+ credit card transactions made by the European cardholders, out of which a little under 500 transactions have been marked as being fraudulent (0.172%). Due to security concerns, this dataset has been cleaned to filter out sensitive information, making sure to keep the relevance of the dataset intact.
Link to the Dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Distinguishing Fake Reviews Amidst Authentic Ones
One of the purchase driving factors for the customers is the ratings and reviews. A good looking product with a terrible review can deter customers from buying it, whereas a simple product can drive purchases if it has good reviews. This is because reviews act as a positive or negative reinforcement in the subconscious mind of the users. Many agencies try to exploit the power of reviews by generating paid or fake reviews, hence it is very essential for the platform hosting the reviews to ensure the reviews are authentic to maintain their unbiased stance towards all the products hosted on the platform.
Yelp provides more than 8M+ reviews in its dataset which it has made publicly available. However, to maintain who has access to the dataset, it requires you to fill in a form where it collects some basic information about you.
Link to the Dataset: https://www.yelp.com/dataset
We hope you find the above data sets useful to get started on your next project. During your project, did you find an interesting trend? Or probably, does your machine learning model accurately distinguish fraudulent activities? Do drop your insights and code with us in the comments.
By Saarthak Jain