Intro to Titanic Dataset and Data Analysis
Introduction
The titanic dataset is very popular, highly used in Machine Learning. It contains
detailed information regarding all the passengers aboard the ship. The titanic
dataset helps predict the fate of the passengers aboard, i.e., whether they will
survive or not.
Link To the Dataset
The titanic dataset consists of a total of 11 features and 1 target variable:-
(i) PassengerID
→ Denotes the ID of the passenger.
→ Value starting from 1.
(ii) Pclass
→ Denotes the class of the passenger aboard.
→ 1 = 1st ; 2 = 2nd ; 3 = 3rd.
(iii) Name
→ Denotes the name of the passenger.
(iv) Sex
→ Denotes the gender of the passenger.
→ Male/female.
(v) Age
→ Denotes the age of the passenger.
→ They are ranging anywhere from 0 to 80.
(vi) SibSp
→ Denotes the no. of siblings or spouses of a particular passenger.
(vii) Parch
→ Denotes the no. of parents or children of a particular passenger.
(viii) Ticket
→ Denotes the ticket ID of a particular passenger.
(ix) Fare
→ Denotes the passenger fare.
→ Value in pounds.
(x) Cabin
→ Denotes the cabin no. of the passenger.
(xi) Embarked
→ Denotes the embarkation port.
→ C = Cherbourg; Q = Queenstown; S = Southampton
(xii) Survived
→ Target variable.
→ ‘0’= No; ‘1’= Yes
Importing Necessary Libraries
Firstly, we will load some basic libraries:-
(i) Numpy - for linear algebra.
(ii) Pandas - for data analysis.
(iii) Seaborn - for data visualization.
(iv) Matplotlib - for data visualisation.
import numpy as np import pandas as pd import seaborn as sns from matplotlib import pyplot as plt |
Loading Data
Our data consists of two files - training data and testing data. We will load both using the pandas library. It is instrumental.
traindf= pd.read_csv('train.csv') testdf = pd.read_csv('test.csv') |
Visualization
1. Basic Visualization
info = traindf.info() |
Output
We get the following details :
Then, we use the shape function to get information regarding the dimensions of the dataset and the head function to get the first 5 rows.
print(traindf.shape) traindf.head() |
Output
print(testdf.shape) testdf.head() |
Output
2. Data imputation
As visible from the above tables, we observe many missing values in the dataset.
#we find no of nan values in each column first print(traindf.isnull().values.any()) #true dfdict={} #dictionary for count of Nan values in each feature #finding no of nan values in eah column of df for i in traindf.columns: #calculating total nan values sum= (traindf[i].isna().sum()/ traindf.shape[0]) *100 #updating in dictionary dfdict[i]= sum #converting dictionary to dataframe for better visualisation new = pd.DataFrame.from_dict(dfdict, orient= 'index') new |
In the above code block, we still check whether there exist any Nan values in our dataset. We create a dictionary for storing the count of missing values in each feature.
We run a loop to iterate over all the columns and compute the sum of the nan values of that particular column, and simultaneously, we update the same in our dictionary.
Finally, we convert our dictionary to a data frame for simplicity and better representation and get the following result:-
Output
We notice that the columns 'Age,' 'Cabin,' and 'Embarked' contain missing
values, with 'Cabin' comprising a whopping 77%.
For other Nan visualization, we can use a heatmap where darker chart shades
represent higher values than the lighter shade.
sns.heatmap(traindf.isnull(), cbar = True).set_title("Missing values Heatmap") |
Output
3. Graph visualization
(i)We will find the percentage of survivors for visualizing the dataset. We do
this by looping over the 'Survived' feature and noting the values, '0' indicating
not survived and '1' indicating survived.
#percentage of survivors #Dictionary for computing percenatage of survivors Survival={} #count for those not survived countnosurv=0 #count for those survived countyessurv=0 #iterating over survived feature for i in traindf['Survived']: #0 indicating didn't survive if i == 0: countnosurv=countnosurv+1 # 1 indicating survived elif i==1: countyessurv=countyessurv+1 #percenate of survived Survival['Survived']= (countyessurv/traindf.shape[0])*100 #percenateg of not survived Survival['Not Survived']= (countnosurv/traindf.shape[0])*100 new = pd.DataFrame.from_dict(Survival, orient='index') new |
Around 38% of the people survived, whereas about 61% didn't. We can observe
the same from the graph below too.
#Bar plot #creating a list of keys of Survival Dictionary Status = list(Survival.keys()) #creating a list of values of Survival Dictionary Value = list(Survival.values()) #setting the dimensions of the figure fig = plt.figure(figsize = (5, 5)) plt.bar(Status, Value, color ='orange',width = 0.5) plt.xlabel("Survival Status") plt.ylabel("Percentage") plt.title("Survival Percenatage") plt.show() |
Output
(ii) Then, we find out the proportion of males and females on the ship and the proportion of people belonging to different classes .
#no of males and females on the titanic import seaborn as sns sns.set(style="white") sns.catplot('Sex',data=traindf,kind='count') |
Output
#people belonging to different classes sns.catplot('Pclass',data=traindf,kind='count') |
Output
(iii) Now, we get the different age categories.
#getting age categories hist = traindf['Age'].hist(bins=10) |
Output
(iv) Now, we get the correlation between target 'Survival' and features such as
'Age' and 'Sex'.
# finding correlation between features, age , sex and target - survival #setting grid style sns.set(style="darkgrid") #grid map grid = sns.FacetGrid(traindf, col='Survived', row='Sex', size=3, aspect=1.5) grid.map(plt.hist, 'Age', alpha=.5, bins=10) grid.add_legend(); |
Output
In the graphs above, we notice that fewer males survived than females. Most of
the males who died were 20-40 years, whereas most of the females who
survived were 20-45.
(v) Now we , will make a factor plot to determine the survival percentage of people from different classes.
#factor plot ax = sns.factorplot('Embarked','Survived',data=traindf, aspect = 2.5, ) |
Output
We notice that around 55% of the people from Cherbourg survived compared to about 38 % and 34% in the case of Queenstown and Southampton, respectively.
Preprocessing
1. Data Imputation
We drop ‘Cabin’ as it contains too many missing values.
#dropping cabin as too much missing data traindf=traindf.drop(['Cabin'], axis=1) |
traindf.shape #cabin has been dropped |
Output
Also, ‘Embarked’ and ‘Age’ contain considerable Nan values so we replace them with their respective modes.
#replacing age and embarked with mode of respective cols print(traindf.isnull().values.any()) print(traindf.shape) modes={} for eachcol in traindf.columns: mode_col= traindf[eachcol].mode()[0] #getting mode of column modes[eachcol]= mode_col #storing mode of each feature in dictionary traindf[eachcol] = traindf[eachcol].replace(np.nan, mode_col) print(traindf.isnull().values.any()) print(traindf.shape) |
Output
‘Ticket’ feature is dropped too as we cannot assess anything based on this.
#we can drop ticket feature as we cannot asses anything from it , no need for prediction traindf = traindf.drop(['Ticket'], axis=1) |
traindf.shape |
Output
traindf.head() |
Output
‘Name’ is dropped too as it is relatively non-standard. It does not contribute directly to survival .
#dropping name traindf = traindf.drop(['Name'], axis=1) traindf.shape |
Output
2. Label Encoding
We see that there are two categorical features - 'Sex' and 'Embarked'. So, we
need to perform label encoding to convert these categorical features to
numerical features as this is a binary classication problem.
from sklearn import preprocessing # label_encoder object label_encoder = preprocessing.LabelEncoder() # Encode labels traindf['Sex']= label_encoder.fit_transform(traindf['Sex']) traindf['Embarked']= label_encoder.fit_transform(traindf['Embarked']) |
traindf.head() |
Output
3. Some more preprocessing
print(traindf.dtypes) |
Output
We observe that out of 9 features, two features, namely, 'Age' and 'Fare,' are
oat data types. So we need to convert these to int.
traindf['Age']= traindf['Age'].astype(int) traindf['Fare']= traindf['Fare'].astype(int) |
traindf.head() |
Output
traindf.info() |
Output
We will move the target variable, 'Survived,' to the end to prepare our data for
training.
X_train = traindf.drop("Survived", axis=1) Y_train = traindf["Survived"] |
Finally, we are done with our data visualization and preprocessing, and the
The dataset is training ready.
Frequently Asked Questions
- What is the importance of data visualization?
Data visualization provides not just quantitative but qualitative understanding of data, which helps us identify areas that need improvement. It also provides us with an understanding of the correlation between the target variable and the relevant features.
- What are some of the preprocessing techniques?
The main preprocessing techniques include:-
(i) Data imputation - checking for missing values.
(ii) Label Encoding - converting categorical data to numerical data.
(iii) Standardization - involves feature scaling for consistency.
(iv) PCA- to reduce the dimensions of the dataset while retaining significant information.
- What is the difference between Data Analytics and Data Science?
Data Science is a broader field that involves collecting data, training ML algorithms, and making predictions. Data analytics is a component of Data Science, involving identifying trends and patterns and drawing conclusions.
Key Takeaways
Congratulations on making it this far. This blog discussed a fundamental overview of the famous Titanic Dataset !!
We learned about Data Loading, Data Visualisation, and Data Preprocessing. We learned how to visualize data using various plots and then, based on this EDA, took significant decisions concerning preprocessing and making our model training ready.
If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for a carefully crafted and designed course on-campus placements and interview preparation.
Comments
No comments yet
Be the first to share what you think