Intro to Titanic Dataset and Data Analysis

Shabeg Singh Gill
Last Updated: May 13, 2022

Introduction

The titanic dataset is very popular, highly used in Machine Learning. It contains

detailed information regarding all the passengers aboard the ship. The titanic

dataset helps predict the fate of the passengers aboard, i.e., whether they will

survive or not.

 

Link To the Dataset

 

The titanic dataset consists of a total of 11 features and 1 target variable:-

 

(i) PassengerID 

 

→ Denotes the ID of the passenger. 

 

→ Value starting from 1. 

 

(ii) Pclass 

 

→ Denotes the class of the passenger aboard. 

 

→ 1 = 1st ; 2 = 2nd ; 3 = 3rd. 

 

(iii) Name 

 

→ Denotes the name of the passenger. 

 

(iv) Sex 

 

→ Denotes the gender of the passenger.  

 

→ Male/female. 

 

(v) Age 

 

→ Denotes the age of the passenger. 

 

→ They are ranging anywhere from 0 to 80. 

 

(vi) SibSp 

 

→ Denotes the no. of siblings or spouses of a particular passenger.  

 

(vii) Parch

 

→ Denotes the no. of parents or children of a particular passenger. 

 

(viii) Ticket 

 

→ Denotes the ticket ID of a particular passenger. 

 

(ix) Fare

 

→ Denotes the passenger fare. 

 

→ Value in pounds. 

 

(x) Cabin 

 

→ Denotes the cabin no. of the passenger. 

 

(xi) Embarked

 

→ Denotes the embarkation port. 

 

→ C = Cherbourg; Q = Queenstown; S = Southampton

  

(xii) Survived

 

→ Target variable. 

 

→ ‘0’= No; ‘1’= Yes

Importing Necessary Libraries

Firstly, we will load some basic libraries:-

(i) Numpy - for linear algebra. 

(ii) Pandas - for data analysis. 

(iii) Seaborn - for data visualization.

(iv) Matplotlib - for data visualisation. 

import numpy as np 
import pandas as pd 
import seaborn as sns
from matplotlib import pyplot as plt

 

 

Loading Data

Our data consists of two files - training data and testing data. We will load both using the pandas library. It is instrumental. 

traindf= pd.read_csv('train.csv')
testdf = pd.read_csv('test.csv')

 

Visualization

1. Basic Visualization

info = traindf.info()

 

Output

We get the following details :

 

 

Then, we use the shape function to get information regarding the dimensions of the dataset and the head function to get the first 5 rows. 

print(traindf.shape)
traindf.head()

 

 

Output

 

print(testdf.shape)
testdf.head()

 

Output

 

2. Data imputation  

As visible from the above tables, we observe many missing values in the dataset. 

#we find no of nan values in each column first 
print(traindf.isnull().values.any()) #true
dfdict={} #dictionary for count of Nan values in each feature

#finding no of nan values in eah column of df 
forin traindf.columns:
    #calculating total nan values
    sum= (traindf[i].isna().sum()/ traindf.shape[0]) *100
    #updating in dictionary 
    dfdict[i]= sum

#converting dictionary to dataframe for better visualisation
new = pd.DataFrame.from_dict(dfdict, orient= 'index'
new

 

 

In the above code block, we still check whether there exist any Nan values in our dataset. We create a dictionary for storing the count of missing values in each feature. 

We run a loop to iterate over all the columns and compute the sum of the nan values of that particular column, and simultaneously, we update the same in our dictionary. 

 

Finally, we convert our dictionary to a data frame for simplicity and better representation and get the following result:-

 

Output

 

We notice that the columns 'Age,' 'Cabin,' and 'Embarked' contain missing

values, with 'Cabin' comprising a whopping 77%.

 

For other Nan visualization, we can use a heatmap where darker chart shades

represent higher values than the lighter shade.

sns.heatmap(traindf.isnull(), cbar = True).set_title("Missing values Heatmap")

 

Output

 

3. Graph visualization

(i)We will find the percentage of survivors for visualizing the dataset. We do

this by looping over the 'Survived' feature and noting the values, '0' indicating

not survived and '1' indicating survived.

#percentage of survivors

#Dictionary for computing percenatage of survivors
Survival={}
#count for those not survived
countnosurv=0 
#count for those survived 
countyessurv=0

#iterating over survived feature 
forin traindf['Survived']:
    #0 indicating didn't survive 
    if i == 0:
        countnosurv=countnosurv+1
    # 1 indicating survived 
    elif i==1:
        countyessurv=countyessurv+1
#percenate of survived 
Survival['Survived']= (countyessurv/traindf.shape[0])*100
#percenateg of not survived
Survival['Not Survived']= (countnosurv/traindf.shape[0])*100

new = pd.DataFrame.from_dict(Survival, orient='index'
new

 

 

Around 38% of the people survived, whereas about 61% didn't. We can observe

the same from the graph below too.

#Bar plot

#creating a list of keys of Survival Dictionary
Status = list(Survival.keys())
#creating a list of values of Survival Dictionary
Value = list(Survival.values())

#setting the dimensions of the figure 
fig = plt.figure(figsize = (55))
plt.bar(Status, Value, color ='orange',width = 0.5)
 
plt.xlabel("Survival Status")
plt.ylabel("Percentage")
plt.title("Survival Percenatage")
plt.show()




 

 

Output

 

(ii) Then, we find out the proportion of males and females on the ship and the proportion of people belonging to different classes . 

#no of males and females on the titanic 
import seaborn as sns
sns.set(style="white")
sns.catplot('Sex',data=traindf,kind='count')


 

Output

#people belonging to different classes 
sns.catplot('Pclass',data=traindf,kind='count')

 

Output

(iii) Now, we get the different age categories. 

#getting age categories 
hist = traindf['Age'].hist(bins=10)

 

 

Output

 

(iv) Now, we get the correlation between target 'Survival' and features such as

'Age' and 'Sex'.

# finding correlation between features, age , sex and target - survival  

#setting grid style
sns.set(style="darkgrid")
#grid map
grid = sns.FacetGrid(traindf, col='Survived', row='Sex', size=3, aspect=1.5)
grid.map(plt.hist, 'Age', alpha=.5, bins=10)
grid.add_legend();

 

 

Output

 

In the graphs above, we notice that fewer males survived than females. Most of

the males who died were 20-40 years, whereas most of the females who

survived were 20-45.

 

(v) Now we , will make a factor plot to determine the survival percentage of people from different classes. 

#factor plot 
ax = sns.factorplot('Embarked','Survived',data=traindf, aspect = 2.5, )

 

Output

We notice that around 55% of the people from Cherbourg survived compared to about 38 % and 34% in the case of Queenstown and Southampton, respectively. 

Preprocessing

1. Data Imputation

We drop ‘Cabin’ as it contains too many missing values. 

#dropping cabin as too much missing data 
traindf=traindf.drop(['Cabin'], axis=1)

 

 

traindf.shape
#cabin has been dropped 

 

Output

 

Also, ‘Embarked’ and ‘Age’ contain considerable Nan values so we replace them with their respective modes. 

#replacing age and embarked with mode of respective cols 
print(traindf.isnull().values.any())
print(traindf.shape)
modes={}
for eachcol in traindf.columns:
    mode_col= traindf[eachcol].mode()[0#getting mode of column 
    modes[eachcol]= mode_col #storing mode of each feature in dictionary
    
    traindf[eachcol] = traindf[eachcol].replace(np.nan, mode_col)
print(traindf.isnull().values.any())
print(traindf.shape)

 

 

Output

 

‘Ticket’ feature is dropped too as we cannot assess anything based on this. 

#we can drop ticket feature as we cannot asses anything from it , no need for prediction 
traindf = traindf.drop(['Ticket'], axis=1)

 

traindf.shape

 

Output

traindf.head()

Output

‘Name’ is dropped too as it is relatively non-standard. It does not contribute directly to survival . 

#dropping name
traindf = traindf.drop(['Name'], axis=1)
traindf.shape

 

Output

 

2. Label Encoding

We see that there are two categorical features - 'Sex' and 'Embarked'. So, we

need to perform label encoding to convert these categorical features to

numerical features as this is a binary classication problem.

from sklearn import preprocessing
# label_encoder object 
label_encoder = preprocessing.LabelEncoder()
# Encode labels 
traindf['Sex']= label_encoder.fit_transform(traindf['Sex'])
traindf['Embarked']= label_encoder.fit_transform(traindf['Embarked'])

 

traindf.head()

 

Output

 

3. Some more preprocessing

print(traindf.dtypes)

 

Output

 

We observe that out of 9 features, two features, namely, 'Age' and 'Fare,' are

oat data types. So we need to convert these to int.

traindf['Age']= traindf['Age'].astype(int)
traindf['Fare']= traindf['Fare'].astype(int)

 

traindf.head()

 

Output

 

traindf.info()

 

Output

 

We will move the target variable, 'Survived,' to the end to prepare our data for

training.

X_train = traindf.drop("Survived", axis=1)
Y_train = traindf["Survived"]

 

Finally, we are done with our data visualization and preprocessing, and the

The dataset is training ready.

 

Frequently Asked Questions

  1. What is the importance of data visualization?
    Data visualization provides not just quantitative but qualitative understanding of data, which helps us identify areas that need improvement. It also provides us with an understanding of the correlation between the target variable and the relevant features. 
     
  2. What are some of the preprocessing techniques?
    The main preprocessing techniques include:-
    (i) Data imputation - checking for missing values. 
    (ii) Label Encoding - converting categorical data to numerical data. 
    (iii) Standardization - involves feature scaling for consistency.
    (iv) PCA- to reduce the dimensions of the dataset while retaining significant information. 
     
  3. What is the difference between Data Analytics and Data Science?
    Data Science is a broader field that involves collecting data, training ML algorithms, and making predictions. Data analytics is a component of Data Science, involving identifying trends and patterns and drawing conclusions. 

Key Takeaways

Congratulations on making it this far. This blog discussed a fundamental overview of the famous Titanic Dataset !!

 

We learned about Data Loading, Data Visualisation, and Data Preprocessing. We learned how to visualize data using various plots and then, based on this EDA, took significant decisions concerning preprocessing and making our model training ready.

 

If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for a carefully crafted and designed course on-campus placements and interview preparation.

Was this article helpful ?
0 upvotes

Comments

No comments yet

Be the first to share what you think