# Intro to Titanic Dataset and Data Analysis

Shabeg Singh Gill
Last Updated: May 13, 2022

## Introduction

The titanic dataset is very popular, highly used in Machine Learning. It contains

detailed information regarding all the passengers aboard the ship. The titanic

dataset helps predict the fate of the passengers aboard, i.e., whether they will

survive or not.

The titanic dataset consists of a total of 11 features and 1 target variable:-

(i) PassengerID

→ Denotes the ID of the passenger.

→ Value starting from 1.

(ii) Pclass

→ Denotes the class of the passenger aboard.

→ 1 = 1st ; 2 = 2nd ; 3 = 3rd.

(iii) Name

→ Denotes the name of the passenger.

(iv) Sex

→ Denotes the gender of the passenger.

→ Male/female.

(v) Age

→ Denotes the age of the passenger.

→ They are ranging anywhere from 0 to 80.

(vi) SibSp

→ Denotes the no. of siblings or spouses of a particular passenger.

(vii) Parch

→ Denotes the no. of parents or children of a particular passenger.

(viii) Ticket

→ Denotes the ticket ID of a particular passenger.

(ix) Fare

→ Denotes the passenger fare.

→ Value in pounds.

(x) Cabin

→ Denotes the cabin no. of the passenger.

(xi) Embarked

→ Denotes the embarkation port.

→ C = Cherbourg; Q = Queenstown; S = Southampton

(xii) Survived

→ Target variable.

→ ‘0’= No; ‘1’= Yes

## Importing Necessary Libraries

Firstly, we will load some basic libraries:-

(i) Numpy - for linear algebra.

(ii) Pandas - for data analysis.

(iii) Seaborn - for data visualization.

(iv) Matplotlib - for data visualisation.

Our data consists of two files - training data and testing data. We will load both using the pandas library. It is instrumental.

## Visualization

1. Basic Visualization

Output

We get the following details :

Then, we use the shape function to get information regarding the dimensions of the dataset and the head function to get the first 5 rows.

Output

Output

2. Data imputation

As visible from the above tables, we observe many missing values in the dataset.

In the above code block, we still check whether there exist any Nan values in our dataset. We create a dictionary for storing the count of missing values in each feature.

We run a loop to iterate over all the columns and compute the sum of the nan values of that particular column, and simultaneously, we update the same in our dictionary.

Finally, we convert our dictionary to a data frame for simplicity and better representation and get the following result:-

Output

We notice that the columns 'Age,' 'Cabin,' and 'Embarked' contain missing

values, with 'Cabin' comprising a whopping 77%.

For other Nan visualization, we can use a heatmap where darker chart shades

represent higher values than the lighter shade.

Output

3. Graph visualization

(i)We will find the percentage of survivors for visualizing the dataset. We do

this by looping over the 'Survived' feature and noting the values, '0' indicating

not survived and '1' indicating survived.

Around 38% of the people survived, whereas about 61% didn't. We can observe

the same from the graph below too.

Output

(ii) Then, we find out the proportion of males and females on the ship and the proportion of people belonging to different classes .

Output

Output

(iii) Now, we get the different age categories.

Output

(iv) Now, we get the correlation between target 'Survival' and features such as

'Age' and 'Sex'.

Output

In the graphs above, we notice that fewer males survived than females. Most of

the males who died were 20-40 years, whereas most of the females who

survived were 20-45.

(v) Now we , will make a factor plot to determine the survival percentage of people from different classes.

Output

We notice that around 55% of the people from Cherbourg survived compared to about 38 % and 34% in the case of Queenstown and Southampton, respectively.

## Preprocessing

1. Data Imputation

We drop ‘Cabin’ as it contains too many missing values.

Output

Also, ‘Embarked’ and ‘Age’ contain considerable Nan values so we replace them with their respective modes.

Output

‘Ticket’ feature is dropped too as we cannot assess anything based on this.

Output

Output

‘Name’ is dropped too as it is relatively non-standard. It does not contribute directly to survival .

Output

2. Label Encoding

We see that there are two categorical features - 'Sex' and 'Embarked'. So, we

need to perform label encoding to convert these categorical features to

numerical features as this is a binary classication problem.

Output

3. Some more preprocessing

Output

We observe that out of 9 features, two features, namely, 'Age' and 'Fare,' are

oat data types. So we need to convert these to int.

Output

Output

We will move the target variable, 'Survived,' to the end to prepare our data for

training.

Finally, we are done with our data visualization and preprocessing, and the

1. What is the importance of data visualization?
Data visualization provides not just quantitative but qualitative understanding of data, which helps us identify areas that need improvement. It also provides us with an understanding of the correlation between the target variable and the relevant features.

2. What are some of the preprocessing techniques?
The main preprocessing techniques include:-
(i) Data imputation - checking for missing values.
(ii) Label Encoding - converting categorical data to numerical data.
(iii) Standardization - involves feature scaling for consistency.
(iv) PCA- to reduce the dimensions of the dataset while retaining significant information.

3. What is the difference between Data Analytics and Data Science?
Data Science is a broader field that involves collecting data, training ML algorithms, and making predictions. Data analytics is a component of Data Science, involving identifying trends and patterns and drawing conclusions.

## Key Takeaways

Congratulations on making it this far. This blog discussed a fundamental overview of the famous Titanic Dataset !!

We learned about Data Loading, Data Visualisation, and Data Preprocessing. We learned how to visualize data using various plots and then, based on this EDA, took significant decisions concerning preprocessing and making our model training ready.

If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for a carefully crafted and designed course on-campus placements and interview preparation.