Kaggle Prediction Exercise: Titanic

6 minute read

1. Import libraries and data

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set(rc = {'figure.figsize':(15, 10)})

import warnings
warnings.filterwarnings(action="ignore", message="internal gelsd")
warnings.simplefilter(action='ignore', category=FutureWarning)

train_df = pd.read_csv('train.csv')

test_df = pd.read_csv('test.csv')

submission_df = pd.read_csv('gender_submission.csv')

2. Some data analysis

print(train_df.shape)
print(test_df.shape)

(891, 12)
(418, 11)

train_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

pd.options.display.max_columns
train_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

3. Feature Engineering

This is involves the process of getting the most out of data provided.

3.1. Feature Extractions

*Create Title from Name feature

#First, we add both train and test set together for this chapter alone.
data = pd.concat([train_df, test_df])
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object
 4   Sex          1309 non-null   object
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object
 11  Embarked     1307 non-null   object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB

import re
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
data.Title.value_counts()

Mr          757
Miss        260
Mrs         197
Master       61
Rev           8
Dr            8
Col           4
Ms            2
Major         2
Mlle          2
Dona          1
Mme           1
Capt          1
Countess      1
Lady          1
Don           1
Sir           1
Jonkheer      1
Name: Title, dtype: int64

#We can replace these titles with only 6 titles (Mr, Mrs, Miss, Master, other, Military and Nobility)
data['Title'] = data['Title'].replace(
{
    'Mile':'Miss', 'Ms':'Miss', 'Mlle':'Miss',
    'Mme':'Mrs', 'Dona':'Mrs',
    'Don':'Mr',
    'Jonkheer':'Nobility', 'Lady':'Nobility', 'Sir':'Nobility', 'Countess':'Nobility',
    'Capt':'Military', 'Major':'Military', 'Col':'Military',
    'Rev':'Other', 'Dr':'Other'
})
data['Title'].value_counts()

Mr          758
Miss        264
Mrs         199
Master       61
Other        16
Military      7
Nobility      4
Name: Title, dtype: int64

*Create Family from Sibsp and Parch

data['Family'] = data['SibSp'] + data['Parch'] + 1

*Edit Ticket

data['Ticket_1'] = data['Ticket'].map(lambda x: re.sub('\D', '', x))
data['Ticket_1'] = pd.to_numeric(data['Ticket_1'])

ticket=dict(data['Ticket'].value_counts())
data['Ticket_2'] = data['Ticket'].map(ticket)

*Binning Fare and Age

data['Bin_Fare'] = pd.qcut(data.Fare, q=4, labels=False)
data['Bin_Age'] = pd.qcut(data.Age, q=10, labels=False)

*Create Deck and Has_Cabin from Cabin

data.sample(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	Family	Ticket_1	Ticket_2	Bin_Fare	Bin_Age
684	685	0.0	2	Brown, Mr. Thomas William Solomon	male	60.0	1	1	29750	39.0000	NaN	S	Mr	3	29750.0	3	3.0	9.0
348	349	1.0	3	Coutts, Master. William Loch "William"	male	3.0	1	1	C.A. 37671	15.9000	NaN	S	Master	3	37671.0	3	2.0	0.0
624	625	0.0	3	Bowen, Mr. David John "Dai"	male	21.0	0	0	54636	16.1000	NaN	S	Mr	1	54636.0	2	2.0	2.0
339	1231	NaN	3	Betros, Master. Seman	male	NaN	0	0	2622	7.2292	NaN	C	Master	1	2622.0	1	0.0	NaN
232	233	0.0	2	Sjostedt, Mr. Ernst Adolf	male	59.0	0	0	237442	13.5000	NaN	S	Mr	1	237442.0	1	1.0	9.0

3.3. Drop Features

data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title', 'Family',
       'Ticket_1', 'Ticket_2', 'Bin_Fare', 'Bin_Age'],
      dtype='object')

#drop features you suspect are unimportant
data = data.drop(['Name', 'PassengerId', 'Survived', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1)
#drop after feature_importance
data = data.drop(['Bin_Fare'], axis=1)

3.4. Feature Transform

#separate dataset back to train and test set
train = data.iloc[:891]
test = data.iloc[891:]
train.head()

	Pclass	Sex	Age	Fare	Embarked	Title	Family	Ticket_1	Ticket_2	Bin_Age
0	3	male	22.0	7.2500	S	Mr	2	521171.0	1	2.0
1	1	female	38.0	71.2833	C	Mrs	2	17599.0	2	7.0
2	3	female	26.0	7.9250	S	Miss	1	23101282.0	1	4.0
3	1	female	35.0	53.1000	S	Mrs	2	113803.0	2	6.0
4	3	male	35.0	8.0500	S	Mr	1	373450.0	1	6.0

#create a sample of numerical predictors
train_num = train.select_dtypes(include=[np.number])
train_num.head()

	Pclass	Age	Fare	Family	Ticket_1	Ticket_2	Bin_Age
0	3	22.0	7.2500	2	521171.0	1	2.0
1	1	38.0	71.2833	2	17599.0	2	7.0
2	3	26.0	7.9250	1	23101282.0	1	4.0
3	1	35.0	53.1000	2	113803.0	2	6.0
4	3	35.0	8.0500	1	373450.0	1	6.0

# I am going to create a numeric pipeline where I would impute the median to missing numbers and then scale the datapoints
from sklearn.pipeline import Pipeline as pl
from sklearn.impute import SimpleImputer as si
from sklearn.preprocessing import RobustScaler

num_pipeline = pl([
    ('imputer', si(strategy='median')),
    ('scaler', RobustScaler()),
])

train_num_trx = num_pipeline.fit_transform(train_num)
train_num_trx

array([[ 0.00000000e+00, -4.61538462e-01, -3.12010602e-01, ...,
         1.23510357e+00,  0.00000000e+00, -5.00000000e-01],
       [-2.00000000e+00,  7.69230769e-01,  2.46124229e+00, ...,
        -2.90816698e-01,  5.00000000e-01,  7.50000000e-01],
       [ 0.00000000e+00, -1.53846154e-01, -2.82776661e-01, ...,
         6.96571943e+01,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 0.00000000e+00,  0.00000000e+00,  3.89603978e-01, ...,
        -3.24124577e-01,  1.50000000e+00,  0.00000000e+00],
       [-2.00000000e+00, -1.53846154e-01,  6.73281477e-01, ...,
        -6.67551483e-03,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  3.07692308e-01, -2.90355831e-01, ...,
         7.78165642e-01,  0.00000000e+00,  5.00000000e-01]])

#create dataframe of object type  
train_cat = train.select_dtypes(exclude=[np.number])
train_cat.head()

	Sex	Embarked	Title
0	male	S	Mr
1	female	C	Mrs
2	female	S	Miss
3	female	S	Mrs
4	male	S	Mr

# convert object type predictors to numeric
from sklearn.preprocessing import OneHotEncoder as onehot

cat_pipeline = pl([
    ('imputer', si(strategy='most_frequent')),
    ('encoder', onehot(sparse=False, drop='first')),
])
train_cat_trx = cat_pipeline.fit_transform(train_cat)
train_cat_trx

array([[1., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.]])

#Combine both pipelines
from sklearn.compose import ColumnTransformer as colt

num_attribs = list(train_num)
cat_attribs = list(train_cat)

full_pipeline = colt([
    ('num', num_pipeline, num_attribs),
    ('cat', cat_pipeline, cat_attribs),
])

3.5. Recheck and Rename

print(train_cat_trx.shape)
print(train_num_trx.shape)

(891, 9)
(891, 7)

X_train = full_pipeline.fit_transform(train)

X_test = full_pipeline.transform(test)

y_train = train_df['Survived']

y_test = submission_df['Survived']

4. Modelling

Adaboost decisiontree, random_forest and gradientboosting

from sklearn.ensemble import RandomForestClassifier as rfc


forest = rfc(random_state=0, n_estimators=500, max_leaf_nodes=16)
forest.fit(X_train, y_train)
forest_ts = forest.predict(X_test)
forest_ts

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1],
      dtype=int64)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(solver='eigen', shrinkage='auto')
lda.fit(X_train, y_train)
lda_ts = lda.predict(X_test)

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(random_state=0, n_estimators=500)
gbc.fit(X_train, y_train)
gbc_ts = gbc.predict(X_test)

from sklearn.linear_model import LogisticRegression

log = LogisticRegression(max_iter=500, random_state=0)
log.fit(X_train, y_train)
log_ts = log.predict(X_test)

from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

svc = BaggingClassifier(
    SVC(C=0.7, gamma='auto', random_state=0), random_state=0, n_estimators=500)

svc.fit(X_train, y_train)
svc_ts = svc.predict(X_test)

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
     LogisticRegression(max_iter=500, random_state=0)
)
ada.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=LogisticRegression(C=1.0, class_weight=None,
                                                     dual=False,
                                                     fit_intercept=True,
                                                     intercept_scaling=1,
                                                     l1_ratio=None,
                                                     max_iter=500,
                                                     multi_class='auto',
                                                     n_jobs=None, penalty='l2',
                                                     random_state=0,
                                                     solver='lbfgs', tol=0.0001,
                                                     verbose=0,
                                                     warm_start=False),
                   learning_rate=1.0, n_estimators=50, random_state=None)

let us see how they all performed

from sklearn.metrics import accuracy_score

for clf in (forest, lda, gbc, log, svc, ada):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_pred, y_test))

RandomForestClassifier 0.9401913875598086
LinearDiscriminantAnalysis 0.9736842105263158
GradientBoostingClassifier 0.8301435406698564
LogisticRegression 0.937799043062201
BaggingClassifier 0.9425837320574163
AdaBoostClassifier 0.9449760765550239

They are all overfitting except gradient boost classifier. Who can tell me why?

5. Evaluation

Lets see how our model would perform without overfitting to the data

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('rfc',forest), ('svc', svc), ('gbc', gbc), ('ada', ada)], voting='hard')

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('rfc',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=16,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=500,
                                                     n_jobs=None,
                                                     oob_score=...
                                                 base_estimator=LogisticRegression(C=1.0,
                                                                                   class_weight=None,
                                                                                   dual=False,
                                                                                   fit_intercept=True,
                                                                                   intercept_scaling=1,
                                                                                   l1_ratio=None,
                                                                                   max_iter=500,
                                                                                   multi_class='auto',
                                                                                   n_jobs=None,
                                                                                   penalty='l2',
                                                                                   random_state=0,
                                                                                   solver='lbfgs',
                                                                                   tol=0.0001,
                                                                                   verbose=0,
                                                                                   warm_start=False),
                                                 learning_rate=1.0,
                                                 n_estimators=50,
                                                 random_state=None))],
                 flatten_transform=True, n_jobs=None, voting='hard',
                 weights=None)

#lets compare our models
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest,  X_train, y_train, cv=5)
print(forest_scores.mean())

lda_scores = cross_val_score(lda, X_train, y_train, cv=5)
print(lda_scores.mean())

gbc_scores = cross_val_score(gbc, X_train, y_train, cv=5)
print(gbc_scores.mean())

log_scores = cross_val_score(log,  X_train, y_train, cv=5)
print(log_scores.mean())

svc_scores = cross_val_score(svc,  X_train, y_train, cv=5)
print(svc_scores.mean())

ada_scores = cross_val_score(ada,  X_train, y_train, cv=5)
print(ada_scores.mean())

vot_scores = cross_val_score(voting_clf,  X_train, y_train, cv=5)
print(vot_scores.mean())

8248948590797817
8091959073504489
8383654510074697
8215366267026551
8249199673592367
8170547988199109
8282719226664993

5.1. Feature Importance

names = list(train)
print ("Features sorted by their score:")
print (sorted(zip(map(lambda x: round(x, 4), forest.feature_importances_), names),
             reverse=True))

Features sorted by their score:
[(0.2211, 'Ticket_1'), (0.0902, 'Pclass'), (0.084, 'Age'), (0.0764, 'Embarked'), (0.0561, 'Title'), (0.0446, 'Fare'), (0.0439, 'Sex'), (0.0226, 'Family'), (0.0086, 'Bin_Age'), (0.0044, 'Ticket_2')]

6. Submision

Select best achieving model and submit

y_pred = voting_clf.predict(X_test)

output = pd.DataFrame({'PassengerId':submission_df.PassengerId, 'Survived': y_pred})

output.to_csv('vote4.csv', index=False)

output.head()

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

Find ways to get a higher score. This was my rank at kaggle.

Share on

Twitter Facebook LinkedIn

Misan Daibo