Kaggle Prediction Exercise: Titanic

6 minute read

1. Import libraries and data

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set(rc = {'figure.figsize':(15, 10)})

import warnings
warnings.filterwarnings(action="ignore", message="internal gelsd")
warnings.simplefilter(action='ignore', category=FutureWarning)

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission_df = pd.read_csv('gender_submission.csv')

2. Some data analysis

print(train_df.shape)
print(test_df.shape)
(891, 12)
(418, 11)
train_df['Survived'].value_counts()
0    549
1    342
Name: Survived, dtype: int64
train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
pd.options.display.max_columns
train_df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

3. Feature Engineering

This is involves the process of getting the most out of data provided.

3.1. Feature Extractions

*Create Title from Name feature

#First, we add both train and test set together for this chapter alone.
data = pd.concat([train_df, test_df])
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object
 4   Sex          1309 non-null   object
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object
 11  Embarked     1307 non-null   object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
import re
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
data.Title.value_counts()
Mr          757
Miss        260
Mrs         197
Master       61
Rev           8
Dr            8
Col           4
Ms            2
Major         2
Mlle          2
Dona          1
Mme           1
Capt          1
Countess      1
Lady          1
Don           1
Sir           1
Jonkheer      1
Name: Title, dtype: int64
#We can replace these titles with only 6 titles (Mr, Mrs, Miss, Master, other, Military and Nobility)
data['Title'] = data['Title'].replace(
{
    'Mile':'Miss', 'Ms':'Miss', 'Mlle':'Miss',
    'Mme':'Mrs', 'Dona':'Mrs',
    'Don':'Mr',
    'Jonkheer':'Nobility', 'Lady':'Nobility', 'Sir':'Nobility', 'Countess':'Nobility',
    'Capt':'Military', 'Major':'Military', 'Col':'Military',
    'Rev':'Other', 'Dr':'Other'
})
data['Title'].value_counts()
Mr          758
Miss        264
Mrs         199
Master       61
Other        16
Military      7
Nobility      4
Name: Title, dtype: int64

*Create Family from Sibsp and Parch

data['Family'] = data['SibSp'] + data['Parch'] + 1

*Edit Ticket

data['Ticket_1'] = data['Ticket'].map(lambda x: re.sub('\D', '', x))
data['Ticket_1'] = pd.to_numeric(data['Ticket_1'])

ticket=dict(data['Ticket'].value_counts())
data['Ticket_2'] = data['Ticket'].map(ticket)

*Binning Fare and Age

data['Bin_Fare'] = pd.qcut(data.Fare, q=4, labels=False)
data['Bin_Age'] = pd.qcut(data.Age, q=10, labels=False)

*Create Deck and Has_Cabin from Cabin

data.sample(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Family Ticket_1 Ticket_2 Bin_Fare Bin_Age
684 685 0.0 2 Brown, Mr. Thomas William Solomon male 60.0 1 1 29750 39.0000 NaN S Mr 3 29750.0 3 3.0 9.0
348 349 1.0 3 Coutts, Master. William Loch "William" male 3.0 1 1 C.A. 37671 15.9000 NaN S Master 3 37671.0 3 2.0 0.0
624 625 0.0 3 Bowen, Mr. David John "Dai" male 21.0 0 0 54636 16.1000 NaN S Mr 1 54636.0 2 2.0 2.0
339 1231 NaN 3 Betros, Master. Seman male NaN 0 0 2622 7.2292 NaN C Master 1 2622.0 1 0.0 NaN
232 233 0.0 2 Sjostedt, Mr. Ernst Adolf male 59.0 0 0 237442 13.5000 NaN S Mr 1 237442.0 1 1.0 9.0

3.3. Drop Features

data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title', 'Family',
       'Ticket_1', 'Ticket_2', 'Bin_Fare', 'Bin_Age'],
      dtype='object')
#drop features you suspect are unimportant
data = data.drop(['Name', 'PassengerId', 'Survived', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1)
#drop after feature_importance
data = data.drop(['Bin_Fare'], axis=1)

3.4. Feature Transform

#separate dataset back to train and test set
train = data.iloc[:891]
test = data.iloc[891:]
train.head()
Pclass Sex Age Fare Embarked Title Family Ticket_1 Ticket_2 Bin_Age
0 3 male 22.0 7.2500 S Mr 2 521171.0 1 2.0
1 1 female 38.0 71.2833 C Mrs 2 17599.0 2 7.0
2 3 female 26.0 7.9250 S Miss 1 23101282.0 1 4.0
3 1 female 35.0 53.1000 S Mrs 2 113803.0 2 6.0
4 3 male 35.0 8.0500 S Mr 1 373450.0 1 6.0
#create a sample of numerical predictors
train_num = train.select_dtypes(include=[np.number])
train_num.head()
Pclass Age Fare Family Ticket_1 Ticket_2 Bin_Age
0 3 22.0 7.2500 2 521171.0 1 2.0
1 1 38.0 71.2833 2 17599.0 2 7.0
2 3 26.0 7.9250 1 23101282.0 1 4.0
3 1 35.0 53.1000 2 113803.0 2 6.0
4 3 35.0 8.0500 1 373450.0 1 6.0
# I am going to create a numeric pipeline where I would impute the median to missing numbers and then scale the datapoints
from sklearn.pipeline import Pipeline as pl
from sklearn.impute import SimpleImputer as si
from sklearn.preprocessing import RobustScaler

num_pipeline = pl([
    ('imputer', si(strategy='median')),
    ('scaler', RobustScaler()),
])

train_num_trx = num_pipeline.fit_transform(train_num)
train_num_trx
array([[ 0.00000000e+00, -4.61538462e-01, -3.12010602e-01, ...,
         1.23510357e+00,  0.00000000e+00, -5.00000000e-01],
       [-2.00000000e+00,  7.69230769e-01,  2.46124229e+00, ...,
        -2.90816698e-01,  5.00000000e-01,  7.50000000e-01],
       [ 0.00000000e+00, -1.53846154e-01, -2.82776661e-01, ...,
         6.96571943e+01,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 0.00000000e+00,  0.00000000e+00,  3.89603978e-01, ...,
        -3.24124577e-01,  1.50000000e+00,  0.00000000e+00],
       [-2.00000000e+00, -1.53846154e-01,  6.73281477e-01, ...,
        -6.67551483e-03,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  3.07692308e-01, -2.90355831e-01, ...,
         7.78165642e-01,  0.00000000e+00,  5.00000000e-01]])
#create dataframe of object type  
train_cat = train.select_dtypes(exclude=[np.number])
train_cat.head()
Sex Embarked Title
0 male S Mr
1 female C Mrs
2 female S Miss
3 female S Mrs
4 male S Mr
# convert object type predictors to numeric
from sklearn.preprocessing import OneHotEncoder as onehot

cat_pipeline = pl([
    ('imputer', si(strategy='most_frequent')),
    ('encoder', onehot(sparse=False, drop='first')),
])
train_cat_trx = cat_pipeline.fit_transform(train_cat)
train_cat_trx
array([[1., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.]])
#Combine both pipelines
from sklearn.compose import ColumnTransformer as colt

num_attribs = list(train_num)
cat_attribs = list(train_cat)

full_pipeline = colt([
    ('num', num_pipeline, num_attribs),
    ('cat', cat_pipeline, cat_attribs),
])

3.5. Recheck and Rename

print(train_cat_trx.shape)
print(train_num_trx.shape)
(891, 9)
(891, 7)
X_train = full_pipeline.fit_transform(train)
X_test = full_pipeline.transform(test)
y_train = train_df['Survived']
y_test = submission_df['Survived']

4. Modelling

Adaboost decisiontree, random_forest and gradientboosting

from sklearn.ensemble import RandomForestClassifier as rfc


forest = rfc(random_state=0, n_estimators=500, max_leaf_nodes=16)
forest.fit(X_train, y_train)
forest_ts = forest.predict(X_test)
forest_ts
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1],
      dtype=int64)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(solver='eigen', shrinkage='auto')
lda.fit(X_train, y_train)
lda_ts = lda.predict(X_test)
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(random_state=0, n_estimators=500)
gbc.fit(X_train, y_train)
gbc_ts = gbc.predict(X_test)
from sklearn.linear_model import LogisticRegression

log = LogisticRegression(max_iter=500, random_state=0)
log.fit(X_train, y_train)
log_ts = log.predict(X_test)
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier

svc = BaggingClassifier(
    SVC(C=0.7, gamma='auto', random_state=0), random_state=0, n_estimators=500)

svc.fit(X_train, y_train)
svc_ts = svc.predict(X_test)
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
     LogisticRegression(max_iter=500, random_state=0)
)
ada.fit(X_train, y_train)
AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=LogisticRegression(C=1.0, class_weight=None,
                                                     dual=False,
                                                     fit_intercept=True,
                                                     intercept_scaling=1,
                                                     l1_ratio=None,
                                                     max_iter=500,
                                                     multi_class='auto',
                                                     n_jobs=None, penalty='l2',
                                                     random_state=0,
                                                     solver='lbfgs', tol=0.0001,
                                                     verbose=0,
                                                     warm_start=False),
                   learning_rate=1.0, n_estimators=50, random_state=None)

let us see how they all performed

from sklearn.metrics import accuracy_score

for clf in (forest, lda, gbc, log, svc, ada):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_pred, y_test))
RandomForestClassifier 0.9401913875598086
LinearDiscriminantAnalysis 0.9736842105263158
GradientBoostingClassifier 0.8301435406698564
LogisticRegression 0.937799043062201
BaggingClassifier 0.9425837320574163
AdaBoostClassifier 0.9449760765550239

They are all overfitting except gradient boost classifier. Who can tell me why?

5. Evaluation

Lets see how our model would perform without overfitting to the data

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('rfc',forest), ('svc', svc), ('gbc', gbc), ('ada', ada)], voting='hard')

voting_clf.fit(X_train, y_train)
VotingClassifier(estimators=[('rfc',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=16,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=500,
                                                     n_jobs=None,
                                                     oob_score=...
                                                 base_estimator=LogisticRegression(C=1.0,
                                                                                   class_weight=None,
                                                                                   dual=False,
                                                                                   fit_intercept=True,
                                                                                   intercept_scaling=1,
                                                                                   l1_ratio=None,
                                                                                   max_iter=500,
                                                                                   multi_class='auto',
                                                                                   n_jobs=None,
                                                                                   penalty='l2',
                                                                                   random_state=0,
                                                                                   solver='lbfgs',
                                                                                   tol=0.0001,
                                                                                   verbose=0,
                                                                                   warm_start=False),
                                                 learning_rate=1.0,
                                                 n_estimators=50,
                                                 random_state=None))],
                 flatten_transform=True, n_jobs=None, voting='hard',
                 weights=None)
#lets compare our models
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest,  X_train, y_train, cv=5)
print(forest_scores.mean())

lda_scores = cross_val_score(lda, X_train, y_train, cv=5)
print(lda_scores.mean())

gbc_scores = cross_val_score(gbc, X_train, y_train, cv=5)
print(gbc_scores.mean())

log_scores = cross_val_score(log,  X_train, y_train, cv=5)
print(log_scores.mean())

svc_scores = cross_val_score(svc,  X_train, y_train, cv=5)
print(svc_scores.mean())

ada_scores = cross_val_score(ada,  X_train, y_train, cv=5)
print(ada_scores.mean())

vot_scores = cross_val_score(voting_clf,  X_train, y_train, cv=5)
print(vot_scores.mean())
0.8248948590797817
0.8091959073504489
0.8383654510074697
0.8215366267026551
0.8249199673592367
0.8170547988199109
0.8282719226664993

5.1. Feature Importance

names = list(train)
print ("Features sorted by their score:")
print (sorted(zip(map(lambda x: round(x, 4), forest.feature_importances_), names),
             reverse=True))
Features sorted by their score:
[(0.2211, 'Ticket_1'), (0.0902, 'Pclass'), (0.084, 'Age'), (0.0764, 'Embarked'), (0.0561, 'Title'), (0.0446, 'Fare'), (0.0439, 'Sex'), (0.0226, 'Family'), (0.0086, 'Bin_Age'), (0.0044, 'Ticket_2')]

6. Submision

Select best achieving model and submit

y_pred = voting_clf.predict(X_test)
output = pd.DataFrame({'PassengerId':submission_df.PassengerId, 'Survived': y_pred})
output.to_csv('vote4.csv', index=False)
output.head()
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1

Find ways to get a higher score. This was my rank at kaggle.


Leave a comment