Kaggle Prediction Exercise: Titanic
1. Import libraries and data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set(rc = {'figure.figsize':(15, 10)})
import warnings
warnings.filterwarnings(action="ignore", message="internal gelsd")
warnings.simplefilter(action='ignore', category=FutureWarning)
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission_df = pd.read_csv('gender_submission.csv')
2. Some data analysis
print(train_df.shape)
print(test_df.shape)
(891, 12)
(418, 11)
train_df['Survived'].value_counts()
0 549
1 342
Name: Survived, dtype: int64
train_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
pd.options.display.max_columns
train_df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
3. Feature Engineering
This is involves the process of getting the most out of data provided.
3.1. Feature Extractions
*Create Title from Name feature
#First, we add both train and test set together for this chapter alone.
data = pd.concat([train_df, test_df])
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1046 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1308 non-null float64
10 Cabin 295 non-null object
11 Embarked 1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
import re
data['Title'] = data.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
data.Title.value_counts()
Mr 757
Miss 260
Mrs 197
Master 61
Rev 8
Dr 8
Col 4
Ms 2
Major 2
Mlle 2
Dona 1
Mme 1
Capt 1
Countess 1
Lady 1
Don 1
Sir 1
Jonkheer 1
Name: Title, dtype: int64
#We can replace these titles with only 6 titles (Mr, Mrs, Miss, Master, other, Military and Nobility)
data['Title'] = data['Title'].replace(
{
'Mile':'Miss', 'Ms':'Miss', 'Mlle':'Miss',
'Mme':'Mrs', 'Dona':'Mrs',
'Don':'Mr',
'Jonkheer':'Nobility', 'Lady':'Nobility', 'Sir':'Nobility', 'Countess':'Nobility',
'Capt':'Military', 'Major':'Military', 'Col':'Military',
'Rev':'Other', 'Dr':'Other'
})
data['Title'].value_counts()
Mr 758
Miss 264
Mrs 199
Master 61
Other 16
Military 7
Nobility 4
Name: Title, dtype: int64
*Create Family from Sibsp and Parch
data['Family'] = data['SibSp'] + data['Parch'] + 1
*Edit Ticket
data['Ticket_1'] = data['Ticket'].map(lambda x: re.sub('\D', '', x))
data['Ticket_1'] = pd.to_numeric(data['Ticket_1'])
ticket=dict(data['Ticket'].value_counts())
data['Ticket_2'] = data['Ticket'].map(ticket)
*Binning Fare and Age
data['Bin_Fare'] = pd.qcut(data.Fare, q=4, labels=False)
data['Bin_Age'] = pd.qcut(data.Age, q=10, labels=False)
*Create Deck and Has_Cabin from Cabin
data.sample(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | Family | Ticket_1 | Ticket_2 | Bin_Fare | Bin_Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
684 | 685 | 0.0 | 2 | Brown, Mr. Thomas William Solomon | male | 60.0 | 1 | 1 | 29750 | 39.0000 | NaN | S | Mr | 3 | 29750.0 | 3 | 3.0 | 9.0 |
348 | 349 | 1.0 | 3 | Coutts, Master. William Loch "William" | male | 3.0 | 1 | 1 | C.A. 37671 | 15.9000 | NaN | S | Master | 3 | 37671.0 | 3 | 2.0 | 0.0 |
624 | 625 | 0.0 | 3 | Bowen, Mr. David John "Dai" | male | 21.0 | 0 | 0 | 54636 | 16.1000 | NaN | S | Mr | 1 | 54636.0 | 2 | 2.0 | 2.0 |
339 | 1231 | NaN | 3 | Betros, Master. Seman | male | NaN | 0 | 0 | 2622 | 7.2292 | NaN | C | Master | 1 | 2622.0 | 1 | 0.0 | NaN |
232 | 233 | 0.0 | 2 | Sjostedt, Mr. Ernst Adolf | male | 59.0 | 0 | 0 | 237442 | 13.5000 | NaN | S | Mr | 1 | 237442.0 | 1 | 1.0 | 9.0 |
3.3. Drop Features
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title', 'Family',
'Ticket_1', 'Ticket_2', 'Bin_Fare', 'Bin_Age'],
dtype='object')
#drop features you suspect are unimportant
data = data.drop(['Name', 'PassengerId', 'Survived', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1)
#drop after feature_importance
data = data.drop(['Bin_Fare'], axis=1)
3.4. Feature Transform
#separate dataset back to train and test set
train = data.iloc[:891]
test = data.iloc[891:]
train.head()
Pclass | Sex | Age | Fare | Embarked | Title | Family | Ticket_1 | Ticket_2 | Bin_Age | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 7.2500 | S | Mr | 2 | 521171.0 | 1 | 2.0 |
1 | 1 | female | 38.0 | 71.2833 | C | Mrs | 2 | 17599.0 | 2 | 7.0 |
2 | 3 | female | 26.0 | 7.9250 | S | Miss | 1 | 23101282.0 | 1 | 4.0 |
3 | 1 | female | 35.0 | 53.1000 | S | Mrs | 2 | 113803.0 | 2 | 6.0 |
4 | 3 | male | 35.0 | 8.0500 | S | Mr | 1 | 373450.0 | 1 | 6.0 |
#create a sample of numerical predictors
train_num = train.select_dtypes(include=[np.number])
train_num.head()
Pclass | Age | Fare | Family | Ticket_1 | Ticket_2 | Bin_Age | |
---|---|---|---|---|---|---|---|
0 | 3 | 22.0 | 7.2500 | 2 | 521171.0 | 1 | 2.0 |
1 | 1 | 38.0 | 71.2833 | 2 | 17599.0 | 2 | 7.0 |
2 | 3 | 26.0 | 7.9250 | 1 | 23101282.0 | 1 | 4.0 |
3 | 1 | 35.0 | 53.1000 | 2 | 113803.0 | 2 | 6.0 |
4 | 3 | 35.0 | 8.0500 | 1 | 373450.0 | 1 | 6.0 |
# I am going to create a numeric pipeline where I would impute the median to missing numbers and then scale the datapoints
from sklearn.pipeline import Pipeline as pl
from sklearn.impute import SimpleImputer as si
from sklearn.preprocessing import RobustScaler
num_pipeline = pl([
('imputer', si(strategy='median')),
('scaler', RobustScaler()),
])
train_num_trx = num_pipeline.fit_transform(train_num)
train_num_trx
array([[ 0.00000000e+00, -4.61538462e-01, -3.12010602e-01, ...,
1.23510357e+00, 0.00000000e+00, -5.00000000e-01],
[-2.00000000e+00, 7.69230769e-01, 2.46124229e+00, ...,
-2.90816698e-01, 5.00000000e-01, 7.50000000e-01],
[ 0.00000000e+00, -1.53846154e-01, -2.82776661e-01, ...,
6.96571943e+01, 0.00000000e+00, 0.00000000e+00],
...,
[ 0.00000000e+00, 0.00000000e+00, 3.89603978e-01, ...,
-3.24124577e-01, 1.50000000e+00, 0.00000000e+00],
[-2.00000000e+00, -1.53846154e-01, 6.73281477e-01, ...,
-6.67551483e-03, 0.00000000e+00, 0.00000000e+00],
[ 0.00000000e+00, 3.07692308e-01, -2.90355831e-01, ...,
7.78165642e-01, 0.00000000e+00, 5.00000000e-01]])
#create dataframe of object type
train_cat = train.select_dtypes(exclude=[np.number])
train_cat.head()
Sex | Embarked | Title | |
---|---|---|---|
0 | male | S | Mr |
1 | female | C | Mrs |
2 | female | S | Miss |
3 | female | S | Mrs |
4 | male | S | Mr |
# convert object type predictors to numeric
from sklearn.preprocessing import OneHotEncoder as onehot
cat_pipeline = pl([
('imputer', si(strategy='most_frequent')),
('encoder', onehot(sparse=False, drop='first')),
])
train_cat_trx = cat_pipeline.fit_transform(train_cat)
train_cat_trx
array([[1., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 1., 0., ..., 0., 0., 0.]])
#Combine both pipelines
from sklearn.compose import ColumnTransformer as colt
num_attribs = list(train_num)
cat_attribs = list(train_cat)
full_pipeline = colt([
('num', num_pipeline, num_attribs),
('cat', cat_pipeline, cat_attribs),
])
3.5. Recheck and Rename
print(train_cat_trx.shape)
print(train_num_trx.shape)
(891, 9)
(891, 7)
X_train = full_pipeline.fit_transform(train)
X_test = full_pipeline.transform(test)
y_train = train_df['Survived']
y_test = submission_df['Survived']
4. Modelling
Adaboost decisiontree, random_forest and gradientboosting
from sklearn.ensemble import RandomForestClassifier as rfc
forest = rfc(random_state=0, n_estimators=500, max_leaf_nodes=16)
forest.fit(X_train, y_train)
forest_ts = forest.predict(X_test)
forest_ts
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1],
dtype=int64)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(solver='eigen', shrinkage='auto')
lda.fit(X_train, y_train)
lda_ts = lda.predict(X_test)
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=0, n_estimators=500)
gbc.fit(X_train, y_train)
gbc_ts = gbc.predict(X_test)
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(max_iter=500, random_state=0)
log.fit(X_train, y_train)
log_ts = log.predict(X_test)
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
svc = BaggingClassifier(
SVC(C=0.7, gamma='auto', random_state=0), random_state=0, n_estimators=500)
svc.fit(X_train, y_train)
svc_ts = svc.predict(X_test)
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(
LogisticRegression(max_iter=500, random_state=0)
)
ada.fit(X_train, y_train)
AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=LogisticRegression(C=1.0, class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
l1_ratio=None,
max_iter=500,
multi_class='auto',
n_jobs=None, penalty='l2',
random_state=0,
solver='lbfgs', tol=0.0001,
verbose=0,
warm_start=False),
learning_rate=1.0, n_estimators=50, random_state=None)
let us see how they all performed
from sklearn.metrics import accuracy_score
for clf in (forest, lda, gbc, log, svc, ada):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_pred, y_test))
RandomForestClassifier 0.9401913875598086
LinearDiscriminantAnalysis 0.9736842105263158
GradientBoostingClassifier 0.8301435406698564
LogisticRegression 0.937799043062201
BaggingClassifier 0.9425837320574163
AdaBoostClassifier 0.9449760765550239
They are all overfitting except gradient boost classifier. Who can tell me why?
5. Evaluation
Lets see how our model would perform without overfitting to the data
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(
estimators=[('rfc',forest), ('svc', svc), ('gbc', gbc), ('ada', ada)], voting='hard')
voting_clf.fit(X_train, y_train)
VotingClassifier(estimators=[('rfc',
RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=16,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=500,
n_jobs=None,
oob_score=...
base_estimator=LogisticRegression(C=1.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
l1_ratio=None,
max_iter=500,
multi_class='auto',
n_jobs=None,
penalty='l2',
random_state=0,
solver='lbfgs',
tol=0.0001,
verbose=0,
warm_start=False),
learning_rate=1.0,
n_estimators=50,
random_state=None))],
flatten_transform=True, n_jobs=None, voting='hard',
weights=None)
#lets compare our models
from sklearn.model_selection import cross_val_score
forest_scores = cross_val_score(forest, X_train, y_train, cv=5)
print(forest_scores.mean())
lda_scores = cross_val_score(lda, X_train, y_train, cv=5)
print(lda_scores.mean())
gbc_scores = cross_val_score(gbc, X_train, y_train, cv=5)
print(gbc_scores.mean())
log_scores = cross_val_score(log, X_train, y_train, cv=5)
print(log_scores.mean())
svc_scores = cross_val_score(svc, X_train, y_train, cv=5)
print(svc_scores.mean())
ada_scores = cross_val_score(ada, X_train, y_train, cv=5)
print(ada_scores.mean())
vot_scores = cross_val_score(voting_clf, X_train, y_train, cv=5)
print(vot_scores.mean())
0.8248948590797817
0.8091959073504489
0.8383654510074697
0.8215366267026551
0.8249199673592367
0.8170547988199109
0.8282719226664993
5.1. Feature Importance
names = list(train)
print ("Features sorted by their score:")
print (sorted(zip(map(lambda x: round(x, 4), forest.feature_importances_), names),
reverse=True))
Features sorted by their score:
[(0.2211, 'Ticket_1'), (0.0902, 'Pclass'), (0.084, 'Age'), (0.0764, 'Embarked'), (0.0561, 'Title'), (0.0446, 'Fare'), (0.0439, 'Sex'), (0.0226, 'Family'), (0.0086, 'Bin_Age'), (0.0044, 'Ticket_2')]
6. Submision
Select best achieving model and submit
y_pred = voting_clf.predict(X_test)
output = pd.DataFrame({'PassengerId':submission_df.PassengerId, 'Survived': y_pred})
output.to_csv('vote4.csv', index=False)
output.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
Find ways to get a higher score. This was my rank at kaggle.
Leave a comment