How to predict Ghosts?

Yesterday Kaggle finished a competition on how to recognise the Ghost/Goblin/Ghoul. Kaggle is a platform where data scientists can challenge themselves via different competitions, to either solve a problem being experienced, or by Kaggle themselves, just for fun. This competition was just for fun.
It was easy to win - you just had to overfit the model to leaderboard. But in this post I'll try to show you how to start working with data and make a prediction.

Let's start!

What we will use: Python with Pandas and Sklearn.
First read the data:

import pandas as pd
df = pd.read_csv('train.csv')

We can show a head of data to look closely at what type of data is in this dataset:

id bone_length rotting_flesh hair_length has_soul color type
0 0 0.354512 0.350839 0.465761 0.781142 clear Ghoul
1 1 0.575560 0.425868 0.531401 0.439899 green Goblin
2 2 0.467875 0.354330 0.811616 0.791225 black Ghoul
3 4 0.776652 0.508723 0.636766 0.884464 black Ghoul
4 5 0.566117 0.875862 0.418594 0.636438 green Ghost

Our job is to predict the type.
First step to predict it will be splitting the data into train and test.

from sklearn.cross_validation import train_test_split
col_name = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul'] # we use only numeric data
X_train, X_test, y_train, y_test = train_test_split(df[col_name], df['type']) # try to predict type

Now in X_train we have 80% of our set. I only used numeric data.
We can build the first model (baseline), this will be the base score and, after building this model, I will try to improve the score.

from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import SGDClassifier # import classifier SGD
base_line = SGDClassifier() #create an instance of classifier, y_train) #train classifier
predict = base_line.predict(X_test) # make prediction
print accuracy_score(y_pred = predict, y_true = y_test) # check the accuracy
print classification_report(y_pred= predict, y_true = y_test) #print the report

The most interesting thing is the classification report, which looks like this:

precision recall f1-score support
Ghost 0.96 0.76 0.85 29
Ghoul 0.91 0.31 0.47 32
Goblin 0.51 0.94 0.66 32
avg / total 0.79 0.67 0.65 93

The accuracy score is (this is the score we want to beat):

To improve this result I decided to use SVM.

from sklearn.svm import SVC # import SVC
svm= SVC() # use default setting, y_train)
predict = svm.predict(X_test)
print accuracy_score(y_pred = predict, y_true = y_test)
print classification_report(y_pred= predict, y_true = y_test)

The result:

precision recall f1-score support
Ghost 0.75 0.93 0.83 29
Ghoul 0.79 0.84 0.82 32
Goblin 0.74 0.53 0.62 32
avg / total 0.76 0.76 0.75 93

The accuracy:
which is better than the previous one.

To beat this result we can try to use a different classifier e.g xgboost, random forest tree, neural network or we can create classifier using a set of classifiers?

Rafał Prońko

Data Scientist

svg group of people svg laptop

Want to join Rafał Prońko
and the rest of our
rockstar team?

We're hiring

Subscribe to Webinterpret Tech

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!