How to predict Ghosts?

Yesterday Kaggle finished a competition on how to recognise the Ghost/Goblin/Ghoul. Kaggle is a platform where data scientists can challenge themselves via different competitions, to either solve a problem being experienced, or by Kaggle themselves, just for fun. This competition was just for fun.
It was easy to win - you just had to overfit the model to leaderboard. But in this post I'll try to show you how to start working with data and make a prediction.

Let's start!

What we will use: Python with Pandas and Sklearn.
First read the data:

import pandas as pd
df = pd.read_csv('train.csv')

We can show a head of data to look closely at what type of data is in this dataset:

df.head()

	id	bone_length	rotting_flesh	hair_length	has_soul	color	type
0	0	0.354512	0.350839	0.465761	0.781142	clear	Ghoul
1	1	0.575560	0.425868	0.531401	0.439899	green	Goblin
2	2	0.467875	0.354330	0.811616	0.791225	black	Ghoul
3	4	0.776652	0.508723	0.636766	0.884464	black	Ghoul
4	5	0.566117	0.875862	0.418594	0.636438	green	Ghost

Our job is to predict the type.
First step to predict it will be splitting the data into train and test.

from sklearn.cross_validation import train_test_split
col_name = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul'] # we use only numeric data
X_train, X_test, y_train, y_test = train_test_split(df[col_name], df['type']) # try to predict type

Now in X_train we have 80% of our set. I only used numeric data.
We can build the first model (baseline), this will be the base score and, after building this model, I will try to improve the score.

from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import SGDClassifier # import classifier SGD
base_line = SGDClassifier() #create an instance of classifier
base_line.fit(X_train, y_train) #train classifier
predict = base_line.predict(X_test) # make prediction
print accuracy_score(y_pred = predict, y_true = y_test) # check the accuracy
print classification_report(y_pred= predict, y_true = y_test) #print the report

The most interesting thing is the classification report, which looks like this:

	precision	recall	f1-score	support
Ghost	0.96	0.76	0.85	29
Ghoul	0.91	0.31	0.47	32
Goblin	0.51	0.94	0.66	32
avg / total	0.79	0.67	0.65	93

The accuracy score is (this is the score we want to beat):
0.6666

To improve this result I decided to use SVM.

from sklearn.svm import SVC # import SVC
svm= SVC() # use default setting
svm.fit(X_train, y_train)
predict = svm.predict(X_test)
print accuracy_score(y_pred = predict, y_true = y_test)
print classification_report(y_pred= predict, y_true = y_test)

The result:

	precision	recall	f1-score	support
Ghost	0.75	0.93	0.83	29
Ghoul	0.79	0.84	0.82	32
Goblin	0.74	0.53	0.62	32
avg / total	0.76	0.76	0.75	93

The accuracy:
0.76
which is better than the previous one.

To beat this result we can try to use a different classifier e.g xgboost, random forest tree, neural network or we can create classifier using a set of classifiers?

Python Miscellaneous

How to predict Ghosts?

Rafał Prońko

Newsletter

Want to join Rafał Prońko
and the rest of our
rockstar team?

Comments

Rafał Prońko

Monolithic app to microservices transition is easy...

Why we decided to use Redux in our AngularJS application

Newsletter

Want to join Rafał Prońkoand the rest of our rockstar team?

Comments

Want to join Rafał Prońko
and the rest of our
rockstar team?