Yesterday Kaggle finished a competition on how to recognise the Ghost/Goblin/Ghoul. Kaggle is a platform where data scientists can challenge themselves via different competitions, to either solve a problem being experienced, or by Kaggle themselves, just for fun. This competition was just for fun.
It was easy to win - you just had to overfit the model to leaderboard. But in this post I'll try to show you how to start working with data and make a prediction.

Let's start!

What we will use: Python with Pandas and Sklearn.
First read the data:

import pandas as pd
df = pd.read_csv('train.csv')

We can show a head of data to look closely at what type of data is in this dataset:

df.head()
id bone_length rotting_flesh hair_length has_soul color type
0 0 0.354512 0.350839 0.465761 0.781142 clear Ghoul
1 1 0.575560 0.425868 0.531401 0.439899 green Goblin
2 2 0.467875 0.354330 0.811616 0.791225 black Ghoul
3 4 0.776652 0.508723 0.636766 0.884464 black Ghoul
4 5 0.566117 0.875862 0.418594 0.636438 green Ghost

Our job is to predict the type.
First step to predict it will be splitting the data into train and test.

from sklearn.cross_validation import train_test_split
col_name = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul'] # we use only numeric data
X_train, X_test, y_train, y_test = train_test_split(df[col_name], df['type']) # try to predict type

Now in X_train we have 80% of our set. I only used numeric data.
We can build the first model (baseline), this will be the base score and, after building this model, I will try to improve the score.

from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import SGDClassifier # import classifier SGD
base_line = SGDClassifier() #create an instance of classifier
base_line.fit(X_train, y_train) #train classifier
predict = base_line.predict(X_test) # make prediction
print accuracy_score(y_pred = predict, y_true = y_test) # check the accuracy
print classification_report(y_pred= predict, y_true = y_test) #print the report

The most interesting thing is the classification report, which looks like this:

precision recall f1-score support
Ghost 0.96 0.76 0.85 29
Ghoul 0.91 0.31 0.47 32
Goblin 0.51 0.94 0.66 32
avg / total 0.79 0.67 0.65 93

The accuracy score is (this is the score we want to beat):
0.6666

To improve this result I decided to use SVM.

from sklearn.svm import SVC # import SVC
svm= SVC() # use default setting
svm.fit(X_train, y_train)
predict = svm.predict(X_test)
print accuracy_score(y_pred = predict, y_true = y_test)
print classification_report(y_pred= predict, y_true = y_test)

The result:

precision recall f1-score support
Ghost 0.75 0.93 0.83 29
Ghoul 0.79 0.84 0.82 32
Goblin 0.74 0.53 0.62 32
avg / total 0.76 0.76 0.75 93

The accuracy:
0.76
which is better than the previous one.

To beat this result we can try to use a different classifier e.g xgboost, random forest tree, neural network or we can create classifier using a set of classifiers?