Yesterday Kaggle finished a competition on how to recognise the Ghost/Goblin/Ghoul. Kaggle is a platform where data scientists can challenge themselves via different competitions, to either solve a problem being experienced, or by Kaggle themselves, just for fun. This competition was just for fun.
It was easy to win - you just had to overfit the model to leaderboard. But in this post I'll try to show you how to start working with data and make a prediction.
Let's start!
What we will use: Python with Pandas and Sklearn.
First read the data:
import pandas as pd
df = pd.read_csv('train.csv')
We can show a head of data to look closely at what type of data is in this dataset:
df.head()
id | bone_length | rotting_flesh | hair_length | has_soul | color | type | |
---|---|---|---|---|---|---|---|
0 | 0 | 0.354512 | 0.350839 | 0.465761 | 0.781142 | clear | Ghoul |
1 | 1 | 0.575560 | 0.425868 | 0.531401 | 0.439899 | green | Goblin |
2 | 2 | 0.467875 | 0.354330 | 0.811616 | 0.791225 | black | Ghoul |
3 | 4 | 0.776652 | 0.508723 | 0.636766 | 0.884464 | black | Ghoul |
4 | 5 | 0.566117 | 0.875862 | 0.418594 | 0.636438 | green | Ghost |
Our job is to predict the type.
First step to predict it will be splitting the data into train and test.
from sklearn.cross_validation import train_test_split
col_name = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul'] # we use only numeric data
X_train, X_test, y_train, y_test = train_test_split(df[col_name], df['type']) # try to predict type
Now in X_train we have 80% of our set. I only used numeric data.
We can build the first model (baseline), this will be the base score and, after building this model, I will try to improve the score.
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import SGDClassifier # import classifier SGD
base_line = SGDClassifier() #create an instance of classifier
base_line.fit(X_train, y_train) #train classifier
predict = base_line.predict(X_test) # make prediction
print accuracy_score(y_pred = predict, y_true = y_test) # check the accuracy
print classification_report(y_pred= predict, y_true = y_test) #print the report
The most interesting thing is the classification report, which looks like this:
precision | recall | f1-score | support | |
---|---|---|---|---|
Ghost | 0.96 | 0.76 | 0.85 | 29 |
Ghoul | 0.91 | 0.31 | 0.47 | 32 |
Goblin | 0.51 | 0.94 | 0.66 | 32 |
avg / total | 0.79 | 0.67 | 0.65 | 93 |
The accuracy score is (this is the score we want to beat):
0.6666
To improve this result I decided to use SVM.
from sklearn.svm import SVC # import SVC
svm= SVC() # use default setting
svm.fit(X_train, y_train)
predict = svm.predict(X_test)
print accuracy_score(y_pred = predict, y_true = y_test)
print classification_report(y_pred= predict, y_true = y_test)
The result:
precision | recall | f1-score | support | |
---|---|---|---|---|
Ghost | 0.75 | 0.93 | 0.83 | 29 |
Ghoul | 0.79 | 0.84 | 0.82 | 32 |
Goblin | 0.74 | 0.53 | 0.62 | 32 |
avg / total | 0.76 | 0.76 | 0.75 | 93 |
The accuracy:
0.76
which is better than the previous one.
To beat this result we can try to use a different classifier e.g xgboost, random forest tree, neural network or we can create classifier using a set of classifiers?