Yesterday Kaggle finished a competition on how to recognise the Ghost/Goblin/Ghoul. Kaggle is a platform where data scientists can challenge themselves via different competitions, to either solve a problem being experienced, or by Kaggle themselves, just for fun. This competition was just for fun.
It was easy to win - you just had to overfit the model to leaderboard. But in this post I'll try to show you how to start working with data and make a prediction.
What we will use: Python with Pandas and Sklearn.
First read the data:
import pandas as pd df = pd.read_csv('train.csv')
We can show a head of data to look closely at what type of data is in this dataset:
Our job is to predict the type.
First step to predict it will be splitting the data into train and test.
from sklearn.cross_validation import train_test_split col_name = ['bone_length', 'rotting_flesh', 'hair_length', 'has_soul'] # we use only numeric data X_train, X_test, y_train, y_test = train_test_split(df[col_name], df['type']) # try to predict type
Now in X_train we have 80% of our set. I only used numeric data.
We can build the first model (baseline), this will be the base score and, after building this model, I will try to improve the score.
from sklearn.metrics import accuracy_score, classification_report from sklearn.linear_model import SGDClassifier # import classifier SGD base_line = SGDClassifier() #create an instance of classifier base_line.fit(X_train, y_train) #train classifier predict = base_line.predict(X_test) # make prediction print accuracy_score(y_pred = predict, y_true = y_test) # check the accuracy print classification_report(y_pred= predict, y_true = y_test) #print the report
The most interesting thing is the classification report, which looks like this:
|avg / total||0.79||0.67||0.65||93|
The accuracy score is (this is the score we want to beat):
To improve this result I decided to use SVM.
from sklearn.svm import SVC # import SVC svm= SVC() # use default setting svm.fit(X_train, y_train) predict = svm.predict(X_test) print accuracy_score(y_pred = predict, y_true = y_test) print classification_report(y_pred= predict, y_true = y_test)
|avg / total||0.76||0.76||0.75||93|
which is better than the previous one.
To beat this result we can try to use a different classifier e.g xgboost, random forest tree, neural network or we can create classifier using a set of classifiers?