Kaggle’s Titanic Challenge: Overview

Introduction

I recently heard from a friend about the Kaggle Titanic challenge. For those of you who don’t know, this is a data science challenge set by Kaggle where the aim is to predict whether or not a passenger of the Titanic will survive based on various factors such as sex (male or female), age, fare etc. It looks like Kaggle has set this as an introductory challenge to get people used to their platform and data science practices and methods.

My motivation for doing this challenge is to brush up on using Python to clean up data and prepare some simple machine learning models. I recently did the Pandas PyCon tutorial by Brandon Rhodes and found it super useful, however, I want to practice the techniques that were taught on a different dataset to further consolidate what I have learnt.

GitHub Repository

All code that I will be writing can be found on GitHub.

Exploring the Titanic dataset with Pandas

Below is a description (with links) of different blog posts which describe how the Pandas package can be used to better understand the Kaggle Titanic dataset.

Loading the Titanic dataset into a Pandas DataFrame

The first step to exploring the Titanic dataset is to load it into a Pandas DataFrame. This is a simple task, however, the use of modules such as os.path can make navigating different file systems easier, therefore, I have written a blog post describing how these modules can be used in the context of the Titanic dataset.

Using Pandas groupby(), unstack() and plot() methods to get quick insights into the different data columns

The Pandas package makes it incredibly easy to get information about your dataset using only a few methods. This blog post looks at using the groupby(), unstack() and plot() DataFrame methods to get a better understanding of the Titanic dataset. To keep things simple I have focused on the ‘Sex’ column.

Using string processing to get more information about the ‘Cabin’ data column

Sometimes the data we are given is not in the final format we require to do data analysis on or build models. This blog post will focus on using some of the string processing capabilities within Pandas to clean up the ‘Cabin’ data column.

Building a simple LinearSVM model for classification

This blog post will look at building a simple linear support vector machine classifier to predict whether or not a passenger will survive. The classifier will then be tested with the Kaggle Titanic test dataset.