Using string processing to get more information about the ‘Cabin’ data column

Introduction

Often data that is contained within many datasets needs to be cleaned up. String processing can be useful to help achieve this. A good example of needing to clean up data from the Kaggle Titanic dataset is the Cabin column. The Cabin column contains a list of cabin IDs e.g. C85 or E45. One could assume that the letter corresponds to the general location of the cabin and the number corresponds to the specific cabin within that location. In addition, many passengers have no recorded cabin, presumably this is because they did not actually have access to one.

It would therefore be useful to see if the broad location had an impact on survival rate. To do this I will show how Pandas string processing can be used quickly to generate a new column which contains the required information.

Using Pandas Series.str.get() to obtain the first character of the Cabin

To operate on Pandas DataFrame columns which contain text/string data the .str command followed by the desired string function can be called on a specific DataFrame column. In this case I want the first character so will use .str.get(0):

temp_df = titanic_training_data
temp_df['Processed_Cabin'] = temp_df.Cabin.str.get(0)
temp_df.head()

Which results in:

training data cabin first letter

There are many string functions which can be called, further information can be found within the Pandas documentation under the section ‘String handling’.

It would be useful to see all the unique elements in the new column called ‘Processed_Cabin’. This can easily be achieved by calling the unique() function on a Pandas series/column.

temp_df['Processed_Cabin'].unique()

Which results in:

unique processed cabins

The next step is to do something with the NaN values. It does not really make sense to remove them so it would be useful to just call NaN values as something else e.g. someone who has no cabin. This can be achieved using the .fillna() method on a Pandas series/column:

NaN replaced with NoCabin

Does having access to a cabin matter?

So now that we have the Cabin data processed into something useful, the question we would like to answer is does having access to a cabin matter? Like in the previous post the groupby() methods can be used to better understand this:

impact of cabin

If you don’t have an assigned cabin then it looks like only 30% of the passengers survived. It looks like some cabin areas are slightly better than others such as B, D and E as they have survival rates of greater than 70%. So it does seem that having access to a cabin does matter. 

General Comments

I showed how basic string processing can be used on the Pandas dataset to clean up the Cabin column. There are other columns which could have string processing applied to them such as the name column. In general having the capability to do string processing is very powerful and Pandas and Python makes it very easy. My next blog post will focus on making a super simple model using a few columns to try and solve the Kaggle Titanic challenge. 

search previous next tag category expand menu location phone mail time cart zoom edit close