Introduction
In this section I will walk through how the Pandas python package can be used to quickly get a grasp of the data that we have available to us.
For me the easiest way to get Pandas up and running with Python is to use the Anaconda distribution. For this type of data analysis I like to use Jupyter Notebooks (also included within the Anaconda distribution) as it allows for rapid iteration and trouble shooting of code.
All the code and data that is referenced to in this blog post can be found on my GitHub repository.
Module imports
For this analysis I have made use of the OS package and Pandas as shown below during the import part of the analysis script.
import os
import pandas as pd
Why use the Operating System (OS) module?
In my scripts I have used the os.path module to define filepaths in an operating system independent way e.g. on Windows systems files paths are separated with a backslash (“C:\Folder1\Folder2”) whilst on Unix systems filepaths are separated with forward slashes (“C:/Folder1/Folder2”). It would be annoying to have to switch between these syntaxes everytime we switched operating system or shared code with someone who used a different operating system. The os.path module saves us from this headache by taking care of these differences.
It is assumed that the Titanic dataset is within a folder called Data which is in the same directory as the Jupyter Notebook or Python file running the analysis.
Finding the working directory
The os.path module can be used to find the working directory of the current Jupyter Notebook or Python file using the following command:
working_directory = os.path.abspath('.')
The path.abspath function returns the absolute path of the current folder which is specified using the single dot (‘.’) command. If we wanted the parent folder we would specify double dots (‘..’) like so:
parent_directory = os.path.abspath('..')
Getting the filepath of the training data
It is easy to get the full filepath of the training data as the working directory has been defined. The training data is within the folder called Data and the name of the file containing the training data is ‘train.csv’.
The os.path.join module can then be used to join these three segments together in a simple way:
training_data_location = os.path.join(working_directory,
'Data', 'train.csv')
Read data into Pandas DataFrame
The training data is stored within a CSV file, therefore, it is convenient to use the Pandas read_csv() method which takes the filepath of the desired CSV file as an argument. In this case the filepath is stored in the variable training_data_location.
titanic_training_data = pd.read_csv(training_data_location)
The Pandas read_csv() method parses the CSV file and stores the information within the CSV file as a Pandas Dataframe object in the variable titanic_training_data.
Viewing the DataFrame
The easiest way to view the Pandas Dataframe is probably using the DataFrame.head() method. This method prints out the first 5 rows of the DataFrame. The DataFrame.head() method can take an integer argument if an alternative number of rows is required.
