This Spring, CapTech hosted an internal training program on machine learning. Through this program, some of the best and brightest in business intelligence and data analytics were introduced to Kaggle, an online data science competition platform, and turned loose to come up with their own answers to the question "who survived the sinking of the Titanic?" This competition is basically Kaggle 101. In it, Kaggle provides two datasets, one with passengers where survival status is given and an identical dataset in which survival status is not provided. The competition is to predict whether or not each passenger in the second dataset survived based on their given attributes. Here is a sample record from the first dataset.

File Sample Image

Going into the program, I had some experience in Azure Machine Learning (AML). Kaggle allows you to use whatever tools you have available, so I went with AML. Here is how I approached the problem…

First, I imported the datasets. Working on a free tier of AML, you are limited to 10gb of data storage, but that's not a problem for this competition.

AML Saved Datasets

Once the datasets are loaded, they can be used as sources for an experiment. Experiments are Azure's term for the dataflow from input to predictive model and they generally adhere to a pattern illustrated below. First, you bring data in as a source. Then you work with the data to inspect it, clean it, enhance it, add features, and identify helpful predictors. After all that you identify a model that fits the data and pass a portion of the dataset through model training and the remaining portion into model scoring. Lastly, you score the model by comparing the model's predicted results on the untrained data to the actual values.

Default Data Flow

Diving into the details a little, I started with a data source just like you would in ETL. AML data sources can be saved datasets or Azure based storage. In this case, I'm using a saved dataset because it is small and I'm not worried about performance impacts or storage costs. In a typical setting, you would probably want to start with data stored in Azure. With the dataset in hand, I was ready to start performing data cleansing and feature engineering which is where the bulk of my work was performed. AML has a very comprehensive set of "modules" for working with data, and I started by whittling down the number of columns I wanted to work with using a "Select Columns in Dataset" module.

Select Columns Module

By using a feature selection module and some manual Excel visualizations, I found that "cabin" lacked any useful information. Also, "passengerid" is a surrogate key and I assume (maybe dangerously) that a surrogate key could only prove to be a false positive predictor if left in the dataset. Therefore, my Select Columns module removed cabin and passengerid.

Note: Getting rid of unneeded columns is helpful to speed up processing on large datasets, but probably doesn't have a significant impact with the Titanic data. I just wanted to reduce the noise going into the rest of the experiment.

Next, I grouped the passengers by age. I ran some basic visualizations on the data, this time in a simple Python notebook, and combined my observations there with what I know about age groups to identify natural dividing points between ages. I came up with some custom bins {0-3, 3-9, 9-16,16-26, 26-45,45+} and applied them using the "Group Data into Bins" module.

Group Data Module

I found this module to be a little lacking because it didn't let me define a default and I couldn't name the bins. To make things a little more usable, I added a SQL module (coded below) to name the bins {Toddler, Child, Teen, Young Adult, Adult, Middle Age, and Elderly}. I apologize to anybody reading tis over the age of 45, but life expectancy was 47 years back in 1912.

The primary goal of this step was to test whether a categorical value for age was a better predictor than the continuous value provided. Also, I wanted to treat the absence of a known age as a possible indicator of survival. Model training does not do well factoring in null values in a field that is included as input. So, you have to make them not null somehow. Alternative to this strategy, I could have added a "known age" indicator column and either predicted or defaulted a value for null ages. But I found that merging the two concepts into a binned field gave me the best correlation.

SQL Code

As the last step in enhancing the data, I ran a simple Python script to extract the title for each passenger and add it as a field to the dataset. AML has an extensive set of built in modules for data manipulation, but there are still some gaps in functionality. Whenever these turn up, they have made it as simple as possible to extend the built-in functionality through Python and R scripting modules. With Python, they do the work of importing the Pandas library and putting your input dataset(s) into a dataframe. Then all you have to do is write the functions. My function is a tad simplistic, but it works for my purposes. It uses regular expressions to extract the text between the first space and the first period following the space. If you look at the sample record I copied, names are provided as Last, Title. First Middle. This regular expression works on all but a couple of the given records.

Python Script Module

Python Script Code

After I massaged the data into a format that I thought was ready to analyze, I pulled in some feature selection modules. Specifically, I tried Fisher Linear and Permutation Feature Importance. There are nuisances to when either is more appropriate than the other and Microsoft provides some great help text on both (https://msdn.microsoft.com/en-us/library/azure/dn905912.aspx). I took some suggestions that each module produced into consideration when running my data through different machine learning algorithms but didn't leave either of them in my experiment. Again, this is a small dataset which allows you to take some liberties like making the model crunch every possible input.

Before I moved into pushing my data through a model, I split the dataset so that 25% of the records would go straight to scoring the model and 75% would go into training it.

Model Scoring

I tried about 10 - 12 different algorithms and pitted them all against each other by comparing scores or using AML's "evaluate model" module when possible. Given the luxury of having no performance constraints and a small dataset, I was able to settle on a model that gave me an average of ~83% accuracy when predicting various permutations of the control set. I thought that was pretty good!

The last step I performed was turning this model into a web service. This sounds more complicated than it is. There is a button that allows you to move from training experiment to predictive experiment. Your data input is replaced with a web service input and a new web service output module is added.

Web Service Flow

After creating the web service, I just needed to figure out how to call it with the records in the test dataset. With just a little Python experience, I was able to use AML's guide to actually call this web service from a Jupyter notebook. I can't begin to explain how cool that is! I actually created my own notebook, but you can simply click on the "Notebooks" tab within AML to launch a new Jupyter notebook filled in with some sample code.

Here is a condensed version of my notebook. The free tier of AML does not allow batch execution, so I had to call the web service in a loop (which is a little bit ugly and inefficient), but it got the job done for now. As for my final score? 77.5%. That put me right in the middle of the pack so I'll have to make some tweaks and try again.

import json

import urllib

import pandas as pd

import numpy

test_data=pd.read_csv("C:/Users/… /test.csv", names=["PassengerId","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked"])

df = pd.DataFrame(test_data)

#These are the settings given by AML for the web service

url = 'https://ussouthcentral.services.azureml.net/workspaces/24cbacb657d84779977f1c68ce7ecbd3/services/' \

'dc4be15bb3994a8f8434da8ea8185a5e/execute?api-version=2.0&details=true'

api_key = 'abc1234!@#$==' # Replace this with the API key for the web service

headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

result_csv=[]

for index in range(df.shape[0]-1):

index+=1

data={}

Inputs={}

input1={}

input1['ColumnNames']=['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']

str_array=[]

val_array=[]

str_array.clear()

str_array.append(str(df['PassengerId'][index]))

str_array.append("0")

str_array.append(df['Pclass'][index])

str_array.append(df['Name'][index])

str_array.append(df['Sex'][index])

str_array.append(str(df['Age'][index]))

str_array.append(str(df['SibSp'][index]))

str_array.append(str(df['Parch'][index]))

str_array.append(df['Ticket'][index])

if str(df['Fare'][index]).replace('.','').isnumeric():

str_array.append(str(df['Fare'][index]))

else:

str_array.append("0")

str_array.append(str(df['Cabin'][index]))

str_array.append(df['Embarked'][index])

val_array.append(str_array)

input1['Values']=val_array

Inputs['input1']=input1

data['Inputs']=Inputs

data['GlobalParameters']={}

body=str.encode(json.dumps(data))

#print(str_array[0])

#print(str_array[9])

#print(df['Fare'][index])

req = urllib.request.Request(url,body,headers)

response = json.load(urllib.request.urlopen(req))

result_csv.append([str_array[0], response["Results"]["output1"]["value"]["Values"][0][9] ])

a = numpy.asarray(result_csv)

numpy.savetxt("c:/… /data.csv",a,delimiter=',',fmt="%s")

In short, Kaggle is a great platform to expose yourself to some real world machine learning applications with a great community. AML is a great platform to get in and learn the machine learning strategies and concepts without the technical hurdles of heavy Python and R programming. If you've been holding out on machine learning, this is the perfect entry point. Good luck!