{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.163789Z",
"start_time": "2018-07-31T02:46:34.410255Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import warnings\n",
"import matplotlib.pyplot as plt\n",
"warnings.filterwarnings('ignore')\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Decision Tree and Ensemble Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Decision Tree\n",
"\n",
"A decision tree may be described as a set of decisions/categorization steps that help you to classify your data into predefined classes. This helps you to solve a classification problem. \n",
"\n",
"For example, consider the following data set related to playing tennis in a specific day based on weather condition parameters. Based on the provided parameters, we are need to come up with a decision tree that helps us to decide whether the player plays tennis on a specific day based on four weather related parameters including outlook, temperature, humidity and wind condition."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.202766Z",
"start_time": "2018-07-31T02:46:35.165787Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Outlook
\n",
"
Temperature
\n",
"
Humidity
\n",
"
Wind
\n",
"
PlayTennis
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Sunny
\n",
"
Hot
\n",
"
High
\n",
"
False
\n",
"
No
\n",
"
\n",
"
\n",
"
1
\n",
"
Sunny
\n",
"
Hot
\n",
"
High
\n",
"
True
\n",
"
No
\n",
"
\n",
"
\n",
"
2
\n",
"
Overcast
\n",
"
Hot
\n",
"
High
\n",
"
False
\n",
"
Yes
\n",
"
\n",
"
\n",
"
3
\n",
"
Rainy
\n",
"
Mild
\n",
"
High
\n",
"
False
\n",
"
Yes
\n",
"
\n",
"
\n",
"
4
\n",
"
Rainy
\n",
"
Cool
\n",
"
Normal
\n",
"
False
\n",
"
Yes
\n",
"
\n",
"
\n",
"
5
\n",
"
Rainy
\n",
"
Cool
\n",
"
Normal
\n",
"
True
\n",
"
No
\n",
"
\n",
"
\n",
"
6
\n",
"
Overcast
\n",
"
Cool
\n",
"
Normal
\n",
"
True
\n",
"
Yes
\n",
"
\n",
"
\n",
"
7
\n",
"
Sunny
\n",
"
Mild
\n",
"
High
\n",
"
False
\n",
"
No
\n",
"
\n",
"
\n",
"
8
\n",
"
Sunny
\n",
"
Cool
\n",
"
Normal
\n",
"
False
\n",
"
Yes
\n",
"
\n",
"
\n",
"
9
\n",
"
Rainy
\n",
"
Mild
\n",
"
Normal
\n",
"
False
\n",
"
Yes
\n",
"
\n",
"
\n",
"
10
\n",
"
Sunny
\n",
"
Mild
\n",
"
Normal
\n",
"
True
\n",
"
Yes
\n",
"
\n",
"
\n",
"
11
\n",
"
Overcast
\n",
"
Mild
\n",
"
High
\n",
"
True
\n",
"
Yes
\n",
"
\n",
"
\n",
"
12
\n",
"
Overcast
\n",
"
Hot
\n",
"
Normal
\n",
"
False
\n",
"
Yes
\n",
"
\n",
"
\n",
"
13
\n",
"
Rainy
\n",
"
Mild
\n",
"
High
\n",
"
True
\n",
"
No
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Outlook Temperature Humidity Wind PlayTennis\n",
"0 Sunny Hot High False No\n",
"1 Sunny Hot High True No\n",
"2 Overcast Hot High False Yes\n",
"3 Rainy Mild High False Yes\n",
"4 Rainy Cool Normal False Yes\n",
"5 Rainy Cool Normal True No\n",
"6 Overcast Cool Normal True Yes\n",
"7 Sunny Mild High False No\n",
"8 Sunny Cool Normal False Yes\n",
"9 Rainy Mild Normal False Yes\n",
"10 Sunny Mild Normal True Yes\n",
"11 Overcast Mild High True Yes\n",
"12 Overcast Hot Normal False Yes\n",
"13 Rainy Mild High True No"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_dict = {\n",
" 'Outlook' : ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny','Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy']\n",
" ,'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild','Mild','Mild', 'Hot', 'Mild']\n",
" ,'Humidity' : ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High','Normal','Normal', 'Normal', 'High', 'Normal', 'High']\n",
" ,'Wind': ['False', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'True']\n",
" ,'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']\n",
"}\n",
"tennis_data = pd.DataFrame(data_dict, columns=data_dict.keys())\n",
"tennis_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following decision three helps us to use a set of decisions to solve our classification problem.\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Order of decision; why Outlook is the first question/decision?\n",
"\n",
"The main question is why outlook was chosen as the first decision level. The answer is that the amount of **Information Grain** that is resolved by this decision is higher that the other decisions. This can be quantified using entropy which is a measure of randomness. Higher entropy means higher randomness and the amount of reduction in randomness based on a decision is called information gain."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.210759Z",
"start_time": "2018-07-31T02:46:35.204764Z"
}
},
"outputs": [
{
"data": {
"text/latex": [
"\n",
"Entropy = $-\\sum_{i=1}^{n} P_i\\times Log_b(P_i)$"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%latex\n",
"\n",
"Entropy = $-\\sum_{i=1}^{n} P_i\\times Log_b(P_i)$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: corresponding units of entropy are the bits for b = 2, nats for b = e, and bans for b = 10 where b is the base of the logarithm function"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.217756Z",
"start_time": "2018-07-31T02:46:35.212759Z"
}
},
"outputs": [],
"source": [
"def entropy_calculate(prob_list):\n",
" \n",
" entropy = 0\n",
" for item in prob_list:\n",
" entropy -= item * np.log2(item) \n",
" return entropy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Entropy of the entire system\n",
"Entropy of the entire system:\n",
"14 observations: 9 Yes and 5 No"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.235745Z",
"start_time": "2018-07-31T02:46:35.220753Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Probabilities of No and Yes are 0.357, 0.643 respectively\n",
"Entire syetems entropy is 0.940 bits\n"
]
}
],
"source": [
"cases,counts = np.unique(tennis_data.PlayTennis,return_counts=True)\n",
"P = [count/len(tennis_data) for count in counts]\n",
"print('Probabilities of %s and %s are %.3f, %.3f respectively'%(cases[0],cases[1],P[0],P[1]))\n",
"\n",
"entropy_entire = entropy_calculate(P)\n",
"\n",
"print('Entire syetems entropy is %.3f bits'%entropy_entire)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Information Gain\n",
"\n",
"Lets calculate reduction in entropy for each decision. For each decision, entropy is calculated for each case under that decision and the a probability weighted average is calculated."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Outlook decision"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.243740Z",
"start_time": "2018-07-31T02:46:35.238744Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For outlook:\n",
"\tProbabality of Overcast is 0.286\n",
"\tProbabality of Rainy is 0.357\n",
"\tProbabality of Sunny is 0.357\n"
]
}
],
"source": [
"cases_outlook,counts_outlook= np.unique(tennis_data.Outlook,return_counts=True)\n",
"P_outlook = [count/len(tennis_data) for count in counts_outlook]\n",
"print('For outlook:')\n",
"for case, prob in zip(cases_outlook,P_outlook):\n",
" print('\\tProbabality of %s is %.3f'%(case, prob))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.260729Z",
"start_time": "2018-07-31T02:46:35.245739Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entropy for Overcast is 0.00\n",
"Entropy for Rainy is 0.97\n",
"Entropy for Sunny is 0.97\n",
"\n",
"Entropy at Outlook decision level is 0.694\n",
"\n",
"Information gain is 0.247\n"
]
}
],
"source": [
"entropy_outlook={}\n",
"total_entropy_outlook=0\n",
"for case, prob in zip(cases_outlook,P_outlook):\n",
" cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Outlook==case],return_counts=True)\n",
" P = [count/len(tennis_data[tennis_data.Outlook==case]) for count in counts]\n",
" entropy_outlook[case]=entropy_calculate(P)\n",
" total_entropy_outlook += entropy_calculate(P)*prob\n",
"\n",
"for case, entropy in entropy_outlook.items():\n",
" print('Entropy for %s is %.2f'%(case,entropy))\n",
"print('\\nEntropy at Outlook decision level is %.3f'%total_entropy_outlook)\n",
"print('\\nInformation gain is %.3f'%(entropy_entire- total_entropy_outlook))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Temperature Decision"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.268726Z",
"start_time": "2018-07-31T02:46:35.262729Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For temperature:\n",
"\tProbabality of Cool is 0.286\n",
"\tProbabality of Hot is 0.286\n",
"\tProbabality of Mild is 0.429\n"
]
}
],
"source": [
"cases_temperature,counts_temperature= np.unique(tennis_data.Temperature,return_counts=True)\n",
"P_temperature = [count/len(tennis_data) for count in counts_temperature]\n",
"print('For temperature:')\n",
"for case, prob in zip(cases_temperature,P_temperature):\n",
" print('\\tProbabality of %s is %.3f'%(case, prob))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.290710Z",
"start_time": "2018-07-31T02:46:35.271723Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entropy for Cool is 0.81\n",
"Entropy for Hot is 1.00\n",
"Entropy for Mild is 0.92\n",
"\n",
"Entropy at Temperature decision level is 0.911\n",
"\n",
"Information gain is 0.029\n"
]
}
],
"source": [
"entropy_temperature={}\n",
"total_entropy_temperature=0\n",
"for case, prob in zip(cases_temperature,P_temperature):\n",
" cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Temperature==case],return_counts=True)\n",
" P = [count/len(tennis_data[tennis_data.Temperature==case]) for count in counts]\n",
" entropy_temperature[case]=entropy_calculate(P)\n",
" total_entropy_temperature += entropy_calculate(P)*prob\n",
"\n",
"for case, entropy in entropy_temperature.items():\n",
" print('Entropy for %s is %.2f'%(case,entropy))\n",
"print('\\nEntropy at Temperature decision level is %.3f'%total_entropy_temperature)\n",
"print('\\nInformation gain is %.3f'%(entropy_entire- total_entropy_temperature))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Wind Decision"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.298705Z",
"start_time": "2018-07-31T02:46:35.292709Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For wind:\n",
"\tProbabality of False is 0.571\n",
"\tProbabality of True is 0.429\n"
]
}
],
"source": [
"cases_wind,counts_wind= np.unique(tennis_data.Wind,return_counts=True)\n",
"P_wind = [count/len(tennis_data) for count in counts_wind]\n",
"print('For wind:')\n",
"for case, prob in zip(cases_wind,P_wind):\n",
" print('\\tProbabality of %s is %.3f'%(case, prob))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.312698Z",
"start_time": "2018-07-31T02:46:35.300704Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entropy for False is 0.81\n",
"Entropy for True is 1.00\n",
"\n",
"Entropy at Wind decision level is 0.892\n",
"\n",
"Information gain is 0.048\n"
]
}
],
"source": [
"entropy_wind={}\n",
"total_entropy_wind=0\n",
"for case, prob in zip(cases_wind,P_wind):\n",
" cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Wind==case],return_counts=True)\n",
" P = [count/len(tennis_data[tennis_data.Wind==case]) for count in counts]\n",
" entropy_wind[case]=entropy_calculate(P)\n",
" total_entropy_wind += entropy_calculate(P)*prob\n",
"\n",
"for case, entropy in entropy_wind.items():\n",
" print('Entropy for %s is %.2f'%(case,entropy))\n",
"print('\\nEntropy at Wind decision level is %.3f'%total_entropy_wind)\n",
"print('\\nInformation gain is %.3f'%(entropy_entire- total_entropy_wind))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Humidity Decision"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.320692Z",
"start_time": "2018-07-31T02:46:35.314696Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For humidity:\n",
"\tProbabality of High is 0.500\n",
"\tProbabality of Normal is 0.500\n"
]
}
],
"source": [
"cases_humidity,counts_humidity= np.unique(tennis_data.Humidity,return_counts=True)\n",
"P_humidity = [count/len(tennis_data) for count in counts_humidity]\n",
"print('For humidity:')\n",
"for case, prob in zip(cases_humidity,P_humidity):\n",
" print('\\tProbabality of %s is %.3f'%(case, prob))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.336683Z",
"start_time": "2018-07-31T02:46:35.322691Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entropy for High is 0.99\n",
"Entropy for Normal is 0.59\n",
"\n",
"Entropy at Humidity decision level is 0.788\n",
"\n",
"Information gain is 0.152\n"
]
}
],
"source": [
"entropy_humidity={}\n",
"total_entropy_humidity=0\n",
"for case, prob in zip(cases_humidity,P_humidity):\n",
" cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Humidity==case],return_counts=True)\n",
" P = [count/len(tennis_data[tennis_data.Humidity==case]) for count in counts]\n",
" entropy_humidity[case]=entropy_calculate(P)\n",
" total_entropy_humidity += entropy_calculate(P)*prob\n",
"\n",
"for case, entropy in entropy_humidity.items():\n",
" print('Entropy for %s is %.2f'%(case,entropy))\n",
"print('\\nEntropy at Humidity decision level is %.3f'%total_entropy_humidity)\n",
"print('\\nInformation gain is %.3f'%(entropy_entire- total_entropy_humidity))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As shown above, by choosing outlook as the first decision/question, the highest reduction in entropy/randomness is achieved that corresponds to the highest **Information Gain**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Titanic Example"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.577533Z",
"start_time": "2018-07-31T02:46:35.338683Z"
}
},
"outputs": [],
"source": [
"import urllib3\n",
"import pandas as pd\n",
"import certifi\n",
"import re\n",
"from bs4 import BeautifulSoup\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading data using BeautifulSoup and urlib3 as the htm client"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:35.583529Z",
"start_time": "2018-07-31T02:46:35.579533Z"
}
},
"outputs": [],
"source": [
"html_address = 'https://www.encyclopedia-titanica.org/titanic-passengers-and-crew/'"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:44.086026Z",
"start_time": "2018-07-31T02:46:35.587527Z"
}
},
"outputs": [],
"source": [
"http = urllib3.PoolManager(maxsize=10, cert_reqs='CERT_REQUIRED',ca_certs=certifi.where())\n",
"r = http.request('GET', html_address) "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:49.267820Z",
"start_time": "2018-07-31T02:46:44.088025Z"
}
},
"outputs": [],
"source": [
"soup = BeautifulSoup(r.data, 'html.parser')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.244638Z",
"start_time": "2018-07-31T02:46:49.269816Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Ticket
\n",
"
Joined
\n",
"
Job
\n",
"
Boat [Body]
\n",
"
Unnamed: 7
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42
\n",
"
3rd Class Passenger
\n",
"
5547£7 11s
\n",
"
Southampton
\n",
"
Blacksmith
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39
\n",
"
3rd Class Passenger
\n",
"
CA2673£20 5s
\n",
"
Southampton
\n",
"
NaN
\n",
"
A
\n",
"
NaN
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16
\n",
"
3rd Class Passenger
\n",
"
CA2673£20 5s
\n",
"
Southampton
\n",
"
Jeweller
\n",
"
[190]
\n",
"
NaN
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13
\n",
"
3rd Class Passenger
\n",
"
CA2673£20 5s
\n",
"
Southampton
\n",
"
Scholar
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21
\n",
"
Victualling Crew
\n",
"
NaN
\n",
"
Southampton
\n",
"
Lounge Pantry Steward
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Ticket \\\n",
"0 ABBING, Mr Anthony 42 3rd Class Passenger 5547£7 11s \n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39 3rd Class Passenger CA2673£20 5s \n",
"2 ABBOTT, Mr Rossmore Edward 16 3rd Class Passenger CA2673£20 5s \n",
"3 ABBOTT, Mr Eugene Joseph 13 3rd Class Passenger CA2673£20 5s \n",
"4 ABBOTT, Mr Ernest Owen 21 Victualling Crew NaN \n",
"\n",
" Joined Job Boat [Body] Unnamed: 7 \n",
"0 Southampton Blacksmith NaN NaN \n",
"1 Southampton NaN A NaN \n",
"2 Southampton Jeweller [190] NaN \n",
"3 Southampton Scholar NaN NaN \n",
"4 Southampton Lounge Pantry Steward NaN NaN "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table = soup.find('table')\n",
"data = pd.read_html(str(table),flavor='bs4')[0] # Note that the flavor cotains information about the encoding as well\n",
"http.clear()\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.257632Z",
"start_time": "2018-07-31T02:46:59.246637Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42
\n",
"
3rd Class Passenger
\n",
"
NaN
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13
\n",
"
3rd Class Passenger
\n",
"
NaN
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21
\n",
"
Victualling Crew
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body]\n",
"0 ABBING, Mr Anthony 42 3rd Class Passenger NaN\n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39 3rd Class Passenger A\n",
"2 ABBOTT, Mr Rossmore Edward 16 3rd Class Passenger [190]\n",
"3 ABBOTT, Mr Eugene Joseph 13 3rd Class Passenger NaN\n",
"4 ABBOTT, Mr Ernest Owen 21 Victualling Crew NaN"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_trim = data[['Name', 'Age', 'Class/Dept', 'Boat [Body]']]\n",
"data_trim.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Processing data\n",
"\n",
"Special characters are removed and then the age is change to float and ages below 1 year are converted to a representative float number"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.609419Z",
"start_time": "2018-07-31T02:46:59.259628Z"
}
},
"outputs": [],
"source": [
"data_trim['Name'] = data_trim['Name'].map(str).apply(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))\n",
"data_trim['Boat [Body]'] = data_trim['Boat [Body]'].map(str).apply(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))\n",
"def process_age(value):\n",
" if 'm' in value:\n",
" return float(re.findall(r'-?\\d+\\.?\\d*',value)[0])/12\n",
" else:\n",
" return(float(value)) \n",
"data_trim['Age'] = data_trim['Age'].map(str).apply(process_age)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.622405Z",
"start_time": "2018-07-31T02:46:59.611411Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39.0
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16.0
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21.0
\n",
"
Victualling Crew
\n",
"
nan
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body]\n",
"0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan\n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A\n",
"2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190]\n",
"3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan\n",
"4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_trim.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Categorize passengers and crews"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.717345Z",
"start_time": "2018-07-31T02:46:59.624404Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
Crew/Passenger
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39.0
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
Passenger
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16.0
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
Passenger
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21.0
\n",
"
Victualling Crew
\n",
"
nan
\n",
"
Crew
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body] \\\n",
"0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan \n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A \n",
"2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] \n",
"3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan \n",
"4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan \n",
"\n",
" Crew/Passenger \n",
"0 Passenger \n",
"1 Passenger \n",
"2 Passenger \n",
"3 Passenger \n",
"4 Crew "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def categorizer(value):\n",
" if 'PASSENGER' in value.upper():\n",
" return 'Passenger'\n",
" else:\n",
" return 'Crew'\n",
"data_trim['Crew/Passenger'] = data_trim['Class/Dept'].map(str).apply(categorizer)\n",
"data_trim.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check passenger ticket class"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.826278Z",
"start_time": "2018-07-31T02:46:59.720346Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
Crew/Passenger
\n",
"
Class
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39.0
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
Passenger
\n",
"
3rd
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16.0
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
Passenger
\n",
"
3rd
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21.0
\n",
"
Victualling Crew
\n",
"
nan
\n",
"
Crew
\n",
"
Crew
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body] \\\n",
"0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan \n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A \n",
"2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] \n",
"3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan \n",
"4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan \n",
"\n",
" Crew/Passenger Class \n",
"0 Passenger 3rd \n",
"1 Passenger 3rd \n",
"2 Passenger 3rd \n",
"3 Passenger 3rd \n",
"4 Crew Crew "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def get_passenger_class(value):\n",
" if 'PASSENGER' in value.upper():\n",
" return value.split(' ')[0]\n",
" else:\n",
" return 'Crew'\n",
"data_trim['Class'] = data_trim['Class/Dept'].map(str).apply(get_passenger_class)\n",
"data_trim.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check Adult/Child"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:46:59.922218Z",
"start_time": "2018-07-31T02:46:59.828277Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
Crew/Passenger
\n",
"
Class
\n",
"
Adult/Child
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
Adult
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39.0
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
Passenger
\n",
"
3rd
\n",
"
Adult
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16.0
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
Passenger
\n",
"
3rd
\n",
"
Child
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
Child
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21.0
\n",
"
Victualling Crew
\n",
"
nan
\n",
"
Crew
\n",
"
Crew
\n",
"
Adult
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body] \\\n",
"0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan \n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A \n",
"2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] \n",
"3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan \n",
"4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan \n",
"\n",
" Crew/Passenger Class Adult/Child \n",
"0 Passenger 3rd Adult \n",
"1 Passenger 3rd Adult \n",
"2 Passenger 3rd Child \n",
"3 Passenger 3rd Child \n",
"4 Crew Crew Adult "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def check_adult(age):\n",
" if age > 18:\n",
" return 'Adult'\n",
" else:\n",
" return 'Child'\n",
"data_trim['Adult/Child'] = data_trim['Age'].apply(check_adult)\n",
"data_trim.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check Gender"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:00.023158Z",
"start_time": "2018-07-31T02:46:59.925218Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
Crew/Passenger
\n",
"
Class
\n",
"
Adult/Child
\n",
"
Gender
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
Adult
\n",
"
Male
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39.0
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
Passenger
\n",
"
3rd
\n",
"
Adult
\n",
"
Female
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16.0
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
Passenger
\n",
"
3rd
\n",
"
Child
\n",
"
Male
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
Child
\n",
"
Male
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21.0
\n",
"
Victualling Crew
\n",
"
nan
\n",
"
Crew
\n",
"
Crew
\n",
"
Adult
\n",
"
Male
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body] \\\n",
"0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan \n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A \n",
"2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] \n",
"3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan \n",
"4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan \n",
"\n",
" Crew/Passenger Class Adult/Child Gender \n",
"0 Passenger 3rd Adult Male \n",
"1 Passenger 3rd Adult Female \n",
"2 Passenger 3rd Child Male \n",
"3 Passenger 3rd Child Male \n",
"4 Crew Crew Adult Male "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def check_gender(name):\n",
" firstname = name[name.index(',')+2:]\n",
" salutation = firstname.split(' ')[0]\n",
" if salutation.upper() in ['MR', 'MASTER']:\n",
" return 'Male'\n",
" else:\n",
" return 'Female'\n",
"\n",
"data_trim['Gender'] = data_trim['Name'].map(str).apply(check_gender)\n",
"data_trim.head() "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:00.046143Z",
"start_time": "2018-07-31T02:47:00.025158Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Age
\n",
"
Class/Dept
\n",
"
Boat [Body]
\n",
"
Crew/Passenger
\n",
"
Class
\n",
"
Adult/Child
\n",
"
Gender
\n",
"
Survival
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
ABBING, Mr Anthony
\n",
"
42.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
Adult
\n",
"
Male
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
ABBOTT, Mrs Rhoda Mary 'Rosa'
\n",
"
39.0
\n",
"
3rd Class Passenger
\n",
"
A
\n",
"
Passenger
\n",
"
3rd
\n",
"
Adult
\n",
"
Female
\n",
"
1
\n",
"
\n",
"
\n",
"
2
\n",
"
ABBOTT, Mr Rossmore Edward
\n",
"
16.0
\n",
"
3rd Class Passenger
\n",
"
[190]
\n",
"
Passenger
\n",
"
3rd
\n",
"
Child
\n",
"
Male
\n",
"
0
\n",
"
\n",
"
\n",
"
3
\n",
"
ABBOTT, Mr Eugene Joseph
\n",
"
13.0
\n",
"
3rd Class Passenger
\n",
"
nan
\n",
"
Passenger
\n",
"
3rd
\n",
"
Child
\n",
"
Male
\n",
"
0
\n",
"
\n",
"
\n",
"
4
\n",
"
ABBOTT, Mr Ernest Owen
\n",
"
21.0
\n",
"
Victualling Crew
\n",
"
nan
\n",
"
Crew
\n",
"
Crew
\n",
"
Adult
\n",
"
Male
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Age Class/Dept Boat [Body] \\\n",
"0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan \n",
"1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A \n",
"2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] \n",
"3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan \n",
"4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan \n",
"\n",
" Crew/Passenger Class Adult/Child Gender Survival \n",
"0 Passenger 3rd Adult Male 0 \n",
"1 Passenger 3rd Adult Female 1 \n",
"2 Passenger 3rd Child Male 0 \n",
"3 Passenger 3rd Child Male 0 \n",
"4 Crew Crew Adult Male 0 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def check_survival(value):\n",
" if value.strip()=='nan' or '[' in value:\n",
" return 0\n",
" else:\n",
" return 1\n",
"data_trim['Survival'] = data_trim['Boat [Body]'].map(str).apply(check_survival)\n",
"data_trim.head() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the importance of each feature"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:00.052140Z",
"start_time": "2018-07-31T02:47:00.048142Z"
}
},
"outputs": [],
"source": [
"def entropy_calculate(prob_list):\n",
" \n",
" entropy = 0\n",
" for item in prob_list:\n",
" entropy -= item * np.log2(item) \n",
" return entropy"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:00.062132Z",
"start_time": "2018-07-31T02:47:00.055138Z"
}
},
"outputs": [],
"source": [
"def information_gain(feature, data, initial_entropy):\n",
" print('Information gain analysis for {}'.format(feature))\n",
" probability_l1 = data.groupby([feature])['Survival'].count()/len(data)\n",
" entropy_feature=0\n",
" for item, probabality in probability_l1.items():\n",
" print('\\tProbability of {}:'.format(item), probabality)\n",
" probability_l2 = data[data[feature]==item].groupby(['Survival'])['Survival'].count()/len(data[data[feature]==item])\n",
"\n",
" entropy_feature += entropy_calculate(probability_l2) * probabality\n",
" print('\\n\\tEntropy of feature: %.3f'%(entropy_feature))\n",
" print('\\tInformation gain: %.3f'%(initial_entropy - entropy_feature))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Information gain analysis"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.033787Z",
"start_time": "2018-07-31T02:47:00.064131Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Initial entropy: 0.815\n"
]
}
],
"source": [
"probability_initial = data_trim.groupby(['Survival'])['Survival'].count()/len(data_trim)\n",
"initial_entropy = entropy_calculate(probability_initial)\n",
"print('Initial entropy: %.3f'%(initial_entropy))"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.049778Z",
"start_time": "2018-07-31T02:47:01.035786Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Information gain analysis for Crew/Passenger\n",
"\tProbability of Crew: 0.4543987086359968\n",
"\tProbability of Passenger: 0.5456012913640033\n",
"\n",
"\tEntropy of feature: 0.772\n",
"\tInformation gain: 0.043\n"
]
}
],
"source": [
"information_gain(feature='Crew/Passenger', data=data_trim, initial_entropy=initial_entropy)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.073762Z",
"start_time": "2018-07-31T02:47:01.051777Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Information gain analysis for Class\n",
"\tProbability of 1st: 0.14124293785310735\n",
"\tProbability of 2nd: 0.11824051654560129\n",
"\tProbability of 3rd: 0.2861178369652946\n",
"\tProbability of Crew: 0.4543987086359968\n",
"\n",
"\tEntropy of feature: 0.738\n",
"\tInformation gain: 0.077\n"
]
}
],
"source": [
"information_gain(feature='Class', data=data_trim, initial_entropy=initial_entropy)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.093749Z",
"start_time": "2018-07-31T02:47:01.075761Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Information gain analysis for Adult/Child\n",
"\tProbability of Adult: 0.880548829701372\n",
"\tProbability of Child: 0.11945117029862792\n",
"\n",
"\tEntropy of feature: 0.813\n",
"\tInformation gain: 0.003\n"
]
}
],
"source": [
"information_gain(feature='Adult/Child', data=data_trim, initial_entropy=initial_entropy)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.110740Z",
"start_time": "2018-07-31T02:47:01.095750Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Information gain analysis for Gender\n",
"\tProbability of Female: 0.23284907183212267\n",
"\tProbability of Male: 0.7671509281678773\n",
"\n",
"\tEntropy of feature: 0.702\n",
"\tInformation gain: 0.113\n"
]
}
],
"source": [
"information_gain(feature='Gender', data=data_trim, initial_entropy=initial_entropy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pick features and lable to construct the decision tree"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.128729Z",
"start_time": "2018-07-31T02:47:01.112739Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Age Crew/Passenger Class Adult/Child Gender Survival\n",
"0 42.0 1 2 0 1 0\n",
"1 39.0 1 2 0 0 1\n",
"2 16.0 1 2 1 1 0\n",
"3 13.0 1 2 1 1 0\n",
"4 21.0 0 3 0 1 0"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:01.548469Z",
"start_time": "2018-07-31T02:47:01.462521Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of valid records: 2449\n"
]
}
],
"source": [
"training_data.dropna(inplace=True)\n",
"training_data.reset_index(drop=True, inplace=True)\n",
"print('Total number of valid records: {}'.format(len(training_data)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split to train and test data"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.092501Z",
"start_time": "2018-07-31T02:47:01.550467Z"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.099493Z",
"start_time": "2018-07-31T02:47:02.094497Z"
}
},
"outputs": [],
"source": [
"train, test = train_test_split(training_data, test_size=0.2)\n",
"train.reset_index(drop=True, inplace=True)\n",
"test.reset_index(drop=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.107489Z",
"start_time": "2018-07-31T02:47:02.101491Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of records used for training: 1959\n",
"Total number of records used for testin: 490\n"
]
}
],
"source": [
"print('Total number of records used for training: {}\\nTotal number of records used for testin: {}'.format(len(train),len(test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Build a decision tree"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.156457Z",
"start_time": "2018-07-31T02:47:02.110487Z"
}
},
"outputs": [],
"source": [
"from sklearn.tree import DecisionTreeClassifier"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.165453Z",
"start_time": "2018-07-31T02:47:02.158458Z"
}
},
"outputs": [],
"source": [
"X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]\n",
"y = train[['Survival']]"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.171451Z",
"start_time": "2018-07-31T02:47:02.167451Z"
}
},
"outputs": [],
"source": [
"clf = DecisionTreeClassifier(max_leaf_nodes=20, criterion='entropy')"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.183442Z",
"start_time": "2018-07-31T02:47:02.174448Z"
}
},
"outputs": [],
"source": [
"clf = clf.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.192436Z",
"start_time": "2018-07-31T02:47:02.186439Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,\n",
" max_features=None, max_leaf_nodes=20,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
" splitter='best')"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- class_weight: A dictionary to provide user defined weights for each feature\n",
"\n",
"- criterion: The criterion is the method used to decide with label and what decision needs to be made. gini impurity and information gain/entropy are the two main methods\n",
"\n",
"- max_depth: It determines the maximum depth/layer from the root of the tree to the finest leaf\n",
"\n",
"- max_features: if you want you can ask the algorithm to use a limited number of features based on their importance. This could be a dimension reduction approach that can simplify the tree\n",
"\n",
"- max_leaf_nodes: The max number of end nodes that is another stopping criteria for the decision tree\n",
"\n",
"- min_impurity_split: The minimum amount of impurity split percentage that result in generating another branch. In case of information gain, this will be measured by the amount/percentage of reduction in entropy.\n",
"\n",
"- min_samples_leaf: Minimum number of samples that should exist in a subset created by a decision. If less than this number, the branch gets pruned and the previous level of the tree will be kept\n",
"\n",
"- min_samples_split: Minimum number of samples in a tree level to consider further splitting and decision making\n",
"\n",
"- min_weight_fraction_leaf: similar to min_samples_leaf but expressed as a fraction of whole data\n",
"\n",
"- random_state, presort and splitter are performance related settings and do not affect the algorithm"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.200431Z",
"start_time": "2018-07-31T02:47:02.193434Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.20502926, 0. , 0.28082862, 0. , 0.51414211])"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.feature_importances_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As can be seen above, the importance analysis confirms the result of information gain analysis that Gender is the most important feature. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tree visualization"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.205429Z",
"start_time": "2018-07-31T02:47:02.203430Z"
}
},
"outputs": [],
"source": [
"from sklearn import tree"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.222417Z",
"start_time": "2018-07-31T02:47:02.207426Z"
}
},
"outputs": [],
"source": [
"with open ('Outputs/Titanic.dot', 'w') as f:\n",
" f = tree.export_graphviz(clf, feature_names=['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender'], out_file=f)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.249399Z",
"start_time": "2018-07-31T02:47:02.224416Z"
}
},
"outputs": [],
"source": [
"# conda install -c conda-forge python-graphviz\n",
"from graphviz import Source"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.449277Z",
"start_time": "2018-07-31T02:47:02.251400Z"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n"
],
"text/plain": [
""
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = Source.from_file('Outputs/Titanic.dot')\n",
"s.render() # saves a pdf file\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate the tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using train and test data sets"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.455273Z",
"start_time": "2018-07-31T02:47:02.451277Z"
}
},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.463268Z",
"start_time": "2018-07-31T02:47:02.457271Z"
}
},
"outputs": [],
"source": [
"predictions = clf.predict(test.drop(['Survival'], axis=1))"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.471264Z",
"start_time": "2018-07-31T02:47:02.465266Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.8224489795918367"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy_score(test['Survival'],predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is possible to tune the decision tree parameters based on the accuracy measure. For example the max_leaf_nodes can be tuned to find the maximum accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### K-fold cross validation"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.476261Z",
"start_time": "2018-07-31T02:47:02.473263Z"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_predict, cross_val_score"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.485255Z",
"start_time": "2018-07-31T02:47:02.478260Z"
}
},
"outputs": [],
"source": [
"X = training_data[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]\n",
"y = training_data[['Survival']]"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.514239Z",
"start_time": "2018-07-31T02:47:02.487255Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.82244898, 0.81428571, 0.79795918, 0.78367347, 0.81799591])"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(clf, X, y, cv=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ensemble learning to avoid over fitting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ensemble learning relies on using multiple instances of a machine learning algorithm (i.e. decision tree) that are trained with different subsets of the available training data. Also, the algorithm parameters and number of used features may be different for each instance.\n",
"\n",
"Each instance might be over/under fitted based on used data. However, the ensemble response is determined based on a majority rule decision.\n",
"\n",
"In other words, an ensemble may be defined as a collection of models (even different types) that solve the same problem. However, the final answer is achieved by a combining the response of all the models.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Random Forest\n",
"\n",
"A collection of decision trees are built and trained independently and the combination provides the final response.\n",
"\n",
"- Each tree in ensemble is built from a different subset of the training set (using Bagging (bootstrap aggregating) technique)\n",
"\n",
"- Each tree in ensemble is built using a different subset of features (using random subspace technique)\n"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.541219Z",
"start_time": "2018-07-31T02:47:02.516237Z"
}
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.549215Z",
"start_time": "2018-07-31T02:47:02.543219Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
" oob_score=False, random_state=None, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf = RandomForestClassifier()\n",
"clf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the parameters are related to settings of a decision tree. Settings related to the ensemble learning (i.e. the random forest) are explained below.\n",
"\n",
"- bootstrap: Using the bootstrap technique to randomly select different subsets of data for each tree\n",
"- n_estimators: number of trees"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 0.81\n"
]
}
],
"source": [
"clf.fit(X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']], y=train[['Survival']])\n",
"predictions = clf.predict(test.drop(['Survival'], axis=1))\n",
"print('Accuracy : %.2f'%(accuracy_score(test['Survival'], predictions)))"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Importance for Age : 0.450\n",
"Importance for Crew/Passenger : 0.052\n",
"Importance for Class : 0.185\n",
"Importance for Adult/Child : 0.009\n",
"Importance for Gender : 0.304\n"
]
}
],
"source": [
"for feature, importance in zip(['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender'],clf.feature_importances_):\n",
" print('Importance for {} : {:.3f}'.format(feature,importance))"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:02.557209Z",
"start_time": "2018-07-31T02:47:02.552214Z"
}
},
"outputs": [],
"source": [
"# Using the train/test data\n",
"def checkAccuracy(n_estimators=10):\n",
" clf = RandomForestClassifier(n_estimators=n_estimators, criterion='entropy')\n",
" X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]\n",
" y=train[['Survival']]\n",
" return np.mean(cross_val_score(clf, X, y, cv=5).mean())"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:05.325497Z",
"start_time": "2018-07-31T02:47:02.596187Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Number of trees
\n",
"
AccuracyScore
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
0.773856
\n",
"
\n",
"
\n",
"
1
\n",
"
11
\n",
"
0.789191
\n",
"
\n",
"
\n",
"
2
\n",
"
21
\n",
"
0.792767
\n",
"
\n",
"
\n",
"
3
\n",
"
31
\n",
"
0.789701
\n",
"
\n",
"
\n",
"
4
\n",
"
41
\n",
"
0.788175
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Number of trees AccuracyScore\n",
"0 1 0.773856\n",
"1 11 0.789191\n",
"2 21 0.792767\n",
"3 31 0.789701\n",
"4 41 0.788175"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result ={'Number of trees':[], 'AccuracyScore': []}\n",
"for n_tree in np.arange(1, 500, 10):\n",
" result['Number of trees'].append(n_tree)\n",
" result['AccuracyScore'].append(checkAccuracy(n_estimators=n_tree))\n",
" \n",
"result_df = pd.DataFrame.from_dict(result)\n",
"result_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:05.557352Z",
"start_time": "2018-07-31T02:47:05.327496Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result_df.plot(x='Number of trees', y='AccuracyScore', figsize=(10,5), grid=True, lw=3, color='orange')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As can be seen above, adding more trees provides a better classification accuracy on average. However, this doe not mean that adding more tress will necessarily always result in a better accuracy depending on the cross validation population."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gradient Boosted Trees\n",
"\n",
"Each tree is built based on learnings of the previous tree\n",
"\n",
"Each tree is trained using a different subset of data. However, the Boosting + Gradient Descent techniques are used to pool the subsets. After the bootstrap for one tree, Gradient Descent adjusts the probability of a data point being in the next training set and this continues sequentially. In this context, while the first bootstrap is based on uniform sampling the next one is modified based on what was learned from the previous three classification results. Data points that are classified correctly get lower weight for the next bootstrap. In this way, the likelihood of picking data that are not classified with previous tree increases. Finally, each tree is given a different weight for the final weighting for the majority rule decision.\n",
"\n",
"I random forest, there is a \"democracy\" that all trees have the same voting weights. However, in the gradient boosted approach, trees that have a better performance/accuracy get a higher weight. Gradient descent is one of the technique that can be used to quantify the probability of choosing a data point in the train subset based on the result of the previous tree.\n",
"\n",
"\n",
"For window installation:\n",
"\n",
"```shell\n",
"conda install -c anaconda py-xgboost\n",
"```\n",
"\n",
"A rule of thumb is to use a lower number of tree when dealing with a small training data set. "
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:55:18.294710Z",
"start_time": "2018-07-31T02:55:18.291713Z"
}
},
"outputs": [],
"source": [
"from xgboost.sklearn import XGBClassifier"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:55:18.905034Z",
"start_time": "2018-07-31T02:55:18.900037Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
" colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n",
" max_depth=3, min_child_weight=1, missing=None, n_estimators=100,\n",
" n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,\n",
" reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n",
" silent=True, subsample=1)"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf = XGBClassifier()\n",
"clf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- colsample_bylevel: Fraction of features used by each tree\n",
"\n",
"- learning_rate: Controls the probability manipulation of data records based on the previous tree. Increasing this factor results in an increased focus on data records that were miss-classified.\n",
"\n",
"- subsample: Fraction of the provided training data set used for each tree. Lowering this number results in a more distinct difference between training data sets used for each tree\n",
"\n",
"- gamma: Minimum reduction in error to split the tree further\n",
"\n",
"- n_estimators: Number of trees"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 0.82\n"
]
}
],
"source": [
"clf.fit(X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']], y=train[['Survival']])\n",
"predictions = clf.predict(test.drop(['Survival'], axis=1))\n",
"print('Accuracy : %.2f'%(accuracy_score(test['Survival'], predictions)))"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Importance for Age : 0.629\n",
"Importance for Crew/Passenger : 0.057\n",
"Importance for Class : 0.221\n",
"Importance for Adult/Child : 0.000\n",
"Importance for Gender : 0.093\n"
]
}
],
"source": [
"for feature, importance in zip(['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender'],clf.feature_importances_):\n",
" print('Importance for {} : {:.3f}'.format(feature,importance))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hyper Parameter Tuning\n",
"\n",
"Considering the different parameters that affect the performance of a decision tree and also parameters that control the ensemble learning, there are many combinations that should be checked to find the best accuracy for the train/test data set. Hyper parameter tuning uses optimization technique to find an optimal combination."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:07.350556Z",
"start_time": "2018-07-31T02:47:06.970478Z"
}
},
"outputs": [],
"source": [
"# conda install -c jaikumarm hyperopt\n",
"from hyperopt import fmin, tpe, hp, STATUS_OK, Trials"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:07.360549Z",
"start_time": "2018-07-31T02:47:07.352552Z"
}
},
"outputs": [],
"source": [
"# provide the optimization range/population \n",
"control = {\n",
" 'n_estimators': hp.quniform('n_estimators',100,1000,1)\n",
" ,'learning_rate': hp.quniform('learning_rate', 0.1,0.4,0.1)\n",
" ,'max_depth': hp.quniform('max_depth',1,15,1)\n",
" ,'min_child_weight': hp.quniform('min_child_weight',1,6,1)\n",
" ,'subsample': hp.quniform('subsample',0.5,1,0.05)\n",
" ,'gamma': hp.quniform('gamma', 0.5,1,0.05)\n",
" ,'colsample_bytree': hp.quniform('colsample_bytree',0.5,1,0.05)\n",
" ,'nthread':5\n",
" ,'silent':1 \n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"def scorevalue(clf):\n",
" X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]\n",
" y=train[['Survival']]\n",
" return np.mean(cross_val_score(clf, X, y, cv=5).mean())"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:07.368542Z",
"start_time": "2018-07-31T02:47:07.362545Z"
}
},
"outputs": [],
"source": [
"# Define a score that gets minimized and provides what you need\n",
"def score(params):\n",
" params['n_estimators'] = int(params['n_estimators'])\n",
" params['max_depth'] = int(params['max_depth'])\n",
" clf = XGBClassifier(**params)\n",
" return {'loss':1-scorevalue(clf), 'status':STATUS_OK}"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:47:07.374538Z",
"start_time": "2018-07-31T02:47:07.370540Z"
}
},
"outputs": [],
"source": [
"trails=Trials()"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:51:09.629027Z",
"start_time": "2018-07-31T02:47:07.376537Z"
}
},
"outputs": [],
"source": [
"best_param_set = fmin(score,control,algo=tpe.suggest,trials=trails,max_evals=250)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"ExecuteTime": {
"end_time": "2018-07-31T02:51:09.639022Z",
"start_time": "2018-07-31T02:51:09.632024Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimized parameters: \n",
"colsample_bytree = 1.0\n",
"gamma = 1.0\n",
"learning_rate = 0.2\n",
"max_depth = 10.0\n",
"min_child_weight = 5.0\n",
"n_estimators = 583.0\n",
"subsample = 1.0\n"
]
}
],
"source": [
"print('Optimized parameters: ')\n",
"for key, value in best_param_set.items():\n",
" print('{} = {}'.format(key,value))"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimized Accuracy: 0.816\n"
]
}
],
"source": [
"print('Optimized Accuracy: {:.3f}'.format(1-score(best_param_set)['loss']))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {
"height": "774px",
"left": "0px",
"right": "1452.91px",
"top": "111px",
"width": "293px"
},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 2
}