[1]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
%matplotlib inline

Decision Tree and Ensemble Learning

Decision Tree

A decision tree may be described as a set of decisions/categorization steps that help you to classify your data into predefined classes. This helps you to solve a classification problem.

For example, consider the following data set related to playing tennis in a specific day based on weather condition parameters. Based on the provided parameters, we are need to come up with a decision tree that helps us to decide whether the player plays tennis on a specific day based on four weather related parameters including outlook, temperature, humidity and wind condition.

[2]:
data_dict = {
    'Outlook' : ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny','Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy']
    ,'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild','Mild','Mild', 'Hot', 'Mild']
    ,'Humidity' : ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High','Normal','Normal', 'Normal', 'High', 'Normal', 'High']
    ,'Wind': ['False', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'True']
    ,'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
tennis_data = pd.DataFrame(data_dict, columns=data_dict.keys())
tennis_data
[2]:
Outlook Temperature Humidity Wind PlayTennis
0 Sunny Hot High False No
1 Sunny Hot High True No
2 Overcast Hot High False Yes
3 Rainy Mild High False Yes
4 Rainy Cool Normal False Yes
5 Rainy Cool Normal True No
6 Overcast Cool Normal True Yes
7 Sunny Mild High False No
8 Sunny Cool Normal False Yes
9 Rainy Mild Normal False Yes
10 Sunny Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Rainy Mild High True No

The following decision three helps us to use a set of decisions to solve our classification problem. alt text

Order of decision; why Outlook is the first question/decision?

The main question is why outlook was chosen as the first decision level. The answer is that the amount of Information Grain that is resolved by this decision is higher that the other decisions. This can be quantified using entropy which is a measure of randomness. Higher entropy means higher randomness and the amount of reduction in randomness based on a decision is called information gain.

[3]:
%%latex

Entropy = $-\sum_{i=1}^{n} P_i\times Log_b(P_i)$
Entropy = $-\sum_{i=1}^{n} P_i\times Log_b(P_i)$

Note: corresponding units of entropy are the bits for b = 2, nats for b = e, and bans for b = 10 where b is the base of the logarithm function

[4]:
def entropy_calculate(prob_list):

    entropy = 0
    for item in prob_list:
        entropy -= item * np.log2(item)
    return entropy

Entropy of the entire system

Entropy of the entire system: 14 observations: 9 Yes and 5 No

[5]:
cases,counts = np.unique(tennis_data.PlayTennis,return_counts=True)
P = [count/len(tennis_data) for count in counts]
print('Probabilities of %s and %s are %.3f, %.3f respectively'%(cases[0],cases[1],P[0],P[1]))

entropy_entire = entropy_calculate(P)

print('Entire syetems entropy is %.3f bits'%entropy_entire)
Probabilities of No and Yes are 0.357, 0.643 respectively
Entire syetems entropy is 0.940 bits

Information Gain

Lets calculate reduction in entropy for each decision. For each decision, entropy is calculated for each case under that decision and the a probability weighted average is calculated.

Outlook decision

[6]:
cases_outlook,counts_outlook= np.unique(tennis_data.Outlook,return_counts=True)
P_outlook = [count/len(tennis_data) for count in counts_outlook]
print('For outlook:')
for case, prob in zip(cases_outlook,P_outlook):
    print('\tProbabality of %s is %.3f'%(case, prob))
For outlook:
        Probabality of Overcast is 0.286
        Probabality of Rainy is 0.357
        Probabality of Sunny is 0.357
[7]:
entropy_outlook={}
total_entropy_outlook=0
for case, prob in zip(cases_outlook,P_outlook):
    cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Outlook==case],return_counts=True)
    P = [count/len(tennis_data[tennis_data.Outlook==case]) for count in counts]
    entropy_outlook[case]=entropy_calculate(P)
    total_entropy_outlook += entropy_calculate(P)*prob

for case, entropy in entropy_outlook.items():
    print('Entropy for %s is %.2f'%(case,entropy))
print('\nEntropy at Outlook decision level is %.3f'%total_entropy_outlook)
print('\nInformation gain is %.3f'%(entropy_entire- total_entropy_outlook))
Entropy for Overcast is 0.00
Entropy for Rainy is 0.97
Entropy for Sunny is 0.97

Entropy at Outlook decision level is 0.694

Information gain is 0.247

Temperature Decision

[8]:
cases_temperature,counts_temperature= np.unique(tennis_data.Temperature,return_counts=True)
P_temperature = [count/len(tennis_data) for count in counts_temperature]
print('For temperature:')
for case, prob in zip(cases_temperature,P_temperature):
    print('\tProbabality of %s is %.3f'%(case, prob))
For temperature:
        Probabality of Cool is 0.286
        Probabality of Hot is 0.286
        Probabality of Mild is 0.429
[9]:
entropy_temperature={}
total_entropy_temperature=0
for case, prob in zip(cases_temperature,P_temperature):
    cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Temperature==case],return_counts=True)
    P = [count/len(tennis_data[tennis_data.Temperature==case]) for count in counts]
    entropy_temperature[case]=entropy_calculate(P)
    total_entropy_temperature += entropy_calculate(P)*prob

for case, entropy in entropy_temperature.items():
    print('Entropy for %s is %.2f'%(case,entropy))
print('\nEntropy at Temperature decision level is %.3f'%total_entropy_temperature)
print('\nInformation gain is %.3f'%(entropy_entire- total_entropy_temperature))
Entropy for Cool is 0.81
Entropy for Hot is 1.00
Entropy for Mild is 0.92

Entropy at Temperature decision level is 0.911

Information gain is 0.029

Wind Decision

[10]:
cases_wind,counts_wind= np.unique(tennis_data.Wind,return_counts=True)
P_wind = [count/len(tennis_data) for count in counts_wind]
print('For wind:')
for case, prob in zip(cases_wind,P_wind):
    print('\tProbabality of %s is %.3f'%(case, prob))
For wind:
        Probabality of False is 0.571
        Probabality of True is 0.429
[11]:
entropy_wind={}
total_entropy_wind=0
for case, prob in zip(cases_wind,P_wind):
    cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Wind==case],return_counts=True)
    P = [count/len(tennis_data[tennis_data.Wind==case]) for count in counts]
    entropy_wind[case]=entropy_calculate(P)
    total_entropy_wind += entropy_calculate(P)*prob

for case, entropy in entropy_wind.items():
    print('Entropy for %s is %.2f'%(case,entropy))
print('\nEntropy at Wind decision level is %.3f'%total_entropy_wind)
print('\nInformation gain is %.3f'%(entropy_entire- total_entropy_wind))
Entropy for False is 0.81
Entropy for True is 1.00

Entropy at Wind decision level is 0.892

Information gain is 0.048

Humidity Decision

[12]:
cases_humidity,counts_humidity= np.unique(tennis_data.Humidity,return_counts=True)
P_humidity = [count/len(tennis_data) for count in counts_humidity]
print('For humidity:')
for case, prob in zip(cases_humidity,P_humidity):
    print('\tProbabality of %s is %.3f'%(case, prob))
For humidity:
        Probabality of High is 0.500
        Probabality of Normal is 0.500
[13]:
entropy_humidity={}
total_entropy_humidity=0
for case, prob in zip(cases_humidity,P_humidity):
    cases,counts = np.unique(tennis_data.PlayTennis[tennis_data.Humidity==case],return_counts=True)
    P = [count/len(tennis_data[tennis_data.Humidity==case]) for count in counts]
    entropy_humidity[case]=entropy_calculate(P)
    total_entropy_humidity += entropy_calculate(P)*prob

for case, entropy in entropy_humidity.items():
    print('Entropy for %s is %.2f'%(case,entropy))
print('\nEntropy at Humidity decision level is %.3f'%total_entropy_humidity)
print('\nInformation gain is %.3f'%(entropy_entire- total_entropy_humidity))
Entropy for High is 0.99
Entropy for Normal is 0.59

Entropy at Humidity decision level is 0.788

Information gain is 0.152

As shown above, by choosing outlook as the first decision/question, the highest reduction in entropy/randomness is achieved that corresponds to the highest Information Gain

Titanic Example

[14]:
import urllib3
import pandas as pd
import certifi
import re
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

Loading data using BeautifulSoup and urlib3 as the htm client

[15]:
html_address = 'https://www.encyclopedia-titanica.org/titanic-passengers-and-crew/'
[16]:
http = urllib3.PoolManager(maxsize=10, cert_reqs='CERT_REQUIRED',ca_certs=certifi.where())
r = http.request('GET', html_address)
[17]:
soup = BeautifulSoup(r.data, 'html.parser')
[18]:
table = soup.find('table')
data = pd.read_html(str(table),flavor='bs4')[0] # Note that the flavor cotains information about the encoding as well
http.clear()
data.head()
[18]:
Name Age Class/Dept Ticket Joined Job Boat [Body] Unnamed: 7
0 ABBING, Mr Anthony 42 3rd Class Passenger 5547£7 11s Southampton Blacksmith NaN NaN
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39 3rd Class Passenger CA2673£20 5s Southampton NaN A NaN
2 ABBOTT, Mr Rossmore Edward 16 3rd Class Passenger CA2673£20 5s Southampton Jeweller [190] NaN
3 ABBOTT, Mr Eugene Joseph 13 3rd Class Passenger CA2673£20 5s Southampton Scholar NaN NaN
4 ABBOTT, Mr Ernest Owen 21 Victualling Crew NaN Southampton Lounge Pantry Steward NaN NaN
[19]:
data_trim = data[['Name', 'Age', 'Class/Dept', 'Boat [Body]']]
data_trim.head()
[19]:
Name Age Class/Dept Boat [Body]
0 ABBING, Mr Anthony 42 3rd Class Passenger NaN
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39 3rd Class Passenger A
2 ABBOTT, Mr Rossmore Edward 16 3rd Class Passenger [190]
3 ABBOTT, Mr Eugene Joseph 13 3rd Class Passenger NaN
4 ABBOTT, Mr Ernest Owen 21 Victualling Crew NaN

Processing data

Special characters are removed and then the age is change to float and ages below 1 year are converted to a representative float number

[20]:
data_trim['Name'] = data_trim['Name'].map(str).apply(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
data_trim['Boat [Body]'] = data_trim['Boat [Body]'].map(str).apply(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
def process_age(value):
    if 'm' in value:
        return float(re.findall(r'-?\d+\.?\d*',value)[0])/12
    else:
        return(float(value))
data_trim['Age'] = data_trim['Age'].map(str).apply(process_age)
[21]:
data_trim.head()
[21]:
Name Age Class/Dept Boat [Body]
0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A
2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190]
3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan
4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan

Categorize passengers and crews

[22]:
def categorizer(value):
    if 'PASSENGER' in value.upper():
        return 'Passenger'
    else:
        return 'Crew'
data_trim['Crew/Passenger'] = data_trim['Class/Dept'].map(str).apply(categorizer)
data_trim.head()
[22]:
Name Age Class/Dept Boat [Body] Crew/Passenger
0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan Passenger
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger
2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger
3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan Passenger
4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan Crew

Check passenger ticket class

[23]:
def get_passenger_class(value):
    if 'PASSENGER' in value.upper():
        return value.split(' ')[0]
    else:
        return 'Crew'
data_trim['Class'] = data_trim['Class/Dept'].map(str).apply(get_passenger_class)
data_trim.head()
[23]:
Name Age Class/Dept Boat [Body] Crew/Passenger Class
0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan Passenger 3rd
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger 3rd
2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger 3rd
3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan Passenger 3rd
4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan Crew Crew

Check Adult/Child

[24]:
def check_adult(age):
    if age > 18:
        return 'Adult'
    else:
        return 'Child'
data_trim['Adult/Child'] = data_trim['Age'].apply(check_adult)
data_trim.head()
[24]:
Name Age Class/Dept Boat [Body] Crew/Passenger Class Adult/Child
0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan Passenger 3rd Adult
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger 3rd Adult
2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger 3rd Child
3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan Passenger 3rd Child
4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan Crew Crew Adult

Check Gender

[25]:
def check_gender(name):
    firstname = name[name.index(',')+2:]
    salutation = firstname.split(' ')[0]
    if salutation.upper() in ['MR', 'MASTER']:
        return 'Male'
    else:
        return 'Female'

data_trim['Gender'] = data_trim['Name'].map(str).apply(check_gender)
data_trim.head()
[25]:
Name Age Class/Dept Boat [Body] Crew/Passenger Class Adult/Child Gender
0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan Passenger 3rd Adult Male
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger 3rd Adult Female
2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger 3rd Child Male
3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan Passenger 3rd Child Male
4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan Crew Crew Adult Male
[26]:
def check_survival(value):
    if value.strip()=='nan' or '[' in value:
        return 0
    else:
        return 1
data_trim['Survival'] = data_trim['Boat [Body]'].map(str).apply(check_survival)
data_trim.head()
[26]:
Name Age Class/Dept Boat [Body] Crew/Passenger Class Adult/Child Gender Survival
0 ABBING, Mr Anthony 42.0 3rd Class Passenger nan Passenger 3rd Adult Male 0
1 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger 3rd Adult Female 1
2 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger 3rd Child Male 0
3 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger nan Passenger 3rd Child Male 0
4 ABBOTT, Mr Ernest Owen 21.0 Victualling Crew nan Crew Crew Adult Male 0

Check the importance of each feature

[27]:
def entropy_calculate(prob_list):

    entropy = 0
    for item in prob_list:
        entropy -= item * np.log2(item)
    return entropy
[28]:
def information_gain(feature, data, initial_entropy):
    print('Information gain analysis for {}'.format(feature))
    probability_l1 = data.groupby([feature])['Survival'].count()/len(data)
    entropy_feature=0
    for item, probabality in probability_l1.items():
        print('\tProbability of {}:'.format(item), probabality)
        probability_l2 = data[data[feature]==item].groupby(['Survival'])['Survival'].count()/len(data[data[feature]==item])

        entropy_feature += entropy_calculate(probability_l2) * probabality
    print('\n\tEntropy of feature: %.3f'%(entropy_feature))
    print('\tInformation gain: %.3f'%(initial_entropy - entropy_feature))

Information gain analysis

[29]:
probability_initial = data_trim.groupby(['Survival'])['Survival'].count()/len(data_trim)
initial_entropy = entropy_calculate(probability_initial)
print('Initial entropy: %.3f'%(initial_entropy))
Initial entropy: 0.815
[30]:
information_gain(feature='Crew/Passenger', data=data_trim, initial_entropy=initial_entropy)
Information gain analysis for Crew/Passenger
        Probability of Crew: 0.4543987086359968
        Probability of Passenger: 0.5456012913640033

        Entropy of feature: 0.772
        Information gain: 0.043
[31]:
information_gain(feature='Class', data=data_trim, initial_entropy=initial_entropy)
Information gain analysis for Class
        Probability of 1st: 0.14124293785310735
        Probability of 2nd: 0.11824051654560129
        Probability of 3rd: 0.2861178369652946
        Probability of Crew: 0.4543987086359968

        Entropy of feature: 0.738
        Information gain: 0.077
[32]:
information_gain(feature='Adult/Child', data=data_trim, initial_entropy=initial_entropy)
Information gain analysis for Adult/Child
        Probability of Adult: 0.880548829701372
        Probability of Child: 0.11945117029862792

        Entropy of feature: 0.813
        Information gain: 0.003
[33]:
information_gain(feature='Gender', data=data_trim, initial_entropy=initial_entropy)
Information gain analysis for Gender
        Probability of Female: 0.23284907183212267
        Probability of Male: 0.7671509281678773

        Entropy of feature: 0.702
        Information gain: 0.113

Pick features and lable to construct the decision tree

[34]:
training_data = data_trim[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender','Survival']]
training_data.head()
[34]:
Age Crew/Passenger Class Adult/Child Gender Survival
0 42.0 Passenger 3rd Adult Male 0
1 39.0 Passenger 3rd Adult Female 1
2 16.0 Passenger 3rd Child Male 0
3 13.0 Passenger 3rd Child Male 0
4 21.0 Crew Crew Adult Male 0

Process categorical variables to numerical

[35]:
category_map ={}
for column in ['Crew/Passenger','Class','Adult/Child','Gender']:
    category_map[column] = dict( enumerate(training_data[column].astype('category').cat.categories) )
    training_data[column] = training_data[column].astype('category').cat.codes
[36]:
training_data.head()
[36]:
Age Crew/Passenger Class Adult/Child Gender Survival
0 42.0 1 2 0 1 0
1 39.0 1 2 0 0 1
2 16.0 1 2 1 1 0
3 13.0 1 2 1 1 0
4 21.0 0 3 0 1 0
[37]:
training_data.dropna(inplace=True)
training_data.reset_index(drop=True, inplace=True)
print('Total number of valid records: {}'.format(len(training_data)))
Total number of valid records: 2449

Split to train and test data

[38]:
from sklearn.model_selection import train_test_split
[39]:
train, test = train_test_split(training_data, test_size=0.2)
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
[40]:
print('Total number of records used for training: {}\nTotal number of records used for testin: {}'.format(len(train),len(test)))
Total number of records used for training: 1959
Total number of records used for testin: 490

Build a decision tree

[41]:
from sklearn.tree import DecisionTreeClassifier
[42]:
X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]
y = train[['Survival']]
[43]:
clf = DecisionTreeClassifier(max_leaf_nodes=20, criterion='entropy')
[44]:
clf = clf.fit(X, y)
[45]:
clf
[45]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=20,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
  • class_weight: A dictionary to provide user defined weights for each feature

  • criterion: The criterion is the method used to decide with label and what decision needs to be made. gini impurity and information gain/entropy are the two main methods

  • max_depth: It determines the maximum depth/layer from the root of the tree to the finest leaf

  • max_features: if you want you can ask the algorithm to use a limited number of features based on their importance. This could be a dimension reduction approach that can simplify the tree

  • max_leaf_nodes: The max number of end nodes that is another stopping criteria for the decision tree

  • min_impurity_split: The minimum amount of impurity split percentage that result in generating another branch. In case of information gain, this will be measured by the amount/percentage of reduction in entropy.

  • min_samples_leaf: Minimum number of samples that should exist in a subset created by a decision. If less than this number, the branch gets pruned and the previous level of the tree will be kept

  • min_samples_split: Minimum number of samples in a tree level to consider further splitting and decision making

  • min_weight_fraction_leaf: similar to min_samples_leaf but expressed as a fraction of whole data

  • random_state, presort and splitter are performance related settings and do not affect the algorithm

[46]:
clf.feature_importances_
[46]:
array([0.20502926, 0.        , 0.28082862, 0.        , 0.51414211])

As can be seen above, the importance analysis confirms the result of information gain analysis that Gender is the most important feature.

Tree visualization

[47]:
from sklearn import tree
[48]:
with open ('Outputs/Titanic.dot', 'w') as f:
    f = tree.export_graphviz(clf, feature_names=['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender'], out_file=f)
[49]:
# conda install -c conda-forge python-graphviz
from graphviz import Source
[50]:
s = Source.from_file('Outputs/Titanic.dot')
s.render() # saves a pdf file
s
[50]:
../_images/notebooks_DecisionTree_77_0.svg

Evaluate the tree

Using train and test data sets

[51]:
from sklearn.metrics import accuracy_score
[52]:
predictions = clf.predict(test.drop(['Survival'], axis=1))
[53]:
accuracy_score(test['Survival'],predictions)
[53]:
0.8224489795918367

It is possible to tune the decision tree parameters based on the accuracy measure. For example the max_leaf_nodes can be tuned to find the maximum accuracy.

K-fold cross validation

[54]:
from sklearn.model_selection import cross_val_predict, cross_val_score
[55]:
X = training_data[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]
y = training_data[['Survival']]
[56]:
cross_val_score(clf, X, y, cv=5)
[56]:
array([0.82244898, 0.81428571, 0.79795918, 0.78367347, 0.81799591])

Ensemble learning to avoid over fitting

The ensemble learning relies on using multiple instances of a machine learning algorithm (i.e. decision tree) that are trained with different subsets of the available training data. Also, the algorithm parameters and number of used features may be different for each instance.

Each instance might be over/under fitted based on used data. However, the ensemble response is determined based on a majority rule decision.

In other words, an ensemble may be defined as a collection of models (even different types) that solve the same problem. However, the final answer is achieved by a combining the response of all the models.

Random Forest

A collection of decision trees are built and trained independently and the combination provides the final response.

  • Each tree in ensemble is built from a different subset of the training set (using Bagging (bootstrap aggregating) technique)

  • Each tree in ensemble is built using a different subset of features (using random subspace technique)

[90]:
from sklearn.ensemble import RandomForestClassifier
[91]:
clf = RandomForestClassifier()
clf
[91]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Most of the parameters are related to settings of a decision tree. Settings related to the ensemble learning (i.e. the random forest) are explained below.

  • bootstrap: Using the bootstrap technique to randomly select different subsets of data for each tree

  • n_estimators: number of trees

[92]:
clf.fit(X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']], y=train[['Survival']])
predictions = clf.predict(test.drop(['Survival'], axis=1))
print('Accuracy : %.2f'%(accuracy_score(test['Survival'], predictions)))
Accuracy : 0.81
[93]:
for feature, importance in zip(['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender'],clf.feature_importances_):
    print('Importance for {} : {:.3f}'.format(feature,importance))
Importance for Age : 0.450
Importance for Crew/Passenger : 0.052
Importance for Class : 0.185
Importance for Adult/Child : 0.009
Importance for Gender : 0.304
[94]:
# Using the train/test data
def checkAccuracy(n_estimators=10):
    clf = RandomForestClassifier(n_estimators=n_estimators, criterion='entropy')
    X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]
    y=train[['Survival']]
    return np.mean(cross_val_score(clf, X, y, cv=5).mean())
[95]:
result ={'Number of trees':[], 'AccuracyScore': []}
for n_tree in np.arange(1, 500, 10):
    result['Number of trees'].append(n_tree)
    result['AccuracyScore'].append(checkAccuracy(n_estimators=n_tree))

result_df = pd.DataFrame.from_dict(result)
result_df.head()
[95]:
Number of trees AccuracyScore
0 1 0.773856
1 11 0.789191
2 21 0.792767
3 31 0.789701
4 41 0.788175
[96]:
result_df.plot(x='Number of trees', y='AccuracyScore', figsize=(10,5), grid=True, lw=3, color='orange')
[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x14197cd6908>
../_images/notebooks_DecisionTree_98_1.png

As can be seen above, adding more trees provides a better classification accuracy on average. However, this doe not mean that adding more tress will necessarily always result in a better accuracy depending on the cross validation population.

Gradient Boosted Trees

Each tree is built based on learnings of the previous tree

Each tree is trained using a different subset of data. However, the Boosting + Gradient Descent techniques are used to pool the subsets. After the bootstrap for one tree, Gradient Descent adjusts the probability of a data point being in the next training set and this continues sequentially. In this context, while the first bootstrap is based on uniform sampling the next one is modified based on what was learned from the previous three classification results. Data points that are classified correctly get lower weight for the next bootstrap. In this way, the likelihood of picking data that are not classified with previous tree increases. Finally, each tree is given a different weight for the final weighting for the majority rule decision.

I random forest, there is a “democracy” that all trees have the same voting weights. However, in the gradient boosted approach, trees that have a better performance/accuracy get a higher weight. Gradient descent is one of the technique that can be used to quantify the probability of choosing a data point in the train subset based on the result of the previous tree.

For window installation:

conda install -c anaconda py-xgboost

A rule of thumb is to use a lower number of tree when dealing with a small training data set.

[64]:
from xgboost.sklearn import XGBClassifier
[65]:
clf = XGBClassifier()
clf
[65]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
  • colsample_bylevel: Fraction of features used by each tree

  • learning_rate: Controls the probability manipulation of data records based on the previous tree. Increasing this factor results in an increased focus on data records that were miss-classified.

  • subsample: Fraction of the provided training data set used for each tree. Lowering this number results in a more distinct difference between training data sets used for each tree

  • gamma: Minimum reduction in error to split the tree further

  • n_estimators: Number of trees

[66]:
clf.fit(X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']], y=train[['Survival']])
predictions = clf.predict(test.drop(['Survival'], axis=1))
print('Accuracy : %.2f'%(accuracy_score(test['Survival'], predictions)))
Accuracy : 0.82
[67]:
for feature, importance in zip(['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender'],clf.feature_importances_):
    print('Importance for {} : {:.3f}'.format(feature,importance))
Importance for Age : 0.629
Importance for Crew/Passenger : 0.057
Importance for Class : 0.221
Importance for Adult/Child : 0.000
Importance for Gender : 0.093

Hyper Parameter Tuning

Considering the different parameters that affect the performance of a decision tree and also parameters that control the ensemble learning, there are many combinations that should be checked to find the best accuracy for the train/test data set. Hyper parameter tuning uses optimization technique to find an optimal combination.

[68]:
# conda install -c jaikumarm hyperopt
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
[69]:
# provide the optimization range/population
control = {
    'n_estimators': hp.quniform('n_estimators',100,1000,1)
    ,'learning_rate': hp.quniform('learning_rate', 0.1,0.4,0.1)
    ,'max_depth': hp.quniform('max_depth',1,15,1)
    ,'min_child_weight': hp.quniform('min_child_weight',1,6,1)
    ,'subsample': hp.quniform('subsample',0.5,1,0.05)
    ,'gamma': hp.quniform('gamma', 0.5,1,0.05)
    ,'colsample_bytree': hp.quniform('colsample_bytree',0.5,1,0.05)
    ,'nthread':5
    ,'silent':1
}
[70]:
def scorevalue(clf):
    X = train[['Age', 'Crew/Passenger', 'Class', 'Adult/Child', 'Gender']]
    y=train[['Survival']]
    return np.mean(cross_val_score(clf, X, y, cv=5).mean())
[71]:
# Define a score that gets minimized and provides what you need
def score(params):
    params['n_estimators'] = int(params['n_estimators'])
    params['max_depth'] = int(params['max_depth'])
    clf = XGBClassifier(**params)
    return {'loss':1-scorevalue(clf), 'status':STATUS_OK}
[72]:
trails=Trials()
[73]:
best_param_set = fmin(score,control,algo=tpe.suggest,trials=trails,max_evals=250)
[74]:
print('Optimized parameters: ')
for key, value in best_param_set.items():
    print('{} = {}'.format(key,value))
Optimized parameters:
colsample_bytree = 1.0
gamma = 1.0
learning_rate = 0.2
max_depth = 10.0
min_child_weight = 5.0
n_estimators = 583.0
subsample = 1.0
[75]:
print('Optimized Accuracy: {:.3f}'.format(1-score(best_param_set)['loss']))
Optimized Accuracy: 0.816