Statistical Analysis for Income Data

Author: Mostafa Hadavand
Date: 2018-03-03

This notebook presents an example data exploration for CSD income data that can be downloaded from the following link.

Income Data

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%reload_ext autoreload
%autoreload 2

%pylab inline --no-import-all
Populating the interactive namespace from numpy and matplotlib

Load the data

[2]:
df_CSD = pd.read_csv('Data/Census2016_AB_CSD.csv',encoding='latin1')
df_CSD.head()
[2]:
CENSUS_YEAR GEO_CODE GEO_LEVEL GEO_NAME GNR GNR_LF DATA_QUALITY_FLAG CSD_TYPE_NAME ALT_GEO_CODE Dimension Member_ID Notes Total Male Female
0 2016 4801003 3 Cypress County 4.3 5.6 0 MD 4801003 Population, 2016 (1) 1 1.0 7662 ... ...
1 2016 4801003 3 Cypress County 4.3 5.6 0 MD 4801003 Population, 2011 (2) 2 2.0 7214 ... ...
2 2016 4801003 3 Cypress County 4.3 5.6 0 MD 4801003 Population percentage change, 2011 to 2016 3 NaN 6.2 ... ...
3 2016 4801003 3 Cypress County 4.3 5.6 0 MD 4801003 Total private dwellings (3) 4 3.0 3222 ... ...
4 2016 4801003 3 Cypress County 4.3 5.6 0 MD 4801003 Private dwellings occupied by usual residents (4) 5 4.0 2689 ... ...

Check the Member_IDs corresponding to median values

[3]:
df_CSD[['Dimension','Member_ID','Total']][df_CSD['Dimension'].str.match('Median')]\
.groupby(['Dimension','Member_ID']).count()
[3]:
Total
Dimension Member_ID
Median after-tax income in 2015 among recipients ($) 665 425
Median after-tax income in 2015 for persons aged 15 years and over not in economic families ($) (51) 830 425
Median after-tax income of couple economic families with children in 2015 ($) 810 425
Median after-tax income of couple economic families without children or other relatives in 2015 ($) 806 425
Median after-tax income of economic families in 2015 ($) 802 425
Median after-tax income of households in 2015 ($) 743 425
Median after-tax income of lone-parent economic families in 2015 ($) 814 425
Median after-tax income of one-person households in 2015 ($) 746 425
Median after-tax income of two-or-more-person households in 2015 ($) 749 425
Median age of the population (6) 40 425
Median employment income in 2015 among recipients ($) 671 425
Median employment income in 2015 for full-year full-time workers ($) (34) 685 425
Median government transfers in 2015 among recipients ($) 669 425
Median market income in 2015 among recipients ($) 667 425
Median monthly shelter costs for owned dwellings ($) (170) 1674 425
Median monthly shelter costs for rented dwellings ($) (176) 1681 425
Median total income in 2015 among recipients ($) 663 425
Median total income in 2015 for persons aged 15 years and over not in economic families ($) (50) 829 425
Median total income of couple economic families with children in 2015 ($) 809 425
Median total income of couple economic families without children or other relatives in 2015 ($) 805 425
Median total income of economic families in 2015 ($) 801 425
Median total income of households in 2015 ($) 742 425
Median total income of lone-parent economic families in 2015 ($) 813 425
Median total income of one-person households in 2015 ($) 745 425
Median total income of two-or-more-person households in 2015 ($) 748 425
Median value of dwellings ($) (172) 1676 425

Check Spatial Data

The corresponding shape file is loaded to study the Census Subdivisions. It is required to install geopandas.

[4]:
import geopandas as gpd
shp = 'Data/Alberta_CSD2016/Alberta_CSD2016.shp'
df_spatial = gpd.read_file(shp)
df_spatial.head()
[4]:
CSDUID CSDNAME CSDTYPE PRUID PRNAME CDUID CDNAME CDTYPE CCSUID CCSNAME ... ERNAME SACCODE SACTYPE CMAUID CMAPUID CMANAME CMATYPE Shape_Leng Shape_Area geometry
0 4815023 Canmore T 48 Alberta 4815 Division No. 15 CDR 4815015 Bighorn No. 8 ... Banff--Jasper--Rocky Mountain House 828 3 828 48828 Canmore D 65444.606318 6.938296e+07 POLYGON ((470559.3832999999 5662470.932800001,...
1 4815027 Ghost Lake SV 48 Alberta 4815 Division No. 15 CDR 4815015 Bighorn No. 8 ... Banff--Jasper--Rocky Mountain House 999 7 None None None None 5207.421246 6.681321e+05 POLYGON ((516587.8195000002 5671021.1709, 5173...
2 4812808 White Fish Lake 128 IRI 48 Alberta 4812 Division No. 12 CDR 4812014 St. Paul County No. 19 ... Wood Buffalo--Cold Lake 997 5 None None None None 58687.129982 6.111809e+07 POLYGON ((706392.4102999996 6025299.4767, 7076...
3 4812810 Cold Lake 149 IRI 48 Alberta 4812 Division No. 12 CDR 4812004 Bonnyville No. 87 ... Wood Buffalo--Cold Lake 997 5 None None None None 48773.548098 1.473223e+08 POLYGON ((808718.5225999998 6031058.0693, 8089...
4 4812811 Kehewin 123 IRI 48 Alberta 4812 Division No. 12 CDR 4812004 Bonnyville No. 87 ... Wood Buffalo--Cold Lake 997 5 None None None None 48449.061245 8.484803e+07 POLYGON ((769856.7632999998 6006472.9333, 7698...

5 rows × 21 columns

[5]:
df_spatial_AB = df_spatial[df_spatial.PRUID=='48']
print('Total Census Subdivision = %i'%(len(df_spatial_AB)))
df_spatial_AB['CSDUID']=df_spatial_AB['CSDUID'].astype(int)
df_spatial_AB.head()
Total Census Subdivision = 425
[5]:
CSDUID CSDNAME CSDTYPE PRUID PRNAME CDUID CDNAME CDTYPE CCSUID CCSNAME ... ERNAME SACCODE SACTYPE CMAUID CMAPUID CMANAME CMATYPE Shape_Leng Shape_Area geometry
0 4815023 Canmore T 48 Alberta 4815 Division No. 15 CDR 4815015 Bighorn No. 8 ... Banff--Jasper--Rocky Mountain House 828 3 828 48828 Canmore D 65444.606318 6.938296e+07 POLYGON ((470559.3832999999 5662470.932800001,...
1 4815027 Ghost Lake SV 48 Alberta 4815 Division No. 15 CDR 4815015 Bighorn No. 8 ... Banff--Jasper--Rocky Mountain House 999 7 None None None None 5207.421246 6.681321e+05 POLYGON ((516587.8195000002 5671021.1709, 5173...
2 4812808 White Fish Lake 128 IRI 48 Alberta 4812 Division No. 12 CDR 4812014 St. Paul County No. 19 ... Wood Buffalo--Cold Lake 997 5 None None None None 58687.129982 6.111809e+07 POLYGON ((706392.4102999996 6025299.4767, 7076...
3 4812810 Cold Lake 149 IRI 48 Alberta 4812 Division No. 12 CDR 4812004 Bonnyville No. 87 ... Wood Buffalo--Cold Lake 997 5 None None None None 48773.548098 1.473223e+08 POLYGON ((808718.5225999998 6031058.0693, 8089...
4 4812811 Kehewin 123 IRI 48 Alberta 4812 Division No. 12 CDR 4812004 Bonnyville No. 87 ... Wood Buffalo--Cold Lake 997 5 None None None None 48449.061245 8.484803e+07 POLYGON ((769856.7632999998 6006472.9333, 7698...

5 rows × 21 columns

Extract Median Data

Choose a Subset

df_663 is a subset of data that contains data for median total income in 2015 among recipients.

[6]:
for item in np.unique(df_CSD[['Dimension']][df_CSD['Member_ID']==663]):
    print(item)
Median total income in 2015 among recipients ($)
[7]:
df_663 = df_CSD[['GEO_CODE', 'GEO_NAME', 'Total', 'Male', 'Female', 'Member_ID']][df_CSD['Member_ID']==663]
df_663.head()
[7]:
GEO_CODE GEO_NAME Total Male Female Member_ID
662 4801003 Cypress County 41920 55488 30816 663
2909 4801006 Medicine Hat 36819 49477 28395 663
5156 4801008 Forty Mile County No. 8 35547 44608 25626 663
7403 4801009 Foremost 44096 60608 29824 663
9650 4801014 Bow Island 32064 46592 21056 663

Data quality check

Check if all the rows contain a numeric value

[8]:
df_663[df_663.Total.apply(lambda x: not x.isnumeric())].head()
[8]:
GEO_CODE GEO_NAME Total Male Female Member_ID
16391 4802002 Coutts x x x 663
68072 4803003 Improvement District No. 4 Waterton x x x 663
74813 4803008 Hill Spring x x x 663
81554 4803016 Cowley x x x 663
101777 4803803 Blood 148A x x x 663
[9]:
np.unique(df_663['Total'][df_663.Total.apply(lambda x: not x.isnumeric())])
[9]:
array(['..', 'F', 'x'], dtype=object)
[10]:
df_663_trim = df_663[df_663.Total.apply(lambda x: x.isnumeric())]
df_663_trim.reset_index(drop=True, inplace=True)
df_663_trim[['Total', 'Male', 'Female']] = df_663_trim[['Total', 'Male', 'Female']].astype(float)
df_663_trim.head()
[10]:
GEO_CODE GEO_NAME Total Male Female Member_ID
0 4801003 Cypress County 41920.0 55488.0 30816.0 663
1 4801006 Medicine Hat 36819.0 49477.0 28395.0 663
2 4801008 Forty Mile County No. 8 35547.0 44608.0 25626.0 663
3 4801009 Foremost 44096.0 60608.0 29824.0 663
4 4801014 Bow Island 32064.0 46592.0 21056.0 663

Alternatives for histogram plot

Using pandas built-in plot

[11]:
df_663_trim.Total.plot.hist(bins=20)
[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d2c9c517f0>
../_images/notebooks_IncomeDataAnalysis_19_1.png

Using matplotlib

[12]:
fig = plt.figure(figsize=(15,10))

ax = fig.add_subplot(221)
hist1 = ax.hist(df_663_trim['Total'], bins=20, color='green')
ax.set_title('Total', fontsize=18); ax.set_xlabel('Median Total Income (2015)'); ax.set_ylabel('Frequency')

ax = fig.add_subplot(222)
hist2 = ax.hist(df_663_trim['Male'], bins=20, color='blue')
ax.set_title('Female', fontsize=18); ax.set_xlabel('Median Total Income (2015)'); ax.set_ylabel('Frequency')

ax = fig.add_subplot(223)
hist3 = ax.hist(df_663_trim['Female'], bins=20, color='red')
ax.set_title('Male', fontsize=18); ax.set_xlabel('Median Total Income (2015)'); ax.set_ylabel('Frequency')

plt.tight_layout()
../_images/notebooks_IncomeDataAnalysis_21_0.png

Using Bokeh

[13]:
from bokeh.io import output_notebook
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()
Loading BokehJS ...
[14]:
# Histogram Total
p1 = figure(title="Histogram (Total)",tools="pan,wheel_zoom,box_zoom,reset,save", background_fill_color="#D0D3D4")
hist, edges = np.histogram(df_663_trim.Total, density=False, bins=20)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="green", line_color="#033649")

# Histogram Male
p2 = figure(title="Histogram (Male)",tools="pan,wheel_zoom,box_zoom,reset", background_fill_color="#D0D3D4")
hist, edges = np.histogram(df_663_trim.Male, density=False, bins=20)
p2.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="blue", line_color="#033649")

# Histogram Female
p3 = figure(title="Histogram (Female)",tools="pan,wheel_zoom,box_zoom,reset", background_fill_color="#D0D3D4")
hist, edges = np.histogram(df_663_trim.Female, density=False, bins=20)
p3.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="red", line_color="#033649")

show(gridplot(p1,p2,p3, ncols=2, plot_width=400, plot_height=400, toolbar_location='right'))

Custom Plots

[15]:
import PythonTools as PT
[16]:
fig = plt.figure(figsize=(15,10))

main_title = df_CSD['Dimension'][df_CSD.Member_ID==663].as_matrix()[0]

ax = fig.add_subplot(221)
PT.pdf_plot(df_663_trim['Total'], bw=-1, ax=ax, line_color='gray', labelfont=14,
            xlabel=main_title+', Total', barcolor='green', kde_color='black')

ax = fig.add_subplot(222)
PT.pdf_plot(df_663_trim['Male'], bw=-1, ax=ax, line_color='gray',labelfont=14,
            xlabel=main_title+', Male', barcolor='blue', kde_color='black')

ax = fig.add_subplot(223)
PT.pdf_plot(df_663_trim['Female'], bw=-1, ax=ax, line_color='gray',labelfont=14,
            xlabel=main_title+', Female', barcolor='red', kde_color='black')

ax = fig.add_subplot(224)
PT.bv_plot(df_663_trim['Male'],df_663_trim['Female'],bw=-1, ax=ax,
           xlabel='Male', label_font_size= 14,
           ylabel='Female', fontsize=14)

plt.tight_layout()
../_images/notebooks_IncomeDataAnalysis_27_0.png

Location Map

[17]:
df_663_trim_spatial = df_spatial_AB.merge(df_663_trim, how='inner', left_on='CSDUID',
                                          right_on='GEO_CODE',suffixes=('_x', '_y'))
[18]:
print('The total numbee of census subdivision with data: %i'%(len(df_663_trim_spatial)))
The total numbee of census subdivision with data: 299
[19]:
fig = plt.figure(figsize=(15,20))

ax = fig.add_subplot(221)
df_663_trim_spatial.plot(column='Total', cmap='hot_r',ax=ax, legend=True)
ax.set_frame_on(False); ax.axes.get_yaxis().set_visible(False); ax.axes.get_xaxis().set_visible(False)
ax.set_title('Total')

ax = fig.add_subplot(222)
df_663_trim_spatial.plot(column='Male', cmap='viridis',ax=ax, legend=True)
ax.set_frame_on(False); ax.axes.get_yaxis().set_visible(False); ax.axes.get_xaxis().set_visible(False)
ax.set_title('Male')

ax = fig.add_subplot(223)
df_663_trim_spatial.plot(column='Female', cmap='inferno',ax=ax, legend=True)
ax.set_frame_on(False); ax.axes.get_yaxis().set_visible(False); ax.axes.get_xaxis().set_visible(False)
ax.set_title('Female')
[19]:
Text(0.5,1,'Female')
../_images/notebooks_IncomeDataAnalysis_31_1.png

Consider writing a wrapper (repeatability)

[24]:
from Accessories import summary_stat as ss
[25]:
ss.summary_stat(df_CSD,663,outfl=True)
../_images/notebooks_IncomeDataAnalysis_34_0.png
[26]:
ss.summary_stat(df_CSD,665,outfl=True)
../_images/notebooks_IncomeDataAnalysis_35_0.png
[27]:
ss.summary_stat(df_CSD, 742,outfl=True)
../_images/notebooks_IncomeDataAnalysis_36_0.png
[28]:
ss.summary_stat(df_CSD, 829,outfl=True)
../_images/notebooks_IncomeDataAnalysis_37_0.png
[29]:
ss.summary_stat(df_CSD, 671,outfl=True)
../_images/notebooks_IncomeDataAnalysis_38_0.png
[ ]: