Statistical Analysis for Income Data¶

Author: Mostafa Hadavand

Date: 2018-03-03

This notebook presents an example data exploration for CSD income data that can be downloaded from the following link.

[1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%reload_ext autoreload
%autoreload 2

%pylab inline --no-import-all

Populating the interactive namespace from numpy and matplotlib

Load the data¶

[2]:

df_CSD = pd.read_csv('Data/Census2016_AB_CSD.csv',encoding='latin1')
df_CSD.head()

[2]:

	CENSUS_YEAR	GEO_CODE	GEO_LEVEL	GEO_NAME	GNR	GNR_LF	CSD_TYPE_NAME	ALT_GEO_CODE	Dimension	Member_ID	Notes	Total	Male	Female
0	2016	4801003	3	Cypress County	4.3	5.6	MD	4801003	Population, 2016 (1)	1	1.0	7662	...	...
1	2016	4801003	3	Cypress County	4.3	5.6	MD	4801003	Population, 2011 (2)	2	2.0	7214	...	...
2	2016	4801003	3	Cypress County	4.3	5.6	MD	4801003	Population percentage change, 2011 to 2016	3	NaN	6.2	...	...
3	2016	4801003	3	Cypress County	4.3	5.6	MD	4801003	Total private dwellings (3)	4	3.0	3222	...	...
4	2016	4801003	3	Cypress County	4.3	5.6	MD	4801003	Private dwellings occupied by usual residents (4)	5	4.0	2689	...	...

Check the Member_IDs corresponding to median values¶

[3]:

df_CSD[['Dimension','Member_ID','Total']][df_CSD['Dimension'].str.match('Median')]\
.groupby(['Dimension','Member_ID']).count()

[3]:

		Total
Dimension	Member_ID
Median after-tax income in 2015 among recipients ($)	665	425
Median after-tax income in 2015 for persons aged 15 years and over not in economic families ($) (51)	830	425
Median after-tax income of couple economic families with children in 2015 ($)	810	425
Median after-tax income of couple economic families without children or other relatives in 2015 ($)	806	425
Median after-tax income of economic families in 2015 ($)	802	425
Median after-tax income of households in 2015 ($)	743	425
Median after-tax income of lone-parent economic families in 2015 ($)	814	425
Median after-tax income of one-person households in 2015 ($)	746	425
Median after-tax income of two-or-more-person households in 2015 ($)	749	425
Median age of the population (6)	40	425
Median employment income in 2015 among recipients ($)	671	425
Median employment income in 2015 for full-year full-time workers ($) (34)	685	425
Median government transfers in 2015 among recipients ($)	669	425
Median market income in 2015 among recipients ($)	667	425
Median monthly shelter costs for owned dwellings ($) (170)	1674	425
Median monthly shelter costs for rented dwellings ($) (176)	1681	425
Median total income in 2015 among recipients ($)	663	425
Median total income in 2015 for persons aged 15 years and over not in economic families ($) (50)	829	425
Median total income of couple economic families with children in 2015 ($)	809	425
Median total income of couple economic families without children or other relatives in 2015 ($)	805	425
Median total income of economic families in 2015 ($)	801	425
Median total income of households in 2015 ($)	742	425
Median total income of lone-parent economic families in 2015 ($)	813	425
Median total income of one-person households in 2015 ($)	745	425
Median total income of two-or-more-person households in 2015 ($)	748	425
Median value of dwellings ($) (172)	1676	425

Check Spatial Data¶

The corresponding shape file is loaded to study the Census Subdivisions. It is required to install geopandas.

[4]:

import geopandas as gpd
shp = 'Data/Alberta_CSD2016/Alberta_CSD2016.shp'
df_spatial = gpd.read_file(shp)
df_spatial.head()

[4]:

	CSDUID	CSDNAME	CSDTYPE	PRUID	PRNAME	CDUID	CDNAME	CDTYPE	CCSUID	CCSNAME	...	ERNAME	SACCODE	SACTYPE	CMAUID	CMAPUID	CMANAME	CMATYPE	Shape_Leng	Shape_Area	geometry
0	4815023	Canmore	T	48	Alberta	4815	Division No. 15	CDR	4815015	Bighorn No. 8	...	Banff--Jasper--Rocky Mountain House	828	3	828	48828	Canmore	D	65444.606318	6.938296e+07	POLYGON ((470559.3832999999 5662470.932800001,...
1	4815027	Ghost Lake	SV	48	Alberta	4815	Division No. 15	CDR	4815015	Bighorn No. 8	...	Banff--Jasper--Rocky Mountain House	999	7	None	None	None	None	5207.421246	6.681321e+05	POLYGON ((516587.8195000002 5671021.1709, 5173...
2	4812808	White Fish Lake 128	IRI	48	Alberta	4812	Division No. 12	CDR	4812014	St. Paul County No. 19	...	Wood Buffalo--Cold Lake	997	5	None	None	None	None	58687.129982	6.111809e+07	POLYGON ((706392.4102999996 6025299.4767, 7076...
3	4812810	Cold Lake 149	IRI	48	Alberta	4812	Division No. 12	CDR	4812004	Bonnyville No. 87	...	Wood Buffalo--Cold Lake	997	5	None	None	None	None	48773.548098	1.473223e+08	POLYGON ((808718.5225999998 6031058.0693, 8089...
4	4812811	Kehewin 123	IRI	48	Alberta	4812	Division No. 12	CDR	4812004	Bonnyville No. 87	...	Wood Buffalo--Cold Lake	997	5	None	None	None	None	48449.061245	8.484803e+07	POLYGON ((769856.7632999998 6006472.9333, 7698...

5 rows × 21 columns

[5]:

df_spatial_AB = df_spatial[df_spatial.PRUID=='48']
print('Total Census Subdivision = %i'%(len(df_spatial_AB)))
df_spatial_AB['CSDUID']=df_spatial_AB['CSDUID'].astype(int)
df_spatial_AB.head()

Total Census Subdivision = 425

[5]:

	CSDUID	CSDNAME	CSDTYPE	PRUID	PRNAME	CDUID	CDNAME	CDTYPE	CCSUID	CCSNAME	...	ERNAME	SACCODE	SACTYPE	CMAUID	CMAPUID	CMANAME	CMATYPE	Shape_Leng	Shape_Area	geometry
0	4815023	Canmore	T	48	Alberta	4815	Division No. 15	CDR	4815015	Bighorn No. 8	...	Banff--Jasper--Rocky Mountain House	828	3	828	48828	Canmore	D	65444.606318	6.938296e+07	POLYGON ((470559.3832999999 5662470.932800001,...
1	4815027	Ghost Lake	SV	48	Alberta	4815	Division No. 15	CDR	4815015	Bighorn No. 8	...	Banff--Jasper--Rocky Mountain House	999	7	None	None	None	None	5207.421246	6.681321e+05	POLYGON ((516587.8195000002 5671021.1709, 5173...
2	4812808	White Fish Lake 128	IRI	48	Alberta	4812	Division No. 12	CDR	4812014	St. Paul County No. 19	...	Wood Buffalo--Cold Lake	997	5	None	None	None	None	58687.129982	6.111809e+07	POLYGON ((706392.4102999996 6025299.4767, 7076...
3	4812810	Cold Lake 149	IRI	48	Alberta	4812	Division No. 12	CDR	4812004	Bonnyville No. 87	...	Wood Buffalo--Cold Lake	997	5	None	None	None	None	48773.548098	1.473223e+08	POLYGON ((808718.5225999998 6031058.0693, 8089...
4	4812811	Kehewin 123	IRI	48	Alberta	4812	Division No. 12	CDR	4812004	Bonnyville No. 87	...	Wood Buffalo--Cold Lake	997	5	None	None	None	None	48449.061245	8.484803e+07	POLYGON ((769856.7632999998 6006472.9333, 7698...

5 rows × 21 columns

Extract Median Data¶

Choose a Subset¶

df_663 is a subset of data that contains data for median total income in 2015 among recipients.

[6]:

for item in np.unique(df_CSD[['Dimension']][df_CSD['Member_ID']==663]):
    print(item)

Median total income in 2015 among recipients ($)

[7]:

df_663 = df_CSD[['GEO_CODE', 'GEO_NAME', 'Total', 'Male', 'Female', 'Member_ID']][df_CSD['Member_ID']==663]
df_663.head()

[7]:

	GEO_CODE	GEO_NAME	Total	Male	Female	Member_ID
662	4801003	Cypress County	41920	55488	30816	663
2909	4801006	Medicine Hat	36819	49477	28395	663
5156	4801008	Forty Mile County No. 8	35547	44608	25626	663
7403	4801009	Foremost	44096	60608	29824	663
9650	4801014	Bow Island	32064	46592	21056	663

Data quality check¶

Check if all the rows contain a numeric value

[8]:

df_663[df_663.Total.apply(lambda x: not x.isnumeric())].head()

[8]:

	GEO_CODE	GEO_NAME	Total	Male	Female	Member_ID
16391	4802002	Coutts	x	x	x	663
68072	4803003	Improvement District No. 4 Waterton	x	x	x	663
74813	4803008	Hill Spring	x	x	x	663
81554	4803016	Cowley	x	x	x	663
101777	4803803	Blood 148A	x	x	x	663

[9]:

np.unique(df_663['Total'][df_663.Total.apply(lambda x: not x.isnumeric())])

[9]:

array(['..', 'F', 'x'], dtype=object)

[10]:

df_663_trim = df_663[df_663.Total.apply(lambda x: x.isnumeric())]
df_663_trim.reset_index(drop=True, inplace=True)
df_663_trim[['Total', 'Male', 'Female']] = df_663_trim[['Total', 'Male', 'Female']].astype(float)
df_663_trim.head()

[10]:

	GEO_CODE	GEO_NAME	Total	Male	Female	Member_ID
0	4801003	Cypress County	41920.0	55488.0	30816.0	663
1	4801006	Medicine Hat	36819.0	49477.0	28395.0	663
2	4801008	Forty Mile County No. 8	35547.0	44608.0	25626.0	663
3	4801009	Foremost	44096.0	60608.0	29824.0	663
4	4801014	Bow Island	32064.0	46592.0	21056.0	663

Alternatives for histogram plot¶

Using pandas built-in plot¶

[11]:

df_663_trim.Total.plot.hist(bins=20)

[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x1d2c9c517f0>

../_images/notebooks_IncomeDataAnalysis_19_1.png

Using matplotlib¶

[12]:

fig = plt.figure(figsize=(15,10))

ax = fig.add_subplot(221)
hist1 = ax.hist(df_663_trim['Total'], bins=20, color='green')
ax.set_title('Total', fontsize=18); ax.set_xlabel('Median Total Income (2015)'); ax.set_ylabel('Frequency')

ax = fig.add_subplot(222)
hist2 = ax.hist(df_663_trim['Male'], bins=20, color='blue')
ax.set_title('Female', fontsize=18); ax.set_xlabel('Median Total Income (2015)'); ax.set_ylabel('Frequency')

ax = fig.add_subplot(223)
hist3 = ax.hist(df_663_trim['Female'], bins=20, color='red')
ax.set_title('Male', fontsize=18); ax.set_xlabel('Median Total Income (2015)'); ax.set_ylabel('Frequency')

plt.tight_layout()

../_images/notebooks_IncomeDataAnalysis_21_0.png

Using Bokeh¶

[13]:

from bokeh.io import output_notebook
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()

Loading BokehJS ...

[14]:

# Histogram Total
p1 = figure(title="Histogram (Total)",tools="pan,wheel_zoom,box_zoom,reset,save", background_fill_color="#D0D3D4")
hist, edges = np.histogram(df_663_trim.Total, density=False, bins=20)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="green", line_color="#033649")

# Histogram Male
p2 = figure(title="Histogram (Male)",tools="pan,wheel_zoom,box_zoom,reset", background_fill_color="#D0D3D4")
hist, edges = np.histogram(df_663_trim.Male, density=False, bins=20)
p2.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="blue", line_color="#033649")

# Histogram Female
p3 = figure(title="Histogram (Female)",tools="pan,wheel_zoom,box_zoom,reset", background_fill_color="#D0D3D4")
hist, edges = np.histogram(df_663_trim.Female, density=False, bins=20)
p3.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="red", line_color="#033649")

show(gridplot(p1,p2,p3, ncols=2, plot_width=400, plot_height=400, toolbar_location='right'))

Custom Plots¶

[15]:

import PythonTools as PT

[16]:

fig = plt.figure(figsize=(15,10))

main_title = df_CSD['Dimension'][df_CSD.Member_ID==663].as_matrix()[0]

ax = fig.add_subplot(221)
PT.pdf_plot(df_663_trim['Total'], bw=-1, ax=ax, line_color='gray', labelfont=14,
            xlabel=main_title+', Total', barcolor='green', kde_color='black')

ax = fig.add_subplot(222)
PT.pdf_plot(df_663_trim['Male'], bw=-1, ax=ax, line_color='gray',labelfont=14,
            xlabel=main_title+', Male', barcolor='blue', kde_color='black')

ax = fig.add_subplot(223)
PT.pdf_plot(df_663_trim['Female'], bw=-1, ax=ax, line_color='gray',labelfont=14,
            xlabel=main_title+', Female', barcolor='red', kde_color='black')

ax = fig.add_subplot(224)
PT.bv_plot(df_663_trim['Male'],df_663_trim['Female'],bw=-1, ax=ax,
           xlabel='Male', label_font_size= 14,
           ylabel='Female', fontsize=14)

plt.tight_layout()

../_images/notebooks_IncomeDataAnalysis_27_0.png

Location Map¶

[17]:

df_663_trim_spatial = df_spatial_AB.merge(df_663_trim, how='inner', left_on='CSDUID',
                                          right_on='GEO_CODE',suffixes=('_x', '_y'))

[18]:

print('The total numbee of census subdivision with data: %i'%(len(df_663_trim_spatial)))

The total numbee of census subdivision with data: 299

[19]:

fig = plt.figure(figsize=(15,20))

ax = fig.add_subplot(221)
df_663_trim_spatial.plot(column='Total', cmap='hot_r',ax=ax, legend=True)
ax.set_frame_on(False); ax.axes.get_yaxis().set_visible(False); ax.axes.get_xaxis().set_visible(False)
ax.set_title('Total')

ax = fig.add_subplot(222)
df_663_trim_spatial.plot(column='Male', cmap='viridis',ax=ax, legend=True)
ax.set_frame_on(False); ax.axes.get_yaxis().set_visible(False); ax.axes.get_xaxis().set_visible(False)
ax.set_title('Male')

ax = fig.add_subplot(223)
df_663_trim_spatial.plot(column='Female', cmap='inferno',ax=ax, legend=True)
ax.set_frame_on(False); ax.axes.get_yaxis().set_visible(False); ax.axes.get_xaxis().set_visible(False)
ax.set_title('Female')

[19]:

Text(0.5,1,'Female')

../_images/notebooks_IncomeDataAnalysis_31_1.png

Consider writing a wrapper (repeatability)¶

[24]:

from Accessories import summary_stat as ss

[25]:

ss.summary_stat(df_CSD,663,outfl=True)

../_images/notebooks_IncomeDataAnalysis_34_0.png

[26]:

ss.summary_stat(df_CSD,665,outfl=True)

../_images/notebooks_IncomeDataAnalysis_35_0.png

[27]:

ss.summary_stat(df_CSD, 742,outfl=True)

../_images/notebooks_IncomeDataAnalysis_36_0.png

[28]:

ss.summary_stat(df_CSD, 829,outfl=True)

../_images/notebooks_IncomeDataAnalysis_37_0.png

[29]:

ss.summary_stat(df_CSD, 671,outfl=True)

../_images/notebooks_IncomeDataAnalysis_38_0.png

[ ]: