Thursday, May 24, 2018

Pandas best practices by Kevin Markham PYCON 2018 talk


Pandas Best Practices

I tried to follow Kevin Markham's PYCON 2018 talk 'Using pandas for Better( and Worse) Data Science' and recreated this notebook
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
ri = pd.read_csv("police.csv")
In [3]:
ri.head()
Out[3]:
stop_date stop_time county_name driver_gender driver_age_raw driver_age driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop
0 2005-01-02 01:55 NaN M 1985.0 20.0 White Speeding Speeding False NaN Citation False 0-15 Min False
1 2005-01-18 08:15 NaN M 1965.0 40.0 White Speeding Speeding False NaN Citation False 0-15 Min False
2 2005-01-23 23:15 NaN M 1972.0 33.0 White Speeding Speeding False NaN Citation False 0-15 Min False
3 2005-02-20 17:15 NaN M 1986.0 19.0 White Call for Service Other False NaN Arrest Driver True 16-30 Min False
4 2005-03-14 10:00 NaN F 1984.0 21.0 White Speeding Speeding False NaN Citation False 0-15 Min False
In [4]:
ri.shape
Out[4]:
(91741, 15)
In [5]:
ri.dtypes
Out[5]:
stop_date              object
stop_time              object
county_name           float64
driver_gender          object
driver_age_raw        float64
driver_age            float64
driver_race            object
violation_raw          object
violation              object
search_conducted         bool
search_type            object
stop_outcome           object
is_arrested            object
stop_duration          object
drugs_related_stop       bool
dtype: object
In [6]:
ri.isnull().sum()
Out[6]:
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64
In [7]:
ri.isnull().sum().sort_values(ascending=False)
Out[7]:
county_name           91741
search_type           88545
driver_age             5621
driver_gender          5335
stop_duration          5333
is_arrested            5333
stop_outcome           5333
violation              5333
violation_raw          5333
driver_race            5333
driver_age_raw         5327
drugs_related_stop        0
search_conducted          0
stop_time                 0
stop_date                 0
dtype: int64

1 Remove the columns that only contain missing values

In [8]:
ri.shape
Out[8]:
(91741, 15)
In [9]:
ri.dropna(axis = 1, how='all').shape
Out[9]:
(91741, 14)
In [10]:
ri.dropna(axis = 1, how='all', inplace=True)

2. Do Men or Women speed more often?

In [11]:
ri[ri.violation=="Speeding"].driver_gender.value_counts()
Out[11]:
M    32979
F    15482
Name: driver_gender, dtype: int64
Ans Men
In [12]:
ri[ri.violation=="Speeding"].driver_gender.value_counts(normalize=True)
Out[12]:
M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

Other way to see this

In [13]:
ri[ri.driver_gender=='M'].violation.value_counts(normalize=True)
Out[13]:
Speeding               0.524350
Moving violation       0.207012
Equipment              0.135671
Other                  0.057668
Registration/plates    0.038461
Seat belt              0.036839
Name: violation, dtype: float64
In [14]:
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)
Out[14]:
Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64
In [15]:
ri.groupby(["driver_gender"])["violation"].value_counts(normalize=True)
Out[15]:
driver_gender  violation          
F              Speeding               0.658500
               Moving violation       0.136277
               Equipment              0.105780
               Registration/plates    0.043086
               Other                  0.029348
               Seat belt              0.027009
M              Speeding               0.524350
               Moving violation       0.207012
               Equipment              0.135671
               Other                  0.057668
               Registration/plates    0.038461
               Seat belt              0.036839
Name: violation, dtype: float64
In [16]:
ri.groupby(["driver_gender"])["violation"].value_counts(normalize=True).loc[:,'Speeding']
Out[16]:
driver_gender
F    0.65850
M    0.52435
Name: violation, dtype: float64
In [17]:
ri.groupby(["driver_gender"])["violation"].value_counts(normalize=True).unstack()
Out[17]:
violation Equipment Moving violation Other Registration/plates Seat belt Speeding
driver_gender
F 0.105780 0.136277 0.029348 0.043086 0.027009 0.65850
M 0.135671 0.207012 0.057668 0.038461 0.036839 0.52435

3 Does Gender affect who gets searched during a stop?

In [18]:
ri.groupby(["driver_gender"])["search_conducted"].value_counts(normalize=True)
Out[18]:
driver_gender  search_conducted
F              False               0.979967
               True                0.020033
M              False               0.956674
               True                0.043326
Name: search_conducted, dtype: float64
In [19]:
ri.groupby(["driver_gender"]).search_conducted.mean()
Out[19]:
driver_gender
F    0.020033
M    0.043326
Name: search_conducted, dtype: float64
In [20]:
ri.groupby(["violation","driver_gender"]).search_conducted.mean()
Out[20]:
violation            driver_gender
Equipment            F                0.042622
                     M                0.070081
Moving violation     F                0.036205
                     M                0.059831
Other                F                0.056522
                     M                0.047146
Registration/plates  F                0.066140
                     M                0.110376
Seat belt            F                0.012598
                     M                0.037980
Speeding             F                0.008720
                     M                0.024925
Name: search_conducted, dtype: float64

Lesson

  • focus on relationships, not causation

4 Why is search_type missing so often?

In [21]:
ri.isnull().sum()
Out[21]:
stop_date                 0
stop_time                 0
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64
In [22]:
ri.search_conducted.value_counts()
Out[22]:
False    88545
True      3196
Name: search_conducted, dtype: int64
In [23]:
ri.search_type.value_counts()
Out[23]:
Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Incident to Arrest,Protective Frisk                           33
Probable Cause,Protective Frisk                               33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Incident to Arrest,Inventory,Protective Frisk                 11
Protective Frisk,Reasonable Suspicion                         11
Inventory,Protective Frisk                                    11
Incident to Arrest,Probable Cause,Protective Frisk            10
Incident to Arrest,Probable Cause,Reasonable Suspicion         6
Inventory,Reasonable Suspicion                                 4
Incident to Arrest,Inventory,Reasonable Suspicion              4
Inventory,Probable Cause,Reasonable Suspicion                  2
Inventory,Probable Cause,Protective Frisk                      2
Incident to Arrest,Protective Frisk,Reasonable Suspicion       1
Probable Cause,Protective Frisk,Reasonable Suspicion           1
Name: search_type, dtype: int64
In [24]:
ri.search_type.value_counts(dropna=False)
Out[24]:
NaN                                                         88545
Incident to Arrest                                           1219
Probable Cause                                                891
Inventory                                                     220
Reasonable Suspicion                                          197
Protective Frisk                                              161
Incident to Arrest,Inventory                                  129
Incident to Arrest,Probable Cause                             106
Probable Cause,Reasonable Suspicion                            75
Incident to Arrest,Inventory,Probable Cause                    34
Incident to Arrest,Protective Frisk                            33
Probable Cause,Protective Frisk                                33
Inventory,Probable Cause                                       22
Incident to Arrest,Reasonable Suspicion                        13
Inventory,Protective Frisk                                     11
Protective Frisk,Reasonable Suspicion                          11
Incident to Arrest,Inventory,Protective Frisk                  11
Incident to Arrest,Probable Cause,Protective Frisk             10
Incident to Arrest,Probable Cause,Reasonable Suspicion          6
Inventory,Reasonable Suspicion                                  4
Incident to Arrest,Inventory,Reasonable Suspicion               4
Inventory,Probable Cause,Reasonable Suspicion                   2
Inventory,Probable Cause,Protective Frisk                       2
Incident to Arrest,Protective Frisk,Reasonable Suspicion        1
Probable Cause,Protective Frisk,Reasonable Suspicion            1
Name: search_type, dtype: int64

5 During a search, how often is the driver frisked?

In [25]:
ri["frisk"] = ri.search_type.str.contains("Protective Frisk")
In [26]:
ri.frisk.value_counts(normalize=True)
Out[26]:
False    0.914268
True     0.085732
Name: frisk, dtype: float64
pandas calculations ignore missing values

6. Which year had the least amount of stops?

In [27]:
pd.to_datetime(ri["stop_date"]).dt.year.value_counts(ascending=True)
Out[27]:
2005     2558
2010     7561
2009     7908
2013     7924
2011     8126
2015     8599
2008     8752
2014     9228
2007     9476
2006    10639
2012    10970
Name: stop_date, dtype: int64
In [28]:
ri.dtypes
Out[28]:
stop_date              object
stop_time              object
driver_gender          object
driver_age_raw        float64
driver_age            float64
driver_race            object
violation_raw          object
violation              object
search_conducted         bool
search_type            object
stop_outcome           object
is_arrested            object
stop_duration          object
drugs_related_stop       bool
frisk                  object
dtype: object
In [29]:
combined = ri.stop_date.str.cat(ri.stop_time, sep = ' ')
In [30]:
combined.head()
Out[30]:
0    2005-01-02 01:55
1    2005-01-18 08:15
2    2005-01-23 23:15
3    2005-02-20 17:15
4    2005-03-14 10:00
Name: stop_date, dtype: object
In [31]:
ri['stop_datetime'] = pd.to_datetime(combined)
In [32]:
ri.stop_datetime.dt.year.value_counts(ascending=True)
Out[32]:
2005     2558
2010     7561
2009     7908
2013     7924
2011     8126
2015     8599
2008     8752
2014     9228
2007     9476
2006    10639
2012    10970
Name: stop_datetime, dtype: int64

7 How does drug activity change by time of day?

In [33]:
ri.drugs_related_stop.value_counts()
Out[33]:
False    90926
True       815
Name: drugs_related_stop, dtype: int64
In [34]:
ri[ri.drugs_related_stop == True].stop_datetime.dt.hour.value_counts().plot(kind='bar')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x10eb32438>
In [35]:
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean().plot()
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x109ce89e8>

8 Do more stops occur at night?

In [36]:
ri.stop_datetime.dt.hour.value_counts().sort_index().plot()
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x10eeef630>

9 Find the bad data in the stop_duration column and fix it

In [37]:
ri.stop_duration.value_counts(dropna=False)
Out[37]:
0-15 Min     69543
16-30 Min    13635
NaN           5333
30+ Min       3228
1                1
2                1
Name: stop_duration, dtype: int64
In [38]:
ri.stop_duration.dtype
Out[38]:
dtype('O')
In [39]:
ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')]
Out[39]:
stop_date stop_time driver_gender driver_age_raw driver_age driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop frisk stop_datetime
55873 2012-01-23 11:17 M 1960.0 52.0 Black Other Traffic Violation Moving violation False NaN Arrest Passenger True 2 False NaN 2012-01-23 11:17:00
76624 2014-04-13 08:34 F 1965.0 49.0 White Speeding Speeding False NaN Citation False 1 False NaN 2014-04-13 08:34:00
In [40]:
ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')].stop_duration = 'NaN' #Wrong
/Users/arunprakash/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
In [41]:
ri.stop_duration.value_counts(dropna=False) #Nothing changed
Out[41]:
0-15 Min     69543
16-30 Min    13635
NaN           5333
30+ Min       3228
1                1
2                1
Name: stop_duration, dtype: int64
In [42]:
ri.loc[(ri.stop_duration == '1') | (ri.stop_duration == '2'), "stop_duration"] = 'NaN'
In [43]:
ri.stop_duration.value_counts(dropna=False)
Out[43]:
0-15 Min     69543
16-30 Min    13635
NaN           5333
30+ Min       3228
NaN              2
Name: stop_duration, dtype: int64
In [44]:
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan
In [45]:
ri.stop_duration.value_counts(dropna=False)
Out[45]:
0-15 Min     69543
16-30 Min    13635
NaN           5335
30+ Min       3228
Name: stop_duration, dtype: int64
In [46]:
# alternative method
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)

10 What is the mean stop_duration for each violation_raw?

In [47]:
ri.stop_duration.value_counts()
Out[47]:
0-15 Min     69543
16-30 Min    13635
30+ Min       3228
Name: stop_duration, dtype: int64
In [48]:
mapping = {'0-15 Min': 8, '16-30 Min': 23, '30+ Min':45 }
In [49]:
ri['stop_minutes'] = ri.stop_duration.map(mapping)
In [50]:
ri.stop_minutes.value_counts()
Out[50]:
8.0     69543
23.0    13635
45.0     3228
Name: stop_minutes, dtype: int64
In [51]:
ri.violation_raw.value_counts()
Out[51]:
Speeding                            48463
Other Traffic Violation             16224
Equipment/Inspection Violation      11020
Registration Violation               3432
Seatbelt Violation                   2952
Special Detail/Directed Patrol       2455
Call for Service                     1298
Violation of City/Town Ordinance      211
Motorist Assist/Courtesy              203
APB                                    79
Suspicious Person                      56
Warrant                                15
Name: violation_raw, dtype: int64
In [52]:
ri.groupby('violation_raw')['stop_minutes'].mean()
Out[52]:
violation_raw
APB                                 20.987342
Call for Service                    22.034669
Equipment/Inspection Violation      11.460345
Motorist Assist/Courtesy            16.916256
Other Traffic Violation             13.900265
Registration Violation              13.745629
Seatbelt Violation                   9.741531
Special Detail/Directed Patrol      15.061100
Speeding                            10.577690
Suspicious Person                   18.750000
Violation of City/Town Ordinance    13.388626
Warrant                             21.400000
Name: stop_minutes, dtype: float64
In [53]:
ri.groupby('violation_raw')['stop_minutes'].agg(['count', 'mean'])
Out[53]:
count mean
violation_raw
APB 79 20.987342
Call for Service 1298 22.034669
Equipment/Inspection Violation 11020 11.460345
Motorist Assist/Courtesy 203 16.916256
Other Traffic Violation 16223 13.900265
Registration Violation 3432 13.745629
Seatbelt Violation 2952 9.741531
Special Detail/Directed Patrol 2455 15.061100
Speeding 48462 10.577690
Suspicious Person 56 18.750000
Violation of City/Town Ordinance 211 13.388626
Warrant 15 21.400000

11. Plot the results of the first groupby from the previous exercise

In [54]:
ri.groupby('violation_raw').stop_minutes.mean().plot()
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x112ee6b00>
In [55]:
ri.groupby('violation_raw').stop_minutes.mean().plot(kind ='bar')
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x1132425c0>
In [56]:
ri.groupby('violation_raw').stop_minutes.mean().sort_values().plot(kind='barh')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x11334b0f0>

12. Compare the age distributions for each violation

In [57]:
ri.groupby('violation')['driver_age'].mean()
Out[57]:
violation
Equipment              31.781503
Moving violation       36.120020
Other                  39.536870
Registration/plates    32.803035
Seat belt              32.206301
Speeding               33.530097
Name: driver_age, dtype: float64
In [58]:
ri.groupby('violation')['driver_age'].describe()
Out[58]:
count mean std min 25% 50% 75% max
violation
Equipment 11007.0 31.781503 11.400900 16.0 23.0 28.0 38.0 89.0
Moving violation 16164.0 36.120020 13.185805 15.0 25.0 33.0 46.0 99.0
Other 4204.0 39.536870 13.034639 16.0 28.0 39.0 49.0 87.0
Registration/plates 3427.0 32.803035 11.033675 16.0 24.0 30.0 40.0 74.0
Seat belt 2952.0 32.206301 11.213122 17.0 24.0 29.0 38.0 77.0
Speeding 48361.0 33.530097 12.821847 15.0 23.0 30.0 42.0 90.0
In [59]:
ri.driver_age.plot(kind='hist')
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x113287e10>
In [60]:
ri.driver_age.value_counts().sort_index()
Out[60]:
15.0       5
16.0      45
17.0     585
18.0    1734
19.0    3228
20.0    3794
21.0    4063
22.0    4125
23.0    3986
24.0    3782
25.0    3537
26.0    3335
27.0    3048
28.0    2781
29.0    2541
30.0    2436
31.0    2200
32.0    2186
33.0    1956
34.0    1826
35.0    1868
36.0    1709
37.0    1631
38.0    1602
39.0    1558
40.0    1582
41.0    1496
42.0    1528
43.0    1571
44.0    1445
        ... 
63.0     349
64.0     298
65.0     258
66.0     227
67.0     174
68.0     148
69.0     120
70.0     115
71.0      94
72.0      76
73.0      66
74.0      51
75.0      44
76.0      42
77.0      34
78.0      22
79.0      24
80.0      18
81.0      13
82.0      12
83.0       7
84.0       9
85.0       6
86.0       7
87.0       6
88.0       2
89.0       1
90.0       1
94.0       1
99.0       1
Name: driver_age, Length: 78, dtype: int64
In [61]:
ri.driver_age.value_counts().sort_index().plot()
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x11361b4e0>
In [62]:
# can't use the plot method
ri.hist('driver_age', by='violation', figsize = (10,8));
In [63]:
ri.hist('driver_age', by='violation', sharex= True, figsize = (10,8));
In [64]:
# what changed? how is this better or worse?
ri.hist('driver_age', by='violation', sharex=True, sharey=True, figsize = (10,8));

No comments :

Post a Comment