Pandas Best Practices¶
I tried to follow Kevin Markham's PYCON 2018 talk 'Using pandas for Better( and Worse) Data Science' and recreated this notebook
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
ri = pd.read_csv("police.csv")
In [3]:
ri.head()
Out[3]:
In [4]:
ri.shape
Out[4]:
In [5]:
ri.dtypes
Out[5]:
In [6]:
ri.isnull().sum()
Out[6]:
In [7]:
ri.isnull().sum().sort_values(ascending=False)
Out[7]:
1 Remove the columns that only contain missing values¶
In [8]:
ri.shape
Out[8]:
In [9]:
ri.dropna(axis = 1, how='all').shape
Out[9]:
In [10]:
ri.dropna(axis = 1, how='all', inplace=True)
2. Do Men or Women speed more often?¶
In [11]:
ri[ri.violation=="Speeding"].driver_gender.value_counts()
Out[11]:
Ans Men
In [12]:
ri[ri.violation=="Speeding"].driver_gender.value_counts(normalize=True)
Out[12]:
Other way to see this¶
In [13]:
ri[ri.driver_gender=='M'].violation.value_counts(normalize=True)
Out[13]:
In [14]:
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)
Out[14]:
In [15]:
ri.groupby(["driver_gender"])["violation"].value_counts(normalize=True)
Out[15]:
In [16]:
ri.groupby(["driver_gender"])["violation"].value_counts(normalize=True).loc[:,'Speeding']
Out[16]:
In [17]:
ri.groupby(["driver_gender"])["violation"].value_counts(normalize=True).unstack()
Out[17]:
3 Does Gender affect who gets searched during a stop?¶
In [18]:
ri.groupby(["driver_gender"])["search_conducted"].value_counts(normalize=True)
Out[18]:
In [19]:
ri.groupby(["driver_gender"]).search_conducted.mean()
Out[19]:
In [20]:
ri.groupby(["violation","driver_gender"]).search_conducted.mean()
Out[20]:
Lesson¶
- focus on relationships, not causation
4 Why is search_type missing so often?¶
In [21]:
ri.isnull().sum()
Out[21]:
In [22]:
ri.search_conducted.value_counts()
Out[22]:
In [23]:
ri.search_type.value_counts()
Out[23]:
In [24]:
ri.search_type.value_counts(dropna=False)
Out[24]:
5 During a search, how often is the driver frisked?¶
In [25]:
ri["frisk"] = ri.search_type.str.contains("Protective Frisk")
In [26]:
ri.frisk.value_counts(normalize=True)
Out[26]:
pandas calculations ignore missing values
6. Which year had the least amount of stops?¶
In [27]:
pd.to_datetime(ri["stop_date"]).dt.year.value_counts(ascending=True)
Out[27]:
In [28]:
ri.dtypes
Out[28]:
In [29]:
combined = ri.stop_date.str.cat(ri.stop_time, sep = ' ')
In [30]:
combined.head()
Out[30]:
In [31]:
ri['stop_datetime'] = pd.to_datetime(combined)
In [32]:
ri.stop_datetime.dt.year.value_counts(ascending=True)
Out[32]:
7 How does drug activity change by time of day?¶
In [33]:
ri.drugs_related_stop.value_counts()
Out[33]:
In [34]:
ri[ri.drugs_related_stop == True].stop_datetime.dt.hour.value_counts().plot(kind='bar')
Out[34]:
In [35]:
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean().plot()
Out[35]:
8 Do more stops occur at night?¶
In [36]:
ri.stop_datetime.dt.hour.value_counts().sort_index().plot()
Out[36]:
9 Find the bad data in the stop_duration column and fix it¶
In [37]:
ri.stop_duration.value_counts(dropna=False)
Out[37]:
In [38]:
ri.stop_duration.dtype
Out[38]:
In [39]:
ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')]
Out[39]:
In [40]:
ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')].stop_duration = 'NaN' #Wrong
In [41]:
ri.stop_duration.value_counts(dropna=False) #Nothing changed
Out[41]:
In [42]:
ri.loc[(ri.stop_duration == '1') | (ri.stop_duration == '2'), "stop_duration"] = 'NaN'
In [43]:
ri.stop_duration.value_counts(dropna=False)
Out[43]:
In [44]:
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan
In [45]:
ri.stop_duration.value_counts(dropna=False)
Out[45]:
In [46]:
# alternative method
ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)
10 What is the mean stop_duration for each violation_raw?¶
In [47]:
ri.stop_duration.value_counts()
Out[47]:
In [48]:
mapping = {'0-15 Min': 8, '16-30 Min': 23, '30+ Min':45 }
In [49]:
ri['stop_minutes'] = ri.stop_duration.map(mapping)
In [50]:
ri.stop_minutes.value_counts()
Out[50]:
In [51]:
ri.violation_raw.value_counts()
Out[51]:
In [52]:
ri.groupby('violation_raw')['stop_minutes'].mean()
Out[52]:
In [53]:
ri.groupby('violation_raw')['stop_minutes'].agg(['count', 'mean'])
Out[53]:
11. Plot the results of the first groupby from the previous exercise¶
In [54]:
ri.groupby('violation_raw').stop_minutes.mean().plot()
Out[54]:
In [55]:
ri.groupby('violation_raw').stop_minutes.mean().plot(kind ='bar')
Out[55]:
In [56]:
ri.groupby('violation_raw').stop_minutes.mean().sort_values().plot(kind='barh')
Out[56]:
12. Compare the age distributions for each violation¶
In [57]:
ri.groupby('violation')['driver_age'].mean()
Out[57]:
In [58]:
ri.groupby('violation')['driver_age'].describe()
Out[58]:
In [59]:
ri.driver_age.plot(kind='hist')
Out[59]:
In [60]:
ri.driver_age.value_counts().sort_index()
Out[60]:
In [61]:
ri.driver_age.value_counts().sort_index().plot()
Out[61]:
In [62]:
# can't use the plot method
ri.hist('driver_age', by='violation', figsize = (10,8));
In [63]:
ri.hist('driver_age', by='violation', sharex= True, figsize = (10,8));
In [64]:
# what changed? how is this better or worse?
ri.hist('driver_age', by='violation', sharex=True, sharey=True, figsize = (10,8));
No comments :
Post a Comment