Sunday, March 5, 2017

Data Visualisation and Machine Learning on Pima Indians Dataset


Introduction

This notebook demos Data Visualisation and various Machine Learning Classification algorithms on Pima Indians dataset.
In [56]:
from IPython.display import YouTubeVideo
YouTubeVideo("pN4HqWRybwk")
Out[56]:

1) Loading Libraries

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore')

2) Data

In [73]:
pima = pd.read_csv("diabetes.csv")
In [74]:
pima.head()
Out[74]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Additional details about the attributes

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
In [75]:
pima.shape
Out[75]:
(768, 9)
In [76]:
pima.describe()
Out[76]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [77]:
pima.groupby("Outcome").size()
Out[77]:
Outcome
0    500
1    268
dtype: int64

3) Data Visualisation

Let's try to visualise this data
In [78]:
pima.hist(figsize=(10,8))
Out[78]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x129966f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12a3b8400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12a3f3a90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12a43f438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12a45a358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12a4c1eb8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12a50f550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12a54cc50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12a599d68>]], dtype=object)
In [79]:
pima.plot(kind= 'box' , subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(10,8))
Out[79]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x12a713e80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12ab70828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12b03db38>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12b075c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12b0c34a8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12b235860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12b27df98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12b28c8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12b40b438>]], dtype=object)
In [80]:
column_x = pima.columns[0:len(pima.columns) - 1]
In [81]:
column_x
Out[81]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')
In [82]:
corr = pima[pima.columns].corr()
In [83]:
sns.heatmap(corr, annot = True)
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x12b51e5f8>

4) Feature Extraction

In [69]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
In [84]:
X = pima.iloc[:,0:8]
Y = pima.iloc[:,8]
select_top_4 = SelectKBest(score_func=chi2, k = 4)
In [87]:
fit = select_top_4.fit(X,Y)
features = fit.transform(X)
In [88]:
features[0:5]
Out[88]:
array([[ 148. ,    0. ,   33.6,   50. ],
       [  85. ,    0. ,   26.6,   31. ],
       [ 183. ,    0. ,   23.3,   32. ],
       [  89. ,   94. ,   28.1,   21. ],
       [ 137. ,  168. ,   43.1,   33. ]])
In [89]:
pima.head()
Out[89]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
So, the top performing features are Glucose, Insulin, BMI, Age
In [90]:
X_features = pd.DataFrame(data = features, columns = ["Glucose","Insulin","BMI","Age"])
In [91]:
X_features.head()
Out[91]:
Glucose Insulin BMI Age
0 148.0 0.0 33.6 50.0
1 85.0 0.0 26.6 31.0
2 183.0 0.0 23.3 32.0
3 89.0 94.0 28.1 21.0
4 137.0 168.0 43.1 33.0
In [93]:
Y = pima.iloc[:,8]
In [94]:
Y.head()
Out[94]:
0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

5) Standardization

It changes the attribute values to Guassian distribution with mean as 0 and standard deviation as 1. It is useful when the algorithm expects the input features to be in Guassian distribution.
In [95]:
from sklearn.preprocessing import StandardScaler
rescaledX = StandardScaler().fit_transform(X_features)
In [97]:
X = pd.DataFrame(data = rescaledX, columns= X_features.columns)
In [98]:
X.head()
Out[98]:
Glucose Insulin BMI Age
0 0.848324 -0.692891 0.204013 1.425995
1 -1.123396 -0.692891 -0.684422 -0.190672
2 1.943724 -0.692891 -1.103255 -0.105584
3 -0.998208 0.123302 -0.494043 -1.041549
4 0.504055 0.765836 1.409746 -0.020496

6) Binary Classification

In [99]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y, random_state = 22, test_size = 0.2)
In [100]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
In [101]:
models = []
models.append(("LR",LogisticRegression()))
models.append(("NB",GaussianNB()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("DT",DecisionTreeClassifier()))
models.append(("SVM",SVC()))
In [102]:
results = []
names = []
for name,model in models:
    kfold = KFold(n_splits=10, random_state=22)
    cv_result = cross_val_score(model,X_train,Y_train, cv = kfold,scoring = "accuracy")
    names.append(name)
    results.append(cv_result)
for i in range(len(names)):
    print(names[i],results[i].mean())
LR 0.776890534109
NB 0.760497091486
KNN 0.745928080381
DT 0.703648863035
SVM 0.776890534109

7) Visualising Results

In [104]:
ax = sns.boxplot(data=results)
ax.set_xticklabels(names)
Out[104]:
[<matplotlib.text.Text at 0x12d24abe0>,
 <matplotlib.text.Text at 0x12b7d1390>,
 <matplotlib.text.Text at 0x12d416588>,
 <matplotlib.text.Text at 0x12d4190b8>,
 <matplotlib.text.Text at 0x12d419ba8>]

8) Final Prediction using Test Data

Logistic Regression and SVM provides maximum results.
In [110]:
lr = LogisticRegression()
lr.fit(X_train,Y_train)
predictions = lr.predict(X_test)
In [111]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
In [112]:
print(accuracy_score(Y_test,predictions))
0.714285714286
In [113]:
svm = SVC()
svm.fit(X_train,Y_train)
predictions = svm.predict(X_test)
In [114]:
print(accuracy_score(Y_test,predictions))
0.733766233766
In [115]:
print(classification_report(Y_test,predictions))
             precision    recall  f1-score   support

          0       0.74      0.92      0.82       100
          1       0.72      0.39      0.51        54

avg / total       0.73      0.73      0.71       154

In [116]:
conf = confusion_matrix(Y_test,predictions)
In [118]:
label = ["0","1"]
sns.heatmap(conf, annot=True, xticklabels=label, yticklabels=label)
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x12d4349e8>

No comments :

Post a Comment