Introduction¶
This notebook demos Data Visualisation and various Machine Learning Classification algorithms on Pima Indians dataset.
In [56]:
from IPython.display import YouTubeVideo
YouTubeVideo("pN4HqWRybwk")
Out[56]:
1) Loading Libraries¶
In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore')
2) Data¶
In [73]:
pima = pd.read_csv("diabetes.csv")
In [74]:
pima.head()
Out[74]:
Additional details about the attributes¶
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
In [75]:
pima.shape
Out[75]:
In [76]:
pima.describe()
Out[76]:
In [77]:
pima.groupby("Outcome").size()
Out[77]:
3) Data Visualisation¶
Let's try to visualise this data
In [78]:
pima.hist(figsize=(10,8))
Out[78]:
In [79]:
pima.plot(kind= 'box' , subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(10,8))
Out[79]:
In [80]:
column_x = pima.columns[0:len(pima.columns) - 1]
In [81]:
column_x
Out[81]:
In [82]:
corr = pima[pima.columns].corr()
In [83]:
sns.heatmap(corr, annot = True)
Out[83]:
4) Feature Extraction¶
In [69]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
In [84]:
X = pima.iloc[:,0:8]
Y = pima.iloc[:,8]
select_top_4 = SelectKBest(score_func=chi2, k = 4)
In [87]:
fit = select_top_4.fit(X,Y)
features = fit.transform(X)
In [88]:
features[0:5]
Out[88]:
In [89]:
pima.head()
Out[89]:
So, the top performing features are Glucose, Insulin, BMI, Age
In [90]:
X_features = pd.DataFrame(data = features, columns = ["Glucose","Insulin","BMI","Age"])
In [91]:
X_features.head()
Out[91]:
In [93]:
Y = pima.iloc[:,8]
In [94]:
Y.head()
Out[94]:
5) Standardization¶
It changes the attribute values to Guassian distribution with mean as 0 and standard deviation as 1. It is useful when the algorithm expects the input features to be in Guassian distribution.
In [95]:
from sklearn.preprocessing import StandardScaler
rescaledX = StandardScaler().fit_transform(X_features)
In [97]:
X = pd.DataFrame(data = rescaledX, columns= X_features.columns)
In [98]:
X.head()
Out[98]:
6) Binary Classification¶
In [99]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y, random_state = 22, test_size = 0.2)
In [100]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
In [101]:
models = []
models.append(("LR",LogisticRegression()))
models.append(("NB",GaussianNB()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("DT",DecisionTreeClassifier()))
models.append(("SVM",SVC()))
In [102]:
results = []
names = []
for name,model in models:
kfold = KFold(n_splits=10, random_state=22)
cv_result = cross_val_score(model,X_train,Y_train, cv = kfold,scoring = "accuracy")
names.append(name)
results.append(cv_result)
for i in range(len(names)):
print(names[i],results[i].mean())
7) Visualising Results¶
In [104]:
ax = sns.boxplot(data=results)
ax.set_xticklabels(names)
Out[104]:
8) Final Prediction using Test Data¶
Logistic Regression and SVM provides maximum results.
In [110]:
lr = LogisticRegression()
lr.fit(X_train,Y_train)
predictions = lr.predict(X_test)
In [111]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
In [112]:
print(accuracy_score(Y_test,predictions))
In [113]:
svm = SVC()
svm.fit(X_train,Y_train)
predictions = svm.predict(X_test)
In [114]:
print(accuracy_score(Y_test,predictions))
In [115]:
print(classification_report(Y_test,predictions))
In [116]:
conf = confusion_matrix(Y_test,predictions)
In [118]:
label = ["0","1"]
sns.heatmap(conf, annot=True, xticklabels=label, yticklabels=label)
Out[118]:
No comments :
Post a Comment