Data¶
In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
In [3]:
X = iris.data
Y = iris.target
$k$-NN¶
Let's use 10-fold cross validation for predicting labels on iris dataset using $k$-Nearest Neighbour
In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
In [5]:
model = KNeighborsClassifier(n_neighbors=1)
kfold = KFold(n_splits = 10)
score = cross_val_score(model, X,Y, cv = 10, scoring = "accuracy")
In [6]:
score
Out[6]:
In [7]:
score.mean()
Out[7]:
The above model uses n_neighbour as 1. Now, let's experiement with various n_neighbour values and find which n_neighbour value produces the maximum accuracy
In [8]:
k_range = np.arange(1,31)
k_result = []
for k in k_range:
model = KNeighborsClassifier(n_neighbors=k)
kfold = KFold(n_splits=10)
score = cross_val_score(model, X,Y, cv = 10, scoring = "accuracy")
k_result.append(score.mean())
In [9]:
print(k_result)
plt.plot(k_range, k_result)
Out[9]:
The above graph shows that when n_neighbour = {13, 17, 20} , it provides the maximum accuracy. Now, let's try to find the same using GridSearch CV
In [10]:
from sklearn.grid_search import GridSearchCV
In [11]:
k_range = np.arange(1,31)
param_grid = dict(n_neighbors=k_range)
print(param_grid)
In [12]:
knn = KNeighborsClassifier()
grid = GridSearchCV(knn,param_grid=param_grid,scoring = "accuracy", cv = 10)
In [13]:
grid.fit(X,Y)
Out[13]:
In [14]:
grid.grid_scores_
Out[14]:
In [15]:
grid.best_estimator_
Out[15]:
In [16]:
grid.best_params_
Out[16]:
In [17]:
grid.best_score_
Out[17]:
Okay, Now let us try to do k-NearestNeighbor using different n_neighbor value, and different combination of "weights" param.
{uniform and distance} are two values for weights
{uniform and distance} are two values for weights
In [18]:
k_range = np.arange(1,31)
weights = ["uniform","distance"]
param_grid = dict(n_neighbors = k_range, weights = weights)
In [19]:
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid,scoring = "accuracy", cv = 10)
In [20]:
grid.fit(X,Y)
Out[20]:
In [21]:
grid.grid_scores_
Out[21]:
In [22]:
grid.best_estimator_
Out[22]:
In [23]:
grid.best_params_
Out[23]:
In [24]:
grid.best_score_
Out[24]:
We have tried 60 different combination for finding the best param values. In this case, the data size is small. So, it's computationally not expensive to do this exhaustive search. In other cases, if the data size is too large, then it's not computationally possible to perform an exhaustive search. To overcome this, we can use RandomizedSearchCV, where we can limit the number of tries.
In [25]:
from sklearn.grid_search import RandomizedSearchCV
In [26]:
k_range = np.arange(1,31)
weights = ["uniform","distance"]
param_grid = dict(n_neighbors = k_range, weights = weights)
knn = KNeighborsClassifier()
randomized = RandomizedSearchCV(knn, param_grid,scoring = "accuracy", cv = 10, n_iter = 10)
In [27]:
randomized.fit(X,Y)
Out[27]:
In [28]:
randomized.best_estimator_
Out[28]:
In [29]:
randomized.grid_scores_
Out[29]:
In [30]:
randomized.best_params_
Out[30]:
In [31]:
randomized.best_score_
Out[31]:
With just 10 different combinations, Luckily, Randomized Search CV is able to find the best accuracy.
In [ ]:
No comments :
Post a Comment