Introduction

The K-Nearest Neighbors (KNN) algorithm is a non-parametric learning algorithm, meaning that it does not assume anything about the underlying data. The idea behind KNN, proximity, is simple, yet it is capable of performing rather complex classifications. Let's talk about its theory.

The theory behind KNN

A simple example of how it works

Say we two classes of training data (red and blue), and the data points are scattered around a graph. One new data point is introduced at a random position of the graph. The (Euclidean/Manhattan) distances between the new data point and training points are calculated. Then any integer, k can be chosen to represent the number of nearest training data points. For that number of nearest data points, the newly introduced data point belongs to the class categorizing the most data points.

An example with k=3:

KNN example
Credit: Scott Robinson
  • "X" is the newly introduced data point. Here you can see that with k=3, within the circle, there are 2 red training data points nearest to X and there is only 1 blue training data point. Therefore, the new data point is classified as red.

The good and bad of KNN

The good

  • Easy to understand and implement
  • KNN is a lazy learning algorithm, which means that it does not require any training to make real time predictions. This then makes KNN a relatively faster algorithm. New data can also be added seamlessly as a result.
  • Only 2 parameters are required for its implementation:
    • k-value
    • distance (either Euclidean or Manhattan)

The bad

  • Not good for high dimensional data. With increasing number of dimensions, it becomes increasingly difficult for the algorithm to calculate distance in each of the dimensions.
  • High prediction cost for large datasets arising from higher cost of distance calculation.
  • Not good for classification of categorical features. Methods of calculating distance other than Euclidean and Manhattan may be needed to apply KNN to categorical features. A discussion about it on Stack Overflow.

How to use the Python Scikit-learn library for KNN classfication

I used Jupyter Notebook to execute the necessary code as observed from this excellent tutorial. I made sure to make my code description as concise as possible.

1
2
3
4
5
# to import the necessary libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
1
2
3
4
5
6
7
8
# to set the url of the famous iris dataset source
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# to make a list of column names to be assigned to dataframe
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# to read dataset to a pandas dataframe
dataset = pd.read_csv(url, names=names)
1
2
# to see the top 5 rows of "dataset"
dataset.head()

sepal-length sepal-width petal-length petal-width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
1
2
3
# now we want to split the dataset into its features and labels
X = dataset.iloc[:, :-1].values # assigns first 4 columns to X
y = dataset.iloc[:, 4].values # assigns the "Class" column to y
1
2
3
4
5
6
7
# to avoid over-fitting, we need to divide the dataset into TRAINING and TEST splits
# this would give us a better idea of how the KNN algorithm performs during the TESTING phase

from sklearn.model_selection import train_test_split

# this will split the dataset into 80% training data and 20% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
1
2
3
4
5
6
7
8
9
10
# it doesn't stop here
# feature scaling should be done next so the features can be uniformly evaluated
# feature scaling basically normalizes the features, enabling the gradient descent algorithm to converge faster
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
1
2
3
4
5
6
7
8
9
# time for training and prediction with KNN!
# 5 is the most commonly used value for KNN algorithm
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

# to make predictions on our test data
y_pred = classifier.predict(X_test)
1
2
3
4
5
6
7
# now we want to evaluate our algorithm's accuracy
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# the result shows that the KNN algorithm was able to classify all 30 records
# in the test set with pretty high accuracy.
[[11  0  0]
 [ 0  8  1]
 [ 0  0 10]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      0.89      0.94         9
 Iris-virginica       0.91      1.00      0.95        10

       accuracy                           0.97        30
      macro avg       0.97      0.96      0.96        30
   weighted avg       0.97      0.97      0.97        30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# so what if other k-values would provide us with better accuracy?
# one way to find out the best k-value is to plot the graph of k-value
# and the corresponding error rate

# let's now plot the mean error for the predicted values of test set for k-values
# from 1 to 40
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))
1
2
3
4
5
6
7
# time to plot the error values against the k-values!
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Text(0, 0.5, 'Mean Error')
plot
1
2
3
4
5
6
7
8
# we can see that the mean error is zero for k-values 25 to 29!
# let's try k=27 and see what we get!

classifier = KNeighborsClassifier(n_neighbors=27)
classifier.fit(X_train, y_train)

# to make predictions on our test data
y_pred = classifier.predict(X_test)
1
2
3
# will we get a higher accuracy from using k=27?
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[11  0  0]
 [ 0  9  0]
 [ 0  0 10]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      1.00      1.00         9
 Iris-virginica       1.00      1.00      1.00        10

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30
1
# YES!

Conclusion

We now know that while any k-value can be chosen to deliver a decent result, one can always make a plot of mean error against k-values to further assess the effectiveness of certain k-values. This would then enable us to get an even more accurate classification result.

Even though the idea behind KNN is simple, the prediction power it offers is pretty simple. In fact, prediction making is one of the most challenging aspects of machine learning.

Leaning something new by first understanding something simple has definitely helped kept my interest alive for machine learning. I certainly hope you would feel the same way too. Shoutout to Scott Robinson for his amazing tutorial, which will be linked below.

References

  1. K-Nearest Neighbors Algorithm in Python and Scikit-Learn

Possible improvement(s)

  • I will try to type my Jupyter Notebook comments in Markdown format next time! :)