K Nearest Neighbors: an Explanation and Tutorial
Think back to your high school lunchroom. Or better yet, think about the stereotypical Hollywood vision of a high school lunchroom. Notice the cliques that have “organically” formed up. You have the ‘jocks’ at their tables, being rambunctious, and throwing a ball around. Lettermen jackets and team jerseys abound. The ‘theater kids’ at their tables running lines and singing, dressed in whatever period piece costume they had to wear to the latest rehearsal. And of course, the ‘nerds’ either eating quietly or waxing poetic about the latest comic book/movie/game/etc that they bought. And the ‘goth kids.’ Dressed in all black, looking like they’ve just come back from the angstiest funeral in history. And finally, there are the ‘loners.’ They don’t think they belong to any clique. They just happen to sit at the same table, every day, but definitely aren’t a clique at all. You walk into this lunchroom to look for me and remember that I’m the president of the chess club. Quick, which group do you think I sit with? OK, the thought process that you just went through is K-Nearest Neighbors Classification.
You thought about the information I gave you about each group (their features) and you compared it to the information you know about me (my features) and decided where I sit (my classification). Well Done! (You know, apart from the horrible stereotyping we just did)
Unfortunately, in the Data Science field, most KNN Classification isn’t so easy, nor are most datasets so visually stark. So, to help all of our non-DS friends better understand how we come up with the results we do, I’ve written a quick tutorial on building a K-Nearest Neighbors Classifier Algorithm.
The plan with any KNN Classifier algorithm is as follows:
- Get a cleaned dataset. We don’t want missing values or mixed-type features.
- Get a point to classify. It has to have the same number of features.
- Find the K closest elements from the dataset to the point in question. K is generally an odd number, so as to avoid ties.
- Classify that point based on the classification of its ‘neighbors.
Steps 1 and 2 require ‘feature engineering’ and ‘data cleaning.’ Two very important skills that I won’t be getting into, but are key to any data science endeavor. For convenience’s sake, we’ll assume that the first 2 steps have been completed.
Step 3 is where we get to the meat of the issue. Luckily, it is exactly what it sounds like. We need to find the closest elements to the point in question. That is to say, the elements the shortest distance away. There are actually more than a couple of ways to define distance in DS. The most straightforward way, I believe is the Euclidian Distance. That is:
The Square Root of the Sum of the Squares of the Differences between the two points.
See, very straightforward…and I’m kidding…but not really. This is actually the most conceptually difficult step to understand, but most of us already know the concept by a different name: The Pythagorean Theorem.
So, now we have a distance measuring method that we are comfortable with, we can use Python code to replicate it :
And let's test it with a simple dataset and a reference point:
Great! With a working Euclidean Distance function, we can now iterate to find the K-Nearest Neighbors to our point. And here is where KNN becomes your favorite model. With most, if not all, other models, we would need to take some time and effort to fit our data to the model. However, because KNN models require the entire dataset to be stored, our fit method is simply pulling the cleaned data into an array, and moving the classification feature to the last element in each row.
And let’s test the fit function:
Now that the data is fitted, we can find the nearest neighbors.
Admittedly, we will not be using the above code as its own function later, but it is easy to test it this way.
With the nearest neighbors in hand, we can move on to Step 4 and make our prediction, based on their classifications. For this model, we’ll be taking the mode of the K-Nearest Neighbors’ classifications.
The 3 nearest neighbors all have classifications of ‘0’. So we will predict that our reference point will also have a classification of ‘0’.
Written up nicely, our KNN Classifier looks like this:
I’ll wrap up this tutorial with some notes of caution. First, beware of the ‘Curse of Dimensionality’ when using a KNN model. KNN models are most appropriate for datasets with only a few features. As the number of features increases, the sparsity of the data will increase exponentially, meaning that the data will become less and less ‘close’. As a rule, the smaller the dataset, the fewer the features we should use. Secondly, we skipped the data-cleaning and feature engineering, which are integral parts of the process usually. Especially considering the curse of dimensionality, good feature engineering can be absolutely mandatory in KNN implementation.
Now you have all the tools to implement a KNN Classifier, and, hopefully, understand each step along the way. Have fun, and KEEP CODING!!