From Least Squares to k-Nearest Neighbor (kNN)

Figure 1 – Click to enlarge

The linear model is the most widely used data science tools and one of the most important.  In addition, there is another basic tool known as the k nearest neighbor method (kNN).  Both models can be used to go beyond prediction for classification.  Feature classes are used by machines to recognize faces within a crowd, to “read” road signs by distinguishing one letter from another, and to set voter registration districts by separating population groups.  This article applies and compares both classification methods

Introduction by Comparison

A side-by-side comparison of methods reveals their nature:

  • The linear model fit using least squares makes strict assumptions about the structure of the model and the underlying data.  Linear regression is solved using a closed form matrix equation.  The resulting model fit is typically stable, and predications tend to be inflexible in shape and possibly inaccurate;
  • The knearest neighbor method makes only limited assumptions about the model and data.  The method is solved using an iterative search functions.  The resulting model fit is very flexible in shape, predictions are usually accurate, but prone to instability.

Least-Squares Classification

Figure 1 shows a scatter plot of training data on the input pair X_{1} and X_{2}, each of which has a 100 points. The random data can be used to define the output class variable G, which sets the color of the background grid with the values \textbf{\textcolor{RoyalBlue}{BLUE}} and \textbf{\textcolor{Orange}{ORANGE}}.

To determine the class of the background grid space, a linear least squares model was fit to the training data to define the response variable Y.  The input values of Y are coded as 0 for \textbf{\textcolor{RoyalBlue}{BLUE}} data points and 1 for \textbf{\textcolor{Orange}{ORANGE}} data points. Second, the fitted values \hat{Y} are then used to define the class variable G given the rule:

(1)   \begin{equation*} \hat{G} = \begin{cases} \mathbf{\textcolor{Orange}{ORANGE}} & \text{if} ~ \hat{Y} > 0.5, \\ \mathbf{\textcolor{RoyalBlue}{BLUE}} & \text{if} ~ \hat{Y} \leq 0.5. \end{cases} \end{equation*}

Hence, the two-dimensional classification problem in matrix notation is:

(2)   \begin{equation*} \mathbf{\hat{y}} = \mathbf{x}^T\beta + \mathbf{e} \end{equation*}

with no intercept term and x_i equal to 0 or 1 depending on the class. The regression line is the decision boundary defined by \mathbf{x}^T \beta = 0.50. Finally, the background grid is shaded in Figure 1 given the classification defined by (1).

Its clear that there are several mis-classifications in the data on both sides of the linear decision boundary in Figure 1. As a result, the decision boundary is too inflexible.  More likely, the optimal decision boundary is non-linear and maybe disjoint in nature.

k-Nearest Neighbor Method
The k nearest neighbor method uses the k observations in the training set \mathlarger{\mathlarger{\tau}} closest to the point x_i on the background space grid to form \hat{Y}. The kNN fit for \hat{Y} is defined by:

(3)   \begin{equation*} \hat{Y} = \frac{1}{k}\sum_{x_i \in N_k (x)} y_i \end{equation*}

where N_k(x) is the neighborhood of x with the k closest points x_i in \nu. Closeness implies a metric, which for simplicity is Euclidean distance. So, in words, we find the k observations for the feature grid point x_i closest to x in the input space, and average their responses.

In Figure 2, the same training data is used as in Figure 1 and the feature space is now classified using the average class of the 25-nearest-neighbors given equation (3) as the method of fitting. It is now obvious that the decision boundary that separates the \textbf{\textcolor{RoyalBlue}{BLUE}} from the \textbf{\textcolor{Orange}{ORANGE}} region is far more irregular, and responds to local clusters and outliers where one class dominates.

Figure 2 – Click to enlarge

In Figure 2, with k=25, there are far fewer training observations that are misclassified than in Figure 1. This doesn’t give much comfort, though, since the classification boundary is still prone to errors.

Figure 3 – Click to enlarge

in Figure 3, with k=8, we now see less errors.  The classification boundary also has disjoint islands to help separate the data. 

Figure 4 – Click to enlarge

In figure 4, none of the training data are misclassified with k=1. A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be an increasing function of k, and will always be 0 for k = 1.

Final Comparison and Takeaways

The k-nearest-neighbor fits have a single parameter, the number of neighbors, k.  Linear least squares relies on p, the number of model parameters. The number of parameters for k-nearest neighbors is N/k.  This is generally much bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Finally, unlike linear least squares, the sum-of-squared errors on the training set has no merit as a criterion for picking as a criterion for picking\quicklatex{size16}kin the nearest neighbor method, since we would always pickas a criterion for picking k = 1 as the best answer.  However, the boundary result for any solution with a low value unstable, complex and noisy.

This entry was posted in Data Science, Modeling, R Programming, Website. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

7 + 5 =