The linear model is the most widely used data science tools and one of the most important. In addition, there is another basic tool known as the nearest neighbor method (NN). Both models can be used to go beyond prediction for classification. Feature classes are used by machines to recognize faces within a crowd, to “read” road signs by distinguishing one letter from another, and to set voter registration districts by separating population groups. This article applies and compares both classification methods

**Introduction by Comparison**

**Introduction by Comparison**

A side-by-side comparison of methods reveals their nature:

- The linear model fit using least squares makes strict assumptions about the structure of the model and the underlying data. Linear regression is solved using a closed form matrix equation. The resulting model fit is typically stable, and predications tend to be inflexible in shape and possibly inaccurate;
- The nearest neighbor method makes only limited assumptions about the model and data. The method is solved using an iterative search functions. The resulting model fit is very flexible in shape, predictions are usually accurate, but prone to instability.

**Least-Squares Classification**

**Least-Squares Classification**

Figure 1 shows a scatter plot of training data on the input pair and , each of which has a 100 points. The random data can be used to define the output class variable , which sets the color of the background grid with the values and .

To determine the class of the background grid space, a linear least squares model was fit to the training data to define the response variable . The input values of are coded as for data points and for data points. Second, the fitted values are then used to define the class variable given the rule:

(1)

Hence, the two-dimensional classification problem in matrix notation is:

(2)

with no intercept term and equal to or depending on the class. The regression line is the decision boundary defined by . Finally, the background grid is shaded in Figure 1 given the classification defined by ().

Its clear that there are several mis-classifications in the data on both sides of the linear decision boundary in Figure 1. As a result, the decision boundary is too inflexible. More likely, the optimal decision boundary is non-linear and maybe disjoint in nature.

**-Nearest Neighbor Method**

The nearest neighbor method uses the observations in the training set closest to the point on the background space grid to form . The NN fit for is defined by:

**-Nearest Neighbor Method**

(3)

where is the neighborhood of with the closest points in . Closeness implies a metric, which for simplicity is Euclidean distance. So, in words, we find the observations for the feature grid point closest to in the input space, and average their responses.

In Figure 2, the same training data is used as in Figure 1 and the feature space is now classified using the average class of the 25-nearest-neighbors given equation () as the method of fitting. It is now obvious that the decision boundary that separates the from the region is far more irregular, and responds to local clusters and outliers where one class dominates.

In Figure 2, with , there are far fewer training observations that are misclassified than in Figure 1. This doesn’t give much comfort, though, since the classification boundary is still prone to errors.

in Figure 3, with , we now see less errors. The classification boundary also has disjoint islands to help separate the data.

In figure 4, *none* of the training data are misclassified with . A little thought suggests that for k-nearest-neighbor fits, the error on the training data should be an increasing function of , and will always be 0 for .

**Final Comparison and Takeaways**

**Final Comparison and Takeaways**

The -nearest-neighbor fits have a single parameter, the number of neighbors, . Linear least squares relies on , the number of model parameters. The number of parameters for -nearest neighbors is . This is generally much bigger than , and decreases with increasing . To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be neighborhoods and we would fit one parameter (a mean) in each neighborhood.

Finally, unlike linear least squares, the sum-of-squared errors on the training set has no merit as a criterion for picking \quicklatex{size16}kas a criterion for picking as the best answer. However, the boundary result for any solution with a low value unstable, complex and noisy.