class: center, middle ## IMSE 586 ## Big Data Analytics and Visualization
### Classification
### Instructor: Fred Feng --- # Classification If the target has .red[two] classes --> .green[logistic regression] - Does the patient have diabetes? - Is the email spam? -- What if the target has .red[more than two] classes? - .green[Which blood type] does a person have, given the results of various diagnostic tests? - .green[Which candidate] will a person vote for, given their demographic, social, and economic characteristics? - Optical character recognition (OCR) --- # Nearest neighbor .center[] --- # *k*-nearest neighbors (*k*-NN) .center[] --- # *k*-nearest neighbors (*k*-NN) .center[] --- # *k*-nearest neighbors (*k*-NN) .center[] --- # *k*-nearest neighbors (*k*-NN) Where is the model exactly? .center[] -- It is a [non-parametric model](https://en.wikipedia.org/wiki/Nonparametric_statistics). --- # *k*-nearest neighbors (*k*-NN) About *k* - *k* should be an odd number - *k* should not be a multiple of the number of classes --
*k* is a model parameter that we pre-specified before training the model. Such a parameter is referred as a [hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning%29) in machine learning. --- # Measuring the distance (similarity) With two features $X$ and $Y$
.center[] $$\text{Euclidean distance}=\sqrt{(X_2-X_1)^2 + (Y_2-Y_1)^2}$$ --- # Manhattan distance .center[] --- # Manhattan distance With two features $X$ and $Y$
.center[] $$\text{Manhattan distance}=|X_2-X_1| + |Y_2-Y_1|$$ --- # Minkowski distance
More generally, $$\text{Minkowski dist.}=[(|X_2-X_1|)^p + (|Y_2-Y_1|)^p]^{\frac{1}{p}}$$
$p=1 \rightarrow \text{Manhattan distance}$ $p=2 \rightarrow \text{Euclidean distance}$ --- # Beyond 2-dimension All the distance definitions can be naturally expended to .red[more than two] dimensions.
For example, with .green[three] features $X$, $Y$, and $Z$ $$ \begin{aligned} &\text{Euclidean distance}= \\\ \\\ &\sqrt{(X_2-X_1)^2 + (Y_2-Y_1)^2+ (Z_2-Z_1)^2} \end{aligned} $$