IMSE 586 Classification

class: center, middle

## IMSE 586 
## Big Data Analytics and Visualization
<br/>
### Classification
<br/>
### Instructor: Fred Feng
---
# Classification

If the target has .red[two] classes --> .green[logistic regression]
- Does the patient have diabetes?
- Is the email spam?

--

What if the target has .red[more than two] classes?
- .green[Which blood type] does a person have, given the results of various diagnostic tests?
- .green[Which candidate] will a person vote for, given their demographic, social, and economic characteristics?
- Optical character recognition (OCR)

<!-- --

we can use other methods, e.g., .green[*k*-nearest neighbors]. -->

---
# Nearest neighbor

.center[![:scale 100%](images/knn_demo.png)]

---
# *k*-nearest neighbors (*k*-NN)

.center[![:scale 100%](images/knn_demo_k1.png)]

---
# *k*-nearest neighbors (*k*-NN)

.center[![:scale 100%](images/knn_demo_k3.png)]

---
# *k*-nearest neighbors (*k*-NN)

.center[![:scale 100%](images/knn_demo_k5.png)]

---
# *k*-nearest neighbors (*k*-NN)

Where is the model exactly?

.center[![:scale 60%](images/knn_demo.png)]

--

It is a [non-parametric model](https://en.wikipedia.org/wiki/Nonparametric_statistics).

---
# *k*-nearest neighbors (*k*-NN)

About *k*

- *k* should be an odd number

- *k* should not be a multiple of the number of classes

--

<br>

*k* is a model parameter that we pre-specified before training the model.

Such a parameter is referred as a [hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning%29) in machine learning.

---
# Measuring the distance (similarity)

With two features $X$ and $Y$

<br>
.center[![:scale 80%](images/dist_euclidean.png)]

$$\text{Euclidean distance}=\sqrt{(X_2-X_1)^2 + (Y_2-Y_1)^2}$$

---
# Manhattan distance

.center[![:scale 58%](images/manhattan.png)]

---
# Manhattan distance

With two features $X$ and $Y$

<br>
.center[![:scale 80%](images/dist_manhattan.png)]

$$\text{Manhattan distance}=|X_2-X_1| + |Y_2-Y_1|$$

---
# Minkowski distance

<br>
More generally,

$$\text{Minkowski dist.}=[(|X_2-X_1|)^p + (|Y_2-Y_1|)^p]^{\frac{1}{p}}$$

<br>

$p=1 \rightarrow \text{Manhattan distance}$

$p=2 \rightarrow \text{Euclidean distance}$

---
# Beyond 2-dimension

All the distance definitions can be naturally expended to .red[more than two] dimensions.

<br>
For example, with .green[three] features $X$, $Y$, and $Z$

$$
\begin{aligned}
&\text{Euclidean distance}= \\\
\\\
&\sqrt{(X_2-X_1)^2 + (Y_2-Y_1)^2+ (Z_2-Z_1)^2}
\end{aligned}
$$