class: center, middle ## IMSE 586 ## Big Data Analytics and Visualization
### Clustering
### Instructor: Fred Feng --- # Clustering .center[] How can we group animals into categories based on their characteristics? --- # Clustering .center[] How can we group animals into categories based on their characteristics? --- # What is clustering? Group objects (e.g., animals, customers) based on their characteristics, -- so that the objects - .red[within each group] are quite .red[similar] to each other -- - .green[in different groups] are quite .green[different] from each other -- Primarily an exploratory technique to discover hidden structures of the data. --- # Clustering example .center[] --- # Why clustering? Taxonomy description - Identify natural groups within the data. - Often used as a prelude to more focused analysis or decision processes (e.g., classification) --
Data simplification - The ability to analyze groups of similar observations instead of all individual observations. --- # Examples of clustering applications - Market segmentation - Identify subgroups of people who might be more receptive to a particular form of advertising -- - Recommender systems - Videos, movies, books, e-commerce, etc. -- - Credit card fraud detection -- - Medical image segmentation --- # .red[Classification] v.s. .green[clustering] .red[Classification] - Classify objects into .blue[pre-defined] groups. - What digit (0-9) or letter (A-Z) is in the image? -- - Need training data to build the model -- .green[Clustering] - Find natural groups in data that are .blue[not pre-defined]. - Often used as a lead-in to classification. - Do NOT need training data --- # .red[Supervised] v.s. .green[unsupervised] learning .red[Supervised] learning - The model is built using labeled data (training data). - Examples: regression, classification, neural nets --
.green[Unsupervised] learning - No training data - The model is built based on commonalities in data. - Examples: clustering, anomaly detection, principle component analysis --- # Selecting cluster features - Include only features that characterize the objects being clustered --
- Irrelevant features can not be excluded from the analysis once it begins. --
- Selecting features requires domain knowledge. --- # Measuring the distance (similarity) .center[] $$ \begin{aligned} \text{Euclidean distance}&=\sqrt{(X_2-X_1)^2 + (Y_2-Y_1)^2} \\\ \text{Manhattan distance}&=|X_2-X_1| + |Y_2-Y_1| \\\ \end{aligned} $$ -- $$\text{Minkowski dist.}=[(|X_2-X_1|)^p + (|Y_2-Y_1|)^p]^{\frac{1}{p}}$$ --- # Standardizing features Feature scales can make a big difference in groupings. Features of varying scales should be standardized before clustering. The standard score of a sample *x* is calculated as $$z=\frac{x-u}{s}$$ where *u* is the mean of the samples, and *s* is the standard deviation of the samples. --- # Standardizing features
| Object |Purchase
probability | Viewing time
in .red[minutes]| Viewing time
in .green[seconds]| |:---:|:---:|:---:|:---:| |A|60|3.0|180| |B|65|3.5|210| |C|63|4.0|240| --- # (Euclidean) distances of the pairs When viewing time is measured in .red[minutes] | Object pair | Value | Rank | |:---:|:---:|:---:| |A-B|5.025|3| |A-C|3.162|2| |B-C|2.062|1| -- When viewing time is measured in .green[seconds] | Object pair | Value | Rank | |:---:|:---:|:---:| |A-B|30.41|2| |A-C|60.07|3| |B-C|30.06|1| --- # After the standardization | Object |Purchase
probability | Viewing
time | |:---:|:---:|---:| |A|-1.06|-1.0 | |B|0.93|0.0 | |C|0.13|1.0 | -- (Euclidean) distance using the standardized values | Object pair | Value | Rank | |:---:|:---:|:---:| |A-B|2.22|2| |A-C|2.33|3| |B-C|1.28|1| --- # The clustering problem $$ \begin{aligned} n \text{ objects: }& x_1, x_2, \cdots, x_n \\\ \\\ k \text{ clusters: }& C=\\{C_1, C_2, \cdots, C_k\\} \end{aligned} $$ Objective: minimize the .red[within-cluster sum-of-squares] (often termed .red[inertia]). $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ $$\text{where $\mu_i$ is the centroid of cluster $C_i$.}$$ --- $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ .center[] --- $$ \begin{aligned} n \text{ objects: }& x_1, x_2, \cdots, x_n \\\ \\\ k \text{ clusters: }& C=\\{C_1, C_2, \cdots, C_k\\} \end{aligned} $$ $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ - This is a difficult optimization problem. Why? -- Ways to partition *n* objects into *k* clusters: $$k^n$$ -- - Fortunately, a simple algorithm can be used to provide a local optimum to the problem. --- *k*-means algorithm demonstration: At the start, each data point is not assigned to a cluster yet. .center[] --- Step 1. We start by randomly selecting *k* objects as the initial centroids (often called "seeds") .center[] --- Step 2. Each object is assigned to a cluster whose centroid is closest to this object. .center[] --- Step 2. Each object is assigned to a cluster whose centroid is closest to this object. .center[] --- Step 3. We update the cluster centroids using the mean of the objects in each cluster. .center[] --- Step 3. We update the cluster centroids using the mean of the objects in each cluster. .center[] --- Step 4. We repeat the earlier step of re-assigning each object to a cluster whose (updated) centroid is closest. .center[] --- Step 4. We repeat the earlier step of re-assigning each object to a cluster whose (updated) centroid is closest. .center[] --- Step 5. After the re-assignments, we repeat the earlier step of updating the cluster centroids again. .center[] --- Step 5. After the re-assignments, we repeat the earlier step of updating the cluster centroids again. .center[] --- Step 6. We repeat the process, the cluster assignments do not change at all. We've reached the convergence. .center[] --- # *k*-means clustering To group *n* objects into *k* clusters, 1. Randomly assign *k* cluster centroids (i.e., seeds). 2. Iterate until the cluster assignments stop changing: - For each object, assign it to the cluster whose centroid is closest. - For each cluster, compute the cluster centroid. --- # Limitations to the *k*-means method
- *k*-means is an example of a [greedy algorithm](https://en.wikipedia.org/wiki/Greedy_algorithm). - Making locally optimal choices at each step with the hope of reaching the globally optimal result. - The globally optimum may not be achieved. --
- The final result may be sensitive to the initial seed assignments. --- What if we use some different initial centroids? .center[] --- Assign objects to a cluster whose centroid is closest. .center[] --- Assign objects to a cluster whose centroid is closest. .center[] --- Update the centroids, and we are stuck... .center[] --- # Remedy In practice, we typically run the algorithm multiple times (e.g., 10) with different initial seed assignments. The final results will be the best output of consecutive runs in terms of minimizing the inertia. $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ --- # The number of clusters $k$ *k* is a hyperparameter that must be pre-specified. How to determine the best value of *k*? .center[] Select *k* at the ".red[elbow point]", where "mountain" ends and "rubble" begins. --- # Silhouette analysis For each object, calculate - mean .red[intra]-cluster distance *a* - mean .green[nearest]-cluster distance *b* .center[] --- For each object, calculate - mean .red[intra]-cluster distance *a* - mean .green[nearest]-cluster distance *b* *b-a* measures how similar an object is to its own cluster compared to other clusters. We divide it by max(*a*, *b*) to make it between -1 and 1. $$\text{Silhouette coefficient}=\frac{b-a}{\max(a, b)}$$ Metric: the mean silhouette coefficient over all objects. We pick a *k* that gives the highest value of this metric. ??? The best score is 1, and the worse score is −1. --- # Silhouette analysis
.center[] --- # The number of clusters *k* *k* may be constrained by downstream applications. -- - Customer segmentation for promotional campaign -- - How many clothing sizes to offer? -- .center[] --- *k*-means partitions space into [convex sets](https://en.wikipedia.org/wiki/Convex_set). It does not work well with non-convex shapes. .center[ ] -- .center[Spectral clustering] .center[] --- # Spatial clustering - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) -- .center[] .center[.gray[[image source](https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg)]] --- # Key parameters of DBSCAN .center[] - `eps`: The max distance between two samples for them to be considered as in the same neighborhood. - `min_samples`: The # of samples in a neighborhood for a point to be considered as a core point. --- class: middle, center # [Visualizing DBSCAN Clustering](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) --- # Further resources
- [An Introduction to Statistical Learning: with Applications in Python](https://www.statlearning.com) (Chapter 12.4) - [Clustering (Wikipedia)](https://en.wikipedia.org/wiki/Cluster_analysis) - [Clustering in `scikit-learn`](https://scikit-learn.org/stable/modules/clustering.html)