class: center, middle ## IMSE 586 ## Big Data Analytics and Visualization
### Clustering
### Instructor: Fred Feng --- # Clustering .center[![:scale 100%](images/clustering_animals.png)] How can we group animals into categories based on their characteristics? --- # Clustering .center[![:scale 100%](images/clustering_animals_2.png)] How can we group animals into categories based on their characteristics? --- # What is clustering? Group objects (e.g., animals, customers) based on their characteristics, -- so that the objects - .red[within each group] are quite .red[similar] to each other -- - .green[in different groups] are quite .green[different] from each other -- Primarily an exploratory technique to discover hidden structures of the data. --- # Clustering example .center[![:scale 70%](images/cluster_example.png)] --- # Why clustering? Taxonomy description - Identify natural groups within the data. - Often used as a prelude to more focused analysis or decision processes (e.g., classification) --
Data simplification - The ability to analyze groups of similar observations instead of all individual observations. --- # Examples of clustering applications - Market segmentation - Identify subgroups of people who might be more receptive to a particular form of advertising -- - Recommender systems - Videos, movies, books, e-commerce, etc. -- - Credit card fraud detection -- - Medical image segmentation --- # .red[Classification] v.s. .green[clustering] .red[Classification] - Classify objects into .blue[pre-defined] groups. - What digit (0-9) or letter (A-Z) is in the image? -- - Need training data to build the model -- .green[Clustering] - Find natural groups in data that are .blue[not pre-defined]. - Often used as a lead-in to classification. - Do NOT need training data --- # .red[Supervised] v.s. .green[unsupervised] learning .red[Supervised] learning - The model is built using labeled data (training data). - Examples: regression, classification, neural nets --
.green[Unsupervised] learning - No training data - The model is built based on commonalities in data. - Examples: clustering, anomaly detection, principle component analysis --- # Selecting cluster features - Include only features that characterize the objects being clustered --
- Irrelevant features can not be excluded from the analysis once it begins. --
- Selecting features requires domain knowledge. --- # Measuring the distance (similarity) .center[![:scale 70%](images/dist.png)] $$ \begin{aligned} \text{Euclidean distance}&=\sqrt{(X_2-X_1)^2 + (Y_2-Y_1)^2} \\\ \text{Manhattan distance}&=|X_2-X_1| + |Y_2-Y_1| \\\ \end{aligned} $$ -- $$\text{Minkowski dist.}=[(|X_2-X_1|)^p + (|Y_2-Y_1|)^p]^{\frac{1}{p}}$$ --- # Standardizing features Feature scales can make a big difference in groupings. Features of varying scales should be standardized before clustering. The standard score of a sample *x* is calculated as $$z=\frac{x-u}{s}$$ where *u* is the mean of the samples, and *s* is the standard deviation of the samples. --- # Standardizing features
| Object |Purchase
probability | Viewing time
in .red[minutes]| Viewing time
in .green[seconds]| |:---:|:---:|:---:|:---:| |A|60|3.0|180| |B|65|3.5|210| |C|63|4.0|240| --- # (Euclidean) distances of the pairs When viewing time is measured in .red[minutes] | Object pair | Value | Rank | |:---:|:---:|:---:| |A-B|5.025|3| |A-C|3.162|2| |B-C|2.062|1| -- When viewing time is measured in .green[seconds] | Object pair | Value | Rank | |:---:|:---:|:---:| |A-B|30.41|2| |A-C|60.07|3| |B-C|30.06|1| --- # After the standardization | Object |Purchase
probability | Viewing
time | |:---:|:---:|---:| |A|-1.06|-1.0 | |B|0.93|0.0 | |C|0.13|1.0 | -- (Euclidean) distance using the standardized values | Object pair | Value | Rank | |:---:|:---:|:---:| |A-B|2.22|2| |A-C|2.33|3| |B-C|1.28|1| --- # The clustering problem $$ \begin{aligned} n \text{ objects: }& x_1, x_2, \cdots, x_n \\\ \\\ k \text{ clusters: }& C=\\{C_1, C_2, \cdots, C_k\\} \end{aligned} $$ Objective: minimize the .red[within-cluster sum-of-squares] (often termed .red[inertia]). $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ $$\text{where $\mu_i$ is the centroid of cluster $C_i$.}$$ --- $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ .center[![:scale 65%](images/kmeans_inertia.png)] --- $$ \begin{aligned} n \text{ objects: }& x_1, x_2, \cdots, x_n \\\ \\\ k \text{ clusters: }& C=\\{C_1, C_2, \cdots, C_k\\} \end{aligned} $$ $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ - This is a difficult optimization problem. Why? -- Ways to partition *n* objects into *k* clusters: $$k^n$$ -- - Fortunately, a simple algorithm can be used to provide a local optimum to the problem. --- *k*-means algorithm demonstration: At the start, each data point is not assigned to a cluster yet. .center[![:scale 100%](images/kmeans_0.png)] --- Step 1. We start by randomly selecting *k* objects as the initial centroids (often called "seeds") .center[![:scale 100%](images/kmeans_1.png)] --- Step 2. Each object is assigned to a cluster whose centroid is closest to this object. .center[![:scale 100%](images/kmeans_1.png)] --- Step 2. Each object is assigned to a cluster whose centroid is closest to this object. .center[![:scale 100%](images/kmeans_2.png)] --- Step 3. We update the cluster centroids using the mean of the objects in each cluster. .center[![:scale 100%](images/kmeans_2.png)] --- Step 3. We update the cluster centroids using the mean of the objects in each cluster. .center[![:scale 100%](images/kmeans_3.png)] --- Step 4. We repeat the earlier step of re-assigning each object to a cluster whose (updated) centroid is closest. .center[![:scale 100%](images/kmeans_3.png)] --- Step 4. We repeat the earlier step of re-assigning each object to a cluster whose (updated) centroid is closest. .center[![:scale 100%](images/kmeans_4.png)] --- Step 5. After the re-assignments, we repeat the earlier step of updating the cluster centroids again. .center[![:scale 100%](images/kmeans_4.png)] --- Step 5. After the re-assignments, we repeat the earlier step of updating the cluster centroids again. .center[![:scale 100%](images/kmeans_5.png)] --- Step 6. We repeat the process, the cluster assignments do not change at all. We've reached the convergence. .center[![:scale 100%](images/kmeans_5.png)] --- # *k*-means clustering To group *n* objects into *k* clusters, 1. Randomly assign *k* cluster centroids (i.e., seeds). 2. Iterate until the cluster assignments stop changing: - For each object, assign it to the cluster whose centroid is closest. - For each cluster, compute the cluster centroid. --- # Limitations to the *k*-means method
- *k*-means is an example of a [greedy algorithm](https://en.wikipedia.org/wiki/Greedy_algorithm). - Making locally optimal choices at each step with the hope of reaching the globally optimal result. - The globally optimum may not be achieved. --
- The final result may be sensitive to the initial seed assignments. --- What if we use some different initial centroids? .center[![:scale 100%](images/kmeans_stuck_1.png)] --- Assign objects to a cluster whose centroid is closest. .center[![:scale 100%](images/kmeans_stuck_2.png)] --- Assign objects to a cluster whose centroid is closest. .center[![:scale 100%](images/kmeans_stuck_3.png)] --- Update the centroids, and we are stuck... .center[![:scale 100%](images/kmeans_stuck_3.png)] --- # Remedy In practice, we typically run the algorithm multiple times (e.g., 10) with different initial seed assignments. The final results will be the best output of consecutive runs in terms of minimizing the inertia. $$\min\_C \sum\_{i=1}^k \sum_{x \in C\_i}||x-\mu_i||^2$$ --- # The number of clusters $k$ *k* is a hyperparameter that must be pre-specified. How to determine the best value of *k*? .center[![:scale 60%](images/elbow.png)] Select *k* at the ".red[elbow point]", where "mountain" ends and "rubble" begins. --- # Silhouette analysis For each object, calculate - mean .red[intra]-cluster distance *a* - mean .green[nearest]-cluster distance *b* .center[![:scale 65%](images/kmeans_4.png)] --- For each object, calculate - mean .red[intra]-cluster distance *a* - mean .green[nearest]-cluster distance *b* *b-a* measures how similar an object is to its own cluster compared to other clusters. We divide it by max(*a*, *b*) to make it between -1 and 1. $$\text{Silhouette coefficient}=\frac{b-a}{\max(a, b)}$$ Metric: the mean silhouette coefficient over all objects. We pick a *k* that gives the highest value of this metric. ??? The best score is 1, and the worse score is −1. --- # Silhouette analysis
.center[![:scale 80%](images/silhouette.png)] --- # The number of clusters *k* *k* may be constrained by downstream applications. -- - Customer segmentation for promotional campaign -- - How many clothing sizes to offer? -- .center[![:scale 100%](images/wetsuit.jpg)] --- *k*-means partitions space into [convex sets](https://en.wikipedia.org/wiki/Convex_set). It does not work well with non-convex shapes. .center[![:scale 40%](images/kmeans_nonlinear_1.png) ![:scale 40%](images/kmeans_nonlinear_2.png)] -- .center[Spectral clustering] .center[![:scale 40%](images/kmeans_nonlinear_3.png)] --- # Spatial clustering - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) -- .center[![:scale 60%](images/dbscan.png)] .center[.gray[[image source](https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg)]] --- # Key parameters of DBSCAN .center[![:scale 45%](images/dbscan.png)] - `eps`: The max distance between two samples for them to be considered as in the same neighborhood. - `min_samples`: The # of samples in a neighborhood for a point to be considered as a core point. --- class: middle, center # [Visualizing DBSCAN Clustering](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) --- # Further resources
- [An Introduction to Statistical Learning: with Applications in Python](https://www.statlearning.com) (Chapter 12.4) - [Clustering (Wikipedia)](https://en.wikipedia.org/wiki/Cluster_analysis) - [Clustering in `scikit-learn`](https://scikit-learn.org/stable/modules/clustering.html)