CPSC 330 Lecture 16: DBSCAN, Hierarchical Clustering

Happy Halloween

Announcements

HW5 extension: Was due yesterday
HW6 is due next week Wednesday.
- Computationally intensive
- You need to install many packages

Imports

iClicker Exercise 16.1

Select all of the following statements which are TRUE.

1. Similar to K-nearest neighbours, K-Means is a non parametric model.
1. The meaning of K in K-nearest neighbours and K-Means clustering is very similar.
1. Scaling of input features is crucial in clustering.
1. In clustering, it’s almost always a good idea to find equal-sized clusters.

K-means Limitations

Shape of clusters

Good for spherical clusters of more or less equal sizes

K-Means: failure case 1

K-Means performs poorly if the clusters have more complex shapes (e.g., two moons data below).

K-Means: failure case 2

Again, K-Means is unable to capture complex cluster shapes.

K-Means: failure case 3

It assumes that all directions are equally important for each cluster and fails to identify non-spherical clusters.

Can we do better?

DBSCAN

Density-Based Spatial Clustering of Applications with Noise
A density-based clustering algorithm

X, y = make_moons(n_samples=200, noise=0.08, random_state=42)
dbscan = DBSCAN(eps=0.2)
dbscan.fit(X)
plot_original_clustered(X, dbscan, dbscan.labels_)

How does it work?

DBSCAN Analogy

Consider DBSCAN in a social context:

Social butterflies (🦋): Core points
Friends of social butterflies who are not social butterflies: Border points
Lone wolves (🐺): Noise points

Two main hyperparameters

eps: determines what it means for points to be “close”
min_samples: determines the number of neighboring points we require to consider in order for a point to be part of a cluster

DBSCAN: failure cases

Let’s consider this dataset with three clusters of varying densities.
K-Means performs better compared to DBSCAN. But it has the benefit of knowing the value of K in advance.

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

Hierarchical clustering

Dendrogram

Dendrogram is a tree-like plot.
On the x-axis we have data points.
On the y-axis we have distances between clusters.

Flat clusters

This is good but how can we get cluster labels from a dendrogram?
We can bring the clustering to a “flat” format use fcluster

Flat clusters

from scipy.cluster.hierarchy import fcluster
# flattening the dendrogram based on maximum number of clusters. 
hier_labels1 = fcluster(linkage_array, 3, criterion="maxclust") 
plot_dendrogram_clusters(X, linkage_array, hier_labels1, title="flattened with max_clusts=3")

Linkage criteria

When we create a dendrogram, we need to calculate distance between clusters. How do we measure distances between clusters?
The linkage criteria determines how to find similarity between clusters:
Some example linkage criteria are:
- Single linkage → smallest minimal distance, leads to loose clusters
- Complete linkage → smallest maximum distance, leads to tight clusters
- Average linkage → smallest average distance between all pairs of points in the clusters
- Ward linkage → smallest increase in within-cluster variance, leads to equally sized clusters

Activity

Fill in the table below in this Google doc: https://shorturl.at/3yOdg

Clustering Method	KMeans	DBSCAN	Hierarchical Clustering
Approach
Hyperparameters
Shape of clusters
Handling noise
Examples

CPSC 330 Lecture 16: DBSCAN, Hierarchical Clustering

Happy Halloween

Announcements

Imports

iClicker Exercise 16.1

K-means Limitations

Shape of clusters

K-Means: failure case 1

K-Means: failure case 2

K-Means: failure case 3

Can we do better?

DBSCAN

How does it work?

DBSCAN Analogy

Two main hyperparameters

DBSCAN: failure cases

Hierarchical clustering

Dendrogram

Flat clusters

Flat clusters

Linkage criteria

Activity

Class demo