Density Based clustering evaluation with Modified Silhouette Score
Modified Silhouette Score for HDBSCAN:
- Silhouette Score works well with convex clusters. However, when dealing with density-based clustering algorithm such as HDBSCAN, the traditional Sihouette Score doesn't perform well due to the irregular cluster shapes and the presence of nose points.
- To address this, modified version of Silhouette Score can be applied to HDBSCAN cluster, where the calculation is based on core points and the distance metrics used reflects the density-based clustering.
- Core Points: HDBSCAN categorizes points as core (high-density points), border (points near the cluster boundary), or noise. The modified Silhouette Score focuses on core points, as they represent the true structure of the cluster.
- Density-based Distance Metric: Traditional distance metrics (like Euclidean distance) assume spherical clusters. For HDBSCAN, a density-aware distance metric is more appropriate, such as the distance between core points based on density connectivity or mutual reachability distance.
- Handling Noise Points: Noise points in HDBSCAN are not assigned to any cluster. The modified Silhouette Score typically ignores these points to avoid skewing the results.
Silhouette Score Formula Recap
For a given data point i:
a(i): The average distance between point iii and all other points in the same cluster (intra-cluster distance).
b(i): The average distance between point iii and all points in the nearest neighboring cluster (inter-cluster distance).
The Silhouette Score for point iii is calculated as:
The overall score is the average Silhouette Score for all points.
Distance Metric AdaptationInstead of using the typical Euclidean or Manhattan distance, the mutual reachability distance or density-aware distance between core points is used.
Mutual reachability distance between two points i and j is defined as:
where:d(i,j) is the actual distance between points i and j,
ϵ(i) and ϵ(j) are the core distances of points i and j.
Python code for modified Silhouette Score
Pros of Modified Silhouette Score:
- Works for Non-Convex Clusters: Unlike the traditional Silhouette Score, it handles non-convex and irregularly shaped clusters better.
- Robust to Noise: Since noise points are excluded from the score, it reflects the quality of the true clusters.
- Density-Aware: It uses a distance metric that reflects the density structure, providing a more accurate reflection of the clustering quality.
Cons of Modified Silhouette Score:
- Computational Complexity: Calculating density-based distance measures can be more computationally expensive than using simple Euclidean distances.
- Implementation Complexity: Requires adapting the Silhouette Score algorithm to work with density metrics and core points.
- Parameter Sensitivity: The score can still be sensitive to parameters like min_samples and min_cluster_size in HDBSCAN.
Comments
Post a Comment