Density Based clustering evaluation with Modified Silhouette Score

 Modified Silhouette Score for HDBSCAN:

  • Silhouette Score works well with convex clusters. However, when dealing with density-based clustering algorithm such as HDBSCAN, the traditional Sihouette Score doesn't perform well due to the irregular cluster shapes and the presence of nose points.
  • To address this, modified version of Silhouette Score can be applied to HDBSCAN cluster, where the calculation is based on core points and the distance metrics used reflects the density-based clustering.
  • Core Points: HDBSCAN categorizes points as core (high-density points), border (points near the cluster boundary), or noise. The modified Silhouette Score focuses on core points, as they represent the true structure of the cluster.
  • Density-based Distance Metric: Traditional distance metrics (like Euclidean distance) assume spherical clusters. For HDBSCAN, a density-aware distance metric is more appropriate, such as the distance between core points based on density connectivity or mutual reachability distance.
  • Handling Noise Points: Noise points in HDBSCAN are not assigned to any cluster. The modified Silhouette Score typically ignores these points to avoid skewing the results.

Silhouette Score Formula Recap 
 
For a given data point i:
  • a(i): The average distance between point iii and all other points in the same cluster (intra-cluster distance).

  • b(i): The average distance between point iii and all points in the nearest neighboring cluster (inter-cluster distance).

  • The Silhouette Score for point iii is calculated as:

  • b(i) — a(i)

  • The overall score is the average Silhouette Score for all points.



Distance Metric AdaptationInstead of using the typical Euclidean or Manhattan distance, the mutual reachability distance or density-aware distance between core points is used. 
Mutual reachability distance between two points i and j is defined as: 
 
where:
            d(i,j) is the actual distance between points i and j,
            ϵ(i) and ϵ(j) are the core distances of points i and j. 

Python code for modified Silhouette Score

Pros of Modified Silhouette Score:

  • Works for Non-Convex Clusters: Unlike the traditional Silhouette Score, it handles non-convex and irregularly shaped clusters better.
  • Robust to Noise: Since noise points are excluded from the score, it reflects the quality of the true clusters.
  • Density-Aware: It uses a distance metric that reflects the density structure, providing a more accurate reflection of the clustering quality.

Cons of Modified Silhouette Score:

  • Computational Complexity: Calculating density-based distance measures can be more computationally expensive than using simple Euclidean distances.
  • Implementation Complexity: Requires adapting the Silhouette Score algorithm to work with density metrics and core points.
  • Parameter Sensitivity: The score can still be sensitive to parameters like min_samples and min_cluster_size in HDBSCAN.

Comments

Popular posts from this blog

Understanding of Attribute types

Basic Statistical Description of Data

Mean value in Data Science