DBSCAN algorithm


Spatial Clustering

Chapter 9 Spatial clustering | 02.522: Urban Data & Methods II: Computational Urban Analysis (02522-cua.github.io)

spatial clustering refers to those clustering methods that clustering data based on the spatial information including the density, actual location and relative path, etc.

DBSCAN

Denstiy-based spatial clustering of applications with Nosie (DBSCAN) is a kind of spatial clustering algorithm based on the density of data points. The following link will give you a view about how the algorithm is proceeding. I recommend you to try smile face to know its advantage and density bar to realize its drawbacks.

Visualizing DBSCAN Clustering (naftaliharris.com)

The algorithm has two important parameters: epsilon and minPoints. And If you have watched the visualization, you would probably know that the epsilon means the radius of the searching circle and minPoints representing the minimum points should include in one cluster.

The algorithms work like this: 1. To random select a point and search its neighbor within the radius and propaganda the process to select their neighbors until there is no data points within the circle. 2. Select points that have not been clustered and repeat the first step, until all of the points have been selected.

Evaluation clustering performance

Silhouette Coefficient

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. — Wikipedia

mean distance of a(i) for other points in cluster

least mean distance of point i for each point in other cluster

Silhouette value definition

Also written as

For above definition and Sihouette value is defined in the

As a(i) is a measure of how dissimilar i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighbouring cluster. Thus an s(i) close to 1 means that the data is appropriately clustered. If s(i) is close to -1, then by the same logic we see that i would be more appropriate if it was clustered in its neighbouring cluster. An s(i) near zero means that the datum is on the border of two natural clusters.

Sklearn.metrics

The sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance computations. And here is the document for usage.

3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 1.1.2 documentation

6.8. Pairwise metrics, Affinities and Kernels — scikit-learn 1.1.2 documentation


Author: Wulilichao
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Wulilichao !
  TOC