I have a dataset of 6 elements. I computed the distance matrix using Gower distance, which resulted in the following matrix:

By just looking at this matrix, I can tell that element #0 is similar to element #4 and #5 the most, so I assumed the output of the HDBSCAN would be to cluster those together, and assume the rest are outliers; however, that wasn’t the case.

clusterer = hdbscan.HDBSCAN(min_cluster_size=2, min_samples=3, metric='precomputed',cluster_selection_epsilon=0.1, cluster_selection_method = 'eom').fit(distance_matrix)

**Clusters Formed:**

*Cluster 0:* {element #0, element #2}

*Cluster 1:* {element #4, element #5}

*Outliers:* {element #1, element #3}

which is a behavior I don’t understand. Also, both parameters `cluster_selection_epsilon`

and `cluster_selection_method`

don’t seem to have an effect on my results at all and I don’t understand why.

I tried changing the parameters again to `min_cluster_size=2, min_samples=1`

**Clusters Formed:**

*Cluster 0:* {element #0, element #2,element #4, element #5}

*Cluster 1:* {element #1, element #3}

and any other change in the parameters resulted in all points classified as outliers.

Can someone please help explain this behavior, and explain why `cluster_selection_epsilon`

and `cluster_selection_method`

don’t affect the clusters formed. I thought that by setting `cluster_selection_epsilon`

to 0.1, I’d be ensuring that the points inside a cluster would be of distance 0.1 or less apart (so that element #0 and element #2 aren’t clustered together for instance)

Below is a visual representation of both clustering trials:

## Answer

As touched upon in the help page, the core of hdbscan is 1) calculating the mutual reachability distance and 2) applying the single linkage algorithm. Since you do not have that many data points and your distance metric is pre-computed, you can see your clustering is decided by the single linkage:

import numpy as np import hdbscan import matplotlib.pyplot as plt import seaborn as sns x = np.array([[0.0, 0.741, 0.344, 1.0, 0.062, 0.084], [0.741, 0.0, 0.648, 0.592, 0.678, 0.657], [0.344, 0.648, 0.0, 0.648, 0.282, 0.261], [1.0, 0.592, 0.655, 0.0, 0.937, 0.916], [0.062, 0.678, 0.282, 0.937, 0.0, 0.107], [0.084, 0.65, 0.261, 0.916, 0.107, 0.0]]) clusterer = hdbscan.HDBSCAN(min_cluster_size=2,min_samples=1, metric='precomputed').fit(x) clusterer.single_linkage_tree_.plot(cmap='viridis', colorbar=True)

The results will be:

clusterer.labels_ [0 1 0 1 0 0]

Because the minimum number of clusters has to be 2. So the only way the achieve this is to have element 0,2,4,5 together.

One quick solution is to simply cut the tree and get the cluster you intended:

clusterer.single_linkage_tree_.get_clusters(0.15, min_cluster_size=2) [ 0 -1 -1 -1 0 0]

Or you simply use something from sklearn.cluster.AgglomerativeClustering since you are not relying on hdbscan to calculate the distance metrics.