The IDseq heatmap is clustered using average-linkage hierarchical clustering based on euclidean distances. This document will discuss what hierarchical clustering is, how to read a dendrogram, and how the clustering is calculated.
How to read a dendrogram
The heatmap is automatically organized by hierarchical clustering. At a high level, this means similar samples and taxa are organized into groups called clusters. The result is a dendrogram displaying a set of clusters (clades), where each clade is distinct, and the taxa and samples within a clade are broadly similar to each other.
In the image below, you see two visual representations of the same data: a dendrogram (right) and nested clusters (left). In the nested clusters, while J and K are in their own cluster, they also share a broader cluster with I and H, and an even larger cluster with G.
Stephanie Glen. "Hierarchical Clustering / Dendrogram: Simple Definition, Examples" From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/hierarchical-clustering/
In the dendrogram, this is represented as clades, where J and K are in their own clade, yet they share a larger clade with I and H, and an even broader clade with G.
Reading the heatmap
There are two dendrograms on the IDseq heatmap. The clustering is based on the metric that is chosen, i.e., the clustering may change if the ‘metric’ is changed from total reads to reads per million (rPM).
- Cluster taxa
- Taxa that are in a cluster are more likely to appear together across samples.
- Cluster samples based on the presence of taxa.
- Samples that are clustered together have similar taxa present.
To visualize a clade, you can hover over a node (represented by the star below) on the dendrogram, and the clade will be highlighted in blue. In the heatmap below, Patient 009 and 010 have more taxa in common than Patient 007.
It should be noted that hierarchical clustering is not perfect and should only be used to generalize how your samples and taxa are related to each other.
More on Average-linkage hierarchical clustering
Before the clustering step, a distance measure is performed to determine the dissimilarity between samples and taxa. IDseq uses Euclidean distance, which is the square root of the sum of the square differences. For the heatmap, the Euclidean distance is calculated based on the chosen metric, i.e., total reads, z-score, rPM, and calculated between samples and taxa.
Once the distance measures are calculated, average-linkage hierarchical clustering is performed. Average-linkage is when the distance between each pair of observations (Euclidean distance of chosen metric) between clusters is added up and divided by the number of pairs to get an average. Clusters with smaller distance averages between them are merged first, forming a new cluster level that, on average, will have minimized pairwise distances between the points. More details on hierarchical clustering can be found here.