Jump to Section:
Overview
The CZ ID heatmap is clustered using average-linkage hierarchical clustering based on euclidean distances. This document will discuss what hierarchical clustering is, how to read a dendrogram, and how the clustering is calculated.
Reading a Dendrogram
The heatmap is automatically organized by hierarchical clustering. At a high level, this means similar samples and taxa are organized into groups called clusters. The result is a dendrogram displaying a set of clusters (clades), where each clade is distinct, and the taxa and samples within a clade are broadly similar to each other.
In the image below there are two visual representations of the same data, including nested clusters (left) and a dendrogram (right). Note that while "J" and "K" are in their own cluster, they also share a broader cluster with "H" and "I", and an even larger cluster with "G".
Nested clusters (left) represented in a dendogram (right). Source: https://www.statisticshowto.com/hierarchical-clustering/.
In the dendrogram, the clustering is represented as clades, where "J" and "K" are in their own clade, yet they share a larger clade with "H" and "I", and an even broader clade with "G".
Reading a Heatmap
There are two dendrograms on the CZ ID heatmap. The clustering is based on the metric that is chosen. Therefore, the clustering may change if the selected metric is changed from total reads to reads per million (rPM).
- Clustering for taxa: Taxa that are in a cluster are more likely to appear together across samples.
- Clustering for samples: Samples that are clustered together have similar taxa present.
To visualize a clade, you can hover over a node on the dendrogram, and the clade will be highlighted in blue. In the heatmap below, Patient 009 and 010 have more viruses in common compared to viruses detected in other samples.
It should be noted that hierarchical clustering is not perfect and should only be used to generalize how your samples and taxa are related to each other.
More on Average-linkage Hierarchical Clustering
Before the clustering step, a distance measure is performed to determine the dissimilarity between samples and taxa. CZ ID uses Euclidean distance, which is the square root of the sum of the square differences. For the heatmap, the Euclidean distance is calculated based on the chosen metric (i.e., total reads, Z score, rPM) and calculated between samples and taxa. Once the distance measures are calculated, average-linkage hierarchical clustering is performed.
Average-linkage is when the distance between each pair of observations (Euclidean distance of chosen metric) between clusters is added up and divided by the number of pairs to get an average. Clusters with smaller distance averages between them are merged first, forming a new cluster level that, on average, will have minimized pairwise distances between the points. More details on hierarchical clustering can be found here.
Comments
0 comments
Please sign in to leave a comment.