### Jump to Section:

## Overview

If the data is not suitable for building a phylogenetic tree, a pairwise distance matrix will be generated to depict relatedness between taxa identified in mNGS samples. Here we explain why some data may not be suitable for tree building, how to interpret the distance matrix and how download matrix data.

## Obtaining a pairwise distance matrix

The tool that underlies CZ ID's phylogenetic pipeline is known as split kmer analysis (SKA) (see pipeline details). SKA relies on kmer matches to compute relative distances between samples. If the assumptions for using SKA are not met, trees will fail to avoid showing erroneous results and you will obtain a pairwise distance matrix instead.

*Message indicating that phylogenetic tree build failed and a pairwise matrix was calculated instead.*

The assumptions for SKA become invalid in two different cases:

**Samples are highly divergent from each other**: If the contigs selected for the analysis are reasonably similar (SKA mash-like distance < 0.15), they will proceed into the phylogenetic analysis. If contigs are relatively divergent (SKA Mash-like distance > 0.15), then the pipeline will end here and you will receive a pairwise matrix. The pipeline will not distinguish between truly divergent sequences or sequences that have low coverage (see case #2).

**Some of the samples have low coverage**: If selected contigs cover a small portion of the reference sequence for the taxon of interest, they may have few overlapping kmers with other sequences in the analysis. SKA analysis relies on matching kmers for calculating relative distances. If there are too few matching kmers, then relative distances cannot be reliably calculated.

If your analysis results in a pairwise matrix, you can attempt to create a phylogenetic tree by removing outlier samples (due to divergence or low coverage) from the analysis and trying again.

## Interpreting a pairwise distance matrix

The distance matrix shows the Mash-like distance between samples. The Mash-like distance is calculated based on the Jaccard Index (j) representing the ratio of split kmers found in both samples to the total found in the two samples and split kmer length (k).

**Jaccard Index (j)** = split kmer matches/(split kmer matches+mismatches)

**Mash-like distance** = (-1/(2k+1))*ln(2j/(1+j)) for 0<j≤1 or 1 for j=0

Mash-like distances are depicted in a pairwise matrix. Within the matrix, each sample is compared to itself and all other samples and the colors indicate the Mash-like distance range (see scale). Along the diagonal green arrow in the screenshot below, each sample is compared to itself, resulting in dark red squares. This indicates that there are 0 differences between the sequences.

Samples are also compared to every other sample. For example, when comparing sample *AP010960.1* to sample *UnAmbiguouslyMapped_ds.cityparks* you can follow the green arrows in the screenshot below to see that there is between 0.05 and 0.1 pairwise distance between the two samples. To get exact distance values, a CSV file of the heatmap can be downloaded.

## Downloading pairwise distance matrix

You can easily download the pairwise distance matrix data using the Download button in the right-hand side of the page.

Downloads for the matrix include:

- Heatmap image (SVG format)
- Heatmap image (PNG format)
- SKA distances (CSV format): Use comma-delimited file to view mismatches, Mash-like distances, number of SNPs, and SNP distances between samples. You can read about each of the metrics here.

## Comments

0 comments

Please sign in to leave a comment.