Jump to Section:
- Creating a phylogenetic tree
- Pipeline details
- Receiving a pairwise distance matrix instead of a tree
- Interpreting the pairwise distance matrix
- Interpreting the phylogenetic tree
- Selection of public sequences
- Coloring the tree by metadata
- How is the new CZ ID phylotree module different from the original?
OverviewCreating a phylogenetic tree enables you to compare the genomic relatedness of organisms found across different samples, which is useful for evaluating similarity in the context of mNGS contamination or outbreaks. The CZ ID phylogenetics module allows you to construct phylogenetic trees for organisms found within multiple mNGS samples. This documentation will guide you through building and interpreting a phylogenetic tree in CZ ID.
Creating a phylogenetic tree
- Select a few samples in a project for which you are interested in creating a phylogenetic tree. You might do this by using the heatmap to identify a set of samples that contain the same taxon.
- Select the “phylogenetic tree” button on the project page.
- You will be taken to the phylogenetic tree creation modal. It may take some time to load.
- Select “Create new tree” in the bottom left. Then enter the project name and organism for which you want to create a tree. Select “Continue”.
- Enter a name for the phylogenetic tree in the “Name” text box.
- Select the samples that you would like included in the phylogenetic tree. Note that it is best to use samples with relatively high coverage of the organism of interest because low-coverage samples can skew the analysis. You can use the ‘Coverage Breadth’ column in the modal to determine which samples to add. Note that the coverage breadth value is the coverage of the top accession.
- Once you have selected the samples you would like to include in the analysis, select “Continue”. You will have the option to add additional samples from public projects on CZ ID. Click ‘create tree’. You should receive confirmation that your tree is being created! The number of samples and the length of the contigs will affect the amount of time it will take for the tree to be created. It can take anywhere from around 10 minutes to 2 hours.
- You can then navigate to view your phylogenetic tree by selecting the “Visualizations” tab within the “My data” page.
Pipeline detailsThe typical phylogenetic analysis involves the comparison of complete genomes to known reference genomes. However, in the context of mNGS data analysis, it may not be possible to obtain full-length genomes and there may not be a sufficiently related reference genome. To address these challenges, the CZ ID phylogenetic pipeline performs kmer-based reference-free genomic distance estimation. This means that split kmers with matching middle bases are aligned to each other and not to a designated reference sequence. This is important for interpreting the results.
The CZ ID phylogenetic tree module consists of three steps:
- In the data-gathering phase, the pipeline gathers contigs associated with the taxID of interest from the selected samples
- In the distance estimation phase, SKA  is used to generate kmer-based estimates of genomic similarity. Since SKA relies on split kmers (link), it is intended for use with relatively similar genomes. If the contigs selected for the analysis are reasonably similar (SKA mash-like distance < 0.15), they will proceed into the phylogenetic analysis. If they are relatively divergent (SKA Mash-like distance > 0.15) due to true divergence or low coverage, then the pipeline will end here and you will receive a heatmap, but no phylogenetic tree.
- In the tree-building phase, SKA  is used to create pseudo-alignments by identifying mismatching bases (using split kmers) between samples. These are then used as input to IQtree , which estimates phylogenetic relationships from the pseudo-alignments.
Receiving a pairwise distance matrix instead of a tree
The tool that underlies CZ ID's phylogenetic pipeline (SKA ) relies on kmer matches to compute the distance between samples. The assumptions of this tool become invalid in two different cases:
- samples are highly divergent from each other. If the contigs selected for the analysis are reasonably similar (SKA mash-like distance < 0.15), they will proceed into the phylogenetic analysis. If they are relatively divergent (SKA Mash-like distance > 0.15) due to true divergence or low coverage, then the pipeline will end here and you will receive a heatmap, but no phylogenetic tree.
- Some of the samples being analyzed have low coverage (and therefore few overlapping kmers with other sequences in the analysis).
Interpreting the pairwise distance matrix
The distance matrix shows the Mash-like distance between samples. The Mash-like distance is calculated as such:
- A distance based on the Mash distance calculation using the Jaccard Index (j) and the split kmer length (k): (-1/(2k+1))*ln(2j/(1+j)) for 0<j≤1 or 1 for j=0
- Jaccard Index = Ratio of split kmers found in both samples to the total found in the two samples: matches/(matches+mismatches)
Samples are also compared to every other sample - for example, when comparing sample AP010960.1 to sample UnAmbiguouslyMapped_ds.cityparks you can follow the green arrows in the screenshot below to see that there is between 0.05 and 0.1 pairwise distance between the two samples. To get exact distance values, a CSV file of the heatmap can be downloaded.
Interpreting the phylogenetic treeThe branch lengths are interpreted as the number of nucleotide substitutions per nucleotide site, specifically where Kmers overlap. In the example below, sample FG-GS-Boogie and sample FGS-01-Ban have a relative distance of 8, which means where the kmers overlapped, there were 8 nucleotide substitutions. These sets of samples are the most divergent from each other.
For the example below, the samples that have a green oval around them are most similar to each other. For instance, FG-G5-Boogie and FG-K7-Boogie have identical contigs. There is no relative distance between the two samples since they are located on the same branch/
Selection of public sequencesAdditional sequences are selected automatically based on the NCBI NT reference sequence with the most short-read matches. A maximum of 10 sequences from NCBI will be selected for any set of samples included in a phylogenetic tree analysis. This may result in a different accession being selected in the phylogenetic tree module than what is shown in the coverage visualization on the sample report because the coverage visualization selectively weights contig matches.
Coloring the tree by metadataThe tree is automatically labeled by project name and NCBI database, but this can be changed by clicking on the “color by:” dropdown menu. From here, you can choose which metadata you would like the tree to show. If you choose “location”, the tree will reflect which location the samples were collected from by changing the color of the branch and sample name. If you would like to return to the original tree, change “color by” back to “Project name”
DownloadsThere are downloads available for the phylogenetic tree and the heatmap. To start a download click the blue “Download” button in the upper right corner.
Downloads available for the phylogenetic tree include:
- Tree file (.nwk) - this is the actual tree. It can be brought into other software (such as MEGA) to edit the tree.
- Tree image (.svg)
- Tree image (.png)
- SKA distance (.tsv) - view mismatches, Mash-like distances, number of SNPs, and SNP distances between samples. You can read about each of the metrics here.
- SKA variants (.aln) - used to view the sequence alignment.
Downloads for the heatmap include:
- Heatmap image (SVG)
- Heatmap image (PNG)
- SKA distances (CSV) - view mismatches, Mash-like distances, number of SNPs, and SNP distances between samples. You can read about each of the metrics here.
How is the new CZ ID phylotree module different from the original?The original CZ ID phylotree module used a tool called kSNP3 to compute the phylogenetic trees from all reads and contigs that mapped to a taxon of interest. The new/current CZ ID phylotree module uses a tool called SKA  to compute the genomic distance and IQtree  to compute the phylogenetic relationships. It does this using only the contigs for a particular taxon of interest (since contigs reduce some of the kmer noise due to errors in raw short reads). Additionally, when the resulting phylogenetic tree is low quality due to low coverage samples or the samples are extremely divergent, the new CZ ID phylotree module will provide a heatmap of estimated genomic distances.
- Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142
- B.Q. Minh, H.A. Schmidt, O. Chernomor, D. Schrempf, M.D. Woodhams, A. von Haeseler, R. Lanfear (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534. https://doi.org/10.1093/molbev/msaa015