Overview
You can compare taxa detected across multiple metagenomic (mNGS) samples using a phylogenetic tree. Here we describe the pipeline used for creating these phylogenetic trees in CZ ID.
Pipeline details
Phylogenetic analysis involves the comparison of sequences to a known reference to assess the number of genetic changes that have occurred relative to that reference. However, sequences obtained through metagenomic analysis may be incomplete and there may not be a reliable reference sequence to compare novel sequences. To address these challenges, the CZ ID phylogenetic pipeline performs a kmer-based reference-free genomic distance estimation known as split k-mer analysis (SKA). This means that sequences are compared based on matching short stretches of sequence (i.e., split kmers) and not to a designated reference sequence. This is important for interpreting results.
The CZ ID phylogenetic tree module consists of three stages:
- Data-gathering phase: In this initial stage the pipeline gathers contigs associated with the taxon of interest from selected samples.
- Distance estimation phase: During the second stage, SKA is used to generate kmer-based estimates of genomic similarity. Since SKA relies on split kmers, it is intended for use with relatively similar genomes. If the contigs selected for the analysis are reasonably similar (SKA mash-like distance < 0.15), they will proceed into the phylogenetic analysis. If they are relatively divergent (SKA Mash-like distance > 0.15) due to true divergence or low coverage, then the pipeline will end here and you will obtain a pairwise distance matrix, but no phylogenetic tree.
- Tree-building phase: In the final stage, SKA is used to create pseudo-alignments by identifying mismatching bases (using split kmers) between samples. These are then used as input to IQ-TREE, which estimates phylogenetic relationships from the pseudo-alignments.
Phylogentic tree pipeline overview
Comments
0 comments
Please sign in to leave a comment.