How is the phylogenetic tree built?
The tree topology is estimated in a reference-free manner using the kSNP3 package. The kSNP method is explained in more detail in this manuscript. It identifies k-mers that are shared between samples, except for the central character in the k-mer. Differences in the central characters of k-mers are counted as SNPs separating the samples. We use k=13 for viruses and k=19 for bacteria. Once a distance matrix has been estimated in this way, the phylogenetic tree is built using the principle of maximum parsimony.
How should branch lengths be interpreted?
It is a peculiarity of the kSNP3 method that branch lengths only have relative meanings, not absolute ones. Branch lengths should not be used for estimation of divergence times, only for estimation of tree topology
How are NCBI references chosen?
For each sample, we recover the NCBI accession that had the most reads mapped to it (within the chosen species/genus). We include that NCBI accession on the tree for context. In principle, one NCBI reference will be added per sample, but it is possible that multiple samples share the same top NCBI match.