Jump to Section:
We are now going to dive deeper into the one sample’s pipeline results to better understand what the analysis pipeline does and how we report the final set of taxa and contigs.
Select Patient 008 (CSF) in the Medical Detectives project to view that sample’s report page. Patient 008 went to the hospital with an unknown infection that was undiagnosable through traditional methods. The scientists used CZ ID to see if there was an infecting agent their tests had missed.
We will explain a bit of the pipeline steps here but more details can be found on our GitHub wiki and in our Pipeline Documentation.
A pipeline is a set of computational steps that are performed on your data.
Viewing Pipeline Details
To view pipeline details, select Sample Details under the Sample name. The Sample Details Panel will pop out on the right side of the page.
Once the side-panel opens, you will be able to see the Metadata and Pipeline information associated with this specific sample. The side-panel defaults to the Metadata tab, select Pipeline to view the Pipeline details.
Host Filtering and Quality Control
The Host Filtering stage is the first step in our pipeline. It performs a series of filtering steps to remove human, selected host, and low-quality sequences. We also want to ensure that, regardless of the selected host, human reads have been filtered out to the best of our ability.
The percentage of reads that are removed during Host Filtering and QC tends to depend on the sample type (tissue of origin). For example, cerebrospinal fluid (CSF) samples often have > 99% of reads removed during the host filtering steps, as one expects the microbial fraction to be minimal compared to the human fraction. Stool samples, on the other hand, may have a much lower percentage of reads removed during host filtering steps, since the bulk of the genetic material is expected to be microbial rather than host-derived. We also want to ensure that, regardless of the selected host, human reads have been filtered out to the best of our ability.
Host Filtering and Quality Control Filtering Steps
The sequence of steps referred to collectively as "Host Filtering and Quality Control" are as follows:
- Initial host filtration using STAR
- Trim sequencing adapters using Trimmomatic
- Quality filter using PriceSeq
- Identify duplicate reads using idseq-dedup (note: prior to pipeline version 6.0, CZ ID used CD-HIT-DUP for duplicate identification)
- Filter out low complexity sequences using LZW
- Filter out remaining host sequences using Bowtie2
- Subsampling to 1 million fragments (reads/read-pairs) if > 1M remain after step (6)
- Filter out human sequences, regardless of host (using STAR, Bowtie2, and GSNAP)
Passed Quality Control (QC) Value
The "Passed Quality Control" percentage represents the proportion of reads that passed sequence quality thresholds imposed during Trimmomatic and PriceSeqFilter steps, i.e. the reads remaining after step (3) compared to the reads remaining after step (1).For Patient 008, you can see that the total reads that Passed QC was 46.14% which is an acceptable value.
Passed Filters Value
The "Passed Filters" percentage refers to the fraction of reads that remained after step (8) compared to the number of initial reads that went in at step (1). This reflects the percentage of original sequencing reads that are sent to downstream analysis after host and quality filtering.
For Patient 008, you can see that 46.14% of reads passed QC, but only 0.82% Passed All Filters. This tells you that nearly half the reads from the sample had high sequencing quality, but only 0.82% of the total reads were non-host and used in the alignment step. This indicates that the sample contained a large majority of human reads.
Mean Insert Size Value
The mean insert size value is computed using the Picard CollectInsertSizeMetrics Tool on the host reads, and indicates the mean length of the nucleotide sequence that is inserted between the adapters. This value can be used to provide evidence regarding the quality of the nucleic acid in the final library - short fragment sizes may indicate sample degradation or over-fragmentation during library preparation. This value is only computed for paired-end sequencing libraries generated from human hosts and will appear empty ("--") for all other sample types.
Compression Ratio Value (DCR)
The compression ratio is calculated as the total number of reads after PriceSeq divided by the number of unique reads identified by idseq-dedup. Duplicate sequences are flagged, but these reads remain present for calculation of downstream results.
Interpreting Run Quality
Reasons why a sample may have low % Passed QC (ie < 10%)
The "Passed QC" percentage refers to the Trimmomatic (adapter trimming) and PriceSeq (QC) steps. It indicates the fraction of reads that came out of the PriceSeq step, compared to what went in initially. Thus, samples with low % Passed QC are likely samples which contained many reads that were removed due to the presence of adapter sequences (i.e. adapter dimers) or sequences that obtained low quality scores during base-calling.
If many reads are flagged as duplicates, you may see the number indicated in the Reads Remaining table as "(XX unique)" is much lower than the total number of reads remaining. This will correspond to a large DCR and may indicate that there was a large amount of PCR amplification during the library preparation stages, resulting in many duplicate sequences. Note that CZ ID is not intended for use with 16S sequencing libraries.
If many reads are filtered out by the LZW filter step, this would indicate that the sequencing reads are relatively low complexity - containing homopolymer nucleotides or simple sequence repeats.
You can find more information on all of our pipeline steps by selecting the View Pipeline Visualization link at the top of the side-panel or selecting the View Results Folder, found under Downloads at the bottom of the side-panel. in the sample details tab. Alternatively, you can also navigate to pipeline steps by selecting PIPELINE version above the Sample Name.
Sequence assembly refers to aligning and merging short sequencing reads obtained from a longer DNA sequence in order to reconstruct the original sequence. Through the process of assembly, longer, contiguous sequences (known as contigs) are generated. Each contig is composed of several sequencing reads. CZ ID maps raw sequencing reads back to the assembled contigs to determine which reads are associated with each contig.
A microbe’s genome size can influence the number of contigs identified. For example, suppose that Sample A contained a virus known to have a genome of 5 kbp (kilo-base-pairs) and Sample B contained a bacterium known to have a genome size of 300 kbp. When analyzing your sample results, it is good to keep in mind that fewer contigs would be assembled in Sample A than Sample B, assuming a random distribution of the short reads across the original genome.