Jump to Section:
Analyzing quality control (QC) metrics is an essential step to ensuring that your sequencing run was successful and the pipeline results provided are dependable. QC metrics, such as cluster density, Q-score, and Passed Filter % are provided by your sequencer and should be checked prior to uploading your samples to CZ ID. CZ ID provides additional valuable metrics to understand how the reads moved through the pipeline. While in this article we provide guidelines for performing QC in CZ ID, the threshold for “good” or “bad” quality are dependent on things like sample type, collection and storage methods, wet-lab protocols, and sequencer.
Sample QC aims to answer the following questions:
- Do I have enough total reads?
- Do my samples have enough high-quality reads?
- Do my samples have enough sequencing diversity?
- Do my samples have sufficient insert lengths?
- How did my samples go through the pipeline?
To navigate to QC visualization, click the bar chart icon above "sample".
Total reads represents the number of reads uploaded for single-end reads. Each end counts as a read for paired-end reads, so R1 and R2 each count as their own read.
Do I have enough Total Reads?
The Total Reads histogram shows the distribution of Total Reads across all samples. The chart below shows the maximum number of reads each sequencer is capable of producing and will give you an idea of how many reads you should expect from your run.
Click on the chart to see the samples associated with each column.
The number of reads per sample will be dependent on which sequencer is used, and how many samples were run. One way to quickly identify a problem is to look for outliers - if one sample has far fewer reads than the other samples, that sample may have experienced a problem in the pooling stage.
The "Passed Quality Control" percentage represents the proportion of reads that passed sequence quality thresholds imposed during Trimmomatic and PriceSeqFilter steps.
Do my samples have enough high-quality reads?
The Passed QC histogram shows the distribution of Passed QC percentages across all samples.
The Passed QC metric refers to the number of reads that have passed the quality control filters. These filters are set to remove reads with low-quality scores. During sequencing, quality scores (Q- scores) are assigned to each base as a way to indicate the probability of a sequencing error. The higher the Q-score, the more reliable the base call. The chart below shows the Quality score and its corresponding probability of an incorrect base and base call accuracy. For example, a Q-score of Q40 represents a 1 in 10,000 chance of sequencing error. During the QC Filter, which uses PRICE, reads with >10% uncalled bases (N’s), and/or less than 85% of the read having a call with an accuracy below 0.98 are removed. This ensures that only quality reads are being analyzed and low-quality reads are not contributing bias with incorrectly called bases. If a low % of reads relative to the other samples made it through the QC filter, virome, microbiome, and quantitative analyses may not be accurate since the majority of reads were removed.
DCR (duplicate compression ratio)
The Duplicate Compression Ratio (DCR) is the ratio of the total number of sequences present prior to running duplicate identification versus the number of unique sequences. Duplicate identification is done via third-party pipeline tools czid-dup for pipeline versions > 6.0 and CD-HIT-DUP for pipeline versions below 6.0.
Are there too many duplicate reads in my library?
The Duplicate Compression Ratio (DCR) indicates the sequence diversity. If the sample contains many duplicate reads, this results in a less diverse library, possibly indicating bias due to amplification or sequencing. Duplicate sequences could be due to biased PCR enrichment or it could truly be a biological phenomenon. Wet-lab methods will influence the DCR. For instance, if samples were prepared with an enrichment method such as MSSPE, the DCR would be high because specific sequences are enriched. If doing metagenomics without enrichment, a DCR value less than 2 is ideal.
Mean Insert Size
The mean insert size is the average length of the nucleotide sequence that is inserted between adapters (see figure below). It is different from the fragment size which includes the adapter sequences. The mean insert size value is computed using the Picard CollectInsertSizeMetrics Tool on the host reads and is only computed for paired-end sequencing libraries generated from human hosts.
Do my samples have sufficient insert lengths?
This value can be used to provide evidence regarding the quality of the nucleic acid in the final library - short fragment sizes may indicate sample degradation or over-fragmentation during library preparation. Sequencing libraries generated from non-human hosts will appear empty.
The sequence of steps referred to collectively as "Host Filtering and Quality Control" are as follows:
- Initial host filtration using STAR
- Trim sequencing adapters using Trimmomatic
- Quality filter using PriceSeq
- Identify duplicate reads using czid-dedup (note: prior to pipeline version XX, CZ ID used CD-HIT-DUP for duplicate identification)
- Filter out low complexity sequences using LZW
- Filter out remaining host sequences using Bowtie2
- Subsampling to 1 million fragments (reads/read-pairs) if > 1M remain after step (6)
- Filter out human sequences, regardless of host (using STAR, Bowtie2, GSNAP)
Note: The Host Filtering stage is the first step in our pipeline. It performs a series of filtering steps to remove human, selected host, and low-quality sequences. We also want to ensure that, regardless of the selected host, human reads have been filtered out to the best of our ability.
How were my samples processed through the pipeline?
The reads lost graph shows at what stage of the pipeline reads are filtered. The “reads remaining” portion of the graph indicates the number of reads that passed host filtering and QC steps and went into the pathogen identification portion of the pipeline. The value for “reads remaining” will depend on numerous variables including (but not limited to):
- The type of sample material - CSF will contain more host reads than feces, thus resulting in fewer reads remaining after host filtering. For example, cerebrospinal fluid (CSF) samples often have > 99% of reads removed during the host filtering steps, as one expects the microbial fraction to be minimal compared to the human fraction. Stool samples, on the other hand, may have a much lower percentage of reads removed during host filtering steps, since the bulk of the genetic material is expected to be microbial rather than host-derived. We also want to ensure that, regardless of the selected host, human reads have been filtered out to the best of our ability.
- Whether or not the host reads were filtered - if the host did not have a sequenced genome and “ERCC only” was selected for the host genome, then no reads would be lost during host filtration, thus resulting in a greater number of reads remaining.
- The quality of the reads - if many of the reads were filtered as low-quality, there would be a smaller number of reads remaining.
- The storage conditions for the sample - if the sample was not stored properly, the nucleic acid may have degraded resulting in many short sequences that may be filtered out during adapter trimming, resulting in fewer reads remaining.
- If many reads are flagged as duplicates during the "identify duplicates" step, the DCR will be large and may indicate that there was a large amount of PCR amplification during the library preparation stages, resulting in many duplicate sequences. Note that CZ ID is not intended for use with 16S sequencing libraries.
- If many reads are filtered out by the LZW Filter step, this would indicate that the sequencing reads are relatively low complexity - containing homopolymer nucleotides or simple sequence repeats.