Sample QC

Jump to Section:

Evaluating Sample QC
Total Reads
Passed QC
Duplicate Compression Ratio (DCR)
Mean Insert Size
Reads Lost

Overview

Analyzing quality control (QC) metrics is an essential step to ensure that your sequencing run was successful and the pipeline results are dependable. QC metrics, such as cluster density, Q-score, and Passed Filter % are provided by your sequencer and should be checked prior to uploading your samples to CZ ID. CZ ID provides additional valuable metrics to understand how the reads were processed through the pipeline.

Here we provide guidelines for performing QC in CZ ID. Please keep in mind that the thresholds for “good” or “bad” quality are dependent on things outside of the pipeline (e.g., sample type, collection and storage methods, wet-lab protocols, and sequencer).

After reading this guide, you will:

Learn how to find sample QC information in CZ ID
Understand QC metrics provided by CZ ID and their relevance

Evaluating Sample Quality

Sample QC aims to answer the following questions:

Do I have enough total reads?
Do my samples have enough high-quality reads?
Do my samples have enough sequencing diversity?
Do my samples have sufficient insert lengths?
How did my samples go through the pipeline?

CZ ID provides a Project-level Quality Control (PLQC) Visualization page where you can see various metrics and evaluate the questions listed above. To navigate to the PLQC Visualization page, go to the Project page of interest and click the QC icon (bar chart) below the "Metagenomics" tab.

Clicking the QC icon will lead to the PLQC Visualization page

Example PLQC Visualization page

In the following sections we describe each of the metrics summarized by the graphs provided within the QC Visualization page. The graphs include histograms for Total Reads, Passed QC, Duplicate Compression Ratio (DCR), and Mean Insert Size. Additionally, you will see a stacked bar graph for Reads Lost showing reads filtered out through each pipeline step. Note that each graph is interactive and you will be able to see details about samples included in each bar by hovering and/or clicking on a given bar.

To view samples represented within each histogram bar:

Click the bar of interest
See sample details for the selected range on the right-hand panel.

Total Reads

The Total Reads metric represents the number of reads uploaded to the CZ ID pipeline. Note that, for paired-end data, each R1 and R2 read counts as one read.

Do I have enough Total Reads?

The Total Reads histogram shows the distribution of total reads across all samples. You can click on each bar of the histogram to see samples associated with a given range of total read number.

Example histogram showing the distribution of total reads across samples.

The number of reads per sample will depend on which sequencer is used and how many samples were loaded on a single run. The table below shows the maximum number of reads each sequencer is capable of producing to give you an idea of how many reads you should expect from your run. One way to quickly identify a problem is to look for outliers. If one sample has far fewer reads than the other samples, that sample may have experienced a problem in the pooling stage.

Sequencer	Max Reads (million)	Max Read Length (bp)
MiSeq	22 - 25	2 x 300
MiSeq Micro	4	2 x 150
iSeq 100	4	2 x 150
HiSeq 4000	250 - 400	2 x 150
NovaSeq 6000 SP	325 - 400	2 x 250
NovaSeq 6000 S1	750 - 800	2 x 150
NovaSeq 6000 S2	1650 - 2050	2 x 150
NovaSeq 6000 S4	2000 - 2500	2 x 150

Passed QC

The Passed QC metric represents the percentage of reads remaining after QC filtering using fastp to remove low quality bases, short reads (< 35 bp), and low complexity reads.

Do my samples have enough high-quality reads?

The Passed QC histogram shows the distribution of reads that passed QC across all samples.

Example histogram showing the percentage of reads passing QC across samples.

The Passed QC metric refers to the number of reads that have passed the quality control filters. During sequencing, Phred quality scores (Q- scores) are assigned to each base to indicate the probability of a sequencing error. The higher the Q-score, the more reliable the base call. The table below shows Phred quality scores and their corresponding probability of an incorrect base call and accuracy. For example, a Q-score of Q40 represents a 1 in 10,000 chance of sequencing error. The QC Filtering step using fastp ensures that only good quality reads are being analyzed and minimize low-quality reads that may introduce biases and unreliable results. If a low percentage of reads relative to the other samples passed QC filtering, virome, microbiome, and quantitative analyses may be misleading since the majority of reads were removed.

Phred Quality Score	Probability of an Incorrect Basecall	Basecall accuracy (%)
10	1 in 10	90
20	1 in 100	99
30	1 in 1000	99.9
40	1 in 10000	99.99
50	1 in 100000	99.999
60	1 in 1000000	99.9999

Duplicate Compression Ratio

The Duplicate Compression Ratio (DCR) is the ratio of the number of reads passing QC filtering and host/human read removal to the number of unique reads after duplicate removal.

Are there too many duplicate reads in my library?

The Duplicate Compression Ratio (DCR) is an indicator of sequence diversity in your sample. If the sample contains many duplicate reads, this results in a less diverse library, possibly indicating bias due to amplification or sequencing. Duplicate sequences could be due to biased PCR enrichment or it could truly be a biological phenomenon. Wet-lab methods will influence the DCR. For instance, if samples were prepared with an enrichment method such as MSSPE, the DCR would be high because specific sequences are enriched. If doing metagenomics without enrichment, a DCR value less than 2 is ideal.

Example histogram showing DCR distribution across samples.

Mean Insert Size

The mean insert size is the average length of the nucleotide sequence that is inserted between sequencing adapters (see figure below). The insert size is different from the fragment size, which includes the adapter sequences. The mean insert size value is computed using the Picard CollectInsertSizeMetrics Tool on the host reads and is only computed for paired-end sequencing libraries generated from human hosts.

Do my samples have sufficient insert lengths?

Mean insert size can be used as an indicator of nucleic acid quality in the final library. Short fragment sizes may indicate sample degradation or over-fragmentation during library preparation. Note that value will only be calculated for samples collected from human hosts. Therefore, the Mean Insert Size graph will be blank for samples from non-human hosts.

Example histogram showing insert size distribution across samples.

Reads Lost

The first stage of the pipeline includes host filtering and quality control. During this stage, the pipeline performs a series of filtering steps to remove human, selected host, and low-quality sequences. Regardless of the selected host, human reads are filtered out to the best of our ability.

The sequence of steps referred to collectively as "Host Filtering and Quality Control" are as follows for samples uploaded to projects created prior to April 19, 2023:

1. Initial host filtration using STAR

2. Trim sequencing adapters using Trimmomatic

3. Quality filter using PriceSeq

4. Identify duplicate reads using czid-dedup (note: prior to pipeline version 6.0, CZ ID used CD-HIT-DUP for duplicate identification)

5. Filter out low complexity sequences using LZW

6. Filter out remaining host sequences using Bowtie2

7. Subsampling to 1 million fragments (reads/read-pairs) if > 1M remain after step (6)

8. Filter out human sequences, regardless of host (using STAR, Bowtie2, and GSNAP)

If samples are uploaded to projects created on or after April 19, 2023, the "Host Filtering and Quality Control" steps will implement tools designed to run faster on cloud instances and incorporate best practices from the peer-reviewed literature. Click here for details.

How were my samples processed through the pipeline?

The Reads Lost graph shows at what stage of the pipeline reads are filtered out. The “reads remaining” portion of the graph indicates the number of reads that passed host filtering and QC steps and went into the analysis and pathogen identification portion of the pipeline. The value for “reads remaining will depend on numerous variables including (but not limited to):

Type of sample material - For example, samples from cerebrospinal fluid (CSF) will contain more host reads than a fecal sample. Therefore, CSF samples will result in fewer reads remaining after host filtering compared to fecal samples. CSF samples often have > 99% of reads removed during the host filtering steps, as one expects the microbial fraction to be minimal compared to the human fraction. Stool samples, on the other hand, may have a much lower percentage of reads removed during host filtering steps, since the bulk of the genetic material is expected to be microbial rather than host-derived. Remember that, regardless of the selected host, human reads are filtered out.
Whether or not host reads were filtered - If the host did not have a sequenced genome and “ERCC only” was selected for the host genome when uploading the sample to CZ ID, then no reads would be lost during host filtration. No host filtering will result in a greater number of reads remaining.
Read quality - If many of the reads were filtered out due to low-quality, there would be a smaller number of reads remaining.
Sample storage conditions - If the sample was not stored properly, the nucleic acid may have degraded resulting in many short sequences that may be filtered out during adapter trimming. This would decrease the number of reads passing QC filtering.
Duplication level - If many reads are flagged as duplicates during the "identify duplicates" step, the DCR will be large. High duplication levels may indicate that there was a large amount of PCR amplification during the library preparation stages, resulting in many duplicate sequences. Note that CZ ID is not intended for use with amplicon libraries (e.g., 16S sequencing).
Read complexity - If many reads are filtered out during QC filtering, this would indicate that the sequencing reads were had relatively low complexity (e.g., contain high proportion of homopolymer nucleotides or simple sequence repeats).

Example of Reads Lost graph showing how many reads were lost during each step of data preprocessing. In this example, the largest proportion of reads was lost during the host filtering step for most samples.

Articles in this section

Jump to Section:

Overview

Evaluating Sample Quality

To view samples represented within each histogram bar:

Total Reads

Do I have enough Total Reads?

Passed QC

Do my samples have enough high-quality reads?

Duplicate Compression Ratio

Are there too many duplicate reads in my library?

Mean Insert Size

Do my samples have sufficient insert lengths?

Reads Lost

How were my samples processed through the pipeline?

Comments

Articles in this section

Jump to Section:

Overview

Evaluating Sample Quality

To view samples represented within each histogram bar:

Total Reads

Do I have enough Total Reads?

Passed QC

Do my samples have enough high-quality reads?

Duplicate Compression Ratio

Are there too many duplicate reads in my library?

Mean Insert Size

Do my samples have sufficient insert lengths?

Reads Lost

How were my samples processed through the pipeline?

Related articles