Verify FASTQ File Downloads

Overview

FASTQ files store raw DNA sequencing reads along with quality scores. Corrupted or truncated files can render them unusable or silently introduce errors into downstream analysis. You can verify that downloaded FASTQ files are neither truncated nor corrupted by using user-friendly software designed for FASTQ file manipulation or simple bash commands. In this guide, we utilize SeqKit and list bash command options to obtain FASTQ file statistics and evaluate file integrity. At the end, we provide a quick checklist for verifying original input FASTQ files downloaded from CZ ID.

After reading this guide, you will learn how to:

Find the number of expected reads for input files
Obtain file statistics using Seqkit
Check files using bash commands

Expected Number of Reads

When checking file integrity, we compare the reads we have against the reads we should have. The first step is finding the expected number of reads for downloaded files. For original input files, you can find the Total Reads number by looking at the Total Reads column in the Project page. This number will reflect the total number of uploaded reads to CZ ID, including forward (R1) and reverse (R2) reads. For example, if the total reads number for paired-end data is 50k reads, it means there are 25k reads in each R1 and R2 file. If you don't see a Total Reads column in the Samples table, you can add it by selecting it from the "+" dropdown menu on the right-hand side of the table.

Note: For input FASTQ files containing >150M reads, the number of reads in the original file will not be visible given that CZ ID truncates uploaded files to 150M reads and the Total Reads number will be capped to 150M.

Obtain File Stats with SeqKit

SeqKit is a fast, easy-to-use toolkit for FASTA/Q file manipulation. This toolkit is compatible with multiple operating systems, including Mac, Linux, and Windows, and can easily be used without dependencies or pre-configurations. Installation instructions can be found in SeqKit's Download page.

The easiest way to check for file integrity is to run the seqkit stats command:

seqkit stats sample.fastq.gz

Outputs:

Statistics for intact files: SeqKit will output a clean markdown table containing the sequence format, total read count, and length distributions. Compare the number of reads in the downloaded file to the expected number of total reads. For paired-end data, check that downloaded R1 and R2 files have the same number of reads.
Error for truncated files: SeqKit will fail to parse the incomplete record at the end of truncated files. Therefore, it will output an explicit error message to your terminal:[ERRO] sample.fastq.gz: fastx: bad fastq format.

If you detect a truncated file and want to salvage the intact reads that came before the corruption point, you can use the seqkit sana command:

seqkit sana sample.fastq.gz -o repaired.fastq.gz

Output:

Repaired files: SeqKit will output a FASTQ file after removing the truncated fragments. Note that this file will be functional but is incomplete relative to the original file.

Bash Check

You can use basic commands in Linux-like and Windows environments to count the number of lines within downloaded files. The number of lines in a given FASTQ file should be divisible by four.

Number of lines divisible by 4: Every sequencing read has its required 4 lines (header, sequence, plus line spacer, and quality score) and the file structure is likely intact.
Number of lines NOT divisible by 4: The file is truncated or corrupted (i.e., a read was cut off "mid-sentence").

Note that by dividing the number of lines by four you are calculating the total number of reads. Compare the number of reads in the downloaded file to the expected number of total reads. For paired-end data, check that downloaded R1 and R2 files have the same number of reads.

Linux/Mac

Command for computing number of lines in uncompressed FASTQ file (".fq", ".fastq"):

wc -l sample.fastq

Command for computing number of lines in compressed FASTQ file (".fq.gz", ".fastq.gz"):

gzip -cd

Alternatively, you can check the integrity of the compressed file to make sure internal GZIP trailing blocks are not truncated. If the following command returns "unexpected end of file" or any output, the file is truncated. No output means that the file is intact.

gzip -t sample.fastq.gz

Windows

Use the following PowerShell command to count lines in uncompressed FASTQ files (".fq", ".fastq"):

gc sample.fastq | measure -l

Command for computing number of lines in compressed FASTQ file (".fq.gz", ".fastq.gz"):

tar -xf sample.fastq.gz -O | measure -l

Summary: Quick Checklist

Use this checklist each time you download original input FASTQ files from CZ ID.

Before you start:

Find the expected number of reads in the Total Reads column on your Project page.

When using SeqKit:

Run seqkit stats to obtain a full report.
If truncated, run seqkit sana to salvage intact reads.

When using Bash commands (Linux/Mac):

Run wc -l (uncompressed files) or gzip -cd (compressed files) to confirm the line count is divisible by 4.
Run gzip -t to confirm no truncation errors.

When using Bash commands (Windows):

Run qc sample.fastq | measure -l(uncompressed files) or tar -xf sample.fastq.gz -O | measure -l (compressed files) to confirm the line count is divisible by 4.

Articles in this section

Overview

Expected Number of Reads

Obtain File Stats with SeqKit

Bash Check

Linux/Mac

Windows

Summary: Quick Checklist

Before you start:

When using SeqKit:

When using Bash commands (Linux/Mac):

When using Bash commands (Windows):

Comments

Articles in this section

Overview

Expected Number of Reads

Obtain File Stats with SeqKit

Bash Check

Linux/Mac

Windows

Summary: Quick Checklist

Before you start:

When using SeqKit:

When using Bash commands (Linux/Mac):

When using Bash commands (Windows):

Related articles