Overview
FASTQ files store raw DNA sequencing reads along with quality scores. Corrupted or truncated files can render them unusable or silently introduce errors into downstream analysis. You can verify that downloaded FASTQ files are neither truncated nor corrupted by using user-friendly software designed for FASTQ file manipulation or simple bash commands. In this guide, we utilize SeqKit and list bash command options to obtain FASTQ file statistics and evaluate file integrity. At the end, we provide a quick checklist for verifying original input FASTQ files downloaded from CZ ID.
After reading this guide, you will learn how to:
- Find the number of expected reads for input files
- Obtain file statistics using Seqkit
- Check files using bash commands
Expected Number of Reads
When checking file integrity, we compare the reads we have against the reads we should have. The first step is finding the expected number of reads for downloaded files. For original input files, you can find the Total Reads number by looking at the Total Reads column in the Project page. This number will reflect the total number of uploaded reads to CZ ID, including forward (R1) and reverse (R2) reads. For example, if the total reads number for paired-end data is 50k reads, it means there are 25k reads in each R1 and R2 file. If you don't see a Total Reads column in the Samples table, you can add it by selecting it from the "+" dropdown menu on the right-hand side of the table.
Note: For input FASTQ files containing >150M reads, the number of reads in the original file will not be visible given that CZ ID truncates uploaded files to 150M reads and the Total Reads number will be capped to 150M.
Obtain File Stats with SeqKit
SeqKit is a fast, easy-to-use toolkit for FASTA/Q file manipulation. This toolkit is compatible with multiple operating systems, including Mac, Linux, and Windows, and can easily be used without dependencies or pre-configurations. Installation instructions can be found in SeqKit's Download page.
The easiest way to check for file integrity is to run the seqkit stats command:
seqkit stats sample.fastq.gzOutputs:
- Statistics for intact files: SeqKit will output a clean markdown table containing the sequence format, total read count, and length distributions. Compare the number of reads in the downloaded file to the expected number of total reads. For paired-end data, check that downloaded R1 and R2 files have the same number of reads.
-
Error for truncated files: SeqKit will fail to parse the incomplete record at the end of truncated files. Therefore, it will output an explicit error message to your terminal:
[ERRO] sample.fastq.gz: fastx: bad fastq format.
seqkit sana command:seqkit sana sample.fastq.gz -o repaired.fastq.gz- Repaired files: SeqKit will output a FASTQ file after removing the truncated fragments. Note that this file will be functional but is incomplete relative to the original file.
Bash Check
You can use basic commands in Linux-like and Windows environments to count the number of lines within downloaded files. The number of lines in a given FASTQ file should be divisible by four.
- Number of lines divisible by 4: Every sequencing read has its required 4 lines (header, sequence, plus line spacer, and quality score) and the file structure is likely intact.
- Number of lines NOT divisible by 4: The file is truncated or corrupted (i.e., a read was cut off "mid-sentence").
Note that by dividing the number of lines by four you are calculating the total number of reads. Compare the number of reads in the downloaded file to the expected number of total reads. For paired-end data, check that downloaded R1 and R2 files have the same number of reads.
Linux/Mac
Command for computing number of lines in uncompressed FASTQ file (".fq", ".fastq"):
wc -l sample.fastqCommand for computing number of lines in compressed FASTQ file (".fq.gz", ".fastq.gz"):
gzip -cd Alternatively, you can check the integrity of the compressed file to make sure internal GZIP trailing blocks are not truncated. If the following command returns "unexpected end of file" or any output, the file is truncated. No output means that the file is intact.
gzip -t sample.fastq.gz
Windows
Use the following PowerShell command to count lines in uncompressed FASTQ files (".fq", ".fastq"):
gc sample.fastq | measure -lCommand for computing number of lines in compressed FASTQ file (".fq.gz", ".fastq.gz"):
tar -xf sample.fastq.gz -O | measure -l
Summary: Quick Checklist
Use this checklist each time you download original input FASTQ files from CZ ID.
Before you start:
- Find the expected number of reads in the Total Reads column on your Project page.
When using SeqKit:
- Run
seqkit statsto obtain a full report. - If truncated, run
seqkit sanato salvage intact reads.
When using Bash commands (Linux/Mac):
- Run
wc -l(uncompressed files) orgzip -cd(compressed files) to confirm the line count is divisible by 4. - Run
gzip -tto confirm no truncation errors.
When using Bash commands (Windows):
- Run
qc sample.fastq | measure -l(uncompressed files) ortar -xf sample.fastq.gz -O | measure -l(compressed files) to confirm the line count is divisible by 4.
Comments
0 comments
Article is closed for comments.