Jump to Section:
Overview
This guide provides an introduction on how to use BLAST to confirm species calls identified by CZ ID. BLAST (Basic Local Alignment Search Tool) is a suit of programs hosted by NCBI that find regions of similarity between sequences by aligning query sequences against a selected database of known sequences. CZ ID compares query sequences against the nucleotide and protein databases from NCBI using minimap2 and DIAMOND, respectively. BLAST is an incredibly useful tool for downstream analysis that is used to confirm sequence hits identified by CZ ID and annotating sequences.
BLAST Types
There are five different types of BLAST searches. The table below lists each BLAST type, its search space (nucleotide vs protein) and general use.
BLAST Type |
Query Sequence |
Database |
Alignment |
General Use |
blastn |
nucleotide |
nucleotide |
nucleotide |
Sequence identification; useful for all taxa categories |
blastx |
nucleotide (translated to protein) |
protein |
protein |
Identification of potential proteins encoded by query sequences; detection of novel viruses |
blastp |
protein |
protein |
protein |
Sequence identification and similarity searches |
tblastx |
nucleotide (translated to protein) |
nucleotide (translated to protein) |
protein |
Identification of nucleotide sequences similar to the query based on their coding potential |
tblastn |
protein |
nucleotide (translated to protein) |
protein |
Identification of database sequences encoding proteins similar to the query |
Running BLAST
Steps to run a BLAST search:
1) Access BLAST here.
2) Click the button for the type of BLAST you wish to run.
Note: Blastn is usually sufficient to confirm a hit. However, if the sequence represents a divergent virus, blastn may fail to find significant matches or may be inaccurate. Blastx can be used to detect divergent sequences because protein sequences accumulate changes more slowly than their encoding nucleotide sequences. Therefore, use blastx to confirm potentially novel or divergent viral sequences.
3) Paste your query sequence(s) in the provided box or upload a FASTA file.
- Query Sequence Box: Paste query sequence(s) here in FASTA format.
- Choose File: Use this option to upload FASTA file with query sequence(s).
4) Navigate to the "Program Selection" section by scrolling down the page. Choose a program for search optimization and click the BLAST button to run search.
- Database: Indicates selected database. By default, the search will use the NT (nucleotide) or NR (protein) database which are used when confirming species calls through blastn or blastx, respectively. You can select other databases from the dropdown menu depending on your goal.
- Program Options: There are three nucleotide BLAST options for searching the database, including Megablast, Discontinous megablast, and Blastn. See more details below.
- BLAST Button: Click this button to begin search.
Program Options
- Megablast: Default option used for identification of highly similar sequences and intra-species comparisons. Generally used to confirm species calls.
- Discontinous megablast: Used to identify more dissimilar sequences and inter-species comparisons. Can be used when querying coding sequences.
- Blastn: Used for finding somewhat similar sequences or related sequences from other species. Use this option if you have short query sequences.
Interpreting BLAST Results
BLAST results show all of the taxa that share sequence similarity with the query sequence based on the selected database. The results page includes a search summary, hit description table, graphic summary, and alignments that can help determine the quality or accuracy of a given hit. Click here for a glossary of BLAST terms and additional information regarding alignment metrics.
Search Summary
- Search Overview: Indicates the database used, query description, and query sequence length.
- Results Tabs: Click tabs to explore results through Descriptions, Graphic Summary, and Alignments.
Descriptions Tab
The Descriptions tab provides a table listing all the sequences sharing similarity with the query and metrics to evaluate the matches. The taxon description link will take you to the alignment between the query and its sequence match. Click the scientific name and/or accession links to find more details about a match of interest.
BLAST metrics include:
- Max Score: Highest bit score calculated from matches and mismatches found in local alignments. The higher the max score, the better the alignment.
-
Total Score: Sum of alignment scores for all of the sequence segments or local alignments. The higher the score, the better the alignment. When max and total scores are the same, there is one global alignment between the query and its match in the database. This means that the sequences can be aligned without long insertions or deletions.
- Note: CZ ID uses an algorithm equivalent to "total score" to assign taxon matches. If you find taxon A in CZ ID's Sample Report and run BLAST, it is possible that taxon A would not be the top hit in the BLAST table because BLAST sorts by E-value. Additionally, CZ ID implements a different alignment strategy (minimap2 and DIAMOND).
- Query Coverage: Percent of the query sequence length that is included in alignments against the sequence match.
- E-value: Indicates the number of hits or alignments that are expected to be seen by random chance with the same score or better. The lower the E-value, the more significant the alignment (the closer to 0, the better). E-value is the default metric used to sort the Descriptions table. Click here for a discussion of E-value thresholds.
- Percent Identity: Percent of nucleotides or amino acids that are identical between the aligned query and database sequences. A query sequence can share low percent identity with a sequence and still be a significant hit. It is essential to take the E-value into account and look for similarity between conserved regions (this will be more evident at the amino acid level).
Graphic Summary Tab
Use the Graphic Summary tab to view a schematic representation of alignments between the query and database sequence matches.
- Alignment Score Scale: Colors reflect the alignment score range. Red bars represents alignments with the highest score (best alignment), while black bars represent low scores for alignments that should be interpreted with caution.
- Query Sequence: The blue bar depicts the query sequence length.
- Sequence Match: Each bar or row represents a sequence sharing similarity with the query. The length of the bar indicates the length of the alignment between the query and sequence match. The color will vary depending on the alignment score (see Alignment Score Scale). Click the bar to view sequence and alignment information.
- Alignment Gaps: Gray lines separating segments of a sequence bar represent alignment gaps.
Alignments Tab
Go to the Alignments tab to view sequence alignments. The alignment type (nucleotide vs amino acid sequences) will depend on the BLAST type.
- Download: Use the download dropdown menu to download the alignment or the matched sequence in various formats.
- Graphics: Use this link to view a graphical representation of the alignment that includes sequence match annotations. This is useful when trying to assess if the query sequence represents a non-coding region, such bacterial 16S or eukaryotic 18S ribosomal sequences. Query sequences from conserved, non-coding regions often result in matches to multiple taxa and cannot be assigned to a specific species.
Graphics Interface
- Database Sequence: Bar represents the matched sequence.
- Annotations: Specifies which genomic region(s) the query sequence is matching.
- Query Alignments: Bars represent aligned query sequences against database sequence.
General Notes
When confirming species calls:
- Record BLAST information for the the best sequence match (usually the top hit listed in the Descriptions table) as it is often reported in publications, including:
-
- Accession number
- E-value
- Query coverage
-
- If the best match of a blastn search is a virus, it is a good idea to perform a blastx to confirm the hit. Divergent viral sequences may reveal poor blastn matches to bacterial sequences, but show significant amino acid alignments to viral sequences through blastx.
- Non-coding regions, such as 16S rRNA, will not be identified through blastx.
- It's OK if your BLAST results do not confirm CZ ID species calls. There is noise in mNGS data and BLAST is a useful tool to identify that noise or confirm a species call.
- Once a species call is confirmed as a hit or false positive, you can record this information in the CZ ID sample report using the annotation feature. Simply click the annotation tag by the species of interest and select the appropriate description from the dropdown menu.
Comments
0 comments
Please sign in to leave a comment.