Comprehensive Guide to BLAST: Basic Local Alignment Search Tool Explained

Introduction to BLAST

BLAST (Basic Local Alignment Search Tool) is a widely-used bioinformatics tool developed by the National Center for Biotechnology Information (NCBI) to identify sequence similarity in nucleotide or protein sequences. It functions as a sequence similarity search tool by comparing a user-provided query sequence against large databases.

Learn more about the Comprehensive Insights into EBI and Essential Bioinformatics Tools to understand where BLAST fits in the broader bioinformatics landscape.

Understanding the Query and Database

Query Sequence: The input sequence to be searched.
Database: Large collections of nucleotide or protein sequences against which the query is compared.

For an expanded overview of protein sequence resources, consult the Comprehensive Guide to Protein Databases: Types and Key Examples.

The BLAST Search Analogy

Searching a query sequence is analogous to finding a specific book in a vast library. Just as a librarian narrows down a search by categories (e.g., higher studies → life sciences → bioinformatics → foreign authors), BLAST efficiently narrows down sequence matches using algorithmic steps.

BLAST Algorithm Overview

Step 1: Removing Low Complexity Regions

Identifies and removes repetitive or low complexity areas (e.g., repetitive amino acids) in the query by replacing them with placeholder characters (X for proteins, N for nucleotides).

Step 2: Word List Creation and Scoring

The query sequence is parsed into fixed-length words: typically 11-mers for nucleotides and 3-mers for proteins.
Each word is scored against sequences in the database using a scoring matrix.
Matches scoring above a chosen threshold (T value) are considered significant for further analysis.

Example:

Query word: 'PQG'
Possible database words scored against it with scores of 18, 15, 13, and 12.
Threshold value: 13
Words scoring 6513 are accepted as hits.

Step 3: Hit Formation

Each word match that meets or exceeds the threshold becomes a 'hit,' recorded and stored for further extension.

Step 4: Extension of Hits

Hits are expanded in both left and right directions to find longer matching sequences, stopping when scores begin to decline.
The extended matching region is termed a High-Scoring Segment Pair (HSP).

For deeper understanding of sequence alignment algorithms, see Global Sequence Alignment Explained: Needleman-Wunsch Algorithm Guide.

Scoring Systems in BLAST

Raw Score: Sum of individual match scores in an alignment.
Bit Score: Raw score transformed onto a normalized logarithmic scale.
E-value: Statistical measure indicating the likelihood an alignment occurred by chance; a lower E-value indicates a more significant match.

Types of BLAST Services at NCBI

Nucleotide BLAST: Nucleotide query vs. nucleotide database.
Protein BLAST: Protein query vs. protein database.
BLASTX: Translates nucleotide query into protein in all reading frames and searches a protein database.
TBLASTN: Protein query searched against a nucleotide database translated in six reading frames.
MegaBLAST: Faster searches optimized for large nucleotide queries.
PSI-BLAST: Detects distant protein homologs using iterative profile searches.
PHI-BLAST: Searches for protein sequences that contain particular patterns.

For an in-depth exploration of recombinant proteins relevant to BLAST protein searches, reference the Comprehensive Guide to Recombinant Protein Expression and Structural Biology.

BLAST vs. FASTA

Besides BLAST, FASTA is another tool for sequence similarity searches, offering alternative algorithmic approaches.

Practical Considerations

Understanding these steps enhances proper usage and interpretation of BLAST.
Proper threshold and scoring considerations prevent unnecessary computation and enhance result reliability.

By grasping the BLAST workflow02Dfrom query processing, word scoring, hit identification, to final alignment scoringDresearchers can efficiently search large databases for meaningful sequence similarities in both nucleotides and proteins.

all right so I'll take a color and uh I'm going to talk a little bit about basic local alignment Search tool now

blast and uh this is something whenever you talk about bi informatics blast is something that we always uh need to

understand always need to perform practically as well so what is a full form of blast

BL all caps basic local alignment Search tool okay now basically it's a tool for searching similarity which is is

developed by ncbi okay ncbi developed blast is very very popular tool to find out sequence similarity okay so the

sequence similarity Search tool we can say that sequence similarity this is sequence similarity

Search tool now how to search for sequence similarity it can be either nucleotide sequence or it can be so

either nucleotide sequence or amino acid that means protein sequences and we can search that sequence basically okay we

can search that sequence so uh so in this case we have a sequence so let's say the sequence that we use to search

throughout the all database under ncip that database is known as the query query sequence query sequence is the

sequence to be tested tested means to be searched okay to find for similarity

and and then there is the whole database whole database a sequence database is there to find the match okay

so whenever the match is there it's like you know you are trying to get a B informatics books for your preparation I

told you the name of the book you have the book in your hand and now you went to a library or a shop where you want to

get that same book so what you do you just show it to the librarian and tell them to search the whole library of

books and to get that same book out okay so what that librarian will do the library will follow follow a protocol

right because the library is filled with thousands of books so the librarian will take your book and first it will check

what kind of book it is whether it is of lower study or higher study it will be a higher study book because bi informatics

is for higher studies so there is a higher study section in the library so now the search is limited so from the

whole library now it's going to the higher study now inside higher study what subject it is B informatics so

under Life Sciences B informatics so the will now stick to the life science portion where only the life science

books are there so now in the this will be even smaller than the whole Library than even smaller than the total higher

education book section and now in the life science section textbook section particularly the person will search for

all the different subjects zoology botony physiology bioinformatics biochemistry then the person will find

out uh uh you know a place where the bioinformatics only bioinformatics books are there and then in that particular uh

let's say uh storage unit of bioinformatics book then the person will search for what kind of book it is

whether it's an Indian author book or a foreign author book now let's say it's a foreign author book then you go to the

foreign author section of the bioinformatics uh Library section and the foreign author book then the person

will search for that book now the search becomes so organized when we search it like this or now if you randomly give

this bioinformatics book to someone who don't know about anything about bioinfo haven't heard about it or don't anything

about uh it's not educated at all and you give this book to the person the person will now run and try to see all

the books in the library try to match the front cover and you it will return your a query so who will take more time

obviously the person letter will take more time and actually it's not logical at all not scientific at all so there

should be a proper approach this approach is known as algorithm so when we talk about algorithm algorithm is a

process which is feeded to a software platform which the software runs because you know software uh knows only binary

one or zero right we are not binary we always think with quantitative measurements no binary thing so for a

software it's all binary one or zero So based on that binary values we can create an algorithm now the software

will follow the algorithm blindfolded and we know that we are going to get our data okay so we just just simply I give

you an example of an algorithm of searching a book so similarly if you have a query sequence and you need to

search that query sequence throughout the all database under ncibi because ncbi runs multiple databases we know

that okay and very popular as well the traffic is also very good so your query sequence you want to find match whether

your query is matching to any other query any other sequence in the database okay that's what we run Blast for okay

so blast algorithm has a step what are the step here basically blast algorithm follows

uh the processing of query is a proper uh it process the query with a proper stages okay so for example uh what I can

tell you is that uh this query sequence you you put it with a query sequence the input is your query sequence and your

query sequence the very first step is removing removing let me write removing low

complexity area low complexity area so any low complexity area will be

removed for example let's assume that we talking about a protein sequence and uh let's say we're talking

about a protein sequence of uh so l l l l k r k d k l

l k k k k so low complexity area means those which are containing repetitive sequence so these are repetitive

sequence this is also repetitive sequence okay so repetitive sequence are not usually uh allowed and the first

thing that the blast algorithm do is that it removes all the repetitive sequence and instead of repetitive

sequence they Place X for what x and N X for protein and uh n for nucleotide so it will be 4X k r k d l uh DK sorry DK l

l and then four more X so X means you're not going to consider those sequence because of low complexity we only take

sequence which are unique one to two sequence is fine but no repetitive sequence okay now after this thing is

done after this thing is made the list of words the list of words uh for each of the word in the query is scored and

we can give the score for each of it so I'm not going to talk the you know the the hard and fast rules of blast

algorithm there I'm going to share it in a way so that you understand that's very important because you know that blast

algorithm background is not the important what is important is how to run a blast in Practical so practical

knowledge for blast is more important practical knowledge for faster is more important we'll do that later on but

just try to understand the situation now so we have value right for individual for individual position we have a score

or scoring system we have the scoring system okay and uh the word so so whatever query

we're searching we call it word okay how many word for a nucleotide sequence we use 11 11 words for

proteins we generally take three words okay three for protein sequence 11 for nucleotide that is the fixed length for

which we check we run the sequence similarity search okay so now what we do is that based on

each of these word matching so whenever there is a match of word there's a score given based on that match if there's a

mismatch then there's a different score there's a match there's a score so for 11 different uh 11 stretch of a word uh

there will be a particular score given to that sequence okay now by this fashion maximum score what will be the

maximum score ask yourself what will be the maximum score if all 11 words are matched all

matched then only if all of them are matched then only what we will get then only we'll get maximum score and and

generally in this case the maximum score value is known as threshold value okay so maximum score not not

threshold sorry this is a maximum score when you get uh full match but full match is not a possibility so sometimes

you know out of 11 there may be two matches out of 11 eight matches out of 11 10 matches out of out of 11 five

matches there are different uh possibilities possible we have a maximum score if all of them matched and we have

a minimum score minimum means no match basically that is minimum so between

minimum score and maximum score there should be something in the middle that we need to keep and decide

so that the search can run because you can see the database is huge remember we found this algorithm without the

algorithm finding out the bioinformatics book from the library will be very difficult similar thing here so the

library if our query sequence is only of 11 stretch and it's searching for the whole databases under NCB take ages to

revert us with the data so what they do is that they always kept a threshold threshold value which is known as T

value T value capital T or threshold value this threshold value is decided in such a way that all the word pair with

the score greater than the t so only we will choose the score only we will choose those sequences where the value

is either equal to T threshold value or greater than t so if and only if our uh query sequence

is matching with the database sequences and the score getting equal to threshold value or more than threshold value then

only we are going to keep that comparison otherwise we're going to discard

it okay so what we can clearly say here I can tell you that this is let's say uh the query sequence and let's say the

length is capital L okay and what what else we have we have our we can only find out so let's say

from here from here we are finding out overlapping sections like this what are this from the query sequence we are

finding individual word each word from the query sequence each word only taken from the

query sequence if it responds to and give us the T value greater than equal to sorry the value here the value must

be greater than equal to T value so the score actually the score value of individual word

here must be greater than or equal to T value then only we'll take that otherwise we'll don't take that so now

let's look at the with an example so let's say LP then we have p q

g l l let's say this is the query sequence Q what else we have we have database sequence

MP p e g l l this is our database sequence

okay so now we do the comparison what kind of comparison we'll do as I mentioned earlier that what we perform

here we perform individual word sections and we compare it with the query sequence so basically here our maximum

length what be a maximum length here L minus W + 1 what is w w is the total number kept

for nucleotide is 11 for protein is three so in this case it's protein we're looking at protein sequence so looking

at maximum this value will be three value to be three so three words at a time so what are the possibilities here

three words at a time in the query sequence again so query sequence is something that we'll deal with so in the

query sequence what are the possibility the first three l p p then the next possibility one word apart

PPQ then PQ g qg l and GL l l all these are examples in the query these are all query

sequence okay and what is w these are all W these are the W

values Okay so let's assume one one such example is pqg let's say

this pqg okay and from

here what we are going to find out the threshold value for pqg the threshold value came to be for example here

13 okay so if the threshold value is 13 for all these six different five different possibil is we are only going

to take values which are equal to or greater than 13 okay so now let me tell

you that for every single comparison what will be the comparison our compar comparison let's start with it

pqg let's say pqg this is our queries query word basically query word pqg let me write query word here or query

sequence whatever see pqg this is query word and we are matching with the database words here

okay so what will matching with pqg with pqg all right so the very first thing is that if we compare pqg with pqg there

will be 100% math the maximum score in this case maximum score will be obtained and that is 18 in this case okay which

is obviously greater than 13 so we'll take it then there is pqg and pqg with

p pqg with Peg so pqg and PG are checked together then there is two match P matches with p g

matches with G Q and E A mismatch based on that what score we got we got a score of 15 then similarly we have pqg query

sequence matching with PSG okay so now from this again s and Q are different PNG both match we got a

score of 13 then we have pqg matching with P QA we have p and Q matched G and a

mismatch we get the score to be 12 so now what we are getting from here we are getting uh maximum score 18 pqg compared

to pqg pqg with PG only 15 and pqg with PSG 13 pqg with pqa 12 so remember the threshold value is 13 or greater than 13

so we'll not take this value rest of the three values we took okay three values we'll take and this is how we do

okay this is how you do it so this is the very first thing that is the processing of the query the processing

of the query is done and after processing of the query the Second Step comes in is a creation of hits so that

is creation of hits now the query sequence is

represented by group okay and we call it neighbor where there is a matching there is a there

will be known as a neighbor there and word is uh compared with the word list of the database if one of the

neighbor word is identical to the word of the databas sequence a hit is recorded okay so hit means what whenever

your word that is q word query word matches matches with the database the

keyword matches with then the neighbor of the keywords actually the match with the sequence of the database sequence

quence of database DBS okay then I'll be considered to be a heat and the heat will be

recorded okay so heat will be recorded heat means Capital H let's write it heat heat is recorded like that

okay so we know the threshold value T is 13 in this case so what are the list of possible matches to

pqg from Peg to PSG we have we have recorded it so once the heat we know the heat is recorded like this then what

else we need to do we need to have a third step of the blust algorithm and that is extension of

Heats extension of Heats so once the heat is generated it

need to be extended in left and right side until the score begins to decrease the value begins to decrease you know

score is decreasing means obviously we are going to some we searching some sequence or a word which is not matching

anymore so we'll stop the search Once there is no such match found later on okay so the stretch of sequence for

which we are checking our query sequence for or against is known as high segmental pair

HSP last stretch of the sequence is known as high segmental pair and after all the words are tested the best set of

high segmental pair is chosen for that database sequence okay so for example example we go something like this start

with MP and that is Peg l l this is the database sequence and our sequence will be

pqg l l okay so this is the query sequence so what we are going to do is that here you can see there's a

particular score that we can uh generate at this point a particular score that we can generate whatever

value we can generate the number of nucle sequence we can move in the left hand side as well as we will move to

right hand side of our match pair or matching pair that we're trying to find out so whatever pair we are trying to

find out from that pair we'll go left hand side and we'll go right hand side to find out the score of the hit okay so

this is how we get the HSP score the HSP score HSP for high segmental pair the large stretch so what we are looking at

here we have this sequence and remember pqg and actually we are matching pqg with PEG similarly we have PSG similarly

we have P EG okay these are all the word list list of words that are there and remember the list of words is composed

of three set here three alphabet here in case of protein but in case of nucleotide it will be of 11 alphabet

it's difficult to explain with nucleotide that's why I chose protein to explain in this case

Okay so based on this we are going to get the value of HSP now who will give keep the score the there is a scoring

system basically here we use Blossom 62 as a scoring system Blossom 62 what is Blossom what is Pam Matrix and all this

we have discussed about it you can watch that lecture on Blossom and Pam Matrix basically this is the Matrix way to put

the scores when we are matching and where we comparing two sequences so here based on this scoring system what we'll

get we'll get the value and you'll get the value to 776 for this three uh letter here and 27 left first stretch

right first stretch 44 okay similarly what we can count we can count this uh HSP score and whatever value of HSP

score that we are getting uh we will move that one one step ahead in the left hand side

and as well as in the right hand side until or unless the value gets lower and lower okay so you can see the 776 is the

value in the center as we moving one one portion in the left side 27 one portion to the right 44 similarly the value will

decrease and will continue to move there until it reaches the very lower value okay and remember after all words are

tested the best set of HSP is chosen for the database sequence so what is the output of the

blast in the output of the blast there will be a list of database sequence list of sequence that we find from the

database which is matching with the query sequence and and some statistical parameters are also there statistical

parameters are also explained there okay and based on the blast there's a score also given for

every match or mismatch there's a score given the score can be of three different value one is a raw

score raw score the second one is bit score and the third one is e value these are the three scores that are

available what is raw score the sum of msps that make up alignment that is the RW score all the

sum of the msps that are out there okay that will be known as a raw score what is a bit

score the raw score is modified to the log base of scoring Matrix

okay so it is converted so the bit score is raw score converted to log scale log scale okay that is bit score

and what is e value e value provides a likelihood that a given sequence alignment is

significant likelihood that the sequence whatever data we got

from the database the sequence from the database and that alignment is significant or how much

significant the alignment is that is provided by the evalue for example you get a good match okay the database

sequences are shared but the E value is extremely low that means we cannot rely on the data although the data is good

but we cannot rely on the data much good and higher EV value is desired okay remember that and finally we'll conclude

with the blast services blast services that are available what are the

services there are various kind of blast services in ncibi website you can clearly see that I'll write some of them

nucleotide blast nucleotide blust Compares what nucleotide so let me write it here query

and DB like that what is the query sequence for nucleotide blast nucleotide the database also of nucleotide then

there are protein blast protein blast what do you mean by protein blast amino acid against the

protein sequence database so it will compare amino acid to the protein database sequence type okay protein

sequence database then we have blast X blast x what it does the query sequence will

be nucleotide and the sequence is translated in all reading frames

against the protein sequence in database okay so nucleotide against the protein database then we have t

blast n t blast n so what we do protein query sequence here is protein

against what nucleotide database T blast X Compares again six reading frame

translation so again nucleotide query but six reading frame translation query against six reading frame nucleotide

database Mega Blast what is Mega Blast

large sequence against nucleotide sequence database then there are five blast okay so what five blast do

expression patterns and homologous to a query protein

sequence expression pattern of queries protein that is also protein type of

database expression pattern is obviously stored in the protein type and there are some P

blast position specific interactions are determined with the cast so what Cy blast is used

for distant finding out distantly related members of protein family

okay find related members distantly related members of protein in

family okay that is C so these are all the different kinds of blast available in the ncbi site now

when we are talking about the sequence search and Alignment Tool then obviously one more thing that we must

discuss that is faster about the faster what is fasta there's also another similarity Search tool like blast fasta

is another kind of sequence similarity Search tool

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Comprehensive Guide to FASTA: Algorithm, Types, and Comparison with BLAST

Explore how the FASTA algorithm performs sequence similarity searches using k-tuples, dot plots, and local alignment with dynamic programming. Understand different FASTA types like TFAST and FASTX/Y and how they compare protein and nucleotide sequences, highlighting differences from BLAST.

Global Sequence Alignment Explained: Needleman-Wunsch Algorithm Guide

Discover how global sequence alignment works using the Needleman-Wunsch algorithm, including step-by-step procedures for initialization, matrix filling, and traceback. Learn the scoring system, gap handling, and how heuristic methods optimize sequence searches without sacrificing sensitivity or specificity.

Comprehensive Guide to Sequence File Formats in Bioinformatics

This article provides an in-depth overview of primary and secondary sequence data used in bioinformatics, explaining various sequence and molecular file formats. It covers formats like FASTA, GenBank, GCG, EMBL, ClustalW, and UniProt, detailing their structure, usage, and significance in sequence analysis and molecular studies.

Comprehensive Insights into EBI and Essential Bioinformatics Tools

Explore the pivotal role of the European Bioinformatics Institute (EBI) in managing diverse biological databases and discover key bioinformatics tools for sequence analysis, pattern recognition, and structural comparison. Understand the synergy between wet labs and dry labs in modern bioinformatics and how EBI supports genomic and proteomic research.

Comprehensive Guide to Molecular File Formats for Protein 3D Modeling

Explore the essential molecular file formats like PDB, mmCIF, CHARMM, MDL, and Mopac used in protein 3D structure modeling. Understand their specific sections, applications in crystallography and molecular dynamics, and learn about key file conversion tools to integrate diverse data sources effectively.