Understanding Global Sequence Alignment
Global sequence alignment is a method used to compare two nucleotide or protein sequences along their entire length. It differs from local alignment, which focuses on matching subsequences. The Needleman-Wunsch algorithm is the standard approach for global alignment, while the Smith-Waterman algorithm is used for local alignment.
Key Stages of the Needleman-Wunsch Algorithm
-
Initialization:
- Construct a scoring matrix with sequences laid out along the x-axis and y-axis.
- Initialize the first row and column with zeros or gap penalties.
-
Matrix Filling:
- Evaluate each cell by comparing corresponding sequence elements.
- Assign scores based on matches (+1), mismatches (0 or penalty), and gaps.
- Calculate cell scores from three possible directions: diagonal (match/mismatch), left (gap), and up (gap).
- Use the maximum of these scores to fill the matrix cell.
-
Traceback:
- Start from the bottom-right cell (maximum score) and trace back through the path of optimal scores to the top-left cell.
- Follow diagonal moves representing matches/mismatches and horizontal or vertical moves representing gaps.
Scoring Example and Matrix Filling
- Matches score +1, mismatches or gaps score 0 in this model.
- For instance, comparing two 'G's: diagonal cell score + 1 equals new cell score.
- Values propagate rightward, downward, or diagonally.
- Only add match score to the diagonal value; do not sum previous non-diagonal values.
- This process continues until the entire matrix is filled.
Traceback and Alignment Representation
- Traceback identifies the aligned sequences and gap positions.
- Diagonal arrows indicate matches in the optimal alignment.
- Gaps are represented by horizontal or vertical movements without a match.
- The resulting alignment is displayed with vertical bars for matches and dashes for gaps.
Heuristic Methods in Sequence Alignment
- Heuristic methods speed up sequence search by focusing on common subsequences (words).
- Instead of searching the entire database, they locate potential matches based on small query fragments.
- This approach increases computational efficiency.
Balancing Sensitivity and Specificity
- Sensitivity: Probability of correctly identifying true positive matches.
- Specificity: Probability of correctly identifying true negative results (non-matches).
- Four possible outcomes in assessments:
- True Positive (correct match)
- False Positive (incorrect match)
- True Negative (correct non-match)
- False Negative (missed match)
- Both sensitivity and specificity are crucial to ensure reliable and accurate alignment results.
Practical Tips for Global Alignment
- Always start with a clear understanding of scoring criteria.
- Use matrix filling and traceback logically to generate final alignments.
- Represent alignments visually using vertical lines for matches and gaps for insertions/deletions.
- Understand differences between global and local alignment to choose the right method.
By mastering the Needleman-Wunsch algorithm and heuristic approaches, bioinformatics practitioners can effectively perform global sequence alignments, essential for genomic and proteomic analyses. For further exploration of key resources in bioinformatics, consider reviewing Comprehensive Insights into EBI and Essential Bioinformatics Tools and the Comprehensive Guide to Protein Databases: Types and Key Examples. These resources complement the understanding of sequence alignment within the broader context of bioinformatics.
but now we are going to solve what is known as a global alignment what is a global remember I talked about global
alignment and local alignment and you know there are two different algorithms that that we use we use needleman wch
algorithm for This Global and also we have Smith Waterman algorithm for the local alignment okay we have this uh
processes so this this alignments they have three different uh three different stages one is initialization second is
Matrix filling third is Trace back these are the three step if they ask you if any kind of algorithm that we are going
to do either Smith Waterman or needleman wch any of that particularly the me waterm algorithm here they have three
different stages initial initialization second one is fing of the Matrix Matrix filling and the third one is Tres back
okay IMT so this is how we are going to go now I'm not going to talk about any theoretical knowledge behind it because
I already told you that there are this two sequence you can clearly see the sequence number one here you can see in
this side in the x- axis and there's a sequence number uh two here in the y axis okay this one y sequence this is
x-axis sequence now we are aligning this two these two sequence so what to do how to align that what is the procedure of
aligning this two sequence okay now try to understand this process very clearly you can pause this video and whenever
you find it difficult pause it rewind that portion to get it understand get it clear okay see one thing let me tell you
one thing here the process of this as I mentioned initialization Matrix fill up and Trace back now what do we mean by
all this we are going to talk about that we're going to discuss that so see uh that basically when we start with this
the very first step is whenever let's say you need to draw this table in this table what we have we have x-axis and y
axis in the xaxis we have sequence one in the y axis we have sequence two and you can see that right next to them I
clear a small column and row and basically what you need to do is that you need to put zero out there basically
you can either put zero there or you you may not draw this column it's up to you okay basically this is the value uh the
lowest value that we can achieve the lowest value and remember when we are trying to align to sequence okay two
sequence alignment in this case then there are two possibility one is there is a
match another one is there is a gap no match no match okay and what is the marking
scheme if there's a match we call it plus one and if there is no match then there is zero or we also call it as a
penalty if there is no match we call it a penalty or give a value zero you know there are different approaches in
earlier videos when I made videos 10 years ago almost I I uh I started with this idea and we put the value of no
match to minus one or something like that but right now we are considering this for match we have plus one for no
match or penalty value we have zero now that means we we will always check the cross means G with G if there is a match
then the value will be one if there's a m mismatch the value will be zero okay we'll do that for all the sequences and
connecting cross talking between all the sequences okay this is the first point second point is that you also need to
make sure that how you're adding the values means this match value or no match value whatever value we are adding
we are always adding it from the diagonal line remember that so let me tell you this simple idea for example in
the very beginning we have G and G both are a match due to the match we know that it will come down to one okay match
means plus one so we'll start with plus one there zero was the start point and actually we'll move diagonal
diagonal movement remember that diagonal movement so we'll start with one here diagonal movement start with one okay so
generally 0 + one so diagonal movement will give you one so whatever value we are going to get from here now try to
understand this value is flee free floating value the same value can be added up in the right hand side it can
be added up to the bottom okay or it can flow to the diagonal so one value that we got can move in three different
direction okay to the next column to the next row okay or to the diagonal column so this is possibility
this is the possibility remember that okay so now one as per this rule it will move in three Direction and we are also
going to check the next place we're going to check the next place so G with another G like that so
we are going to do that always remember that very very important okay and when we are going to compare that g with G if
we have a match Plus One will be added to the diagonal value that is here in this case
zero so let me solve it otherwise you cannot understand so one will migrate to here
okay and what what else we'll get so one migrates there and this one will migrate here but at this point we
know that G and G another match so Plus One will be added so already we have one here it should migrate here plus one
should be added so should we add 1 + 1 two no we will add this match value + one to
the diagonal value diagonal is also 0er so 0 + 1 1 we get one no change in the value okay no change in the value now
what we are going to see is going to see the rest how we are going to see that let's see that so again this one it will
also migrate to the next it can come at this point so you see a and g no match so again one
T and G no match it will come as one C and G no match it will come one G and G another match remember but again what 0
+ 1 it will be 1 okay so one all the values are one here in this side now in this side we have one this is also one
then a with a another match a with a so as per this rule this one can come to the bottom field but in this case a and
a match so plus one and if there is a plus one then the diagonal value it was one there so 1 + 1 gives us
2 got it if not repeat this place remember a matches with a so what we can put Plus One will be added to who this
one or this one this one diagonal one so get two so now this two will move to the next quite
easily isn't it it plays there like that now we're going to see a with t no match two c with a no match two G with a no
match a with a another match so+ one + one will be added with this plus one gives us two remember many people will
make a mistake that here there was a two now plus one will be added people add plus one to this two to get three no+
one Whatever value for the math will be added to the diagonal value that was 1 1 + 1 2 got it this is how it will work
and now what else we know so this is where AG the value will be
one this is 1 a with a it was already the value is already transferred to be two but still you can see that y it is
two either this two slides there or you can see that a and a match so 1 + 1 two that will also give us two right and
then a with t two a with c 2 a with g 2 and now at the end here you can see that a
with a another match so match means plus one it will be added to this diagonal 2 + 1 will be
3 got it now let's move to the next one in this case also GT it will be one here normal sliding of one this is also GT
sliding of one uh this one and at also one TT there's a match here T and T match and you know match means Plus One
will be added to whom the diagonal value 2 + 1 will be three okay and then migration of three CNT no match three
GNT no match three a andt no match three next round again one g&t no match one a and t no match one t
and t at this point T and T So match one plus will be added here so 1 + 1 2 so actually this two migrates here
sorry you can see the two always migrates in this right hand side to two migrates here so there's a mistake here
two migrates here so two already present in this diagonal and uh there's a match so 2 + 1 it will be three not two it
will be three okay because of the rule of sliding the three will slide here as
well like that okay so t with C no match three t with g no match three a with t no match
three this starts with one CNG no match so one a Ang G no match with one okay but it won't be one this two won't be
one why because it is two so if it's two then two will slide the value will be two in all these cases got it now t with
C the value already know c c with C now another match C with C so one added with with what this three 3 + 1 will be
four okay then afterwards C with g no match so it will remain four A and C no match so 4 + 0 will be
4 end so again we start here then AG no match it will be one two will slide here obviously so a
and a another match you can clearly see that so 1 plus one will be two either you'll get two like this or like this
then from this two uh a and t no match so two but actually three sliding so this value will also be three okay and
then a with C A with C no match but again four we have already received here so this value will also be sliding to
four G with a the value will be four a with a now another match a with
a + one will be added to what value four 4 + 1 will be five five value will come here so one so you can simply write it
like this till the end G with G the value will be two you can see that because G with G one will
be added with this cross so it will be two okay so G with a the value will be two Okay g with t as per our rule the
value is two but three will slide here the value of three will slide here the value of four will be sliding here and
then there is a G and G so 1 plus will be added 4 + 1 will be five and this five value will slide there so at the
end here two will slide the end and here we have t with a so
two two and two no issues and then we have T and T So + 1 2 + 1 will be 3 then this three should slide here t and
c 3 + 0 3 but actually four already received so four will slide here as it is as a four not like so then we have G
and t five will slide there and there is a GNA so five will remain as it is and here we have a T and T
similarity you can see T and T similarity so 2 + 1 will be three right four will slide
here five will slide here five will slide here like that and at the end what else we have we have this value for a
right two earlier so a with t three will slide here four will slide here five will slide here and last we
have a with a so similarity Plus One will be added to five we'll get six so that is how we can fill the
Matrix so initialization is done we put the values zero in both X and Y axis near to the sequences and then what else
we put we put the Matrix values the rule is that whatever value we received the value will slide uh you know in the
right hand side each each of this tab at a Time Each of this square at a time okay and we starting
our journey from here zero this is our start point and ending our journey here this
is our end point start and end so end maximum value we received is six start point is
zero the third step of this process is tracing back we need to trace it back okay Trace back from what the highest
value that we got at the end to the what to the lowest value that we started with at the beginning so
basically whatever value we are sliding to the right hand side is not important the important value is when we add + one
and we have a angular value so what are those where we add values so we add values here this is where the value
changes six we add values here at this point okay we have 35 next to it we have adding values here
then what else we have added value here 2 fours then we have added so all the cross Arrow or diagonal Arrow Arrow are
the one that we need to figure out and that is known as stress backing from the highest value we received and we are
going back to the zero that is known as Trace backing and in Trace backing we always going to find out the diagonal
Arrow so this is diagonal Arrow number one this is diagonal Arrow number two this is diagonal Arrow number three and
this is diagonal Arrow number four here at this point diagonal Arrow number four uh only this three this is diagonal
Arrow four then we have this two and we have another diagonal Arrow diagonal Arrow four this is diagonal Arrow five
and then what else what else we have this two as a one and we have a diagonal Arrow here at this point okay this
one okay six here here at this point actually here at this point six diagonal AR seven
so total seven diagonal Arrow were drawn not this these ones these ones are not the one because we actually follow the
sliding rule in all these cases these are the diagonal arrow that we form and actually it will also form a diagonal
straight line from zero to the maximum value that is six okay so seven different positions seven different
location seven different location where we have a diagonal diagonal Arrow and we get this value okay so what value
we got here we got this somewhere where between this to a at this point okay we have this somewhere near this G so think
about it we we have this okay we have this different locations so what are the locations we
have a G and G this is one location uh then we have this GA a that Gap there are two a gaps there present
okay and then we have this gap of T then we have a gap of two I mean between C and A there and
then we have this value between gtt there and final value six okay so now we know how to fill it we know how to back
trace it but now it's time to put it into the data set basically how exactly we represent the data so from this table
how would you represent the data it's very important that's what we'll do now all right so this is what we did I
just took a screenshot of that and put it and now I'm going to draw the actual line how to draw it how to draw it so
see basically what we know is that we have the similarity and dissimilarity try to understand it so so we have this
two sequences we need to put two sequences draw two sequences so this is sequence number one on the top
G okay and we have a similarity you can see the G the very first one g with G
there's a similarity so G with G there's a pairing we do a vertical line like this we draw it and then in the bottom
we have another G okay in the bottom we have a g but here we don't have any similarity where do we have similarity
with a okay so there's a gap here here then we have a on the top okay and what we have we have a
here some maybe some uh Gap in the middle but we have a and a another similarity then on the top what else we
have we have what shouldn't draw this so after a so we have repeatedly a a t
t c a g t t a okay and in the bottom you can see that we have a g there's a this this vertical line means there's a
pairing G and G this a with this a okay the second G don't have any pairing so you keep it blank the second G no
pairing so you keep it blank now after that what else we have we have similarity of a with
a this a with a okay so I either this a or that a which one obviously after G there's only
one a and this is the third a so of course this third a must be placed somewhere here and there should be
connection between them not between this a got it and then afterwards what else we know we know that there will be
another similarity between t here there's a gap another similarity between the C next to each other this t
this Tre then G and a then there's a gap G and there's a Gap a similarity similarity similarity
similarity how to know whether there is a gap or not how to know that okay very very important let me
tell you that as well now I'll be telling you that just focus on the box okay starting with this box starting
with this box box number one fill it up the second box is this box box number two third box and third and fourth is
this two then this two remember I already cleared it out why we say this different
boxes this one then this two okay and finally this box number six
these are the locations this is the sequence alignment rule so after back tracing all you need to do is simply
make them highlighted so that you get to know about where there's a pairing now try to understand
whenever whenever there is this shaded values the start Point remember that the
start point of
shaded cell the start point of the Shaded cell will be
taken as similarity will be considered as
similarity Okay so at the very beginning a with a so we put a a with a similarity then where this stretch but
what I mention start point of shaded cell start point is here G so upper G with lower
G got it so whatever things present in the middle are Gap you don't need to do anything else
only thing that you need to do here let me eras it and try to give you a clear idea about
it take a eraser let me erase this so that you understand this process quite well because it may be complicated a
little bit okay so let's assume this so first write
down the bigger sequence because obviously the Gap is present in the smaller sequence
so start with this a with a full match so I'll I'll take different color for the smaller sequence so a and a match
okay done a with a as per our rule the second one match with be start point of the Shaded cell so shaded cell start
point is this G so this G with this G so put G with g a match then what again start point of sh
cell this is C with this C so C with C match okay then what t with this t
Okay t with t then a with this a okay then another one g with G first so
basically this G with G or this G with G we start with the last one g with G
Okay g with G so now we know the sequence the upper sequence is already written the lower sequence G A TC g g a
TC GA but there's one more G that should be here somewhere at this point okay there's a gap here Gap here
Gap here so these are all Gap there's the possibility of Gap no issues okay so this would be our
alignment so whenever they ask this alignment question they always want to get an
answer like this where you have nucleotide sequences vertical line means they are similarities they have
similarities and horiz Al line means there's a gap in that area so this is how
you should solve this kind of problems if they ask any problem like that you can solve it like this
okay so in local alignment what happens this is the global alignment in local alignment also basically local alignment
matches similar regions and sequences without the Gap basically the length will be short
shorter the difference is there only in the traceback step
okay so in case of local alignment once we fill the Matrix sometimes the value becomes negative and if the value
becomes negative there's another rule we'll set it to zero negative means we'll set it to zero and then we'll
again start the process of marking as per Matrix fill up the rule of Matrix fill up that is the idea this is how we
solve this kind of global alignment and local alignment problems as well
okay so this is global alignment and how to perform Global alignment okay this is how we can
perform the global alignment and there is one more term known as heuristic method what is
heuristic method let me write it down for you heuristic
method heuristic method relies on what some common words in the sequence the query sequence that you feed in and that
feeding sequence will move to each database to perform the search so basically when we put the
searching with a query sequence with a query sequence the whole sequence is not searched for the m match instead some
fragment of that query sequence is used some common words are used for match to find out match this is the approach of
searching uh the match so whatever match it finds first of all with some common words then they will Digger deep for a
effective bigger search this way the process can be faster and can be more
effective to make it fast pace we use heuristic method okay but there's always a question to do fast
whether the work is compromised the quality of the work compromised the thing is in euristic methods again we
check two important parameters the sensitivity of the test and the specificity of the test so both are
checked sensitivity and specificity both are equally important
to check without knowing this we cannot comment on any technique but remember in euristic method basically let's say
there is a library and I gave you uh a name of a book or I gave you for example a stretch
of a paragraph and I told like from which book probably I get this paragraph So what you're going to do is that
you're going to check for some words in the paragraph to find out whether that which genre that belongs whether it's a
suspense whether that is comedy whether that is romantic whether that is devotional based on that you search only
that section and you'll get the results faster that is the idea of euristic model okay but again as I told you we
cannot compromise the Sens we cannot compromise the quality so that's why two things
very necessary sensitivity and specificity and based on that we have four different outcomes what are the
four outcomes we have true positive okay so basically in this case the test
is positive for a condition and that is true that is desirable true homologues are examples
where the they are tested positive for the condition of homology then there are false positive that
means the test positive but actually are not true okay false nonhomologous can be of
this type for example false positive means the data set is suggesting positive but
actually they are not satisfying the criteria of positivity or positiveness here that is false positive third one
true negative in this case the test is negative but actually
false test is negative but actually there are false so so it's true
negative and there is false negative where the test is negative but actually the test is negative but
actually it's supposed to be positive that is false negative so basically this terms this is not only
specific for bioinformatics but also for any uh any kind of experiments or any uh particular ular
machine that we design we always check sensitivity and specificity okay so what are these terms
you know based on the parameters of sensitivity and specificity we can generate a
probability okay so what is specificity specificity gives us the probability of correctly predicting a negative
example specificity is the probability of correctly predicting negative
example and sensitivity is the probability of correctly predicting a positive example okay so if anybody ask
you this question what is sensitivity and what is specificity in case of Bio info then you say that sensitivity is
the probability of correctly predicting a positive
example and specificity is a probability of of correctly predicting a negative example
okay so basically sensitivity equals to what we say true positive divided by true positive
plus false negative and specificity equals to True positive by true positive plus
false positive just you can remember you can remember it or simply think about it why we say
this why this is as per uh what we wrote why it is like that you can this this is kind of a homework you can do that okay
because they don't ask any question from sensitivity specific like this but without mentioning sensitivity or spe
specific specifity sorry without mentioning sensitivity and specificity blast basic local alignment Search tool
cannot work and actually in any kind of work any kind of tool in B informatics will give you a confidence value a value
at which the the the the software runs because it all depends on the softwares and they algorithm based systems so
they'll always display the sensitivity they'll always display the specificity and also they will display uh at what
confidence level they are showing you the data or they're explaining you to the data and based on that we get uh our
search results in Blast okay as well as in in fast f as well okay we'll discuss about blast we'll discuss about faster
in details in the later stages of bioinformatics lecture but I believe you have a clear idea about uh this sequence
alignment process
Global sequence alignment compares two nucleotide or protein sequences along their entire length to find the best overall match. In contrast, local alignment identifies matching subsequences within larger sequences. The Needleman-Wunsch algorithm is commonly used for global alignment, while Smith-Waterman is used for local alignment.
The Needleman-Wunsch algorithm involves three main steps: initialization of a scoring matrix with gap penalties, matrix filling by scoring matches, mismatches, and gaps (typically +1 for matches and 0 or penalties for mismatches/gaps), and traceback from the bottom-right cell to the top-left to determine the optimal alignment path. This process ensures the entire sequence length is optimally aligned.
A simple scoring scheme assigns +1 for matches and 0 or a penalty for mismatches and gaps. During matrix filling, each cell score is calculated considering scores from diagonal (match/mismatch), left (gap), and up (gap) directions, selecting the maximum to identify optimal alignment paths. This approach emphasizes aligning identical residues while penalizing gaps and mismatches.
Traceback starts at the bottom-right cell of the scoring matrix and moves through the path of optimal scores back to the top-left cell. Diagonal moves represent matches or mismatches, while horizontal or vertical moves signify gaps in either sequence. This path reconstruction reveals the aligned sequences with matched characters and gap positions clearly indicated.
Heuristic methods accelerate sequence alignment by focusing on common subsequences or small query fragments (words) instead of examining the entire database. This targeted approach significantly increases computational efficiency while maintaining good sensitivity and specificity, making large-scale sequence searches practical in real-world bioinformatics applications.
Sensitivity measures the probability of correctly identifying true positive matches (correct alignments), while specificity measures the probability of correctly identifying true negatives (non-matches). Balancing both is crucial to minimize false positives (incorrect matches) and false negatives (missed matches), ensuring reliable and accurate alignment outcomes in bioinformatics analyses.
To improve accuracy, start with clear scoring criteria for matches, mismatches, and gaps, use matrix filling and traceback logically to generate alignments, and visually represent results with vertical bars for matches and dashes for gaps. Additionally, understand the distinctions between global and local alignment methods to select the best approach for your specific biological question.
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries
Comprehensive Guide to BLAST: Basic Local Alignment Search Tool Explained
This article provides an in-depth overview of BLAST, the Basic Local Alignment Search Tool developed by NCBI, explaining its algorithm, practical usage, scoring system, and various types of BLAST services. Understand how BLAST processes sequences, filters low complexity regions, scores matches, and identifies significant alignments in nucleotide and protein databases.
Comprehensive Guide to FASTA: Algorithm, Types, and Comparison with BLAST
Explore how the FASTA algorithm performs sequence similarity searches using k-tuples, dot plots, and local alignment with dynamic programming. Understand different FASTA types like TFAST and FASTX/Y and how they compare protein and nucleotide sequences, highlighting differences from BLAST.
Comprehensive Guide to Sequence File Formats in Bioinformatics
This article provides an in-depth overview of primary and secondary sequence data used in bioinformatics, explaining various sequence and molecular file formats. It covers formats like FASTA, GenBank, GCG, EMBL, ClustalW, and UniProt, detailing their structure, usage, and significance in sequence analysis and molecular studies.
Comprehensive Guide to Molecular File Formats for Protein 3D Modeling
Explore the essential molecular file formats like PDB, mmCIF, CHARMM, MDL, and Mopac used in protein 3D structure modeling. Understand their specific sections, applications in crystallography and molecular dynamics, and learn about key file conversion tools to integrate diverse data sources effectively.
Understanding DNA Matching: The Role of Gel Electrophoresis in Forensic Science
This video explains how scientists use gel electrophoresis to match DNA from crime suspects to DNA found at crime scenes. It details the process of cutting DNA with restriction enzymes and how the resulting fragments are separated based on size to create unique banding patterns.
Most Viewed Summaries
Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.
A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.
Pamamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakaran ng mga Espanyol sa Pilipinas, at ang epekto nito sa mga Pilipino.
Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.
Pamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakarang kolonyal ng mga Espanyol sa Pilipinas at ang mga epekto nito sa mga Pilipino.

