Comprehensive Guide to Sequence File Formats in Bioinformatics

Introduction to Sequence Analysis in Bioinformatics

Sequence analysis is a cornerstone of bioinformatics, involving the study of nucleotide (DNA/RNA) and amino acid sequences. Two main types of sequence data exist:

Primary sequence data: Generalized raw sequence data available in public databases like GenBank.
Secondary sequence data: Specialized, specific data sets derived from or related to primary sequences.

Types of Sequence File Formats

Sequence data are stored and shared in various file formats tailored for different applications and types of analysis.

1. Sequence File Formats

These formats store DNA or protein sequences primarily as text files with specific structure rules.

Raw Format: Contains only continuous nucleotide or protein sequences using IUPAC codes, no spaces or digits allowed.
FASTA Format: Widely used for multiple sequence alignment. Starts with a '>' header line containing sequence name and description, followed by the sequence data. Sequence termination marked with '*'. For detailed understanding, see Comprehensive Guide to FASTA: Algorithm, Types, and Comparison with BLAST.
GenBank Flat File Format: Used by NCBI, divided into three parts:
- Header (metadata like source organism, taxonomy, biological significance, mutations).
- Features section detailing gene/transcriptional units.
- Sequence data in IUPAC codes.
GCG Format (Genetic Computer Group): Begins with annotation lines indicating sequence type (protein or nucleic acid), sequence length, and other metadata, followed by sequence.
GCG MSF Format: For multiple sequence files using the GCG format, first line must include 'msf', followed by sequence length, type, and date information.
EMBL Format: Used by the European Molecular Biology Laboratory. Starts with an ID line and ends with a double slash '//'. Contains various descriptive lines (e.g., AC accession number, SV sequence version) each serving different annotation roles. For a broader context on bioinformatics tools associated with EMBL, refer to Comprehensive Insights into EBI and Essential Bioinformatics Tools.
PHYLIP Format: Used by molecular phylogeny software, with two subformats:
- Interleaved: sequences presented in blocks.
- Sequential: sequences presented one after another. The format starts with two numbers indicating number of sequences and sequence length.
NEXUS Format: Used by software like PAUP and MacClade. Provides metadata such as data type (DNA/protein), gap and missing character symbols, and sequence matrix.
ClustalW Format: Used for multiple sequence alignment. Starts with a header line indicating version, followed by aligned sequences in blocks of 60 residues with symbols indicating conservation:
- * Identical residues (fully conserved)
- : Strongly conserved substitutions (semi-conserved)
- . Weakly conserved substitutions For more on ClustalW and related multiple sequence alignment formats, see Comprehensive Guide to BLAST: Basic Local Alignment Search Tool Explained which complements understanding of alignment tools and formats.
P NBRF Format: From the National Biomedical Research Foundation, similar to FASTA, starts with '>' and sequence type code, followed by sequence name, description, and sequence. Ends with '*'.
UniProt/Swiss-Prot Format: Annotated protein database entries similar to EMBL format but include special lines like GN (Gene name) and OG (Origin location of gene), providing detailed gene and protein context. Additional insights can be found in Comprehensive Guide to Protein Databases: Types and Key Examples.

2. Molecular File Formats

These formats are primarily used to store three-dimensional structures of molecules such as proteins, often derived from experimental techniques like X-ray crystallography or NMR spectroscopy. Examples include PDB format but were not detailed here.

Key Considerations When Handling Sequence Files

Each format has a specific structure and metadata rules important for correct parsing.
Understanding format-specific symbols and headers is essential for proper interpretation.
Multiple sequence alignment formats (FASTA, ClustalW) include symbols to interpret conservation across sequences.
Molecular file formats complement sequence data by providing structural information.

Practical Applications

Sequence file formats facilitate storage, exchange, and analysis of bioinformatics data.
Software tools for alignment, phylogenetics, and structural biology rely on specific supported formats. For example, the principles of Global Sequence Alignment Explained: Needleman-Wunsch Algorithm Guide underpin many phylogenetic analyses.
Proper knowledge of sequence formats enhances data integration and interpretation from multiple bioinformatics resources.

Conclusion

Knowledge of diverse sequence and molecular file formats is vital in bioinformatics for effective data management and analysis. Familiarity with formats like FASTA, GenBank, EMBL, and ClustalW enables researchers to utilize sequence data confidently and accurately across various applications.

all right so now we are going to talk about the sequence part of the sequence

analysis and this is one of the biggest application of V informatics because in bioinformatics we

are dealing with information particularly sequences what sequence nucleotide sequence as well as amino

acid sequences so sequence analysis is very important

thing okay so the sequence that we get there are primary sequence

data okay and there are also secondary sequence data both are available the primary sequence data we have talked

about what is primary data that is a data which is available in the general sequence like gen Bank sequence and all

gen Bank data and secondary data are specific data okay not uh generalized data specific data both are equally

important and sequence formats are the formats at which the different sequences are

available so we call it sequence file format so it's different kinds of file you know just like image is a type of

file video is a type of file audio is a type of file so in this case also we have different types of file sequence

file and molecular file we have sequence file format and molecular file format okay

the sequence file format represents what DNA sequences also

represents protein sequences okay so among this sequence file format

there are some example uh that we'll be discussing in a moment there are few examples and there are this

molecular file format they're little different molecular file format

mostly they composed of 3D structure of protein 3D structures of

protein and basically the data that we obtain is obtained from x-ray crystallography okay from xra

crystallography data directly we can fit in or we can go for NMR spectroscopy data we can feed in and get a 3D

structure of proteins and those data are stored in molecular file formats we'll see some

example of molecular file formats later on okay let's first talk about the sequence file

format example of sequence file format these are all examples of sequence file format so we have raw for format faster

format gen Bank file format gcg format gcg msf format embl format Philip format Nexus format clustal W format P nbrf

format unipro Swiss Pro format so many different kinds they have their own format so formatting here is you know in

case of a video we know that the file name so let me write file name then dot it can be

MP it can be mp4 file name dot AI okay file name Dot

M Peg 2 so there are the different names like that these are the file formats similarly we have file formats for

sequence for example if you start with the raw format if you if you start here the raw

format is just the sequence of DNA either DNA sequence or protein sequence nothing

else just the sequence okay and in this case only IUPAC

characters are allowed there is no digit no space no digit no space so basically

continuous writings like that start with a small a should be there a whatever it is so these are all nucleotide sequences

okay this is are raw format no spacing remember that and no digits number

digits then we have faster format second type faster format contains a on line header this is a

header now what this header is composed of starts with this symbol and rest is the

name description of the sequence fosb uncore human protein fosb 338 base pairs so again in this case the blank

space is ignored so we'll not follow we'll not

add any blank space in the F format okay and faster format is the program or format which is accepted for

multiple sequence alignment okay and termination of the sequence is

marked with an asterics like this star Mark like this okay now we have J Bank file

format it's a ncbi database they have their own file format which is known as

flat file flat file this is divided into three different section header

features and sequence so what header consists of description of the

sequence description let me write with a different

color Source organism length scientific

name and biological signification biological significance

taxonomy related information all these things are present in the

header okay it also carries transcriptional units information regarding transcriptional

units also carries information regarding mutations so it's kind of Allin one information and bibliographic

references and then there is a rest part is the sequence sequence part represents the composition of the

sequence by I UPAC symbols okay so these three components are of flat file format so there are different headers of

the field and there are description to it for example Locus so short name for the sequence is

mentioned there definition means description of Entry so there are different headers are mentioned so I'm

not going to say individual out of it this is flat file format but remember flat file

is format of ncbi's own ncbi's own for file format for ncbi's own databases then we have uh I think this

is the fourth one gcg file format so this is known as gcg means

what genetic computer group genetic computer group format gcg okay so in this format what happens

it begins with annotation so this is the line whatever sequence version is

mentioned for the nuclic acid in the bracket they will write either for the nuclic acid or for the

protein if this is na then then nuclic acid and if it's written as NP then means uh sorry AA if it is a a a

so a a underscore whatever value then it is for protein a for protein na for nuclic

acid okay and the next line contains informative text following dot

dot at the end so in this case this dot dot the dotted lines this is by know choice not just a

dotted line okay see is the sequence number starting from one and this 51 dotted line is the

ending so test sequence length is given 4303 date April 10 1997 time type everything is listed here okay then

we have another file format gcg msf gcg was computer genetic computer group but when the genetic computer

group file have multiple sequence file in it then it will appear as a msf format

multiple sequence file multiple sequence file okay so what is the file

format so it will something like nacore whatever value then finally at the end it will be

the version or it can be AA whatever it is this is for DNA and this is

for protein okay but there is one more catch and in this file

format the first line is mandatory and it must have msf written there msf also this first line indicates the

sequence length type and date and after that the next line is mandatorily kept

blank okay so start with msf first line next line blank then the sequence and initially

the first line that contains started with msf that includes the sequence length type and date

information so the third line is for the sequence separately and this follows the rule of

gcg format the sequence follow the rule of gcg format okay now we'll move to the sixth embl type

embl this is followed by European molecular biology laboratory embl for all the database under the

embl so they have different types of lines and it contains all the records each entry begins with an identification

number or ID line it starts with an ID line and then terminated

termination is with double slash remember that start with ID then

termination with double slash I'm not showing example here because we when we do the Practical

you can see the details and example but remember one thing start with an ID line ends with a double

slash there are multiple lines will be following through this okay from the ID line to all these different lines XX a c

SB DT all these different lines are there and they have different roles to play they have different description

individually have different description you know it's not that difficult to understand the file format for embl

particularly because there are so many different lines and each line have different information for example after

ID there's a XX line so this is used in the place of blank

lines then we have AC a line with AC segment known as acction number data then SV segment contains

nucleotide sequence identifier in this uh section so there are multiple such sections it's not feasible and

scientific to try to mug it up it don't need it okay so you can you know if you go to embl database and you try to work

with some sequence just uh select individual lines and take the cursor there cursor there and it will

automatically show up the features of those lines and formats we have Philip format

this is by molecular philogen software particularly by phoeny

software what kind of philogen molecular by molecular phogy software and there are two formats used

here inter Le format and sequential format inter Leed and

sequential format both are used let's take a color interlift and sequential formats are

used this particular you can see this starts with two numbers then the sequence line comes in

and at the end we have dotted lines okay so first number represents the

number of sequence in the file and the second number is the total character present in

each sequence this is number of sequence and this is total number of

characters in each sequence the next line onwards the sequence itself is

displayed okay so the way we represent the sequence by a sequence title which is

not more than 10 characters one 1 2 3 4 5 6 7 8 9 you can see nine characters in both

okay sequence character line not more than 10 characters then rest of the sequence will

be provided we have Nexus as a file format so Nexus is a file format if you

retrieve the data via pop and Mac Cade program Po

and Mac CID

program they use Nexus file for so you know basically this pop and Ma this software programs they use Nexus file

format starts with a number Nexus start with this data Nexus is the term that they started

with okay and then the beginning of the data and then what you can see the dimensions

of character is also tell told 60 in character format data type DNA Gap missing Gap is denoted with Dash missing

is denoted with question mark so basically what they mean is already stated at the top quite easy quite

simple straightforward the information regarding the dimension the character format type

whether it's a DNA or protein data if Gap is present then a dash missing it's question mark is already stated so what

else we have we have Matrix here cow chicken Mouse human rat and for them the data is provided that is

actually Nexus Nexus is a data not needed for a general purpose it's only needed if you are working with Mac or

pop softwares then comes clustal W format clustal W there are different version

the version is 1.8 to mentioned in here clustal W is another format which we can use for multiple sequence alignment or

MSA there are two formats that we predominantly use for multiple sequence alignments one is the faster format

another one is the clal W format in clal W format the most recent version may be changing from this value that whenever

you're seeing this early recorded version here so you can see what is this format

clustal W and clustal X there are two formats they start with clustal the file start with clustal all caps and the

version that we use in this case clustal W 1.8 that is the version we are using and then the alignment is written in

blocks of 60 residues you can see the alignment data for Mouse and human Le Mouse and Le human the data is already

mentioned in here every block starts with the sequence name the sequence name is in

this case L Mouse the second sequence name is LP human and the sequence is mentioned see

that okay the input sequence obtained and a count of total number of residues is shown at the end of the line total

number of residues will be shown at the end of the line you can see here at the end of the line what we have we have

stars four star one dot star this and two stars so what they mean this means residue in that column are

identical in all sequence as you can see identical identical identical identical identical different so different

semiconserved sequences a DOT conserved substitution is this okay so

this this this uh this three three places are different so that is already mentioned in here star means residues in

the column are identical to all sequence or conserved sequence and this means semiconserved SE substitution

means the conserved nature is there semiconserved me not totally conserved semiconserved maybe due to some sort of

substitution the conserved is not 100% semiconserved then what we have P nbrf format that is again from National

biomedical Research Foundation National biomedical research Foundation nbrf format PR format which is very

similar to that of the fasta format although fasta is widely used it also starts with this sign like

faster the symbol followed by the sequence type code in this case this is sequence type sequence type code crab

AAP L Type code then so this is type code P1 this is the first type code P1 and

then semicolon and then the next line sequence name and description Alpac crystalline B chain and the description

is 200 bases so the protein chain here is Alpha crystall in B and there are 200 bases present there it's amino acid

sequence that we're looking at so sequence means obviously either it can be amino acid or nucleotide in this case

amino acid sequence then the sequence is followed in from the following

line and once the sequence is end star is placed so whenever there's a star we can tell yes the sequence is end it's

very important to Mark the start and the end so we'll mark start with this we'll mark end with the star okay very similar

to that of the faster format and what else we have unipr swiss Pro

format so basically unipr swis Pro these are protein databases isn't it these are protein

databases so there are annotated protein databases and their entry is composed of lines for the standardization purpose

the unipr or swis prot if you look at as closely as possible to that of the embl format very

similar to that of the embl format similarity is there but little difference is also there

for example in this case there is a line called GN line Gene name contains the name of the

Gene and also if it's for the obviously it's for the protein so name of the gene which actually translated to the protein

and there's also another line is OG line organel g means Gene Line This is Gen line OG is organel

indicates the location or origination of the gene that is organel okay so both informations are added with

the embl type of format similar format but the thing is Gene line name of the gene for the coding of

the protein is mentioned and then location origin of the gene origin of the gene is already told

with organ OG sequence okay

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Comprehensive Guide to Molecular File Formats for Protein 3D Modeling

Explore the essential molecular file formats like PDB, mmCIF, CHARMM, MDL, and Mopac used in protein 3D structure modeling. Understand their specific sections, applications in crystallography and molecular dynamics, and learn about key file conversion tools to integrate diverse data sources effectively.

Comprehensive Insights into EBI and Essential Bioinformatics Tools

Explore the pivotal role of the European Bioinformatics Institute (EBI) in managing diverse biological databases and discover key bioinformatics tools for sequence analysis, pattern recognition, and structural comparison. Understand the synergy between wet labs and dry labs in modern bioinformatics and how EBI supports genomic and proteomic research.

Comprehensive Guide to Protein Databases: Types and Key Examples

Explore the main types of protein databases including sequence, structure, family/domain, and interaction databases. Learn about essential examples like PRITE, Swiss 2D-PAGE, SugarBindDB, and SwissVar that support protein analysis and research in bioinformatics.

Comprehensive Guide to FASTA: Algorithm, Types, and Comparison with BLAST

Explore how the FASTA algorithm performs sequence similarity searches using k-tuples, dot plots, and local alignment with dynamic programming. Understand different FASTA types like TFAST and FASTX/Y and how they compare protein and nucleotide sequences, highlighting differences from BLAST.

Comprehensive Guide to BLAST: Basic Local Alignment Search Tool Explained

This article provides an in-depth overview of BLAST, the Basic Local Alignment Search Tool developed by NCBI, explaining its algorithm, practical usage, scoring system, and various types of BLAST services. Understand how BLAST processes sequences, filters low complexity regions, scores matches, and identifies significant alignments in nucleotide and protein databases.