Transcript Slide 1
BIOINFORMATICS IN BIOCHEMISTRY
Bioinformatics– a field at the interface of molecular biology,
computer science, and mathematics
Bioinformatics focuses on the analysis of molecular sequences (DNA,
RNA, and proteins)
The National Institutes of Health (NIH) definition of bioinformatics:
“research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral
or health data, including those to acquire, store, organize, analyze,
or visualize such data.”
How is bioinformatics important to biochemistry?
The tools of bioinformatics include algorithms and computer programs for
analysis of molecular sequences that reveal the structure and function of
macromolecules.
Bioinformatics analysis gives valuable information that can guide
experimental work.
AMINO ACID SEQUENCE ALIGNMENT
A way to compare 2 or more sequences;
The sequences are lined up (“aligned”), one above the other, so that each
residue of one sequence can be compared to the corresponding residue of
the other sequence;
Sometimes one sequence must be “cut,” and a gap introduced, in order to
make this sequence align in the optimal way with the other sequence.
An example of a pairwise amino acid sequence alignment (2 sequences):
sequence_1
sequence_2
1 MLFMCHQRVMKKEAEEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCA 50
.|||||..:
||:::||||.||||||.
1
MEEKLKKTK-----------IIFVVGGPGSGKGTQCE 26
All the residues that are identical in the two sequences are indicated with
the “|” symbol between them; residues that are chemically similar are
indicated with the “:” or “.” symbol, such as W and F (both have aromatic
side chains). Note that a gap (----- region) was introduced into
sequence_2 in order to make it align optimally with sequence_1.
BLAST– Basic Local Alignment Search Tool
A bioinformatics tool that allows users to compare a protein or DNA
sequence to databases of other protein or DNA sequences from many
organisms.
A web-based version is available free of charge at the National Center for
Biotechnology Information (NCBI) website:
http://www.ncbi.nlm.nih.gov/BLAST/
The output from a “BLAST search” is a series of sequence alignments.
EXAMPLE OF A BLAST SEARCH
Suppose you have the sequence of a human protein and want to know if
there is a homologous protein in the fruit fly Drosophila melanogaster.
The amino acid sequence of the human protein will be the “query” for the
BLAST search.
The BLAST algorithm compares the query sequence to all proteins in the
Drosophila genome.
The BLAST output will show a list of the Drosophila proteins that have
statistical sequence similarity to the human query protein. These Drosophila
proteins can be referred to as “BLAST hits.” Below this list of BLAST hits,
there will be a series of sequence alignments between the human query
protein and each Drosophila protein that is in the list of BLAST hits. The
first alignment will be between the query and the Drosophila protein that is
most similar in sequence; the second alignment will be between the query
and the Drosophila protein that is the second best match in terms of
sequence similarity… and so on.
The next slide shows just one of these alignments from a BLAST search.
The last 2 slides explain some of the features of the alignment.
Query = a human protein
Subject (sbjct) = the Drosophila protein that is most similar to this human protein
Sample from BLAST output (see explanation on next 2 slides):
>gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster]
Length = 229
Score = 179 bits (453), Expect = 1e-45
Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%)
Query: 2
Sbjct: 15
Query: 51
Sbjct: 75
EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50
EEKLK +
II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG
EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74
SARGKKLSEIMEKGQLVPLETVLDMLRDAMVAKVNTSKGFLIDGYPREVQQGEEFERRIG 110
S +G++L +M G LV + VL +L DA+
+SKGFLIDGYPR+ QG EFE RI
SDKGRQLQAVMASGGLVSNDEVLSLLNDAITRAKGSSKGFLIDGYPRQKNQGIEFEARIA 134
Query: 111 QPTLLLYVDAGPETMTQRLLKRGETSG--RVDDNEETIKKRLETYYKATEPVIAFYEKRG 168
L LY +
+TM QR++ R
S
R DDNE+TI+ RL T+ + T ++ YE +
Sbjct: 135 PADLALYFECSEDTMVQRIMARAAASAVKRDDDNEKTIRARLLTFKQNTNAILELYEPKT 194
Query: 169 IVRKVNAEGSVDSVFSQVCTHLDAL 193
+
+NAE VD +F +V
+D +
Sbjct: 195 LT--INAERDVDDIFLEVVQAIDCV 217
First you will see sequence identification information for the subject (Drosophila)
protein in the alignment. This protein is called “Adenylate kinase-1”:
>gi|24663208|ref|NP_729792.1| Adenylate kinase-1, [Drosophila melanogaster]
Next you will see the total length of the subject protein, 229 amino acid residues:
Length = 229
Looking at the sequence alignment itself, you will see that it wraps around, taking up
3 ½ “rows.” One “row” is shown at the bottom of this slide. Residues 2 to 193 of the
query protein are aligned with residues 15 to 217 of the Drosophila protein (see the
numbers on the right and left sides of the previous slide). The “middle” line of each
row (the line between the query and subject lines) is called the “consensus
sequence.” Whenever there is a residue that is identical for the query protein and
the subject protein, it is indicated in this middle line. Whenever there is a residue
that is chemically similar (a conservative substitution) for the query and the subject, it
is marked with a ‘+’ symbol. If one of the sequences must be “cut” in order to align it
with the other, this is indicated with a “-” symbol. This is referred to as a “gap” in the
alignment.
Query: 2
Sbjct: 15
EEKLKKTK-----------IIFVVGGPGSGKGTQCEKIVQKYGYTHLSTGDLLRSEVSSG 50
EEKLK +
II+++GGPG GKGTQC KIV+KYG+THLS+GDLLR+EV+SG
EEKLKAEELRRARAAADIPIIWILGGPGCGKGTQCAKIVEKYGFTHLSSGDLLRNEVASG 74
Just above the sequence alignment itself you will see statistical information for the
alignment (essentially telling you “how similar” the two sequences are):
Score = 179 bits (453), Expect = 1e-45
Identities = 96/205 (47%), Positives = 131/205 (64%), Gaps = 15/205 (7%)
This tells you that of the 205 amino acid residues that are aligned, 96 are identical
between the query protein and the subject protein. Of the 205 aligned residues, 131
are either identical OR similar (have “+” symbol). 15 gaps were introduced into the
sequences (have “-” symbol).
The expected-value (1x10-45 in this case; a very small number!) is the probability
that this alignment could occur by chance between two unrelated sequences from a
database of the size that was searched. The bottom line: the smaller the expectedvalue, the more similar the two sequences.