Biology and computers

Download Report

Transcript Biology and computers

Module 2
Sequence DBs and Similarity Searches
Learning objectives






Understand how information is stored in GenBank.
Learn how to read a Genbank flat file.
Learn how to search Genbank for information.
Understand difference between header, features and
sequence.
Learn the difference between a primary database and
secondary database.
Principle of similarity searches using the BLAST
program
What is GenBank?
Gene sequence database
Annotated records that represent single
contiguous stretches of DNA or RNA-may
have more than one coding region (limit
350 kb)
Generated from direct submissions to the
DNA sequence databases from the authors.
Part of the International Nucleotide
Sequence Database Collaboration.
Exchange of information on a
daily basis
GenBank
(NCBI)
EMBL (EBI)
United Kingdom
International Nucleotide Sequence
Database Collaboration
DDBJ
Japan
History of GenBank
Began with Atlas of Protein Sequences and
Structures (Dayhoff et al., 1965)
In 1986 it collaborated with EMBL and in 1987 it
collaborated with DDBJ.
It is a primary database-(i.e., experimental data is
placed into it)
Examples of secondary databases derived from
GenBank/EMBL/DDBJ: Swiss-Prot, PRI.
GenBank Flat File is a human readable form of the
records.
General Comments on GBFF
Three sections:
1) Header-information about the whole record
 2) Features-description of annotations-each
represented by a key.
 3) Nucleotide sequence-each ends with // on
last line of record.

DNA-centered
Translated sequence is only a feature
Feature Keys
Purpose:
1) Indicates biological nature of sequence
 2) Supplies information about changes to
sequences

Feature Key
Description
conflict
Separate deter’s of the same seq. differ
rep_origin
protein_bind
CDS
Origin of replication
Protein binding site on DNA
Protein coding sequence
Feature Keys-Terminology
Feature Key
CDS
Location/Qualifiers
23..400
/product=“alcohol dehydro.”
/gene=“adhI”
Interpretation-The feature CDS is a coding sequence
beginning at base 23 and ending at base 400, has a
product called “alcohol dehydrogenase” and corresponds
to the gene called “adhI”.
Feature Keys-Terminology
(Cont.)
Feat. Key Location/Qualifiers
CDS
join (544..589,688..1032)
/product=“T-cell recep. B-ch.”
/partial
Interpretation-The feature CDS is a partial coding sequence
formed by joining the indicated elements to form one
contiguous sequence encoding a product called T-cell
receptor beta-chain.
Record from GenBank
GenBank division (plant, fungal and algal)
Modification date
LOCUS
SCU49845
5028 bp
DNA
PLN
21-JUN-1999
DEFINITION
Saccharomyces cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION
U49845 Unique identifier (never changes)
VERSION
U49845.1
KEYWORDS
.
Coding region
GI:1293613 GeneInfo identifier (changes whenever there is a change)
Nucleotide sequence identifier (changes when there is a change
in sequence (accession.version))
Word or phrase describing the sequence (not based on controlled vocabulary).
Not used in newer records.
SOURCE
ORGANISM
baker's yeast. Common name for organism
Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
Saccharomycetaceae; Saccharomyces.
Formal scientific name for the source organism and its lineage
based on NCBI Taxonomy Database
Record from GenBank (cont.1)
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
1 (bases 1 to 5028) Oldest reference first
Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
Cloning and sequence of REV7, a gene whose function is required
for DNA damage-induced mutagenesis in Saccharomyces cerevisiae
Yeast 10 (11), 1503-1509 (1994)
95176709 Medline UID
2 (bases 1 to 5028)
Roemer,T., Madden,K., Chang,J. and Snyder,M.
Selection of axial growth sites in yeast requires Axl2p, a
novel plasma membrane glycoprotein
Genes Dev. 10 (7), 777-793 (1996)
96194260
3
(bases 1 to 5028)
AUTHORS
Roemer,T. Submitter of sequence (always the last reference)
TITLE
Direct Submission
JOURNAL
Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,
New Haven, CT, USA
Record from GenBank (cont.2)
There are three parts to the feature key: a keyword (indicates functional group), a location
(instruction for finding the feature), and a qualifier (auxiliary information about a feature)
FEATURES
source
Keys
CDS
Database cross-refs
Location/Qualifiers
1..5028 Location
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
Qualifiers
/chromosome="IX"
/map="9"
<1..206 Partial sequence on the 5’ end. The 3’ end is complete.
/codon_start=3 Start of open reading frame
/product="TCP1-beta" Descriptive free text must be quotations
/protein_id="AAA98665.1" Protein sequence ID #
/db_xref="GI:1293614"
Values
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
Note: only a partial sequence
Record from GenBank (cont.3)
gene
687..3158 New location
/gene="AXL2"
CDS
687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ Cutoff
gene
complement(3300..4037) New location
/gene="REV7"
CDS
complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff
Record from GenBank (cont.4)
BASE COUNT
1510 a
1074 c
835 g
1609 t
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac
ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca
gtagtcagct . . .//
Primary databases contain experimental
biological information
GenBank/EMBL/DDBJ
Alu-alu repeats in human DNA
dbEST-expressed sequence tags-single pass cDNA
sequences (high error freq.)
It is non-redundant
HTGS-high-throughput genomic sequence database
(errors!)
PDB-Three-dimensional structure coordinates of
biological molecules
PROSITE-database of protein domain/function
relationships.
Types of secondary databases that contain
biological information
dbSTS-Non-redundant db of sequence-tagged sites (useful for
physical mapping)
Genome databases-(there are over 20 genome databases that can be
searched
EPD:eukaryotic promoter database
NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with
100% sequence identity are merged as one.
Vector: A subset of GenBank containing vector DNA
ProDom
PRINTS
BLOCKS
Workshop 2 A-Look up a Genbank record. Use
the annotations to determine the the first open
reading frame.
Similarity Searching
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
Purpose of finding differences
and similarities of amino acids.
Infer structural information
Infer functional information
Infer evolutionary relationships
Evolutionary Basis of Sequence
Alignment
1. Similarity: Quantity that relates to how
alike two sequences are.
2. Identity: Quantity that describes how alike
two sequences are in the strictest terms.
3. Homology: a conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
Evolutionary Basis of Sequence
Alignment (Cont. 1)
1. Example: Shown on the next page is a pairwise alignment of
two proteins. One is mouse trypsin and the other is crayfish
trypsin. They are homologous proteins. The sequences
share 41% identity.
2. Underlined residues are identical. Asterisks and diamond
represent those residues that participate in catalysis. Five
gaps are placed to optimize the alignment.
Evolutionary Basis of Sequence
Alignment (Cont. 2)
Why are there regions of identity?
1) Conserved function-residues participate in reaction.
2) Structural-residues participate in maintaining structure of
protein. (For example, conserved cysteine residues that
form a disulfide linkage)
3) Historical-Residues that are conserved solely due to a
common ancestor gene.
Evolutionary Basis of Sequence
Alignment (Cont. 3)
Note: It is possible that two proteins share a high degree of
similarity but have two different functions. For example,
human gamma-crystallin is a lens protein that has no known
enzymatic activity. It shares a high percentage of identity with
E. coli quinone oxidoreductase. These proteins likely had a
common ancestor but their functions diverged.
Analogous to railroad car and diner function.
Modular nature of proteins
The previous alignment was global. However,
many proteins do not display global patterns of
similarity. Instead, they possess local regions of
similarity.
Proteins can be thought of as assemblies of
modular domains. It is thought that this may, in
some cases, be due to a process known as exon
shuffling.
Modular nature of proteins (cont. 1)
Gene A
Exon 1a
Exon 2a
Duplication
Gene B
Exon 1a
Exon 2a
Exon 2a
Exchange
Gene A
Exon 1a
Exon 2a
Exon 3 (Ex. 2b from Gene B)
Gene B
Exon 1b
Exon 2b
Exon 3 (Ex. 2a from Gene A)
Dot Plots
A
A
T
G
C
C
T
A
G
T
G
C C
*
T
*
*
A
G
*
*
*
* *
* *
*
*
*
*
*
*
Window = 1
Note that 25% of
the table will be
filled due to random
chance. 1 in 4 chance
at each position
Dot Plots with window = 2
A
A
{
T
{
G
{
C
{
C
{
T
{A
{G
T
G
C C
T
A
*
*
*
G
Window = 2
The larger the window
the more noise can
be filtered
What is the
percent chance that
you will receive a
match randomly?
1/16 * 100 = 6.25%
*
*
*
*
Identity Matrix
A
C
I
L
1
0
0
0
A
1
0 1
0 0
C I
1
L
Simplest type of scoring matrix
Similarity
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
Scoring Matrices
Importance of scoring matrices
Scoring matrices appear in all analyses involving
sequence comparisons.
The choice of matrix can strongly influence the
outcome of the analysis.
Scoring matrices implicitly represent a particular
theory of sequence alignment.
Understanding theories underlying a given scoring
matrix can aid in making the proper choice when
performing sequence alignments.
Scoring Matrices
When we consider scoring matrices, we encounter the
convention that matrices have numeric indices
corresponding to the rows and columns of the matrix.
For example, M11 refers to the entry at the first row
and the first column. In general, Mij refers to the
entry at the ith row and the jth column. To use this
for sequence alignment, we simply associate a
numeric value to each letter in the alphabet of the
sequence. For example, if the matrix is:
{A,C,T,G} then A = 1,1; C = 1,2, etc.
Steps to building the first PAM
(Point Accepted Mutation)
Dayhoff aligned sequences that were at
least 85% identical.
2. Reconstructed phylogenetic trees and
inferred ancestral sequences. 71 trees
containing 1,572 aa exchanges were used.
3. Tallied aa replacements "accepted" by
natural selection, in all pair-wise
comparisons.
1.
Steps to building PAM (cont. 1)
4. Computed amino acid mutability, mj (the
propensity of a given amino acid, j, to be
replaced)
5. Combined data from 3 & 4 to produce a
Mutation Probability Matrix for one
PAM of evolutionary distance, according
to the following formula:
Replacements
Mjj = 1 - mj
MPM of aaj for aaj
Steps to building PAM (cont. 2)
6. Took the log odds ratio to obtain each
score:
Sij = log (Mij/fi) (Note: this is what you see in
the matrix)
Where fi is the normalized frequency of aai in
the sequences used.
7. Note: must multiply the Mij/fi by factors of
10 prior to avoid fractions.
Assumptions in the PAM model
1. Replacement at any site depends only
on the amino acid at that site and the
probability given by the table (Markov
model).
2. Sequences that are being compared
have average amino acid composition.
The bottom line on PAM
Frequencies of alignment
Frequencies of occurrence
The probability that two amino acids, i and j are
aligned by evolutionary descent divided by the
probability that they are aligned by chance
Sources of error in PAM model
1. Many sequences depart from average aa composition.
2. Rare replacements were observed too infrequently to resolve
relative probabilities accurately (for 36 aa pairs (out of appoximately 400 aa pairs) no replacements were observed!).
3. Errors in 1PAM are magnified in the extrapolation to
250 PAM. (Mijk = k PAM)
4. This process (Markov) is an imperfect representation of
evolution: distantly related sequences usually have islands
(blocks) of conserved residues. This implies that replacement is
not equally probable over entire sequence.
BLOSUM Matrices
BLOSUM is built from distantly related
sequences whereas PAM is built from
closely related sequences
BLOSUM is built from conserved blocks of
aligned protein segment found in the
BLOCKS database (remember the
BLOCKS database is a secondary database
that depends on the PROSITE Family)
Gap Penalties
Takes into account insertions and deletions.
Can’t have too many that may make the alignment
meaningless
Typically, there is a fixed deduction for
introducing a gap plus additional deduction for the
length of the gap.
Gap penalty = G + Ln where G = gap opening penalty, L =
gap extension penalty and n = gap length.
G = 2 to 12, L = 2
Global Alignment vs. Local
Alignment
Global alignment is used when the overall gene
sequence is similar to another sequence-often used
in multiple sequence alignment.

Clustal W algorithm (Needleman-Wunsch)
Local alignment is used when only a small portion
of one gene is similar to a small portion of another
gene.



BLAST
FASTA
Smith-Waterman algorithm
Two proteins that are similar in
certain regions
Tissue plasminogen activator (PLAT)
Coagulation factor 12 (F12).
The Dotter Program
• Program consists of three components:
•Sliding window
•A scoring matrix that gives a score for each amino acid
•A graph that converts the score to a dot of certain pixel density
Region of
similarity
Single region on F12
is similar to two regions
on PLAT
BLAST
Basic Local Alignment Search Tool
Speed is achieved by:
Pre-indexing the database before the search
 Parallel processing

Uses a hash table that contains
neighborhood words rather than just
identical words.
Neighborhood words
The program declares a hit if the word taken from
the query sequence has a score >= T when a
substitution matrix is used.
This allows the word size (W (this is similar to
ktup value)) to be kept high (for speed) without
sacrificing sensitivity.
If T is increased by the user the number of
background hits is reduced and the program will
run faster
Workshop for module 2: Use the Dotter program to determine
the optimal alignment between two sequences. Perform a Blast
search on a protein sequence.