Finding Regulatory Sites - TAMU Computer Science Faculty Pages

Download Report

Transcript Finding Regulatory Sites - TAMU Computer Science Faculty Pages

Sequence Databases
As DNA and protein sequences accumulate, they are deposited
in public databases.
One of the most popular of these is GenBank, which stores all
types of different sequences in plain text files, where different
features are separated by blank spaces and special symbols.
Each file can be downloaded directly. Sample file:
LOCUS
name of locus
SOURCE
source organism of DNA
FEATURES
information about sequence by base position
ORIGIN
1 gaattcgata aatctctggg ttattgtgac gtttataatg acgttaggca
51 atatcattct atcattaagc
Relational Databases
A more structured way to store data is by using relational
databases. A database is made up of a set of tables which
stores data in a non-redundant manner to facilitate data
updates.
student
uin
123456789
333445555
firstname
Alice
Tom
major
uin
123456789
123456789
333445555
department
BICH
CPSC
BICH
lastname
Smith
Wang
Query Languages
Instead of allowing users to look at the contents of a database
directly, access is through the use of a query language over a
connection to the database.
SQL is one of the most popular query languages. The
following is a sample SQL query:
select firstname, lastname
from student, major
where student.uin = major.uin
and major.department = “BICH”
This approach imposes a structure to the data and allows
updates to be seen instantaneously.
Graphical User Interface
Fortunately, there is no need to learn a query language in
order to be able to access a relational database.
Modern database software packages provide graphical user
interfaces developed on top of the query language level, so
that users don’t need to know anything about query
languages.
Relational databases are very popular within biotech
companies and web interfaces are becoming standard.
Comparing Sequences
Mutation in DNA sequences is modeled by three basic
operations: insertion of a nucleotide, deletion of a nucleotide
and substitution of a nucleotide.
Mutation is tolerated to a certain extent before the function of
a DNA sequence is lost or changed. Thus, similarity between
sequences reflects similarity in function.
As DNA sequences of different organisms accumulate,
comparing a newly discovered sequence to known sequences
helps to identify the function of the new sequence.
Pairwise Alignment
Given two (DNA or protein) sequences, an alignment puts the
two strings against each other so that similar parts are aligned
together.
For example, given two strings ATCTCGAT and TGCATAT,
one alignment is:
AT-C-TCGCT
-TGCAT--AT
where ‘-’ denotes a gap introduced to pad non-similar parts.
Given a scoring scheme, the pairwise alignment problem is
the problem of finding the best alignment with optimal score.
A special form of alignment called local alignment allows
alignments to not cover entire sequences.
Finding Optimal Alignment
Dynamic programming techniques are used to find the
optimal alignment (which use a divide-and-conquer approach
to express the original problem as smaller subproblems).
These algorithms for finding optimal alignment takes time
proportional to the product of the length of input sequences.
In large scale comparisons, heuristics are commonly
employed to find good alignment (which is not necessarily
optimal) between two sequences.
Database Search
As the amount of sequenced DNA increases, it is likely that a
new DNA sequence has similarity to at least one of the known
sequences.
New DNA sequences are annotated as much as possible and
collected in sequence databases such as GENBANK.
When a new DNA sequence is obtained, the first logical step
is to try to search for similar sequences in a database of
known sequences.
If similar sequences are found from database search, sequence
similarity would imply similarity in function.
Filtration Idea
Given a database of sequences {s1, …, sn} and a sequence s,
the first step is to determine which of the pairs (s, si) are
likely to be similar.
There are conflicting objectives:
• Don’t want to lose real similarities.
• Want to cut down the number of possible similarities.
The simplest way to perform such a filtration is to consider a
k-mer appearing in s and check if it appears in each of si.
s1
s2
s
s3
s4
Seed-Extension Approach
The next step is to extend the short similarities (seeds) we
found in the filtration step to see if the gaps between the short
similar areas can be filled.
All hits with strong similarity found are sorted in decreasing
order of score.
It is possible to compute the optimal alignment in detail for a
limited subset of high scoring hits.
BLAST
BLAST is one of the most popular programs people use to
perform database search.
There are various variations of the basic programs to compare
DNA against DNA, DNA against protein, protein against
protein, etc.
In addition to employing the filtration-extension approach,
BLAST employs an additional set of heuristics to optimize its
performance.
BLAST Seed Selection
BLAST uses different seed lengths for different types of
sequences:
• use words of length 11 for DNA sequences.
• use words of length 3 for protein sequences.
• perform a translation from DNA to protein for
comparisons involving DNA and proteins. All three
reading frames are investigated.
For protein sequences, in addition to look for exact matches,
BLAST also looks for approximate matches with score above
a preset threshold.
BLAST Alignment Extension
BLAST tries to extend an alignment from the matching words
in both directions, and continue the extension as long as the
score continues to increase. If this results in a gap between
two matching words being filled, the extension is successful.
×
×
s
×
×
×
si
×
Multiple Alignment
It is frequently very difficult to reveal weak similarities
between two sequences.
These similarities can often be captured by comparing
multiple sequences.
Multiple alignment generalizes the alignment idea to handle
many sequences.
AT-C-TCGAT
-TGCAT--AT
ATCCA-CGCT
ClustalW
ClustalW is a very popular software for multiple alignment.
It uses the progressive alignment idea: treat each sequence as
an alignment initially and iteratively select the next two most
similar alignments to be combined into a bigger alignment by
pairwise alignment techniques.
+

The order to combine the alignments is determined by
following an evolutionary tree generated on the fly between
the sequences.
TCoffee
When computing an initial pairwise alignment, TCoffee
incorporates consistency information from other sequences to
improve its consistency within the final multiple alignment:
• It accepts different kinds of pairwise alignments as
input, and constructs a weighted library to represent the
pairwise relationships.
• For each sequence pair A and B, it extends the original
library by trying to align A and B through each
sequence C in the input set and adding consistency
information to the (A,B) pair.
TCoffee is much more accurate than ClustalW.
Gene Prediction
In eukaryotes, genes are made up of exons and introns.
Introns are cut out of the RNA and exons are concatenated
together to form the mRNA before translation into a protein.
exon
intron
exon intron
exon
gene (RNA)
mRNA
The gene prediction problem is the following problem: Given
a genomic sequence, locate the exon-intron boundaries.
GENSCAN
GENSCAN is one of the most popular software programs
used for gene structure prediction.
It integrates multiple types of information in a hidden markov
model (HMM), including promoter signals, splicing signals,
length distributions for exons and introns, poly-A
transcriptional end signal, etc.
A HMM consists of a set of states and transition probabilities
between the states. Each state represents a modeled feature.
Although the structure of the HMM is pre-determined, it is
necessary to provide estimates of other parameters (such as
transition probabilities) via a set of training examples.
Similarity-based Approaches
As the amount of sequenced DNA increases, it is likely that a
new (unspliced) genomic sequence has similarity to a known
cDNA or protein.
BLAST can be used to find similar sequences from the
database of known sequences. These BLAST hits reveal only
the most significant local alignments between the genomic
sequence and a similar sequence, which usually represent one
or few exons.
The GenomeScan software integrates this information with
the HMM model of GENSCAN.
Regulation of Gene Expression
Gene regulatory proteins bind to specific places (regulatory
sites) on DNA. These sites are usually close to the gene.
off
site
gene
regulatory protein
on
site
gene
Finding Regulatory Sites
Identify a set of genes believed to be controlled by the same
regulatory mechanism (co-regulated genes).
Extract regulatory regions of the genes (usually upstream
sequences) to form a sample of sequences.
Find some way to identify conserved patterns shared by these
sequences, resulting in a list of potential regulatory sites.
The problem of finding patterns shared by a given sample of
sequences is called the motif finding problem.
Motif Finding Problem
sample
gene
site
site
site
gene
gene
gene
site
site
gene
MEME
MEME is one of the most popular programs for motif finding.
It finds a motif, represented by a set of ungapped patterns,
which is the least likely to appear by chance.
It uses the expectation-maximization (EM) approach: first
obtain an initial motif (which may not be very good), then
iteratively obtain a better motif with the following two steps:
• Expectation: compute the statistical composition of the
current motif and find the probability of finding the site
at each position in each sequence.
• Maximization: These probabilities are used to update
the statistical composition.
Finding Regulatory Sites by Alignment
Recall that one way to find regulatory sites is to construct
samples of upstream regions of co-regulated genes and use
motif finding algorithms to look for shared patterns.
These samples of co-regulated genes are constructed from
diverse sources. Regulatory sites can reside on either strand
and they are not at a fixed distance from the transcriptional
start site.
For these samples, alignment algorithms are not useful.
However, alignment approaches can be very useful when we
specifically consider co-regulated genes which are very close
in evolutionary distance.
What to Look For
There are frequently more than one regulatory site for a gene.
When sufficiently close upstream regions of co-regulated
genes are aligned, we expect the sites to be short, well
conserved, in the same order and on the same strand. These
short blocks of highly conserved positions are separated by
less conserved regions.
Although evolutionarily close genes should be used, they
should not be so close that the entire upstream region is
highly conserved, which does not give much information.
Using Closely Related Yeast Species
Align upstream sequences from a few Saccharomyces species
to try to identify regulatory elements:
• Many alignments of S.mikatae and S.bayanus promoter
regions with S.cerevisiae show short runs of conserved
sequences, which predicts regulatory sites.
• The two species S.cariocanus and S.paradoxus are too
closely related to S.cerevisiae. Not much information
is revealed in their alignments.
Inferring Evolution
Using estimates on the evolutionary distance between
multiple organisms, an evolutionary tree can be constructed
predicting the evolutionary relationships.
mouse
rat
human
It is possible that conflicting trees are produced from different
data sources. We are much more confident that a tree is
correct when the same tree is produced from various data
sources.
Evolutionary Distance Estimates
Estimates of evolutionary distance:
• Each multiple alignment gives distance estimates
between species. To obtain more accurate estimates, a
large set of multiple alignments including different
regions or genes should be utilized collectively.
Using Evolutionary Information in Alignment
Progressive multiple alignment algorithms often construct an
evolutionary tree on the fly to determine the order in which
smaller alignments are merged to form bigger ones.
If the sequences to be aligned are from different species, it is
also possible to utilize a pre-determined evolutionary tree.
Since an evolutionary tree can be constructed given a multiple
alignment, it is possible to iteratively improve both the
evolutionary tree and the multiple alignment from crude
initial estimates.
Using Evolutionary Information in Motif Finding
The motif finding problem is to find a pattern shared by a
given set of sequences. In the case when the sequences are
from different species, evolutionary information can be
utilized to get more accurate results.
Given an evolutionary tree of the represented species, a better
way to score a motif is to compute the total number of
mutations along the branches of the tree. It has been shown
that this approach produces better motifs.