LecturesPart03
Download
Report
Transcript LecturesPart03
Computational Biology, Part 3
Representing and Finding
Sequence Features using
Frequency Matrices
Robert F. Murphy
Copyright 1996-2001.
All rights reserved.
Sequence Analysis Tasks
Calculating the probability of finding a
region with a particular base composition
Statistics of AT- or GC-rich
regions
What is the probability of observing a “run”
of the same nucleotide (e.g., 25 A’s)
Let px be the mononucleotide probability of
nucleotide x
The per nucleotide probability of a run of N
consecutive x’s is pxN
The probability of occurence in a sequence
of length L longer than N is ≈ L pxN
Statistics of AT- or GC-rich
regions
What if J “mismatches” are allowed?
Let py be the probability of observing a
different nucleotide (normally py = 1 - px)
The probability of observing N-J of
nucleotide x and J of nucleotide y in a
region of length N is
pxN-J pyJ C(N,J)
where
C(N,J) = N! / ( (N-J)! J! )
Statistics of AC- or GC-rich
regions
As before, we can multiply by L to
approximate the probability of observing
that combination in a sequence of length L
Note that this is the probability of observing
exactly N-J matches and exactly J
mismatches. We may also wish to know the
probability of finding at least N-J matches,
which requires summing the probability for
I=0 to I=J.
Statistics of AT- or GC-rich
regions
(A4 Enriched seq prob demo)
Sequence Analysis Tasks
Calculating the probability of finding a
sequence pattern
Calculating the probability of finding a
region with a particular base composition
Representing and finding sequence
features/motifs using frequency matrices
Describing features using
frequency matrices
Goal: Describe a sequence feature (or
motif) more quantitatively than possible
using consensus sequences
Need to describe how often particular bases
are found in particular positions in a
sequence feature
Describing features using
frequency matrices
Definition: For a feature of length m using
an alphabet of n characters, a frequency
matrix is an n by m matrix in which each
element contains the frequency at which a
given member of the alphabet is observed at
a given position in an aligned set of
sequences containing the feature
Frequency matrices (continued)
Three uses of frequency matrices
Describe
a sequence feature
Calculate probability of occurrence of feature in
a random sequence
Calculate degree of match between a new
sequence and a feature
Interactive Demonstration
(A2 Frequency matrix demo)
Frequency Matrices, PSSMs, and
Profiles
A frequency matrix can be converted to a
Position-Specific Scoring Matrix (PSSM)
by converting frequencies to scores (e.g., by
taking logs)
PSSMs also called Position Weight
Matrixes (PWMs) or Profiles
Finding occurrences of a
sequence feature using a Profile
As with finding occurrences of a consensus
sequence, we consider all positions in the
target sequence as candidate matches
For each position, we calculate a score by
“looking up” the value corresponding to the
base at that position
Interactive Demonstration
(A10 Searching with Profile demo)
Block Diagram for Building a
PSSM
Set of Aligned
Sequence
Features
Expected
frequencies of
each sequence
element
PSSM
builder
PSSM
Block Diagram for Searching
with a PSSM
PSSM
Threshold
Set of
Sequences to
search
PSSM
search
Sequences that
match above
threshold
Positions and
scores of
matches
Block Diagram for Searching for
sequences related to a family
with a PSSM
Set of
Aligned
Sequence
Features
Expected
frequencies
of each
sequence
element
PSSM
builder
PSSM
Threshold
Set of
Sequences
to search
PSSM
search
Sequences that match above
threshold
Positions and scores of
matches
Consensus sequences vs.
frequency matrices
Should I use a consensus sequence or a
frequency matrix to describe my site?
If
all allowed characters at a given position are
equally "good", use IUB codes to create
consensus sequence
Example:
Restriction enzyme recognition sites
If
some allowed characters are "better" than
others, use frequency matrix
Example:
Promoter sequences
Consensus sequences vs.
frequency matrices
Advantages of consensus sequences:
smaller description, quicker comparison
Disadvantage: lose quantitative information
on preferences at certain locations
Summary, Part 3
Probability of finding sequences enriched in
one or more bases can be calculated using
probability of consecutive bases multiplied
by number of combinations allowed
Complex sequence features can be
described using frequency matrices
Frequency matrices can be used for
quantitative estimates of the degree to
which a given sequence matches a feature