LecturesPart03

Download Report

Transcript LecturesPart03

Computational Biology, Part 3
Representing and Finding
Sequence Features using
Frequency Matrices
Robert F. Murphy
Copyright  1996-2001.
All rights reserved.
Sequence Analysis Tasks
 Calculating the probability of finding a
region with a particular base composition
Statistics of AT- or GC-rich
regions
What is the probability of observing a “run”
of the same nucleotide (e.g., 25 A’s)
 Let px be the mononucleotide probability of
nucleotide x
 The per nucleotide probability of a run of N
consecutive x’s is pxN
 The probability of occurence in a sequence
of length L longer than N is ≈ L pxN

Statistics of AT- or GC-rich
regions
What if J “mismatches” are allowed?
 Let py be the probability of observing a
different nucleotide (normally py = 1 - px)
 The probability of observing N-J of
nucleotide x and J of nucleotide y in a
region of length N is



pxN-J pyJ C(N,J)
where

C(N,J) = N! / ( (N-J)! J! )
Statistics of AC- or GC-rich
regions
As before, we can multiply by L to
approximate the probability of observing
that combination in a sequence of length L
 Note that this is the probability of observing
exactly N-J matches and exactly J
mismatches. We may also wish to know the
probability of finding at least N-J matches,
which requires summing the probability for
I=0 to I=J.

Statistics of AT- or GC-rich
regions

(A4 Enriched seq prob demo)
Sequence Analysis Tasks

Calculating the probability of finding a
sequence pattern

Calculating the probability of finding a
region with a particular base composition
 Representing and finding sequence
features/motifs using frequency matrices
Describing features using
frequency matrices
Goal: Describe a sequence feature (or
motif) more quantitatively than possible
using consensus sequences
 Need to describe how often particular bases
are found in particular positions in a
sequence feature

Describing features using
frequency matrices

Definition: For a feature of length m using
an alphabet of n characters, a frequency
matrix is an n by m matrix in which each
element contains the frequency at which a
given member of the alphabet is observed at
a given position in an aligned set of
sequences containing the feature
Frequency matrices (continued)

Three uses of frequency matrices
 Describe
a sequence feature
 Calculate probability of occurrence of feature in
a random sequence
 Calculate degree of match between a new
sequence and a feature
Interactive Demonstration

(A2 Frequency matrix demo)
Frequency Matrices, PSSMs, and
Profiles
A frequency matrix can be converted to a
Position-Specific Scoring Matrix (PSSM)
by converting frequencies to scores (e.g., by
taking logs)
 PSSMs also called Position Weight
Matrixes (PWMs) or Profiles

Finding occurrences of a
sequence feature using a Profile
As with finding occurrences of a consensus
sequence, we consider all positions in the
target sequence as candidate matches
 For each position, we calculate a score by
“looking up” the value corresponding to the
base at that position

Interactive Demonstration

(A10 Searching with Profile demo)
Block Diagram for Building a
PSSM
Set of Aligned
Sequence
Features
Expected
frequencies of
each sequence
element
PSSM
builder
PSSM
Block Diagram for Searching
with a PSSM
PSSM
Threshold
Set of
Sequences to
search
PSSM
search
Sequences that
match above
threshold
Positions and
scores of
matches
Block Diagram for Searching for
sequences related to a family
with a PSSM
Set of
Aligned
Sequence
Features
Expected
frequencies
of each
sequence
element
PSSM
builder
PSSM
Threshold
Set of
Sequences
to search
PSSM
search
Sequences that match above
threshold
Positions and scores of
matches
Consensus sequences vs.
frequency matrices

Should I use a consensus sequence or a
frequency matrix to describe my site?
 If
all allowed characters at a given position are
equally "good", use IUB codes to create
consensus sequence
 Example:
Restriction enzyme recognition sites
 If
some allowed characters are "better" than
others, use frequency matrix
 Example:
Promoter sequences
Consensus sequences vs.
frequency matrices
Advantages of consensus sequences:
smaller description, quicker comparison
 Disadvantage: lose quantitative information
on preferences at certain locations

Summary, Part 3
Probability of finding sequences enriched in
one or more bases can be calculated using
probability of consecutive bases multiplied
by number of combinations allowed
 Complex sequence features can be
described using frequency matrices
 Frequency matrices can be used for
quantitative estimates of the degree to
which a given sequence matches a feature
