Transcript Slide 1

Pattern Analysis in Biology
Timothy L. Bailey
Institute for Molecular Bioscience
University of Queensland
Overview



Pattern Analysis: Converting data into
knowledge
Objectives of Biological Pattern
Analysis
Elements of Pattern Discovery
– Sequence pattern example
Pattern Analysis:
Converting data into knowledge
The purpose of pattern analysis is to convert
data into knowledge.
 Pattern: … order or form discernible in
things, actions, ideas, situations, etc.
(Oxford English Dictionary)
 We analyze patterns to give form and
structure to knowledge and to make
predictions from data.
Discovery and Search
 Pattern discovery involves constructing a
model of a biological signal, process or
interaction from data.
 Pattern search involves looking for data that
fits a given model in order make predictions.
The Components of Pattern
Analysis
Pattern analysis starts with three major
questions:
 The data:
– What kind of data am I looking for patterns in?
 The pattern language:
– How will I describe or model the patterns?
 The learning algorithm:
– What algorithms exist for searching for patterns
in this language.
Data
Pattern analysis is applied to many types of
biological data:
 Sequence (DNA, RNA, protein)
 Structural (protein 2-D and 3-D)
 Expression (mRNA levels)
 Literature (text)
Pattern Language
Different types of pattern languages are used to
represent patterns of different types:
 Sequence models: regular expressions, hidden
Markov models, stochastic context-free grammars
 Structural models: 3-D coordinates
 Phylogeny models: trees and cladograms
 Network models: boolean networks
 General models: linear and non-linear equations,
artificial neural networks (ANN), support vector
machines (SVM)
Learning algorithms
The process of finding model that best fits given data or that
optimizes some objective is often referred to as “learning”.
 Optimization algorithms: simulated annealing, genetic
algorithms, backpropogation in neural networks
 Clustering algorithms: k-means clustering, self-organizing
maps
 Statistical learning algorithms: expectation maximization
(EM), forward-backward, Gibbs sampling
 Heuristic search: branch-and-bound, suffix trees, Tabu
search, nearest neighbor
Categories of learning algorithms
Learning algorithms fall into two broad categories:
 Supervised learning: the “training” data is “labeled” with the
features that the model will be used to predict.
– Classification
– Regression
 Unsupervised learning: unlabeled training data is used and
clusters or “surprising” patterns are sought
– Clustering
– Pattern discovery
Objectives of Pattern Analysis
 Patterns in biological data can be used to describe
and predict, among other things, the properties
and relationships of genes, proteins and species.
 These include:
–
–
–
–
–
Evolution
Structure
Function
Regulation
Interaction networks
Evolution
 How are current species
(or genes or proteins)
related evolutionarily?
 What kind of
reorganizations have
chromosomes undergone
over evolutionary time?
 What speciation and gene
duplication events have
occurred?
Structure
Buffalo Center of Bioinformatics
What is the 3-D structure
of a particular protein?
Function
 What protein-protein
and protein-DNA
interactions does a
protein or DNA
molecule engage in?
 Which amino acid
residues or DNA bases
are involved in the
interactions?
Regulation
 What transcription
factors (proteins etc.)
and DNA binding sites
are involved in the
regulation of the
transcription of a
particular gene?
 How are the signals
arranged along the
chromosome?
Wasserman and Sandelin
Interaction networks
 What enzymes and
substrates are involved
in a particular
metabolic pathway?
 What is the network of
interactions of genes
involved in
development?
Elements of Pattern Discovery
Pattern discovery requires:
 A pattern language
– This defines what kind of patterns you can find.
(Models are described in the pattern language.)
 An objective function
– This defines what makes a pattern “interesting”.
 An algorithm
– This defines how to search among the possible patterns
to find the “interesting” ones.
Pattern search is generally much simpler--computing
the objective function.
Pattern discovery example:
sequence patterns
We will illustrate pattern discovery using
sequence pattern examples.
 Sequence patterns in protein, DNA and RNA
 Sequence pattern languages
 Objective functions for sequence patterns
 Learning algorithms for sequence patterns
Protein sequence patterns:
the “leucine zipper”
Pattern:
L-X(6)-L-X(6)-L-X(6)-L
The leucine side chains
extending from one
alpha-helix interact with
those from a similar
alpha helix of a second
polypeptide, facilitating
dimerization.
DNA sequence patterns:
a protein-coding gene
Patterns in RNA sequences
Human RNAsplice
junctions
sequence
matrix
http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html
Higher-order sequence patterns
Cis-regulatory modules often involve clusters of
binding sites for one or more transcription factors
(e.g., drosophila EVE gene).
Clusters of 13 or more pattern matches in a window of 700 bp.
Elements of Pattern Discovery
Pattern discovery requires:
 A pattern language
– This defines what kind of patterns you can find.
 An objective function
– This defines what makes a pattern “interesting”.
 An algorithm
– This defines how to search among the possible patterns
to find the “interesting” ones.
Sequence pattern description
languages




Regular expressions
Profiles
Hidden Markov Models (HMMs)
Motif-based HMMs
Regular expressions define sets
of sequences that they match
Sp1 binds to DNA via 3
zinc-finger binding
domains:
C-X(2,4)-C-X(3)[LIVMFYWC]-X(8)-HX(3,5)-H
These particular domains
recognize Sp1 binding
sites:
GRGGCRGGW
Transcription factor Sp1 binding
to DNA
Profiles are more powerful than
regular expressions
Regular expressions do not capture the
statistics of the variation in sequence
patterns—they just tell you what letters are
permissible at each position in the pattern.
Profiles capture the frequency of each letter at
each position in the pattern so you can tell
how well a potential site matches the pattern
(the site’s probability.)
Profiles are built from multiple
alignments of instances of a pattern
Example: nuclear hormone
receptor transcription
factor binding site profile
derived from
experimentally determined
sites.
Observed counts can be
converted to frequencies
by dividing by the number
of observed instances.
So profiles are probabilistic
models of sequence patterns.
Counts of number of
times each letter is
observed at each
position in pattern.
Hidden Markov Models
 HMMs are statistical models (like profiles)
that assign a probability (score) to any
(sub-)sequence they are presented with.
 HMMs can model whole sequences or
domains (e.g., PFAM models of protein
domains).
 HMMs can also be built from one or more
profiles to model groups of interacting
patterns (e.g., cis-regulatory modules).
A “motif” HMM
Each box is a “state” and recognizes letters with
the probabilities in the vertical rectangles.
This simple HMM is equivalent to a profile.
1
2
3
A .7
C .1
G .1
T .1
A .1
C .0
G .0
T .9
A .1
C .1
G .8
T .0
4
A
C
T
T
.0
.9
.0
.1
5
A .1
C .0
G .0
T .9
A motif-based HMM for
recognizing
cis-regulatory modules
 Motif states
 Complemented motifs
 Non-emitting states
 Emitting gap states
 Free transitions
This HMM can recognize (sub-)sequences
consisting of one or more motifs separated
by “gaps” (sequence of unknown function).
within-cluster
gap
+1
-1
+2
-2
between-cluster
gap
Elements of Pattern Discovery
Pattern discovery requires:
 A pattern language
– This defines what kind of patterns you can find.
 An objective function
– This defines what makes a pattern “interesting”.
 An algorithm
– This defines how to search among the possible patterns
to find the “interesting” ones.
Objective functions for Regular
Expression Patterns
 Possible objective functions are:
– Perfect matches only (no mismatches)
– Allow a given number of mismatches
– Allow a given density of mismatches (or
wildcards).
 To be interesting, the pattern must occur a
certain minimum number of times in the
data.
Objective functions for profiles
and HMMs
 Profile- and HMM-based patterns are
usually ranked by statistical or informationtheoretic measures:
– Likelihood ratio
– Information content
– Maximum a posteriori probability
Example for motif HMMs:
the likelihood ratio
 Use the HMM to compute the likelihood of
the data: Pr(data | motif)
 Use a “background” model to compute the
likelihood of the data under the background
model: Pr(data | background)
 The likelihood is:
Pr(data | motif) / Pr(data | bakground)
Elements of Pattern Discovery
Pattern discovery requires:
 A pattern language
– This defines what kind of patterns you can find.
 An objective function
– This defines what makes a pattern “interesting”.
 An algorithm
– This defines how to search among the possible patterns
to find the “interesting” ones.
Motif discovery algorithms
 The goal is to find a set of sites (or a motif model)
that maximizes the objective function.
 Motif discovery algorithms for finding sequence
motifs mostly use either EM (Expectation
Maximization) or Gibbs sampling.
 Gibbs sampling is a bit easier to visualize, so the
following slides illustrate it via the AlignACE
algorithm (by G. M. Church.)
AlignACE Example
Input Data Set
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
300-600 bp of upstream sequence
per gene are searched in
Saccharomyces cerevisiae.
AlignACE Example
The Target Motif
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
MAP score = 20.37 (maximum)
AlignACE Example
Initial Seeding
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
MAP score = -10.0
AlignACE Example
Sampling
Add?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this site
as opposed to without?
TCTCTCTCCA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
AlignACE Example
Continued Sampling
Add?
Remove.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this site
as opposed to without?
ATGAAAAAAT
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
AlignACE Example
Continued Sampling
Add?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this site
as opposed to without?
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
AlignACE Example
Column Sampling
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
**********
How much better is the
alignment with this new
column structure?
GACATCGAAAC
GCACTTCGGCG
GAGTCATTACA
GTAAATTGTCA
CCACAGTCCGC
TGTGAAGCACA
********* *
AlignACE Example
The Best Motif
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
MAP score = 20.37
Conclusion
 Pattern analysis is an important tool for
making sense of the large amounts of data
being generated in biology laboratories.
 As pattern-description languages and
machine learning algorithms improve,
pattern analysis will become increasingly
useful.
Learning methods in today’s talks
 Supervised methods:
– Using support vector machines for classification
and regression
 Unsupervised methods:
– Discovering and using sequence patterns
– Using artificial neural networks and genetic
algorithms to discover patterns
– The generalized Gibbs sampler