Transcript Document

Sequence features of DNA binding sites
reveal structural class of associated
transcription factor
Narlikar L and Hartemink AJ. Bioinformatics. 2006 Jan
15;22(2):157-63.
Carol Sniegoski
The Central Dogma of Molecular Biology
Double-stranded
chain of
nucleotide bases
(A-T, C-G)
Single-stranded
chain of
nucleotide bases
(A,U,C,G)
Polypeptide chain
DNA Basics
• Two chains form a double
helix
• Chains have orientation
• 5’ end is “upstream”; 3’ end
is “downstream”
• Sugar-phosphate backbone
provides framework for bases
(A,C,G,T)
• Hydrogen bonds between
complementary base pairs
hold chains together
• A pairs with T, C pairs with G
Protein Basics
• Proteins are folded up polypeptide strings
• Sequence determines form; form determines function
• Function is focused at key domains (active sites, binding sites)
• Predicting form from sequence is an unsolved problem
• Experimental methods: NMR; X-ray crystallography
• Computational methods: predicting de novo; predicting based on
sequence similarity to other known proteins
Ball-and-stick model
Space-filling model
Cartoon model
Protein Structure
Primary protein
structure
The order of amino
acids
Secondary protein
structure
Common repeating
structures, often
formed by hydrogen
bonds
Tertiary protein
structure
The full 3-dimensional
folded structure
Quaternary protein
structure
Proteins organized of
multiple polypeptide
chains
Protein Domains
Structural domains
• Elements of tertiary
structure
• May be composed
of one or more
motifs (secondary
structure)
• Many domains
appear in a variety
of protein families
• Domains are
important to a
protein’s biological
function
Proteins Do (Almost) Everything
Gene Expression Control Points
Activating the gene structure
Initiating transcription of mRNA from DNA
Processing the mRNA transcript
Transporting the processed transcript from
nucleus to cytoplasm
Translating mRNA into protein
Controlling mRNA degradation
Components Needed for Transcription
• RNA polymerase (RNAP)
• Enzyme that transcribes DNA into RNA.
• DNA
• Accessible DNA sequence to be transcribed (gene).
• Various cis-acting DNA regulatory sequences located near the sequence
to be transcribed.
(Cis-acting = part of the DNA sequence; affects one copy of a gene.)
• The regulatory sequences serve as binding sites recognized by
transcription factors.
• Transcription factors (TFs)
• Set of trans-acting accessory proteins required to initiate transcription.
(Trans-acting = freely diffusible; affects both copies of a gene.)
• TFs have binding domains that recognize and bind to specific DNA
sequences.
RNA Polymerase
• The RNA polymerase protein transcribes DNA into RNA.
• It is not responsible for knowing when or where to start transcription.
New RNA
transcript
DNA double
helix
RNA polymerase
DNA Regulatory Sequences
• Characteristic regulatory sequences in DNA are bound by specific
transcription factors.
• Complexes of bound factors both locate and promote gene transcription.
Transcription
startpoint
• Promoter regions are usually located within 200 bp upstream of startpoint.
•
•
•
•
Initiator (Inr): consensus sequence “YYAN(T/A)YY”, within 5 bp of startpoint
TATA box: consensus sequence “TATAAAA”, 25 bp above startpoint
GC box: consensus sequence “GGGCGG”
CAAT box: consensus sequence “CCAAT”
• Enhancer regions (not shown) are located farther upstream or downstream.
DNA Regulatory Sequences
•
•
•
•
Modular
Specific to a gene or a set of genes
Specific to a condition or range of conditions
Support complex control of gene transcription
gene
gene
Example
DNA
sequences
gene
gene
upstream
TATA box
CAAT box
GC box
Octamer motif
Transcription
startpoint
downstream
Transcription Factors
Any factor that is needed for the initiation of transcription but is not
part of RNA polymerase
Three operationally defined classes of transcription factors:
• General factors
• Form an initiation complex with RNA polymerase around the
transcription startpoint
• Always required for initiation of transcription
• Unregulated
• Upstream factors
• Bind to specific DNA consensus sequences (promoters and enhancers)
upstream of the startpoint
• Required for adequately efficient initiation of transcription
• Unregulated
• Inducible factors
• Operate like upstream factors
• Highly regulated
• Responsible for controlling transcription patterns in time and space
Activating Inducible TFs (1)
Activating Inducible TFs (2)
Transcription Factors
• Transcription factors bind to DNA and to each other
to form complexes that initiate transcription
TFIIIA binds to a
site within the
promoter region
TFIIIC binds to form
a stable complex
TFIIIB (with 3 subunits)
now binds to its binding
site near the startpoint of
transcription
Finally RNA polymerase
binds and begins
transcribing the gene
Transcription Factors
• Even factors bound to remote enhancers
can contribute to the initiation complex
Enhancer
Gene
Basal transcription
complex
Enhancer-bound
complex
Binding Site Specificity
•
•
•
•
Many TFs’ DNA-binding domains use similar types of mechanisms.
Binding domain structures can be grouped into classes.
Each class binds particular sets of DNA sequences (binding sites).
Binding sites are usually somewhat degenerate (variable).
• Two common models for characterizing binding sites:
Regular expressions
• Construct a regular expression that
matches only the sequences at known
binding sites.
• Can match variable-length sequences.
• Does not provide information about
probability or binding affinity.
PSSM (Position-Specific Scoring Matrix)
• Next slide.
PSSM
Position-Specific Scoring Matrix
• Align known binding sites for the TF, all of length n.
• Create a 4xn matrix showing the number of times each base appears at
each position.
• To determine the TF’s binding affinity for sequence S, calculate
log( (P|M) / (P|B) ) .
Probability of seeing S in the motif
A
C
G
T
3
5
3
1
Probability of seeing S outside the motif
2 0 12 0 0 0
2 12 0 12 0 1
7 0 0 0 12 0
1 0 0 0 0 11
0
0
7
5
1
2
5
4
3
1
4
4
PSSM matrix built from an alignment of 12
binding sites of length 10 bp for yeast TF Pho4p
The Experiment
Goal: Predict the type of DNA-binding domain that a TF has based on features of
the DNA sequences to which it binds.
Data: Encoded data about TF factors’ classes and the sequences to which they
bind, as taken from the TRANSFAC database.
TRANSFAC Database
TRANSFAC® is a database on eukaryotic cis-acting regulatory DNA elements
and trans-acting factors. It covers the whole range from yeast to human. It
started 1988 with a printed compilation and was transferred into computerreadable format in 1990.
The FACTOR table contains 6133 entries in 50 classes, but this figure does not reflect
the number of independent transcription factors. Homologous factors from different
species such as human and mouse SRF are given different entries since they may differ
in some molecular aspects. Factors originally described by different research groups as
binding to different genes may turn out identical when cloned. Also, more factors are
recognized as representatives of whole TF families that are products of distinct but
similar genes or alternative splice products. We have in general not entered proteins
just because of the presence of a putative DNA-binding motif. Thus there are many
more zinc finger or homeo domain proteins known than are included in FACTOR, but for
many no data about DNA-binding specificity or other gene regulatory features are
available.
The SITE table gives information on individual (putatively) regulatory protein binding
sites. It contains 7915 entries. 6360 of them refer to sites within 1504 eukaryotic genes.
1295 are artificial sequences. 260 have consensus binding sequences given in the IUPAC
code.
TRANSFAC Classes
1 Superclass: Basic Domains
*1.1 Class: Leucine zipper factors (bZIP). (IV)
*1.2 Class: Helix-loop-helix factors (bHLH). (III)
1.3 Class: Helix-loop-helix / leucine zipper factors (bHLH-ZIP).
1.4 Class: NF-1
1.5 Class: RF-X
1.6 Class: bHSH
2 Superclass: Zinc-coordinating DNA-binding domains
*2.1 Class: Cys4 zinc finger of nuclear receptor type. (II)
2.2 Class: diverse Cys4 zinc fingers.
*2.3 Class: Cys2His2 zinc finger domain. (I)
2.4 Class: Cys6 cysteine-zinc cluster.
2.5 Class: Zinc fingers of alternating composition
3 Superclass: Helix-turn-helix
*3.1 Class: Homeo domain. (IV)
3.2 Class: Paired box.
*3.3 Class: Fork head / winged helix. (V)
3.4 Class: Heat shock factors
3.5 Class: Tryptophan clusters.
3.6 Class: TEA domain.
4 Superclass: beta-Scaffold Factors with Minor Groove Contacts
4.1 Class: RHR (Rel homology region).
4.2 Class: STAT
4.3 Class: p53
4.4 Class: MADS box.
4.5 Class: beta-Barrel alpha-helix transcription factors
4.6 Class: TATA-binding proteins
etc.
TRANSFAC Class Hierarchy
Transcription Factor Classification
Last modified 2002-10-01
1 Superclass: Basic Domains
1.1 Class: Leucine zipper factors (bZIP).
1.1.1 Family: AP-1(-like) components
1.1.1.1 Subfamily: Jun
1.1.1.1.1 XBP-1 (human).
1.1.1.1.2 v-Jun (ASV).
1.1.1.1.3 c-Jun (mouse); c-Jun (rat); c-Jun (human); c-Jun (chick).
1.1.1.1.4 JunB (mouse).
1.1.1.1.5 JunD (mouse).
1.1.1.1.6 dJRA
1.1.1.2 Subfamily: Fos
1.1.1.2.1 v-Fos (FBR MuLV); v-Fos (FBJ MuLV); v-Fos (NK24).
1.1.1.2.2 c-Fos (mouse); c-Fos (human); c-Fos (rat); c-Fos (chick).
1.1.1.2.3 FosB (mouse).
1.1.1.2.3.1 FosB1
1.1.1.2.3.2 FosB2
1.1.1.2.4 Fra-1 (mouse); Fra-1 (rat).
1.1.1.2.5 Fra-2 (chick); Fra-2 (human).
etc.
TRANSFAC Factors
Drilldown on 1.1 Class: Leucine zipper factors (bZIP) lists factors in the class:
CL basic region + leucine zipper; 1.1.
CC A DNA-binding basic region is followed by a leucine zipper. The leucine zipper consists of
repeated leucine residues at every seventh position and mediates protein dimerization as a
prerequisite for DNA-binding. The leucines are directed towards one side of an alpha-helix. The
leucine side chains of two polypeptides are thought to interdigitate upon dimerization (knobs-intoholes model). The leucine zipper dictates dimerization specificity. Upon DNA-binding of the dimer,
the basic regions adopt alpha-helical conformation as well. Possibly, a sharp angulation point
separates two alpha-helices of the subregions A and B leading to the scissors grip model for the
bZIP-DNA complex. The DNA is contacted through the major groove over a whole turn.
BF T03820 ABF1; Species: thale cress, Arabidopsis thaliana.
BF T03823 ABF2; Species: thale cress, Arabidopsis thaliana.
BF T03824 ABF3; Species: thale cress, Arabidopsis thaliana.
BF T03825 ABF4; Species: thale cress, Arabidopsis thaliana.
BF T04543 ABI5; Species: thale cress, Arabidopsis thaliana.
BF T04565 ACA1; Species: yeast, Saccharomyces cerevisiae.
BF T00027 AP-1; Species: clawed frog, Xenopus.
BF T00029 AP-1; Species: human, Homo sapiens.
BF T00030 AP-1; Species: monkey, Cercopithecus aethiops.
BF T00031 AP-1; Species: rat, Rattus norvegicus.
BF T00032 AP-1; Species: mouse, Mus musculus.
BF T03199 ARR1; Species: yeast, Saccharomyces cerevisiae.
BF T02783 ATB-2; Species: thale cress, Arabidopsis thaliana.
etc.
TRANSFAC Sites
Drilldown on factor ABF1 lists the sequences to which it binds:
SQ GGACGCGTGGC.
SQ TGTCGTGGGGACACGTGGCATACGAGGC.
SQ TGTCGGGGACACGTGGCGCTAACGAGGC.
SQ TGTCGGGACACGTGGCGCAACACGAGGC.
SQ TGTCGGGACACGTGGCCCACCCGGAGGC.
SQ TGTCGGGACACGTGGCACAAATAGAGGC.
SQ TGTCGTCAATGGACACGTGGCTAGAGGC.
SQ TGTCGTCGGACACGTGGCACGAAGAGGC.
SQ GCCTCGACAGGACACGTGGCACGCGACA.
SQ TGTCGATCAATGGACACGTGGCAGAGGC.
SQ GCCTCGGTGACACGTGGCTTGACCGACA.
SQ TGTCGGAAGTGGTGACACGTGGCGAGGC.
etc.
Feature Encoding (1)
• Encode each TF as a 1390-length feature vector.
• Don’t worry about too many features; the classifier will identify the
important ones.
• For 1387 features, calculate the arithmetic mean of the feature vectors for the
sequences the TF binds.
• Add 3 extra binary features indicating whether the TF is plant, animal, or
fungus.
Feature Encoding (2)
• Encode each binding site as a 1387-length feature vector.
1364 integer features encoding subsequence frequency for subsequences up to length 5:
41
42
43
44
45
=
=
=
=
=
4 features for subsequences of length 1 (A, T, C, G)
16 for subsequences of length 2 (AA, AT, AC, AG, TA, TT, TC, TG, …)
64 for subsequences of length 3
256 for subsequences of length 4
1024 for subsequences of length 5
Feature Encoding (3)
8 binary features encoding the presence or absence of an ungapped palindrome of
half-length 3, 4, 5, or 6, either spanning the whole sequence or not.
• A palindromic sequence is equal to its complementary sequence read backwards.
• A and T, C and G are complementary bases.
1 for
1 for
1 for
1 for
etc.
a
a
a
a
palindrome
palindrome
palindrome
palindrome
of
of
of
of
half-length
half-length
half-length
half-length
3,
3,
4,
4,
spanning (e.g., ACG CGT)
not spanning (e.g., … ACG CGT …)
spanning (e.g., ACGC GCGT)
not spanning (e.g., … ACGC GCGT …)
Feature Encoding (4)
8 binary features encoding the presence or absence of a gapped palindrome of
half-length 3, 4, 5, or 6, either spanning the whole sequence or not.
• A gapped palindrome is a palindrome with a non-palindromic insertion in the
exact middle.
1 for
1 for
1 for
1 for
etc.
a
a
a
a
gapped palindrome of half-length 3, spanning (e.g., ACG ... CGT)
palindrome of half-length 3, not spanning (e.g., … ACG … CGT …)
palindrome of half-length 4, spanning (e.g., ACGC … GCGT)
palindrome of half-length 4, not spanning (e.g., … ACGC … GCGT …)
Feature Encoding (5)
7 binary features encoding the presence or absence of a special sequence
identified in the literature as over-represented in the binding sites of certain
classes of TF.
Sequence
Class
G..G
G..G..G
[GC] . . [GC] . . [GC]
AGGTCA | TGACCT
CA . . TG
TGA .* TCA
TAAT | ATTA
Cys2His2 (I)
Cys2His2 (I)
Cys2His2 (I)
Cys4 (II)
bHLH (III)
bZip (IV)
Homeodomain (VI)
Regular expression representation:
.
Any single character.
[]
Any single character inside the brackets.
|
Either the expression preceding or the expression following.
*
Zero or more of the preceding expression.
Encoding Example
Encode sequence GGACGCGTGGC.
Length 1
subsequence:
A=1
C=3
G=6
T=1
Palindromes:
1 feature = 1
7 features = 0
Length 2
Length 3
subsequence:
subsequence:
6 features = 1 or 2 9 features = 1
10 features = 0
55 features = 0
Gapped palindromes:
8 features = 0
Length 4
subsequence:
8 features = 1
248 features = 0
Length 5
subsequence:
7 features = 1
1017 features = 0
Special sequences:
?
At least 1345 of the 1387 features for this binding sequence are zero-valued.
Dataset
n = 587 columns, one for each TF
x1,1
d=
1390
rows,
one for
each
feature
1-of-m
class
encoding
x1,2
. . .
x1,587
x2,1
.
.
.
.
.
.
x1390, 1
y1,1
..
y6,1
. . .
SMLR Algorithm
Sparse Multinomial Logistic Regression
• Learns a multi-class classifier
• Simultaneously performs feature selection
• Reports the probabilities of a sample belonging to each of the m
classes, given m sets of feature weights, one for each class.
Linear Regression
Model/predict a dependent variable as
a linear function of independent
variables:
yi = b1xi1 + b2xi2 + … + bnxn + εi
Find the best-fit line (e.g., estimate
the bi’s) by minimizing the sum of the
squares of the vertical deviations from
each data point to the line:
R2 = ∑ [yi – f(xi b1, b2. ..., bn)]2
Logistic Regression
Used when dependent variable y is binary.
Logit function of p is expressed as a linear
combination of xi .
logit(p) = log ( p/(1-p) )
= w0 + w1x1 + … + wnxn = wTx
p
p = P ( y = 1 | x, w)
=
e wTx
1 + e wTx
= probability that x belongs to class y,
given x and w
w = [ w 0 w 1 … w n ]T ,
single weight
vector of length d
x = [ x 0 x1 … x n ] T
d feature values
for one sample
x
Multinomial Logistic Regression
Generalization of logistic regression.
Used when dependent variable y is multiclass.
p = P ( y(i) = 1 | x, w) =
e
m
∑
j=1
w(i)T x
e
w (j)T x
= probability that x belongs to the
class encoded by y(i) = 1, given w
w = [ w(1)T w(2)T … w(m)T ]T , x = [ x0 x1 … xd ]T , y = [ y(1) y(2) … y(m) ]T
weight vectors of length d for
each of m classes
d feature values for
one sample
one-of-m class encoding
Estimating w
In logistic regression, w is usually estimated using maximum likelihood (ML).
Want to find w that maximizes the probability of classifying samples correctly.
P ( yj | xj , w ) = probability of classifying sample xj correctly, given the
values of w.
n
log-likelihood l(w) =
∑
j=1
log ( P ( yj | xj , w ) )
=
w jT xj
e
n
∑ log (
m
∑
j=1
e
wj indicates the weight
vector for the class to
which xj belongs
)
w (i)T xj
i =1
n
=
∑
j=1
n
=
∑
j=1
( wjT Xj ) – log
m
∑
i=1
e
m
∑ yj(i) w(i)T Xj
i=1
w(i)T xj
m
– log
This is only 1 when
xj is in class i, 0 else
∑ e w (i)T xj
i=1
Estimating a Sparse w
We want w to be sparse, with many zero values, deselecting many features.
Use the maximum a posteriori (MAP) method:
Penalize the ML estimate by placing a prior p(w) on the parameters w.
Choose a prior distribution that induces sparsity: the Laplace distribution.
^
w
MAP
= argmax L(w) = argmax ( l(w) + log p(w) )
w
w
sum of log-likelihoods of xi being
classified correctly, given xi and w
probability that w comes from
a Laplace distribution
Laplace Distribution
p(x) = (1/2b) e
p(w)
e
–|x - μ|/b
–λ ||w||1
e –λ ∑j |w|j
• Remember ln p(w) is the MAP penalty function.
Larger |w|j  smaller p(w)  very negative ln p(w)
Smaller |w|j  larger p(w)  less negative ln p(w)
ln p(w) is at its max at ln p(w) = 0
p(w) = 1
–λ ∑j |w|j
e
= e0 = 1
• The λ parameter needs to be set appropriately.
Larger λ  greater sparsity, fewer features selected.
Authors chose λ=1 using cross-validation.
Results
• 77 TFs misclassified during LOOCV, for 87% accuracy.
• 20% accuracy during LOOCV after permuting class labels
(28% accuracy expected).
(%error)(#TFs)
= #TFs misclassified
.23(97) = 22.31
.09(97) = 8.73
.11(61) = 6.71
.08(165) = 13.2
.17(52) = 8.84
.15(115) = 17.25
--------------------.13(587) = 77.04
Results
• Analyzed feature
selection consistency
across LOOCV trials.
• Most features were
selected either very
infrequently (1047
features were selected
in < 10% of trials)
or very frequently (290
features were selected
in > 90% of trials).
This leaves 53 features
selected inconsistently.
Results
• Used trained classifier to
predict TF class based on
experimentally determined
binding site motifs.
• Used 14 TFs in TRANSFAC but
not in training set.
• TF binding sites were
experimentally determined.
• Motifs were extracted from the
binding sites using PSSM.
• Other potential binding sites with
the same motifs were located
using PSSM methods.
• These binding sites formed the
input data.
• Class was predicted
correctly for 12 of 14 TFs.
Conclusions
• The authors have developed a multiclass classifier that assigns TFs DNAbinding domain classes based on features in their binding site sequences.
• They argue that this capability demonstrates that DNA binding sites contain
significant predictive information about TFs’ binding mechanisms.
• They note that their classifier consistently selects certain features and argue
for their biological plausibility.
• Nearly 1/3 of features are predictors of Class I, zinc finger proteins with poor
sequence specificity.
• Palindromic features are predictors of Class II, zinc finger proteins that form dimers.
• They argue that their method has implications for how TF binding sites should
be modeled.
• Regular expression models are not probabilistic
• PSSM models are length invariant
• They note that their classifier might be useful to biologists.
• Help to engineer proteins that bind to specific DNA sequences
• Predict which class of TF binds to sites find using conventional motif finding algorithms
Cell-Signaling Pathways