Detecting Peaks in Genomewide Location Analysis Data Using

Transcript Detecting Peaks in Genomewide Location Analysis Data Using

ChIP-chip
Data, Model and Analysis
Ying Nian Wu
Dept. Of Statistics
UCLA
Joint with Ming Zheng, Leah Barrera, Bing Ren
ChIP-chip

A technology for isolation and
identification of the DNA sequences
occupied by specific DNA binding proteins
(regulatory sequences) in living cells.

Chromatin-immunoprecipitation and
microarray analysis (chip) are combined
to study protein-DNA interaction in vivo.

Also known as “genome-wide location
analysis”.
ChIP-chip process
Step 1: Bound transcription factors are crosslinked to DNA with formaldehyde
ChIP-chip process (cont’d)
Step 2: sonication is used to break genomic DNA
to small DNA fragments (various lengths,
difficult to measure, 1-2kb)
ChIP-chip process (cont’d)
Step 3: Special antibody is added to immunoprecipitate DNA segments crossed-linked
with target protein
ChIP-chip process (cont’d)
Step 4.1: the cross-linking between DNA and protein is
reversed and DNA is amplified by LM-PCR and
labeled with a fluorescent dye Cy5.
ChIP-chip process (cont’d)
Step 4.2: As a negative control, a sample of DNA
which is not enriched by the immunoprecipitation process are also amplified by
LM-PCR and labeled with another dye Cy3.
ChIP-chip process (cont’d)
Step 5: Both IP-enriched and IP-unenriched
samples are hybridized to the same
oligonucleotide array.
ChIP-chip process (cont’d)
Step 6: The microarray is scanned, Cy5 and Cy3 signal
strengths are extracted, and log(Cy5/Cy3) is
calculated after normalization.
Ren, B. UCSD
Summary of
ChIP-chip
•Protein bound to DNA
•Sonication
•Immunoprecipitation
•Amplify DNA and add control
•Hybridize to probes
•Microarray analysis
Ren, B. UCSD
ChIP-chip data
SignalMap, NimbleGen Inc.
One probe is one data point in the dataset.
The x-axis represents the genomic position of the probe.
The y-axis (the height) denotes the signal strength log(Cy5/Cy3) of
each probe.
A closer look
SignalMap, NimbleGen Inc.
Cy5 signal

The Cy5 signal strength at a point P0 should be
proportional to the probability that an IPenriched segment contains that point.
Single binding site scenario



Assume there is only one binding site at
the origin.
To contribute to the signal at P0 :
1) this binding site is bound by protein
2) no cut should occur between 0 and P0
Signal at P0 is proportional to (approx):
P0
qB  exp{  ( s)ds}
0
Model derivation


Assume  (s) to be constant around the
binding site. Therefore, the Cy5 signal
strength should decrease exponentially
from the binding site.
Log(Cy5/Cy3) decreases linearly from the
binding site: triangular shape.
Cy5


log(
)  intercept slope1  P0   slope2 P0 
Cy3
Two binding sites scenario
Consider two BSs : B1 , B2 , and a point P0 in between.
Event A : B1 is bound and no cut occurs between P0 and B1.
Event B : B2 is bound and no cut occurs between P0 and B2 .
Signal at P0 is proportion
al to:
Pr(A)  Pr(B)  Pr(A  B) 
P0
q1 exp{   ( s)ds}
B1
B2
B2
P0
B1
 q2 exp{   ( s)ds}  q1q2 exp{   ( s)ds}
General scenario
Event A : no cut between P0 and nearest bound BS to theleft.
Event B : no cut between P0 and nearest bound BS to theRight.
Signal at P0 is proportion
al to: Pr(A)  Pr(B)  Pr(A  B)
General scenario
P r(A) 
m
 P r(nearestgood BS is B )  P r(nocut between B and P | B is good)
i
i 1
 m

P0
  [  (1  q j )]  qi   exp{   ( s )ds}
Bi
i 1  j i 1

m
i
0
i
Regression to fit triangle
A simple case: probes are
evenly spaced.

Y  ( yLL , yLL 1 ,, yL1 , y0 , y1R , , y RR1 , y RR )

L L 1
1
X 1  (  ,
,, ,0,0,,0,0)
L
L
L

1
R 1 R
X 2  (0,0, 0,0, , ,
, )
R
R
R
Best fitted triangle

Fix left boundary and the right boundary,
we can identify the slopes and intercept.

For different combinations of left and
right boundary, find the best one with the
minimum variance of residual.

This is the best fitted triangle centered at
the probe we are considering.
Mpeak process

Arrange local maxima
by their signal strength.

For the first local maximum, find the best fitted
triangle in a small neighborhood and identify the
center as peak.
Mpeak process

For any local maximum in the range of this
triangle, if the difference between two fitted
values is small, mark it as non-peak.

Continue this process until every local maximum
has been considered or smaller than a threshold.
P value of peaks



Null hypothesis: background signal in ChIP-chip
data follows normal distribution with mean 0.
n  Y  Yi / n is used as the statistic for testing:
it is zero-mean and variance stabilized.
Background signals are not independent: probes
close to each other tend to be included in the
same segment simultaneously.
Variance approximation
Assume thatYi correlateswith Y j when | Pi  Pj | m.
Var( Yi / n )  1 / n cov(Yi , Y j )  1 / n
| Pi  Pj | m
i, j
 var(Yi )(1 
 cov(Y , Y )
i
j
 cov(Y , Y ) / var(Y ))
| Pi  Pj | m
i
j
i
  2 (1  f )
T hemarginalvarianceof noise 2 and theauto - correlation
factor f can both be estimatedfrominput data.
Result
SignalMap, NimbleGen Inc.
Result
SignalMap, NimbleGen Inc.
Result
SignalMap, NimbleGen Inc.
Result
SignalMap, NimbleGen Inc.
How good the fit is?
Result
9,328 promoters for known
transcripts
1,196 putative promoters for
unknown transcripts
Ren, B. UCSD
Kim, T.H. et al. A high-resolution map of active promoters in
the human genome. Nature, 436, 876-880
Comparison with kernel smoothing
Multi-resolution
Peak tree
Why use model?

A promoter is characterized
not only by a large probe
signal, but also a truncated
triangle shape

Identify the neighboring
probes that are caused by
the same promoter to pool
the info for ranking the
potential binding sites
SignalMap, NimbleGen Inc.
Model justification

Intuitively, human vision recognizes the local
shape, instead of a single probe, to detect peaks.

Model fitting improves detection: 1) largest signal
may not always be the tip of the best fitted
triangle, 2) we can handle outliers caused by
probe malfunctioning.

For window smoothing, if the window size is not
chosen well, a local maximum of the window
average can well be the bottom of a valley.
Model justification

The model gives us a sensible way to choose the
range:


this enables us to pool many weak signals
together if they form a good triangle. So that
we can reduce the chance of false negative.
this prevents us from pooling too many weak
signals together if they do not form a good
triangle. So that we can reduce the chance of
false positive.
Model justification

Probabilistic approx: Poisson process
Fact: two different slopes around the nondifferential tip
Functional approx: line segments locally

Gives reasonable fit to data

Not enough data for more complex model
Not enough computational power to fit more
complex model within minutes



Software

Fast: ~10 seconds for ~400,000 probes
with a regular PC.

Robust to noise (data shown later).

Software and source code publicly available:
www.stat.ucla.edu/~zmdl/Mpeak
Chromosome structure
Lodish, H. et al. Molecular Cell Biology.
Histone and transcription
LS3 class note, UCLA

Histone proteins need to be modified and
DNA needs to be released for
transcription to take place.
Histone and transcription
Ren, B., UCSD
Twin-peak phenomenon
The promoter region is
in between two binding
sites of the modified
histone protein, e.g.,
Acetylated histone H3
(AcH3).
LS3
class
note,
UCLA
ChIP-chip data for AcH3
show a twin-peak
phenomenon, with a
valley corresponding to
promoter region.
SignalMap, NimbleGen Inc.
Possible solutions

Fit twin-peak shape to data based on the
probability model for two binding site
scenario.

Use Witkin’s scale-space filtering to detect
peaks and twin-peaks.

Detecting Peaks in Genomewide Location Analysis Data Using

Transcript Detecting Peaks in Genomewide Location Analysis Data Using

Directory