Finding sequence motifs in PBM data workshop project

Download Report

Transcript Finding sequence motifs in PBM data workshop project

Finding sequence motifs in PBM
data
Workshop Project
Yaron Orenstein
October 2010
MF workshop 10 © Yaron Orenstein
1
Outline
1. Some background again…
2. The project
MF workshop 10 © Yaron Orenstein
2
1. Background
Slides with Ron Shamir and Chaim Linhart
MF workshop 10 © Yaron Orenstein
3
Gene: from DNA to protein
DNA
PremRNA
transcription
MF workshop 10 © Yaron Orenstein
Mature
mRNA
splicing
protein
translation
4
DNA
• DNA: a “string” over the alphabet of 4 bases
(nucleotides): { A, C, G, T }
• Resides in chromosomes
• Complementary strands: A-T ; C-G
Forward/sense strand:
AACTTGCG
Reverse-complement/anti-sense strand:
TTGAACGC
• Directional: from 5’ to 3’:
(upstream)
AACTTGCGATACTCCTA
5’ end
MF workshop 10 © Yaron Orenstein
(downstream)
3’ end
5
Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription
start site (TSS)
Transcription
Pre-mRNA
Exon
Intron
Splicing
Mature mRNA
5’ UTR
Start codon
(RNA polymerase)
Exon
(spliceosome)
3’ UTR
Coding region
Translation
MF workshop 10 © Yaron Orenstein
Protein
Stop codon
(ribosome)
6
Translation
• Codon - a triplet of bases, codes a specific amino
acid (except the stop codons); many-to-1 relation
• Stop codons - signal termination of the protein
synthesis process
MF workshop 10 © Yaron Orenstein
http://ntri.tamuk.edu/cell/ribosomes.html
7
Genome sequences
• Many genomes have been sequences,
including those of viruses, microbes, plants
and animals.
• Human:
– 23 pairs of chromosomes
– 3+ Gbps (bps = base pairs) , only ~3% are genes
– ~25,000 genes
• Yeast:
– 16 chromosomes
– 20 Mbps
– 6,500 genes
MF workshop 10 © Yaron Orenstein
8
Regulation of Expression
• Each cell contains an identical copy of the
whole genome - but utilizes only a subset
of the genes to perform diverse, unique
tasks
• Most genes are highly regulated –
their expression is limited to specific
tissues, developmental stages,
physiological condition
• Main regulatory mechanism –
transcriptional regulation
9
MF workshop 10 © Yaron Orenstein
Transcriptional regulation
• Transcription is regulated primarily by transcription
factors (TFs) – proteins that bind to DNA
subsequences, called binding sites (BSs)
• TFBSs are located mainly (not always!) in the gene’s
promoter – the DNA sequence upstream the gene’s
transcription start site (TSS)
• BSs of a particular TF share a common pattern, or
motif
• Some TFs operate together – TF modules
TF
5’
BS
MF workshop 10 © Yaron Orenstein
TF
Gene
BS
TSS
3’
10
TFBS motif models
• Consensus (“degenerate”) string:
A ACT CT
C
G
AACTGT
CACTGT
CACTCT
CACTGT
AACTGT
gene 1
gene 2
gene 3
gene 4
gene 5
gene 6
gene 7
gene 8
gene 9
gene 10
• Statistical models…
• Motif logo representation
MF workshop 10 © Yaron Orenstein
11
Human G2+M cell-cycle genes:
The CHR – NF-Y module
CDCA3 (trigger of mitotic entry 1)
CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18
CDCA8 (cell division cycle associated 8)
TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23
CDC2 (cell division control protein 2 homolog)
CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG
GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0
CDC42EP4 (cdc42 effector protein 4)
GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110
CCNB1 (G2/mitotic-specific cyclin B1)
AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45
CCNB2 (G2/mitotic-specific cyclin B2)
TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGC
CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10
BS’s are short, non-specific, hiding in both strands and at various
TFs: NF-Y , CHR
12
locations
along
the
promoters
MF workshop 10 © Yaron Orenstein
Protein Binding Microarrays
Berger et al, Nat. Biotech 2006
• Generate an array of double-stranded
DNA with all possible k-mers
• Detect TF binding to specific k-mers
MF workshop 10 © Yaron Orenstein
13 13
PBM (2)
MF workshop 10 © Yaron Orenstein
14 14
PBM - implementation
• Use 60-mers (Agilent): 25nt constant
primer + 35nt variable region
• De Bruijn seq of all 10-mers (410 long)
split into 35nt long fragments with
9nt overlap
• ~40K probes
• For each 8-mer, combine signals from
all probes that contain it (or differ in
1nt) to obtain its binding score
MF workshop 10 © Yaron Orenstein
15 15
The computational challenge
• Input: PBM data (sequences and
binding scores) of one TF.
• Goal: Find a motif (PWM) that is the
binding site of that TF.
• Intuition: sequences that match the
motif (on one of the two possible
strands!) are expected to have high
binding scores.
MF workshop 10 © Yaron Orenstein
16
2. The project
MF workshop 10 © Yaron Orenstein
17
General goals
• Research
- Learn about known solutions
- Trial and error with training data
• Develop software from A-Z:
– Design
– Implementation (Optimization)
– Execution & analysis of test data
• A taste of bioinformatics
• Have fun
• Get credit…
MF workshop 10 © Yaron Orenstein
18
The computational task
• Given a set of PBM data of different
TFs.
• Find the binding site motif in PWM
format of each TF.
• Main challenges:
– Performance (time, memory)
– Accuracy
MF workshop 10 © Yaron Orenstein
19
Input
File with 41,923 lines, each containing a
probe sequence of length 35 and binding
intensity.
<sequence 35bp> \t <intensity> \n
MF workshop 10 © Yaron Orenstein
20
Input (II)
• For the training data, an additional
PWM file will be supplied for each
PBM data set.
A:
C:
G:
T:
<freq1> <freq2> … <freq10>
<freq1> …
<freq10>
…
…
• Separated by \t and \n.
• All lines must contain same number of
frequencies (10 is just an example).
MF workshop 10 © Yaron Orenstein
21
Input (III)
You will be given:
1. 10 training sets (PBM data + PWM)
2. 4 test sets (PBM data). You have to
provide the PWM.
3. In the final project presentation, you will
be given an online test set (PBM data) and
your software will be applied to it.
MF workshop 10 © Yaron Orenstein
22
Output
1. A PWM file describing the binding
site found in the given PBM file.
2. The PWM in motif logo format (i.e.
displayed on the screen).
bits = 2 - entropy
The file logo.zip contains a java package
with the code that will easily display
your motif.
MF workshop 10 © Yaron Orenstein
23
Output (II)
3. Show graphically how well your motif
predicts the binding intensity.
• One example (note it’s not PWM):
MF workshop 10 © Yaron Orenstein
24
Ranking 8-mers
• One possible way to start: rank the 8mers in some way. Scores for example:
1. Signal average.
2. Signal median.
• You can think of other scores that
incorporate more information, e.g.
position in probe sequence.
• This is just an example. You can think of
other ways to start.
MF workshop 10 © Yaron Orenstein
25
Alignment procedure
• Then, you can align the
significant 8-mers.
• You may take into account
the relative score.
• Don’t forget about the
reverse complement!
• Example: Cebpb TF
MF workshop 10 © Yaron Orenstein
26
Enrichment scores
• To test how good your motif is, you
can use an enrichment score.
• An enrichment score tests how good
the motif distinguishes between highranking probes and the rest of the
probes.
MF workshop 10 © Yaron Orenstein
27
Hypergeometric probability
drawn
not drawn
total
white
k
m−k
m
black
n−k
N+k−n−m
N−m
total
n
N−n
N
MF workshop 10 © Yaron Orenstein
28
Hypergeometric enrichment
score
• Let B and T (T  B) denote the BG and
target sets, respectively, and let b
and t denote the subset of probes
from the BG and target set,
respectively, that contain at least one
occurrence of the motif.
 | b |  | B |  | b | 


min(|T |,|b|) 
i
|
T
|

i

HG tail ( | B |,| T |,| b |,| t |)    
| B |
i |t|


|
T
|


MF workshop 10 © Yaron Orenstein
29
Hypergeometric score (2)
• The HG enrichment score computes
the probability of observing at least
|t| target sequences with a motif
occurrence, under the null hypothesis
that the probes in the target set
were drawn randomly, independently,
and without replacement from the BG
set.
• Code is provided in math.zip
MF workshop 10 © Yaron Orenstein
30
Wilcoxon-Mann-Whitney
(WMW) enrichment score
• Foreground probes are all those containing
a match, background are all the others.
• B and F are the sizes of background and
foreground, respectively.
• ρB and ρF are the sums of the background
and foreground ranks.
• Read more in supplementary info (Berger06).
1  B F 
area 


BF  B
F 
MF workshop 10 © Yaron Orenstein
31
Deciding the length of the
motif
• Another challenge is to decide the
length of the motif.
• Most binding site are 6-12 bp long.
• You should consider the information
each position contains and decide on
the length accordingly.
MF workshop 10 © Yaron Orenstein
32
Scoring your PWM
• One way to score your motif is by
ranking the probe sequences
according to a match score.
• You may use the given code for match
score.
• Compare the ranking of the probes
you got to the ranking according to
binding intensities. There are
different correlation score for that.
MF workshop 10 © Yaron Orenstein
33
Match Score between PWMs
• Already implemented for you:
1. Euclidian Distance:
2. Pearson Correlation Coefficient
3. KL Divergence
MF workshop 10 © Yaron Orenstein
34
Implementation
• Java (Eclipse) ; Linux (Other languages are
possible, but will not participate in bonus).
• Input: one single argument PBM filename
• Output: PWM file, PWM presented in logo
and graphical presentation of PWM
matching distribution among probes.
• Packages for motif logo and statistical
scores will be supplied
• Time performance will be measured
• Reasonable documentation
• Separate packages for data-structures,
scores, GUI, I/O, etc.
MF workshop 10 © Yaron Orenstein
35
Submission
• Printed design document.
• Printed code – for comments and remarks.
• Printed results document – for each test
set PWM logo + how good your result in
terms of correlation to the probes ranks.
• 4 PWM files, e.g. Test_1.pwm (submitted
by email).
• Executable for the online test.
MF workshop 10 © Yaron Orenstein
36
Grade
• 20% for the design
• 30% for the implementation (20% for modularity,
clarity, documentation, 10% for efficiency)
• 30% for the performance and experimental
results (20% for the accuracy on the 4 test
queries and 10% for the accuracy on the online
test query)
• 20% for the final report and presentation
• 10% bonus to the group with the most accurate
results
• 10% bonus for the group with the fastest
implementation
MF workshop 10 © Yaron Orenstein
37
Bonus grading
• Accuracy will be determined using the
provided code that compares two
PWMs.
• We will take the average of runs on
several different PBM data sets.
• Running time will be measured in java
implementation, and the average will
be taken.
MF workshop 10 © Yaron Orenstein
38
Schedule
1.First progress report 23/11
2.Design document 21/12
3.Final presentation 16/2
•
•
•
We shall meet with each group on each of
these dates – mark your calendars!
Schedule can be made earlier if you are
ready.
You are always welcome to meet us.
Contact us by email.
MF workshop 10 © Yaron Orenstein
39
Design document
• Due in week 12 (21/12).
• 3-5 pages (Word), Hebrew/English
• Briefly describe main goal, input and
output of program
• Describe main data structures,
algorithms, and scores.
• Meet with me before submission.
MF workshop 10 © Yaron Orenstein
40
Reference
• Berger MF, Philippakis AA, Quershi AM, He FS, EstepIII
PW, Bulyk ML. Compact, universal DNA microarrays to
comprehensively determine transcription-factor binding site
specificities. Nature biotechnology. 2006;338:1429-1435.
Very important! Read:
the_brain.bwh.harvard.edu/UPBMseqn/suppl_methods.doc
• Chen X, Hughes TR, Morris Q. RankMotif++: a motif-search
algorithm that accounts for relative ranks of K-mers in
binding transcription factors. Bioinformatics. 2007 Jul
1;23(13):i72-79.
MF workshop 10 © Yaron Orenstein
41
Fin
MF workshop 10 © Yaron Orenstein
42