Walk-thru of CAGE exercise

Download Report

Transcript Walk-thru of CAGE exercise

Walk-thru of CAGE exercise
• Also at
http://people.binf.ku.dk/albin/teaching/htbinf
/tag_analysis/
• …together with updated slides
• And linked from web page
Interlude: a logistics problem
• The largest cDNA project so far
made 102,000 cDNAs
• If you publish, you need to be
able to ship these to the
people asking for it
• This would take >50kg of dry
ice! Expensive and a logistics
nightmare since you need to
keep track of the 102,000
tubes
• How can we transfer DNA?
RNA-seq
• With a high-throughput tag sequencer, we can
also do the brute force approach – fragment
all mRNAs in a cell and sequence the pieces
(or part of the pieces)
• This is commonly referred to as RNA-seq
Compared to SAGE, CAGE
• Sequence the whole mRNA – not just the end
or the start
• Can give connectivity, so that we know what
exons that are used, and what isoforms
• Is actually bad at capturing 5’ and 3’ edges,
due to statistical issues (white board demo)
Typical protocol
AAAAA
AAAAA
TTTTT
AAAAA
Isolate mRNA
Break up mRNAs
Make cDNAs of RNA fragments
Add adapters, amplify
and sequence
We sequence 25-35 bp
reads…randomly selected from each
side of the fragment
Mapping tags
Challenge: What do we get (pros and cons) if we
map the tags
a) To the genome
b) To the transcriptome (like all refseq
transcripts)
Genome: unbiased – we could hit any
transcripts. Hard to hit spliced tags, and
possibly mRNAs that get modified…
Transcriptome: We hit annotated genes, and
splice sites are not a problem. On the other
hand, we cannot find new things
Going from tags to wigs
Showing all tags as blocks in the browser is
possible, but dumb – because there are
potentially thousands in the window of
interest, and we go blind
Easy way to summarize is to make nucleotide
histograms – whiteboard demo
Looking at RNA-seq data
• At the tag _analysis web directoy, there is a
wig file, mm9_brain.wig showing tags an
RNA-seq experiment from mouse brains.
Upload this to the browser and look at the
two genes below – are they expressed, and
how much?
• Kcnc3
• Hoxa5
Thought challenge: from tags to
expression
• We have a wig file showing where all the tags
match on the genome
• We have the UCSC annotation for all known
genes
• We want something like a microarray, saying
– Gene X has an expression of Y
– How can we do this? (2 minutes with your
sideman)
“Naïve solution”
• For each gene, count the tags that overlap it
– Gene X has 45 tags
– Gene Y has 4578 tags
– Etc
Problems with this?
Length of transcripts will have an
effect!
• A long transcript gives more tags when broken
up, and can be captured more easily
• So, the number of tags from a transcript
depends on
– Actual expression (number of RNA molecules)
– Length of the RNAs
Normalizing for length – not that hard
• For each gene, count the tags that overlap it,
and divide by gene length
– Gene X has 45/(length of x) tags
– Gene Y has 4578(length of y) tags
– Etc
What if we want to compare two experiments?
We also need to normalize for sample
size, just as in SAGE, CAGE and ESTs
• Recap: TPM is a normalization that remakes the
tags count into what we would get if having
exactly one million tags
• …so, 10^6* (#tags in my gene)/(total tags)
Combining the two
• Normalize by gene length AND sample size
• Gene X has an expression of
– Z TPMs/(N)
– Where N is the RNA length.
Summary of tag technologies
• ESTs: old, expensive, long tags. Biased to 5’and 3’ of genes. Can be used for
exploration
• SAGE: 3’ end tags. Only gene expression, no functional data. Limited for
exploration
• CAGE/5’SAGE: 5’ end tags. Promoter expression and location. Can be used
for exploration
• RNA-seq: “Random” tags over the whole mRNA. Expression and location –
can be used for both expression and exploration