sequencing all mRNAs

Download Report

Transcript sequencing all mRNAs

Tag-based expression/function
analysis
Data files at webpage (link at todays date), and also:
http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/
Where are we now?
• R to do statistics
• Genome browsers and galaxy to visualize
genes and genomics data
• Analyzing expression by microarrays +R and
Bioconductor
• Tag analysis
• Proteomics
What we want in transcriptomics
• Know what transcripts that are transcribed,
and how much they are transcribed
– Implicitly also what transcripts that exist in the
cell, and how they look!
• Intuitively, we could get all this information
by sequencing all mRNAs in one cell
General problems with cDNA
sequencing:
Reverse transcriptase falls off
Hard to sequence long transcripts
Many cDNAs are identical, but some occurs only
once per cell (or less!). Need to sequence
MANY cDNAs
Very expensive if you want to sequence all
molecules
Solutions:
1) Do not sequence: use probes and
hybridization: microarrays and tiling arrays (
this is where we are now!)
2) Only sequence parts of transcripts: tag
sequencing (this is where we are getting)
Thought exercise
• What are the pros/cons with hybridization
(micro/tiling arrays) vs sequencing? 2 minutes
with your sideman
Albin’s take
Sequencing
Hybridization
•
•
•
•
•
•
+ Cheap(per “gene”)
+ Mature methods
+ Standardized
-complex normalization needed
- cross-hybridization
- highly dependant on
annotation of probes
• -dependant on designed
probes for genes
• -Cannot deal with repeats
• +/-Integrative signal (more on
next slide)
•
•
•
•
•
•
•
•
•
- expensive (now, but changing)
-”unbiased” - no designed probes
- non-standard computational
methods
- more demanding processing
(now)
- much easier statistics in the end
+ less noisy
+ much higher resolution - up to
nucleotide level
+ location information
+/- Sampled signal (more on next
slides)
Hybridization: integrative
We have many identical probes. Each time a probe gets a
hybridization event, we add a little to the signal.
This includes non-optimal hybridization events - just
something labeled that hybridizes will give some signal
Sequencing: sampling
The number of cDNAs in a library
is VERY LARGE
We pick only some of them to do
sequencing, randomly
Blind sampling (does not know
anything about RNAs)
We map sequences back to the
genome ( a kind of quality
check)
Why is this interesting?
• Sequencing approaches are generally
better than hybridization in quality and you
can also do more diverse experiments
• New sequencers make it possible to do this
almost as cheap as with hybridization –
normal research groups can now buy the
capacity of an old sequencing centre
• It is basically the technology of the future
5 types of sequencing data data for
expression – and functional- studies
•
•
•
•
•
Non-subtracted cDNA
ESTs
SAGE
CAGE
RNA-seq
Why so many techniques?
• Historical reasons – technology development
over time
• Some of these technologies are only for
expression – others also give other
information (and different information)
• Difference in costs - efficiency
Non-subtracted cDNA
• Theoretically possible to sequence all
cDNAs in a cell
• Very, very expensive!
• Hard to get true expression, since
amplification is length-dependant
• Not very necessary to have the whole cDNA
for expression?
Expressed sequence tags ESTs
Sequence from 5’ and 3’
ends – until the reverse
transcriptase falls off
Cheaper than full-length
cDNAs
Problems:
many ESTs are simply
trash – the result of overenthusiastic sequencing
For longer genes, no
coverage of the middle
part
How can we use ESTs?
• View the EST as a ranom sample from a
pool of transcripts:
– The number of ESTs found from a transcript
should be proportional to the concentration of
that transcript in the cell=the expression
• How do we know what
transcripts an EST comes from?
Unigene:clustering ESTs to
“genes”
Back in the 90s, the idea was to use a lot of ESTs to
find, and puzzle together, genes
The UNIGENE database is one of the outcome of
this. Slightly obsolete, but useful at times
Basically, it tries to cluster ESTs and cDNAs to
functional units: “genes”
Bonus: we can use this to look at expression of
these genes – because we can count ESTs from
different libraries
Thought exercise: How?
• Say that we have two lung EST libraries(= two
collections of tags) from two patients, one who
has lung cancer
• How can we prove that a given gene, like RARA, is
significantly altered in expression in lung cancer?
• Think R! What do we need, and what tests should
we use?
• 2 minutes with your side man
“Electronic Northern blot”
• In a nutshell: Fill in the following
contingency table for a given gene
ESTs
from
tissue A
RARA
Rest of
ESTs
ESTs
from
tissue B
Fisher exact test
situation!
We can do this within
unigene for single
genes
Side-story for non-life-scientists:
Northern what?
• Northern blot is classical method for
detecting RNA molecules
• Related to Southern and Western blot (DNA
and protein detection methods)
However…
• An electronic Northern is just a clever name,
although it has the same goals - finding RNAs
• It is nothing more than a statistical overrepresentation test of mRNAs, by use of ESTs
Unigene:
• http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene
• …or just google for unigene
Let’s look at the tissue
constraints of human RARA…
EST hits
from
different
tissues
Public microarray
data (nice for
comparison - but
not important now)
Note that the sample sizes are very different!
1tag of 282332 is not the same as 1 tag out of 131488
What is TPM?
TPM= Tags per million
A normalization to be able to compare libraries of different sizes. Used very
often for tag-based expression.
“How many tags would my gene have we have if the sample size is 1 million?”
…so, 10^6 * (#tags in my gene)/(#total tags)
Challenge
• Is the RARA gene significantly different in
expression in eye vs blood?
ESTs
from
blood
ESTs
from
eye
Gene X
12
12
Rest of
ESTs
124139
-12
210756
-12
> a<-matrix(c( 12,12,124139-12, 210756-12),
nrow=2,byrow=T)
> fisher.test(a)
Fisher's Exact Test for Count Data
data: a
p-value = 0.2078
# so,despite twice the TPM value, not significant
So ESTs are fantastic?
…not really!
Sometime useful but
There are too few of them, and very diverse
libraries
…and way too expensive to make routinely in
a normal lab
Basically, ESTs are rarely used now, but it is
data worth considering
Modern tag sequencing
• SAGE, CAGE and RNASeq
Underlying idea:
• Only sequence as much as you need: 5', 3' or
whole cDNA (in pieces)
• Map tags to known cDNAs or the genome
(Thought exercise: what is the difference?)
SAGE
SAGE
• After sequencing:
– Mask out adapters and primers
– Make a database of all possible hits in mRNAs following
the restriction site (white board demo)
– Map tags to this database, or the genome
• Mapping is surprisingly tricky
– We cannot use BLAST or BLAT alignments (too short
sequences)
– Sequencing errors exist, as well as RNA editing
– Some species have very few known mRNAs
Common approach
First identify all unique tags, and how many times we have seen them
AAAGATGCTGC
67
CAGTCGATCGAT
192
…
Correlate these tags with our gene database.
Sum up all the tags for each gene
Make expression analysis!
How can we analyze count data?
• The difference to micro arrays is that we deal with
integers
• The more counts for a gene, the more expressed
it is - theoretically a linear relation. We are
theoretically counting actual RNA molecules
• Very much like the EST case, we can make
statistics based on contingency tables if we have
two samples
Data flow for tags
…is a bit too complex for this course to do in real life
- takes time and requires programming (and a big
computer)
Mapping of tags to genes is complex, and no
standard solutions are adopted (yet)
Statistical analysis often involves making multiple
fisher exact tests - this involves some R
programming
To get a feeling for the data, we will instead use a
website to to these things for us
Typical data after mapping:
Tag Frequency
AAAAAAAAAA 173
AAAAAAAAAG 1
AAAAAAAAAT 1
AAAAAAAATA 2
AAAAAAACAA 1
AAAAAAACTA 2
AAAAAAATAA 1
We want to go from here to actual counts per
gene: we will let a web system do this for us
• In the data directory, I have collected two such
files:SAGE_Colon…, corresponding to normal and
cancer colon
• These are linked in the web page, also here:
http://people.binf.ku.dk/albin/teaching/htbinf/ta
g_analysis/
• Then, go to http://cgap.nci.nih.gov/SAGE/
• This page has many SAGE-related analyses. We
will try Digital Gene Expression Displayer (DGED)
Challenge
• Using DGED
• Use the “Two of your files” option to use
the two colon samples. Select “short tags”
• Try to understand what the statistical test
does (accept defaults)
• What types of genes are “over-expressed”
in colon i) cancer tissue vs normal tissue, ii)
normal tissue vs cancer tissue
Thought exercise
• What are the limitations with SAGE?
Albin’s take
• We can only measure expression – the location of
tags in genes have no functional meaning
• Dependent on gene annotation - we can map to
the genome, but hard to interpret such data
(what genes?)
• Compared to array data: very few standard
analysis methods
• Limited sequencing depth
5’ tagging
• Three methods that really do the same thing.
Difference lies in chemistry and throughput
and length of tags
– CAGE
– 5’SAGE
– 5’ Oligo-capping
• We will use CAGE as an example (“Cap
Analysis of Gene Expression)
CAGE
Sequencing and mapping to the genome
CAGE vs …
• SAGE
– Conceptually same thing, but you catch the 5’
end of the gene: the transcription start site and
thereby the promoter– which is a functional
entity
– Higher number of tags
– 5’ ends give functional data apart from
expression
Issues
• Only capped transcripts
– Some real transcripts are not capped
– Some capped transcripts are not full-length
• Associating 5’ ends with gene products is
sometimes problematic
– We only know starts of genes, not the length
• Tag length is borderline for mapping - 20-21 bp
• Not clear how to define cutoffs - how many tags
are “real biological promoter”
• Under-sampling: we miss a lot of promoters
because there are so many of them
Strengths
We are actually looking at promoters, not genes
Find novel promoters - sometimes within known
genes
We can look at expression at promoter level - for
instance define “tissue-specific” promoters
We can get a first unbiased look at where promoters
are, and how much they are used in a given cell
CAGE concepts
• The atom unit in CAGE is the tag, mapped to the genome. The tag
comes from a given experiment (and has a label)
• What positional information is the most relevant for analysis?
?
?
The tag
20-21 bp
Only 5’ ends are interesting!
• …since the 20 bp length is only for mapping
purposes .
• What if we have many tags overlapping one
another? How can we represent this?
Some soon-to-be-outdated terminology
So…
• Unlike SAGE, CAGE can be viewed as a “barplot”
on the genome, on nucleotide level
• How to cluster nearby CAGE tags to a meaningful
“promoter” is an open problem
Within a promoter…
• …we can do exactly the same Fisher exact
tests as before (as in SAGE or ESTs do for
whole genes)
• What is the advantage/disadvantage of
doing this on promoters instead of genes?
(2min)
The big answer: alternative promoters
with different tissue usage
CAGE resources
• Genomic element viewer ( very similar to
UCSC browser)
– CAGE tags and cDNA landscapes
– Easiest by the links on fantom.gsc.riken.jp/3
Clicking on cage clusters give two
options:
CAGE analysis viewer
CAGE basic viewer
CAGE resources
• Basic CAGE viewer
– Comprehensive browser of CAGE tags and CAGE
tag clusters, and library information
Challenge
• Look at the RARA gene in the MM5
assembly in the genomic elements
viewer(browser) (so, NOT UCSC).
• How many alternative promoters does it
have?
• Are any of these biased towards certain
tissues?
Some points
• Not that easy to say which of these promoters
that are “significant”
• Easy to get overwhelmed by numbers when
counting tags
Back to work…
• We can treat CAGE tag counts, or really
TPMs in a promoter as expression
• We can do the same analyses as in
microarrays - including the typical heatmap
• We will do a small exploratory study of
some CAGE data
• http://people.binf.ku.dk/albin/teaching/htbinf
/tag_analysis/
Walk-thru of CAGE exercise
• Also at
http://people.binf.ku.dk/albin/teaching/htbinf
/tag_analysis/
• …together with updated slides
• And linked from web page