Gene expression pipelining, applications and the wisdom

Download Report

Transcript Gene expression pipelining, applications and the wisdom

Genomic sequence of the pathogenic
and allergenic filamentous fungus
Aspergillus fumigatus
Nierman et al., Nature , 2005
Carlos De Niz
Gene expression pipelining,
applications and the wisdom of crowds
Carlos De Niz
Image taken from Konrad J. Karczewski’s website
What is gene expression?
3
Outline
Background
Basic steps of a pipeline
Transcriptome normalization and obtaining missing data
The wisdom of crowds
Examples and applications
4
Background
 The Cancer Genome Atlas (TCGA) kicked off in 2005-$100M (Pilot): National Cancer Institute (NCI)
and National Human Genome Research Institute (NHGRI)
• Create an atlas of changes for specific cancer types
• Pool the results
• World wide freely available data
 DNA=A,G,C and T -> These 4 “letters” give us a lot of headaches
 Cancer is a disease of the Genome
Hallmarks of cancer (Gutschner):
(1) sustaining proliferative signaling
(2) evading growth suppressors
(3) enabling replicative immortality
(4) activating invasion and metastasis
(5) inducing angiogenesis
(6) resisting cell death
Image: TCGAGenomics Brochure
Advise
Advice
Avice
Avrice
• Transcript abundance: RNASeq
• The TCGA consortium, the Cancer Genomics Hub (CGHub –
UCSC) is a repository for storing, cataloging, and accessing
cancer genome sequences, alignments, and mutations
• How to find data:
Barcode
Universal Unique Identifier (UUID)
TCGA-D8-A1JJ-01A-31R-A14M-07 -> 0307bd0b-b59a-4996-b89d-612e72652890
5
RNASeq is now becoming more popular in clinical use
Transcript Quantification from RNASeq – Data Analysis Pipeline
 RNASeq produces millions of reads (ranging from 30-400 bp up to 10-15kb) by sampling fragments of RNA:
Illumina, SOLiD, 454, PacBio, etc.
 Once the sequencing is done, the tasks to achieve are:
 Mapping/aligning such reads to a reference genome or transcript (or DeNovo assembly if there isn’t one)
 Estimate abundance at the gene/isoform level
 Differential expression, mutations, SNV, gene fusion, SNPs, TE ID, etc
Reference
Transcriptome
(GFF/GTF)
RNASeq reads
(FASTQ)
Reference
Genome
(FASTA)
-Bowtie index-
+
Alignment
-Bowtie-Mapsplice-
Reads aligned
to the genome
(SAM/BAM)
+
-RSEM-Cufflinks-
Expression
-Cuffdiff-
Abundance
estimation
Differential
Expression
Each one of these stages have to take place on a step by step basis, that’s the reason
why the overall process is called pipeline
6
Mapping and Gene Expression
 De Novo Splice Aligners (~6hrs) - Mapping
•
•
•
•
TopHat
Mapsplice
Subread
STAR
 Quantitative Analysis and Differential Expression – Gene Expression (GExp)
•
•
•
•
•
RSEM ~ 5 hrs
Cufflinks ~ 4 hrs
Unix based
eXpress ~ 7 hrs
Matlab and R-Based
Salifish – K-mers approach (no mapping required) ~ .09 hrs
The core of most of these programs is based on: the Burrows-Wheeler Transform
(Mapping) and Expectation Maximization (GExp)
RNASeq – Normalized measuring units
𝑭𝑷𝑲𝑴/𝑹𝑷𝑲𝑴 𝒇𝒐𝒓 𝒈𝒆𝒏𝒆 𝒊 =
𝟏𝟎𝟗
𝒄𝒊
×
𝒍𝒊 ′𝑵
where:
RPKM= Reads Per Kilobase per Million mapped reads
FPKM= Fragments Per Kilobase per Million mapped reads
Ci = # reads mapping to transcript i
N= total # of mappable reads
Li’= length
What happens when we have missing data?
8
Expectation Maximization (EM)
• EM is a method to find the maximum likelihood estimator of a parameter θ of a probability distribution
• Putting this into a practical context:
o Let’s say the probability of the temperature outside your room’s window during the 24-hours of the day: 𝑥 ∈ ℝ24 ,
depends on the season Θ ∈ {summer, fall, winter, spring}, and that we know the seasonal temperature distribution is
p(x | θ) (with some West Texas exceptions of course, because #Lubbock)
o But let’s assume we can only measure the average temperature 𝒚 = 𝒙 for the day:
 TASK: we want to guess what season θ it is
 The maximum likelihood estimate of θ MAXIMIZES p(y | θ). In some cases it may be hard to find
• That’s when EM is useful! EM takes the observed data y, iteratively makes guesses about the complete data x, and then
finds the θ that maximizes p(x | θ) over Θ
• EM tries to find the maximum likelihood estimate of θ given y
• EM doesn’t actually promise to find you the θ that maximizes p(y | θ), but there are some theoretical guarantees, and it
often does a good job in practice. However, it may need a little help in the form of multiple random starts
Maximization:
Expectation:
9
Required Data and Results
 FASTQ: is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCA
quality scores (3-10 GB)
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
 FASTA: for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are
NM_207127 2
represented using single-letter codes (3GB)
CCTCGCTCCGCCTCCGGCCTCCTCCGAGAGCTCCAGACCTCCCGGCTACTCAGAAGCCCTCGGACTGCCCGGACCGCGC
 GTF/GFF: contains 9 columns of data, each line describes one feature. Version 2 spec (20-100MB)
chr20 scripture
exon 61747569
61747837
.
+
.
gene_id "XLOC_013608.1"; transcript_id "TCONS_00028587.1";
exon_number "1"; oId "TCONS_00024272"; linc_name "linc-BIRC7-2"; tss_id "TSS21239"; class_code "u"; gene_name "linc-BIRC7-2";
 Annotation txt: (for RSEM only) Contains the associated names to the gene ID and the transcript ID
XLOC_013608.1 TCONS_00028587.1
XLOC_013608.2 TCONS_00028587.2
 BAM: A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains
sequence alignment data UNC11-SN627_66:4:47:2750:9058/1 339 chr1 10061 69 50M = 10179 168 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAA
DFAFFFDF@FHFHGFHHHFHEHHHHGCGGGHHFHHHHHHHHHHHHHHHHH
RG:Z:110302_UNC11-SN627_0066_AB047KABXX_4_ IH:i:3 HI:i:3 NM:i:0
 Abundance reads : contains the transcript abundance per gene (~1MB)
gene_id
A1BG|1
transcript_id(s)
uc002qsd.3,uc002qsf.1
length
1976.30
effective_length
1854.25
expected_count
46.00
TPM
0.73
FPKM
0.48
10
The Wisdom of Crowds (WOC)
1390
987
874
1278
Francis
Galton
Mean
Real Weight
1198 lb ~
1197lb
977
MAQC/SEQC Consortium Data
•
The Sequencing Quality Control (SEQC/MAQC) Project along with the FDA:

Examined Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites
using reference RNA samples with built-in controls plus TaqMan and PrimePCR verification (for 843
selected genes):


Sample A - Universal Human Reference RNA
Sample B - Human Brain Reference RNA
• The images below show the correlation between some of the RNA-SEQ technologies, in order to
compare gene expression consistency among them
For Sample A, approximately 400
different samples were averaged for
Illumina and 190 for Life
Technologies
12
MAQC/SEQC Consortium Data
13
Each plot also has a fitted model (red line) using linear regression which can help to predict linear data generation
DREAM 5– WOC
(Dialogue on Reverse Engineering Assessment and Methods )
Aggregation is
robust and often
better than the best
performer
(transcriptional
gene regulatory
networks)
14
BioViva - First gene therapy successful against
human aging (April 21, 2016)
•
Elizabeth Parrish, CEO of Bioviva USA
•
•
The first human to be ’successfully rejuvenated’ by gene
therapy, after her own company’s experimental therapies
reversed 20 years of normal telomere shortening
Telomere score is calculated according to telomere
length of white blood cells (T-lymphocytes)
• The higher the telomere score, the ’younger’ the
cells
• Telomeres are short segments of DNA which cap the ends of every chromosome, acting as
‘buffers’ against wear and tear. They shorten with every cell division, eventually getting
too short to protect the chromosome, causing the cell to malfunction and the body to age
 Her telomeres had lengthened ~20 years, from 6.71kb to 7.33kb (protect against loss of muscle mass
and to battle stem cell depletion)
15
Takeaways
•
•
Genomic data like Gene Expression (transcriptome count), among its many applications, it
is becoming a helpful and popular clinical tool
In order to obtain transcriptome count, it is necessary to take raw data from the
sequencers (FASTQ file) and pipeline it through a series of additional steps to assemble it
and obtain gene expression
o
o
•
There are many programs available under different computational platforms that
can be used
The results from the different programs rely upon the different assumptions they
make: like the way they estimate missing data or supposing data has a particular
underlying probability function depending if the data comes from a technical or
from a biological replicate (which can generate bias), among many others
The wisdom of crowds or crowdsourcing is an effective process that has
proven to produce accurate results in many fields, by the simple approach where the
collective knowledge of a community is greater than the knowledge of any individual
16
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
The cancer genome atlas (http://cancergenome.nih.gov/ , http://cancergenome.nih.gov/newsevents/newsannouncements/news_12_13_2005)
SAMtools (http://samtools.sourceforge.net/)
Bowtie An ultrafast memory-efficient short read aligner (http://bowtie-bio.sourceforge.net/index.shtml)
TopHat A spliced read mapper for RNA-Seq (http://ccb.jhu.edu/software/tophat/index.shtml)
RNA-SeqTutorial 1 (https://www.msi.umn.edu/sites/default/files/RNA-Seq%20Module%201.pdf)
UUID (https://wiki.nci.nih.gov/display/TCGA/Universally+Unique+Identifier)
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome (http://www.biomedcentral.com/14712105/12/323, http://deweylab.biostat.wisc.edu/rsem/README.html )
The hallmarks of cancer: a long non-coding RNA point of view. Gutschner T, Diederichs S (http://www.ncbi.nlm.nih.gov/pubmed/22664915)
A survey of best practices for RNA-seq data analysis. Conesa Ana, Et al (http://genomebiology.biomedcentral.com/articles/10.1186/s13059-0160881-8)
Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data, Yan Guo., Quanhu Sheng, 2013
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome, Bo Li and Colin N Dewey, 2011
Benchmarking RNA-Seq Quantification Tools, Raghu Chandramohan, Po-Yen Wu, 2013
http://bioviva-science.com/2016/04/21/first-gene-therapy-successful-against-human-aging/
A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium
(http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html)
Wisdom of crowds for robust gene network inference (http://www.nature.com/nmeth/journal/v9/n8/full/nmeth.2016.html)
EM Demystified: An Expectation-Maximization Tutorial (https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0002.pdf)
17