miRNA analysis
Download
Report
Transcript miRNA analysis
Agenda
Friday, August 20th, 2010
9:00 – 10:30
RNA-seq
10:30 – 10:45
Morning Break
10:45 – 12:00
RNA-seq Hands On
12:00 – 1:00
Lunch Break
1:00 – 2:30
ChIP-seq analysis
2:30 – 2:45
Afternoon Break
2:45 – 4:00
ChIP-seq Hands on Demo
1
Copyright © Partek Inc.
Whole Transcriptome Analysis
With RNA-Seq
Ryan Peters
Field Application Specialist
Who is Partek?
•
Founded in 1993
•
Based in St. Louis, MO USA
•
Focused on Genomics
•
Thousands of customers
worldwide
•
Building tools for both biologists
and bioinformaticians
3
Copyright © Partek Inc.
What is Partek Genomics Suite?
• Desktop software - no server
required
• Supports multiple assays
• Supports multiple assay
providers
• Enables Integrated Genomics
• Competitively priced
4
Copyright © Partek Inc.
Partek GS™ for Integrated Genomics
Microarray
Genome
• Copy Number
• Total & Allele Specific
• Association
• Loss of Heterozygosity
&
Next Generation Sequencing
Transcriptome
• Gene Expression
• Exon/Alternative Splicing
• DGE & mRNA –Seq
5
Regulation
• ChIP-Chip
• ChIP-Seq
• microRNA
Copyright © Partek Inc.
Partek GS™ fits right into
your Next Generation Sequencing Pipeline
Data Acquisition
Powerful Data
Analysis and
Visualization
System
software and
alignment tools
ELAND
SHRiMP
Corona
BWA
MAQ
*Bowtie
Publication
RNA-seq
SmallRNA-seq
ChIP-seq
DNA-seq
6
Copyright © Partek Inc.
7
Copyright © Partek Inc.
Partek Recordings
8
Copyright © Partek Inc.
Whole Transcriptome Analysis
• Detect all known and novel RNAs in a transcriptome
• Differential expression of mRNAs
• Identification of alternative splicing events
• Differential expression of non-coding RNAs
• Coding SNPs discovery
9
Copyright © Partek Inc.
Import
Multiple Directories Soon
1 million reads/minute
One format at a time
*Quality Score option
=Binary .SAM
Now (.BAM)
Max/Corola =Color space -> base space
10
Copyright © Partek Inc.
Technical Replicates
11
Copyright © Partek Inc.
Import
.pdata
12
Copyright © Partek Inc.
Import Next Gen Data
• Select columns:
13
Copyright © Partek Inc.
Import Next Gen Data
Chromosome Alias File
chr1
chr2
chr3
chr4
1
2
3
4
gi|224589800|ref|NC_000001.10| Homo sapiens chromosome 1, GRCh37 primary reference assembly
1
gi|224589811|ref|NC_000002.11| Homo sapiens chromosome 2, GRCh37 primary reference assembly
2
gi|224589815|ref|NC_000003.11| Homo sapiens chromosome 3, GRCh37 primary reference assembly
3
gi|224589816|ref|NC_000004.11| Homo sapiens chromosome 4, GRCh37 primary reference assembly
4
14
Copyright © Partek Inc.
Import Next Gen Data
.2bit file
hg18.2bit/hg19.2bit
15
Copyright © Partek Inc.
Append already imported data
Merge .pdata tool
16
Copyright © Partek Inc.
RNA-seq workflow in Partek GS
17
Copyright © Partek Inc.
RNA-seq workflow in Partek GS
18
Copyright © Partek Inc.
mRNA-seq Data Loaded into Partek
• Sample attributes can be easily added to table to group
biological replicates (if available)
19
Copyright © Partek Inc.
Analyze Known Transcripts
•Junction reads
•Paired or single end reads
•Multiple aligned reads
•Strand-specific reads
• Two main steps
1. Assign the reads to isoforms using modified E/M algorithm
(Xing, Y. et al. Nucl. Acids Res. 2006 34:3150-3160)
2. Statistics to calculate p-value differential expression and
alternative splicing
20
Copyright © Partek Inc.
Transcript Level Mapping
21
Copyright © Partek Inc.
Create Own Annotation
/(.gff3)
.pannot
22
Copyright © Partek Inc.
E/M assignment of reads
1. Assume all isoforms are in equal abundance
2. Distribute exon reads between isoforms based on
abundance
3. Recalculate isoform abundance based on read counts
4. Stop if isoform abundance is constant otherwise, return to
step 2
23
Copyright © Partek Inc.
1st step E/M algorithm
Isoforms:
Raw Reads:
16
6
8
Exon read distribution
Relative Isoform abundance
50%
8
50%
8
24
4
6
4
Copyright © Partek Inc.
2nd step E/M algorithm
Isoforms:
Raw Reads:
16
6
8
Exon read distribution
Relative Isoform abundance
40%
8
60%
8
25
4
6
4
Copyright © Partek Inc.
E/M algorithm
Isoforms:
Raw Reads:
16
6
8
Exon read distribution
Relative Isoform abundance
26
Copyright © Partek Inc.
Completed E/M
Isoforms:
Raw Reads:
16
6
8
Exon read distribution
Relative Isoform abundance
25%
4
75%
12
2
6
6
Reads per kilobase for each isoform
Orange:
6
Green:
2
Help > Online Tutorials > White Paper > RNA-Seq
27
Copyright © Partek Inc.
Caveats
• This is actually done across sets of overlapping transcripts; different
genes sometimes share reads
• e.g., Genes on different strands on assays which are not strand specific
• This requires known isoforms
• novel splicing cannot be quantified
• Genes with few reads are not estimated as accurately
• Simulation data showing increase coverage
leading to increased accuracy from
Xing, Y. et al. Nucl. Acids Res. 2006 34:3150-3160
28
Copyright © Partek Inc.
Read Summary
29
Copyright © Partek Inc.
A Transcript & Gene focused Data Views
30
Copyright © Partek Inc.
A Transcript-focused Data View
• Organized by NCBI mRNA identifiers (e.g., NM_080702)
• Probability of differential transcript expression across groups
• Probability of alternative splicing within a gene
• Both Raw & Normalized read counts per sample
• Log Likelihood test
Transcript
Gene level
level
31
Copyright © Partek Inc.
PCA and ANOVA
*Biological Replicates
32
Copyright © Partek Inc.
RNA-Seq Reads Distribution
RPKM: Reads Per Kb exon length and Millions of mapped Reads
16.9316
1.766
X
1,000,000
= 0.849757
11,282,682
Other Transformations?
33
Copyright © Partek Inc.
New Workflow Features
• Exon Level Mapping
• Alternative Splicing
34
Copyright © Partek Inc.
Alternative Splicing
*Biological Replicates
35
Copyright © Partek Inc.
Soon to come - All in one
•
•
•
•
•
•
Read Summary
Transcript Results
Transcript Focused View
Gene Focused view
Exon Focused View
Exon Results
Message to describe each spreadsheet and appropriate
analysis
36
Copyright © Partek Inc.
Integration Between Data Tables & Visualization
Choice 1: From the workflow
Choice 2: Row Header
37
Copyright © Partek Inc.
Visualization of Differential Expression & Alternative
Splicing
38
Copyright © Partek Inc.
Strand-Specific Visualization
Separate Forward and Reverse Reads
39
Copyright © Partek Inc.
Creating Lists of Affected Transcripts
40
Copyright © Partek Inc.
Biological Interpretation:
Monitor Biological Trends with GO Enrichment
41
Copyright © Partek Inc.
Biological Interpretation:
Up-/Down-regulation of Biological Processes
42
Copyright © Partek Inc.
Differential Expression of Non-coding RNAs
SnoRNA, siRNA, miRNA, long non-coding RNA…
Noncode.org; mirbase.org; Convert using ‘Manage
Available Annotations’
43
Copyright © Partek Inc.
Differential Expression of Non-coding RNAs
44
Copyright © Partek Inc.
SNP Discovery
• .2bit file
• Computationally intense
• Log Odds ratio > 5.0
45
Copyright © Partek Inc.
Variations Against Reference
• The probability that the position is different from the reference is
10,000 times more likely than the position is the same as the
reference
46
Copyright © Partek Inc.
Variations Across Samples
• Probability that at least one sample having a different
genotype call is 10,000 times more likely than all the samples
having the same genotype call
47
Copyright © Partek Inc.
Overlap with known genes/features
48
Copyright © Partek Inc.
Compare Detected cSNPs Against Known SNPs
SNP Proportion – Detected SNP’s
49
Copyright © Partek Inc.
SNP Proportion
50
Copyright © Partek Inc.
Unexplained Peaks
• Find locations on
the genome that had
reads mapped to it,
but are not in our
chosen annotation
(i.e. RefSeq)
• Find potential novel
transcripts, exons
• Set a threshold for
minimum number of
reads
51
Copyright © Partek Inc.
Unexplained Peaks Result
52
Copyright © Partek Inc.
Discover Novel Exons & Transcripts
53
Copyright © Partek Inc.
Novel Exon in Intronic Region
54
Copyright © Partek Inc.
Allele Specific Expression
Allele Specific Expression
Use Analysis of Variance to study allele specific expression based on the
interaction of allele (A, T, G, C) counts and sample groups.
20
55
Copyright © Partek Inc.
Copyright © Partek Inc.
Integrated Genomics
A few examples
RNA-seq data and Exon array data
57
Copyright © Partek Inc.
Integration of ChIP-seq & RNA-Seq data
Combine Next-Generation ChIP-Seq and RNA-Seq data into one view.
58
Copyright © Partek Inc.
RNA-Seq Hands On Data
4 Samples
• Brain
• Skeletal Muscle
• Liver
• Heart
• Illumina Genome Analyzer
• Aligned Using ELAND aligner allowing for up to 2
mismatches
• Tutorial & Data
Help > Online Tutorials > ‘Next Generation
Sequencing tab’
59
Copyright © Partek Inc.
ChIP-Seq Analysis in Partek
Genomic Suite
What is ChIP-Seq?
• ChIP – Seq = Chromatin Immunoprecipitation Sequencing
• The sequencing of genomic DNA fragments that coprecipitate with a DNA-binding protein under study
• ‘Unbiased’ – doesn’t rely prior knowledge of precise DNA
binding sites (like ChIP-ChIP)
• Results
• The DNA sequence motif that is recognized by the
binding protein
• The regulatory sites for any transcription factor
• Direct downstream targets of any transcription factor
61
Copyright © Partek Inc.
Transcription Factors
• DNA binding proteins that attach themselves to the
genome with an affinity for a specific DNA sequence
• Function: Bind to specific sites in the genome, recruit
cofactors, and regulate transcription
• ChIP-seq – identify binding transcription factor binding
sites across entire genomes
62
Copyright © Partek Inc.
Summary of ChIP Seq Assay
1.
Collect and
fractionate DNA
1
2-3.Enrich binding
sites using IP
2
3b. PCR (Not
Shown)
3
4.
Sequence short
reads
4
5.
Align and
detect peaks
Photos: U.S. Department of Energy Genome Programs
63
Copyright © Partek Inc.
ChIP-Seq Flow Chart
Sequence Reads
GAGGTTGCAGTTTG chr1
243919543 R
ACTGCTCCGCCTCA chr16
49094914
F
GAATAAAAAATCCA chr13
55882620
F
CGTCCTTCACCCTCT chr13
110085165 R
CCTTAAGGAAAGGA chr18
72273046
CAGCTAGGGTTGCC chr2
120786940 R
CTGCTGGTGCTGCG chr10
73237323
Align Reads to
Reference
Genome
Import
F
F
Detect peaks
Detect motifs
64
Copyright © Partek Inc.
ChIP-Seq Workflow in Partek
65
Copyright © Partek Inc.
Sample Data Set
Study mapped the genomic binding sites
of the NRSF transcription factor
across the entire genome
Two samples: NRSF-enriched ChIP sample
(chip.txt) and control sample
(mock.txt) DNA immunoprecipitated
by a non-specific control antibody
Johnson, et.al: Genome-Wide Mapping of in Vivo Protein-DNA
Interactions (Vol. 316). New York, NY: Science. (2007)
Other experiment setting can be also supported by Partek:
•Multiple samples
•Technical replicates
•Biological replicates
66
Copyright © Partek Inc.
Goals
• Import ChIP-seq data
• Calculate average fragment length of the IP samples
• Detect and Visualize enriched regions in the genome
• Discover Motif bindings site
• Annotate enriched regions with overlapping genes
• Look for enriched functional groups using GO
Enrichment
67
Copyright © Partek Inc.
Import
1 million reads/minute
Multiple Directories Soon
One format at a time
*Quality Score
Now (.BAM)
68
Copyright © Partek Inc.
Import
hg18.2bit/hg19.2bit
69
Copyright © Partek Inc.
Imported ChIP-Seq Data
70
Copyright © Partek Inc.
QA/QC--Fragment Length Analysis
• Single end reads –
phase shift between the
forward and reverse
reads
• Maximum
• Only on IP samples
• Paired-end reads –
distribution of fragment
lengths between paired
end fragments
PCR Artefacts/Alignment Bias
71
Copyright © Partek Inc.
Cross Correlation Fragment Length Estimation
Probable
Binding
Forward Reads
F
Reverse Reads
3
2
1
2
2
2
0
0
0
0
0
0
0
0
0
3
3
2
1
2
72
R
Copyright © Partek Inc.
Detect Peaks
Set Average
fragment length
(read extension
length)
Window
Size
Merge
(Methyl.,
Histone)
FDR cutoff
Reference
sample
(need for
SFC &
Binomial
p-value)
Peak Detection Rate ~ 1 minute / 4 million reads
73
Copyright © Partek Inc.
Peak Detection
Forward Read
Reverse Read
Midpoint
100bp*
1) Extend Reads by Estimated frag length(single)
2) Find Midpoints
3) Divide Genome into windows of estimated
fragment length or 100bp
4) Count number of reads in each window
5) Fit to ZTNB
Single End Reads
Paired End Reads
4
2
Chromosome
74
Copyright © Partek Inc.
# Windows
Peak Detection – Read Distribution
0
1
2
3
4
5
6
7
……
# of Midpoints
75
Copyright © Partek Inc.
Detect Peaks Results
Peaks are detected in each sample separately reported one peak at a time
Mann-Whitney – Separation of forward and reverse reads
Lower p-value = greater separation
76
Copyright © Partek Inc.
Scaled Fold Change
Scaled Fold Change(ChIP vs. Mock) = (1+ChIP)/(1+Alpha*Mock)
Scaling Factor
9
Chip
8
ChIP Sample
7
Mock
6
3
4
2
10
5
6
5
4
3
2
1
0
0
2
4
6
8
10
Mock (Reference Sample)
Best Fit Line Slope
x2
Higher Scaled Fold Change = more enriched
77
Copyright © Partek Inc.
ChIP-seq considerations
• Peaks are detected on a per sample basis
• Control samples are not required, but encouraged
• # reads for each sample don’t have to match
• Antibody selection
• Must have specificity for the protein
• Must be able to immunoprecipitate with target protein; even if
they do, they may not do well with ChIP-seq
• Sequencing – platform dependent bias, error rates
• Algorithm – short tags ambiguous in repeat regions, account
for sequence errors
78
Copyright © Partek Inc.
Detected peaks
79
Copyright © Partek Inc.
Chromosome Browser
Help > Online Tutorials > ‘Chromosome Viewer User Guide’
80
Copyright © Partek Inc.
Create a List of Enriched Regions
• Regions of DNA which have many reads
mapped to them
• They will occur only in our protein
bound sample
81
Copyright © Partek Inc.
Detecting motifs—Discover de novo motifs
Height = binding importance; how well a
base is preserved
82
Copyright © Partek Inc.
Gibbs Motif Sampler
Search for instances of
Motif in Sequences
83
Create new Motif out of
discovered instances
Copyright © Partek Inc.
1. Randomly choose instances
Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA….
Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA…..
Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA…..
CGTCGT
GACGTA
GGAGGG
84
Copyright © Partek Inc.
2. Create Count Matrix
Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA….
Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA…..
Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA…..
CGTCGT
GACGTA
GGAGGG
A
C
G
T
011001
101100
220221
001011
85
Copyright © Partek Inc.
3. Find Motif Instances
Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA….
Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA…..
Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA…..
GACCCT
GGATTT
GGAGGG
A
C
G
T
011001
101100
220221
001011
86
Copyright © Partek Inc.
4. Update Count Matrix
Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA….
Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA…..
Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA…..
GACCCT
GGATTT
GGAGGG
A
C
G
T
012000
001110
321111
000112
87
Copyright © Partek Inc.
5. Repeat Until Convergence
Region 1: …….ACGTCGTACGTACGACCCTGGAGCCTGA….
Region 2: …….AATGCCCCTGGATTTACGTTGGACGTAGA…..
Region 3: …….TCCGACCCCTGGAGGGGACCCTAGACGA…..
GGACCT
GGACCT
GGACCT
A
C
G
T
003000
000030
330300
000003
88
Copyright © Partek Inc.
Motif Instances
LR(Prob sequence from
motif vs. background
distribution)
89
Copyright © Partek Inc.
Detect Known Motifs
IUPAC
nucleotide
code
Base
A
Adenine
C
Cytosine
G
Guanine
T (or U)
Thymine (or
Uracil)
R
A or G
Y
C or T
S
G or C
W
A or T
K
G or T
…….
……
90
Copyright © Partek Inc.
Detecting Motifs -- Find Known Motif
REST – another name for NRSF
Probability of occurrence can be thought of as follows :
(1) shuffle the bases in all the sequences.
(2) Count the number of locations in the shuffled sequence that
score above your threshold.
91
Copyright © Partek Inc.
JASPAR Spreadsheet
92
Copyright © Partek Inc.
Overlap with Databases
• Databases of ChIP binding available from UCSC such as
Oreganno
• Databases of known genes such as RefFlat
• Genes which overlap with peaks, nearby to motif instance
93
Copyright © Partek Inc.
Find Overlapping Genes
(PAZAR) soon
-public database
of regulatory
sequences and
transcription
factors
94
Copyright © Partek Inc.
Overlapping Genes
95
Copyright © Partek Inc.
Biological Interpretation
Overlappin g genes in a category
All overlappin g genes
vs
All genes in the category
All genes on genome
96
Copyright © Partek Inc.
GO Browser: NRSF ChIP-Seq analysis
97
Copyright © Partek Inc.
Hands on ChIP-seq Data
Study mapped the genomic binding sites of the NRSF
transcription factor across the entire genome
Two samples:
1. NRSF-enriched ChIP sample (chip.txt)
2. control sample without immuno-enrichment
(mock.txt)
Johnson, et.al: Genome-Wide Mapping of in Vivo ProtenDNA Interations (Vol. 316). New York, NY: Science.
(2007)
Data and Tutorial Available for download: Help > Online
Tutorials
98
Copyright © Partek Inc.
99
Copyright © Partek Inc.
Sneak Preview
Alignment/Import & Analysis
Aligner
Desktop/ Laptop
100
Copyright © Partek Inc.
Sneak preview
Histone Modification
Methylation Workflow – current (Affy-tiling;Illumina-GX)
101
Copyright © Partek Inc.
Sneak preview
102
Copyright © Partek Inc.
Partek® Genomics Suite ™
•
•
•
•
•
•
Powerful Statistics with Interactive Visualization
Fast*, Memory-efficient
Easy to Use
Enables integration of analysis, even between vendors
Integrated Genomics can enhance your research pipeline
Integrated with Public Genomic Resources: NCBI GEO, UCSC,
Ensembl, Gene ontology, KEGG…
Get your FREE trial today!
Email www.partek.com
103
Copyright © Partek Inc.