Slides - Computational Bioscience Program

Download Report

Transcript Slides - Computational Bioscience Program

PHANG LAB TALK
Tzu L Phang Ph.D.
Assistant Professor
Department of Medicine
Division of Pulmonary Sciences & Critical Care Medicine
What I do:
• Perform high-throughput data analysis for the scientific community;
microarray and Next Generation Sequencing datasets
• Provide analysis solution for experts and novice users alike
• Develop multi-media approaches to disseminate translational
science education
• Studying the role of long non-coding RNA; second talk
• Establishing the Bioinformatics Consultation and Analysis Core to
help researchers and scientists design, analyze and interpret their
experiments.
Today’s Talk Layout
• The center of my universe:
– R and Bioconductor
• Collaboration with Biologists
• 5x5; simple way to teach and contribute
• Next Generation Sequencing (NGS)
Today’s Talk Layout
• The center of my universe:
– R and Bioconductor
• Collaboration with Biologists
• 5x5; simple way to teach and contribute
• Next Generation Sequencing (NGS)
R
r-project.org
R is hot
http://blog.revolutionanalytics.com/r-is-hot/
R in the media
Bioconductor
• www.bioconductor.org
• Statistical tools in R for high-throughput data analysis
• 6 month update cycle. Release 2.10 with 554 software
package (45 new)
• Analysis workflow
–
–
–
–
–
Oligonucleotide Arrays
Sequence Analysis
Variants
Accessing Annotation Data
High-throughput Assays
The Website
www.bioconductor.org
Categories
Categories
cont …
• Typical Analysis
Routine
R is easy 
Result output
Other Resources
http://www.rseek.org/
http://crantastic.org/
http://www.statmethods.net/
http://stackoverflow.com/
Today’s Talk Layout
• The center of my universe:
– R and Bioconductor
• Collaboration with Biologists
• 5x5; simple way to teach and contribute
• Next Generation Sequencing (NGS
Collaboration
• >1000 microarray chips / year
• Affymetrix & Illumina platforms
• Next Generation Sequencing 25 free Pilot
Projects.
• Serve the rocky mountain region scientific
community
Collaboration - tips
• Don’t be a data analyst – be a co-investigator
• Suggest analysis approaches that are not
obvious
• Focus on the result, not method
• Always looks for grant writing opportunity
• Understand the technical & biological system
as thoroughly as possible – you will be
surprise what biologists missed informatically
Exmaple 1: Classification of Pituitary
Tumors
• Pituitary tumors are the most common type of brain tumor
in 20% at autopsy and 1/10,000 persons clinically. Based
upon 2010 figures of a veteran population of 22.7 million,
this translates into >225,000 veterans with pituitary
tumors.
• Currently no medical therapies exist for these tumors and
surgical resection is the treatment of choice. Recurrence
rates approach 40%.
• Understanding of the pathways to tumorigenesis and
markers of aggressiveness and risk of recurrence would
alter the intensity and cost of clinical care and may provide
novel candidates and pathways to explore for new
treatment options for these patients
Principle Component Analysis
Potential markers
Outputs
Example 2: Explore the artistic side!
Example 3: Unconventional Usage
Introduction
• Crohn’s Disease (CD) is an Inflammatory Bowel
Disease (IBD) that affecting up to one million
Americans (15 to 30 ages).
• Discordance between monozygotic twins affected
by CD provide evidence for epigenetic role in
etiology of disease.
• We combined 2 microarray technologies to study
these roles
– CHARM array (Comprehensive High-throughout Array
for Relative Methylation)
– Gene Expression (Affymetix Gene 1.0 ST)
Research Informatics Integrated
Core (RIIC)
Michael G. Kahn MD, PhD
CCTSI Co-Director & RIIC Core Director
[email protected]
RIIC Organizational Model
Michael Kahn
Thomas
Yaeger
Jessica
Bondy
(Cancer Center
Informatics Core
Director)
REDCap,
REDCap Survey
Third Thursday
@ Three Thirty
Three
Informatics
Seminar
Series
Data
Management
Best Practices
Secondary
database and
analysis service
Web site
Portal
applications
Virtual server farm
Research LIS
implementation
Desktop support
Michael
Kahn
Tzu
Phang
Steve
Ross
Community
Engagement
Informatics
5x5s
Video
Tutorials
Bioinformatics
Tools Tutorials
Liaison
http://cctsi.ucdenver.edu/RIIC/Pages/ConsultationDataAnalysis.aspx
5X5
http://cctsi.ucdenver.edu/5x5
Demonstration
http://gcrc.ucdenver.edu/Videos/Informatics/5x5/SocialNetworking5x5.wmv
Tools
Podcast
TIES – Translational Informatics
Education Support (TIES)
• Bridging the gap in translational research
through education
• Training biologist informatics
• Enhance collaboration through education and
knowledge exchange
• Bring awareness in latest technical advances
• Disseminate knowledge through innovation
Next Generation Sequencing
The future is here ….
High Throughput Parallel Sequencing
• http://www.youtube.com/watch?v=77r5p8IBwJk
Paradigm Shift
• Standard “Sanger” sequencing
– 96 sample/day
– Read length ~650 bp
– Total = 450,000 bases of sequence data
• 454 – the game changer!
– ~400,000 different templates (reads)/day
– Read length ~ 250 (at that time)
– Total = 100,000,000 bases of sequence data
The second generation
Roche (454) http://454.com/
– First on the market
– Emulsion PCR and pyrosequencing
Illumina (Solexa) http://www.illumina.com/
– Second on the market
– Bridge PCR and polymerase based SBS
Abi (Solid) http://solid.appliedbiosystems.com/
–
Third on the market
Emulsion PCR and ligase based sequencing
–
Single molecule sequencing
Helicos Biosciences
http://helicosbio.com
true Single Molecule Sequencing technology
Pacific Biosciences
http://www.pacificbiosciences.com
Single Molecule Real Time sequencing
Portable Sequencer
• Ion Torrent
Others
Polonator http://www.polonator.org
Emulsion PCR and ligase based sequencing
Used in the Personal Genome Project
Open platform, open source
Cheap/affordable
Complete Genomics http://www.completegenomics.com
Specializing in human genome sequencing
Type of read data
• Base Space or Color Space
• Paired end or single end
• Stranded or Unstranded
Short Reads
• Short reads from NGS are challenging (Solexa
~36 bp, now HiSeq 100 bp single pass)
– Very hard to assemble whole genome
– Especially on repeat regions
• Requires many fold coverage
• New and faster algorithm for many traditional
bioinformatics operations
• Reads are getting longer – another moving
target. (2x250)
Applications
• An explosion of scientific innovation!!
• New usages not directly foreseen by the
original developers of the technology
• Some envision the beginning of next
revolution – such as PCR – NGS machine in
every lab!!
• Cheap high-volume sequencing – revisiting
data collection and management system
RNA Sequencing
• “Digital Gene Expression” or “RNA-Seq”
• Truly accurate gene expression measurements
– Can replace gene expression microarrays
• 25% more sensitive
• Does not rely on hybridization (no %GC bias, no crosshybridization between related genes)
• Discover novel genes (and other kinds of RNA molecules)
– one experiment found that 34% of human transcripts were
not from known genes
• Sultan et al, Science. 2008 Aug 15;321(5891):956-60.
Why RNAseq better then microarray?
• Not predefine gene annotation — make
discovering novel transcripts possible
• Low, if any, background
• Large dynamic range of expression levels, no
upper limit for quantification
• Reveal sequence variation, such as SNP, in the
transcript region
• In Helico — single molecule sequencing — no
PCR step, remove amplification bias
More information from RNA
• Can capture true alternative splicing
information
– Sequence of splice-junctions
• One study found 4,096 previously unknown splice
junctions in 3,106 human genes
– Different transcription start and end points for
RNA molecules
• Allelic variation (SNPs)
• Small RNAs
Bottleneck: Data Analysis
Informatics is the Bottleneck
• Scientists are currently able to generate
sequence data much faster/more easily than
they are able to analyze it
• Customized analysis / Bioinformatics
consulting is needed for every project
Bioinformatics Challenges
• Need for large amount of CPU power
– Informatics groups must manage compute clusters
– Challenges in parallelizing existing software or redesign of
algorithms to work in a parallel environment
– Another level of software complexity and challenges to
interoperability
• VERY large text files (million lines long)
– Can’t do ‘business as usual’ with familiar tools such as
Microsoft Excel.
– Impossible memory usage and execution time
• Sequence Quality filtering
Auer P. Statistical design and analysis of RNA sequencing data. Genetics. 2010.
Data formats
•
•
•
•
•
Images
“raw” basecalls with quality scores
Sequence reads aligned to reference genomes
Assemblies (contigs)
Variants (SNPs, indels, copy number variants)
Hexadecimal mode
Decimal mode
Raw
FASTQ
Example
SAM format
Pileup format
FLAG
QNAME
POS
RNAME
MAPQ
CIGAR
MPOS
MRNM
ISIZE
SEQ
QUAL
CIGAR
•
•
•
•
•
•
•
M : match/mismatch
I : Insertion compared with reference
D : Deletion compared with reference
N : Skipped bases on reference
S : soft clipping (unaligned)
H : hard clipping
P : padding
File Size
•
•
•
•
•
s_1_ILS4_sequence.txt [5.2 GB]
s_1_ILS4_sequence.fastq [3.3 GB]
s_1_ILS4_sequence.sam [4.5 GB]
s_1_ILS4_sequence.bam [995 MB]
s_1_ILS4_sequence.sorted.bam [696 MB]
The Bible
Utility Tools
•
•
•
•
SamTools
Picard
Useq
Etc …
Bioconductor Solution
Secondary Tools
•
•
•
•
Laboratory Management
Data mining and visualization
Project management for genome assembly
Pathway mapping (functional analysis of
groups of genes)
• Motif finding (for Chip-Seq)
Integration
• Integrate information from different
technologies on a single genome map
– Genetic variation
– Gene expression (mRNA levels)
– Alternative splicing
– Transcription factor binding
– Methylation/histone status
– Small RNA levels (gene regulatory molecules)
– Non-coding RNA levels!
Speed/Efficiency
• New emphasis on efficient data structures and
algorithms
• Use of “old style” tools such as grep/sed/awk
• Machine language programming
• Currently a huge burst of programming creativity in
an “anything goes” environment
• A desperate scramble for tools that work
• Huge duplication of effort in programming, but also
in evaluating new software
Amazon Web Services
http://aws.amazon.com/education/
Future Directions
• Sequencing will continue to get much faster
and cheaper, by 4-10x per year for several
more years.
• Affordable complete human genome
sequencing will be available as a clinical
diagnostic tool within 2-3 years.
• Data storage and analysis bottleneck
• Data security/privacy issues
Move to 1:52
Field Trip