VanBUG_quackenbush
Download
Report
Transcript VanBUG_quackenbush
Meeting the Bioinformatics
Challenges of Functional
Genomics
VanBUG
11 September 2003
Acknowledgments
<[email protected]>
TIGR Human/Mouse/Arabidopsis
H. Lee Moffitt Center/USF
Expression Team
Timothy J. Yeatman
Emily Chen
Greg Bloom
Bryan Frank
Renee Gaspard
PGA Collaborators
Jeremy Hasseman
Gary Churchill (TJL)
Lara Linford
Greg Evans (NHLBI)
Fenglong Liu
Harry Gavras (BU)
Simon Kwong
Howard Jacob (MCW)
John Quackenbush
Anne Kwitek (MCW)
Shuibang Wang
Allan Pack (Penn)
Yonghong Wang
Emeritus
Beverly Paigen (TJL)
Ivana Yang
Jennifer Cho (TGI)
Luanne Peters (TJL)
Yan Yu
Ingeborg Holt (TGI)
David Schwartz (Duke)
Array Software Hit Team
Feng Liang (TGI)
Nirmal Bhagabati
Kristie Abernathy (mA)
TIGR PGA Collaborators
John Braisted
Sonia Dharap (mA)
Norman Lee
Tracey Currier
Julie Earle-Hughes (mA)
Renae Malek
Jerry Li
Cheryl Gay (mA)
Hong-Ying Wang
Wei Liang
Priti Hegde (mA)
Truong Luu
John Quackenbush
Rong Qi (mA)
Bobby Behbahani
Alexander I. Saeed
Erik Snesrud (mA)
Vasily Sharov
Heenam Kim (mA)
Mathangi Thiagarajan
Funding provided by the Department of Energy
Joseph White
and the National Science Foundation
Assistant
Funding provided by the National Cancer Institute,
Sue Mineo
the National Heart, Lung, Blood Institute,
and the National Science Foundation
The TIGR Gene Index Team
Foo Cheung
Svetlana Karamycheva
Yudan Lee
Babak Parvizi
Geo Pertea
Razvan Sultana
Jennifer Tsai
John Quackenbush
Joseph White
TIGR Faculty, IT Group, and Staff
Acknowledgments
<[email protected]>
Thanks to Syntek, Inc.
<http://www.syntek.com>
for GeneShaving MeV module and assistance
with MyMADAM
Thanks to DataNaut, Inc.
<http://www.datanaut.com>
for RelNet and Terrain Map modules and
assistance with Client/Server MeV
<[email protected]>
Science is built with facts as a house is with
stones – but a collection of facts is no more a
science than a heap of stones is a house.
– Jules Henri Poincare
There are 1011 stars in the galaxy. That
used to be a huge number. But it's only a
hundred billion. It's less than the national
deficit! We used to call them astronomical
numbers. Now we should call them
economical numbers.
- Richard Feynman, physicist, Nobel laureate
(1918-1988)
Microarray Analysis at TIGR
Step 1: Experimental Design
Step 2: Data Collection
Step 3: Data Analysis
Step 4: Consulting with the ArraySW gang in the trailer
Step 5: Sharing data with our collaborators
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
TIGR Gene Indices
home page
www.tigr.org/tdb/tgi
~60 species
>16,000,000 sequences
TGICL Tools are available – with more coming
Geo Pertea
Razvan Sultana
Valentin Antonescu
Available with source
Gene Index Assembly process
ESTs from
GenBank
(dbEST)
Expressed Transcripts (ET)
from GenBank CDS
TIGR ESTs
reduce
redundancy
remove vector, poly-A,
adapter,mitochondrial
and ribosomal sequence
High stringency pairwise comparisons to
build Clusters
Each cluster is
assembled to obtain
Tentative Consensus
sequences (TCs)
Annotate TCs
and release
The Mouse Gene Index <http://www.tigr.org/tdb/mgi>
A TC Example
GO Terms
and EC Numbers
Babak Parvizi
The TIGR Gene Indices <http://www.tigr.org.tdb/tdb/tgi>
Dan Lee, Ingeborg Holt
Building TOGs: Reflexive, Transitive Closure
And Paralogues
Tentative Orthologues
Thanks to Woytek Makałowski and Mark Boguski
TOGA: An Sample Alignment: bithoraxoid-like protein
Gene Finding in Humans is easy!
Razvan Sultana
Gene Finding in Humans is easy?
Razvan Sultana
Gene Finding in Humans is difficult?
Razvan Sultana
Gene Finding in Humans is difficult?
A genome and its annotation is only a
hypothesis that must be tested.
Razvan Sultana
RESOURCERER
Jennifer Tsai
http://pga.tigr.org/tools.shtml
RESOURCERER: An Example
RESOURCERER: Using Genetic Markers
Just added: Integrated QTLs
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
SOPs are available
Coming: Data QC SOP
cDNA/template prep
PCR purification
Printing
RNA labeling
Hybridization
<http://pga.tigr.org/tools.shtml>
What data should we collect?Nature Genetics 29, December 2001
MAGE-ML – XML-based data exchange format
<http://www.mged.org>
EVERYTHING
MIAME Relational Schema
What’s Wrong with MIAME?
MIAME was designed as a model for capturing information
necessary to create public databases.
MIAME-based databases lack LIMS capabilities, which are
necessary for large-scale studies.
We do not want to store images in our database for
practical reasons – limited space.
We needed to develop a variety of tools adapted to our
existing infrastructure and legacy data and databases.
Probes are labeled and applied to the arrays
An “experiment” is a hybridization
A “study” is a collection of hybridization experiments
MAD Microarray Database Schema
Conceptual Schema: MAD
Clone
Slide
Slide_type
Spot
New_plate
Gene
Hyb
Study
Experiment
Expression
Expt_probe
Probe
Probe_source
PCR
Protocol
Primer_pair
Scan
Analysis
Normalize
Primer
MADAM: Microarray Data Manager
Marie-Michelle Cordonnier-Pratt, UGA
converted MySQL to Oracle and made
MADAM work!
Available with source and MySQL
ExpDesigner
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
Microarray Overview I
Microtiter Plate
Microbial
ORFs
Design PCR Primers
Microarray Slide
(with 60,000 or more
spotted genes)
+
PCR Products
Eukaryotic
Genes
Select cDNA clones
PCR Products
Many different plates
For each plate set,
containing different genes many identical replicas
Microarray Overview
Selected Genes
PCR Scorer
Reads/loads primer data file
to MAD and allows PCR data entry,
and translation of 96 384.
(Alex Saeed, developer and maintainer
enhancements: Wedge Smith)
Primer Design
Clone Selection
Primer Synthesis
PCR Amplification
Gel-based Scoring
MAD
The Beast: Microarray Robot from Intelligent Automation
<http://www.ias.com>
Additional Software for Arrays: Scheduler
Microarray Scheduler
Allows scheduling
of all instruments
Designed and maintained
by Jerry Li
Available with source
Microarray Overview
Amplified/Purified Genes
Loaded in Arrayer
Run Parameters Set
Slides Printed
SliTrack/Controller
Takes Slide Order
and Run parameters,
generates spot order,
IAS control file,
launches IAS run software,
loads database.
(J. Li, developer and maintainer)
MAD
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
Microarray Overview II
Measure
Fluorescence
in 2 channels
red/green
Control
Test
Prepare Fluorescently
Labeled Probes
Hybridize,
Wash
Analyze the data
to identify
patterns of
gene expression
Microarray Overview II
Measure
Fluorescence
in 2 channels
red/green
Weed
Control
Test
Prepare Fluorescently
Labeled Probes
Bush
Hybridize,
Wash
Analyze the data
to identify
patterns of
gene expression
Microarray Overview II
Measure
Fluoresence
in 2 channels
red/green
Control
Test
Prepare Fluorescently
Labeled Probes
Obtain RNA Samples
Hybridize,
Wash
Analyze the data
to identify
differentially
expressed genes
Microarray Overview
Control
MADAM
Allows data entry
(J. Li & J. White, web prototype;
A. Saeed, J. White, J.Li,
& V. Sharov, developers)
Test
Obtain RNA Samples
Prepare Fluorescently
Labeled Probes
Hybridize,
Wash
MAD
Microarray Overview
Control
MABCOS
Uses Bar Codes to track samples
(J. Li developer)
Test
Obtain RNA Samples
Prepare Fluorescently
Labeled Probes
Hybridize,
Wash
MAD
Available with source
MADAM + mMAP
Allows data entry,
Paired TIFF moves files/renames to
Image Files
long-term storage
(A. Saeed, J. White, J.Li,
& V. Sharov, developers)
Microarray Overview
NetAPP
MAD
Microarray Overview
Spotfinder
Provides Image Analysis,
writes data to
flat files or directly to db
(V. Sharov, developer and maintainer)
NetAPP
Available as Executable for Windows;
device-independent C/C++ coming
MAD
The TIGR Array Software System
PCRSCORE
SpotFinder
SLITRACK
MADAM
MAD
McCoder
MABCOS
ExpDesigner
MIDAS
MeV
Data Normalization and Filtering
Lowess Normalization
Why LOWESS?
A
SD =
0.346
Observations
1. Intensity-dependent structure
2. Data not mean centered at log2(ratio) = 0
LOWESS (Cont’d)
Local linear regression model
Tri-cube weight function
Least Squares
yi xi
A
w( xi ) ( yi xi ) 2
w( x ) ( y x )
i
( X 'WX ) 1 X 'WY
i
i
2
0
Estimated
values of
log2(Cy5/Cy3) as
function of
log10(Cy3*Cy5)
SD =
0.346
LOWESS Results
“Slice Analysis” (Intensity-dependent Z-score)
MIDAS: Data Analysis
Wei Liang
Adding Error Models,
MAANOVA,
Automated Reporting
Available with OSI source
Microarray Overview
MIDAS Performs data normalization
and filtering, including, soon, ANOVA
MAD
MIDAS
MAD
Steps in the Process
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
MeV: Data Mining Tools
Available with OSI source
Alexander Saeed
Alexander Sturn
Nirmal Bhagabati
John Braisted
Syntek Inc.
Datanaut, Inc.
MeV: Metabolic pathway analysis is coming
Maria Klapa and Chris Koenig
Analyses available in MeV...
Hierarchical clustering (HCL)
Bootstrapped/Jackknifed HCL
k-means clustering (KMC)
k-means support (iterative KMC)
Self-Organizing Maps (SOMs)
Cluster Affinity Search Technique (CAST)
Figure of Merit for CAST and KMC (soon SOM)
QT-clust (Heyer Jackknife)
Principal component analysis (PCA)
Gene Shaving
Relevance Networks
Support Vector Machines (SVM)
Self-Organizing Trees
Classification approaches, including Template Matching
t-tests
Significance Analysis of Microarrays (SAM)
ANOVA tools
GO, Metabolic Pathway, and Genome Localization annotation/clustering
Client-server mode with well-defined API
Missing from MeV...
MAGE-ML output for direct submission to
databases ... Coming in the next MADAM release.
Links to BioConductor … are coming.
Array CGH module from Barb Weber and Adam
Margolin ... is coming.
EASE module from Doug Hosack ... is coming
Lots of stuff we are not smart enough to think
about.
Sleep Deprivation Studies in Mouse
0
3
z
6
z
z
z
z
z
9
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
12
z
z
z
z
z
z
z
z
Experimental Paradigm
Compare gene expression between sleeping and
sleep-deprived mice in cortex and hypothalamus
Perform 3 biological replicates
Normalize and filter data and use data mining techniques
to select distinct patterns of gene expression
Use Gene Ontology (GO) assignments to classify genes
by cellular localization, molecular function, biological
process
Use GO analysis to develop an understanding of response
Differential Expression in Cortex
Stress Response
Intermediate
Metabolism and
Signal Transduction
Energy Metabolism
Transcription;
Mitochondrial and
Ribosomal Proteins
Differential Expression in Hypothalamus
Sleep signaling
EASE Analysis of GO terms
GO Class
GO Cellular Component
GO Molecular Function
Cortex – Up-regulated Genes
GO Category
endoplasmic reticulum
heat shock protein activity
pyruvate dehydrogenase (lipoamide) phosphatase activity
chaperone activity
p-value
6.0610-03
8.7810-04
3.1710-03
7.3810-03
Themes:
Cortex – Down-regulated Genes
GO Class
Gene Category
General
biological trends based
on representation of p-value
GO Biological Process
biosynthesis
2.8510
functional
rolesprotein
on
the
array
protein metabolism
1.0010
electron transport
6.0410
Problem:
GO Cellular Component
ribosome
5.9510
complex
Requirement ofribonucleoprotein
functional
class assignment limits utility
1.1710
48S initiation complex
9.7410
for discovery ofeukaryotic
new
functional
networks
eukaryotic 43S pre-initiation complex
2.6810
GO Molecular Function
mitochondrial inner membrane
structural constituent of ribosome
RNA binding activity
cytochrome c oxidase activity
hydrogen ion transporter activity
Hosack, et al. 2003
-25
-11
-03
-37
-32
-18
-15
3.7010-03
6.4610-39
4.8310-21
9.7910-04
1.8810-03
Thanks to Doug Hosack, NIAID
Now available...
The TGI databases, including RESOURCERER
The TGICL Gene Index Clustering and Assembly Tools
A freely-available MySQL version of our MIAMEsupportive database
A freely-available, open source, java-based set of tools:
MADAM: Microarray Data Manager
MIDAS: Microarray Data Analysis System
MeV: Multiexperiment Viewer
A freely-available, image processing software system
linked to the database: TIGR Spotfinder
Nobody in the game of football
should be called a genius.
A genius is somebody like Norman Einstein.
-Joe Theisman, Former quarterback
A theory has only the possibility of being
right or wrong. A model has a third
possibility; it may be right but irrelevant.
– Manfred Eigen
Unless a reviewer has the courage
to give you unqualified praise, I say
ignore the bastard.
- John Steinbeck
Acknowledgments
<[email protected]>
TIGR Human/Mouse/Arabidopsis
H. Lee Moffitt Center/USF
Expression Team
Timothy J. Yeatman
Emily Chen
Greg Bloom
Bryan Frank
Renee Gaspard
PGA Collaborators
Jeremy Hasseman
Gary Churchill (TJL)
Heenam Kim
Greg Evans (NHLBI)
Lara Linford
Harry Gavaras (BU)
Simon Kwong
Howard Jacob (MCW)
John Quackenbush
Anne Kwitek (MCW)
Shuibang Wang
Allan Pack (Penn)
Yonghong Wang
Emeritus
Beverly Paigen (TJL)
Ivana Yang
Jennifer Cho (TGI)
Luanne Peters (TJL)
Yan Yu
Ingeborg Holt (TGI)
David Schwartz (Duke)
Array Software Hit Team
Feng Liang (TGI)
Nirmal Bhagabati
Kristie Abernathy (mA)
TIGR PGA Collaborators
John Braisted
Sonia Dharap(mA)
Norman Lee
Tracey Currier
Julie Earle-Hughes (mA)
Renae Malek
Jerry Li
Cheryl Gay (mA)
Hong-Ying Wang
Wei Liang
Priti Hegde (mA)
Truong Luu
John Quackenbush
Rong Qi (mA)
Bobby Behbahani
Alexander I. Saeed
Erik Snesrud (mA)
Vasily Sharov
Mathangi Thaiagarjian
Funding provided by the Department of Energy
Joseph White
and the National Science Foundation
Assistant
Funding provided by the National Cancer Institute,
Sue Mineo
the National Heart, Lung, Blood Institute,
and the National Science Foundation
The TIGR Gene Index Team
Foo Cheung
Svetlana Karamycheva
Yudan Lee
Babak Parvizi
Geo Pertea
Razvan Sultana
Jennifer Tsai
John Quackenbush
Joseph White
TIGR Faculty, IT Group, and Staff