Transcript Slide 1

Making Sense of Public Domain
Expression Data- GeneVestigator
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
1
On the Agenda Microarray databases – characteristics
pros and cons
Examples:
• GEO and ArrayExpress
• GeneVestigator - meta-analytical approach
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
2
Meta-data in Microarray Experiments
Gene expression studies generate large amounts of data !
Metsada Pasmanik-Chor, TAU
http://titan.biotec.uiuc.edu/cs491jh/slides/cs491jh-Yong.ppt#268,6,Capturing
Data and Meta-data in
Bioinformatics
Unit,
19/3/09
Microarray Experiments
3
Properties of High-throughput Data
Microarray databases: have the ability to accept, store and export (share)
large quantities of data.
Data (stored) contain:
Many genes
Many samples
Various organisms/tissues
Variety of biological phenomena
Time course
Replicates
Different technologies: various data format
Data Retrieval:
user-friendly
web-based interfaces
Links to Analysis Tools
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
4
Gene Expression Matrix

Genes
Spots
The final gene expression matrix (on the right) is needed
for higher level analysis and mining
Images
Samples
Spot/Image
quantiations
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09 Expression Matrix
http://titan.biotec.uiuc.edu/cs491jh/slides/cs491jh-Yong.ppt#271,8,Gene
Gene
expression
levels
5
Microarray Data Precision and Loss
Electron
microscopy
Only provided in
0.1% of public
experiments
Processed data loses precision !
90% of CEL files generated from microarray experiments have never been deposited to any
repository. Stokes et al. BMC Bioinformatics 2008 9(Suppl 6):S18
http://www.bio-miblab.org/arraywiki
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
6
Microarray Data Formats
A. Raw image data, the intensity of the signal at each spot is
proportional to the expression level of the gene under test.
Image intensities are quantified using image analysis software.
B. Raw numerical data (signal intensities).
A.
C. Processed data.
C.
B.
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
7
 Complete description of complex experiments is desired.
 We don’t always know what’s important:
 “Noise” probes could end up being informative (e.g. detection
of a splice variant).
Different labs have different needs – a central system is needed !
 The Future
 Better (more accurate) summarization algorithms will emerge.
 New uses for raw data may emerge.
 Challenge: Store the raw data in accessible form.
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
8
Complexity and Categories of Data
The MIAME (Minimum
and MIAME 6 parts
Information About a Microarray
Experiment) guidelines contain
standards for publication of
information. Brazma et al. (2001),
Nature Genetics 29(4), 365-71
Publication
Experimental
design
Source
(e.g., Taxonomy)
Sample –
Source & treatment,
prep. & labelling
Normalization
Hybridisation
Array
design
Gene
(e.g., EMBL)
Data
measurements
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
http://www.ict.ox.ac.uk/odit/projects/digitalrepository/docs/workshop/Helen_Parkinson-RDMW0608.ppt#429,18,Slide
18
9
Microarray
Database
Repositories
are Biased
The relative size of each
pie corresponds to the
number of experiments
contained in each
repository.
All
human
data
Mostly
custom
arrays
Mostly
human
data
Mostly
old
data
Mainly
Affy chips
Metsada
Pasmanik-Chor,
Stokes et al. BMC Bioinformatics 2008
9 (Suppl
6): S18TAU
Bioinformatics Unit, 19/3/09
http://www.biomedcentral.com/1471-2105/9/S6/S18
10
Overlaps of Data Between Repositories
Stokes et al. BMC Bioinformatics 2008 9 (Suppl 6): S18
http://www.biomedcentral.com/1471-2105/9/S6/S18
Total Experiments: 2376
August 2005 – June 2006
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
11
User-Friendly Microarray Databases
 Many gene expression databases exist: commercial and non-commercial.
 Most focus on either a particular technology, particular organism or
both.
 We will discuss most promising ones:
 ArrayExpress – EBI (AE)
 The Gene expression Omnibus (GEO; NCBI)
 GeneVestigator
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
12
http://www.ncbi.nlm.nih.gov/geo/
The Gene Expression
Omnibus is a public
repository in the Entrez
database that includes
high-throughput gene
expression data, hosted
at the National library
of Medicine (NIH).
GEO was designed to
accommodate diverse
types of data.
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
13
Gene Express Omnibus -
Experiment centered view
(GDS)
14
Gene Express Omnibus - Gene centered view
Example: GDS563
Expression profile of the Dystrophin
gene in a DataSet examining skeletal
muscle biopsies from 12 Duchenne
muscular dystrophy patients and 12
normal subjects.
Red bars: level of abundance of an
individual transcript across the
Samples that make up a DataSet.
Normal
Duchenne
Values are presented as arbitrary
units. Single channel: normalized
Experimental design
Values signal count data.
Dual channel: submitted Values are
Faded bars/squares: These correspond to
normalized log ratios.
Affymetrix 'Detection call' = Absent.
Blue square rank order, give an
indication of where the expression of
that gene falls with respect to all
other genes on that array
Metsada Pasmanik-Chor, TAU
15
(enrichment).
Bioinformatics Unit, 19/3/09
http://www.ebi.ac.uk/microarray-as/ae/
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
16
Query ArrayExpress
Annotations
Experiments and description
Click
Condition
Gene name
Species
Results: a list of all experiments, ordered by p value.
For each experiment: short description, experimental factors and gene expression.
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
17
Query ArrayExpress – similar expressed genes
Select the ‘find 3 closest genes’ option.
IER2, FOS, JUN, have similar expression to nfkbia.
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
18
HeatMap Atlas
Output
Experimental
condition
Number of
up/down
regulated genes
Metsada Pasmanik-Chor, TAU
http://www.ebi.ac.uk/microarrayas/atlas/qr?q_gene=saa4&q_updn=updn&q_orgn=MUS+MUSCULUS&q_expt=%28all+conditions%29&view=heatmap&view=
Bioinformatics Unit, 19/3/09
19
https://www.genevestigator.com/gv/index.jsp
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
GeneVesigator –
a reference
expression
database and
meta-analysis
system
20
Genevestigator – a system for the metaanalysis of microarray data
A database & Web-browser data mining interface
for Affymetrix GeneChip data, based on a the new
concept of “Meta-Profiles“, relying on reference
expression databases.
Allows biologists to study the expression and
regulation of genes in a broad variety of contexts by
summarizing information from hundreds of manually
curated microarray experiments.
Workspaces and views can be stored into files and
re-opened for another analysis session (*.gvw which
stands for GenevestigatorWorkspace).
Application server
Java application
Analysis output
Metsada Pasmanik-Chor, TAU
http://bar.utoronto.ca/ICAR19/ICAR19_BioinfoWorkshop%20-%20Genevestigator.ppt#257,2,Overview
of the
Genevestigator system
Bioinformatics Unit, 19/3/09
21
Database Content and Quality
Database consist of large and various manually curated and qualitycontrolled Affymetrix chips:
Quality control of EACH experiment is manually done by Genevestigator
curators using a pipeline of Bioconductor packages performing
normalization and probe-level analysis.
Low quality arrays are characterized by:
•
•
•
•
fall out of range relative to the other arrays from the same experiment,
exhibit higher RNA degradation,
particularly noisy,
do not correlate with replicate samples.
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
22
User Hardware Requirements
Genevestigator is a web-based application running in Java.
Java applet provides several advantages:
• users don’t have to install any software
• users always work with the latest software release
• Java is more powerful than HTML/Javascript for data manipulation
To run the application, client machines must have Java runtime environment
(JRE; version 1.4.2 or higher) installed (usually available by default on PCs).
JRE is freely available for download at Sun Microsystems (http://www.Java.com).
To optimally work with the Genevestigator application, we recommend:
• screen resolution: 1024 x 768 or higher
• memory: preferably 512 MB RAM or more
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
23
GeneVestigator Species Availability
Species:
[Mammals]
Arrays:
Number
of arrays:
Species:
[Plants]
Arrays:
Number
of arrays:
Human
Mouse
Human 133_2 &
Human Genome
10k 20k 47 k
Mouse Genome
1109, 3786, 2782
3071, 1967
Arabidopsis
Arabidopsis Genome
22k
12k
Rat
Rat Genome
40k
Barley
8k
31k
2146, 858
Rice
Barley Genome
22k
Rice Genome
22k
706
-
Soybean
3110
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
24
Data Sources and Referencing
The Genevestigator analysis platform comprises a large database of
manually curated microarray experiments collected from the public domain
or from individual contributors. The array annotations necessary for data
analysis were retrieved from public repositories and/or, if insufficiently
available, from the authors themselves.
Genevestigator contains data from the following repositories and databases:
Database
Link
Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
ArrayExpress
http://www.ebi.ac.uk/arrayexpress/
ChipperDB
http://chipperdb.chip.org/adb/adb-home
The Arabidopsis Information Resource (TAIR)
http://www.arabidopsis.org/
MUSC Microarray Database
http:proteogenomics.musc.eduma
Public Expression Profiling Resource (PEPR)
http://pepr.cnmcresearch.org
NASC Microarray Database (NASCArrays)
http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl
NIH Neuroscience Microarray Consortium
http://arrayconsortium.tgen.org/np2/home.do
Gene Expression Open Source System (GEOSS)
RNA Abundance Database (RAD)
https://genes.med.virginia.edu/intro to geoss.html
Metsada Pasmanik-Chor, TAU
http://www.cbil.upenn.edu/RAD/php/index.php
Bioinformatics
Unit, 19/3/09
25
GeneVestigator –
focus on gene expression in the context of:
1. Time (Gene expression during stages of development\life-cycle).
2. Space (Tissue specific expression).
3. Response (Expression caused by stimuli: biotic stress, abiotic stress, chemical,
hormone, light, drug treatment, disease).
Users can query the database to retrieve the expression patterns of individual
genes throughout chosen environmental conditions, growth stages, or organs.
Reversely, mining tools allow users to identify genes specifically expressed
during selected stresses, growth stages, or in particular organs
Access:
Free / By license
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
26
Metsada Pasmanik-Chor, TAU
Bioinformatics Unit, 19/3/09
27
http://sbw.kgi.edu/
Dr. Metsada Pasmanik-Chor
Bioinformatics Unit,
Life Science, TAU
Tel: x 6992
E-mail: [email protected]
28
Bioinformatics
Metsada Pasmanik-Chor,
Intro, 15/12/2008,
TAU
Bioinformatics
Metsada Pasmanik-Chor
Unit, 19/3/09
Bioinfo. Unit webpage: http://bioinfo.tau.ac.il
28