Microarray Gene Expression Data
Download
Report
Transcript Microarray Gene Expression Data
MIAME and ArrayExpress –
a standard for microarray
gene expression data
and the public database at EBI
Susanna-Assunta Sansone
(Toxicogenomics project coordinator)
Microarray Informatics Team
EMBL- EBI (European Bioinformatics Institute)
Transcriptome Symposium, April 2002
CHU Pitié-Salpêtrière, Université Paris VI
Why have a public database?
EMBL- EBI centre for research and services in
bioinformatics that makes and maintains public db:
•
EMBL Nucleotide Sequence, SWISS-PROT, Ensembl, MSD, etc.
Practical reasons:
•
•
•
Easy data access
Resolves local storage issues
Common data exchange formats can be developed
Scientific reasons:
•
•
•
•
Curation can be applied
Annotation can be controlled
Additional info can be stored that is missing in publications
Improve data comparison !
Public standard can be applied
Talk structure
MIAME standard
MIAME annotation challenge:
• MGED BioMaterial Ontology
Uses of MIAME concepts:
• ArrayExpress:
a public repository for gene expression data
• MIAMExpress
submission and annotation tool
Talk structure
MIAME standard
Standard for microarray data Why?
Size of dataset
Different platforms - nylon, glass
Different technologies - oligos, spotted
References to external db not stable!
Array annotation
Sample annotation
Data sharing needs standardized way to
annotate and record the information!
Standard for microarray data MGED Group
Microarray Gene Expression Data Group:
EBI + world’s largest microarray labs and companies
(Sanger, Stanford, TIGR, Universite D'Aix-Marseille II,
Affymetrics, Agilent, NCBI, DDBJ, etc.)
MGED Group aims to
• Facilitate adoption of standards for:
– Experiment annotation
– Data representation
• Introduce standard for:
– Experimental controls
– Data normalization methods
General MIAME principles
Minimum information about a microarray experiment
NOT a formal specification BUT a set of guidelines
Sufficient information must be recorded to:
• Correctly interpret and verify the results
• Replicate the experiments
Structured information must be recorded to:
• Query and correctly retrieve the data
• Analyse the data
MIAME- Brazma et al., Nature Genetics, 2001
• Sample source
• Sample treatments
• Extraction protocol
• Labeling protocol
Sample
MIAME
Hybridization protocol
Hybridisation
Array
• Array design information
• Location of each element
• Description of each element
• Image
• Scanning protocol
• Software specifications
• Quantification matrix
• Analysis protocol
• Software specifications
MIAME 6 parts of a microarray experiment
MIAME
Experiment
Sample
Hybridisation
Array
Sample
Hybridisation
Array
Sample
Hybridisation
Array
Sample
Hybridisation
Array
• Strategy
• Algorithm
• Control array elements
• 3 data processing levels
• Lack of gene expression
measurement units !
Normalisation
Final data
MIAME 6 parts of a microarray experiment
MIAME – Annotation challenge
Annotation implementations are required !
• Avoid/reduce free text descriptions
• Use of controlled terms
• Definitions and sources for each term
• Remove of synonyms, or use of synonym
mappings
• Data curation at source (LIMS)
• Integration of controlled terms in query interfaces
Facilitate data queries-analysis…….
A gene expression database from
the data analyst’s point of view
Genes
and
transcription units
Samples
Gene expression
matrix
Gene expression levels
A gene expression database from
the data analyst’s point of view
Genes
and
transcription units
Samples
• Array description:
- Gene annotations
• Sample annotations:
- Source
- Treatment
Gene expression
matrix
Gene expression levels
MIAME - Gene annotation
Unambiguous identification
Synonyms !
• Community approved names
• Alternative to gene names
Usable external sources e.g.:
• EMBL-GenBank - sequence accession n.
• Jackson Lab - approved mouse gene names
• HUGO - approved human gene names
• GO categories - function, process, location
MIAME - Sample annotation
Gene expression data only have a meaning in
the context of detailed sample descriptions !
Usable external sources e.g.:
• NCBI Taxonomy - organisms
• Jackson Lab - mouse strains names
• Mouse Anatomical Dictionary – mouse anatomy
• ChemID – compounds
• ICD-9 – diseases classification
More is needed…..
Annotation –
implementations required!
Need an ontology to describe the sample:
• Defining controlled vocabularies and……
• ….Using existing external ontologies
Integrate the ontology in LIMS and databases:
• Develop browser or interface for the ontology
• Develop internal editing tools for the ontology
However some free text description is
unavoidable
Talk structure
MIAME standard
MIAME annotation challenge:
• MGED BioMaterial Ontology
What CV and ontology are?
Controlled Vocabulary (CV):
• Set of restrictive terms used to describe
something, in the simplest case it could be a list
Ontology is more then a CV:
• Describes the relationship between the terms in
a structured way, provides semantics and
constraints
• Capture knowledge and make it machine
processable
Sample annotation –
MGED BioMaterial Ontology
Under construction by Chris Stoeckert (Univ. of
Penn.) and MGED members
Use OILed (rdf, daml and html files available)
Motivated by MIAME and guided by ‘case scenarios’
Defines terms, provides constraints, develops CVs for
sample annotation
Links also to external CVs and ontologies
Will be extended to other part of a microarray
experiment that need to be described
Sample annotation –
MGED BioMaterial Ontology
an example
Sample source and treatment description,
and its correct annotation using
the MGED BioMaterial Ontology
classes and correspondent external references:
“Seven week old C57BL/6N mice
were treated with fenofibrate.
Liver was dissected out, RNA prepared………”
MGED BioMaterial Ontology
External References
Instances
©-BioMaterialDescription
©-Biosource Property
©-Organism
NCBI Taxonomy
©-Age
©-DevelopmentStage
Mouse Anatomical Dictionary
Mus musculus musculus id: 39442
7 weeks after birth
Stage 28
Female
©-Sex
©-StrainOrLine
International Committee on Standardized
Genetic Nomenclature for Mice
Charles River, Japan
©-BiosourceProvider
©-OrganismPart
C57BL/6
Mouse Anatomical Dictionary
Liver
©-BioMaterialManipulation
©-EnvironmentalHistory
©-CultureCondition
©-Temperature
22 2C
©-Humidity
55 5%
©-Light
12 hours light/dark cycle
©-PathogenTests
Specified pathogen free conditions
©-Water
ad libitum
©-Nutrients
MF, Oriental Yeast, Tokyo, Japan
©-Treatment
©-CompoundBasedTreatment
(Compound)
ChemIDplus
Fenofibrate, CAS 49562-28-9
(Treatment_application)
in vivo, oral gavage
(Measurement)
100mg/kg body weight
Talk structure
MIAME standard
Sample annotation:
• MGED BioMaterial Ontology
Uses of MIAME concepts:
• ArrayExpress
a public repository for gene expression data
• MIAMEpress
submission and annotation tool
Uses of MIAME concepts
Specifies the content of the information:
• Sufficient
• Structured
Uses:
• Creation of MIAME-compliant LIMS or databases
e.g: ArrayExpress
• Development of submission/annotation tool for
generating MIAME-compliant information
e.g.: MIAMExpress
ArrayExpress – data flow
EBI
Web server
Users
Submission
Submission
LIMS
Browse-Query
MIAMExpress
ArrayExpress
MIAMExpress
Curation
database
Output
MAGE-ML
Update
Image
server
Central
database
Data
warehouse
ArrayExpress - details
Implementation in ORACLE of the MAGE-OM model:
• Microarray gene expression - Object Model
• OMG approved standard (MGED and Rosetta, 2001)
• Model developed in UML
Object model-based query mechanism:
• Automatic mapping to SQL
ArrayExpress
Independent of:
• Experimental platform
• Image analysis method
• Normalization method
Central
database
Data
warehouse
MAGE-ML data loader:
• Microarray gene expression - Mark-up Language generated from model
ArrayExpress – conceptual model
Experiment
Sample
Hybridisation
Array
Sample
Hybridisation
Array
Sample
Hybridisation
Array
Sample
Hybridisation
Array
Normalisation
Final data
MIAME 6 parts of a microarray experiment
ArrayExpress – simplified model
• Classes are represented by boxes
• Classes describe objects
• Related classes are grouped together in
packages
• MAGE-OM has 16 packages, ~ 150 tables
ArrayExpress data (via MAGE-ML)
Currently:
Near future:
• Human data - EMBL
•
(ironchip)
•
• Yeast data - EMBL
•
• S. pombe - Sanger Institute •
• Available as example
•
annotated and curated data
sets
•
Array descriptions - TIGR
Array description - Affymetrix
Mouse data - TIGR and HGMP
Anopheles data - EMBL
Direct pipeline - Sanger Institute
LIMS
Data - DESPRAD partners
• Toxicogenomics data- ILSI HESI
ArrayExpress – query interface
First release 12 Januray 2002
ArrayExpress –
link to Expression Profiler
External data, tools
pathways, function,
etc.
Expression
data
EP:PPI
Prot-Prot ia.
EP:GO
GeneOntology
EPCLUST
GENOMES
Expression data
URLMAP
provide links
sequence, function,
annotation
SEQLOGO
SPEXS
PATMATCH
visualise patterns
discover patterns
ArrayExpress – curation effort
User support and help documentation:
• Ontologies and CV’s
• Minimize free text, removal of synonyms
• Help on MAGE-ML format and MAGE-OM
MIAME compliance-check
Curation at source (LIMS)
To provide high-quality, well-annotated data
and allow automated data analysis
MIAMExpress - details
Submission and annotation tool:
• Curators will monitor the submissions
Based on MIAME concepts:
MIAMExpress
• Experiment, Array and Protocol submissions
• Generates MIAME-compliant information
Uses MGED BioMaterial Ontology terms:
• Terms and required fields are explained
Allows user driven ontology development:
• User can provide new terms and their sources
Allows browsing:
• Array descriptions
• Protocols
MIAMExpress - details
Version 1 launch in December 2002
Expected users:
MIAMExpress
• Limited local bioinformatic support
• No LIMS on site
• Small scale users with custom made arrays
Can be installed as local version:
• As a lab-book to annotate your experiment
• As part of a LIMS
Interfaces:
• Version 1 is general
• Future versions, application specific interfaces
- Species specific
- Toxicogenomics specific (ILSI- HESI)
ArrayExpress - future
Load public data into ArrayExpress:
• TIGR, EMBL, ILSI HESI, DESPRAD partners
Improve query interfaces
Launch MIAMExpress v.1 (Dec.2002)
MIAMExpress v.2:
• Extended according to the user needs
• Integrated MGED ontology
• Increased usability, flexibility and scalability
Develop curation tools
Acknowledgments
Microarray Informatics Team at EBI (19 members):
•
•
•
•
•
•
Alvis Brazma (Team Leader and MGED President)
Helen Parkinson (Curation Coordinator)
Mohammad Shojatalab (MIAMExpress Database Programmer)
Ugis Sarkans (ArrayExpress Database development coordinator)
Jaak Vilo (Expression Profiler)
Curators and Programmers.
MGED members and working groups:
• Alvis Brazma (MGED President, MIAME)
• Chris Stoeckert, U. Penn. (MGED Ontology Working Group)
Resources and ….messages
Open sources resources:
•
•
•
•
ArrayExpress and MIAMExpress schema-access to code
MIAME document and glossary
MAGE-ML dtd and annotation examples
MGED Ontology and other resources………
www.mged.org / www.ebi.ac.uk/microarray
[email protected]
Be aware of MIAME !
• Nature, Lancet and have already expressed their interest
• Founding agencies
Join MGED meetings, tutorials and mailing lists:
• MGED-5 meeting in Japan (Sept. 2002)
• Ontology for BioSample description, EBI (Nov. 2002)