ArrayExpress and Gene Expression Atlas: Mining Functional

Download Report

Transcript ArrayExpress and Gene Expression Atlas: Mining Functional

ArrayExpress and Gene Expression Atlas:
Mining Functional Genomics data
Gabriella Rustici, PhD
Functional Genomics Team
EBI-EMBL
[email protected]
Talk structure
 Why do we need a database for functional genomics
data?
 ArrayExpress database
• Archive
• Gene Expression Atlas
 Database content
 Query the database
 Data download
 Data submission
2
ArrayExpress
Components of a functional genomics experiment
• Sample source
• Sample treatments
• RNA extraction protocol
• Labelling protocol
• Array design information
• Location of each element
• Description of each element
• Hybridization protocol
Sample
Sample
• Sample source
• Sample treatments
• Template preparation
Library
• Library preparation
Array
Chip
• Cluster amplification
• Image
• Scanning protocol
• Software specifications
• Sequencing and imaging
• Quantification matrix
• Software specifications
Raw data
• Control array elements
• Normalization method
Normalized data
Data analysis
Data analysis
• From images to sequences
• Quality Control
• Sequence alignment
• Assembly
• Specific steps depending
on the application
ArrayExpress
www.ebi.ac.uk/arrayexpress/
 Is a public repository for functional genomics data, mostly
generated using microarray or high throughput sequencing (HTS)
assays
 Serves the scientific community as an archive for data supporting
publications, together with GEO at NCBI and CIBEX at DDBJ
 Provides easy access to well annotated microarray data in a
structured and standardized format
 Facilitates the sharing of microarray designs and experimental
protocols
 Based on FGED standards: MIAME checklist, MAGE-TAB format
and MO Ontology.
 MINSEQE checklist for HTS data (http://www.mged.org/minseqe/)
4
ArrayExpress
Reporting standards for microarrays
MIAME checklist
 Minimal Information About a Microarray Experiment
 The 6 most critical elements contributing towards MIAME are:
1. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)
2. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)
3. Sufficient array annotation (e.g. gene identifiers, genomic
coordinates, probe sequences or array catalog number)
4. Essential laboratory and data processing protocols
(e.g. normalization method used)
5. Raw data for each hybridization (e.g. CEL or GPR files)
6. Final normalized data for the set of hybridizations in the experiment
5
ArrayExpress
Reporting standards for sequencing
MINSEQE checklist
 Minimal Information about a high-throughput Nucleotide
SEQuencing Experiment
 The proposed guidelines for MINSEQE are (still work in progress):
1. General information about the experiment
2. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)
3. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)
4. Essential experimental and data processing protocols
5. Sequence read data with quality scores, raw intensities and
processing parameters for the instrument
6. Final processed data for the set of assays in the experiment
6
ArrayExpress
Reporting standards for microarrays
MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of
different files to capture information about a microarray experiment:
7
IDF
Investigation Description Format file, contains top-level information about the
experiment including title, description, submitter contact details and protocols.
SDRF
Sample and Data Relationship Format file contains the relationships between
samples and arrays, as well as sample properties and experimental factors, as
provided by the data submitter.
ADF
Array Design Format file, describes the design of an array, i. e. the sequence
located at each feature on the array and annotation of the sequences.
Data files
Raw and processed data files. The ‘raw’ data files are the files produced by the
microarray image analysis software, such as CEL files for Affymetrix or GPR files
from GenePix.
The processed data file is a ‘data matrix’ file containing processed values, as
provided by the data submitter.
ArrayExpress
Reporting standards
What semantics (or ontology) should we use to best
describe its annotation?
8

Ontology, which is a formal specification of terms in a
particular subject area and the relations among them.

Its purpose is to provide a basic, stable and unambiguous
description of such terms and relations in order to avoid
improper and inconsistent use of the terminology pertaining to
a given domain.

Thus far, Gene Ontology (GO) has been the most successful
ontology initiative. GO is a controlled vocabulary used to
describe the biology of a gene product in any organism.
ArrayExpress
Reporting standards for microarrays
MGED ontology (MO)
9

The MO provides terms for annotating all aspects of a
microarray experiment from the design of the experiment and
array layout, through to the preparation of the biological
sample and the protocols used to hybridize the RNA and
analyze the data

The MO was developed to provide terms for annotating
experiments in line with the MIAME guidelines, i.e. to provide
the semantics to describe a microarray experiment according
to the concepts specified in MIAME

Also check Open Biomedical Ontologies (OBO) initiative
(www.obofoundry.org) for the development of life-science
ontologies
ArrayExpress
ArrayExpress – two databases
10
ArrayExpress
How to query AE and Atlas?
AE Archive
• Query by experiment, sample and experimental
factor annotations
• Filter on species, array platform, molecule assayed
and technology used
Gene Expression Atlas
• Gene and/or condition queries
• Query across experiments and across platforms
11
ArrayExpress
ArrayExpress – two databases
12
ArrayExpress
How much data in AE Archive?
13
ArrayExpress
Archive by species
14
ArrayExpress
Browsing the AE Archive
15
ArrayExpress
Browsing the AE Archive
AE unique
experiment ID
Curated title of
experiment
Number of
assays
The date when the
data were loaded in
the Archive
Species
investigated
loaded in
Atlas flag
Raw sequencing
data available in
ENA
The list of experiments retrieved
can be printed, saved as Tabdelimited format or exported to
Excel or as RSS feed
16
ArrayExpress
The total number of experiments
and assay retrieved
The direct link to raw and
processed data. An icon
indicates that this type of data
is available.
Browsing the AE Archive
17
ArrayExpress
Experimental factor ontology (EFO)
http://www.ebi.ac.uk/efo
 Application focused ontology modeling experimental factors (EFs) in
AE
 Developed to:
• increase the richness of annotations that are currently made
in AE Archive
• to promote consistency
• to facilitate automatic annotation and integrate external data
 EFs are transformed into an ontological representation,
forming classes and relationships between those classes
 EFO terms map to multiple existing domain specific ontologies,
such as the Disease Ontology and Cell Type Ontology
18
ArrayExpress
Experimental factor ontology (EFO)
An example
19
ArrayExpress & Atlas
Searching AE Archive
Simple query - EFO
20
ArrayExpress
Searching AE Archive
Simple query
 Search across all fields:
• AE accession number e.g. E-MEXP-568
• Secondary accession numbers e.g. GEO series accession
GSE5389
• Experiment name
• Submitter's experiment description
• Sample attributes, experimental factor and values, including
species (e.g. GeneticModification, Mus musculus, DREB2C
over-expression)
• Publication title, authors and journal name, PubMed ID
 Synonyms for terms are always included in searches e.g. 'human'
and 'Homo sapiens’
21
ArrayExpress
AE Archive query output
• Matches to exact terms are highlighted in yellow
• Matches to synonyms are highlighted in green
• Matches to child terms in the EFO are highlighted in pink
AE Archive – experiment view
23
ArrayExpress
How does processed data look?
Sample
annotation
Genes
Samples
Gene
annotations
24
ArrayExpress
Gene expression
levels or count
level data
AE Archive – SDRF file
25
ArrayExpress
SDRF file – sample & data relationship
26
ArrayExpress
AE Archive – ADF file
27
ArrayExpress
AE Archive – Old interface
28
ArrayExpress
AE Archive – all files
29
ArrayExpress
AE Archive – all files
30
ArrayExpress
Searching AE Archive
Advanced query
 Combine search terms
• Enter two or more keywords in the search box with the operators AND,
OR or NOT. AND is the default search term; a search for kidney cancer'
will return hits with a match to ‘kidney' AND ‘cancer’
• Search terms of more than one word must be entered inside quotes
otherwise only the first word will be searched for. E.g. “kidney cancer”
 Specify fields for searches
• Particular fields for searching can also be specified in the format
of fieldname:value
31
ArrayExpress
Searching AE Archive
Advanced query - fieldnames
32
Field name
Searches
Example
accession
Experiment primary or secondary accession
accession:E-MEXP-568
array
Array design accession or name
array:AFFY-2 OR array:Agilent*
ef
Experimental factor, the name of the main variables in an
experiment.
ef:celltype OR ef:compound
efv
Experimental factor value. Has EFO expansion.
efv:fibroblast
expdesign
Experiment design type
expdesign:”dose response”
exptype
Experiment type. Has EFO expansion.
exptype:RNA-seq
gxa
Presence in the Gene Expression Atlas. Only value is
gxa:true.
ef:compound AND gxa:true
pmid
PubMed identifier
pmid:16553887
sa
Sample attribute values. Has EFO expansion.
sa:wild_type
species
Species of the samples. Has EFO expansion.
species:”homo sapiens” AND ef:cellline
ArrayExpress
Searching AE Archive
Advanced query
 Filtering experiments by counts of a particular attribute
•
33
Experiments fulfilling certain count criteria can also be searched for e.g. having
more than 10 assays (hybridizations)
Filter
What is filtered
assaycount:[x TO y]
filter on the number of of assays where x <= y and both values are between 0 and 99,999
(inclusive) . To count excluding the values given use curly brackets e.g. assaycount:{1 TO
5} will find experiments with 2-4 assays. Single numbers may also be given e.g.
assaycount:10 will find experiments with 10 assays.
efcount:[x TO y]
filter on the number of experimental factors
samplecount:[x TO y]
filter on the number of samples
sacount:[x TO y]
filter on the number of sample attribute categories
rawcount:[x TO y]
filter on the number of raw files
fgemcount:[x TO y]
filter on the number of final gene expression matrix (processed data) files
miamescore:[x TO y]
filter on the MIAME compliance score (maximum score is 5)
date:yyyy-mm-dd
filter by release date
•date:2009-12-01 - will search for experiments released on 1st of Dec 2009
•date:2009* - will search for experiments released in 2009
•date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and
end of May 2008
ArrayExpress
Searching AE Archive
Advanced query – an example
34
ArrayExpress & Atlas
Exercise 1
35
ArrayExpress
ArrayExpress – two databases
36
ArrayExpress
Gene Expression Atlas
Experiment selection criteria
 The criteria we use for selecting experiments for inclusion in the Atlas
are as follows:
• Array designs relating to experiment must be provided to
enable re-annotation using Ensembl or Uniprot (or have the
potential for this to be done)
• High MIAME scores
• Experiment must have 6 or more hybridizations
• Sufficient replication and large sample size
• EF and EFV must be well annotated
• Adequate sample annotation must be provided
• Processed data must be provided or raw data which can be
renormalized must be available
37
ArrayExpress
Gene Expression Atlas
Atlas construction
 New meta-analytical tool for searching gene expression profiles
across experiments in AE
 Data is taken as normalized by the submitter
 Gene-wise linear models (limma) and t-statistics are applied to
calculate the strength of genes’ differential expression across
conditions across experiments
 The result is a two-dimensional matrix where rows correspond to
genes and columns correspond to conditions, rather than samples.
 The matrix entries are p-values together with a sign, indicating the
significance and direction of differential expression
38
ArrayExpress
Gene Expression Atlas
Atlas construction
39
ArrayExpress
Gene Expression Atlas
Atlas construction
 up-regulated
 down-regulated
 no change
Gene Expression Atlas
41
ArrayExpress
Atlas home page
http://www.ebi.ac.uk/gxa/
Query for gene(s)
Restrict search by
direction of differential
expression
Query for condition(s)
The ‘advanced search’
option allows building
more complex queries
42
ArrayExpress
Atlas home page
The ‘Genes’ search box & auto-complete function
43
ArrayExpress
Atlas home page
The ‘Conditions’ search box & ontology browsing
44
ArrayExpress
Atlas home page
A single gene query
45
ArrayExpress
Atlas gene summary page
46
ArrayExpress
Atlas experiment page
Experimental factors list
Expression plot
Table containing gene information and
drop down menus for searching within
the experiment
47
ArrayExpress
Atlas experiment page – HTS data
48
ArrayExpress & Atlas
Atlas home page
A ‘Conditions’ only query
49
ArrayExpress & Atlas
Atlas heatmap view
50
ArrayExpress
Atlas list view
Click the ‘expression profile’ link
to view the experiment page
51
ArrayExpress
Atlas data download
52
ArrayExpress
Atlas gene-condition query
53
ArrayExpress
Atlas query refining
54
ArrayExpress
Atlas gene-condition query
55
ArrayExpress
Atlas query refining
56
ArrayExpress
Atlas query refining
57
ArrayExpress
Exercises 2, 3 & 4
58
ArrayExpress
Data submission to AE
59
ArrayExpress
Data submission to AE
www.ebi.ac.uk/microarray/submissions.html
60
ArrayExpress
Submission of HTS gene expression data
• Submit via MAGE-TAB submission route
• Submit:
• MAGE-TAB spreadsheet containing details of the samples and
protocols used.
• Trace data files for each sample (in SRF, FASTQ or SFF format )
• Processed data files
• For non-human species we will supply your SRF or FASTQ files to
the European Nucleotide Archive (ENA).
• If you have human identifiable sequencing data you need to submit
to the The European Genome-phenome Archive and not
ArrayExpress. They will supply you with a suitable template for
submission and store human identifiable data securely.
61
ArrayExpress & Atlas
Types of data that can be submitted
62
ArrayExpress & Atlas
What happens after submission?
• Email confirmation
• Curation
• The curation team will review your submission and will
email you with any questions.
• Possible reopening for editing
• We will send you an accession number when all the
required information has been provided.
• We will load your experiment into ArrayExpress and
provide you with a reviewer login for viewing the data
before it is made public.
63
ArrayExpress & Atlas