Presentation

Download Report

Transcript Presentation

ArrayExpress and Gene Expression Atlas:
Mining Functional Genomics data
Emma Hastings
Functional Genomics Team
EBI-EMBL
[email protected]
http://www.ebi.ac.uk/~emma/Zagreb/
Services available from EBI
Literature and ontologies
Genomes
Protein sequence
DNA & RNA sequence
Protein structure
Gene expression
Chemical entities
Protein families,
motifs and domains
Protein interactions
Pathways
www.ebi.ac.uk
2
Systems
Talk structure
 Why do we need a database for functional genomics
data?
 ArrayExpress database
• Archive
• Gene Expression Atlas
 Database content
 Query the database
 Data download
 Data submission
3
ArrayExpress
What is functional genomics?
Functional genomics is a field of molecular biology that attempts
to make use of the vast wealth of data produced by genomic
projects (such as genome sequencing projects) to describe gene
(and protein) functions and interactions. Unlike genomics and
proteomics, functional genomics focuses on the dynamic aspects
such as gene transcription, translation, and protein-protein
interactions, as opposed to the static aspects of the genomic
information such as DNA sequence or structures. Functional
genomics attempts to answer questions about the function of DNA
at the levels of genes, RNA transcripts, and protein products.
4
ArrayExpress & Atlas
Components of a functional genomics experiment
• Sample source
• Sample treatments
• RNA extraction protocol
• Labelling protocol
• Array design information
• Location of each element
• Description of each element
• Hybridization protocol
Sample
Sample
Library
• Library preparation
Array
Chip
• Cluster amplification
• Image
• Scanning protocol
• Software specifications
• Sequencing and imaging
• From images to sequences
• Quality Control
• Sequence alignment
• Assembly
• Specific steps depending
on the application
• Quantification matrix
• Software specifications
Raw data
• Control array elements
• Normalization method
• Sample source
• Sample treatments
• Template preparation
Normalized data
Data analysis
Data analysis
Why do we need a database for functional
genomics data?
E-MEXP-2451
Transcription profiling of grape berries during ripening
Grape berries undergo considerable physical and biochemical
changes during the ripening process. Ripening is characterized by a
number of changes, including the degradation of chlorophyll, an
increase in berry deformability, a rapid increase in the level of
hexoses in the berry vacuole, an increase in berry volume, the
catabolism of organic acids, the development of skin colour, and the
formation of compounds that influence flavour, aroma, and therefore,
wine quality. The aim of this work is to identify differentially
expressed genes during grape ripening by microarray and realtime PCR techniques. Using a custom array of new generation, we
analysed the expression of 6000 grape genes from pre-veraison to
full maturity, in Vitis vinifera cultivar Muscat of Hamburg, in two
different years (2006 and 2007). Five time points per year and two
biological replicates per stadium were considered. To reduced intraplant and inter-plant biological variability, for each ripening stadium
we collected around hundred berries from several bunch grapes of
five plants of V. vinifera cv Muscat of Hamburg. We will use the realtime PCR technique to validate microarray data.Muscat of Hamburg.
We will use the real-time PCR technique to validate microarray data.
6
ArrayExpress & Atlas
Why do we need a database for functional
genomics data?
7
ArrayExpress & Atlas
ArrayExpress
www.ebi.ac.uk/arrayexpress/
 Is a public repository for functional genomics data, mostly
generated using microarray or high throughput sequencing (HTS)
assays
 Serves the scientific community as an archive for data supporting
publications, together with GEO at NCBI and CIBEX at DDBJ
 Provides easy access to well annotated microarray data in a
structured and standardized format
 Facilitates the sharing of microarray designs and experimental
protocols
 Based on FGED standards: MIAME checklist, MAGE-TAB format
and MO Ontology.
 MINSEQE checklist for HTS data (http://www.mged.org/minseqe/)
8
ArrayExpress
Reporting standards
Data standardization efforts focus on answering 3 major
questions:
1. What information do we need to capture?
2. What syntax (or file format) should we use to exchange
data?
3. What semantics (or ontology) should we use to best
describe its annotation?
9
ArrayExpress
What information do we need to capture?
MIAME checklist
 Minimal Information About a Microarray Experiment
 The 6 most critical elements contributing towards MIAME are:
1. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)
2. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)
3. Sufficient array annotation (e.g. gene identifiers, genomic
coordinates, probe sequences or array catalog number)
4. Essential laboratory and data processing protocols
(e.g. normalization method used)
5. Raw data for each hybridization (e.g. CEL or GPR files)
6. Final normalized data for the set of hybridizations in the experiment
10
ArrayExpress
What information do we need to capture?
MINSEQE checklist
 Minimal Information about a high-throughput Nucleotide
SEQuencing Experiment
 The proposed guidelines for MINSEQE are (still work in progress):
1. General information about the experiment
2. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)
3. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)
4. Essential experimental and data processing protocols
5. Sequence read data with quality scores, raw intensities and
processing parameters for the instrument
6. Final processed data for the set of assays in the experiment
11
ArrayExpress
What syntax (or file format) should we use to
exchange data?
MAGE-TAB is a simple spreadsheet format that uses a number of
different files to capture information about a microarray experiment:
12
IDF
Investigation Description Format file, contains top-level information
about the experiment including title, description, submitter contact details
and protocols.
SDRF
Sample and Data Relationship Format file contains the relationships
between samples and arrays, as well as sample properties and
experimental factors, as provided by the data submitter.
ADF
Array Design Format file, describes the design of an array, i. e. the
sequence located at each feature on the array and annotation of the
sequences.
Data files
Raw and processed data files. The ‘raw’ data files are the files
produced by the microarray image analysis software, such as CEL files
for Affymetrix or GPR files from GenePix.
The processed data file is a ‘data matrix’ file containing processed
values, as provided by the data submitter.
What semantics (or ontology) should we use to
best describe its annotation?
13

Ontology, which is a formal specification of terms in a
particular subject area and the relations among them.

Its purpose is to provide a basic, stable and unambiguous
description of such terms and relations in order to avoid
improper and inconsistent use of the terminology pertaining to
a given domain.
ArrayExpress
Experimental factor ontology (EFO)
http://www.ebi.ac.uk/efo
 Application focused ontology modeling experimental factors (EFs) in
AE
 Developed to:
Consistent curation – ensure that curators use same vocabulary
Query support (e.g, query for 'cancer' and get also ‘leukemia')
 EFs are transformed into an ontological representation, forming
classes and relationships between those classes
 EFO terms map to multiple existing domain specific ontologies,
such as the Disease Ontology and Cell Type Ontology
14
ArrayExpress
What semantics (or ontology) should we use to
best describe its annotation?
15
ArrayExpress & Atlas
ArrayExpress – two databases
16
ArrayExpress
ArrayExpress – two databases
18
ArrayExpress
ArrayExpress Archive
22281
...and counting
19
ArrayExpress & Atlas
How much data in AE Archive?
20
ArrayExpress
Archive by species
21
ArrayExpress
Browsing the AE Archive
22
ArrayExpress
Browsing the AE Archive
AE unique
experiment ID
Curated title of
experiment
Number of
assays
The date when the
data released
Species
investigated
Loaded in
Atlas flag
Raw sequencing
data available in
ENA
The list of experiments
retrieved can be printed,
saved as Tab-delimited format
or exported to Excel or as
23
RSSArrayExpress
feed
The total number of
experiments and assay
retrieved
The direct link to raw and
processed data. An icon
indicates that this type of
data is available.
Browsing the AE Archive
24
ArrayExpress
Experimental factor ontology (EFO)
Example:
acinar cell carcinoma
adrenocortical carcinoma
bladder carcinoma
breast carcinoma
cervical carcinoma
esophageal carcinoma
follicular thyroid carcinoma
hepatocellular carcinoma
carcinoma
pancreatic carcinoma
papillary thyroid carcinoma
renal carcinoma
signet ring cell carcinoma
uterine carcinoma
adenocarcinoma
adenoid cystic carcinoma
endometrioid carcinoma
gastric carcinoma
hereditary leiomyomatosis and renal cell cancer
hypopharyngeal carcinoma
25
ArrayExpress
Searching AE Archive
Simple query - EFO
•Matches to exact terms are highlighted in yellow
•Matches to child terms in the EFO are highlighted in pink
26
ArrayExpress
Searching AE Archive
Simple query
 Search across all fields:
• AE accession number e.g. E-MEXP-568
• Secondary accession numbers e.g. GEO series accession
GSE5389
• Experiment name
• Submitter's experiment description
• Sample attributes, experimental factor and values, including
species (e.g. GeneticModification, Mus musculus, DREB2C
over-expression)
• Publication title, authors and journal name, PubMed ID
 Synonyms for terms are always included in searches e.g. 'human'
and 'Homo sapiens’
27
ArrayExpress
AE Archive query output
• Matches to exact terms are highlighted in yellow
AE Archive – experiment view
MIAME or MINSEQE scores show how
much the experiment is standard compliant
Link to files available.
This varies between sequencing
and microarray data. For
microarray experiments you also
have array design file.
Experimental factor(s) and
its values
29
ArrayExpress
AE Archive – SDRF file
30
ArrayExpress
SDRF file – sample & data relationship
31
ArrayExpress
AE Archive – ADF file
32
ArrayExpress
AE Archive – all files
33
ArrayExpress
AE Archive – all files
34
ArrayExpress
Searching AE Archive
Advanced query
 Combine search terms
• Enter two or more keywords in the search box with the operators AND,
OR or NOT. AND is the default search term; a search for kidney cancer
will return hits with a match to ‘kidney' AND ‘cancer’
• Search terms of more than one word must be entered inside quotes
otherwise only the first word will be searched for. E.g. “kidney cancer”
 Specify fields for searches
• Particular fields for searching can also be specified in the format
of fieldname:value
35
ArrayExpress
Searching AE Archive
Advanced query - fieldnames
36
Field name
Searches
Example
accession
Experiment primary or secondary accession
accession:E-MEXP-568
array
Array design accession or name
array:AFFY-2 OR array:Agilent*
ef
Experimental factor, the name of the main variables in an
experiment.
ef:celltype OR ef:compound
efv
Experimental factor value. Has EFO expansion.
efv:fibroblast
expdesign
Experiment design type
expdesign:”dose response”
exptype
Experiment type. Has EFO expansion.
exptype:RNA-seq
gxa
Presence in the Gene Expression Atlas. Only value is
gxa:true.
ef:compound AND gxa:true
pmid
PubMed identifier
pmid:16553887
sa
Sample attribute values. Has EFO expansion.
sa:wild_type
species
Species of the samples. Has EFO expansion.
species:”homo sapiens” AND ef:cellline
ArrayExpress
Searching AE Archive
Advanced query
 Filtering experiments by counts of a particular attribute
•
37
Experiments fulfilling certain count criteria can also be searched for e.g. having
more than 10 assays (hybridizations)
Filter
What is filtered
assaycount:[x TO y]
filter on the number of of assays where x <= y and both values are between 0 and 99,999
(inclusive) . To count excluding the values given use curly brackets e.g. assaycount:{1 TO
5} will find experiments with 2-4 assays. Single numbers may also be given e.g.
assaycount:10 will find experiments with 10 assays.
efcount:[x TO y]
filter on the number of experimental factors
samplecount:[x TO y]
filter on the number of samples
sacount:[x TO y]
filter on the number of sample attribute categories
rawcount:[x TO y]
filter on the number of raw files
fgemcount:[x TO y]
filter on the number of final gene expression matrix (processed data) files
miamescore:[x TO y]
filter on the MIAME compliance score (maximum score is 5)
date:yyyy-mm-dd
filter by release date
•date:2009-12-01 - will search for experiments released on 1st of Dec 2009
•date:2009* - will search for experiments released in 2009
•date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and
end of May 2008
ArrayExpress
Searching AE Archive
Advanced query – Examples
38
ArrayExpress & Atlas
Exercise 1
39
ArrayExpress
ArrayExpress – two databases
40
ArrayExpress
Gene Expression Atlas
Experiment selection criteria
 The criteria we use for selecting experiments for inclusion in the Atlas
are as follows:
• Array designs relating to experiment must be provided to
enable re-annotation using Ensembl or Uniprot (or have the
potential for this to be done)
• High MIAME scores
• Experiment must have 6 or more hybridizations
• Sufficient replication and large sample size
• EF and EFV must be well annotated
• Adequate sample annotation must be provided
• Processed data must be provided or raw data which can be
renormalized must be available
41
ArrayExpress
Gene Expression Atlas
Atlas construction
 New meta-analytical tool for searching gene expression profiles
across experiments in AE
 Data is taken as normalized by the submitter
 Gene-wise linear models (limma) and t-statistics are applied to
calculate the strength of genes’ differential expression across
conditions across experiments
 The result is a two-dimensional matrix where rows correspond to
genes and columns correspond to conditions, rather than samples.
 The matrix entries are p-values together with a sign, indicating the
significance and direction of differential expression
42
ArrayExpress
Gene Expression Atlas
Atlas construction
43
ArrayExpress
Gene Expression Atlas
Atlas construction
Gene Expression Atlas
45
ArrayExpress
Atlas home page
http://www.ebi.ac.uk/gxa/
Query for
genes
Restrict query by
direction of differential
expression
Query for conditions
The ‘advanced
query’ option
allows building
more complex
queries
46
ArrayExpress
Atlas home page
The ‘Genes’ search box
47
ArrayExpress
Atlas home page
The ‘Conditions’ search box
48
ArrayExpress
Atlas home page
A single gene query
49
ArrayExpress
Atlas gene summary page
50
ArrayExpress
Atlas home page
A ‘Conditions’ only query
51
ArrayExpress & Atlas
Atlas heatmap view
52
ArrayExpress
Atlas list view
Click the ‘expression profile’ link
to view the experiment page
53
ArrayExpress
Atlas data download
54
ArrayExpress
Atlas experiment page
Box plot showing differential
expression of Mt2,Agxt2l1 and
Insig2 across the growth conditions
studied. Data is available for 3
separate array probes.
Search for a specific gene,
experimental factor or
expression level (up/down)
55
ArrayExpress
Atlas experiment page – HTS data
56
ArrayExpress & Atlas
Atlas gene-condition query
57
ArrayExpress
Atlas query refining
58
ArrayExpress
Atlas gene-condition query
59
ArrayExpress
Atlas query refining
60
ArrayExpress
Atlas query refining
61
ArrayExpress
Exercises 2 & 3
62
ArrayExpress
Data submission to AE
63
ArrayExpress
Data submission to AE
www.ebi.ac.uk/microarray/submissions.html
64
ArrayExpress
MIAMExpress- Overview
• MIAMExpress is a web based tool for submitting
microarray data and microarray designs to the
ArrayExpress database
• MIAMExpress can be used for experiments up to 50
hybridizations in size
Protocol
65
ArrayExpress & Atlas
Array design (custom)
Experiment
MIAMExpress- Getting Started
66
ArrayExpress & Atlas
MIAMExpress- Protocol Submission
Change the
automatically
generated
protocol name to
something
meaningful like
Sanger Lab
Growth Protocol
67
ArrayExpress & Atlas
MIAMExpress- Experiment Submission
Experiment
description
Samples
Batchloader interface
where you fill in your
experiment information
in a spreadsheet-like
format
Web page interface in
which a series of
web forms are filled
out.
68
ArrayExpress & Atlas
Extracts
Labeled
extracts
Hybs
Upload raw and
norm data
Upload combined
data files
MIAMExpress-Batch Upload Tool
Experiment
description
Samples
HELP
69
ArrayExpress & Atlas
Extracts
Labeled
extracts
Hybs
Upload raw and
norm data
Upload combined
data files
MIAMExpress-Batch Upload Tool
Experiment
description
Samples
Extracts
Labeled
extracts
Hybs
Upload raw and
norm data
Upload combined
data files
Toolbar
Click multiply to create as many rows as you
need. Hint: If you complete all the values in the
first row prior to this it will save you time
70
ArrayExpress & Atlas
Cross document
functionality
MIAMExpress-Batch Upload Tool
Experiment
description
Samples
Extracts
Double click and
select one or more
samples per extract
71
ArrayExpress & Atlas
Labeled
extracts
Hybs
Upload raw and
norm data
Upload combined
data files
MIAMExpress-Batch Upload Tool
Experiment
description
Samples
Extracts
Labeled
extracts
Hybs
Upload raw and
norm data
Upload combined
data files
Affymetrix users submit the
.CEL file here (and .EXP if
available). For all other
submissions the raw data must
be a single .txt or .gpr file.
Use the folder icon to select
the required data files
72
ArrayExpress & Atlas
MIAMExpress-Batch Upload Tool
Experiment
description
Samples
If you are supplying a Final
Gene Expression Data File
you must supply a
transformation protocol
73
ArrayExpress & Atlas
Extracts
Labeled
extracts
Hybs
Upload raw and
norm data
Upload combined
data files
MIAMExpress-Data File Formats for
Submission
•
•
•
•
•
74
Submitted normalized data files must contain data from a single hybridization only. If your
normalization procedure creates a file containing data from all your hybs then you can submit this
as a final gene expression matrix (FGEM).
A normalization protocol should be submitted along with your normalized data files-be precise when
describing how the data was calculated
A final gene expression matrix (FGEM) or combined data file is a file containing data from several
hybridizations.
The creation of your FGEM must be described in your transformation protocol.
The format of the FGEM is as follows:
• each line corresponds to an array element
• each column corresponds to your calculated value
ArrayExpress & Atlas
Array Submission- Background
75
•
An array design describes how a microarray was manufactured, what was
printed/synthesized at each position on the array and what biological
sequences these represent.
•
If the array design has already been submitted to ArrayExpress, you do not
need to re-submit it.
•
If you have used a custom array design you will need to submit the array
design as an Array Design Format (ADF) file
•
After submitting an array design you can immediately continue with your
experiment submission
•
The ADF file can be created in any spreadsheet application but must be
saved as a tab delimited text file.
ArrayExpress & Atlas
MIAMExpress-Array Submission
Enter some
general
information
about your
array
Checker- This
tool will check
for errors in
the format and
content of an
ADF file.
76
ArrayExpress & Atlas
Reporter
BioSequence
Database Entry
[unigene]
Reporter
BioSequence
Database Entry
[refseq]
Mm.4946
Mm.30140
Mm.4010
Mm.31625
Mm.3862
Mm.4236
NM_008387
NM_011392
NM_009509
NM_008600
NM_010514
NM_007530
MIAMExpress-Array Submission
Annotation such as the nucleotide sequence or database
accession associated with the reporter e.g. RefSeq
control_biosequence - for example a spike
Location and name
of each
CompositeSequence
Identifier
reporter (probe) on the array
control_buffer - buffer spotted on the array
RMA Normalized(LEI_CY_1)
1769308_at
1769309_at
1769310_at
control_empty
on the array
Type
of reporter (cDNA,- nothing spotted
Group of reporter oligonucleotide,
RNA, DNA etc) that- e.g.which
control_genomic_DNA
salmon
9.244743
are thesperm DNA
is present
control and which
control_label - landing lights
5.896561
are the experimental
control_reporter_size - sizereporters
standard
5.251203
control_spike_calibration
- spike at varying
Optional additional information - this can be used to showconcentrations
that several reporters are associated with the same gene for
1769311_at
11.004683
example, or add a comment about a reporter.
1769312_at
77
1769313_at
ArrayExpress & Atlas
control_unknown_type
9.974109
11.046103
MAGE-TAB Example: IDF
78
ArrayExpress & Atlas
MAGE-TAB Example: SDRF
79
ArrayExpress & Atlas
MAGE-TAB Example: SDRF
80
ArrayExpress & Atlas
MAGE-TAB Submission
Indicate submission
type
Ontology Link
Submitter is directed to
either submit an
experiment or
download a template
to the desktop
81
ArrayExpress & Atlas
MAGE-TAB Submission
82
ArrayExpress & Atlas
What happens after submission?
• Email confirmation
• Curation
• The curation team will review your submission and will
email you with any questions.
• Possible reopening for editing
• We will send you an accession number when all the
required information has been provided.
• We will load your experiment into ArrayExpress and
provide you with a reviewer login for viewing the data
before it is made public.
83
ArrayExpress & Atlas
Submission of High-throughput
sequencing gene expression data
84
ArrayExpress & Atlas
MAGE-TAB Example: SDRF
Source Name
finch 1
finch 2
finch 3
finch 4
finch 5
finch 6
finch 7
finch 8
finch 9
finch 10
Material Type
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
Term Source REF
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
Characteristics[Organism]
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Characteristics[Sex]
male
male
male
male
male
male
male
male
male
male
Characteristics[StrainOrLine]
Pinta
Pinta
Marchesa
Marchesa
Santiago
Santiago
Floreana
Floreana
Pinzon
Pinzon
Protocol REF
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
MAGE-TAB Example: SDRF
86
ArrayExpress & Atlas
What needs to be included in the Spreadsheet?
•
Include Assay Name and Technology Type columns
•
Raw files must go in the Array Data File column
•
A sequencing protocol must be provided.
• The sequencing protocol should have a performer- this is used as the run center name.
• This protocol must have a Protocol Hardware value saying which sequencing instrument was used
(e.g. 454 GS, Illumina Genome Analyzer, AB SOLiD System
• Reference this in the Protocol REF column before the Assay Name column.
•
These 4 extra Comment[] columns should be added after Extract Name to provide information about
how the library was prepared
•
•
•
•
87
1. Comment[LIBRARY_LAYOUT] - Specifies whether to expect single or paired reads . The column should contain one of
the following: SINGLE or PAIRED
2. Comment[LIBRARY_SOURCE] - Specifies the type of source material that is being sequenced. The column should
contain one of the following: TRANSCRIPTOMIC, METAGENOMIC, SYNTHETIC, VIRAL RNA, OTHER *
3. Comment[LIBRARY_STRATEGY] - Sequencing technique intended for the library. The column should contain one of the
following: WGS, WCS, WXS, CLONE, CLONEEND, POOLCLONE, FINISHING, AMPLICON, RNA-Seq, EST, FL-cDNA,
CTS, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, Bisulfite-Seq, MRE-Seq, MeDIP-Seq, MBD-Seq, OTHER *
4. Comment[LIBRARY_SELECTION] - Method was used to select and/or enrich the material being sequenced. The column
should contain one of the following: RANDOM, PCR, RANDOM PCR, RT-PCR, cDNA, CAGE, RACE, ChIP, MNase DNAse,
HMPR, MF, MSLL, 5-methylcytidine antibody, MBD2 protein methyl-CpG binding domain, Hybrid Selection, Reduced
Representation, Restriction Digest, size fractionation, CF-S, CF-M, CF-H, CF-T, other, unspecified *
ArrayExpress & Atlas
Submissions- Key Point
Don’t be afraid to ask
[email protected]
88
ArrayExpress & Atlas
Summary: ArrayExpress
Search by keyword
Search by gene name,
species and
experimental condition
View experiment
Browse results summary
Link to sample
properties and
experiment design
Search by experiment
View
expression
under different
conditions and
profiles
Search by gene across experiments
Acknowledgements
ArrayExpress/Atlas training is funded by the European Commission
under SLING, grant agreement number 226073 (Integrating Activity) within
Research Infrastructures Action of the FP7 Capacities Specific
Programme.
90
ArrayExpress