ArrayExpress and Gene Expression Atlas: Mining Functional

Download Report

Transcript ArrayExpress and Gene Expression Atlas: Mining Functional

ArrayExpress and Gene Expression Atlas:
Mining Functional Genomics data
Amy Tang PhD
ArrayExpress Production Team
Functional Genomics Group
EMBL-EBI
[email protected]
What’s covered this morning?
 Why do we need a database for functional genomics data?
 ArrayExpress databases:
• Archive
• Gene Expression Atlas
 What’s in each database, how to browse, search, interpret,
download data
 Hands-on exercises
 (How to submit data to ArrayExpress?)
2
ArrayExpress
Functaionl genomics (FG) data
• The aim of FG is to understand the function of genes and
other (non-genic) parts of the genome
• Often involved high-throughput technologies
(microarrays, high-throughput sequencing [HTS])
• Questions addressed:
• Gene expression - when? where? how much? changes?
• Gene function - roles of different genes in cellular processes,
pathways
• Gene regulation - e.g. epigenetic modifications of histones or
DNA
3
ArrayExpress
ArrayExpress
www.ebi.ac.uk/arrayexpress
 Public repository for functional genomics data (both
microarray and sequencing)
 Together with GEO at NCBI and CIBEX at DDBJ, serves the
scientific community as an archive for data supporting
publications
 Provides access to curated data in a structured and
standardised format. Facilitates the sharing of experimental
information
 Submissions are curated based on community standards:
 MIAME guidelines & MAGE-TAB format for microarray
 MINSEQE guidelines & MAGE-TAB format for HTS data
4
ArrayExpress
Community standards for data requirement
 MIAME = Minimal Information About a Microarray Experiment
(http://www.mged.org/Workgroups/MIAME/miame_2.0.html)
 MINSEQE = Minimal Information about a high-throughput Nucleotide
SEQuencing Experiment (http://www.mged.org/minseqe)
 The checklist:
Requirements
5
MIAME
MINSEQE
1. Experiment design / background description


2. Sample annotation and experimental factor


3. Array design annotation (e.g. probe sequence)

4. All protocols (wet-lab bench and data processing)


5. Raw data files (from scanner or sequencing machine)


6. Processed data files (normalised and/or transformed)


ArrayExpress
What is an experimental factor?
 The main variable(s) studied in the experiment
 It often is the independent variable of the microarray or HTS
experiment. Values of the factor (“factor values”) should vary.
 Examples:
Experiment design
human blood samples vs
mouse blood samples
lung samples from male C57BL/6 mice
vs
lung samples from male 129 mice
6
ArrayExpress
Factor  Factor Values
Not factor 
organism
Homo sapiens,
Mus musculus
organism part
(blood only)
C57BL/6, 129
organism part
(lung only), sex
(male only)
strain
Reporting standards - MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of
different files to capture information about a microarray or sequencing
experiment.
Investigation Description Format file
IDF
Experiment title and background, investigator(s)’ contact details, definition of protocols
Sample and Data Relationship Format file
SDRF
ADF
(for array
data only)
Captures the chronological flow of experiment from source materials to data files.
Shows relationship between sample, data files, experiment factors.
Array Design Format file
Describes probes on an array, e.g. sequence, genomic mapping location
Raw and processed data files.
Data files
•
•
7
ArrayExpress
Raw = unmodified from microarray scanner (e.g. CEL for Affymetrix, GPR for
GenePix), or trace data files (fastq and bam) for sequencing.
Processed file = data normalised/transformed from the raw data
MAGE-TAB Example: IDF
8
ArrayExpress & Atlas
MAGE-TAB Example: SDRF
9
ArrayExpress & Atlas
ArrayExpress – two databases
10
ArrayExpress
What is the difference between them?
ArrayExpress Archive
• Central object: experiment
• Contains both microarray and HTS experiments
• Query to retrieve experimental information and
associated data
Expression Atlas
• Central object: gene/condition
• Contains data from mainly microarray experiments
(HTS coming very soon!)
• Query for up/downregulated genes across experiments
and across platforms
11
ArrayExpress
ArrayExpress – two databases
12
ArrayExpress
ArrayExpress Archive – when to use it?
• Find FG experiments that might be relevant to your
research
• Download data and re-analyse it yourself.
Data deposited in public repositories may shed light on biological
questions different from the one asked in the original
experiments.
• Submit microarray or HTS data that you want to publish.
Major journals will require data to be submitted to a public
repository like ArrayExpress as part of the peer-review process.
13
ArrayExpress
How much data in AE Archive?
(as of September 2012)
(up to Sept.)
14
ArrayExpress
HTS data in AE Archive
(as of mid-September 2012)
Microarray vs HTS
RNA-, DNA-, ChIPseq breakdown
Browsing the AE Archive
www.ebi.ac.uk/arrayexpress
16
ArrayExpress
Browsing the AE Archive
AE unique
experiment ID
Curated title of
experiment
Number of
assays
Species
investigated
The date when the
data were loaded
in the Archive
loaded in
Atlas flag
Raw sequencing data
available in ENA
The list of experiments retrieved
can be printed, saved as Tabdelimited format or exported to
Excel or as RSS feed
17
ArrayExpress
The total number of
experiments and
assay retrieved
The direct link to raw and processed
data. An icon indicates that this type of
data is available.
Browsing the AE Archive
18
ArrayExpress
Experimental factor ontology (EFO)
http://www.ebi.ac.uk/efo
 An ontology modeling the relationship between
experimental factors (EFs) and other data elements
 Used in EBI databases:
and external projects (e.g. NHGRI GWAS Catalogue)
 Combine terms from a subset of well-maintained and
compatible ontologies, e.g. Gene Ontology (cellular
component + biological process terms), NCBI Taxonomy
19
ArrayExpress
Experimental factor ontology (EFO)
http://www.ebi.ac.uk/efo
EFO developed to:
 increase the richness of annotations in databases
 expand on search terms when querying ArrayExpress
and Gene Expression Atlas
• using synonyms (e.g. “cerebral cortex” = “adult brain cortex”)
• using child terms (e.g. “bone”  “rib” and “vertebra”)
 promote consistency (e.g. F/female/, 1day/24hours)
 facilitate automatic annotation and integration of external
data (e.g. changing “gender” to “sex” automatically)
20
ArrayExpress
Building EFO
An example
Take all experimental
factors
sarcoma
Find the logical connection between them
Organize them in an ontology
disease
disease is the parent term
[-]
cancer
neoplasm
is a type of
disease
neoplasm
[-]
neoplasm
cancer
is synonym of
cancer
neoplasm
[-]
disease
sarcoma
is a type of
sarcoma
cancer
[-]
Kaposi’s sarcoma
21
ArrayExpress
Kaposi’s sarcoma
is a type of
sarcoma
Kaposi’s sarcoma
Exploring EFO
An example
22
ArrayExpress
Searching AE Archive
Simple query
23
•
“Auto-complete” with suggestions
(like Google search)
•
Avoid acronyms as search terms
ArrayExpress
Filter your search results by:
• Species of interest
• One array design (platform),
• molecule (DNA, RNA, protein, etc)
• technology (microarray or HTS)
Searching AE Archive
Simple query
 Search across all fields:
• AE accession number e.g. E-MEXP-568
• Secondary accession numbers e.g. GEO series accession
GSE5389
• Experiment title, submitter’s experiment description
• Submitter's email address
• Sample attributes, experimental factor and values, including
species (e.g. GeneticModification, Mus musculus, DREB2C
over-expression)
• Publication title, authors and journal name, PubMed ID
 Synonyms for terms are always included in searches e.g. 'human'
and 'Homo sapiens’
24
ArrayExpress
AE Archive query output
• Matches to exact terms are highlighted in yellow
• Matches to synonyms are highlighted in green
• Matches to child terms in the EFO are highlighted in pink
AE Archive – experiment view
Experimental factor(s) and
its values
MIAME or MINSEQE scores show how
much the experiment is standard compliant
(* = compliant)
Link to files available.
This varies between sequencing and
microarray data. For microarray experiments
you also have array design file (ADF)
26
ArrayExpress
SDRF file – sample & data relationship
27
ArrayExpress
Searching AE Archive
Advanced query
 Combine search terms
• Join two or more keywords in the search box with the operators AND,
OR or NOT (in CAPS), e.g.
brain OR prostate NOT mouse
• Search terms of more than one word must be entered inside quotes
otherwise only the first word will be searched for, e.g. “kidney cancer”
 Specify fields for searches
• E.g. Search only for human assays on Agilent microarrays:
species: “homo sapiens” AND array:Agilent*
* For more details and examples, see http://www.ebi.ac.uk/fg/doc/help/ae_help.html
28
ArrayExpress
Hands-on exercise 1
Find RNA-seq assays studying human
prostate adenocarcinoma
29
ArrayExpress
ArrayExpress – two databases
30
ArrayExpress
Expression Atlas – when to use it?
• Find out if the expression of a gene (or a group of genes
with a common gene attribute, e.g. GO term) change(s)
across all the experiments available in the Expression
Atlas;
• Discover which genes are differentially expressed in a
particular biological condition that you are interested in.
• Experiments in Archive are curated before being
introduced into the Atlas
31
ArrayExpress
Expression Atlas construction
Experiment selection criteria during curation
• Array (platform) designs relating to the experiment must be
provided. Probe annotation must be adequate to enable reannotation of external references (e.g. Ensembl gene ID, Uniprot ID)
• At least 3 replicates for each value of the experimental factor
• Maximum 4 experimental factors
• Adequate sample annotation using EFO terms
• Presence of data files: CEL raw data files for Affymetrix assays,
processed data files for non-Affymetrix ones
32
ArrayExpress
Expression Atlas construction
Analysis pipeline
A dummy example:
Cond.1 Cond.2 Cond.3
genes
Cond.1
Cond.2
Cond.3
Input data
(Affy CEL, non-Affy processed)
Linear model*
(Bio/C Limma)
Output: 2-D matrix
1= differentially expressed
0 = not differentially expressed
* More information about the statistical methodology:
http://nar.oxfordjournals.org/content/38/suppl_1/D690.full
33
ArrayExpress
Expression Atlas construction
Analysis pipeline
“Is gene X differentially
expressed in condition 1 in
this experiment?”
= a single expression value for gene X
Cond.1 mean
Cond.2 mean
Cond.3 mean
Compare and calculate statistic
34
ArrayExpress
Mean of all samples
Exp.1
Cond.1
Cond.2
Cond.3
genes
Statistical
test
Exp. 2
Cond.4
Cond.5
Cond.6
genes
Statistical
test
Exp. n
Cond.X
Cond.Y
Cond.Z
genes
Statistical
test
35
ArrayExpress
Each experiment has its own
“verdict” or “vote” on whether a
gene is differentially expressed or
not under a certain condition
Expression Atlas construction
Summary of
the
“verdicts”
from
different
experiments
36
ArrayExpress
Expression Atlas
37
ArrayExpress
Atlas home page
http://www.ebi.ac.uk/gxa
Query for
genes
Restrict query by
direction of differential
expression
Query for conditions
The ‘advanced
query’ option
allows building
more complex
queries
38
ArrayExpress
Atlas home page
The ‘Genes’ and ‘Conditions’ search boxes
Conditions
Genes
39
ArrayExpress
Atlas single gene query
gene summary page
40
ArrayExpress
Atlas single gene query (cont’d)
experiment page
41
ArrayExpress
Atlas single gene query
gene summary page – jump to orthologs
Orthology comes from
Ensembl Compara database
42
ArrayExpress
Atlas single gene query
compare orthologs – heatmap view
43
ArrayExpress
Atlas ‘condition-only’ query
44
ArrayExpress
Atlas ‘condition-only’ query (cont’d)
heatmap view
45
ArrayExpress
Atlas gene + condition query
46
ArrayExpress
Atlas query refining (method 1)
What if there are no terms in the “REFINE YOUR QUERY” box which fit
my biological question?
47
ArrayExpress
Atlas query refining (method 2)
48
ArrayExpress
Atlas query refining (method 2)
49
ArrayExpress
Atlas query refining (method 2)
50
ArrayExpress
Hands-on exercise 2
Find genes in the “androgen receptor
signaling pathway” which are (i) expressed in
prostate carcinoma and (ii) involved in
regulation of transcription from RNA Pol II
Hands-on exercise 3
Find information on Tbx5 expression in
mouse in relation to Holt-Oram syndrome
51
ArrayExpress
ArrayExpress-Atlas
Crossword
52
ArrayExpress
A glimpse of what’s coming…
“Differential atlas”
“Is gene X differentially expressed
in condition 1 in this experiment?”
= a single expression value for gene X
Cond.1 mean
Cond.2 mean
Cond.3 mean
Compare and calculate statistic
53
ArrayExpress
Mean of all samples
A glimpse of what’s coming…
“Differential atlas” mock-up (1)
54
ArrayExpress
A glimpse of what’s coming…
“Differential atlas” mock-up (2)
55
ArrayExpress
A glimpse of what’s coming…
“Baseline atlas”
• Gene expression in normal tissues, not looking for differentially
expressed genes based on different conditions
• E.g. “Give me all the genes expressed in normal human kidney”
• Can also filter genes by expression level (e.g. FPKM values)
• Start with Illumina Body Map 2.0 RNA-seq data
• 16 tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver,
lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white
blood cells
• We are working on something similar for mouse
56
ArrayExpress
A glimpse of what’s coming…
“Baseline atlas” mock-up display
57
ArrayExpress
Find out more about Archive and Atlas
• Visit our eLearning portal, Train online, at
http://www.ebi.ac.uk/training/online/ for tutorials on
ArrayExpress and Atlas
• ArrayExpress BioConductor R package:
http://bioconductor.org/packages/release/bioc/html/ArrayExpre
ss.html
• Try the ArrayExpress help page: www.ebi.ac.uk/fg/doc
• Email us at: [email protected]
• Atlas mailing list: [email protected]
58
ArrayExpress
Data submission to
ArrayExpress Archive
59
ArrayExpress
Data submission to AE
60
ArrayExpress
Data submission to AE
www.ebi.ac.uk/microarray/submissions.html
• MIAMExpress originally designed mainly for simple Affymetrix and
Agilent two-colour microarray submissions
• MAGE-TAB route recommended for large/complicated experiments.
• HTS experiments must be submitted via MAGE-TAB route.
• MAGE-TAB spreadsheet (IDF and SDRF) tailor-made for your
experiment if you follow the MAGE-TAB submission tool (i.e. with all
mandatory column headings present)
61
ArrayExpress
Submission of HTS data
• ArrayExpress acts as a “broker” for submitter.
• Meta-data and processed data: ArrayExpress
• Raw sequence reads* (e.g. fastq, bam): ENA
*See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format
62
ArrayExpress
What happens after submission?
Email confirmation
Curation:
Submission ‘closed’
so no more editing on
your end
We will email you with any
questions
May ‘re-open’ submission
for you to make changes
Can keep data private
until publication.
Will provide login account
details to you and
reviewer for private data
access
Get your submission in the best possible shape to shorten
curation and processing time!
63
ArrayExpress
Submission checklist
Microarrays
1. Is your array design already
accessioned in ArrayExpress?
(Check:
http://www.ebi.ac.uk/arrayexpress/arrays/browse.html?
directsub=on
If your array design is not represented, you will have to
submit the array design to us before submitting any
experimental data, because all data points in your
raw/processed files refer back to the array design file)
HTS
1. Are your reads file in a
format accepted by the SRA?
(Check here:
http://www.ebi.ac.uk/ena/about/sra_data_
format)
2. If yes, have you dropped the
files on the private
ArrayExpress FTP site and
email us about them?
2. Do you have all the data files ready in
the required formats?
3. Have you filled in the MAGE-TAB spreadsheet with adequate meta-data?
64
ArrayExpress
Need help with submitting your data?
•
Visit our eLearning portal, Train online, at
www.ebi.ac.uk/training/online/course/arrayexpresssubmitting-data-using-mage-tab for the specific tutorial on
how to submit data using MAGE-TAB
• Watch this short YouTube video on how to navigate the
MAGE-TAB submission tool:
http://youtu.be/KVpCVGpjw2Y
• Email curators at: [email protected]
65
ArrayExpress