ppt - University of Pennsylvania

Download Report

Transcript ppt - University of Pennsylvania

Bioinformatics: Making sense of
functional genomics data
Chris Stoeckert, Ph.D.
Dept of Genetics and Penn Center for Bioinformatics,
University of Pennsylvania School of Medicine
GoldenHelix Symposia, Athens, Greece
December 3, 2010
What is Functional Genomics?
From http://en.wikipedia.org/wiki/Functional_genomics
• Functional genomics is a field of molecular biology that attempts to
make use of the vast wealth of data produced by genomic projects (such
as genome sequencing projects) to describe gene (and protein) functions
and interactions. Unlike genomics and proteomics, functional genomics
focuses on the dynamic aspects such as gene transcription, translation,
and protein-protein interactions, as opposed to the static aspects of the
genomic information such as DNA sequence or structures. Functional
genomics attempts to answer questions about the function of DNA at
the levels of genes, RNA transcripts, and protein products. A key
characteristic of functional genomics studies is their genome-wide
approach to these questions, generally involving high-throughput
methods rather than a more traditional “gene-by-gene” approach.
The Impact of Functional Genomics
• More data than can be fit in a publication, or a
lab book, and often not generated by a single
lab.
• Microarray-based datasets are an example.
• High throughput sequencing (HTS)-based
datasets are becoming the “new” microarrays.
What is Needed for Reproducible
Research
• Public archives
From July 21, of
2010data
N.Y. Times:
“Last year, two biostatisticians at the University
• Minimum
information
about
of Texas
MD Anderson Cancer
Center experiments so
an
article
in the scientific
journal
that thepublished
steps
to
generate
the
results can be
Annals of Applied Statistics in which they
followedidentified errors in Duke’s data analysis and
said they had not been able to reproduce
• An example
of what can go wrong:
Duke’s results.”
Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J,
Cottrill H, Kelley MJ, Petersen R, Harpole D, Marks J, Berchuck A,
GinsburgKeith
GS,A.Febbo
P, Lancaster
Nevins JR. Genomic signatures
Baggerly
and Kevin R.J,Coombes
from cell lines:
to guideDeriving
the usechemosensitivity
of chemotherapeutics.
Nat Med. 2006
Forensic bioinformatics
reproducible
Nov;12(11):1294-300.
Epub and
2006
Oct 22. Erratum in: Nat Med.
research in high-throughput
biology
2008 Aug;14(8):889.
Nat Med. 2007
Nov;13(11):1388.
Ann. Appl. Stat. Volume 3, Number 4 (2009),
1309-1334.
Recent examples of discoveries from integrating
usable functional genomics datasets
• A global map of human gene
expression. Lukk et al. Nature
Biotech. 2010
• Differentially Expressed RNA
from Public Microarray Data
Identifies Serum Protein
Biomarkers for Cross-Organ
Transplant Rejection and Other
Conditions. R. Chen et al. PLoS
Comp. Biol. 2010
In the beginning there were
microarrays…
• And they were developed for gene expression profiling.
• Microarray studies were published but only gene lists were
provided.
– No details, no place to get the data
• In 1999, the Microarray Gene Expression Data Society
(MGED) began to address the lack
of verifiable and reproducible
datasets by developing and
promoting standards.
MGED Standards
• What information is needed for a microarray
experiment?
– MIAME: Minimal Information About a Microarray
Experiment. Brazma et al., Nature Genetics 2001
• How do you “code up” microarray data?
– MAGE-OM: MicroArray Gene Expression Object Model.
Spellman et al., Genome Biology 2002
– MAGE-TAB Rayner et al., BMC Bioinformatics 2006
• What words do you use to describe a microarray
experiment?
– MO: MGED Ontology. Whetzel et al. Bioinformatics 2006
MIAME in a nutshell (ala Alvis Brazma)
Sample
Sample
Sample
Sample
Sample
Experiment
Array design
RNA
extract
RNA
extract
RNA
extract
RNA
RNAextract
extract
labelled
labelled
labelled
labelled
nucleic
acid
labelled
nucleic
acid
nucleic
acid
nucleic
nucleicacid
acid
genes
hybridisation
hybridisation
hybridisation
hybridisation
hybridisation
array
array
array
array
Microarray
Gene expression
data matrix
Protocol
Protocol
Protocol
Protocol
Protocol
Protocol
normalization
Stoeckert et al.
Drug Discovery Today TARGETS 2004
integration
Sequencing is replacing array technology
Sample
Sample
Sample
Sample
Sample
RNA
extract
RNA
extract
RNA
extract
RNA
RNAextract
extract
labelled
labelled
labelled
labelled
nucleic
acid
labelled
nucleic
acid
nucleic
acid
nucleic
nucleicacid
acid
Protocol
Protocol
Protocol
Protocol
Protocol
Protocol
genes
Experiment
Array design
@HWI-EAS266_0011:8:1:6:969#0/1
GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT
+HWI-EAS266_0011:8:1:6:969#0/1
_abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY
@HWI-EAS266_0011:8:1:7:1688#0/1
AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG
+HWI-EAS266_0011:8:1:7:1688#0/1
a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR
@HWI-EAS266_0011:8:1:7:593#0/1
CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG
+HWI-EAS266_0011:8:1:7:593#0/1
abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_
@HWI-EAS266_0011:8:1:7:139#0/1
CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT
+HWI-EAS266_0011:8:1:7:139#0/1
aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX
@HWI-EAS266_0011:8:1:7:1390#0/1
GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA
+HWI-EAS266_0011:8:1:7:1390#0/1
_U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX
@HWI-EAS266_0011:8:1:7:1663#0/1
TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG
+HWI-EAS266_0011:8:1:7:1663#0/1
a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB
hybridisation
hybridisation
hybridisation
hybridisation
hybridisation
array
array
array
array
Microarray
normalization
integration
Gene expression
data matrix
Sequencing is replacing array technology
Sample
Sample
Sample
Sample
Sample
RNA
extract
RNA
extract
RNA
Chromatin,
RNAextract
extract
DNA extract
labelled
labelled
labelled
labelled
nucleic
acid
nucleic
acid
nucleic
nucleic
acid
nucleicacid
acid
Protocol
Protocol
Protocol
Protocol
Protocol
Protocol
genes
Experiment
Array design
@HWI-EAS266_0011:8:1:6:969#0/1
GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT
+HWI-EAS266_0011:8:1:6:969#0/1
_abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY
@HWI-EAS266_0011:8:1:7:1688#0/1
AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG
+HWI-EAS266_0011:8:1:7:1688#0/1
a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR
@HWI-EAS266_0011:8:1:7:593#0/1
CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG
+HWI-EAS266_0011:8:1:7:593#0/1
abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_
@HWI-EAS266_0011:8:1:7:139#0/1
CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT
+HWI-EAS266_0011:8:1:7:139#0/1
aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX
@HWI-EAS266_0011:8:1:7:1390#0/1
GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA
+HWI-EAS266_0011:8:1:7:1390#0/1
_U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX
@HWI-EAS266_0011:8:1:7:1663#0/1
TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG
+HWI-EAS266_0011:8:1:7:1663#0/1
a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB
hybridisation
hybridisation
hybridisation
hybridisation
hybridisation
array
array
array
array
Microarray
normalization
integration
ChiP-Seq
MeDIP-Seq
Etc.
MGED is now the
Functional Genomics Data Society
•
•
•
•
The Functional Genomics Data Society - FGED (eff – jed) Society, founded
in1999 as the MGED Society, advocates for open access to genomic data
sets and works towards providing concrete solutions to achieve this. Our
goal is to assure that investment in functional genomics data generates
the maximum public benefit. Our work on defining minimum information
specifications for reporting data in functional genomics papers have
already enabled large data sets to be used and reused to their greater
potential in biological and medical research.
We work with other organizations to develop standards for biological
research data quality, annotation and exchange.
We facilitate the creation and use of software tools that build on these
standards and allow researchers to annotate and share their data easily.
We promote scientific discovery that is driven by genome wide and other
biological research data integration and meta-analysis.
The Functional Genomics Data Society
For more information see http://mged.org or http://fged.org
Look here for information on a new site in the coming year …
New Directions for FGED Standards
• What information is needed for a UHTS experiment?
– MINSEQE: Minimal Information about a high throughput
SEQuencing Experiment.
– http://www.mged.org/minseqe/
• How do you annotate functional genomics data?
– Annotare: Tool to create MAGE-TAB.
– http://code.google.com/p/annotare/
• What words do you use to describe an investigation?
– OBI: Ontology for Biomedical Investigations.
– http://obi-ontology.org/
Minimum Information about a high-throughput
Nucleotide SeQuencing Experiment – MINSEQE
(April, 2008)
• The description of the biological system and the particular
states that are studied
• The sequence read data for each assay
• The 'final' processed (or summary) data for the set of
assays in the study
• The experiment design including sample data
relationships
• General information about the experiment
• Essential experimental and data processing protocols
Minimal Information for
Biological and Biomedical
Investigations
33 Projects registered as
of November, 2010
Related standards for exchange of
phenotype data for reuse in meta analysis.
• statement of informed consent that the data
may (not) be used in additional analysis.
• enrollment criteria / exclusion criteria
• Population stratification: race/ethnicity/
strain, gender, age
• tissue or cellular makeup of specimen
• non-omics phenotype for each participant
(subject or group)
Jennifer Fostel (NIEHS): Leader Phenotypes Working group
FGED design-specific criteria
Different study or trial designs would have additional reporting requirements:
(A) study involving a disease
(A1) type and stage of disease
(A2) pertinent genotype / genomic conditions
(B) study involving chemical exposure
(B1) dose administered, dosing frequency and duration, route of administration
(B2) ID to permit unambiguous identification of chemical, i.e. Cas No
(C) study involving surgical intervention
(C1) details of intervention
(C2) timing with respect to pertinent events
(D) Observational or epidemiological study
(D1) demographic factors
(D2) pertinent medical history
Jennifer Fostel (NIEHS): Leader Phenotypes Working group
Annotare - An open source
standalone MAGE-TAB editor
Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M,
Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA.
Annotare - a tool for annotating high-throughput biomedical
investigations and resulting data.
Bioinformatics. 2010 Aug 23.
MAGE-TAB Format
What’s MAGE-TAB?
•
•
•
•
•
MAGE-TAB is a simple spreadsheet view which has two files
IDF - describing the experiment design, contact details, variables and
protocols
SDRF - a spreadsheet with columns that describe samples, annotations,
protocol references, hybridizations and data
Linked data files, e.g. CEL files, these are referenced by the SDRF
For single channel data one row in the SDRF = 1 hybridization, for two channel
data one row = 1 channel
MAGE-TAB can also be used to annotate Next Gen Sequencing data
Where can I get MAGE-TAB from?
•
•
~10,000 MAGE-TAB files are available for download from ArrayExpress
(includes GEO derived and ArrayExpress data)
caArray also provides MAGE-TAB files for download.
Who’s using MAGE-TAB?
•
•
•
BioConductor
GenePattern
MeV
IDF file for E-TABM-34
IDF = Investigation Description Format
SDRF file for E-TABM-34
SDRF = Sample and Data Relationship Format
Annotare - an open source MAGE-TAB Editor
Annotare is an annotation tool for high throughput gene
expression experiments in MAGE-TAB format. Researchers can
describe their investigations with the investigators’ contact
details, experimental design, protocols that were employed,
references to publications, details of biological samples,
arrays, and experimental data produced in the investigation.
Annotare Features
• Intuitive graphical user interface forms for
editing
• Ontology support, an inbuilt ontology and
web services connectivity to bioportal
• Searchable standard templates
• Design wizard
• Validation module
• Mac and Windows Support
http://code.google.com/p/annotare/
MAGE-TAB is used to load datasets into Beta Cell Genomics
University of Pennsylvania
Chris Stoeckert
Elisabetta Manduchi
Chiao-Feng Lin
Emily Allen
Junmin Liu
Vanderbilt University
Mark Magnuson
Jean-Philippe Cartailler
Thomas Houfek
Josh Norman
Chris Howard
http://genomics.betacell.org supporting the Beta Cell Biology Consortium
http://genomics.betacell.org supporting the Beta Cell Biology Consortium
OBI – Ontology for Biomedical
Investigations
Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone
J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA,
Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J; OBI consortium.
Modeling biomedical experimental processes with OBI.
J Biomed Semantics. 2010 Jun 22;1 Suppl 1:S7.
Ontology
• A set of concepts within a domain and the
relationships between those concepts
Unambiguous description of a domain
Controlled vocabulary
Relations (logical constraints among terms)
Human readable and machine interpretable
• Uses
Annotation
Text mining
Semantic web
Reasoning
Gene Ontology (GO)
• The Gene ontology has been
developed to provide terms to
consistently describe the gene
and gene product attributes
across species and databases
• Standardized annotation
facilitate data integration cross
various databases and enable
the ability to query and retrieve
genes or proteins based on their
shared biology
(generated by NCBO BioPortal)
Explosion of biomedical ontologies available
http://bioportal.bioontology.org/
OBI – Ontology for Biomedical
Investigations
• FGED is one of many communities contributing to OBI
• Whereas the MGED Ontology is primarily a controlled
vocabulary for use with MAGE, OBI is a well-founded
ontology with logical definitions and restrictions to be
used for multiple purposes (e.g., database models, text
mining, file annotation)
• OBI intends to be part of the OBO Foundry
– Interoperable with Gene Ontology, CheBI, Phenotypic qualities
(PATO), Cell Type (CL)…
• OBI is available through browsers like the NCBO BioPortal
Partial high level structure of OBI classes
OBI and IAO (Information Artifact Ontology) classes are shown in blue. Classes
imported from other external ontologies are shown in red. Some example
subclasses, such as PCR product and cell culture are included to illustrate the use
of the class processed material.
Measuring the glucose concentration in blood
From The OBI Consortium, The Ontology for Biomedical Investigations, under revision
Using OBI and other OBO Foundry
ontologies to capture phenotype data
Jie Zheng, Omar Harb
Enhance data mining in functional genomics
resources
EuPathDB: a portal to eukaryotic pathogen databases.Aurrecoechea C, Brestelli J, Brunk BP, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS,
Heiges M, Innamorato F, Iodice J, Kissinger JC, Kraemer ET, Li W, Miller JA, Nayak V, Pennington C, Pinney DF, Roos DS, Ross C, Srinivasamoorthy
G, Stoeckert CJ Jr, Thibodeau R, Treatman C, Wang H.Nucleic Acids Res. 2010
Find T. brucei genes by phenotype based on RNAi
knockdown and insufficiency studies.
TriTrypDB: a functional genomic resource for the Trypanosomatidae.Aslett M, Aurrecoechea C, Berriman M, Brestelli J, Brunk BP, Carrington
M, Depledge DP, Fischer S, Gajria B, Gao X, Gardner MJ, Gingle A, Grant G, Harb OS, Heiges M, Hertz-Fowler C, Houston R, Innamorato F,
Iodice J, Kissinger JC, Kraemer E, Li W, Logan FJ, Miller JA, Mitra S, Myler PJ, Nayak V, Pennington C, Phan I, Pinney DF, Ramasamy G, Rogers
MB, Roos DS, Ross C, Sivam D, Smith DF, Srinivasamoorthy G, Stoeckert CJ Jr, Subramanian S, Thibodeau R, Tivey A, Treatman C, Velarde G,
Wang H.Nucleic Acids Res. 2010
Making sense of functional genomics
data: Standards & Ontologies
• Enhance sharing data and understanding what
is shared by computers and humans.
• Applications include:
– Data mining of databases
– Meta-analysis of multiple datasets