Challenges in data sourcing and integration in MGI

Download Report

Transcript Challenges in data sourcing and integration in MGI

Managing Big Scientific Data
Capturing, Integrating and Presenting
Mouse Data at MGI
Mouse Genome Informatics
www.informatics.jax.org
Cynthia Smith
Canberra
April 2010
Mouse Genome Informatics (MGI) program goal
…to facilitate the use of the mouse as a model for
heritable human diseases and normal human biology.
Achondroplasia
Homozygous achondroplasia
mouse mutant and control
• short domed skull
• short-limbed dwarfism
• malocclusion
• bulging abdomen as adults
• respiratory problems
• shorted lifespan
…to accomplish MGI’s mission, we provide
integrated access to the genetics, genomics,
and biology of the laboratory mouse.
natural
variation
expression
genome location
sequence
gene function
strain geneaology
tumors
Hermansky-Pudlak
syndrome
orthologies
Information content spans from sequence to phenotype/disease
MGI Data Content, a few numbers
April, 2010
Genes (including uncloned mutants)
36,691
Genes w/ nucleotide sequence
29,108
Genes annotated to GO
25,620
Total mouse GO annotations
223,558
Mouse/human orthologs
17,846
Mouse/rat orthologs
16,776
Phenotypic alleles in mice
24,007
genes with mutant alleles in mice
mutant alleles in cell lines only
total phenotype annotations (Mamamlian Phenotype-MP)
QTL
Human diseases w/ one or more mouse model
Gene Expression Assays
Integrated mouse nucleotide sequences+ESTs
refSNPs
References
12,427
541,172
182,139
4,436
1005
37,584
>9,994,000
>10,089,000
153,161
…plus strains, expression and phenotype images, tumor records, etc.
Integration in MGI
• Identify objects.
• Resolve discrepancies.
Integration is key to
knowledge discovery
Integration is hard…not just a matter of
combining data sources…
• Data from multiple sources can be of differing quality
• The same data can enter the system via various paths
• Naming conventions may or may not be to standards
• Some data sources don’t maintain unique accession numbers (or allow
them to change)
• Periodic updates from data sources can cause problems
• if objects have disappeared… (or reappear)
• If objects have split in two
Literature &
Loads
New Gene,
Strain or
Sequence?
Controlled
Vocabularies
Evidence &
Citation
Co-curation of shared
objects and concepts
Annotation Pipeline
• Data Acquisition
• Object Identity
• Standardizations
• Data Associations
• Integration with other
bioinformatics resources
Data integration is hard
• “Bucketizing”
establishe types of
correspondence between
objects in the input sets.
• Allows immediate
incorporation of 1:1
corresponding data.
• Sorts conflicting data
into bins that allow
prioritization for curator
resolution.
VEGA annotated three distinct genes instead of multiple
transcripts for a single gene (Mvk)
chr5:114705285-114721583
Why resolve and integrate data?
1. Allows you to find all the data:
Example: I want all the sequences from GenBank that are from C57BL/6
There are >100 different versions of this strain name in GenBank files, e.g.
B6
BL/6
C57BL76J
57BL/6J
Black-6
JB6
C57Black/6
black six …..ETC…
Example: You find several papers describing different phenotypes of
knockouts of the Fgfr2 gene. The knockout alleles are just called Fgfr2-/-. Help!
There are 14 different targeted alleles of Fgfr2 (knockout/knockin, each has a
unique symbol and MGI-ID, different phenotype annotations, and are models of
different human diseases). All are associated with their respective references.
MGI has curated these data. You can ask these questions!
Why resolve and integrate data?
2. Allows you to discriminate ambiguous data
Example: I want information for mouse gene Tap
Which gene? There are 5 genes published as Tap. Each of these genes has Tap
as a synonym.
Chr 15 Ly6a , lymphocyte antigen 6 complex, locus A
Chr 19 Nxf1, nuclear RNA export factor 1 homolog (S. cerevisiae)
Chr 11 Sec14l2, SEC14-like 2 (S. cerevisiae)
Chr 17 Tap1, transporter 1, ATP-binding cassette, sub-family B (MDR/TAP)
Chr 5 Uso1, USO1 homolog, vesicle docking protein (yeast)
P.S. Gene Gnas has 20 synonyms
Why resolve and integrate data?
3. In addition to object identification issues,
integration allows you ask complex questions that
span data sets and data types from different
sources:
Example:
What genes on Chromosome 11 have mutant alleles that display
phenotypes of hydrocephaly and hypertension?
Example:
Provide me with a list of Refseq IDs where the gene corresponding to
the sequences show expression in embryos at 13.5-15 days and are
involved in the biological process (GO) of apoptosis.
Integration requires consistent semantics
Controlled vocabularies/nomenclatures
•
•
•
•
•
•
•
•
•
•
Strains
Genes
Alleles (phenotypic or variant)
Classes of genetic markers
Types of mutations
Types of assays
Developmental stages
Tissues
Clone libraries
ES cell lines
….. organized as lists or simple hierarchies
Assay Type
Gene nomenclature
Strain
Age
Results
Ldb1 (LIM domain
binding 1) gene
expression in CD-1 mice
Semantics plus relationship data
Ontologies/structured vocabularies
• Gene Ontology (GO)
DAGs
• Molecular function
• Biological process
• Cellular component
• Mouse Anatomy (MA)
• Embryonic
• Adult
• Mammalian Phenotype (MP)
• Sequence Ontology (SO)
….. organized as directed acyclic graphs (DAGs)
Mammalian
Phenotype
Ontology
• Structured as DAG
• Over 7324 terms
covering physiological
systems, behavior,
development and
survival
• Available in browser
and in OBO file formats
from MGI ftp and OBO
Foundry sites
Annotating Gene Products using GO
P05147
PMID: 2976880
Gene Product
P05147
Reference
GO:0047519
IDA
PMID:2976880
IDA
GO:0047519
GO Term
Evidence
Data sources
Primary
literature
Centers:
mutagenesis, gene
trap, etc
• Gather data from multiple sources
• Factor out common objects
• Assemble integrated objects
Data Loads: GenBank,
SNPs, clone collections,
UniProt, RIKEN, IKMC,etc
Processing, QC, and curation
Electronic
Submissions
(individual labs)
Data sourcing for MGI
• Data from major providers (e.g. Ensembl, UniProt)
and from data project Centers (e.g. gene trap, ENU
mutagenesis centers) are generally reliably
formatted, though data may still have QC issues.
Occasional changes in format can be frustrating.
• Data from individual research labs vary greatly in
file formats and adherence to nomenclature &
usually are handled on a case-by-case basis.
• Scientific literature is a reflection of individual labs
(largely), & must be treated as using non-standard
nomenclatures – but awareness is improving!
Data sourcing for MGI (…wishes)
• more user contributions
• pre-publication nomenclature assignments
• data submissions
(data can be held private until publication)
• journal permissions for images - have some
• in progress (collaborations on raw phenotype data
exchange with European and Japanese mouse
mutagenesis and knockout groups)
Building a mouse phenotyping data resource
• Large scale ENU mutagenesis programs worldwide - continuing
• Large scale gene trap programs (International Gene Trap Consortium)
www.genetrap.org - gene trap cell lines loaded, with Lexicon
• International Mouse Knockout Consortium
• KOMP – Knockout Mouse Project (USA) www.knockoutmouse.org
• EUCOMM – European Conditional Mouse Mutagenesis www.eucomm.org
• NorCOMM – North American Conditional Mouse Mutagenesis
http://norcomm.phenogenomics.ca
• Texas Institute for Genomic Medicine Knockouts www.tigm.org
• Collaborative Cross www.complextrait.org
• Literature and lab submissions
• New recombinase (cre, flp, etc) and reporter database is online and
data is being populated
BREADTH: Large
scale screen for
potential phenotypic
outliers
DEPTH: Phenotypic
description of mutant
genotype(s)
SUMMARY
Integration in MGI
• is accomplished through a combination of
automatic & semi-automatic loads & QC
processing, followed by manual curation.
• requires applying semantic consistency using
standard nomenclatures, ontologies and
structured vocabularies.
• provides users with the ability to find data that
would otherwise not be found or ambiguous.
• allows complex questions spanning different data
sets and data areas to be asked.
SUMMARY
Data Sourcing in MGI
• includes data from major genome resources and mouse
centers, as well as individual lab submissions and
curated information from scientific literature.
• requires QC processing for format consistency; for
some (individual) labs case-by-case assistance.
• for new large-scale phenotyping activities, integrate
data with common curation of MP ontology; connect
with raw data (international collaboration).
• continue to work with community and journals to allow
easier data access.
Bar Harbor, Maine
MGI is funded by:
NHGRI grants HG000330, HG002273, HG003622
NICHD grant HD033745
NCI grant CA089713
www.informatics.jax.org