Some computing solutions to your data problems

Download Report

Transcript Some computing solutions to your data problems

Data integration
via XML
Ela Hunt
John Wilson
Vangelis Pafilis
Inga Tulloch
http://xtect.cis.strath.ac.uk/
Overview
•
•
•
•
•
Four biological scenarios of data integration
Data integration - problem definition
XTECT indexing approach
Literature review
Current status and further work
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Scenario 1:
Cardiovascular Functional Genomics
• AIM: discover genes causing hypertension
• Rat animal models of hypertension (rat strains which
suffer from stroke)
• Microarrays are used to compare gene expression in sick
and healthy rats, typically 100-400 genes are differentially
expressed
• microarray results are visualised on maps – and data are
interpreted using public web databases (browsing and
querying)
Hunt, Wilson, Pafilis and Tulloch, Glasgow
SyntenyVista
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Scenario 2:
Mouse mammary gland development as a
model of cancer proliferation
• AIM: find genes active in cancer growth
• Take mouse samples and apply to a microarray slide
• Measure trends in gene expression, identify 400 genes
of interest
• Use public web databases to interpret information on
400 genes (interpreting 100 genes took 6 months,
now the information is out of date)
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Scenario 3:
Rat model of schizophrenia
• AIM: understand which genes are expressed during
schizophrenia
• Rats have symptoms of schizophrenia after a chemical
treatment (2 models are used)
• Measure gene expression in two models
• Interpret data on 250 genes: find if microarray probes
correspond to genes by using BLAST (DNA sequence
comparison) and PubMed (bibliographic database)
• Gather DNA sequences for real genes from Ensembl
(BLAST hits), design probes
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Scenario 4:
Proteomics
• AIM: understand and record protein functions
• Case 1: study the proteome of Trypanosoma brucei. For
all proteins identified, find information on the web which
might shed light on their function
• Case 2: interpret data on human proteins differentially
expressed in human cells invaded by Toxoplasma
gondii.
• Compare protein and gene expression
• Use SwissProt, PubMed, GeneOntology and any other
web resources
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Problem definition
• Given a large microarray or proteomics experiment (a
list of gene names or peptide masses)
• Find all known information about those genes or proteins
on the web
• Make this information accessible
Hunt, Wilson, Pafilis and Tulloch, Glasgow
What we expect to achieve
Result1:
table of integrated information
Result2:
map of probes and synteny
Query:
table of
names
Result3:
Clusters based on
to the number of relevant
query terms found
Hunt, Wilson, Pafilis and Tulloch, Glasgow
• Use item matching - XML leaves - to start
• Match starting from leaves and extend towards the
schemas expressed as paths
• Use database techniques - indexing
• Use data mining techniques – get statistics on data
Hunt, Wilson, Pafilis and Tulloch, Glasgow
More detail
• Index all paths and leaves in XML trees for a
representative set of biological databases
• Relational technology
• Warehouse
• Match leaves (data values)
• Find path overlaps => remove redundancies in data
Hunt, Wilson, Pafilis and Tulloch, Glasgow
First problem solved:
query expansion
• 30K human, 30K rat, and 30K mouse genes, some of
them have synonyms
• Query expansion to include the synonyms
• Prototype in Java, 300 ms for synonym lookup
• Same idea as in GeneCards which focuses on human
data
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Second – indexing XML
• Medline (40 GB) in XML (bibliographic)
• SwissProt + Trembl, 1 GB in XML (proteins)
• OMIM and HUGO databases of genes, small (human
diseases and human genes)
• Affymetrix microarray files for the mouse, small, XML
• Ensembl – no XML files, access via MySQL (human,
mouse, rat genomes and predicted genes)
• Mouse Genome MGD – direct access to Sybase, no
XML
• Rat database RGD – stores little data!
• Gene Ontology – around 1GB in XML
Hunt, Wilson, Pafilis and Tulloch, Glasgow
• Paths and tags indexed using integer encoding,
preserving XML order
• Indexing of Medline and OMIM needs to be resolved
(text + XML)
Hunt, Wilson, Pafilis and Tulloch, Glasgow
How the index will work
PubMed
Swiss-Prot
accession
abstract
PubMedID GeneName
12345
12345
agene1
Swiss-Prot/PubMedID ~ PubMed/accession
Swiss-Prot/GeneName ~ PubMed/abstract
Hunt, Wilson, Pafilis and Tulloch, Glasgow
.. interactions of
agene1 with
agene2 ...
Matching
• Db1/path1/socs3 and Db2/path2/socs3 =>
synonymous paths
• Get statistics for full and partial path matches and
postulate schema matches
• Manually inspect the matched paths, and examine
support for each path match
• Automate the procedure
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Architecture
Microarray experiment
Proteomics experiment
INTERACTION
List of names
Synonym
expander
PROCESSING LAYER
WAREHOUSE
XML tree
finder
Visualisation
XML tree
merger
INDEX
Data
replicas
Gene trees
XML
PubMed
Sprot
Affy
OMIM
Hugo
Mapping
generation
and lookup
Status
• Mirroring external XML data
• Query expansion is implemented
• Software to XMLise OMIM and some of the
MGD
• Testing indexing software for loading into Oracle
• Designing an algorithm for data mining
• Developing ideas on adding sequence
comparison and text retrieval, and connecting to
visualisation tools (collaboration with e-Science
project BRIDGES)
Hunt, Wilson, Pafilis and Tulloch, Glasgow
THE
VISION
To tabular
summaries
To sequence
To multiple
alignment
Other work
• Schema-based approaches: look at the schemas to find
mappings between them
– use constraints, tree shape, some data
– involve the user/programmer: YATL, Clio, REVERE
• Data-based approaches: look at data values in order to
find mappings between attributes
– ML approaches are inefficient, all-against-all
• Problems:
– Expensive in terms of labour (programmer or user)
– Only very similar schemas can be matched
– Not scalable
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Recent papers
• Kurgan et al., 2002, machine learning for schema
matching (2 very similar schemas)
• Doan et al., VLDBJ03, machine learning, 2 semistructured schemas (ontologies), schemas + some data
• Chua et al., VLDBJ03, (RDBMS) given entity matches
(table names), match attributes (values), based on a
variety of statistical tests
• Halevy et al, CIDR-2003, user-driven schema matching
by example, and mapping by transitivity (no algorithm
has been given)
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Summary
• Aim - to overcome the problems associated
with manual or schema-based mapping
approaches which are expensive
• Scale up, take into account data values
• Provide a digest of information for a list of
gene/protein names of interest
• Using XML and relational indexes
Hunt, Wilson, Pafilis and Tulloch, Glasgow
Collaborators at Glasgow
Barry Gusterson
Vangelis Pafilis
Andy Jones
Torsten Stein
Inga Tulloch
Catherine Winchester
Anna F. Dominiczak
Neil Hanlon
BRIDGES project (uses DB2)
FUNDING:
Carnegie Trust for the Universities of Scotland
Medical Research Council (UK)
Royal Society
Synergy
John Wilson