Creation and Maintenance of GeneKeyDB

Download Report

Transcript Creation and Maintenance of GeneKeyDB

Creation and Maintenance
of GeneKeyDB
Research being conducted by
Kevin Kastner
Under the direction of
Dr. Erich Baker
The Problem
 There exists thousands of biomedical data
sources.
 In 2006, there were ~557 relevant public
resources in molecular biology.
 This is growing rapidly.
 203 sources in 1999
 226 sources in 2000
 277 sources in 2001.
The Problem
 Traditional database approaches are too
structured.
 Scientific objects change identification over time.
 Gene names change over time.
 The Human Genome Nomenclature Database
(HUGO) contains 13,594 active symbols, 9635
literature aliases, and 2739 withdrawn symbols.
 SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2like 1.
Scientific Object Identities
Hugo Name
GDB
GenAtlas
OMIM
GeneCards
LocusLink
TP53
1
33
52
22
13
P53
1(same)
17
188
69
63
SIRT1
1
0
5
1
2
SIR2L1
0
0
1
1(same)
1(same)
The Solution
 GeneKeyDB
 A gene-centered relational database
developed to enhance data mining in
biological data sets.
 GeneKeyDB relies primarily on existing
database identifiers derived from community
databases (NCBI, GO, Ensembl, et al.) as
well as the known relationships among those
identifiers.
 Version 1 is already out!
 http://www.biomedcentral.com/1471-2105/6/72
Weaknesses of Version 1
 Can no longer be updated
 Complex queries must be made to the
database in order to obtain desired
information
Complex Queries
SELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organism
FROM ll_xp_cdd, ll_np_cdd, ll_locus
WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score
AND ll_id IN
(SELECT ll_id
FROM ll_refseq_xm
WHERE ll_refseq_xm_id IN
(SELECT ll_refseq_xm_id
FROM ll_xp_cdd, ll_np_cdd
WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score))
AND ll_id IN
(SELECT ll_id
FROM ll_refseq_nm
WHERE ll_refseq_nm_id IN
(SELECT ll_refseq_nm_id
FROM ll_xp_cdd, ll_np_cdd
WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));
Current Research
 Creation of APIs to validate data in the
database and to enable querying to
become much easier for the user.
 One-step updating of the database and
the information it contains.
API Alternative
// fxn(search_params, desired_info), returns ll_id
curated.cdd(score[ ],null)
curated_score[ ]  score[ ]
locus_id1[ ]  gaa.cdd((name[ ],score[ ]), score[ ])
gaa_name[ ]  name[ ]
gaa_score[ ]  score[ ]
locus_id2[ ]  curated.cdd(name[ ],score[ ])
curated_name[ ]  name[ ]
locus_id[ ]  intersect(locus_id1[ ],locus_id2[ ])
locus(organism[ ], locus_id[ ])
print(gaa_name[ ], curated_name[ ], organism[ ])
External Implementations
 Some databases have APIs as well.
 Ensembl
 APIs are done in Perl.
 APIs for GeneKeyDB will be done in Java.
 More structured language.
 Easier to read.
The Future of GeneKeyDB
 GeneKeyDB will join even more external
and widely used databases together.
 Code for updating GeneKeyDB will tie into
database information that will change in
expected ways.
 Lowers the required number of code rewrites.
 GeneKeyDB will be dynamically updated.
The Future of GeneKeyDB
 APIs made that will be written in Perl.
 Perl is used often, almost exclusively, by
biologists.
 Can have Perl APIs tie into Java APIs, rather
than creating all new ones.
Comments? Questions?
 http://genereg.ornl.gov/gkdb/