Curated Databases

Download Report

Transcript Curated Databases

Curated Databases
Peter Buneman
School of Informatics
University of Edinburgh
The Population of Corfu
113.479 (2001)
http://www.corfunext.com/corfu_geography.htm
107,879 (as of 2001 )
http://en.wikipedia.org/wiki/Corfu ***
93,000
http://www.corfunet.com/corfu/
109,512
www.agni.gr/
110,000
www.corfuvisit.net
70,000
http://www.newadvent.org/cathen/04362a.htm
107,600
http://www.greek-hotels.com/
105,043
http://www.merriam-webster.com/dictionary/corfu
approximately 110,000
www.kassiopi.com/MenuContent.aspx?MenuId=6
approximately 120.000
http://www.gardeno-corfu.com/
115,200 (2003 est)
http://encyclopedia.farlex.com/Corfu
around 110,000
http://www.sunshinetravel.gr/CORFUGUIDE/CORFU_TRAVEL_GUIDE 0-1.htm
110.000
http://www.dialashop.com/travel/corfu.html
about 110,00
http://www.argobenitses.gr/greece.php
97,102 in 1981
http://geography.howstuffworks.com/europe/corfu.htm
107,880
http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury
109,512
http://www.corfu-property.gr/content/view/14/38/lang,en/
about 100,000
http://members.virtualtourist.com/m/6ce90/67541/
110,000 approximately
http://www.corfu-island.org/features.htm
107,000
http://www.nytimes.com/2009/09/11/greathomesanddestinations/11iht-recorfu.html
*** The only site to give attribution: http://www.statistics.gr/portal/page/portal/ESYE
ECDL
2
These are both curated databases
ECDL
3
What is a curated database?
• A curated database is one that is maintained with a lot of
human effort
• Curare: Latin “to care for”
• Prime concern is quality of data
ECDL
4
What is a database?
(for the purposes of this talk)
• Any structured collection of data that is subject to
change/revision
–
–
–
–
ECDL
Ontologies
XML and other structured text files
Structured wikis
Standard relational and object-oriented databases
5
Curated databases have interesting properties…
• A digital reference work. Traditional dictionaries, gazetteers,
encyclopedia have been replaced by curated databases.
• Value lies in the organization and annotation of data
• Commonly constructed by copying parts of other (curated) databases.
• Rapidly increasing in scientific research. (> 1000 in molecular biology)
• Constantly checked/verified. Data quality and timeliness are
important.
• Often group efforts. Produced by a dedicated organization or
collaboration.
• Increasingly seen as “publications” by scientists. (You get kudos if
someone uses your database – like a citation.)
ECDL
6
... and they are very expensive
In $/€/£ per byte
“Reliable” code / Curated data
“Production” code/Curated data
10
1
Book
[Movie]
10-1
10-3
Big physics (LHC) data
10-7
A change for the better?
Storage:
• Redundant
• Persistent
• Distributed
• Readable by people
Clear standards for citation
Historical record (old data is useful)
Well understood ownership/IP
Storage:
• Single-source
• Volatile
• Centralised
• Internal DBMS format
No standards for citation
No historical record
Mind-boggling legal issues
20th century libraries did some things better!
ECDL
8
Some computer science issues
•
•
•
•
Archiving (CS usage)
Provenance
Annotation/citation
Data cleaning
All of these are intimately connected.
For example, if you cite some part of a curated database,
the version you cited should be available (archiving)
ECDL
9
Some well-known curated databases
ID
AC
DT
DT
DT
DE
OS
OC
OC
RN
RP
RC
RX
RA
RL
RN
RP
RA
RL
CC
CC
CC
CC
CC
DR
DR
DR
KW
FT
FT
FT
FT
FT
FT
FT
FT
SQ
11SB_CUCMA
STANDARD;
PRT;
480 AA.
P13744;
01-JAN-1990 (REL. 13, CREATED)
01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)
01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)
11S GLOBULIN BETA SUBUNIT PRECURSOR.
CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).
EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
VIOLALES; CUCURBITACEAE.
[1]
SEQUENCE FROM N.A.
STRAIN=CV. KUROKAWA AMAKURI NANKIN;
MEDLINE; 88166744.
HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;
EUR. J. BIOCHEM. 172:627-632(1988).
[2]
SEQUENCE OF 22-30 ND 297-302.
OHMIYA M., HARA I., MASTUBARA H.;
PLANT CELL PHYSIOL. 21:157-167(1980).
-!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.
-!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A
BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A
DISULFIDE BOND.
-!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).
EMBL; M36407; G167492; -.
PIR; S00366; FWPU1B.
PROSITE; PS00305; 11S_SEED_STORAGE; 1.
SEED STORAGE PROTEIN; SIGNAL.
SIGNAL
1
21
CHAIN
22
480
11S GLOBULIN BETA SUBUNIT.
CHAIN
22
296
GAMMA CHAIN (ACIDIC).
CHAIN
297
480
DELTA CHAIN (BASIC).
MOD_RES
22
22
PYRROLIDONE CARBOXYLIC ACID.
DISULFID
124
303
INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
CONFLICT
27
27
S -> E (IN REF. 2).
CONFLICT
30
30
E -> S (IN REF. 2).
SEQUENCE
480 AA; 54625 MW; D515DD6E CRC32;
MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR
RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV
FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE
TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY
TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF
KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE
//
CIA World Factbook
ECDL
Uniprot
10
Archiving / Database Preservation
• How do we preserve something that evolves (both in
content and structure)
• Keep snapshots?
– frequent: space consuming
– infrequent: lose “history”
Most curated databases have a hierarchical
structure that we can exploit…
ECDL
11
A Sequence of Versions
ECDL
12
Pushing time down
This relies on a deterministic / keyed model – there’s a
unique path to every data item.
[B., Khanna, Tajima, Tan, TODS 27,2 (2004)]
ECDL
13
An initial experiment
• Grabbed the last 20 available versions of Swissprot
• XML-ized all of them
• Also recorded all OMIM versions for about 14 weeks
(100 of them)
• Combined into archive XML format file by pushing
time down.
ECDL
14
Uncompressed
•
Archive size is
–
 1.01 times diff repository size
 1.04 times size of largest
version
Compressed
• archive size is between 0.94 and 1
times compressed diff repository size
• gzip - unix compression tool
• XMill - XML compression tool
–
Size (bytes) x 106
100 days of
OMIM
Legend
•archive
•inc diff
•version
•compressed inc diff
•compressed archive
gzip(inc diff)
XMill(archive)
ECDL
15
Uncompressed
• Archive size is
–
 1.08 times diff repository size
Size (bytes) x 106
~ 5 years of
UniProt
Legend
•archive
•inc diff
•version
•compressed inc diff
•compressed archive
 1.92 times size of largest
version
Compressed
archive size is between 0.59 and 1
times compressed diff repository size
–
•
•
ECDL
16
Snapshots are immediate.
Longitudinal/temporal queries are also easy
[1990-2006]
Factbook
*
*
Andorra
Liechtenstein
China
*
*
Demography
Economy
*
Population
[1990]
28,292
[1991]
28,476
…
*
[2006]
34,247
Plot, by year, the population of
Liechtenstein since 1990
ECDL
17
A Working System
• Implemented by Heiko Müller
• For scale, we require external sorting of large XML
files
•Designed and implemented by Ioannis Koltsidas
Heiko Müller and Stratis Viglas
• Has a simple temporal query language
• Experimented with recent (HTML) versions of CIA
world factbook
ECDL
18
What the archive looks like
<T t="2002-2007">
<FACTBOOK>
<COUNTRY>
<NAME>Afghanistan</NAME>
<CATEGORY>
<NAME>Communications</NAME>
<PROPERTY>
<NAME>Internet users</NAME>
<TEXT>
<T t="2004-2005">1,000 (2002)</T>
<T t="2006-2007">30,000 (2005)</T>
<T t="2002-2003">NA</T>
</TEXT>
</PROPERTY>
<PROPERTY>
<NAME>Radios</NAME>
<TEXT>167,000 (1999)</TEXT>
</PROPERTY>
<PROPERTY>
<NAME>Telephones - main lines in use</NAME>
<TEXT>
<T t="2006">100,000 (2005)</T>
<T t="2007">280,000 (2005)</T>
<T t="2002-2003">29,000 (1998)</T>
<T t="2004-2005">33,100 (2002)</T>
…
ECDL
19
How did the population of China
change from 2002-2007?
<T t="2002-2007">
<FACTBOOK>
<COUNTRY>
<CATEGORY>
<PROPERTY>
<NAME>Population</NAME>
<TEXT>
<T t="2002">1,284,303,705
<T t="2003">1,286,975,468
<T t="2004">1,298,847,624
<T t="2005">1,306,313,812
<T t="2006">1,313,973,713
<T t="2007">1,321,851,888
</TEXT>
</PROPERTY>
</CATEGORY>
</COUNTRY>
</FACTBOOK>
</T>
ECDL
(July
(July
(July
(July
(July
(July
2002
2003
2004
2005
2006
2007
est.)</T>
est.)</T>
est.)</T>
est.)</T>
est.)</T>
est.)</T>
20
How did land area of countries change in 2002-2007?
<T t="2002-2007">
<FACTBOOK KEY="">
…
<COUNTRY KEY="NAME Austria">
<CATEGORY KEY="NAME Geography">
<PROPERTY KEY="NAME Area">
<SUBPROP>
<NAME>land</NAME>
<TEXT>
<T t="2004-2007">82,444 sq km</T>
<T t="2002-2003">82,738 sq km</T>
</TEXT>
</SUBPROP>
</PROPERTY>
</CATEGORY>
</COUNTRY>
…
<COUNTRY KEY="NAME France">
<CATEGORY KEY="NAME Geography">
<PROPERTY KEY="NAME Area">
<SUBPROP>
<NAME>land</NAME>
<TEXT>
<T t="2002-2006">545,630 sq km</T>
<T t="2007">640,053 sq km; 545,630 sq km (metropolitan France)</T>
</TEXT>
… ECDL
21
What are the differences between the factbooks
on 21/08/2007 and 10/09/2007?
<T t="21/08/2007-10/09/2007">
<CIAWFB KEY="">
<COUNTRY KEY="NAME Afghanistan">
<CATEGORY KEY="NAME Communications">
<PROPERTY KEY="NAME Internet users">
<T t="21/08/2007">
<TEXT>30,000 (2005)</TEXT>
</T>
<T t="10/09/2007">
<TEXT>535,000 (2006)</TEXT>
</T>
</PROPERTY>
<PROPERTY KEY="NAME Telephones - mobile cellular">
<T t="21/08/2007">
<TEXT>1.4 million (2005)</TEXT>
</T>
<T t="10/09/2007">
<TEXT>2.52 million (2006)</TEXT>
</T>
…
ECDL
22
http://homepages.inf.ed.ac.uk/hmueller/xarch/download.html
Heiko Müller’s Xarch
• Examples of use with
⁻ Ontologies
⁻ XML files
⁻ Relational databases
• Automatically converts
RDBs into XML
• Efficiently extracts
snapshots
• Simple temporal query
language
ECDL
23
Provenance – a huge issue
•
•
•
•
Where did this data come from?
How did it get here?
How was it constructed?
...
Two schools of research:
• Workflow (coarse-grain) provenance – a complete
record of how some large scientific
analysis/simulation was performed.
• Data (fine-grain) a record of how some small piece of
data (in a larger databases) was produced
ECDL
24
Data provenance:
an example
Copy-paste, or
<cntrl>C <cntrl>V
113.479 (2001)
http://www.corfunext.com/corfu_geography.htm
107,879 (as of 2001 )
http://en.wikipedia.org/wiki/Corfu ***
109,512
www.agni.gr/
105,043
http://www.merriam-webster.com/dictionary/corfu
115,200 (2003 est)
http://encyclopedia.farlex.com/Corfu
97,102 in 1981
http://geography.howstuffworks.com/europe/corfu.htm
107,880
http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury
ECDL
25
“Where provenance”
Possible explanations of how something was copied:
This data item was extracted from location L1 in
document D1 and placed in location L2 in document D2
or
This data item was extracted from database D1 by
query Q1 and placed in database D2 by update U2
(or some combination of the two)
ECDL
26
ID
11SB_CUCMA
STANDARD;
PRT;
480 AA.
AC
P13744;
DT
01-JAN-1990 (REL. 13, CREATED)
DT
01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)
DT
01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)
DE
11S GLOBULIN BETA SUBUNIT PRECURSOR.
OS
CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).
OC
EUKARYOTA;
PLANTA; EMBRYOPHYTA;
ANGIOSPERMAE; DICOTYLEDONEAE;
DE
11S GLOBULIN
BETA SUBUNIT
PRECURSOR.
OC
VIOLALES;
CUCURBITACEAE.
OS
CUCURBITA
MAXIMA
(PUMPKIN) (WINTER SQUASH).
RN
[1]
OC
EUKARYOTA;
PLANTA;
EMBRYOPHYTA;
ANGIOSPERMAE; DICOTYLEDONEAE;
RP
SEQUENCE FROM
N.A.
RC
STRAIN=CV. KUROKAWA AMAKURI NANKIN;
OC
VIOLALES;
CUCURBITACEAE.
RX
MEDLINE; 88166744.
RA
HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;
RL
EUR. J. BIOCHEM. 172:627-632(1988).
-!- FUNCTION: THIS IS
SEED STORAGE PROTEIN.
RN A [2]
RP EACH
SEQUENCE
OF 22-30
297-302. OF AN ACIDIC AND A
-!- SUBUNIT: HEXAMER;
SUBUNIT
ISAND
COMPOSED
RA
OHMIYA M., HARA I., MASTUBARA H.;
BASIC CHAIN DERIVED
FROMCELL
A SINGLE
AND LINKED BY A
RL
PLANT
PHYSIOL. PRECURSOR
21:157-167(1980).
DISULFIDE BOND. CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.
CC
-!SUBUNIT:
EACH
SUBUNIT IS(GLOBULINS).
COMPOSED OF AN ACIDIC AND A
-!- SIMILARITY: TO OTHER
11S
SEED HEXAMER;
STORAGE
PROTEINS
CC
BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A
CC
DISULFIDE BOND.
CC
-!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).
DR
EMBL; M36407; G167492; -.
PIR; S00366;11S
FWPU1B.
FT
CHAIN
22 DR 480
GLOBULIN BETA SUBUNIT.
DR
PROSITE; PS00305; 11S_SEED_STORAGE; 1.
FT
CHAIN
22 KW 296
CHAIN
(ACIDIC).
SEED STORAGEGAMMA
PROTEIN;
SIGNAL.
SIGNAL
1
21
FT
CHAIN
297 FT 480
DELTA
CHAIN
(BASIC).
FT
CHAIN
22
480
11S GLOBULIN BETA
SUBUNIT.
FT
MOD_RES
22
22
PYRROLIDONE CARBOXYLIC
ACID.
FT
CHAIN
22
296
GAMMA CHAIN (ACIDIC).
FT
DISULFID
124 FT 303
INTERCHAIN
(GAMMA-DELTA)
(POTENTIAL).
CHAIN
297
480
DELTA CHAIN (BASIC).
FT
MOD_RES
22
22
PYRROLIDONE CARBOXYLIC ACID.
FT
DISULFID
124
303
INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
FT
CONFLICT
27
27
S -> E (IN REF. 2).
FT
CONFLICT
30
30
E -> S (IN REF. 2).
SQ
SEQUENCE
480 AA; 54625 MW; D515DD6E CRC32;
MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR
RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV
FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE
TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY
TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF
KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE
//
Where Provenance
CC
CC
CC
CC
CC
Where does this information
come from? Which curator?
Or was it the cited papers?
Was it copied from some
other DB?
ECDL
27
Copy-paste model of curated DBs
Curated databases are not views!!
(a) A biologist copies some UniProt records into her DB.
(b) She fixes entries so that UniProt PTMs are not confused with hers.
(c) She copies in some publication details from OMIM
(d) She corrects a mistake in a PubMed publication number.
[B. Chapman, Cheney, Sigmod ’06]
ECDL
28
A very simple copy-paste language
(uses a “deterministic” tree model)
(1) delete c5 from T;
(2) copy S1/a1/y into T/c1/y;
(3) insert {c2 : {}} into T;
(4) copy S1/a2 into T/c2;
(5) insert {y : {}} into T/c2;
(6) copy S2/b3/y into T/c2/y;
(7) copy S1/a3 into T/c3;
(8) insert {c4 : {}} into T;
(9) copy S2/b2 into T/c4;
(10) insert {y : 12} into T/c4;
How costly is it to record all this?
ECDL
29
How to reduce space
• Complete provenance: Record every update.
• Transactional provenance: Record the links at the end of some
user-defined transaction (sequence of updates)
• Hierarchical (inferred) provenance. Only record a link if it
cannot be inferred from the provenance of a higher node
Taken together these provide a substantial saving on
storage. Overhead comparable with the size of the DB
in some realistic simulations
ECDL
30
Query languages and where provenance
(select A, 5 as B from R where A = 1)
union
(select * from R where A <> 1)
delete from R where A = 1;
insert into R values (1,5)
update R set B = 5 where A = 1
ECDL
[B., Cheney, Vansummeren, TODS 33,4, 2008]
A B
A B
1
3
1
5
6
7
6
7
1
3
1
5
6
7
6
7
1
3
1
5
6
7
6
7
31
Other forms of provenance in query languages
• Why-provenance: why is a tuple in the output, or what parts
of the input “contributed to” the tuple? [Widom et al]
• How-provenance: how (by what process) was this tuple
constructed. [Tannen et al]
Complex program or process
Large, heterogeneous
source
Small part of
source
Database
Simpler program/process
“Piece” of data: data
value, tuple.etc
Taken together, these are the “explanation”.
ECDL
32
Workflow provenance
Taken from [Cohen, et al DILS 2006]
• Each step S1. . . S4 is itself a workflow.
• How does one record an “enactment” of the workflow?
• How much “context” does one record?
– from people
– from databases that change
• Recent attempts to produce a general model
– Open Provenance Model [Moreau et al. 2007]
– Petri Net + Complex Object [Hidders et al.Inf Syst 2008]
ECDL
33
Provenance is very general issue
• Intrinsic to data quality.
• It is starting to be used in several areas of CS:
–
–
–
–
–
–
–
Semantics of update languages.
Probabilistic databases
Data integration
Debugging schema transformations
File/data synchronization
Program debugging (program slicing)
Security
• The fundamental problem is finding the right
model/models – can we combine data and
workflow models?
ECDL
34
Annotation – closely related to provenance
• Much of the activity of curators is the annotation of
existing data.
• When we copy that data, we should also copy its
annotations
• The propagation of annotation follows (where-)
provenance
• But the story is more complicated because we often
annotate views
ECDL
35
The Distributed Annotation Server (DAS)
ECDL
36
ECDL
37
Annotating Databases
Polygen [Wang & Madnick VLDB 1990], DBNotes [Bhagwat et al,
VLDB 2004] Concern is propagation of annotations from views
to source and back. Again, there is an interesting theory
Not strong
Stijn says this is
not a beer
Guinness
Stijn says this is
not a beer
Guinness
Stout
Heineken
Pilsner
Old Jock
Ale
Fischer
Blonde
Stout
5.0
Eire
Heineken
Pilsner
5.0
Netherlands
Old Jock
Ale
6.7
Scotland
Guinness
Stout
7.5
Nigeria
Fischer
Blonde
6.0
France
π
Not strong
σ
Not strong
Stijn says this is
not a beer
Guinness
Stout
5.0
Eire
Heineken
Pilsner
5.0
Netherlands
Fischer
Blonde
6.0
France
How do you cite something in a database?
Many scientific databases ask you to cite them, but..
• they don’t tell you how, or
• they tell you to give the URL, or
• they tell you to cite a paper about the database.
Nutrition Education for Diverse Audiences [Internet]. Urbana (IL): University of
Illinois Cooperative Extension Service, Illinet Department; [updated 2000 Nov
28; cited 2001 Apr 25]. Diabetes mellitus lesson; [about 1 screen]. Available
from
http://www.aces.uiuc.edu/~necd/inter2_search.cgi?ind=854148396
NLM Recommended Formats for Bibliographic Citation.
Internet Supplement. NLM Technical report Bethesda, MD 20894, July 2001.
ECDL
39
What is a citation?
Bard JB and Davies JA. Development, Databases and
the Internet. Bioessays. 1995 Nov; 17(11):999-1001.
[Location and descriptive information]
Ann. Phys., Lpz 18 639-641
Nature, 171,737-738
(We often want more than location)
ECDL
40
Automatically generating citations
A rule:
{ DB=IUPHAR, Version=$v, Family=$f Receptor=$r, Contributors=$a,
Editor=$e, Date=$d, DOI=$i}
←
/Root[ ]/Version[Number=$’v, Editor=$?e, DOI=$.i, Date=$.d] /Data[ ]
/Family[FamilyName=$’f] /Contributor-list/Contributor=$+a]
/Receptor[ReceptorName=$’r]
What gets generated (example):
{ DB=IUPHAR, Version=11, Family=Calcitonin,
Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner},
Editor=Tony Harmar, Date=Jan 2006, DOI=10.1234 }
ECDL
41
Other topics: Data quality and data cleaning
• Published data often looks clean but is intrinsically
messy
– “Dead” fields in the underlying data
– Multiple syntactic conventions
– Abuse of / confusion over formats & schema
• Human errors require human correction
– Automate error detection rather than error correction
• Cleaning is an essential prerequisite in any
integration or preservation task.
ECDL
42
Other topics: Evolution of Structure
• Curated DBs evolve from humble origins. Schemas
are often wrong; they are
– designed by people who don’t understand schemas
– designed before the domain is fully understood
• Do ontologies help (you can build an ontology
without worrying much about the schema) or do
they defer the problem and make it worse?
ECDL
43
The larger (economic and social) issues
• Who will archive/curate curated databases?
• Should they be open-access?
– who pays for their maintenance?
• What are the legal/IP issues?
ECDL
44
A case study: IUPHAR database
(curated by Tony Harmar and team)
• “Standard” curated database
• Labour-intensive (hundreds of
contributors)
IUPHAR
DCC
• Valuable (supported by drug
companies)
• Simple, clean structure – as seen
by users
50m
ECDL
46
ECDL
47
ECDL
48
ECDL
49
We wanted to use our archiver
• Our first task was to convert the database into a hierarchical
structure (following the web presentation) so that we could
archive it.
• We used the Prata XML (Fan et al) publishing software
• This had some unexpected benefits…
ECDL
50
…
…
ECDL
<transduction>
<secondary>Y</secondary>
<transcomments/>
<transcites/>
</transduction>
</transductions>
<ligandTypes>
<ligandType>
<typeName>Agonist</typeName>
<ligandComment>empty</ligandComment>
<ligands>
<ligand>
<radioactive>NO</radioactive>
<endogenous>YES</endogenous>
<alternative>NO</alternative>
<ligandSpeices>Human</ligandSpeices>
<ligandName>oxytocin</ligandName>
<affinity>9.1-8.8</affinity>
<ligandAction>Full Agonist</ligandAction>
<ligandUnits>p<i>K</i><sub>d</sub></ligandUnits>
<ligandCites>9</ligandCites>
</ligand>
<ligand>
<radioactive>YES</radioactive>
<endogenous>NO</endogenous>
<alternative>NO</alternative>
<ligandSpeices>Human</ligandSpeices>
<ligandName>[<sup>3</sup>H]-oxytocin</ligandName>
<affinity>9.1-8.8</affinity>
<ligandAction>Full Agonist</ligandAction>
<ligandUnits>p<i>K</i><sub>d</sub></ligandUnits>
<ligandCites>9, 42</ligandCites>
</ligand>
51
• We can preserve all versions of the data (as intended)
• We can generate static web pages (less software, more
efficient)
• We can make the database citable
• Tony can trace the history of entries
• Tony can generate an old-fashioned book (yes, he wants to do
this!)
• We have a “community model” for data exchange
• The data got cleaned up in the process
• The representation information (required by archivists) is
greatly simplified
ECDL
52
Selected pages from the book
– generated by a 100-line
style sheet
53
Our library will “host” the book, but
not the database!
54
Centralized vs. distributed publishing
20th century libraries provided robust, distributed dissemination and
preservation of reference material
Valuable information was lost in earlier “data centers” .
Is this still
happening?
Replication and distribution has always been the best guarantee of
preservation. We should do the same for curated databases – a database
LOCKSS ?
ECDL
55
Many of the issues are non “technical”
• A good economic model for sustainability
– Open access works for journal papers
– Can it work for curated DBs? They require long-term
support. And people who write reference manuals
sometimes expect to make money out of them.
• Intellectual property in curated databases is a
nightmare
– legislation still largely based on the notion of copying.
• We can still help by providing good models of the
processes in curating and publishing databases
ECDL
56