MEDLINE in Oracle XML

Download Report

Transcript MEDLINE in Oracle XML

MEDLINE in Oracle XML-DB and
Oracle Text
Peter Stoehr
Head of Database Operations
European Bioinformatics Institute (EBI)
www.ebi.ac.uk
Oracle Life Sciences, OracleWorld, San Francisco, Sep 10 2003
European Molecular Biology Laboratory (EMBL)
International network of research institutes dedicated to
research in molecular biology
Treaty organisation funded by 16 member states
Headquarters established in Heidelberg in 1974
research programmes in cell biology, developmental
biology, instrumentation, gene expression etc
Outstations
Hamburg and Grenoble: structural biology
Monterotondo: mouse genetics
Hinxton, EBI: bioinformatics
European Bioinformatics Institute (EBI)
Mandate
to ensure the growing body of data and information from
molecular biology and genome research is placed in the
public domain and is accessible freely to all facets of the
scientific community in ways that promote scientific progress
and global competitiveness
to support academic research as well as biotech, agricultural,
chemical and pharmaceutical industries
The EBI builds, develops and publishes databases and
information services relevant to molecular biology, as well as
conducting research in bioinformatics.
About EBI
 Located in Hinxton, Cambridge, England (since
1993)
 Non-profit organisation - part of EMBL
 Started as EMBL Data Library in 1980
 Centre for research and services in
bioinformatics
 Three branches: Services, Research, Industry
 Funding mainly from EMBL and EU
EBI Resources
Personnel
- 220 people
- ~100 Database developers, software engineers
- DBA - 4
- Systems - 6
Environment
- OS: Tru64 ES45s, Solaris
- Storage: SAN storage, NetApp NFS
- 350-cpu linux compute farm
Major public databases at the EBI






DNA sequences
protein sequences
genome annotation
protein structures
microarrays
literature
- EMBL Nucleotide Sequence Database
- SWISS-PROT, TrEMBL, Interpro, CluSTr
- Ensembl
- MSD
- ArrayExpress
- MEDLINE, patents, fulltext
 enzymes
- IntEnz
 protein interactions - IntAct
 immunogenetics
- IMGT, HLA
 integration
- Integr8, EnsMart
Oracle Production Instances
Database
Instance
cpu Oracle
size tables
EMBL
PRDB1
4
8.1.7.2
350
300
SVA
ERD
4
8.1.7.2
550
6
SWISS-PROT
PRDB1
4
8.1.7.2
8
125
Interpro
IPRO
4
8.1.7.2
8
130
Clustr
CLUSTR
4
8.1.7.2
40
24
MSD
MSD
280
475
MEDLINE
MEDLINE
4
9.2.0.3
120
10
GO
IPRO
4
8.1.7.2
1
15
+10
…
…
…
…
…
Statistics
26 million nucleotide sequences (25b bases)
1 million protein sequences
200 complete genomes (+viruses, organelles)
28,000 genes in human genome
10,000 protein 3D structures
2500 journals with sequences
12 million MEDLINE citations, 4500 journals
130,000 biotech patent documents
550,000 web hits per day, www.ebi.ac.uk
EBI interest in text resources
• provide links from factual databases to full-text
literature (journals, patents…)
• mine for information relevant to factual database
annotation.
• most scientific information buried in free text
resources 
• enable indexing and searching of full-text literature
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
XX
RN
RA
RT
RL
RL
XX
FH
FH
FT
FT
FT
AX067464
standard; DNA; PRO; 100848 BP.
espacenet
AX067464;
AX067464.1
24-JAN-2001 (Rel. 66, Created)
24-JAN-2001 (Rel. 66, Last updated, Version 1)
Sequence 39 from Patent WO0078968.
.
Moraxella catarrhalis
Bacteria; Proteobacteria; gamma subdivision;
Moraxellaceae; Moraxella.
[1]
Lagace R.E., Patterson C., Berg K.L.;
"Nucleotide sequences of moraxella catarrhalis genome";
Patent number WO0078968-A/39, 28-DEC-2000.
Incyte Genomics, Inc. (US).
Key
Location/Qualifiers
source
1..100848
/db_xref="taxon:480"
/organism="Moraxella catarrhalis"
Searches based on bibliographic data in patent documents
Full text documents - PDF files
EBI interest in text resources
• provide access from factual databases to full-text
literature (journals, patents…)
• mine for information relevant to factual database
annotation.
• most scientific information buried in free text
resources 
• enable indexing and intelligent searching of fulltext literature
Areas of improvement for public text resources
• improve text retrieval functionality
• improve and add text corpora
• use of thesauri and ontologies (UMLS, SNOMED,
GO, GOBO)
• interfaces
Text corpora
• MEDLINE
• Full-text literature
• AGRICOLA
• Biotech patent abstracts
• Biotech patent full-text
• OMIM
• The web
• => public searchable services
Search engine evaluation criteria
Speed of searches
Speed of indexing
Ability to search multiple data sources and formats,
MEDLINE in XML
EMBL/SWISS-PROT-type structured files
websites
Word, PDF and text, email/jitterbug files
RDBMS (ORACLE and MySQL, Postgres)
Ability to handle large database/collections
Text query functions
Natural language
Boolean operators
Phrase searches
Proximity searches
Use of synonyms, ontologies, thesauri
esp. UMLS/MeSH, GO
Stemming and wild-carding
Multiple language support (for patent literature)
Document clustering functionality
Search refinenement, set operators
Ranking of results (relevance, date)
Weighting of search
Scaleability of indexing, searching
Load balancing on multiple nodes
Parallel processing
Incremental and off-line indexing
Ease of use of APIs, documentation
API languages, C/Java/Perl
Interoperabilty with SRS
Market strength of vendor
Availability on multiple unix platforms
esp. Tru64, Linux and Solaris
Technical support
Licence costs and flexibility
Text search/extraction systems
Altavista
Verity
Inktomi
ASPseek
Google
Autonomy
Thunderstone
Excalibur
Fulcrum
SPSS/LexiQuest
Stratify/”Purple Yogi”
Dolphinsearch
Quiver
Oracle Text
X-Mine - “Opus”
Diogene
incellico - “Cell Entity Browser”
Collexis
Alma
PharmDM
Linguamatics
Inxight
ClearForest
APRSmartLogik
Search engine implementation
Verity K2
- Red Hat Linux, 200-cpu pc farm
- MEDLINE XML parser built
Oracle Text
- Oracle 9iR2 implementation
- text indexing of titles and abstract
- XML DB
- Oracle Life Sciences Initiative
- MEDLINE + weekly updates implemented
- tuning, performance analysis
- in use for internal sequence DB maintenance
MEDLINE
• National Library of Medicine (NLM) Bethesda
• 530 XML files, following NLM DTD
• ~ 12 million citations published in over 4500
biomedical journals
• Daily updates
MEDLINE record view in SRS
Patent abstract record view in SRS
Patent abstract in XML
Medline in XML
<!DOCTYPE MedlineCitationSet
PUBLIC "-//NLM//DTD NLM Medline, 1st November 2002//EN”
"http://www.nlm.nih.gov/databases/dtd/nlmmedline_021101.dtd>
<MedlineCitationSet>
<MedlineCitation Owner="NLM” Status="Completed">
<MedlineID>94033980</MedlineID>
<PMID>8219565</PMID>
…...
<Journal>
<ISSN>1051-0443</ISSN>
<JournalIssue>
<Volume>4</Volume>
<Issue>5</Issue>
<PubDate>
<MedlineDate>1993 Sep-Oct</MedlineDate>
</PubDate>
</JournalIssue>
</Journal>
…
</MedlineCitation>
…………..
<MedlineCitationSet>
Possible Approaches
MEDLINE/Patent XML
SRS
Fast,
efficient,
Domain
standard
Lack of
text
query
functions
Verity
K2
Efficient text
query,
scaleable,
industry
standard
Return data
as XML,
postprocessing
required
Normalised
relational
tables
Fast, efficient
for simple
query
Pre-processing
required, hard
to maintain,
lack of text
query functions,
requesting
multiple joins
for more info
Oracle
Text
Efficient,
support text
query
Return data
as XML,
postprocessing
required
Oracle
XML-DB
Why Oracle XML DB for Medline ?
Oracle 9iR2 embedded XML features with DBMS
• XMLType datatype
LOB storage
- maintains original XML byte for byte
- can use an Oracle text index, support Xpath queries
- flexible when schemas change
object-relational storage
- better performance, index specific fields
- access to SQL features (constraints, indices etc)
- DOM fidelity (ordering, namespaces, inhertitance…)
- piecewise XML element update
• XMLSchema support
Why Oracle Text for Medline ?
Oracle 9i embedded Text features with DBMS
• Powerful and extensive text functions
- wildcards, boolean, stemming, proximity searches,
NLP linguistic features, pattern matching, ‘soundex’
• XML specific operators, HASPATH, INPATH etc, to support
XPATH like expressions
• Relevance ranking
• Multi-lingual features
• Extensions to SQL*Plus
• Management of thesauri
• Classification (CTXRULE indextype)
• Unsupervised document Clustering
• Documentation pretty good
Oracle RDBMS, XML DB and Text
One product range
- Cost (already purchased and used RDBMS)
- lower complexity – common administration, training, backup,
replication, RAC etc
- lower latency of development/deployment
- no incompatible product updates, gateways etc
- greater performance for mixed queries
Prepare XMLSchema from DTD
• NLM MEDLINE DTD
• XML-Spy
• To use Oracle CLOB type:
<xs:schema
xmlns:xs=http://www.w3.org/2001/XMLSchema
xmlns:xdb=http://xmlns.oracle.com/xdb
elementFormDefault="qualified"
xdb:storeVarrayAsTable="true">
<xs:element name="Abstract"
xdb:SQLType="CLOB"/>
• => XMLSchema
Register XMLSchema, create table
begin
dbms_xmlschema.registerschema(
’http://www3.ebi.ac.uk/internal/Services/medline/
medlinecitation_Types.xsd’,xdburitype(’/public/
medlinecitation_Types.xsd’).getClob(),
TRUE,TRUE,FALSE,TRUE);
end;
Load data
• SQL*Loader
• We use a Java application, JDBC
- need to control updates, deletions.
• synchronize context index
exec ctx_ddl.sync_index(‘title_ind’,’40M’);
• Complete MEDLINE:
- 1 day to load, 1 day for context indexing
• Updates
- 10 mins
Actual MEDLINE instances @EBI
• 9iR2 in Production
- MEDLINE + patent abstracts- updated twice per week
- used for in-house reference
- CLOB storage of XMLType field
- partitioned (by date) context index of XMLType
• 9iR2 in development
- using structured object-relational storage
- indexing fields, inc. context indextype for titles, abstracts
- no partioning
MED production schema
Auxiliary tables
describes
MedlineCitationSetFrame
MedlineCitationSet
consists of
(0..n)
MedlineCitation
#PMID
number
MedlineID
number
Heading
Start_tag
MedlineCitation
DeleteCitation
End_tag
MedlineCitation XMLType
MedlineCitation
PubYear
number
varchr2(256)
varchar2(50)
XMLType null
XMLType null
varchr2(50)
includes
DeleteCitation
comply with
DeleteCitation_Id
#PMID
Undeleted
DTD
#Name
NLMURL
InternalLocation
Doc
number
char(1)
Control vocabularies
Mesh_Tree_CV
controlled by
#Mesh_Id
Parent_Id
Descriptor
Rank
varchar2 (40)
varchar2 (40)
varchar2 (300)
number (2)
registered
with
varchar2(56)
varchar2 (150)
varchar2(150)
clob
converted to
XMLSchema
#Name
varchar2 (56)
URL
varchar2 (150)
Doc
clob
refereed to
Qualifier_CV
#Qualifier_id
Qualifier
Qualifier_Abbr
varchar2 (15)
varchar2 (40)
varchar2 (5)
PubMed_Journal_CV
#JrId
JournalTitle
MedAbbr
ISSN
ESSN
IsoAbbr
NlmId
number,
varchar2(500),
varchar2(255),
varchar2(9),
varchar2(9),
varchar2(255),
varchar2(25));
Language_CV
#Abbr
Name
varchar2 (3)
varchar2 (255)
UpdateCitation_Info
PMID
TimeUpdated
TimeDeleted
number
timestamp(0)
timestamp(0)
Legend:
#-- primary key
included in
Entrez_Journal_CV
#JrId
JournalTitle
MedAbbr
ISSN
ESSN
IsoAbbr
NlmId
number,
varchar2(500),
varchar2(255),
varchar2(9),
varchar2(9),
varchar2(255),
varchar2(25));
PublicationType_CV
#ID number
Name varchar2 (255)
CitationSubSet_CV
#ID varchar2 (3)
Name varchar2 (255)
Main Table: MedlineCitation
MedlineCitation
#PMID
MedlineID
number
number
MedlineCitation XMLType
PubYear
dummy
number
clob
Table is partitioned
into 8
XMLtype Column is
registered with
XMLSchema, locally
context type indexed
select m.MedlineCitation.getClobVal() AS MedXML
from Medlinecitation m where pmid=8219565;
--return a full XML document:
<MedlineCitation
Owner="NLM” Status="Completed">
<MedlineID>94033980</MedlineID>
<PMID>8219565</PMID>
…...
<Article>
<Journal>
<ISSN>1051-0443</ISSN>
<JournalIssue>
<Volume>4</Volume>
<Issue>5</Issue>
<PubDate>
<MedlineDate>1993 Sep-Oct</MedlineDate>
</PubDate>
</JournalIssue>
</Journal>
<ArticleTitle>Transcatheter manipulation of asymmetrically opened titanium Green field
filters.</ArticleTitle>
<Pagination>
<MedlinePgn>687-90</MedlinePgn>
</Pagination>……
<Article>
…
</MedlineCitation>
Select a part of XML document
select extract (MedlineCitation,
'/MedlineCitation/Article/Journal/JournalIssue').getStringVal()
”JournalIssue” from Medlinecitation where pmid= 11194419;
<JournalIssue>
<Volume>40</Volume>
<Issue>4</Issue>
<PubDate>
<Year>2000</Year>
<Month>Nov</Month>
</PubDate>
</JournalIssue>
Select just Abstract text
select extractValue (MedlineCitation,
'/MedlineCitation/Article/Abstract/AbstractText').getStringVal()
“AbstractText” from Medlinecitation where pmid=8219565;
AbstractText
-------------------------------------------------------------------------------A case of amebic meningoencephalitis recognized in an adult Zambian is
described. This is the first authenticated case from Africa. The morphologic
features of the organism, its ability to form cysts in tissue, and the
granulomatous tissue response denote that the ameba is an hartmannellid rather
than a Naegleria. Free -living amebas of the family Hartmannellidae have not
been incriminated before as a cause of primary amebic meningoencephalitis in
man. To our knowledge this is the only case where such an ameba was
responsible for fulminating meningoencephalitis. The presence of the amebas in
a cellulocutaneous abdominal lesion sugges ts hematogenous dissemination.
Improvements - can create additional relational tables
Citation
PMID
number primary key,
MedlineID number,
ArticleTitle VARCHAR2(1500),
Volume
VARCHAR2(55),
Issue
VARCHAR2(55),
StartPage VARCHAR2(55),
EndPage
VARCHAR2(55),
MedlinePgn VARCHAR2(100),
PubYear
number,
ISSN
VARCHAR2(9),
NlmUniqueId VARCHAR2(25),
AuthorListCompleteYN
VARCHAR2(1)
Author
PMID
number foreign key,
LastName
VARCHAR2(255),
Initials
VARCHAR2(255),
Suffix
VARCHAR2(25),
CollectiveName
VARCHAR2(1000),
Affiliation
VARCHAR2(1000),
Rank
number
Improvements – use O-R storage
Can ‘context’ index whole XML table …
create index MEDLINE_CITATION_INDEX on
MEDLINE_CITATION_TABLE x
(value(x)) indextype is ctxsys.context
parameters(’storage med_storage lexer med_lex
stoplist med_STOPLIST section group autogroup
memory 400M’);
or just text fields:
Create index Journal_Abstract_Index
on MEDLINE_CITATION_TABLE c
(extractValue(val(m),
’/MedlineCitation/Article/Abstract/AbstractText’))
indextype is ctxsys.context;
Improvements S-R storage - index specific fields
create unique index MEDLINE_PMID_INDEX on MEDLINE_CITATION_TABLE m
(extractValue(value(m),'/MedlineCitation/PMID'))
create index AUTHOR_LASTNAME_INDEX on ARTICLE_AUTHOR_TABLE a
(a."LastName")
create index JOURNAL_ISSN_INDEX on MEDLINE_CITATION_TABLE c
( extractValue(value(c),'/MedlineCitation/Article/Journal/ISSN') )
create index JOURNAL_VOLUME_INDEX on MEDLINE_CITATION_TABLE c
(extractValue(value(c),'/MedlineCitation/Article/Journal/JournalIssue/Volume'),
extractValue(value(c),'/MedlineCitation/Article/Journal/JournalIssue/Issue'))
create index PAGINATION_INDEX on MEDLINE_CITATION_TABLE c
( extractValue(value(c),'/MedlineCitation/Article/Pagination/MedlinePgn'))
Improvements - create a view
create or replace view Citation
as select
to_Number(extractValue(value(m),'/MedlineCitation/PMID')) PMID,
to_Number(extractValue(value(m),'/MedlineCitation/MedlineID')) MedlineId,
extractValue(value(m),'/MedlineCitation/Article/Journal/ISSN') ISSN,
extractValue(value(m),'/MedlineCitation/Article/Journal/JournalIssue/Volume') Volume,
extractValue(value(m),'/MedlineCitation/Article/Journal/JournalIssue/Issue') Issue,
extractValue(value(m),'/MedlineCitation/Article/Pagination/MedlinePgn') MedlinePgn
from MEDLINE_CITATION_TABLE m;
Typical query becomes simple & fast
select PMID, MedlineId from Citation where Volume='12'
and ISSN='1040-4651' and MedlinePgn like '1-%';
PMID
MEDLINEID
---------- ---------10634903 20102627
Elapsed: 00:00:00.35
Text query examples
SELECT score, pmid, title FROM citation
WHERE CONTAINS(abstract,’gene NEAR
expression’,1) >0
ORDER BY score(1) DESC;
SELECT pmid, title FROM citation
WHERE CONTAINS(abstract,’Drosophila
AND ABOUT(adh)’)>0;
Next steps
• lexical functionality of Oracle Text
• thesauri (UMLS, GO, SNOMED)
• scaleability (linux RAC ?)
• concept extraction, classification, clustering
• application to database curation
• interfaces: web services, GUI
Acknowledgements
Leader:
Peter Stoehr, Weimin Zhu
DBA:
Muruli Rao, Olalekan Oyewole
Developer:
Lichun Wang
Oracle Support:
Mark Drake (XML)
Roger Ford (Text)