Extending SDARTS: Extracting Metadata from Web Databases
Download
Report
Transcript Extending SDARTS: Extracting Metadata from Web Databases
Extending SDARTS:
Extracting Metadata from Web Databases
and Interfacing with Open Archives Initiative
Panagiotis G. Ipeirotis
Tom Barry
Luis Gravano
Computer Science Dept., Columbia University
Metasearching? Why?
“Surface” Web vs. “Hidden” Web
Keywords
SUBMIT
“Surface” Web
–
–
Link structure
Crawlable
“Hidden” Web
–
–
–
–
7/17/2015
CLEAR
Documents “hidden” in databases
No link structure
Search engines do not index them
Need to query each collection individually
Columbia University
Computer Science Dept.
2
Metasearching Challenges
Select good databases for a given query
Evaluate the query at these databases
Merge the results from these databases
“Content summaries” of
databases
Uniform interfaces
(frequencies of words)
Hidden Web
Metasearcher
Non-indexed
Documents
wireless: 2,000
network: 8,000
...
7/17/2015
Relational
Database /
Library / etc.
wireless: 0
network: Columbia
10 University
...
Computer Science Dept.
Existing
Web
Database
<%
%>
wireless: 5
network: 40
...
3
Outline
Background: SDARTS, SDLIP, STARTS
Extracting content summaries from remote
web databases
Interfacing with Open Archives Initiative
7/17/2015
Columbia University
Computer Science Dept.
4
SDARTS: SDLIP + STARTS
NOT
yet another
protocol
SDLIP
interfaces
STARTS
metadata
Metasearcher
S M
S M
grep cat
select
S M
http://….
<%
%>
S = Search
7/17/2015
Columbia University
Computer Science Dept.
M = Metadata
5
STARTS: A Metasearching Protocol
Defines:
Query language
Results format
Metadata for the collection
Complements SDLIP for
PubMed content summary
metasearching purposes
number of documents = 3,868,552
Provides metadata for individual
documents
Provides content summaries for
databases
7/17/2015
…
cancer
1,398,178
heart
281,506
hepatitis 23,481
basketball 907
Columbia University
Computer Science Dept.
6
SDARTS: The Toolkit
SDARTS architecture makes new-wrapper
implementation easy
SDARTS toolkit includes reference implementations for
common types of text databases:
Local text databases
Local XML databases
Remote web databases
Customization requires just editing configuration files,
no programming
7/17/2015
Columbia University
Computer Science Dept.
7
SDARTS Content Summaries
Detailed content summaries easily extracted from
locally available (plain-text or XML) databases
Detailed content summaries so far not available for
remote web databases
7/17/2015
No access to full contents
Columbia University
Computer Science Dept.
8
Extracting Content Summaries from
Remote Web Databases
No direct access to remote documents
Resort to document sampling:
VLDB 2002
Send queries to the database
Retrieve a representative document sample
Use the sample to create an approximation of the
content summary
Database selection algorithms work well even
with approximate content summaries
7/17/2015
Columbia University
Computer Science Dept.
9
Topic-based Sampling: Training
Start with a predefined hierarchy
Root
and associated, pre-classified
documents
...
Train rule-based document
Computers
...
Health
...
classifiers for each node
The output is a set of rules like:
ibm AND computers → Computers
lung AND cancer → Health
…
hepatitis AND liver → Hepatitis
angina → Heart
…
7/17/2015
...
Heart
...
Hepatitis
...
} Root
} Health
Columbia University
Computer Science Dept.
10
Topic-based Sampling: Probing
Transform each rule into a query
HealthRoot
metallurgy
aids
polo
oncology
(0)
(7,530)
football
liver
angina
keyboard
(1,230)
(80)
cancer(150)
(4,345)
(780)chf
dna
psa
ram (32)
(24,520)
(30)
(2,340)
(7,700)
(140)
Sports
Heart
Health
Cancer
Computers
Science safe AND sex
(245)
Hepatitis
AIDS
hiv
(5,334)
Sampling proceeds in rounds:
In each round, the rules associated with each
node are turned into queries to the database
7/17/2015
For each query:
Send query to database
Record number of matches
Retrieve top-k documents for
query
At the end of the round:
Analyze matches for each
category
Choose category to focus on
The result is a representative
document sample
Columbia University
Computer Science Dept.
11
Sample Contains “Relative” Word Frequencies
“Liver” appears in 200 out of 300 documents in sample
“Kidney” appears in 100 out of 300 documents in sample
“Hepatitis” appears in 30 out of 300 documents in sample
Document frequencies in actual database?
Query “liver” returned 140,000 matches
Query “hepatitis” returned 20,000 matches
“kidney” was not a query probe…
Can exploit number of matches from one-word queries
7/17/2015
Columbia University
Computer Science Dept.
12
Adjusting Document Frequencies
We know absolute
document frequency f of
words from one-word
queries
f = P (r+p) -B
Known Frequency
We know ranking r of
words according to
document frequency in
sample
?
140,000 matches
Unknown Frequency
?
Frequency in Sample (always known)
Mandelbrot’s formula
60,000 matches
connects word frequency
f and ranking r
?
20,000 matches
We use curve-fitting to
estimate the absolute
frequency of all words in
sample
7/17/2015
...
cancer
...
...
liver
...
kidneys
Columbia University
Computer Science Dept.
...
...
?
...
stomach
hepatitis
13
Implementing Content-Summary
Extraction in SDARTS Toolkit
Implemented content-summary extraction module as
J2EE-compliant servlet
First, build SDARTS wrapper for remote web database
Then, trigger extraction process to generate content summary
automatically
Module customizable with any classification scheme
7/17/2015
Toolkit provides 72-node hierarchical scheme and associated
classifiers
To add new scheme, should define the hierarchy and provide
classifiers for the internal nodes
Columbia University
Computer Science Dept.
14
Fraction of PubMed Content Summary
PubMed content summary
number of documents = 3,868,552
…
cancer
1,398,178
aids
106,512
heart
281,506
angina
26,775
hepatitis 23,481
…
Extracted automatically
~ 27,500 words in the extracted
content summary
Less than 200 queries sent
Retrieved 4 documents per
query
basketball 907
cpu
487
The extracted content summary accurately represents size and
Columbia University
7/17/2015
contentsComputer
of
theScience
database
Dept.
15
Topic-based Sampling: Conclusions
SDARTS now supports extraction of detailed content
summaries from any database, local or remote
Sophisticated database selection algorithms can now
be implemented on top of SDARTS
Implemented and available for download:
Database Selection Module
SDARTS Client with Database Selection
7/17/2015
Columbia University
Computer Science Dept.
16
Interfacing with Open Archives Initiative (OAI)
“No man is an island, entire of itself;
every man is a piece of the continent,
a part of the main...…”
(John Donne)
Export SDARTS metadata
under OAI
OAI
Service
Provider
SDARTS/
SDLIP
Server
OAI
Data
Provider
Access transparently any OAI
collection through SDARTS
SDARTS
Client
7/17/2015
Columbia University
Computer Science Dept.
17
Exporting SDARTS Metadata under OAI
SDARTS supports detailed,
record-level metadata for each
document, for XML and plaintext collections
<PAPER>
COLUMBIA SDARTS Server
<TITLE>The threat of vancomycin resistance</TITLE>
PubMed Publications
<AUTHORS>Trish M. Perl MD, MSc</AUTHORS>
Aides Medical Collection
<FILENO>ajm_106_05_0489</FILENO>
Easy mapping to Dublin Core
SDARTS also exports content
summaries under OAI
Each SDARTS collection
is mapped to an OAI set
We export the content
summaries under OAI, as
metadata about the set
7/17/2015
<APPEARED>
NOAH: New York Online Access to Health
<JRNL>American Journal of Medicine</JRNL>
<VOL>106</VOL><ISS>5</ISS>
Cardiovascular
Institute of the South
<DATE>3 May </DATE> <YEAR>1999</YEAR>
</APPEARED>
Columbia's DLI2 Medical Corpus
<ABSTRACT> … </ABSTRACT>
Harrisons Online
<BODY> … </BODY>
</PAPER>
Columbia University
Computer Science Dept.
18
SDARTS OAI Sever: Details
Uses OCLC OAI Server
OAI
Service
Provider
Uses MySQL –via JDBC– to
store OAI records
Records materialized after first
request for space efficiency
Distributed as WAR file
SDARTS
OAI
Interface
JDBC
Simple configuration: Specify
SDARTS/MySQL address
SDARTS
Server
7/17/2015
Columbia University
Computer Science Dept.
MySQL
RDBMS
19
Searching OAI Collections
OAI is not designed for searching
Possible to restrict only “Date” and “Set”
Need to search OAI collections
Users want to specify “Title”, “Author”, etc.
Author = “F. Douglass”
OAI
Service
Provider
OAI Data
Provider
(e.g., Library
of Congress )
User
Author = “F. Douglass”
7/17/2015
Columbia University
Computer Science Dept.
20
Harvesting and Searching OAI
within SDARTS
OAI exports metadata records in XML
SDARTS can index and search XML collections
(e.g., Library
of Congress )
Harvest
OAI/XML
records
Solution:
OAI Data
Provider
Harvest OAI records (by “Date”, “Set”)
Store records locally as XML documents
Use SDARTS XML wrapper to index them
The OAI collection is searchable as an
SDARTS XML database
7/17/2015
Columbia University
Computer Science Dept.
Index
OAI/XML
records
SDARTS/
SDLIP
Server
21
Adding an OAI Collection in SDARTS
http://memory.loc.gov/cgi-bin/oai
loc
2002-01-01
7/17/2015
Columbia University
Computer Science Dept.
22
Distributed Search over OAI
SDARTS treats OAI collections as
simple, local XML databases
VT Electronic Thesis & Dissertation
number of documents = 2,948
…
Exact content summaries are
exported for OAI collections
study
1,479
thesis
493
…
Possible to build sophisticated
distributed search over OAI using
SDARTS
cancer
13
basketball 2
…
SDARTS Content Summary for
an OAI collection
7/17/2015
Columbia University
Computer Science Dept.
23
Conclusions
SDARTS can now extract rich content summaries from:
Local text and XML databases
Remote web databases
OAI-compliant collections
SDARTS is now OAI-compliant
SDARTS allows easy integration of any OAI collection into SDARTS
SDARTS supports searching transparently over a wide range of
heterogeneous collections
No programming required for any of the tasks
7/17/2015
Columbia University
Computer Science Dept.
24
We are on the Web :-)
SDARTS executables and documentation
SDARTS source code with documentation
SDARTS web client
SDARTS database selection module
SDARTS-OAI interface tools
Sample SDARTS-compliant databases
http://sdarts.cs.columbia.edu/
7/17/2015
Columbia University
Computer Science Dept.
25