CHESHIRE II - University of California, Berkeley

Download Report

Transcript CHESHIRE II - University of California, Berkeley

Cheshire II: Recent Additions
&
Cheshire III:
Design and System Overview
Ray R. Larson
School of Information Management and
Systems
University of California, Berkeley
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Overview
• Cheshire II
– Feature overview
– Current usage
– Recent Additions
•
•
•
•
•
•
Distributed Search and Indexing
Geographic Operators and Search Ranking
XML Schemas and Element Retrieval
MySQL and PostgreSQL interfaces
CORI, Okapi BM-25 ranking algorithms
Result Set sorting, merging and ranking operators, bitmapped
indexes
• Cheshire III Design and Development
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Overview of Cheshire II
• It supports SGML and XML
• It is a client/server application
• Uses the Z39.50 Information Retrieval Protocol, support for SRW,
OAI, SOAP, SDLIP also implemented
• Server supports a Relational Database Gateway
• Supports Boolean searching of all servers
• Supports probabilistic ranked retrieval in the Cheshire search engine as
well as Boolean and proximity search
• Search engine supports ``nearest neighbor'' searches and relevance
feedback
• GUI interface on X window displays and Windows NT
• WWW/CGI forms interface for DL, using combined client/server CGI
scripting via WebCheshire
• Scriptable clients using Tcl and (new) Python
• Store SGML/XML as files or “Datastore” database
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Current Usage
• Over 100 Databases in the UK, including
–
–
–
–
AHDS/History Data Service
Mersey Libraries
ZETOC
Archives Hub
• Distributed Archives Hub
– JISC Resource Discovery Network (RDN)
• (OAI-MHP Harvesting with Cheshire Search)
– Planned use with TEL being developed by the BL
• Also being used at Harvard and Berkeley
• California Sheet Music Project
• Los Alamos National Lab (genomics metadata)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Distributed Search
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
The Problem
• The Digital Library vision -- Access to everyone
for “all human knowledge”
• Lyman and Varian’s estimates of the “Dark Web”
• Hundreds or Thousands of servers with databases
ranging widely in content, topic, format
– Broadcast search is expensive in terms of bandwidth
and in processing too many irrelevant results
– How to select the “best” ones to search?
• Which resource to search first?
• Which to search next if more is wanted?
– Topical /domain constraints on the search selections
– Variable contents of database (metadata only, full text,
multimedia…)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Distributed Search Tasks
• Resource Description
– How to collect metadata about digital libraries and their
collections or databases
• Resource Selection
– How to select relevant digital library collections or
databases from a large number of databases
• Distributed Search
– How to perform parallel or sequential searching over the
selected digital library databases
• Data Fusion
– How to merge query results from different digital
libraries with their different search engines, differing
record structures,CDLetc.
-- Cheshire II & III -- Ray R. Larson
October 3, 2003
An Approach for Distributed
Resource Discovery
• Distributed resource representation and discovery
– New approach to building resource descriptions based on
Z39.50
– Instead of using broadcast search across resources we are
using two Z39.50 Services
• Identification of database metadata using Z39.50 Explain
• Extraction of distributed indexes using Z39.50 SCAN
• Evaluation
– How efficiently can we build distributed indexes?
– How effectively can we choose databases using the index?
– How effective is merging search results from multiple
sources?
– Can we build hierarchies of servers (general/metatopical/individual)?
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Z39.50 Explain
• Explain supports searches for
– Server-Level metadata
• Server Name
• IP Addresses
• Ports
– Database-Level metadata
• Database name
• Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Z39.50 SCAN
• Originally intended to support Browsing
• Query for
–
–
–
–
–
Database
Attributes plus Term (i.e., index and start point)
Step Size
Number of terms to retrieve
Position in Response set
• Results
– Number of terms returned
– List of Terms and their frequency in the database (for
the given attribute combination)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Z39.50 SCAN Results
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
zscan topic cat 1 20 1
% zscan title cat 1 20 1
{SCAN {Status 0}
{SCAN {Status 0}
{Terms 20}
{Terms 20}
{StepSize 1}
{StepSize 1}
{Position 1}}
{Position 1}}
{cat 706}
{cat 27}
{cat-and-mouse 19}
{cat-fight 1}
{cat-burglar 1}
{catalan 19}
{cat-carrying 1}
{cat-egory 1}
{catalogu 37}
{cat-fight 1}
{catalonia 8}
{cat-gut 1}
{catalyt 2}
{cat-litter 1}
{catania 1}
{cat-lovers 2}
{cataract 1}
{cat-pee 1}
{catch 173}
{cat-run 1}
{catch-all 3}
{cat-scanners 1} …
{catch-up 2} …
CDL -- Cheshire II & III -- Ray R. Larson
October 3, 2003
Resource Index Creation
• For all servers, or a topical subset…
– Get Explain information
– For each index
• Use SCAN to extract terms and frequency
• Add term + freq + source index + database metadata
to the XML “Collection Document” for the resource
– Planned extensions:
• Post-Process indexes (especially Geo Names, etc)
for special types of data
– e.g. create “geographical coverage” indexes
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
MetaSearch Approach
Map
Query
Map Explain
And Scan
Queries
Search
Engine
Map
Results
MetaSearch
Server
Internet
DB 1
Map
Results
Distributed
Index
October 3, 2003
DB2
Map
Query
Search
Engine
Db 5 Db 6
CDL -- Cheshire II & III -- Ray R. Larson
Search
Engine
Map
Results
DB 3 DB 4
Known Issues and Problems
• Not all Z39.50 Servers support SCAN or Explain
• Solutions that appear to work well:
– Probing for attributes instead of explain (e.g. DC
attributes or analogs)
– We also support OAI and can extract OAI metadata for
servers that support OAI
– Query-based sampling (Callan)
• Collection Documents are static and need to be
replaced when the associated collection changes
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Evaluation
• Test Environment
– TREC Tipster data (approx. 3 GB)
– Partitioned into 236 smaller collections based on source
and date by month (no DOE)
• High size variability (from 1 to thousands of records)
• Same database as used in other distributed search studies by J.
French and J. Callan among others
– Used TREC topics 51-150 for evaluation (these are the
only topics with relevance judgements for all 3
TIPSTER disks
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Harvesting Efficiency
• Tested using the databases on the previous slide +
the full FT database (210,158 records ~ 600 Mb)
• Average of 23.07 seconds per database to SCAN
each database (3.4 indexes on average) and create
a collection representative, over the network
• Average of 14.07 seconds
• Also tested larger databases (E.g. TREC FT
database ~600 Mb with 7 indexes was harvested in
131 seconds.
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Our Collection Ranking
Approach
• We attempt to estimate the probability of
relevance for a given collection with respect to
a query using the Logistic Regression method
developed at Berkeley (W. Cooper, F. Gey, D.
Dabney, A. Chen) with new algorithm for
weight calculation at retrieval time
• Estimates from multiple extracted indexes are
combined to provide an overall ranking score
for a given resource (I.e., fusion of multiple
query results)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Probabilistic Retrieval: Logistic
Regression
Probability of relevance for a given index is based on
logistic regression from a sample set documents
to determine values of the coefficients (TREC).
At retrieval the probability estimate is obtained by:
6
P( R | Q, C )  c0   ci X i
i 1
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Probabilistic Retrieval: Logistic
Regression attributes
1
X1 
M
M
 log QAF
tj
1
Query Length
X 2  QL
1
X3 
M
Average Absolute Query Frequency
M
 log CAF
Average Absolute Collection Frequency
tj
1
CL
10
1 M
X5 
log ICFt j

M 1
N  nt j
ICF 
nt j
Collection size estimate
X4 
X 6October
 log
M
3, 2003
Average Inverse Collection Frequency
Inverse Document Frequency (N =
Number of collections
M=
Number of Terms in common
CDL -- Cheshire II & III -- Ray R. Larson
between query and document
Evaluation
• Effectiveness
– Tested using the collection representatives described
above (as harvested from over the network) and the
TIPSTER relevance judgements
– Testing by comparing our approach to known
algorithms for ranking collections
– Results were measured against reported results for the
Ideal and CORI algorithms and against the optimal
“Relevance Based Ranking” (MAX)
– Recall analog (How many of the Rel docs occurred in
the top n databases – averaged)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Titles only (short query)
Rˆ
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Future
• Logically Clustering servers by topic
• Meta-Meta Servers (treating the
MetaSearch database as just another
database)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Distributed Metadata Servers
General Servers
October 3, 2003
Meta-Topical
Servers
Replicated
servers
CDL -- Cheshire II & III -- Ray R. Larson
Database
Servers
Geographic Operators and Search
Ranking
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
The GEO Operations
• Operators established for the GEO Z39.50 profile
• Implemented using special operations on indexes
• Indexing allows extraction of geographic
coordinates and dates from SGML/XML data in a
variety of formats
• Normalized internal representation in indexes
• Search using geographic and time elements as
primary or limiting search elements
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
The GEO Operations
• X-based interfaces permit (simple) map
drawing and search
• Interface to MapServer for web-based map
searching
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
GEO Geographic operators
>=<
>#<
<#>
<>#
++
:<:
:<=:
Overlap
Fully Enclosed
Encloses
Fully Outside
Near
Before
Before or
During
:>=: During or
After
:>: After
October 3, 2003
Search region and data Overlap
Data fully enclosed in search reg.
Data fully encloses search region
Data outside of search region
Data is near search region
Data date is before search date
Data date is before or during
search date
Data date is during or after search
date
Data date is after search date
CDL -- Cheshire II & III -- Ray R. Larson
Overlaps search
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Fully Enclosed Search
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Map-Based Search
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
GeoSearch Web Interface
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
XML Schemas and Element
Retrieval
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
XML Schema Support
• XML Schemas can now be used to define
the data contents
• Tested with a wide variety of schemas
including METS (with various supporting
schemas)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
XML Element Extraction
• A new search “ElementSetName” is
XML_ELEMENT_
• Any Xpath, element name, or regular
expression can be included following the
final underscore when submitting a present
request
• The matching elements are extracted from
the records matching the search and
delivered in a simple format..
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
XML Extraction
% zselect sherlock
372 {Connection with SHERLOCK (sherlock.berkeley.edu) database 'bibfile' at port
2100 is open as connection #372}
% zfind topic mathematics
{OK {Status 1} {Hits 26} {Received 0} {Set Default} {RecordSyntax UNKNOWN}}
% zset recsyntax XML
% zset elementset XML_ELEMENT_Fld245
% zdisplay
{OK {Status 0} {Received 10} {Position 1} {Set Default} {NextPosition 11}
{RecordSyntax XML 1.2.840.10003.5.109.10}} {
<RESULT_DATA DOCID="1">
<ITEM XPATH="/USMARC[1]/VarFlds[1]/VarDFlds[1]/Titles[1]/Fld245[1]">
<Fld245 AddEnty="No" NFChars="0"><a>Singularitâes áa Cargáese</a></Fld245>
</ITEM>
<RESULT_DATA> … etc…
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
MySQL and PostgreSQL
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
RDBMS Support
• There are two reasons for RDBMS support
– IR systems are not meant for LOTS of update
transactions
– Some application need to have access to both relational
data and text data via Z39.50
• Both MySQL and PostgreSQL are popular open
source RDBMS and now either can now be used
via Cheshire
– Z39.50 mappings to RDBMS columns
– “ZQL” submission of SQL as Z39.50 Type 0 query
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Protocol Support
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Protocols
• In Cheshire II most protocols (except
Z39.50) are implemented using scripting
• Example scripts to support the following are
included in the distribution
–
–
–
–
October 3, 2003
OAI
SRW (Python version)
SOAP
SDLIP
CDL -- Cheshire II & III -- Ray R. Larson
CORI, Okapi BM-25 ranking
algorithms
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Why additional ranking methods
• CORI is extremely hard to beat as a
distributed search method
• OKAPI BM-25 is now the “default”
retrieval algorithm in experimental IR
• New operators (later) let us mix and match
ranking methods and Boolean operations
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
CORI ranking
df
T
df  50  150 cw / cw
 | DB | 0.5 

log
cf


I
log| DB | 1.0 
p (rk | dbi )  0.4  0.6  T  I
where :
df is number of document scont ainingrk
cf is number of dat abasescont ainingrk
| DB | is t henumber of dat abases being ranked
cw is t henumber of words in dbi
CDL --of
Cheshire
II & dat
III -- Ray
R. Larson
cw is t heaveragecw
t he
abases
being ranked
October 3, 2003
Okapi BM25
(k1  1)tf (k3  1)qtf
w

K  tf
k3  qtf
T Q
(1)
•
•
•
•
•
•
Where:
Q is a query containing terms T
K is k1((1-b) + b.dl/avdl)
k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000
tf is the frequency of the term in a specific document
qtf is the frequency of the term in a topic from which Q was
derived
• dl and avdl are the document length and the average document
 r  0.5 
length measured in some convenient unit


R

r

0
.
5
(1)


(1)
• w is the Robertson-Sparck Jones weight. w  log
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
n  r  0.5




 N  n  R  r  0.5 
Result Set sorting, merging and
ranking operators, bitmapped
indexes
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Sorting
• Support for Z39.50 Sort functions
• Merge multiple resultsets and sort new set
– Sort by index name/key (ATTRIBUTE)
– Sort by rank (ELEMENTS)
• Merges ranked results and Boolean results
– Sort by XML/SGML Tag contents (TAG)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Merging and Ranking Operators
• Extends the capabilities of merging to include
merger operations in queries like Boolean
operators
• Fuzzy Logic Operators
– !FUZZY_AND
– !FUZZY_OR
– !FUZZY_NOT
• Restrict components to particular parents
• Merge Operators
– !MERGE_SUM
– !MERGE_MEAN
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Bitmapped Indexes
• Bitmap indexes can be used for Boolean
operations where the data has only a few
values and very large numbers of items with
each value
• Only one bit per record stored in the index
• Processed on a demand basis so only blocks
with the bits needed to resolve a query are
fetched
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Cheshire III Design and
Development
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Cheshire III Goals
• Retain or reproduce (and refine) all Cheshire II
features
– “Spring cleaning” of code base
– Add Full Unicode Support
– Store most system and content data in the database
• Permit easy and efficient integration in Web
Services
• Use threaded server for economy of resource
usage
• Enhanced Multiprotocol support
• Support for distributed processing (I.e. GRID
clusters)
• Enhance expandability and “drop in’ functionality
• Interfaces and/or APIs for Java, Python, C/C++
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Cheshire II Design Overview
Z
SERVER
CONT
CONFIG
INDEX
CLUSTER
BUILD
ASSOC
XML DOCS
INDEX
CHESHIRE
XML
DIRECTORY
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Cheshire III Server Overview
Cheshire III SERVER
C
O
N
F Normalization
I
C
G I L
S
&
N U
S
C
E
O D S A C
N E T R A
N
T X E
C
R I R
O N I H
L G N
G
SERVER
CONTROL
API
A
U
T
H
E
N
T
I
C
A
T
I
O
N
X
S
L
T
R
E
C
O
R
D
T
R
A
N
S
F
O
R
M
S
DB API
USER
INFO
A
P
Native calls
A
C
H
P H Z39.50
R A SOAP E
O N OAI I
SRW
T D Fetch ID N
O L Put ID T
E
C EOpenURLR
O R UDDI F
WSRP A
L
OGIS C
E
JDBC
N
E
T
W
O
R
K
STAFF UI
CONFIG
User/
Client
Clients
REMOTE
SYSTEMS
(any protocol)
LOCAL DB
XML
October 3, 2003
CONFIG
RESULT
INDEXES & Metadata
SETS
CDL -- Cheshire II & III -- Ray R. Larson
INFO
ACCESS
INFO
Cheshire III SERVER
C
O
N
F
I
G
&
C
O
N
T
R
O
L
SERVER
CONTROL
API
Normalization
I
N
D
E
X
I
N
G
C
L
U
S
T
E
R
I
N
G
S
E
A
R
C
H
A
U
T
S H
E
C N
A T
N I
C
A
T
I
O
N
X
S
L
T
T
R R
E A
C N
O S
R F
D O
R
M
S
USER
INFO
Native calls
P
R
O
T
O
C
O
L
H
A
N
D
L
E
R
Z39.50
SOAP
OAI
SRW
Fetch ID
Put ID
OpenURL
UDDI
WSRP
OGIS
A
P
A
C
H
E
I
N
T
E
R
F
A
C
E
JDBC
DB API
N
E
T
W
O
R
K
STAFF UI
CONFIG
User/
Client
Clients
REMOTE
SYSTEMS
(any protocol)
LOCAL DB
XML
October 3, 2003
INDEXES
CONFIG
RESULT
& Metadata
SETS
CDL -- Cheshire
II & III -- Ray R. Larson
INFO
ACCESS
INFO
Retain Features
• The intent is to permit all of the types of in
indexing, searching and record formatting
available now, while making it easier to add
new capabilities
• The new system will also support full
UNICODE for content and for metadata
• Store metadata and content in the database
(including config information, etc.)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Permit easy integration of Web
Services
• The assumption is that the web server will
be the central server mechanism in the
future.
• The new design relies on the session
handling, threading and load management
tools available in Apache (2.0.40+)
• The Cheshire server is dynamically loaded
as part of the Web Server
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Multiprotocol Support
• The Web server handles the network issues
and passes requests in various protocols
along to the Cheshire Server.
• Individual Protocol “plugins” and the
Protocol Handler convert search, display,
and metadata requests in a particular
protocol to the internal Cheshire III control
language, and convert outgoing message
and data to the appropriate protocol form
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
•
Distributed Processing
(RESEARCH)
The server will support protocols for interchange
of partial results and collection statistics with a
single “Master” controlling the actions of a large
number of “Slave” servers
• These will run in parallel in a GRID environment
• This is still “research” but will probably be using
“Storage Grid” technology from SDSC with our
own applications
• Non-Grid use of the same protocols, etc will be
possible (but definitely slower)
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Enhanced Expanability
• Clearly defined APIs for interacting with
the server will permit easy addition of new
functionality, or to replace or upgrade
existing functionality
• Interactive user interface for database
configuration and setup
– We want to make it easier for a
user/administrator to create and manage the
database
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Multilingual APIs
• The system is being developed in a
multilingual environment.
• We will include the ability to interface with
(at a minimum) Java, Python and C/C++
applications.
• APIs for developing new functions will be
available in these languages as well
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
Development
• Currently work is going on here (RRL) and
(primarily) in the UK
• We have incomplete (Alpha) versions of the
system, but haven’t been distributing it in
the current form (changing constantly)
• First release version is expected in mid-’04
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson
For More Information
• http://Cheshire.berkeley.edu
• ftp://Cheshire.berkeley.edu for source
• Contact [email protected]
October 3, 2003
CDL -- Cheshire II & III -- Ray R. Larson