CHESHIRE II - University of California, Berkeley

Transcript CHESHIRE II - University of California, Berkeley

Distributed IR for Digital Libraries
Ray R. Larson
School of Information Management & Systems
University of California, Berkeley
[email protected]
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Overview
• The problem area
• Distributed searching tasks and issues
• Our approach to resource characterization
and search
• Experimental evaluation of the approach
• Application and use of this method in
working systems
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
The Problem
• Prof. Casarosa’s definition of the Digital Library
vision in yesterday afternoons plenary session -Access to everyone for “all human knowledge”
• Lyman and Varian’s estimates of the “Dark Web”
• Hundreds or Thousands of servers with databases
ranging widely in content, topic, format
– Broadcast search is expensive in terms of bandwidth
and in processing too many irrelevant results
– How to select the “best” ones to search?
• Which resource to search first?
• Which to search next if more is wanted?
– Topical /domain constraints on the search selections
– Variable contents of database (metadata only, full text,
multimedia…)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Distributed Search Tasks
• Resource Description
– How to collect metadata about digital libraries and their
collections or databases
• Resource Selection
– How to select relevant digital library collections or
databases from a large number of databases
• Distributed Search
– How to perform parallel or sequential searching over the
selected digital library databases
• Data Fusion
– How to merge query results from different digital
libraries with their different search engines, differing
record structures,ECDL
etc.
2003, Trondheim -- Ray R. Larson
August 20, 2003
An Approach for Distributed
Resource Discovery
• Distributed resource representation and discovery
– New approach to building resource descriptions based on
Z39.50
– Instead of using broadcast search across resources we are
using two Z39.50 Services
• Identification of database metadata using Z39.50 Explain
• Extraction of distributed indexes using Z39.50 SCAN
• Evaluation
– How efficiently can we build distributed indexes?
– How effectively can we choose databases using the index?
– How effective is merging search results from multiple
sources?
– Can we build hierarchies of servers (general/metatopical/individual)?
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Z39.50 Overview
UI
Map
Query
Map
Results
Map
Query
Search
Engine
Internet
Map
Results
Map
Query
UI
Map
Results
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Z39.50 Explain
• Explain supports searches for
– Server-Level metadata
• Server Name
• IP Addresses
• Ports
– Database-Level metadata
• Database name
• Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Z39.50 SCAN
• Originally intended to support Browsing
• Query for
–
–
–
–
–
Database
Attributes plus Term (i.e., index and start point)
Step Size
Number of terms to retrieve
Position in Response set
• Results
– Number of terms returned
– List of Terms and their frequency in the database (for
the given attribute combination)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Z39.50 SCAN Results
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
zscan topic cat 1 20 1
% zscan title cat 1 20 1
{SCAN {Status 0}
{SCAN {Status 0}
{Terms 20}
{Terms 20}
{StepSize 1}
{StepSize 1}
{Position 1}}
{Position 1}}
{cat 706}
{cat 27}
{cat-and-mouse 19}
{cat-fight 1}
{cat-burglar 1}
{catalan 19}
{cat-carrying 1}
{cat-egory 1}
{catalogu 37}
{cat-fight 1}
{catalonia 8}
{cat-gut 1}
{catalyt 2}
{cat-litter 1}
{catania 1}
{cat-lovers 2}
{cataract 1}
{cat-pee 1}
{catch 173}
{cat-run 1}
{catch-all 3}
{cat-scanners 1} …
{catch-up 2} …
ECDL 2003, Trondheim -- Ray R. Larson
August 20, 2003
Resource Index Creation
• For all servers, or a topical subset…
– Get Explain information
– For each index
• Use SCAN to extract terms and frequency
• Add term + freq + source index + database
metadata to the XML “Collection
Document” for the resource
– Planned extensions:
• Post-Process indexes (especially Geo
Names, etc) for special types of data
– e.g. create “geographical coverage” indexes
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
MetaSearch Approach
Map
Query
Map Explain
And Scan
Queries
Search
Engine
Map
Results
MetaSearch
Server
Internet
DB 1
Map
Results
Distributed
Index
August 20, 2003
DB2
Map
Query
Search
Engine
Db 5 Db 6
ECDL 2003, Trondheim -- Ray R. Larson
Search
Engine
Map
Results
DB 3 DB 4
Known Issues and Problems
• Not all Z39.50 Servers support SCAN or
Explain
• Solutions that appear to work well:
– Probing for attributes instead of explain (e.g.
DC attributes or analogs)
– We also support OAI and can extract OAI
metadata for servers that support OAI
– Query-based sampling (Callan)
• Collection Documents are static and need to
be replaced when the associated collection
changes
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Evaluation
• Test Environment
– TREC Tipster data (approx. 3 GB)
– Partitioned into 236 smaller collections based
on source and date by month (no DOE)
• High size variability (from 1 to thousands of
records)
• Same database as used in other distributed search
studies by J. French and J. Callan among others
– Used TREC topics 51-150 for evaluation (these
are the only topics with relevance judgements
for all 3 TIPSTER disks
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Test Database Characteristics
TREC Disk
Source
Size MB
Size doc
1
1
1
WSJ (86-89)
AP (89)
ZIFF
270
259
245
98,732
84,678
75,180
1
2
2
2
FR (89)
WSJ (90-92)
AP (88)
ZIFF
262
247
241
178
25,960
74.520
79,919
56,920
2
3
FR (88)
AP (90)
211
242
19,860
78,321
3
SJMN (91)
290
90,257
3
PAT
245
6,711
August 20, 2003
Totals
ECDL 2003, Trondheim -- Ray R. Larson
2,690
691,058
Test Database Characteristics
TREC Disk
Source
Num DB
Total DB
1
1
1
WSJ (86-89)
AP (89)
ZIFF
29
12
14
Disk 1
67
1
2
2
2
FR (89)
WSJ (90-92)
AP (88)
ZIFF
12
22
11
11 (1 dup)
2
3
FR (88)
AP (90)
10
12
Disk 3
3
SJMN (91)
12
116
3
PAT
92
August 20, 2003
Totals
Disk 2
54
ECDL 2003, Trondheim -- Ray R. Larson
237 - 1
237 - 1
Harvesting Efficiency
• Tested using the databases on the previous
slide + the full FT database (210,158 records
~ 600 Mb)
• Average of 23.07 seconds per database to
SCAN each database (3.4 indexes on
average) and create a collection
representative, over the network
• Average of 14.07 seconds
• Also tested larger databases (E.g. TREC FT
database ~600 Mb with 7 indexes was
harvested in 131 seconds.
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Our Collection Ranking
Approach
• We attempt to estimate the probability of
relevance for a given collection with respect to
a query using the Logistic Regression method
developed at Berkeley (W. Cooper, F. Gey, D.
Dabney, A. Chen) with new algorithm for
weight calculation at retrieval time
• Estimates from multiple extracted indexes are
combined to provide an overall ranking score
for a given resource (I.e., fusion of multiple
query results)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Probabilistic Retrieval: Logistic
Regression
Probability of relevance for a given index is based on
logistic regression from a sample set documents
to determine values of the coefficients (TREC).
At retrieval the probability estimate is obtained by:
6
P( R | Q, C )  c0   ci X i
i 1
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Statistics Used for Regression
Variables
•
•
•
•
•
•
Average Absolute Query Frequency
Query Length
Average Absolute Collection Frequency
Collection size estimate
Average Inverse Collection Frequency
Number of terms in common between query and
collection representative
(Details in the proceedings)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Other Approaches
• GlOSS – Developed by the DL project at
Stanford Univ. Uses fairly conventional
TFIDF ranking
• CORI – Developed by J. Callan and
students at CIIR. Uses a ranking that
exploits some of the features of the
INQUERY system in merging evidence
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Evaluation
• Effectiveness
– Tested using the collection representatives
described above (as harvested from over the
network) and the TIPSTER relevance
judgements
– Testing by comparing our approach to known
algorithms for ranking collections
– Results were measured against reported results
for the Ideal and CORI algorithms and against
the optimal “Relevance Based Ranking”
(MAX)
– Recall analog (How many of the Rel docs
occurred in the top n databases – averaged)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Titles only (short query)
Rˆ
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Long Queries
Rˆ
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Very Long Queries
Rˆ
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Current Usage
• Mersey Libraries
• Distributed Archives Hub
• Related approaches
– JISC Resource Discovery Network
• (OAI-MHP Harvesting with Cheshire Search)
– Planned use with TEL by the BL
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Future
• Logically Clustering servers by topic
• Meta-Meta Servers (treating the
MetaSearch database as just another
database)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Distributed Metadata Servers
General Servers
August 20, 2003
Meta-Topical
Servers
Replicated
servers
ECDL 2003, Trondheim -- Ray R. Larson
Database
Servers
Conclusion
• A practical method for metadata harvesting and an
effective algorithm for distributed resource
discovery
• Further research
– Continuing development of the Cheshire III system
– Applicability of language modelling methods to
resource discovery
– Developing and Evaluating methods for merging crossdomain results, such as text and image or text and GIS
datasets (or, perhaps, when to keep them separate)
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Further Information
• Full Cheshire II client and server source is
available
ftp://cheshire.berkeley.edu/pub/cheshire/
– Includes HTML documentation
• Project Web Site
http://cheshire.berkeley.edu/
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Probabilistic Retrieval: Logistic
Regression attributes
1
X1 
M
M
 log QAF
tj
1
Query Length
X 2  QL
1
X3 
M
Average Absolute Query Frequency
M
 log CAF
Average Absolute Collection Frequency
tj
1
CL
10
1 M
X5 
log ICFt j

M 1
N  nt j
ICF 
nt j
Collection size estimate
X4 
X 6August
 log
M
20, 2003
Average Inverse Collection Frequency
Inverse Document Frequency (N =
Number of collections
M=
Number of Terms in common
ECDL 2003, Trondheim -- Ray R. Larson
between query and document
CORI ranking
df
T
df  50  150 cw / cw
 | DB | 0.5 

log
cf


I
log| DB | 1.0 
p (rk | dbi )  0.4  0.6  T  I
where :
df is number of document scont ainingrk
cf is number of dat abasescont ainingrk
| DB | is t henumber of dat abases being ranked
cw is t henumber of words in dbi
ECDL of
2003, Trondheim
-- Ray
R. Larson
cw is t heaveragecw
t hedat
abases
being ranked
August 20, 2003
Measures for Evaluation
• Assume each database has some merit for a
given query, q
• Given a Baseline ranking B and an estimated
(test) ranking E for
• Let dbbi and dbei denote the database in the ith ranked position of rankings B and E
• Let Bi = merit(q, dbbi) and Ei = merit(q, dbei)
• We can define some measures:
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson
Measures for Evaluation – Recall
Analogs
n
Rn 
E
i 1
n
i
B
i 1
i
n*  max k such thatBk  0
n
Rˆ n 
E
i 1
n*
B
i 1
August 20, 2003
i
i
ECDL 2003, Trondheim -- Ray R. Larson
Measures for Evaluation –
Precison Analog
| {db  Topn ( E ) | m erit(q, db)  0} |
Pn 
| Topn ( E ) |
I.e., thefractionof the topn databasesin the
estimatedrankingwith non - zero merit.
August 20, 2003
ECDL 2003, Trondheim -- Ray R. Larson

CHESHIRE II - University of California, Berkeley

Transcript CHESHIRE II - University of California, Berkeley

Directory