poster - University of California, Berkeley

Download Report

Transcript poster - University of California, Berkeley

A Logistic Regression Approach to Distributed IR
Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- [email protected]
Distributed IR Tasks
• Resource Description
• Hundreds or Thousands of servers with
to collect metadata about digital libraries and their collections
databases ranging widely in content, topic, – How
or databases
and format
• Resource Selection
How to select relevant digital library collections or databases from a
large number of databases
• Distributed Search
How to merge query results from different digital libraries with their
different search engines, differing record structures, etc.
Probabilistic Retrieval Using Logistic Regression
df
T
df  50  150  cw / cw
1
 | DB | 0.5 

log 
cf


I
log | DB | 1.0 
p (rk | dbi )  0.4  0.6  T  I
0.9
Average Inverse Collection Frequency
Inverse Collection Frequency (N = Number of
collections
Recall Analogs
M = Number of Terms in common between query and
document
n
Rn 
The probabilities are actually calculated as the log odds, and converted
E
i 1
n
S
log O ( R | Q, C )  c0  ci X i
We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and
month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC
queries 51-150. Collection relevance information was based on whether any documents in the collection were relevant
according to the relevance judgements for TREC queries 51-150. The relevance information was used both for
estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data).
1
FR (89)
12
2
WSJ (90-92)
22
Disk 2
2
AP (88)
11
54
2
ZIFF
11 (1 dup)
2
FR (88)
10
3
AP (90)
12
Disk 3
3
SJMN (91)
12
116
3
PAT
92
237 - 1
237 - 1
233
225
217
209
201
193
185
177
169
161
153
145
137
129
121
113
105
97
89
81
73
65
57
0.2
i
TREC Disk
Source
Size MB
Size doc
1
WSJ (86-89)
270
98,732
1
AP (89)
259
84,678
1
ZIFF
245
75,180
1
FR (89)
262
25,960
2
WSJ (90-92)
247
74.520
2
AP (88)
241
79,919
2
ZIFF
178
56,920
2
FR (88)
211
19,860
3
AP (90)
242
78,321
3
SJMN (91)
290
90,257
3
PAT
245
6,711
2,690
691,058
Totals
233
225
217
209
201
193
185
177
169
161
153
145
137
129
121
113
105
97
89
81
73
65
57
49
41
33
B
25
0
17
Number of Collections
i
Very Long Queries
1
0.9
0.8
0.7
Prob
Max
0.5
CORI
Ideal()
0.4
0.3
0.2
0.1
233
225
217
209
201
193
185
177
169
161
153
145
137
129
0
121
^
R
0.6
113
The figures to the right summarize our results from the
preliminary evaluation. The X axis is the number of collections
in the ranking and the Y axis, ^R , is a Recall analog that
measures the proportion of the total possible relevant documents
that have been accumulated in the top N databases, averaged
across all of the queries.
The Max line is the optimal results based where the
collections are ranked in order of the number of relevant
documents they contain. Ideal(0) is an implementation of the
GlOSS ``Ideal''algorithm and CORI is an implementation of
Callan's Inference net
approach. The Prob line is the logistic regression method
(described to the left). For title queries the described method
performs slightly better than the CORI algorithm for up to about
100 collections, where CORI exceeds it. For Long queries our
method is virtually identical to CORI, and CORI performs better
for Very Long queries. Both CORI and the logistic regression
method outperform the Ideal(0) implementation.
105
i 1
n*
0.1
97
Test Database Characteristics
14
0.3
89
 (6.2501 X 5 )  (7.5491 X 6 )
ZIFF
0.4
81
 (1.1921 X 3 )  (0.0053  X 4 )
1
Prob
73
For very long (full TREC query) queries
log OR | Q, C   20.9850  (9.6801 X 1 )  (1.8669  X 2 )
67
Ideal()
65
 (5.9174  X 5 )  (2.3612  X 6 )
12
CORI
0.5
57
 (1.0695  X 3 )  (0.0029  X 4 )
AP (89)
^
R
Max
49
For long (title and descriptio n) queries
log OR | Q, C   7.0103  (2.3188  X 1 )  (1.8669  X 2 )
1
| {db  Topn ( E ) | merit (q, db)  0} |
| Topn ( E ) |
41
where K is a constant (316.2277).
Disk 1
0.7
33
 (0.223  X 5 )  (2.01 X 6 )
29
0.8
25
 (0.679  X 3 )  K
WSJ (86-89)
0.9
Results and Discussion
For short (title only) queries
log OR | Q, C   3.70  (1.269  X 1 )  (0.310  X 2 )
1
1
1
Rˆ n 
E
i 1
Total DB
Long Queries
n
The ci coefficients were estimated separately for three query types
(during retrieval the length of the query was used to differentiate these.
Num DB
Number of Collections
0.6
I.e., the fraction of the top n databases in the
estimated ranking with non - zero merit.
i
0
n*  max k such that Bk  0
i 1
Source
i
0.1
And a Precision Analog
Pn 
B
i 1
TREC Disk
0.2
1
Collection Length
CORI
0.3
Assume each database has some merit for a given
query, q
Given a Baseline ranking B and an estimated (test)
ranking E for the evaluated system
Let dbbi and dbei denote the database in the i-th
ranked position of rankings B and E
Let Bi = merit(q, dbbi) and Ei = merit(q, dbei)
We can define some measures:
1
Max
0.5
0.4
Effectiveness Measures
Average Absolute Collection Frequency
Ideal()
Prob
cw is the average cw of the databases being ranked
tj
Totals
^
R
| DB | is the number of databases being ranked
cw is the number of words in dbi
Query Length
X 6  log M
DB 3 DB 4
0.6
cf is number of databases containing rk
1
CL
X4 
10
1 M
X5 
log ICFt j

M 1
N  nt j
ICF 
nt j
Db 5Db 6
0.7
49
 log QAFt j Average Absolute Query Frequency
 log CAF
Map
Results
0.8
where :
df is number of documents containing rk
M
M
Search
Engine
Title Queries
41
Comparative
Evaluation
CORI Ranking
33
i 1
1
X3 
M
Search
Engine
Distributed
Index
25
P( R | Q, C )  c0   ci X i
X 2  QL
Map
Query
– Tested using the collection representatives as harvested from over the network and the TIPSTER relevance
judgements
– Testing by comparing our approach to known algorithms for ranking collections
– Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against
the optimal “Relevance Based Ranking” (MAX)
– Recall analog (How many of the Rel docs occurred in the top n databases – averaged)
17
6
1
X1 
M
DB 1 DB2
Distributed Retrieval Testing and Results
We attempt to estimate the probability of relevance for a given collection with respect to a
query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey,
D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time.
We calculate the probability of relevance using Logistic regression from a sample set of
documents to determine values of the coefficients.
At retrieval time the probability of relevance for a particular query Q and a collection C is
estimated by:
For the 6 X attribute measures shown below:
Internet
Map
Results
17
–
Search
Engine
Map
Results
MetaSearch
Server
-- How efficiently can we build distributed indexes?
-- How effectively can we choose databases using the index?
-- How effective is merging search results from multiple
sources?
-- Do Hierarchies of servers (general/meta-topical/individual)
work?
• Data Fusion
Map
Query
Map Explain
And Scan
Queries
Evaluation Questions:
How to perform parallel or sequential searching over the selected
digital library databases
9
–
9
–
New approach to building metasearch based on Z39.50
Instead of using broadcast search we are using two Z39.50
Services
-- Identification of database metadata using Z39.50 Explain
-- Extraction of distributed indexes using Z39.50 SCAN
-- Creation of “Collection Documents” using index contents
1
– Broadcast search is expensive in terms of bandwidth
and in processing too many irrelevant results
– How to select the “best” ones to search?
• What to search first
• Which to search next
– Topical /domain constraints on the search selections
– Variable contents of database (metadata only, full
text…)
MetaSearch
Our Approach Using Z39.50
9
The Problem
Number of Collections
Application and Further Research
The method described here is being applied to two distributed systems of servers in the UK. The first (the Distributed Archives
Hub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) DTD.
MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the method described
here is being used to build a central index to provide efficient distributed search over the various servers. The basic model is shown
below – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelligently route queries
to the databases most like to contain relevant materials.
We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that additional
collection and collection document statistics may provide a better ranking of results and thus more effective routing of queries.
Database
Meta-Topical
Servers
General Servers
Servers
Data Harvesting and Collection Document Creation
• For all servers (could be a topical subset)…
– Get Explain information to find which indexes are supported and the collection statistics.
– For each index
• Use SCAN to extract terms and frequency information
• Add term + freq + source index + database metadata to the metasearch XML “Collection Document”
– Index collection documents including for retrieval by the above algorithm
• Planned Exensions
Replicated
servers
• Post-Process indexes (especially Geo Names, etc) for special types of data
– e.g. create “geographical coverage” indexes
Acknowledgements
This research was sponsored at U.C. Berkeley and the University of Liverpool
by the National Science Foundation and the Joint Information Systems Committee (UK)
under the International Digital Libraries Program award #IIS-99755164
James French and Allison Powell kindly provided the CORI and Ideal(0) results used in
the evaluation.