슬라이드 1 - Pusan

Download Report

Transcript 슬라이드 1 - Pusan

DISTRIBUTED INFORMATION RETRIEVAL
2003. 07. 23
Lee Won Hee
Abstraction
 A multi-database model of distributed information retrieval




Full-text information retrieval consists of discovering database contents
Ranking databases by their expected ability to satisfy the query
Searching a small number of databases
Merging results returned by different databases
 This paper
 Presents algorithms for each task
2
Introduction
 Multi-database model of distributed information retrieval

Reflects the distributed location and control of information in a wide area
computer network
1) Resource description
- The contents of each text database must be described
2) Resource selection
- Given an information need and a set of resource descriptions, a decision
must be made about which database(s) to search
3) Results merging
- Integrating the ranked lists returned by search by each data base into a
single, coherent ranked list
3
Multi-database Testbeds
 Marcus, 1983
 Addressed resource description and selection in the EXPERT CONIT system
 The creation of the TREC corpora
 The text collections created by the U.S. National Institute for Standards and
Technology (NIST) for its TREC conferences
 Sufficiently large and varied
 Could divide into smaller databases
 The summary statistics for three distributed IR testbeds
4
Resource Description
 Unigram language model
 Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., 1995
- Represent each database by a description consisting of the words that occur
in the database, and their frequencies of occurrence
 Compact and can be obtained automatically by examining the documents in a
database or the document indexes
 Can be extended easily to include phrases, proper names, and other text
features
 Resource description based on terms and frequencies
 A small fraction of the size of the original text database
 Resource Description gives the way to technique called Query Based
Sampling
5
Resource Selection(1/4)
 Distributed Information Retrieval System
 Resource Selection
 Process of selecting databases relative to the query
 Collections are treated analogously to documents in a databae
 CORI database selection algorithm is used
6
Resource Selection(2/4)
 The CORI Algorithm (Callan et al., 1995)
df
df  50  150 cw / avg _ cw
c  0.5
log(
)
cf
I
log(c  1.0)
p(rk | ci )  b  (1  b)  T  I
T
- df : the number of documents in Ri containing rk
- cw : the number of indexing terms in resource Ri
- avg_cw : the average number of indexing terms in each resource
- C : the number of resource
- cf : the number of resources containing term rk
- B : the minimum belief component (usually 0.4)
7
Resource Selection(3/4)
 INQUERY query operator (Turtle, 1990; Turtle and Croft,
1991)
 Can be used for ranking databases and documents
( p1  p 2  ...  p n )
n
( w p  w2 p 2  ...  wn p n )
belwsum (Q)  1 1
( w1  w2  ...  wn )
belsum (Q) 
- pj :p(rj|Ri)
belnot (Q)  1  p1
belor (Q)  1  (1  p1 )  ... (1  p n )
beland (Q)  p1  p 2  ... p n
8
Resource Selection(4/4)
 Effectiveness of a resource ranking algorithm
 Compares a given database ranking at rank n to a desired database ranking at
rank n
R ( n) 
i1 rgi

n
- rgi : number of relevant documents in the i’’th-ranked database
under the given ranking
n
- rdi : number of relevant documents in the i’’th-ranked database
under a desired ranking in which documents are ordered by the
number of relevant documents they contain
rdi
i 1
9
Merging Document Ranking(1/2)
 After a set of databases is searched
 The ranked results from each databases must be merged into a single ranking
 Difficult when individual databases are not cooperative
- Each database are based on different corpus statistics, representations
and/or retrieval algorithms
 Resource merging technique
 Cooperative approach
- Use of global idf or same ranking algorithm
- Recomputing document scores at the search client
 Non-cooperative approach
- Estimate normalized document scores : combination of the score of the
database and the score of the document
10
Merging Document Ranking(2/2)
 Estimates normalized document score
 D' '  D  (1  N  Ri  Avg _ R )
Avg _ R
- N : number of resources searched
- D’’ : the product of the unnormalized document score D
- Ri : the database score Ri
- Avg_R : the average database score
 Ri'  ( Ri  R min)/( R max R min)
D' ' 
D  0.4  D  Ri'
1.4
 Ri'  ( Ri  R min)/( R max R min)
D'  ( D  D min i ) /( D maxi  D min i )
D'0.4  D'Ri'
D' ' 
1.4
11
Acquiring Resource Descriptions(1/2)
 Query-based sampling (Callan, et al., 1999; Callan & Connel,
2001)




Does not require cooperation of the databases
Process of querying database using random word queries
Initial query is selected from large dictionary of terms
Subsequent queries from documents sampled from database
12
Acquiring Resource Descriptions(2/2)
 Query-based sampling algorithm
1.
2.
3.
4.
Select initial query term
Run a one-term query on the database
Retrieve the top N documents returned by the database
Update the resource description based on characteristics of retrieved document
- Extract words & frequencies from top N documents returned by the
database
- Add the word and their frequencies to the learned resource description
5. If a stopping criterion as not yet been reached,
- Select a new query term
- Go to Step 2
13
Accuracy of Unigram Language Models(1/3)
 Test corpora for query-based sampling experiments
 Ctf ratio
 How well the learned vocabulary matches the actual vocabulary


iV '
ctfi
ctfi
iV
- V’ : a learned vocabulary
- V : a an actual vocabulary
- ctfi :the number of times term I occurs in the database
14
Accuracy of Unigram Language Models(2/3)
 Spearman Rank Correlation Coefficient
 How well the learned term frequencies indicates the frequency of each term
in database
 The rank correlation coefficient
- 1 : two orderings are identical
- 0 : they are uncorrelated
- -1 : they are in reverse order

1
R
6
1
1
2
3
3
(
d

(
f

f
)

(
g
 f m ))



i
k
k
m
12
12
n3  n
( f k3  f k )
( g m3  g m )


(1 
) (1 
)
n3  n
n3  n
- di : the rank difference of common term i
- n : the number of terms
- fk :the number of ties in the kth group if ties in the learned
resource description
- gm : the number of ties in the mth group of ties in the actual
resource description
15
Accuracy of Unigram Language Models(3/3)
 Experiment
16
Accuracy of Resource Rankings
 Experiment
17
Accuracy of Document Rankings
 Experiment
18
Summary and Conclusions
 Techniques for acquiring descriptions of resources controlled
by uncooperative parties
 Using resource description to rank text databases by their likelihood of
satisfying a query
 Merging the document rankings returned by different text databases
 The major remaining weakness
 The algorithm for merging document rankings produces by different
databases
 Computational cost by parsing and reranking the documents
 Many of the traditional IR tools, such as relevance feedback, have yet
to be applied to multi-database environments
19