슬라이드 1 - Pusan
Download
Report
Transcript 슬라이드 1 - Pusan
DISTRIBUTED INFORMATION RETRIEVAL
2003. 07. 23
Lee Won Hee
Abstraction
A multi-database model of distributed information retrieval
Full-text information retrieval consists of discovering database contents
Ranking databases by their expected ability to satisfy the query
Searching a small number of databases
Merging results returned by different databases
This paper
Presents algorithms for each task
2
Introduction
Multi-database model of distributed information retrieval
Reflects the distributed location and control of information in a wide area
computer network
1) Resource description
- The contents of each text database must be described
2) Resource selection
- Given an information need and a set of resource descriptions, a decision
must be made about which database(s) to search
3) Results merging
- Integrating the ranked lists returned by search by each data base into a
single, coherent ranked list
3
Multi-database Testbeds
Marcus, 1983
Addressed resource description and selection in the EXPERT CONIT system
The creation of the TREC corpora
The text collections created by the U.S. National Institute for Standards and
Technology (NIST) for its TREC conferences
Sufficiently large and varied
Could divide into smaller databases
The summary statistics for three distributed IR testbeds
4
Resource Description
Unigram language model
Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., 1995
- Represent each database by a description consisting of the words that occur
in the database, and their frequencies of occurrence
Compact and can be obtained automatically by examining the documents in a
database or the document indexes
Can be extended easily to include phrases, proper names, and other text
features
Resource description based on terms and frequencies
A small fraction of the size of the original text database
Resource Description gives the way to technique called Query Based
Sampling
5
Resource Selection(1/4)
Distributed Information Retrieval System
Resource Selection
Process of selecting databases relative to the query
Collections are treated analogously to documents in a databae
CORI database selection algorithm is used
6
Resource Selection(2/4)
The CORI Algorithm (Callan et al., 1995)
df
df 50 150 cw / avg _ cw
c 0.5
log(
)
cf
I
log(c 1.0)
p(rk | ci ) b (1 b) T I
T
- df : the number of documents in Ri containing rk
- cw : the number of indexing terms in resource Ri
- avg_cw : the average number of indexing terms in each resource
- C : the number of resource
- cf : the number of resources containing term rk
- B : the minimum belief component (usually 0.4)
7
Resource Selection(3/4)
INQUERY query operator (Turtle, 1990; Turtle and Croft,
1991)
Can be used for ranking databases and documents
( p1 p 2 ... p n )
n
( w p w2 p 2 ... wn p n )
belwsum (Q) 1 1
( w1 w2 ... wn )
belsum (Q)
- pj :p(rj|Ri)
belnot (Q) 1 p1
belor (Q) 1 (1 p1 ) ... (1 p n )
beland (Q) p1 p 2 ... p n
8
Resource Selection(4/4)
Effectiveness of a resource ranking algorithm
Compares a given database ranking at rank n to a desired database ranking at
rank n
R ( n)
i1 rgi
n
- rgi : number of relevant documents in the i’’th-ranked database
under the given ranking
n
- rdi : number of relevant documents in the i’’th-ranked database
under a desired ranking in which documents are ordered by the
number of relevant documents they contain
rdi
i 1
9
Merging Document Ranking(1/2)
After a set of databases is searched
The ranked results from each databases must be merged into a single ranking
Difficult when individual databases are not cooperative
- Each database are based on different corpus statistics, representations
and/or retrieval algorithms
Resource merging technique
Cooperative approach
- Use of global idf or same ranking algorithm
- Recomputing document scores at the search client
Non-cooperative approach
- Estimate normalized document scores : combination of the score of the
database and the score of the document
10
Merging Document Ranking(2/2)
Estimates normalized document score
D' ' D (1 N Ri Avg _ R )
Avg _ R
- N : number of resources searched
- D’’ : the product of the unnormalized document score D
- Ri : the database score Ri
- Avg_R : the average database score
Ri' ( Ri R min)/( R max R min)
D' '
D 0.4 D Ri'
1.4
Ri' ( Ri R min)/( R max R min)
D' ( D D min i ) /( D maxi D min i )
D'0.4 D'Ri'
D' '
1.4
11
Acquiring Resource Descriptions(1/2)
Query-based sampling (Callan, et al., 1999; Callan & Connel,
2001)
Does not require cooperation of the databases
Process of querying database using random word queries
Initial query is selected from large dictionary of terms
Subsequent queries from documents sampled from database
12
Acquiring Resource Descriptions(2/2)
Query-based sampling algorithm
1.
2.
3.
4.
Select initial query term
Run a one-term query on the database
Retrieve the top N documents returned by the database
Update the resource description based on characteristics of retrieved document
- Extract words & frequencies from top N documents returned by the
database
- Add the word and their frequencies to the learned resource description
5. If a stopping criterion as not yet been reached,
- Select a new query term
- Go to Step 2
13
Accuracy of Unigram Language Models(1/3)
Test corpora for query-based sampling experiments
Ctf ratio
How well the learned vocabulary matches the actual vocabulary
iV '
ctfi
ctfi
iV
- V’ : a learned vocabulary
- V : a an actual vocabulary
- ctfi :the number of times term I occurs in the database
14
Accuracy of Unigram Language Models(2/3)
Spearman Rank Correlation Coefficient
How well the learned term frequencies indicates the frequency of each term
in database
The rank correlation coefficient
- 1 : two orderings are identical
- 0 : they are uncorrelated
- -1 : they are in reverse order
1
R
6
1
1
2
3
3
(
d
(
f
f
)
(
g
f m ))
i
k
k
m
12
12
n3 n
( f k3 f k )
( g m3 g m )
(1
) (1
)
n3 n
n3 n
- di : the rank difference of common term i
- n : the number of terms
- fk :the number of ties in the kth group if ties in the learned
resource description
- gm : the number of ties in the mth group of ties in the actual
resource description
15
Accuracy of Unigram Language Models(3/3)
Experiment
16
Accuracy of Resource Rankings
Experiment
17
Accuracy of Document Rankings
Experiment
18
Summary and Conclusions
Techniques for acquiring descriptions of resources controlled
by uncooperative parties
Using resource description to rank text databases by their likelihood of
satisfying a query
Merging the document rankings returned by different text databases
The major remaining weakness
The algorithm for merging document rankings produces by different
databases
Computational cost by parsing and reranking the documents
Many of the traditional IR tools, such as relevance feedback, have yet
to be applied to multi-database environments
19