13 - School of Computing

Download Report

Transcript 13 - School of Computing

COMP3740 CR32:
Knowledge Management
and Adaptive Systems
Inverted Files and Signature Files for IR
By Eric Atwell, School of Computing,
University of Leeds
(including re-use of teaching resources from other sources,
esp. Stuart Roberts, School of Computing, Univ of Leeds)
Module Objectives include …
“On completion of this module, students should be
able to:
… describe classical and emerging information
retrieval techniques, and their relevance to
knowledge management; …”
Today’s objectives
By the end of this lecture you should understand:
• why relational databases techniques eg BTree
indexing are no use for ‘IR’ queries;
• how ‘inverted file’ structures work to provide
efficient query processing.
• An alternative approach provided by signature files
The relational problem
• the simple approach to searching for a
keyword uses leading (and trailing)
wildcards: eg ‘%graphics%’
• there is no way other than ‘brute force scan’
to match such a condition with the data
records held in a traditional relational
database.
The relational problem
• Rather than hold full text, why not do content
analysis, extract the index terms (keywords), and
hold these in a relational database?
Indexed by
Module
module(mod_code, title, semester, …)
term(term_id, value)
index(mod_code, term_id)
Index term
Sample SQL query: OR
find all modules matching:
‘database’ or ‘AI’ or ‘knowledge base’
select distinct m.* from
module m inner join index i on m.code = i.mod_code inner
join term t on t.term_id = i.term_id
where
t.value = ‘database’ OR t.value = ‘AI’ OR t.value =
‘knowledge base’;
Another sample query: AND (?)
find all modules matching:
‘database’ and ‘AI’ and ‘knowledge base’
select distinct m.* from
module m inner join index i on m.code = i.mod_code inner
join term t on t.term_id = i.term_id
where
t.value = ‘database’ AND t.value = ‘AI’ AND t.value =
‘knowledge base’;
This SQL query will not match any record; t.value cannot be
simultaneously equal to ‘database’, ‘AI’ and ‘knowledge base’.
We cannot simply replace the ‘OR’s of the last SQL query with
‘AND’s.
Corrected sample query: AND
find all modules matching:
‘database’ and ‘AI’
select distinct m.* from
module m inner join index i1 on m.code = i1.mod_code inner
join term t1 on t1.term_id = i1.term_id
inner join index i2 on m.code = i2.mod_code inner join term
t2 on t2.term_id = i2.term_id
where
t1.value = ‘database’ and t2.value = ‘AI’;
Both tables ‘index’ and ‘term’ must be searched twice in order to
establish whether, for each module, it is attached to both terms
‘database’ and ‘AI’.
If the query is a conjunction of N terms, the SQL would have 2N
inner joins. AND is more complicated than OR (but common in IR)
Inverted file
• Non-DB structure, so not suitable for standard SQL
• each index term entry ‘points’ to a list of document
record identifiers (RIDs)
• standard indexing method for IR systems
• widely used for search engines
• can be extended to allow for positional (context)
searches
Inverted file structure
The idea of an inverted file is, as well as storing a
document with its list of terms that are used to index
it, we store the list of terms used in the whole
collection of documents, and for each term point to
the list of documents that are indexed by the term.
So we have ‘inverted’ the structure:
D1: T11, T12, …, T1k
D2: T21, T22, …, T2l
…
to give:
T1: D11, D12, …, D1m
T2: D21, D22, …, D2n
…
Inverted file structure
dictionary
Inverted or postings file
Data file
1
Term 1 (2)
1
Term 2 (3)
3
Term 3 (1)
6
Term 4 (3)
7
Term 5 (4)
.
.
9
.
.
2
1
2
3
2
2
3
4
.
.
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6
.
.
Inverted file structure
dictionary
Inverted or postings file
Data file
1
Term 1 (2)
1
Term 2 (3)
3
Term 3 (1)
6
Term 4 (3)
7
Term 5 (4)
.
.
9
.
.
2
1
2
3
2
2
3
4
.
.
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6
.
.
Inverted file structure
dictionary
Inverted or postings file
Data file
1
Term 1 (2)
1
Term 2 (3)
3
Term 3 (1)
6
Term 4 (3)
7
Term 5 (4)
.
.
9
.
.
2
1
2
3
2
2
3
4
.
.
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6
.
.
Inverted file structure
dictionary
Inverted or postings file
Data file
1
Term 1 (2)
1
Term 2 (3)
3
Term 3 (1)
6
Term 4 (3)
7
Term 5 (4)
.
.
9
.
.
2
1
2
3
2
2
3
4
.
.
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6
.
.
Dictionary (in IR)
• list of terms including ‘normalised’ keywords or
stems plus object descriptors (eg author name)
• frequency with which that term occurs in the
collection
• pointer to the inverted file
• access to dictionary is by standard file access method
(eg binary search or Btree or hashing algorithm)
Inverted file
• for each entry in the dictionary:
– a list of pointers into the data file (or object-ids, or URLs..)
– identifying those objects indexed by the dictionary term
• inverted file may also contain:
– positional information within each document – where?
– term frequency (or weight) within each document – how
many?
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form: (A and C) or (B and C) or (A and B and C)
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc1
doc3
doc4
doc7
doc8
doc10
doc2
doc3
doc5
doc6
doc8
doc12
doc1
doc2
doc4
doc9
doc11
doc12
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc1
doc3
doc4
doc7
doc8
doc10
doc2
doc3
doc5
doc6
doc8
doc12
doc1
doc2
doc4
doc9
doc11
doc12
doc1: (1, 0, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc3
doc4
doc7
doc8
doc10
doc2
doc3
doc5
doc6
doc8
doc12
doc2
doc4
doc9
doc11
doc12
doc1: (1, 0, 1)
doc2: (0, 1, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc3
doc4
doc7
doc8
doc10
doc3
doc5
doc6
doc8
doc12
doc4
doc9
doc11
doc12
doc1: (1, 0, 1)
doc2: (0, 1, 1)
doc3: (1, 1, 0)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc4
doc7
doc8
doc10
doc5
doc6
doc8
doc12
doc4
doc9
doc11
doc12
doc1:
doc2:
doc3:
doc4:
(1, 0, 1)
(0, 1, 1)
(1, 1, 0)
(1, 0, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc7
doc8
doc10
doc5
doc6
doc8
doc12
doc9
doc11
doc12
doc1:
doc2:
doc3:
doc4:
doc5:
(1, 0, 1)
(0, 1, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 0)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc7
doc8
doc10
doc6
doc8
doc12
doc9
doc11
doc12
doc1:
doc2:
doc3:
doc4:
doc5:
doc6:
(1, 0, 1)
(0, 1, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 0)
(0, 1, 0)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc7
doc8
doc10
doc8
doc12
doc9
doc11
doc12
doc1:
doc2:
doc3:
doc4:
doc5:
doc6:
doc7:
(1, 0, 1)
(0, 1, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 0)
(0, 1, 0)
(1, 0, 0)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc8
doc10
doc8
doc12
doc9
doc11
doc12
doc1:
doc2:
doc3:
doc4:
doc5:
doc6:
doc7:
doc8:
(1, 0, 1)
(0, 1, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 0)
(0, 1, 0)
(1, 0, 0)
(1, 1, 0)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc10
doc12
doc9
doc11
doc12
doc1:
doc2:
doc3:
doc4:
doc5:
doc6:
doc7:
doc8:
doc9:
(1, 0, 1)
(0, 1, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 0)
(0, 1, 0)
(1, 0, 0)
(1, 1, 0)
(0, 0, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
doc1: (1, 0, 1)
B and C
doc10
doc12
doc11
doc12
doc2: (0, 1, 1)
doc3: (1, 1, 0)
doc4: (1, 0, 1)
doc5: (0, 1, 0)
doc6: (0, 1, 0)
doc7: (1, 0, 0)
doc8: (1, 1, 0)
doc9: (0, 0, 1)
doc10: (1, 0, 0)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file
corresponding
to A,
doc1:
(1, 0, 1)
doc2: (0, 1, 1)
B and C
doc12
doc11
doc12
doc3: (1, 1, 0)
doc4: (1, 0, 1)
doc5: (0, 1, 0)
doc6: (0, 1, 0)
doc7: (1, 0, 0)
doc8: (1, 1, 0)
doc9: (0, 0, 1)
doc10: (1, 0, 0)
doc11: (0, 0, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A,
B and C
doc1: (1, 0, 1)
doc12
doc12
doc2: (0, 1, 1)
doc3: (1, 1, 0)
doc4: (1, 0, 1)
doc5: (0, 1, 0)
doc6: (0, 1, 0)
doc7: (1, 0, 0)
doc8: (1, 1, 0)
doc9: (0, 0, 1)
doc10: (1, 0, 0)
doc11: (0, 0, 1)
doc12: (0, 1, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
doc1: (1, 0, 1)
doc2: (0, 1, 1)
doc3: (1, 1, 0)
doc4: (1, 0, 1)
doc5: (0, 1, 0)
doc6: (0, 1, 0)
doc7: (1, 0, 0)
doc8: (1, 1, 0)
doc9: (0, 0, 1)
doc10: (1, 0, 0)
doc11: (0, 0, 1)
doc12: (0, 1, 1)
Use of inverted file
• Boolean query: (A or B) and C
– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
doc1: (1, 0, 1)
doc2: (0, 1, 1)
doc4: (1, 0, 1)
doc4: (1,
doc12:
(0,0,1,1)1)
report number of hits to user (4)
(Note: can be done before any ‘hits’ are retrieved
retrieve all objects using ‘pointers’:
doc1, doc2, doc4 and doc12
doc12: (0, 1, 1)
Use of inverted file with weighted terms
weighted query:
A0.5, B0.7, C1.0
form weighted vector:
(0.5, 0.7, 1.0)
retrieve lists of document ids from inverted file corresponding to
A, B and C with weights
doc1 (.2)
doc3 (.6)
doc4 (.7)
doc7 (.3)
doc8 (.5)
doc10 (.5)
doc2 (.6)
doc3 (.8)
doc5 (.9)
doc6 (.3)
doc8 (.5)
doc12 (.2)
doc1 (.4)
doc2 (.4)
doc4 (.7)
doc9 (.6)
doc11 (.3)
doc12 (.6)
Assumes sim(Query,Document) function comparing 2 vectors
Use of inverted file
weighted query:
A0.5, B0.7, C1.0
form weighted vector:
(0.5, 0.7, 1.0)
retrieve lists of document ids from inverted file corresponding to
A, B and C with weights
doc1 (.2)
doc3 (.6)
doc4 (.7)
doc7 (.3)
doc8 (.5)
doc10 (.5)
doc2 (.6)
doc3 (.8)
doc5 (.9)
doc6 (.3)
doc8 (.5)
doc12 (.2)
doc1 (.4)
doc2 (.4)
doc4 (.7)
doc9 (.6)
doc11 (.3)
doc12 (.6)
sim((0.5, 0.7, 1.0), (0.2, 0.0, 0.4)) = 0.85
doc1: 0.85
Use of inverted file
weighted query:
A0.5, B0.7, C1.0
form weighted vector:
(0.5, 0.7, 1.0)
retrieve lists of document ids from inverted file corresponding to
A, B and C with weights
doc3 (.6)
doc4 (.7)
doc7 (.3)
doc8 (.5)
doc10 (.5)
doc2 (.6)
doc3 (.8)
doc5 (.9)
doc6 (.3)
doc8 (.5)
doc12 (.2)
doc2 (.4)
doc4 (.7)
doc9 (.6)
doc11 (.3)
doc12 (.6)
sim((0.5, 0.7, 1.0), (0.0, 0.6, 0.4)) = 0.86
doc1: 0.85
doc2: 0.86
Use of inverted file
•
•
•
•
•
sort (rank) list according to similarity coefficient.
retrieve first ‘N’ ranked objects.
present ranked list to user.
offer to retrieve next ‘N’.
Note that so far we have not retrieved any
documents; this is particularly important if the ids are
URLS - we don’t need to start downloading web
pages in order to rank them.
Use of inverted file
• proximity queries eg Q1: “A B” Q2: “A(3)B” (A…B)
–
–
–
–
postings file holds positional information
proceed as for ‘A and B’
keep positional information in (AB) list
filter (AB) list:
for Q1 pos(A) = pos(B) -1
for Q2 |pos(B) - pos(A)| < 3
• now we can distinguish ‘Venetian blind’ from ‘blind
Venetian’
• in principle this should help precision without affecting recall
too much
Pros and cons of inverted file
• can be used for Boolean, weighted and positional
queries
• query processing can be completed without accessing
data file
• number of hits for single term is available from
dictionary
• expensive to update if information objects change
content.
• demanding storage requirements (dictionary+inverted
file approx same size as original data)
An alternative: Text signatures
• use hash algorithm to map a keyword onto one or
more bits in a bit string: like Hashing (DB21)
• Simplest example: use one bit:
– ‘Bath’ = [66, 97, 116, 104]
mod32 (66+97+116+104) = mod32(383) = 31
so represent ‘Bath’ by setting bit 31 in 0-31 bits:
000000000000000000000000000000000001
Text signatures
• Or use several bits:
– ‘Bath’ = [66, 97, 116, 104]
‘ Ba’  mod32(66+97) = 3,
‘Bat’  mod32(66+97+116) = 23
‘ath’  mod32(97+116+104) = 29
‘th ’  mod32(116+104) = 28
represent ‘Bath’ by:
001000000000000000000000000100001100
• This may allow wildcards, eg Bat?
Document signatures
• superimpose keyword signatures
Bath
0000000000000000000000000000001
tub
0000000000100000000000000000000
0000000000100000000000000000001
• if each document has 6 keywords, there would be
comb(32, 6) = 906192 different document signatures.
• Document signatures can be mapped onto numbers
between 1 and 906192
Using signature file
• Boolean query: (A or B) and C
– superimpose signatures of A and C
– superimpose signatures of B and C
– for each signature, S, in the file:
if either all bits of A&C are set in S or all bits in B&C are
set, retrieve the document with signature S.
– check document to see if it is a ‘hit’
– bit comparisons are very fast compared to string
comparisons.
Pros and cons of signature files
• Needs less space than inverted
• easier to update as documents change
• fast for queries with many keywords
• probabilistic - will return false hits
• cannot filter on positional information
• cannot hold keyword weights (or other weights)
these last three points imply that further
processing is required to filter retrieved documents.
Summary of key points
• standard relational databases do not provide suitable
indexing for handling index terms.
• standard SQL is not good at expressing ‘search-engine’ type
queries
• inverted file structures are purpose made for these types of
system
• storing frequencies/weights in the dictionary and inverted
file allows for vector model queries
• storing positional information allows proximity queries,
“Knowledge Management” v “MK”
• Signature files give faster matches but with limitations
Questions to think about
• Explain why the relational model is not good for IR.
• How is it that, using an inverted file, the number of
hits can be reported without retrieving anything from
the data file?
• Could this be achieved using signature files?
• What are proximity queries, and how can inverted
file technology be used to deal with them?
• How can signature files be used for proximity
queries?