File Structures
Download
Report
Transcript File Structures
File Structures
Information Retrieval: Data Structures and Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ: Prentice Hall, 1992.
(Chapters 3-5)
1
File Structures for IR
lexicographical indices
» indices that are sorted
» e.g. inverted files
» e.g. Patricia (PAT) trees
cluster file structures
indices based on hashing
» signature files
2
Inverted Files
Information Retrieval: Data Structures and Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ: Prentice Hall, 1992.
(Chapters 3)
3
Inverted Files
Each document is assigned a list of keywords or attributes.
Each keyword (attribute) is associated with operational
relevance weights.
An inverted file is the sorted list of keywords (attributes), with
each keyword having links to the documents containing that
keyword.
Penalty
» the size of inverted files ranges from 10% to 100%
of more of the size of the text itself
» need to update the index as the data set changes
4
Indexing Restrications
A controlled vocabulary which is the collection of
keywords that will be indexed. Words in the text that
are not in the vocabulary will not be indexed
A list of stopwords that for reasons of volume will not
be included in the index
A set of rules that decide the beginning of a word or a
piece of text that is indexable
A list of character sequences to be indexed (or not
indexed)
5
Sorted array implementation of an inverted file
Structures used in Inverted Files
Sorted Arrays
»
»
»
»
store the list of keywords in a sorted array
using a standard binary search
advantage: easy to implement
disadvantage: updating the index is expensive
Hashing Structures
Tries (digital search trees)
Combinations of these structures
7
Sorted Arrays
1. The input text is parsed into a list of words along with their
location in the text. (time and storage consuming operation)
2. This list is inverted from a list of terms in location order to
a list of terms in alphabetical order.
3. Add term weights, or reorganize or compress the files.
8
Inversion of Word List
Dictionary and postings file
Idea: the file to be searched should be as short as possible
split a single file into two pieces
e.g. data set: 38,304 records, 250,000 unique terms
(document #, frequency)
10
Producing an Inverted File for Large Data Sets without Sorting
Idea: avoid the use of an explicit
sort by using a right-threaded
binary tree
current number of term postings &
the storage location of postings list
traverse the binary tree and the
linked postings list
A Fast Inversion Algorithm
Principle 1
the large primary memories are available
If databases can be split into memory loads that can be rapidly
processed and then combined, the overall cost will be
minimized.
Principle 2
the inherent order of the input data
It is very expensive to use polynomial or even nlogn sorting
algorithms for large files
12
FAST-INV algorithm
concept postings/
pointers
See p. 13.
Sample document vector
document number
concept number (one concept number
for each unique word)
Similar to the documentword list shown in p. 7.
The concept numbers are
sorted within document
numbers, and document
numbers are sorted within
collection
Preparation
Terminology
» HCN= highest concept number in dictionary, or the number
of words to be indexed
» L= number of document/concept pairs in the collection
» M= available primary memory size
Assumption
» M>>HCN
» M<L
15
: the range of concepts
for each primary load
讀入(Doc,Con)
依Con去查Load
表,確定這個
配對該落在那
個Load
依序將每個Load
File反轉。CONPTR
表中的Offset顯示每
筆資料該填入的位
置。
Preparation
1. Allocate an array, con_entries_cnt, of size HCN.
2. For each <doc#, con#> entry in the document vector file:
increment con_entries_cnt[con#]
……………………0
(1,1), (1,4)……….. 2
(2,3) …………….. 3
(3,1), (3,2), (3,5) ... 6
(4,2), (4,3) ………. 8
…
(con#, doc#)
Preparation (continued)
5. For each <con#,count> pair obtained from con_entries_cnt:
if there is no room for documents with this concept to fit in
the current load, then created an entry in the load table and
initialize the next load entry; otherwise update information
for the current load table entry.
Building Load Table
Terminology
»
»
»
»
LL= length of current load
S= spread of concept numbers in the current load
8 bytes = space needed for each concept/weight pair
4 bytes = space needed for each concept to store count of
postings for it
Constraints
» 8*LL+4*S<M
19
: the range of concepts
for each primary load
讀入(Doc,Con)
依Con去查Load
表,確定這個
配對該落在那
個Load
依序將每個Load
File反轉。CONPTR
表中的Offset顯示每
筆資料該填入的位
置。