Text Processing, Indexes

Download Report

Transcript Text Processing, Indexes

IN350: Text properties, Zipf’s Law,and
Heap’s Law.
Judith A. Molka-Danielsen
September 12, 2002
Notes are based on Chapter 6
of the Article Collection
More notes to follow in class.
Review: Michael Spring’s comments



The Document Processing Revolution by MBS.
 How do you define a document?
 Revolutions: reprographics, communications
 Transition: create,compose,render; WWW/XML
 New processing model for e-docs, forms.
XML- a first look.
 Namespace rules allow for a modular document,
reuse.
 DTDs are historical, Schema is the future.
 Xpath and pointers: used for accessing the
document as a tree of nodes.
 XSL style – rendering, XSLT uses XSL and FO.
 XSLT transformations of docs to multiple forms.
E-business: B2B applications will use concepts,
specifications, and tools based on the XML family.
Modeling Natural Languages and Compression

Information theory - the amount of information in a text is
quantified by its entropy, information uncertainty. If one symbol
appears all the time it does not convey much information. Higher
entropy text cannot be compressed as much lower.

Symbols - There are 26 in the English alphabet and 29 in
Norwegian.

Frequency of occurrence - of the symbols in text in different in
different languages. In english ’e’ has the highest occurance. Run
length encoding schemes such as Huffman encoding can be
used to represent the symbols based on frequency of occurance.

Compression for transfering data can be based on the
frequency of symbols. More on this in another lecture.
Modeling Natural Languages and Compression




Creation of Indicies are often based on the frequency of
occurance of words within a text.
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950), is the observation
that frequency of occurrence of some event ( P ), as a
function of the rank ( i) when the rank is determined by
the above frequency of occurrence, is a power-law
function Pi ~ 1/ia with the exponent a close to unity.
Zipf’s distribution- frequency of words in a document
approximatly follow this distribution. In the English
language, the probability of encountering the ith most
common word is given roughly by Zipf law for i up to
1000 or so. The law breaks down for less frequent
words, since the harmonic series diverges.
Stopwords - a few hundred words take up 50% of the
text. These can be disregarded to reduce the space of
indices.
Zipf’s and Heap’s distributions
(a)
(b)
(a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary
Sample Word Frequency Data
(from B. Croft, UMass)
Zipf’s Law



Rank (r): The numerical position of a word in
a list sorted by decreasing frequency (f ).
Zipf (1949) 1“discovered” that:
f  r  k (for constantk )
f 
r
If probability of word of rank r is pr and N is
the total number of word occurrences:
f
A
pr 

for corpus indp. const. A  0.1
N
r
Occurrence Frequency Data
(from B. Croft, UMass)
Does Real Data Fit Zipf’s Law?




A law of the form y = kxc is called a power
law.
Zipf’s law is a power law with c = –1
On a log-log plot, power laws give a straight
line with slope c.
c
log( y)  log(kx )  log k  c log(x)
Zipf is quite accurate except for very high and
low rank.
Fit to Zipf for Brown Corpus
k = 100,000
Mandelbrot distribution

Mandelbrot distribution: Experimental data suggests that a better
model is this where c is an additional parameter and k is such that all

frequencies add to n.
k /(c  i)

Since the distribution of words is Very skewed (that is, there are
a few hundred words which take up 50% of the text), words that are
too frequent, such as stopwords, can be disregarded.

A stopword is a word which does not carry meaning in natural
language and therefore can be ignored (made not searchable), such
as 'a,' 'the,' and 'by,'.

The most frequent words are stopwords. This allows a significant
reduction in the space overhead of indices for natural language texts.
Mandelbrot (1954) Correction

The following more general form gives a bit
better fit:
f  P(r   ) B
For constantsP, B, 
Mandelbrot Fit
P = 105.4, B = 1.15,  = 100
Explanations for Zipf’s Law



Zipf’s explanation was his “principle of least effort.”
Balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one.
Debate (1955-61) between Mandelbrot and H.
Simon over explanation.
Li (1992) shows that just random typing of letters
including a space will generate “words” with a
Zipfian distribution.

http://linkage.rockefeller.edu/wli/zipf/
Zipf’s Law Impact on IR


Good News: Stopwords will account for a
large fraction of text so eliminating them
greatly reduces inverted-index storage costs.
Bad News: For most words, gathering
sufficient data for meaningful statistical
analysis (e.g. for correlation analysis for
query expansion) is difficult since they are
extremely rare.
Vocabulary Growth



How does the size of the overall vocabulary
(number of unique words) grow with the size
of the corpus?
This determines how the size of the inverted
index will scale with the size of the corpus.
Vocabulary not really upper-bounded due to
proper names, typos, etc.
Heaps’ Law

Heap’s law is used to predict the growth of
the vocabulary size n words of size V.
V  Kn  withconstantsK , 0    1

Typical constants:


K  10100
  0.40.6 (approx. square-root)
Heaps’ Law Data
Modeling Natural Languages and Compression

Document vocaburlary - is the number of distinct
(different) words within a document.

Heaps’ Law - is show in the right graph in the
readings Figure 6.2. In general, the number of new
words found in a document does increases
logarithmically with the increasing text size. So, if you
have an encylopedia, probably most of the words are
found in the first volume.

Heaps’ Law is also implies the length of words
increase logarithmically with the text size, but the
average word length is constant. That is because
there is a greater occurance of the shorter words.
Relate text size to Indexing



Review: What is the purpose of an index and how is
it made? Based on: ”Appendix A: File Organization
and Storage Structures”
Text Processing: large text collections are often
indexed using inverted files. A vector of distinct words
forms a vocabulary, with pointers to a list of all the
documents that contain the word(s).
Indexes can be compressed to allow quicker access.
(Discuss this with Ch.7.)