Transcript Document

南京大学多媒体研究所
Multimedia Computing Institute of NJU
文本信息检索——文本操作
武港山
Tel : 83594243
Office: 蒙民伟楼608B
Email : [email protected]
南京大学多媒体研究所
Multimedia Computing Institute of NJU
信息检索系统的体系结构
(clir,QA,Web)
文档
用户界面
用户
需求
查询语言和
用户
查询处理
反馈
具体
应用
系统
文档处理
文本处理
逻辑视图
提问处理
建索引
数据库
管理
倒排文档
提问
排序后
的文档
2015/7/18
搜索
索引和检索
索引
文本
数据库
排序
检出的文档
Wu Gangshan: Modern Information Retrieval
2
南京大学多媒体研究所
Multimedia Computing Institute of NJU
内容提要





文档预处理
文档分类
文档聚类
文档摘要
文档压缩
2015/7/18
Wu Gangshan: Modern Information Retrieval
3
南京大学多媒体研究所
Multimedia Computing Institute of NJU
文档预处理
Tokenization
 Stopword removal
 Lemmatization[词性还原]
 Stemming [词干]
 Metadata and markup languages

2015/7/18
Wu Gangshan: Modern Information Retrieval
4
Simple Tokenization
南京大学多媒体研究所
Multimedia Computing Institute of NJU


Analyze text into a sequence of discrete tokens
(words).
Sometimes punctuation (e-mail), numbers (1999),
and case (Republican vs. republican) can be a
meaningful part of a token.



However, frequently they are not.
Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
More careful approach:



Separate ? ! ; : “ ‘ [ ] ( ) < >
Care with . - why? when?
Care with …
2015/7/18
Wu Gangshan: Modern Information Retrieval
5
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Punctuation(标点符号)
Ne’er: use language-specific mappings to
normalize
 State-of-the-art: break up hyphenated
sequence.
 U.S.A. vs. USA
 a.out

2015/7/18
Wu Gangshan: Modern Information Retrieval
6
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Numbers
3/12/91
 Mar. 12, 1991
 55 B.C.
 B-52
 100.2.86.144


Generally, don’t index as text
2015/7/18
Wu Gangshan: Modern Information Retrieval
7
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Case folding

Reduce all letters to lower case

exception: upper case in mid-sentence



2015/7/18
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Wu Gangshan: Modern Information Retrieval
8
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Tokenizing HTML

Should text in HTML commands not typically
seen by the user be included as tokens?
Words appearing in URLs.
 Words appearing in “meta text” of images.


Simplest approach is to exclude all HTML tag
information (between “<“ and “>”) from
tokenization.
2015/7/18
Wu Gangshan: Modern Information Retrieval
9
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Stopwords
It is typical to exclude high-frequency words (e.g. function
words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”).
 Stopwords are language dependent
 For efficiency, store strings for stopwords in a hashtable to
recognize them in constant time.



Simple Perl hashtable for Perl-based implementations
How to determine a list of stopwords?

For English? – may use existing lists of stopwords



E.g. SMART’s commonword list (~ 400)
WordNet stopword list
For Spanish? Bulgarian?
2015/7/18
Wu Gangshan: Modern Information Retrieval
10
Lemmatization(同义异形)
Reduce inflectional/variant forms to base form
 Direct impact on VOCABULARY size
 E.g.,
南京大学多媒体研究所
Multimedia Computing Institute of NJU




am, are, is  be

car, cars, car's, cars'  car
the boy's cars are different colors  the boy car be different
color
How to do this?

Need a list of grammatical rules + a list of irregular words

Children  child, spoken  speak …

Practical implementation: use WordNet’s morphstr function
2015/7/18
Wu Gangshan: Modern Information Retrieval
11
Stemming
南京大学多媒体研究所
Multimedia Computing Institute of NJU

Reduce tokens to “root” form of words to recognize
morphological variation.

“computer”, “computational”, “computation” all reduced to
same token “compute”
Correct morphological analysis is language specific
and can be complex.
 Stemming “blindly” strips off known affixes
(prefixes and suffixes) in an iterative fashion.

for example compressed
and compression are both
accepted as equivalent to
compress.
2015/7/18
for exampl compres and
compres are both accept
as equival to compres.
Wu Gangshan: Modern Information Retrieval
12
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Porter Stemmer


Simple procedure for removing known
affixes in English without using a dictionary.
Can produce unusual stems that are not
English words:



“computer”, “computational”, “computation” all
reduced to same token “comput”
May conflate (reduce to the same token)
words that are actually distinct.
Not recognize all morphological derivations.
2015/7/18
Wu Gangshan: Modern Information Retrieval
13
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Typical rules in Porter
sses  ss
 ies  i
 ational  ate
 tional  tion


See class website for link to “official” Porter
stemmer site

Provides Perl, C ready to use implementations
2015/7/18
Wu Gangshan: Modern Information Retrieval
14
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Porter Stemmer Errors

Errors of “comission”:
organization, organ  organ
 police, policy  polic
 arm, army  arm


Errors of “omission”:
cylinder, cylindrical
 create, creation
 Europe, European

2015/7/18
Wu Gangshan: Modern Information Retrieval
15
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Other stemmers

Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250
rules)
 Motivated by Linguistics as well as IR
 Full morphological analysis - modest benefits
for retrieval

2015/7/18
Wu Gangshan: Modern Information Retrieval
16
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Metadata

On Metadata


Often included in Web pages
Hidden from the browser, but useful for indexing
Information about a document that may not be a part of the
document itself (data about data).
 Descriptive metadata is external to the meaning of the
document:








Author
Title
Source (book, magazine, newspaper, journal)
Date
ISBN
Publisher
Length
2015/7/18
Wu Gangshan: Modern Information Retrieval
17
南京大学多媒体研究所
Multimedia Computing Institute of NJU
文本的特性

Text properties
Distribution of words in language
 Why are they important?

Zipf’s Law
 Heap’s Law

2015/7/18
Wu Gangshan: Modern Information Retrieval
18
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Statistical Properties of Text
How is the frequency of different words
distributed?
 How fast does vocabulary size grow with the
size of a corpus?
 Such factors affect the performance of
information retrieval and can be used to select
appropriate term weights and other aspects of
an IR system.

2015/7/18
Wu Gangshan: Modern Information Retrieval
19
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Word Frequency

A few words are very common.


Most words are very rare.


2 most frequent words (e.g. “the”, “of”) can
account for about 10% of word occurrences.
Half the words in a corpus appear only once,
called hapax legomena (Greek for “read only
once”)
Called a “heavy tailed” distribution, since
most of the probability mass is in the “tail”
2015/7/18
Wu Gangshan: Modern Information Retrieval
20
Sample Word Frequency Data
南京大学多媒体研究所
Multimedia Computing Institute of NJU
(from B. Croft, UMass)
2015/7/18
Wu Gangshan: Modern Information Retrieval
21
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Zipf’s Law
Rank (r): The numerical position of a word in
a list sorted by decreasing frequency (f ).
 Zipf (1949) “discovered” that:
f  r  k (for constantk )


If probability of word of rank r is pr and N is
the total number of word occurrences:
f
A
pr 

for corpus indp. const. A  0.1
N
r
2015/7/18
Wu Gangshan: Modern Information Retrieval
22
Zipf and Term Weighting
南京大学多媒体研究所
Multimedia Computing Institute of NJU

Luhn (1958) suggested that both extremely common
and extremely uncommon words were not very useful
for indexing.
2015/7/18
Wu Gangshan: Modern Information Retrieval
23
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Predicting Occurrence Frequencies
By Zipf, a word appearing n times has rank rn=AN/n
 Several words may occur n times, assume rank rn
applies to the last of these.
 Therefore, rn words occur n or more times and rn+1
words occur n+1 or more times.
 So, the number of words appearing exactly n times is:

AN AN
AN
I n  rn  rn1 


n
n  1 n(n  1)
2015/7/18
Wu Gangshan: Modern Information Retrieval
24
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Predicting Word Frequencies
(cont’d)
Assume highest ranking term occurs once and
therefore has rank D = AN/1
 Fraction of words with frequency n is:

In
1

D n(n  1)

Fraction of words appearing only once is
therefore ½.
2015/7/18
Wu Gangshan: Modern Information Retrieval
25
Occurrence Frequency Data
南京大学多媒体研究所
Multimedia Computing Institute of NJU
(from B. Croft, UMass)
2015/7/18
Wu Gangshan: Modern Information Retrieval
26
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Does Real Data Fit Zipf’s Law?



A law of the form y = kxc is called a power law.
Zipf’s law is a power law with c = –1
On a log-log plot, power laws give a straight line
with slope c.
log( y)  log(kxc )  log k  c log(x)

Zipf is quite accurate except for very high and low
rank.
2015/7/18
Wu Gangshan: Modern Information Retrieval
27
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Fit to Zipf for Brown Corpus
k = 100,000
2015/7/18
Wu Gangshan: Modern Information Retrieval
28
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Mandelbrot (1954) Correction

The following more general form gives a bit
better fit:
B
f  P(r   )
For constantsP, B, 
2015/7/18
Wu Gangshan: Modern Information Retrieval
29
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Mandelbrot Fit
P = 105.4, B = 1.15,  = 100
2015/7/18
Wu Gangshan: Modern Information Retrieval
30
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Explanations for Zipf’s Law


Zipf’s explanation was his “principle of least
effort.” Balance between speaker’s desire for
a small vocabulary and hearer’s desire for a
large one.
Li (1992) shows that just random typing of
letters including a space will generate “words”
with a Zipfian distribution.

2015/7/18
http://linkage.rockefeller.edu/wli/zipf/
Wu Gangshan: Modern Information Retrieval
31
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Zipf’s Law Impact on IR
Good News: Stopwords will account for a
large fraction of text so eliminating them
greatly reduces inverted-index storage costs.
 Bad News: For most words, gathering
sufficient data for meaningful statistical
analysis (e.g. for correlation analysis for
query expansion) is difficult since they are
extremely rare.

2015/7/18
Wu Gangshan: Modern Information Retrieval
32
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Vocabulary Growth
How does the size of the overall vocabulary
(number of unique words) grow with the size
of the corpus?
 This determines how the size of the inverted
index will scale with the size of the corpus.
 Vocabulary not really upper-bounded due to
proper names, typos, etc.


(没有上界的原因是名称、打字错等)
2015/7/18
Wu Gangshan: Modern Information Retrieval
33
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Heaps’ Law

If V is the size of the vocabulary and the n is
the length of the corpus in words:
V  Kn 

withconstantsK , 0    1
Typical constants:
K  10100
   0.40.6 (approx. square-root)

2015/7/18
Wu Gangshan: Modern Information Retrieval
34
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Heaps’ Law Data
2015/7/18
Wu Gangshan: Modern Information Retrieval
35
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Explanation for Heaps’ Law


Can be derived from Zipf’s law by assuming
documents are generated by randomly
sampling words from a Zipfian distribution.
Heap’s Law holds on distribution of other
data

Own experiments on types of questions asked by
users show a similar behavior
2015/7/18
Wu Gangshan: Modern Information Retrieval
36