Transcript Document
南京大学多媒体研究所
Multimedia Computing Institute of NJU
文本信息检索——文本操作
武港山
Tel : 83594243
Office: 蒙民伟楼608B
Email : [email protected]
南京大学多媒体研究所
Multimedia Computing Institute of NJU
信息检索系统的体系结构
(clir,QA,Web)
文档
用户界面
用户
需求
查询语言和
用户
查询处理
反馈
具体
应用
系统
文档处理
文本处理
逻辑视图
提问处理
建索引
数据库
管理
倒排文档
提问
排序后
的文档
2015/7/18
搜索
索引和检索
索引
文本
数据库
排序
检出的文档
Wu Gangshan: Modern Information Retrieval
2
南京大学多媒体研究所
Multimedia Computing Institute of NJU
内容提要
文档预处理
文档分类
文档聚类
文档摘要
文档压缩
2015/7/18
Wu Gangshan: Modern Information Retrieval
3
南京大学多媒体研究所
Multimedia Computing Institute of NJU
文档预处理
Tokenization
Stopword removal
Lemmatization[词性还原]
Stemming [词干]
Metadata and markup languages
2015/7/18
Wu Gangshan: Modern Information Retrieval
4
Simple Tokenization
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Analyze text into a sequence of discrete tokens
(words).
Sometimes punctuation (e-mail), numbers (1999),
and case (Republican vs. republican) can be a
meaningful part of a token.
However, frequently they are not.
Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
More careful approach:
Separate ? ! ; : “ ‘ [ ] ( ) < >
Care with . - why? when?
Care with …
2015/7/18
Wu Gangshan: Modern Information Retrieval
5
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Punctuation(标点符号)
Ne’er: use language-specific mappings to
normalize
State-of-the-art: break up hyphenated
sequence.
U.S.A. vs. USA
a.out
2015/7/18
Wu Gangshan: Modern Information Retrieval
6
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Numbers
3/12/91
Mar. 12, 1991
55 B.C.
B-52
100.2.86.144
Generally, don’t index as text
2015/7/18
Wu Gangshan: Modern Information Retrieval
7
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Case folding
Reduce all letters to lower case
exception: upper case in mid-sentence
2015/7/18
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Wu Gangshan: Modern Information Retrieval
8
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Tokenizing HTML
Should text in HTML commands not typically
seen by the user be included as tokens?
Words appearing in URLs.
Words appearing in “meta text” of images.
Simplest approach is to exclude all HTML tag
information (between “<“ and “>”) from
tokenization.
2015/7/18
Wu Gangshan: Modern Information Retrieval
9
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Stopwords
It is typical to exclude high-frequency words (e.g. function
words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”).
Stopwords are language dependent
For efficiency, store strings for stopwords in a hashtable to
recognize them in constant time.
Simple Perl hashtable for Perl-based implementations
How to determine a list of stopwords?
For English? – may use existing lists of stopwords
E.g. SMART’s commonword list (~ 400)
WordNet stopword list
For Spanish? Bulgarian?
2015/7/18
Wu Gangshan: Modern Information Retrieval
10
Lemmatization(同义异形)
Reduce inflectional/variant forms to base form
Direct impact on VOCABULARY size
E.g.,
南京大学多媒体研究所
Multimedia Computing Institute of NJU
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car be different
color
How to do this?
Need a list of grammatical rules + a list of irregular words
Children child, spoken speak …
Practical implementation: use WordNet’s morphstr function
2015/7/18
Wu Gangshan: Modern Information Retrieval
11
Stemming
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Reduce tokens to “root” form of words to recognize
morphological variation.
“computer”, “computational”, “computation” all reduced to
same token “compute”
Correct morphological analysis is language specific
and can be complex.
Stemming “blindly” strips off known affixes
(prefixes and suffixes) in an iterative fashion.
for example compressed
and compression are both
accepted as equivalent to
compress.
2015/7/18
for exampl compres and
compres are both accept
as equival to compres.
Wu Gangshan: Modern Information Retrieval
12
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Porter Stemmer
Simple procedure for removing known
affixes in English without using a dictionary.
Can produce unusual stems that are not
English words:
“computer”, “computational”, “computation” all
reduced to same token “comput”
May conflate (reduce to the same token)
words that are actually distinct.
Not recognize all morphological derivations.
2015/7/18
Wu Gangshan: Modern Information Retrieval
13
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Typical rules in Porter
sses ss
ies i
ational ate
tional tion
See class website for link to “official” Porter
stemmer site
Provides Perl, C ready to use implementations
2015/7/18
Wu Gangshan: Modern Information Retrieval
14
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Porter Stemmer Errors
Errors of “comission”:
organization, organ organ
police, policy polic
arm, army arm
Errors of “omission”:
cylinder, cylindrical
create, creation
Europe, European
2015/7/18
Wu Gangshan: Modern Information Retrieval
15
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Other stemmers
Other stemmers exist, e.g., Lovins stemmer
http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250
rules)
Motivated by Linguistics as well as IR
Full morphological analysis - modest benefits
for retrieval
2015/7/18
Wu Gangshan: Modern Information Retrieval
16
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Metadata
On Metadata
Often included in Web pages
Hidden from the browser, but useful for indexing
Information about a document that may not be a part of the
document itself (data about data).
Descriptive metadata is external to the meaning of the
document:
Author
Title
Source (book, magazine, newspaper, journal)
Date
ISBN
Publisher
Length
2015/7/18
Wu Gangshan: Modern Information Retrieval
17
南京大学多媒体研究所
Multimedia Computing Institute of NJU
文本的特性
Text properties
Distribution of words in language
Why are they important?
Zipf’s Law
Heap’s Law
2015/7/18
Wu Gangshan: Modern Information Retrieval
18
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Statistical Properties of Text
How is the frequency of different words
distributed?
How fast does vocabulary size grow with the
size of a corpus?
Such factors affect the performance of
information retrieval and can be used to select
appropriate term weights and other aspects of
an IR system.
2015/7/18
Wu Gangshan: Modern Information Retrieval
19
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Word Frequency
A few words are very common.
Most words are very rare.
2 most frequent words (e.g. “the”, “of”) can
account for about 10% of word occurrences.
Half the words in a corpus appear only once,
called hapax legomena (Greek for “read only
once”)
Called a “heavy tailed” distribution, since
most of the probability mass is in the “tail”
2015/7/18
Wu Gangshan: Modern Information Retrieval
20
Sample Word Frequency Data
南京大学多媒体研究所
Multimedia Computing Institute of NJU
(from B. Croft, UMass)
2015/7/18
Wu Gangshan: Modern Information Retrieval
21
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Zipf’s Law
Rank (r): The numerical position of a word in
a list sorted by decreasing frequency (f ).
Zipf (1949) “discovered” that:
f r k (for constantk )
If probability of word of rank r is pr and N is
the total number of word occurrences:
f
A
pr
for corpus indp. const. A 0.1
N
r
2015/7/18
Wu Gangshan: Modern Information Retrieval
22
Zipf and Term Weighting
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Luhn (1958) suggested that both extremely common
and extremely uncommon words were not very useful
for indexing.
2015/7/18
Wu Gangshan: Modern Information Retrieval
23
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Predicting Occurrence Frequencies
By Zipf, a word appearing n times has rank rn=AN/n
Several words may occur n times, assume rank rn
applies to the last of these.
Therefore, rn words occur n or more times and rn+1
words occur n+1 or more times.
So, the number of words appearing exactly n times is:
AN AN
AN
I n rn rn1
n
n 1 n(n 1)
2015/7/18
Wu Gangshan: Modern Information Retrieval
24
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Predicting Word Frequencies
(cont’d)
Assume highest ranking term occurs once and
therefore has rank D = AN/1
Fraction of words with frequency n is:
In
1
D n(n 1)
Fraction of words appearing only once is
therefore ½.
2015/7/18
Wu Gangshan: Modern Information Retrieval
25
Occurrence Frequency Data
南京大学多媒体研究所
Multimedia Computing Institute of NJU
(from B. Croft, UMass)
2015/7/18
Wu Gangshan: Modern Information Retrieval
26
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Does Real Data Fit Zipf’s Law?
A law of the form y = kxc is called a power law.
Zipf’s law is a power law with c = –1
On a log-log plot, power laws give a straight line
with slope c.
log( y) log(kxc ) log k c log(x)
Zipf is quite accurate except for very high and low
rank.
2015/7/18
Wu Gangshan: Modern Information Retrieval
27
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Fit to Zipf for Brown Corpus
k = 100,000
2015/7/18
Wu Gangshan: Modern Information Retrieval
28
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Mandelbrot (1954) Correction
The following more general form gives a bit
better fit:
B
f P(r )
For constantsP, B,
2015/7/18
Wu Gangshan: Modern Information Retrieval
29
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Mandelbrot Fit
P = 105.4, B = 1.15, = 100
2015/7/18
Wu Gangshan: Modern Information Retrieval
30
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Explanations for Zipf’s Law
Zipf’s explanation was his “principle of least
effort.” Balance between speaker’s desire for
a small vocabulary and hearer’s desire for a
large one.
Li (1992) shows that just random typing of
letters including a space will generate “words”
with a Zipfian distribution.
2015/7/18
http://linkage.rockefeller.edu/wli/zipf/
Wu Gangshan: Modern Information Retrieval
31
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Zipf’s Law Impact on IR
Good News: Stopwords will account for a
large fraction of text so eliminating them
greatly reduces inverted-index storage costs.
Bad News: For most words, gathering
sufficient data for meaningful statistical
analysis (e.g. for correlation analysis for
query expansion) is difficult since they are
extremely rare.
2015/7/18
Wu Gangshan: Modern Information Retrieval
32
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Vocabulary Growth
How does the size of the overall vocabulary
(number of unique words) grow with the size
of the corpus?
This determines how the size of the inverted
index will scale with the size of the corpus.
Vocabulary not really upper-bounded due to
proper names, typos, etc.
(没有上界的原因是名称、打字错等)
2015/7/18
Wu Gangshan: Modern Information Retrieval
33
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Heaps’ Law
If V is the size of the vocabulary and the n is
the length of the corpus in words:
V Kn
withconstantsK , 0 1
Typical constants:
K 10100
0.40.6 (approx. square-root)
2015/7/18
Wu Gangshan: Modern Information Retrieval
34
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Heaps’ Law Data
2015/7/18
Wu Gangshan: Modern Information Retrieval
35
南京大学多媒体研究所
Multimedia Computing Institute of NJU
Explanation for Heaps’ Law
Can be derived from Zipf’s law by assuming
documents are generated by randomly
sampling words from a Zipfian distribution.
Heap’s Law holds on distribution of other
data
Own experiments on types of questions asked by
users show a similar behavior
2015/7/18
Wu Gangshan: Modern Information Retrieval
36