Transcript Lecture 10

Information Retrieval(IR)
• Information retrieval generally refers to the process of
extracting relevant documents according to the
information specified in the query.
• IRS vs. DBMS
lecture 11
IRS
DBMS
•unstructured
•structured
•casual query style
•query formulation
•non-deterministic
•deterministic
•most relevant data
•all relevant data
•relevance feedback
•one-off query
1
Basic Components of IRS
Text editor/
file system/
internet files
Linguistics Information
Input documents
KnowledgeBase
IRS core
Relevant Documents
Query
lecture 11
Users
2
Information Retrieval
• The IR technology:
– Knowledge base: Dictionary and rules
– Basic Information representation model
– Indexing of documents for retrieval
– Relevance calculation
• Oriental languages Vs English in IR
Main difference is in what is considered
useful information in each language
– Different NLP knowledge and variants of
common methods need to be used
lecture 11
3
Vector Space Model
•
•
•
•
lecture 11
for document representation
Document D: articles in text form
Terms T: Basic language units, such as words, phrases
D(T1; T2; … Ti; … ;Tn),
Ti and Tj may be referring to the same word appearing in
different places and also the order in which it appear is
also relevant.
Term Weight: Ti has an associated weight value Wi to
indicate the importance of Ti to D
D=D(T1 W1; T2 W2; …; Tn Wn)
For a given D, if we do not consider word repetition and
order, also terms are against a known set T= (t1; t2; …; tK)
where K is the number of words in the vocabulary,
thus D(T1 W1; T2 W2; …; Tn Wn) can be represented by the
Vector Space Model: D=D(W1; W2; …; WK)
4
• Vector Space Model for Document Representation:
(W1; W2; …; Wk) can be considered as a vector
(t1; t2; …; tk) defines the k dimension coordinate
system, where K is a fixed number.
Each coordinate indicates the weight of term ti.
Different documents are then considered as different
vectors in the VSM.

lecture 11
5
• Similarity: The degree of relevance of two
documents
• The degree of relevance of D1and D2 can be
described by a so called similarity function, Sim(D1,
D2 ), which describe their distance.
• Many different definitions of Sim(D1, D2 )
– One simple definition(inner product):
Sim (D1, D2 ) = nk=1 w1kw2k
– Example: D1=(1,0,0, 1, 1, 1), D2=(0,1,1,0,1,0)
– Sim(D1, D2)=1x0+0x1+0x1+1x0+1x1+1x0 = 1
lecture 11
6
– Another definition( Cosine):
Sim (D1, D2 ) = cos
=( nk=1 w1kw2k)/sqrt((nk=1 w1k2)(nk=1 w2k2))
• For Information retrieval, D2 can be a query Q.
• Suppose there are I documents: Di ,where i =1 to I
• Rank Sim(Di, Q), the higher the value, the more
relevant Di is to Q
lecture 11
7
Terms Selection for Indexing
• T can be considered as all the terms that can be
indexed:
– Approach 1: Simply use a word dictionary
– Approach 2: terms in a dictionary + terms
segmented dynamically => T is not necessarily
static
• Every document Di needs to be segmented
– Vocabulary for indexing is normally much smaller
than vocabulary of documents.
• Not every word Tk in Di which is in T will be indexed
• Tk in Di which is in T but not indexed is considered to
have weight wik = 0
lecture 11
– In other words, all indexed terms for Di are
considered to have weight greater than zero
8
• The process to select terms in a Di for indexing is
called Terms selection
• Word frequency in documents is related to the
information the articles intend to convey. Thus word
frequency is often used in earlier term selection and
weight assignment algorithms
• The Zipf Law in information theory:
For a given document set, and rank the terms
according to its frequency => Zipf Law
Freq(ti) * rank(ti)  constant
lecture 11
9
The P.H. Lunh Method(terms selection)
• Suppose N documents form a document set
Dset={ Di, 1 <=i<N}
• (1) Freqik : the frequency of Tk in Di.
• TotFreqk : the frequency of Tk in Dset.
• (2) Then,
TotFreqk = Ni=1 Freqik
• (3) Sort TotFreqk in descending order. Select
an upper bound and a lower bound, Cu-b , Cl-b,
respectively.
• Index only the terms between Cu-b and Cl-b
• Absolute frequency, choice of Cu-b and Cl-b
lecture 11
10
• P.H. Lunh’s method is a very rough association of terms
frequency with information to be conveyed in a
document.
– Some low frequency terms may be very important to
a particular article(document), and it may be exactly
the reason it doesn’t appear so often in other articles
• Keys in terms selection for indexing: completeness
and accuracy
– Related to the article so that it can be indexed for
retrieval(completeness)
– Distinguish one article from other articles(accuracy
and representative)
• Example: The term “電腦” in the document set “computer”
is not an important term, however , it is probably important
in a “hardware devices” set
• Relative frequency:
lecture 11
11
Weight Assignment Algorithm
Assuming:
• Freqik  in Di, importance of tk in Di 
Freqik
• TotFreqk (Frequency of tk) in Dset , importance
of tk in Di 
=> log2(N/TotFreqk)
• The weight should be assigned based on these
assumptions
• Wik = Freqik + Freqik (log2 N - log2 TotFreqk)
lecture 11
12
More Consideration in Relevance
• Long articles are more likely to be retrieved
 discount the length factor
• Document feature terms
most frequently used terms tend to appear in
more articles, it does not serve to distinguish
one article from others.
• Expressiveness of terms: high-frequency, lowfrequency, normal-frequency. Terms with
normal frequency and low-frequency terms
convey more about the article features/theme.
• Word class: nouns convey more information(實
詞 vs. 虛詞)
lecture 11
–Use of PoS tags and also stoplist(Slist)
13
• Syntactical word classification information: Nouns are
more related to concepts
人口 的 自然 增長 是 由 出生 和 死亡 之間 的 差額 所 形成 的
Class:名 助 副 名
動介 名
連 名
名 助 名 助 動
助
Slist: elimination of words cannot be identified by class such as “是”
Only those terms not on the stop list will be used in frequency
calculation
• Semantic word classification: extracting concepts
人口的自然 增長是由出生和死亡 之間的差額 形成的。
人
必然 增多
誕生 死
期間 數量 興起
Thesaurus: co-occurrence of related terms
lecture 11
14
• Indexing of phrases(grammatical analysis):
– Example: Artificial intelligence(人工 智能) it
is more relevant to index “artificial
intelligence(人工智能)”( as one unit)
rather than two independent terms.
lecture 11
15
Information Retrieval System Architecture
Document
sources
Customer
inquiries
Indexing
engine
Indexed
Document
repository
Relevance
Calculation
Indexed
query
expression
Retrieved
documents
Evaluation and
return by ranking
Query optimization
relevance feedback
Return result
lecture 11
16
Bilingual (Multi-lingual) IR
• Retargetable IR approach
monolingual but IR system can be configured for
different languages
• Need Bi- or Multi-lingual IR?
– Bi- and Multi-lingual community that reads text in
more than one language: find text in more than
language!
– Retrieval of law information in Chinese and English.
– Retrieval of a person reported in different
newspapers written in different languages
lecture 11
17
• Dictionary approach
– normalize indexing and searching into one
language (save storage)
– determination of translation equivalence(multiple
translations) during indexing and during extraction
of term vector of query (increase time of indexing)
• not easy to obtain good translation equivalence
• many proper nouns not in translation dictionary
need to be found and map to their
corresponding target translation
– inflexible: cannot use user specified translation
equivalence
lecture 11
18
• Multi-lingual indexing approach
– indexing for all different languages
• higher storage cost
• different indexing techniques for different
languages (e.g. English and Chinese)
– flexible (can use system supplied or user supplied
translation equivalence)
– support exact match in different languages
lecture 11
19