Transcript 지능형정보검색입문
Preliminary Lecture: Advanced
Information Retrieval
Han-joon Kim
Data Mining Lab., Univ. of Seoul
1
교재: Introduction to Information
Retrieval
•
강의홈페이지
•
http://dmlab.uos.ac.kr/dmlab/TextMining(2010-1)
Data Mining Lab., Univ. of Seoul
2
(지능형) 정보검색시스템의 예
Data Mining Lab., Univ. of Seoul
3
Evaluation
평가비율
• 발표 (1~2건): 20%
• 과제: 30%
• 검색엔진 활용
• 기말시험: 50%
Data Mining Lab., Univ. of Seoul
4
Information Retrieval
(IR)
The indexing and retrieval of textual
documents.
Searching for pages on the World Wide
Web is the most recent “killer app.”
Concerned firstly with retrieving relevant
documents to a query.
Concerned secondly with retrieving from
large sets of documents efficiently.
5
Typical IR Task
Given:
• A corpus of textual natural-language
•
documents.
A user query in the form of a textual string.
Find:
• A ranked set of documents that are relevant
to the query.
6
IR System
Document
corpus
Quer
y
Strin
g
IR
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
7
Relevance
Relevance is a subjective judgment and
may include:
• Being on the proper subject.
• Being timely (recent information).
• Being authoritative (from a trusted source).
• Satisfying the goals of the user and his/her
intended use of the information (information
need).
8
Keyword Search
Simplest notion of relevance is that the
query string appears verbatim in the
document.
Slightly less strict notion is that the words
in the query appear frequently in the
document, in any order (bag of words).
9
Problems with Keywords
May not retrieve relevant documents
that include synonymous terms.
• “restaurant” vs. “café”
• “PRC” vs. “China”
May retrieve irrelevant documents that
include ambiguous terms.
• “bat” (baseball vs. mammal)
• “Apple” (company vs. fruit)
• “bit” (unit of data vs. act of eating)
10
Beyond Keywords
We will cover the basics of keywordbased IR, but…
We will focus on extensions and recent
developments that go beyond keywords.
We will cover the basics of building an
efficient IR system, but…
We will focus on basic capabilities and
algorithms rather than system’s issues
that allow scaling to industrial size
databases.
11
Intelligent IR
Taking into account the meaning of the
words used.
Taking into account the order of words in
the query.
Adapting to the user based on direct or
indirect feedback.
Taking into account the authority of the
source.
12
IR System Architecture
User Interface
User
Need
Text
Text Operations
Logical View
User
Query
Feedback Operations
Query
Ranked
Docs
Searching
Ranking
Indexing
Database
Manager
Inverted
file
Index
Retrieved
Docs
Text
Database
13
IR System Components
Text Operations forms index words
(tokens).
• Stopword removal
• Stemming
Indexing constructs an inverted index of
word to document pointers.
Searching retrieves documents that contain
a given query token from the inverted index.
Ranking scores all retrieved documents
according to a relevance metric.
14
IR System Components (continued)
User Interface manages interaction with
the user:
• Query input and document output.
• Relevance feedback.
• Visualization of results.
Query Operations transform the query to
improve retrieval:
• Query expansion using a thesaurus.
• Query transformation using relevance
feedback.
15
Web Search
Application of IR to HTML documents
on the World Wide Web.
Differences:
• Must assemble document corpus by
•
•
•
spidering the web.
Can exploit the structural layout
information in HTML (XML).
Documents change uncontrollably.
Can exploit the link structure of the web.
16
Web Search System
Web
Spider
Quer
y
Strin
g
Document
corpus
IR
System
1. Page1
2. Page2
3. Page3
.
.
Ranked
Documents
17
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or
products
Information extraction
Information integration
Question answering
18
History of IR
1960-70’s:
• Initial exploration of text retrieval systems for
•
•
“small” corpora of scientific abstracts, and law
and business documents.
Development of the basic Boolean and vectorspace models of retrieval.
Prof. Salton and his students at Cornell
University are the leading researchers in the
area.
19
IR History Continued
1980’s:
• Large document database systems, many run
by companies:
• Lexis-Nexis
• Dialog
• MEDLINE
20
IR History Continued
1990’s:
• Searching FTPable documents on the Internet
• Archie
• WAIS
• Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
21
IR History Continued
1990’s continued:
• Organized Competitions
• NIST TREC
• Recommender Systems
• Ringo
• Amazon
• NetPerceptions
• Automated Text Categorization & Clustering
22
Recent IR History
2000’s
• Link analysis for Web Search
• Google
• Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
• Question Answering
• TREC Q/A track
23
Recent IR History
2000’s continued:
• Multimedia IR
• Image
• Video
• Audio and music
• Cross-Language IR
• DARPA Tides
• Document Summarization
24
Related Areas
Database Management
Library and Information Science
Artificial Intelligence
Natural Language Processing
Machine Learning
25
Database Management
Focused on structured data stored in
relational tables rather than free-form text.
Focused on efficient processing of welldefined queries in a formal language (SQL).
Clearer semantics for both data and queries.
Recent move towards semi-structured data
(XML) brings it closer to IR.
26
Library and Information Science
Focused on the human user aspects of
information retrieval (human-computer
interaction, user interface, visualization).
Concerned with effective categorization of
human knowledge.
Concerned with citation analysis and
bibliometrics (structure of information).
Recent work on digital libraries brings it
closer to CS & IR.
27
Artificial Intelligence
Focused on the representation of
knowledge, reasoning, and intelligent
action.
Formalisms for representing knowledge
and queries:
• First-order Predicate Logic
• Bayesian Networks
Recent work on web ontologies and
intelligent information agents brings it
closer to IR.
28
Natural Language Processing
Focused on the syntactic, semantic, and
pragmatic analysis of natural language text
and discourse.
Ability to analyze syntax (phrase structure)
and semantics could allow retrieval based
on meaning rather than keywords.
29
Natural Language Processing:
IR Directions
Methods for determining the sense of an
ambiguous word based on context (word
sense disambiguation).
Methods for identifying specific pieces of
information in a document (information
extraction).
Methods for answering specific NL
questions from document corpora.
30
Machine Learning
Focused on the development of
computational systems that improve
their performance with experience.
Automated classification of examples
based on learning concepts from
labeled training examples (supervised
learning).
Automated methods for clustering
unlabeled examples into meaningful
groups (unsupervised learning).
31
Machine Learning:
IR Directions
Text Categorization
• Automatic hierarchical classification (Yahoo).
• Adaptive filtering/routing/recommending.
• Automated spam filtering.
Text Clustering
• Clustering of IR query results.
• Automatic formation of hierarchies (Yahoo).
Learning for Information Extraction
Text Mining
32
46
Data Mining
Knowledge Discovery in large
Databases
대량의 데이타로부터
이전에 알려지지는 않은,
묵시적이고,
잠재적으로 유용한 정보를 탐사하는 작업
Data Mining Lab., Univ. of Seoul
33
Data Mining - 구매패턴의 발견
Data Mining Lab., Univ. of Seoul
34
Data Mining - 구매패턴의 발견
추
천
Data Mining Lab., Univ. of Seoul
35
Data Mining - 분류패턴의 발견
Data Mining Lab., Univ. of Seoul
36
Data Mining - 자동문서분류
Automatic
Manual
Entertainment (Yahoo)
Comic&Animation
Movie&Film
Editorial
Cartoons
Comic
Books
Animatoin
Comic Strip
News&Media
FilmMaking
Film
Festival
Cartoonist
Review
Magna
History
Animated
Gifs
Magazine
Conventions
Magazine
Festival
Anime
Computer
Animation
Magazine
Screen
Short
Films
Writing
37
Data Mining Lab., Univ. of Seoul
Data Mining 응용분야
Retail/Marketing
• 구매자의 성향, 구매패턴, 성향들사이의 관계 등을 판독
• shelf planning, supermarket inventory planning 등에 활용
Banking
• 위조 신용카드사용의 패턴을 추적
• "loyal" 고객을 identify
• 신용카드 가입을 변경시킬 것으로 판단되는 고객을 미리 에측
• 여러 가지 재정 지표들간의 숨겨진 상관관계 판독
Insurance
• Claim Analysis
• 새로운 상품에 대한 고객 수요 예측
• risky customer의 행동 패턴을 identify
• 위조행위를 identify
Medicine
• 환자 history 데이타의 분석
• 성공적인 의료 요법을 identify하는데 이용
• 특정 환자에 대한 수술 여부 판단
Data Mining Lab., Univ. of Seoul
38
Data Mining 응용분야
화학/약학 정보 데이타 관리
• 새로운 화학 구조식의 발견, 새로운 촉매의 발견
석유 탐사
• 석유의 품질에 관한 정보와 지형 데이타상에서 DM
• 석유 생산량과 석유의 품질 예측
Data Mining Lab., Univ. of Seoul
39
Data Mining 과정
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
Data Mining Lab., Univ. of Seoul
40
Data Mining을 위해 필요한 기술
Database
Technology
Machine
Learning
Statistics
Data Mining
Information
Science
Visualization
Other
Disciplines
Data Mining Lab., Univ. of Seoul
41
Data Mining 방법론
Classification
Learning
Model
학습 데이터
profitable common
Least
loyal
未知 데이터
Categorization
Clustering
Teenager
having a computer
Young urban
고객데이타 =
인구학적정보, 구매정보 등으로 표현
Data Mining Lab., Univ. of Seoul
career women
42
Data Mining 방법론
Association Rules Mining
- 장바구니 분석
Data Mining Lab., Univ. of Seoul
43
Data Mining 방법론
Classification
Learning
학습 데이터
Model
profitable
common
Least
loyal
未知 데이터
Categorization
Data Mining Lab., Univ. of Seoul
44
학습(Learning)의 원리
Pattern, Model
(Intelligence)
학습 데이터
profitable
common
Least
loyal
未知 데이터
과거 데이타
미래 예측
Data Mining Lab., Univ. of Seoul
45
Classification 예: e-mail 분류
1. 학습
학습문서
2. 자동분류
Classifier
미분류문서
A. fun
Data Mining Lab., Univ. of Seoul
B. business C.private
46
Machine Learning algorithms
Decision
Tree Learning
Neural Networks
Instance-based Learning
• K-Nearest Neighbor
Statistical
method
• Bayesian classification
Data Mining Lab., Univ. of Seoul
47
Classification : k-Nearest Neighbor
+
-
-
-
q
+
-
+
-
Lazy learning
Data Mining Lab., Univ. of Seoul
48
Classification : Decision Trees
Credit Analysis
s a la ry
e d u c a t io n
la b e l
10000
h ig h s c h o o l
re je c t
40000
u n d e r g ra d u a t e
ac c ept
15000
u n d e r g ra d u a t e
re je c t
75000
g ra d u a t e
ac c ept
18000
g ra d u a t e
ac c ept
salary < 20000
Training data
yes
no
education in graduate
학습
yes
accept
Data Mining Lab., Univ. of Seoul
accept
no
reject
49
gain(outlook, T) = 0.94 - 0.694 = 0.246
gain(windy, T) = 0.94 - 0.892 = 0.048
gain(temperature, T) = 0.94 - 0.924 = 0.016
gain(humidity, T) = 0.94 - 0.925 = 0.015
outlook으로 분기
windy로 분기
humidity로 분기
Data Mining Lab., Univ. of Seoul
50
Association Rule Mining
Given:
• 상품 구매 기록으로부터 상품간의 연곽성을 측정
하여 함께 거래될 가능성을 규칙으로 표현
일명: 장바구니 분석
Data Mining Lab., Univ. of Seoul
51
Clustering (Segmentation)
Teenager
having a compute
Young urban
Age, income, address, career, …
career women
Data Mining Lab., Univ. of Seoul
52
Clustering (Segmentation)
Given:
• Data points and number of desired clusters
K
Group the data points into K clusters
• Data points within clusters are more similar
than across clusters
Data Mining Lab., Univ. of Seoul
53
Clustering (Segmentation)
wk
wk
d*
d5
d5
d3
d2
d3
d8
d4
d1 d
15
d*
클러스터링
d11
d2
d8
d4
d1 d
15
d9
d11
d9
wj
wj
wi
wi
클러스터내 데이타
“서로 비슷한 데이타”
Data Mining Lab., Univ. of Seoul
54
Clustering methods
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Data Mining Lab., Univ. of Seoul
55