Chapter 1 Introduction

Download Report

Transcript Chapter 1 Introduction

Information Retrieval and Extraction
Hsin-Hsi Chen
Department of Computer Science and
Information Engineering
National Taiwan University
Hsin-Hsi Chen
1-1
Chapter 1 Introduction
Hsin-Hsi Chen (陳信希)
國立台灣大學資訊程學系
Hsin-Hsi Chen
1-2
Motivation
• Information retrieval
– To retrieve information which might be useful or
relevant to the user
– Representation, Storage, Organization, Access
• Information need (vs query)
– Find all the pages containing information on college
tennis teams which
• (1) are maintained by an university in the USA and
• (2) participate in the NCAA tennis tournament.
– To be relevant, the page must include information on
the national ranking of the team in the last three years
and the email or phone number of the team coach.
Hsin-Hsi Chen
1-3
資訊需求
Information Need
• 找尋金馬獎/諾貝爾獎相關報導
獎
金馬獎/諾貝爾獎
金馬獎/諾貝爾獎得主
今年金馬獎/諾貝爾獎得主
今年金馬獎最佳影片得主
Hsin-Hsi Chen
1-4
檢索:今年金馬獎得主 (2000)
Hsin-Hsi Chen
1-5
檢索:今年金馬獎得主 (2008)
Hsin-Hsi Chen
1-6
檢索:今年諾貝爾獎得主 (2000)
Hsin-Hsi Chen
1-7
Hsin-Hsi Chen
1-8
檢索:今年諾貝爾獎得主 (2008)
Hsin-Hsi Chen
1-9
尋找“門聯”資料 (2000)
Hsin-Hsi Chen
1-10
尋找“門聯”資料 (2008)
Hsin-Hsi Chen
1-11
Information versus Data Retrieval
• Data retrieval
– Determine which documents of a collection
contain the keywords in the user query
– Retrieve all objects which satisfy clearly defined
conditions in regular expression or relational
algebra expression
– Data has a well defined structure and semantics
– Solution to the user of a database system
• Information retrieval
Hsin-Hsi Chen
1-12
Database Management
• A specified set of attributes is used to characterize
each item.
EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX,
SALARY, DNO)
• Exact match between the attributes used in
query formulations and those attached to the
record.
SELECT BDATE, ADDR
FROM EMPLOYEE
WHERE NAME = ‘John Smith’
Hsin-Hsi Chen
1-13
Basic Concepts
• Content identifiers (keywords, index terms,
descriptors) characterize the stored texts.
• degrees of coincidence between the sets of
identifiers attached to queries and documents
User task
query formulation
Hsin-Hsi Chen
Logical view
of the documents
content analysis
1-14
The User Task
• Convey the semantics of information need
• Retrieval and browsing
Retrieval
Database
Browsing
Hsin-Hsi Chen
1-15
Logical View of Documents
• Full text representation
• A set of index terms
–
–
–
–
Elimination of stop-words
The use of stemming
The identification of noun groups
…
Hsin-Hsi Chen
1-16
From full text to a set of index terms
automatic
or manual
indexing
document
text+
structure
structure
recognition
structure
Hsin-Hsi Chen
accents,
spacing,
etc.
text
stopwords
noun
groups
stemming
index
terms
full text
1-17
Indexing
• indexing: assign identifiers to text items.
• assign: manual vs. automatic indexing
• identifiers:
– objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher
names, dates of publications, …
– controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
– single-term vs. term phrase
Hsin-Hsi Chen
1-18
The retrieval process
Text
User Interface
user need
Text
Text Operations
logical view
logical view
user
feedback
Query
Operations
query
Indexing
Searching
Index
retrieved documents
ranked
documents
Ranking
Hsin-Hsi Chen
DB Manager
Module
Text
Database
1-19
Information Retrieval
• generic information retrieval system
select and return to the user desired documents from a
large set of documents in accordance with criteria
specified by the user
• functions
– document search
the selection of documents from an existing collection
of documents
– document routing
the dissemination of incoming documents to
appropriate users on the basis of user interest profiles
Hsin-Hsi Chen
1-20
Detection Need
• Definition
a set of criteria specified by the user which describes the
kind of information desired.
– queries in document search task
– profiles in routing task
• forms
–
–
–
–
–
keywords
keywords with Boolean operators
free text
example documents
...
Hsin-Hsi Chen
1-21
Example
<head> Tipster Topic Description
<num> Number: 033
<dom> Domain: Science and Technology
<title> Topic: Companies Capable of Producing Document
Management
<des> Description:
Document must identify a company who has the capability to
produce document management system by obtaining a turnkeysystem or by obtaining and integrating the basic components.
<narr> Narrative:
To be relevant, the document must identify a turnkey document
management system or components which could be integrated
to form a document management system and the name of either
the company developing the system or the company using the
system. These components are: a computer, image scanner or
optical
character recognition system, and an information retrieval
Hsin-Hsi Chen
1-22
or text management system.
Example (Continued)
<con> Concepts:
1. document management, document processing, office automation
electronic imaging
2. image scanner, optical character recognition (OCR)
3. text management, text retrieval, text database
4. optical disk
<fac> Factors:
<def> Definitions
Document Management-The creation, storage and retrieval of
documents containing, text, images, and graphics.
Image Scanner-A device that converts a printed image into a video
image, without recognizing the actual content of the text or pictures.
Optical Disk-A disk that is written and read by light, and are
sometimes associated with the storage of digital images because of
their high storage capacity.
Hsin-Hsi Chen
1-23
search vs. routing
• The search process matches a single Detection
Need against the stored corpus to return a subset
of documents.
• Routing matches a single document against a
group of Profiles to determine which users are
interested in the document.
• Profiles stand long-term expressions of user needs.
• Search queries are ad hoc in nature.
• A generic detection architecture can be used for
both the search and routing.
Hsin-Hsi Chen
1-24
Search
• retrieval of desired documents from an existing corpus
• Retrospective search is frequently interactive.
• Methods
– indexing the corpus by keyword, stem and/or phrase
– apply statistical and/or learning techniques to better
understand the content of the corpus
– analyze free text Detection Needs to compare with the
indexed corpus or a single document
– ...
Hsin-Hsi Chen
1-25
Document Detection: Search
Hsin-Hsi Chen
1-26
Document Detection: Search(Continued)
• Document Corpus
– the content of the corpus may have significant
the performance in some applications
• Preprocessing of Document Corpus
–
–
–
–
stemming
a list of stop words
phrases, multi-term items
...
Hsin-Hsi Chen
1-27
Document Detection: Search(Continued)
• Building Index from Stems
– key place for optimizing run-time performance
– cost to build the index for a large corpus
• Document Index
– a list of terms, stems, phrases, etc.
– frequency of terms in the document and corpus
– frequency of the co-occurrence of terms within the
corpus
– index may be as large as the original document corpus
Hsin-Hsi Chen
1-28
Document Detection: Search(Continued)
• Detection Need
– the user’s criteria for a relevant document
• Convert Detection Need to System Specific Query
– first transformed into a detection query, and then a
retrieval query.
– detection query: specific to the retrieval engine, but
independent of the corpus
– retrieval query: specific to the retrieval engine, and to
the corpus
Hsin-Hsi Chen
1-29
Document Detection: Search(Continued)
• Compare Query with Index
• Resultant Rank Ordered List of Documents
– Return the top ‘N’ documents
– Rank the list of relevant documents from the
most relevant to the query to the least relevant
Hsin-Hsi Chen
1-30
Routing
Hsin-Hsi Chen
1-31
Routing (Continued)
• Profile of Multiple Detection Needs
– A Profile is a group of individual Detection
Needs that describes a user’s areas of interest.
– All Profiles will be compared to each incoming
document (via the Profile index).
– If a document matches a Profile the user is
notified about the existence of a relevant
document.
Hsin-Hsi Chen
1-32
Routing (Continued)
• Convert Detection Need to System Specific
Query
• Building Index from Queries
– similar to build the corpus index for searching
– the quantify of source data (Profiles) is usually
much less than a document corpus
– Profiles may have more specific, structured
data in the form of SGML tagged fields
Hsin-Hsi Chen
1-33
Routing (Continued)
• Routing Profile Index
– The index will be system specific and will make use of
all the preprocessing techniques employed by a
particular detection system.
• Document to be routed
– A stream of incoming documents is handled one at a
time to determine where each should be directed.
– Routing implementation may handle multiple document
streams and multiple Profiles.
Hsin-Hsi Chen
1-34
Routing (Continued)
• Preprocessing of Document
– A document is preprocessed in the same manner that a
query would be set-up in a search
– The document and query roles are reversed compared
with the search process
• Compare Document with Index
– Identify which Profiles are relevant to the document
– Given a document, which of the indexed profiles match
it?
Hsin-Hsi Chen
1-35
Routing (Continued)
• Resultant List of Profiles
– The list of Profiles identify which user should
receive the document
Hsin-Hsi Chen
1-36
Summary
• Generate a representation of the meaning or
content of each object based on its
description.
• Generate a representation of the meaning of
the information need.
• Compare these two representations to select
those objects that are most likely to match
the information need.
Hsin-Hsi Chen
1-37
Basic Architecture of
an Information Retrieval System
Documents*
Queries
Document
Representation
Query
Representation
Comparison
Hsin-Hsi Chen
*此模型可以延伸到其他媒體所呈現的資訊
1-38
Research Issues
• Given a set of description for objects in the
collection and a description of an information
need, we must consider
• Issue 1
– What makes a good document representation?
– What are retrievable units and how are they
organized?
– How can a representation be generated from a
description of the document?
Hsin-Hsi Chen
1-39
Research Issues (Continued)
• Issue 2
How can we represent the information need and
how can we acquire this representation either from
a description of the information need or through
interaction with the user?
• Issue 3
How can we compare representations to judge
likelihood that a document matches an information
need?
Hsin-Hsi Chen
1-40
Research Issues (Continued)
• Issue 4
How can we evaluate the effectiveness of the
retrieval process?
Hsin-Hsi Chen
1-41
Text Data Mining Tasks
•
•
•
•
•
•
Information extraction -- facts, fill database
Summarization
Categorization
Clustering
Associations
Temporal analysis of document collection
Information Extraction:
Beyond Document Retrieval
• Question and Answering
– Q: Who is the author of the book, "The Iron
Lady: A Biography of Margaret Thatcher"?
A: Hugo Young
– Q: What was the monetary value of the Nobel
Peace Prize in 1989?
A: $469,000
Hsin-Hsi Chen
1-43
Information Extraction
• Generic Information Extraction System
An information extraction system is a cascade of
transducers or modules that at each step add structure and
often lose information, hopefully irrelevant, by applying
rules that are acquired manually and/or automatically.
Hsin-Hsi Chen
1-44
Information Extraction (Continued)
•
•
•
•
•
•
•
What are the transducers or modules?
What are their input and output?
What structure is added?
What information is lost?
What is the form of the rules?
How are the rules applied?
How are the rules acquired?
Hsin-Hsi Chen
1-45
Example: Parser
•
•
•
•
•
•
•
•
transducer: parser
input: the sequence of words or lexical items
output: a parse tree
information added: predicate-argument and
modification relations
information lost: no
rule form: unification grammars
application method: chart parser
acquisition method: manually
Hsin-Hsi Chen
1-46
Modules
• Text Zoner
turn a text into a set of text segments
• Preprocessor
turn a text or text segment into a sequence of sentences,
each of which is a sequence of lexical items, where a
lexical item is a word together with its lexical attributes
• Filter
turn a set of sentences into a smaller set of sentences by
filtering out the irrelevant ones
• Preparser
take a sequence of lexical items and try to identify various
reliably determinable, small-scale structures
Hsin-Hsi Chen
1-47
Modules (Continued)
• Parser
input a sequence of lexical items and perhaps small-scale
structures (phrases) and output a set of parse tree
fragments, possibly complete
• Fragment Combiner
turn a set of parse tree or logical form fragments into a
parse tree or logical form for the whole sentence
• Semantic Interpreter
generate a semantic structure or logical form from a parse
tree or from parse tree fragments
Hsin-Hsi Chen
1-48
Modules (Continued)
• Lexical Disambiguation
turn a semantic structure with general or ambiguous
predicates into a semantic structure with specific,
unambiguous predicates
• Co-reference Resolution, or Discourse Processing
turn a tree-like structure into a network-like structure by
identifying different descriptions of the same entity in
different parts of the text
• Template Generator
derive the templates from the semantic structures
Hsin-Hsi Chen
1-49
<DOC>
<DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID>
<DATASET> Air Vehicle Launch </DATASET>
<DD> 1997/06/12 </DD>
<DOCTYPE> 報紙報導 </DOCTYPE>
<DOCSRC> 中國時報 </DOCSRC>
<TEXT>
【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度
出售「刺針」肩射防空飛彈 給南韓的第二天,美國與北韓今天
在紐約恢復延擱已久的會談,這項預定三天的會談將以北韓的
飛彈發展為重點,包括北韓準備部署射程 可涵蓋幾乎日本全境
的「蘆洞」一號長程飛彈 的報導。
美國國務院發言人柏恩斯說:「在有關北韓 飛彈擴散問題上
,美方的確有多項關切之處。」美國官員也長期懷疑北韓正對
伊朗和敘利亞輸出飛彈,並希望平壤加入禁止擴散此種武器的
Hsin-Hsi Chen
1-50
國際公約。美國官員已知會北韓說,倘若北韓希望與美國建立
正常的外交關係,就必須減少飛彈輸出。
這項有關北韓飛彈計劃的會談是雙方於一九九六年四月在德國
柏林舉行的首度會談的後續談判。美國在該次會談中要求北韓停
止生產、測試及出售飛彈給他國,尤其是敘利亞和伊朗兩國。
美國副助理國務卿艾恩宏和北韓外交部對外事務局局長李衡哲
分別為雙方的談判代表,會談預定在十三日結束。
柏恩斯說:「美方非常關心所有北韓本身,或是北韓與中共、
伊朗或其他國家的飛彈問題。我們認為就此與他們舉行會談是甚
為重要。」
而為提昇南韓陸軍的自衛能力,美國於昨天宣布準備出售價值
三億零七百萬美元的一千零六十五枚刺針飛彈與其他武器給南韓,
Hsin-Hsi Chen
1-51
它說,這項交易不會使朝鮮半島的緊張局勢惡化。
五角大廈說:「這項設備與支援的銷售不會影響該區基本軍事
均勢。」
國務院也表示全力支持此項包含兩百一十三座發射台、支援
設備、零件與訓練的交易。
柏恩斯說:「這項交易獲得政府內每一個人的全力支持,它
符合我們在朝鮮半島的政策。」他強調:「我們的第一優先是
防衛南韓。」
如果國會同意,這將是華府對南韓出售防空 飛彈的第一筆交易。
</TEXT>
</DOC>
Hsin-Hsi Chen
1-52
<ID="3">十一日
<ID="4" REF="3" >今天
<ID="5“ REF="3">出售「刺針」肩射防空飛彈 給南韓的第二天
<ID="63" >延擱已久的會談
<ID=“66” REF=“63”>一九九六年四月在德國柏林舉行的首度會談
的後續談判
<ID="65" REF="63">這項有關北韓飛彈計劃的會談
<ID="70" REF="65">會談
<ID="69" REF="65">會談
<ID="64" REF="63">這項預定三天的會談
Hsin-Hsi Chen
1-53
The Advanced Research and
Development Activity (ARDA)
• a joint activity of the Intelligence Community (IC)
and the Department of Defense (DOD) in late
November 1998
• intelligence community's (IC) center for
conducting advanced research and development
related extracting intelligence from and providing
security for information stored, transmitted, or
manipulated by electronic means
情報
Hsin-Hsi Chen
1-54
Hsin-Hsi Chen
1-55
ARDA R&D Programs
• Information Exploitation
– Pulling Information
– Pushing Information
– Visualizing and Navigating Information
• Quantum Information Science & Photonics
• Digital Network Intelligence
Hsin-Hsi Chen
1-56
Pulling Information
• Providing answers to complex, multifaceted
questions that analysts pose
• The analyst seeks to "pull" the answer out
of multiple, very large, heterogeneous data
sources that may physically reside in
diverse locations
Hsin-Hsi Chen
1-57
Pulling Information (Continued)
• Accepting complex questions in a form natural to the
analyst.
• Questions may include judgment terms and an acceptable
answer may need to be based upon conclusions and
decisions reached by the system and may require the
summarization, fusion, and synthesis of information drawn
from multiple sources.
• Translating analytic questions into multiple queries
appropriate to the various data sets to be searched.
• Finding relevant information in distributed, multimedia,
multilingual, multi-agency data sets.
• Analyzing, fusing and summarizing information into a
coherent answer.
• Providing the answer to the analyst in the form that he/she
want
Hsin-Hsi Chen
1-58
Pushing Information
• Providing information from multiple, very
large, heterogeneous data sources that
analysts do not ask
• The system discovers information in some
profiling, clustering, pattern recognition,
data mining, or other fashion and "pushes"
this information to analysts that the system
determines might have an interest.
Hsin-Hsi Chen
1-59
Pushing Information (Continued)
• Profiling and blind clustering of new data.
• Detecting anomalies, patterns and changes
in large volumes of data.
• Analyzing the nature and description of the
anomalies, patterns, and changes.
• Alerting the appropriate analyst(s) of the
newly discovered information.
Hsin-Hsi Chen
1-60
Topics
• Introduction to Information Retrieval and
Extraction
• Modeling
• Retrieval Evaluation
• Query Languages
• Query Operations
• Text and Multimedia Languages and Properties
• Text Operations
• Indexing and Searching
Hsin-Hsi Chen
1-61
Topics (Continued)
•
•
•
•
•
•
•
User Interfaces and Visualization
Multimedia IR: Models and Languages
Multimedia IR: Indexing and Searching
Searching the Web
Digital Libraries
Information Extraction (IJCAI 1999)
Text Data Mining (ACL 1999)
• Text Web Mining (AIRS 2004)
Hsin-Hsi Chen
1-62
Text
IR
Retrieval Models
and Evaluation
Improvements
On Retrieval
Efficient
Processing
Hsin-Hsi Chen
Human-Computer
Interaction for IR
Interfaces &
Visualization
Multimedia
IR
Multimedia
Modeling
& Searching
Applications
for IR
Bibliographic
Systems
The
Web
Digital
Libraries
1-63
Information Sources
• http://wwwcsli.stanford.edu/~schuetze/informationretrieval.html
• Books
– Ricardo Baeza-Yates and Berthier Riberiro-Neto (1999)
Modern Information Retrieval, Addison-Wesley.
台灣進口商為 “華通書坊” 電話 : (03)5720317
– Salton, G. (1989) Automatic Text Processing. The
Transformation, Analysis and Retrieval of Information
by Computer. Reading, MA: Addison-Wesley.
– Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992)
Information Retrieval: Data Structures and Algorithms.
Englewood Cliffs, NJ: Prentice Hall.
– Cheong, F. (1996) Internet Agents: Spiders, Wanderers,
Hsin-Hsi Chen
1-64
Brokers, and Bots. Indianapolis, IN: New Riders, 1996.
Information Sources
• Karen Sparck Jones and Peter Willett (1997) Readings in
Information Retrieval, CA: Morgan Kaufmann Publishers.
• Christopher D. Manning, et al. Introduction to Information
Retrieval, Cambridge University Press. 2007.
http://www-csli.stanford.edu/~schuetze/informationretrieval-book.html.
• Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J.
Damerau, Text Mining: Predictive Methods for Analyzing
Unstructured Information, Springer, 2005.
Hsin-Hsi Chen
1-65
Information Sources
• Conference Proceedings
– ACM SIGIR Annual International Conference on
Research and Development in Information
Retrieval (1978-)
– ACM International Conference on Digital
Libraries
– ACM Conference on Information and Knowledge
Management
– Text Retrieval Conference
Hsin-Hsi Chen
1-66
Information Sources
(Continued)
• Journals
– ACM Transactions on Information Systems
– Information Processing and Management
– Journal of the American Society for Information
Science and Technology
– Journal of Documentation
– Information Systems
– Information Retrieval
– Knowledge and Information Systems
Hsin-Hsi Chen
1-67