Introduction to Information Extraction

Download Report

Transcript Introduction to Information Extraction

Introduction to
Information Extraction
Chia-Hui Chang
Dept. of Computer Science and Information
Engineering, National Central University, Taiwan
[email protected]
Problem Definition

Information Extraction (IE) is to identify
relevant information from documents,
pulling information from a variety of sources
and aggregates it into a homogeneous form.
Input  extractor structured output

The output template of the IE task


Several fields (slots)
Several instances of a field
2
Difficulties of IE tasks depends on …

Text type



Domain


From plain text to semi-structured Web
pages
e.g. Wall Street Journal articles, or
email message, HTML documents.
From financial news, or tourist
information, to various language.
Scenario
3
Various IE Tasks

Free-text IE:



For MUC (Message Understanding Conference)
E.g. terrorist activities, corporate joint
ventures
Semi-structured IE:

E.g.: meta-search engines, shopping agents,
Bio-integration system
4
Types of IE from MUC

Named Entity recognition (NE)


Coreference Resolution (CO)


Identifies identity relations between entities in
texts.
Template Element construction (TE)


Finds and classifies names, places, etc.
Adds descriptive information to NE results.
Scenario Template production (ST)

Fits TE results into specified event scenarios.
5
Named Entity Recognition
http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_3.html
6
NE Recognition (Cont.)



Spanish:
93%
Japanese:
92%
Chinese:
84.51%
7
Coreference Resolution


Coreference resolution (CO) involves
identifying identity relations between
entities in texts.
For example, in
Alas, poor Yorick, I knew him well.


Tie “Yorick" with “him“.
The Sheffield system scored 51% recall
and 71% precision.
http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html
8
Template Element Production


Adds description with named entities
Sheffield system scores 71%
9
Scenario Template Extraction



STs are the
prototypical outputs of
IE systems
They tie together TE
entities into event and
relation descriptions.
Performance for
Sheffield: 49%
http://www.cs.nyu.edu/cs/
faculty/grishman/
IEtask15.book_2.html
10
Example

The operational domains that user interests are
centered around are drug enforcement, money
laundering, organized crime, terrorism, ….
1. Input: texts dealing with drug enforcement, money
laundering, organized crime, terrorism, and legislation;
2. NE: recognizes entities in those texts and assigns them to
one of a number of categories drawn from the set of
entities of interest (person, company, . . . );
3. TE: associates certain types of descriptive information with
these entities, e.g. the location of companies;
4. ST: identifies a set (relatively small to begin with) of
events of interest by tying entities together into event
relations.
11
Example Text
12
Output Example (NE, TE)
13
Output (STs)
14
Another IE Example


Corporate Management Changes
Purpose






which positions in which organizations are changing
hands?
who is leaving a position and where the person is going
to?
who is appointed to a position and where the person is
coming from?
the locations and types of the organizations involved in
the succession events;
the names and titles of the persons involved in the
succession events
http://www.cs.umanitoba.ca/~lindek/ie-ex.htm
15
Input Text
President Clinton nominated John Rollwagen, the chairman
and CEO of Cray Research Inc., as the No. 2 Commerce
Department official. Mr. Rollwagen said he wants to push
the Clinton administration to aggressively confront U.S.
trading partners such as Japan to open their markets,
particularly for high-tech industries. In a letter sent
throughout the Eagan, Minn.-based company on Friday, Mr.
Rollwagen warned: "Whether we like it or not, our country
is in an economic war; and we are at a key turning point in
that war." ......
Cray said it has appointed John F. Carlson, its president and
chief operating officer, to succeed him. ......
16
Extraction Result
Corporate Management Database
Person
Organization
Position
Transition
John Rollwagen
Cray Research Inc.
chairman
out
John Rollwagen
Cray Research Inc.
CEO
out
John F. Carlson
Cray Research Inc.
chairman
in
John F. Carlson
Cray Research Inc.
CEO
in
Organization Database
Name
Location
Alias
Type
Cray Research Inc.
Eagan, Minn.
Cray
COMPANY
Commerce Department
GOVERNMENT
17
MUC

Data Set for

MET2

MUC3&4
http://www.itl.nist.gov/iaui/894.02/related_projects/m
uc/met2/met2package.tar.gz
http://www.itl.nist.gov/iaui/894.02/related_projects/m
uc/muc_data/muc34.tar.gz
 MUC6&7 from LDC http://www.ldc.upenn.edu/

MUC-6:

MUC-7
http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
proceedings/muc_7_toc.html
18
Summary

Evaluation

Design Methodology for Text IE
# of correctly extracted fields
 Precision=
# of extracted fields
# of correctly extracted fields
 Recall=
# of fields to be extracted


Natural Language Processing
Machine Learning
19
IE from Web pages

Output Template: k-tuple


Multiple instances of a field
Missing data
20
Web data extraction

Various Web pages


Multiple-record page extraction
One-record (singular) page extraction
21
Multiple-record
page extraction
One-record (singular)
page extraction
Applications

Information integration



Meta Search Engines
Shopping agents
Travel agents
24
Information Integration Systems
Abstracted
Information
Mediation
Human & Computer Users
User Services:
• Query
• Monitor
• Update
Information
Integration
Service
Mediator
Mediator
Wrapper
Unprocessed,
Unintegrated
Details
Agent/Module
Coordination
Wrapper
Text,
Hierarchical
Images/Video, & Network
Spreadsheets Databases
Mediator
SQL
Relational
Databases
ORB
Semantic
Integration
Translation and
Wrapping
Object &
Knowledge
Bases
Heterogeneous Data Sources
25
Web Wrappers

What is a wrapper?


An extracting program to extract
desired information from Web pages.
Web pages → wrapper→ Structure Info.
Web wrappers wrap...


“Query-able’’ or “Search-able’’ Web sites
Web pages with large itemized lists
26
Summary

Evaluation

Methodology for Web IE
# of correctly extracted records
 Precision=
# of extracted records
# of correctly extracted records
 Recall= # of records to be extracted



Programming package
Machine Learning
Pattern Mining
27
Type III: News Group IE

Example: Computer-Related Jobs
28
Output Template


Between free-text IE and semi-structured IE
[CaliffRapier 99]
29
Wrapper Induction Systems


Wrapper induction (WI) or information
extraction (IE) systems are software that
are designed to generate wrappers.
Taxonomy of Web IE systems by

Task domain
• free text vs semi-structured pages

Automation degree
• supervised vs unsupervised

Techniques applied
• Machine learning vs pattern mining
30
Task Domain


Document type
Extraction level


Extraction target variation





Missing Attributes
Multi-valued Attributes
Multi-order attribute Permutations
Nested Data Objects
Template variation



Field-level, record-level, page-level
Various Templates for an attribute
Common Templates for various attributes
Untokenized Attributes
31
Automation Degree
Page-fetching Support
 Annotation Requirement
 Output Support
 API Support

32
Techniques Applied
Scan passes
 Extraction rule types
 Learning algorithms
 Tokenization schemes
 Feature used

33
Conclusion
Define the IE problem
 Specify the input: training example




with annotation, or
without annotation
Depict the extraction rule

Use necessary background knowledge
34
References




*H. Cunningham, Information Extraction – a
User Guide, http://www.dcs.shef.ac.uk
*MUC-6, http://www.cs.nyu.edu/cs/faculty/
grishman/muc6.html
*I. Muslea, Extraction Patterns for Information
Extraction Tasks: A Survey, The AAAI-99
Workshop on Machine Learning for Information
Extraction.
Califf, Relational Learning of Pattern-Matching
Rule for Information Extraction, AAAI-99.
35