Text Mining I -- Extraction Web-Based Information Architectures
Download
Report
Transcript Text Mining I -- Extraction Web-Based Information Architectures
Text Mining -- Extraction
Web-Based Information Architectures
MSEC 20-760
Mini II
Jaime Carbonell
General Topic: Text Extraction
•
•
•
•
•
Motivation: Text Mining
Context-Free Entity Extraction
Role-based Entity Extraction
Relational Extraction
eBusiness Applications
Text Mining (1)
The Need to Process Text Automatically
• Text is meant to be read by humans, not programs.
• Most useful information is stored as text.
(100 times as much online text as online DBs)
• HTML web pages are text (with structuring tags).
• Data Mining (covered later) operates on data
tables (i.e. numbers, fixed fields, adherence to data
models).
Text Mining (2)
The Need to Process Text Automatically
• We need text => data table transducers.
• General Natural Language Understanding is still
too hard.
• But, can we solve simpler but useful subproblems?
• Yes – categorization of text by topic and extraction
of certain kinds of information from free text or
HTML-structured text is possible.
Text Mining (3)
Components of Text Mining
• Categorization by topic or Genre
Introduced here, see Prof Yang’s lecture
• Fact extraction from text
Topic of this class
• Data Mining from DBs or extracted facts
Later lecture on Data Mining
Text Categorization (1)
Definition
• Assign labels to each document or web-page
• Labels may be topics such as Yahoo-categories
e.g. "finance," "sports," "news>world>asia>business"
• Labels may be genres
e.g. "editorials" "movie-reviews" "news"
• Labels may be binary
e.g. "interesting-to-me" "not-interesting-to-me"
Text Categorization (2)
Methods
• Manual assignment (as in Yahoo)
• Hand-coded rule based (as in Reuters)
(Usually If the document contains a given
boolean combination of words, then assign
it a specified category.)
Text Categorization (3)
Methods
• Learning of document-label assignment function
– Most new applications rely on machine learning
– k-Nearest Neighbors (simple, powerful)
See Prof. Yang’s lecture
– Decision-tree induction (most common method)
– Support-vector machines (newest method)
Named Entity Identification I (1)
Purpose
To answer questions such as:
• Who is mentioned in these 100 Society article?
• What locations are listed in these 2000 web pages?
• What companies are mentioned in these patent
forms?
• What products were evaluated by Consumer
Reports this year?
Named Entity Identification I (2)
Example
President Clinton decided to send special trade
envoy Mickey Kantor to the special Asian
economic meeting in Singapore this week. Ms.
Xuemei Peng, trade minister from China, and Mr.
Hideto Suzuki from Japan’s Ministry of Trade and
Industry will also attend. Singapore, who is
hosting the meeting, will probably be represented
by its foreign and economic ministers. The
Australian representative, Mr. Langford, will not
attend, though no reason has been given. The
parties hope to reach a framework for currency
stabilization.
Named Entity Identification I (3)
Extracted Named Entities (NEs)
PEOPLE
PLACES
__________________________________________
President Clinton
Singapore
Mickey Kantor
Japan
Ms. Xuemei Peng
China
Mr. Hideto Suzuki
Australia
Mr. Langford
Named Entity Identification II
Finite-State Machines (1)
Definition of Finite State Acceptor (FSA)
• A FSA is a directed graph
• With a "start" node
• With one or more "accepting" nodes
Named Entity Identification II
Finite-State Machines (2)
Definition of Finite State Acceptor (FSA)
• With link-labels matching input items
– exact-match links labels
e.g. "China" matching only "China"
– wildcard (?) match
e.g. "?" matches "100" or "China" or ...
– feature-match
e.g. CAP matches any capitalized word
– list-membership match
e.g. if HON-LIST := (Mr, Ms, Dr, President, ...)
it would match any of those words in the input
Named Entity Identification II
Finite-State Machines (3)
Definition of Finite State Acceptor (FSA)
• With an input source (e.g. string of words)
• Outputs "YES" or "NO"
Named Entity Identification III
Finite-State Machines
Definition of A Finite State Transducer (FST)
• An FSA with variable binding
• Outputs "NO" or "YES"+variable-bindings
• Variable bindings encode recognized entity
e.g. "YES <firstname Hideto> <lastname
Suzuki>"
Finite State Acceptor (FSA)
Start
State
HON-LIST
Accept
State
CAP
CAP
Finite State Transducer (FST)
HON-LIST
CAP
CAP
HON :=
FirstName :=
LastName :=
Role-Situated Named Entities (1)
Motivation
• It is useful to know roles of NE’s, e.g.:
• Who participated in the economic meeting?
• Who hosted the economic meeting?
• Who was discussed in the economic
meeting?
• Who was absent from the the economic
meeting?
Role-Situated Named Entities (2)
How do we Assign Roles to Entities?
• Instead of one FSM, use a trio of 3 FSMs
– <left-context-FSA><entity-FSM><right-context-FSA>
• Where left and right context help assign role
Role-Situated Named Entities (3)
Example
If <right-context> =
<? "not" ("attend" | "participate")>
Then entity.role = ABSENT
If <left-context> =
<("meet" | "meeting") ("in" | "at")>
Then entity.role = HOST
Relational
Information Extraction (1)
Motivation
It useful to know who is doing what to whom
Relational
Information Extraction (2)
Example
"John Snell reporting for Wall Street. Today
Flexicon Inc. announced a tender offer for
Supplyhouse Ltd. for $30 per share,
representing a 30% premium over Friday’s
closing price. Flexicon expects to acquire
Supplyhouse by Q4 2001 without problems
from federal regulators"
Relational
Information Extraction (3)
Extraction System is Template of FSMs
[Corporate-acquisition
[acquirer <company-FSM> <r-acquirer-FSM>]
[acquiree <l-acquiree-FSM> <company-FSM)]
[share-price <money-FSM> <r-stock-FSM>]
[date <l-event-date-FSM> <date-FSM>]
]
Relational
Information Extraction (4)
Output is Instantiated FSM
[Corporate-acquisition
[acquirer "Flexicon Inc."]
[acquiree "Supplyhouse Ltd."]
[share-price "30 USD"]
[date "Q4 2001"]
]
Fact Extraction:
State of the Art (1)
Observations
• Entity => entity+roles => relation templates
Increasing richness of information extracted
• But not equivalent to language
understanding
Only pre-determined info types extracted
Fact Extraction:
State of the Art (2)
Observations
• Useful for relational DB filling
Acquirer Acquiree Sh.price Year
__________________________________
Flexicon Logi-truck 18
1999
Flexicon Supplyhouse 30
2001
buy.com reel.com
10
2000
...
...
...
...
Fact Extraction:
State of the Art (3)
Technical Approaches
• Manually-built ad-hoc extraction "rules"
• Manually-built FSTs
• Feature-based training from labeled instances
(Naive Bayes, Decision Trees)
• Hidden Markoff Models
• FSTs with feedback-driven turning
Applications of Text Extraction I (1)
Financial
• Email auto-response
– e.g. "What is the balance of account
N007623013?"
– First categorize as balance-request
– Then extract account number
Applications of Text Extraction I (2)
Financial
• Template filling from bank order
– e.g. "Please transfer 100,000 USD from
N007623013 to checking account A011129081
tomorrow“
– First categorize as transfer
Applications of Text Extraction I (3)
Financial
• Template filling from bank order
– Then extract:
[account-transfer
<from N00762301>
<to A01112908>
<amount 100,000>
<date ??>]
– Then employee checks template and adds/corrects
information such as missing date (e.g. if the system
cannot interpret "tomorrow")
Applications of Text Extraction II (1)
Informational
• For all seminar announcements in BB
extract time/title/speaker/location
• From email messages about proposed
meetings
extract time/participants/location
Applications of Text Extraction II (2)
Large-scale Wed applications
• Build DB of all job openings
– Categorize web pages as job descriptions
– Extract company/date/salary/level/...
– fill in relational DB with extracted info
• Whizbang! (a Pittsburgh eCompany) is doing just
this via its flipdog.com site
• Build DB of all web-posted resumes,
first categorizing pages as resumes,
then extracting key fields name/expertise/...
Applications of Text Extraction II (3)
Corporate Intelligence
• Extract key facts about competition web
sites
– New products offered
– Any changes to prices, sales, etc.
• Extract key facts about customers of
competitors