Text Mining: Finding Nuggets in Mountains of Textual Data
Download
Report
Transcript Text Mining: Finding Nuggets in Mountains of Textual Data
Text Mining:
Finding Nuggets in Mountains
of Textual Data
Jochen Dörre, Peter Gerstl, and
Roland Seiffert
Overview
Introduction to Mining Text
How Text Mining differs from data mining
Mining Within a Document: Feature
Extraction
Mining in Collections of Documents:
Clustering and Categorization
Text Mining Applications
Exam Questions/Answers
Introduction to Mining Text
Reasons for Text Mining
90
80
70
60
50
40
Collections of
Text
Structured
Data
30
20
10
0
Percentage
Corporate Knowledge “Ore”
Email
Insurance claims
News articles
Web pages
Patent portfolios
Customer complaint
letters
Contracts
Transcripts of phone
calls with customers
Technical
documents
Challenges in Text Mining
Information
is in unstructured textual
form.
Not readily accessible to be used by
computers.
Dealing with huge collections of
documents
Two Mining Phases
Knowledge
Discovery: Extraction of
codified information (features)
Information Distillation: Analysis of the
feature distribution
How Text Mining Differs from
Data Mining
Comparison of Procedures
Data Mining
Identify data sets
Select features
Prepare data
Analyze distribution
Text Mining
Identify documents
Extract features
Select features by
algorithm
Prepare data
Analyze distribution
IBM Intelligent Miner for Text
SDK:
Software Development Kit
Contains necessary components for
“real text mining”
Also contains more traditional
components:
IBM
Text Search Engine
IBM Web Crawler
drop-in Intranet search solutions
Mining Within a Document:
Feature Extraction
Feature Extraction
To
recognize and classify significant
vocabulary items in unrestricted natural
language texts.
Let’s see an example…
Example of Vocabulary found
Certificate of deposit
CMOs
Commercial bank
Commercial paper
Commercial Union
Assurance
Commodity Futures
Trading Commission
Consul Restaurant
Convertible bond
Credit facility
Credit line
Debt security
Debtor country
Detroit Edison
Digital Equipment
Dollars of debt
End-March
Enserch
Equity warrant
Eurodollar
…
Implementation of Feature
Extraction relies on
Linguistically
motivated heuristics
Pattern matching
Limited amounts of lexical information,
such as part-of-speech information.
Not used: huge amounts of lexicalized
information
Not used: in-depth syntactic and
semantic analyses of texts
Goals of Feature Extraction
Very
fast processing to be able to deal
with mass data
Domain-independence for general
applicability
Extracted information
categories
Names
of persons, organizations and
places
Multiword terms
Abbreviations
Relations
Other useful stuff
Canonical Forms
Normalized
forms of dates, numbers, …
Allows applications to use information
very easily
Abstracts from different morphological
variants of a single term
Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:
George Bush
The canonical name is the most explicit, least
ambiguous name constructed from the
different variants found in the document
Reduces ambiguity of variants
Disambiguating Proper
Names: Nominator Program
Principles of Nominator
Design
Apply
heuristics to strings, instead of
interpreting semantics.
The unit of context for extraction is a
document.
The unit of context for aggregation is a
corpus.
The heuristics represent English naming
conventions.
Mining in Collections of
Documents: Clustering and
Categorization
1. Clustering
Partitions a given collection into groups of
documents similar in contents, i.e., in their
feature vectors.
Two clustering engines
Hierarchical Clustering tool
Binary Relational Clustering tool
Both tools help to identify the topic of a group
by listing terms or words that are common in
the documents in the group.
Thus, provides overview of the contents of a
collection of documents
Groups
documents
similar in their
feature vectors
2. Categorization
Topic
Categorization Tool
Assign documents to preexisting
categories (“topics” or “themes”)
Categories are chosen to match the
intended use of the collection
categories defined by providing a set of
sample documents for each category
2. Categorization (cont.)
This
“training” phase produces a special
index, called the categorization schema
categorization tool returns a list of
category names and confidence levels
for each document
If the confidence level is low, document
is put aside for human categorizer
2. Categorization (cont.)
Effectiveness:
Tests have shown that the Topic
Categorization tool agrees with human
categorizers to the same degree as human
categorizers agree with one another.
Set of sample
documents
Training phase
Special index
used to
categorize
new
documents
Returns list
of category
names and
confidence
levels for
each
document
Text Mining Applications
Main Advantages of mining technology
over traditional ‘information broker’
business
Ability
to quickly process large amounts
of textual data
“Objectivity” and customizability
Automation
Applications used to:
Gain
insights about trends, relations
between people/places/organizations
Classify and organize documents
according to their content
Organize repositories of documentrelated meta-information for search and
retrieval
Retrieve documents
Main Applications
Knowledge
Discovery
Information
Distillation
CRI: Customer Relationship
Intelligence
Appropriate documents selected
Converted to common format
Feature extraction and clustering tools are
used to create a database
User may select parameters for
preprocessing and clustering step
Clustering produces groups of feedback that
share important linguistic elements
Categorization tool used to assign new
incoming feedback to identified categories.
CRI (continued)
Knowledge
Discovery
Clustering
used to create a structure that
can be interpreted
Information
Distillation
Refinement
and extension of the clustering
results
Interpreting the results
Tuning of the clustering process
Selecting meaningful clusters
Exam Question #1
Name
an example of each of the two
main classes of applications of text
mining.
Knowledge
Discovery: Discovering a
common customer complaint among much
feedback.
Information Distillation: Filtering future
comments into pre-defined categories
Exam Question #2
How
does the procedure for text mining
differ from the procedure for data
mining?
Adds
feature extraction function
Not feasible to have humans select
features
Highly dimensional, sparsely populated
feature vectors
Exam Question #3
In
the Nominator program of IBM’s
Intelligent Miner for Text, an objective of
the design is to enable rapid extraction
of names from large amounts of text.
How does this decision affect the ability
of the program to interpret the
semantics of text?
Does
not perform in-depth syntactic or
semantic analyses of texts
THE END
http://www-3.ibm.com/software/data/iminer/fortext/