Text Mining: Finding Nuggets in Mountains of Textual Data

Transcript Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining: Finding Nuggets
in Mountains of Textual Data
Jochen Doerre, Peter Gerstl, Roland Seiffert
IBM Germany, August 1999
Presenter: Tyler Carr
April 22, 2004
1
Outline






Motivation
Methodology
Feature Extraction
Clustering and Categorizing
Applications
Exam Questions
April 22, 2004
Motivation
2
Motivation
90% of company’s data cannot be
looked at with standard Datamining:





Customer Letters
E-Mail
Correspondence
Phone Call
Recordings
Contracts
April 22, 2004




Motivation
Technical
Documentation
Patents
News Articles
Web Pages
3
Value of Text Mining




Rapid Digestion of large document
collections
Faster than human knowledge brokers
Objective and Customizable Analysis
Automation of tasks
April 22, 2004
Motivation
4
Typical Applications





Summarizing Documents
Monitoring relations among people,
places, and organizations
Organizing documents by content
Organizing indices for search and
retrieval (keyword finding)
Retrieving documents by content
April 22, 2004
Motivation
5
Outline






Motivation
Methodology
Feature Extraction
Clustering and Categorizing
Applications
Exam Questions
April 22, 2004
Methodology
6
Challenges in Text Mining



Information is in unstructured textual
form
Natural Language (NL) interpretation is
years away for computers
Text Mining deals with huge collections
of documents
April 22, 2004
Methodology
7
Two Text Mining Approaches

Knowledge Discovery


Extraction of codified information (features)
Information Distillation

Analysis of the feature distribution
April 22, 2004
Methodology
8
Comparison with Data Mining

Data Mining





Identify data sets
Select features
manually
Prepare data
Analyze distribution
Text Mining





April 22, 2004
Methodology
Identify documents
Extract features
Select features by
algorithm
Prepare data
Analyze distribution
9
Outline






Motivation
Methodology
Feature Extraction
Clustering and Categorizing
Applications
Exam Questions
April 22, 2004
Feature Extraction
10
Feature Extraction


“To recognize and classify significant
vocabulary items in unrestricted natural
language texts.”
Classes of Vocabulary




Proper names
Technical phrases
Abbreviations and acronyms
…
April 22, 2004
Feature Extraction
11
Canonical Forms

Numbers convert to normal form



Date convert to normal form
Inflected forms convert to common form


Four ==> 4
Sings, Sang, Sung ==> Sing
Alternative names convert to explicit
form

Mr. Carr, Tyler, Presenter==>Tyler Carr
April 22, 2004
Feature Extraction
12
Feature Extraction Tools



Linguistically motivated heuristics
Pattern matching
Limited amounts of lexical information


Part-of-speech information (subject,verb)
Avoid analyzing too deep (for speed)


Does not use huge amounts of lexical info.
No in-depth syntactic and semantic
analysis
April 22, 2004
Feature Extraction
13
Feature Extraction Example

Disambiguating Proper Names
(Nominator Program)



Apply heuristics to strings, instead of
interpreting semantics.
The unit of context for extraction is a
document.
The heuristics represent English naming
conventions.
April 22, 2004
Feature Extraction
14
Feature Extraction Goals


Very fast processing to deal with huge
amounts of data
Domain independence for general
applicability
April 22, 2004
Feature Extraction
15
Outline






Motivation
Methodology
Feature Extraction
Clustering and Categorizing
Applications
Exam Questions
April 22, 2004
Clustering and Categorization
16
Clustering




Also called Knowledge Discovery
Fully automatic process
Partitions a given collection into groups
of documents similar in contents
Clusters identifiable by feature vectors

Provides a set of keywords for each cluster
April 22, 2004
Clustering and Categorization
17
Two Clustering Engines

Hierarchical Clustering tool


Orders the clusters into a tree reflecting
various levels of similarity.
Binary Relational Clustering tool


Produces a flat clustering together with
relationships of different strength between
the clusters
Relationships reflect inter-cluster
similarities
April 22, 2004
Clustering and Categorization
18
Clustering Model
April 22, 2004
Clustering and Categorization
19
Categorization




Also called Information Distillation
Topic Categorization Tool
Assigns documents to pre-existing
categories (“topics” or “themes”)
Categories are chosen to match the
intended use of the collection
April 22, 2004
Clustering and Categorization
20
Categorization



Categories defined by providing a set of
sample documents for each category
Training phase produces a special
index, called the categorization schema
Categorization tool returns set of
category names and confidence levels
for each document
April 22, 2004
Clustering and Categorization
21
Categorization


If confidence is below some threshold,
document is set aside for human
categorizer
Tests have shown the Topic
Categorization Tool agrees with human
categorizers to the same degree as
human categorizers agree with one
another.
April 22, 2004
Clustering and Categorization
22
Categorization Model
April 22, 2004
Clustering and Categorization
23
Outline






Motivation
Methodology
Feature Extraction
Clustering and Categorizing
Applications
Exam Questions
April 22, 2004
Applications
24
IBM Intelligent Miner for Text



Software Development Kit (not full
application)
Contains necessary components for “real text
mining”
Also contains more traditional components:



IBM Text Search Engine
IBM Web Crawler
Drop-in Intranet search solutions
April 22, 2004
Applications
25
Applications

Customer Relationship Management
application provided by IBM Intelligent
Miner for text called Customer
Relationship Intelligence (CRI)

“Help companies better understand what
their customers want and what they think
about the company itself.”
April 22, 2004
Applications
26
Customer Intelligence Process




Take body of communications with customer
as input.
Cluster the documents to identify issues.
Characterize the clusters to identify the
conditions for problems.
Assign new messages appropriate to
clusters.
April 22, 2004
Applications
27
Customer Intelligence Usage

Knowledge Discovery


Clustering used to create a structure that can be
interpreted
Information Distillation

Refinement and extension of clustering results



April 22, 2004
Interpreting the results
Tuning of the clustering process
Selecting meaningful clusters
Applications
28
Outline






Motivation
Methodology
Feature Extraction
Clustering and Categorizing
Applications
Exam Questions
April 22, 2004
Exam Questions
29
Exam Question #1

Name an example of each of the two
main classes of applications of textmining.


Knowledge Discovery: Discovering a
common customer complaint among much
feedback
Information Distillation: Filtering future
comments into pre-defined categories.
April 22, 2004
Exam Questions
30
Exam Question #2

How does the procedure for text mining
differ from the procedure for data
mining?



Adds feature extraction function
Not feasible to have humans select
features
Highly dimensional, sparsely populated
feature vectors
April 22, 2004
Exam Questions
31
Exam Question #3

In the Nominator program of IBM’s
Intelligent Miner for Text, an objective of
the design is to enable rapid extraction
of names from large amounts of text.
How does this decision affect the ability
of the program to interpret the
semantics of text?

Does not perform in-depth syntactic or
semantic analysis of texts
April 22, 2004
Exam Questions
32
Thank You
Any Questions?
April 22, 2004
33
Thank You
Any Questions?
April 22, 2004
34
Thank You
Any Questions?
April 22, 2004
35

Text Mining: Finding Nuggets in Mountains of Textual Data

Transcript Text Mining: Finding Nuggets in Mountains of Textual Data

Directory