Text Mining: Finding Nuggets in Mountains of Textual Data

Download Report

Transcript Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining:
Finding Nuggets in Mountains of Textual Data
Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert
Adapted from slides by: Trevor Crum
Presenter: Caitlin Baker
1
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
2
Definition
● Text Mining:
○
○
The discovery by computer of new, previously
unknown information, by automatically extracting
information from different unstructured textual
documents.
Also referred to as text data mining, roughly
equivalent to text analytics which refers more
specifically to problems based in a business settings.
3
Paper Overview
● This paper introduced text mining and how it
differs from data mining proper.
● Focused on the tasks of feature extraction
and clustering/categorization
● Presented an overview of the tools/methods
of IBM’s Intelligent Miner for Text
4
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
5
Motivation
● A large portion of a company’s data is
unstructured or semi-structured – about 90%
in 1999!
•
•
•
•
Letters
Emails
Phone transcripts
Contracts
•
•
•
•
Technical documents
Patents
Web pages
Articles
6
Unstructured Data
Chapter
Date
Problem
32-31-01
1/1/1999
32-31-01
2/3/1999
32-31-01
4/1/1999
Water dripping on
right hand lg. tom
9275-412
Phil, rough landing
lg seems to have
a crack
Saw leaking in the
rh landing g. apr
1999
7
Text Mining Benefits
● Ability to quickly process large amounts of
textual data
● “Objectivity” and customizability of the
process
● Possibility to automate labor-intensive
routine task
8
Typical Applications
● Summarizing documents
● Discovering/monitoring relations among
people, places, organizations, etc
● Customer profile analysis
● Trend analysis
● Spam Identification
● Public health early warning
● Event tracks
● Predictive analytics
9
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
10
Methodology: Challenges
● Information is in unstructured textual form
● Natural language interpretation is difficult &
complex task! (not fully possible)
○
Google and Watson are a step closer
● Text mining deals with huge collections of
documents
○
Impossible for human examination
11
Google vs Watson
● Google justifies the ● Watson tries to
answer by returning
understand the
the text documents
semantics behind a
where it found the
given key phrase or
evidence.
question.
● Google finds
● Then Watson will
documents that are
use its huge
most suitable to a
knowledge base to
given Keyword.
find the correct
answer.
12
Methodology: Two Aspects
● Knowledge Discovery
○
○
Feature Extraction
Mining proper – determining some structure
● Information Distillation
○
○
Analysis of feature distribution
Mining on the basis of some pre-established structure
13
Two Text Mining
Approaches
● Extraction
○
Extraction of codified information from single
documents
● Analysis
○
Analysis of the features to detect patterns, trends, and
other similarities over whole collections of documents
14
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
15
IBM Intelligent Miner for
Text
● IBM introduced Intelligent Miner for Text in
1998
● SDK with: Feature extraction, clustering,
categorization, and more
● Traditional components (search engine, etc)
16
IBM SPSS Text Analytics
● Clustering/ categorization
● Extraction of words with ranking
● Produces graphical output
17
Advantages to IBM’s
approach
● Processing is very fast (helps when dealing
with huge amounts of data)
● Heuristics work reasonably well
● Generally applicable to any domain
18
SAS Text Miner
● Term profiling and trending
● Document theme discovery
● Visual integration of results
19
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
20
Feature Extraction
● Recognize and classify “significant”
vocabulary items from the text
● Categories of vocabulary
21
Extracted Information
Classified into Categories
●
●
●
●
●
Names of persons, organizations, and places
Multiword terms
Abbreviations
Relations
Other useful stuff: numerical or textual forms
of numbers, percentages, dates, currency
amounts, etc.
22
Canonical Form Examples
● Normalize numbers, money
○
Four = 4, five-hundred dollars = $500
● Conversion of date to normal form
○
8/17/1992 = August 18 1992
● Morphological variants
○
Drive, drove, driven = drive
● Proper names and other forms
○
Mr. Johnson, Bob Johnson, The author = Bob
Johnson
23
Feature Extraction
Approach
●
●
●
●
Linguistically motivated heuristics
Pattern matching
Limited lexical information (part-of-speech)
Avoid analyzing with too much depth
○
○
Does not use too much lexical information
No in-depth syntactic or semantic analysis
24
Feature Extraction Ex.
Chapter
Date
Problem
32-31-01
1/1/1999
32-31-01
2/3/1999
32-31-01
4/1/1999
Water dripping on
right hand lg. tom
9275-412
Phil, rough landing
lg seems to have
a crack
Saw leaking in the
rh landing g. apr
1999
25
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
26
Clustering
● Fully automatic process
● Documents are grouped according to
similarity of their feature vectors
● Each cluster is labeled by a listing of the
common terms/keywords
● Good for getting an overview of a document
collection
27
Two Clustering Engines
● Hierarchical clustering
○
Orders the clusters into a tree reflecting various levels
of similarity
● Binary relational clustering
○
○
Flat clustering
Relationships of different strengths between clusters,
reflecting similarity
28
Clustering Model
29
Categorization
● Assigns documents to preexisting categories
● Classes of documents are defined by
providing a set of sample documents.
● Training phase produces “categorization
schema”
● Documents can be assigned to more than
one category
● If confidence is low, document is set aside
for human intervention
30
Categorization Model
31
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
32
Applications
● Aircraft Faults using IBM SPSS Text
Analytics
● Customer Relationship Management
application provided by IBM Intelligent Miner
for Text called “Customer Relationship
Intelligence” or CRI
○
“Help companies better understand what their
customers want and what they think about the
company itself”
33
Aircraft Faults
● Take as input free-hand text from operators
and aircraft mechanics
● Cluster the documents to identify faults
● Characterize the clusters to identify the
conditions for faults
● Determine most common fault for a certain
component
34
Customer Intelligence
Process
● Take as input body of communications with
customer
● Cluster the documents to identify issues
● Characterize the clusters to identify the
conditions for problems
● Assign new messages to appropriate
clusters
35
Applications Summary
● Knowledge Discovery
○
Clustering used to create a structure that can be
interpreted
● Information Distillation
○
Refinement and extension of clustering results
■ Interpreting the results
■ Tuning of the clustering process
■ Selecting meaningful clusters
36
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion & Exam Questions
37
Comparison with Data
Mining
● Data mining
○
○
○
Discover hidden
models.
Tries to generalize all
of the data into a single
model.
Marketing, medicine,
health care
● Text mining
○ Discover hidden facts.
○ Tries to understand
the details, cross
reference between
individual instances
○ Biosciences, customer
profile analysis
38
Outline
●
●
●
●
●
●
●
●
●
Definition and Paper Overview
Motivation
Methodology
Software Packages
Feature Extraction
Clustering and Categorizing
Some Applications
Comparison with Data Mining
Conclusion and Exam Questions
39
Conclusion
● Text mining can be used as an effective
business tool that supports
○
Creation of knowledge by preparing and organizing
unstructured textual data [Knowledge Discovery]
○ Extraction of relevant information from large amounts
of unstructured textual data through automatic preselection based on user defined criteria [Information
Distillation]
40
Exam Question #1
● How does the procedure for text mining differ
from the procedure for data mining?
○
○
○
Adds feature extraction phase
Infeasible for humans to select features manually
The feature vectors are, in general, highly
dimensional and sparse
41
Exam Question #2
● What is one application of text mining and
why would that application be beneficial?
○
○
Customer Relationship Management application
provided by IBM Intelligent Miner for Text called
“Customer Relationship Intelligence” or CRI
“Help companies better understand what their
customers want and what they think about the
company itself”
42
Exam Question #3
● What are three benefits of text mining?
○
○
○
1. Efficiency
2. Customizability
3. Automation of task
43
Questions?
44