An Introduction to Text Mining - Information Resource Management
Download
Report
Transcript An Introduction to Text Mining - Information Resource Management
An Introduction to Text
Mining
Tim Daciuk
SPSS, Inc.
Services Manager, Canada
Copyright 2003-4, SPSS Inc.
1
Agenda
Introductions
An Overview of Document Warehousing
Understanding Unstructured Text
Concept Extraction
Text Mining
Data Mining
Demonstration
Copyright 2003-4, SPSS Inc.
2
Tim Daciuk
Background
SPSS
Social research
Survey research
25 years working with the product
12 years working with the company
5 years working with text analysis
Prior history
Consulting
Education
Copyright 2003-4, SPSS Inc.
3
Predictive Analytics: Defined
Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Copyright 2003-4, SPSS Inc.
4
SPSS At A Glance
Leadership
Stability
Founded in 1968
30+ year heritage in analytic technologies
Proven track record
Market leader in Predictive Analytics
Focus on online & offline customer data acquisition and analysis
250,000+ customers worldwide
NASDAQ: SPSS
Analytics standard
80% of Fortune 500 are SPSS customers
80% plus market share in Survey & Market Research sector
Ranked #1 Data Mining solution by KD Nuggets
Copyright 2003-4, SPSS Inc.
5
Some of Our Brands
Unstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into:
Content and Document Management
Search and Retrieval
XML database and tools
Categorization, Classification, and Visualization
Copyright 2003-4, SPSS Inc.
7
80% of Data is Unstructured
Database notes:
Copyright 2003-4, SPSS Inc.
Call center transcripts
Other CRM
Email
Open-ended survey
responses
Web pages
NewsGroups
Documents themselves
Competitive information
8
Applications for Text Analysis
Surveys
‘Reading’ email
Call centre data
Comment data
Abstracts
Document management
Corporate history
Thematic understanding of website
Copyright 2003-4, SPSS Inc.
9
Data Warehouse vs. Document
Warehouse
Data warehouse
Who, what, when, where, how much
Internally focused
Operational information
Rarely include external information
Document warehouse
Why
May not be internally focused
May contain a range of information
Often integrate external information
Copyright 2003-4, SPSS Inc.
10
Document Warehouse Features
There is no single document structure or document
type
Documents are drawn from multiple sources
Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse
Document warehouses are designed to integrate
semantically related documents
Copyright 2003-4, SPSS Inc.
11
Building the Document Warehouse
Identify Retrieve Pre-process Text
Sources Document Document Analysis
Copyright 2003-4, SPSS Inc.
Compile
Metadata
12
Predict, Impact, Deploy
Concept
Maps
Attract
Text
Attitudes
Categorization
Surveys
Text
Actions
NLP
Concepts
Trending
Web
Channel
Grow
Retain
Outcomes
Information
Extraction
Operational
Systems
Attributes
Business UI
Clustering
Fraud
Prediction
Customer
Data
Data
Collection
Copyright 2003-4, SPSS Inc.
Expert UI
Business
User
13
The Building Blocks of Language
Morphology
Syntax
Semantics
Phonology
Pragmatics
Copyright 2003-4, SPSS Inc.
14
Morphology
Understanding words
Noun
Stems
Affixes
Prefix
Suffix
Inflectional elements
Reducing complexity of
analysis
Reduces complexity of
representation
Supports text mining
Copyright 2003-4, SPSS Inc.
Prefix
Noun
Stem
Suffix
in -
dispute
- able
15
Syntax
The Bank of Canada will curb inflation with higher
interest rates
Sentence
Noun phrase
Adjective
The
Verb phrase
Aux
Verb
Noun
will
curb
inflation
Noun
Bank of
Canada
with
Adjective
higher
Copyright 2003-4, SPSS Inc.
Prepositional phrase
Noun phrase
Noun
Interest rates
16
Semantics
The meaning of it all
Approaches to meaning
Semantic networks
Deductive logic
Rule-based systems
Useful for classification
Copyright 2003-4, SPSS Inc.
17
Problems with NLP
Limitations of Natural Language Processing
Correctly identifying the role of noun phrases
Representing abstract concepts
Classifying synonyms
Representing the number of concepts
Copyright 2003-4, SPSS Inc.
18
Problems with NLP
Limitations of technology
Language specific designs are required
Classification speed
Classifying hybrid words and sentences
Copyright 2003-4, SPSS Inc.
19
Underlying Technology is Based on
Linguistics
Text is unstructured, ambiguous, and language
dependent.
The Linguistic Approach:
Does not treat a document as a bag of words
Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Copyright 2003-4, SPSS Inc.
20
From Text to Concepts
Morphology
Accurate
Scalable
••Compound
Inserm; merck
& co…
words
•1GB/hour
Speed
••Proper
tnp-470;
glut-4…
nouns
•PDF,
Multiple
MS formats
Office, text…
• factor receptor;
•Figures
•English,
Multiple languages
French, German
Inhibitory effect;
Spanish, Italian, Dutch,
•• Named
D. Johnentities
Paganoni, ..
Positive/Negative
•Domain
specifics opinion… Japanese
• London, Paris…
Linguistic
Semantics
Terminology
Extractor
Customizable
Names,
Orgs…
•SPSS
dictionaries
MeSH,
genes...
•User
dictionaries
Predicates rules
•Extraction
Synonyms,patterns
stop
•Extraction
words..
Statistics
DiscoveryOriented
•Trends
Known terms
•Unknown terms
•New terms
Syntax
Copyright 2003-4, SPSS Inc.
21
From Concepts to Predictive
Analytics Components
LexiQuest
Mine
Discover
concepts,
relationships
and trends
LexiQuest
Categorize
Linguistic
Terminology
Extractor
Understand
documents
and assign in
pre-defined
categories
Text Mining for
Clementine
Add text fields to
data mining for
better prediction
Copyright 2003-4, SPSS Inc.
22
Concept Extraction Engine
The extractor turns unstructured text into concepts:
Visualization
LexiQuest
Mine
Probabilities
Clementine
LexiQuest
Categorize
LexiQuest Extractor Engine
Linguistic Processor
Copyright 2003-4, SPSS Inc.
23
Part-of-Speech Tagging
Copyright 2003-4, SPSS Inc.
a: adjective
b: adverb
c: preposition
d: determiner
n: noun
v: verb
o: coordination
p: participle
s: stop word
24
How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using
a
tool
like
LexiQuest
Mine
is
a
great
V
P
N
A
N
N
V
P
A
idea
for
any
organization
that
is
interested
in
maintaining
N
P
A
N
P
V
V
P
V
information
on
competitive
intelligence.
N
P
N
N
Copyright 2003-4, SPSS Inc.
25
How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
NCDNN
(32 Known patterns for English)
Copyright 2003-4, SPSS Inc.
26
How is the Concept Extracted?
The extractor looks at this sentence:
Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on
competitive intelligence.
And extracts the concept:
Competitive Intelligence
Concepts are:
Noun based
Can be longer than one word
Copyright 2003-4, SPSS Inc.
27
Example: Categorization
Copyright 2003-4, SPSS Inc.
28
The Issue of Language
NLP requires separate language understanding
Clementine text mining
French
English
English/French
German
Spanish
Dutch
Japanese
Italian
Mesh (Medical subject headings)
Copyright 2003-4, SPSS Inc.
http://www.nlm.nih.gov/mesh/meshhome.html
29
Data Mining Defined
“The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Why data mining?
Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data
As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output
Copyright 2003-4, SPSS Inc.
31
A Data Mining Methodology
Use the cross industry
standard process for
data mining (CRISPDM)
Based on real-world
lessons:
Focus on business
issues
Copyright 2003-4, SPSS Inc.
User-centric &
interactive
Full process
Results are used
32
Data Mining is not…
Keep in mind that data mining is not…
“Blind” application of analysis/modeling algorithms
Brute-force crunching of bulk data
Black box technology
Magic
Copyright 2003-4, SPSS Inc.
33
Back to the Process
Text
Mining
Copyright 2003-4, SPSS Inc.
34
Understanding
Business Understanding
Determine objective
Assess situation
Determine data mining goals
Produce project plan
Data Understanding
Collect initial data
Describe data
Explore data
Verify data quality
Copyright 2003-4, SPSS Inc.
35
Data Preparation
Data
Data set
Data set description
Select data
Clean data
Construct data set / Integrate data
Format data
Text
Concept extraction
Concept combination
Concept assessment
Copyright 2003-4, SPSS Inc.
36
Modeling
Select modeling technique
Universe of techniques
Appropriate techniques
Data
Text
Requirements
Constraints
Selected tools
Generate test design
Run model(s)
Assess model(s)
Copyright 2003-4, SPSS Inc.
37
Evaluation
Results = Models + Findings
Evaluate results
Review process
Determine next steps
Copyright 2003-4, SPSS Inc.
38
Deployment
Plan deployment
Plan monitoring and maintenance
Final report
Project review
Copyright 2003-4, SPSS Inc.
39
Data Mining Approaches
Unsupervised methods:
Group patients by drugs and demographic information
and try to find unusual patients
Supervised methods:
Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Copyright 2003-4, SPSS Inc.
40
What Does Data Mining Do?
Data mining uses existing data to:
Predict
Category membership
Numeric Value
Ie. Credit risk
Group
Cluster (group) things together
based on their characteristics
Ie. Different types of TV viewers
Associate
Find events that occur together, or in
a sequence
Ie. Beer and diapers
Find outliers
Identify cases that don’t follow
expected behavior
Ie. Fraudulent behaviour
Copyright 2003-4, SPSS Inc.
41
Benefits of Document Warehousing
Richer operational business intelligence
Knowing your customers
Macroenvironmental monitoring
Technology assessment
Copyright 2003-4, SPSS Inc.
42
Conclusions
Text mining is
More than word counts
Linguistically based
Concept extraction
Data mining is
Advanced analytics applied to datasets
A family of techniques
Supervised or unsupervised
Copyright 2003-4, SPSS Inc.
43
Conclusions
Text and data mining
Add dimensionality to the data
Allow for automation of the text analysis event
Create 360 degree view
Applications
Websites
Surveys
Email
Call centre
Documentation
Copyright 2003-4, SPSS Inc.
44
?
Copyright 2003-4, SPSS Inc.
45
So How Do I Get Started?
Document Warehousing and Text Mining
Survey of Text Mining: Clustering, Classification
and Retrieval
Dan Sullivan, Wiley, 2001
Michael W. Berry (ed.), Springer, 2003
Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization
P. Jackson and I. Moulinier, John Benjamins, 2002
Copyright 2003-4, SPSS Inc.
46
SPSS Canada
Tim Daciuk
Services Manager, Canada
416-410-7921
800-543-6607 ext. 5156
[email protected]
Hugh Rooney
SPSS Sales Canada
416-410-7921
905-886-4322
[email protected]
www.spss.com
Copyright 2003-4, SPSS Inc.
47