An Introduction to Text Mining - Information Resource Management

Download Report

Transcript An Introduction to Text Mining - Information Resource Management

An Introduction to Text
Mining
Tim Daciuk
SPSS, Inc.
Services Manager, Canada
Copyright 2003-4, SPSS Inc.
1
Agenda

Introductions

An Overview of Document Warehousing

Understanding Unstructured Text

Concept Extraction

Text Mining

Data Mining

Demonstration
Copyright 2003-4, SPSS Inc.
2
Tim Daciuk

Background



SPSS




Social research
Survey research
25 years working with the product
12 years working with the company
5 years working with text analysis
Prior history


Consulting
Education
Copyright 2003-4, SPSS Inc.
3
Predictive Analytics: Defined
Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Copyright 2003-4, SPSS Inc.
4
SPSS At A Glance

Leadership



Stability



Founded in 1968
30+ year heritage in analytic technologies
Proven track record



Market leader in Predictive Analytics
Focus on online & offline customer data acquisition and analysis
250,000+ customers worldwide
NASDAQ: SPSS
Analytics standard



80% of Fortune 500 are SPSS customers
80% plus market share in Survey & Market Research sector
Ranked #1 Data Mining solution by KD Nuggets
Copyright 2003-4, SPSS Inc.
5
Some of Our Brands
Unstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into:

Content and Document Management

Search and Retrieval

XML database and tools

Categorization, Classification, and Visualization
Copyright 2003-4, SPSS Inc.
7
80% of Data is Unstructured

Database notes:


Copyright 2003-4, SPSS Inc.
Call center transcripts
Other CRM

Email

Open-ended survey
responses

Web pages

NewsGroups

Documents themselves

Competitive information
8
Applications for Text Analysis

Surveys

‘Reading’ email

Call centre data

Comment data

Abstracts

Document management

Corporate history

Thematic understanding of website
Copyright 2003-4, SPSS Inc.
9
Data Warehouse vs. Document
Warehouse

Data warehouse

Who, what, when, where, how much
 Internally focused
 Operational information
 Rarely include external information

Document warehouse

Why
 May not be internally focused
 May contain a range of information
 Often integrate external information
Copyright 2003-4, SPSS Inc.
10
Document Warehouse Features

There is no single document structure or document
type

Documents are drawn from multiple sources

Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse

Document warehouses are designed to integrate
semantically related documents
Copyright 2003-4, SPSS Inc.
11
Building the Document Warehouse
Identify Retrieve Pre-process Text
Sources Document Document Analysis
Copyright 2003-4, SPSS Inc.
Compile
Metadata
12
Predict, Impact, Deploy
Concept
Maps
Attract
Text
Attitudes
Categorization
Surveys
Text
Actions
NLP
Concepts
Trending
Web
Channel
Grow
Retain
Outcomes
Information
Extraction
Operational
Systems
Attributes
Business UI
Clustering
Fraud
Prediction
Customer
Data
Data
Collection
Copyright 2003-4, SPSS Inc.
Expert UI
Business
User
13
The Building Blocks of Language

Morphology

Syntax

Semantics

Phonology

Pragmatics
Copyright 2003-4, SPSS Inc.
14
Morphology

Understanding words
Noun

Stems
 Affixes

Prefix
 Suffix


Inflectional elements
Reducing complexity of
analysis

Reduces complexity of
representation

Supports text mining
Copyright 2003-4, SPSS Inc.
Prefix
Noun
Stem
Suffix
in -
dispute
- able
15
Syntax

The Bank of Canada will curb inflation with higher
interest rates
Sentence
Noun phrase
Adjective
The
Verb phrase
Aux
Verb
Noun
will
curb
inflation
Noun
Bank of
Canada
with
Adjective
higher
Copyright 2003-4, SPSS Inc.
Prepositional phrase
Noun phrase
Noun
Interest rates
16
Semantics

The meaning of it all

Approaches to meaning




Semantic networks
Deductive logic
Rule-based systems
Useful for classification
Copyright 2003-4, SPSS Inc.
17
Problems with NLP

Limitations of Natural Language Processing

Correctly identifying the role of noun phrases
 Representing abstract concepts
 Classifying synonyms
 Representing the number of concepts
Copyright 2003-4, SPSS Inc.
18
Problems with NLP

Limitations of technology



Language specific designs are required
Classification speed
Classifying hybrid words and sentences
Copyright 2003-4, SPSS Inc.
19
Underlying Technology is Based on
Linguistics
Text is unstructured, ambiguous, and language
dependent.
The Linguistic Approach:

Does not treat a document as a bag of words

Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Copyright 2003-4, SPSS Inc.
20
From Text to Concepts
Morphology
Accurate
Scalable
••Compound
Inserm; merck
& co…
words
•1GB/hour
Speed
••Proper
tnp-470;
glut-4…
nouns
•PDF,
Multiple
MS formats
Office, text…
• factor receptor;
•Figures
•English,
Multiple languages
French, German
Inhibitory effect;
Spanish, Italian, Dutch,
•• Named
D. Johnentities
Paganoni, ..
Positive/Negative
•Domain
specifics opinion… Japanese
• London, Paris…
Linguistic
Semantics
Terminology
Extractor
Customizable
Names,
Orgs…
•SPSS
dictionaries
MeSH,
genes...
•User
dictionaries
Predicates rules
•Extraction
Synonyms,patterns
stop
•Extraction
words..
Statistics
DiscoveryOriented
•Trends
Known terms
•Unknown terms
•New terms
Syntax
Copyright 2003-4, SPSS Inc.
21
From Concepts to Predictive
Analytics Components
LexiQuest
Mine
Discover
concepts,
relationships
and trends
LexiQuest
Categorize
Linguistic
Terminology
Extractor
Understand
documents
and assign in
pre-defined
categories
Text Mining for
Clementine
Add text fields to
data mining for
better prediction
Copyright 2003-4, SPSS Inc.
22
Concept Extraction Engine
The extractor turns unstructured text into concepts:
Visualization
LexiQuest
Mine
Probabilities
Clementine
LexiQuest
Categorize
LexiQuest Extractor Engine
Linguistic Processor
Copyright 2003-4, SPSS Inc.
23
Part-of-Speech Tagging
Copyright 2003-4, SPSS Inc.
a: adjective
b: adverb
c: preposition
d: determiner
n: noun
v: verb
o: coordination
p: participle
s: stop word
24
How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using
a
tool
like
LexiQuest
Mine
is
a
great
V
P
N
A
N
N
V
P
A
idea
for
any
organization
that
is
interested
in
maintaining
N
P
A
N
P
V
V
P
V
information
on
competitive
intelligence.
N
P
N
N
Copyright 2003-4, SPSS Inc.
25
How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
NCDNN
(32 Known patterns for English)
Copyright 2003-4, SPSS Inc.
26
How is the Concept Extracted?
The extractor looks at this sentence:
Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on
competitive intelligence.
And extracts the concept:
Competitive Intelligence
Concepts are:


Noun based
Can be longer than one word
Copyright 2003-4, SPSS Inc.
27
Example: Categorization
Copyright 2003-4, SPSS Inc.
28
The Issue of Language

NLP requires separate language understanding

Clementine text mining









French
English
English/French
German
Spanish
Dutch
Japanese
Italian
Mesh (Medical subject headings)

Copyright 2003-4, SPSS Inc.
http://www.nlm.nih.gov/mesh/meshhome.html
29
Data Mining Defined
“The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Why data mining?

Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data


As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output
Copyright 2003-4, SPSS Inc.
31
A Data Mining Methodology

Use the cross industry
standard process for
data mining (CRISPDM)

Based on real-world
lessons:

Focus on business
issues



Copyright 2003-4, SPSS Inc.
User-centric &
interactive
Full process
Results are used
32
Data Mining is not…

Keep in mind that data mining is not…
“Blind” application of analysis/modeling algorithms
 Brute-force crunching of bulk data
 Black box technology
 Magic

Copyright 2003-4, SPSS Inc.
33
Back to the Process
Text
Mining
Copyright 2003-4, SPSS Inc.
34
Understanding

Business Understanding

Determine objective
 Assess situation
 Determine data mining goals
 Produce project plan

Data Understanding
 Collect initial data
 Describe data
 Explore data
 Verify data quality
Copyright 2003-4, SPSS Inc.
35
Data Preparation

Data

Data set
 Data set description
 Select data
 Clean data
 Construct data set / Integrate data
 Format data

Text



Concept extraction
Concept combination
Concept assessment
Copyright 2003-4, SPSS Inc.
36
Modeling

Select modeling technique


Universe of techniques
Appropriate techniques





Data
Text
Requirements
Constraints
Selected tools

Generate test design

Run model(s)

Assess model(s)
Copyright 2003-4, SPSS Inc.
37
Evaluation

Results = Models + Findings

Evaluate results

Review process

Determine next steps
Copyright 2003-4, SPSS Inc.
38
Deployment

Plan deployment

Plan monitoring and maintenance

Final report

Project review
Copyright 2003-4, SPSS Inc.
39
Data Mining Approaches

Unsupervised methods:


Group patients by drugs and demographic information
and try to find unusual patients
Supervised methods:

Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Copyright 2003-4, SPSS Inc.
40
What Does Data Mining Do?

Data mining uses existing data to:

Predict

Category membership
 Numeric Value
 Ie. Credit risk

Group

Cluster (group) things together
based on their characteristics
 Ie. Different types of TV viewers

Associate

Find events that occur together, or in
a sequence
 Ie. Beer and diapers

Find outliers
Identify cases that don’t follow
expected behavior
 Ie. Fraudulent behaviour

Copyright 2003-4, SPSS Inc.
41
Benefits of Document Warehousing

Richer operational business intelligence

Knowing your customers

Macroenvironmental monitoring

Technology assessment
Copyright 2003-4, SPSS Inc.
42
Conclusions

Text mining is




More than word counts
Linguistically based
Concept extraction
Data mining is

Advanced analytics applied to datasets
 A family of techniques
 Supervised or unsupervised
Copyright 2003-4, SPSS Inc.
43
Conclusions

Text and data mining




Add dimensionality to the data
Allow for automation of the text analysis event
Create 360 degree view
Applications

Websites
 Surveys
 Email
 Call centre
 Documentation
Copyright 2003-4, SPSS Inc.
44
?
Copyright 2003-4, SPSS Inc.
45
So How Do I Get Started?

Document Warehousing and Text Mining


Survey of Text Mining: Clustering, Classification
and Retrieval


Dan Sullivan, Wiley, 2001
Michael W. Berry (ed.), Springer, 2003
Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization

P. Jackson and I. Moulinier, John Benjamins, 2002
Copyright 2003-4, SPSS Inc.
46
SPSS Canada

Tim Daciuk

Services Manager, Canada
 416-410-7921
 800-543-6607 ext. 5156
 [email protected]

Hugh Rooney

SPSS Sales Canada
 416-410-7921
 905-886-4322
 [email protected]
www.spss.com
Copyright 2003-4, SPSS Inc.
47