Introduction to Data Mining

Download Report

Transcript Introduction to Data Mining

Introduction to Data Mining
Supercomputing 2002
Peter Bajcsy, Ph.D.
Research Scientist
Adjunct Assistant Professor, CS Department, UIUC
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
[email protected]
Course Overview
•
Introduction to Knowledge Discovery in Databases and Data Mining
•
•
Applications of Data Mining
•
•
•
D2K, SAS, Clementine, Intelligent Miner, Insightful Miner, K-Wiz
Data Mining Methods
•
•
•
•
Processing Steps
Data Quality, Preparation, and Transformations
Data Mining Tools
•
•
Application Domains and Examples
Knowledge Discovery in Databases and Data Mining Process
•
•
Why Data Mining? What is Data Mining? On What Kind of Data?
Association Rules
Decision Trees
Information Visualization
Summary
alg | Automated Learning Group
Acknowledgement
•
Contributions:
•
Michael Welge, Loretta Auvil, Lisa Gatzke, Automated Learning Group,
National Center for Supercomputing Applications (NCSA), University of
Illinois at Urbana-Champaign
•
Jiawei Han, Computer Science, University of Illinois at Urbana-Champaign
alg | Automated Learning Group
Literature
Data Mining – Concepts and Techniques by
J. Han & M. Kamber, Morgan Kaufmann
Publishers, 2001
Pattern Classification by R. Duda, P. Hart
and D. Stork, 2nd edition, John Wiley &
Sons, 2001
alg | Automated Learning Group
Introduction to Knowledge Discovery in
Databases and Data Mining
alg | Automated Learning Group
Computational Knowledge Discovery
alg | Automated Learning Group
Terminology
• Data Mining
A step in the knowledge discovery process consisting of particular
algorithms (methods) that under some acceptable objective, produces
a particular enumeration of patterns (models) over the data.
• Knowledge Discovery Process
The process of using data mining methods (algorithms) to extract
(identify) what is deemed knowledge according to the specifications of
measures and thresholds, using a database along with any necessary
preprocessing or transformations.
alg | Automated Learning Group
Terminology - A Working Definition
•
Data Mining is a “decision support” process in which we search for
patterns of information in data.
•
•
Data Mining is a process of discovering advantageous patterns in data.
A pattern is a conservative statement about a probability distribution.
•
Webster: A pattern is (a) a natural or chance configuration, (b) a reliable
sample of traits, acts, tendencies, or other observable characteristics of a
person, group, or institution
alg | Automated Learning Group
Data Mining: On What Kind of Data?
•
Relational Databases
•
Data Warehouses
•
Transactional Databases
•
Advanced Database Systems
•
•
•
•
•
•
•
Object-Relational
Spatial and Temporal
Time-Series
Multimedia
Text
Heterogeneous, Legacy, and Distributed
WWW
alg | Automated Learning Group
Structure - 3D Anatomy
Function – 1D Signal
Metadata – Annotation
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW
NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1
GF2
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1,78
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.73"
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.53"
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
YDL211C
4
1 A 1 7 17.31 35.34 581.00
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.87"
YAL010C
MDM10 1
1 A 2 2 13.73 9.61 461.03
YBL088C
TEL1 2
1 A 2 3 8.50 7.74 285.38
YBR162C
2
1 A 2 4 226.84
293.83
YCL052C
PBN1 3
1 A 2 5 41.28 34.79 "1,385.79"
YDL028C
MPS1 4
1 A 2 6 7.95 6.24 266.99
Data Mining: Confluence of Multiple Disciplines
?
20x20 ~ 2^400  10^120 patterns
alg | Automated Learning Group
Why Do We Need Data Mining ?
•
Data volumes are too large for classical analysis approaches:
•
•
Large number of records (108 – 1012 bytes)
High dimensional data ( 102 – 104 attributes)
How do you explore millions of records, tens or hundreds of
fields, and find patterns?
alg | Automated Learning Group
Why Do We Need Data Mining ?
•
Leverage organization’s data assets
•
Only a small portion (typically - 5%-10%) of the collected data is ever
analyzed
•
Data that may never be analyzed continues to be collected, at a great
expense, out of fear that something which may prove important in the
future is missing.
•
Growth rates of data precludes traditional “manually intensive” approach
alg | Automated Learning Group
Why Do We Need Data Mining?
•
As databases grow, the ability to support the decision support process
using traditional query languages becomes infeasible
•
Many queries of interest are difficult to state in a query language (Query
formulation problem)
•
“find all cases of fraud”
•
“find all individuals likely to buy a FORD expedition”
•
“find all documents that are similar to this customers problem”
(Latitude, Longitude)2
QUERY
RESULT
(Latitude, Longitude)1
alg | Automated Learning Group
What is It?
Knowledge Discovery in Databases is the non-trivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in
data.
•
The understandable patterns are used to:
•
•
•
•
Make predictions or classifications about new data
Explain existing data
Summarize the contents of a large database to support decision making
Graphical data visualization to aid humans in discovering deeper patterns
alg | Automated Learning Group
Applications of Data Mining
alg | Automated Learning Group
Data Mining Applications
•
•
•
Market analysis
•
Text mining (news group, email, documents) and Web mining
•
Stream data mining
•
DNA and bio-data analysis
Risk analysis and management
Fraud detection and detection of unusual patterns (outliers)
alg | Automated Learning Group
Market Analysis
•
Where does the data come from?
•
•
Target marketing
•
•
•
Associations/co-relations between product sales, & prediction based on such
association
Customer profiling
•
•
Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis
•
•
Credit card transactions, loyalty cards, discount coupons, customer complaint calls,
plus (public) lifestyle studies
What types of customers buy what products (clustering or classification)
Customer requirement analysis
•
•
identifying the best products for different customers
Predict what factors will attract new customers)
alg | Automated Learning Group
Corporate Analysis & Risk Management
•
Finance planning and asset evaluation
•
•
•
•
Resource planning
•
•
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
summarize and compare the resources and spending
Competition
•
•
•
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
alg | Automated Learning Group
Fraud Detection & Mining Unusual Patterns
•
Approaches: Clustering & model construction for frauds, outlier analysis
•
Applications: Health care, retail, credit card service, telecomm.
•
•
•
•
•
•
Auto insurance: ring of collisions
Money laundering: suspicious monetary transactions
Medical insurance
– Professional patients, ring of doctors, and ring of references
– Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
– Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
Retail industry
– Analysts estimate that 38% of retail shrink is due to dishonest employees
Anti-terrorism
alg | Automated Learning Group
Data Mining and Business Intelligence
alg | Automated Learning Group
Knowledge Discovery in Databases
Process
alg | Automated Learning Group
KDD Process
•
Develop an understanding of the application domain
•
•
Collect initial data, describe, focus on a subset of
variables, verify data quality
Data cleaning and preprocessing
•
•
Precision Farming
Create target data set
•
•
Relevant prior knowledge, problem objectives, success
criteria, current solution, inventory resources,
constraints, terminology, cost and benefits
Remove noise, outliers, missing fields, time sequence
information, known trends, integrate data
Data Reduction and projection
•
Feature subset selection, feature construction,
discretizations, aggregations
alg | Automated Learning Group
Filter
KDD Process
•
Selection of data mining task
•
•
•
•
•
Classification, segmentation, deviation detection, link analysis
Select data mining approach
Data mining to extract patterns or models
Interpretation and evaluation of patterns/models
Consolidating discovered knowledge
alg | Automated Learning Group
Knowledge Discovery
alg | Automated Learning Group
Required effort for each KDD Step
• Arrows indicate the direction we hope the effort should go.
alg | Automated Learning Group
Data Mining Tools
alg | Automated Learning Group
Commercial and Research Tools
Data To Knowledge
http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/d2k/
SAS
http://www.sas.com/
Clementine
http://www.spss.com/spssbi/clementine/
Intelligent Miner
http://www-3.ibm.com/software/data/iminer/
Insightful Miner
http://www.insightful.com/products/product.asp?PID=26
K-Wiz
http://www.thinkanalytics.com/products/factsheets/Kwiz_product_brief.htm
alg | Automated Learning Group
Software Engineering in Data Mining
Conceptual Software Hierarchy
•
•
•
•
•
Operating System (Windows, Mac OS, UNIX, Linux)
Programming Language (Java)
Modules = Sequences of Programming Language Commands
Itineraries = Linked Modules
Streamlines = Linked Itineraries
Software for
Users with Various Levels of Programming Skills
Collaborating Users
alg | Automated Learning Group
D2K - Software Environment for Data Mining
•
•
Visual programming system employing a scalable framework
Robust computational infrastructure
•
•
•
•
Reduction of development time
•
•
•
•
Increase code reuse and sharing
Expedite custom software developments
Relieve distributed computing burden
Flexible and extensible architecture
•
•
•
Enable processor intensive apps, support distributed computing
Enable data intensive apps, support multi-processor, shared memory architectures,
thread pooling
Very low granularity, fast data flow paradigm, integrated control flow
Create plug and play subsystem architectures, and standard APIs
Rapid application development (RAD) environment
Integrated environment for models and visualization
alg | Automated Learning Group
D2K Architecture
•
D2K Infrastructure
•
•
D2K Modules
•
•
A group of modules that are connected
to form an application
D2K ToolKit
•
•
Computational unit written in Java
that follows the D2K API
D2K Itineraries
•
•
Defines the D2K API
User interface
D2K Driven Applications
•
•
Applications that use D2K modules
D2K SL
alg | Automated Learning Group
Data Flow Programming Environment: D2K
Tool Menu
Tool Bar
Side Tab Panes
Workspace
Jump Up Panes
alg | Automated Learning Group
D2K Programming and Runtime Environment
alg | Automated Learning Group
Streamlined Data Mining Environment: D2K SL
KDD Steps
Workspace
KDD Options
Session
alg | Automated Learning Group
Data Mining Techniques in D2K
•
Discovery
•
•
•
Association Rules, Link Analysis, Self Organizing Maps
Predictive Modeling
•
Classification – Naive Bayesian, Neural Networks, Decision Trees
•
Regression – Neural Networks, Regression Trees
Deviation Detection
•
Visualization
•
Text To Knowledge (T2K)
•
Image To Knowledge (I2K)
•
----------------------
•
Audio, Touch, Scent and Savor To Knowledge
•
Knowledge To Wisdom (K2W)
alg | Automated Learning Group
Data Mining at Work
Numerous
Functional Foods
Territorial Ratemaking
Data Sources
Transaction Management
Heterogeneous Data
Visualization
Precision Farming
Bio-Informatics
Effluent Quality Control
Web Information Retrieval,
Archival and Clustering
Multiple
Crime Data Analysis
Data Fusion and
Visualization
Target Marketing
Survey Study of Disability
Warranty Clustering
Auto Loss Ratio
Predictions
Cost Prediction
(Warranty, Insurance Claims)
Single
Diagnostics
Decision Support
Project Objectives
alg | Automated Learning Group
Automation
Examples of Data Mining Methods
alg | Automated Learning Group
Three Primary Data Mining Paradigms
•
Discovery
•
•
Predictive Modeling
•
•
Example: Association Rules
Classification Example: Decision Trees
Deviation Detection
•
Visualization
alg | Automated Learning Group
Association Rules and
Market Basket Analysis
alg | Automated Learning Group
What is Market Basket Analysis?
•
Customer Analysis
•
•
Market Basket Analysis uses the information about what a customer
purchases to give us insight into who they are and why they make certain
purchases.
Product Analysis
•
Market basket Analysis gives us insight into the merchandise by telling us
which products tend to be purchased together and which are most
amenable to purchase.
alg | Automated Learning Group
Market Basket Example
?
?
?
?
alg | Automated Learning Group
Where should detergents be placed in the
Store to maximize their sales?
Are window cleaning products purchased
when detergents and orange juice are
bought together?
Is soda typically purchased with bananas?
Does the brand of soda make a difference?
How are the demographics of the
neighborhood affecting what customers
are buying?
Association Rules
•
•
There has been a considerable amount of research in the area of Market Basket
Analysis. Its appeal comes from the clarity and utility of its results, which are
expressed in the form association rules.
Given
•
•
•
A database of transactions
Each transaction contains a set of items
Find all rules X->Y that correlate the presence of one set of items X with
another set of items Y
•
Example: When a customer buys bread and butter, they buy milk 85% of the time
+
alg | Automated Learning Group
Results: Useful, Trivial, or Inexplicable?
•
While association rules are easy to understand, they are not always
useful.
Useful: On Fridays convenience store customers often purchase diapers and
beer together.
Trivial: Customers who purchase maintenance agreements are very likely to
purchase large appliances.
Inexplicable: When a new Super Store opens, one of the most commonly sold
item is light bulbs.
alg | Automated Learning Group
How Does It Work?
Grocery Point-of-Sale Transactions
Customer
Items
1
Orange Juice,
juice, Soda
2
Milk, Orange Juice, Window Cleaner
3
Orange Juice, Detergent
4
Orange Juice,
Detergent, soda
Soda
juice, detergent,
5
Window Cleaner,
cleaner, Soda
soda
Co-Occurrence of Products
OJ
Window
Cleaner
Milk
Soda
Detergent
OJ
4
1
1
2
1
Window Cleaner
1
2
1
1
0
Milk
1
1
1
0
0
Soda
2
1
0
3
1
Detergent
1
0
0
1
2
alg | Automated Learning Group
How Does It Work?
•
The co-occurrence table contains some simple patterns
•
•
•
•
Orange juice and soda are more likely to be purchased together than any other two items
Detergent is never purchased with window cleaner or milk
Milk is never purchased with soda or detergent
These simple observations are examples of Associations and may suggest a formal
rule like:
•
If a customer purchases soda, THEN the customer also purchases orange juice
OJ
Window
Cleaner
OJ
1
4
1
2
1
Window Cleaner
2
1
1
1
0
Milk
1
1
1
0
0
Soda
1
2
0
3
1
Detergent
0
1
0
1
2
alg | Automated Learning Group
Milk
Soda
Detergent
How Good Are the Rules?
•
In the data, two of five transactions include both soda and orange
juice, These two transactions support the rule. The support for the
rule is two out of five or 40%
•
Since both transactions that contain soda also contain orange juice
there is a high degree of confidence in the rule. In fact every
transaction that contains soda contains orange juice. So the rule If
soda, THEN orange juice has a confidence of 100%.
alg | Automated Learning Group
Confidence and Support - How Good Are the Rules
•
A rule must have some minimum user-specified confidence
•
•
1 & 2 -> 3 has a 90% confidence if when a customer bought 1 and 2, in 90%
of the cases, the customer also bought 3.
A rule must have some minimum user-specified support
•
1 & 2 -> 3 should hold in some minimum percentage of transactions to have
value.
alg | Automated Learning Group
Association Examples
•
Find all rules that have “Diet Coke” as a result. These rules
may help plan what the store should do to boost the sales of
Diet Coke.
•
Find all rules that have “Yogurt” in the condition. These rules
may help determine what products may be impacted if the
store discontinues selling “Yogurt”.
•
Find all rules that have “Brats” in the condition and
“mustard” in the result. These rules may help in determining
the additional items that have to be sold together to make it
highly likely that mustard will also be sold.
•
Find the best k rules that have “Yogurt” in the result.
alg | Automated Learning Group
The Basic Process
•
Choosing the right set of items
•
•
•
Taxonomies
Generation of rules
•
If condition Then result
•
Negation
Overcoming the practical limits imposed by thousand or tens of
thousands of products
•
Minimum Support Pruning
alg | Automated Learning Group
Choosing the Right Set of Items
Specific
Partial Product Taxonomy
General
Frozen
Foods
Frozen
Desserts
Frozen
Vegetables
Frozen
Yogurt
Ice
Cream
Chocolate
Strawberry
alg | Automated Learning Group
Frozen
Fruit Bars
Vanilla
Peas
Rocky
Road
Frozen
Dinners
Carrots
Cherry
Garcia
Mixed
Other
Other
Example - Minimum Support Pruning / Rule Generation
Scan Database
Transaction ID #
Find Pairings
Items
Find Level of Support
Itemset
Support
Itemset
Support
1
{ 1, 3, 4 }
{1}
2
{2}
3
2
{ 2, 3, 5 }
{2}
3
{3}
3
3
{ 1, 2, 3, 5 }
{3}
3
{5}
3
4
{ 2, 5 }
{4}
1
{5}
3
Scan Database
Find Pairings
Find Level of Support
Itemset
Itemset
Support
Itemset
{2}
{ 2, 3 }
2
{ 2, 5 }
{3}
{ 2, 5 }
3
{5}
{ 3, 5 }
2
alg | Automated Learning Group
Support
3
Two rules with the highest support
for two item set: 2->5 and 5->2
Other Association Rule Applications
•
Quantitative Association Rules
•
•
Association Rules with Constraints
•
•
Find all association rules where the prices of items are > 100 dollars
Temporal Association Rules
•
•
•
Age[35..40] and Married[Yes] -> NumCars[2]
Diaper -> Beer (1% support, 80% confidence)
Diaper -> Beer (20%support) 7:00-9:00 PM weekdays
Optimized Association Rules
•
•
Given a rule (l < A < u) and X -> Y, Find values for l and u such that support greater
than certain threshold and maximizes a support and confidence.
Check Balance [$ 30,000 .. $50,000] -> Certificate of Deposit (CD)= Yes
+
alg | Automated Learning Group
Strengths of Market Basket Analysis
•
•
•
•
It produces easy to understand results
It supports undirected data mining
It works on variable length data
Rules are relatively easy to compute
alg | Automated Learning Group
Weaknesses of Market Basket Analysis
•
•
•
•
It an exponentially growth algorithm
It is difficult to determine the optimal number of items
It discounts rare items
It is limited on the support that it provides attributes
alg | Automated Learning Group
Decision Tree Learning
alg | Automated Learning Group
Example: Supervised Learning with Decision Trees
alg | Automated Learning Group
Decision Tree Learning
•
Start with data at the root node
•
Select an attribute and form a logical test on attribute
•
Branch on each outcome of test, move subset of example satisfying that out
come to corresponding child node
•
Recurse on each child node
•
Termination rule specifies when to declare a node is a leaf node
Note: this is a one-step look ahead, non-backtracking search through the
space of all decision trees
Critical Steps
•
•
Formulation of good logical tests
Selection measure for attributes
alg | Automated Learning Group
Decision Trees
•
Classifiers
•
•
Internal Nodes: Tests for Attribute Values
•
•
•
Typical: equality test (e.g., “Wind = ?”)
Inequality, other tests possible
Branches: Attribute Values
•
•
Instances (unlabeled examples): represented as attribute (“feature”) vectors
One-to-one correspondence (e.g., “Wind = Strong”, “Wind = Light”)
Leaves: Assigned Classifications (Class Labels)
alg | Automated Learning Group
Decision Tree for Concept: PlayTennis
Outlook?
Sunny
Humidity?
High
No
alg | Automated Learning Group
Overcast
Rain
Wind?
Yes
Normal
Yes
Strong
No
Light
Yes
Decision Trees and Decision Boundaries
How to Visualize Decision Trees?
Example: Dividing Instance Space into Axis-Parallel Rectangles
y
x < 3?
+
7
No
+
Yes
y > 7?
5
No
-
-
+
-
y < 5?
Yes
+
No
Yes
+
x < 1?
No
1
3
x
More than two variables ?
alg | Automated Learning Group
+
Yes
-
An Illustrative Example
Training Examples for Concept PlayTennis
Day
Outlook
Temperature Humidity
Wind
PlayTennis?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Light
Strong
Light
Light
Light
Strong
Strong
Light
Light
Light
Strong
Strong
Light
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
alg | Automated Learning Group
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Constructing a Decision Tree for PlayTennis
The Initial Decision Tree with One Leaf
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
Temperature Humidity
Wind
Play Tennis?
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Light
Strong
Light
Light
Light
Strong
Strong
Light
Light
Light
Strong
Strong
Light
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
[9+, 5-]
E(D) = min(9/14, 5/14)
= 5/14 = 36%
Question: What
attribute A and what
value of A should we
split on?
Goal: maximize error reduction E, where the error reduction relative to
attribute A is the expected reduction in error due to splitting on A:
alg | Automated Learning Group
Constructing a Decision Tree for PlayTennis
Potential Splits of Root Node
[9+, 5-]
Temperature
[9+, 5-]
Outlook
Sunny
Overcast
[4+, 0-]
[2+, 3-]
Rain
[3+, 2-]
[9+, 5-]
Cool
Mild
[4+, 2-]
[3+, 1-]
[2+, 2-]
[9+, 5-]
Humidity
High
[3+, 4-]
E(Split/Outlook)
Hot
Normal
[6+, 1-]
Wind
Light
[6+, 2-]
Strong
[3+, 3-]
= (5/14) – ((5/14)(min(2/5,3/5)) + (4/14)(min(4/4,0/4)) + (5/14)(min(3/5,2/5))) = 7%
E(Split/Temperature) = (5/14) – ((4/14)(min(3/4,1/4)) + (6/14)(min(4/6,2/6)) + (4/14)(min(2/4,2/4))) = 0%
E(Split/Humidity)
= (5/14) – ((7/14)(min(3/7,4/7)) + (7/14)(min(6/7,1/7))) = 7%
E(Split/Wind)
= (5/14) – ((8/14)(min(6/8,2/8)) + (6/14)(min(3/6,3/6))) = 0%
alg | Automated Learning Group
Constructing a Decision Tree for PlayTennis
•
Top-Down Induction
For discrete-valued attributes, terminates in (n) splits
Makes at most one pass through data set at each level (why?)
1,2,3,4,5,6,7,8,9,10,11,12,13,14
[9+,5-]
Sunny
1,2,8,9,11
[2+,3-]
Humidity?
High
Outlook?
Overcast
Rain
Yes
Normal
3,7,12,13
[4+,0-]
Wind?
Strong
4,5,6,10,14
[3+,2-]
Light
No
Yes
No
Yes
1,2,8
[0+,3-]
9,11
[2+,0-]
6,14
[0+,2-]
4,5,10
[3+,0-]
alg | Automated Learning Group
Strengths Of Decision Trees
•
Decision trees are able to generate understandable results
•
Decision trees perform classification without requiring much
computation
•
Decisions trees can handle both continuous and categorical variables
•
Decision trees provide a clear indication of which attributes are most
important for prediction or classification
alg | Automated Learning Group
Weakness Of Decision Trees
•
Error-prone with too many classes
•
Quick partitioning of data results in fast deterioration in attribute
selection quality
•
Trouble with non-rectangular regions
alg | Automated Learning Group
Visualization
alg | Automated Learning Group
Visualization Example: Naïve Bayesian
Three Flower Types; Petal and Sepal Based Classification
alg | Automated Learning Group
Naïve Bayesian Visualization
•
The right hand pane shows the
distribution of the classes.
•
The left hand pane shows the
attributes and each of their
values. They are listed by order
of significance.
•
•
•
•
The message box shows details about
each pie chart when brushed.
Clicking on a pie chart shows how
knowing this information can change
the overall class predication.
Clicking on multiple pie charts
calculates conditional probabilities.
Zoom in and out using the right
mouse button.
Notice Iris-versicolor has a
33% likelihood
alg | Automated Learning Group
Rule Association Visualization
•
•
Read rules down the column
Example - the rule in the column
labeled as 2 is
•
•
•
if petal-width Binned=(…, 2.) then
flower-type=Iris-setosa
Support = 25%
Confidence = 100%
alg | Automated Learning Group
Discovery Using Rule Association
•
What services are
purchased
together?
•
What products or
transactions are
executed by
customers on a
single visit to
your website?
•
What are the
relationships in
the data?
alg | Automated Learning Group
Parallel Coordinates - Visualization
•
Each vertical line represents a
field with the minimum and
maximum values represented at
bottom and top.
•
Each record has a line that
connects it to the its value at each
field
•
Lines are colored based on the
output field
•
Clicking on the label boxes allows
the lines to be rearranged
•
Zooming is accomplished by
dragging a box over the desired
area. Clicking returns to the
original view.
alg | Automated Learning Group
Scatterplots - Visualization
alg | Automated Learning Group
Image To Knowledge (I2K): Data Visualization
•
Hyperspectral image with 120 bands
alg | Automated Learning Group
Image To Knowledge (I2K): Visualization of Results
•
Classification Results
•
•
•
•
Alignment Results
•
•
•
Class labels per pixel
Class labels per geographical entity
Class labels of aggregations
Overlays
Summary Charts
Image Operations
•
•
•
Enhancements
Image Restoration
Filtering
alg | Automated Learning Group
T2K - Text to Knowledge: Topic Evolution
Any chronologically ordered text
• News feeds
• Email
alg | Automated Learning Group
Protein Consumption Dynamics
•
Objective
•
•
To understand,
through database
visualization,
global protein
consumption
patterns by
providing a
means to
directly compare
historical and
simulated data.
Presented at the
Global Soy Forum 1999
alg | Automated Learning Group
Data Comparison, Reduction & Synthesis
•
Goal
•
•
Development of a
3D visualization
tool for multichannel on-board
sensor data. This
tools allows for
multiple time
series
comparison,
reduction and
synthesis.
Related Projects
•
•
Derivative
Monitoring
Real-time System
Monitoring
alg | Automated Learning Group
Summary
•
•
•
Curious? Puzzled?
•
•
•
•
•
Become Familiar with Data Mining Terminology
Found Application? Domain Specific Questions?
Learn !
Introduction to Data Mining
Look For Tools
Apply Data Mining Techniques to Problems
Ask For Help
alg | Automated Learning Group