Association Rules
Download
Report
Transcript Association Rules
Data Mining and
Information Visualization
Yan Liu, PhD
Assistant Professor
Department of Biomedical, Industrial and Human Factors Engineering
Wright State University
Outline
Data Mining (DM)
Definition and Usefulness
DM Process
DM Modeling Techniques
Information Visualization
Definition and Usefulness
Multivariate Data Visualization Techniques
2
Data Mining (DM): What and Why
What Is DM
A synonym for knowledge discovery in databases (KDD)
Nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data (Fayyard et al., 1996)
Lying at the interface of database management, machine learning, pattern
recognition, statistics and visualization
Why Is DM Useful
Rapid development in information techniques produces vast amounts of
data
Knowledge discovered from data can be use for competitive advantage
Classification, prediction, association, clustering, etc.
3
Data Mining Process
Data
Understanding
Problem
Understanding
Data
Deployment
Data
Preparation
Evaluation
Modeling
CRISP-DM(CRoss Industry Standard Process for DM)
(Holsheimer,1999)
4
Data Mining Process (Cont’d)
Problem Understanding
Understand the objectives
Define performance criteria
Assess current situations of the organization
Objective or subjective
Background knowledge, data sources, resources, etc.
Data Understanding
Collect data
Describe data
Volume, identities of attributes, format, etc.
Explore/survey data
From scratch or existing databases
Distributions of attributes, relations among a small number of attributes, results of
simple aggregations, etc.
Statistical analyses, data visualization, database queries can be useful tools
Verify data quality
Incomplete data, missing values, errors, etc.
5
Data Mining Process (Cont’d)
Data Preparation
“Garbage in, garbage out”
Select data
Clean data
remove errors, fill in missing data with default values or estimates by modeling
Construct data
Based on relevance, technical constraints
Generate new attributes (records), merge tables, transform data, etc.
Reduce data
Obtain a dataset much smaller yet retaining enough important information
6
Data Mining Process (Cont’d)
Modeling
Select appropriate modeling techniques
Generate test design
Build models
Assess models
According to domain knowledge, success criteria and test design
Evaluation
Evaluate results
With respect to the project objectives
Review process
Test models’ quality and validity
Overlooked important factors or tasks
Deployment
Plan deployment
Plan monitoring and maintenance
Produce final result
7
Class Description
Classes
Data Characterization
e.g. Customers of a bank can be classified into those with “good Credit” and
“bad credit”; Grades of students in a class include “A”, “B”, “C”, and “D”
Summarize the data in each class
e.g. summarize the distributions of age, educational level, and household income
of customers that have “good credit” or “bad credit”
Data Discrimination
Compare data in different classes
e.g. compare customers with “good credit” and those with “bad credit” in their
distributions of o age, educational level, and household income
8
Mining Frequent Pattern, Associations,
and Correlations
Frequent Patterns
Patterns that occur frequently in data
Itemsets: a set of items that frequently appear together in a transactional dataset
Subsequences: a set of events that frequently occur in a particular sequence
Substructures: a set of structures (such as graphs, trees, lattices) that appear
frequently
Association Mining
Discovery of frequent patterns, associations and correlations
Association Rules
Computer => Software (support=1%, confidence=50%)
Age(20,29] and Income(20K, 29K] => CD Player (support=2%, confidence=60%)
9
Classification and Prediction
Classification
Process of finding a model that describes and distinguishes data classes, for the
purpose of being able to use the model to predict the class of objects whose class
label (categorical, unordered) is unknown
Numeric Prediction
Models continuous-valued functions to predict the missing or unavailable
numerical data values
10
Cluster Analysis
Functions
Analyze data without consulting a known class label
Divide data into groups(clusters) so that objects within the same cluster are
similar while those belonging to different clusters differ much
11
Outlier Analysis
Function
Identify objects that do not comply with the general pattern of the data
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of extremely large amounts for a given account number in comparison to regular
charges incurred by the same account
12
Evolution Analysis
Function
Describes and models regularities or trends for objects whose behavior changes
over time
Suppose you have the major stock market (time-series) data of the last several years
available from the New York Stock Exchange and you would like to invest in shares
of high-tech industrial companies. A data mining study of stock exchange data may
identify stock evolution regularities for overall stocks and for the stocks of particular
companies. Such regularities may help predict future trends in stock market prices,
contributing to your decision making regarding stock investments
13
Decision Tree
Predictive model in a Tree Structure
Decision nodes (splitting attributes) and leaf nodes
Decision Nodes
Leaf Nodes
14
Association Rules
Association Rules Modeling
Finds interesting associations or correlation
relationships among items (binary attributes)
In the form of “if-then” statements
Measures
Support (A=>B) = Pr (A and B)
Confidence (A=>B) = Pr (B|A)
Antecedent => Consequent
Thursdays
=>
=>
+
15
Information Visualization: What and Why
What Is Information Visualization
Use of computer-supported, interactive, visual representations of abstract data
to amplify cognition (Card,1999)
Why Is Information Visualization Useful
Take advantage of the powerful processing capacities of human visual
perception system
Three Types of Usages
Exploratory analysis: searching for interesting phenomena in data
Confirmatory analysis: validating some hypothetical features in data
Presentation: demonstrating known information
16
Multivariate Data Visualization
Multivariate Data Visualization Methods
Scatterplot matrix
Trellis display
Parallel coordinates
Mosaic display
…
17
Datasets
Auto-Mpg Dataset
Retrieved from the UCI machine learning repository
Attributes: “mpg(continuous)”, “cylinders(3/4/5/6/8)”, “horsepower(continuous)”,
“weight(continuous)”, “origin(American/European/Japanese)”
392 records
Titanic Survival Dataset
Retrieved from Friendly (1994)
Attributes: “booking class (first/second/third/crew)”, “gender (male/female)”,
“age (adult/child)”, “survival (yes/no)”
Mosaic
18
Scatterplot Matrix
Organizes all the pairwise scatterplots in a matrix format
Each display panel in the matrix is identified by its row and column
coordinates
The panel at the ith row and jth column is a scatterplot of Xj versus Xi
• The panel at the 3rd row (the top row) and 1st column
is a scatterplot of Z versus X
• Panels that are symmetric with respect to the XYZ
diagonal have the same variables as their coordinates,
rotated 90°
•The redundancy is designed to improve visual linking
• Patterns can be detected in both horizontal and vertical
directions
• Can only visualize the correlation between two
Scatterplot matrix with three
variables X, Y, and Z
variables, without using retinal visual elements
19
Scatterplot Matrix of the Auto-Mpg Dataset
American
European
Japanese
20
Trellis Display
Overview (Becker and Cleveland, 1996)
Display any one of a large variety of 1-D, 2-D and 3-D plot types in an trellis
layout of panels, where each panel displays the select plot type for a level or
interval on additional discrete or continuous conditioning variables
Panels are laid out into columns, rows and pages
Mapping of Variables and Data Records
Axis variable
Conditioning variable
Mapped to one of the coordinates in the panels
Mapped to a horizontal bar at the top of each panel, representing on of its levels
(discrete variable) or interval (continuous variable)
Superpose variable
Mapped to colors or symbols of points in the panels
21
Trellis Display of the Auto-Mpg Dataset
American European Japanese
22
Parallel Coordinates
Overview (Inselberg, 1985)
Each variable is represented by a vertical axis and m variables are organized as
uniformly spaced vertical lines
A data record in a m-D space is manifested as a connected set of points, one on
each axis
Mapping of Variables and Data Records
Variable Xi is represented as ith vertical axis in a 2-D space
Values of Xi are scaled so that its maximum and minimum values correspond to
the top and bottom points on its axis, respectively
A data record with m variables is represented as a set of m-1 connected line
segments which connect to vertical lines at the corresponding variables’ values
23
Origin
Cylinders
mpg
Horsepower
Weight
Parallel Coordinates of the Auto-Mpg Dataset
American European Japanese
24
Mosaic Display
Overview
Well recognized visualization method for categorical variables (Friendly, 1994)
Shows the frequencies in an m-way contingency table by nested rectangles
whose areas are proportional to the frequency in cells or marginal subtables
For two or more variables, the levels of sub-division are spaced with larger gaps
at the earlier levels to allow easier perception of the groupings at various levels
Dataset
survived people
not survived people
Mosaic Display of the Titanic Survival Dataset
25