Data Mining and Data Visualization
Download
Report
Transcript Data Mining and Data Visualization
Data Mining
and
Data Visualization
SOM 485
Fall 2007
Getting Started
What is Data Mining?
Online Analytical Processing
Data Mining Techniques
Market Basket Analysis
Limitations and Challenges to Data Mining
Data Visualization
Siftware Technologies
What is Data Mining (DM)?
Group of activities used to find different patterns
in data
Information provided through a Data Warehouse
Provides valuable information for different types
of research.
Applications of DM
Customer Relationship
Management (CRM)
software is an
application that can
benefit DM
Activities of CRM
One-to-One Marketing
Sales Force Automation
Sales Campaign
Management
Marketing Encyclopedia
Call Center Automation
Verification of DM
Requires a lot of prior knowledge on the
decision maker’s part
Used mainly in casinos
i.e. Can determine if a new customer is a high roller, a souvenir
buyer, a ticket purchaser, etc.
Uses Siftware to help discover new
patterns of customer spending habits
Allows effective targeting to a specific group of customers
Online Analytical Processing
Online Analytical Processing (OLAP) was
introduced by E. F. Codd in 1993
OLAP: computer process that allows a
user to extract data from different view
points
Scientific and Academic organizations
store about 1 terabyte (1 trillion bytes) of
new data each day.
OLAP continue…
Codd’s 12 Rules for OLAP
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Multidimensional View
Transparent to the User
Accessible
Consistent Reporting
Client-Server architecture
Generic Dimensionality
Dynamic Sparse Matrix Handling
Multi-user Support
Cross-Dimensional Operations
Intuitive Data Manipulation
Flexible Reporting
Infinite Levels of Dimension and
Aggregation
OLAP: MOLAP & ROLAP
OLAP data is stored in a Multidimensional
Database (MBD)
MOLAP: OLAP application that accesses
data from a multidimensional database
MBD are frequently created using input
from an existing Relational Database
ROLAP: Relational Database server that
can work with SQL for portability and
scalability.
DATA MINING
TECHNIQUES
FOUR MAJOR
CATEGORIES
1. Classification
2. Association
3. Sequence
4. Cluster
CLASSIFICATION
- Mining processes
intended to discover
rules that define
whether an item
belongs to a particular
class of data
- Two Sub-processes:
1) Building a Model
2) Predicting
Classifications
ASSOCIATION
Techniques that employ association
search all details from operational systems
for patterns with a high probability of
repetition
Example: Market Basket Analysis
SEQUENCE
Time series analysis methods relate
events in time based on a series of
preceding events
Through analysis, various hidden trends,
often highly predictive of future events,
can be discovered.
Example: Mail Industry
CLUSTER
To create partitions so that all members of
each set are similar according to some
metric
Simply a set of objects grouped together
by virtue of their similarity or proximity to
each other
Example: Credit Card Transactions
DATA MINING
TECHNOLOGIES
Providing new answers to old questions
Developing new knowledge and understanding
through discovery
Statistical Analysis – statistically evaluating
products and making a decision based on logical
reasoning
Neural Networks – attempts to mirror the way
the human brain works in recognizing patterns
by developing mathematical structures with the
ability to learn
DATA MINING
TECHNOLOGIES CONT’
Genetic Algorithms and Fuzzy Logic – machine
learning techniques derive meaning from
complicated and imprecise data and can extract
patterns from and detect trends within the data
that are far too complex to be noticed by
humans
Decision Trees – assists in data mining
applications by the classification of items or
events contained within the warehouse
NEW APPLICATIONS FOR
DATA MINING
Two new categories of applications
1) Text Mining – summarizes, navigates, and
clusters documents contained in a database
2) Web Mining – integrates data and text mining
within a Web site; enhances the Web site with
intelligent behavior, such as suggesting related
links or recommending new products to the
consumer
Market Basket Analysis
Market Basket Analysis
Market Basket Analysis
• Market Basket Analysis is an algorithm that
examines a long list of transactions in order to
determine which items are most frequently
purchased together.
• It takes its name from the idea of a person in a
supermarket throwing all of their items into a
shopping cart (a "market basket").
• Market basket analysis one of the most
common and useful types of data analysis for
marketing.
• With the data gathered from MBA, marketers
can group products that customers like and group
them together.
• Market basket analysis can improve the
effectiveness of marketing and sales tactics.
Benefits of Market Basket Analysis:
•A good indication of consumer behavior
•Increase in sales
•Improves customer satisfaction
•Tracks what types of products interest
consumer and finds relative alternative ones to
introduce to the consumer.
ASSOCIATION RULES for MBA
• Support
• Confidence
• Lift
•Method
Association rules- are a common undirected data mining
technique and complement market basket analysis.
These rules are unidirectional
Left-hand side rule IMPLIES Right-hand side rule
ex. Pasta IMPLIES Wine, but Wine IMPLIES Pasta may not hold
40% of transactions that contain Pasta also
contain Wine. 4% of transaction contain both
of these items.
Support- % measure of baskets where the association rule is true
between the Left-hand side & the Right-hand side.
ex. 4% of transactions contain both
Confidence- Probability that the Right-hand side item is present
once the Left-hand side item is present.
ex. 40% of transactions that contain Pasta… p=.40
Lift- compares the likelihood of finding the right-hand side item in
any random basket. Measures how well and associative rules
performs by comparing how well an item can sell without the other
item (improvement).
Method
Frozen
Pizza
Milk
Cola
Potato Chips
Pretzels
Frozen
Pizza
2
1
2
0
0
Milk
1
3
1
1
1
Cola
2
1
3
0
1
Potato
Chips
0
1
0
1
0
Pretzels
0
1
1
0
2
Market Basket Analysis
Market Basket analysis- determines what products
customers purchase together
Limits to Market Basket Analysis
• A large number of data is req. to obtain meaningful
data, but data’s accuracy is compromised if all the
products don’t occur w/in similar frequency.
• ex. Milk sells almost every transaction, but
Elmer’s glue sells
sporadically, its not effective to put them in same basket analysis.
• Sometimes presents results that are actually due to
the success of previous market campaigns.
• ex. Discounted price of cola with purchase of pizza.
Using Data from MBA
Once information has been gathered about different
items and how they sell with respect to other items,
a store may want to change their layout of items to
improve their profits.
ex. Lunchboxes and School Supplies
For business without an actual storefront, they may want
to offer promotions for products that sell togetherincreasing sales.
MARKET BASKET ANALYSIS In a
Nutshell
Current Limitations and
Challenges to Data Mining
Current Limitations & Challenges to
Data Mining
New and underdeveloped field
Identification of missing information
Most companies run legacy systems
Not DW (data warehouse) friendly
DW designers have to convert existing ODSs
(operational data stores) to homogenous form
of DW
Current Limitations & Challenges to
Data Mining
Not all knowledge about application
domains are present in the data
ODSs are normally limited to those
needed by the operational application
associated with that DB
Data warehouse designers need to include
mechanisms for “inventorying” data
Data noise & missing values
Most operational databases contain data
errors in their values and/or classification
Errors lead to misclassification
Future data mining systems must incorporate
more sophisticated mechanisms for treating
“noisy data”
Bayesian technique – a statistical technique
Large Databases & high
dimensionality
Databases are large & dynamic
Contents are always changing
Data patterns must be constantly updated
New discovery applications have to portion
problems into smaller chunks of manageable
data without losing any essential attributes of
the data
Data Visualization
Process by which numerical data are
converted into meaningful 3-D images
Example
Intended to analyze complex data
Data from: satellite photos, sonar
measurements, surveys, or computer
simulations
History of Data Visualization
Originated from statistics and science
Example of 2-D
Advancement credited to NCSA
National Center for Supercomputing
Applications
Newest developments by Xerox PARC in
virtual reality
Human Visual Perception
Human visual cortex dominates our
perception
Accelerates the identification of hidden
patterns in data
“A picture is worth a thousand words”
Geographical Information Systems
(GIS)
A special-purpose DB which common spatial
coordinate system is primary means of
reference
Requires:
1.
2.
3.
4.
Data input
Data storage, retrieval, and query
Data transformation, analysis, and modeling
Data reporting
Integrates info. and aids in decision making
GIS continued
Spatial Data – elements stored in map
form
•
Contain three basic components:
1. Points
2. Lines
3. Polygons
Attribute Data – describes spatial data
Example of GIS
Applications of Data Visualization
Techniques
Retail Banking
Government
Insurance
Health Care and Medicine
Telecommunications
Transportation
Capital Markets
Asset Management
Siftware Technologies
Siftware Technologies
IBM
Informix
Red Brick
DB2
Oracle
Silicon Graphics
Sybase
Offers several Data Mining solutions, depending
on users need.
IBM Information Warehouse Solutions
IBM Visualizer
Red Brick
Informix
Three-tier model
Tier 1: “Client” presentation layer
Tier 2: Hewlett-Packard hardware
Tier 3: Data layer INFORMIX –OnLine
database
Sybase Warehouse WORKS
Assemble data from may sources
Transform data for a consistent and understandable
view
Distribute data where needed
Provide high-speed access to the data
Leading company for large-scale data mining
Data spread across mutliple databases
Data spread across processors for faster
queries
Discover new patterns and trends that may not
be realized using traditional SQL
Three-dimensional Visualization
Visual models can save days and even months
from the review process
Review
Data mining (DM)
Techniques used to mine data
Market Basket Analysis: The King of DM
Algorithms
Review continued…..
Current Limitations and Challenges to
Data Mining
Data Visualization
Siftware Technologies