Data Mining and Data Visualization

Download Report

Transcript Data Mining and Data Visualization

Data Mining
and
Data Visualization
SOM 485
Fall 2007
Getting Started
 What is Data Mining?
 Online Analytical Processing
 Data Mining Techniques
 Market Basket Analysis
 Limitations and Challenges to Data Mining
 Data Visualization
 Siftware Technologies
What is Data Mining (DM)?
 Group of activities used to find different patterns
in data
 Information provided through a Data Warehouse
 Provides valuable information for different types
of research.
Applications of DM
Customer Relationship
Management (CRM)
software is an
application that can
benefit DM
Activities of CRM
One-to-One Marketing
Sales Force Automation
Sales Campaign
Management
Marketing Encyclopedia
Call Center Automation
Verification of DM
 Requires a lot of prior knowledge on the
decision maker’s part
 Used mainly in casinos
 i.e. Can determine if a new customer is a high roller, a souvenir
buyer, a ticket purchaser, etc.
 Uses Siftware to help discover new
patterns of customer spending habits
 Allows effective targeting to a specific group of customers
Online Analytical Processing
 Online Analytical Processing (OLAP) was
introduced by E. F. Codd in 1993
 OLAP: computer process that allows a
user to extract data from different view
points
 Scientific and Academic organizations
store about 1 terabyte (1 trillion bytes) of
new data each day.
OLAP continue…
Codd’s 12 Rules for OLAP
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Multidimensional View
Transparent to the User
Accessible
Consistent Reporting
Client-Server architecture
Generic Dimensionality
Dynamic Sparse Matrix Handling
Multi-user Support
Cross-Dimensional Operations
Intuitive Data Manipulation
Flexible Reporting
Infinite Levels of Dimension and
Aggregation
OLAP: MOLAP & ROLAP
 OLAP data is stored in a Multidimensional
Database (MBD)
 MOLAP: OLAP application that accesses
data from a multidimensional database
 MBD are frequently created using input
from an existing Relational Database
 ROLAP: Relational Database server that
can work with SQL for portability and
scalability.
DATA MINING
TECHNIQUES
FOUR MAJOR
CATEGORIES
1. Classification
2. Association
3. Sequence
4. Cluster
CLASSIFICATION
- Mining processes
intended to discover
rules that define
whether an item
belongs to a particular
class of data
- Two Sub-processes:
1) Building a Model
2) Predicting
Classifications
ASSOCIATION
 Techniques that employ association
search all details from operational systems
for patterns with a high probability of
repetition
 Example: Market Basket Analysis
SEQUENCE
 Time series analysis methods relate
events in time based on a series of
preceding events
 Through analysis, various hidden trends,
often highly predictive of future events,
can be discovered.
 Example: Mail Industry
CLUSTER
 To create partitions so that all members of
each set are similar according to some
metric
 Simply a set of objects grouped together
by virtue of their similarity or proximity to
each other
 Example: Credit Card Transactions
DATA MINING
TECHNOLOGIES
 Providing new answers to old questions
 Developing new knowledge and understanding
through discovery
 Statistical Analysis – statistically evaluating
products and making a decision based on logical
reasoning
 Neural Networks – attempts to mirror the way
the human brain works in recognizing patterns
by developing mathematical structures with the
ability to learn
DATA MINING
TECHNOLOGIES CONT’
 Genetic Algorithms and Fuzzy Logic – machine
learning techniques derive meaning from
complicated and imprecise data and can extract
patterns from and detect trends within the data
that are far too complex to be noticed by
humans
 Decision Trees – assists in data mining
applications by the classification of items or
events contained within the warehouse
NEW APPLICATIONS FOR
DATA MINING
 Two new categories of applications
1) Text Mining – summarizes, navigates, and
clusters documents contained in a database
2) Web Mining – integrates data and text mining
within a Web site; enhances the Web site with
intelligent behavior, such as suggesting related
links or recommending new products to the
consumer
Market Basket Analysis
Market Basket Analysis
Market Basket Analysis
• Market Basket Analysis is an algorithm that
examines a long list of transactions in order to
determine which items are most frequently
purchased together.
• It takes its name from the idea of a person in a
supermarket throwing all of their items into a
shopping cart (a "market basket").
• Market basket analysis one of the most
common and useful types of data analysis for
marketing.
• With the data gathered from MBA, marketers
can group products that customers like and group
them together.
• Market basket analysis can improve the
effectiveness of marketing and sales tactics.
Benefits of Market Basket Analysis:
•A good indication of consumer behavior
•Increase in sales
•Improves customer satisfaction
•Tracks what types of products interest
consumer and finds relative alternative ones to
introduce to the consumer.
ASSOCIATION RULES for MBA
• Support
• Confidence
• Lift
•Method
Association rules- are a common undirected data mining
technique and complement market basket analysis.
These rules are unidirectional
Left-hand side rule IMPLIES Right-hand side rule
ex. Pasta IMPLIES Wine, but Wine IMPLIES Pasta may not hold
40% of transactions that contain Pasta also
contain Wine. 4% of transaction contain both
of these items.
Support- % measure of baskets where the association rule is true
between the Left-hand side & the Right-hand side.
ex. 4% of transactions contain both
Confidence- Probability that the Right-hand side item is present
once the Left-hand side item is present.
ex. 40% of transactions that contain Pasta… p=.40
Lift- compares the likelihood of finding the right-hand side item in
any random basket. Measures how well and associative rules
performs by comparing how well an item can sell without the other
item (improvement).
Method
Frozen
Pizza
Milk
Cola
Potato Chips
Pretzels
Frozen
Pizza
2
1
2
0
0
Milk
1
3
1
1
1
Cola
2
1
3
0
1
Potato
Chips
0
1
0
1
0
Pretzels
0
1
1
0
2
Market Basket Analysis
Market Basket analysis- determines what products
customers purchase together
Limits to Market Basket Analysis
• A large number of data is req. to obtain meaningful
data, but data’s accuracy is compromised if all the
products don’t occur w/in similar frequency.
• ex. Milk sells almost every transaction, but
Elmer’s glue sells
sporadically, its not effective to put them in same basket analysis.
• Sometimes presents results that are actually due to
the success of previous market campaigns.
• ex. Discounted price of cola with purchase of pizza.
Using Data from MBA
 Once information has been gathered about different
items and how they sell with respect to other items,
a store may want to change their layout of items to
improve their profits.
 ex. Lunchboxes and School Supplies
 For business without an actual storefront, they may want
to offer promotions for products that sell togetherincreasing sales.
MARKET BASKET ANALYSIS In a
Nutshell
Current Limitations and
Challenges to Data Mining
Current Limitations & Challenges to
Data Mining
 New and underdeveloped field
 Identification of missing information
 Most companies run legacy systems
 Not DW (data warehouse) friendly
 DW designers have to convert existing ODSs
(operational data stores) to homogenous form
of DW
Current Limitations & Challenges to
Data Mining
 Not all knowledge about application
domains are present in the data
 ODSs are normally limited to those
needed by the operational application
associated with that DB
 Data warehouse designers need to include
mechanisms for “inventorying” data
Data noise & missing values
 Most operational databases contain data
errors in their values and/or classification
 Errors lead to misclassification
 Future data mining systems must incorporate
more sophisticated mechanisms for treating
“noisy data”
 Bayesian technique – a statistical technique
Large Databases & high
dimensionality
 Databases are large & dynamic
 Contents are always changing
 Data patterns must be constantly updated
 New discovery applications have to portion
problems into smaller chunks of manageable
data without losing any essential attributes of
the data
Data Visualization
 Process by which numerical data are
converted into meaningful 3-D images
 Example
 Intended to analyze complex data
 Data from: satellite photos, sonar
measurements, surveys, or computer
simulations
History of Data Visualization
 Originated from statistics and science
 Example of 2-D
 Advancement credited to NCSA
 National Center for Supercomputing
Applications
 Newest developments by Xerox PARC in
virtual reality
Human Visual Perception
 Human visual cortex dominates our
perception
 Accelerates the identification of hidden
patterns in data
 “A picture is worth a thousand words”
Geographical Information Systems
(GIS)

A special-purpose DB which common spatial
coordinate system is primary means of
reference

Requires:
1.
2.
3.
4.

Data input
Data storage, retrieval, and query
Data transformation, analysis, and modeling
Data reporting
Integrates info. and aids in decision making
GIS continued
 Spatial Data – elements stored in map
form
•
Contain three basic components:
1. Points
2. Lines
3. Polygons
 Attribute Data – describes spatial data
 Example of GIS
Applications of Data Visualization
Techniques
 Retail Banking
 Government
 Insurance
 Health Care and Medicine
 Telecommunications
 Transportation
 Capital Markets
 Asset Management
Siftware Technologies
Siftware Technologies
 IBM
 Informix
 Red Brick
 DB2
 Oracle
 Silicon Graphics
 Sybase
 Offers several Data Mining solutions, depending
on users need.
 IBM Information Warehouse Solutions
 IBM Visualizer
 Red Brick
Informix
 Three-tier model
 Tier 1: “Client” presentation layer
 Tier 2: Hewlett-Packard hardware
 Tier 3: Data layer INFORMIX –OnLine
database
 Sybase Warehouse WORKS
 Assemble data from may sources
 Transform data for a consistent and understandable
view
 Distribute data where needed
 Provide high-speed access to the data
 Leading company for large-scale data mining
 Data spread across mutliple databases
 Data spread across processors for faster
queries
 Discover new patterns and trends that may not
be realized using traditional SQL
 Three-dimensional Visualization
 Visual models can save days and even months
from the review process
Review
 Data mining (DM)
 Techniques used to mine data
 Market Basket Analysis: The King of DM
Algorithms
Review continued…..
 Current Limitations and Challenges to
Data Mining
 Data Visualization
 Siftware Technologies