Data Mining - Department of Computer Engineering

Download Report

Transcript Data Mining - Department of Computer Engineering

Supply Chain Management
Data Warehousing
Data Mining
CPE 665 Enterprise Computing
www.cpe.kmutt.ac.th/~suthep/cpe665
Assoc. Prof. Suthep Madarasmi, Ph.D.
Enterprise Data
Sales &
A/R
Human
Resource
Accounts &
Finance
Asset, Costing
Production
Mgmt.
Ideal
MRP
Supply
Chain/MRP
Purchasing
& A/P
Inventory
Control
Planning
2
What is SCM? (Supply Chain Mgmt.)
Use of IT To Answer:
• What are we producing?
• What resources are needed?
• What do we have, what we need, and when?
• How much capacity do we need, and when?
• What are pending issues and deliveries for
orders?
• Why was order Late?
• What future production problems will we have?
3
Ideal Information System
Top
EIS
Executive
Information
System EIS
Management
Information System MIS
Transaction Processing
System TPS
database
4
MRP
(Material Requirements Planning)
P1
Issue MO
S.O.
SO – Sales Order
MO – Manufacturing Order
WO – Work Order
PO – Purchase Order
MO
/ WO
Issue PO
P.O.
In Stock
MRP (Material Requirements Planning)
Issue MO
MO
/ WO
S.O.
Issue PO
P.O.
Receive Raw Mat
In Stock
MRP (Material Requirements Planning)
Issue Raw Mat
Issue MO
MO
/ WO
S.O.
Issue PO
P.O.
Receive Raw Mat
In Stock
MRP (Material Requirements Planning)
Issue Raw Mat
Issue MO
MO
/ WO
S.O.
Issue PO
P.O.
Receive Raw Mat
Receive Produced Goods
In Stock
MRP (Material Requirements Planning)
Issue Raw Mat
Issue MO
MO
/ WO
S.O.
Issue PO
P.O.
FG
Delivery
Receive Raw Mat
Receive Produced Goods
In Stock
MRP: Inventory System Links All
Information on goods in/out of warehouse entered
by stores department links all MRP information.
Orders affect future stock balance. Flows linked to:
- Sales Order SO (delivery remain)
- Purchase Order PO (receiving remain)
- Manufacturing Order MO (raw mat. issues
remain, finished goods receiving remain)
- Work Order WO (raw mat. issues remain,
finished goods receiving remain)
MRP
 Can see Future Stock Balance
 Can see Future Stock Card
Example Orders Data for MRP
Example Orders Data for MRP
Stores Data with Order References
Stock Balance Report Example
Stock Card Report Example
MRP: Future Stock Balance
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 2 —
© Jiawei Han and Micheline Kamber
Multi-Tiered Architecture of data
warehousing and data mining
other
sources
Metadata
Operational Extract
Transform
DBs
Load
Integration
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
What Data to put in Warehouse?
What Analysis/Query/Reports are Needed?
What Data Mining Functions may be
done?
Example Case Study
Tesco Lotus Retail Outlet Data
 Head: Central Finance System
 Head & Outlet: Inventory management per
warehouse
 Head: Distribution Plan
 Head: Supply Chain and Logistics Management
 Head: Orders Management
 Outlet: POS (Point of Sales) at Outlet
 Outlet: Returns Handling
 Head & Outlet: Human Resource Management
 Outlet: Customer Satisfaction Survey
What Analysis Reports Needed?
 Sales Analysis
 Sales by Outlet, Zone, Product Code, Actual
Product, Product Category
 Profit by Supplier, Product, Category
 Return by Product, Supplier, Sales Zone
 Payment Methods Used by Customer
 Trends per Season. Comparison across years.
 Performance
 Stores below minimum stock for over ___
days
 Delays in delivery
 Sales by Promotion Type, Sales person
 Goods Lost / Stolen by product, outlet, zone
What Data Mining Needed?
 Product Correlations by Product Code, Actual
Product, Product Categories, Outlet, Zone
 Products with high return chances
 Credit Card fraud cases
 Employee Theft cases
 Member Purchase Patterns
Chapter 2: Data Warehousing and OLAP
Technology for Data Mining
 What is a data warehouse?
 A multi-dimensional data model
 Data warehouse architecture
 Data warehouse implementation
 Further development of data cube technology
 From data warehousing to data mining
What is Data Warehouse?
 Defined in many different ways, but not
rigorously.
 A decision support database that is maintained
separately from the organization’s operational
database
 Support information processing by providing a solid
platform of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data
warehouses
Data Warehouse: Subject-Oriented
 Organized around major subjects, such as
customer, product, sales.
 Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
 Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
Data Warehouse:Integrated
Constructed by integrating multiple,
heterogeneous data sources
 relational databases, flat files, on-line
transaction records
Data cleaning and data integration
techniques are applied.
 Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
• E.g., Hotel price: currency, tax, breakfast
covered, etc.
 When data is moved to the warehouse, it is
converted.
Data Warehouse: Time Variant
 The time horizon for the data warehouse is
significantly longer than that of operational
systems.
 Operational database: current value data.
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”.
Data Warehouse: Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in
the data warehouse environment.
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:
• initial loading of data and access of data.
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration:
 Build wrappers/mediators on top of heterogeneous databases
 Query driven approach
• When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
• Complex information filtering, compete for resources
Data warehouse: update-driven, high
performance
 Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Why Separate Data Warehouse?
High performance for both systems
 DBMS - tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse - tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
 missing data: Decision support requires historical data
which operational DBs do not typically maintain
 data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
 data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
Data Warehousing and OLAP
Technology for Data Mining
 What is a data warehouse?
 A multi-dimensional data model
 Data warehouse architecture
 Data warehouse implementation
 Further development of data cube technology
 From data warehousing to data mining
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
 In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all
time
time,item
0-D (apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D (base) cuboid
time, item, location, supplier
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set
of dimension tables
 Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
 Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Example of Star Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
branch_key
location
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
location_key
street
city
province_or_street
country
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
A Data Mining Query Language,
DMQL: Language Primitives
 Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
 Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
 Special Case (Shared Dimension Tables)
 First time as “cube definition”
 define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch,
location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city, province_or_state, country)
A Concept Hierarchy: Dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Wind
Multidimensional Data
Sales volume as a function of
product, month, and region
Product
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Category Country Quarter
Product
Month
City
Office
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
Browsing a Data Cube
 Visualization
 OLAP capabilities
 Interactive manipulation
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions
 Slice and dice:
 project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning
the entire organization
 Data Mart
 a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
 Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be
materialized
Cube Operation
 Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
 Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
 Need compute the following
Group-Bys:
()
(city)
(date, product, customer),
(date,product),(date, customer),
(product, customer), (city, item)
(date), (product), (customer)
()
(item)
(city, year)
(city, item, year)
(year)
(item, year)
Data Warehouse Back-End Tools and
Utilities
 Data extraction:
 get data from multiple, heterogeneous, and external
sources
 Data cleaning:
 detect errors in the data and rectify them when
possible
 Data transformation:
 convert data from legacy or host format to warehouse
format
 Load:
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse
Data Warehousing and OLAP
Technology for Data Mining
 What is a data warehouse?
 A multi-dimensional data model
 Data warehouse architecture
 Data warehouse implementation
 Further development of data cube technology
 From data warehousing to data mining
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
 Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
 Differences among the three tasks
From On-Line Analytical Processing to
On Line Analytical Mining (OLAM)
 Why online analytical mining?
 High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
 Available information processing structure surrounding data
warehouses
• ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
 OLAP-based exploratory data analysis
• mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
• integration and swapping of multiple mining functions,
algorithms, and tasks.
 Architecture of OLAM
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository
Summary
 Data warehouse
 A subject-oriented, integrated, time-variant, and nonvolatile collection of data
in support of management’s decision-making process
 A multi-dimensional model of a data warehouse
 Star schema, snowflake schema, fact constellations
 A data cube consists of dimensions & measures
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 OLAP servers: ROLAP, MOLAP, HOLAP
 Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Multiway array aggregation
 Bitmap index and join index implementations
 Further development of data cube technology
 Discovery-drive and multi-feature cubes
 From OLAP to OLAM (on-line analytical mining)
Data Mining: Introduction
Lecture Notes for Chapter 1
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Modified by
Songrit Maneewongvatana
& Suthep Madarasmi
57
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more
powerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
 Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations
generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation
Mining Large Data Sets - Motivation
 There is often information “hidden” in the data that is
not readily evident
 Human analysts may take weeks to discover useful
information
 Much of the data is never analyzed at all
4,000,000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?
Many Definitions
 Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
 Exploration & analysis, by automatic or semiautomatic means, of large quantities of data in order
to discover meaningful patterns.
What is (not) Data Mining?
What is NOT
Data Mining?

– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”

What is Data Mining
– Certain names are more
widespread in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
 Traditional Techniques may be unsuitable due to
 Enormity of data
 High dimensionality of data
 Heterogeneous, distributed
nature of data
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
Data Mining Tasks
Prediction Methods
 Use some variables to predict
unknown or future values of other
variables.
Description Methods
 Find human-interpretable patterns
that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
 Classification
 Clustering
 Association Rule Discovery
 Sequential Pattern Discovery
 Regression
 Deviation Detection
Data mining
Predictive
Classification
Descriptive
Regression
Deviation Detection
Clustering
Sequence Pattern
Discovery
Association rules
Classification: Definition
Given a collection of records (training
set )
 Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a
function of the values of other
attributes.
Goal: previously unseen records should
be assigned a class as accurately as
possible.
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Neural Networks for Data Mining
An actual neuron
A crude model of a neuron
Computational Neural Networks: A
computational approach inspired by the
architecture of the biological nervous system
Synapse
Dendrite
Axon
Soma
Cat Neural Probe to Study Response
pr obe
pr obe
cat neuron
un lucky cat
Neural
Response
RESPONSE
milli
Volts
Shine
Light
Intensity
STIMULUS
Tim e Axis No
Light
oscillosco pe
The Perceptron Model
I1
I2
I3
I4
w1
Sum
Threshold
w2
w3
w4
O
Example Weights: And & Or Problems
I1
w1
>T
I1 w1 + I w
2 2
O
I2 w 2
I1
I2
0
0
1
1
0
1
0
1
0
0
0
1
0
1
1
1
1
1
0
0
w1
w2
1
1
1
1
-1
0
1.5
0.5
- 0.5
threshold
AND OR
NOT I1
I 1w1  I 2 w2  T
I 1w1  I 2 w2  1 (  T )  0
Weight Adjustments
Eqn: I1w 1  I 2 w 2  1  w 3
I1
I2
O
0
0
0
0  w1  0  w 2  1  w 3  0  w 3  0
0
1
0
0  w1  1  w 2  1  w 3  0  w 2  w 3  0
1
0
0 1  w1  0  w 2  1  w 3  0  w1  w 3  0
1
1
1
1  w1  1  w 2  1  w 3  0  w1  w 2  w 3  0
3-D and 2-D Plot of AND Table
Problem is to find
a plane that separates the “on” circle
from the “off” circles.
- Output is 0
- Output is 1
Training Procedure
1.
First assign any values to w1, w2 and w3
2.
Using the current weight values w1, w2 and w3
and the next training item inputs I1and I2 compute the
value:
V = I1w1  I2w2  1 w3
3.
If V 0 set computed output C to 1 else set to 0.
4.
If the computed output C is not the same as the
current training item output O, Adjust Weights.
5.
Repeat steps 2-4. If you run out of training
items, start with the first training item. Stop repeating if
no weight changes through 1 complete training cycle).
Gradient Descent Algorithm
w1 Next  w1Current  I 1( C  O)
w2 Next  w 2 Current  I 2 ( C  O)
w3 Next  w 3Current  ( C  O)
Linearly vs. Non-Linearly Separable
The XOR Problem is Linearly Nonseparable
The Back Propagation Model
Five
Four
Three
6 input, 1 output
5 input, 1 output
4 input, 1 output
Perceptr ons
Perceptr ons
Perceptr ons
I1
I2
I3
I4
w1
Sum
Threshold
w2
O
w3
w4
One Backprop Unit
Input
hid den
Output
layers
Backpropagation Network
Advantage of Backprop over Perceptron
f
e
a
t
u
r
e
2
ย
กก ย ย ย กก ถ
หก ก ถ
ก
ก ก หหห ก กก ถถ
ห
ย
ก ก ย ย ย กก
ห กก กก ถถถ
กก
ห
ก ห ห หก ถ
feature 1
Input: Clu ster based
Layer 1: Decision
on two features
bou ndaries draw n
ย
กก ย ย ย กก
ห กก กก ถถถ
ก
ห
กก ห ห ก ถ
ห
ย
กก ย ย ย ก
ห กกกกก ถถถ
กก
ห
ก ห หห ก ถ
Layer 2: Decision
Layer 3: Decision
regions determined
ก
regions ( ) grouped
Backprop Learning Algorithm
1. Assign random values to all the weights
2. Choose a pattern from the training set (similar to perceptron).
3. Propagate the signal through to get final output (similar to
perceptron).
4. Compute the error for the output layer (similar to the
perceptron).
5. Compute the errors in the preceding layers by propagating the
error backwards.
6. Change the weight between neuron A and each neuron B in
another layer by an amount proportional to the observed output
of B and the error of A.
7. Repeat step 2 for next training sample.
Application: Needs Enough Training
หห ห
ก ก
กก
A small train ing set means
many possible decisio n
bou ndaries .
ห
ห
หห หห ห ก
หหหห ก กก กก
กกก กก
ก
A large traini ng set
constrain ts the decision boun dary more.
Speech Recognition
Data Mining using Neural Networks
Stock Market Predictor N-Net
Classification: Application 1
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
 Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms the
class attribute.
• Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application 2
Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
 Approach:
• Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
• Label past transactions as fraud or fair
transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit
card transactions on an account.
Clustering Definition
Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
 Data points in one cluster are more similar to
one another.
 Data points in separate clusters are less
similar to one another.
Similarity Measures:
 Euclidean Distance if attributes are
continuous.
 Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
Clustering: Application 1
Market Segmentation:
 Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
 Approach:
• Collect different attributes of customers based on
their geographical and lifestyle related
information.
• Find clusters of similar customers.
• Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.
Clustering: Application 2
Document Clustering:
 Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
 Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
 Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
Illustrating Document Clustering
Clustering Points: 3204 Articles of Los
Angeles Times.
Similarity Measure: How many words are
common in these documents (after some
Category
Total
Correctly
word filtering).
Articles
Placed
Financial
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Association Rule Discovery: Definition
Given a set of records each of which
contain some number of items from a
given collection;
 Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
 Can identify potential cross-selling opportunities
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule Discovery: Application 1
Marketing and Sales Promotion:
 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used
to determine what should be done to boost
its sales.
 Bagels in the antecedent => Can be used to
see which products would be affected if the
store discontinues selling bagels.
 Bagels in antecedent and Potato chips in
consequent => Can be used to see what
products should be sold with Bagels to
promote sale of Potato chips!
Association Rule Discovery: Application 2
Supermarket shelf management.
 Goal: To identify items that are bought
together by sufficiently many customers.
 Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
 A classic rule -• If a customer buys diaper and milk, then he is very
likely to buy beer.
• So, don’t be surprised if you find six-packs stacked
next to diapers!
Regression
 Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
 Greatly studied in statistics, neural network
fields.
 Examples:
 Predicting sales amounts of new product based on
advertising expenditure.
 Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
 Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations from normal
behavior
Applications:
 Credit Card Fraud Detection
 Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million
connections per day
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data