Transcript Data Mining

DATA MINING:
INTRODUCTION
Instructor: Dr. Chun Yu
School of Statistics
Jiangxi University of Finance and Economics
Fall 2016
Syllabus
• Textbook: P. Tan, M. Steinbach, and V. Kumar,
“Introduction to Data Mining”. 2nd edition, 2010
• Chapters covered:
Chapter 1: Introduction
Chapter 2: Data
Chapter 3: Exploring Data
Chapter 4: Classification
Chapter 6: Association Analysis
Chapter 8: Cluster Analysis
Chapter 10: Anomaly Detection (Outlier Detection)
Additional topic: Regression
• Statistical computing software: Usage of R will be
introduced in class
Class Schedule and Grading Rule
week
Chapter and topic
week
Chapter and topic
1
Introduction
10
Association rule: rule generation
2
Data
11
Clustering: K-means
3
Exploring data: summary
12
Clustering: Hierarchical
4
Exploring data: visualization
13
Clustering: DBSCAN
5
Simple linear regression
14
Outlier detection: Statistical
6
Multiple linear regression
15
Outlier detection: other methods
7
Classification: basic
concepts
16
Project presentation
8
Classification: decision tree
17
Final exam
9
Association rule: Apriori
Grading rule: homework 20% + project 10% + exam 70%
Outline
• Why data mining?
• What is data mining?
• Data mining: on what kinds
of data?
• Data mining tasks
• About R
Why Data Mining? Commercial Viewpoint
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• Purchases at department/
grocery stores
• Bank/Credit Card transactions
• Computers have become cheaper and more
powerful
• Competitive Pressure is Strong:
--Provide better, customized services
Why Data Mining? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene
expression data
• scientific simulations
generating terabytes of data
• Traditional techniques infeasible for
raw data
• Data mining may help scientists
• in classifying and partitioning data
• in Hypothesis Formation
Mining Large Data Sets - Motivation
• There is often information “hidden" in the data that is
not readily evident
• Human analysts may take
weeks to discover useful
information
• Much of the data is never
analyzed at all
What Is Data Mining?
• Data Mining: Extraction of interesting patterns or knowledge
from huge amount of data
• Alternative names
• knowledge extraction
• data/pattern analysis
• information harvesting
• business intelligence
• Exploration & analysis,
by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
What is (not) Data Mining?
• 1. Look up phone number in phone directory
• 2. Certain names are more prevalent in certain US
locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
• 3. Query a Web search engine for information about
“Amazon”
• 4. Group together similar documents returned by search
engine according to their context (e.g. Amazon rainforest,
Amazon.com,)
What is (not) Data Mining?

What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
 What is Data Mining?
– Certain names are more
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/pattern recognition,
statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Enormity of data
• High dimensionality
of data
• Heterogeneous,
distributed nature of data
Statistics
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
Data Mining: On What Kinds of Data?
• Database data
• Data warehouses
• Transactional data
• Other kinds of data
• Data streams
• Time-related or sequence data
• Structure data, graphs, and social networks
• Spatial data
• Multimedia database
• The World-Wide Web (Internet)
Relational Database
• Relational database is a collection of tables, each of
which is assigned a unique name
• Each table consists of a set of attributes (columns or
fields) and usually stores a large set of tuples (records or
rows)
• Each tuple in a relational table represents an object
identified by a unique key and described by a set of
attribute values
Relational Database Example
Customer
cust_ID
name
address
age
income
credit_inf
category
…
C1
….
Smith
……
1223 lake Ave,… 31
…..
…..
$78000
…..
1
…..
3
….
…
...
Item
item_id
name
brand
category
type
price
made
cost
13
18
…
HDTV
laptop
…
Vizion
Dell
…
high resolution
TV
computer
…
$988
$1369
…..
USA
USA
…..
$600
$983
….
laptop
…..
Purchases
trans_id
customer_id
employ_id
date
meth_paid
amount
T100
…
C1
…
E55
…
03/21/2014
….
Visa
….
$1357
….
Data Warehouses and Example
• A data warehouse is usually modeled by a
multidimensional data structure, called a data cube
• Example: A data cube for a company
• The cube has three dimensions:
• address (with city values Chicago, New York, Toronto,
Vancouver)
• time (with quarter values Q1, Q2, Q3, Q4)
• item(with item type values home entertainment,
computer, phone, security)
A multidimensional data cube
A multidimensional data cube: drill down
and roll up
Transactional or Market Basket Data
• Each record in a transactional database captures a
transaction, such as a customer’s purchase, a flight
booking, or a user’s clicks on a web page
• A transaction typically includes a unique transaction
identity number (trans_ID) and a list of the items making
up the transaction, such as the items purchased in the
transaction
• Transactions can be stored in a table, with one record per
transaction.
Transactional Database and Market
Basket Data Analysis
trans_id
items
1
Bread, Soda, Milk
2
Beer, Bread
3
Beer, Soda, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Soda, Diaper, Milk
…..
…..
• Market basket data analysis would enable you to bundle
groups of items together as a strategy for boosting sales
Time-related or Sequence Data
• Historical records
• Stock exchange data
• Time-series data
• Biological sequence data
Spatial Data
• Maps
• Weather data (precipitation,
temperature, pressure)
• The design of buildings
Multimedia Data
• Text
• Image
• Video data
• Audio data
Data Mining Tasks
• Prediction Tasks
• Use some variables to predict unknown or future values
of other variables.
• Description Tasks
• Find human-interpretable patterns that describe the
data.
Data Mining Tasks...
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Regression [Predictive]
• Outlier Detection [Predictive]
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
Yes
Single
Refund Marital
Status
Taxable
Income Cheat
No
Single
75K
?
Yes
Married
50K
?
150K
?
No
2
No
Married
100K
No
3
No
Single
70K
No
No
Married
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Test
Set
10
Training
Set
Learn
Classifier
Model
Classification: Application
• Fraud Detection
• Goal: Predict fraudulent cases in credit card
transactions.
• Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
• When does a customer buy, what does he buy, how
often he pays on time, etc
• Label past transactions as fraud or fair transactions.
This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.
• Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Illustrating Clustering
• Euclidean Distance Based Clustering in 3-D space
Intracluster distances
are minimized
Intercluster distances
are maximized
Association Rule Discovery: Definition
• Given a set of records each of which contain some
number of items from a given collection
• Produce dependency rules
which will predict occurrence
of an item based on occurrences
of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule Discovery: Application
• Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
• A classic rule -• If a customer buys diaper and milk, then he is very
likely to buy beer.
• So, don’t be surprised if you find six-packs stacked next
to diapers!
Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on
advertising expenditure.
• Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
• Time series prediction of stock market indices.
Outlier Detection
• Detect significant deviations from normal behavior
• Applications:
• Credit Card Fraud Detection
• Network Intrusion
Detection
What is R?
• R is a free software environment for
statistical computing and graphics
(http://www.r-project.org/)
• R can be easily extended with more than 4,000 packages
available on CRAN (http://cran.r-project.org/)
• R manuals on CRAN
• An Introduction to R
• The R Language Definition
• R Data Import/Export
Thank You!