CS186: Introduction to Database Systems

Download Report

Transcript CS186: Introduction to Database Systems

EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2007
Administrative

Final project is due May 9th


I will not accept late hand-in for the final project
Start early



you need to take quick action if you haven’t divided the
tasks yet
For those in the implementation phase (this is the right phase
that you should be in), thinking about testing plans
Class review is April 30th.


Provide your feedbacks
If you like the course, recommend it to other students
7/7/2015
Luke Huan Univ. of Kansas
2
Review What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data

What is classification?


Predict the value of unseen data
What is clustering

Grouping similar objects into groups
7/7/2015
Luke Huan Univ. of Kansas
3
Today’s Topic

Continue the introduction to data mining

Sequential patterns

Regression

Knowing the nature of your data

Discover association in your data
7/7/2015
Luke Huan Univ. of Kansas
4
Sequential Pattern Discovery: Definition

Given is a set of objects, with each object associated with its own timeline of events,
find rules that predict strong sequential dependencies among different events.
(A B)

(C)
(D E)
Rules are formed by first disovering patterns. Event occurrences in the patterns are
governed by timing constraints.
(A B)
<= xg
(C)
(D E)
>ng
<= ws
<= ms
7/7/2015
Luke Huan Univ. of Kansas
5
Sequential Pattern Discovery: Examples


In telecommunications alarm logs,
 (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
In point-of-sale transaction sequences,
 Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
 Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
7/7/2015
Luke Huan Univ. of Kansas
6
Regression



Predict a value of a given continuous valued variable based on
the values of other variables, assuming a linear or nonlinear
model of dependency.
Greatly studied in statistics, neural network fields.
Examples:



7/7/2015
Predicting sales amounts of new product based on advetising
expenditure.
Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
Time series prediction of stock market indices.
Luke Huan Univ. of Kansas
7
Deviation/Anomaly Detection


Detect significant deviations from normal behavior
Applications:

Credit Card Fraud Detection

Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
7/7/2015
Luke Huan Univ. of Kansas
8
Challenges of Data Mining







Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
7/7/2015
Luke Huan Univ. of Kansas
9
Knowing the Nature of Your Data



Data types: nominal, ordinal, interval, ratio.
Data quality
Data preprocessing
7/7/2015
Luke Huan Univ. of Kansas
10
What is Data?



Collection of data objects and
their attributes
An attribute is a property or
characteristic of an object
 Examples: eye color of a
person, temperature, etc.
 Attribute is also known as
variable, field,
Objects
characteristic, or feature
A collection of attributes
describe an object
 Object is also known as
record, point, case, sample,
entity, or instance
7/7/2015
Luke Huan Univ. of Kansas
Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
6
No
Married
7
Yes
Divorced 220K
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Yes
No
No
10
11
Attribute Values

Attribute values are numbers or symbols assigned to an
attribute

Distinction between attributes and attribute values

Same attribute can be mapped to different attribute
values


Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
 But properties of attribute values can be different


7/7/2015
ID has no limit but age has a maximum and minimum value
Luke Huan Univ. of Kansas
12
Types of Attributes

There are different types of attributes

Nominal


Ordinal


Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio

7/7/2015
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval


Examples: ID numbers, eye color, zip codes
Examples: temperature in Kelvin, length, time, counts
Luke Huan Univ. of Kansas
13
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses:








7/7/2015
Distinctness:
Order:
Addition:
Multiplication:
= 
< >
+ */
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Luke Huan Univ. of Kansas
14
Properties of Attribute Values
Attribute Type
Description
Examples
Nominal
The values of a nominal attribute are just
different names, i.e., nominal attributes
provide only enough information to
distinguish one object from another. (=, )
zip codes, employee ID
numbers, eye color, sex:
{male, female}
Ordinal
The values of an ordinal attribute provide
enough information to order objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
Interval
For interval attributes, the differences
between values are meaningful, i.e., a unit
of measurement exists.
(+, - )
calendar dates, temperature
in Celsius or Fahrenheit
For ratio variables, both differences and
ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities, counts,
age, mass, length, electrical
current
Ratio
7/7/2015
Luke Huan Univ. of Kansas
15
Discrete and Continuous Attributes

Discrete Attribute





Has only a finite or countablely infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute




7/7/2015
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using
a finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Luke Huan Univ. of Kansas
16
Structured vs Unstructured Data

Structured Data


Semi-structured data


Data in a relational database
Graphs, trees, sequencs
Un-structured data

Image, text
7/7/2015
Luke Huan Univ. of Kansas
17
Important Characteristics Data
 Dimensionality

Curse of Dimensionality
 Sparsity

Only presence counts
 Resolution

7/7/2015
Patterns depend on the scale
Luke Huan Univ. of Kansas
18
Record Data

Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
7/7/2015
Luke Huan Univ. of Kansas
19
Data Matrix

If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

Such data set can be represented by an m by n matrix, where there
are m rows, one for each object, and n columns, one for each
attribute
Projection
of x Load
7/7/2015
Projection
of y load
Distance
Load
Thickness
10.23
5.27
15.22
2.7
1.2
12.65
6.25
16.22
2.2
1.1
Luke Huan Univ. of Kansas
20
Document Data

Each document becomes a `term' vector,


team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
7/7/2015
each term is a component (attribute) of the vector,
the value of each component is the number of times the
corresponding term occurs in the document.
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
Luke Huan Univ. of Kansas
21
Transaction Data

A special type of record data, where


7/7/2015
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual products
that were purchased are the items.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Luke Huan Univ. of Kansas
22
Data Quality




What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:


7/7/2015
Noise and outliers
missing and duplicated data
Luke Huan Univ. of Kansas
23
Noise

Noise refers to modification of original values

Examples: distortion of a person’s voice when talking on
a poor phone and “snow” on television screen
Two Sine Waves
7/7/2015
Two Sine Waves + Noise
Luke Huan Univ. of Kansas
24
Mapping Data to a New Space
Fourier transform
 Wavelet transform

Two Sine Waves
7/7/2015
Two Sine Waves + Noise
Luke Huan Univ. of Kansas
Frequency
25
Outliers

Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set

One person’s outlier can be another one’s treasure!!
7/7/2015
Luke Huan Univ. of Kansas
26
Missing Values

Reasons for missing values



Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values




7/7/2015
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
Luke Huan Univ. of Kansas
27
Duplicate Data

Data set may include data objects that are duplicates, or
almost duplicates of one another


Examples:


Major issue when merging data from heterogeous
sources
Same person with multiple email addresses
Data cleaning

7/7/2015
Process of dealing with duplicate data issues
Luke Huan Univ. of Kansas
28
EDA: Exploratory Data Analysis




Histogram
Box plot
Scatter plot
Correlation
7/7/2015
Luke Huan Univ. of Kansas
29
Visualization Techniques: Histograms

Histogram





Usually shows the distribution of values of a single variable
Divide the values into bins and show a bar plot of the number of
objects in each bin.
The height of each bar indicates the number of objects
Shape of histogram depends on the number of bins
Example: Petal Width (10 and 20 bins, respectively)
7/7/2015
Luke Huan Univ. of Kansas
30
Two-Dimensional Histograms


Show the joint distribution of the values of two
attributes
Example: petal width and petal length

7/7/2015
What does this tell us?
Luke Huan Univ. of Kansas
31
Visualization Techniques: Box Plots

Box Plots



Invented by J. Tukey
Another way of displaying the distribution of data
Following figure shows the basic part of a box plot
outlier
10th percentile
75th percentile
50th percentile
25th percentile
10th percentile
7/7/2015
Luke Huan Univ. of Kansas
32
Example of Box Plots

Box plots can be used to compare attributes
7/7/2015
Luke Huan Univ. of Kansas
33
Scatter Plot Array of Iris Attributes
7/7/2015
Luke Huan Univ. of Kansas
34
Correlation


Correlation measures the linear relationship between
objects
To compute correlation, we standardize data objects, p
and q, and then take their dot product
pk  ( pk  mean( p)) / std ( p)
qk  (qk  mean(q)) / std (q)
correlation( p, q)  p  q
7/7/2015
Luke Huan Univ. of Kansas
35
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
7/7/2015
Luke Huan Univ. of Kansas
36
Discover Association Rules

Apriori Algorithm
7/7/2015
Luke Huan Univ. of Kansas
37
Association Rule Mining

Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
7/7/2015
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,
not causality!
Luke Huan Univ. of Kansas
38
Definition: Frequent Itemset




Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper}
 k-itemset
 An itemset that contains k items
Support count ()
 Frequency of occurrence of an itemset
 E.g. ({Milk, Bread,Diaper}) = 2
Support
 Fraction of transactions that contain an
itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
 An itemset whose support is greater
than or equal to a minsup threshold
7/7/2015
Luke Huan Univ. of Kansas
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
39
Definition: Association Rule


Association Rule
 An implication expression of the form
X  Y, where X and Y are itemsets
 Example:
{Milk, Diaper}  {Beer}
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
 Support (s)
Example:
 Fraction of transactions that
{Milk, Diaper}  Beer
contain both X and Y
 Confidence (c)
 (Milk , Diaper, Beer ) 2
s

  0.4
 Measures how often items in Y
|T|
5
appear in transactions that
 (Milk, Diaper, Beer ) 2
contain X
c
  0.67
 (Milk , Diaper )
7/7/2015
Luke Huan Univ. of Kansas
3
40
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
7/7/2015
Luke Huan Univ. of Kansas
41
An Exercise
Transaction-id
Items bought
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n



The support value of pattern
Transaction
database TDB
{acm}
is
 Sup(acm)=3
The support of pattern {ac}
is
 Sup(ac)=3
Given min_sup=3, acm is


The confidence of the rule:
{ac} => {m} is

7/7/2015
Frequent
100%
Luke Huan Univ. of Kansas
42
Mining Association Rules

Two-step approach:
1.
Frequent Itemset Generation
–
2.
Rule Generation
–

Generate all itemsets whose support  minsup
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
Frequent itemset generation is still computationally
expensive
7/7/2015
Luke Huan Univ. of Kansas
43
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
ABCDE
7/7/2015
Luke Huan Univ. of Kansas
BCDE
Given d items, there
are 2d possible
candidate itemsets
44
Apriori Algorithm

A level-wise, candidate-generation-and-test approach
(Agrawal & Srikant 1994)
Data base D
TID
Items
10
a, c, d
20
b, c, e
30
40
1-candidates
2-candidates
Itemset
Sup
Itemset
Sup
Itemset
a
2
a
2
ab
b
3
b
3
ac
a, b, c, e
c
3
c
3
ae
b, e
d
1
e
3
bc
e
3
Scan D
Min_sup=2
3-candidates
be
Freq 2-itemsets
Counting
Itemset
Itemset
Sup
Itemset
Sup
bce
ac
2
ab
1
bc
2
ac
2
be
3
ae
1
ce
2
bc
2
be
3
ce
2
Scan D Freq 3-itemsets
7/7/2015
Freq 1-itemsets
Itemset
Sup
bce
2
Luke Huan Univ. of Kansas
ce
Scan D
45
Summary


Nature of the data
Data types:





SSN
Grade
Temperature (degree)
Length
Nominal
Ordinal
Interval
Ratio
Data Quality



Noise
Outlier
Missing/duplicated data
7/7/2015
Luke Huan Univ. of Kansas
46
Summary

Common tools for exploratory data analysis





Histogram
Box plot
Scatter plot
Correlation
Association


Each rule: L => R has two parts: L, the left hand item set
and R the right hand item set
Each rule is measured by two parameters:


7/7/2015
Support
Confidence
Luke Huan Univ. of Kansas
47