Ceng770-Introduction

Download Report

Transcript Ceng770-Introduction

CENG 770
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Definition by Gartner Group
• “Data mining is the process of discovering
meaningful new correlations, patterns and
trends by sifting through large amounts of
data stored in repositories, using pattern
recognition technologies as well as
statistical and mathematical techniques.”
• (Deductive) query processing
• Expert systems or small ML/statistical programs
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful information
• Much of the data is never analyzed at all
4,000,000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Artificial
Intelligence
Machine
Learning
Database
Management
Statistics
Visualization
Algorithms
Data
Mining
Data Mining:
History of the Field
• Knowledge Discovery in Databases workshops started ‘89
– Now a conference under the auspices of ACM SIGKDD
– IEEE conference series started 2001
7
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases (PiatetskyShapiro)
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
• 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD
Explorations
• More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
CS490D
8
• Market Analysis, Customer Relationships Management (CRM)
• Churn Analysis
• Risk Analysis and Management
• Fraud Detection, Counter Terrorism
• Network Intrusion Detection
• Web Site Restructuring
• Recommendation
• Scientific Applications
What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
10
Clustering
• Find groups of similar data items
• Statistical techniques require
some definition of “distance”
(e.g. between travel profiles)
while conceptual techniques use
background concepts and logical
descriptions
“Group people with
similar travel profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
11
Classification
• Find ways to separate data items
into pre-defined groups
• Requires “training data”: Data
items where group is known
“Route documents to
most likely interested
parties”
– English or non-english?
– Domestic or Foreign?
Training Data
tool produces
Groups
classifier
12
Association Rules
• Identify dependencies in the
data:
– X makes Y likely
• Indicate significance of each
dependency
“Find groups of items
commonly purchased
together”
– People who purchase fish
are extraordinarily likely to
purchase wine
– People who purchase
Turkey are extraordinarily
likely to purchase
cranberries
D
a
t
e
/
T
i
m
e
/
R
e
g
i
s
t
e
r
F
i
s
h
T
u
r
k
e
y
C
r
a
n
b
e
r
r
i
e
s
W
i
n
e
…
YYY
…
1
2
/
6
1
3
:
1
5
2 N
NNY
…
1
2
/
6
1
3
:
1
6
3 Y
13
Sequential Associations
• Find event sequences that are
unusually likely
“Find common sequences of
warnings/faults within 10
minute periods”
– Warn 2 on Switch C
preceded by Fault 21 on
Switch B
– Fault 17 on any switch
preceded by Warn 2 on any
switch
T
im
e S
w
itc
hE
v
e
n
t
B F
a
u
lt2
1
2
1
:1
0
A W
a
rn2
2
1
:1
1
C W
a
rn2
2
1
:1
3
A F
a
u
lt1
7
2
1
:2
0
14
Recommendation Techniques
• Given database of user preferences, predict preference of new
user
• Example:
– Predict what new movies you will like based on
• your past preferences
• others with similar past preferences
• their preferences for the new movies
– Predict what books/CDs a person may want to
buy
(and suggest it, or give discounts to tempt customer)
15
Knowledge Discovery in Databases:
Process
Interpretation/
Evaluation
Data Mining
Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
16
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
•
Learning the application domain
– relevant prior knowledge and goals of application
•
•
•
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation
•
Choosing functions of data mining
–
•
•
•
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
•
Use of discovered knowledge
What is Data?
Attributes
• Collection of data objects and
their attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc.
– Attribute is also known as
variable, field, characteristic, or Objects
feature
• A collection of attributes describe
an object
– Object is also known as record,
point, case, sample, entity, or
instance
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student credit_rating buys_computer
no fair
no
no excellent
no
no fair
yes
no fair
yes
yes fair
yes
yes excellent
no
yes excellent
yes
no fair
no
yes fair
yes
yes fair
yes
yes excellent
yes
no excellent
yes
yes fair
yes
no excellent
no
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
–
–
–
–
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student credit_rating buys_computer
no fair
no
no excellent
no
no fair
yes
no fair
yes
yes fair
yes
yes excellent
no
yes excellent
yes
no fair
no
yes fair
yes
yes fair
yes
yes excellent
yes
no excellent
yes
yes fair
yes
no excellent
no
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multidimensional space, where each dimension represents a
distinct attribute
• Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one shopping
trip constitute a transaction, while the individual
products that were purchased are the items.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Graph Data
• Examples: Generic graph and HTML Links
2
1
5
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Ordered Data
• Sequences of transactions
Items/Events
An element of
the sequence
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using
a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Important Characteristics of Structured Data
– Dimensionality
• Curse of Dimensionality
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale