Knowledge Discovery in Databases

Download Report

Transcript Knowledge Discovery in Databases

ITCS 6162
KDD Class
Fall 2007
Transparencies made by
Ho Tu Bao [JAIST]
1
Outline of the presentation
Objectives,
Brief
Discussion
Prerequisite
Introduction
and
and Content
to Lectures
Conclusion
This presentation summarizes the content and organization
of lectures in module “Knowledge Discovery and Data Mining”
2
Objectives
This course provides:
• fundamental techniques of knowledge
discovery and data mining (KDD)
• issues in KDD practical use and tools
• case-studies of KDD application
3
Prerequisite for the course
Nothing special but the followings are
expected:
• experience of computer use
• basis of databases and statistics
• programming skill for advanced levels
4
Content of the course
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
5
Outline of the presentation
Objectives,
Brief
Discussion
Prerequisite
Introduction
and
and Content
to Lectures
Conclusion
This presentation summarizes the content and organization
of lectures in module “Knowledge Discovery and Data Mining”
6
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
7
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
8
KDD: A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes:
never see the whole
data set or put it in the
memory of computers
Data mining
algorithms?
What knowledge?
How to represent
and use it?
9
Data, Information, Knowledge
We often see data as a string of bits, or numbers and
symbols, or “objects” which we collect daily.
Information is data stripped of redundancy, and reduced
to the minimum necessary to characterize the data.
Knowledge is integrated information, including facts and
their relations, which have been perceived, discovered,
or learned as our “mental pictures”.
Knowledge can be considered data at
a high level of abstraction and generalization.
10
From Data to Knowledge
Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes
...
10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97,
49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, ,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44,
57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS
...
Numerical attribute
categorical attribute
missing values
class labels
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15
THEN Prediction = VIRUS [87,5%]
[confidence, predictive accuracy]
11
Data Rich Knowledge Poor
How to acquire knowledge for
knowledge-based systems
remains as the main difficult
and crucial problem.
People gathered and stored so
much data because they think
some valuable assets
are implicitly coded within it.
?
Raw data is rarely of direct benefit.
knowledge
base
inference
engine
Its true value depends on the ability
to extract information useful for
decision support.
Tradition: via knowledge engineers
Impractical Manual Data Analysis
New trend: via automatic programs
12
Benefits of Knowledge Discovery
Value
Disseminate
DSS
Generate
Volume
MIS
Rapid Response
EDP
EDP: Electronic Data Processing
MIS: Management Information Systems
DSS: Decision Support Systems
13
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
14
The KDD process
The non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
Multiple process
non-trivial process
valid
novel
useful
understandable
Justified patterns/models
Previously unknown
Can be used
by human and machine
15
The Knowledge Discovery Process
a step in the KDD process
consisting of methods
that produce useful
patterns or models from
the data, under some
acceptable computational
efficiency limitations
5
4
3
Putting the results
in practical use
Interpret and Evaluate
discovered knowledge
Data Mining
2
1
Extract Patterns/Models
Collect and
Preprocess Data
Understand the domain and
Define problems
KDD is inherently
interactive and iterative
16
The KDD Process
Data organized by function
Create/select
target database
Data warehousing
1
Select sampling
technique and
sample data
Supply missing
values
Eliminate
noisy data
Normalize
values
Transform
values
2
Create derived
attributes
Find important
attributes &
value ranges
4
3
Select DM
task (s)
Transform to
different
representation
Select DM
method (s)
Extract
knowledge
Test
knowledge
Refine
knowledge
Query & report generation
Aggregation & sequences
Advanced methods
5
17
Main Contributing Areas of KDD
[data warehouses:
integrated data]
Statistics
[OLAP: On-Line
Analytical Processing]
Databases
Store, access, search,
update data (deduction)
Infer info from data
(deduction & induction,
mainly numeric data)
KDD
Machine Learning
Computer algorithms that improve
automatically through experience
(mainly induction, symbolic data)
18
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
19
Potential Applications
Business information
Manufacturing information
- Marketing and sales
data analysis
- Investment analysis
- Loan approval
- Fraud detection
- etc.
Scientific information
-
-
Controlling and scheduling
Network management
Experiment result analysis
etc.
Personal information
Sky survey cataloging
Biosequence Databases
Geosciences: Quakefinder
etc.
20
KDD: Opportunity and Challenges
Competitive
Pressure
Data Rich
Knowledge Poor
(the resource)
KDD
Data Mining
Technology
Mature
Enabling Technology
(Interactive MIS, OLAP,
parallel computing, Web, etc.)
21
KDD: A New and Fast Growing Area
KDD workshops: since 1989.
Inter. Conferences: KDD (USA), first in 1995;
PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997.
ML’04/PKDD’04 (in Pisa, Italy)
Industry interests and competition: IBM, Microsoft,
Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …
About 80% of the Fortune 500 companies are involved in
data mining projects or using data mining systems.
JAPAN: FGCS Project (logic programming and reasoning).
“Knowledge Discovery is the most desirable end-product of computing”.
Wiederhold, Standford Univ.
22
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
23
Primary Tasks of Data Mining
finding the description
of several predefined
classes and classify
a data item into one
of them.
Classification
?
maps a data item
to a real-valued
prediction variable.
Regression
discovering the
most significant
changes in the data
Deviation and
change detection
identifying a finite
set of categories or
clusters to describe
the data.
Clustering
finding a model
which describes
significant dependencies
between variables.
Dependency
Modeling
finding a
compact description
for a subset of data
Summarization
24
Classification
“What factors determine cancerous cells?”
Examples
Data
Cancerous Cell Data
Mining
Algorithm
Classification
Algorithm
General
patterns
- Rule Induction
- Decision tree
- Neural Network
25
Classification: Rule Induction
“What factors determine a cell is cancerous?”
If
and
and
Then
Color = light
Tails = 1
Nuclei = 2
Healthy Cell
If
and
and
Then
Color = dark
Tails = 2
Nuclei = 2
Cancerous Cell
(certainty = 92%)
(certainty = 87%)
26
Classification: Decision Trees
Color = dark
#nuclei=1
#tails=1
healthy
#tails=2
cancerous
#nuclei=2
cancerous
Color = light
#nuclei=1
#nuclei=2
healthy
#tails=1
#tails=2
healthy
cancerous
27
Classification: Neural Networks
“What factors determine a cell is cancerous?”
Color = dark
# nuclei = 1
…
Healthy
Cancerous
# tails = 2
28