cs412slides

Transcript cs412slides

Course 1
簡介
Introduction
Data Mining
資料探勘
國立聯合大學資訊管理學系陳士杰老師
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Outline

為何要有資料探勘? (Motivation)

什麼是資料探勘? (What is data mining?)

資料探勘處理什麼類型的資料? (Data Mining: On what kind of data?)

資料探勘應該提供什麼樣的功能? (Data mining functionality)

資料探勘所找出的模式都是人們有興趣的嗎? (Are all the patterns
interesting?)

資料探勘系統的種類有哪些? (Classification of data mining systems)

資料探勘任務的原義有哪些? (Data Mining Task Primitives)

資料探勘主要的討論議題有哪些? (Major issues in data mining)
2
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Motivation: “Necessity is the Mother of Invention”

Data explosion problem


Automated data collection
tools and mature database
technology lead to tremendous
amounts of data stored in
databases, data warehouses
and other information
repositories
We are drowning in data, but
starving for knowledge!
3
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
4
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Evolution of Database Technology

1960s:


1970s:




RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial(空間), temporal(時序), engineering, etc.)
1990s:


Hierarchical and network database systems
Relational data model, relational DBMS implementation
1980s:


Data collection, database creation, IMS and network DBMS
Data mining, data warehousing, multimedia databases, and Web databases
2000s



Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
5
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

Data mining: a misnomer?


Data mining探勘的不僅僅是資料，而是知識!!
Alternative names

Knowledge discovery (mining) in databases (KDD),
knowledge extraction, business intelligence, data/pattern
analysis, data archeology, data dredging, information
harvesting, etc.
6
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
Many people treat data mining as a synonym (同
義字) for another popularly used term,
Knowledge Discovery from Data (KDD) — 廣義
的Data mining

Alternatively, other view data mining as simply an
essential step in the process of knowledge
discovery — 狹義的Data mining
7
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
8
Knowledge Discovery (KDD) Process
 Evaluation and
 Presentation
 Data Mining
Patterns
Task-relevant
Data
Data Warehouse
 Selection and
 Transformation
 Data Cleaning and
 Data Integration

Databases
Data mining—core of
knowledge discovery process
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
KDD Process: Several Key Steps
1. Data cleaning (資料清理)


Remove noise and inconsistent data
may take 60% of effort!
2. Data integration (資料整合)

Where multiple data source may be combined
3. Data selection (資料選擇)

Where data relevant to the analysis task are retrieved from the DB
4. Data transformation (資料轉換)

Where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance.
5. Data mining (資料探勘)


Intelligent methods are applied in order to extract data patterns.
Choosing the mining algorithm(s) for searching patterns of interest
6. Pattern evaluation (模式評估)

To identify the truly interesting patterns representing knowledge based on
some interestingness measures.
7. Knowledge presentation (知識表示)

Where visualization and knowledge representation techniques are used to
present the mined knowledge to the user.
9
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
We adopt a broad view of data mining functionality:

Data mining is the process of discovering interesting
knowledge from large amounts of data stored in
databases, data warehouses, or other information
repositories.
10
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
11
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowledge
-Base
Data Mining Engine
Database or Data
Warehouse Server
OLAP:
On line analytical Processing
data cleaning, integration, and selection
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
12
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
End User
Data Presentation
Business
Analyst
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data
Analyst
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Watch out: Is everything “data mining”?

Although there are many “data mining system” on
the market, not all of them can perform true data
mining:

Machine learning system, statistical data analysis tool


Does not handle large amounts of data
OLAP, database system, information retrieval system

Can only perform data or information retrieval, including
finding aggregate values, or that performs deductive query
answering in large databases.
13
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Data Mining: On What Kind of Data?




Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories







Object-oriented and object-relational databases
Spatial and Spatiotemporal Databases
Temporal, Sequence, and Time-Series Databases
Text databases and multimedia databases
Heterogeneous and legacy databases
Data Streams
WWW
14
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Relational databases
15
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Data warehouses
16
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Transactional databases
17
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Object-oriented and object-relational databases

Object-oriented database (物件導向資料庫)

Each entity is considered as an object.



Object-relational database (物件關係資料庫)



For instance, an employee class can contain variables like name,
address, and birthday.
Suppose that the class, sales_person, is a subclass of the class, employee.
It would inherit all of the variables pertaining to its superclass of
employee.
Inherits the essential concepts of object-oriented database.
This model extends the relational model by providing a rich data
type for handling complex objects and object orientation.
For data mining in object-oriented or object-relational
systems, techniques need to developed for handling:





Complex object structure
Complex data type
Class and subclass hierarchies
Property inheritance
Methods and procedures.
18
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Spatial and Spatiotemporal Databases

Spatial Database (空間資料庫)


Contain spatial-related information

空間拓樸特徵

(非)空間屬性特徵

對象在時間上的變化

Examples include: Geographic databases, VLSI, Medical and Satellite
image database.

Maps can be represented in vector format.
Spatiotemporal Database (時空資料庫)

Stores spatial objects that change with time.


Group the trends of moving objects and identify some strangely
moving vehicles.
Distinguish a bioterrorist attack form a normal outbreak of the flu
based on the geographic spread of a disease with time.
19
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Temporal, Sequence, and Time-Series Databases

Temporal Database (時間資料庫)


Stores relational data that include time-related attributes.
Sequence Database (序列資料庫)

Stores sequences of ordered events, with or without a concrete
notion of time.


Time-Series Database (時序資料庫)

Stores sequences of values or events obtained over repeated
measurement of time.


Customer shopping sequences, Web click streams, and biological
sequences.
The stock exchange, inventory control, the observation of natural
phenomena.
Data mining techniques can be used to find the
characteristics of object evolution, or the trend of changes
for objects in the database.
20
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
21
Text databases and multimedia databases

Text database (文件資料庫)


Are databases that contain word descriptions for objects.
These word descriptions are usually not simple key words but rather long
sentences or paragraphs.


Text databases may be somewhat structured:





Product specifications, error or bug reports, warning messages, summary
reports, notes, or other documents.
Highly unstructured (Web pages)
Semistructured (e-mail message, XML web pages)
Well structured (library catalogue database)
Highly regular structures typically can be implemented using relational
database systems.
Multimedia database (多媒體資料庫)



Store image, audio, and video data.
Specialized storage and search techniques are also required.
Storage and search techniques need to be integrated with standard
data mining methods.
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Heterogeneous and legacy databases

Heterogeneous database (異質資料庫)



Legacy database (遺產資料庫)



Consists of a set of interconnected, autonomous component
database.
Objects in one component database may differ greatly from
objects in other component databases, making it difficult to
assimilate their semantics into the overall heterogeneous database.
Many enterprises acquire legacy databases as a result of the long
history of information technology development.
A legacy database is a group of heterogeneous database.
Information exchange across such databases is difficult
because it would require precise transformation rules from
one representation to another, considering diverse semantics.
22
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
23
Data Streams

A new kind of data:


Unique feature:






Huge or possibly infinite volume
Dynamically changing
Flowing in and out in a fixed order
Demanding fast response time
Allowing only one or small number of scans
主要應用場合: data produced in dynamic environments.





Data flow in and out of an observation platform dynamically.
影像監控 (Video surveillance)
網路流量 (Network traffic)
股票交易 (Stock exchange)
天氣與環境的監視 (Weather or environment monitoring)…等等
Because data streams are normally not stored in any kind of data
repository, effective and efficient management and analysis of stream
data poses great challenges to researchers.
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
24
WWW

WWW and its associated distributed information services
provide rich, worldwide, on-line information services,
where data objects are linked together to facilitate
interactive access.

Although web pages may appear fancy and informative to
human readers, they can be highly unstructured and lack
a predefined schema, type, or pattern.


Web services that provide keyword-based searches without
understanding the context behind the web pages can only offer
limited help to users.
數據挖掘內容

內容檢索 (Text Retrieval)

WEB訪問模式檢索
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Data Mining Functionalities:
What kinds of patterns can be mined?

Data mining functionalities are used to specify the kinds of
patterns to be found in data mining tasks.

Data mining tasks can be classified into two categories:

Descriptive (描述性):


Characterize the general properties of the data in the database.
Predictive (預測性):

Perform inference on the current data in order to make predictions.

In some cases, users may have no idea regarding what kinds
of patterns in their data may be interesting, and hence may
kind to search for several different kinds of patterns in parallel.

Thus it is important to have a data mining system that can
mine multiple kinds of patterns to accommodate different
user expectations or applications.
25
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
Data mining functionalities, and the kinds of patterns
they can discover, are described below:

Concept description: Characterization and discrimination
(概念描述: 特性描述與區分)

Association Analysis (關聯分析)

Classification and Prediction (分類與預測)

Cluster analysis (聚類分析)

Outlier analysis (孤立點分析)

Trend and evolution analysis (趨勢與演化分析)
26
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Concept Description:
Characterization and Discrimination

Concept Description (or Class Description):

將一群資料，利用匯總的、簡潔的、精確的方式描述
成不同的類別 (Class)或概念 (Concept)。



如: 在AllElectronics商店中:

銷售的商品可分類成電腦與印表機

顧客的概念可分成bigSpenders和budgetSpenders
These descriptions can be derived via:

Data characterization (資料特性描述)

Data discrimination (資料區分)

Both data characterization and discrimination
Chapter 4
27
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Data Characterization

Summarization of the general characteristics or features of
a target class of data.

範例: 一個data mining system應可對AllElectronics花費$1000
美元以上的顧客 (大客戶) 特徵加以匯總:




年齡在40 – 50
有工作
良好的信用等級
The output of data characterization can be presented in
various forms:





Pie charts (圓餅圖)
Bar charts (直條圖)
Curve (曲線)
…
Chapter 4
28
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Data Discrimination

Comparison of the general features of target class
data objects with the general features of objects
from one or a set of contrasting classes.

範例: Data mining system應可比較出所有
AllElectronics客戶中，定期 (每月多於2次)購買電腦
產品和偶爾 (每年少於3次) 購買這類產品的兩組客戶:

經常購買的客戶中，80%在20 – 40歲之間，受過大學教育

偶爾購買的客戶中，60%太老或太小，沒有大學學位
29
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Association Analysis


從交易資料庫、關聯式資料庫或其它資訊儲存系
統的大量資料項目 (item)中，發現有趣的、頻繁
出現的模式 (Frequent Pattern)，並分析在此模
式下，存在於資料項目間有趣之關聯
(associations) 和相關性 (correlations)。

這種關聯在資料中沒有被直接表示出來

最佳的應用例子就是確定關聯規則 (Association Rule)
Chapter 5
30
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
範例: AllElectronics的行銷經理想要判定，有哪些商品常
常被客戶於同一次交易中一起被購買。假設
AllElectronics的日常交易資料庫中:

有2筆是有購買computer，其中有1筆也購買了software

有98筆是有購買software，其中有1筆也購買了computer

此時，Data Mining System為該公司mining出一條關聯規則:
buys(X, “computer”)  buys(X, “software”)
[support=1%, confidence=50%]



X: 表示 “顧客” 的變數
Confidence (信賴度, 又稱certainty): 表示一個顧客若買了
computer，則有50%的機會會買 software
Support (支持度): 表示在所有有購買computer和software的交易
記錄中，只有1%既購買computer又購買software
31
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)

Frequent patterns (頻繁模式): are patterns that occur
frequently in data.

Some kinds of frequent patterns:

Frequent itemset:


Frequent sequential pattern:


a set of items that frequently appear together in a transactional data set.
A frequently occurring subsequence, such as the pattern that customers
tend to purchase first a PC, followed by a digital camera, and
then a memory card.
Frequent structured pattern:

A substructure can refer to different structural forms, such as graphs, trees,
or lattices, which may be combined with itemsets or subsequences.

If a substructure occurs frequently, it is called a frequent structured
pattern.
32
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Classification and Prediction

Classification (分類):



The process of finding a model (or function) that
describes and distinguishes data classes or concepts
Be able to use the model to predict the class of objects
whose class label is unknown
例如: 為了識別乘客是否是潛在的恐怖份子或罪犯，機場
安全攝影站需要對乘客的臉部進行掃描並辨識臉部的基本
模式 (如: 雙眼間距、嘴的大小與形狀…等)，然後將得到
的模式與資料庫中的已知恐怖份子或罪犯的模式進行逐個
比較，看看是否與其中的某一模式相匹配。
33
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
34
範例: Table 6.1 指出AllElectronics公司的顧客中，
可分成會買電腦與不會買電腦的兩類顧客
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
35
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
36
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)

Whereas classification predicts categorical (discrete,
unordered) labels, prediction models continuous-valued
functions.

Although the term prediction may refer to both numeric
prediction and class label prediction, in this book we use it to
refer primarily to numeric prediction.

預測 (prediction) 可以看作是一種分類，差別在於預測主要是預測
未來資料的狀態，而不是當前狀態。

由於在分析測試資料之前，類別就已經被確定了，所以分
類通常被稱做有指導學習

Chapter 6
37
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Cluster Analysis

Unlike classification and prediction, which analyze
class-labeled data objects, clustering analyzes
data objects without consulting a known class label.




除了在訓練資料中，資料的類別沒有預先定義而是由
資料決定之外，聚類與分類很相似。
對資料間指定某些屬性，通過對這些屬性上的相似性
就可以完成聚類任務。最相似的資料會聚集成一個
cluster (簇)。
由於cluster不是預先定義的，通常需要領域專家對所產
生的cluster之含義進行解釋。
由於在分析測試資料時，類別是未知的，所以又被稱做無
指導學習
38
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)

範例: 聚類分析可以在AllElectronics的顧客資料上進行，
以便識別顧客的同類子群，這些cluster可以表示每個購物
目標群。

Chapter 7.
39
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Outlier Analysis

A database may contain data objects that do not comply
with the general behavior or model of the data. These data
objects are outliers (孤立點, 異常點).



應用






Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection (詐欺偵測), the
rare events (罕見事件) can be more interesting than the more
regularly occurring ones.
信用卡詐欺檢測
行動電話詐欺檢測
客戶劃分
醫療分析 (異常)
The analysis of outlier data is referred to as outlier mining.
Chapter 7.
40
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
Evolution Analysis

Data evolution analysis describes and models regularities or
trends for objects whose behavior changes over time.

May include characterization and discrimination, association,
classification, prediction of time related data.

範例: 假定你有紐約股票交易所過去幾年的主要股票市場
(時間序列) 資料，並希望投資於高科技工業公司的股票。
股票交易資料的挖掘研究可以識別整個股票市場和特定公
司的股票演變規律。這種規律可以幫助預測股票市場價格
的未來走向，幫助你對股票投資作出決策。

Chapter 8.
41
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Why Data Mining?—Potential Applications

資料分析 (Data analysis) 與決策支援 (decision support)

市場分析與管理 (Market analysis and management)


風險分析與管理 (Risk analysis and management)



Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
詐欺行為檢測與異常模式檢測 (Fraud detection and detection of
unusual patterns (outliers))
Other Applications

Text mining (news group, email, documents)

Web mining

Bioinformatics and bio-data analysis
42
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
市場分析和管理

資料從那裡來?



信用卡交易, 會員卡, 商家的優惠卷, 消費者投訴電話,
公眾生活模式研究
目標市場 (Target marketing)

構建一系列的“客戶群模型”，這些顧客具有相同特
徵: 興趣愛好, 收入水準, 消費習慣,等等

確定顧客的購買模式
交叉市場分析 (Cross-market analysis)

貨物銷售之間的相互關聯和相關性，以及基於這種關
聯上的預測
43
國立聯合大學資訊管理學系


44
顧客分析 (Customer profiling)


資料探勘課程 (陳士杰)
哪類顧客購買那種商品 (聚類分析或分類預測)
客戶需求分析 (Customer requirement analysis)

確定適合不同顧客的最佳商品

預測何種因素能夠吸引新顧客
提供概要訊息 (Provision of summary information)

多維度的綜合報告

統計概要訊息 (資料的集中趨勢和變化)
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
公司分析和風險管理


財務計畫 (Finance planning)

現金流轉分析和預測

交叉區域分析和時間序列分析（財務資金比率，趨勢
分析等等）
資源規畫 (Resource planning)


總結和比較資源和花費
競爭 (Competition)

對競爭者和市場趨勢的監控

將顧客按等級分組和基於等級的定價過程

將定價策略應用於競爭更激烈的市場中
45
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
詐欺行為檢測和異常模式的發現

方法: 對欺騙行為進行聚類和模式建構，並進行孤立點分析

應用: 衛生保健、零售業、信用狀服務、電信等

汽車保險: 相撞事件的分析

洗錢: 發現可疑的貨幣交易行為

醫療保險


頭班病患, 醫生以及相關數據分析

不必要的或相關的測試
電信: 電話呼叫欺騙行為


零售產業


電話呼叫模型: 呼叫到達站，持續時間，日或周呼叫次數. 分析該模型
發現與期待標準的偏差
分析師估計有38％的零售額下降是由於雇員的不誠實行為造成的
反恐怖主義
46
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Are All the “Discovered” Patterns Interesting?

Data mining may generate thousands of patterns:
Not all of them are interesting

Some serious questions:

What makes a pattern interesting?

Can a data mining system generate all of the interesting
pattern?

Can a data mining system generate only interesting
patterns?
47
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
The answer of first question:

Interestingness measures

A pattern is interesting if it is:
1. Easily understood by humans,
2. Valid on new or test data with some degree of certainty,
3. Potentially useful,
4. Novel, or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures


Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
48
國立聯合大學資訊管理學系

The answer of second question:


資料探勘課程 (陳士杰)
Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering
The answer of third question:

Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?

Approaches

First general all the patterns and then filter out the uninteresting ones

Generate only the interesting patterns—mining query optimization
49
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
50
 Data Mining:
Confluence of Multiple Disciplines

Data mining is an interdisciplinary field, the confluence
of a set of disciplines.
資料庫系統
機器學習
演算法
統計學
資料挖掘
可視化
其他學科
(資訊檢索 IR, …)
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)

Because of the diversity of disciplines contributing to
data mining, data mining research is expected to
generate a large variety of data mining systems.

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted
51
國立聯合大學資訊管理學系

Kinds of databases mined (根據所探勘的資料庫類型):



Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Kinds of Knowledge mined (根據所要探勘的知識類型):

Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels
Techniques utilized (根據探勘所用的技術):


資料探勘課程 (陳士杰)
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted (根據探勘的應用):

Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
52
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
53
 Primitives that Define a Data Mining Task

一個對於Data Mining錯誤的觀點:
“期望Data Mining System能自動地挖掘出埋藏在給定的大型資料庫中，所有有價
值的知識，而不需要人的干預或指導”

會產生大量模式（重新把知識淹沒）

會涵蓋所有資料，使得挖掘效率低下

大部分有價值的模式集可能被忽略

挖掘出的模式可能難以理解，缺乏有效性、新穎性和實用性 ─ 令人不感興趣。

沒有精確的指令和規則，資料探勘系統就無法使用。

用資料探勘原義 (Primitive) 和查詢語言 (Query) 來指導資料探勘。
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
Each user will have a data mining task in mind, that is, some
form of data analysis that he or she would like to have
performed.

A data mining task can be specified in the form of a data
mining query (Data Mining Query Language, DMQL), which
is input to the data mining system.

A data mining query is defined in terms of data mining task
primitives.

These primitives allow the user to interactively communicate with the
data mining system during discovery in order to direct the mining
process, or examine the findings from different angles or depths.
54
國立聯合大學資訊管理學系

資料探勘課程 (陳士杰)
The data mining primitives:

The set of task-relevant data to be mined


The kind of knowledge to be mined




用以指明要執行的資料探勘函數 (data mining function)
The background knowledge to be used in the discovery process


用以指明在資料庫或資料集當中，使用者有興趣的部份
一些有關於被挖掘的領域之背景知識，對於引導知識發掘之程序與
評估所發現的模式是很有用的
表達背景知識的方式: 概念分層 (Concept Hierarchies)
The interestingness measures and thresholds for pattern evaluation

用於指導挖掘過程或挖掘之後，評估所發現的模式

將不感興趣的模式從知識中分開
The expected representation for visualizing the discovered pattern

涉及所發現之模式的顯示格式
55
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
56
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
57
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)

沒有興趣度度量，挖掘出來的有用模式，很可能會給淹沒
在用戶不感興趣的模式中。

興趣度的客觀度量方法︰


根據模式的架構和統計，用一個臨界值來判斷某個模式是不是用
戶感興趣的。
常用的四種興趣度的客觀度量︰

簡單性 (Simplicity)

確定性 (Certainty)

實用性 (Utility)

新穎性 (Novelty)
58
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
簡單性和確定性


簡單性(simplicity)

模式是否容易被人所理解

可根據模式架構的函數

模式的長度、屬性的個數、符號個數

e.g. 規則長度或決策樹的節點個數。
確定性(certainty)

表示一個模式在多少機率下是有效的。

置信度 (Confidence)

e.g. buys(X, “computer)=>buys(X, “software”)[30%, 80%]

100%置信度︰準確的。
59
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
實用性和新穎性

實用性 (Utility)

可以用支持度來進行度量︰



e.g. buys(X, “computer)=>buys(X, “software”) [30%, 80%]
同時滿足最小置信度臨界值和最小支持度臨界值的關聯規則稱為
強關聯規則。
新穎性 (Novelty)

提供新訊息或提高給定模式集性能的模式

透過刪除冗餘模式來檢測新穎性 (一個模式已經為另外一個模式
所蘊涵)

Location(X, “Canada”)=>buys(X, “Sony_TV”) [8%, 70%]

Location(X, “Vancouver”)=>buys(X, “Sony_TV”) [2%, 70%]

前一規則比後一規則更一般，因此我們可以預料前一規則比後一規則更
常出現。
60
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
61
 Integration of Data Mining and Data Warehousing

一個好的系統架構，可以使Data Mining System在性能、交
互性、使用性以及擴展性等多個方面的都得到良好的保證。

當前大部分資料都是存放在資料庫或者是資料倉儲之中，在
此基礎上往往還構建了綜合的訊息處理和訊息分析功能。

A critical question in the design of a data mining system is how
to integrate or couple the DM system with a database
system and/or a data warehouse system.

不耦合 (No coupling)

鬆散耦合 (Loose coupling)

半緊密耦合 (Semitight coupling)

緊密耦合 (Tight coupling)
國立聯合大學資訊管理學系

No coupling:



DM system will not utilize any function of a DB or DW system.
Simple
Drawbacks:



資料探勘課程 (陳士杰)
DM system may spend a substantial amount of time finding, collecting,
cleaning, and transforming data.
DM system will need to use other tools to extract data, making it
difficult to integrate such a system into an information processing
environment.
Loose coupling:



DM system will use some facilities of a DB or DW system.
Better than no coupling.
Drawbacks:

Because mining does not explore data structures and query
optimization methods provided by DB or DW systems, it is difficult for
loose coupling to achieve high scalability and good performace with
large data set.
62
國立聯合大學資訊管理學系


資料探勘課程 (陳士杰)
Semitight coupling:

Besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives can be
provided in the DB/DW system.

Some frequently used intermediate mining results can be precomputed
and stored in the DB/DW system, this design will enhance the
performance of a DM system.
Tight coupling:

DM system is smoothly integrated into the DB/DW system. The data
mining subsystem is treated as one functional component of an
information system.

Data mining queries and functions are optimized based on mining
query analysis, data structures, indexing schemes, and query processing
methods of a DB or DW system.

This will provide a uniform information processing environment.
63
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Major Issues in Data Mining

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of
abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem
64
國立聯合大學資訊管理學系


資料探勘課程 (陳士杰)
Performance issue

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods
Issues relating to the diversity of data types

Handling relational and complex types of data

Mining information from heterogeneous databases and
global information systems (WWW)
65
國立聯合大學資訊管理學系
資料探勘課程 (陳士杰)
 Summary

Data mining: Discovering interesting patterns from large amounts of data

A natural evolution of database technology, in great demand, with wide
applications

A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation

Mining can be performed in a variety of information repositories

Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.

Data mining systems and architectures

Major issues in data mining
66

cs412slides

Transcript cs412slides

Directory