Recent issues in data mining

Download Report

Transcript Recent issues in data mining

Data Mining: Extracting Knowledge
from Past Data
Ming-Syan Chen
Network Database Laboratory
Electrical Engineering Department
National Taiwan University
Outline
• An introduction to data mining
• Challenging issues on data mining
M.-S. Chen
NTU
2
Data Mining
• Data mining: Knowledge discovery in
databases
– extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases
– Relevant fields: AI, database, statistics
• We are buried in data, but looking for
knowledge
M.-S. Chen
NTU
3
Knowledge Discovering Process
Interpretation/
Evaluation
Data Mining
Knowledge
Transformation
Patterns
Preprocessing
Selection
………………
………………
………………
Transformed
Data
Preprocessed
Data
Data
M.-S. Chen
Target Data
NTU
4
Mining Capabilities
•
•
•
•
•
•
Association
Classification
Clustering
Traversal patterns
Sequential patterns
and many others
M.-S. Chen
NTU
5
E.g., Mining Association Rules
• Transaction data analysis: Mining association
rules
– Given: (1) a database of transactions
(2) each tx has a list of items purchased
• Find all asso. rules: the presence of one set of
items implies the presence of another set of
items in the same tx
• Two primary approaches
(1) Apriori-Based
(2)
FP-Tree-Based
M.-S. Chen
NTU
6
Two Parameters
• Confidence (how true)
– the rule X&Y => Z has 90% conf. means 90%
of customers who bought X and Y also bought
Z
• Support (how useful the rule is)
– useful rules should have some minimum tx
support
M.-S. Chen
NTU
7
Applications
• 依據不同產業需求提出產業別應用
金融
保險業
零售業
製造業
信用評 即時輔 生產過
等、客 助購買 程中作
製化金 決策之 為最佳
融服務、 依據, 化生產
授信、 並且提 因素決
客戶之 供貨品、 定之專
資產管 架位、 家輔助
理、壞 物流整 決策系
帳分析、 合及配 統,並
道德危 置之輔 且提供
機分析、 助決策 最佳化
逆向選 支援系 之存貨
擇風險 統
控管與
分析、
供應鏈
潛在客
暨顧客
戶名單
利潤率
分析
分析
M.-S. Chen
連鎖業
醫療業
電信業
生技業
作為展 作業成 提供最 提供研
店店址 本管理 佳化之 發平台
之選擇, 之動因 網路交 以及分
以及分 分析、 通配置, 析所需
店貨品 作為顧 暨、客 工具,
品項選 客利潤 製化服 加速累
擇,並 率分析、 務,並 積研發
且作為 或客戶 且提供 能量
物流倉 客製化 即時之
庫位址 服務之 線上客
決策輔 來源
製化輔
助工具,
助資訊
以及物
系統、
流產能
客製化
輔助配
之入口
置之依
網站及
據
輔助促
銷功能
NTU
教育業
廣告業
非營利
組織
作為潛 廣告點
在學生 閱來源
之來源 分析、
名單分 回應率
析,並 分析、
且運用 行銷策
資訊勘 略提供
測作為
入學申
請暨獎
學金申
請評等
之分析,
及學生
課程規
劃與職
涯規劃
之依據
作為勸
募捐款
信函與
通信之
聯繫名
單方式
8
Remarks
• Data mining is very application dependent
– Small team with good skill and domain knowledge
• Lots of work has been done in other areas
• Emerging issues:
– Journals, ACM TODS, ACM TKDD (from 2007), IEEE
TKDE, DMKD, KAIS, IS, Pattern Recognition
– SIGKDD, ICDM (from 2001), SIAM-SDM (from
2001), SIGMOD, ICDE, VLDB, CIKM, ICML, SIGIR,
WI, PAKDD, etc.
M.-S. Chen
NTU
9
What is the Next for Data Mining
•
•
•
•
Privacy-preserving mining
Data stream mining
Mining for bioinformatics
Mining to assist content-based data
management
M.-S. Chen
NTU
10
Data Streams: Computation Model
Synopsis in Memory
Data Streams
Stream
Processing
Engine
(Approximate)
Answer
• Stream processing requirements
– Single pass: Each record is examined at most once
– Bounded storage: Limited Memory for storing synopsis
– Real-time: Per record processing time must be low
M.-S. Chen
NTU
11
Outline
• An introduction to data mining
• Challenging Issues on data mining
M.-S. Chen
NTU
12
Challenging Issues for Data Mining
• Identifying data source for desired knowledge
– Mining purposes: knowledge or auxiliary meta data
• Data collection methods (in Web, wireless, tx)
– Different types of data from different environment
• Usefulness and certainty of mining results
– Support and confidence
• Interactive mining with different data granularities
– e.g., generalized association rules
M.-S. Chen
NTU
13
Issues (cont’d)
• Mining in data streaming environments
– Look at data only once; the amount of data is huge
– incremental mining (temporal and spatial)
• Efficiency and scalability of mining algorithms
– Sampling methods (frequency tuned wrt data or wrt
result accuracy)
• Hardware-enhanced mining
– E.g., PDA, STB, devices for LBS
M.-S. Chen
NTU
14
Issues (cont’d)
• Interestingness of mining results
– Have to know the original likelihood
• Evaluation of mining results
– How to measure the advantage gained
• Expression of various kinds of mining
results
• Protection of privacy and data security
– Data hiding
M.-S. Chen
NTU
15
Ongoing Works in NetDB Lab
•
•
•
•
Web usage mining
Web content mining
Mining in mobile environments
Scalable clustering techniques tuned with
domain knowledge
• Incremental mining (temporal and spatial)
• Hardware-enhanced mining
M.-S. Chen
NTU
16
Summary
• Data mining is an area of growing importance
– Increasing demand for intelligence
– Fast advance in IT techniques
• Mining will be of increasing impact to Web and
wireless applications.
– Huge amount of digital data
– Nature of applications and their users
M.-S. Chen
NTU
17
Graphical user interface
Pattern evaluation
Knowledge
base
Data mining engine
Database or
data warehouse server
Data cleaning
Data integration
Database
M.-S. Chen
Filtering
Data
warehouse
NTU
18
Incremental Mining
• Due to the increasing use of the recordbased databases, recent important
applications have called for the need of
incremental mining
– Such applications include Web log records,
stock market data, grocery sales data,
transactions in electronic commerce, and daily
weather/traffic records, to name a few
M.-S. Chen
NTU
19
Incremental Mining
• To mine the transaction database
for a fixed amount of most recent
data (say, data in the last 12
months)
• One has to not only include new
data (i.e., data in the new month)
into, but also remove the old data
(i.e., data in the most obsolete
month) from the mining process.
M.-S. Chen
NTU
data for 1/2000
Pi
data for 2/2000
Pi+1
dbi, j
dbi+1, j+1
data for 12/2000
Pj
data for 1/2001
Pj+1
20
E.g., Redundant Rules
• For the same support and confidence, if we
have a rule {a,d}=>{c,e,f,g}, what do we
have
–
–
–
–
{a,d}=>{c,e,f}
{a}=>{c,e,f,g}
{a,d,c}=>{e,f,g}
{a}=>{d,c,e,f,g}
M.-S. Chen
NTU
21
E.g., Generalized Asso. Rules
• Which data granularities should be used for
data mining
• To mine meaningful rules (proper data units)
and be as specific as possible
– similar dilemma for other mining capabilities
M.-S. Chen
NTU
22
Clothes
Outerwear
Jackets
Shirts
Hiking Boots
Shoes
Ski Pants
Database
Tx
100
200
300
400
500
600
M.-S. Chen
Freg.
Itemset
Footwear
Itemset
support
Jacket
Outerwear
Clothes
Shoes
Hiking Boots
Footwear
Outerwear, Hiking Boots
Clothes, Hiking Boots
Outerwear, Footwear
Clothes, Footwear
2
3
4
2
2
4
2
2
2
2
Items bought
Shirt
Jacket, Hiking Boots
Ski Pants, Hiking Boots
Shoes
Shoes
Jacket
sup(30%) conf(60%)
Outerwear → Hiking Boots
33%
66%
Outerwear → Footwear
33%
66%
Hiking Boots → Outwear
33%
100%
Hiking Boots → Clothes
33%
100%
However,
Jacket → Hiking Boots
16%
50%
Ski Pants → Hiking Boots
16%
100%
NTU
23
E.g., Interestingness of Rules
• In a school of 5000 students
– 60% (3000) play basketball and 75% (3750) eat
cereal; and 40% (2000) do both
• Say, minimal sup is 2000 and min conf is
60%, one gets the rule
– “play basketball => eat cereal” so ... does that
mean promoting the basketball activities will
help the sales of cereal?
M.-S. Chen
NTU
24
Interestingness (Cont’d)
• In fact, P(A and B)/P(A) should be greater
than P(B) to make the rule “A=>B” be
interesting
– how about for the rule {A,K,}=>{B,L,V} to be
interesting
M.-S. Chen
NTU
25
Related Training
• Database
• AI: machine learning
• Statistics
M.-S. Chen
NTU
26