Transcript Data mining

Data mining
1
Outline


Definition
Techniques





Hypothesis Verification
Knowledge Discovery
Online Analytical Processing
Location Considerations
Benefits and Challenges
2
Data Mining資料探勘


Analyzing the data in a data warehouse to
reveal hidden patterns, relationship, and
trends in historical business activity
為資料倉儲主要用途之一。藉由資料探勘,能
試圖從資料庫中所儲存的商業活動記錄,找出
潛在的趨勢、關係、或模型。
3
Data Mining
選取
轉換
探勘(型態) 解釋評估
4
Mining for Decision Support
5
Data Mining Functions




Determining classifications for the data
Forming clusters from the data
Determining whether certain associations
exist in the data
Determining whether any pattern or
sequence exists in the data.
6
兩類探勘方式

Hypothesis Verification假設驗証




Querying
Modeling
User-driven
Knowledge Discovery知識發掘


Patterns
System-driven
7
One example-Harrah’s gambling casino




Using stripe card called “Total Reward”
30% of customers spent $100 to $300,
account for 80% of revenue and 100% of
profit.
90 demographic segments
Age and distance are two factors of being a
repeat customer
8
Data Mining Techniques

Recency, frequency, monetary (RFM)



Decision trees



Thirty-one permutations of sorting four variables
(customer number, recency, frequency, monetary)
Inexpensive; easy to perform
More complex than RFM
Helps turn complex data representation into a
much easier structure
Cluster analysis


Place customers/prospects into groups such that
everyone in the group has similar traits
Categories include demographics,
psychographics, behavioral, geographic
9
Other Data Mining techniques
Artificial neural network類神經網絡,
business intelligence (BI),
data stream mining,
fuzzy logic,
nearest neighbor最鄰近者algorithm,
pattern recognition,
relational data mining,
text mining,
chi-Square, t-test, regression迴歸, correlation
10
A Decision Tree
11
Genetic Algorithms基因演算法




Software that uses Darwinian, randomizing, and
other mathematical functions to simulate an
evolutionary process that can yield increasingly
better solutions to a problem
利用達爾文定律(適者生存)、隨機化與數學函數,
來模擬演化的過程,以產生更佳的解決方案
特別適用於有數千種可能的解決方案,但必須產生一
個最佳解決的情況。
利用幾組數學程序規則,指定各程序元件或步驟的組
合方式,透過隨機程序結合,將程序中優良的部分加
以組合,並選出良好的程序組而捨棄較差的程序組,
以產生最好的解決方案
12
13
Memory Based Reasoning



Data records for entities that have a known
behavior pattern are grouped to form a test
case.
Records for customers with unknown purchase
habits can be matched to the profile.
RFM (Recency, Frequency, Monetary level) is
an example used as a predictor of future
behavior.
14
Neural Networks類神經網路
Definition:
 Computing systems modeled after the brain’s mesh-like network
of interconnected processing elements, called neurons
 模擬人類的大腦架構,根據大腦的神經元(neurons)的處理元件(PE)所
組成網路系統
 能辨識出處理資料的模式與關係,所接受的資料範例愈多,學習效
果也就越好

有一個輸入層、一個輸出層,以及數個隱藏的處理層。

每個PE節點對另一節點的影響力,視其不同權重而定。

Input代表問題的屬性,必須乘以「權重」,顯示其對下一個PE影響力的
強弱,亦即權重不同,Output的結果就不同。

由輸入的過去案例,可依據Output與Input特性的關係,決定最好的結構與
各路徑的權重,亦即所謂的機器學習。

要判斷的新案例輸入後,會自動預測出可能結果。
15
A neural network
16
類神經網路基本概念與架構
17
18
Online Analytical Processing (OLAP)
線上分析處理


Enables mangers and analysts to
interactively examine and manipulate large
amounts of detailed and consolidated data
from many perspectives
讓公司主管與分析師從各種角度切入,利用互
動方式來處理大量細部與合併的資料
19
20
Analytical Operations



Consolidation – aggregation of data合併指的是將資
料聚集整合,這包括簡單的向上整編(roll-up),或是
複雜的相關資料群集。
Drill-down – detail data that comprise consolidated
data向下擷取:將整併的資料,以反方向(由上往下)
的顯示細部資料
Slice and Dice – ability to look at the database from
different viewpoints交叉分析是從不同的觀點來檢視
資料庫的能力
21
OLAP Technology
22
23
Types of data mining system environments

Decision Support Systems (DSS)



“List current inventory, predict sales of products to
be promoted, and list inventory requirements by
store”
“Determine who are responders and
nonresponders for the last promotion”
“Identify nonresponders from the last promotion
and send them a second promotional offer using a
different advertising copy”
24
Types of data mining system environments

Executive Information Systems (EIS) –
Dashboards


“Provide ROI results for all sales promotions for
the last sixty days”
“Populate a spreadsheet with sales by product
category from the Web, catalogue, and retail.
Allow for simple data manipulation for the purpose
of creating trend reports”
25
Types of data mining system environments

Enterprise Resource Planning (ERP)



“Process all online orders within twelve hours and
send alert to quality and control when time limit is
exceeded”
“Automatically notify supplier to restock when
inventory depletes to certain level”
“Update customer service ODS with current
customer order status information”
26
Types of data mining system environments

CRM



“Identify the most profitable customers by household level
for the last twenty-four months and create a recognition
strategy at different incremental levels based on profitability
level”
“Determine which customers have purchased for their own
consumer needs versus on behalf of the company they
work for and create a profitability index for each”
“Examine customer purchase history and build a channel
preference profile for each customer including time
variations such as ‘snowbirds’”
27
Data Mining Location and access
considerations



Operational Data Store (ODS)
 Dynamic data repository
 Tactical and decision report applications
 Data limited to current operational needs
Data warehouse (DW)
 More static than ODS
 Large depth and breadth of information
 Data transformed into knowledge
 Analysis strategy and planning applications
Data marts (DM)
 Receives data from DW or ODS, but usually the former
 Limited but concentrated information
 Data transformed into knowledge
 Analysis, strategy and planning applications
 Usually designed for use as a narrow application
 Data mining and statistics
28
Data Mining Benefits





Better understanding of customers and
prospects supports relationship building
efforts
Measurable
Fatigue prevention
Precipitate new opportunities
Fraud detection and identification of
nonfavorable behavior
29
Data Mining Challenges









Organizational obstacles to attaining data
Cost versus benefit
Ability to capture data
Giving customer/prospect perception of
invasiveness
Privacy issues
Sustained secondary availability
Ability to perform data and information
transformation
Technology and analytical expertise
“Analysis Paralysis”
30