Density-based method
Download
Report
Transcript Density-based method
Data Mining
김한준
서울시립대학교
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining
Knowledge Discovery in large Databases
대량의 데이타로부터
이전에 알려지지는 않은,
묵시적이고,
잠재적으로 유용한 정보를 탐사하는 작업
Data Mining Applications
Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
Data Mining - 구매패턴의 발견
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining - 구매패턴의 발견
추
천
Data Mining - 문서 패턴
Automatic
Manual
Entertainment (Yahoo)
Comic&Animation
Movie&Film
Editorial
Cartoons
Comic
Books
Animatoin
Comic Strip
News&Media
FilmMaking
Film
Festival
Cartoonist
Review
Magna
History
Animated
Gifs
Magazine
Conventions
Magazine
Festival
Anime
Computer
Animation
Magazine
Screen
Short
Films
Writing
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining - 데이터 특성 파악
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
1.
2.
Classification
Clustering
•
3.
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining 방법론에 따른 Pattern
Classification
IF “age < 25” and “salary > 40,000” THEN “sports cars” 선호
Clustering
A그룹 : age=30’s and job=‘IT’ and address = ‘Seoul’
Description Model
Association rules
Prediction Model
98% of people who purchase diapers also buy beer
Sequential pattern
<Star Wars> <Empire Strikes Back> <Return of Jedi>
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining 방법론
Classification (Categorization)
학습 데이터
(Training Data)
Learning
Model
profitable
common
Least
loyal
未知 데이터
(Unknown Data)
Classification
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Machine Learning
Pattern, Model
(Intelligence)
학습 데이터
profitable
common
Least
loyal
未知 데이터
과거 데이타
미래 예측
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Machine Learning
Classification
Prediction
Classification
Decision Tree
induction
Neural network
Value Prediction
Linear
regression
Nonlinear regression
Machine Learning Algorithms
Decision Tree
Neural Networks
Bayesian Statistics
Instance-based Learning
Naïve Bayes
Bayesian Networks
K-Nearest Neighbor
Support Vector Machine
Rough Set Theory
Ensemble (Committee)
Bagging, Boosting
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Decision Tree
Credit Analysis
s a la ry
e d u c a t io n
la b e l
10000
h ig h s c h o o l
re je c t
40000
u n d e r g ra d u a t e
ac c ept
15000
u n d e r g ra d u a t e
re je c t
75000
g ra d u a t e
ac c ept
18000
g ra d u a t e
ac c ept
Training data
salary < 20000
Training
yes
Education
in graduate
yes
accept
no
accept
no
reject
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Neural Networks
training
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
k-Nearest Neighbors
+
Training Data
-
?
-
?
+
+
Unknown Data
-
Voting 적용
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
P(h | D) P(D | h)P(h)
P(D)
Bayesian Statistics
hMAP: maximally probable hypothesis
MAP: maximum a posteriori
MAP (maximum posteriori) hypothesis
h
arg max P(h | D) arg max P(D | h)P(h).
MAP hH
hH
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Bayesian Statistics
classifier (d ) arg max Pr(c | d ) arg max Pr( w | c) Pr(c)
c
c
wd
Estimated prob. Dist.
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Support Vector Machine
The hyperplane that separates positive and negative
training data is
w x + b = 0
20
CS583, Bing Liu, UIC
Classification의 활용
문서분류
Web directory-based Search Engine에서 웹문서의 자동분류
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification의 활용
CRM (Customer Relationship Management)
고객분류
우수고객
이탈고객
정상고객
불량고객
분류된 고객에 대한 차별화된 서비스 제공
Direct Mail 발송
차별적 Marketing
고객 이탈 방지
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification의 활용
디지털카메라 상품평 분석 보고서
가격대: 450,000원 ~ 545,000원
조작하기 간편, 그립감이 좋은 편, 배터리 수명이 캐논보다는 긴 편임,…
ㅇㅇㄹㅇㄹㅇㄹㅇㄹㅇㄹㅇㅇㄹㅇㄹㅇ
……..
속성별 상품평 분석
사용편의성
사진품질
배터리수명
Classification
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
1.
Classification
•
2.
Clustering
•
3.
Meta-learning
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
meta-learning
D
1 단계
다중
데이터집합
생성
D1
D2
2단계
다중
예측모델 생성
C1
C2
3단계
최종
예측모델
완성
....
C
① 초기학습데이터를 적정
원칙에 따라 t개의 다중
학습데이터집합으로 분할
초기 학습
데이터
*
Dt-1
Dt
Ct -1
Ct
② 각 학습데이터를 가지고
예측모델을 생성 (machine
learning 학습알고리즘)
③ 생성된 예측모델을 종합한
예측모델을 생성 (관련기술:
Bagging, Boosting 메타
학습알고리즘)
Classification
meta-learning
Semi-supervised learning
Supervised learning + clustering
Training data의 생성을 위해서 clustering 기술을 활용
1.
2.
Classification
Clustering
•
3.
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
Teenager
having a computer
Young urban
고객데이타 =
인구학적정보, 구매정보 등으로 표현
career women
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도
Summarization of large data
Data organization
Manage the large customer data
Outlier detection
Understand the large customer data
Find unusual customer data
Classification/Association Rule Mining의 전단계
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도
Classification/Association Rule Mining의 전단계
의미 있는 cluster로부터 class를 도출
Cluster내부에 있는 데이터에 대한 Association Rule Mining을 수행
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도
Information Retrieval 관련
Obtaining search results
Re-organization for search results
Discovering new topics (categories)
한 cluster내에 존재하는 문서는 “Relevant” 문서
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도
검색결과에 대한 clustering
32
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도
Clusty.com
vivisimo incorp.
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도
Discovering new topics
“Movie &
Film”
“Movie &
Film”
...
Concept
drift
“Plays”
...
Clustering
“Film
Festivals”
“Screen
Plays”
New categories
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
I
Partitioning methods
: centroid
II
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
0step
a
Hierarchical methods
1step
2step
3step
ab
b
abcde
c
cde
e
de
d
4step
4step
3step
2step
1step
0step
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Density-based methods
Clusters: density-connected sets
DBSCAN algorithm
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
Density-based methods
Based on a set of density distribution functions
Density function
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms
Partitioning method
Hierarchical method
DBSCAN, OPTICS, …
Grid-based method
Agglomerative and divisive hierarchical clustering
Complete-linkage, single-linkage, group-average, ward’s method
BIRCH, CURE, …
Density-based method
K-means, K-medoids, …
STING, WaveCluster, CLIQUE, …
Model-based method
Statistical approach, …
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
distance measure
Distance Measure
Distance = dissimilarity
Manhattan distance
Euclidean distance
Minkowski metric
Mahalnobis distance
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
1.
Classification
2.
Clustering
•
3.
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association (Rule) Mining
Basket Analysis
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association Rules
Example:
Transaction Id
1
2
3
4
Purchased Items
{1, 2, 3}
{1, 4}
{1, 3}
{2, 5, 6}
Association Rules
1 => 3 with 50% support and 66% confidence
3 => 1 with 50% support and 100% confidence
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association Mining의 확장
Machine learning에서의 (예측) model과의 차이점은?
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining
: Confluence of Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
48
Statistics
Data Mining
Visualization
Other
Disciplines
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining과 Text Mining
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining
Difference with data mining
Analyze both raw data and textual information at the same time
Require complicated FEATURE SELECTION technologies
May include linguistic, lexical, and contextual techniques
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining
Feature Selection
Stopword 제거
Zipf’s Law
DF (document frequency)-based
x2 Statistics-based
Mutual Information
Term Strength
etc
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining Feature Selection
Zipf’s Law
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining Feature Selection
x2 statistics-based
t
/t
C
p
q
/C
r
s
n ( ps qr )
(c, t )
( p r ) ( p q) (q s ) (r s )
2
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Clustering
검색결과에 대한 clustering
54
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association Rules (Sequential Patterns)
Published: August 31, 2006,
9:10 AM PDT TalkBack E-mail
Print del.icio.us Digg this
Lenovo Group has hired a
former Dell executive as
senior vice president of its
global supply chain, the sixth
Dell executive it has tapped in
the past month.
(ORG:Lenovo Group) [Verb:hired] (ORG:Dell)
(POS:former executive) (POS:senior vice
president)
55
Association Rules (Sequential Patterns)
Verbs
Patterns
HIRE
PER|ORG – V- PER-POS
RELEASE ORG – V – PRD
MERGE
ORG – V – ORG
56
Opinion Mining
Introduction – Opinion Mining
Summary Report
Introduction – Opinion Mining
Opinion: user generated media
Opinion on the Web:
◦
contains personal experiences and opinions (sentiments) on almost anything,
at review sites, forums, discussion groups, blogs ...,
“Subjective” opinions on product, people, organization, topic,
and etc
Opinion Mining
◦
59
To mine opinions expressed in the user-generated contents =>
summarized information
Examples of Product Reviews
http://amazon.com
Introduction – Opinion Mining
Motivation
Two main types of information on the Web.
◦
Current search engines search for “facts”
◦
◦
“Facts” and “Opinions”
Search engines do not search for opinions
Current search ranking strategy is not appropriate for ‘opinion
search’.
Opinion Search
customer (detailed) opinions on a digital camera
Introduction – Opinion Mining
Tasks
Review classification (Semantic orientation
recognition)
Product review mining:
Ex) What features of the Nikon D40X do customers like and which do
they dislike?
Temporal Tracking sentiments
62
Ex) Is a review positive or negative toward the Nikon D40X ?
Ex) Is the sentiment of X product rising up or cooling down?
Opinion Mining의 분류
Mining 범위
Document-level opinion mining
Sentence-level opinion mining
접근방식
Sentiment classification
Feature-based opinion mining (summarization)
Comparative sentence and relation mining
Phrase (Word) Sentiment recognition
Semantic Orientation
Point-wise Mutual Information (PMI)
A form of correlation measure
Positive when words co-occur and negative otherwise
Given two words, w1 and w2
pr ( w1 & w2 )
PMI ( w1, w2) log 2
pr ( w1 ) pr ( w2 )
SO( word )
PMI (word , pword )
pwordPwords
PMI (word , nword )
nwordNwords
Pwords = {good, nice, excellent, positive, correct, superior, …}
Nwords = {bad, nasty, poor, negative, wrong, inferior, …}.
Sentiment Classification
Text classification 기법 활용
Machine learning: Naïve Bayes, Support Vector Machine
Build a binary opinion classifier
Require a labeled database of opinion
Training data의 준비
1.
2.
Download ratings from Amazon.com, epinions.com etc.
From positive and negative ratings: 1 and 2 stars -> negative, and 3, 4 and
5 starts -> positive
Feature-based Opinion Mining
Feature를 기준으로 opinion을 요약
Example
Feature1: picture
Positive: 12
•The pictures coming out of this camera are amazing.
•Overall this is a good camera with a really good picture clarity.
…
Negative: 2
•The pictures come out hazy if your hands shake even for a moment during the entire
process of taking a picture.
•Focusing on a display rack about 20 feet away in a brightly lit room during day time,
pictures produced by this camera were blurry and in a shade of orange.
Feature2: battery life
…
Topic Discovery
Topic Discovery
“Entertainment”
“Movie &
Film”
“Movie &
Film”
Classification
69
...
“Film
Festivals”
Re-organization
“Plays”
“Screen
Plays”
...
Clustering (Topic Discovery)
Conventional clustering methods
Hierarchical clustering methods
Depends on the standard Euclidean distance metric
– Not suitable for topic discovery
wk
Goal
– Generate user-specific clusters !
d5
d6
d3
d4
d1
d8
d2
d10
70
wi
wj
Supervised Clustering (Topic Discovery)
Document
Collection
Clustering
Human Knowledge
A’
71
B’
C’
Clusters
D’ E’
Topics (categories)
Concept (Topic, Category, Keyword)
Relationship
Concept Association
Similarity: # of docs in which
the companies co-occur
Keyword Association
Category Subsumption Relation
Category A
ti0
• Characteristic function
ti1
tip …
c (t jq )
j
General
tj0 tj1
Specific
c (t jq )
tjq …
j
75
Category B
Term Subsumption Relation
Dt
Dt
i
j
ti
tj
76
: more general
Category Hierarchy
Using category subsumption relation
Cd
Ca
Cb
Cc
Ca
Cd
Cb
Ce
Ce
Cc
77
Ontology Construction
Ontology
Ontology construction
Concepts + Relationships
Raw text documents => { concept, relationship }
Solutions
Manual
Machine learning
Inference
Mixed
Anomaly Detection
Anomaly/Outlier Detection
What are anomalies/outliers?
The set of data points that are considerably different than the
remainder of the data
Applications:
Credit card fraud detection, telecommunication fraud detection,
network intrusion detection, fault detection
Anomaly Detection Schemes
General Steps
Build a profile of the “normal” behavior
Use the “normal” profile to detect anomalies
Profile can be patterns or summary statistics for the overall population
Anomalies are observations whose characteristics
differ significantly from the normal profile
Types of anomaly detection
schemes
Graphical & Statistical-based
Distance-based
Model-based
Distance-based Approaches
Data is represented as a vector of features
Three major approaches
Nearest-neighbor based
Density based
Clustering based
Nearest-Neighbor Based Approach
Approach:
Compute the distance between every pair of data points
There are various ways to define outliers:
Data points for which there are fewer than p neighboring points
within a distance D
The top n data points whose distance to the kth nearest neighbor
is greatest
The top n data points whose average distance to the k nearest
neighbors is greatest
Clustering-Based Approach
Basic idea:
Cluster the data into groups
of different density
Choose points in small cluste
r as candidate outliers
Compute the distance betwe
en candidate points and non-c
andidate clusters.
If candidate points are far fro
m all other non-candidate poi
nts, they are outliers
Information Extraction
What is Information Extraction?
Recovering structured data from formatted text
Identifying fields (e.g. named entity recognition)
Understanding relations between fields (e.g. record
association)
Normalization and deduplication
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy…
Contact: [email protected]
Category: Travel/Hospitality
Function: Food Services
Using Machine Learning
Training data: documents marked up with ground truth
In contrast to text classification, local features crucial.
Features of:
Contents
Text just before item
Text just after item
Begin/end boundaries
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix
…
contents
suffix
…
Related Applications
www.textmap.org
Summarization
90
Summarization Example: MS-Word
The patient lives alone. Her son lives in town. She has one son. She is a widow. She does not
smoke, quit 20 years ago, and admits drinking alcohol socially. She is retired. She used to work as
salesman in real estate department. She is able to perform activities of daily living, but it is getting
harder. She does not have any hired help. The patient is an 86-year-old woman with a 100pack-year history of smoking. She presents with a mass in the right middle lobe of the
lung, which has been slowly enlarging since last year.This is an 82-year-old white female
who came in to see Dr. James Kang and underwent lumbar laminectomy. She had fusion
of L3 through L5 with revision of previous laminectomy effusion. This was done on
06/06/2002. She immediately noticed in the postoperatively period some swelling of her glands, no
pain with this. The swelling persisted but this was gradually subsiding and I was asked to
see her about this swelling.They also asked me to address her complaint of difficulty
emptying her bladder. She is to go to rehabilitation today. Her past medical history is extensive
and it involves multiple surgeries. She also has a history of type II diabetes, diabetic neuropathy, no
known knowledge of diabetic nephropathy and denies knowledge of diabetic retinopathy. She does
have gastric emptying dysfunction and diabetic neuropathy in the peripheral extremities. She has
some bladder dysfunction with difficulty with retention of her urine, unable to initiate her urine
predating her surgery.
91
Summarization Example: MEAD
The patient lives alone. Her son lives in town. She has one son. She is a widow. She does not
smoke, quit 20 years ago, and admits drinking alcohol socially. She is retired. She used to
work as salesman in real estate department. She is able to perform activities of daily living, but it is
getting harder. She does not have any hired help. The patient is an 86-year-old woman with a 100pack-year history of smoking. She presents with a mass in the right middle lobe of the lung,
which has been slowly enlarging since last year. This is an 82-year-old white female who came
in to see Dr. James Kang and underwent lumbar laminectomy. She had fusion of L3 through L5 with
revision of previous laminectomy effusion. This was done on 06/06/2002. She immediately
noticed in the postoperatively period some swelling of her glands, no pain with this. The
swelling persisted but this was gradually subsiding and I was asked to see her about this swelling.
They also asked me to address her complaint of difficulty emptying her bladder. She is to
go to rehabilitation today. Her past medical history is extensive and it involves multiple surgeries.
She also has a history of type II diabetes, diabetic neuropathy, no known knowledge of
diabetic nephropathy and denies knowledge of diabetic retinopathy. She does have
gastric emptying dysfunction and diabetic neuropathy in the peripheral extremities.
She has some bladder dysfunction with difficulty with retention of her urine, unable to
initiate her urine predating her surgery.
92
Summarization: Method 1
1.
2.
Sentence clustering
Summary Layout
Group the sentences by cluster
Select the sentences closest to the each of the centroids
1.
2.
93
Or select the sentence(s) that best subsumes other sentences
Summarization: Method 2
Clustering medical concepts: concept clustering
Measuring relatedness of concepts
Ignore the value component of the concept
Extracting representative concepts
Necessary to define the degree of importance
An importance measure = the # of links
concept :=
94
<ID, type, keyphrase, value>
Summarization: Method 3
Using TF*IDF weighing scheme
95
Term frequency (TF) -> Concept frequency (CF)
The concept with high CF*IDF value may be significant.
Text Mining Applications
Vivisimo
Gate
Textmap
SAS Text Miner
Alibaba
Intellexer
102