Density-based method

Download Report

Transcript Density-based method

Data Mining
김한준
서울시립대학교
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining

Knowledge Discovery in large Databases




대량의 데이타로부터
이전에 알려지지는 않은,
묵시적이고,
잠재적으로 유용한 정보를 탐사하는 작업
Data Mining Applications












Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
Data Mining - 구매패턴의 발견

Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining - 구매패턴의 발견
추
천
Data Mining - 문서 패턴
Automatic
Manual
Entertainment (Yahoo)
Comic&Animation
Movie&Film
Editorial
Cartoons
Comic
Books
Animatoin
Comic Strip
News&Media
FilmMaking
Film
Festival
Cartoonist
Review
Magna
History
Animated
Gifs
Magazine
Conventions
Magazine
Festival
Anime
Computer
Animation
Magazine
Screen
Short
Films
Writing
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining - 데이터 특성 파악
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
1.
2.
Classification
Clustering
•
3.
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining 방법론에 따른 Pattern

Classification

IF “age < 25” and “salary > 40,000” THEN “sports cars” 선호


Clustering

A그룹 : age=30’s and job=‘IT’ and address = ‘Seoul’


Description Model
Association rules


Prediction Model
98% of people who purchase diapers also buy beer
Sequential pattern

<Star Wars> <Empire Strikes Back> <Return of Jedi>
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining 방법론

Classification (Categorization)
학습 데이터
(Training Data)
Learning
Model
profitable
common
Least
loyal
未知 데이터
(Unknown Data)
Classification
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Machine Learning
Pattern, Model
(Intelligence)
학습 데이터
profitable
common
Least
loyal
未知 데이터
과거 데이타
미래 예측
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Machine Learning
Classification

Prediction
Classification
 Decision Tree
induction
 Neural network

Value Prediction
 Linear
regression
 Nonlinear regression
Machine Learning Algorithms



Decision Tree
Neural Networks
Bayesian Statistics



Instance-based Learning




Naïve Bayes
Bayesian Networks
K-Nearest Neighbor
Support Vector Machine
Rough Set Theory
Ensemble (Committee)

Bagging, Boosting
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Decision Tree
Credit Analysis
s a la ry
e d u c a t io n
la b e l
10000
h ig h s c h o o l
re je c t
40000
u n d e r g ra d u a t e
ac c ept
15000
u n d e r g ra d u a t e
re je c t
75000
g ra d u a t e
ac c ept
18000
g ra d u a t e
ac c ept
Training data
salary < 20000
Training
yes
Education
in graduate
yes
accept
no
accept
no
reject
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Neural Networks
training
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
k-Nearest Neighbors
+
Training Data
-
?
-
?
+
+
Unknown Data
-
Voting 적용
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification

Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
P(h | D)  P(D | h)P(h)
P(D)

Bayesian Statistics
hMAP: maximally probable hypothesis
MAP: maximum a posteriori
MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
Bayesian Statistics
classifier (d )  arg max Pr(c | d )  arg max  Pr( w | c)  Pr(c)
c
c
wd
Estimated prob. Dist.
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification

Support Vector Machine
The hyperplane that separates positive and negative
training data is
w  x + b = 0
20
CS583, Bing Liu, UIC
Classification의 활용

문서분류

Web directory-based Search Engine에서 웹문서의 자동분류
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification의 활용

CRM (Customer Relationship Management)


고객분류

우수고객

이탈고객

정상고객

불량고객
분류된 고객에 대한 차별화된 서비스 제공

Direct Mail 발송

차별적 Marketing

고객 이탈 방지
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification의 활용
디지털카메라 상품평 분석 보고서
가격대: 450,000원 ~ 545,000원
조작하기 간편, 그립감이 좋은 편, 배터리 수명이 캐논보다는 긴 편임,…
ㅇㅇㄹㅇㄹㅇㄹㅇㄹㅇㄹㅇㅇㄹㅇㄹㅇ
……..
속성별 상품평 분석
사용편의성
사진품질
배터리수명
Classification
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
1.
Classification
•
2.
Clustering
•
3.
Meta-learning
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Classification
meta-learning
D
1 단계
다중
데이터집합
생성
D1
D2
2단계
다중
예측모델 생성
C1
C2
3단계
최종
예측모델
완성
....
C
① 초기학습데이터를 적정
원칙에 따라 t개의 다중
학습데이터집합으로 분할
초기 학습
데이터
*
Dt-1
Dt
Ct -1
Ct
② 각 학습데이터를 가지고
예측모델을 생성 (machine
learning 학습알고리즘)
③ 생성된 예측모델을 종합한
예측모델을 생성 (관련기술:
Bagging, Boosting 메타
학습알고리즘)
Classification

meta-learning
Semi-supervised learning


Supervised learning + clustering
Training data의 생성을 위해서 clustering 기술을 활용
1.
2.
Classification
Clustering
•
3.
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
Teenager
having a computer
Young urban
고객데이타 =
인구학적정보, 구매정보 등으로 표현
career women
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도

Summarization of large data


Data organization


Manage the large customer data
Outlier detection


Understand the large customer data
Find unusual customer data
Classification/Association Rule Mining의 전단계
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도

Classification/Association Rule Mining의 전단계


의미 있는 cluster로부터 class를 도출
Cluster내부에 있는 데이터에 대한 Association Rule Mining을 수행
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도

Information Retrieval 관련



Obtaining search results
Re-organization for search results
Discovering new topics (categories)
한 cluster내에 존재하는 문서는 “Relevant” 문서
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도

검색결과에 대한 clustering
32
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도

Clusty.com

vivisimo incorp.
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering의 용도

Discovering new topics
“Movie &
Film”
“Movie &
Film”
...
Concept
drift
“Plays”
...
Clustering
“Film
Festivals”
“Screen
Plays”
New categories
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms





Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
I
Partitioning methods
: centroid
II
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms





Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering
0step
a
Hierarchical methods
1step
2step
3step
ab
b
abcde
c
cde
e
de
d
4step
4step
3step
2step
1step
0step
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms
Partitioning methods
 Hierarchical methods
 Density-based methods
 Grid-based methods
 Model-based methods

Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Density-based methods

Clusters: density-connected sets
DBSCAN algorithm
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering

Density-based methods
Based on a set of density distribution functions
Density function
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering Algorithms

Partitioning method


Hierarchical method




DBSCAN, OPTICS, …
Grid-based method


Agglomerative and divisive hierarchical clustering
Complete-linkage, single-linkage, group-average, ward’s method
BIRCH, CURE, …
Density-based method


K-means, K-medoids, …
STING, WaveCluster, CLIQUE, …
Model-based method

Statistical approach, …
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Clustering

distance measure
Distance Measure





Distance = dissimilarity
Manhattan distance
Euclidean distance
Minkowski metric
Mahalnobis distance
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
1.
Classification
2.
Clustering
•
3.
Outlier Detection
Association Rule Mining
•
Sequential Pattern Mining
Data Mining 방법론
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association (Rule) Mining

Basket Analysis
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association Rules

Example:
Transaction Id
1
2
3
4

Purchased Items
{1, 2, 3}
{1, 4}
{1, 3}
{2, 5, 6}
Association Rules


1 => 3 with 50% support and 66% confidence
3 => 1 with 50% support and 100% confidence
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association Mining의 확장
Machine learning에서의 (예측) model과의 차이점은?
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining
: Confluence of Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
48
Statistics
Data Mining
Visualization
Other
Disciplines
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Data Mining과 Text Mining
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining

Difference with data mining



Analyze both raw data and textual information at the same time
Require complicated FEATURE SELECTION technologies
May include linguistic, lexical, and contextual techniques
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining

Feature Selection







Stopword 제거
Zipf’s Law
DF (document frequency)-based
x2 Statistics-based
Mutual Information
Term Strength
etc
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining Feature Selection

Zipf’s Law
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Mining Feature Selection

x2 statistics-based
t
/t
C
p
q
/C
r
s
n  ( ps  qr )
 (c, t ) 
( p  r )  ( p  q)  (q  s )  (r  s )
2
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Text Clustering

검색결과에 대한 clustering
54
Data Mining Lab., Univ. of Seoul, Copyright ® 2008
Association Rules (Sequential Patterns)

Published: August 31, 2006,
9:10 AM PDT TalkBack E-mail
Print del.icio.us Digg this
Lenovo Group has hired a
former Dell executive as
senior vice president of its
global supply chain, the sixth
Dell executive it has tapped in
the past month.
(ORG:Lenovo Group) [Verb:hired] (ORG:Dell)
(POS:former executive) (POS:senior vice
president)
55
Association Rules (Sequential Patterns)
Verbs
Patterns
HIRE
PER|ORG – V- PER-POS
RELEASE ORG – V – PRD
MERGE
ORG – V – ORG
56
Opinion Mining
Introduction – Opinion Mining
Summary Report
Introduction – Opinion Mining

Opinion: user generated media

Opinion on the Web:
◦


contains personal experiences and opinions (sentiments) on almost anything,
at review sites, forums, discussion groups, blogs ...,
“Subjective” opinions on product, people, organization, topic,
and etc
Opinion Mining
◦
59
To mine opinions expressed in the user-generated contents =>
summarized information
Examples of Product Reviews
http://amazon.com
Introduction – Opinion Mining

Motivation

Two main types of information on the Web.
◦

Current search engines search for “facts”
◦
◦

“Facts” and “Opinions”
Search engines do not search for opinions
Current search ranking strategy is not appropriate for ‘opinion
search’.
Opinion Search

customer (detailed) opinions on a digital camera
Introduction – Opinion Mining
Tasks


Review classification (Semantic orientation
recognition)


Product review mining:


Ex) What features of the Nikon D40X do customers like and which do
they dislike?
Temporal Tracking sentiments

62
Ex) Is a review positive or negative toward the Nikon D40X ?
Ex) Is the sentiment of X product rising up or cooling down?
Opinion Mining의 분류

Mining 범위



Document-level opinion mining
Sentence-level opinion mining
접근방식

Sentiment classification
Feature-based opinion mining (summarization)
Comparative sentence and relation mining

Phrase (Word) Sentiment recognition


Semantic Orientation

Point-wise Mutual Information (PMI)



A form of correlation measure
Positive when words co-occur and negative otherwise
Given two words, w1 and w2
 pr ( w1 & w2 ) 

PMI ( w1, w2)  log 2 
 pr ( w1 )  pr ( w2 ) 
SO( word ) 
 PMI (word , pword ) 
pwordPwords
 PMI (word , nword )
nwordNwords
Pwords = {good, nice, excellent, positive, correct, superior, …}
Nwords = {bad, nasty, poor, negative, wrong, inferior, …}.
Sentiment Classification
Text classification 기법 활용

Machine learning: Naïve Bayes, Support Vector Machine
Build a binary opinion classifier



Require a labeled database of opinion
Training data의 준비

1.
2.
Download ratings from Amazon.com, epinions.com etc.
From positive and negative ratings: 1 and 2 stars -> negative, and 3, 4 and
5 starts -> positive
Feature-based Opinion Mining

Feature를 기준으로 opinion을 요약
Example
Feature1: picture
Positive: 12
•The pictures coming out of this camera are amazing.
•Overall this is a good camera with a really good picture clarity.
…
Negative: 2
•The pictures come out hazy if your hands shake even for a moment during the entire
process of taking a picture.
•Focusing on a display rack about 20 feet away in a brightly lit room during day time,
pictures produced by this camera were blurry and in a shade of orange.
Feature2: battery life
…
Topic Discovery
Topic Discovery
“Entertainment”
“Movie &
Film”
“Movie &
Film”
Classification
69
...
“Film
Festivals”
Re-organization
“Plays”
“Screen
Plays”
...
Clustering (Topic Discovery)

Conventional clustering methods


Hierarchical clustering methods
Depends on the standard Euclidean distance metric
– Not suitable for topic discovery

wk
Goal
– Generate user-specific clusters !
d5
d6
d3
d4
d1
d8
d2
d10
70
wi
wj
Supervised Clustering (Topic Discovery)
Document
Collection
Clustering
Human Knowledge
A’
71
B’
C’
Clusters
D’ E’
Topics (categories)
Concept (Topic, Category, Keyword)
Relationship
Concept Association
Similarity: # of docs in which
the companies co-occur
Keyword Association
Category Subsumption Relation
Category A
ti0
• Characteristic function
ti1
tip …
 c (t jq )
j
General
tj0 tj1
Specific
 c (t jq )
tjq …
j
75
Category B
Term Subsumption Relation

Dt
Dt
i
j
ti
tj
76
: more general
Category Hierarchy

Using category subsumption relation
Cd
Ca
Cb
Cc
Ca
Cd
Cb
Ce
Ce
Cc
77
Ontology Construction

Ontology


Ontology construction


Concepts + Relationships
Raw text documents => { concept, relationship }
Solutions




Manual
Machine learning
Inference
Mixed
Anomaly Detection
Anomaly/Outlier Detection

What are anomalies/outliers?


The set of data points that are considerably different than the
remainder of the data
Applications:

Credit card fraud detection, telecommunication fraud detection,
network intrusion detection, fault detection
Anomaly Detection Schemes

General Steps

Build a profile of the “normal” behavior


Use the “normal” profile to detect anomalies


Profile can be patterns or summary statistics for the overall population
Anomalies are observations whose characteristics
differ significantly from the normal profile
Types of anomaly detection
schemes



Graphical & Statistical-based
Distance-based
Model-based
Distance-based Approaches

Data is represented as a vector of features

Three major approaches



Nearest-neighbor based
Density based
Clustering based
Nearest-Neighbor Based Approach

Approach:

Compute the distance between every pair of data points

There are various ways to define outliers:

Data points for which there are fewer than p neighboring points
within a distance D

The top n data points whose distance to the kth nearest neighbor
is greatest

The top n data points whose average distance to the k nearest
neighbors is greatest
Clustering-Based Approach

Basic idea:



Cluster the data into groups
of different density
Choose points in small cluste
r as candidate outliers
Compute the distance betwe
en candidate points and non-c
andidate clusters.

If candidate points are far fro
m all other non-candidate poi
nts, they are outliers
Information Extraction
What is Information Extraction?

Recovering structured data from formatted text



Identifying fields (e.g. named entity recognition)
Understanding relations between fields (e.g. record
association)
Normalization and deduplication
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy…
Contact: [email protected]
Category: Travel/Hospitality
Function: Food Services
Using Machine Learning


Training data: documents marked up with ground truth
In contrast to text classification, local features crucial.
Features of:




Contents
Text just before item
Text just after item
Begin/end boundaries
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix
…
contents
suffix
…
Related Applications

www.textmap.org
Summarization
90
Summarization Example: MS-Word

The patient lives alone. Her son lives in town. She has one son. She is a widow. She does not
smoke, quit 20 years ago, and admits drinking alcohol socially. She is retired. She used to work as
salesman in real estate department. She is able to perform activities of daily living, but it is getting
harder. She does not have any hired help. The patient is an 86-year-old woman with a 100pack-year history of smoking. She presents with a mass in the right middle lobe of the
lung, which has been slowly enlarging since last year.This is an 82-year-old white female
who came in to see Dr. James Kang and underwent lumbar laminectomy. She had fusion
of L3 through L5 with revision of previous laminectomy effusion. This was done on
06/06/2002. She immediately noticed in the postoperatively period some swelling of her glands, no
pain with this. The swelling persisted but this was gradually subsiding and I was asked to
see her about this swelling.They also asked me to address her complaint of difficulty
emptying her bladder. She is to go to rehabilitation today. Her past medical history is extensive
and it involves multiple surgeries. She also has a history of type II diabetes, diabetic neuropathy, no
known knowledge of diabetic nephropathy and denies knowledge of diabetic retinopathy. She does
have gastric emptying dysfunction and diabetic neuropathy in the peripheral extremities. She has
some bladder dysfunction with difficulty with retention of her urine, unable to initiate her urine
predating her surgery.
91
Summarization Example: MEAD

The patient lives alone. Her son lives in town. She has one son. She is a widow. She does not
smoke, quit 20 years ago, and admits drinking alcohol socially. She is retired. She used to
work as salesman in real estate department. She is able to perform activities of daily living, but it is
getting harder. She does not have any hired help. The patient is an 86-year-old woman with a 100pack-year history of smoking. She presents with a mass in the right middle lobe of the lung,
which has been slowly enlarging since last year. This is an 82-year-old white female who came
in to see Dr. James Kang and underwent lumbar laminectomy. She had fusion of L3 through L5 with
revision of previous laminectomy effusion. This was done on 06/06/2002. She immediately
noticed in the postoperatively period some swelling of her glands, no pain with this. The
swelling persisted but this was gradually subsiding and I was asked to see her about this swelling.
They also asked me to address her complaint of difficulty emptying her bladder. She is to
go to rehabilitation today. Her past medical history is extensive and it involves multiple surgeries.
She also has a history of type II diabetes, diabetic neuropathy, no known knowledge of
diabetic nephropathy and denies knowledge of diabetic retinopathy. She does have
gastric emptying dysfunction and diabetic neuropathy in the peripheral extremities.
She has some bladder dysfunction with difficulty with retention of her urine, unable to
initiate her urine predating her surgery.
92
Summarization: Method 1
1.
2.
Sentence clustering
Summary Layout
Group the sentences by cluster
Select the sentences closest to the each of the centroids
1.
2.

93
Or select the sentence(s) that best subsumes other sentences
Summarization: Method 2

Clustering medical concepts: concept clustering



Measuring relatedness of concepts
Ignore the value component of the concept
Extracting representative concepts

Necessary to define the degree of importance


An importance measure = the # of links
concept :=

94
<ID, type, keyphrase, value>
Summarization: Method 3

Using TF*IDF weighing scheme


95
Term frequency (TF) -> Concept frequency (CF)
The concept with high CF*IDF value may be significant.
Text Mining Applications
Vivisimo
Gate
Textmap
SAS Text Miner
Alibaba
Intellexer
102