PowerPoint-Präsentation

Download Report

Transcript PowerPoint-Präsentation

technische universität
dortmund
Fakultät für Informatik
LS 8
The Challenge of
Heterogeneity
Prof. Dr. Katharina Morik
technische universität
dortmund
Faculty Computer Science
LS 8
Overview




Heterogeneity in Data
Distributed Data
Web 2.0
Heterogeneity of Users
 Structuring music collections
 Structuring tag collections
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
2
technische universität
dortmund
Faculty Computer Science
LS 8
Heterogeneity in Data

Databases





Preparation for mining






Fixed set of attributes
Declared data types
Multi-relational
Very large number of records
Extract, Transform, Load
Select attributes
Declare label for learning
Handle missing values
Compose new attributes
Schema-mapping for re-use of DM
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
MiningMart application to customer churn -Telecom Italia
3
technische universität
dortmund
Faculty Computer Science
LS 8
Heterogeneity in Data
 Time series data
 Measurements over time
 Business
 Medicine
 Production
 Hand writing
 Pictures
 Music




Prediction
Classification
Clustering
Signal to Symbol
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
4
technische universität
dortmund
Faculty Computer Science
LS 8
Heterogeneity in Data
 Texts
 High dimensional vectors
 Sparse word vectors
 Texts of the same class
need not share a word!
 Syntactic, semantic
structures
 Classification
 Clustering
 Named Entity Recognition,
Information Extraction
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
5
technische universität
dortmund
Faculty Computer Science
LS 8
Distributed Data
 Distributed databases of the
same schema
 Distributed databases of
different schemas
 Low-level, low capacity
sensors
 Peer-to-peer networks
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
6
technische universität
dortmund
Faculty Computer Science
LS 8
Heterogeneity of Users
 The same label name does
not necessarily mean the
same concept.
 Different names may refer to
the same set of items.
 Users apply diverse aspects,
e.g., genre, time of day,
episodes (summer 99),...
 Users share some set of items
(possibly under different
names).
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
favourites
blues
jazz pop
classic
alternative
modern
metal pop
hip hop
death metal true metal
classic guitar
classic
piano pop
jazz
hip hop
home
office
work
plane
7
technische universität
dortmund
Faculty Computer Science
LS 8
Web 2.0
 Organizing large data
collections
requires semantic annotations.
 Users annotate items with
arbitrary tags.
 No common ontology is
required (“folksonomies”).
 Users want to keep their tags,
but like to benefit from efforts
of others.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
8
technische universität
dortmund
Faculty Computer Science
LS 8
Structuring Music Collections
 A concept’s meaning is its
extension, e.g., some music.
 A concept’s meaning can be
expressed by a classifier.
 A concept hierarchy for each
aspect --> hierarchical
classification.
 Acquiring the hierarchy by
clustering under the
assumption that user-given
taggings are kept.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
pop
bad
rock
a
metal
d
blues
good
b
aggressive
e f
9
technische universität
dortmund
Faculty Computer Science
LS 8
Localized Alternative Cluster
Ensembles (ECML 2006)
 Acquiring hierarchical
clusterings from
 Own partial clusterings
 Clusterings of other peers




Preserve taggings of users
Produce several alternative
Exploit input clusterings
Consider locality instead of global
consensus
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
favourites
jazz pop
classic
alternative
blues
metal pop
hip hop
modern
death metal true metal
classic guitar
classic
piano pop
jazz
hip hop
home
office
work
plane
10
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Items are represented by Ids.
alternative
metal
a
death metal
c
a
b
true metal
b
c
11
f
d
e
g
pop
d
hip hop
f
12
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
11
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Best matching cluster node is
selected by f-measure.
alternative
metal
a
death metal
c
a
b
true metal
b
c
11
f
d
e
g
pop
d
hip hop
f
12
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
12
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Items that are sufficiently similar to
items in the best matching clustering
are deleted from the query set.
alternative
metal
a
death metal
f
b
d
e
alternative
true metal
c
11
g
metal
a
death metal
b
true metal
c
11
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
pop
d
hip hop
f
12
13
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
A new query is posed containing
the remaining items. Only tags not
used yet are considered.
alternative
metal
a
death metal
f
b
d
e
alternative
true metal
c
11
g
metal
a
death metal
b
true metal
c
11
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
pop
d
hip hop
f
12
14
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
The process continues until all items are
covered, no additional match is possible or a
maximal number of rounds is reached.
death metal
b
true metal
c
true metal
c
11
g
hip hop
pop
d
a
a
b
e
metal
metal
death metal
1
alternative
alternative
f
12
11
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
pop
d
hip hop
f
12
15
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Remaining items are added by
classification (kNN).
alternative
metal
a
death metal
1
b
true metal
c
11
alternative
metal
a
death metal
b
hip hop
pop
d
e
true metal
c
f
12’
11
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
g
pop
d
hip hop
f
12
16
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Process starts anew until no more
matches are possible or the maximal
number of results is reached.
alternative
metal
a
death metal
b
true metal
c
11
alternative
metalpop
hip hop
pop
d
hip hop
f
death metal true metal
1
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
12
17
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Process starts anew until no more
matches are possible or the maximal
number of results is reached.
alternative
metal
a
death metal
b
true metal
c
11
alternative
metalpop
death metal true metal
1
hip hop
home
pop
d
work
office
2
plane
3 … k
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
hip hop
f
12
18
technische universität
dortmund
Faculty Computer Science
LS 8
LACE Algorithm
Ad hoc peer-to-peer network.
alternative
metal
a
death metal
b
true metal
c
11
alternative
metalpop
death metal true metal
1
hip hop
home
pop
d
work
office
2
plane
3 … k
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
hip hop
f
12
19
technische universität
dortmund
Faculty Computer Science
LS 8
Structuring Music Collections
Challenge of music data:
 There is no perfect feature set for
all mining tasks.
 Learning feature extraction for a
classification task
Mierswa/Morik MLJ 2005
 Structuring music collections
Wurst/Morik/Mierswa ECML 2006
 User views are local models no global consensus wanted!
Mierswa/Morik/Wurst, In:
Masseglia, Poncelet, l and
Teisserie(editors), Successes and
New Directions in Data Mining,
2007
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
20
technische universität
dortmund
Faculty Computer Science
LS 8
Structuring Tag Collections
 Users annotate resources
with arbitrary tags.
 Frequency of tags is shown
by the tag cloud.
 Tags structure the collection.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
21
technische universität
dortmund
Faculty Computer Science
LS 8
Navigation
 User may select a tag and
sees the resources.
 User may follow related
tags.
 Problem:
 No hierarchical
structure.
 Restricted navigation to
given tags.
 No navigation according
to subsets.
 Photography and art
cannot be found!
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
22
technische universität
dortmund
Faculty Computer Science
LS 8
Given: Folksonomy
 A Folksonomy (U,T,R,Y), with





U Users
T tags
R Resources
Y U  T  R
a record (u,t,r)  Y means
that user u has annotated
resource r with tag t.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
23
technische universität
dortmund
Faculty Computer Science
LS 8
Wanted: Tagset clustering
 Hierarchical clustering of tags
for navigation,
 based on frequency:
how many users used tag t?
supp: P(T) --> 
suppU(T)=
|{uU| t T:  r R:
(u,t,r) Y}|
 Subset of the lattice of
frequent tag sets that
optimizes clustering criteria.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
24
technische universität
dortmund
Faculty Computer Science
LS 8
Starting Point: Termset Clustering
 Termset clustering: how many
resources support a term?
 Given frequent term sets form
a clustering with small overlap
and large coverage.
Beil, Ester, Xu (2002) Frequent TermBased Text Clustering, in KDD 2002
Fung, Wang, Ester (2003)
Hierarchical Document Clustering
Using Frequent Itemsets, in SDM
2003
 Heuristics for minimizing
overlap, maximizing coverage.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
{ } D1, ..., D16
...{sun}
{beach}
D1, D4, D5, D6,
D2, D9, D13
D8, D10, D11, D15
{sun,fun}
{fun, beach}
D1,D4,D6,D8
D10, D11, D13
...
D7, D14
D2, D9, D13
D8, D10, D11, D15
{sun,beach}
D2, D8, D9, D10
D11, D15
{sun, fun, beach}
D8, D10, D11, D15
25
technische universität
dortmund
Faculty Computer Science
LS 8
Heterogeneous Preferences
Child-count vs. completeness (left); coverage vs. overlap (right)
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
26
technische universität
dortmund
Faculty Computer Science
LS 8
Multi-objective Optimization






Given frequent tag sets
Find all optimal clusterings
according to two
orthogonal criteria.
Orthogonal criteria can
only be determined
empirically.
Childcount: number of
successors of a cluster
Overlap: average overlap
of clusters at each level.
Completeness: how much
of the lattice is retained?
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
+
+
+
+
+
+
+
+
+
+
+
+
27
technische universität
dortmund
Faculty Computer Science
LS 8
GA for Optimization
 NSGA II algorithm
Deb, Agrawal,Pratab,
Meyarivan (2000) in Procs.
Parallel Problem Solving from
Nature
 Delivers all Pareto-optimal
clusterings to a partial lattice
of frequent tag sets.
Initial
population
Fitness
Mutation
Output
Stop?
Selection
Crossover
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
28
technische universität
dortmund
Faculty Computer Science
LS 8
Encoding Frequent Tag Sets
 Given the lattice of possibly
frequent tag sets,
 a Binary vector indicates the
inclusion of a tag set into the
clustering.
 A vector can be mutated by
flipping bits.
 Two vectors can be combined
to a new one by crossover.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
29
technische universität
dortmund
Faculty Computer Science
LS 8
Result: Points of Pareto-front



Childcount vs.
Completeness
Pareto-front for
different minimal
support
Instances
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
30
technische universität
dortmund
Faculty Computer Science
LS 8
Application




Bibsonomy social bookmark system: Hotho, Jäschke, Schmitz, Stumme 2006
780 users, 59.000 resources, 25.000 tags
4000 frequent tag sets
Optimization according to Childcount vs. Completeness and
Overlap vs. Coverage
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
31
technische universität
dortmund
Faculty Computer Science
LS 8
Multi-objective Tagset Clustering
 Multi-objective optimization
allows the user to select
among equally good
clusterings -->
heterogeneity of users is
respected
 High scalability, high
dimensionality
 Understandable labels (tags)
 Hierarchical structure for
navigation.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
32
technische universität
dortmund
Faculty Computer Science
LS 8
Challenges for Data Mining







High dimensional data
High throughput data
Distributed Data
P2P networks
Web 2.0
Diverse user preferences
Service for end-user systems,
e.g. mobile “phones”
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
33