C. Faloutsos - Carnegie Mellon School of Computer Science

Download Report

Transcript C. Faloutsos - Carnegie Mellon School of Computer Science

CMU SCS
Thank you!
C. Faloutsos
CMU
CMU SCS
Large Graph Mining
C. Faloutsos
CMU
CMU SCS
Large Graph Mining
Data Mining for fun (and profit)
C. Faloutsos
CMU
CMU SCS
Outline
• Credit where credit is due
• Technical part – Data mining
– Can it be automated?
– Research challenges
• Non-technical part: `Listen’
– To the data
– To non-experts
KDD'10
C. Faloutsos
4
CMU SCS
Nominator
• Jian Pei
KDD'10
C. Faloutsos
5
CMU SCS
Endorsers
•
•
•
•
•
Charu C. Aggarwal (IBM Research)
Ricardo Baeza-Yates (Yahoo! Research)
Albert-Laszlo Barabasi (Northeastern University)
Denilson Barbosa (University of Alberta)
Yixin Chen (Washington University at St. Louis)
KDD'10
C. Faloutsos
6
CMU SCS
Endorsers, cont’d
•
•
•
•
•
William Cohen (Carnegie Mellon University)
Diane J. Cook (Washington State University)
Gautam Das (University of Texas at Arlington)
Inderjit S. Dhillon (University of Texas at Austin)
Chris H. Q. Ding (University of Texas at
Arlington)
KDD'10
C. Faloutsos
7
CMU SCS
Endorsers, cont’d
• Petros Drineas (Rensselaer Polytechnic Institute)
• Tina Eliassi-Rad (Lawrence Livermore National
Laboratory)
• Greg Ganger (Carnegie Mellon University)
• Minos Garofalakis (Technical University of Crete)
• James Garrett (Carnegie Mellon University)
KDD'10
C. Faloutsos
8
CMU SCS
Endorsers, cont’d
•
•
•
•
•
•
Dimitrios Gunopulos (University of Athens)
Xiaofei He (Zhejiang University)
Panagiotis G. Ipeirotis (New York University)
Eamonn Keogh (UCR)
Hiroyuki Kitagawa (University of Tsukuba)
Tamara Kolda (Sandia Nat. Labs)
KDD'10
C. Faloutsos
9
CMU SCS
Endorsers, cont’d
•
•
•
•
•
•
Flip Korn (AT&T Research)
Nick Koudas (University of Toronto)
Hans-Peter Kriegel
Ravi Kumar (Yahoo! Research)
Laks Lakshmanan (UBC)
Jure Leskovec (Stanford University)
KDD'10
C. Faloutsos
10
CMU SCS
Endorsers, cont’d
•
•
•
•
•
•
Nikos Mamoulis (Hong Kong University)
Heikki Manilla (Aalto University,
Dharmendra S. Modha (IBM Research)
Mario Nascimento (University of Alberta)
Jennifer Neville (Purdue University)
Beng Chin Ooi (National University of Singapore)
KDD'10
C. Faloutsos
11
CMU SCS
Endorsers, cont’d
• Dimitris Papadias (Hong Kong University of
Science and Technology)
• Spiros Papadimitriou (IBM Research)
• Jian Pei (Simon Fraser University)
• Foster Provost (New York University)
• Oliver Schulte (Simon Fraser University)
• Dennis Shasha (New York University)
• Srinivasan Parthasarathy (OSU)
KDD'10
C. Faloutsos
12
CMU SCS
Endorsers, cont’d
•
•
•
•
•
•
Jimeng Sun (IBM Research)
Dacheng Tao (Nanyang University of Technology)
Yufei Tao (The Chinese University of Hong Kong)
Evimaria Terzi (Boston University)
Alex Thomo (University of Victoria)
Andrew Tomkins (Google Research)
KDD'10
C. Faloutsos
13
CMU SCS
Endorsers, cont’d
•
•
•
•
Caetano Traina (University of Sao Paulo)
Vassilis Tsotras (University of California, Riverside)
Alex Tuzhilin (New York University)
Haixun Wang (Microsoft Research)
KDD'10
C. Faloutsos
14
CMU SCS
Endorsers, cont’d
• Wei Wang (University of North Carolina at
Chapel Hill)
• Philip S. Yu (University of Illinois, Chicago)
• Zhongfei Zhang (Binghamton University, State
University of New York)
KDD'10
C. Faloutsos
15
CMU SCS
KDD committee
• Ramasamy Uthurusamy, Chair
• Robert Grossman (University of Illinois at
Chicago)
• Jiawei Han (University of Illinois at
Urbana-Champaign)
• Tom Mitchell (Carnegie Mellon University)
• Gregory Piatetsky-Shapiro (KDnuggets)
KDD'10
C. Faloutsos
16
CMU SCS
KDD committee cnt’d
• Raghu Ramakrishnan (Yahoo! Research)
• Sunita Sarawagi (Indian Institute of
Technology, Bombay)
• Padhraic Smyth (University of California at
Irvine)
• Ramakrishnan Srikant (Google Research)
KDD'10
C. Faloutsos
17
CMU SCS
KDD committee cnt’d
• Xindong Wu (University of Vermont)
• Mohammed J. Zaki (Rensselaer Polytechnic
Institute)
KDD'10
C. Faloutsos
18
CMU SCS
Family
• Parents Nikos & Sophia
• Siblings Michalis*, Petros*, Maria
• Wife Christina#
(*)
: and co-authors
(#) : and research impact evaluator (‘grandpa’ test - see later…)
KDD'10
C. Faloutsos
19
CMU SCS
Academic ‘parents’
• Christodoulakis, Stavros (T.U.C.)
• Sevcik, Ken (U of T)
• Roussopoulos, Nick (UMD)
KDD'10
C. Faloutsos
20
CMU SCS
Academic ‘children’
•
•
•
•
•
•
KDD'10
King-Ip (David) Lin
Ibrahim Kamel
Flip Korn
Byoung-Kee Yi
Leejay Wu
Deepayan Chakrabarti
C. Faloutsos
21
CMU SCS
Academic ‘children’
•
•
•
•
•
KDD'10
Jia-Yu (Tim) Pan
Spiros Papadimitriou
Jimeng Sun
Jure Leskovec
Hanghang Tong
C. Faloutsos
22
CMU SCS
Academic ‘children’
•
•
•
•
•
•
•
KDD'10
Mary McGlohon
Fan Guo
Lei Li
Leman Akoglu
Dueng Horng (Polo) Chau
Aditya Prakash
U Kang
C. Faloutsos
23
CMU SCS
CMU colleagues
•
•
•
•
•
•
•
Tom Mitchell
Garth Gibson
Greg Ganger
M. (Satya) Satyanarayanan
Howard Wactlar
Jeannette Wing
++
KDD'10
C. Faloutsos
24
CMU SCS
Co-authors
• [dblp 7/2010:] All 300 of you
• Agma J. M. Traina (22)
• Caetano Traina Jr. (20)
• …
KDD'10
C. Faloutsos
25
CMU SCS
Funding agencies
• NSF (Maria Zemankova, Frank Olken, ++)
• DARPA, LLNL, PITA
• IBM, MS, HP, INTEL, Y!, Google,
Symantec, Sony, Fujitsu, …
KDD'10
C. Faloutsos
26
CMU SCS
Outline
• Credit where credit is due
• Technical part – Data mining
– Can it be automated?
– Research challenges
• Non-technical part: `Listen’
– To the data
– To non-experts
KDD'10
C. Faloutsos
27
CMU SCS
Data mining = compression & …
Christos Faloutsos, Vasileios Megalooikonomou: On data
mining,
compression,
and
Kolmogorov
complexity.
Data
Min.
KDD'10
C. Faloutsos
28
Knowl. Discov. 15(1): 3-20 (2007)
CMU SCS
Data mining = compression & …
Christos Faloutsos, Vasileios Megalooikonomou: On data
mining,
compression,
and
Kolmogorov
complexity.
Data
Min.
KDD'10
C. Faloutsos
29
Knowl. Discov. 15(1): 3-20 (2007)
CMU SCS
Data mining = compression & …
But: how can compression
• do forecasting?
• spot outliers?
• do classification?
KDD'10
C. Faloutsos
30
CMU SCS
Data mining = compression & …
OK – then, isn’t compression a solved problem (gzip, LZ)?
KDD'10
C. Faloutsos
31
CMU SCS
… compression is undecidable!
Theorem*: for an arbitrary string x, computing
its Kolmogorov complexity K(x) is
undecidable
EVEN WORSE
than NP-hard!
A.N. Kolmogorov
(*) E.g.,
[T.
M.
Cover
and
J.
A.
Thomas.
Elements
of
KDD'10
C. Faloutsos
32
Information Theory. John Wiley and Sons,1991, section 7.7]
CMU SCS
… compression is undecidable!
…which means there will always be better data
mining tools/models/patterns to be discovered
-> job security 
-> job satisfaction 
KDD'10
C. Faloutsos
33
CMU SCS
Let’s see some examples of
Response
models
to new drug
KDD'10
C. Faloutsos
Body weight
34
CMU SCS
Let’s see some examples of
models
$ spent
KDD'10
C. Faloutsos
income
35
CMU SCS
Let’s see some examples of
models
$ spent
KDD'10
C. Faloutsos
income
36
CMU SCS
Let’s see some examples of
models
$ spent
KDD'10
C. Faloutsos
income
37
CMU SCS
Let’s see some examples of
models
$ spent
3/4
KDD'10
C. Faloutsos
income
38
CMU SCS
Let’s see some examples of
Metabolic
models
rate
3/4
mass
KDD'10
C. Faloutsos
http://universe-review.ca /R10-35-metabolic.htm
39
CMU SCS
Metabolic
rate
Kleiberg’s law
3/4
mass
KDD'10
C. Faloutsos
http://universe-review.ca /R10-35-metabolic.htm
40
CMU SCS
Outline
• Credit where credit is due
• Technical part – Data mining
– Can it be automated? NO!
• Always room for better models
– Research challenges
• Non-technical part: `Listen’
– To the data
– To non-experts
KDD'10
C. Faloutsos
41
CMU SCS
Always room for better models
• Eg.: clustering – k-means (or our favorite
clustering algo)
• How many clusters are in the Sierpinski
triangle?
…
KDD'10
C. Faloutsos
42
CMU SCS
Always room for better models
KDD'10
C. Faloutsos
43
CMU SCS
Always room for better models
K=3 clusters?
KDD'10
C. Faloutsos
44
CMU SCS
Always room for better models
K=3 clusters?
K=9 clusters?
KDD'10
C. Faloutsos
45
CMU SCS
Always room for better models
Piece-wise
flat
Mixture
of (Gaussian)
clusters
KDD'10
C. Faloutsos
46
CMU SCS
Always room for better models
Piece-wise
flat
¾ Power
law
Mixture
of (Gaussian)
clusters
KDD'10
??
C. Faloutsos
47
CMU SCS
Always room for better models
Piece-wise
flat
¾ Power
law
Mixture
of (Gaussian)
clusters
KDD'10
ONE, but
Self-similar
‘cluster’
C. Faloutsos
48
CMU SCS
Always room for better models
• Barnsley’s method of IFS (iterated function
systems) can easily generate it
[Barnsley+Sloan, BYTE, 1988]
ONE, but
Self-similar
‘cluster’
~100 lines of C code: www.cs.cmu.edu~/christos/www/SRC/ifs.tar
KDD'10
C. Faloutsos
49
CMU SCS
Always room for better models
• But, does self-similarity appear in real life?
KDD'10
C. Faloutsos
50
CMU SCS
Real, self similar dataset
KDD'10
C. Faloutsos
51
CMU SCS
Real, self similar dataset
KDD'10
C. Faloutsos
52
CMU SCS
Real, self similar dataset
KDD'10
C. Faloutsos
53
CMU SCS
Real, self similar dataset
KDD'10
C. Faloutsos
54
CMU SCS
• the red is true
• origin: Norway
•but most other coastlines
are ‘self-similar’, too!
KDD'10
C. Faloutsos
55
CMU SCS
How can we find better models?
• Obviously, an art (‘undecidable’!)
• Helps if we
– Listen to domain experts and
– Listen to the data (next)
KDD'10
C. Faloutsos
56
CMU SCS
Outline
• Credit where credit is due
• Technical part – Data mining
– Can it be automated? NO!
– Research challenges
• Listen to the data (the more, the better!)
• Non-technical part: `Listen’
– To the data
– To non-experts
KDD'10
C. Faloutsos
57
CMU SCS
Scalability
• Google: > 450,000 processors in clusters of
~2000 processors each [Barroso, Dean, Hölzle,
“Web Search for a Planet: The Google Cluster
Architecture” IEEE Micro 2003]
• Yahoo: ~5Pb of data [Fayyad’07]
• ‘M45’: 4K proc’s, 3Tb RAM, 1.5 Pb disk
KDD'10
C. Faloutsos
58
CMU SCS
Promising research direction:
scalability
• challenges
– Vast amounts of data; storing; cooling (!); …
• … and opportunities:
– DATA: Easier to collect (clickstreams, sensors
etc)
– S/W: Hadoop, hbase, pig, … : open source
– H/W: 1Tb disk: ~ US$ 100
KDD'10
C. Faloutsos
59
CMU SCS
Promising research direction
• The more data, the more subtle patterns we
may discover
• Examples of subtle patterns:
KDD'10
C. Faloutsos
60
CMU SCS
More data, more subtle patterns
PDF: fraction
of customers
(log scale)
Duration (log scale)
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
61
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
PDF: fraction
of customers
(log scale)
(mixture of)
Gaussians
Duration (log scale)
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
62
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
63
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
Zipf
(Pareto,
Power-law)
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
64
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
65
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
lognormal
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
66
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
67
distributions. KDD 2008: 596-604
CMU SCS
More data, more subtle patterns
dPln
(=doubly
Pareto
Lognormal)
Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos
Faloutsos,
Jure Leskovec: Mobile call
graphs: beyond power-law and lognormal
KDD'10
C. Faloutsos
68
distributions. KDD 2008: 596-604
CMU SCS
So, dPln is the answer?
KDD'10
C. Faloutsos
69
CMU SCS
So, dPln is the answer?
Yes, for the moment…
KDD'10
C. Faloutsos
70
CMU SCS
So, dPln is the answer?
With more data, who knows?!
KDD'10
C. Faloutsos
71
CMU SCS
Outline
• Credit where credit is due
• Technical part – Data mining
– Can it be automated? NO!
– Research challenges
• Listen to the data (the more, the better!)
• Non-technical part: ‘Listen’
– To the data
– To non-experts
KDD'10
C. Faloutsos
72
CMU SCS
Listen to non-experts
• Explain ‘why’, to a non-expert (‘grandpa’)
• (and, even harder, explain ‘how’ – e.g.:
– Frobenious Perron for irreducible MC
KDD'10
C. Faloutsos
73
CMU SCS
Listen to non-experts
• Explain ‘why’, to a non-expert (‘grandpa’)
• (and, even harder, explain ‘how’ – e.g.:
– Frobenious Perron for irreducible MC ->
pageRank -> random surfer
KDD'10
C. Faloutsos
74
CMU SCS
Summary
• Data mining = compression = undecidable =
job security 
• Hence: always room for better
models/patterns
– Listen to the data (Gb, Tb and Pb of them!)
– Listen to domain experts (e.g., ¾ Kleiberg’s
law)
• Listen to non-experts (‘explain to grandpa’)
KDD'10
C. Faloutsos
75
CMU SCS
Compression, fun, recursion
• The shortest, recursive joke:
• There are 3 types of data miners
KDD'10
C. Faloutsos
76
CMU SCS
Compression, fun, recursion
• The shortest, recursive joke:
• There are 3 types of data miners
– Those who can count
KDD'10
C. Faloutsos
77
CMU SCS
Compression, fun, recursion
• The shortest, recursive joke:
• There are 3 types of data miners
– Those who can count
– And those who can not
KDD'10
C. Faloutsos
78
CMU SCS
Thank you!
For
the
honor,
and for making this wonderful
research community
KDD'10
C. Faloutsos
79