0512_ch0_DMG_CurrentResearch

Download Report

Transcript 0512_ch0_DMG_CurrentResearch

Current Research in Data
Mining Research Group
Jiawei Han
Data Mining Research Group
Department of Computer Science
University of Illinois at Urbana-Champaign
Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing
April 8, 2015
1
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
2
Data Mining and Data Warehousing
Jiawei Han’s Group at CS, UIUC



Mining patterns and knowledge discovery from massive data
Data mining in heterogeneous information networks
Exploring broad applications of data mining

Developed many effective data mining algorithms, e.g., FPgrowth,
PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus ,
RankClus, and NetClus

600+ research papers in conferences and journals

Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W.
McDowell Award, Daniel Drucker Eminent Faculty Award

Textbook, “Data mining: Concepts and Techniques,” adopted
worldwide

Project lead for NASA EventCube for Aviation Safety [2008-2012]

Director of Information Network Academic Research Center funded
from Army Research Lab (ARL) [2009-2014]
3
Data Mining Research Group at CS, UIUC
4
New Books on Data Mining & Link Mining
Sun and Han,
Mining Heterogeneous
Han, Kamber and Pei,
Yu, Han and Faloutsos (eds.),
Data Mining, 3rd ed. 2011
Link Mining, 2010
Information Networks, 2012
5
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
6
Mining Heterogeneous Information Networks
RankClus/NetClus
SIGMOD
VLDB
Alice
EDBT
KDD
ICDM
SDM
RankCompete: A Competing Random Walk
Model for Rank-Based Clustering
AAAI
ICML
Ranking
Tom
Mary
Bob
Cindy
Tracy
Jack
Mike
Objects
Lucy
Jim
SDM
VLDB
KDD
ICDM
EDBT
SIGMOD
AAAI
ICML
RankClass [KDD11]
Knowledge Propagation in Heterogeneous Network
Top-5
ranked
conferenc
es
Top-5
ranked
terms
Database
Data Mining
AI
IR
VLDB
KDD
IJCAI
SIGIR
SIGMOD
SDM
AAAI
ECIR
ICDE
ICDM
ICML
CIKM
PODS
PKDD
CVPR
WWW
EDBT
PAKDD
ECML
WSDM
data
mining
learning
retrieval
database
data
knowledge
information
query
clustering
reasoning
web
system
classification
logic
search
xml
frequent
cognition
text
Similarity Search and Role Discovery in
Information Networks
Which images are most
similar to me in Flickr?
PathSim [VLDB11]
Meta Path-Guided
Similarity Search in
Networks
Path: ITI
A “dirty” Information
Network (imaginary)
Cleaned/Inferred
Adversarial Network
Automa
tically
infer
Chief
Path: ITIGITI
8
Advisee
Top Ranked
Advisor
Time
Note
David
M. Blei
1. Michael I.
Jordan
01-03
PhD advisor, 2004
2. John D.
Lafferty
05-06
Postdoc, 2006
Hong
Cheng
1. Qiang Yang
02-03
MS advisor, 2003
2. Jiawei Han
04-08
PhD advisor, 2008
Sergey
Brin
1. Rajeev
Motawani
97-98
Unofficial advisor
Cell Lead
Insurgent
Role Discovery in Information Networks [KDD’10]
Meta-Paths & Their Prediction Power

List all the meta-paths in bibliographic network up to length 4

Investigate their respective power for coauthor relationship
prediction
 Which meta-path has more prediction power?
 How to combine them to achieve the best quality of prediction
9
Relationship Prediction in Heterogeneous Info Networks


Why Prediction of Co-Author Relationship in DBLP?

Prediction of relationships between different types of nodes
in heterogeneous networks

E.g., what papers should Faloutsos writes?
Traditional link prediction: homogeneous networks



Co-author networks in DBLP, friendship networks in Facebook
Relationship prediction

Study the roles of topological features in heterogeneous
networks in predicting the co-author relationship building

Meta-path guided prediction!
Y. Sun, et al., "Co-Author Relationship Prediction in Heterog.
Bibliographic Networks", ASONAM'11, July 2011
10
Guidance: Meta Path in Bibliographic Network


Relationship prediction: meta path-guided prediction
Meta path relationships among similar typed links share similar
semantics and are comparable and inferable
venue
publish
topic
mention-1
publish-1
paper
mention
cite/cite-1
contain/contain-1

write-1
author
write
Co-author prediction (A—P—A) using topological features also
encoded by meta paths, e.g., citation relations between
authors (A—P→P—A)
11
Case Study in CS Bibliographic Network

The learned significance for each meta path under measure
“normalized path count” for HP-3hop dataset
12
Case Study: Predicting Concrete Co-Authors

High quality predictive power for such a difficult task

Using data in T0 =[1989; 1995] and
T1 = [1996; 2002]
Predict new coauthor relationship
in T2 = [2003; 2009]

13
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
14
iTopicModel: Model Set-Up & Objective Function

Graphical model: ϴi=(ϴi1, ϴi2,…, ϴiT): Topic distribution for document xi
Structural Layer: follow the same
topology as the document network
Text Layer: follow PLSA, i.e., for each
word, pick a topic z~multi(ϴi), then
pick a word w~multi(βz)

Objective function: joint probability
X: observed text information
G: document network
Parameters
ϴ: topic distribution
β: word distribution
ϴ is the most critical, need to be
Structure part
Text part
Can model themconsistent with the text as well
separately!
as the network structure
Case Study: Topic Hierarchy Building for DBLP
Probabilistic Topic Models with Network-Based
Biased Propagation




Text-rich heterogeneous information network
 Ubiquitous textual documents (news, papers)
 Connect with users and other objects: Topic propagation
Deng, Han et al, “Probabilistic Topic Models with Biased
Propagation on Heterogeneous Information Networks”, KDD’11
How to discover latent topics and identify clusters of multi-typed objects
simultaneously?
How can text data and heterogeneous information network mutually enhance
each other in topic modeling and other text mining tasks?
17
Biased Topic Propagation
Intuition:
InfoNet provides valuable information
Different objects have their own inherent
information (e.g., D with rich text and U without
explicit text)
To treat documents with rich text and other objects
without explicit text in a different way
Topic(D)  inherent text + connected U
Topic(U)  connected D
Basic Criterion: (Biased Topic Propagation)



The topic of an object without explicit text depends on the topic of the
documents it connects
The topic of a document is correlated with its objects to some extend, and
should be principally determined by its inherent content of the text
A simple and unbiased topic propagation does not make much sense
18
Incorporating Heterogeneous Info. Network
R(G): Biased propagation
L(C): Topic model
19
Experiments: DBLP & NSF Awards



Data Collection
 DBLP
 NSF-Awards
Metrics
 Accuracy (AC)
 Normalized mutual information (NMI)
Results
20
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
21
Event Cube: An Overview
Funded by NASA (2008-2010)
Analysis
Support
…
Analyst
Multidimensional OLAP, Ranking, Cause Analysis,
……
Topic Summarization/Comparison
Topic
Topic
turbulence
birds
undershoot
Event Cube
Representation
Encounter
Deviation
overshoot
LAX
SJC MIA AUS
Location
98.02
98.01
99.02
99.01
drilldown
1998
1999
CA
FL TX
Location
roll-up
Multidimensional
Text Database
Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events 22
Text/Topic Cube: General Idea

Heterogeneous: categorical attributes + unstructured text
ACN Time
Location
Place
Environment
……
Event
Report
Text data


How to combine?
Our solution:
Cube: Categorical Attributes
Measure
Term/Topic
Weight
T1
W1
T2
W2
T3
W3
…
…
Text/Topic Model: Unstructured Text
Effective Keyword Search

TopCells (ICDE’ 10): Ranking aggregated cells (objects) in
TextCube.
Healthcare
Reform
…
24
Effective OLAP Exploration

TEXplorer (submitted): Integrating keyword-based ranking
and OLAP exploration
Healthcare
Reform
25
Effective Event Tracking

PET (KDD’ 10): tracking popularity and textual representation
of events in social communities (twitter)
Healthcare
Reform
debate,
cost,
senate,
…
26
pass,
success,
law,
…
benefit,
profit,
effective,
…
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
27
Growing Parallel Paths
(WWW 2011)
Path
DIV ...
P
AD
HTML
DIV
HTML
DIV
LI
AB
HTML
P
LI
AC
AE
HTML
Page B
Page E
HTML
HTML
Page C
1
LI
AY
2
LI
AZ
3
LI
AW
4
TD
AU
5
TD
AV
6
X
Y
DIV
UL
Page A
AX
UL
Page D
DIV ...
LI
DIV
P
AF
Page F
DIV
TABLE
Z
UL
TR
W
U
V
Result:
28
Mapping Pages to Records (CIKM’10)
Name
Tarek Abdelzaher
Sarita Adve
Vikram Adve
Gul Agha
Eyal Amir
Dan Roth
Jiawei Han
Zipcode
--------
URL
--------
rsim.cs.illinois.edu/
~sadve/
llvm.cs.uiuc.edu
/~vadve/Home.html
l2r.cs.uiuc.edu
/~danr/
www.cs.illinois.edu
/homes/hanj/
Mappings
Web Pages
Structured Data
Database records can be found on link paths!
Faculty
/people
Vikram Adve
/people
/faculty
/people
/faculty
/vikramadve
Personal
Site
llvm.cs.uiuc.edu
/~vadve/Home.html
Dan Roth
People
Jiawei Han
/ (root)
[cs.illinois.edu]
/people
/faculty
/dan-roth
Personal
Site
l2r.cs.uiuc.edu
/~danr/
Research
Data
Mining
/research
Dan Roth
/research
/areas
/data
Jiawei Han
/people
/faculty
/jiawei-han
Personal
Site
www.cs.illinois.edu
/homes/hanj/
29
WinaCS: Web Information Network Analysis
for Computer Science
Integration of Web structure mining and
information network analysis
Tim Weninger, Marina Danilevsky, et al.,
“WinaCS: Construction and Analysis
of Web-Based Computer Science
Information Networks", ACM
SIGMOD'11 (system demo), Athens,
Greece, June 2011.
30
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
31
Discovery of Swarms and Periodic Patterns in Moving
Object Data


A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object
Databases", SIGMOD’10 (system demo)
Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10
(sub)
← Bird flying paths
shown on Google
Earth
Mined periodic
patterns by our
new method →

Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”,
VLDB’10 (sub)
Swarm discovers
more patterns →
← Convoy discovers
only restricted patterns
32
GeoTopic Discovery: Mining Spatial Text
Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
Geo-tagged photos w. landscape (coast vs. desert vs. mountain)
LDM
TDM
GeoFolk
LGTA
33
Outline

An Introduction to Data Mining Research Group

Mining and OLAPing Information Networks

Mining Heterogeneous Information Networks

Mining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information
networks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures
and contents)

Mining Cyber-Physical Systems and Networks

Conclusions
34
Conclusions: Towards Mining Data Semantics in Integrated
Heterog. Networks


Most data objects are linked, forming heterogeneous information
networks

Most datasets can be “organized” or “transformed” into
“structured” multi-typed heterogeneous info. networks

Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …

Structures can be progressively mined from less organized
data sets by info. network analysis

Surprisingly rich knowledge can be mine from such structured
heterogeneous info. networks

Clustering, ranking, classification, data cleaning, trust analysis,
role discovery, similarity search, relationship prediction, ……
It is promising to mine data semantics from rich info. networks !
35
References for the Talk









J. Han, Y. Sun, X. Yan, and . S. Yu, “Mining Heterogeneous Information Networks"
(tutorial), KDD'10.
Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of
Heterogeneous Information Networks", KDD'11.
Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09
Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information
Networks with Star Network Schema", KDD’09
Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity
Search in Heterogeneous Information Networks”, VLDB'11
Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction
in Heterogeneous Bibliographic Networks", ASONAM'11
C. Wang, J. Han, et al.,, , “Mining Advisor-Advisee Relationships from Research
Publication Networks", KDD'10.
Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of WebBased Computer Science Information Networks", ACM SIGMOD'11 (system demo)
X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information
Providers on the Web”, IEEE TKDE, 20(6), 2008
36