Mining Scientific Data Sets Using Graphs

Download Report

Transcript Mining Scientific Data Sets Using Graphs

Mining Scientific Data
Sets Using Graphs
George Karypis
Department of Computer Science &
Engineering
University of Minnesota
(Michihiro Kuramochi & Mukund Deshpande)
Outline
Mining Scientific Data-sets

Opportunities & Challenges
Using Graphs and Mining them
Pattern Discovery in Graphs
Putting Patterns to Good Use
Going Forward
NGDM-02
Data Mining In Scientific Domain
Data mining has emerged as a critical tool for
knowledge discovery in large data sets.

It has been extensively used to analyze business, financial, and
textual data sets.
The success of these techniques has renewed interest
in applying them to various scientific and engineering
fields.






Astronomy
Life sciences
Ecosystem modeling
Fluid dynamics
Structural mechanics
…
NGDM-02
Challenges in Scientific Data Mining
Most of existing data mining algorithms assume that the data is
represented via




Transactions (set of items)
Sequence of items or events
Multi-dimensional vectors
Time series
Scientific datasets with structures, layers, hierarchy, geometry, and
arbitrary relations can not be accurately modeled using such
frameworks.

e.g., Numerical simulations, 3D protein structures, chemical
compounds, etc.
Need algorithms that operate on scientific datasets
in their native representation
NGDM-02
How to Model Scientific Datasets?
There are two basic choices


Treat each dataset/application differently and develop custom
representations/algorithms.
Employ a new way of modeling such datasets and develop
algorithms that span across different applications!
What should be the properties of this general modeling
framework?


Abstract compared with the original raw data.
Yet powerful enough to capture the important characteristics.
Labeled directed/undirected
topological/geometric graphs and hypergraphs
NGDM-02
Modeling Data With Graphs…
Going Beyond Transactions
Data Instance
Graphs are suitable for
capturing arbitrary
relations between the
various elements.
Element
Element’s Attributes
Relation Between
Two Elements
Graph Instance
Vertex
Vertex Label
Edge
Type Of Relation
Edge Label
Relation between
a Set of Elements
Hyper Edge
Provide enormous flexibility for modeling the underlying data as they allow
the modeler to decide on what the elements should be and the type of
relations to be modeled
NGDM-02
Example: Protein 3D Structure
β
Backbone
Contact
β
β
α
β
β
α
β
β
PDB; 1MWP
N-Terminal Domain Of The Amyloid Precursor Protein
Alzheimer's disease amyloid A4 protein precursor
NGDM-02
Example: Fluid Dynamics
Vertices
Edges
 Vortices
 Proximity
NGDM-02
Graph Mining
Goal:
Develop algorithms to mine and analyze graph
data sets.



Finding patterns in these graphs
Finding groups of similar graphs (clustering)
Building predictive models for the graphs (classification)
Applications




Structural motif discovery
High-throughput screening
Protein fold recognition
VLSI reverse engineering
A lot more …
Beyond Scientific Applications






Semantic web
Mining relational profiles
Behavioral modeling
Intrusion detection
Citation analysis
…
NGDM-02
Finding Frequent Patterns in Graphs
A pattern is a relation between the object’s elements that
is recurring over and over again.



Common structures in a family of chemical compounds or
proteins.
Similar arrangements of vortices at different “instances” of
turbulent fluid flows.
…
There are two general ways to formally define a pattern in
the context of graphs


Arbitrary subgraphs (connected or not)
Induced subgraphs (connected or not)
Frequent pattern discovery translates to frequent
subgraph discovery…
NGDM-02
Finding Frequent Subgraphs:
Input and Output
Input





Database of graph transactions.
Undirected simple graph
(no loops, no multiples edges).
Each graph transaction has labels
associated with its vertices and edges.
Transactions may not be connected.
Minimum support threshold σ.
Output


Input: Graph Transactions
Output: Frequent Connected Subgraphs
Support = 100%
Support = 66%
Support = 66%
Frequent subgraphs that satisfy the
minimum support constraint.
Each frequent subgraph is connected.
NGDM-02
FSG
Frequent Subgraph Discovery Algorithm
Single edges
Follows an Apriori-style
level-by-level approach
and grows the patterns
one edge-at-a-time.
Double edges
3-candidates
3-frequent
subgraphs
4-candidates
4-frequent
subgraphs
NGDM-02
Computational Challenges
Simple operations become complicated & expensive when dealing with graphs…
Candidate generation




To determine if we can join two candidates, we need to perform subgraph isomorphism to
determine if they have a common subgraph.
There is no obvious way to reduce the number of times that we generate the same subgraph.
Need to perform graph isomorphism for redundancy checks.
The joining of two frequent subgraphs can lead to multiple candidate subgraphs.
Candidate pruning

To check downward closure property, we need subgraph isomorphism.
Frequency counting

Subgraph isomorphism for checking containment of a frequent subgraph
Key to FSG’s computational efficiency:



Uses an efficient algorithm to determine a canonical labeling of a graph and use these “strings” to
perform identity checks.
Uses a sophisticated candidate generation algorithm that reduces the number of times each
candidate is generated.
Uses an augmented TID-list based approach to speedup frequency counting.
NGDM-02
Candidate Generation Based On Core
Detection
Multiple candidates
for the same core!
NGDM-02
Candidate Generation Based On Core
Detection
First Core
Second Core
First Core
Second Core
Multiple cores
between two
(k-1)-subgraphs
NGDM-02
Canonical Labeling
v1
v0



v 0

 v1
v 2

v3
v
 4
v5
B
A
B
v0
B
B
B
B
B
A
A
v3
B
v2
B
v1
B
1
1
A
v2
B
1
1
1
v4
v3
B
v4
A
1
1
1
1
1
1
v5
v5 
A 





1






v3

 v1
v2

v4
v
 5
v0
v3
B
B
B
B
A
A
B
1
1
1
1
v1
B
1
v2
B
1
1
1
1
v4
A
1
v5 v0 
A B 

1

1






Label = “1 11 100 1000 01000”
Label = “1 01 011 0001 00010”
NGDM-02
DTP Dataset (chemical compounds)
(Random 100K transactions)
10000
1400
9000
Running Time [sec]
8000
1200
#Patterns
7000
1000
6000
800
5000
600
4000
3000
400
2000
200
Number of Patterns
Discovered
Running Time [sec]
1600
1000
0
0
1
2
3
4
5
6
7
8
9
10
Minimum Support [%]
NGDM-02
DTP Dataset
Running Time [sec]
1600
1400
1200
1000
800
600
400
200
0
0
20000
40000
60000
80000
100000
Number of Transactions
NGDM-02
Topology Is Not Enough (Sometimes)
H
I
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
O
H
O
H
H
O
H
H
H
H
O
H
H
H
H
H
H
H
H
H
H H
H
H
Graphs arising from physical
domains have a strong geometric
nature.

O
H
H
H
O
H
H
H
H
H
This geometry must be taken into
account by the data-mining
algorithms.
Geometric graphs.

Vertices have physical 2D and 3D
coordinates associated with them.
NGDM-02
gFSG—Geometric Extension Of FSG
Same input and same output as
FSG

Finds frequent geometric connected
subgraphs
Geometric version of (sub)graph
isomorphism


B
A
The mapping of vertices can be
translation, rotation, and/or scaling
invariant.
The matching of coordinates can be
inexact as long as they are within a
tolerance radius of r.

R-tolerant geometric isomorphism.
NGDM-02
Use Of Geometry
Transformation-invariant signatures enable quick identity checks




Normalized sum of distances from the center to each vertex
A sorted list of edge angles
To compare signatures, use a certain threshold
No canonical labeling
For the subgraph isomorphism, coordinates of vertices also work as
strong constraints to narrow down the search space of combination.

Not only the vertex and edge labels, now the coordinates must be
matched.
R-tolerance makes the problem of finding all patterns extremely
hard.

Patterns that are 2R-isomorphic to each other will not be counted
properly
NGDM-02
Performance of gFSG
Running Time [sec]
2500
w scaling
w/o scaling
2000
1500
1000
500
0
0
5000
10000
15000
20000
Number of Transactions
Different number of transactions randomly sampled from DTP dataset
Average transaction size about 23
Minimum support 1.0%
NGDM-02
A discovered
pattern
Example
NSC 4960
O
O
NSC 699181
NSC 40773
OH
HO
O
O
SH
NSC 191370
HN
O
NH
O
O
H2N
O
S
NSC 164863
O
HO
N
O
O
O
O
HO
O
O
O
O
O
O
O
O
O
O
O
NGDM-02
Putting Patterns to Good Use…
NGDM-02
Drug Development Cycle
Idea for drug
target
File IND
Drug screening/
rational drug design/
direct synthesis
Production for
clinical trials
Small scale
production
Laboratory and
animal testing
NGDM-02
Graph Classification Approach
Graph
Databases
1
Discover Frequent
Sub-graphs
Select Discriminating
Features
2
3
Transform Graphs
in Feature
Representation
Learn a Classification
Model
4
NGDM-02
Chemical Compound Datasets
Predictive Toxicology Challenge (PTC)



Predicting toxicity (carcinogenicity) of compounds.
Bio assays on four kinds of rodents
4 Classification Problems -- Approx 400 chemical compounds.
DTP AIDS Antiviral Screen (AIDS)



Predicting anti-HIV activity of compounds.
Assay to measure protection of human cells against HIV infection.
3 Classification problems -- Approx 40,000 chemical compounds.
Anthrax




Predicting binding ability of compounds with the anthrax toxin.
Expensive molecular dynamics simulation
Collaboration with Dr. Frank Lebeda, USAMRIID
Approx 35,000 chemical compounds
NGDM-02
Comparison with PTC and HIV
Aids-CACI
Aids-AI
Aids-AM
1
1
0.8
True Positive Rate
True Positive Rate
0.8
0.6
0.4
0.2
0.6
0.4
0.2
0
0
0.2
0.4
0.6
False Positive Rate
(female mice)
0.8
1
0
0
0.2
0.4
0.6
0.8
1
False Positive Rate
(HIV screening)
NGDM-02
Anthrax
NGDM-02
Comparison of Topological & Geometric Features on
Anthrax Dataset
Topological
Geometric
True Positive Rate
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
False Positive Rate
NGDM-02
Most Discriminating Subbgraphs
OH
H2N
HO
(a) On Toxicology (PTC) Dataset
O
NH2
OH
O
NH
HO
NH2
NH2
O
O
(b) On AIDS Dataset
H
N
S
O
S
NH2
N
N
(c) On Anthrax Dataset
NGDM-02
Moving Forwards
Graphs provide a powerful mechanism to
represent relational and physical datasets.
Can be used as a quick prototyping tool to test
out whether or not data-mining techniques can
help a particular application area and problem.
Their benefits can be realized if there exists an
extensive set of efficient and scalable
algorithms to mine them…
NGDM-02
Research on Pattern Discovery
Robust algorithms for mining 3D geometric
graphs

extensive applications in proteomics
Algorithms for finding approximate patterns


allow for a limited number of changes
there is always variation in the physical world
Algorithms to find patterns in single large graphs



what is a pattern?
what is its support?
do we allow for overlap?
…
NGDM-02
Research on Classification
Position specific models

a substructure at the surface of the protein is
in general more important than the same
substructure being buried
Efficient Graph-based kernel methods

algebra of graphs?
…
NGDM-02
Research on Clustering
Efficient methods to compute graph
similarities
spectral properties?
 graph edit distance?

Graph consensus representations
Multiple graph “alignments”
…
NGDM-02
Graphs, graphs, and more graphs…
Graphs with multi-dimensional labels
Stream graphs

phone-network connections
Hypergraphs

compact representation of set relations
Benchmarks and real-life test cases!
NGDM-02
Thank you!
http://www.cs.umn.edu/~karypis
NGDM-02