BIBE07_Presentation_Topological

Download Report

Transcript BIBE07_Presentation_Topological

Graph and Topological Structure
Mining on Scientific Articles
Fan Wang, Ruoming Jin,
Gagan Agrawal and Helen Piontkivska
The Ohio State University
The Kent State University
Presenter: Fan Wang
The Ohio State University
Outline
• Introduction
• Topological Structure Mining
• Data Preprocessing and Graph
Representations
• Experiment Results and Pattern Analysis
• Conclusion
Introduction
• Huge number of genes in literature
• Associated with targeted disease or
functionality
• Finding interaction among genes manually
– Time consuming
– Error Prone
Introduction
CCL5
CCR5
CCL3
CCL5
T
CCL4
CCL4
CCL3
T
CD4
• Well-known relationship among chemokine ligands
• Mining these relations from literature documents
• Mining frequent patterns from graph datasets
– Convenient representation
– Lots of research in subgraph mining
Introduction
• Our Goal
– Find commonly occurring interactions
– Represent them visually
• Capture the co-occurrence of scientific
terms
• Graph representation of scientific document
• Mining frequent topological structures
Outline
• Introduction
• Topological Structure Mining
• Data Preprocessing and Graph
Representations
• Experiment Results and Pattern Analysis
• Conclusion
Topological Structure Mining
• Disadvantages of subgraph mining
– Exact matching
– Missing potential patterns
• Focusing on the topological relationship
• Incorporating approximate matching
Topological Structure Mining
G
Y
X
G is a subgraph of Y
X is a (0,3) topological structure of Y
Topological Structure Mining
• Definition
– Given a collection of graphs, two parameters l
and h, and a threshold θ. A (l,h)-topological
structure whose support is greater than or
equal to θis called a frequent topological
structure.
• Given a set of graphs, in our KDD05 paper,
an algorithm TSMiner finding frequent
topological structures is implemented
Our Work
• Using topological structure mining
• Challenges
– How to create graphs?
– What are the keywords?
– How to insert edges into graphs?
Outline
• Introduction
• Topological Structure Mining
• Data Preprocessing and Graph
Representations
• Experiment Results and Pattern Analysis
• Conclusion
Data Preprocessing
and Graph Representation
• One graph for each document
• Nodes are keywords of interest
• Edges inserted based on occurrence of the
keywords
• Run topological structure mining algorithm
Data Preprocessing
• Four dictionaries of keywords
– Short Dictionary
• 321 genes expressed between prostate epithelial
and stromal cells
– Long Dictionary
• 2600 human genes found in supperarray’s DNA
microarray experiment
– Confusion Dictionary
• Gene names easily confused with ordinary words
– GO Dictionary
• GO terms (molecular function, biological process
and cellular component)
Graph Representations
• Edge Construction Methods
– Sentence-based Method
• Two keywords in one sentence
– Mutual Information Method
• The mutual information of two keywords greater than a
threshold
– Sliding Window Method
• Two keywords located within a sliding window with a predefined size
Outline
• Introduction
• Topological Structure Mining
• Data Preprocessing and Graph
Representations
• Experiment Results and Pattern Analysis
• Conclusion
Experiment Results
• Focusing on articles containing at least
one of the 5 genes
– CCL5, TF, IGF1, MYLK, IGFBP3
• Generating graph for each article
• Finding frequent topological structures
Three Edge Construction Methods
Number of Patterns Found, MYLK, sup=20%
Number of Patterns Found, MYLK, sup=15%
1000
800
500
Sentence
MI
Window
Total Number of Patterns
Total Number of Patterns
1200
600
400
200
0
400
Sentence
MI
300
Window
200
100
0
(0,0)
(1,1)
L and H values
(2,2)
(0,0)
(1,1)
L and H values
(2,2)
Three Edge Construction Methods
Number of Patterns Found, IGFBP3, sup=20%
Number of Patterns Found, IGFBP3, sup=15%
350
300
Sentence
MI
250
Window
350
Total Number of Patterns
Total Number of Patterns
400
200
150
100
50
300
250
200
Sentence
MI
Window
150
100
50
0
0
(0,0)
(1,1)
L and H values
(2,2)
(0,0)
(1,1)
L and H values
(2,2)
Three Edge Construction Methods
Results
• Sliding window method wins
– Largest number of frequent patterns
– Best scalability
• Topological structure mining giving us
more frequent patterns
• Large number doesn’t mean high
biological significance
Pattern Analysis
CCL4
CCL3
T
CCR5
CCL3
T
CCL4
PSMB6
CCL3
•
CCBP2
CCL5
CCL5
CCL4
• ONLY be found by
CCL5
CCL5
T
CCL4
•
T
CCL3
CXCR4
PSMB6
Underlying Graph 1
PSMB6
Underlying Graph 2
topological structure
mining
ONLY be found by
sliding window method
Restoring nodes
revealing interesting
patterns
Outline
• Introduction
• Topological Structure Mining
• Data Preprocessing and Graph
Representations
• Experiment Results and Pattern Analysis
• Conclusion
Conclusion
• Sliding window method is the best
– The most number of frequent patterns
– The highest quality of frequent patterns
• Topological structures found
corresponding well to known relationships
• Topological mining being a very valuable
tool for biological researchers
Three Edge Construction Methods
• Interestingness of Edges
– Counting the number of distinct edges
– Computing the average interestingness of edges for
all patterns found by using each edge construction
method