Transcript X - UNSW

7th December
2010
A Visual Analytics Approach to Augmenting Formal Concepts
with Relational Background Knowledge in a Biological Domain
Elma Akand*, Mike Bain, Mark Temple
*CSE, UNSW/School of Biomedical and Health Sciences,UWS
The Sixth Australasian Ontology Workshop, Adelaide
University of South Australia
1
Outline
 Machine learning and data mining in bioinformatics
 Domain Ontologies in biomedical applications
 Formal Concept Analysis
 MCW algorithm (Mining Closed itemsets for Web apps)
 BioLattice – a web based browser
 Experimental Application: systems biology
Part-1: Concept ranking by gene interaction
Part-2: Relational learning of multiple-stress rules
Machine learning & Data mining in Bioinformatics
 Bioinformatics
“Bioinformatics is the study of information content and information flow in biological
systems and processes” (Michael Liebman,1995)
 Machine Learning & Data mining
-Can offer automatic knowledge acquisition
-Process to discover knowledge by analyzing data from different perspectives and
can contribute greatly in building knowledge base
 Our work: focus on knowledge-based machine learning
- Previous work: learning from ontologies
- Current work: ontology construction by learning
- Potential application areas: ontologies – central to eCommerce, eHealth
- Current application area: systems biology – predict gene function, data integration
Ontology
 In philosophy - concerned with nature and relations of being
 In knowledge representation - study of categorization of things:
Ontology – "specification of a conceptualization” (Gruber, 1993)
Conceptualization – "formalization of knowledge in declarative
form” (Genesereth and Nilsson, 1987)
Ontology
Informal Ontology
Upper Ontology
Natural language
General
Formal Ontology
Domain Ontology
First order logic or a
variant
Specific
Gene Ontology
a
 Missing concepts and relations
 One gene annotated with different GO terms
with a term specialization of other
gene: x
concepts : a ,b
relations :
(i) x a
(ii) x b and
(iii) b  a
b
x
x
y
Formal Concept Analysis (FCA)
Mathematical order theory (Rudolf Wille in the early 80s)
-Derives conceptual structures out of data
-Method for data analysis, knowledge representation and information management
Components
-Formal context, concept , concept lattice
fourlegged
haircovered
cats
x
x
dogs
x
x
dolphins
gibbons
x
intelligent
marine
x
x
thumbed
x
x
humans
x
x
whales
x
x
Formal concepts in a concept lattice
Top
({cats, gibbons, dogs, dolphins,
humans, whales}, {})
({cats, gibbons, dogs},
{hair-covered})
({gibbons, dolphins, humans,
whales}, {intelligent})
5
6
({dolphins, whales},
{intelligent, marine})
({cats, dogs}, {haircovered, four-legged})
3
2
 Formal context: an n by m Boolean matrix
m attributes
n objects
A
O
columns
rows
1
({gibbons, humans},
{intelligent, thumbed})
({gibbons}, {intelligent,
hair-covered,
thumbed})
 Formal concept: Galois connection
<X, Y>
X is a subset of A, Y is a subset of O
 Concept lattice loosely interpretable in ontology
terms:
concept definitions and
sub-concept relations
cf. T-box
concept membership
by objects
cf. A-box
Bottom ({}, {intelligent, hair-covered,
thumbed, marine, four-legged})
4
FCA in data mining
FCA can be seen as a clustering technique in machine learning
-Most of the work is in a propositional framework
 In data mining closed itemset mining is an efficient alternative to FCA
A frequent itemset X is closed if there exists no proper superset Y such that
Y⊃X with support(Y)=support(X)
E.g., if X = {a,b,c,d} and Y ={a,b,c,d,e} and support(Y)=support(X), then X is not closed
Parameters to avoid building entire lattice
-Extent size must be greater than minsup
Existing closed itemset mining algorithms
-Data structures to speed up closed itemset mining
-But may not build lattice, or include extents
MCW algorithm (Mining Closed itemsets for Web apps)
Vertical data format
IT-tree (itemset-tidset tree) search space
-node has X x t(X) and all children have prefix X
Pruning
- 4 set difference closure operators
Subsumption check
- A look-up table to record all attributes and their occurrences
in closed concepts
Is {TA}{135} closed?
i(135)={TAWC}
Lattice
- adding concepts following a general to specific order
D
T
A
W
C
2
1
1
1
1
4
3
3
2
2
5
5
4
3
3
6
6
5
4
4
5
5
6
attribute
Concept_id
D
C1,C2
T
C3,C4
A
C4,C5
W
C2,C4,C5,C6
C
C1,C2,C3,C4,
C5,C6,C7
Closure operators
Based on CHARM (Zaki, 2005)
D
T
A
W
C
2
1
1
1
1
4
3
3
2
2
5
5
4
3
3
6
6
5
4
4
5
5
6
{TA}{135}={TW}{135} ->{TAW}{135}
{D}{2456}⊂{C}{123456}->{DC}{2456}
{D}{2456} and {W}{12345}->{DW}{245}
Concept lattice as a visual analytics approach
Visual analytics
-combination of information visualization with machine learning and data analysis (Keim
et al., 2008)
Visualization of concept lattice
- provides overview of the structure of the domain
- means for further data analysis, e.g., classification, clustering, implication discovery, rule
learning
Previous work
- lattice navigation since Godin et al. (1993)
-Browsable concept lattice, e.g., Kim & Compton (2004)
Our current work
- on augmenting concept lattice by integrating multiple sources of knowledge (Gene
Ontology, protein interactions) for further analysis & machine learning
Case study: Yeast systems biology
Browsable concept lattice
more
general
Biological validation (1) : synthetic lethality
Synthetic lethal interaction
if cell is viable when either
gene A or B are individually
deleted, but cannot grow
when both are deleted.
Our results show that 72 (119) concepts in the lattice more likely than
random chance at p < 0.01 (p < 0.05) to contain synthetic lethal pairs.
Biological validation (2) : ILP learning of concept definitions
Biochemical pathway data
Protein-protein interaction data
Ontology data
Inductive
Logic
Programming
Microarray geneexpression data
First-order rule
concept(A):- ppi(B,A,C), ppi(B,A,E), ppi(B,C,E)
tfbinds(D,C),fbinds(F,E)
Transcription factor binding
data (ChIP-chip)
Example rule:
RSM19 required for
H2O2 response;
RSM19, RSM22 and
MRPS17 in
“mitochondrial
ribosomal small
subunit” stable
complex; and
RSM22, MRPS17
bound by
transcription factors
under amino acid
starvation.
Transcription
factors
Conclusions
Many real-world domains are data-intensive
Machine learning and data mining applications required to generate predictive and
useful outputs
We focus on knowledge-based learning for comprehensibility – use ontologies
Formal concept analysis as a framework for ontology structure
Use data mining techniques for efficient concept lattice generation
Visual analytics approach: browsable lattice, added background knowledge
Initial validation on a case study from yeast systems biology
Future work
Investigate pseudo-intents to simplify concept lattice
Investigate variants of concept lattice structures
-e.g., concept lattice of inverse context
Add concept definitions to background knowledge in ILP