Chemical structures database subsimilarity search: Persistent object

Download Report

Transcript Chemical structures database subsimilarity search: Persistent object

Distributed Ligand and Monomer Object Database
Milorad Tosic, John Westbrook, Helen Berman
Rutgers, The State University of New Jersey
Department of Chemistry
Introduction
• Key concepts
–
–
–
–
Use hyper-graph as a data model for abstract representation of unstructured data
Examine hyper-graph data model for subsimilarity search of chemical structures
Use graph-based database models for data with features that can not be numerically represented
Apply metric-based indexing for unstructured data
• Key applications
–
–
–
–
Bioinformatics,
Multimedia,
Query systems for World-Wide Web,
Geographic Information Systems (GIS)
• Software architecture
–
–
–
Client-Server
Distributed object database (CORBA)
Metadata description of software methodology
Application Areas
Problems dealing with data that does not conform to traditional data models, like traditional
relational or object oriented model.
• Bioinformatics
–
Structure subsimilarity search
Introduction of topology based searching criteria improves efficiency of searching as well as gives more
accurate results. Instead of comparing structures on the atom-bond level, it is much more promising to
make comparison on different levels of basic building substructures hierarchy.
• Multimedia
–
Features of multimedia data (sound, picture, movie) can not be completely represented by numerical values.
Usually, set of feature values and relation between them appears as necessary.
• Query systems for World-Wide Web
–
Physical data representation is diverse as well as connection pattern.
• Geographic Information Systems (GIS)
–
Nature of both data and query is topology-oriented
(similar or exact paths between two points, similar areas, …)
LIMA (LIgand MAtching) Search Engine for
Maximal Common Substructure Search Strategy
• Back-tracking
– Back-tracking is used as an background algorithm for finding maximal common
subgraph
• Topology-based comparison criteria
– Topology-based features of chemical structures are employed for structure efficient
[XUJ96], [EST98], [WAN98]
description
– Topological queries and indexing in collection of distributed objects can be
[PSV99]
generalized to other applications
– Heuristics for reducing average searching time postpone the computational
explosion for the larger structures based on a substructure-by-substructure
rather than atom-by-atom search
• Distributed objects
– Distributed computing is explored for increasing processing speed
– Persistent objects are essential for robustness of the searching engine
LIMA software architecture
• LIMA version 1 is a client-server application
–
–
Employs a completely in-memory database model
high resource impact
• LIMA version 2 is a distributed object-oriented application
– CORBA implementation for interoperability
–
Consists of CORBA interfaces representing:
• Agents - objects that obey Composite and Chain of responsibility design patterns
–
–
–
Agent object stores and searches a number of structures.
Agent objects can be created by independent creators.
Creator of agent is freed of any administration responsibilities. Agents check in to the
SystemAdministration object which performs further administration.
• SystemAdministration - object that obeys Decorator design pattern
–
Keeps track of all agents in the system, and provides administrative and user services to outside
users.
• Administrator - application providing user interface for administration purposes
• Client - application providing user interface for basic user services provided by the
system as a whole.
Details of the LIMA software architecture a CORBA based implementation
Agent
Agent
...
...
Agent
System
Administration
Interface
Agent
System
Administration
Client
Interface
Administrator
Administrator
Interface
Client-Administrator
Client
Client-User
Details of the CORBA reference model
Interface
Repository
IDL
compiler
Implementation
repository
operation()
Client
DII
IDL
stubs
Object
in args
out args + return value
ORB
Interface
(Servant)
IDL
DSI
skeleton
Use case overview:
•
IDL interface is defined and compiled
•
Servant containing application code
for defined interface is made.
•
Object is created by linking code
generated from IDL interface to
Servant on both programming language
and compiler level
•
Object is connected to ORB by Object
Adapter
•
Client is created by linking code
generated from IDL interface to user
code on both programming language
and compiler level
•
Client gets generated Object’s
reference and accesses the Object by
reference as they are in the same
program space.
Object Adapter
ORB Core
GIOP/IIOP
Standard interface
Standard language mapping
ORB-specific interface
Standard protocol
Object Management Group, The Common Object Request Broker:
Architecture and Specification, 2.2 ed., Feb.1998.
LIMA - topology-based comparison criteria
• LIMA is used for experimental evaluation of topology-based comparison criteria
–
LIMA has been used for subsimilarity search of the small molecule components of the macromolecular-ligand
complexes in the PDB and NDB.
–
LIMA was used to classify and check the topologies of the 2500 ligands and modified residues contained in the
PDB that makes it very useful tool for experimental evaluation of our assumptions in the early system
development phase.
• Is there any searching speed-up due to introduction of topology-based comparison
criteria ?
–
Compare searching time with and without topology-based criteria.
–
The topology criterion based on ring number is used:
An atom X matches atom Y iff they have the same atom types and number of rings that X belongs to is not
greater than that Y belongs to.
–
In order to examine how atom types influence searching process, the same set of target structures is applied
including as well as excluding hydrogens.
• Is there any improvement in quality of the searching results due to
introduction of topology-based comparison criteria ?
–
Does topology-based comparison criteria improve substructure similarity measure?
–
Compare resulting structures obtained by searching with and without topology-based criteria.
Experimental results from LIMA
Is there any searching speed-up due to introduction of topology-based comparison
criteria ? - YES
Searching time per target strucutre
(84 targets; 2164 database structures)
(excluding Hydrogens)
100
160
140
120
100
80
60
40
20
0
80
no-limits-wHs
with-limits-wHs
•
60
no-limits-noHs
with-limits-noHs
40
20
0
structure
•
•
•
time (s)
time (s)
Searching time per target structure
(92 targets; 2164 database structures)
(including Hydrogens)
structure
Searching speed-up is evident if topology-based criteria are applied.
Oscillations in searching time indicate further potential for improving speed.
Exponential complexity remains (both curves have the same growing tendency), but by introducing topology-based
criteria point of the run-time explosion is translated into the area of much more complex structures.
Relative improvement is higher for the case where structures without hydrogens are considered.
Experimental results from LIMA
Is there any improvement in quality of the searching results due to
introduction of topology-based comparison criteria ? - YES
Target
Matching
Eliminated
•
Topological criteria based on rings successfully decreases number of resulting structures.
•
Increased probability for expected structures to be found in the set of resulting structures.
Hyper-graph: definitions
Definition: A hyper-graph HG is an ordered two-tuple HG = (C,E) , where C is set of hyper-graphs that
are containers of HG, and E is a set of hyper-graphs that are elements of HG:
C = { c | c > HG }, E = { e | e < HG }
Definition: An undirected hyper-graph HG is an ordered two-tuple HG = ((C, E), I) , where (C,E) is
hyper-graph, and I is set of undirected hyper-graphs that are neighbors of the HG. We say that HG is
in undirected connection relation with its neighbors.
Definition: The undirected connection relation is an equivalence relation.
Definition: An directed hyper-graph HG is an ordered three-tuple HG = ((C, E), I, O) , where (C,E) is
hyper-graph, I is set of directed hyper-graphs that are input neighbors of the HG, and O is set of
directed hyper-graphs that are output neighbors of the HG. We say that HG is in directed connection
relation with its neighbors.
Definition: The directed connection relation is an order relation.
Note: We use the undirected hyper-graph in MCS.
Hyper-graph: example
v1
v1:
v3
e12
e35
e23
v5
v2
e24
id = v1;
type = VERTEX;
Container = {G1};
Elements = {};
InElements = {e12};
e57
e45
id = v2;
type = VERTEX;
Container = {G1};
Elements = {};
InElements = {e12, e23, e24};
v7
v4
e46
v2:
...
e67
v6
e68
v8
G1:
id = G1;
type = GRAPH;
Container = {};
Elements = {v1, … , v8, e12, e23, … ,e68};
InElements = {};
e12:
id = e12;
type = EDGE;
Container = {G1};
Elements = {};
InElements = {v1,v2};
...
e23:
id = e23;
type = EDGE;
Container = {G1};
Elements = {};
InElements = {v2, v3};
Hyper-graph: example (con’t)
After simple-loop reduction
v3
v1
e12
v2
e35
e23
v2
e24
e46
e45
id = G2;
type = GRAPH;
Container = {};
Elements = {g1,g2,g3,g4, e1,e2,e3,e4};
InElements = {};
g2:
e57
v7
v4
v4
G2:
id = g1;
type = GRAPH;
Container = {G2};
Elements = {v1,v2,e12};
InElements = {e1};
e45
v5
v6
e67
v6
g1
g1:
v5
g2
e1
g3
e2
g4
e3
id = g2;
type = LOOP;
Container = {G2};
Elements = {v2,v3,v4,v5,e23,e24,e35,e45};
InElements = {e1, e2};
e1:
id = e1;
type = EDGE;
Container = {G2};
Elements = {v2};
InElements = {g1,g2};
e2:
id = e2;
type = EDGE;
Container = {G2};
Elements = {v4,v5,e45};
InElements = {g2, g3};
e68
v8
Conclusions
• Experimental analysis proved that topological abstraction of chemical
structure data (e.g. rings in chemical structures) can improve both
searching efficiency and accuracy of searching results.
• New hyper-graph model
– is able to efficiently represent topology features of a chemical structure,
in a hierarchical way.
– is applicable to any application dealing with unstructured data
– can be efficiently implemented by a distributed software architecture
• Hyper-graph model and distributed object technology used here
generalize to other applications areas which use unstructured data
Related work
• Graph-based data models and query languages
–
–
–
Peter Buneman, Susan Davidson, Mary Fernandez, and Dan Suciu, Adding structure to unstructured data. In
Proceedings of the International Conference on Database Theory, pages 3360350, Delphi, Greece, 1997,
Springer Verlag.
Tova Milo, and Dan Suciu, Index Structures for Path Expressions, In ICDT’99, LNCS 1540, pp.277-295,1998,
Springer Verlag
Ralf Hartmut Güting, GraphDB: A data model and query language for graphs in databases, VLDB
1994, pp.297-308.
• Similarity search in metric instead of vector space
–
–
–
Paolo Ciaccia, Marco Patella, Pavel Zezula, M-tree: An efficient access method for similarity search in metric
spaces, In Proceedings of the 23rd VLDB Conference Athens, Greece, 1997.
Monika R. Henzinger, Thomas A. Henzinger, and Peter W. Kopke, Computing simulations on finite and infinite
graphs, In Proceedings of the 36th Annual IEEE Symposium on Foundations of Computer Science (FOCS95),
October 1995, pp.453-462.
Tolga Bozkaya, Nasser Yazdani, and Meral Özsoyoglu, Matching and indexing sequences of different lengths,
Proc. 1997 ACM CIKM, Sixth International Conference on Information and Knowledge Management, Las
Vegas, Nevada, Nov. 1997.
References
[ART92]
Artymiuk, J., et. all., (1992), Similarity searching of three-dimensional molecules and
macromolecules., J. Chem. Inf. Comput. Sci., 32, 617-630.
[BAR93]
Barnard, J.M., (1993), Substructure searching methods: Old and New., J. Chem. Inf. Comput.
Sci., 33, 532-538.
Downs, G.M., and Willett, P. (1995), Similarity searching in databases of chemical structures.,
Rev. Comput. Chem., 7, 1-66.
[DOW96]
[EST98]
Estrada, E., (1998), Spectral moments of the edge adjacency matrix in molecular graphs., J.
Chem. Inf. Comput. Sci., 38, 23-27.
[GWW96] Gillet, V.J., Wild, D.J., Willet, P., and Bradshaw, J. (1998), Similarity and dissimilarity
methods for processing chemical structure databases., The Computer Journal, 41, No. 8, 547558.
[HAG92]
Hagadone, T.R., (1992), Molecule substructure similarity searching: Efficient retrival in twodimensional structure databases., J. Chem. Inf. Comput. Sci., 32, 515-521.
[PSV99]
Papadimitriou, C.H., Suciu, D., and Vianu, V., (1999), Topological queries in spatial databases.,
Journal of Comput. and Sys. Sci., 58, 29-53.
Wang, T., and Zhou, J., (1998), 3DFS: A new 3D flexible searching system for use in drug
design., J. Chem. Inf. Comput. Sci., 38, 71-77.
Xu, J., (1996), GMA: A generic match algorithm for structural homomorphism, isomorphism,
and maximal common substructure match and its applications., J. Chem. Inf. Comput. Sci., 36,
25-34.
[WAN98]
[XUJ96]