Graph mining in bioinformatics

Download Report

Transcript Graph mining in bioinformatics

Graph mining in bioinformatics
Laur Tooming
Graphs in biology
• Graphs are often used in bioinformatics for
describing processes in the cell
• Vertices are genes or proteins
• The meaning of an edge depends on the type of
the graph
– Protein-protein interaction
– Gene regulation
What we’re looking for
• We want to find sets of genes that have a
biological meaning.
• Idea: find graph-theoretically relevant sets
of vertices and find out if they are also
biologically meaningful.
• Simple example: connected components
• A more advanced idea: graph clustering.
Find subgraphs that have a high edge
density.
Markov Cluster Algorithm (MCL)
• If there is cluster structure in a graph, random
walks tend to remain in a cluster for a long time
• Graph modelled as a stochastic matrix: sum of
entries in a column is 1
• aij - probability that randomly walking out of j will
go to i on the next step
• Bigger edge weight means greater probability of
choosing that edge
Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.
http://micans.org/mcl/
Markov Cluster Algorithm (MCL)
• Two procedures, inflation and expansion,
are applied alternatively
• Expansion: matrix squaring
– considers longer random walks
• Inflation: raising entries to some power,
rescaling to remain stochastic
– Weakens weak edges and strengthens strong
ones
• Converges to a steady state
Markov Cluster Algorithm (MCL)
Images from http://micans.org/mcl/ani/mcl-animation.html
Betweenness centrality clustering
• An edge between different clusters is on many
shortest paths from one cluster to another.
• An edge inside a cluster is on less shortest paths,
because there are more alternative paths inside a
cluster.
• Betweenness centrality of an edge - the number of
shortest paths in the graph containing that edge.
• Remove edges with the highest centrality from the
graph to obtain clustering.
• Optimisations:
– instead of all shortest paths, pick a sample of vertices
and calculate shortest paths from them
– remove several edges at once
GraphWeb
• Web interface for analysing biological graphs
• Simple syntax for entering graphs
– multiple datasets
– directed edges
– edge weights
• Visualising graphs with GraphViz
• Finding biological meaning with g:Profiler
ds1:
ds2:
ds1:
ds2:
A > B 10
A > B 4
B
C 5
C > D 12
Combining several datasets
• Whether or not there is an edge between two
vertices is determined in biological experiments,
which may sometimes give false results.
• For a given graph different sources may give
different information. Some sources may be
more trustworthy than others.
• We would like to combine different sources and
assess the trustworthyness of each edge in the
resulting graph.
• Edge weight in summary graph: sum over
datasets
– w(e,G) = Σ w(e,Gi)*w(Gi)
Combining several datasets
The end