Lecture 13 Networks and Ontology

Download Report

Transcript Lecture 13 Networks and Ontology

Report
due: March 31, electronically submit, pdf format.
Find a computational research article (5 pages or more) from one of the following
journals:
Bioinformatics
BMC Bioinformatics
Genome Research
Journal of Proteome Research
Nucleic Acids Research
The article needs to be published after 1/1/2015.
Write a report based on the study of the paper, and its citations.
Report
due: March 31, electronically submit, pdf format.
Requirements:
8 Pages, 1’’ margin, 1.5 line spacing not including figures/tables.
Figures/tables need to be attached at the end of the document.
Include (but not limited to) the following components:
Background and significance of the work.
What’s the technical improvement of the work over previous works?
What could have been done better?
If you were the authors, what’s your next step to extend this work?
Networks in Bioinformatics
General Characteristics
Directed Acyclic Graph and Gene Ontology
Defining distances on DAGs
Network and expression data
Testing on an existing network
Reverse engineering of networks
Network / Graph
A network is a set of vertices connected by edges.
undirected edges  “undirected network”
directed edges  “directed network”.
Vertex-level characteristic:
The number of connections to a vertex : “degree”
Incoming edges  “in-degree” ki
Outgoing edges  “out-degree” ko
k=ki+ko
ki
Evolution of networks. S.N. Dorogovtsev, J.F.F. Mendes
ko
Network
Network-level characteristics:
Number of vertices: N
Number of edges: L
Number of loops: I
For an undirected network: I=L-N+1
Degree: The distribution of vertex degrees
Network
Distribution of shortest path:
ℓμν is the shortest path between nodes u and v
The mean value is called the “diameter” of the network
Clustering coefficient:
For each vertex, the fraction of existing connections
between nearest neighbors of the vertex:
C(μ) ≡ y(μ)/[z(μ) (z(μ) − 1)/2],
z(μ): Number of neighboring vertices
y(μ): Number of edges between the neighboring
vertices
Clustering coefficient C is the mean of C(μ)
Scale-free Network
Scale-free network:
The degree distribution follows the power law:
P( k )  k

Few nodes are of high degree, while most nodes are
of low degree.
log( P(k ))  c   log( k )
Contrast: random edge generation
yields Poisson distribution.
Scale-free Network
Quote from the figure legend:
Both networks contain 130 nodes and 215 links.
Red, the five nodes with the highest number of links; green, their first
neighbours.
Nature 406(6794):378.
A large number of real-world networks, including biological
networks are found to have power law degree distribution.
Some nodes serve as “hubs”. This makes sense for WWW,
social networks, and for biological networks, where
controllers like the transcription factors are well known.
Scale-free networks are “ultra small-world” – most nodes can
reach one another in a few steps.
These networks exhibit “high tolerance to random
perturbations but are sensitive to targeted attack on the highly
connected nodes”.
Scale-free Network
One way to generate a network with such distribution is the
“rich get richer” model by Barabási and Albert (1999):
Initiate a network, with degree ≥ 1 for each node;
Add new node to the network, linking to existing nodes
with probabilities:
, where ki is the degree of the node.
Higher-degree nodes are more likely to gain new connections.
Scale-free Network
The protein-protein interaction network is a scale-free
network.
S. Wuchty, E. Ravasz and A.-L. Baraba¶si: The Architecture of Biological Networks
Bioinformaticians’ interest in network
Characterizing the structure of biological networks, and find
functional and evolutionary implications.
(a) Ashbya gossypii ATCC 10895
(b) Burkholderia sp.
(c) human
“The compounds that have multiple
pathways to the core compounds are less
likely to cause diseases than the
compounds without multiple pathways.”
Scientific Reports 5, 15567 (2015)
Bioinformaticians’ interest in network
Characterizing the behavior of network nodes/subnetworks
on an existing network.
BMC Genomics,15:314
Bioinformaticians’ interest in network
Reverse engineering of networks based on observations of
gene expression behavior – inference of regulatory relations.
Current Genomics, 2015, 16, 3-22
Bioinformaticians’ interest in network
Disease etiology
Bioinformaticians’ interest in network
Disease etiology
Testing on the network
Goal:
Utilize existing network to aid
biomarker selection (“network marker”)
disease mechanism finding
predictive model building
Data:
A network between biological units
Signal transduction network
Genetic interaction network
Protein-protein interaction network
TF regulatory network
……
Behavior of nodes
Expression data
Knock-out data
…...
Testing on the
network
An example of machinelearning approach.
Mol Syst Biol. 2007; 3: 140.
Testing on the
network
Network markers:
Diamond – univariate significant
Mol Syst Biol. 2007; 3: 140.
Testing on the network
Example: A Bayesian framework
 Univariate test of all genes
 Transform p-values to normal quantiles
 Assume a gene is either “1” (disease related) or “0” (unrelated)
 Use a network-based mixture model – neighboring genes are more likely to
share status
Ann. Appl. Stat. (Epub ahead of print)
Reverse engineering of networks from microarray data
Goal:
infer genetic regulation network structure from microarray
data
Key assumption:
The mRNA level measured by microarray truly reflects the
activity of the regulator
Sadly this is only true for ~20% of the regulators
Methods incorporating more data/knowledge are
developed
Reverse engineering of networks from microarray data
Margolin & Califano, Ann N Y Acad Sci. 2007,1115:51.
Hesselberth et al. Genome Biology. 2006,7:R30.
Reverse engineering of networks from microarray data
Correlation
Partial correlation
(Gaussian graphical models)
Expression data alone
Mutual information
Bayesian network
Expression data + other information
Known transcription factor targets
ChIP-chip and ChIP-seq
Known interactions/pathways
…
Reverse engineering of networks from microarray data
Differentiating mechanisms of co-regulation based on expression data alone is a
daunting task.
Margolin & Califano, Ann N Y Acad Sci. 2007,1115:51.
Network for knowledge representation
Directed Acyclic Graph (DAG)
Directed graph with no directed loops, i.e. from any node,
no route to come back to the same node.
The structure leads to partial ordering of the nodes:
If an edge ij exists, node i is at higher level than
node j.
The Gene-Ontology knowledge-base
Organize knowledge about genes in a directed acyclic
graph. The lower the level, the more detailed
knowledge.
Each gene is annotated to the terms, reflecting people’s
knowledge about it.
The Gene-Ontology knowledge-base
Similar thinking has been used on the tree of life
and other areas
Mol. BioSyst., 2014, 10, 86-92
The Gene-Ontology knowledge-base
Here’s how people’s knowledge about the gene ACE2 is
summarized using the database.
Based on these
papers:
Gene ontology and high-throughput data
Gene ontology was necessitated by high-throughput data --when thousands of genes are measured simultaneously,
people must be able to combine the results with existing
knowledge in a computationally efficient way.
Gene ontology and high-throughput data
Two general types of considerations:
 Does a GO term have first-order association with
the clinical outcome?
 Does the GO term change its interactions with
other functional units in response to the clinical
factor?
Gene ontology and high-throughput data
How to deal with dependency between (neighboring) GO
terms ?
General strategies:
Treat all GO terms as independent units, test for significant
changes one-by-one, and let biologists remove the redundant
information.
Using the GO structure to remove redundant terms, and only
test a small informative subset of all GO terms.
Test for independence conditioned on the results of
descendant nodes.
Gene ontology and high-throughput data
Given a GO term, how to find whether it is up- or downregulated in association with disease is an active research
area. We list a few examples here.
Difficulty:
Within each GO term, a number of genes exist.
These genes in fact operate in a network fashion in the cell.
Competitions and feed back loops are common.
The genes in one GO term don’t change in one direction. In
association with a disease, some are up-regulated, some are
suppressed, and some don’t change.
Gene ontology and high-throughput data
GO term: positive regulation of I-kappaB kinase/NF-kappaB cascade
Disease: Oral cancer metastasis
Gene ontology and high-throughput data
Cutoff-based methods:
General Idea:
Test significance gene-by-gene.
Select a threshold level, divide all genes into two
groups: differentially expressed and non-differentially
expressed.
For each GO term, test the hypothesis that the
differentially expressed genes are drawn from the pool of all
genes independent of the GO term.
Hypergeometric
Binomial
Chi-square test
…………
The arbitrary threshold has substantial impact on the results.
Gene ontology and high-throughput data
Cutoff-free methods:
Try to avoid the use of arbitrary threshold.
Usually use permutation tests to find significance. This
ensures the correlation structure between the genes are
preserved.
With group of genes to analyze, the hypothesis becomes
complicated. Different method may use different assumptions
and test for different hypotheses.
Gene ontology and high-throughput data
Comparing the p-value (or correlation, or other statistics)
distributions from one GO term to the overall distribution:
 Kolmogorov–Smirnov goodness-of-fit test statistic for
comparing two distributions
 Anderson–Darling test statistic for testing for a uniform
distribution
 Wilcoxon rank-sum test statistic
JOURNAL OF COMPUTATIONAL BIOLOGY. 13:798.
GSEA.
PNAS vol. 102 no. 43 15545-15550
GSEA.
PNAS vol. 102 no. 43 15545-15550
Gene ontology and high-throughput data
The competitive null hypothesis: genes in the gene set are not more associated with
the phenotype than genes outside the gene set.
The self-contained null hypothesis: no genes in the gene set are associated with the
phenotype.
GSDCA.
Single gene set
gene set pairs
GSDCA.