Differential Association Rule Mining for the Study of

Download Report

Transcript Differential Association Rule Mining for the Study of

Biological Networks
CSci 732: Introduction to Bioinformatics
Anne Denton
Assistant Professor
Department of
Computer Science
North Dakota State
University, Fargo, ND
The Promise of Bioinformatics

Rich supply of data from high-throughput
experiments
–
–
–

Sequencing
Microarray experiments
Scores of specialized high-throughput experiments
Data made available
–
–
–
Most biological data disseminated in large biological
databases
Dissemination is a condition of funding (in US)
Makes biology different from most other disciplines!
Challenge and Opportunity

Traditional approach
–
–

Challenge
–

Studies of one or a few gene at a time
Conclusions based on thorough domain knowledge
Standard data analysis techniques inadequate for hundreds
or thousands of genes
Opportunity
–
–
Massive data valuable to quantify evidence
New aspects can be studied, such as network structure
From Sequencing to
Functional Genomics

Sequencing genomes is a mature discipline
experimentally and computationally
–

Whole-genome sequencing centrally involved
computations
But what do the proteins do?
–
–
–
Sequence comparison (BLAST) among species
Great computational and experimental challenges
Networks have a fundamental role in specifying
relationships
Computer Scientist’s
Introduction to Genomics

Encoding
–
Smallest unit of information


–
Computer: 1 bit (0 or 1)
Cell: 1 nucleotide of DNA (A,C,T, or G)
Most practical unit of information


Computer: 8 bit are 1 byte
(can represent 256 values)
Cell: 3 nucleotide determine 1 amino acid of the protein
(20 different amino acids)
Further Analogy between
Cells and Computers

Encoded values can serve very different
purposes and are stored in the same location
–
–
Computer: Instructions and data are stored in the
same memory (von Neumann architecture)
Cell: Proteins in the cell do very different things


Catalyze chemical reactions (proteins are part of the
process)
Regulate other proteins (proteins change the process)
Networks in Bioinformatics

Many network
definitions:
–
–
–

Protein-protein
interactions
Biochemical
pathways
Annotation
Networks
Different definitions
in each category
Scientific American 05/03
Why Study Networks?

Biochemical pathways tell us about functioning
of cell
–
–



Chemical processes (Metabolic pathways)
Control of other proteins (Regulatory pathways)
Neighbors in networks often have similar
function
Structure of networks can tell us about evolution
Combined study of networks and data can
uncover yet more information about cells
Outline

Part 1: Properties of the Networks
–

Scale-free networks
Part 2: Networks and data
–
–
–
–
Relational data mining
Problem: Similarity between network neighbors
Solution: Focus on differences between neighbors
Comparison of different networks
Example of a Biological Network:
Physical Protein-Protein Interactions

Proteins interact (attach to each other)
–
–
–

Tests if proteins are stable in a close position
Proteins may perform function together (not tested)
Mathematically: Undirected graph
Only one definition of interactions between
proteins? No!
–
–
Definition based on function: Genetic
Definition based on evolution: Domain fusion
Physical
ProteinProtein
in Yeast
Scientific
American
05/03
Scale-free Networks






Properties
Barabasi, Bonabeau 1998
Some nodes have large number of links,
most have only a few
Number of nodes that have a particular
number of links decreases as a power law
Robust against accidental failures
Hubs with high connectivity
Power-law Behavior


Probability that any node is connected to k
other nodes is 1/k n with n between 2 and 3
For k = 2: Probability of having twice as
many links is a quarter as likely
Scientific
American 05/03
Robust against accidental failures




As many as 80% of nodes can fail in a scale-free
network without breaking up the entire cluster
I.e., even with large number of random mutations in
genes, unaffected proteins continue to work together
Note that random removal of nodes is unlikely to
remove hubs
Removal of only a few hubs does break up network
significantly
Examples Outside Biology





Hyperlink structure of the Internet
Physical structure (routers and
communication lines) of the Internet
Social networks
Airline system
Scientific papers connected by citations
Scale-free Network (Airline System)
Scientific American 05/03
Random Networks in Contrast

Links placed randomly
–




Mathematical model (Erdos 1959)
Example: Highway system
Bell-shaped (Poisson, similar to Gauss) distribution of number of
nodes around typical value
For large number of links k,
probability of k links
decreases exponentially
Very unlikely to have
nodes with a very large
number of links
Random Network (Highway System)
Scientific American 05/03
Reasons for the development of scalefree networks

Networks grow over time
–
–

Older nodes have had longer to accumulate links
The most connected nodes in E.coli metabolic
network have an early evolutionary history
Preferential attachment
–
"The rich get richer“
cf. many people hyper-link to Google
"Small World" Property




Game "Six Degrees of Kevin Bacon ": Acting
in the same movie as links connects most
actors in 6 steps
Internet pages are typically 19 clicks apart
Any two chemicals in a cell are only 3
reactions apart!
Small-world property is not limited to scalefree networks
Are Protein Networks Pure Scale-free
Networks?


Clusters of tightly connected nodes
Example
–

Proteins that perform function together
Recovering scale free network
–
Clustering of nodes leads to groups that interact
as scale-free networks
Summary of Scale-free Networks

Characterized by
–
–

Show small-world property
–

Few hubs with a large number of edges
Many nodes with few edges
Any node can be reached from any other in only a
few steps
Ubiquitous in biology and outside
Has Everything Interesting Related to
Networks Been Done?



Graph theory is an old topic:
Euler 1736
Work on scalefree network
has added to it
Surely now
everything is done!
Protein-Protein Interaction Networks
Name: YAL040C
Function: Cell Cycle & DNA Process
Localization: nucleus
Class: Cyclins
Complex: Cyclin-dependent kinases
Motif: PDOC00013
Phenotype: Cell cycle defects
Name: YAL003W
Function: Protein Synthesis
Localization: Cytoplasm
Class: GTP/GDP-exchange factors
Complex: Translation complexes
MOTIF: PDOC00648
Outline (2): Data and Networks

Relational rather than graph-theoretic
approach to mining of data on a graph
–

New Algorithm
–

Problem of similarity between neighbors
Focuses on differences
Comparison of Networks
–
–
Different definitions of Networks
Generalization of difference-based algorithm
Questions of Interest

How do data between nodes relate?
–

Can we find relationships that are not yet
known to biologists?
–
–

Are there typical patterns among interacting
proteins?
It is expected that proteins in the nucleus interact
with other proteins in the nucleus
It is more surprising if proteins in the nucleus
interact with proteins in the mitochondria
How do different networks compare?
Frequent Patterns in Tables:
Association Rule Mining

1st step: Finding sets of items that are frequent
–
–
–

Originally: Items in shopping carts
Here: Properties of proteins
Support: Fraction of transactions, in which the set of
items occurs
2nd step: Finding associations A -> B
–
–
–
If we find set A, we are likely to find set B
Similar to correlation, but goes in one direction only
Confidence: Fraction of transactions that have A,
which also have B
Relational Approach
0.
Node Table
ORF
1.
Annotations
Node Table
ORF
Annotations
YPR184W
{cytoplasm}
YER146W
{cytoplasm}
YNL287W
{SensitivityTOaaaod}
Edge Table
YPR184W
{cytoplasm}
YER146W
{cytoplasm}
ORF1
ORF1
YPR184W
YER146W
YNL287W
YBL026W
YBL026W
YMR207C
YNL287W
{SensitivityTOaaaod}
YBL026W
{transcription, nucleus}
YBL026W
{transcription, nucleus}
YMR207C
{nucleus}
YMR207C
{nucleus}
Generalization




Works for any number of
nodes
Even 2-node structure allows
complex rules
Results become harder to
interpret for many nodes
“Small world property”: Any
protein can be reached in 3-4
hops
Effect of Joining on Item Sets
ARM on Joined Tables


Protein names (key) used for joining but not in ARM
One node table participates multiple times
–

Items labeled to keep track of instance
Typical transactions
–
Simplest approach had been done
overemphasizes similarities
T1
{0.cyto, 0.Cond_pheno, 0. PDOC00030, 1.Hydrolases}
T2
{0.Cond_pheno, 0.Transferases, 0.Polymerases, 1.Cond_pheno, 1.Transferases, 1.Polymerases}
Problem with Naive Implementation


Many rules that are not interesting
Rules that reflect similarity between neighbors
–

Rules involving only one protein
–
–

0.nucleus → 1.nucleus
0.transcription → 0.nucleus
Note: Different support and confidence compared with ARM
on node relation (protein data ignoring network)
Rules that are a consequence of both above
–
0.transcription → 1.nucleus
Algorithm
TID
Unique Operation
1
{0.cytoplasm}
{1.cytoplasm}
2
{0.metabolism, 0.nucleus}
{1.transcription, 1.nucleus}
3
{0.transcription, 0.nucleus}
{1.nucleus}
TID
2
Final Transactions
{0.metabolism}
{1.transcription}
Properties of Algorithm

Significant pruning at transaction level
–
–

Note: Pruning at rule or item set level would not be
consistent
–
–

Fewer items in transactions
Fewer transactions
Example: 0.transcription → 1.nucleus
Differs from conventional setting: pruning at itemset level is
gold standard
Modular approach
–
Unique operation can be combined with different ARM
implementations
Results for
{0.transcription}  {1.nucleus}

Standard ARM:
–
Support = 0.29%, Confidence = 28.38%
Rule: {0.transcription}  {0.nucleus} (0.70%, 69.59%)
Rule: {0.nucleus}  {1.nucleus} (5.74%, 29.06%)

Differential ARM:
–

Support 0.02%, Confidence 2.08%
Typical range for differential ARM
–
Support 0.2-2%, Confidence 6-20%
Focus on Differences

Similarity can be tested through calculation of
correlation of items with themselves
–
–

Computationally easy
Association meaningless
Typical rules that are interesting to biologists
–
–
Which kinds of proteins show compartmental cross-talk?
E.g., proteins in the nucleus interacting with proteins in the
mitochondria
How do interaction definitions differ?
E.g., which protein families show physical interactions but
no domain fusion interactions
Other Results of Differential ARM

Expected Cross-Talk past analysis papers
–

{1.mitochondria}  {0.cytoplasm} (1.2%, 27.3%)
Interesting Related Rules not found before
–
–
{1.mitochondria}  {0.nucleus} (0.72%, 16%)
{1.ER}  {0.mitochondira} (0.21%, 6%)
Performance
Comparison of Different ProteinProtein Interaction Networks

Different Definitions
–
–
–

Physical interactions
Genetic interactions
Domain fusion
Are the resulting networks biologically
equivalent?
–
Common assumption: All networks signify
similarity in neighbors
Physical Interactions


Tests whether proteins physically interact
when brought close together
Yeast-2-hybrid method
–
–

A gene is cut in half, and each half attached to
one of the proteins in question
The gene can only perform its job if its parts come
close through the protein-protein interactions
Can be done in any cell, i.e. does not test
functions in cell
Genetic Interactions


In vivo analysis, i.e., in the living organism
Typical scenario
–
–

Assume gene A and B can individually be deleted
and the organism survives
Assume deleting A and B together means the
organism does not survive
Other combinations are possible
–
–
Organism does not survive deletion of A and B
individually but does survive combined deletion
Or: Organism survives but is noticeably changed
Domain Fusion Interactions



Comparative Genome Analysis
Purely computational analysis
Based on evolutionary relationships
–
–
–


Assume species A has one gene with two domains 1 and 2
Assume species B has two genes that have the same
evolutionary origin (orthologs), 1' and 2'
Likely that proteins from genes 1' and 2' interact to generate
the same function
24,000 protein-protein interactions in yeast
No experimental verification!
Can ARM Results be Compared
Between Networks?


Not all proteins studied for all interactions
Networks have very different properties
Table
Int/orf
Max int
#>20
#int
Physical
3.55
289
73
14672
Genetic
7.88
157
93
8336
Domain fusion 44.6
231
305
28040
Construction of Network Comparison
Basis Set
Transactions
C
D
E
C
D
K
G
D
E
G
D
K
J
H
I
Algorithm


Only nodes are considered that are involved in both networks
Items are eliminated if they occur in either of the two interaction
types
2.A
2.B
1.A
1.C
1.D
0.C
0.E
Results


Rules based on physical interactions have higher
confidence than genetic or domain fusion
Example: Physical compared with Domain Fusion
–
–
–
{0.ABC trans family signature(PDOC00185)}  {1.ATP/GTP
binding site motif A(PDOC00017)} (0.45%, 90%)
ABC family is known to function together with ATP binding
site
Nevertheless ABC family signature don’t occur in one gene
together with ATP binding site in other species (no domain
fusion)
Possible reason: many proteins with ATP binding site
Conclusions of Part 2

Differential algorithm to ARM in relations that
describe networks contrasts
–
–

Solves other problems of network setting
–
–
–


Neighbors within neighbors
Multiple networks
Overwhelming number of rules due to neighbor similarity
Problem of rules that don’t involve all nodes
Rules that follow from combinations of both above problems
Some results confirmed by biologists
Some results interesting but plausible to biologists
Overall Conclusions

Computer science techniques can uncover
patterns in data that could not be identified
by simple inspection
–
–

Exciting times are only starting
–

Large amount of data
Complex data
Functional genomics only at its beginning
Mutual understanding of language takes time
Overall Conclusions
(Data Miner’s perspective)





Bioinformatics excellent “playing field” for data
miners
Data more easily available than in most other
disciplines (possible exception: astronomy)
Results can directly benefit researchers in
biology
Algorithms can/have to pass test of reality
Real data motivate fundamentally new
algorithms
Acknowledgements (Part 2)
Computational side
 Christopher Besemann
Biological interpretation
(Collaborators from NDSU Dept. of Biology)
 Ajay Yekkirala
 Ron Hutchison
 Marc Anderson
The work was funded by the Dept. of CS, EPSCoR and
the NDSU Research Foundation