BIBE06_kaushik

Transcript BIBE06_kaushik

Exploratory Tools for Follow-up Studies
to Microarray Experiments
Kaushik Sinha
Ruoming Jin
Gagan Agrawal
Helen Piontkivska
Ohio State and Kent State Universities
1
Overall Motivation



Biological literature is vast
Need tools to find interesting patterns from
literature
Specific Example


Identify genes from DNA microarray and other
gene and protein assays
Next step



What is known about these genes?
How are these genes related to each other or other
genes identified in similar studies?
Which other genes are most similar
2
Outline



Hypergraph Mining
Similarity Measures
Evaluation and Observations
3
Hypergraph Mining: Motivating
Example




Micro array experiment - suspects that a small set of genes
are related to a disease
Confirm by searching existing literature - expect related
genes to appear together in literature
However, suppose Gene A and C are related and both of
them are weakly related to another term B
In literature, one would expect




A,C appear together OR/AND
A,B appear together
B,C appear together
How do we efficiently conclude that A,C are actually
related?
4
Hypergraph Mining

Basic Motivation


Example (Gene-Disease Relationship)




Gene A is related to a term B
Term B is related to a gene C
Is Gene A related to Gene C ?
Gene Source


To find useful “Transitive Relation” (hyperedges) among
genes
Microarray Experiments
Information Source

Online Literature abstracts
5
Formal Problem Definition

Given




A dictionary KT
A set KM of user provided keywords (KT‫כ‬KM)
Collection of literature abstracts - each abstract is
represented as a set of words from dictionary
Task

To find hyperedges exceeding user defined threshold,
each of which involves a set of key words from KM and are
potentially connected by another set of linking words from
KT-KM
6
Relationship to Work on Frequent
Pattern Mining

Frequent itemset mining




Can represent each document abstract as a
transaction with several keywords
Find sets of keywords that appear together
and often
Cannot capture cross relationships
Differences


How do we define support ?
How do we prune search space
7
Solution Approach

Define



total weight=support + cross support
Support: set of keywords appear together in one
document
Cross support: set of keywords can be partitioned



each partition appears in different document
Common linking words
Issues

Since downclosure property does not hold for total
weight modified downclosure property can be defined
8
Idea

Support satisfies downclosure property



Cross support can be designed to be restricted below a
particular value, i.e., it is bounded
Form a function h as addition of two functions h=f+g



f satisfies downclosure property
g is bounded
h satisfies modified down closure property


Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies
downclosure property if for all A,B ∈ Ω , A ‫ כ‬B ,f(B)>f(A)
For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))}
This property can be used to devise efficient algorithm
9
Outline



Hypergraph Mining
Similarity Measures
Evaluation and Observations
10
Similarity Measure among Sets of
Genes



Given two list of gene names
Need to find most similar genes, based on literature
abstract occurrences
Standard statistics approach





Each file containing gene names can be considered as a
Discrete Random Variable (DRV)
Each such DRV can take several values (gene names)
For two such files X,Y and for any pair (x,y),
joint probability mass function p(x,y)=P(X=x,Y=y)
Compute from online abstracts based on co-occurrence
11
Probability Computation

Assume,




File X has n gene names xi, i ∈{1,…,n}
File Y has m gene names yj, j ∈{1,…,m}
M(i,j) is the number of times (xi,yj)
appears together in transactions (article
abstracts)
Then,

p(xi,yj)=M(i,j)/{∑i∑jM(i,j)}
12
Expectation Computation

Now define,



Expectation of Z is,




Z=g(X,Y), where g: X x Y →[0,∞)
Clearly, Z is a random variable
E(Z)=E(g(X,Y))=∑i∑j (g(xi,yj)M(i,j)/Mt)
Where, Mt=∑i∑jM(i,j)
Expected value of Z can directly be used as a
similarity measure
Different choices of g, give rise to different
similarity measures
13
Some Choices of function g

First Choice,



Choose g=M(i,j)
This choice leads to similarity measure,
se1= ∑i∑j M(i,j)2 /Mt
Second Choice,



Choose g=tot_length(xi,yj), where tot_length (xi,yj) is
the sum of transaction lengths where (xi,yj) co-occur
The idea is longer the transaction length, higher the
chance of having related linking key words
This choice leads to similarity measure,
se2= ∑i∑j tot_length(xi,yj)*M(i,j) /Mt
14
Extending the notion towards
gene ranking

Extend to rank genes from a list Y




Most similar to the genes from list X
Here, instead of Y as a random variable, for
each yj ∈Y, consider Uj as a random variable
taking value only yj
Find the similarity measure between X and Uj
for all j∈{1,…,m}
Sort the genes from list Y according to
decreasing similarity measure
15
Datasets
Used two sets of 21 and 31 genes



A standard dictionary, as reported in literature, containing
300 genes was used


These genes are differentially expressed between prostate
epithelial and stromal cells in prostate cancer patients
Dr Gail Frazer’s lab, Kent State University
These genes were significantly up or down regulated in tumor and
adjacent normal tissues when compared with a normal donor
tissue
Each literature abstract was represented in a bag of word
format containing words,

where each word comes from a dataset or the dictionary or is a GO
term
16
Results: Hypergraph Mining

Results show the linking GO terms and linking
genes from the dictionary for 21 and 31 dataset
obtained by hypergraph mining
17
Results: Similarity Measures

4 sets of 300 genes each ,- A,B,C,D were formed




The task is to identify which of A,B,C,D is most similar to the 21 or
31 dataset
As one would expect, A is most similar to the 21 dataset as shown
below


A is the dictionary of 300 genes as mentioned before
B,C,D were randomly chosen from superarray’s DNA micro-array
experiments
It also shows that some naïve similarity measure, such as s1, fails to
capture this
Sometimes, this tool discovers some interesting result,

For 31 dataset, randomly chosen list C was most similar
This has been justified by checking the functionalities of top ranked
genes from list C
18
Results: Ranking


Results of the ranked genes from the most similar list to either 21
or 31 data set
Linking words from hypergraph mining were also found within top
20 genes
19
Summary




Biological Literature is large and
complex
Need data mining tools to summarize
interesting patterns
Proposed hypergraph mining and
similarity metrics
Initial results are promising
20

BIBE06_kaushik

Transcript BIBE06_kaushik

Directory