Transcript Document

Text Mining and Its
Application in
Bioinformatics
Xiaohua Tony Hu
College of Information Science & Technology
Drexel University, USA
Agenda
• Introduction
• Problems of Biomedical Literature Mining
Approaches
• Related Works
• Our System: Bio-Set-DM
• Sub Network Modeling, Simulation and
Evaluation
• Conclusion and Future Studies
2
Biomedical Literature Mining
• Many biomedical and bioinformatics
knowledge and experimental results only
published in text documents and these
documents are collected in online digital
libraries/databases (Medline,
PubMedCentral, BioMedCentral).
• How big is Medline?
– Abstracts from more than 4800 journals, with
over 16 million abstracts
– Over 10,000 papers per week are added
3
0
Year
1950
1952
1954
1956
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006 (Apr.)
MEDLINE Size (# of articles)
Introduction
16,000,000
14,000,000
12,000,000
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
The Exploding number of PubMed articles over the years
4
Introduction
• How to solve the information
overloading of biomedical
Literature?
– developing scalable searching & mining
methods
– integrating information extraction and data
mining methods to automatically
o search & retrieve biomedical literature efficiently
and effectively
o extract the results into a structured format
o mine important biological relationships
5
Major Issues in Biomedical
Literature Mining
• Huge numbers of documents
• Lack of structures
• Many subdomains
• Many aliases and typographical variants
for most biomedical objects
• Abbreviations, synonyms, polysemy, etc
6
The General Text Mining View
1. Selects what they will read (Information Retrieval),
2. Identifies important entities and relations between
those entities (Information Extraction),
3. Combines this new information with other
documents and other knowledge into a database
4. Mine the extracted results (Data Mining)
7
Issues in Current Information
Retrieval (IR)?
• Key-word based: get a lot of irrelevant
and miss a lot of relevant documents
• Query Expansion
• Probability Language Modeling
Ex: mouse, bank, chip, apple etc
8
Issues in Current Information
Extraction (IE)?
• Examining every document
– Doing so against Medline is extremely timeconsuming
• Using filters to select promising abstracts
for extraction
– Requiring human involvement to maintain and
to adapt to new topics or sub disciplines.
9
Our Approaches: Bio-SET-DM
• Information Retrieval: semantic-query expansion
•
•
•
(Xiaohua Zhou’s Ph.D. Thesis)
Information Extraction Methods: mutual reinforcement
learning for automatic pattern learning and tuple
extraction (Illhoi Yoo’s Ph.D. thesis)
Text Mining: graphical-based representation text
clustering and summarization (Xiaodan Zhang’s Ph.D.
thesis)
Bio-SET-DM (Biomedical Literature Searching, Extracting
and Text Data Mining)
• Biomedical Ontologies: UMLS and Go are the glues
10
NSF Career: A Unified Architecture for Data Mining
Biomedical Literature Databases (415K US$, March
2005-Feb 2010)
11
Problem Descriptions of IR
• Descriptions
– Many biomedical literature searches are about
relationships between biological entities.
– The co-occurrence of two keywords often does mean
these two keywords are really related.
obesity [TIAB] AND hypertension [TIAB] AND hasabstract [text]
AND ("1900"[PDAT] : "2005/03/08"[PDAT])
The query used to retrieve documents addressing the interaction of obesity and
hypertension from PubMed. A ranked hit list of 6687 documents is returned. We
then took the top 100 abstracts for human relevance judgment.
Unfortunately, as expected, only 33 of them were relevant.
– Explicitly index and search documents with
relationships
12
Statistical Language Model
• Statistical language model
– It is a probabilistic mechanism for generating
text.
• Text generation
– Suppose word is the unit of a text (e.g.
document). The text generation process looks
like as follows:
• Choose a language model in each step.
• Generate a word according to the chosen model.
13
Language Modeling and IR
• Example:
– Document 1={(A,3), (B, 5), (C,2)}
– Document 2={(A,4), (B, 1), (C,5)}
– Query={A, B}
– Which document is more relevant to the
query?
Doc 1: 0.3*0.5=0.15
Doc 2: 0.4*0.1=0.04
Doc 1 is more relevant to the query than Doc 2
14
Why Smoothing?
• Avoid Zero Probability
– Document 1={(A,3), (B, 5), (C,2)}
– Document 2={(A,4), (B, 1), (C,5)}
– Query={A, D}
– Which document is more relevant to the
query?
Doc 1: 0.3*0=0
Doc 2: 0.4*0=0
Obviously, this result is not reasonable.
15
Why Smoothing?
• Discount High-frequency Terms: Stop words (e.g. the, a, an, you…)
frequently occur in documents. According to Maximum Likelihood
Estimate (MLE), their generative probability will be very high.
However, stop words are obviously trivial to those documents.
• Assign reasonable probability to unseen word (Data Sparsity)
– Testing words do not appear in training corpus.
– Need effective smoothing method, especially incorporating the
semantic relationship between the testing words and training
words into the model.
– Examples: a document containing “auto” for query “car” in text
retrieval task.
• If using Laplacian smoothing or background smoothing, the
document will not return for the query.
• If using semantic smoothing, the document will return for
the query.
16
LM and IR
• Steps:
– Estimate the word distribution for each
document, i.e., p(w|di), which is also referred
to as document language model or document
model.
– Computing the probability of generating the
query according to each document model.
– Rank the query-generating probabilities of all
documents in the collection.
17
Language Modeling IR
Formalism
• LM: view IR as a process of word sampling from
the document. The higher probability to generate
the query, the more relevant the document is to
the query (Ponte and Croft 1998)
log
p( r Q, D )
p(Q | D, r )
p( rD )
 log
 log
p ( r Q, D )
p(Q | D, r )
p( r D )
rank
p( rD )
 log p(Q | D, r )  log
p( r D )
rank
 log p(Q | D, r )
The formula is from (Lafferty and Zhai 2002)
18
Context-Sensitive Semantic
Smoothing (Our Approach)
• Definition
– Like the statistical translation model, term semantic
relationships are used for model smoothing.
– Unlike the statistical translation model, contextual and
sense information is considered
• Method
– Decompose a document into a set of contextsensitive topic signatures and then statistically
translate topic signatures into individual words.
19
Topic Signatures
• Concept Pairs
– A pair of two concepts which are semantically and
syntactically related to each other
– Example: computer and mouse, hypertension and
obesity
– Extraction: Ontology-based approach (Zhou et al.
2006, SIGIR)
• Multiword Phrases
– Example: Space Program, Star War, White House
– Extraction: Xtract (Smadja 1993)
20
Translation Probability Estimate
• Method
– Use cooccurrence counts
(topic signature and
individual words)
– Use a mixture model to
remove noise from topicfree general words
Vt
Vd
Vw
t1
D1
w1
D2
w2
D3
w3
D4
w4
t2
t3
t4
t5
Figure 2. Illustration of document indexing.
Vt, Vd and Vw are topic signature set,
document set and word set, respectively.
p(w | Dk )  (1   ) p(w | tk )  p(w | C)
Denotes Dk the set of documents containing the topic signature tk.
The parameter α is the coefficient controlling the influence of the corpus
model in the mixture model.
21
Translation Probability Estimate
• Log likelihood of generating Dk
log p( Dk | tk , C )   c( w, Dk ) log p( w | Dk )
w
  c( w, Dk )((1   ) p( w | tk )  p( w | C ))
• EM for estimation
w
pˆ ( n ) ( w) 
p
( n 1)
(1   ) p ( n ) ( w | tk )
(1   ) p ( n ) ( w |  tk )  p( w | C )
c( w, Dk ) pˆ ( n ) ( w)
( w |  tk ) 
(n)
ˆ
c
(
w
,
D
)
p
 i k ( wi )
i
Where is the document frequency of term w in Dk, i.e., the cooccurrence
count of w and tk in the whole collection.
22
Contrasting Translation Example
Space:
space 0.245; shuttle 0.057; launch 0.053; flight 0.042; air 0.035; program 0.031;
center 0.030; administration 0.026; develop 0.025; like 0.023; look 0.022; world 0.020;
director 0.020; plan 0.018; release 0.017; problem 0.017; work 0.016; place 0.016;
mile 0.015; base 0.014;
Program:
program 0.193; washington 0.026; congress 0.026; administration 0.024; need 0.024;
billion 0.023; develop 0.023; bush 0.020; plan 0.020;money 0.020; problem 0.020;
provide 0.020; writer 0.018; d 0.018; help 0.018; work 0.017; president 0.017;
house .017; million 0.016; increase 0.016;
Space Program:
space 0.101; program 0.071; NASA 0.048; shuttle 0.043; astronaut 0.041;
launch 0.040; mission 0.038; flight 0.037; earth 0.037; moon 0.035; orbit 0.032;
satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; technology 0.026;
project 0.025; science 0.023; budget 0.023;
23
Topic Signature LM
• Basic Idea
– Linearly interpolate the topic signature based translation
model with a simple language model.
– The document expansions based on context-sensitive
semantic smoothing will be very specific.
– The simple language model can capture the points the
topic signatures miss.
pbt ( w | d )  (1   ) pb ( w | d )  pt ( w | d )
Where the translation coefficient (λ) controls the influence of the translation
component in the mixture model.
24
Topic Signature LM
• The Simple Language Model
pb (w | d )  (1   ) pml (w | d )  p(w | C)
• The Topic Signature Translation Model
pt ( w | d )   p( w | tk ) pml (tk | d )
k
c(tk , d )
pml (tk | d ) 
 c(ti , d )
i
c(ti, d) is the frequency of topic signature ti in document d.
25
Text Retrieval Experiments
• Collections
– TREC Genomics Track 2004 and 2005
– Use sub-collection
– 2004: 48,753 documents
– 2005: 41,018 documents
• Measures:
– Mean Average Precision (AP), Recall
• Settings
–
–
–
–
Simple language model as the baseline
Use concept pairs as topic signatures
Background coefficient: 0.05
Pseudo-relevance feedback: top 50 documents, expand10 terms
26
Experiments
• Collections
– TREC Genomics Track 2004 and 2005
– Use sub-collection
– 2004: 48,753 documents
– 2005: 41,018 documents
• Measures:
– Mean Average Precision (AP), Recall
• Settings
– Background coefficient: 0.05
– Pseudo-relevance feedback: top 50 documents,
expand10 terms
27
Baseline Models
Table 1. Comparison of the baseline language model to the Okapi model. The
Okapi formula is the same as the one in [10]. The number of relevant
documents for TREC04 and TREC05 are 8266 and 4585, respectively. The
asterisk indicates the initial query is weighted.
Recall
MAP
Collection
SLM
Okapi
Change
SLM
Okapi
Change
TREC04
6411
6662
+3.9%
0.345
0.363
+5.2%
TREC04*
6527
6704
+2.7%
0.364
0.364
+0.0%
TREC05
4084
4124
+1.0%
0.255
0.250
-2.0%
TREC05*
4135
4134
-0.0%
0.260
0.254
-2.3%
28
Experiment Results
Table 2. The comparison of the baseline language model (DM0) to
document smoothing model (DM2) and query smoothing model
(FM1).
Collection
TREC04
TREC05
TREC05*
DM2
Change
γ =0.6
FM1
Change
0.345
0.395
+14.5% 0.451
6411
6749
+5.3%
0.364
0.414
+13.7% 0.460
Recall
6527
6905
+5.8%
7039
+7.8%
MAP
0.255
0.277
+8.6%
0.279
+9.4%
Recall
4084
4167
+2.0%
4227
+3.5%
0.260
0.288
+10.8% 0.287
4135
4214
+1.9%
MAP
Recall
TREC04*
DM0
λ=0.3
MAP
MAP
Recall
6929
4235
+30.9%
+8.0%
+26.9%
+10.4%
+2.4%
29
Context-sensitive vs. Contextinsensitive
• The context-sensitive semantic smoothing approach
performs significantly better than context-insensitive
semantic smoothing approaches.
Table 3. Comparison of the context-sensitive semantic smoothing
(DM2) to the context-insensitive semantic smoothing (DM2’) on MAP.
The rightmost column is the change of DM2 over DM2’.
DM0
Collection
DM2’
Chang
e
DM2
Map
Change
Change
MAP
MAP
TREC04
0.346
0.36
7
+6.1% 0.395 +14.5% +7.6%
TREC04*
0.364
0.38
4
+5.5% 0.414 +13.7% +7.8%
TREC05
0.255
0.26
0
+2.0% 0.277 +8.6%
TREC05*
0.260
0.26
+3.5% 0.288 +10.8% +7.1%
+6.5%
30
Relevant Publications
1.
Hu X., Xu., X., Mining Novel Connections from Online
Biomedical Databases Using Semantic Query Expansion
and Semantic-Relationship Pruning, International Journal of
3.
Web and Grid Service, 1(2), 2005, pp 222-239
Zhou X., Hu X., Zhang X., Topic Signature Language Models
for Ad-hoc Retrieval, in the IEEE Transactions on Knowledge
and Data Engineering (IEEE TKDE), September, 2007
Song M., Song I-Y, Hu X., Allen B., Integration of Association
4.
in the Journal of Data & Knowledge Engineering
Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Context-Sensitive
2.
5.
6.
Rules and Ontology for Semantic-based Query Expansion
Semantic Smoothing for the Language Modeling Approach
to Genomic IR, in the Proc. Of the 29th Annual International
ACM SIGIR Conference on Research & Development on
Information Retrieval (SIGIR 2006),
Zhou X., Zhang X., Hu X., Semantic Smoothing of Document
Models for Agglomerative Clustering, accepted in the Twentieth
International Joint Conference on Artificial Intelligence(IJCAI 07),
Hyderabad, India, Jan 6-12, 2007
Zhang X., Hu X., Zhou X., A Comparative Evaluation of Different
Link Types on Enhancing Document Clustering, accepted in 31th
Annual International ACM SIGIR Conference on Research &
31
Development on Information Retrieval (SIGIR 2008)
SPIE:
Scalable and Portable Information Extraction
• Scalable and portable information
extraction system (SPIE) is influenced by
the idea of DIPRE introduced by Brin [Brin,
1998].
• The goal is to develop efficient and
portable information extraction system to
automatically extract various biological
relationships from online biomedical
literature with no or little human
intervention.
32
SPIE:
Scalable and Portable Information Extraction
• The main ideas of SPIE:
– Automatic query generation and query
expansion for effective search and retrieval
from text databases
– Dual reinforcement information extraction for
pattern generation and tuple extraction
– Scalable well in huge collections of text files
because it does not need to scan every text
file
33
SPIE:
Scalable and Portable Information Extraction
Initial seed tuples
Queries
Automatic Query Generation &
Document Categorization
Search Engine
Biomedical
Literature DB
Query List
Set of Documents
Data Mining to
generate rules from
categorized
documents
Automatic
Categorization of
Documents
Extract text segment
of interest
Find occurrence of
seed tuples
Mutual Reinforcement of
Pattern Generation
- Instance Extraction
Generate extraction
pattern and store it in
pattern base
Instance Relation
Initial seed tuples
tuples generated from IE
Pattern Base
New instance
extraction based on
pattern matching
34
SPIE
(Scalable & Portable IE)
SPIE takes the following steps:
1.Starting with a set of user-provided seed tuples,
SPIE retrieves a sample of documents from the
biomedical literature library.
– the seed tuples can be quite small, normally 5 to 10 is
enough
– constructing some simple queries by using the attribute
values of the initial seed tuples to extract the document
samples of a pre-defined size using from the search
engine
35
SPIE
(Scalable & Portable IE)
2. The tuple set induces a binary partition (a
split) on the documents:
– those that contain tuples or those that do not
contain any tuple from the relation
– The documents are thus labeled
automatically as either positive or negative
examples, respectively.
– The positive examples represent the
documents that contain at least one tuple.
– The negative examples represent documents
that contain no tuples.
36
Query Generation/Expansion for
Document Retrieval
STEP 3 consists of two stages
– converting the positive and negative examples
into an appropriate representation for training
– running the data mining algorithms on the
training examples to generate a set of rules
and then convert the rules into an ordered
list of queries expected to retrieve new useful
documents
37
Query Generation/Expansion for
Document Retrieval
38
Query Generation/Expansion for
Document Retrieval
• In STEP 3 three data mining algorithms are used
•
•
for rule generation; Ripple, CBA & DB-Deci
Those rules are ranked based on Laplace
measures
Top 10% of rules are converted into a query list
Positive IF WORDS ~ protein AND binding
Positive IF WORDS ~ cell and function
Query 1: protein AND binding
Query 2: cell AND function
39
Pattern Generation
• A pattern is
– a 5–tuples <prefix, entity_tag1, infix, entity_tag2, suffix>
– prefix, infix, and suffix are vectors associating weights with
terms.
– prefix is the part of sentence before entity1,
– infix is the part of sentence between entity1 and entity2
– suffix is the part of sentence after entity2.
“HP1 interacts with HDAC4 in the two–hybrid system…” 
{ “”, <Protein>, “interacts with”, <Protein>, “”}.
40
Pattern Matching
41
Experiment
• Keyword base vs. SPIE
• Keyword base experiment
– Input:
o around 7000 protein names (expanded from 1600 protein
o
o
names using protein synonyms)
23 keywords
1.5 million abstracts (obtained using those keyword
searching in PubMed)
• SPIE experiment
– Input:
o Only 10 pairs of protein-protein interaction (PPI) pairs
– Maximum number of documents used in each iteration is
10k
– Starting with 50k documents and stopping at 500k
documents
42
Experiment
Keyword based
SPIE
43
Experiment
Experiment
Abstracts used
# of distinct PPI
Keyword base
1,444,002
9,980
SPIE
500k
9,483
• It is very obvious that SPIE has a
significant performance advantage
over key-word based approach.
44
Chromatin Protein Network
45
Biomolecular Network
Analysis
• Biomolecular networks dynamically respond
to stimuli and implement cellular function
• Understanding these dynamic changes is
the key challenge for cell biologists
• Biomolecular networks grow in size and
complexity, and thus the computer
simulation is an essential tool to
understand biomolecular network models
• A sub-network executes a specific cellular
function and deserve to be studied
46
Biomolecular Network
Analysis
• Our method consists of two steps.
• First, a novel scale-free network clustering
approach is applied to the biomolecular
network to obtain various sub-networks.
• Second, computational models are generated
for the sub-network and simulated to predict
their behavior in the cellular context.
• We discuss and evaluate three advanced
computational models: state-space model,
probabilistic Boolean network model, and
fuzzy logic model.
47
Mining the large-scale
biomolecular network (1)
Main Algorithm SNBuilder (G, s, f, d)
1: G(V, E) is the input graph with vertex set
V and edge set E.
2: s is the seed vertex; f is the affinity
threshold; d is the distance threshold.
3: N ← {Adjacency list of s } U{s}
4: C ← FindCore(N)
5: C’ ← ExpandCore(C, f, d)
6: return C’
48
Mining a large-scale
biomolecular network (2)
Sub-Algorithm FindCore (N)
8: for each v N
9: calculate kvin(N)
10:end for
11: Kmin ← min { kvin (N), v  N}
12: Kmax ← max { kvin(N), v N}
13: if Kmin = Kmax or (kiin (N) = kjin (N),
(i, j  N , i, j  s, i  j
) then return N
14:else return FindCore(N – {v}, kvin(N) = Kmin)
49
Mining a large-scale
biomolecular network (3)
Sub-Algorithm ExpandCore(C, f, d)
D ← ( v ,w )E,vC ,wC
{v, w}
17: C’ ← C
18: for eacht D,
 t C, and distance(t, s) <=
d
19:
calculate ktin (D)
20:
calculate ktout (D)
21:
if ktin (D) > ktout (D) or ktin (D)/|D| > f
then
C’ ← C’ U {t}
22: end for
23: if C’ = C then return C
50
16:
Experiment Results
Promising Protein-Protein Interaction clusters
51
Experiment Results
Fig 1
A sub-network obtained using the algorithm
52
State-space model for simulation
(1)
A gene regulatory network
x1
x2
Z1
External
inputs
Dynamic
equations
…
Zp
Observation
equations
.
.
.
xn
53
State-space model for simulation
(2)
z (t  1)  Az(t )  Bu(t )  n1 (t )

 x(t )  Cz(t )  n2 (t )
•
•
•
•
•
•
x: gene expression data
z: internal variables (promoters)
A: state transition matrix
B: control (input) matrix
C is transformation matrix
n1(t) and n2(t) stand for noises
54
State-space model for simulation
(3)
• Applying the state-space modeling method to
gene expression data of 16 genes in Figure 1,
we obtained an inferred gene regulatory
network with nine internal variables
• The analysis shows that the inferred network
is stable, robust, and periodic
• Use the constructed model from the training
dataset Thy-Thy 3 to predict the expression
profiles in the testing dataset Thy-Noc, the
result is shown in Figure 2
55
State-space model for simulation
(4)
1
Expression level
Expression level
1
0.5
0
0.5
0
B
A
-0.5
0
5
10
Time
15
-0.5
20
5
10
Time
15
0.5
0
0.5
0
D
C
-0.5
20
1
Expression level
Expression level
1
0
0
5
10
Time
15
20
-0.5
0
5
10
Time
15
20
Fig 2 Comparison of experimental (solid lines) and predicted (dotted lines)
gene expression profiles for DMFT(A), F2 (B), RRM2 (C) and TYR (D)56
Fuzzy logic model for simulation
(1)
• The fuzzy biomolecular network model is a set
•
•
of rule sets for each node (in this case gene)
in the network governing the response to
each fuzzy state of the input genes to that
node (the output gene).
Fuzzy rule sets are generated for genes in the
sub-network in Figure 1.
Use the constructed model from the training
dataset Thy-Thy 3 to predict the expression
profiles in the testing dataset Thy-Noc, the
results shown in Figure 3
57
Fuzzy logic model for simulation
A
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
0
10
20
30
40
Log(Expression Ratio)
Log(Expression Ratio)
(2)
B
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
0
10
0.6
C
0.4
0.2
0
-0.2
-0.4
-0.6 0
10
20
Time
30
40
30
40
Time
Log(Expression Ratio)
Log(Expression Ratio)
Time
20
30
40
D
1
0.5
0
-0.5
-1
-1.5
0
10
20
Time
Fig 3 Best fit rule on training set “Thy-Thy 3” predicting gene expression on
the test data set (solid line) compared to actual data from the test set
“Thy-Noc” (dashed line) for CDK2 (A), BRCA1 (B), EP300 (C), and CDK4
58
(D)
Probabilistic Boolean Networks
for simulation (1)
• A probabilistic Boolean network (PBN) is a
•
•
Markov chain capturing transition probabilities
among different genes expression states.
We construct PBNs for the given microarray
data set "Thy-Thy 3" and use the data set
"Thy-Noc" to test the constructed PBNs
The results are shown in Tables 1 through 3
59
Probabilistic Boolean Networks
for simulation (2)
Gene
DMTF
BRCA1
HIFX
HE
PPP2R4
MYC
NR4A2
F2
2 states
66.67
55.56
77.78
22.22
55.56
44.44
72.22
61.11
Gene
PTEN
RRM2
PLAT
TYR
CAD
CDK2
CDK4
EP300
2 states
72.22
77.78
50.00
55.56
66.67
50.00
66.67
72.22
Table 1: Prediction accuracy based on the given genetic network using 2 states microarray data.
Gene
DMTF
BRCA1
HIFX
HE
PPP2R4
MYC
NR4A2
F2
3 states
50.00
55.56
66.67
16.67
61.11
55.56
55.56
61.11
Gene
PTEN
RRM2
PLAT
TYR
CAD
CDK2
CDK4
EP300
3 states
44.44
72.22
66.67
50.00
61.11
38.89
66.67
61.11
60
Table 2: Prediction accuracy based on the given genetic network using 3 states microarray data.
Probabilistic Boolean Networks
for simulation (3)
• To improve the prediction accuracy of the He MYC
and CDK2, we use the developed multivariate
Markov chain to model the mircoarray data set.
The results are shown in Table 3
Gene
HE
MYC
CDK2
2 states
55.56 (22.22) 61.11 (44.44) 66.67 (50.00)
3 states
27.78 (16.67) 55.56 (55.56) 38.89 (38.89)
Table 3: Prediction accuracy based on the input genes estimated
from the multivariate Markov chain model.
61
Conclusions
• We present a new method for mining and
•
•
•
dynamic simulation of sub-networks from large
biomolecular network.
The presented method applies a scale–free
network clustering approach to the biomelcular
network to obtain biologically functional subnetwork.
Three computational models: state-space model,
probabilistic Boolean Network, and fuzzy logical
model are employed to simulate the sub-network,
using time-series gene expression data of the
human cell cycle.
The results indicate our presented method is
promising for mining and simulation of sub62
networks.
Relevant Publications
1.
Hu X., Wu D., Data Mining and Predictive Modeling of
2.
and Bioinformatics, (April-June 2007), p251-263
Hu X.,, Sokhansanj, Wu, D., Tang Y., A Novel Approach for
3.
4.
5.
Biomolecular Network from Biomedical Literature
Databases, in IEEE/ACM Transactions on Computational Biology
Mining and Dynamic Fuzzy Simulation of Biomolecular
Network, in IEEE Transactions on Fuzzy Systems
Hu X., Wu F.X. Ng M., Sokhansanj B., Mining and Dynamic
Simulation of Sub-Networks from Large Biomolecular Networks, in
2007 International Conference on Artificial Intelligence, June 25-28,
Las Vegas, USA (Best Paper Award, out of 500 submissions)
Hu X., Yoo I., Song I-Y., Song M., Han J., Lechner M., Extracting
and Mining Protein-Protein Interaction Network from Biomedical
Literature, in the Proceedings of the 2004 IEEE Symposium on
Computational Intelligence in Bioinformatics and Computational
Biology (IEEE CIBCB 2004), Oct. 7-8, 2004, San Diego, USA, (Best
Paper Award), pp 244-251
Tang Y.C., Zhang Y-Q, Huang Z., Hu X.,, and Zhao Y. Recursive Fuzzy
Granulation for Gene Subsets Extraction and Cancer Classification accepted
to be published in the IEEE Transactions on Information Technology
in Biomedicine
63
Dragon Toolkit
• Software package Designed for Language
•
•
Modeling, Information Retrieval and Text Mining
Free download
http://www.ischool.drexel.edu/dmbio/dragontool
/default.asp
500 Java Libaries in NLP, Search Engine, Entity
Extraction, One of the most popular software
packages for Information Retrieval, NLP etc.
More than 1500 research groups in the world
have downloaded it since Jul 2007
64
Call for Paper
International Journal of Data Mining and
Bioinformatics (IJDMB)
Editor-in-Chief: Xiaohua Hu
Inaugural Issue: July 2006
SCI Indexed: Oct, 2007
65
Call For Participation
2008 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM 08)
Philadelphia, USA, Nov 3-5, 2008
IEEE BIBM Steering Committee Chair:
Xiaohua Hu
66
My Ph.D. Students and Joint Ph.D.
Students with Chinese University
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Illhoi Yoo (graduated in 2006, tenure-track assistant professor in Univ. of
Missouri-Columbia)
Xiaodan Zhang (4th year Ph.D. student, Text and Web Data Mining, Digital Library,
Bioterrorism)
Daniel Wu (5th year Ph.D. student, Data Mining and Biomolecular Network
Analysis)
Xuheng Xu (4th year Ph.D. student, Semantic-based Query Optimization and
Intelligent Searching)
Davis Zhou (5th year Ph.D. student, Semantic-based Information Extraction and
Retrieval)
Palakorn Achananuparp (4rth year Ph.D. student, Text Mining)
Deima Elnatour (4th year Ph.D. student, Semantic-based Text Mining)
Guisu Li (2nd year Ph.D. student, Healthcare Informatics)
Zhong Huang (2nd year Ph.D. student, Bioinformatics, Computational Biology)
Xin Chen (fresh Ph.D. student, USTC)
Xiaoshi Yin (joint Ph.D. student with Prof. Zhoujun Li from BAUU)
Min Xu (joint Ph.D. student with Prof. Shuigeng Zhou from Fudan University)
Yaoyu Zuo (joint Ph.D. student with Prof. Ying Tong from Zhongshan University
67
Acknowledgements
•
•
•
•
•
•
•
PI: NSF CAREER: A Unified Architecture for Data Mining Large
Biomedical Literature Databases (NSF CAREER IIS 0448023,
$415K, 03/15/2005-02/28/2010)
PI: High Performance Rough Sets Data Analysis in Data Mining
(NSF CCF 0514679, $102K, 08/01/2005-07/31/2008)
Co-PI: The Drexel University GAANN Fellowship Program:
Educating Renaissance Engineers (US Dept. of Education,
9/1/2006 to 8/31/2009, around $700K)
Co-PI: Penn State Cancer Education Network Evaluation (PA
Dept. of Health, 04/25/2006-07/31/2010, $1.2M)
Co-PI: Center for Public Health Readiness and Communication
(PA Dept. of Health, 08/01/2004-08/31/2007, $1.5M)
Co-PI: Origin and Evolution of Genomic Instability in Breast
Cancer (PA Dept. of Health, $100K, 05/01/2004-04/30/2005)
Co-PI: Systems Biology Approach to Understanding ProteinProtein Interactions (PA Dept. of Health, $100K, 05/01/200404/30/2005
68
Thanks for your
attention
Any comments
or questions ?
69