Innovative Database Models and Advanced Tools in Bioinformatics

Download Report

Transcript Innovative Database Models and Advanced Tools in Bioinformatics

University of Nebraska at Omaha
Innovative Database Models and Advanced
Tools in Bioinformatics
Hesham H. Ali
UNO Bioinformatics Research Group
Department of Computer Science
College of Information Science and Technology
Key Challenges Facing Bioinformatics
Research
• Significant gaps between tool developers and tool
users
– Different objectives
– Different funding agencies
– Different academic cultures
• Significant problems with available Biological
Data
– Archival based
– Lack of structure
Source: ncbi.nih.gov
Problems with Current Biological Data
• The availability of large biological data and the
increasing rate in producing new data, available in
public data banks or via microarray data
• The increasing pressure to maximize the use of the
available data, particularly to impact key related
industries (biotech companies, biotech drugs)
• The large degree of heterogeneity of the available
data in terms of quality
UNO Bioinformatics Research Group
• Group Triangle
– Research motivated by real biological problems
– Innovative Database Models
– Advanced tools
Biological Questions Addresses by our Group
• Molecular diagnosis - Identification
– Sequence based id
– Enzyme (cutting order) based id
– Instrumentation (Mass Spec, WAVE) based id
• Basic Molecular Biology - Gene regulation
– Microarray Analysis
– Motif discovery/searching
• Epidemiology and Clinical Research
– Patient tracking system
– Clinical expert system
Bioinformatics Solutions to These Problems
• Develop new inventive database models
– Custom database for specific domains
– Centralized Structured integrated data
• Develop innovative Bioinformatics tools
– Clustering algorithms
– Advanced motif finding approaches
Database Models
• Customized (Private) Solution:
Custom based Data Base Model
High degree of quality and consistency
• Centralized (Public) Solution:
New Curated Integrated and Structured
DataBase Model
Model One: Custom Databases
•
•
•
•
•
Allowing researchers to create custom sets of genetic
data suited to their specific needs.
Allowing researchers to control the quality of genetic
data in their custom data sets through fine-tuning
parameters.
Searching data using optimal alignment algorithms,
rather than using heuristic methods.
Giving researchers/clinicians the ability to formulate
sequence identification concepts and test their ideas
against a validated database
Incorporating information from GenBank if needed
The Sequence Identification Problem
• Identification of organisms using obtained sequences is a
very important problem
• Relying on wet lab methods only is not enough
• Employing identification algorithms using signature motifs
to complement the experimental approaches
• Currently, no robust software tool is available for aiding
researchers and clinicians in the identification process
• Such a tool would have to utilize biological knowledge and
databases to identify sequences
• Issues related to size of data and quality of data are suspect
and would need to be dealt with
Nebraska gets its very own organism
While trying to pinpoint the cause of
a lung infection in local cancer
patients, they discovered a previously
unknown micro-organism. And
they've named it "mycobacterium
nebraskense," after the Cornhusker
state.
It was discovered few weeks ago
using Mycoalign: A Bioinformatics
program developed at PKI
Source: Omaha World Herald, March 21, 2005
Model Two: Centralized Database - the
Integrated Model
A new integrated model based on:
– Organized and curated database
– True non-redundancy by having one record for each
polymorphic set with pointer to the rest of the set if needed
– Allowing advanced queries
– Being user-friendly and employing true automation
– Employing various algorithms with different levels of
accuracy and speed for conducting homology searches.
The Clean Gene Package
• A set of integrated database and alignment tools:
–
–
–
–
–
–
–
Edited and curated
Web based
Of manageable size
Based on hierarchical database model
Utilize various alignment algorithms
Allows advanced automated queries
Allows fast and accurate searches
The Key Challenges
• The New Structured relational database model
• Identification of equivalence classes of records
(polymorphic sets)
• Identification of a good representative for each set
• Curation and classification
• Accurate annotation
• Advanced data mining tools
• A user-friendly interface that employs true automation for
interfacing with the database
Tool I: Clustering Biological Data
• Clustering is a fundamental technique in finding a
structure in a collection of unlabeled data.
• Basically, clustering is the process of organizing
objects into groups whose members are similar in
some way.
• A good Clustering tool is a key component in
analyzing microarray data
Message Passing Clustering (MPC)
• Inspired by real-world situations: elements with similar
attributes cluster together simultaneously
• Advantages:
– Easy to understand and use.
– Taking the advantage of communication among data objects, MPC is
able to balance the global and local structure and be performed in
parallel.
– “Message” has flexible structure which allows further development to
fit to different research interests.
– We have extended the basic MPC to
• Weighted MPC
• Stochastic MPC
• Semi-supervised MPC
Basic MPC
• The phylogenetic trees of Mycobacterium (9 species, 34 strains),
constructed by the Neighbor Joining and MPC method.
3M.intracellulareMac-D
M.intracellulare ATCC
4M.intracellulareMac-J
M.intracellulare S 348
M.intracellulare S 348
1M.intracellulareMin-A
3M.chelonaeMch-C
M.chelonae DSM
35770 *
*
*
43276
35752
19536
1M.chelonaeMch-A
M.chelonae ATCC
M.chelonae
M.chelonae ATCC
2M.chelonaeMch-B
M.xenopi S 88
1M.xenopiI/Mxe-A
M.xenopi ATCC 19250
3M.xenopiIII/Mxe
M.xenopi S 91
2M.xenopiII/Mxe-B
M.xenopi
1M.kansasiiMka-A
M.kansasii ATCC
12478 #
14470
2M.gordonaeMgo-B
M.gordonae ATCC 35756
4M.gordonaeMgo-D
M.gordonae Bo 10681/99
M.gordonae Bo 11340/99
3M.gordonaeMgo-C
M.gordonae Bo 9411/99
5M.gordonaeMgo-E
1M.gordonaeMgo-A
M.gordonae ATCC
M.gordonae
2M.kansasiiMka-B
M.kansasii S 221
#
#
M.kansasii S 536 #
3M.kansasiiMka-C
M.kansasii
DSM
44431
#
5M.kansasiiMka-F
M.kansasii S 233
4M.kansasiiMka-D
3M.peregrinumMpe
M.peregrinum S
254 ^
ATCC 14467 ^
M.fortiutum ATCC 49403 $
1M.fortiutumMfo
M.peregrinum
1M.peregrinum
2M.peregrinumMpe
M.peregrinum ATCC
700686 ^
2M.flavescensMfla-B
M.flavescens DSM 43531
1M.flavescensMfla-A ATCC14474
M.flavescens ATCC 14474
M.flavescens
3M.flavescensMfla-A
M.flavescens ATCC 23033
3M.terraeIII
M.terrae S 281
M.terrae ATCC 15755
M.terrae
1M.terraeI
M.terrae DSM 43541
2M.terraeII
4M.fortuitumMfo
M.fortiutum ATCC
6841 $
49404 $
M.fortiutum ATCC 43266 $
3M.fortuitumMfo
2M.fortuitumMfo
M.fortiutum ATCC
M.intracellulare ATCC 35847 *
2M.intracellulareMAC-E
5M.intracellulareMac-L
M.intracellulare S 350
a. NJ
*
M.xenopi S 88
M.xenopi S 91
M.xenopi ATCC 19250
M.kansasii S 233
M.kansasii DSM 44431
M.kansasii S 536
M.kansasii S 221
M.kansasii ATCC 12478
M.intracellulare S 350
M.intracellulare ATCC 35847
M.intracellulare S 348
M.intracellulare ATCC 35770
M.intracellulare ATCC 13950
M.gordonae Bo 10681/99
M.gordonae Bo 9411/99
M.gordonae Bo 11340/99
M.gordonae ATCC 35756
M.gordonae ATCC 14470
M.terrae S 281
M.terrae DSM 43541
M.terrae ATCC 15755
M.peregrinum ATCC 700686
M.peregrinum S 254
M.peregrinum ATCC 14467
M.fortuitum ATCC 6841
M.fortuitum ATCC 43266
M.fortuitum ATCC 49404
M.fortiutum ATCC 49403
M.flavescens DSM 43531
M.flavescens ATCC 23033
M.flavescens ATCC 14474
M.chelonae DSM 43276
M.chelonae ATCC 19536
M.chelonae ATCC 35752
b. MPC
M.xenopi
M.kansasii
M.intracellulare
M.gordonae
M.terrae
M.peregrinum
M.fortuitum
M.flavescens
M.chelonae
Weighted MPC (WMPC)—
with Adaptive Feature Scaling
• Add weight associated with each clusterfeature pair. A single feature have multiple
weights in different clusters and, in one cluster,
all features may have different weights.
• Update the weights during the clustering
process. If on some dimension, the similarity
between two going-to-merge clusters is high
(/low), then we increase (/decrease) the weight
on that dimension in the newly merged cluster.
 x1   x11
x   x
21
x   2  
  
  
 x n   xn1
 w1   w11
w  w
21
w   2  
  
  
 w k   wk 1
x12
x22
xn 2
w12
w22
wk 2
x1 p 
x2 p 


xnp 
w1 p 
w2 p 


wkp 
• Test WMPC on Colon Cancer data (2000 genes in 40 tumor and 22
normal samples), giving higher classification rate.
• Two benefits:
– Strengthen the signal features while reducing the noise features, so making
clustering results more accurate.
– More importantly, reveal the contribution of the features (genes) to the clusters
(samples), so that identify the set of genes responsible for certain diseases.
Stochastic MPC (SMPC)
Based on Kernel Functions
Chance to
merge?
Tie ?
Target
Object
a
0
b
d
e
c
Probability Density Estimates Based on Little Gaussian Kernel Functions
Kick
out ?
f
distance
Kernel Density Estimates Using Gaussian Kernels
Semi-supervised MPC
• Clustering methods are considered unsupervised, meaning
that the reduction is derived solely from the data rather
than reflecting any previous knowledge.
• Classification methods are considered supervised, because
in the training phase, samples classes are already known,
and we classify the objects into known groups.
• Between clustering and classification: Unlabeled data with
prior knowledge, such as constraints and hypotheses.
• The goal of semi-supervised clustering is to guide the
clustering, using the prior knowledge, to get better
partitions.
Semi-supervised MPC
Instance-level Constraints
• Colon Cancer data (2000 genes in 40 tumor and 22 normal samples).
• We cluster samples with genes as features. Since the samples (instances)
labels (constrains) are known, it is call instance-level constraints.
1
0.9
Accuracy (Rand index)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
OP
IP
0
0
5
10
15
20
25
30
35
40
45
50
Number of samples with an initial label
55
60
We want to see how well our method
could separate the normal and tumor
tissues based on different numbers of
known labels for the samples as prior
knowledge.
OP: Output partition after clustering.
IP: Input constraints presented before
clustering.
Combining the power of clustering with
background information achieves better
performance than either in isolation.
Semi-supervised MPC
Attribute-level Constraints
Cluster
t-statistic
Cluster
Size
1
-5.73831
22
2
-5.11799
19
T57079, T60860, R38758, R55828, H55759, X58401, H63361, R56207, U25435, R09138, H82741, H16096, H64576, L34059, U12140, H7834 6, H62177, M55422,
H89481, U12134, H11460, R28608
R00254, H41147, R42765, T51139, D21205, H70635, H69819, H71122, H81802, R46731, M29550, H14607, R88749, J00146, H18451, M9685 9, X69115, X51435,
H06189,
3
13.9581
17
L12723, M81637, D59253, X65873, M77698, T84049, M88108, R56401, H09719, X74330, D26018, D26067, R56630, X13482, H16991, T8674 9, H92195
4
9.81885
14
R39531, R21901, J02645, R34876, H72234, T92259, H47107, T65740, Z15115, D43951, L19437, X82103, U29175, T65438
5
-8.42722
42
M16029*, M16029*, D38549, U03100, L24038, J02906, X93499, M73481, T71207, R73129, X56253, M81758, R27017, U18934, X17651, H47 650, H89688, H23135,
R39130, H82631, H45526*, H45526*, D90188, L06895, L34774, T58756, L11370, U34252, H78063, D21239, M96824, M24470, T77446, R53 612, R38513*,
R38513*, R44677, M98045, T88712, R54317, M77477, H11054
6
14.9275
33
R42127, T90350, L25941, T65790, R22779, M90516, R11485, X68194, X63629, J05032, M34175, X74795, U34074, M58050, H20512, U1829 9, H80114, R67987,
R56399, U10324, R56443, X17025, R40717*, R40717*, D13630, L24203, T96873, X53586, X73478, X14618, M55543, H42884, X54101
7
10.3472
12
D13315, T70062, U30825, T89115, D26600, T60437, D50063, R20804, R09479, R42837, D14658, H87473
8
-9.94085
14
T61333, M28882, M69066, M14539, T67406, M37721, T79831, M69135, M36634, U25138, R48303, U31525, T51539, H21042
9
8.59304
11
U22055, L41559, M88279, R37114, R96357, U29607, H82719, H04802, M22632, R27813, R60195
10
5.97321
29
H48027, M14200, T52642, R85464, T52343, T72503, T49732, T95048, T47584, R21547, L28010, M21339, T87527, T61338, T79813, L3684 4, M34192, M69238,
L26405, H06970, M84326, T70063, X66363, H85878, X62153, T67905, T57872, Z48950, R62425
11
-7.47053
16
T93284, H80975, R73052, X05610, X79683, X55187, U30827, T64974, D31887, M92843, R67358, R46753, X68277, H50623, H15813, X5134 5
12
12.6072
12
T63591, T63370, T53412, T53396, T63133, R44884, R84411, X12671, R08183, M22382, D63874, D21261
13
5.07119
37
H46728, T57686, T68848, M11354, T69026, M14630, T72879, T47144, T61627, X61971, M94345, R42570, H13194, T65580, H15542, J0307 7, D30655
14
-5.35587
16
T61661, T62220, T96832, H80240, H86060, T63508, H24754, T63499, M11799, L28809, T63484, M57710, M33680, U12255, H88360, R7893 4
Gene Name (ORF)
Gene clusters illustrating differentially expressed genes
in tumor and normal samples
a. Cluster 6
b. Cluster 8
Generalizations
• WMPC extends the unweighted MPC to the weighted MPC.
– If we initialize all entries in w to be 1 and never change the
weights, MPC-AFS becomes a regular MPC.
• SMPC extends the deterministic MPC to the stochastic MPC.
– If we choose the particular kernel function (rectangular) and the
particular bandwidth parameter (the minimum distance between
the target cluster and all the others) to estimate the probability,
SMPC can be reduced to a regular MPC .
• Semi-supervised MPC extends unsupervised MPC to somehow
supervised MPC.
– Unsupervised MPC can be considered as a special case of semisupervised MPC with null background info and constraints.
Tool II: Motif Finding/Data Mining Tool
• Given a set of known binding sites, develop a
representation of these binding sites that can be
used to search for additional instances of those
binding sites in the genome.
• Given a set of sequences known to be co-regulated
(i.e. by an expression array) determine the binding
locations in the sequence and determine a
representation for binding specificity.
Motif Representations
•
•
•
•
Static Sequences: tataat
Regular Expressions (RegEx): tat[at].t
Sequences with N errors: tataat:2
RegEx with N errors: tat[at].t:2
• Mononucleotide Scoring Matrices:
• Dinucleotide Scoring Matrices (HMMs)
• Multinucleotide
a
t
[at]
a 75% 25% 50%
Scoring Matrices
t 25%
g 0%
c 0%
1
.
t
t
25%
25%
100%
75%
50%
25%
75%
0%
0%
0%
25%
0%
0%
0%
0%
25%
0%
0%
2
3
4
5
6
Searching for Known Motifs
1. Obtain a multiple sequence alignment of known motifs (e.g. from gel shift assay)
2. Construct
…atagtt…
representation a
…aattat…
t
…attatt…
g
…ttactt…
c
3. Score all possible
windows in the data
Set based on:
I seq (i)   f b,i log 2
b
f b ,i
pb
a
t
[at]
.
t
t
75%
25%
50%
25%
25%
100%
25%
75%
50%
25%
75%
0%
0%
0%
0%
25%
0%
0%
0%
0%
0%
25%
0%
0%
4. Output results that exist over
a specified threshold from data set
Finding Unknown Motifs
1. Input a set of co-expressed
sequences that are related
by micro-array experiment
2. Input: motif length n
3. Score all possible windows by first
Constructing a multiple sequence
Alignment of the window to all other
possible matches in the other sequences
4. Rank the set of all
possible scoring matrices
of length n based on
information content
relative to background.
I seq (i)   f b,i log 2
b
f b ,i
pb
5. Output an ordered list of motifs
and corresponding scoring matrices.
AGAST: Advanced Grammar Alignment
Search Tool
• Capitalize on the advantages of alignment.
• Provide a formal and robust method for computing biorelationships
• Provide optimum results based on the input.
• Calculate relationships in the same time as alignment.
• Allow for user knowledge and subsequence relationships.
• Record attributes and sequence attributes can be
considered simultaneously.
• Dynamically construct requisite algorithms in a user
friendly way, thus limiting development time and technical
knowledge requirements.
AGAST: Advanced Grammar Alignment
Search Tool
Advantages:
• It can evaluate regular expressions important to biology
as well as traditional RegEx tools.
• It can evaluate traditional alignments.
• It can do any combination of RegEx and traditional
alignments.
BioRegEx
BioRegEx II
Example of an Advanced Query
Find a sequence that contains:
tatatagcagcccatgagccggcccgcadtgctagttcag
5-10 bases
Transcription Start Site
Any Number of Bases
Functional Unit
Start Codon
Example Query:
tatata.*{5,10}atg.*[gc]ca[at]gct[atgc]g:2.*
tatatagcaggggcccatgagccggcccccadagctcgttcag
Score: 0
tatatagcagcccatgagccggcccgcadtgagttcag
Score: 2
Conventional Problem
• Motif Searching programs do not calculate based
on combinatorial regulation modules (instead they
calculate based on probability of a single motif).
• We have developed and tested a program that
considers an ordered set of motifs (or sequence
attributes) and searches based on a context of
adjacent elements (or grammar).
Next Steps
• Extract multiple motifs in the context of
regulatory control networks.
• Use phylogenetic footprinting and gene regulatory
network information to compare and contrast gene
regulation networks and extrapolate combinatorial
control mechanisms and corresponding motifs
upstream of genes.
Other Current Research Projects
 Advanced tools for identifying splice sites
 Using ab initio Bayesian networks based approaches
 Using homology graph theory based approaches
 Fast Recognition of Microorganisms using enzyme cutting
sequence, mass spectrometry or sequence based approaches
 Gene Prediction using Comparative Genomics
 Reconstructing Gene Regulatory Networks
 Clustering Techniques for Simplifying Protein Sequences
Acknowledgment









Kiran Bastola
Alexander Churbanov
Xutao Deng
Huimin Geng
Steven Hinrichs
Xiaolu Huang
Daniel Kuyper
Mark Pauley
Daniel Quest