Transcript Document

Collaborative Information Management:
Advanced Information Processing in
Bioinformatics
Joost N. Kok
LIACS - Leiden Institute of Advanced Computer Science
&
LUMC - Leiden University Medical Center
BioRange
• Bioinformatics for microarray technology
• Bioinformatics for proteomics and metabolomics
• Integrative bioinformatics
• Vl-e informatics for bioinformatics applications
• Test bed with “real-life applications”
Biorange
CIM, AIM in BioINF
Five research lines:
• Information Structuring
• Heterogenous Data Integration
• Advanced Mining Algorithms
• Data Interlinking and Integration
• Data Storage and Management
1: Advanced Mining Algorithms
Data Mining
• Data Mining is the non-trivial process of identifying
valid, novel, potentially useful, and ultimately
understandable patterns in data
useful
novel, surprising
comprehensible
valid (accurate)
Data Mining
• It is somewhat comparable to statistics (and often
based on the latter), but takes it further in the
sense that whereas statistics aims more at
validating given hypotheses, in data mining often
millions of potential patterns are generated and
tested, in the hope of finding some that are
potentially useful.
Intelligent Interfaces
Case study: SNP data
• Genome scan comprising 500K data points (Single
Nucleotide Polymorphisms or SNPs) in 900 subjects
from families expressing survival to extremely high
ages (longevity).
• The analysis of this set of 450 million data points is
to recognize patterns specific for the genetic make-up
of long survivors.
Case study: SNP data
• The genetic scan data will be combined with
• gene expression data (30,000 data points per
subject in 100 subjects),
• protein data (NMR spectra from blood parameters in
hundreds of subjects) and
• imaging data (quantitative photography of facial
ageing parameters).
Case study: SNP data
• Subjects with SNP’s
• Classes (Young, Old)
• Above a certain support within Y,O
• Above a certain difference between classes Y,O
• Above a certain correlation with a class Y,O
• etc
Substructures
• Sequences
• DNA
• Trees
• XML documents
• Graphs
• Molecules
GASTON Tools
hms.liacs.nl
Mutagenicity data set of
4069 compounds
(56% mutagenic)
www.cheminformatics.org
To boldly go where no chemist has gone before
08 February 2006
Studying the interactions between different molecular fragments is taking researchers to the uncharted regions of chemical space.
© NASA-JPL
Chemical space, consisting of all possible stable molecules, is mind-bogglingly vast. Theoretical chemists have calculated that there
are more possible molecules based on hexane (10**29) than there are stars in the visible universe. Chemists have only made
fairly tentative journeys into this space, with the largest chemical databases currently containing up to 25 million different
molecules.
Ad IJzerman from Leiden University, the Netherlands, and colleagues realised that analysing these chemical databases could reveal
which regions of chemical space have been extensively explored and which remain relatively uncharted.
IJzerman’s team split the 250 000 molecular structures contained in the US National Cancer Institute’s database into component
fragments, consisting of rings, substituents and several types of linkers.
This generated 65 000 different fragments, of which the vast majority (70 per cent) occurred only once. The chemists selected the
1730 fragments that occurred in more then 20 different molecules and calculated the number of times that each possible pair of
fragments occurred in the same molecule.
Some pairs of fragments were commonly found together, forming what the researchers termed ‘chemical clichés’, but others were
rarely found in the same molecule. By generating molecules containing the fragments that aren’t often brought together, predict
the researchers, chemists should be able to open up new areas of chemical space and potentially discover new molecules with
interesting properties.
IJzerman has already demonstrated the benefits of this fragment analysis to a medicinal chemist. She was having problems with a
particular compound and he suggested possible alternative ring systems, based on his list of the most popular ring fragments. ‘It
turned out that one of our top 40 ring systems was actually her intended modification, reached after much deliberation,’ he
told Chemistry World.
•
2: Data Storage and Management
Patternbases
• Pattern Databases = Patterns + Data
• Query Languages work on Patterns + Data
• Since patternbases provide an architecture for
pattern discovery and a means to discover and use
those patterns through the query language,
data mining becomes in essence an interactive
querying process.
Patternbases
• Derive new patterns from data + old patterns
• Apriori Algorithm: Frequent Item Sets
• Frequent Items Sets + Data: Assocation Rules
Patternbases
• Derive new patterns from data + old patterns
• Find all item sets that are correlated with classes
• Fix a
• We can prune the search space by only considering
frequent item sets with minimum support
Patternbases
Research Lines Biorange
Five research lines:
• Information Structuring
• Heterogenous Data Integration
• Advanced Mining Algorithms
• Data Interlinking and Integration
• Data Storage and Management