Data mining for genetics
Download
Report
Transcript Data mining for genetics
Data mining: theory and
applications
Heikki Mannila
Data Mining:
Theory and Applications
• Data analysis becoming more important in other
sciences and in industry
– New measurement methods
– Ability to store data
• High-dimensional large data sets
• Non-traditional forms (e.g., strings, trees, graphs)
• Data analysis lags behind
Data mining
• Has emerged as a major research area in the interface
of computer science and statistics
– Machine learning, databases, algorithms
• Data
analysis questions are increasingly visible in
database and algorithms research
• Theory and practice interact
Goals
•
•
•
•
Develop novel data analysis techniques for the use of
other sciences and industry
How?
– Look at data analysis problems arising in practice
– Abstract new computational concepts from them
– Analyse the concepts and develops new
computational methods
– Take the results into practice
Theoretical work in algorithms and foundations of data
analysis can have fast impact in the application areas
The applications feed interesting novel questions to
theoretical research
Major themes in methods
• Pattern discovery
• Methods for sequence decomposition
• Interplay
of combinatorial and continuous methods in
data mining
• Techniques
sets.
for the decomposition of large 0-1 data
Application areas
• Genome structure
• Gene expression data analysis
• Palaeontology
• Linguistic applications
• Ubiquitous computing
The people
• Heikki Mannila, Hannu Toivonen, Jaakko Hollmen,
Aristides Gionis, Floris Geerts, Bart Goethals
• 6 Ph.D. students
• A visible position in the international community
Highlights
•
•
•
•
Finding recurrent sources in sequences
– global structure in genomic sequences
– recognizing recurrent contexts in mobile device usage
– (k,h)-segmentation
Finding orderings of attributes from unordered binary data
– Fragements of order
– Spectral ordering techniques
Pattern discovery and mixture modelling techniques for
onomastic data sets
Methods for finding topics in 0-1 datasets on the basis of
co-occurrence information
Finding recurrent sources
in sequences
• Sequences
– DNA
– Telecommunications
– Etc.
• How to find some global structure from a sequence?
• Try to find homogenous segments from the sequence
Finding homogenous segments
T = S1
S2
S3
S4
S5
S6
• Sequence T, integer k
• Measure of homogeneity H for segments of T
– E.g., H(S) = |S| Var(S)
k
• Find the division T = S1,S2,…, Sk minimizing
H (Si )
• Dynamic programming
i 1
• (k,k)-segmentation: k-segments with no relationship to each
other; independent sources
(k,h)-segmentation
• We want to limit the number of different types of
segments
• Only h<k different types are allowed
• Find the best segmentation of T into k segments by
using only h different types of segments
(6,3)-segmentation
Source 1
Source 2
Source 3
Data
k = 3 and h = 3
k = 3 and h = 2
(k,h)-segmentation problem
• Given sequence T
• Find h sources w1,w2,…, wh
• A decomposition of sequence T into k segments
T = S1 S2 … Sk
• Minimizing the sum of distances from each point t to the
source wa(t) of the segment to which t belongs to
n
(t
i 1
i
wa (ti ) )
2
Results
• (k,h)-segmentation problem is NP-hard for dimension d>1, for
L1 and L2 metrics
• Dimension d=1: complexity open
• Simple approximation algorithms
• d=1: 3-approximation for L1
• d=1: 5 -approximation for L2
• d>1: 3+e –approximation for L1 for any e>0
• d>1: A+2 –approximation for L2, where A is the best
approximation factor for k-means clustering
• Very good performance in practice
• The algorithms work for any generative model (not just reals
with Lp metrics)
5
Example: onomastic data
• Names of lakes in Finland
• About 150,000 lakes
• What are the main trends?
• High-dimensional marked point process
• Collaboration with Research Center for the Languages
of Finland (Kotus)
• Similar data analysis problems arise also in
environmental sciences
Similarity with the names of lakes in Kangasala
Clustering on the basis of the names of lakes
Example: paleontological data
• Given a matrix of occurrences of species in fossil sites
• Ages of the fossil sites are not available
• How to order the sites according to their age?
• Background information: species arrive and vanish
• Try to find ordering that minimizes Lazarus events
time
A
0
1
1
0
species
B
C
0
1
1
0
0
1
1
0
Lazarus events
Methods
• Spectral ordering: form a Laplacian of the cooccurrence matrix, look at eigenvectors
• Fragments of order: find short segments of orders
which are not violated by observations
• Other applications: text analysis, telecommunications
Fortelius, Jernvall, Gionis, Mannila, in preparation
Future research directions
•
•
•
•
•
•
•
Theory and practice
The combination of continuous and combinatorial
methods
Concepts and algorithms for describing structure of
sequences
Methods for pattern discovery in and modelling of
spatiotemporal data
Theoretical models for data mining (such as inductive
databases)
Foundational issues in pattern discovery (e.g., logical
form of patterns and the difficulty in discovering them)
Publications, collaborations, software releases
Applications in the future
• Genome structure and its relation to function
• Linguistic applications: spatial and temporal variation in
language
• Ubiquitous computing and telecommunications
applications
• Paleontological and ecological applications
Mobile Computing Research
at HIIT
Kimmo Raatikainen
Research Director
Helsinki Institute for Information Technology
[email protected]
Fuego Mission
• To address the research
challenges arising in mobile
computing systems and
applications of tomorrow.
• Mobile computing will fulfil the
vision of ubiquitous - invisible computing providing access
and services anytime,
anywhere, and anyhow.
• The key research challenges
are related to
– context-awareness,
– reconfigurability,
– adaptability,
– understanding user needs
and experience, and
– personalization.
”Any technology distinguishable from magic is
insufficiently advanced,” Gregory Benford
Present State
• Some 20 researchers organised in two closely co-operating
research groups
– Mobile Computing Group (Prof. Kimmo Raatikainen)
– User Experience Research Group (Prof. Martti Mäntylä)
• Other senior researchers and post-docs:
– Dr. Ken Rimey (software technologies, distributed computing)
– Dr. Pekka Nikander, permanent visitor from Ericsson
Research (security and privacy in Mobile Internet)
– Dr. Timo Saari (user experience research, media science)
– Dr. Jan Lindström (distributed data management, mobile data)
• Other postdocs likely to be hired 2004
Current Research Topics
• Middleware for Mobile Wireless Internet – Fuego Core project
– Mobile distributed event system
– Mobile (XML-based) file system with intelligent
synchronization
– SOAP messaging over wireless (W3C: XML Binary Infoset)
– Mobile Presence
– Host Identity Protocol
• Personal Distributed Information Storage – PDIS project
– Synchronization-based peer-to-peer infrastructure for storage
of structured XML data: PIM data, metadata for digital media
• Context Recognition by User Situation Data Analysis – CONTEXT
project
– Bridge between User Experience Group at ARU and Adaptive
Computing Systems Group at BRU
• Software Architectures for Configurable Ubiquitous Systems –
Sarcous project by SoberIT at HUT
– Managing the large variety of software products
Targets to 2005-2010 – 1/3
• to enlarge and strengthen international co-operation
– current: WWRF, UCB, Fraunhofer FOKUS
– new: Japan, KCL/Mobile VCE, an European NoE,
CMU, …
– but not forgetting co-operation in Finland:
• HUT, UHE, Tampere Univ Tech, Univ Oulu,
UIAH, …
• to contribute to software architecture for Wireless World
• to address challenges due to personal networking
– minimal differences between solution stacks for
ad-hoc communities and networked infrastructure
– peer-to-peer, device-to-device solutions
Not in primary focus
(and other smart places)
Not in primary focus but perhaps latter
Targets to 2005-2010 – 2/3
• to put more focus on infrastructure for contextawareness and dynamic (end-user) systems
– context modelling: presentation, maintenance,
sharing, protection, reasoning, and queries
– decision rules for reconfiguration
• reflective (self-aware) middleware for personal
networking
• Fault tolerance in Wireless World
– traditional exception will be the usual case
– compensations, delayed/delegated actions, …
• Trust and privacy in Wireless World
Targets to 2005-2010 – 3/3
• user needs and novel application concepts
– human factors of the Wireless World
• basic psychosocial mechanisms
• what makes a service use experience engaging
and sustaining?
– user-centric concept design (UCPCD)
• process, methods, tools
– novel application concepts based on contextawareness, other novel technologies
– experience prototypes