Data mining for genetics - Helsinki Institute for Information

Download Report

Transcript Data mining for genetics - Helsinki Institute for Information

Data mining for genetics
Hannu Toivonen
Premises of the research
• Computational methods for medical molecular genetics
• Immediate and important applications:
– locating disease predisposing genes is essential for
understanding the etiology of complex common
diseases, such as heart disease or asthma
• Focus on selected topics where
– we can have a significant impact
– we can combine our algorithmic and data analytical
expertise with the unique research on medical
genetics in Finland
Computational methods
• Pattern discovery
– How to find frequently occurring phenomena
• Markov Chain Monte Carlo (MCMC)
– Finding posterior distributions for many-dimensional
distributions
Present state of the group
• Leading researchers: Profs. Hannu Toivonen, Heikki
Mannila, Jaakko Hollmén
• 5 post-docs, 7 PhD students
• Collaboration with leading groups in medical genetics
– Prof. Leena Palotie (Public Health Institute)
– Prof. Juha Kere (Karolinska Institutet)
• Various forms of collaboration
– Shared personnel (Kismat Sood, Juha Muilu,
visiting researchers Kenneth Lange and Joe
Terwilliger)
– Seminars etc.
haplotype
(~chromosome)
case
case
case
case
case
case
case
case
control
control
control
control
control
control
control
control
marker locus
Gene mapping
1
2
4
7
5
3
1
5
2
7
3
2
3
1
4
2
4
4
5
2
2
4
2
3
4
3
4
5
3
6
2
2
8
3
2
3
4
3
1
3
7
7
3
2
1
4
8
4
2
7
4
7
6
7
5
7
1
7
2
4
2
5
4
9
2
3
5
5
2
3
2
3
3
5
5
3
4
5
2
5
1
2
5
4
4
1
5
2
4
7
3
1
2
5
3
4
2
8
2
5
2
3
2
1
1
8
2
3
1
9
5
4
6
4
6
2
6
3
6
4
4
6
3
6
4
1
2
2
2
2
4
2
1
4
2
3
8
6
2
2
2
3
5
4
allele
Gene mapping
case
1 4
case
2 4
case
4 5
case
7 2
case
5 2
case
3 4
case
1 2
case
5 3
control
2 4
control
7 3
control
3 4
control
2 5
control
3 3
control
1 6
control
4 2
control
2 2
pattern 1: (3)(4)
pattern 2:
8
3
2
3
4
3
1
3
7
7
3
2
1
4
8
4
3
2 2 1
7 3 2
4 5 5
7 5 4
6 2 4
7 3 1
5 2 5
7 3 2
1 3 4
7 5 7
2 5 3
4 3 1
2 4 2
5 5 5
4 2 3
9 5 4
7 (3)(2)
(5)
2
8
2
5
2
3
2
1
1
8
2
3
1
9
5
4
6
4
6
2
6
3
6
4
4
6
3
6
4
1
2
2
2
2
4
2
1
4
2
3
8
6
2
2
2
3
5
4
2
6 (2)
Highlights
• A novel method, Haplotype Pattern Mining (HPM), for
gene mapping based on association analysis
– search for haplotype patterns that are associated
with the disease status
– HPM is the first method to use the concept of
patterns and to take an algorithmic approach to the
problem
– HPM has later been extended to cover a variety of
different cases, and recently we showed how
haplotype patterns can be found in genotype data
and used for gene mapping without explicit
haplotypes.
Highlights (continued)
• TreeDT method for association analysis
– TreeDT looks for tree structured haplotype patterns
– the patterns reflect possible recombination histories
of genes in the population
– gene localization is based on the most plausible of
such histories
• Oligogenic models for binary traits
– Bayesian inference using recurrence risk data and
Markov chain Monte Carlo (MCMC) simulation
methods
Highlights (continued)
• A new method for haplotyping
– Problem: given a set of remotely related genotypes
(unordered allele pairs from a pair of haplotypes),
reconstruct plausible haplotypes
– The new method is based on Markov chains of
variable order, and applicable to genetically larger
regions than previous methods
• Populus simulator
– A very useful tool for method development in
population genetic studies
Proposed research directions (1/2)
• Trend towards higher efficiency in wet labs:
– high-throughput genotyping technologies
– high-density marker maps of biallelic SNP markers
– bigger sample sizes
• Research issues
– scalability of computational methods
• number of data points, number of dimensions
– effective utilization of this richer information
• e.g. haplotype block structure
Proposed research directions (2/2)
• Goal of decreasing per-genotype laboratory expenses
with computational techniques
– Pooling of DNA samples from several individuals
and genotyping the pooled samples
• how to recover haplotypes from pools?
• how much information is lost in pooling?
• how to design pooling studies?
– Selecting a subset of the available markers for a
future study or a diagnostic test
• determine a small set of markers sufficient to
reliably identify the whole haplotype
Conclusions
• Common characteristics of the tasks:
– combinatorics
– stochastic data
– optimization problems, non-obvious objective
functions
– conserved patterns due to shared genetic origin
Adaptive Computing Systems
Patrik Floréen, HIIT
SAB meeting 17.10.2003
Outline
• Introduction
• Present activities
• Future activities
What is Adaptive Computing?
•
•
•
Adaptive computing refers to solutions that adapt to
their environment
Linked to the ubicomp / pervasive computing /
proactive computing vision
We focus on some central topics to realise this vision:
– Context-awareness and adaptation is central to
user-friendly ubicomp applications and ad hoc
networking may in the future provide infrastructure
for many ubicomp applications
Fundamentals
•
•
•
Draws on existing competence in data mining,
probabilistic reasoning, algorithmics and language
technology
At the crossroads of many of the research groups of
HIIT: many of our research groups deal with contextawareness, personalisation and adaptation
This presentation mainly about the ACS group at BRU
– 8 persons
– New! Started this year
Present projects: CONTEXT
•
•
•
•
•
Nov 2002-Dec 2005, Academy of Finland
Hannu Toivonen, with ARU
Characterization and analysis of information about
user's context and its use in proactive adaptivity: what
is the user's understanding of her current context, how
to make automatic inferences about the contexts, and
how to characterize context to users and design user
interaction about contexts
Research approach: qualitative end user studies, data
analysis algorithm development, and empirical testing
in a prototype environment
Manuscript is in preparation. The user experience
research (undertaken in ARU) has been reported at
Intl Symp. Human Factors in Telecommunication
Present projects: NAPS
•
•
•
•
•
•
•
Jan 2003-Dec 2005, Academy of Finland
Patrik Floréen, with HUT
Fundamental topology control and routing problems in
ad hoc networks
Research approach: algorithmics, in particular graphtheoretic optimisation
First topic studied is topology control: multicasting
lifetime maximization under energy constraints
Results presented in MobiCom 2003 workshop;
journal paper in preparation
Next directions: sensor networks
Present projects: NAPS results
•
•
•
•
•
•
•
•
•
Model: Energy consumption function of transmission
power (graph representation)
Input: Node energies and power threshold graph
Output: Optimal power assignment schedule
Usually: min. energy & static power assignments
Here: max. lifetime & dynamic power assignments
Result 1: max. multicast lifetime & static: polynomial
Result 2: max. multicast lifetime & dynamic (discrete
time steps): NP-hard and APX-hard
(As part of proof: certain Steiner tree packing problem
proved NP-complete)
Result 3: 2 heuristic algorithms and a method to
calculate an upper bound; algorithms give good results
Present projects: Space4U
•
•
•
•
July 2003-June 2004, Nokia, ITEA/EUREKA project
subcontract
Patrik Floréen
Context-aware selection of software components in
mobile phones
Project just started
Present projects:
PROACT coordination
• Jan 2002-May 2006, Academy of Finland (Tekes,
French Ministry of Research)
• Program director Heikki Mannila, Coordinator Greger
Lindén
• Coordination of the Research programme on Proactive
Computing (14 projects with 41 partners)
– Promoting collaboration between projects and to
improve national and international contacts
– Follow-up of projects and programme
– Assisting in administrative procedures
Principles for building the future
• There is potential for expanding the activities
• Collaborative projects with other groups (also outside of
HIIT)
• Attention to recruitment of postdocs
• Seek to engage in TEKES projects
• Optimal size of research group is maybe 15 persons
Future research topics
•
•
Continuation and expansion of work on context
inference and reasoning, as well as continuation of
work on fundamentals of ad hoc networking
New research topics:
– Context-aware personalised information retrieval
– Potential of groups of users in the form of context
sharing, group learning and social navigation
– Distributed solutions, both algorithmically and from
a software engineering point of view distributing the
intelligence into the surroundings
Intelligent Systems
Henry Tirri
Recent achievements
People and collaboration
• Professor Henry Tirri, Dr. Wray Buntine, Dr. Jorma Rissanen,
Dr. Petri Myllymäki and many excellent graduate students
General Goal
The aim of our research is fundamental
understanding
and
development
of
computationally efficient probabilistic and
information-theoretic modeling techniques,
and their multi-disciplinary applications
from engineering to sciences.
B-course (http://b-course.hiit.fi)
Some other recent highlights
•
Application of the modeling to intelligent educational software Outstanding Paper Award (SITE 2002)
•
Honorable mention for being 2nd (out of 114 international groups)
Knowledge Discovery and Databases prediction competition
(pharmaceutical application) (MDL-solution)
•
E-government: Modeling the voting behavior of Finnish Parliament
members 1999-2002 (http://cosco.hiit.fi/eduskunta/index.jsp)
•
Mobile Device positioning: “Manhattan trial 2003” - the most accurate
GSM-phone urban positioning (<30m) available
Future
Next Generation Information
Retrieval
Research emphasis
• PROSE: Probabilistic modeling based analysis engine and
query processor; mathematical methods
• SIB: semantic-based information management integrated
with personalization; “network appliance”; specialized
interfaces
• ALVIS: distributed architecture; topic-specific gathering;
sophisticated language processing
NGIR features
• semantic
search built on automatically
performed analysis, not just human-tagged
content
– probabilistic modeling
– ”shallow” language analysis
• integrated personalization
• integrated collaboration
• ”intelligent” interface
• distributed architecture
• topic-specific search
Aspects of ”Search”
Theory
Theory topics
• Kolmogorov’s structure function interpretation of MDL
(with related rate distortion theory)
• Model-based similarity metrics
• Normalized Maximum likelihood (i.e., stochastic
complexity) for flexible graphical model families
• Computationally efficient models (mPCA, ICA) for
LARGE-SCALE text retrieval
CoSCo papers
Applications
Multi-disciplinary applications
• Medical domains (“P-Course”)
• Some biology stuff (mPCA for genome modeling
with Michael Jordan)
• Analysis software for social sciences (e.g.,
educational “qualitative” analysis)
Sensor networks
•
•
•
“Grounded Web” = computing + communication + sensing
applying mobile positioning & modeling research for selforganization and modeling
Scaling challenge
Scientific impact
• Regular channels: publications, books, conferences
etc.
• International networking (cooperation & recruiting,
graduate education)
• Open source code releases (search, e-learning
environments, modeling software)
• Publicly available servers (search, data analysis)
• Cool demonstrations :-)
HIIT Digital Economy (DE)
Jukka Kemppinen
Martti Mäntylä
Olli Pitkänen
HIIT ARU
Premises for the Research
• Digital Economy refers to legal, societal, and business
issues that are specific to the network society
• The rapid development of information and
communication technologies challenges traditional
ways to structure, organize, analyze, and regulate the
activities in a society:
– Digital contents: rights, distribution, …
– Balancing different stakeholders interests: privacy,
trust, …
– Roadmap to future: whose future will prevail?
• The research interests of HIIT’s DE group are related to
solving these problems
Present State of the Group
• DE group founded in 2000
– Longer roots at Helsinki University of Technology
• About 18 researchers
– Professor Jukka Kemppinen, leader
– Professor Martti Mäntylä
– Olli Pitkänen, program coordinator
• Currently four projects, funded by Tekes and
companies:
– DE Core, MobileIPR, STAMI, Welfare of Nations
– The group’s strengths include especially issues in
intellectual property rights, digital rights
management, open source licensing, and security
Digital Economy 2003
UCB/ICSI
BCIS
UCB/SIMS
Kemppinen &
Himanen
Kemppinen
Welfare of Nations
MobileIPR
DRM & business
models
Kemppinen & Mäntylä
Network society
study
DE Core
Structures of
Digital Economy
Mäntylä
STAMI
Security technology,
personal privacy
Competences
• Present disciplines in the group:
– Information technology
– Law
– Political science
– History
– Philosophy
• More breadth may be needed - economics?
• Partnerships:
– HUT, UH; Helsinki School of Economics,
Lappeenranta University of Technology, Research
Institute of the Finnish Economy, Bank of Finland
Recent Achievements
• Papers, journal articles, reports
• Working prototypes to study certain aspects of the
future digital services
– Digital rights management
– Secure and accountable peer-to-peer content
sharing
– Micromovie sharing (under preparation)
• First International Mobile IPR Workshop: Rights
Management of Information Products on the Mobile
Internet, keynote speakers: Professor Hal Varian (UC
Berkeley) and Professor Ross Anderson (Cambridge)
Future Research Directions (1/2)
• In the next few years, the DE group is going to focus on
the following topics:
– The structures of the network society, including
value networks, transaction costs, relations
between actors, pricing mechanisms, and business
models (DE Core)
– Digital goods, information products and services,
rights in them, value of the rights, products and
services, and rights management (IPIS)
– Open source as a modus operandi, development
model, licensing models, economic analysis of
copyright, incentives, and user rights (RIPOS)
Future Research Directions (2/2)
– Privacy, trust, and economy in P2P content
distribution (MUPPET)
– Models of information and welfare societies, digital
sociology and politics, international comparison
esp. Chinese information society (WoN II)
– Digital economy issues in manufacturing
industries?
• Continued co-operation with University of California,
Berkeley, School of Information Management and
Systems (SIMS)
• New partnerships in Europe?
Digital Economy 2004
UCB/ICSI
BCIS
UCB/SIMS
Kemppinen
Himanen
IPIS
Welfare of Nations II
Intellectual Property
in Information
Society
Dynamics of Chinese
Information Society
Kemppinen & Mäntylä
DE Core
Structures of
Digital Economy
Kemppinen
Mäntylä
RIPOS
MUPPET
Risks and Prospects
of Open Source
Privacy and Trust in
P2P Communication