Parallel computational methods for sequence analysis

Download Report

Transcript Parallel computational methods for sequence analysis

Andrew Meade ([email protected])
School of Biological Sciences
PARALLEL COMPUTATIONAL
METHODS FOR SEQUENCE
ANALYSIS
Molecular sequence growth rates
from 600 to 100 million sequences in 25 years
Human Genome project
Molecular sequence growth rates
 18 million new sequences a year (2007




– 2008)
Rate of growth is accelerating
Doubling every 2 years
Likely to continue with new
sequencing technology
Cost, time and technical ability
required has reduced
Its worse than it looks
 Lack of suitably tools for sequence analysis
 Analysis methods don’t always scale linearly
 Methods have changed




Simple heuristics  Statistical methods
Simple rules  More realistic models
Descriptive results  Biological process
Sub system analysis  Systems biology
 Computing power a major rate limiting steep
 The widening gap between data and analytical
methods is increasing
Tools for genomic analysis
Current Tools
Required Tools
 Co-opted for purpose
 Custom build
 Designed for smaller data sets
 Limited by available hardware
 Limited to a single computer
 Use available computers
 External data required
 Models derived from data
 Hard to generalise
 Identify informative information
in the data
454 parallel sequencing
 Fast, 400-600 million bases per 10 hours
 Human genome in 100 hours, HGP 13 years
 Cheap, 20¢ per kb, currently $12
 Human genome for $100,000, HGP $10 billion
 Accurate, 99% accurate on 400th base
 Small chunks 400 – 800 bases per sequence
 Similar to parallel computing, hard to convert raw
power to usefully results
 The catch - analysis
454 sequencing
 Sequence populations of bacteria (16s) taken
from cow guts under different experiential
conditions
 Identify how changes in feed affects bacteria
populations.
 332,000 sequence in total
 £8,000 using 454, previously over £2 million
454 sequencing analysis
 Find how closely related sequence are to each




other.
Perform an approximate match between all
pairs of sequences. Allowing for insertions,
deletions and mutations.
332,000^2 * 0.5 = 5.5 * 1010 comparisons
874 years on a single computer
Trivially parallel task, easy to distribute over
nodes, different clusters, different OS /
hardware.
454 sequencing analysis 2
 Cluster sequences from previous steep to find
what species are present and in what
quantities
 102 GB of data. Distributed code to reduce
memory and processing requirements.
 Liner scaling (memory, CPU) up to 200
nodes
 Problems with disk access.
Bayesian Phylogenetic
inference
 Infer evolutionally histories (phylogenies)
from molecular data.
 Widely uses in all arias for biology.
 Used to investigate how genes and proteins
change and adapt to their environment
 How viruses spread and mutate
 Reconstruct ancestral genes and proteins
 Used in conservation studies to identify species
that are most at risk of extinction and most
valuable to conserve
Mammal Mitochondrial
44 Taxa
13 Protein coding regions
16400 Nucleotides
Mammal Mitochondrial scaling
x
x
x
1 ~ 70 days
60 ~ 2 days
x
Number of computers