PPTX - Tandy Warnow

Download Report

Transcript PPTX - Tandy Warnow

CS 466 and BIOE 498:
Introduction to Bioinformatics
Tandy Warnow
Founder Professor of Engineering
Brief Description
Algorithmic approaches in bioinformatics:
– biological problems that can be solved
computationally (e.g., genome assembly,
sequence alignment, and phylogeny estimation)
– algorithmic techniques with wide applicability in
solving these problems (e.g., dynamic
programming and probabilistic methods)
– practical issues in translating the basic algorithmic
ideas into accurate and efficient tools that
biologists may use.
But also…
• Applications of these techniques to cutting
edge research problems in biology that
require computational approaches, such as:
– Systems Biology
– Protein Structure and Function Prediction
– The Tree of Life
Saurabh Sinha: gene regulation,
big data to knowledge
Two broad areas:
How is information about us encoded in our DNA ?
How do we bring the latest and greatest in machine learning and graph mining to
the biologist’s desktop computer?
Research questions:
–
–
–
–
–
Gene regulation: How are genes turned on and off in precisely orchestrated ways?
Regulatory evolution: Can we model evolution of regulatory sequences?
Genomics of behavior: How does DNA encode animal behavior ?
Cancer pharmacogenomics: Can a person’s DNA predict the best drug treatment?
Big Data To Knowledge (BD2K): Build a “Knowledge Engine for Genomics”.
http://www.sinhalab.net/
4
Jian Peng: Machine learning for
molecular and systems biology
Biological
data





Machine
learning
Knowledge
Understanding of the sequence-structure-function relationship
Machine learning algorithms for biological network integration
Prediction of gene/protein function from heterogeneous datasets
Role of protein mutations in human diseases
Graphical models, statistical modeling and structured prediction
Computational Phylogenomics
NP-hard problems
Large datasets
Complex statistical estimation problems
Metagenomics
Protein structure and function prediction
Medical forensics
Systems biology
Population genetics
This course: Material
• Core bioinformatics problems:
–
–
–
–
•
•
Probabilistic models of sequence evolution
Algorithm design techniques
–
–
–
–
–
•
Genome assembly,
Phylogeny estimation,
Multiple sequence alignment, and
Database search
Recursion
Divide-and-conquer
Dynamic programming
Designing heuristic search strategies
Use of profile Hidden Markov Models
Computational complexity
–
–
NP-hardness
Approximation algorithms
This course: Material
• Core bioinformatics problems:
–
–
–
–
•
•
Probabilistic models of sequence evolution
Algorithm design techniques
–
–
–
–
–
•
Genome assembly,
Phylogeny estimation,
Multiple sequence alignment, and
Database search
Recursion
Divide-and-conquer
Dynamic programming
Designing heuristic search strategies
Use of profile Hidden Markov Models
Computational complexity
–
–
NP-hardness
Approximation algorithms
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Phylogeny Problem
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
However…
U
V
W
AGGGCATGA
AGAT
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Indels (insertions and deletions)
Deletion
Mutation
…ACGGTGCAGTTACCA…
…ACCAGTCACCA…
Deletion
Substitution
…ACGGTGCAGTTACCA…
Insertion
…ACCAGTCACCTA…
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
The true multiple alignment
– Reflects historical substitution, insertion, and deletion
events
– Defined using transitive closure of pairwise alignments
computed on edges of the true tree
Input: unaligned sequences
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
Phase 1: Alignment
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S2
S3
S4
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
Phase 2: Construct tree
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S4
S1
S2
S3
S4
S2
S3
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly identify orthologs
• Compute multiple sequence alignments for each locus, and construct gene
trees
• Compute species tree or network:
– Combine the estimated gene trees, OR
– Estimate a tree from a concatenation of the multiple sequence
alignments
• Get statistical support on each branch (e.g., bootstrapping)
• Estimate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly identify orthologs
• Compute multiple sequence alignments for each locus, and construct gene
trees
• Compute species tree or network:
– Combine the estimated gene trees, OR
– Estimate a tree from a concatenation of the multiple sequence
alignments
• Get statistical support on each branch (e.g., bootstrapping)
• Estimate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
Two-phase estimation
Alignment methods
• Clustal
• POY (and POY*)
• Probcons (and Probtree)
• Probalign
• MAFFT
• Muscle
• Di-align
• T-Coffee
• Prank (PNAS 2005, Science 2008)
• Opal (ISMB and Bioinf. 2007)
• FSA (PLoS Comp. Bio. 2009)
• Infernal (Bioinf. 2009)
• Etc.
Phylogeny methods
•
•
•
•
•
•
•
•
Bayesian MCMC
Maximum parsimony
Maximum likelihood
Neighbor joining
FastME
UPGMA
Quartet puzzling
Etc.
RAxML: heuristic for large-scale ML optimization
1000-taxon models, ordered by difficulty (Liu et al., 2009)
Re-aligning on a tree
C
A
B
D
Decompose
dataset
A
B
C
D
Align
subsets
Estimate ML
tree on merged
alignment
ABCD
A
B
C
D
Merge
subalignments
SATé and PASTA Algorithms
Obtain initial alignment and
estimated ML tree
Tree
Use tree to compute
new alignment
Estimate ML tree on new
alignment
Alignment
Repeat until termination condition, and
return the alignment/tree pair with the best ML score
SATé-1 (Science 2009) performance
1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right
SATé-1 24 hour analysis, on desktop machines
(Similar improvements for biological datasets)
SATé-1 can analyze up to about 8,000 sequences.
This course: Material
• Core bioinformatics problems:
–
–
–
–
•
•
Probabilistic models of sequence evolution
Algorithm design techniques
–
–
–
–
–
•
Genome assembly,
Phylogeny estimation,
Multiple sequence alignment, and
Database search
Recursion
Divide-and-conquer
Dynamic programming
Designing heuristic search strategies
Use of profile Hidden Markov Models
Computational complexity
–
–
NP-hardness
Approximation algorithms
Pre-requisites
• CS 466: requires CS 225
• BIOE 498: requires CS 173
All students should be able to program in some
high-level language (of their choice).
This course: grading
• Homework: 35% (includes assigned reading)
• Midterm (March 31): 35%
• Class participation (includes class
presentations): 10%
• Final project: 20% (due last day of course)
Homework
• Generally due Tuesdays by 1 PM (emailed in PDF
format or delivered to my SC office
• Homework assignments will include
–
–
–
–
–
Programming assignments (language of your choice)
Algorithm designs
Proofs
Calculations
Written critiques of published papers (in PDF format)
Final Project
• Can be done by yourself or with one other person
• Options:
– Easiest: Critique of a collection of papers on a particular problem
– Analysis of biological data using different techniques, and paper
describing the differences
– Comparison of two or more methods for the same problem
– Hardest: A new algorithm that you design
• Timeline:
– Project proposals (PDF format, 1 page): April 3
– In class presentation of final project plans: April 7-17
– Projects (PDF format, 6-10 pages): May 3
Tandy Warnow
The Tree of Life: Multiple Challenges
Large datasets:
100,000+ sequences
10,000+ genes
“BigData” complexity
Large-scale statistical phylogeny estimation
Ultra-large multiple-sequence alignment
Estimating species trees from incongruent gene trees
Supertree estimation
Genome rearrangement phylogeny
Reticulate evolution
Visualization of large trees and alignments
Data mining techniques to explore multiple optima
Other
• http://tandy.cs.illinois.edu/CS466.html
• Textbooks:
– Jones and Pevzner, Introduction to Bioinformatics
Algorithms, MIT Press
– Computational Phylogenetics, at
http://tandy.cs.illinois.edu/textbook.pdf
• Office hours: Wednesday 9-10 AM in Siebel 3235
• Homework policy: homework submitted late but
within 24 hours are 80%, no homework can be
submitted more than 24 hours late