Transcript Slide 1
Computational Genomics
Lecture 1, Tuesday April 1, 2003
Biology in One Slide: 2 Paradigms
Molecular Paradigm
Evolution Paradigm
High Throughput Biology
DNA Sequencing
Gene Expression
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
Biology is becoming an information science
Goals of this course
• Introduction to Computational Biology
Basic biology for computer scientists
Breadth: mention many topics & applications
• In-depth coverage of Computational Genomics
Algorithms for sequence analysis
Current applications, trends, and open problems
•
Coverage of useful algorithms
Hidden Markov models
Dynamic Programming
String algorithms
Applications of AI techniques
Topics in CS262
Part 1: In-depth coverage of basic computational methods
for analysis of biological sequences
Sequence Alignment & Dynamic Programming
Hidden Markov models
These methods are used heavily in most genomics
applications:
DNA sequencing
Comparison of DNA and proteins across organisms
Discovery of genes, promoters, regulatory sites
Topics in CS262
Part 2: Topics in computational genomics, more algorithms,
and areas of active research
DNA sequencing & assembly: reading a complete genome such
as the human DNA
Gene finding: marking genes on the DNA sequence
Large-scale comparative genomics: comparing whole genomes
from multiple organisms
Microarrays & regulation: understanding the regulatory code, and
potential disease-causing genes
RNA structure: predicting the folding of RNA
Phylogeny and evolution: quantifying the evolution of biological
sequences
Course responsibilities
• Homeworks
[72%]
4 challenging problem sets, 4-5 problems/pset
Collaboration allowed – please give credit
Hws due Thursday, solutions explained Friday
Two worst problems in all hws do not count
• Final
[18%]
Takehome, 1 day
Collaboration not allowed
Basic questions – much easier than homeworks
• Scribing
[10%]
Due one week after the lecture, except special permission
Reading material
• Books
“Biological sequence analysis” by Durbin, Eddy, Krogh,
Mitchinson
• Chapters 1-4, 6, (7-8), (9-10)
“Algorithms on strings, trees, and sequences” by Gusfield
• Chapters (5-7), 11-12, (13), 14, (17)
• Papers
• Lecture notes
Topic 1. Sequence Alignment
Complete genomes
Evolution
Evolution at the DNA level
C
…ACGGTGCAGTCACCA…
…ACGTTGCAGTCCACCA…
SEQUENCE EDITS
REARRANGEMENTS
Evolutionary Rates
next generation
OK
OK
OK
Changes in non-functional sites are OK, so will be propagated
X
X
Still OK?
Most changes in functional sites are deleterious and will be rejected
Sequence conservation implies function
100%
40%
Interleukin region in human and mouse
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings
x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, M in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gap
in the other sequence
What is a good alignment?
Alignment:
The “best” way to match the letters of one sequence with
those of the other
How do we define “best”?
Alignment:
A hypothesis that the two sequences come from a
common ancestor through sequence edits
Parsimonious explanation:
Find the minimum number of edits that transform one
sequence into the other
Scoring Function
• Sequence edits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
Scoring Function:
Match:
+m
Mismatch: -s
Gap:
-d
Score F = (# matches) m - (# mismatches) s – (#gaps) d