Transcript Slide 1

Computational Genomics
Lecture 1, Tuesday April 1, 2003
Biology in One Slide: 2 Paradigms
Molecular Paradigm
Evolution Paradigm
High Throughput Biology
DNA Sequencing
Gene Expression
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
Biology is becoming an information science
Goals of this course
• Introduction to Computational Biology
 Basic biology for computer scientists
 Breadth: mention many topics & applications
• In-depth coverage of Computational Genomics
 Algorithms for sequence analysis
 Current applications, trends, and open problems
•
Coverage of useful algorithms




Hidden Markov models
Dynamic Programming
String algorithms
Applications of AI techniques
Topics in CS262
Part 1: In-depth coverage of basic computational methods
for analysis of biological sequences


Sequence Alignment & Dynamic Programming
Hidden Markov models
These methods are used heavily in most genomics
applications:



DNA sequencing
Comparison of DNA and proteins across organisms
Discovery of genes, promoters, regulatory sites
Topics in CS262
Part 2: Topics in computational genomics, more algorithms,
and areas of active research

DNA sequencing & assembly: reading a complete genome such
as the human DNA

Gene finding: marking genes on the DNA sequence

Large-scale comparative genomics: comparing whole genomes
from multiple organisms

Microarrays & regulation: understanding the regulatory code, and
potential disease-causing genes

RNA structure: predicting the folding of RNA

Phylogeny and evolution: quantifying the evolution of biological
sequences
Course responsibilities
• Homeworks




[72%]
4 challenging problem sets, 4-5 problems/pset
Collaboration allowed – please give credit
Hws due Thursday, solutions explained Friday
Two worst problems in all hws do not count
• Final
[18%]
 Takehome, 1 day
 Collaboration not allowed
 Basic questions – much easier than homeworks
• Scribing
[10%]
 Due one week after the lecture, except special permission
Reading material
• Books
 “Biological sequence analysis” by Durbin, Eddy, Krogh,
Mitchinson
• Chapters 1-4, 6, (7-8), (9-10)
 “Algorithms on strings, trees, and sequences” by Gusfield
• Chapters (5-7), 11-12, (13), 14, (17)
• Papers
• Lecture notes
Topic 1. Sequence Alignment
Complete genomes
Evolution
Evolution at the DNA level
C
…ACGGTGCAGTCACCA…
…ACGTTGCAGTCCACCA…
SEQUENCE EDITS
REARRANGEMENTS
Evolutionary Rates
next generation
OK
OK
OK
Changes in non-functional sites are OK, so will be propagated
X
X
Still OK?
Most changes in functional sites are deleterious and will be rejected
Sequence conservation implies function
100%
40%
Interleukin region in human and mouse
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings
x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, M in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gap
in the other sequence
What is a good alignment?
Alignment:
The “best” way to match the letters of one sequence with
those of the other
How do we define “best”?
Alignment:
A hypothesis that the two sequences come from a
common ancestor through sequence edits
Parsimonious explanation:
Find the minimum number of edits that transform one
sequence into the other
Scoring Function
• Sequence edits:
AGGCCTC
 Mutations
AGGACTC
 Insertions
AGGGCCTC
 Deletions
AGG.CTC
Scoring Function:
Match:
+m
Mismatch: -s
Gap:
-d
Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d