Genes - Computer Science Department, Technion

Download Report

Transcript Genes - Computer Science Department, Technion

Introduction to Algorithms in
Computational Biology
Lecture 1
Background Readings: The first three chapters (pages 1-31) in
Genetics in Medicine, Nussbaum et al., 2001.
This class has been edited from Nir Friedman’s lecture which is available at
www.cs.huji.ac.il/~nir.
Changes made by Dan Geiger.
.
Course Information
Meetings:
 Lecture, by Dan Geiger: Mondays 16:30 –18:30, Taub 4.
 Tutorial, by Ydo Wexler: Tuesdays 10:30 – 11:30, Taub 2.
Grade:
 20% in five question sets. These questions sets are obligatory.
Each contains 4-6 theoretical problems. Submit in pairs in two
weeks time
 80% test. Must pass beyond 55 for the homework’s grade to count
Information and handouts:

www.cs.technion.ac.il/~cs236522

A brochure with zeroxed material at Taub library
2
Course Prerequisites
Computer Science and Probability Background
 Data structure 1 (cs234218)
 Algorithms 1 (cs234247)
 Probability (any course)
Some Biology Background
 Formally: None, to allow CS students to take this course.
 Recommended: Biology 1 (especially for those in the
Bioinformatics track), or a similar Biology course, and/or a serious
desire to complement your knowledge in Biology by reading the
appropriate material (see the course web site).
Studying the algorithms in this course while acquiring enough
biology background is far more rewarding than ignoring the
biological context.
3
Relations to Some Other Courses
Intro to Bioinformatics (cs236523). This course covers practical
aspects and hands on experience with web-based bioinformatics
Software . Albeit not a formal requirement, it is recommended that
you look on the web site http://webcourse.technion.ac.il/234523/ and
examine the relevant software.
Algorithms in Computational Biology (cs236522). This is the
current course which focuses on modeling some bioinformatics
problems and presents algorithms for their solution.
Bioinformatics project (cs5236524). Developing bioinformatics
tools under close guidance.
4
First Homework Assignment
Read carefully the first three chapters (pages 1-31) in Genetics in
Medicine, Nussbaum et al., 2001.
Solve two of the questions for Chapter 2
and two of the questions for Chapter 3.
Due time: During the third tutorial class, or earlier in the
teaching assistant’s mail slot. Recall to submit in pairs.
5
Computational Biology
Computational biology is the application of computational tools
and techniques to (primarily) molecular biology. It enables new
ways of study in life sciences, allowing analytic and predictive
methodologies that support and enhance laboratory work. It is a
multidisciplinary area of study that combines Biology, Computer
Science, and Statistics.
Computational biology is also called Bioinformatics, although
many practitioners define Bioinformatics somewhat narrower by
restricting the field to molecular Biology only.
6
Examples of Areas of Interest
•
•
•
•
•
Building evolutionary trees from molecular (and other) data
Efficiently assembling genomes of various organisms
Understanding the structure of genomes (SNP, SSR, Genes)
Understanding function of genes in the cell cycle and disease
Deciphering structure and function of proteins
7
Exponential growth of biological information:
growth of sequences, structures, and literature.
8
Four Aspects
Biological
 What is the task?
Algorithmic
 How to perform the task at hand efficiently?
Learning
 How to adapt/estimate/learn parameters and
models describing the task from examples
Statistics
 How to differentiate true phenomena from
artifacts
9
Example: Sequence Comparison
Biological

Evolution preserves sequences, thus similar genes might
have similar function
Algorithmic

Consider all ways to “align” one sequence against
another
Learning

How do we define “similar” sequences? Use examples to
define similarity
Statistics

When we compare to ~106 sequences, what is a random
match and what is true one
10
Course Goals
 Learning
about computational tools for (primarily)
molecular biology.
 We will cover computational tasks that are posed
by modern molecular biology
 We will discuss the biological motivation and setup
for these tasks
 We will understand the kinds of solutions that exist
and what principles justify them
11
Topics I
Dealing with DNA/Protein sequences:
 Finding similar sequences
 Models of sequences: Hidden Markov Models
 Gene finding
 Genome projects and how sequences are found
12
Topics II
Models of genetic change:
 Long term: evolutionary changes among species
 Reconstructing evolutionary trees from sequences
 Short term: genetic variations in a population
 Finding genes by linkage and association
13
Topics III (One class, if time allows)
Protein World:
 How proteins fold - secondary & tertiary structure
 How to predict protein folds from sequences data
 How to analyze proteins changes from raw
experimental measurements (MassSpec)
14
Human Genome
Most human cells contain
46 chromosomes:

2 sex chromosomes (X,Y):
XY – in males.
XX – in females.

22 pairs of chromosomes
named autosomes.
15
Source: Alberts et al
DNA Organization
16
Source: Alberts et al
The Double Helix
17
DNA Components
Four nucleotide types:
 Adenine
 Guanine
 Cytosine
 Thymine
Hydrogen bonds
(electrostatic connection):
 A-T
 C-G
18
Genome Sizes
 E.Coli
(bacteria)
 Yeast (simple fungi)
 Smallest human chromosome
 Entire human genome
4.6 x 106 bases
15 x 106 bases
50 x 106 bases
3 x 109 bases
19
Genetic Information
Gene – basic unit of genetic
information. They determine
the inherited characters.
 Genome – the collection of
genetic information.
 Chromosomes – storage
units of genes.

20
Genes
The DNA strings include:
 Coding regions (“genes”)
 E. coli has ~4,000 genes
 Yeast has ~6,000 genes
 C. Elegans has ~13,000 genes
 Humans have ~32,000 genes
 Control regions
 These typically are adjacent to the genes
 They determine when a gene should be
expressed
 “Junk” DNA (unknown function)
21
The Cell
All cells of an organism contain the same DNA content
(and the same genes) yet there is a variety of cell types.
22
Example: Tissues in Stomach
How is this variety encoded and expressed ?
23
Central Dogma
‫שעתוק‬
Transcription
Gene
‫תרגום‬
Translation
mRNA
Protein
cells express different subset of the genes
In different tissues and under different conditions
24
Transcription
sequences can be transcribed to RNA
Source: Mathews & van Holde
 Coding
 RNA


nucleotides:
Similar to DNA, slightly different backbone
Uracil (U) instead of Thymine (T)
25
Transcription: RNA Editing
1. Transcribe to RNA
2. Eliminate introns
3. Splice (connect) exons
* Alternative splicing exists
Exons hold information, they are more stable during evolution.
This process takes place in the nucleus. The mRNA molecules
diffuse through the nucleus membrane to the outer cell plasma.
26
RNA roles
Messenger RNA (mRNA)
 Encodes protein sequences. Each three nucleotide acids
translate to an amino acid (the protein building block).
 Transfer RNA (tRNA)
 Decodes the mRNA molecules to amino-acids. It connects
to the mRNA with one side and holds the appropriate
amino acid on its other side.
 Ribosomal RNA (rRNA)
 Part of the ribosome, a machine for translating mRNA to
proteins. It catalyzes (like enzymes) the reaction that
attaches the hanging amino acid from the tRNA to the
amino acid chain being created.
 ...

27
Translation (Outside the nucleolus)
 Translation
is mediated by the ribosome
 Ribosome is a complex of protein & rRNA
molecules
 The ribosome attaches to the mRNA at a translation
initiation site
 Then ribosome moves along the mRNA sequence
and in the process constructs a sequence of amino
acids (polypeptide) which is released and folds into
a protein.
28
Genetic Code
There are 20 amino acids from which proteins are build.
29
Protein Structure
 Proteins
are polypeptides of 70-3000
amino-acids
 This
structure is
(mostly) determined
by the sequence of
amino-acids that
make up the protein
30
Protein Structure
31
Evolution
 Related
organisms have similar DNA
 Similarity in sequences of proteins
 Similarity in organization of genes along the
chromosomes
 Evolution plays a major role in biology
 Many mechanisms are shared across a wide
range of organisms
 During the course of evolution existing
components are adapted for new functions
32
Evolution
Evolution of new organisms is driven by
 Diversity
 Different individuals carry different variants of
the same basic blue print
 Mutations
 The DNA sequence can be changed due to
single base changes, deletion/insertion of DNA
segments, etc.
 Selection bias
33
Source: Alberts et al
The Tree of Life
34
Example for Phylogenetic Analysis
Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken
from four species.
Question: Which evolutionary tree best explains these sequences ?
One Answer (the parsimony principle): Pick a tree that has a
minimum total number of substitutions of symbols between species
and their originator in the evolutionary tree (Also called
phylogenetic tree).
AAA
AAA
1
AAG
2
GGA
AAA
AAA
1
AGA
Total #substitutions = 4
35
Example Continued
There are many trees possible. For example:
AAA 1
AAA
1
AAG
AAA
AAA
AAA
AGA
1
GGA
AGA
1
AAG
1
AGA
AAA
AAA
2
GGA
Total #substitutions = 3
Total #substitutions = 4
The left tree is “better” than the right tree.
Questions:
Is this principle yielding realistic phylogenetic trees ? (Evolution)
How can we compute the best tree efficiently ? (Computer Science)
What is the probability of substitutions given the data ? (Learning)
Is the best tree found significantly better than others ? (Statistics)
36
Werner’s Syndrome
A successful application of genetic
linkage analysis
37
The Disease
 First
references in 1960s
 Causes premature ageing
 Linkage studies from 1992
 WRN gene cloned in 1996
 Subsequent discovery of mechanisms involved in
wild-type and mutant proteins
38
A sample Input
D
H
D D
A2 A2
2
1
A2/A2
A1/A1
Phase inferred
H D
A1 A2
Recombinant
H
H
4
3
A2/A2
A1/A2
D D
A1 A2
D
H |D
A2 | A2
5
A1/A2
The study used 13 Markers; here we see only one.
The study used 14 families; here we see only one.
39
Genehunter Output
distance between
markers in centimorgans
Most ‘likely’
position
position LOD_score
information
Marker’s name
0.00 -1.254417
0.224384
1.52
2.836135
0.226379
...[data skipped]...
Log likelihood of
18.58
19.92
21.26
22.60
22.92
23.24
23.56
13.688599
14.238474
14.718037
15.159389
15.056713
14.928614
14.754848
placing disease
0.384088
gene at distance,
0.401992
relative to it being
0.426818
unlinked.
0.462284
D8S131
0.462510
D8S339
D8S259
0.463208
0.464387
...[data skipped]...
81.84
1.939215
90.60 -11.930449
0.059748
0.087869
Maximum log
likelihood score
40
Final Location
Marker
D8S259
location of marker
D8S339
Marker
D8S131
WRN Gene
final location
Error in location by genetic linkage of about 1.25M base pairs.
41