Transcript ppt

Stat 877(992)
Statistical methods in molecular
biology
Course plans
• Team taught: Newton, Larget, Ane, Keles,
Kendziorski, Broman, Yandell
• Per instructor homework set (six at 12pts each)
• Final project, poster presentation (28 pts)
National Research Council Report, 2004
Mathematics and 21st Century Biology
“Progress in the biosciences will increasingly depend on deep and broad
integration of mathematical analysis into studies at all levels of
biological organization…: molecules, cells, organisms, populations, and
Ecosystems.”
“The committee regards the interface between mathematics and biology
as biology-driven.”
Some definitions [first approximations!]
cell
structural/functional unit of all living organisms
protein
organic compound produced and used by cell
amino acid
protein building block
nucleic acid chainlike molecule involved in preservation,
replication, and expression of hereditary
information in every living cell
nucleotide
nucleic acid building block
Example function: oxygen transport
2-3 x 10^13 red blood cells/body
2 x 10^6 new cells/second
95% of dry weight is protein hemoglobin
hemoglobin
more about hemoglobin
sequence of amino acids in hemoglobin
•
alpha chain (141 amino acids) [2 subunits]
•
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKA
HGKKVADGLTLAVGHLDDLPGALSDLSNLHAHKLRVDPVNFKLLSHCLLSTLAVHLPND
FTPAVHASLDKFLSSVSTVLTSKYR
•
beta chain (146 amino acids) [2 subunits]
•
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDP
ENFRLLGNVLALVVARHFGKDFTPELQASYQKVVAGVANALAHKYH
A few amino acids (among 20 standard)
V = Val = Valine
L = Leu = Leucine
M = Meth = Methionine
more about amino acids
Amino acids are concatenated into protein by the translation
of information stored in messenger RNA
Ribonucleic acid (RNA)
Nucleotide bases
A = adenine
C = cytosine
U = uracil
G = guanine
single stranded
Amino acids are concatenated into protein by the translation
of information stored in messenger RNA (mRNA)
Met
Ribonucleic acid (RNA)
Nucleotide bases
A = adenine
C = cytosine
U = uracil
G = guanine
Thr
Glu
Leu
Arg
Ser
stop
Amino acids are encoded by triples of mRNA nucleotides called codons
more about the genetic code
Translation: mRNA to protein via ribosome & tRNA
Base pairing
A-U, G-C
video podcast of translation
mRNA structure
orientation 5’ to 3’
UTR = untranslated region: mRNA stability
mRNA localization
translational efficiency
Mature mRNA may have been processed by
splicing a primary transcript (pre-mRNA)
Primary transcripts are produced by the transcription of DNA
Deoxyribonucleic acid (DNA)
double stranded
4 nucleotide bases ATGC
base pairing: A-T, C-G
Transcription: DNA to RNA via RNA polymerase
initiate
elongate
terminate
Central dogma of molecular biology
Replication: DNA copies itself during cell division
More on organization of DNA
Chromosomes are organized structures of DNA and proteins
that are found in cells. Each chromosome contains a single
continuous piece of DNA.
In diploid species,
chromosomes are paired.
Human
chromosome
total number
base pairs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X (sex chromosome)
Y (sex chromosome)
247,200,000
242,750,000
199,450,000
191,260,000
180,840,000
170,900,000
158,820,000
146,270,000
140,440,000
135,370,000
134,450,000
132,290,000
114,130,000
106,360,000
100,340,000
88,820,000
78,650,000
76,120,000
63,810,000
62,440,000
46,940,000
49,530,000
154,910,000
57,740,000
A genome equals the sequence
of one full copy
3 Gbp, or
100 yrs at
1bp/second
Estimates from Sanger’s Vertebrate Genome Annotation (VEGA) database, 7/07
2001:
drafts of the human genome sequence published
1 % of bases are in exons
24 % of bases are in introns
2007:
pilot phase of ENCODE project completed
Encyclopedia Of DNA Elements
majority of bases are transcribed
extensive transcript overlap
functions poorly understood
Evolving definition of gene
1860s-1900s: a discrete unit of heredity (Mendel)
1910s: a distinct locus (Morgan)
1940s: the blueprint for a protein (Beadle & Tatum)
1960s: a transcribed code (Watson & Crick)
Genome era: a locatable region of genomic sequence,
corresponding to a unit of inheritance, which is associated
with regulatory regions, transcribed regions and/or other
functional sequence regions
Figure 5"> Figure 5
Mark B. Gerstein et al. Genome Res. 2007; 17: 669-681
Post ENCODE
The gene is a union of genomic sequences encoding a
coherent set of potentially overlapping functional products
Gerstein et al 2007
What about Statistics?
Statistics
supports the development of genomic resources
• In accomodating sequencing errors for genome assembly
• In rating the significance of sequence matches by
alignment algorithms
Statistics supports analyses to determine the
function of genes/transcripts/proteins
• Gene regulation
• Gene expression
• Network considerations (many processes/functions)
Example: oxygen transport
According to the Gene Ontology (GO) project,
46 different genes are involved in this biological process
Statistics is critical in analyzing patterns of
genomic variation within populations, and in
relating this variation to disease states or other
phenotypes
• Genomes differ from the reference copy
(single nucleotide polymorphisms, structural variants)
• Gene mapping by linkage and association methods
Statistics is critical in analyzing patterns of
genomic variation between populations/species
• Phylogenetic analysis
“Nothing in biology makes sense except in the light of evolution”
-T. Dobzhansky
Tree of life project
“It is interesting to contemplate a tangled bank, clothed with many
plants of many kinds, with birds singing on the bushes, with various
insects flitting about, and with worms crawling through the damp
earth, and to reflect that these elaborately constructed forms, so
different from each other, and dependent upon each other in so
complex a manner, have all been produced by laws acting around
us. These laws, taken in the largest sense, being Growth with
reproduction; Inheritance which is almost implied by reproduction;
Variability from the indirect and direct action of the conditions of life,
and from use and disuse; a Ratio of Increase so high as to lead to a
Struggle for Life, and as a consequence to Natural Selection,
entailing Divergence of Character and the Extinction of less
improved forms. Thus, from the war of nature, from famine and
death, the most exalted object which we are capable of conceiving,
namely, the production of the higher animals, directly follows.”
- Charles Darwin