Clustering for Accuracy, Performance, and Alternative
Download
Report
Transcript Clustering for Accuracy, Performance, and Alternative
Genetics and Molecular Biology
Tutorial II -- Computational
Perspective
The goal is to introduce some topics to
individuals with a minimal background in
genetics/biology, and yet try to provide some
examples of topics to maintain the interest of
individuals with extensive biological/genetics
backgrounds.
Gene structure
Outline
– genomic structure vs mRNA structure
– coding and noncoding exons
– introns
– primary transcript processing
aside -- nonsense mediated mRNA degradation
– alternative splicing and differential
polyadenylation
– evolutionary conservation of coding and
noncoding sequences
2
Outline…
Genomic structure
– repetitive sequences
LINES and SINES
– example -- Y chromosome palindromes
– C value paradox
– genomes of model organisms
example
– yeast genome and gene-chip
– single/double knockouts
– cross-species sequence similarities for
putative function identification
example -- “chaperonine”
3
Fundamental Genetics and
Probability Concepts
meiosis and sampling
patterns of inheritance
monogenic and complex inheritance
– phenocopy
– reduced penetrance
DNA variation
– polymorphisms, SNPs, and mutations
positional cloning
4
Gene Structure
5
Transcript Processing
DNA -> pre-mRNA -> mRNA -> protein
6
Nonsense mediated mRNA
degradation
– unknown mechanism
– more rapidly degrades mRNA containing
– Lykke-Andersen, “mRNA quality control:
Marking the message for life or death.”
Current Biology, 11, 2001.
7
Nonsense Mediated mRNA Degradation
8
Genome Structure -- repeat classes
Class (blocks)
Megasatellite (100s of
kb)
RS447
untitled
untitled
Satellite (100kb to Mbs)
alphoid
Sau3 A family
satellite 1 (AT rich)
satellites 2 and 3
Minisatellite (0.1-20 kb)
telomeric family
hypervariable family
Microsatellite (<150
bp)
Size of
Repeat
several kb
Chr Locations
4.7 kb
2.5 kb
3.0 kb
5-171 bp
171 bp
68 bp
~50-70 copies on 4, several on 8
~400 copies on 4 and 19
~50 copies on X
centromeric
centromeric hetero all chrs
centromeric hetero 1 9 13 14 15 21
22 6
centromeric hetero most chrs
most chrs
At or close to telomeres
all telomeres
all chrs, often near telomeres
dispersed through all chromosomes
25-48 bp
5 bp
6-64 bp
6 bp
9-64 bp
1-4 bp
various locations
9
C-Value Paradox
Hartl, “Molecular melodies in high and low C,” Nat. Rev. Genetics, Nov 20001
refers to the massive, counterintuitive
and seemingly arbitrary differences in
genome size observed in eukaryotic
organisms
– Drosophila melanogaster 180 Mb
– Podisma pedestris 18,000 Mb
– difference is difficult to explain in view of
apparently similar levels of evolutionary,
developmental, and behavioral complexity
10
Alternative Splicing
Every conceivable pattern of alternative
splicing is found in nature. Exons have
multiple 5’ or 3’ splice sites alternatively
used (a, b). Single cassette exons can
reside between 2 constitutive exons
such that alternative exon is either
included or skipped ( c ). Multiple
cassette exons can reside between 2
constitutive exons such that the splicing
machinery must choose between them
(d). Finally, introns can be retained in
the mRNA and become translated.
Graveley, “Alternative splicing:
increasing diversity in the proteomic
world.” Trends in Genetics, Feb., 2001.
11
Classic View of Gene No Longer
Valid -- Strachan pg 185
Mechanism
Frequency/Examples
multigenic transcription units
rare. 18S, 28S, and 5.8S rRNA,
mitochondria
common. dystrophin gene (8)
alternative promoters
alternative splicing
alternative polyadenylation
RNA editing
post-translational cleavage
very frequent. slo gene (8
cassettes), >500 mRNAs
common. calcitonin gene (2)
extremely rare. apolipoprotein B
gene (tissue specific editing –
codon changed)
rare. may generate functionally
related polypeptides – hormones.
insuline
12
Alternative Splicing Example -- Graveley 2001
13
Alternative PolyAdenylation
common in human RNA (EdwardsGilbert 1997)
in many genes, 2 or more poly-A signals
in 3’ UTR
– alternative transcripts can show tissue
specificity
alternative poly-A signals may be
brought into play following alternative
splicing
14
Edwards-Gilbert. Nucleic Acids Res, 13, 1997
15
Evolution of
the
mitochondrial
genome and
origin of
eukaryotic cells
16
Evolutionary Conservation of
Coding and Noncoding Sequences
Sequencing of H. sapiens and model
organisms is basis for comparative
genomics
Generally, functional solutions (encoded as
genes) across organisms allows us to
compare gene sequences and infer function
protein functional/structural region ==
“domains”
Intergenic regions are generally not
17
conserved (always exceptions)
Example - MKKS (UniGene
Clusters)
human rat 87.4 %
human mouse 84.9 %
human cow 87.1 %
mouse rat 97.8 %
rat cow 91.0%
mouse cow 85.1 %
frog rat 62.5 %
18
Example - MKKS
19
20
Computational Approach to
Using Conserved Regions
Problem -- want to screen genes for
mutations
Conventional approach -- screen all
exons of a single gene
Alternative -- identify domains with in
multiple genes, and screen domains
first, to optimize screening time and
resources
21
Cross-Species Similarities
yeast
– gene chip for hybridization/expression
– complete genome (first eukaryote)
– singe knockouts and double knockouts
22
Fundamental Genetics
meiosis
– Hs are diploid
– meiosis produces haploid gametes
– mechanism for transmission of genetic
material to offspring
– recombination by cross-over (Holliday
structure) or by independent segregation of
homologous pairs
23
Fundamental Genetics (Background for
Linkage Analysis)
Rule of Segregation
– offspring receive ONE allele (genetic
material) from the pair of alleles possessed
by BOTH parents
Rule of Independent Assortment
– alleles of one gene can segregate
independently of alleles of other genes
– (Linkage Analysis relies on the violation of
Independent Assortment Rule)
24
Genetic Marker … Prelude to LA
– A genetic marker allows for the observation of
the genetic state at a particular genomic location
(locus).
A genotype is the measured state of a genetic marker.
May never be feasible to sequence cases directly.
– An “informative” marker is often “heterozygous,”
or “polymorphic” and enables the observation of
the inheritance of genetic material.
25
Monogenic and Polygenic Diseases
– monogenic (Mendelian) -- one gene
“simple” (dominant and recessive) Mendelian
inheritance
direct correspondence between one gene
mutation and one disorder
majority of disease genes found are monogenic
– polygenic -- (complex) multiple genes
heterogeneity and epistasis
combinatorics
no longer have direct correspondence between
one gene and disorder
majority of disorders are probably polygenic
– complexity of organisms and observed pathways
26
...Mongenic and Polygenic Diseases
phenocopy
reduced penetrance
– Example -- sickle cell anemia
“classic” recessive disorder
defect in red blood cells (hemoglobin)
but… infant hemoglobin gene can “leak”
wide range of phenotypes
27
Examples
28
Examples
29
Example
30
BBS4 Pedigree
31
Hardy-Weinberg Equilibrium
Rule that relates allelic and genotypic frequencies in a
population of diploid, sexually reproducing individuals
if that population has random mating, large size, no
mutation or migration, and no selection
Assumptions
– allelic frequencies will not change in a population
from one generation to the next
– genotypic frequencies are determined in a
predictable way by allelic frequencies
– the equilibrium is neutral -- if perturbed, it will
reestablish within one generation of random mating
at the new allelic frequency
32
33
H-W
f(AA) = p2
f(Aa) = 2pq
f(aa) = q2
(p+q)2
(p2 + q2 + r2 + 2pq + 2pr + 2qr)= (p+q+r)2
34
Dominant and Recessive
Penetrance Modeled
penetrance = P(pt | gt)
DD Dd dd
1 1 0
DD Dd dd
0 0 1
DD Dd dd
0.9 0.9 0.0
DD Dd dd
0 0 0.8
35
D-R Heterogeneous, DD Epistatic
AA
BB 1
Bb 1
bb 1
Aa
1
1
1
aa
0
0
1
AA
BB 1
Bb 1
bb 0
Aa
1
1
0
aa
0
0
0
reduced penetrance
3,9,27,81,243… 3n
36
Dom-Rec Heterozygous
Screen genes A, B?, b
37
Uninformative Marker
38
Informative Marker
39
Given the following observations: family structure,
affection status, genotypes, and disease allele
frequencies. Assuming a model for the disease, can
we calculate the probability that these observations
“fit” an assumed model???
40
Linkage
41
Linkage Analysis
Goal: find a marker “linked” to a disease
gene.
LOD score = log of likelihood ratio
LR[θ;data] == k P[data; θ]
theta = estimate of genetic distance
(recombination fraction) between marker
and disease
= proportion of recombinant
gametes/total gametes
42
…Linkage Analysis
Linkage analysis calculates the
likelihood that the inheritance pattern of
the phenotype (disease) is supported by
the observed inheritance patterns
(genotypes) in a pedigree.
– few monogenic models, easy to test
– more difficult to find models explaining
inheritance in polygenic models
– parameter maximization
43
Linkage Analysis Programs
FASTLINK - 2 point
– O(n2), where n = number of markers
GeneHunter - multipoint, 2 point
– O(n2), where n = number of people
44
Allele Sharing
tries to show that affected family
members inherit the same chromosomal
regions more often than expected by
chance
45
Allele Sharing Example
Needs at least sibs.
46
Association Studies
“Allelic association studies provide the most
powerful method for locating genes of small
effect contributing to complex diseases and
traits.” Daniels, Am J Hum Genet 62:1189-1197,
1998.
Linkage analysis
– genome wide screen, 400 markers ~ 10 cM (10 MB),
association needs 4000+ polymorphic markers
– generally need nuclear family or larger
Association finds “linkage disequilibruim”
47
Association Studies
“Association is simply a statistical
statement about the co-occurrence of
alleles or phenotypes. Allele A is
associated with disease D if people who
have D also have A more (or maybe
less) often than would be predicted from
the individual frequencies of D and A in
the population.” Pg. 286 Human
Molecular Genetics 2, Tom Strachan
48
Examples
HLA-DR4 (antigen marker)
– 36% in UK
– 78% with rheumatoid arthritis
CF( RFLP markers XV2.c (X1,X2), KM19(K1,K2))
– Marker Alleles
CF(case)
Normal(control)
– X1, K1
3
49
– X1, K2
147
19
– X2, K1
8
70
– X2, K2
8
25
– CF associated with X1, K2 in ‘89 (Strachan)49
Linkage Disequilibrium
linkage equilibrium (aka HardyWeinberg) is true if
– P(gt1,gt1’;gt2,gt2’) = P(gt1,gt1’)*P(gt2,gt2’)
where [P(haplotype)]
case vs controls
TDT (heterozygous marker transmitted),
HRR (untransmitted alleles as control)
allelic associations (outbred populations)
50
maintained at only <= 1cM
Equilibrium
51
“SNPs”
Single-Nucleotide Polymorphisms
1 every 1000 bp (estimated)
2,972,052 SNPs submitted to dbSNP
– dbSNP summary link
– 50% of all SNPs are in question
– 10% of UTRs have SNPs
100,000 - 500,000 SNPs needed
Why don’t we do this?
– $$$
52
Homozygosity Mapping
53
Positional Cloning
54
Disease Gene Identification
SSCP -- single strand conformational
polymorphism
PCR -- polymerase chain reaction
– primers amplify template sequence
direct sequencing
BBS2 (Bardet-Biedl Syndrome)
55
BBS2 genetic mapping
C16
1
2
3
4
5
6
7
8
9
10
11
12
56
BBS2 genetic mapping
affected
unaffected
C16
1
2
3
4
5
6
7
8
9
10
11
12
57
BBS4 Gene (Direct Sequencing)
(Hs.26471)
58
BBS4 Deletion (by PCR)
exons 3
4
59
BBS4 Mutations (direct
sequencing)
(R295P)
60
Summary
Disease Gene Identification
– challenges
– interval localization
genotyping and genetic markers, linkage
analysis, allele sharing, association studies
(“SNiPs”), homozygosity mapping
– disease gene identification techniques
Take home
– A complex disorder (with interacting genes)
has yet to be characterized
61
Demo -- installing a database
A database organizes data
Most common
– relational database (oracle, sybase)
– perceived as a collection of tables,
– where table is an unordered collection of
rows
– each row has a fixed number of fields, and
each field can store a predefined type of
data value (date, integer, string, etc.)
simplest
– flat file
62
Databases
NCBI
BLAST
Amazon
Yahoo
Several of our own
– genotypes
– rat ESTs
– eye clones from differential display
– micro-array data
63
This space intentionally left blank
64