Clustering for Accuracy, Performance, and Alternative

Transcript Clustering for Accuracy, Performance, and Alternative

Genetics and Molecular Biology
Tutorial II -- Computational
Perspective
The goal is to introduce some topics to
individuals with a minimal background in
genetics/biology, and yet try to provide some
examples of topics to maintain the interest of
individuals with extensive biological/genetics
backgrounds.

Gene structure
Outline
– genomic structure vs mRNA structure
– coding and noncoding exons
– introns
– primary transcript processing

aside -- nonsense mediated mRNA degradation
– alternative splicing and differential
polyadenylation
– evolutionary conservation of coding and
noncoding sequences
2
Outline…

Genomic structure
– repetitive sequences

LINES and SINES
– example -- Y chromosome palindromes
– C value paradox
– genomes of model organisms

example
– yeast genome and gene-chip
– single/double knockouts
– cross-species sequence similarities for
putative function identification

example -- “chaperonine”
3
Fundamental Genetics and
Probability Concepts
meiosis and sampling
 patterns of inheritance
 monogenic and complex inheritance

– phenocopy
– reduced penetrance

DNA variation
– polymorphisms, SNPs, and mutations

positional cloning
4
Gene Structure
5
Transcript Processing

DNA -> pre-mRNA -> mRNA -> protein
6
Nonsense mediated mRNA
degradation
– unknown mechanism
– more rapidly degrades mRNA containing
– Lykke-Andersen, “mRNA quality control:
Marking the message for life or death.”
Current Biology, 11, 2001.
7
Nonsense Mediated mRNA Degradation
8
Genome Structure -- repeat classes
Class (blocks)
Megasatellite (100s of
kb)
RS447
untitled
untitled
Satellite (100kb to Mbs)
alphoid
Sau3 A family
satellite 1 (AT rich)
satellites 2 and 3
Minisatellite (0.1-20 kb)
telomeric family
hypervariable family
Microsatellite (<150
bp)
Size of
Repeat
several kb
Chr Locations
4.7 kb
2.5 kb
3.0 kb
5-171 bp
171 bp
68 bp
~50-70 copies on 4, several on 8
~400 copies on 4 and 19
~50 copies on X
centromeric
centromeric hetero all chrs
centromeric hetero 1 9 13 14 15 21
22 6
centromeric hetero most chrs
most chrs
At or close to telomeres
all telomeres
all chrs, often near telomeres
dispersed through all chromosomes
25-48 bp
5 bp
6-64 bp
6 bp
9-64 bp
1-4 bp
various locations
9
C-Value Paradox
Hartl, “Molecular melodies in high and low C,” Nat. Rev. Genetics, Nov 20001

refers to the massive, counterintuitive
and seemingly arbitrary differences in
genome size observed in eukaryotic
organisms
– Drosophila melanogaster 180 Mb
– Podisma pedestris 18,000 Mb
– difference is difficult to explain in view of
apparently similar levels of evolutionary,
developmental, and behavioral complexity
10
Alternative Splicing
Every conceivable pattern of alternative
splicing is found in nature. Exons have
multiple 5’ or 3’ splice sites alternatively
used (a, b). Single cassette exons can
reside between 2 constitutive exons
such that alternative exon is either
included or skipped ( c ). Multiple
cassette exons can reside between 2
constitutive exons such that the splicing
machinery must choose between them
(d). Finally, introns can be retained in
the mRNA and become translated.
Graveley, “Alternative splicing:
increasing diversity in the proteomic
world.” Trends in Genetics, Feb., 2001.
11
Classic View of Gene No Longer
Valid -- Strachan pg 185
Mechanism
Frequency/Examples
multigenic transcription units
rare. 18S, 28S, and 5.8S rRNA,
mitochondria
common. dystrophin gene (8)
alternative promoters
alternative splicing
alternative polyadenylation
RNA editing
post-translational cleavage
very frequent. slo gene (8
cassettes), >500 mRNAs
common. calcitonin gene (2)
extremely rare. apolipoprotein B
gene (tissue specific editing –
codon changed)
rare. may generate functionally
related polypeptides – hormones.
insuline
12
Alternative Splicing Example -- Graveley 2001
13
Alternative PolyAdenylation
common in human RNA (EdwardsGilbert 1997)
 in many genes, 2 or more poly-A signals
in 3’ UTR

– alternative transcripts can show tissue
specificity

alternative poly-A signals may be
brought into play following alternative
splicing
14
Edwards-Gilbert. Nucleic Acids Res, 13, 1997
15

Evolution of
the
mitochondrial
genome and
origin of
eukaryotic cells
16
Evolutionary Conservation of
Coding and Noncoding Sequences
Sequencing of H. sapiens and model
organisms is basis for comparative
genomics
 Generally, functional solutions (encoded as
genes) across organisms allows us to
compare gene sequences and infer function
 protein functional/structural region ==
“domains”
 Intergenic regions are generally not
17
conserved (always exceptions)

Example - MKKS (UniGene
Clusters)
human rat 87.4 %
 human mouse 84.9 %
 human cow 87.1 %
 mouse rat 97.8 %
 rat cow 91.0%
 mouse cow 85.1 %
 frog rat 62.5 %

18
Example - MKKS
19
20
Computational Approach to
Using Conserved Regions
Problem -- want to screen genes for
mutations
 Conventional approach -- screen all
exons of a single gene
 Alternative -- identify domains with in
multiple genes, and screen domains
first, to optimize screening time and
resources

21
Cross-Species Similarities

yeast
– gene chip for hybridization/expression
– complete genome (first eukaryote)
– singe knockouts and double knockouts
22
Fundamental Genetics

meiosis
– Hs are diploid
– meiosis produces haploid gametes
– mechanism for transmission of genetic
material to offspring
– recombination by cross-over (Holliday
structure) or by independent segregation of
homologous pairs
23
Fundamental Genetics (Background for
Linkage Analysis)

Rule of Segregation
– offspring receive ONE allele (genetic
material) from the pair of alleles possessed
by BOTH parents

Rule of Independent Assortment
– alleles of one gene can segregate
independently of alleles of other genes
– (Linkage Analysis relies on the violation of
Independent Assortment Rule)
24
Genetic Marker … Prelude to LA
– A genetic marker allows for the observation of
the genetic state at a particular genomic location
(locus).
A genotype is the measured state of a genetic marker.
 May never be feasible to sequence cases directly.

– An “informative” marker is often “heterozygous,”
or “polymorphic” and enables the observation of
the inheritance of genetic material.
25
Monogenic and Polygenic Diseases
– monogenic (Mendelian) -- one gene
“simple” (dominant and recessive) Mendelian
inheritance
 direct correspondence between one gene
mutation and one disorder
 majority of disease genes found are monogenic

– polygenic -- (complex) multiple genes
heterogeneity and epistasis
 combinatorics
 no longer have direct correspondence between
one gene and disorder
 majority of disorders are probably polygenic

– complexity of organisms and observed pathways
26
...Mongenic and Polygenic Diseases
phenocopy
 reduced penetrance

– Example -- sickle cell anemia
“classic” recessive disorder
 defect in red blood cells (hemoglobin)
 but… infant hemoglobin gene can “leak”
 wide range of phenotypes

27
Examples
28
Examples
29
Example
30
BBS4 Pedigree
31
Hardy-Weinberg Equilibrium


Rule that relates allelic and genotypic frequencies in a
population of diploid, sexually reproducing individuals
if that population has random mating, large size, no
mutation or migration, and no selection
Assumptions
– allelic frequencies will not change in a population
from one generation to the next
– genotypic frequencies are determined in a
predictable way by allelic frequencies
– the equilibrium is neutral -- if perturbed, it will
reestablish within one generation of random mating
at the new allelic frequency
32
33
H-W
f(AA) = p2
 f(Aa) = 2pq
 f(aa) = q2


(p+q)2

(p2 + q2 + r2 + 2pq + 2pr + 2qr)= (p+q+r)2
34
Dominant and Recessive
Penetrance Modeled
penetrance = P(pt | gt)
DD Dd dd
1 1 0
DD Dd dd
0 0 1
DD Dd dd
0.9 0.9 0.0
DD Dd dd
0 0 0.8
35
D-R Heterogeneous, DD Epistatic
AA
BB 1
Bb 1
bb 1


Aa
1
1
1
aa
0
0
1
AA
BB 1
Bb 1
bb 0
Aa
1
1
0
aa
0
0
0
reduced penetrance
3,9,27,81,243… 3n
36
Dom-Rec Heterozygous
Screen genes A, B?, b
37
Uninformative Marker
38
Informative Marker
39

Given the following observations: family structure,
affection status, genotypes, and disease allele
frequencies. Assuming a model for the disease, can
we calculate the probability that these observations
“fit” an assumed model???
40
Linkage
41
Linkage Analysis
Goal: find a marker “linked” to a disease
gene.
 LOD score = log of likelihood ratio
 LR[θ;data] == k P[data; θ]
 theta = estimate of genetic distance
(recombination fraction) between marker
and disease
 = proportion of recombinant
gametes/total gametes
42

…Linkage Analysis

Linkage analysis calculates the
likelihood that the inheritance pattern of
the phenotype (disease) is supported by
the observed inheritance patterns
(genotypes) in a pedigree.
– few monogenic models, easy to test
– more difficult to find models explaining
inheritance in polygenic models
– parameter maximization
43
Linkage Analysis Programs

FASTLINK - 2 point
– O(n2), where n = number of markers

GeneHunter - multipoint, 2 point
– O(n2), where n = number of people
44
Allele Sharing

tries to show that affected family
members inherit the same chromosomal
regions more often than expected by
chance
45
Allele Sharing Example
Needs at least sibs.
46
Association Studies


“Allelic association studies provide the most
powerful method for locating genes of small
effect contributing to complex diseases and
traits.” Daniels, Am J Hum Genet 62:1189-1197,
1998.
Linkage analysis
– genome wide screen, 400 markers ~ 10 cM (10 MB),
association needs 4000+ polymorphic markers
– generally need nuclear family or larger

Association finds “linkage disequilibruim”
47
Association Studies

“Association is simply a statistical
statement about the co-occurrence of
alleles or phenotypes. Allele A is
associated with disease D if people who
have D also have A more (or maybe
less) often than would be predicted from
the individual frequencies of D and A in
the population.” Pg. 286 Human
Molecular Genetics 2, Tom Strachan
48
Examples

HLA-DR4 (antigen marker)
– 36% in UK
– 78% with rheumatoid arthritis

CF( RFLP markers XV2.c (X1,X2), KM19(K1,K2))
– Marker Alleles
CF(case)
Normal(control)
– X1, K1
3
49
– X1, K2
147
19
– X2, K1
8
70
– X2, K2
8
25
– CF associated with X1, K2 in ‘89 (Strachan)49
Linkage Disequilibrium

linkage equilibrium (aka HardyWeinberg) is true if
– P(gt1,gt1’;gt2,gt2’) = P(gt1,gt1’)*P(gt2,gt2’)
where [P(haplotype)]
case vs controls
 TDT (heterozygous marker transmitted),
HRR (untransmitted alleles as control)
 allelic associations (outbred populations)
50
maintained at only <= 1cM

Equilibrium
51
“SNPs”
Single-Nucleotide Polymorphisms
 1 every 1000 bp (estimated)
 2,972,052 SNPs submitted to dbSNP

– dbSNP summary link
– 50% of all SNPs are in question
– 10% of UTRs have SNPs
100,000 - 500,000 SNPs needed
 Why don’t we do this?

– $$$
52
Homozygosity Mapping
53
Positional Cloning
54
Disease Gene Identification
SSCP -- single strand conformational
polymorphism
 PCR -- polymerase chain reaction

– primers amplify template sequence

direct sequencing

BBS2 (Bardet-Biedl Syndrome)
55
BBS2 genetic mapping
C16
1
2
3
4
5
6
7
8
9
10
11
12
56
BBS2 genetic mapping
affected
unaffected
C16
1
2
3
4
5
6
7
8
9
10
11
12
57
BBS4 Gene (Direct Sequencing)
(Hs.26471)
58
BBS4 Deletion (by PCR)
exons 3
4
59
BBS4 Mutations (direct
sequencing)
(R295P)
60
Summary

Disease Gene Identification
– challenges
– interval localization

genotyping and genetic markers, linkage
analysis, allele sharing, association studies
(“SNiPs”), homozygosity mapping
– disease gene identification techniques

Take home
– A complex disorder (with interacting genes)
has yet to be characterized
61
Demo -- installing a database
A database organizes data
 Most common

– relational database (oracle, sybase)
– perceived as a collection of tables,
– where table is an unordered collection of
rows
– each row has a fixed number of fields, and
each field can store a predefined type of
data value (date, integer, string, etc.)

simplest
– flat file
62
Databases
NCBI
 BLAST
 Amazon
 Yahoo
 Several of our own

– genotypes
– rat ESTs
– eye clones from differential display
– micro-array data
63
This space intentionally left blank
64

Clustering for Accuracy, Performance, and Alternative

Transcript Clustering for Accuracy, Performance, and Alternative

Directory