Computational Biology

Download Report

Transcript Computational Biology

Pharmacogenomics
Pharmacogenetics is an old discipline.
One many distinguish pharmacogenetics (the study of a single gene) and
pharmacogenomics (study of many genes or entire genomes)
or use pharmagenomics for approaches that go beyond DNA to include mRNA and
proteins
Today, it is possible to assess entire pathways that might be relevant to disease or
to drug response at the DNA, mRNA and protein levels.
Eventually, the entire genome, transcriptome and proteome will be available.
Therefore, parmacogenetics/-genomics and disease genetics/genomic are
undergoing similar transitions, with a shift in focus from Mendelian examples (one
gene  one disease) to more complex modes of genetic causation.
12. Lecture WS 2003/04
Bioinformatics III
1
Where do drugs interact with proteins?
This figure shows the paths that are taken by the
anti-epileptic drug phenytoin and the angiotensinconverting enzyme (ACE) inhibitor imidapril in the
human body. Phenytoin is absorbed into the
bloodstream at the gut and circulated through the
liver to the brain. It crosses the blood–brain
barrier where it binds and inhibits its target,
neuronal sodium channels. It is pumped back
out across the blood–brain barrier into the
bloodstream by multidrug resistance protein 1
(MDR1 , also known as ABCB1) efflux pumps.
At the liver, phenytoin is metabolized by the
cytochrome P450 enzymes CYP2C9 and
CYP2C19, and it is eliminated through the
kidneys.
Imidapril is a PRO-DRUG . After its absorption from
the gut into the bloodstream it is hydroxylated in the
liver to the active metabolite imidaprilat. Imidaprilat
binds and inhibits ACE in the plasma. Imidaprilat is
also eliminated through the kidneys.
Goldstein et al. Nature Rev. Gen. 4, 937 (2003)
12. Lecture WS 2003/04
Bioinformatics III
2
These associations were
compiled from the literature
by using the keywords
„pharmacogenetics“ OR
pharmacogenomcis“,
„association study“ AND
„drug response“,
„polymorphism“ AND
„drug response“.
Therefore, the list omits
many polymorphisms and
probably includes some
false positives.
Most of the polymorphisms
are either in the drug target
or in a protein that is in the
pathway in which the
target acts.
12. Lecture WS 2003/04
Bioinformatics
III
Goldstein
3
et al. Nature Rev. Gen. 4, 937 (2003)
The SNP Consortium
12. Lecture WS 2003/04
Goldstein et al. Nature Rev. Gen. 4, 937 (2003)
Bioinformatics III
4
Haplotypes
The diagram shows 5 haplotypes. 12 SNPs are
localized in order along the chromosome. The
letters on the top indicate groups of SNPs that
have perfect pairwise linkage
disequilibrium (LD) with one another, and
the numbers on the bottom indicate each of
the 12 SNPs. SNP 9 is the causal variant,
which in this simple example determines drug
response: allele C results in a therapeutic
response, whereas allele G results in an
adverse reaction. In this example, the
selection of just one SNP from each of the
groups A–E would be sufficient to fully
represent all of the haplotype diversity. Each
haplotype can be identified by just five
tagging SNPs (tSNPs), and the causal
variant would be tagged even if it were not
itself typed. So, tSNP profiles that are
highlighted predict an adverse reaction to the
medicine. Normally, LD patterns are not so
clear-cut and statistical methods are required
to select appropriate sets of tSNPs.
12. Lecture WS 2003/04
Bioinformatics III
Goldstein et al. Nature Rev. Gen. 4, 937 (2003
5
Haplotypes
b The diagram depicts the same 12
SNPs, but with different associations
among them, as might happen in a
different population group.
Because patterns of LD are different,
some patients would be misclassified if
the same five tSNPs were used and
interpreted in the same way.
Using the same SNP profiles as defined
in population A, haplotype profiles 1, 2
and 3 are predicted to have allele C at
the causal SNP 9 (a therapeutic
response), whereas haplotype profiles 4
and 5 are predicted to have an adverse
response. However, because the
pattern of association has changed, the
new haplotypes 6 and 7 are
misclassified as haplotype patterns 6
and 7 in population B.
Goldstein et al. Nature Rev. Gen. 4, 937 (2003)
12. Lecture WS 2003/04
Bioinformatics III
6
Discovering genotypes underlying phenotypes:
from mendelian diseases to complex diseases
Traditional view: over the past decade, about 1200 genes causing human diseases or
traits have been identified, largely by positional cloning.
Identification of the gene  knowledge of relevant protein(s)  often leads to
understanding of the molecular and physiological basis of the disease phenotype.
Successful examples in positional cloning: identifcation of genes underlying
chronic granulomatous disease
X-linked muscular dystrophies
cystic fibrosis
Fanconi anemia
ataxia telangiectasia
neurofibromatosis I
Huntington disease
identification of genes underlying hereditary predispositions to
cancer, including retinoblastoma
breast cancer
polyposis colorectal cancer
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
7
Linkage mapping
Positional cloning begins with linkage analysis.
Families in which the disease phenotype segregates are analyzed using a
group of DNA polymorphisms.
Ideal method for diseases with very clear diagnosis. The limit of resolution
remains the number of meioses in which crossovers might have occurred.
In favorable cases (such as cystic fibrosis), the patterns of crossovers in the
region of the gene among the cohorts studied leaves only a few predicted
genes, all within about 1cM (~1Mb) as likely candidates.
In less favorable cases, there may be as many as a few hundred predicted
genes that might be the relevant disease genes.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
8
Linkage disequilibrium
Greater power in fine-mapping is obtained by haplotype analysis, in which all
markers are considered simultaneously as haplotypes rather than individually.
Haplotype analysis allows the inference of likely historical crossover points,
which localize the disease mutation.
New algorithms based on haplotype analysis are being developed to estimate
statistically the likely locations of such crossovers and thus the likely location of
the disease mutation.
The success of linkage disequilibrium (LD) mapping depends heavily on the
degree of genetic heterogeneity underlying a disease sample.
Unless one or a few mutations account for most instances of disease, the
signal will be too inconsistent to find mutations.
Some degree of heterogeneity is tolerable and can be overcome by clustering
of disease chromosomes.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
9
Lessons from cloned mendelian genes
HGMD lists 27.000 mutations in
1222 genes associated with
human diseases and traits.
In-frame amino acid
substitutions are the most
frequent.
Less than 1% are found in
regulatory regions.
These data provide overwhelming support for the notion that mendelian clinical
phenotypes are associated primarily with alterations in the normal coding
sequence of proteins.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
10
Criteria for amino acid replacements
Distinguish
(1) biochemical severity of missense changes, and
(2) location and/or context of the altered amino acid in the protein sequence.
A useful guide is the Grantham scale:
categorize codon replacements into classes of increasing chemical dissimilarity
between the encoded amino acids:
conservative
moderately conservative
moderately radical
radical
„stop“ or nonsense.
There is a clear relationship between the severity of amino acid replacement
and the likelihood of clinical observation.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
11
Clinical severity increases with severity of AA substitution
Purple bars represent the ratio of frequencies
of the indicated class of change compared to
conservative changes for functional human
genes compared to pseudogenes.
Orange bars represent the ratio of the
likelihood of clinical observation for a
conservative change versus the indicated
class of change.
9x
A nonsense change is 9 times more likely to
present clinically than a conservative amino
acid substitution.
For the other changes, the ratios are 3, 2.3,
and 1.8.
The same trend exists for the relative abundance of the different types of substitutions
found in SNPs from human genes as compared with their abundance in pseudogenes.
Evolution selects against radical changes!
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
12
Clinical significance correlates with degree of crossspecies evolutionary conservation
An obvious way to measure the
importance of a particular amino acid:
conservation across species.
The figure shows that the disease
probability decreases monotonically
with the number of amino acid
differences among species.
In simple terms:
if evolution allows mutations between
species, this amino acid cannot be so
crucial.
Relative risks (log odds ratios) for the
observed versus the expected number
of amino acid changes.
Purple: severe diseases, Orange: milder
disease mutations (G6PD).
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
13
Correlation of clinical severity and severity of gene lesion
In numerous cases, genotype-phenotype correlation has identified milder forms of
disease that are associated with less severe mutations.
A classic example is Duchenne (severe) and Becker (mild) muscular dystrophy:
Duchenne is caused primarily by frame-shift deletions,
Becker is cause by in-frame changes.
Other examples:
hemolytic anemia – associated with globin mutations
hemochromatosis – high penetrance radical amino acid substitution
low penetrance milder amino acid substitution
Gaucher disease – common milder mutation associated with
fewer clinical symptoms
G6PD deficiency – severity of amino acid substitution correlates
with clinical significance
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
14
The future: understand complex diseases
Classical linkage analysis and positional cloning remain the methods of choice for identifying
rare, high-risk, disease-associated mutations, owing to their clear inheritance patterns.
Knowledge of Human genome sequence will certainly help.
But „simple“ mendelian inheritance is often not so simple:
- multiple different mutations are often identified in the same or in different loci,
with variable phenotypic effects and highly variable associated risks.
- mutational or genotypic heterogeneity can explain some of the clinical variability observed in
single-gene diseases, but usually not all  modifier genes, environmental contributors.
For non-mendelian diseases and for diseases with multi-gene effects, all contributing loci
might be thought of as „modifiers“ as no single locus of large effect exists.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
15
large-scale SNP discovery projects
Two strategies: „map-based“ or „sequence-based“.
It is unclear which one will be more effective.
The private sequencing effort has reported 2.1 million SNPs (Venter et al. 2001)
and the public SNP consortium has identified 1.4 million SNPs (Sachidanandam et
al. 2001).
Rates of false-positives (10-15%) are modest.
Rates of false-negatives (undetected SNPs) are more problematic.
Neither collection was based on the sequences of many individuals
 many lower-fequency (< 10%) SNPs were not detected, especially those that are
specific to a single population.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
16
fine-scale SNP discovery projects
Study A analyzed 313 genes (720 kb of genomic sequence) for 84 ethnically
diverse individuals.
Only 2% (or 6% excluding singletons) of the SNPs identified are in dbSNP
suggesting that there exist many more SNPs than the roughly 1.2 million
unique SNPs in dbSNP
Study B analyzed 65% of the unique sequence of chromosome 21 for 10
individuals.
36.000 SNPs were identified  > 6.4 million SNPs for whole genome.
Only 45% of the SNPs in dbSNP were found in this study.
Conclusion: the number of SNPs in the human genome (defined by a rare-allele
frequency of 1% or greater in at least one population) is likely to be > 15 million.
Note: there are only 30.000 genes.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
17
fine-scale SNP discovery projects
The alternative strategy to „map-based“ is based on genes and sequence. Here, genotyping
focuses on SNPs identified in coding regions that alter or terminate amino acid sequence, or
disrupt splice sites, or occur in promoter regions.
The table shows that we expect 50.000 – 100.000 such gene-related SNPs.
Based on results from cloned mendelian disease, one can prioritize amino acid replacements
according to (a) the severity of the alteration, and (b) the degree of evolutionary conservation.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
18
Can disease-associated alleles be predicted from sequence?
Main feature that distinguishes a map-based approach from a genome-based
approach to genome-wide association studies is:
degree to which functional variants can be predicted on the basis of sequence in,
for example, coding and/or conserved regions of the genome.
Table 1 showed that – for mendelian phenotypes - most diseases are the result of
changes that cause loss or alterations in encoded proteins.
< 1% of listed mutations occur in regulatory regions (these would be more difficult
to predict from sequence).
The greatest risk of a disease phenotype is associated with splice-site mutations,
deletions and insertions.
Botstein & Risch, Nature Gen. 33, 228 (2003)
12. Lecture WS 2003/04
Bioinformatics III
19
Can disease-associated alleles be predicted from sequence?
Can this distribution of risks be extrapolated to alleles of moderate to low relative
risk – which are assumed to underlie complex disease phenotypes?
Literature: 18 changes – 15 AA substitutions, 1 large deletion, 1 frameshift, 1
variation in promoter region. This is not very different from high risk diseases
Botstein & Risch, Nature Gen. 33, 228 (2003)
and is also biased to substitutions.
12. Lecture WS 2003/04
Bioinformatics III
20
Natural variation in human membrane transporter genes:
identify evolutionary and functional constraints
Large-scale SNPs and Haplotype maps have only analyed 24-40 chromosomes
within an ethnic population and therefore identified common variants (> 5%) with
good accuracy.
These screens could not identify less common variants that may have more
severe functional consequences.
Little is known about the relative levels of genetic diversity within classes of
genes.
Here: focus on membrane transporters which are important drug targets.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
21
Structure of Membrane Transporters
Transmembrane helices (25 residue long
stretches, purely hydrophobic; prediction
accuracy > 90%).
Typically 12-14 TM helices align to form
pore. External domains are very variable
in size.
Predicted secondary structures
of two representative membrane
transporters from the ABC and
SLC superfamilies. The
transmembrane topology is
schematically rendered.
12. Lecture WS 2003/04
Leabman et al. PNAS 100, 5896 (2003)
Bioinformatics III
22
Membrane Transporters
Membrane transporters play critical role in many biological processes:
- maintain cellular and organismal homeostasis by importing nutrients essential
for cellular metabolism
- export cellular waste products and toxic componds.
- important in drug response – they provide the targets for many commonly used
drugs
- are major determinants for drug absorption, distribution, and elimination.
Two major subfamilies
- ABC (ATP-binding cassette) transporters
- SLC (solute carrier transporters) – take up neurotransmitters, nutrients, heavy
metals ...
Here: screen for variation in a set of 24 genes encoding membrane transporters.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
23
24 TM transporters with potential roles in drug response
Transporters are grouped based
on transporter family (e.g., OCT1,
OCT2, and OCT3 belong to the
SLC6 family; CNT1 and CNT2
belong to the SLC28 family).
Blue ovals: transporters of SLC
superfamily;
red rectangles, ABC superfamily;
green hexagon, P-type ATPase.
Typical substrates for each family
of transporters are listed. The
direction of transport is indicated
by an arrow pointing into the cell
(influx) or out of the cell (efflux).
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
24
Aims of SNP scan
Analyze 247 DNA samples of ethnically diverse collection (100 European
Americans, 100 African Americans, 30 Asians, 10 Mexicans, 7 Pacific Islanders).
Identify SNPs.
Aim 1: determine the levels and patterns of genetic diversity
- in different ethnic groups
- in different transporter families
- across different structural regions of membrane transporters.
Aim 2: combine population-genetic and phylogenetic analysis to identify amino acid
residues and protein domains that may be important for human fitness.
Infer functional consequences of amino acid substitution.
To identify polymorphisms, screen all exons plus 35 -100 bp of flanking intronic
sequence.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
25
Variation in transporter genes
680 biallelic SNPs, 2 tri-allelic SNPs.
91/477 SNPs were already deposited in dbSNP.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
26
Population specificity
421/680 SNPs are population specific.
248/421 are singletons = occur only once among 494 chromosomes.
(This explains why large-scale SNP projects have sofar identified far less SNPs).
Of the 259 population-unspecific SNPs, 83 are present in all 5 populations.
Few population-specific alleles were found at high frequency:
only 4/278 African American-specific alleles had frequency > 0.1
only 1/50 Asian-specific allele had frequency > 0.1
The European American population sample had no population-specific allele
(0/80) at fequency > 0.05.
The relatively high incidence of moderately frequent population-specific alleles in
African Americans may facilitate identification of ethnic-specific disease loci in this
population.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
27
Analysis of Nucleotide Diversity
On average, genetic variation in membrane transporters () is similar to that in
other genes.
Next: study nucleotide diversity in TM domains and in loop domains.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
28
Variation across structural regions
As expected, amino acid
diversity (ns) is significantly
lower in TM domains than
in loops.
Consistent with observation
that TM domains are evolutionary more conserved
than loops; suggesting that
there are constraints on TM
domains of transporters.
EC: evolutionary conserved
EU: evolutionary unconserved
Agreement suggests that constraints on structural regions of proteins (e.g.
TM domains) occurs across long and short evolutionary distances for this set
of proteins.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
29
ABC and SLC superfamilies
ABC and SLC superfamilies of transporters have evolved to transport structurally
diverse biological molecules.
TMDs of both superfamilies contain residues and structural domains responsible for
substrate specificity.
Only the loops of the ABC transporters contain ATP-binding domains.
Observation:  is extremely low in TM domains of ABC transporters,
much lower than in TM domains of SLC family members.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
30
Paralogue identification
Predicted secondary structures of two
representative membrane
transporters (BSEP and CNT1) from
the ABC and SLC superfamilies
showing positions of nonsynonymous
SNPs (leading to amino acid
mutations).
The transmembrane topology
schematic was rendered by using the
program TOPO.
Nonsynonymous amino acid changes
are shown in red.
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
31
Evolutionary conservation
Surprisingly, the extent of amino acid diversity did not parallel evolutionary
conservation:
the fraction of EU residues in the TM domains of the ABC superfamily is
significantly higher than in the TM domains of the SLC superfamily.
This implies that a protein segment (TM domains of ABC transporters) is more
constrained within humans than across species
 may be related to substrate properties
_________________________________________________________________
For the SLC superfamily, NS-EC is significantly lower than NS-EU – both for the TM
domains and for the loops.
For the TM domains of the ABC superfamily, NS-EC ~ NS-EU. This may reflect
special functional demands on the TM domain of this superfamily.
 Again: variation among humans does not always parallel phylogenetic
variation!
Leabman et al. PNAS 100, 5896 (2003)
12. Lecture WS 2003/04
Bioinformatics III
32
Back to Pharmacogenomics
With the linkage of genomics with transcriptomics + proteomics,
pharmacogenomics is undergoing a similar shift in focus from Mendelian
examples to more complex modes of genetic causation.
Candidate genes for variable drug response:
(1) genes that code for drug-metabolizing enzymes (DME). Most DME-encoding
genes have polymorphisms that have been shown to influence enzymatic activity.
(2) proteins involved in drug transport.
Drug transporters (e.g. ABC and SLC) show considerable genetic variation
including many functional polymorphisms.
Goldstein et al. Nature Rev. Gen. 4, 937 (2003)
12. Lecture WS 2003/04
Bioinformatics III
33
Future of Pharmacogenomics
To detect the effect of a gene variant that explains 5% of the total phenotypic
variation in a quantitative response to a drug by typing 100 independent SNPs
would require 500 patients to provide an 80% chance of detection assuming an
experiment-wide false-positive rate of 5%.
The behaviour of most drugs will be influenced by a wide range of gene products
(DMEs, transporters, targets, and others), and in many cases the importance of
polymorphisms in one of the relevant genes might depend on polymorphisms in
other genes.
As a simple example, CYP1A2 and N-acetyltransferase 2 act in different stages
in the pathway that metabolizes compounds in burnt meat.
Variants might interact to influence the risk of colorectal cancer.
The polymorphisms indicate that regulatory variants have a far more
important role in variable drug response than they do in Mendelian
diseases.
Goldstein et al. Nature Rev. Gen. 4, 937 (2003)
12. Lecture WS 2003/04
Bioinformatics III
34