Comparative genomics and Target discovery

Download Report

Transcript Comparative genomics and Target discovery

Comparative genomics
and Target discovery
Maarten Sollewijn Gelpke
MDI, Organon





What is comparative genomics?
What can we learn from comparative genomics?
What is target discovery?
What are the implications of comparative genomics to
target discovery?
What issues in target discovery can be addressed by
comparative genomics?
Overview




Introduction to genomes and sequencing.
Comparative genomics aspects.
Phylogenomics concepts.
Examples of comparative genomics.
Sequence availability

Availability of gene and protein
sequences has increased
enormously in during the last 2
decades.

Current capacity of the main
sequencing centers is >3Gb
per month per centre.

This will increase again
dramatically with the
development of new superfast
sequencing techniques.
Currently > 100Gbases
Genomes sequenced
A.thaliana
First bacterial genomes sequenced
H.influenzae and M.genitalium
1995
•Mouse
•Ciona
•Rice
•Fugu
•Anopheles
The yeast genome
1996
2002
•Human finished
•Rat
•Chicken
Human draft
2001
E.coli K12
1997
1998
2004
Full sequence
of chr. 22
2005
1999
C.elegans
2000
D.melanogaster
Genome & Chr. 21
2003
Chimpanzee
Xenopus
Zebrafish
Genome sequencing
primates
mammalia
rodents
carnivores
amniota
tetrapods
vertebrates
bovidae
aves
amphibia
chordates
fish
metazoans
tunicates
insects
nematodes
Evolutionary relationship between metazoans (multicellular animals) that have been sequenced or are due for sequencing.
Genome sequencing

BAC fingerprinting  shotgun approach


Accurate but laborious!
Shotgun sequencing (WGS)
Bac Clone:
100-200 kb
Whole Genome:
30Mb – 3Gb
Random
Reads
Assembly
Sheared DNA: 1.0-2.0 kb
Low Base
Quality
Single
Stranded
Region
Consensus
Sequence
Gap
Sequencing
Templates:
Templates
Finishing
MisMis-Assembly
(Inverted)
Genome sequencing

Current state of sequenced ‘organisms’:





>316 Prokaryotes
>27 Archae
>280 Eukaryotes (complete, in assembly or in progress)
>1600 Viruses and > 500 mitochondria/chloroplasts
Some ongoing genome sequencing projects:


Poplar, gibbon, platypus, Drosophila species, variety of
pathogenic fungi and bacteria, etc.
Meta-genomic projects on environmental samples (soil, deepsea, waste sites)
Future of genome sequencing?





New complete genomes.
New low-redundancy genomes.
New (low-redundancy) genome areas.
Meta-genomics. Sequencing of microbial communities.
Sequencing of extinct species.


40000 year old Cave bear: 26k, 21 genes.
45000 year old Neanderthaler: 75k  diverged from human
lineage ~ 315000 years ago
Comparative genomics


Discover what lies hidden in genomic sequence by
comparing sequence information.
Main areas






Whole genome alignment
Gene prediction
Regulatory element prediction
Phylogenomics
Pharmacogenetics
Affected by evolutionary aspects




Mutational forces (introduce random mutations)
Selection pressures
Ratio of non-synonymous to synonymous substitutions
Mutation rates lower or higher than neutral
Comparing sequences, methods.

Pairwise comparison of
sequences (alignments)




proteins or genes
variety of local alignment
tools like BLAST, SmithWaterman etc.
multiple sequence
comparisons (ClustalW,
Muscle etc.)
results may be dependent
on alignment settings
Comparing sequences, methods.

Whole genome comparisons



Large stretches of sequence
Divergence up to 450Mya (fugu-human) with sufficient
similarity remaining.
BLAT, BLASTZ, Phusion/BlastN

Seeding strategy → alignment extension → gapped alignments
Whole genome comparison

Conservation of synteny!


Genome expansion and contraction


Genome duplications, segmental duplications: important
mechanism for generating new genes.
(G+C) content, CpG islands


Cross-reference of any genetic traits (diseases!) from one
organism (eg mouse) to genes in the syntenic regions in the
other organism (eg human).
Reflect different mutational or DNA repair processes?
Repeats


Transposable elements are a main force in reshaping genomes.
TE’s (or remainders thereof) can be used to measure
evolutionary forces acting on the genome.
Neutral mutation rate.
Gene prediction

Comparing sequences has contributed enormously to the accuracy of
gene prediction.

Evidence based method.

Use cDNAs, ESTs and proteins from various organisms.
 Apply gene feature rules.
Gene model
Proteins
Clustered ESTs
cDNAs
Gene prediction

De novo methods.




Alignment of genomic sequences
Splicing rules and other gene features
De novo gene prediction by comparing sequences attempts to
model a negative selection of mutations. Areas with less mutations
are conserved because the mutations where detrimental for the
organism.
Prediction of similar proteins in both genomes.
Newly predicted protein in
mouse and human, similar
to the disease related
gene dystrophin.
Regulatory element prediction

The complexity of higher eukaryotes and their relatively
low number of genes can be explained partially through
the importance of transcriptional regulation.

Identification of RE’s will have an extensive impact in
understanding gene expression patterns (expression
intensity, tissue specificity), relations within expression
patterns and inferring biological systems or networks.
Regulatory element prediction


No formal models for regulatory motifs
Attempt to find conserved regions or motifs based on the
global alignment of similar sequences of different
organisms (phylogenetic footprinting).




Which species to compare? Evolutionary distance?
What regions around gene models to investigate? 5’ and 3’
flanking regions, introns?
Take expression patterns into account?
How does evolution affect RE’s?
Phylogenomics

Comparison of genes and gene products across a
number of species (whole genomes), characterizing
homologues and gain insights in the evolutionary
process itself.

Pharmacophylogenomics is the use of phylogenomics in
aid of drug discovery, through improved target selection
and validation.
Orthology and paralogy
Phylogenetic tree
of gene X


Orthologs: genes in different species that arose from a single gene in the
most recent common ancestor, by speciation.
Paralogs : genes in the same species that arose from a single gene in a
ancestral species, by a process of gene duplication.
Target orthology

Species differences frequently affect progression of
targets and compounds. Orthology maps in combination
with expression studies may explain these differences.

Establishing orthology




Reciprocal highest scoring Blast hit.
Conservation of synteny.
Gene loss or rate of evolution issues.
Orthology does not guarantee common function
(functional shift).



Extensive sequence divergence
High non-synonymous over synonymous nucleotide substitution
ratios.
Comparison of regulatory regions?
Target paralogy


Key insights in large pharmacologically relevant families
(NRs, GPCRs) can be gained from paralogy analysis.
Paralogy is inter-related with several other gene to
function occurrences that can seriously affect the
suitability of genes as drug targets
Paralogy
Alternative
transcription
Pleiotropy
Redundancy
Heteromery
Crosstalk
Function
Protein
Gene
Schematic representation of various mappings of genes to functions.

Pleiotropy





Suggested to precede paralogy
Relaxed substrate or ligand specificity
Multiple protein domains
Tissue or cellular localization
Redundancy




Total or partial redundancy of function
Directly linked to paralogy
Robustness against gene knock-outs (target validation)
PPAR-δ / PPAR-α in skeletal muscle; PXR / FXR in bile acid
signaling; dopamine transporters / serotonin transporters in
adjacent neurons.

Heteromery


Formation of heteromers between paralogs
Known examples in major classes of drug targets




GPCRs : GABAβ receptors
NRs : formation of heterodimers with retinoid X rexeptors (RXR)
Ion channels
Crosstalk



Combination of pleiotropy and redundancy
May be regulated in time and space (expression and
localization)
Action of cytokines (interleukins) on immune cell types.

Alternative transcription


Intermediate between paralogy and pleiotropy. ‘paralogy in
place’
Increases effective size of the genome (estimated >30% of
human genes show alternative transcription!)
P
Effects on drug discovery

Functional shifts, pleiotropy and redundancy potentially
have good or bad news for drug discovery.

Functional shifts



Pleiotropy



Misleading or unavailable animal model
Animal toxicity irrelevant for humans
Unintended drug effects
Opportunities for multiple indications
Redundancy


Disease resistant to treatment (multi-functionality)
Highly selective treatment for complex diseases.
Pharmacogenetics




Within species comparative genomics:
 Single Nucleotide Polymorphisms: SNPs
Current focus in coding regions, expected to expand to
sites of transcription regulation.
Determine the site of a SNP and the allele frequencies
from ethnic or multi-ethnic panels of individuals (eg 100)
Pharmacogenetics (PGx): relate SNP information to
efficacy and safety issues during the drug development
process.


Efficacy PGx: Select/predict drug responders, increase
confidence in a certain drug in development.
Safety PGx: Identification of individuals with adverse effects to a
drug
Examples




New genes and REs from yeast genomes.
Multi species comparisons from targeted genomic
regions.
Comparative genomics at the vertebrate extremes.
Pharmacogenetics in drug efficacy
Comparison of yeast species to identify
genes and regulatory elements. (Kellis et al, Nature 2003)

Saccharomyces cerevisiae and 3 related species

7x coverage WGS of each species
 Assembly of draft genome sequence
 S.cerevisiae genome aligned to others using ORFs as seeds




Most ORFs have 1:1 matches. Considerable conserved synteny.
Most genomic rearrangements clustered in telomeric regions.
Local gene family expansion/contraction, creating phenotypic
diversity over evolutionary time.
Balance between conservation and divergence allows for
accurate gene identification and recognition of REs as well!
Identification of genes



Original S.cerevisiae genome (1996): 6275 ORFs
Re-analysis and other evidence (2002): 6062 ORFs
This study validates all ORFs using a reading frame
conservation score (very sensitive).


5538 ORFs, 20 unresolved, 504 rejected ORFs!
In addition to gene recognition, also largely improved
gene structure definitions (start, stop, intron).
Identification of regulatory elements

REs are difficult to identify


De novo discovery of REs directly from genomic sequence.



Short (6-15bp), sequence variation, few known rules
Develop a motif conservation score system based on known motifs
78 motifs discovered, overlapping with 36 of 55 known motifs
Putative annotation of motifs using adjacent genes. (GO)

25 of 42 new motifs show high category annotation correlation

Discovery of combinatorial control of Res

Applications to human genome?

Increase number of species in comparison to enrich the low signal to
noise ratio in humans.
Multi species comparisons from targeted
genomic regions. (Thomas et al, Nature 2003)


Comparing targeted regions areas in multiple
evolutionary diverse vertebrates (less probable for
conservation to occur by chance)
ENCODE project



44 genomic regions (14 manually selected of which some
disease related, 30 random) of diverse gene density and nonexonic conservation
primates, bat, alligator, elephant, cat, emu, leopard, salmon etc.
Initial analysis 1.8 Mb on chromosome 7 containing 10
genes, including CFTR, from 12 species.

Detection of ~1000 multi-species conserved sequences of which
>60% would not be detected by a 2 species comparison.
Comparative genomics at the vertebrate
extremes (Bofelli et al, Nature 2004)

What can be learned from comparisons of genomes that
are distant or closely related in evolution?

Distant comparisons reveal the most constrained
sequence elements.


Most of the conserved human-fish non-coding sequences are
found near genes with roles in embryonic development.
Mutations can have an important role in human disease


Human-Fugu conservation of non-coding sequence in the DACH
gene area (development of brain, limbs, sensory organs).
Validation of identified enhancer regions by driving expression of a
reporter in mouse embryos.
Comparative genomics at the vertebrate
extremes

Intraspecies sequence
comparisons allow identification
of species specific sequences



Phylogenetic shadowing
Requires high rate of polymorphism
Comparison among primates
show human specific sequences

Analysis of regulatory sequence of
ApoA (involved in human heart
disease)
A. Mutation rate analysis of Ciona intestinalis
5` region of the forkhead gene. B. Validation
of identified potential regulatory elements in
Ciona larvae.
Pharmacogenetics in drug efficacy
Efficacy PGx for an obesity drug.
Compare genotypes 1-1, 1-2 and 2-2