Drug Target Discovery by Genome Analysis
Download
Report
Transcript Drug Target Discovery by Genome Analysis
Drug Target Discovery
by
Genome Analysis
AREXIS
Model
genome
Drug
97% total
67% finished
x106
1
Species
# of genes
E. coli
4.289
S. cerevisie
6.217
%known function
62
65
C. elegans
19.000
M. musculus 30-50.000
H. sapiens
30-50.000
?
≈10
≈15
gap
0.5
1990
1995
2000
Link genes
to
biological
functions
time
Bioinformatik?
•
Bioinformatik - det forskingsområde som behandlar och
analyserar “bioinformation”
•
Bioinformation - den information som finns lagrad i:
–
–
–
genom-data (gener, genuttryck, genfunktion, etc i relation till den
organism som härbärgerar genomet i fråga)
biologiska sekvenser och,
relationer mellan biologiska sekvenser, med avseende på biologiska
organismers funktion (metabolism, hälsa, etc)
•
Bioinformatik skall ge idéer och förslag till nya våta experiment
•
Forskare med bioinformatik som experimentellt verktyg (in silico
biologi)
Animal models
Why animal models?
•
•
•
•
•
•
•
Genetically homogeneous
Controlled environmental
influence
Large family sizes give optimal
statistical power
Tools to define and characterise
disease causative
genes and mechanisms
In vivo validation and in vivo
pharmacology
Increase productivity
Higher resolution
Research and development strategy
Disease
models
Genetic
analysis
Academic
partners
Target
discovery
Target
validation
Drug
discovery
Arexis
Clinical
development
Marketing
of new
products
Industrial
partners
Arexis
Integrated biology-driven discovery
Comparative biology
Human patient materials
Medicinal chemistry
Bioinformatics
Biotechnology expertise
Clinical science
Functional genomics
In vivo pharmacology
R&D project overview
Metabolic diseases
Type 2 diabetes
X
Obesity
X
AMPK
Inflammatory diseases
Rheumatoid arthritis
X
Multiple sclerosis
X
Skin inflammation
Immunotherapy
Prioritised projects
SCCE
X
Muc. A
Business model
Input to the Arexis
pipeline and
portfolio
Research
collaborations
Sub-contracts
Partnerships
TargetTarget
and and
discovery
Drug Drug
discovery
Revenue sources
Commercialisation
process
Spin-off
opportunities
Early
Access fees
Research funding
Targeted
In-licensing
Drug development/
commercialisation
Milestone
payments
Mid
Royalties
Late
Organisation build-up plan
Management &
Administration
Management
Administration
Accumulated
R&D
Bioinformatics
Biology
Chemistry
Clinical development
Accumulated
Total
2001
2002
2003
2004
2005
2006
3
3
2
5
4
3
7
5
4
9
5
5
10
5
5
10
5
3
10
2
1
16
5
21
4
2
32
8
32
6
3
49
10
45
8
4
67
11
57
13
6
87
8
21
39
58
77
97
3
2
3
Board of Directors
Anders Vedin, Chairman of the Board
Professor, Senior Advisor InnovationsKapital AB
Henry Geraedts, Deputy Chairman of the Board
PhD, Independent director, 3i
Carl Christensson
CEO SEB Företagsinvest
Rikard Holmdahl
Professor of Medical inflammation, founder
Lennart Hansson
PhD, Chief Executive Officer
Leif Andersson
Professor of Animal Genetics, founder
Curt Lönnström
Chief Executive Officer of Ryda Bruk
Expression profiling
Affymetrix
experiment, and
experimental
data
database
with annotated
experiments
Genetic approaches
in silico approaches
Ensembl
aGDB
auto-annotated
genetic/linkage
genomes
data
pointers to
disease loci
pointers to
phenotype-related
genes
relevant
genes
integration
phenotype-related
pathways
QuickTime™ and a TIFF(Uncompressed) decompressor are needed to see thi s picture.
Target
database
curated
gene
structures
Research System Architecture
aGDB
Academic
partners
DAS
DAS
Arexisusers
tools for
sequence
analysis
tools for
expression
data analysis
LDAP
vpn
GIM
business dev
mail
economy
documents
Commercial
partners
DAS
DAS
Arexisusers
Arexis intranet
IT System Architecture
project B
AMPK
common ancestor
pig
common ancestor
common ancestor
mouse
mouse
project C
homo
homo
mouse
rat
?
homo
?
Tissue section of skeletal muscle fiber from
Hampshire pigs
Normal rn+/rn+
Mutant RN-/rn+ or RN-/RN-
AMPK
A skeletal muscle-specific variant of AMPK
Tissue distribution of AMPKg-chains
AMP-activated kinase (AMPK)
- a heterotrimeric enzyme
g1
g
a
b
g1
g2
g3
g2
b1 b2
Colon
Peripheral Blood
Small intestine
Ovary
Testis
Prostate
Thyroid gland
Spleen
Pancreas
Kidney
Muscle
Liver
Lung
Brain
Placenta
g3
Heart
a1
a2
AMPK
Pathways regulating glucose
transport in muscle cells
AMPK
Modified from Shepherd et al. NEJM 1999
AMPK
genetic mapping
Experimental validation
chr. 5 mouse
chr. 7 human
Link to patophysiology?
Pathway analysis!
AMP
aa
AMP
AMP
gg
bb
AMPK
Protein
Phosphatase
2C
AMPKK
P
AMPK
Acetyl-CoA
Carboxylase
Increased
glucose
uptake
Protein
Phosphatase
2A
P
Acetyl-CoA
Carboxylase
inactive
Acetyl CoA
Increased
amount of
GLUT4
Malonyl CoA
Malonyl-CoA
Decarboxylase
P
Decreased
glycogen
degradation
Malonyl-CoA
Decarboxylase
active
Fatty acid
Pristane induced arthritis in the rat
Susceptible
DA rat
Resistant
E3 rat
mouse (1 Mbp)
position of
mouse gene
duplicated genomic segments
human (2.4 Mbp)
Genomics data
Expression data
integrate / analyse / visualise
Reconstruction of Pathway
Drug Target
NOVEL
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Database resources
National Center for Biotechnology Information
NCBI
European Bioinformatics Institute
DNA Databank of Japan
EBI
Blocks
MGI
PubMed
DDB
J
EMBL
PIR
GenBank GDB
Pfam
ProSite
Swissprot
MIPS
AceDB
DNA
protein
motifs
families
genomes
bibliography
Where do sequences come from?
DNA
genomic sequence
•Directed / small-scale
•Large-scale : BAC, YACs
mRNA
protein
cDNA sequence
•Directed / small-scale
•Random / large-scale
• Expressed Sequence Tag [EST]
protein sequence
•Directed, very little
7/17/2015
Sequence databases
Nucleotide databases
GenBank
EMBL
International Nucleotide
Sequence Database
Collaboration
DDBJ
Sequence databases
Primary vs secondary databases
• Primary database = sequence database
Seq 1
Seq 3
ACGTTT
TTCTGA
Seq 2
CTAGAC
– eg EMBL, GenBank,
SWISSPROT
– Each record describes
individual sequence
– Can be contain either
nucleotide or protein
sequences
Sequence databases
Primary vs secondary databases
• Secondary database = pattern database
Pattern 1
Pattern 3
accagtgt
acgactct
ttcgatgtca
ttcgatcgca
tccgatcgtc
Pattern 2
tacgtagc
tacctacc
taggtagc
– eg PROSITE,
PRINTS, BLOCKS,
Pfam
– Each record
describes a set of
sequences
– Set can be expressed
as a motif, multiple
sequence alignment
or probabilistic model
Sequence databases
Nucleotide databases
• How do the databases compare?
– Three databases are 99.99% identical
– Annotations can be slightly different
• How often are they updated?
– New release of databases every 3 months
– Interim releases - EMBL-new
• Can the annotations be trusted?
– Not always - some estimates suggest 25% are incorrect
Sequence databases
Nucleotide databases
• EMBL is subdivided into EST and non-EST
sequences
EST
vrt
mam
Non-EST
hum
rod
Sequence databases
Protein databases
GenBank
EMBL
GenPept
TrEMBL
PIR
SWISSPROT
Sequence databases
Protein databases
EMBL
• 13,700,000 entries
TrEMBL
REM
SP
SWISSPROT
• Coding sequences automatically translated
• 558,150 entries
• TrEMBL split into:
– SP-TrEMBL - Sequences destined for SWISSPROT
– REM-TrEMBL
- Remaining sequences
• Sequences manually moved to SWISSPROT
• 106,602 entries
• Because it is manually curated,
annotations are reliable!
Sequence databases
Summary
•
•
•
•
EMBL is main nucleotide sequence database (Europe)
TrEMBL is an automated translation of EMBL
SWISSPROT is main curated protein database
Between main releases, interim releases are made
– eg EMBL-new, TrEMBL-new, SWISSPROT-new
• EMBL is subdivided into EST / non-EST then by species
• Annotations can be trusted in SWISSPROT, not in EMBL
• Accession numbers uniquely identify a sequence and remain
constant when entries are updated
Basics of sequence searching Methods
Method
Rigorous
Heuristic
Probabilistic
•
•
•
Accuracy
+++++
++
++++
Duration
+++++
+
+++
Example
Smith-Waterman
BLAST, FASTA
HMM
Probabilistic methods are best, but can be slow and difficult to use
Rigorous are good when used on a small subset of sequences, but too
slow to search large sequence database
Heuristic methods are the best place to start
Basics of sequence searching
Terminology
• Sensitivity vs Selectivity
–
–
–
–
Sensitivity searching will find weaker hits
Selectivity searching less likely to find unrelated hits
Increased sensitivity means more true positives
Increased selectivity means fewer false positives
Searching with BLAST
How it works
Query sequence
Find identical stretches of
nucleotides in two
sequences
Sequence in database
HSP
HSP 1
Extend regions of similarity
as far as possible
HSP 2
Identify all regions of
similarity
Local vs global comparisons
The nature of proteins
• Proteins consist of functional and structural units domains
Local vs global comparisons
What is a local and global comparison?
Global comparison
attempts to match all of
one sequence against
another
Local comparison attempts
to match short stretches of
one sequence with another
Local vs global comparisons
When should each technique be used?
• Global comparisons
– Closely related sequences
– Same general structure of sequence
– Roughly equal lengths
• Local comparisons
– Sequences not closely related
– Sequence fragments
– Interested in identifying common domains
Local vs global comparisons
When should each technique be used?
Common
domain
Non-matching
domains
Domain unique
to one sequence
Common
domain
Common
domain
Global comparison will
attempt to match all of one
sequence against another
even when sequences
share only one common
domain
Global comparison should
only be used if the
sequences being
compared have a common
domain structure
Local vs global comparisons
Summary
• Proteins are organised into domains
• Local comparisons find short stretches of similarity
• Global comparisons match the whole length of one
sequence against another
• Local comparisons should be used unless sequences
are closely related and have identical domain
structures.
Searching with BLAST
Search with DNA or protein?
• Use DNA if
– There are frameshifts - common in ESTs
– Interested in evolution (3rd base in codon hidden in translation)
• Otherwise, use protein sequence. Why?
–
–
–
–
Two DNA sequences can be aligned in six ways
Each alignment can give scores, therefore more partial matches
Therefore there is more noise associated with comparison
Statistical significance of good hits are thus reduced.
Searching with FASTA
BLAST vs FASTA
• Advantages of BLAST
– Faster than FASTA
– Reports all high-scoring local alignments
• Advantages of FASTA
–
–
–
–
More sensitive - approaches that of rigorous methods
Faster than rigorous methods
E-values are more accurate
Better handling of frameshifts - important for ESTs.
Basics of sequence searching
Summary
• Sequence searching is complicated because we want
to find partial matches
• Search method should be sensitive and selective
• Rigorous methods are much more sensitive than
heuristic methods, but are too slow
Secondary databases
Databases available - Prosite
• 1492 regular expressions
• Each entry consists of two files
– Text file with information on family
– A regular expression and matching sequences
ID
DT
DE
PA
PROTEIN_KINASE_TYR; PATTERN.AC PS00109;
APR-1990 (CREATED); DEC-1992 (DATA UPDATE); JUL-1998 (INF UPDATE).
Tyrosine protein kinases specific active-site signature.
[LIVMFYC]-x-[HY]-x-D-[LIVMFY]-[RSTAC]-x(2)-N-[LIVMFYC](3).
Secondary databases
Databases available - Pfam
•
Split into two sections
– Pfam-A
– Pfam-B
•
3,071 HMMs
36,700 HMMs
(Curated)
(Not curated)
Each entry consists of description and alignment
ID IL7
AC PF01415
DE Interleukin 7/9 family
AU Ponting CP, Schultz J, Bork P
AL Clustalw
BM hmmbuild HMM SEED
BM hmmcalibrate --seed 0 HMM
DR PROSITE; PDOC00228;
CC IL-7 is a cytokine that acts as a growth factor for early
CC lymphoid cells of both B- and T-cell lineages. IL-9 is a
CC multi-functional cytokine.
IL7_BOVIN/28-172 DISGKDGGAYQNVLMVNIDD-LDNMINFDSNCLNNEPNFFKKHSCDDNKEASFLNRASRK
IL7_HUMAN/28-173 DIEGKDGKQYESVLMVSIDQLLDSMKEIGSNCLNNEFNFFKRHICDANKEGMFLFRAARK
IL7_MOUSE/28-152 HIKDKEGKAYESVLMISIDE-LDKMTGTDSNCPNNEPNFFRKHVCDDTKEAAFLNRAARK.
Secondary databases
Databases available - InterPro
Biotechhuset modell
Biotechhuset Vy mot sydväst
Biotechhuset Annedal
http://www.arexis.com