Disease Genomics Part 2 - Medical Sciences Division
Download
Report
Transcript Disease Genomics Part 2 - Medical Sciences Division
Functional genomics approaches to
disease genomics
• Biological information and organisation
• Genomics approaches to identifying diseaserelevant enrichment
• Candidate gene approaches
Biological information increases rapidly
• Everyday hundreds of articles are published
– We can’t read them all
– We can’t remember them all
– Our memories are subjective anyway
• To make use of this incredible research
output, we need some ways to bring this
information together and summarise it
• If we could make it readable by a computer
then our power to use it increases hugely
OMIM Home Page
http://www.ncbi.nlm.nih.gov/omim/
OMIM
• Online Mendelian Inheritance in Man (OMIM) is a
catalog of human genes and genetic disorders, with
links to literature references, sequence records,
maps, and related databases
• Annotates 325 genes associated with human disease
• 2,710 disorders with a known molecular basis
• 1,634 genetic disorders with an unknown basis
• The OMIM entries are made by experienced
annotators
– Even the best annotators are not wholly consistent
What is Ontology?
1606
1700s
• Dictionary: A branch of metaphysics
concerned with the nature and relations of
being.
• Barry Smith:The science of what is, of the
kinds and structures of objects, properties,
events, processes and relations in every
area of reality.
Slide from the GO website www.geneontology.org
Ontologies
• Formalising our knowledge into a structured
and defined vocabulary is essential for
genomics approaches
• The benefits from an agreed language enable
rapid progress (e.g. Species classification)
• Recently, biological research communities
have been defining a common language for
describing everything from protein function
through to phenotype
From a practical view, ontology is the
representation of something we
know about. “Ontologies" consist of
a representation of things, that are
detectable or directly observable,
and the relationships between those
things.
Slide taken from GO (www.geneontology.org)
Gene Ontology (GO)
• The Gene Ontology project was set up to
provide a controlled vocabulary that describes
a gene and its products (principally its
product)
• GO describes genes in 3 separate ontologies
– Molecular function, biological process and cellular
location
– Genes can be annotated with many terms in each
category
GO
Molecular Function
GO term: Malate dehydrogenase.
GO id: GO:0030060
(S)-malate + NAD(+) = oxaloacetate + NADH.
NAD+
O
HO
H
HO
NADH + H+
OH
O
H
O
OH
H
H
H
HO
O
O
Biological Process
GO term: tricarboxylic acid
cycle
Synonym: Krebs cycle
Synonym: citric acid cycle
GO id:
GO:0006099
Cellular Component
GO term: mitochondrion
GO id: GO:0005739
Definition: A semiautonomous, self
replicating organelle that occurs in
varying numbers, shapes, and sizes in
the cytoplasm of virtually all eukaryotic
cells. It is notably the site of tissue
respiration.
GO
Biological Process
Is_a
• Directed Acyclic Graph
(DAG)
• Allows a child node to
have more than one
parent
Physiological Process
Is_a
Metabolism
Is_a
Is_a
Primary Metabolism
Is_a
Biosynthesis
Protein Metabolism
Is_a
Is_a
Protein Biosynthesis
Mammalian Phenotype Ontology
• Really the mouse phenotype ontology
• Annotators take each published mouse gene
knock-out experiment and annotate the
phenotype with the MPO
Human Medical Ontologies
• Human Phenotype Ontology
www.human-phenotype-ontology.org
• The HPO provides a standardized vocabulary of phenotypic
abnormalities encountered in human genetic syndromes
Organ
abnormality
Cardiovascular
abnormality
Cardiac
abnormality
Cardiac
malformation
Abn. of the
cardiac atria
• London Dysmorphology Database
www.human-phenotype-ontology.org
Brachycephaly
Cranium,
general
abnormalities
Microcephaly
Neurology
Mental cognitive
function
Abn. of the
cardiac septa
Intellectual
disability
Model Organisms
• Excellent functional genomics resources
– The comparison between a human phenotype and a mouse
phenotype is often very readily interpretable.
– Other useful organisms include the fly, the worm and even
yeast
• Useful as they have well-curated data for many genes
Kyoto Encyclopaedia of Genes and Genomes
(KEGG)
• Pathway database
• manually-curated information from literature
High-throughput functional resources
• Tissue-expression
– Where and when genes are expressed may be
relevant to the disease
• Interactions
– genes that interact may be involved in the same
biological process
– E.g. protein-protein interactions or genetic
interactions (coordinated regulation)
• Sequence patterns (coding or regulatory)
– Similar sequence can infer common functionality
Different data sources have
different types of error
• Literature sources (GO, model organism data,
etc) have poor coverage and a lack of true
negatives
– We publish “A is an X” more than “A is not a Y”
– All genes have not been subject to the same studies
• High-throughput sources often have high-error
rates
– False-positives are particularly a problem for
gene/protein interactions when you’re considering
all pairs
The value of mouse phenotypic data
Ability to predict
Human Phenotype
Ontology terms
Forming interesting gene sets
• If you can’t identify a single gene/loci, may be you
can form a subset of genes likely to contain gene(s)
of interest
– Genes in large intervals identified by linkage studies
– Genes near SNPs with low, but not genome-wide
significant, p-values from GWAS studies
– Genes in de novo or rare CNVs seen in cases
• Power is important
– Bringing together many similar cases enriches for disease
genes associated with that disease
Testing for enrichments
• Compare to the genome
– Pulling balls (genes) from a bag (genome) is
sampling without replacement, hypergeometric
distribution
• Compare to controls
– If chosen well, may account for biases
– Contingency tables, Chi2 tests
– If controls are unavailable, you can randomise to
help address potential biases like gene length and
function
Rare de novo copy number variant (CNV)
associated with learning disability
2.8 Mb
How does this CNV
relate to the etiology
of the disease?
Which gene(s)
underlie the
phenotype?
Rare de novo CNVs are frequent in
learning disability
• Rare de novo CNVs > 100kb
present in ~10% of LD cases
• Occur all over genome
• 80% unique, non-recurrent
Collect a list of 148
rare de novo CNVs
CNVs are common in all people
• Apparently benign, mostly inherited CNVs
occur all over genome
Collect a list of
26,472 benign CNVs
Redon et al. Nature 2006
Mutations at different loci can give a similar
phenotype
SYMPTOM/PHENOTYPE
Method
Interesting intervals
in patients
Mouse
Genes
Human
Genes
Available Mouse KO phenotypes
ORTHOLOGY
Mouse models
relevant to the
human disorder
Disease phenotype
Significantly overrepresented phenotype
Significant enrichments of genes associated with
particular mouse phenotypes within de novo CNVs
identified in patients with Intellectual disability
*
15
200
10
150
5
% change
% change
overover
expected
expected
0
*
*
*
*
*
*
250
200
100
50
50
-10
FDR < 5%
*
150
100
-5
-15
300
0
0
Nervous System
category
Benign
CNVs
All LD
CNVs
Abnormal axon
morphology
LD CNVs benign CNVs
Abnormal dopaminergic
neuron morphology
Loss LD
CNVs
Loss LD CNVs benign CNVs
Human brain-specific genes corroborates mouse
findings
40
30
* *
*
*
% change 20
over
10
expected
0
Benign CNVs
All LD CNVs
All LD CNVs minus
benign CNVs
-10
Loss LD CNVs
-20
Loss LD CNVs minus
benign CNVs
Brain-specific Genes
“Brain-specific” genes are defined as those whose expression in human
whole brain is > 4 x median expression across all other tissues
Provides ~ 3.75% of human genes as “brain-specific”
Autism Spectrum Disorders – the ‘triad’ of symptoms
Impaired
social
interaction
Restrictive, repetitive
behaviours and
interests
Impaired
communication
Autism.org.uk
Behavioural model phenotypes associated with Autism
Spectrum Disorder (ASD) de novo CNVs
“Difficulty processing and retaining verbal information”
“Difficulty understanding social language”
“Difficulty coping with changes in routine”
Behavioural model phenotypes associated with Autism
Spectrum Disorder (ASD) de novo CNVs
“Difficulty understanding social language”
“Difficulty with empathy and friendships”
Behavioural model phenotypes associated with ASD de
novo CNVs
“Restricted and Repetitive Behaviours and Interests”
60-80% of individuals with ASD exhibit poor motor planning and coordination
Candidate genes
• The genes that constitute significant enrichments
become candidate disease genes
• While the enrichment is significantly associated with
the intervals, the individual genes are not, and each
requires further proof individually
• Experimental follow-up is costly and thus the genes
taken forward need to be considered carefully
Annotations vary in coverage and specificity
GO
Transcription
% change
over
expected
BrainSpecific
KEGG
Neuro
KEGG
Parkinson’s
200
500
150
Number
of 300
candidate
genes 200
400
100
50
100
0
0
80
% of
CNVs with a
candidate
gene
70
60
50
40
30
20
10
0
Mouse phenotypes
Abnormal
Axon/Neuron
The better the patients are classified the more
power we have to identify enrichments
Tremor phenotype
6 of 148 LD patients have a cleft palate
400
Enrichment
for KO
phenotype
cleft palate
250
* *
300
% change 150
over
200
100
expected 50
100
0
0
Benign CNVs
*
200
-50
Patients +/- seizures
-100
LD CNVs
in 6 patients
with cleft palate
Abnormal myelination phenotype
142 without
cleft palate
600
*
*
500
400
300
200
100
0
-100
Patients +/- brain
abnormality
Some associations found for the main cohort may be more
relevant to associated, or co-occurring, symptoms – ASD
Mutation databases are a rich source of discovery:
DECIPHER
• DECIPHER is a database that holds genetic information
about patients who present with congenital abnormalities
Proband 1
Proband 2
Proband 3
Very similar
phenotype
Single gene
DECIPHER patients are annotated with
London Medical Database terms
Level 1
Level 2
Level3
Brachycephaly
Cranium,
general
abnormalities
Microcephaly
Neurology
Mental
cognitive
function
Intellectual
disability
Formed groups CNVs associated with each
human phenotype
Cranium, General abnormalities
Brachycephaly
Microcephaly
7 CNVs
11 CNVs
114 CNVs
121 CNVs
18 CNVs
ENSEMBL genes
assigned to CNVs
Remove copy
number variable
genes observed
in healthy
individuals
132 CNVs
692 genes
3320 genes
3036 genes
633 genes
3030 genes
2767 genes
Many enrichments are readily interpretable
Human Symptom: Short Stature, Prenatal
Onset
*
300
250
200
150
100
50
Mouse Phenotype: Decreased Fetal Size
% Enrichment
350
1200
*
1000
800
600
400
200
*
*
300
250
200
150
100
50
Mouse Phenotype: Abnormal Palate
Development
Human Symptom: Malocclusion
3000
% Enrichment
Human Symptom: Syndactyly of toes
400
1400
0
0
450
*
1600
% Enrichment
% Enrichment
350
Human Symptom: Cupid bow shape
of mouth
2500
*
2000
1500
1000
500
0
-500
0
Mouse Phenotype: Syndactyly
All
Gain
Loss
Mouse Phenotype: Malocclusion
* Statistically Significant FDR < 0.05
Others identify less obvious relationships
Human Symptom: Psychotic Behaviour
Human Symptom: Complex Partial Seizures
6000
3000
*
*
2000
1000
0
Mouse Phenotype: Abnormal pre-pulse
inhibition
% Enrichment
% Enrichment
4000
*
4000
2000
0
Mouse Phenotype: Abnormal circadian
rhythm
KEY
All
Gain
Loss
* Statistically Significant FDR < 0.05
Mutations can be dissected to identify the
contributions of individual genes
Patient id: 248772
ATG7
OXTR
ATP2B2
Intellectual disability/
developmental delay
candidate genes
FANCD2 Short stature,
prenatal onset
candidate gene
Patient id: 785
SNX2
Mental retardation/
developmental delay
candidate gene
FBN2
Camptodactyly
candidate gene
Gene set enrichment analysis
Aravind Subramanian et al, 2005
• Start with some list of ranked genes
– Genes ranked by expression cases vs controls (Microarrays)
– Genes ranked by nearby SNP p-values
• Score genes + or – according to some property
• Ask, are genes with this property more focussed towards
the top of this list that I would expect by chance?
Gene Prioritisation for disease
• Given a list of genes, which are most likely to be
involved in this disease?
• We just want a ranking, not a significant association
• Commonly employed approaches involve
supervised learning methodologies
– Collect data points from one or more sources
– Take a “Gold Standard” set of genes for this disease
– Train a method using known true +ives (and true –ives
if known)
– Given a list of genes, which ones “look” most similar
to the known disease genes?
Linkage networks can infer missing values
– “guilt by association”
From pubmed ID: 19728866
Linkage network for human disorders using the Human
Phenotype ontology (PMID 18950739)
Conserved co-expression of disease genes
(Ala et al. ,PLoS Genetics 2008)
• 850 OMIM entries where a phenotype was mapped to
a loci but specific genes unknown
• Used conserved human-mouse co-expression data as
other interaction or pathway data can bias towards
studied genes
• Generated single species gene co-expression networks
– Calculated Pearson’s cor. coef. between all pairs of gene
expression data. Formed a network edge if 2 genes’ exp.
correlation was in the top 1% either gene.
• Clustered OMIM phenotypes using MimMiner
– A text-mining tool
Using this methodology, they were able to predict 321 candidates across
81 disease-associated loci at an FDR of <10%
Human phenome-interactome network for predicting
disease candidate genes
(Lage et al., Nature Biotech. 2007)
• 2 data networks
– Phenotypic similarity, consisting of detecting words that
are common to two phenotype descriptions and do not
occur frequently among all phenotype description.
– Human interactome, consisting of several large human sets
and sets transferred from model organisms, weighted
according to observation frequency.
(1) a given positional candidate is queried for high-scoring interaction partners
(“virtual pull-down”). These are interaction partners for the candidate complex.
(2) proteins known to be involved in disease are identified in the candidate
complex, and pairwise scores of the phenotypic overlap between disease of
these proteins and the candidate phenotype are assigned.
(3) Based on the phenotypes represented in the candidate complex, a Bayesian
predictor awards a probability to the candidate in the complex. The score is
used to form the ranking.