Transcript workshop2

CANDID:
A candidate gene identification tool
Part 2
Janna Hutz
[email protected]
March 26, 2007
Review
• Literature
– Well-characterized genes
• Protein domains
– All genes
• Cross-species conservation
– All genes
Today’s agenda
• Expression levels
• Linkage data
• Association data
• CANDID performance measures
Candidate lists vs.
single candidates
• Candidate lists
– Complex trait or disease
– Disease with known heterogeneity
• Single candidates
– Mendelian trait
– New disease
– Disease with clear, well-defined pathology
Candidate lists vs.
single candidates
• Microarray
• SNP typing
• Sequencing
• Immunocytochemistry
• Knockout model
ACT[A/G]GGA
Example 4
• Goiter - thyroid gland
problem
• Iodine deficiency
• Genetic causes
Example 4
• Iodine is not supplied
• Iodine is present, but is not added to the
molecule
• Which gene is mutated?
Expression data
• We know what tissue our gene is
expressed in (thryoid).
• How can we use this knowledge to help
identify the candidate?
• Wouldn’t it be nice if we had an
expression database?
Expression databases
• Our ideal expression database would have:
– Expression data for the same genes across many
different tissues
– As many tissues as possible
– As many genes as possible
– Good documentation
• Gene Atlas
Gene Atlas
• Genomics Institute of the Novartis
Research Foundation
• 79 human tissues (160 samples)
• 2 arrays
– Affymetrix HG-U133A
– GNF1H (custom)
• 17,809 genes
Measure of gene expression
• Our thyroid gene:
– Gene that is brightest on the thyroid array?
– Gene that is brightest on the thyroid array,
compared to all the other arrays.
heart
brain
thyroid
lung
Measures of gene expression
• Run CANDID, specifying that we’re
interested in the thyroid.
http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html
User name: workshop
Password: perl031907
• (We’ll need a tissue code for that.)
Example 4 - Results
• Our favorite genes:
• TP53 - rank is…
– 16314th
• KRAS - rank is…
– 5229th
• What genes are ranked most highly?
Example 4 - Results
• 192 genes with expression score of 1
• The TOP gene is actually responsible
for the phenotype described earlier
– Its expression score = 1
Prior evidence
• I’m not interested in examining all of the
genes in the genome - just some of them.
• Linkage and association
Linkage
• CANDID can:
– Weight regions with higher LOD scores
– Limit analysis to certain regions
– How does it do this?
Linkage scoring
1732
gene’s LOD score
maximum genome-wide LOD score
Linkage files
• How does CANDID get this linkage
information?
• CANDID takes two kinds of files
– Unformatted output from GENEHUNTER
and MERLIN
– Custom linkage files
Custom linkage files
• Simple format
• Line 1 of the file must contain the word
“custom” somewhere
• Subsequent lines:
Chromosome (tab) cM (tab)
LOD score
• But how do I get cM positions?
Mapmaker
• Inputs file as:
Chromosome (tab) basepair (tab) LOD score
• Outputs new file in the format:
Chromosome (tab) cM (tab) LOD score
• Will be available on the CANDID
website soon
Example 5
pancreatic cancer
• Deletion on chromosome 13 between
23.65 cM and 25.08 cM.
Creating a custom linkage file
• Example:
custom
13 23.64
13 23.65
13 25.08
13 25.08
0
3
3
0
23.65 25.08
Running CANDID
1. Try running CANDID using only the
linkage criterion.
2. Now, run CANDID with the linkage
criterion and literature criterion (your
choice of keywords)
•
•
Linkage weight = 1000
Literature weight = 1
Results
• From OMIM:
“Individuals with mutations in the
BRCA2 gene, which predisposes to
breast and ovarian carcinoma, have an
increased risk of pancreatic cancer;
germline mutations in BRCA2 are the
most common inherited alteration
identified in familial pancreatic cancer.”
But linkage is so last season…
Association
• Increasing numbers of association
studies
• Increasing numbers of SNPs in each
study
• Can CANDID use this information, too?
Association
• Database
– dbSNP - 11.8 million human SNPs
– Includes HapMap SNPs
– Most comprehensive
– Each snp has a number prefixed with “rs”
Association
• How does CANDID accept association
data?
• Custom file format - each line is:
rs# (tab) p-value
Association scoring
• For each gene, take the best p-value for
that gene’s SNPs
• Subtract that p-value from 1
• Unless you test SNPs in every gene,
this can be kind of unfair…
Association scoring
• Tested 10 genes
• Gene 9 has a best p-value of 0.8 (bad)
• Gene X was not tested
• Should Gene 9 get a higher overall
score than Gene X?
p-value threshold
• User defines a p-value threshold
• Let’s say it’s 0.1.
• Any SNPs with p-values above 0.1 are
not considered.
• Now Gene 9 and Gene X have the
same score (0).
Example 6
• Age-related Eye Disease Study
• Macular degeneration
Example 6
• Make custom association file
rs3753396
rs543879
rs7724788
0.0444
0.0494
0.75
• Run CANDID with this association file
Results
rs3753396
rs543879
rs7724788
0.0444 } CFH
0.0494
0.75 } SLC25A46
So just how well does this
work anyway?
Preliminary evidence
• Online Mendelian Inheritance in Man
• 154 diseases linked to chromosome 1
• Literature, domains - chose keywords
• Conservation
• Expression - chose tissue codes
Ideal weights
• Tested all combinations of weights in
those 4 categories
– Possible weights: (0, 0.1, … , 0.9, 1)
• Which weight combination was the best,
across all 154 diseases?
Top 10 weight combinations
1. Literature = 1, everything else = 0
2. Literature = 0.9, everything else = 0
3. Literature = 0.8, everything else = 0
4. Literature = 0.7, everything else = 0
5. …
10. Literature = 0.1, everything else = 0
11. Literature = 1, domains = 0.1
More specifics
• Literature only: average ranking = 425
– 425/38697 = 98.9th percentile
– 44/154 genes ranked #1 for at least one set of
weights
• Chromosome 1: average ranking = 22
– 22/2280 = 99th percentile
– 84/154 genes ranked #1 for at least one set of
weights
Analysis of results
• They make a lot of sense.
• Genes in OMIM are, by definition, wellcharacterized.
• Many diseases are rare, with particular
names or keywords that would only appear in
papers about the disease genes.
Next steps
• Separate OMIM analysis into simple
and complex traits
– Get new ideal weights
• See how well these ideal weights do in
ranking candidates from chromosome 2.
Next steps
• CANDID’s databases were last
compiled in November 2006.
• Find publications that have come out
since then.
• How well does CANDID do in ranking
those genes?
Next steps
• Many new whole-genome studies and
microarray studies implicate lists of
candidates.
• If CANDID analyzes those phenotypes,
how significant is the overlap of
CANDID’s top genes and those papers’
top genes?
Next steps
• Any other suggestions?
• Any interesting data you have?
Any questions?
Acknowledgments
• Mike Province
• Howard McLeod
• Aldi Kraja
• Ingrid Borecki
• Qunyuan Zhang
• Ryan Christensen
• John Martin