Fast identification and statistical evaluation of segmental

Download Report

Transcript Fast identification and statistical evaluation of segmental

Genomics of Water Use Efficiency
Advisory Committee Meeting
Nov 2003
• Comparative mapping
– FISH software and related computational methods
– Application to tomato fine-mapping
• QTL mapping
– experimental design and analysis methodology
• QTL data management
– web application
Comparative mapping:
computational aspects
• Software needed for two tasks
– Identification of homologous chromosomal
segments given two marker maps and
information about homology among
markers (FISH)
– Prediction of gene content within
homologous segments (ongoing work)
Homology matrix for Arabidopsis
We need to allow for
• non-colinearity in marker order
• the presence of ‘singleton’ markers
Going beyond eyeballing
• LineUp – Hampson et al (2003)
– Designed for genetic maps with error
• ADHoRe – Van der Poele (2002)
– Designed for unambiguous marker order data
• Both perform automatic detection of blocks
• For statistics, both employ permutation tests
– Computationally intensive
– p-values are approximate
– What is the null model?
Two contributions
•
•
•
Local genome alignment
– Dynamic programming approach
• Fast
• Guarantee of optimality
– Can be generalized to multiple alignments
Statistics
– An explicit null model for marker homology
– Analytic p-values (i.e. no permutation testing)
Contributors
– Sugata Chakravarty (Masters, UNC Operations Research)
– Peter Calabrese (collaborator, USC)
From homology matrix to
graph
• nodes ()
– represent dots in the homology
matrix
From homology matrix to
graph
• nodes ()
– represent dots in the homology
matrix
• edges ()
– connect nodes with nearest
neighbors
– are unidirectional
– have an associated distance
– must be shorter than some
threshold
From homology matrix to
graph
•
•
•
nodes ()
– represent dots in the homology matrix
edges ()
– connect nodes with nearest neighbors
– are unidirectional
– have an associated distance
– must be shorter than some threshold
paths ()
– traverse shortest available edges
– can be efficiently computed
– can be considered candidate blocks
Block statistics
• An explicit null model
– Within a genome: homologies are due to the duplication of
a feature followed by its insertion into a random position
– Between genomes: homologies are due to the above
process plus the transposition of features between randomly
chosen positions.
• Number of blocks of a given size is approximately
Poisson
• We can calculate
– The expected number of blocks of a given size
– A conservative matrix-wide p-value
How often are blocks of size k observed
under the null model compared with
expectation (in simulated data)?
k
2
3
4
5
6
obs
45.8
2.28
0.113
0.006
0.0003
stderr
0.06
0.02
0.003
0.001
0.0002
upbound
47.6
2.39
0.120
0.006
0.0003
lowbound
40.1
1.78
0.079
0.004
0.0002
FISH v.1.0
• Released in July 2003:
–
–
–
–
–
http://www.bio.unc.edu/faculty/vision/lab/FISH
source code
compiled executables
documentation
sample data
• Publication
Calabrese PP, Chakravarty S, Vision TJ (2003) Fast
identification and statistical evaluation of segmental
homologies in comparative maps. Bioinformatics 19, i74-i80
Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121
Prediction of gene content
• Explicitly model gene loss among
homologous segments
• Perform multiple rather than pairwise
alignment
• To provide
– Markers for fine-mapping
– Candidate genes
Phytome (http://www.phytome.org)
•
•
Funded independently through PGRP
A web interface to a relational database for plant comparative genomics
– Integrating organismal phylogenies, genetic maps and gene phylogenies
– Inclusive of major model plant species
•
Functionality
– Explore relationships among genes/proteins and chromosome segments
within and between species
– Predict gene content in uncharacterized chromosomal regions.
•
Current status
– One can search for, retrieve, visualize and manipulate protein sequences,
gene families, multiple alignments and phylogenetic trees for nine species
– Will be made live during 2004
•
Ongoing work to integrate “phylocartographic” data and tools
– Curation
– Analysis
– Visualization
Protein sequence
prediction
Unigene
collections
GenBank IDs
Descriptions
GO terms
(ESTWise)
Protein
sequences
Protein families
Multiple alignments
Phytome
Homolog
identification
(BLAST)
Protein family
clustering
(TRIBE-MCL)
Multiple sequence
alignment
(CLUSTALW)
Phylogenetic trees
Phylogenetic
inference
(PHYLIP)
Comparative mapping in aid of
marker development: application
• Complementary to marker development
strategy at OK State
• Proposed work (within coming year)
– Combine computational predictions and
experimental validation to design PCR-based
markers in tomato based on known genes in
homologous segments of Arabidopsis
– To be used for fine mapping of QTLs in pennellii
(and possible hirsutum).
Comparative map of IL5-4
TG23
TG351
TG60
CHS3
T1584
At1g45160
CT145
T0633
20
5
TG597
At1g48520
At2g38050
18
TG238
At1g45474
2
Atg308720
At4g23650
CT130
3
At1g48490
At2g37840
TG69
At3g08940
At4g23710
Strategy
select Arabidopsis genes in putative regions of synteny
BLAST Arabidopsis genes against tomato EST database
no match
map best match tomato EST in a subset of the IL population
maps elsewhere
design primers to amplify tomato locus from both parents
primers fail
sequence products from both parents to detect polymorphisms
no polymorphism
convert to CAPS or dCAPS markers
QTL Data Converter Tool
• A utility that converts QTL data files to and from the
various software formats
• Currently, the utility can do the following:
– Convert comma-delimited (CSV) genotype, phenotype and
map data files to the following formats:
• QTL Cartographer cross.inp and map.inp input files
• Qgene filename.cro and filename.map input files.
–
–
–
–
Error-check the input data files.
Transpose data file rows and columns, if desired.
Tag special data with prefixes, for use in Qgene.
Summarize data file characteristics.
Future plans
• Optimize XML code
• Add additional software formats
– MapMaker
– MapPop
– others as needed (JoinMap, MultiQTL, etc.)
• Release in mid-2004
• Advertise availability
– Published note
– Mailing list announcements
QTL mapping methodology
• Problem
– QTL analysis in mapping populations where individuals have
been selected to optimize marker map resolution.
• Work to date
– Effect of selective sampling on crossover distributions
– Effect of selective sampling on bias, power, and resolution in
QTL mapping
• Change of plans from proposal
– QTL mapping software tailored to selected samples is not
necessary
• Manuscript in preparation for Genetical Research
Bins and map resolution
random sample
optimized sample
X
X
X
full population
Selective mapping
base population
Genotype framework markers (1/20cM)
Use MapPop to select optimized sample
selected sample
Genotype additional markers (>1/cM)
Use MapPop to locate markers with bin mapping
Experimental design parameters
• Population type (F2, RI, DH, etc.)
• Base population size
• Selected sample size
– Sample fraction (f)
• Framework marker density
Maize RI population
(184 markers, 4140 cM)
Bin Length
Whole
N=976
Optimized
N=90
Random
N=90
Maximum
1.8
7.5
12.7
Expected
0.3
1.7
2.6
Advantage of optimizing expected
(versus maximum) bin size
18
16
Bin Size (cM)
14
MBL perceived
MBL actual
EBL perceived
EBL actual
12
10
8
6
4
2
0
0
5
10
15
20
marker spacing (cM)
25
30
35
Recombination enrichment and
pseudo-interference
random
selected
Recombination enrichment
Fixed marker spacing = 10 cM
Fixed map length=1000 cM
1.5
2.5
marker spacing
map lgth
100 cM
500 cM
1000 cM
2500 cM
5000 cM
20 cM
10 cM
2
2 cM
RE
RE
5 cM
1.5
1
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Sample Fraction
Sample Fraction
RE= # of crossovers in selected sample / # of crossovers in random sample
Predicting recombination enrichment
Empirical formula:
A
RE  1  0.5
(1  f )
L
L = map length in cM
f = sample fraction
R2
pop
A
RI
500
0.965
BRI
750
0.976
DH
1200
0.983
Pseudo-interference and map
functions
• Translates between the map distance (in cM) and the
expected frequency of crossovers between two points
• Haldane map function: no interference
• Karlin map function: allows variable interference
• When N>5, rK ~ rH
rH  (1  e 2|m| ) / 2
N

1
 2m  
rK    1  1 
 
N  
 2   
Pseudointerference is very minor
L=100 cM
50
L=500 cM
50
40
30
30
N
N
40
20
20
10
10
0
0
0
0.2
0.4
0.6
0.8
Sample Fraction
0
1
0.4
0.6
0.8
1
0.8
1
Sample Fraction
L=1000 cM
50
0.2
L=2500 cM
50
40
30
30
N
N
40
20
20
10
10
0
0
0
0.2
0.4
0.6
0.8
0
1
0.2
0.4
0.6
Sample Fraction
Sample Fraction
5cM
10cM
20cM
Significance of findings
• Since
– We can predict RE very accurately and easily from
the experimental design
– Pseudointerference is minimal for realistic values
of RE
• We can use standard QTL mapping methods
for selected samples once we have multiplied
map distances by the RE factor.
• No need for specialized software
Do selected samples have better QTL resolution?
A simulation study
• Variables
–
–
–
–
–
Population type (RI, BRI, DH)
Map length
Marker spacing (always even)
Sample fraction (optimized for expected bin lgth)
Genetic effects
• Additive ~ Gamma(1,2)
• Dominance ~ Beta(1,1)
• Pairwise epistasis (when >1 QTL)
• QTL analysis
– Marker regression (QTL Cartographer)
QTL detection power
–
–
–
–
5 QTL
Map length 1000 cM
Base population 500
Sample fraction 0.2
3
# QTL detected
• Reduced in a selected
sample in proportion to
distance between
marker and QTL
• Experimental design
selected
random
2
1
0
20cM
5cM
Marker spacing
1cM
QTL resolution
12
selected
random
10
Resolution (cM)
• Resolution increases
with recombination
enrichment
• Resolution here
measured as width of
95% confidence
interval (cM)
• Experimental design
– 1 QTL
– Map length 100 cM
– Base population
500
– Sample fraction 0.1
8
6
4
2
0
0.5
0.75
1
Additive effect
1.5
2
Relationship between power and
resolution
marker
marker
1
1
2
2
3
4
3
4
QTLs for cell wall composition in
the maize IBM population
0.2
confidence interval (Morgans)
3
random
number of QTLs
selected
2
1
0
random
selected
f
0.5
0.6
0.7
0.8
0.9
0.1
0
0.5
0.6
0.7
0.8
sample fraction
0.9
1
0.5
0.6
0.7
0.8
sample fraction
data from Hazen SP, Hawley RM et al. (2003) Plant Physiology
0.9
1
RE
1.24
1.19
1.14
1.09
1.05
Summary of findings:
QTL mapping methodology
• Selection can result in substantial RE with only minor
pseudointerference
• Corrected map distances can be obtained using a
simple formula for RE (which will depend on the
experimental design)
• Currently available QTL mapping methods are
appropriate for analysis of selected samples.
• Selected samples
– Have increased QTL mapping resolution (relative to random
ones)
– Do not bias estimates of QTL position or effect size