Aligned Stats

Download Report

Transcript Aligned Stats

Toward a Better Understanding of Cereal Genome
Evolution Through Ensembl Compara
Apurva Narechania1, Joshua Stein1, William Spooner1, Sharon Wei1, Ben Faga1, Shiran Pasternak1, and Doreen Ware1, 2
1 Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY11724, USA
2 USDA-ARS NAA Plant, Soil & Nutrition Laboratory Research Unit, USA
Syntenic Blocks Between Maize, Rice, and Sorghum
Summary
Blastz-NET Alignment Stats (Maize Accelerated Region)
Region Statistics
The maize genome has been largely shaped by its history of tetraploidization,
subsequent rearrangement and duplicate gene loss. Disruption of synteny has also
resulted from apparent gene movement in both maize and sorghum relative to rice.
Many questions remain concerning the evolution of cereals, including the extent of
lineage-specific rearrangements, selective forces that dictated the retainment of
duplicate genes, and the extent of conserved non-coding regions. The availability of
three nearly complete cereal genomes (maize, rice and sorghum) provides an
unprecedented opportunity to use comparative genomics to answer these and other
questions in the evolution of plant genomes. As part of the Maize Genome
Sequencing Project, we describe the use of the Ensembl Compara whole genome
alignment pipeline to construct sequence-based syntenies. The pipeline automates
pairwise whole genome analysis by parallelizing the construction of blastz
alignments, their subsequent consolidation into chains and nets, and their
coalescence into syntenic regions. The algorithms employed identify highly similar
regions between two large sequences while allowing for segments without similarity,
thus highlighting gene movement or genomic rearrangement within syntenic blocks.
The tetraploid nature of maize and its history of whole genome duplications suggest
that much of its genome should have at least two blocks that align to the same
region of rice. Preliminary analysis using a pilot 22 megabase maize assembly
spanning maize chromosome 4 exhibits synteny to a comparably sized region on
rice chromosome 2. In agreement with marker-based syntenic studies, we show that
this rice chromosome has a duplicate homelogue on maize chromosome 5. We
address the challenges of applying this pipeline to the maize genome in its partially
assembled state.
Total length
Alignable Sequence
Rice Aligned Coverage
Sorghum Aligned Coverage
21757985
5076641
1767792
3342816
Region Alignment Statistics
Total Alignments
Rice
Sorghum
Chain or Net Alignments
Chains
Nets
53136
30896
17936
2063
276777
152885
115210
4337
Blastz-NET coverage by NET Level
0.5
Percent of Alignable Sequence
0.45
0.4
0.35
0.3
Rice
Sorghum
0.25
0.2
0.15
0.1
0.05
0
Level 1
Level 2
Level 3
Level 4
Blastz-NET Level
Blastz-CHAIN-NET and the Ensembl Hive
• Alignable Sequence refers to the portion of the maize accelerated region
that is of high quality and has not been RepeatMasked.
• Sorghum blastz-NETs align 66% of the alignable maize sequence, while rice
aligns 35% of the available accelerated region.
CreateAlignmentChainsJobs
SubmitGenome
ChunkAndGroupDNA
AlignmentChains
AlignmentChains
Blastz-NET coverage by Rice Chromosome
CreatePairAlignerJobs
UpdateMaxAlignmentLength
0.25
Blastz
Blastz
CreateAlignmentNetsJobs
FilterDuplicates
AlignmentNets
UpdateMaxAlignmentLength
Percent of Alignable Sequence
Blastz
AlignmentNets
UpdateMaxAlignmentLength
• The Blastz-CHAIN-NET pipeline creates long range gapped pairwise blastz chains
and nets from raw blastz alignments thereby allowing for genomic rearrangements
in syntenic regions. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11188-9.
• The Ensembl Hive pipeline parallelizes the generation of blastz alignments and
their consolidation into chains and nets using a hive system that creates specific
jobs and spawns anonymous, general workers to complete those jobs. Nucleic
Acids Res. 2008 Jan;36(Database issue):D707-14.
• The maize accel region contains syntenic blocks to rice chr2 and sorghum chr4
• Maize: max gap between NETS 100,000 residues; min NET size 5000 residues.
• Rice and sorghum: max NET gap 50,000 residues; min NET size 2000 residues.
• Syntenic blocks are defined in two steps. First, NETS are grouped if the distance
between them is smaller than twice the max gap parameter and there are no NETS
breaking the synteny. Second, these groups are arranged into syntenic blocks up
to 30 times the max gap parameter with two synteny breaking groups allowed.
• The rice assembly is complements of TIGR (version 5), and early access to the
sorghum assemblies complements of JGI.
0.2
0.15
0.1
0.05
0
1
2
3
4
5
6
7
8
9
10
11
12
Mito
Chloro
Chromosome
Blastz-NET coverage by Sorghum Chromosome
0.45
Maize BAC-contigs versus Rice at MaizeSequence.org
Percent of Alignable Sequence
0.4
1800
1600
Blastz-NET Alignments
• In its partially
assembled state, the
longest contiguous
regions at
maizesequence.org are
the BAC contigs.
• Whole genome
alignments to rice for all
BAC contigs are
available and
correspond well to
FgenesH predictions
with similarity to known
proteins and maize
ESTs.
Maize Accelerated Region Duplication
1400
1200
1000
800
600
200
Class
Class
1
2
3
4
5
6
7
8
9
Level 1
Level 2
Level 3
Avg Len
1260.8
322.8
178.8
Median Len
259
185
147
Max Len
170824
6161
541
Min Len
15
31
56
Count
1181
841
41
• Rice Chr2 from positions 29MB to 36MB aligns to Maize Chromosomes 4 and 5
in equal measure indicating a duplication event. Alignments were made to maize
BAC-contigs and mapped to Chromosomes 4 and 5 using the FPC map.
• The majority of Chr4 hits were on FPC ctg182, corresponding to the accelerated
region. The majority of NETS on Chr5 were on contigs 250, 251, 253, and 254 in
agreement with marker based studies. PLoS Genet. 2007 Jul 20;3(7):e123
Median Len
Rice and Sorghum Level 1/2 Distributions
Max Len
Min Len
0.09
Count
Level 1
1830.0
203
934805
27
1213
Level 2
370.3
185
12122
27
2909
Level 3
213.1
137
2346
27
204
Level 4
208.0
107
696
30
11
Rice Level 1
Sorghum Level 1
0.05
0.04
0.03
15
1181
Level 2
532.0
188
96551
31
841
Level 3
205.2
147
798
56
41
Level 1
18500.8
Level 2
Min span
Count
Blastz-NET span (bp, log10)
207 10989119
27
1213
472.8
187
30712
27
2909
0.12
Level 3
330.9
137
10466
27
204
0.1
Level 4
208.0
107
696
30
11
0.08
• Blastz-NET lengths are defined as the number of aligning bases in a NET excluding gaps while blastz-NET spans are
the distances from the first to the last base in the NET including gaps.
• Level 1 NETS consistently show the longest length and span across species.
• Sorghum NETS are considerably longer than those found in rice.
• Despite large differences in lengths and spans across levels and species, the overall distributions are similar, highlighting
the influence of biologically significant outliers.
7
Max span
4
4.
3
4.
6
4.
9
5.
2
5.
5
5.
8
6.
1
6.
4
6.
7
Median span
1
1.
3
1.
6
1.
9
2.
2
2.
5
2.
8
3.
1
3.
4
3.
7
Avg span
Rice Level 2
Sorghum Level 2
0.06
0.04
0.02
0
Blastz-NET span (bp, log10)
7
2936780
0
4
4.
3
4.
6
4.
9
5.
2
5.
5
5.
8
6.
1
6.
4
6.
7
275
Class
5
6
7
8
9
10
11
• The majority of Blastz-NETS cluster on rice chromosome 2 and sorghum
chromosome 4 in agreement with known marker based synteny. Proc Natl
Acad Sci U S A. 2005 Sep 13;102(37):13206-11.
Gene Predictions Associated with Blastz-NETs
• 39% of maize genes within syntenic blocks are non-syntenic, suggesting
substantial gene movement within maize.
• Almost 50% of rice genes are non-syntenic, possibly due to loss of duplicate
genes w/in maize homeologous regions.
0.06
1
1.
3
1.
6
1.
9
2.
2
2.
5
2.
8
3.
1
3.
4
3.
7
9144.5
Min span Count
4
0.07
Frequency
Level 1
Max span
3
0.08
Span Stats
Avg span Median span
2
Chromosome
Maize Chromosome
0.01
Class
0.1
1
0.02
Span Stats
0.15
10
0.1
Avg Len
0.2
0
Frequency
Aligned Stats
0.25
0
Sorghum Stats
Aligned Stats
0.3
0.05
400
Distribution of blastz-NET sizes for Rice and Sorghum Alignments
Rice Stats
0.35
Rice Chr.
1
2
4
5
6
7
9
10
Total
Block
count
1
51
1
3
5
2
1
2
66
Rice Spans
Maize
Fold
Rice genes Syntenic rice Maize genes Syntenic maize
(kb)
Spans (kb) Expansion w/in Blocks
genes
w/in blocks
genes
4.3
5.9
1.37
2
2 (100%)
0
5,019.9
13,760.0
2.74
776
392 (50.5%)
721
440 (61.0%)
6.4
19.7
3.06
1
1 (100%)
1
1 (100%)
13.9
131.0
9.41
3
2 (66.7%)
12
5 (41.7%)
30.9
39.1
1.27
7
3 (42.9%)
9
5 (55.6%)
35.9
57.8
1.61
0
2
2 (100%)
5.1
8.3
1.64
1
1 (100%)
2
2 (100%)
30.9
43.8
1.42
2
2 (100%)
8
6 (75.0%)
5,147.2
14,065.5
2.73
792
403 (50.9%)
755
461 (61.1%)
Methods:
• Syntenic blocks were defined using from BLASTZ-Chain-Net data using
parameters MaxDist and MinDist as described in the synteny views above.
• Genes (excluding TE’s) were counted as syntenic if they overlapped a chain HSP
that contributed to the synteny.