summary_Stickleback_Seg_Dup

Download Report

Transcript summary_Stickleback_Seg_Dup

Stickleback Seg Dup Analysis
1.
2.
3.
4.
Genome
Parameters for Pipeline
Analysis
Files and images are at
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/sticklebackwgac.html
5. The Data is in directory
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/
Stickleback Genome
• The Genome(v1.0) is down loaded from UCSU.
• Total Length is 463,354,448bp which contains a chrUn of
62,550,211bp
• Total of 29101 gene annotations from ensemble gene
annotation were down loaded from UCSC.
Seg Dup detection pipelines
• WGAC
to detect Seg Dup in genomic assembly by
looking for homology pairs. ( >1kb in length >90%
identity)
• WSSD
to detect Seg Dup in given sequences based
on depth coverage of WGS (whole Genome shot gun
reads). Depth coverage > Average + 3SD.
Parameters and Notes for WGAC pipeline
• Repeats
– Standard repeat coordinated were reverse generated from the soft
mask data.
– The secondary repeat masker were done using two repeat libraries, the
ab_initio_lib.txt and supplemental_lib.txt.
– Repeat Mask result for all three libraries were combined and sorted,
then used for both pipelines
• Blast parsing seeds in WGAC pipeline:
– the seed size is 500bp
Result from WGAC Pipeline
•
•
•
•
•
•
Total pairs of SD detected(>1kb and >90% identity)
Inter chromosome pairs
Intra chromosome pairs
chrUn intra
chrUn inter and intra
Total NR
152272
63744
88528
81641
123278
40,573,574bp
Notes:
•
•
•
In general, the number of WGAC pairs is too high (10%) for stickleback
genome with only 400mb.
92% of total intra chromosomal WGAC pairs and 81% total pairs has at
least one sequence in the pair is on chrUn. The result is expected, since
chrUn contains high percentage of redundant poorly assembled sequences.
Our analysis also suggest that the potential repeats which are not covered
by the repeat libraries, may also detected as WGAC pairs. Next slid.
Repeats?
•
•
•
•
•
•
Since the repeats might be an issue, I set up a filter to determine how many
of WGACs may be affected. If I use >20hit, 400bp on boundary, hit length
<10kb, it affected 30% of WAC pairs. If I use >10hit, and 400bp bound
overlap, and hit < 10kb, 60% of WGAC is affected.
I then generate the nr space of these hit. They are total of 7,481,640bp from
103, 157 pairs in total WGAC (152, 272 pairs of total 40,473,574bp). It has
2/3 of hits, but only 1/5 of total nr space.
I think it is very reasonable. Because the high proportion of the WGAC pairs
only affect a small proportion of NR space.
These sequence intervals should also be detected by WSSD if they are the
repeats.
However, I did not take them out from Alldup(which is a merge of WGAC
and WSSD) yet, because many of them has high frequency hit on chrUn. At
this stage we do not know if they are the redundant sequences or the real
seg dup. But we can pull them out at any time based on the coordinates.
If I use >20hit, 400bp on boundary, hit length <10kb, 30% of WGAC can be
General analysis of WGAC length and identity
distribution
1. Length distribution peaked at < 3kb, intra > inter, with 92% of intra on chrUn.
2. Identity distribution peaked at 96%. Few is high than 99%.
length distribution
identity distribution
120000000
160000000
inter
100000000
intra
intra
120000000
total(bp)
80000000
60000000
40000000
100000000
80000000
60000000
40000000
20000000
20000000
identity
1
0.
99
0.
99
5
0.
98
0.
97
0.
96
0.
95
0.
94
0.
93
0.
92
50.kb
40.kb
30.kb
20.kb
10.kb
9.kb
8.kb
6.kb
5.kb
4.kb
3.kb
2.kb
7.kb
length
0.
9
0.
91
0
0
1.kb
total (bp)
inter
140000000
General analysis, NR distribution on chromosome.
high SD in chrUn
Percentage of Dup NR relative to chromosome
nr lengh on chromosome
35.00%
25000000
inter
intra
intra
25.00%
percent
both
15000000
10000000
both
20.00%
15.00%
10.00%
5000000
chromosome
chromosome
chrXX
chrXXI
chrXVII
chrXVIII
chrXV
chrXVI
chrXIX
chrXIV
chrXII
chrXIII
chrX
chrXI
chrVII
chrVIII
chrV
chrVI
chrIX
chrUn
chrIII
chrIV
chrI
chrXX
chrXXI
chrXVII
chrXVIII
chrXV
chrXVI
chrXIX
chrXIV
chrXII
chrXIII
chrX
chrXI
chrVII
chrVIII
chrV
chrVI
chrIX
chrUn
chrIII
chrI
chrIV
0.00%
0
chrII
5.00%
chrII
total (bp)
20000000
inter
30.00%
General view which show all WGAC on all chromosome
Concentration of SD on smaller supercontigs on
chrUn
Global image shows the inter and intra pairs of 5kb and above 90%
without the chrUn. The red indicates the inter chromosomal pairs and
blue indicates intra chromosomal pairs
Global image shows the inter and intra pairs of 10kb and 90% without
chrUn. The red indicates the inter chromosomal pairs and blue
indicates intra chromosomal pairs
Global image shows the inter and intra pairs of WGAC with10kb and
90%. ChrUn is also included. The red indicates the inter chromosomal
pairs and blue indicates intra chromosomal pairs
chrUn
WSSD analysis
• Down load the WGS reads about 6 million.
• Down load Stickleback finished BAC. These BACs are used to
determine the threshold for WGS depth coverage. For 5k window,
the average number of reads is 78, with SD 27. The threshold for 5k
window is 125. for 1k window is 25. (Average+3SD)
• Repeat mask of the stickleback genome. I used the standard,
ab_initio_lib.txt and supplemental_lib.txt. In addition I added the
potential repeats I detected in WGAC process which shows more
than 20 hit pairs the same region.
WSSD result
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd/
• There are total of 729 regions with 22,324,144bp were
found in wssdGE10K_nogap.tab ( which has a 10k cut
off), 251 of them are on chrUn.
• 850 regions in wssd.tab with 23,116,317 total base. It
has 125 more regions and less than 1mb extra
sequences comparing to 10k hits.
• A summary table of WGAC intersect with WSSD is at
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wg
acCMPwssd.xls
Union of WSSD and WGAC
Gene intersect with Seg Dups
• First a none redundant Union of WGAC and WSSD is
generated. AllDup.tab
• A list of genes intersect with the AllDup is performed to identify
genes overlap with Dup space in genome. There are 3135
ensemble genes identified.
• Both data sets are at
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/
The general view of WGAC and WSSD on chromosome
Wssd black above chrom line
WGAC 5k94% black below chrom line
WGAC 10k brown below chrom line
Summary table 1
total
chrN
chrUn
No. nr interval
wssd (bp)
22324144
13574716
8749428
729
wgac (bp)
40573574
21017679
19555895
7387
data/wgac/NRspace
AllDup (bp)
45608440
24390195
21218245
5934
data/allDup.tab
463354448
400804237
62550211
7481640
1741266
5740374
Genome (bp)
repeats ? (bp)
file
wssdGE10K_nogap.tab
data/repeathitMerge
The intersect between WSSD and WGAC
chrom
size
allWGAC
gt94WGAC
_ge10K
WSSD
Shared
gt94WGAC_ge10
K_WGAConly
WSSDonly
<=94%W
GAC
<=94%WGA
C +shared
chrI
28185914
1275120
315356
709840
195481
119875
514359
193013
388494
chrII
23295652
713095
144114
234515
77007
67107
157508
72943
149950
chrIII
16798506
1041842
435522
821969
389684
45838
432285
108184
497868
chrIV
32632948
2093860
476191
1589484
379805
96386
1209679
306309
686114
chrIX
20249479
1389579
610360
1004524
490770
119590
513754
100388
591158
chrUn
62550211
19483869
10809499
8749428
4789618
6019881
3959810
630260
5419878
chrV
12251397
591969
178851
393826
166869
11982
226957
50079
216948
chrVI
17083675
621495
177632
245111
128778
48854
116333
87014
215792
chrVII
27937443
1480355
521853
861056
469264
52589
391792
175038
644302
chrVIII
19368704
824600
245027
274801
119937
125090
154864
62353
182290
chrX
15657440
1274186
735451
1039477
611552
123899
427925
79609
691161
chrXI
16706052
1336848
499828
1152246
474664
25164
677582
149606
624270
chrXII
18401067
1002589
455231
721761
436954
18277
284807
91092
528046
chrXIII
20083130
1001618
315089
508170
174381
140708
333789
93504
267885
chrXIV
15246461
472042
95357
221539
60401
34956
161138
53894
114295
chrXIX
20240660
918086
240950
635973
212904
28046
423069
83718
296622
chrXV
16198764
578995
173468
303978
101413
72055
202565
64444
165857
chrXVI
18115788
1216252
462619
810223
375762
86857
434461
165325
541087
chrXVII
14603141
278408
54942
45597
24201
30741
21396
21509
45710
chrXVIII
16282716
827757
320585
572537
273969
46616
298568
78890
352859
chrXX
19732071
916472
277129
556990
193012
84117
363978
147507
340519
chrXXI
11717487
1062424
717376
871099
665531
51845
205568
58839
724370
463354448
40417128
18278097
22324144
10811957
7466140
11512187
2873518
13685475
total
Summary
•
Stickleback Seg Dup has been detected using two independent pipelines WGAC and WSSD. Since each pipeline is
based on its unique mechanism, we expect majority of the interval should be consistent with some variation. From
the result of two pipeline, two set of genomic intervals were generated for Seg Dup.
–
The first set consists of the genomic intervals detected by WGAC and WSSD, which is the intersect interval between WGAC and
WSSD. This set represents the most conservative estimate of SEG DUPs in Genome.
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd_wgac_intersect
–
The second set is a union of the interval of WAGC and WSSD (AllDup.tab), which represent the largest estimate of the SEG DUP in
the genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/allDup.tab
–
A list of genes intersecting with each set were also generated.
• With AllDUp, union of WGAC and WSSD. There are total 3153 genes.
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_alldup
•
With Dup from WGAC and WSSD intersect. There are total 1267 genes.
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_wssd_wgac_intersect
•
A list of interval with potential to be repeats is also generated. They are the region with high frequency of hit with
defined the boundary ( >10hits, <400bp at bound, <10kb in length). They account for >60% of total WAGC pairs and
1/5 of WGAC NR intervals.
http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/repeathitMerge
•
ChrUn contigs contribute great deal to the total SD in both WGAC and WSSD. The identity distribution analysis
shows that the identity of pairs are less than 99%, suggest they may contain true SD which are hard to assemble.
But how many of them remain to be determined.