Patents 101 - The Zhao Bioinformatics Laboratory

Download Report

Transcript Patents 101 - The Zhao Bioinformatics Laboratory

Post-process of IMGAG M.t. 2.0 Release
Affymetrix Medicago Probe set – IMGAG 2.0 /
MTGI 8.0 Mapping
Zhao Bioinformatics Lab
Plant Biology Division
IMGAG M.t. 2.0
Data downloaded from ftp://ftpmips.gsf.de/plants/medicago/MT_2_0/MT2.0_medicago_chrX_20080303_NoOverlap.xml.tar.gz
● Summary
- 38,844 TU and 38,844 models. One to one
- 38,759 gene name, so 82 model is redundant in gene name.
- Of the 38,844 models, 85’s CDS region is not compatible with FASTA
file
- 4644 models with 5’-UTR + CDs;
- 5846 models with CDS+3’-UTR
- 11656 models with 5’-UTR + CDS + 3’-UTR.
- 16698 models CDS only
Plant Biology Division
Evidence Code
● F (5036 genes) full coverage/FL-cDNA: The complete gene model from translation
●
●
●
start to translation stop is covered by expressed Medicago sequence, e.g. FL-cDNA or
EST alignments across the full length of the coding sequence.
E (14737 genes) expressed/EST matches: Expression of the gene is supported by
Medicago EST sequence that matches the gene call (partially).
H (14209 genes) homology/heterologous: the gene call is supported by similarity to
Medicago or other ESTs, protein, FL-cDNA, genomic or other sequences with partial or
full-length alignments.
I (1375 genes) intrinsic/ab initio/inferred/hypothetical: the gene call is based only on
intrinsic prediction tools such as FGENESH, Genscan or Eugene, and no significant
alignments to other sequences are available. The length of the prediction is greater than
300 bp or there is a significant domain match in Interpro.
● L (3830 genes) 'low quality' gene calls: gene calls not in F, E, nor H, with no significant
Interpro domain match and a length less than 300 bp. i.e., unsupported intrinsic predictions of short
length and thus statistically containing many false predictions.
Total genes: 38334
NON-OVERLAPPED genes
Plant Biology Division
Affymetrix Medicago Probe set –
IMGAG gene Mapping
Two approaches
● A. Blast-based approach
(1) HSP length / Affymetrix probeset target length >= threshold1
(2) Matching identity length / Max_HSP length >= threshold2
● B. Affy probe-set level matching
(1) IMGAG gene sequences were matched to corresponding Affymetrix
probe sets using a position-weighted scoring index in which mismatches
near the middle of a probe were most heavily penalized as follows:
(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1).
(2) A perfect match for a probe set yields a score of 45. Matches were
declared when at least 8 of 11 probe sets had scores of 43 or higher.
Plant Biology Division
Statistics on Probe sets
Type
Percent in
the Mtr. set
Notes
Unique probe sets: e.g. 44182
Mtr.10097.1.S1_at
86.80
unique to one gene
alternative (_a_), e.g.:
Mtr.10267.1.S1_a_at
116
2.28
alternative probe sets
to one gene
shared (_s_), e.g.
Mtr.10146.1.S1_s_at
4793
9.42
common to multiple
genes
others (_x_), e.g.:
Mtr.10093.1.S1_x_at
1809
3.55
other probe sets with
complicated mapping
Total
50900
100
Plant Biology Division
Num of
probe sets
Statistics on Approach A –
scenario #1: less stringent
● Affy Probeset Target Blast against IMGAG cDNA
Threshold 1=0.7; Threshold 2=0.7
Num of
cDNA
Matching
probe-set
Percent
Num of
probe_sets
Matching
cDNA
Percent
13717
0
35.31
25190
0
49.49
10054
1
25.88
15223
1
29.91
15073
>=2
38.80
10487
>=2
20.60
38844
total
100
50900
total
100
Plant Biology Division
Statistics on Approach A –
scenario #2: Perfect matches
● Affy Probeset Target Blast against IMGAG cDNA
Threshold 1=1.0; Threshold 2=1.0
Num of
cDNA
Matching Percent
probe-set
Num of
probe_sets
Matching
cDNA
Percent
28169
0
72.52
39593
0
77.79
8864
1
22.82
10344
1
20.32
1811
>=2
4.62
963
>=2
1.89
38844
total
100
50900
total
100
Plant Biology Division
Statistics of Original probe_set
EST mapping
Num of
EST
Matching probeset
Percent
6315
0
17.12
29038
1
78.74
1525
>=2
4.14
36878
total
100
Plant Biology Division
Statistics of our probe_set vs. EST
mapping
90
Num of Matching
EST
probe-set
Percent
80
70
3304
0
8.96
29535
1
80.09
60
50
Origin
40
4039
>=2
10.95
Ours
30
20
36878
total
100
10
0
0 probset
1 probeset
2 probesets
Overlapping mapping between our probe-set vs. EST mapping and the Affy
original probe-se vs. EST mapping. 37872 ∩ 32108=32106.
Plant
Biology
Division
Our
method
covered
32106/32108=99.9993% of the Affy original mapping.
Statistics on Approach B
● IMGAG cDNA versus Probe_set
Num of cDNA
Matching
probe_set
Percent
19961
0
51.39
12909
1
33.23
5974
(3134 uni)
>=2
15.38
38844
total
100
Plant Biology Division
Probe sets map to IMGAG or ESTs
Item
Num of
probe_sets
1
7494
None
14.72
2
21284
TC/EST only
41.82
3
14362 12866
TC/EST and
unique
IMGAGv2
25.28
1496
TC/EST and
multiple
IMGAGv2
2.94
+
6500
Unique
12.77
IMGAGv2 only
1260
Multiple
2.48
IMGAGv2 only ++
4
7760
Plant Biology Division
50900
Matched To
Total
Percent
28.22
14.72
EST
41.82
(28.22)
IMGAG
15.25
15.25
100
MTGI 8 vs.– IMGAG gene
Mapping
● Mt2.0 cDNA BLASTN against MTGI8
(expectation 1e-04);
● Further applied blow filters:
HSP length/Unigene length (a)
Identity length/HSP length (b)
● Result:
9333 (24.0%) cDNA are mapped to 9255 (25.1%) unigene (a>0.9 b>0.9);
11517 (29.6) cDNA are mapped to 11383 (30.9%) unigene (a>0.8 b>0.8);
13284 (34.2%) cDNA are mapped to 13092 (35.5%) unigene (a>0.7 b>0.7);
9959 (25.64.0%) cDNA are mapped to 10543 (28.59%) unigene (a>0.8 b>0.95);
13063 (33.63%) cDNA are mapped to 14585 (39.55%) unigene (a>0.5 b>0.95);
●
Total cDNA: 38844, Total unigene: 36878
Plant Biology Division
MTGI 8 High Quality TC vs.–
IMGAG gene Mapping
● I. Retrieved 9,396 High Quality TC based on IMGAG’s
criteria
BLAST TIGR’s High Quality TC vs. BAC:
(1). >95% identity over 80% of the TC length = 64.3% (current 2,500 BACs) -> 73.2% projected for
2,800 BACs to be sequenced
(2). >95% identity over 50% of the TC length = 68.6% (current 2,500 BACs) -> 77.0% projected for
2,800 BACs to be sequenced
● II. Our Mt2.0 cDNA BLASTN against 9396 MTGI8
High Quality TC (expectation 1e-04);
Further applied blow filters:
HSP length/Unigene length (a)
Identity length/HSP length (b)
Result:
3550 (9.14%) cDNA are mapped to 3294(35.06%) unigene (a>0.8 b>0.95);
5052 (13.0%) cDNA are mapped to 4613(49.10%) unigene (a>0.5 b>0.95);
Total cDNA:
38844, Total High Quality TC: 9396
Plant Biology
Division
Thank You!
● Suggestions / Comments
Plant Biology Division