Transcript mouse

Homology Based Analysis of the
Human/Mouse
lncRNome
Cédric Notredame
Giovanni Bussotti
Comparative Bioinformics lab
CRG
1
Part 1: GENCODE v10 lncRNA screening vs human and mouse genomes
Strategy:
Template:
PipeR one2many homolog assignment
genes
10840
transcripts
17547
exons
58857
sum of mature transcript length (nt)
16·927·027
real coverage (nt)
13·083·478
non overlapping loci
7428
PipeR Parameters:
Blast
- Freyhult parametrization
- Lower case masking
- Low complexity masking
Exonerate
- est2genome model
- 70% coverage required
- seed extension 2X
(the span of the genomic size of the query on both sides)
2
PipeR: a pipeline for mapping lncRNAs
•
•
blast-exonerate based framework to map lncRNAs
against target genomes
algorithm used:
lncRNA
2 Blast hits
chromosome
mapping
extension
Exonerate
spliced transcript
3
GENECODE lncRNAs
Vs
Complete Genomes
PipeR: lncRNA Homology Mapping
1.
2.
3.
4.
5.
GFF File
Anchor points: ENCODE vs Mouse with
tuned Blast
Extension: Exonerate
Filtering: Id and Coverage
Validation of the GFF annotation
Overlap with Annotation
Overlap with Cufflink Models
RPKM on target genome
Further Mapping Parameter Space
Exploration using Experimental Evidences
Notredame, Bussotti
Mapping overview
Gene B
Gene A
Query species
Transcript 1
Transcript 3
Transcript 2
Blast/Exonerate
failed
Multiple Homologues
Homolog 1
Best reciprocal
Homolog 2
Conserved exon number
Homolog 3
High repeat coverage
Homolog 4
Overlap with protein
Target species
5
GENCODEv10 vs human genome
• mapped 17327 transcripts out of 17547
• many lncRNAs found in multiple copies
(lncRNA families)
- found 144566 homologs
corresponding to 501355 exons
• Annotations of discovered homologs
are readily available
6
Homolog repeat coverage
• About the 10% of all our homolog
predictions are fully covered by
repeats
7
Homolog repeat coverage
• We could sub-group the homologs
in 3 set according with the repeat
coverage:
<= 20
< = 80
< = 100
8
HUMAN
Mapping statistics
<= 20%
<= 80%
<= 100%
genV10 mapped
genes
6088
10425
10698
genV10 mapped
transcripts
9318
16856
17327
Total homologs
35399
102250
144566
Homologs whose
exons overlap
protein coding
exons (same strand)
3621
5076
8988
9
GENCODEv10 vs mouse genome
• mapped 3190 transcripts out of 17547
representing 2249 human genes
• many lncRNAs found in multiple copies
(lncRNA families)
- found 14936 homologs
corresponding to 38910 exons
• Annotations of discovered homologs
are readily available
10
Human/Mouse
Exon Number Conservation
• Difference between the number of exons in the
human transcripts and in the mouse homologs
• “0” means that the exon number is the same
• Negative bins indicate mouse homologs having
more exons than the human query
• 1160 GENCODE v10 transcripts find at least 1
homolog in mouse with the same exon number
human < mouse
human > mouse
11
Homolog repeat coverage
• We could sub-group the homologs
in 3 set according with the repeat
coverage:
<= 20
< = 80
< = 100
12
MOUSE
Mapping statistics
<= 20%
<= 80%
<= 100%
Reciprocal
homologs
genV10 mapped
genes
1867
2172
2249
1445
genV10 mapped
transcripts
2586
3076
3190
1966
Total homologs
6108
11141
14936
1966
Homologs whose
exons overlap
protein coding exons
(same strand)
1611
2290
3177
497
Homologs with
conserved
number of exons
1534
2407
2958
689
Best Candidates: There are 148 transcripts that have < 20% repeat coverage,
conserved exon structure, do not overlap protein coding exons and are best reciprocal homologs
with the human queries
13
GENECODE lncRNAs
Vs
Complete Genomes
PipeR: lncRNA Homology Mapping
1.
2.
3.
4.
5.
GFF File
Anchor points: ENCODE vs Mouse with
tuned Blast
Extension: Exonerate
Filtering: Id and Coverage
Validation of the GFF annotation
Overlap with Annotation
Overlap with Cufflink Models
RPKM on target genome
Further Mapping Parameter Space
Exploration using Experimental Evidences
Notredame, Bussotti
BlastR vs The World
BlastR vs The World
blastnOpt
(12487)
a)
blastn
(8749)
Figure 2: Exon read support.
a) Venn-diagram indicating the number of exon detected
by different methods (numbers in parentesis) and their
intersection (transcripts annotated identically by the
three methods).
b) Average amount of reads per exons
c) Percent of reads covered by at least one exon
all
(7492)
blastr
(12093)
b)
c)
1,400
80
% exons with read
average reads per exon
78
1,300
1,200
1,100
1,000
76
74
72
70
68
66
64
900
62
800
60
blastn
blastnOpt
methods
blastr
all
blastn
blastnOpt
methods
blastr
all
Part 2: Ensembl.v65 lncRNAs screening vs human and mouse genomes
Strategy:
Template:
PipeR one2many homolog assignment
genes
3845
transcripts
5669
exons
18353
sum of mature transcript length (nt)
7279679
real coverage (nt)
6091050
non overlapping loci
2790
PipeR Parameters:
Blast
- Freyhult parametrization
- Lower case masking
- Low complexity masking
Exonerate
- est2genome model
- 70% coverage required
- seed extension 2X
(the span of the genomic size of the query on both sides)
18
Ensembl.v65 vs human genome
• mapped 1187 transcripts out of 5669
• many lncRNAs found in multiple copies
(lncRNA families)
- found 13193 homologs
corresponding to 46770 exons
• Annotations of discovered homologs
are readily available
19
Ensembl.v65 vs mouse genome
• mapped 5622 transcripts out of 5669
• many lncRNAs found in multiple copies
(lncRNA families)
- found 41005 homologs
corresponding to 121515 exons
• Annotations of discovered homologs
are readily available
20
Mouse/Human
Exon Number Conservation
• Difference between the number of exons in the mouse
transcripts and in the human homologs
• “0” means that the exon number is the same
• Negative bins indicate human homologs having more
exons than the mouse query
• 481 Ensemblv65 transcripts find at least 1 homolog in
human with the same exon number
mouse < human
mouse > human
21
Homolog repeat coverage
• Not observed a peak of homolog
predictions fully covered by repeats
22
Ensemble.65 and GENCODEv10 repeat coverage
• Input lncRNA datasets have similar
repeat distributions
23
ensV65 mapped
genes
879
ensV65 mapped
genes
3815
ensV65 mapped
transcripts
1187
ensV65 mapped
transcripts
5622
Total homologs
13193
Total homologs
41005
3642
Homologs whose
exons overlap
protein coding exons
(same strand)
10086
Homologs whose
exons overlap
protein coding
exons (same strand)
Homologs whose
exons do not
overlap any
gencode v10
element (same
strand)
6085
Homologs with
conserved number
of exons
4925
HUMAN
MOUSE
Mapping statistics
24
Part 3: GENCODE v10 lncRNA coding potential check
Strategies:
1) GeneId ORF score comparison between mRNAs and lncRNAs
2) BlastX against human proteins (ensembl 65)
3) Overlap with protein coding gene exon annotations (gencodeV10)
4) PipeR filtering routines
25
1) ORF scores as returned by GeneID
2) blastX against human proteins indicates that 1202 GENCODE v10 lncRNAs match
proteins
Parameters:
seg low complexity filtering, repeat filtering , evalue 10e-10, search just the plus strand.
Human Ensembl 65 protein set
26
3)
-Checked the overlap between GENCODE v10 lncRNA exons
and GENCODE v10 protein coding exons.
- Found 846 lncRNA having at least one exon overlapping with a protein coding gene exon
Example 1
Example 2
27
4) Extensive filtering
7813 GENCODE v10 transcripts passed *ALL* PipeR filtering routines
Filtering rules:
- overlap with protein coding exons
- geneID ORF score similar to the ones of mRNA
- blastX to uniprot database (50% redundancy)
- blastX to nr database
- rpsBlast to pfam domain families
- blast against Rfam
28