UCSC “Known” Genes Version 3 Take 10

Download Report

Transcript UCSC “Known” Genes Version 3 Take 10

Displaying associations,
improving alignments and
gene sets at UCSC
Jim Kent and the UCSC Genome Bioinformatics Group
Wellcome Trust Case Control Consortium rheumatoid arthritis data
Wellcome Trust Case Control Consortium rheumatoid arthritis data
Sort Genes to see candidates
Case control consortium rheumatoid arthritis data,
type1 diabetes and bipolar disorder. National Institute
of Mental Health bipolar disorder in US and German
populations (different scale).
In the long term we hope to import data from GAIN
and dbGAP and other sources as well.
28-way multiple alignment
Still based on Penn State/UCSC blastz/chain/net/multiz
pipeline.
Have added “syntenic” filtering for high coverage genomes
and reciprocal-best filtering for 2x genomes to reduce
artifacts from paralogs.
PhyloP vs. PhastCons
Existing conservation track uses PhastCons algorithm,
which computes probability that a region is conserved. As
more species are added this converges to 0 or 1.
PhyloP track instead shows degree of conservation of a base
UCSC Genes Goals
• Include noncoding as well as coding genes
• Increase sensitivity of gene set in general.
• Increase coverage of alternative splice
forms (but not too much).
• Apply comparative genomics to protein
(CDS) prediction.
• Create permanent accessions for
transcripts.
Make graph
Snap soft ends to hard end within 6 bp
Extend soft ends to hard ends
Consensus of soft ends weighted 3/4 of way towards long
Weigh edges by number of transcripts that make them
3
3
3
2
2
1
1
1
4
3
3
3
3
2
2
1
1
1
4
3
Make graphs from various other sources:
exoniphy
ests
Mouse
splicing
Merge in weights from other graphs:
2
4
4
5
3
5
3
5
3
6
Initial transcripts (ordered by exon count)
A
B
D
C
E
2
4
4
5
3
5
3
3
5
6
Walk graph to get nonredundant transcripts, starting with
first transcript and continuing until all edges in graph of
weight above a threshold are emitted.
A
A
B
D
C
E
2
4
4
5
3
5
3
3
5
6
Walk graph to get nonredundant transcripts, starting with
first transcript and continuing until all edges in graph of
weight above a threshold are emitted.
A
A
B
D
C
E
2
4
4
5
3
5
3
3
5
6
Walk graph to get nonredundant transcripts, starting with
first transcript and continuing until all edges in graph of
weighted above a threshold are emitted.
A
>= 3
B
>= 2
A
B
D
C
E
2
4
4
5
3
5
3
5
6
3
Walk graph to get nonredundant transcripts, starting with
first transcript and continuing until all edges in graph of
weighted above a threshold are emitted.
A
>= 3
B
>= 2
DONE
Evidence type and weights
refSeq RNA
100
Other Genbank RNA
2
Genbank spliced EST graph
edges from at least 2 ESTs
Orthologous splicing graph in
mouse mapped to human
Exoniphy exon predictions
1
1
1
Minimum total weight of 3 for spliced transcripts, 4 for unspliced.
Assigning Coding Regions
• Take top scoring ORF using a program,
txCdsPredict, that considers:
–
–
–
–
–
Length of ORF
Kozak consensus sequence
Nonsense mediated decay
Upstream open reading frames
Length of orthologous ORF in other species.
• txCdsPredict agrees with RefSeq
reviewed ~96% of the time.
Gene Statistics
class
UCSC
Ensembl RefSeq
coding
20433
22934
18992
antisense
643
109
19
noncoding
5228
9034
590
Transcript Statistics
class
UCSC
Ensembl RefSeq
coding
45475
43569
25187
nearCoding
4469
112
14
antisense
731
109
19
noncoding
6047
9045
592
Coding
Non-coding
Near-coding
• 38% of UCSC noncoding genes are < 200 bp
transcripts primarily of known types such as snoRNAs,
piRNAs, miRNAs etc.
• 62% are long, with a size distribution much like coding.
• (For Ensemble only 21% of noncoding are long)
Long noncoding genes have
lower expression levels
Coding
Non coding
Absolute expression values from Affymetrix human exon arrays
Other characteristics of
long noncoding
• Long noncoding have lower tissue specificity.
• Poor conservation. Average phastCons score is
0.09 for long noncoding vs 0.73 for coding.
• BLAST analysis suggests 20% of long noncoding
may be transcribed pseudogenes.
• Conclusion - long noncoding but transcribed
genes are slippery. Most are likely nonfunctional.
– Xist is poorly conserved overall but has some peaks and
is reasonably well expressed.
Acknowledgements
• Programming and analysis:
–
–
–
–
Galt Barber - Genome Graphs extensions
Webb Miller Lab - Alignments
Adam Seipel - Evolutionary analysis
Dorota Retelska - UCSC noncoding genes
• Data:
– Sanger, Wash U, Broad, JGI, NCBI, EBI, Affy
– Contributors to scientific databases worldwide
• Funding:
– NHGRI, NCI, HHMI, State of California
The End
UCSC Genes Overall Pipeline
•
•
•
•
•
•
Start with genomic/RNA alignments
Remove antibody fragments
Clean alignments and project to genome
Cluster into splicing graph
Add EST, Exoniphy, OrthoSplice info.
Walk unique well supported transcripts out of
graph.
• Assign coding regions (CDS) to transcripts.
• Classify into coding, antisense, noncoding.
• Assign accessions.
UCSC Genes Overall Pipeline
•
•
•
•
•
•
Start with genomic/RNA alignments
Remove antibody fragments
Clean alignments and project to genome
Cluster into splicing graph
Add EST, Exoniphy, OrthoSplice info.
Walk unique well supported transcripts out of
graph.
• Assign coding regions (CDS) to transcripts.
• Classify into coding, antisense, noncoding.
• Assign accessions.
Classifying transcripts
• Coding: CDS survives trimming stage
• Near-coding: overlap coding by at
least 20 bases on same strand
• Near-coding junk: near-coding
transcripts that show signs of
incomplete splicing. These are
removed.
• Antisense: overlap coding by at
least 20 bases on opposite strand
• Noncoding: other transcripts