UCSC Known Genes (by Jim Kent)

Download Report

Transcript UCSC Known Genes (by Jim Kent)

UCSC Known Genes
Version 3
Take 10
Overall Pipeline
•
•
•
•
•
•
•
•
•
•
Get alignments etc. from database
Remove antibody fragments
Clean alignments, project to genome
Cluster into splicing graph
Add EST, Exoniphy, OrthoSplice info.
Walk unique transcripts out of graph.
Assign coding regions (CDS) to transcripts.
Classify into coding, antisense, noncoding.
Remove weak transcripts.
Assign accessions.
Removing Antibody Var Regions
• Chromosomes 2,14,22 contain antibody regions.
• Thousands of transcripts for these in Genbank.
• Gaps are from genomic rearrangements, not
splicing. Millions of possibilities.
• Identify regions by:
– Searching for words like ‘immunoglobulin’ ‘variable’ to
make initial set of Ab fragments.
– Treat anything that overlaps these as Ab fragment too.
– Cluster together putative Ab fragments.
– Take 4 largest clusters as the 4 variable regions. (One is
just a pseudogene of a real variable region.)
• Remove all alignments in Ab clusters.
• Replace with a single noncoding gene for each
cluster near end of gene build.
Chr22 Ab Region
(lambda light chain)
Cleaning and projecting
Cluster into splicing graph
• Make graph where vertices are
begin/ends of exons, edges are
exons and introns.
• Multiple input transcripts can share
vertices and edges.
Make graph
Snap soft ends to hard
Extend soft ends to hard
Consensus of soft ends
Walk graph to get nonredundant transcripts
Splicing graph and txWalk
Adding Evidence to Graph
• Initial evidence for each edge comes
from mRNAs.
• If edge is supported by at least 2 ESTs.
(Single EST likely is same clone as single
RNA…) Just use spliced ESTs
• Make graph in mouse and map via
chains. Reinforce orthologous human
edges.
• Reinforce exon edges that overlap
Exoniphy predictions.
• Evidence weight: refSeq 100, each mRNA
2, est pair 1, mouse ortho 1, exoniphy 1.
Walking graph
• Weight of 3 on an edge is good enough.
• Single exon gene edges take 4 though.
• Rank input RNA by whether refSeq, and
number of good edges they use.
• If any good edges, output a transcript
consisting of the edges used by the first
RNA.
• Output transcript based on next RNA if
the good edges it uses have not been
output in same order before.
• Continue until reach last RNA.
Evidence, Walk, AltSplice
Assigning Coding Regions
• Score ORF as so:
– 1 point for each base in orf
– 50 points for initial ATG
– 100 points if ATG follows Kozak rules
• G after ATG or A/G 3 bases before
– -400 points if nonsense mediated decay
• Last intron more than 55 bases past stop codon
– -0.5 points for each base in upstream ORF
– -0.5 points each base in upstream Kozak ORF
– +1 point each base also ORF in other species
• Rhesus, mouse, dog
• Scheme agrees with RefSeq reviewed
~96% of the time.
Comparing ORF Finders
method
Big orf
Kozak
twinOrf*
bestOrf
txCdsPredict
+ ortho
same
62.9%
87.2%
85.6%
80.9%
92.8%
93.3%
close
30.4%
7.4%
7.5%
14.4%
4.7%
4.4%
in
4.0%
2.3%
2.3%
2.9%
1.1%
1.1%
out
2.7%
2.2%
1.8%
1.9%
1.3%
1.3%
Comparison vs. RefSeq reviewed ORF annotations.
*twinOrf only predicts if has homologous sequence. This run with dog,
only adds up to 97.2% for this reason.
CDS Mapping, Filtering
Classifying and Weeding
• The transcripts are classified into:
– Coding: CDS survives trimming stage
– Near-coding: overlap coding by at
least 20 bases on same strand
– Antisense: overlap coding by at least
20 bases on opposite strand
– Noncoding: other transcripts
• Near-coding transcripts that show
signs of incomplete splicing
(retained intron, bleeds > 100 bases
into intron) are removed.
Take 10 Statistics
class
genes
transcripts
coding
20433
45475
nearCoding
N/A
4469
antisense
643
731
uncoding
5228
6047
RefSeq Statistics
class
genes
transcripts
coding
18992
25187
nearCoding
N/A
14
antisense
19
19
uncoding
590
592