Transcript kg3_9
UCSC Known Genes
Version 3
Take 9
Known Gene History
• Initially based on Genie predictions
constrained by BLAT mRNA alignments.
– David Kulp got busy at Affy.
• Switched to RefSeq
– Jim got paranoid Riken RNAs would take over
• Fan built KG 1
– Mark got annoyed at low quality predictions
• Fan & Mark built KG 2
– Jim got annoyed at missing genes
• KG 3
– The perfect set … until KG 4.
Overall Pipeline
•
•
•
•
•
•
•
•
•
•
•
Get alignments etc. from database
Remove antibody fragments
Clean alignments, project to genome
Cluster into splicing graph
Add EST, Exoniphy, OrthoSplice info.
Walk unique transcripts out of graph.
Assign coding regions (CDS) to transcripts.
Classify into coding, antisense, noncoding.
Remove weak transcripts.
Assign accessions.
Build gene-centric database tables.
Genbank & Alignment Issues
• Using global instead of local near-best
alignment, also higher stringency.
• Including all Genbank RNA, not just mRNA
• These changes not yet reflected in
Genbank mRNA/RefSeq tracks.
• Collect data such as selenocysteine
substitutions and alternative start codons
from Genbank. These data are in the .ra
files but not the SQL database.
Removing Antibody Var Regions
• Chromosomes 2,14,22 contain antibody regions.
• Thousands of transcripts for these in Genbank.
• Gaps are from genomic rearrangements, not
splicing. Millions of possibilities.
• Identify regions by:
– Searching for words like ‘immunoglobulin’ ‘variable’ to
make initial set of Ab fragments.
– Treat anything that overlaps these as Ab fragment too.
– Cluster together putative Ab fragments.
– Take 4 largest clusters as the 4 variable regions. (One is
just a pseudogene of a real variable region.)
• Remove all alignments in Ab clusters.
• Replace with a single noncoding gene for each
cluster near end of gene build.
Chr22 Ab Region
(lambda light chain)
Cleaning, projecting alignments
• BLAT sometimes leaves messy gappy ends.
• New heuristic:
– For gaps 6 base or less on both mRNA and genome,
just ignore gap, filling in with genome if necessary.
– Try to turn other gaps into introns if they are not already
by wiggling one base on either side of gap.
– Break up alignments at remaining gaps that are not
intronic. Intronic gaps are at least 16 bases, and have
gt/ag or gc/ag ends.
– After break up throw away any pieces less than 18
bases long.
• For refSeq mRNA only, join pieces back together
after breaking up. Other mRNA can be joined
by other transcripts (which may not suffer the
same problems from polymorphism/error)
• Consider applying similar heuristic in mRNA track.
Cleaning and projecting
Cluster into splicing graph
• Make graph where vertices are
begin/ends of exons, edges are
exons and introns.
• Multiple input transcripts can share
vertices and edges.
• Went over this in some detail a few
weeks back…
Splicing graph and txWalk
Adding Evidence to Graph
• Initial evidence for each edge comes
from mRNAs.
• If edge is supported by at least 2 ESTs.
(Single EST likely is same clone as single
RNA…) Just use spliced ESTs
• Make graph in mouse and map via
chains. Reinforce orthologous human
edges.
• Reinforce exon edges that overlap
Exoniphy predictions.
• Evidence weight: refSeq 100, each mRNA
2, est pair 1, mouse ortho 1, exoniphy 1.
Walking graph
• Weight of 3 on an edge is good enough.
• Rank input RNA by whether refSeq, and
number of good edges they use.
• If any good edges, output a transcript
consisting of the edges used by the first
RNA.
• Output transcript based on next RNA if
the good edges it uses have not been
output in same order before.
• Continue until reach last RNA.
Evidence, Walk, AltSplice
Assigning Coding Regions
• Align UniProt and RefSeq proteins to
txWalk transcripts. Mark regions they hit
as possible CDS.
• Align Genbank/RefSeq RNAs to txWalk
transcripts, map CDS from RNA records as
possible CDS.
• Use bestorf program for another possible
CDS.
• Assign an ad-hoc score to each possible
CDS, choose highest scoring.
• More comparative genomics could really
help here someday…
CDS Mapping, Filtering
Classifying and Weeding
• The transcripts are classified into:
– Coding: CDS survives trimming stage
– Near-coding: overlap coding by at
least 20 bases on same strand
– Antisense: overlap coding by at least
20 bases on opposite strand
– Noncoding: other transcripts
• Near-coding transcripts that show
signs of incomplete splicing
(retained intron, bleeds > 100 bases
into intron) are removed.
Assigning accessions
• Initial temporary identifiers of form
<chrom>.<cluster>.<tx>.<accession>, eg
chr22.210.5.AB209301
• Make permanent identifiers of form TX12345678.
– Find exact match in previous gene set, and
reuse previous accession.
– Find compatible match (all introns alike) in old
gene set, reuse accession, bump version.
– Make up new accession otherwise.
– Record genes in old set not in new.
• Version 7 -> version 9 mapping actually a
good test of this: 53025 exact, 4732 lost, 3736
new, 464 compatible.
• Move to UC1234567 format in v. 10?
Building gene-centric tables
• mmBlastTab, rnBlastTab etc. homolog tables.
Blastp best plus syntenic weeding.
• kgXref and knownToXxx tables to relate gene to
other databases and tables.
• kgAlias table to help search on gene names.
• gnfAtlas2Distance to measure expression
similarity between genes for Gene Sorter. 3 other
expression distance tables
• humanVidalP2P and humanWankerP2P protein
network distance tables.
• knownCanonical/knownIsoform tables to help
people selectively view alt-splicing.
• pbXXX tables for proteome browser.
• In all about 10 hours of compute and
indexing.
The Plan
• Next week
– test preliminary integration on hg18a
– resolve issues with proteome browser
– Tinker on take 10, maybe take 11
• Week after
– Integration of final gene build into hg18a
– Move hg18.knownGenes to hg18.knownGenesOld
– Swap hg18a tables into hg18.
• Coming months
– Continue to improve gene build.
– Add new information from build into details pages.
– Allow user filtering of which genes are shown
– Allowing selection by names as well as ID’s in table
browser.
– Present at Cold Spring Harbor. Write up paper.