The (new) Table Browser

Download Report

Transcript The (new) Table Browser

The (new) Table Browser
Talk Outline
• Table Browser History
• New Table Browser Features
• New Table Browser Implementation
– all.joiner & .as files
– Overall control and data flow
– Joining and intersection modules
• Limits and future directions
Table Browser History
• Goal - annotations over a particular region of
genome in text rather than graphic format
• Krish - did first successful implementation separated tables into positional and non-positional,
merged chrN_ tables, split off hgFind.
• Angie - added sequence output, filters,
intersections, and many help pages.
• These versions of the table browser were called
hgText
Why a New Table Browser
• hgText is powerful, but much of the power
is not obvious in the first page.
• In hgText the association between tracks
and tables was not clear.
• No way to join fields across related tables.
New Table Browser
• Flip to demoing new table browser online.
– Show overall controls
– Demo getting genome position, common name, and
review status for refSeq on ENCODE.
– Demo getting alt-splice varients with knownCanonical
and knownIsoforms
– Demo custom track created from filtered cpgIslands
(>= 500 bases >= 0.9 Exp/Obs)
– Intersect custom fat cpg track with most conserved,
requiring 75% overlap, output as custom track
– Intersect conserved fat cpg with exonophy, requiring <=
5% overlap, output as hyperlink (custom track output
crashes!)
New Table Browser
Implementation
• Built using:
– AutoSql .as files to describe table fields
– all.joiner file to describe table relationships
– .bed based intersection and sequence output
code from old table browser
– About 8000 lines of new C code in 19 .c files in
src/hg/hgTables
Data Flow
• Each region (piece of a chromosome) processed
separately
• Filter is turned into a SQL where clause
• Field oriented output, especially selected tables is
handled by one branch of code.
– SQL rows -> joining routines -> output
• GFF, Custom Track, Sequence, Hyperlink, and
Summary Stats outputs handled by a branch of
code that turns things into BED format internally:
– SQL rows -> BED -> intersecting -> output
• Need to merge fields & BEDs to get joining and
intersecting to happen at the same time ultimately.
Joining Code
• Use all.joiner to find out route from primary table
to other tables in join.
• Construct SQL query for each table that applies
table filters and region and includes key fields
even if not part of final output.
• Construct a row object (array of lists) for each row
returned on primary table.
• Construct a hash keyed by joining field of primary
table, with row objects as values.
• Execute SQL query for next table, and when keys
match add info to row object.
• Repeat with third and subsequent tables if any.
Limits/Features of Joining Code
• Unless a filter is applied, non-positional tables
will be scanned completely. This takes 3 minutes
for gbCdnaInfo. (Hint, add filter type=mRNA)
• Joining code only applied to field oriented output.
• Will handle joins across split tables.
• Can chop of prefixes and suffixes on a key field
before joining if specified in all.joiner. (Needed
for chopping off version number in some Ensembl
tables for instance)
• Avoids combinatorical explosion of output rows
by allowing fields to contain lists.
Intersecting Code
• Primarily inherited from hgText.
• Uses hTableInfo (call in hg/lib/hdb.c) which
reports which fields in database store
chromosome, start, end, etc.
• Analyses hTableInfo to figure out how many
fields in corresponding BED structure, and how to
query database and massage output to get a BED.
• Converts second table in intersection into a
bitmap.
• Counts up number of bases in bitmap that intersect
each bed item in first table.
• (For pure bitwise operations converts first table to
bitmap too.)
Limits and Features of Intersections
• Not applied to field or MAF output.
• Information is lost in converting to BED.
• Does allow intersection code for sequence,
GFF, custom track, BED, statistics, and
hyperlinks output to go through same path.
Future Directions
• Make a combined BED/Row structure to bring
together intersections and joining.
• Polish sequence output in some places.
• Get .as file info for all tables.
• Encourage people to pay a little more attention to
database concerns as well as genome browser
concerns when designing tables.
• See if can phase out split tables by tuning MySQL
aggressively.