compbioOct2000

Download Report

Transcript compbioOct2000

Sordid Details of the
Genome Browser
•
•
•
•
Totally retro technology
Highly portable across browsers
Fulfills the need for speed
Some assembly required
Retro Design Choices
• C isn’t so bad, really
– Universally available compilers
– Very fast run time
– Really nice debuggers
• CGI is portable, at least
– Works with all web browsers
– Works with all web servers
– Not too hard on host if scripts are
small and fast.
• MySQL worth every penny
– Technically, it’s free
– Fast, simple SQL database
Language Wars
episode 23812
• Problems with C:
– Char arrays aren’t nearly as nice as strings.
– Have to check return values for error codes.
– Uninitialized local and heap variables lead to
hard to isolate bugs.
• Problems with C++
– 8 stream classes, 4 string classes, and half the
time you still have char arrays for strings.
– Throw/Catch not working so well in GNU.
– Uninitialized local and heap vars still lead to
hard to isolate bugs.
– Private info ends up in huge headers.
– Er, which setX is getting called in this context?
• Problems with Java
– Microsoft plot to kill client side Java by
incompatible extensions worked all too well.
– Server side Java not quite mainstream in 1999.
Fixing Problems with C
• Good library routines can make life
with char arrays better.
• setjmp/longjmp, atexit and resource
tracking lists can make error
handling relatively easy
– errAbort(char *message,…);
– pushAbortHandler(AbortHandler)
• Heap memory at least can be
initialized to zero
– needMem(int size)
– #define AllocA(varName)
needMem(sizeof(varName))
– freez(&objectPointer);
Limited Object Orientation
• A struct can generally act as an
object.
• Families of routines starting with the
name of the object are like nonvirtual methods, but more greppable.
– struct dna *dnaNew(int size);
– void dnaFree(struct dna **pDna);
– void dnaCount(struct dna *dna,
char base);
• Virtual methods can be implemented
by embedded function pointers.
• All objects begin with a next pointer
field so can be hung on a generic
singly linked list.
• Inheritance in wrong hands can
destroy program locality worse than
gotos.
Basic Module Structure
• Library interfaces are in inc/*.h
Implementations in lib/*.c
• There are two libaries:
– src/lib - older, more generic. 54
modules in all
– src/hg/lib - newer, more human
genome project specific. Requires
mySQL to compile. 25 modules in
all.
• Programs are usually one or a
few source files linked with
libraries.
– About 200 programs in all.
Library Utility Modules
• common.h - basic stuff included in every
program. Strings, files, error handling,
singly linked lists.
• hash.h - hash tables
• linefile.h - line oriented and space/tab
delimited file stuff.
•
•
•
•
•
bits.h - exciting arrays of bits
dlist.h - doubly linked lists
dystring.h - dynamically sized strings
localmem.h - fast local heap memory
portable.h - wrappers around things that
vary between operating systems.
• digraph.h - directed graphs.
Web Oriented Modules
• cheapcgi.h - stuff to get variables
and do other common chores for
CGI scripts in C.
• htmshell.h - stuff that makes it
easier to write .html files, also
heavily used by CGI scripts.
• memgfx.h - draw on a 256 color
bitmap in memory and save it as a
GIF
• hg/jksql.h - wrapper around MySQL
interface with error handling and
some shortcuts.
Biological Modules
• xenalign.h - cross species aligner (pair
HMM).
• supStitch.h - fast large scale aligner for
mRNA and other things with >95% base
identity.
• fuzzyFind.h - small scale aligner for
mRNA and other things with >90% base
identity.
• dnautil.h - reverse complement, etc.
• dnaseq.h - nucleotide sequence object.
• fa.h - read/write Fasta files.
• blastParse.h - read blast output.
• psl.h - read/write psLayout alignments.
Important Programs
• psLayout - Fast bulk alignment program
for mRNA and other sequences with >95%
sequence identity.
• pslSort and pslReps - applies ‘near best
in genome’ filter to alignments.
• ooGreedy - Uses alignments and other
data to assemble draft human genome.
• waba - Cross species aligner.
• ameme - DNA motif finder.
• faNoise - add various types of noise to an
.fa file.
• ccCp - copy a file efficiently to all nodes
in compute cluster
• autoSql - generates C and SQL code from
a data format specification.
The Browser
• CGI script generates graphics
on the fly as .gif file in temp dir.
• Zooming and scrolling handled
by link to same CGI script with
different parameters.
• Separate CGI script called to
process most clicks.
• Data is stored in MySQL
database.
The Interactive
Challenge
• Need to bring up initial page in
about 15 seconds, subsequent pages
in about 5 seconds.
• Precompute stuff on 100 machine
cluster.
• Database usually the bottleneck.
• Scaled out view of chromosome 1
involves over 500,000 items.
• Database design must minimize
number of seeks needed to display a
window.
• Must sort data, not just index it.
• Graphics also need to be snappy.
Anatomy of CGI
• A CGI script essentially just prints a
web page to stdout.
• Web server knows if cgi-bin is part
of URL to call a program to get the
page rather than read a file.
• Web ‘forms’ can pass data to CGI
scripts.
• CGI scripts can generate web forms.
• Can embed images.
• Image maps tell browser what URL
to call when clicking on specific
parts of an image.
• A challenge - maintaining context
between user clicks. (hidden vars)
Tracks - the central metaphor
struct trackGroup
/* Structure that displays a track. */
{
struct trackGroup *next; /* Next on list.*/
char *mapName; /* Name on ui buttons. */
enum visibility vis; /* Dense? Full? */
char *longLabel; /* Label for center. */
char *shortLabel; /* Label for left side */
void *items; /* Singly linked item list. */
…
void (*loadItems)(struct trackGroup *tg);
/* Load items, called before draw. */
void (*drawItems)(struct trackGroup *tg,
struct memGfx *mg, int x, int y, ...
enum visibility vis);
/* Draw all items. */
char (*itemName)(struct trackGroup *tg,
void *item);
/* Return name of an item. */
int (*totalHeight)(struct trackGroup *tg);
/* Return height needed for all items. */
…
};
Loading data in window
• Open database and build a query:
conn = sqlConnect(“hg3”);
sprintf(query, “select * from ctgPos”
“ where chrom = ‘%s’”
“ and chromStart < %d”
“ and chromEnd > %d”,
winChrom, winStart, winEnd);
• Query database
sr = sqlGetResult(conn, query);
• Get results as array of strings
while ((row = sqlNextRow(sr)) != NULL)
• Use AutoSQL generated routine to
convert to object
ctg = ctgPosLoad(row);
• Save on item list.
slAddHead(&itemList, ctg);
Drawing Data
• Loop through item list
for (ctg = items; ctg != NULL;
ctg = ctg->next)
• Scale item to window
x1 = scaleItem(ctg->chromStart);
x2 = scaleItem(ctg->chromEnd);
w = x2-x1;
• Render item
mgDrawBox(mg, x1, y, w, height,
color);
mgTextCentered(mg, x1, y,
w, height, color, ctg->name);
• Advance to next line if full display
if (vis == tvFull) y += height;
• Write box to image map
mapBox(x1, y, w, height, “ctgPos”,
ctg->name);
Conclusions
• Robust, simple, extensible, and
fast design that works across
web browsers.
• Appropriate use of lagging edge
technologies.
• Write ups in Science and
Nature.
• >1000 users per day.
Acknowledgements
• David Haussler - bold, charming,
astute. A good teacher to boot.
• Al Zahler - a kind and generous
boss and a sharp biologist.
• Paul Tatarsky - #1 system admin.
• Scott, Nick, Terry, Patrick and
Ewan - for all the programming.
• Francis, Eric, Bob, and John over 4 billion bases served.