Sordid Details of the Human Genome Browser

Download Report

Transcript Sordid Details of the Human Genome Browser

The Shocking Details of
Genome.ucsc.edu
Qu ickT ime™ an d a T IFF (Unc omp ress ed) d ecom pre ssor are n eed ed to see this pict ure.
Qu ickT ime™ an d a TIFF (Unc omp ress ed) d eco mpre ssor are need ed to see this pict ure.
History of the Code
• Started in 1999 in C after Java proved hopelessly
unportable across browsers.
• Early modules include a Worm genome browser
(Intronerator), and GigAssembler which produced
working draft of human genome.
• In 2001 a few other grad students started working
on the code.
• In 2002 hired staff to help with Genome Browser
• Currently project employs ~20 full time people.
The Genome Browser Staff
•
•
•
•
•
•
•
•
5 programmers: Mark, Angie, Hiram, Kate, Rachel, Fan, Jim
4 quality assurance engineers - Heather, Bob, Mike, Galt
3 post-docs - Terry, Gill, Katie
9 grad students - Chuck, Daryl, Brian, Robert, Yontao, Krish,
Adam, Ryan, Andy
3 system administrators - Paul, Jorge, Patrick
1 writer - Donna
David Haussler and CBSE Staff
About 1/3 of staff (including me 3 days a week)
telecommutes.
The Goal
Make the human genome
understandable by humans.
Prognosis
Maybe we’ll understand it one of these days
Cardiac Troponin T2
Comparative Genomics at BMP10
Normalized eScores
Conservation Levels of
Regulatory Regions
Complex Transcription
Add Your Own Tracks
• Users can extend the browser with their
own tracks.
• User tracks can be private or public.
• No programming required.
• GFF, GTF, PSL or BED formats supported
#chrom start end [name strand score …]
chr1 1302347 1302357 SP1 + 800
chr1 1504778 1504787 SP2 – 980
The Underlying Database
• Power users and bioinformaticians sometimes want
underlying database.
• There is a table for each track.
• Larger tracks have a table for each chromosome.
• Format of a track table generally similar to add-your-own
track formats.
• Pieces of database available from ‘tables’ browser.
• Whole database available as tab-separated files.
• Most of database served via DAS.
Parasol and Kilo Cluster
• UCSC cluster has 1000 CPUs
running Linux
• 1,000,000 BLASTZ jobs in 25 hours
for mouse/human alignment
• We wrote Parasol job scheduler to
keep up.
– Very fast and free.
– Jobs are organized into batches.
– Error checking at job and at batch
level.
Science is Hard
Coding: Discipline Is Required
• While software development is immune
from almost all physical laws, entropy his
us hard. - The Pragmatic Programmer
• To keep the system from devolving into
disorder we have to follow code
conventions and insist on a lot of testing.
• We use CVS (concurrent version system) to
help all of us work on the same code at
once.
Obtaining the Code from CVS
• See http://genome.ucsc.edu/admin/cvs.html
• This gets you a ‘sandbox’ - a local copy of the
source to compile and edit.
• Type ‘make’ in the lib and utilities directory.
• You can do a ‘cvs update’ to get our updates to the
code base.
• To add permanently to code base email me to
enable ‘cvs commit’
Expand Your Mental Capacity
With…
Qu ickT ime™ an d a TIFF (Unc omp ress ed) d eco mpre ssor are need ed to see this pict ure.
QuickTime™ and a TIFF(Unc ompressed) decompres sor are needed to s ee this picture.
Lagging Edge Software
• C language - compilers still available!
• CGI Scripts - portable if not pretty.
• SQL database - at least MySQL is free.
Problems with C
• Missing booleans and strings.
• No real objects.
• Must free things
Advantages of C
•
•
•
•
•
•
Very fast at runtime.
Very portable.
Language is simple.
No tangled inheritance hierarchy.
Excellent free tools are available.
Libraries and conventions can
compensate for language weaknesses.
Coping with Missing Data Types
in C
• #define boolean int
• Fixing lack of real string type much harder
– lineFile/common modules and autoSql code
generator make parsing files relatively painless
– dyString module not a horrible string ‘class’
Object Oriented Programming in C
• Build objects around structures.
• Make families of functions with names that
start with the structure name, and that take
the structure as the first argument.
• Implement polymorphism/virtual functions
with function pointers in structure.
• Inheritance is still difficult. Perhaps this is
not such a bad thing.
struct dnaSeq
/* A dna sequence in one-letter-per-base format. */
{
struct dnaSeq *next; /* Next in list. */
char *name;
/* Sequence name. */
char *dna;
/* a’s c’s g’s and t’s. Null terminated */
int size;
/* Number of bases. */
};
struct dnaSeq *dnaSeqFromString(char *string);
/* Convert string containing sequence and possibly
* white space and numbers to a dnaSeq. */
void dnaSeqFree(struct dnaSeq **pSeq);
/* Free dnaSeq and set pointer to NULL. */
void dnaSeqFreeList(struct dnaSeq **pList);
/* Free list of dnaSeq’s. */
struct screenObj
/* A two dimensional object in a sleazy video game. */
{
struct screenObj *next; /* Next in list. */
char *name;
/* Object name. */
int x,y,width,height; /* Bounds of object. */
void (*draw)(struct screenObj *obj); /* Draw object */
boolean (*in)(struct screenObj *obj, int x, int y);
/* Return true if x,y is in object */
void *custom; /* Custom data for a particular type */
void (*freeCustom)(struct screenObj *obj);
/* Free custom data. */
};
#define screenObjDraw(obj) (obj->draw(obj))
/* Draw object. */
void screenObjFree(struct screenObj **pObj);
/* Free up screen object including custom part. */
Naming Conventions
• Code is constrained by few natural laws.
• There are many ways to do things, so
programmers make arbitrary decisions.
• Arbitrary decisions are hard to remember.
• Conventions make decisions less arbitrary.
• varName vs. VarName vs varname vs var_name.
We use varName.
• variable vs. var vs. vrbl vs. vble vs varible: if you
need to abbreviate, keep it short.
Commenting Conventions
• Each module has a comment describing it’s
overall purpose.
• Each function also has an overall comment.
• Each field in a structure has a comment.
• Longer functions broken into ‘paragraphs’
that each begin with a comment.
• The module, function, and structure
comments are replicated in the .h file, which
serves as an index to the module.
Error Handling
• Code prints out a message and aborts (via
the errAbort function) when there is a
problem.
• This saves loads of error handling code and
is generally the right thing to do.
• You can ‘catch’ an errAbort if necessary,
though it rarely is.
Memory
• Uninitialized memory leads to difficult bugs.
• Compiler set to warn of uninitialized vars
• Dynamic memory goes through needMem. It is
always zeroed.
• Memory usually freed with freez(), which sets
pointer to null as well as freeing it.
• ‘Careful’ memory handler can be pushed to help
track down memory bugs:
– Sentinal values to detect writing past end of array
– Detects memory freed twice or not freed
– Detects heap corruption in general.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed t o see t his pict ure.
Generally Useful Modules
• String handling - common dystring wildcmp
• Collections - common (singly linked lists), hash,
dlist, binRange rbTree
• DNA - dnautils dnaseq
• Web - htmshell, cheapcgi, htmlPage
• I/O - linefile, xap (XML), fa, nib, twoBit,
blastParse, blastOut, maf, chain, gff
• Graphics - memgfx, gifwrite, psGfx, vGfx
Anatomy of a CGI Script
• Gets called by Web Server when user clicks
submit or follows a cgi link.
• Input is in environment variables and
sometimes also stdin. Routines in cheapCgi
move this to a hash table.
• Output is to stdout. Routines in htmshell
help with output formatting.
• In the middle often access a database.
Challenges of CGI
• Each click launches program anew.
– User state can be kept in ‘cart’ variables
• Run from Web Server, harder to debug
– Use cgiSpoof to run from command line
– Push an error handler that will close out web
page, so can see your error messages. htmShell
does this, but webShell may not….
• Ideally should run in less than 2 seconds.
Relational Databases
• Relational databases consist of tables, indices, and
the Structured Query Language (SQL).
• Tables are much like tab-separated files:
#chrom
chr22
chr21
start
end
14600000 14612345
18283999 18298577
name
ldlr
vldlr
strand
+
-
score
0.989
0.998
Fields are simple - no lists or substructures.
• Can join tables based on a shared field. This is
flexible, but only as fast as the index.
• Tables and joins are accessed a row at a time.
• The row is represented as an array of strings.
Converting A Row to Object
struct exoFish *exoFishLoad(char **row)
/* Load a exoFish from row fetched with select * from exoFish
* from database. Dispose of this with exoFishFree(). */
{
struct exoFish *ret;
AllocVar(ret);
ret->chrom = cloneString(row[0]);
ret->chromStart = sqlUnsigned(row[1]);
ret->chromEnd = sqlUnsigned(row[2]);
ret->name = cloneString(row[3]);
ret->score = sqlUnsigned(row[4]);
return ret;
}
Motivation for AutoSql
• Row to object code is tedious at best.
• Also have save object, free object code to
write.
• SQL create statement needs to match C
structure.
• Lack of lists without doing a join can
seriously impact performance and
complicate schema.
AutoSql Data Declaration
table exoFish
"An evolutionarily conserved region (ecore) with Tetroadon"
(
string chrom;
"Human chromosome or FPC contig"
uint chromStart; "Start position in chromosome"
uint chromEnd; "End position in chromosome"
string name;
"Ecore name in Genoscope database"
uint score;
"Score from 0 to 1000"
)
See autoSql.doc for more details.
See also autoXml
Coding Conclusion
• It’s always safer on the lagging edge
• Consider redesigning system as COBOL
character-based application
UCSC Gene Family Browser
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Expression and other information on genes in a big sorted, linked table
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Up in Testes, Down in Brain
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Conclusions
• Genome browser - good for exploring
genome and displaying your custom tracks
• ‘kent’ code base - a good starting point for
many programming projects
• Family browser - a fine way to collect data
sets.
• Browser staff - helpful but overworked.