jkIEEE2004 - University of California, Santa Cruz

Download Report

Transcript jkIEEE2004 - University of California, Santa Cruz

Spaghetti Code, Soupy Logic
adventures in gene expression & genome annotation
Jim Kent
University of California Santa Cruz
A Challenge Every Speaker
Faces:
• Who is the audience?
• Bioinformaticians:
– Biologists with bigger, better databases?
– Geeks trading bits for bases?
– Leading edge interdisciplinary super scientists?
Top 5 Reasons Biologists Go Into
Bioinformatics
• 5 - Microscopes and biochemistry are so
20th century.
• 4 - Got started purifying proteins, but it
turns out the cold room is really COLD.
• 3 - After 23 years of school wanted to make
MORE than 23,000/year in a postdoc.
• 2 - Like to swear, @ttracted to $_ Perl #!!
• 1 - Getting carpel tunnel from pipetting
Top 5 Reasons Computer People
go into Bioinformatics
• 5 - Bio courses have some females.
• 4 - Human genome stabler than Windows XP
• 3 - Having mastered binary trees, quad trees,
and parse trees ready for phylogenic trees.
• 2 - Missing heady froth of the internet bubble.
• 1 - Must augment humanity to defeat evil
artificial intelligent robots.
The Paradox of Genomics
How does a long, static, one dimensional string
of DNA turn into the remarkably complex,
dynamic, and three dimensional human body?
GTTTGCCATCTTTTG
CTGCTCTAGGGAATC
CAGCAGCTGTCACCA
TGTAAACAAGCCCAG
GCTAGACCAGTTACC
CTCATCATCTTAGCT
GATAGCCAGCCAGCC
ACCACAGGCATGAGT
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Models and Metaphors
• When trying to understand something we like to
build up metaphors and models.
• Computer programs are complex systems that
ultimately are built up of 0’s and 1’s, perhaps they
are a model for a genome built of A,C,G and T?
• Human genome lacks documentation, has
accumulated 3 billion years of cruft, and does not
believe in local variables.
• Therefore we must look to less than
straightforward software programs as guides.
Bioperl CORBA module
sub new {
my ( $class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ( $idl, $ior, $orbname ) = $self->_rearrange( [ qw(IDL IOR ORB
@args);
$self->{'_ior'} = $ior || 'biocorba.ior';
$self->{'_idl'} = $idl || $ENV{BIOCORBAIDL} || 'biocorba.idl';
$self->{'_orbname'} = $orbname || 'orbit-local-orb';
$CORBA::ORBit::IDL_PATH = $self->{'_idl'};
my $orb = CORBA::ORB_init($orbname);
my $root_poa = $orb->resolve_initial_references("RootPOA");
$self->{'_orb'} = $orb;
$self->{'_rootpoa'} = $root_poa;
return $self;
}
Obfuscated C
#define c(n,s)case n:s;continue
char x[]="((((((((((((((((((((((",w[]=
"\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b";char r[]={92,124,47},l[]={2,3,1
,0};char*T[]={" |"," |","%\\|/%"," %%%",""};char d=1,p=40,o=40,k=0,*a,y,z,g=
-1,G,X,**P=&T[4],f=0;unsigned int s=0;void u(int i){int n;printf(
"\233;%uH\233L%c\233;%uH%c\233;%uH%s\23322;%uH@\23323;%uH \n",*x-*w,r[d],*x+
,r[d],X,*P,p+=k,o);if(abs(p-x[21])>=w[21])exit(0);if(g!=G){struct itimerval t=
{0,0,0,0};g+=((g<G)<<1)-1;t.it_interval.tv_usec=t.it_value.tv_usec=72000/((g>>
3)+1);setitimer(0,&t,0);f&&printf("\e[10;%u]",g+24);}f&&putchar(7);s+=(9-w[21]
)*((g>>3)+1);o=p;m(x);m(w);(n=rand())&255||--*w||++*w;if(!(**P&&P++||n&7936)){
while(abs((X=rand()%76)-*x+2)-*w<6);++X;P=T;}(n=rand()&31)<3&&(d=n);!d&&--*x<=
*w&&(++*x,++d)||d==2&&++*x+*w>79&&(--*x,--d);signal(i,u);}void e(){signal(14,
SIG_IGN);printf("\e[0q\ecScore: %u\n",s);system("stty echo -cbreak");}int main
(int C,char**V){atexit(e);(C<2||*V[1]!=113)&&(f=(C=*(int*)getenv("TERM"))==(
int)0x756E696C||C==(int)0x6C696E75);srand(getpid());system("stty -echo cbreak"
);h(0);u(14);for(;;)switch(getchar()){case 113:return 0;case 91:case 98:c(44,k
=-1);case 32:case 110:c(46,k=0);case 93:case 109:c(47,k=1);c(49,h(0));c(50,h(1
));c(51,h(2));c(52,h(3));}}
Microsoft Windows
mouse
blue screen
of death
keyboard
network
elaborate proprietary process
Looks like metaphor not enough,
must study actual cells & DNA
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
How DNA is Used by the Cell
Promoter Tells Where to Begin
Different promoters activate different genes in
different parts of the body.
A Computer in Soup
Idealized promoter for a gene involved in making hair.
Proteins that bind to specific DNA sequences in the
promoter region together turn a gene on or off. These
proteins are themselves regulated by their own promoters
leading to a gene regulatory network with many of the
same properties as a neural network.
Genes can be transcription factors that activate
or repress other genes, leading to regulatory networks
such as this one from the development of the central
nervous system. (Image from D’Haeseleer Somogyi 1999)
The Decisions of a Cell
• When to reproduce?
• When to migrate and where?
• What to differentiate into?
• When to secrete something?
• When to make an electrical signal?
The more rapid decisions usually are via the cell
membrane and 2nd messengers. The longer
acting decisions are usually made in the nucleus.
Nucleus Used to Appear Simple
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Cheek cells stained with basic dyes. Nuclei are
readily visible.
Mammalian Nuclei Stained in Various Ways
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Image from Tom Misteli lab
Artist’s rendition of nucleus
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Image from nuclear protein database
Chromatin
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Turning on a gene:
• Getting DNA into the right compartment of the
nucleus (may involve very diffuse signals in DNA
over very long distances)
• Loosening up chromatin structure (this involves
activator and repressors which can act over
relatively long distances)
• Attracting RNA Polymerase II to the transcription
start site (these involve relatively close factors
both upstream and downstream of transcription
start).
Methods for Studying Transcription
• Genetics in model organisms
• Promoters hooked to reporter genes
• Gel shifts and DNAse footprinting.
• Phylogenic footprinting
• Motif searches in clusters of coregulated
genes.
Drosophila Genetics
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
normal
antennapedia
mutant
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Reporter Gene Constructs
promoter to study
easily seen gene
Qui ckTime™ and a
TIFF (U ncompr essed) decompressor
are needed to see thi s pi cture.
Drosophila embryo transfected with ftz promoter hooked
up to lacz reporter gene, creating stripes where ftz promoter
is active.
Biochemical Footprinting
Assays
Gel showing selective
QuickTime™ and a
protection of DNA fromTIFF (Uncompressed)
decompressor
are needed to see this picture.
nuclease digestion
where transcription
factor is bound.
Txn factor
footprint
Pseudogenes
Creative Chaos & Genome
Finding Transcription Start
Phylogenic Footprinting
Mouse Paints Some Promoters
RefSeq
Spliced EST
Mouse
Fish
Repeat
Crystallin - a gene expressed in the eye.
Coding regions are very similar to crystallins
in the liver, but the promoter is different.
Normalized eScores
Mouse/Human Chrom 7 Synteny
Motifs in Coregulated Genes
Conservation Levels of
Regulatory Regions
Transition from Private Research
Interests to Role in Genome
Project
Assembly War Story
Building a Better Browser
Pretty Adventurous Programming
Genome Browser
BLAT
Gene Sorter
Table Browser
Service Organization
Parasol and Kilo Cluster
• UCSC cluster has 1000 CPUs
running Linux
• 1,000,000 BLASTZ jobs in 25
hours for mouse/human
alignment
• We wrote Parasol job
scheduler to keep up.
– Very fast and free.
– Jobs are organized into batches.
– Error checking at job and at
batch level.
Acknowledgements
Individuals
Institutions
David Haussler, Chuck Sugnet
NHGRI, The Wellcome Trust,
HHMI, Taxpayers in the US
and worldwide.
Francis Collins, Bob
Waterston, Eric Lander, John
Sulston, Richard Gibbs
Lincoln Stein, Sean Eddy,
Olivier Jaillon, David Kulp,
Victor Solovyev, Ewan Birney,
Greg Schuler, Deanna Church,
Asif Chinwalla, Kim Worley,
the Gene Cats.
Everyone else!
Whitehead, Sanger, Wash U,
Baylor, Stanford, DOE, and
the international sequencing
centers.
NCBI, Ensembl, Genoscope,
The SNP Consortium, UCSC,
Softberry, Affymetrix.
THE END
Coloring CRYGD Start
gctcgttcaggggtaaaggtgtattctagatCCACAACAAGCCCCGTGGTCTAGCACAGC
AAAGAGAAAAAAAGAGAACACGAAAATGCCCTTGCTCCCCTCCGGGGGCCCCTTTTGTGC
GGTTCTTGCCAACGCAGCAGCCCTCCTGCTATATAGCCCGCCGCGCCgCAGCCCCACCCG
CTCAGCGCCGCCGCCCCACCAGCTCAGCACCGCCGTGCGCCCAGCCAGCCATGGGGAAGG
TGAGCCCAGCCTGCGCCCCGGGACCCCGGAGCTTCCTCCATCGCGGGGGCCAGAGACTGG
GGCAGGAGCAGGCCTGTGAGACCTCGCCTTGTCCCGCCTTGCCTTGCAGATCACCCTCTA
CGAGGACCGGGGCTTCCAGGGCCGCCACTATGAATGCAGCAGCGACCACCCCAACCTGCA
GCCCTACTTGAGCCGCTGCAACTCGGCGCGCGTGGACAGCGGCTGCTGGATGCTCTATGA
GCAGCCCAACTACTCGGGCCTCCAGTACTTCCTGCGCCGCGGCGACTATGCCGACCACCA
GCAGTGGATGGGCCTCAGCGACTCGGTCCGCTCCTGCCGCCTCATCCCCCACGTGAGTAC
ATCCTCAAGTCAGGACCCAGGCCCTCAGGACACTCACTGGAtgGTTTCAAGCAAAAGTTA
AACATTAGAAGTAGTGATCAGTcacaataaCTGAGAGTGGACAAAAGATGAACTATAGTG
GATTAAGTCAATAGagttTGCTCCCCACATAAGCAAAGTATTACCCAGACAcCAGTTAAT
caCAATTAATCCACAAATATGTATTGAGTAGGAATGTGTCTCCTGCCctAGGGGTTGTAT
Trends in Society & Biology
50’s
Cars are good
Mitochondria and metabolism
60’s
Recording
DNA as recording media of genes
70’s
Birth control
Working out the cell cycle
80’s
Yuppies
Start of serious genetic engineering
90’s
Microsoft rules Incyte, Celera race to patent genome
2000’s
(The NEED for Bioinformatics)
• ~200 million bases of DNA are sequenced
every day.
– Not much use without assembly.
• Protein and non-sequence data also being
generated at a prodigious rate.
– How to store it and find the parts you want?
• Making models that are simple enough to
understand, but rich enough to reflect the
biology.
(My Road to a Bio PhD)
• Liked bio, but too many prerequisites!
• Had fun doing graphics/animation
programming in 80’s & early 90’s.
• Bored of endlessly shifting Microsoft APIs
• Community college, UC extension to get
bio BA equivalent in 97 & 98.
• UC Santa Cruz bio grad school 1999
• Interested in developmental biology and
how a cell makes decisions.
Perhaps Must Study Actual Cells
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Spaghetti Code or Soupy Logic
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Steaming fresh modules in
sourceforge.net
Combinatorical assembly of
transcription factors in cell.