Transcript Document

Genome Informatics 2005
Meeting Report
Cold Spring Harbor, Oct 28-Nov1
Peter E.M. Taschner
PT 11-05
Genome Informatics 2005
•
•
•
•
~ 220 participants
1 keynote speaker: David Haussler
47 talks
121 posters
Rodger Voelker:Two classes of splice junctions
• Search for 5-7 base motifs in exonic and intronic
flanking sequences of known splice junctions
• Computational analysis of collocations between
different motifs
• Many collocations between exonic and intronic
sequences
• Known ESEs display collocations with intronic
sequences (including ISEs)
• Nearly all introns (89%) can be classified into 2
classes
Chip Lawrence: futility of optima in inferences
• The strong focus in bioinformatics on optimal
solutions is fundamentally flawed, because the
asymptotic underpinnings of these solutions, such
as consistency, do not apply
• The curse of dimensionality can render optimal
solutions very unlikely and misleading
• Example: minimum free energy predictions of
RNA structures
• Reason: incomplete energy function used, only sec
structure considered, no tertiary
Minimum free energy predictions of RNA structures
• Assumption:
– molecule folds into lowest energy state
– unique solution to folding problem (optimum)
• Many programs (e.g. Zuker's Mfold) use the
Boltzmann probability function
– Most include calculations of suboptimal structures
– but not all structures are computed
– PPV of MFE: 48 %
Alternative prediction of RNA structures
• Sample the ensemble of sec structures in
proportion to their Boltzmann weights
• Cluster the structures
• Use centroid structure in predictions
– Improved PPV compared to MFE
• Srna module of Sfold
(http://sfold.wadsworth.org/ )
A.tumefaciens 5S rRNA energy landscape
Alternative prediction of RNA structures
• Improved PPV compared to MFE:
– Ensemble centroid + 30 %
– Largest cluster centroid +18 %
– Best centroid + 47 %
Data mining
• Geneseer – searchable name-translation database
(http://geneseer.cshl.org/ )
• Access to genomic information through gene
names
• Mapping sequences to gene names
• Identification of homologs across several species
for a given gene
• Used in RNAi Codex (http://codex.cshl.edu )
Data mining
• Ulysses – annotate human genes based on gene
interactions in model organisms
(http://www.cisreg.ca:8080/ulysses/ )
• Interologs: conserved protein-protein interactions
• Regulogs: conserved protein-DNA interactions
• Almost no overlap between data in interaction
databases
• BIND  DIP: 984 refs; BIND  5 DB's: 3 refs
Data mining
• Integrated Genome Browser (IGB) –
visualize:
– Genomic annotations from multiple data
resources
– Experimental data from Affymetrix arrays
(http://www.affymetrix.com/support/developer/
tools/download_igb.affx )
Gene expression and pathways
• Skypainter tool in Reactome database:
– allows overlay of gene expression data on
pathway graphs
– allows generation of a "movie" of a time series
• (http://www.reactome.org/ )
Gene expression
• ArrayBlast:
• Compares gene expression signatures
generated on different platforms
• Uses public microarray data sets (GEO)
• Used to create conserved cancer-related
expression signature
• (http://seq.mc.vanderbilt.edu/arrayBlast/ )
Gene expression
• C. elegans Gene Expression Consortium:
• SAGE data from specific stages, tissues and cell
types
• Database of gene expression data/pictures/movies of
transgenic worms with promoter::GFP fusions for
2000 genes with human orthologs
(http://elegans.bcgsc.ca/home/ge_consortium.html )
Michael Caudy: Whole genome analysis of
combinatorial and architectural transcription codes
• Search for TFBS in known neural pathway genes
• Determine architecture: number, type, order,
orientation and spacing of TFBS
• Compare architecture of activated and repressed
genes
• Determine activity of promoters with TFBS
mutations
• Architecture is critical for differential response to
Notch signalling
Regulatory sequence identification
• Evoprinter:
• highlights multi-species conserved sequences
within orthologous DNAs in the context of a
single species of interest
• (http://evoprinter.ninds.nih.gov/ )
Regulatory sequence identification
• NestedMICA:
– method for discovering many over-represented
short motifs in large sets of strings in a single run
– candidate transcription factor binding sites
• (http://www.sanger.ac.uk/Software/analysis/n
mica/ )