Managing Gene Annotation Information: The search is over

Transcript Managing Gene Annotation Information: The search is over

Managing Gene
Annotation Information
the search is over
… one problem solved
… another begins
observations from a foot soldier in the bio-information (r)evolution
Bill Farmerie -- ICBR Genomics Group
Interdisciplinary Center for
Biotechnology Research

Established at the University of Florida in 1987
by the Florida Legislature



centralized organization of biomedical core
facilities
supporting biotechnology-based research
How did information management become my
problem?
1998 GSAC Miami Beach
Why should I care about this problem?


Because my paycheck depends on it.
Avoid fatal failure in the funding loop.
PI has $ for
large genebased
project
Other PI’s
think this
looks like a
good idea
PI applies
for new
funding
Core Lab
generates
data
Downstream data
management &
analysis
PI writes
papers,
gives talks
From Sequence to Function

The genomic sequence identifies the 'parts'
 the
next trick is understanding gene
function
Post genomic era = functional genomics
 Critical concept: genes of similar sequence
may have similar functions

 Inferring
function for a new gene begins
with searching for it’s nearest neighbor (or
homolog) of known function
BLAST



Most common starting point for gene identification
Similarity search of sequence repository (GenBank)
Output
Calculated scores (bit score and e-value)
 Text string (definition line), ID Reference Tag
 Sequence alignment


Advantages


Disadvantages


Fast algorithm, very good at finding close homologs
Not good at finding distant relatives
Cluster and Grid-enabled versions available
HMMER



HMMER developed by Sean Eddy
Uses Hidden Markov Models
Searches unknown protein query sequence against a
database of protein family models


Advantages


Superior to BLAST for discovering more distant homology
relations
Disadvantages


Statistical models constructed from alignment of
conserved protein regions (Pfam)
More computationally intensive than BLAST
GRID enabled
OK! Great!
 Sequencing done.
 Homology searches complete.
But how will I deliver this
information to scientists spread all
over campus, and their worldwide
collaborators?
Search for summarizing information that restores sanity
CTGGGTTCTGTTCGGGATCCCAGT
CACAGGGACAATGGCGCATTCATA
TGTCACTTCCTTTACCTGCCTGGA
GAGGTGTGGCCACAGACTCTGGTG
GCTGCGAACGGGGACTCTGACCCA
GTCGACTTTATCGCCTTGACGAAG
AACCAGATTGACGTTGTCGGAGTC
GGAACTCACCTGGTCACCTGTACG
ACTCAGCCGTCGCTGGGTTGCGTT
CTGACACGCGGCTCCTCGTGTGGA
GCCGAAACCCCGACAAAAGCGAAG
GAGAGAGTGAGTATGAGCAGGCGG
BlastQuest
A small idea with a big mission
BlastQuest Requirements




Accessible to research groups at remote locations
Privacy constrained sharing of results among the scientists
Selective browsing of BLAST homology search results
Selective data filtering on statistical criteria



e-value or bit score
Selective data grouping on criteria such as GI number, or a defined
number of top-scoring results
Ad hoc search capability on user determined criteria:
text terms
 boolean logic

From a computational point of view BlastQuest is embarrassingly
simple. However it solved our problem for information storage,
selective retrieval, and distribution.
Overview of BlastQuest Architecture
Web Browser
Client Side GUI
Web
Server
BLAST
XML
document
Tier 3
Client Interf ace Module
XML Loader
ACE Loader
Assembly
ACE f ile
Tier 2
SQL Constructor
JDBC
Tier 1
MySQL DBMS
Welcome to BlastQuest
Choose among client projects
Results Selection
Grouped Results
Ad Hoc Text Searching
Internal BLAST Searches
Viewing a Gene Ontology Tree
Viewing a Gene Ontology Tree
Viewing a Gene Ontology Tree
KEGG Classification



Kyoto Encyclopedia of Genes and Genomes
“Wiring diagrams of life”
KEGG Protein Networks





Metabolic pathways
Regulatory pathways
Molecular complexes
Network-network relations
Network-environment relations
Common to both
Unique to non-Unigene
Unique to Unigene
Bacterial Genome
Annotation Workbench
Another simple idea driven by necessity
Start
Project Summary
Contig Browser
Contig summary
Physical map linked to annotation
Simple problems.
Simple solutions.
Why are these simple ideas important?
Human Genome Project
 HGP drove
innovation in biotechnology
 2 major technological benefits
 stimulated
methods
development of high throughput
on computational tools for data
mining and visualization of biological
information
 reliance
The HGP and the cost of
DNA sequencing

“finished” quality DNA sequence
a DNA base call is considered finished if the probability of
base call error is less than 1 in 10,000
 also known as phred > 40




contiguous DNA sequence of phred > 40 usually achieved
by multifold sequencing of the same region; typically 710X coverage
1985: $10 per finished base
2001: $1 per 10 finished bases
Genbank August 22, 2005
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Public Collections of DNA and RNA Sequence Reach 100 Gigabases
Trends in the cost efficiency
of
§
DNA sequencing
§
Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335
454 Life Sciences
Corporation
The first commercial, massively
parallel, DNA sequencing
technology
454 Technology

Cyclic-array sequencing on in vitro amplified DNA
molecules



individual molecules must be amplified to give a
detectable sequencing signal
Instead of biological cloning, we amplify individual
DNA fragments on solid state beads using PCR
Instead of terminator-based sequencing,
pyrosequencing used to determine nucleotide order

“sequencing by synthesis”
454 Process Overview
The bottom line …


efficiency of DNA sequencing increased 100X
cost per finished base declined 10- to 30-fold
… so what happens next?



The “democratization” of large-scale genomic biology
Many projects are now possible that were once fiscally
inviable
We must deal with basic local data management and
information issues or lose this opportunity
If you thought
bioinformatics was
important before
By terminator-based sequencing we @ UF
produce 60-70 Mbp per year
By synthesis-based sequencing we produce 6070 Mbp per day

Managing Gene Annotation Information: The search is over

Transcript Managing Gene Annotation Information: The search is over

Directory