Managing Gene Annotation Information: The search is over
Download
Report
Transcript Managing Gene Annotation Information: The search is over
Managing Gene
Annotation Information
the search is over
… one problem solved
… another begins
observations from a foot soldier in the bio-information (r)evolution
Bill Farmerie -- ICBR Genomics Group
Interdisciplinary Center for
Biotechnology Research
Established at the University of Florida in 1987
by the Florida Legislature
centralized organization of biomedical core
facilities
supporting biotechnology-based research
How did information management become my
problem?
1998 GSAC Miami Beach
Why should I care about this problem?
Because my paycheck depends on it.
Avoid fatal failure in the funding loop.
PI has $ for
large genebased
project
Other PI’s
think this
looks like a
good idea
PI applies
for new
funding
Core Lab
generates
data
Downstream data
management &
analysis
PI writes
papers,
gives talks
From Sequence to Function
The genomic sequence identifies the 'parts'
the
next trick is understanding gene
function
Post genomic era = functional genomics
Critical concept: genes of similar sequence
may have similar functions
Inferring
function for a new gene begins
with searching for it’s nearest neighbor (or
homolog) of known function
BLAST
Most common starting point for gene identification
Similarity search of sequence repository (GenBank)
Output
Calculated scores (bit score and e-value)
Text string (definition line), ID Reference Tag
Sequence alignment
Advantages
Disadvantages
Fast algorithm, very good at finding close homologs
Not good at finding distant relatives
Cluster and Grid-enabled versions available
HMMER
HMMER developed by Sean Eddy
Uses Hidden Markov Models
Searches unknown protein query sequence against a
database of protein family models
Advantages
Superior to BLAST for discovering more distant homology
relations
Disadvantages
Statistical models constructed from alignment of
conserved protein regions (Pfam)
More computationally intensive than BLAST
GRID enabled
OK! Great!
Sequencing done.
Homology searches complete.
But how will I deliver this
information to scientists spread all
over campus, and their worldwide
collaborators?
Search for summarizing information that restores sanity
CTGGGTTCTGTTCGGGATCCCAGT
CACAGGGACAATGGCGCATTCATA
TGTCACTTCCTTTACCTGCCTGGA
GAGGTGTGGCCACAGACTCTGGTG
GCTGCGAACGGGGACTCTGACCCA
GTCGACTTTATCGCCTTGACGAAG
AACCAGATTGACGTTGTCGGAGTC
GGAACTCACCTGGTCACCTGTACG
ACTCAGCCGTCGCTGGGTTGCGTT
CTGACACGCGGCTCCTCGTGTGGA
GCCGAAACCCCGACAAAAGCGAAG
GAGAGAGTGAGTATGAGCAGGCGG
BlastQuest
A small idea with a big mission
BlastQuest Requirements
Accessible to research groups at remote locations
Privacy constrained sharing of results among the scientists
Selective browsing of BLAST homology search results
Selective data filtering on statistical criteria
e-value or bit score
Selective data grouping on criteria such as GI number, or a defined
number of top-scoring results
Ad hoc search capability on user determined criteria:
text terms
boolean logic
From a computational point of view BlastQuest is embarrassingly
simple. However it solved our problem for information storage,
selective retrieval, and distribution.
Overview of BlastQuest Architecture
Web Browser
Client Side GUI
Web
Server
BLAST
XML
document
Tier 3
Client Interf ace Module
XML Loader
ACE Loader
Assembly
ACE f ile
Tier 2
SQL Constructor
JDBC
Tier 1
MySQL DBMS
Welcome to BlastQuest
Choose among client projects
Results Selection
Grouped Results
Ad Hoc Text Searching
Internal BLAST Searches
Viewing a Gene Ontology Tree
Viewing a Gene Ontology Tree
Viewing a Gene Ontology Tree
KEGG Classification
Kyoto Encyclopedia of Genes and Genomes
“Wiring diagrams of life”
KEGG Protein Networks
Metabolic pathways
Regulatory pathways
Molecular complexes
Network-network relations
Network-environment relations
Common to both
Unique to non-Unigene
Unique to Unigene
Bacterial Genome
Annotation Workbench
Another simple idea driven by necessity
Start
Project Summary
Contig Browser
Contig summary
Physical map linked to annotation
Simple problems.
Simple solutions.
Why are these simple ideas important?
Human Genome Project
HGP drove
innovation in biotechnology
2 major technological benefits
stimulated
methods
development of high throughput
on computational tools for data
mining and visualization of biological
information
reliance
The HGP and the cost of
DNA sequencing
“finished” quality DNA sequence
a DNA base call is considered finished if the probability of
base call error is less than 1 in 10,000
also known as phred > 40
contiguous DNA sequence of phred > 40 usually achieved
by multifold sequencing of the same region; typically 710X coverage
1985: $10 per finished base
2001: $1 per 10 finished bases
Genbank August 22, 2005
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Public Collections of DNA and RNA Sequence Reach 100 Gigabases
Trends in the cost efficiency
of
§
DNA sequencing
§
Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335
454 Life Sciences
Corporation
The first commercial, massively
parallel, DNA sequencing
technology
454 Technology
Cyclic-array sequencing on in vitro amplified DNA
molecules
individual molecules must be amplified to give a
detectable sequencing signal
Instead of biological cloning, we amplify individual
DNA fragments on solid state beads using PCR
Instead of terminator-based sequencing,
pyrosequencing used to determine nucleotide order
“sequencing by synthesis”
454 Process Overview
The bottom line …
efficiency of DNA sequencing increased 100X
cost per finished base declined 10- to 30-fold
… so what happens next?
The “democratization” of large-scale genomic biology
Many projects are now possible that were once fiscally
inviable
We must deal with basic local data management and
information issues or lose this opportunity
If you thought
bioinformatics was
important before
By terminator-based sequencing we @ UF
produce 60-70 Mbp per year
By synthesis-based sequencing we produce 6070 Mbp per day