Powerpoint - CANIS: Community Architectures for Network
Download
Report
Transcript Powerpoint - CANIS: Community Architectures for Network
Analysis Environments
For Functional Genomics
Bruce R. Schatz
CANIS Laboratory
Institute for Genomic Biology
University of Illinois at Urbana-Champaign
[email protected] , www.canis.uiuc.edu
Bioinformatics Seminar
Department of Computer Science, UIUC
February 25, 2005
What are Analysis Environments
Functional Analysis
Find the underlying Mechanisms
Of Genes, Behaviors, Diseases
Comparative Analysis
Top-down data mining (vs Bottom-up)
Multiple Sources especially literature
Building Analysis Environments
Manual by Humans
Interaction
Classification
user navigation
collection indexing
Automatic by Computers
Federation
Integration
search bridges
results links
Trends in Analysis Environments
Central versus Distributed Viewpoints
The 90s Pre-Genome
Entrez (NIH NCBI) versus
WCS (NSF Arizona)
The 00s Post-Genome
GO (NIH curators) versus
BeeSpace (NSF Illinois)
Pre-Genome Environments
Focused on Syntax pre-Web
WCS (Worm Community System)
Search words across sources
Follow links across sources
Words automatic, Links manual
Towards Uniform Searching
Post-Genome Environments
Focused on Semantics post-Web
BeeSpace (Honey Bee Inter Space)
Navigate concepts across sources
Integrate data across sources
Concepts automatic, Links automatic
Towards Question Answering
Paradigm Shift
Towards Dry-Lab Biology, Walter Gilbert (Jan 1991)
“The new paradigm, now emerging, is that all the 'genes' will be known
(in the sense of being resident in databases available electronically), and
that the starting point of a biological investigation will be theoretical.
An individual scientist will begin with a theoretical conjecture, only then
turning to experiment to follow or test that hypothesis. ...
To use this flood of knowledge [the total sequence of the human and
model organisms], which will pour across the computer networks of the
world, biologists not only must become computer-literate, but also
change their approach to the problem of understanding life. ...
The Coming of Informational Science
Correlation of Information across Sources
NCBI Entrez
Community Systems
results
data
(database management)
(electronic mail)
knowledge
(hypertext annotations)
literature
(information retrieval)
Formal
news
(bulletin boards)
Informal
browse and share all the knowledge of a community
Worm Community System
WCS Information:
Literature BIOSIS, MEDLINE, newsletters, meetings
Data
Genes, Maps, Sequences, strains, cells
WCS Functionality
Browsing
search, navigation
Filtering
selection, analysis
Sharing
linking, publishing
WCS: 250 users at 50 labs across Internet (1991)
WCS
Molecular
WCS
Cellular
WCS
Publishing
WCS
Linking
WCS
invokes
gm
WCS
vis-à-vis
acedb
Towards the Interspace
from Objects to Concepts
from Syntax to Semantics
Infrastructure is Interaction with Abstraction
Internet is packet transmission across computers
Interspace is concept navigation across repositories
THE THIRD WAVE OF NET EVOLUTION
CONCEPTS
OBJECTS
PACKETS
LEVELS OF INDEXES
Technology
Engineering
FORMAL
(manual)
Electrical
IEEE
communities
INFORMAL
groups
(automatic)
individuals
COMPUTING CONCEPTS
‘92: 4,000 (molecular biology)
‘93: 40,000 (molecular biology)
‘95: 400,000 (electrical engineering)
‘96: 4,000,000 (engineering)
‘98: 40,000,000 (medicine)
1992
1993
1995
1996
1998
Simulating a New World
Obtain discipline-scale collection
Partition discipline into Community Repositories
4 core terms per abstract for MeSH classification
32K nodes with core terms (classification tree)
Community is all abstracts classified by core term
MEDLINE from NLM, 10M bibliographic abstracts
human classification: Medical Subject Headings
40M abstracts containing 280M concepts
concept spaces took 2 days on NCSA Origin 2000
Simulating World of Medical Communities
10K repositories with > 1K abstracts
(1K w/ > 10K)
Interspace Remote Access Client
Navigation in MEDSPACE
For a patient with Rheumatoid Arthritis
Find a drug that reduces the pain (analgesic)
but does not cause stomach (gastrointestinal) bleeding
Choose Domain
Concept Search
Concept Navigation
Retrieve Document
Navigate Document
Retrieve Document
Informational Science
Computational Science is widely accepted as
The Third Branch of Science
(beyond Experimental and Theoretical)
Genes are Computed, Proteins are Computed,
Sequence “equivalences” are Computed.
Informational Science is coming to be accepted as
The Fourth Branch of Science
Based on Information Science technologies for
Functional Analysis across Information Sources
Post-Genome Informatics I
Comparative Analysis within the
Dry Lab of Biological Knowledge
Classical Organisms have Genetic Descriptions.
There will be NO more classical organisms beyond
Mice and Men, Worms and Flies, Yeasts and Weeds.
Must use comparative genomics on classical organisms
Via sequence homologies and literature analysis.
Post-Genome Informatics II
Functional Analysis within the
Dry Lab of Biological Knowledge
Automatic annotation of genes to standard
classifications, e.g. Gene Ontology via homology on
computed protein sequences.
Automatic analysis of functions to scientific
literature, e.g. concept spaces via text extractions.
Thus must use functions in literature descriptions.
Informatics: From Bases to Spaces
data Bases support genome data
e.g. FlyBase has sequences and maps
Genes annotated by GO and linked to literature
e.g. BeeBase has computed annotations
Protein homologies for similar Genes via GO
information Spaces support biomedical literature
e.g. BeeSpace uses automatically generated
conceptual relationships to navigate functions
Gene Ontology
Gene Ontology
Gene Symbol
…
Calca
Cat-1
Cat-2
CCKR-Human
CRF2-Rat
Crhr2
Egl-10
Egl-30
Feh-1
For
Data Source
Full Name
MGI
Wormbase
Wormbase
UniProt
UniProt
RGD
Wormbase
Wormbase
Wormbase
FlyBase
calcitonin-related polypeptide
None
None
Cholecystokinin receptor
Corticotropin releasing factor
corticotrophin relse hormone
None
None
None
None
Conceptual Navigation in BeeSpace
Behavioral
Biologist
Bee
Literature
Molecular
Biology
Literature
Brain Gene
Expression
Profiles
Brain Region
Localization
Neuroscience
Literature
Neuroscientist
Molecular
Biologist
Bee
Genome
Flybase,
WormBase
BeeSpace Analysis Environment
Build Concept Space of Biomedical Literature
for Functional Analysis of Bee Genes
-Partition Literature into Community Collections
-Extract and Index Concepts within Collections
-Navigate Concepts within Documents
-Follow Links from Documents into Databases
Locate Candidate Genes in Related Literatures
then follow links into Genome Databases
Question Answering
Behaviour
Molecular
Function
Organism
Gene
Reference
Rover vs sitter phenotype
Drosophila melanogaster
for
Protein kinase G
8
Roamer vs dweller phenotype
C. elegans
egl-4
Protein kinase G
16
Division of labour: age at onset of
foraging
Apis mellifera
for
Protein kinase G
9
Division of labour: age at onset of
foraging
Apis mellifera
mlv
Mn transporter
19
Division of labour: foraging-related?
Apis mellifera
per
Transcription cofactor
68
Division of labour: foraging-related?
Apis mellifera
ache
Acetylcholine
esterase
69
Division of labour: foraging-related?
Apis mellifera
IP(3)K
Inositol signaling
70
Foraging specialization: nectar vs.
pollen
Apis mellifera
pkc
Protein kinase C
71
Social feeding
Drosophila melanogaster
dpnf
Neuropeptide Y
(NPY) homolog
21
Social feeding (aggregation)
C. elegans
npr-1
Foraging
Receptor for NPY
22, 23
Functional Phrases
<gene> encodes <chemical>
Sokolowski and colleagues demonstrated in Drosophila
melanogaster that the foraging gene (for) encodes a cGMP
dependent protein kinase (PKG).
The dg2 gene encodes a cyclic guanosine monophosphate
(cGMP)- dependent protein kinase (PKG).
<chemical> affects/causes <behavior>
Thus, PKG levels affected food-search behavior.
cGMP treatment elevated PKG activity and caused foraging
behavior.
<gene> regulates <behavior>
Amfor, an ortholog of the Drosophila for gene, is involved in
the regulation of age at onset of foraging in honey bees.
This idea is supported by results for malvolio (mvl), which
encodes a manganese transporter and is involved in regulating
Drosophila feeding and age at onset of foraging in honey bees.
BeeSpace Software Implementation
Natural Language Processing
Identify noun and verb phrases
Recognize biological entities
Compute biological relations
Statistical Information Retrieval
Compute statistical contexts
Support conceptual navigation
Data Integration (FlyBase Gene)
D. melanogaster gene foraging , abbreviated as for , is reported here . It
has also been known in FlyBase as BcDNA:GM08338, CG10033 and
l(2)06860. It encodes a product with cGMP-dependent protein kinase
activity (EC:2.7.1.-) involved in protein amino acid phosphorylation
which is a component of the cellular_component unknown . It has been
sequenced and its amino acid sequence contains an eukaryotic protein
kinase , a protein kinase C-terminal domain , a tyrosine kinase catalytic
domain , a serine/Threonine protein kinase family active site , a cAMPdependent protein kinase and a cGMP-dependent protein kinase . It has
been mapped by recombination to 2-10 and cytologically to 24A2--4 . It
interacts genetically with Csr . There are 27 recorded alleles : 1 in vitro
construct (not available from the public stock centers), 25 classical
mutants ( 3 available from the public stock centers) and 1 wild-type.
Mutations have been isolated which affect the larval nerve terminal and
are behavioral, pupal recessive lethal, hyperactive, larval
neurophysiology defective and larval neuroanatomy defective. for is
discussed in 80 references (excluding sequence accessions), dated
between 1988 and 2003. These include at least 6 studies of mutant
phenotypes , 2 studies of wild-type function , 3 studies of natural
polymorphisms and 7 molecular studies . Among findings on for
function, for activity levels influence adult olfactory trap response to a
food medium attractant. Among findings on for polymorphisms, the
frequency of for R and for s strains in three natural populations are
studied to determine the contribution of the local parasitoid community
to the differences in for R and for s frequencies.
BeeSpace Information Sources
Biomedical Literature
-
-
Medline (medicine)
Biosis
(biology)
Agricola, CAB Abstracts, Agris (agriculture)
Model Organisms (heredity)
-
-Gene Descriptions (FlyBase, WormBase)
Natural Histories (environment)
-BeeKeeping Books (Cornell Library, Harvard
Press)
Medical Concept Spaces (1998)
Medical Literature (Medline, 10M abstracts)
Partition with Medical Subject Headings (MeSH)
Community is all abstracts classified by core term
40M abstracts containing 280M concepts
computation is 2 days on NCSA Origin 2000
Simulating World of Medical Communities
10K repositories with > 1K abstracts
(1K with > 10K)
Biological Concept Spaces (2005)
Compute concept spaces for All of Biology
BioSpace across entire biomedical literature
50M abstracts across 50K repositories
Use Gene Ontology to partition literature into
biological communities for functional analysis
GO same scale as MeSH but adequate coverage?
GO light on social behavior (biological process)
Paradigm Shift
Dissecting Human Disease, Victor McKusick (Feb 2001)
Structural genomics
Genomics
Map-based gene discovery
Monogenic disorders
Specific DNA diagnosis
Analysis of one gene
Gene action
Etiology (mutation)
One species
Functional genomics
Proteomics
Sequence-based gene discovery
Multifactorial disorders
Monitoring susceptibility
Analysis of multi-gene pathways
Gene regulation
Pathogenesis (mechanism)
Several species
Needles and Haystacks
Genes
Honey Bees have 13K genes
Perhaps 100 have known functions
Paths
Perhaps 30K protein families exist
KEGG has 200 known pathways
Statistical Clustering for Interactive Discovery
Across Two Orders of Magnitude!
Concept Switching
In the Interspace…
each Community maintains its own repository
Switching is navigating Across repositories
use your specialty vocabulary to search
another specialty
CONCEPT SWITCHING
“Concept” versus “Term”
set of “semantically” equivalent terms
Concept switching
region to region (set to set) match
Semantic region
term
Concept Space
Concept Space
Biomedical Session
Categories and Concepts
Concept Switching
Document Retrieval
Future Technologies
Concept Switching
Dynamic Indexing
Spreading activation, type tagging
On-the-fly collections, during session
Path Matching
Aggregating indexes, many repositories
THE NET OF THE 21st CENTURY
Beyond Objects to Concepts
Beyond Search to Analysis
Problem Solving via Cross-Correlating
Multimedia Information across the Net
Every community has its own special library
Every community does semantic indexing
The Interspace approximates Cyberspace
Interactive Functional Analysis
BeeSpace will enable users to navigate a uniform space of
diverse databases and literature sources for hypothesis
development and testing, with a software system beyond a
searchable database, using literature analyses to discover
functional relationships between genes and behavior.
Genes to Behaviors
Behaviors to Genes
Concepts to Concepts
Clusters to Clusters
Navigation across Sources
XSpace Information Sources
Organize Genome Databases (XBase)
Compute Gene Descriptions from Model Organisms
Partition Scientific Literature for Organism X
Compute XSpace using Semantic Indexing
Boost the Functional Analysis from Special Sources
Collecting Useful Data about Natural Histories
e.g. CowSpace Leverage in AIPL Databases
Towards the Interspace
The Analysis Environment technology is
GENERAL!
BirdSpace? BeeSpace?
PigSpace? CowSpace?
BehaviorSpace? BrainSpace?
BioSpace
… Interspace