Transcript Document
Virtual Organizations: Building
Interdisciplinary Collaborations
Dan Reed
[email protected]
Chancellor’s Eminent Professor
Vice Chancellor for IT
University of North Carolina at Chapel Hill
Director, Renaissance Computing Institute
Acknowledgments
• Funding agencies
– NIH
• Carolina Center for Exploratory Genetic Analysis (CCEGA)
– NSF
• TeraGrid Science Gateways
– State of North Carolina
• RENCI and ancillary Bioportal support
• RENCI staff
–
–
–
–
Alan Blatecky, Kevin Gamiel, Xiaojun Guan
Clark Jefferies, Howard Lander
John Magee, Ruth Marinshaw, Jeff Tilson
Lavanya Ramakrishnan
• And a host of others …
21st Century Challenges
• The three fold way
–
–
–
–
–
distributed, multidisciplinary teams
multimodal collaboration systems
distributed, large scale data sources
leading edge computing systems
distributed experimental facilities
• Socialization and community
–
–
–
–
multidisciplinary groups
geographic distribution
new enabling technologies
creation of 21st century IT infrastructure
• sustainable, multidisciplinary communities
• “Come as you are” response
Experiment
• Supported by
Theory
– theory and scholarship
– experiment and measurement
– computation and analysis
Computation
Exemplar 21st Century Challenges
• Population growth in sensitive areas
– severe weather sensitivity
• national impact
– geobiology and environment
– economics and finance
– sociology and policy
• Economics and health care
– longitudinal public health data
• environmental interactions
– genetic susceptibility
• heart disease, cancer, Alzheimer's
– privacy and insurance
– public policy and coordination
Mean Onset of Alzheimer’s Disease
• apolipoprotein (apo)
– apoE2, apoE3 and apoE4 alleles
• on chromosome 19
– apoE4 allele
• apo gene inheritance
– ~25% inherit 1 copy of apoE4 allele
• Alzheimer's risk increases 4X
– 2% inherit 2 copies of apoE4 allele
• Alzheimer's risk increases 10X
Source: Alan Roses, GSK
1.0
Proportion of each
genotype unaffected
• 40% to 60% of Alzheimer's patients
• not the only cause for Alzheimer’s
2/3
0.8
0.6
3/3
2/4
0.4
3/4
0.2
0
60
4/4
65
70
75
80
Age at onset
85
Big Questions
Protein sequence
and regulation
Sequence
Annotation
Message
Promoter
DNA
sequence
T
A
T
A
C
A
G
T
A
C
C
G
T
Protein
structure
Protein/enzyme
function
Q
Homology based
Y protein structure
prediction
Molecular
simulations
Data
integration
R
Pathway
simulations
Network
analysis
Organs, Organisms
and Ecologies
Bacteria and cells
Metabolic pathways
and regulatory networks
Multi-protein
machines
Genetics and Disease Susceptibility
Phenotype 1
Phenotype 2
Phenotype 3 Phenotype 4
Ethnicity
Environment
Age
Gender
Identify Genes
Pharmacokinetics
Metabolism
Endocrine
Physiology
Biomarker
Signatures
Immune
Proteome
Transcriptome
Morphometrics
Predictive Disease Susceptibility
Source: Terry Magnuson, UNC
PITAC Report Contents
•
Computational Science: Ensuring
America’s Competitiveness
1. A Wake-up Call: The Challenges to U.S.
Preeminence and Competitiveness
2. Medieval or Modern? Research and Education
Structures for the 21st Century
3. Multi-decade Roadmap for Computational
Science
4. Sustained Infrastructure for Discovery and
Competitiveness
5. Research and Development Challenges
•
Two key appendices
–
–
•
Examples of Computational Science at Work
Computational Science Warnings – A Message
Rarely Heeded
Available at www.nitrd.gov
Life Science Lessons from Astronomy
• Historically, discoveries accrued to those
– with access to unique data
– who built next generation telescopes
• Two things changed
– growing costs and complexity of telescopes
– emergence of whole sky surveys
• The result – virtual astronomy
– discovering significant patterns
• analysis of rich image/catalog databases
– understanding complex astrophysical systems
• integrated data/large numerical simulations
{Inter}national Virtual Observatory
Chandra SIA
3. X-ray and Optical
Images retrieved
via SIA interface
Skyview SIA
NED Cone Search
CADC CNOC Cone Search
DSS SIA
Cluster Galaxy Morphology Analysis Portal
2. Look up cluster
in internally stored
catalog
DSS SIA
5. Initial Galaxy Catalog
generated via Cone Search
CNOC SIA
6. Image cutout
pointers merged
into catalog
clusters
Morphology Calculation
Service
User’s Machine
1. User selects
a cluster
4. User launches
distributed
analysis
Source: Ray Plante, NCSA
web
browser
Morphological
7. parameters calculated
on grid for each galaxy
User downloads final
8. table and images for analysis
& visualization
The Bioinformatics Challenge
• Challenge
– the rise of quantitative biology
• burgeoning bioinformatics data
– complex analysis and modeling problems
– education and training in new technologies
• Reality
– diverse tools with idiosyncratic interfaces
• steep learning curves
– software development by diverse groups
– distributed, databases with diverse metadata
• Need
– integrated, easy-to-use toolset with standard interfaces
– extensible mechanisms that hide idiosyncrasies
– tool and bioinformatics training
• The solution
– bioinformatics infrastructure and coupled training
Need: Simple, Easy-To-Use Tools
“Genome. Bought the
book. Hard to read.”
Eric Lander
Web and Social Processes
• Google
– it’s a search engine, it’s a verb, …
• Blogs
– published self-expression
• Instant Messenger
– social networks
• Wireless messaging
– semi-synchronous
• Internet commerce
– the dot.com boom/bust
– EBay, Amazon
• Spam, phishing, …
– anti-social behavior
Benefits of Standards
•
•
•
•
•
•
•
•
Interoperability
Separation of concerns
Reuse
Independence
Dependability
Sharing
Commonality
Shared knowledge base
– knowledge reuse
– simplification (one hopes)
Grids of All Flavors
What’s A Grid/Web Service?
It’s been 12 years!
http://
Web: Uniform
access to
documents
Grid/Web Services:
Flexible, highperformance
access to
resources and
services for
distributed
communities
http://
Software
catalogs
Computers
Sensors and
instruments
Colleagues
Data archives
Grid History: I-Way at SC’95
• A prototype national infrastructure
– 17 sites, connected by
• vBNS and six other ATM networks
– 60 applications
• Features
–
–
–
–
I-POPs for site access
Kerberos authentication
manual scheduling
distributed communication libraries
• Experiences
– led to Globus Grid toolkit
• Concurrent industry needs
– led to web services for B2B interoperation
Web Services: “Commercial Grids”
• From browser-centric to service-centric
– from human-computer to computer-computer
– structured negotiation and response
• Workflow creation and management
– end-to-end service negotiation
– inter-organizational interaction
• Prerequisites
– metadata standard for service descriptions
– standard communication mechanisms
– resource discovery and registration
eBay Web Services Architecture
• Over 40% of eBay's listings are now via API calls
Source: IBM
Web Services: A Definition
A web service is … designed to support interoperable machineto-machine interaction over a network. It has an interface
described in a machine-processable format (specifically WSDL).
Other systems interact … [using] its description using SOAPmessages, … using HTTP with an XML serialization ....
W3C Working Draft, August
2003
Service
SOAP
Invoke
Consumer
SOAP
Locate
WSDL
UDDI
Service
Provider
Publish
SOAP
Service
Broker
• SOAP (Simple Object Access Protocol)
• WSDL (Web Services Description Language)
• UDDI (Universal Description, Discovery and Integration)
Technology Push
Source: Gartner Group
European myGrid Architecture
Source: www.mygrid.org
The Bioinformatics Challenges
• Complex, multilevel models
– integration and in silico designs
• Information visualization
– complexity and scale
• Data models and ontologies
– community definition
• Data federation, storage and management
– shared access and support
• User access portals
– web-based tool and service interfaces
• Packaging, distribution and deployment
– community building
Multilevel Cellular Models
• Signaling networks
– environmental triggers and behavior
• e.g., cell lifecycle
– different pathways in each tissue type
• Metabolic networks
– measurable products in pathway
– many systems are steady state
– negative feedback leads to stabilization
• Protein interaction networks
– localization of proteins that interact for function
– protein-protein interactions for specific actions
• Gene regulatory networks
– many things affect gene product concentration
– nucleic-nucleic, protein-nucleic interactions
• Computing, physics, engineering and biology
– control theory, mathematical models, phase spaces
– from biological cartoons to predictive models
• e.g., microRNAs and gene expression controls
Biological Models
• Simulation and prediction
– structures and dynamics
• Reasoning and discovery
– reverse engineering
Temporal (seconds)
10-12
10-9
Bond Motion
10-6
10-3
100
103
Catalysis
Growth &
Division
Diffusion
Spatial (nM3)
100
102
Metabolites
Proteins
106
Transcription
Translation
104
Ribosomes
106
108
Prokaryotes
1010
Eukaryotes
1012
Biophysical and Environmental Modeling
Airway/flow
Mucus
Cilia
Cell biochemistry
and structure
Proteomics
Genomics
Source: Ric Boucher, UNC
Data Heterogeneity and Complexity
Phenotype
Genomic, proteomic,
transcriptomic,
metabalomic, proteinprotein interactions,
regulatory bionetworks, alignments,
disease, patterns and
motifs, protein
structure, protein
classifications,
specialist proteins
(enzymes, receptors),
…
Source: Carole Goble (Manchester)
Disease
Clinical
trial
Gene
Genome
sequence
sequence
Disease
Drug
Gene
Gene
expression
expression
Proteome
Disease
Disease
Protein
Protein
Structure
homology
Protein
Sequence
P-P
interactions
Sensor Data Overload
Source: Chris Johnson, Utah
Art Toga, UCLA
Source: Robert Morris, IBM
• High resolution brain imaging
– 4.5 petabytes (PB) per brain
RENCI: What Is It?
• Statewide objectives
– create broad benefit in a competitive world
– engage industry, academia, government and citizens
• Four target areas
– public benefit
• supporting urban planning, disaster response, …
– economic development
• helping companies and people with innovative ideas
– research engagement across disciplines
• catalyzing new projects and increasing success
• building multidisciplinary partnerships
– education and outreach
• providing hands on experiences and broadening participation
• Mechanisms and approaches
– partnerships and collaborations
– infrastructure as needed to accomplish goals
Carolina Center for Exploratory Genetic
Analysis (CCEGA)
Interoperable
Data
Management
Faculty, Staff &
Students
Driving
Problems
Analysis
Techniques
Extant Data
Models
Promoting
Mutual
Awareness
Experimental
Genetics Portal
Statistical &
Computational
Techniques
Virtuous Cycle
Interdisciplinary
Research & Education
CCEGA Participants
•
Coordination team
–
–
–
–
•
Dan Reed, RENCI
Terry Magnuson, CCGS
Alan Blatecky, RENCI
Kirk Wilhelmsen, CCGS
Eleven departments/institutes
–
–
–
–
–
–
–
–
–
–
–
•
•
Biostatistics
Cancer Center
Genetics
Computer Science
Epidemiology
Genetics
Health Science Library
Information and Library Science
Pharmacy
RENCI
Statistics
Campus wide support
– from many sources
Project participants
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Brad Hemminger, Information & Library Science
James Evans, Genetics
Kevin Gamiel, RENCI
Xiaojun Guan, RENCI
Barrie Hays, Health Science Library
Clark Jefferies, RENCI
Ethan Lange, Genetics
Andrew Nobel, Statistics
Karen Mohlke, Genetics
Kari North, Epidemiology
Susan Paulsen, Computer Science
Fernando Manuel Pardo, Genetics
Charles Perou, Cancer Center
Lavanya Ramakrishnan, RENCI
Jan Prins, Computer Science
Patrick Sullivan, Genetics
Lisa Susswein, Cancer Center
David Threadgill, Genetics
Alexander Tropsha, Pharmacy
K.T.L. Vaughan, Health Science Library
Fred Wright, Biostatistics
Wei Wang, Computer Science
Fei Zou, Biostatistics
Data: From Lab and Clinic to Analysis
• Independent data management
–
–
–
–
ELSI
data security
version control
redundancy
controlled access
Clinical
ELSI
Analysis
Integration &
Informatics
Analysis
Laboratory
• NIH CCEGA
– Carolina Center for Exploratory Genetic Analysis
Source: Brad Hemmenger, UNC
Analysis
Data Management and Information Viz
Published Domain
Literature
Taxonomy….
Annotation
.
GenBank
Ontology
Annotation
DB Schema
Ontology
Annotation
Annotated Domain
Literature
Information Mining
Module
Information Visualization
Module
From SNPs to HapMap
• Single Nucleotide Polymorphisms (SNPs)
– one in ~1200 bases differ across individuals
– SNPs act as markers to locate genes
• Common groups of SNPs are shared
– i.e., form a haplotype
• HapMap data sources
–
–
–
–
90 Yoruba individuals (30 trios) from Nigeria (YRI)
90 individuals (30 trios) of European descent from Utah (CEU)
45 Han Chinese individuals from Beijing (CHB)
45 Japanese individuals from Tokyo (JPT)
• ~3,500,000 SNPs typed
– basis for association studies for disease identification
CCEGA HapMap Simulator
• Synthetic data
– disease models
– model testing
• mining bakeoffs
Carolina Bioportal
• Three overlapping target groups
– undergraduate education
– graduate education and research
– academic/industrial research
• Features
– access to common bioinformatics tools
– extensible toolkit and infrastructure
• OGCE and National Middleware Initiative (NMI)
• leverages emerging international standards
– remotely accessible or locally deployable
– packaged and distributed with documentation
• National reach and community
– TeraGrid deployment
• science gateway
• Education and training
– hands-on workshops
• clusters, Grids, portals and bioinformatics
Distributed Grid and Web Services
Launch, configure
and control
Application Interface
Workflow service
Grid Portals
App Instance
App Instance
App Instance
Open Grid Service Architecture Layer
Registries and
Name binding
Data Management
Service
Security
Policy
Reservations
And Scheduling
Administration
& Monitoring
Accounting
Service
Logging
Grid Orchestration
Event/Message
Service
Open Grid Service Infrastructure (web service component model)
Resource Layer
(from PCs to Supercomputers)
Online instruments
Source: Dennis Gannon, Indiana
Bioportal Architecture
HTML Files
Interface
Generator
PISE
Application
XML
Description
Application
Processing
www.ncbioportal.org
Application
Databases
Remote
File
Access
Job
Records
Job History
Database
Job
Submission
Bioportal
Velocity
Files
Application
Processing
Command
Files
User
Profile
OGCE User
Databases
MyProxy
GridFTP
Gatekeeper
• OGCE toolkit
– used by cyberinfrastructure projects
• LEAD, NEES, PACI, DOE, TeraGrid …
Local
cluster
Authentication,
Grid Credential
Putting the Technologies Together
NC Bioportal
OGCE Toolkit (Grid middleware)
Chef (collaboration/standard portlets)
Jakarta Jetspeed
(enterprise portal)
Turbine
(web app
framework)
Velocity
(template
engine)
VMC
PISE
Tomcat
(XML
Wrapper)
(Apache
servlet
container)
Bio
Applications
Grid
Portlets,
CoG
Databases
Community Software Toolkit: Lessons
• NSF PACI Alliance “In a Box” toolkits
–
–
–
–
cluster software (aka OSCAR)
Grid infrastructure (aka NMI)
Access Grid for distributed collaboration
tiled display walls for visualization
• Distribution materials
– software and training materials
• CDs and web
• Community workshops and training
– Linux Clusters Institute
– MSI HPC workshops
– hands on training
• Lowering the entry barrier
– usage and deployment
• Bioportal distribution
– workshops, tutorials
– training materials
– road shows
NC Bioportal: What’s Next
• Engagement
– workshops, experiences and deployments
• Infrastructure
–
–
–
–
dynamic job scheduling across multiple sites
migration to OGCE 2.0
fully automated database updates
workflow construction and processing
• Portal tool suite
– expanded applications and databases
• phylogeny, morphology, microarray analysis, …
• Training materials
– additional modules based on user feedback
– workshop materials packaged for self-study
• Leverage national presence
– TeraGrid/NCSA bioinformatics portal
The Vision of Grid/Web Services
“… Behold, the people is
one, and they have all one
language; and this they
begin to do: and now
nothing will be restrained
from them, which they have
imagined to do.”
– Book of Genesis
Peter Bruegel
The Tower of Babel (1563)
Interdisciplinary Collaborations
• Appropriate reward structures
– well-matched time constants
• Intellectual equality
– balanced recognition of contributions
• Research/infrastructure distinctions
– timelines and people needs differ
• Confidentiality and openness
– academic/industry collaboration perspectives
• Intellectual property
– background IP and differential disciplinary models
Some Thoughts on the Future
• Grids/web services are not a panacea
– we have seen this movie before
• standards debates can be endless
• make new mistakes, not the same old ones
– code is shifted from modules to interfaces
• Danger of “Death by CS Abstraction”
– “all problems can be solved by another level of indirection”
• Appropriate decomposition is a challenge
– performance, usability, flexibility
• Generality and extensibility really matter
– incremental aggregation and interoperability
– data management and federation
• Better questions, not just private capabilities
– limited by creativity not resources
The Cambrian Explosion
• Most phyla appear
– sponges, archaeocyathids, brachiopods
– trilobites, primitive mollusks, echinoderms
• Indeed, most appeared quickly!
– Tommotian and Atdbanian
– as little as five million years
• Lessons for computing
– it doesn’t take long when conditions are right
• raw materials and environment
– leave fossil records if you want to be remembered!
Thanks for the Invitation!