Using existing products and technologies for scientific research

Download Report

Transcript Using existing products and technologies for scientific research

Using Existing Products And
Technologies For Scientific Research
Dan Fay
Director – North America
Technical Computing
Microsoft Corporation
Can “Here And Now” Technologies
Reduce Time To Insight?
Can “Business” Tools and techniques
for dealing with
Be used in scientific research to raise the bar
and allow researchers to be scientists and
not computer scientists.
The Problem For The e-Scientist
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist and cooperate
with others?
Data Query and Visualization tools
Support/training
Performance
Execute queries in a minute
Batch (big) query scheduling
Computational
Modeling
Persistent
Distributed
Data
Workflow,
Data Mining
& Algorithms
Interpretation
& Insight
Real-world
Data
Persistent
Distributed
Storage
Visual
Programming
Distributed
Computation
Interoperability
& Legacy
Support via
Web Services
Searching &
Visualization
Live
Documents
Reputation
& Influence
The Scripps Research Institute
Peter Kuhn Lab
Research Focus
Early detection and therapy management
of cancer patients
Modulation of protein interactions for
therapeutic intervention
Projects
Cancer bioengineering partnership
Structural Proteomics of SARS
TSRI Goals
Improve Collaboration
Complex experimental data
Within Scripps and with outside organizations
Capture more data electronically
Images
Discussions
Structured Data
To provide project data and decisions in
context – e.g. annotations on 2D and
3D objects
Leverage existing productivity applications
The Collaborative
Molecular Environment
Application
Allows the user to establish context among projects, entities, and annotations
Easily collect data from multiple sources (notes, files, URLs, Screen Clipping)
Provides for Annotation on pictures, data, and molecules
Very simple reporting (not yet implemented)
Windows Presentation Foundation
Application container
Controls for annotating 2D and 3D images
Rapid application environment for images and 3D data
SharePoint 2007
Supports the Application with standard Web Services
Provides the security context for project teams and external collaborators
Enables search of annotations in order to find relevant images
Provides a single repository for collaboration with internal and external (SSL) collaborators
Office 2007
Captures metadata to describe application context (image, investigator, etc.)
External Research &
Programs
C-ME And 2D Annotation
Annotating Protein Data
Research
Integrate


Data acquisition from
source systems and
integration
Data transformation
and synthesis
Analyze


Data enrichment,
with business logic,
hierarchical views
Data discovery via
data mining
Report


Data presentation
and distribution
Data access for
the masses
Comparison Of Soil Moisture
Water Content at 5 cm
0.6
0.5
Water Content at 20 cm
0.4
Vaira
0.6
0.3
0.5
0.2
0.1
y = 0.4712x
R2 = 0.7039
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Vaira
0.4
0.3
0.2
Tonzi
y = 0.5854x
0.1
Thanks to
Gretchen Miller – UC Berkeley
& Catharine Van Ingen (MSR)
R2 = 0.9163
0.0
0.0
0.1
0.2
0.3
Tonzi
0.4
0.5
0.6
Other Applications
Temperature at North American Sites
Average Tempmerature in oC
30
20
10
`
0
-10
20
30
40
50
Latitude
60
70
80
Dynameomics
Goal: Perform MD simulations of
representatives of all fold families (unique structures)
----maximize sampling of fold and
sequence space
Protein Folding Protein Unfolding (more tractable}
In Protein Databank
~17,000 structures
>35,000 domains
CATH, SCOP, Dali fold classification methods
consensus → 1130 non-redundant protein folds
Thanks to
Valerie Daggett – Univ of Washington
Day et al, Prot. Sci., 2003
www.dynameomics.org
IGG-like: 1fna
fibronectin
Rossman: 3chy
CheY
TIM barrel: 1ypi
TIM
Jelly Roll: 1sac
SAP
- plait: 1ris
S6
3-helix bundle: 1enh
engrailed homeodomain
Globin: 1a6n
myoglobin
4-helix bundle: 2a0b
phosphotransfer domain
-grasp: 1pgb
protein G
EF-hand: 4icb
calbindin
OB fold: 1mjc
IGG-like: 1e65
Cytochrome C: 1hrc
azurin
cytochrome C
Top 30 folds
Represent ~50%
of the structures
in Protein Databank
Trypsin-like serine
protease: 1qq4
-lytic protease
Thioredoxin-like: 1ev4
GST A1-1
Rossman: 1ght
SH3 barrel: 1shg
transposon  resolvase
CspA
knottin: 1snb
C-type lectin: 2afp
-spectrin SH3
FAD/NAD(P) binding
domain: 1ebd
oxidoreductase
neurotoxin BMK M8
type II antifreeze prot.
lipocalin: 1ifc
fatty acid binding prot.
trefoil: 1tld
bovine trypsin
Zn finger: 2adr
Zn finger (ADR)
snake toxin: 1ntn
cobra neurotoxin
acid protease: 1g6l
HIV-1 protease
Rossman: 2pth
peptidyl tRNA
hydrolase
GST (C-term): 1ev4
GST A1-1
IL-8 like (OB): 1bf4
Sso7d
PLP dep. transferase: 1e5f
methionine -lyase
Laminin-like: 1edm
coagulation
factor IX
Day et al, Prot. Sci., 2003
Example simulation system (1fna)
Fibronectin, a representative from the top ranked fold (IGG like) is prepared for
molecular dynamics (MD) simulation by adding hydrogens (not shown) to the
PDB structure and solvating it with explicit waters (red and white in ball & stick).
MD is the time dependent integration of the classical equations of motion for
molecular system. Our MD methods have been qualitatively and quantitatively
benchmarked against experiment for more than 50 proteins in the past 15 years
Example unfolding simulation (1fna)
10 nanoseconds
21 ns
Denatured (D)
Starting structure
Native (N)
Unfolding of fibronectin, a representative from the top ranked fold (IGG like), from its
biologically active state (N) to a denatured, inactive state (D). During unfolding, it
loses a critical hydrophobic contact in its core between a valine and a tyrosine
Dynameomics
200 targets complete –
6 simulations of each
1 native, 5 unfolding
DOE INCITE Award 3,300,000 CPU
hours
On NERSC
250 GB every 48 hours
Projecting that we will have
100 TB (compressed)
Now a database is required
SQL Server 2005
For Their Purposes
Suite of applications
Relational database engine
OLAP engine
Performance tools
Extraction, Transformation and Loading tools
Integrated development environment
OLAP – On-line Analysis Processing
MOLAP – Multi-dimensional OLAP
MOLAP For Scientific Analysis
Why A Multidimensional
Database Is Desired
It is efficient, most of the time, only two
or three dimensions are actively in play
The multidimensionality allows user
to select properties of interest and
sideline the rest
Better than SQL by eliminating the need
for complicated joins
Sparsity tolerant
Faster Time to Insight
Better integration to existing Windows infrastructure
Integrated and familiar development environment
Fighting HIV With Computer Science
Nebojsa Jojic and David Heckerman - MSR
A major problem: Over 40 million infected
Drug treatments are effective but are an
expensive life commitment
Vaccine needed for third world countries
Effective vaccine could eradicate disease
Methods from computer science are helping with the
design of vaccine
Machine learning: Finding biological patterns that may stimulate
the immune system to fight the HIV virus
Optimization methods: Compressing these patterns
into a small, effective vaccine
Developed Set Of Specialist Tools
Chromatogram deconvolution
Pathway analysis/association/
causal models
Clustering/Trees (phylo, haplotypes etc.)
Protein binding and folding
Sequence diversity models (epitomes)
Image analysis/classification
Evolution modeling and inference
Epitope prediction
HIV: The Diabolical Virus
The train-and-kill mechanism doesn’t
work for HIV – the virus adapts through
rapid mutation. As soon as the killer
cells get the upper hand, the epitopes
start changing
Strategy
Find peptides or epitopes that occur
commonly across a *population* of HIV
viruses
Compact the known or potential immune
targets into a small vaccine
HPC and HIV Vaccine Design
Carl Kadie and David Heckerman
Machine Learning and Applied Statistics
Microsoft Research
Developed Software: 8 or so new
research programs. Most .NET(C# &
C++/CLI), One in ‘R’. One in native C++.
Hardware: Cluster of 25 IBM eServer
326, 2 processors per machine
Cluster Software: Windows Compute
Cluster Server 2003
Fusion Events
Integrated Discovery in Gene Networks
Integrate genome-scale data for discovery
and prediction
Incorporate Disease, multiple organisms
Create applied systems network standard
Thanks to:
Mehmet Dalkilic – Indiana University
Andrews-Dalkilic Laboratory
James Costello (PhD Candidate),
Rupali Patwardhan, Sumit Middha,
Brian Eads, John Colbourne, Scott Beason
Junguk Hur
Microarray Co-Expression
Arbeitman – “Life Cycle of Drosophila”
Parisi – “Incyte Drosophila LifeArray v1.0”
White – “Larval Tissues-Specific Transcripts”
Protein-Protein Interaction
FlyGRID (Fly General Repository for Interaction Dataset)
DIP (Drosophila Interaction Database by CuraGen)
MINT (Molecular Interaction Database)
BIND (Biomolecular Interaction Network Database)
Genetic Interaction
Flybase
Phenotypic Data
Flybase
Binding Site
DNase I Footprint Database and Patser3 using PWM
RNAi Screens
Harvard RNAi Screen – Norbert Perrimon
xl-caBIG Smart Client
How to give scientists a graphical
interface for accessing cancer Biomedical
Informatics Grid (caBIG) data-services
http://xl-cabig-client.sourceforge.net/
PI - Katarzyna Macura
Johns Hopkins
caBIG In Vivo Imaging Workspace
Subject Matter Expert
xl-caBIG Smart Client
Reproducible Research Document
Broad Institute
Infusion Development
SharePoint Products And Technologies
Microsoft Office SharePoint Server 2007
Server-based Excel
spreadsheets and data
visualization, Report
Center, BI Web Parts,
KPIs/Dashboards
Docs/tasks/calendars, blogs,
wikis, e-mail integration,
project management “lite”,
Outlook integration,
offline docs/lists
Business
Intelligence
Rich and Web
forms based frontends, LOB
actions, enterprise
SSO
Business
Forms
Integrated document
management, records
management, and Web
content management with
policies and workflow
Collaboration
Platform
Services
Workspaces, Mgmt,
Security, Storage,
Topology, Site Model
Content
Management
Portal
Enterprise Portal
template, Site
Directory, My
Sites, social
networking,
privacy control
Search
Enterprise scalability,
contextual relevance, rich
people and business
data search
Excel Services
Overview
Browser
Excel 2007
Publish
Spreadsheets
High quality web rending
Zero-footprint
Interactive: Set parameters,
sort, filter, explore
Limit to browser access
View and
Interact
Design and author
Export/Snapshot into Excel
Programmatic Access
Open in Excel for rich
exploration and analysis
Open snapshots
SharePoint platform and Excel services
Spreadsheets stored in
document libraries
Spreadsheet calculation and rendering
External data retrieval and caching
100% calculation fidelity
Excel 2007
Custom
applications
Set values, perform calculations, get
updated values via web services
Retrieve full workbook file
Development Data Workflow
Collaboration Publications
.NET & Visual Studio
F#
Iron Python
SQL Sever
SQL Server analysis Services
Windows Workflow
SharePoint Server 2007
Knowledge Network
Instant Messenger
ConferenceXP
Academic Live, Onfolio, etc
…
Resources
Windows Compute Cluster Server
Tuesday 12-1 Baker
High-Performance Computing with Windows
http://windowshpc.net/
Data mining
www.sqlserverdatamining.com/
Develop without Borders Challenge
www.developwithoutborders.com
Technical Computing Blogs
http://blogs.msdn.com/dan_fay and
http://blogs.msdn.com/eScience
© 2006 Microsoft Corporation. All rights reserved.
Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.
Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,
and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.