Transforming life science research with

Download Report

Transcript Transforming life science research with

Transforming life science
research with advanced
Information Technology at
Indiana University
Craig A. Stewart
[email protected]
University Information Technology Services, Indiana University
© Copyright Trustees of Indiana University 2004
1
License Terms
•
•
•
•
Please cite this presentation asStewart, C.A. Transforming life science research
with advanced Information Technology at Indiana University. 2004. Presentation.
Presented at: IBM Life Sciences Symposium (Pallisades, NY, 31 May 2004).
Available from: http://hdl.handle.net/2022/14785
Portions of this document that originated from sources outside IU are shown here
and used by permission or under licenses indicated within this document.
Items indicated with a © or denoted with a source url are under copyright and
used here with permission. Such items may not be reused without permission from
the holder of copyright except where license terms noted on a slide permit reuse.
Except where otherwise noted, the contents of this presentation are copyright
2004 by the Trustees of Indiana University. This content is released under the
Creative Commons Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the work
and to remix – to adapt the work under the following conditions: attribution – you
must attribute the work in the manner specified by the author or licensor (but not
in any way that suggests that they endorse you or your use of the work). For any
reuse or distribution, you must make clear to others the license terms of this work.
Outline
• IU overview
• Data, data grids, and life sciences (Centralized Life
Science Data Service)
• Computational grids and life science research (HPC
Challenge Award at SC2003)
• Looking forward – Institute of Innovation projects
• Strategy and execution: how did we get here?
• Delivering benefits
3
I-Light & Abilene
• I-light
– connects IUB, IUPUI, and
Purdue University, to be
extended within Indiana
– first higher ed owned
statewide network in nation
– The networking infrastructure
for collaboration of many sorts
• Abilene
– Nation’s (current) highestspeed national network
– NOC in Indianapolis
4
IU in a nutshell
• $2B Annual Budget
• One university with
• 8 campuses
• 90,000 students
• 3,900 faculty
• 878 degree programs
• Nation’s 2nd largest school of medicine
• CIO: Vice President Michael A. McRobbie
• ~$100M annual IT budget
• Indiana Genomics Initiative - $105M Lilly
Endowment, Inc. grant
5
IBM Research SP
(Aries/Orion Complex)
• 1.005 TeraFLOPS. 1st
University-owned
supercomputer in US to
exceed 1 TFLOPS peak
theoretical processing
capacity.
• Geographically
distributed at IUB and
IUPUI
• Initially 50th, now 170th
in Top 500
supercomputer list
• An enabler of
collaborative research
using very large scale
computations
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
6
AVIDD
• Analysis and
Visualization of
Instrument-Driven Data
• Distributed Linux cluster.
Three locations: IUN,
IUPUI, IUB
• 2.164 TFLOPS (peak
theoretical), 0.5 TB RAM,
10 TB Disk
• First distributed Linux
cluster to achieve more
than 1 TFLOPS on
Linpack benchmark
7
Massive Data Storage System
• Reliable and robust
• HPSS (High Performance
Software System)
• Automatic replication of
data between Indianapolis
and Bloomington, via Ilight.
• 180 TB capacity with
existing tapes; total
capacity of 2.4 PB.
• >100 TB currently in use;
>5 TB for biomedical data
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
8
John-E-Box
Design licensed to central Indiana manufacturer
9
10
Data, data grids, and life sciences
11
Federated Databases
• Federated database approach
focuses on establishing glue
between existing databases
• “Private” databases stay
where they are – under local
control
• “Public” databases may be
replicated locally for
performance
• Queries are entered as SQL,
and the Federated Database
System knows enough about
the structure of the
databases to select data from
the right sources
• Integrate the right data in the
right way
Lab
Result
s
You!
Clinica
l Data
Toxicit
y Data
12
IBM’s Federated Database approach
• Based on Discovery Link
• Wrappers
– program that sits between a database and
DiscoveryLink, allowing on the fly queries by DL from
the database
– Database registration. Each particular database must
be registered once
– Accessing a calculation as one might a database
(BLAST)
• Parsers
– Programs to import data from one format into
another that permits higher-performance queries
• Accessing a calculation from within a database query
(BLAST, HMMR)
• Accessing a database from within a calculation (SAS)
13
14
Microarray Data Portal
• Web application and database designed for
annotation and analysis of microarray experiments.
• Annotation: Designed for users to set up
experimental design first minimizing amount of
time for sample entry but still getting in the
essential info
• Analysis
– Allows user to partition data into groups based
on their annotation.
– Extensive filtering, search, and display options
– T-test, Clustering, SVD, etc.
– Allows different views of data based on
informatics associated with the genes (e.g.
KEGG, GO, Chromosome Location)
15
The Microarray Data Portal was created by the Center for Medical Genomics at IU School of Medicine.
Supported in part by the 21st Century Research & Technology Fund and the Indiana Genomics Initiative.
The Indiana Genomics Initiative is supported in part by a grant from the Lilly Foundation, Inc.
16
Hereditary Diseases and Family Studies Division, Dept. of Medical and Molecular Genetics, IU School of
Medicine. Supported in part by NIH R01 NS37167.
17
Hereditary Diseases and Family Studies Division, Dept. of Medical and Molecular Genetics, IU School of
Medicine. Supported in part by NIH R01 NS37167.
18
Under development: Linking Cancer
data within IUSM
•
•
•
•
Thousands of cancer and normal tissue samples
De-identified, select phenotype data
Database system that manages IRB approvals
DiscoveryLink is planned ‘glue’ to tie tissue data to
data generated by other IUSM cores
19
Protein identification
• Problem: categorize thousands of protein identifications
from proteomic experiments
• Planned solution: Use CLSD interface with LocusLink to
obtain information about proteins
• Data Generation:
– Peptide Extracts from experiment
– Separate peptides using Liquid 2D Chromatography
– Identify Mass/Charge using Mass Spectrometer
– Creates raw data (LOTS of it!)
• Data Analysis:
– SAS, using queries into CLSD
20
HPC Challenge @ SC2003
Are Hexapods a single evolutionary group? Are ecdysozoans a
single evolutionary group?
21
Computational grids and life
science research (HPC Challenge
Award at SC2003)
22
A partial bestiary
All organism illustrations copyright
Jennifer Fairman, 2003.
www.fairmanstudios.com
Used by agreement
23
Software and data analysis
• Non-grid preparatory work
– Download sequences from NCBI (67 Taxa, 12,162 bp,
mitochondrial genes for 12 proteins)
– Align sequences with Multi-Clustal
– Determine rate parameters with TreePuzzle
• Grid preparatory work
– Analyze performance of fastDNAml with Vampir
– Meetings via Access Grid & CoVise
• The grid software
– PACXMPI – Grid/MPI middleware (HLRS – High
Performance Computing Center Stuttgart)
– Covise – Collaboration and visualization (HLRS)
– fastDNAml – Maximum Likelihood phylogenetics (IU)
24
• ML analysis of
phylogenetic
trees based on
DNA sequences
• Foreman/worker
MPI program
• Heuristic search
for best trees
• For 67 taxa:
2.12 ~10109 trees
• Goal: 300
bootstraps, 10
jumbles per –
3000 executions
(more than 3x
typical!)
fastDNAml
25
It worked!
• Grid of 6 continents, 5 functional units, 6+ vendors, 8
types of systems, 641 processors… all analyzing
evolutionary relationships of arthropods
• HPC Challenge Award winner at SC03 conference –
demonstrates new capabilities in grid computing while
advancing research in evolutionary biology
26
Looking forward – Institute of
Innovation projects
• IBM Life Sciences Institute of Innovation in 3-D
Cell Modeling
• Center for Cell and Virus Theory
• Biocomplexity Institute (talk tomorrow by Debasis
Dan)
• Model repository
• Markup Languages and Cell Models
• To the TeraGrid (and beyond!)
27
Strategy and execution: how
did we get here
28
IU’s IT Strategic Plan
• Real plans and real execution of those plans
• Strong focus on centralization and enablement of
capability computing
• Economy of scale
• Advantages of centralization while minimizing
disadvantages
• Engagement with researchers and vendors in projects
and grants
29
Support strategy
• CS research is wonderful,
but what biomedical
researchers care about is
tools!
• Considerable effort is put
into seeking out
collaborators and people we
can assist
• If a particular application is
useful it doesn’t matter if it
seems sophisticated to a
computer scientist
• When a problem is
sophisticated we need the
computer scientists!
• Gradual enhancement of
community
30
Collaboration and Outreach
• AVIDD – 20 faculty,
dozens of staff,
$1.8M in NSF funding
• Research in Indiana –
3 universities, dozens
of faculty
• IP-Grid – 2
universities, dozens
of faculty, $3M in NSF
funding
• INGEN – 100+ faculty,
hundreds of staff,
$105M funding from
Lilly Endowment, Inc.
• In-state, national, and
international outreach
are all essential
31
Delivering Benefits
• 9 inventions disclosed
since 1997; 6 of these are
open source software
(BSD-like). Participation in
the community behind
community codes
essential! [IBM has
supported this strongly]
• John-E-Box design
licensed to a central
Indiana firm for
commercial production!
• A software product has
just been commercialized
• Results explainable to a
voter are essential for
continued public support!
32
For further information
• fastDNAml:
http://www.indiana.edu/~rac/hpc/fastDNAml/
• about.uits.iu.edu/divisions/rac/index.html
• about.uits.iu.edu/divisions/rac/pubsstaff.html
• ingen.iu.edu
• it.iu.edu
33
Acknowledgments
• This work was supported in part by Shared University
Research grants from IBM, Inc. to Indiana University. IU’s
life science research has benefited from collaboration
with IBM researchers since 1997.
• This research was supported in part by the Indiana
Genomics Initiative. The Indiana Genomics Initiative of
Indiana University is supported in part by Lilly
Endowment Inc.
• This material is based upon work supported by the
National Science Foundation under Grant No. 0116050
and Grant No. CDA-9601632. Any opinions, findings
and conclusions or recommendations expressed in this
material are those of the authors) and do not necessarily
reflect the views of the National Science Foundation
(NSF).
• Assistance with this presentation: John Herrin, Malinda
Lingwall, W. Les Teach
• For HPC Challenge: thanks to the SciNet team, SC2003
organizers, HLRS, and especially Prof. Dr. Michael Resch
34
& Dr. Matthias Müller.
Thank you
Any questions?
35