Why Semantic Web Technology? - Integrated Breeding Platform Wiki

Download Report

Transcript Why Semantic Web Technology? - Integrated Breeding Platform Wiki

The iPlant Collaborative
IBP Annual Meeting – June 1st 2011
Steve Goff
iPlant Collaborative, BIO5 Institute
School of Plant Science
University of Arizona
www.iplantcollaborative.org
[email protected]
What is iPlant?
• iPlant’s mission is to build the CI to support plant
biology’s Grand Challenge solutions
• Phase I – Community Input
• Phase II – Building the CI Foundation
• Next Phase – Enabling Plant Science Discovery
Now need to integrate workflows and
test theories
Will support tool integration and
synthesis activities
www.iplantcollaborative.org
[email protected]
NSF Cyberinfrastructure Vision
• High Performance Computing
• Data and Data Analysis
• Virtual Organizations
• Learning and Workforce
Ref: “Cyberinfrastructure Vision for 21st Century Discovery”, NSF Cyberinfrastructure Council, March 2007.
www.iplantcollaborative.org
[email protected]
CI for Plant Science: Observations
• Investment in data creation is high
• Sources of data are disparate.
• Investment in existing tools is significant
• Tools shouldn’t be discarded
• Tools shouldn’t be reproduced, but lack:
–Interoperability w/other tools
–Data standards
–Scalability
–Consistency of interface access & use
–Experimental reproducibility
www.iplantcollaborative.org
[email protected]
iPlant is a process and a platform
(or set of platforms, depending on
your point of view).
www.iplantcollaborative.org
[email protected]
Computational & Storage Capability
– Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora
(ASU) Marin, Ice (UA)
•~700 Teraflops
– Storage: Corral, Ranch (UT), Ocotillo (ASU)
•> 10 Petabytes of storage available for the project
– Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave
•Among the world’s largest visualization systems
– Virtualized/Cloud Services: iPlant, TeraGrid, vendor clouds
• Cloud tech to deliver persistent gateways and user services
Thanks to large-scale NSF investments, iPlant
has excellent CI access
www.iplantcollaborative.org
[email protected]
Bench
Biologists
Computational
Biologists
Semantic Web Layer
Discovery
Environment
Data Store
Atmosphere
iPlant Cyberinfrastructure
APIs
Data
APIs
Algorithms
www.iplantcollaborative.org
[email protected]
Overview of Components
• iPlant Discovery Environment - Core Software
• iRODS Integration – Core Services
• Atmosphere Cloud – Core Services
• Semantic Web Tech – SSWAP Team
• iPlant Tool/Workflow API – Core Software &
Engagement Teams
www.iplantcollaborative.org
[email protected]
Discovery
Environment
Semantic Web
Event
3rd Party Science
Gateways
DNA
Subway
I/O
User Scripts &
Applications
Public APIs
Data
Apps
Job
Profile
Auth
Low-Level Services
Condor
PBS
SGF
LSF
LL
iRODS
LDAP
Shibboleth
Globus/
Unicore
GPIR
MySQL
Eucalyptus
Action
Folders
MyProxy
XSEDE
iPlant Hardware Resources
High Perf Computing
Databases
www.iplantcollaborative.org
Storage
Cloud Systems
[email protected]
iRODS
Integrated Rule-Oriented Data System
www.irods.org
• Why iRODS?
– Large data storage in simple format
– Sharing of large data among iPlant CI Resources
– Sharing of large data with colleagues and collaborators
– Processing large data with TACC resources
• General information on iRODS: www.irods.org
• Access iPlant’s iRODS: irodsweb.iplantcollaborative.org
• Documentation:
https://pods.iplantcollaborative.org/wiki/display/systems/iRODS
www.iplantcollaborative.org
[email protected]
Atmosphere
iPlant’s Cloud Computing Resources
http://atmosphere.iplantcollaborative.org
• Tutorial:
https://pods.iplantcollaborative.org/wiki/display/atmosphe
re/Demo+with+picture+walkthrough
• Why Atmosphere?
– Use a virtual machine (VM) with preinstalled software
– Create a VM to install complex software
– Create and share an image of a VM (VMI)
– Mount data from iPlant iRODS for use by your VM
www.iplantcollaborative.org
[email protected]
Semantic Web
http://www.iplantcollaborative.org/communities/developers/semanticweb
• Why Semantic Web Technology?
–Provides a means for web-services to
communicate and be aware of one another
iPlant
Service
Semantic
Web
Remote
Consumer
iPlant
Consumer
Semantic
Web
Remote
Service
User-Created
Service in
Atmosphere
Semantic
Web
iPlant’s
Discovery
Environment
www.iplantcollaborative.org
[email protected]
iPG2P: From Genotype to Phenotype
•
•
•
•
•
Visual Analytics
– R. Grene and G. Abram: Information Visualization Tools capable of
displaying diverse types of data from laboratory, field, in silico analyses
and simulations
Data Integration
– D. Ware and C. Jordan: Methods for describing and unifying data sets
into systems that support iPG2P activities
Statistical Inference
– D. Kliebenstein and E. Buckler: Platform for using advanced
computational approaches to statistically link genotype to phenotype
Modeling Tools
– J. White, C. Myers, S. Welch : Framework for the construction,
simulation and analysis of computational models of plant
Ultra High Throughput Sequencing
– T. Brutnell and M. Vaughn: HPC resources and applications to process
large-volume sequence data
Ultra High-Throughput Sequencing
Genome
Services
Scalable computing
Data
•NCBI SRA
•Desktop
•AmazonS3
•FTP
•HTTP
Data Wrangling
•Quality Control
•Preprocessing
•Rescaling
•Barcoding
Alignments
•BWA
•TopHat
Community Use Cases
Expression studies
Forward genetic screens
Association studies
Cufflinks
SAMTools
Expression
Levels
(RPKM)
Genome
Variants
(VCF3.3)
SAM Alignments
High Throughput Image Analysis
Scope: Enable image-based plant sciences research by incorporating image
processing algorithms, grid computing, and databasing into an analysis pipeline
Objectives
1. Integrate Phytomorph and BISQUE as PhytoBisque
2. Broaden access to algorithms that benefit the community
3. Automate workflows so that plant biologists need not be computer scientists
APIs
Storage
Authentication
Compute cluster
E. Spalding @ U of Wisconsin, B.S Majunath and K. Kvilekval @ UCSB
Phytobisque: Example Use Case
Given a flatbed scanner image of Arabidopsis
seeds, measures the length, width, and area and
produce a population estimate for each trait
Seed trait QTL can be mapped when applied to
mapped populations like Ler x CVI
A Strategy for
Association
Studies
Iterative analyses
• iPlant workflow
management simplifies
automation
• Compare methods!
Basic QTL/GWAS analysis
• R/Qtl, QTLcartographer, et al.
• Community can integrate these into the CI
Exploratory methods
• Hand-built R, Python,
SAS, C codes
• Easy integration into
iPlant CI via API
• Adopt common data
model
Scalability Challenges: Highdensity markers, large
populations, combinatorial
analyses
• iPlant-authored parallel GLM (etc)
implementations
• Common data model
• Utilize workflow framework
Statistical Inference: Scalable GLM
Genotype
Phenotype
40 million markers
in maize NAM
6 traits of
interest
X
ANOVA
•Simplest case*: a few minutes using
GLM on desktop TASSEL
•1000-replicate bootstrap: 75-150 hours
/ trait
•Runtimes only gets larger (days to
years) for more complex analyses
* One trait x 40 million markers with no
bootstrapping or epistasis testing
X
1000 replicate
analyses
Epistasis
testing
GPU-based QTL Mapping
•Aspects of the problem are highly parallel
•Re-architect data flow and mapping algorithms for GPU architecture
•Interface for C and GPU implementations will be identical
Ali Akoglu and Dave Lowenthal, UArizona
Alignment-based protein searches sped up 6-10x
19
iPlant Tree of Life (iPToL)
Large phylogenetic inference
Building a tree of life for up to 500,000 green plants
Tree Visualization
Scalable visualization for small to large trees
Data Assembly and Integration
Acquisition, organization and processing the data
Taxonomic Intelligence
Sorting out different names for the same species
Tree Reconciliation
Resolving discordant gene and species trees
Trait Evolution
Using tree to understand how traits evolved
www.iplantcollaborative.org
[email protected]
Phyloviewer: visualization of large
phylogenetic trees
www.iplantcollaborative.org
[email protected]
21
My-Plant
• Social networking
for plant biologists
• Organized by clade
• Used to organize
the data collection
for the “big tree”
www.iplantcollaborative.org
[email protected]
Taxonomic Name Resolution Service
www.iplantcollaborative.org
[email protected]
Integration of New Tools w/o Programming
This part is done!!!
This part is coming soon!
www.iplantcollaborative.org
[email protected]
Related Activities
 Integrated Breeding Platform
 Social networking portal for plant breeders
 R analysis packages
 Breeders fieldbook
 1kp (1,000 plant transcriptomes)
 DOE’s Knowledgebase (Kbase)
 Seed projects
 Elixir
 CoGe
Future Workshop Activities
 Small tool/workflow integration meetings
 2-3 days each, 10-20 local participants
 4-5 meetings starting in June 2011
 Addressing specific biological questions
 With appropriate test data and available software
 Building on iPlant’s cyberinfrastructure
 Complementary tools and additional data access
 Preference for broad use, high impact tools & workflows
 Can be kept private until published
 Positive results will stimulate additional support
iPlant’s Building Blocks
Faculty Advisors:
Greg Andrews
Kobus Barnard
Susan Brown
Vicki Chandler
John Hartman
Nirav Merchant
Sudha Ram
Ann Stapleton
Lincoln Stein
Doreen Ware
Sue Wessler
Ramin Yadegari
Metadata
Staff:
Greg Abram
Victoria Bryan
Rion Dooley
Andy Edmonds
Juan Antonio
Raygoza Garay
Karla Gendler
Damian Gessler
Cornel Ghiban
Michael Gonzales
Hariolf Häfele
Matthew Helmke
Data
Students:
Storme Briscoe
Steven Gregory
Monica Lent
Bansri Poduval
Pavithra Ravi
Shannon Wermes
Jill Yarmchuk
Executive Team:
Steve Goff
Dan Stanzione
Tools
Natalie Henriques
Uwe Hilgert
Nicole Hopkins
Lisa Howells
Kathleen Kennedy
Mohammed Khalfan
Seung-jin Kim
Adam Kubach
Sangeeta
Kuchimanchi
Tina Lee
Andrew Lenards
Sonya Lowry
www.iplantcollaborative.org
Workflows
Jerry Lu
Eric Lyons
Naim Matasci
Sheldon McKay
Dave Micklos
Andy Muir
Martha Narro
Christos Noutos
Dennis Roberts
Bernice Rogowitz
Jerry Schneider
Bruce Schumaker
Viz
Edwin Skidmore
Sriram Srinivasan
Mary Margaret Sprinkle
Matthew Vaughn
Liya Wang
Sharon Wei
Jason Williams
Frank Willmore
John Wregglesworth
Weijia Xu
[email protected]
27