iPlant Pods - iPlant Collaborative
Download
Report
Transcript iPlant Pods - iPlant Collaborative
iPlant Collaborative
Powering a New Biology
David Micklos, Cold Spring Harbor Laboratory
My own suspicion is that the universe is not only queerer than we suppose, but
queerer than we can suppose.
J.B.S. Haldane, Possible Worlds and Other Essays (1927)
The Egalitarian Gene
Agarose Gel Electrophoresis, 1973
1958
Matt Meselson &
Ultracentrifuge, $500,000
1973
Sharp, Sambrook, Sugden
Gel Electrophoresis
Chamber, $250
The Egalitarian Genome
Next Generation Sequencing, 2005
Bacterial colonies
Hundreds of millions of…
PCR colonies (clusters, features)
Su, Andrew (2013):
Cumulative sequenced
genomes. figshare.
http://dx.doi.org/10.6084/
m9.figshare.722952
PacBio Sequencer
14x coverage of Rice IR6
Read Length (nucleotides)
9
The Egalitarian Genome
Next Generation Sequencing 2014
2003:
ABI 3730 Sequencer
2014
Oxford Nanopore
MiniION
Human Genome:
$2.7 Billion, 13 Years
Human Genome:
$900, 6 Hours
The Big Data Problem
Storage and Analysis
“BGI, based in China, is the
world ’ s largest genomics
research institute, with 167
DNA sequencers producing
the equivalent of 2,000
human genomes a day.
BGI churns out so much data
that it often cannot transmit
its results to clients or
collaborators over the Internet
or other communications lines
because that would take
weeks. Instead, it sends
computer disks containing the
data, via FedEx.”
Biology’s Other Big Data
Phenomics
Visualization
The useful lifetime of our analysis
toolchains is now 6 months
-Matthew Trunnel, Broad Institute
• Requires a platform that can support diverse and
constantly evolving needs.
• Cyberinfrastructure is the platform for a biological
“App Store” that allows scientists to run tools and
workflows they need.
Paradigm Shift
data limited > data unlimited world
• Hypotheses underdetermined by data > data
underdetermined by hypotheses
• Reductive biology > constructive biology
iPlant Collaborative
A 10 year NSF project to develop a
computer infrastructure to apply
computational thinking to solve
biological problems
• Virtual organization
• High performance computing
• Data and data analysis
• Learning and workforce
iPlant Collaborative
A virtual organization
UPDATE!!!!
CSHL
UA
TACC
High Performance Computing
Texas Advanced Computing Center (TACC)
• 2 of the three largest
parallel computers in
the XSEDE (formerly
TeraGrid) System
• 500,000 compute Cores
• New Intel MIC (Many
Integrated Core) chips
contain 61 cores!
• Up to 1TB shared
memory
Dan Stanzione, Acting Director
iPlant Collaborative
Ways to Access iPlant
• iData Store: All data large and small
• Atmosphere: For virtual hosting of web apps, sites,
databases.
• Discovery Environment: Integrated Web apps.
• MyPlant: Social Networking.
• DNASubway: Annotation and more
• Standalone Apps: TNRS, TreeViewer, PhytoBisque, etc
• The API: for programmers embedding iPlant capabilities
• Command line for experts (thru TeraGrid/XSEDE)
Data Store
Texas Advanced Computing Center
Dan Stanzione: “We hit a
billion files about a year
ago, so when people ask us
what we’re going to do
about a billion files. The
answer is we’re going to do
this.”
100,000 Terabytes of disk
and tape. Data Store
moves > 2 GB files with
ease
Atmosphere
Cloud Computing for Biology
• Handle those big data
• Analogous to Amazon Elastic Compute Cloud (EC2)
• Default virtual machine (VM) has 6 CPUs with 16 GB of
RAM compared to desktop or laptop 1-2 CPUs with 1-4
GB RAM
• Up to 16 CPU/32G RAM VM can be assigned on request
• Co-localize with your data from the iPlant Data Store
• Configure machine, data transformation to share with
collaborators or with use case for students.
Discovery Environment
A rich web client
•Free App store for Bio
research
•Consistent interface to a range
of bioinformatics tools
•Integrated, extensible system
of applications and services
•Add tools, build custom
workflows
• Other major projects are beginning to adopt the iPlant CI as
their underlying infrastructure (some completely, some in
limited ways):
• CoGe (auth service, hosting)
• BioExtract (web service platform)
• CiPRES (computation)
• Gates Integrated Breeding Platform (hosting,
development)
• Galaxy (storage, for now)
The Biology
App Store
iPlant APIs
Resources
iPlant Audiences:
Converge on the Middle Ground
• Expert: bioinformaticians and computational
biologists
• Intermediate: bright biological researchers who need
to solve problems – but who aren’t bioinformaticians
or don’t know one down the hall
• Novice: high school and college faculty engaged
primarily in teaching
Educational Challenge
For the first time in the history of biology students
can work with the same data at the same time and
with the same tools as research scientists.
Research
Education
Context of scientific discovery
Insights from Genomics in Education
Washington University, June 16-19, 2009
44 participants from three worlds and three kingdoms
• Bioinformatics: Students have limited patience for
pure computer work and want a wet bench hook.
• Student-scientists partnerships: Someone has to care
about the data generated by students.
• Students as co-investigators: Projects should
potentially lead to publication.
• Scale: Need to move from individual classroom
experiments to distributed projects.
Walk or…
…ride
an educational Discovery Environment
DNA Subway
an agnostic education and research tool
•
•
•
•
Simplified bioinformatics workflows
Developed with 25 collaborators at 11 institutions
Since March 2010 launch: 7,510 registered users
Red Line: predict and annotate genes in <150 kb
(2,670 projects in last six months)
• Yellow Line: identify homologs in sequenced genomes
• Blue Line: analyze DNA barcodes and build gene trees
(7,700 projects in last six months)
• Green Line: align and analyze RNA-seq data (beta)
DNA Subway
an educational Discovery Environment
• Developed in parallel with Discovery Environment
• Simplified workflow for gene discovery, annotation,
and comparison
• 25 collaborators at 11 institutions
• Since March 2010 launch:
4,218 registered users
69,660 visits, 31,587 unique visits
DNA Subway Concepts (Big Ideas)
•
•
•
•
•
•
Genomes are complex and dynamic (queer).
DNA sequence is information.
DNA sequence is biological identity.
Gene annotation adds meaning to DNA sequence.
Concept of gene continues to evolve.
A genome is more than genes.
DNA Subway
Producers
Uwe Hilgert
David Micklos
Jason Williams
Designers
Eun-Sook Jeong
Susan Lauter
Programmers
Cornel Ghiban
Mohammed Khalfan
Sheldon McKay
Contributors
Matt Vaughn
Rion Dooley
Anthony Biondo
Jim Burnette
Scott Cain
Ed Lee
Zhenyuan Lu
Advisors
Matt Conte
Carson Holt
Bruce Nash
Oscar Pineda-Catalan