The iPlant Tree of Life Project and Toolkit: Building a

Download Report

Transcript The iPlant Tree of Life Project and Toolkit: Building a

The iPlant Tree of Life Project and
Toolkit: Building a
Cyberinfrastructure for Plant Science
Research
Naim Matasci
The iPlant Collaborative
Evolution 2011
Jun 17-21, 2011
What is iPlant?
Discovery Environment
http://www.iplantcollaborative.org/discovery-environment-preview-access
4
Physical Infrastructure
Computation
• 63K cores cluster
• 20K cores cluster
• 1 TB RAM
Storage
• 2 PB
• 20 PB archive
Cloud Storage
• Store, access and share large
datasets
• Multiple points of entry: web
interface, mounted FS, API
• Free and secure
http://www.iplantcollaborative.org/about/policies/data-set-hosting
Cloud Computing
• Virtual Machines
– Up to 4 cores, 32 GB RAM,
100 GB dedicated disk
– Run any x86-compatible OS
(even Windows)
– Persistent or on-demand
– Log in via SSH or secure VNC
• Use Cases
– Internet-enabled Servers
– Database management
appliances
– Virtual desktops
– …The sky is the limit!
http://www.iplantcollaborative.org/atmosphere-preview
Consumer Applications
iPlant's CI
8
iPlant Tree of Life Grand Challange
Large phylogenetic inference
Building a tree of life for up to 500,000 green plants
Tree Visualization
Scalable visualization for small to large trees
Data Assembly and Integration
Acquisition, organization and processing the data
Taxonomic Intelligence
Sorting out different names for the same species
Tree Reconciliation
Resolving discordant gene and species trees
Trait Evolution
Using trees to understand how traits evolved
BIG TREES
To optimize existing methods to construct phylogenetic trees in the
order of 500K taxa.
Big Trees
NINJA/WINDJAMMER (Travis Wheeler)
Neighbor-Joining implementation that can analyze > 200K species
Six day run time reduced 32-fold to 4.5 hours for 220K species data set
Two/three day run time reduced 1,800-folds to 2 minutes for distance
matrix calculation on 220K set
RAxML-Light (Alexandros Stamatakis)
Large Scale Maximum Likelihood implementation
55K Tree published (Stephen A. Smith et al., “Understanding angiosperm diversification using
small and large phylogenetic trees,” American Journal of Botany 98, no. 3 (2011): 404 -414)
TREE VISUALIZATION
To develop an application for viewing, analyzing and exploring large
phylogenetic trees.
Tree Visualization
•
•
•
•
•
> 500K Taxa
Fast
Web based, platform independent
Semantic zooming
Metadata driven display of information
iPlant Tree Viewer Prototype
http://portnoy.iplantcollaborative.org/
1KP
Collaboration (1KP) – To support the data analysis of the Thousand Plant
Transcriptomes Project
1KP
dozens of species
completed
genomes
N(genes)
unexplored
territory
dozens of genes
PCR in 104 species
N(species)
Broad phylogenetic coverage
algae
non-flowering
flowering (angiosperm)
on role of
polyploidy in
Darwin’s
“abominable
mystery”
Phylogenomics of 1000 species across plant taxa
TREE RECONCILIATION
To reconcile the evolutionary history of genes and species.
Tree Reconciliation
Gene family data courtesy John Bowers
TAXONOMIC NAME
RESOLUTION
Collaboration (BIEN) - To unify and resolve synonymous, erroneous, or
other conflicting taxonomic names.
Taxonomic uncertainty
1. Non-existent names
•
•
Misspellings
Contamination
•
•
•
Annotations
Morphospecies
Digitization issues (frame shifts, character
encoding)Lexical variants (digitization conventions)
2. Synonymy
•
•
Nomenclatural synonyms
Taxonomic synonyms / concepts
3. Misidentifications, incomplete identifications
Taxonomic Name Resolution Service
• Computer assisted standardization of plant
names
• Corrects spelling errors and alternative
spellings to a standard list of names
• Convert out-of-date names to currently
accepted names
TRAIT EVOLUTION
To develop an infrastructure for downstream analysis of large trees.
Trait Evolution
• Toolkit to study the evolution of traits of
interest on very large phylogenies
–
–
–
–
–
Diversification
Biogeographic patterns
Adaptation
Co-evolution
…
Current analyses (Proof of concept)
• Phylogenetically Independent Contrasts
(Felsenstein 1985)
• Continuous Ancestral Character Estimation
(Schulter et al. 1997, Paradis 2004)
• Discrete Ancestral Character Estimation
(Pagel 1994, Paradis 2004)
Community Integrated
(2 ½ Days Workshop)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
EUtils
Lopper
RAxML
Ninja
Phyml
Muscle
PHYLIP
VCF to GFF script
LRmaqqtl
FASTX quality stats
FASTX quality boxplot
FASTX nucleotide distribution
Cuffcompare
ERMINEJ
progressiveMauve
iPlantBorda (mlpy)
iPlantCanberra (mlpy)
vbay
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
MECPM
OUCH
Picante
Ontologize
BOWTIE
BWA
TopHat
SHRiMP
Cuffdiff
GNU Core Text utilities
GeneMania
SRA import
PARS
PL
DTT
BBC biclustering
MY-PLANT.ORG
To easily share information and research, collaborate, and stay on top of
the latest news in the field.
Collaborative Tool
http://my-plant.org/
http://www.iplantcollaborative.org