“Big Data” scientific workflow management tools for the Materials

Download Report

Transcript “Big Data” scientific workflow management tools for the Materials

2nd Texas A&M Big Data Workshop
Development of “Big Data” Scientific Workflow
Management Tools for the Materials Genome
Initiative:
“Materials Galaxy”
Rodolfo Aramayo
Department of Biology
College of Sciences
Dr. Rodolfo Aramayo
Ricardo Perez
Department of Biology
Dr. Raymundo Arroyave
Dr. Ibrahim Karaman
Daniel Sauceda
Dr. Anjana Talapatra
Nayan chaudhary
Ramaranjan Ruj
Vinay Akula
Department of Materials Science and Engineering
Dr. Ricardo Gutierrez-Osuna
Department of Computer Science and Engineering
The Problem…
• When data is generated faster than it can be processed we have an
"information crisis”
• This crisis is not only associated with the lack of hardware/software
infrastructure to transform data into knowledge but also with two
major informatics-related needs:
– I. Accessibility: It is not uncommon to find scientists unable to process
information due to their lack of programming and/or informatics expertise
– II. Reproducibility: Lack of robust frameworks to ensure reproducibility
has been identified as a major issue in the scientific enterprise
Reproducibility in INFORMATICS is a major challenge as the generation
of knowledge out of data involves highly complex analysis workflows
The Problem = Opportunity
• This “Problem” will hit Materials Sciences hard, since this
field is undergoing a major transformation into being “BigData” centric
– This is particularly true since the launch of the Materials Genome
Initiative (MGI) in 2011
– The Materials data infrastructure is undergoing active development
– Materials Sciences is expected to ramp-up from 0% to 100% BigData in few years
• This is a tremendous opportunity for us to become leaders in
this emerging field
Our Objective…
• To establish TAMU as a leading center for Materials and Materials
Informatics
• How are we going to do that?
– By developing a series of computational tools designed to collect, store
and analyze "Big Data" from diverse sources
– By adapting and porting Informatic Tools from other, more developed
areas into Materials Sciences
We propose to start adapting “Galaxy”, a complex web-based system
originally developed for Genomics applications for
Materials Informatics
What Is Galaxy? (Definition)
• Galaxy is an open source, web-based platform for accessible,
reproducible, and transparent computational biomedical
research
– Accessible: Users without programming experience can easily
specify parameters and run tools and workflows
– Reproducible: Galaxy captures information so that any user can
repeat and understand a complete computational analysis
– Transparent: Users share and publish analyses via the web and
create Pages, interactive, web-based documents that describe a
complete analysis
Source: Galaxy Wiki: https://wiki.galaxyproject.org/FrontPage
External
Databases
Big Data
Galaxy’s Internals
Internal
Databases
Big Data
Media Data
Media Data
System Log File
System Log File
Relational Databases
Relational Databases
Shell
User
Non-Relational Databases
Non-Relational Databases
Static
HTML
Web
Server
(NGINX/APACHE)
Python
Server
(PASTE)
Galaxy
Database
(SQL/POSTGRES)
XML
Tools
External
Databases
Big Data
Galaxy’s Internals
Internal
Databases
Big Data
Media Data
Media Data
System Log File
System Log File
Relational Databases
Relational Databases
Shell
User
Non-Relational Databases
Non-Relational Databases
Static
HTML
Web
Server
(NGINX/APACHE)
Python
Server
(PASTE)
Galaxy
Database
(SQL/POSTGRES)
XML
Tools
Galaxy’s Interface
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
External
Databases
Big Data
Galaxy’s Internals
Internal
Databases
Big Data
Media Data
Media Data
System Log File
System Log File
Relational Databases
Relational Databases
Shell
User
Non-Relational Databases
Non-Relational Databases
Static
HTML
Web
Server
(NGINX/APACHE)
Python
Server
(PASTE)
Galaxy
Database
(SQL/POSTGRES)
XML
Tools
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
External
Databases
Big Data
Galaxy’s Internals
Internal
Databases
Big Data
Media Data
Media Data
System Log File
System Log File
Relational Databases
Relational Databases
Shell
User
Non-Relational Databases
Non-Relational Databases
Static
HTML
Web
Server
(NGINX/APACHE)
Python
Server
(PASTE)
Galaxy
Database
(SQL/POSTGRES)
XML
Tools
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Galaxy Interface
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
Hillman-jackson et al. - 2012 - Using Galaxy to Perform Large-Scale Interactive Data Analyses
External
Databases
Big Data
Galaxy’s Internals
Internal
Databases
Big Data
Media Data
Media Data
System Log File
System Log File
Relational Databases
Relational Databases
Shell
User
Non-Relational Databases
Non-Relational Databases
Static
HTML
Web
Server
(NGINX/APACHE)
Python
Server
(PASTE)
Galaxy
Database
(SQL/POSTGRES)
XML
Tools
Blankenberg et al. - 2011 - Making whole genome multiple alignments usable for biologists
Galaxy Runs on “Ada” (“Reveille”)