PIPELINE PROGRAM

Download Report

Transcript PIPELINE PROGRAM

Python Pipelining
An Introduction to Building Data Analysis Pipelines
(& Hacking Graduate School)
Presented by : Kevin Dick
LECTURE WEBPAGE
http://bioinf.sce.carleton.ca/PythonPipelining/
Intro
Presentation Outline
P1
P2
P3
30 Minutes ::
• Setup the Environment
• Brief introduction to Python
• Motivating the Development of Analysis Pipelines
• Abstraction in Programming
60 Minutes :: Workshop
• [20 mins] PHASE I : Basic Functional Pipeline
• [20 mins] PHASE II : Scaling the Analysis
• [20 mins] PHASE III :
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Preamble :: Getting the Environment Setup
You need R: https://cran.r-project.org/bin/windows/base/
Abstraction in Programming
Program Function/Module
Function Definition
Variable Definition
Control/Logic
Operations
Return Statement
Abstraction in Programming
Def functionName()
var variable_1
var variable_2
Function Definition
Variable Definition
var variable_n
for v in [1,10]:
# do stuff
if (cond):
# stuff
return theStuff
Control/Logic
Operations
Return Statement
Abstraction in Programming
Program
Function/Module 1
…
Function/Module 2
Main Function
Abstraction in Programming
Pipeline
Program 1
…
Program 2
Program n
Abstraction in Programming
Pipeline
Program 1
…
Program 2
Program n
• Data Flow: General direction of data
manipulation
• The results of one program generally
becoming the input to the next
Abstraction in Programming
Pipeline
Program 1
…
Program 2.1
Program n
• Data Flow: General direction of data
manipulation
• The results of one program generally
becoming the input to the next
• Modularity: Can easily swap in/out programs
Abstraction in Programming
Pipeline
Program 1
C/C++
…
Program 2
• Data Flow: General direction of data
manipulation
• The results of one program generally
becoming the input to the next
• Modularity: Can easily swap in/out programs
• Optimizability: Can use diverse tools for
specific problems
Program n
Image Source: www.python.org, www.r-project.org/about.html
Abstraction in Programming
Pipeline
Program 1
…
Program 2
• Data Flow: General direction of data
manipulation
• The results of one program generally
becoming the input to the next
• Modularity: Can easily swap in/out programs
• Optimizability: Can use diverse tools for
specific problems
• Reproducibility: Automating the work is
highly desirable
Developer
Program n
Scientific
Community
Abstraction in Programming
Pipeline
Program 1
…
Program 2
Program n
• Data Flow: General direction of data
manipulation
• The results of one program generally
becoming the input to the next
• Modularity: Can easily swap in/out programs
• Optimizability: Can use diverse tools for
specific problems
• Reproducibility: Automating the work is
highly desirable
• Software Variety: Can incorporate software
across diverse platforms
Image Source: www.python.org, https://www.mathworks.com/products/matlab.html
Abstraction in Programming
PIPELINE
PROGRAM
Module 1
…
Module 2
Main
Function Definition
Variable Definition
Control/Logic
Operations
…
Return Statement
Main
PYTHON SCRIPT EXECUTING THE PIPELINE
PROGRAM
n
Hacking your Graduate Studies
“If its online,
its available…”
Python is great for
pulling information
out of web-pages!
We can follow the BioGrid Link to open that page and parse out the values of interest!
Similarly, the IntAct page is more up to date!
Building a Simple Pipeline
.txt
.fasta
DATA
CODE
DECISION
Legend ::
FLOW
.png .png .png
.pdf
Building a Simple Pipeline
STEP 1
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
STEP 2
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
STEP 3
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
STEP 4
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
STEP 5
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
STEP 6
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
STEP 7
DATA
CODE
DECISION
Legend ::
FLOW
Building a Simple Pipeline
.txt
.fasta
DATA
CODE
DECISION
Legend ::
FLOW
.png .png .png
.pdf
Building a Simple Pipeline
Proteome_TAXID
.fasta
DATA
CODE
DECISION
Legend ::
FLOW
Pick a Single Protein
get_proteome.py
report_template
.pdf
get_interactor_count
.py
plot_inter_count.py
compile_report.py
r_file_TAXID.r
inter_count_TAXID.
txt
binary_
TAXID
.png
biogrid_
TAXID
.png
intact_
TAXID
.png
report_TAXID.pdf
email_alert_simple.py
Pick an Organism
PHASE I
Building a Simple Pipeline
get_proteome.py
report_template
.pdf
APPLY TO ALL
PROTEINS
get_interactor_count
.py
plot_inter_count.py
compile_report.py
r_file_TAXID.r
Proteome_TAXID
.fasta
DATA
CODE
DECISION
Legend ::
FLOW
inter_count_TAXID.
txt
binary_
TAXID
.png
biogrid_
TAXID
.png
intact_
TAXID
.png
report_TAXID.pdf
email_alert_simple.py
Pick an Organism
PHASE II
Building a Simple Pipeline
PHASE III
APPLY TO ALL
ORGANISMS
get_interactor_count
.py
plot_inter_count.py
compile_report.py
r_file_TAXID.r
Proteome_TAXID
.fasta
DATA
CODE
DECISION
Legend ::
FLOW
inter_count_TAXID.
txt
binary_
TAXID
.png
biogrid_
TAXID
.png
intact_
TAXID
.png
report_TAXID.pdf
email_alert_simple.py
get_proteome.py
report_template
.pdf
APPLY TO ALL
PROTEINS
Pick Your Starter Pokémon ::
Caenorhabditis elegans
Mus musculus
TAX ID :: 10090
TAX ID :: 6239
Escherichia coli
Homo sapiens
TAX ID :: 511145
TAX ID :: 9606
Drosophila melanogaster
TAX ID :: 7227
Saccharomyces cerevisiae
TAX ID :: 559292
Arabidopsis thaliana
TAX ID :: 3702
Image Sources: http://bulbapedia.bulbagarden.net/wiki/Main_Page
Take Away Lessons
PHASE I ::
• Combat Data Veracity :: Pipelines are useful for aggregating data from multiple sources
PHASE II::
• Abstraction :: Wrapping a section in a loop can easily scale the outcome
• Determining Bottlenecks :: Estimating the runtime for each section of code helps determine
bottlenecks
PHASE III::
• Intelligent Design / Amortiziation :: Think about scaling when designing your pipelines; more
work upfront has large payoffs later.
EX :: Specifying TAX_ID in all the scripts as parameters really simplifies scaling the pipeline to
all organisms
• Offlining Data :: Identify areas in your pipeline that are severe bottlenecks (Calls to External
servers are really bad...Offline when you can)
• Replicability :: This pipeline is highly replicable for anyone looking to understand your work and
allows the community to build upon it!
Common Bugs
Errors when trying to compile the
LaTeX document might be the
result of errors in previous steps.