Introduction to Quantitative Analysis and R

Download Report

Transcript Introduction to Quantitative Analysis and R

An introduction to
quantitative biology
and R
David Quigley, Ph.D.
Helen Diller Comprehensive Cancer Center, UCSF
Department of Epidemiology and Biostatistics, UCSF
What’s quantitative biology?
The process of data analysis
Reproducible research
An glance at R
analysis walk-through
High-performance computing at UCSF
David Quigley [email protected]
What’s quantitative
biology?
“how many samples?”
BIOMEDICAL SCIENCE
Cell
Biology
Molecular
Biology
Evolutionary &
comparative
biology
“Is it a significant effect?”
David Quigley [email protected]
Chemical
Biology
Genetics
Clinical
trials
biostatisticans
Study design
BIOMEDICAL SCIENCE
biologists
Cell
Biology
Evolutionary &
comparative
biology
biostatisticians
David Quigley [email protected]
statistical
inference
Molecular
Biology
Chemical
Biology
Genetics
Clinical
trials
New techniques bring new data
Gene expression microarrays
Arrayed SNP chips
pooled and arrayed genetic shRNA
short read sequencing
DNA
RNA
etc...
High throughput mass spectroscopy
David Quigley [email protected]
Wet lab
quantitative
David Quigley [email protected]
Nik-Zainal Cell 2012
Fullwood Nature 2009
CGAN Nature 2012
New challenges
statistical sophistication
in study design
in interpretation
many data points
1,000 to 1,000,000 measurements
many false positives
software becomes part of the experiment
engineering thinking vs. biology thinking
David Quigley [email protected]
biostatisticans
Study design
BIOMEDICAL SCIENCE
Bioinformatics
biologists
Cell
Biology
hybrid researchers
bioinformaticians
biostatisticians
David Quigley [email protected]
Evolutionary &
comparative
biology
statistical
inference
Molecular
Biology
Chemical
Biology
Genetics
Clinical
trials
Bioinformatics is multi-disciplinary
Biology
Computer
science
statistics
machine
learning
Bioinformatics
Working in isolation
biologist
wet lab experiments
bioinformatics
data analysis
biostatisticans
statistical inference
biologist
biological interpretation
The process of
quantitative data
analysis
cells in a dish
DMSO
library
preparation
sequencing
produce DNA reads
drug
align reads
to genome
alignment pipeline
normalize
data
interpret
results
library
preparation
cells in a dish
DMSO
sequencing
produce DNA reads
drug
Primary analysis
align reads
to genome
alignment pipeline
Secondary analysis
normalize
data
interpret
results
Primary analysis
specialized tools and packages
standardized pipelines develop over time
driven by methods
Secondary analysis
general tools
open-ended
driven by statistics and biology
David Quigley [email protected]
Acquiring primary data
Pipeline of command-line tools
Instrument 
Tool 1  Tool 2  tool 3 
 quality control
 statistical normalization
Acquiring primary data
engineering
instrument
native output
format
spectrographs
qPCR cycle files
microarray images
short sequences
David Quigley [email protected]
Primary Analysis: what happened?
engineering
instrument
David Quigley [email protected]
bioinformatics
native output
format
standardized
output
spectrographs
qPCR cycle files
microarray images
short sequences
protein assignments
expression matrixes
genome variants
Considerations during primary analysis
batch effects
sample quantity
biological artifacts (e.g. GC content)
individual assay quality
sample quality
platform effects
operator effects
David Quigley [email protected]
Normalization challenges vary
solved problems
microarray expression level
taqman expression level
genotypes from SNP chips
best practices
SNV calling from sequence
gene-level RNA-seq
ChIP-seq
open problems
mRNA isoform reconstruction
tumor clonality analysis from
sequence
David Quigley [email protected]
Secondary analysis: what might it mean?
To which DNA sequences does TP53 bind?
What mutations are frequent in basal-like breast cancer?
Which kinases does my tool compound target?
David Quigley [email protected]
Chosing quantitative tools
Cost
Learning curve
Ease of use
Flexibility
ecosystem
people
other tools
David Quigley [email protected]
Traditional programming languages
Python, C++, Java, others
can solve any computable problem
creates the fastest tools
free
requires programming expertise
complex to write and test
 high effort
David Quigley [email protected]
Specialized single-purpose programs
command line tools
academic research
type commands at a prompt or run scripts
PLINK, bowtie, GATK, bedtools
GUI (point and click)
commercial software for a vendor’s platform
slick, opaque, hard/impossible to automate
David Quigley [email protected]
Commercial statistics programs
STATA, SPSS, GraphPad, others
1) Load one dataset
2) Select analysis by clicking on a GUI
3) Generate a report
may have a built-in language
mature tools
Not free
David Quigley [email protected]
R: a “software environment”
Using R is like writing and using software
Traditionally, biologists did not do this.
David Quigley [email protected]
Why is R popular?
Open-ended
Most statistical problems can be solved
Pro statisticians use R
Large library of packages
package: easy-to use published methods
like a Qiagen kit
Free!
David Quigley [email protected]
You use R by typing at the prompt
There is no pull-down menu of statistical commands
David Quigley [email protected]
What’s good about this approach?
easy to reproduce
runs on anything
makes sense to computer programmers
re-use and chain analyses together
work with multiple datasets
use packages of code
David Quigley [email protected]
What’s hard about this approach?
hard to get started
cryptic commands
built-in help is hard to use
David Quigley [email protected]
RStudio makes it easier
David Quigley [email protected]
bioconductor
Curated collection of R packages
Microarrays, aCGH, sequence analysis, advanced
statistics, graphics, lots more
bioconductor.org
David Quigley [email protected]
A gentle introduction
to R
Excel vs. R
Excel
Easy tasks are easy
non-trivial tasks impossible or expensive
No paper trail
Mangles gene names
Plots look terrible
Hard to reproduce and standarize
David Quigley [email protected]
Comparing Excel and R
R
Easy jobs are hard at first
Non-trivial things are possible
Easy to make a paper trail
Bioinformatics tools often written for R
Can create publication-ready plots
David Quigley [email protected]
Working with R is interactive
type in a line of text
R evaluates your words as a command
R returns a response, e.g.:
printed message
new variable
picture of a plot
Working with R is interactive
Simple functions on number lists
Create a vector using the c() function
Operate on the vector
Commands are functions
Organizing data
Each subject has a row
Each feature is a column
David Quigley [email protected]
R calls columns vectors
vectors
ordered collections of numbers or words
name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]
age:
[2.5, 2.6, 2.5, 4]
David Quigley [email protected]
R calls the data set a data frame
data frame
a list of vectors (columns) that have names
elements can be read and written by row & column
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
You can define your own functions
function_name( details about how to do it )
generate sequence from 1 to 5 counting by 0.5
parameters for seq are named
from, to, and by
David Quigley [email protected]
You can define your own functions
function_name( details about how to do it )
report the mean of my.data. Result of one
function is fed into another one.
David Quigley [email protected]
You can define your own functions
function_name( details about how to do it )
define a new function that adds 2 to
whatever it’s passed
compare to original value of my.data
David Quigley [email protected]
Code is a protocol for the computer
A program is a series of operations on data
Short programs (scripts) are often linear
Large programs have decision points
“flow control”
David Quigley [email protected]
Most jobs: data preparation & scripts
tools that manipulate text
text editing programs (TextPad, BBEdit, Emacs)
Python
Old-school command line tools (awk)
David Quigley [email protected]
Reproducible research
Replicate a wet lab experiment
detailed protocols (not printed in the methods)
extensive optimization
reagents that might be unique or hard to get
techniques that require years of experience
David Quigley [email protected]
Replicate a dry lab experiment
published algorithms (if novel)
published source code
sometimes “available from the authors”
well-specified input and deterministic output
no reagents
Okay, maybe a supercomputer or cloud
How hard could it be?
David Quigley [email protected]
Many chances to make honest errors
Bookkeeping errors
Transposed column headers
Out-of-date/changed annotations
Off-by-one
Misunderstood sample labels
Batch effects
Cryptic cohort stratification
Inappropriate analytical methods
David Quigley [email protected]
Your notebook should be the final product
hand-curate metadata; automate the analysis
primary data
metadata
David Quigley [email protected]
analysis script
figures
tables
R Markdown
David Quigley [email protected]
R Markdown
David Quigley [email protected]
Walk-through a
straightforward
analysis
Primary data from METABRIC study
gene expression
TP53 sequence
1,400 samples from 5 hospitals
Is there an association between breast
cancer subtype and TP53 mutation?
David Quigley [email protected]
Tasks
Normalize data
batch effects
unwanted inter-sample variation
Identify outliers
associations between p53 and subtype
David Quigley [email protected]
Quantile Normalization (limma)
Force every array to have the same distribution of
expression intensities
> library(limma)
> raw = read.table('raw_extract.txt’, ...)
> raw.normalized = normalize.quantiles( raw )
> normalized = log2( raw.normalized )
David Quigley [email protected]
Identify batch effects in microarrays
gene 1
Principle Components Analysis
Identify strongest variation in a matrix
gene 2
David Quigley [email protected]
Identify batch effects in microarrays
gene 1
Principle Components Analysis
Identify axes of maximal variation in a matrix
gene 2
David Quigley [email protected]
Identify batch effects in microarrays
Principle Components Analysis
Identify axes of maximal variation in a matrix
gene 1
gene 1
group A
group B
gene 2
David Quigley [email protected]
gene 2
second component
PCA identifies a batch effect
hospital 3 (yellow)
first component
> my.pca = prcmp( t( expression.data ) )
> plot( my.pca, ... )
David Quigley [email protected]
batch correction reduces bias
second principle component
ComBat package reduces user-defined batch effects
first principle component
David Quigley [email protected]
Molecular subtypes of breast carcinoma, defined by gene expression
ER status
Luminal A
N=507
Luminal B
N=379
Her2
N=161
> sa = read.table(‘patients.txt’, ...)
> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)
(convert counts to percentages)
> barplot( c( tumor.counts[1], tumor.counts[2] ),
col=c(“red”,”green”), ... )
David Quigley [email protected]
Basal
N=234
Find interactions: TP53 and subtype
Fit a linear model:
> fitted.model = lm( dependent ~ independent )
Perform Analysis of Variance:
> anova( fitted.model )
general form of my analysis:
> anova( lm( gene.expression ~ PAM * TP53 )
18,000 genes
PAM: {LumA, LumB, Her2, Basal}
TP53: {mutant, WT}
David Quigley [email protected]
Automate with loops
Calculate anova for 18,000 genes by looping
through each gene and storing result.
> n_genes = 18000
> result = rep( 0, n_genes )
> for( counter in 1:n_genes ){
result[counter] = anova(...)
}
sort results
identify significant interaction
David Quigley [email protected]
repeat 18,000 times
Immune infiltration in TP53-WT Basal
CD3E
log2 expression
log2 expression
Does p53 have a role in immune surveillance?
absent mild severe
infiltration
David Quigley [email protected]
High-performance
computing resources
Cluster computing
1 computer
20 hours
20 computers
1 hour
Clusters available on campus
Institute for Human Genetics
Recharge
~800 cores, plenty of disk space
HDFCC Cluster
Free for small jobs to cancer center members
Contribute resources for big jobs
~800 cores, plenty of disk space
QB3
Free for small jobs to QB3 members
Lots of cores, not much disk space
Amazon AWS
Infinite capacity, pay by the hour
David Quigley [email protected]
Next steps:
getting help and
learning more
UCSF resources
Library classes and information
Formal courses (BMI, Biostatistics)
Cores (Computational Biology, Genomics)
QBDG monthly methods discussion group
David Quigley [email protected]
I’m teaching a biostatistics class...
Next year’s first year students
Online and in-person material
Want to TA? Get in touch.
David Quigley [email protected]
Online classes and blogs
Free courses on data analysis
http://jhudatascience.org
simplystatistics.org
Coursera etc...
Good tutorials on sequence analysis
http://evomics.org/learning
David Quigley [email protected]
online forums: expert help for free
biostars.org
all of bioinformatics
David Quigley [email protected]
online forums: expert help for free
biostars.org
all of bioinformatics
David Quigley [email protected]
seqanswers.com
Nextgen sequencing
online forums: expert help for free
seqanswers.com
biostars.org
Nextgen sequencing
all of bioinformatics
stats.stackexchange.com
statistics
David Quigley [email protected]
Questions?
[email protected]