Introduction to Quantitative Biology: 2015 Teaching

Download Report

Transcript Introduction to Quantitative Biology: 2015 Teaching

An introduction to
quantitative biology
and R
David Quigley, Ph.D.
[email protected]
Helen Diller Comprehensive Cancer Center, UCSF
genetics of skin cancer
Balmain (UCSF)
genetics of breast cancer
Børresen-Dale (U Oslo)
genetic interactions
synthetic lethality
Ashworth (UCSF)
2007
2009
David Quigley [email protected]
2011
2013
2015
What’s quantitative biology?
The process of data analysis
Reproducible research
An glance at R
analysis walk-through
High-performance computing at UCSF
David Quigley [email protected]
What’s quantitative
biology?
Quantitative Biology
Studying biology by integrating molecular,
genetic, computational, and statistical methods.
c.f. molecular biology, developmental biology
Data Science
Statistics with venture capital funding
David Quigley [email protected]
Genetics has always been quantitative
evolutionary genetics
population genetics
epidemiology
linkage analysis
association tests
David Quigley [email protected]
Molecular biology 30 years ago
Wet lab
quantitative
David Quigley [email protected]
Suzuki Med Mol Morph 2010
Oh PNAS 1996
Mao Genes Dev 2004
Molecular biology now
Wet lab
quantitative
David Quigley [email protected]
Nik-Zainal Cell 2012
Fullwood Nature 2009
CGAN Nature 2012
Challenges
requires statistical sophistication
in study design
in interpretation
many data points
1,000 to 1,000,000 measurements per sample
many false positives which look like great stories
software becomes part of the experiment
divide between engineering, biology culture & thinking
David Quigley [email protected]
The process of
quantitative data
analysis
Quantitative Biology is biology.
Start with questions.
motivation
approach
statistical power
analysis
David Quigley [email protected]
First comes bioinformatics
engineering
instrument
native output
format
spectrographs
qPCR cycle files
microarray images
short sequences
David Quigley [email protected]
What did the machine say?
engineering
instrument
David Quigley [email protected]
bioinformatics
native output
format
standardized
output
spectrographs
qPCR cycle files
microarray images
short sequences
protein assignments
expression matrixes
genome variants
Considerations during primary analysis
batch effects
sample quantity
biological artifacts (e.g. GC content)
individual assay quality
sample quality
platform effects
operator effects
David Quigley [email protected]
Normalization challenges vary
solved problems
microarray expression level
taqman expression level
genotypes from SNP chips
best practices
SNV calling from sequence
gene-level RNA-seq
ChIP-seq
open problems
mRNA isoform reconstruction
tumor clonality analysis from
sequence
David Quigley [email protected]
Secondary analysis addresses the biological question
To which DNA sequences does TP53 bind?
What mutations are frequent in basal-like breast cancer?
Which kinases are does my tool compound target?
David Quigley [email protected]
Primary analysis
specialied tools and packages
standardized pipelines develop over time
driven by methods
Secondary analysis
general tools
open-ended
driven by statistics and biology
David Quigley [email protected]
Chosing quantitative tools
Cost
Learning curve
Ease of use
Flexibility
ecosystem
people
other tools
David Quigley [email protected]
Traditional programming languages
Python, C++, Java, others
can solve any computable problem
creates the fastest tools
free
requires programming expertise
complex to write and test
 high effort
David Quigley [email protected]
Specialized single-purpose programs
command line tools
academic research
type commands at a prompt or run scripts
PLINK, bowtie, GATK, bedtools
GUI (point and click)
commercial software for a vendor’s platform
slick, opaque, hard/impossible to automate
David Quigley [email protected]
Commercial statistics programs
STATA, SPSS, GraphPad, others
1) Load one dataset
2) Select analysis by clicking on a GUI
3) Generate a report
may have a built-in language
mature tools
Not free
David Quigley [email protected]
Web-based tools
Galaxy
string together pre-defined analysis steps
very easy to use
reproducible
David Quigley [email protected]
R: a “software environment”
Using R is like writing and using software
Traditionally, biologists did not do this.
David Quigley [email protected]
Why is R popular?
Open-ended, open-source
Large library of packages
package: easy-to use published methods
like a Qiagen kit
Free!
David Quigley [email protected]
You use R by typing at the prompt
There is no pull-down menu of statistical commands
David Quigley [email protected]
What’s good about this approach?
chain analyses
work with multiple datasets
use packages of code
easy to reproduce
runs on anything
makes sense to computer programmers
David Quigley [email protected]
What’s hard about this approach?
hard to get started
cryptic commands
built-in help is hard to use
David Quigley [email protected]
RStudio makes it easier
David Quigley [email protected]
bioconductor
Curated collection of R packages
Microarrays, aCGH, sequence analysis, advanced
statistics, graphics, lots more
bioconductor.org
David Quigley [email protected]
packages for common tasks
limma: microarray normalization and analysis
samr: differential expression
impute: dealing with missing data
downloaded for free from a central repository
David Quigley [email protected]
Reproducible research
Replicate a wet lab experiment
detailed protocols (not printed in the methods)
extensive optimization
reagents that might be unique or hard to get
techniques that require years of experience
David Quigley [email protected]
Replicate a dry lab experiment
published algorithms (if novel)
published source code
sometimes “available from the authors”
well-specified input and deterministic output
no reagents
Okay, maybe a supercomputer or cloud
How hard could it be?
David Quigley [email protected]
Many chances to make honest errors
Bookkeeping errors
Transposed column headers
Out-of-date/changed annotations
Off-by-one
Misunderstood sample labels
Batch effects
Cryptic cohort stratification
Inappropriate analytical methods
David Quigley [email protected]
Your notebook should be the final product
hand-curate metadata; automate the analysis
primary data
metadata
David Quigley [email protected]
analysis script
figures
tables
R Markdown
David Quigley [email protected]
R Markdown
David Quigley [email protected]
Learning R data types
by comparing them
to Excel spreadsheets
Comparing Excel and R
Excel
Easy tasks are easy
non-trivial tasks impossible or expensive
No paper trail
Mangles gene names
Plots look terrible
David Quigley [email protected]
Comparing Excel and R
Excel
Easy tasks are easy
non-trivial tasks impossible or expensive
No paper trail
Mangles gene names
Plots look terrible
R
Easy jobs are hard at first
Non-trivial things are possible
Easy to make a paper trail
Biostatistics researchers publish tools in R
Can create publication-ready plots
David Quigley [email protected]
Organizing data in Excel
Each subject has a row.
Each column has a feature of your subjects.
David Quigley [email protected]
R calls the data points variables
variables
numbers and characters (letters, words)
numbers:
characters:
David Quigley [email protected]
2.6, 4
“Flopsy”, “white, brown paws”
R calls the columns vectors
vectors
ordered collections of a variable
name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]
age:
[2.5, 2.6, 2.5, 4]
David Quigley [email protected]
R calls the data set a data frame
data frame
a list of vectors (columns) that have names
elements can be read and written by row & column
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
Tell R to do things using functions
function_name( details about how to do it )
generate sequence from 1 to 5 counting by 0.5
parameters for seq are named
from, to, and by
David Quigley [email protected]
Tell R to do things using functions
function_name( details about how to do it )
report the mean of my.data. Result of one
function is fed into another one.
David Quigley [email protected]
Tell R to do things using functions
function_name( details about how to do it )
define a new function that adds 2 to
whatever it’s passed
compare to original value of my.data
David Quigley [email protected]
Code is a protocol for the computer
A program is a series of operations on data
Short programs (scripts) are often linear
Large programs have decision points
“flow control”
David Quigley [email protected]
Most jobs: data preparation & scripts
tools that manipulate text
text editing programs (TextPad, BBEdit, Emacs)
Python
Old-school command line tools (awk)
David Quigley [email protected]
Walk-through a
straightforward
analysis
Primary data from METABRIC study
gene expression
TP53 sequence
1,400 samples from 5 hospitals
Is there an association between breast
cancer subtype and TP53 mutation?
David Quigley [email protected]
Tasks
Normalize data
batch effects
unwanted inter-sample variation
Identify outliers
associations between p53 and subtype
David Quigley [email protected]
Quantile Normalization (limma)
Force every array to have the same distribution of
expression intensities
> library(limma)
> raw = read.table('raw_extract.txt’, ...)
> raw.normalized = normalize.quantiles( raw )
> normalized = log2( raw.normalized )
David Quigley [email protected]
Identify batch effects in microarrays
gene 1
Principle Components Analysis
Identify strongest variation in a matrix
gene 2
David Quigley [email protected]
Identify batch effects in microarrays
gene 1
Principle Components Analysis
Identify axes of maximal variation in a matrix
gene 2
David Quigley [email protected]
Identify batch effects in microarrays
Principle Components Analysis
Identify strongest variation in a matrix
gene 1
gene 1
group A
group B
gene 2
David Quigley [email protected]
gene 2
second component
PCA of identifies a batch effect
hospital 3 (yellow)
first component
> my.pca = prcmp( t( expression.data ) )
> plot( my.pca, ... )
David Quigley [email protected]
batch correction reduces bias (ComBat)
second principle component
ComBat package reduces user-defined batch effects
first principle component
David Quigley [email protected]
Molecular subtypes of breast carcinoma, defined by gene expression
ER status
Luminal A
N=507
Luminal B
N=379
Her2
N=161
> sa = read.table(‘patients.txt’, ...)
> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)
(convert counts to percentages)
> barplot( c( tumor.counts[1], tumor.counts[2] ),
col=c(“red”,”green”), ... )
David Quigley [email protected]
Basal
N=234
Find interactions: TP53 and subtype
Fit a linear model:
> fitted.model = lm( dependent ~ independent )
Perform Analysis of Variance:
> anova( fitted.model )
general form of my analysis:
> anova( lm( gene.expression ~ PAM * TP53 )
18,000 genes
PAM: {LumA, LumB, Her2, Basal}
TP53: {mutant, WT}
David Quigley [email protected]
Automate with loops
Calculate anova for 18,000 genes by looping
through each gene and storing result.
> n_genes = 18000
> result = rep( 0, n_genes )
> for( counter in 1:n_genes ){
result[counter] = anova(...)
}
sort results
identify significant interaction
David Quigley [email protected]
repeat 18,000 times
Immune infiltration in TP53-WT Basal
CD3E
log2 expression
log2 expression
Does p53 have a role in immune surveillance?
absent mild severe
infiltration
David Quigley [email protected]
High-performance
computing resources
Cluster computing
1 computer
20 hours
20 computers
1 hour
Clusters available on campus
Institute for Human Genetics
Recharge
~800 cores, plenty of disk space
HDFCC Cluster
Free for small jobs to cancer center members
Contribute resources for big jobs
~800 cores, plenty of disk space
QB3
Free for small jobs to QB3 members
Lots of cores, not much disk space
Amazon AWS
Infinite capacity, but bring a credit card
David Quigley [email protected]
Next steps:
getting help and
learning more
online forums: expert help for free
biostars.org
all of bioinformatics
David Quigley [email protected]
online forums: expert help for free
biostars.org
all of bioinformatics
David Quigley [email protected]
seqanswers.com
Nextgen sequencing
online forums: expert help for free
seqanswers.com
biostars.org
Nextgen sequencing
all of bioinformatics
stats.stackexchange.com
statistics
David Quigley [email protected]
UCSF resources
Library classes and information
Formal courses (BMI, Biostatistics)
Cores (Computational Biology, Genomics)
QGDG monthly methods discussion group
David Quigley [email protected]
Online classes and blogs
Free courses on data analysis
http://jhudatascience.org
simplystatistics.org
Coursera etc...
Good tutorials on sequence analysis
http://evomics.org/learning
David Quigley [email protected]
Questions?
[email protected]