Intro to Bioinformatics CSC 360-01

Download Report

Transcript Intro to Bioinformatics CSC 360-01

Reflections on teaching a
bioinformatics programming
and gene expression
analysis course
Garrett Dancik, PhD
Eastern Connecticut State University
http://bioinformatics.easternct.edu
May 29, 2015
Bioinformatics Programming and Analysis
(Course Background)
• Supports a forthcoming Bioinformatics minor at Eastern
Connecticut State University
• A computer science-based course
• open to students who have taken Programming I (C++)
• An upper-level elective course for computer science
majors, who will
• Have little / no biological background
• Have little / no background in statistics
• But are experienced programmers (C++, Java)
Course content
Basic programming concepts
in R – variables, vectors,
functions
R programming
• Gene expression
profiling
• Data processing
• Clustering
• Identification of
differentially
Biology /
expressed genes
Bioinformatics
• Heatmaps
• Clustering
• Predictive modeling
Probability and
Statistics
• Descriptive statistics
• Probability and
probability
distributions
• Hypothesis testing
and p-values
Note: a separate Introduction to Bioinformatics course is also offered
(covers sequence databases, sequence alignment, etc)
R programming
• R (http://cran.r-project.org) is an open source, freely
available programming language for data analysis and
visualization
• Bioconductor (http://www.bioconductor.org) is an open
source repository of (primarily) R packages for the
analysis of genomic data.
• GEOquery – downloads data from the Gene Expression Omnibus
(GEO)
• affy – processing and analysis of affymetrix oligonucleotide arrays
• limma (Linear Models for Microarray Data) – Data analysis of gene
expression data
• Rstudio (http://www.rstudio.com) is an integrated
development environment (IDE) for R.
Run
History
R Script
Plots
R Console
• R Notebooks
• Creates a document
(e.g., web page)
interweaving R code and
output
• Requires the knitr
package
• Serves as a study guide
linking R code with
output
• Confirms that code is
error-free
Probability and Statistics
• Why is this necessary?
• Summary of gene expression data and sample characteristics
• Quality assessment of gene expression data
• Identification of differentially expressed genes
• What is a p-value? What is an adjusted p-value (or FDR)?
• How do students learn this?
• Students are surveyed to collect data that will be fun to analyze
• R is used as a tool for illustrating important concepts in probability
and statistics
• Sample space
• Probability distributions
• Central limit theorem
• P-values
Summary of Student Survey – Data is analyzed in class or in an
assignment
Probability and Statistics
• Why is this necessary?
• Summary of gene expression data and sample characteristics
• Quality assessment of gene expression data
• Identification of differentially expressed genes
• What is a p-value? What is an adjusted p-value (or FDR)?
• How do students learn this?
• Students are surveyed to collect data that will be fun to analyze
• R is used as a tool for illustrating important concepts in probability
and statistics
• Sample space
• Probability distributions
• Central limit theorem
• P-values
R as a tool: Understanding sample space
(all possible poker hands)
R as a tool: Random sampling and the
Central Limit Theorem
Sample R code
get.sample.mean <-function(n) mean(rexp(n))
x.population = rexp(1000)
x.10 = replicate(5000, get.sample.mean(10))
…
Bioinformatics and Gene Expression
Analysis
• Gene expression and genetics, microarrays
• Processing raw microarray data (CEL files), using affy package
• Downloading data from the Gene Expression Omnibus (GEO;
http://www.ncbi.nlm.nih.gov/geo/), using GEOquery package
• Clustering and heatmaps
• Identification of differentially expressed genes,
• Extends concepts of
• Linear models to understand limma
• p-value to understand false discovery rate (FDR)
• Predictive modeling and classification, e.g., k-NN
• Learning by example through case studies…
Sample Assignment
1.
2.
3.
4.
Confirm that high EGFR expression is
associated with sensitivity to erlotinib.
Identify the top 10 probes differentially
expressed between sensitive and resistant
samples.
Generate a heatmap of these 10 probes
Calculate the accuracy of knn, using
leave-one-out classification
Heatmap of differentially expressed genes between
erlotinib-sensitive and –resistant patients
Sample R code
fit = lmFit(GSE31625.X, design)
contrast.matrix <makeContrasts(resistant sensitive,levels=design)
fit = contrasts.fit(fit, contrast.matrix)
fit = eBayes(fit)
tt = topTable(fit,sort.by = "p", p.value =
0.05)
probes = rownames(tt)
m = match(probes,
rownames(GSE31625.X))
X = GSE31625.X[m,]
X = t(scale(t(X)))
heatmap(X, ColSideColors =
col.response, col = col.heat)
Lessons learned…
• Students enjoyed learning R and using R studio, and found R to be a
very intuitive language
• Students enjoyed the class survey and were able to learn important
statistical concepts.
• Students like the real-world examples (cancer, Alzheimer's disease)
• Technical classroom issues – package installation
• Future ideas:
• Include a small research project where a student analyzes a GEO dataset
of their choice
• Include a lab on the analysis of sequencing data
• Can anything be cut out?
• Less in-class time on "R basics"
• Some statistical concepts (e.g., hypothesis testing for a proportion) can be
skipped
THANK YOU!
Garrett Dancik, PhD
Eastern Connecticut State University
E-mail: [email protected]
http://bioinformatics.easternct.edu