PPT - Bioinformatics.ca

Download Report

Transcript PPT - Bioinformatics.ca

Bioconductor, Microarrays and
Genomics
Ben Bolstad
Biostatistics
University of California, Berkeley
www.stat.berkeley.edu/~bolstad
Lecture 1.6
1
Outline
• BioConductor
• Introduction to Genetics
• Introduction to microarrays
Lecture 1.6
2
Lecture 1.6
3
What is R?
• Widely used freely available implementation
of the S language
• Commercial implementation is known as SPlus www.insightful.com
• There are small differences between R and
S-plus but for the most part any books about
one apply reasonably well to the other
• Available for Unix, Windows, Macintosh OS
Lecture 1.6
4
What is BioConductor?
• Bioconductor (BioC) is an open source and
open development software project to provide
tools for the analysis and comprehension of
genomic data
• Primarily based on the R (a free
implementation of the S language)
• Developers from around the world
• Predominantly licensed under the
GPL/LGPL/BSD licenses
Lecture 1.6
5
BioC Goals
• Provide access to a wide range of powerful statistical
and graphical methods for the analysis of genomic
data;
• Facilitate the integration of biological metadata in the
analysis of experimental data
• Allow the rapid development of extensible, scalable,
and interoperable software;
• Promote high-quality documentation and reproducible
research.
• Provide training in computational and statistical
methods for the analysis of genomic data.
Lecture 1.6
6
BioC features
• Available for a number of platforms
– Linux/Unix, Windows
• Predominantly command line interface
• Often object oriented: S4 objects
• Most of the current tools are designed for the
analysis of microarray data
• R is used by many statisticians and has a
large repository of packages which might also
be useful cran.r-project.org
Lecture 1.6
7
BioC: Why open source?
• Full access to algorithms and their implementation
• The ability to fix bugs
• To encourage good scientific computing and
statistical practice by providing appropriate tools and
instruction
• To provide a workbench of tools that allow
researchers to explore and expand the methods used
to analyze biological data
• To ensure that the international scientific community
is the owner of the software tools needed to carry out
research
Lecture 1.6
8
BioC documentation
• Each package contains at least one vignette
– a document that provides a textual, task-oriented description
of the package's functionality and that can be used
interactively. Many are simple "HowTo"s, that is, they are
designed to demonstrate how a particular task can be
accomplished with that package's software. Others provide a
more thorough overview of the package, or might even
discuss general issues related to the package.
• The vignettes are generated using the Sweave
function from the R package tools. They are
documents that intermix text, code, and output
(textual and graphical) and can be regenerated
automatically whenever the data or analyses change.
Lecture 1.6
9
BioC: Packages
• There are currently almost 90 packages in
the 1.4 release (May 2004). The first release
in May 2002 had only 15 packages
• Some are very simple while others provide
extensive capabilities for the analysis of a
particular type of data
• There is some level of dependency among
the packages
• We will explore a subset of the packages
Lecture 1.6
10
Biobase
• This is the base package. Many BioC
packages depend on it.
• Provides the exprSet object which is an
object for storing gene expression data for a
particular experiment linked with related
phenotypic data and other descriptive
information (MIAME)
• Many functions in other packages take the
exprSet as input
Lecture 1.6
11
Biobase
• Accessor functions that can be applied to
exprSets
–
–
–
–
–
–
exprs() - access the expression values
se.exprs() – access standard error estimates
pData() – access phenotype data
description() – obtain the MIAME information
geneNames() – access the names of the genes
sampleNames() – names of the samples
Lecture 1.6
12
reposTools
• Package management
• Handles installation and updating of BioC
packages
Lecture 1.6
13
affy
• The core package for low-level analysis of
Affymetrix data
• Provides
– Mechanisms for reading and storing cel file data
(raw probe intensities)
– Tools for exploring probe-intensity data
– Methods for pre-processing – background
correction, normalization
– Computing expression measures
Lecture 1.6
14
affy
boxplot()
Lecture 1.6
hist()
15
affy
matplot(log2(pm(Dilution,”1004_at”)))
matplot(t(log2(pm(Dilution,”AFFX-BioC-5_at”))))
Lecture 1.6
16
affy
normalize()
Lecture 1.6
17
affyPLM
• Fitting probe-level models to Affymetrix data
provides quality control information
• Quality assessment focuses on
–
–
–
–
Residuals
Weights from a robust fitting procedure
Relative log expression
Standard errors
Lecture 1.6
18
affyPLM - Pseudo-chip images
Weights
Residuals
image()
Positive
Residuals
Lecture 1.6
Negative
Residuals
19
affyPLM - RLE Plots
Relative
Log
Expression
Mbox()
Lecture 1.6
20
affyPLM - NUSE Plots
Normalized
Unscaled
Standard
Errors
boxplot()
Lecture 1.6
21
gcrma
• Provides another method of computing
expression measures for affy arrays
• Uses sequence information in its background
adjustment step
Lecture 1.6
22
affycomp
• A framework and set of routines for
comparing expression summary values for
Affymetrix arrays
• A competition and comparison website
affycomp.biostat.jhsph.edu/
Lecture 1.6
23
marray
• Basic package for low-level analysis of cDNA
arrays
• Pre-processing for two color arrays
– Diagnostic images to look for artifacts
– Normalization
• Input from various image analysis programs
output
Lecture 1.6
24
marray
image()
Lecture 1.6
25
marray
plot()
Lecture 1.6
boxplot()
26
limma
• Linear Models for Microarray Analysis
– Allows the analysis of designed microarray
experiments using linear models
– Methods for working with both two channel
(cDNA) and single channel (affy) arrays
– Also provides some pre-processing functionality
for cDNA microarrays
– Moderated t-statistic
Lecture 1.6
27
multtest
• Standard hypothesis testing says that a level
 test will reject the null hypothesis (no
change) even when it is true % of the time.
Ie false positives
• Problem: microarray experiments involve
tests on thousands of genes simultaneously
so will get many false positives just due to
chance
• Deals with multiple testing problem by
adjusting P-values for multiple comparisons
Lecture 1.6
28
vsn
• Provides variance stabilization normalization
for microarrays, can be applied to both affy
and cDNA arrays
Lecture 1.6
29
annotate
• Handles annotation
– Convert between Unigene, LocusLink, Affymetrix
probeset ids and other annotation methods
• Methods for accessing online information
from PubMed, GenBank
Lecture 1.6
30
Rgraphviz
• Allows you to create graphs with nodes and
edges
Lecture 1.6
31
tkWidgets/widgetTools
• Tools for
building GUI
widgets
• Pre built
widgets
Lecture 1.6
32
affylmGUI/limmaGUI
Lecture 1.6
33
hexbin
• Hexagonal binning is a form of bivariate
histogram useful for datasets with large
number of observations
Lecture 1.6
34
Metadata
• CDF packages
• Probe sequence packages
• Annotation packages
Lecture 1.6
35
Other useful R packages (not in
BioC)
•
•
•
•
•
clust: for clustering
class: for classification
rpart: trees
mlclust: model based clustering
mgcv: smoothers
Lecture 1.6
36
BioC: More Information
• www.bioconductor.org
– The main BioC website. Documentation, software
• www.stat.math.ethz.ch/mailman/listinfo/bioco
nductor
– The BioC mailing list
• www.r-project.org
– The R-project website
Lecture 1.6
37
Gene Expression
Lecture 1.6
38
The Human Genome
• The cell is the fundamental working unit of every living
organism.
• Humans: trillions of cells (metazoa);
other organisms like yeast: one cell (protozoa).
• Cells are of many different types (e.g. blood, skin, nerve
cells), but all can be traced back to a single cell, the
fertilized egg.
Lecture 1.6
39
Genes
• The human genome is distributed along 23 pairs of
chromosomes.
– 22 autosomal pairs;
– the sex chromosome pair, XX for females and XY for
males.
• In each pair, one chromosome is paternally inherited, the
other maternally inherited.
• Chromosomes are made of compressed and entwined
DNA.
• A (protein-coding) gene is a segment of chromosomal
DNA that directs the synthesis of a protein.
Lecture 1.6
40
Chromosomes and DNA
Lecture 1.6
41
DNA
• A deoxyribonucleic acid or DNA molecule is a doublestranded polymer composed of four basic molecular
units called nucleotides.
• Each nucleotide comprises a phosphate group, a
deoxyribose sugar, and one of four nitrogen bases:
adenine (A), guanine (G), cytosine (C), and thymine (T).
• The two chains are held together by hydrogen bonds
between nitrogen bases.
• Base-pairing occurs according to the following rule: G
pairs with C, and A pairs with T.
Lecture 1.6
42
Lecture 1.6
43
Genetic and physical maps
Lecture 1.6
44
Genetic and physical maps
• Physical distance: number of base pairs (bp).
• Genetic distance: expected number of crossovers
between two loci, per chromatid, per meiosis.
• Measured in Morgans (M) or centiMorgans (cM).
• 1cM ~ 1 million bp (1Mb) in humans
Lecture 1.6
45
Exons and introns
• Genes comprise only about 2% of the human
genome; the rest consists of non-coding
regions, whose functions may include
providing chromosomal structural integrity
and regulating when, where, and in what
quantity proteins are made (regulatory
regions).
• The terms exon and intron refer to coding
(translated into a protein) and non-coding
DNA, respectively.
Lecture 1.6
46
Differential expression
• Each cell contains a complete copy of the organism's
genome.
• Cells are of many different types and states
E.g. blood, nerve, and skin cells, dividing cells,
cancerous cells, etc.
• What makes the cells different?
• Differential gene expression, i.e., when, where, and
in what quantity each gene is expressed.
• On average, 40% of our genes are expressed at any
given time.
Lecture 1.6
47
Functional genomics
• The various genome projects have yielded the
complete DNA sequences of many organisms. E.g.
human, mouse, yeast, fruitfly, etc.
• Human: 3 billion base-pairs, 30-40 thousand genes.
• Challenge: go from sequence to function, i.e., define
the role of each gene and understand how the
genome functions as a whole.
Lecture 1.6
48
Central dogma
• The expression of the genetic information stored in the
DNA molecule occurs in two stages:
(i) transcription, during which DNA is transcribed into
mRNA;
(ii) translation, during which mRNA is translated to
produce a protein.
DNA  mRNA  protein
• Other important aspects of regulation: methylation,
alternative splicing, etc.
• The correspondence between DNA's four-letter alphabet
and a protein's twenty-letter alphabet is specified by the
genetic code, which relates nucleotide triplets to amino
acids.
Lecture
1.6
49
RNA
• A ribonucleic acid or RNA molecule is a nucleic acid
similar to DNA, but
– single-stranded;
– ribose sugar rather than deoxyribose sugar;
– uracil (U) replaces thymine (T) as one of the bases.
• RNA plays an important role in protein synthesis and
other chemical activities of the cell.
• Several classes of RNA molecules, including messenger
RNA (mRNA), transfer RNA (tRNA), ribosomal RNA
(rRNA), and other small RNAs.
Lecture 1.6
50
Idea: measure the amount of mRNA to see which genes are
being expressed in (used by) the cell.
Lecture 1.6
Measuring
protein might be better, but is currently harder.
51
Microarrays
Lecture 1.6
52
Uses and types of microarrays
Microarrays are currently used to do many different things: to detect and measure
gene expression at the mRNA or protein level; to find mutations and to genotype; to
(re)sequence DNA; to locate chromosomal changes (CGH = comparative genomic
hybridization), and more. There are many different ways to do these things without
microarrays, but microarrays promise a high-throughput approach to the tasks.
There are many different types of microarrays (called platforms) in use, but all have
a high density and number of biomolecules fixed onto a well-defined surface. Low
density means 100s (e.g. protein antibodies), medium density would be 1000s to 10s
of 1000s (e.g. cDNA arrays), and high-density is 100s to 1000s of 1000s, i.e.millions
(e.g. short oligonucleotide arrays).
In general there are five basic aspects of microarrays: a) coupling biomolecules to
a platform; b) preparing samples for detection; c) hybridization; d) scanning; and e)
analyzing the data.
Obviously we’re interested in e), but without some knowledge of a) to d), we’d be
dangerous.
Lecture 1.6
53
Nucleic acid hybridization: here DNA-RNA
Lecture 1.6
54
The cDNA and short (25 bp) oligo technologies in brief.
Long (60-75 bp) oligo arrays are more like the cDNA ones
Lecture 1.6
55
excitation
cDNA clones
(probes)
cDNA arrays
in summary
PCR product amplification
purification
printing
laser 2
scanning
laser 1
emission
mRNA target)
overlay images and normalize
0.1nl/spot
microarray
Lecture 1.6
Hybridise
target to
microarray
analysis
56
cDNA microarrays on glass slides
A little more detail
• An overview of the Brown/De Risi/Iyer technology,
based on
– the 2000 CSH Microarray Course notes, Nature Genetics
Supp, Jan1999
– two books edited by M Schena: DNA Microarrays, A
Practical Approach, OUP 1999, and Microarray Biochip
Technology, Eaton Publishing, 2000
– DNA Arrays or Analysis of Gene Expression by M. Eisen and
P. Brown
Lecture 1.6
57
cDNA arrays: history
• cDNA microarrays have evolved from Southern blots, with clone
libraries gridded out on nylon membrane filters being an important and
still widely used intermediate. Things took off with the introduction of
non-porous solid supports, such as glass - these permitted
miniaturization - and fluorescence based detection.
• Currently, up to about 30,000 cDNAs are spotted onto a microscope
slide.
Lecture 1.6
58
Affymetrix GeneChips
GeneChip Probe Array
Hybridized Probe Cell
Single stranded,
labeled RNA target
*
*
*
*
*
Oligonucleotide probe
24µm
1.28cm
Millions of copies of a specific
oligonucleotide probe
synthesized in situ (“grown”)
>500,000 different
complementary probes
Lecture 1.6
Image of Hybridized Probe Array
59
Affymetrix arrays
• Commercial product
• Currently mass produced arrays targeting 17
different organisms
– More than 40 different array types/sets
• Custom arrays also provided
• Recently also have been selling arrays for
genotyping/detecting SNPs
Lecture 1.6
60
A word of acknowledgement
BioC
Robert Gentleman
Vince Carey
Rafael Irizarry
Sandrine Dudoit
Yee Hwa Yang
Gordon Smyth
Laurent Gautier
Jeff Gentry
And many more
www.bioconductor.org/people.html
Lecture 1.6
Some Slides
Terry Speed
Francois Colin
Jean Yee Hwa Yang
61