Introduction to GenomeGraphs and Bioconductor

Download Report

Transcript Introduction to GenomeGraphs and Bioconductor

Intro to
Biocoductor and GeneGraph
by Kyrylo Bessonov
([email protected])
9 Oct 2012
Bioconductor repository
• Is the repository with extensions and
libraries for R-language found at
http://www.bioconductor.org/
• Bioconductor libraries cover
•
•
•
•
•
Micorarray analysis
Genetic variants analysis (SNPs)
Sequence analysis (FASTA, RNA-seq)
Annotation (pathways, genes)
High-troughput assays (Mass Spec)
• All libraries are free to use and contain
documentation (i.e. vignettes)
• Vignettes are short and require some previous
knowledge of R and/or other defendant libraries
2
Configuring R for Bioconductor
• To install Bioconductor libraries the R
environment needs to be configured
source("http://bioconductor.org/biocLite.R")
• To download any Bioconductor library use
biocLite("package_name") function
biocLite("GenomeGraphs")
• Load the downloaded library functions via the
library("lib_name")function
library("GenomeGraphs")
• Read help file(s) via the ?? or browseVignettes()
??library_name
(e.g. ??GenomeGraphs)
browseVignettes(package = "GenomeGraphs")
3
Genome Data Display and Plotting
The GenomeGraphs library
Authors: Steen Durinck and James Bullard
(Bioconductor library)
Intro to GenomeGraphs library
• Allows to retrieve data from
• Ensembl comprehensive DB on genomes
• intron/exon locations
• sequence genetic variation data
• protein properties (pI, domains, motifs)
• GenomeGraphs Bioc library allows to display data on:
• Gene Expression (expression / location)
• Comparative genomic hybridization (CGH)
• Sequencing (e.g. variations/introns/exons)
5
GenomeGraphs: Plotting with gdPlot
• gdPlot main plotting function
gdPlot(gdObjects, minBase, maxBase ...)
• gdObjects = any objects created by GenomeGraphs
a) BaseTrack; b) Gene; c) GenericArray; d) RectangleOverlay
• minBase the lowest nt position to be plotted (optional)
• maxBase the largest nt to display (optional)
•
Objects are data structures that hold/organize a set of variables
1) Create an object to plot using makeBaseTrack() function
ObjBaseT = makeBaseTrack(1:100, rnorm(1:100),strand = "+")
2) Plot the newly created object
gdPlot(ObjBaseT)
• Display object structure / properties (e.g. of BaseTrack)
attributes(ObjBaseT)
• $ sign next to the name represents object variables
•
Change or view individual object variables
attr(ObjBaseT, “strand")
6
Manipulating GenomeGraphs class graphical properties
• Changing values of variables (classical way)
attr(Object, "variable/parameter") = value
attr(ObjBaseT, "strand") = "-“
• If an array, to access individual elements use [element number 1..n]
attr(ObjBaseT, "variable/parameter")[number] = new_value
attr(ObjBaseT, "base")[1] = 0
• Changing graphical parameters such as color
showDisplayOptions(ObjBaseT)
alpha = 1
lty = solid
color = orange
lwd = 1
size = 5
type = p
getPar(ObjBaseT, "color")
[1] "orange"
setPar(ObjBaseT, "color", "blue")
7
gdPlot composite/group plots
• gdPlot(…) can take any number of gdObjects to plot
• Let’s plot two BaseTrack objects a and b
a = makeBaseTrack(1:100, rlnorm(100), strand = "+")
b = makeBaseTrack(1:100, rnorm(100), strand = "-")
ab=list(a,b)
gdplot(ab)
8
Manipulating values of the grouped gdObjects
• ab is list of objects a and b
• To access individual object within the list use [ ]
ab[1] will display object a
ab[2] will display object b
• To modify the grouped object use double [[ ]]
• To change color of b to red
getPar(ab[[2]], "color")
setPar(ab[[2]], "color", "red")
-OR- (re-create object)
b = makeBaseTrack(1:100,
rnorm(100), strand = "-",
dp=DisplayPars(color="red"))
ab=list(a,b)
gdPlot(ab)
9
Plotting with labels and legend
• gdPlot does not provide functions to label axis
• Trick = “use labeled / tagged” objects
"label" = GenomeGraph object
"+ strand"= makeBaseTrack(1:100, rlnorm(100), strand = "+")
• To display legend use makeLegend("text","color")
ab=list("+"=a, "-"=b, makeGenomeAxis(),
makeLegend(c("+", "-"),c("orange", "red")) )
10
Retrieving and Displaying Data
from
Public Database
Combining capabilities of
biomaRt and GenomeGraph libraries
Welcome to biomaRt
• biomaRt library allows to retrieve data from public DBs
ensembl
snp
unimart
bacteria_mart_14
ENSEMBL GENES 68 (SANGER UK)
ENSEMBL VARIATION 68 (SANGER UK)
UNIPROT (EBI UK)
ENSEMBL BACTERIA 14 (EBI UK)
***Use listMarts() to see all available databases***
• Let’s retrieve gene data of the Bacillus subtilis strain
• useMart(database,dataset)allows to connect to
specified database and dataset within this database
db=useMart("bacteria_mart_14")
listDatasets(db)
Dataset
…
bac_6_gene
Description
…
Bacillus subtilis genes (EB 2 b_subtilis)
version
…
EB 2 b_subtilis
db=useMart("bacteria_mart_14", "bac_6_gene")
12
Exploring biomaRt object
• listAttributes()shows all prop. of the biomaRt obj.
• The db object has total of 4175 genes
• Use getBM(attribute, filter, value, biomaRt_obj)to
extract values belonging to specified attributes
• attribute: general term such as gene
name/chromosome # / strand (+ or - )
• filter: parameter applied on attribute such as
genomic region to consider (i.e. start and end in nt)
• value: actual values of the applied filter(s)
getBM(c("external_gene_id", "description","start_position",
"end_position", "strand"), filters = c("start", "end"),
values = list(1,10000), db)
ID
metS
ftsH
hslO
Dgk
description
Methionyl-tRNA synthetase
Cell division protease ftsH homolog
33 kDa chaperonin
Deoxyguanosine kinase
start(nt)
45633
76984
79880
23146
end(nt)
47627
78897
80755
23769
strand
1
1
1
-1
13
Plotting the selected “Genome Region”
• Create an object with makeGeneRegion()function
makeGeneRegion(start,end,chromosome name,
strand, biomaRt object, plotting options)
• Find notation used for the chromosome naming
getBM("chromosome_name","","",db)
chromosome_name
1
Chromosome
gRegion = makeGeneRegion(1, 10000, chr = "Chromosome",
strand = "+", biomart = db, dp =
DisplayPars(plotId = TRUE, idRotation = 90,
cex = 0.8, idColor = "black"))
gdPlot(list(gRegion, makeGenomeAxis(),
makeTitle("Position(nt)",cex=3,"black",0.1)))
14
Bacillus subtilis genome region (intron / exon)
ensembl_gene_id
15
Mapping Expression data
RNAseq
GenomeArray()
Intro into RNA-seq data
• HT sequencing technologies allow to sequence mRNA in
a series of short contigs of 50-200 bp
• In addition to gene expression analysis it possible to
• Map location of introns (UTRs) / exons
• Principal: one searches for a rapid changes in
abundance of the RNA-Seq signal (contigs)
• Integration of sequence + expression information
• Ugrappa Nagalakshmi et. al. 2008 had used this
strategy to accurately map yeast genome1
• Task: Display part of the seqDataEx dataset having both
• abundance of mRNA (cDNA) transcripts and
• annotated yeast genome
1Ugrappa Nagalakshmi et. al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 2008
17
Plotting RNA-seq data with GenomeArray()
• Need to get mRNA contig abundance data
data("seqDataEx")
rnaSeqAb=seqDataEx$david
• Create GeneticArray object with
makeGenericArray(intensity, probeStart, probeEnd, trackOverlay, dp = NULL)
• intensity
• probStart
either microaray or RNAseq transcript abundance signal
start position of the probe (location in nt)
• Plot contig abundance w.r.t. to genomic location
gdPlot(list("abun"=makeGenericArray(rnaSeqAb[, "expr", drop = FALSE],
rnaSeqAb [, "location"]) , makeGenomeAxis() ) )
18
Adding genomic annotation to the prev plot
• Need to get annotated yeast genome data
annotGen = useMart("ensembl", "scerevisiae_gene_ensembl")
• Create mRNA contig abundance
mRNAabun = makeGenericArray(rnaSeqAb[, "expr",
drop = FALSE], rnaSeqAb [, "location"])
• Create annotated seq. covering the mRNA contigs location
annotSeq=
makeGeneRegion(start = min(rnaSeqAb[, "location"]),
end = max(rnaSeqAb[, "location"]), chr = "IV",
strand = "+", biomart = annotGen,
dp = DisplayPars(plotId = TRUE, idRotation = 0,
cex = 0.85, idColor = "black", size=0.5))
• Combine objects and plot them
gdPlot( list("abund" = mRNAabun,makeGenomeAxis(), "+" = annotSeq),
1299000,1312000)
19
Resulting plot:
Transcript abundance w.r.t. to location
20
Overlays of Basic Shapes and Custom Text
• To overlay rectangle use makeRectangleOverlay()
makeRectangleOverlay(start, end, region = NULL,
coords = c("genomic", "absolute"), dp)
rectOver = makeRectangleOverlay( 1301500, 1302200,
region=c(1,2), "genomic", DisplayPars(alpha = 0.5))
• To overlay text use makeTextOverlay()
tOver =
makeTextOverlay("Ribosomal Large subunit",
1302000, 0.95, region = c(1,1),
dp = DisplayPars(color = "red"))
• Combine all overlay objects into one vector with c(v1,v2)
gdPlot( list("abund" = mRNAabun,makeGenomeAxis(),
"+" = annotSeq), overlays=c(rectOver, tOver) )
21
Overlay of rectangle and text
22
Alternative splicing of transcript
makeTranscript(id, type, biomart, dp = NULL)
Displaying alternative splicing of a gene
• mRNA coming from the same ORF could be spliced in
many ways
• E.g. case of VDR genes of IgG
• Given biomaRt object, the makeTranscript()will
extract splicing information for given id (i.e. gene)
• Download human genome databse
hGenome <- useMart("ensembl", "hsapiens_gene_ensembl")
• Select Ensembl ID to look at
head(getBM(c("ensembl_gene_id", "description"),"","", hGenome))
• Get splicing data from biomaRt object (hGenome)
spliceObj = makeTranscript("ENSG00000168309",
"ensembl_gene_id" ,hGenome)
• Plot the object with gdPlot
gdPlot(list(makeTitle("Transcript ID:
ENSG00000168309"),splicingObj, makeGenomeAxis()) )
24
The final result
25
Conclusion
• GeneGraphs provides a wide range to plot genomic data
• Can use external databases through biomaRt
• Main useful features
• identifies exons/introns
• allows to cross-reference expression / genome data
• flexible albeit complex plotting capabilities
• allows to overlay graphical objects and text
• ability to create custom legends
• annotation capabilities provided by powerful biomaRt
26
Thank you for your patience!
&
Happy Bioconductor/R Exploration!