Genome browsers for power users

Download Report

Transcript Genome browsers for power users

Genome Browsers
for power users
• The ENCODE project
• Up- and downloading data from the
browser
• Analyzing genomic data in R
• (depending on time) Galaxy – a power
user tool to interact with the browser
Some (almost) cutting edge
science:
The ENCODE project
• Encyclopedia Of DNA Elements
• Aims for a targeted and coordinated elucidation of the
(whole) human genome, using multiple systems
• Currently ended the pilot phase: analyze 1% (30 Mbp)of the
genome deeply
• 44 regions
• 14 regions chosen by function: for instance the HOXD gene
cluster – 0.5 to 2 Mbp
• 30 regions chosen “randomly” - 500 kb
• Should be viewed as a pilot project for the rest of the
genome – so both technology and biology is driving
• A large number of labs involved in both data production and
analysis
Some impressive numbers
• 400 million data points, excluding sequencing of other
genomes (adds another 250 million)!
• Tiling arrays from 11 different cell sources
• 96 ChIP-chip experiments
• Tag sequencing data to identify promoters (covered later in
the course)
• In-depth cDNA annotation (GENCODE)
• Sequencing of orthologous regions in a wide array of species
• …and a lot more
…and these are not all data tracks!
Why introduce these things
now in the course?
• Get you used to look at tons of data at
once
• This is a fantastic data resource, which
is under-used
• make you realize that to analyze such
data, you will have to understand the
underlying method/biology
• It will be 100* bigger in a year…
Genome browsers for power
users
In this course
1: How to use the web interface;
understanding the data types
2: How to download and upload data to
the browser; interaction with R
3: How to make complex analyses
between data types; Galaxy and R
Power users
• As power users, we want to
– Compare our OWN data with the
annotations graphically
– Download data and test hypothesis: the
web browser is for casual viewing and
dreaming up hypotheses!
– Overlap different data types
For instance…
• What genes are close to my new chipdata sites?
• What is the conservation of these sites?
• Do they overlap with other chip sites
• Etc, etc
• We will learn to do all this this week!
Uploading data
• We can upload custom tracks in the
browser - for instance, results of new
experiments
• To do this, we need to know what
formats that the browser likes
•
•
•
•
•
BED
– Block annotation - something starts and ends at certain positions
(method of choice - more in next slide)
GFF
– A standard annotation format in bioinformatics. Very similar to BED
GTF
– Variant of GFF (I have never used this)
WIG
– Format for continuous data - one “score” per nucleotide, or similar.
Method of choice when BED is not a good option. Will touch on
this.
PSL
– An alignment format - the output of the BLAT program. Will not use
this in the course.
BED
A text file that in each line shows one “entity” that is
mapped to the genome - a gene, a site for a factor,
etc
Can have up to 12 tab-separated columns, but only
three are required:
1. chrom - The name of the chromosome (e.g. chr3, chrY).
2. chromStart - The starting position of the feature in the chromosome or scaffold.
The first base in a chromosome is numbered 0.
3. chromEnd - The ending position of the feature in the chromosome or scaffold.
The chromEnd base is not included in the display of the feature. For example,
the first 100 bases of a chromosome are defined as chromStart=0,
chromEnd=100, and span the bases numbered 0-99.
Example:
chr7
127471196
chr22
20100011
127472363
20100200
The name of the track
• If we have many tracks, it really helps to have
a name of each new track
• We can start a bed file by a “track line” which
gives name, description, plus some nice
options (more later)
• Can look like this
track name=”Dataset 1" description=”Cell line X treatment”
chr7
127471196
127472363
chr22
20100011
20100200
…so a small bed file could
look like this
track name=clones description=”There are many of
me and we have a plan"
chr22 10000000 10004000
chr22 10002000 10006000
chr22 10005000 10009000
chr22 10006000 10010000
chr22 10011000 10015000
chr22 10012000 10017000
Live Demo:I will just paste this into the browser
“custom track” page
Reverse engineering
challenge
Locate the RGS9 gene in the
browser, assembly hg18
Can you make a bed file which
exactly replicates the first exon
(see example in next slide)?
You have to find what the
coordinates should be by zooming
It should have the track name
“my_refseq”
(5 minutes)
Example again:
track name=clones
description=”my Clones"
chr22 10000000 10004000
chr22 10002000 10006000
chr22 10005000 10009000
chr22 10006000 10010000
track name=“my_refseq” description=”my_refseq"
chr17 60564010
60564177
These coordinate are only meaningful with the May
2005 assembly
Challenge 2: real data
At the web page, on today’s date, there is a link to a directory
containing a bed file of chip-chip data of estrogen receptor
alpha sites: ERA_mm8.bed (this is a text file!)
These are from the mm8 assembly
1.
Upload these to the browser with the correct assembly
2.
Look at the following genes to see if they have one or more
sites inside them or further upstream
1.
2.
3.
Nrip1
Esr1 (what is this gene?)
Hoxa1
track name="ERA" description="Estrogen receptor alpha" color=255,0,0 visibility=0 browser position chr6:28,912,411-28,925,620
• The 9 additional optional BED fields are:
4. name - Defines the name of the BED line.
5. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for
this annotation data set, the score value will determine the level of gray in which this
feature is displayed (higher numbers = darker gray).
6. strand - Defines the strand - either '+' or '-'.
7. thickStart - The starting position at which the feature is drawn thickly (for example, the start
codon in gene displays).
8. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop
codon in gene displays).
9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute
is set to "On", this RBG value will determine the display color of the data contained in this
BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be
used with this attribute to avoid overwhelming the color resources of the Genome Browser
and your Internet browser.
10. blockCount - The number of blocks (exons) in the BED line.
11. blockSizes - A comma-separated list of the block sizes. The number of items in this list
should correspond to blockCount.
12. blockStarts - A comma-separated list of block starts. All of the blockStart positions should
be calculated relative to chromStart. The number of items in this list should correspond to
blockCount.
Genome graphs
Gene density in the human genome - took < 2 minutes to do
Live demo:
• Where are the ENCODE regions on the
genome?
– As you remember, the ENCODE regions
are 1% of the genome – but what 1%?
Challenge
• In the data directory, there is a set of
estrogen alpha chip sites at the mm8
assembly (we used it some slides back).
• Upload this to the mm8 browser (unless you
already have) just as before (NOT via
genome graph)
• Then, import this set into the genome graph
tool (under custom tracks)
• What is your impressions? Where are they?
era_mm5.bed
Downloading data
• We often want to download whole data tracks
to do analysis on them, in R
• The table browser is a great tool to do this
with
• It allows for selecting tracks, or part of tracks,
and also see the underlying data structure
• Accessed by clicking on “Tables” on top of
the browser
(with live demo: get all RefSeq genes from
chromosome 1)
Step 1. Pick a genome assembly (and species)
Step 2. Pick an annotation track
The group list shows all the annotation track groups available in the selected
genome assembly. The names correspond to the groupings displayed at the
bottom of the Genome Browser annotation tracks page. When a group is selected
from the list, the track list automatically updates to show all the annotation tracks
available within that group.
* To examine all the tracks available within a certain group (e.g. all gene
prediction tracks), select the group name from the group list, then browse the
entries in the track list.
Step 3. Pick a table
The table list shows all tables (both positional and non-positional)
associated with the currently-selected track. By default, it displays the
primary table for the track, i.e. the table containing the data shown in
the Genome Browser annotation track. Other tables in the list are
linked to the primary table by a common field and may provide
supporting data used in constructing the annotation.
Step 4. Pick a genomic region (positional tables only)
By default, the Table Browser region is set to genome, which will
display all the data records in the selected table.
What are we actually getting
back?
#bin
585
587
587
587
589
589
589
591
name
NM_001005484
NM_001005224
NM_001005277
NM_001005221
NM_001005221
NM_001005224
NM_001005277
NM_152486
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chrom
+
+
+
+
+
strand
58953
357521
357521
357521
610958
610960
610960
850983
txStart
59871
358458
358458
358460
611897
611897
611897
869825
58953
357521
357521
357521
610958
610960
610960
851184
Slightly confusing! Lets go back to the table
browser to get explanations
txEnd
59871
358458
358458
358460
611897
611897
611897
869396
1
1
1
1
1
1
1
14
#bin
585
587
587
587
589
589
589
591
name
NM_001005484
NM_001005224
NM_001005277
NM_001005221
NM_001005221
NM_001005224
NM_001005277
NM_152486
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chrom
+
+
+
+
+
strand
58953
357521
357521
357521
610958
610960
610960
850983
txStart
59871
358458
358458
358460
611897
611897
611897
869825
txEnd
58953
357521
357521
357521
610958
610960
610960
851184
cdsStart
59871
358458
358458
358460
611897
611897
611897
869396
cdsEnd
1
1
1
1
1
1
1
14
Analyzing downloaded data
These are normal text files and can be read by R!
Two handy tricks:
1. Give an output file names directly in the web
form - some browsers try to save the html page
otherwise (HTML has a lot of junk ><> signs)
2. After you have the file, remove the # in the first
line, and you will get nice column names for the
data, otherwise, this row will be ignored
Larger Challenge
•
Download all refseq genes from chr1, and
chr2, from the mm8 assembly, to two
different files.
• Using R:
1. Make two vectors describing gene lengths
from the different chromosomes
1. Hint: use end and start…
2. Are these significantly different in
means/medians? Which chromosome has
the longest genes?
c1<-read.table("refseq_mm8_chr1.txt", h=T)
chr1_dist<-c1$txEnd-c1$txStart
chr2_dist<-c2$txEnd-c2$txStart
boxplot (chr1_dist, chr2_dist)
boxplot (chr1_dist, chr2_dist, log=“y”)
wilcox.test(chr1_dist, chr2_dist)
In this course
1: How to use the web interface;
understanding the data types
2: How to download and upload data to
the browser; interaction with R
3: How to make complex analyses
between data types; Galaxy and R
The Galaxy tool
• Galaxy is a user-friendly interface to
power-user analysis of UCSC data
• The main strength is that we can
overlap tracks, make sub-tracks, and do
some statistics WITHIN the browser
• We can use this together with R for
even more complex stuff
A different assignment setup…
• The Galaxy people has made several very
informative screencasts, which we will shamelessly
steal
• The general format is that we will watch a short video
lecture showing a concept or technique.
• You will then try to repeat the analysis - the video
lecture you can play yourself if you need a recap:
great way to prepare before home works
• Later on, we will combine different things from
different screencasts, and also explore things not
covered in the casts
Where is galaxy?
• See link on homepage, or
• http://main.g2.bx.psu.edu/
• Please work in groups of 3 persons per
computer (and connection) to not
overload their server – we got thrown
out last year
General intro to getting data to
and from Galaxy and UCSC
• UCSC2galaxy.mov
• Objective:
– 1) find all protein-coding exons and overlap
them with all SNP data
– 2) make these overlaps into a genome
browser track
Try it out!
• Objective:
– (on chr22)
– 1) find all protein-coding exons and overlap them
with all SNP data
– 2) make these overlaps into a genome browser
track
• Try to repeat this WITHOUT the screen cast
and see how well you can manage
• Use the screen cast if you get stuck
Manipulating data
• text_manipulation.mov
• We learn how to add columns and
compute something on each entry
Try it out
• Again, try it with the help of the screen cast
to make an output file based on the coding
exon table you imported that looks like this:
– Chr Start End Your_Own _ID exon_length
• Additional challenge (not covered by screen
cast):
– Using the Statistics option in the Toolbox, what is
the mean exon length?
Intervals
A set of screen casts showing things we
can do with “intervals”
1. Getting the data
2. Genome coverage of a set
3. Coverage of one set to another
4. Intersects (recap), and subtractions
betweens sets
5. Cluster the parts of a set
• We will look at these casts in order, one by
one, (all are in the sub-directory intervals):
• data_prep.mov
• base_cov_and_complement.mov
• coverage.mov
• intersect_and_subtract.mov
• cluster.mov
Do it!
• This is just to get the data to play with should go fast!
• Additional challenge: Upload the CpG
islands track for later use
• At the next cast, we will see how much
of the genome that the different tracks
cover
• Repeat the analysis!
• Challenge: What percentage of the
human genome is covered by CpG
islands?
• Next cast: The coverage of one set to
another set
More challenging…
• Again, repeat the analysis first to understand
it
• Challenge:
– 1) compare the coverage of exons to CpGs
– 2) (not covered by screen cast) Can you use the
“Sort” function under “Filter and sort” to find the
exon with the highest CpG coverage? What gene
is this?
– 2) Can you download this data and open it in R:
plot the distribution of coverage as a histogram
– (solution posted later on the web page)
• Next: intersecting/subtracting sets to
make new tracks
• Intersect_and_subtract.mov
• Challenge: What proportion of CpG
islands have NO overlap with the exon
track?
Next cast: how to cluster data
• Why do we want to cluster data points
in the browser?
• Challenge
• If we cluster repeats that are <100 bp
from each other, what is the longest
repeat-cluster in the human genome?
Operating on genome “scores”
• Often, we have data on each nucleotide in one track
(like conservation), and larger blocks in another (like
exons)
• We want to summarize these - like getting the mean
conservation score for each exon
exons
conservation
mean1
mean2
Live demo
• There is a screencast for this, but the sound
is very bad: genomic_scores.mov
• So, watch this at home for a recap
• We will assess the distribution of mean
conservation scores of the exons we looked
at before
• You will do this yourself on another set in a
few slides
Larger exercise
• We will watch the
“getting_encode_data.mov” screencast
together.
• Then, you will answer the following
question
– Is there a significant difference in mean
species conservation in PU1 and RAR chip
sites in the encode regions (use the
affymetrix data)
– Is there a significant difference in mean
species conservation in PU1 and RAR chip
sites in the encode regions (use the
affymetrix data)
Looking at actual alignments
Fetching_m_alignments.mov
We will learn how to get pairwise or
multiple species alignments for a set of
regions
Challenge:
• Produce a stitched alignment between
human, mouse and rat for the first exon
in the RPS9 gene (you will have to
define a bed file describing this interval
manually - you did already did this in the
first part of this lecture)
Two larger exercises
• We will watch two screencasts of
analysis of some tiling array chip data most of this you have seen before
– Affy_Sp1_CpG.mov
– Affy_Sp1_RefSe.mov
• Then, you will try to answer two larger
questions, by using parts from both
screen casts and R
Large exercise!
We will compare the RAR and PU1 sites from
before.
Is there a significant difference in proportion of
sites that are
1) Overlapping CpG islands?
2) Within 10kb of any gene start?
(hint - it looks as though the screen cast
does this, but beware - they think wrong!)
Tip: look back at the R lectures to see what test
you will need, and then what form of data!