IGVTutorial_CountwayJun2010_final

Download Report

Transcript IGVTutorial_CountwayJun2010_final

Integrative Genomics
Viewer
HMS Countway Library
June 15, 2010
Jim Robinson , Helga Thorvaldsdóttir
Broad Institute of MIT and Harvard
Agenda
•
•
•
•
•
•
•
•
Introduction
User Interface Basics
Data Integration
Hands-on Exercise
File formats
Viewing Next-Generation Sequence (NGS) Data
Hands-on Exercise
IGVTools
Slides and handouts:
ftp.broadinstitute.org/pub/genepattern/igv/tutorials/June2010
Introduction
What is IGV
A desktop application
for integrated visualization
of multiple data types and annotations
in the context of the genome
Microarrays
Epigenomics
RNA-Seq
NGS alignments
Comparative genomics
Motivation
• Easily view investigator-generated datasets alongside
publically available data
• Support integration of diverse data types and sample
attribute information
• Handle large datasets
IGV goals
• Meet the needs of diverse projects, including
•
•
•
•
The Cancer Genome Atlas (TCGA)
Epigenetic & lincRNA studies
1000 Genomes Project
Single-investigator projects
• Meet the needs of diverse users –
biologists and bioinformatics specialists
• Scale to very large datasets on standard desktop systems
• Intuitive and easy to use
IGV distribution
•First public release in August 2008
•Current release: 1.4.2
•Early access versions updated frequently
•More than 5500 registered users
•Is open source and freely available
•http://www.broadinstitute.org/igv
•Contact us: [email protected]
Installing IGV
• Register at http://www.broadinstitute.org/igv
• Click “Downloads”
• Click a Launch button (Mac or PC), or
• Download an unzip binary distribution (Linux)
IGV Web site
http://www.broadinstitute.org/igv
IGV Web site
Downloads
IGV Web site
Downloads
PC and
Mac
Linux
User Interface Basics
IGV layout
Expression and copy number data
IGV layout
Cytoband
Track
Names
Genomic
Coordinates
Data
Panel
Annotation
Heatmap
Genome
Features
IGV layout
NGS data
IGV layout
NGS data
UI basics
•
•
•
•
Selecting a reference genome
Loading data
Navigating through the data
Setting track attributes
Selecting a reference genome
• Select one of the
hosted genomes from
the pull-down menu
• For more information see www.broadinstitute.org/igv/Genomes
• You can import other genomes if you have the sequence data
Loading data
Types of data
• Any data tied to genomic coordinates
• Genome annotations
• Sample attributes/annotations
File formats
• Many different file formats supported
• See www.broadinstitute.org/igv/FileFormats
Tracks
• Two generic types:
• data (continuous valued data)
• annotation (features)
• Specialized types include
• alignments
• mutations
• multiple alignments
• Type is defined by file format, and can be overridden by the user
• IGV uses type to determine
• initial placement in a panel
• display options and options for other track attributes
Loading data
#1 : Load local file
#2 : Load from URL
#3 : Load from server
(Broad IGV data server,
other data server)
“Load from server” menu
What you see depends on :
(1) which server you selected – default is Broad server
(2) which reference genome you’ve selected
Click on the
for more information about the data source
“Load from server” menu
What you see depends on :
(1) which server you selected – default is Broad server
(2) which reference genome you’ve selected
Click on the
for more information about the data source
Click on the
to expand the sub-menus
“Load from server” menu
“Load from server” menu
Click on the
to select datasets
Note that all nested datasets are
also selected – make sure you
know what you’ve selected
“Load from server” menu
“Load from server” menu
One last thing …
… you cannot unload using
the checkboxes
Navigating through the data
Whole genome view
Navigating through the data
Zooming in to the chromosome level
Select
chromosome
from menu
Click on
chromosome
number
Navigating through the data
Chromosome view
Navigating through the data
Zooming further in
Use the
railroad track
Double-click in data panel
Shift-click to go faster
Alt-click to zoom back out
Navigating through the data
Zooming further in
Specify range in
the search box
Navigating through the data
Zooming further in
Red box on
cytoband shows
where we are
Ruler shows the
extent of the region
Navigating through the data
Scroll or jump to location at same zoom level
Click on
cytoband
Click on
ruler
Click and drag – up/down left/right
Use
scroll bar
Use keyboard
(1) arrow keys
(2) Page Up, Page Down, Home, End
Navigating through the data
Zoomed in to base pair view
Reference
genome bases
Protein residues
Navigating through the data
Jump to feature
• Enter name of feature in search box
• With or without zoom (View > Preferences > General)
• Click on a feature track (e.g. gene track, BED, GFF)
• Ctrl+F = jump forward to next feature
• Ctrl+B = jump backward to previous feature
Setting track attributes
Right-click popup menu
Setting track attributes
Multiple tracks
Select multiple tracks by
clicking on track names :
Shift-click / Ctrl-click
Select multiple tracks
by clicking on color in
annotation heatmap
Setting track attributes
Global attributes
Tracks > Fit Data to Window
Tracks > Set Track Height
Annotation track
Gene representation
5’ UTR
Intron
Zoomed in views
Exons
3’ UTR
Annotation display mode
1. Features are drawn in a single row, by default
2. Expand the track using the popup menu
Sessions
• Save current state of IGV to a named session file.
• Use to
• restore the same state
• share session with colleagues
Data Integration
Data integration
• Load different types of data
• Use sample annotations to manipulate tracks
Sample annotations
• Default annotations for all sample tracks:
• data file, data type, track name
• Custom annotations:
• use sample information file
• Show / hide annotation panel
(View > Show Attribute Display)
• Show / hide selected annotations
(View > Select Attributes to Show)
Sort tracks
by attribute value
Click on the
annotation name
Use the menu
Tracks > Sort Tracks
Sort tracks
by data value in a region
Region
Tool
Popup
menu
Group tracks
Group tracks
Group tracks
Group tracks
Filter tracks
Filter tracks
Filter tracks
Hands-on Exercise
UI basics and data integration
File Formats
File formats
•Sample Info File
•Annotation File Formats
•Data File Formats
•Track Line
•Genomes and FASTA Files
Sample info file
A sample information file (also called an attribute file) is a tabdelimited text file that includes descriptive information (attributes) for
track identifiers.
Uses:
Annotation heatmap
Sorting
Filtering
Grouping
The first column of a sample information file contains track identifiers.
Subsequent columns may contain any attribute values and may be
given any arbitrary label.
Sample info file
Example
TRACK_ID
Data_Type
LINKING_ID SAMPLE_ID
EX-01-001
Expression
P-01-P001
CN-01-002
CopyNumber
MU-01-003
Primary/
Hypermutated
Secondary
GENDER
T/N
Tumor_type
Treated
P-01-S001
M
Tumor
GBM
Y
Primary
Y
P-01-P001
P-01-S001
M
Tumor
GBM
Y
Primary
Y
Mutation
P-01-P001
P-01-S002
M
Tumor
GBM
Y
Primary
Y
EX-01-004
Expression
P-01-P002
P-01-S003
M
Normal
GBM
Y
Secondary
Y
CN-01-005
CopyNumber
P-01-P002
P-01-S004
M
Tumor
GBM
Y
Secondary
N
EX-01-006
Expression
P-01-P002
P-01-S004
M
Tumor
GBM
Y
Secondary
N
ME-01-007
Methylation
P-01-P002
P-01-S004
M
Tumor
GBM
Y
Secondary
N
EX-01-008
Expression
P-01-P003
P-01-S006
F
Tumor
GBM
N
Primary
Y
EX-01-009
Expression
P-01-P004
P-01-S009
F
Tumor
GBM
N
Primary
Y
EX-01-0010
Expression
P-01-P005
P-01-S0011
M
Control
Annotation File formats
•BED - UCSC standard format. Useful for displaying any
feature type from simple blocks to genes.
http://genome.ucsc.edu/FAQ/FAQformat.html#format1
•GFF – Two variants, GFF2 and GFF3. Can also be used all
feature types, tends to be more verbose and slower to parse
than BED. File sizes can be significantly larger.
http://www.sequenceontology.org/gff3.shtml
•Note: BED file coordinates are “zero-based half-open”. This means an
interval spanning the first base is represented as 0-1. GFF files are “onebased open”. An interval spanning the first base is represented as 1-1.
This difference is responsible for many off-by-one bugs.
Data File formats
•Single Track Formats
• WIG – for fixed or variable step data with fixed spans
http://genome.ucsc.edu/goldenPath/help/wiggle.html
• BEDGraph – similar to BED format
http://genome.ucsc.edu/goldenPath/help/bedgraph.html
Data File formats
• Multi-track (array) formats
• IGV – general array-based data format.
• CN – GenePattern format designed for SNP copy number data.
Can be used for other data that spans a single base.
• SEG – Specialized format for segmented copy number data.
• GCT – GenePattern format for expression data. Only coordinate
that uses probe names instead of genomic coordinates.
GCT format
GCT format
GCT rows are keyed by probe identifier. For
display in IGV these rows must be mapped to
genomic coordinates with one of the following
options:
Probe to locus. IGV can automatically map probes for many
common chips directly to a genomic location.
Probe to gene. Optionally the user can specify that probes be
mapped to genes. When this is chosen the expression value is
applied to the entire region spanned by the gene.
User-supplied. The automatic mappings can be overridden by
inserting a locus string in the description column delimited by the
symbols |@ and |, for example |@chr6:1950428-1950681|.
UCSC track line
A track line can be used to control many aspects of the track
display such as graph type, color, and scale.
Can be used with wig, bed, gff, igv, cn, and gct files.
Line begins with “track” for wig and bed, “#track” for other formats.
Track line consists of key=value pairs, separated by a single space
Example:
track name=“my custom track” graphType=bar color=255,0,0
Importing a genome
Custom genome assemblies can be defined using “import genome”
The imported genome will be available from the drop down menu
Prerequisites:
A FASTA file , directory of FASTA files, or zip of FASTA files that
contains the sequence data for each chromosome in the
genome. (Required)
A cytoband file, which IGV uses to display the chromosome
ideogram. (Optional)
An annotation file in BED file format, the GFF file format, or any
variation of the genePred table format. (Optional)
Importing a genome
1. Click File > Import Genome. IGV displays the Import Genome
window:
2. Enter a name for the genome.
3. For Sequence File, click the ellipse button and select the FASTA
file (or zip of FASTA files) that contains the sequence data.
4. Optionally, specify the cytoband file and the gene track annotation
(Gene File) file.
5. Click Save. IGV displays the Genome Archive window.
6. Select the directory in which to save the genome archive
(*.genome) file and click Save. IGV saves the genome and loads it
into IGV.
Viewing NGS Data
Next-generation sequencing
The size of NGS datasets presents many challenges,
including:
• Implementation
• Managing terabyte size files with modest compute
resources (desktop computers).
• Visual design
• Highlight events of interest
• Deemphasize irrelevant details
• Avoid information overload
Aligned reads – all bases
Aligned reads - mismatches
Aligned reads – base quality
Vary view by resolution scale
Whole chromosome -- calculated summary data, e.g. coverage.
~ 50-100 kb -- putative rearrangements, SNPs
~ 500 bp -- bases
Viewing NGS data
Viewing NGS data
Viewing NGS data
Viewing NGS data
Viewing NGS data
Viewing NGS data
Double
click
here
Viewing NGS data
Viewing NGS data
Click
here
Viewing NGS data
Viewing NGS data
Viewing NGS data
Viewing NGS data
Viewing NGS data
Paired end data
• Pairs with unexpected insert sizes are color coded by
chromosome.
• Useful for visualizing possible rearrangements.
Paired end data
Paired end data
Paired end data
Alignment preferences
View > Preferences > Alignments
Hands-on Exercise
Viewing NGS data
IGVTools
IGVTools
IGVTools is a set of utilities for preparing large files for efficient display.
tile: converts a sorted data input file to a binary tiled data (.tdf) file.
Supported input file formats: .wig, .cn, .snp, .igv, .gct
count: computes average alignment or feature density for over a specified
window size across the genome.
Supported input file formats: .sam, .bam, .aligned, .sorted.txt,
.bed
sort: sorts the input file by start position. Supported input file formats: .cn,
.igv, .sam, .aligned, and .bed.
index: creates an index file for an input ascii alignment file.
Supported input file formats: .sam, .aligned, .sorted.txt
IGVTools tile
The tile utility converts large ascii data files into tiled data format
(.tdf) files. TDF files have the following advantages
1.Data is is indexed for efficient retrieval.
2.Data for zoomed out views are preprocessed.
3.TDF files are web friendly, large data files can be shared over
the web. Only small slices of the file are actually transferred as
needed.
IGVTools count
The count command is used to transform alignment files to read
density TDF files, e.g. for ChIP-Seq, RNA-Seq, & similar alignment
counting experiments.
igvtools
Alignments
Alignments in bam/sam,
.aligned, or bed format.
Read density
“Tiled Data File” indexed and
optimized for fast retrieval at
multiple resolution scales
IGVTools sort
This utility sorts IGV supported genomic formats by start position.
Example:
igvtools sort -m 1000000 –t ~/myTmpDir inputFile.sam outputFile.sorted.sam
The sort command uses a combination of memory and disk to handle
large files.
-m = maximum # of lines to hold in memory. When this number is
exceeded a temporary file is created.
-t = directory used to create temporary files during sorting.
IGVTools index
Used to create an index file for viewing SAM (not BAM) files
Note: to be confused with the samtools index, which is used to
create an index for BAM files
SAM => igvtools
BAM => samtools
Example: igvtools index inputFile.sam
Result inputFile.sam.sai
The index file must remain with the sam file to be found, IGV just
appends .sai to the end.
Creating Web links
Use HTML hyperlinks to launch IGV and share datasets over the web.
Two types of links are supported
(1) Launch IGV on a specified session file.
Example:
http://www.broadinstitute.org/igv/dynsession/igv.jnlp?sessionURL=http://www.broadinstitute.
org/tumorscape/textReader/IGV/all_tumors_session.xml&locus=chr7:55054218-55206232
(2) Load sessions or data files into a running IGV
Example:
http://localhost:60151/load?file=http://www.broadinstitute.org/igvdata/annotations/hg18/cons
ervation/pi.12mer.wig.tdf&locus=egfr&genome=hg18