AdvancedChIPSeq - iPlant Pods

Download Report

Transcript AdvancedChIPSeq - iPlant Pods

Advanced ChIP-seq
Identification of consensus binding sites
for the LEAFY transcription factor
ChIPseq Conceptual Overview
The NCBI SRA
• NCBI SRA is a repository for NGS sequence reads
• Data is stored in association with basic metadata
explaining experimental technique and inter-sample
relationships
• Data format is NCBI-specific SRA and SRA-lite format.
“Universal” lossless format.
• Upload and download is offered via FTP and HTTP but
also via Aspera ASCP
– Fast, parallel protocol similar in performance to iRODS
iput/iget commands used in iPlant Data Store
• Use NCBI SRA Import to rapidly copy SRA
accession SRP003928 over ASCP into the iPlant Data
Store.
NCBI SRA Toolkit
• SRA data format is a universal format, but
no downstream apps can accept it natively.
• Need to export SRA to FASTQ, SFF, etc.
• These are the standard file formats for
representing sequence.
• Use the NCBI SRA Toolkit fastq-dump
to export FASTQ sequence files from SRA
files so we can process them
Import SRA data
from NCBI SRA
Extract FASTQ
files from the
downloaded SRA
archives
BWA
• BWA is one of many applications whose
objective is to efficiently align short
sequence reads to a reference genome
sequence
• Other alternatives are BOWTIE, MAQ,
TopHat, Stampy, Novoalign, etc.
• BWA is used by the Human 1000 genomes
project due to its speed and accuracy.
Outputs from BWA
• BWA emits alignments in the SAM format
• SAM is a universal system for describing
next-gen sequences and their
corresponding genome alignments
• SAMTools is a suite of applications for
manipulating SAM files
– Sort, Merge, Index, and more
– Emit as binary BAM file
Align FASTQ files
to Arabidopsis
genome using
BWA
Merge and index
BAM files using
SAMtools apps
PeakRanger
• PeakRanger is a fast, optimized algorithm for
detecting enrichment peaks in ChIPseq data
sets
• PeakRanger was developed at OICR in
partnership between modENCODE and
iPlant and is now maintained at UTSW
• It’s not the only option for peak finding:
–
–
–
–
MACS
ChIPseq Peak Finder
CisGenome
FindPeaks
http://ranger.sourceforge.net/
Use PeakRanger with the
BAM files from the Control
and Sample assays to find
LEAFY enrichment
NOTE: Many parameters
to tweak. You are
recommended to read the
PeakRanger paper.
Outputs from PeakRanger
• Wiggle (.wig) files: Density map of
sequence reads across the reference
genome for control and sample BAM
alignments
• Region (.bed) file: Feature file containing
the significantly enriched domains in the
genome
• Summit (.bed) file: Feature file containing
the single base maximum of each peak
Wiggle file
BED file
Integrative Genomics Viewer
The Integrative Genomics Viewer (IGV) is a high-performance
visualization tool for interactive exploration of large, integrated
genomic datasets. It supports a wide variety of data types,
including array-based and next-generation sequence data,
and genomic annotations.
http://www.broadinstitute.org/igv/
Use IGV to inspect outputs from PeakRanger
Using IGV in Atmosphere
1. Launch an
instance of
NGS Viewers
from the
Atmosphere
App list
2. Use VNClient
to connect to
your remote
desktop
Using IGV in Atmosphere
1. Configure iDrop
2. Copy .wig and
.bed files from
the
PeakRanger
output to your
Atmosphere
instance
desktop
Using IGV in Atmosphere
1. Launch IGV
(Integrative
Genomics
Viewer)
2. Change the
current genome
to A. thaliana
(TAIR10)
Using IGV in Atmosphere
1. Open igvtools
and convert
.wig file to .tdf
2. Load the .tdf
and .bed files
into the IGV
window
3. Inspect loci by
entering their
name into
search box
Using IGV in Atmosphere
Enrichment
region and
alignment peak
at promoter
region of
APETALA (AP1)
AP1 (APETALA) Mutant
Wild-type
ap1
Why do we even care about LEAFY? Well, it activates AP1. If API is not
active, Arabidopsis can’t make flowers and instead makes cauliflowers!
Some Known LEAFY targets
Gene Name
Locus
APETALA (AP1)
AT1G69120.1
AGAMOUS (AG)
AT4G18960.1
LMI2
AT3G61250.1
LMI3
AT5G49770.1
LMI4
AT5G60630.1
LMI5
AT1G16070.1
Look for LEAFY enrichment at these loci in IGV 2.0
Filtering the PeakRanger summits file
The statiscally best summits from PeakRanger have P-values of Zero. If
you look at the summits.bed file you can see this is embedded in the
name of the features. So, if we filter the summits.bed for only lines
matching pval_0, we will generate a BED file containing summits most
likely to be near true LEAFY binding sites.
This identical to running
egrep “pval_0”
peakranger_summit.bed >
peakranger_summit_best.bed
on a command line
Find Lines Matching a Regular Expression
BEDTools for Interval Operations
The BEDTools utilities allow one to address common genomics
tasks such as finding feature overlaps and computing coverage.
The utilities are largely based on four widely-used file formats:
BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can
develop sophisticated pipelines that answer complicated
research questions by "streaming" several BEDTools together.
slopBed – Expand the
coordinates of features
in a BED file by a a
defined number of
bases
fastaFromBed –
Extract a multiFASTA
file from a reference
sequence using a BED
file of features
* The entire BEDtools suite is slated for itegration into the iPlant DE.
Follow us on Twitter @iPlantCollab to learn when new tools become
available.
Filter summits.bed on pval_0
Best Summits BED File
(single base pair features)
BEDTools slopBed, 50bp equidistant
100 bp Region BED File
(100 bp centered on peak centers)
BEDTools fastaFromBed, Arabidopsis genome
FASTA file of 100 bp regions
(likely to contain consensus motifs)
DREME
Objective
Go from BED file
of single-base
peak summits to a
FASTA file
containing the 100
bp surrounding
those summits that
can be used for
motif hunting
DREME
• Run DREME on
100bp windows
surrounding
LEAFY peaks
• Download results
DREME results
Success!
CCANTG(G/T)!
Potential Next Steps
• Identify all consensus LEAFY sites in the
genome that fall in promoters
• Extract all the promoters where LEAFY
has significant binding and associate them
with genes.
• Generate a simple gene list and run
Ontology Term enrichment analysis to
find classes of genes influenced by LEAFY
Cyberinfrastructure Overview
Component
What we did
Why we used it
iPlant Data Store
Imported data from
SRA. Stored results of
analyses. Downloaded
results.
Fast, flexible storage for
large bioinformatics
data.
Discovery Environment
Data import. NGS
Alignment. Peak
Finding. Data
organization.
One interface. Multiple
bioinformatics
applications. Easy to
manage work products.
Atmosphere
Loaded results into
desktop client
application.
Avoid downloading large
files to personal
computer. Easy access
to powerful desktop
environment.