Day7 In-class slides for RNA-Seq

Download Report

Transcript Day7 In-class slides for RNA-Seq

RNA Sequencing
Day 7
Wooohoooo!
Aaron Odell
June 17th 2015
Outline For The Day
• Map raw fastq files to reference genome
• Convert mapped data to visualization ready
files
• “Quantify” mapped data over reference
genome genes/annotations
• Differential Expression
Data Set
• Mouse RNA Seq
– PolyA Strand Specific Library. Ethanol treatment
– 2 x 100 Paired End Reads = 2 Fastq Files
• mm10 Reference Genome (.fasta file)
• Ensemble mm10 gene annotation file (.gtf)
• Due to time and compute restraints we will
only be working with Chromosome 19
What Mapping Program To Use?...
• Need program that is able to handle spliced
reads…
– Tophat2, GSNAP, STAR, etc …
• We will be using Tophat2
– https://ccb.jhu.edu/software/tophat/manual.shtml
• Excellent spliced alignment program evaluation
– http://www.nature.com/nmeth/journal/v10/n12/full/
nmeth.2722.html
First things first
• Log onto vieques and load all modules we will
need to run analysis successfully
module load commands
•
•
•
•
•
•
•
module load python_2.7.3
module load samtools_0.1.16
module load numpy_1.6.1
module load pysam_0.8.1
module load bowtie_bowtie2-2.0.2
module load tophat_2.0.6
module load htseq_0.6.1
Lets begin…
• cp -r /projects/sreadgrp/RNA /Users/username/
• Edit /Users/username/RNA/SCRIPTS/ILS_E_tophat_samtools.pbs in
text editor
– Change all “odella” to your actual username
• Example /Users/odella/  /Users/your_username/
– Change #PBS –M your_email_address to your actual email address
(or if you don’t want the email then do nothing)
• From the /Users/username/RNA/SCRIPTS directory
qsub ILS_E_1_tophat_samtools.pbs
While we wait…
• Edit
/Users/username/RNA/SCRIPTS/ILS_E_htSeqCou
nt.pbs
– Change all “odella” to your actual username
• htSeqCount manual
– http://wwwhuber.embl.de/users/anders/HTSeq/doc/count.html
• Once
/Users/username/RNA/SCRIPTS/ILS_E_tophat_sa
mtools.pbs has run to completion, qsub
/Users/username/RNA/SCRIPTS/ILS_E_htSeqCou
nt.pbs
For those of you who are bored…
• Read through the tophat and htseq manuals
• Take a look at various files created by our
mapping/quantification pipeline.
• Tophat manual explains what all these files are
and where they come from
• Run this command from
/Users/username/RNA/ILS_E directory
– samtools flagstat ILS_E_sorted.bam
– Do you understand the output? Search the internet or
ask someone if you don’t
Time to visualize!
Copy bam and genome files to student directory
• cp /Users/username/RNA/ILS_E/ILS_E_sorted.bam
/projects/sreadgrp/student/username/ILS_E_sorted.bam
• cp /Users/username/RNA/ILS_E/ILS_E_sorted.bam.bai
/projects/sreadgrp/student/username/ILS_E_sorted.bam.bai
• cp /User/username/RNA/GENOME/chr19.fasta
/projects/sreadgrp/student/username/chr19.fasta
• cp /User/username/RNA/GENOME/chr19.fasta.fai
/projects/sreadgrp/student/username/chr19.fasta.fai
• cp /User/username/RNA/GENOME/chr19.gtf
/projects/sreadgrp/student/username/chr19.gtf
Load the approriate files
• Load genome from file
– Chr19.fasta
• Load from file
– Chr19.gtf
– ILS_E_sorted.bam
Look at ILS_E_htSeq.txt
• Count the number of reads you see mapping
in IGV over any given gene and compare to
the htSeqCount value
– Does it make sense?