2 retrieving sequence data file and Linux commands

Download Report

Transcript 2 retrieving sequence data file and Linux commands

Stubbs Lab Bioinformatics - 3
Review
RNA-Seq Analysis Overview
Alignment using Tophat2
Nov 22, 2016
Joe Troy
Agenda
• Review of tools and Linux commands
• Overview of the RNA-Seq Analysis
• Aligning short reads (.fastq files) with Tophat2
to create alignment files (accepted_hits.bam)
Software Tools
Linux Command Line Run Bash, R, Perl or Python scripts or other
command line tools
Bash scripts
R
Perl
Python
Bowtie
TopHat
HTSeq
edgeR (R library)
limma (R library)
pysam
Cyberduck
sftp
Glue individual commands together to form a
process. (bash = Bourne-again shell)
Perform statistical analysis and/or transform data
and output results
Transform (reformat) data, call other software
Transform (reformat) data, call other software
Align reads to a genome
cheatsheets.s3.amazonaws.com/formobile/linux-commands-cheatsheet-new.pdf
ryanstutorials.net/bash-scriptingtutorial/
www.r-project.org
perl.org
python.org
bowtiebio.sourceforge.net/index.shtml
manage reads from spliced exons
ccb.jhu.edu/software/tophat/index.
shtml
Create reads per gene counts
wwwhuber.embl.de/HTSeq/doc/overvie
w.html
Indentify DE genes, create MDS plots
bioconductor.org/packages/release
/bioc/html/edgeR.html
has fuctionality needed by edgeR
bioconductor.org/packages/release
/bioc/html/limma.html
A python module used by HTSeq to read SAM/BAM pysam.readthedocs.io/en/latest/api
files
.html
Move and manage files
cyberduck.io
Move and manage files
oucsace.cs.ohiou.edu/~chelberg/cla
sses/2400/SSH_SFTP_Handout.pdf
TextWranger
MAC File editor that works well with Cyberduck
Excel
Many uses, often to store final results or to do
further analysis
Not used to execute any of the example scripts, but www.rstudio.com
an invaluable tool for any R user.
R-studio
www.barebones.com/products/text
wrangler/
http://best-excel-tutorial.com
Linux commands (review and new)
cp
copy. copy file ex: cp oldfile.txt newfile.txt
copy folder ex: cp –R old_folder new_folder
df –h
See how much disk space is on the server
cd
change to new folder. ex: cd my_new_folder
pwd
print working directory, show the current folder
ls –lh
list contents with details (l), show file size & date as human readable (h)
rm
PERMANENTLY remove a file or folder. ex: rm my_file.txt removes a file named
“my_file.txt” in the current working director.
ex: rm -r myfolder removes a folder, and all of its contents named “myfolder”
in the current working directory. ex: rm *.txt removes all file ending with ‘.txt’.
ex: rm * removes everything in the current working directory BE CAREFUL.
screen
Screen allows you to start a “sub-process” on stubbslab.igb.illinois.edu, exit
that subprocess while it continues to run (allowing you to disconnect from
stubbslab.igb.illinois.edu), and reattach to the process at a later time.
sh
Used to start a shell script. ex: sh main_script_tophat_16Gso.sh
RNA-Seq data analysis
Context and Overview
Perform experimental protocol
ê
RNA Library Prep
ê
Sequencing
ê
RNA-Seq Data Analysis
Retrieve short-read files (.fastq) from the biotech ftp server
Align short-reads to a genome to create tophat2 alignment files
(accepted_hits.bam file) and create count files.
Create alignment summary reports
Create expression MDS plots to visualize expression differences
between samples and sample groups
Create BigWig files that can be used for "Track Hubs" on the UCSC
genome browser
Analyze differential expressed genes using edgeR
Perform DAVID (gene ontology) analysis on differentially expressed
genes
Perform hypergeometric tests on Differentially expressed gene lists
INPUT: .tgz file(s)
from
INPUT: .fastq short
read files
ftp.biotec.illinois.edu
OUTPUT:
OUTPUT: .fastq
short read files
“accepted_hits.bam”
file from each
“.fastq file”
Retrieve and uncompress short
read files
Align Reads to
genome
sftp command
Tophat 2 script
tar command
Next Step: review
alignment stats
Terminal is used to access the Linux
command line on a MAC
Instructions to alignment short reads
with tophat2
INSTRUCTION SLIDE 1
Josephs-MacBook-Pro:~ josephtroy$ ssh [email protected]
[email protected]'s password:
Last login: Mon Nov 21 20:15:51 2016 from c-73-73-226-74.hsd1.il.comcast.net
[jmtroy2@stubbslab ~]$ df -h
Filesystem
Size Used Avail Use% Mounted on
/dev/sda1
4.6T 4.2T 156G 97% /
/dev/sda2
95G 14G 77G 16% /var
/dev/sdb1
289M 29M 246M 11% /boot
tmpfs
32G 0 32G 0% /dev/shm
/dev/sdb2
275G 116G 145G 45% /var/lib/mysql
[jmtroy2@stubbslab ~]$ screen
Instructions to alignment short reads with tophat2
INSTRUCTION SLIDE 2
[jmtroy2@stubbslab ~]$ cd /home/share/example_rna_seq_project_16Gso/
[jmtroy2@stubbslab example_rna_seq_project_16Gso]$ ls -1
code_010_tophat2
code_020_alignment_summary_report
code_030_MDS_plots
code_040_create_track_hub_bigwigs
code_050_cpm_means_report
code_060_differential_expression_w_edgeR
fastq_files
output_010_tophat2_RUN_20161121_092530
[jmtroy2@stubbslab example_rna_seq_project_16Gso]$ cd code_010_tophat2/
[jmtroy2@stubbslab code_010_tophat2]$ ls
main_script_tophat_16Gso.sh
[jmtroy2@stubbslab code_010_tophat2]$ sh main_script_tophat_16Gso.sh
Start of Tophat
…
NOW HOLD DOWN THE CONTROL KEY AND PRESS a, THEN PRESS d, TO DETACH FROM THE SCREEN
SESSION
Instructions to alignment short reads with tophat2
DEMONSTRATION SLIDE 3
[jmtroy2@stubbslab ~]$ screen -ls
There is a screen on:
11559.pts-2.stubbslab (Detached)
1 Socket in /var/run/screen/S-jmtroy2.
[jmtroy2@stubbslab ~]$ screen -r 11559
[end of tophat]
[jmtroy2@stubbslab code_010_tophat2]$ exit
[end of tophat]
[jmtroy2@stubbslab code_010_tophat2]$ screen -ls
No Sockets found in /var/run/screen/S-jmtroy2.
Review tophat2 output in Cyberduck
align_summary.txt
NOTE: The “Mapped” rate of 99.9% is this high because of the way the example
fastq files were created for the training exercise. The fastq files were created with
only those reads already mapped to chromosome 5.
/home/share/example_rna_seq_project_16Gso/code_010_tophat2/main_script_tophat_1
6Gso.sh (1 of 2)
/home/share/example_rna_seq_project_16Gso/code_010_tophat2/main_script_tophat_1
6Gso.sh (2 of 2)