M07 Presentation: Sequence analysis File
Download
Report
Transcript M07 Presentation: Sequence analysis File
High-throughput sequencing analyses
Eric Franzosa
Slides courtesy of:
Oliver Hofmann @ HSPH Bioinformatics Core
Michelle Giglio @ UMD IGS
Istvan Albert @ PSU Bioinformatics
Harvard School of Public Health
Department of Biostatistics
03-07-16
Biological samples
Sequence reads
Quality control
Mapping
Assembly
Metagenomes
Contact maps
Peak calling
Variant
Detection
Annotation
...
What do you do with your sequences, and how?
2
Quality Control
• A typical pipeline:
1. (Optional) duplicate removal
2. Frequency checks
a. Reads: primers
b. K-mers: barcodes
3. Quality scores and trimming
4. Length filtering
3
FastQC + FASTX
4
K-mer spectra for QC
https://banana-slug.soe.ucsc.edu/bioinformatic_tools:quake
5
Assembly
Iverson et al. Science 3 February 2012: Vol. 335 no. 6068 pp. 587-590
6
De novo assembly
New methods are still actively developed
Short reads require long overlaps
•
•
•
•
•
e.g., 30-100nt reads must overlap by 20+nt
end-trimming helps to remove low quality bases.
Most de novo short read assemblers use a kmer hashing based approach.
•
•
Known as “de Bruijn graph” assembly
Describes which short words follow which other
short words
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
Sequence
AGTCGAG CTTTAGA CGATGAG
CTTTAGA
GTCGGG TTAGATC ATGAGGC GAGACAG
GAGGCTC
ATCCGAT AGGCTTT GAGACAG
AGTCGAG TAGATCC ATGAGGC TAGAGAA
TAGTCGA CTTTAGA CCGATGA TTAGAGA
CGAGGCT AGATCCG TGAGGCT AGAGACA
TAGTCGA GCTTTAG TCCGATG
GCTCTAG
TCGACGC GATCCGA GAGGCTT AGAGACA
TAGTCGA TTAGATC GATGAGG
TTTAGAG
GTCGAGG TCTAGAT
ATGAGGC
TAGAGAC
AGGCTTT ATCCGAT AGGCTTT
GAGACAG
AGTCGAG
TTAGATT
ATGAGGC AGAGACA
GGCTTTA
TCCGATG
TTTAGAG
CGAGGCT TAGATCC TGAGGCT
GAGACAG
AGTCGAG TTTAGATC ATGAGGC
TTAGAGA
GAGGCTT GATCCGA GAGGCTT
GAGACAG
Hashing (k = 4)
Graph Building
GATT
(1x)
TAGT
(3x)
AGTC
(7x)
GTCG
(9x)
TCGA CGAG
(10x) (8x)
CGAC
(1x)
ATGA
(8x)
GATG
(5x)
CGAT
(6x)
CCGA
(7x)
TCCG
(7x)
GACG
(1x)
GATC
(8x)
AGAT
(8x)
CTTC
(1x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
{
GAGG AGGC GGCT GCTT
(16x) (16x) (16x) (11x)
ATCC
(7x)
CTTT
(8x)
TTTA
(8x)
{
TGAG
(9x)
AGAA
(1x)
AGAG
(9x)
GAGA AGAC
(12x) (9x)
GACA
(8x)
ACAG
(5x)
TTAG TAGA
(12x) (16x)
ACGC
(1x)
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
GATT
(1x)
ATGA
(8x)
AGTC
(7x)
GTCG
(9x)
TCGA CGAG
(10x) (8x)
CGAT
(6x)
CCGA
(7x)
TCCG
(7x)
GACG
(1x)
AGAT
(8x)
TTCA
(2x)
TCAG
(2x)
CAGA
(1x)
GAGG AGGC GGCT GCTT
(16x) (16x) (16x) (11x)
CGAC
(1x)
GATC
(8x)
CTTC
(1x)
{
ATCC
(7x)
CTTT
(8x)
TTTA
(8x)
AGAA
(1x)
AGAG
(9x)
GAGA AGAC
(12x) (9x)
GACA
(8x)
ACAG
(5x)
TTAG TAGA
(12x) (16x)
ACGC
(1x)
Simplification of Linear Stretches
GATT
GATCCGATGAG
TAGTCGA
CGAG
CGACGC
AGAT
{
GCTCTAG
AGAA
{
TAGT
(3x)
GATG
(5x)
{
TGAG
(9x)
TAGA
GAGGCT
GCTTTAG
AGAGA
AGACAG
GATT
GATCCGATGAG
AGAT
Tips
TAGTCGA
CGAG
CGACGC
{
GCTCTAG
AGAA
{
TAGA
GAGGCT
AGAGA
AGACAG
GCTTTAG
Bubble
Error (tip and bubble) removal
AGATCCGATGAG
TAGTCGAG
GAGGCTTTAGA
AGAGACAG
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
12
De novo Assembler performance: the
Assemblathon
• From Genome10K/UCSD
– Roughly annual, latest 21 participants, 3 genomes
Assembly size
for bird
genome
(budgie)
NG50: if all scaffold lengths are
summed from longest to the
shortest, this is the length at
which the sum length accounts
for 50% of total sequence.
“NG50” calculated in genome size,
“N50” in assembly size.
Scaffold size
Higher is better (most sequence in long
scaffolds), but it does not account for error!
% scaffolds
De novo Assembler performance: the
Assemblathon
Assembly
metrics for
bird genome
(budgie)
Second level of QA
Mismatched paired end reads
15
Annotation
Annotation must be grounded with
supporting evidence"
•! The process of functional annotation involves
assessing available evidence and reaching a
conclusion about what you think the protein is doing
in the cell and why.!
•! Functional annotations should only be as specific as
the supporting evidence allows!
•! All evidence that led to the annotation conclusions
that were made must be stored.!
•! In addition, detailed documentation of
methodologies and general rules or guidelines used
in any annotation process should be provided. !
arch
o Make
al
ons"
I conclude that "
you are a cat."
41!
Why?"
-!You look like
other cats I know"
-!I heard you
meow and purr"
I conclude that "
you code for a "
protein kinase."
Why?"
-!You look like other
protein kinases I know"
-!You have been
42!
observed to add
phosphate to proteins"
16
–! DNA microarray analysis (2 tools) !
–! Small molecules (2 tools)!
37!
Annotation
Gene Model Curation: Start site edits"
38!
Gene Model Curation: !
Overlaps and InterEvidence"
What to consider:!
- Start site frequency: ATG >> GTG >> TTG!
- Ribosome Binding Site (RBS): a string of AG rich sequence
located 5-11 bp upstream of the start codon!
- Similarity to match proteins, in BER and multiple alignments !
(Remember to note, that the DNA sequence reads down in columns
for each codon.)!
-In the example below (showing just the beginning of one BER
alignment), homology starts exactly at the first atg (the current
chosen start, aa #1), there is a very favorable RBS beginning 9bp
upstream of this atg (gagggaga). There is no reason to consider the
ttg, and no justification for moving to the second atg (this would cut
off some similarity and it does not have an RBS.) !
Overlap analysis !
When two ORFs overlap (boxed areas), the one without similarity to
anything (another protein, an HMM, etc.) is removed. If both don’ t
match anything, other considerations such as presence in a putative
operon and potential start codon quality are considered. Small regions
of overlap are allowed (circle). !
InterEvidence regions!
Areas of the genome with no genes and areas within genes without any
kind of evidence (no match to another protein, HMM, etc., such regions
may include an entire gene in case of “hypothetical proteins”) are
translated in all 6 frames and searched against a non-redundant protein
database.!
3 possible"
start sites"
RBS upstream of
chosen start"
39!
This ORF’s upstream
boundary"
40!
17
Why?"
-!You look like
other cats I know"
-!I heard you
meow and purr"
Annotation
41!
Evaluate the Evidence!
-!You look like other
protein kinases I know"
-!You have been
42!
observed to add
phosphate to proteins"
The Pitfalls of Transitive Annotation!
•! Visually inspect alignments !
–! they should be full length!
•! or partial matches to identified domains !
–! at least 40% identity!
–! look for conservation of protein features or catalytic
residues!
–! avoid the pitfalls of transitive annotation (see next slide) !
•! Check HMM scores!
–! need to be above trusted to be considered part of the
family modeled by the HMM!
–! how specific is the HMM?!
–! what annotation on the HMM is appropriate for the query?!
•! Look at any available metabolic analysis !
•! Transitive Annotation is the process of
passing annotation from one protein (or
gene) to another based on sequence
similarity: !
A"
B"
B"
C"
C"
D"
–! A’s name has now passed to D from A through
several intermediates.!
•! This is fine if A is also similar to D !
•! This is NOT fine if A is NOT similar to D, which
can happen easily and happens often !
–! pathways, complexes?!
•! Check for operon structure or other information
from neighboring genes.!
•! Current public datasets full of such errors !
•! To avoid transitive annotation errors we
require that in a pairwise match, the match
must be experimentally characterized!
•! Be conservative!
–! presence of a gene in an operon can supplement weak
similarity evidence!
•! Are there transmembrane regions? !
•! Is there a signal peptide? "
•! Are there any motifs that might give a clue to
function?"
43!
–! err on the side of not making an annotation,
when possibly you should, rather than making an
annotation when probably you shouldn’t.!
44!
18
Genescript: a canonical annotation pipeline
Hudek et al Bioinformatics (2003) 19 (9): 1177-1178
19
Genescript
Hudek et al Bioinformatics (2003) 19 (9): 1177-1178
20
HMMs for gene calling and annotation
21
Some annotation resources
• End-to-end annotation pipelines
– Genescript
– Manatee
– Apollo
• HMMs
– HMMER
– Pfam/Rfam/TIGRfam
– TMHMM/SignalP/etc.
• ORF callers
– FragGeneScan
– (Meta)GeneMark
22
Alignment
http://genomesavant.com/savant/
23
SAM files
• Sequence Alignment/Map format
• Is a concise file format that contains information about how sequence
reads maps to a reference genome
• Can be further compressed in BAM format, which is a binary format of
SAM.
• Can also be sorted and indexed to provide fast random access, using
SAMtools (more on this in a minute).
• Requires ~1 byte per input base to store sequences, qualities and
meta information.
• Supports paired-end reads and color space.
• Originally produced by bowtie and bwa, now a de facto standard.
• SAM/BAM can also be converted to pileup format for SNP calling.
25
SAM/BAM files
BAM = Binary sAM
100% equivalent, just
smaller and faster.
Never save files as SAMs
unless necessary!
Save as BAM and use
SAMtools to manipulate.
26
SAM Format
coor
ref
r001+
r002+
r003+
r004+
r003r001-
12345678901234 5678901234567890123456789012345
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
TTAGATAAAGGATA*CTG
aaaAGATAA*GGATA
gcctaAGATAA
ATAGCT..............TCAGC
ttagctTAGGC
CAGCGCCAT
@SQ SN:ref LN:45
r001 163 ref 7 30
r002
0 ref 9 30
r003
0 ref 9 30
r004
0 ref 16 30
r003 16 ref 29 30
r001 83 ref 37 30
8M2I4M1D3M
3S6M1P1I4M
5H6M
6M14N5M
6H5M
9M
= 37 39 TTAGATAAAGGATACTG
* 0 0 AAAAGATAAGGATA
* 0 0 AGATAA
* 0 0 ATAGCTTCAGC
* 0 0 TAGGC
* 0 0 CAGCGCCAT
*
*
*
*
*
28
29
Variant calling
• When do you believe differences with
respect to a reference genome?
– More reads = more support
– Errors are more likely at read ends
– Some sequences can be error hotspots
– Sometimes the reference genome’s the variant!
• When do you believe differences within your
own sequences?
– Error hotspots
– Misalignment of near-repetitive regions
30
Variant detection
Tablet Second-Gen Visualizer (http://bioinf.scri.ac.uk/tablet/)
31
Variant detection
Tablet Second-Gen Visualizer (http://bioinf.scri.ac.uk/tablet/)
32
33
Variant Call Format
##format=PCFv1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
#CHROM POS
ID
REF
ALT
20
14370
rs6054257 G
A
20
13330
.
T
A
20
1110696 rs6040355 A
G,T
20
10237
.
T
.
20
123456 microsat1 G
D4,IGA
QUAL
29
3
67
47
50
FILTER
0
q10
0
0
0
INFO
NS=58;DP=258;AF=0.786;DB;H2
NS=55;DP=202;AF=0.024
NS=55;DP=276;AF=0.421,0.579;AA=T;DB
NS=57;DP=257;AA=T
NS=55;DP=250;AA=G
FORMAT
GT:GQ:DP:HQ
GT:GQ:DP:HQ
GT:GQ:DP:HQ
GT:GQ:DP:HQ
GT:GQ:DP
NA00001
0|0:48:1:51,51
0|0:49:3:58,50
1|2:21:6:23,27
0|0:54:7:56,60
0/1:35:4
NA00002
1|0:48:8:51,51
0|1:3:5:65,3
2|1:2:0:18,2
0|0:48:4:51,51
0/2:17:2
34
35
ChIP-seq
http://www.fejes.ca
36
37
38
39
40
41
RNA-seq
Wang et al. Nat Rev Genet. 2009 January; 10(1): 57–63
42
Transcript discovery
43
RNA-seq without a ref. genome
Contigs → Genes → Isoforms
Grabherr et al. Nature Biotechnology 29, 644–652 (2011
44
Quantification
Sample comparison
RPKM:
Reads Per Kilobase per Million reads
45
Creative uses
Contact maps
TFBS determination
46
47
http://seqanswers.com/wiki/Software
48