M07 Presentation: Sequence analysis File

Download Report

Transcript M07 Presentation: Sequence analysis File

High-throughput sequencing analyses
Curtis Huttenhower
Slides courtesy of:
Oliver Hofmann @ HSPH Bioinformatics Core
Michelle Giglio @ UMD IGS
Istvan Albert @ PSU Bioinformatics
Harvard School of Public Health
Department of Biostatistics
03-04-12
Biological samples
Sequence reads
Quality control
Mapping
Assembly
Metagenomes
Contact maps
Peak calling
Variant
Detection
Annotation
...
2
3
4
Quality Control
• A typical pipeline:
1. Duplicate removal
2. Frequency checks
a. Reads: primers
b. K-mers: barcodes
3. Quality scores and trimming
4. Length filtering
5
What’s wrong with this picture?
PCR
duplicates:
Bias during
emulsion PCR
Optical
duplicates:
One cluster
detected >once
Bainbridge et al. Genome Biology 2010 11:R62
6
Read Frequency
Distribution
QA: filtering
7
Read Frequency
Distribution
VecBase
Screen
> gnl|uv|NGB00105.1:1-219 pCR4-TOPO multiple cloning site
Length=219
Score = 100 bits (50), Expect = 9e-19
Identities = 50/50 (100%), Gaps = 0/50 (0%)
Strand=Plus/Plus
Query
50
1
ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC
Sbjct
92
43
||||||||||||||||||||||||||||||||||||||||||||||||||
ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC
QA: filtering
8
Read Frequency
Distribution
VecBase
Screen
> gnl|uv|NGB00105.1:1-219 pCR4-TOPO multiple cloning site
Length=219
Score = 100 bits (50), Expect = 9e-19
Identities = 50/50 (100%), Gaps = 0/50 (0%)
Strand=Plus/Plus
Query
50
1
ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC
Sbjct
92
43
||||||||||||||||||||||||||||||||||||||||||||||||||
ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC
QA: filtering
9
K-mer spectra for QC
https://banana-slug.soe.ucsc.edu/bioinformatic_tools:quake
10
11
De-phasing
Crosstalk
Degradation
Error profiles
12
13
FastQC
14
FASTX-Toolkit
15
16
17
18
19
Alignment
http://genomesavant.com/savant/
20
21
SAM files
• Sequence Alignment/Map format
• Is a concise file format that contains information about how sequence
reads maps to a reference genome
• Can be further compressed in BAM format, which is a binary format of
SAM.
• Can also be sorted and indexed to provide fast random access, using
SAMtools (more on this in a minute).
• Requires ~1 byte per input base to store sequences, qualities and
meta information.
• Supports paired-end reads and color space.
• Originally produced by bowtie and bwa, now a de facto standard.
• SAM/BAM can also be converted to pileup format for SNP calling.
SAM/BAM files
BAM = Binary sAM
100% equivalent, just
smaller and faster.
Never save files as SAMs
unless necessary!
Save as BAM and use
SAMtools to manipulate.
23
SAM Format
coor
ref
r001+
r002+
r003+
r004+
r003r001-
12345678901234 5678901234567890123456789012345
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
TTAGATAAAGGATA*CTG
aaaAGATAA*GGATA
gcctaAGATAA
ATAGCT..............TCAGC
ttagctTAGGC
CAGCGCCAT
@SQ SN:ref LN:45
r001 163 ref 7 30
r002
0 ref 9 30
r003
0 ref 9 30
r004
0 ref 16 30
r003 16 ref 29 30
r001 83 ref 37 30
8M2I4M1D3M
3S6M1P1I4M
5H6M
6M14N5M
6H5M
9M
= 37 39 TTAGATAAAGGATACTG
* 0 0 AAAAGATAAGGATA
* 0 0 AGATAA
* 0 0 ATAGCTTCAGC
* 0 0 TAGGC
* 0 0 CAGCGCCAT
*
*
*
*
*
25
26
Pileup
Standard format for mapped data, position
summaries
Seq.
seq1
seq1
seq1
seq1
seq1
seq1
seq1
seq1
272
273
274
275
276
277
278
279
T
T
T
A
G
T
G
C
24
23
23
23
22
22
23
23
,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
,.....,,.,.,...,,,.,..A
<<<;<<<<<<<<<3<=<<<;<<+
,.$....,,.,.,...,,,.,...
7<7;<;<<<<<<<<<=<;<;<<6
,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
...T,,.,.,...,,,.,....
33;+<<7=7<<7<&<<1;<<6<
....,,.,.,.C.,,,.,..G.
+7<;<<<<<<<&<=<<:;<<&<
....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
A..T,,.,.,...,,,.,.....
;75&<<<<<<<<<=<<<9<<:<<
Pos. Len.
Ref.
27
Alignment
Quality
Second level of QA
Mismatched paired end reads
28
Search
• Mapping:
– Short reads, no (few) mismatches
– Extremely speedy!
• BLAST
– Any sequences, configurable sensitivity
– Can accurately reach homology twilight zone
– Less speedy
• Accelerated BLASTs
– Any sequences, heuristic sensitivity
– Speedier
29
Accelerated BLASTs
• USEARCH
• map/mapx
• mblast/mblastx
– All up to 1000sx faster
– Configurable options:
• Is there a “good enough” hit, yes/no?
• Just retrieve first N “good enough hits
• Trade sensitivity for specificity
30
31
...
32
Assembly
Iverson et al. Science 3 February 2012: Vol. 335 no. 6068 pp. 587-590
33
34
Assembly
• No such thing as an automated assembly
– Many alternatives:
• Velvet, Newbler, SOAPdenovo, ABySS, ALLPATHS…
– Each has a fistful of tuning parameters
35
36
Annotation
Annotation must be grounded with
supporting evidence"
•! The process of functional annotation involves
assessing available evidence and reaching a
conclusion about what you think the protein is doing
in the cell and why.!
•! Functional annotations should only be as specific as
the supporting evidence allows!
•! All evidence that led to the annotation conclusions
that were made must be stored.!
•! In addition, detailed documentation of
methodologies and general rules or guidelines used
in any annotation process should be provided. !
arch
o Make
al
ons"
I conclude that "
you are a cat."
41!
Why?"
-!You look like
other cats I know"
-!I heard you
meow and purr"
I conclude that "
you code for a "
protein kinase."
Why?"
-!You look like other
protein kinases I know"
-!You have been
42!
observed to add
phosphate to proteins"
37
–! DNA microarray analysis (2 tools) !
–! Small molecules (2 tools)!
37!
Annotation
Gene Model Curation: Start site edits"
38!
Gene Model Curation: !
Overlaps and InterEvidence"
What to consider:!
- Start site frequency: ATG >> GTG >> TTG!
- Ribosome Binding Site (RBS): a string of AG rich sequence
located 5-11 bp upstream of the start codon!
- Similarity to match proteins, in BER and multiple alignments !
(Remember to note, that the DNA sequence reads down in columns
for each codon.)!
-In the example below (showing just the beginning of one BER
alignment), homology starts exactly at the first atg (the current
chosen start, aa #1), there is a very favorable RBS beginning 9bp
upstream of this atg (gagggaga). There is no reason to consider the
ttg, and no justification for moving to the second atg (this would cut
off some similarity and it does not have an RBS.) !
Overlap analysis !
When two ORFs overlap (boxed areas), the one without similarity to
anything (another protein, an HMM, etc.) is removed. If both don’ t
match anything, other considerations such as presence in a putative
operon and potential start codon quality are considered. Small regions
of overlap are allowed (circle). !
InterEvidence regions!
Areas of the genome with no genes and areas within genes without any
kind of evidence (no match to another protein, HMM, etc., such regions
may include an entire gene in case of “hypothetical proteins”) are
translated in all 6 frames and searched against a non-redundant protein
database.!
3 possible"
start sites"
RBS upstream of
chosen start"
39!
This ORF’s upstream
boundary"
40!
38
Why?"
-!You look like
other cats I know"
-!I heard you
meow and purr"
Annotation
41!
Evaluate the Evidence!
-!You look like other
protein kinases I know"
-!You have been
42!
observed to add
phosphate to proteins"
The Pitfalls of Transitive Annotation!
•! Visually inspect alignments !
–! they should be full length!
•! or partial matches to identified domains !
–! at least 40% identity!
–! look for conservation of protein features or catalytic
residues!
–! avoid the pitfalls of transitive annotation (see next slide) !
•! Check HMM scores!
–! need to be above trusted to be considered part of the
family modeled by the HMM!
–! how specific is the HMM?!
–! what annotation on the HMM is appropriate for the query?!
•! Look at any available metabolic analysis !
•! Transitive Annotation is the process of
passing annotation from one protein (or
gene) to another based on sequence
similarity: !
A"
B"
B"
C"
C"
D"
–! A’s name has now passed to D from A through
several intermediates.!
•! This is fine if A is also similar to D !
•! This is NOT fine if A is NOT similar to D, which
can happen easily and happens often !
–! pathways, complexes?!
•! Check for operon structure or other information
from neighboring genes.!
•! Current public datasets full of such errors !
•! To avoid transitive annotation errors we
require that in a pairwise match, the match
must be experimentally characterized!
•! Be conservative!
–! presence of a gene in an operon can supplement weak
similarity evidence!
•! Are there transmembrane regions? !
•! Is there a signal peptide? "
•! Are there any motifs that might give a clue to
function?"
43!
–! err on the side of not making an annotation,
when possibly you should, rather than making an
annotation when probably you shouldn’t.!
44!
39
Genescript: a canonical annotation pipeline
Hudek et al Bioinformatics (2003) 19 (9): 1177-1178
40
Genescript
Hudek et al Bioinformatics (2003) 19 (9): 1177-1178
41
HMMs for gene calling and annotation
42
Center for Biological Sequence Analysis tools
43
Some annotation resources
• End-to-end annotation pipelines
– Genescript
– Manatee
– Apollo
• HMMs
– HMMER
– Pfam/Rfam/TIGRfam
– TMHMM/SignalP/etc.
• ORF callers
– FragGeneScan
– (Meta)GeneMark
44
Variant calling
• When do you believe differences with
respect to a reference genome?
– More reads = more support
– Errors are more likely at read ends
– Some sequences can be error hotspots
– Sometimes the reference genome’s the variant!
• When do you believe differences within your
own sequences?
– Error hotspots
– Misalignment of near-repetitive regions
45
Variant detection
Tablet Second-Gen Visualizer (http://bioinf.scri.ac.uk/tablet/)
46
Variant detection
Tablet Second-Gen Visualizer (http://bioinf.scri.ac.uk/tablet/)
47
48
Variant Call Format
##format=PCFv1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
#CHROM POS
ID
REF
ALT
20
14370
rs6054257 G
A
20
13330
.
T
A
20
1110696 rs6040355 A
G,T
20
10237
.
T
.
20
123456 microsat1 G
D4,IGA
QUAL
29
3
67
47
50
FILTER
0
q10
0
0
0
INFO
NS=58;DP=258;AF=0.786;DB;H2
NS=55;DP=202;AF=0.024
NS=55;DP=276;AF=0.421,0.579;AA=T;DB
NS=57;DP=257;AA=T
NS=55;DP=250;AA=G
FORMAT
GT:GQ:DP:HQ
GT:GQ:DP:HQ
GT:GQ:DP:HQ
GT:GQ:DP:HQ
GT:GQ:DP
NA00001
0|0:48:1:51,51
0|0:49:3:58,50
1|2:21:6:23,27
0|0:54:7:56,60
0/1:35:4
NA00002
1|0:48:8:51,51
0|1:3:5:65,3
2|1:2:0:18,2
0|0:48:4:51,51
0/2:17:2
49
50
51
52
ChIP-seq
http://www.fejes.ca
53
54
55
56
57
58
RNA-seq
Wang et al. Nat Rev Genet. 2009 January; 10(1): 57–63
59
Transcript discovery
60
RNA-seq without a ref. genome
Contigs → Genes → Isoforms
Grabherr et al. Nature Biotechnology 29, 644–652 (2011
61
Quantification
Sample comparison
RPKM:
Reads Per Kilobase per Million reads
62
Creative uses
Contact maps
TFBS determination
63
64
http://seqanswers.com/wiki/SEQanswers
The extended selection
220 applications and counting
65
The future: making sense of a genome
66