MS PowerPoint - Genome Projects at University of Kentucky
Download
Report
Transcript MS PowerPoint - Genome Projects at University of Kentucky
Mapping NGS sequences to a
reference genome
Why?
• Resequencing studies (DNA)
– Structural variation
– SNP identification
• RNAseq
– Mapping transcripts to a genome sequence
• Genome annotation
• Transcript enumeration
• Identification of splice junctions/variants
Blast is too slow
• Different alignment algorithms are necessary
• Burrows Wheeler Alignment
– sequence database (genome) is transformed to
produce an index
– Individual sequence reads are searched against
this index
• STAR Aligner (Dobin et al. 2012) Bioinformatics
– Uncompressed Suffix trees
BWT of “banana”
Tophat2
• Based on the Bowtie alignment engine
– Bowtie, matching with no gaps
– Tophat2, gapped matches
• Aligns reads to a Burrows Wheeler
transformed index of the genome
• 1st pass non-gapped matches
• 2nd pass splits unmapped reads and
attempts to align the fragments
The STAR Aligner
• Start at the first base of sequence read
• Find Maximal Mappable Prefix (MMP)
• Repeat process using unmapped portion of read
• 50x faster than other aligners
OUTPUTS
• TopHat (Bowtie)
– .bam file (binary alignment/map)
– .sam (sequence alignment/map)
– Single .sam file entry:
I8MVR:53:837 0 17_dna:chromosome
14090858
TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h
XG:Z:CT
255 21M * 0 0
NM:i:7
XR:Z:CT
.sam fields
.sam flags
0x0001
1
0x0002
2
0x0004
0x0008
0x0010
4
8
16
0x0020
0x0040
0x0080
0x0100
32
64
128
256
the read is paired in sequencing, no matter
whether it is mapped in a pair
the read is mapped in a proper pair
(depends on the protocol, normally inferred
during alignment)
the query sequence itself is unmapped
the mate is unmapped
strand of the query (0 for forward; 1 for
reverse strand)
strand of the mate
the read is the first read in a pair
the read is the second read in a pair
the alignment is not primary (a read having
split hits may have multiple primary
alignment records)
1. 1
2.
2
3.
1+2
4.
0+4
5.
1+4
6.
0+2+4
7.
1+2+4
8.
0+8
9.
1+8
10. 0+2+8
11. 1+2+8
12. 0+4+8
13. 1+4+8
14. 0+2+4+8
15. 1+2+4+8
16. …etc.
CIGAR format
I8MVR:104:144 0 7_dna:chromosome 120102744 255
62M1I14M *
0
0
GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA
Quantifying alignments
• How many reads overlap a given interval on a
chromosome (scaffold)?
• How do these regions correspond to known
genes?
– .gtf file
• How many transcripts from my gene of
interest?
• How confident can I be about a variant call?
Annotate regions - GTF files
1
2
3
4
5
6
7
8
Chromosome
_8.1
Cufflinks transcript 90162 90766 1000
+
.
Chromosome
_8.1
Cufflinks
exon
90162 90231 1000
+
.
Chromosome
_8.1
Cufflinks
exon
90314 90766 1000
+
.
Chromosome
_8.1
Cufflinks transcript 90889 91620 1000
.
.
9
gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM
"110.6292802224"; frac "1.000000"; conf_lo "41.668327";
conf_hi "132.581041"; cov "6.415537";
gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number
"1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo
"41.668327"; conf_hi "132.581041"; cov "6.415537";
gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number
"2"; FPKM "110.6292802224"; frac "1.000000"; conf_lo
"41.668327"; conf_hi "132.581041"; cov "6.415537";
gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM
"49.8117204717"; frac "1.000000"; conf_lo "21.651798";
conf_hi "73.074820"; cov "2.193724";
GTF fields
1.
2.
3.
4.
5.
Sequence ID
Source
Feature
Start
End
6.
7.
8.
9.
Score
Strand
Frame
Attribute
Variant Calling
• .bam/.sam file contains all of the information
required to call variants
• Variant calls can’t be extracted from the .bam file
• Must provide the genome sequence
I8MVR:53:837 0 17_dna:chromosome
14090858 255 21M * 0
0 TAACTACGAATACCTGTCGAT
**%-**,00%-*-%---*-*NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h
XR:Z:CT XG:Z:CT
Today’s exercises
Variant Analysis
• Extract variant information from provided
.bam file
• Examine output file and learn about the
information contained in the various fields
Introducing… Dr. Eric Rouchka
• Bioinformatics Core Director
• Department of Computer Engineering and
Computer Science
• University of Louisville
• Kentucky Biomedical Research Infrastructure
Network