M06: Genome sequences supplementary material File
Download
Report
Transcript M06: Genome sequences supplementary material File
What is sequencing?
Video:
https://www.youtube.com/watch?v=womKfik
WlxM (Illumina video)
Sequences: what's out there?
NCBI: total nucleotide repository, "reference" sequences (more later), tools,
features, annotations, etc.
https://www.youtube.com/watch?v=Phxkg5H5Q6E
Sequences: what's out there?
EBI: EU counterpart, functionally equivalent (tends to have a bit less data,
a bit better tools)
Sequences: what's out there?
DNA:
Complete genomes sequences:
Draft genome sequences: has lower
accuracy, partially assembled, useful but
annotation often improves dramatically.
Miscellany: GenBank accepts any
identified sequence.
Eric Altermann et al. PNAS 2005;102:3906-3912
Sequences: what's out there?
Sequence Read Archive
Raw short reads
European Nucleotide Archive (ENA)
Sequences: what's out there?
Sequence Read Archive
Raw short reads
European Nucleotide Archive (ENA)
Sequences: what's out there?
SRA and ENA are recently been plant to be joined by the INSDC.
Standardizing deposition protocols, storage formats, access patterns,
etc..
Sequences: what's out there?
RNA:
CDNA/ESTs:
Not used so much anymore – single pass, high quality
sequences from RTed mRNAs
Can be used to catalog portions of genomes that are
actively transcribed.
Great for organisms without high quality sequenced
genomes or annotations
ESTs are often 300-800 bp
Early efforts resulted in the identification of many
hundreds of genes novel at the time.
DbEST is a division of GeneBank
Sequences: what's out there?
RNA:
RNA-seq
US
EU
Underutilized
Sequences: what's out there?
Amino acids:
Won't discuss today, but AA seqs. typically handled very differently and in
different databases
Features: annotations, from location to function. Loci are referred to as
"features", which can be anything: Genes, introns/exons, polymorphisms,
regulatory elements, conserved regions, islands, etc.
Sequences: what's out there?
Alignments
Pairwise alignment is the process of lining
up two sequences to achieve maximal
levels of identity
Fig 3.5 Pevsner. Pairwise alignment of human beta globin (query) and myoglobin (subject)
Basic Local Alignment Search Tool
(BLAST)
It is an algorithm that allows the user to
select one sequence (query) and perform
pairwise sequence alignment between the
target and the entire database of
sequences, and identify the ones that
resemble.
Basic Local Alignment Search Tool
(BLAST)
We can assess the relatedness of any two
proteins by performing a pairwise alignment using
NCBI pairwise BLAST tool.
Perform the following steps:
1. Choose the protein BLAST program and
select “BLAST 2 sequences” for our comparison of
two proteins. An alternative is to select blastn (for
“BLAST nucleotides”) for DNA–DNA comparison.
2. Enter the sequences or their accession
numbers. Here we use the sequence of human
beta globin in the fasta format, and for myoglobin
we use the accession number (Fig. 3.4).
3. Select any optional parameters:
Scoring matrices: BLOSUM#, PAM#
Gap extension penalty
Change reward and penalty values
Basic Local Alignment Search Tool
(BLAST)
We can assess the relatedness of any two
proteins by performing a pairwise alignment using
NCBI pairwise BLAST tool.
Perform the following steps:
1. Choose the protein BLAST program and
select “BLAST 2 sequences” for our
comparison of two proteins. An alternative is
to select blastn (for “BLAST nucleotides”) for
DNA–DNA comparison.
2. Enter the sequences or their accession
numbers. Here we use the sequence of human
beta globin in the fasta format, and for myoglobin
we use the accession number (Fig. 3.4).
3. Select any optional parameters:
Scoring matrices: BLOSUM#, PAM#
Gap extension penalty
Change reward and penalty values
Basic Local Alignment Search Tool
(BLAST)
We can assess the relatedness of any two
proteins by performing a pairwise alignment using
NCBI pairwise BLAST tool.
Perform the following steps:
1. Choose the protein BLAST program and
select “BLAST 2 sequences” for our comparison of
two proteins. An alternative is to select blastn (for
“BLAST nucleotides”) for DNA–DNA comparison.
2. Enter the sequences or their accession
numbers. Here we use the sequence of human
beta globin in the fasta format, and for myoglobin
we use the accession number (Fig. 3.4).
3. Select any optional parameters:
Scoring matrices: BLOSUM#, PAM#
Gap extension penalty
Change reward and penalty values
4. Click BLAST. Output includes a pairwise
alignment using the single letter amino acid
code.
NOTE: Similar pairs of residues are
structurally or functionally related. That
means they may look different but they are
related because they share similar
biochemical properties.
BLAST algorithm
1) Make a k-letter word list of the query
sequence. For example K=3
2) List the possible matching words. BLAST
cares only about the high scoring words. We
use a scoring matrix to compare the work in
the list in 1) with all the 3 letter words. For
example if we have PQG, different scores
are obtained when compared to PEG and
PQA. Only keep words that surpass a
threshold T.
3) Organize the remaining high-scoring
words into an efficient search tree.
4) Repeat steps 2-3) for each k-letter word
in the query sequence.
BLAST algorithm
1) Make a k-letter word list of the query
sequence. For example K=3
2) List the possible matching words. BLAST
cares only about the high scoring words. We
use a scoring matrix to compare the work in
the list in 1) with all the 3 letter words. For
example if we have PQG, different scores
are obtained when compared to PEG and
PQA. Only keep words that surpass a
threshold T.
3) Organize the remaining high-scoring
words into an efficient search tree.
4) Repeat steps 2-3) for each k-letter word
in the query sequence.
BLAST algorithm
1) Make a k-letter word list of the query sequence. For example K=3
2) List the possible matching words. BLAST cares only about the high scoring words.
We use a scoring matrix to compare the work in the list in 1) with all the 3 letter
words. For example if we have PQG, different scores are obtained when compared
to PEG and PQA. Only keep words that surpass a treshold T.
3) Organize the remaining high-scoring words into an efficient search tree.
4) Repeat steps 2-3) for each k-letter word in the query sequence.
5) Scan the database sequences for exact matches witht the remaining high scoring
words. High scoring segment pairs (HSPs)
BLAST
algorithm
1) Make a k-letter word list of the query sequence. For example K=3
2) List the possible matching words. BLAST cares only about the high scoring words.
We use a scoring matrix to compare the work in the list in 1) with all the 3 letter
words. For example if we have PQG, different scores are obtained when compared
to PEG and PQA. Only keep words that surpass a treshold T.
3) Organize the remaining high-scoring words into an efficient search tree.
4) Repeat steps 2-3) for each k-letter word in the query sequence.
5) Scan the database sequences for exact matches witht the remaining high scoring
words.
E-value: the number of times that an unrelated
6) List
all the
HSPs inwould
the database
whos
database
sequence
obtain a score
S score is high enough to be cosnsidered.
higher than x by chance.
7) Evaluate the significance of the HSP score. (e-value)
8) Report every match whose expect score is lower than a threshold parameter E.
BLAST
PAM#
BLOSUM#
BLOSUM62 scoring matrix of Henikoff and Henikoff (1992).
Sequences: what's out there?
A note on gene IDs
How do you put a standard identifier on _anything_ in genomics?
Partially overlapping ID systems:
GenBank, RefSeq, EMBL-Bank, EMBL, UniGene, UniRef,
HomoloGene, KO, every array platform, Entrez, HGNC, KEGG, UCSC, every
model organism DB...
And these just cover genes!
Different competing systems for proteins, functions, diseases, physiology,
you name it
A large % of the reason we learn Python is so you can automate things
like gene ID conversion
What can you get where?
http://www.ncbi.nlm.nih.gov/genbank/
Tutorial:
https://www.youtube.com/watch?v=g5a__okj5Zs
http://www.ncbi.nlm.nih.gov/refseq/
What can you get where?
http://www.ensembl.org/index.html
Tutorial:
http://www.ensembl.org/Multi/Help/Movie?db=core;id=188
What can you get where?
USC Genome Browser
https://genome.ucsc.edu
http://jbrowse.org