Bioinformatics Unit 1: Data Bases and Alignments

Download Report

Transcript Bioinformatics Unit 1: Data Bases and Alignments

Bioinformatics
Unit 1: Data Bases and
Alignments
Lecture 2:
“Homology” Searches and
Sequence Alignments
Overview of Lecture
• Introduction: When, what and how to
search for “homologous” seqeunces
• Terminology
• Nucleotide database searches
– BLAST programs
– FASTA programs
– Others
• Protein database searches
Introduction
•
•
•
•
•
When do I search?
What do I search for (which database)?
How do I search (which program)?
What do the search results mean?
Answer: Database searches (hopefully)
identify biologically relevant sequence
alignments
Sequence Alignments
• Sequence alignments allow comparison of new
sequences to either one, a group of, or all known
sequences
• A well-designed alignment can allow one to infer:
– gene or protein function
– evolutionary relationships among genes, proteins or
species
– structure of proteins of nucleic acids
• Process is highly dependent on choice of query
and parameters of alignment
Terminology Associated with
Searches and Alignments
• Query: The input sequence (or other type of search
term) with which all of the entries in a database
are to be compared.
– Examples: Your unknown DNA sequence, a word, an
accession number, etc.
• Algorithm: A fixed procedure embodied in a
computer program
– Examples: Alignment programs like BLAST,FASTA,
BLITZ, etc.
Terminology Associated with
Searches and Alignments (cont.)
• Homology: Similarity attributed to descent from a common
ancestor (often misused).
• Identity: The extent to which two (nucleotide or amino acid)
sequences are invariant. Often expressed as a percentage.
• Similarity: The extent to which nucleotide or protein sequences
are related. The extent of similarity between two sequences can be
based on percent sequence identity (nucleotides) and/or
conservation (proteins i.e., a lysine substituted for an arginine).
Terminology Associated with
Searches and Alignments (cont.)
• Gap: A space introduced into an alignment to
compensate for insertions and deletions in one
sequence relative to another.
– Example: Aligning a cDNA sequence with a gene
requires gaps at the position of introns
• Substitution (Scoring) matrices: Speed vs.
sensitivity
– allow a query sequence to be aligned with sequences in
the database very rapidly. The most significant matches
(successful alignments) are reported. Less complex,
faster matrices sacrifice a certain degree of match
significance (i.e. you need a better match for it to be
recognized than if you use a slower, more complex
matrix). The matrix, together with the choice of
program essentially determine the search sensitivity and
Terminology Associated with
Searches and Alignments (cont.)
• Filters: usually part of an alignment algorithm and
are turned on by default.
– The filter masks (hides) regions of the query sequence
(your sequence) that have low compositional
complexity (like poly A tails). Masking is achieved by
replacing the sequence with a string of N's
(NNNNNN), the code for any DNA base.
– Poly-A tails, for example, can give rise to artificially
high scores and therefore misleading results. This is due
to the large numbers of such sequences distributed
throughout the genome, and therefore throughout the
database.
– Similarly, new programs exist to filter out vector
sequences.
Nucleotide Database Searching
• Commonly used search algorithms:
–
–
–
–
–
BLAST (at NCBI)
FASTA (in France)
BLITZ (at EPI in EMBL)
SSEARCH (in France)
PSI-BLAST (at NCBI)
Basic Local Alignment Search
Tool (BLAST)
• A set of similarity search tools
• Fast and sensitive
• “real” matches fairly easily distinguished from
random matches by scoring
• Seeks local rather than global alignment
• Can detect relationships between sequences that
share only regions of similarity
– GREAT as proteins are “modular”
Algorithms Within BLAST
• Blastn
• Blastp
• Blastx
compares nucleotide query
sequence against nucleotide
sequence database
compares amino acid query
sequence against protein sequence
database
compares nucleotide query
translated in all reading frames
against a protein sequence
database
Algorithms Within BLAST (cont.)
• Tblastn compares amino acid query
sequence against nucleotide sequence
database dynamically translated in all
reading frames
• Tblastx compares the six-frame translation of a
nucleotide query sequence against a
nucleotide sequence database
dynamically translated in all
reading frames COMPUTATIONALLY INTENSE!!
• Choose the correct algorithm!!!
A Sample BLAST Search
•
AAAAGAAAAGGTTAGAAAGATGAGAGATGATAAAGGGTCCATTTGAGGTTAGGTAA
TATGGTTTGGTATCCCTGTAGTTAAAAGTTTTTGTCTTATTTTAGAATACTGTGAT
CTATTTCTTTAGTATTAATTTTTCCTTCTGTTTTCCTCATCTAGGGAACCCCAAGA
GCATCCAATAGAAGCTGTGCAATTATGTAAAATTTTCAACTGTCTTCCTCAAAATA
AAGAAGTATGGTAATCTTTACCTGTATACAGTGCAGAGCCTTCTCAGAAGCACAGA
ATATTTTTATATTTCCTTTATGTGAATTTTTAAGCTGCAAATCTGATGGCCTTAAT
TTCCTTTTTGACACTGAAAGTTTTGTAAAAGAAATCATGTCCATACACTTTGTTGC
AAGATGTGAATTATTGACACTGAACTTAATAACTGTGTACTGTTCGGAAGGGGTTC
CTCAAATTTTTTGACTTTTTTTGTATGTGTGTTTTTTCTTTTTTTTTAAGTTCTTA
TGAGGAGGGGAGGGTAAATAAACCACTGTGCGTCTTGGTGTAATTTGAAGATTGCC
CCATCTAGACTAGCAATCTCTTCATTATTCTCTGCTATATATAAAACGGTGCTGTG
AGGGAGGGGAAAAGCATTTTTCAATATATTGAACTTTTGTACTGAATTTTTTTGTA
ATAAGCAATCAAGGTTATAATTTTTTTTAAAATAGAAATTTTGTAAGAAGGCAATA
TTAACCTAATCACCATGTAAGCACTCTGGATGATGGATTCCACAAAACTTGGTTTT
ATGGTTACTTCTTCTCTTAGATTCTTAATTCATGAGGAGGGTGGGGGAGGGAGGTG
GAGGGAGGGAAGGGTTTCTCTATTAAAATGCATTCGTTGTGTTTTTTAAGATAGTG
TAACTTGCTTAAATTTCTTATGTGACATTAACAAATAAAAAAGCTCTTTTAATATTAGATAA
Top red line represents query sequence
Each line below indicates matching sequences sorted by score
(in color) and position of match
Below is a list of high scoring matches followed by actual alignments
The “Expectation” Value (E Value)
• Expectation value. The number of different alignments
with scores equivalent to or better than S (threshold score)
that are expected to occur in a database search by chance.
The lower the E value, the more significant the score.
• Given in scientific notation.
• For example, an E value of e-167 indicates that there is a
1/10167 chance that the match is random
• The smaller the E value, the more significant the match
• Varies due to number of bp of sequence in the database and
the length of the query sequence
How Does the BLAST Algorithm
Work? An Overview
• A two step process
• Initial scanning identifies high scoring matches to
“words” in the query sequence
– Positive scores for exact matching bases or amino acids
– Negative scores for mismatches
– Default word size is 11 bases
• Sequences with high scores are extended in both
directions in the second step until the best score is
achieved
• Scoring matrices are used in each step
Options
• Word length
– Set at 11 bases for blastn.
– Requires a perfect 11 bp match to go to the
second step
– Chances of a random 11 bp exact match are
1/411 (= 1/4,194,304)
– Shortening the word length may make the
search more sensitive, but it may increase the
number of non-biologically significant hits
Options (cont.)
• Filters
– Can mask regions of low complexity
• Poly A tails
• Proline rich regions
– Can now mask human repetitive sequences
– Low complexity filter is on by default. Others
must be activated
Options (cont.)
• The Expect threshold
– The statistical significance threshold for reporting
matches against database sequences
– The default value is 10, meaning that 10 matches are
expected to be found merely by chance
– If the statistical significance ascribed to a match is
greater than the EXPECT threshold, the match will not
be reported.
– Lower EXPECT thresholds are more stringent, leading
to fewer chance matches being reported.
– Increasing the threshold shows less stringent matches.
Fractional values are acceptable.
Protein Database Searching
• 2-5 times more sensitive than a DNA
database search!
– DNA alphabet is smaller than the protein
alphabet (4 v. 20 letters)
– The genetic code is redundant (6 serine codons)
– There is a selection for function, thus protein
sequence is more highly conserved through
time
• Groups of genes or proteins from different
organisms that have the same function are
called “orthologs”