Matching problems in bioinformatics

Download Report

Transcript Matching problems in bioinformatics

Matching Problems in Bioinformatics
Charles Yan
Fall 2008
Matching Problem
Given a string P (pattern) and a long string T
(text), find all occurrences, if any, of P in T.
Example
T: Given a string P (pattern) and a long string T (text), find all
occurrences, if any, of P in T.
P: any
Exact matching: Does not allow any mismatch
Inexact matching: Allow up to k mismatches
2
Matching Problem
Unix: grep
MS word: find
Genbank: http://www.ncbi.nlm.nih.gov/Genbank/
Human genome:
http://www.ncbi.nlm.nih.gov/projects/mapview/map_search.cgi?taxid=9
606
Given “TTGTTCCGGTTAAAGATGGTGAAAATTTTT”, does it appear in
human genome? Where?
How about
“ACCCCCAGGCGAGCATCTGACAGCCTGGAGCAGCACACACAACCCCAGG
CGAG”?
3
Motifs


A motif is a conserved element corresponding to a
certain function (or structure). Occurrence of a motif
in a protein is likely to indicate that the protein has
the corresponding function.
Motifs are usually represented using alignment or
regular expression
4
Motifs
5
Motifs
Protein function prediction using motifs



Each protein function is characterized by one
single motif or multiple motifs .
If a protein contain the motif(s), it probably has
the function that the motif(s) corresponds to.
A pertinent analogy is the use of fingerprints by
the police for identification purposes. A
fingerprint is generally sufficient to identify a
given individual. Similarly, motif(s) can be used
to formulate hypotheses about the function of a
newly discovered protein.
6
PROSITE



PROSITE (http://ca.expasy.org/prosite/) is a database of
protein families and domains. (Starting in 1988).
PROSITE currently contains patterns (motifs) and profiles
specific for more than a thousand protein families or
domains. Release 20.36, of 22-Jul-2006 (contains 1528
documentation entries).
Each of these signatures comes with documentation
providing background information on the structure and
function of these proteins.
7
PROSITE
8
PROSITE
9
PROSITE
10
PROSITE
11
PROSITE
Steps in the development of a new motif


Select a set of sequences that belong to a function family. Make
a multiple alignment.
Find a short (not more than four or five residues long)
conserved sequence (core motif) which is part of a region
known to be important or which include biologically significant
residue(s).
12
PROSITE
Steps in the development of a new motif (cont.)


The most recent version of the Swiss-Prot knowledgebase is then
scanned with these core pattern(s). If a core motif will detect all
the proteins in the family and none (or very few) of the other
proteins, we can stop at this stage.
In most cases we are not so lucky and we pick up a lot of extra
sequences which clearly do not belong to the group of proteins
under consideration. A further series of scans, involving a gradual
increase in the size of the motif, is then necessary. In some cases
we never manage to find a good motif.
13
PROSITE
The motif are described using the following conventions:

The standard IUPAC one-letter codes for the amino acids are used.

The symbol 'x' is used for a position where any amino acid is
accepted.

Ambiguities are indicated by listing the acceptable amino acids for a
given position, between square parentheses '[ ]'. For example:
[ALT] stands for Ala or Leu or Thr.

Ambiguities are also indicated by listing between a pair of curly
brackets '{ }' the amino acids that are not accepted at a given
position. For example: {AM} stands for any amino acid except Ala
and Met.

Each element in a pattern is separated from its neighbor by a '-'.
14
PROSITE
The motif are described using the following conventions (Cont.):

Repetition of an element of the pattern can be indicated by following
that element with a numerical value or a numerical range between
parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds
to x-x or x-x-x or x-x-x-x.

When a pattern is restricted to either the N- or C-terminal of a
sequence, that pattern either starts with a '<' symbol or respectively
ends with a '>' symbol. In some rare cases (e.g. PS00267 or
PS00539), '>' can also occur inside square brackets for the Cterminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.

A period ends the pattern.
Examples:
[AC]-x-V-x(4)-{ED}.This pattern is translated as: [Ala or Cys]-any-Valany-any-any-any-{any but Glu or Asp}
15
PROSITE
16
PROSITE
17
PROSITE
A profile or weight matrix is a
table of position-specific
amino acid weights and gap
costs. These numbers (also
referred to as scores) are
used to calculate a similarity
score for any alignment
between a profile and a
sequence, or parts of a profile
and a sequence. An alignment
with a similarity score higher
than or equal to a given cutoff value constitutes a motif
occurrence.
18
PROSITE
19
Motifs and Matching

Motif Finding:
Given a set of protein sequences, to find the motif(s) that are shared
by these proteins.

Motif Scanning
Given a motif and a protein sequence, to find the occurrences (not
necessary identical) of the motif on the protein sequences.
–--The Matching Problem!
20
From Single Motif to Multiple Motifs
One single motif is not sufficient to predict a protein
function. Multiple motifs have stronger predicting
power.
21
Multiple Motifs
Protein function prediction using multiple motifs


Each protein function is characterized by a set of motifs
(in stead of a single one).
If a protein contain a set of motifs, it probably has the
function that the set of motifs correspond to.
22
PRINTS






PRINTS
(http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/ ) is a
database of protein fingerprints.
A fingerprint is a group of conserved motifs used to
characterize a protein family;
ftp.bioinf.man.ac.uk/pub/prints
PRINTS is now maintained at the University of
Manchester
PRINTS VERSION 38.1 (25 May, 2007)
1904 FINGERPRINTS, encoding 11,451 single motifs
23
PRINTS



Two types of fingerprint are represented in the database, i.e. they
are either simple or composite, depending on their complexity:
simple fingerprints are essentially single-motifs; while composite
fingerprints encode multiple motifs. The bulk of the database entries
are of the latter type because discrimination power is greater for
multi-component searches.
Usually the motifs do not overlap, but are separated along a
sequence, though they may be contiguous in 3D-space.
Fingerprints can encode protein folds and functionalities more
flexibly and powerfully than can single motifs, full diagnostic potency
deriving from the mutual context provided by motif neighbors.
24
PRINTS
25
PRINTS
26
PRINTS
a) General field
27
PRINTS
FPScan
Submitting a PROTEIN sequence find the closest matching
PRINTS fingerprint/s.
28
PRINTS
29
PRINTS
30
PRINTS
31
PRINTS
32
Related Projects









InterPro - Integrated Resources of Proteins Domains and Functional
Sites
BLOCKS - BLOCKS db
Pfam - Protein families db (HMM derived) [Mirror at St. Louis (USA)]
PRINTS - Protein Motif fingerprint db
ProDom - Protein domain db (Automatically generated)
PROTOMAP - An automatic hierarchical classification of Swiss-Prot
proteins
SBASE - SBASE domain db
SMART - Simple Modular Architecture Research Tool
TIGRFAMs - TIGR protein families db
33
Motifs and Matching

Motif Finding:
Given a set of protein sequences, to find the motif(s) that are shared
by these proteins.

Motif Scanning
Given a motif and a protein sequence, to find the occurrences (not
necessary identical) of the motif on the protein sequences.
–--The Matching Problem!
34