Transcript Slides

Single Motif
Charles Yan
Spring 2006
Single Motif
Similar Sequence Similar Function
In some cases the sequence of an unknown protein is too
distantly related to any protein of known structure to detect
its resemblance by sequence alignment, but it can be
identified by the occurrence in its sequence of a particular
cluster of residue types which is variously known as a
pattern, motif, signature, or fingerprint.
Single Motif
Protein function prediction using a single motif
Each protein family is characterized by one motif.
If a protein contain a motif, it probably belong to the family that the
motif corresponds to.
A pertinent analogy is the use of fingerprints by the police for
identification purposes. A fingerprint is generally sufficient to
identify a given individual. Similarly, a motif can be used to assign a
newly sequenced protein to a specific family of proteins and thus to
formulate hypotheses about its function.
Single Motif
This approach is based on the observation that
While there is a huge number of different proteins, most of them can
be grouped, on the basis of similarities in their sequences, into a
limited number of families.
Proteins belonging to a particular family generally share sequence
and/or structural attributes.
In a protein family, some regions have been better conserved than
others during evolution. These regions are generally important for
the function of a protein and/or for the maintenance of its threedimensional structure.
Thus, by analyzing the constant and variable properties of such
groups of similar sequences, it is possible to derive a signature for a
protein family.
Single Motif
A motif is a conserved element corresponding to a
region whose function or structure is known. It is
likely to be predictive of any subsequent occurrence
of such a structural/functional region in any other
protein sequence.
Motifs are usually represented using alignment or
regular expression
Single Motif
PROSITE ( is a database of
protein families and domains. (Starting in 1988).
PROSITE currently contains patterns (motifs) and profiles
specific for more than a thousand protein families or
domains. Release 19.18, of 10-Jan-2006 (contains 1398
documentation entries).
Each of these signatures comes with documentation
providing background information on the structure and
function of these proteins.
Steps in the development of a new motif
Select a set of sequences that belong to a function family. Make
a multiple alignment.
Find a short (not more than four or five residues long)
conserved sequence (core motif) which is part of a region
known to be important or which include biologically significant
Steps in the development of a new motif (cont.)
The most recent version of the Swiss-Prot knowledgebase is then
scanned with these core pattern(s). If a core motif will detect all
the proteins in the family and none (or very few) of the other
proteins, we can stop at this stage.
In most cases we are not so lucky and we pick up a lot of extra
sequences which clearly do not belong to the group of proteins
under consideration. A further series of scans, involving a gradual
increase in the size of the motif, is then necessary. In some cases
we never manage to find a good motif.
The motif are described using the following conventions:
The standard IUPAC one-letter codes for the amino acids are used.
The symbol 'x' is used for a position where any amino acid is
Ambiguities are indicated by listing the acceptable amino acids for a
given position, between square parentheses '[ ]'. For example:
[ALT] stands for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly
brackets '{ }' the amino acids that are not accepted at a given
position. For example: {AM} stands for any amino acid except Ala
and Met.
Each element in a pattern is separated from its neighbor by a '-'.
The motif are described using the following conventions (Cont.):
Repetition of an element of the pattern can be indicated by following
that element with a numerical value or a numerical range between
parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds
to x-x or x-x-x or x-x-x-x.
When a pattern is restricted to either the N- or C-terminal of a
sequence, that pattern either starts with a '<' symbol or respectively
ends with a '>' symbol. In some rare cases (e.g. PS00267 or
PS00539), '>' can also occur inside square brackets for the Cterminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.
A period ends the pattern.
[AC]-x-V-x(4)-{ED}.This pattern is translated as: [Ala or Cys]-any-Valany-any-any-any-{any but Glu or Asp}
There are a number of protein families as well as
functional or structural domains that cannot be detected
using patterns due to their extreme sequence divergence;
the use of techniques based on weight matrices (also
known as profiles) allows the detection of such proteins
or domains.
Three types of entry in PROSITES:
1327 patterns/motifs
591 profiles/matrices
4 rules
A profile or weight matrix is a
table of position-specific
amino acid weights and gap
costs. These numbers (also
referred to as scores) are
used to calculate a similarity
score for any alignment
between a profile and a
sequence, or parts of a profile
and a sequence. An alignment
with a similarity score higher
than or equal to a given cutoff value constitutes a motif
The rule is described in ordinary English and is freeformat.