Transcript Document

An Introduction to Bioinformatics
Protein Modules
AIMS
To introduce the concept of multidomain proteins
To define the terms associated with analysis of multidomain
proteins
To introduce the major secondary databases
OBJECTIVES
To select an appropriate secondary database for analysis of
protein domains
To carry out an analysis to establish to establish the domain
structure of a protein
To ascribe likely biological functions to protein domains
When the amino acid sequences of two proteins are compared
and found to exhibit significant similarity they are assumed to
be evolutionarily related i.e. they are homologues
two classes of homologue (orthologue and paralogue)
orthologous genes are descended from a unique ancestral
gene and their divergence with comparable genes in different
organisms is simply parallel to speciation
paralogous genes are descended from copies of a gene that
duplicated within a single ancestral genome
a substantial proportion of all proteins are composed of more
than one domain
A domain is defined as sequentially consecutive residues in a
protein that can fold up independently of other parts of the
protein
Crystallographers commonly refer to domains as folds and the
term module is also used
The domain/module is the fundamental unit of protein structure
inter-domain splicing, fusion, deletion, duplication and shuffling
have occurred frequently during evolution, whereas intra-domain
rearrangements have occurred rarely
Influenza virus
haemagglutinin
When two homologous proteins are aligned,
there are one or more regions where sequence identity is
particularly high, and these regions frequently enable the
definition of motifs or signature sequences that are diagnostic
(Module 4)
Any particular
domain may have
one or more
characteristic motifs
Domains/modules, motifs/signature sequences constitute
the content of many secondary databases and are of
enormous value in attempting to predict the function and
structure of new proteins
Low complexity regions
The individual domains of multidomain proteins are frequently
separated from each other by regions of low complexity, also
referred to as linker sequences
Long stretches of repeated residues, particularly proline,
glutamine, serine or threonine often indicate linker sequences
The program SEG detects such low complexity regions and
can be used as part of BLAST to mask off segments of the
query sequence that have low compositional complexity
This leaves the biologically interesting regions of the query
sequence available for matching against database sequences
Secondary (pattern) databases
Analysis of the primary protein sequence databases, usually
through multiple sequence alignments has led to the identification
of sequence patterns (motifs, signatures, blocks, profiles)
common to homologous proteins or protein modules
These motifs, usually of ~10-20 amino acids length, commonly
correspond to key functional or structural elements, often
domains/modules, and are extremely useful in identifying
such features in new uncharacterized proteins
An unknown protein is often too distantly related to any protein
of known sequence to detect its resemblance by overall
sequence alignment, but it can potentially be identified by the
occurrence in its sequence of a particular motif
There are a number of programs which allow the searching of
an unknown protein against databases of motifs/profiles etc
Pfam is a collection of multiple alignments and profile hidden
Markov models of protein domain families, which is based on
proteins from both SWISS-PROT and SP-TrEMBL
SMART (a Simple Modular Architecture Research Tool)
allows the identification and annotation of genetically mobile
domains and the analysis of domain architectures
PROSITE is a database of protein families and domains. It
consists of biologically significant sites, patterns and profiles that
help to reliably identify to which known protein family (if any)
a new sequence belongs