The presentation

Download Report

Transcript The presentation

Identification of Protein
Domains
Orthologs and Paralogs
Describing evolutionary relationships
among genes (proteins):
Two major ways of creating homologous
genes is gene duplication and
speciation.
Homology: not sufficiently well-defined
Therefore additional terms are used:
ortho
para
ortho
Orthologs are two
genes from two
different species
that derive from a
single gene in the
last common
ancestor of the
species.
Paralogs are genes
that derive from a
single gene that was
duplicated within a
genome.
co-ortho
Co-orthologs are
paralogs produced
by duplications of
orthologs
subsequent to a
given speciation
event.
in-para
in-para
out-para
Inparalogs are
paralogs in a given
lineage that all
evolved by gene
duplications that
happened after the
speciation event.
Outparalogs are
paralogs in the given
lineage that evolved
by gene duplications
that happened
before the speciation
event
Orthologs and Paralogs
• Orthologs - evolutionary functional
counterparts in different species
• Inparalogs – important for detecting
lineage-specific adaptations
Proteins :
• Rapidly growing databases of protein
sequences due to genome sequencing
projects.
• Many new proteins belong to protein families
with known functions, (significant sequence
similarity).
• Only a small fraction of known proteins have
functions determined by experiment.
• Databases providing computational sequence
analysis allow us to classify new proteins to
known families, and thus determine their
function.
Protein Domains
• A domain is an independent structural unit
which can be found alone or in conjunction
with other domains or repeats.
• Module = mobile domain.
• Different domains have distinct functions.
• Many eukaryotic proteins have multiple
domains.
Protein Domains
PX domain with
ligand
SH3 domain with
ligand
Identifying Protein Domains :
Problems :
– Defining the members of each family.
– Building multiple alignments of the
members.
– Finding the boundaries of the domain.
Identifying Protein Domains
• Little structural data  identification by
sequence analysis.
• Even when the structure of the domain
is not known it may be possible to define
its boundaries from sequence alone.
• Sequence characterization of families determine 3D structure and molecular
functions.
Identifying Protein Domains :
Motif matches are often useful to indicate
functional sites, however :
• They do not give a clear picture of the
domain boundaries.
• Lack sensitivity.
Identifying Protein Domains :
Automatic methods :
• Fast, effective, deals with a lot of
information.
• Might fragment domain families.
• Might cause fusion of domain families.
Manual methods :
• Knowledge of protein experts is put to use.
• Slow, require a lot of manpower.
SMART :
(Simple Modular Architecture Research Tool)
Web-based resource used for :
– rapid annotation of protein domains.
– analysis of domain architectures.
Domain Architecture
Protein: PA-3427CG
Species: Drosophila melanogaster
Protein: ENSMUSP00000023109
Species: Mus musculus
Protein: ENSANGP00000009529
Species: Anopheles gambiae
SMART (Simple Modular Architecture Research Tool)
• There are over 600 domain families.
• Provides information about :
– function .
– subcellular localization.
– phyletic distribution.
– tertiary structure.
• Based on HMMs (Hidden Markov
Models).
SMART (Simple Modular Architecture Research Tool)
HMM – based on seed
alignment.
Threshold values used
to determine
homology of
domains.
SMART (Simple Modular Architecture Research Tool)
• Alignments of proteins by:
– Minimize insertions/deletions in conserved
alignment blocks.
– Optimize amino acid property conservation.
– Closing unnecessary gaps.
• Gapped alignments prefered over
ungapped ones:
– prediction of domain boundaries.
– greater information content.
• Alignment of entire structural domains.
PROSITE domains
database of protein families and
• Database of biologically significant sites and
patterns. Contains 1,609 profiles.
• Pattern – conserved sequence of a few amino
acids.
• Identifies to which known family of proteins
(if any) the new sequence belongs.
• Used to determine the function of
uncharacterized proteins translated from
genomic or cDNA sequences.
PROSITE -
database of protein families and domains
• A protein too distant from any other to
detect its resemblance by overall
sequence alignment, can be classified
according to a Pattern.
• Patterns arise because of requirements
of binding sites that impose very tight
constraint on the evolution of portions
of the protein.
PROSITE – how is a pattern developed ?
• As short as possible.
• Detects all/most sequences it describes.
• As little false results as possible.
high sensitivity and high specificity.
PROSITE – how is a pattern developed ?
First – study reviews on a protein family.
Then build alignment table with particular
attention to residues and regions important to
the biological function of that family.
- Enzyme catalytic sites.
- Prostethic group attachment sites (heme).
- Amino acids involved in binding a metal ion.
- Cysteines involved in disulfide bonds.
- Regions involved in binding a molecule
(ADP/ATP, GDP/GTP, calcium, DNA, etc.) or
another protein.
PROSITE
steps in the development of a pattern:
• Finding a core pattern : 4-5 biologically
significant residues.
• Test the pattern on a large database.
• If lucky – there is correlation in this
region which indicates a good pattern.
• Mostly, there is no correlation :
– Gradually increase the size of the pattern.
– search over other patterns.
PROSITE – An example
This pattern is small and would probably
pick up too many false positive results :
ALRDFATHDDF
SMTAEATHDSI
ECDQAATHEAS
Patterns - small regions, high sequence
similarity.
Profiles – characterize a protein family or
domain over its entire length.
Research: Finding new domain families
Automatic methods
• The team started with 107 nuclear
domains.
• Using SMART - get all proteins with at
least one of these domains, characterize
their complete domain structure.
• Regions not annotated using known
SMART domain models were extracted
with their domain context.
Finding new domain families:
Automatic methods
• Grouping proteins by region similarity.
• Finding homologs using PSI-BLAST on
longest of every group (Threshold Evalue<0.001).
• Finding domain organization via SMART.
• Homologous regions – candidates for a
novel domain family.
Finding new domain families:
107 nuclear domains
finding proteins -SMART
regions not known by SMART
group regions
PSI-BLAST finding homologs
domain architecture - SMART
manual inspection more searches
Finding new domain families:
Manual confirmation
• Different context – novel module family.
• Proteins with nuclear AND extracellular
domains excluded.
• Multiple alignments and known locations of
domains – definition of domains’ borders.
• Automatic searches to find more members, Evalue < 0.1, and manual checks.
• Marginal similarity to domain family – possible
divergent family.
Prediction of Function:
Chromatin-Binding Domains
• Protein SPT6 containing CSZ domain,
regulates transcription through a histonebinding capability.
• It also contains two other types of domains,
which are unlikely to bind histones.
• Therefore it was predicted that CSZ domain
has that function.
Research :
• Arabidopsis protein – UBA in N-terminal.
• Search of C-terminal by PSI-BLAST (Evalue<10-5) found UBX containing proteins
and metazoan homologs of PNGases.
• PNGases – proteins involved in
UPR.
• UPR – unfolded protein response.
• PUG – the homologous regions.
• PUG domains found in proteins
with domains central to ubiquitinmediated proteolysis, (UBA and
UBX).
Conclusion :
PUG containing proteins might link the
UPR to ubiquitin mediated protein
degradation.
PUG
UBA
PUG
UBX
PUG
UBCc
PNGases PUG
Believed to
have a role in
the UPR
Domains
central to
ubiquitin
mediated
proteolysis
Apoptosis
Ubx domain from human
faf1
Dna binding protein
c-terminal uba domain of the human
homologue of rad23a (hhr23a)
• Orthologs of PNGases in metazoan are
present singly, (not in multiple paralogs) –
likely to have similar cellular localization.
• The ortholog in Sacharaomyces cervisiae
is known to be localized mainly in the
nucleus.
Likely that PNGases are localized in
the nucleus too.
• HMM from the PUG – marginal similarity to
IRE1p-like Kinases which are known to
initiate the UPR as well.
• They suggest the presence of divergent
PUG domains in the C termini of these
Proteins.
• Analysis revealed a conserved region in
metazoan PNGases. Named it PAW. Put it in
SMART.
• The team found 28 novel nuclear domain
families.
• Most of them with representatives in
diverse molecular context in different
species.
• Some specific to single species.
• Others divergent members of previously
recognized families.
The End