lecture_intro_protx

Transcript lecture_intro_protx

Introduction to proteins bioinformatics analysis
Juliette Hayer
November 2015
Protein databases
• Sequences: UniProt Knowledge Base (UniProtKB):
http://www.uniprot.org/
– SwissProt entries: manually annotated and reviewed
(549,646 entries in November 2015)
– TrEMBL entries: automatically annotated translations of
coding sequences (CDS) from European Nucleotide Archive
(52,783,601 in November 2015)
• Structures: Protein Data Bank (PDB):
http://www.rcsb.org/pdb/home/home.do
– 113,494 entries (November 2015): proteins, nucleic acids,
complexes
– Structures entries are files containing the spatial
coordinates of all atoms of the molecule
A SwissProt entry: description, general info
A SwissProt entry: description, general info
A SwissProt entry: sequence
From DNA to proteins
3 frames forward
5’3’: Frame 1
5’3’: Frame 2
5’3’: Frame 3
DNA sequence
Translation in 6
frames
http://web.expasy.org/translate/
3 frames backward
3’5’: Frame 1
3’5’: Frame 2
3’5’: Frame 3
Questions about a new protein
New Protein
Function
- Signature of a
function and/or a
protein family:
patterns, profiles
- Homolog proteins
Linked
Structure
- Physicochemical
properties
- Secondary structures
- Tertiary structure
(tridimensional fold)
similar sequence => similar structure => similar function
Protein Family: signatures
• Investigate relation between proteins: detection of
patterns of conservation.
• Multiple alignments are a starting point as they tend to
highlight conserved regions, or motifs, that characterize
particular families.
• If stored in a reference database, these ‘signatures’ may
be used to identify similar features in uncharacterized
sequences.
Multiple Sequences Alignment (MSA)
• Tabular description of the relationships between
proteins:
– rows represent individual sequences
– columns the residue positions.
• Similar residues are brought into vertical register by
introducing gaps (-), so that the relative position of
residues within the alignment is preserved.
• The result is an expression of the similarities and
dissimilarities between the sequences.
Multiple Sequences Alignment (MSA): Tools
• Clustal W:
– the sequence pairs are aligned separately in order to calculate a distance
matrix
– a guide tree is then calculated from the matrix
– the sequences are then progressively aligned according to the branching
order of the guide tree.
• T-Coffee: package that allows the combination of a collection of
multiple/pairwise, global or local alignments into a single model.
It also allows estimation of the level of consistency of each
position in the new alignment with the rest of the alignments.
• Muscle, Clustal omega, Multalign, etc.
Protein Family: Pattern recognition methods
1. Those that employ single motifs (usually encoded as
regular expressions) => PROSITE
2. Those that exploit multiple motifs => Blocks (logos),
PRINTS (fingerprints)
3. Those that use full domain alignments (generally
encoded as profiles or Hidden Markov Models)
=> Pfam (HMMs), Profile library
 Each of these methods has been used to create different
types of reference database.
Protein Family: Pattern recognition methods
Protein family database: PROSITE
• PROSITE was the first protein family database to be
derived.
• Today, it stores 3 types of information:
– family-specific regexs (sometimes simply termed ‘patterns’)
– non-family-specific regexs (sometimes termed rules)
– profiles
Protein family database: Domain based databases
• 2 main resources that use domain-based approaches for
family characterization:
– one of which exploits profiles (Profile library of PROSITE)
– the other using hidden Markov models (HMM): Pfam.
• A profile is a mathematical description of a sequence
alignment. It can be viewed as an alternating sequence of
‘match’ and ‘insert’ states.
• Pfam: domain families are stored in the form of profile
HMMs :turn a multiple sequence alignment into a positionspecific scoring system suitable for searching databases for
remotely homologous sequences
Similarity search for identifying homolog proteins
• Similarity search algorithms are usually classified as
global or local:
– Global algorithms optimize the full alignment of 2 sequences,
which may include large dissimilar regions.
– Local similarity algorithms focus on conserved subsequences,
a single comparison often yielding several alignments
(dissimilar regions do not contribute to the similarity
measure).
• Local similarity tools (e.g. BLAST), are usually preferred
for database searches, where distantly-related proteins
may share only isolated regions of similarity (e.g., in the
vicinity of an active site).
Similarity search: Interpretation of results
• BLAST interpretation:
– identify high-scoring sequences, or groups of sequences, with
low probabilities that such matches may have arisen by
chance (e-value)
– Coverage
– % identity
Similarity search: Interpretation of results
Similarity search: PSI-BLAST
• Several iterations of
BLAST using a
profile built with
alignment of
sequences selected
in the previous
iteration
• Enables to find
distant homologs in
the database
Structural aspects
• Primary structure: amino acid linear sequence
• Secondary structure: tridimensional form of
segments
– α-helices,
– β-strands, β-sheet,
– Random coil
These secondary structures are defined by patterns of
hydrogen bonds between the main-chain peptide groups.
They have a regular geometry, being constrained to specific
values of the dihedral angles ψ and φ
• Tertiary structure: organization/fold of these
sub-structures to form the global 3D structure
of the protein
• Quaternary structure: 3D structure of a multisubunit protein and how the subunits fit
together
Physicochemical properties
• Amino-acid composition (number of charged residues)
• Molecular weight
• Physicochemical property profile along the sequence
(hydropathy profiles, hydrophilic parts)
• Potential transmembrane domains
Secondary structures prediction
• Some assumptions and simplifications are made:
– that all the information for folding is contained in the sequence
– that examining short sequence windows (e.g., 10-20 residues) is sufficient to
provide robust predictions
• There are 3 main approaches to secondary structure prediction:
– empirical statistical methods, which use parameters from known structures
– machine learning methods, which are trained using known secondary
structures
– threading or fold-recognition approaches, which seek compatible folds for a
sequence within fold template databases
 The best is to use a combination of methods
Tridimensional structure
• Protein 3D structures available in
• Experimental techniques:
– Crystallography + X-ray
– NMR
• Data format: atoms coordinates that can be represented
with a 3D visualization tool:
– Rasmol / Jmol
– Pymol
– DeepView (SwissPDBViewer)
• Fold classification databases:
CATH and SCOP
Tridimensional structure prediction:
homology modeling
• Search template (BLAST, PSI-BLAST)
=> homolog protein with solved structure
• Pairwise sequence alignment
• Building of the model according to the
alignment:
– Rigid body assembly, segment matching
(SWISS-MODEL, SegMOD)
– Satisfaction of spatial restraints (distance and
angles) => MODELLER, Geno3D
– Artificial evolution (template-based methods
with ab initio-like energy minimization
principles) => NEST
• Check the quality of the model (PROCHECK, PDBSum):
– Energetically favorable conformation
– Max of allowed angles for residues (Ramachandran plot)
– Minimal distance with template (Root Mean Square Deviation)
• Molecular dynamic simulations (force fields)
• … etc. …
[email protected]

lecture_intro_protx

Transcript lecture_intro_protx

Directory