Day 6 Carlow Bioinformatics
Download
Report
Transcript Day 6 Carlow Bioinformatics
School B&I TCD Bioinformatics
Proteins:
structure,function,databases,formats
Wot’s a protein,then?
Hierarchical
• A collection of amino acids (0-D)
– AACompIdent can identify a protein from AA%s
• A sequence (string) of AAs (1-D)
•
•
•
•
•
2ndry structural elements: -helix etc. (2-D)
Domains – (independent) functional units
Whole Protein (from single CDS) (3-D)
Quaternary structure: dipeptides,ribosomes
Interactome, pathways
Protein functions
Amino acid properties
again … and again and again
Amino acid groups
•
•
•
•
•
KR (Lys Arg) NH3+ basic
DE (Glu Asp) COO- acidic
WYF (Trp Tyr Phe) large aromatic
GP (Gly,Pro) -breaking
C (Cys) disulphide –S – S – bridges
– C also not disulphide bridges
• etc.
Secondary structure
• -helix (no Pro Gly)
–
–
–
–
Easy like exon prediction
3.4 residues per turn
Leucine zipper …LXXXXXXLXXXXXXL…
Amphipathic helix (charged on one side)
Transmembrane (-helix,hydrophobic ~21AA
long)
• -sheet
– 2 dimensional zigzag
• Coil,random
• Turn (kink)
Patterns to recognise
(more reliable in MSA than in single seq)
MSA improves 2ndary structure (-helix -sheet) prediction by >6%)
• Alternate hydrophobic residues
– Surface -sheet (zig-zag-zig-zag)
• Runs of hydrophobic residues
– Interior/buried -sheet
• Residues with 3.5AA spacing (amphipathic)
– -helix WNNWFNNFNNWNNNF
• Gaps/indels
– Probably surface not core
Conserved residues
• W,F,Y large hydrophobic, internal/core
– conserved WFY best signal for domains
• G,P turns, can mark end of -helix -sheet
• C conserved with reliable spacing speaks C-C
disulphide bridges - defensins
• H,S often catalytic sites in proteases (and other
enzymes)
• KRDE charged: ligand binding or salt-bridge
• L very common AA but not conserved
– except in Leucine zipper L234567L234567L234567L
Basic information
How big is my protein?
Where beta-sheets?
Is there a signal peptide?
Is there a trypsin cleavage site?
• ProtParam tool (MWt etc.)
• Tmpred,TMHMM transmembrane helix
inside/outside,external loops.
• JPRED for 2-D structure
• see practical manual for examples
Tertiary structure
• The holy grail of bioinformatics
Difficult like
Gene prediction
• 3-D orientation of known ,
• Proteins made of functional units “domains”
– Tried tested module
– Domain shuffling and exon boundaries
• Bioinf tries to make predictive calls on
aspects of the 3-D structure
• Q. Why is 3-D important ?
A. Structure = function
What binf can do about 3-D
• Expressed/exported proteins have signal
peptide
• Hydropathy plot,antigenicity
index,amphipathicity get handle on surface
probability
• But homology to known 3-D structure
(Xray,NMR) is best predictor – threading.
• Plan to X-ray all “folds” in human genome.
recaA
SwissProt/UniProt
Some of the 194 lines of info in a SwissProt entry
ID
AC
RX
RA
RT
RL
DR
DR
DR
DR
DR
DR
DR
FT
FT
FT
FT
RECA_ECOLI
STANDARD;
PRT;
352 AA.
P0A7G6; P03017; P26347; P78213;
MEDLINE=92114994; PubMed=1731246;;
Story R.M.,Weber I.T.,Steitz T.A.;
"The structure of the E. coli recA protein";
Nature 355:318-325(1992).
EMBL; V00328; CAA23618.1; -; Genomic_DNA.
PDB; 2REB; X-ray; @=-.
PRINTS; PR00142; RECA.
ProDom; PD000229; RecA; 1.
SMART; SM00382; AAA; 1.
TIGRFAMs; TIGR02012; tigrfam_recA; 1.
PROSITE; PS00321; RECA_1; 1.
HELIX
72
85
UniProt is the key hub of
TURN
86
87
Bioinformatics databases
STRAND
90
94
HELIX
101
106
Homology?
LVMFWSIVGE Known1
L
W
GE
LIVYWTVIGE Unknown 40% ID
ILVFYTVVGD Known2
V TV G
LIVYWTVIGE Unknown 40% ID
Is Unknown part of the same family?
Or is this just a 4/10 co-incidence?
RegEx
LVMFWSIVGE Known1
ILVFYTVVGD Known2
RegEx
[MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE]
LIVYWTVIGE Unknown
* ***** **
More convincing that it is same family?
How modify RegEx to include 3rd sequence?
Family Databases
Three methods
Prosite
• Groups families by conserved motif.
Which is
• Present in all family members
• Absent in all other proteins
• No/few false positives (selectivity)
• All true positives (sensitivity)
• Motif defined with a Regular expression
What prosite looks likecf SwissProt
ID
DT
PA
NR
NR
NR
DR
DR
DR
DR
3D
DO
RECA_1; PATTERN.AC
PS00321;
APR-1990 (CREATED); NOV-1997 DE
recA signature.
A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R.
/RELEASE=49.0,207132;
/TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0);
/FALSE_POS=2(2); /FALSE_NEG=11; /PARTIAL=10;
Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T;
P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T;
Etc for 70 lines
Q7UJJ0,RECA_RHOBA ,N; Q9EVV7,RECA_STRTR ,N;
Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F;
2REB; 2REC;
Documentation
PDOC00131;
False negatives
False positives
PDB structures
Prosite problems
• RegEx now breaking down as recAs
increase so no longer defines the protein
• Database now huge so prob of finding any
short motif is high.
• Many copies of ELVIS hiding in UniProt
• May be more than 1 motif defining a
family
• A great first attempt and still useful but too
crude
Prints
•
•
•
•
A database of multiple domains/motifs.
Multiple motifs abstracted to database
Stored as probability matrix
If two proteins have the same motifs in the
same order they are likely to be
homologous.
• More biological/real/sensitive than ProSite
ProDom
• A French DB
• All against all search of the nr protein Db.
• Includes domains with no known function
– cf synteny of non coding regions
• Great for determining the domain structure
of a particular protein.
Pfam
• Moves up from the short; highly conserved;
easily aligned bits of protein family.
• Uses PSSM position specific scoring matrix
• … on complete aligned family members
PSSM
• Multiple sequence
alignment:
1234567890
NSGTIVFLWP
DSGTAIFLKP
ESGTIIFLHN
DSDTVRSLKP
Posn1
Posn2
Posn3
Posn4
Posn5
Posn6
Posn7
Posn8
Posn9
Posn0
50% D,N,E
100% S
75% G,D
100% T
50% I,A,V
50% I,V,R
75% F,S
100% L
50% K,H,W
75% P,N
Domain take home
• Run your protein against
– InterproScan
– CD server at NCBI
– Pfscan
• Likely that the crucial bit of info is only in
one of the above.