lecture08_12

Download Report

Transcript lecture08_12

Predicting Protein Function
DNA
RNA
protein
Biochemical function
(molecular function)
What does it do?
Kinase???
Ligase???
Page 245
Function based on
ligand binding specificity
What (who) does it bind ??
Page 245
Function based
on biological process
What is it good for ??
Amino acid metabolism?
Page 245
Function based on
cellular location
DNA
RNA
Where is it active??
Nucleolus ?? Cytoplasm??
Page 245
Function based on
cellular location
DNA
RNA
Where is the Protein Expressed ??
Brain? Testis?
Where it is under expressed??
Page 245
GO (gene ontology)
http://www.geneontology.org/
• The GO project is aimed to develop three
structured, controlled vocabularies (ontologies)
that describe gene products in terms of their
associated
• molecular functions (F)
• biological processes (P)
• cellular components (C)
Ontology is a description of the concepts and relationships that can
exist for an agent or a community of agents
Inferring protein function
Bioinformatics approach
• Based on homology
• Based on functional characteristics
“protein signature”
Homologous proteins
Rule
of thumb:
Proteins are homologous if 25% identical (length >100)
Homologous proteins
Proteins with a common evolutionary origin
Orthologs - Proteins from different species that evolved by speciation.
Hemoglobin human vs Hemoglobin mouse
Paralogs - Proteins encoded within a given species that arose from one or
more gene duplication events.
Hemoglobin human vs Myoglobin human
COGs
Clusters of Orthologous Groups of proteins
> Each COG consists of individual orthologous proteins or
orthologous sets of paralogs.
> Orthologs typically have the same function, allowing transfer
of functional information from one member to an entire
COG.
Refence: Classification of conserved genes according to
their homologous relationships. (Koonin et al., NAR)
DATABASE
Inferring protein function based on
the protein signature
The Protein Signature
Expression Pattern
Where it is expressed ?
Motif (or fingerprint):
• a short, conserved region of a protein
• typically 10 to 20 contiguous amino acid residues
Domain:
• A region of a protein that can adopt a 3 dimensional structure
Protein Motifs
Protein motifs can be represented as a consensus or a profile
ecblc
vc
hsrbp
1
50
MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD
~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
GXW[YF][EA][IVLM]
GTWYEI
K AV
M
Searching for Protein Motifs
- ProSite a database of protein patterns that can be searched
by either regular expression patterns or sequence profiles.
- PHI BLAST Searching a specific protein sequence pattern
with local alignments surrounding the match.
-MEME searching for a common motifs in unaligned sequences
Protein Domains
• Domains can be considered as building blocks
of proteins.
• Some domains can be found in many proteins
with different functions, while others are only
found in proteins with a certain function.
DNA Binding domain
Zinc-Finger
Varieties of protein domains
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
Page 228
Example of a protein with 2 domains:
Methyl CpG binding protein 2 (MeCP2)
MBD
TRD
The protein includes a Methylated DNA Binding Domain
(MBD) and a Transcriptional Repression Domain (TRD).
MeCP2 is a transcriptional repressor.
Result of an MeCP2 blastp search:
A methyl-binding domain shared by several proteins
Are proteins that share only a domain homologous?
Pfam
> Database that contains a large collection of
multiple sequence alignments of protein domains
Based on
Profile hidden Markov Models (HMMs).
Profile HMM (Hidden Markov Model)
HMM is a probabilistic model of the MSA consisting
of a number of interconnected states
D16
D17
delete
100%
50%
M16
Match
insert
D 0.8
S 0.2
I16
X
M17
50%
D19
100% D18
P 0.4
R 0.6
M18
100%
T 1.0
16 17 18 19
M19
100%
R 0.4
S 0.6
I17
I18
I19
X
X
X
DRTR
DRTS
S - - S
SP TR
DR TR
DP TS
D - - S
D - - S
D - - S
D - - R
Pfam
> Database that contains a large collection of
multiple sequence alignments of protein domains
Based on
Profile Hidden Markov Models (HMMs).
> The Pfam database is based on two distinct classes
of alignments
–Seed alignments which are deemed to be
accurate and used to produce Pfam A
-Alignments derived by automatic clustering of
SwissProt, which are less reliable and give rise to
Pfam B
Physical properties of proteins
DNA binding domains have relatively high
frequency of basic (positive) amino acids
GCN4
M K D P A A L K R A R N T E A A
R R S S R A R K L Q R M
zif268
M E R P Y A C P V E S C D R R F
S R S D E L T R H I R I H T
myoD
S K V N E A F E T L K R C T S S N
P N Q R L P K V E I L R N A I R
Transmembrane proteins have a
unique hydrophobicity pattern
Knowledge Based Approach
• IDEA
Find the common properties of a protein
family (or any group of proteins of interest)
which are unique to the group and different
from all the other proteins.
Generate a model for the group and predict
new members of the family which have
similar properties.
Knowledge Based Approach
Basic Steps
1. Building a Model
• Generate a dataset of proteins with a common function
(DNA binding protein)
• Generate a control dataset
• Calculate the different properties which are characteristic
of the protein family you are interested for all the proteins
in the data (DNA binding proteins and the non-DNA
binding proteins
• Represent each protein in a set by a vector of calculated
features and build a statistical model to split the groups
Basic Steps
2. Predicting the function of a new protein
• Calculate the properties for a new protein
And represent them in a vector
• Predict whether the tested protein belongs to the family
TEST CASE
Y14 – A protein sequence translated from an ORF
(Open Reading Frame)
Obtained from the Drosophila complete Genome
>Y14
PQRSVGWILFVTSIHEEAQEDEIQEKFCDYGEIKNIHL
NLDRRTGFSKGYALVEYETHKQALAAKEALNGAEIM
GQTIQVDWCFVKG G
>Y14
PQRSVGWILFVTSIHEEAQEDEIQEKFCDYGEIKNI
HLNLDRRTGFSKGYALVEYETHKQALAAKEALN
GAEIMGQTIQVDWCFVKG G
Y14 DOES NOT BIND RNA
Projects 2011-12
Instructions for the final project
Introduction to Bioinformatics 2011-12
Key dates
19.12 lists of suggested projects published *
*You are highly encouraged to choose a project yourself or find
a relevant project which can help in your research
29.1 Submission project overview (power point presentation
Max 5 slides)
-Title
-Main question
-Major Tools you are planning to use to answer the questions
30.1/31.1 Presentation of project overview
7.3 Poster submission
14.3 Poster presentation
2. Planning your research
After you have described the main question or questions of your
project, you should carefully plan your next steps
A. Make sure you understand the problem and read the necessary
background to proceed
B. formulate your working plan, step by step
C. After you have a plan, start from extracting the necessary data and
decide on the relevant tools to use at the first step.
When running a tool make sure to summarize the results and extract
the relevant information you need to answer your question, it is
recommended to save the raw data for your records , don't present
raw data in your final written project.
Your initial results should guide you towards your next steps.
D. When you feel you explored all tools you can apply to answer your
question you should summarize and get to conclusions. Remember NO
is also an answer as long as you are sure it is NO. Also remember this is
a course project not only a HW exercise.
.
3. Summarizing final project in a poster (in pairs)
Prepare in PPT poster size 90-120 cm
Title of the project
Names and affiliation of the students presenting
The poster should include 5 sections :
Background should include description of your question (can add
figure)
Goal and Research Plan:
Describe the main objective and the research plan
Results (main section) : Present your results in 3-4 figures, describe
each figure (figure legends) and give a title to each result
Conclusions : summarized in points the conclusions of your project
References : List the references of paper/databases/tools used for
your project
Examples of posters will be presented in class