introtobioinformatics_includingdemo_28_9_2011

Download Report

Transcript introtobioinformatics_includingdemo_28_9_2011

Introduction to Bioinformatics
Monday 26th September 2011
EMBO Practical Course: Protein Bioinformatics Tools
September 25th - 30th 2011
EMBL Heidelberg
Aidan Budd
EMBL Heidelberg,
Germany
Niall Haslam
University College Dublin
Conway Institute, Ireland
Aidan Budd, EMBL Heidelbe
Introduction to the Introduction....
Why include such a session in this course?
Haven't we all had "Introduction to Bioinformatics" courses in
our studies, or have quite some experience of the topic
already?
Hands up who feels that this describes their situation...?
Aidan Budd, EMBL Heidelbe
Introduction to the Introduction....
Why include such a session in this course?
1. Because of the diversity of your backgrounds and experience
• learning occurs in the context of our own specific set of previous
experiences
• different people have different understandings of the same
terms etc.exploring your (and our) understanding of some
key/basic bioinformatics ideas helps:identify and address
possible misconceptions that might hinder learning more
sophisticated ideas/contentfocus your learning on the most
important topics and issues for you (rather than us just
guessing what you might need help with)
• some of you will need this less than others:those with more
experience, please help those around you
Aidan Budd, EMBL Heidelbe
Introduction to the Introduction....
Why include such a session in this course?
2. To demonstrate general principles of how bioinformaticians address
problems:
• show how we link tools together within the context of larger analyses
• highlighting the kinds of patterns/information that we tend to focus
on
• experts in a field are better at noticing important information/patterns
in the data they work with
• by highlighting patterns we notice when working with tools may
help you to start spotting similar patterns i.e. becoming more
expert in the topic
Aidan Budd, EMBL Heidelbe
Exploring Your Experience With
Bioinformatics
Aidan Budd, EMBL Heidelbe
Exploring Your Experience With
Bioinformatics
3 questions on the next slides aim to help you (and us) explore
your current ideas, level of confidence, and understanding of
bioinformatics
For each question:
1.I'll present an example answer
2.You'll spend a few minutes writing (laptop, paper, desktop
computer) your own answers to these questions
3.You'll discuss these answers with your neighbours - explaining
them to each other, and identifying shared understanding (and
problems with understanding)
4.We'll solicit and discuss answers from the class, focusing on
answers/problems shared by several trainees
Aidan Budd, EMBL Heidelbe
Question 1:
Useful Bioinformatics Resources
Which bioinformatics resource(s) have been most useful
to you in your work so far?
Why are they so important (think about what would be more
difficult/impossible if these the tools did not exist)?
Example:
BLAST
without BLAST (or similar pairwise alignment/sequence similarity
search tools) it would be difficult to
•identify records within a database corresponding to my protein
molecule/sequence of interest
• relying instead on text-based searches which can be
problematic
•obtain suggestion of specific hypotheses for the function of novel
sequences
Aidan Budd, EMBL Heidelbe
Question 1:
Useful Bioinformatics Resources
Which bioinformatics resource(s) have been most useful
to you in your work so far?
Why are they so important (think about what would be more
difficult/impossible if these the tools did not exist)?
Other possible tools:
• UniProt
• ENSEMBL
• PFAM
• etc.
Aidan Budd, EMBL Heidelbe
Question 2:
Common Problems
Are there any common problems you have
encountered while using bioinformatics tools?
How have you tried to deal with these problems?
Example:
Data records/resources changing with time/disappearing,
meaning I can't reproduce my earlier results
One way I try to deal with this problem is to keep copies of
the original files I downloaded - in particular the sequence
(not just the identifiers) of any proteins/DNA regions of
interest
Aidan Budd, EMBL Heidelbe
Question 3:
Key Knowledge/Experience/Tricks
What bioinformatics knowledge/experience/tricks have
you learnt that you wish you had been taught at the start
of your research career?
How have these ideas been useful for you?
Example:
Realising that almost all bioinformatics tools and resources aim to
address either one or both of two key questions has often helped me
in my work i.e.:
•what experimental data has been reported concerning my entity
(protein) of interest [e.g. much of the data in UniProt]
•what predictions can I make about the structure/function of my entity
(protein) of interest [e.g. BLAST, IUPRED]
Many different bioinformatics resources, no time to learn about them
all!
Knowing this helps me identify the questions a tool aims to address,
Aidan Budd, EMBL Heidelbe
Question 3:
Key Knowledge/Experience/Tricks
What bioinformatics knowledge/experience/tricks have
you learnt that you wish you had been taught at the start
of your research career?
How have these ideas been useful for you?
Example:
Realising that almost all bioinformatics tools and resources aim to
address either one or both of two key questions has often helped me
in my work i.e.:
•what experimental data has been reported concerning my entity
(protein) of interest [e.g. much of the data in UniProt]
•what predictions can I make about the structure/function of my entity
(protein) of interest [e.g. BLAST, IUPRED]
Also helps me as I know that using bionformatics to help my analysis
means it helps to frame the questions I ask in the terms of these two
kinds of question
Aidan Budd, EMBL Heidelbe
Question 3:
Key Knowledge/Experience/Tricks
What bioinformatics knowledge/experience/tricks have
you learnt that you wish you had been taught at the start
of your research career?
How have these ideas been useful for you?
Example:
Using an accession number (a unique identifier of a record within
a database) allows me to unambiguously identify the record I
want from a data resource.
Searches with non-unique identifiers can return several very
different entities from a search, where several of them do not
correspond to the entity I want to identify - using unique identifiers
avoids this problem.
Aidan Budd, EMBL Heidelbe
Knowledge/Experience/Tricks
Knowledge/experience/tricks on doing successful bioinformatic
analyses are some of the more useful things you could take from a
course like this.
Thus, we'll now present some that we've found useful in our own
work.
After they've been presented, we will ask you to read through them
and (quickly) discuss what you understand by them with your
neighbours, to try and highlight any major misunderstandings
Then we will illustrate these points by demonstrating for you an
example of a bioinformatic analysis that illustrates many of these
points, and how they are built into a "complete" analysis.
If you notice some of these tricks etc. being used in the analysis but
not commented on by us, please note them and we'll discuss them
at the end of the demonstration
Aidan Budd, EMBL Heidelbe
Diversity of Bioinformatics Resources
There are many many different bioinformatics resources available,
and they change with time, sometimes dramatically...
•too many for me to know them all
•for those I know, I usually don't have time to spend understanding
everything about how they work, what can be done with them, all their
features, etc.
Thus, becoming better at the following tasks helps make me a more
efficient and confident bioinformatician:
•identifying/searching for/finding those resources that can help my
research
•quickly judging whether or not a tool is likely to be useful for my research
•spotting when I've learnt enough about a tool, so that I can use it
reasonably effectively
•knowing that not understanding (all about) how a tool works is not a
failure - it's normal - what's important is deciding whether you need
toEMBL Heidelbe
Aidan Budd,
The Two Key Bioinformatics
Questions
We already discussed this as an example answer.
To remind you, I think the key questions are:
• what experimental data has been reported concerning my entity
(protein) of interest [e.g. much of the data in UniProt]
• what predictions can I make about the structure/function of my
entity (protein) of interest [e.g. BLAST, IUPRED]
Aidan Budd, EMBL Heidelbe
Incomplete Overlap of Resources
Many data resources contain some of the same/similar data as
each other i.e. have partial but not complete overlap of their
content. For example, the sequences in the SwissProt databases
searched by NCBI BLAST at NCBI and EBI on any one day might
contain different sets of sequences.
Being aware of this, I know that
•if I'm looking for something (e.g. a protein sequence) in one resource,
and can't find it there, then I may find it if I look elsewhere
•these differences exist because:
• different aims of the developers of different resources
• different update schedules
• different amounts of resources available to maintain and update
resources
•these differences are inevitable - knowing this helps prevent me from
getting (too) frustrated when tools don't contain what I think they
should
Aidan Budd,
EMBL Heidelbe
Different Features of Different Tools
Different implementations of the same tool may have
different search features, different ways of presenting the
output etc., even if the content is the same
Related/similar/part of the previous point.
For example - the web interface to BLAST at EBI and NCBI are
rather different and offer different features - some things are easy to
do on one site and almost impossible on the other
So, if the implementation you're working with doesn't do what you
want, you may be able to find one that does somewhere else
Aidan Budd, EMBL Heidelbe
The Importance of Knowing Which
Question You Want to Answer
The "right way" to use a tool depends on the question
you want to address with it
How should I use UniProt?
it depends on what you want to use it for
How should I change parameters to improve my BLAST search?
it depends on what you want to use it for
Which MSA tools should I use to align my sequences?
it depends on what you want to use it for
etc.
Thus, a clear understanding of precisely which question
you want to address helps us use tools more effectively
Aidan Budd, EMBL Heidelbe
Importance of Accession Numbers
When Using a Text Search of a
Database
We covered this already...
Using an accession number (a unique identifier of a record within a
database) allows me to unambiguously identify the record I want
from a data resource.
Searches with non-unique identifiers can return several very
different entities from a search, where several of them do not
correspond to the entity I want to identify - using unique identifiers
avoids this problem.
Aidan Budd, EMBL Heidelbe
The More You Know the Easier it Gets
Just having experience recognising where identifiers are likely
to come from, knowing things about that structure of important
databases, common errors found in databases, makes it
easier to spot important patterns
For example, I recognise ENSDARG00000046048 as an
Ensembl identifier immediately, so would know where to begin
etc.
Thus, spending some time working with and exploring key
resources can be a big help running a range of different
bioinformatics tasks
Aidan Budd, EMBL Heidelbe
Example Bioinformatics Analysis
Demonstrating the use tools to address problems/questions must be
done in the context of a particular problem/question - because, as
already pointed out, the way to use a tool effectively depends always
on the question it is being used to address.
Scenario:
A friend working in a zebrafish lab has done a forward genetic analysis,
using a phenotype, and has identified the mutated gene
They want to try and understand how/why the gene contributes to the
phenotype, in particular by identifying or predicting proteins that
physically interact with the gene
perhaps knocking these out/silencing them will have a similar
pheontype?
Aidan Budd, EMBL Heidelbe
Example Bioinformatics Analysis
Would you like to:
1. Try this yourselves with no more information/ideas from me?
2. Try this yourselves with some hints on resources you might like to
try?
3. I demo it to you first, then you have a go yourself? - in which case
you'll get a short written description of how I did it to try and follow
yourselves
If you try first, then do it in pairs, keep going until you get stuck then get help! - we'll try and notice when several people are stuck,
and then we'll move on. Think about, in this case, what contributed
to you getting stuck
Aidan Budd, EMBL Heidelbe
Example Bioinformatics Analysis
Hints on how I would/did do it... ENSDARG00000046048
• Find protein sequence of the ENSEMBL record
• Get the UniProt record - two ways of doing it, database cross-linking and
•
•
•
•
•
•
•
BLAST (note that I try first swissprot and it's not there, but it is in uniprot)
Read about the gene
Look for interaction partners described in the record - via STRING maybe, but
not very strong evidence
Look for related PDB structures in complex? Yes, we find one by BLAST at
NCBI
Get the structure from PDB and look at it in PyMOL
Look for a protein related to the interacting protein in ZF
Get this info PDBsum - I find it easier to get the info on which sequences are in
there compared to PDB
Read about the interaction - is there a model for describing the interaction
modules in the two proteins? Yes, it's the FFAT motif described by the ELM
resource
Aidan Budd, EMBL Heidelbe
Another example
Scenario:
A friend is working on the parasite Giardia. They want to study the role
of nucleoporin proteins in the biology of the parasite, expecting this
might be important understanding gene regulation there etc.
They want to find the sequences of these proteins to help with their
cloning etc. They ask you for help
Discuss with your neighbours some of the ways you could begin to try
and find the sequences of these proteins
Aidan Budd, EMBL Heidelbe
Another example
Scenario:
A friend is working on the parasite Giardia. They want to study the role
of nucleoporin proteins in the biology of the parasite, expecting this
might be important understanding gene regulation there etc.
They want to find the sequences of these proteins to help with their
cloning etc. They ask you for help
•
•
•
•
•
•
•
Search for a Giardia genome resource - try a text search for nucleoporin
Check which ones this is matching using BLAST
Google for "nucleoporins"
Choose one of them
Try a BLAST at the NCBI
If it doesn't work, what modules do SMART/PFAM predict in the sequenc
Try using these tools to identify similar proteins in the organism
Aidan Budd, EMBL Heidelbe
Another example
Scenario:
A friend is working on the parasite Giardia. They want to study the role
of nucleoporin proteins in the biology of the parasite, expecting this
might be important understanding gene regulation there etc.
They want to find the sequences of these proteins to help with their
cloning etc. They ask you for help
Try, together with your neighbour, to find some other Giardia
nucleoporin sequences
Aidan Budd, EMBL Heidelbe