Introductory presentation from the course
Download
Report
Transcript Introductory presentation from the course
The Poor
Beginners’
Guide to
Bioinformatics
What we have – and don’t have...
a computer connected to the Internet (incl. Web
browser)
a text editor (Notepad or better)
public databases of genomic sequences
public databases of cDNA + EST
public databases of protein sequences, structures
and motifs
money for specialised software packages
public servers capable of (almost) anything we wish
to do
Dealing with a sequence: model
tasks
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Notes on basic sequence handling
Make sure you have the correct format.
FASTA format is (almost) always
correct.
>sequencename
thisisasequenceinfastaformat
If not, you can always use raw data.
If things don’t work, check for gaps in
sequence, empty lines, and file extension.
BEWARE OF MICROSOFT!
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Defining a gene family…
• By overall domain structure
FH3?
FH1
FH2
• By domain sequence
• Based on a peptide motif
L-X-X-G-N-X-[ML]-N
Sequence comparison-based searches
• Entrez “related sequences”
easy identification of “false starts”
no organism selection
• BLAST/FASTA
all DNA/protein combinations
taxonomy selection possible
statistical data provided
domain structure comparison available
divergent motifs may be missed
Two methods are better than one.
Notes on all sequence comparisons,
searches, alignments…
Start with defaults (the authors
know what they are doing)…
… BUT don’t be afraid to vary the
parameters
Chose a reasonable scoring matrix:
Distant sequences: low BLOSUM, high PAM
Closely related sequences: low PAM, high BLOSUM
Motif-based searches
sensitive
no statistics
only protein databases can be searched
• TAIR PatMatch
Arabidopsis - specific
Problematic user interface
• ISREC - INSECTS
admirable technology
access to SwissProt and TrEMBL
no organism selection
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Some genes are more alike
than others…
• A number of splicing prediction servers
available
• Agreement of different methods is a good
sign but no absolute measure
• Always align ESTs if possible
• Beware of non-conventional intron
boundaries (GC-AG instead of GT-AG)
• Plant data for transcription start/factor binding
sites prediction are limited
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Searching for known domains/motifs
• Searching for PROSITE patterns – allowing
ambiguities
• PROSITE and Pfam profile searches
• SMART, CDsearch (domains and more)
Predicting protein localisation
• predicting signal
peptides/anchors
• 2 methods available
• possibility to predict
organelle localisation
• transmembrane
segments prediction
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Alignment: “manual” or automated?
locally installed, free, for
Mac and PC
interactive domain
definition
statistical data provided
may produce falsepositive blocks (read
the on-line manual!)
“objective” results
a number of servers
available
recommended for wellconserved proteins
empiric parameters
(e.g. gap penalties)
bad for divergent
sequences
Phylogenetic analyses
Two methods are better
than one.
Your phylogeny cannot be
better than your alignment.
Gaps are no data.
Allways do bootstrapping
(100-500 cycles)
Certain questions cannot
be answered from an
unrooted tree.
Points to take off...
• go to the Bioinformatics page
http://www2.rhul.ac.uk/~ujba110/Bioinfo.htm
• select your exercise (A,B,C,D,E)
• … and enjoy it!
If you mean it seriously:
• create your own bookmarks (seed provided
on the course web page)