Introductory presentation from the course

Download Report

Transcript Introductory presentation from the course

The Poor
Beginners’
Guide to
Bioinformatics
What we have – and don’t have...
 a computer connected to the Internet (incl. Web
browser)
 a text editor (Notepad or better)
 public databases of genomic sequences
 public databases of cDNA + EST
 public databases of protein sequences, structures
and motifs
 money for specialised software packages
 public servers capable of (almost) anything we wish
to do
Dealing with a sequence: model
tasks
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Notes on basic sequence handling
 Make sure you have the correct format.
 FASTA format is (almost) always
correct.
>sequencename
thisisasequenceinfastaformat
 If not, you can always use raw data.
 If things don’t work, check for gaps in
sequence, empty lines, and file extension.
 BEWARE OF MICROSOFT!
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Defining a gene family…
• By overall domain structure
FH3?
FH1
FH2
• By domain sequence
• Based on a peptide motif
L-X-X-G-N-X-[ML]-N
Sequence comparison-based searches
• Entrez “related sequences”
 easy identification of “false starts”
 no organism selection
• BLAST/FASTA
 all DNA/protein combinations
 taxonomy selection possible
 statistical data provided
 domain structure comparison available
 divergent motifs may be missed
Two methods are better than one.
Notes on all sequence comparisons,
searches, alignments…
 Start with defaults (the authors
know what they are doing)…
 … BUT don’t be afraid to vary the
parameters
 Chose a reasonable scoring matrix:
Distant sequences: low BLOSUM, high PAM
Closely related sequences: low PAM, high BLOSUM
Motif-based searches
 sensitive
 no statistics
 only protein databases can be searched
• TAIR PatMatch
 Arabidopsis - specific
 Problematic user interface
• ISREC - INSECTS
 admirable technology
 access to SwissProt and TrEMBL
 no organism selection
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Some genes are more alike
than others…
• A number of splicing prediction servers
available
• Agreement of different methods is a good
sign but no absolute measure
• Always align ESTs if possible
• Beware of non-conventional intron
boundaries (GC-AG instead of GT-AG)
• Plant data for transcription start/factor binding
sites prediction are limited
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Searching for known domains/motifs
• Searching for PROSITE patterns – allowing
ambiguities
• PROSITE and Pfam profile searches
• SMART, CDsearch (domains and more)
Predicting protein localisation
• predicting signal
peptides/anchors
• 2 methods available
• possibility to predict
organelle localisation
• transmembrane
segments prediction
Model tasks continued …
• basic (DNA) sequence manipulation:
restriction analysis, translation…
• sequence similarity and pattern/motif
searches
• gene building: modelling exon-intron
structures
• protein domain searches,structure analysis
• construction and interpretation of sequence
alignments
Alignment: “manual” or automated?
 locally installed, free, for
Mac and PC
 interactive domain
definition
 statistical data provided
 may produce falsepositive blocks (read
the on-line manual!)
 “objective” results
 a number of servers
available
 recommended for wellconserved proteins
 empiric parameters
(e.g. gap penalties)
 bad for divergent
sequences
Phylogenetic analyses
 Two methods are better
than one.
 Your phylogeny cannot be
better than your alignment.
 Gaps are no data.
 Allways do bootstrapping
(100-500 cycles)
 Certain questions cannot
be answered from an
unrooted tree.
Points to take off...
• go to the Bioinformatics page
http://www2.rhul.ac.uk/~ujba110/Bioinfo.htm
• select your exercise (A,B,C,D,E)
• … and enjoy it!
If you mean it seriously:
• create your own bookmarks (seed provided
on the course web page)