Transcript compute1

Basic Overview of
Bioinformatics Tools and
Biocomputing Applications I
Dr Tan Tin Wee
Director
Bioinformatics Centre
Software Tools
• Data stored in retrievable forms in database systems
• Data generated by machines, DNA / Protein sequencers,
automated systems
Automated
Machines
Research
Labs
Biological Data
Databases
Analytical
Tools
New Knowledge
Common Computational Analyses
• Sequence Assembly
• Simple sequence analysis
– Translation and reverse Complement, ORF
– Composition statistics (protein & DNA)
– Molecular mass
– Total charge and pI; local hydropathy
– Simple determination of secondary structures
– Restriction site analysis
– Internal repeat analysis
• Detection of active sites, functional residues, characteristic
structures, substrates, and processing signals
Common Computational Analyses
• Database sequence search
• Multiple alignment
• 2 and 3 Structure prediction;
transmembrane helix detection
• Structure modeling
• Docking prediction and design
• Hidden Markov model searches
Sequence Assembly
•
•
•
•
5'
Fragmented data from DNA sequencers
Detection of Overlap
Merging of Contigs
Assembly into continuous sequence
3'
Sequence Format Interconversion
• DNA/Protein and other sequence data come
in different formats.
• Annotations
• Different programs use different formats
• Interconversion utility tools
• eg. READSEQ, TOGCG, TOSTADEN, etc
Simple Sequence Analysis
1. Linear Sequence eg. DNA/ Protein
2. Open a Window - n = 1
n = variable
n = sliding
3. Calculate based on list of criteria
………….…
……………..
……………..
……………...
Some Simple Sequence Analysis Applications
• DNA complementary strand eg.
COMPLEMENT & REVERSE
– Open window size 1
–
–
–
–
–
–
–
–
–
–
A--->T
C --->G
T ---> A
G ---> C
Slide to next Window of 1
Proceed to end of sequence
Reverse order of complement
5' ...ATCTCGATACTACTACG...3'
|||||||||||||||||
3' ...TAGAGCTATGATGATGC...5'
Some Simple Sequence Analysis Applications
• DNA to Protein sequence translation, e.g.
TRANSLATE
–
–
–
–
–
–
Open window of 3 bases
Look up Codon Usage table
Assign Amino acid residue
Slide window to next 3 bases
Proceed till stop codon detected.
Repeat whole procedure for six frames
ATACTACTGAGATCTAGGCTAGTACTGCGTGCG
Frame 1
Frame 2
Frame 3
Complement - Frames 4-6
Some Simple Sequence Analysis Applications
• Detect Open Reading Frame e.g.
ORF
– Translate sequence, report long stretches of
start and stop codons
• Compositional analysis
– eg. Calculate total A, T, G, C
– eg. Calculate total molecular mass of protein,
analysis percentages of amino acids
– eg. Total Charge composition, pI
Some Simple Sequence Analysis Applications
• Simple prediction of secondary structure of
Protein sequence
– decide a window size
– compute for each window of amino acids
statistical potential to form helix, beta sheet,
turn, etc. Chou-Fasman, GOR etc algorithms
– use a statistical potential chart
– plot potentials in graphical or pictorial format
Some Simple Sequence Analysis Applications
• Restriction Mapping eg. MAP,
MAPPLOT,MAPSORT, PLASMIDMAP etc
– Table of Restriction Enzymes
gel
and cut sites
eg. EcoRI, BamHI AluI
and their cut sites
eg. GAATTC , AATT
– Take a DNA sequence
– Pattern match against the list of cut sites
Plasmid
– For each match, assign Restriction enzyme
map
– Calculate distance between cut sites
– Display in table, graphical, or restriction map, etc
Some Simple Sequence Analysis Applications
• Protein sequence Motifs pattern matching eg.
PROSITEMAP, MOTIFS, BLOCKS etc
– Table/Database of Sequence Patterns/Motifs and
their signature sequence
eg. Arg-Gly-Asp (RGD) or consensus sequence
(eg. PROSITE, BLOCKS db)
– Take Protein sequence
– Pattern match against the list of signature sites
– For each match, assign potential function
according to database
– Display in table or graphically, or hyperlinked
Some Simple Sequence Analysis Applications
• Peptide Cleavage Maps eg. PEPTIDESORT,
PEPTIDE MAP
– Table of Protease vs Cleavage sites eg. Trypsin,
chymotrypsin, and Chemical cleavage sites
cyanogen bromide
– Pattern match with entire protein sequence
– Calculate size of peptide fragments
– Sort and Map, Plot as electrophoretic patterns
on a log-linear simulated digest.
– Compute Partial Digest patterns
Some Simple Sequence Analysis Applications
• DOTPLOT- selfcomparison
– Take a Window size
– Compare against entire
length of own sequence
– Report matches above
a threshold
– Plot on Graph
– Slide window, repeat till
end of sequence
– Detection of Internal repeats
Sequence A
• Pairwise comparison - detection of homology
Some Simple Sequence Analysis Applications
• RNA secondary structure analysis
• Mfold, PlotFold, FoldRNA, Squiggles, Circles,
Domes, Mountains, StemLoop
• Folding of RNA into stems, loops
AUCG
• Calculation of energy
U
G
- prediction of
G
A
A-- U
stability of structure
U-- A
• Display of structure
G-- C
C -- G
and alternatives
...AUCGA
AUCUC...
Database Searching
• Text-based Database Searching using a text string to match an annotation in
a sequence database record, ie. Keyword
search
• Sequence-based Database Searching using a biological sequence to match its
whole or parts of its sequence to the
sequences of every sequence database
records
Text-Based Database Searching
• Examples: Entrez, SRS, DBGET, AceDB
- common integrated database systems
• Search Concepts
–
–
–
–
–
Boolean Search - AND, OR, NOT
Broadening Search
Narrowing the Search
Proximity searching, soundex
Wild Card, Stemming eg. Thala* for
thalasemia, thalassemia, thalassemic
• Use standard string search algorithms and
boolean operations, vocabulary matches
Text-based Database Searching
• Example: To find the human homolog of the
Drosophila per gene
• Procedure
–
–
–
–
–
Web to Entrez
All Fields : enter "human" "per"
Hits returned, irrelevant - broaden search
"human" "period" - more hits
check every one, find the human RIGUI gene
• Hit and miss, clever guess work,
free form or controlled vocabulary (MeSH terms)?
Use Boolean searches?
Sequence-based Database
Searching
•
•
•
•
•
•
•
Homology Search
Global or Local Sequence Alignment
Needleman-Wunch Algorithm
Smith-Waterman Algorithm
Lipman - Pearson FASTA
Altschul's BLAST
Take a sequence, pairwise comparison with
each sequence in the database
Sequence-based Database
Searching
• Basic Assumptions:
• Sequences of homologous Genes/Protein diverge
over time even though structure and/or function
change little
• Significant sequence similarity inferred as
potential structural /functional similarity or
common evolutionary origin
• Based on well-characterised protein, infer the
function of an unknown sequence at gene or
protein sequence level.
Sequence-based Database
Searching
• Global Alignment
forces complete alignment of the pairwise
comparison of the two input sequences
• Local Alignment
looks for local stretches of similarity and
tries to align the most similar segments
• Algorithms used may be similar, but output
different, statistics needed to assess results
Sequence-based Database
Searching
• Alignment Scoring
• Substitution score and substitution matrix
PAM, BLOSUM
• affine gap costs/gap penalty and gap scores
• Optimal alignments, dynamic programming
Needleman-Wunsch algorithm,
Smith-Waterman algorithm (SSEARCH)
• Additional heuristics - FASTA, BLAST