PowerPoint 簡報 - University of Hong Kong
Download
Report
Transcript PowerPoint 簡報 - University of Hong Kong
Introduction to EMBOSS
H
K
U
Computer
Centre
Christine Ho
[email protected]
Web page of EMBOSS
H
K
U
Computer
Centre
The programs of EMBOSS is available at
http://bioinfo.hku.hk/EMBOSS/
The files required for this lecture is available at
http://bioinfo.hku.hk/tutorial/
User required to apply for a BIOINFO account to use the tools
on the web and off-line, and to download the databases.
BIOINFO account is open freely to the public to register, and
usage on the BIOINFO is restricted for academic and research
purposes only.
How to apply BIOINFO account:
HKU members: Submit the HKUESD application Form(Cfe139)
Non-HKU members: submit the application form of
http://www.hku.hk/ccoffice/forms/cf139.pdf
Question and comment: [email protected]
What is EMBOSS?
H
K
U
Computer
Centre
EMBOSS (The European Molecular Biology Open
Software Suite) is a free Open Source software
analysis package that provides a comprehensive set
of sequence analysis package specially developed
for the needs of the molecular biology user
community.
Within EMBOSS you will find around 100 programs
(applications).
More information about EMBOSS can be found at
http://www.uk.embnet.org/Software/EMBOSS/
Main Programs in EMBOSS
H
K
U
Computer
Centre
Retrieve sequences from database
Sequence alignment
Nucleic gene finding and translation
Protein secondary structure prediction
Rapid database searching with sequence patterns
Protein motif identification, including domain
analysis
Nucleotide sequence pattern analysis, for example
to identify CpG islands or repeats.
Codon usage analysis for small genomes
Rapid identification of sequence patterns in large
scale sequence sets
Presentation tools for publication
Starting EMBOSS
There
are three ways to start EMBOSS
Command
line after login bioinfo.hku.hk
Web interface (EMBOSS-GUI)
H
K
U
Computer
Centre
Command line of EMBOSS
Inside
HKU campus
telnet
bioinfo.hku.hk
Outside
H
K
U
Computer
Centre
HKU campus
Windows
Use
Linux
ssh
machine
putty, see http://bioinfo.hku.hk FAQ Q13
or UNIX machine
<username>@bioinfo.hku.hk
Web interface of EMBOSS
Directly access the web page at
http://bioinfo.hku.hk/EMBOSS/
Or browse the BIOSUPPORT Homepage:
http://bioinfo.hku.hk/ and select “Tools” Option
H
K
U
Computer
Centre
Web interface of EMBOSS
H
K
U
Computer
Centre
Click on the link EMBOSS - GUI
Programs in EMBOSS
Parameters in EMBOSS
Input can be:
Uniform Sequence Addresses (USAs) path in the
format:
database
database:entry_name
H
K
U
Computer
Centre
or database:accession_number
(e.g. embl:xlrhodop or embl:L07770)
database:wildcard (sw:opsd_a*)
filename
filename:entry
format::filename
@list
The
sequence data to be pasted in the text area.
Programs in EMBOSS
H
K
U
Computer
Centre
Output will be:
Textual and/or graphical representation
of data.
The output can be saved as text file or
in some cases image file in PNG or PS
format.
EMBOSS online help
H
K
U
Computer
Centre
The documentation for EMBOSS is available
at http://bioinfo.hku.hk/emboss/
Difference between GCG and
EMBOSS
H
K
U
Computer
Centre
GCG
EMBOSS
File format
supported
GCG, MSF, RSF,
FastA, BLAST (Other
file format must be
converted using
program (e.g.
FromFastA,
FromEMBL,
FromPIR, etc)
ABI trace file, ACeDB, Clustal
ALN (multiple alignment),
EMBL, FASTA, GENBANK,
NBRF (PIR), PHYLIP
interleaved multiple alignment,
SWISSPROT, Plain text, etc
No. of sequence
in one file
One file can only
have one sequence.
One file can have multiple
sequence.
3rd party
package
included
FASTA, BLAST
FASTA, BLAST, Assembly
program not included. They
must be run separately
Upper limit of
sequence size
35K
2G
Replacement of GCG programs
H
K
U
Computer
Centre
Exchanging sequences between packages
In GCG
getseq
In EMBOSS
Newseq
Fromfasta, tofasta,
seqret
fromembl, toembl
From…, to… (any
program that
reads/writes sequences)
Replacement of GCG programs
H
K
U
Computer
Centre
Sequence editing, manipulation and display
In GCG
In EMBOSS
fetch
Seqret
Seqed
command delete
command insert
No complete solution yet
cutseq
pasteseq
lineup
No good solution yet
assemble
union
shuffle
shuffleseq
reverse
Revseq
chopup
Not needed as EMBOSS
reads ‘any’ format
publish
Showseq, prettyseq
Replacement of GCG programs
H
K
U
Computer
Centre
Translation
In GCG
In EMBOSS
translate
transeq
Sequence comparison and alignment
In GCG
In EMBOSS
compare+dotplot (default
(window stringency))
Compare+dotplot (word=n)
Dotmatcher
Gap
Needle, stretcher (for long
sequences)
bestfit
Water, matcher (for long
sequences)
Pileup, clustal
Emma (=CLUSTAL)
pretty
Cons, showalign
dottup
Replacement of GCG programs
H
K
U
Computer
Centre
Patterns and gene finding
In GCG
In EMBOSS
Findpatterns
Fuzznuc, fuzztrans, fuzzprot
NB: uses PROSITE syntax (not GCG)
to define pattern
motifs
Patmatmotifs
NB: ps_scan searches also PROSITE
profiles
codonpreference
Syco, wobble
Replacement of GCG programs
In GCG
In EMBOSS
distances+growtree
Ednadist or eprotdist+ eneighbor
H
K
U
Computer
Centre
Phylogeny
Mapping
In GCG
In EMBOSS
Map
-With option “Find translationally
silent potential restriction sites”
-With option options 3’ or 5’
overhang
Remap, restrict
Silent
restover
Mapsort
Mapsort+plasmidmap
Restrict
Cirdna (only partial solution:
input file with Tick positions
must be created “manually”
Replacement of GCG programs
H
K
U
Computer
Centre
Protein analysis
In GCG
In EMBOSS
Pepplot,
peptidestructure+plotstructure
Garnier, pepinfo, octanol,
pepwindow
Primer selection
In GCG
In EMBOSS
prime
Eprimer3 (=Primer3)
Primepair, melttemp
No good solution yet
Replacement of GCG programs
H
K
U
Computer
Centre
Keyword-based databank searching
In GCG
In EMBOSS
Names
Whichdb
Indexsearch
Indexsearch
Stringsearch (mode A)
Stringsearch (mode B)
Textsearch
No good solution yet but
advantageously replaceable
by indexsearch
Running EMBOSS program
H
K
U
Computer
Centre
EMBOSS programs are run by typing them
at the Unix prompt, or by using an
interface.
The EMBOSS command syntax follows
normal Unix command conventions.
Programname -help
Programname -opt
to get some help on the options.
to make the program prompt you for common
options.
tfm programname
to get the full help on a program.
Login bioinfo
H
K
U
Computer
Centre
Login bioinfo with ‘telnet bioinfo.hku.hk’
If you are using the temp account, please create a
directory of your username at hkusua:
bioinfo% mkdir <username>
E.g. bioinfo% mkdir chantaiman
Change directory to your created directory
Bioinfo% cd <username>
E.g. bioinfo% cd chantaiman
wossname
It
H
K
U
Computer
Centre
is easy to forget the name of a
program.
To find EMBOSS programs, use
wossname
wossname finds programs by looking
for keywords in the description or the
name of the program.
wossname
H
K
U
Computer
Centre
Type wossname at the Unix % prompt
bioinfo % wossname
Displays one-line description.
Prompts you for information:
Finds programs by keywords in their one-line documentation
Keyword to search for: restrict
SEARCH FOR 'RESTRICT’
recode
Remove restriction sites but maintain the
same translation
remap
Display a sequence with restriction cut
sites, translation
etc…..
Optional parameters
H
K
U
Computer
Centre
To get prompted for all the optional parameters, type
the following:
bioinfo % wossname -opt
Finds programs by keywords in their one-line
documentation
Keyword to search for: protein
Output program details to a file [stdout]: myfile
Format the output for HTML [N]:
String to form the first half of an HTML link:
String to form the second half of an HTML link:
Output only the group names [N]:
Output an alphabetic list of programs [N]:
Use the expanded group name [N]:
help
bioinfo % wossname -help
Mandatory qualifiers:
[-search]
string
Enter a word or words here.
Optional qualifiers (* if not always prompted):
-outfile
outfile this program will write the program
names
H
K
U
Computer
Centre
Advanced qualifiers:
-[no]emboss bool
EMBOSS program
documentation will be searched.
Mandatory - required, are often parameters (in ‘[]’)
Optional - use -opt to be prompted for these.
Advanced - things that are not often used!
Writing to the screen
Note
H
K
U
Computer
Centre
that the default output file for
wossname was:
stdout (Standard output)
Use this whenever prompted for an
output file.
This is a ‘magic’ file name.
It displays the output on the screen,
not a file.
Working with sequences
EMBOSS
H
K
U
Computer
Centre
reads sequences from
files or databases.
It automatically recognizes the input
sequence format.
You can easily specify many output
formats.
Getting sequences from the
databases
H
K
U
Computer
Centre
Database single entry (ID)
database:entry
For example embl:hsfau
Wildcarded entries (Query)
database:hs*
For example sw:fos_*
All entries
database:*
Most databases will support all 3 methods
- some may not.
showdb
bioinfo% showdb
Displays information on the currently available
databases
# Name
Type ID Qry All Comment
# ====
==== == === === =======
domo
P
OK OK OK DOMO sequences
enspep
P
OK OK OK ENSEMBL PEP sequences
gp
P
OK OK OK GENPEPT sequences
gpnew
P
OK OK OK New GENPEPT sequences
kabatp
P
OK OK OK KABAT Protein sequences
nrl
P
OK OK OK NRL_3d
pdb
P
OK OK OK PDB sequences
pir
P
OK OK OK PIR using NBRF access
for 4 files
P
OK OK OK REMTREMBL sequences
Computer rem
H
K
U
Centre
seqret
Reads in a sequence, and writes it out.
bioinfo % seqret
Reads and writes (returns) a sequence
Input sequence: embl:xlrhodop
Output sequence [xlrhodop.fasta]:
H
K
U
bioinfo % more xlrhodop.fasta
>XLRHODOP L07770 Xenopus laevis rhodopsin
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaa
gaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaac
ggaac
.
Computer
.
Centre
seqret from the command line
Give seqret all of its data on the commandline.
It doesn’t need to prompt for anything else.
H
K
U
Computer
Centre
bioinfo % seqret embl:xlrhodop -outseq xlrhodop.fasta
The ‘-outseq’ can be abbreviated to ‘-out’.
Any abbreviation must be unique.
Even shorter, leave out the qualifier:
bioinfo % seqret embl:xlrhodop xlrhodop.fasta
Changing output formats
(reformatting)
seqret
can reformat sequences by
specifying the output format:
bioinfo % seqret embl:xlrhodop xlrhodop.gcg -osformat gcg
bioinfo % more xlrhodop.gcg
H
K
U
!!NA_SEQUENCE 1.0
Xenopus laevis rhodopsin mRNA, complete cds.
XLRHODOP Length: 1684 Type: N Check: 9453 ..
1 ggtagaacag cttcagttgg gatcacaggc ttctagggat
cctttgggca
51 aaaaagaaac acagaaggca ttctttctat acaagaaagg
actttataga
.
Computer
.
Centre
Multiple sequences, single
files
You
can use seqret to retrieve multiple
sequences into a file:
H
K
U
Computer
Centre
bioinfo% seqret “sw:opsd_a*” opsd_a.seqs
This
retrieves all the sequences whose
identifiers start with “opsd_a” into a file
called opsd_a.seqs.
Multiple sequences, many
files
If
you wish to write one sequence per
file, use:
bioinfo % seqret “sw:opsd_a*” -ossingle
H
K
U
output filenames will be based on
the sequence entry names.
The program seqretsplit will split an
existing multiple sequence file into
many files.
Computer
Centre
The
Asterisk on the command line
You can't use a ‘*’ on the UNIX commandline.
UNIX tries to match it to filenames.
Use it quoted, either with quotes or a
backslash:
H
K
U
Computer
Centre
"embl:*"
embl:\*
For example:
bioinfo % seqret “embl:hsf*” hsf.seq
EMBOSS web interface
H
K
U
Computer
Centre
On the left, you can choose the program to run. You
can also see all the program sorted alphabetically
instead of sorted by group by clicking on the link.
Getting help in EMBOSS
Help
on the program is available by
clicking on the question mark.
H
K
U
Computer
Centre
Input to EMBOSS
H
K
U
Computer
Centre
If you know the entry_name or accession number,
enter the sequence in the Uniform Sequence
Addresses (USAs) format
E.g. embl:xlrhodop
Input to EMBOSS
If
you have your own sequence file,
upload the sequence by clicking the
browse button.
H
K
U
Computer
Centre
Input to EMBOSS
You
can also copy and paste your
own sequence into the text area.
H
K
U
Computer
Centre
seqret web interface
H
K
U
Computer
Centre
E.g. seqret - retrieving single sequence
Input:
USA path embl:xlrhodop
Output file format: GCG 9.x/10.x
Output:
The sequence retrieved in GCG
format
seqret
H
K
U
Computer
Centre
seqret
H
K
U
Computer
Centre
seqret
H
K
U
Computer
Centre
Seqret – retrieving multiple sequences
Input: sw:ops2_*. Output file format: Pearson FASTA
Output: multiple sequences with the identifier starting with
sw:ops2_.
Save the file as ops2.fasta by right clicking on the link
coderet
H
K
U
Computer
Centre
Extract CDS, mRNA and translations from feature
tables. If any sequences are in other entries of
that database, they are automatically fetched and
incorporated correctly into the final sequence.
Input: embl:X03487
coderet
Output
H
K
U
Computer
Centre
dottup
H
K
U
Computer
Centre
dottup – Comparison between 2 sequences using
dot-plots.
Input:
1st sequence: embl:xl23808 (Xenopus
laevis rhodopsin gene)
Second sequence: embl:xlrhodop (Xenopus
laevis rhodopsin cDNA from complement of
mRNA)
Output:
A dotplot showing the diagonal lines
representing areas where the two
sequences align well in PNG format.
The image can be saved into the computer.
dottup
H
K
U
Computer
Centre
dottup
H
K
U
Computer
Centre
The 5 diagonal lines represent areas where the two
sequences align well.
Since this is aligning genomic and cDNA, the five diagonals
represent the five exons of the gene.
Pairwise Sequence Alignment
An
H
K
U
Computer
Centre
alignment is an arrangement of
two sequences which shows where
the two sequences are similar, and
where they differ.
There is no unique, precise, or
universally applicable notion of
similarity.
Global Alignment
H
K
U
Computer
Centre
A global alignment is one that compares
the two sequences over their entire
lengths, and is appropriate for comparing
sequences that are expected to share
similarity over the whole length.
The alignment maximizes regions of
similarity and minimizes gaps using the
scoring matrices and gap parameters
provided to the program.
needle
Function
Needleman-Wunsch
global alignment
Description
This
H
K
U
Computer
Centre
program uses the NeedlemanWunsch global alignment algorithm to find
the optimum alignment (including gaps) of
two sequences when considering their
entire length.
The computation is rigorous.
It can be time consuming to run if the
sequences are long.
Input sequence for needle
H
K
U
Computer
Centre
needle
needle - Needleman-Wunsch global alignment
Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808
Output: Global alignment showing the 5 aligned regions.
H
K
U
Computer
Centre
Local alignment
H
K
U
Computer
Centre
Local alignment searches for regions of
local similarity and need not include the
entire length of the sequences.
Local alignment methods are very useful
for scanning databases or other
circumstances when you wish to find
matches between small regions of
sequences, for example, between protein
domains.
water
Function
Smith-Waterman
local alignment.
Description
H
K
U
Computer
Centre
Water
uses the Smith-Waterman
algorithm (modified for speed
enhancements) to calculate the local
alignment.
water
H
K
U
Computer
Centre
water - Smith-Waterman local alignment.
Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808
Output: Local alignment showing the 5 aligned region.
Multiple Sequence Analysis
Multiple
To
H
K
U
Computer
Centre
sequence alignments are used
find patterns to characterize protein
families.
To detect or demonstrate homology
between new sequence and existing
families of sequences.
To help predict the secondary and tertiary
structures of the new sequences.
As an essential prelude to molecular
evolutionary analysis.
emma
Function
Multiple
alignment program - interface to
ClustalW program
Description
H
K
U
Computer
Centre
EMMA calculates
the multiple alignment
of nucleic acid or protein sequences
according to the method of Thompson,
J.D., Higgins, D.G. and Gibson, T.J.
(1994). This is an interface to the
ClustalW distribution.
Upload file to emma
H
K
U
Computer
Centre
Input: output from seqret (ops2.fasta) retrieving all
swissprot sequences whose identifiers begin with
sw:ops2_*
Click on browse button to upload the file ops2.fasta
Input sequence to emma
H
K
U
Computer
Centre
ops2.fasta
emma
emma – interface to ClustalW program
Output:
H
K
U
Computer
Centre
multiple alignment saved as file ops2.aln.
prettyplot
Prettyplot – displays aligned sequences, with colouring and
boxing
H
K
U
Computer
Centre
Input: output from program emma ops2.aln
Output: graphic display of aligned sequences. Identical residues in
red, similar residues in green.
prophecy
Function
Creates
matrices/profiles from multiple
alignments
Description
H
K
U
Computer
Centre
This
creates a profile matrix file from a
nucleic acid or a protein sequence
alignment.
The profile matrix file can then be used
by program profit or prophet.
prophecy
H
K
U
Computer
Centre
Input:
Sequence: output from program emma
ops2.aln
Select type: Gribskov
prophecy
H
K
U
Computer
Centre
Output: A profile to be saved as ops2.prophecy.
This profile allows a new sequence to be aligned
optimally to a family of similar sequences in the
program prophet.
prophet
Prophet – Gapped alignment for profiles
Input:
Input
H
K
U
Computer
Centre
sequence: The file xlrhodop.pep, output from
transeq of the sequence embl:xlrhodop from 110-1171
region.
Profile or matrix file: ops2.prophecy
Output file: ops2.prophet
Output:
The gapped alignment to profile. The
vertical bars (|) represent residues that are
identical between the ops2 consensus and our
rhodopsin, while the colons (:) represent
conservative substitutions. Aligning members of a
family can reveal conserved regions that may be
important for structure and/or function.
prophet
Output
H
K
U
Computer
Centre
plotorf
H
K
U
Computer
Centre
plotorf – plots potential opening reading frames
Input sequence: embl:xlrhodop
Output: graphical output showing the potential opening reading
frames in all six frames.
The longest protein is in second frame.
The correct open reading frame is the second frame.
getorf
H
K
U
Computer
Centre
getorf - Finds and extracts open reading frames (ORFs)
Input:
Sequence: embl:xlrhodop
Type of sequence to output: Nucleic sequence between
START and STOP codons
Output: Textual information of the region and the sequence of that
region.
transeq
H
K
U
Computer
Centre
transeq - Translate nucleic acid sequences
Input:
sequence: embl:xlrhodop
regions to translate: 110-1171 (from information of getorf)
Output: Translated sequence of the given region.
Save the file as xlrhodop.pep
Exercise 1 Q1
H
K
U
Computer
Centre
Align HER2 _ERB2_HUMAN and
UNKNOWN_AAL39899.1 with needle and water.
What is the main difference between the two types of
alignment in these two cases (the files HER2fasta.prt and ALL39899_1.prt are at
http://bioinfo.hku.hk/tutorial/)?
Repeat the Smith-Waterman alignment of HER2fasta.prt and ALL39899_1.prt with different
parameters. What happens if gap penalties are
changed to 30 and 2 instead of the defaults 10 and
0.5?
BLOSUM62 is default. What happens to the local
alignment (using program water) when using other
matrices, e.g. EPAM10?
Exercise 1 Q2
Type
H
K
U
Computer
Centre
gb:A7120FTSZ in the text box
and run seqret. Run entret with the
same sequence USA and examine
the entry. What is the difference
between the two entries?
Exercise 1 Q3
With
H
K
U
Computer
Centre
the program infoseq, display
information on all sequences whose
name starts with ‘10’ in the SwissProt
database. (hint: the sequence is
sw:10*, choose the information you
want to display by changing to ‘yes’)
Exercise 1 answer (A1)
H
K
U
Computer
Centre
Needle output
Exercise 1 answer (A1)
H
K
U
Computer
Centre
Water output
Exercise 1 answer (A1)
H
K
U
Computer
Centre
Water output with gap opening penality of
30 and gap extension penality of 2.
Exercise 1 answer (A1)
H
K
U
Computer
Centre
Water output with matrix of EPAM10
Exercise 1 answer (A1)
H
K
U
Computer
Centre
The global alignment (needle) require the whole
sequences to be aligned. The % identity and %
similarity is much less than local alignment
(water).
If the gap penalties are changed to 30 and 2, no
gap appears in the alignment
If EPAM10 is used, the score and alignment
length drops. Since PAM is derived from global
alignment, it gives worser result for the local
alignment program water. EPAM10 is more
suitable for very similar protein with no more than
10% evolutionary divergent.
Exercise 1 answer (A1)
Amino Acid substitution matrices
H
K
U
Computer
Centre
PAM (percent accepted mutation) – lists the
likelihood of change from one amino acid to
another in homologous sequences during
evolution.
One PAM is a unit of evolutionary divergence in
which 1% of the amino acids have been
changed.
some amino acid substitutions occurred more
readily than others, probably because they did
not have a great effect on the structure and
function of a protein.
Exercise 1 answer (A1)
Amino Acid substitution matrices (con’t)
H
K
U
Computer
Centre
BLOSUM – matrix values are based on a large
set of ~2000 conserved amino acid patterns
called blocks. Blocks come from a database of
protein sequences representing more than 500
families of related proteins.
PAM is derived from global alignments of proteins,
while BLOSUM comes from alignments of shorter
sequences.
The matrix built from blocks with no more than x%
of similarity is called BLOSUM X
Exercise 1 answer (A1)
H
K
U
Computer
Centre
PAM100 ==> Blosum90
PAM120 ==> Blosum80
PAM160 ==> Blosum62
PAM200 ==> Blosum52
PAM250 ==> Blosum45
The Blosum matrices are best for detecting
local alignments.
The Blosum62 matrix is the best for
detecting the majority of weak protein
similarities.
The Blosum45 matrix is the best for
detecting long and weak alignments.
Exercise 1 answer (A1)
H
K
U
Computer
Centre
If the BLOSUM62 matrix is compared to PAM160
then it is found that the BLOSUM matrix is less
tolerant of substitutions to or from hydrophilic
amino acids, while more tolerant of hydrophobic
changes and of cysteine and tryptophan
mismatches.
Exercise 1 answer (A2)
seqret
H
K
U
Computer
Centre
output
Exercise 1 answer (A2)
entreq
H
K
U
Computer
Centre
output
Exercise 1 answer (A2)
You
H
K
U
Computer
Centre
will see the sequence for the
Anabaena 7120 ftsZ and gsh-III
genes.
EMBOSS is also capable of
extracting more information than just
the sequence from a database entry.
The program entret will return the
entire entry as a text file.
Exercise 1 answer (A3)
H
K
U
Computer
Centre
Output
garnier
H
K
U
Computer
Centre
Garnier - Predicts protein secondary structure using the
Garnier-Osguthorpe-Robson (GOR) method
Secondary structure prediction is notoriously difficult to do
accurately. The GOR I alogorithm is one of the first semisuccessful methods.
The Garnier method is not regarded as the most accurate
prediction, but is simple to calculate on most workstations.
Input: translated sequence (xlrhodop.pep) embl:xlrhodop
from 110-1171 region with program transeq.
Output: Predicted protein secondary structure
garnier
Output
H
K
U
Computer
Centre
pepinfo
pepinfo - Plots simple amino acid properties in parallel.
Input sequence: translated sequence (xlrhodop.pep) embl:xlrhodop
from 110-1171 region with program transeq.
Output: A textual and graphical representation of amino acid
properties (size, polarity, aromaticity, charge, etc). Hydrophobicity
profiles useful for locating turns, potential antigenic peptides and
transmembrane helices.
H
K
U
Computer
Centre
pepinfo
H
K
U
Computer
Centre
Showing the residues distribution
pepinfo
H
K
U
Computer
Centre
Hydrophobicity profiles are useful for locating turns, potential
antigentic peptides and transmembrane helices.
positive score -> a hydrophobic region.
negative score -> hydrophilic region.
show seven highly hydrophobic regions.
use the program tmap to investigate further.
patmatmotifs
Patmatmotifs – search a PROSITE motif
database with a protein sequence. It can
identify to which known family of protein (if
any) the new sequence belongs.
PROSITE currently contains patterns and
profiles specific for more than a thousand
protein families or domains.
PROSITE patterns (Biologically significant
amino acid patterns can be summarized in
the form of regular expressions)
PROSITE profile (techniques based on
weight matrices allows the detection
extreme sequence divergence protein
families and functional/structural domains)
H
K
U
Computer
Centre
patmatmotifs
H
K
U
Computer
Centre
Input sequence: The file xlrhodop.pep, which is output
from transeq of the sequence embl:xlrhodop from 1101171 region.
Output: A textual representation showing where the
sequence match with a motif.
pscan
Pscan – Scans proteins using PRINTS
PRINTS is a database of diagnostic protein
signatures, or fingerprints.
Fingerprints are groups of conserved motifs
or elements that together form a diagnostic
signature for particular protein families.
An uncharacterised sequence matching all
motifs or elements can then be readily
diagnosed as a true match to a particular
family fingerprint.
Input sequence: The file xlrhodop.pep, which
is output from transeq of the sequence
embl:xlrhodop from 110-1171 region.
H
K
U
Computer
Centre
pscan
Output: A textual representation showing where the
short sequences match with the PRINTS
database that defines functional protein families.
H
K
U
Computer
Centre
fuzznuc
fuzznuc uses PROSITE style patterns to
search nucleotide sequences.
Letter code for pattern
[ACG]
H
K
U
Computer
Centre
stands for A or C or G.
{AG} stands for any nucleotides except A and
G.
N(3) corresponds to N-N-N, N(2,4)
corresponds to N-N or N-N-N or N-N-N-N.
[CG](5)TG{A}N(1,5)C
Input:
embl:hhtetra
Pattern: AAGCTT
sequence:
fuzznuc
Output
H
K
U
Computer
Centre
Exercise 2 Q1
H
K
U
Computer
Centre
Use tmap to displays membrane spanning
regions with the input sequence of
xlrhodop.pep ( translated with program
transeq from embl:xlrhodop at 110-1171
region). Does the result agree with
pepinfo?
Exercise 2 Q2
Use
H
K
U
Computer
Centre
fuzzpro to search sequence:
CREAp_m.txt pattern: CXXXXC (the
file CREAp_m.txt is from
http://bioinfo.hku.hk/tutorial/)
Exercise 2 Q3
Use
H
K
U
Computer
Centre
patmatmotifs to find pattern in
swissprot sequences fos_human or
fos_rat, and use these pattern to do
fuzzpro. Search other fos genes of
different organisms. (Hint: Use
sw:fos_human for the input; Other
organisms: bovin, chick, mouse,
sheep.)
Exercise 2 Q4
Sometimes it is better to run the program
fuzznuc in command line because more
parameters can be given
In the BIOINFO terminal, type the following
(you must put the command in one line in the
UNIX prompt):
H
K
U
Computer
Centre
bioinfo% fuzznuc -sequence=embl:hhtetra
-pattern=AAGCTT -mismatch=1 -complement
-outf=outf.out
How is the result different from previous run in
web interface?
Exercise 2 answer (A1)
H
K
U
Computer
Centre
Bars are displayed in the plot above the regions
predicted as being most likely to form
transmembrane regions
May be seven transmembrane helices in this
protein.
Result agree with pepinfo.
Exercise 2 answer (A2)
symbol ‘x’ is used for a position
where any amino acid is accepted.
There, the pattern CXXXXC matches
the result patterns of CQFPGC and
CMFPGC.
The
H
K
U
Computer
Centre
Exercise 2 answer (A2)
H
K
U
Computer
Centre
Patmatmotifs output using sw:FOS_HUMAN
Exercise 2 answer (A3)
When
H
K
U
Computer
Centre
run with patmatmotifs, the
sequences sw:FOS_HUMAN and
sw:FOS_RAT returns the same
motifs of AMIDATION,
LEUCINE_ZIPPER, and
BZIP_BASIC.
When run with fuzzpro with one of the
pattern, the start and end position
agrees with patmatmotifs.
Exercise 2 answer (A3)
H
K
U
Computer
Centre
Fuzzpro output with pattern
“GRAQSIGRRGKVEQ” and sequence
sw:fos_human
Exercise 2 answer (A4)
H
K
U
Computer
Centre
You can add no. of mismatches in input
parameters for command line. The result with
1 mismatch can now be shown
cpgplot
CPGPLOT – Plot the CpG rich areas
CpG refers to a C nucleotide immediately
followed by a G. The 'p' in 'CpG' refers to the
phosphate group linking the two bases.
By default, this program defines a CpG island
as a region where
H
K
U
Computer
Centre
over
an average of 10 windows, the calculated %
composition is over 50%
and the calculated Obs/Exp (i.e.
Observed/Expected) ratio is over 0.6
and the conditions hold for a minimum of 200 bases.
These conditions can be modified by setting
the values of the appropriate parameters.
cpgplot
H
K
U
Computer
Centre
The Observed number of CpG patterns in
a window is simply the count of the
number of times a 'C' is found followed
immediately by a 'G'.
The Expected frequency of CpG's in a
window is calculated as the number of 'C's
in the window multiplied by the number of
'G's in the window, divided by the window
length.
Expected = (number of C's * number of
G's) / window length
cpgplot
Input:
embl:rnu68037
Output
H
K
U
Computer
Centre
cpgplot
Output
H
K
U
Computer
Centre
cusp
H
K
U
Computer
Centre
CUSP reads one or more coding
sequences (CDS sequence only) and
calculates a codon frequency table.
It is important to use a codon frequency
table that is appropriate for the species
that your protein comes from.
Input:
Seq:
embl:paamir
Codon usage table: Default (Ehum.cut)
cusp
Output:
Fract – the faction of all amino acids coded for
this codon triplet.
/1000 – the number of codons per 1000 bases
H
K
U
Computer
Centre
cusp
H
K
U
Computer
Centre
Running the program in command line
allows you to specify the sequence begin
and sequence end
bioinfo% cusp -sbeg 135 -send 1292
Create a codon usage table
Input sequence(s): embl:paamir
Output file [paamir.cusp]:
cusp
H
K
U
Computer
Centre
bioinfo% more paamir.cusp
hmoment
H
K
U
Computer
Centre
hmoment plots or writes out the
hydrophobic moment. Hydrophic moment
is the hydrophobicity of a peptide
measured for a specified angle of rotation
per residue.
Assumption: The angle of rotation (bonds
of the backbone and amino acid sidechains) per residue in alpha helices is 100
degrees. The angle of rotation per residue
in beta sheets is 160 degrees.
Input:
Sequence:sw:hbb_human
Produce graph: yes
Plot two graph: yes
hmoment
Output:
one
for the alpha helix moment and one
for the beta sheet moment.
H
K
U
Computer
Centre
H
K
U
Computer
Centre
End of lecture
Thank you!