Lab session 6

Download Report

Transcript Lab session 6

I529 - Lab 6
GO+MEME + TwinScan
AI : Kwangmin Choi
Today’s topics
• Gene Ontology prediction/mapping
– AmiGo
• http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
– PFP
• http://dragon.bio.purdue.edu/pfp/
– GOtcha
• http://www.compbio.dundee.ac.uk/gotcha/
• Pathway prediction/mapping
– KAAS
• http://www.genome.jp/kegg/kaas
• MEME
• TwinScan
Gene Ontology
• In a species-independent manner., the GO project
has developed three structured controlled
vocabularies (ontologies) that describe gene
products in terms of their associated
GO:biological process
• A biological process is series of events accomplished by one
or more ordered assemblies of molecular functions.
– E.g. cellular physiological process or signal transduction.
– E.g. pyrimidine metabolic process or alpha-glucoside transport.
• It can be difficult to distinguish between a biological
process and a molecular function, but the general rule is
that a process must have more than one distinct steps.
• A biological process is not equivalent to a pathway; at
present, GO does not try to represent the dynamics or
dependencies that would be required to fully describe a
pathway.
GO: molecular functions
• Molecular function describes activities, such as catalytic or
binding activities, that occur at the molecular level.
• GO molecular function terms represent activities rather
than the entities (molecules or complexes) that perform
the actions,
• GO milecular function terms do not specify where or when,
or in what context, the action takes place.
– E..g. (general) catalytic activity, transporter activity, or binding
etc.
– E.g. (specific) adenylate cyclase activity, Toll receptor binding
etc.
GO: cellular components
• A cellular component is just that, a component of a
cell, but with the proviso that it is part of some larger
object;
• Less informative
• This may be an anatomical structure
– e.g. rough endoplasmic reticulum or nucleus
• or a gene product group
– e.g. ribosome, proteasome or a protein dimer
AmiGO
• URL http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
• AmiGO is the official tool for searching and browsing the
Gene Ontology database
• Simple blast search is provided (not useful)
• AmiGO consists of a controlled vocabulary of terms
covering biological concepts, and a large number of genes
or gene products whose attributes have been annotated
using GO terms.
PFP (Automated Protein Function
Prediction Server)
• Hawkins, T., Luban, S. and Kihara, D. 2006. Enhanced
Automated Function Prediction Using Distantly
Related Sequences and Contextual Association by
PFP. Protein Science 15: 1550-6.
• The PFP algorithm has been shown to increase
coverage of sequence-based function annotation more
than fivefold by extending a PSI-BLAST search to
extract and score GO terms individually
• It applies the Function Association Matrix (FAM), to
score significantly associating pairs of annotations.
PFP method
• PFP uses a scoring scheme to rank GO
annotations assigned to all of the most similar
sequences according to
– (1) their frequency of occurrence in those sequences
– (2) the degree of similarity of the originating sequence
to the query.
• This is similar to the scoring basis for the R-value
used by the GOtcha method to score annotations
from pairwise alignment matches (Martin et al.
2004)
PFP method
•
•
•
•
•
•
A GO term, fa
s(fa) is the final score assigned to the GO term, fa
N is the number of the similar sequences retrieved by PSI-BLAST
E_value(i) is the E-value given to the sequence I
b = 2 (or log10[100]) to allow the use of sequence matches to an E-value of 100.
Function Association Matrix (FAM),
–
–
–
–
–
–
fj is a GO term assigned to the sequence i.
P(fa | fj) is the conditional probability that fa is associated with fj,
c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt
c(fj) is the total number of times fj appeared in UniProt,
μ is the size of one dimension of the FAM (i.e., the total number of unique GO terms)
ɛ is the pseudo-count.
PFP
• Web server
http://dragon.bio.purdue.edu/pfp/queue/116
8_kw.f.result.html
• Local installation
– http://dragon.bio.purdue.edu/pfp/dist
– You need to specify the path of blastp
– And also need BLOSUM62
PFP (Automated Protein Function
Prediction Server)
•
PFP output
–
•
http://darwin.informatics.indiana.edu/col/courses/I529-11/Lab/Lab5/Data/pfp_data/0008.out
Columns
–
–
–
–
–
–
–
–
–
–
–
–
1: predicted GO term
2: GO category (f/p/c)
3: raw term score
4: term p-value
5: rank (by p-value)
6: confidence to be exact match
7: rank (by column 7)
8: confidence within 2 edges on the GO DAG
9: rank (by column 8)
10: confidence within 4 edges on the GO DAG
11: rank (by column 10)
12: GO term short definition
GOtcha
• The GOtcha method
– Martin et al. BMC Bioinformatics (2004) 5:178.
• GOtcha assigns functional terms transitively
based upon sequence similarity.
• These terms are ranked by probability and
displayed graphically on a subtree of Gene
Ontology.
Gotcha method
•
GOtcha performs a BLAST search of the
query sequence against individual well
annotated genomes.
•
Annotations are transitively assigned from
all hits, with a score corresponding to the Evalue, individual GO-terms receiving
cumulative scores from multiple sequence
similarity matches.
•
Cumulative scores are normalized and, for
each term, two scores are obtained
–
–
•
the I-score which is normalized to the root
node,
the C-score which is the cumulative score at
the root node.
For each GO-term a precomputed scoring
table is used to establish the assignment
likelihood for that term given that I-score
and that C-score. This is represented as a
probability
Results:
http://darwin.informatics.indiana.edu/col/courses/
I529-11/Lab/Lab5/Data/Gotcha/gotcha.4451/
KAAS
• KAAS (KEGG Automatic Annotation Server) provides
functional annotation of genes in a genome by BLAST
comparisons against a manually curated set of ortholog
groups in KEGG GENES.
• The result contains KO (KEGG Orthology) assignments
and automatically generated KEGG pathways.
• Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A., and
Kanehisa, M.; KAAS: an automatic genome annotation
and pathway reconstruction server. Nucleic Acids Res.
35, W182-W185 (2007). [NAR]
KAAS
• Web server: http://www.genome.jp/kegg/kaas/
• KAAS works best when a complete set of genes in a genome is
known. Prepare query amino acid sequences and use the BBH (bidirectional best hit) method to assign orthologs.
• KAAS can also be used for a limited number of genes. Prepare query
amino acid sequences and use the SBH (single-directional best hit)
method to assign orthologs.
• When ESTs are comprehensive enough, a set of consensus contigs
can be generated by the EGassembler server and used as a gene set
for KAAS with the BBH method. Otherwise, use ESTs as they are
with the SBH method.
KAAS workflow
Pathway mapping
• KAAS returns
–
–
–
–
KO list
KEGG Atlas Metabolism map [Create atlas]
Pathway maps [Create all maps]
Hierarchy files
• You can highlight KEGG maps using KEGG API
– http://www.genome.jp/kegg/soap/doc/keggapi_man
ual.html
– See: color_pathway_by_objects
MEME
• Discover (conserved) motifs in a group of
unaligned and related sequences (DNA or protein)
• Unsupervised: Automatically choose the following
(with little or no prior knowledge)
– Best width of motifs
– Number of occurrences in each sequence
– Composition of each motif
2
1
MEME
• MEME (Multiple EM for Motif Elicitation) discovers patterns in
nucleotide and amino acid sequences.
• MEME paper
• Few tutorials - 1( Pg 6 ), 2, 3, 4
• MEME represents motifs as position-dependent letter-probability
matrices which describe the probability of each possible letter at
each position in the pattern.
– MEME uses both alternately EM to find model parameters that
maximize the likelihood of the model
• Individual MEME motifs do not contain gaps. Patterns with variablelength gaps are split by MEME into two or more separate motifs.
Motif finding using the EM algorithm MEME
(Bailey & Elkan 1995)
http://meme.sdsc.edu/meme/intro.html
• MEME works by iteratively refining matrix and
identifying sites:
– Estimate motif model
• Start with an n-mer seed (random or specified)
• Build a matrix by incorporating some of background frequencies
– Identify examples of the model
• For every n-mer in the input set, identify its probability given the
matrix model
– Re-estimate the motif model
• Calculate a new matrix, based on the weighted frequencies of all
n-mers in the set
– Iteratively refine the matrix and identify sites until
convergence.
Expectation Maximization
1. Expectation step – initial guess about the
location of a (variable) sequence pattern in a
set of sequences
2. Maximization step – improve/update
pattern. Iterative search
24
Expectation Maximization Algorithm
dataset - unaligned set of
sequences (training data) S1,
S2, …, Si, …, Sn each of length
L
W - width of motif
p - matrix of probabilities that the
motif starts in position j in Si
Z - matrix representing the
probability of character c in
column k (the character c will
be A, C, G, or T for DNA
sequences or one of the 20
protein characters)
e - epsilon value
25
MEME Algorithm
26
Problem: find a 6-mer motif in 4 sequences
S1: GGCTATTGCAGATGACGAGATGAGGCCCAGACC
S2: GGATGAC
TTATATAAAGGACGATAAGAGATGAC
S3: CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC
S4: GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
1. MEME uses an initial EM heuristic to estimate the best
Starting-point matrix:
G
A
T
C
0.26
0.24
0.25
0.25
0.24
0.26
0.23
0.27
0.18
0.28
0.30
0.24
0.26
0.24
0.25
0.25
0.25
0.25
0.25
0.25
0.26
0.22
0.25
0.27
2. MEME scores the match of all 6-mers to current matrix
GGCTATTGCATATGACGAGATGAGGCCCAGACC
GGATGAC
TTATATAAAGGACCGTGATAAGAGATTAC
CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC
GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
Here, just consider the
underlined 6-mers,
Although in reality all
6-mers are scored
2. MEME scores the match of all 6-mers to current matrix
GGCTATTGCATATGACGAGATGAGGCCCAGACC
GGATGAC
TTATATAAAGGACCGTGATAAGAGATTAC
CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC
GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
The height of the bases
above corresponds to
how much that 6-mer
counts in calculating
the new matrix
3. Reestimate the matrix based on the
weighted contribution of all 6 mers
G
A
T
C
0.29
0.22
0.24
0.24
0.24
0.26
0.23
0.27
0.17
0.27
0.33
0.23
0.27
0.22
0.23
0.28
0.24
0.28
0.24
0.24
0.30
0.18
0.28
0.24
MEME scores the match of all 6-mers to current matrix
GGCTATTGCATATGACGA
GGATGAC
GATGAGGCCCAGACC
TTATATAAAGGACCGTGATAAGAGATTAC
CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC
GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
GGCTATTGCATATGACGA
GGATGAC
GATGAGGCCCAGACC
TTATATAAAGGACCGTGATAAGAGATTAC
CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC
GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
The height of the bases
above corresponds to
how much that 6-mer
counts in calculating
the new matrix
Reestimate the matrix based on the
weighted contribution of all 6 mers
G
A
T
C
0.40
0.30
0.15
0.15
0.20
0.30
0.30
0.20
0.15
0.20
0.45
0.20
0.42
0.24
0.16
0.16
0.24
0.46
0.15
0.15
0.30
0.18
0.28
0.24
MEME scores the match of all 6-mers to current matrix
GGCTATTGCATATGACGA
GGATGAC
GATGAGGCCCAGACC
TTATATAAAGGACCGTGATAAGAGATTAC
CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC
GATGACGGAGTATTAAAGACTCGATGAGT
A
TATACG
Iterations continue until convergence
(ie. numbers don’t change much between iterations)
Final motif
G
A
T
C
0.85
0.05
0.05
0.05
0.05
0.60
0.30
0.05
0.10
0.10
0.70
0.10
0.80
0.05
0.05
0.10
0.20
0.60
0.20
0.10
0.35
0.10
0.10
0.35
Expectation Maximization Idea
34
MEME web server
• MEME website
• Protein motif
• Sample Input
• Fill in email address
• Keep all else default
• DNA motif
• Sample Input
• Minimum motif width: 4
• Number of different motifs: 2
Types of Possible Motif Models
1. OOPS
•
One occurrence per sequence of the
motif in the dataset
2. ZOOPS
•
Zero or one motif occurrences per
dataset sequence
3. TCM
•
Motif to appear any number of times
in a sequence (two-component
mixture)
3
6
A simple DNA example
meme dna.fasta -dna -mod oops -pal -o output_dir
•
MEME looks for a single motif in the file dna.fasta which contains DNA sequences
in FASTA format.
•
The OOPS model is used so MEME assumes that every sequence contains exactly
one occurrence of the motif.
•
The palindrome switch is given so the motif model (PSPM) is converted into a
palindrome by combining corresponding frequency columns.
•
•
•
If MEME decides that a motif is a palindrome, it averages the letter frequencies in
corresponding columns together.
For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are
averaged together.
MEME automatically chooses the best width for the motif in this example since no
width was specified.
Searching for motifs on both DNA
strands
meme dna.fasta -dna -mod oops -revcomp -o output_dir
• The -revcomp switch tells MEME to consider both DNA
strands, and
• The -pal switch is absent so the palindrome conversion is
omitted.
• When DNA uses both DNA strands, motif occurrences on
the two strands may not overlap. That is, any position in
the sequence given in the training set may be contained in
an occurrence of a motif on the positive strand or the
negative strand, but not both.
A simple protein example
meme protein.fasta -mod oops -maxw 20 -nmotifs 2 -o output_dir
• The -dna switch is absent, so MEME assumes the input file as
protein sequences.
• Each motif is assumed to occur in each of the sequences because
the OOPS model is specified.
• Specifying -maxw 20 makes MEME run faster since it does not have
to consider motifs longer than 20.
• MEME searches for two motifs each of width less than or equal to
20.
MEME on BigRed
• MEME is installed on BigRed.
• KB: http://kb.iu.edu/data/awwa.html
TWINSCAN
TwinScan
• TwinScan finds genes in a "target" genomic sequence by simultaneously
maximizing the probability of the gene structure in the target and the
evolutionary conservation derived from "informant" genomic sequences.
• The target sequence (i.e. the sequence to be annotated) should generally
be of draft or finished quality.
• The informant can range from a single sequence to a whole genome in any
condition from raw shotgun reads to finished assembly.
• References
– http://mblab.wustl.edu/media/publications/Hu-Brent-2003-Proof.pdf
– http://mrw.interscience.wiley.com/emrw/9780471250951/cp/cpbi/article/bi0
408/current/abstract
– http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S14
0
Requirements
•
TwinScan/ Nscan 4.0 executable
•
DNA sequence in FASTA format
•
Twinscan parameter file (/parameters/twinscan_parameters)
–
•
Each filename contains the name of the target organism: eg maize_twinscan.zhmm.
Conservation sequence (see examples/example.conseq)
–
–
–
–
–
–
A symbolic representation of the the best alignments between the target and informant sequences.
To create this conservation sequence, WU-BLAST (http://blast.wustl.edu) is used.
For NCBI BLAST, the input parameters need to be changed. (see parameters/examples/example_blast_parameters.txt)
xdformat (WU-BLAST package) formats the informant sequences to create a blast database.
After running BLAST, the output must be formatted with conseq, which is included in this package.
Example (using WU-BLAST)
•
•
•
–
•
xdformat -n informant.fa
Blast M=1 N=-1 Q=5 R=1 B=10000 V=100 -cpus=1 -warnings -lcfilter filter=seg filter=dust topcomboN=1 informant.fa target.fa >
blast.out
conseq target.fa blast.out > conseq.fa
Note: The runTwinscan2 script will run these steps without user intervention (see below).
EST sequence (optional)
–
–
–
EST sequence is a symbolic representation of evidence from ESTs that align to the target sequence.
The estseq script included in the distribution creates EST sequence when given a DNA sequence and a (set of) BLAT reports of
the the ESTs aligned to the target.
For downloading BLAT, go to http://genome.ucsc.edu/FAQ/FAQblat.html#blat3
Web service and Installation
• Web service
– http://mblab.wustl.edu/nscan/submit/
• Local installation
– http://mblab.wustl.edu/software/twinscan/
• Installation step
$ tar xvzf iscan-4.1.2.tar.gz
$ cd N-SCAN
$ make linux
$ ./test-executable
How to run
• Usage
twinscan <parameter file> <masked sequence file> -c=<conseq file> [e=estseq_file] > <outputfile>
• Example:
./bin/iscan ./parameters/twinscan_parameters/human_twinscan.zhmm
./examples/example.fa.masked -c=./examples/example.conseq >
mySequence.gtf
Running Twinscan using the
runTwinscan2 script
•
In summary, there are 5 steps required to run Twinscan:
–
–
–
–
–
–
Step 1: Mask target sequence with RepeatMasker
Step 2: Create informant BLAST database
Step 3: Run BLAST
Step 4: Create conservation sequence
(Step 4b: Create EST sequence)
Step 5: Run Twinscan
•
These five steps are all contained in the example script runTwinscan2 (see ./bin)
•
The default BLAST parameters used by runTwinscan2 are those for C.elegans (see
parameters/blast_params/Celegans.blast.param).
– This can and should be changed for any other species with the -B option to the runTwinscan2
script.
•
The file example.output in the /examples directory contains the output from
runTwincan2 using the BLAST parameters found in the script.
Local environment seeting for
runTwinscan2
•
Example
– ../bin/runTwinscan2 -r ../parameters/twinscan_parameters/human_twinscan.zhmm -d output
-B ../parameters/blast_params/Hsapiens.blast.param example.fa.masked informant.fa
•
After running you can find output files in the newly created /output directory.
•
Several programs must be installed on your system to run runTwinscan.
•
You may need to change runTwinscan to point it to these programs.
–
–
–
–
–
my $REPEATMASKER
my $BLASTN
my $BLAT
my $XDFORMAT
my $PRESSDB
= "RepeatMasker";
= "blastn";
= "blat";
= "xdformat";
= "pressdb";
# Format for local environment
# Format for local environment
# Format for local environment
# Format for local environment
# Format for local environment