motif-search - Virginia Commonwealth University

Download Report

Transcript motif-search - Virginia Commonwealth University

Tour of BioBIKE
Motif Discovery
BioBIKE (Biological Integrated Knowledge Environment) combines:
Knowledge: All known genomes of interest to a specific scientific
community.
Analytical Tools: A powerful graphical language that permits creative
expression to those with no programming experience
Various BioBIKEs are available through:
http://biobike.csbc.vcu.edu
This demonstration is best viewed as a slide show,
enabling you to simulate a session and make
changes
in cursor
more
Click
anywhere
to position
go on to
theobvious.
next slide
To do this, click Slide Show on the top tool bar, then View show.
Tour of BioBIKE
Motif Discovery
In this tour, you'll see how to:
Slide
4 • Log onto CyanoBIKE
13,19 • Find a gene from a short description of it
16
• Speak BioBIKE (the language of CyanoBIKE)
31
• Find orthologs of a gene
48
• Obtain upstream sequences of a gene or list of genes
53
• Search a set of sequences for common motifs
You can go to any slide in this tour
at any time by typing the slide
number and pressing Enter.
Or go to the next slide by clicking
the mouse.
Coming Attractions!
If you like this tour, you might also try:
Sequence Analysis
• Display a sequence
• Find similar sequences amongst metagenomes, known
viruses, everything in GenBank
• Make a sequence alignment from a set of similar sequences
• Construct a phylogenetic tree
Analysis of Metagenome Aggregates
• Find the number of contigs in a metagenome
• Find the average contig size in a metagenome
• Find the average GC content within a metagenome
• Visualize the distribution of GC content amongst the
contigs of a metagenome
To get to CyanoBIKE, click a link
to one of the public sites/
To see more tours like this one,
click Guided tours of BioBIKE
Access this site at htpp://biobike.csbc.vcu.edu
Your name (no spaces)
- Enter anything you like as a login
name, but no spaces or symbols.
- EMail address is optional but may
be useful if you want to send in
questions or complaints.
- Click New Login
Function palette
Workspace
The BioBIKE environment is divided
into three areas as shown. You'll bring
functions down from the function
palette to the workspace, execute
them, and note the results in the
results window
Results window
Two very important buttons on the
function palette:
HELP!
On-line help (general)
PROBLEM
Something went wrong?
Tell us!
Two very important buttons in the
workspace:
Undo (return to workspace
before last action)
Redo (Get back the
workspace you undid)
Our Story
The glnA gene in the cyanobacterium Anabaena
PCC 7120 encodes glutamine synthetase, a
critical enzyme in nitrogen metabolism.
The transcription of this gene is regulated by the
availability of a nitrogen source.
Suppose you want to understand the molecular
mechanism by which the regulation takes place.
Our Story
Your strategy is to presume that this highly
conserved gene possesses the same upstream
regulatory sequences in related organisms.
You will collect orthologs of glnA in related
organisms, collect their upstream sequences, and
examine them for a conserved sequence motif.
The first step is to get in hand one glnA gene,
the one you already know about in Anabaena.
Mouse over the GENES-PROTEINS button.
Mousing over a button in the function pallette
causes a menu to appear.
You know the unofficial name of the gene,
"glnA", and from that you want to get the
official name of the gene described by "glnA".
Mouse over DESCRIPTION-ANALYSIS.
Click on the function
GENE-DESCRIBED-BY.
A GENE-DESCRIBED-BY function
box is now in the workspace.
Before continuing with the problem,
let's consider what function boxes
mean.
General Syntax of BioBIKE
Function-name
Argument
(object)
Keyword
object
The basic unit of BioBIKE is the
function box. It consists of the
name of a function, perhaps one or
more required arguments, and
optional keywords and flags.
A function may be thought of as a
black box: you feed it information,
it produces a product.
Flag
General Syntax of BioBIKE
Function-name
Argument
(object)
Keyword
object
Flag
Function boxes contain the
following elements:
• Function-name (e.g. SEQUENCE-OF or LENGTH-OF)
• Argument: Required, acted on by function
• Keyword clause: Optional, more information
• Flag: Optional, more (yes/no) information
General Syntax of BioBIKE
Function-name
Argument
(object)
Keyword
object
Flag
… and icons to help you work with
functions:
•
Option icon: Brings up a menu of keywords and flags
•
Action icon: Brings up a menu enabling you to execute
a function, copy and paste, information, get help, etc
Clear/Delete icon: Removes information you entered
or removes box entirely
•
Back to our story.
Click on the Argument box to open it
for entry…
…then type in the description you
know, "glnA".
A very common error is to forget to
close an entry box. A function can't
be executed until all entry boxes are
closed, either by pressing Enter or
Tab.
Do one or the other.
Left to it's own devices, BioBIKE
will search every organism it knows
about for genes described by "glnA".
You'll get a much faster response if
you modify the function to search
only Anabaena. Do this by mousing
over the Option Icon…
… and clicking the IN option.
Then open the IN object box for
entry by clicking on it.
You could type in the official name
or nickname of the organism, but if
you don't happen to know it, find it
by mousing over the DATA button…
Anabaena PCC 7120 is a nitrogenfixing cyanobacterium. Mouse over
that choice.
Mouse over Anabaena PCC 7120,…
… and click on its official nickname,
A7120.
That causes the name to appear in the
selected box.
The function is now ready for
execution. Mouse over the Action
Icon…
… and click Execute.
A result now appears in the Result
Window.
With the name of the gene in hand,
you want to find all orthologs of it in
cyanobacteria, to extract their
upstream sequences.
Mouse over the GENES-PROTEINS
button…
… and click ORTHOLOG-OF.
Open the argument box of the
function for entry by clicking on it…
And type in the nickname of
Anabaena's glnA gene, alr2328.
Close the entry box by pressing
Enter or Tab…
… and execute the function.
Lots of orthologs!
It would be helpful to be able to refer
to them as a group. To define such a
group, mouse over the DEFINITION
button…
… and click the DEFINE function.
The DEFINE function asks for two
things: the name of the variable to be
defined and the value it is to be
given. The value will be all those
orthologs. The name is up to you.
Click on the variable argument box
to open it up for entry…
… and type a name that makes sense
to you, closing the box afterwards by
pressing Tab.
Tab closes the entry box and
automatically opens the next one (if
it exists).
There are many ways of getting that
list of orthologs. You could copy and
paste that list from the Result pane to
the open value box, but it might be
more clear to cut/paste the function
that produced it. Let me show you.
Click on the Action icon of
ORTHOLOG-OF.
Click Cut.
The function box will disappear
but will be retained in the
BioBIKE clipboard.
… then mouse over the Action Icon
of the value argument box…
… and click Paste.
The definition is now complete
(and reads well for future
reference). But it will not take
effect until the function is
executed
Click the Action icon.
… and click Execute.
Notice that a new VARIABLES
button appears. We'll use it later to
access the newly defined list.
For now, we need to get upstream
regions from all those genes. Mouse
over the GENES-PROTEINS
button…
… then mouse over GENES-NEIGHBORHOOD…
… and click the SEQUENCE-UPSTREAM-OF function.
The function seems to call for a gene as the argument.
However, like most BioBIKE functions, this one has the
following useful property:
- Give it a single item, it returns a single answer
- Give it a list of items, it returns a list of answers.
Open the argument box for input.
We want the function to act on the
group of genes we just defined.
Mouse over the VARIABLES
button…
… and click the name of the group
you just defined. That will bring the
group into the selected box.
We could execute the completed function, and
then take those upstream sequences and look
within them for sequence motifs.
Alternatively we could skip the intermediate
step and have the sequences go directly into
the motif finder. To do that, we surround the
function with the motif finder.
To surround, mouse over the Action Icon…
… and click Surround with.
The entire function is now selected.
We need to specify that we want to
surround the function with a function that
searches for motifs within sequences.
Mouse over the STRINGS-SEQUENCES
button…
… mouse over BIOINFORMATIC-TOOLS…
… and click MOTIFS-IN.
(By the way, if the categories aren't sufficiently
intuitive, you can always find functions
alphabetically, through the ALL button on the
Function Palette)
The upstream sequences returned by
SEQUENCES-UPSTREAM-OF will now be
given to the MOTIFS-IN function.
Executing that function will execute everything
inside of it.
You might think it's time to go over to the Action
Icon of MOTIFS-IN and execute …
… but hold that mouse!
MOTIFS-IN, unless told otherwise, looks for
amino-acid motifs. Eventually we'll get around to
teaching it how to distinguish DNA from protein
sequences automatically, but for now, mouse over
the Options Icon…
… and click the DNA option.
Now execute the function.
Notice "Executing" in the message bar.
MOTIFS-IN might take 10-20 seconds to
execute. Don't try to do any other
function during that time.
MOTIFS-IN formats the sequences in a
way a motif-finding program (Meme)
likes to see and supplies its results in a
separate window.
A new window opens, which you can save
to your own computer if you like.
For now just scroll down.
Meme has found a motif with a very good E-value. It provides a
histogram, showing the information content of each position of
the motif. The higher the bar, the more conserved the position.
Scroll further.
You get the sequence of the motif for each
upstream sequence in which it was found.
Scroll further.
Meme also found a second
good motif.
Scroll to the end of the file.
At the end you get a map of all the
motifs found and where in the upstream
sequences they appear.
Evidently, Motif 2 and 3, when present,
generally precedes Motif 1.
BioBIKE
You've seen a knowledge environment in which:
• Knowledge and tools are integrated.
Data conversion is seldom necessary.
• The language is uniform, facilitating access to
many popular tools through a common interface.
• The language is as flexible as any general purpose
language, permitting construction of new tools.
• The programming language is easy to pick up,
using graphical conventions familiar to those
who don't program.
• The environment is well suited for teaching the
concepts of molecular biology through computational
experiment.
Collaborators
Michael Chaplin
Johnny Casey (Sequoia Cons.)
Sarah Cousins (now Wistar Institute)
Michiko Kato (now UC Davis)
Hailan Liu
JP Massar (Berkeley)
James Mastros (now Philip Morris)
Bogdan Mihai
John Myers (Sequoia Cons.)
Nihar Sheth
Jeff Shrager (Carnegie Inst.)
Arnaud Taton
Hien Truong
Andy Whittam (Washington & Jefferson)
… and many participating students
Development of BioBIKE was funded by a grant from the National Science Foundation
Contact Jeff Elhai, Center for the Study of Biological Complexity, Virginia Commonwealth University
(E-mail) [email protected], (Tel) 804-828-0794