Making a Phylogenetic Tree

Download Report

Transcript Making a Phylogenetic Tree

Biology of Seed Plants Presents
Making a Phylogenetic Tree
Contents
• Principles behind phylogenetic trees
• How to find the sequences to make the tree
• Doing it (with your own hands)
Slides 3-10
Slides 11-26
Slides 27-102
Goals
• You'll know what goes into making a phylogenetic tree
• You'll appreciate that the concepts aren't very difficult,
but there's an awful lot of picky things to do.
This demonstration is best viewed as a slide show,
enabling you to simulate a session and make
changes
in cursor
more
Click
anywhere
to position
go on to
theobvious.
next slide
To do this, click Slide Show on the top tool bar, then View show.
Making a Phylogenetic Tree
In mosses, the gametophyte
form is dominant
In angiosperms, the
sporophyte form is dominant
The moss, Psychomitrella patens.
University of Leeds
The angiosperm, Arabidopsis
thaliana. Universität Karlsruhe.
Menand et al (2007) found that a transcription factor
governing
genes
involved
in
sporophyte
development in an angiosperm is found also in
moss.
Making a Phylogenetic Tree
Their finding that factors of the AtRHD6 family are
also present in the moss Psychomitrella patens can
be explained in at least two ways.
1. Those in moss and angiosperms evolved from a
common ancestor, but the specialization of the
factors arose after angiosperms diverged from
bryophytes. If this were the case, you'd expect to
see a phylogenetic tree as shown to the right
OR…
Phylogenetic tree of
transcriptional factors in
Arabidopsis and moss
related to the AtRHD6
family
Making a Phylogenetic Tree
2. Specialization of the factors arose before
divergence. If this were the case, you'd
expect to see a phylogenetic tree with the
factors from the two plants mixed together.
… and that's exactly what you DO see for factors
in the AtRHD6 clade! This family evidently
evolved before Arabidopsis and moss existed as
distinct plants.
If so much rests on the structure of this
phylogenetic tree, then we need to understand
what trees mean and how they are made (two
sides of the same issue).
Making a Phylogenetic Tree
To illustrate the process by which phylogenetic trees are
constructed, consider the following list of words:
English
Mother
Father
Red
Foot
Salt
Young
Dutch
Moeder
Vader
Rood
Voet
Zout
Jong
German
Mutter
Vater
Rot
Fuss
Salz
Jung
Swedish
Fostra
Fader
Rött
Foten
Salt
Barn
French
Mère
Père
Rouge
Pied
Sel
Jeune
Spanish
Madre
Padre
Rojo
Pie
Sal
Joven
Italian
Madre
Padre
Rosso
Piede
Sale
Giovane
Russian
Mat
Otetz
Kracnii
Noga
Sol
Molodoye
Clearly, these words (and many more) are connected across
several European languages, sometimes closely connected,
sometimes more distantly.
The relationships can be quantitated and expressed as a tree.
Making a Phylogenetic Tree
English
Mother
Father
Red
Foot
Salt
Young
Dutch
Moeder
Vader
Rood
Voet
Zout
Jong
German
Mutter
Vater
Rot
Fuss
Salz
Jung
Swedish
Fostra
Fader
Rött
Foten
Salt
Barn
French
Mère
Père
Rouge
Pied
Sel
Jeune
Spanish
Madre
Padre
Rojo
Pie
Sal
Joven
Italian
Madre
Padre
Rosso
Piede
Sale
Giovane
Russian
Mat
Otetz
Kracnii
Noga
Sol
Molodoye
French
Spanish
Italian
German
Dutch
English
Swedish
The tree to the right showing the
divergence of human languages
over the past few thousand years is
based on an analysis of mutations
that have taken place in individual
words.
Analysis is not sufficient, however.
There is a prior step…
years
Adapted from Gray & Atkinson (2003) Nature 426:435-439
Russian
Making a Phylogenetic Tree
English
Red
Mother
Salt
Foot
Father
Young
Dutch
Voet
Moeder
Zout
Jong
Rood
Vader
German
Salz
Jung
Rot
Vater
Mutter
Fuss
Swedish
Salt
Fostra
Fader
Rött
Foten
Barn
French
Pied
Sel
Rouge
Jeune
Père
Mère
Spanish
Joven
Rojo
Madre
Pie
Sal
Padre
Italian
Piede
Padre
Madre
Sale
Giovane
Rosso
Russian
Otetz
Sol
Molodoye
Noga
Mat
Kracnii
French
Spanish
Italian
German
Dutch
English
Swedish
It would not be possible to analyze
the mutations if the words with the
same meanings had not first been
aligned with each other.
years
Russian
Making a Phylogenetic Tree
Alignment is equally
important in building
phylogenetic trees from
protein
and
DNA
sequences.
Note that when these
sequences are aligned
properly,
similarities
stand out, and so do
regions that are similar
only in certain groups.
Making a Phylogenetic Tree
The tree may be built by counting how many differences there are
between pairs of sequences.
Making a Phylogenetic Tree
Back to the key figure from Menand et al (2007)…
One excellent way of understanding a figure is to
construct it yourself from the raw data. How could
you construct this tree?
Recall the steps:
0. GET THE SEQUENCES!
1. Alignment of sequences
2. Analysis of alignments
tree
… Actually, we've glided over a step
necessary before we can even begin.
How can we get them? This is often the
most difficult of all the steps.
Obtaining the Sequences for the Tree
In any research article (if the authors have done
their jobs), we should be able to find the source of
the key material that underlie the results, usually in
the Materials and Methods section or in the
appropriate figure legend.
OK, we turn then to Menand et al (2007). If you
don't have a copy handy, go back to the main page
for this investigation and click on the link to
Science… let's do that together (you do it, too).
We want to search for Menand et al (2007).
Science 316:1477-1480. You'd think Science
would make this sort of thing easy, but the
fastest way of doing requires a click to the
Advanced Search section
316
1477
Type in the volume and page numbers for
Menand et al (2007), Science 316:1477-1480.
That makes the search absolutely unique. Then
click the Search button.
They give us four choices:
Abstract is useful for a brief picture of the article (but
we're well beyond that)
Full Text gives the article as a web page, good for
clicking on references
PDF gives the article as it looks on the printed page.
I choose this.
Supporting Online Material… we'll see about that.
Once the PDF file loads, I go straight to Fig. 3. What
(if anything) does it say about where the sequences
come from? I see…
Where's "fig.S3"? I can't find it anywhere. I also note
that Science articles don't have Materials and
Methods sections!!! I can think of no reason for
presenting a research article without a Materials and
Methods section, at least no reason I can repeat on a
family web page.
In desperation I try a general search using Acrobat's
search function.
After some searching, I find something potentially
interesting. I click on it.
That seems to be the winner, at the very end of the
article! Evidently the source of the sequences are in
Fig. S3, and Fig. S3 is hidden in another web site!
Nothing to do but to go there.
Now I understand the import of the Supporting
Material link at the page that got me to this article. I
go back, click on it, and download the supporting
material.
I scroll down the page to Fig. S3. This is an alignment
all right, as advertised. But where did these
sequences come from? The legend for this figure is
no help.
Fortunately, during the scrolling, I ran across the
missing Materials and Methods section, including a
paragraph that seems to meet at least some of my
needs. I go back a few pages.
Finally! I discover that at least some of the sequences
(those that begin PpRSL…, the ones from moss) are
in GenBank and the others provide GenBank
accession numbers. I can get what I need from there.
What exactly does GenBank have?
Let's go together to NCBI (where GenBank lives),
using the link on the main page.
NCBI houses GenBank and much other information.
Type in the first GenBank accession number,
EF156393, and click the Go button.
Since the GenBank accession number is highly
specific, we get back only one hit, to a nucleotide
record. Click to get to the record.
"Psycomitrella patens…" yup, that's what we want.
Click on the link.
This is what GenBank knows about PpRSL1.
Right organism. Right authors.
What about the sequence? Scroll down.
It shows me both the amino acid sequence...
…and the DNA acid sequence.
Do I have to look up each sequence separately, then
copy and paste, then somehow get rid of the numbers
and spaces, and then somehow do an alignment?
Fortunately, all of this can be automated.
Making a tree: Doing it
Our plan:
• Obtain moss sequences from GenBank
• Somehow obtain Arabidopsis sequences (discussed later)
• Align the sequences
• Use the aligned sequences to make a phylogenetic tree
The execution of this plan will be greatly facilitated
by a web site, an instance of BioBIKE, that makes it
possible for those without programming experience to
do creative programming.
BIKEs are Biological Integrated Knowledge
Environments. They come stocked with knowledge
specific to a group of organisms. We won't be using
that organism-specific knowledge, so any of the
servers should do. Try clicking CyanoBIKE.
Your name (no spaces)
Enter anything you like as a login
name, but no spaces or symbols.
No password necessary.
Click New Login
Function palette
Workspace
The BioBIKE environment is divided
into three areas as shown. You'll bring
functions down from the function
palette to the workspace, execute
them, and note the results in the
results window
Results window
Two very important buttons on the
function palette:
HELP!
On-line help (general)
PROBLEM
Something went wrong?
Tell us!
Our Plan
We want to define a set of sequences, each element being
one of the eight moss sequences with GenBank accession
numbers given in the Supplemental Material.
We'll define each sequence separately, then combine them
into a set.
Click the DEFINITION button on the function palette.
Clicking on any palette button
brings down a function or data into
the workspace.
Click DEFINE.
A DEFINE box is now in the
workspace.
Before continuing with the
problem, let's consider what
function boxes mean.
General Syntax of BioBIKE
Function-name
Argument
(object)
Keyword
object
The basic unit of BioBIKE is the
function box. It consists of the
name of a function, perhaps one or
more required arguments, and
optional keywords and flags.
A function may be thought of as a
black box: you feed it information,
it produces a product.
Flag
General Syntax of BioBIKE
Argument
(object)
Function-name
SIN
Keyword
object
angle
A function you’re already familiar
with is the Sin function. You feed it
an angle, it produces the sin of the
angle.
In BioBIKE, you provide
information by clicking on a gray
input box to open it up for entry.
Flag
General Syntax of BioBIKE
Function-name
SIN
Argument
(object)
Keyword
object
30
A box that is white and outlined in
read is open for entry. Type into it
an appropriate value, then close it
by pressing Enter or Tab.
All input boxes must be closed
before a function may be executed.
If you leave a white input box open
while trying to execute a function,
you'll get an error!
Flag
General Syntax of BioBIKE
Function-name
Argument
(object)
Keyword
object
Flag
Function boxes contain the
following elements:
• Function-name (e.g. SEQUENCE-OF or LENGTH-OF)
• Argument: Required, acted on by function
• Keyword clause: Optional, more information
• Flag: Optional, more (yes/no) information
General Syntax of BioBIKE
Function-name
Argument
(object)
Keyword
object
Flag
… and icons to help you work with
functions:
•
Option icon: Brings up a menu of keywords and flags
•
Action icon: Brings up a menu enabling you to execute
a function, copy and paste, information, get help, etc
Clear/Delete icon: Removes information you entered
or removes box entirely
•
Back to our story… we were defining
the set of moss sequences.
The DEFINE function has two
arguments: the name of the variable
being defined and its value. Click on
the argument marked variable to
provide the name.
In the white, open input box, type in
the name of the first sequence on our
list, pprsl1 (upper/lower case doesn't
matter). Then press Tab or Enter
If you pressed Tab, the first input box
will close (turn gray) and the next
box will automatically open.
If you pressed Enter, you'll have to
pen the next box yourself by clicking
on it.
You'll define pprsl1 as the sequence
from GenBank. We could copy and
paste the actual sequence here, but it's
much easier to ask a function to go to
GenBank and do that.
Mouse over to the GENES/PROTEIN
menu to get the function.
Click on SEQUENCE-OF to bring it
down into the open input box.
We need to tell SEQUENCE-OF two
things:
(a) We want to get the sequence from
GenBank.
(b) The accession number by which
to identify it
Start with the first. Mouse over the
Options Arrow.
Click FROM-GENBANK to add that
flag to the SEQUENCE-OF function.
Open the input box of the function
and type in the accession number (in
quotes), "EF156394", which you get
from the Supplemental Material.
Press Tab or Return.
One more thing… When we make the
alignment, we want each sequence
labeled by its name. Specify now that
the name is to be associated with the
sequence. That's an option given by
the DEFINE function.
The function is now complete, so
mouse over the green Action Icon and
click Execute.
Notice that a sequence appears in the
Result Window. Is it the right one?
Compare it with what you saw in
GenBank.
You've defined one of the eight moss
sequence. It's now an easy matter to modify
the function to get the rest.
Reopen the variable box in the DEFINE
function, and change pprsl1 to pprsl2. Press
Tab or Enter.
Then reopen the entity box in SEQUENCEOF and change the accession number.
Execute, and repeat this to get all eight
sequences.
If all went well, you should be able to
mouse over the VARIABLES button
and see all eight sequences you
created.
We're ready to define a set consisting
of all eight moss sequences.
Go back to the DEFINITION menu
and bring down a new copy of
DEFINE.
Open the variable input box and type
in the name of the set. You can call it
anything you like (anything with no
spaces).
I used all-sequences. Perhaps allmoss-sequences would have been
more appropriate.
The set will consist of a list of
sequences, so after opening the value
box of DEFINE (causing it to turn
white and get a red outline)…
…mouse over the LISTS/TABLES
button and click LIST.
The list will have eight elements
(count them!), so we need to add
seven additional holes, an action
accessible from the Options Menu.
Repeat until you have all eight holes.
Now populate the holes, selecting
each one in turn and then going to the
VARIABLES menu to select one of
the eight sequences.
When the DEFINE function is
complete, execute it.
Finally we're ready to align the
sequences. You can find the
ALIGNMENT-OF function by mousing
over the STRINGS/SEQUENCES
menu and then the BIOINFORMTICTOOLS submenu.
Click on ALIGNMENT-OF.
Open the argument box of
ALIGNMENT-OF and put the set
all-sequences in it.
The function is complete, so execute it.
The alignment pops up on the screen
(attend to FireFox's popup blocker to
allow popup windows).
…but there's only one sequence!
Why?
Scrolling down…
…finally PpRSL2 shows up, after
1700+ nucleotides of PpRSL1.
Scrolling down more…
…there are the rest.
Scrolling to near the end…
…most end together but a few go on.
This won't do. We need all the
sequences to be the same size,
otherwise the tree-making program
will get confused.
We'll need to truncate the long
sequences.
Plan: Define new sequences that are
truncated versions of the originals.
Bring down a new DEFINE function.
First we'll define a modified pprsl1,
I'll call it pprsl1x. It will consist of
part of the sequence of pprsl1.
Select the value box and bring down
SEQUENCE-OF.
This time we don't want the entire
sequence, so mouse over the green
Options Arrow and select FROM.
From where? Recall the
alignment. The coordinates on
the left allow us to count over to
determine the exact nucleotide
we want to start the modified
sequence.
I count 1930.
1911
1921
1931
Now insert a TO keyword from the
Options Menu.
2111
Count over to reach the end of
pprsl1.
I count 2130.
2121
2131
This sequence needs to be
labeled like all the rest.
Now do the same with the other sequence
(PpRSL2) that needs truncation (don't worry
about the one extra nucleotide of PpIND1)
Be sure to get the coordinates right and to
Execute.
Now we need to update the definition of the set
of all-sequences. Click each box that needs
modifying, close each…
…and execute the function again.
…and execute ALIGNMENT-OF again.
Much nicer, but it doesn't look anything like the
alignment in the Supplemental Material, Fig.S3.
That alignment was of amino acids. This one is
of nucleotides.
Of course amino acid sequences are translations
of nucleotide sequences.
It isn’t necessary to define translated sequences. Much
easier to translate the entire set all at once.
To do this, surround the set with a TRANSLATION-OF
function.
TRANSLATION-OF may be found on the
GENES-SEQUENCES menu, TRANSLATION
submenu.
Execute the alignment again
Now the alignment looks right!
(…except the sequences from Arabidopsis
are still missing)
But before moving on, let's compare the
two alignments we've gotten.
Comparison of nucleotide and protein alignments
Nucleotide alignment
Note that there is a good deal
more similarity amongst the
first seven sequences as
judged by the protein
alignment. Why is that?
Protein alignment
What's the relationship
between the nucleotide
sequence and the protein
sequence?
Comparison of nucleotide and protein alignments
Nucleotide alignment
Take the first three nucleotides
as an example. They're
uniformly GGG or GGT. And
the next three are either AGT
or TCA.
Protein alignment
But there is no variation in the
amino acid sequence at the first
two positions. Why?
Why are protein sequences
preferred to DNA sequences to
compare genes of different
organisms?
Our goal was to make a phylogenetic tree,
so let's do it.
The TREE-OF function is found in the
STRINGS-SEQUENES menu and the
PHYLOGENETIC-TREE submenu.
(There are a lot of ways to make a tree!
We'll use a simple approach.)
The TREE-OF function asks for two
things: the alignment (which we now have)
and some name of our choice to give the
project.
We can get the alignment by copying and
pasting what we already did. Mouse over
the Action Icon of ALIGNMENT-OF.
Select Copy to copy the function and all of
its contents.
Paste alignment of into the alignment input
box of TREE-OF.
Now we need to name the project. Two warnings:
1. The name must be in quotes (I chose "moss-tree")
2. You can't reuse a previously used name
Of the many ways to make a tree, I choose
parsimony, because (delving deep into the paper)
that's what Menand et al used.
Finally, the tree program will get
confused by the line of asterisks,
so I get rid of that line.
Execute, and…
… if all went well, you should have yet another alignment, AND A TREE!
Better, the tree looks remarkably like Fig. 3, except no Arabidopsis
sequences. We'll finally deal with that.
We'll define the Arabidopsis sequences much as we did the moss sequences
(even reusing the same DEFINE box). Give the first sequence the name
given in Fig. 3.
But where do we get the actual sequence???
First step is to delete the old value, using the red X delete button.
Now to the sequence.
We turn to the time honored (but universally vilified) practice of screenscraping. This is the last resort, done only when an article forgets to tell you
where they got the sequence.
Highlight the AtRHD6 sequence, and (making sure that the Acrobat Select
Tool is clicked) copy it to the clipboard.
Now select the value box in the DEFINE function, type in "", paste the
sequence between the two quotation marks, press Tab or Enter, and execute
the completed function.
You can repeat this operation for the remaining six sequences from
Arabidopsis, being sure to change in each case both the name of the
sequence and the sequence itself.
Now that you have all the Arabidopsis sequences, join them as you did with
the moss sequences into a set, called perhaps arabidopsis-set, or whatever.
Do this by replacing all the moss sequences with Arabidopsis sequences and
deleting the box you don't need.
The Arabidopsis sequences should now be in the VARIABLES list.
To delete the unwanted box, click the red X delete/clear button first to clear
and then to delete.
Finally, change the name of the set and execute the DEFINE function.
Now we can join the moss set and Arabidopsis set into a
single set, from which you'll then derive a tree. Bring down a
new DEFINE function, give the set a name, select the value
box and bring down the JOIN function, found in the LISTSTABLE menu, LIST-PRODUCTION submenu.
Be sure you join the right things: the translated moss
sequences (because the originals are DNA sequences) and the
unmodified Arabidopsis set (because the originals are protein
sequences).
Now to make the final phylogenetic tree. Click the red clear
button to erase the previous sequences that were aligned, and
replace it with the set you just defined.
Once that's done, nothing left but to execute and enjoy!
Biology of Seed Plants Presents
Making a Phylogenetic Tree
Congratulations!