Transcript Slide 1

A search engine for
phylogenetic tree
databases
David Fernández-Baca
Joint work with Mukul Bansal, Duhong
Chen (Computer Science, ISU) and J.
Gordon Burleigh (NESCent)
PhyloFinder
http://pilin.cs.iastate.edu/phylofinder/
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Issues in Phylogenetic Databases





Taxonomic consistency
 Species may appear in multiple trees by different but
synonymous names.
 Homonyms
 Misspellings
Querying capability
 Storage/representation
 Exploiting classification trees (e.g., NCBI)
Clustering capabilities
 Distance measures
Aggregation (synthesis) capabilities
 Supertrees
Visualization
Classification trees and
phylogenies
Exploiting taxonomic classifications


The leaves in phylogenetic trees may
represent different taxonomic levels
A classification tree can allows us to locate
trees that contain a taxon, as well as its
descendants or ancestors.

E.g., a “Pinaceae" query would identify trees
that contain “Pinus thunbergii” or “Abies alba”
TreeBASE (Piel, Donoghue, &
Sanderson, 1996)
TreeBASE capabilities

Search by







taxon
author
citation
study accession number
matrix accession number
structure (topology)
Tree surfing
TreeBASE limitations


Taxonomic name consistency
Querying


Few options
Does not exploit classification



Can’t identify ancestors/descendants
Visualization
Clustering and aggregation (supertrees)
PhyloFinder

A search engine for tree databases


Allows powerful phylogenetic queries






Not a database
Handles synonymous taxonomic names (via TBMap)
Handles misspellings.
Exploits taxonomic classification
Offers precise options for identifying different types of
subtrees and metrics for identifying similar trees.
Provides a visualization tool with links to GenBank and
TBMap.
Fast

Efficient storage and filtering
PhyloFinder Design

Uses simple but powerful techniques


Inverted index for filtering
Nested-set representation of trees



Least common ancestor queries directly on database
Off-the-shelf spell-checking technology
Can be used with any phylogenetic
database


E.g., PhyLoTA browser
However, set-up is not (yet) automatic
Outline
1.
2.
3.
4.
Introduction
Queries
Storage and querying
Acknowledgements
PhyloFinder Queries


Taxonomic queries involve a single taxon
or set of taxa.
Phylogenetic queries take as input a
phylogenetic tree

Locate trees that match it in some specified
way.
Taxonomic Queries
1. Contains: Given a list of taxa, return all trees
that contain all or any of these names.
 Similar to Boolean “AND” and “OR” searches.
 Automatically searches for synonymous taxa
2. Related: Given a taxon, find all trees involving it
or any of its descendants in the NCBI taxonomy.
 E.g., if the query taxon is “birds", identify all trees that
contain bird taxa.
3. Pathlength: Given a pair of taxa, return all trees
containing them, along with the distance between
them in each tree.
Taxonomic Queries: Contains
Phylogenetic Queries

Tree mining: Given a query tree Q, find the
database trees that exhibit Q in some way.
Options:



Return the trees that have Q as an embedded subtree.
Return the trees that refine Q.
Similarity: Given a query tree Q and a specified
similarity measure, return trees in database
ranked by decreasing similarity from Q.

Requires at least 3 taxon overlap
Phylogenetic queries: Notation 1

T(A) is the minimal subtree of T that
contains the leaves in A.
a
b
c
d
e
f
g
Phylogenetic queries: Notation 1

T(A) is the minimal subtree of T that
contains the leaves in A.
a
b
c
d
e
f
g
Phylogenetic queries: Notation 1

T(A) is the minimal subtree of T that
contains the leaves in A.
a
b
c
d
e
f
g
Phylogenetic queries: Notation 2

T|A is obtained from T(A) by suppressing
all internal nodes that have only one child.
a
b
c
d
e
f
g
Phylogenetic queries



Let Q be a query tree with leaf set A.
Q is an embedded subtree of T if and only
if it is identical to T|A.
Q is refined by T (T refines Q) if T|A is a
refinement of Q.
Phylogenetic queries
Phylogenetic queries: Embedded
Phylogenetic queries: Refined by
Phylogenetic queries: Refined by
Q embedded in T  Q refined by T
Phylogenetic queries: Embedded
Similarity queries

Return trees ranked by a similarity score


PhyloFinder’s similarity measures:



Score is a percentage between 0 and 100%
reflecting how similar query tree is to
candidate tree.
Robinson-Foulds (RF) similarity
Least common ancestor (LCA) similarity
Score takes degree of taxon overlap into
account.
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
System architecture
Least Common Ancestors (LCAs)
a
b
c
d
e
f
g
Least Common Ancestors (LCAs)
a
b
c
LCA(b,e)
d
e
f
g
Storage: Nested intervals
(4,4) (5,5) (6,6) (8,8) (9,9) (10,10)
a
b
(3,5)
c
e
d
f
(7,9)
(Node_ID,RMD_ID)
(2,9)
(1,10)
 Ancestor/descendant relationship is easy to
determine
 The between predicate defines subtrees
 LCAs are easily computed
 Find common ancestor with largest Node_ID
Storage: Inverted index

For each taxon, store a list of all trees
that contain it.
Cornus
2
Spigelia
1
Hedera

4
2
8
16 32 64 128
3
5
8
13 21 34
13 16
Easy to find trees containing any or all
elements in a list of taxa

Used as a filter
Building the inverted index
I.
Input trees:
1: (((man,pan),gorilla),pongo),
2. (((human, coprinus),cryptomonas),zea_mays),
...,
N: (((dogs,homo_sapiens),pig),lambs)
II.
Convert trees into lists of taxa:
man pan gorilla pongo, human coprinus . . .  . . .
III.
Synonymy preprocessing: Replace names by TBMap name clusters:
 tc1 tc2 tc3 tc4 , tc1 tc5 . . .  . . .
IV.
Build index consisting of (i) dictionary (mapping of taxon names to
name clusters) and (ii) postings (lists of tree IDs).
tc1
1 2 3 4
tc2
1
Schema
Query Processing: Outline
Q:
Consult
inverted index
Candidate trees:
Results:
Compare against Q
using LCA queries
Implementing Phylogenetic Queries



Idea: Use LCA queries to compare ancestordescendant relationships in Q with those in T.
M(x) and M(y) have the same relationship in T as
x and y have in Q  Q can be embedded in T.
Advantage: Database trees need not be read into
main memory.
Implementing Taxonomic Queries
Use Boolean (union/intersection)
operations on the inverted index
Example: Querying for “birds"


1.
2.
3.
Find all bird species in the database trees
using the NCBI taxonomy tree.
Use inverted index to retrieves the tree ID
lists for each bird species.
Return the union of these lists.
Tree visualization
Tree visualization

Other tree visualization tools are available:




Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482.
Sanderson 2006. Bioinformatics 22: 1004-1006.
Zmasek & Eddy. 2001. Bioinformatics 17: 383-384.
We developed our own to


avoid plug-ins, and
easily highlight query results and provide outlinks to
GenBank and TBMap.
Spelling


Suggestions come from TreeBASE and NCBI
Uses GNU Aspell

Modified to handle special characters (`-', `&', '.') and
compound words.
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Under construction


Unrooted trees
Supertree methods




MRP, MRF, MMC
Desktop version
Automatic update
Suggestions?
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Thanks to





Rod Page for TBMap
Bill Piel for TreeBASE data
Mike Sanderson
Oliver Eulenstein
National Science Foundation (grant EF0334832)