Transcript Slide 1
A search engine for
phylogenetic tree
databases
David Fernández-Baca
Joint work with Mukul Bansal, Duhong
Chen (Computer Science, ISU) and J.
Gordon Burleigh (NESCent)
PhyloFinder
http://pilin.cs.iastate.edu/phylofinder/
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Issues in Phylogenetic Databases
Taxonomic consistency
Species may appear in multiple trees by different but
synonymous names.
Homonyms
Misspellings
Querying capability
Storage/representation
Exploiting classification trees (e.g., NCBI)
Clustering capabilities
Distance measures
Aggregation (synthesis) capabilities
Supertrees
Visualization
Classification trees and
phylogenies
Exploiting taxonomic classifications
The leaves in phylogenetic trees may
represent different taxonomic levels
A classification tree can allows us to locate
trees that contain a taxon, as well as its
descendants or ancestors.
E.g., a “Pinaceae" query would identify trees
that contain “Pinus thunbergii” or “Abies alba”
TreeBASE (Piel, Donoghue, &
Sanderson, 1996)
TreeBASE capabilities
Search by
taxon
author
citation
study accession number
matrix accession number
structure (topology)
Tree surfing
TreeBASE limitations
Taxonomic name consistency
Querying
Few options
Does not exploit classification
Can’t identify ancestors/descendants
Visualization
Clustering and aggregation (supertrees)
PhyloFinder
A search engine for tree databases
Allows powerful phylogenetic queries
Not a database
Handles synonymous taxonomic names (via TBMap)
Handles misspellings.
Exploits taxonomic classification
Offers precise options for identifying different types of
subtrees and metrics for identifying similar trees.
Provides a visualization tool with links to GenBank and
TBMap.
Fast
Efficient storage and filtering
PhyloFinder Design
Uses simple but powerful techniques
Inverted index for filtering
Nested-set representation of trees
Least common ancestor queries directly on database
Off-the-shelf spell-checking technology
Can be used with any phylogenetic
database
E.g., PhyLoTA browser
However, set-up is not (yet) automatic
Outline
1.
2.
3.
4.
Introduction
Queries
Storage and querying
Acknowledgements
PhyloFinder Queries
Taxonomic queries involve a single taxon
or set of taxa.
Phylogenetic queries take as input a
phylogenetic tree
Locate trees that match it in some specified
way.
Taxonomic Queries
1. Contains: Given a list of taxa, return all trees
that contain all or any of these names.
Similar to Boolean “AND” and “OR” searches.
Automatically searches for synonymous taxa
2. Related: Given a taxon, find all trees involving it
or any of its descendants in the NCBI taxonomy.
E.g., if the query taxon is “birds", identify all trees that
contain bird taxa.
3. Pathlength: Given a pair of taxa, return all trees
containing them, along with the distance between
them in each tree.
Taxonomic Queries: Contains
Phylogenetic Queries
Tree mining: Given a query tree Q, find the
database trees that exhibit Q in some way.
Options:
Return the trees that have Q as an embedded subtree.
Return the trees that refine Q.
Similarity: Given a query tree Q and a specified
similarity measure, return trees in database
ranked by decreasing similarity from Q.
Requires at least 3 taxon overlap
Phylogenetic queries: Notation 1
T(A) is the minimal subtree of T that
contains the leaves in A.
a
b
c
d
e
f
g
Phylogenetic queries: Notation 1
T(A) is the minimal subtree of T that
contains the leaves in A.
a
b
c
d
e
f
g
Phylogenetic queries: Notation 1
T(A) is the minimal subtree of T that
contains the leaves in A.
a
b
c
d
e
f
g
Phylogenetic queries: Notation 2
T|A is obtained from T(A) by suppressing
all internal nodes that have only one child.
a
b
c
d
e
f
g
Phylogenetic queries
Let Q be a query tree with leaf set A.
Q is an embedded subtree of T if and only
if it is identical to T|A.
Q is refined by T (T refines Q) if T|A is a
refinement of Q.
Phylogenetic queries
Phylogenetic queries: Embedded
Phylogenetic queries: Refined by
Phylogenetic queries: Refined by
Q embedded in T Q refined by T
Phylogenetic queries: Embedded
Similarity queries
Return trees ranked by a similarity score
PhyloFinder’s similarity measures:
Score is a percentage between 0 and 100%
reflecting how similar query tree is to
candidate tree.
Robinson-Foulds (RF) similarity
Least common ancestor (LCA) similarity
Score takes degree of taxon overlap into
account.
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
System architecture
Least Common Ancestors (LCAs)
a
b
c
d
e
f
g
Least Common Ancestors (LCAs)
a
b
c
LCA(b,e)
d
e
f
g
Storage: Nested intervals
(4,4) (5,5) (6,6) (8,8) (9,9) (10,10)
a
b
(3,5)
c
e
d
f
(7,9)
(Node_ID,RMD_ID)
(2,9)
(1,10)
Ancestor/descendant relationship is easy to
determine
The between predicate defines subtrees
LCAs are easily computed
Find common ancestor with largest Node_ID
Storage: Inverted index
For each taxon, store a list of all trees
that contain it.
Cornus
2
Spigelia
1
Hedera
4
2
8
16 32 64 128
3
5
8
13 21 34
13 16
Easy to find trees containing any or all
elements in a list of taxa
Used as a filter
Building the inverted index
I.
Input trees:
1: (((man,pan),gorilla),pongo),
2. (((human, coprinus),cryptomonas),zea_mays),
...,
N: (((dogs,homo_sapiens),pig),lambs)
II.
Convert trees into lists of taxa:
man pan gorilla pongo, human coprinus . . . . . .
III.
Synonymy preprocessing: Replace names by TBMap name clusters:
tc1 tc2 tc3 tc4 , tc1 tc5 . . . . . .
IV.
Build index consisting of (i) dictionary (mapping of taxon names to
name clusters) and (ii) postings (lists of tree IDs).
tc1
1 2 3 4
tc2
1
Schema
Query Processing: Outline
Q:
Consult
inverted index
Candidate trees:
Results:
Compare against Q
using LCA queries
Implementing Phylogenetic Queries
Idea: Use LCA queries to compare ancestordescendant relationships in Q with those in T.
M(x) and M(y) have the same relationship in T as
x and y have in Q Q can be embedded in T.
Advantage: Database trees need not be read into
main memory.
Implementing Taxonomic Queries
Use Boolean (union/intersection)
operations on the inverted index
Example: Querying for “birds"
1.
2.
3.
Find all bird species in the database trees
using the NCBI taxonomy tree.
Use inverted index to retrieves the tree ID
lists for each bird species.
Return the union of these lists.
Tree visualization
Tree visualization
Other tree visualization tools are available:
Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482.
Sanderson 2006. Bioinformatics 22: 1004-1006.
Zmasek & Eddy. 2001. Bioinformatics 17: 383-384.
We developed our own to
avoid plug-ins, and
easily highlight query results and provide outlinks to
GenBank and TBMap.
Spelling
Suggestions come from TreeBASE and NCBI
Uses GNU Aspell
Modified to handle special characters (`-', `&', '.') and
compound words.
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Under construction
Unrooted trees
Supertree methods
MRP, MRF, MMC
Desktop version
Automatic update
Suggestions?
Outline
1.
2.
3.
4.
5.
Introduction
PhyloFinder queries
Implementation
Future directions
Acknowledgements
Thanks to
Rod Page for TBMap
Bill Piel for TreeBASE data
Mike Sanderson
Oliver Eulenstein
National Science Foundation (grant EF0334832)