Phytome intro and demo: lab meeting, Sept 13, 2004

Download Report

Transcript Phytome intro and demo: lab meeting, Sept 13, 2004

www.
PHYTOME.org
a plant comparative genomics resource
Todd Vision,
Jason Phillips, Dihui Lu, Stefanie Hartmann
Outline of today’s presentation
1. What kind of data is stored in Phytome - and
how did we generate this data?
2. How can you search Phytome?
3. What kind of results will Phytome give you?
Phytome integrates
organismal phylogeny
gene family information: sequences
alignments
phylogenies
genetic and physical maps
Phytome: applications
Starting with a gene family
 resolve orthology/paralogy relationships
 identify coevolving families
Starting with a species
 explore lineage-specific diversification
 guide comparative mapping bench-work
Starting with a chromosome segment
 identify homologous segments
 predict unobserved gene content (candidate QTL)
overview of the pipeline
data aquisition
EST - expressed sequence tags
protein
DNA
pre-RNA
mRNA
cDNA
cDNA clone
• are partial sequences of expressed genes
• are error-prone, contain sequence or frame shift errors
• are very useful for discovering new genes,
provide data on gene expression, make up much of the sequence data
EST contig assemblies
• contigs: continuous sequences of multiple overlapping ESTs
• singletons: don’t match other ESTs in the dataset
sources
• TIGR, Plant GDB, NCBI, TAIR, Sputnik, Plant Genome Network;
• for each species, we used the source with the largest number of EST
data acquisition/organismal phylogenies
Glycine max
Phaseolus coccineus
Lotus corniculatus
Medicago truncatula
Cucumis sativus
Prunus persica
Populus tremula x tremuloides
Arabidopsis thaliana
Brassica napus
rosids
Gossypium hirsutum
Theobroma cacao
Citrus sinensis
eudicotyledons
Vitis vinifera
core eudicots
Lycopersicon esculentum
Solanum tuberosum
Capsicum annuum
Nicotiana benthamiana
Helianthus annuus
Zinnia elegans
Stevia rebaudiana
Angiosperms
asterids
Lactuca sativa
Beta vulgaris
Mesembryanthemum crystallinum
Eschscholzia californica
Hordeum vulgare
Triticum aestivum
Secale cereale
Avena sativa
Liliopsida
Saccharum officinarum
Zea mays
Sorghum bicolor
Oryza sativa
Allium cepa
Amborella trichopoda
Cryptomeria japonica
Pinus taeda
Cycas rumphii
Ceratopteris richardii
Marchantia polymorpha
Physcomitrella patens
conifers
cycad
fern
liverwort
moss
protein sequence prediction
from EST contigs to peptide sequences: ESTwise
• translate cDNA sequence (ESTs) in all reading frames
• compare the translated DNA to a database of known proteins
(Swiss-Prot, TrEMBL)
• use this information for gene prediction/translation
• correct frame shift errors based on the homology information
protein
EST
TVKKAHFEKWGNIVDVDYFQHFGNIVDINIVIDKETGKKRGFAFVEFDDYDPVDKVVLQKQHQLNGKMVDV
TVK++HF +WG + D DYF+ +G I I I+ D+ +GKKRGF FV FD +D VDK+V+QK H +NG
+V
TVKRSHFxQWGTLTDCDYFEQYGKIEVIEIMTDRGSGKKRGF!FVTFDGHDSVDKIVIQKYHTVNGHNxEV
agaaactNctgacagtgttgctgaaggagaaagcgagaaagt2tgatggcgtggaagacatcagagcatgg
ctaggataaggctcagaataaagatattattcaggggaaggt ttctagaactaatttaaaactagaaNat
tgagcttgagagcgcttttagtaatagtacgtcactcgagct tactcctccgtgtctgacttgtccctat
protein family clustering
(Tribe-MCL)
input:
• a set of proteins
• BLAST-all vs. BLAST-all values
method:
• construct weighted graph
• convert into Markov matrix
• expansion repeat until matrix
• inflation
doesn’t change
output:
• clusters of related proteins:
protein families
protein family clustering
(Tribe-MCL)
input:
• a set of proteins
• BLAST-all vs. BLAST-all values
method:
• construct weighted graph
• convert into Markov matrix
• expansion repeat until matrix
• inflation
doesn’t change
output:
• clusters of related proteins:
protein families
image taken from the MCL homepage: http://micans.org/mcl/
protein family clustering
(Tribe-MCL)
multiple sequence alignment
tested
program
ClustalW
Mafft i
Mafft p
T-Coffee
Dialign
quality
+
++
++
+++
+++
speed
++
+
+++
memory!
time!
algorithm
progressive
iterative
progressive
consistency-based/progressive
consistency based
progressive sequence alignment:
1. generate pairwise distances from a multiple alignment
2. use distances to construct a guide tree
3. start by aligning the most similar sequences
4. progressively add more sequences to the existing alignment
multiple sequence alignment
1. identification of homologous proteins, clustering these into a
Phytome family, generation of a multiple sequence alignment
2. identification of homologous sequence positions within the
homologous proteins = of columns of amino acids that share a
common ancestral amino acid
multiple sequence alignment
1. find columns that will be retained
• remove columns with low average pairwise scores
• remove columns with high percentage of gaps
multiple sequence alignment
1. find columns that will be retained
• remove columns with low average pairwise scores
• remove columns with high percentage of gaps
2. find sequences that will be retained
• remove sequences with a high proportion of gaps within the
retained columns
• remove misaligned sequences (i.e., with a low overall score)
3. final check
• are enough sequences left for a phylogeny?
phylogenetic inference
generate distance matrix
PHYLIP
generate unrooted
neighbor-joining tree
midpoint-root the tree
TreePuzzle
do molecular clock test
?
defining subfamilies
ghir40678
taes49609
lsat28223
taes10592
lsat22003
taes12120
pper2228
soff68095
cjap1662
zmay5764
crum2659
soff59135
sbic29242
soff91873
lsat25221
taes42042
hvul18430
stub712
nben1351
taes10593
osat87929
zmay10735
lsat24951
sbic10907
lsat35999
gmax12743
taes100462
cann3062
ptre15750
lesc54493
stub32048
ghir40662
lsat25017
ecal221
ghir36382
bvul1173
ghir31978
ghir27968
stub12723
1
2
3
4
5
6
1
2
3
4
5
6
1
2
1
2
3
4
5
6
1
2
1
2
3
1
2
3
4
1
2
3
4
5
6
7
8
9
10
webflow,
overview
search pages
result pages




Lab meeting, Sept 13, 2004: Phytome demo
Dihui - BLAST search
a friend of mine is working with a plant called Lophopyrum elongatum (it's a weed, and it's salt-tolerant, and that's all I know about it). She just cloned a cDNA and want to find out more
about it - what it does and which other genes in which other taxa it is related to.
Though Lophoprum is not among the species represented in Phytome, I offered her to see if I can find out more about her gene.
Best to use for this: the single BLAST search.
Navigate to the single BLAST search and explain the page. Mention batch BLAST.
paste the friend's sequence into the appropriate field
MEYQGQQQHDQATTNRVDEYGNPVAGHGVGTGMGAHGGVGTGAAAGGHFQPTREEHKAGGILQRSGSSSSSSSSEDDGMGGRRKKGIKDKIKEKLPGGHGDQQQTAGTYGQQGHTGM
AGTGGNYGQPGHTGMAGTDGTGEKKGIMDKIKEKLPGQH
explain the results page
view the best result: taes7111 from wheat
go to the best scoring family: 1980
Stefanie - Unigene search
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR000167
search Phytome for InterproEntry 000167
look at the hvul1175 entry:
The family and subfamily ID
Interpro and Gene Ontology results, but only if the Unipeptide is an exemplar of its subfamily
The species name
A link to the primary source for this unigene sequence
A list of related unigenes (from all sources) that contain common Genbank accession numbers in their assembly
Predicted peptide sequence (available for download in FASTA format)
Jason - "restrict by species" search
You can search for families that do or do not contain members from particular species. Navigate to the "restrict by species" search and explain the page.
The relationships among the species are displayed as a phylogenetic tree (NCBI taxonomy information)
and you can select families to include or exclude using radio buttons to the right of each species name.
If the default "either" is selected, Phytome will return a family regardless of whether there are members from that species.
I'm interested in monocot gene families (Hordeum-barley to Allium-onion): want to exclude all other taxa, only use gene families with monocot members. NOTE: explain the difference
between "include" monocots or "either" monocots: because species with small numbers of Unipeptides will necessarily lack members in most families, selecting "include" will return NO
families!
119273 families were retrieved. Their family ID is shown
click on family number 1980
Stefanie - family results page
The "Family Information Page" includes
o Related families if this family is part of a superfamily (?)
o Hyperlinks to subfamilies (these will work if the "Subfamily" tab is selected).
o A link to a list of family members excluded from the reduced alignment by REAP
o A list of those species represented within the family (these will work if the with the default species tab)
The tabs below allow one to view
o A list of member Unipeptides, which can be sorted either by subfamily or by species, depending on which tab is selected. From these lists, you may select members to include in a
multiple alignment and/or phylogeny.
o InterPro and GO assignments for an examplar of each subfamily.
o By selecting multiple Unipeptides and proceeding to the "Alignment Page", one can download a single filecontaining all the predicted peptide sequences (in FASTA format) as well as
additional information such as the names used by the Unigene sources and the component Genbank accession numbers.
protein family clustering
(Tribe-MCL)
I=
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
3
3
4
4
4
4
5
5
6
6
3.6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
3
3
4
4
4
4
5
5
5
5
2.8 2.0 1.2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
2
4
4
4
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
...some numbers
almost 1 million EST contigs/singletons
ESTwise translation
730,000 unigenes
BLAST all vs. BLAST all
640,000 unigenes
to be clustered
into families
110,000 singletons
data aquisition
species
tax_id
common name
Allium cepa
Amborella trichopoda
Arabidopsis thaliana
Avena sativa
Beta vulgaris
Brassica napus
Capsicum annuum
Ceratopteris richardii
Citrus sinensis
Cryptomeria japonica
Cucumis sativus
Cycas rumphii
Eschscholzia californica
Glycine maxX
Gossypium hirsutum
Helianthus annuus
Hordeum vulgare
Lactuca sativa
Lotus corniculatus
Lycopersicon esculentum
Marchantia polymorpha
Medicago truncatula
Mesembryanthemum crystallinum
Nicotiana benthamiana
Oryza sativa
Physcomitrella patens
Pinus taeda
Phaseolus coccineus
Populus tremula x Populus tremuloides
Prunus persica
Saccharum officinarum
Secale cereale
Solanum tuberosum
Sorghum bicolor
Stevia rebaudiana
Theobroma cacao
Triticum aestivum
Vitis vinifera
Zea mays
Zinnia elegans
4679
13333
3702
4498
161934
3708
4072
49495
2711
3369
3659
58031
3467
3847
3635
4232
4513
4236
47247
4081
3197
3880
3544
4100
4530
3218
3352
3886
47664
3760
4547
4550
4113
4558
55670
3641
4565
29760
4577
34245
onion
amborella
thale cress
oat
sugarbeet
rape
(orgnamental) pepper
water sprite or indian fern
orange
Japanese cedar
cucumber
sago palm or seashore cycad
california poppy
soybean
cotton (tetraploid)
sunflower
barley
lettuce
lotus
tomato
marchantia
barrel medic
ice plant
wild tobacco
rice
Physcomitrella moss
loblolly pine
scarlet runner bean
aspen
peach
plume grass or sugar cane
rye
potato
sorghum
candyleaf
cacao
wheat
wine grape
corn
zinnia
NCBI
X
PGDB
PGN
SPNK
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
TIGR
X
X
X
X
X
X
X
X
X
X
X
X
X
multiple sequence alignment
tested
program
ClustalW
Mafft i
Mafft p
T-Coffee
Dialign
quality
+
++
++
+++
+++
family
1
2
3
4
5
6
7
8
9
10
11
speed
++
+
+++
memory!
time!
ClustalW
2061
360
5108
950
307
87
104
105
46
145
4
Mafft i
12380
845
8414
2470
404
125
128
114
33
296
5
algorithm
progressive
iterative
progressive
consistency-based/progressive
consistency based
Mafft p2
93
32
182
45
22
9
9
8
6
17
1
Mafft p3
312
73
467
101
59
31
24
20
16
36
3
T-Coffee
–
–
–
–
–
–
–
19207
11820
7736
177
Dialign
–
8429
–
12533
3564
1376
1075
887
394
898
27