Transcript Slide 1

ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS,
LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES.
Alexander Kozik and Richard W. Michelmore
University of California, Davis, Dept. of Vegetable Crops, Davis, CA 95616, USA
Approximately 3,700 of the genes in the Arabidopsis Col-0 genome are single copy. These genes
were used to identify conserved orthologs in several other plant species. Using computational
approaches we identified 1104 lettuce, 686 sunflower, 1704 tomato, 2016 soybean, 1701 maize and
1290 rice ESTs that are conserved orthologs to these Arabidopsis genes. Each EST sequence
from these sets has an unambiguous single strong BLAST hit to the Arabidopsis genome.
Reciprocal BLAST searches (Arabidopsis single copy genes versus EST assemblies) showed that
more than 80% of BLAST hits had only a single strong hit. It indicated that the majority of these
conserved orthologs are represented by single genes in multiple plant species. The total number
of Arabidopsis genes that have similarity (BLAST score 1e-20 or better) to at least one of these
selected ESTs is 2205, which is 60% of total number of single copy genes in Arabidopsis. Only 248
sequences were in common between EST collections from different species and Arabidopsis
single copy genes. This can be partially explained by the incomplete representation within each
EST collection. Analysis and visualization of single copy genes over Arabidopsis chromosomes
(http://cgpdb.ucdavis.edu/COS_Arabidopsis/arabidopsis_single_copy_genes_2003.html) revealed
that these genes were distributed throughout the genome regardless of large scale chromosomal
duplications. This indicates that deduction of order of genes in common ancestors is required for
informative analyses of synteny.
SINGLE COPY ORTHOLOGS
SUMMARY
PIPELINE TO IDENTIFY SINGLE COPY ORTHOLOGS
BLAST search of selected ESTs
versus all Arabidopsis predicted
proteins and selection of ESTs
with a single strong hit to
Arabidopsis genome
(Exp cutoff 1e-20)
Arabidopsis
predicted
proteins
(27,169 seqs)
source
[step 3]
lettuce ESTs
(68,197 seqs)
BLAST search
Arabidopsis proteins
against themselves
and
selection of
Arabidopsis
single copy genes
sunflower ESTs
(67,180 seqs)
tomato ESTs
(113,932 seqs)
Arabidopsis
single copy
genes
(3,714 seqs)
[step 1]
soybean ESTs
(341,564 seqs)
rice ESTs
(107,329 seqs)
BLAST search of
Arabidopsis single
copy genes versus
full sets of ESTs
selection of ESTs
with BLAST hits to
Arabidopsis single
copy subset
[step 2]
maize ESTs
(362,510 seqs)
Raw data and detailed description of the sequence
extraction pipeline is available at:
http://cgpdb.ucdavis.edu/COS_Arabidopsis/
lettuce
sunflower
tomato
soybean
maize
rice
common
between all
common
between
lettuce and
sunflower
number of single
copy orthologs
1104
686
1704
2016
1701
1290
Arabidopsis
(total)
248
431
2205
(out of 3,714
single copy
genes)
PIPELINE TO EXTRACT ALIGNMENTS AT NUCLEOTIDE LEVEL
GenBank files of
Arabidopsis genome
(DNA sequences of entire
chromosomes and
corresponding annotation)
BLAST parser
(Tcl/Tk script)
[step 4]
tab-delimited file with info about
BLAST alignments (start points
and end points for each sequence
in BLAST report)
GenBank
Parser
(alignment)
[step 1]
spliced DNA
sequences
corresponding
to ORFs
BLAST
output
[step 5]
BLASTX search
[ESTs vs proteins]
translation
[step 2]
[step 3]
translated
(protein)
sequences
[subject]
SeqsExtractorFromBlastX
(Python script)
ESTs (unigene) set
[query]
final step of the pipeline:
extraction of DNA sequences
corresponding to BLAST alignments from
“spliced DNA” (subject) and EST (query)
files.
Script automatically counts codon usage.
Output: spreadsheet with info about
codon usage
http://cgpdb.ucdavis.edu/COS_Arabidopsis/Codon_Usage_Pipeline.html
MULTIPLE ALIGNMENT VISUALIZED WITH TkLife ( http://www.atgc.org/TkLife/ )
lettuce 
sunflower 
Arabidopsis 
alignment
summary 
Graphical representation of BLAST search of lettuce, sunflower, tomato, soybean, maize and rice ESTs against
Arabidopsis genome. The picture displays potential conserved orthologs (single copy genes in Arabidopsis).
Each box (element) is a single copy Arabidopsis gene having homology to selected sets of plant ESTs.
Genes are plotted along five Arabidopsis chromosomes according to their physical positions.
codon match
(and amino acid match)
codon mismatch
and
amino acid match
(synonymous substitutions)
codon mismatch
and
amino acid mismatch
(non-synonymous substitutions)
Segmental duplication between
Arabidopsis chromosomes 4 and 5
CHRM 4
CHRM 5
Color Scheme:
Black - single copy genes
Purple - kinases
Green - cytochrome
Red - resistance genes
Yellow - ribosomal proteins
Gray lines connect genes with sequence identity 40% or greater
Note: Single copy genes are distributed evenly through both segments of the
duplicated region. Image was generated by GenomePixelizer using the “locus
zoomer” function. Additional information is available at:
http://www.atgc.org/GP_Ref/presentation/
Patterns of segmental duplications in Arabidopsis
genome (generated by GenomePixelizer
http://www.atgc.org/). Regions selected by white
boxes are shown in large scale above.
Credits:
This work was funded by USDA IFAFS Plant Genome Program to the
Compositae Genome Project
Questions and comments to Alexander Kozik, email: [email protected]
Putative scenario of gene loss after segmental duplication
Because of extensive gene loss after duplication, deduction of
gene order in ancestral genomes is required for informative
synteny analysis between different genomes.