Asilomar - University of Notre Dame

Download Report

Transcript Asilomar - University of Notre Dame

Discovery and Annotation of
Transposable Elements on VectorBase
http://www.vectorbase.org
1,2
Kennedy ,
1,3
Unger ,
4
Christley ,
Ryan C.
Maria F.
Scott
Jenica L. Abrudan1,3, Neil F. Lobo1,3, Greg Madey1,2, Frank H. Collins1,2,3
1Eck
Institute for Global Health, University of Notre Dame
2Department of Computer Science and Engineering, University of Notre Dame
3Department of Biological Sciences, University of Notre Dame
4Department of Mathematics & Department of Computer Science, University of California, Irvine
Abstract
VectorBase
TE Discovery
Although transposable elements (TEs) were discovered over 50 years
ago, the robust discovery of them in newly sequenced genomes remains
a difficult problem. Numerous types with different structural
characteristics, sequence degradation, multiple insertions within existing
elements, and co-option by the organism’s regulatory system are some of
the issues confounding the discovery process.
VectorBase is an NIAID bioinformatics resource center that serves as a
web-based facilitator to information and tools pertaining to invertebrate
vectors of human pathogens. VectorBase currently houses genome
information for the mosquito species Aedes aegypti, Anopheles gambiae,
and Culex quinquefasciatus, as well as the body louse, Pediculus
humanus, and the tick, Ixodes scapularis. Current features and
capabilities include:
TEs are difficult to thoroughly characterize because of their complex and
varying structure (or lack thereof). Most current TE discovery techniques
fall into the following categories: homology-based, structure-based, and
de novo. Popular tools exist within each of these categories, yet most are
not automated or easily accessible for all researchers. We have
developed a semi-automated discovery pipeline that utilizes a homologybased approach and is complemented with de novo and structure-based
components. Our pipeline is reliant on several well-known technologies,
including BLAST, Perl (and BioPerl), and DNASTAR SeqMan II. We also
require a library of representative TEs, which we obtain from Repbase,
TEfam, and the literature.
We have developed an automated pipeline employing a homology-based
approach, complemented with de novo- and structure-based approaches,
to discover and annotate TEs in invertebrate genomes. Once fully
automated, our pipeline will be integrated with VectorBase, an NIAID
Bioinformatics Resource Center for invertebrate vectors of human
pathogens, to produce a first-pass discovery and annotation of TEs for
newly sequenced genomes. Currently hosting five organisms with more
on the way, VectorBase provides the Ensembl genome browser,
computational tools, and other data specific to the study of invertebrate
vectors.
The annotation component of our pipeline includes enhancements to the
Ensembl genome browser, elevating the importance of TEs by displaying
genomic location, structural details, alignments with consensus TEs, and
homology with other organisms. VectorBase has developed a community
annotation system whereby the research community can upload
annotation corrections to genes for curation and broad dissemination; we
plan to extend this to TEs. We hope this will provide an invaluable
resource for researchers studying the biology of TEs and their genomic
impact.
• Integrated use of the Ensembl genome browser
• Integrated tools, including BLAST, ClustalW, and HMMER
• Community annotation pipeline for genes
• Microarray and gene expression repository
• Controlled vocabularies
TE Discovery Pipeline
Our homology-based TE discovery pipeline can be broken down into the
following steps and is also shown graphically in Figure 2:
Goal
1. BLAST representative sequences against the genome
• Sequences are individually blasted against the genome
We aim to provide an automatic and easy-to-use method, integrated with
VectorBase, to identify and annotate TEs in invertebrate genomes.
2. Process results files and extract hits
• Hits that are within a prespecified threshold are combined and
represented as a single hit
• Hits that do not meet a minimum length threshold are ignored
• Corresponding sequence from the genome is extracted, including
flanking sequence
3. Assemble sequences
• Sequences are assembled into contigs
Community Gene Annotation
• Biologists familiar with TEs manually determine TE boundaries and
consensus sequences are generated
Annotation is the process by which meaning is given to genomic data.
Ensembl’s automatic gene annotation system is one of the better-known
gene annotation systems. VectorBase currently hosts a community
annotation pipeline, whereby registered users of the site can contribute
annotation data for one of the hosted genomes. VectorBase can accept
four types of annotation information: gene models, publications, controlled
vocabulary terms, and comments. The following steps are taken by users
and curators to submit gene models:
4. BLAST results from step 3 against the genome and characterize results
• Generated consensus sequences are blasted against the genome to
determine coverage
• Hits are then analyzed by scripts and TEs are annotated
1. Users download, fill out, and upload the gene submission form through
VectorBase
2. Users preview and submit the data
3. Curators can then approve the data
4. If approved, genes are integrated into the manual annotation DAS track
and displayed in the genome browser, shown in Figure 1
Community TE Annotation
While not yet fully implemented on VectorBase, annotation of TEs on
VectorBase will follow the same general steps as genes and TEs will be
shown within the genome browser. Current work has led to a means to
store consensus TEs in the same Chado database schema as genes and
also to provide a structural display of TEs. Current TE online repositories
traditionally lack this structural display as well as the user-feedback
system that VectorBase employs. Additionally, BLAST will be utilized to
provide a mechanism to show coverage of TEs within a genome. Figure 1
graphically shows the information flow for TE community annotation on
VectorBase.
Figure 2. Simplified visual diagram of
homology-based discovery pipeline.
References
Figure 1. Information flow diagram for TE annotation.
D. Lawson, et al., VectorBase: a data resource for invertebrate vector genomics.
Nucleic Acids Research, 37:D58307, 2009.
Repbase. http://www.girinst.org/repbase/index.html.
TEfam. http://tefam.biochem.vt.edu.
The VectorBase project is funded by the US National Institute of Allergy and
Infectious Diseases (NIAID), contract HHSN266200400039C.