Ab Initio - iPlant Pods

Download Report

Transcript Ab Initio - iPlant Pods

Genome Annotation using
MAKER-P at iPlant
Collaboration with Mark Yandell Lab (University of Utah)
www.yandell-lab.org
Carson Holt (Ontario Institute Cancer Research)
iPlant:
Josh Stein (CSHL)
Matt Vaughn (TACC)
Dian Jiao (TACC)
Zhenyuan Lu (CSHL)
Nirav Merchant (U. Arizona)
Cantarel et al. 2008. Genome Research 18:188
Holt & Yandell. 2011. BMC Bioinformatics 12:491
What Are Annotations?
• Annotations are descriptions of features of the genome
• Structural: exons, introns, UTRs, splice forms etc.
• Coding & non-coding genes
• Functional: enzymatic activity, expression
• Annotations should include evidence trail
• Assists in quality control of genome annotations
• Examples of evidence supporting a structural annotation:
• Ab initio gene predictions
• ESTs
• Protein homology
Secondary Annotation
• Protein Domains
• InterPro Scan: combines many HMM databases
• GO and other ontologies
• Pathway mapping
• E.g. BioCyc Pathway tools
Challenges in Plant Genome Annotation
• Genomes are BIG
• Highly repetitive
• Many pseudogenes
Yet it is important to get it right!
Contamination Issue
Annotation Error
Example: split gene models
Typical Annotation Pipeline
• Contamination screening
• Repeat/TE masking
• Ab initio prediction
• Evidence alignment (cDNA, EST, RNA-seq,
protein)
• Evidence-based prediction
• Combiner
• Evaluation/filtering
• Manual curation
Options for Protein-coding Gene Annotation
• MAKER is an easy-to-use annotation pipeline
designed to help smaller research groups
convert the mountain of genomic data
provided by next generation sequencing
technologies into a usable resource.
MAKER identifies repeats, aligns ESTs and proteins to a genome, produces
ab-initio gene predictions, automatically synthesizes these data into gene
annotations, and produces evidence-based quality values for downstream annotation
management
Quality Control evaluation of the MAKER-P and TAIR10
datasets using Annotation Edit Distance (AED).
Better
Quality
Worse
MAKER-P MPI Support
Message Passing Interface
(MPI) is a communication
protocol for computer clusters
which essentially allows
multiple computers to act like
a single powerful machine.
Annotating the Genome – Apollo View
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
Current evidence
Current Assembly
Identify and Mask Repetitive Elements
•
•
RepeatMasker
– RepBase
– Species specific library
Current evidence
RepeatRunner
– MAKER internal protein library
Current Assembly
Identify and Mask Repetitive Elements
Current evidence
Current Assembly
Generate Ab Initio Gene Predictions
Current evidence
Ab initio Predictions
Current Assembly
Generate Ab Initio Gene Predictions
•
MAKER currently supports:
– SNAP
– Augustus
– GeneMark
– FGENESH
Current evidence
Ab initio Predictions
•
Can be run internally or externally
Current Assembly
Generate Ab Initio Gene Predictions
Current evidence
Ab initio Predictions
Current Assembly
Align EST and Protein Evidence
EST TBLASTX
Protein BLASTX
EST BLASTN
Current evidence
Ab initio Predictions
Current Assembly
Align EST and Protein Evidence
•
•
Identify regions being actively transcribed (i.e. EST data)
ESTwith
TBLASTX
Identify region
homology to a known protein
Protein BLASTX
EST BLASTN
Current evidence
Ab initio Predictions
Current Assembly
Align EST and Protein Evidence
EST TBLASTX
Protein BLASTX
EST BLASTN
Current evidence
Ab initio Predictions
Current Assembly
Polish BLAST Alignments with Exonerate
Current evidence
Polished protein
Polished EST
Ab initio Predictions
Current Assembly
Polish BLAST Alignments with Exonerate
•
All base pairs must aligns in order.
•
No HSP overlap is permitted
Current evidence
Polished protein
Polished EST•
Aligns HSPs correctly with respect to splice
sites.
Ab initio Predictions
Current Assembly
Polish BLAST Alignments with Exonerate
Current evidence
Polished protein
Polished EST
Ab initio Predictions
Current Assembly
Pass Gene Finders Evidence-based ‘hints’
Current evidence
Ab initio Predictions
Hint-based SNAP
Hint-based FgenesH
Current Assembly
Identify Gene Model Most Consistent with Evidence*
Current evidence
*
Ab initio Predictions
Hint-based SNAP
Hint-based FgenesH
Current Assembly
*Quantitative Measures for the Management and Comparison of Annotated Genomes
Karen Eilbeck , Barry Moore , Carson Holt and Mark Yandell BMC Bioinformatics 2009
10:67doi:10.1186/1471-2105-10-67
Revise it further if necessary; Create New Annotation
Current evidence
*
Ab initio Predictions
Current Assembly
Compute Support for Each Portion of Gene Model
MAKER-P v2.28 at iPlant
• TACC Lonestar
• Supercomputer with 22,656 CPU
• MPI enabled for parallel computation
• Can complete entire rice genome in ~2 hrs (1,152 cores)
96 CPU per chromosome
• Can complete Aegilops tauschii ALLPATHS-LG assembly
in ~8 hrs (1,152 cores)
• Currently being integrated into the iPlant Discovery
Environment
• Atmosphere
• MPI enabled for parallel computation
• Maximum instance size 16 CPU
Assembly & Annotation at iPlant
Genome Assembly
Conversions tools
Visualiza on
ALLPATHS-LG
maker2jbrowse
JBROWSE
Newbler
maker2zff
Web-Apollo
SOAPdenovo
MAKER
output
SCARF
ABySS
Oasis
Genome
input
SNAP Training
Fathom/Forge
MPI-MAKER
TACC Lonestar
HMM-assembler
Velvet
(22,656 cores)
Ray
Augustus
Post Annota on
SNAP
InterProScan
Exonerate
InterPro2GO
Transcriptome Assembly
De novo:
Trinity
BLAST
Data Commons
RepeatMasker
Reference genomes
Reference annota ons
SNAP HMM models
Repeat Libraries
Transcriptome data
SOAPdenovo-Trans
Velvet/Oasis
Trans-ABySS
Reference-guided:
Tophat
Cufflinks
Evidence
input
Conversions tools
ncRNA Annota on
miRDeep2
tophat2gff
cufflinks2gff
Key:
DE
TACC
in progress