Ensembl Acknowledgements - The Francis Crick Institute

Transcript Ensembl Acknowledgements - The Francis Crick Institute

Ensembl RNASeq
Pipeline
Steve Searle
EBI is an Outstation of the European Molecular Biology Laboratory.
Outline of Talk
• What the Ensembl RNASeq pipeline does
• Ensembl Pipeline infrastructure
• How we’ve modified the process as its scaled up
• Cloud version
• Tips for running large compute
Uses of RNASeq in Ensembl
Ways we use RNASeq data in Ensembl:
•
Build complete gene set from scratch for individual or pooled
RNASeq data sets
•
Incorporate into a new Ensembl gene set
•
Add novel models into a gene set
•
UTR
•
Filtering Models
•
Improve old gene sets
•
Expression levels (work in progress)
RNASeq pipeline
Align reads
using BWA
Generate proto-transcripts
Find splice sites
using Exonerate
Model generation
Filtering
RNA-Seq models
against UniProt
per tissue & merged
RNASeq Pipeline
Alignment and Initial Processing
•
•
•
•
•
Reads are aligned to the genome with a quick un-gapped alignment using
BWA
Transcriptome reads split over introns - we need to allow for this:
Align with up to 50% miss-matches to get intron spanning reads to align
The alignments are then processed to collapse overlapping reads into
blocks representing exons
Read pairing is used (if available) to group the exon blocks into
approximate transcript structures
RNASeq Pipeline
Intron Alignment
We align split reads using Exonerate – has a good splice model but is
not a short read aligner
Intron alignment is made faster in 2 ways:
• Don’t realign all the reads:
• Introns are resolved by realigning partially aligned reads.
• Use Exonerate word length to define which reads to realign
• Align to a single transcript:
• Reads are realigned either to the rough transcript sequence or
to the genomic span of the rough transcript.
•
•
Limiting the search space allows us to do a more sophisticated
Exonerate alignment with a splice model and a shorter word
length.
Aligning to the genomic span of the transcript can identify small
exons that were missed by the BWA alignment that can be
incorporated into the final model.
Final Models
Collapsed
Intron Features
Split reads
Exonerate spliced alignment
Partially aligned
reads
A*
B
C
A*
1
12
Tissue 1
: 1-A-2-B-3
Tissue 2
: 1-A-2-B-3-D-4
1-A-2-C-4
B
23
D
34
45
Display
Data visible in Ensembl
Transcript models
Intron features
BAM files of BWA alignments
Nile tilapia: BAM files
Advantage of using RNASeq in the
genebuild
• Some species have little specific data
– Eg. Nile tilapia
• 131 proteins in Uniprot
• 35 cDNAs, 119531 ESTs
– Rely on data from related species
• RNASeq supplements the above data
– Species-specific
• Fills gaps, alternate splice sites, filtering
Genebuild process
Raw Computes
Targetted stage
Similarity stage
cDNAs/ESTs
Filtering
UTR addition
Merged
RNA-Seq models
Filtering
Final gene set
RNASeq helps with:
1. Choice of splice site
RNASeq
Similarity models
Ensembl model
RNASeq helps with:
2. UTR addition
RNASeq model
Similarity model
Ensembl model
RNASeq helps with:
3. New models & alt. splicing
RNASeq introns
RNASeq model
Similarity model
Ensembl model
Gene set Update Pipeline
using RNASeq
RNASeq update pipeline is highly automated, many species
take around a week to process
1. Split existing Ensembl gene set into single transcript
genes
2. Transcript scoring / filtering
•UTR addition done at the same time
3. Layering
•avoiding pseudogenes
•gap filling (includes some fragmented models)
4. Rebuild core set
5. Transfer pseudogenes + ncRNAs
Gene set update pipeline is fast and is using existing code
in a novel way with very few alterations
Dealing with biological data issues
•
Retained introns / sequence of partially processed transcripts
• Secondary alignment can build introns within rough exons
•
Wide variation in expression levels
• In highly expressed genes background expression can lead to
joining of genes
• In lowly expressed genes need merged set to get good model
• Mixture of alternate isoforms complicates model building
•
Running on individual tissues gives different pattern of genes
expression, which can make models easier to build for some genes,
BUT comes at cost of extra compute
Retained intron problem
Ensembl pipeline infrastructure
• Runs on top of a batch queuing system (we use LSF),
controlling submission of jobs
• Important for tracking job status - failures occur (maybe
using too much memory, failed writes, hardware
problems)
• Resubmits failed jobs with different submission
parameters
• Runs on large Sanger compute farm (1000s of cores,
large Lustre file system)
Runnables and RunnableDBs
Runnables and RunnableDBs are wrappers
Database
RunnableDB
Runnable
RepeatMasker
24
Runnables and RunnableDBs
• RunnableDB
–
–
–
–
Fetches sequence from the database
Passes data to Runnable
Fetches output from Runnable
Writes output to db
• Runnable
–
–
–
–
–
–
–
Wrapper
Is stand-alone from the db
Accepts data from RunnableDB
Passes data to Module eg. Exonerate
Fetches output
Parses output and creates Ensembl objects
Passes output to RunnableDB
25
Core schema
• Ensembl core schema
– MySQL (MyISAM)
– Tables containing biological data
• Ensembl pipeline schema
– Additional tables to the core tables
– Tables containing job management data
• API to access data
– Uniform method of access to the data
– Object-orientated Perl
– cvs-controlled repositories:
•ensembl
•ensembl-pipeline
•ensembl-analysis
26
Core schema
Gene (Bio::EnsEMBL::Gene)
Gene_id
Analysis_id
Biotype
Seq_region_start
Seq_region_end
Seq_region_strand
459
Targetted_genewise
Protein_coding
1000
2000
-1
Exon (Bio::EnsEMBL::Exon)
Exon_id
Rank
Seq_region_start
Seq_region_end
Seq_region_strand
1
n
Transcript (Bio::EnsEMBL::Transcript)
)
Transcript_id
783
Gene_id
459
Analysis_id
Targetted_genewise
Biotype
Protein_coding
Seq_region_start
1000
Seq_region_end
2000
Seq_region_strand
-1
3562
2
1500
1700
-1
1
1
1
Exon_transcript
n
Transcript_id
Exon_id
783
3562
27
Pipeline schema
• analysis:
ensembl/sql/table.sql
– analysis
• rules:
– rule_conditions
– rule_goal
• input_ids:
– input_id_analysis
– input_id_type_analysis
ensembl-pipeline/sql/table.sql
• job-tracking tables:
– job
– job_status
28
Configs and rules
• Configuration files
– General configuration: General.pm
– PERL5LIB
– Job control configuration in
•Bio/Ensembl/Pipeline/Config/BatchQueue.pm
– Analysis-specific configuration in
•e.g.
Bio/Ensembl/Analysis/Config/GeneBuild/RefineS
olexaGenes.pm
– Database-specific configuration in
•Bio/Ensembl/Analysis/Config/Databases.pm
29
Configs and rules
• Rule tables
Analysis
Analysis 1
SubmitContig
Analysis 2
RepeatMasker
Analysis 3
Pmatch
1
1
Rule_goal
Analysis 2
Goal 1
Analysis 3
Goal 2
1
n
Rule_conditions
Goal 1
SubmitContig
Goal 2
SubmitContig
Goal 2
RepeatMasker
30
Configs and rules
Analysis
Input_id_type
Analysis 1
SubmitContig
Analysis 1
CONTIG
Analysis 2
RepeatMasker
Analysis 2
CONTIG
Analysis 3
Pmatch
Analysis 3
CONTIG
Analysis 4
BestPmatch
Analysis 4
GENOME
1
1
31
Input IDs
• Tables
– input_id_type_analysis: links analysis to type
– input_id_analysis: links analysis to specific data
• Input types
– Sequence types e.g. Chromosome, Contig, Slice
– specified using an Ensembl identifier
– coord_system:asm:seq_region_id:start:end:strand
scaffold:gorGor1:Scaffold_1000:1:65750:1
– File eg. Fasta chunk of 100 cDNAs
32
Input IDs
Analysis
RepeatMasker
Analysis 1
SubmitContig
Analysis 2
RepeatMasker
Input_id_analysis
Analysis 3
Pmatch
Analysis 1
Input_id 1
Analysis 1
Input_id 2
Analysis 1
Input_id 3
Analysis 2
Input_id 1
Analysis 2
Input_id 2
Analysis 2
Input_id 3
Input_id_analysis
Analysis 1
Input_id 1
Analysis 1
Input_id 2
Analysis 1
Input_id 3
33
Jobs
A job is a single analysis processing a single input_id
• Tables
– job and job_status
• Status of jobs
– Pending, Reading, Running, Writing, AWOL, FAILED
– Has is_current flag
• Compute farm
– Platform LSF used as scheduler
34
Jobs on the farm
Job
Input_id_analysis
Analysis 1
Input_id 1
Analysis 1
Input_id 2
Analysis 1
Input_id 3
Analysis 2
Input_id 1
Analysis 2
Input_id 2
Analysis 2
Input_id 3
1
1 Job_id 1
Job_id 2
Analysis 1
Input_id 1
Analysis 1
Input_id 2
Job_id 3
Job_id 4
Analysis 1
Analysis 2
Input_id 3
Input_id 1
…
…
…
1
n
Job_status
Job_id 2
Pending
-
Job_id 2
Running
-
Job_id 2
Writing
Is_current
Job_id 3
Pending
Is_current
35
Jobs on the farm
SubmitContig
Analysis
executable
parameters
Contig
Job
(input_id)
Run on farm
Store result to DB
Retry if failed
36
Infinite loop
Set-up
Rulemanager
Set_up_pipeline
• Get all rules and input_ids
Foreach input_id_type:
Foreach input_id:
Run job if
• Goal type eq input_id_type
• Analysis incomplete
• Rule conditions are met
• Job has failed and can be retried
• Batch size is met
37
Dealing with Data Growth
RNASeq per release
1400
BAM file size across all tissues
(GB)
Chicken
Anole lizard
1200
Human
Cat
1000
Coelacanth
Tasmanian devil
Platyfish
800
Zebrafish
Platypus
600
Opossum
Orangutan
Ferret
400
Turtle
Dog
Pig
200
Tilapia
Chimpanzee
0
e66
e67
e68
e69
e70
e71
Ensembl release number
e72
Evolution of the RNASeq build system
First version
Exonerate for both alignment stages
Stored alignment data in database - doesn't scale
Second version
Exonerate for both alignment stages
Stored alignment data in BAM files, and only gene models and supporting features in
database
Third version
BWA for first stage
Exonerate for spliced alignments, use either rough models or genomic range
Stored alignment data in BAM files, and only gene models and supporting features in
database
Scripts to automate more of the pipeline setup process
Using multi-threaded Picard and Samtools for some BAM file manipulation steps
Investigating effect of read depth
on model generation
• Converted from perl to C and further optimisation of the code
– ~750x speed up
• Allowing us to explore many different filtering combinations:
– Read depth filters for consensus and non consensus
introns
– Intron length filters
– Run on individual tissues (lower depth, different / simpler
pattern of expression)
Optimising RNASeq model building
step
• Rewriting in C gives ~90x speed up without any other optimisation
• Multi-threaded Samtools to speed up reading intron alignments from BAM
file
• Smaller memory because its C rather than perl
• Use tcmalloc
– Small increase in speed
– easier control of releasing memory back to the OS
• Database reading done using few large queries
– speed up fetching
– Reduce load on database
• Sequence fetching optimised
– fetch complete sequence once and then do substrings
• Optimised longest ORF finding code (single pass 6 frame translation)
• Better ordering of data to improve access speed
• Many other optimisations where profiler identified slowness
Multi-threaded version of Samtools
• Modifications to samtools code by Nils Homer
• https://github.com/nh13/samtools
• Stops process being CPU bound when reading/writing
BAM files due to compression/decompression overhead
Performance comparison
RNASeq model building
(Sheep chr 26 refine_all)
CPU time
Perl: 48586 sec
741x
C: 65.5 sec
Memory
Perl: 2.6Gb
C: 500Mb
Run on: Intel Xeon X5650 @ 2.67GHz (12288 KB cache)
Sheep chr 1 1-62Mb RNASeq
models(72 sets of filtering params)
Sheep 1:1-50Mb RNASeq gene models
in 90 different samples
RNASeq pipeline in the cloud
Stephen Keenan in Ensembl is working to create a system
to run the Ensembl RNASeq pipeline in the cloud
Dynamic Cloud Architecture
StarCluster
master
User data
x 100 GBs
Control
Deploy
Storage: GlusterFS
Compute: SUN Grid Engine HPC Cluster
47
x TBs shared across all nodes
7+ GB/s data throughput




48
Sun Grid Engine
400 Job Slots
8 core / 8 Gb hosts
300 Gb GlusterFS
49
50
51
Pipelining tips
• Lots of jobs - tracking and process are important
• Use a pipeline system (there are lots to choose from eg.
eHive, vrpipe, Ensembl genebuild pipeline, Galaxy…)
• Make your pipeline modular so you can switch programs (eg.
Use different aligners) easily
• Use databases when possible
• Allow flexible use and querying of data
• Easier to check for correctness
• Use files when necessary
• Check your writes!
• We use a version of BWA with extra write checking
• Manage locations with a database
Pipelining tips continued
• Write your output at the end of a job, not incrementally
• much easier to tidy up
• Checkpoint your jobs if they have to be long
• Optimise the bottlenecks, not everything
• Look for ways to limit the amount of data which needs to
be put through computationally expensive algorithms
• Eg. Realignment of partially mapped reads cuts number of reads
which need to go into Exonerate substantially
• Write lots of checks to make sure the pipeline has run
correctly
• Visualise your data
Hardware related tips
• Good storage and network important
• Sanger uses Lustre on farm and NFS
• Other places use Isilon
• Gluster for cloud
• Decide how much memory you want per core
• what do you need cached?
• how big will your jobs run?
• Sanger have recently gone up to 8Gb per core
• Must shape your jobs to match the hardware
• Chunk your sequences
• Divide your genome up
People related tips
• Make friends with your systems team
• Document your process
• everyone runs it correctly
• Encourage suggestions for improving pipeline
performance
• See if someone else has already done optimisation of the
code you need.
Ensembl Acknowledgements
Ensembl Team
•
Funding
Genebuild
–
–
–
–
–
–
–
–
–
Bronwen Aken
Thibaut Hourlier
Daniel Barrell
Carlos Garcia Giron
Rishi Nag
Andreas Kahari
Fergal Martin
Daniel Murphy
Konstas Bellis
–
Simon White
•
•
Stephen Keenan
The entire Ensembl team
•
Sanger ISG team particularly Guy Coates and Peter
Clapham
Data
•
Pig, Chicken and Sheep genome consortia
•
Broad Institute
European Commission
Framework Programme 7

Ensembl Acknowledgements - The Francis Crick Institute

Transcript Ensembl Acknowledgements - The Francis Crick Institute

Directory