2013-01-BioUML

Download Report

Transcript 2013-01-BioUML

NGS data analyses
with BioUML
Fedor Kolpakov
Biosoft.Ru, Ltd.
Institute of Systems Biology, Ltd.
Novosibirsk, Russia
Agenda
• BioUML overview
• NGS tools
– quality control
– alignment tools
– annotation tools
– workflows
• Genome browser
• Archakov’s genome
• Ribosome profiling
• Live demonstration
BioUML overview
BioUML platform
• BioUML is an open source integrated platform for systems
biology that spans the comprehensive range of capabilities
including access to databases with experimental data, tools
for formalized description, visual modeling and analyses of
complex biological systems.
• Due to scripts (R, JavaScript) and workflow support it
provides powerful possibilities for analyses of highthroughput data.
• Plug-in based architecture (Eclipse run time from IBM is used)
allows to add new functionality using plug-ins.
BioUML platform consists from 3 parts:
• BioUML server – provides access to biological databases;
• BioUML workbench – standalone application.
• BioUML web edition – web interface based on AJAX
technology;
Main platforms for bioinformatics
and BioUML
Taverna
R/Bioconductor
standalone application
powerful workflows
standalone application
powerful workflows
scripts,
statistics, plots
scripts,
statistics, plots
BioUML platform
web interface,
collaborative research
genome browser
Eclipse plug-in based
architecture,
chemoinformatics
Eclipse plug-in based
workflows, web interface,
architecture,
collaborative research,
chemoinformatics
genome browser
Galaxy
BioClipse
Main platforms for bioinformatics
and BioUML
Taverna
R/Bioconductor
standalone application
powerful workflows
standalone application
powerful workflows
BioUML
web interface,
collaborative research
genome browser
scripts,
statistics, plots
+ systems biology
• visual modelling
• simulation
platform
• parameters fitting
Eclipse plug-in based
•…
architecture,
+ chat for on-line
chemoinformatics
consultations
scripts,
statistics, plots
Eclipse plug-in based
workflows, web interface,
architecture,
collaborative research,
chemoinformatics
genome browser
Galaxy
BioClipse
Market
Android
market
AppStore
Biostore
Platform
Android
MacOS,
iPOD, iPhone
BioUML
BioUML ecosystem
Developers
- plug-ins: methods,
visualization, etc.
- databases
provide tools
and databases
Users
- subscriptions
- collaborative &
reproducible research
use
Biostore
BioUML platform
Experts
-services for
data analysis
- on-line consultations
provide services
NGS
- интегрированные в BioUML методы
(Bowtie, MACS, ChIPHorde, ChIPMunk, …)
- программы, интегрированные в Galaxy
- пакеты R
- аннотация найденных пиков (SNP, сайтов и т.п.)
- визуализация
- workflows
- ChIP-SEQ
- RNA-SEQ
- сборка и аннотация генома человека (в процессе)
- поддержка распарелеливания внешних программ
как часть workflow
- база данных GTRD (на основе данных ChIP-SEQ)
- выделенные сервера
- Amazon EC2 – по запросу
- Biodatomics – 64 ядра, 256 Гб памяти.
Galaxy – analyses methods
Galaxy - workflow
Raw data preprocessing
Track statistics
Gather various
statistics about
track or FASTQ file
Preprocess raw reads
Remove reads not satisfying
simple quality tests, removes
adapters, trims low quality bases
from read ends
выравнивание коротких ридов:
Bowtie
- fast
- no indels
- used for chip-seq
Novoalign
-single-end and paired-end
- in nucleotide and color space
- handle indels,
- finds global optimum alignments
using full Needleman-Wunsch
algorithm
RNA-seq with tophat and Cuff* tools
ChIP-seq
Bowtie for alignment
MACS for peak calling
ChipMunk, IPS, MEME for motif discovery
Popular NGS toolboxes available:
GATK, Picard, SAM tools
An example: workflow for analyses of ChIP-Seq data
example: RNA-seq workflow
NGS data
quality control
2 examples:
rna-seq data (rat, IPS )
genome data – Archakov’s genome
Track statistics (FastQC)
• Estimate quality of RAW or aligned reads like in FastQC program
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• All original FastQC processors are supported
• Works faster than FastQC
• Additional processor: Overrepresented prefixes
• Overrepresented K-mers works more precise (do not skip 80% of
sequences)
• Along with HTML report separate statistics tables are generated and
accessible for further analysis
• Ability to merge several reports into composite report
• As any BioUML analysis can become a part of workflow, scripts, etc.
• Tested on Archakov AP3 (RAW reads: 5.9Gb csfasta+12.7Gb qual),
analysis time: 36 min (all processors)
• Tested on Zakian db50 (RAW reads: 6.5Gb fastq),
analysis time: 7 min (all processors)
Track statistics launch
Input data: BAM,
FastQ and Solid
(colorspace) data
supported
Whether reads
should be aligned
by left or right side
Switch off
individual
processors to save
time.
Track statistics results (Archakov AP3):
Quality per base
Track statistics results (Archakov AP3):
Quality per sequence
Track statistics results (Archakov AP3):
Nucleotide content per base
Track statistics results (Archakov AP3):
GC content per base
Track statistics results (Archakov AP3):
GC content per sequence
Track statistics results (Archakov AP3):
N content per base
Track statistics results (Archakov AP3):
Duplicate sequences
Track statistics results (Archakov AP3):
Overrepresented sequences and 5-mers
Track statistics results (Archakov AP3):
Overrepresented prefixes
Track statistics results (Zakian db50):
Quality per base
Track statistics results (Zakian db50):
Quality per sequence
Track statistics results (Zakian db50):
Nucleotide content per base
Track statistics results (Zakian db50):
GC content per base
Track statistics results (Zakian db50):
GC content per sequence
Track statistics results (Zakian db50): N
content per base
Track statistics results (Zakian db50):
Duplicate sequences
Track statistics results (Zakian db50):
Overrepresented sequences and 5-mers
Genome browser
Genome browser: main features
• uses AJAX and HTML5 <canvas> technologies
• interactive - dragging, semantic zoom
• tracks support
• Ensembl
• DAS-servers
• user-loaded BED/GFF/Wiggle files
DAS
The Distributed Annotation System (DAS) defines a communication
protocol used to exchange annotations on genomic or protein
sequences.
It is motivated by the idea that such annotations should not be provided
by single centralized databases, but should instead be spread over
multiple sites. Data distribution, performed by DAS servers, is separated
from visualization, which is done by DAS clients.
DAS is a client-server system in which a single client integrates
information from multiple servers. It allows a single machine to gather up
sequence annotation information from multiple distant web sites, collate
the information, and display it to the user in a single view.
DAS is heavily used in the genome bioinformatics community. Over the
last years we have also seen growing acceptance in the protein
sequence and structure communities.
Genome browser
Two BAM tracks are compared with each other (Example view on Human NCBI37 Chr.1)
Profile is visible showing the coverage
Genome browser
Upon zooming individual reads become visible. All information associated with
selected read is displayed in the Info box
Genome browser
In detailed scale phred qualities graph is displayed along with changed nucleotides
between read and reference sequence
NGS data
Archakov’s genome
Preprocessing
1. Remove duplicates
Purpose is to mitigate the effects of PCR amplification bias
introduced during library construction. Two read pairs
considered duplicate if they align to the same genomic
position.
>60% were removed as duplicates
Alignments after this step: 213 531 460
•
•
•
Preprocessing
2. Local realignment
Read mapping algorithms operate on each read independently,
locally realign reads such that the number of mismatching bases
is minimized across all the reads.
Preprocessing
3. Remove duplicates after realignment
Realignment may change genomic positions of
read pairs, after this step additional duplicates
can be identified.
712 reads were removed (<0.00035%)
•
•
Preprocessing
4. Recalibration of base quality values
For each base in each read calculates various covariates (such as
reported quality score, cycle, dinucleotide, GC-content).
Using these values build the model that predicts sequencing
errors. Then apply this model to calculate an empirical base
quality score and overwrites the phred quality score
currently in the read.
Genotyping
1. Call SNV by GATK 'Unified Genotyper'
2. Assign a well-calibrated probability to each variant call.
•
Estimate the probability that SNV is a true genetic variant versus a sequencing or data
processing artifact given SNP call annotations provided by 'Unuified Genotyper'
(DepthOfCoverage, StrandBias, HaplotypeScore, ReadPosRankSumTest for example).
o Variant Annotator - create the set of "true variants" from dbSNP, Hapmap and 1000
genomes databases.
o Variant Recalibrator - create a Gaussian mixture model by looking at the annotations
values over a high quality subset of the input call set ("true variants").
o Apply Variant Recalibration - apply the model parameters to each variant identified by
Unified Genotyper calculating log odds ratio of being a true variant versus being false
under the trained Gaussian mixture model.
Genotyping
3. Call indels by GATK 'Unified Genotyper'
4. Assign a well-calibrated probability to each indel.
Similar to SNV calling but use only indels from 1000 Genomes as
"true variants"
Genotyping
5. Filter out low quality variant calls.
1 783 656
SNVs
17 110
Indels
6. Annotate identified variants relative to genes.
Genotyping
Affected genes
http://cloud-biotech.com/bioumlweb/
#de=data/Collaboration/Dr.Archakov/Data/alignment/
Ap1.bam-CleanedAlignment/Genotyping2/tmp/Raw-affected-annotated
Genotyping: potential lose of function
118 genes have mutations that potentially affect function
Mutation in the exon of MAP4K3
Gene ontology classification
Full table
Genome browser
Example of deletion and insertion presentation in genome browser
Ribosome profiling
Ignolia T.N. et al., Cell, 2011
Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of
Mammalian Proteomes
В статье представлен результат:
- полногеномного профилирования местоположения рибосом (секвинирование
защищенных рибосомами фрагментов мРНК);
- скорости элонгации трансляции (pulse-chase strategy).
Анализ полученных данных выявил:
- тысячи сильных сайтов задержки трансляции (pause sites);
- тысячи неаннотированных продуктов трансляции, которые включают:
- расширение и обрезание с N-конца
- вышележащие рамки считывания, начинающиеся как с AUG и не-AUG
кодонов, причем их трансляция изменяется после дифференцировки;
- highly translated short ORFs in the majority of annotated lincRNAs - sprcRNAs - ,
polycistronic ribosome-associated coding RNAs (sprcRNAs), которые кодируют
малые белки.
Данные исследования показывают наличие еще одного уровня сложности в протеоме
млекопитающих.
Live demonstration