Developing Accessible Application Software for Individual

Download Report

Transcript Developing Accessible Application Software for Individual

Developing Accessible Application Software for
Individual de novo Genome Projects
Vince Forgetta, PhD Candidate
Ken Dewar PhD, Supervisor
Department of Human Genetics, McGill University
Montreal, Quebec, Canada
December 8th, 2011
Next-Gen Gap
“Unfortunately, the software and computer hardware demands on these analyses are not
much less than those of the large Genome Centers. From this perspective, the gap between
large-scale genome centers and individual investigators may seem to be growing, not
shrinking, as the next-generation platforms’ apparent promise of a ‘Genome Center in a
box’ may have only been half delivered, providing data without a full suite of tools.”
(Nature Methods 6, S2 - S5 (2009))
Bacterial genome in
< 1 week for ~ $3000
(Genome Assembly)+
Download Data
Learn *NIX
Install Software and Dependencies
Run Software … Wait? … Problems?
Three Common Methodologies in de novo
Genome Analysis
1. Display and analysis of genome annotations
2. Quality assessment of a genome assembly
3. Comparison and mining of genomic data from public
repositories.
One or more methodologies used to address needs in three specific
projects; projects used as a vehicle to develop software:
Project
Software
Methodology
C. difficile 14 Genome Comparison
cgb
1. Genome Display
Multi-centre WGS of O. novo-ulmi
ContiGo
2. Assembly QA
E. fergusonii ECD-227
BLAST in Pivot
3. Data Mining
3
Assembly Quality Assessment
Assembly Analysis
DNA
Sequencing Centre
Researcher
Assembly
• Researchers should have easy access to
determine quality and perform simple analysis.
• Delays and limits on data access exist:
- Viewers need to be installed and have specific
software (e.g. Linux) or hardware requirements (e.g. RAM).
- Assembly data (multiple GBs) must be downloaded.
Objective
• Develop a simple assembly viewer that
operates within a web-browser, allowing a
researcher to rapidly analyze and access their
data.
Method
Parser/Converter: Used python to parse, analyze, and convert
assembly data into web accessible formats (HTML, JSON, JPG
images) which are stored on sequence centre servers.
Interface: Use browser-based interface (HTML) to dynamically
access data (Javascript) on servers. Incorporates pre-existing webtechnologies (JQuery, Seadragon Deepzoom AJAX).
Usage:
- after genome assembly, parser/converter is run on
sequencing center servers
- researcher accesses interface over the internet using a
modern web browser
Performance
Parser/Converter:
– Multiple platforms (Windows/OS X/Linux)
– Multi-processor support.
– Low memory usage (< 250Mb of memory per processor).
User interface:
– Client-side programming  decreased server load
– Data is downloaded is on-demand  limited bandwidth
users.
– Sole system requirement: a modern web-browser (Firefox,
Opera, Google Chrome)  ease of installation.
– Low memory usage (peaks at ~ 250 Mb).
The Interface
Assembly statistics, batch download of sequence and statistical data.
Table of contig/scaffold
statistics:
•Sortable/Filter by column
•Access to contig
sequence/quality and read
sequences.
Contig Assembly:
-Pan/Zoom
- Identify position, read names, mismatches
Dynamic Charts:
• toggle axis value
• identify points
• summarize regions
Demo
3. Data Mining
blip.codeplex.com
Microsoft Research Summer Internship
Microsoft Biology Foundation
Redmond, Washington, USA
Mentor - Simon Mercer
blip.codeplex.com
BLAST
NCBI
ACGTCACTGACTG
ACTAGCTAGCTAG
CTAGCATCGATCG
ATCGATCGATCGA
TCGACGTAACTAG
CACGACTGACTCT
Species,
Function, …
?
Local
blip.codeplex.com
Limitation
Scientist
>gi|301326298|ref|ZP_07219671.1| TIM-barrel protein, nifR3 family [Escherichia
coli MS 78-1] Length=321
Score = 583.563 bits (1503), Expect = 8.65371E-165
Identities = 280/281 (100%), Positives = 280/281 (100%), Gaps = 0/281 (0%)
Frame = 0
Query
1
Sbjct
41
Query
61
Sbjct
101
Query
121
Sbjct
161
Query
181
Sbjct
221
Query
241
Sbjct
281
MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC
MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC
MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC
100
PAKKVNRKLAGSALLQYPDVVKSILTEVVNAVDVPVTLKIRTGWAPEHRNCEEIAQLAED
PAKKVNRKLAGSALLQYPDVVKSILTEVVN VDVPVTLKIRTGWAPEHRNCEEIAQLAED
PAKKVNRKLAGSALLQYPDVVKSILTEVVNTVDVPVTLKIRTGWAPEHRNCEEIAQLAED
160
=
+
=
60
120
CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA
CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA
CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA
220
LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR
LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR
LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR
280
KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA
KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA
KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA
+
180
240
281
321
~5000 genes
E. coli
Programmer
blip.codeplex.com
Blast in Pivot
ACGTCACTGACTG
ACGTCACTGACTG
ACTAGCTAGCTAG
ACGTCACTGACTG
ACTAGCTAGCTAG
CTAGCATCGATCG
ACTAGCTAGCTAG
CTAGCATCGATCG
ATCGATCGATCGA
CTAGCATCGATCG
ATCGATCGATCGA
TCGACGTAACTAG
ATCGATCGATCGA
TCGACGTAACTAG
CACGACTGACTCT
TCGACGTAACTAG
CACGACTGACTCT
CACGACTGACTCT
???
1
2
3
blip.codeplex.com
E. coli ECD227
E. coli
ECD-227
Acknowledgement
Moussa Diarra, Heidi Rempel
Demo
Conclusions
 ContiGo: used by clients of the Genome Centre at McGill (release soon).
 BL!P: >500 downloads (blip.codeplex.com).
18
Acknowledgements
C. difficile
Ophiostoma novo-ulmi
Ken Dewar
Jan Kieleczawa
Andre Dascal
Michael Zianni
Matthew Oughton
Robert Steen
Joana Dias
Deborah Grove
Gary Leveque
Anoja Perera
Pascale Marquis
Robert Lyons Jr.
Corina Nagy
Sushmita Singh
Amelie Villeneuve
Doug Bintzler
Ivan Brukner, Mark Miller
Scottie Adams
Vivian Loo
Deborah Grove
Mike Mulvey
Gregory Grove
Dale Gerding
Robert Lyons Jr.
Maya Rupnik
Suzanne Genik
Elaine Mardis
Chris Wright
V. Magrini
Alvaro Hernandez
M. Hickenbotham
Sharon Bachman
K. Haub
Lorie Hetrick
C. Markovic
Sushmita Singh
J. Nelson
Nichole Peterson
Gary Leveque
Joana Dias
Clotilde Teiling
Tim Harkins
E. coli ECD-227
H. Rempel
Andrew Metcalfe
M. S. Diarra
BL!P/Microsoft
Simon Mercer
Xin-Yi Chua
Mauro Luigi Drago
Beatriz Diaz Acosta
Vivek Kumar
Bob Davidson
Mike Zyskowski
Xiaoji Chen
Bob Silverstein
Vikram Bapat
Jared Jackson
Wei Lu
The Pivot Team
19