Career Advancement Workshop
Download
Report
Transcript Career Advancement Workshop
Workshop on FCP Accelerated
NGS
Srinivas Aluru
Iowa State University
The Big Data Challenge
Then (2005)
ABI 3700
96 ~800 bp reads
76.8 X 103 bases
~$1 per kilo base
Now
Illumina Hiseq 2500
6 billion 100 bp reads
600 X 109 bases
~$1 per 200 million bases
Many NGS Technologies
Why FCP?
• 1 NGS experiment = ~100 GB data
• Sequencing Center decade ago small
budget individual investigator today
• Many FCP technologies are inexpensive and
widely available
Genomes Galore – Big Data Analytics for High
Throughput DNA Sequencing
Driving Grand Challenges
Identification of complex disease traits
Detection of biological threats
Microbial studies and human health
Plant genotype to phenotype
⁞
⁞
Research and Dissemination Approach
Vision and Goals
Empower community
migration to HPC
Preserve ability to
create new solutions
Target researchers &
software developers
The Team
Srinivas Aluru (ISU)
Jaroslaw Zola (Rutgers)
Kunle Olukotun (Stanford)
Wu Feng (V. Tech)
Domain Experts:
Patrick Schnable (ISU)
Charles Sing (U. of Michigan)
NGS Application: Assembly
reconstruct longer original sequences from the high coverage
sampling of short fragments produced by NGS
Multiple copies
Sequence
Unordered
of the same
genome
source
fragments
Randomly fragment the copies
NGS Application: Assembly
resequencing genome mapping
de novo sequencing genome assembly
gene expression analysis transcriptome
assembly
metagenomic sampling metagenomic
clustering and/or assembly
Graph Abstractions for Assembly
• Overlap graphs
– node: an NGS read
– edge: suffix-prefix alignment between a pair of
reads
• De Bruijn graphs
– node: a kmer from an NGS read
– edge: length (k-1) suffix-prefix match between
two reads
Graph Operations for Assembly
• Graph construction from reads
• Collapsing chains
• Features in local neighborhood to identify
errors
• Path walking subject to distance constraints
on pairs of edges
• Operations on multiple assembly graphs, or
multiple genomes in a combined graph
NGS Error Correction
• Hamming/Edit distance graphs
– Node: a kmer in an NGS read
– Edge: two kmers with short hamming/edit
distance
• Graph operations needed
– Concurrent access to many nodes for neighbor
queries