Presentation Title - Indiana University

Download Report

Transcript Presentation Title - Indiana University

Early Users: Metagenomics Sequence
Analysis
Mason – a HP ProLiant DL580 G7
provided by NCGAS
A specific goal is to provide dedicated access to large memory supercomputers,
such as IU's new Mason system. Each Mason compute node has 512GB of
random access memory, critical for data-intensive science applications such as
genome assembly.
a Pervasive Technology Institute Center
What is The National Center for
Genome Analysis Support?
• NCGAS is a national center dedicated to providing scientists easy and ready
access to the software and supercomputers necessary for the important work
of genomics research.
• Initially funded by the National Science Foundation Advances in Biological
Informatics (ABI) program, grant # 1062432
• Provides access to memory rich supercomputers customized for genomics
studies, including Mason and XSEDE other systems.
• A Cyberinfrastructure Service Center affiliated with the Pervasive Technology
Institute at Indiana University (http://pti.iu.edu)
• Provides distributions of hardened versions of popular codes
• Particularly dedicated to genome assembly software such as:
• de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS
• consensus methods: Celera, Arachne 2
• For more information, see http://ncgas.org
IU's NCGAS partners include the Texas Advanced Computing
Center (TACC) and the San Diego Supercomputer Center
(SDSC), and will support software running on supercomputers
at TACC and SDSC, as well as other supercomputers
available as part of XSEDE (the new NSF-funded Extreme
Science and Engineering Discovery Environment). NCGAS
will further campus-based integration, known as "campus
bridging."
• 16 node cluster
• 10GE interconnect
– Cisco Nexus 7018
– Compute nodes are oversubscribed 4:1
– This is the same switch that we use for DC and
other 10G connected equipment.
• Quad socket nodes
– 8 core Xeon L7555, 1.87 GHz base frequency
– 32 cores per node
– 512 GByte of memory per node!
• Rated at 3.383 TFLOPs (G-HPL benchmark)
NCGAS Sandbox Demo at
Supercomputing 11
For Indiana University’s Supercomputing 11 research sandbox demo, NCGAS
implemented a biological application to simulate a sequence alignment and
SNP (single nucleotide polymorphism) identification pipeline (shown above).
The goal is to demonstrate that, with a network bridging between NCGAS
computing nodes at IU and a remote storage file system, we are able to
conduct a data intensive pipeline without repetitive data file movement.
• STEP 1: data preprocessing, to
evaluate and improve
the quality of the input
sequence
• STEP 2: sequence
alignment to a known
reference genome
• STEP 3: SNP
detection to scan the
alignment result for
new polymorphisms
Early Users:Daphnia Population
Genomics
Michael Lynch Lab (IU Bloomington Department of
Biology)
This project involves the whole genome shotgun sequences of
over 20 more diploid genomes with genomes sizes >200
Megabases each.
•With each genome sequenced to over 30 x coverage, the full
project involves both the mapping of reads to a reference genome
and the de novo assembly of each individual genome.
•The genome assembly of millions of small reads often requires
excessive memory use for which we once turned to Dash at
SDSC. With Mason now online at IU, we have been able to run our
assemblies and analysis programs here at IU.
Yuzhen Ye Lab (IU Bloomington School of Informatics)
Environmental sequencing
– Sampling DNA sequences directly from the environment
– Since the sequences consists of DNA fragments from hundreds
or even thousands of species, the analysis is far more difficult
than traditional sequence analysis that involves only one
species.
• Assembling metagenomic sequences and deriving genes from the
dataset
• Dynamic programming to optimally map consecutive contigs from
the assembly.
Since the number of contigs is enormous for most metagenomic
dataset, a large memory computing system is required to perform the
dynamic programming algorithm so that the task can be completed in
polynomial time.
Early Users: Genome Assembly and
Annotation
Michael Lynch Lab (IU Bloomington, Department of Biology)
• Assembles and annotates Genomes in the Paramecium aurelia species
complex in order to eventually study the evolutionary fates of duplicate genes
after whole-genome duplication. This project also has been performing RNAseq
on each genome, which is currently used to aid in genome annotation and
subsequently to detect expression differences between paralogs.
• The assembler used is based on an overlap-layout-consensus method instead
of a de Bruijn graph method (like some of the newer assemblers). It is more
memory intensive – requires performing pairwise alignments between all pairs
of reads.
• The annotation of the genome assemblies involves programs such as GMAP,
GSNAP, PASA, and Augustus. To use these programs, we need to load-in
millions of RNAseq and EST reads and map them back to the genome.
Early Users: Genome Informatics for Animals
and Plants
Genome Informatics Lab (IU Bloomington Department of
Biology)
• This project is to find genes in animals and plants, using the vast
amounts of new gene information coming from next generation
sequencing technology.
• These improvements are applied to newly deciphered genomes for an
environmental sentinel animal, the waterflea (Daphnia), the agricultural
pest insect Pea aphid, the evolutionarily interesting jewel wasp
(Nasonia), and the chocolate plant (Th. cacao) which will bring
genomics to sustainable agriculture of cacao.
• Large memory compute systems are needed for biological genome and
gene transcript assembly because assembly of genomic DNA or gene
RNA sequence reads (in billions of fragments) into full genomic or gene
sequences requires a minimum of 128 GB of shared memory, more
depending on data set. These programs build graph matrices of
sequence alignments in memory.
Early Users: Imputation of Genotypes And Sequence
Alignment
Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular
Genetics)
• Study complex disorders by using imputation of genotypes typically for
genome wide association studies as well as sequence alignment and
post-processing of whole genome and whole exome sequencing.
• Requires analysis of markers in a genetic region (such as a
chromosome) in several hundred representative individuals genotyped
for the full reference panel of SNPs, with extrapolation of the inferred
haplotype structures.
• More memory allows the imputation algorithms to evaluate haplotypes
across much broader genomic regions, reducing or eliminating the need
to partition the chromosomes into segments. This increases the
accuracy and speed of imputed genotypes, allowing for improved
evaluation of detailed within-study results as well as communication and
collaboration (including meta-analysis) using the disease study results
with other researchers.
Thomas G. Doak ([email protected]), Le-Shin Wu, Craig A. Stewart, Robert Henschel, William K. Barnett
http://ncgas.org