Presentation Title

Download Report

Transcript Presentation Title

Early Users:
Metagenomics Sequence Analysis
Yuzhen Ye Lab (IU Bloomington School of Informatics)
Environmental sequencing
– Sampling DNA sequences directly from the environment
– Since the sequences consists of DNA fragments from
hundreds or even thousands of species, the analysis is far
more difficult than traditional sequence analysis that
involves only one species.
• Assembling metagenomic sequences and deriving genes from
the dataset
• Dynamic programming to optimally map consecutive contigs
from the assembly.
a Pervasive Technology Institute (pti.iu.edu)Center
NCGAS is a national service center funded by the National
Science Foundation’s Advances in Biological Informatics (ABI) to
provide scientists access to software and supercomputers for
genomics research.
http://ncgas.org
NCGAS provides
•A specific goal is to provide dedicated access to memory
rich supercomputers customized for genomics studies,
including Mason and other XSEDE systems
•Distributions of hardened versions of popular codes
•Initially, nucleated around genome assembly software such
as:
• de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS
• consensus methods: Celera, Arachne 2
• Expanding to other areas as users are recruited: now
moving into phylogenetics and metagenomics
• We’re especially interested in helping smaller institutions
•Funded only in Nov. 2011, NCGAS is actively
seeking users!
Current participating institutions:
• IU’s Mason – a HP ProLiant DL580 G7: 10GE
interconnect; Quad socket nodes (8 core Xeon L7555,
1.87 GHz base frequency 32 cores per node; 512 GByte
of memory per node!); rated at 3.383 TFLOPs (G-HPL
benchmark)
• Texas Advanced Computing Center (TACC)
• San Diego Supercomputer Center (SDSC); e.g. DASH
• NCGAS will support software running at IU, TACC and
SDSC, as well as other supercomputers available as part
of XSEDE, with the goal to create a single allocation
system that will transparently access all appropriate
clusters
• NCGAS will further campus bridging integration
Since the number of contigs is enormous for most metagenomic
dataset, a large memory computing system is required to
perform the dynamic programming algorithm so that the task can
be completed in polynomial time.
Genome Assembly and Annotation
Michael Lynch Lab (IU Bloomington, Department of Biology)
• Assembles and annotates Genomes in the Paramecium aurelia species
complex in order to eventually study the evolutionary fates of duplicate
genes after whole-genome duplication. This project also has been
performing RNAseq on each genome, which is currently used to aid in
genome annotation and subsequently to detect expression differences
between paralogs.
• The assembler used is based on an overlap-layout-consensus method
instead of a de Bruijn graph method (like some of the newer
assemblers). It is more memory intensive – requires performing pairwise
alignments between all pairs of reads.
• The annotation of the genome assemblies involves programs such as
GMAP, GSNAP, PASA, and Augustus. To use these programs, we need
to load-in millions of RNAseq and EST reads and map them back to the
genome.
Genome Informatics for Animals and Plants
Genome Informatics Lab (IU Bloomington Department of Biology)
• This project is to find genes in animals and plants, using the vast
amounts of new gene information coming from next generation
sequencing technology.
• These improvements are applied to newly deciphered genomes
for an environmental sentinel animal, the waterflea (Daphnia), the
agricultural pest insect Pea aphid, the evolutionarily
interesting jewel wasp (Nasonia), and the chocolate plant (Th.
cacao) which will bring genomics to sustainable agriculture of
cacao.
• Large memory compute systems are needed for biological
genome and gene transcript assembly because assembly of
genomic DNA or gene RNA sequence reads (in billions of
fragments) into full genomic or gene sequences requires a
minimum of 128 GB of shared memory, more depending on data
set. These programs build graph matrices of sequence
alignments in memory.
Imputation of Genotypes And Sequence Alignment
Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular Genetics)
• Study complex disorders by using imputation of genotypes
typically for genome wide association studies as well as sequence
alignment and post-processing of whole genome and whole
exome sequencing.
• Requires analysis of markers in a genetic region (such as a
chromosome) in several hundred representative individuals
genotyped for the full reference panel of SNPs, with extrapolation
of the inferred haplotype structures.
• More memory allows the imputation algorithms to evaluate
haplotypes across much broader genomic regions, reducing or
eliminating the need to partition the chromosomes into segments.
This increases the accuracy and speed of imputed genotypes,
allowing for improved evaluation of detailed within-study results as
well as communication and collaboration (including meta-analysis)
using the disease study results with other researchers.
Daphnia Population Genomics
Michael Lynch Lab (IU Bloomington Department of Biology)
This project involves the whole genome shotgun sequences of
over 20 more diploid genomes with genomes sizes >200
Megabases each.
•With each genome sequenced to over 30 x coverage, the full
project involves both the mapping of reads to a reference genome
and the de novo assembly of each individual genome.
•The genome assembly of millions of small reads often requires
excessive memory use for which we once turned to Dash at
SDSC. With Mason now online at IU, we have been able to run our
assemblies and analysis programs here at IU.
Thomas G. Doak ([email protected]), Le-Shin Wu, Craig A. Stewart, Robert
Henschel, William K. Barnett
http://ncgas.org