PowerPoint - Computer Sciences Dept.

Download Report

Transcript PowerPoint - Computer Sciences Dept.

Condor in Cryo-EM image processing
Weimin Wu, Wen Jiang
Department of biological sciences
Purdue University
04/30/2008
Cryo-EM: low temperature electron microscopy
Image processing: get the 3D reconstruction from
2D images.
Introduction:
Viral infections have been and remain one of the major threats to
human health. Viruses are large assemblies of proteins and nucleic
acids that rely on infection of hosts to complete their life cycle and
sustain their propagation. High resolution 3-D structure of the virus
particles will provide important insights to understanding of these
processes and the development of effective prevention and
treatment strategies.
Recently we have demonstrated, in collaboration with researchers in
Baylor College of Medicine and MIT, the 3-D reconstruction of the
infectious bacterial virus Epsilon15 (ε15) at 4.5 Å resolution, which
allowed tracing of the polypeptide backbone of its major capsid
protein gp7 (Jiang et al., Nature 451(7182):1130-4, 2008).
For many of the tailed dsDNA viruses, for
example the bacterial viruses T7, T3 and ε15,
one of the 12 icosahedral 5-fold vertices is
occupied by a unique 12-fold portal protein
complex. This unique portal vertex is
responsible for the packaging of dsDNA
genome into the protein shell during assembly
and the ejection of the dsDNA genome out of
the virus and into the host cell during infection.
However, high resolution structure of these
virus particles, especially the non-icosahedrally
organized components such as the portal
complex, the tail and the encapsulated dsDNA
genome, are lacking.
I am working on this kind of project without
enforcing any symmetry on virus. Now we get
a sub-nanometre resolution result which
enables us to visualize the secondary structure
of portal, tail hub and tail spikes.
A
B
ipDN A capsid
infectious
phage
ipDNA Capsid
capsid I
capsid II
100 nm
(A) Schematic diagram of the T7/T3 phage particle assembly and dsDNA
genome packaging pathway. Adapted from (Serwer, 2004). (B) A cryoEM micrograph of T3 phage showing the particles representing each of
the major stages during assembly and genome packaging.
Tail hub
spike
portal
terminus
core
DNA rings
Image processing is a critical step for generating the macromolecule 3D
structure from the 2D images taken with cryo-EM technique. This step
includes 2D alignment and 3D reconstruction. Both need intensive
computing power. High performance computing (HPC) resources
supported by RCAC enable us to work on huge datasets for getting high
resolution results and therefore learning more details of biological
system.
Scientific needs:
Two major steps are involved in the cryo-EM image processing. One is the 2D
alignment step, which is to find the orientation and center information of the
sample particles by matching the images (2D projection of the sample
particles) with the reference, the other step is 3D reconstruction step, which
generates the 3D map by collecting all the particles’ orientation and center
information and averaging them.
1second
1 raw image vs 1 projection
22K
CPU hours
2D images
55,000
Projections
vs
1,400
GroEL as example to show the 3D
reconstruction and many iterations
needed for high resolution. For our
E15 project, even we started with
an intermediate resolution map
(7Ǻ), more than 10 iterations were
continued for achieving 4.5Ǻ.
Features as a function of
resolution to show how to
evaluate the resolution
qualitatively from density
map
Condor Performance:
We feel lucky in Purdue to get so many resources supported by RCAC,
otherwise our research will take forever. Here I list the condor jobs we
submitted and CPU hours we used.
10/01/2006~10/01/2007 Condor Jobs
CPU hours
Group-jiang12
4,817,429*
9,465,944
*each job took about half a hour.
08/17/2007~04/22/2008 Condor Jobs
CPU hours
wu49
3,884,568*
4,046,914
*each job took about one hour due to different algorithm and other
reasons.
Running jobs on different Condor-Platform vs Time
1500
Number of Running Jobs
The peak around 8am 10/16/2007
TOTAL RUNNING JOBS
WINDOWS
LINUX 64bit
LINUX 32bit
1000
500
0
0
10
20
30
40
50
60
70
Time/Hours (Job submitted from 10:00pm 10/15/2007)
Running jobs versus Time. This is a long time job, about 64hours. It is obvious there
are three major peaks. These three periods are overnight time. At daytime, the
number of running jobs drop a lot due to owner use. The three peaks are getting
smaller mean the user priority is getting lower. Now it is summer holiday, I can get
more than 3,000 nodes for my condor jobs.
We tried to use all the platforms to run our condor jobs. How
about the performance of different platforms?
Average
Running
Time
Average
Remote
Running
CPU time
Average
Remote
Wall Clock
Time
Average
Queue Wait
Time
LINUX
32bit
2991.1s
2331.1s
2957.0s
255.5s
LINUX 64
bit
2516.6s
1630.0s
2457.0s
329.1s
WINDOWS
2929.2s
2302.8s
2881.6s
356.5s
The LINUX 64-bit machines are not as fast as we expected. Why?
We checked the remote host condor jobs submitted to in this test,
90% of LINUX 64-bit machines were from ccl00.cse.nd.edu.
Remote Host
Jobs
Average
Running time
*.nd.edu
5,449
2541.4s
*.purdue.edu
560
2275.3s
The condor jobs could go to the nodes out of campus and the
performance was just slightly worse. It made us more confident
to seriously think about the Teragrid, although we have tried
Teragrid but still used the resources in campus. Anyway it is a
problem when the files to be transferred are large, for example,
more than 700M.
High quality Alphahelix ,Beta sheet
and Side chain,
which enabled us to
do the modeling
and get the
backbone structure.
With
icosahedral
symmetry
Our problem/concern about Condor:
1. Operation: the best thing for us is to submit the condor
jobs from our desktop, and let condor itself to find
resources, but now we need specify where to go if using
Teragrid.
2. File transfer: in the case of large file transfer, the network
becomes bottleneck which will easily overload the head
node and crash it, especially when the file goes outside of
campus. This is due to large amount of reading from the
only copy of large dataset. However this might be
circumvented by applying P2P client into the condor
because in our image processing 2D alignment step, one
image will be compared to all the reference projections,
those projections might have been sent to neighboring
computers to run another condor job, therefore for this
condor job, the file could be transferred from neighboring
nodes. Based on this, the number of reading from original
copy will drop a lot, in theory, might be just a few times.
The file transfer speed will also increase dramatically.
Acknowledgment:
Preston Smith
David Braun
Steve Wilson
Pia Mikeal
Bruce L. Fuller
Reference:
1. Jiang et.al Vol439|2 February 2006/Nature 04487
2. Jiang et.al Vol451|28 February 2008/Nature 06665