Methods (count.) - CSIE -NCKU

Transcript Methods (count.) - CSIE -NCKU

CUDASW++:
optimizing Smith-Waterman sequence
database searches for
CUDA-enabled graphics processing units
Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt
Publisher: BMC Research Notes, 2009
Speaker : De Yu Chen
Data: 2011/4/20
Outline
Introduction
Smith-Waterman Algorithm
CUDA Programming Model
Methods
Results and discussion
2
Introduction
In this paper, the compute power of CUDA-enabled GPUs is further explored to
accelerate SW sequence database searches.
Two versions of CUDASW++ are implemented: a single-GPU version and a
multi-GPU version. Our CUDASW++ implementations provide better performance
guarantees for protein sequence database searches compared to
(1) SWPS3: PlayStation 3, 2008
(2) CBESW: PlayStation 3, 2008
(3) SW-CUDA: CUDA-enabled GPU, 2008
(4) NCBI-BLAST: Basic Local Alignment Search Tool program was designed by
National Center for Biotechnology Information, USA, 1997
3
Smith-Waterman Algorithm
4
CUDA Programming Model
CUDA execution model
CPU (host)
GPU (Device)
5
CUDA Programming Model (count.)
CUDA hardware model
6
CUDA Programming Model (count.)
※ Shared memory access patterns
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__shared__
int data[32]
7
CUDA Programming Model (count.)
FPGA
implement
GPU
implement
Forward array
backward array
8
Methods
Considering the optimal local alignment of a query sequence and a subject
sequence as a task, we have investigated two approaches for parallelizing the
sequence database searches using CUDA.
Inter-task parallelization
Each task is assigned to exactly one thread and dimBlock tasks are performed
in parallel by different threads in a thread block.
subject
query
9
Methods (count.)
Intra-task parallelization
Each task is assigned to one thread block and all dimBlock threads in the thread
block cooperate to perform the task in parallel, exploiting the parallel
characteristics of cells in the minor diagonals.
subject
query
10
Methods (count.)
Inter-task parallelization occupies more device memory but achieves better
performance than intra-task parallelization. However, intra-task parallelization
occupies significantly less device memory and therefore can support longer
query/subject sequences.
In our implementation, two stages are used: the first stage exploits inter-task
parallelization and the second intra-task parallelization. For subject sequences of
length less than or equal to threshold, the alignments with a query sequence are
performed in the first stage in order to maximize the performance.
The alignments of subject sequences of length greater than threshold, are carried
out in the second stage. In our implementation, the threshold is set to 3,072.
11
Methods (count.)
※ Device memory access patterns for coalescing：
(1) Placing matrix elements into linear order
M
1
2
3
4
5
6
7
8
9
10 11 12
13 14 15 16
M
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
12
Methods (count.)
(2) A coalesced access pattern
M
Access
Direction
By threads
1
2
3
4
5
6
7
8
9
10 11 12
13 14 15 16
Load iteration1
T(0) T(1) T(2) T(3)
M
1
2
3
4
Load iteration1
T(0) T(1) T(2) T(3)
5
6
7
8
9
10 11 12 13 14 15 16
13
Methods (count.)
(2) An uncoalesced access pattern
M
Access
Direction
By threads
1
2
3
4
5
6
7
8
9
10 11 12
13 14 15 16
T(0)
T(1)
T(0)
M
1
Load iteration1
T(2)
T(1)
2
Load iteration2
T(2)
3
4
5
6
7
8
9
T(3)
T(3)
10 11 12 13 14 15 16
14
Methods (count.)
Our implementation uses three techniques to improve performance:
(I) Coalesced subject sequence arrangement.
(II) Coalesced global memory access.
(III) Cell block division method.
I. Coalesced subject sequence arrangement
15
Methods (count.)
II. Coalesced global memory access
16
Methods (count.)
III. Cell block division method
To maximize performance and to reduce the bandwidth demand of global memory,
we propose a cell block division method for the inter-task parallelization, where
the alignment matrix is divided into cell blocks of equal size.
Query sequences
Subject sequences
n×n
n×n
n×n
n×n
n×n
n×n
n×n
n×n
n×n
17
Results and discussion
To remove the dependency on the query sequences and the databases used for
the different tests, Cell Updates Per Second (CUPS) is a commonly used
performance measure in bioinformatics.
Given a query sequence of size Q and a database of size D.
the GCUPS(billion cell updates per second) value is calculated by:
QD
t 109
|Q| is the total number of symbols in the query sequence
|D| is the total number of symbols in the database
t is the runtime in. In this paper, for the single-GPU version, the runtime t
includes the transfer time of the query sequences from host to GPU, the
calculation time of the SW algorithm, and the transfer-back time of the scores;
18
Results and discussion (count.)
(1) Database : Swiss-Prot release 56.6
(2) Number of query sequences: 25
(3) Query Length: 144 ~ 5,478
(4) Single-GPU: NVIDIA GeForce GTX 280 ( 30M, 240 cores, 1G RAM)
(5) Multi-GPU: NVIDIA GeForce GTX 295 (60M, 480 cores, 1.8G RAM )
19
Results and discussion (count.)
20
Results and discussion (count.)
(1) Number of query sequences: 1
(2) Number of database sequences: 30
(3) Database Length: 256 ~ 5,000
(4) Single-GPU: NVIDIA GeForce C1060 ( 30M, 240 cores, 4G RAM)
(5) CPU: Intel E6420 1.8GHz
Execution time T (s)
Database
Length
CUDASW++
CPU SW
Our implement
128
0.007
0.02
0.002
256
0.017
0.082
0.008
512
0.058
0.277
0.029
1024
0.223
1.25
0.113
1500
0.476
2.81
0.328
2048
0.88
5.39
0.437
3500
1.45
10.86
0.83
21

Methods (count.) - CSIE -NCKU

Transcript Methods (count.) - CSIE -NCKU

Directory