memsys 16 - Data-Intensive Scalable Computing Laboratory

Download Report

Transcript memsys 16 - Data-Intensive Scalable Computing Laboratory

Concurrent Dynamic Memory Coalescing
on GoblinCore-64 Architecture
Xi Wang,
John, D. Leidel,
Yong Chen
[email protected], [email protected], [email protected]
Texas Tech University
Oct. 3. 2016
MEMSYS16
Overview
• Background
• Dynamic Memory Coalescing
• Concurrent Design
• Evaluations
• Conclusions
Background:GC-64
GoblinCore-64
• The GoblinCore-64 (GC-64), is the first
RISC-V Extension for data intensive
computing, such as graphs, sparse
matrices, (et.al.) [1] [2].
• We also utilize the Hybrid Memory Cube
as the main memory[3] and discard the
data cache.
Hybrid Memory Cube Device [3]
MEMSYS 16
3
Background:GC-64
`
`
MEMSYS 16
4
4
Background
• Cache structures provide good utilization for unit-stride memory
access patterns.
• However, this convenience comes at a price: die area and
power! To make it worse, cache structures provide little
assistance for random memory access patterns. Therefore,
the GC-64 gets rid of cache.
• The main purpose of Dynamic Memory Coalescing research is
to reduce the memory accesses to the main memory, thereby
providing equivalent performance on unit stride access or
random access patterns.
MEMSYS 16
5
Dynamic Memory Coalescing:Tree Structure
• Sorting Binary Tree structures are defined to store the
requests from processors. Each tree contains one root
node that is null
•
•
All nodes left of the top-level root are read operations
All nodes right of the top-level root are write operations
• Tree nodes are structures of:
•
•
•
•
Task Operation
Task ID
Address
{TP:TID:ADDRESS}
Root
R
R
W
R
W
W
Tree
MEMSYS 16
6
Dynamic Memory Coalescing :Tree Logic
Coalescing Tree Logic
• Memory coalescing is conducted when the tree reaches the
max read/write bytes (128), or exceeds time limits.
• Firstly, the "First Order Traversal" will be conducted. The
most left child will be found as the base address.
• Then the address and size of request data will be evaluated to
check whether the requests are consecutive, in the manner of
Left Child => Parent => Right Child.
• Afterwards, coalescing tree will be expired and generate the
HMC requests.
MEMSYS 16
7
Concurrent Design:
Architecture
MEMSYS 16
8
Concurrent Design: Algorithms
Following the design purpose of further coalescing the memory
accesses, 2 algorithms are designed to increase the possiblilty of
consecutive requests stored in the coalescing tree.
• Address Partitioned Algorithm (APA): designed to coalesce
adjacent memory accesses by partitioning the physical
address.
• Work Partitioned Algorithm (WPA): designed to coalesce same
type of the memory accesses by partitioning the requests type:
read/write, based on the APA.
MEMSYS 16
9
Concurrent Design: Sequential Example
6 HMC Requests
Suppose there are 8 requests are going to be
inserted, as shown in the table 1.
OP stands for the operation of the requests, RD
represents read request, WR represents write
request.
MEMSYS 16
10
Concurrent Design: APA Example
4 HMC Requests
Suppose there are 16 threads.
The size of each memory partition is 0x10000000.
Thread 0 => [0x00000000, 0x0FFFFFFF]
Thread 1 => [0x10000000, 0x1FFFFFFF]
MEMSYS 16
11
Concurrent Design: WPA Example
4 HMC Requests
Suppose there are 16 threads.
The size of each memory partition
is 0x10000000.
Thread 0,8 => [0x00000000, 0x0FFFFFFF]
Thread 1,9 => [0x10000000, 0x1FFFFFFF]
MEMSYS 16
12
Evaluation: Test Cases
• A port of RISC-V Linux kernel built by UC Berkeley [6],
is eventually chosen as the environment to run the 5 test
cases, including: HPCG, SSCA#2, stream, scatter and
gather benchmark [7][8][9].
• The efficiency is calculated by the following equation:
# of Reduced Requests
Efficiency=
# of Input Requests
MEMSYS 16
[0,1)
13
Evaluation: APA & WAPA Evaluations
WPA Efficiency
APA Efficiency
MEMSYS 16
14
Evaluation: Requests Distribution
15
Requests Distribution of HPCG with APA on 2 & 8 threads
Evaluation: Decreases of Total Request Cost
Total Request Cost =
Control (32 bytes) + Data
APA
For instance, one 16 bytes HMC
read request consists of:
WPA
- Request packet = 16 bytes
(control cost)
- Response packet = 32 bytes (16
control + 16 data)
Total Request Cost =
32 bytes(control) + 16 bytes(Data)
= 48 bytes
MEMSYS 16
16
Evaluation: Test Results
As a conclusion to all the test results shown above:
• Across all the test cases, we find an average increase in
efficiency of 55.17% and 55.94% through APA and WPA
approach, respectively.
• The average cost decreases across all APA tests and WPA
tests are 8.83% and 10.04% ,respectively.
• As such, the WPA approach is also considered to be more
efficient in this research.
MEMSYS 16
17
Conclusions & Future Work
• We construct the architecture and model for concurrent DMC
design and implement two different parallel algorithms: APA,
WPA;
• We also evaluate the concurrent DMC and prove the superiority
of memory coalescing unit, in the perspective of efficacy.
• The future direction of the research will focus on the extension
based on GC-64 and improve the concurrent DMC according to
the HMC specification 2.1.
MEMSYS 16
18
References
[1] GoblinCore-64, http://gc64.org/
[2] riscv.org. Regents of the University of California. Retrieved August 25, 2014.
[3] HMC Specification 2.0 from Micron, http://www.hybridmemorycube.org/
[4] Memory Wall, https://en.wikipedia.org/wiki/Random-access_memory#Memory_wall
[5] Yocto/OpenEmbedded RISC-V Port. http://riscv.org/software-tools/yocto/
[6] Linux/RISC-V, Linux kernel port of RISC-V. http://riscv.org/software-tools/linux/
[7] Dongarra J, Heroux M A, Luszczek P. A new metric for ranking high performance
computing systems[J]. National Science Review, 2016: nwv084.
[8] J. Kepner, D. P. Koester, and et al. HPCS SSCA#2 Graph Analysis Benchmark
Specifications v1.0, April 2005.
[9] McCalpin J D. A survey of memory bandwidth and machine balance in current high
performance computers[J]. IEEE TCCA Newsletter, 1995: 19-25.
MEMSYS 16
19
21
22
Concurrent Design: Components
Spike is the modified
RISC-V simulator which
has been extended to
support the memory
tracing.
Microcode is responsible
for the requests insertion
and the generation the
HMC requests.
Library contains the data
structure of the coalescing
tree, requests from RISCV cores and HMC
requests etc.
Components of DMC Unit
Master's Thesis Xi Wang
23
Concurrent Design: Tree Logic
Coalescing Tree logicReduce Requests
For the case that the address of
the requests in the tree are
consecutive , we will reduce it
into 1 HMC request regardless of
the request type.
As shown in the case a , b and c
on the right
Case a
Case b
Case c
MEMSYS 16
24
Concurrent Design: Tree Logic
Coalescing Tree logic----Reduce Requests
For the case that the requests data are not concecutive:
• If they are READ requests, we can still reduce it as one HMC
request if the read bytes are smaller than the 128 bytes;
(Max and Min read/write bytes are 128 and 16 respectively)
Case d
MEMSYS 16
25
Evaluation: Test Cases
• The first one is the High Performance Conjugate Gradient
Benchmark, also known as HPCG, that performs symmetric
Gauss-Seidel algorithms .
• The 2rd test case is a synthetic graph theory benchmark
called SSCA#2 which is developed by the High Productivity
Computer Systems (HPCS).
• The 3rd test case is the stream benchmark, which is designed
to measure sustainable memory bandwidth for contiguous,
long-vector memory accesses.
• The last 2 test cases are the scatter and gather, which is a type
of memory addressing that often arises when addressing
26
vectors in sparse linear algebra operations.
Evaluation: Decreases of Total Request Cost
Total Request Cost =
Control (32 bytes) + Data
APA
For instance, one 16 bytes HMC
read request consists of:
WPA
- Request packet = 16 bytes
(control cost)
- Response packet = 32 bytes (16
control + 16 data)
Total Request Cost =
32 bytes(control) + 16 bytes(Data)
= 48 bytes
MEMSYS 16
27