Silicon OS for Large.. - Index of

Download Report

Transcript Silicon OS for Large.. - Index of

2016/4/11
Silicon Operating System for
Large Scale Heterogeneous Cores
and its FPGA Implementation
SoC CAD
黃翔 Huang, Xiang
電機系, Department of Electrical Engineering
國立成功大學, National Cheng Kung University
Tainan, Taiwan, R.O.C
(06)2757575 轉62400 轉2825, Office: 奇美樓, 6F,95602
Email: [email protected]
Website address: http://j92a21b.ee.ncku.edu.tw/broad/index.html
1
NCKU
SoC CAD
Abstract (1/3)

Grand challenge applications have a strong hunger for high
performance supercomputing clusters to satisfy their requirements.

Competent node architecture in supercomputing clusters is critical
to quench the requirements of the varied computationally
demanding applications.

This accentuates the need for heterogeneous multicore node architectures in
supercomputing clusters, thus paving way for the novel concept of execution
of Simultaneous Multiple Application (SMAPP) non space-time sharing.
Huang, Xiang, 黃翔
SoC & ASIC Lab
2
NCKU
Abstract (2/3)
SoC CAD

OS is the other side of the coin in attaining exa-flop performance in
supercomputing clusters.

Conventional OSs being software driven, their performance becomes a
bottleneck since it involves the complexities associated with parallel mapping
and scheduling of different applications across the underlying nodes.



In this context it is suitable if the kernel of the OS is made completely hardware
based.
Further the simultaneous multiple application execution with non-space time
sharing calls for a parallel and hierarchy based multi-host system.
Hence the hardware design for an OS for supercomputing clusters
designed to meet these demands known as Silicon Operating
System [SILICOS] was evolved at Waran Research Foundation
[WARFT].
Huang, Xiang, 黃翔
SoC & ASIC Lab
3
NCKU
SoC CAD
Abstract (3/3)

This thesis analyses the architecture and design of SILICOS at
greater depths.

The SILICOS architecture is integrated with the Warft India Many
Core [WIMAC] simulator a clock driven, cycle accurate simulator.
Huang, Xiang, 黃翔
SoC & ASIC Lab
4
NCKU
SoC CAD
1. Origin and History (1/10)
The execution of Simultaneous Multiple Application (SMAPP) non
space-time sharing will be a major step forward towards attaining
exa-flop computing.
 Some positives of SMAPP are:

Enhanced resource utilization due to a large scale increase in the execution of
independent instructions as the number of applications and their problem size
increases.
 Cost effectiveness across multiple applications being run in a single cluster.
 Eliminates conventional space sharing and time sharing leading to increased
performance.
 SMAPP is supported by virtue of the Heterogeneous Multi Core, node
Architectures based on CUBEMACH (CUstom Built hEterogeneous MultiCore ArCHitectures) design paradigm[2].
 CUBEMACH design paradigm achieves low power yet high performance.

Huang, Xiang, 黃翔
SoC & ASIC Lab
5
NCKU
SoC CAD
1. Origin and History (2/10)
Figure 1: Concept of SMAPP
Huang, Xiang, 黃翔
SoC & ASIC Lab
6
NCKU
SoC CAD
1. CUBEMACH (3/10)

The CUBEMACH design paradigm is aimed towards creation of
high performance, low power and cost effective heterogeneous
multicore architectures capable of executing wide range of
applications without space or time sharing [1].

The use of Hardwired Algorithm Level Functional Units (ALFU) [3]
and its corresponding Backbone Instruction Set Architecture also
called Algorithm Level Instruction Set Architecture (ALISA),
brings about increased performance due to much reduced number
of instruction generation and hence memory fetches.
Huang, Xiang, 黃翔
SoC & ASIC Lab
7
NCKU
SoC CAD

1. Algorithm Level Functional Unit (4/10)
Why ALFU and Why not ALU?
ALFUs handle higher order computations by processing blocks of data in a
single operation when compared to using a set of ALUs to execute the same
computations.
 1 ALFU instructions=Several ALU instructions
 ALFU based cores are proven to offer better performance at reduced power
compared to ALU based cores [3].

ALISA is a superset of other instruction sets such as vector
instructions, CISC and VLIW which are used in various multicore/many core processors.
 A single ALISA instruction encompasses the data dependencies
associated with several equivalent ALU instructions and helps in
minimizing the number of cache misses.
 Parallel issue of ALISA instructions pose a major challenge to the
compilers and cannot be handled by a purely software based
compiler hence we have resolved to a hardware based compiler.

Huang, Xiang, 黃翔
SoC & ASIC Lab
8
NCKU
SoC CAD
1. Customizable Compiler On Silicon (5/10)

The Compiler-On-Silicon[4] is an easily customizable hardware
based compiler to suit different CUBEMACH architecture for
different classes of applications.

Compiler-On-Silicon is made up of a two stage hierarchy.
The Primary Compiler On Silicon.
 The Secondary Compiler On Silicon.


The hardware based dependency analyzer in COS, is the key to
increase the rate of instruction generation.
Huang, Xiang, 黃翔
SoC & ASIC Lab
9
NCKU
SoC CAD
1. Customizable Compiler On Silicon (6/10)
Figure 3: Hierarchical Architecture of Compiler on Silicon
Huang, Xiang, 黃翔
SoC & ASIC Lab
10
NCKU
SoC CAD
1. On Core Network Architecture (7/10)

CUBEMACH architecture uses a novel cost effective On Chip
Network called the On Core Network (OCN).

The hierarchy of OCN is emphasized by the presence of a SubLocal Router for a group of ALFUs (population), a Local router for
across population communication.

While populations of ALFUs form a core, global routers are used to establish
communication across them.
Huang, Xiang, 黃翔
SoC & ASIC Lab
11
NCKU
SoC CAD
1. Customizable Compiler On Silicon (8/10)
Figure 4: Hierarchical OCN Architecture of single core
Huang, Xiang, 黃翔
SoC & ASIC Lab
12
NCKU
SoC CAD

OS of current day supercomputers is managed by a stripped kernel
present in the nodes.


Core OS functionalities such as process scheduling, memory management,
I/O handling and exception handling are monitored by this stripped kernel.
This level of operation at the cluster does suffice for parallel
execution of applications.


1. Silicon Operating System (9/10)
But in case of SMAPP non space-time sharing the communication complexity
involved is huge, hence needs to be monitored by efficient mapping strategies.
The hardware design of the OS for supercomputing clusters
designed to meet these demands known as Silicon Operating
System (SILICOS) was evolved at WARFT [1].
Huang, Xiang, 黃翔
SoC & ASIC Lab
13
NCKU
SoC CAD
2. Overview of Linux Kernels (1/8)

The Linux Kernel abstracts and mediates access to all hardware
resources including the CPU.

One important aspect of Linux kernel is support to multitasking.
Each process can act individually in the system with exclusive memory
access and other hardware usage.
 The kernel is responsible for providing this facility by running each process
concurrently, providing an equal share to hardware resources for each process
and also maintaining the inter-process security.

Huang, Xiang, 黃翔
SoC & ASIC Lab
14
NCKU
SoC CAD

2. Overview of Linux Kernels (2/8)
The Linux kernel as defined by Iwan T.Bowman [8] is composed of
five main subsystems:
the Process Scheduler (SCHED)
 the Memory Manager (MM)
 the Virtual File System (VFS)
 the Network Interface (NET)
 the Inter-Process Communication (IPC) subsystem


This thesis analyses the architecture and design of SILICOS at
greater depths.
Huang, Xiang, 黃翔
SoC & ASIC Lab
15
NCKU
SoC CAD
2. Loop Unroller and Dependency Analyser (3/8)
Dependencies across application libraries play a major role in
allocation of libraries to the underlying nodes.
 The information on dependent libraries needs to be passed onto the
process scheduler for efficient scheduling thus extracting maximum
work from the underlying nodes.
 In addition, in case of complex applications may be loops across the
dependent libraries hence the loops needs to be unrolled in order to
effectively identify the execution time of each iteration and to
schedule those libraries.
 In this regard, a dependency analyzer, to extract the dependency
and execution time for each of the dependent libraries and also to
perform loop unrolling is needed.

Huang, Xiang, 黃翔
SoC & ASIC Lab
16
NCKU
SoC CAD

The graph traversal unit is used to traverse across the dependency
graph and extract the dependent libraries.


2. Loop Unroller and Dependency Analyser (4/8)
This information from the unit is updated in the library detail table.
The loop unroller unit forms an integral part of the dependency
analyzer.
It unrolls loops by replicating the libraries using the loop index value.
 Thus this unit greatly assists the dependency analyzer in time stamp
generation.


After extracting the dependencies across the libraries the
information is used to generate time stamp of child libraries.
Huang, Xiang, 黃翔
SoC & ASIC Lab
17
NCKU
SoC CAD
2. ISA of the Dependency Analyzer (5/8)
Figure 12: Overall Architecture of Dependency Analyzer
Huang, Xiang, 黃翔
SoC & ASIC Lab
18
SoC CAD
2. Design of Hardware Based Programmable
Scheduler for SMAPP (6/8)
NCKU

The existing scheduler is not programmable hence cannot facilitate
any new scheduling heuristics to be programmed into it.

Hence the scheduler needs to be made adaptive in such a way that
the user himself can choose the scheduling heuristics.

The optimization techniques which we have adopted for our
scheduler are:
Game Theory-Simulated Annealing based Scheduling
 Ant Colony Optimization based Scheduling

Huang, Xiang, 黃翔
SoC & ASIC Lab
19
SoC CAD

2. Design of Hardware Based Programmable
Scheduler for SMAPP (7/8)
NCKU
Game Theory-Simulated Annealing based Scheduling
The communication and computation complexity of the nodes are considered
as cost function in the GT-SA based approach.
 By varying the parameters of the cluster system, an optimal cost function of
the nodes in secondary host is achieved.
 This scheduler unit schedules libraries to underlying nodes by maintaining the
computation and communication complexity (cost functions) of the node
plane in its optimized state.
 The GT-SA based scheduler unit compares the current state of the cost
function with a next state obtained by varying the system parameters – load
available, queue length of buffers.
 The unit also accepts poor next states based on probability equation in order
to not to get stuck in a local minima.

Huang, Xiang, 黃翔
SoC & ASIC Lab
20
SoC CAD

2. Design of Hardware Based Programmable
Scheduler for SMAPP (8/8)
NCKU
Ant Colony Optimization based Scheduling
The behavior of ants for shortest path finding to the food using pheromone
extraction has been adopted in this scheduling algorithm [21].
 Here, the application libraries to be mapped onto a distant node need to
traverse through the shortest path across the nodes to reach the destination
node.
 In order to choose the path to reach a destination node, to broadcast the host
can make use of this scheduling unit.
 Information about distances in between nodes and shortest path are constantly
updated by this unit hence can be utilized to reduce the communication
complexity across the nodes.
 Thus based on the network topology and also the traffic in the network,
optimized path to reach a particular node are identified.

Huang, Xiang, 黃翔
SoC & ASIC Lab
21
NCKU
SoC CAD
3. Xilinx Virtex FPGA Family

The Xilinx Virtex family FPGAs are being utilized in prototyping
the SILICOS architecture.

The Xilinx Virtex 7 FPGA kit being the latest in the Virtex family
consists of 2,000,000 logic cells. It provides a 68Mb ram space.

It consists of a 3600 DSP slices thus providing higher bandwidth and aid for
programming parallel processing logics into the FPGA kit.
Huang, Xiang, 黃翔
SoC & ASIC Lab
22