pptx - University of Colorado Denver

Download Report

Transcript pptx - University of Colorado Denver

Mrutyunjay (Mjay)
University of Colorado, Denver
Hardware Trends
Multi-Core CPUs
Many Core: Co-Processors
GPU (NVIDIA, AMD Radeon)
Huge main memory capacity with complex access characteristics
(Caches, NUMA)
Non-Volatile Storage
Flash SSD (Solid State Drive)
Around 2005, frequencyscaling wall,
improvements by adding
multiple processing cores
to the same CPU chip,
forming chip
multiprocessors servers
with multiple CPU
sockets of multicore
processors (SMP of CMP)
Use Moore’s law to place more cores per chip
2x cores/chip with each CMOS generation
Roughly same clock frequency
Known as multi-core chips or chip-multiprocessors (CMP)
The good news
Exponentially scaling peak performance
No power problems due to clock frequency
Easier design and verification
The bad news
Need parallel program if we want to ran a single app faster
Power density is still an issue as transistors shrink
This how we think
its works.
This how EXACTLY it
works.
Type of cores
E.g. few OOO cores Vs many simple
cores
Memory hierarchy
Which caching levels are shared and
which are private
Cache coherence
Synchronization
On-chip interconnect
Bus Vs Ring Vs scalable interconnect
(e.g., mesh)
Flat Vs hierarchical
All processor have access to unified physical
memory
The can communicate using loads and stores
Advantages
Looks like a better multithreaded processor
(multitasking)
Requires evolutionary changes the OS
Threads within an app communicate implicitly
without using OS
Simpler to code for and low overhead
App development: first focus on correctness,
then on performance
Disadvantages
Implicit communication is hard to optimize
Synchronization can get tricky
Higher hardware complexity for cache
management
NUMA: Non-Uniform Memory Access
GPU (Graphics Processing Unit) is a specialized microprocessor for
accelerating graphics rendering
GPUs traditionally for graphics computing
GPUs now allow general purpose computing easily
GPGPU: using GPU for general purpose computing
Physics, Finance, Biology, Geosciences, Medicine, etc
NVIDIA and AMD Radeon
GPU design with up to a thousand of core enables massively
parallel computing
GPUs architecture with streaming multiprocessors has form of
SIMD processors
CPU
GPU
SIMD: Single Instruction Multiple Data
Distributed memory SIMD computer
Shared memory SIMD computer
Each GPU has ≥ 1 Streaming Multiprocessors (SMs)
Each SM has design of an simple SIMD Processor
8-192 Streaming Processors (SPs)
NVIDIA GeForce 8-Series GPUs and later
SMP of CMP:
SMP: sockets of multicore processors (Multiple CPU in single system)
CMP: Chip Multiprocessor (Single Chip with multi/many cores)
SP: Streaming Processor
SFU: Special Function Units
Double Precision Unit
Multithreaded Instruction Unit
Hardware thread scheduling
14 Streaming Multiprocessors per GPU
32 cores per Streaming Multiprocessors
Two main approaches:
Other tool ?  OpenACC
CUDA = Compute Unified Device Architecture
A development framework for Nvidia GPUs
Extensions of C language
Support NVIDIA GeForce 8-Series & later
Host = CPU
Device = GPU
Host memory = RAM
Device memory = RAM on GPU
Host
(CPU)
Host memory
PCI Express bus
Device
(GPU)
Device memory
CPU sends data to the GPU
CPU instructs the processing on GPU
GPU processes data
CPU collects the results from GPU
Host
(CPU)
2
Host memory
1
4
Device
(GPU)
3
Device memory
1. CPU sends data to the GPU
Host Code
int N= 1000;
int size = N*sizeof(float);
float A[1000], *dA;
2. CPU instructs the processing on GPU
cudaMalloc((void **)&dA, size);
cudaMemcpy(dA , A, size, cudaMemcpyHostToDevice);
ComputeArray <<< 10, 20 >>> (dA ,N);
3. GPU processes data
cudaMemcpy(A, dA, size, cudaMemcpyDeviceToHost);
cudaFree(dA);
4. CPU collects the results from GPU
Device Code
__global__ void ComputeArray(float *A, int N)
{ int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<N) A[i] = A[i]*A[i]; }
• A kernel is executed as a grid of
blocks
• A block is a batch of threads that
can cooperate with each other by:
– Sharing data through shared memory
– Synchronizing their execution
• Threads from different blocks
cannot cooperate
Limiting kernel launches
Limiting data transfers(Solution Overlapped Transfers)
GPU strengths are useful
Memory bandwidth
Parallel processing
Accelerating SQL queries – 10x improvement
Also well suited for stream mining
Continuous queries on streaming data instead of one-time queries on
static database
Slowest part: Main
Memory and Fixed
Disk.
Can we decrease the
latency between Main
Memory and Fixed
disk?
Solution: SSD
A Solid-State Disk (SSD) is a data storage device that emulates a hard disk
drive (HDD). It has no moving parts like in HDD.
NAND Flash SSD’s are essentially arrays of flash memory devices which
include a controller that electrically and mechanically emulate, and are
software compatible with magnetic HDD’s
Host Interface Logic
SSD Controller
RAM Buffer
Flash Memory Package
 What will be the initial state of SSD?
 NAND-flash cells have a limited lifespan due to their

Ans:
Still
looking
for
it.
limited number of P/E cycles (Program/Erase Cycle)
Reads are aligned on page size: It is not possible to read less than one page at once.
One can of course only request just one byte from the operating system, but a full
page will be retrieved in the SSD, forcing a lot more data to be read than necessary.
Writes are aligned on page size: When writing to an SSD, writes happen by increments
of the page size. So even if a write operation affects only one byte, a whole page will
be written anyway. Writing more data than necessary is known as write amplification
Pages cannot be overwritten: A NAND-flash page can be written to only if it is in the
“free” state. When data is changed, the content of the page is copied into an internal
register, the data is updated, and the new version is stored in a “free” page, an
operation called “read-modify-write”.
Erases are aligned on block size: Pages cannot be overwritten, and once they become
stale, the only way to make them free again is to erase them. However, it is not
possible to erase individual pages, and it is only possible to erase whole blocks at
once.
Align writes: Align writes on the page size,
and write chunks of data that are multiple
of the page size.
Buffer small writes: To maximize
throughput, whenever possible keep small
writes into a buffer in RAM and when the
buffer is full, perform a single large write
to batch all the small writes
Latency difference for
each type.
More levels increases
the latency: Delays in
read and write.
Solution: Hybrid SDD,
consisting mixed levels
The garbage collection process in the SSD controller ensures that “stale” pages are
erased and restored into a “free” state so that the incoming write commands can be
processed.
Split cold and hot data. : Hot data is data that changes frequently, and cold data is
data that changes infrequently. If some hot data is stored in the same page as some
cold data, the cold data will be copied along every time the hot data is updated in a
read-modify-write operation, and will be moved along during garbage collection for
wear leveling. Splitting cold and hot data as much as possible into separate pages will
make the job of the garbage collector easier
Buffer hot data: Extremely hot data should be buffered as much as possible and
written to the drive as infrequently as possible.
The main factor that made adoption of SSDs so easy is that they use the same host
interfaces as HDDs.
Although presenting an array of Logical Block Addresses (LBA) makes sense for HDDs
as their sectors can be overwritten, it is not fully suited to the way flash memory
works
For this reason, an additional component is required to hide the inner characteristics
of NAND flash memory and expose only an array of LBAs to the host. This component
is called the Flash Translation Layer (FTL), and resides in the SSD controller.
The FTL is critical and has two main purposes: logical block mapping and garbage
collection.
This mapping takes the form of a table, which for any LBA gives the corresponding
PBA. This mapping table is stored in the RAM of the SSD for speed of access, and is
persisted in flash memory in case of power failure. When the SSD powers up, the
table is read from the persisted version and reconstructed into the RAM of the SSD
Internal parallelism: Internally, several
levels of parallelism allow to write to
several blocks at once into different NANDflash chips, to what is called a “clustered
block”.
Multiple levels of parallelism:
Channel-level parallelism
Package-level parallelism
Chip-level parallelism
Plane-level parallelism
SSD Advantages
Read and write are much faster than traditional HDD
Allow PCs to boot up and launch programs far more quickly
More physically Robust.
Use less power and generate less heat
SSD Disadvantages
Lower capacity than HDDs
Higher storage cost per GB
Limited number of data write cycles
Performance degradation over time
http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-asummary-what-every-programmer-should-know-about-solidstate-drives/.