No Slide Title

Download Report

Transcript No Slide Title

Implementing Advanced
Intelligent Memory
Josep Torrellas, U of Illinois & IBM Watson Ctr.
David Padua and Dan Reed, U of Illinois
[email protected], [email protected], [email protected]
September 1998
Technological Opportunity
We can fabricate a large silicon area of
Merged Logic and Dram (MLD)
Question: How to exploit this capability best
to advance computing?
Pieces of the Puzzle
• Today:
256 Mbit MLD process with 0.25um
Includes logic running at 200 MHz
E.g. 2 IBM PowerPC 603 with 8KB I+D caches
take 10% of the chip
• Manufacturers:
IBM Cmos-7LD technology available Fall 98
Japanese manufacturers (NEC,Fujitzu) are in the lead
• In a couple of years: 512 Mbit MLD process at 0.18um
Key Applications Clamor for HW
• Data Mining (decision trees and neural networks)
• Computational Biology (DNA sequence matching)
• Financial Modeling (stock options, derivatives)
• Molecular Dynamics (short-range forces)
• Plus the typical ones: MPEG, TPCD, speech recognition
All are Data Intensive Applications
Our Solution: Principles
1. Extract high bandwidth from DRAM:
> Many simple processing units
2. Run legacy codes w/ high performance:
> Do not replace off-the-shelf uP in workstation
> Take place of memory chip. Same interface as DRAM
> Intelligent memory defaults to plain DRAM
3. Small increase in cost over DRAM:
> Simple processing units, still dense
4. General purpose:
> Do not hardwire any algorithm. No special purpose
Architecture Proposed
P.Host
L1,L2 Cache
P.Mem
Cache
P.Array
DRAM
Network
Plain
DRAM
FlexRAM
Proposed Work
• Design an architecture based on key IBM applications
• Fabricate chips using IBM Cmos 7LD technology
• Build a workstation w/ an intelligent memory system
• Build a language and compiler for the intelligent memory
• Demonstrate significant speedups on the applications
Example App: DNA Matching
BLAST code from NIH web site
sample DNA
database of
DNA chains
Problem: Find areas of database DNA chains that match
(modulo some mutations) the sample DNA chain
How the Algorithm Works
1. Pick 4 consecutive aminoacids from the sample
bbcf
2. Generate 50+ most-likely mutations
becf
Example App: DNA Matching
3. Compare them to every position in the database DNAs
becf
4. If match is found: try to extend it
sample DNA
becf
?
?
becf
database of
DNA chains
P.Arrays
2
• Total of 64 per chip (90 mm )
• SPMD engines, not SIMD. Cycling at 200 MHz
• 32-bit datapath, integer only, including MPY. 28 instruc.
• Organized as a ring, no need for a mesh
• Each P.Array
1 Mbyte of DRAM memory. Can also access the
memory of N and S neighbors
2 1-Kbyte row buffers to capture data locality
8 Kbyte of SRAM I-memory shared by 4 P.Arrays
P.Array Design
ALU
Switches
R.Reg.
Port 0
Sense AMP/Col. Dec
Controller
Port 1
Port 2
DRAM Block
Addr. Gen.
Instr. Mem
ROW Decoder
Broadcast Bus
P.Mem
• IBM 603 Power PC with 8 KB D + 8 KB I cache
• About 15 mm 2
• 200 MHz
• Also included: memory interface
DRAM Memory
• 512 Mbit (64 Mbyte) with 0.18um
• Organized as 64 banks of 1 MB each (one per P.Array)
• 2.2V operating voltage
• Internal memory bandwidth: 102 Gbytes/s at 200 MHz
• Memory access time at 200 MHz:
2 cycles for row buffer hit
4 cycles for miss
Chip Architecture
Mutiplier
8kB Instruction Memory (4-port SRAM)
Basic Block
Basic Block
(4 PArray,4MB DRAM,
8kB 4-Port SRAM,
1 Multiplier)
Basic Block
Basic Block
Basic Block
Basic Block
Basic Block
Broadcasting
Pmem
Broadcasting
Basic Block
Basic Block
Basic Block
Basic Block
Basic Block
Basic Block
Basic Block
Basic Block
Basic Block
512 row x 4k columns
1MB Block
2Mb Block
256kB Block
Memory Control Block
PArray
PArray
Memory Control Block
1MB Block
256kB Block
Memory Control Block
PArray
PArray
Memory Control Block
8Mb Block
Language & Compiler
• High-level C-like explicitly parallel language that
exposes the architecture
• Compiler that automatically translates it into
structured assembly
• Libraries of Intelligent Memory Operations (IMOs)
written in assembly
Intelligent Memory Ops
• General-purpose operations such as:
• Arithmetic/logic/symbolic array operations
• Set operations. Iterators over elements of a set
• Regular/irregular structure search and update
(CAM operations)
• Domain-specific operations: e.g. FFT
Performance Evaluation
• Hardware performance monitoring
embedded in the chip
• Software tools to extract and interpret
performance info
Preliminary Results
120
Uniprocessor
100
1 FlexRAM
80
4 FlexRAM
60
40
20
0
MPEG2
Chroma/Keying
Current Status
• Identified and wrote all applications
• Designed architecture based on apps & IBM technology
• Conceived ideas behind language/compiler
• Need to do: chip layout and fabrication
development of the compiler
• Funds needed for: processor core (P.Mem)
chip fabrication
hardware and software engineers
Conclusion
• We have a handle on:
• A promising technology (MLD)
• Key applications of industrial interest
• Real chance to transform the computing landscape
Current Research Work
Josep Torrellas,
U of Illinois & IBM Watson Ctr.
[email protected]
http://iacoma.cs.uiuc.edu
September 1998
Current Research Projects
• 1. Illinois Aggressive COMA (I-ACOMA): Scalable
NUMA and COMA architectures
• 2. FlexRAM: Avanced Intelligent Memory
• 3. Speculative Parallelization Hardware
• 4. Database Workload characterization: TPC-C,
TPC-D, Data mining
> All projects are in collaboration with IBM Watson
> Project 4 is also in collaboration with Intel Oregon
Publications 1997 and 98
1.Architectural Advances in DSMs: A Possible Road Ahead
by Josep Torrellas, Ninth SIAM Conference on Parallel Processing for Scientific Computing Spring 1999.
2.A Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors
by Venkata Krishnan and Josep Torrellas, International Conference on Parallel Architectures and Compilation Techniques (PACT), October 1998.
3.Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor
by Venkata Krishnan and Josep Torrellas, International Conference on Supercomputing (ICS), July 1998.
4.Comparing Data Forwarding and Prefetching for Communication-Induced Misses in Shared-Memory MPs
by David Koufaty and Josep Torrellas, International Conference on Supercomputing (ICS), July 1998.
5.Cache-Only Memory Architectures
by Fredrik Dahlgren and Josep Torrellas, IEEE Computer Magazine, to appear 1998.
6.Executing Sequential Binaries on a Multithreaded Architecture with Speculation Support
by Venkata Krishnan and Josep Torrellas, Workshop on Multi-Threaded Execution, Architecture and Compilation (MTEAC'98), January 1998.
7.A Clustered Approach to Multithreaded Processors
by Venkata Krishnan and Josep Torrellas, International Parallel Processing Symposium, March 1998.
8.Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
by Ye Zhang, Lawrence Rauchwerger, and Josep Torrellas, Fourth International Symposium on High-Performance Computer Architecture, February 1998.
9.Enhancing Memory Use in Simple Coma: Multiplexed Simple Coma
by Sujoy Basu and Josep Torrellas, Fourth International Symposium on High-Performance Computer Architecture, February 1998.
10.How Processor-Memory Integration Affects the Design of DSMs
by Liuxi Yang, Anthony-Trung Nguyen, and Josep Torrellas, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.
11.Efficient Use of Processing Transistors for Larger On-Chip Storage: Multithreading
by Venkata Krishnan and Josep Torrellas, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.
12.The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors
by Pedro Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.
13.Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA
by Zheng Zhang and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.
14.Speeding up the Memory Hierarchy in Flat COMA Multiprocessors
by Liuxi Yang and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.