Transcript uw-ugrads

System Support for Data-Intensive
Applications
Katherine Yelick
U.C. Berkeley, EECS
Slide 1
The “Post PC” Generation
Two technologies will likely dominate:
1) Mobile Consumer Electronic Devices
– e.g., PDA, Cell phone, wearable computers,
with cameras, recorders, sensors
– make the computing “invisible” through
reliability and simple interfaces
2) Infrastructure to Support such Devices
– e.g., successor to Big Fat Web Servers,
Database Servers
– make these “utilities” with reliability and
new economic models
Slide 2
Open Research Issues
• Human-computer interaction
– uniformity across devices
• Distributed computing
– coordination across independent devices
• Power
– low power designs and renewable power sources
• Information retrieval
– finding useful information amidst a flood of data
• Scalability
– Scaling devices down
– Scaling services up
• Reliability and maintainability
Slide 3
The problem space: big data
• Big demand for enormous amounts of data
– today: enterprise and internet applications
» online applications: e-commerce, mail, web, archives
» enterprise decision-support, data mining databases
– future: richer data and more of it
» computational & storage back-ends for mobile devices
» more multimedia content
» more use of historical data to provide better services
• Two key application domains:
– storage: public, private, and institutional data
– search: building static indexes, dynamic discovery
Slide 4
Reliability/Performance Trade-off
• Techniques for reliability:
– High level languages with strong types
» avoid memory leaks, wild pointers, etc.
» C vs. Java
– Redundant storage, computation, etc.
» adds storage and bandwidth overhead
• Techniques for performance:
– Optimize for a specific machine
» e.g., cache or memory hierarchy
– Minimize redundancy
• These two goals work against each other
Slide 5
Specific Projects
• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers
– Modeling the real world
– Storing and finding information
• Titanium
– A high level language (Java) with high
performance
– A domain-specific language and optimizing
compiler
• Sparsity
– Optimization using partial program input
Slide 6
ISTORE: Reliable Storage System
• 80-node x86-based cluster, 1.4TB storage
– cluster nodes are plug-and-play, intelligent, networkattached storage “bricks”
» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk
– 2-node system running now; full system in next quarter
ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mbit/s
•2 1 Gbit/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibration
sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
Slide 7
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without significantly
increasing size of disk
• ISTORE HW in 5-7 years:
– building block: 2006
MicroDrive integrated with
IRAM
» 9GB disk, 50 MB/sec from disk
» connected via crossbar switch
– 10,000 nodes fit into one
rack!
• O(10,000) scale is our ultimate
design point
Slide 8
Specific Projects
• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers
– Modeling the real world
– Storing and finding information
• Titanium
– A high level language (Java) with high
performance
– A domain-specific language and optimizing
compiler
• Sparsity
– Optimization using partial program input
Slide 9
Heart Modeling
• A computer simulation of a human heart
– Used to design artificial heart valves
– Simulations run for days on a C90 supercomputer
– Done by Peskin and MacQueen at NYU
• Modern machines are
faster but harder to use
– working with NYU
– using Titanium
• Shown here: close-up of aortic valve during ejection
• Images from the Pittsburgh Supercomputer Center
Slide 10
Simulation of a Beating Heart
• Shown here:
– Aortic valve (yellow); Mitral valve (purple)
– Mitral valves closes when left ventrical pumps
• Future: virtual surgery?
Slide 11
Earthquake Simulation
• Earthquake modeling
– Used for retrofitting buildings, emergency
preparedness, construction policies
– Done by Beilak (CMU); also by Fenves (Berkeley)
– Problems: grid (graph) generation; using images
Slide 12
Earthquake Simuation
• Movie shows a simulated aftershock following the
1994 Northridge earthquake in California
• Future: sensors everywhere; tied to central system
Slide 13
Pollution Standards
• Simulation of ozone layer
– Done by Russell (CMU) and McRae (MIT)
– Used to influence automobile emissions policy
Los Angeles Basin shown at 8am (left) and 2pm (right)
The “cloud” shows areas where ozone levels are above federal ambient
air quality standards (0.12 parts per million)
Slide 14
Information Retrieval
• Finding useful information amidst huge data sets
– I/O intensive application
• Today’s example: web search engines
– 10 Million documents in typical matrix.
– Web storage increasing 2x every 5 months
– One class of techniques based on sparse matrices
# documents ~= 10 M
# keywords
~100K
x
•Matrix is compressed
•“Random” memory access
•Cache miss per 2Flops
•Run at 1-5% of machine peak
• Problem: Can you make this run faster, without
writing hand-optimized, non-portable code?
Slide 15
Image-Based Retrieval
• Digital library problem:
– retrieval on images
– content-based
• Computer vision problem
– uses sparse matrix
• Future: search in
medical image
databases; diagnosis;
epidemiological studies
Slide 16
Object Based Image Description
Slide 17
Specific Projects
• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers
– Modeling the real world
– Storing and finding information
• Titanium
– A high level language (Java) with high
performance
– A domain-specific language and optimizing
compiler
• Sparsity
– Optimization using partial program input
Slide 18
Titanium Goals
• Help programmers write reliable software
– Retain safety properties of Java
– Extend to parallel programming constructs
• Performance
– Sequential code comparable to C/C++/Fortran
– Parallel performance comparable to MPI
• Portability
• How?
– Domain-specific language and compiler
– No JVM
– Optimizing compiler
– Explicit parallelism and other language constructs
for high performance
Slide 19
Titanium Overview: Sequential
Object-oriented language based on Java with:
• Immutable classes
– user-definable non-reference types for
performance
• Unordered loops
– compiler is free to run iteration in any order
– useful for cache optimizations and others
• Operator overloading
– by demand from our user community
• Multidimensional arrays
– points and index sets as first-class values
– specific to an application domain: scientific
computing with block-structured grids
Slide 20
Titanium Overview: Parallel
Extensions of Java for scalable parallelism:
• Scalable parallelism
– SPMD model with global address space
• Global communication library
– E.g., broadcast, exchange (all-to-all)
– Used to build data structures in the global
address space
• Parallel Optimizations
– Pointer operations
– Communication (underway)
• Bulk asynchronous I/O
– speed with safety
Slide 21
Implementation
• Strategy
– Compile Titanium into C
– Communicate through shared memory on SMPs
– Lightweight communication for distributed memory
• Titanium currently runs on:
– Uniprocessors
– SMPs with Posix or Solaris threads
– Berkeley NOW, SP2 (distributed memory)
– Tera MTA (multithreaded, hierarchical)
– Cray T3E (global address space)
– SP3 (cluster of SMPs, e.g., Blue Horizon at SDSC)
Slide 22
Sequential Performance
Java
Ultrasparc: C/C++/
FORTRAN Arrays
DAXPY
3D multigrid
2D multigrid
EM3D
1.4s
12s
5.4s
0.7s
6.8s
1.8s
Java
Pentium II: C/C++/
FORTRAN Arrays
DAXPY
3D multigrid
2D multigrid
EM3D
1.8s
23.0s
7.3s
1.0s
Titanium
Arrays
1.5s
22s
6.2s
1.0s
Titanium
Arrays
2.3s
20.0s
5.5s
1.6s
Overhead
7%
83%
15%
42%
Overhead
27%
-13%
-25%
60%
Performance results from 98; new IR and optimization
framework almost complete.
Slide 23
SPMD Execution Model
• Java programs can be run as Titanium, but the result
will be that all processors do all the work
• E.g., parallel hello world
class HelloWorld {
public static void main (String [] argv) {
System.out.println(‘’Hello from proc ‘’ +
Ti.thisProc());
}
}
• Any non-trivial program will have communication and
synchronization
Slide 24
SPMD Execution Model
• A common style is compute/communicate
• E.g., in each timestep within particle simulation with
gravitation attraction
read all particles and compute forces on mine
Ti.barrier();
write to my particles using new forces
Ti.barrier();
• This basic model is used on the large-scale parallel
simulations described earlier
Slide 25
SPMD Model
• All processor start together and execute same code,
but not in lock-step
• Basic control done using
– Ti.numProcs() total number of processors
– Ti.thisProc() number of executing processor
• Sometimes they do something independent
if (Ti.thisProc() == 0) { ….. do setup ..… }
System.out.println(‘’Hello from ‘’ + Ti.thisProc());
double [1d] a = new double [Ti.numProcs()];
Slide 26
Barriers and Single
• Common source of bugs is barriers or other global
operations inside branches or loops
barrier, broadcast, reduction, exchange
• A “single” method is one called by all procs
public single static void allStep(...)
• A “single” variable has same value on all procs
int single timestep = 0;
• The compiler uses “single” type annotations to ensure
there are no synchronization bugs with barriers
Slide 27
Explicit Communication: Exchange
• To create shared data structures
– each processor builds its own piece
– pieces are exchanged (for object, just exchange
pointers)
• Exchange primitive in Titanium
int [1d] single allData;
allData = new int [0:Ti.numProcs()-1];
allData.exchange(Ti.thisProc()*2);
• E.g., on 4 procs, each will have copy of allData:
0
2
4
6
Slide 28
Exchange on Objects
• More interesting example:
class Boxed {
public Boxed (int j) {
val = j;
}
public in val;
}
Object [1d] single allData;
allData = new Object [0:Ti.numProcs()-1];
allData.exchange(new Boxed(Ti.thisProc());
Slide 29
Use of Global / Local
• As seen, references (pointers) may be remote
– easy to port shared-memory programs
• Global pointers are more expensive than local
– True even when data is on the same processor
– Use local declarations in critical sections
• Costs of global:
– space (processor number + memory address)
– dereference time (check to see if local)
• May declare references as local
Slide 30
Global Address Space
• Processes allocate locally
• References can be passed to
other processes
Class C { int val;….. }
C gv;
// global pointer
C local lv; // local pointer
Process 0
lv
gv
Other
processes
lv
gv
if (thisProc() == 0) {
lv = new C();
}
gv = broadcast lv from 0;
gv.val = …..;
….. = gv.val;
Slide 31
Local Pointer Analysis
• Compiler can infer many uses of local
– “Local Qualification Inference” (Liblit’s work)
Effect of LQI
250
running time (sec)
200
150
Original
After LQI
100
50
0
cannon
lu
sample
gsrb
poison
applications
• Data structures must be well partitioned
Slide 32
Bulk Asynchronous I/O Performance
External sort benchmark
on NOW
1.4
async
• bulkraf: bulk random
access (Titanium)
• bulkds: bulk
sequential (Titanium)
• async: asynchronous
(Titanium)
bulkds
1.0
Throughput (MB/sec)
• raf: random access
file (Java)
• ds: unbuffered
stream (Java)
• dsb: buffered stream
(Java)
1.2
bulkraf
0.8
0.6
dsb
0.4
0.2
raf
ds
0.0
0
10
20
30
40
50
60
File Size (MB)
Slide 33
Performance Heterogeneity
Minimum Per-Process Bandwidth
(MB/sec)
• System performance limited by the weakest link
• Performance heterogeneity is the norm
– disks: inner vs. outer track (50%), fragmentation
– processors: load (1.5-5x)
• Virtual Streams: dynamically off-load I/O work from slower
disks to faster ones
Ideal
Virtual Streams
Static
6
5
4
3
2
1
0
100%
67%
39%
Efficiency Of Single Slow Disk
29%
Slide 34
Parallel performance on an SMP
• Speedup on Ultrasparc SMP
(shared memory multiprocessor)
• EM3D performance linear
– simple kernel
• AMR largely limited by
– problem size
– 2 levels, with top one serial
8
7
6
5
4
em3d
3
amr
2
1
0
1
2
4
8
Slide 35
Parallel Performance on a NOW
Time/fine-patch-iter/proc
• MLC for Finite-Differences by Balls and Colella
• Poisson equation with infinite boundaries
– arise in astrophysics, some biological systems, etc.
• Method is scalable
1.2
– Low communication
1
• Performance on
0.8
129x129/65x65
129x129/33x33
– SP2 (shown) and t3e
0.6
257x257/129x129
257x257/65x65
– scaled speedups
0.4
– nearly ideal (flat)
0.2
• Currently 2D and
0
1
4
16
non-adaptive
processors
Slide 36
Performance on CLUMPs
• Clusters of SMPs (CLUMPs) have two-levels of
communication
– BH at SDSC has 144 nodes, each with 8 nodes
– 8th processor cannot be used effectively
GSRB performance with 700x700 patches
70
60
1 p/node
Time (s)
50
2 p/node
40
4 p/node
30
7 p/node
20
8 p/node
10
0
0
5
10
15
20
25
30
35
Processes
Slide 37
Cluster of SMPs
• Communication within a node is shared-memory
• Communication between nodes uses LAPI
– for large messages, a separate thread is created
by LAPI
– interferes with computation performance
Aggregate bandwidth with multiple processes
Bandwidth (MB/s)
50
1 p/node
40
2 p/node
30
4 p/node
20
7 p/node
10
8 p/node
0
0
10000
20000
30000
40000
50000
60000
70000
Data Size (bytes)
Slide 38
Optimizing Parallel Programs
• Would like compiler to introduce asynchronous
communication, which is a form of possible reordering
• Hardware also reorders
– out-of-order execution
– write buffered with read by-pass
– non-FIFO write buffers
• Software already reorders too
– register allocation
– any code motion
• System provides enforcement primitives
– volatile: at the language level not well-defined
– tend to be heavy weight, unpredictable
• Can the compiler hide all this?
Slide 39
Semantics: Sequential Consistency
• When compiling sequential programs:
x = expr1;
y = expr2;
y = expr2;
x = expr1;
Valid if y not in expr1 and x not in expr2 (roughly)
• When compiling parallel code, not sufficient test.
Initially flag = data = 0
Proc A
Proc B
data = 1;
while (flag==1);
flag = 1;
... = ...data...;
Slide 40
Cycle Detection: Dependence
Analog
• Processors define a “program order” on accesses from
the same thread
P is the union of these total orders
• Memory system define an “access order” on accesses
to the same variable
A is access order (read/write & write/write pairs)
write data
read flag
write flag
read data
• A violation of sequential consistency is cycle in P U A.
• Intuition: time cannot flow backwards.
Slide 41
Cycle Detection
• Generalizes to arbitrary numbers of variables and
processors
write x
read y
write y
read y
write x
• Cycles may be arbitrarily long, but it is sufficient
to consider only cycles with 1 or 2 consecutive stops
per processor [Sasha & Snir]
Slide 42
Static Analysis for Cycle
Detection
• Approximate P by the control flow graph
• Approximate A by undirected “dependence” edges
• Let the “delay set” D be all edges from P that
are part of a minimal cycle
write z
read x
write y
read y
read x
write z
• The execution order of D edge must be
preserved; other P edges may be reordered
(modulo usual rules about serial code)
• Synchronization analsysis also critical
[Krishnamurthy]
Slide 43
Automatic Communication
Optimization
Time (normalized)
• Implemented in subset of C with limited pointers
• Experiments on the NOW; 3 synchronization styles
• Future: pointer analysis and optimizations
Slide 44
Specific Projects
• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers
– Modeling the real world
– Storing and finding information
• Titanium
– A high level language (Java) with high
performance
– A domain-specific language and optimizing
compiler
• Sparsity
– Optimization using partial program input
Slide 45
Sparsity: Sparse Matrix Optimizer
• Several data mining or web search algorithms use
sparse matrix-vector multiplication
– use for documents, images, video, etc.
– irregular, indirect memory patterns perform poorly
on memory hierarchies
• Performance improvements possible, but depend on:
– sparsity structure, e.g., keywords within documents
– machine parameters without analytical models
• Good news:
– operation repeated many times on similar matrix
– Sparsity: automatic code generator based on
matrix structure and machine
Slide 46
Sparsity: Sparse Matrix Optimizer
Slide 47
Summary
• Future
– small devices + larger servers
– reliability increasingly important
• Reliability techniques include
– hardware: redundancy, monitoring
– software: better languages, many others
• Performance trades off against safety in languages
– use of domain-specific features (e.g., Titanium)
Slide 48
Backup Slides
Slide 49
The Big Motivators for
Programming Systems Research
• Ease of Programming
– Hardware costs -> 0
– Software costs -> infinity
• Correctness
– Increasing reliance on software increases cost of
software errors (medical, financial, etc.)
• Performance
– Increasing machine complexity
– New languages and applications
» Enabling Java; network packet filters
Slide 50
The Real Scalability Problems: AME
• Availability
– systems should continue to meet quality of service
goals despite hardware and software failures and
extreme load
• Maintainability
– systems should require only minimal ongoing human
administration, regardless of scale or complexity
• Evolutionary Growth
– systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
• These are problems at today’s scales, and will only
get worse as systems grow
Slide 51
Research Principles
• Redundancy everywhere, no single point of failure
• Performance secondary to AME
– Performance robustness over peak performance
– Dedicate resources to AME
» biological systems use > 50% of resources on maintenance
– Optimizations viewed as AME-enablers
» e.g., use of (slower) safe languages like Java with static
and dynamic optimizations
• Introspection
– reactive techniques to detect and adapt to failures,
workload variations, and system evolution
– proactive techniques to anticipate and avert
problems before they happen
Slide 52
Outline
• Motivation
• Hardware Techniques
– general techniques
– ISTORE projects
• Software Techniques
• Availability Benchmarks
• Conclusions
Slide 53
Hardware techniques
• Fully shared-nothing cluster organization
– truly scalable architecture, automatic redundancy
– tolerates partial hardware failure
• No Central Processor Unit:
distribute processing with storage
– Most storage servers limited by speed of CPUs;
why does this make sense?
– Amortize sheet metal, power, cooling infrastructure
for disk to add processor, memory, and network
• On-demand network partitioning/isolation
– Applications must tolerate these anyway
– Allows testing, repair of online system
Slide 54
Hardware techniques
• Heavily instrumented hardware
– sensors for temp, vibration, humidity, power
• Independent diagnostic processor on each node
– remote control of power, console, boot code
– collects, stores, processes environmental data
– connected via independent network
• Built-in fault injection capabilities
– Used for proactive hardware introspection
» automated detection of flaky components
» controlled testing of error-recovery mechanisms
– Important for AME benchmarking
Slide 55
ISTORE-2 Hardware Proposal
• Smaller disks
– replace 3.5” disks with 2.5” or 1” drives
» 340MB available now in 1”, 1 GB next year (?)
• Smaller, more highly integrated processors
– E.g., Transmeta Crusoe includes processor and
Northbridge (interface) functionality in 1 Watt
– Xilinx FPGA for Southbridge, diagnostic proc, etc.
• Larger scale
– Roughly 1000 nodes, depending on support
» ISTORE-1 built with donated disks, memory, processors
» Paid for network, board design, enclosures (discounted)
Slide 56
Outline
• Motivation
• Hardware Techniques
• Software Techniques
– general techniques
– Titanium: a high performance Java dialect
– Sparsity: using dynamic information
– Virtual streams: performance robustness
• Availability Benchmarks
• Conclusions
Slide 57
Software techniques
• Fault tolerant data structures
– Application controls replication, checkpointing, and
consistency policy
– Self-scrubbing used to identify software errors
that have corrupted application state
• Encourage use of safe languages
– Type safety and automatic memory management
avoid a host of application errors
– Use of static and dynamic information to meet
performance needs
• Runtime adaptation to performance heterogeneity
– e.g., outer vs. inner track (1.5X), fragmentation
– Evolution of systems adds to this problem
Slide 58
Software Techniques
• Reactive introspection
– Use statistical techniques to identify normal
behavior and detect deviations from it
» e.g., network activity, response time, program counter (?)
– Semi-automatic response to abnormal behavior
» initially, rely on human administrator
» eventually, system learns to set response parameters
• Proactive introspection
– Continuous online self-testing
» in deployed systems!
» goal is to shake out bugs in failure response code on
isolated subset
» use of fault-injection and stress testing
Slide 59
Techniques for Safe Languages
Titanium: A high performance dialect of Java
• Scalable parallelism
– A global address space, but not shared memory
– For tightly-coupled applications, e.g., mining
– Safe, region-based memory management
• Scalar performance enhancements, some specific
to application domain
– immutable classes (avoids indirection)
– multidimensional arrays with subarrays
• Application domains
– scientific computing on grids
» typically +/-20% of C++/F in this domain
– data mining in progress
Slide 60
Use of Static Information
• Titanium compiler performs parallel optimizations
– communication overlap (40%) and aggregation
• Uses two new analyses
– synchronization analysis: the parallel analog to
control flow analysis
» identifies code segments that may execute in
parallel
– shared variable analysis: the parallel analog to
dependence analysis
» recognize when reordering can be observed by
another processor
» necessary for any code motion or use of relaxed
memory models in hardware => missed or illegal
optimizations
Slide 61
Conclusions
• Two key applications domains
– Storage: loosely coupled
– Search: tightly coupled, computation important
• Key challenges to future servers are:
– Availability, Maintainability, and Evolutionary growth
• Use of self-monitoring to satisfy AME goals
– Proactive and reactive techniques
• Use of static techniques for high performance and
reliable software
– Titanium extension of Java
• Use of dynamic information for performance robustness
– Sparsity and Virtual Streams
• Availability benchmarks a powerful tool?
Slide 62
Projects and Participants
ISTORE: iram.cs.berkeley.edu/istore
With James Beck, Aaron Brown, Daniel Hettena,
David Oppenheimer, Randi Thomas, Noah Treuhaft,
David Patterson, John Kubiatowicz
Titanium: www.cs.berkeley.edu/projects/titanium
With Greg Balls, Dan Bonachea, David Gay, Ben Liblit,
Chang-Sun Lin, Peter McQuorquodale,
Carleton Miyamoto, Geoff Pike, Alex Aiken,
Phil Colella, Susan Graham, Paul Hilfinger
Sparsity: www.cs.berkeley.edu/~ejim/sparsity
With Eun-Jin Im
Slide 63
History of Programming Language Research
General Purpose
Language Design
Parsing Theory
Domain-Specific
Language Design
Type Systems Theory
Flop optimization
Memory Optimizations
Data and Control Analysis
Program
Verification
70s
Garbage
Collection
Type-Based Analysis
Threads
Program Checking Tools
80s
90s
2K
Slide 64