Slides - Dan Davis

Download Report

Transcript Slides - Dan Davis

DS-RT 2010
DISTRIBUTED AND INTERACTIVE
SIMULATIONS AT LARGE SCALE FOR
TRANSCONTINENTAL EXPERIMENTATION
19 October 2010
Dan M. Davis
For Gottschalk, Yao
Lucas, & Wagenbreth
[email protected]
(310)448-8434
Approved for public release; distribution is unlimited.
Co-Authors
Dr. Thomas D. Gottschalk
Center for Advanced Computing Research
California Institute of Technology
Marina del Rey, California 91125
[email protected]
Prof. Robert F. Lucas,
Dr. Ke-Thia Yao and
Gene Wagenbreth
Information Sciences Institute
University of Southern California
Marina del Rey, California 90292
{ rflucas, kyao, genew } @isi.edu
DS-RT 2010
Overview
DS-RT 2010
Needs of user community at JFCOM
Three Technologies
General Purpose GPUs Implementation
High Bandwidth Studies with Interest Management
Distributed Data Management
Lessons Learned
Theses
DS-RT 2010
Transcontinentally Distributed Simulations require
advanced technologies and innovative
paradigmatic approaches
Academic Researchers often have such capabilities
that are typically not yet in the journeyman
programmers’ toolbox
GPGPUs provide useable acceleration, but making
conservative assessments of potential speed-up
may be warranted
Transcontinental High Performance Communications
(“10 Gig”) are possible with Interest Management
With distributed systems and a data glut from High
Performance Computing simulations, new
approaches to Data Management are required
JFCOM as a Distributed
Simulation User
DS-RT 2010
U.S. Joint Forces Command, Norfolk, Virginia
One of DoD’s combatant commands
Key role in transforming defense capabilities
Two JFCOM Directorates using agent-based simulations
J7 - Training
trains forces
develops doctrine
leads training requirements analysis
provides an interoperable training environment
J9 - Concept Development and Experimentation
develops innovative joint concepts and capabilities
provides proven solutions to problems facing the joint force
Simulations are typically GenSer Secret and characterized by:
Interactive use by hundreds of personnel
Distributed trans-continentally, but must be real time
Vast majority of users at the terminals are uniformed warfighters
Plan View Display
DS-RT 2010
3D Stealth View
DS-RT 2010
Simulation Federates
DS-RT 2010
Agent-based models use rules for entity behavior
Autonomous-agent entities
Can be Human-In-The-Loop (HITL) and run in real time
Large compute clusters required to run large-scale simulations
Standard interface is HLA RTI communication (IEEE 1516)
Supplanted to old DIS
Publish/Subscribe model
USC/Caltech Software Routers scale better
Common Codes in use at JFCOM:
Joint Semi-Automated Forces (JSAF)
“Culture”, stripped-down civilian instantiation of JSAF
Simulating Location & Attack of Mobile Enemy Missiles (SLAMEM)
OneSAF
Computing Infrastructure
DS-RT 2010
Koa and Glenn Deployed,
spring ’04
MHPCC & ASC-MSRC
2 Linux Clusters, 32 bit
256 nodes, 512 cores each
Joshua deployed 2008
GPGPU-enhanced
256 nodes, 1024 cores, 64 bit
DREN Connectivity
Users in VA, CA & Army bases
Application tolerates network
latency
Real-time interactive
supercomputing
Joshua GPGPU-Enhanced
Cluster Configuration
DS-RT 2010
256 Nodes
Nodes - (2) AMD Santa Rosa 2220 2.8 GHz dual-core
processors, 1024 cores total
GPUs - (1) NVIDIA 8800 Video Card
Node Chassis - 4U chassis
Memory - 16 GB DIMM DDR2 667 per node
GigE Inter-node Communications
Delivery to:
Joint Advanced Tactics and Training Laboratory
(JATTL) in Suffolk, VA
Perspective:
Entity Growth vs. Time
DS-RT 2010
10,000,000
Number and
Complexity of JSAF
Entities
SCALE
and FIDELITY
Experiments continue
to require orders of
magnitude larger &
more complex
battlespaces
1,400,00
1,000,000
DHPI
GPUEnhanced
Cluster
DC Clusters at MHPCC & ASCMSRC
SPP Proof of Principle
DARPA / Caltech
250,000
107,000
12,000
3,600
UE 98-1
(1997)
SAF
Express
(1997)
J9901
(1999)
50,000
AO-00
(2000)
JSAF/SPP
Urban Resolve
(2004)
JSAF/SPP
Tests
(2004)
JSAF/SPP JSAF/SPP
Capability
Joshua
(2006)
(2008)
Why GPUs?
DS-RT 2010
GPU performance can be 100X hosts
Don’t forget Gene Amdahl’s Law; 2-3X typical
This speed-up is expected to grow
Early SAF work (UNC. SAIC, USC)
Line of Sight
Route Finding
Collision Detection
Sparse Matrix Factorization
ISI verified similar bottlenecks in JSAF
New ideas for use in scenario generation for
new multi-spectral sensors
Route Planning
Performance Impact
DS-RT 2010
Time Spent in Route Planning is Critical Bottleneck
Benefits of GPGPU
Computing
DS-RT 2010
Joshua has provided many benefits; some are not easily quantified
Training, analysis or evaluation in cities otherwise off-limits due to:
security issues
public resistance to combat troops in their city
diplomatic about U.S. interest in cities of potential conflict
Joshua does save personnel costs, e.g. Army Division costs ~ $20M per day.
DHPI cluster can runs such a program using only ~100 technicians
Cost saving may be ~$19.5M each day.
Good visibility with the leadership elite:
Congressional visits
Lieutenant General noted that it was probably the only time in his
career he would have an opportunity to command so large a unit
1,500 soldiers across the country participated, all connected
Transcontinental Network
High Bandwidth Research
DS-RT 2010
The issue here was the potential exploitation of High
Bandwidth Nets, e.g. as 10 Gig (10 GigaBit per Second) Nets
The nodes of this WAN were located at
ISI-East in Virginia
University of Illinois at Chicago
ISI-West in California
Previous work indicated the utility of interest managed
communications on
cluster meshes
high-bandwidth Local Area Networks (LANs)
lower bandwidth WANs
Interest-limited message exchange was done using
Caltech’s MeshRouter formalism
Interest Managed
Software Router Diagram
DS-RT 2010
Transcontinental Testbed for
“10 Gig” Proof of Concept
DS-RT 2010
Tests on Local ISI Cluster
DS-RT 2010
Prepared for the wide area tests
Ran a number of generalizations of a various
configurations using the ISI-W cluster
Tests involved configurations with a
single router process on its own node
single publish process on its own node
multiple subscriber processes on either
Performance results for these configurations are
summarized in next slide
Bandwidth numbers reflect only data rates into the
subscriber processes.
Bandwidth results for Various
Test Configurations
DS-RT 2010
Number of
Subscribers
Subscriber
Nodes
Per Node
Bandwidth
Total
Bandwidth
Router
Load
1
1
680 Mbit/sec
690 Mbit/sec
41% Busy
2
1
672 Mbit/sec
1.3 Gbit/sec
54% Busy
4
1
504 Mbit/sec
2.0 Gbit/sec
59% Busy
6
2
344 Mbit/sec
2.1 Gbit/sec
55% Busy
8
2
288 Mbit/sec
2.3 Gbit/sec
51% Busy
16
2
160 Mbit/sec
2.6 Gbit/sec
44% Busy
Benchmark Tests:
Two forms
DS-RT 2010
Application processes
Publish Processors
messages of specified length and interest state
nominal total publication rate (Mbyte/sec)
controlled by a data file
Subscribe Processors which
receives messages for a specified interest state,
collects messages from multiple publishers
measure actual incoming message rates
Routers direct individual messages from publishers to
subscribers according to the interest declarations
Router processes were instrumented to determine the
fraction of time spent on management.
Aggregate bandwidth versus
Number of Subscribers
DS-RT 2010
10 Gig WAN Test Results
DS-RT 2010
WAN Tests Were then Performed
This entailed a great deal of trouble-shooting various
routers along the WAN
A number of variants of the basic configuration were
explored:
the number of distinct interest states
the number of processors associated with a single router
processes)
the number of replicas of the basic “Router plus Associated
Pub/Sub” nodes at each site
Typical performance numbers for a test with eight
participating nodes at each of the UIC and ISI sites
Results are summarized on the next slide
Performance Measures for
Typical WAN Test
DS-RT 2010
Message
Length
Client BW
(bytes/sec)
Single MR BW
(bytes/sec)
Aggregate BW
(bits/sec)
0.4 KByte
3.2 M
16.0 M
1.0 G
0.8 KByte
6.4 M
32.0 M
2.1 G
1.6 KByte
12.8 M
64.0 M
4.1 G
2 KByte
14.3 M
71.5 M
4.6 G
100 KByte
0.8 M
4.0 M
0.3 G
Final Results ~ 5 GigaBits
Over a “10 Gig” Line
DS-RT 2010
Aggregate throughput is rather poorer for:
Individual message sizes that are too small
Individual message sizes that are too large
This is consistent with experience on single SPP
Constraints largely due to the nature of the RTI-s
communications primitives
As long as near the optimal message size, the aggregate
bandwidth for the WAN test is ~ 4.6 Gbits/sec.
WAN tests varying configurations gives similar results:
Max Total WAN BW = 4.6 – 4.9 Gbit/sec
Data Requirements
DS-RT 2010
Larger Scale
Global scale vs. theatre
Higher Fidelity
Greater complexity in models of entities
(sensors, people, vehicles, etc.)
Urban Operations
Millions of civilians
All of the above produce dramatic
increases in data relative to the
previously experienced events.
Terrain Large, but NOT
Significant Data Issue
DS-RT 2010
Growth and the Impending
Data Armageddon
DS-RT 2010
JFCOM had immediate need for more entities (>10X)
Limited Memory on Nodes and in Tesla
Tertiary Storage very limited
TeraByte a week with existing practice
Keeping only 20% of current data
Need 10X more entities
Need 10X behavior improvement
Net growth needed: almost three orders of magnitude
Now doing face validity
Need more quantitative, statistical
approach (future work at Caltech – Dr. Thomas Gottschalk)
Data mining efforts now commencing
Two Key Challenges
DS-RT 2010
Collect the “fire hoses” of data generated by largescale distributed sensor rich environments
Without interfering with communication
Without interfering with simulator performance
Maximally exploit the collected data efficiently
Without overwhelming users
Without losing critical content
Goal:
Unified distributed logging/analysis
infrastructure, which will help the users,
not burden the computing/networking
managers and not unduly tax the network
traffic loads
Limitation of the First System:
Did not scale!
Two separate data analysis systems
One for near-real time during the event
Another one for post event processing
For near-real time
Too much data to access over
wide-area network without crashing
For post event processing
1-2 weeks to stage data to
centralized data store
Discards Green entities (80%)
DS-RT 2010
Handling Dynamic Data
DS-RT 2010
Data is NOT static during runs, but users need to access
Logger continuously inserts new data from the simulation
Need distributed query to combine remote data sources
Distributed logger inserts data into SDG data store at each site
Problems
Local cache invalid with respect to inserts
Cannot preposition data to optimize queries
ISI Strategy: explore trade-offs
Compute on demand for better efficiency
Compute on insert for faster queries
Variable fidelity: periodic updates
Dynamic pre-computation: detect frequent queries
Handling Distributed Data
DS-RT 2010
Analyze data in place
Data is generated on distributed nodes
Leave data where it is generated
Distribute data access so data appears to be at a single site
Take advantage of HPC hardware capabilities
Large capacity data storage
High bandwidth network
Data archival
Exploit JSAF query characteristics
Limited number of joins
Counting/aggregation type queries
Data product is several orders of magnitude
less than raw data
Notional Diagram of
Scalable Data Grid
DS-RT 2010
Multidimensional Analysis:
Sensor/Target Scoreboard
Summarizes sensor contact reports
Positive and negative sensor
detections
Displays two dimensional views
of the data
Provides three levels of drill-down
Sensor platform type
vs. target class
Sensor platforms vs. sensor modes
List of contact reports
DS-RT 2010
Multidimensional Analysis
DS-RT 2010
Raw data has other dimensions of potential interest
Detection status
Time, location
Terrain type, entity density
Weather condition
Each dimension can be aggregated at multiple
levels
Location: country, grid square
Collapse and expand multiple
dimensions for viewing
sensor
Time: minutes, hours, days
target
Multidimensional
Analysis
DS-RT 2010
Dimensions
* : any
* : any
* : any
Dimension Lattice
t : sensor
platform
type
p : sensor
platform
c : target
class
m : sensor
mode
*
t
o : target
object
Sensor/Target Scoreboard
drill-downs in the context
multidimensional analysis
Data classified along 3
dimensions
Drill-down to 3 nodes in the
dimensional lattice
p
c
tc
pc
o
to
po
m
tm
pm
cm
tcm
pcm
om
tom
pom
Cube Dimension Editor
DS-RT 2010
SDG:
Broader Applicability?
DS-RT 2010
Scalable Data Grid: a distributed data management
application/middleware that effectively:
Collects and stores high volumes of data at very high data rates
from geographically distributed sources
Accesses, queries and analyzes the distributed data
Utilizes the distributed computing resources on HPC
Provides a multidimensional framework for
viewing the data
Potential application areas
Large scale distributed simulations
Instrumented live training exercises
High volume instrumented physics research
Virtually any distributed data environment using HPC
K-Means Testing
DS-RT 2010
K-Means is a popular data mining algorithm
The K-Means algorithm requires three inputs:
- an integer k to indicate the number of desired clusters
- a distance function over the data instances
- the set of n data instances to be clustered.
Typically, a data instance represented as a vector
The output of the algorithm is a set of k points
representing the mean of the k clusters
Each of the n data instances is assigned to nearest
cluster mean based on the distance function
K-Means Clustering of
Three Data Points
DS-RT 2010
Clustering of 3 Clusters
5000
4000
3000
2000
1000
0
-6000
-5000
-4000
-3000
-2000
-1000
0
-1000
-2000
1000
2000
Conclusions
DS-RT 2010
Large-scale Distributed Simulations demand more power
Technology can to deliver that power
This paper set out three emerging technologies
GPGPU Acceleration of Linux Clusters
High Bandwidth Interest Managed Communications
Distributed Data Management
This work shows Simulation Researchers’ new abilities
to generate simulations
to more effectively move the data around
to more efficiently store the data
to better analyze the data
Researchers would benefit from understanding the technologies’
power
limitations
costs
Research Funded by
JFCOM and AFRL
DS-RT 2010
This material is based on research sponsored by the
U.S. Joint Forces Command via a contract with the
Lockheed Martin Corporation and SimIS, Inc., and on
research sponsored by the Air Force Research
Laboratory under agreement numbers F30602-02-C-0213
and FA8750-05-2-0204. The U.S. Government is
authorized to reproduce and distribute reprints for
Governmental purposes notwithstanding any copyright
notation thereon. The views and conclusions contained
herein are those of the authors and should not be
interpreted as necessarily representing the official
policies or endorsements, either expressed or implied,
of the U.S. Government. Approved for public release;
distribution is unlimited.