IPD Fall 2005 IPU Summary and Report Out

Download Report

Transcript IPD Fall 2005 IPU Summary and Report Out

Designing and Optimizing Software for
Intel® Architecture Multi-core Processors
Peter van der Veen
QNX Software Systems
Overview


Software and system vendors continue to add
features and capabilities that demand more and
more CPU performance
CPU
Bridge
Bridge
CPU
CPU
Microprocessor vendors can no longer scale
performance simply by increasing clock speed
►
►
Thermal considerations
Design complexity

Trend to include multiple processor cores on a
single die

Multi-core designs address performance issues
►
►

CPU
Bridge
CPU
Favorable power / performance ratio for embedded
systems
Decreased board area
Companies that can leverage the full
capabilities of hardware can achieve a
competitive advantage
CPU
Bridge
CPU
CPU
Bridge
CPU
Multi-core Architectures

Increased integration on die
►
Multiple CPU cores and caches
► High speed, on-chip system
interconnect
 Greatly reduces latency associated
with a traditional board-level
interconnect

CPU
CPU
Cache
Cache
Memory controller(s) on system bus
►

Single Die
Allows separation of memory for
asymmetric operation
System
Interconnect
On-chip peripherals on system bus
►
Maximizes peripheral throughput
► Reduces latency
I/O I/O I/O I/O
Memory
Controller
Intel Evolution of Parallelism
Architectural
State
Interrupt
Cntlr (APIC)
AS
AS
APIC
APIC
PER
PER
AS
AS
AS
AS
APIC
APIC
APIC
APIC
PER
PER
Processor
Execution
Resources
Processor
Execution
Resources
Chipset
One die
Classic
Uniprocessor
Classic
SMP
AS
AS
AS
AS
APIC
APIC
APIC
APIC
PER
PER
PER
PER
Bus Interface
L2 Cache &
Bus Interface
One die
One die
Hyper-Threading
Technology*
(HT Technology)
Multi-core
AS
Architectural State: registers, flags, timestamp counter, etc.
APIC Advanced Programmable Interrupt Controller
PER Processor Execution Resources: caches, execution units,
instruction decode, bus interface etc.
Bus Interface
Chipset
Symmetrical
Multi-Processing
(SMP) with
Multi-core
All of these forms of parallelism are in use today
* Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Processor supporting HT Technology
and an HT Technology enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware
and software you use. See www.intel.com/products/ht/hyperthreading_more.html for more information including details on
which processors support HT Technology.
QNX and Multi-core

QNX has done the heavy lifting to enable migration to multi-core
►

Let developers focus on product differentiation
Reliable, proven support for multi-core applications
1997: Industry’s first to bring SMP to embedded
► 1984: High performance, transparent distributed messaging
► Full support for asymmetric and symmetric multiprocessing
► Linux, vxWorks interoperability
►

Migrate existing software base and enable new multi-core
optimized applications

Multi-core capable tool suite

World class professional services and expert training

Active role in developing standards through Multi-core Exchange
consortium
►
Enable portability of applications across various platforms
► Derive common set of APIs that multi-core development tools can utilize to
support interoperability
Microkernel Architecture
Process
Manager
File
System
Application
is the only trusted
component
Audio
Driver
Graphics
Driver
Message Bus
Microkernel
Microkernel
Protocol
Stack
…
Applications and drivers




Are processes that plug into a message bus
Reside in memory-protected address space
Cannot corrupt other software components
Can be started, stopped, and upgraded on the fly
Multiprocessing
Models
Asymmetric


OS 2
OS 1
CPU
CPU
Two cores, two OSs
Same (homogeneous) or different
(heterogeneous) OS
Symmetric
Single OS
Instance
CPU

CPU
Two cores, one OS
Asymmetric Processing
 Asymmetric
Model Pros:
► Only
possible mode when different OSs are running
► CPU core can be dedicated to specific applications
► One possible mode for applications that cannot
operate with parallel processing
 Asymmetric
Applications
Model Cons:
► Resource
sharing / arbitration needs to be designed
into system by developers
 Neither OS “owns” the whole system
 Memory, I/O, interrupts are shared
 Evolution - complexity increases as cores are
added
 Static configuration, difficult to add dynamic
resourcing
 Time to market?
 Any HW contention must be dealt with by
designer
OS 1
OS 2
CPU
CPU
Cache
Cache
System
Interconnect
I/O
I/O
I/O
Memory
Controller
OS 1 Memory
► Synchronization
between cores done through
application level messages
 Sub-optimal performance
 Complexity of the problem is not linear
 Addition of cores may require re-architecting
application to increase performance
Applications
Managing shared
resources
complicates design
OS 2 Memory
Shared Memory
Neutrino Homogeneous AMP
Transparent Distributed Processing
 Extends
message passing
bus over a transport layer
Message Bridge (Ethernet, RapidIO,Shared Memory)
Internet
Networking
Stack
Flash File
System
Message
Queues
Message-Passing Bus
Microkernel Core 0
Application
 Applications
/ services can
be built in a fully distributed
manner without special
code
► Message
► File
Database
Flash File
System
systems
► Hardware
 Seamless
Microkernel Core 1
Application
queues
ports
sharing of I/O
resources between cores
(e.g. use a serial port
“owned” by another core)
fd = open(“/dev/ffs1”,…);
open(“/net/core0/dev/ffs1”,…);
write(fd, …);
Symmetric Processing
 Symmetric
Model Pros:
► Highly
scalable. Supports multiple processing
cores seamlessly without code modification
► One OS “sees all” and handles all resource
sharing / arbitration issues
► Dynamic load balancing handles processing
bursts with OS thread scheduling
► Dynamic memory allocation = all cores can draw
on full pool of available memory without penalty
► High performance inter-core messaging
synchronization
 Core-to-core synchronization using OS
primitives
► System wide statistics / information gathering
capability for performance optimizations,
debugging, etc.
Applications
OS
CPU
Cache
Cache
System
Interconnect
I/O
 Symmetric
CPU
I/O
I/O
Memory
Controller
Model Cons:
► Load
balancing is dynamic and application may
require dedicated CPU
► Applications with poor synchronization among
threads may not work properly
 Difficult to change software
 3rd party software
Memory
Multi-core Scaling Software

QNX conforms to POSIX (Portable
Operating System Interface) Application
Programming Interface
►
Allows straightforward porting of code
from one OS to another that is also
conformant

Application broken down into memory
protected units called processes

Processes further divided into internal,
schedulable units called threads
►
Application
Process
Threads
Threads share all of the same
resources (memory space included)

PROCESSES run on individual cores
concurrently in asymmetric mode (all
threads for a process are tied to one
core)

THREADS run on individual cores
concurrently in symmetric operation
Process
Threads
Active Threads and Ready Queues SMP
Running
Ready queues
Priority
CPU 0
Thread
255
Process
254
Thread
Thread
253
...
Thread
CPU 1
Process
0
Blocked states
Thread
Thread
Thread
AMP or SMP?


Sometimes this can be a clear cut decision
►
Two operating systems = AMP
►
Application requires all available CPUs to maximize performance =
SMP
What if the versatility of SMP is desired but the control of
AMP is needed?
QNX Bound
Multiprocessing

Benefits of both AMP and SMP

Support legacy code base and multi-core
optimized applications simultaneously
►

►
►
►
A4
OS
Supports bound and symmetric operation,
selectable by process / thread
Applications and/or threads can be “bound” to a
specific core
Restrictive CPU usage as decided by designer
CPU
CPU
Cache
Cache
OS dynamic or designer controlled
Tools to optimize load balancing
Resource sharing handled by OS
System
Interconnect
Single OS has full visibility and control
►
►

A3
Load balancing
►

A2
Designer has full control over applications
►

A1
Resource sharing handled by OS, simplifies design
process
System wide statistics / information gathering
capability for performance optimizations &
debugging
I/O
I/O
I/O
Memory
Controller
High Performance
►
Kernel support for message passing and thread
synchronization
The Best of Both Worlds
Memory
A5
Active Threads and Ready Queues: BMP
Available CPU runs highest-priority CPU-designated thread
Running
Ready queues
Priority
CPU 0
(Available)
Thread
Process
designated
CPU 0
CPU 1
Thread
Process
designated
CPU 1
255
254
Scheduler
253
Thread
Thread
Thread
...
0
Thread
User controls which CPU will run a process’s threads. All threads
in a process are tied to one CPU.
Multiprocessing Summary
Design Consideration
Symmetric
Bound
Asymmetric




?










Fast
(OS primitives)
Fast
(OS primitives)
Slower
(Application)
Thread synchronization between cores


Load balancing


System wide debug and optimization





Seamless Resource Sharing
Scalable beyond dual core
Legacy application operation
Mixed OS environment
Dedicated processor by function
Inter-core messaging
The Transition to Multi-core
The Role of Tools
The Role of Tools

The right toolset eases the transition to multi-core
processors

Assess current software when moving to multi-core
►
Should processes be separated between cores?
 Determine how closely coupled the current processes are
► Where can concurrent processing help?
 Show the current processing bottlenecks

Debugging in a multi-core environment
►

Characterize and debug interaction between threads on multiple
CPUs
Tuning and Optimization in a multi-core environment
►
Move processes and threads between cores
► Examine processing bottle necks
► Examine inter-process communications
Instrumented
Kernel
The instrumented kernel logs events which are filtered
and stored into buffers which are captured and analyzed
Process/thread
creation
System
calls
Interrupts
On/Off filters
Static event filters
User defined filters
Events
Microkernel
Event buffers
State
changes
E1 E2 E3 E4 E5 E6
System Profiler
Network
Capture
File
Thread / Process Coupling: QNX Momentics
System Profiler
Determine amount of
messaging between
processes.
Load Balancing: QNX Momentics System Profiler
Measure CPU activity
for all cores to
determine optimal load
balancing
Intel® C++ Compiler 8.1 for QNX Neutrino ® RTOS

Compiler based on classic Intel® C++ Compilers for
desktop/server markets**
►
Leverages mature Intel compiler technology
► Leads industry in supporting Intel Architecture’s performance
features and *T technologies

Cross-compiler:
►
From Windows to QNX Neutrino RTOS 6.3.0

Superior performance (see benchmarks)
 Integrates into QNX Momentics* Development Suite
 GCC C/C++ Object compatibility and interoperability
Download free 30-day evaluation
www.intel.com/software/products/compilers/qnx
EEMBC* 1.1 Intel® Pentium® 4 Processor
(Embedded Microprocessor Benchmark Consortium*)
140%
Avg. 28% Better
Avg. 26% Better
120%
100%
80%
60%
40%
20%
0%
-O2
Advance Opt.
Intel C++ Compiler
GCC 3.3.1
Configuration Info:
•Intel® C++ Compiler 8.1 for QNX Neutrino* RTOS, GCC 3.3.1
•Intel® Pentium® 4 Processor, 3.0 GHz, 512 KB L2 Cache, 512MB Memory
•QNX Neutrino* RTOS 6.3
•EEMBC 1.1 scores were not certified by ECL. Out-of-the-box performance was measured. Relative performance was computed
by averaging relative performance on Automotive, Consumer, Networking, Office Automation, and Telecomm tests.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance
of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual
performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the performance of Intel products, reference
www.intel.com/software/products or call (U.S.) 1-800-628-8686 or 1-916-356-3104
The Transition to Multi-core
Software Architecture and Optimization
Optimizing Multi-core Applications

Reduce contention
►
Minimize or remove core-core interactions to ensure most parallelism
► Ensure no serialization between competing tasks due to resource
contention

Scale to number of available processors

Use system analysis tools to tune performance

Asymmetric operation
►

Properly partition to produce desired CPU loading for each core
Symmetric operation
►
Asymmetric application operation
► Thread affinity
► Bound Multiprocessing for dedicated CPU allocation

Select proper thread / process priorities to optimize real-time
performance / CPU allocation
Example: Layer 3 Forwarding Optimization
Driver thread
CPU0
Single
Forwarding
Table
CPU1

Driver thread
CPU0
Driver thread
CPU1
Original implementation

Driver thread

Lock contention and cache
misses in forwarding table
Serializes Rx / Tx operations

Forward
Table
CPU1
No lock contention for FW table
►

Forward
Table
CPU0
One table per CPU
Minimizes cache contention
and snoop traffic
Instrumented Kernel Profile
Unoptimized
Lock
contention
Optimized
 10%
increase in small packet performance
Summary
QNX Momentics® Multi-Core Edition

The QNX Momentics Multi-Core Edition provides the
industry’s only comprehensive software foundation that
addresses the imminent transition to multi-core silicon

The QNX Momentics Multi-Core Edition
►
Rapidly move current uni-processor based applications to any
multi-processing architecture, decreasing overall time to market
►
Quickly build reliable, high performance products that leverage
latest generation multi-core processors
►
Future proof your designs to scale beyond dual-core to multi-core
silicon and beyond to highly distributed systems
►
Focus on product differentiation and product delivery rather than
plumbing
►
Supports all multi-processing models: AMP, SMP or BMP