AMD64 - University of Virginia, Department of Computer Science

Download Report

Transcript AMD64 - University of Virginia, Department of Computer Science

The AMD Opteron
Henry Cook
Kum Sackey
Andrew Weatherton
Presentation Outline
•
•
•
•
History and Goals
Improvements
Pipeline Structure
Performance Comparisons
K8 Architecture Development
• The Nx586, March 1994
– Superscalar
– Designed by NexGen
– Manufactured by IBM
– 70-111MHz
– 32KB L1 cache
– 3.5 million transistors
– .5 micron process
K8 Architecture Development
• AMD SSA/5 (K5)
– March 1996
– Built by AMD from the ground up
•
•
•
•
•
–
–
–
–
–
Superscalar architecture
out of-order speculative execution
branch prediction
integrated FPU
power-management
75-117MHz
Ran “hot”
34KB L1 cache
4.5 million transistors
.35 micron process
K8 Architecture Development
• AMD K6 (1997)
– Based on NexGen's RISC86 core (in the
Nx586)
– Based on Nx586 core
– 166-300MHz
– 84KB L1 Cache
– 8.8 million transistors
– .25 micron process
K8 Architecture Development
• AMD K6 (1997) continued
– Advantages of K6 over K5:
• RISC86 core translates x86 complex instructions into shorter
ones, allowing the AMD to reach higher frequencies than the
K5 core.
• Larger L1 cache.
• New MMX instructions.
– AMD produced both desktop and mobile K6
processors. The only difference being lower
processor core voltage for the mobile part
K8 Architecture Development
• First AMD Athlons, K7 (June 23, 1999)
– Based on the K6 core
– improved the K6’s FPU
– 128 KB (2x64 KB) L1 cache
– Initially 500-700MHz
– 8.8 million transistors
– .25 micron process
K8 Architecture Development
• AMD Athlons, K7 continued
– 1999-2002 held fastest x86 title off and on
– First to 1GHz clock speed
– Intel suffered a series of major production, design,
and quality control issues at this time.
– Changed from slot to socket format
– Athlon XP – desktop
– Athlon XP-M – laptop
– Athlon MP – server
K8 Architecture Development
• AMD Athlons, K7 continued
– Final (5th) revision, the Barton
– 400 MHz FSB (up from 200 MHz)
– Up to 2.2 GHz clock
– 512 KB L2 cache, off-chip
– 54.3 million transistors
– .13 micron process
• In 2004 AMD began using 90nm process on XP-M
The AMD Opteron
• Built on the K8 Core
– Released April 22, 2003
– AMD's AMD64 (x86-64) ISA
• Direct Connect Architecture
– Integrated memory controllers
– HyperTransport interface
• Native execution of x86 64-bit apps
• Native execution of x86 32-bit apps with
no speed penalty!
Opteron vs. Intel Offerings
• Targeted at the server market
– 64-bit computing
– Registered memory
• Initial direct competitor was the Itanium
• Itanium was the only other 64-bit processor
architecture with 32-bit x86 compatibility
• But, 32-bit software support was not native
– Emulated 32-bit performance took a significant hit
Opteron vs. ???
• Opteron had no real competition
• Near 1:1 multi-processor scaling
• CPUs share a single common bus
– integrated memory controller CPU can access
local-RAM without using the Hypertransport bus
processor-memory communication.
– contention for the shared-bus leads to decreased
efficiency, not an issue for the Opteron
• Still did not dominate the market
Opteron Layout
Other New Opteron Features
• 48-bit virtual address space and a 40-bit
physical address space
• ECC (error correcting code) protection for
L1 cache data, L2 cache data and tags
• DRAM with hardware scrubbing of all
ECC-protected arrays
Other New Opteron Features
• Lower thermal output, improved frequency
scaling via .13 micron SOI (siliconinsulator) process technology
• Two additional pipeline stages (compared
to K7) for increased performance and
frequency scalability
• Higher IPC (instructions-per-clock) with
larger TLBs, flush filters, and enhanced
branch prediction algorithms
64-bit Computing
• Move beyond the 4GB virtual-address
space ceiling 32-bit systems impose
• Servers and apps like databases, content
creation, MCAD, and design-automation
tools push that boundary.
• AMD’s implementation allows:
– Up to 256TB of virtual-address space
– Up to 1TB of physical memory
– No performance penalty
64-bit Computing Cont’d
• AMD believes the following desktop apps stand
to benefit the most from its architecture, once
64-bit becomes more widespread
–
–
–
–
–
–
3D gaming
Codecs
Compression algorithms
Encryption
Internet content serving
Rendering
AMD and 64-bit Computing
• Goal is not immediate transition to 64-bit
operation
– Like Intel’s transition to 32-bit with the 386
– AMD's Brunner: "The transition will occur at
the pace of demand for its benefits."
• Sets foundation and encourages
development of 64-bit applications while
fully supporting current 32-bit standard
AMD64
• AMD’s 64-bit ISA
• 64-bit software support with zero-penalty
32-bit backward compatibility
• x86 based, with extensions
• Cleans up x86-32 idiosyncrasies
• Updated since release: i.e. SSE3
AMD64 - Features
• All benefits of 64-bit processing (e.g. virtualaddress space)
• Added registers
– Like Pentium 4 in 32-bit mode, but 8 more 64-bit
GPRs available for 64-bit
– 8 more XMM registers
• Native 32-bit compatibility
– Low translation overhead (unlike Intel)
– Both 32 and 64-bit apps can be run under a 64bit OS
Register Map for AMD64
AMD64 – More Features
• RIP relative data access: Instructions can
reference data relative to PC, which makes code
in shared libraries more efficient and able be
mapped anywhere in the virtual address space.
• NX Bit: Not required for 64-bit computing, but
provides for a more tightly controlled software
environment. Hardware set permission levels
make it much more difficult for malicious code to
take control of the system.
AMD64 Operating Modes
• Legacy mode supports 16- and 32-bit OSes and
apps, while long mode enables 64-bit OSes to
accommodate both 32- and 64-bit apps.
– Legacy: OS, device drivers, and apps will run exactly
as they did prior to upgrading.
– Long: Drivers and apps have to be recompiled, so
software selection will be limited, at least initially.
• Most likely scenario is a 64-bit OS with 64-bit
drivers, running a mixture of 32- and 64-bit apps
in compatibility mode.
Direct Connect Architecture
• I/O Architecture for Opteron and Athlon64
• Microprocessors are connected to:
– Memory through an integrated memory
controller.
– A high performance I/O subsystem via
Hypertransport bus
– To other CPUs via HyperTransport bus
Onboard Memory Control
• Processors do not have to go through a
northbridge to access memory
• 128-bit memory bus
• Latency reduced and bandwidth doubled
• Multicore: Processors have own memory
interface and own memory
• Available memory scales with the number
of processors
More Onboard Memory Control
• DDR-SDRAM only
• Up to 8 registered DDR DIMMs per
processor
• Memory bandwidth of up to 5.3 Gbytes/s
(with PC2700) per processor.
• 20% improvement over Athlon just due to
integrated memory
HyperTransport
• Bidirectional, serial/parallel, scalable, highbandwidth low-latency bus
• Packet based
– 32-bit words regardless of physical width
• Facilitates power management and low
latencies
HyperTransport in the Opteron
• 16 CAD HyperTransport (16-bit wide,
CAD=Command, Address, Data)
– processor-to-processor and processor-tochipset
– bandwidth of up to 6.4 GB/s (per HT port)
– 50% more than what the latest Pentium 4 or
Xeon processors
• 8-bit wide HyperTransport for components
such as normal I/O-Hubs
More Opteron HyperTransport
• Number of HyperTransport channels
(up to 3) determined by number of CPUs
– 19.2 Gbytes/s of peak bandwidth per
proccessor
• All are bi-directional, quad-pumped
• Low power consumption (1.2 W) reduces
system thermal budget
More HyperTransport
•
•
•
•
Auto-negotiated bus widths
Devices negotiate sizes during initialization
2-bit lines to 32-bit lines.
Busses of various widths can be mixed together
in a single application
• Allows for high speed busses between main
memory and the CPU and lower speed busses
to peripherals as appropriate
• PCI compatible but 80x faster
DCA – InterCPU Connections
• Multiple CPUs connected through a
proprietary extension running on additional
HyperTransport interfaces
• Allows support of a cache-coherent, NonUniform Memory Access, multi-CPU
memory access protocol
DCA – InterCPU Connections
• Non-Uniform Memory Access
– Separate cache memory for each processor
– Memory access time depends on memory location.
(i.e. local faster than non-local)
• Cache coherence
– Integrity of data stored in local caches of a shared
resource
• Each CPU can access the main memory of
another processor, transparent to the
programmer
DCA Enables Multiple CPUs
• Integrated memory controller allows cache
access without using HyperTransport
• For non-local memory access and
interprocessor communication, only the initiator
and target are involved, keeping bus-utilization
to a minimum.
• All CPUs in multiprocessor Intel Xeon systems
share a single common bus for both
• Contention for shared bus reduces efficiency
Multicore vs Multi-Processor
• In multi-processor systems (more than one
Opteron on a single motherboard), the
CPUs communicate using the Direct
Connect Architecture
• Most retail motherboards offer one or two
CPU sockets
• The Opteron CPU directly supports up to
an 8-way configuration (found in mid-level
servers)
Multicore vs Multi-Processor
• With multicore each physical Opteron chip
contains two separate processor cores
(more someday soon?)
• Doubles the compute-power available to
each motherboard socket. One socket can
delivers the performance of two
processors, two deliver a four processor
equivalent, etc.
Future Improvements
• Dual-Core vs Double Core
– Dual core: Two processors on a single die
– Double core: Two single core processors in
one ‘package’
• Better for manufacturing
• Intel Pentium D 900 Presler
• Combined L2 cache
• Quad-core, etc.
K7 vs. K8 Changes
Summary of Changes
From K7 to K8
•
•
•
•
Deeper & Wider Pipeline
Better Branch Predictor
Large workload TLB
HyperTransport capabilities eliminate
Northbridge and allow low latency
communication between processors as well as
I/O
• Larger L2 cache with higher bandwidth and
lower latency
• AMD 64 ISA allowing for 64-bit operation
The K7 Basics
•
•
•
•
•
3 x86 decoding units
3 integer units (ALU)
3 floating point units (FPU)
A 128KB L1 cache
Designed with an efficiency aim
– IPC mark (Instructions Per Cycle)
– K7 units allow to handle up to 9 instructions
per clock cycle
The K8 Basics
•
•
•
•
3 x86 decoding units
3 integer units (ALU)
3 floating point units (FPU)
A 1MB L1 cache
The K7 Core
The K8 Core
Things To Note About the K8
• Schedules a large number of instructions
simultaneously
– 3 8-entry schedulers for integer instructions
– A 36-entry scheduler for floating point
instructions
• Compared to the K7, the K8 allows for
more integer instructions to be active in
the pipeline. How is this possible?
Processor Constraints
• - A 'bigger' processor has more execution units
(width) and more stages in the pipeline (depth)
– Processor 'size' is limited by the accuracy of the
branch predictor
– determines how many instructions can be active in
the pipeline before an incorrect branch prediction
occurs
– in theory, CPU should only accomodate the number of
instructions that can be sent in a pipe before a
misprediction
The K8 Branch Predictor
The K8 Branch Predictor Details
• Compared to the K7, the K8 has improved
branch prediction
– Global history counter (ghc) is 4x previous size
• ghc is a massive array of 2-bit (0-3) counters, indexed by a
part of an instructions addresse
• if the value is => 2 then branch is predicted as "taken“
– Taken branches incrememnt counter
– Untaken branches decrement it
– The larger global history counter means more
instruction addresses can be saved thus increasing
branch predictor accuracy
Translation Look-aside Buffer
• The number of
entries TLB has
been increased
– Helps
performance in
servers with
large memory
requirements
– Desktop
performance
impact will be
limited to a small
boost when
running 3D
rendering
software
HyperTransport:
Typical CPU to Memory Set-Up
• CPU sends 200MHz clock to the north bridge, this is the FSB.
•The bus between north bridge and the CPU is 64 bits wide at 200MHz,
(Quad Pumped for 4 packets per cycle) giving effective rate of 800MHz
•The memory bus is also 200MHz and 64 or 128 bits wide (single or dual
channel). As it is DDR memory, two 64/128 bits packs are sent every
clock cycle.
HyperTransport:
Opteron Memory Set-Up
•integrated memory controller does not improve the memory bandwidth,
but drastically reduces memory request time
•HyperTransport uses a 16 bits wide bus at 800MHz, and a double data
rate system that enables a 3.2GB peak bandwidth one-way
Pros & Cons
• Pros
– The performance of the integrated controller
of the K8 increases as the CPU speed
increases and so does the request speed.
– The addressable memory size and the total
bandwidth increase with the number of CPUs
• Cons
– Memory controller is customized to use a
specific memory, and is not very flexible about
upgrading
Caches
L1 Cache Comparison
CPU
K8
Pentium 4 Prescott
code : 64KB
TC : 12Kµops
data : 64KB
data : 16KB
code : 2 way
TC : 8 way
data : 2 way
data : 8 way
code : 64 bytes
TC : n.a
Cache line size
data : 64 bytes
data : 64 bytes
Write policy
Write Back
Write Through
Latency Given By
Manufacturer
3 cycles
4 cycles
Size
Associativity
K8 L1 Cache
• Compared to the Intel machine,the large
size of the L1 cache allows for bigger
block size
– Pros: a big range of data or code in the same
memory area
– Cons: low associativity tends to create
conflicts during the caching phase.
L2 Cache Comparison
CPU
K8
Pentium 4 Prescott
512KB (NewCastle)
Size
1024KB (Hammer)
1024KB
Associativity
16 way
8 way
Cache line size
64 bytes
64 bytes
Latency given by
manufacturer
11 cycles
11 cycles
Bus width
128 bits
256 bits
L1 relationship
exclusive
inclusive
K8 L2 cache
• L2 cache of the K8 shares lot of common
features with the K7.
• The K8’s L2 cache uses a 16-way set
associativity to partially compensates for the low
associativity of the L1.
• Although the bus width in the K8 is double what
the K7 offered, it still is smaller than the Intel
model
• The K8 also includes an hardware prefetch
logic, that allows to get data from memory to the
L2 cache during the the memory bus idle time.
Inclusive vs. Exclusive Caching
• Inclusive Caching: Used by the Intel P4
– L1 cache contains a subset of the L2 cache
– During an L1 miss/L2 success data is copied
into the L1 cache and forwarded to the CPU
– During an L1/L2 miss, data is copied from
memory into both L1 and L2 caches
Inclusive vs. Exclusive Caching
• Exclusive: Used by the Opteron
– L1 and L2 caches cannot contain the same
data
– During an L1 miss/L2 success data
• One line is evicted from the L1 cache into the L2
• L2 cache copies data into the L1 cache
– During an L1/L2 miss, data is copied into the
L1 cache alone
Drawback of Exclusive Caching
and its solution…
• Problem: A line from the L1 must be copied to the L2 *before*
getting back the data from the L2.
– Takes a lot of clock cycles, adding to the time needed to get data
from the L2
• Solution:
– victim buffer (VB), that is a very little and fast memory between L1 and
L2.
• The line evicted from L1 is then copied into the VB rather than into the L2.
• In the same time, the L2 read request is started, so doing the L1 to VB write
operation is hidden by the L2 latency
• Then, if by chance the next requested data is in the VB, getting back the
data from it is much more quickly than getting it from the L2.
– The VB is a good improvement, but it is very limited by its small size
(generally between 8 and 16 cache lines). Moreover, when the VB is
full, it must be flushed into the L2, that is an additional step and needs
some extra cycles.
Drawback of Inclusive
• The constraint on the L1/L2 size ratio needs the L1 to be
small,
– but a small size will result in reducing its success rate, and
consequently its performance.
– On the other hand, if it is too big, the ratio will be too large for
good performance of the L2.
• Reduces flexibility when deciding size of L1 and L2
caches
– It is very hard to build a CPU line with such constraints. Intel
released the Celeron P4 as a budget CPU, but its 128KB L2
cache completely broke the performance.
• Total useful cache size is reduced since data is
duplicated over the caches
Inclusive vs. Exclusive Caching
Pros
Cons
Exclusive
•No constraint on the L2 size.
•Total cache size is sum of the
sub-level sizes.
•L2 performance
decreases
Inclusive
•L2 performance
•Constraint on the
L1/L2 size ratio
•Total cache size is
effectively reduced
The Pipeline
K7 vs. K8 – Pipeline Comparison
The Fetch Stage
• Two Cycles Long
• Feeds 3 Decoders with 16 instruction
byres each cycle
• Uses the L1 code cache and the branch
prediction logic.
The Decode Stage
• The decoders convert the x86 instruction in
fixed length micro-operations (µOPs).
• Can generate 3 µOPs per cycle
– The FastPath: "simple" instructions, that are decoded
in 1-2 µOPs, are decoded by hardware then packed
and dispatched
– Microcoded path: complex instructions are decoded
using the internal ROM
• Compared to the K7, more instructions in the K8
use the fast path especially SSE instructions.
– AMD claims that the microcoded instructions number
decreased by 8% for integer and 28% for floating
point instructions.
Instruction Dispatch
• There are:
– 3 address generation units (AGU)
– Three integer units (ALU). Most operations
complete within a cycle, in both 32 and 64bits:
addition, rotation, shift, logical operations
(and, or).
• Integer multiplication has a 3 cycles latency in 32
bits, and a 5 cycles latency in 64 bits.
– Three floating point units (FPU), that handle
x87, MMX, 3DNow!, SSE and SSE2.
Load/Store Stage
• Last stage of the pipeline process
– uses the L1 data cache.
– the L1 is dual-ported to handle two 32/64 bits
reads or writes each clock cycle.
Cache Summary
• Compared to the K7, the K8 cache
provides higher bandwidth and lower
latencies
• Compared to the Intel P4, the K8 cache’s
are write-back and inclusive
AMD 64: GPR encoding
The IA32 instructions encoding is made with a special byte called the
ModRM (Mode / Register / Memory), in which are encoded the source
and destination registers of the instruction.
– 3 bits encode the source register, 3 bits encode the destination
• There’s no way to change the ModRM byte since that would break
IA32 compatibility. So to allow instructions to use the 8 new GPRs, an
addition bit named the REX is added outside the ModRM.
• The REX is used only in long (64-bit) mode, and only if the
specified instruction is a 64-bit one
AMD 64: SSE
• Abandoned the original MMX, 3DNow!
Instruction sets because they operated on the
same physical registers
• Supports SSE/SSE2 using eight SSE-dedicated
80-bit registers
– If a 128 bit instruction is processed it will take two
steps to complete
– Intel’s P4 allows for the use of 128 bit registers so 128
bit instructions only take a single step
– However, C/C++ compilers still usually output scalar
SSE instructions that only use 32/64 bits so the
Opteron can processes most SSE instructions in one
step and thus remain competitive with the P4
AMD 64: One Last Trick
• suppose we want to write 1 in a register, that is
written in pseudo-code as :
• mov register, 1
• In the case of a 32 bits register, the immediate
value 1 will be encoded on 32 bits:
– mov eax, 00000001h
• In the case the register is 64 bits :
– mov rax, 0000000000000001h
• Problems? The 64-bit instruction takes 5 more
bits to encode the same number thus wasting
space.
AMD 64: One Last Trick
• Under AMD64, the default size for
operand bits is 32.
AMD 64: One Last Trick
• For memory addressing a more
complicated table is used.
AMD 64: Code Size
• Cpuid.org estimated that a 64 bits code will be
20-25% bigger compared to the same IA32
instructions based code.
• However, the use of sixteen GPR will tend to
reduce the number of instructions, and perhaps
make 64-bit code shorter than 32-bit code.
– The K8 is able to handle the code size increase,
thanks to its 3 decoding units, and its big L1 code
cache. The use of big 32KB blocs in the L1
organization in order seems now very useful
AMD 64:
32-bit code vs. 64-bit Code
Note:
Athlon64 3200+ runs at 2.0GHz
AthlonXP 3200+ runs at 2.2 GHz
• [H]ard|OCP:
• “AthlonXP 3200+ got
outpaced by the Athlon64
3200+…the P4 and the
P4EE came in at a dead
tie, which suggests that
the extra CPU cache is not
a factor in this
benchmark... pipeline
enhancements made to
the new K8 core certainly
did impact instructions per
clock.”
AMD 64: Conclusions
• Allows for a larger addressable memory
size
• Allows for wider GPRs and 8 more of them
• Allows the use of all x86 instructions that
were avaliable on the AMD64 by default
• Can lead to small code that is faster as a
result of less memory shuffling
Opteron vs. Xeon
Opteron vs Xeon in a nutshell
• Opteron offers better computing and perWatt performance at a roughly equivalent
per-device price
• Opteron scales much better when moving
from one to two or even more CPUs
• Fundamental limitation:
– Xeon processors must share one front side
bus and one memory array
FSB Bottleneck
Intel’s Xeon
AMD’s Opteron
Xeon and the FSB Bottleneck
• External north bridge makes
implementing multiple FSB
interfaces expensive and hard
• Intel just has all the processors
share
• Relies on large on-die L3 caches
to hide issue
• Problem grows with number of
CPUs
The AMD Solution
• Recall: Each processor
has own integrated
memory controller and
three HyperTransport
ports
– No NB required for
memory interaction
– 6.4 GB/s bandwidth
between all CPUs
• No scaling issue!
Further Xeon Notes
• Even 64-bit extensions would not solve the
fundamental performance bottleneck
imposed by the current architecture
• Xeon can make use of Hyperthreading
– Found to improve performance by 3 - 5%
AnandTech: Database Benchmarks
• SQL workload based on site’s forum
usage, database was forums themselves
– i.e. trying to be real world
• Two categories: 2-way and 4-way setups
• Labels:
– Xeon: Clock Speed / FSB Speed / L3 Cache Size
– Opteron: Clock Speed / L2 Cache Size
AnandTech: Average Load 2-way
• Longer line is better
• Opterons at 2.2 GHz maintain 5% lead
over Xeons at 3.2 GHz
AnandTech: Average Load 4-way
• With two more processors, best Opteron system
increases performance lead to 11%
• Opterons @ 1.8 GHz nearly equal Xeons at
3.0 GHz
AnandTech: Enterprise
benchmarks
Stored Procedures / Second
• 2-way Xeon at 3+GHz and large L3 cache does
better
• 4-way Opteron jumps ahead (8.5% lead)
AnandTech Test Conclusions
• Opteron is clear winner for >2 processor
systems
– Even for dual-processors, Xeon essentially
only ties
• Clearly illustrates the scaling bottleneck
• Xeons are using most of their huge (4MB)
L3 cache to keep traffic off the FSB
• Also Opteron systems used in tests cost ½
as much
Tom’s Hardware Benchmarks
• AMD's Opteron 250 vs. Intel's Xeon 3.6GH
• Xeon Nocona (i.e. 64-bit processing)
– Results enhanced by chipset used (875P)
which has improved memory controller
– Still suffers from lack in memory performance
• Workstation applications rather than
server based tests
Tom’s Hardware
Tom’s Hardware
Tom’s Hardware Conclusions
• AMD has memory benefits, as before
• Opteron better in video, Intel better with
3D but only when 875P-chipset is used
– Otherwise Opteron wins in spite of inferior
graphics hardware
• Still undecided re: 64-bit, no good
applications to benchmark on
K8 in Different Packages
K8 in Different Packages
• Opteron
– Server Market
– Registered memory
– 940 pin count
– Three HyperTransport links
• Multi-cpu configurations (1,2,4, or 8 cpus)
• Multiple multi-core cpus supported as well
– Up to 8 1GB DIMMs
K8 in Different Packages
• Athlon 64
–
–
–
–
–
Desktop market
Unregistered memory
754 or 939pin count
Up to 4 1GB DIMMs
Single HyperTransport links
• Single slot configurations
• X2 has multiple cores in one slot
– Athlon 64 FX
• Same feature set as Athlon 64
• Unlocked multiplier for overclocking
• Offered at higher clock speeds (2.8GHz vs. 2.6GHz)
K8 in Different Packages
• Turion 64
– Named to evoke the “touring” concept
– 90nm “Lancaster” Athlon 64 core
• 64bit computing
• SSE3 support
– High quality core yields, can run at high clock speeds
with low voltage
• Similar process for low wattage opterons
– On chip memory controller
• Saves power by running in single channel mode
• Better compared to Petium M’s extra controller on the mobo
Thermal Design Points
• Pentium 4’s TDP: 130w
• Athlon 64’s TDP: 89-104w
• Opteron HE - 50w; EE -30w
• Athlon 64 mobiles: 50w
– DTR market sector
• Pentium M: 27w
• Turion 64: 25w
K8 in Different Packages
• Turion 64 continued
– Uses PowerNow! Technology
• Similar to Intel’s SpeedStep
• Identical to desktop Cool’N’Quiet
• Dynamic voltage and clock frequency modulation
– Operates “on demand”
– Run “Cooler and Quieter” even when plugged in
K8 in Different Packages
• AMD uses “Mobile Technology” name
– Intel has a monopoly on centrino
• Supplies Wireless, chipset and cpu
• invested $300 million in Centrino advertising
– Some consumers think Centrino is the only way to get wireless
connectivity in a notebook
– AMD supplies only the cpu
• Chipset and wireless are left up to the motherboard
manufacturer/OEM
Marketing
VS
Intel’s Marketing
•
•
•
•
•
Men who are Blue
Moore’s Law
Megahertz
Most importantly: Money
Beginning with: “In order to correctly
communicate the benefits of new processors to
PC buyers it became important that Intel transfer
any brand equity from the ambiguous and
unprotected processor numbers to the company
itself”
Industry on AMD vs. Intel
• Intel spends more on R&D in one quarter than
AMD makes in a year
• Intel still has a tremendous amount of arrogance
• Has been shamed technologically by a fleasized (relatively speaking) firm
• Humbling? Intel is still grudgingly turning to the
high IPC, low clock rate, dual-core, x86-64, ondie memory controller design pioneered by its
diminutive rival.
– Geek.com
AMD’s Marketing
• Mascot = The AMD Arrow
• “AMD makes superior CPUs, but the marketing
department is acting like they are still selling the K6” theinquirer.net
• Guilty with Intel on poor metrics:
• “AMD made all the marketing hay it could on the
historically significant clock-speed number.
By trying to turn attention away from that number now, it
runs the risk of appearing to want to change the subject
when it no longer has the perceived advantage. In
marketing, appearance is everything. And no one wants
to look like a sore loser, even when they aren't. “- Forbes
Anandtech on AMD’s Marketing
• AMD argued that they didn't have to talk
about a new architecture, as Intel is just
playing catch-up to their current
architecture.
• However, we look at it like this - AMD has
the clear advantage today, and for a
variety of reasons, their stance in the
marketplace has not changed all that
much.
Conclusion
• Improvements over K7
– 64-bit
– Integrated memory controller
– HyperTransport
– Pipeline
• Multiprocessor scaling > Xeon
• K8 is dominant in every market
performance-wise
• K8 is trounced in every market in sales
Reason for 64-bit in Consumer
Market
• “If there aren't widespread, consumerpriced 64-bit machines available in three
years, we're going to have a hard time
developing games that are more
compelling than last year's games.”- Tim
Sweeney, Founder & President Epic
Games
Questions?
http://www.people.virginia.edu/~avw6s/opteron.html