Exploring P4 Trace Cache Features

Download Report

Transcript Exploring P4 Trace Cache Features

Exploring P4 Trace
Cache Features
Ed Carpenter
Marsha Robinson
Jana Wooten
Problem Statement

Explore characteristics of the P4 Trace
Cache using microbenchmarks and
performance counters related to branching
and Trace Cache
Approach


Determine characteristics of the Pentium 4
processor that will help us evaluate the
P4’s trace cache
Using a performance monitoring tool (Intel’s
Vtune Performance Analyzer) measure the
data we need and analyze it to find
limitations on the trace cache
Some P4 Characteristics

Like most high performance processors,
the P4 has special on-chip hardware for
performance monitoring. This hardware
typically includes



Event detectors and counters
Qualification of event detections and counting by
privilege mode and event characteristics
Support for event-based sampling
P4 characteristics cont.

Common problems faces by modern processors




Small number of counters
Inability to distinguish between speculative and nonspeculative events
Imprecise event-based sampling
With 42 million transistors (compared to 28 million
of the P3), the P4 has overcome these problems



48 event detectors and 18 event counters
Provides instruction-tagging to enable counting of
nonspeculative performane events
Provides support for imprecise event-based sampling
(IEBS) and precise event-based sampling (PEBS)
Trace Cache



Special instruction cache for capturing long
dynamic instruction sequences.
Each line stores a snapshot, or trace, of the
dynamic instruction stream
P4 executes trace caches when there is an
L1 cache hit (which is over 90% of the time)
Characteristics of Trace Cache

Stores instructions after they’ve already been
decoded into μops (“micro-ops”).




Branch Prediction hardware is used


μops – RISC-style instructions
Cache Line Size: 6 μops
Trace Cache Size: 12K μops
knows about any branch and fetch instructions that follow
the branch.
Conditional Branches can cause problems

Won’t know if wrong until branch condition check in ALU0
Entering The Execution Pipeline Pentium 4's Trace Cache
Tom’s Hardware Guide http://www6.tomshardware.com/cpu/20001120/p4-06.html
Advantages of Trace Cache

More efficient use of limited cache space.



Trace cache lines contain both branch
instructions and the code after the branch
instruction.
No extra latency for branches
Does not use TLB check
The P4’s Critical Execution Path
“Execute Mode"
(when needed code is in L1
cache)
Execute Mode
Vs.
Trace Segment Build Mode

Execute Mode



Trace cache feeds stored traces to the execution logic to
be executed.
Trace cache normally runs in this mode.
Trace Segment Build Mode

Used when there is an L1 cache miss




Front end fetches x86 code from the L2 cache,
Translates into μops,
Builds a “trace segment” with it,
Loads that segment into the trace cache to be executed.
Branch Prediction

X86 code with a branch in it:



The trace cache builds a trace from instructions
up to and including the branch instruction
Then picks which branch it thinks the program
will take
Continues to build the trace along that
speculative branch.
Microcode ROM




Used by P4 to process longer instructions
Allows regular hardware decoder to concentrate on decoding
the smaller, faster instructions.
Stores a sequence of μops for each long instruction
encountered.
Inserts a tag into the trace segment that points to the section
of the microcode ROM where the μop sequence is held.


Trace Cache gives control to the Microcode ROM when a tag is
encountered until the proper sequence of μops is produced.
Execution Engine does not care if instructions come from the
Trace Cache or the Microcode ROM
VTune Experiment
for(i=0; i<1M; i++)
_asm
{
mov eax, 10
mov eax, 20
}
VTune Experiment
for(i=0; i<1M; i++)
_asm
{
mov eax, 10
…
mov eax, 4990
}
VTune Results
mov eax,
4980
mov eax,
4990
mov eax,
5000
Trace Cache
Misses
Trace Cache
Delivery Mode
0
174,605,634
2,356
173,879,264
3,945
174,448,595
VTune Results cont.
Spec
microcode
Uops
Distanc
e
Ru
n#
501
1
-
224,924
509,636,316
501
2
-
222,852
510,233,880
501
5
41,843
273,615
509,599,380
490
1
-
86,260
491,929,610
490
2
48,614
373,550
498,086,310
490
4
-
217,040
500,424,768
450
1
-
245,190
455,461,376
450
2
55,108
82,445
457,650,081
450
3
-
193,896
457,591,820
Spec TCbuilt uops
Spec TCdelived uops
TC Build
Mode
-
TC Deliver
Mode
TC
Misses
uops Decoded
uops Retired
176,973,480
4,705
507,140,497
508,671,072
441,313
175,451,130
5,505
512,390,816
509,080,482
442,288
177,215,964
10,725
511,918,204
511,939,750
-
172,872,716
1,880
499,609,758
498,960,432
-
171,210,494
5,361
497,178,660
497,336,020
173,107,503
6,444
496,964,597
496,790,932
-
157,471,452
1,877
458,074,872
460,907,257
-
154,759,410
5,768
460,827,366
459,866,660
158,811,048
12,448
460,118,105
459,147,504
382,397
449,223
VTune Results for P4m
Distance
Run #
Spec Uops retired
Spec TCbuilt uops
Spec TC-delived
uops
TC Build
Mode
TC Deliver
Mode
TC
Misses
uops Retired
150
1
157,706,784
0
156,600,752
0
53,391,182
4,248
158,219,460
150
2
158,352,360
0
159,005,262
383,183
55,624,016
2,957
157,856,940
149
1
157,698,240
0
157,680,678
0
55,166,319
7,248
158,195,300
149
2
157,311,357
0
157,421,964
389,101
55,592,768
5,192
157,215,060
130
1
163,841,040
0
137,760,210
0
48,314,569
0
137,856,452
130
2
139,101,786
0
137,808,330
342,955
48,707,795
9,054
138,242,080
130
3
140,317,920
0
138,527,055
360,100
50,786,612
0
139,032,684
Sources:





M. Milenkovic, A. Milenkovic, J. Kulick,
“Demystifying
Intel Branch Predictors,”
Proceedings of the
Workshop on Duplicating, Deconstructing, and Debunking
(held in conjunction with 29th ISCA), Anchorage, Alaska, May
2002
E. Rotenberg, S. Bennett, J. E. Smith,
“A Trace Cache Microarchitecture and Evaluation,” IEEE
Transactions on Computers, (Vol. 48, No. 2) February 1999
http://www6.tomshardware.com/cpu/20001120/p4-06.html
http://www.extremetech.com/article2/0,3973,1488,00.asp
http://www.arstechnica.com/cpu/01q2/p4andg4e/p4andg4e5.htm