Chapter7 - Code Cortex

Download Report

Transcript Chapter7 - Code Cortex

COMP 2003:
Assembly Language and Digital Logic
Chapter 7: Computer Architecture
Notes by Neil Dickson
About This Chapter
• This chapter delves deeper into the computer
to give an understanding of the issues
regarding the CPU, RAM, and I/O.
• Having an understanding of the underlying
architecture helps with writing efficient
software.
Part 1 of 3: CPU Execution
Pipelining and Beyond
Execution Pipelining
Old systems:
1 instruction at a time
Fetch Instruction
Decode Instruction
Load Operand Values
Execute Operation
Store Results
Fetch
Less-old systems:
multiple independent
instructions at a time
Decode
Fetch
Load
Decode
Fetch
Execute
Load
Decode
Fetch
Store
Execute
Load
Decode
Store
Execute
Load
Store
Execute
Store
A Hardware View
Instructionof Pipelining
1
2
3
4
5
6
7
InstructionFetching
Circuitry
Instruction
Decoder(s)
OperandLoading
Circuitry
Execution
Unit(s)
ResultsStoring
Circuitry
Problem:
What if Instruction 1
stores result in eax
(e.g. “mov eax,1”)
and Instruction 2
needs to load eax?
(e.g. “add ebx,eax”)
Pipeline Dependencies
Instruction
1
2
3
4
5
6
7
InstructionFetching
Circuitry
Instruction
Decoder(s)
OperandLoading
Circuitry
Execution
Unit(s)
ResultsStoring
Circuitry
Suppose Instruction
1 stores result in eax
and Instruction 2
needs to load eax.
 Have to wait here
until result stored.
Problem:
What about
conditional jumps?
Branch Prediction
• Suppose that Instruction 3 is a conditional jump
(e.g. jc MyLabel)
• The “operand” to load is the flags.
• Its execution is to determine whether or not to
jump (i.e. where to go next).
• Its result is stored in the instruction pointer, eip.
• Unknown what comes next until the execution,
so the CPU makes a prediction first and checks it
in the execution stage
Branch Prediction
and the Pipeline
Instruction
4’
1
2
3
4
5
6
InstructionFetching
Circuitry
so clear the
Suppose Instruction 3
pipeline and start
is a conditional jump
from the new eip
Instruction
Decoder(s)
OperandLoading
Circuitry
Execution
Unit(s)
ResultsStoring
Circuitry
Instruction 2
changed the flags,
here
ohso
no!wait
It turned
out that the CPU
guessed wrong.
Pipelining Pros/Cons
• Pro: Only one set of each hardware
component is needed (plus some hardware to
manage)
• Pro: Initial concept was simple
• Con: Programmer/compiler must try to
eliminate dependencies, which can be tough,
else face big performance penalties
• Con: The actual hardware can get complicated
• Note: No longer short on CPU die space, so
first Pro doesn’t matter much anymore
Beyond Pipelining
• For jumps that are hard to predict, guess
BOTH directions, and keep two copies of
results based on the guess (e.g. 2 of each
register)
• Allow many instructions in at once (e.g.
multiple decoders, multiple execution units,
etc.) so that there’s a higher probability of
more operations that can run concurrently
• Vector instructions (multiple data together)
Intel Core i7 Execution Architecture
32KB Instruction Cache
Branch
Prediction
16-byte Prefetch Buffer
Initial (Length) Decoder
Queue of ≤18 Instructions
4 Decoders
2 Copies of
Registers
Store to
Memory
L2
split instructions
into parts called Cache
“MicroOps”
Buffer of ≤128 MicroOps
Load from
Memory
32KB Data Cache
Several, 128-bit
Execution Units
from RAM
L3
Cache
What About Multiple Cores?
• What we’ve looked at so far is a single CPU
core’s execution.
• A CPU core is a copy of this functionality on
the CPU die, so a quad-core CPU has 4 copies
of everything shown (except larger caches).
• Instead of trying to run multiple instructions
from the same stream of code concurrently, as
before, each core runs independently of any
others (one thread on each)
Confusion About Cores
• “Cores” in GPUs and custom processors like the
Cell are not independent, whereas cores in
standard CPUs are, so this has led to great
confusion and misunderstanding.
• The operating system decides what instruction
stream (thread) to run on each CPU core, and can
periodically change this (thread scheduling)
• These issues are not part of this course, but may
be covered in a parallel computing or operating
systems course.
Part 2 of 3: Memory
Caches and Virtual Memory
Memory Caches
• Caches are copies of RAM on the CPU to save time
• A cache miss is when one checks a cache for a
piece of memory that is not there
• Larger caches have fewer misses, but are slower,
so modern CPUs have multiple levels of cache:
– Memory Buffers (ignored here), L1 Cache, L2 Cache, L3
Cache, RAM
• CPU only accesses memory through cache under
normal circumstances
Reading From Cache
• want value of memory at location A
• if A is not in L1
• if A is not in L2
• if A is not in L3
• L3 reads A from RAM
• L2 reads A from L3
• L1 reads A from L2
• read A from L1
• Note: A is now in all levels of cache
Writing to Cache
•
•
•
•
•
•
want to store value into memory at location A
write A into L1
after time delay, L1 writes A into L2
after time delay, L2 writes A into L3
after time delay, L3 writes A into RAM
Note: the time delays could result in
concurrency issues in multi-core CPUs, so
write caching can get more complicated
Caching Concerns
• Randomly accessing memory causes many
more cache misses than sequentially accessing
memory or accessing relatively few locations
– This is how quicksort is usually not so quick
compared to mergesort
• Writing to a huge block of memory that won’t
be read soon can cause cache misses later,
since it fills up caches with the written data
– There are special instructions to indicate not to
cache certain writes, avoiding this in assembly
Paging
• Paging, a.k.a. virtual memory mapping, is a
feature of CPUs that allows the apparent
rearrangement of physical memory blocks into
one or more virtual memory spaces.
• 3 main reasons for this:
– Programs can be in separate memory spaces, so they
don’t interfere with each other
– The OS can give the illusion of more memory using the
hard drive
– The OS can prevent programs from messing up the
system (accidentally or intentionally)
Virtual Memory
• With a modern OS, no memory accesses by a
program directly access physical memory
• Virtual addresses are mapped to physical
addresses in 4KB or 2MB pages using page
tables, set up by the OS.
Page Tables
virtual page #: 0 1 2 3 4 5 6 7
page table for
Dude.exe:
virtual page #: 0 1 2 3 4 5 6 7
...
page table for
Sweet.exe:
physical memory:
...
physical page #: 0 1 2 3 4 5 6 7 8 9 A B C D E F
...
Part 3 of 3: I/O and Interrupts
Just an Overview
Common I/O Devices
• Human Interface (what most people think of)
– Keyboard, Mouse, Microphone, Speaker, Display,
Webcam, etc.
• Storage
– Hard Drive, Optical Drive, USB Key, SD Card
• Adapters
– Network Card, Graphics Card
• Timers (very important for software)
– PITs, LAPIC Timers, CMOS Timer
If There’s One Thing to Remember
•I/O IS SLOW!
• Bad Throughput:
– Mechanical drives can transfer up to 127MB/s
– Memory bus can transfer up to 30,517 MB/s
(or more for modern ones)
• Very Bad Latency:
– 10,000 RPM drive average latency: 3,000,000ns
– 1333MHz uncached memory average latency: 16ns
I/O Terminology
• I/O Ports or Memory-Mapped I/O?
– Some devices are controlled through special “I/O
ports” accessible with the “in” and “out”
instructions.
– Some devices make themselves controllable by
occupying blocks of memory and intercepting any
reads or writes to that memory instead of using
“in” and “out”. This is often called Direct Memory
Access (DMA).
I/O Terminology
• Programmed I/O or Interrupt-Driven I/O?
– Programmed I/O is controlling a device’s
“operation” step-by-step with the CPU
– Interrupt-Driven I/O involves the CPU setting up
some “operation” to be done by a device and
getting “notified” by the device when the
“operation” is done
– Most I/O in a modern system is interrupt-driven
Interrupts
• Instead of continuously checking for keyboard
or mouse input, can be notified of it when it
happens
• Instead of waiting idly for the hard drive to
finish writing data, can do other work and be
notified when it’s done
• Such a notification is called an I/O interrupt.
• (There are also exception interrupts e.g. for
when doing an integer division by zero.)
Interrupts
• When an interrupt occurs, the CPU stops what it
was doing and calls a function specified by the OS
to handle the interrupt.
– This function is an interrupt handler
• The interrupt handler deals with the I/O
operation (e.g. saves a typed key) and returns,
resuming whatever was interrupted
• Because interrupts can occur at any time, values
on the stack below esp may change at any time