CS 352H: Computer Systems Architecture

Download Report

Transcript CS 352H: Computer Systems Architecture

CS 352H: Computer Systems Architecture
Lecture 1: What is Computer
Architecture and why should I care?
Professor Emmett Witchel
University of Texas at Austin
[email protected]
Lecture 1
1
Goals
• Understand the “how” and “why” of computer system
organization
–
–
–
–
Instruction Set Architecture
System Organization (processor, memory, I/O)
Microarchitecture
Virtualization
• Learn methods of evaluating performance
– Metrics & benchmarks
• Learn how to make systems go fast
– Pipelining, caching
– Parallelism (ILP, DLP, TLP)
– Application specific architectures (graphics, signal proc.)
• Preview of where architecture is heading
Lecture 1
2
Logistics
Lectures
Instructor
TA
T/Th 12:30-2:00pm, PAI 3.14
Prof. Emmett Witchel, W 1:15-2:15
Shalini Sahoo
MW 11:30-1:00pm PAI 5.38 Desk1
Grading
see web page
Texts
Hennessy & Patterson, Computer
Organization and Design (Fourth Edition)
Including CD
Revised Fourth Edition preferred, not required
Lecture 1
3
CS352H Online
URL: www.cs.utexas.edu/users/witchel/CS352H
I will occasionally email you via blackboard and by your
registered email address. I expect this channel to be
reliable and timely.
discussion group: via blackboard
login at courses.utexas.edu
General, Homeworks, Project
Computer Architecture Seminar Series:
www.cs.utexas.edu/users/cart/arch
Lecture 1
4
Assignment for Next Tuesday
• Turn in student survey forms, if you want
• Read the Moore paper (see webpage)
– Write a review of 1/2-1 page (see syllabus)
– Review should include
• Summary of content of paper
• Your observations on the most interesting/important
aspects
• Your observations on its relevance today
– Be prepared to discuss on Tuesday in class
Lecture 1
5
Discussion
• Are you interested in taking this course?
• One question about computer science
• One question about computer architecture
CS352H
Fall 2007
Lecture 1
6
Specification
compute the fibonacci sequence
for(i=2; i<100; i++) {
a[i] = a[i-1]+a[i-2];}
Program
load r1, a[i];
add r2, r2, r1;
ISA (Instruction Set Architecture)
Arch vs. µarch
microArchitecture
A
Logic
F
B
D
S
G
G
Transistors
S
Physics/Chemistry
Lecture 1
D
S
7
CS352H Topics
•
•
•
•
Technology Trends
Instruction set architectures
Pipelining
Modern pipelined architectures
– Dynamic ILP machines
– Static ILP machines
•
•
•
•
Cache memory systems
Virtual memory
Multiprocessors
Computer system implementation
Lecture 1
8
Making This Class Work For You
• Plus and minus grades
• Clickers
CS352H
Fall 2007
Lecture 1
9
I/O Chan
Link
ISA
API
What is Computer Architecture?
Interfaces
Technology
IR
Regs
Machine Organization
Applications
Computer
Architect
Measurement &
Evaluation
Lecture 1
10
Technology Constraints
• Yearly improvement
– Semiconductor technology
• 60% more devices per
chip
(doubles every 18
months)
• 15% faster devices
(doubles every 5 years)
• Slower wires
– Magnetic Disks
• 60% increase in density
– Circuit boards
• 5% increase in wire
density
– Cables
• no change
1000nm 350nm
800nm
250nm 130nm
90nm
1989
1992
1995
1998
2002
2006
>100x more devices since 1989
10x faster devices
Lecture 1
11
Changing Technology leads to
Changing Architecture
• 1970s
• 1990s
– multi-chip CPUs
– semiconductor memory very
expensive
– microcoded control
– complex instruction sets
(good code density)
– lots of transistors
– complex control to exploit
instruction-level parallelism
• 2000s
–
–
–
–
• 1980s
– single-chip CPUs, on-chip
RAM feasible
– simple, hard-wired control
– simple instruction sets
– small on-chip caches
even more transistors
Power wall
Transition to CMPs
Multi-level caches
• 2010s
– Embedded vs. Desktop vs.
Data center (cloud)
– New storage (PCM, flash)
– Simpler cores and lots of
them
– Optimizing for power
Lecture 1
12
Intel 4004 - 1971
• The first microprocessor
• 2,300 transistors
• 108 KHz
• 10mm process
Lecture 1
13
Some Recent Chips!
Intel Pentium IV
• 42 million transistors
• 4GHz
• 0.13mm process
• Could fit ~15,000 4004s on this
chip!
Intel’s net revenue was around $35 billion a year for most of the aughts
Intel Itanium II (Montecito)
R&D- GeForce
about 6800
$5 billion a year
NVidia
• 222 million transistors
• 400MHz
• 0.13mm process
• 1.7 billion transistors
• 1.6 GHz
• 90nm process
IBM Cell
• 8 vector processors + 1 PPC
• 4 GHz
• 90nm process
Lecture 1
14
Any Architecture You Want (as long as it is x86)
CS352H
Fall 2007
Lecture 1
15
Application Constraints
•
Applications drive machine
‘balance’
– Numerical simulations
• floating-point performance
• main memory bandwidth
– Transaction processing
• I/Os per second
• integer CPU performance
– Decision support
• I/O bandwidth
– Embedded control
• I/O timing, power
– Media processing
• low-precision ‘pixel’
arithmetic
Lecture 1
16
Application-Driven Architectures
• General purpose - good performance on “all”
programs
– x86 family, ARM, powerPC, etc.
• Application specificity can focus on:
– Types of concurrency available
– Domain of deployment (server, handheld, desktop)
• Today - overview of graphics processors
– Interface (instruction set architecture - ISA)
– Processor organization
– Concurrent elements
Lecture 1
17
Apple’s iPad/iPhone4 Powered by A4 Chip
• A4 is modified ARM Cortex run at 1GHz
– Integrated processor, graphics, memory controller
• Among other claims, ARM says the processors gets
a near "25 percent processing power boost, even at
same processor speed, from the use of a new
instruction pipelining system."
– We will cover pipelining in this class.
• Claim: 10 hours of 1024x768 video at 25W
• Let’s look at the Freescale i.MX51
CS352H
Fall 2007
Lecture 1
18
Performance: Latency and Throughput
• Latency: time to complete an operation
• Throughput: work completed per unit time
• Consider plumbing
– Low latency: turn on faucet and water comes out
– High bandwidth: lots of water (e.g., to fill a pool)
• What is “High speed Internet?”
– Low latency: needed to interactive gaming
– High bandwidth: needed for downloading large files
– Marketing departments like to conflate latency and
bandwidth…
Relationship between Latency and Throughput
• Latency and bandwidth only loosely coupled
– Henry Ford: assembly lines increase bandwidth without
reducing latency
• My factory takes 1 day to make a Model-T ford.
–
–
–
–
But I can start building a new car every 10 minutes
At 24 hrs/day, I can make 24 * 6 = 144 cars per day
A special order for 1 green car, still takes 1 day
Throughput is increased, but latency is not.
• Latency reduction is difficult
• Often, one can buy bandwidth
– E.g., more memory chips, more disks, more computers
– Big server farms (e.g., google) are high bandwidth
What is cloud computing?
• Cloud computing is where dynamically scalable and
often virtualized resources are provided as a service
over the Internet (thanks, wikipedia!)
• Infrastructure as a service (IaaS)
– Amazon’s EC2 (elastic compute cloud)
• Platform as a service (PaaS)
– Google gears
– Microsoft azure
• Software as a service (SaaS)
– gmail
– facebook
– flickr
Thanks, James Hamilton, amazon
Graphics has dedicated chip in PCs
Memory
Memory
Memory
Memory
Memory Controller Chip
CPU
582 Million
transistors
(“North Bridge”)
Input/Output Glue Chip
(“South Bridge”)
Graphics
Processor
681 Million
transistors
(GeForce 8800, 90nm)
(Intel “Kentsfield” quad core,
QX6700, 65nm, two dies, 8MB L2$)
(AGP, PCIe)
Disk, Keyboard, PCIe, etc.
Lecture 1
23
GFLOPS
GPU/CPU Performance comparison
* IBM Cell ~200 GFlops
Core 2 Quad 3GHz, 96 GFLOPS *
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Source: NVIDIA (except CELL and Core2 Quad)
Lecture 1
24
Why a dedicated processing chip?
• 1) Specialization – becoming less important with time
• 2) Parallelism – becoming more important
Graphics processors are the only highly-parallel
processors in every desktop machine.
128 “processors”
* 2 FLOPS
@ 1.35 GHz
You can program them!
CS352H
Fall 2007
Lecture 1
25
Graphics requires programmability
Every application does something a bit different.
Example Cg “shader” program (invoked like a “callback” function):
void normalmapped(float2 normalMapTexCoord : TEXCOORD0,
…
out float4 color : COLOR,
uniform float ambient,
…)
{
float3 normalTex, …;
normalTex = tex2D(normalMap, normalMapTexCoord).xyz;
…
diffuse = saturate(dot(normal, normLightDir);
…
color = Kd * (ambient + diffuse ) +
Ks * pow(specular, specularExponent;
}
Lecture 1
26
GeForce 8800
Lecture 1
27
Next Time
•
•
•
•
Performance evaluation
Basic computer organization
How chips are made
Start in on instruction set review/overview
• Always check web page for assignments
Lecture 1
28