Transcript Networks
Interconnection Networks
(based on D. Patterson’s lectures and
Hennessy/Patterson’s book)
1
Networks
• Goal: Communication between computers
• Eventual Goal: treat collection of computers as one big
computer with distributed resource sharing
• Theme: Different computers must agree on many
things
– Overriding importance of standards and protocols
– Error tolerance critical as well
• Warning: Terminology-rich environment
2
Networks
• Facets people talk a lot about:
–
–
–
–
–
direct (point-to-point) vs. indirect (multi-hop)
topology (e.g., bus, ring, DAG)
routing algorithms
switching (aka multiplexing)
wiring (e.g., choice of media, copper, coax, fiber)
• What really matters:
–
–
–
–
latency
bandwidth
cost
reliability
3
Interconnections (Networks)
• Examples (see Figure 7.19, page 633):
– Wide Area Network (ATM): 100-1000s nodes; ~ 5,000 kilometers
– Local Area Networks (Ethernet): 10-1000 nodes; ~ 1-2 kilometers
– System/Storage Area Networks (FC-AL): 10-100s nodes;
~ 0.025 to 0.1 kilometers per link
a.k.a.
end systems,
hosts
a.k.a.
network,
communication
subnet
Interconnection Network
4
SAN: Storage vs. System
• Storage Area Network (SAN): A block I/O oriented
network between application servers and storage
– Fibre Channel is an example
• Usually high bandwidth requirements, and less
concerned about latency
– in 2001: 1 Gbit bandwidth and millisecond latency OK
• Commonly a dedicated network
(that is, not connected to another network)
• May need to work gracefully when saturated
• Given larger block size, may have higher bit error
rate (BER) requirement than LAN
5
SAN: Storage vs. System
• System Area Network (SAN): A network aimed at
connecting computers
– Myrinet is an example
• Aimed at High Bandwidth AND Low Latency.
– in 2001: > 1 Gbit bandwidth and ~ 10 microsecond
• May offer in order delivery of packets
• Given larger block size, may have higher bit error
rate (BER) requirement than LAN
6
More Network Background
• Connection of 2 or more networks:
Internetworking
• 3 cultures for 3 classes of networks
– WAN: telecommunications, Internet
– LAN: PC, workstations, servers cost
– SAN: Clusters, RAID boxes: latency (System A.N.) or
bandwidth (Storage A.N.)
• Try for single terminology
• Motivate the interconnection complexity
incrementally
7
ABCs of Networks
• Starting Point: Send bits between 2 computers
•
•
•
•
Queue (FIFO) on each end
Information sent called a “message”
Can send both ways (“Full Duplex”)
Rules for communication? “protocol”
– Inside a computer:
» Loads/Stores: Request (Address) & Response (Data)
» Need Request & Response signaling
8
A Simple Example
• What is the format of mesage?
– Fixed? Number bytes?
Request/
Response
1 bit
Address/Data
32 bits
0: Please send data from Address
1: Packet contains data corresponding to request
• Header/Trailer: information to deliver a message
• Payload: data in message (1 word above)
9
Questions About Simple Example
• What if more than 2 computers want to communicate?
– Need computer “address field” (destination) in packet
• What if packet is garbled in transit?
– Add “error detection field” in packet (e.g., Cyclic Redundancy Chk)
• What if packet is lost?
– More “elaborate protocols” to detect loss
(e.g., NAK, ARQ, time outs)
• What if multiple processes/machine?
– Queue per process to provide protection
• Simple questions such as these lead to more complex
protocols and packet formats => complexity
10
A Simple Example Revisted
• What is the format of packet?
– Fixed? Number bytes?
Request/
Response
Address/Data
CRC
1 bit
32 bits
4 bits
00: Request—Please send data from Address
01: Reply—Packet contains data corresponding to request
10: Acknowledge request
11: Acknowledge reply
11
Software to Send and Receive
• SW Send steps
1: Application copies data to OS buffer
2: OS calculates checksum, starts timer
3: OS sends data to network interface HW and says start
• SW Receive steps
3: OS copies data from network interface HW to OS buffer
2: OS calculates checksum, if matches send ACK; if not, deletes message
(sender resends when timer expires)
1: If OK, OS copies data to user address space and signals application
to continue
• Sequence of steps for SW: protocol
– Example similar to UDP/IP protocol in UNIX
12
Network Performance Measures
• Overhead: latency of interface vs. Latency: network
13
Universal Performance Metrics
Sender
Sender
Overhead
Transmission time
(size ÷ bandwidth)
(processor
busy)
Time of
Flight
Transmission time
(size ÷ bandwidth)
Receiver
Overhead
Receiver
Transport Latency
(processor
busy)
Total Latency
Total Latency = Sender Overhead + Time of Flight +
Message Size ÷ BW + Receiver Overhead
Includes header/trailer in BW calculation?
14
Total Latency Example
• 1000 Mbit/sec., sending overhead of 80 µsec & receiving
overhead of 100 µsec.
• a 10000 byte message (including the header), allows 10000
bytes in a single message
• 3 situations: distance 1000 km v. 0.5 km v. 0.01
• Speed of light ~ 300,000 km/sec (1/2 in media)
• Latency0.01km =
• Latency0.01km =
• Latency1000km =
15
Total Latency Example
• 1000 Mbit/sec., sending overhead of 80 µsec & receiving
overhead of 100 µsec.
• a 10000 byte message (including the header), allows 10000
bytes in a single message
• 2 situations: distance 100 m vs. 1000 km
• Speed of light ~ 300,000 km/sec
• Latency0.01km = 80 + 0.01km / (50% x 300,000)
+ 10000 x 8 / 1000 + 100 = 260 µsec
• Latency0.5km = 80 + 0.5km / (50% x 300,000)
+ 10000 x 8 / 1000 + 100 = 263 µsec
• Latency1000km = 80 + 1000 km / (50% x 300,000)
+ 10000 x 8 / 1000 + 100 = 6931
• Long time of flight => complex WAN protocol
16
Universal Metrics
• Apply recursively to all levels of system
• inside a chip, between chips on a board, between
computers in a cluster, …
• Look at WAN v. LAN v. SAN
17
Simplified Latency Model
• Total Latency = Overhead + Message Size / BW
• Overhead = Sender Overhead + Time of Flight +
Receiver Overhead
• Example: show what happens as vary
– Overhead: 1, 25, 500 µsec
– BW: 10,100, 1000 Mbit/sec (factors of 10)
– Message Size: 16 Bytes to 4 MB (factors of 4)
• If overhead 500 µsec,
how big a message > 10 Mb/s?
18
Interconnect Issues
• Performance Measures
• Network Media
19
Network Media
Twisted Pair:
Coaxial Cable:
Plastic Covering
Copper, 1mm thick, twisted to avoid
antenna effect (telephone)
"Cat 5" is 4 twisted pairs in bundle
Insulator
Copper core
Fiber Optics
Transmitter
– L.E.D
– Laser Diode
light
source
Used by cable companies:
high BW, good noise
Braided outer conductor immunity
Buffer
Light: 3 parts
Cladding
are cable, light
Total internal
source, light
reflection
detector.
Receiver
– Photodiode Note fiber is
unidirectional;
need 2 for full
Silica core
duplex
Cladding
Buffer
20
Fiber
• Multimode fiber: ~ 62.5 micron diameter vs. the 1.3 micron
wavelength of infrared light. Since wider it has more
dispersion problems, limiting its length at 1000 Mbits/s for
0.1 km, and 1-3 km at 100 Mbits/s. Uses LED as light
• Single mode fiber: "single wavelength" fiber (8-9 microns)
uses laser diodes, 1-5 Gbits/s for 100km
– Less reliable and more expensive, and restrictions on bending
– Cost, bandwidth, and distance of single-mode fiber affected by power of
the light source, the sensitivity of the light detector, and the attenuation
rate (loss of optical signal strength as light passes through the fiber) per
kilometer of the fiber cable.
– Typically glass fiber, since has better characteristics than the less
expensive plastic fiber
21
Wave Division Multiplexing Fiber
• Send N independent streams on single fiber!
• Just use different wavelengths to send and
demultiplex at receiver
• WDM in 2000: 40 Gbit/s using 8 wavelengths
• Plan to go to 80 wavelengths => 400 Gbit/s!
• A figure of merit: BW* max distance
(Gbit-km/sec)
• 10X/4 years, or 1.8X per year
22
Compare Media
• Assume 40 2.5" disks, each 25 GB, Move 1 km
• Compare Cat 5 (100 Mbit/s), Multimode fiber (1000 Mbit/s),
single mode (2500 Mbit/s), and car
• Cat 5: 1000 x 1024 x 8 Mb / 100 Mb/s = 23 hrs
• MM: 1000 x 1024 x 8 Mb / 1000 Mb/s = 2.3 hrs
• SM:
1000 x 1024 x 8 Mb / 2500 Mb/s = 0.9 hrs
• Car: 5 min + 1 km / 50 kph + 10 min = 0.25 hrs
• Car of disks = high BW media
23
Interconnect Issues
• Performance Measures
• Network Media
• Connecting Multiple Computers
24
Connecting Multiple Computers
• Shared Media vs. Switched: pairs
communicate at same time: “point-topoint” connections
• Aggregate BW in switched network is
many times shared
– point-to-point faster since no arbitration,
simpler interface
• Arbitration in Shared network?
– Central arbiter for LAN?
– Listen to check if being used (“Carrier
Sensing”)
– Listen to check if collision
(“Collision Detection”)
– Random resend to avoid repeated collisions;
not fair arbitration;
– OK if low utilization
(A. K. A. data switching
interchanges, multistage
interconnection networks,
interface message processors)
25
Connecting Multiple Computers
• Shared Media vs. Switched: pairs
communicate at same time: “point-topoint” connections
• Aggregate BW in switched network is
many times shared
– point-to-point faster since no arbitration,
simpler interface
• Arbitration in Shared network?
– Central arbiter for LAN?
– Listen to check if being used (“Carrier
Sensing”)
– Listen to check if collision
(“Collision Detection”)
– Random resend to avoid repeated collisions;
not fair arbitration;
– OK if low utilization
(A. K. A. data switching
interchanges, multistage
interconnection networks,
interface message processors)
26
Connection-Based vs. Connectionless
• Telephone: operator sets up connection between the caller
and the receiver
– Once the connection is established, conversation can continue for hours
• Share transmission lines over long distances by using
switches to multiplex several conversations on the same
lines
– “Time division multiplexing” divide B/W transmission line into a fixed
number of slots, with each slot assigned to a conversation
• Problem: lines busy based on number of conversations, not
amount of information sent
• Advantage: reserved bandwidth
27
Connection-Based vs. Connectionless
• Connectionless: every package of information
must have an address => packets
– Each package is routed to its destination by looking at its
address
– Analogy, the postal system (sending a letter)
– also called “Statistical multiplexing”
– Note: “Split phase buses” are sending packets
28
Routing Messages
• Shared Media
– Broadcast to everyone
• Switched Media needs real routing. Options:
– Source-based routing: message specifies path to the destination
(changes of direction)
– Virtual Circuit: circuit established from source to destination,
message picks the circuit to follow
– Destination-based routing: message specifies destination, switch
must pick the path
» deterministic: always follow same path
» adaptive: pick different paths to avoid congestion, failures
» Randomized routing: pick between several good paths to
balance network load
29
Deterministic Routing Examples
• mesh: dimension-order routing
– (x1, y1) -> (x2, y2)
– first x = x2 - x1,
– then y = y2 - y1,
• hypercube: edge-cube routing
– X = xox1x2 . . .xn -> Y = yoy1y2 . . .yn
– R = X xor Y
– Traverse dimensions of differing address in
order
110
010
111
• tree: common ancestor
• Deadlock free?
011
100
000
001
101
30
Store and Forward vs. Cut-Through
• Store-and-forward policy: each switch waits for the full
packet to arrive in switch before sending to the next switch
(good for WAN)
• Cut-through routing or worm hole routing: switch examines
the header, decides where to send the message, and then
starts forwarding it immediately
– In worm hole routing, when head of message is blocked, message stays
strung out over the network, potentially blocking other messages (needs only
buffer the piece of the packet that is sent between switches).
– Cut through routing lets the tail continue when head is blocked,
accordioning the whole message into a single switch. (Requires a buffer large
enough to hold the largest packet).
31
Cut-Through vs. Store and Forward
• Advantage
– Latency reduces from function of:
number of intermediate switches X by the size of the packet
to
time for 1st part of the packet to negotiate the switches
+ the packet size ÷ interconnect BW
32
Congestion Control
• Packet switched networks do not reserve bandwidth; this leads
to contention (connection based limits input)
• Solution: prevent packets from entering until contention is
reduced
(e.g., freeway on-ramp metering lights)
• Options:
– Packet discarding: If packet arrives at switch and no room in buffer, packet is
discarded (e.g., UDP)
– Flow control: between pairs of receivers and senders;
use feedback to tell sender when allowed to send next packet
» Back-pressure: separate wires to tell to stop
» Window: give original sender right to send N packets before getting
permission to send more; (e.g., TCP), adjustable window
– Choke packets: aka “rate-based”; Each packet received by busy switch in
warning state sent back to the source via choke packet. Source reduces traffic to
that destination by a fixed % (e.g., ATM)
33
Protocols: HW/SW Interface
• Internetworking: allows computers on independent and
incompatible networks to communicate reliably and
efficiently;
– Enabling technologies: SW standards that allow reliable communications
without reliable networks
– Hierarchy of SW layers, giving each layer responsibility for portion of
overall communications task, called
protocol families or protocol suites
• Transmission Control Protocol/Internet Protocol (TCP/IP)
– This protocol family is the basis of the Internet
– IP makes best effort to deliver; TCP guarantees delivery
– TCP/IP used even when communicating locally: NFS uses IP even though
communicating across homogeneous LAN
34
Protocol Family Concept
Message
Actual
H Message T
Logical
Message
Actual
Logical
H Message T
Actual
H H Message T T
Actual
H H Message T T
Physical
35
Protocol Family Concept
• Key to protocol families is that communication occurs logically
at the same level of the protocol, called peer-to-peer,
• but is implemented via services at the next lower level
• Encapsulation: carry higher level information within lower
level “envelope”
• Fragmentation: break packet into multiple smaller packets
and reassemble
• Danger is each level increases latency if implemented as
hierarchy (e.g., multiple check sums)
36
TCP/IP packet, Ethernet packet, protocols
• Application sends message
Ethernet Hdr
• TCP breaks into 64KB segments,
adds 20B header
IP Header
TCP Header
• IP adds 20B header, sends to network
EHIP Data
• If Ethernet, broken into 1500B
packets with headers, trailers (24B)
TCP data
Message
Ethernet Hdr
• All Headers, trailers have length
field, destination, ...
37
Example Networks
• Ethernet: shared media 10 Mbit/s proposed in 1978, carrier
sensing with expotential backoff on collision detection
• 15 years with no improvement; higher BW?
• Multiple Ethernets with devices to allow Ethernets to operate
in parallel!
• 10 Mbit Ethernet successors?
–
–
–
–
–
–
FDDI: shared media (too late)
ATM (too late?)
Switched Ethernet
100 Mbit Ethernet (Fast Ethernet)
Gigabit Ethernet
10 Gigabit Ethernet in 2002?
38
Connecting Networks
• Bridges: connect LANs together, passing traffic from one side
to another depending on the addresses in the packet.
– operate at the Ethernet protocol level
– usually simpler and cheaper than routers
• Routers or Gateways: these devices connect LANs to WANs
or WANs to WANs and resolve incompatible addressing.
– Generally slower than bridges, they operate at the internetworking
protocol (IP) level
– Routers divide the interconnect into separate smaller subnets, which
simplifies manageability and improves security
• Cisco is major supplier;
basically special purpose computers
39
Comparing Networks
SAN
FC-AL
Infiniband
LAN
WAN
100 Mb 1000 Mb ATM
Ethernet Ethernet
10 Mb
Ethernet
Length
(meters)
Data
lines
Clock
(MHz)
Switch?
30/1000 17/100
500/2500 200
Nodes
100
2
1, 4, 12 1
1
4/1
1
1000
2500
10
100
1000
Opt.
Yes
Optional Opt.
Yes
155/
622
Yes
<=127
~1000
<=254
<=254
~10000
Material Copper
/ fiber
Copper Copper
/fiber
<=254
Copper Copper Copper
/fiber
/fiber
40
Comparing Networks
Switch?
Bisection
BW
(Mbits
/sec)
SAN
FC-AL
Infiniband
LAN
WAN
10 Mb
100 Mb 1000 Mb ATM
Ethernet Ethernet Ethernet
Opt.
Yes
Optional Opt.
Yes
Yes
(2000 24000)
x
switch
ports
2000,
8000,
24000
Star
10
shared
or 10 x
switch
ports
10
100
shared
or 100 x
switch
ports
100
1000 x
switch
ports
155 x
switch
ports
1000
155/
622
Line or
Star
Line or
Star
Star
Star
800
shared
or 800 x
switch
ports
Peak link 800
BW(Mbits
/sec)
Topology Ring or
Star
41
Comparing Networks
SAN
LAN
WAN
FC-AL
Infiniband 10 Mb
100 Mb 1000 Mb ATM
Ethernet Ethernet Ethernet
Connectionless?
Store &
forward?
Congestion
control
Standard
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Yes
Creditbased
Backpressure
Carrier
sense
Carrier
sense
Carrier
sense
Credit
based
ANSI
Task
Group
X3T11
Infiniband IEEE
Trade
802.3
Association
IEEE
802.3
IEEE
ATM
802.3
Forum
ab-1999
42
Wireless Networks
• Media can be air as well as glass or copper
• Radio wave is electromagnetic wave propagated by an
antenna
• Radio waves are modulated: sound signal superimposed
on stronger radio wave which carries sound signal, called
carrier signal
• Radio waves have a wavelength or frequency: measure
either length of wave
or number of waves per second (MHz):
long waves => low frequencies,
short waves => high frequencies
• Tuning to different frequencies => radio receiver pick up a
signal.
– FM radio stations transmit on band of 88 MHz to 108 MHz using
frequency modulations (FM) to record the sound signal
43
Issues in Wireless
• Wireless often => mobile => network must
rearrange itself dynamically
• Subject to jamming and eavesdropping
– No physical tape
– Cannot detect interception
• Power
– devices tend to be battery powered
– antennas radiate power to communicate and little of it reaches
the receiver
• As a result, raw bit error rates are typically a
thousand to a million times higher than copper
wire
44
Reliability of Wires Transmission
• bit error rate (BER) of wireless link determined by
received signal power, noise due to interference
caused by the receiver hardware, interference from
other sources, and characteristics of the channel
– Path loss: power to overcome interference
– Shadow fading: blocked by objects (walls, buildings)
– Multipath fading: interference between multiple version of
signals arriving different times
– Interference: reuse of frequency or from adjacent channels
45
2 Wireless Architectures
• Base-station architectures
– Connected by land lines for longer distance communication, and
the mobile units communicate only with a single local base
station
– More reliable since 1-hop from land lines
– Example: cell phones
• Peer-to-peer architectures
– Allow mobile units to communicate with each other, and
messages hop from one unit to the next until delivered to the
desired unit
– More reconfigurable
46
Cellular Telephony
• Exploit exponential path loss to reuse same frequency at
spatially separated locations, thereby greatly increasing
customers served
• Divide region into nonoverlaping hexagonal cells (2-10
mi. diameter) which use different frequencies if nearby,
reusing a frequency when cells far apart so that mutual
interference OK
• Intersection of three hexagonal cells is a base station with
transmitters and antennas
• Handset selects a cell based on signal strength and then
picks an unused radio channel
• To properly bill for cellular calls, each cellular phone
handset has an electronic serial number
47
Cellular Telephony II
• Orginal analog design frequencies set for each
direction: pair called a channel
– 869.04 to 893.97 MHz, called the forward path
– 824.04 MHz to 848.97 MHz, called the reverse path
– Cells might have had between 4 and 80 channels
• Several digital successors:
– Code division multiple access (CDMA) uses a wider radio frequency
band
– time division multiple access (TDMA)
– global system for mobile communication (GSM)
– International Mobile Telephony 2000 (IMT-2000) which is based
primarily on two competing versions of CDMA and one TDMA,
called Third Generation (3G)
48
Practical Issues for Inteconnection
Networks
• Connectivity: max number of machines affects
complexity of network and protocols since
protocols must target largest size
• Connection Network Interface to computer
– Where in bus hierarchy? Memory bus? Fast I/O bus? Slow I/O
bus? (Ethernet to Fast I/O bus, Inifiband to Memory bus since it
is the Fast I/O bus)
– SW Interface: does software need to flush caches for consistency
of sends or receives?
– Programmed I/O vs. DMA? Is NIC in uncachable address
space?
49
Practical Issues for Inteconnection
Networks
• Standardization advantages:
– low cost (components used repeatedly)
– stability (many suppliers to chose from)
• Standardization disadvantages:
– Time for committees to agree
– When to standardize?
» Before anything built? => Committee does design?
» Too early suppresses innovation
• Reliability (vs. availability) of interconnect
50
Practical Issues
Interconnection
Example
Standard
Fault Tolerance?
Hot Insert?
SAN
Inifiband
Yes
Yes
Yes
LAN
Ethernet
Yes
Yes
Yes
WAN
ATM
Yes
Yes
Yes
• Standards: required for WAN, LAN, and likely SAN!
• Fault Tolerance: Can nodes fail and still deliver messages to
other nodes?
• Hot Insert: If the interconnection can survive a failure, can it
also continue operation while a new node is added to the
interconnection?
51
Cross-Cutting Issues for Networking
• Efficient Interface to Memory Hierarchy vs. to Network
– SPEC ratings => fast to memory hierarchy
– Writes go via write buffer, reads via L1 and L2
caches
• Example: 40 MHz SPARCStation(SS)-2 vs 50 MHz SS20, no L2$ vs 50 MHz SS-20 with L2$ I/O bus latency;
different generations
• SS-2: combined memory, I/O bus => 200 ns
• SS-20, no L2$: 2 busses +300ns => 500ns
• SS-20, w L2$: cache miss+500ns => 1000ns
52
Crosscutting: Smart Switch vs. Smart
Network Interface Card
Less Intelligent
More Intelligent
Large Ethernet
Switch
Small Ethernet
Myrinet
Inifiband
NIC
Ethernet
Infiniband Target
Channel Adapter
Myrinet
Inifiband Host
Channel Adapter
•Inexpensive NIC => Ethernet standard in all computers
•Inexpensive switch => Ethernet used in home networks
53
Summary: Networking
• Protocols allow heterogeneous networking
– Protocols allow operation in the presence of failures
– Internetworking protocols used as LAN protocols
=> large overhead for LAN
• Integrated circuit revolutionizing networks as well
as processors
– Switch is a specialized computer
– Faster networks and slow overheads violate of Amdahl’s Law
54
Review: Networking
• Protocols allow hetereogeneous networking
– Protocols allow operation in the presense of failures
– Internetworking protocols used as LAN protocols
=> large overhead for LAN
• Integrated circuit revolutionizing networks as well
as processors
– Switch is a specialized computer
– Faster networks and slow overheads violate of Amdahl’s Law
• Wireless Networking offers new challenges in
bandwidth, mobility, reliability, ...
55
Cluster
• LAN switches => high network bandwidth and scaling was
available from off the shelf components
• 2001 Cluster = collection of independent computers using
switched network to provide a common service
• Many mainframe applications run more "loosely coupled"
machines than shared memory machines (next chapter/week)
– databases, file servers, Web servers, simulations, and
multiprogramming/batch processing
– Often need to be highly available, requiring error tolerance and repairability
– Often need to scale
56
Cluster Drawbacks
• Cost of administering a cluster of N machines
~ administering N independent machines
vs. cost of administering a shared address space N processors
multiprocessor
~ administering 1 big machine
• Clusters usually connected using I/O bus, whereas
multiprocessors usually connected on memory bus
• Cluster of N machines has N independent memories and N
copies of OS, but a shared address multi-processor allows 1
program to use almost all memory
– DRAM prices has made memory costs so low that this multiprocessor
advantage is much less important in 2001
57
Cluster Advantages
• Error isolation: separate address space limits contamination of
error
• Repair: Easier to replace a machine without bringing down the
system than in an shared memory multiprocessor
• Scale: easier to expand the system without bringing down the
application that runs on top of the cluster
• Cost: Large scale machine has low volume => fewer machines
to spread development costs vs. leverage high volume off-theshelf switches and computers
• Amazon, AOL, Google, Hotmail, Inktomi, WebTV, and Yahoo
rely on clusters of PCs to provide services used by millions of
people every day
58
Addressing Cluster Weaknesses
• Network performance: SAN, especially Inifiband,
may tie cluster closer to memory
• Maintenance: separate of long term storage and
computation
• Computation maintenance:
– Clones of identical PCs
– 3 steps: reboot, reinstall OS, recycle
– At $1000/PC, cheaper to discard than to figure out what is
wrong and repair it?
• Storage maintenance:
– If separate storage servers or file servers, cluster is no worse?
59
Clusters and TPC Benchmarks
• “Shared Nothing” database (not memory, not
disks) is a match to cluster
• 2/2001: Top 10 TPC performance 6/10 are clusters
(4 / top 5)
60
Putting it all together: Google
• Google: search engine that scales at growth
Internet growth rates
• Search engines: 24x7 availability
• Google 12/2000: 70M queries per day, or
AVERAGE of 800 queries/sec all day
• Response time goal: < 1/2 sec for search
• Google crawls WWW and puts up new index every
4 weeks
• Stores local copy of text of pages of WWW (snippet
as well as cached copy of page)
• 3 collocation sites (2 CA + 1 Virginia)
• 6000 PCs, 12000 disks: almost 1 petabyte!
61
Hardware Infrastructure
• VME rack 19 in. wide, 6 feet
tall, 30 inches deep
• Per side: 40 1 Rack Unit (RU)
PCs +1 HP Ethernet switch (4
RU): Each blade can contain
8 100-Mbit/s EN or a single 1Gbit Ethernet interface
• Front+back => 80 PCs +
2 EN switches/rack
• Each rack connects to 2 128 1Gbit/s EN switches
• Dec 2000: 40 racks at most
recent site
62
Google PCs
• 2 IDE drives, 256 MB of SDRAM, modest Intel
microprocessor, a PC mother-board, 1 power supply
and a few fans.
• Each PC runs the Linux operating system
• Buy over time, so upgrade components:
populated between March and November 2000
– microprocessors: 533 MHz Celeron to an 800 MHz Pentium III,
– disks: capacity between 40 and 80 GB, speed 5400 to 7200 RPM
– bus speed is either 100 or 133 MH
– Cost: ~ $1300 to $1700 per PC
• PC operates at about 55 Watts
• Rack => 4500 Watts , 60 amps
63
Reliability
• For 6000 PCs, 12000s, 200 EN switches
• ~ 20 PCs will need to be rebooted/day
• ~ 2 PCs/day hardware failure, or 2%-3% / year
– 5% due to problems with motherboard, power supply, and
connectors
– 30% DRAM: bits change + errors in transmission (100 MHz)
– 30% Disks fail
– 30% Disks go very slow (10%-3% expected BW)
• 200 EN switches, 2-3 fail in 2 years
• 6 Foundry switches: none failed, but 2-3 of 96
blades of switches have failed (16 blades/switch)
• Collocation site reliability:
– 1 power failure,1 network outage per year per site
64
Google Performance: Serving
• How big is a page returned by Google? ~16KB
• Average bandwidth to serve searches
70,000,000/day x 16,750 B x 8 bits/B
24 x 60 x 60
=9,378,880 Mbits/86,400 secs
= 108 Mbit/s
65
Google Performance: Crawling
• How big is a text of a WWW page? ~4000B
• 1 Billion pages searched
• Assume 7 days to crawl
• Average bandwidth to crawl
1,000,000,000/pages x 4000 B x 8 bits/B
24 x 60 x 60 x 7
=32,000,000 Mbits/604,800 secs
= 59 Mbit/s
66
Google Performance: Replicating Index
• How big is Google index? ~5 TB
• Assume 7 days to replicate to 2 sites, implies BW to
send + BW to receive
• Average bandwidth to replicate new index
2 x 2 x 5,000,000 MB x 8 bits/B
24 x 60 x 60 x 7
=160,000,000 Mbits/604,800 secs
= 260 Mbit/s
67
Collocation Sites
• Allow scalable space, power, cooling and network
bandwidth plus provide physical security
• charge about $500 to $750 per Mbit/sec/month
– if your continuous use measures 1- 2 Gbits/second
to $1500 to $2000 per Mbit/sec/month
– if your continuous use measures 1-10 Mbits/second
• Rack space: costs $800 -$1200/month, and drops by 20%
if > 75 to 100 racks (1 20 amp circuit)
– Each additional 20 amp circuit per rack costs another $200 to $400 per
month
• PG&E: 12 megawatts of power, 100,000 sq. ft./building,
10 sq. ft./rack => 1000 watts/rack
68
Google Performance: Total
•
•
•
•
•
Serving pages: 108 Mbit/sec/month
Crawling: 59 Mbit/sec/week, 15 Mbit/s/month
Replicating: 260 Mbit/sec/week, 65 Mb/s/month
Total: roughly 200 Mbit/sec/month
Google’s Collocation sites have OC48
(2488 Mbit/sec) link to Internet
• Bandwidth cost per month?
~$150,000 to $200,000
• 1/2 BW grows at 20%/month
69
Google Costs
• Collocation costs: 40 racks @ $1000 per month + $500
per month for extra circuits
= ~$60,000 per site, * 3 sites
~$180,000 for space
• Machine costs:
• Rack = $2k + 80 * $1500/pc + 2 * $1500/EN
= ~$125k
• 40 racks + 2 Foundry switches @$100,000
= ~$5M
• 3 sites = $15M
• Cost today is $10,000 to $15,000 per TB
70
Comparing Storage Costs: 1/2001
• Google site, including 3200 processors and 0.8 TB
of DRAM, 500 TB (40 racks)
$10k - $15k/ TB
• Compaq Cluster with 192 processors,
0.2 TB of DRAM, 45 TB of SCSI Disks (17+ racks)
$115k/TB (TPC-C)
• HP 9000 Superdome: 48 processors,
0.25 TB DRAM, 19 TB of SCSI disk =
(23+ racks) $360k/TB (TPC-C)
71
Putting It All Together: Cell Phones
• 1999 280M handsets sold;
2001 500M
• Radio steps/components:
Receive/transmit
–
–
–
–
–
–
Antenna
Amplifier
Mixer
Filter
Demodulator
Decoder
72
Putting It All Together: Cell Phones
• about 10 chips in 2000, which should shrink, but
likely separate MPU and DSP
• Emphasis on energy efficiency
From “How Stuff Works” on cell phones: www.howstuffworks.com
73
Cell phone steps (protocol)
•
Find a cell
•
•
Local switching office registers call
•
•
•
•
Scans full BW to find stronger signal every 7 secs
records phone number, cell phone serial number, assigns
channel
sends special tone to phone, which cell acks if correct
Cell times out after 5 sec if doesn't get supervisory tone
Communicate at 9600 b/s digitally (modem)
•
•
Old style: message repeated 5 times
AMPS had 2 power levels depending on distance (0.6W and
3W)
74
Frequency Division Multiple Access
(FDMA)
• FDMA separates the
spectrum into distinct voice
channels by splitting it into
uniform chunks of
bandwidth
• 1st generation analog
From “How Stuff Works” on cell phones: www.howstuffworks.com
75
Time Division Multiple Access (TDMA)
• a narrow band that is 30 kHz
wide and 6.7 ms long is split timewise into 3 time slots.
• Each conversation gets the radio
for 1/3 of time.
• Possible because voice data
converted to digital information is
compressed so
• Therefore, TDMA has 3 times
capacity of analog
• GSM implements TDMA in a
somewhat different and
incompatible way from US (IS136); also encrypts the call
From “How Stuff Works” on cell phones: www.howstuffworks.com
76
Code Division Multiple Access (CDMA)
• CDMA, after digitizing data,
spreads it out over the entire
bandwidth it has available.
• Multiple calls are overlaid over
each other on the channel, with
each assigned a unique sequence
code.
• CDMA is a form of spread
spectrum; All the users transmit
in the same wide-band chunk of
spectrum.
• Each user's signal is spread over
the entire bandwidth by a unique
spreading code. same unique code
is used to recover the signal.
• GPS for time stamp
. Between 8 and 10 separate calls
From “How Stuff Works” on cell phones: www.howstuffworks.com
space as 1 analog call
77
Cell Phone Towers
From “How Stuff Works” on cell phones: www.howstuffworks.com
78
Review: Networking
• Clusters +: fault isolation and repair, scaling, cost
• Clusters -: maintenance, network interface
performance, memory efficiency
• Google as cluster example:
–
–
–
–
scaling (6000 PCs, 1 petabyte storage)
fault isolation (2 failures per day yet available)
repair (replace failures weekly/repair offline)
Maintenance: 8 people for 6000 PCs
• Cell phone as portable network device
– # Handsets >> # PCs
– Universal mobile interface?
• Is future services built on Google-like clusters
delivered to gadgets like cell phone handset?
79