Realization and Utilization of high

Download Report

Transcript Realization and Utilization of high

Realization and Utilization of high-BW TCP
on real application
Kei Hiraki
Data Reservoir / GRAPE-DR project
The University of Tokyo
Kei Hiraki
University of Tokyo
Computing System for real Scientists
• Fast CPU, huge memory and disks, good graphics
– Cluster technology, DSM technology, Graphics processors
– Grid technology
• Very fast remote file accesses
– Global file system, data parallel file systems, Replication facilities
• Transparency to local computation
– No complex middleware, or no small modification to existing software
• Real Scientists are not computer scientists
•
Kei Hiraki
Computer scientists are not work forces for real scientists
University of Tokyo
Objectives of Data Reservoir / GRAPE-DR(1)
• Sharing Scientific Data between distant research institutes
– Physics, astronomy, earth science, simulation data
• Very High-speed single file transfer on Long Fat pipe Network
– > 10 Gbps, > 20,000 Km, > 400ms RTT
• High utilization of available bandwidth
– Transferred file data rate > 90% of available bandwidth
• Including header overheads, initial negotiation overheads
Kei Hiraki
University of Tokyo
Objectives of Data Reservoir / GRAPE-DR(2)
• GRAPE-DR:Very high-speed attached processor to a server
– 2004 – 2008
– Successor of Grape-6 astronomical simulator
• 2PFLOPS on 128 node cluster system
–
–
–
–
1G FLOPS / processor
1024 processor / chip
8 chips / PCI card
2 PCI card / serer
– 2 M processor / system
Kei Hiraki
University of Tokyo
Data intensive scientific computation through global networks
Nobeyama
Radio
Observatory
(VLBI)
Nuclear experiments
Belle Experiments
Data
Reservoir
Very High-speed
Network
Digital Sky Survey
Data
Reservoir
Local
Accesses
X-ray astronomy
Satellite ASUKA
Data
Reservoir
SUBARU
Telescope
Grape6
Kei Hiraki
Data analysis at University of Tokyo
University of Tokyo
Basic Architecture
Data
Reservoir
High latency
Very high bandwidth
Network
Disk-block level
Parallel and
Multi-stream transfer
Local file accesses
Cache Disks
Data
Reservoir
Distribute Shared Data
Local file accesses
(DSM like architecture)
Cache Disks
Kei Hiraki
University of Tokyo
File accesses on Data Reservoir
Scientific
Detectors
File Server
User Programs
File Server
1st level striping
File Server
File Server
Disk access by iSCSI
IP Switch
IP Switch
2nd level striping
Disk Server
Disk Server
Disk Server
Disk Server
IBM x345 (2.6GHz x 2)
Kei Hiraki
University of Tokyo
Global Data Transfer
Scientific
Detectors
File Server
User Programs
File Server
File Server
File Server
iSCSI Bulk Transfer
IP Switch
IP Switch
Global Network
Disk Server
Kei Hiraki
Disk Server
Disk Server
Disk Server
University of Tokyo
Problems found in 1st generation Data Reservoir
• Low TCP bandwidth due to packet losses
– TCP congestion window size control
– Very slow recovery from fast recovery phase (>20min)
• Unbalance among parallel iSCSI streams
– Packet scheduling by switches and routers
– User and other network users have interests only to
total behavior of parallel TCP streams
Kei Hiraki
University of Tokyo
Fast Ethernet vs. GbE
• Iperf in 30 seconds
• Min/Avg: Fast Ethernet > GbE
180
Throughput [Mbps]
160
140
120
Min
Max
Avg
100
80
60
40
20
0
Kei Hiraki
TxQ100
FE
TxQ5000
TxQ100
GbE
TxQ25000
University of Tokyo
Packet Transmission Rate
• Bursty behavior
– Transmission in 20ms against RTT 200ms
– Idle in rest 180ms
50
US2Tokyo
45
Packet loss
occurred
40
35
30
25
20
15
10
5
0
4.5
Kei Hiraki
5
5.5
6
6.5
7
Time [sec]
7.5
8
8.5
University of Tokyo
Packet Spacing
• Ideal Story
– Transmitting packet
every RTT/cwnd
– 24μs interval for
500Mbps (MTU 1500B)
– High load for software
only
– Low overhead because of
limited use at slow start phase
8
7
RTT
6
5
4
3
2
1
0
8
7
6
5
4
3
2
1
0
RTT/cwnd
Kei Hiraki
University of Tokyo
Example Case of 8 IPG
• Success on Fast Retransmit
– Smooth Transition to Congestion Avoidance
– CA takes 28 minutes to recover to 550Mbps
140
IPG8
120
100
80
60
40
20
0
0
Kei Hiraki
5
10
15
20
Time [sec]
25
30
35
University of Tokyo
Best Case of 1023B IPG
• Like Fast Ethernet case
– Proper transmission rate
• Spurious Retransmit due to Reordering
600
500
400
300
200
100
IPG1023
0
0
Kei Hiraki
5
10
15
20
Time [sec]
25
30
35
University of Tokyo
Unbalance within parallel TCP streams
• Unbalance among parallel iSCSI streams
– Packet scheduling by switches and routers
– Meaningless unfairness among parallel streams
– User and other network users have interests only to total behavior
of parallel TCP streams
• Our approach
– Constant Σcwnd i for fair TCP network usage to other users
– Balance each cwnd i communicating between parallel TCP streams
BW
Kei Hiraki
BW
time
time
University of Tokyo
3rd Generation Data Reservoir
• Hardware and software basis for 100Gbps Distributed Datasharing systems
•
•
10Gbps disk data transfer by a single Data Reservoir server
Transparent support for multiple filesystems (detection of
modified disk blocks)
• Hardware(FPGA) implementation of Inter-layer coordination
mechanisms
• 10 Gbps Long Fat pipe Network emulator and 10 Gbps data
logger
Kei Hiraki
University of Tokyo
Utilization of 10Gbps network
• A single box 10 Gbps Data Reservoir server
•
•
•
•
Quad Opteron server with multiple PCI-X buses (prototype, SUN
V40z server)
Two Chelsio T110 TCP off-loading NIC
Disk arrays for necessary disk bandwidth
Data Reservoir software (iSCSI deamon, disk driver, data transfer
maneger)
PCI-X bus
Quad Opteron
Server
(SUN V40z)
Linux 2.6.6
PCI-X bus
PCI-X bus
PCI-X bus
Chelsio
T110
TCP NIC
10GBASE-SR
Chelsio
T110
TCP NIC
SCSI
adaptor
10G
Ethernet
Switch
Ultra320SCSI
SCSI
adaptor
Data
Reservoir
Software
Kei Hiraki
University of Tokyo
Tokyo-CERN experiment (Oct.2004)
• CERN-Amsterdam-Chicago-Seattle-Tokyo
– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE
– 18,500 km WAN PHY connection
• Performance result
– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf
– 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf
– 8.8 Gbps disk to disk performance
• 9 servers, 36 disks
• 36 parallel TCP streams
Kei Hiraki
University of Tokyo
CANARIE
IEEAF
Calgary
Vancouver
CA*net 4
Seattle
Minneapolis
Chicago
Amsterdam
SURFnet
Geneva
Tokyo
Network used in the experiment
End Systems
A L1 or L2 switch
Tokyo-CERN Network connection
Kei Hiraki
University of Tokyo
Network topology of CERN-Tokyo
experiment
T-LEX
Fujitsu XG800
12 port switch
IBM x345 server
Dual Intel Xeon 2.4GHz
2GB memory
Linux 2.6.6 (No.2-7)
Linux 2.4.X (No. 1)
Extreme
Summit 400
IBM x345
Tokyo
Seattle
IBM x345
GbE
Foundry
BI MG8
Data Reservoir at
Univ. of Tokyo
StarLight
Minneapolis
Foundry
NetIron40G
Fujitsu
XG800
Linux 2.6.6 (No.2-6)
IBM x345 server
Dual Intel Xeon 2.4GHz
2GB memory
Linux 2.6.6 (No.2-7)
Linux 2.4.X (No. 1)
IBM x345
Vancouver
Pacific
Northwest
Gigapop
Amsterdam
CERN
(Geneva)
Foundry
FEXx448
Nether
Light
IBM x345
GbE
Data Reservoir at
CERN(Geneva)
WIDE /
IEEAF
Kei Hiraki
Chicago
Opteron server
Chelsio
Dual Opteron248,2.2GHz
T110 NIC1GB memory
10GBAS
E-LW
Opteron server
Chelsio
Dual Opteron248,2.2GHz
1GB memoryT110 NIC
Linux 2.6.6 (No.2-6)
CA*net 4
SURFnet
University of Tokyo
LSR experiments
• Target
– > 30,000 km LSR distance
– L3 switching at Chicago and Amsterdam
– Period of the experiment
• 12/20 – 1/3
• Holiday season for vacant public research networks
• System configuration
– A pair of opteron servers with Chelsio T110 (at N-otemachi)
– Another pair of opteron servers with Chelsion T110 for competing
traffinc generation
– ClearSight 10Gbps packet analyzer for packet capturing
Kei Hiraki
University of Tokyo
CANARIE
Calgary
Vancouver
CA*net 4
Seattle
IEEAF/Tyco/WIDE
Amsterdam
Minneapolis
Chicago
SURFnet
APAN/JGN2
Abilene
NYC
Tokyo
Network used in the experiment
A router or an L3 switch
A L1 or L2 switch
Figure 2. Network connection
Kei Hiraki
University of Tokyo
Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo
Vancouver
OME
6550
ONS
15454
Calgary
ONS
15454
ONS
15454
Minneapolis
ONS
15454
OME
6550
Chicago
Router or L3 switch
CANARIE
L1 or L2 switch
SURFnet
Tokyo
Opteron1
Chelsio
T110 NIC
T-LEX
IEEAF/Tyco
ONS
15454
WIDE
Opteron
server
Opteron3
Chelsio
T110 NIC
WAN PHY
Foundry
NetIron
40G
University of
Amsterdam
Seattle
Pacific
Northwest
Gigapop
SURFnet
ONS
15454
Pacific
Force10
E600
WAN PHY
Ocean
WIDE
Opteron
server
Procket
8812
TransPAC
APAN/JGN
Force10
E1200
CISCO
6509 SURFnet
Atlantic
Ocean
Procket
8801
T640
HDXc
OC-192
CISCO
12416 SURFnet
Chicago StarLight
Fujitsu
XG800
Abilene
OC-192
SURFnet
ClearSight
10Gbps
capture
T640
HDXc
OC-192
Amsterdam
New York
NetherLight
MANLAN
Univ of Tokyo
Kei Hiraki
CISCO
12416 SURFnet
WIDE
IEEAF/Tyco/WIDE
CANARIE
SURFnet
Abilene
APAN/JGN2
University of Tokyo
Network Traffic on routers and switches
StarLight Force10 E1200
University of Amsterdam Force10 E600
Abilene T640 NYCM to CHIN
TransPAC Procket 8801
Submitted run
Kei Hiraki
University of Tokyo
Summary
• Single Stream TCP
– We removed TCP related difficulties
– Now I/O bus bandwidth is the bottleneck
– Cheap and simple servers can enjoy 10Gbps network
• Lack of methodology in high-performance network debugging
– 3 day debugging (overnight working)
– 1 day stable period (usable for measurements)
– Network may feel fatigue, some trouble must happen
– We need something effective.
• Detailed issues
– Flow control (and QoS)
– Buffer size and policy
– Optical level setting
Kei Hiraki
University of Tokyo
Kei Hiraki
University of Tokyo
Systems used in Long-distance TCP experiments
CERN
Kei Hiraki
Pittsburgh
Tokyo
University of Tokyo
Efficient and effective utilization of
High-speed internet
• Efficient and effective utilization of 10Gbps network is
still very difficult
• PHY, MAC, Data-link , and Switches
– 10Gbps is ready to use
• Network interface adaptor
– 8Gbps is ready to use, 10Gbps in several months
– Proper offloading, RDMA implementation
• I/O bus of a server
– 20 Gbps is necessary to drive 10Gbps network
• Drivers, operating system
– Too many interruption, buffer memory management
• File system
– Slow NFS service
– Consistency problem
Kei Hiraki
University of Tokyo
Difficulty in10Gbps Data Reservoir
• Disk to disk Single Stream TCP data transfer
– High CPU utilization (performance limit by CPU)
• Too many context switches
• Too many interruption from Network adaptor (> 30,000/s)
• Data copy from buffers to buffers
• I/O bus bottleneck
– PCI-X/133 --- maximum 7.6Gbps data transfer
• Waiting for PCI-X/266 or PCI-express x8 or x16 NIC
– Disk performance
• Performance limit of RAID adaptor
• Number of disks for data transfer (>40 disks are required)
• File system
– High BW in file service is more difficult than data sharing
Kei Hiraki
University of Tokyo
High-speed IP network in supercomputing
(GRAPE-DR project)
•
World fastest computing system
–
•
Construction of general-purpose massively parallel
architecture
–
–
•
Low power consumption in PFLOPS range performance
MPP architecture more general-purpose than vector architecture
Use of comodity network for interconnection
–
–
Kei Hiraki
2PFLOPS in 2008 (performance on actual application programs)
10Gbps optical network (2008) + MEMs switches
100Gbps optical network (2010)
University of Tokyo
FLOPS
Target performance
30
10
Grape DR
2 PFLOPS
27
10
Parallel processors
1Y
KEISOKU
supercomputer
10PFLOPS
Earth
Simulator
40TFLOPS
1Z
1E
1P
1T
Processor chips
1K
256
1G
64
16
1M
Kei Hiraki
70
80
90
2000
2010 2020
2030
2040
2050
Year
University of Tokyo
Kei Hiraki
University of Tokyo
GRAPE-DR architecture
•
•
•
Massively Parallel Processor
Pipelined connection of a large number of PEs
SIMASD (Single Instruction on Multiple and Shared Data)
– All instruction operates on Data of local memory and shared memory
– Extension of vector architecture
•
Issues
–
Compiler for SIMASD architecture (currently developing – flat-C)
Local Memory
Integer ALU
Floating point ALU
512 PEs
G
F
CP
+
On chip network
On chip shared memory
Outside world
Kei Hiraki
Shared memory
University of Tokyo
Hierarchical construction of GRAPE-DR
メモリ
512PE/Chip
2KPE/PCI board
8 KPE/Server
512 GFlops /Chip
2TFLOPS/PCI board
8 TFLOPS/Server
2M PE/System
2PFLOPS/System
1MPE/Node
1PFLOPS/Node
Kei Hiraki
University of Tokyo
Network architecture inside a GRAPE-DR system
AMD based server
IP storage
system
Memory bus
100Gbps
iSCSIサーバ
光インタフェース
KOE
Memory
MEMs based optical switch
Highly functional router
Adaptive compier
Total system conductor
For dynamic
optimization
Outside IP network
Kei Hiraki
University of Tokyo
Fujitsu Computer
Technologies, LTD
Kei Hiraki
University of Tokyo