幻灯片 1 - Universidad Abierta Interamericana

Download Report

Transcript 幻灯片 1 - Universidad Abierta Interamericana

Large Scale Parallel File
System and Cluster Management
ICT, CAS
About ICT, CAS
• Institute of Computing Technology, Chinese
Academy of Science
• The first (from 1958) and largest national IT
research institute in China
• The largest graduate school of Computer
Science in China
• Builder of most Chinese systems in HPC TOP
500
• Focusing on computing system architecture:
CPU, Compiler, Network, Grid, HPC and
Storage
Storage Centre of ICT
• Founded in 2001
• Leader: Dr. Xu Lu (from HP Lab)
• Storage for scientific computing
– BWFS: Parallel cluster file system
– Service on Demand system: Storage-based
cluster management system.
• Storage for business computing
– VSDS: Virtual storage research project
– Backup / Virtual Computing……
The Storage Bottleneck of Cluster
• NFS (Network File System)
– Most widely used in clusters to provide shared data access
– Simple and easy to use and management
• Scalability Problem
– Multiple NFS server means multiple name space
– Hard to extend in capacity.
– The performance do not increase with the capacity
– Poor performance in I/O
density computing
– Weak MS Windows support
数据吞吐率(KB)
• Parallel Access Problem
80000
70000
60000
50000
40000
30000
20000
10000
0
4k
8k
32k
64k
1M
2M
1
2
4
8
计算节点个数
16
32
What’s BWFS
• Parallel network file system
– Support multiple storage appliances (8-128) in a single
name space (Up to 512 TB)
– Separated Data and Meta-Data access to provide
parallel accessing between different storage appliance
• Global name space between clients with different
platforms
– Fully compatible with NFS (not 100% POSIX)
– Support data sharing between Linux and Windows
clients
– Support IA32, IA64 and x86_64 hardware platforms
What’s BWFS
• Centralized Management
– Web based management for the storage appliances
and the storage sub-system
– Integrated client management with Service on
Demand system.
• Online extension
– Add storage appliances to increase the capacity
without stopping the application
– The new data will be automatically stripped between
all the storage appliances to get a high performance.
Data Access on NFS
Meta-Data
User Data
Application Server
Storage Appliance
`
Application Server
Data Access on BWFS
Meta-Data
User-Data
元数据控制器
节点服务器
存储设备
节点服务器
存储设备
350
write large files(20G per node, 1MB record size)
300
1SN
250
200
150
100
50
0
1
2
4
Aggregate Bandwidth(MB/s)
Aggregate Bandwidth (MB/s)
Bandwidth of BWFS
read large files(20G
per node, 1MB record size)
2SN
350
4SN
300
1SN
250
NFS
200
2SN
150
4SN
8
100
16
Number of client nodes
NFS
50
0
1
2
4
8
Number of client nodes
16
计算能力(线/小时)
Paradigm Epos3 (China Petrol, Xinjiang)
10
BWFS
NAS9500
RackServer+DawningNFS
8
6
4
2
0
32
64
96
节点数
128
Paradigm Disco (China Petrol, Xinjiang)
2500
RaidsysNFS
线性 (BWFS)
运行平均时间(秒)
BWFS
NAS8500
NAS9500
2000
1500
1000
500
0
1
2
3
4
5
6
7
8
9
10
11
节点数(每节点一个作业)
12
13
14
15
Management Interface
Service on Demand System
• Initially developed as a subsystem of BWFS to
provide cluster management
• Reduce the management work especially in the
system deployments
• Increase the availability against the storage
components fail
• Enable the fast schedule in large server farms
with multiple clusters
• Boot the system directly from the BWFS storage
appliance without the need of local hard disks
Traditional Cluster Deployment System
20mins
硬盘
硬盘
硬盘
硬盘
硬盘
硬盘
系统映像
Shortcoming 1: Inefficiency in Schedule
20 mins
硬盘
硬盘
硬盘
硬盘
硬盘
硬盘
系统映像
系统映像
2
Shortcoming 2: Inefficiency in Maintains
硬盘
硬盘
系统映像
系统映像
2
硬盘
Hard disk errors occupy 30%-50% of
all the computer system errors
硬盘
硬盘
硬盘
Shortcoming 3: Inefficiency in Capacity
A 5GB system on a
74GB hard disk
硬盘
硬盘
系统映像
系统映像
2
硬盘
The disks are getting larger and larger
but the system images are keeping small
to reduce deployment time
硬盘
硬盘
硬盘
Service on Demand System
• Diskless boot OS by TCP/IP
– Virtual SCSI disk to support Windows and Linux
– Fully compatible with applications
• Provide high performance snapshots to support
fast cloning of system images
– Copy on Write when the system image is modified
– Online backup system image with snapshot
• Automatic take over on failed clients
• Integrated monitor engine to support automatic
schedule or adaptive computing (still in
researching)
Service on Demand System
Service 2
Service N
Service 1
Network
Map to Local
Disk
User
Storage Appliance
Application Node
Fast Deployment and Schedule
Paradigm
Services
CGG
Services
Paradigm
Image
Web
系统
Paradigm
Snapshot
Paradigm
Snapshot
Paradigm
Snapshot
Email
CGG
Image
系统
CGG
Snapshot
Paradigm
Snapshot
Paradigm
Snapshot
Easy to maintain
Maintenance
System
Image
System
Snapshot
System
Snapshot
System
Snapshot
System
Snapshot
System
Snapshot
Management UI
SERVER
73GB×3硬盘,4GB
MEMORY, 2 CPU
大内存节点
计算节点
17台
73G硬盘,2 CPU,
4GB MEMORY
4 CPU, 8GB MEMORY
4T盘阵
服务网络
千兆以太网
注:挂DVD刻录机
InfiniBand
部署、管理网络
百兆以太网
虚拟存储
管理服务器
(物理机器
为Console)
3T盘阵×3
IP SAN
千兆
平台管理访问局域网
部署系统
设备1TB
Internet
备用节点
4 CPU, 4GB MEMORY
Console(曙光PC)
龙芯NC 两台
曙光PC
Thanks 谢谢!