No Slide Title

Download Report

Transcript No Slide Title

Status of the Bologna Computing
Farm and GRID related activities
Vincenzo M. Vagnoni
Thursday, 7 March 2002
Outline







Currently available resources
Farm configuration
Performance
Scalability of the system (in view of the DC)
Resources Foreseen for the DC
Grid middleware issues
Conclusions
Current resources

Core system (hosted in two racks at INFN-CNAF)


56 CPUs hosted in Dual Processor machines (18 PIII 866 MHz +
32 PIII 1 GHz + 6 PIII Tualatin 1.13 GHz), 512 MB RAM
2 Network Attached Storage systems

1 TB in RAID5, with 14 IDE disks + hot spare
 1 TB in RAID5, with 7 SCSI disks + hot spare



1 Fast Ethernet switch with Giga Uplink.
Ethernet Controlled power distributor for remote power cicle
Additional resources by INFN-CNAF

42 CPUs in dual Processor machines (14 PIII 800 MHz, 26 PIII 1
GHz, 2 PIII Tualatin 1.13 GHz)
Farm Configuration (I)

Diskless processing nodes with OS centralized on a file
server (Root over NFS)




It makes trivial the introduction or removal of a node in the system,
i.e. no need of software installation on local disks
Grants easy interchange or CEs in case of shared resources (e.g.
among various experiments), and permits dynamical allocation of
the latter without additional work
Very stable! No real drawback observed in about 1 year of run
Improved security

Usage of private network IP addresses and Ethernet VLAN



High level of isolation
Access to external services (afs, mccontrol, bookkeeping db, servlets
of various kinds, …) provided by means of NAT technology on the GW
Most important critical systems (Single Points of Failure), but not
everything actually, made redundant


Two NAS in the core system with RAID5 redundancy
GW and OS server: operating systems installed on two RAID1 disks
(Mirroring)
Farm Configuration (II)
Red Hat 7.2
DNS
IP Forwarding
Masquerading
Diskless
Red Hat 6.1
Kernel 2.2.18
PBS Master
Mcserver
Farm Monitoring
RAID 1
Gateway
NAS
Giga
Uplink
Public
VLAN
Private
VLAN
Fast Ethernet
Switch
NAS
RAID 5
RAID 5
Control Node 1
Processing Node 2
Processing Node n
Ethernet
Link
Power Distributor
Power Control
RAID 1
Diskless
Red Hat 6.1
Kernel 2.2.18
PBS Slave
Red Hat 7.2
OS File Systems
Home directories
Various services:
PXE remote boot,
DHCP, NIS
Fast ethernet switch
Rack (1U dual-processor MB)
NAS 1TB
Ethernet controlled power
distributor (32 channels)
Performance




System has been fully integrated in the LHCb MC
production since August 2001
20 CPUs until December, 60 CPUs until last week, 100
CPUs now
Produced mostly bb inclusive DST2 with the classic
detector (SICBMC v234 and SICBDST v235r4, 1.5 M) +
some 100k channel data sets for LHCb light studies
Typically roughly 20 hours needed on a 1 GHz PIII for the
full chain (minbias RAWH + bbincl RAWH + bbincl piled
up DST2) for 500 events


Farm capable of producing about (500 events/day)*(100
CPUs)=50000 events/day, i.e. 350000 events/week, i.e. 1.4 TB/week
(RAWH + DST2)
Data transfer to CASTOR at CERN realized with standard
ftp (15 Mbit/s over available bandwidth of 100 Mbit/s), but
tests with bbftp reached very good troughput (70 Mbit/s)

Still waiting for IT to install a bbftp server at CERN
Scalability

Production tests made these days with 82 MC
processes running in parallel




Using the two NAS systems independently (instead to share
the load between them)
Each NAS worked at 20% of full performance, i.e. each of
them can be scaled up much more than a factor 2
Distributing the load we are pretty sure this system can
handle more than 200 CPUs working at the same time at
100% (i.e. without bottlenecks)
For the analysis we want to test other technologies

We plan to test a fibre channel network (SAN, Storage Area
Network) on some of our machines, with nominal 1 Gbit/s
bandwidth to fibre channel disk arrays
Resources for the DC



Additional resources by INFN-CNAF foreseen
for the DC period
We’ll join the DC with order of 150-200 CPUs
(around 1 GHz or more), 5 TB of disk storage
and a local tape storage system (CASTOR like?
Not yet officially decided)
Still need some work to make the system fully
redundant
Grid issues (A. Collamati)




2 nodes reserved at the moment for tests on GRID
middleware
The two nodes form a minifarm, i.e. they have exactly the
same configuration as the production nodes (one master
node and one slave node) and can run MC jobs as well
Globus has been installed and first trivial tests on job
submission through PBS were successful
Test job submission via globus on large scale by
extending the PBS queue of the globus test farm to all
our processing nodes

No interference with the distributed production working system
Conclusions







Bologna is ready to join the DC with a reasonable amount
of resources
Scalability tests were successful
The farm configuration is pretty stable
We need the bbftp server installed at CERN to fully exploit
WAN connectivity and throughput
We are waiting for the decision of the DC period by CERN
for the final allocation of INFN-CNAF resources
Work on GRID middleware started, first results are
encouraging
We plan to install Brunel ASAP