with all indexes

Download Report

Transcript with all indexes

WLCG – January 2007
ATLAS COOL/TAGs (online, offline, T1)
Viegas, Florbela
Dimitrov, Gancho
22-01-2007
1
Online Database, T0 and T1 Architecture
PVSS2COOL
application
TAGs
application
ATLAS
Online RAC
6 node ATONR
ATLAS_COOL_3D
ATLAS
Offline RAC
6 node ATLR
Redo log transport
services
ATLDSC
Real-time
DownStream
Capture
PIC
BNL
NORDU
GRID
CNAF
SARA
IN2P3
RAL
22-01-2007
TRIUMF
Gridka
ASGC
2
Test environment the status as it was on 18-19.01.2007
Stefan Stonjek’s
COOL client appl.
2 node ATONR
ATLAS
Online RAC
TAGs test
application
ATLAS_COOL_3D
2 node INTR
ATLAS
Validation RAC
PIC
BNL
NORDU
GRID
CNAF
SARA
Phase 2 Sites
IN2P3
RAL
TRIUMF
Gridka
ASGC
ATLAS_COOL_3D
ATLAS_TAGS_3D
22-01-2007
3
Replication Test Issues
 The tests that have been made, ‘support’( we still do not know what are
the requirements for COOL) the throughput requirements on the
following T1 sites: IN2P3, GRIDKA, CNAF, RAL, ASCG
 The following sites have throughput problems: TRIUMF and BNL
 Sites, being tested and connected now: SARA
 Sites, not ready yet: NORDUGRID and PIC
22-01-2007
4
Special COOL setup ‘to help’ the Streams Apply
process
PVSS2COOL
application
Atlas Offline RAC
ATLAS_COOL_DCS3D
schema - the COOL tables
are with PKs only
Atlas Online RAC
Atlas PVSS Oracle
archive
COOL online sub detector accounts
ATLAS_COOLONL_xxx
- with all indexes
ATLAS_COOL_DCS3D
schema - the COOL tables
are with PKs only
Refresh MVs
on demand
Refresh MVs
on demand
ATLAS_COOL_DCS
schema - Mat. views with
full set of indexes and
applied RANGE
partitioning on iov_since
ATLAS_COOL_DCS
schema - Mat. views with
full set of indexes and
applied RANGE
partitioning on iov_since
ATLAS_COOLONL_xxx
- with all indexes
ATLAS_COOLONL_xxx
- with all indexes
ATLAS_COOL_xxx
- with all indexes
22-01-2007
Tier 1s
ATLAS_COOL_xxx
- with all indexes
5
Best result so far … ( got on 19 Dec 2006)




Participants: INTR (source) and RAL, GridKa, CNAF, IN2P3 (destinations)
Tags insert rate - 200Hz ( row length 1.3KB, means 15 MB per minute ).
COOL rate - 2,2 MB per minute at the beginning, but went down with the time, because
of the lack of an index).
For the test period of 12 hours, the latency was 1-2 seconds, except 2-3 small peaks (the
picture below, y axis = LCRs/sec; x axis = timeline )
22-01-2007
6
Non conformity of throughput -BNL
BNL has the following throughput:
Prop. 75Hz
20
15
BNL
10
GRIDKA
5
0
0
10000
20000
30000
40000
50000
The maximum TAGS rate supported is 50Hz, which translates into
1.3kB*60*50 3.9 MB/minute.
(legend y axis=minutes of delay; x axis number of records sent)
22-01-2007
7
Non conformity of throughput -TRIUMF
TRIUMF has the following throughput:
The maximum TAGS rate supported is 100Hz, which translates into 1.3kB*60*100
7.8 MB/minute.
(y axis = LCRs/sec, x axis=timeline)
22-01-2007
8
Issues encountered in tests
 Dropping propagations causes ORA-600 errors, which means that if a
site is down and it has to be removed from the streaming current, it
can’t. We have to drop the whole capture, which might render the
destinations out of sync. We seek a better solution for this problem.
 When we put rules for filtering out tables, we find that the capture slows
down considerably, even for a simple rule.
 There is a « library cache lock » problem we are experiencing in INTR
which is an obstacle for the administration of the propagation
schedules
 Capture process lag due to very high production of archive logs in the
source database.
 Undo SQL using ROWIDs gets translated at destination with a column
by column comparison, which is not usable in production for recovery.
22-01-2007
9
Proposal for optimizing the Capture
 Since the capture gets delayed when there is a large transaction
volume on the database, we propose to split the source instances
(Online and Offline) into two separate databases
 One has only the applications that are to be streamed, and, so the
archive logs have the meaningful transactions only.
 Another is dedicated to non-streamed applications, so its load does not
interfere with streams performance.
22-01-2007
10
Agreement of the Proposal from IT for Backup
Strategy
 Backup Strategy is the same for Online Database, Offline(T0)
Database and T1 sites
 The backup on disk is kept for 48 hours, so recovery can be quick for
any point in time in between.
 The backup on tape is kept for 31 days, so recovery can be done for
data from 31 days ago. Recovery from tape takes 5 hours in a 300Gb
database.
 ATLAS is in agreement, with the following provisions:
 UNDO_RETENTION should be set for the databases to safeguard against
errors. We propose 24 hours as a beginning value.
 Provision has to be made for historical, read-only data, to be stored in a
special pool of tapes or disks, out of the retention policy for RMAN.
22-01-2007
11
Plans
 Continue with the “high-rate” streaming tests
 Helping in resolving problems that pop up at Tier 1s
 Demonstrate reliable service over an extended time period. Athena
jobs running at Tier-1s, which read data back that has been generated
at Tier-0. Tests with real COOL data have already started by Richard
and Stefan. ( RAL has been chosen as destination )
 Go into production mode at the end of March (when the ATLAS offline
release 13 will be available )
22-01-2007
12