mcglone_chep161_2 - Indico

Download Report

Transcript mcglone_chep161_2 - Indico

CHEP 2007
Building a Scalable Event Level
Metadata Service for ATLAS
Performance and Scalability Tests for
Relational TAG Database
Jack Cranshaw, Luc Goossens, David Malon,
Helen McGlone, Florbela Viegas
Introduction
• The ATLAS TAG Database
• A Challenging Environment
• The 1TB Tests
• Learning
• Performance results
• What’s next
The ATLAS TAG Database
• Defined in ATLAS Computing Model
• File and relational database system
• Support seamless discovery, identification,
selection and retrieval of events of interest
to analysis and no others
• TAGs written at AOD production
• Implemented using LCG POOL Collection
infrastructure
What is a TAG?
• 1kB summary information per event
• Attributes selected to support selection of
events and navigation within the system
– Event ID and Global event properties
– Trigger information
– Quality information
– Temporal information
– Physics objects (high level)
A Challenging Environment…..
• ATLAS data rate – 200Hz
– 200 new event TAGs per second for data taking,
assume 50K active seconds/day = 58% efficiency for
each active day, 10^7 events/day
• ATLAS data volume
Year
% Year data
taking
2008
40
2009
60
additional 60
Data Volume
1.42TB
3.65TB
6.09TB
Terabytes!
A Challenging User!
• The ATLAS physicist…..
– Fast, efficient, accurate queries
– Reliable navigation to event data
– Seamless integration with analysis

QUERY
TAG DB
RETRIEVE
ANALYSIS
Performance and Scalability Tests
for Relational TAG Database
• Large scale realistic tests to uncover
challenges brought with scale
• Optimise and measure performance
Management
Partitioning
Indexing
Optimizer
Hints
Multiple clients
Query patterns
Why 1TB tests?
• We expect that important performance
and management phase transitions will be
crossed as we scale to billions of events
• Queries may be unselective and select
across a range of attributes – want to
address this challenge at large scale
• Memory to disk
The 1TB TAG Database
• Created one million dummy events based
on real TAG attributes
• Realistic and varied data types and value
distributions
• Multiplication and replication, one million to
one billion, realistic and with unique ids
The 1TB Table
Name
ID
RUNNR
EVENTNR
GOLDEN1
GOLDEN2
AODFILEFK
ESDFILEFK
BOOL500CHAR01
BOOL100CHAR01
BOOL10CHAR01
BOOL1CHAR01
BOOL500NUM01
BOOL100NUM01
BOOL10NUM01
BOOL1NUM01
ENUMUNI100VC01
ENUMUNI10VC01
ENUMEXP1000NUM01
ENUMEXP100NUM01
ENUMEXP10NUM01
ENUMUNI1000NUM01
ENUMUNI100NUM01
ENUMUNI10NUM01
ENUMUNI5NUM01
NOR100BF01
NOR10BF01
NOR1BF01
NOR100NUM01
NOR10NUM01
NOR1NUM01
UNINUM01
UNI10KNUM01
UNI1KNUM01
UNI100NUM01
UNI10NUM01
Datatype
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
CHAR
CHAR
CHAR
CHAR
NUMBER
NUMBER
NUMBER
NUMBER
VARCHAR2
VARCHAR2
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
BINARY_FLOAT
BINARY_FLOAT
BINARY_FLOAT
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
NUMBER
Size
36
12
12
12
12
12
12
1
1
1
1
1
1
1
1
10
10
5
5
5
5
5
5
5
4
4
4
12
12
12
12
12
12
12
12
Scale
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
5
5
0
0
0
0
0
Nulls?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Index ?
B*Tree
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
Bitmap
B*Tree
B*Tree
B*Tree
B*Tree
B*Tree
B*Tree
B*Tree
Bitmap
Bitmap
Bitmap
Bitmap
 We replicate the 1M
rows to produce a 1TB
table
 Approximately 1kB row
size (expected TAG size)
 Distribution of values
tries to mimic the types
of columns and
distributions expected for
TAG metadata.
 This set of attributes is
repeated in the same
sequence until set 8, but
only this set is indexed.
1TB Test system
ITRAC04.cern.ch
2 CPUS, 4GB total memory
ITRAC05.cern.ch
2 CPUS, 4GB total memory
INT8R1 instance
2GB Oracle Cache
INT8R2 instance
2GB Oracle Cache
ASM – Shared
Storage
INT8R.CERN.CH
~2TB user space
available
Partitioning the data
• Partitions are an Oracle tool
– Performance queries
– Data manageability
• Multidimensional
– Horizontal
– Vertical
• Choice of partition key
– RUN
– STREAM
Table Architecture
STREAM A
STREAM B
Divide all data
HORIZONTALLY
into TEN
STREAM TABLES
RUN a
RUN b
STREAM N
RUN c
Partition all
STREAM TABLES
HORIZONTALLY
by RUN – each
table has 100
partitions, 40
runs per
partition, 1
million rows per
partition
STREAM C
• This is our nth partitioning schema…..
Indexing
• In our test table we have btree index, bitmap
•
•
index and non indexed attributes
Tested extensively to understand performance
and management implications of all three
We learned
• Btree and bitmap index attributes perform differently
• Oracle has distinct query plans which are optimal for each
• Non indexed attributes are a huge restriction on query
performance
• So…..
• Index everything (challenge)
• Pre-process to separate btrees and bitmaps, implement a
different query plan for each
Optimizing the Optimizer
• Oracle has an Optimizer designed to select
•
optimal query execution plans
Performance is sensitive to query plans
• Optimizer chooses the plan based on cost estimation
• Complex selection
• Uses statistical information about data
• Optimizer is not perfect
• Investigate optimal query plans
• Help the Optimizer make the right choice
Oracle Hints
• Many hints were tested…..
PARALLEL - needed for parallelization of full table scan or
partition range scan
PARALLEL INDEX - needed for parallelization of index access
INDEX JOIN - for hash joins with b-tree indexes
INDEX COMBINE - for bitmap indexes
opt param( INDEX JOIN ENABLED ,false) - Can enable and
disable session parameters for a single SQL
Queries for testing
Count the events with at least two electrons and missing
ET >10GeV that are “good for physics” - SUMMARY
Give me all the events with at least two electrons and missing
ET >10GeV that are “good for physics” - CONTENT
• Selection based on both index types
• Use INDEX_JOIN for btrees and INDEX_COMBINE for
bitmaps, then INTERSECT
• Flush buffer cache between queries, so no cache
advantage
• Increase number of partitions involved as increase
rows returned, consistent % rows from each partition
Summary queries
Time to count events, 1% data per partition, both index types
Time to count events, 10% data per partition, both index types
20
30
18
25
16
20
12
time(s)
time(s)
14
10
8
15
10
6
4
5
2
0
0
0
10000
20000
30000
40000
rows counted
50000
60000
70000
80000
0
100000
200000
300000
400000
500000
rows returned
•Time increases with number of partitions
•Linear increase – predict time
•10 x data not 10 x time – can predict time with some bounds
•Time in order of seconds
600000
700000
800000
Summary queries – an extreme
Time to count events, all partitions, increase data returned, both index types
9
8
time(mins)
7
6
5
4
3
2
1
0
0
500000
1000000
1500000
2000000
2500000
3000000
rows returned
See value of partition elimination
3500000
4000000
Content queries
Time to select, both index types, 1-10 partitions, 10% data per partition
140
140
120
120
100
100
time(s)
time(s)
Time to select, both index types, 1-10 partitions, 1% data per partition
80
60
80
60
40
40
20
20
0
0
0
20000
40000
60000
80000
100000
120000
0
rows returned
200000
400000
600000
800000
1000000
rows selected
Linear increase in time with number of partitions
Time overhead is in number of partitions accessed, not data returned from within
Time order of seconds
Does this linear relation extrapolate indefinitely…..
1200000
Content query – an extreme
Time to select, both index types, all partitions, increase data per partition
50
45
40
time(mins)
35
30
25
20
15
10
5
0
-10000000
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
rows returned
If linear relation is constant ~proportional to number of partitions in
query, then query from all (100) partitions ~20 mins
Not in practice….why?
Threshold case where sorts move from all memory to disk – higher performance
overhead, same query plan, but with use of disk
Faster than full table scan
Results
Count vs Select - 10% data, 1-10 partitions
140
120
time(s)
100
80
select
60
count
The “extreme case”
(optimised) upwards of 30 mins
for selects
40
20
0
0
200000
400000
600000
800000
1000000
1200000
rows returned
•
•
•
•
Encourage counts before select
Encourage use of partition key
Can extrapolate to predict time query likely to take
Pre-processing crucial
Multiple clients – Stress Tests
• Assess performance of system under multi client
•
environment
Simulate a realistic query environment by creating
expected typical queries
– Counts and retrieves
– Vary attributes and filters
• Create a sample job of 9 optimised queries, with 60
•
•
seconds, then 90 seconds of pause between each
Sessions on one node of INT8R cluster.
Node has 2Gb memory and 2 CPUs
Job Frequency - 90 seconds
Job Frequency- 60 seconds
120
120
100
100
80
80
N. of Jobs Active
N. of Concurrent Jobs
60
Job Time (minutes)
60
Job Time (minutes)
Avg Host CPU
12:38
12:29
12:20
12:11
12:02
11:53
11:44
11:35
11:26
11:17
11:08
10:58
10:49
10:40
10:31
10:22
15:11
15:03
14:55
14:47
14:38
14:30
14:22
14:14
0
14:06
0
13:58
20
13:50
20
13:42
40
13:34
40
10:13
Avg Host CPU
• Saturation at 60 seconds, queries supported at
•
•
•
•
90 seconds
1 Job every 90 seconds generates ~9000
queries per day
T0 production database has 6 nodes, each T1
has 2 nodes (3 oracle licences)
New hardware in April 2008, will test when new
hardware becomes available
Need to limit concurrent processes
ATLAS Relational TAG Database
experience
• The Streams Test TAG database
• Smaller scale
• Introduce users to TAGS
• Gather query pattern information
http://atlas.web.cern.ch/Atlas/GROUPS/OPERATIONS/dataBases/TAGS/tag_browser.php
> 5000 queries
captured
“Pinch of salt” for query
capture….. but positive and
useful feedback nonetheless
Counts vs Retrievals for Streams test TAG database
What’s next?
• Assess performance on new hardware
• New ‘real’ TAG Database for CSC data
• Continue gathering data about likely query
•
•
patterns
Assess file vs. relational database to guide users
as to when each is appropriate
Trigger implementation in TAGs
Summary
• 1TB tests optimised and measured
performance of realistic scale TAG
database
• Understand challenges and develop a
system that can offer good performance
• Ongoing challenge
Thank you!
Scenarios
– Complete set of data (or most used part thereof) fits in
memory/cache
• memory is O(1GB)
• query time is limited by cpu and mem speed
– Complete set of Indices (or most used part thereof) fits in
memory/cache
• fast identification of data
• slow retrieval from disk
– contiguous parts @ O(100MB/s)
– random parts @ O(1000 IO/s) = O(1 MB/s)
– contiguous IO degrades with parallelism
– Neither Indices nor data fit in memory
• index reading from disk
– usually contiguous
– Final and/or intermediate results do not fit in memory
• e.g. for sorting, intersection, joins, ...
• need to use disk even for these ops
HASH JOIN, BITMAP CONVERSION TO ROWIDS, BITMAP INDEX
RANGE SCAN, INDEX RANGE SCAN, PARALLEL
summary
QUERYINDEX (FULL SCAN) OF 'ICMG1_1_ID’, BITMAP INDEX (RANGE
SCAN), BITMAP MERGE, BITMAP AND, BITMAP CONVERSION (TO
ROWIDS), HASH JOIN, SORT (UNIQUE), QUERY PARALLEL
content
• Each query scans 1Gb
of data, which in
logical units, would be
around 1 hour and 30
minutes of data at
200Hz speed
• Average time of job
running alone is 10
minutes
• Node has 2Gb memory
and 2 CPUs