Potential Improvement - Indico

Download Report

Transcript Potential Improvement - Indico

Recent cache improvements
in ROOT I/O
LCG Applications Area meeting
13 September 2006
René Brun, Leandro Franco
CERN
Problem
• The original ROOT I/O was designed to read efficiently
local files.
• Caching was assumed to be implemented in the remote
file servers daemons (RFIO, dCache, etc).
• However server caches were optimized to read data
sequentially and not data sets like ROOT Trees.
• In fact efficient caches cannot be implemented by the
remote file servers alone. Additional information about
the internals of the data sets is essential.
• Reason why we implemented TFileCacheRead,
TFileCacheWrite and TTreeCache (in version 5.12)
René Brun, LCGAA
Recent improvements in ROOT I/O
2
Problem in one picture
Client
CPU
Remote
Data
Server
Client
CPU
Client
CPU
Client
CPU
René Brun, LCGAA
Total time to execute a remote read
=
+
+
+
CPU time on client
network transfer time
disk access time
process context switch on client side
Recent improvements in ROOT I/O
3
ROOT file structure
René Brun, LCGAA
Recent improvements in ROOT I/O
4
Looking inside a ROOT file
•
•
Root [0] TFile f("hsimple.root");
Root [1] f.Map();
•
•
•
•
•
•
•
•
•
•
•
•
•
•
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
20030825/123739
René Brun, LCGAA
At:100
At:208
At:28694
At:58389
At:88084
At:117254
At:127405
At:128911
At:130515
At:133246
At:167413
At:167690
At:174198
At:174254
N=108
N=28486
N=29695
N=29695
N=29170
N=10151
N=1506
N=1604
N=2731
N=34167
N=277
N=6508
N=56
N=1
TFile
TBasket
TBasket
TBasket
TBasket
TBasket
TH1F
TH2F
TProfile
TNtuple
KeysList
StreamerInfo
FreeSegments
END
Recent improvements in ROOT I/O
CX
CX
CX
CX
CX
CX
CX
CX
CX
=
=
=
=
=
=
=
=
=
1.12
1.08
1.08
1.10
3.15
1.44
4.77
1.55
3.00
CX =
3.59
5
Looking inside a ROOT Tree
•
•
Root [0] TFile f("h1big.root");
Root [1] T.Print();
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
******************************************************************************
*Tree
:h42
: dstar
*
*Entries :
283813 : Total =
561578413 bytes File Size = 280404398 *
*
:
: Tree compression factor =
2.00
*
******************************************************************************
*Br
0 :nrun
: nrun/I
*
*Entries :
283813 : Total Size=
1149132 bytes File Size =
35488 *
*Baskets :
143 : Basket Size=
8000 bytes Compression= 32.23
*
*............................................................................*
*Br
1 :nevent
: nevent/I
*
*Entries :
283813 : Total Size=
1149430 bytes File Size =
698331 *
*Baskets :
143 : Basket Size=
8000 bytes Compression=
1.64
*
*............................................................................*
*Br
2 :nentry
: nentry/I
*
*Entries :
283813 : Total Size=
1149430 bytes File Size =
410862 *
*Baskets :
143 : Basket Size=
8000 bytes Compression=
2.78
*
*............................................................................*
*Br
3 :trelem
: trelem[192]/b
*
*Entries :
283813 : Total Size=
55157062 bytes File Size =
10246696 *
*Baskets :
6922 : Basket Size=
8000 bytes Compression=
5.37
*
*............................................................................*
*Br
4 :subtr
: subtr[128]/b
*
*Entries :
283813 : Total Size=
36770452 bytes File Size =
2403023 *
*Baskets :
4652 : Basket Size=
8000 bytes Compression= 15.25
*
*............................................................................*
*Br
5 :rawtr
: rawtr[128]/b
*
*Entries :
283813 : Total Size=
36770452 bytes File Size =
5050350 *
*Baskets :
4652 : Basket Size=
8000 bytes Compression=
7.26
*
*............................................................................*
*.......
*............................................................................*
*Br 151 :nnout
: nnout[1]/F
*
*Entries :
283813 : Total Size=
1149287 bytes File Size =
20163 *
*Baskets :
143 : Basket Size=
8000 bytes Compression= 56.73
*
*............................................................................*
René Brun, LCGAA
Recent improvements in ROOT I/O
6
Looking inside a ROOT Tree
•
•
Root [0] TFile f("h1big.root");
Root [1] f.DrawMap();
René Brun, LCGAA
283813 entries
280 Mbytes
152 branches
Recent improvements in ROOT I/O
3 branches
have been
colored
7
What happens when reading one branch
• File h1big.root is 280 Mbytes with 152 branches
• uncompressed file is 561 Mbytes
• uncompressed basket size is 8000 bytes
• T.Draw(“nrun”)
• read 35488 bytes distributed in 143 baskets
• average compressed basket size = 248 bytes
• about 1985 entries per basket
• T.Draw(“rawtr”)
• read 5050350 bytes distributed in 4652 baskets
• average compressed basket size = 1085 bytes
• about 61 entries per basket
• Reading one basket triggers a call to TFile::ReadBuffer
René Brun, LCGAA
Recent improvements in ROOT I/O
8
System file cache and seek
Disk latency small when
seeking in a file being
read sequentially
file1
But time to seek
between 2 files may
be large (> 5ms)
file2
file3
René Brun, LCGAA
Recent improvements in ROOT I/O
9
Situation before TTreeCache
Problem if 2 or more
users read different
files from the same
disk at the same time
René Brun, LCGAA
Recent improvements in ROOT I/O
10
Situation with TTreeCache
Read many small buffers
compacted in few
large buffers
This minimizes the
number of large
seeks
René Brun, LCGAA
Recent improvements in ROOT I/O
11
A major problem: network latency
Client
Server
Latency
Response Time
Latency
Client Process Time ( CPT )
Round Trip Time ( RTT )
=
2*Latency + Response Time
Client Process Time ( CPT )
Runt Trip Time ( RTT )
Client Process Time ( CPT )
Total Time = 3 * [Client Process Time ( CPT )] + 3*[Round Trip Time ( RTT )]
Total Time = 3* ( CPT ) + 3 * ( Response time ) + 3 * ( 2 * Latency )
René Brun, LCGAA
Recent improvements in ROOT I/O
12
Idea ( diagram )
• Perform a big request instead of many small
requests (only possible if the future reads are
known !! )
Client
Server
Latency
Response Time
Latency
Client Process Time ( CPT )
Total Time = 3* ( CPT ) + 3 * ( Response time ) + ( 2 * Latency )
René Brun, LCGAA
Recent improvements in ROOT I/O
13
Example ( h2fast ) - Simulated latency ( xrootd )
René Brun, LCGAA
Recent improvements in ROOT I/O
14
TFileCacheRead
TFile
::ReadBuffer
TWebFile
::ReadBuffer
TCastorFile
::ReadBuffer
TDcacheFile
::ReadBuffer
TRFIOFile
::ReadBuffer
TXNetFile
::ReadBuffer
TNetFile
::ReadBuffer
René Brun, LCGAA
rootd
xrootd
must implement
the ReadBuffers
protocol
TFileCacheRead
::ReadBuffers
Manages a list of buffers
to be read in one transaction
Filled by user
Or TTreeCache
Buffers are sorted in seek order
Recent improvements in ROOT I/O
15
ReadBuffers protocol on server
• The ReadBuffers protocol has been implemented so far
with rootd (Leandro), xrootd(Leandro/Fabrizio) and
the Apache httpd server (Fons).
• Could be implemented with other servers like Dcache,
rfio, but probably not with GFAL.
René Brun, LCGAA
Recent improvements in ROOT I/O
16
TTreeCache
• TTreeCache is a specialized TFileCacheRead
• During the learning phase, it estimates the list of
branches candidates for the cache. This depends on the
selection factor.
• The learning phase is currently a naïve algorithm. By
default the first 100 events of the Tree (can be set).
• When the learning phase is terminated, reads as many
buffers that can fit in the cache (default 10 Mbytes).
René Brun, LCGAA
Recent improvements in ROOT I/O
17
Example of TTreeCache improvement
•
•
•
The file is on a CERN machine connected to the CERN LAN at at 100MB/s.
The client A is on the same machine as the file (local read)
The client B is on a CERN LAN connected at 100 Mbits/s with a network latency of 0.3
milliseconds (P IV 3 Ghz).
The client C is on a CERN Wireless network connected at 10 Mbits/s with a network latency of 2
milliseconds (Mac Intel Coreduo 2Ghz).
The client D is in Orsay (LAN 100 Mbits/s) connected to CERN via a WAN with a bandwith of 1
Gbits/s and a network latency of 11 milliseconds (P IV 3 Ghz).
The client E is in Amsterdam (LAN 100 Mbits/s) connected to CERN via a WAN with a
bandwith of 10 Gbits/s and a network latency of 22 milliseconds (AMD64 280).
The client F is connected via ADSL with a bandwith of 8Mbits/s and a latency of 70 milliseconds
(Mac Intel Coreduo 2Ghz).
The client G is connected via a 10Gbits/s to a CERN machine via Caltech latency 240 ms.
The times reported in the table are realtime seconds
•
•
•
•
•
•
client latency(ms) cachesize=0 cachesize=64KB cachesize=10MB
A
0.0
3.4
3.4
3.4
B
0.3
8.0
6.0
4.0
C
2.0
11.6
5.6
4.9
D
11.0
124.7
12.3
9.0
E
22.0
230.9
11.7
8.4
F
72.0
743.7
48.3
28.0
G
240.0
125.4
9.9
René Brun, LCGAA
(0.0)
>1800
(2.4) Recent improvements
(2.4) in ROOT I/O (2.4)
One query to
a 280 MB Tree
I/O = 6.6 MB
18
The interesting case G
10 Gbits
WAN
caltech
240ms
Root client
vinci1.cern
LAN CERN
Optical
switch
0.1ms
New York
120ms
Chicago
160ms
Rootd server
vinci2.cern
René Brun, LCGAA
Recent improvements in ROOT I/O
19
The interesting case G (2)
• The test used the 10 GBits/s line CERN->Caltech->CERN with
TCP/IP Jumbo frames (9KB) and a TCP/IP window size of 10
Mbytes.
• The TTreeCache size was set to 10 Mbytes. So in principle only one
buffer/message was required to transport the 6.6 Mbytes used by
the query.
• However the TTreeCache learning phase (set to 10 events) had to
exchange 10*3 messages (one per branch) to process the first 10
events, ie 10*3*240ms=7.2 seconds
• As a result more time was spent in the first 10 events than in the
remaining 283000 events !!
• Still work to do to optimize the learning phase. In this example, we
could process the query in 2.7 seconds instead of 9.9
• We must also reduce the number of messages when opening a file.
René Brun, LCGAA
Recent improvements in ROOT I/O
20
The interesting case G (3)
• Thanks to this interesting experience with case G (and
many thanks to Iosif Legrand who provided the test
bed), I am convinced that access to ROOT files across a
WAN with high bandwidth and high latency is doable
for Physics analysis.
• This will still require additional developments in ROOT
I/O, but we must be ready with the software once these
WANs with the right TCP/IP parameters and the OS
kernels supporting them will become available (just a
question of months).
• This may have important consequences for the LHC
analysis strategy.
René Brun, LCGAA
Recent improvements in ROOT I/O
21
Future work ( client side )
• we can try a parallel transfer ( multiple
threads asking for different chunks of the
same buffer ) to avoid latency ( protocol
specific ). If we remember the first graphs we
would be dividing the slope by the number of
threads.
• We can implement a client-side ReadAhead
mechanism ( also multithreaded ) to ask the
server for future chunks ( parallel if possible
but could be seen as another thread
transferring data while the main thread does
something else ).
René Brun, LCGAA
Recent improvements in ROOT I/O
22
Future work ( server side )
• We could use the pre-read mechanism specified
in the xrootd protocol for example (to avoid the
disk latency), but this doesn't help much with
the network latency.
• Although this is implemented in the server,
modification in the client must be made ( we have to
tell the server the buffers we want to pre-read ).
René Brun, LCGAA
Recent improvements in ROOT I/O
23
Future work ( different issue )
• After having the buffer with all the requests,
create a thread to decompress the chunks that
will be used. Avoiding the latency of the
decompression and reducing the footprint since
right now it's copied twice before being
unzipped.
• This is not really related to the other subject but
could be interesting .
• And particularly interesting on multi-core cpus!
René Brun, LCGAA
Recent improvements in ROOT I/O
24
Conclusions (1)
• TFileCacheRead
• State: Implemented
• Potential Improvement: Critical in high latency
networks ( can go to 2 orders of magnitude ).
• Pre-reads on the xrootd Server
• State: Already implemented on the server.
Modifications in the client side are easy.
• Potential Improvement: Reduce disk latency.
• Parallel Reading
• State: Working on it, beginning with one additional
thread and passing to a pool.
• Potential Improvement: Avoid the limitation of the
block size in the xrootd server ( new latency = old
latency / number of threads ).
René Brun, LCGAA
Recent improvements in ROOT I/O
25
Conclusions (2)
• Read Ahead in the client side
• State: Implemented independently of TFileCacheRead
(integration pending).
• Potential Improvement: Use all the CPU time to
transfer data at the same time ( in a different thread ).
• Unzipping Ahead?
• State: Idea
• Potential Improvement: The application won't need to
wait since the data has been unzipped in advance ( by
another thread ). This could be a substantial gain (up
to a factor of 2).
René Brun, LCGAA
Recent improvements in ROOT I/O
26