Descriptive Data Analysis of File Transfer Data

Download Report

Transcript Descriptive Data Analysis of File Transfer Data

Descriptive Data
Analysis of File
Transfer Data
Sudarshan Srinivasan
Victor Hazlewood
Gregory D. Peterson
Objective
 Understanding the GridFTP log transfer data we
have at NICS.
 Analyze the data and identify areas of potential
improvement.
 Perform predictive analysis to improve efficiency.
 Apply knowledge to XSEDE service providers.
2
NICS GridFTP Infrastructure
3
GridFTP Logging
 Gridftp data transfer protocol version 5.2.2.
 Two types of logging: "usage" logging and
"log_transfer" logging (enabled in 5.2.2).
 Prior to 5.2.2 endpoint IP address data was
filled with 0.0.0.0.
 Thanks to the Globus folks for fixing this bug!
4
Transfer Logs
 NICS uses a PostgreSQL database for storing
transfer log data.
 Two new tables: n_gridftp_usage and
n_gridftp_usage_detail.
 n_gridftp_usage: quick lookup of aggregate
monthly GridFTP usage information.
 n_gridftp_usage_detail: Detailed records of each
data transfer.
 Log data includes: starttime, endtime, nbytes,
user, filename, source and destination end
points.
5
Log Data Collection
 Data from each GridFTP
server is copied to log files
to a central NFS location.
 Each month we run a
processing script on the
log files that checks for
errors in the log entry.
 Following this, we run a
script to load the log files
into database table.
 We chose transfer log
data for the year 2013 for
this analysis.
DATE=20130401132041.65
7463
HOST=datamover1.nics.ut
k.edu PROG=globusgridftp-server
NL_EVNT=FTP_INFO
START=2013041132041.53
4646 USER=username
NBYTES=1048576
VOLUME=/ STREAMS=1
STRIPS=1
DEST=[192.249.6.164]
TYPE=RETR CODE=226
Log Data Analysis
 Two variables were identified: number of transfers
and total amount of data transferred.
 Data transfer rate based on starttime, endtime and
nbytes.
 Monthly visual comparison of data coming into and
going out of NICS from everywhere.
 Intra XSEDE site number of transfers and data
transferred coming into and going out of NICS.
 Bucketing of transfer data based on transfer size (ts).
 R statistical computing language was used to plot all
histograms and graphs.
7
Basic Statistics for the year 2013
8
Type
Quantity
Total Transfers
67,160,380
Average transfers per month
5,596,698
File transfers ts > 64 GB
813 (0.001%)
File transfers 1 MB < ts < 64GB
19,374,549 (28.85%)
File transfers ts < 1 MB
47,785,018 (71.15%)
Total amount transferred
(in TB)
Number of transfers
(in millions)
Number of transfers and amount
transferred for the year 2013
9
Number of transfers (in millions)
Total = 83.54 millions
Total amount transferred (in TB)
Total = 1235.7millions
Month
Mean
Percentage of transfers vs Transfer
size for the year 2013
Percentage of transfers
Total transfers: 67160380
Transfers size (ts)
10
gbps
Transfer speed for top 500 transfers
with transfer size > 1GB
Month
11
Total number of transfers
(in millions)
Monthly comparison between
number of transfers coming into
and going out of NICS for year 2013
Month
12
Total amount of data moved
(in TB)
Monthly comparison between total
amount of data coming into and
going out of NICS for year 2013
Month
13
Transfer data buckets for
November 2013
of transfers
Percentage
Percentage
transfers
of transfers
Percentage of
AllAll
All
transfers
transfers
transfers
All transfers
forfor
for
November
November
November
for November
2013,
2013,
2013,
1MB
2013
tsts<<
>1MB
ts
64GB
< 64GB
Total
Total
Total
Total
transfers:
transfers:
transfers:
transfers:
2181157
749747
1431385
25
Transfer
Transfersize
size(ts)
(ts)
14
Intra XSEDE Sites and Abbreviation
15
Site Name
Abbreviation
Texas Advanced Computer Center
TACC
Pittsburgh Supercomputing Center
PSC
San Diego Supercomputer Center
SDSC
National Institute for Computational
Sciences/ Georgia Institute of
Technology
NICS/GaTech
Indiana University
IU
Open Science Grid
OSG
National Center for Atmospheric
Research
NCAR
Total amount transferred
(in TB)
Number of transfers
(in thousands)
Intra XSEDE site data coming into
NICS
16
TACC
PSC
SDSC
NICS/GaTech
Month
IU
OSG
NCAR
Total amount transferred
(in TB)
Number of transfers
(in thousands)
Intra XSEDE site data going out of
NICS
17
TACC
PSC
SDSC
NICS/GaTech
Month
IU
OSG
NCAR
Total amount transferred
(in TB)
Number of transfers
(in thousands)
Intra XSEDE site data coming into
and going out of NICS together
18
TACC
PSC
SDSC
NICS/GaTech
Month
IU
OSG
NCAR
Future Work
 Currently in progress:
– Moving from using PostgreSQL database to loading data
completely in memory in a separate machine.
– Using Apache Spark for fast large-scale data processing.
– Combining SQL, streaming, and complex analytics.
– Using advanced data mining and machine learning
algorithms provided in libraries in Python.
 Next Step:
– Analyze by combing job data, filesystem data, and archive
data for analysis.
– Visualize data flow within XSEDE network on a
geographical map.
19
Thank You!