Database Deployment scenarios and performance on SSD arrays

Download Report

Transcript Database Deployment scenarios and performance on SSD arrays

DATABASE DEPLOYMENT
SCENARIOS AND PERFORMANCE
ON SSD ARRAYS
Mark Holliman
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
Summary
 Some Astronomical survey databases are becoming so
large that some curation tasks are approaching
unreasonable timeframes. This is evident in such surveys as
the Galactic Plane Survey (GPS) for UKIDSS, and the VISTA
Variables in the Via Lactea (VVV). The VVV alone contains
>10 of TB of data and includes a detection table with >1010
rows. While RDBMSs are capable of handling this much
data, the execution times for curation activities can stretch
into weeks at a time. To address this issue, there are two
main approaches:
1.
2.

Switch from a RDBMS to a Key-Pair based model or Column
oriented DB (i.e. Hadoop, MonetDB, etc)
Throw Hardware at it (i.e. SSDs, SAN, LUSTRE, GPFS, etc)
As you can probably guess, this talk is about #2
The Big Database Problem
 In an RDBMS the main bottleneck on almost all operations is disk
I/O. While parallel processing and increased RAM can address
some performance issues, ultimate performance figures are
dictated by how fast data can be moved to/from the storage
medium.

For large databases (>1TB) the most cost efficient storage medium are
RAID5/6/10 arrays of spinning disks (HDD). These disks range in size
from 1TB to 4TB at present.
 These arrays provide redundancy in case of disk failure, while at the same
time speed up I/O by spreading disk operations across multiple devices
simultaneously
 The rotational speed on the hard drives is the main factor in determining
an HDD’s performance. The faster the disk spins, the faster read/write
operations can occur. Most enterprise disks run at 7200RPM, though
10000RPM and 15000RPM disks are available (at a serious jump in price)

SSDs are just beginning to approach the size/price ratio necessary for
large DBs.
Test Server Details
 Intel Xeon 2.8GHz, 24 cores
 16GB RAM
 Disk Subsystem (6Gbps SAS RAID)
 HDD1,2: 7 x 1TB HDD RAID5 arrays
 SSD1: 1 x 512GB Crucial SSD
 SSD2: 6 x 512GB Crucial SSD RAID5 Array
 OS: Windows Server 2008 R2
 DBMS: SQL Server 2008 R2
Full DB Tests
 The first tests involved placing an entire database on each
particular disk subsystem and running a set of queries to
measure performance. The queries were constructed to
represent 3 specific use cases in order to identify exactly
where the SSDs provide performance gains.
 Database: 2MASS
 Query 1, All Indexed Columns:
Select * From twomass_psc Where ((ra> 240) AND (ra<242)) OR ((ra> 120) AND
(ra<122)) AND ((dec> -47) AND (dec<-44)) OR ((dec> 120) AND (dec<123))
 Query 2, All Nonindexed Columns:
Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m<15.8) AND (k_m>15.1 AND
k_m<15.3)
 Query 3, Mixed Index Columns:
Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m<15.8) AND (k_m>15.1 AND
k_m<15.3) AND ((ra> 230) AND (ra<241)) OR ((ra> 110) AND (ra<121)) AND ((dec> -47) AND (dec<-44)) OR
((dec> 120) AND (dec<123)) AND ((k_psfchi>1) OR (k_psfchi<0.4)) AND (dist_opt<.3) AND (k_msig_stdap<1)
Full DB Test Results
Indexed Col Select Time (s)
Nonindexed Col Select Time (s)
300
3500
3000
253
250
3000
212
2500
200
150
150
150
2000
1500
100
1140
1000
50
500
0
293
255
SSD
SSD RAID5
0
RAID5
RAID0
SSD
SSD RAID5
RAID5
RAID0
Mixed Index Col Select Time (s)
1400
1200
1200
1000
800
600
305
400
104
200
0
RAID5
SSD
SSD RAID5
SSD Cost Issue
 While putting an entire Database on SSDs is
certainly preferable, it is unfortunately cost
prohibitive for very large archive databases
(where 10’s of TB are required).
 At recent pricing:
1TB on SSD = ~£500 vs 1TB on HDD = ~£87
 So with this in mind we ran a second battery of
tests to assess whether sufficient performance
gains can be achieved with a hybrid system,
whereby some database files are placed on SSDs
while the rest remain on standard HDD arrays
SSD File Location Tests
 Database: UKIDSS DR5
 4 different representative queries were run (email me
if you want to see the actual SQL)




#1: produce light curves
#2: produce variable light curves
#3: JOIN Source and Detection Tables (single index)
#4: JOIN Source and Detection Tables (multi indices)
 4 different file/disk configurations
 All files on HDD
 Indices on SSD (RAID)
 Source Tables on SSD (RAID)
 Detection Tables on SSD (RAID)
SSD File Location test results
Query 1 Execution Time
25
11
12
21
19
20
Query 2 Execution Time
9
10
15
8
10
6
5
4
2
1
SSD Index
SSD Source
4
2
0
Spin Disk
6
SSD
Detection
0
Spin Disk
Query 3 Execution Time
1400
1271
SSD Source
SSD Detection
Query 4 Execution Time
1289
60
1167
1200
SSD Index
50
1000
49
39
40
800
33
37
30
600
20
400
130
200
0
10
0
Spin Disk
SSD Index
SSD Source
SSD
Detection
Spin Disk
SSD Index
SSD Source
SSD Detection
Conclusions

The biggest performance gains from SSDs are seen on queries involving
non-indexed columns
 RAID arrays of SSDs do provide performance gains over individual disks,
though these gains are not universal or consistently scaled
 Placing Source or Detection tables on SSDs both result in large
performance gains




Performance gains are heavily dependant on the query being run, and how much
of the query uses the tables on the SSDs
Due to SSD size constraints, Source tables are more likely to fit as they tend to be
smaller
Curation activities could be optimized by moving database files from HDD to SSD
for intensive operations, then moved back to HDD once complete. This only
works if the file movement speeds are high enough to overcome the extra time
necessary for running operations on the HDD stored files
Index file operations are actually slower on SSD than on HDD


This is a serious surprise
Further testing is necessary to determine if this is due to the inherent sequential
access method of HDDs, or if this is due to some quirk of the MS SQL index file
format