Scalable, Fault-Tolerant NAS for Oracle - The Next

Download Report

Transcript Scalable, Fault-Tolerant NAS for Oracle - The Next

Scalable, Fault-Tolerant NAS for
Oracle - The Next Generation
Kevin Closson
Chief Software Architect
Oracle Platform Solutions, Polyserve Inc
The Un-”Show Stopper”
•
NAS for Oracle is not “file serving”, let me explain…
•
Think of GbE NFS I/O paths from Oracle Servers to the NAS device that are
totally direct. No VLANing sort of indirection.
– In these terms, NFS over GbE is just a protocol as is FCPover Fiber
Channel
– The proof is in the numbers.
• A single dual-socket/dual-core ADM server running Oracle10gR2 can push through
273MB/s of large I/Os (scattered reads, direct path read/write, etc) of triple-bonded GbE
NICs!
• Compare that to infrastructure and HW costs of 4GbE FCP (~450MB/s, but you need 2
cards for redundancy)
– OLTP over modern NFS with GbE is not a challenging I/O profile.
•
However, not all NAS devices are created equal by any means
Agenda
• Oracle on NAS
• NAS Architecture
• Proof of Concept Testing
• Special Characteristics
Oracle on NAS
Oracle on NAS
•
Connectivity
– Fantasyland Dream Grid™ would be nearly impossible with FibreChannel
switched fabric, for instance:
• 128 nodes == 256 HBAs, 2 switches each with 256 ports just for the servers then you
have to work out storage paths
•
Simplicity
– NFS is simple. Anyone with a pulse can plug in cat-5 and mount filesystems.
– MUCH MUCH MUCH MUCH MUCH simpler than:
•
•
•
•
Raw partitions for ASM
Raw, OCFS2 for CRS
Oracle Home? Local Ext3 or UFS?
What a mess
– Supports shared Oracle Home, shared APPL_TOP too
– But not simpler than a Certified Third Party Cluster Filesystem , but that is a
different presentation
•
Cost
– FC HBAs are always going to be more expensive than NICs
– Ports on enterprise-level FC switches are very expensive
Oracle on NAS
• NFS Client Improvements
– Direct IO
• open(,O_DIRECT,) works with Linux NFS clients, Solaris NFS client, likely
others
• Oracle Improvements
• init.ora filesystemio_options=directIO
• No async I/O on NFS, but look at the numbers
• Oracle runtime checks mount options
• Caveat: It doesn’t always get it right, but at least it tries (OSDS)
• Don’t be surprised to see Oracle offer a platform-independent NFS client
• NFS V4 will have more improvements
NAS Architecture
NAS Architecture
• Single-headed Filers
• Clustered Single-headed Filers
• Asymmetrical Multi-headed NAS
• Symmetrical Multi-headed NAS
Single Headed Filer
Architecture
NAS Architecture: Single-headed Filer
GigE Network
Filesystems
/u01
/u02
/u03
Oracle Servers Accessing a Single-headed Filer:
I/O Bottleneck
Oracle Database Servers
A single one of these…
I/O Bottleneck
Filesystems
/u01
/u02
/u03
Has the same (or more)
bus bandwidth
as this!
Oracle Servers Accessing a Single-headed Filer:
Single Point of Failure
Oracle Database Servers
Highly Available through failover-HA,
DataGuard, RAC, etc
Single Point of Failure
Filesystems
/u01
/u02
/u03
Clustered Single-headed Filers
Architecture:
Cluster of Single-headed Filers
Filesystems
/u01
/u02
Paths
Active
After
Failover
Filesystems
/u03
Oracle Servers Accessing a Cluster of
Single-headed Filers
Oracle Database Servers
Filesystems
/u01
/u02
Paths
Active
After
Failover
Filesystems
/u03
Architecture: Cluster of Single-headed Filers
Oracle Database Servers
Filesystems
/u01
/u02
Paths
Active
After
Failover
Filesystems
/u03
What if /u03 I/O saturates this Filer?
Filer I/O Bottleneck. Resolution == Data Migration
Oracle Database Servers
Filesystems
/u01
/u02
Paths
Active
After
Failover
Filesystems
/u03
Filesystems
/u04
Migrate some of the “hot” data to /u04
Data Migration Remedies I/O Bottleneck
Oracle Database Servers
NEW Single
Point of Failure
Filesystems
/u01
/u02
Paths
Active
After
Failover
Filesystems
/u03
Filesystems
/u04
Migrate some of the “hot” data to /u04
Summary: Single-headed Filers
• Cluster to mitigate S.P.O.F
– Clustering is a pure afterthought with filers
– Failover Times?
• Long, really really long.
– Transparent?
• Not in many cases.
• Migrate data to mitigate I/O bottlenecks
– What if the data “hot spot” moves with time? The Dog Chasing His Tail
Syndrome
• Poor Modularity
• Expanded by pairs for data availability
• What’s all this talk about CNS?
Asymmetrical Multi-headed NAS
Architecture
Asymmetrical Multi-headed NAS Architecture
Three Active NAS Heads / Three For Failover
and
“Pools of Data”
FibreChannel SAN
…
…
Oracle Database Servers
Note: Some variants of this architecture support M:1 Active:Standby
but that doesn’t really change much.
Asymmetrical NAS Gateway Architecture
•
Really not much different than clusters of single-headed filers:
– 1 NAS head to 1 filesystem relationship
– Migrate data to mitigate I/O contention
– Failover not transparent
•
But:
– More Modular
• Not necessary to scale up by pairs
Symmetric Multi-headed NAS
HP Enterprise File Services
Clustered Gateway
Symmetric vs Asymmetric
EFS-CG
/Dir1/File1
/Dir1/File1
/Dir2/File2
/Dir3/File3
/Dir2/File2
/Dir3/File3
NAS
Head
NAS
Head
/Dir1/File1
/Dir1/File1
/Dir2/File2
/Dir3/File3
NAS
Head
NAS
Head
NAS
Head
/Dir2/File2
NAS
Head
/Dir1/File1
/Dir1/File1
/Dir2/File2
/Dir3/File3
/Dir2/File2
/Dir3/File3
Enterprise File Services Clustered Gateway
Component Overview
•
Cluster Volume Manager
– RAID 0
– Expand Online
•
Fully Distributed, Symmetric Cluster Filesystem
– The embedded filesystem is a fully distributed, symmetric cluster filesystem
•
Virtual NFS Services
– Filesystems are presented through Virtual NFS Services
•
Modular and Scalable
– Add NAS heads without interruption
– All filesystems can be presented for read/write through any/all NAS heads
EFS-CG Clustered Volume Manager
•
RAID 0
– LUNS are RAID 1, so this implements S.A.M.E.
•
Expand online
– Add LUNS, grow volume
•
Up to 16TB
– Single Volume
The EFS-CG Filesystem
• All NAS devices have embedded operating systems and file
systems, but the EFS-CG is:
– Fully Symmetric
• Distributed Lock Manager
• No Metadata Server or Lock Server
– General Purpose clustered file system
– Standard C Library and POSIX support
– Journaled with Online recovery
• Proprietary format but uses standard Linux file system semantics
and system calls including flock() and fcntl() clusterwide
• Expand a single filesystem online up to 16TB, up to 254 filesystems
in current release.
EFS-CG Filesystem Scalability
Scalability. Single Filesystem Export
MegaBytes per
Second (MB/s)
Using x86 Xeon-based NAS Heads (Old Numbers)
1,200
1,000
986
1,196
739
800
493
600
400
200
1,084
123
246
0
1
2
4
6
8
9
10
Approximate
Singleheaded Filer
limit
Cluster Size (Nodes)
NAS
Heads Total bytes (Mbytes)
# Servers
1
2
4
6
8
9
10
16,384
32,768
65,536
98,304
131,072
147,456
163,840
Time (sec.) Mbytes/Sec. Gbits/Sec Scale Factor Scaling Coefficient
133
123.19
0.96
1.00
100%
133
246.38
1.92
2.00
100%
133
492.75
3.85
4.00
100%
133
739.13
5.77
6.00
100%
133
985.50
7.70
8.00
100%
136
1,084.24
8.47
8.80
98%
137
1,195.91
9.34
9.71
97%
HP StorageWorks Clustered File System is optimized for both READ and WRITE performance.
Virtual NFS Services
•
Specialized Virtual Host IP
•
Filesystem groups are exported through VNFS
•
VNFS failover and rehosting are 100% transparent to NFS client
– Including active file descriptors, file locks (e.g. fctnl/flock), etc
EFS-CG Filesystems and VNFS
Enterprise File Services Clustered Gateway
Oracle Database Servers
Enterprise File Services Clustered Gateway
vnfs1
/u01
NAS
Head
vnfs1b
vnfs2b
vnfs3b
/u02
/u03
/u03
/u04
/u04
NAS
Head
NAS
Head
/u01
/u02
/u03
/u04
NAS
Head
…
EFS-CG Management Console
EFS-CG Proof of Concept
EFS-CG Proof of Concept
•
Goals
– Use Oracle10g (10.2.0.1) with a single high performance filesystem for the RAC
database and measure:
– Durability
– Scalability
– Virtual NFS functionality
EFS-CG Proof of Concept
•
The 4 filesystems presented by the EFS-CG were:
– /u01. This filesystems contained all Oracle executables (e.g., $ORACLE_HOME)
– /u02. This filesystem contained the Oracle10gR2 clusterware files (e.g., OCR,
CSS) and some datafiles and External Tables for ETL testing
– /u03. This filesystem was lower-performance space used for miscellaneous tests
such as backup disk-to-disk
– /u04. This filesystem resided on a high-performance volume that spanned two
storage arrays. It contained the main benchmark database
EFS-CG P.O.C.
Parallel Tablespace Creation
•
All datafiles created in a single exported filesystem
– Proof of multi-headed, single filesystem write scalability
EFS-CG P.O.C.
Parallel Tablespace Creation
Multi-headed EFS-CG Tablespace Creation
Scalability
250
208
MB/s
200
150
111
100
50
0
Single-head, Single GigE Path
Multi-headed, dual GigE Paths
EFS-CG P.O.C.
Full Table Scan Performance
•
All datafiles located in a single exported filesystem
– Proof of multi-headed, single filesystem sequential I/O scalability
EFS-CG P.O.C.
Parallel Query Scan Throughput
Multi-headed EFS-CG Full Table Scan
Scalability
250
188
MB/s
200
150
98
100
50
0
Single-head, Single GigE Path
Multi-headed, dual GigE Paths
EFS-CG P.O.C.
OLTP Testing
•
OLTP Database based on an Order Entry Schema and workload
•
Test areas
– Physical I/O Scalability under Oracle OLTP
– Long Duration Testing
EFS-CG P.O.C.
OLTP Workload Transaction Avg Cost
Oracle Statistics
Average Per Transaction
SGA Logical Reads
33
SQL Executions
5
Physical I/O
6.9 *
Block Changes
8.5
User Calls
6
GCS/GES Messages Sent
12
* Averages with RAC can be deceiving, be aware of CR sends
EFS-CG P.O.C.
OLTP Testing
10gR2 RAC Scalability on EFS-CG
Transactions per
Second
2500
2276
2000
1773
1500
1246
1000
650
500
0
1
2
3
RHEL4-64 RAC Servers
4
EFS-CG P.O.C.
OLTP Testing. Physical I/O Operations
RAC OLTP I/O Scalability on EFS-CG
Random 4K IOps
15000
13743
11619
10000
8831
5214
5000
0
1
2
3
RHEL4-64 RAC Servers
4
EFS-CG Handles all OLTP I/O Types
Sufficiently—no Logging Bottleneck
I/O Ops per Second
OLTP I/O by Type
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
8150
5593
893
redo writes
datafile writes
datafile reads
Long Duration Stress Test
•
Benchmarks do not prove durability
– Benchmarks are “sprints”
– Typically 30-60 minute measured runs (e.g., TPC-C)
•
This long duration stress test was no benchmark by any means 
– Ramp OLTP I/O up to roughly 10,000/sec
– Run non-stop until the aggregate I/O breaks through 10 Billion physical transfers
– 10,000 physical I/O transfers per second for every second of nearly 12 days
Long Duration Stress Test
Long Duration Stress Test
Long Duration Stress Test
Special Characteristics
Special Characteristics
•
The EFS-CG NAS Heads are Linux Servers
– Tasks can be executed directly within the EFS-CG NAS Heads at FCP speed:
– Compression
– ETL, data importing
– Backup
– etc..
Example of EFS-CG Special Functionality
•
A table is exported on one of the RAC nodes
•
The export file is then compressed on the EFS-CG NAS head:
– CPU from NAS Head, instead of database servers
• The NAS heads are really just protocol engines. I/O DMAs are offloaded to the I/O
subsysystems. There are plenty of spare cycles.
– Data movement at FCP rate instead of GigE
• Offload the I/O fabric (NFS paths from servers to the EFS-CG)
Export a Table to NFS Mount
Compress it on the NAS Head
Questions and Answers
Backup Slide
EFS-CG Scales “Up” and “Out”
Ethernet Switch
3 GbE NFS Paths:
Can be triple bonded, etc
EFS-CG NAS Head
EFS-CG NAS Head
…
FiberChannel Switches
SAN