RMAN in the Trenches: Part II
Download
Report
Transcript RMAN in the Trenches: Part II
Part 2 -- RMAN in the Trenches:
To Go Forward, We Must Backup
Philip Rice
Univ. of California Santa Cruz
NoCOUG: August 16, 2007
1
Overview
Motivation: Few RMAN sessions, & Giving Back
Experience Level: Intermediate & Beginner
Corruption Detection
Metadata Management and Reporting
The Good, The Bad, The Ugly (a sampling)
Flashback
Performance/Tuning (safety first!)
Plenty of material today; please ask if for clarity,
otherwise best to save questions to end
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
2
Corruption Detection
Default is to stop the backup as soon as
corruption is detected
SET MAXCORRUPT for each datafile would
override that
But MAXCORRUPT should only be used when
priority is finishing rest of backup vs. repairing
corruption (seldom)
BACKUP VALIDATE will expose other
corruptions, and repair can be done
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
3
Corruption Detection
Can use RESTORE...VALIDATE to check on
backups; this is not checking datafiles (.dbf)
From Oracle Press book: “...validation is not a
comprehensive test.“
RESTORE DATABASE looks at headers in
the level 0 backup, which is used to get
datafiles
Level 1 has changes applied on top of those
datafiles, so level 1 would not come into play
until doing RECOVER rather than RESTORE
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
4
Corruption Detection
RECOVER...VALIDATE n/a, but can use
VALIDATE instead of RESTORE...VALIDATE
Use KEY column values from LIST BACKUP
SUMMARY
Testing shows we can examine any or all
backups, including level 1 and archivelog
backups
Alternate: RECOVER DATABASE TEST, but
docs say that the TEST clause can be used
"only if you have restored a backup taken since
the last RESETLOGS operation.“
I tried it: said system datafile in use
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
5
Corruption Detection
The init parameters db_block_checking and
db_block_checksum will detect datafile corruptions, as
reads and writes are occurring
Similar, but not interdependent:
When block checking is on, blocks are examined for
internal consistency --always enabled for the system
tablespace, but off by default for other tablespaces.
When checksum is on, corruption caused by
underlying I/O systems can be detected. If set to
FULL, it also catches in-memory corruptions and
stops them from making it to the disk. Default is
TYPICAL, same as TRUE: 9i backward compatibility
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
6
Corruption Detection
For strongest possible corruption protection with
RMAN backups, a White Paper
(http://www.oracle.com/technology/deploy/availability/pdf/corruption_wp.pdf)
recommends:
In the initialization parameter file, set
DB_BLOCK_CHECKSUM=TRUE (default setting;
default is TYPICAL for 10g, TRUE for
backward compatibility)
In BACKUP and RESTORE commands, do
not specify the MAXCORRUPT option, do not
specify the NOCHECKSUM option, but do
specify the CHECK LOGICAL option
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
7
Corruption Detection
Turn on db_block_checking (for non-system
tablespaces) with LOW, MEDIUM, FULL in 10g,
with 1-10% overhead
TRUE (backward compatible from 9i) is the
same as FULL
In docs for this parameter: "You should set
DB_BLOCK_CHECKING to FULL if the
performance overhead is acceptable."
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
8
Corruption Detection
FULL option for db_block_checksum (10g):
Extra 4-5% overhead
10.2 docs: “catches in-memory corruptions
and stops them from making it to the disk. [...].
Oracle recommends that you set
DB_BLOCK_CHECKSUM to TYPICAL.”
Steve Adams says I/O intensive queries with
moderate to high CPU use can be worse than
the estimate indicated in the docs
Testing is advisable
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
9
Corruption Detection
DB_BLOCK_CHECKSUM FULL setting or not:
Prior job: CPU glitch discovered after months
in production, introduced corruption due to
heavy batch job use. Financial repercussion
$1M+ (1999 $$), vendor essentially gave a
top end machine to compensate
Cost for not capturing in-memory corruption
could be high.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
10
Metadata Basics
Data Dictionary for backup work
Always in controlfile in virtual tables (V$ views)
Optionally in separate catalog DB, comparable
info in real tables (& RC* views)
Catalog: Open ended time span, multiple DBs
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
11
Metadata: Create New Controlfile
All previous metadata lost; asking for fresh start
How does RESYNC affect catalog?
Testing shows resync is 1-way street, controlfile
to catalog
Nice surprise: Testing shows resync does not
wipe out catalog entries in control_file_record_keep_time
period; metadata not lost in catalog DB
Catalog recommended by Oracle; less critical
with newer features
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
12
Metadata Management
Safety: With CONFIGURE command, turn on
autobackup of the controlfile
Controlfile plus catalog: extra layer for safety
"High Availability Best Practices" section 2.5.3.2:
Run source backups in nocatalog mode to
reduce dependency on the catalog database
being available. At a later point, do a resync
Feature idea: be able to connect to two catalogs;
like mirrored disks;
alternatively, do standby of catalog DB
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
13
Metadata Reporting: Runtime trends for disk/tape
Do crosstab from RC_BACKUP_PIECE view.
Make anything before 4PM part of previous overnight run:
CREATE OR REPLACE VIEW ucsc_bkup_trend_insert_vw
(...column aliases...) AS
SELECT
CASE WHEN to_char(p.START_TIME,'HH24') < 16
THEN trunc(p.START_TIME - 1)
ELSE trunc(p.START_TIME)
END AS bkup_date,
sdl.SERVER_NAME, d.name, p.device_type,
nvl(max(CASE WHEN backup_type = 'D‘
THEN p.ELAPSED_SECONDS END),0) AS LVL0_secs,
nvl(max(CASE
nvl(max(CASE
nvl(max(CASE
nvl(max(CASE
nvl(max(CASE
WHEN
WHEN
WHEN
WHEN
WHEN
backup_type
backup_type
backup_type
backup_type
backup_type
=
=
=
=
=
'I'
'L'
'D'
'I'
'L'
THEN
THEN
THEN
THEN
THEN
p.ELAPSED_SECONDS END),0) AS LVL1_secs,
p.ELAPSED_SECONDS END),0) AS ARCH_secs,
p.BYTES END),0) AS LVL0_bytes,
p.BYTES END),0) AS LVL1_bytes,
p.BYTES END),0) AS ARCH_bytes,
p.START_TIME, p.COMPLETION_TIME
FROM rc_backup_piece p, rc_database d,
ucsc_server_db_list sdl
[...]
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
14
Metadata Reporting: Runtime trends for disk/tape
[crosstab from RC view...]
WHERE p.DB_KEY = d.DB_KEY
AND d.NAME = sdl.DB_NAME
AND p.backup_type in ('D','I','L')
GROUP BY
(CASE WHEN to_char(p.START_TIME,'HH24') < 16
THEN trunc(p.START_TIME - 1)
ELSE trunc(p.START_TIME)
END )
,
d.NAME, sdl.SERVER_NAME, p.DEVICE_TYPE, p.START_TIME, p.COMPLETION_TIME
ORDER BY
(CASE WHEN to_char(p.START_TIME,'HH24') < 16
THEN trunc(p.START_TIME - 1)
ELSE trunc(p.START_TIME)
END )
, sdl.SERVER_NAME, d.name, p.device_type, p.START_TIME;
Make persistent table so we have trend info beyond retention period:
create table ucsc_bkup_trend_details as
select * from ucsc_bkup_trend_insert_vw;
Do scheduled inserts so trend info is available long term.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
15
Metadata Reporting: Runtime trends for disk/tape
In next 4 slides, we see OPTIMIZATION ON in
effect for last several days on graphs. For
OPTIMIZATION OFF in earlier days, two factors:
1. Archives on disk for 3 days, used in transition
at our site: results in 3 copies of archive
backups. This was known/expected.
2. BACKUP BACKUPSET gives multiple tape copies,
including Level 0 each day! The repository
knows about Incremental Level, but behavior is
different from making original backupset. This
was not expected, and metadata reporting
brought out this difference.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
16
Metadata Reporting: Schedule Planning – Disk Time
Archive backups greatly reduced in last few days
250
200
Minutes
150
ARCH
LVL1
LVL0
100
50
0
7/20
7/21
7/22
7/23
NoCOUG: August 16, 2007
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
8/1
RMAN in the Trenches, Part 2
8/2
8/3
8/4
8/5
8/6
17
Metadata Reporting: Tape to Disk Size Ratio
Size greatly reduced, no extra Lvl0 copies
Ratio reduced from Max of 16:1, down to 1:1
18
16
14
Tape to Disk Ratio
12
10
8
6
4
2
0
7/21
7/22
7/23
7/24
NoCOUG: August 16, 2007
7/26
7/27
7/28
7/29
7/30
7/31
RMAN in the Trenches, Part 2
8/1
8/2
8/3
8/4
8/5
8/6
18
Metadata Reporting: Tape Runtime
Multiple tape processes from MML
Execution Time higher than Clock Time
Not cumulative (not stacked) line chart
1200
1000
Minutes
800
600
400
200
0
7/19
7/20
7/21
7/22
7/23
7/24
7/25
7/26
7/27
7/28
CLOCK
NoCOUG: August 16, 2007
7/29
7/30
7/31
8/1
8/2
8/3
8/4
8/5
8/6
EXEC
RMAN in the Trenches, Part 2
19
Metadata Reporting: Disk/Tape
Cumulative Runtime – stacked line chart
700.0
600.0
500.0
Minutes
400.0
TAPE
DISK
300.0
200.0
100.0
0.0
7/19
7/20
7/21
7/22
7/23
NoCOUG: August 16, 2007
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
RMAN in the Trenches, Part 2
8/1
8/2
8/3
8/4
8/5
8/6
20
Metadata Reporting: Compression Ratio
COMPRESSION_RATIO column is in 10
_summary and _details views, but these are
“primarily intended to be used internally by
Enterprise Manager.”
Before finding that caveat, I had found results to
not trust RC_BACKUP_SET_DETAILS --- good to
avoid. In following chart, (10.2.0.2 for testing),
Input Bytes is same for Level 0 and 1, so calc is
from total DB used space.
Level 1 ratio is distorted.
The BDF table in RC_BACKUP_DATAFILE view
knows about Block Change Tracking, and has
count of blocks scanned, so a better ratio could
be calculated.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
21
Metadata Reporting: Compression Ratio
from RC_BACKUP_SET_DETAILS
For device_type = ‘*’, MAX Ratio for Level 1 of 690 to 1
690
391
LVL1
47
7
LVL0
7
Bkup Type
6
MAX
AVG
6
MIN
6
ARCH
4
1
CTLFL
1
1
0
100
200
300
400
500
600
700
800
Ratio
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
22
The Good, The Bad, The Ugly (a sampling)
Gradual improvements in each release:
e.g. binary compression in 10g
I requested a couple: separate retention
periods for disk/tape, ability to display
connection information at the RMAN prompt
Corruption detection during block scanning
History in the RMAN catalog, e.g. disk and
tape/MML runtimes for planning purposes
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
23
The Good, The Bad, The Ugly (a sampling)
Unavoidable: When testing, we can't modify
metadata in the controlfile to alter behavior for
our purposes, e.g. the shortest retention window
is 1 day (use ALTER SYSTEM SET FIXED_DATE )
No command editing or buffer display in RMAN
comparable to sqlplus; LIST command without
summary clause can be copious, can be off the
terminal buffer, so cmd not retrievable
NLS_DATE_FORMAT variable must be set
before starting RMAN (no SET cmd)
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
24
The Good, The Bad, The Ugly (a sampling)
RMAN is tied into SQL engine, but no SELECT;
For catalog query (sometimes better than LIST),
need separate sqlplus session
RMAN will make a new backup file, but for
backups in separate directories based on
database name (%d in the format), it won't make
a new directory for us; causes backup failure
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
25
The Good, The Bad, The Ugly (a sampling)
Can't reconnect a different way after command
requiring repository connection (e.g. BACKUP),
must exit and start over
GLOBAL stored script in 10g is a step forward:
But no variables and language
PL/SQL is inherently in engine, e.g. could
execute correct RMAN syntax based on query
to determine DB version, allow generic script
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
26
Cover Your Sixes ...
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
27
...so you don’t get caught by surprise!
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
28
Cover Your Sixes
In 10.2 docs for ALLOCATE CHANNEL: "You
must use a recovery catalog when backing up a
standby database." -- another benefit of catalog
"When using Flashback Database with a target
time at which a NOLOGGING operation was in
progress, block corruption is likely in the
database objects and datafiles affected by the
NOLOGGING operation."
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
29
Cover Your Sixes: Syntax
Syntax can be similar with different meanings:
# We're doing a 'normal' backup here, not an image copy:
RMAN> backup as backupset ...;
# The backupset that was created before is copied to another destination:
RMAN> backup backupset ...;
------------------------------------------------# These two will deal with all types: controlfile, datafile, and archivelog
RMAN> CROSSCHECK BACKUP;
RMAN> LIST EXPIRED BACKUP;
------------------------------------------------# These two affect archivelogs, not backups of archivelogs
RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> LIST EXPIRED ARCHIVELOG;
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
30
Cover Your Sixes: tape
The "BACKUP BACKUPSET" command did not
pick up format from CONFIGURE, it used the
default of “%U”, not what I specified for tape:
CONFIGURE CHANNEL DEVICE TYPE 'SBT_TAPE' FORMAT '%d_%T_%U' SEND [...]
But using the format in ALLOCATE CHANNEL in
the script was successful.
Docs say default of “%U” is unique, but it gave
us occasional duplicate tape file names.
Virtual Tape Library -- "BACKUP BACKUPSET"
command can not copy from tape to tape.
We will still want VTL as secondary storage, not
as a replacement for our disk backup area.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
31
Flashback
9i was logical only, using Undo
10g Flashback Database is physical, using
Flashback Logs; rewind DB faster than PIT
recovery
When to use? Business can not lose
transactions for a number of hours!!
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
32
Flashback
Scenarios for Flashback Database:
“...save the SCN to a spool file, for example, before
running a high-risk batch job.“
“Easy conversion of a physical standby database to a
reporting database and back to a standby. [...] reverse
the activation of a standby database.”
Test/Dev DB: known starting point for tests
Standby can be reverted to an earlier time, which
could allow examination and manipulation in two
different time periods. This would allow recovery of
corrupted objects.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
33
Flashback
Flashback Recovery Area, many file types
recommended: redo, archive, ctlfile, backupsets
Potential for more disk contention when all in
one area
Example: RAID10 for DB plus redo, archive,
ctlfile now; RAID5 for backupsets (write penalty
not significant enough for off hours batch job);
vendor app generated 25GB in a half hour for
temp tables in reporting(!)
Do FRA for a reason, not on a whim
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
34
Tuning/Performance
Minimize what needs to be read and written:
10.2.0.2 can skip empty blocks in addition to
unused blocks
10g can do binary compression to disk, much
less quantity that needs to be taken to tape
Concern about backups adversely affecting
online use?
RATE clause can limit disk I/O
Look at max bytes/sec of disk system, e.g. for
10M max possible, RATE of 5M would allow
1/2 of disk capability for non-backup purposes
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
35
Tuning/Performance
For sites using a physical standby, docs say that
turning off db_block_checking in a physical
standby "can provide as much as a twofold
increase in the apply rate" for redo logs, but
db_block_checking should not be turned off at
the primary database
Metalink Note 311068.1 -- RMAN Performance
Tuning Diagnostics
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
36
Tuning/Performance
References:
http://www.oracle.com/technology/deploy/availability/pdf/rman_performance_wp.pdf
http://www.oracle.com/technology/deploy/availability/pdf/br_optimization.pdf
Chapter in 10.2 RMAN Docs -- Advanced User's Guide:
http://downloadwest.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmtunin.htm#sthref1057
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
37
A&Q
Acknowledgements:
Timothy Chien, Oracle Product Mgr.
Bill Wagman, UC Davis
Presenter: Philip Rice price [at] ucsc.edu
A&Q
Answers: Wisdom to share?
Questions?
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
38