RMAN in the Trenches: Part II

Download Report

Transcript RMAN in the Trenches: Part II

Part 2 -- RMAN in the Trenches:
To Go Forward, We Must Backup
Philip Rice
Univ. of California Santa Cruz
NoCOUG: August 16, 2007
1
Overview








Motivation: Few RMAN sessions, & Giving Back
Experience Level: Intermediate & Beginner
Corruption Detection
Metadata Management and Reporting
The Good, The Bad, The Ugly (a sampling)
Flashback
Performance/Tuning (safety first!)
Plenty of material today; please ask if for clarity,
otherwise best to save questions to end
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
2
Corruption Detection

Default is to stop the backup as soon as
corruption is detected
 SET MAXCORRUPT for each datafile would
override that
 But MAXCORRUPT should only be used when
priority is finishing rest of backup vs. repairing
corruption (seldom)
 BACKUP VALIDATE will expose other
corruptions, and repair can be done
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
3
Corruption Detection

Can use RESTORE...VALIDATE to check on
backups; this is not checking datafiles (.dbf)
 From Oracle Press book: “...validation is not a
comprehensive test.“
 RESTORE DATABASE looks at headers in
the level 0 backup, which is used to get
datafiles
 Level 1 has changes applied on top of those
datafiles, so level 1 would not come into play
until doing RECOVER rather than RESTORE
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
4
Corruption Detection

RECOVER...VALIDATE n/a, but can use
VALIDATE instead of RESTORE...VALIDATE
 Use KEY column values from LIST BACKUP
SUMMARY
 Testing shows we can examine any or all
backups, including level 1 and archivelog
backups
 Alternate: RECOVER DATABASE TEST, but
docs say that the TEST clause can be used
"only if you have restored a backup taken since
the last RESETLOGS operation.“
I tried it: said system datafile in use
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
5
Corruption Detection


The init parameters db_block_checking and
db_block_checksum will detect datafile corruptions, as
reads and writes are occurring
Similar, but not interdependent:
 When block checking is on, blocks are examined for
internal consistency --always enabled for the system
tablespace, but off by default for other tablespaces.
 When checksum is on, corruption caused by
underlying I/O systems can be detected. If set to
FULL, it also catches in-memory corruptions and
stops them from making it to the disk. Default is
TYPICAL, same as TRUE: 9i backward compatibility
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
6
Corruption Detection

For strongest possible corruption protection with
RMAN backups, a White Paper
(http://www.oracle.com/technology/deploy/availability/pdf/corruption_wp.pdf)
recommends:
 In the initialization parameter file, set
DB_BLOCK_CHECKSUM=TRUE (default setting;
default is TYPICAL for 10g, TRUE for
backward compatibility)
 In BACKUP and RESTORE commands, do
not specify the MAXCORRUPT option, do not
specify the NOCHECKSUM option, but do
specify the CHECK LOGICAL option
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
7
Corruption Detection

Turn on db_block_checking (for non-system
tablespaces) with LOW, MEDIUM, FULL in 10g,
with 1-10% overhead
 TRUE (backward compatible from 9i) is the
same as FULL
 In docs for this parameter: "You should set
DB_BLOCK_CHECKING to FULL if the
performance overhead is acceptable."
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
8
Corruption Detection

FULL option for db_block_checksum (10g):
 Extra 4-5% overhead
 10.2 docs: “catches in-memory corruptions
and stops them from making it to the disk. [...].
Oracle recommends that you set
DB_BLOCK_CHECKSUM to TYPICAL.”
 Steve Adams says I/O intensive queries with
moderate to high CPU use can be worse than
the estimate indicated in the docs
 Testing is advisable
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
9
Corruption Detection

DB_BLOCK_CHECKSUM FULL setting or not:
 Prior job: CPU glitch discovered after months
in production, introduced corruption due to
heavy batch job use. Financial repercussion
$1M+ (1999 $$), vendor essentially gave a
top end machine to compensate
 Cost for not capturing in-memory corruption
could be high.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
10
Metadata Basics

Data Dictionary for backup work
 Always in controlfile in virtual tables (V$ views)
 Optionally in separate catalog DB, comparable
info in real tables (& RC* views)
 Catalog: Open ended time span, multiple DBs
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
11
Metadata: Create New Controlfile





All previous metadata lost; asking for fresh start
How does RESYNC affect catalog?
Testing shows resync is 1-way street, controlfile
to catalog
Nice surprise: Testing shows resync does not
wipe out catalog entries in control_file_record_keep_time
period; metadata not lost in catalog DB
Catalog recommended by Oracle; less critical
with newer features
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
12
Metadata Management

Safety: With CONFIGURE command, turn on
autobackup of the controlfile
 Controlfile plus catalog: extra layer for safety
 "High Availability Best Practices" section 2.5.3.2:
Run source backups in nocatalog mode to
reduce dependency on the catalog database
being available. At a later point, do a resync
 Feature idea: be able to connect to two catalogs;
like mirrored disks;
alternatively, do standby of catalog DB
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
13
Metadata Reporting: Runtime trends for disk/tape
Do crosstab from RC_BACKUP_PIECE view.
Make anything before 4PM part of previous overnight run:
CREATE OR REPLACE VIEW ucsc_bkup_trend_insert_vw
(...column aliases...) AS
SELECT
CASE WHEN to_char(p.START_TIME,'HH24') < 16
THEN trunc(p.START_TIME - 1)
ELSE trunc(p.START_TIME)
END AS bkup_date,
sdl.SERVER_NAME, d.name, p.device_type,
nvl(max(CASE WHEN backup_type = 'D‘
THEN p.ELAPSED_SECONDS END),0) AS LVL0_secs,
nvl(max(CASE
nvl(max(CASE
nvl(max(CASE
nvl(max(CASE
nvl(max(CASE
WHEN
WHEN
WHEN
WHEN
WHEN
backup_type
backup_type
backup_type
backup_type
backup_type
=
=
=
=
=
'I'
'L'
'D'
'I'
'L'
THEN
THEN
THEN
THEN
THEN
p.ELAPSED_SECONDS END),0) AS LVL1_secs,
p.ELAPSED_SECONDS END),0) AS ARCH_secs,
p.BYTES END),0) AS LVL0_bytes,
p.BYTES END),0) AS LVL1_bytes,
p.BYTES END),0) AS ARCH_bytes,
p.START_TIME, p.COMPLETION_TIME
FROM rc_backup_piece p, rc_database d,
ucsc_server_db_list sdl
[...]
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
14
Metadata Reporting: Runtime trends for disk/tape
[crosstab from RC view...]
WHERE p.DB_KEY = d.DB_KEY
AND d.NAME = sdl.DB_NAME
AND p.backup_type in ('D','I','L')
GROUP BY
(CASE WHEN to_char(p.START_TIME,'HH24') < 16
THEN trunc(p.START_TIME - 1)
ELSE trunc(p.START_TIME)
END )
,
d.NAME, sdl.SERVER_NAME, p.DEVICE_TYPE, p.START_TIME, p.COMPLETION_TIME
ORDER BY
(CASE WHEN to_char(p.START_TIME,'HH24') < 16
THEN trunc(p.START_TIME - 1)
ELSE trunc(p.START_TIME)
END )
, sdl.SERVER_NAME, d.name, p.device_type, p.START_TIME;
Make persistent table so we have trend info beyond retention period:
create table ucsc_bkup_trend_details as
select * from ucsc_bkup_trend_insert_vw;
Do scheduled inserts so trend info is available long term.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
15
Metadata Reporting: Runtime trends for disk/tape

In next 4 slides, we see OPTIMIZATION ON in
effect for last several days on graphs. For
OPTIMIZATION OFF in earlier days, two factors:
1. Archives on disk for 3 days, used in transition
at our site: results in 3 copies of archive
backups. This was known/expected.
2. BACKUP BACKUPSET gives multiple tape copies,
including Level 0 each day! The repository
knows about Incremental Level, but behavior is
different from making original backupset. This
was not expected, and metadata reporting
brought out this difference.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
16
Metadata Reporting: Schedule Planning – Disk Time
Archive backups greatly reduced in last few days
250
200
Minutes
150
ARCH
LVL1
LVL0
100
50
0
7/20
7/21
7/22
7/23
NoCOUG: August 16, 2007
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
8/1
RMAN in the Trenches, Part 2
8/2
8/3
8/4
8/5
8/6
17
Metadata Reporting: Tape to Disk Size Ratio
Size greatly reduced, no extra Lvl0 copies
Ratio reduced from Max of 16:1, down to 1:1
18
16
14
Tape to Disk Ratio
12
10
8
6
4
2
0
7/21
7/22
7/23
7/24
NoCOUG: August 16, 2007
7/26
7/27
7/28
7/29
7/30
7/31
RMAN in the Trenches, Part 2
8/1
8/2
8/3
8/4
8/5
8/6
18
Metadata Reporting: Tape Runtime
Multiple tape processes from MML
Execution Time higher than Clock Time
Not cumulative (not stacked) line chart
1200
1000
Minutes
800
600
400
200
0
7/19
7/20
7/21
7/22
7/23
7/24
7/25
7/26
7/27
7/28
CLOCK
NoCOUG: August 16, 2007
7/29
7/30
7/31
8/1
8/2
8/3
8/4
8/5
8/6
EXEC
RMAN in the Trenches, Part 2
19
Metadata Reporting: Disk/Tape
Cumulative Runtime – stacked line chart
700.0
600.0
500.0
Minutes
400.0
TAPE
DISK
300.0
200.0
100.0
0.0
7/19
7/20
7/21
7/22
7/23
NoCOUG: August 16, 2007
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
RMAN in the Trenches, Part 2
8/1
8/2
8/3
8/4
8/5
8/6
20
Metadata Reporting: Compression Ratio

COMPRESSION_RATIO column is in 10
_summary and _details views, but these are
“primarily intended to be used internally by
Enterprise Manager.”
Before finding that caveat, I had found results to
not trust RC_BACKUP_SET_DETAILS --- good to
avoid. In following chart, (10.2.0.2 for testing),
Input Bytes is same for Level 0 and 1, so calc is
from total DB used space.
Level 1 ratio is distorted.
 The BDF table in RC_BACKUP_DATAFILE view
knows about Block Change Tracking, and has
count of blocks scanned, so a better ratio could
be calculated.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
21
Metadata Reporting: Compression Ratio
from RC_BACKUP_SET_DETAILS
For device_type = ‘*’, MAX Ratio for Level 1 of 690 to 1
690
391
LVL1
47
7
LVL0
7
Bkup Type
6
MAX
AVG
6
MIN
6
ARCH
4
1
CTLFL
1
1
0
100
200
300
400
500
600
700
800
Ratio
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
22
The Good, The Bad, The Ugly (a sampling)

Gradual improvements in each release:
 e.g. binary compression in 10g
 I requested a couple: separate retention
periods for disk/tape, ability to display
connection information at the RMAN prompt
 Corruption detection during block scanning
 History in the RMAN catalog, e.g. disk and
tape/MML runtimes for planning purposes
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
23
The Good, The Bad, The Ugly (a sampling)

Unavoidable: When testing, we can't modify
metadata in the controlfile to alter behavior for
our purposes, e.g. the shortest retention window
is 1 day (use ALTER SYSTEM SET FIXED_DATE )
 No command editing or buffer display in RMAN
comparable to sqlplus; LIST command without
summary clause can be copious, can be off the
terminal buffer, so cmd not retrievable
 NLS_DATE_FORMAT variable must be set
before starting RMAN (no SET cmd)
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
24
The Good, The Bad, The Ugly (a sampling)

RMAN is tied into SQL engine, but no SELECT;
For catalog query (sometimes better than LIST),
need separate sqlplus session
 RMAN will make a new backup file, but for
backups in separate directories based on
database name (%d in the format), it won't make
a new directory for us; causes backup failure
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
25
The Good, The Bad, The Ugly (a sampling)

Can't reconnect a different way after command
requiring repository connection (e.g. BACKUP),
must exit and start over
 GLOBAL stored script in 10g is a step forward:
 But no variables and language
 PL/SQL is inherently in engine, e.g. could
execute correct RMAN syntax based on query
to determine DB version, allow generic script
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
26
Cover Your Sixes ...
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
27
...so you don’t get caught by surprise!
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
28
Cover Your Sixes

In 10.2 docs for ALLOCATE CHANNEL: "You
must use a recovery catalog when backing up a
standby database." -- another benefit of catalog
 "When using Flashback Database with a target
time at which a NOLOGGING operation was in
progress, block corruption is likely in the
database objects and datafiles affected by the
NOLOGGING operation."
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
29
Cover Your Sixes: Syntax

Syntax can be similar with different meanings:
# We're doing a 'normal' backup here, not an image copy:
RMAN> backup as backupset ...;
# The backupset that was created before is copied to another destination:
RMAN> backup backupset ...;
------------------------------------------------# These two will deal with all types: controlfile, datafile, and archivelog
RMAN> CROSSCHECK BACKUP;
RMAN> LIST EXPIRED BACKUP;
------------------------------------------------# These two affect archivelogs, not backups of archivelogs
RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> LIST EXPIRED ARCHIVELOG;
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
30
Cover Your Sixes: tape

The "BACKUP BACKUPSET" command did not
pick up format from CONFIGURE, it used the
default of “%U”, not what I specified for tape:
CONFIGURE CHANNEL DEVICE TYPE 'SBT_TAPE' FORMAT '%d_%T_%U' SEND [...]
But using the format in ALLOCATE CHANNEL in
the script was successful.
 Docs say default of “%U” is unique, but it gave
us occasional duplicate tape file names.
 Virtual Tape Library -- "BACKUP BACKUPSET"
command can not copy from tape to tape.
We will still want VTL as secondary storage, not
as a replacement for our disk backup area.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
31
Flashback

9i was logical only, using Undo
 10g Flashback Database is physical, using
Flashback Logs; rewind DB faster than PIT
recovery
 When to use? Business can not lose
transactions for a number of hours!!
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
32
Flashback

Scenarios for Flashback Database:
 “...save the SCN to a spool file, for example, before
running a high-risk batch job.“
 “Easy conversion of a physical standby database to a
reporting database and back to a standby. [...] reverse
the activation of a standby database.”
 Test/Dev DB: known starting point for tests
 Standby can be reverted to an earlier time, which
could allow examination and manipulation in two
different time periods. This would allow recovery of
corrupted objects.
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
33
Flashback

Flashback Recovery Area, many file types
recommended: redo, archive, ctlfile, backupsets
 Potential for more disk contention when all in
one area
 Example: RAID10 for DB plus redo, archive,
ctlfile now; RAID5 for backupsets (write penalty
not significant enough for off hours batch job);
vendor app generated 25GB in a half hour for
temp tables in reporting(!)
 Do FRA for a reason, not on a whim
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
34
Tuning/Performance

Minimize what needs to be read and written:
 10.2.0.2 can skip empty blocks in addition to
unused blocks
 10g can do binary compression to disk, much
less quantity that needs to be taken to tape
 Concern about backups adversely affecting
online use?
 RATE clause can limit disk I/O
 Look at max bytes/sec of disk system, e.g. for
10M max possible, RATE of 5M would allow
1/2 of disk capability for non-backup purposes
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
35
Tuning/Performance

For sites using a physical standby, docs say that
turning off db_block_checking in a physical
standby "can provide as much as a twofold
increase in the apply rate" for redo logs, but
db_block_checking should not be turned off at
the primary database
 Metalink Note 311068.1 -- RMAN Performance
Tuning Diagnostics
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
36
Tuning/Performance

References:
http://www.oracle.com/technology/deploy/availability/pdf/rman_performance_wp.pdf
http://www.oracle.com/technology/deploy/availability/pdf/br_optimization.pdf
Chapter in 10.2 RMAN Docs -- Advanced User's Guide:
http://downloadwest.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmtunin.htm#sthref1057
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
37
A&Q



Acknowledgements:
Timothy Chien, Oracle Product Mgr.
Bill Wagman, UC Davis
Presenter: Philip Rice price [at] ucsc.edu
A&Q
Answers: Wisdom to share?
Questions?
NoCOUG: August 16, 2007
RMAN in the Trenches, Part 2
38