Nominet template

Transcript Nominet template

Adventures in Dataguard
Dr. Jason Arneil
Motivation
Why Dataguard
AGENDA
• Introduction
• The Motivation
• Dataguard Architecture & Features
• Creating a Physical Standby
• Maintaining your standby
• Using your Standby
• Performing a Switchover
Introduction
Health Warning
Introduction
About Me
• Jason Arneil
• System Administrator/DBA
• Using Oracle since 1998
• At Nominet since 2001
Introduction
About Nominet
• Nominet is the internet registry for .uk domain names
• Nominet has been in existence for over 11 years
• Nominet is run as a not-for-profit company
• Nominet is owned by its members
• There are over 6 Million .uk domain names
Motivation
Why Dataguard
• Big push on a Nominet Business Continuity Plan
• Dataguard is the Oracle solution for disaster recovery
• Physical Standby was the obvious option
• Maximum Availability Architecture (MAA)
Motivation
Business Continuity Site
Architecture & Features
Dataguard Processes
Oracle Net
Transactions
Physical/Logical
Standby
Database
LGWR
LNS
Primary
Database
RFS
MRP/ LSP
Transform Redo
to SQL for
SQL Apply
Online Redo Logs
ARCH
Archived Redo Logs
FAL
Standby
Redo
Logs
Backup /
Reports
ARCH
Archived Redo Logs
Architecture & Features
Dataguard Features
• Several Protection Modes
– Maximum Protection
– Maximum Availability
– Maximum Performance
• Several Transport Modes
– LGWR SYNC
– LGWR ASYNC
– ARCH
Creating a Standby
Prepare Primary & Standby
• Prepare Primary Database
– Enable Force Logging
SQL> alter database force logging;
– Modify initialization parameters
• Prepare Standby Database
– Setup directory structure
– Create spfile with correct parameters
– Start database in nomount
Creating a Standby
Log Transport Parameters
• LOG_ARCHIVE_CONFIG='DG_CONFIG=(PRIMARY, STANDBY)'
• LOG_ARCHIVE_DEST_1='LOCATION=/var/oracle/PRIMARY/arch'
• LOG_ARCHIVE_DEST_2='SERVICE=PRIMARC
DB_UNIQUE_NAME=PRIMARY'
• LOG_ARCHIVE_DEST_3='SERVICE=STANDBY LGWR ASYNC
REOPEN=15 MAX_FAILURE=10 OPTIONAL
VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
DB_UNIQUE_NAME=STANDBY'
Creating a Standby
ssh tunnels
• You may not wish your redo data being sent unencrypted across
the internet to your standby. You can use ssh tunnels to avoid this
– ssh -N -L 3333:standby:1521 oracle@standby
• Now the tnsnames entry points to the localhost
STANDBYARC =
(DESCRIPTION =
(SDU = 32767)
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT=3333)))
(CONNECT_DATA =
(SERVICE_NAME = STANDBY)))
Creating a Standby
Some Other Parameters
• FAL_SERVER
• FAL_CLIENT
• ARCHIVE_LAG_TARGET
• STANDBY_FILE_MANAGEMENT
• DB_FILE_NAME_CONVERT
• LOG_FILE_NAME_CONVERT
Creating a Standby
backup your primary
• Backup primary - rman is good
– rman> backup format '/backup/%U' database plus archivelog;
– rman> backup format '/backup/%U' current controlfile for
standby;
• Recover backup on standby node
–
I like using rman duplicate to create standby:
• (oracle$) rman target sys/password@PRIMARY auxiliary /
• rman> duplicate target database for standby;
Creating a Standby
Start applying redo
• Create standby redo log files on both primary and standby:
– sql> alter database add standby logfile thread 2 group 42
(’PATH_TO_DATA/standbyredo01.log') size 512M;
• Now you can start the physical standby recovering logs:
– sql>alter database recover managed standby database disconnect
from session;
• Or if you prefer real time apply:
– sql>alter database recover managed standby database using
current logfile disconnect from session;
Maintaining your standby
Monitoring the Standby
• You have to ensure your standby is keeping up with your primary
• You can check which was the last log to have been applied to your
standby is
– sql> SELECT MAX(SEQUENCE#), THREAD#
FROM V$ARCHIVED_LOG
where APPLIED='YES'
GROUP BY THREAD#;
MAX(SEQUENCE#)
--------------
THREAD#
----------
2976
1
1888
2
Maintaining your standby
Monitoring Standby Progress
•
A good way of checking what the background processes of your
standby are up to is using v$managed_standby
– SQL>
select process, sequence#, status
from V$managed_standby;
PROCESS
SEQUENCE# STATUS
--------
----------
------------
ARCH
2967
CLOSING
ARCH
2974
CLOSING
RFS
2977
IDLE
MRP0
1889
APPLYING_LOG
RFS
1889
IDLE
RFS
2977
IDLE
Maintaining your standby
Monitoring Your Standby
• You have to ensure your standby is keeping up with your primary
• V$DATAGUARD_STATS provides useful information
– SQL> select name, value from v$dataguard_stats;
NAME
VALUE
-------------------------------- -----------------------------------apply finish time
+00 00:00:00
apply lag
+00 00:00:11
estimated startup time
41
standby has been open
N
transport lag
+00 00:00:03
Maintaining your standby
Monitoring Your Standby
• A way of finding out what has been happening to your standby
over a period time is to look at the v$dataguard_status view
– Log Apply Services
01-AUG-07 Media Recovery Waiting for
thread 1 sequence 2977 (in transit)
– Log Apply Services
01-AUG-07 Media Recovery Waiting for thread
1 sequence 2977 (in transit)
– Log Apply Services
01-AUG-07 Media Recovery Waiting for thread
2 sequence 1889 (in transit)
– Remote File Server
PERFORMANCE mode
01-AUG-07 Primary database is in MAXIMUM
– Remote File Server
01-AUG-07 RFS[53]: Successfully opened
standby log 14: '+DATA2/standby/standbyredo02.log'
Maintaining your standby
Oracle can’t divide by 0
• Standby was happily working away
– ORA-07445: exception encountered: core dump [kcrarmb()+152]
[SIGFPE] [Integer divide by zero] [0x00085C300
• MRP process crashes
– No redo gets applied from this point
• Logs after the one that caused the ORA-07445 still being shipped
• A simple restart of the managed recovery process does a FAL and
the standby is back up-to-date
Maintaining your standby
kcrfr_resize2
• Lots of problems after upgrade to 10.2.0.3
– Recovery of Online Redo Log: Thread 2 Group 23 Seq 999 Reading
mem 0
Mem# 0: +DATA3/standby/standbyredo11.log
ORA-00600: internal error code, arguments: [kcrfr_resize2],
[652614828032], [268423168], [], [], [], [], []
• Perhaps caused by the following:
– Bug 3306010 OERI[kcrfr_resize2] possible in MEDIA recovery
Media recovery may fail with ORA-600 [kcrfr_resize2] when
the number of redo strands is set to a high value using
log_parallelism.
Maintaining your standby
kcrfr_resize2
• This issue has recently been published as Note:453259.1
– Triggered by having a large log_buffer
• This bug affects 10.2.0.3 and potentially 9.2.0.8
• It is related to the size of the log_buffer parameter
• Fix is included in 10.2.0.4
Maintaining your standby
kcrrupirfs
•
ARC processes died on primary:
ORA-00600: [kcrrupirfs.20] [4] [368]
•
Trace file showed the following:
Corrupt redo block 479421 detected: bad block number
Flag: 0x0 Format: 0x0 Block: 0x00000000 Seq: 0x00000000 Beg:
0x0 Cks:0x0 <<<<<<<------ Dump of Corrupt Redo Buffer ----00000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000
000000000000000000000000
Maintaining your standby
kcrrupirfs
•
Oracle think initially think this ORA-600 error was hardware related
– There are NO indications of any hardware fault - the primary keeps
running
•
After a couple of weeks it was decided this was a “bug situation”
– This was bug 4767278 which talked about FAL not being able to
read from multiple mirror sides when encountering invalid/stale
redo in a file. Apparently required for ASM configurations because
ASM does not guarantee all mirror sides contain same data after
writing.
– We were using ASM, but external redundancy
– Oracle then said “The ASM group is not 100% sure if the patch
4767278 will fix the problem”
Maintaining your standby
log corruption
• The Managed Recovery process crashed complaining about log
corruption
MRP0: Background Media Recovery terminated with error 355
ORA-00355: change numbers out of order
ORA-00353: log corruption near block 2 change 1273622545 time
03/06/2007 08:32:46
ORA-00312: online log 13 thread 1:
'+DATA2/standby/standbyredo01.log'
•
Oracle blame the upgrade process at first. They suggest rebuilding the
standby
•
Then I notice that trying managed recovery rather than real time
apply seems to allow the standby to progress
Maintaining your standby
log corruption
• At this point Oracle say “it looks like a bug”
• Lots of time spent diagnosing the issue
– ALTER SYSTEM DUMP LOGFILE '+DATA2/nom/standby33.log' scn
min 865465290 scn max 865465300;
• Eventually Oracle produced a patch 5746174
– MRP HANGS WITH ASYNC LNS AND PARALLEL ARCHIVAL
Using Your Standby
Utilize those cpu cycles
• A Standby can be considered an insurance policy
• Several ways to utilize your standby
– Run your backups from your standby
– Open your standby read only for reporting
– Flashback standby to look at old data
– Open your standby read write for testing purposes
Using Your Standby
Open for Reports
• You need to cancel managed recovery
– sql> alter database recover managed standby database cancel;
• Then simply open the standby
– sql> alter database open;
• Redo is still transported to your standby
• To transition back to applying redo shutdown the open standby,
startup mount and restart the recovery process
Using Your Standby
Open for read write
• You must have flashback database enabled for this
• Stop redo apply on standby
• Create a restore point
• Activate the Standby & perform read/write testing
• Flashback to restore point
• Start the redo on the Standby again
Using Your Standby
Open for read write
Physical Standby
Restore
Point
Physical Standby
read write
Using Your Standby
Flashback Database in a Nutshell
• Set up Flashback Database
– alter system set db_recovery_file_dest_size = 8G;
– alter system set db_recovery_file_dest = 'your flashback
destination';
– alter system set db_flashback_retention_target = 1440 ;
– alter database flashback on;
• Once you have cancelled the standby recovery create a
guaranteed restore point
– create guaranteed restore point before_activate;
Using Your Standby
Open for read write
•
Activate your Standby
– SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE;
•
You can open the Standby for business
– SQL> ALTER DATABASE OPEN;
•
To become a Standby again shutdown and startup in mount
– SQL> FLASHBACK DATABASE TO RESTORE POINT BEFORE_ACTIVATE;
– SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;
Using Your Standby
Open for read write
• However things never go according to plan
– ORA-00600: internal error code, arguments: [3705], [1], [8], [3],
[8], [], []
• This was bug 4479323 which is a bug with recovery (not standby
specific) and only occurs in a RAC environment
• This is fixed in 10.2.0.3
Doing a Switchover
It’s good to test
• A business continuity plan is no good unless it’s been tested
• It’s not all about the database
• Good to think in terms of services
Doing a Switchover
Database Switchover
• Make sure your standby is up-to-date
• Check your primary database switchover status:
– primary> SELECT SWITCHOVER_STATUS FROM V$DATABASE;
• Switchover primary database
– primary> ALTER DATABASE COMMIT TO SWITCHOVER TO
PHYSICAL STANDBY with session shutdown;
• Switchover the standby
– standby> ALTER DATABASE COMMIT TO SWITCHOVER TO
PRIMARY with session shutdown;
Doing a switchover
DNS Primer
• DNS allows translation from hostname to IP address
– example.co.uk
IN
A
162.0.0.1
• Our principle is all services are accessed through a CNAME
– anexample.co.uk 5M
IN CNAME
example.co.uk
• relocation of the service is just a case of changing where the
CNAME points
Conclusion
Conclusion
• Dataguard is an efficient DR solution for your primary database
• Dataguard is mostly reliable but is not without it’s blips
• There are opportunities for gaining added value from your standby
• You can’t test your Business continuity plan enough
Adventures in Dataguard
Questions?
Contact:
• [email protected]
• http://blog.nominet.org.uk

Nominet template

Transcript Nominet template

Directory