Emergency Database Failover Impact and Recovery

Download Report

Transcript Emergency Database Failover Impact and Recovery

Emergency Database Failover:
Impacts & Recovery Plan
Aaron Smallwood – ERCOT IT
Joel Mickey – ERCOT Market Operations
Emergency Database Failover
• Summary:
– ERCOT conducted an emergency database failover on April 21st, 2008
following a hardware failure
– While ERCOT does perform controlled database failovers monthly, this
was different due to the nature of the hardware failure
• Normally, the database is ‘stopped’ at one site, and then ‘started’ at the other
in controlled manner
• In this case, the database ‘hung’ – meaning that it became unresponsive
and data was unable to be written to or read from database
– The impacts:
• Transactions were prevented from updating downstream databases
• The lack of transaction updates in downstream databases left a gap in
transactional records (out of sync)
– The affected extracts for April 21st through April 30th are listed in market
notices for the incident
– ERCOT considers this to be an isolated incident and not a systemic
problem
2
Recovery Plan
• Goal:
– Recover transactions that are needed to perform price adjustment
calculations that are missing in downstream databases from a restored
copy of the production database
• Plan:
 Build an environment identical to the production environment
• Servers, storage, applications
– Restore data to pre-crash state (4/21)
• Over 20TB of data to restore from tape (in progress)
– Using the restored environment and data, extract transactions missing
from downstream databases and then roll forward all subsequent
transactions
– ERCOT Market Operations will then review the data for reasonableness
and approve the data for reporting and settlement
3
Questions
•
Actions to prevent future occurrences:
– Nodal market databases will be on newer hardware with more fault tolerance and
redundancy
– Potential re-architecture of system integration between the databases
• Lessons learned are being documented but no plan yet
• Resources are focused on the data recovery efforts
•
Questions:
– When will non-spinning reserve price adjustments for PRR 650 be completed?
• When the transactional data has been restored, reviewed, and approved
– What is the timeline?
• The environment build is complete, we anticipate the data restore from tape to be the
task that takes the longest
• We are estimating weeks, not months, to complete the plan
– Unknowns include the amount of time needed to restore from tape and the quality
of the data once it’s been restored
• Market notices will continue to be sent to indicate status
4