Managing Outages
4.1 Outage Overview
4.1.1 Unscheduled Outages
Managing Outages
This chapter describes unscheduled and scheduled outages and the Oracle operational best practices that can tolerate or manage each outage type and minimize downtime.
This chapter contains these topics:
■ Outage Overview
■ Recovering from Unscheduled Outages
■ Restoring Fault Tolerance
■ Eliminating or Reducing Downtime for Scheduled Outages
4.1 Outage Overview
This section describes the types of possible outages and the recommended methods to repair or minimize the downtime associated with each outage.
This section contains these topics:
■ Unscheduled Outages
■ Scheduled Outages
4.1.1 Unscheduled Outages
Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:
■ Hardware
■ Software
■ Network infrastructure
■ Naming services infrastructure
■ Database
Your monitoring and high-availability infrastructure should provide rapid detection and recovery from downtime. Chapter 3, "Monitoring Using Oracle Grid Control"
describes detection, while this chapter focuses on reducing downtime.
Table 4–1 describes unscheduled outages that affect the primary or secondary site components.
The best practice recommendations for reducing unscheduled outages on the primary site and the secondary site, estimated recovery times, and recovery steps appear in the following sections:
■ Managing Unscheduled Outages on the Primary Site
■ Managing Unscheduled Outages on the Secondary Site Table 4–1 Unscheduled Outages
Outage Type Description Examples
Site failure The entire site where the current production database resides is unavailable. This includes all tiers of the application.
Disaster at the production site such as a fire, flood, or earthquake
Malicious attack on the site
Power outages. If there are multiple power grids and backup generators for critical systems, then this should affect only part of the data center.
Clusterwide failure The whole cluster hosting the RAC database is unavailable or fails. This includes failures of nodes in the cluster, and any other components that result in the cluster being unavailable and the Oracle database and instances on the site being unavailable.
The last surviving node on the RAC cluster fails and cannot be restarted
Both of the redundant cluster interconnects fail or clusterware failure or problem
Database corruption is severe enough to disallow continuity on the current data server
Disk storage fails Computer failure
(node)
A node of the RAC cluster is unavailable or fails
A database tier node fails or has to be shut down because of bad memory or bad CPU
The database tier node is unreachable
Both of the redundant cluster interconnects fail, resulting in another node taking ownership Computer failure
(instance)
A database instance is unavailable or fails
An instance of the RAC database on the data server fails because of a software bug, an operating system
problem, or a hardware problem.
Storage failure Storage holding some or all of the database contents becomes unavailable, because it has shut down or is no longer accessible
Disk drive failure Disk controller failure Storage array failure Data corruption Parts of the database are
unavailable because of media corruption, inaccessibility, or inconsistencies
A datafile is accidentally removed or is unavailable Media corruption affects blocks of the database Oracle block corruption is caused by operating system or other node-related problems
Human error Parts of the database are unavailable, and transactional or logical data inconsistencies arise.
Usually caused by an operator or bugs in the application code.
Localized damage (needs surgical repair) Human error results in a table being dropped or in rows being deleted from a table
Widespread damage (needs drastic action to avoid downtime) Application errors result in logical
corruption in the database, or operator error results in a batch job being run more times than specified.
Note: This category focuses on human errors that affect database availability and, in particular, cause
transactional or logical data inconsistencies.
4.1.1.1 Managing Unscheduled Outages on the Primary Site
If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for these outages are critical for maximum availability of the system. Only the Oracle Database 10g with Data Guard, and the Oracle Database 10g with RAC and Data Guard (MAA) architectures have a secondary site to protect from site disasters.
Table 4–2 summarizes the recovery steps for unscheduled outages on the primary site.
For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts on page 4-9.
Table 4–2 Recovery Times and Steps for Unscheduled Outages on the Primary Site Outage Type Oracle Database 10g
Oracle Database 10g
Site failure Hours to days 1. Restore site.
2. Restore from tape backups.
3. Recover database.
Hours to days 1. Restore site.
2. Restore from tape backups.
3. Recover database.
Seconds to 5 minutes1 1. Database Failover
Seconds to 5 minutes1 1. Database Failover
Not applicable Hours to days 1. Restore cluster or
restore at least one node.
Not applicable Seconds to 5 minutes 1. Database Failover
failure (node) Minutes to hours2 1. Restart node and
restart database.
2. Reconnect users.
No downtime3
Managed automatically by RAC Recovery for Unscheduled Outages
Seconds to 5 minutes2 1. Database Failover 1. Restart node and
restart database.
4.1.1.2 Managing Unscheduled Outages on the Secondary Site
For most cases, outages on the secondary site can be managed with no effect on availability of the primary database located on the primary site. However, if the configuration is in maximum protection mode, then unscheduled outages on the last surviving standby database will cause outages on the production database to ensure no data loss when failing over to the standby database. After downgrading the data protection mode, you can restart the production database even without accessibility to Computer by RAC Recovery for Unscheduled Outages
Minutes2
1. Restart instance.
2. Reconnect users.
or
Seconds to 5 minutes1 1. Database Failover
Storage failure No downtime4 ASM Recovery After
Seconds to 5 minutes1 1. Database Failover
Seconds to 5 minutes1 1. Database Failover Human error < 30 minutes6
Recovering from
1 Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.
2 Recovery time consists largely of the time it takes to restore the failed system.
3 Database is still available, but portion of application connected to failed system is temporarily affected.
4 Storage failures are prevented by using Automatic Storage Management (ASM) with mirroring and its automatic rebalance capability.
5 Not all types of data corruption are prevented. For the most recent information about the HARD initiative, refer to Section 2.1.6,
"Consider HARD-Compliant Storage" on page 2-7.
6 Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flashback the appropriate transactions.
Table 4–2 (Cont.) Recovery Times and Steps for Unscheduled Outages on the Primary Site
Outage Type Oracle Database 10g
Oracle Database 10g
the standby databases. Outages on the secondary site might affect the maximum time to recovery (MTTR) if there are concurrent failures on the primary site.
Table 4–3 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts on page 4-9.