Recovering from Data Corruption (Data Failures)

Managing Outages

3. Click Yes to continue with the switchover. Click No to cancel

4.2.7 Recovering from Data Corruption (Data Failures)

Recovering from data corruption is an unscheduled outage scenario. Data corruption is usually—but not always—caused by some activity or failure that occurs outside the database, even though the problem might be evident within the database.

Data corruption in data files has two categories:

■ Data file block corruption

A corrupt data file block can be accessed, but the contents in the block are invalid or inconsistent. The typical cause of data file corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.

The database usually remains available when corrupt blocks have been detected, but some corrupt blocks might cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.

A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application.

For example:

– A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk

– A single corrupt data block because of block inconsistencies detected by Oracle. The block will be marked corrupted and any application accessing the block will receive an ORA-1578 error.

– A database that automatically shuts down because of the invalid blocks of a data file in the SYSTEM tablespace caused by a failing disk controller

■ Media failure

This category of data corruption results from a physical hardware problem or user error. The system cannot successfully read or write to a file that is necessary to operate the database.

In all environments, you can resolve a data corruption outage by one of the following methods:

■ RMAN block media restoration and recovery

■ Data Guard switchover or failover to a standby database

■ RMAN datafile media restoration and recovery

■ Manually re-create the object

RMAN block media restoration and recovery provides the highest application availability if targeted blocks are not critical to application functionality. Data Guard switchover or failover to standby database provides the fastest predictable RTO.

Other outages that result in database objects becoming unavailable or inconsistent are caused by human error, such as dropping a table or erroneously updating table data.

Information about recovering from human error can be found in Section 4.2.8,

"Recovering from Human Error" on page 4-37.

If the data corruption impacts nondata files, then the repair may be slightly different.

Table 4–10 provides a matrix of the key non database object corruption and the recommended repair.

The following recovery methods can be used:

■ Use Data Guard to Recover From Data Corruption and Data Failure

■ Use RMAN Block Media Recovery

■ Use RMAN Data File Media Recovery

■ Re-Create Objects Manually

Table 4–10 Non Database Object Corruption and Recommended Repair Object or

Component

Affected Impact Repair

Any control file Database fails Data Guard fast-start failover will automatically fail over to the standby database

Redo log member None ^1. Investigate failure and check system 2. Drop and re-create redo log member Active log group

that is archived and not needed for crash recovery

Database fails Restart database after dropping affected redo log group

Active redo log group not archived and not needed for crash recovery

Database fails ^1. Restart database after dropping affected log group

2. Create a new backup

3. Refresh the standby database either by applying an incremental backup or re-creating the standby database from the primary or a backup of the primary database Active or current

redo log group that is still needed for crash recovery

Database fails Use one of the following solutions:

■ Data Guard failover

■ Flashback Database—flash the database back to a consistent time and then issue an OPEN RESETLOGS

Archived redo log file

None ^1. Create database backup

2. Refresh the standby database either by applying an incremental backup or re-creating the standby database from the primary or a backup of the primary database

SPFILE None Restore SPFILE from a backup and revise

4.2.7.1 Use Data Guard to Recover From Data Corruption and Data Failure

Failover is the operation of transitioning the standby databases as the new production database. A database switchover is a planned transition in which a standby database and a production database switch roles. Either of these operations can occur in less than 5 minutes and with no data loss.

Use Data Guard switchover or failover for data corruption or data failure when:

■ The database is down or when the database is up but the application is unavailable because of data corruption or failure, and the time to restore and recover locally is long or unknown.

■ Recovering locally will be longer than the business SLA or RTO.

4.2.7.2 Use RMAN Block Media Recovery

Block media recovery (BMR) recovers one block or a set of data blocks marked "media corrupt" in a data file by using the RMAN BLOCKRECOVER command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole data files. This results in lower RTO because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected data files to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.

Use block media recovery when:

■ A small number of blocks require media recovery and the blocks that need recovery are known. If a significant portion of the datafile is corrupt, or if the amount of corruption is unknown, then a different recovery method should be used.

■ Blocks are marked corrupt (verified with the RMAN BACKUP VALIDATE command) and only when complete recovery is required.

■ Backup of the data file containing the corrupted blocks is available locally or can be retrieved from a remote location including from a a physical standby database.

Block media recovery cannot be used to recover from the following:

■ User error or software bugs that cause logical corruption where the data blocks are intact. See Section 4.2.8, "Recovering from Human Error" on page 4-37 for

additional details for this type of recovery.

■ Changes caused by corrupt redo data. Block media recovery requires that all available redo data be applied to the blocks being recovered.

For example, to recover a specific corrupt block using RMAN block media recovery:

RMAN> BLOCKRECOVER DATAFILE 7 BLOCK 3;

When the corruption is detected, it would be easy to recover this block through Grid Control.

See Also: Database Switchover with a Standby Database on page 4-19 and Database Failover with a Standby Database on page 4-13

4.2.7.3 Use RMAN Data File Media Recovery

Data file media recovery recovers an entire datafile or set of data files for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected data files must be restored and recovered.

Use RMAN file media recovery when the following conditions are true:

■ The number of blocks requiring recovery is large or unknown

■ Block media recovery is not available (for example, if incomplete recovery is required, or if only incremental backups are available for the data file requiring recovery)

4.2.7.4 Re-Create Objects Manually

Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.

Use manual object re-creation when:

■ You must re-create a small index because of media corruption. Creating an index online enables the base object to be used concurrently.

■ You must re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creating the table might be the fastest option.

Nel documento Oracle® Database High Availability Best Practices 10g (pagine 115-118)