Microsoft SQL Server Crash Recovery Mechanism

How SQL Server performs its logging and crash recovery mechanisms till date is a mystery for most IT professionals including DBAs. Microsoft SQL Server Crash recovery mechanism is the process that occurs when the database is moved back to a consistent and usable state after a dirty shutdown. When the SQL Server instance fails, there is a high possibility that the database can be left in an inconsistent state. During the time of the crash, the following could have occurred:

  • Existing data might have changed by transactions which did not fully complete at the time of failure.
  • Completed transactions that had not been flushed to the disk could be lost during failure.

Crash recovery operates by rolling back incomplete transactions and writing to the disk completed committed transactions that were still in memory when the crash occurred.

A quick look at the causes of a crash recovery could be:

  • A hardware failure like disk, CPU, memory, or network.
  • A fatal operating system error
  • A power failure on the server.
  • The DBMS software crashed
  • Human error

A database recovery process is a crucial requirement for all RDMS and it can end up being a mind-numbing process. The quest to improve the recovery process has birthed different procedures, but still only few have really understood RECOVERY.

In this article, I’ll demonstrate the impact of a dirty shutdown of the database instance while a transaction is running and provide several techniques and tools that are available for faster and successful recovery even in the worst-case scenario of a corrupted database. Let’s stay under the assumption that we currently have a LIVE database with no issues and OLTP transactions being run against the database(s). In this simulation, I would have to create a test database and try to mimic the above assumptions.

First, we run this code to create a blank database for simulation purposes

Next, using the database context of CrashRecoverySimulator, we create a dummy table with 3 columns and insert test data of replicated data values.

Now this is where we want to begin a transaction and keep it open to simulate the crash recovery of the SQL Server database

With the transaction still running, perform a dirty shutdown with code below. This mimics the scenario of a server crashing or forced shutdown.

NOTE: Run this in a separate query window or editor.

Now that SQL Server has stopped, we need to restart it by going to SQL Configuration Manager, Right Click on the SQL instance and Click Start.

With SQL Server running successfully and all databases ONLINE, including CrashRecoverySimulator. You can check database state with the code below.

Result

Now let’s query the CRS table and see if the UPDATE we performed in the running transaction is available.

Result

The UPDATE statement was rolled back hence the data in ColA still remains the same.

Want to know the INTERNALS?

Logging serves a provision for a variety of operations in SQL Server. There are so many reasons as to why logging is vital in any RDMS. A few are presented below:

  • It makes sure that if a crash occurs, a committed transaction will be correctly reflected in the database after the crash.
  • It ensures that an uncommitted transaction will be correctly rolled back and not reflected in the database after a crash.
  • It ensures that it is possible to cancel an in-flight transaction and have all its operations rolled back.
  • It allows a backup copy of the transaction log to be taken so that a database can be restored and the transaction log backups replayed to bring the database to a specific point in time with transactional consistency.
  • It supports features that rely on reading the transaction log, such as replication, database mirroring, and change data capture.

The majority of these uses of logging involve a mechanism called recovery. Recovery is the procedure of having the changes found in log records restated or reverted in the database. Recovery has three phases: analysis, redo and undo. Database is completely accessible only after all three phases of recovery has completed. There is a feature in the enterprise edition called FAST recovery which allows the database to come online after REDO phase. Aside this, all these phases are required to happen before database can be in a consistent state.

Replaying log records is called the REDO (or roll forward) phase of recovery. Reverting log records is called the UNDO (or roll back) phase of recovery. In other words, recovery will make sure that a transaction and all its constituent log records are either redone or undone.

The simple form of recovery is when a single transaction is canceled, in which case it is undone and there is no net effect on the database. The most complex form is crash recovery—when SQL Server crashes (for whatever reason) and the transaction log must be recovered to bring the database to a transactionally consistent point. This means that all transactions that were committed at the time of the crash must be rolled forward to ensure their effects are persisted in the database. And all in-flight transactions that had not committed at the time of the crash must be rolled back to ensure their effects are not persisted in the database.

This is because there is no facility for a transaction in SQL Server to continue after a crash. Thus, if the effects of a partially complete transaction were not rolled back, the database would be left in an inconsistent state (possibly even structurally corrupt, depending on what the transaction was in the middle of doing). (Randal, 2009)

Why is my recovery taking so LONG?

Did you look at your SQL Server error longs and notice the below information? You are possibly experiencing a slow database recovery process. The three mentioned phases talked above is occurring in this scenario and the possible ONLY solution is to WAIT for it to complete. The wait time can be exponentially long based of factors of size of database files, number of VLFs, and computer resources.

Recovery of database ‘CrashRecoverySimulator’ (6) is 77% complete (approximately 28890 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery of database ‘CrashRecoverySimulator’ (6) is 77% complete (approximately 28595 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery of database ‘CrashRecoverySimulator’ (6) is 77% complete (approximately 28499 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery of database ‘CrashRecoverySimulator’ (6) is 77% complete (approximately 28202 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery of database ‘CrashRecoverySimulator’ (6) is 77% complete (approximately 28067 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Query rollback operation is usually single threaded while when same query executes it can use parallelism/multiple threads to perform same operation making it very fast as compared to the rollback. Even restarting SQL Server service is will not help much, as it will actually make it worst as the rollback process will restart again. Luckily there are some dmv’s that help give you an estimated completion time.

Sample:

Result

Session_id command status percent_complete estimated_completion_time
35DB STARTUPbackground86.06061289093

From the result, you can see that there is a background process running a DB STARTUP command which is at an estimate percent of 86% with an estimated completion time of 289093 milliseconds. Convert the time to minutes or hours to get a rough idea of when the process will complete.

What happens if recovery FAILS?

In a case of failed recovery of the database, we would either have to restore the database from good known backups or attempt to repair the database. Allow me to introduce to you a tool that is trusted by Microsoft MVPs. Stellar SQL Database Repair is the most recommended SQL Recovery software amongst database administrators. The Software fixes corrupt SQL database files and recovers inaccessible objects to save business data.