Testing Disaster Recovery Systems? Steps CIOs Miss

One frequently-overlooked or avoid aspect of disaster recovery is the need to test your disaster recovery plan on a regular schedule. Otherwise, you risk an event that your company may not recover from.

Testing involves a run-through of the entire recovery process so that bugs can be identified and ironed out. The best way to accomplish this is to execute what is called a soft failover on a regular basis. This practice run gives you a chance to decide whether or not the process meets your company’s needs. If it doesn’t, changes can be made before the potential for irrecoverable loss arises.

Disaster Recovery Testing Preparation

Prior to executing your DR test, it is essential to engage in a deliberate process of understanding your current environment, mapping your application dependencies, and defining out what application level testing will be performed to validate the integrity of the DR environment. And of course, it is also critical to ensure that your data sets are being replicated, hopefully via some sort of SAN to SAN replication or another DR orchestration tool, such as Veeam or Zerto.

One area where many smaller organizations can get tripped up on DR testing and planning is application dependency mapping. What exactly do we mean by this? Well, we we are talking about making sure you have a clear understanding of what virtual machines and what external connectivity is required to support a particular application. Some organizations may choose to deploy a centralized set of highly available database servers, which multiple applications rely upon. In a DR testing (or actual DR scenario), it's absolutely critical to understand that all of these applications are reliant upon multi-application database platform.

In other situations, you may find that your applications are reliant upon external network connectivity to certain resources, which may need to be routed through specific IPSec tunnels or come from a particular set of authorized IP addresses. These third party dependencies can be terribly impactful during a real DR event -- imagine the frustration of knowing that your DR plan was completely successful, but you forget to setup an IPSec tunnel or inform your third party provider of IP address changes, causing degraded application functionality.

There's also application sequencing concerns. Complex N-tier applications may require a certain startup order be adhered to in order to function properly. For instance, if an application or web server comes online before a database server, the web server may become overwhelmed by requests that results in errors. In some cases, the web server can recovery automatically once the database server is online, but in other cases the web servers will require manual intervention if they are started prior to the database server.

Clearly, it is also essential to have a well defined test plan. In order to properly test the functionality of your applications and systems, you will likely need to engage many different stakeholder groups who have highly specific application level knowledge. The input of these stakeholders is critical in developing a manual test plan, or ideally an automated test plan.

Finally, part of being prepared for a DR event - whether its a test or real disaster - is to continuously monitor your data replication platform. Perhaps it sounds trite, but if your data is not at the DR site your DR test will be a failure.

Disaster Recovery Testing

A key element to DR testing is communication and scheduling. Consider these questions - who needs to be notified of the testing? When should we do the test? Who needs to be available to assist with the overall failover, and then the application specific testing? If you have application dependency mappings, you should also know who the application owners are, and if they are teams or specific individuals. Similarly, if you have a well defined test plan, you can also quickly and easily identify who is going to be necessary to perform the DR test.

In some cases, non-disruptive testing can actually be performed during normal business hours. However, depending on the data replication methodology in use, if you happen to suffer an actual disaster during the DR test, you may be stuck with data that it outside of your defined RPOs. In other cases, doing the testing after hours is a requirement.

The good news is at this point, running your DR test should be smooth sailing. Yes, some long nights or weekend work might be required, but there's a plan, there's a process, and the right people on board to ensure the testing goes off as planned.

And do not fret if issues are identified during the testing. If you do a DR test and don't find issues, you should actually be more concerned that your test plan is lacking.

Bottom Line

By implementing a disaster recovery system, you have accepted that things can go wrong unexpectedly. And while a negative event could damage your company hardware or corrupt its data, another issue may ruin the effectiveness of your disaster recovery plan.

Regular testing will keep your backup system in good form and let you monitor its effectiveness as the company changes. This ensures that when you need it, the system will serve its purpose.

Has your company implemented an effective disaster recovery plan?

To learn more about how to implement effective disaster recovery plans, download your copy of our free 20+ page eBook: “Backup and Disaster Recovery Planning Guide for CIOs.”