I'm going through NetApp Filer DR (Disaster Recovery) failover exercise. This is the question that I prepare for the team before actual execution. In my environment, I have tier 1 production data that is "unprotected" from disaster and installing snapmirror to protect it.  I too have to consider an RTO (Recovery Time Objective) and an RPO (Recovery Point Objective).

 

To narrow it down, here are some questions you need to find out and work with the respective team member.

 

1.       What kind of data is this: CIFS, NFS, iSCSI, etc...?

 

2.       What/who uses this data: human, servers, other...?

 

3.       Will this replace or supplement your current backup scheme?

 

4.       What exactly are your RTO and RPO?  That will help determine what type of snapmirror you'll be using (assuming that's what you'll be using).  In theory, "Synchronous snapmirror" can recover data immediately but, is intensive.  With snapmirror synch, the lag time for full recovery will be dependant on everything else other than NetApp.  Most use "asynchronous".

 

5.       Almost each protocol involves other non-netapp technologies.  For example:  CIFS requires AD, DNS, DFS, etc.  Which do you have?  The design will need to include this info.   From my understanding of snapmirror RTO itself alone, all you have to do is break the mirror to failover to a snapmirror destination in the event of a disaster.  That one task will take hardly any time.  The majority of an RTO will be restoring the functionality of the technologies/services that NetApp requires to function. 

 

6.       Will you use Flex Clone with snapmirror to easily test DR (Disaster Recovery) without taking downtime?

 

7.       Can you perform snapmirror operations during your production hours or do you only have a particular time window? In theory, snapmirror asynch/synch requires no downtime.  It essentially takes a point in time snapshot of the data and performs a block level replication of that snapshot to the destination filer.  Essentially you're hardly even touching production data.  However, be cautious because it can be resource intensive in terms of CPU, disk reads, and network throughput on the production filer which may be noticeable to the user community.  I'm experimenting with the very cool intelligent throttling that can be set for replication to minimize impact during production hours. 

 

 

If it’s as simple as CIFS and snapmirror all you need to know about is snapmirror resync command.

 

 

What the snapmirror resync command does?

 

After the snapmirror break command, you can apply the snapmirror resync command to either the original SnapMirror destination or the original source.

 

  • Applied to the original destination—the snapmirror resync command puts a volume or qtree back into a SnapMirror relationship and resynchronizes its contents with the source without repeating the initial transfer.
  • Applied to the source volume—the snapmirror resync command can turn the source volume into a copy of the original destination volume. In this way, the roles of source and destination can be reversed.

 

In order avoid data loss, following is the procedure:

 

1.       Turn off CIFS, ensure no more writes are done = consistent data

2.       Run a snapmirror update just prior to failing over, ensures all new data before CIFS was shut down is pushed across.

3.       fail over.

 

Before failing back you will need to once more, turn CIFS off, snapmirror update, revert back using resync. After all this there is some cleaning up to be done: on the DR controller you will need to delete all snapmirror destinations.