Data Domain Replication issues

Post author:Tim
Post published:November 20, 2014
Post category:Replication
Post comments:8 Comments

Recently, I had a pair of Data Domains that were having replication issues. After the offsite was seeded, and taken to it’s location – we noticed that the pre-comp remaining value would get down to 150 GiB or so, and then would jump back up to it’s original number of about 2,300GiB.

What was happening is the replication was starting over, because it lost communication with the other Data Domain.

So, what’s the problem? Well, the first thing to check is your network.

Dropped pings?
Firewalls?
Packet filtering / IPS / AV inline?

In this case, we had no devices inspecting the packets. (It’s possible for the replication traffic to randomly match an AV definition, and the firewall would discard it – causing the Data Domain to restart replication due to the loss of data).

The next step was to modify our Keep Alive values.

Putty into the Data Domain
Type System Show SerialNo
To enter SE mode, type Priv Set SE
Use the serial number of the device as the password

We will change 3 values (per EMC support’s direction).

net.ipv4.tcp_keepalive_time
- The interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe; after the connection is marked to need keepalive, this counter is not used any further
net.ipv4.tcp_keepalive_probes
- The interval between subsequential keepalive probes, regardless of what the connection has exchanged in the meantime
net.ipv4.tcp_keepalive_intvl
- The number of unacknowledged probes to send before considering the connection dead and notifying the application layer

While in SE mode, enter the following commands

net option set net.ipv4.tcp_keepalive_time 300
net option set net.ipv4.tcp_keepalive_probes 9
net option set net.ipv4.tcp_keepalive_intvl 30

This will not require a reboot, however, existing TCP connections are not affected, so we will need to restart our replication. Exit SE mode.

Replication Disable All
Replication Enable All

At this point, you can monitor the replication, and most likely, the issue will be taken care of – assuming no physical network issues.

To monitor performance from the cli, type:

iostat 2

After a night of replication, the job finally finished successfully, and replication is now in a normal state.

Tags: Data Domain, dd160, EMC, Replication

This Post Has 8 Comments

Michael White November 20, 2014 Reply

You left me hanging. Did those network settings make things work better?
1. Tim November 20, 2014 Reply
  
  I’ll update the post when we find out! Should know by tomorrow.
2. Tim November 21, 2014 Reply
  
  Worked like a charm. Replication completed without any issues.
ex-ddup November 24, 2014 Reply

Interesting blog post. How many times did it get down to ~150GB pre-comp remaining? Did it replicate for a while then repeatedly get stuck at the same value? If so, that’s pretty suspicious and points to a problem either with something in the network (AV/IDS stuff resetting replication TCP connections; but you ruled this out), something on the source going wrong, or something on the destination going wrong. Replication on the source basically just needs to read data and metadata, and if that’s not happening, then you have bigger problems (!) and should be getting filesystem panics, and it would probably be pretty obvious. It’s most likely a problem on the destination. It sounds like it got to a point where it needed to do some specific type of processing, maybe specific to m-tree replication or some underlying filesystem module that that relies on, and whatever it was doing was just taking too long and the repl destination code did not handle that cleanly. As a result, it did not respond to the source in a timely manner. As a result, the source timed out, logged an error message (most likely), and restarted after a short delay.

If you want, push back on tech support; have them escalate to engineering and have them tell you what exactly was going on with the destination filesystem at the time of the previous timeout/restarts (or actually, just before those occurred). They should be able to tell from the filesystem logs in a support bundle from the destination system.

(PS, in case you are guessing, yeah, I used to work on this product…)
1. Tim November 26, 2014 Reply
  
  Customer closed the case, and since its working – we didn’t push any farther. However, since then, I have noticed the wan connection is less than reliable, and we show some dropped pings, so, I’m leaning towards that. great information – thanks!
Pingback: Newsletter: December 29, 1014 | Notes from MWhite
Paul Aviles January 6, 2017 Reply

Tim, do you know what the values were before the change? I have VTL lagging replication issues and before making the changes you suggested I saw the numbers are as follows:

# cat /proc/sys/net/ipv4/tcp_keepalive_time 300
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl 75
# cat /proc/sys/net/ipv4/tcp_keepalive_probes 9

So the only difference is the tcp_keepalive_intvl and is already higher than the value you suggest.

Regards,

Paul
1. Tim January 6, 2017 Reply
  
  Unfortunately, I do not. I was directed by EMC support in this case to change the values, but I didn’t record the initial values. If memory serves me correct, it was a latency issue with the WAN connection that prompted the change.

Share this post: Share this content

You Might Also Like

Replication Roundup – Kickoff

vSphere 6.0 Replication – Configuration

Estimating bandwidth requirements for Zerto Replication

This Post Has 8 Comments

Leave a Reply to Tim Cancel reply

Share this content