Recently, I had a pair of Data Domains that were having replication issues. After the offsite was seeded, and taken to it’s location – we noticed that the pre-comp remaining value would get down to 150 GiB or so, and then would jump back up to it’s original number of about 2,300GiB.
What was happening is the replication was starting over, because it lost communication with the other Data Domain.
So, what’s the problem? Well, the first thing to check is your network.
- Dropped pings?
- Firewalls?
- Packet filtering / IPS / AV inline?
In this case, we had no devices inspecting the packets. (It’s possible for the replication traffic to randomly match an AV definition, and the firewall would discard it – causing the Data Domain to restart replication due to the loss of data).
The next step was to modify our Keep Alive values.
- Putty into the Data Domain
- Type System Show SerialNo
- To enter SE mode, type Priv Set SE
- Use the serial number of the device as the password
We will change 3 values (per EMC support’s direction).
- net.ipv4.tcp_keepalive_time
- The interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe; after the connection is marked to need keepalive, this counter is not used any further
- net.ipv4.tcp_keepalive_probes
- The interval between subsequential keepalive probes, regardless of what the connection has exchanged in the meantime
- net.ipv4.tcp_keepalive_intvl
- The number of unacknowledged probes to send before considering the connection dead and notifying the application layer
While in SE mode, enter the following commands
- net option set net.ipv4.tcp_keepalive_time 300
- net option set net.ipv4.tcp_keepalive_probes 9
- net option set net.ipv4.tcp_keepalive_intvl 30
This will not require a reboot, however, existing TCP connections are not affected, so we will need to restart our replication. Exit SE mode.
- Replication Disable All
- Replication Enable All
At this point, you can monitor the replication, and most likely, the issue will be taken care of – assuming no physical network issues.
To monitor performance from the cli, type:
- iostat 2
After a night of replication, the job finally finished successfully, and replication is now in a normal state.
You left me hanging. Did those network settings make things work better?
I’ll update the post when we find out! Should know by tomorrow.
Worked like a charm. Replication completed without any issues.
Interesting blog post. How many times did it get down to ~150GB pre-comp remaining? Did it replicate for a while then repeatedly get stuck at the same value? If so, that’s pretty suspicious and points to a problem either with something in the network (AV/IDS stuff resetting replication TCP connections; but you ruled this out), something on the source going wrong, or something on the destination going wrong. Replication on the source basically just needs to read data and metadata, and if that’s not happening, then you have bigger problems (!) and should be getting filesystem panics, and it would probably be pretty obvious. It’s most likely a problem on the destination. It sounds like it got to a point where it needed to do some specific type of processing, maybe specific to m-tree replication or some underlying filesystem module that that relies on, and whatever it was doing was just taking too long and the repl destination code did not handle that cleanly. As a result, it did not respond to the source in a timely manner. As a result, the source timed out, logged an error message (most likely), and restarted after a short delay.
If you want, push back on tech support; have them escalate to engineering and have them tell you what exactly was going on with the destination filesystem at the time of the previous timeout/restarts (or actually, just before those occurred). They should be able to tell from the filesystem logs in a support bundle from the destination system.
(PS, in case you are guessing, yeah, I used to work on this product…)
Customer closed the case, and since its working – we didn’t push any farther. However, since then, I have noticed the wan connection is less than reliable, and we show some dropped pings, so, I’m leaning towards that. great information – thanks!
Pingback: Newsletter: December 29, 1014 | Notes from MWhite
Tim, do you know what the values were before the change? I have VTL lagging replication issues and before making the changes you suggested I saw the numbers are as follows:
# cat /proc/sys/net/ipv4/tcp_keepalive_time 300
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl 75
# cat /proc/sys/net/ipv4/tcp_keepalive_probes 9
So the only difference is the tcp_keepalive_intvl and is already higher than the value you suggest.
Regards,
Paul
Unfortunately, I do not. I was directed by EMC support in this case to change the values, but I didn’t record the initial values. If memory serves me correct, it was a latency issue with the WAN connection that prompted the change.