MSSQL 2014 AlwaysOn Availability Group Cluster & Gratuitous ARP (GARP) Issue

MSSQL 2014 AlwaysOn cluster running on Windows 2012 R2 doesn’t send Gratuitous ARP (GARP) packets by default!

I have recently come across gratuitous arp (GARP) issues while working on Microsoft SQL 2014 AlwaysOn Availability Group cluster setup. I experienced the following –

  1. MSSQL 2014 AlwaysOn cluster with AlwaysOn Availability Group (AG) setup was done as per best practices and experts recommendations; all cluster related services were running OK without any issue.
  2. clients sitting on the same IP network/same VLAN were able to connect to the AlwaysOn AG listener Virtual IP (VIP) address immediately after a cluster failover happen from Node-A to Node-B and vice versa.
  3. however, clients sitting on different IP subnets were NOT able to connect to the VIP immediately after a cluster failover.
  4. clients sitting on different IP subnets waited for 20MIN to get connect to the VIP.
  5. this 20minutes is MAC address lifetime on the ethernet switch (I use Juniper EX-series switches) where the servers are connected (connected to physical Hypervisor).
  6. on the network layer the switch “ARP table” was showing previously learnt MAC address for the AG Listener VIP; the switch didn’t updated MAC address after a cluster failover triggered. The switch flushed out the old MAC and re-learnt the new correct MAC address after the MAC age time (20min) expired on the switch.

I was looking for a solution and found “GARP Reply” needs to be enabled on the Juniper EX switch manually – I have done that but still NO improvement!

Also looked at Microsoft KB documents and forums – people are saying GARP needs to be turned on the network switch which I have DONE already without any success.

After doing further digging inside I found that the Windows 2012 R2 servers were not sending any GARP packets so the switch was not updating the ARP table although it is configured to work with GARP.

To get this working – Windows server registry object “ArpRetryCount” needs to be added; Microsoft said the following about this –

“Determines how many times TCP sends an Address Request Packet for its own address when the service is installed. This is known as a gratuitous Address Request Packet. TCP sends a gratuitous Address Request Packet to determine whether the IP address to which it is assigned is already in use on the network.”

Add the registry entry as following –

-HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
-REG_DWORD > ArpRetryCount
-Value is between 0-3 (use value 3)

0 – dont send garp
1 – send garp once only
2 – send garp twice
3 – send garp three times (Default Value – actually not present on Windows 2012 R2)

To enable “GARP reply” on Juniper EX & SRX platform – user the following command –

#set interface interface_name/number gratuitous-arp-reply

The interface can be a physical interface, logical interface, interface group, SVI or IRB.

To enable GARP on Cisco IOS – use interface command “ip gratuitous-arps“.

References:
https://technet.microsoft.com/en-us/library/cc957526.aspx
http://www.juniper.net/techpubs/en_US/junos13.2/topics/usage-guidelines/interfaces-configuring-gratuitous-arp.html
http://www.cisco.com/web/techdoc/dc/reference/cli/nxos/commands/l3/ip_arp_gratuitous.html

5 thoughts on “MSSQL 2014 AlwaysOn Availability Group Cluster & Gratuitous ARP (GARP) Issue

  1. Came across this while researching a similar issue in my environment. I am curious if you have ever researched an alternate solution to this. In my environment we are seeing this with AGs on multiple SQL versions (2012/2014/2016) running on top of Windows Server 2012 R2 clusters. Even without ArpRetryCount explicitly set if GARP is enabled on the VLAN SVI failover is near instantaneous. With it disabled we see disconnects until the VLAN ARP entry gets stale and updated (about 15 minutes).

    The reason I ask about an alternate solution is because in my environment gratuitous ARP is disabled by design because of DISA STIGs.

    Great article and explanation!

    Liked by 1 person

    • Thank you Joie.

      Since you have disabled GARP – you might try reducing “MAC address age time” on the switch; this might not work instantaneously – however will activate new mac address much quicker than default MAC age (default age is around 300 seconds and above depending on hardware platform). There are some downside for this as well such as high system resource usage on the switch and frequent MAC table rebuild after every aged expiration (will increase some broadcast due to new mac learning).

      Like

  2. Pingback: Microsoft Failover Cluster node not sending out Gratuitous ARP request after a failover | myitblog

  3. Why Microsoft changed default behavior of sending ARP only if registry key exists in Windows 2012 and latest versions, as earlier version like Windows 2008 Server OS doesn’t require this registry key per technet article. I would like to know was there any security issues for enabling ARP by default

    Like

    • ARP is layer 2 and its boundary is limited within a VLAN only and never advertised beyond its VLAN domain; from security point of view – I don’t see its a big security issue provided that your DB servers are in a seperate VLAN than web/app/other servers. Apology for delayed response.

      Like

Leave a comment