I recently rebuilt my lab and added 2x new ESXi hosts, I re-used my old single host in the process which I upgraded from ESXi 5.5 to 6.0 and patched to the same level as the new hosts.
Everything was working as expected until it came for the time to enable HA.
My old host claimed the master roll and thus the other boxes had to connect to it as slaves, however, these failed with “HA Agent Unreachable” and “Operation Timed Out” errors.
After some host reboots, ping, nslookup and other standard connectivity tests with still no progress I started blaming the ESXi 5.5 -> 6.0 upgrade this was, as it turns out, unfounded.
Looking at the /var/log/fdm.log
on the master
host the following lines could be seen:
SSL Async Handshake Timeout : Read timeout after approximately 25000ms. Closing stream <SSL(<io_obj p:0x1f33f794, h:31, <TCP 'ip:8182'>, <TCP 'ip:47416'>>)>
Further along we could see that it knows the other hosts are alive:
[ClusterDatastore::UpdateSlaveHeartbeats] (NFS) host-50 @ host-50 is ALIVE
And further along again:
[AcceptorImpl::FinishSSLAccept] Error N7Vmacore16TimeoutExceptionE(Operation timed out) creating ssl stream or doing handshake
On the slave
candidates this could be seen:
[ClusterManagerImpl::AddBadIP] IP 1{master.ip.address.here} marked bad for reason Unreachable IP
After yet more troubleshooting and messing about with SSL cert regeneration I stumbled upon this:
This issue occurs when Jumbo Frames is enabled on the host Management Network (VMkernel port used for host management) and a network misconfiguration prevent hosts communicating using jumbo frames. It is supported to use jumbo frames on the Management Network as long as the MTU values and physical network are set correctly.
Checked the vmk0
MTU
on my master
host – sure enough, I had configured this as 9000
back in the day and completely forgotten about it, bumped it back down to 1500
, HA agents came up right away:
Hopefully this saves you some time and you don’t have to go through what I did trying to solve this.
Why not follow @mylesagray on Twitter for more like this!
This is exactly what I ran into, set the MTU back to 1500 on the management network vmknic and HA enabled as expected. Thanks for posting this!
thank you, thank you, thank you, thank you, thank you – seriously. took ALL DAY to find this.
This post did the magic for me. I just changed the new host’s MTU to 9000 to be the same with the existing hosts int eh cluster and bingo every thing worked fine. Thanks for this great post
i had same issue resolved as below
https://kb.vmware.com/s/article/2017233
Cause
This issue occurs due to a security feature on physical switches that blocks communication if the source and destination ports are identical. In case of HA (FDM), some packets have source and destination port set to 8182.
This feature is one of the Denial of Service Attack Protection methods. The name of the feature may differ from one switch vendor to another. For example, on Dell PowerConnect switches, it is called DOS-Control l4port. On HP switches, the feature is called Auto Denial-of-Service (DoS) protections.
Resolution
To work around this issue, contact your network switch vendor to help disable the Denial-of-Service protection feature.
For example:
On a Dell PowerConnect switch, run this command on the switch to disable the feature:
console(config)#no dos-control l4port
For more information, see Denial of Service Attack Protection in the Dell PowerConnect 6200 Series Configuration Guide.
On a HP ProCurve switch, navigate to Security > Advanced Security and deselect the Enable Auto DOS checkbox.
On an Extreme Networks switch (running ExtremeWare 7.7), run this command on the switch to disable the feature:
console# disable cpu-dos-protect