Blah, Cloud.

Adventures in architectures

  • Twitter
  • GitHub
  • Home
  • Blog
  • Kubernetes on vSphere
  • Multi-tenant IaaS Networking
  • Me
    • About
    • CV
    • Contact
Home » Blog » Infrastructure » vSphere HA Configuration fails: Operation Timed Out

vSphere HA Configuration fails: Operation Timed Out

22/07/2015 by Myles Gray 4 Comments

I recently rebuilt my lab and added 2x new ESXi hosts, I re-used my old single host in the process which I upgraded from ESXi 5.5 to 6.0 and patched to the same level as the new hosts.

Everything was working as expected until it came for the time to enable HA.

My old host claimed the master roll and thus the other boxes had to connect to it as slaves, however, these failed with “HA Agent Unreachable” and “Operation Timed Out” errors.

After some host reboots, ping, nslookup and other standard connectivity tests with still no progress I started blaming the ESXi 5.5 -> 6.0 upgrade this was, as it turns out, unfounded.

Looking at the /var/log/fdm.log on the master host the following lines could be seen:

SSL Async Handshake Timeout : Read timeout after approximately 25000ms. Closing stream <SSL(<io_obj p:0x1f33f794, h:31, <TCP 'ip:8182'>, <TCP 'ip:47416'>>)>

Further along we could see that it knows the other hosts are alive:

[ClusterDatastore::UpdateSlaveHeartbeats] (NFS) host-50 @ host-50 is ALIVE

And further along again:

[AcceptorImpl::FinishSSLAccept] Error N7Vmacore16TimeoutExceptionE(Operation timed out) creating ssl stream or doing handshake

On the slave candidates this could be seen:

[ClusterManagerImpl::AddBadIP] IP 1{master.ip.address.here} marked bad for reason Unreachable IP

After yet more troubleshooting and messing about with SSL cert regeneration I stumbled upon this:

This issue occurs when Jumbo Frames is enabled on the host Management Network (VMkernel port used for host management) and a network misconfiguration prevent hosts communicating using jumbo frames. It is supported to use jumbo frames on the Management Network as long as the MTU values and physical network are set correctly.

Checked the vmk0 MTU on my master host – sure enough, I had configured this as 9000 back in the day and completely forgotten about it, bumped it back down to 1500, HA agents came up right away:

HA Agent Vmware master

Hopefully this saves you some time and you don’t have to go through what I did trying to solve this.

Why not follow @mylesagray on Twitter for more like this!

Show some love:

  • Reddit
  • Twitter
  • Pocket
  • LinkedIn
  • Email
  • Telegram

Similar things I've written

Filed Under: Infrastructure, Networks, Virtualisation Tagged With: esxi, ha, jumbo frame, ssl, vcenter, vmware

About Myles Gray

Hi! I'm Myles, and I'm a Dev Advocate at VMware. Focused primarily on content generation, product enablement and feedback from customers and field to engineering.

Comments

  1. Ryan F says

    24/03/2016 at 00:24

    This is exactly what I ran into, set the MTU back to 1500 on the management network vmknic and HA enabled as expected. Thanks for posting this!

    Reply
  2. Jerry Kendall says

    08/08/2016 at 22:18

    thank you, thank you, thank you, thank you, thank you – seriously. took ALL DAY to find this.

    Reply
  3. chima says

    24/05/2019 at 08:35

    This post did the magic for me. I just changed the new host’s MTU to 9000 to be the same with the existing hosts int eh cluster and bingo every thing worked fine. Thanks for this great post

    Reply
  4. Nuwan says

    20/07/2019 at 07:55

    i had same issue resolved as below

    https://kb.vmware.com/s/article/2017233

    Cause
    This issue occurs due to a security feature on physical switches that blocks communication if the source and destination ports are identical. In case of HA (FDM), some packets have source and destination port set to 8182.
    This feature is one of the Denial of Service Attack Protection methods. The name of the feature may differ from one switch vendor to another. For example, on Dell PowerConnect switches, it is called DOS-Control l4port. On HP switches, the feature is called Auto Denial-of-Service (DoS) protections.
    Resolution
    To work around this issue, contact your network switch vendor to help disable the Denial-of-Service protection feature.

    For example:

    On a Dell PowerConnect switch, run this command on the switch to disable the feature:
    console(config)#no dos-control l4port
    For more information, see Denial of Service Attack Protection in the Dell PowerConnect 6200 Series Configuration Guide.
    On a HP ProCurve switch, navigate to Security > Advanced Security and deselect the Enable Auto DOS checkbox.

    On an Extreme Networks switch (running ExtremeWare 7.7), run this command on the switch to disable the feature:

    console# disable cpu-dos-protect

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Myles Gray

Hi! I'm Myles, and I'm a Dev Advocate at VMware. Focused primarily on content generation, product enablement and feedback from customers and field to engineering. Read More…

Categories

Tags

active directory authentication CBT cisco datastore dell design esxi fortigate iscsi jumbo frame kubernetes lab linux load-balancing lun md3000i mtu networking NginX nic nsx openSUSE osx pxe readynas san sdelete serial teaming ubuntu vcenter vcloud director vcsa vexpert video VIRL vmdk vmfs vmware vsan vsphere vsphere 6 vsphere beta windows

Subscribe to Blog via Email

Copyright © 2021 · News Pro Theme on Genesis Framework · WordPress · Log in

loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.