Thursday, October 1, 2015

ramdisk 'var' is full and vMotion Fails

I recently had a couple of odd events in a VM cluster.  I started looking one particular host and found the some odd entries in the Events tab in the vSphere Client.  Mostly it was

  • The ramdisk 'var' is full.  As a result, the file /var/run/vmware/tickets/vmtck-<id> could not be written.

I couldn't find the particular error in VMWare KBs, but I found my problem in an article at Techazine.com

Here's an excerpt from that post.



Fixing ESXi when RAMdisks run out of disk space
I ran into an interesting problem a few weeks back with one of my ESXi hosts.  While trying to do some normal things – like vMotion – I noticed an error recorded for the tasks – nothing that seemed to point to a lot of detail – just “A general system error occurred.”  On further investigation, I found that the underlying message was an out of disk space message while trying to proceed with a Storage vMotion.

Observed errors
  • Attempting vMotion – “A general system error occurred:”
  • Attempting Storage vMotion – “/var/log/vmware/journal/xxxx error writing file. There is no space left on the device.”


Troubleshooting steps


  • Go to Configuration tab on host in vCenter client, go to Security Profile, click Properties link on the Services section.
  • Scroll down to SSH and highlight – click options – click start to start SSH service.
  • Use putty or reflections to ssh to the host.
  • If you get a connection rejected – root filesystem ramdisk is probably full.
  • Go to console (either through KVM or OA for blades)
  • F2 to login, login, arrow down to Troubleshooting Options, select Enable ESXi Shell.
  • Press ALT-F1 to change to management shell and login (same root credentials).
  • Run ‘vdf -h’ and look for root filesystem – should look like:
    Ramdisk                   Size      Used Available Use% Mounted on
    root                       32M        3M       28M  10% --
  • If it is 0M available and 100% used, that’s the problem.  Try to clear up space:
    • cd /var/log/
    • ls -la
  • Check size of the hpHelper.log file – likely pretty large.  Reset the file, if large.
    • > hpHelper.log

Host is back online and working and it looks like the most likely culprit was the HP agents inside of the custom ESXi image provided by HP.  It seems in some circumstances that the hpHelper.log file can become very large, filling the RAMdisk and causing the issues.  Its a first for me and I have not observed the issues on any of my other ESXi hosts running on Proliant rack-mount or blade servers.

No comments:

Post a Comment