At work, we have various ways of running machines. One is local cheap virtual machines, controlled by a set of Python scripts. It’s been running well for almost two years. Recently, we started spining up new Ubuntu 14.04 based machines, and somehow this has started to degrade performance, rather noticably: Munin warning levels for IO write times are triggered on several virtual machines, and sometimes just getting the number of open files takes longer than munin waits for the answer, reporting it as UNKNOWN. Initial googling help very little, but I took some time today to really try to understand what could be wrong (as SMART didn’t report any errors on any of the disks involved in the RAID 1 serving the troubled machines). It all looks like we’re bitten by the fact that VirtualBox does unaligned Direct (or Asyncronous) IO writes to the underlying ext4 filesystem, causing the following to be displayed in the logs:
kern.log:Jul 6 19:30:52 server02 kernel: [18356737.680513] EXT4-fs (md127): Unaligned AIO/DIO on inode 23855109 by VBoxHeadless; performance will be poor.
This message is only printed once per day (kind of) and so is hard to pin point to the situations where we get the very slow IO write times (Munin reports 5+ seconds).
“Use host I/O cache”
EDIT: It turned out the problem was that VM didn’t have “Use host I/O cache” enabled. All good now.
While this seems to have fixed the issue for us, the IO caching section of the VirtualBox manual recommends agains this usage:
While buffering is a useful default setting for virtualizating a few machines on a desktop computer, there are some disadvantages to this approach: […] this may slow down the host immensely, especially if several VMs run at the same time. For example, on Linux hosts, host caching may result in Linux delaying all writes until the host cache is nearly full and then writing out all these changes at once, possibly stalling VM execution for minutes.
Rethink partitioning layout and block sizes
There’s an IBM devworks article dated 04/2010 that includes some benchmarks and impact scenarios using different filesystems:
The essence, as I understood it:
- if you’re seeing this error on Advanced Format type drives (newer, bigger HDs, using default 4k blocksize), the speed penalty is substantial. You should rethink your partitioning layout and probably move away from 512k block size in linux.
- If you’re (like me) on a HW RAID5 of bigger, newer disks and seeing this error, the penalty for ext4 filesystems on top is somewhere between 5-30%.
Realign VDI images
Move away from Ext4
- Linux on 4KB-sector disks: Practical advice
- Windows 7 reported to trigger this
- Ask Ubuntu: Unaligned AIO/DIO
- Eric Sandeen’s patch, with comment on why and workaround to Linux Ext4 mailing list
- Guest system stuck in IOWAIT for tens of seconds periodically
- Guest Linux hangs on disk I/O every 30-60 min