VirtualBox With Poor IO Performance

At work, we have various ways of running machines. One is local cheap virtual machines, controlled by a set of Python scripts. It’s been running well for almost two years. Recently, we started spining up new Ubuntu 14.04 based machines, and somehow this has started to degrade performance, rather noticably: Munin warning levels for IO write times are triggered on several virtual machines, and sometimes just getting the number of open files takes longer than munin waits for the answer, reporting it as UNKNOWN. Initial googling help very little, but I took some time today to really try to understand what could be wrong (as SMART didn’t report any errors on any of the disks involved in the RAID 1 serving the troubled machines). It all looks like we’re bitten by the fact that VirtualBox does unaligned Direct (or Asyncronous) IO writes to the underlying ext4 filesystem, causing the following to be displayed in the logs:

kern.log:Jul  6 19:30:52 server02 kernel: [18356737.680513] EXT4-fs (md127): Unaligned AIO/DIO on inode 23855109 by VBoxHeadless; performance will be poor.

This message is only printed once per day (kind of) and so is hard to pin point to the situations where we get the very slow IO write times (Munin reports 5+ seconds).

Possible Solutions

“Use host I/O cache”

EDIT: It turned out the problem was that VM didn’t have “Use host I/O cache” enabled. All good now.

While this seems to have fixed the issue for us, the IO caching section of the VirtualBox manual recommends agains this usage:

While buffering is a useful default setting for virtualizating a few machines on a desktop computer, there are some disadvantages to this approach: […] this may slow down the host immensely, especially if several VMs run at the same time. For example, on Linux hosts, host caching may result in Linux delaying all writes until the host cache is nearly full and then writing out all these changes at once, possibly stalling VM execution for minutes.

Rethink partitioning layout and block sizes

There’s an IBM devworks article dated 04/2010 that includes some benchmarks and impact scenarios using different filesystems:

The essence, as I understood it:

  • if you’re seeing this error on Advanced Format type drives (newer, bigger HDs, using default 4k blocksize), the speed penalty is substantial. You should rethink your partitioning layout and probably move away from 512k block size in linux.
  • If you’re (like me) on a HW RAID5 of bigger, newer disks and seeing this error, the penalty for ext4 filesystems on top is somewhere between 5-30%.

Realign VDI images

Move away from Ext4

Learned along the way

iostat and fuser, LatencyTOP

This work by Fredrik Wendt is licensed under CC by-sa.