Extending SCSI Timeouts in KVM Guests

Hi folks, today’s post will cover something that came up in the last few days at work. Someone was concerned about any lengthy delay affecting the health of his VM’s virtual disk and therefore the VM. We’ve all seen the aftermath – either the VM falls into a “paused” state, or the disk goes into a read-only state. So, how do you extend the timeout value in KVM for SCSI disks? I will tell you first that the first thing to verify is that you have a reliable connection network connection. No timeout value will make up for a poorly deployed network, or even a misconfigured switch.

Ok, so you’ve verified that the network is about as “ok” as it’s going to get. The file that we want to edit is /sys/block/$DISK/device/timeout, and typically we would simply “echo” a new timeout value to that file, for each disk.

Let us first see what the default is; I’m not fan of blindly editing values such as timeouts.

# cat /sys/block/sda/device/timeout
30

Ok, so we have a default timeout of 30 seconds. Normally I would tell you to move up gradually.. say, 10 seconds at time. However, most vendors (including my employer, NetApp) recommend 180 seconds, so, we’ll just jump to that instead of titrating.. For most techies, that would lead you to just echo the new timeout or put together some clever script or even tempt you to put that clever script into /etc/rc.d/rc.local.

That’s old school. And if you know the purpose of rc.local, I applaud you and you get a gold star. Then I’ll tell you how much cooler you would be if you went to the new school in this case. That would be “udev”. Here’s a great article on udev. From 10 years ago. So does that qualify as “new old school”? or simply that rc.local is that outdated?

Regardless, we need to create a file in /etc/udev/rules.d. You’ll notice that each of the existing rules starts with a number, then an arbitrary but meaningful name, with a .rules extension. We’re going to create a file called 96-scsi-timeout.rules and enter a rule that will take good care of us from a timeout standpoint. The rules will be run in numerical order when udev starts at boot.
Here it is:
# cat 96-scsi-timeout.rules 
ACTION=="add", SUBSYSTEM=="block", ENV{ID_MODEL}=="QEMU_HARDDISK", \ 
RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

The line that starts with “ACTION” needs to be written on all one line, without the “\”. I just wanted it to be more easily viewed as opposed to WordPress doing something funky because it’s a long string…

Now, I will also tell you that this will only work on Red Hat, CentOS, and Fedora systems running as KVM guests on Red Hat, CentOS, and Fedora. I don’t know what the equivalent string for Ubuntu or SuSE would be, or what it would be for a Red Hat system on VMware or Xen. Although a quick search in /dev/Google would likely turn up some answers.

Hope this helps,
Captain KVM

10 thoughts on “Extending SCSI Timeouts in KVM Guests”

  1. This is a great tip!!! Just to know, what is the document of the netapp vendor that describes the timeouts configuration?

    I have the same problem with some VMs on my datacenter, but in my case, the main HyperVisor is the Vmware and lab HyperVisor is the KVM.

    1. Hi Gabriel,

      Sorry for the delayed response. If you Google for VMware SCSI timeouts, you’ll find examples. You should actually find some links to VMware’s knowledge base, and I would trust that over others.

      hope this helps,

      Captain KVM

      1. FYI, ‘Vmware Tools’ installs an udev rule in /etc/udev/rules.d/99-vmware-scsi-udev.rules file. That’s the content:

        #
        # VMware SCSI devices Timeout adjustment
        #
        # Modify the timeout value for VMware SCSI devices so that
        # in the event of a failover, we don’t time out.
        # See Bug 271286 for more information.

        ACTION==”add”, SUBSYSTEMS==”scsi”, ATTRS{vendor}==”VMware “, ATTRS{model}==”Virtual disk “, RUN+=”/bin/sh -c ‘echo 180 >/sys$DEVPATH/timeout'”

  2. Given how common this change is, how many vendors require it, etc, is there good reasons why 30 sec. remains the default for devices, even on Enterprise Linux kernels?

    1. Eugene – it’s a fair question. It’s a good default for a “local, native” system. But not for a VM. I would think that starting “libvirtd” should trigger the reconfig of the timeout, as it is clearly a hypervisor. There should be a similar trigger for guests, even if it’s just a server build type.

      Captain KVM

  3. The kernel parameter mentioned earlier

    /sys/block/sda/device/timeout*

    is not passed to the guest when using VirtIO, only iscsi, ide. Is there any other way to achieve the same when using VIrtIO?

    Also I’m trying to understand how this will impact the data on the VM, in theory all I/O paths are on hold while the failover happens and resume when the path is available again?

    Could this be applied to the HV layer ? if not available on the VM?

    Sorry lots of questions :D, I’m trying to understand what options we have available.

    1. Jesus,

      Thanks for reaching out, and don’t apologize for questions.. I love questions. If I don’t have the answer, then it forces me to learn the answer, and we both benefit. You are correct in that a virtio network device on a VM doesn’t have the timeout value. There really is not an equivalent for virtio as it is meant to be optimized.

      If the paths return before the timeout value expires, then all resumes. If the paths don’t return before the timeout value expires then the default action for the VM is to go to a “read only” state in order to maintain data integrity. The VM may also go into a paused state.

      However, if you need to make an adjustment because of network hiccups on a virtio network device, look at the mount options for your NFS or iSCSI.

      Captain KVM

      1. Thank you. I really appreciate your comment. It helped me corroborating that this may be done at the ISCSI layer.

Agree? Disagree? Something to add to the conversation?