Revisiting File System Alignment for Linux

If this is a new blog, how is Captain KVM “revisiting” the topic?

In January of this year, I posted an entry for File System Alignment for Linux VMs on my old blog; I’d like to revisit the topic.  So, why would anyone want to worry about something like “file system alignment” in the first place?  Simple.

Performance.  (You do like performance, right?)

Allow me to illuminate… Running native Linux on local disk presents no issue, but running a virtualized instance of Linux starts to present challenges.  As multiple abstraction layers are added between physical disks and virtual disks, there is plenty of room to have partition boundaries and sector boundaries get out of alignment.  Look at the layers below:

VM filesystem
VM LVM
VM disk
Storage File System (NFS)
Storage Volume
Storage Physical disk

We could actually expand that out even further, but this will suffice.  What needs to happen is that the partition boundaries at the top layer need to line up with the blocks on the bottom layer.  It’s actually quite complicated, but part of the reason for adding abstraction is to hide the complexities.

NOTE: Newer operating systems such as RHEL 6, Windows Server 2008r2, and Windows 7 align properly by default.

So lets look at this issue using a real life example – a RHEL 5.x VM (KVM, of course!), on a RHEL 5.x host, backed by a NetApp storage.  Left to the default configuration of ‘fdisk’, ‘sfdisk’, or ‘parted’, the /boot partition would start at sector 63 and each subsequent partition would start at the next available sector, ensuring that all of the performance you were seeking with your 10GbE & Jumbo Frames would be a waste of time.

Why is this such a problem?  Imagine that you have a piece of data that gets written to disk.  Normally, this is a perfect fit for NetApp blocks (also 4k in size) – but, a misaligned file system would cause that piece of data to be written across 2 data blocks.  Do you see the problem yet?  Essentially, this will result in 2 reads or 2 writes for every 1 I/O request.  Multiply that times your favorite 7 digit number and imagine all of the extra reads and writes that are occurring because of misalignment.  Performance penalties of 40% are not unheard of.

Now lets prevent this issue from even rearing its ugly head.  If you are using Linux and KVM, you likely have some kind of provisioning process such as Kickstart.  This is arguably the easiest way to prevent any issues that stem from misaligned file systems.  The standard Kickstart file has a section for options, packages, an optional %post, and an optional %pre.  In the exercise below, we will concern ourselves with the “disk layout” part of the options section and we will implement the optional %pre section.  NOTE: The %pre section must be the last section in the Kickstart file.

Let’s look at the %pre section and provide some explanation:

%pre
 parted /dev/sda mklabel msdos
 parted /dev/sda mkpart primary ext3 64s 208718s
 parted /dev/sda mkpart primary 208720s 100%
 parted /dev/sda set 2 lvm on

We’ve defined the section by starting with the “%pre” directive, then proceeded with 4 “parted” commands: create a disk label, create a partition starting on sector 64 roughly 100mb in size, create another partition that takes up the rest of the disk, and prep the second partition for use with LVM.

So why sector 64, and why sector 208720?  The short answer is that they are “alignment friendly” sectors.  That is to say that “64” is cleanly divisible by 8 – so, 128 and 2048 would work as well.  Don’t pay any attention to the unused sectors in between the partitions.  It’s not worth the performance penalty.

Remember though, that is only the first of 2 pieces to configure in the Kickstart file.  The second section is the “disk layout”.  Let’s look at a layout example:

zerombr yes
##clearpart --linux --drives=sda  ## comment out or remove
part /boot --fstype ext3 --onpart sda1
part pv.2 —onpart sda2
volgroup VolGroup00 --pesize=32768 pv.2
logvol swap --fstype swap --name=LogVol01 --vgname=VolGroup00 --size=1008 \--grow --maxsize=2016
logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=1024 --grow

The first thing to note is the the “clearpart” directive is commented out.  (You could actually remove it altogether.)  Leaving it in would undo the magic in the “%pre” section.  The next thing to note is that we are using the “–onpart” options to the “part” directive.  (The “/dev/sda” can be changed to “/dev/vda” if using the VirtIO driver.)  This is just telling Kickstart to use the properly aligned partitions.  Everything else is the same.

The magic is illustrated and alignment is optimal!

But what about the extra layers of abstraction you say?

It’s all good.  Provided the partition boundaries are set to an “alignment friendly” sector, Linux LVM is itself alignment friendly.  And the layers introduced by NetApp are also alignment friendly by default.  There is one caveat in dealing with LUNs, though.  It’s not exception, mind you, just something to be aware of.  Let’s say you have a NetApp LUN presented to a RHEL 5.x server and you want to layer it with LVM. Run your `pvcreate` command against the entire LUN without running `fdisk` or `parted` first.  Again, LVM itself is alignment friendly.  And unless the LUN is to be carved up into multiple partitions, there is no point in using “parted” or “fdisk”….

But what about the reason for revisiting the topic you say?

Two reasons:

  • I can never overstate the importance of proper alignment
  • I have a new scenario

What if I’m actually SAN booting my Linux box with multi-pathing enabled?  The server itself is not virtualized, but the storage is.. How do I take that into consideration?

Simple.

Let’s say you’re using RHEL 5.x, Centos 5.x, or Fedora (of the same time frame).  The “%pre” section remains the same.  The disk layout has a slightly different look:

zerombr yes
###clearpart --all --drives=sda
part /boot --fstype ext3 --size=100 --onpart=mapper/mpath0p1
part pv.2 --size=0 --grow --onpart=mapper/mpath0p2
volgroup VolGroup00 --pesize=32768 pv.2
logvol swap --fstype swap --name=LogVol01 --vgname=VolGroup00 --size=1008 \--grow --maxsize=2016
logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=1024 --grow

The “–onpart” options to the part directives now point to “mapper/mpath0p1”.  Why didn’t we change the “%pre” section?  Easy.  Multipathing isn’t configured that early in the process, so “%pre” stays the same.  All we need to account for are the mpath partitions in the disk layout.

The magic is illustrated and alignment is optimal!

thanks for reading,

ck

For more information on File System Alignment, please review NetApp TR-3747 http://www.netapp.com/us/library/technical-reports/tr-3747.html

For more information on Kickstart please review the “deployment guides” as hosted on http://redhat.com or http://centos.com.

13 thoughts on “Revisiting File System Alignment for Linux”

  1. Hi,

    you mention “So why sector 64, and why sector 208896” but I can not find the number 208896 anywhere?! 208896 / 4096 = 51 what makes sense to me but it looks like the numbers you use with parted are not divisible by 4096?!

    Any idea?

    Regards,
    engel

    1. Hi Engel,

      First off, thanks for taking the time to post a comment & question.

      Sector 64 is cleanly divisible by “8”. It’s the next “alignment friendly” sector after the default of 63. The first partition is typically used for /boot, and as such doesn’t typically need to be larger than 100MB (although sometimes we see 256MB). If we stick with 100MB for /boot, then the next available sector for the next partition is 208719, which is NOT cleanly divisible by 8. If we bump up the sector to 208848, then we’re properly aligned. Sure, we lose the use of the sectors that we’ve skipped, but I will take the loss of a few sectors if it means I don’t lose I/O performance.

      I hope this answers your question.

      c.k.

      1. You have missed what the original poster was pointing out. Your pre script contains an error:
        parted /dev/sda mkpart primary 208720s 100%
        You then discuss using sector 208896, but that is not the number you used in the pre script. 208896 is evenly divisible by 64, but 208720 is not.

        1. Hi Brian,

          Thanks for taking the time to point out something that needs clarification.

          Ok, here is what happened.. The “208896” number was a typo on my part in the explanation paragraph. I’m not sure where that number came from, to be honest. The 208720 number is the first “alignment friendly” sector after 208718. 208720 is cleanly divisible by 8 – 64 is the first alignment friendly sector, not our divisor.

          I’ve edited my original article and corrected the explanation paragraph.

          I apologize for the confusion,

          Captain KVM

  2. Have you attempted to do this just using part? I was having an issue where parted was hanging my kickstart because it was prompting for a Yes/No, it was doing this at the mklabel step.

    I asked about this on #cobbler and they suggested that I stop using parted and just use part, it’s the recommended way going forward apparently.

    I’ve got all the partitioning working using parted but I’m not getting aligned partitions. I even tried using –start=64 with part, still misaligned partitions. Would you happen to know anything about this?

    Dan

    1. Hey Dan,

      Thanks for stopping by. What distro and version of Linux are you using? I’ve not had an issue with parted going interactive. And I’ll be honest, I’ve not heard of using “part” instead of “parted”. I don’t have it on my Fedora 16 server or any of my RHEL 6 servers.

      In any case, if you’re using a fairly recent Linux distro, you might not even need to worry about file system alignment.

      thanks,

      Captain KVM

  3. Unfortunately before we became familiar with best practices, we built a lot of systems into production with out proper alignment. I’m working on finding a simple and concise way to fix the alignment, hopefully while the system is live, but probably expect a shutdown or reboot.

    So for example we have:

    Disk /dev/sda: 64.4 GB, 64424509440 bytes
    255 heads, 63 sectors/track, 7832 cylinders, total 125829120 sectors
    Units = sectors of 1 * 512 = 512 bytes

    Device Boot Start End Blocks Id System
    /dev/sda1 * 63 1044224 522081 83 Linux
    /dev/sda2 1044225 125821079 62388427+ 8e Linux LVM

    What would be your best recommendation for fixing this?

    1. Hi Damion,

      Sorry for the delayed response – I’ve been traveling. The short answer is that it’s easier to build new VMs and restore from backup. You could use `dd` as well, but you definitely want to backup before using that.. and once you have that backup, it’s easier to use the first option that I presented.

      I know that’s not what you wanted to hear, but I don’t want to sugar coat this.

      Captain KVM

  4. This is a good article. This is very useful info. I finally able to get the aligned partitions using the following %pre section of the script

    When I hit ‘fdisk -lu’ I get the partition table like this.

    Disk /dev/sda: 214.7 GB, 214748364800 bytes
    255 heads, 63 sectors/track, 26108 cylinders, total 419430400 sectors
    Units = sectors of 1 * 512 = 512 bytes

    Device Boot Start End Blocks Id System
    /dev/sda1 * 64 208848 104392+ 83 Linux
    Partition 1 does not end on cylinder boundary.
    /dev/sda2 208856 12490712 6140928+ 82 Linux swap / Solaris
    Partition 2 does not end on cylinder boundary.
    /dev/sda3 12490720 419430399 203469840 83 Linux

    %pre
    #
    # Get the type of disk sda, hda, etc.
    #
    set $(list-harddrives)
    let numd=$#/2
    primary_disk=$1
    secondary_disk=$3

    #
    #Boot partition is assumed as 100MB. (208848-64)*512=99MB
    #
    boot_start_sector=64
    boot_end_sector=208848

    #
    # Calculate the swap size based on RAM.
    #

    let swap_start_sector=$boot_end_sector+8
    swap_end_sector=8597456 #assumed default 4GB swap i.e ${OUT}
    echo “part /boot –fstype ext3 –onpart ${primary_disk}1” >> ${OUT}
    echo “part / –fstype ext3 –onpart ${primary_disk}3” >> ${OUT}
    #Please include #include /tmp/part-include in the command section after zerombr in ks.cfg
    #We can see the file using cat /tmp/part-include, while installation from the busy box shell. During installation Press Alt+F2
    #Do not perform clearpart –all –initlabel.
    fi

    #In case of 2 HDD, with ride.
    if [ -b “/dev/$secondary_disk” ]; then
    parted -s /dev/$secondary_disk mklabel msdos
    parted -s /dev/$secondary_disk mkpart primary ext3 ${boot_start_sector}s ${boot_end_sector}s
    parted -s /dev/$secondary_disk mkpart primary linux-swap ${swap_start_sector}s ${swap_end_sector}s
    parted -s /dev/$secondary_disk mkpart primary ext3 ${root_start_sector}s 100%
    # You need to write you own /tmp/part-include , some thing like below to suite to your specifications.
    # Assumed primary_disk=sda and secondary_disk=sdb
    #part swap –onpart sda2\n\
    #part raid.18 –onpart sda1\n\
    #part raid.20 –onpart sda3\n\
    #part swap –onpart sdb2\n\
    #part raid.19 –onpart sdb1\n\
    #part raid.21 –onpart sdb3\n\
    #raid \/boot –fstype ext3 –level=RAID1 –device=md0 raid.18 raid.19\n\
    #raid \/ –fstype ext3 –level=RAID1 –device=md1 raid.20 raid.21\n”
    fi

    1. The previous ‘pre’ sections is not cleanly pasted. Here you go.

      %pre
      #
      # Get the type of disk sda, hda, etc.
      #
      set $(list-harddrives)
      let numd=$#/2
      primary_disk=$1
      secondary_disk=$3

      #
      #Boot partition is assumed as 100MB. (208848-64)*512=99MB
      #
      boot_start_sector=64
      boot_end_sector=208848

      #
      # Calculate the swap size based on RAM.
      #

      let swap_start_sector=$boot_end_sector+8
      #assumed default 4GB swap i.e ${OUT}

      echo “part /boot –fstype ext3 –onpart ${primary_disk}1” >> ${OUT}

      echo “part / –fstype ext3 –onpart ${primary_disk}3” >> ${OUT}

      #Please include #include /tmp/part-include in the command section #after zerombr in ks.cfg
      #We can see the file using cat /tmp/part-include, while installation #from the busy box shell. During installation Press Alt+F2
      #Do not perform clearpart –all –initlabel.
      fi

      #In case of 2 HDD, with ride.
      if [ -b “/dev/$secondary_disk” ]; then
      parted -s /dev/$secondary_disk mklabel msdos
      parted -s /dev/$secondary_disk mkpart primary ext3 ${boot_start_sector}s ${boot_end_sector}s
      parted -s /dev/$secondary_disk mkpart primary linux-swap ${swap_start_sector}s ${swap_end_sector}s
      parted -s /dev/$secondary_disk mkpart primary ext3 ${root_start_sector}s 100%
      # You need to write you own /tmp/part-include , some thing like below to suite to your specifications.
      # Assumed primary_disk=sda and secondary_disk=sdb
      #part swap –onpart sda2\n\
      #part raid.18 –onpart sda1\n\
      #part raid.20 –onpart sda3\n\
      #part swap –onpart sdb2\n\
      #part raid.19 –onpart sdb1\n\
      #part raid.21 –onpart sdb3\n\
      #raid \/boot –fstype ext3 –level=RAID1 –device=md0 raid.18 raid.19\n\
      #raid \/ –fstype ext3 –level=RAID1 –device=md1 raid.20 raid.21\n”
      fi

Agree? Disagree? Something to add to the conversation?