The UNIX way.

Around the clock, across the globe. By Vladimir Legeza

Update OS on hundreds of servers.

with one comment

Once I was asked about:
What would I do in case if I will need to update an OS on many, several hundreds for instance, servers without turning off the entire cluster?

And at that time I suggested a several ideas, nothing special, spontaneously, just in theory. Today I would like to share with you my recent practical experience in this area. Please read my and leave your comments and questions.

The first implementation was deployed few days ago and this is how it was done.

Background:

All the following implementation was based on only two major principles:

  • The first, as usual in UNIX: KISS (acronym for “Keep it simple, Stupid!”), don’t make it complex.
  • The second, by order, not by meaning is “Divide and Conquer“.

That finally means that all that will be done – should be simple to understand, easy to use and manage, fast (means that turning server off even for an hour just for update – definitely is not acceptable), robust ( “Shit happens” and it is a common practice, not an exception, and we should know how to recover, also fast and simple) and well documented (hopefully comments is not necessary for that statement).

Phases:

  • Initial Installation.
  • OS Update.
As you will see below, not every system in occasion state can be updated by the described method and every single server should be specially crafted for that. And this is why all the procedure was devided into two major phases.

Servers content:

I started my thoughts from the question: “What exactly we need to do?”
To clarify this obvious answer, I have started with definition of what is on our servers – OS? Really! what else?
In general I’ve devided servers content into four parts:

  1. The Operation System itself (lots of packages).
  2. The Configuration of this Operation System. Some settings that make this system unique such as ip address and hostname.
  3. Additional software configuration. Here we have got large spectrum of apps and their configurations. 
    • Some of them is common for most of servers in a cluster and usually is identically configured on all of them (for instance NTP, MTA, Authentication, etc.).
    • Others are unique and available on only one or a few servers ( Mail Gate, NTP server… Apache… etc).
  4. User Data. The most importent content. Here is everything else (logs and backups included).

Now we know what is that we have to deal with: Set of packages (first part) and its unique identification (part number two).

Do you remember that time when you for the first time tried to install new OS as a secondary on to your home PC?  I do. It was not so difficult, isn’t it ? Just cut a partition in a half, create a new empty partition and load installation CD (in my case it was FDD).

The major benefit of such installation is a posibility of switching between different OS’s just by reboot into specified one (that usually takes a few minutes) . Is it fast? Enough I think. So why not to use this great method to change old OS instance to the new one?

To make this method a bit more realistic keep in mind that we have to devide OS from anything else. For that purpose I move all that I call a User Data in to separate space (additional partition mounted as /spool). Additional software configuration also need to by devided.

So, move on:

Disk layout:

  • Partition 1 – OS. First system instance.
  • Partition 2 – OS. Empty partition for future use. Equal with Partition 1 in size.
  • Partition 3 – SWAP. Should be large enough to be cutted out in case of resizing of OS partitions without affecting User Data storage size.
  • Partition 4 – User Data.

Initial installation phase:

Create a new partition table, than install one instance of OS into Partition 1.
No User Data available at this time.

To this phase we can employ “Network Install” with such features like Preseed/Kickseed or Kickstart. Process will be strongly automated. And by addition to just install a new system you might employ DNS plus DHCP services to automate importing of unique system idetification settings. Finally, thanks to such configuration automation controllers as “Puppet” and CfEngine (I use the firs one) you can obtain fresh, well prepared and ready to run installation in just about 10-20 minutes. Everything will be done in “Hands Off” mode.

OS Update phase:

  • Get the image of the new version of the OS. Simplest way to get it is to install a frach instance from new destibution as we did in a first phase. If don’t have free hardware to perform that I’ll suggest you to install it in a vertual environment (in Virtualbox on your laptop for exmple).
  • Boot newly installed OS and correct files that related to partitioning. Main idea is to remove such modern feature as UUID. They have to be changed to regular device names like /dev/sda1 or /dev/md0. I asume that device naming should not bothering us because all the servers are identical, and hence partitions and their names are also equal every where.
  • Now we need to grab a new image and spread it across our testing environment. Image should be loaded into Partition 2 (use dd tool to do this). Once again, because all the partitions are sized by the same pattern there fore there should not be any problems with image loading. Perform all necessary tests. As usual, everything should be tested prior production use.
  • Spread image across whole cluster in a same way.
  • Reboot into the new system. One (or a fews) system at a time to control how is it going and to be able to interrupt process in case of problems. Try to choose right amount of servers to be reloaded simultaneously to complete whole cluster update process will not takes too much time and at the same time will not affect cluster’s workload. (Serialized update across 200 servers might take about 16 hours.)
  • Set the partition as an Active to allow hardware to boot from it. (Fixate our changes.)

Great, we done. But .. how to actually boot a new OS instance? And how is it possible if we set Active state after booting into new environment?

Booting:

Technically to boot from the second partition should be enough to set this partition as an “Active” one. But this is not as flexible solution as we need.

To make it as so, configure servers to try to boot from network at first and only then boot from local drive. In this case whole cluster booting process will be controled at a single point: Network Boot Environment. (And if anything happens with Network Boot Environment, cluster is still able to boot normally.)

Be note that this flexibility will costs you  5-30 seconds timeout at boot time.

Such software as PXELinux allows to vary booting process. We may choose to continue local boot (which is in normal situation is the default option) or to originage Initial Installation process. Or, that is most interesting, we are may boot specially crafted environment that will boot our local system from specified partition (actually kernel and its modules will be loaded from network but root file system will be the one that we specify regardless to “Active” statement).

And here is the mechanism of how to temporary boot into newly loaded image. Boot menu manipulations can help you t o solve many different problems related to installing process. From revert to previous state to load some sort of rescue image to see what is going on.

Boot with second partition as a root file system (Linux case):

To create required bootable environment we need to make new initrd file system and it has to contain all required kernel modules related to kernel we want to boot and a we need kernel itself.

Employ tools like mkinitramfs or mkinitrd that will build you an image from the system you running them on. So, you might run one of them right from the installation that was taken as an OS image.

Initrd image also contains filesystem configuration. Check whether this configuration contains UUIDs and fix it as needed.

To mounting right root partition specify it as a parameter to kernel on boot time (possibly to be set permanently in PXELinux).

Conclusion:

Described implementation is not so difficult. Only thing you need, is

  • Devide OS from DATA;
  • Plan your Installation (space required for installation);
  • Prepare os image (and change one file);
  • Prepare initrd image (hopefully it will work without modifications);
  • Add appropriate options into Network Boot Environment;
    • boot from local disks (default)
    • boot from first partition
    • boot from second partition
    • initial install
    • boot in rescue mode.

Boom, it’s done and ready to go.

It will allow you to:

  • Update entire OS to the new version with relatively short downtimes (few minutes per server).
  • Rollback to previous version (and do it as fast as a server is able to boot).
  • To deal with any sort of problems related to the boot process (boot blocks and boot loaders).

So, practice and enjoy.

Configuration examples:


File pxelinux.cfg/default 

menu title Network boot menu
menu background splash.png
menu color title * #FFFFFFFF *
menu color border * #00000000 #00000000 none
menu color sel * #ffffffff #76a1d0ff *
menu color hotsel 1;7;37;40 #ffffffff #76a1d0ff *
menu color tabmsg * #ffffffff #00000000 *
menu color help 37;40 #ffdddd00 #00000000 none
menu tabmsg Press ENTER to boot or TAB to edit a menu entry
default menu.c32
label localboot
 menu label ^Boot from local drive
 menu default
 localboot 0
 timeout 50
label partition1
 menu label Boot from Partition ^1
 kernel ubuntu/12.04/amd64/linux
 append initrd=ubuntu/12.04/amd64/initrd_local.gz root=/dev/md0
label partition2
 menu label Boot from Partition ^2
 kernel ubuntu/12.04/amd64/linux
 append initrd=ubuntu/12.04/amd64/initrd_local.gz root=/dev/md1
label initialinstall
 menu label ^Ubuntu 12.04 Initial install (amd64)
 kernel ubuntu/12.04/amd64/linux
 append vga=normal initrd=ubuntu/12.04/amd64/initrd.gz ks=http://172.30.0.20/ks.cfg netcfg/choose_interface=eth1
 text help
 All the data on first two hard drives will be destroyed!
 WITHOUT ANY QUESTIONS!
 endtext

File ks.cfg

lang en_US
langsupport en_US
keyboard us
mouse
timezone Europe/Moscow
reboot
text
install
url --url http://172.30.0.20/ubuntu/
bootloader --location mbr
zerombr
preseed --owner d-i partman-auto/method string raid
preseed --owner d-i partman-auto/disk string "/dev/sdb /dev/sda"
preseed --owner d-i partman-auto/expert_recipe string "multiraid :: 2048 2048 2048 raid $primary{ } method{ raid } raidid{ 1 } . 2048 2048 2048 raid $primary{ } method{ raid } raidid{ 2 } . 4096 4096 4096 raid $primary{ } method{ raid } raidid{ 3 } . 50 10000 1000000000 raid method{ raid } raidid{ 4 } . "
preseed --owner d-i partman-auto-raid/recipe string "1 2 0 ext4 / raidid=1 . 1 2 0 ext4 /mnt/md1 raidid=2 . 1 2 0 swap - raidid=3 . 1 2 0 ext4 /spool raidid=4 . "
preseed --owner d-i partman-md/device_remove_md boolean true
preseed --owner d-i partman-md/confirm boolean true
preseed --owner d-i partman-md/confirm_nooverwrite boolean true
preseed --owner d-i partman-partitioning/confirm_new_label boolean true
preseed --owner d-i partman-partitioning/confirm_write_new_label boolean true
preseed --owner d-i partman/choose_partition select finish
preseed --owner d-i partman/confirm boolean true
preseed --owner d-i partman/confirm_nooverwrite boolean true
preseed --owner d-i mdadm/boot_degraded boolean true
preseed --owner d-i grub-installer/bootdev string "(hd0,0) (hd1,0)"
rootpw --disable
user mnt-noc --fullname "NOC Manager" --iscrypted --password $6$93lyzIxN$NtC9GCwnqPjvXJT29bFA6V7W76DB9PC2GQ9SuRTFYj3jv2hFPxBxV3LiQskYltkk138vXv73spQWcBbPm/BQ01
auth --useshadow --enablemd5
network --bootproto dhcp --device eth1
firewall --disabled
skipx
%packages
ubuntu-minimal
openssh-server
puppet
vim
curl
%pre
%post
INTERFACE="eth1"
MY_IP=`ip addr show $INTERFACE | awk '/inet / {split($2,x,"/");print x[1]}'`
MY_HOSTNAME=`host $MY_IP| sed -e 's/\(.*\) domain name pointer \([a-z0-9._]*\)\([.]\)$/\2/'`
MY_MAC=`ip addr show $INTERFACE| awk '/link\/ether / {print $2}'`
MY_GW=`ip route list|awk '/^default/ {print $3}'`
# Configure NETWORK
echo $MY_HOSTNAME >/etc/hostname
cat >/etc/network/interfaces<<EOF
auto lo
iface lo inet loopback
auto $INTERFACE
iface $INTERFACE inet static
 address $MY_IP
 netmask 255.255.255.0
 gateway $MY_GW
EOF
# Fix DNS configuration
FILE="/etc/resolv.conf"
if [ -L "$FILE" ];then
 DATA_SRC="$(dirname $FILE)/$(readlink $FILE)"
 rm $FILE
 cat $DATA_SRC > $FILE
fi

* To whom interested, the password for mnt-noc user is “password”.

Advertisements

Written by Vladimir Legeza

July 13, 2012 at 10:09 am

One Response

Subscribe to comments with RSS.

  1. Pretty nice post. I just stumbled upon your weblog and wished to say that I have truly enjoyed surfing around your blog posts. In any case I’ll be subscribing to your feed and I hope you write again very soon!

    Trena Jaber

    July 14, 2012 at 3:03 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: