Argh...
The days pass like seconds and the minutes become months, it has been a challenging time for me. I have so much to write about, but no time at all (I managed to visit SNOLAB 2km underground in Sudbury last Friday and will be presenting my work on the ATLAS detector upgrade at the Large Hadron Collider at the National Conference on Undergraduate Research (NCUR) in Spokane, Washington next month). I got home today and fell asleep from 2PM to 10PM and have been up since, just trying to catch up with myself and spend a few minutes with my own thoughts (and apparently my blog). I still need to finish posts about the International Astronautical Conference and my multiple trips to Germany. I am wondering if, at this point, they will never happen...
Here's a short video of what it's like to go to SNOLAB (hint: holy shit, mind officially blown for having been able to see it with my own eyes...):
The reason for my post is I've realized that I need to upgrade my Linux server from Slackware 13.37 to Slackware 14.1 as I've needed to install software that required more modern libraries. To that end, I just wanted to reproduce a post that I made in June 2011 on Livejournal that doesn't seem to be here. This is necessary to frame things for a coming post on what was required to do the upgrade and document any issues I encountered. Since this blog is search-indexed, hopefully it can help someone who is also trying to do cool things with their computers. Keep in mind this is a reposting of a historical entry from a few years ago. With that said, the server in question has been rock frickin' solid the whole time. I think I needed to reboot it once in that entire time because of some issue (it has been rebooted more than that because of power failures and deciding to move it, but only once because of a problem... at this point, 'uptime' says it has been up for 110 days now... since I moved it to the other side of the living room).
Going from stable hardware to a functional Internet server is not an instant process. For instance, deciding how to install the operating system and getting it to boot and how to partition the drive for data takes a lot of work — especially when "state of the art" is a moving target. When I last installed a system, the idea of trying to boot off a RAID 1 partition (mirrored disks... in case one disk dies, the exact same data is on the second one as well) was not possible. In my first post on the topic, I had been planning to have one non-mirrored partition on each of the two drives (for redundancy) that I would have had to manage manually so I could boot off either disk if the other failed. On my current server, I have a separate (non-mirrored) boot disk (it also had the operating system on it) and then a pair of disks in a RAID 1 configuration for my data. I learned, however, that LILO (the LInux LOader) could now boot a RAID 1 partition! Well, that was going to save me a lot of manual configuration and provide better data safety, so that sounded like a great idea. Right? I mean, right?
Well, I had already partitioned my hard disk as follows (sda and sdb were identically partitioned... and note in case you didn't know or are used to other Unices, Linux blocks are indicated as 1K, not 512 bytes):
I went ahead and created the mirrored partitions /dev/md0 and /dev/md1 with /dev/sda3:/dev/sdb3 and /dev/sda4:/dev/sdb4 respectively [mdadm --create /dev/md{0|1} --level=1 --raid-devices=2 /dev/sda{3|4} /dev/sdb{3|4}] and created EXT4 filesystems on /dev/sda1, /dev/sdb1, and /dev/md0 (the mirrored disks from the previous step) [mkfs.ext4 /dev/{sda1|sda2|md0}]. I mentioned earlier that LILO can now boot off RAID 1 partitions, but I did not know that at the point that I had done all of this... I installed the Slackware64 13.37 distribution and then started investigating how to do the LILO boot thing properly with my particular configuration. It was then that I learned about the new capability and realized that it would be best if I rolled things back a little and mirrored sda1 and sdb1. I copied the files out of that filesystem into a temporary directory I created, rebooted the system so I could change the partitions from type 83 "Linux" to type fd "Linux raid autodetect" and mirror the partitions. Sadly... the temporary directory I had created was on the RAMdisk that is used by the installation load and when I rebooted, all the files were gone. It was a laughing (at myself) head-desk moment... doh! Well, not such a bad thing (I just needed to re-install the OS, so not a problem at that stage, heh). It also gave me the chance to redo things with the new configuration. I would make /dev/md0 the /dev/sda1:/dev/sdb1 mirrored partition and go from there.
And here's where things took a turn for the argh... I knew I had to re-number the other mirrored partitions so that the /dev/hda4:/dev/hdb4 partition went from /dev/md1 to /dev/md2, and the /dev/hda3:/dev/hdb3 partition went from /dev/md0 to /dev/md1 so I could make the boot one /dev/md0. How to do this? Well, after much research (this is all new functionality, so it's not very well documented anywhere), you stop the mirrored partition (say /dev/mdX for the mirrored partitions /dev/sdaN and /dev/sdbN), re-assign it a new "superblock minor number" (let's say Y), and start it back up again [mdadm --stop /dev/mdX; mdadm --assemble /dev/mdY --super-minor=X --update=super-minor; mdadm --assemble /dev/mdY /dev/sdaN /dev/sdbN] (boy, did it take a long time to figure out how to do that!). Did /dev/md2, then /dev/md1, then created /dev/md0 and everything looked good. Did a "cat /proc/mdstat" and everything was happily mirrored and chugging away. Created an EXT4 filesystem on /dev/md0 and everything looked good. I wiped the filesystem on /dev/md1 to make sure I had a clean installation, did a fresh installation, and rebooted the computer just for good measure and... all the RAID device numbering was messed up! I thought it was hard to figure out how to do the stuff I just did... it had nothing on figuring out how to fix this new problem! The clue came when I looked at the information associated with the RAID devices [mdadm --detail /dev/mdX] and saw that there was a line like "Name : slackware:1" where the number after the "slackware:" seemed to match the "mdX" number assigned... and also corresponded to the number I used to create the RAID partition (which the --update=super-minor command didn't seem to change). I was wondering if this was something that was autogenerated at boot time or whether it was actually in the RAID configuration information stored on the disk... I used the program "hexdump" to look at the contents of the first few kilobytes of data stored in the RAID device block on the disk [hexdump -C -n /dev/mdX] and sure enough, the string "slackware:X" was there. I then had to start the search for how to change the "Name" of a RAID array as apparently this was very new and never used functionality. The built-in help indicated it could be done, but the syntax didn't make sense. Ultimately, I figured it out and changed the name (and re-changed the minor number in the superblock as well just to be sure) [mdadm --stop /dev/mdX; mdadm --assemble /dev/mdY --update=name --name=slackware:Y /dev/sdaN /dev/sdbN; mdadm --assemble /dev/mdY --update=super-minor /dev/sdaN /dev/sdbN; mdadm --assemble /dev/mdY /dev/sdaN /dev/sdbN] and this technique proved reliable and worked like a charm every time (rebooted the system to make sure everything stuck, and it did, yay!). I understand that this is Slackware functionality to guarantee what mdX number gets assigned to a RAID array (where other operating systems can, and do, randomly make assignments), so it's ultimately a Good Thing™, but it's not well documented.
So, it was time to finish up the installation by installing the bootloader. The configuration (in /etc/lilo.conf on the /etc directory for the operating system installed on the disk, e.g. /mnt/etc/lilo.conf if that's where the disk partition with the OS is mounted) was pretty much this (it was having problems with my video card, so I left out the fancy graphical console modes):
Great... right? Sigh... the kernel would load and then panic because it could not figure out how to use the root filesystem (it would give the message: "VFS: Unable to mount root fs on unknown-block(9,1)". I remembered from my digging that the RAID disk devices had the major device number "9", and the root minor device (from above) was "1", so it knew it was trying to load that device, but couldn't. To me, that said that the RAID drivers were not in the kernel and that I would need to build a RAMdisk with the proper kernel modules and libraries for it to properly mount the RAID device as root. I'd had enough and went to bed at that point and took it up the next day. Again, what a pain to find documentation (one of the reasons why I'm writing this all out for posterity's sake... maybe I should write a magazine article, heh)! The trick was to use the "mkinitrd" script that comes with Slackware, and to do that you need to have the installed OS available because the command doesn't seem to be installed on the DVD's filesystem. Once the operating system is mounted [mount /dev/md1 /mnt; mount /dev/md0 /mnt/boot], create a copy of the /proc/partitions file on the disk version of the OS [cat /proc/partitions > /mnt/proc/partitions] (it will be the only file in that proc directory). Edit the /mnt/etc/lilo.conf file to include the line "initrd = /boot/initrd.gz" right below the "image = /boot/vmlinuz" line (and make sure the boot line is "boot = /dev/sda"). Then run the mkinitrd command to create the RAMdisk image and lilo to install it [chroot /mnt mkinitrd -R -m ext4 -f ext4 -r /dev/md1; chroot /mnt lilo -v -v -v]. Change the /mnt/etc/lilo.conf file to "boot = /dev/sdb" and run lilo again [chroot /mnt lilo -v -v -v] to install LILO's configuration on both disks. At this point, you need to delete the "partitions" file on the mounted OS image (it should be an empty directory for the virtual /proc filesystem when it runs) [rm /mnt/proc/partitions].
And that, my friends, is how I spent my summer vacation ;). The system booted (I tried switching boot order via BIOS and it worked fine), mounted its root filesystem, and loaded my shiny new Slackware64 13.37 installation in all its glory. Finally!!! But my journey is far from over... I now have to configure the system and integrate it with the framework I already have running so it could eventually take over from my current server (my plan was to move the pair of 200G disks from the current server to the new one and use them as part of a system backup strategy). I had to install the LVM partition for my data and decide how to carve up the space into Logical Volumes (LVs). I have to decide whether I want to stick with NIS or move to LDAP for authentication (I've been meaning to for a while, but know it's going to be a colossal nightmare), I have to configure Samba (for file and print sharing with Windoze machines), I have to move my web sites to the new box (including migrating the MySQL databases for the Wordpress installations), and then migrate the data from my old server to the new data partitions. Sigh... it's a huge job with so many different technologies (each of which requires a great deal of expertise to use).
Actually, the next thing I need to get working after the upgrade is to sync my server's clock with the NRC NTP servers since the hardware clock on its motherboard swerves like a drunken landlubber on a crooked dock. But that will likely have to wait for the summer.
Here's a short video of what it's like to go to SNOLAB (hint: holy shit, mind officially blown for having been able to see it with my own eyes...):
The reason for my post is I've realized that I need to upgrade my Linux server from Slackware 13.37 to Slackware 14.1 as I've needed to install software that required more modern libraries. To that end, I just wanted to reproduce a post that I made in June 2011 on Livejournal that doesn't seem to be here. This is necessary to frame things for a coming post on what was required to do the upgrade and document any issues I encountered. Since this blog is search-indexed, hopefully it can help someone who is also trying to do cool things with their computers. Keep in mind this is a reposting of a historical entry from a few years ago. With that said, the server in question has been rock frickin' solid the whole time. I think I needed to reboot it once in that entire time because of some issue (it has been rebooted more than that because of power failures and deciding to move it, but only once because of a problem... at this point, 'uptime' says it has been up for 110 days now... since I moved it to the other side of the living room).
Going from stable hardware to a functional Internet server is not an instant process. For instance, deciding how to install the operating system and getting it to boot and how to partition the drive for data takes a lot of work — especially when "state of the art" is a moving target. When I last installed a system, the idea of trying to boot off a RAID 1 partition (mirrored disks... in case one disk dies, the exact same data is on the second one as well) was not possible. In my first post on the topic, I had been planning to have one non-mirrored partition on each of the two drives (for redundancy) that I would have had to manage manually so I could boot off either disk if the other failed. On my current server, I have a separate (non-mirrored) boot disk (it also had the operating system on it) and then a pair of disks in a RAID 1 configuration for my data. I learned, however, that LILO (the LInux LOader) could now boot a RAID 1 partition! Well, that was going to save me a lot of manual configuration and provide better data safety, so that sounded like a great idea. Right? I mean, right?
Well, I had already partitioned my hard disk as follows (sda and sdb were identically partitioned... and note in case you didn't know or are used to other Unices, Linux blocks are indicated as 1K, not 512 bytes):
Where sda1/sdb1 [100MiB] was going to be where I stored the operating system image to boot off of (manually placing copies on each filesystem and installing the LILO bootloader individually on each disk's Master Boot Record (MBR)) and mounted as /boot once the system was running, sda2/sdb2 [4GiB] would be non-mirrored swap partitions (both used simultaneously to give 8G of swap), sda3/sdb3 [80GiB] was going to be the RAID 1 (mirrored) / (root) partition, and sda4/sdb4 [some crazyass number of GiB, like 850 or something] was going to be the RAID 1 (mirrored) with a Logical Volume Manager (LVM) volume group (VG) on top of it (more on that later...). A quick note on the swap partitions: the fact that I did not use a swap file on a RAID partition does mean that if the system is heavily loaded down and swap space is being used and a disk fails, that stuff could crash (programs and possibly even the operating system). However, if swap space is needed, the performance hit of putting it on top of a software RAID implementation would be unforgivable. The system could crash, but if it's brought back up, there's enough swap on one disk to run the system fine on the one functioning swap partition). A compromise that I feel is acceptable to take.Device Boot Start End Blocks Id System /dev/sda1 * 2048 206847 102400 83 Linux /dev/sda2 206848 8595455 4194304 82 Linux swap /dev/sda3 8595456 176367615 83886080 fd Linux raid autodetect /dev/sda4 176367616 1953525167 888578776 fd Linux raid autodetect
I went ahead and created the mirrored partitions /dev/md0 and /dev/md1 with /dev/sda3:/dev/sdb3 and /dev/sda4:/dev/sdb4 respectively [mdadm --create /dev/md{0|1} --level=1 --raid-devices=2 /dev/sda{3|4} /dev/sdb{3|4}] and created EXT4 filesystems on /dev/sda1, /dev/sdb1, and /dev/md0 (the mirrored disks from the previous step) [mkfs.ext4 /dev/{sda1|sda2|md0}]. I mentioned earlier that LILO can now boot off RAID 1 partitions, but I did not know that at the point that I had done all of this... I installed the Slackware64 13.37 distribution and then started investigating how to do the LILO boot thing properly with my particular configuration. It was then that I learned about the new capability and realized that it would be best if I rolled things back a little and mirrored sda1 and sdb1. I copied the files out of that filesystem into a temporary directory I created, rebooted the system so I could change the partitions from type 83 "Linux" to type fd "Linux raid autodetect" and mirror the partitions. Sadly... the temporary directory I had created was on the RAMdisk that is used by the installation load and when I rebooted, all the files were gone. It was a laughing (at myself) head-desk moment... doh! Well, not such a bad thing (I just needed to re-install the OS, so not a problem at that stage, heh). It also gave me the chance to redo things with the new configuration. I would make /dev/md0 the /dev/sda1:/dev/sdb1 mirrored partition and go from there.
And here's where things took a turn for the argh... I knew I had to re-number the other mirrored partitions so that the /dev/hda4:/dev/hdb4 partition went from /dev/md1 to /dev/md2, and the /dev/hda3:/dev/hdb3 partition went from /dev/md0 to /dev/md1 so I could make the boot one /dev/md0. How to do this? Well, after much research (this is all new functionality, so it's not very well documented anywhere), you stop the mirrored partition (say /dev/mdX for the mirrored partitions /dev/sdaN and /dev/sdbN), re-assign it a new "superblock minor number" (let's say Y), and start it back up again [mdadm --stop /dev/mdX; mdadm --assemble /dev/mdY --super-minor=X --update=super-minor; mdadm --assemble /dev/mdY /dev/sdaN /dev/sdbN] (boy, did it take a long time to figure out how to do that!). Did /dev/md2, then /dev/md1, then created /dev/md0 and everything looked good. Did a "cat /proc/mdstat" and everything was happily mirrored and chugging away. Created an EXT4 filesystem on /dev/md0 and everything looked good. I wiped the filesystem on /dev/md1 to make sure I had a clean installation, did a fresh installation, and rebooted the computer just for good measure and... all the RAID device numbering was messed up! I thought it was hard to figure out how to do the stuff I just did... it had nothing on figuring out how to fix this new problem! The clue came when I looked at the information associated with the RAID devices [mdadm --detail /dev/mdX] and saw that there was a line like "Name : slackware:1" where the number after the "slackware:" seemed to match the "mdX" number assigned... and also corresponded to the number I used to create the RAID partition (which the --update=super-minor command didn't seem to change). I was wondering if this was something that was autogenerated at boot time or whether it was actually in the RAID configuration information stored on the disk... I used the program "hexdump" to look at the contents of the first few kilobytes of data stored in the RAID device block on the disk [hexdump -C -n
So, it was time to finish up the installation by installing the bootloader. The configuration (in /etc/lilo.conf on the /etc directory for the operating system installed on the disk, e.g. /mnt/etc/lilo.conf if that's where the disk partition with the OS is mounted) was pretty much this (it was having problems with my video card, so I left out the fancy graphical console modes):
Fairly simple stuff, the "boot" line specified the "whole disk" so the bootloader would be installed in the Master Boot Record (MBR) of the drive, it would load the Linux image, and use /dev/md1 as the root filesystem. Simple, except it didn't work!!! LILO, when run [mount /dev/md1 /mnt; mount /dev/md0 /mnt/boot; chroot /mnt lilo -v -v -v], would generate the message "Inconsistent Raid Version information on /dev/md0". Sigh... now what? Well, it turns out that sometime over the past year, the "metadata format" version of the "mdadm" package had changed from 0.9 to 1.2... and LILO did not know how to read the 1.2 version metadata and so assumed the superblock of the RAID array was corrupted (there's a bug report here). It could, according to what I read, understand the 0.9 metadata format, so... copied the files off the /dev/md0 partition (this time onto the actual hard drive, heh) and re-initialized the partition to use the old metadata format (again, it took a huge amount of time to track down the poorly documented command) [umount /mnt/boot; mdadm --stop /dev/md0; mdadm --create /dev/md0 --name=slackware:0 --metadata=0.90 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1; mkfs.ext4 /dev/md0; mount /dev/md0 /mnt/boot]. Once that was done, the boot files could be copied back and lilo run again. When I first tried it, I only installed on /dev/sda and when I tried to boot, it just hung (never even made it to LILO). This confused me, so I checked the boot order of the disks in the BIOS settings. The "1st disk" was set to boot first and then then "3rd disk" if it couldn't. It took me a while, but I eventually tried (out of desperation) to switch the boot order of the disks and ... voila... the LILO boot prompt! Turns out that the disk that Linux thinks is "a", the BIOS thinks is the "3rd" disk, and "b" was the "1st" disk. Live and learn, eh? The trick was it still needed to be installed on both hard disks (each has a separate MBR), so "lilo" had to be run and then the "boot" parameter had to be changed to /dev/hdb in lilo.conf and lilo had to be run again [just "chroot /mnt lilo -v -v -v" once the filesystems were already mounted]. Once I installed on both /dev/sda and /dev/sdb it didn't matter which one I set first, so that was then working the way it should.lba32 # Allow booting past 1024th cylinder with a recent BIOS boot = /dev/sda # Append any additional kernel parameters: append=" vt.default_utf8=0" prompt timeout = 50 # In 1/10ths of a second vga = normal # Linux bootable partition config begins image = /boot/vmlinuz root = /dev/md1 label = Linux read-only # Partitions should be mounted read-only for checking
Great... right? Sigh... the kernel would load and then panic because it could not figure out how to use the root filesystem (it would give the message: "VFS: Unable to mount root fs on unknown-block(9,1)". I remembered from my digging that the RAID disk devices had the major device number "9", and the root minor device (from above) was "1", so it knew it was trying to load that device, but couldn't. To me, that said that the RAID drivers were not in the kernel and that I would need to build a RAMdisk with the proper kernel modules and libraries for it to properly mount the RAID device as root. I'd had enough and went to bed at that point and took it up the next day. Again, what a pain to find documentation (one of the reasons why I'm writing this all out for posterity's sake... maybe I should write a magazine article, heh)! The trick was to use the "mkinitrd" script that comes with Slackware, and to do that you need to have the installed OS available because the command doesn't seem to be installed on the DVD's filesystem. Once the operating system is mounted [mount /dev/md1 /mnt; mount /dev/md0 /mnt/boot], create a copy of the /proc/partitions file on the disk version of the OS [cat /proc/partitions > /mnt/proc/partitions] (it will be the only file in that proc directory). Edit the /mnt/etc/lilo.conf file to include the line "initrd = /boot/initrd.gz" right below the "image = /boot/vmlinuz" line (and make sure the boot line is "boot = /dev/sda"). Then run the mkinitrd command to create the RAMdisk image and lilo to install it [chroot /mnt mkinitrd -R -m ext4 -f ext4 -r /dev/md1; chroot /mnt lilo -v -v -v]. Change the /mnt/etc/lilo.conf file to "boot = /dev/sdb" and run lilo again [chroot /mnt lilo -v -v -v] to install LILO's configuration on both disks. At this point, you need to delete the "partitions" file on the mounted OS image (it should be an empty directory for the virtual /proc filesystem when it runs) [rm /mnt/proc/partitions].
And that, my friends, is how I spent my summer vacation ;). The system booted (I tried switching boot order via BIOS and it worked fine), mounted its root filesystem, and loaded my shiny new Slackware64 13.37 installation in all its glory. Finally!!! But my journey is far from over... I now have to configure the system and integrate it with the framework I already have running so it could eventually take over from my current server (my plan was to move the pair of 200G disks from the current server to the new one and use them as part of a system backup strategy). I had to install the LVM partition for my data and decide how to carve up the space into Logical Volumes (LVs). I have to decide whether I want to stick with NIS or move to LDAP for authentication (I've been meaning to for a while, but know it's going to be a colossal nightmare), I have to configure Samba (for file and print sharing with Windoze machines), I have to move my web sites to the new box (including migrating the MySQL databases for the Wordpress installations), and then migrate the data from my old server to the new data partitions. Sigh... it's a huge job with so many different technologies (each of which requires a great deal of expertise to use).
Actually, the next thing I need to get working after the upgrade is to sync my server's clock with the NRC NTP servers since the hardware clock on its motherboard swerves like a drunken landlubber on a crooked dock. But that will likely have to wait for the summer.