Converting RAID1 to RAID10 online

Schema of a RAID10 array

Schema of a RAID10 array (CC BY JaviMZN)

I have a (now old) HP microserver with 4 HDDs. I installed Ubuntu 14.04 (then in beta) on it on a quiet Sunday in February 2014. It is now running Ubuntu 16.04 and still working perfectly. However, I’m not sure what I thought on that Sunday more than 3 years ago. I had partitioned the 4 HDDs in a similar fashion each with a partition for /boot, one for swap and the last one for a BTRFS volume (with subvolumes to separate / from other spaces like /var or /home). My idea was to have the 4 partitions for /boot in RAID10 and the 4 ones for swap in RAID0. I realised today that I only used 2 partitions for /boot and configured them in RAID1, and only used 3 partitions for swap in RAID0.

I have a recurrent problem that because each partition for /boot was 256MB, therefore instead of having 512 (RAID10 with 4 devices) I ended up having only 256MB (RAID1), and that’s not much especially if you install the Ubuntu HWE (Hardware Enablement) kernels, then you quickly have problems with unattended-update failing to install security update because there is no space left on /boot, etc. It was becoming high maintenance and with 4 kids to attend I had to remediate that quickly.

But here is the magic with Linux, I did an online reshaping from RAID1 to RAID10 (via RAID0) and an online resizing of /boot (ext4). And in 15 minutes I went from 256MB problematic /boot to 512MB low maintenance one without rebooting!

That’s how I did it, and it will only work if you have mdadm 3.3+ (could work with 3.2.1+ but not tested) and a recent kernel (I had 4.10, but should have worked with the 4.4 shipped with Ubuntu 16.04 and probably older Kernel). Note that you should backup, test your backup and know how to recover your /boot (or whatever partition you are trying to change).

Increasing the size a RAID0 array (for swap)

First this is how I fixed the RAID0 for the swap (no backup necessary, but you should make sure that you have enough free space to release the swap). The current RAID0 is called md0 and is composed of sda3, sdb3 and sdc3. The partition sdd3 is missing.

$ sudo mdadm --grow /dev/md0 --raid-devices=4 --add /dev/sdd3
mdadm: level of /dev/md0 changed to raid4
mdadm: added /dev/sdd3
mdadm: Need to backup 6144K of critical section..
$ cat /proc/mdstat
md0 : active raid4 sdd3[4] sdc3[2] sda3[0] sdb3[1]
      17576448 blocks super 1.2 level 4, 512k chunk, algorithm 5 [5/4] [UUU__]
      [>....................]  reshape =  1.8% (105660/5858816) finish=4.6min speed=20722K/sec
$ sudo swapoff /dev/md0
$ grep swap /etc/fstab
UUID=2863a135-946b-4876-8458-454cec3f620e none            swap    sw              0       0
$ sudo mkswap -L swap -U 2863a135-946b-4876-8458-454cec3f620e /dev/md0
$ sudo swapon -a

What I just did is tell MD that I need to grow the array from 3 to 4 devices and add the new device. After that, one can see that the reshape is taking place (it was rather fast because the partitions were small, only 256MB). After that first operation, the array is bigger but the swap size is still the same. So I “unmounted” or turn off the swap, recreated it using the full device and “remounted” it. I grepped for the swap in my `/etc/fstab` file in order to see how it was mounted, here it is using the UUID. So when formatting I reused the same UUID so I did not need to change my `/etc/fstab`.

Converting a RAID1 to RAID10 array online (without copying the data)

Now a bit more complex. I want to migrate the array from RAID1 to RAID10 online. There is no direct path for that, so we need to go via RAID0. You should note that RAID0 is very dangerous, so you should really backup as advised earlier.

Converting from RAID1 to RAID0 online

The current RAID1 array is called m1 and is composed of sdb2 and sdc2. I’m going to convert it to a RAID0. After the conversion, only one disk will belong to the array.

$ sudo mdadm --grow /dev/md1 --level=0 --backup-file=/home/backup-md0
$ cat /proc/mdstat
md1 : active raid0 sdc2[1]
      249728 blocks super 1.2 64k chunks
$ sudo mdadm --misc --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Sun Feb  9 15:13:33 2014
     Raid Level : raid0
     Array Size : 249664 (243.85 MiB 255.66 MB)
   Raid Devices : 1
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Tue Jul 25 19:27:56 2017
          State : clean 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           Name : jupiter:1  (local to host jupiter)
           UUID : b95b33c4:26ad8f39:950e870c:03a3e87c
         Events : 68

    Number   Major   Minor   RaidDevice State
       1       8       34        0      active sync   /dev/sdc2

I printed some extra information on the array to illustrate that it is still the same array but in RAID0 and with only 1 disk.

Converting from RAID0 to RAID10 online

$ sudo mdadm --grow /dev/md1 --level=10 --backup-file=/home/backup-md0 --raid-devices=4 --add /dev/sda2 /dev/sdb2 /dev/sdd2
mdadm: level of /dev/md1 changed to raid10
mdadm: added /dev/sda2
mdadm: added /dev/sdb2
mdadm: added /dev/sdd2
raid_disks for /dev/md1 set to 5
$ cat /proc/mdstat
md1 : active raid10 sdd2[4] sdb2[3](S) sda2[2](S) sdc2[1]
      249728 blocks super 1.2 2 near-copies [2/2] [UU]
$ sudo mdadm --misc --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Sun Feb  9 15:13:33 2014
     Raid Level : raid10
     Array Size : 249664 (243.85 MiB 255.66 MB)
  Used Dev Size : 249728 (243.92 MiB 255.72 MB)
   Raid Devices : 2
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Jul 25 19:29:10 2017
          State : clean 
 Active Devices : 2
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 2

         Layout : near=2
     Chunk Size : 64K

           Name : jupiter:1  (local to host jupiter)
           UUID : b95b33c4:26ad8f39:950e870c:03a3e87c
         Events : 91

    Number   Major   Minor   RaidDevice State
       1       8       34        0      active sync set-A   /dev/sdc2
       4       8       50        1      active sync set-B   /dev/sdd2

       2       8        2        -      spare   /dev/sda2
       3       8       18        -      spare   /dev/sdb2

As the result of the conversion, we are in RAID10 but with only 2 devices and 2 spares. We need to tell MD to use the 2 spares as well if not we just have a RAID1 named differently.

$ sudo mdadm --grow /dev/md1 --raid-devices=4
$ cat /proc/mdstat
md1 : active raid10 sdd2[4] sdb2[3] sda2[2] sdc2[1]
      249728 blocks super 1.2 64K chunks 2 near-copies [4/4] [UUUU]
      [=============>.......]  reshape = 68.0% (170048/249728) finish=0.0min speed=28341K/sec
$ sudo mdadm --misc --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Sun Feb  9 15:13:33 2014
     Raid Level : raid10
     Array Size : 499456 (487.83 MiB 511.44 MB)
  Used Dev Size : 249728 (243.92 MiB 255.72 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Jul 25 19:29:59 2017
          State : clean, resyncing 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 64K

  Resync Status : 99% complete

           Name : jupiter:1  (local to host jupiter)
           UUID : b95b33c4:26ad8f39:950e870c:03a3e87c
         Events : 111

    Number   Major   Minor   RaidDevice State
       1       8       34        0      active sync set-A   /dev/sdc2
       4       8       50        1      active sync set-B   /dev/sdd2
       3       8       18        2      active sync set-A   /dev/sdb2
       2       8        2        3      active sync set-B   /dev/sda2

Once again, the reshape is very fast but this is due to the small size of the array. Here what we can see is that the array is now 512MB but only 256MB are used. Next step is to increase the file system size.

Increasing file system to use full RAID10 array size online

This cannot be done online with all file systems. But I’ve tested it with XFS or ext4 and it works perfectly. I suspect other file systems support that too, but I never tried it online. In all cases, as already advised, make a backup before continuing.

$ sudo resize2fs /dev/md1
resize2fs 1.42.13 (17-May-2015)
Filesystem at /dev/md1 is mounted on /boot; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 2
The filesystem on /dev/md1 is now 499456 (1k) blocks long.

$ df -Th /boot/
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/md1       ext4  469M  155M  303M  34% /boot

When changing the /boot array, do not forget GRUB

I already had a RAID array before. So the Grub configuration is correct and does not need to be changed. But if you reshaped your array from something different than RAID1 (e.g. RAID5), then you should update Grub because it is possible that you need different module for the initial boot steps. On Ubuntu run `sudo update-grub`, on other platform see `man grub-mkconfig` on how to do it (e.g. `sudo grub-mkconfig -o /boot/grub/grub.cfg`).

It is not enough to have the right Grub configuration. You need to make sure that the GRUB bootloader is installed on all HDDs.

$ sudo grub-install /dev/sdX  # Example: sudo grub-install /dev/sda

GitLab – Disable Container Registry feature for all existing Projects

Today I activate a new feature at work which provides the Container Registry feature on our GitLab instance.

However not all projects are requiring this feature, actually at the moment just a few. Therefore we wanted this option to be disabled by defaults and to the responsibility of the project leaders to activate or not when needed.

GitLab offers to disable the Container registryfeature for new projects only. What I wanted was to do that for existing projects. This needs to be done directly on the database, so back it up before doing this, and try it on a non-production environment first. Note that this was tested on GitLab 8.17.3 using PostreSQL 9.6, with other releases this could be different. In addition, the following is provided as is without any warranty, it worked for me, it might not for you and I’m in no way responsible if you mess your database.

Now that you have done your backup, to perform the changes on the database, you need to login as the `gitlab-psql` user:

When using Docker:

$ docker exec -it --user=gitlab-psql gitlab bash

When using omnibus package:

$ su - gitlab-psql

The following applies for both Docker and Omnibus installation once logged in:

$ export PGHOST="/var/opt/gitlab/postgresql"
$ psql gitlabhq_production
gitlabhq_production=# update projects SET container_registry_enabled = FALSE;

That’s it!

Installing Ubuntu Server on Raspberry Pi – Headless

Raspberry Pi 2 Model B+ v1.1This article will describes the steps to install Ubuntu Server 16.04 on a Raspberry Pi 2. This article provides extra steps so that no screen or keyboard are required on the Raspberry Pi, it will be headless. But of course you need a screen and keyboard on the computer on which you will download the image and write it to the MicroSD card. It is similar to a previous article about installing Debian on Raspberry Pi 2, also headless mode.

Disclaimer: you need to know a minimum about computer, operating system, Linux and Raspberry Pi. If you just want to install an Operating System on your Raspberry Pi, get NOOBS the Raspberry Pi Foundation installer. This guide is for more advanced users. If you follow this guide but do mistakes, you might wipe out disk content or could even brick Micro SD card or what not.

Install the Ubuntu Server image

Ubuntu Circle of Friend LogoGrab your official Ubuntu Server for Raspberry Pi 2 image (the latest version at time of writing is ubuntu-16.04.1-preinstalled-server-armhf+raspi2.img.xz but in a few days the image for Ubuntu 16.04.2 should be available, it will save you some time when upgrading it (and save some write cycles on your Micro SD card). Once downloaded, you need to insert the Micro SD card on your computer (you probably need a USB card reader for that) and try to figure out which device it corresponds to, see the Ubuntu documentation for further guidance. I assume you know what you do but be weary that the next command if done on the wrong device could wipe out the data on that device. I do not take any responsibility if things go wrong.

$ xzcat ubuntu-16.04.1-preinstalled-server-armhf+raspi2.img.xz | dd of=<device> bs=32M

Create a user account and allow SSH access

Then make sure to sync your media data and then mount the newly created partition (normally there are 2 partitions created, we are interested in the second one, it should be named <device>p2 or <device>2:

$ sudo sync
$ sudo mkdir -p /mnt/rpi
$ sudo mount <device>2 /mnt/rpi

User account creation

As the Raspberry Pi uses an ARM processor and the computer on which I created the Micro SD card is a x86_64 processor, I cannot simply chroot and execute adduser in the newly mounted partition. The programs are compiled for a different architecture. So to add a new user we will need to do it manually by editing system files. We will create a new user and group, then add the corresponding entries in the files where the passwords are kept.

Add a new user (replace $(whoami) by your username if you want a different username than your current one).

$ echo "$(whoami):x:1000:1000:<Full Name>:/home/$(whoami):/bin/bash" | sudo tee -a /mnt/rpi/etc/passwd

Now create your group by editing /mnt/rpi/etc/group:

$ echo "$(whoami):x:1000:"" | sudo tee -a /mnt/rpi/etc/group

Now edit the group password database:

$ echo "$(whoami):*::$(whoami)" | sudo tee -a /mnt/rpi/etc/gshadow

And the user passsword database (it will have no default password but allow SSH key base authentication over the network and it will request to set a password upon first login. Note that with this configuration remote SSH login cannot happen without the SSH key, so it is a secure configuration):

$ echo "$(whoami)::0:0:99999:7:::" | sudo tee -a /mnt/rpi/etc/shadow

Grant your user access to administrative tasks (via sudo), but still requires that the user enter his own password:

$ echo "$(whoami) ALL=(ALL) ALL" | sudo tee /mnt/rpi/etc/sudoers.d/20_$(whoami)_superuser

User home folder and SSH access

Now we shall create the user’s home and add the SSH public key so we can login (it is assumed that you have a public RSA key under your home directory named ~/.ssh/id_rsa.pub change the name if it’s different):

$ sudo cp -R /mnt/rpi/etc/skel /mnt/rpi/home/$(whoami)
$ sudo chmod 0750 /mnt/rpi/home/$(whoami)
$ sudo mkdir -m 0700 /mnt/rpi/home/$(whoami)/.ssh
$ cat ~/.ssh/id_rsa.pub | sudo tee -a /mnt/rpi/home/$(whoami)/.ssh/authorized_keys
$ sudo chmod 0600 /mnt/rpi/home/$(whoami)/.ssh/authorized_keys
$ sudo chown -R 1000:1000 /mnt/rpi/home/$(whoami)

Setup Systemd for enabling SSH access and headless mode

Normally everything else should be correctly setup. However you might want to have a look at systemd configuration, mostly of interests are which default target is in use (for headless you want multi-user.target) and if the SSH service is part of the default target. What I did was the following (it also avoid creating the ubuntu user):

$ cd /mnt/rpi/lib/systemd/system
$ rm -f default.target
$ ln -s multi-user.target default.target
$ cd /mnt/rpi/etc/systemd/system/multi-user.target.wants
$ ln -s /lib/systemd/system/ssh.service ssh.service

(if the last command fails because the file already exist then it is all OK)

Start Ubuntu Server on Raspberry Pi

Now unmount the card and eject it: sudo umount /mnt/rpi. You can now safely insert the card in your Raspberry Pi 2 and boot it. It boots slower than with Raspbian, so be patient. Note that with all the above configuration, you do not need to boot with a keyboard or screen attached to your Raspberry Pi. Only an Ethernet cable and the power plug are necessary.

Now you need to find your newly installed Ubuntu Server on your network, the default hostname is ubuntu so you could always start with that (ssh $(whoami)@ubuntu) if it is not in conflict with another device of yours and if your router is clever enough to have updated the DNS resolver. Or else you need to scan your network for it. To scan your network you need to know your subnet (e.g. 192.168.1.0 with a netmask of 255.255.255.0) and have nmap installed on your computer (sudo dnf install nmap will work for Fedora, and it is as easy for Debian/Ubuntu-based distros as well, just replace sudo apt-get install nmap).

$ sudo nmap -sP 192.168.1.0/24

Of course you need to adapt the above command to your subnet. The “/24” part is the netmask equivalent of 255.255.255.0. I recommend running the above command with sudo because it will display the MAC address of all the discovered devices which will help you spot your Raspberry Pi as nmap is displaying the vendor attached to each MAC address. See for yourself in the example output:

Starting Nmap 6.47 ( http://nmap.org ) at 2015-07-19 20:12 CEST
(...)
Nmap scan report for ubuntu.lan (192.168.1.9)
Host is up (0.0060s latency).
MAC Address: B8:27:EB:1E:42:18 (Raspberry Pi Foundation)
(...)
Nmap done: 256 IP addresses (8 hosts up) scanned in 2.05 seconds

Now you can simply connect to your RPi using SSH:

ssh $(whoami)@192.168.1.9
Enter passphrase for key '~/.ssh/id_rsa':
You are required to change your password immediately (root enforced)
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-1017-raspi2 armv7l)

(...)

142 packages can be updated.
69 updates are security updates.

(...)

WARNING: Your password has expired.
You must change your password now and login again!
(current) UNIX password:

Now that you are authenticated and have access to your newly installed Ubuntu Server, it is time to upgrade it.

Upgrade Ubuntu Server to latest packages

The tool tmux should already be installed on your system (or do sudo apt install tmux), so use it to create a new session, so even if you get a network problem your session is not killed (simply do tmux attach)

$ tmux
$ sudo apt dist-upgrade
$ sudo systemctl reboot

Note: it is possible that unattended-upgrade kicks in before you can do the upgrade manually. Then wait an hour or more (depending on the speed of your internet connection and Micro SD card mainly) before doing the above steps. It is still worth while as the dist-upgrade command will perform more thorough upgrade (potentially removing deprecated packages or even downgrading some if necessary) but you will be in sync with the latest and greatest Ubuntu Server.

Picture credits: Photo of a Raspberry Pi board by me, see the website licensing policy. Ubuntu Circle of Friends logo is copyright by Canonical.

Ubuntu Core – Atomicity rough on the edges

View of a Harbour Terminal before Containers existed

Before the invention of containers, docker was a much more manual job. And that’s what I’m looking for my Raspberry Pi.

I’ve been recently trying to play with Ubuntu Core on my 2nd Raspberry Pi 2. I like the concept of a minimalist host with atomic updates and the possibility to run my services inside containers. In addition, snap looks like an interesting package system (and more than that). But this setup does not fit my use cases, it is perfect for repeatable testing and safe environment for deployment. However I need a system that is stable and safe, but which I can tinker with (modify a specific configuration or kernel in order to optimise its use or detect new devices) and which grow organically (depending on my free time).

Introduction to Ubuntu Core

So Ubuntu Core (or the Project Atomic) do appeal to me but they are too restrictive for my use cases. I need more freedom. Anyway, for those of you who could be interested in these projects here is a quick review of these systems with respect to day-to-day use as I’m not going to explain the philosophy of Core/Atomic neither of snap/atomic technologies.

Both Ubuntu Core (armhf variant) and CentOS Atomic Host (x86_64 variant) felt a bit rough and despite carrying the respective name Ubuntu and CentOS I had to reconsider how I am used to administer such boxes. A basic concept is that you install a core (or minimalist) OS and you cannot pretty much change it (most parts are read only), but it should have everything to run containers. For Atomic Host, there is no way to install additional packages, you need to add every other bit of software as a container. The big difference with Ubuntu Core is that you have snaps which allows to extend the core OS without having to install and configure containers manually. A snap package – once installed – feels more or less like if you just installed a deb package. But there is a big difference, they are like little containers or sandboxed process(es) already neatly packaged so they feel like a normal command, but a lot is going on behind the scene. Here is an example, I’ve installed `htop` on my Raspberry Pi 2, now I get two distinct results if I run it with a standard user or with the super user rights, a behaviour uncommon on a standard Linux installation:

Standard User Super user

Snap htop - standard user

Snap htop – Standard User

Snap htop - super user (via sudo) - see complete list of processes

Snap htop – Super User (via sudo)

As it is visible in the case of htop, by default when I run it, htop shows only processes where the owner is myself. This is not bad but it is different from traditional Linux distribution and so you need to get the habit to do `sudo htop` to see all processes, but take care you then run htop as super user, and htop allows you to kill processes, so be careful.

However, on Raspberry Pi (and probably other ARMv7 (aka armhf) platforms) there is a very limited amount of snap packages available yet. It is probably changing fast, but if you run today Ubuntu Core 16 you can’t install much snaps and need to rely on Docker containers for adding extra applications to the base system (more on that later, including its current limitations).

Basic system configuration absent

Let’s get back to the beginning and I mean by that right after the installation of Ubuntu Core upon first boot. I still had my keyboard and screen attached to it (and you need it in order to login with your Launchpad SSO login to create the first user). After the first boot, I was not offered the possibility to change the keyboard layout (I do not have a US layout) but at least on Ubuntu Core one can change it afterwards by editing the file `/etc/default/keyboard`, this feat is not possible on Atomic Host. Anyway, not such a big issue as I’m using my Raspberry Pi mostly (if not alway) via SSH then it does not matter anymore. Staying on the localization topic, both systems do not allow changing the locale (language, regional preferences, etc.). On Atomic Host it is set to US with the “peculiar” time and date format ;-) no offense! Similar fate on Ubuntu Core but they are using the C locale (which sadly for me also uses the US date/time format). Although I would anyway stick to the English language, I don’t like the regional choice and I don’t like the lack of choices here.

So let’s move back to an SSH connection, at least the keyboard layout is no longer an issue. But here comes the next one ;-) whenever I use SSH, the first thing I do is launch a new tmux session or attach to an existing one. I’m open to other solutions, so if the host only has screen or byobu, I’m also fine. However, Ubuntu Core for ARMv7 does not ship by default with any of them, nor are they any existing snap package. So there is no simple way (out of compiling from source and creating a snap package) to have safe SSH session.

Docker

Normally thanks to container technology, it is possible to install more or less everything I would need and with some clever alias it would seem much like a snap installation. So I just wanted to create a Dockerfile which reference Alpine Linux and install tmux and build the container using Docker. Not possible with Ubuntu Core and the Docker snap. According to the Docker snap information, I should be creating a folder under `$HOME/apps/docker/1.6.xxx` (eventhough Docker 1.11.2 was installed, weird), but it does NOT work. The AppArmor profile installed with Docker snap denies it. I’ve tried many different folders to no avail. This is not a misconfiguration of Ubuntu Core but rather one of the Docker snap packager, but the end result is that as a user Ubuntu Core on armhf is barely usable (lack of too many essential tool, no snap and cannot easily use Docker).

Other rough edges with Docker are that your default user does not belong to the Docker group, so you need to sudo every single Docker command. There is also no way to add yourself to the Docker group (as the group is defined in the standard file `/etc/group` which is on a read-only filesystem). Your default user is also not a real traditional user, it is an “extra user” (did not know this subtlety before this) so there are a bunch of things that are not compatible with it like you cannot launch a container to run as yourself (docker run -u $(whoami) ...) as you are not a standard user (no entry in `/etc/passwd`), you are an extra users (see in `/var/lib/extrausers/passwd`), but at least it would be possible to do it specifying your user ID (e.g. 1000)! This is all logic but “rough”. A corollary to the previous statements is that you cannot run Docker in user namespaces mode (unprivileged Docker container) as you cannot add yourself to the Docker group.

At least the Docker snap includes Docker Compose, that is cool. On Atomic Host, Docker is installed but not Docker Compose. As the file system is also read-only there is good way to install Docker Compose in a central place, it should be run within a container. At least on Atomic Host, it is very easy (as on standard Linux OS) to create Dockerfile and build them. So these limitations (of not having Docker Compose and some other tools) can be overcome with some efforts.

Final Try

My next steps was to download some scripts and tools. However both curl and wget are absent of the base installation. Even git is not available as a snap or on the core installation. There is no way to build a Dockerfile, so in the end to have tmux, curl, git, etc. I had to create a lxd container running Ubuntu just to get a normal OS (or from my laptop, I could download the scripts and via scp pushing them to Ubuntu Core). But then I simply prefer to run Ubuntu Server or Raspbian on my Raspberry Pi.

For me I’ve lost too much time on this already. Ubuntu Core seems nice I really like the principles and approach, but it is definitively missing too many basic tools, too rough for my taste and available free time.

Conclusion

I will install Ubuntu Server or CentOS 7 for armhf on my second Raspberry Pi 2. Ubuntu Core is perhaps still a bit too young (maybe it is specific to the armhf platform) and requires too much work for my needs. At least this experience made me understand that I don’t want a lock down and safe box, but I need flexibility.

Picture credits: Public Domain. The picture is part of the State Library of South Australia collection (see original B 4433 photo).

Getting docker-compose on Raspberry Pi (ARM) the easy way (updated)

I really like docker-compose, it has a simple language (YaML) to describe how to build and run a container, so you do not have to remember (or count on your history availability) the long `docker build ...` and `docker run ...` commands (and many others).

However, docker-compose is not (yet) available for Raspberry Pi or any other ARM architecture. (Update 2017-03-02: but we are getting there. A first series of patches to allow support has been merged in the master branch but is not yet released. However, it does not look like official releases of compose for ARM will be provided in the near future, but at least building them will become even easier.)

Our Hypriot friends have done a great job and do provide a binary version of it on their repositories. But I usually do not like to add 3rd party repositories and I finded the list of changes (patches) longer than I expected.

So I forked the official Docker Compose repository and did a few minimalistic changes in order to get a built of docker-compose for Raspberry Pi. I have created a Pull Request in the hope that it might get accepted and that ARMv7 be officially built. But while waiting for the review process to be triggered, here is how to do it for yourself.

Download the Project

As pre-requisite you need to have `git` (sudo apt-get install git) and `docker` (see my previous article) already installed on your platform.

Then get a copy of the project on your local Raspberry Pi.

$ git clone https://github.com/docker/compose.git
$ git checkout release

Now apply the following patch (Update 2017-03-02: soon when the master branch will be merged into the release one, these extra steps won’t be necessary):

$ cd compose
$ cp -i Dockerfile Dockerfile.armhf
$ sed -i -e 's/^FROM debian\:/FROM armhf\/debian:/' Dockerfile.armhf
$ sed -i -e 's/x86_64/armel/g' Dockerfile.armhf

Build and install docker-compose

To build the docker-compose binary, the procedure is rather simple. First you need to build the docker image which will be used to set-up the build environment. Second and last you need to run the container which will build docker-compose. At the end the binary will be available under the `dist` subfolder.

$ docker build -t docker-compose:armhf -f Dockerfile.armhf .
$ docker run --rm --entrypoint="script/build/linux-entrypoint" -v $(pwd)/dist:/code/dist -v $(pwd)/.git:/code/.git "docker-compose:armhf"

After several minutes you will get a binary file which you can then install on your system:

$ ls -l dist/
total 6816
-rwxr-xr-x 1 pi pi 6976500 Feb  8 11:41 docker-compose-Linux-armv7l
$ sudo cp dist/docker-compose-Linux-armv7l /usr/local/bin/docker-compose
$ sudo chown root:root /usr/local/bin/docker-compose
$ sudo chmod 0755 /usr/local/bin/docker-compose
$ docker-compose version
docker-compose version 1.11.0-rc1, build daed6db
docker-py version: 2.0.2
CPython version: 2.7.13
OpenSSL version: OpenSSL 1.0.1t  3 May 2016

Goodies: Install docker-compose bash autocompletion

Docker Compose provides autocompletion for bash. Installing it is as simple as doing:

$ sudo curl -L https://raw.githubusercontent.com/docker/compose/$(docker-compose version --short)/contrib/completion/bash/docker-compose -o /etc/bash_completion.d/docker-compose

Conclusion

If you might wonder why I stated “the easy way” in my title, well if you want to master Docker, you ought to consider the above easy :-)

Of course in our field of work nothing is as simple as a mouse click, especially when you need to create something that is not (until today) provided out of the box. If you want real easy and are using a Debian-based Linux OS, then just use the repository provided by Hypriot. If you’re not using Debian, you might want to consider the above described approach.

GPL https://commons.wikimedia.org/wiki/File:Sablier-temps-icone-5376-128.png

A Time Server in a Container – Part 1

To learn Docker in details I decided to use it to run a local time server using ntpd from the ntp.org project.

I have used an incremental approach where I started with an easy setup and then increased the challenges either to improve the time server or to better understand Docker.

Getting Started

So why ntpd and not <put your favourite time server>

I know ntpd for having configured it many times in the past 10 years. So I wanted to start with it first to quickly get time synchronisation working.

Which platform?

I have a Raspberry Pi (abbreviated RPi from now on) which serves as DHCP server and local forwarding and caching of DNS queries for my LAN. It had early support for Docker back in October when I started my experiment which added a bit of spice to it.

Getting Docker on a Raspberry Pi

There are many ways to get Docker running on your RPi. You could get the Hypriot OS Linux distribution which has everything setup nicely for running Docker containers. You can compile Docker on your platform of choice (which I had to do to squash a few early adopters’ bugs). You can install a tarball containing the binaries for your platform. But if running Raspbian Jessie – like I was – you can today just include Docker’s own repository and install a binary version using apt-get. Make sure your Kernel is recent (Docker requires 3.10 at least, but if you have a properly updated Raspbian it should be running 4.4 at the time of writing).

You can follow Docker’s installation guide for Debian, but by default it will install you the x86_64 Docker repository. As hinted in the documentation, for other architecture you need to use the [arch=...] clause. In addition, Docker provides a specific variant of the package for Raspbian. So for Raspbian Jessie, use the following entry for your docker.list file:

deb [arch=armhf] https://apt.dockerproject.org/repo raspbian-jessie main

Continue to follow the Docker guide, including how to set up non-root access to a specific user.

Creating a Docker image for ntpd

Create a specific folder somewhere on your Raspberry Pi storage (e.g. mkdir -p ~/projects/docker/ntpd) and create a file Dockerfile.armhf (I use the extension .armhf so I can have distinct Dockerfiles for each platform I use) with the following content:

FROM armhf/ubuntu:16.04
RUN apt-get update \
    && apt-get install -y --no-install-recommends ntp \
    && apt-get clean -q \
    && rm -Rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/sbin/ntpd"]

Note: This file as well as newer version of it and instructions to build and run the container are available on my GitHub ntp container project. In the rest of this blog post, I’m only going to detailed how I approach running the container and solve problems.

The first line state that the base image for the container will be Ubuntu 16.04 (the specific variant for RPi architecture). The second until the fifth lines are commands we execute on top of the base image, basically it updates the packages list to install the latest version of ntpd with the smallest dependencies, and it removes any cached or temporary files. So we minimise the size of the image on disk. Finally the last line, is the command that will be executed by Docker when instructed to run the container. I have used the term ENTRYPOINT because it allows me – while experimenting – to change the list of parameters I send to ntpd when I create the container and run it. This gives me flexibility with testing different parameters.

I picked up Ubuntu as the base image because it has sane default for the ntpd configuration file. It will use the NTP Pool project and the configuration is secured by default. Note that other base images could have also worked and have also sane default. I could have used Alpine Linux base image, it is really compact and lightweight, would have been perfect for a small platform like a Raspberry Pi, but it does not provide the ntpd packages from the NTP project which I wanted to start with. It only supports OpenNTPD (which does not support leap seconds, so it was a no go for me) and Chrony (which could be a good alternative but as I mentioned before I wanted to first experiment with Docker not learn yet another NTP application).

Let’s build the container image (I named the image “article/armhf/ntpd” and tagged it with the current date, but just name it like you want):

$ docker build -f Dockerfile.armhf -t article/armhf/ntpd:20170106.1 .

Running the NTP container

We are now going to spawn an instance of the container image in foreground to see what is going on and to notice any error:

$ docker run --rm -it article/armhf/ntpd:20170106.1 -n
 6 Jan 14:03:30 ntpd[1]: ntpd 4.2.8p4@1.3265-o Wed Oct  5 12:38:30 UTC 2016 (1): Starting
 6 Jan 14:03:30 ntpd[1]: Command line: /usr/sbin/ntpd -n
 6 Jan 14:03:30 ntpd[1]: Cannot set RLIMIT_MEMLOCK: Operation not permitted
 6 Jan 14:03:30 ntpd[1]: proto: precision = 1.198 usec (-20)
 6 Jan 14:03:30 ntpd[1]: Listen and drop on 1 v4wildcard 0.0.0.0:123
 6 Jan 14:03:30 ntpd[1]: Listen normally on 2 lo 127.0.0.1:123
 6 Jan 14:03:30 ntpd[1]: Listen normally on 3 eth0 172.17.0.2:123
 6 Jan 14:03:30 ntpd[1]: Listen normally on 4 lo [::1]:123
 6 Jan 14:03:30 ntpd[1]: Listening on routing socket on fd #21 for interface updates
 6 Jan 14:03:30 ntpd[1]: start_kern_loop: ntp_loopfilter.c line 1126: ntp_adjtime: Operation not permitted
 6 Jan 14:03:30 ntpd[1]: set_freq: ntp_loopfilter.c line 1089: ntp_adjtime: Operation not permitted
 6 Jan 14:03:31 ntpd[1]: Soliciting pool server 193.200.241.66
 6 Jan 14:03:32 ntpd[1]: Soliciting pool server 90.187.7.5
 6 Jan 14:03:32 ntpd[1]: adj_systime: Operation not permitted
 6 Jan 14:03:32 ntpd[1]: Soliciting pool server 129.70.132.37
 6 Jan 14:03:33 ntpd[1]: Soliciting pool server 85.25.210.112
 6 Jan 14:03:33 ntpd[1]: Soliciting pool server 31.25.153.77
 6 Jan 14:03:34 ntpd[1]: Soliciting pool server 178.63.9.212
 6 Jan 14:03:34 ntpd[1]: Soliciting pool server 193.22.253.13
^C 6 Jan 14:03:40 ntpd[1]: ntpd exiting on signal 2 (Interrupt)

We have a few errors (Operation not permitted) which I have highlighted above, one is about RLIMIT_MEMLOCK (this is about resetting the limit of the maximum locked-in-memory address space, ntpd uses it to forbid its main process from swapping to limit jitter) and the other ones are about ntp_adjtime and adj_systime (both are used by ntpd to interface with the Kernel and adjust the system time).

By default ntpd is running as root user, so it should have enough privilege for these operations. In addition, even though Docker supports running unprivileged containers (i.-e. the root user inside the container is mapped to a normal user on the host, this is based on user namespaces (see namespaces(7)), this is not the default Docker configuration, so my root user inside the container is the root user outside the container (and if Docker would be configured to use user namespace, they are not compiled in the Raspberry Pi foundation Kernel. So it is at the moment not possible to use that feature on a Raspberry Pi without some extra efforts, but I will details this in a future article).

In order to implement basic privilege limitations of container, Docker can use various security feature of the Linux Kernel to limit the container accessing certain sensible Kernel calls, the most notable ones are Linux Capabilities (since Docker 1.2), Linux SECCOMP filtering (since Docker 1.10, but better use Docker 1.12+ as pervious default SECCOMP profiles were in conflict with the Linux Capabilities management of Docker. In addition, the Raspbian Kernel (version 4.4 as of writing) has not the built-in support for SECCOMP filtering, so this functionality is not usable on Raspberry Pi, unless you compile your own Kernel) and Linux MAC (like SELinux or AppArmor, but none of them are available on Raspberry Pi without recompiling your own Kernel and installing the user space tools). So Docker on Raspberry Pi can only use Linux Capabilities as security feature.

By default Docker provides each container with a reasonable set of capabilities (see Docker documentation on capabilities). If you check both documentation (the Linux Capability manual and the Docker runtime privileges doc), you will find out that basically our container is missing the CAP_SYS_RESOURCE and CAP_SYS_TIME capabilities. Now there are 2 ways to add them, most online guide would tell you that when you run into “operations denied” errors, just add the --privilege flag to the docker run command line and it will be fixed, that’s the first way and it’s the wrong approach (sure it works, but it is like deactivating SELinux because you are not allowed to perform an operation). The other way is to add the missing capabilities to the container. This can be done by using the --cap-add flag. That’s what I’m going to show now:

$ docker run --rm -it --cap-add SYS_RESOURCE --cap-add SYS_TIME article/armhf/ntpd:20170106.1 -n
 7 Jan 11:19:24 ntpd[1]: ntpd 4.2.8p4@1.3265-o Wed Oct  5 12:38:30 UTC 2016 (1): Starting
 7 Jan 11:19:24 ntpd[1]: Command line: /usr/sbin/ntpd -n
 7 Jan 11:19:24 ntpd[1]: proto: precision = 1.823 usec (-19)
 7 Jan 11:19:24 ntpd[1]: Listen and drop on 0 v6wildcard [::]:123
 7 Jan 11:19:24 ntpd[1]: Listen and drop on 1 v4wildcard 0.0.0.0:123
 7 Jan 11:19:24 ntpd[1]: Listen normally on 2 lo 127.0.0.1:123
 7 Jan 11:19:24 ntpd[1]: Listen normally on 3 eth0 172.17.0.2:123
 7 Jan 11:19:24 ntpd[1]: Listen normally on 4 lo [::1]:123
 7 Jan 11:19:24 ntpd[1]: Listening on routing socket on fd #21 for interface updates
 7 Jan 11:19:25 ntpd[1]: Soliciting pool server 213.95.21.43
 7 Jan 11:19:26 ntpd[1]: Soliciting pool server 134.119.8.130
 7 Jan 11:19:26 ntpd[1]: Soliciting pool server 46.4.32.135
 7 Jan 11:19:27 ntpd[1]: Soliciting pool server 213.136.86.203
 7 Jan 11:19:27 ntpd[1]: Soliciting pool server 178.63.9.212
 7 Jan 11:19:27 ntpd[1]: Listen normally on 7 eth0 [fe80::42:acff:fe11:2%6]:123
 7 Jan 11:19:27 ntpd[1]: new interface(s) found: waking up resolver
 7 Jan 11:19:27 ntpd[1]: Soliciting pool server 46.165.212.205
 7 Jan 11:19:28 ntpd[1]: Soliciting pool server 109.239.58.247
 7 Jan 11:19:28 ntpd[1]: Soliciting pool server 131.188.3.221
 7 Jan 11:19:28 ntpd[1]: Soliciting pool server 78.46.189.152
 7 Jan 11:19:28 ntpd[1]: Soliciting pool server 195.50.171.101
^C  7 Jan 11:22:40 ntpd[1]: ntpd exiting on signal 2 (Interrupt)

To make sure this is working, first verify that you do not have any time synchronisation service running: $ sudo systemctl stop systemd-timesyncd ntp.

Then change the system time by shifting it by 5 seconds: $ sudo date -s "5 seconds".

Check that your system clock is now off by 5 seconds:

$ ntpdate -q time1.google.com
server 216.239.35.0, stratum 2, offset -5.002284, delay 0.14117
18 Jan 11:27:55 ntpdate[5217]: step time server 216.239.35.0 offset -5.002284 sec

Start the container in the background this time: $ docker run --name ntpd --detach --restart always --cap-add SYS_RESOURCE --cap-add SYS_TIME article/armhf/ntpd:20170106.1 -g -n

Wait a few seconds and query again the network time using the above ntpdate command. The offset should now be below 5 seconds and probably close to 0 second.

You have now a ntp service running inside a container and synchronising your system clock using Internet time servers from the NTP pool project. If you want to stop the experiment here and restore your system, you need to stop the container ($ docker stop ntpd) and block it from restarting at next boot ($ docker update --restart=no ntpd) and perhaps reboot so that you reactivate the default time synchronisation service.

But if you want to keep experimenting or let the container do its job of time synchronisation, you should make sure to deactivate any other time synchronisation mechanisms to avoid conflicts if you want to keep your NTP container running:

$ sudo timedatectl set-ntp false 
$ sudo systemctl disable ntp chronyd
$ sudo systemctl mask systemd-timesyncd
$ sudo systemctl stop systemd-timesyncd ntp chronyd

Foreword about Time and NTP on a Raspberry Pi

The Raspberry Pi (at the time of writing this applies to all models) has no real time clock (RTC) module on its board. A RTC is a small oscillator (e.g. quartz, like in your electronic wristwatch) plus some electronic to keep track of time and a battery (or equivalent). Those RTCs help a system keep track of time when there are off and in the early phases of boot. On a standard desktop or laptop computer the motherboard has an RTC. Many oscillators are not particularly accurate (low quality) with non-stable frequencies which can depend on external factors such as room temperature. It is possible to add a RTC module to the Raspberry Pi (I will have a detailed article on that soon), but without RTC you need a network connection in order for the RPi to know the current time.

On Linux, the kernel manage 2 clocks, the hardware clock (which is based on the RTC) and the system clock (which is the clock used by the system to query/set the time, this clock is ticking using a clocksource such as a CPU/SoC timer, Kernel jiffies, etc.). On boot, the current time is read from the hardware clock and is used to initialise the system clock. The system clock is then driven by the ticks from the selected clocksource and the time read at boot from the hardware clock. Usually, on shutdown, many Linux distribution are configured to store the system clock in the hardware clock.

The Raspberry Pi has maybe no hardware clock but it has a clock source (current clocksource on Raspberry Pi 2, other models may differ):

[    0.000000] arm_arch_timer: Architected cp15 timer(s) running at 19.20MHz (phys).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x46d987e47, max_idle_ns: 440795202767 ns
[    0.000010] sched_clock: 56 bits at 19MHz, resolution 52ns, wraps every 4398046511078ns
[    0.000032] Switching to timer-based delay loop, resolution 52ns

So if the time is set on boot, the OS can keep track of the time even if disconnected and as long as it is up and running. Systemd 213 introduced a new service systemd-timesyncd which is a SNTP client implementation, so it is able to query a network time server and set the OS system time based on the response. This service has an extra feature for systems without RTCs, it saves the system time on disk on shutdown. So when your Raspberry Pi reboot, it can use the stored time to initialise its system time while waiting for more accurate time once the network is ready. Sure during the early boot process the system time might be off by a couple of seconds but it is better than nothing.

As for NTP, it is adjusting the system time based on responses from network time servers or when offline based on the clock drift NTP has been calculating for the current clock source. This means that if you run NTP, it is good to let it run at least 24hours so it can accurately measure the clock source drift and then it can compensate it during network disconnection periods. In addition, NTP will regularly sync back the system time to the hardware time to correct the RTC clock. In up coming articles, we will see how we can add a RTC to our Raspberry Pi and how to overcome the challenges of allowing RTC access to NTP inside the container and increasing the clock accuracy. In addition, we will see how we can become an NTP network time server for the local LAN.

What did we learn about Docker

First, we practiced the basics of building a container (the Dockerfile syntax and docker build ... command), running a container in foreground or background mode (docker run ...) and controlling the running container (docker stop ... and docker update ...). I did not yet elaborate much on the capabilities of these commands offer but it is my intention that we will discover them further as we progress we the experiment.

Second, we learned about some of Docker security measures (like Linux capabilities) and limitations of the current Raspberry Pi platform (like no SECCOMP filtering or AppArmor or user namespace), and we also learned how to extend a container permission by adding new capabilities.

Next to learn will be how to provide access to specific devices (such as an RTC), how to do simple monitoring (checking the container is running, its resource usage and logs), how to increase its security (dropping unnecessary capabilities, using the other security measures). With this quest we will learn a lot on the Raspberry Pi as well, we will add an RTC module, we will compile our own Kernels in order to add new security functions and improve the OS jitter, etc.

jcberthon

September 19, 2016

I’ve been busy in the past months and did not update much this  site. However, here are a few hints about some upcoming articles:

  • Install and run Docker on Raspberry Pi
  • Run ntpd to provide time to your local network (Docker), I will talk about the pros and cons of running that on a Raspberry Pi.
  • Run dnsmasq to provide a DHCP and DNS resolver on your local network (Docker)
  • This website is now using TLS so that the URL has changed to HTTPS. The certificate authority is Let’s Encrypt which my hosting provider is offering easily.
  • I have a few more unfinished articles which might still take some time to complete, topics are ranging from: LXC advance usage and tips&tricks, LXD first usage, Linux SSH key management, SSD caching on Linux, FreeBSD/Arch Linux on Raspberry Pi 2.
  • Last but not least, I ought to announce why I’ve been so busy in the past months (and years). I will give some hints soon.

Listing open sockets on Linux

Everyone knows probably the `netstat’ command already, so I will only give my favourite options with it (`sudo’ is optional here, but if you want to see the process attached to the socket, you will want to use it):

  • List all (TCP) listening sockets:
$ sudo netstat -tlpen
  • List all TCP listening sockets and all UDP sockets:
$ sudo netstat -tulpen
  • List all sockets (TCP, UDP, Unix, etc.):
$ sudo netstat -apen

Now, you may not know it, but `netstat’ is deprecated. It is still installed on most Linux distribution that I know of, but you should move to its replacement command `ss’. The good thing is that all of the above options are still working with this command for the same results, the output is however differently formatted, hence script using the output of netstat command might get broken.

$ sudo ss -tlpen
$ sudo ss -tulpen
$ sudo ss -apen

Finally because everything (or almost) is a file on Unix/Linux, you can use the `lsof’ command – which provides listing of open files – to query the list of open sockets. It is yet another method, but I like it because the output is more condensed and contains usually what I want. Thanks to Michael Crosby (@crosbymichael) for sharing his knowledge.

So the first command is listing all IPv4 TCP and UDP sockets:

$ sudo lsof -Pnl +M -i4

The next one does the same thing for IPv6 TCP and UDP sockets:

$ sudo lsof -Pnl +M -i6

The combined output of the previous commands is equivalent to the output of `sudo ss -atupen’.

Setting Shared Folder Compression on Synology NAS (BTRFS)

disk-managementIf you have a Synology NAS that supports BTRFS (mostly the intel based NASes) and that you decided to use BTRFS, there are a couple of shared folders automatically created for you (like the “homes” or “video”) but they don’t have the “compression” option set, and trying to edit the shared folder in the administration GUI does not help, the check box is grayed out, meaning it is not possible.

BTRFS compression is quite “clever”. It has some heuristics that evaluate if a file is worth being compressed or not so it won’t try to compress the 1GB video of your toddlers playing together which is a waste of time given that the compression achieved might not be visible. But anyway, even if BTRFS is “clever” it does not mean that if you have a folder named video that you should consider using compression. Simply just don’t do it.

For folders with mixed data like “homes” (which is the shared folder for all user home directory) you might have wished Synology would have activated the compression. Or if you forgot to tick the check box once creating the volume, you might want to change it. But there is a way to change that. It is not guaranteed that it won’t break your NAS, especially if you do execute the wrong command, but if you don’t mind the risk then follow on.

BTRFS allows you to change the option on a live system without troubles. However, existing data on the shared folder won’t be compressed after activating the option, you would need to copy again the existing data to take benefits for it or defragment it using the compression option (-c see man btrfs-filesystem), however depending on your amount of data this might take a while.

To do it, you will need to activate SSH remote connection (try to limit it to your local LAN and do not open it to the internet unless you know what you are doing). You will need to connect via SSH using the administrator account (admin by default, but you would be wise to change the default name). I trust you know how to activate SSH on your NAS box, if not I would recommend you don’t try to do the rest of this article, ask someone who might know it! From a Linux or macOS (OS X) system, just open a terminal and type:

$ ssh <admin>@<hostname>

(and replace admin by the correct user account and hostname by your NAS hostname or IP address)

On Windows, you could use putty and achieve a similar fate.

Once connected, you need to know your BTRFS volume path:

$ mount -t btrfs
/dev/mapper/vg1-volume_1 on /volume1 [...]

In the above example, it is /volume1. Now you should have a BTRFS subvolume (think of it as a BTRFS internal sub partition which Synology uses to define shared folders) called “homes” (or whatever other shared folder you would like to tweak):

$ sudo btrfs subvolume list /volume1
[...]
ID 259 gen 1688 top level 257 path homes
[...]
ID 264 gen 1686 top level 257 path video

So here we have made sure that the “homes” shared folder is located on /volume1/homes. Now let us check its properties:

$ sudo btrfs property get /volume1/homes
ro=false

Here we can confirm that compression is not set (note that compression was not set as a mount option, nor at the volume root). To activate is, you need to create the “compression” property, you can choose either zlib or lzo. The former compress better but is slower, the latter is fast but as much lower compression ratio. I personnaly choose lzo:

$ sudo btrfs property set /volume1/homes compression lzo

You can use again the previous command to get the properties for the volume and see if it was set. Now you can copy your files to the shared folder, and BTRFS will try to compress them if it thinks it makes sense.

Picture credits: Picture is from the KDE project. The original materials is licensed under GNU LGPLv3.

LXC unprivileged containers on Ubuntu 14.04 LTS

LXC Containers

LXC Containers

I’ve been toying around with containers using LXC, and I decided to use this technology to do some performance technique for a PHP web application. So I set-up a VM with Ubuntu 14.04 LTS and decided to use containers to test various stacks, e.g. MySQL 5.5 + PHP 5.6 + Nginx 1.4.6 vs MariaDB 10 + PHP 7 RC3 + Nginx 1.9 (and all possible combinations), and HTTP/2, latencies, etc. On top of this experiment I wanted to learn more about Ansible.

Therefore my needs were the following:

  • One account to rule them all: one user account on the VM with sudo privileges. Used by Ansible to administer the VM and its containers.
  • Run each container as unprivileged ones.
  • Each container is bound to one unique user.

As of Ansible 1.8, LXC containers are supported, but as second class citizen: this extra module needs some dependency to work (which are not in the default repositories) and it would hide me how to configure the LXC containers, so I discarded this solution.

Unprivileged containers using the LXC command line UI

Standard containers usually run as root and the root user within that container maps to the root user outside of the container. This is not exactly how I think of a container.

Real Container

Real Container

In my view, a container is meant to hold something which I don’t want to leak out of the container, a chroot on steroids! So although I was really looking forward for containers on Linux, the first implementation were not matching my expectations or use cases.

Being able to run unprivileged containers is one of the great thing which finally decided me to check those containers on Linux, they finally cut the mapping between the privileged users on the host and those in the container.

Unprivileged containers are not as easy to set-up as normal fully privileged containers and you have to accept that you need to download an image from some website you need to trust. Not ideal, but resources about how to build an image for an unprivileged container are really scarce online, and I decided to first try it with the images, and then once I master LXC, to try building my own images.

Setting up LXC unprivileged containers require a few more packages, especially for the user and group IDs mapping, some preliminary account setup and giving LXC proper access to where you store the LXC containers data in your home folder. I did all of this, including setting up ACLs for the access. But when I hit lxc-start (...) it all failed!!!

Container_Failuer

Container in Distress

What I did after all preliminary configurations was:

$ sudo -H -u dbusr lxc-create -t download -n mysql55 -- --dist ubuntu --release trusty --arch amd64
$ sudo -H -u dbusr lxc-start -n mysql55 -d
lxc-start: lxc_start.c: main: 344 The container failed to start.
lxc-start: lxc_start.c: main: 346 To get more details, run the container in foreground mode.
lxc-start: lxc_start.c: main: 348 Additional information can be obtained by setting the --logfile and --logpriority options

And the lxc-start command failed miserably.

Why? A bit of background first, I will try to describe at a high level how containers “contain” on Linux. LXC should not really be compared to Solaris Zones or FreeBSD Jails. LXC uses Linux Control Groups (cgroups) to contain (allow/restricting/limiting) access by some process to some resources (e.g. CPU, memory, etc.). When one creates a “local” session on Ubuntu 14.04 LTS (and this is still valid on the up-coming Ubuntu 15.10 as of writing), such as login through the console or via SSH, Ubuntu allocates for the user different control groups controllers which can be viewed by doing:

$ cat /proc/self/cgroup
12:name=systemd:/user/1000.user/4.session
11:perf_event:/user/1000.user/4.session
10:net_prio:/user/1000.user/4.session
9:net_cls:/user/1000.user/4.session
8:memory:/user/1000.user/4.session
7:hugetlb:/user/1000.user/4.session
6:freezer:/user/masteen/0
5:devices:/user/1000.user/4.session
4:cpuset:/user/1000.user/4.session
3:cpuacct:/user/1000.user/4.session
2:cpu:/user/1000.user/4.session
1:blkio:/user/1000.user/4.session

Each line is a type of control group controller assigned to the user. For example the line about memory concerns the Memory Resource Controller which can be used for things such as limiting the amount of memory a group of process can use.

So when I run my script via Ansible, Ansible first establish a SSH session using my “rule-them-all” user, an SSH session is considered “local” and PAM is triggering systemd-logind to create automatically cgroups for my user shell process. Yes, even-though Ubuntu 14.04 LTS is still using upstart for init system, it has already a few dependencies on systemd! Now my “rule-them-all” user, when he is using sudo to execute commands as another user (be it root or one of my container users), the executed command is not considered as a “local” session for the sudo-ed user. So no cgroups are created for the new process, and it actually inherit the cgroups of the callee. This is easily visible by doing, you can see that the username and UID did not change despite the command being run as another user:

$ sudo -u userdb cat /proc/self/cgroup
12:name=systemd:/user/1000.user/4.session
11:perf_event:/user/1000.user/4.session
10:net_prio:/user/1000.user/4.session
9:net_cls:/user/1000.user/4.session
8:memory:/user/1000.user/4.session
7:hugetlb:/user/1000.user/4.session
6:freezer:/user/masteen/0
5:devices:/user/1000.user/4.session
4:cpuset:/user/1000.user/4.session
3:cpuacct:/user/1000.user/4.session
2:cpu:/user/1000.user/4.session
1:blkio:/user/1000.user/4.session

This usually does not really matter, unless you are LXC and you use cgroups heavily!! So what happened is that lxc-start wanted to write to the various cgroups to create the container. But lxc-start was called by the user dbuser (via the sudo command), however the cgroups it inherited were from my “rule-them-all” user and obviously (and thankfully) dbuser does not have the right to change the cgroups of my “rule-them-all” user. So lxc-start failed due to some permission denied:

lxc_container: cgmanager.c: lxc_cgmanager_create: 299 call to cgmanager_create_sync failed: invalid request
lxc_container: cgmanager.c: lxc_cgmanager_create: 301 Failed to create hugetlb:mysql55
lxc_container: cgmanager.c: cgm_create: 646 Error creating cgroup hugetlb:mysql55
lxc_container: start.c: lxc_spawn: 861 failed creating cgroups
lxc_container: start.c: __lxc_start: 1080 failed to spawn 'mysql55'
lxc_container: lxc_start.c: main: 342 The container failed to start.

Getting Ubuntu 14.04 LTS ready to run our Unprivileged Containers

I did not manage to solve the problem on Ubuntu 14.04 LTS in a first attempt. But after some extra steps (which I will details below), I made it work on the up-coming Ubuntu 15.10 (still alpha) release! So revisiting my Ubuntu 14.04 LTS setup, I identified the required packages needing an upgrade: kernel, lxc and cgmanager (and a few dependencies). For the Kernel, I’ve used the latest Ubuntu LTS Enablement Stack and upgraded to kernel 3.19. For the other packages, despite my aversion for 3rd party repositories, I decided to trust the Ubuntu LXC team (they are the ones who do the work to get LXC/LXD in Ubuntu in the first place) and their LXC PPA. So the commands were:

$ sudo apt-get install --install-recommends linux-generic-lts-vivid
$ sudo apt-add-repository ppa:ubuntu-lxc/lxc-stable
$ sudo apt update
$ sudo apt full-upgrade
$ sudo apt-get install --no-install-recommends lxc lxc-templates uidmap libpam-cgm

The lxc-templates package are necessary for me as I’m using the “download” template. The uidmap is mandatory for unprivileged containers. And libpam-cgm is necessary to resolve my problem, it is a PAM module for the cgmanager which is the Control Group Manager daemon (installed as a dependency to LXC on Ubuntu).

Now armed with this updated kernel and container stack it is time to present you the extra setup steps I was talking about. We need first to update either (or both depending which method you use) the PAM configuration for sudo or su. I will show the extra line for the former, they should apply for the later if you wish to use it. You need to edit as root the file /etc/pam.d/sudo and add the following lines after the line ‘@include common-session-noninteractive‘:

session required pam_loginuid.so
session required pam_systemd.so class=user

The above 2 lines will register the new session created by sudo with the systemd login manager (systemd-logind). That’s the guy we wanted notified so that the creation of the cgroups for our user can now work. I’m still scouring the internet for the exact explanation of how this work. If I find it, I will probably write another post with the information.

If I would simply use su -l userdb or sudo -u userdb -H -s, I would just have to execute the following:

$ sudo cgm create all $USER
$ sudo cgm chown all $USER $(id -u) $(id -g)
$ cgm movepid all $USER $$

This will create all cgroups under the user $USER which is userdb. It will then set the owner of these new cgroups to the UID and GID of the userdb. And the last command move the SHELL process ($$) within these new cgroups. Then, the rest is trivial:

$ lxc-start -n mysql55 -d

And it is working. And if you want to run those commands using sudo, this is how you do it:

$ sudo -H -i -u userdb bash -c 'sudo cgm create all $USER; sudo cgm chown all $USER $(id -u) $(id -g)'
$ sudo -H -i -u userdb bash -c 'cgm movepid all $USER $$; lxc-start -n mysql55 -d'

If you close later the SSH session and you reconnect to it, you have to run the 2 commands again, even though it might display some warning that the paths or what-not are already existing. I still have those pesky warning and extra commands which I’m not sure why I still have.

Conclusion

While trying to run unprivileged containers on Ubuntu 14.04 LTS I’ve met several problems which I could only solved by using the latest LXC and CGManager packages and some special PAM configuration for which I’m not 100% sure of the impact. So unprivileged LXC containers on Ubuntu 14.04 LTS is still quite rough.

But during my journey with LXC, I’ve found an even easier way to create unprivileged LXC containers, without touching PAM, but still requiring the latest LXC and CGManager packages. This solution is based on LXD, the Linux Containers Daemon. There is a really good getting started guide by Stéphane Graber, the man behind LXD. I’m exploring this avenue at the moment and will report soon on this very blog.

All of this make me really look forward to the next Ubuntu LTS due to next Spring, the newer LXC and the under-heavy-development LXD will be part of Ubuntu 16.04 LTS and this could give a really great experience out-of-the-box with Linux Containerisation powers.

Picture credits: LXC Containers is based on a Public Domain photo of an unknown author. Real Container by Petr Brož, licensed under CC BY-SA 3.0 via Wikimedia Commons. Container in distress is licensed under a CC-BY 2.0 license by the New Zealand Defence Force.