Discovering SmartOS

Revolutionize the datacenter: ZFS, DTrace, Zones, KVM

What 22TB looks like.

It has been a long and interesting weekend of fixing computers.
Adopt the pose: sit cross-legged on the floor surrounded by 9 hard-drives – wait, I need another one, make it 10 hard-drives – and the attendant spaghetti of SATA cables and plastic housings and fragments of case.
Funnily enough the need for screwdrivers has reduced over the years, albeit more than compensated by the cost of a case alone. I’m sure it never used to make for such a sore back either…
Anyway. Amidst the turmoil of fixing my main archive/work/backup server, I discovered a new OS.
For a few years now, I’ve been fond of ZFS – reliable as a brick, convenient as anything to use; I choose my OSes based on their ability to support ZFS, amongst other things. Just a quick
zpool create data /dev/ada1 /dev/ada2
zfs create data/Pictures
and that’s it, a new pool and filesystem created, another 1-liner command to add NFS sharing… Not a format or a mount in sight.
Of course, Linux has not been able to include ZFS in the kernel due to licensing considerations, so the various implementations (custom kernel; user-space FUSE module) have been less than desirable. So I’ve been using FreeBSD as server operating-system of choice. The most convenient way to control a plethora of virtual machines on a FreeBSD host seems to be to use VirtualBox – rather large and clunky nowadays.
However, a couple of weeks ago I stumbled across SmartOS, a new-to-me OS combining ZFS, DTrace and a Solaris/Illumos kernel, with both its own native Zones and Linux’s KVM virtualization.
There have been a few steps in this direction previously – most memorably was Nexenta, an opensolaris/illumos kernel with Debian packaging and GNU toolchain. That was a nice idea, but it lacked virtualization.
So, this weekend, with a storage server box rebuilt (staying with FreeBSD) and a whole new machine on which to experiment, I installed SmartOS.
Overall, it’s the perfect feature blend for running one’s own little cloud server. ZFS remains the filesystem of choice, DTrace has yet to be experimented with, and KVM is a breeze, mostly since Joyent have provided their own OS semi-installed images to work from (think: Docker, but without the Linux-specificity). The vmadm command shares a high-level succinctness with the zfs tools. Just import an image, make a JSON config file describing the guest VM and create an instance and it’s away and running with a VNC interface before you know it.
There’s one quirk that deserves special note so far. If you wish to use a guest VM as a gateway, e.g. via VPN to another network, you have to enable spoofing of IPs and IP forwarding on the private netblocks, in the VM config file.
      "allow_dhcp_spoofing": "true",
      "allow_ip_spoofing": "true",
      "allowed_ips": [ "192.168.99.0/24" ]
[root@78-24-af-39-19-7a ~]# imgadm avail | grep centos-7 
5e164fac-286d-11e4-9cf7-b3f73eefcd01 centos-7 20140820 linux 2014-08-20T13:24:52Z 
553da8ba-499e-11e4-8bee-5f8dadc234ce centos-7 20141001 linux 2014-10-01T19:08:31Z 
1f061f26-6aa9-11e4-941b-ff1a9c437feb centos-7 20141112 linux 2014-11-12T20:18:53Z 
b1df4936-7a5c-11e4-98ed-dfe1fa3a813a centos-7 20141202 linux 2014-12-02T19:52:06Z 
02dbab66-a70a-11e4-819b-b3dc41b361d6 centos-7 20150128 linux 2015-01-28T16:23:36Z 
3269b9fa-d22e-11e4-afcc-2b4d49a11805 centos-7 20150324 linux 2015-03-24T14:00:58Z 
c41bf236-dc75-11e4-88e5-038814c07c11 centos-7 20150406 linux 2015-04-06T15:58:28Z 
d8e65ea2-1f3e-11e5-8557-6b43e0a88b38 centos-7 20150630 linux 2015-06-30T15:44:09Z 

[root@78-24-af-39-19-7a ~]# imgadm import d8e65ea2-1f3e-11e5-8557-6b43e0a88b38 Importing d8e65ea2-1f3e-11e5-8557-6b43e0a88b38 (centos-7@20150630) from "https://images.joyent.com" 
Gather image d8e65ea2-1f3e-11e5-8557-6b43e0a88b38 ancestry 
Must download and install 1 image (514.3 MiB) 
Download 1 image [=====================================================>] 100% 514.39MB 564.58KB/s 15m32s 
Downloaded image d8e65ea2-1f3e-11e5-8557-6b43e0a88b38 (514.3 MiB) ...1f3e-11e5-8557-6b43e0a88b38 [=====================================================>] 100% 514.39MB 38.13MB/s 13s 
Imported image d8e65ea2-1f3e-11e5-8557-6b43e0a88b38 (centos-7@20150630) 
[root@78-24-af-39-19-7a ~]# 

[root@78-24-af-39-19-7a ~]# cat newbox.config 
{
  "brand": "kvm",
  "resolvers": [
    "8.8.8.8",
    "8.8.4.4"
  ],
  "ram": "256",
  "vcpus": "2",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "192.168.5.48",
      "netmask": "255.255.255.0",
      "gateway": "192.168.5.1",
      "model": "virtio",
      "primary": true,
      "allow_dhcp_spoofing": "true",
      "allow_ip_spoofing": "true",
      "allowed_ips": [ "192.168.99.0/24" ]
    }
  ],
  "disks": [
    {
      "image_uuid": "d8e65ea2-1f3e-11e5-8557-6b43e0a88b38",
      "boot": true,
      "model": "virtio"
    }
  ],
"customer_metadata": {
    "root_authorized_keys":
"ssh-rsa AAAAB3NzaC1y[...]"
  }

}
[root@78-24-af-39-19-7a ~]# vmadm create -f newbox.config 
Successfully created VM d7b00fa6-8aa5-466b-aba4-664913e80a2e 
[root@78-24-af-39-19-7a ~]# ping -s 192.168.5.48 
PING 192.168.5.48: 56 data bytes 
64 bytes from 192.168.5.48: icmp_seq=0. time=0.377 ms 
64 bytes from 192.168.5.48: icmp_seq=1. time=0.519 ms 
64 bytes from 192.168.5.48: icmp_seq=2. time=0.525 ms ... 

zsh, basalt% ssh root@192.168.5.48 
Warning: Permanently added '192.168.5.48' (ECDSA) to the list of known hosts.
Last login: Mon Aug  3 16:49:24 2015 from 192.168.5.47
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ;  Instance (CentOS 7.1 (1503) 20150630)
                   `-'   https://docs.joyent.com/images/linux/centos

[root@d7b00fa6-8aa5-466b-aba4-664913e80a2e ~]# 

And there we have a new guest VM up and running in less than a minute’s effort.

Infrastructure and development environments recreated from scratch (partly thanks to storing my ~/etc/ in git) in under an hour.

I’m still looking for the perfect distributed filesystem, however…

Determining the best ZFS compression algorithm for email

I’m in the process of setting up a FreeBSD jail in which to run a local mail-server, mostly for work. As the main purpose will be simply archiving mails for posterity (does anyone ever actually delete emails these days?), I thought I’d investigate which of ZFS’s compression algorithms offers the best trade-off between speed and compression-ratio achieved.

The Dataset

The email corpus comprises 273,273 files totalling 2.14GB; individually the mean size is 8KB, the median is 1.7KB and the vast majority are around 2.5KB.

The Test

The test is simple: the algorithms consist of 9 levels of gzip compression plus a new method, lzjb, which is noted for being fast, if not compressing particularly effectively.

A test run consists of two parts: copying the entire email corpus from the regular directory to a new temporary zfs filesystem, first using a single thread and then using two parallel threads – using the old but efficient   find . | cpio -pdv   construct allows spawning of two background jobs copying the files sorted into ascending and descending order – two writers, working in opposite directions. Because the server was running with a live load at the time, a test was run 5 times per algorithm – a total of 13 hours.

The test script is as follows:

#!/bin/zsh

cd /data/mail || exit -1

zfs destroy data/temp

foreach i ( gzip-1 gzip-2 gzip-3 gzip-4 gzip-5 gzip-6 \
	gzip-7 gzip-8 gzip-9 lzjb ) {
  echo "DEBUG: Doing $i"
  zfs create -ocompression=$i data/temp
  echo "DEBUG: Partition created"
  t1=$(date +%s)
  find . | cpio -pdu /data/temp 2>/dev/null
  t2=$(date +%s)
  size=$(zfs list -H data/temp)
  compr=$(zfs get -H compressratio data/temp)
  echo "$i,$size,$compr,$t1,$t2,1"
  zfs destroy data/temp

  sync
  sleep 5
  sync

  echo "DEBUG: Doing $i - parallel"
  zfs create -ocompression=$i data/temp
  echo "DEBUG: Partition created"
  t1=$(date +%s)
  find . | sort | cpio -pdu /data/temp 2>/dev/null &
  find . | sort -r | cpio -pdu /data/temp 2>/dev/null &
  wait
  t2=$(date +%s)
  size=$(zfs list -H data/temp)
  compr=$(zfs get -H compressratio data/temp)
  echo "$i,$size,$compr,$t1,$t2,2"
  zfs destroy data/temp
}

zfs destroy data/temp

echo "DONE"

Results

The script’s output was massaged with a bit of commandline awk and sed and vi to make a CSV file, which was loaded into R.

The runs were aggregated according to algorithm and whether one or two threads were used, by taking the mean removing 10% outliers.

Since it is desirable for an algorithm both to compress well and not take much time to do it, it was decided to define efficiency = compressratio / timetaken.

The aggregated data looks like this:

algorithm nowriters eff timetaken compressratio
1 gzip-1 1 0.011760128 260.0 2.583
2 gzip-2 1 0.011800408 286.2 2.613
3 gzip-3 1 0.013763665 196.4 2.639
4 gzip-4 1 0.013632926 205.0 2.697
5 gzip-5 1 0.015003015 183.4 2.723
6 gzip-6 1 0.013774746 201.4 2.743
7 gzip-7 1 0.012994211 214.6 2.747
8 gzip-8 1 0.013645055 203.6 2.757
9 gzip-9 1 0.012950727 215.2 2.755
10 lzjb 1 0.009921776 181.6 1.669
11 gzip-1 2 0.004261760 677.6 2.577
12 gzip-2 2 0.003167507 1178.4 2.601
13 gzip-3 2 0.004932052 539.4 2.625
14 gzip-4 2 0.005056057 539.6 2.691
15 gzip-5 2 0.005248420 528.6 2.721
16 gzip-6 2 0.004156005 709.8 2.731
17 gzip-7 2 0.004446555 644.8 2.739
18 gzip-8 2 0.004949638 566.0 2.741
19 gzip-9 2 0.004044351 727.6 2.747
20 lzjb 2 0.002705393 900.8 1.657

A plot of efficiency against algorithm shows two clear bands, for the number of jobs writing simultaneously.

Analysis

In both cases, the lzjb algorithm’s apparent speed is more than compensated for by its limited compression ratio achievements.

The consequences of using two writer processes are two-fold: first, the overall efficiency is not only halved, but it’s nearer to only a third that of the single writer – there could be environmental factors at play such as caching and disk i/o bandwidth. Second, the variance overall has increased by 8%:

> aggregate(eff ~ nowriters, data, FUN=function(x) { sd(x)/mean(x, trim=0.1)*100.} )
 nowriters eff
1 1 21.56343
2 2 29.74183

so choosing the right algorithm has become more significant – and it remains gzip-5 with levels 4, 3 and 8 becoming closer contenders but gzip-2 and -9 are much worse choices.

Of course, your mileage may vary; feel free to perform similar tests on your own setup, but I know which method I’ll be using on my new mail server.