index-long.html

2025-01-11 - Update: Atomic cross-rename of two paths (mv --exchange, exch) [tags: coreutils, mv, util-linux]

I did not notice that the cross rename already existed when I was writing about it but it’s been there since coreutils v9.5 released in Mar 2024.

$ mv --exchange file1 file2

And it also landed in util-linux v2.40 at about the same time as utility exch (added in commit cd094a05880c):

$ exch file1 file2

The two projects have an overlap in what they provide so the journey of mv --exchange took a few turns:

From coreutils git:

2024-02-29 6cd2d5e5335b mv: add --swap (-x) option to atomically swap 2 paths
2024-03-05 9d8890d87231 mv: revert add --swap (-x) option
2024-03-20 5d3fd24f4286 mv: new option --exchange

Good that the feature is provided in standard tools but unfortunately none of them comes with a decent description of the capabilities and limitations. All types of files, directories (including subvolumes) could be exchanged but only on the same filesystem, otherwise it returns the error EXDEV Invalid cross-device link.

2024-12-18 - Case study: link C++ standalone executable to another C executable (ntfs2btrfs to btrfs-convert) [tags: btrfs, ntfs]

Problem statement: there’s a tool written in C, btrfs-convert, that implements conversion from one filesystem to btrfs. For ext2/3/4 and reiserfs this is done using existing C libraries. There’s a standalone project ntfs2btrfs written in C++, not providing a library interface at all, but the equivalent functionality.

The question: can the functionality of ntfs2btrfs be provided by btrfs-convert while minimizing the difficulty of the integration? Ideally automated with only a few necessary fix-ups.

Inputs
Initial problems
Solution outline
Create the entry point
What about the two main()s
Final link

Inputs

btrfs-progs.git

primary build system and target
C sources
autoconf, make

ntfs2btrfs.git

C++ sources
cmake

What can certainly be done is to git clone the sources, run their respective configure and build steps and get lots of .o files.

Initial problems

The ntfs2btrfs conversion tool is in a different repository, git submodules are not an option so this needs to be cloned and pulled at the time btrfs-convert is built.

The other build system should be ideally run without any additional hurdles, so we can assume it’s in a known separate directory and produces the object files we’ll use later. But this also builds a standalone executable, so there’s a symbol main defined, which will collide with the one defined by the btrfs-convert sources.

The language difference is not that significant but in the glue interface we need to handle potential name mangling or other symbol duplication.

There are duplicate implementations of low-level functions like checksumming. The estimated size is tens of kilobytes, so we won’t mind that as long as the linker is able to resolve everything.

Solution outline

To integrate the projects the steps will be in a script that will basically:

clone the ntfs2btrfs.git repository
configure it and build with some sane options (preferring maximum compatibility)
grab the resulting .o files
link btrfs-convert and the .o files

The last point can be tricky and the missing part is how to hand over the functionality.

Create the entry point

The entry point of the standalone ntfs2btrfs utility is main so we need to inject a public function that will not be a mangled as symbol, export it in a header and use it from btrfs-convert.

Also we may need to translate the user options set as command line options. They’ll be parsed in the ntfs2btrfs project and passed as parameters to the glue function and mapped to the conversion options if possible. There’s some feature mismatch so this is not 1:1.

Minimal glue function (:

extern "C" {
        void convert_entry(const char *fn_c) {
                enum btrfs_compression compression;
                enum btrfs_csum_type csum_type;
                bool nocsum = false;

                csum_type = btrfs_csum_type::crc32c;
                compression = btrfs_compression::none;

                string fn(fn_c);
                ntfs dev(fn);

                convert(dev, compression, csum_type, nocsum);
        }
}

This can be appended/included to the main file so it lives in the scope of the ntfs2btrfs where its main() is, so we have access to the translation unit static and global variables (that can store the representation of the command line options).

echo '#include "convert-entry.c"' >> "ntfs2btrfs.git/src/ntfs2btrfs.cpp"

Symbol name mangling is avoided by extern "C", otherwise the parameters are set to defaults for simplicity. The path of the file/device must be translated to a C++ string but this happens inside the glue function, no interoperability problems. After creation of the filesystem object ntfs we can finally hand control to convert().

There are some assumptions. The conversion detects the NTFS signature and does other validation of the image as we’ll call that from btrfs-convert unconditionally.

What about the two main()s

Actually this is not hard at all. Linkage is done by comparing plain string representations, so we’ll just edit it as needed. The colliding symbol name is main so let’s rename it to MAIN. This is a standard symbol in C++ that’s not mangled so we can simply find in the right .o file and change it. We can use sed.

$ nm build/.../ntfs2btrfs.o | grep main
00000000000140ed T main
...
$ sed -i -e 's/main/MAIN/' build/.../ntfs2btrfs.o
$ nm build/.../ntfs2btrfs.o | grep MAIN
00000000000140ed T MAIN
...

This is crude and renames several other symbols like lambdas and templates of std::basic_string (00000000000174c9 t std::__cxx11::basic_string<MAIN::{lambda()#1}::operator()() const::FMT_COMPILE_STRING::char_type, ...). I haven’t found a standard utility to that so sed it is, fortunately there’s no other string main in the binary e.g. in an error message.

Final link

No remaining problems, we have a bunch of .o files, no conflicting symbol names, one executable entry point main and one final binary. So we now can link btrfs-convert with all the other objects. This is basic makefile integration, not that interesting, also some conditional build support in the btrfs-convert sources. With everything in place

$ ./btrfs-convert --help
...
    Supported filesystems:
        ext2/3/4: yes
        reiserfs: no
        ntfs: yes
$ mkfs.ntfs /dev/sdx
$ ./btrfs-convert /dev/sdx
btrfs-convert from btrfs-progs v6.12
...
Using CRC32C for checksums.

And the last question: does anybody what that?

2024-12-10 - Python auto increment variables [tags: heresy, python]

Although everybody knows the python language does not provide the unary increment/decrement operators, I’d still like to use that for some specific cases. Let’s explore an unconventional way how to do that.

Python operator overview

The Stackoverflow Increment and decrement operator question provides some insights why ++x works but does not do what one expects. This is parsed as +(+(x)). The question is now 15 years old and everybody moved on. The unary + operator is an identity.

Side quest: what does unary + do in C?

Let’s start with a quiz: what could be the following class useful for?

class X():
    def __init__(self, value=0):
        self.x = value
    def __call__(self, *args):
        if len(args) == 0:
            return self.x
        self.x = args[0]
    def __int__(self):
        return int(self.x)
    def __str__(self):
        return str(self.x)
    def __pos__(self):
        self.x += 1
        return self.x - 1
    def __neg__(self):
        self.x -= 1
        return self.x + 1

Auto sequence is sometimes useful. Why? I’ve used this in GUI programming, when dumping tabular data with custom formatting. If you write your software correctly on the first time, congratulations. I may not and resort to incremental development that requires lots of editing. An example in pseudo code for Tcl/Tk application that fills rows of a grid layout in a sequence.

widget1.grid(row=0, column=0)
widget2.grid(row=1, column=0)
widget3.grid(row=2, column=0)
widget4.grid(row=3, column=0)
widget5.grid(row=4, column=0)

The layout many need to be tuned so lines are swapped, reordered, then save, run, check and repeat until satisfied. In C++ and using e.g. Qt I’d do something like:

row = 1;
widget3->grid(row++, 0);
widget2->grid(row++, 0);
widget5->grid(row++, 0);
widget1->grid(row++, 0);
widget4->grid(row++, 0);

In any way I swap the lines, they’ll be always in the order as they’re in the source file and I don’t have to renumber the rows each time I need to save and run it. I could turn it to a fixed number sequence once I’m done with the tuning. Expecting future changes I would leave it as is, extending that in the future would be much easier.

But python religiously does not offer the ++ operators, neither pre-increment nor post-increment. OTOH python offers other tuning methods that could be hidden on the class level and ~~abused~~utilized as pure syntax repurposing.

Using the class from above, the python version can be rewritten like this, with the reordered lines:

row = X()
widget3.grid(+row, column=0)
widget2.grid(+row, column=0)
widget5.grid(+row, column=0)
widget1.grid(+row, column=0)
widget4.grid(+row, column=0)

The + operator pos, unary and returning the positive size of the value. Overloading that to increment the value is at least a bit intuitive even for a casual reader not familiar with the class X.

As a discouragement the following works (python code):

++x

and (as implemented) increments x by 1, just once. But no surprise or magic here, standard python accepts this for any integer value. This will be first evaluated as +x and then the unary + applied on the result, as said before +(+x) so the leading + does not involve x. We can go crazy:

+x
++x
+++x
++++(++++x)
+++++++++++++++x

All of them do single post-increment. In standard python this works for integer-like objects too:

+1
++1
+++1
++++1

This is evaluated as 1. Actually we can go crazy even more, because the unary - and + can be mixed freely:

+1
-1
-+1
+-1
-+-1
+-+1
-+-+1
+-+-1

And so on. I’ll leave that as an exercise what the final value of each line is. This also works in C, but I digress.

What I find really bad is that ++x does not match the pre-increment semantics but given the limitations of the operator overloading we can’t do anything about that.

With a minor update the class X, the bare value of such object instance can be used in expressions like x+1 or x-1:

    def __add__(self, value):
        return self.x + value
    def __sub__(self, value):
        return self.x - value

and similar for the remaining integer operators. This is non-standard extension best to be used within limited scope.

Heresy.

2024-12-04 - Self Makefile and C source [tags: c, fun, make]

I’ve seen this trick long time ago, one file that’s a C source and also usable as it’s own Makefile, also known as polyglot code. While the polyglot has probably little use in practice, the combination with build system can be handy, to distribute a single file. Build instructions can be of course in a comment in the file.

Let’s see a minimal example:

Source:

#include <stdio.h>

int main() {
        printf("Hello world!\n");
        return 0;
}

Makefile:

CFLAGS=-Wall
hello: hello.c
	gcc -o $@ $< $(CFLAGS)

Combined (syntax highlighting for C):

#if 0
CFLAGS=-Wall
hello: hello.c
	gcc -o $@ $< $(CFLAGS)
ifdef false
#else
#include <stdio.h>

int main() {
        printf("Hello world!\n");
        return 0;
}
#endif
#if 0
endif
#endif

Combined (syntax highlighting for Makefile):

#if 0
CFLAGS=-Wall
hello: hello.c
	gcc -o $@ $< $(CFLAGS)
ifdef false
#else
#include <stdio.h>

int main() {
        printf("Hello world!\n");
        return 0;
}
#endif
#if 0
endif
#endif

There’s still a problem that the embedded C code in the ifdef false/endif is parsed by make, a standalone ifdef in the middle of the C code will terminate the supposedly skipped part. I’ve tried to confuse make more but it seems to be quite tolerant to various syntax hacks, like defining a C function $ and emulating some built-in function calls.

How it all works is straightforward and depends on source level compatibility of comments and code. The C preprocessor directives are comments for make so they hide the make directives from compiler.

2024-11-27 - Atomic cross-rename of two paths [tags: mv, util-linux]

Since version v3.15 linux supports a file operation that will exchange any two directory entries atomically. This needs support of the filesystem so it’s not universal, but the commonly used filesystems can do it.

The underlying syscall is renameat2. Obligatory LWN article.

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
int renameat2(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, unsigned int flags);

The magic is in the flags being set to RENAME_EXCHANGE, the oldpath and newpath are simply the names (can be relative or full path), with AT_FDCWD for oldfddir and newfddir unless you need to something fancy. See below for standalone working implementations in several languages. The minimal and stripped down example:

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>

int main(int argc, char **argv)
{
        return renameat2(AT_FDCWD, argv[1], ATD_FDCWD, argv[2], RENAME_EXCHANGE);
}

The full example below does proper checking of arguments and existence of the files but the syscall does that by itself too. Obviously both directory entries need to exist at the time the rename action is performed, otherwise it would be TOCTOU bug.

What is a directory entry and where it works:

regular file
special file (socket, named pipe, …)
directory
symlink (the path is not resolved, so the target remains the same)
hardlink (it’s just another name of a file)
subvolume (on Btrfs)

Each type can be the source and target, in any combination:

$ touch file
$ mkfifo fifo
$ mkdir dir
$ ln -s file symlink
$ ln file hardlink
$ btrfs subvol create subvol
$ ls -lis
total 16
123323316 0 drwxr-xr-x 1 user users    0 Nov 21 14:35 dir
123323315 0 prw-r--r-- 1 user users    0 Nov 21 14:35 fifo
123323314 0 -rw-r--r-- 2 user users    0 Nov 21 14:35 file
123323314 0 -rw-r--r-- 2 user users    0 Nov 21 14:35 hardlink
      256 0 drwxr-xr-x 1 user users    0 Nov 21 14:35 subvol
123323317 4 lrwxrwxrwx 1 user users    4 Nov 21 14:35 symlink -> file

$ mvx dir file
$ mvx fifo subvol
$ mvx symlink hardlink
$ ls -lis
123323314 0 -rw-r--r-- 2 user users    0 Nov 21 14:35 dir
      256 0 drwxr-xr-x 1 user users    0 Nov 21 14:35 fifo
123323316 0 drwxr-xr-x 1 user users    0 Nov 21 14:35 file
123323317 4 lrwxrwxrwx 1 user users    4 Nov 21 14:35 hardlink -> file
123323315 0 prw-r--r-- 1 user users    0 Nov 21 14:35 subvol
123323314 0 -rw-r--r-- 2 user users    0 Nov 21 14:35 symlink

Below are implementations in some other languages using direct interfaces to the syscalls, I haven’t found standard libraries providing the extended syntax. Same as there’s missing support in the mv utility. Welp, after 10 years.

C

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <sys/stat.h>

/*
 * Cross-rename of two paths: file, directory, symlink (not the target),
 * also subvolumes (on btrfs)
 *
 * Example:
 *
 *    file - a file
 *    dir  - a directory
 *
 *    $ mvx file dir
 *
 *    file - a directory named 'file'
 *    dir  - a file named 'dir'
 *
 * Q: Should this be implemented by mv?
 * A: Yes.
 */

int main(int argc, char **argv)
{
        int ret;
        struct stat st;

        if (argc != 3) {
                fprintf(stderr, "Usage: mvx path1 path2\n");
                return 1;
        }
        if (stat(argv[1], &st) < 0) {
                fprintf(stderr, "stat: failed for %s: %m\n", argv[1]);
                return 1;
        }
        if (stat(argv[2], &st) < 0) {
                fprintf(stderr, "stat: failed for %s: %m\n", argv[2]);
                return 1;
        }

        ret = renameat2(AT_FDCWD, argv[1], AT_FDCWD, argv[2], RENAME_EXCHANGE);
        if (ret < 0)
                fprintf(stderr, "renameat2: failed with %m\n");
        return ret;
}

Python

I found project python-renameat2 implementing that as a proper module (example). Otherwise you can use this:

import ctypes
import sys

libc = ctypes.CDLL("libc.so.6")
libc.renameat2.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p, ctypes.c_uint]
libc.renameat2.restype = ctypes.c_int
AT_FDCWD = -100
RENAME_EXCHANGE = (1 << 1)
path1 = str.encode(sys.argv[1])
path2 = str.encode(sys.argv[2])
ret = libc.renameat2(AT_FDCWD, path1, AT_FDCWD, path2, RENAME_EXCHANGE)

Perl

use strict;
use warnings;

require("syscall.ph");
my ($path1, $path2) = @ARGV;
my $AT_FDCWD = -100;
my $RENAME_EXCHANGE = (1 << 1);
my $ret = syscall(&SYS_renameat2, $AT_FDCWD, $path1, $AT_FDCWD, $path2, $RENAME_EXCHANGE);

2024-11-20 - Linux kernel tip: build with custom CFLAGS [tags: c, linux]

Build linux kernel, either whole or a selected directory with compiler flags, without editing any Makefiles or such.

    cd linux.git
    make ccflags-y='-Wundef' kernel/

Any change to flags will be detected and leads to full rebuild of all affected files. Building whole directory must have the trailing /. Whole modules can be built with M=fs/btrfs, where M= does not require the trailing /.

What to use it for

Build test the code with different warnings that are not on by default. Inspiration is in gcc(1) manual page or in the GCC Option Summary. Linux provides sets of predefined warning level with increasing verbosity and decreasing usefulness, see file scripts/Makefile.extrawarn. The levels can be enabled on command line like

    make W=1 kernel/
    make W=12 kernel/
    make W=2 kernel/
    make W=3 kernel/

This could be quite noisy so it’s better to capture that to a file and postprocess.

Capture other information during build, like -fstack-usage (consumption of stack by functions, in .su) or -save-temps=obj (save generated assembly .s and preprocessed sources .i).

2023-08-24 - Interactive sleep [tags: shell, sleep]

The sleep utility is something I’d hardly consider worth enhancing by features, the task is simple to sleep for a given time. Nothing else is needed when it’s used from scripts. However, it can be also used from manually typed command line, e.g. to let some command start later to offset initial IO load or not to overload network.

Let’s say for example that we’d like to download a big data set, but rather let it start in 20 minutes.

sleep 20m && wget https://URL

Once this command starts we don’t know how much time is remaining and interrupting it by Ctrl-C would not start the download. Additionally, what if the time is too short and we’d like to increase it. So let’s implement it. For more compatibility it’s done in shell and using only widely available utilities (date). And btw, it does not even use sleep for the sleeping but the builtin read with the timeout option as it allows to read input while waiting.

#!/bin/sh

t=${1:-1}
ts="$t"

case "$t" in
	*s) tm=${t%*s} t=$(($tm)) ;;
	*m) tm=${t%*m} t=$(($tm * 60)) ;;
	*h) tm=${t%*h} t=$(($tm * 60 * 60)) ;;
	*d) tm=${t%*d} t=$(($tm * 60 * 60 * 24)) ;;
	*[0-9]) : ;;
	*) echo "ERROR: unknown timestamp"; exit 1;;
esac

while :; do
	[ "$t" -le 0 ] && break
	read -n 1 -t 1 ch
	ret=$?
	if [ $ret = 142 ]; then
		# timeout
		:
	elif [ $ret = 0 ]; then
		# keypress
		echo "sleep for $ts, remaining" `TZ=utc date --date=@$t +%H:%M:%S` "($t)"
		if [ "$ch" = 'q' ]; then
			echo "quit"
			exit 0
		elif [ "$ch" = '+' ]; then
			echo "increase by 10s"
			t=$(($t + 10 - 1))
		elif [ "$ch" = '-' ]; then
			echo "decrease by 10s"
			t=$(($t - 10 - 1))
		fi
		t=$(($t + 1))
	fi
	t=$(($t - 1))
done

The time passed on command line needs to be parsed and it does not support fractional times, otherwise it’s straightforward. The interesting part is that it reacts to key presses:

q stop waiting, with exit code 0 so the next command could continue, otherwise Ctrl-C
+ increase waiting time by 10 seconds and print it
- decrease waiting time by 10 seconds and print it
anything else prints remaining time (and slightly increases the waiting time due to processing)

2023-08-16 - VIM tip: split file to 2 panes, lines continued [tags: usability, vim]

Vim has the excellent feature to split the window to multiple panes, but can it open one file in two vertical tabs and let the contents of left pane continue in the one next to it?

From

1 ...

2 ...

3 ...

4 ...

5 ...

6 ...

1 ...	7 ...
2 ...	8 ...
3 ...	9 ...
4 ...	10 ...
5 ...	11 ...
6 ...	12 ...

Not by default, certainly with some scripting or by a set of a few commands entered manually. We also want to preserve the line continuation while scrolling. Maybe all of that is implemented somewhere for vim, I don’t know. First time I saw this was many years ago in computer lab, somebody using emacs. What I’m going show dates back to 2011, it works and I think there must be an easier way or some vim fork can do it natively.

The idea is to split the panel, move the lines by the screen height and enable scrollbind.

function SplitAndJoin()
	:vsplit
	:wincmd l
	:normal L
	:normal z+
	:wincmd h
	:windo set scrollbind
	:wincmd h
endfunction

Moving the lines is really emulated by navigating to the bottom, resetting the position so the next line is the first one displayed. This is obviously not resistant to resizing the terminal, the line offset between the windows is fixed. This can be manually adjusted by disabling scrollbind, syncing the lines and enabling scrollbind again. I think I never needed that so urgently to implement it, reopening the buffer is faster.

The function can be invoked from command line mode by

:call SplitAndJoin()

… or 3 panes, lines continued

If we can do 2 panes by a few simple commands, we can do more, any number of splits that still fit the screen. This repeats the split and bind step from the last opened pane before moving back to the first one. The scrollbind is temporarily disabled when moving the lines for the split.

function SplitAndJoin3()
	:vsplit
	:wincmd l
	:normal L
	:windo set noscrollbind
	:normal z+
	:wincmd h
	:windo set scrollbind
	:vsplit
	:wincmd l
	:normal L
	:windo set noscrollbind
	:normal z+
	:wincmd h
	:windo set scrollbind
	:wincmd h
	:wincmd h
endfunction

Which can look like this:

Problems

resize - the line offsets are fixed from the time the window is split, resizing terminal window will get that out of sync
randomly not refreshed - sometimes the right pane does not show the right lines, but Ctrl-L fixes that, so it’s just a visual glitch
unsynced text during and after typing - when writing new lines to the left pane, the right pane shifts only the line numbers, not the text itself, ESC and Ctrl-L does not fix it, only scrolling gets it back in sync with visual appearance
wrap - with set wrap on and some lines actually wrapped in the left pane, the line offset is derived from that and once the wrapped line(s) scroll away from the screen the lines and text get out of sync
switching files in the right pane will inherit the scrollbind, so movement will also apply to the left window (not desired e.g. when using tags/cscope/…)

That’s not a short list, might be a bug somewhere like a missing refresh after an action or it’s working as expected individually but not when combined like that. A native support in vim is needed. But, it is still usable to some extent.

2023-08-09 - JSON format dumper (a C API) [tags: c, json]

We’ve had users asking for a machine readable output of btrfs-progs commands, suggesting to add an alternative way to dump the data in a structured format e.g. json. And nowadays this is very common in many if not all commands provided by util-linux. Try lscpu --json, or lsblk --json. Feeding such output to python or a shell-friendly json parser is going to make life easier. So that’s the background motivation.

Adding that to btrfs-progs is a task for long winter evenings, but the first step has been made. I wrote yet another json dumper. Of course this was preceded by a research about the state of existing libraries or files to copy. But why not have some fun writing one nevertheless.

The requirements:

produce valid json (validators exist)
allow nesting of compound structures
hashes ({key=value})
arrays ([value, value, value])
simple values (integer, string, uuid, time, …)

The nesting was one of the things I have not found in the existing libraries, or not implemented properly. The output must be a valid json and the single most difficult thing was the placement of continuation comma “,”. There must be none after the last object in the sequence. When the dump is done incrementally, the next object is often not known while printing the current one. So the comma cannot be printed unconditionally and previous state needs to be stored until the last object is dumped.

But, it’s not sufficient to keep track of the last object only, simply nest some hashes and arrays. Like this:

{
   "first" : {
      "nested" : {
         "key" : [
            "value",
            "value2"
         ]
      }
   },
   "last" : null
}

Last objects in sequences are on lines 6, 7, 8 and 10, so this needs a stack for tracking that. The main building blocks for correct syntax:

track depth of structured types
print delayed “,” when there’s next object in sequence
quoting or escaping

The rest is for readable output, indentation is simply derived from the nesting level.

The code representation of the json object should be straightforward, basically translating the objects into API calls, for compound object there’s the start and end element. Element nesting must be proper. On the lower level, there should be some nice specification of the key/value formatting.

I’ve ended up with a static declaration of the value formatting specifiers in a list, the actual structure of the json objects will be defined by API calls referring back to the spec structure.

struct rowspec {
    const char *key;
    const char *fmt;
    const char *out_text;
    const char *out_json;
};

The key refers to the key (name), the rest of the structure is relevant for the value formatting. The out_text can be ignored for now, it’s a generalization of the output formatting so the same code is used for plain text and json output, the switch is done from the outside (e.g. by a command line option).

Basic value types are a string and number. For the purposes of btrfs-progs there are more and so far built in the formatter, but it’s extensible (on the source level, not dynamically). Additional custom types are UUID, qgroupid (a u64 split into 16 and 48 bits), size (human friendly formatting of byte suffix values), and time. Another example of customization is that there are actually two byte size formatters, one that prints “-” instead of zero value.

The structural API is simple:

start/end the whole json document fmt_start, fmt_end
start/end optionally named group, fmt_print_start_group/fmt_print_end_group, when the name is defined it means it’s a map/hash "key": "value", or a plain array made of “[ value, ... ]”
for list values there’s fmt_start_value/fmt_end_value so there can be not just a simple value but also a compound one
and remaining is a simple value fmt_start_value/fmt_end_value

Most of them take a reference to the rowspec, this is convenient that at the time we’d like to print the value we summon it by its name, the rest is in the table. All the API calls take a context structure that stores the nesting level (for indentation and sanity checks) and some internal data.

Now the above example would not be the best one to demonstrate the API but let’s try:

static const struct rowspec rows[] = {
    { .key = "first", .fmt = "map", .out_json = "first" },
    { .key = "nested", .fmt = "map", .out_json = "nested" },
    { .key = "key", .fmt = "list", .out_json = "key" },
    { .key = "value", .fmt = "str", .out_json = NULL },
};
struct format_ctx ctx;

fmt_start(&ctx, rows, 0, 0);
  fmt_print_start_group(&ctx, "first", JSON_TYPE_MAP);
    fmt_print_start_group(&ctx, "nested", JSON_TYPE_MAP);
      fmt_print_start_group(&ctx, "key", JSON_TYPE_ARRAY);
        fmt_print(&ctx, "value", "value");
        fmt_print(&ctx, "value", "value2");
      fmt_print_end_group(&ctx, NULL);
    fmt_print_end_group(&ctx, NULL);
  fmt_print_end_group(&ctx, NULL);
  fmt_print_start_group(&ctx, "last", JSON_TYPE_MAP);
  fmt_print_end_group(&ctx, NULL);
fmt_end(&ctx);

That should demonstrate it. There might be some slight tweaks needed as I took it from a specialized use but would like to present it as a generic API. So far it’s not generic, it might become one. With sources available it’s available for further reuse, I don’t plan to make it a standalone library but if there’s interest then, well, why not.

Examples

The simple values can be formatted in multiple ways, e.g. size values with suffixes or raw numbers, though it’s defined as multiple keys:

    { .key = "size", .fmt = "%llu", .out_json = "size" },
    { .key = "size-si", .fmt = "size", .out_json = "size" },
    { .key = "size-si-not-zero", .fmt = "size-or-none", .out_json = "size" },

Raw formats starting with “%” are passed to printf, this lets it do all the heavy work. The special formatters may need additional parameters, in this case it’s the pretty printer mode for size.

    if (mode == RAW)
        fmt_print(&ctx, "size", value);
    else if (mode == SI)
        fmt_print(&ctx, "size-si", value, mode);
    else if (mode == SI_NO_ZERO)
        fmt_print(&ctx, "size-si-not-zero", value, mode);

Here the size-or-none format prints “-” instead of 0. As can be seen here, not all the row specifiers need to be used at once.

References

Source links point to the most recent release so there might be an update since the time I write this but I’m not expecting significant changes.

2022-01-12 - CUDA not initialized after resume (unknown error) [tags: cuda, linux, nvidia, resume]

tl;dr CUDA returning an unknown error upon initialization after resume on linux can be solved by reloading kernel module nvidia_uvm.

There are some problems reported where with CUDA returns an unknown error upon initialization, while there’s no clear cause nor a clear way how to fix it. I’ve found https://victorwyee.com/engineering/failed-call-to-cuinit-cuda-error-unknown/ listing various solutions from rebooting to tweaking environment variables.

I’ve hit another problem that does not seem to be explicitly mentioned as the cause or it could have been misdiagnosed. After switching the built in Intel graphic card to be primary and NVIDIA card only for computations, it was not possible to run any application using CUDA. All it reported back were unknown errors.

The Leela chess lc0 reported CUDA version 0.0.0 with message like:

   $ ./lc0
      _
   |   _ | |
   |_ |_ |_| v0.28.2 built Dec 13 2021
   ucinewgame
   Found pb network file: ./192x15-2021_1212_2050_25_709.pb.gz
   Creating backend [cuda-auto]...
   Switching to [cuda]...
   CUDA Runtime version: 0.0.0
   WARNING: CUDA Runtime version mismatch, was compiled with version 11.4.0
   Latest version of CUDA supported by the driver: 11.4.0
   error CUDA error: unknown error (../../src/neural/cuda/network_cuda.cc:134)

The line in source file is simply reporting errors:

   134     ReportCUDAErrors(cudaGetDeviceCount(&total_gpus));

This is the first place where the errors are checked, a quick trivial CUDA program fails on the first memory allocation, with the same unkown error.

I guess this stresses out the importance of error checking, namely when some piece of hardware is involved. Though this fails on the first attempt to use the API, there are possibly more cases during a long computation and many tasks using the GPU in parallel.

A reboot fixed the problem, until the next suspend, which is impractical and setting up the work environment gets annoying. I have tried to look for some utility/command that is supposed to be run after resume. There’s a script called nvidia-sleep.sh that is the callback script for suspend and resume. Running that with parameter suspend manually while the system was up was not a good idea as it switched me out of the graphical desktop to some virtual terminal without any way to undo that. Also running that with resume did not help eiter (and does not even make sense after the first proper resume).

There are no obvious signs that something could be wrong after resume, nothing in the logs and running games utilizing acceleration works (judging by FPS). One advice recommends to run nvidia-modprobe, in case some of the 4 modules is not loaded (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia). Besides the UVM module, all are unused.

Right after resume the modules look like this:

 nvidia_drm             69632  2
 nvidia_modeset       1204224  3 nvidia_drm
 nvidia_uvm           2531328  0
 nvidia              35373056  90 nvidia_uvm,nvidia_modeset

That there’s 0 (usage by other modules) brought some suspicion, because according to nvidia-smi, there’s a least Xorg keeping a few megabytes of allocated memory (which may or may not be related though).

I’ve reloaded the nvidia_uvm module, because it was the only thing possible to do with the modules:

 # modprobe -r nvidia_uvm
 # modprobe nvidia_uvm

And it started to work. Then quick check, lc0, works again. Now the module list is:

 nvidia_uvm           2531328  2
 nvidia_drm             69632  2
 nvidia_modeset       1204224  3 nvidia_drm
 nvidia              35373056  204 nvidia_uvm,nvidia_modeset

While this is still a workaround, it’s much better than rebooting and a post-resume hook should be able to make it work automatically.

The suspend script is in /usr/lib/systemd/system-sleep/:

     #!/bin/sh
     case "$1" in
	 post)
		     sleep 4
		     modprobe -r nvidia_uvm
		     sleep 1
		     modprobe nvidia_uvm
	     ;;
     esac

2021-05-24 - Authenticated hashes for btrfs (part 1) [tags: blake2, btrfs, kernel, sha256, xxhash]

There was a request to provide authenticated hashes in btrfs, natively as one of the btrfs checksum algorithms. Sounds fun but there’s always more to it, even if this sounds easy to implement.

Johaness T. at that time in SUSE sent the patchset adding the support for SHA256 [1] with a Labs conference paper, summarizing existing solutions and giving details about the proposed implementation and use cases.

The first version of the patchset posted got some feedback, issues were found and some ideas suggested. Things have stalled a bit, but the feature is still very interesting and really not hard to implement. The support for additional checksums has provided enough support code to just plug in the new algorithm and enhance the existing interfaces to provide the key bytes. So until now I’ve assumed you know what an authenticated hash means, but for clarity and in simple terms: a checksum that depends on a key. The main point is that it’s impossible to generate the same checksum for given data without knowing the key, where impossible is used in the cryptographic-strength sense, there’s an almost zero probability doing that by chance and brute force attack is not practical.

Auth hash, fsverity

Notable existing solution for that is fsverity that works in read-only fashion, where the key is securely hidden and used only to verify that data that are read from media haven’t been tampered with. A typical use case is an OS image in your phone. But that’s not all. Images of OS appear in all sorts of boxed devices, IoT. Nowadays, with explosion of edge computing, assuring integrity of the end devices is a fundamental requirement.

Where btrfs can add some value is the read AND write support, with an authenticated hash. This brings questions around key handling, and not everybody is OK with a device that could potentially store malicious/invalid data with a proper authenticated checksum. So yeah, use something else, this is not your use case, or maybe there’s another way how to make sure the key won’t be compromised easily. This is beyond the scope of what filesystem can do, though.

As an example use case of writable filesystem with authenticated hash: detect outside tampering with on-disk data, eg. when the filesystem was unmounted. Filesystem metadata formats are public, interesting data can be located by patterns on the device, so changing a few bytes and updating the checksum(s) is not hard.

There’s one issue that was brought up and I think it’s not hard to observe anyway: there’s a total dependency on the key to verify a basic integrity of the data. Ie. without the key it’s not possible to say if the data are valid as if a basic checksum was used. This might be still useful for a read-only access to the filesystem, but absence of key makes this impossible.

Existing implementations

As was noted in the LWN discussion [2], what ZFS does, there are two checksums. One is the authenticated and one is not. I point you to the comment stating that, as I was not able to navigate far enough in the ZFS code to verify the claim, but the idea is clear. It’s said that the authenticated hash is eg. SHA512 and the plain hash is SHA256, split half/half in the bytes available for checksum. The way the hash is stored is a simple trim of the first 16 bytes of each checksum and store them consecutively. As both hashes are cryptographically strong, the first 16 bytes should provide enough strength despite the truncation. Where 16 bytes is 128 bits.

When I was thinking about that, I had a different idea how to do that. Not that copying the scheme would not work for btrfs, anything that the linux kernel crypto API provides is usable, the same is achievable. I’m not judging the decisions what hashes to use or how to do the split, it works and I don’t see a problem in the strength. Where I see potential for an improvement is performance, without sacrificing strength too much. Trade-offs.

The CPU or software implementation of SHA256 is comparably slower to checksums with hardware aids (like CRC32C instructions) or hashes designed to perform well on CPUs. That was the topic of the previous round of new hashes, so we now compete against BLAKE2b and XXHASH. There are CPUs with native instructions to calculate SHA256 and the performance improvement is noticeable, orders of magnitude better. But the support is not as widespread as eg. for CRC32C. Anyway, there’s always choice and hardware improves over time. The number of hashes may seem to explode but as long as it’s manageable inside the filesystem, we take it. And a coffee please.

Secondary hash

The checksum scheme proposed is to use a cryptographic hash and a non-cryptographic one. Given the current support for SHA256 and BLAKE2b, the cryptographic hash is given. There are two of them and that’s fine. I’m not drawing an exact parallel with ZFS, the common point for the cryptographic hash is that there are limited options and the calculation is expensive by design. This is where the non-cryptographic hash can be debated. Also I want to call it secondary hash, with obvious meaning that it’s not too important by default and comes second when the authenticated hash is available.

We have CRC32C and XXHASH to choose from. Note that there are already two hashes from the start so supporting both secondary hashes would double the number of final combinations. We’ve added XXHASH to enhance the checksum collision space from 32 bits to 64 bits. What I propose is to use just XXHASH as the secondary hash, resulting in two new hashes for the authenticated and secondary hash. I haven’t found a good reason to also include CRC32C.

Another design point was where to do the split and truncation. As the XXHASH has fixed length, this could be defined as 192 bits for the cryptographic hash and 64 bits for full XXHASH.

Here we are, we could have authenticated SHA256 accompanied by XXHASH, or the same with BLAKE2b. The checksum split also splits the decision tree what to do when the checksum partially matches. For a single checksum it’s a simple yes/no decision. The partial match is the interesting case:

primary (key available) hash matches, secondary does not – as the authenticated hash is hard to forge, it’s trusted (even if it’s not full length of the digest)
primary (key available) does not match, secondary does not – checksum mismatch for the same reason as above
primary (key not available) does not match, secondary does – this is the prime time for the secondary hash, the floor is yours

This leads to 4 outcomes of the checksum verification, compared to 2. A boolean type can simply represent the yes/no outcome but for two hashes it’s not that easy. It depends on the context, though I think it still should be straightforward to decide what to do that in the code. Nevertheless, this has to be updated in all calls to checksum verification and has to reflect the key availability eg. in case where the data are auto-repaired during scrub or when there’s a copy.

Performance considerations

The performance comparison should be now clear: we have the potentially slow SHA256 but fast XXHASH, for each metadata and data block, vs slow SHA512 and slow SHA256. As I reckon it’s possible to also select SHA256/SHA256 split in ZFS, but that can’t beat SHA256/XXHASH.

The key availability seems to be the key point in all that, puns notwithstanding. The initial implementation assumed for simplicity to provide the raw key bytes to kernel and to the userspace utilities. This is maybe OK for a prototype but under any circumstances can’t survive until a final release. There’s key management wired deep into linux kernel, there’s a library for the whole API and command line tools. We ought to use that. Pass the key by name, not the raw bytes.

Key management has it’s own culprits and surprises (key owned vs possessed), but let’s assume that there’s a standardized way how to obtain the key bytes from the key name. In kernel its “READ_USER_KEY_BYTES”, in userspace it’s either keyctl_read from libkeyutils or a raw syscall to keyctl. Problem solved, on the low-level. But, well, don’t try that over ssh.

Accessing a btrfs image for various reasons (check, image, restore) now needs the key to verify data or even the key itself to perform modifications (check + repair). The command line interface has to be extended for all commands that interact with the filesystem offline, ie. the image and not the mounted filesystem.

This results to a global option, like btrfs --auth-key 1234 ispect-internal dump-tree, compared to btrfs inspect-internal dump-tree --auth-key 1234. This is not finalized, but a global option is now the preferred choice.

Final words

I have a prototype, that does not work in all cases but at least passes mkfs and mount. The number of checksum verification cases got above what I was able to fix by the time of writing this. I think this has enough matter on itself so I’m pushing it out out as part 1. There are open questions regarding the command line interface and also a some kind of proof or discussion regarding attacks. Stay tuned.

References:

[1] https://lore.kernel.org/linux-fsdevel/20200428105859.4719-1-jth@kernel.org/
[2] https://lwn.net/Articles/819143/ LWN discussion under Authenticated Btrfs (2020)

2020-02-15 - Btrfs hilights in 5.4: tree checker updates [tags: btrfs, kernel]

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

There’s not much to show in this release. Some users find that good too, a boring release. But still there are some changes of interest. The 5.4 is a long-term support stable tree, stability and core improvements are perhaps more appropriate than features that need a release or two to stabilize.

? stable not known in advance so not pushing half-baked features to stable, possibly requiring more intrusive fixups

The development cycle happened over summer and this slowed down the pace of patch reviews and update turnarounds.

Tree-checker updates

The tree-checker is a sanity checker of metadata that are read from/written to devices. Over time it’s being enhanced by more checks, let’s have a look at two of them.

ROOT_ITEM checks

The item represents root of a b-tree, of the internal or the subvolume trees.

Let’s take an example from btrfs inspect dump-tree:

       item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
                generation 5 root_dirid 0 bytenr 30523392 level 0 refs 1
                lastsnap 0 byte_limit 0 bytes_used 16384 flags 0x0(none)
                uuid 00000000-0000-0000-0000-000000000000
                drop key (0 UNKNOWN.0 0) level 0

Some of the metadata inside the item allow only simple checks, following commit 259ee7754b6793:

key.objectid must match the tree that’s being read, though the code verifies only if the type is not 0
key.offset must be 0
block offset bytenr must be aligned to sector size (4KiB in this case)
itemsize depends on the item type, but for the root item it’s fixed value
level and drop_level is 0 to 7, but it’s not possible to cross check if the tree has really of that level
generation must be lower than the super block generation, same for lastsnap
flags can be simply compared to the bit mask of allowed flags, right now there are two, one represents a read-only subvolume and another a subvolume that has been marked as deleted but its blocks not yet cleaned

The refs is a reference counter and sanity check would require reading all the expected reference holders, bytes_used would need to look up the block that it accounts, etc. The subvolume trees have more data like ctime, otime and real uuids. If you wonder what’s byte_limit, this used to be a mechanism to emulate quotas by setting the limit value, but it has been deprecated and unused for a long time. One day we might to find another purpose for the bytes.

Many of the tree-checker enhancements are follow ups to fuzz testing and reports, as it was in this case. The bug report shows that some of the incorrect data were detected and even triggered auto-repair (as this was on filesystem with DUP metadata) but there was too much damage and it crashed at some point. The crash was not random but a BUG_ON of an unexpected condition, that’s sanity check of last resort. Catching inconsistent data early with a graceful error handling is of course desired and ongoing work.

Extent metadata item checks

There are two item types that represent extents and information about sharing. EXTENT_ITEM is older and bigger while METADATA_ITEM is the building block of skinny-metadata feature, smaller and more compact. Both items contain type of reference(s) and the owner (a tree id). Besides the generic checks that also the root item does (alignment, value ranges, generation), there’s a number of allowed combinations of the reference types and extent types. The commit f82d1c7ca8ae1bf implements that, however further explanation is out of scope of the overview as the sharing and references are the fundamental design of btrfs.

Example of METADATA_ITEM:

        item 170 key (88145920 METADATA_ITEM 0) itemoff 10640 itemsize 33
                refs 1 gen 27 flags TREE_BLOCK
                tree block skinny level 0
                tree block backref root FS_TREE

And EXTENT_ITEM:

        item 27 key (20967424 EXTENT_ITEM 4096) itemoff 14895 itemsize 53
                refs 1 gen 499706 flags DATA
                extent data backref root FS_TREE objectid 8626071 offset 0 count 1

This for a simple case with one reference, tree (for metadata) and ordinary data, so comparing the sizes shows 20 bytes saved. On my 20GiB root partition with about 70 snapshots there are XXX EXTENT and YYY METADATA items.

Otherwise there can be more references inside one item (eg. many snapshots of a file that is randomly updated over time) so the overhead of the item itself is smaller

2020-02-10 - Btrfs hilights in 5.5: new hashes [tags: blake2, btrfs, kernel, sha256, xxhash]

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

More checksum algorithms

The first request for more checksums is several years old (2014), it was SHA256. There was no followup of the patch but the discussion in the mail thread brought the topic into light. Questions about why SHA256 and why not something else (welcome to the zoo: SpookyHash, BLAKE2, CityHash, MurmurHash3, CRC64), preferences, use cases etc. Browsing the mail thread again, I don’t see anybody suggest SHA1 or MD5. There were some open questions and suggestions and though it’s an old discussion, most of the points were valid at the time of pre-5.5 evaluation and served as a checklist. This is what I really like about the community projects and btrfs users in particular. Getting useful feedback on topics out of scope of the filesystem itself.

Fast forward to now, there are 3 new hashes:

Name	Digest size	Notes
CRC32C	32bit	fast, weak, default, CPU support, backward compatible
XXHASH	64bit	fast, weak, good on modern CPUs
SHA256	256bit	slow, strong, FIPS certified, CPU support
BLAKE2B	256bit	slow, strong, good on 64bit architectures

You can find more about the selection process in another post.

Quick start

# mkfs.btrfs --csum xxhash /dev/sda

And the output:

btrfs-progs v5.4.1
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               9774b466-fa5d-4c2e-8b4f-479d0bb5873d
Node size:          16384
Sector size:        4096
Filesystem size:    10.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP             256.00MiB
  System:           DUP               8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Checksum:           xxhash64
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1    10.00GiB  /dev/sda

For convenience, the hash names can be specified by aliases so both xxhash and xxhash64 are accepted on command line, similarly for blake2 it’s blake2b. This is based on a usability pattern, where humans don’t have to remember the exact specification and copy & paste from the output works as well.

Use cases and outlook

For ordinary checksumming, all the hashes are considered good enough. CRC32C is the weakest but it’s been with us since the beginning and has caught many random bit flips.

XXHASH is supposed to add to the strength with a good performance. Increased digest size reduces collisions, but can’t compete with the crypto-strong hashes. That can still be of great help for deduplication.

From my perspective, the number of hashes should stay limited as the support is needed not just in kernel but for other tooling. For btrfs-progs it’s mandatory to support them, but also bootloaders (*cough*in case they verify checksums*cough*). The kernel support requires the hash implementations and the modules become a dependency of btrfs.ko. This can increase overall size and build time.

One of the suggestions was to allow users to choose the hash out of all what linux crypto API offers, not just the fixed set of hashes selected by developers. I understand that from the point of view of users who can make the decisions about trade-offs for their use case this is better, but I argue that such users are not in majority. Satisfying the common case was the preferred option, hopefully providing good options also for the knowledgeable users. Providing and maintaining implementations of seven and one hashes would be a burden for several projects.

Yes, I’m aware of BLAKE3, but let’s wait before adding new hashes. By the time we do another round there might be more contenders or improved versions. There’s XXH3 from the author of XXHASH, producing 128bit digest and is slightly faster. However it is still not recommended for production use due to fine tuning and finalization.

The cryptographic hashes can be a building block for a keyed hash, ie. checksums that can’t be subverted. There’s a patch implementing that, currently in an RFC state. While the code is straightforward, adding a new interface could use a feedback round. Please contact us in case you have comments.

Last but not least, strong hashes are a building block for deduplication. At the moment btrfs supports only the out-of-band way, on demand by calling an ioctl. There are tools utilizing that, eg. BEES or duperemove, calculating the block checksums by themselves. Exporting the raw checksums would help to decrease CPU time and also external tracking of the checksums.

Implementation notes

With the single checksum CRC32C, lots of code had hard coded size (4 bytes/32 bits) or functions and identifiers contained ‘crc32c’. This could not remain once there are more hashes to be added, so a round of cleanups happened in advance in 5.4. I’d call it great preparatory work that paid off, because adding new hashes was a matter of inserting new definition entries to a table. The rest was behind abstraction of btrfs functions that call to the linux crypto API.

In parallel to the btrfs patches, the BLAKE2B linux implementation was merged, while XXHASH was merged in 5.3 and SHA256 was there forever. All in all, everything was ready in the same target release.

2020-02-02 - Btrfs hilights in 5.5: 3-copy and 4-copy block groups [tags: btrfs, kernel]

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

New block group profiles RAID1C3 and RAID1C4

There are two new block group profiles enhancing capabilities of the RAID1 types with more copies than 2. Brief overview of the profiles is in the table below, for table with all profiles see manual page of mkfs.brtfs, also available on wiki.

Profile	Copies	Utilization	Min devices
RAID1	2	50%	2
RAID1C3	3	33%	3
RAID1C4	4	25%	4

The way all the RAID1 types work is that there are 2 / 3 / 4 exact copies over all available devices. The terminology is different from linux MD RAID, that can do any number of copies. We decided not to do that in btrfs to keep the implementation simple. Another point for simplicity is from the users’ perspective. That RAID1C3 provides 3 copies is clear from the type. Even after adding a new device and not doing balance, the guarantees about redundancy still hold. Newly written data will use the new device together with 2 devices from the original set.

Compare that with a hypothetical RAID1CN, on a filesystem with M devices (N <= M). When the filesystem starts with 2 devices, equivalent to RAID1, adding a new one will have mixed redundancy guarantees after writing more data. Old data with RAID1, new with RAID1C3 – but all accounted under RAID1CN profile. A full re-balance would be required to make it a reliable 3-copy RAID1. Add another device, going to RAID1C4, same problem with more data to shuffle around.

The allocation policy would depend on number of devices, making it hard for the user to know the redundancy level. This is already the case for RAID0/RAID5/RAID6. For the striped profile RAID0 it’s not much of a problem, there’s no redundancy. For the parity profiles it’s been a known problem and new balance filter stripe has been added to support fine grained selection of block groups.

Speaking about RAID6, there’s the elephant in the room, trying to cover write hole. Lack of a resiliency against 2 device damage has been bothering all of us because of the known write hole problem in the RAID6 implementation. How this is going to be addressed is for another post, but for now, the newly added RAID1C3 profile is a reasonable substitute for RAID6.

How to use it

On a freshly created filesystem it’s simple:

# mkfs.btrfs -d raid1c3 -m raid1c4 /dev/sd[abcd]

The command combines both new profiles for sake of demonstration, you should always consider the expected use and required guarantees and choose the appropriate profiles.

Changing the profile later on an existing filesystem works as usual, you can use:

# btrfs balance start -mconvert=raid1c3 /mnt/path

Provided there are enough devices and enough space to do the conversion, this will go through all metadadata block groups and after it finishes, all of them will be of the of the desired type.

Backward compatibility

The new block groups are not understood by old kernels and can’t be mounted, not even in the read-only mode. To prevent that a new incompatibility bit is introduced, called raid1c34. Its presence on a device can be checked by btrfs inspect-internal dump-super in the incompat_flags. On a running system the incompat features are exported in sysfs, /sys/fs/btrfs/UUID/features/raid1c34.

Outlook

There is no demand for RAID1C5 at the moment (I asked more than once). The space utilization is low already, the RAID1C4 survives 3 dead devices so IMHO this is enough for most users. Extending resilience to more devices should perhaps take a different route.

With more copies there’s potential for parallelization of reads from multiple devices. Up to now this is not optimal, there’s a decision logic that’s semi-random based on process ID of the btrfs worker threads or process submitting the IO. Better load balancing policy is a work in progress and could appear in 5.7 at the earliest (because 5.6 development is now in fixes-only mode).

Look back

The history of the patchset is a bit bumpy. There was enough motivation and requests for the functionality, so I started the analysis what needs to be done. Several cleanups were necessary to unify code and to make it easily extendable for more copies while using the same mirroring code. In the end change a few constants and be done.

Following with testing, I tried simple mkfs and conversions, that worked well. Then scrub, overwrite some blocks and let the auto-repair do the work. No hiccups. The remaining and important part was the device replace, as the expected use case was to substitute RAID6, replacing a missing or damaged disk. I wrote the test script, replace 1 missing, replace 2 missing. And it did not work. While the filesystem was mounted, everything seemed OK. Unmount, check again and the devices were still missing. Not cool, right.

Due to lack of time before the upcoming merge window (a code freeze before next development cycle), I had to declare it not ready and put it aside. This was in late 2018. For a highly requested feature this was not an easy decision. Should it be something less important, the development cycle between rc1 and final release provides enough time to fix things up. But due to the maintainer role with its demands I was not confident that I could find enough time to debug and fix the remaining problem. Also nobody offered help to continue the work, but that’s how it goes.

At the late 2019 I had some spare time and looked at the pending work again. Enhanced the test script with more debugging messages and more checks. The code worked well, the test script was subtly broken. Oh well, what a blunder. That cost a year, but on the other hand releasing a highly requested feature that lacks an important part was not an appealing option.

The patchset was added to 5.5 development queue at about the last time before freeze, final 5.5 release happened a week ago.

2020-01-21 - BLAKE3 vs BLAKE2 for BTRFS [tags: blake2, blake3, btrfs]

Irony isn’t it. The paint of BLAKE2 as BTRFS checksum algorithm hasn’t dried yet, 1-2 weeks to go but there’s a successor to it. Faster, yet still supposed to be strong. For a second or two I considered ripping out all the work and … no not really but I do admit the excitement.

Speed and strength are competing goals for a hash algorithm. The speed can be evaluated by anyone, not so much for the strength. I am no cryptographer and for that area rely on expertise and opinion of others. That BLAKE was a SHA3 finalist is a good indication, where BLAKE2 is it’s successor, weakened but not weak. BLAKE3 is yet another step trading off strength and speed.

Regarding BTRFS, BLAKE2 is going to be the faster of strong hashes for now (the other one is SHA256). The argument I have for it now is proof of time. It’s been deployed in many projects (even crypto currencies!), there are optimized implementations, various language ports.

The look ahead regarding more checksums is to revisit them in about 5 years. Hopefully by that time there will be deployments, real workload performance evaluations and overall user experience that will back future decisions.

Maybe there are going to be new strong yet fast hashes developed. During my research I learned about Kangaroo 12 that’s a reduced version of SHA3 (Keccak). The hash is constructed in a different way, perhaps there might be a Kangaroo 2π one day on par with BLAKE3. Or something else. Why not EDON-R, it’s #1 in many of the cr.yp.to/hash benchmarks? Another thing I learned during the research is that hash algorithms are twelve in a dozen, IOW too many to choose from. That Kangaroo 12 is internally of a different construction might be a point for selecting it to have wider range of “building block types”.

Quick evaluation

For BTRFS I have a micro benchmark, repeatedly hashing a 4 KiB block and using cycles per block as a metric.

Block size: 4KiB
Iterations: 10000000
Digest size: 256 bits (32 bytes)

Hash	Total cycles	Cycles/iteration	Perf vs BLAKE3	Perf vs BLAKE2b
BLAKE3 (AVX2)	111260245256	11126	1.0	0.876 (-13%)
BLAKE2b (AVX2)	127009487092	12700	1.141 (+14%)	1.0
BLAKE2b (AVX)	166426785907	16642	1.496 (+50%)	1.310 (+31%)
BLAKE2b (ref)	225053579540	22505	2.022 (+102%)	1.772 (+77%)

Right now there’s only the reference Rust implementation and a derived C implementation of BLAKE3, claimed not to be optimized but from my other experience the compiler can do a good job optimizing programmers ideas away. There’s only one BLAKE3 entry with the AVX2 implementation, the best hardware support my testing box provides. As I had the other results of BLAKE2 at hand, they’re in the table for comparison, but the most interesting pair are the AVX2 versions anyway.

The improvement is 13-14%. Not much ain’t it, way less that the announced 4+x faster than BLAKE2b. Well, it’s always important to interpret results of a benchmark with respect to the environment of measurement and the tested parameters.

For BTRFS filesystem the block size is always going to be in kilobytes. I can’t find what was the size of the official benchmark results, the bench.rs script iterates over various sizes, so I assume it’s an average. Short input buffers can skew the results as the setup/output overhead can be significant, while for long buffers the compression phase is significant. I don’t have explanation for the difference and won’t draw conclusions about BLAKE3 in general.

One thing that I dare to claim is that I can sleep well because upon the above evaluation, BLAKE3 won’t bring a notable improvement if used as a checksum hash.

References

new hash for BTRFS selection, same testing box for the measurements
https://github.com/BLAKE3-team/BLAKE3 – top commit 02250a7b7c80ded, 2020-01-13, upstream version 0.1.1
DJB’s hash menu and per-machine results with numbers

Personal addendum

During the evaluations now and in the past, I’ve found it convenient if there’s an offer of implementations in various languages. That eg. Keccak project pages does not point me directly to a C implementation slightly annoyed me, but the reference implementation in C++ was worse than BLAKE2 I did not take the next step to compare the C version, wherever I would find it.

BLAKE3 is fresh and Rust seems to be the only thing that has been improved since the initial release. A plain C implementation without any warning-not-optimized labels would be good. I think that C versions will appear eventually, besides that Rust is now the new language hotness, there are projects not yet “let’s rewrite it in Rust”. Please Bear with us.

2019-12-11 - Btrfs hilights in 5.3 [tags: btrfs, kernel]

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

CRC32C uses accelerated versions on other architectures

… than just Intel-based ones. There was a hard coded check for the intel SSE feature providing the accelerated instruction, but this totally skipped other architectures. A brief check in linux.git/arch/*/crypto for files implementing the accelerated versions revealed that there’s a bunch of them: ARM, ARM64, MIPS, PowerPC, s390 and SPARC. I don’t have enough hardware to show the improvements, though.

Automatically remove incompat bit for RAID5/6

While this is not the fix everybody is waiting on, it’s demonstrating user-developer-user cycle how things can be improved. A filesystem created with RAID5 or -6 profiles sets an incompatibility bit. That’s supposed to be there as long as there’s any chunk using the profiles. User expectation is that the bit should be removed once the chunks are eg. balanced away. This is what got implemented and serves as a prior example for any future feature that might get removed on a given filesystem. (Note from the future: I did that for the RAID1C34 as well).

Side note: for the chunk profile it’s easy because the runtime check of presence is quick and requires only going through a list of chunk profiles types. Example of the worst case would be dropping the incompat bit for LZO after there are no compressed files using that algorithm. You can easily see that this would require either enumerating all extent metadata (one time check) or keeping track of count since mkfs time. This is IMHO not a typical request and can be eventually done using the btrfstune tool on an unmounted filesystem.

2019-10-12 - Simple asm diff filter [tags: asm, diff, perl]

One thing you windows guys have is a decent binary diff. I know there’s radare2 on linux, but I’m a noob to set it up and I’d be fine with something less fancy. Something like 2 objdump -dr files, intelligent filter and vimdiff.

The tricky part is the intelligent filter. I say intelligent, but in fact it’s a 20 line perl script that basically filters out things that are not important or are likely to change in a slightly modified C source.

The result is lossy, eg. there are no addresses and thus jumps can’t be followed, but having that would actually deserve the ‘intelligent’ title. My use case is simpler, eg. doing small tweaks like reordering lines, adding annotations (like READ_ONCE/WRITE_ONCE) or reducing argument counts.

Which takes eg.

0000000000000000 <btrfs_set_lock_blocking_read>:
       0:       e8 00 00 00 00          callq  5 <btrfs_set_lock_blocking_read+0x5>
                        1: R_X86_64_PLT32       __fentry__-0x4
       5:       55                      push   %rbp
       6:       53                      push   %rbx
       7:       48 89 fb                mov    %rdi,%rbx
       a:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
       f:       65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # 16 <btrfs_set_lock_blocking_read+0x16>
                        12: R_X86_64_PC32       cpu_number-0x4
      16:       89 c0                   mov    %eax,%eax
      18:       48 0f a3 05 00 00 00    bt     %rax,0x0(%rip)        # 20 <btrfs_set_lock_blocking_read+0x20>
      1f:       00
                        1c: R_X86_64_PC32       __cpu_online_mask-0x4
      20:       0f 82 b5 00 00 00       jb     db <btrfs_set_lock_blocking_read+0xdb>
      26:       0f b6 ab 94 00 00 00    movzbl 0x94(%rbx),%ebp
      2d:       40 80 fd 01             cmp    $0x1,%bpl
      31:       0f 87 00 00 00 00       ja     37 <btrfs_set_lock_blocking_read+0x37>
                        33: R_X86_64_PC32       .text.unlikely+0x11
      37:       83 e5 01                and    $0x1,%ebp
      3a:       74 1b                   je     57 <btrfs_set_lock_blocking_read+0x57>
      3c:       8b 8b 88 00 00 00       mov    0x88(%rbx),%ecx
      42:       65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
      49:       00 00
                        47: R_X86_64_32S        current_task
      4b:       39 88 c0 04 00 00       cmp    %ecx,0x4c0(%rax)
      51:       0f 84 bd 00 00 00       je     114 <btrfs_set_lock_blocking_read+0x114>
      57:       8b 83 18 02 00 00       mov    0x218(%rbx),%eax
      5d:       85 c0                   test   %eax,%eax
      5f:       0f 84 b2 00 00 00       je     117 <btrfs_set_lock_blocking_read+0x117>
      65:       f0 ff 83 90 00 00 00    lock incl 0x90(%rbx)
      6c:       8b 83 14 02 00 00       mov    0x214(%rbx),%eax
      72:       85 c0                   test   %eax,%eax
      74:       0f 84 ab 00 00 00       je     125 <btrfs_set_lock_blocking_read+0x125>
      7a:       f0 ff 8b 14 02 00 00    lock decl 0x214(%rbx)
      81:       48 8d bb 98 00 00 00    lea    0x98(%rbx),%rdi
      88:       5b                      pop    %rbx
      89:       5d                      pop    %rbp
      ...

and produces

0000000000000000 <btrfs_set_lock_blocking_read>:
callq  btrfs_set_lock_blocking_read
TARGET __fentry__
push   %rbp
push   %rbx
mov    %rdi,%rbx
NOP
mov    %gs:0x0(%rip),%eax
mov    %eax,%eax
bt     %rax,0x0(%rip)
jb     btrfs_set_lock_blocking_read
movzbl 0x94(%rbx),%ebp
cmp    $0x1,%bpl
ja     btrfs_set_lock_blocking_read
and    $0x1,%ebp
je     btrfs_set_lock_blocking_read
mov    0x88(%rbx),%ecx
mov    %gs:0x0,%rax
cmp    %ecx,0x4c0(%rax)
je     btrfs_set_lock_blocking_read
mov    0x218(%rbx),%eax
test   %eax,%eax
je     btrfs_set_lock_blocking_read
lock incl 0x90(%rbx)
mov    0x214(%rbx),%eax
test   %eax,%eax
je     btrfs_set_lock_blocking_read
lock decl 0x214(%rbx)
lea    0x98(%rbx),%rdi
pop    %rbx
pop    %rbp
...

This looks quite simple and when lined together, the diff is readable.

So here’s the magic script (pardon my perl skills):

#!/usr/bin/perl

@c=<>;
foreach(@c) {
        chomp;
        s/.*R_X86_64_PLT32\s+([^+-]+)[+-].*/TARGET $1/;
        next if(/R_X86_/);
        next if(/^\s*[0-9a-f]+:\s*([0-9a-f][0-9a-f]\s)+$/);
        s/^\s*[0-9a-f]+:\s*([0-9a-f][0-9a-f]\s)+\s*//;
        s/[0-9a-f]+ <([^+]+)\+.*>$/$1/;
        s/\s+#.*$//;
        s/nopl.*/NOP/;
        s/xchg.*ax.*ax/NOP/;
        s/data16 nop/NOP/;
        s/nop/NOP/;
        print("$_\n");
}

Use like:

$ objdump -dr before/locking.o > before
$ objdump -dr after/locking.o > after
$ vimdiff before after

2019-10-08 - Selecting the next checksum for btrfs [tags: blake2, btrfs, kernel]

The currently used and the only one checksum algorithm that’s implemented in btrfs is crc32c, with 32bit digest. It has served well for the years but has some weaknesses so we’re looking for a replacements and also for enhancing the use cases based on possibly better hash algorithms.

The advantage of crc32c is it’s simplicity of implementation, various optimized versions exist and hardware CPU instruction-level support. The error detection strength is not great though, the collisions are easy to generate. Note that crc32c has not been used in cryptographically sensitive context (eg. deduplication).

Side note: the collision generation weakness is used in the filesystem image dump tool to preserve hashes of directory entries while obscuring the real names.

Future use cases

The natural use case is still to provide checksumming for data and metadata blocks. With strong hashes, the same checksums can be used to aid deduplication or verification (HMAC instead of plain hash). Due to different requirements, one hash algorithm cannot possibly satisfy all requirements, namely speed vs. strength. Or can it?

The criteria

The frequency of checksumming is high, every data block needs that, every metadata block needs that.

During the discussions in the community, there were several candidate hash algorithms proposed and as it goes users want different things but we developers want to keep the number of features sane or at least maintainable. I think and hope the solution will address that.

The first hash type is to replace crc32c with focus on speed and not necessarily strength (ie. collision resistance).

The second type focus is strength, in the cryptographic sense.

In addition to the technical aspects of the hashes, there are some requirements that would allow free distribution and use of the implementations:

implementation available under a GPL2 compatible license
available in the linux kernel, either as a library function or as a module
license that allows use in external tools like bootloaders (namely grub2)

Constraints posed by btrfs implementation:

maximum digest width is 32 bytes
blocks of size from 4KiB up to 64KiB

For the enhanced use case of data verification (using HMAC), there’s a requirement that might not interest everybody but still covers a lot of deployments. And this is standard compliance and certification:

standardized by FIPS

And maybe last but not least, use something that is in wide use already, proven by time.

Speed

Implementation of all algorithms should be performant on common hardware, ie. 64bit architectures and hopefully not terrible on 32bit architectures or older and weaker hardware. By hardware I mean the CPU, not specialized hardware cards.

The crypto API provided by linux kernel can automatically select the best implementation of a given algorithm, eg. optimized implementation in assembly and on multiple architectures.

Strength

For the fast hash the collisions could be generated but hopefully not that easily as for crc32c. For strong hash it’s obvious that finding a collision would be jackpot.

In case of the fast hash the quality can be evaluated using the SMHasher suite.

The contenders

The following list of hashes has been mentioned and considered or evaluated:

xxhash
XXH3
SipHash
CRC64
SHA256
SHA3
BLAKE2

xxhash

criteria: license OK, implementation OK, digest size OK, not standardized but in wide use

The hash is quite fast as it tries to exploit the CPU features that allow instruction parallelism. The SMHasher score is 10, that’s great. The linux kernel implementation landed in 5.3.

Candidate for fast hash.

XXH3

Unfortunately the hash is not yet finalized and cannot be in the final round, but for evaluation of speed it was considered. The hash comes from the same author as xxhash.

SipHash

This hash is made for hash tables and hashing short strings but we want 4KiB or larger blocks. Not a candidate.

CRC64

Similar to crc32 and among the contenders only because it was easy to evaluate but otherwise is not in the final round. It has shown to be slow in the microbenchmark.

SHA256

criteria: license OK, implementation OK, digest size OK, standardized in FIPS

The SHA family of hashes is well known, has decent support in CPU and is standardized. Specifically, SHA256 is the strongest that still fits into the available 32 bytes.

Candidate for strong hash.

SHA3

criteria: license OK, implementation OK, digest size OK, standardized in FIPS

Winner of the 2012 hash contest, we can’t leave it out. From the practical perspective of checksum, the speed is bad even for the strong hash type. One of the criteria stated above is performance without special hardware, unlike what was preferred during the SHA3 contest.

Candidate for strong hash.

BLAKE2

criteria: license OK, implementation OK, digest size OK, not standardized

From the family of BLAKE that participated in the 2012 SHA contest, the ‘2’ provides a trade-off speed vs. strength. More and more projects adopt it.

Candidate for strong hash.

Final round

I don’t blame you if you skipped all the previous paragraphs. The (re)search for the next hash was quite informative and fun so it would be shame not to share it, also to document the selection process for some transparency. This is a committee driven process though.

fast hash: xxhash
strong hash: BLAKE2 and SHA256

Two hashes selected for the strong type is a compromise to get a fast-but-strong hash yet also something that’s standardized.

The specific version of BLAKE2 is going to be the ‘BLAKE2b-256’ variant, ie. optimized for 64bit (2b) but with 256bit digest.

Microbenchmark

A microbenchmark gives more details about performance of the hashes:

Block: 4KiB (4096 bytes), Iterations: 100000

Hash	Total cycles	Cycles/iteration
NULL-NOP	56888626	568
NULL-MEMCPY	60644484	606
CRC64	3240483902	32404
CRC32C	174338871	1743
CRC32C-SW	174388920	1743
XXHASH	251802871	2518
XXH3	193287384	1932
BLAKE2b-256-arch	1798517566	17985
BLAKE2b-256	2358400785	23584
BLAKE2s-arch	2593112451	25931
BLAKE2s	3451879891	34518
SHA256	10674261873	106742
SHA3-256	29152193318	291521

Machine: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (AVX2)

Hash implementations are the reference ones in C:

NULL-NOP – stub to measure overhead of the framework
NULL-MEMCPY – simple memcpy of the input buffer
CRC64 – linux kernel lib/crc64.c
CRC32C – hw assisted crc32c, (linux)
CRC32C-SW – software implementation, table-based (linux)
XXHASH – reference implementation
XXH3 – reference implementation
BLAKE2b-256-arch – 64bit optimized reference version (for SSE2/SSSE3/SSE41/AVX/AVX2)
BLAKE2b-256 – 64bit reference implementation
BLAKE2s-arch – 32bit optimized reference version
BLAKE2s – 32bit reference implementation
SHA256 – RFC 6234 reference implementation
SHA3-256 – C, based on canonical implementation

There aren’t optimized versions for all hashes so for fair comparison the unoptimized reference implementation should be used. As BLAKE2 is my personal favourite I added the other variants and optimized versions to observe the relative improvements.

Evaluation

CRC64 was added by mere curiosity how does it compare to the rest. Well, curiosity satisfied.

SHA3 is indeed slow on a CPU.

What isn’t here

There are non-cryptographic hashes like CityHash, FarmHash, Murmur3 and more, that were found unsuitable or not meeting some of the basic criteria. Others like FNV or Fletcher used in ZFS are of comparable strength of crc32c, so that won’t be a progress.

Merging

All the preparatory work in btrfs landed in version 5.3. Hardcoded assumptions of crc32c were abstracted, linux crypto API wired in, with additional cleanups or refactoring. With that in place, adding a new has is a matter of a few lines of code adding the specifier string for crypto API.

The work on btrfs-progs is following the same path.

Right now, the version 5.4 is in development but new features can’t be added, so the target for the new hashes is 5.5. The BLAKE2 algorithm family still needs to be submitted and merged, hopefully they’ll make it to 5.5 as well.

One of my merge process requirements was to do a call for public testing, so we’re going to do that once all the kernel and progs code is ready for testing. Stay tuned.

2019-07-25 - Btrfs hilights in 5.2 [tags: btrfs, kernel]

A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.

Read-time, write-time corruption detection

The tree-checker is a recent addition to the sanity checks, a functionality that verifies a subset of metadata structures. The capabilities and strength is limited by the context and bits of information available at some point, but still there’s enough to get an additional value.

The context here means a b-tree leaf that packs several items into a block (ie. the nodesize, 16KiB by default). The individual items’ members’ values can be verified against allowed values, the order of the keys of all items listed in the leaf header can be checked etc. This is just for a rough idea what happens.

With that in the works, there are two occasions that can utilize the extended checking:

read-time – the first time a block is read fresh from disk
write-time – when the final contents of a block is in memory and going to be written to disk

It’s clear that the read-time is merely a detector of problems that already happened, so there’s not much to do besides warning and returning an error (EUCLEAN). Turning the filesystem to read-only to prevent further problems is another option and inevitable on some occasions.

The write side check aims to catch silent errors that could make it to the permanent storage. The reasons why this happens are two fold, but the main idea is to catch memory bit flips. You’d be surprised how often this happens, that would be for a separate article entirely.

The new checks in 5.2 improve:

device item (item that describes the physical device of the filesystem) – check that items have the right key type, device id is within allowed range and that usage and total sizes are within limits of the physical device
inode item (item for inodes, ie. files, directories or special files)– the objectid (that would be the inode number) is in the range, generation is consistent with the global one and the basic sanity of mode, flags/attributes and link count
block group profiles – check that only a single type is set in the bit mask that represents block group profile type of a chunk (ie. single/dup/raid1/…)

As mentioned before, the bits to check are inside a single buffer that represents the tree block and the check is really local to that. As an inode item can represent more than just files and directories, doing structural checks like where and how the inode item is linked to is not easy in this context. This is basically what would btrfs check do.

The checks need to be fast because they happen on each metadata block, so no additional IO is allowed. This still brings some overhead but is (considered) negligible compared to all other updates to the block. A measurement can be done by adding tracepoint, but that’s left as an exercise.

2019-06-12 - Here, have a pyramid made of BLAKE2 hashes [tags: blake2, fun]

Fun stuff only …

Hash some bytes with changing digest length from 1 to 64.

[ 1]                                                                 3b
[ 2]                                                                cf5b
[ 3]                                                               68a2fa
[ 4]                                                              ac6de759
[ 5]                                                             022847e13c
[ 6]                                                            fd2965be0997
[ 7]                                                           d4c93ad6ffbcbb
[ 8]                                                          87c1dcb224fd2a57
[ 9]                                                         6017c4f8c4bcb1d115
[10]                                                        9b3f7b15df8867b8214c
[11]                                                       8087b2861bbdf3d0356ea6
[12]                                                      7a035413d62f0774148a12cc
[13]                                                     697cabcfcade7114a271373039
[14]                                                    784e3578c7b5a031237296225e26
[15]                                                   0102e6aa5ff6bf57071b34f838d3d4
[16]                                                  324f54d0e8f39326145e8e04b00434b4
[17]                                                 45fbfb40756a53305881dc52d7aa6a9025
[18]                                                c1a664957cbc0bbd0e2eb15be3a8e0f10b46
[19]                                               58c84031abff74bba0cab2a3ce8160c955fb92
[20]                                              bad65265d1351746c3fc23816dca6966b42d2e90
[21]                                             c9ed77f178c889cc985aaad89d740831922d7d6253
[22]                                            751673d5516709339176859392a61e45b444a3a4612c
[23]                                           e62b1a6a239571d2a7ebf686f3f1f628f7253fa595debf
[24]                                          3afb48409fa94374bd1819bef5d40f8e90386686ec7212db
[25]                                         139fb5a766ab873fc1747e76253ebddec0da85388261ddcd0c
[26]                                        90b849e8d333229c8942f68faac2870530886c30edf4fe295281
[27]                                       cfefa0767fdd316475e4374dc3b9e1eabdebb0111c88ccdaf6b668
[28]                                      6928e0b74efa9534c84483975ab83b65191d59d69c76880eac794afc
[29]                                     5d825aaaaabff533c36308227ed42982837bf7fa85d5f4457f6d7ebdfe
[30]                                    a14f3dafe99f88dbb119f23cd2d390b3396f2c1c90fc8099514abb95872a
[31]                                   6a62d371a4b6583df5cc9f125e01536338a8e87fd17b1e24468778e00205bc
[32]                                  99b4a09edc99a1a2283f94189d8604ca65aab7aeebc31327b8c4a918341f92b0
[33]                                 1b0728d4ae415dc572ebb7caf18f7b43115ca6b53f6379a6571b4949d2049e6b7f
[34]                                04c6817ba561b6e6aefb2237682aa4171314fc2a7bfa15f604227e239e8a08696153
[35]                               102d84c3e030bd1d413f12128c34caef64c06c364eb13c4eb49549ab278efcd5d0dac1
[36]                              fa4edc42c4fcd5754de080616d09412c3f7baba4a972954e449bfa65fc040ae5556bc63c
[37]                             0728fcf98a665b6a8c9a63ae31c87538d68bca8dc2e2a6b66dba3c1f4a8c5965bb4045a1e2
[38]                            3fd0e5397a0d98eaf56a941e5b7ba7b8aa4b8dbe2a22993809cbb9a4b68834de1341c1412e2f
[39]                           16eee95d5dc56762bc42f59aac2fc0bee5ba3a5fa7d6b2c1acc13af519a1a13cc793d2482685b9
[40]                          74843823a200d423fbead5d40d83dea26638ecf79a768f819b44174a2ce0b51702944ff967597b9f
[41]                         53134460150687f2b993259a048a9b59de7d8c091ffc635b792b33ae9c7970ef6dc47fe6308e47da21
[42]                        174cbb459228b1b5fe302d36dfd69d0c75b8d77398e6eb8440e9449db38ce7387d12095f23531eb5a2bd
[43]                       f550a1d564f073486095bf473a20d135fcba595fd574c124e6a8f3f1451a4cd649b3d3c6c22fd3961dff9e
[44]                      9e25048892b6d367d0e85a75d1f49c6950a9ae3ca02b03165b1308f0aadac093032fbb13556cb9bb85872def
[45]                     d43eb7fa68c34906813c360cd5c50e8d5f62c0de941f1e9520d405104144c30c1eb43c95bad4b850b3dc308141
[46]                    b9e4bc8364d69248c68dfe17a66131de04e6be949663c936fdbe5fec3efb73d4e40c22ad298d8be849de58b0a3a8
[47]                   f81017569d44554c934f6d89723e8997e115def6d66c76ee3c05a7e7626f487cab5659b4e3a7f32c9859dd04c579eb
[48]                  f2a987f00b96fafd25f2d8a19717913674961f234d0f81da5dbb549382ce2f25a5a95c4be917b97341d7c8b8a05aa2d1
[49]                 03602f74fadc251d46a064dcc55a812c37c98960a984782c533bce1db92c514521996c58ff293a3957fd686cd814cbc18e
[50]                bf939f9418697008f5a1c67e16e7f01bf596a7be648069c5bb322ba7207b6af5947912a40773f3c8f48bef03cda925c79eb8
[51]               77cb53241baa7dfd4b73b786b29543ad4f9da15c33d21b7acf0d04bd78c437018f7358357879f06e8fa59832028a851c126fbf
[52]              404a9cdd87d1ff194c45c9dd1995889b5492b22d446f87f8efe0148793bee21a756cfe6ac18f438f2535b35e909678e1f4ddfd92
[53]             38b399db58e9a1d4b3d78e3756b55a7d8bc3fce565a6cd68f66632cb3e201720384487bc314fc960cc7ece0915ab06ab655cb21f95
[54]            341c51088313284bb6efb2ce87116fa0ad86c540b4761ad1a91db8a54faf0ac16ae44de62e02d87d88124cdf7820f939657f30d43401
[55]           4d217c7ba91da2c4503357904f2264117a445fde54a7f25cf85c0b9d4528e1a8ca682ea59586d63755c2dac848716d19afa00f9c2a70e5
[56]          05664a51270f7a9dc309a1e78bddd3ed6a210aecf31f42be0a4231280a9e15ca1af3ec0456760fc2a6b76039a7f041a7623470bd912afc7d
[57]         434c8e7d8357e840f8522c582385fcba6650b7e923a821a185ae4e036de7ed6b931b908ef4bc3838de6864decdff762ac123f6f66a14a2fded
[58]        87fc3385049632e310553cda08b22810b0fa59c92f5559b49d2ffd5e5d0734886ccd269e6866d6412bc06489cba014265767edebcd2c18977c42
[59]       282ca365baaa22317a3c7faadc075e6b132fefee0d98d8823a79ffe87f87f7a38d8fd9e4a7ca10a904fbbcd9952dc12af3e561b884a40b7b2b4288
[60]      a0f9ca58069fe970f2fd36d6c86b38b14ad41418f870ee695aa52395400552c4f583760e01c68c440c5fa6f9c6db25138d85cd6568f0f9a5698db89a
[61]     79b0b42401bb022fdfb71fdd2e6ddd2d0dc20cedcdd4b64e7921a03fd2f71508a195e2d8a4ebeb0ea8cdc6942ccd28cca7eda0fa2126fd884a0b279cd4
[62]    42b704ca2da3386bf00fc511648668af1b9f3bbf69d7972d35dc24365d0b37778fffa624662bd50139430644314367ff855da6bb489227144456007d5f1d
[63]   504a6a52e16b29a0ebe9106ac74931889243ef98aee97bd598502aac9363d99c7c414d39cb1229a3b49e79351f2c0daa3e9221bc7a135ffc41f79a6e416762
[64]  0b944b6bdc6c99e10448a536dea6ea79fd1d78e70307bf6e3c4a889f3876e2ec6a002ec9b4198a27e2b2dcdfcef7a4de4da52791248457a50d97307cdb0b6c80

Sources: blake2-impl.h test-blake2.h test-blake2b.c

2019-04-27 - Linux crypto: testing blake2s [tags: blake2, crypto, linux]

The BLAKE2 algorithm is out for some time, yet there’s no port to linux crypto API. The recent push of WireGuard would add it to linux kernel, but using a different API (zinc) than the existing one. I haven’t explored zinc, I assume there will be a way to use it from inside kernel, but this would need another layer to switch the APIs according to the algorithm.

As btrfs is going to use more hashing algos, we are in the process of selection. One of the contenders is BLAKE2, though everybody would have a different suggestion. In order to test it, a port is needed. Which basically is a glue code between the linux crypto API and the BLAKE2 implementation.

I’m not going to reimplement crypto or anything, so the the default implementation is going to be the reference one found in blake2 git.

Note: due to the maximum space of 32 bytes available in the btrfs metadata blocks, the version is BLAKE2s.

The BLAKE2s sources

From the repository https://github.com/BLAKE2/BLAKE2:

ref/blake2s-ref.c
ref/blake2.h
ref/blake2-impl.h

Briefly skimming over the sources, there’s nothing that’ll cause trouble. Standard C code, some definitions. Adapting for linux kernel would need to replace the stdint.h types (uint8_t etc) with the uXX (u8 etc) and change path to includes. For simplicity, let’s remove the ifdefs for C++ and MSVC too.

Add the new algorithm definition (CRA)

Though it’s possible to prepare a module for an out-of-tree build (see below), let’s do it inside the linux.git/crypto/ path for now. There’s also plenty of sources to copy & paste. I’ve used crc32c_generic.c, and it turned out to be useful.

The crypto hash description is defined in struct shash_alg, contains the technical description like type of the algorithm, length of the context and callbacks for various phases of the hash calculation (init, update, final), and identification of the algorithm and the implementation. The default implementations are C and use the string “-generic” in the name.

The crc32c module came with a few stub callbacks (checksum_init etc), that will only call into the blake2 functions and the definition can stay. Simple search and replace from crc32c to blake2s will do most of the conversion.

Add the glue code for crypto API

Now we have the blake2s.c with the reference implementation, crypto algorithm definition. The glue code connects the API with the implementation. We need 2 helper structures that hold the context once the user starts digest calculation. The private blake2s context is embedded into one of them. The intermediate results will be stored there.

And the rest is quite easy, each callback will call into the respective blake2s function, passing the context, data and lengths. One thing that I copied from the examples is the key initialization that’s in blake2s_cra_init, that appears to be global and copied to the context each time a new one is initialized.

Here the choice of using crc32c.c helped as there were the stub callback with the right parameters, calling the blake2s functions that can retain their original signature. This makes later updates easier. All the functions are static so the compiler will most probably optimize the code that there will be no unnecessary overhead.

Well and that’s it. Let’s try to compile it and insert the module:

linux.git$ make crypto/blake2s.o
...
  CC [M]  crypto/blake2s.o
linux.git$ make crypto/blake2s.ko
...
  CC      crypto/blake2s.mod.o
  LD [M]  crypto/blake2s.ko
...

Check that it’s been properly loaded

linux.git/crypto$ sudo insmod ./blake2s.ko

and registered

linux.git$ cat /proc/crypto
name         : blake2s
driver       : blake2s-generic
module       : blake2s
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 1
digestsize   : 32
...

The selftest says it passed, but there no such thing so far. There are test values provided in blake2 git so it would be nice to have too (tm). But otherwise it looks good.

To do actual test, we’d need something inside kernel to utilize the new hash. One option is to implement a module that will do that or use the userspace library libkcapi that can forward the requests from userspace to the available kernel implementations.

Test it with libkcapi

The libkcapi project at http://www.chronox.de/libkcapi.html provides an API that uses the AF_ALG socket type to exchange data with kernel. The library provides a command line tool that we can use right away and don’t need to code anything.

$ kcapi-dgst -c blake2s --hex < /dev/null
48a8997da407876b3d79c0d92325ad3b89cbb754d86ab71aee047ad345fd2c49

The test vectors provided by blake2 confirm that this is hash of empty string with the default key (0x000102..1f).

Out-of-tree build

Sources in the linux.git require one additional line to Makefile, build it unconditionally as a module. Proper submission to linux kernel would need the Kconfig option.

obj-m += blake2s.o

The standalone build needs a Makefile with a few targets that use the existing build of kernel. Note that you’d need a running kernel with the same built sources. This is usually provided by the kernel-*-devel packages. Otherwise, if you build kernels from git, you know what to do, right?

KDIR ?= /lib/modules/`uname -r`/build
obj-m := blake2s.o

default:
        $(MAKE) -C $(KDIR) M=$$PWD

clean:
        $(MAKE) -C $(KDIR) M=$$PWD clean

modules_install:
        $(MAKE) -C $(KDIR) M=$$PWD modules_install

After running make, the kernel module is ready for use.

What next?

Send it upstream. Well, after some work of course.

update the coding style of the blake2 sources
add Kconfig
write self-tests
optionally add the optimized implementations

All the files can be found here:

Makefile: out-of-tree build
blake2.h: copied and updated for linux
blake2-impl.h: copied and updated for linux
blake2s.c: copied and updated for linux

References

2018-10-18 - Some btrfs benchmarks of changes pending for 4.20 [tags: benchmark, btrfs, kernel]

Update: enhance the text, still a bit raw, check the numbers

Example of benchmarks of patches going to 4.20 kernel. The goal was to identify potential differences between the devleopment branch with and without merge to the mainline. The factors affecting the difference are usually in other subsystems like block layer (IO queuing) or memory management (namely page cache, memory allocator, writeback), core (scheduling, locking primitives).

base/misc1: v4.19-rc8
misc2: v4.19-rc8 + btrfs pull request
misc3: v4.19 with btrfs changes merged

The numbers are reflected in the results below, referring to comparisons of the branches. 1-to-2 is the net effect of the patches, 1-to-3 is the same patches after merge.

The testsuite used is MMTests, that’s internally used for performance evaluation and provides wide range of workloads

The expected performance delta should be roughly the same for 2 and 3, the tables show if the delta is significant (green: positive, red: negative), so it can be easily observed. Note that the values need to be properly interpreted in the context of the benchmark and perhaps other values.

MMTests results

blogbench 1, 2
blogbench 1, 2, 3
dbench4-async 1, 2
dbench4-async 1, 2, 3
- good: slight improvement in 128 thread latency
- good: decreased system cpu time
- bad: longer overall runtime
dbench4-fsync 1, 2
dbench4-fsync 1, 2, 3
- good: slight improvement in 128 thread latency
- good: decreased system cpu time
filebench-webproxy 1, 2
filebench-webproxy 1, 2, 3
fio-async-randrw 1, 2
fio-async-randrw 1, 2, 3
fio-async-seqw 1, 2
fio-async-seqw 1, 2, 3
fsmark-metadata 1, 2
fsmark-metadata 1, 2, 3
- good: overall improvements in transaction counts for 3

The raw stats are available.

Test setup

HP ProLiant DL380 G6
16 CPUs (Intel(R) Xeon(R), E5520 2.27GHz)
16 GiB RAM
4x SAS 136 GiB disks, partitioned
filesystem:
- btrfs
- mkfs: data: raid0, metadata: raid1
- mount: noatime

2018-06-08 - qemu 2.12 does not support if=scsi [tags: qemu]

After update of qemu to version 2.12, my testing vms stopped to just warn about the if=scsi (with a bit more cryptic message), and did not want to start.

qemu-system-x86_64: -drive file=root,if=scsi,media=disk,cache=none,index=0,format=raw: machine type does not support if=scsi,bus=0,unit=0

The if=scsi shortcut was handy, the maze of qemu command line options may take some time to comprehend but then you can do wonders.

The release notes of version 2.12 do mention that the support is gone and that other options should be used instead, also mentions which ones. But it does not tell how exactly. As this should be a simple syntax transformation from one option to another, an example would save a lot of time.

I was not able to find any ready-to-use example in a few minutes so had to experiment a bit (this saved me documentation reading time).

The setup I use is a file-backed root image for a virtual machine, nothing really fancy. The file name is root, sparse file, raw ie. no qcow2, no caching.

Here you are:

-device virtio-scsi-pci,id=scsi
-drive file=root,id=root-img,if=none,format=raw,cache=none
-device scsi-hd,drive=root-img

First you need to define the bus, otherwise the 3rd line will complain that there’s no SCSI. Second line is to define the file backed drive and the third one puts that together.

Using SCSI might not be the best idea for a qemu VM as the emulated driver is buggy and crashes, so I’d recommend to use virtio, but for a almost read-only root image it’s fine. Also the device names are predictable.

2017-12-06 - gitstats: btrfs-progs for v4.14 [tags: btrfs-progs, draft, git, stats]

gitstats

2017-11-23 - errno.h for the masses [tags: debugging, errno]

The errno values, in various formats you may encounter. The standard errno is a positive number, linux kernel uses the negative values. Besides the sources (going through like 5 include hops from /usr/include/errno.h to /usr/include/asm-generic/errno.h and /usr/include/asm-generic/errno-base.h) there’s no direct way to get the symbolic <-> numeric mapping. The hexa values commonly appear in the stack trace dumps.

symbolic	dec	hex	neg hex	comment
EPERM	1	0x01	0x..ffff	Operation not permitted
ENOENT	2	0x02	0x..fffe	No such file or directory
ESRCH	3	0x03	0x..fffd	No such process
EINTR	4	0x04	0x..fffc	Interrupted system call
EIO	5	0x05	0x..fffb	I/O error
ENXIO	6	0x06	0x..fffa	No such device or address
E2BIG	7	0x07	0x..fff9	Argument list too long
ENOEXEC	8	0x08	0x..fff8	Exec format error
EBADF	9	0x09	0x..fff7	Bad file number
ECHILD	10	0x0a	0x..fff6	No child processes
EAGAIN	11	0x0b	0x..fff5	Try again
ENOMEM	12	0x0c	0x..fff4	Out of memory
EACCES	13	0x0d	0x..fff3	Permission denied
EFAULT	14	0x0e	0x..fff2	Bad address
ENOTBLK	15	0x0f	0x..fff1	Block device required
EBUSY	16	0x10	0x..fff0	Device or resource busy
EEXIST	17	0x11	0x..ffef	File exists
EXDEV	18	0x12	0x..ffee	Cross-device link
ENODEV	19	0x13	0x..ffed	No such device
ENOTDIR	20	0x14	0x..ffec	Not a directory
EISDIR	21	0x15	0x..ffeb	Is a directory
EINVAL	22	0x16	0x..ffea	Invalid argument
ENFILE	23	0x17	0x..ffe9	File table overflow
EMFILE	24	0x18	0x..ffe8	Too many open files
ENOTTY	25	0x19	0x..ffe7	Not a typewriter
ETXTBSY	26	0x1a	0x..ffe6	Text file busy
EFBIG	27	0x1b	0x..ffe5	File too large
ENOSPC	28	0x1c	0x..ffe4	No space left on device
ESPIPE	29	0x1d	0x..ffe3	Illegal seek
EROFS	30	0x1e	0x..ffe2	Read-only file system
EMLINK	31	0x1f	0x..ffe1	Too many links
EPIPE	32	0x20	0x..ffe0	Broken pipe
EDOM	33	0x21	0x..ffdf	Math argument out of domain of func
ERANGE	34	0x22	0x..ffde	Math result not representable
EDEADLK	35	0x23	0x..ffdd	Resource deadlock would occur
ENAMETOOLONG	36	0x24	0x..ffdc	File name too long
ENOLCK	37	0x25	0x..ffdb	No record locks available
ENOSYS	38	0x26	0x..ffda	Function not implemented
ENOTEMPTY	39	0x27	0x..ffd9	Directory not empty
ELOOP	40	0x28	0x..ffd8	Too many symbolic links encountered
EWOULDBLOCK	EAGAIN
ENOMSG	42	0x2a	0x..ffd6	No message of desired type
EIDRM	43	0x2b	0x..ffd5	Identifier removed
ECHRNG	44	0x2c	0x..ffd4	Channel number out of range
EL2NSYNC	45	0x2d	0x..ffd3	Level 2 not synchronized
EL3HLT	46	0x2e	0x..ffd2	Level 3 halted
EL3RST	47	0x2f	0x..ffd1	Level 3 reset
ELNRNG	48	0x30	0x..ffd0	Link number out of range
EUNATCH	49	0x31	0x..ffcf	Protocol driver not attached
ENOCSI	50	0x32	0x..ffce	No CSI structure available
EL2HLT	51	0x33	0x..ffcd	Level 2 halted
EBADE	52	0x34	0x..ffcc	Invalid exchange
EBADR	53	0x35	0x..ffcb	Invalid request descriptor
EXFULL	54	0x36	0x..ffca	Exchange full
ENOANO	55	0x37	0x..ffc9	No anode
EBADRQC	56	0x38	0x..ffc8	Invalid request code
EBADSLT	57	0x39	0x..ffc7	Invalid slot
EDEADLOCK	EDEADLK
EBFONT	59	0x3b	0x..ffc5	Bad font file format
ENOSTR	60	0x3c	0x..ffc4	Device not a stream
ENODATA	61	0x3d	0x..ffc3	No data available
ETIME	62	0x3e	0x..ffc2	Timer expired
ENOSR	63	0x3f	0x..ffc1	Out of streams resources
ENONET	64	0x40	0x..ffc0	Machine is not on the network
ENOPKG	65	0x41	0x..ffbf	Package not installed
EREMOTE	66	0x42	0x..ffbe	Object is remote
ENOLINK	67	0x43	0x..ffbd	Link has been severed
EADV	68	0x44	0x..ffbc	Advertise error
ESRMNT	69	0x45	0x..ffbb	Srmount error
ECOMM	70	0x46	0x..ffba	Communication error on send
EPROTO	71	0x47	0x..ffb9	Protocol error
EMULTIHOP	72	0x48	0x..ffb8	Multihop attempted
EDOTDOT	73	0x49	0x..ffb7	RFS specific error
EBADMSG	74	0x4a	0x..ffb6	Not a data message
EOVERFLOW	75	0x4b	0x..ffb5	Value too large for defined data type
ENOTUNIQ	76	0x4c	0x..ffb4	Name not unique on network
EBADFD	77	0x4d	0x..ffb3	File descriptor in bad state
EREMCHG	78	0x4e	0x..ffb2	Remote address changed
ELIBACC	79	0x4f	0x..ffb1	Can not access a needed shared library
ELIBBAD	80	0x50	0x..ffb0	Accessing a corrupted shared library
ELIBSCN	81	0x51	0x..ffaf	.lib section in a.out corrupted
ELIBMAX	82	0x52	0x..ffae	Attempting to link in too many shared libraries
ELIBEXEC	83	0x53	0x..ffad	Cannot exec a shared library directly
EILSEQ	84	0x54	0x..ffac	Illegal byte sequence
ERESTART	85	0x55	0x..ffab	Interrupted system call should be restarted
ESTRPIPE	86	0x56	0x..ffaa	Streams pipe error
EUSERS	87	0x57	0x..ffa9	Too many users
ENOTSOCK	88	0x58	0x..ffa8	Socket operation on non-socket
EDESTADDRREQ	89	0x59	0x..ffa7	Destination address required
EMSGSIZE	90	0x5a	0x..ffa6	Message too long
EPROTOTYPE	91	0x5b	0x..ffa5	Protocol wrong type for socket
ENOPROTOOPT	92	0x5c	0x..ffa4	Protocol not available
EPROTONOSUPPORT	93	0x5d	0x..ffa3	Protocol not supported
ESOCKTNOSUPPORT	94	0x5e	0x..ffa2	Socket type not supported
EOPNOTSUPP	95	0x5f	0x..ffa1	Operation not supported on transport endpoint
EPFNOSUPPORT	96	0x60	0x..ffa0	Protocol family not supported
EAFNOSUPPORT	97	0x61	0x..ff9f	Address family not supported by protocol
EADDRINUSE	98	0x62	0x..ff9e	Address already in use
EADDRNOTAVAIL	99	0x63	0x..ff9d	Cannot assign requested address
ENETDOWN	100	0x64	0x..ff9c	Network is down
ENETUNREACH	101	0x65	0x..ff9b	Network is unreachable
ENETRESET	102	0x66	0x..ff9a	Network dropped connection because of reset
ECONNABORTED	103	0x67	0x..ff99	Software caused connection abort
ECONNRESET	104	0x68	0x..ff98	Connection reset by peer
ENOBUFS	105	0x69	0x..ff97	No buffer space available
EISCONN	106	0x6a	0x..ff96	Transport endpoint is already connected
ENOTCONN	107	0x6b	0x..ff95	Transport endpoint is not connected
ESHUTDOWN	108	0x6c	0x..ff94	Cannot send after transport endpoint shutdown
ETOOMANYREFS	109	0x6d	0x..ff93	Too many references: cannot splice
ETIMEDOUT	110	0x6e	0x..ff92	Connection timed out
ECONNREFUSED	111	0x6f	0x..ff91	Connection refused
EHOSTDOWN	112	0x70	0x..ff90	Host is down
EHOSTUNREACH	113	0x71	0x..ff8f	No route to host
EALREADY	114	0x72	0x..ff8e	Operation already in progress
EINPROGRESS	115	0x73	0x..ff8d	Operation now in progress
ESTALE	116	0x74	0x..ff8c	Stale NFS file handle
EUCLEAN	117	0x75	0x..ff8b	Structure needs cleaning
ENOTNAM	118	0x76	0x..ff8a	Not a XENIX named type file
ENAVAIL	119	0x77	0x..ff89	No XENIX semaphores available
EISNAM	120	0x78	0x..ff88	Is a named type file
EREMOTEIO	121	0x79	0x..ff87	Remote I/O error
EDQUOT	122	0x7a	0x..ff86	Quota exceeded
ENOMEDIUM	123	0x7b	0x..ff85	No medium found
EMEDIUMTYPE	124	0x7c	0x..ff84	Wrong medium type
ECANCELED	125	0x7d	0x..ff83	Operation Canceled
ENOKEY	126	0x7e	0x..ff82	Required key not available
EKEYEXPIRED	127	0x7f	0x..ff81	Key has expired
EKEYREVOKED	128	0x80	0x..ff80	Key has been revoked
EKEYREJECTED	129	0x81	0x..ff7f	Key was rejected by service
EOWNERDEAD	130	0x82	0x..ff7e	Owner died
ENOTRECOVERABLE	131	0x83	0x..ff7d	State not recoverable
ERFKILL	132	0x84	0x..ff7c	Operation not possible due to RF-kill
EHWPOISON	133	0x85	0x..ff7b	Memory page has hardware error
ERESTARTSYS	512	0x200	0x..ff200
ERESTARTNOINTR	513	0x201	0x..ff1ff
ERESTARTNOHAND	514	0x202	0x..ff1fe	restart if no handler..
ENOIOCTLCMD	515	0x203	0x..ff1fd	No ioctl command
ERESTART_RESTARTBLOCK	516	0x204	0x..ff1fc	restart by calling sys_restart_syscall
EBADHANDLE	521	0x209	0x..ff1f7	Illegal NFS file handle
ENOTSYNC	522	0x20a	0x..ff1f6	Update synchronization mismatch
EBADCOOKIE	523	0x20b	0x..ff1f5	Cookie is stale
ENOTSUPP	524	0x20c	0x..ff1f4	Operation is not supported
ETOOSMALL	525	0x20d	0x..ff1f3	Buffer or request is too small
ESERVERFAULT	526	0x20e	0x..ff1f2	An untranslatable error occurred
EBADTYPE	527	0x20f	0x..ff1f1	Type not supported by server
EJUKEBOX	528	0x210	0x..ff1f0	Request initiated, but will not complete before timeout
EIOCBQUEUED	529	0x211	0x..ff1ef	iocb queued, will get completion event
EIOCBRETRY	530	0x212	0x..ff1ee	iocb queued, will trigger a retry

2016-11-05 - Escaped snapper snapshots [tags: btrfs, snapper]

Happened to me, that space on my root partition was going low, no matter what I removed. I have snapper hourly snapshots enabled, tuned the daily and weekly settings, so the usual problem of running out of space is prevented. Also before doing a big update of Tumbleweed (openSUSE rolling distro) I check the available space to roughly match what zypper says it’d need.

$ snapper ls
Type   | #   | Pre # | Date                             | User | Cleanup  | Description  | Userdata
-------+-----+-------+----------------------------------+------+----------+--------------+--------------
single | 0   |       |                                  | root |          | current      |
single | 395 |       | Mon 26 Sep 2016 10:30:01 PM CEST | root | timeline | timeline     |

So what happened? I was removing some old kernels, manually by rpm. And it started to report write errors, no space. Indeed, df reported just 64K available. This is within the margin of accuracy that we’re able to get from the statfs syscall. Side note, getting this right is a lot of guesswork, so 64K is as good as it gets.

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        30G   30G   64K  86% /

RPM continued to spew lots of errors, I removed rest of the snapper snapshots, no improvement. So then I resorted to the bare ‘btrfs’ command and found a stale snapper snapshot. The snapshot directory itself was there, but the metadata were lacking (a xml file). After deleting it manually, the free space jumped up by 6GB!

# btrfs subvol list /
...
... .snapshots/3721/snapshot
...

The theory is that I have once removed all snapshots so the numbering started from 1 again, as the current series is around ~390, while the number of stale snapshot is much higher.

Takeaways:

synthetic tools may not give our an accurate view: snapper requires some metadata describing its snapshot, which is obviously OK, given that the user has the freedom to create snapshots anywhere
learn how to use the low-level tools to cross-verify the state of the system

Summary of used tools:

snapper ls – show all snapper snapshots (for the first config)
snapper delete – delete snapshots by id or range
btrfs subvolume list – list all subvolumes (snapshots included) on a given path
btrfs subvolume delete – delete a subvolume/snapshot by path

Update {{ page.update1 }}

Again. There were 2 stale snapper snapshots, age of like 8 months until discovered. The space got up by 12GiB after deletion. The root partition is 50GiB, I think it survived pretty long given the frequent updates of the rolling distro. There’s an open issue.