index-long.html
- 2023-08-24 - Interactive sleep [tags: shell, sleep]
q
stop waiting, with exit code 0 so the next command could continue, otherwiseCtrl-C
+
increase waiting time by 10 seconds and print it-
decrease waiting time by 10 seconds and print it- anything else prints remaining time (and slightly increases the waiting time due to processing)
- 2023-08-16 - VIM tip: split file to 2 panes, lines continued [tags: usability, vim]
- resize - the line offsets are fixed from the time the window is split, resizing terminal window will get that out of sync
- randomly not refreshed - sometimes the right pane does not show the right lines, but Ctrl-L fixes that, so it’s just a visual glitch
- unsynced text during and after typing - when writing new lines to the left pane, the right pane shifts only the line numbers, not the text itself, ESC and Ctrl-L does not fix it, only scrolling gets it back in sync with visual appearance
- wrap - with
set wrap on
and some lines actually wrapped in the left pane, the line offset is derived from that and once the wrapped line(s) scroll away from the screen the lines and text get out of sync - switching files in the right pane will inherit the scrollbind, so movement will also apply to the left window (not desired e.g. when using tags/cscope/…)
- 2023-08-09 - JSON format dumper (a C API) [tags: c, json]
- produce valid json (validators exist)
- allow nesting of compound structures
- hashes (
{key=value}
) - arrays (
[value, value, value]
) - simple values (integer, string, uuid, time, …)
- track depth of structured types
- print delayed “
,
” when there’s next object in sequence - quoting or escaping
- start/end the whole json document
fmt_start
,fmt_end
- start/end optionally named group,
fmt_print_start_group
/fmt_print_end_group
, when the name is defined it means it’s a map/hash"key": "value"
, or a plain array made of “[ value, ... ]
” - for list values there’s
fmt_start_value
/fmt_end_value
so there can be not just a simple value but also a compound one - and remaining is a simple value
fmt_start_value
/fmt_end_value
- https://github.com/kdave/btrfs-progs/blob/master/common/format-output.h
- https://github.com/kdave/btrfs-progs/blob/master/common/format-output.c
- https://github.com/kdave/btrfs-progs/blob/master/tests/json-formatter-test.c
- https://github.com/wireshark/wireshark/blob/master/wsutil/json_dumper.c
- 2022-01-12 - CUDA not initialized after resume (unknown error) [tags: cuda, linux, nvidia, resume]
- 2021-05-24 - Authenticated hashes for btrfs (part 1) [tags: blake2, btrfs, kernel, sha256, xxhash]
- primary (key available) hash matches, secondary does not – as the authenticated hash is hard to forge, it’s trusted (even if it’s not full length of the digest)
- primary (key available) does not match, secondary does not – checksum mismatch for the same reason as above
- primary (key not available) does not match, secondary does – this is the prime time for the secondary hash, the floor is yours
- [1] https://lore.kernel.org/linux-fsdevel/20200428105859.4719-1-jth@kernel.org/
- [2] https://lwn.net/Articles/819143/ LWN discussion under Authenticated Btrfs (2020)
- 2020-02-15 - Btrfs hilights in 5.4: tree checker updates [tags: btrfs, kernel]
key.objectid
must match the tree that’s being read, though the code verifies only if the type is not 0key.offset
must be 0- block offset
bytenr
must be aligned to sector size (4KiB in this case) itemsize
depends on the item type, but for the root item it’s fixed valuelevel
anddrop_level
is 0 to 7, but it’s not possible to cross check if the tree has really of that levelgeneration
must be lower than the super block generation, same forlastsnap
flags
can be simply compared to the bit mask of allowed flags, right now there are two, one represents a read-only subvolume and another a subvolume that has been marked as deleted but its blocks not yet cleaned- 2020-02-10 - Btrfs hilights in 5.5: new hashes [tags: blake2, btrfs, kernel, sha256, xxhash]
- 2020-02-02 - Btrfs hilights in 5.5: 3-copy and 4-copy block groups [tags: btrfs, kernel]
- 2020-01-21 - BLAKE3 vs BLAKE2 for BTRFS [tags: blake2, blake3, btrfs]
- Block size: 4KiB
- Iterations: 10000000
- Digest size: 256 bits (32 bytes)
- new hash for BTRFS selection, same testing box for the measurements
- https://github.com/BLAKE3-team/BLAKE3 – top commit 02250a7b7c80ded, 2020-01-13, upstream version 0.1.1
- DJB’s hash menu and per-machine results with numbers
- 2019-12-11 - Btrfs hilights in 5.3 [tags: btrfs, kernel]
- 2019-10-12 - Simple asm diff filter [tags: asm, diff, perl]
- 2019-10-08 - Selecting the next checksum for btrfs [tags: blake2, btrfs, kernel]
- implementation available under a GPL2 compatible license
- available in the linux kernel, either as a library function or as a module
- license that allows use in external tools like bootloaders (namely grub2)
- maximum digest width is 32 bytes
- blocks of size from 4KiB up to 64KiB
- standardized by FIPS
- xxhash
- XXH3
- SipHash
- CRC64
- SHA256
- SHA3
- BLAKE2
- fast hash: xxhash
- strong hash: BLAKE2 and SHA256
- NULL-NOP – stub to measure overhead of the framework
- NULL-MEMCPY – simple memcpy of the input buffer
- CRC64 – linux kernel lib/crc64.c
- CRC32C – hw assisted crc32c, (linux)
- CRC32C-SW – software implementation, table-based (linux)
- XXHASH – reference implementation
- XXH3 – reference implementation
- BLAKE2b-256-arch – 64bit optimized reference version (for SSE2/SSSE3/SSE41/AVX/AVX2)
- BLAKE2b-256 – 64bit reference implementation
- BLAKE2s-arch – 32bit optimized reference version
- BLAKE2s – 32bit reference implementation
- SHA256 – RFC 6234 reference implementation
- SHA3-256 – C, based on canonical implementation
- 2019-07-25 - Btrfs hilights in 5.2 [tags: btrfs, kernel]
- read-time – the first time a block is read fresh from disk
- write-time – when the final contents of a block is in memory and going to be written to disk
- device item (item that describes the physical device of the filesystem) – check that items have the right key type, device id is within allowed range and that usage and total sizes are within limits of the physical device
- inode item (item for inodes, ie. files, directories or special files)– the
objectid
(that would be the inode number) is in the range, generation is consistent with the global one and the basic sanity of mode, flags/attributes and link count - block group profiles – check that only a single type is set in the bit mask that represents block group profile type of a chunk (ie. single/dup/raid1/…)
- 2019-06-12 - Here, have a pyramid made of BLAKE2 hashes [tags: blake2, fun]
- 2019-04-27 - Linux crypto: testing blake2s [tags: blake2, crypto, linux]
- ref/blake2s-ref.c
- ref/blake2.h
- ref/blake2-impl.h
- update the coding style of the blake2 sources
- add Kconfig
- write self-tests
- optionally add the optimized implementations
- Makefile: out-of-tree build
- blake2.h: copied and updated for linux
- blake2-impl.h: copied and updated for linux
- blake2s.c: copied and updated for linux
- https://blake2.net
- https://github.com/BLAKE2/BLAKE2:
- http://www.chronox.de/libkcapi.html
- https://www.kernel.org/doc/html/latest/crypto/userspace-if.html
- https://www.kernel.org/doc/html/latest/crypto/intro.html
- 2018-10-18 - Some btrfs benchmarks of changes pending for 4.20 [tags: benchmark, btrfs, kernel]
- base/misc1: v4.19-rc8
- misc2: v4.19-rc8 + btrfs pull request
- misc3: v4.19 with btrfs changes merged
- blogbench 1, 2
- blogbench 1, 2, 3
- dbench4-async 1, 2
- dbench4-async 1, 2, 3
- good: slight improvement in 128 thread latency
- good: decreased system cpu time
- bad: longer overall runtime
- dbench4-fsync 1, 2
- dbench4-fsync 1, 2, 3
- good: slight improvement in 128 thread latency
- good: decreased system cpu time
- filebench-webproxy 1, 2
- filebench-webproxy 1, 2, 3
- fio-async-randrw 1, 2
- fio-async-randrw 1, 2, 3
- fio-async-seqw 1, 2
- fio-async-seqw 1, 2, 3
- fsmark-metadata 1, 2
- fsmark-metadata 1, 2, 3
- good: overall improvements in transaction counts for 3
- HP ProLiant DL380 G6
- 16 CPUs (Intel(R) Xeon(R), E5520 2.27GHz)
- 16 GiB RAM
- 4x SAS 136 GiB disks, partitioned
- filesystem:
- btrfs
- mkfs: data: raid0, metadata: raid1
- mount: noatime
- 2018-06-08 - qemu 2.12 does not support if=scsi [tags: qemu]
- 2017-12-06 - gitstats: btrfs-progs for v4.14 [tags: btrfs-progs, draft, git, stats]
- 2017-11-23 - errno.h for the masses [tags: debugging, errno]
- 2016-11-05 - Escaped snapper snapshots [tags: btrfs, snapper]
- synthetic tools may not give our an accurate view: snapper requires some metadata describing its snapshot, which is obviously OK, given that the user has the freedom to create snapshots anywhere
- learn how to use the low-level tools to cross-verify the state of the system
snapper ls
– show all snapper snapshots (for the first config)snapper delete
– delete snapshots by id or rangebtrfs subvolume list
– list all subvolumes (snapshots included) on a given pathbtrfs subvolume delete
– delete a subvolume/snapshot by path
The sleep
utility is something I’d hardly consider worth enhancing by
features, the task is simple to sleep for a given time. Nothing else is needed
when it’s used from scripts. However, it can be also used from manually typed
command line, e.g. to let some command start later to offset initial IO load or
not to overload network.
Let’s say for example that we’d like to download a big data set, but rather let it start in 20 minutes.
sleep 20m && wget https://URL
Once this command starts we don’t know how much time is remaining and
interrupting it by Ctrl-C
would not start the download. Additionally, what if
the time is too short and we’d like to increase it. So let’s implement it. For
more compatibility it’s done in shell and using only widely available utilities
(date
). And btw, it does not even use sleep
for the sleeping but the builtin
read
with the timeout option as it allows to read input while waiting.
sleepy the shell script
#!/bin/sh t=${1:-1} ts="$t" case "$t" in *s) tm=${t%*s} t=$(($tm)) ;; *m) tm=${t%*m} t=$(($tm * 60)) ;; *h) tm=${t%*h} t=$(($tm * 60 * 60)) ;; *d) tm=${t%*d} t=$(($tm * 60 * 60 * 24)) ;; *[0-9]) : ;; *) echo "ERROR: unknown timestamp"; exit 1;; esac while :; do [ "$t" -le 0 ] && break read -n 1 -t 1 ch ret=$? if [ $ret = 142 ]; then # timeout : elif [ $ret = 0 ]; then # keypress echo "sleep for $ts, remaining" `TZ=utc date --date=@$t +%H:%M:%S` "($t)" if [ "$ch" = 'q' ]; then echo "quit" exit 0 elif [ "$ch" = '+' ]; then echo "increase by 10s" t=$(($t + 10 - 1)) elif [ "$ch" = '-' ]; then echo "decrease by 10s" t=$(($t - 10 - 1)) fi t=$(($t + 1)) fi t=$(($t - 1)) done
The time passed on command line needs to be parsed and it does not support fractional times, otherwise it’s straightforward. The interesting part is that it reacts to key presses:
Vim has the excellent feature to split the window to multiple panes, but can it open one file in two vertical tabs and let the contents of left pane continue in the one next to it?
From
1 ... |
2 ... |
3 ... |
4 ... |
5 ... |
6 ... |
to
1 ... | 7 ... |
2 ... | 8 ... |
3 ... | 9 ... |
4 ... | 10 ... |
5 ... | 11 ... |
6 ... | 12 ... |
Not by default, certainly with some scripting or by a set of a few commands entered manually. We also want to preserve the line continuation while scrolling. Maybe all of that is implemented somewhere for vim, I don’t know. First time I saw this was many years ago in computer lab, somebody using emacs. What I’m going show dates back to 2011, it works and I think there must be an easier way or some vim fork can do it natively.
The idea is to split the panel, move the lines by the screen height and enable scrollbind.
function SplitAndJoin()
:vsplit
:wincmd l
:normal L
:normal z+
:wincmd h
:windo set scrollbind
:wincmd h
endfunction
Moving the lines is really emulated by navigating to the bottom, resetting the position so the next line is the first one displayed. This is obviously not resistant to resizing the terminal, the line offset between the windows is fixed. This can be manually adjusted by disabling scrollbind, syncing the lines and enabling scrollbind again. I think I never needed that so urgently to implement it, reopening the buffer is faster.
The function can be invoked from command line mode by
:call SplitAndJoin()
… or 3 panes, lines continued
If we can do 2 panes by a few simple commands, we can do more, any number of splits that still fit the screen. This repeats the split and bind step from the last opened pane before moving back to the first one. The scrollbind is temporarily disabled when moving the lines for the split.
function SplitAndJoin3()
:vsplit
:wincmd l
:normal L
:windo set noscrollbind
:normal z+
:wincmd h
:windo set scrollbind
:vsplit
:wincmd l
:normal L
:windo set noscrollbind
:normal z+
:wincmd h
:windo set scrollbind
:wincmd h
:wincmd h
endfunction
Which can look like this:
Problems
That’s not a short list, might be a bug somewhere like a missing refresh
after an action or it’s working as expected individually but not when
combined like that. A native support in vim
is needed. But, it is still
usable to some extent.
We’ve had users asking for a machine readable output of btrfs-progs commands,
suggesting to add an alternative way to dump the data in a structured format
e.g. json. And nowadays this is very common in many if not all commands
provided by util-linux
. Try lscpu --json
, or lsblk --json
. Feeding such
output to python or a shell-friendly json parser is going to make life easier.
So that’s the background motivation.
Adding that to btrfs-progs is a task for long winter evenings, but the first step has been made. I wrote yet another json dumper. Of course this was preceded by a research about the state of existing libraries or files to copy. But why not have some fun writing one nevertheless.
The requirements:
The nesting was one of the things I have not found in the existing libraries, or not
implemented properly. The output must be a valid json and the single most
difficult thing was the placement of continuation comma “,
”. There must be none
after the last object in the sequence. When the dump is done incrementally, the
next object is often not known while printing the current one. So the comma
cannot be printed unconditionally and previous state needs to be stored
until the last object is dumped.
But, it’s not sufficient to keep track of the last object only, simply nest some hashes and arrays. Like this:
1: {
2: "first" : {
3: "nested" : {
4: "key" : [
5: "value",
6: "value2"
7: ]
8: }
9: },
10: "last" : null
11: }
Last objects in sequences are on lines 6, 7, 8 and 10, so this needs a stack for tracking that. The main building blocks for correct syntax:
The rest is for readable output, indentation is simply derived from the nesting level.
The code representation of the json object should be straightforward, basically translating the objects into API calls, for compound object there’s the start and end element. Element nesting must be proper. On the lower level, there should be some nice specification of the key/value formatting.
I’ve ended up with a static declaration of the value formatting specifiers in a list, the actual structure of the json objects will be defined by API calls referring back to the spec structure.
struct rowspec {
const char *key;
const char *fmt;
const char *out_text;
const char *out_json;
};
The key
refers to the key (name), the rest of the structure is relevant for
the value formatting. The out_text
can be ignored for now, it’s a
generalization of the output formatting so the same code is used for plain text
and json output, the switch is done from the outside (e.g. by a command line
option).
Basic value types are a string and number. For the purposes of btrfs-progs
there are more and so far built in the formatter, but it’s extensible (on the
source level, not dynamically). Additional custom types are UUID, qgroupid (a
u64 split into 16 and 48 bits), size (human friendly formatting of byte suffix
values), and time. Another example of customization is that there are actually
two byte size formatters, one that prints “-
” instead of zero value.
The structural API is simple:
Most of them take a reference to the rowspec, this is convenient that at the time we’d like to print the value we summon it by its name, the rest is in the table. All the API calls take a context structure that stores the nesting level (for indentation and sanity checks) and some internal data.
Now the above example would not be the best one to demonstrate the API but let’s try:
static const struct rowspec rows[] = {
{ .key = "first", .fmt = "map", .out_json = "first" },
{ .key = "nested", .fmt = "map", .out_json = "nested" },
{ .key = "key", .fmt = "list", .out_json = "key" },
{ .key = "value", .fmt = "str", .out_json = NULL },
};
struct format_ctx ctx;
fmt_start(&ctx, rows, 0, 0);
fmt_print_start_group(&ctx, "first", JSON_TYPE_MAP);
fmt_print_start_group(&ctx, "nested", JSON_TYPE_MAP);
fmt_print_start_group(&ctx, "key", JSON_TYPE_ARRAY);
fmt_print(&ctx, "value", "value");
fmt_print(&ctx, "value", "value2");
fmt_print_end_group(&ctx, NULL);
fmt_print_end_group(&ctx, NULL);
fmt_print_end_group(&ctx, NULL);
fmt_print_start_group(&ctx, "last", JSON_TYPE_MAP);
fmt_print_end_group(&ctx, NULL);
fmt_end(&ctx);
That should demonstrate it. There might be some slight tweaks needed as I took it from a specialized use but would like to present it as a generic API. So far it’s not generic, it might become one. With sources available it’s available for further reuse, I don’t plan to make it a standalone library but if there’s interest then, well, why not.
Examples
The simple values can be formatted in multiple ways, e.g. size values with suffixes or raw numbers, though it’s defined as multiple keys:
{ .key = "size", .fmt = "%llu", .out_json = "size" },
{ .key = "size-si", .fmt = "size", .out_json = "size" },
{ .key = "size-si-not-zero", .fmt = "size-or-none", .out_json = "size" },
Raw formats starting with “%
” are passed to printf
, this lets it do all the
heavy work. The special formatters may need additional parameters, in this case
it’s the pretty printer mode for size.
if (mode == RAW)
fmt_print(&ctx, "size", value);
else if (mode == SI)
fmt_print(&ctx, "size-si", value, mode);
else if (mode == SI_NO_ZERO)
fmt_print(&ctx, "size-si-not-zero", value, mode);
Here the size-or-none
format prints “-
” instead of 0
. As can be seen here,
not all the row specifiers need to be used at once.
References
Source links point to the most recent release so there might be an update since the time I write this but I’m not expecting significant changes.
tl;dr
CUDA returning an unknown error upon initialization after resume on
linux can be solved by reloading kernel module nvidia_uvm.
There are some problems reported where with CUDA returns an unknown error upon initialization, while there’s no clear cause nor a clear way how to fix it. I’ve found https://victorwyee.com/engineering/failed-call-to-cuinit-cuda-error-unknown/ listing various solutions from rebooting to tweaking environment variables.
I’ve hit another problem that does not seem to be explicitly mentioned as the cause or it could have been misdiagnosed. After switching the built in Intel graphic card to be primary and NVIDIA card only for computations, it was not possible to run any application using CUDA. All it reported back were unknown errors.
The Leela chess lc0
reported CUDA version 0.0.0 with message like:
$ ./lc0
_
| _ | |
|_ |_ |_| v0.28.2 built Dec 13 2021
ucinewgame
Found pb network file: ./192x15-2021_1212_2050_25_709.pb.gz
Creating backend [cuda-auto]...
Switching to [cuda]...
CUDA Runtime version: 0.0.0
WARNING: CUDA Runtime version mismatch, was compiled with version 11.4.0
Latest version of CUDA supported by the driver: 11.4.0
error CUDA error: unknown error (../../src/neural/cuda/network_cuda.cc:134)
The line in source file is simply reporting errors:
134 ReportCUDAErrors(cudaGetDeviceCount(&total_gpus));
This is the first place where the errors are checked, a quick trivial CUDA program fails on the first memory allocation, with the same unkown error.
I guess this stresses out the importance of error checking, namely when some piece of hardware is involved. Though this fails on the first attempt to use the API, there are possibly more cases during a long computation and many tasks using the GPU in parallel.
A reboot fixed the problem, until the next suspend, which is impractical and
setting up the work environment gets annoying. I have tried to look for some
utility/command that is supposed to be run after resume. There’s a script called
nvidia-sleep.sh
that is the callback script for suspend and resume. Running
that with parameter suspend
manually while the system was up was not a good
idea as it switched me out of the graphical desktop to some virtual terminal
without any way to undo that. Also running that with resume
did not help
eiter (and does not even make sense after the first proper resume).
There are no obvious signs that something could be wrong after resume, nothing
in the logs and running games utilizing acceleration works (judging by FPS).
One advice recommends to run nvidia-modprobe
, in case some of the 4 modules is
not loaded (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia). Besides the UVM
module, all are unused.
Right after resume the modules look like this:
nvidia_drm 69632 2
nvidia_modeset 1204224 3 nvidia_drm
nvidia_uvm 2531328 0
nvidia 35373056 90 nvidia_uvm,nvidia_modeset
That there’s 0 (usage by other modules) brought some suspicion, because
according to nvidia-smi
, there’s a least Xorg keeping a few megabytes of
allocated memory (which may or may not be related though).
I’ve reloaded the nvidia_uvm module, because it was the only thing possible to do with the modules:
# modprobe -r nvidia_uvm
# modprobe nvidia_uvm
And it started to work. Then quick check, lc0, works again. Now the module list is:
nvidia_uvm 2531328 2
nvidia_drm 69632 2
nvidia_modeset 1204224 3 nvidia_drm
nvidia 35373056 204 nvidia_uvm,nvidia_modeset
While this is still a workaround, it’s much better than rebooting and a post-resume hook should be able to make it work automatically.
The suspend script is in /usr/lib/systemd/system-sleep/
:
#!/bin/sh
case "$1" in
post)
sleep 4
modprobe -r nvidia_uvm
sleep 1
modprobe nvidia_uvm
;;
esac
There was a request to provide authenticated hashes in btrfs, natively as one of the btrfs checksum algorithms. Sounds fun but there’s always more to it, even if this sounds easy to implement.
Johaness T. at that time in SUSE sent the patchset adding the support for SHA256 [1] with a Labs conference paper, summarizing existing solutions and giving details about the proposed implementation and use cases.
The first version of the patchset posted got some feedback, issues were found and some ideas suggested. Things have stalled a bit, but the feature is still very interesting and really not hard to implement. The support for additional checksums has provided enough support code to just plug in the new algorithm and enhance the existing interfaces to provide the key bytes. So until now I’ve assumed you know what an authenticated hash means, but for clarity and in simple terms: a checksum that depends on a key. The main point is that it’s impossible to generate the same checksum for given data without knowing the key, where impossible is used in the cryptographic-strength sense, there’s an almost zero probability doing that by chance and brute force attack is not practical.
Auth hash, fsverity
Notable existing solution for that is fsverity that works in read-only fashion, where the key is securely hidden and used only to verify that data that are read from media haven’t been tampered with. A typical use case is an OS image in your phone. But that’s not all. Images of OS appear in all sorts of boxed devices, IoT. Nowadays, with explosion of edge computing, assuring integrity of the end devices is a fundamental requirement.
Where btrfs can add some value is the read AND write support, with an authenticated hash. This brings questions around key handling, and not everybody is OK with a device that could potentially store malicious/invalid data with a proper authenticated checksum. So yeah, use something else, this is not your use case, or maybe there’s another way how to make sure the key won’t be compromised easily. This is beyond the scope of what filesystem can do, though.
As an example use case of writable filesystem with authenticated hash: detect outside tampering with on-disk data, eg. when the filesystem was unmounted. Filesystem metadata formats are public, interesting data can be located by patterns on the device, so changing a few bytes and updating the checksum(s) is not hard.
There’s one issue that was brought up and I think it’s not hard to observe anyway: there’s a total dependency on the key to verify a basic integrity of the data. Ie. without the key it’s not possible to say if the data are valid as if a basic checksum was used. This might be still useful for a read-only access to the filesystem, but absence of key makes this impossible.
Existing implementations
As was noted in the LWN discussion [2], what ZFS does, there are two checksums. One is the authenticated and one is not. I point you to the comment stating that, as I was not able to navigate far enough in the ZFS code to verify the claim, but the idea is clear. It’s said that the authenticated hash is eg. SHA512 and the plain hash is SHA256, split half/half in the bytes available for checksum. The way the hash is stored is a simple trim of the first 16 bytes of each checksum and store them consecutively. As both hashes are cryptographically strong, the first 16 bytes should provide enough strength despite the truncation. Where 16 bytes is 128 bits.
When I was thinking about that, I had a different idea how to do that. Not that copying the scheme would not work for btrfs, anything that the linux kernel crypto API provides is usable, the same is achievable. I’m not judging the decisions what hashes to use or how to do the split, it works and I don’t see a problem in the strength. Where I see potential for an improvement is performance, without sacrificing strength too much. Trade-offs.
The CPU or software implementation of SHA256 is comparably slower to checksums with hardware aids (like CRC32C instructions) or hashes designed to perform well on CPUs. That was the topic of the previous round of new hashes, so we now compete against BLAKE2b and XXHASH. There are CPUs with native instructions to calculate SHA256 and the performance improvement is noticeable, orders of magnitude better. But the support is not as widespread as eg. for CRC32C. Anyway, there’s always choice and hardware improves over time. The number of hashes may seem to explode but as long as it’s manageable inside the filesystem, we take it. And a coffee please.
Secondary hash
The checksum scheme proposed is to use a cryptographic hash and a non-cryptographic one. Given the current support for SHA256 and BLAKE2b, the cryptographic hash is given. There are two of them and that’s fine. I’m not drawing an exact parallel with ZFS, the common point for the cryptographic hash is that there are limited options and the calculation is expensive by design. This is where the non-cryptographic hash can be debated. Also I want to call it secondary hash, with obvious meaning that it’s not too important by default and comes second when the authenticated hash is available.
We have CRC32C and XXHASH to choose from. Note that there are already two hashes from the start so supporting both secondary hashes would double the number of final combinations. We’ve added XXHASH to enhance the checksum collision space from 32 bits to 64 bits. What I propose is to use just XXHASH as the secondary hash, resulting in two new hashes for the authenticated and secondary hash. I haven’t found a good reason to also include CRC32C.
Another design point was where to do the split and truncation. As the XXHASH has fixed length, this could be defined as 192 bits for the cryptographic hash and 64 bits for full XXHASH.
Here we are, we could have authenticated SHA256 accompanied by XXHASH, or the same with BLAKE2b. The checksum split also splits the decision tree what to do when the checksum partially matches. For a single checksum it’s a simple yes/no decision. The partial match is the interesting case:
This leads to 4 outcomes of the checksum verification, compared to 2. A boolean type can simply represent the yes/no outcome but for two hashes it’s not that easy. It depends on the context, though I think it still should be straightforward to decide what to do that in the code. Nevertheless, this has to be updated in all calls to checksum verification and has to reflect the key availability eg. in case where the data are auto-repaired during scrub or when there’s a copy.
Performance considerations
The performance comparison should be now clear: we have the potentially slow SHA256 but fast XXHASH, for each metadata and data block, vs slow SHA512 and slow SHA256. As I reckon it’s possible to also select SHA256/SHA256 split in ZFS, but that can’t beat SHA256/XXHASH.
The key availability seems to be the key point in all that, puns notwithstanding. The initial implementation assumed for simplicity to provide the raw key bytes to kernel and to the userspace utilities. This is maybe OK for a prototype but under any circumstances can’t survive until a final release. There’s key management wired deep into linux kernel, there’s a library for the whole API and command line tools. We ought to use that. Pass the key by name, not the raw bytes.
Key management has it’s own culprits and surprises (key owned vs possessed), but let’s assume that there’s a standardized way how to obtain the key bytes from the key name. In kernel its “READ_USER_KEY_BYTES”, in userspace it’s either keyctl_read from libkeyutils or a raw syscall to keyctl. Problem solved, on the low-level. But, well, don’t try that over ssh.
Accessing a btrfs image for various reasons (check, image, restore) now needs the key to verify data or even the key itself to perform modifications (check + repair). The command line interface has to be extended for all commands that interact with the filesystem offline, ie. the image and not the mounted filesystem.
This results to a global option, like btrfs --auth-key 1234 ispect-internal dump-tree
, compared to btrfs inspect-internal dump-tree --auth-key 1234
. This is not finalized, but a global option is now the preferred choice.
Final words
I have a prototype, that does not work in all cases but at least passes mkfs
and mount
. The number of checksum verification cases got above what I was able to fix by the time of writing this. I think this has enough matter on itself so I’m pushing it out out as part 1. There are open questions regarding the command line interface and also a some kind of proof or discussion regarding attacks. Stay tuned.
References:
A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.
There’s not much to show in this release. Some users find that good too, a boring release. But still there are some changes of interest. The 5.4 is a long-term support stable tree, stability and core improvements are perhaps more appropriate than features that need a release or two to stabilize.
? stable not known in advance so not pushing half-baked features to stable, possibly requiring more intrusive fixups
The development cycle happened over summer and this slowed down the pace of patch reviews and update turnarounds.
Tree-checker updates
The tree-checker is a sanity checker of metadata that are read from/written to devices. Over time it’s being enhanced by more checks, let’s have a look at two of them.
ROOT_ITEM checks
The item represents root of a b-tree, of the internal or the subvolume trees.
Let’s take an example from btrfs inspect dump-tree
:
item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
generation 5 root_dirid 0 bytenr 30523392 level 0 refs 1
lastsnap 0 byte_limit 0 bytes_used 16384 flags 0x0(none)
uuid 00000000-0000-0000-0000-000000000000
drop key (0 UNKNOWN.0 0) level 0
Some of the metadata inside the item allow only simple checks, following commit 259ee7754b6793:
The refs
is a reference counter and sanity check would require reading all the expected reference holders, bytes_used
would need to look up the block that it accounts, etc. The subvolume trees have more data like ctime
, otime
and real uuid
s. If you wonder what’s byte_limit
, this used to be a mechanism to emulate quotas by setting the limit value, but it has been deprecated and unused for a long time. One day we might to find another purpose for the bytes.
Many of the tree-checker enhancements are follow ups to fuzz testing and reports, as it was in this case. The bug report shows that some of the incorrect data were detected and even triggered auto-repair (as this was on filesystem with DUP metadata) but there was too much damage and it crashed at some point. The crash was not random but a BUG_ON of an unexpected condition, that’s sanity check of last resort. Catching inconsistent data early with a graceful error handling is of course desired and ongoing work.
Extent metadata item checks
There are two item types that represent extents and information about sharing. EXTENT_ITEM
is older and bigger while METADATA_ITEM
is the building block of skinny-metadata
feature, smaller and more compact. Both items contain type of reference(s) and the owner (a tree id). Besides the generic checks that also the root item does (alignment, value ranges, generation), there’s a number of allowed combinations of the reference types and extent types. The commit f82d1c7ca8ae1bf implements that, however further explanation is out of scope of the overview as the sharing and references are the fundamental design of btrfs.
Example of METADATA_ITEM
:
item 170 key (88145920 METADATA_ITEM 0) itemoff 10640 itemsize 33
refs 1 gen 27 flags TREE_BLOCK
tree block skinny level 0
tree block backref root FS_TREE
And EXTENT_ITEM
:
item 27 key (20967424 EXTENT_ITEM 4096) itemoff 14895 itemsize 53
refs 1 gen 499706 flags DATA
extent data backref root FS_TREE objectid 8626071 offset 0 count 1
This for a simple case with one reference, tree (for metadata) and ordinary data, so comparing the sizes shows 20 bytes saved. On my 20GiB root partition with about 70 snapshots there are XXX EXTENT and YYY METADATA items.
Otherwise there can be more references inside one item (eg. many snapshots of a file that is randomly updated over time) so the overhead of the item itself is smaller
A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.
More checksum algorithms
The first request for more checksums is several years old (2014), it was SHA256. There was no followup of the patch but the discussion in the mail thread brought the topic into light. Questions about why SHA256 and why not something else (welcome to the zoo: SpookyHash, BLAKE2, CityHash, MurmurHash3, CRC64), preferences, use cases etc. Browsing the mail thread again, I don’t see anybody suggest SHA1 or MD5. There were some open questions and suggestions and though it’s an old discussion, most of the points were valid at the time of pre-5.5 evaluation and served as a checklist. This is what I really like about the community projects and btrfs users in particular. Getting useful feedback on topics out of scope of the filesystem itself.
Fast forward to now, there are 3 new hashes:
Name | Digest size | Notes |
---|---|---|
CRC32C | 32bit | fast, weak, default, CPU support, backward compatible |
XXHASH | 64bit | fast, weak, good on modern CPUs |
SHA256 | 256bit | slow, strong, FIPS certified, CPU support |
BLAKE2B | 256bit | slow, strong, good on 64bit architectures |
You can find more about the selection process in another post.
Quick start
# mkfs.btrfs --csum xxhash /dev/sda
And the output:
btrfs-progs v5.4.1
See http://btrfs.wiki.kernel.org for more information.
Label: (null)
UUID: 9774b466-fa5d-4c2e-8b4f-479d0bb5873d
Node size: 16384
Sector size: 4096
Filesystem size: 10.00GiB
Block group profiles:
Data: single 8.00MiB
Metadata: DUP 256.00MiB
System: DUP 8.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Checksum: xxhash64
Number of devices: 1
Devices:
ID SIZE PATH
1 10.00GiB /dev/sda
For convenience, the hash names can be specified by aliases so both xxhash and xxhash64 are accepted on command line, similarly for blake2 it’s blake2b. This is based on a usability pattern, where humans don’t have to remember the exact specification and copy & paste from the output works as well.
Use cases and outlook
For ordinary checksumming, all the hashes are considered good enough. CRC32C is the weakest but it’s been with us since the beginning and has caught many random bit flips.
XXHASH is supposed to add to the strength with a good performance. Increased digest size reduces collisions, but can’t compete with the crypto-strong hashes. That can still be of great help for deduplication.
From my perspective, the number of hashes should stay limited as the support is needed not just in kernel but for other tooling. For btrfs-progs it’s mandatory to support them, but also bootloaders (*cough*in case they verify checksums*cough*). The kernel support requires the hash implementations and the modules become a dependency of btrfs.ko. This can increase overall size and build time.
One of the suggestions was to allow users to choose the hash out of all what linux crypto API offers, not just the fixed set of hashes selected by developers. I understand that from the point of view of users who can make the decisions about trade-offs for their use case this is better, but I argue that such users are not in majority. Satisfying the common case was the preferred option, hopefully providing good options also for the knowledgeable users. Providing and maintaining implementations of seven and one hashes would be a burden for several projects.
Yes, I’m aware of BLAKE3, but let’s wait before adding new hashes. By the time we do another round there might be more contenders or improved versions. There’s XXH3 from the author of XXHASH, producing 128bit digest and is slightly faster. However it is still not recommended for production use due to fine tuning and finalization.
The cryptographic hashes can be a building block for a keyed hash, ie. checksums that can’t be subverted. There’s a patch implementing that, currently in an RFC state. While the code is straightforward, adding a new interface could use a feedback round. Please contact us in case you have comments.
Last but not least, strong hashes are a building block for deduplication. At the moment btrfs supports only the out-of-band way, on demand by calling an ioctl. There are tools utilizing that, eg. BEES or duperemove, calculating the block checksums by themselves. Exporting the raw checksums would help to decrease CPU time and also external tracking of the checksums.
Implementation notes
With the single checksum CRC32C, lots of code had hard coded size (4 bytes/32 bits) or functions and identifiers contained ‘crc32c’. This could not remain once there are more hashes to be added, so a round of cleanups happened in advance in 5.4. I’d call it great preparatory work that paid off, because adding new hashes was a matter of inserting new definition entries to a table. The rest was behind abstraction of btrfs functions that call to the linux crypto API.
In parallel to the btrfs patches, the BLAKE2B linux implementation was merged, while XXHASH was merged in 5.3 and SHA256 was there forever. All in all, everything was ready in the same target release.
A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.
New block group profiles RAID1C3 and RAID1C4
There are two new block group profiles enhancing capabilities of the RAID1
types with more copies than 2. Brief overview of the profiles is in the table
below, for table with all profiles see manual page of mkfs.brtfs
, also
available on
wiki.
Profile | Copies | Utilization | Min devices |
---|---|---|---|
RAID1 | 2 | 50% | 2 |
RAID1C3 | 3 | 33% | 3 |
RAID1C4 | 4 | 25% | 4 |
The way all the RAID1 types work is that there are 2 / 3 / 4 exact copies over all available devices. The terminology is different from linux MD RAID, that can do any number of copies. We decided not to do that in btrfs to keep the implementation simple. Another point for simplicity is from the users’ perspective. That RAID1C3 provides 3 copies is clear from the type. Even after adding a new device and not doing balance, the guarantees about redundancy still hold. Newly written data will use the new device together with 2 devices from the original set.
Compare that with a hypothetical RAID1CN, on a filesystem with M devices (N <= M). When the filesystem starts with 2 devices, equivalent to RAID1, adding a new one will have mixed redundancy guarantees after writing more data. Old data with RAID1, new with RAID1C3 – but all accounted under RAID1CN profile. A full re-balance would be required to make it a reliable 3-copy RAID1. Add another device, going to RAID1C4, same problem with more data to shuffle around.
The allocation policy would depend on number of devices, making it hard for the
user to know the redundancy level. This is already the case for
RAID0/RAID5/RAID6. For the striped profile RAID0 it’s not much of a problem,
there’s no redundancy. For the parity profiles it’s been a known problem and
new balance filter stripe
has been added to support fine grained selection of
block groups.
Speaking about RAID6, there’s the elephant in the room, trying to cover write hole. Lack of a resiliency against 2 device damage has been bothering all of us because of the known write hole problem in the RAID6 implementation. How this is going to be addressed is for another post, but for now, the newly added RAID1C3 profile is a reasonable substitute for RAID6.
How to use it
On a freshly created filesystem it’s simple:
# mkfs.btrfs -d raid1c3 -m raid1c4 /dev/sd[abcd]
The command combines both new profiles for sake of demonstration, you should always consider the expected use and required guarantees and choose the appropriate profiles.
Changing the profile later on an existing filesystem works as usual, you can use:
# btrfs balance start -mconvert=raid1c3 /mnt/path
Provided there are enough devices and enough space to do the conversion, this will go through all metadadata block groups and after it finishes, all of them will be of the of the desired type.
Backward compatibility
The new block groups are not understood by old kernels and can’t be mounted,
not even in the read-only mode. To prevent that a new incompatibility bit is
introduced, called raid1c34
. Its presence on a device can be checked by
btrfs inspect-internal dump-super
in the incompat_flags
. On a running
system the incompat features are exported in sysfs,
/sys/fs/btrfs/UUID/features/raid1c34
.
Outlook
There is no demand for RAID1C5 at the moment (I asked more than once). The space utilization is low already, the RAID1C4 survives 3 dead devices so IMHO this is enough for most users. Extending resilience to more devices should perhaps take a different route.
With more copies there’s potential for parallelization of reads from multiple devices. Up to now this is not optimal, there’s a decision logic that’s semi-random based on process ID of the btrfs worker threads or process submitting the IO. Better load balancing policy is a work in progress and could appear in 5.7 at the earliest (because 5.6 development is now in fixes-only mode).
Look back
The history of the patchset is a bit bumpy. There was enough motivation and requests for the functionality, so I started the analysis what needs to be done. Several cleanups were necessary to unify code and to make it easily extendable for more copies while using the same mirroring code. In the end change a few constants and be done.
Following with testing, I tried simple mkfs and conversions, that worked well. Then scrub, overwrite some blocks and let the auto-repair do the work. No hiccups. The remaining and important part was the device replace, as the expected use case was to substitute RAID6, replacing a missing or damaged disk. I wrote the test script, replace 1 missing, replace 2 missing. And it did not work. While the filesystem was mounted, everything seemed OK. Unmount, check again and the devices were still missing. Not cool, right.
Due to lack of time before the upcoming merge window (a code freeze before next development cycle), I had to declare it not ready and put it aside. This was in late 2018. For a highly requested feature this was not an easy decision. Should it be something less important, the development cycle between rc1 and final release provides enough time to fix things up. But due to the maintainer role with its demands I was not confident that I could find enough time to debug and fix the remaining problem. Also nobody offered help to continue the work, but that’s how it goes.
At the late 2019 I had some spare time and looked at the pending work again. Enhanced the test script with more debugging messages and more checks. The code worked well, the test script was subtly broken. Oh well, what a blunder. That cost a year, but on the other hand releasing a highly requested feature that lacks an important part was not an appealing option.
The patchset was added to 5.5 development queue at about the last time before freeze, final 5.5 release happened a week ago.
Irony isn’t it. The paint of BLAKE2 as BTRFS checksum algorithm hasn’t dried yet, 1-2 weeks to go but there’s a successor to it. Faster, yet still supposed to be strong. For a second or two I considered ripping out all the work and … no not really but I do admit the excitement.
Speed and strength are competing goals for a hash algorithm. The speed can be evaluated by anyone, not so much for the strength. I am no cryptographer and for that area rely on expertise and opinion of others. That BLAKE was a SHA3 finalist is a good indication, where BLAKE2 is it’s successor, weakened but not weak. BLAKE3 is yet another step trading off strength and speed.
Regarding BTRFS, BLAKE2 is going to be the faster of strong hashes for now (the other one is SHA256). The argument I have for it now is proof of time. It’s been deployed in many projects (even crypto currencies!), there are optimized implementations, various language ports.
The look ahead regarding more checksums is to revisit them in about 5 years. Hopefully by that time there will be deployments, real workload performance evaluations and overall user experience that will back future decisions.
Maybe there are going to be new strong yet fast hashes developed. During my research I learned about Kangaroo 12 that’s a reduced version of SHA3 (Keccak). The hash is constructed in a different way, perhaps there might be a Kangaroo 2π one day on par with BLAKE3. Or something else. Why not EDON-R, it’s #1 in many of the cr.yp.to/hash benchmarks? Another thing I learned during the research is that hash algorithms are twelve in a dozen, IOW too many to choose from. That Kangaroo 12 is internally of a different construction might be a point for selecting it to have wider range of “building block types”.
Quick evaluation
For BTRFS I have a micro benchmark, repeatedly hashing a 4 KiB block and using cycles per block as a metric.
Hash | Total cycles | Cycles/iteration | Perf vs BLAKE3 | Perf vs BLAKE2b |
---|---|---|---|---|
BLAKE3 (AVX2) | 111260245256 | 11126 | 1.0 | 0.876 (-13%) |
BLAKE2b (AVX2) | 127009487092 | 12700 | 1.141 (+14%) | 1.0 |
BLAKE2b (AVX) | 166426785907 | 16642 | 1.496 (+50%) | 1.310 (+31%) |
BLAKE2b (ref) | 225053579540 | 22505 | 2.022 (+102%) | 1.772 (+77%) |
Right now there’s only the reference Rust implementation and a derived C implementation of BLAKE3, claimed not to be optimized but from my other experience the compiler can do a good job optimizing programmers ideas away. There’s only one BLAKE3 entry with the AVX2 implementation, the best hardware support my testing box provides. As I had the other results of BLAKE2 at hand, they’re in the table for comparison, but the most interesting pair are the AVX2 versions anyway.
The improvement is 13-14%. Not much ain’t it, way less that the announced 4+x faster than BLAKE2b. Well, it’s always important to interpret results of a benchmark with respect to the environment of measurement and the tested parameters.
For BTRFS filesystem the block size is always going to be in kilobytes. I can’t find what was the size of the official benchmark results, the bench.rs script iterates over various sizes, so I assume it’s an average. Short input buffers can skew the results as the setup/output overhead can be significant, while for long buffers the compression phase is significant. I don’t have explanation for the difference and won’t draw conclusions about BLAKE3 in general.
One thing that I dare to claim is that I can sleep well because upon the above evaluation, BLAKE3 won’t bring a notable improvement if used as a checksum hash.
References
Personal addendum
During the evaluations now and in the past, I’ve found it convenient if there’s an offer of implementations in various languages. That eg. Keccak project pages does not point me directly to a C implementation slightly annoyed me, but the reference implementation in C++ was worse than BLAKE2 I did not take the next step to compare the C version, wherever I would find it.
BLAKE3 is fresh and Rust seems to be the only thing that has been improved since the initial release. A plain C implementation without any warning-not-optimized labels would be good. I think that C versions will appear eventually, besides that Rust is now the new language hotness, there are projects not yet “let’s rewrite it in Rust”. Please Bear with us.
A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.
CRC32C uses accelerated versions on other architectures
… than just Intel-based ones. There was a hard coded check for the intel SSE
feature providing the accelerated instruction, but this totally skipped other
architectures. A brief check in linux.git/arch/*/crypto
for files
implementing the accelerated versions revealed that there’s a bunch of them:
ARM, ARM64, MIPS, PowerPC, s390 and SPARC. I don’t have enough hardware to show
the improvements, though.
Automatically remove incompat bit for RAID5/6
While this is not the fix everybody is waiting on, it’s demonstrating user-developer-user cycle how things can be improved. A filesystem created with RAID5 or -6 profiles sets an incompatibility bit. That’s supposed to be there as long as there’s any chunk using the profiles. User expectation is that the bit should be removed once the chunks are eg. balanced away. This is what got implemented and serves as a prior example for any future feature that might get removed on a given filesystem. (Note from the future: I did that for the RAID1C34 as well).
Side note: for the chunk profile it’s easy because the runtime check of
presence is quick and requires only going through a list of chunk profiles
types. Example of the worst case would be dropping the incompat bit for LZO
after there are no compressed files using that algorithm. You can easily see
that this would require either enumerating all extent metadata (one time check)
or keeping track of count since mkfs time. This is IMHO not a typical request
and can be eventually done using the btrfstune
tool on an unmounted
filesystem.
One thing you windows guys have is a decent binary diff. I know there’s radare2
on linux, but I’m a noob to set it up and I’d be fine with something less
fancy. Something like 2 objdump -dr
files, intelligent filter and vimdiff
.
The tricky part is the intelligent filter. I say intelligent, but in fact it’s a 20 line perl script that basically filters out things that are not important or are likely to change in a slightly modified C source.
The result is lossy, eg. there are no addresses and thus jumps can’t be followed, but having that would actually deserve the ‘intelligent’ title. My use case is simpler, eg. doing small tweaks like reordering lines, adding annotations (like READ_ONCE/WRITE_ONCE) or reducing argument counts.
Which takes eg.
0000000000000000 <btrfs_set_lock_blocking_read>:
0: e8 00 00 00 00 callq 5 <btrfs_set_lock_blocking_read+0x5>
1: R_X86_64_PLT32 __fentry__-0x4
5: 55 push %rbp
6: 53 push %rbx
7: 48 89 fb mov %rdi,%rbx
a: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
f: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 16 <btrfs_set_lock_blocking_read+0x16>
12: R_X86_64_PC32 cpu_number-0x4
16: 89 c0 mov %eax,%eax
18: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 20 <btrfs_set_lock_blocking_read+0x20>
1f: 00
1c: R_X86_64_PC32 __cpu_online_mask-0x4
20: 0f 82 b5 00 00 00 jb db <btrfs_set_lock_blocking_read+0xdb>
26: 0f b6 ab 94 00 00 00 movzbl 0x94(%rbx),%ebp
2d: 40 80 fd 01 cmp $0x1,%bpl
31: 0f 87 00 00 00 00 ja 37 <btrfs_set_lock_blocking_read+0x37>
33: R_X86_64_PC32 .text.unlikely+0x11
37: 83 e5 01 and $0x1,%ebp
3a: 74 1b je 57 <btrfs_set_lock_blocking_read+0x57>
3c: 8b 8b 88 00 00 00 mov 0x88(%rbx),%ecx
42: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
49: 00 00
47: R_X86_64_32S current_task
4b: 39 88 c0 04 00 00 cmp %ecx,0x4c0(%rax)
51: 0f 84 bd 00 00 00 je 114 <btrfs_set_lock_blocking_read+0x114>
57: 8b 83 18 02 00 00 mov 0x218(%rbx),%eax
5d: 85 c0 test %eax,%eax
5f: 0f 84 b2 00 00 00 je 117 <btrfs_set_lock_blocking_read+0x117>
65: f0 ff 83 90 00 00 00 lock incl 0x90(%rbx)
6c: 8b 83 14 02 00 00 mov 0x214(%rbx),%eax
72: 85 c0 test %eax,%eax
74: 0f 84 ab 00 00 00 je 125 <btrfs_set_lock_blocking_read+0x125>
7a: f0 ff 8b 14 02 00 00 lock decl 0x214(%rbx)
81: 48 8d bb 98 00 00 00 lea 0x98(%rbx),%rdi
88: 5b pop %rbx
89: 5d pop %rbp
...
and produces
0000000000000000 <btrfs_set_lock_blocking_read>:
callq btrfs_set_lock_blocking_read
TARGET __fentry__
push %rbp
push %rbx
mov %rdi,%rbx
NOP
mov %gs:0x0(%rip),%eax
mov %eax,%eax
bt %rax,0x0(%rip)
jb btrfs_set_lock_blocking_read
movzbl 0x94(%rbx),%ebp
cmp $0x1,%bpl
ja btrfs_set_lock_blocking_read
and $0x1,%ebp
je btrfs_set_lock_blocking_read
mov 0x88(%rbx),%ecx
mov %gs:0x0,%rax
cmp %ecx,0x4c0(%rax)
je btrfs_set_lock_blocking_read
mov 0x218(%rbx),%eax
test %eax,%eax
je btrfs_set_lock_blocking_read
lock incl 0x90(%rbx)
mov 0x214(%rbx),%eax
test %eax,%eax
je btrfs_set_lock_blocking_read
lock decl 0x214(%rbx)
lea 0x98(%rbx),%rdi
pop %rbx
pop %rbp
...
This looks quite simple and when lined together, the diff is readable.
So here’s the magic script (pardon my perl skills):
#!/usr/bin/perl
@c=<>;
foreach(@c) {
chomp;
s/.*R_X86_64_PLT32\s+([^+-]+)[+-].*/TARGET $1/;
next if(/R_X86_/);
next if(/^\s*[0-9a-f]+:\s*([0-9a-f][0-9a-f]\s)+$/);
s/^\s*[0-9a-f]+:\s*([0-9a-f][0-9a-f]\s)+\s*//;
s/[0-9a-f]+ <([^+]+)\+.*>$/$1/;
s/\s+#.*$//;
s/nopl.*/NOP/;
s/xchg.*ax.*ax/NOP/;
s/data16 nop/NOP/;
s/nop/NOP/;
print("$_\n");
}
Use like:
$ objdump -dr before/locking.o > before
$ objdump -dr after/locking.o > after
$ vimdiff before after
The currently used and the only one checksum algorithm that’s implemented in btrfs is crc32c, with 32bit digest. It has served well for the years but has some weaknesses so we’re looking for a replacements and also for enhancing the use cases based on possibly better hash algorithms.
The advantage of crc32c is it’s simplicity of implementation, various optimized versions exist and hardware CPU instruction-level support. The error detection strength is not great though, the collisions are easy to generate. Note that crc32c has not been used in cryptographically sensitive context (eg. deduplication).
Side note: the collision generation weakness is used in the filesystem image dump tool to preserve hashes of directory entries while obscuring the real names.
Future use cases
The natural use case is still to provide checksumming for data and metadata blocks. With strong hashes, the same checksums can be used to aid deduplication or verification (HMAC instead of plain hash). Due to different requirements, one hash algorithm cannot possibly satisfy all requirements, namely speed vs. strength. Or can it?
The criteria
The frequency of checksumming is high, every data block needs that, every metadata block needs that.
During the discussions in the community, there were several candidate hash algorithms proposed and as it goes users want different things but we developers want to keep the number of features sane or at least maintainable. I think and hope the solution will address that.
The first hash type is to replace crc32c with focus on speed and not necessarily strength (ie. collision resistance).
The second type focus is strength, in the cryptographic sense.
In addition to the technical aspects of the hashes, there are some requirements that would allow free distribution and use of the implementations:
Constraints posed by btrfs implementation:
For the enhanced use case of data verification (using HMAC), there’s a requirement that might not interest everybody but still covers a lot of deployments. And this is standard compliance and certification:
And maybe last but not least, use something that is in wide use already, proven by time.
Speed
Implementation of all algorithms should be performant on common hardware, ie. 64bit architectures and hopefully not terrible on 32bit architectures or older and weaker hardware. By hardware I mean the CPU, not specialized hardware cards.
The crypto API provided by linux kernel can automatically select the best implementation of a given algorithm, eg. optimized implementation in assembly and on multiple architectures.
Strength
For the fast hash the collisions could be generated but hopefully not that easily as for crc32c. For strong hash it’s obvious that finding a collision would be jackpot.
In case of the fast hash the quality can be evaluated using the SMHasher suite.
The contenders
The following list of hashes has been mentioned and considered or evaluated:
xxhash
criteria: license OK, implementation OK, digest size OK, not standardized but in wide use
The hash is quite fast as it tries to exploit the CPU features that allow instruction parallelism. The SMHasher score is 10, that’s great. The linux kernel implementation landed in 5.3.
Candidate for fast hash.
XXH3
Unfortunately the hash is not yet finalized and cannot be in the final round, but for evaluation of speed it was considered. The hash comes from the same author as xxhash.
SipHash
This hash is made for hash tables and hashing short strings but we want 4KiB or larger blocks. Not a candidate.
CRC64
Similar to crc32 and among the contenders only because it was easy to evaluate but otherwise is not in the final round. It has shown to be slow in the microbenchmark.
SHA256
criteria: license OK, implementation OK, digest size OK, standardized in FIPS
The SHA family of hashes is well known, has decent support in CPU and is standardized. Specifically, SHA256 is the strongest that still fits into the available 32 bytes.
Candidate for strong hash.
SHA3
criteria: license OK, implementation OK, digest size OK, standardized in FIPS
Winner of the 2012 hash contest, we can’t leave it out. From the practical perspective of checksum, the speed is bad even for the strong hash type. One of the criteria stated above is performance without special hardware, unlike what was preferred during the SHA3 contest.
Candidate for strong hash.
BLAKE2
criteria: license OK, implementation OK, digest size OK, not standardized
From the family of BLAKE that participated in the 2012 SHA contest, the ‘2’ provides a trade-off speed vs. strength. More and more projects adopt it.
Candidate for strong hash.
Final round
I don’t blame you if you skipped all the previous paragraphs. The (re)search for the next hash was quite informative and fun so it would be shame not to share it, also to document the selection process for some transparency. This is a committee driven process though.
Two hashes selected for the strong type is a compromise to get a fast-but-strong hash yet also something that’s standardized.
The specific version of BLAKE2 is going to be the ‘BLAKE2b-256’ variant, ie. optimized for 64bit (2b) but with 256bit digest.
Microbenchmark
A microbenchmark gives more details about performance of the hashes:
Block: 4KiB (4096 bytes), Iterations: 100000
Hash | Total cycles | Cycles/iteration |
---|---|---|
NULL-NOP | 56888626 | 568 |
NULL-MEMCPY | 60644484 | 606 |
CRC64 | 3240483902 | 32404 |
CRC32C | 174338871 | 1743 |
CRC32C-SW | 174388920 | 1743 |
XXHASH | 251802871 | 2518 |
XXH3 | 193287384 | 1932 |
BLAKE2b-256-arch | 1798517566 | 17985 |
BLAKE2b-256 | 2358400785 | 23584 |
BLAKE2s-arch | 2593112451 | 25931 |
BLAKE2s | 3451879891 | 34518 |
SHA256 | 10674261873 | 106742 |
SHA3-256 | 29152193318 | 291521 |
Machine: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (AVX2)
Hash implementations are the reference ones in C:
There aren’t optimized versions for all hashes so for fair comparison the unoptimized reference implementation should be used. As BLAKE2 is my personal favourite I added the other variants and optimized versions to observe the relative improvements.
Evaluation
CRC64 was added by mere curiosity how does it compare to the rest. Well, curiosity satisfied.
SHA3 is indeed slow on a CPU.
What isn’t here
There are non-cryptographic hashes like CityHash, FarmHash, Murmur3 and more, that were found unsuitable or not meeting some of the basic criteria. Others like FNV or Fletcher used in ZFS are of comparable strength of crc32c, so that won’t be a progress.
Merging
All the preparatory work in btrfs landed in version 5.3. Hardcoded assumptions of crc32c were abstracted, linux crypto API wired in, with additional cleanups or refactoring. With that in place, adding a new has is a matter of a few lines of code adding the specifier string for crypto API.
The work on btrfs-progs is following the same path.
Right now, the version 5.4 is in development but new features can’t be added, so the target for the new hashes is 5.5. The BLAKE2 algorithm family still needs to be submitted and merged, hopefully they’ll make it to 5.5 as well.
One of my merge process requirements was to do a call for public testing, so we’re going to do that once all the kernel and progs code is ready for testing. Stay tuned.
A bit more detailed overview of a btrfs update that I find interesting, see the pull request for the rest.
Read-time, write-time corruption detection
The tree-checker is a recent addition to the sanity checks, a functionality that verifies a subset of metadata structures. The capabilities and strength is limited by the context and bits of information available at some point, but still there’s enough to get an additional value.
The context here means a b-tree leaf that packs several items into a block (ie.
the nodesize
, 16KiB by default). The individual items’ members’ values can be
verified against allowed values, the order of the keys of all items listed in
the leaf header can be checked etc. This is just for a rough idea what happens.
With that in the works, there are two occasions that can utilize the extended checking:
It’s clear that the read-time is merely a detector of problems that already happened, so there’s not much to do besides warning and returning an error (EUCLEAN). Turning the filesystem to read-only to prevent further problems is another option and inevitable on some occasions.
The write side check aims to catch silent errors that could make it to the permanent storage. The reasons why this happens are two fold, but the main idea is to catch memory bit flips. You’d be surprised how often this happens, that would be for a separate article entirely.
The new checks in 5.2 improve:
As mentioned before, the bits to check are inside a single buffer that
represents the tree block and the check is really local to that. As an inode
item can represent more than just files and directories, doing structural
checks like where and how the inode item is linked to is not easy in this
context. This is basically what would btrfs check
do.
The checks need to be fast because they happen on each metadata block, so no additional IO is allowed. This still brings some overhead but is (considered) negligible compared to all other updates to the block. A measurement can be done by adding tracepoint, but that’s left as an exercise.
Fun stuff only …
Hash some bytes with changing digest length from 1 to 64.
[ 1] 3b
[ 2] cf5b
[ 3] 68a2fa
[ 4] ac6de759
[ 5] 022847e13c
[ 6] fd2965be0997
[ 7] d4c93ad6ffbcbb
[ 8] 87c1dcb224fd2a57
[ 9] 6017c4f8c4bcb1d115
[10] 9b3f7b15df8867b8214c
[11] 8087b2861bbdf3d0356ea6
[12] 7a035413d62f0774148a12cc
[13] 697cabcfcade7114a271373039
[14] 784e3578c7b5a031237296225e26
[15] 0102e6aa5ff6bf57071b34f838d3d4
[16] 324f54d0e8f39326145e8e04b00434b4
[17] 45fbfb40756a53305881dc52d7aa6a9025
[18] c1a664957cbc0bbd0e2eb15be3a8e0f10b46
[19] 58c84031abff74bba0cab2a3ce8160c955fb92
[20] bad65265d1351746c3fc23816dca6966b42d2e90
[21] c9ed77f178c889cc985aaad89d740831922d7d6253
[22] 751673d5516709339176859392a61e45b444a3a4612c
[23] e62b1a6a239571d2a7ebf686f3f1f628f7253fa595debf
[24] 3afb48409fa94374bd1819bef5d40f8e90386686ec7212db
[25] 139fb5a766ab873fc1747e76253ebddec0da85388261ddcd0c
[26] 90b849e8d333229c8942f68faac2870530886c30edf4fe295281
[27] cfefa0767fdd316475e4374dc3b9e1eabdebb0111c88ccdaf6b668
[28] 6928e0b74efa9534c84483975ab83b65191d59d69c76880eac794afc
[29] 5d825aaaaabff533c36308227ed42982837bf7fa85d5f4457f6d7ebdfe
[30] a14f3dafe99f88dbb119f23cd2d390b3396f2c1c90fc8099514abb95872a
[31] 6a62d371a4b6583df5cc9f125e01536338a8e87fd17b1e24468778e00205bc
[32] 99b4a09edc99a1a2283f94189d8604ca65aab7aeebc31327b8c4a918341f92b0
[33] 1b0728d4ae415dc572ebb7caf18f7b43115ca6b53f6379a6571b4949d2049e6b7f
[34] 04c6817ba561b6e6aefb2237682aa4171314fc2a7bfa15f604227e239e8a08696153
[35] 102d84c3e030bd1d413f12128c34caef64c06c364eb13c4eb49549ab278efcd5d0dac1
[36] fa4edc42c4fcd5754de080616d09412c3f7baba4a972954e449bfa65fc040ae5556bc63c
[37] 0728fcf98a665b6a8c9a63ae31c87538d68bca8dc2e2a6b66dba3c1f4a8c5965bb4045a1e2
[38] 3fd0e5397a0d98eaf56a941e5b7ba7b8aa4b8dbe2a22993809cbb9a4b68834de1341c1412e2f
[39] 16eee95d5dc56762bc42f59aac2fc0bee5ba3a5fa7d6b2c1acc13af519a1a13cc793d2482685b9
[40] 74843823a200d423fbead5d40d83dea26638ecf79a768f819b44174a2ce0b51702944ff967597b9f
[41] 53134460150687f2b993259a048a9b59de7d8c091ffc635b792b33ae9c7970ef6dc47fe6308e47da21
[42] 174cbb459228b1b5fe302d36dfd69d0c75b8d77398e6eb8440e9449db38ce7387d12095f23531eb5a2bd
[43] f550a1d564f073486095bf473a20d135fcba595fd574c124e6a8f3f1451a4cd649b3d3c6c22fd3961dff9e
[44] 9e25048892b6d367d0e85a75d1f49c6950a9ae3ca02b03165b1308f0aadac093032fbb13556cb9bb85872def
[45] d43eb7fa68c34906813c360cd5c50e8d5f62c0de941f1e9520d405104144c30c1eb43c95bad4b850b3dc308141
[46] b9e4bc8364d69248c68dfe17a66131de04e6be949663c936fdbe5fec3efb73d4e40c22ad298d8be849de58b0a3a8
[47] f81017569d44554c934f6d89723e8997e115def6d66c76ee3c05a7e7626f487cab5659b4e3a7f32c9859dd04c579eb
[48] f2a987f00b96fafd25f2d8a19717913674961f234d0f81da5dbb549382ce2f25a5a95c4be917b97341d7c8b8a05aa2d1
[49] 03602f74fadc251d46a064dcc55a812c37c98960a984782c533bce1db92c514521996c58ff293a3957fd686cd814cbc18e
[50] bf939f9418697008f5a1c67e16e7f01bf596a7be648069c5bb322ba7207b6af5947912a40773f3c8f48bef03cda925c79eb8
[51] 77cb53241baa7dfd4b73b786b29543ad4f9da15c33d21b7acf0d04bd78c437018f7358357879f06e8fa59832028a851c126fbf
[52] 404a9cdd87d1ff194c45c9dd1995889b5492b22d446f87f8efe0148793bee21a756cfe6ac18f438f2535b35e909678e1f4ddfd92
[53] 38b399db58e9a1d4b3d78e3756b55a7d8bc3fce565a6cd68f66632cb3e201720384487bc314fc960cc7ece0915ab06ab655cb21f95
[54] 341c51088313284bb6efb2ce87116fa0ad86c540b4761ad1a91db8a54faf0ac16ae44de62e02d87d88124cdf7820f939657f30d43401
[55] 4d217c7ba91da2c4503357904f2264117a445fde54a7f25cf85c0b9d4528e1a8ca682ea59586d63755c2dac848716d19afa00f9c2a70e5
[56] 05664a51270f7a9dc309a1e78bddd3ed6a210aecf31f42be0a4231280a9e15ca1af3ec0456760fc2a6b76039a7f041a7623470bd912afc7d
[57] 434c8e7d8357e840f8522c582385fcba6650b7e923a821a185ae4e036de7ed6b931b908ef4bc3838de6864decdff762ac123f6f66a14a2fded
[58] 87fc3385049632e310553cda08b22810b0fa59c92f5559b49d2ffd5e5d0734886ccd269e6866d6412bc06489cba014265767edebcd2c18977c42
[59] 282ca365baaa22317a3c7faadc075e6b132fefee0d98d8823a79ffe87f87f7a38d8fd9e4a7ca10a904fbbcd9952dc12af3e561b884a40b7b2b4288
[60] a0f9ca58069fe970f2fd36d6c86b38b14ad41418f870ee695aa52395400552c4f583760e01c68c440c5fa6f9c6db25138d85cd6568f0f9a5698db89a
[61] 79b0b42401bb022fdfb71fdd2e6ddd2d0dc20cedcdd4b64e7921a03fd2f71508a195e2d8a4ebeb0ea8cdc6942ccd28cca7eda0fa2126fd884a0b279cd4
[62] 42b704ca2da3386bf00fc511648668af1b9f3bbf69d7972d35dc24365d0b37778fffa624662bd50139430644314367ff855da6bb489227144456007d5f1d
[63] 504a6a52e16b29a0ebe9106ac74931889243ef98aee97bd598502aac9363d99c7c414d39cb1229a3b49e79351f2c0daa3e9221bc7a135ffc41f79a6e416762
[64] 0b944b6bdc6c99e10448a536dea6ea79fd1d78e70307bf6e3c4a889f3876e2ec6a002ec9b4198a27e2b2dcdfcef7a4de4da52791248457a50d97307cdb0b6c80
Sources: blake2-impl.h test-blake2.h test-blake2b.c
The BLAKE2 algorithm is out for some time, yet there’s no port to linux crypto API. The recent push of WireGuard would add it to linux kernel, but using a different API (zinc) than the existing one. I haven’t explored zinc, I assume there will be a way to use it from inside kernel, but this would need another layer to switch the APIs according to the algorithm.
As btrfs is going to use more hashing algos, we are in the process of selection. One of the contenders is BLAKE2, though everybody would have a different suggestion. In order to test it, a port is needed. Which basically is a glue code between the linux crypto API and the BLAKE2 implementation.
I’m not going to reimplement crypto or anything, so the the default implementation is going to be the reference one found in blake2 git.
Note: due to the maximum space of 32 bytes available in the btrfs metadata blocks, the version is BLAKE2s.
The BLAKE2s sources
From the repository https://github.com/BLAKE2/BLAKE2:
Briefly skimming over the sources, there’s nothing that’ll cause trouble.
Standard C code, some definitions. Adapting for linux kernel would need to
replace the stdint.h
types (uint8_t
etc) with the uXX (u8
etc) and change
path to includes. For simplicity, let’s remove the ifdefs for C++ and MSVC too.
Add the new algorithm definition (CRA)
Though it’s possible to prepare a module for an out-of-tree build (see below),
let’s do it inside the linux.git/crypto/
path for now. There’s also plenty of
sources to copy & paste. I’ve used crc32c_generic.c
, and it turned out to be
useful.
The crypto hash description is defined in struct shash_alg
, contains the
technical description like type of the algorithm, length of the context and
callbacks for various phases of the hash calculation (init, update, final), and
identification of the algorithm and the implementation. The default
implementations are C and use the string “-generic” in the name.
The crc32c module came with a few stub callbacks (checksum_init
etc), that
will only call into the blake2 functions and the definition can stay. Simple
search and replace from crc32c to blake2s will do most of the conversion.
Add the glue code for crypto API
Now we have the blake2s.c
with the reference implementation, crypto algorithm
definition. The glue code connects the API with the implementation. We need 2
helper structures that hold the context once the user starts digest calculation.
The private blake2s context is embedded into one of them. The intermediate
results will be stored there.
And the rest is quite easy, each callback will call into the respective blake2s
function, passing the context, data and lengths. One thing that I copied from
the examples is the key initialization that’s in blake2s_cra_init
, that
appears to be global and copied to the context each time a new one is
initialized.
Here the choice of using crc32c.c
helped as there were the stub callback with
the right parameters, calling the blake2s functions that can retain their
original signature. This makes later updates easier. All the functions are
static so the compiler will most probably optimize the code that there will be
no unnecessary overhead.
Well and that’s it. Let’s try to compile it and insert the module:
linux.git$ make crypto/blake2s.o
...
CC [M] crypto/blake2s.o
linux.git$ make crypto/blake2s.ko
...
CC crypto/blake2s.mod.o
LD [M] crypto/blake2s.ko
...
Check that it’s been properly loaded
linux.git/crypto$ sudo insmod ./blake2s.ko
and registered
linux.git$ cat /proc/crypto
name : blake2s
driver : blake2s-generic
module : blake2s
priority : 100
refcnt : 1
selftest : passed
internal : no
type : shash
blocksize : 1
digestsize : 32
...
The selftest says it passed, but there no such thing so far. There are test values provided in blake2 git so it would be nice to have too (tm). But otherwise it looks good.
To do actual test, we’d need something inside kernel to utilize the new hash.
One option is to implement a module that will do that or use the userspace
library libkcapi
that can forward the requests from userspace to the
available kernel implementations.
Test it with libkcapi
The libkcapi
project at
http://www.chronox.de/libkcapi.html
provides an API that uses the AF_ALG
socket type to exchange data with
kernel. The library provides a command line tool that we can use right away and
don’t need to code anything.
$ kcapi-dgst -c blake2s --hex < /dev/null
48a8997da407876b3d79c0d92325ad3b89cbb754d86ab71aee047ad345fd2c49
The test vectors provided by blake2 confirm that this is hash of empty string with the default key (0x000102..1f).
Out-of-tree build
Sources in the linux.git require one additional line to Makefile, build it unconditionally as a module. Proper submission to linux kernel would need the Kconfig option.
obj-m += blake2s.o
The standalone build needs a Makefile with a few targets that use the existing
build of kernel. Note that you’d need a running kernel with the same built
sources. This is usually provided by the kernel-*-devel
packages. Otherwise,
if you build kernels from git, you know what to do, right?
KDIR ?= /lib/modules/`uname -r`/build
obj-m := blake2s.o
default:
$(MAKE) -C $(KDIR) M=$$PWD
clean:
$(MAKE) -C $(KDIR) M=$$PWD clean
modules_install:
$(MAKE) -C $(KDIR) M=$$PWD modules_install
After running make
, the kernel module is ready for use.
What next?
Send it upstream. Well, after some work of course.
All the files can be found here:
References
Update: enhance the text, still a bit raw, check the numbers
Example of benchmarks of patches going to 4.20 kernel. The goal was to identify potential differences between the devleopment branch with and without merge to the mainline. The factors affecting the difference are usually in other subsystems like block layer (IO queuing) or memory management (namely page cache, memory allocator, writeback), core (scheduling, locking primitives).
The numbers are reflected in the results below, referring to comparisons of the branches. 1-to-2 is the net effect of the patches, 1-to-3 is the same patches after merge.
The testsuite used is MMTests, that’s internally used for performance evaluation and provides wide range of workloads
The expected performance delta should be roughly the same for 2 and 3, the tables show if the delta is significant (green: positive, red: negative), so it can be easily observed. Note that the values need to be properly interpreted in the context of the benchmark and perhaps other values.
MMTests results
The raw stats are available.
Test setup
After update of qemu to version 2.12, my testing vms stopped to just warn
about the if=scsi
(with a bit more cryptic message), and did not want to
start.
qemu-system-x86_64: -drive file=root,if=scsi,media=disk,cache=none,index=0,format=raw: machine type does not support if=scsi,bus=0,unit=0
The if=scsi
shortcut was handy, the maze of qemu command line options may
take some time to comprehend but then you can do wonders.
The release notes of version 2.12 do mention that the support is gone and that other options should be used instead, also mentions which ones. But it does not tell how exactly. As this should be a simple syntax transformation from one option to another, an example would save a lot of time.
I was not able to find any ready-to-use example in a few minutes so had to experiment a bit (this saved me documentation reading time).
The setup I use is a file-backed root image for a virtual machine, nothing
really fancy. The file name is root
, sparse file, raw ie. no qcow2, no
caching.
Here you are:
-device virtio-scsi-pci,id=scsi
-drive file=root,id=root-img,if=none,format=raw,cache=none
-device scsi-hd,drive=root-img
First you need to define the bus, otherwise the 3rd line will complain that there’s no SCSI. Second line is to define the file backed drive and the third one puts that together.
Using SCSI might not be the best idea for a qemu VM as the emulated driver is buggy and crashes, so I’d recommend to use virtio, but for a almost read-only root image it’s fine. Also the device names are predictable.
The errno
values, in various formats you may encounter. The standard errno is
a positive number, linux kernel uses the negative values. Besides the sources
(going through like 5 include hops from /usr/include/errno.h
to
/usr/include/asm-generic/errno.h
and /usr/include/asm-generic/errno-base.h
)
there’s no direct way to get the symbolic <-> numeric mapping. The hexa values
commonly appear in the stack trace dumps.
symbolic | dec | hex | neg hex | comment |
---|---|---|---|---|
EPERM | 1 | 0x01 | 0x..ffff | Operation not permitted |
ENOENT | 2 | 0x02 | 0x..fffe | No such file or directory |
ESRCH | 3 | 0x03 | 0x..fffd | No such process |
EINTR | 4 | 0x04 | 0x..fffc | Interrupted system call |
EIO | 5 | 0x05 | 0x..fffb | I/O error |
ENXIO | 6 | 0x06 | 0x..fffa | No such device or address |
E2BIG | 7 | 0x07 | 0x..fff9 | Argument list too long |
ENOEXEC | 8 | 0x08 | 0x..fff8 | Exec format error |
EBADF | 9 | 0x09 | 0x..fff7 | Bad file number |
ECHILD | 10 | 0x0a | 0x..fff6 | No child processes |
EAGAIN | 11 | 0x0b | 0x..fff5 | Try again |
ENOMEM | 12 | 0x0c | 0x..fff4 | Out of memory |
EACCES | 13 | 0x0d | 0x..fff3 | Permission denied |
EFAULT | 14 | 0x0e | 0x..fff2 | Bad address |
ENOTBLK | 15 | 0x0f | 0x..fff1 | Block device required |
EBUSY | 16 | 0x10 | 0x..fff0 | Device or resource busy |
EEXIST | 17 | 0x11 | 0x..ffef | File exists |
EXDEV | 18 | 0x12 | 0x..ffee | Cross-device link |
ENODEV | 19 | 0x13 | 0x..ffed | No such device |
ENOTDIR | 20 | 0x14 | 0x..ffec | Not a directory |
EISDIR | 21 | 0x15 | 0x..ffeb | Is a directory |
EINVAL | 22 | 0x16 | 0x..ffea | Invalid argument |
ENFILE | 23 | 0x17 | 0x..ffe9 | File table overflow |
EMFILE | 24 | 0x18 | 0x..ffe8 | Too many open files |
ENOTTY | 25 | 0x19 | 0x..ffe7 | Not a typewriter |
ETXTBSY | 26 | 0x1a | 0x..ffe6 | Text file busy |
EFBIG | 27 | 0x1b | 0x..ffe5 | File too large |
ENOSPC | 28 | 0x1c | 0x..ffe4 | No space left on device |
ESPIPE | 29 | 0x1d | 0x..ffe3 | Illegal seek |
EROFS | 30 | 0x1e | 0x..ffe2 | Read-only file system |
EMLINK | 31 | 0x1f | 0x..ffe1 | Too many links |
EPIPE | 32 | 0x20 | 0x..ffe0 | Broken pipe |
EDOM | 33 | 0x21 | 0x..ffdf | Math argument out of domain of func |
ERANGE | 34 | 0x22 | 0x..ffde | Math result not representable |
EDEADLK | 35 | 0x23 | 0x..ffdd | Resource deadlock would occur |
ENAMETOOLONG | 36 | 0x24 | 0x..ffdc | File name too long |
ENOLCK | 37 | 0x25 | 0x..ffdb | No record locks available |
ENOSYS | 38 | 0x26 | 0x..ffda | Function not implemented |
ENOTEMPTY | 39 | 0x27 | 0x..ffd9 | Directory not empty |
ELOOP | 40 | 0x28 | 0x..ffd8 | Too many symbolic links encountered |
EWOULDBLOCK | EAGAIN | |||
ENOMSG | 42 | 0x2a | 0x..ffd6 | No message of desired type |
EIDRM | 43 | 0x2b | 0x..ffd5 | Identifier removed |
ECHRNG | 44 | 0x2c | 0x..ffd4 | Channel number out of range |
EL2NSYNC | 45 | 0x2d | 0x..ffd3 | Level 2 not synchronized |
EL3HLT | 46 | 0x2e | 0x..ffd2 | Level 3 halted |
EL3RST | 47 | 0x2f | 0x..ffd1 | Level 3 reset |
ELNRNG | 48 | 0x30 | 0x..ffd0 | Link number out of range |
EUNATCH | 49 | 0x31 | 0x..ffcf | Protocol driver not attached |
ENOCSI | 50 | 0x32 | 0x..ffce | No CSI structure available |
EL2HLT | 51 | 0x33 | 0x..ffcd | Level 2 halted |
EBADE | 52 | 0x34 | 0x..ffcc | Invalid exchange |
EBADR | 53 | 0x35 | 0x..ffcb | Invalid request descriptor |
EXFULL | 54 | 0x36 | 0x..ffca | Exchange full |
ENOANO | 55 | 0x37 | 0x..ffc9 | No anode |
EBADRQC | 56 | 0x38 | 0x..ffc8 | Invalid request code |
EBADSLT | 57 | 0x39 | 0x..ffc7 | Invalid slot |
EDEADLOCK | EDEADLK | |||
EBFONT | 59 | 0x3b | 0x..ffc5 | Bad font file format |
ENOSTR | 60 | 0x3c | 0x..ffc4 | Device not a stream |
ENODATA | 61 | 0x3d | 0x..ffc3 | No data available |
ETIME | 62 | 0x3e | 0x..ffc2 | Timer expired |
ENOSR | 63 | 0x3f | 0x..ffc1 | Out of streams resources |
ENONET | 64 | 0x40 | 0x..ffc0 | Machine is not on the network |
ENOPKG | 65 | 0x41 | 0x..ffbf | Package not installed |
EREMOTE | 66 | 0x42 | 0x..ffbe | Object is remote |
ENOLINK | 67 | 0x43 | 0x..ffbd | Link has been severed |
EADV | 68 | 0x44 | 0x..ffbc | Advertise error |
ESRMNT | 69 | 0x45 | 0x..ffbb | Srmount error |
ECOMM | 70 | 0x46 | 0x..ffba | Communication error on send |
EPROTO | 71 | 0x47 | 0x..ffb9 | Protocol error |
EMULTIHOP | 72 | 0x48 | 0x..ffb8 | Multihop attempted |
EDOTDOT | 73 | 0x49 | 0x..ffb7 | RFS specific error |
EBADMSG | 74 | 0x4a | 0x..ffb6 | Not a data message |
EOVERFLOW | 75 | 0x4b | 0x..ffb5 | Value too large for defined data type |
ENOTUNIQ | 76 | 0x4c | 0x..ffb4 | Name not unique on network |
EBADFD | 77 | 0x4d | 0x..ffb3 | File descriptor in bad state |
EREMCHG | 78 | 0x4e | 0x..ffb2 | Remote address changed |
ELIBACC | 79 | 0x4f | 0x..ffb1 | Can not access a needed shared library |
ELIBBAD | 80 | 0x50 | 0x..ffb0 | Accessing a corrupted shared library |
ELIBSCN | 81 | 0x51 | 0x..ffaf | .lib section in a.out corrupted |
ELIBMAX | 82 | 0x52 | 0x..ffae | Attempting to link in too many shared libraries |
ELIBEXEC | 83 | 0x53 | 0x..ffad | Cannot exec a shared library directly |
EILSEQ | 84 | 0x54 | 0x..ffac | Illegal byte sequence |
ERESTART | 85 | 0x55 | 0x..ffab | Interrupted system call should be restarted |
ESTRPIPE | 86 | 0x56 | 0x..ffaa | Streams pipe error |
EUSERS | 87 | 0x57 | 0x..ffa9 | Too many users |
ENOTSOCK | 88 | 0x58 | 0x..ffa8 | Socket operation on non-socket |
EDESTADDRREQ | 89 | 0x59 | 0x..ffa7 | Destination address required |
EMSGSIZE | 90 | 0x5a | 0x..ffa6 | Message too long |
EPROTOTYPE | 91 | 0x5b | 0x..ffa5 | Protocol wrong type for socket |
ENOPROTOOPT | 92 | 0x5c | 0x..ffa4 | Protocol not available |
EPROTONOSUPPORT | 93 | 0x5d | 0x..ffa3 | Protocol not supported |
ESOCKTNOSUPPORT | 94 | 0x5e | 0x..ffa2 | Socket type not supported |
EOPNOTSUPP | 95 | 0x5f | 0x..ffa1 | Operation not supported on transport endpoint |
EPFNOSUPPORT | 96 | 0x60 | 0x..ffa0 | Protocol family not supported |
EAFNOSUPPORT | 97 | 0x61 | 0x..ff9f | Address family not supported by protocol |
EADDRINUSE | 98 | 0x62 | 0x..ff9e | Address already in use |
EADDRNOTAVAIL | 99 | 0x63 | 0x..ff9d | Cannot assign requested address |
ENETDOWN | 100 | 0x64 | 0x..ff9c | Network is down |
ENETUNREACH | 101 | 0x65 | 0x..ff9b | Network is unreachable |
ENETRESET | 102 | 0x66 | 0x..ff9a | Network dropped connection because of reset |
ECONNABORTED | 103 | 0x67 | 0x..ff99 | Software caused connection abort |
ECONNRESET | 104 | 0x68 | 0x..ff98 | Connection reset by peer |
ENOBUFS | 105 | 0x69 | 0x..ff97 | No buffer space available |
EISCONN | 106 | 0x6a | 0x..ff96 | Transport endpoint is already connected |
ENOTCONN | 107 | 0x6b | 0x..ff95 | Transport endpoint is not connected |
ESHUTDOWN | 108 | 0x6c | 0x..ff94 | Cannot send after transport endpoint shutdown |
ETOOMANYREFS | 109 | 0x6d | 0x..ff93 | Too many references: cannot splice |
ETIMEDOUT | 110 | 0x6e | 0x..ff92 | Connection timed out |
ECONNREFUSED | 111 | 0x6f | 0x..ff91 | Connection refused |
EHOSTDOWN | 112 | 0x70 | 0x..ff90 | Host is down |
EHOSTUNREACH | 113 | 0x71 | 0x..ff8f | No route to host |
EALREADY | 114 | 0x72 | 0x..ff8e | Operation already in progress |
EINPROGRESS | 115 | 0x73 | 0x..ff8d | Operation now in progress |
ESTALE | 116 | 0x74 | 0x..ff8c | Stale NFS file handle |
EUCLEAN | 117 | 0x75 | 0x..ff8b | Structure needs cleaning |
ENOTNAM | 118 | 0x76 | 0x..ff8a | Not a XENIX named type file |
ENAVAIL | 119 | 0x77 | 0x..ff89 | No XENIX semaphores available |
EISNAM | 120 | 0x78 | 0x..ff88 | Is a named type file |
EREMOTEIO | 121 | 0x79 | 0x..ff87 | Remote I/O error |
EDQUOT | 122 | 0x7a | 0x..ff86 | Quota exceeded |
ENOMEDIUM | 123 | 0x7b | 0x..ff85 | No medium found |
EMEDIUMTYPE | 124 | 0x7c | 0x..ff84 | Wrong medium type |
ECANCELED | 125 | 0x7d | 0x..ff83 | Operation Canceled |
ENOKEY | 126 | 0x7e | 0x..ff82 | Required key not available |
EKEYEXPIRED | 127 | 0x7f | 0x..ff81 | Key has expired |
EKEYREVOKED | 128 | 0x80 | 0x..ff80 | Key has been revoked |
EKEYREJECTED | 129 | 0x81 | 0x..ff7f | Key was rejected by service |
EOWNERDEAD | 130 | 0x82 | 0x..ff7e | Owner died |
ENOTRECOVERABLE | 131 | 0x83 | 0x..ff7d | State not recoverable |
ERFKILL | 132 | 0x84 | 0x..ff7c | Operation not possible due to RF-kill |
EHWPOISON | 133 | 0x85 | 0x..ff7b | Memory page has hardware error |
ERESTARTSYS | 512 | 0x200 | 0x..ff200 | |
ERESTARTNOINTR | 513 | 0x201 | 0x..ff1ff | |
ERESTARTNOHAND | 514 | 0x202 | 0x..ff1fe | restart if no handler.. |
ENOIOCTLCMD | 515 | 0x203 | 0x..ff1fd | No ioctl command |
ERESTART_RESTARTBLOCK | 516 | 0x204 | 0x..ff1fc | restart by calling sys_restart_syscall |
EBADHANDLE | 521 | 0x209 | 0x..ff1f7 | Illegal NFS file handle |
ENOTSYNC | 522 | 0x20a | 0x..ff1f6 | Update synchronization mismatch |
EBADCOOKIE | 523 | 0x20b | 0x..ff1f5 | Cookie is stale |
ENOTSUPP | 524 | 0x20c | 0x..ff1f4 | Operation is not supported |
ETOOSMALL | 525 | 0x20d | 0x..ff1f3 | Buffer or request is too small |
ESERVERFAULT | 526 | 0x20e | 0x..ff1f2 | An untranslatable error occurred |
EBADTYPE | 527 | 0x20f | 0x..ff1f1 | Type not supported by server |
EJUKEBOX | 528 | 0x210 | 0x..ff1f0 | Request initiated, but will not complete before timeout |
EIOCBQUEUED | 529 | 0x211 | 0x..ff1ef | iocb queued, will get completion event |
EIOCBRETRY | 530 | 0x212 | 0x..ff1ee | iocb queued, will trigger a retry |
Happened to me, that space on my root partition was going low, no matter what I
removed. I have snapper
hourly snapshots enabled, tuned the daily and weekly
settings, so the usual problem of running out of space is prevented. Also
before doing a big update of Tumbleweed (openSUSE rolling distro) I check the
available space to roughly match what zypper
says it’d need.
$ snapper ls
Type | # | Pre # | Date | User | Cleanup | Description | Userdata
-------+-----+-------+----------------------------------+------+----------+--------------+--------------
single | 0 | | | root | | current |
single | 395 | | Mon 26 Sep 2016 10:30:01 PM CEST | root | timeline | timeline |
So what happened? I was removing some old kernels, manually by rpm. And it started to report write errors, no space. Indeed, df reported just 64K available. This is within the margin of accuracy that we’re able to get from the statfs syscall. Side note, getting this right is a lot of guesswork, so 64K is as good as it gets.
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 30G 30G 64K 86% /
RPM continued to spew lots of errors, I removed rest of the snapper snapshots, no improvement. So then I resorted to the bare ‘btrfs’ command and found a stale snapper snapshot. The snapshot directory itself was there, but the metadata were lacking (a xml file). After deleting it manually, the free space jumped up by 6GB!
# btrfs subvol list /
...
... .snapshots/3721/snapshot
...
The theory is that I have once removed all snapshots so the numbering started from 1 again, as the current series is around ~390, while the number of stale snapshot is much higher.
Takeaways:
Summary of used tools:
Update {{ page.update1 }}
Again. There were 2 stale snapper snapshots, age of like 8 months until discovered. The space got up by 12GiB after deletion. The root partition is 50GiB, I think it survived pretty long given the frequent updates of the rolling distro. There’s an open issue.