Kernel bugs can have all kinds of unfortunate consequences, from inconvenient crashes to nasty security vulnerabilities. Some of the most feared bugs, though, are those that corrupt data in filesystems. The losses imposed on users can be severe, and the resulting problems may not be noticed for a long time, making recovery difficult. Filesystem developers, knowing that they will have to face their users in the real world, go to considerable effort to prevent this kind of bug from finding its way into a released kernel. A recent failure in that regard raises a number of interesting questions about how kernel development is done.

On November 13, Claude Heiland-Allan created a bug report about a filesystem corruption problem with the 4.19.1 kernel; other users joined in with reports of their own. Initially, the problem was thought to be in the ext4 filesystem, since that is what the affected users were using. Tracking the problem down took a few weeks, though, because few developers were able to reproduce the problem. There were some attempts at using bisection to find the commit that caused the problem, but they proved to be worse than useless, as they identified the wrong commits and caused developers to waste time on false leads.

It took until December 4 for Lukáš Krejčí to correctly bisect the problem down to a block-layer change. Commit 6ce3dd6eec, added during the 4.19 merge window, optimized the handling of requests in the multiqueue block layer. If there is no I/O scheduler in use, and if the hardware queue is not full, this patch causes new I/O requests to be placed directly into the hardware queue, shorting out a bunch of unnecessary processing. It's a harmless-seeming change that should make I/O go a little faster.

Things can go bad, though, if the low-level driver for the block device is unable to actually execute that request. This is most likely to happen as the result of a resource shortage — memory, perhaps, or something related to the hardware itself. In that case, the driver will return a soft failure, causing the I/O request to be requeued for another attempt later. While that request sits in the queue, the block layer may merge it with other requests for adjacent blocks, which should be fine. If, however, the low-level driver has already done some of the setup for the request, such as creating scatter/gather DMA mappings, those mappings may not be updated to match the larger, merged request. That results in only part of the request being executed by the hardware, with bad effects on the data involved.

The problem was partially fixed with this commit, but one more fix was required to fix a new problem caused by the first. Both fixes were included in the 4.20-rc6 release; they also found their way into 4.19.8. The original patch was never selected for backporting to older stable kernels, so those were not affected.