Linux Kernel: Four LWN Articles Outside the Paywall Today
-
Readahead: the documentation I wanted to read [LWN.net]
The readahead code in the Linux kernel is nominally responsible for reading data that has not yet been explicitly requested from storage, with the idea that it might be needed soon. The code is stable, functional, widely used, and uncontroversial, so it is reasonable to expect the code to be of high quality, and largely this is true. Recently, I found the need to document this code, which naturally shone a rather different light on it. This work revealed minor problems with functionality and significant problems with naming.
My particular reason for wanting documentation probably colors my view of the code so I'll start there. Once upon a time, Linux had a strong concept of "congestion" as it applied to I/O paths. If the queue of requests to some device grew too large, the backing device would be marked as "congested" and certain optional I/O requests would be skipped or delayed, particularly writeback and readahead. As time has passed, so too (apparently) has the need for congestion management. Maybe this is because many I/O devices are now faster than our CPUs but, whatever the reason, the block layer no longer tracks congestion and only a few virtual "backing devices" continue this outdated practice.
In Linux 5.16, the only backing device that gets marked as "read congested" is the virtual device used for FUSE filesystems. As part of a project to remove all remnants of congestion tracking, I proposed that there was really nothing special about FUSE, and it should just accept all readahead requests just like everyone else. Miklos Szeredi, the maintainer of FUSE, found my reasoning to be unsatisfactory — and who could blame him? If FUSE doesn't want readahead requests, it shouldn't have to accept them. Trying to understand how FUSE could safely say "no" to readahead, without having to maintain the congestion-tracking functionality in common code, started me on the path to understanding readahead — once it was explained to me that it wasn't as simple as just changing the "readahead" callback in FUSE to return zero.
-
Negative dentries, 20 years later [LWN.net]
Filesystems and the virtual filesystem layer are in the business of managing files that actually exist, but the Linux "dentry cache", which remembers the results of file-name lookups, also keeps track of files that don't exist. This cache of "negative dentries" plays an important role in the overall performance of the system but, if it is allowed to grow too large, its role can become negative in its own right. As the 2022 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) approaches, the subject of negative dentries has come up yet again; whether one can be positive about the prospects for a resolution this time around remains unclear.
The kernel's dentry cache saves the results of looking up a file in a filesystem. Should the need arise to look up the same file again, the cached result can be used, avoiding a trip through the underlying filesystem and accesses to the storage device. Repeated file-name lookups are common — consider /usr/bin/bash or ~/.nethackrc — so this is an important optimization to make.
The importance of remembering failed lookups in negative dentries may be less obvious at the outset. As it happens, repeated attempts to look up a nonexistent file are also common; an example would be the shell's process of working through the search path every time a user types "vi" (Emacs users start the editor once and never leave its cozy confines thereafter, so they don't benefit in the same way). Even more common are failed lookups created by the program loader searching for shared libraries or a compiler looking for include files. One is often advised to "fail fast" in this society; when it comes to lookups of files that don't exist, that can indeed be good advice.
So negative dentries are a good thing but, as we all know, it is possible to have too much of a good thing. While normal dentries are limited by the number of files that actually exist, there are few limits to the number of nonexistent files. As a result, it is easy for a malicious (or simply unaware) application to create negative dentries in huge numbers. If memory is tight, the memory-management subsystem will eventually work to push some of these negative dentries out. In the absence of memory pressure, though, negative dentries can accumulate indefinitely, leaving a large mess to clean up when memory does inevitably run out.
-
Private memory for KVM guests [LWN.net]
Cloud computing is a wonderful thing; it allows efficient use of computing systems and makes virtual machines instantly available at the click of a mouse or API call. But cloud computing can also be problematic; the security of virtual machines is dependent on the security of the host system. In most deployed systems, a host computer can dig through its guests' memory at will; users running guest systems have to just hope that doesn't happen. There are a number of solutions to that problem under development, including this KVM guest-private memory patch set by Chao Peng and others, but some open questions remain.
A KVM-based hypervisor runs as a user-space process on the host system. To provide a guest with memory, the hypervisor allocates that memory on the host, then uses various KVM ioctl() calls to map it into the guest's "physical" address space. But the hypervisor retains its mapping to the memory as well, with no constraints on how the memory can be accessed. Sometimes that access is necessary for communication between the guest and the hypervisor, but the guest would likely want to keep much of that memory to itself.
-
trusted_for() bounces off the merge window [LWN.net]
When last we looked in on the proposed trusted_for() system call, which would allow user-space interpreters and other tools to ask the kernel whether a file is "trusted" for execution, it looked like it was on-track for the mainline. That was back in October 2020; the patch has been updated multiple times since then, made its way into linux-next, and a pull request was made by Mickaël Salaün for the 5.18 merge window. But it seems that there will be more to the story of getting this functionality into the kernel, as Linus Torvalds declined to pull trusted_for(), at least partly because he did not like the name, but there were other reasons as well. While he is not opposed to the functionality it would provide, he also had strong feelings that a new system call was not the right approach.
-
