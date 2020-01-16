Kernel: LWN Article (Outside Paywall Today) and Remembering the LAN (Way Before Wireguard)
-
process_madvise(), pidfd capabilities, and the revenge of the PIDs
Once upon a time, there were few ways for one process to operate upon another after its creation; sending signals and ptrace() were about it. In recent years, interest in providing ways for processes to control others has been on the increase, and the kernel's process-management API has been expanded accordingly. Along these lines, the process_madvise() system call has been proposed as a way for one process to influence how memory management is done in another. There is a new process_madvise() series which is interesting in its own right, but this series has also raised a couple of questions about how process management should be improved in general.
The existing madvise() system call allows a process to make suggestions to the kernel about how its address space should be managed. The 5.4 kernel saw a couple of new types of advice that could be provided with madvise(): MADV_COLD and MADV_PAGEOUT. The former requests that the kernel place the indicated range of pages onto the inactive list, essentially saying that they have not been used in a long time. Those pages will thus be among the first considered for reclaim if the kernel needs memory for other purposes. MADV_PAGEOUT, instead, is a stronger statement that the indicated pages are no longer needed; it will cause them to be reclaimed immediately.
These new requests are useful for processes that know what their future access patterns will be. But it seems that in certain environments — Android, in particular — processes lack that knowledge, but the management system does know when certain memory ranges are no longer needed. The bulk of a process's address space could be marked as MADV_COLD when that process is moved out of the foreground, for example. In such settings, letting one process call madvise() on behalf of another helps the system as a whole make the best use of its memory resources. That is the purpose behind the process_madvise() proposal.
-
KRSI and proprietary BPF programs
The "kernel runtime security instrumentation" (or KRSI) patch set enables the attachment of BPF programs to every security hook in the kernel; LWN covered this work in December. That article focused on ABI issues, but it deferred another potential problem to our 2020 predictions: the possibility that vendors could start shipping proprietary BPF programs for use with frameworks like KRSI. Other developers did pick up on the possibility that KRSI could be abused this way, though, leading to a discussion on whether KRSI should continue to allow the loading of BPF programs that do not carry a GPL-compatible license.
It may be surprising to some that the kernel, while allowing BPF programs to declare their license, is entirely happy to load programs that have a proprietary license. This behavior, though, is consistent with how the kernel handles loadable modules: any module can be loaded, but modules without a GPL-compatible license will not have access to many kernel symbols (any that are exported with EXPORT_SYMBOL_GPL()). BPF programs interact with the kernel through special "helper functions", each of which must be explicitly exported; these, too, can have a "GPL only" marking on them. In current kernels, about 25% of the defined helpers are restricted to GPL-licensed code.
-
Scheduling for the Android display pipeline
The default CPU-frequency governor used by Android is schedutil, which relies on the CPU utilization of the runnable tasks to select the frequency of the CPU they execute on: the higher the utilization, the higher the frequency of the CPU when they are runnable. This governor fits so well with the needs of mobile Android devices that, in Android, it also takes care of the SCHED_RT tasks, which are normally run at the maximum frequency in mainline Linux kernels.
Schedutil chooses the lowest frequency sufficient not to overload the system, based on the measurement of the system utilization. This solution works well when tasks are independent and are able to run in parallel. But, whenever there is a dependency — tasks that are blocked on the completion of others — the single-task utilization accounting mechanism is no longer sufficient to define the requirements of the whole task set.
For example, in the scenario shown below, schedutil sees that RenderThread only requires 50% of a CPU's capacity, so it sets the CPU frequency to 50% of the maximum. But RenderThread cannot run until the UI thread has done its work — the two tasks cannot run in parallel — so it misses its deadline.
-
Control-flow integrity for the kernel
Control-flow integrity (CFI) is a technique used to reduce the ability to redirect the execution of a program's code in attacker-specified ways. The Clang compiler has some features that can assist in maintaining control-flow integrity, which have been applied to the Android kernel. Kees Cook gave a talk about CFI for the Linux kernel at the recently concluded linux.conf.au in Gold Coast, Australia.
Cook said that he thinks about CFI as a way to reduce the attack, or exploit, surface of the kernel. Most compromises of the kernel involve an attacker gaining execution control, typically using some kind of write flaw to change system memory. These write flaws come in many flavors, generally with some restrictions (e.g. can only write a single zero or only a set of fixed byte values), but in the worst case, they can be a "write anything anywhere at any time" flaw. The latter, thankfully, is relatively rare.
-
Remembering the LAN
We can have the LAN-like experience of the 90's back again, and we can add the best parts of the 21st century internet. A safe small space of people we trust, where we can program away from the prying eyes of the multi-billion-person internet. Where the outright villainous will be kept at bay by good identity services and good crypto.
The broader concept of virtualizing networks has existed forever: the Virtual Private Network. New protocols make VPNs better than before, Wireguard is pioneering easy and efficient tunneling between peers. Marry the VPN to identity, and make it work anywhere, and you can have a virtual 90s-style LAN made up of all your 21st century devices. Let the internet be the dumb pipe, let your endpoints determine who they will talk to based on the person at the other end.
-
