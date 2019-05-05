LWN Articles About Linux Kernel (Outside Paywall Today)
The state of system observability with BPF
The 2019 version of the Linux Storage, Filesystem, and Memory-Management Summit opened with a plenary talk by Brendan Gregg on observing the state of Linux systems using BPF. It is, he said, an exciting time; the BPF-based "superpowers" being added to the kernel are growing in capability and maturity. It is now possible to ask many questions about what is happening in a production Linux system without the need for kernel modifications or even basic debugging information.
Gregg started with a demonstration tool that he had just written: it's immediate manifestation was in the creation of a high-pitched tone that varied in frequency as he walked around the lectern. It was, it turns out, a BPF-based tool that extracts the signal strength of the laptop's WiFi connection from the kernel and creates a noise in response. As he interfered with that signal with his body, the strength (and thus the pitch of the tone) varied. By tethering the laptop to his phone, he used the tool to measure how close he was to the laptop. It may not be the most practical tool, but it did demonstrate how BPF can be used to do unexpected things.
Gregg works at Netflix, a company that typically operates about 150,000 server instances. Naturally, Netflix cares a lot about performance; that leads to a desire for observability tools that can help to pinpoint the source of performance problems. But the value of good tools goes beyond just performance tuning.
Containers and address space separation
James Bottomley began his talk at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) by noting that the main opposition to his ideas was not present at the summit, which was likely to mean the ideas got a much easier reception than they would have otherwise. In particular, Peter Zijlstra and Ingo Molnar expressed some strong reservations to the work that Bottomley's colleague Mike Rapoport posted recently; none of those three were in attendance at LSFMM. The idea is to use address spaces to reduce the attack surface available to virtual machines (VMs) and containers such that kernel bugs of various sorts have less reach on multi-tenant systems.
Bottomley has been working with Rapoport on the idea for the container use case, but there are others, from Google and Oracle, who are trying to solve the same problems for VMs. Address spaces are the oldest and most secure mechanism for keeping tenants separate from one another, he said. Separating processes into their own address spaces is what was used to support multi-user systems, so there is around 50 years of history there. Part of the reason to extend the idea for VMs and containers is that address spaces have proven to work well as a security measure.
Some 5.1 development statistics
The release of the 5.1-rc6 kernel prepatch on April 21 indicates that the 5.1 development cycle is getting close to its conclusion. So naturally the time has come to put together some statistics describing where the changes merged for 5.1 came from. It is, for the most part, a fairly typical development cycle.
As of this writing, 12,749 non-merge changesets have been pulled into the mainline repository for the 5.1 release. That is slightly more than seen in 5.0, but still a bit lower than the other kernels released in the last few years. There were nearly 545,000 lines of code added by those changesets and 289,000 lines removed, for a net growth of 256,000 lines; this is not one of those rare development cycles where the kernel gets smaller. That work was contributed by 1,707 developers, 245 of whom made their first contribution in the 5.1 cycle.
Bounce buffers for untrusted devices
The recently discovered vulnerability in Thunderbolt has restarted discussions about protecting the kernel against untrusted, hotpluggable hardware. That vulnerability, known as Thunderclap, allows a hostile external device to exploit Input-Output Memory Management Unit (IOMMU) mapping limitations and access system memory it was not intended to. Thunderclap can be exploited by USB-C-connected devices; while we have seen USB attacks in the past, this vulnerability is different in that PCI devices, often considered as trusted, can be a source of attacks too. One way of stopping those attacks would be to make sure that the IOMMU is used correctly and restricts the device to accessing the memory that was allocated for it. Lu Baolu has posted an implementation of that approach in the form of bounce buffers for untrusted devices.
Toward a reverse splice()
The splice() system call is, at its core, a write operation; it attempts to implement zero-copy I/O by moving pages from a pipe to a file. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Miklos Szeredi described a nascent idea for rsplice() — a "reverse splice" system call. There were not a lot of definitive outcomes from this discussion, but one thing was clear: rsplice() needs a much better description (and some code posted) before the development community can begin to form an opinion on it.
Memory encryption issues
"People think that memory encryption sounds really cool; it will make my system more secure so I want it". At least, that is how Dave Hansen characterized the situation at the beginning of a session on the topic during the memory-management track at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. This session, also led by Kirill Shutemov, covered a number of aspects of the memory-encryption problem on Intel processors and beyond. One clear outcome of the discussion was also raised by Hansen at the beginning: users of memory encryption need to think hard about where that extra security is actually coming from.
Android memory management
The Android system is designed to provide a responsive user experience on systems that, in a relative sense at least, have limited amounts of CPU and memory. Doing so requires a number of techniques, including regular use of a low-memory process killer, that are not seen elsewhere. In a memory-management-track session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Suren Baghdasaryan covered a number of issues related to how Android ensures that interactive processes have enough memory to get their jobs done.
Baghdasaryan started by noting that the recently added pressure-stall information feature, which was not originally developed for Android at all, has proved to be quite useful. It gives the Android runtime more accurate information about memory pressure, which can be used to better manage the set of running processes. Overall, he said, the goal of Android memory management is to ensure that interactive processes work as well as possible while minimizing the number of out-of-memory kills needed to do that.
