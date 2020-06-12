Linux Foundation and Linux Kernel Coverage at LWN
ELISA Project Momentum Continues
As ELISA (Enabling Linux in Safety Applications) nears its year and a half anniversary, the project continues to hit key milestones showing its value for delivering foundational support for safety-critical applications. ELISA, formed in February 2019 and a hosted project of the Linux Foundation, aims to create a shared set of tools and processes to help companies build and certify Linux-based safety-critical applications and systems whose failure could result in loss of human life, significant property damage, or environmental damage.
As Linux continues to be a key component in safety applications, autonomous vehicles, medical devices, and even rockets, ELISA will make it easier for companies to build and expand these safety-critical systems. As a show of support for this business-critical initiative, several new members have joined the ELISA project. New members include Premier Member Intel/Mobileye, General Members ADIT, Elektrobit, Mentor, SiFive, Suzuki, Wind River and Associate Members Automotive Grade Linux and Technical University of Applied Sciences Regensburg.
Cloud Native Computing Foundation Welcomes SPD Bank as Gold Member
As an active participant of the CNCF Financial Services User Group, SPD Bank is dedicated to working with the cloud native ecosystem to address the security, regulatory and compliance-related questions that financial institutions face when using cloud native platforms. Since 2017, it has introduced microservices and cloud native technologies, including Docker and Kubernetes, into its infrastructure. The bank's container management platform now runs over 3,000 containers and nearly 60 applications, including mission-critical applications for mobile banking and online payments.
Seccomp and deep argument inspection
Seccomp filtering (or "seccomp mode 2") allows a process to filter which system calls can be made by it or its threads—it can be used to "sandbox" a program such that it cannot make calls that it shouldn't. Those filters use the "classic" BPF (cBPF) language to specify which system calls and argument values to allow or disallow. The seccomp() system call is used to enable filtering mode or to load a cBPF filtering program. Those programs only have access to the values of the arguments passed to the system call; if those arguments are pointers, they cannot be dereferenced by seccomp, which means that accepting or rejecting the system call cannot depend on, for example, values in structures that are passed to system calls via pointers—or even string values.
The reason that seccomp cannot dereference the pointers is to avoid the time-of-check-to-time-of-use (TOCTTOU) race condition, where user space can change the value of what is being pointed to between the time that the kernel checks it and the time that the value gets used. But certain system calls, especially newer ones like clone3() and openat2(), have some important arguments passed in structures via pointers. These new system calls are designed with an eye toward easily adding new arguments and flags by redefining the structure that gets passed; in his email, Cook called these "extensible argument" (or EA) system calls.
It does not make sense for seccomp to provide a mechanism to inspect the pointer arguments of every system call, he said: "[...] the grudging consensus was reached that having seccomp do this for ALL syscalls was likely going to be extremely disruptive for very little gain". But for the EA system calls (or perhaps only a subset of those), seccomp could copy the structure pointed to and make it available to the BPF program via its struct seccomp_data. That would mean that seccomp would need to change to perform that copy, which would require a copy_from_user() call, and affected system calls would need to be seccomp-aware so that they can use the cached copy if seccomp creates one.
5.8 Merge window, part 1
Just over 7,500 non-merge changesets have been pulled into the mainline repository since the opening of the 5.8 merge window — not a small amount of work for just four days. The early pulls are dominated by the networking and graphics trees, but there is a lot of other material in there as well. Read on for a summary of what entered the kernel in the first part of this development cycle.
A crop of new capabilities
The first of the new capabilities is CAP_PERFMON, which was covered in detail here last February. With this capability, a user can perform performance monitoring, attach BPF programs to tracepoints, and other related actions. In current kernels, the catch-all CAP_SYS_ADMIN capability is required for this sort of performance monitoring; going forward, users can be given more restricted access. Of course, a process with CAP_SYS_ADMIN will still be able to do performance monitoring as well; it would be nice to remove that power from CAP_SYS_ADMIN, but doing so would likely break existing systems.
The other new capability, CAP_BPF, controls many of the actions that can be carried out with the bpf() system call. This capability has been the subject of a number of long and intense conversations over the last year; see this thread or this one for examples. The original idea was to provide a special device called /dev/bpf that would control access to BPF functionality, but that proposal did not get far. What was being provided was, in essence, a new capability, so capabilities seemed like a better solution.
The current CAP_BPF controls a number of BPF-specific operations, including the creation of BPF maps, use of a number of advanced BPF program features (bounded loops, cross-program function calls, etc.), access to BPF type format (BTF) data, and more. While the original plan was to not retain backward compatibility for processes holding CAP_SYS_ADMIN in an attempt to avoid what Alexei Starovoitov described as the "deprecated mess", the code that was actually merged does still recognize CAP_SYS_ADMIN.
One interesting aspect of CAP_BPF is that, on its own, it does not confer the power to do much that is useful. Crucially, it is still not possible to load most types of BPF programs with just CAP_BPF; to do that, a process must hold other capabilities relevant to the subsystem of interest. For example, programs for tracepoints, kprobes, or perf events can only be loaded if the process also holds CAP_PERFMON. Most program types related to networking (packet classifiers, XDP programs, etc.) require CAP_NET_ADMIN. If a user wants to load a program for a networking function that calls bpf_trace_printk(), then both CAP_NET_ADMIN and CAP_PERFMON are required. It is thus the combination of CAP_BPF with other capabilities that grants the ability to use BPF in specific ways.
DMA-BUF cache handling: Off the DMA API map (part 1)
Recently, the DMA-BUF heaps interface was added to the 5.6 kernel. This interface is similar to ION, which has been used for years by Android vendors. However, in trying to move vendors to use DMA-BUF heaps, we have begun to see how the DMA API model doesn't fit well for modern mobile devices. Additionally, the lack of clear guidance in how to handle cache operations efficiently, results in vendors using custom device-specific optimizations that aren't generic enough for an upstream solution. This article will describe the nature of the problem; the upcoming second installment will look at the path toward a solution.
The kernel's DMA APIs are all provided for the sharing of memory between the CPU and devices. The traditional DMA API has, in recent years, been joined by additional interfaces such as ION, DMA-BUF, and DMA-BUF heaps. But, as we will see, the problem of efficiently supporting memory sharing is not yet fully solved.
As an interface, ION was poorly specified, allowing applications to pass custom, opaque flags and arguments to vendor-specific, out-of-tree heap implementations. Additionally, since the users of these interfaces only ran on the vendors' devices with their custom kernel implementations, little attention was paid to trying to create useful generic interfaces. So multiple vendors might use the same heap ID for different purposes, or they might implement the same heap functionality but using different heap IDs and flag options. Even worse, many vendors drastically changed the ION interface and implementation itself, so that there was little in common between vendor ION implementations other than their name and basic functionality. ION essentially became a playground for out-of-tree and device-specific vendor hacks.
