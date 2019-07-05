New LWN Articles About Kernel (Paywall Expired)
Bcachefs gets closer
When it comes to new filesystems for Linux, patience is certainly a virtue. Btrfs took years to mature and, according to some, still isn't ready yet. Tux3 has kept users waiting since at least 2008; as of 2018 its developer still said that it was progressing. By these measures, bcachefs is a relative youngster, having been first announced a mere four years ago. Development of this next-generation filesystem continues, and bcachefs developer Kent Overstreet recently proclaimed his desire to "get this sucker merged", but there are some obstacles to overcome still.
Bcachefs has its origins in the bcache caching layer, though it is a separate project at this point. Like most of the newer filesystems out there, it uses a copy-on-write approach — data is copied to a new location when changed rather than overwritten. That enables the implementation of a number of interesting features; those intended for bcachefs include data checksumming, compression, multiple-device and RAID support, hierarchical storage management, snapshots, and, naturally, good performance. Work on bcachefs has apparently been slowed by the fact that there is relatively little interest in supporting this work; Over
5.3 Merge window, part 1
As of this writing, exactly 6,666 non-merge changesets have been pulled into the mainline repository for the 5.3 development cycle. The merge window has thus just begun, there is still quite a bit in the way of interesting changes to look at. Read on for a list of what has been merged so far.
Reworking CFS load balancing
The Linux scheduler is made of the main types of scheduling which are the Completely Fair Scheduler (CFS), the realtime (RT), and the more recent deadline scheduler. The CFS class is the default and most commonly used one, which aims at sharing the running time of CPUs between tasks according to their priority. It was introduced in 2007 and has seen several major changes since. One of these major changes was the introduction of per-entity load tracking (PELT), which gives more details about the utilization of CPUs by tasks.
The load-balancing algorithm of the scheduler has the key responsibility of placing tasks on CPUs to optimize the overall throughput of the system. It periodically monitors the system and decides when tasks have to migrate to ensure a fair distribution of compute capacity and an optimal use of resources. But that hasn't really changed to take full advantage of these new metrics and it is still only using the load as the unit to migrate tasks, even when the root cause of an imbalance is not linked to load but to the available compute capacity of CPUs, for example.
Frequency scale-invariance on x86_64
The utilization and load signals computed with the PELT algorithm are affected by the processor's clock frequency: loosely speaking, a task looks bigger if the machine is running slower. The remedy to this problem is called "frequency scale-invariance" and consists in normalizing all interesting quantities via the scaling factor current_frequency / max_frequency. At the time of this writing only the Arm architecture implements it; a session at the third OSPM summit in Pisa discussed a possible way forward for x86_64 systems.
The reader may recall that, in PELT, time is partitioned in segments and, for each of those, the on-CPU time of a task is recorded (in the case of utilization; for load, the quantity of interest is on-run-queue time). This implies that a given task would score a higher utilization and load if the CPU is running at a lower frequency: generally speaking, a slower running CPU makes tasks run for longer; a longer running time produces larger values of the PELT signals. This effect of the PELT formula is undesired, because utilization and load of tasks and run queues cannot be compared across CPUs or across time, since the operating frequency might be different.
The PELT framework offers a mechanism to rescale quantities and make them invariant to changes of frequency: some architecture-specific code has to implement the function arch_scale_freq_capacity() to return an appropriate scaling factor which, ideally, is going to be the ratio current_frequency / max_frequency — PELT will then use this factor where appropriate. As of today, only the Arm architecture implements arch_scale_freq_capacity(), thus it's the only architecture that can claim to have frequency scale-invariant load and utilization.
How can we make schedutil even more effective?
Mobile platforms can feature some operating power points (OPPs) that are more energy-efficient than others at lower frequencies. The inefficient low-frequency OPPs can therefore be avoided in normal conditions, leading to better latency at no cost. The power cost of OPPs does not increase linearly with frequency, which gives some opportunities for smarter decisions: if the frequency can be increased when it would be beneficial for a low power bill, why not do it?
Scheduler soft affinity
As systems are getting bigger with more and more CPU cores, multiple instances of workloads are being consolidated on a single system. For example, multiple virtual machines (VMs) or containers on the same host is a common use case. Currently the Linux scheduler provides a few ways to partition multiple workload instances: hard partitioning using the sched_setaffinity() system call or the cpuset.cpus control group interface that binds the thread to a specific set of CPUs, or by using control group CPU shares (cpu.shares) that divide the CPU cycles of the system among multiple instances using fair sharing.
But there is a need to have a way of dynamically partitioning workload instances so that one instance can use the available CPUs of another instance if they are idle, but only use the CPUs of its own partition when other partitions are busy. For example, the Oracle database has a multi-tenancy feature that can enable the root-level database instance to house multiple lightweight Pluggable Database (PDB) instances, each of which can be partitioned to use a NUMA node in a multi-socket system. Hard partitioning is not an option here, as one PDB instance needs to be able to burst out of its partition and use other available idle CPUs when other PDBs are idle. Hence CPU shares are used in this case. But this has the disadvantage of cache-coherence overhead (i.e. each instance running on all sockets will incur the cross-socket cache-coherence penalty due to data sharing).
SCHED_DEADLINE on heterogeneous multicores
As already mentioned in other talks, the SCHED_DEADLINE policy currently does not consider the capacities or the running frequencies of the various CPU cores. This mainly impacts two different aspects: admission control and task placement.
The SCHED_DEADLINE admission control is designed with two goals: avoiding overload (that is, avoid starving non-deadline tasks) and providing performance guarantees to deadline tasks. Unfortunately, the current code assumes that all of the CPU cores have the same maximum capacity (which is assumed to be equal to the maximum capacity of the fastest core), and this assumption breaks the admission-control mechanism. A simple experiment (creating SCHED_DEADLINE tasks until the admission control fails) shows that on a big.LITTLE CPU, it is currently possible to starve non-deadline tasks. A first patch that has been submitted to the Linux kernel mailing list fixes this issue by considering the maximum capacity of each CPU core when performing the admission control. Repeating the experiment shows that the patch is effective (until thermal throttling slows down the CPU, but this is a different issue).
TurboSched
Parth Shah discussed the problem of sustaining "turbo" frequencies on SMP systems. Modern multicore systems have support for turbo frequencies, which are frequencies above the range of the rated frequencies that can be sustained by a small number of CPUs in the chip under certain power and thermal constraints. However, due to these very power and thermal constraints, it is harder to sustain these turbo frequencies for longer durations. Shah said that IBM POWER9 systems have a margin of around 18% for turbo range and sustaining these frequencies can provide better single-threaded performance.
New approaches to thermal management
Volker Eckert presented results from his experiments to use the CFS bandwidth controller for thermal management. The fundamental idea is to use less CPU bandwidth while running low-priority (background) tasks and thus keep the power budget available for more important tasks. This led to two interesting discussions: how to solve the per-entity load tracking (PELT) utilization issues for throttled tasks, and the idea, pushed by Morten Rasmussen, that thermal management should be applied to tasks rather than CPUs. Following this overall design approach, which was also backed by Paul Turner, the CFS bandwidth controller could play an essential role in a thermal-management architecture for future mobile systems.
Proxy execution
At the risk of playing defense, Juri Lelli started his talk by saying that he was going to be quick, as he didn't actually have any updates from what he presented last year at the Linux Plumbers Conference and from the first RFC posted on the Linux kernel mailing list. The main goal of his session was to understand if there is still interest in this line of work.
Proxy execution can be simply thought of as a "better" priority-inheritance mechanism, which a mutex owner can potentially run using (inheriting) the scheduling context (properties) of other tasks blocked on the same mutex (avoiding priority inversions). For the SCHED_DEADLINE scheduling policy, this translates to the possibility for a mutex owner to run "inside" donors' (mutex waiters) bandwidth, fixing a longstanding issue of policy: priority-boosted tasks are currently allowed to run outside of runtime enforcement, as they only inherit donors' deadline.
