Language Selection

English French German Italian Portuguese Spanish

Kernel Planet

Syndicate content
Kernel Planet - http://planet.kernel.org
Updated: 56 min 50 sec ago

Linux Plumbers Conference: Get ready for LPC 2021!

4 hours 34 min ago

The LPC 2021 conference is just around the corner. We wanted to share the logistics on how to participate and watch the virtual conference.

For those that are not registered for the conference, we will have live streaming of the sessions on YouTube, like last year. This is free of charge. We will provide the URLs where to watch each day, on this page. The only limitation is that you cannot participate and ask questions live with audio. However this year we will have the chat in each Big Blue Button room also available externally via the Matrix open communication network. Anyone is invited to join with their personal Matrix account.

Those who are registered for the conference will be able to log into our Big Blue Button server through our front end page, starting Monday September 20 at 7:00AM US Pacific time.
To log in to BBB, please go to meet.lpc.events. You will find a front end showing the schedule for the current day with all the active sessions you can join. Your credentials are the email address you used for registration, and the confirmation code you received in email when you registered. Please make sure you have those available in advance of trying to log in.

Please review the LPC 2021 Participant Guide before you join the conference.

Looking forward to seeing you there!

Linux Plumbers Conference: Linux Plumbers Conference 2021 is Almost Here

Friday 17th of September 2021 02:43:00 AM

We are only three days away from the start of LPC 2021!

Thank you to all that made our conference possible:
– Our generous Sponsors, listed here on the right
– The Linux Foundation, which provides as always impeccable support
– Our speakers and leaders, who are providing a lot of great content and planning great discussions

As you can see, the schedule is finalized now. There are going to be seven parallel tracks each day, lasting four hours each. We have a total of 23 different tracks and Microconferences, with 191 sessions.

At this time we are closing the CfPs for all tracks. We have still room for a limited number of Birds of a Feather sessions. If you want to propose one, even during the conference, and the necessary participants are all registered, please send an email to our lpc-contact@lists.linuxplumbersconf.org mailing list.

Take a look at all the great technical content at this year virtual LPC.
You can view the schedule by main blocks , or by track, or as a complete detailed view.

Note that at the end of the first day we’ll have a plenary keynote by Jon “maddog” Hall.
Additionally, at the end of the last day we’ll have a plenary session as a wrap up for this year conference.

The conference will be entirely virtual, offered on a completely free and open software stack.

We look forward to five days filled with great discussions, and we hope that LPC 2021 will provide once again a creative and productive environment where ideas can be exchanged and problems tackled. Many great ideas have sprung in the past from these meetings, driving innovation in the Linux plumbing layer!

Pete Zaitcev: Scalability of a varying degree

Tuesday 14th of September 2021 08:27:15 PM

Seen at official site of Qumulo:

Scale

Platforms must be able to serve petabytes of data, billions of files, millions of operations, and thousands of users.

Thousands of users...? Isn't it a little too low? Typical Swift clusters in Telcos have tens of millions of users, of which tens or hundreds of thousands are active simultaneously.

Google's Chumby paper has a little section on scalability problem with talking to a cluster over TCP/IP. Basically at low tens of thousands you're starting to have serious issues with kernel sockets and TIME_WAIT. So maybe that.

Paul E. Mc Kenney: Stupid RCU Tricks: Making Race Conditions More Probable

Tuesday 14th of September 2021 04:43:14 AM
Given that it is much more comfortable chasing down race conditions reported by rcutorture than those reported from the field, it would be good to make race conditions more probable during rcutorture runs than in production. A number of tricks are used to make this happen, including making rare events (such as CPU-hotplug operations) happen more frequently, testing the in-kernel RCU API directly from within the kernel, and so on.

Another approach is to change timing. Back at Sequent in the 1990s, one way that this was accomplished was by plugging different-speed CPUs into the same system and then testing on that system. It was observed that for certain types of race conditions, the probability of the race occurring increased by the ratio of the CPU speeds. One such race condition is when a timed event on the slow CPU races with a workload-driven event on the fast CPU. If the fast CPU is (say) two times faster than the slow CPU, then the timed event will provide two times greater “collision cross section” than if the same workload was running on CPUs running at the same speed.

Given that modern CPUs can easily adjust their core clock rates at runtime, it is tempting to try this same trick on present-day systems. Unfortunately, everything and its dog is adjusting CPU clock rates for various purposes, plus a number of modern CPUs are quite happy to let you set their core clock rates to a value sufficient to result in physical damage. Throwing rcutorture into this fray might be entertaining, but it is unlikely to be all that productive.

Another approach is to make use of memory latency. The idea is for the rcutorture scripting to place one pair of a given scenario's vCPUs in the hyperthreads of a single core and to place another pair of that same scenario's vCPUs in the hyperthreads of a different single core, and preferably a core on some other socket. The theory is that the different communications latencies and bandwidths within a core on the one hand and between cores (or, better yet, between sockets) on the other should have roughly the same effect as does varying CPU core clock rates.

OK, theory is all well and good, but what happens in practice?

As it turns out, on dual-socket systems, quite a bit.

With this small change to the rcutorture scripting, RCU Tasks Trace suddenly started triggering assertions. These test failures led to no fewer than 12 fixes, perhaps most notably surrounding proper handling of the count of tasks from which quiescent states are needed. This caused me to undertake a full review of RCU Tasks Trace, greatly assisted by Boqun Feng, Frederic Weisbecker, and Neeraj Upadhyay, with Neeraj providing half of the fixes. There is likely to be another fix or three, but then again isn't that always the case?

More puzzling were the 2,199.0-second RCU CPU stall warnings (described in more detail here). These were puzzling for a number of reasons:

  1. The RCU CPU stall warning timeout is set to only 21 seconds.
  2. There was absolutely no console output during the full stall duration.
  3. The stall duration was never 2,199.1 seconds and never 2,198.9 seconds, but always exactly 2,199.0 seconds, give or take a (very) few tens of milliseconds.
  4. The stalled CPU usually took only a handful of scheduling-clock interrupts during the stall, but would sometimes take them at a rate of 100,000 per second, which seemed just a bit excessive for a kernel built with HZ=1000.
  5. At the end of the stall, the kernel happily continued, usually with no other complaints.
These stall warnings appeared most frequently when running rcutorture's TREE04 scenario.

But perhaps this is not a stall, but instead a case of time jumping forward. This might explain the precision of the stall duration, and would definitely explain the lack of intervening console output, the lack of other complaints, and the kernel's being happy to continue at the end of the stall. Not so much the occasional extreme rate of scheduling-clock interrupts, but perhaps that is a separate problem.

However, running large numbers (as in 200) of concurrent shorter one-hour TREE04 runs often resulted in the run terminating (forcibly) in the middle of the stall. Now this might be due to the host's and the guests' clocks all jumping forward at the same time, except that different guests stalled at different times, and even when running TREE04, most guests didn't stall at all. Therefore, the stalls really did stall, and for a very long time.

But then it should be possible to work out what the CPUs were doing in the meantime. One approach would be to use tracing, but previous experience with massive volumes of trace messages (and thus lost trace messages) suggested a more surgical approach. Furthermore, the last console message before the stall was always of the form “kvm-clock: cpu 3, msr d4a80c1, secondary cpu clock” and the first console message after the stall was always of the form “kvm-guest: stealtime: cpu 3, msr 1f597140”. These are widely separated and and are often printed from different CPUs, which also suggests a more surgical approach. This situation also implicates CPU hotplug, but this is not at all unusual.

The first attempt at exploratory surgery used the jiffies counter to check for segments of code taking more than 100 seconds to complete. Unfortunately, these checks never triggered, even in runs having stall warnings. So maybe the jiffies counter is not being updated. It is easy enough to switch to ktime_get_mono_fast_ns(), right? Except that this did not trigger, either.

Maybe there is a long-running interrupt handler? Mark Rutland recently posted a patchset to detect exactly that, so I applied it. But it did not trigger.

I switched to ktime_get() in order to do cross-CPU time comparisons, and out of shear paranoia added checks for time going backwards. And these backwards-time checks really did trigger just before the stall warnings appeared, once again demonstrating the concurrent-programming value of a healthy level paranoia, and also explaining why many of my earlier checks were not triggering. Time moved forward, and then jumped backwards, making it appear that no time had passed. (Time did jump forward again, but that happened after the last of my debug code had executed.)

Adding yet more checks showed that the temporal issues were occurring within stop_machine_from_inactive_cpu(). This invocation takes the mtrr_rendezvous_handler() function as an argument, and it really does take 2,199.0 seconds (that is, about 36 minutes) from the time that stop_machine_from_inactive_cpu() is called until the time that mtrr_rendezvous_handler() is called. But only sometimes.

Further testing confirmed that increasing the frequency of CPU-hotplug operations increased the frequency of 2,199.0-second stall warnings.

A extended stint of code inspection suggested further diagnostics, which showed that one of the CPUs would be stuck in the multi_cpu_stop() state machine. The stuck CPU was never CPU 0 and was never the incoming CPU. Further tests showed that the scheduler always thought that all of the CPUs, including the stuck CPU, were in the TASK_RUNNING state. Even more instrumentation showed that the stuck CPU was failing to advance to state 2 (MULTI_STOP_DISABLE_IRQ), meaning that all of the other CPUs were spinning in a reasonably tight loop with interrupts disabled. This could of course explain the lack of console messages, at least from the non-stuck CPUs.

Might qemu and KVM be to blame? A quick check of the code revealed that vCPUs are preserved across CPU-hotplug events, that is, taking a CPU offline does not cause qemu to terminate the corresponding user-level thread. Furthermore, the distribution of stuck CPUs was uniform across the CPUs other than CPU 0. The next step was to find out where CPUs were getting stuck within the multi_cpu_stop() state machine. The answer was “at random places”. Further testing also showed that the identity of the CPU orchestrating the onlining of the incoming CPU had nothing to do with the problem.

Now TREE04 marks all but CPU 0 as nohz_full CPUs, meaning that they disable their scheduling-clock interrupts when running in userspace when only one task is runnable on that CPU. Maybe the CPUs need to manually enable their scheduling-clock interrupt when starting multi_cpu_stop()? This did not fix the problem, but it did manage to shorten some of the stalls, in a few cases to less than ten minutes.

The next trick was to send an IPI to the stalled CPU every 100 seconds during multi_cpu_stop() execution. To my surprise, this IPI was handled by the stuck CPU, although with surprisingly long delays ranging from just a bit less than one millisecond to more than eight milliseconds.

This suggests that the stuck CPUs might be suffering from an interrupt storm, so that the IPI had to wait for its turn among a great many other interrupts. Further testing therefore sent an NMI backtrace at 100 seconds into multi_cpu_stop() execution. The resulting stack traces showed that the stuck CPU was always executing within sysvec_apic_timer_interrupt() or some function that it calls. Further checking showed that the stuck CPU was in fact suffering from an interrupt storm, namely an interrupt storm of scheduling-clock interrupts. This spurred another code-inspection session.

Subsequent testing showed that the interrupt duration was about 3.5 microseconds, which corresponded to about one third of the stuck CPU's time. It appears that the other two-thirds is consumed repeatedly entering and exiting the interrupt.

The retriggering of the scheduling-clock interrupt does have some potential error conditions, including setting times in the past and various overflow possibilities. Unfortunately, further diagnostics showed that none of this was happening. However, they also showed that the code was trying to schedule the next interrupt at time KTIME_MAX, so that an immediate relative-time-zero interrupt is a rather surprising result.

So maybe this confusion occurs only when multi_cpu_stop() preempts some timekeeping activity. Now TREE04 builds its kernels with CONFIG_PREEMPT=n, but maybe there is an unfortunately placed call to schedule() or some such. Except that further code inspection found no such possibility. Furthermore, another test run that dumped the previous task running on each CPU showed nothing suspicious (aside from rcutorture, which some might argue is always suspicious).

And further debugging showed that tick_program_event() thought that it was asking for the scheduling-clock interrupt to be turned off completely. This seemed like a good time to check with the experts, and Frederic Weisbecker, noting that all of the action was happening within multi_cpu_stop() and its called functions, ran the following command to enlist ftrace, while also limiting its output to something that the console might reasonably keep up with:

./kvm.sh --configs "18*TREE04" --allcpus --bootargs "ftrace=function_graph ftrace_graph_filter=multi_cpu_stop" --kconfig "CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y"
This showed that there was no hrtimer pending (consistent with KTIME_MAX), and that the timer was nevertheless being set to fire immediately. Frederic then proposed the following small patch:

--- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -595,7 +595,8 @@ void irq_enter_rcu(void)  {         __irq_enter_raw();   -       if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) +       if (tick_nohz_full_cpu(smp_processor_id()) || +           (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))                 tick_irq_enter();           account_hardirq_enter(current);
This forces the jiffies counter to be recomputed upon interrupt from nohz_full CPUs in addition to idle CPUs, which avoids the timekeeping confusion that caused KTIME_MAX to be interpreted as zero.

And a 20-hour run for each of 200 instances of TREE04 was free of RCU CPU stall warnings! (This represents 4,000 hours of testing consuming 32,000 CPU-hours.)

This was an example of that rare form of deadlock, a temporary deadlock. The stuck CPU was stuck because timekeeping wasn't happening. Timekeeping wasn't happening because all the timekeeping CPUs were spinning in multi_cpu_stop() with interrupts disabled. The other CPUs could not exit their spinloops (and thus could not update timekeeping information) because the stuck CPU did not advance through the multi_cpu_stop() state machine.

So what caused this situation to be temporary? I must confess that I have not dug into it (nor do I intend to), but my guess is that integer overflow resulted in KTIME_MAX once again taking on its proper role, thus ending the stuck CPU's interrupt storm and in turn allowing the multi_cpu_stop() state machine to advance.

Nevertheless, this completely explains the mystery. Assuming integer overflow, the extremely repeatable stall durations make perfect sense. The RCU CPU stall warning did not happen at the expected 21 seconds because all the CPUs were either spinning with interrupts disabled on the one hand or being interrupt stormed on the other. The interrupt-stormed CPU did not report the RCU CPU stall because the jiffies counter wasn't incrementing. A random CPU would report the stall, depending on which took the first scheduling-clock tick after time jumped backwards (again, presumably due to integer overflow) and back forwards. In the relatively rare case where this CPU was the stuck CPU, it reported an amazing number of scheduling clock ticks, otherwise very few. Since everything was stuck, it is only a little surprising that the kernel continued blithely on after the stall ended. TREE04 reproduced the problem best because it had the largest proportion of nohz_full CPUs.

All in all, this experience was a powerful (if sometimes a bit painful) demonstration of the ability of controlled memory latencies to flush out rare race conditions!

Brendan Gregg: ZFS Is Mysteriously Eating My CPU

Sunday 5th of September 2021 02:00:00 PM
A microservice team asked me for help with a mysterious issue. They claimed that the ZFS file system was consuming 30% of CPU capacity. I summarized this case study at [Kernel Recipes] in 2017; it is an old story that's worth resharing here. ## 1. Problem Statement The microservice was for metrics ingestion and had recently updated their base OS image (BaseAMI). After doing so, they claimed that ZFS was now eating over 30% of CPU capacity. My first thought was that they were somehow mistaken: I worked on ZFS internals at Sun Microsystems, and unless it is badly misconfigured there's no way it can consume that much CPU. I have been surprised many times by unexpected performance issues, so I thought I should check their instances anyway. At the very least, I could show that I took it seriously enough to check it myself. I should also be able to identify the real CPU consumer. ## 2. Monitoring I started with the cloud-wide monitoring tool, [Atlas], to check high-level CPU metrics. These included a breakdown of CPU time into percentages for "usr" (user: applications) and "sys" (system: the kernel). I was surprised to find a whopping 38% of CPU time was in sys, which is highly unusual for cloud workloads at Netflix. This supported the claim that ZFS was eating CPU, but how? Surely this is some other kernel activity, and not ZFS. ## 3. Next Steps I'd usually SSH to instances for deeper analysis, where I could use mpstat(1) to confirm the usr/sys breakdown and perf(1) to begin profiling on-CPU kernel code paths. But since Netflix has tools (previously [Vector], now FlameCommander) that allow us to easily fetch flame graphs from our cloud deployment UI, I thought I'd jump to the chase. Just for illustration, this shows the Vector UI and a typical cloud flame graph: Note that this sample flame graph is dominated by Java, shown by the green frames. ## 4. Flame Graph Here's the CPU flame graph from one of the problem instances: The kernel CPU time pretty obvious, shown as two orange towers on the left and right. (The other colors are yellow for C++, and red for other user-level code.) Zooming in to the left kernel tower: This is arc_reclaim_thread! I worked on this code back at Sun. So this really is ZFS, they were right! The ZFS Adapative Replacement Cache (ARC) is the main memory cache for the file system. The arc_reclaim_thread runs arc_adjust() to evict memory from the cache to keep it from growing too large, and to maintain a threshold of free memory that applications can quickly use. It does this periodically or when woken up by low memory conditions. In the past I've seen the arc_reclaim_thread eat too much CPU when a tiny file system record size was used (e.g., 512 bytes) creating millions of tiny buffers. But that was basically a misconfiguration. The default size is 128 Kbytes, and people shouldn't be tuning below 8 Kbytes. The right kernel tower enters spl_kmem_cache_reap_now(), another ZFS memory freeing function. I imagine this is related to the left tower (e.g., contending for the same locks). But first: Why is ZFS in use? ## 5. Configuration There was only one use of ZFS so far at Netflix that I knew of: A new infrastructure project was using ZFS for containers. This led me to a theory: If they were quickly churning through containers, they would also be churning through ZFS file systems, and this might mean that many old pages needed to be cleaned out of the cache. Ahh, it makes sense now. I told them my theory, confident I was on the right path. But they replied: "We aren't using containers." Ok, then how _are_ you using ZFS? I did not expect their answer: "We aren't using ZFS." What!? Yes you are, I can see the arc_reclaim_thread in the flame graph. It doesn't run for fun! It's only on CPU because it's evicting pages from the ZFS ARC. If you aren't using ZFS, there are no pages in the ARC, so it shouldn't run. They were confident that they weren't using ZFS at all. The flame graph defied logic. I needed to prove to them that they were indeed using ZFS somehow, and figure out why. ## 6. cd & ls I should be able to debug this using nothing more than the cd and ls(1) commands. cd to the file system and ls(1) to see what's there. The file names should be a big clue as to its use. First, finding out where the ZFS file systems are mounted: df -h mount zfs list This showed nothing! No ZFS file systems were currently mounted. I tried another instance and saw the same thing. Huh? Ah, but containers may have been created previously and since destroyed, hence no remaining file systems now. How can I tell if ZFS has ever been used? ## 7. arcstats I know, arcstats! The kernel counters that track ZFS statistics, including ARC hits and misses. Viewing them: # cat /proc/spl/kstat/zfs/arcstats name type data hits 4 0 misses 4 0 demand_data_hits 4 0 demand_data_misses 4 0 demand_metadata_hits 4 0 demand_metadata_misses 4 0 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 0 prefetch_metadata_misses 4 0 mru_hits 4 0 mru_ghost_hits 4 0 mfu_hits 4 0 mfu_ghost_hits 4 0 deleted 4 0 mutex_miss 4 0 evict_skip 4 0 evict_not_enough 4 0 evict_l2_cached 4 0 evict_l2_eligible 4 0 [...] Unbelievable! All the counters were zero! ZFS really wasn't in use, ever! But at the same time, it was eating over 30% of CPU capacity! Whaaat?? The customer had been right all along. ZFS was straight up eating CPU, and for no reason. How can a file system _that's not in use at all_ consume 38% CPU? I'd never seen this before. This was a mystery. ## 8. Code Analysis I took a closer look at the flame graph and the paths involved, and saw that the CPU code paths led to get_random_bytes() and extract_entropy(). These were new to me. Browsing the [source code] and change history I found the culprit. The ARC contains lists of cached buffers for different memory types. A performance feature ("multilist") had been added that split the ARC lists into one per CPU, to reduce lock contention on multi-CPU systems. Sounds good, as that should improve performance. But what happens when you want to evict some memory? You need to pick one of those CPU lists. Which one? You could go through them in a round-robin fashion, but the developer thought it better to pick one at random. _Cryptographically-secure random._ The kicker was that ZFS wasn't even in use. The ARC was detecting low system memory and then adjusting its size accordingly, at which point it'd discover it was zero size already and didn't need to do anything. But this was after randomly selecting a zero-sized list, using a CPU-expensive random number generator. I filed this as ZFS [issue #6531]. I believe the first fix was to have the arc_reclaim_thread bail out earlier if ZFS wasn't in use, and not enter list selection. The ARC has since had many changes, and I haven't heard of this issue again. [Kernel Recipes]: https://youtu.be/UVM3WX8Lq2k?t=133 [Kernel Recipes2]: https://www.slideshare.net/brendangregg/kernel-recipes-2017-using-linux-perf-at-netflix/2 [talks]: /index.html#Videos [issue #6531]: https://github.com/openzfs/zfs/issues/6531 [source code]: https://github.com/openzfs/zfs/blob/4e33ba4c389f59b74138bf7130e924a4230d64e9/module/zfs/arc.c [Atlas]: https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a [Vector]: https://netflixtechblog.com/introducing-vector-netflixs-on-host-performance-monitoring-tool-c0d3058c3f6f

Brendan Gregg: Analyzing a High Rate of Paging

Sunday 29th of August 2021 02:00:00 PM
These are rough notes. A service team was debugging a performance issue and noticed it coincided with a high rate of paging. I was asked to help, and used a variety of Linux performance tools to solve this including those that use eBPF and Ftrace. This is a rough post to share this old but good case study of using these tools, and to help justify their further development. No editing, spell checking, or comments. Mostly screenshots. ## 1. Problem Statement The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes. A cloud-wide monitoring tool, Atlas, showed a high rate of paging for the larger file uploads: The blue is pageins (page ins). Pageins are a type of disk I/O where a page of memory is read in from disk, and are normal for many workloads. You might be able to guess the issue from the problem statement alone. ## 2. iostat Starting with my [60-second performance checklist], the iostat(1) tool showed a high rate of disk I/O during a large file upload: # iostat -xz 1 Linux 4.4.0-1072-aws (xxx) 12/18/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.03 0.00 0.83 1.94 0.02 92.18 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.29 0.21 0.17 6.29 3.09 49.32 0.00 12.74 6.96 19.87 3.96 0.15 xvdb 0.00 0.08 44.39 9.98 5507.39 1110.55 243.43 2.28 41.96 41.75 42.88 1.52 8.25 avg-cpu: %user %nice %system %iowait %steal %idle 14.81 0.00 1.08 29.94 0.06 54.11 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 745.00 0.00 91656.00 0.00 246.06 25.32 33.84 33.84 0.00 1.35 100.40 avg-cpu: %user %nice %system %iowait %steal %idle 14.86 0.00 0.89 24.76 0.06 59.43 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 739.00 0.00 92152.00 0.00 249.40 24.75 33.49 33.49 0.00 1.35 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 14.95 0.00 0.89 28.75 0.06 55.35 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 734.00 0.00 91704.00 0.00 249.87 24.93 34.04 34.04 0.00 1.36 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 14.54 0.00 1.14 29.40 0.06 54.86 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 750.00 0.00 92104.00 0.00 245.61 25.14 33.37 33.37 0.00 1.33 100.00 ^C I'm looking at the r_await column in particular: the average wait time for reads in milliseconds. Reads usually have apps waiting on them; writes may not (write-back caching). An r_wait of 33 ms is kinda high, and likely due to the queueing (avgqu-sz). They are largeish I/O, about 128 Kbytes (divide rkB/s by r/s). ## 3. biolatency From [bcc], this eBPF tool shows a latency histogram of disk I/O. I'm running it in case the averages are hiding outliers, which could be a device issue: # /usr/share/bcc/tools/biolatency -m Tracing block device I/O... Hit Ctrl-C to end. ^C msecs : count distribution 0 -> 1 : 83 | | 2 -> 3 : 20 | | 4 -> 7 : 0 | | 8 -> 15 : 41 | | 16 -> 31 : 1620 |******* | 32 -> 63 : 8139 |****************************************| 64 -> 127 : 176 | | 128 -> 255 : 95 | | 256 -> 511 : 61 | | 512 -> 1023 : 93 | | This doesn't look too bad. Most I/O are between 16 and 127 ms. Some outliers reaching the 0.5 to 1.0 second range, but again, there's quite a bit of queueing here seen in the earlier iostat(1) output. I don't think this is a device issue. I think it's the workload. ## 4. bitesize As I think it's a workload issue, I want a better look at the I/O sizes in case there's something odd: # /usr/share/bcc/tools/bitesize Tracing... Hit Ctrl-C to end. ^C Process Name = java Kbytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 31 | | 16 -> 31 : 15 | | 32 -> 63 : 15 | | 64 -> 127 : 15 | | 128 -> 255 : 1682 |****************************************| The I/O is mostly in the 128 to 255 Kbyte bucket, as expected from the iostat(1) output. Nothing odd here. ## 5. free Also from the 60-second checklist: # free -m total used free shared buff/cache available Mem: 64414 15421 349 5 48643 48409 Swap: 0 0 0 There's not much memory left, 349 Mbytes, but more interesting is the amount in the buffer/page cache: 48,643 Mbytes (48 Gbytes). This is a 64-Gbyte memory system, and 48 Gbytes is in the page cache (the file system cache). Along with the numbers from the problem statement, this gives me a theory: Do the 100-Gbyte files bust the page cache, whereas 40-Gbyte files fit? ## 6. cachestat [cachestat] is an experimental tool I developed that uses Ftrace and has since been ported to bcc/eBPF. It shows statistics for the page cache: # /apps/perf-tools/bin/cachestat Counting cache functions... Output every 1 seconds. HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 1811 632 2 74.1% 17 48009 1630 15132 92 9.7% 17 48033 1634 23341 63 6.5% 17 48029 1851 13599 17 12.0% 17 48019 1941 3689 33 34.5% 17 48007 1733 23007 154 7.0% 17 48034 1195 9566 31 11.1% 17 48011 [...] This shows many cache misses, with a hit ratio varying between 6.5 and 74%. I usually like to see that in the upper 90's. This is "cache busting." The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance. The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file). ## 7. Smaller File Test For further confirmation, I collected the same outputs for a 32 Gbyte file upload. cachestat shows a ~100% cache hit ratio: # /apps/perf-tools/bin/cachestat Counting cache functions... Output every 1 seconds. HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 61831 0 126 100.0% 41 33680 53408 0 78 100.0% 41 33680 65056 0 173 100.0% 41 33680 65158 0 79 100.0% 41 33680 55052 0 107 100.0% 41 33680 61227 0 149 100.0% 41 33681 58669 0 71 100.0% 41 33681 33424 0 73 100.0% 41 33681 ^C This smaller size allows the service to process the file (however many passes it takes) entirely from memory, without re-reading it from disk. free(1) shows it fitting in the page cache: # free -m total used free shared buff/cache available Mem: 64414 18421 11218 5 34773 45407 Swap: 0 0 0 And iostat(1) shows little disk I/O, as expected: # iostat -xz 1 Linux 4.4.0-1072-aws (xxx) 12/19/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 12.25 0.00 1.24 0.19 0.03 86.29 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.32 0.31 0.19 7.09 4.85 47.59 0.01 12.58 5.44 23.90 3.09 0.15 xvdb 0.00 0.07 0.01 11.13 0.10 1264.35 227.09 0.91 82.16 3.49 82.20 2.80 3.11 avg-cpu: %user %nice %system %iowait %steal %idle 57.43 0.00 2.95 0.00 0.00 39.62 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util avg-cpu: %user %nice %system %iowait %steal %idle 53.50 0.00 2.32 0.00 0.00 44.18 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 0.00 2.00 0.00 19.50 19.50 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 39.02 0.00 2.14 0.00 0.00 58.84 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util [...] ## 8. Final Notes cachestat was the killer tool here, but I should stress that it's still experimental. I wrote it for Ftrace with the constraints that it must be low-overhead and use the Ftrace function profiler only. As I mentioned in my LSFMMBPF 2019 keynote in Puerto Rico, where the Linux mm kernel engineers were present, I think the cachestat statistics are so commonly needed that they should be in /proc and not need my experimental tool. They pointed out challenges with providing them properly, and I think any robust solution is going to need their help and expertise. I hope this case study helps show why it is useful and worth the effort. Until the kernel does support page cache statistics (which may be never: they are hot-path, so adding counters isn't free) we can use my cachestat tool, even if it does require frequent maintenance to keep working. [cachestat]: /blog/2014-12-31/linux-page-cache-hit-ratio.html [60-second performance checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [bcc]: https://github.com/iovisor/bcc

Michael Kerrisk (manpages): man-pages-5.13 released

Friday 27th of August 2021 09:07:02 PM
Alex Colomar and I have released released man-pages-5.13. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 40 contributors. The release includes around 200 commits that changed around 120 manual pages.

The most notable of the changes in man-pages-5.13 are the following:
  • A new mount_setattr(2) page, written by Christian Brauner, that documents the mount_setattr() system call added in Linux 5.12.
  • I added documentation of the /proc/PID/projid_map file to the user_namespaces(7) page.
  • I added a greatly expanded description of the concept of locked mounts in the mount_namespaces(7) page.

Special thanks again to Alex, who kept track of a lot of patches while I was unavailable.

Brendan Gregg: Slack's Secret STDERR Messages

Thursday 26th of August 2021 02:00:00 PM
These are rough notes. I run the Slack messaging application on Ubuntu Linux, and it recently started mysteriously crashing. I'd Alt-Tab and find it was no longer there. No error message, no dialog, just gone. It usually happened when locking and unlocking the screen. A quick internet search revealed nothing. These are rough notes for how I debugged it, in case it's useful for someone searching on this topic. I spend many hours documenting advanced debugging stories for books, talks, and blog posts, but many things I never have time to share. I'm experimenting with this "rough notes" format as a way to quickly share things. No editing, spell checking, or comments. Mostly screenshots. Dead ends included. Note that I don't know anything about Slack internals, and there may be better ways to solve this. ## 1. Enabling core dumps I'm guessing it's core dumping and Ubuntu's apport is eating them. Redirecting them to the file system so I can then do core dump analysis using gdb(1), as root: # cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c %d %P # mkdir /var/cores # echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern [...another crash...] # ls /var/cores # This didn't work: No core file showed up. I may need to increase the core file size ulimits for Slack, but that might mean mucking around with its startup scripts; I'll try some other tracing first. ## 2. exitsnoop Using an eBPF/bcc tool to look for exit reasons: # exitsnoop -t TIME-AEST PCOMM PID PPID TID AGE(s) EXIT_CODE 13:51:19.432 kworker/dying 3663305 2 3663305 1241.59 0 13:51:30.948 kworker/dying 3663626 2 3663626 835.76 0 13:51:33.296 systemd-udevd 3664149 2054939 3664149 3.55 0 13:53:09.256 kworker/dying 3662179 2 3662179 2681.15 0 13:53:25.636 kworker/dying 3663520 2 3663520 1122.60 0 13:53:30.705 grep 3664239 6009 3664239 0.08 0 13:53:30.705 ps 3664238 6009 3664238 0.08 0 13:53:40.297 slack 3663135 1786 3663135 1459.54 signal 6 (ABRT) 13:53:40.298 slack 3663208 3663140 3663208 1457.86 0 13:53:40.302 slack 3663140 1786 3663140 1459.18 0 13:53:40.302 slack 3663139 1786 3663139 1459.18 0 13:53:40.303 slack 3663171 1786 3663171 1458.22 0 13:53:40.317 slack 3663197 1786 3663197 1458.03 0 13:53:44.827 gdm-session-wor 3664269 1778 3664269 0.02 0 [...] This traced a Slack SIGABRT which happened around the same time as a crash. A strong lead. ## 3. killsnoop Running killsnoop (eBPF/bcc) to get more info: # killsnoop TIME PID COMM SIG TPID RESULT 13:45:01 2053366 systemd-journal 0 1024 0 13:45:01 2053366 systemd-journal 0 3663525 -3 13:45:01 2053366 systemd-journal 0 3663528 -3 13:49:00 2054939 systemd-udevd 15 3664053 0 13:51:33 2054939 systemd-udevd 15 3664149 0 13:53:44 2053366 systemd-journal 0 4265 -1 13:53:44 2053366 systemd-journal 0 972 0 13:53:44 2053366 systemd-journal 0 1778 0 13:53:44 2053366 systemd-journal 0 6414 -1 [...] A crash happened, but killsnoop(8) didn't see it. A quick look at the killsnoop(8) source reminded me that I wrote it back in 2015, which is practically ancient in eBPF years. Back then there wasn't tracepoint support yet so I was using kprobes for everything. Kprobes aren't a stable interface, which might be the problem. ## 4. perf trace Nowadays this can be done as a perf one-liner: # perf list syscalls:sys_enter_*kill List of pre-defined events (to be used in -e): syscalls:sys_enter_kill [Tracepoint event] syscalls:sys_enter_tgkill [Tracepoint event] syscalls:sys_enter_tkill [Tracepoint event] # perf trace -e 'syscalls:sys_enter_*kill' 15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT) Ok, so there's our slack SIGABRT, sent via tgkill(2). (And I filed an issue to update bcc killsnoop(8) to use tracepoints.) This output doesn't really tell me much more about it though. I want to see a stack trace. I can use perf record or bpftrace...and that reminds me, didn't I write another signal tool using bpftrace? ## 5. signals.bt The signals.bt bpftrace tool from my BPF book traces the signal:signal_generate tracepoint, which should catch every type of generated signal, including tgkill(2). Trying it out: # bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/signals.bt Attaching 3 probes... Counting signals. Hit Ctrl-C to end. ^C @[SIGNAL, PID, COMM] = COUNT @[SIGPIPE, 1883, Xorg]: 1 @[SIGCHLD, 1797, dbus-daemon]: 1 @[SIGINT, 3665167, bpftrace]: 1 @[SIGTERM, 3665198, gdm-session-wor]: 1 @[SIGCHLD, 3665197, gdm-session-wor]: 1 @[SIGABRT, 3664940, slack]: 1 @[SIGTERM, 3665197, gdm-session-wor]: 1 @[SIGKILL, 3665207, dbus-daemon]: 1 @[SIGWINCH, 859450, bash]: 2 @[SIGCHLD, 1778, gdm-session-wor]: 2 @[, 3665201, gdbus]: 2 @[, 3665199, gmain]: 2 @[SIGWINCH, 3665167, bpftrace]: 2 @[SIGWINCH, 3663319, vi]: 2 @[SIGCHLD, 1786, systemd]: 6 @[SIGALRM, 1883, Xorg]: 106 Ok, there's the SIGABRT for slack. (There's a new sigsnoop(8) tool for bcc that uses this tracepoint as well.) ## 6. Signal Stacks Moving from signals.bt to a bpftrace one-liner: # bpftrace -e 't:signal:signal_generate /comm == "slack"/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }' Attaching 1 probe... 6, __send_signal+579 __send_signal+579 send_signal+221 do_send_sig_info+81 do_send_specific+110 do_tkill+171 __x64_sys_tgkill+34 do_syscall_64+73 entry_SYSCALL_64_after_hwframe+68 0x7f4a2e2e2f95 This was supposed to print both the kernel and user stack traces that led to the SIGABRT, but the user stack is broken, showing 0x7f4a2e2e2f95 only. Grumble. There's ways to fix this, but it usually gets time consuming, so let me try something else first. Logs! ## 7. Logs Does Slack have logs? I have no idea. Maybe they contain the error message. # lsof -p `pgrep -n slack` | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. Ignore the lsof(8) warnings. There's no output here, nothing containing "log". Although I'm looking at the most recent process called "slack" and maybe that's the wrong one. # pstree -ps `pgrep -n slack` systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549) ├─{slack}(3666551) ├─{slack}(3666552) ├─{slack}(3666553) ├─{slack}(3666554) ├─{slack}(3666555) ├─{slack}(3666556) ├─{slack}(3666557) ├─{slack}(3666558) ├─{slack}(3666559) ├─{slack}(3666560) ├─{slack}(3666564) ├─{slack}(3666566) ├─{slack}(3666568) └─{slack}(3666609) Ok, how about I try the great-grandparent, PID 3666477: # lsof -p 3666477 | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. slack 3666477 bgregg 37r REG 253,1 32768 140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log slack 3666477 bgregg 40r REG 253,1 32768 131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log slack 3666477 bgregg 71w REG 253,1 15566 1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log slack 3666477 bgregg 72w REG 253,1 347 1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG slack 3666477 bgregg 73w REG 253,1 2324236 1718407 /home/bgregg/.config/Slack/logs/browser.log slack 3666477 bgregg 90w REG 253,1 363600 1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log slack 3666477 bgregg 92w REG 253,1 274 1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG slack 3666477 bgregg 108w REG 253,1 4182513 1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log slack 3666477 bgregg 116w REG 253,1 259 1704369 /home/bgregg/.config/Slack/Session Storage/LOG slack 3666477 bgregg 122w REG 253,1 31536 1749325 /home/bgregg/.config/Slack/Session Storage/000036.log slack 3666477 bgregg 126w REG 253,1 3970909 1704566 /home/bgregg/.config/Slack/logs/webapp-console.log slack 3666477 bgregg 127w REG 253,1 2330006 1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log slack 3666477 bgregg 131w REG 253,1 330 1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG slack 3666477 bgregg 640r REG 253,1 32768 140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted) Lots of logs in ~/config/Slack/logs! # cd ~/.config/Slack/logs # ls -lrth total 26M -rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/ -rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log -rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log -rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log Ok, so how about this one: # cat webapp-console.log [...] [08/27/21, 14:46:36:238] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.801 dnd.teamInfo is RESOLVED [08/27/21, 14:46:36:240] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.930 users.interactions.list is RESOLVED [08/27/21, 14:46:36:242] info: [DND] (TKZ41AXQD) Fetched DND info for the following member: ULG5H012L [08/27/21, 14:46:36:251] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L [08/27/21, 14:46:36:254] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Fetched DND info for the following members: UL0US3455 [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L,UL0US3455 [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S9267BE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S9267BE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.00ms No, I don't see any errors jumping out at me. How about searching for errors: # egrep -i 'error|fail' webapp-console.log [08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [...] # egrep -i 'error|fail' browser.log [07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:27:653] error: Failed to load empty window url in window "error": { "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n at Object.emit (events.js:315:20)\n at Object.EventEmitter.emit (domain.js:467:12)", [07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel "error": { "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n at Module.load (internal/modules/cjs/loader.js:935:32)\n at Module._load (internal/modules/cjs/loader.js:776:14)\n at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n at Module.require (internal/modules/cjs/loader.js:959:19)\n at require (internal/modules/cjs/helpers.js:88:18)\n at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n at Object. (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)", [07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links [07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag [...] I browsed the logs for a while but didn't see a smoking gun. Surely it spits out some error message when crashing, like to STDERR... ## 8. STDERR Tracing Where is STDERR written? # lsof -p 3666477 [...] slack 3666477 bgregg mem REG 7,16 141930 7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak slack 3666477 bgregg mem REG 7,16 165680 7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin slack 3666477 bgregg 0r CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 1w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 2w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 3r FIFO 0,12 0t0 29532192 pipe slack 3666477 bgregg 4u unix 0x00000000134e3c45 0t0 29526717 type=SEQPACKET slack 3666477 bgregg 5r REG 7,16 10413488 7167 /snap/slack/44/usr/lib/slack/icudtl.dat [...] /dev/null? Like that's going to stop me. I could trace writes to STDERR, but I think my old shellsnoop(8) tool (another from eBPF/bcc) already does that: # shellsnoop 3666477 [...] [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms (slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme. This may indicate that pixbuf loaders or the mime database could not be found. ** Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5) Ah-ha! The last message printed is an error about a .png file and a .so file. As it's Slack's final mesage before crashing, this is a smoking gun. Note that this was not in any log!: # grep image-missing.png * grep: recorded-trace: Is a directory It's the .so file that is missing, not the .png: # ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png -rw-r--r-- 1 root root 535 Nov 6 2020 /usr/share/icons/Yaru/16x16/status/image-missing.png # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so': No such file or directory But there is a .so file with a similar path: # ls -lh /snap/slack/ total 0 drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so Hmm, I wonder... ## 9. Workaround This is obviously a hack and is not guaranteed to be safe: # cd /snap/slack # ln -s current 42 # ls -lh total 0 lrwxrwxrwx 1 root root 7 Aug 27 15:01 42 -> current/ drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so I don't know why Slack was looking up this library via the old directory version, but linking the new version to the old path did the trick. Slack has stopped crashing! I'm guessing this is a problem with how the snap is built. Needs more debugging. ## 10. Other debugging In case you're wondering what I'd do if I didn't find the error in STDERR, I'd go back to setting ulimits to see if I could get a core dump, and if that still didn't work, I'd try to run Slack from a gdb(1) session. I'd also work on fixing the user stack trace and symbols to see what that revealed. ## 11. Bonus: opensnoop I often wonder how I could have debugged things sooner, and I'm kicking myself I didn't run opensnoop(8) as I usually do. Tracing just failed opens: # opensnoop -Tx TIME(s) PID COMM FD ERR PATH [...] 11.412358000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/subclasses 11.412360000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/icons 11.412363000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/generic-icons 11.412495000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.412527000 3673057 slack -1 2 /usr/share/locale/en_AU/LC_MESSAGES/gdk-pixbuf.mo 11.412537000 3673057 slack -1 2 /usr/share/locale/en/LC_MESSAGES/gdk-pixbuf.mo 11.412559000 3673057 slack -1 2 /usr/share/locale-langpack/en/LC_MESSAGES/gdk-pixbuf.mo 11.412916000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.425405000 1786 systemd -1 2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/user@1000.service/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/memory.events That shows its last failed open was to the .so file. Which would have been a good lead. But the best clue was the secret STDERR messages Slack sends to /dev/null, rescued using shellsnoop(8).

Linux Plumbers Conference: BOFs Call for Proposals Now Open

Thursday 26th of August 2021 12:06:34 AM

We have formally opened the CfP for Birds of a Feather. Select the BOFs track when submitting a BOF here.

As a reminder:

  • Submissions for Microconferences topics are still open.
  • The GNU Tools Track CfP will close in a few days, on August 31.
  • Kernel Summit track is open for late breaking topics.
  • All the other tracks are now closed for additional submissions.

 

Linux Plumbers Conference: Diversity, Equity & Inclusion Microconference Accepted into 2021 Linux Plumbers Conference

Thursday 19th of August 2021 04:35:20 PM

We are pleased to announce that the Diversity, Equity & Inclusion Microconference has been accepted into the 2021 Linux Plumbers Conference.

Creating diverse communities requires effort and commitment to creating inclusive and welcoming spaces. Recognizing that communities which adopt inclusive language and actions attract and retain more individuals from diverse backgrounds, the Linux kernel community adopted inclusive language in Linux 5.8 release. Understanding if this sort of change has been effective is a topic of active research. This MC will take a pulse of the Linux kernel community as it turns 30 this year and discuss some next steps. Experts from the DEI research community will share their perspectives, together with the perspectives from the Linux community members. This microconference will build on what was started at the LPC 2020 BoF session on Improving Diversity.

This year’s topics to be discussed include:

  • What are the challenges in attracting and retaining a diverse group of developers that are worth focusing on.
  • Does the Code of Conduct and Inclusive naming efforts help people of diverse groups feel at home? What else is missing?
  • How effective have the kernel mentoring initiatives been? Are there best practices emerging that help the limited pool of mentors be more effective?
  • What will be the most effective next steps for advancing Diversity, Equity and Inclusion that will improve the trends, and help us scale?

Come and join us in the discussion of how we can improve the diversity of the Linux Kernel community and help keep it vibrant for the next 30 years!

We hope to see you there.

Linux Plumbers Conference: GPU/media/AI buffer management and interop Microconference Accepted into 2021 Linux Plumbers Conference

Monday 16th of August 2021 06:51:15 PM

We are pleased to announce that the GPU/media/AI buffer management and interop Microconference has been accepted into the 2021 Linux Plumbers Conference. The Linux GPU subsystem has long had three major tenets:

  • Kernel mediation, validation, submission, and scheduling of GPU jobs
  • Implicit synchronisation between multiple user space accessors
  • Open-source user space

Forthcoming hardware makes the former two difficult, if not impossible, to achieve. In order to give user space the fastest possible path to support modern complex workloads, forthcoming hardware is removing the notion of a small number of kernel-controlled job queues, replacing it with direct user space access to command queues to submit and control their own jobs.

This, and evolution in the Vulkan API, make it difficult to retain the existing implicit synchronization model, where the kernel tracks all access and ensures that the hardware executes jobs in the order of user space submission, so that multiple independent clients can reuse the same buffers without data hazards. As all of these changes impact both media and neural-network accelerators, this Linux Plumbers Conference microconference allows us to open the discussion past the graphics community and into the wider kernel community.

This year’s topics to be discussed include:

Come and join us in the discussion of keeping Linux a first class citizen
in the would of graphics and media.

We hope to see you there.

Linux Plumbers Conference: Android Microconference Accepted into 2021 Linux Plumbers Conference

Wednesday 4th of August 2021 07:40:00 PM

We are pleased to announce that the Android Microconference has been accepted into the 2021 Linux Plumbers Conference. The past Android microconferences have been centered around the idea that it was primarily a synchronization point between the Android kernel team and the rest of the community to inform them on what they have been doing. With the help of last year’s focus on the Generic Kernel Image[1] (GKI), this year’s Android microconference will instead be an opportunity to foster a higher level of collaboration between the Android and Linux kernel communities. Discussions will be centered on the goal of ensuring that both the Android and Linux development moves in a lockstep fashion going forward.

Last year’s meetup achieved the following:

This year’s topics to be discussed include:

  • Alignment issues between Android and Cgroups v2: Issues in refactoring Android’s use of cgroups to utilize cgroups v2
  • Thermal: Issues around performance and thermal handling between the kernel and Android’s HAL
  • Fuse/device-mapper/other storage: Discuss out-of-tree dm/storage drivers and how they might go upstream or better align with upstream efforts
  • In kernel memory tracking: Tracking/account GPU (and other multi-device shared) memory and how it might fit with cgroups
  • fw_devlink: Remaining fw_devlink issues to resolve, now that its enabled by default.
  • Hermetic builds/Kbuild
  • GKI updates: Whats new this year in GKI and where its going next year
  • Rust in AOSP / Kernel / Binder: How Android is adopting rust for userland and potentially for kernel drivers
  • Android Automotive OS Reference Platform: Details on recent Android Automotive work
  • Community devboard/device Collaboration: Ways to better collaborate on enabling various devboard against AOSP, without needing close interlock with Google

Come and join us in the discussion of making Android a better partner with Linux.

We hope to see you there.

Dave Airlie (blogspot): crocus misrendering of the week

Wednesday 4th of August 2021 08:04:07 AM

 I've been chasing a crocus misrendering bug show in a qt trace.


The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191 fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

Linux Plumbers Conference: System Boot and Security Microconference Accepted into 2021 Linux Plumbers Conference

Thursday 29th of July 2021 07:10:43 PM

We are pleased to announce that the System Boot and Security Microconference has been accepted into the 2021 Linux Plumbers Conference. This microconference brings together those that are interested in the firmware, bootloaders, system boot and security. The events around last year’s BootHole showed how crucial platform initialization is for the overall system security. Those events may have showed the shortcomings in the current boot process, but they have also tightened the cooperation between various companies and organizations. Now is the time to use this opportunity to discuss the lessons learned and what can be done to improve in the future. Other cooperation discussions are also welcomed like those based on legal and organizational issues which may hinder working together.

Last year’s meetup achieved the following:

This year’s topics to be discussed include:

Come and join us in the discussion about how to keep your system secure even at bootup.

We hope to see you there.

Linux Plumbers Conference: Kernel Dependability and Assurance Microconference Accepted into 2021 Linux Plumbers Conference

Wednesday 28th of July 2021 02:28:36 PM

We are pleased to announce that the Kernel Dependability and Assurance Microconference has been accepted into the 2021 Linux Plumbers Conference.

Linux development is producing kernels at an ever increasing rate, and at the same time with arguably increasing software quality. The process of kernel development has been adapting to handle the increasing number of contributors over the years to ensure a sufficient software quality. This quality is key in that Linux is now being used in applications that require a high degree of trust that the kernel is going to behave as expected. Some of the key areas we’re seeing Linux start to be used are in medical devices, civil infrastructure, caregiving robots, automotives, etc.

Last year’s miniconference raised awareness about this topic with the wider community. Since then the ELISA team has made contributions to the Documentation and tools. The team has deployed a CI server that runs static analysis tools and syzkaller on the Linux kernel repos and is making the results of last 10 days of linux-next are made available to the community.

This year’s topics to be discussed include:

  • Identify missing features that will provide assurance in safety critical systems.
  • Which test coverage infrastructures are most effective to provide evidence for kernel quality assurance? How should it be measured?
  • Explore ways to improve testing framework and tests in the kernel with a specific goal to increase traceability and code coverage.
  • Regression Testing for safety: Prioritize configurations and tests critical and important for quality and dependability

Come and join us in the discussion on how we can assure that Linux becomes the most trusted and dependable software in the world!

We hope to see you there.

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XVIII: Preventing Involuntary Generosity

Tuesday 27th of July 2021 03:12:16 AM
I recently learned that all that is required for someone to take out a loan in some random USA citizen's name is that citizen's full name, postal address, email address, date of birth, and social security number. If you are above a certain age, all of these are for all intents and purposes a matter of public record. If you are younger, then your social security number is of course supposed to be secret—and it will be, right up to that data breach that makes it available to all the wrong people.

This sort of thing can of course be a bit annoying to our involuntarily generous USA citizen. Some unknown person out there gets a fancy toy, and our citizen gets some bank's dunning notices. Fortunately, there are quite a few things you can do, although I will not try to reproduce the entirety of the volumes of good advice that are available out there. Especially given that laws, processes, and procedures are all subject to change.

But at present, one important way to prevent this is to put a hold on your credit information through either of Experian, Equifax, or Transunion. I strongly suggest that you save yourself considerable time and hassle by doing this, which is free of charge for a no-questions-asked one-year hold. Taking this step is especially important if you are among the all too many of us whose finances don't have much slack, as was the case with my family back when my children were small. After all, it is one thing to have to deal with a few hassles in the form of phone calls, email, and paperwork, but it is quite another if you and your loved ones end up missing meals. Thankfully, it never came to that for my family, although one of my children did complain bitterly to a medical professional about the woefully insufficient stores of candy in our house.

Of course, I also have some advice for the vendor, retailer, digital-finance company, and bank that were involved in my case of attempted involuntary generosity:

  1. Put just a little more effort into determining who you are really doing business with.
  2. If the toy contains a computer and connects to the internet, consider the option of directing your dunning notices through the toy rather than to the email and phone of your involuntarily generous USA citizen.
  3. A loan application for a toy that is shipped to a non-residential address should be viewed with great suspicion.
  4. In fact, such a loan application should be viewed with with some suspicion even if the addresses match. Porch pirates and all that.
  5. If the toy is of a type that must connect to the internet to do anything useful, you have an easy method of dealing with non-payment, don't you?

I should hasten to add that after only a little discussion, these companies have thus far proven quite cooperative in my particular case, which is why they are thus far going nameless.

Longer term, it is hard to be optimistic, especially given advances in various easy-to-abuse areas of information technology. In the meantime, I respectfully suggest that you learn from my experience and put a hold on your credit information!

Linux Plumbers Conference: RISC-V Microconference Accepted into 2021 Linux Plumbers Conference

Monday 26th of July 2021 09:26:42 PM

We are pleased to announce that the RISC-V Microconference has been accepted into the 2021 Linux Plumbers Conference. The RISC-V software eco-system is gaining momentum at breakneck speed with three new Linux development platforms available this year. The new platforms bring new issues to deal with.

Last year’s meetup achieved the following:

This year’s topics to be discussed include:

  • Platform specification progress, including SBI-0.3 and the future plans for SBI-0.4. There has been significant progress on the platform specifications, including a server profile, that needs discussion.
  • Privileged specification progress, possible 1.12 (which is a work in progress at the foundation).
  • Support for the V and B specifications, along with questions about the drafts. The V extension is of particular interest, as there are implementation of the draft extensions that are likely to be incompatible with what will eventually be ratified so we need to discuss what exactly user ABI compatibility means.
  • H extension / KVM discussion, which is probably part of the drafts.  The KVM port has been hung up on the H extension ratification process, which is unlikely to proceed any time soon. We should discuss other options for a KVM port that avoid waiting for the H extension.
  • Support for the batch of SOCs currently landing (JH7100, D1)
  • Support for non-coherent systems
  • How to handle compliance.

Come join us and participate in the discussion on how we can improve the support for RISC-V in the Linux kernel.

We hope to see you there.

Pete Zaitcev: MinIO liberates your storage from rebalancing

Thursday 22nd of July 2021 11:08:18 PM

MinIO posted a blog entry a few days ago where the bragged about adding capacity without a need to re-balance.

First, they went into a full marketoid mode, whipping up the fear:

Rebalancing a massive distributed storage system can be a nightmare. There’s nothing worse than adding a storage node and watching helplessly as user response time increases while the system taxes its own resources rebalancing to include the new node.

Seems like MinIO folks assume that operators of distributed storage such as Swift and Ceph have no tools to regulate the resource consumption of rebalancing. So they have no choice but to "wait helplessly". Very funny.

But it gets worse when obviously senseless statements are made:

Rebalancing doesn’t just affect performance - moving many objects between many nodes across a network can be risky. Devices and components fail and that often leads to data loss or corruption.

Often, man! Also, a commit protocol? Never heard of her!

Then, we talk about some unrelated matters:

A group of drives is an erasure set and MinIO uses a Reed-Solomon algorithm to split objects into data and parity blocks based on the size of the erasure set and then uniformly distributes them across all of the drives in the erasure such that each drive in the set contains no more than one block per object.

Understood, your erasure set is what we call "partition" in Swift or a placement group in Ceph.

Finally, we get to the matter at hand:

To enable rapid growth, MinIO scales by adding server pools and erasure sets. If we had built MinIO to allow you to add a drive or even a single hardware node to an existing server pool, then you would have to suffer through rebalancing.

MinIO scales up quickly by adding server pools, each an independent set of compute, network and storage resources.

Add hardware, run MinIO server to create and name server processes, then update MinIO with the name of the new server pool. MinIO leaves existing data in their original server pools while exposing the new server pools to incoming data.

My hot take on the social media was: "Placing new sets on new storage impacts utilization and risks hotspotting because of time affinity. There's no free lunch." Even on the second thought, I think that is about right. But let us not ignore the cost of the data movement associated with rebalancing. What if the operator wants to implement in Swift what MinIO blog post talks about?

It is possible to emulate MinIO, to an extent. Some operators add a new storage policy when they expand the cluster, configure all the new nodes and/or volumes in its ring, then make it default, so newly-created objects end on the new hardware. This works to accomplish the same goals that MinIO outline above, but it's a kludge. Swift was not intended for this originally and it shows. In particular, storage policies were intended for low numbers of storage classes, such as rotating media and SSD, or Silver/Gold/Platinum. Once you make a new policy for each new forklift visit, you run a risk of finding scalability issues. Well, most clusters only upgrade a few times over their lifetime, but potentially it's a problem. Also, policies are customer visible, they are intended to be.

In the end, I still think that balanced cluster is the way to go. Just think rationally about it.

Interestingly, the reverse emulation appears to be not possible for MinIO: if you wanted to rebalance your storage, you would not be able to. Or at least the blog post above says: "If we had built MinIO to allow you to add a drive or ... a node to an existing server pool". I take it to mean that they don't allow, and the blog post is very much a case of sour grapes, then.

Linux Plumbers Conference: Open Printing Microconference Accepted into 2021 Linux Plumbers Conference

Thursday 22nd of July 2021 04:21:12 PM

We are pleased to announce that the Open Printing Microconference has been accepted into the 2021 Linux Plumbers Conference. Over the years OpenPrinting has been actively working on improving and modernizing the way we print in Linux. We have been working on multiple areas of printing and scanning. Especially driverless print and scan technologies have helped the world do away with a lot of hassles involved in deciding on the correct driver to use and to install the same. Users can now just plug in their printer and do what they need.

Based on the discussions that we had last year, we have been able to achieve the following:

– Significant progress in deciding on the structure of PAPPL – framework/library for developing Printer Applications as a replacement of Printer Drivers.

– Progress on LPrint. Label Printer Application, implementing printing for a variety of common label and receipt printers connected via network or USB.

– Have helped us in giving shape to the Printer Application concept. Sample printer applications for HP PCL printers have been created that use PAPPL to support IPP printing from multiple operating systems. This prototype will help others looking forward to adopting this new concept of Printer Applications. First production Printer Application started from this prototype is the PostScript Printer Application.

Development is in continuous progress, see the state of the art in OpenPrinting’s monthly news posts[6].

This year’s topics to be discussed include:

  • Changes in CUPS 2.4.x
    • Print sharing changes for mobile
    • OAauth support to replace Kerberos
    • Printer drivers replaced with Printer Applications
    • TLS/X.509 changes
  • CUPS in containers
    • CUPS 3.0
    • Future CUPS development
    • Identify support platforms
    • Key printing system components
    • Discuss integration with Printer Applications and application stores like Snap Store
  • Print Management GUI
    • Migrating from working with CUPS queues to IPP services
    • Handling legacy devices that do not handle IPP services
  • Common Print Dialog Backends
    • CPDB, CUPS backend.
    • Separating GUI toolkits and the print technology support to be independent from each other.
  • Printer/Scanner Driver Design and Development

Come join us and participate in the discussion to bring Linux printing, scanning and fax a better experience.

We hope to see you there.

Dave Airlie (blogspot): llvmpipe/lavapipe: anisotropic texture filtering

Wednesday 21st of July 2021 01:07:50 AM

In order to expose OpenGL 4.6 the last missing feature in llvmpipe is anisotropic texture filtering. Adding support for this also allows lavapipe expose the Vulkan samplerAnisotropy feature.

I started writing anisotropic support > 6 months ago. At the time we were trying to deprecate the classic swrast driver, and someone pointed out it had support for anisotropic filtering. This support had also been ported to the softpipe driver, but never to llvmpipe.

I had also considered porting swiftshaders anisotropic support, but since I was told the softpipe code was functional and had users I based my llvmpipe port on that.

Porting the code to llvmpipe means rewriting it to generate LLVM IR using the llvmpipe vector processing code. This is a lot messier than just writing linear processing code, and when I thought I had it working it passes GL CTS, but failed the VK CTS. The results also to my eye looked worse than I'd have thought was acceptable, and softpipe seemed to be as bad.

Once I swung back around to this I decided to port the VK CTS test to GL and run it on softpipe and llvmpipe code. Initially llvmpipe had some more bugs to solve esp where the mipmap levels were being chosen, but once I'd finished aligning softpipe and llvmpipe I started digging into why the softpipe code wasn't as nice as I expected.

The softpipe code was based on an implementation of an Elliptical Weighted Average Filter (EWA). The paper "Creating Raster Omnimax Images from Multiple Perspective Views Using the Elliptical Weighted Average Filter" described this. I sat down with the paper and softpipe code and eventually found the one line where they diverged.[1] This turned out to be a bug introduced in a refactoring 5 years ago, and nobody had noticed or tracked it down.

I then ported the same fix to my llvmpipe code, and VK CTS passes. I also optimized the llvmpipe code a bit to avoid doing pointless sampling and cleaned things up. This code landed in [2] today.

For GL4.6 there are still some fixes in other areas.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11917

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8804

More in Tux Machines

Programming Leftovers

  • Announcement : An AArch64 (Arm64) Darwin port is planned for GCC12

    As many of you know, Apple has now released an AArch64-based version of macOS and desktop/laptop platforms using the ‘M1’ chip to support it. This is in addition to the existing iOS mobile platforms (but shares some of their constraints). There is considerable interest in the user-base for a GCC port (starting with https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96168) - and, of great kudos to the gfortran team, one of the main drivers is folks using Fortran. Fortunately, I was able to obtain access to one of the DTKs, courtesy of the OSS folks, and using that managed to draft an initial attempt at the port last year (however, nowhere near ready for presentation in GCC11). Nevertheless (as an aside) despite being a prototype, the port is in use with many via hombrew, macports or self-builds - which has shaken out some of the fixable bugs. The work done in the prototype identified three issues that could not be coded around without work on generic parts of the compiler. I am very happy to say that two of our colleagues, Andrew Burgess and Maxim Blinov (both from embecosm) have joined me in drafting a postable version of the port and we are seeking sponsorship to finish this in the GCC12 timeframe. Maxim has a lightning talk on the GNU tools track at LPC (right after the steering committee session) that will focus on the two generic issues that we’re tackling (1 and 2 below). Here is a short summary of the issues and proposed solutions (detailed discussion of any of the parts below would better be in new threads).

  • Apple Silicon / M1 Port Planned For GCC 12 - Phoronix

    Developers are hoping for next year's GCC 12 release they will have Apple AArch64 support on Darwin in place for being able to support Apple Silicon -- initially the M1 SoC -- on macOS with GCC. LLVM/Clang has long been supporting AArch64 on macOS given that Apple leverages LLVM/Clang as part of their official Xcode toolchain as the basis for their compiler across macOS to iOS and other products. While the GNU Compiler Collection (GCC) supports AArch64 and macOS/Darwin, it hasn't supported the two of them together but there is a port in progress to change it.

  • Dirk Eddelbuettel: tidyCpp 0.0.5 on CRAN: More Protect’ion

    Another small release of the tidyCpp package arrived on CRAN overnight. The packages offers a clean C++ layer (as well as one small C++ helper class) on top of the C API for R which aims to make use of this robust (if awkward) C API a little easier and more consistent. See the vignette for motivating examples. The Protect class now uses the default methods for copy and move constructors and assignment allowing for wide use of the class. The small NumVec class now uses it for its data member.

  • QML Modules in Qt 6.2

    With Qt 6.2 there is, for the first time, a comprehensive build system API that allows you to specify a QML module as a complete, encapsulated unit. This is a significant improvement, but as the concept of QML modules was rather under-developed in Qt 5, even seasoned QML developers might now ask "What exactly is a QML module". In our previous post we have scratched the surface by introducing the CMake API used to define them. We'll take a closer look in this post.

  • Santiago Zarate: So you want to recover and old git branch because it has been overwritten?
  • Start using YAML now | Opensource.com

    YAML (YAML Ain't Markup Language) is a human-readable data serialization language. Its syntax is simple and human-readable. It does not contain quotation marks, opening and closing tags, or braces. It does not contain anything which might make it harder for humans to parse nesting rules. You can scan your YAML document and immediately know what's going on. [...] At this point, you know enough YAML to get started. You can play around with the online YAML parser to test yourself. If you work with YAML daily, then this handy cheatsheet will be helpful.

  • 40 C programming examples

    C programming language is one of the popular programming languages for novice programmers. It is a structured programming language that was mainly developed for UNIX operating system. It supports different types of operating systems, and it is very easy to learn. 40 useful C programming examples have been shown in this tutorial for the users who want to learn C programming from the beginning.

Devices/Embedded: Asus Tinker Board 2 and More

  • Asus Tinker Board 2 single-board computer now available for $94 and up - Liliputing

    The Asus Tinker Board 2 is a Raspberry Pi-shaped single-board computer powered by a Rockchip RK3399 hexa-core processor and featuring 2GB to 4GB of RAM. First announced almost a year ago, the Tinker Board 2 is finally available for $99 and up. Asus also offers a Tinker Board 2S model that’s pretty similar except that it has 16GB of eMMC storage. Prices for that model start at about $120.

  • Raspberry Pi Weekly Issue #371 - Sir Clive Sinclair, 1940 – 2021

    This week ended with the incredibly sad news of the passing of Sir Clive Sinclair. He was one of the founding fathers of home computing and got many of us at Raspberry Pi hooked on programming as kids. Join us in sharing your Sinclair computing memories with us on Twitter and our blog, and we’ll see you next week.

  • cuplTag battery-powered NFC tag logs temperature and humidity (Crowdfunding) - CNX Software

    Temperature and humidity sensors would normally connect to a gateway sending data to the cloud, the coin-cell battery-powered cuplTag NFC tag instead sends data to your smartphone after a tap. CulpTag is controlled by an MSP430 16-bit microcontroller from Texas Instruments which reads and stores sensor data regularly into an EEPROM, and the data can then be read over NFC with the tag returning an URL with the data from the sensor and battery, then display everything on the phone’s web browser (no app needed).

  • A first look at Microchip PolarFire SoC FPGA Icicle RISC-V development board - CNX Software

    Formally launched on Crowd Supply a little over a year ago, Microchip PolarFire SoC FPGA Icicle (codenamed MPFS-ICICLE-KIT-ES) was one of the first Linux & FreeBSD capable RISC-V development boards. The system is equipped with PolarFire SoC FPGA comprised a RISC-V CPU subsystem with four 64-bit RISC-V (RV64GC) application cores, one 64-bit RISC-V real-time core (RV64IMAC), as well as FPGA fabric. Backers of the board have been able to play with it for several months ago, but Microchip is now sending the board to more people for evaluation/review, and I got one of my own to experiment with. That’s good to have a higher-end development board instead of the usual hobbyist-grade board. Today, I’ll just have a look at the kit content and main components on the board before playing with Linux and FPGA development tools in an upcoming or two posts.

  • What is IoT device management?

    Smart devices are everywhere around us. We carry one in our pocket, watch movies on another while a third cooks us dinner. Every day there are thousands of new devices connecting to the Internet. Research shows that by 2025, more than 150,000 IoT devices will come online every minute. With such vast numbers it is impossible to keep everything in working order just on your own. This brings the need for IoT device management. But what is IoT device management? To answer this question we first need to understand what the Internet of Things (IoT) is.

  • Beelink U59 mini PC with Intel Celeron N5095 Jasper Lake coming soon - Liliputing

    Beelink says the system ships with Windows 10, but it should also supports Linux.

  • Beelink U59 Celeron N5095 Jasper Lake mini PC to ship with 16GB RAM, 512GB SSD - CNX Software

    Beelink U59 is an upcoming Jasper Lake mini PC based on the Intel Celeron N5095 15W quad-core processor that will ship with up to 16GB RAM, and 512 GB M.2 SSD storage. The mini PC will also offer two 4K HDMI 2.0 ports, a Gigabit Ethernet port, WiFi 5, as well as four USB 3.0 ports, and support for 2.5-inch SATA drives up to 7mm thick.

Graphics: Mesa, KWinFT, and RADV

  • Experimenting Is Underway For Rust Code Within Mesa - Phoronix

    Longtime Mesa developer Karol Herbst who has worked extensively on the open-source NVIDIA "Nouveau" driver as well as the OpenCL/compute stack while being employed by Red Hat is now toying with the idea of Rust code inside Mesa.  Karol Herbst has begun investigating how Rust code, which is known for its memory safety and concurrency benefits, could be used within Mesa. Ultimately he's evaluating how Rust could be used inside Mesa as an API implementation as well as for leveraging existing Mesa code by Rust. 

  •     
  • KWinFT Continues Working On WLROOTS Render, Library Split

    KWinFT as a fork of KDE's KWin X11/Wayland compositor code continues making progress on driving fundamental display improvements and ironing out the Wayland support.  KWinFT has been transitioning to use WLROOTS for its Wayland heavy-lifting and that process remains ongoing. KWinFT has also been working on splitting up its library code to make it more manageable and robust.  Among the features still desired by KWinFT and to be worked on include input methods, graphical tablet support, and PipeWire video stream integration. Currently there are two full-time developers working on the project but they hope to scale up to four to five full-time developers. 

  • Raytracing Starting to Come Together – Bas Nieuwenhuizen – Open Source GPU Drivers

    I am back with another status update on raytracing in RADV. And the good news is that things are finally starting to come together. After ~9 months of on and off work we’re now having games working with raytracing.

  • Multiple Games Are Now Working With RADV's Ray-Tracing Code - Phoronix

    Not only is Intel progressing with its open-source ray-tracing driver support but the Mesa Radeon Vulkan driver "RADV" has been rounding out its RT code too and now has multiple games correctly rendering. Bas Nieuwenhuizen has been spearheading the RADV work on Vulkan ray-tracing support and after more than a half-year tackling it things are starting to fall into place nicely.Games such as Quake II RTX with native Vulkan ray-tracing are working along with the game control via VKD3D-Proton for going from Direct3D 12 DXR to Vulkan RT. Metro Exodus is also working while Ghostrunner and Doom Eternal are two games tested that are not yet working.

Audiocasts/Shows: Full Circle Weekly News, Juno Computers, Kali Linux 2021.3