Message ID | 20231107215742.363031-1-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Make the kernel preemptible | expand |
On Tue, 7 Nov 2023 13:56:46 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > Hi, Hi Ankur, Thanks for doing this! > > We have two models of preemption: voluntary and full (and RT which is > a fuller form of full preemption.) In this series -- which is based > on Thomas' PoC (see [1]), we try to unify the two by letting the > scheduler enforce policy for the voluntary preemption models as well. I would say there's "NONE" which is really just a "voluntary" but with fewer preemption points ;-) But still should be mentioned, otherwise people may get confused. > > (Note that this is about preemption when executing in the kernel. > Userspace is always preemptible.) > > Design > == > > As Thomas outlines in [1], to unify the preemption models we > want to: always have the preempt_count enabled and allow the scheduler > to drive preemption policy based on the model in effect. > > Policies: > > - preemption=none: run to completion > - preemption=voluntary: run to completion, unless a task of higher > sched-class awaits > - preemption=full: optimized for low-latency. Preempt whenever a higher > priority task awaits. > > To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the > scheduler to mark that a reschedule is needed, but is deferred until > the task finishes executing in the kernel -- voluntary preemption > as it were. > > The TIF_NEED_RESCHED flag is evaluated at all three of the preemption > points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. > > ret-to-user ret-to-kernel preempt_count() > none Y N N > voluntary Y Y Y > full Y Y Y Wait. The above is for when RESCHED_LAZY is to preempt, right? Then, shouldn't voluntary be: voluntary Y N N For LAZY, but voluntary Y Y Y For NEED_RESCHED (without lazy) That is, the only difference between voluntary and none (as you describe above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED, but on none, it still sets NEED_RESCHED_LAZY? > > > There's just one remaining issue: now that explicit preemption points are > gone, processes that spread a long time in the kernel have no way to give > up the CPU. I wonder if this needs to be solved by with a user space knob, to trigger the time that "NEED_RESCHED" will force a schedule? > > For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED. > > For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED > if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick. > (This would cause preemption either at ret-to-kernel, or if the task is in > a non-preemptible section, when it exits that section.) > > Arguably this provides for much more consistent maximum latency (~2 tick > lengths + length of non-preemptible section) as compared to the old model > where the maximum latency depended on the dynamic distribution of > cond_resched() points. Again, why I think we probably want to set a knob for users to adjust this. Default, it will be set to "tick" but if not, then we need to add another timer to trigger before then. And this would only be available with HRTIMERS of course ;-) > > (As a bonus it handles code that is preemptible but cannot call > cond_resched() completely trivially: ex. long running Xen hypercalls, or > this series which started this discussion: > https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/) > > > Status > == > > What works: > - The system seems to keep ticking over with the normal scheduling > policies (SCHED_OTHER). The support for the realtime policies is somewhat > more half baked.) > - The basic performance numbers seem pretty close to 6.6-rc7 baseline > > What's broken: > - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts > lazily") > - Non-x86 architectures. It's trivial to support other archs (only need > to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some > comments on the series. > (From some testing on arm64, didn't find any surprises.) > - livepatch: livepatch depends on using _cond_resched() to provide > low-latency patching. That is obviously difficult with cond_resched() > gone. We could get a similar effect by using a static_key in > preempt_enable() but at least with inline locks, that might be end > up bloating the kernel quite a bit. Maybe if we have that timer, livepatch could set it to be temporarily shorter? > - Documentation/ and comments mention cond_resched() > - ftrace support for need-resched-lazy is incomplete Shouldn't be a problem. > > What needs more discussion: > - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED > only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info: > change to tif_need_resched(resched_t)") I would say NEED_RESCHED only, then it would match the description of the different models above. > - Tracking whether a task in userspace or in the kernel (See patch-40 > "context_tracking: add ct_state_cpu()") > - The right model for preempt=voluntary. (See patch 44 "sched: voluntary > preemption") > > > Performance > == > > * optimal-load (-j 1024) > > 6.6-rc7 +series > > > wall 139.2 +- 0.3 wall 138.8 +- > 0.2 utime 11161.0 +- 3360.4 utime 11061.2 +- > 3244.9 stime 1357.6 +- 199.3 stime 1366.6 +- > 216.3 %cpu 9108.8 +- 2431.4 %cpu 9081.0 +- > 2351.1 csw 2078599 +- 2013320.0 csw 1970610 +- > 1969030.0 > > > For both of these the wallclock, utime, stime etc are pretty much > identical. The one interesting difference is that the number of > context switches are fewer. This intuitively makes sense given that > we reschedule threads lazily rather than rescheduling if we encounter > a cond_resched() and there's a thread wanting to be scheduled. > > The max-load numbers (not posted here) also behave similarly. It would be interesting to run any "latency sensitive" benchmarks. I wounder how cyclictest would work under each model with and without this patch? -- Steve
Steven Rostedt <rostedt@goodmis.org> writes: > On Tue, 7 Nov 2023 13:56:46 -0800 > Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Hi, > > Hi Ankur, > > Thanks for doing this! > >> >> We have two models of preemption: voluntary and full (and RT which is >> a fuller form of full preemption.) In this series -- which is based >> on Thomas' PoC (see [1]), we try to unify the two by letting the >> scheduler enforce policy for the voluntary preemption models as well. > > I would say there's "NONE" which is really just a "voluntary" but with > fewer preemption points ;-) But still should be mentioned, otherwise people > may get confused. > >> >> (Note that this is about preemption when executing in the kernel. >> Userspace is always preemptible.) >> > > >> Design >> == >> >> As Thomas outlines in [1], to unify the preemption models we >> want to: always have the preempt_count enabled and allow the scheduler >> to drive preemption policy based on the model in effect. >> >> Policies: >> >> - preemption=none: run to completion >> - preemption=voluntary: run to completion, unless a task of higher >> sched-class awaits >> - preemption=full: optimized for low-latency. Preempt whenever a higher >> priority task awaits. >> >> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the >> scheduler to mark that a reschedule is needed, but is deferred until >> the task finishes executing in the kernel -- voluntary preemption >> as it were. >> >> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption >> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. >> >> ret-to-user ret-to-kernel preempt_count() >> none Y N N >> voluntary Y Y Y >> full Y Y Y > > Wait. The above is for when RESCHED_LAZY is to preempt, right? > > Then, shouldn't voluntary be: > > voluntary Y N N > > For LAZY, but > > voluntary Y Y Y > > For NEED_RESCHED (without lazy) Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags and in the middle switched to talking about how the voluntary model will get to what it wants. > That is, the only difference between voluntary and none (as you describe > above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED, > but on none, it still sets NEED_RESCHED_LAZY? Yeah exactly. Just to restate without mucking it up: The TIF_NEED_RESCHED flag is evaluated at all three of the preemption points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. ret-to-user ret-to-kernel preempt_count() NEED_RESCHED_LAZY Y N N NEED_RESCHED Y Y Y Based on how various preemption models set the flag they would cause preemption at: ret-to-user ret-to-kernel preempt_count() none Y N N voluntary Y Y Y full Y Y Y >> The max-load numbers (not posted here) also behave similarly. > > It would be interesting to run any "latency sensitive" benchmarks. > > I wounder how cyclictest would work under each model with and without this > patch? Didn't post these numbers because I suspect that code isn't quite right, but voluntary preemption for instance does what it promises: # echo NO_FORCE_PREEMPT > sched/features # echo NO_PREEMPT_PRIORITY > sched/features # preempt=none # stress-ng --cyclic 1 --timeout 10 stress-ng: info: [1214172] setting to a 10 second run per stressor stress-ng: info: [1214172] dispatching hogs: 1 cyclic stress-ng: info: [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples stress-ng: info: [1214174] cyclic: mean: 9834.56 ns, mode: 3495 ns stress-ng: info: [1214174] cyclic: min: 2413 ns, max: 3145065 ns, std.dev. 77096.98 stress-ng: info: [1214174] cyclic: latency percentiles: stress-ng: info: [1214174] cyclic: 25.00%: 3366 ns stress-ng: info: [1214174] cyclic: 50.00%: 3505 ns stress-ng: info: [1214174] cyclic: 75.00%: 3776 ns stress-ng: info: [1214174] cyclic: 90.00%: 4316 ns stress-ng: info: [1214174] cyclic: 95.40%: 10989 ns stress-ng: info: [1214174] cyclic: 99.00%: 91181 ns stress-ng: info: [1214174] cyclic: 99.50%: 290477 ns stress-ng: info: [1214174] cyclic: 99.90%: 1360837 ns stress-ng: info: [1214174] cyclic: 99.99%: 3145065 ns stress-ng: info: [1214172] successful run completed in 10.00s # echo PREEMPT_PRIORITY > features # preempt=voluntary # stress-ng --cyclic 1 --timeout 10 stress-ng: info: [916483] setting to a 10 second run per stressor stress-ng: info: [916483] dispatching hogs: 1 cyclic stress-ng: info: [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples stress-ng: info: [916484] cyclic: mean: 3682.77 ns, mode: 3185 ns stress-ng: info: [916484] cyclic: min: 2523 ns, max: 150082 ns, std.dev. 2198.07 stress-ng: info: [916484] cyclic: latency percentiles: stress-ng: info: [916484] cyclic: 25.00%: 3185 ns stress-ng: info: [916484] cyclic: 50.00%: 3306 ns stress-ng: info: [916484] cyclic: 75.00%: 3666 ns stress-ng: info: [916484] cyclic: 90.00%: 4778 ns stress-ng: info: [916484] cyclic: 95.40%: 5359 ns stress-ng: info: [916484] cyclic: 99.00%: 6141 ns stress-ng: info: [916484] cyclic: 99.50%: 7824 ns stress-ng: info: [916484] cyclic: 99.90%: 29825 ns stress-ng: info: [916484] cyclic: 99.99%: 150082 ns stress-ng: info: [916483] successful run completed in 10.01s This is with a background kernbench half-load. Let me see if I can dig out the numbers without this series. -- ankur
On Tue, 07 Nov 2023 15:43:40 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote: > > The TIF_NEED_RESCHED flag is evaluated at all three of the preemption > points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. > > ret-to-user ret-to-kernel preempt_count() > NEED_RESCHED_LAZY Y N N > NEED_RESCHED Y Y Y > > Based on how various preemption models set the flag they would cause > preemption at: I would change the above to say "set the NEED_SCHED flag", as "set the flag" is still ambiguous. Or am I still misunderstanding the below table? > > ret-to-user ret-to-kernel preempt_count() > none Y N N > voluntary Y Y Y > full Y Y Y > -- Steve
The kernel is not preemptible???? What are you smoking? On Tue, 7 Nov 2023, Ankur Arora wrote: > In voluntary models, the scheduler's job is to match the demand > side of preemption points (a task that needs to be scheduled) with > the supply side (a task which calls cond_resched().) Voluntary preemption models are important for code optimization because the code can rely on the scheduler not changing the cpu we are running on. This allows removing code for preempt_enable/disable to be removed from the code and allows better code generation. The best performing code is generated with defined preemption points when we have a guarantee that the code is not being rescheduled on a different processor. This is f.e. important for consistent access to PER CPU areas. > To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the > scheduler to mark that a reschedule is needed, but is deferred until > the task finishes executing in the kernel -- voluntary preemption > as it were. That is different from the current no preemption model? Seems to be the same. > There's just one remaining issue: now that explicit preemption points are > gone, processes that spread a long time in the kernel have no way to give > up the CPU. These are needed to avoid adding preempt_enable/disable to a lot of primitives that are used for synchronization. You cannot remove those without changing a lot of synchronization primitives to always have to consider being preempted while operating.
Christoph Lameter <cl@linux.com> writes: > The kernel is not preemptible???? What are you smoking? The title admittedly is a little tongue in check but the point was that a kernel under a voluntary preemption model isn't preemptible. That's what this series attempts to do. Essentially enable PREEMPT_COUNT and PREEMPTION for all preemption models. PREEMPT_COUNT is always enabled with PREEMPT_DYNAMIC as well. There the approach is to toggle which preemption points are used dynamically. Here the idea is to not have statically placed preemption points and let the scheduler decide when preemption is warranted. And the only way to safely do that is by having PREEMPT_COUNT=y. >> In voluntary models, the scheduler's job is to match the demand >> side of preemption points (a task that needs to be scheduled) with >> the supply side (a task which calls cond_resched().) > > Voluntary preemption models are important for code optimization because the code > can rely on the scheduler not changing the cpu we are running on. This allows > removing code for preempt_enable/disable to be removed from the code and allows > better code generation. The best performing code is generated with defined > preemption points when we have a guarantee that the code is not being > rescheduled on a different processor. This is f.e. important for consistent > access to PER CPU areas. Right. This necessitates preempt_enable/preempt_disable() so you get consistent access to the CPU. This came up in an earlier discussion (See https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned that preempt_enable/_disable() overhead was relatively minimal. Is your point that always-on preempt_count is far too expensive? >> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the >> scheduler to mark that a reschedule is needed, but is deferred until >> the task finishes executing in the kernel -- voluntary preemption >> as it were. > > That is different from the current no preemption model? Seems to be the same. >> There's just one remaining issue: now that explicit preemption points are >> gone, processes that spread a long time in the kernel have no way to give >> up the CPU. > > These are needed to avoid adding preempt_enable/disable to a lot of primitives > that are used for synchronization. You cannot remove those without changing a > lot of synchronization primitives to always have to consider being preempted > while operating. I'm afraid I don't understand why you would need to change any synchronization primitives. The code that does preempt_enable/_disable() is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define CONFIG_PREEMPT_COUNT. The intent here is to always have CONFIG_PREEMPT_COUNT=y. -- ankur
On Tue, 7 Nov 2023, Ankur Arora wrote: > This came up in an earlier discussion (See > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned > that preempt_enable/_disable() overhead was relatively minimal. > > Is your point that always-on preempt_count is far too expensive? Yes over the years distros have traditionally delivered their kernels by default without preemption because of these issues. If the overhead has been minimized then that may have changed. Even if so there is still a lot of code being generated that has questionable benefit and just bloats the kernel. >> These are needed to avoid adding preempt_enable/disable to a lot of primitives >> that are used for synchronization. You cannot remove those without changing a >> lot of synchronization primitives to always have to consider being preempted >> while operating. > > I'm afraid I don't understand why you would need to change any > synchronization primitives. The code that does preempt_enable/_disable() > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define > CONFIG_PREEMPT_COUNT. In the trivial cases it is simple like that. But look f.e. in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a overhead added to be able to allow the cpu to change under us. There are likely other examples in the source. And the whole business of local data access via per cpu areas suffers if we cannot rely on two accesses in a section being able to see consistent values. > The intent here is to always have CONFIG_PREEMPT_COUNT=y. Just for fun? Code is most efficient if it does not have to consider too many side conditions like suddenly running on a different processor. This introduces needless complexity into the code. It would be better to remove PREEMPT_COUNT for good to just rely on voluntary preemption. We could probably reduce the complexity of the kernel source significantly. I have never noticed a need to preemption at every instruction in the kernel (if that would be possible at all... Locks etc prevent that ideal scenario frequently). Preemption like that is more like a pipe dream. High performance kernel solution usually disable overhead like that.
On Tue, 7 Nov 2023 20:52:39 -0800 (PST) Christoph Lameter <cl@linux.com> wrote: > On Tue, 7 Nov 2023, Ankur Arora wrote: > > > This came up in an earlier discussion (See > > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned > > that preempt_enable/_disable() overhead was relatively minimal. > > > > Is your point that always-on preempt_count is far too expensive? > > Yes over the years distros have traditionally delivered their kernels by > default without preemption because of these issues. If the overhead has > been minimized then that may have changed. Even if so there is still a lot > of code being generated that has questionable benefit and just > bloats the kernel. > > >> These are needed to avoid adding preempt_enable/disable to a lot of primitives > >> that are used for synchronization. You cannot remove those without changing a > >> lot of synchronization primitives to always have to consider being preempted > >> while operating. > > > > I'm afraid I don't understand why you would need to change any > > synchronization primitives. The code that does preempt_enable/_disable() > > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define > > CONFIG_PREEMPT_COUNT. > > In the trivial cases it is simple like that. But look f.e. > in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a > overhead added to be able to allow the cpu to change under us. There are > likely other examples in the source. > preempt_disable() and preempt_enable() are much lower overhead today than it use to be. If you are worried about changing CPUs, there's also migrate_disable() too. > And the whole business of local data > access via per cpu areas suffers if we cannot rely on two accesses in a > section being able to see consistent values. > > > The intent here is to always have CONFIG_PREEMPT_COUNT=y. > > Just for fun? Code is most efficient if it does not have to consider too > many side conditions like suddenly running on a different processor. This > introduces needless complexity into the code. It would be better to remove > PREEMPT_COUNT for good to just rely on voluntary preemption. We could > probably reduce the complexity of the kernel source significantly. That is what caused this thread in the first place. Randomly scattered "preemption points" does not scale! And I'm sorry, we have latency sensitive use cases that require full preemption. > > I have never noticed a need to preemption at every instruction in the > kernel (if that would be possible at all... Locks etc prevent that ideal > scenario frequently). Preemption like that is more like a pipe dream. > > High performance kernel solution usually disable > overhead like that. > Please read the email from Thomas: https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/ This is not technically getting rid of PREEMPT_NONE. It is adding a new NEED_RESCHED_LAZY flag, that will have the kernel preempt only when entering or in user space. It will behave the same as PREEMPT_NONE, but without the need for all the cond_resched() scattered randomly throughout the kernel. If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will preempt at the next available location (preempt_count == 0). But yes, all locations that do not explicitly disable preemption, will now possibly preempt (due to long running kernel threads). -- Steve
Steven Rostedt <rostedt@goodmis.org> writes: > On Tue, 7 Nov 2023 20:52:39 -0800 (PST) > Christoph Lameter <cl@linux.com> wrote: > >> On Tue, 7 Nov 2023, Ankur Arora wrote: >> >> > This came up in an earlier discussion (See >> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned >> > that preempt_enable/_disable() overhead was relatively minimal. >> > >> > Is your point that always-on preempt_count is far too expensive? >> >> Yes over the years distros have traditionally delivered their kernels by >> default without preemption because of these issues. If the overhead has >> been minimized then that may have changed. Even if so there is still a lot >> of code being generated that has questionable benefit and just >> bloats the kernel. >> >> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives >> >> that are used for synchronization. You cannot remove those without changing a >> >> lot of synchronization primitives to always have to consider being preempted >> >> while operating. >> > >> > I'm afraid I don't understand why you would need to change any >> > synchronization primitives. The code that does preempt_enable/_disable() >> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define >> > CONFIG_PREEMPT_COUNT. >> >> In the trivial cases it is simple like that. But look f.e. >> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a >> overhead added to be able to allow the cpu to change under us. There are >> likely other examples in the source. >> > > preempt_disable() and preempt_enable() are much lower overhead today than > it use to be. > > If you are worried about changing CPUs, there's also migrate_disable() too. > >> And the whole business of local data >> access via per cpu areas suffers if we cannot rely on two accesses in a >> section being able to see consistent values. >> >> > The intent here is to always have CONFIG_PREEMPT_COUNT=y. >> >> Just for fun? Code is most efficient if it does not have to consider too >> many side conditions like suddenly running on a different processor. This >> introduces needless complexity into the code. It would be better to remove >> PREEMPT_COUNT for good to just rely on voluntary preemption. We could >> probably reduce the complexity of the kernel source significantly. > > That is what caused this thread in the first place. Randomly scattered > "preemption points" does not scale! > > And I'm sorry, we have latency sensitive use cases that require full > preemption. > >> >> I have never noticed a need to preemption at every instruction in the >> kernel (if that would be possible at all... Locks etc prevent that ideal >> scenario frequently). Preemption like that is more like a pipe dream. The intent isn't to preempt at every other instruction in the kernel. As Thomas describes, the idea is that for voluntary preemption kernels resched happens at cond_resched() points which have been distributed heuristically. As a consequence you might get both too little preemption and too much preemption. The intent is to bring preemption in control of the scheduler which can do a better job than randomly placed cond_resched() points. >> High performance kernel solution usually disable >> overhead like that. You are also missing all the ways in which voluntary preemption points are responsible for poor performance. For instance, if you look atq clear_huge_page() it does page by page copy with a cond_resched() call after clearing each page. But if you can expose the full extent to the CPU, it can optimize differently (for the 1GB page it can now elide cacheline allocation): *Milan* mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 14.55 19.29 +32.5% pg-sz=1GB 19.34 49.60 +156.4% (See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/) > Please read the email from Thomas: > > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/ > > This is not technically getting rid of PREEMPT_NONE. It is adding a new > NEED_RESCHED_LAZY flag, that will have the kernel preempt only when > entering or in user space. It will behave the same as PREEMPT_NONE, but > without the need for all the cond_resched() scattered randomly throughout > the kernel. And a corollary of that is that with a scheduler controlled PREEMPT_NONE a task might end up running to completion where earlier it could have been preempted early because it crossed a cond_resched(). > If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at > 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will > preempt at the next available location (preempt_count == 0). > > But yes, all locations that do not explicitly disable preemption, will now > possibly preempt (due to long running kernel threads). -- ankur
On 07.11.23 22:56, Ankur Arora wrote: > Hi, > > We have two models of preemption: voluntary and full (and RT which is > a fuller form of full preemption.) In this series -- which is based > on Thomas' PoC (see [1]), we try to unify the two by letting the > scheduler enforce policy for the voluntary preemption models as well. > > (Note that this is about preemption when executing in the kernel. > Userspace is always preemptible.) > > Background > == > > Why?: both of these preemption mechanisms are almost entirely disjoint. > There are four main sets of preemption points in the kernel: > > 1. return to user > 2. explicit preemption points (cond_resched() and its ilk) > 3. return to kernel (tick/IPI/irq at irqexit) > 4. end of non-preemptible sections at (preempt_count() == preempt_offset) > > Voluntary preemption uses mechanisms 1 and 2. Full preemption > uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*} > which can be all things to all people because they internally > contain 2, and 4. > > Now since there's no ideal placement of explicit preemption points, > they tend to be randomly spread over code and accumulate over time, > as they are are added when latency problems are seen. Plus fear of > regressions makes them difficult to remove. > (Presumably, asymptotically they would spead out evenly across the > instruction stream!) > > In voluntary models, the scheduler's job is to match the demand > side of preemption points (a task that needs to be scheduled) with > the supply side (a task which calls cond_resched().) > > Full preemption models track preemption count so the scheduler can > always knows if it is safe to preempt and can drive preemption > itself (ex. via dynamic preemption points in 3.) > > Design > == > > As Thomas outlines in [1], to unify the preemption models we > want to: always have the preempt_count enabled and allow the scheduler > to drive preemption policy based on the model in effect. > > Policies: > > - preemption=none: run to completion > - preemption=voluntary: run to completion, unless a task of higher > sched-class awaits > - preemption=full: optimized for low-latency. Preempt whenever a higher > priority task awaits. > > To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the > scheduler to mark that a reschedule is needed, but is deferred until > the task finishes executing in the kernel -- voluntary preemption > as it were. > > The TIF_NEED_RESCHED flag is evaluated at all three of the preemption > points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user. > > ret-to-user ret-to-kernel preempt_count() > none Y N N > voluntary Y Y Y > full Y Y Y > > > There's just one remaining issue: now that explicit preemption points are > gone, processes that spread a long time in the kernel have no way to give > up the CPU. > > For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED. > > For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED > if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick. > (This would cause preemption either at ret-to-kernel, or if the task is in > a non-preemptible section, when it exits that section.) > > Arguably this provides for much more consistent maximum latency (~2 tick > lengths + length of non-preemptible section) as compared to the old model > where the maximum latency depended on the dynamic distribution of > cond_resched() points. > > (As a bonus it handles code that is preemptible but cannot call cond_resched() > completely trivially: ex. long running Xen hypercalls, or this series > which started this discussion: > https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/) > > > Status > == > > What works: > - The system seems to keep ticking over with the normal scheduling policies > (SCHED_OTHER). The support for the realtime policies is somewhat more > half baked.) > - The basic performance numbers seem pretty close to 6.6-rc7 baseline > > What's broken: > - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts > lazily") > - Non-x86 architectures. It's trivial to support other archs (only need > to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some > comments on the series. > (From some testing on arm64, didn't find any surprises.) > - livepatch: livepatch depends on using _cond_resched() to provide > low-latency patching. That is obviously difficult with cond_resched() > gone. We could get a similar effect by using a static_key in > preempt_enable() but at least with inline locks, that might be end > up bloating the kernel quite a bit. > - Documentation/ and comments mention cond_resched() > - ftrace support for need-resched-lazy is incomplete > > What needs more discussion: > - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED > only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info: > change to tif_need_resched(resched_t)") > - Tracking whether a task in userspace or in the kernel (See patch-40 > "context_tracking: add ct_state_cpu()") > - The right model for preempt=voluntary. (See patch 44 "sched: voluntary > preemption") > > > Performance > == > > Expectation: > > * perf sched bench pipe > > preemption full none > > 6.6-rc7 6.68 +- 0.10 6.69 +- 0.07 > +series 6.69 +- 0.12 6.67 +- 0.10 > > This is rescheduling out of idle which should and does perform identically. > > * schbench, preempt=none > > * 1 group, 16 threads each > > 6.6-rc7 +series > (usecs) (usecs) > 50.0th: 6 6 > 90.0th: 8 7 > 99.0th: 11 11 > 99.9th: 15 14 > > * 8 groups, 16 threads each > > 6.6-rc7 +series > (usecs) (usecs) > 50.0th: 6 6 > 90.0th: 8 8 > 99.0th: 12 11 > 99.9th: 20 21 > > > * schbench, preempt=full > > * 1 group, 16 threads each > > 6.6-rc7 +series > (usecs) (usecs) > 50.0th: 6 6 > 90.0th: 8 7 > 99.0th: 11 11 > 99.9th: 14 14 > > > * 8 groups, 16 threads each > > 6.6-rc7 +series > (usecs) (usecs) > 50.0th: 7 7 > 90.0th: 9 9 > 99.0th: 12 12 > 99.9th: 21 22 > > > Not much in it either way. > > * kernbench, preempt=full > > * half-load (-j 128) > > 6.6-rc7 +series > > wall 149.2 +- 27.2 wall 132.8 +- 0.4 > utime 8097.1 +- 57.4 utime 8088.5 +- 14.1 > stime 1165.5 +- 9.4 stime 1159.2 +- 1.9 > %cpu 6337.6 +- 1072.8 %cpu 6959.6 +- 22.8 > csw 237618 +- 2190.6 %csw 240343 +- 1386.8 > > > * optimal-load (-j 1024) > > 6.6-rc7 +series > > wall 137.8 +- 0.0 wall 137.7 +- 0.8 > utime 11115.0 +- 3306.1 utime 11041.7 +- 3235.0 > stime 1340.0 +- 191.3 stime 1323.1 +- 179.5 > %cpu 8846.3 +- 2830.6 %cpu 9101.3 +- 2346.7 > csw 2099910 +- 2040080.0 csw 2068210 +- 2002450.0 > > > The preempt=full path should effectively not see any change in > behaviour. The optimal-loads are pretty much identical. > For the half-load, however, the +series version does much better but that > seems to be because of much higher run to run variability in the 6.6-rc7 load. > > * kernbench, preempt=none > > * half-load (-j 128) > > 6.6-rc7 +series > > wall 134.5 +- 4.2 wall 133.6 +- 2.7 > utime 8093.3 +- 39.3 utime 8099.0 +- 38.9 > stime 1175.7 +- 10.6 stime 1169.1 +- 8.4 > %cpu 6893.3 +- 233.2 %cpu 6936.3 +- 142.8 > csw 240723 +- 423.0 %csw 173152 +- 1126.8 > > > * optimal-load (-j 1024) > > 6.6-rc7 +series > > wall 139.2 +- 0.3 wall 138.8 +- 0.2 > utime 11161.0 +- 3360.4 utime 11061.2 +- 3244.9 > stime 1357.6 +- 199.3 stime 1366.6 +- 216.3 > %cpu 9108.8 +- 2431.4 %cpu 9081.0 +- 2351.1 > csw 2078599 +- 2013320.0 csw 1970610 +- 1969030.0 > > > For both of these the wallclock, utime, stime etc are pretty much > identical. The one interesting difference is that the number of > context switches are fewer. This intuitively makes sense given that > we reschedule threads lazily rather than rescheduling if we encounter > a cond_resched() and there's a thread wanting to be scheduled. > > The max-load numbers (not posted here) also behave similarly. > > > Series > == > > With that, this is how he series is laid out: > > - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure > used by that is via static_calls() and this is a simpler approach which > doesn't need any of that (and does away with cond_resched().) > > Some of the commits will be resurrected. > 089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout") > cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors") > 5693fa74f98a ("kcsan: Use preemption model accessors") > > - Patches 31-45: contain the scheduler changes to do this. Of these > the critical ones are: > patch 35 "thread_info: change to tif_need_resched(resched_t)" > patch 41 "sched: handle resched policy in resched_curr()" > patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models" > patch 44 "sched: voluntary preemption" > (this needs more work to decide when a higher sched-policy task > should preempt a lower sched-policy task) > patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily" > > - Patches 47-50: contain RCU related changes. RCU now works in both > PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION. > (Until now PREEMPTION=y => PREEMPT_RCU) > > - Patches 51-56,86: contain cond_resched() related cleanups. > patch 54 "sched: add cond_resched_stall()" adds a new cond_resched() > interface. Pitchforks? > > - Patches 57-86: remove cond_resched() from the tree. > > > Also at: github.com/terminus/linux preemption-rfc > > > Please review. > > Thanks > Ankur > > [1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ > > > Ankur Arora (86): > Revert "riscv: support PREEMPT_DYNAMIC with static keys" > Revert "sched/core: Make sched_dynamic_mutex static" > Revert "ftrace: Use preemption model accessors for trace header > printout" > Revert "preempt/dynamic: Introduce preemption model accessors" > Revert "kcsan: Use preemption model accessors" > Revert "entry: Fix compile error in > dynamic_irqentry_exit_cond_resched()" > Revert "livepatch,sched: Add livepatch task switching to > cond_resched()" > Revert "arm64: Support PREEMPT_DYNAMIC" > Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" > Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from > GENERIC_ENTRY" > Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers" > Revert "sched/preempt: Refactor sched_dynamic_update()" > Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later" > Revert "preempt/dynamic: Fix setup_preempt_mode() return value" > Revert "preempt: Restore preemption model selection configs" > Revert "sched: Provide Kconfig support for default dynamic preempt > mode" > sched/preempt: remove PREEMPT_DYNAMIC from the build version > Revert "preempt/dynamic: Fix typo in macro conditional statement" > Revert "sched,preempt: Move preempt_dynamic to debug.c" > Revert "static_call: Relax static_call_update() function argument > type" > Revert "sched/core: Use -EINVAL in sched_dynamic_mode()" > Revert "sched/core: Stop using magic values in sched_dynamic_mode()" > Revert "sched,x86: Allow !PREEMPT_DYNAMIC" > Revert "sched: Harden PREEMPT_DYNAMIC" > Revert "sched: Add /debug/sched_preempt" > Revert "preempt/dynamic: Support dynamic preempt with preempt= boot > option" > Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static > call" > Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static > calls" > Revert "preempt/dynamic: Provide cond_resched() and might_resched() > static calls" > Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC" > x86/thread_info: add TIF_NEED_RESCHED_LAZY > entry: handle TIF_NEED_RESCHED_LAZY > entry/kvm: handle TIF_NEED_RESCHED_LAZY > thread_info: accessors for TIF_NEED_RESCHED* > thread_info: change to tif_need_resched(resched_t) > entry: irqentry_exit only preempts TIF_NEED_RESCHED > sched: make test_*_tsk_thread_flag() return bool > sched: *_tsk_need_resched() now takes resched_t > sched: handle lazy resched in set_nr_*_polling() > context_tracking: add ct_state_cpu() > sched: handle resched policy in resched_curr() > sched: force preemption on tick expiration > sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models > sched: voluntary preemption > preempt: ARCH_NO_PREEMPT only preempts lazily > tracing: handle lazy resched > rcu: select PREEMPT_RCU if PREEMPT > rcu: handle quiescent states for PREEMPT_RCU=n > osnoise: handle quiescent states directly > rcu: TASKS_RCU does not need to depend on PREEMPTION > preempt: disallow !PREEMPT_COUNT or !PREEMPTION > sched: remove CONFIG_PREEMPTION from *_needbreak() > sched: fixup __cond_resched_*() > sched: add cond_resched_stall() > xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq() > xarray: use cond_resched_xas*() > coccinelle: script to remove cond_resched() > treewide: x86: remove cond_resched() > treewide: rcu: remove cond_resched() > treewide: torture: remove cond_resched() > treewide: bpf: remove cond_resched() > treewide: trace: remove cond_resched() > treewide: futex: remove cond_resched() > treewide: printk: remove cond_resched() > treewide: task_work: remove cond_resched() > treewide: kernel: remove cond_resched() > treewide: kernel: remove cond_reshed() > treewide: mm: remove cond_resched() > treewide: io_uring: remove cond_resched() > treewide: ipc: remove cond_resched() > treewide: lib: remove cond_resched() > treewide: crypto: remove cond_resched() > treewide: security: remove cond_resched() > treewide: fs: remove cond_resched() > treewide: virt: remove cond_resched() > treewide: block: remove cond_resched() > treewide: netfilter: remove cond_resched() > treewide: net: remove cond_resched() > treewide: net: remove cond_resched() > treewide: sound: remove cond_resched() > treewide: md: remove cond_resched() > treewide: mtd: remove cond_resched() > treewide: drm: remove cond_resched() > treewide: net: remove cond_resched() > treewide: drivers: remove cond_resched() > sched: remove cond_resched() I'm missing the removal of the Xen parts, which were one of the reasons to start this whole work (xen_in_preemptible_hcall etc.). Juergen
On 11/8/23 06:12, Steven Rostedt wrote: > On Tue, 7 Nov 2023 20:52:39 -0800 (PST) > Christoph Lameter <cl@linux.com> wrote: > >> On Tue, 7 Nov 2023, Ankur Arora wrote: >> >> > This came up in an earlier discussion (See >> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned >> > that preempt_enable/_disable() overhead was relatively minimal. >> > >> > Is your point that always-on preempt_count is far too expensive? >> >> Yes over the years distros have traditionally delivered their kernels by >> default without preemption because of these issues. If the overhead has >> been minimized then that may have changed. Even if so there is still a lot >> of code being generated that has questionable benefit and just >> bloats the kernel. >> >> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives >> >> that are used for synchronization. You cannot remove those without changing a >> >> lot of synchronization primitives to always have to consider being preempted >> >> while operating. >> > >> > I'm afraid I don't understand why you would need to change any >> > synchronization primitives. The code that does preempt_enable/_disable() >> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define >> > CONFIG_PREEMPT_COUNT. >> >> In the trivial cases it is simple like that. But look f.e. >> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a >> overhead added to be able to allow the cpu to change under us. There are >> likely other examples in the source. >> > > preempt_disable() and preempt_enable() are much lower overhead today than > it use to be. > > If you are worried about changing CPUs, there's also migrate_disable() too. Note that while migrate_disable() would be often sufficient, the implementation of it has actually more overhead (function call, does preempt_disable()/enable() as part of it) than just preempt_disable(). See for example the pcpu_task_pin() definition in mm/page_alloc.c
On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: > Hi, > > We have two models of preemption: voluntary and full 3, also none (RT is not actually a preemption model). > (and RT which is > a fuller form of full preemption.) It is not in fact a different preemption model, it is the same full preemption, the difference with RT is that it makes a lot more stuff preemptible, but the fundamental preemption model is the same -- full. > In this series -- which is based > on Thomas' PoC (see [1]), we try to unify the two by letting the > scheduler enforce policy for the voluntary preemption models as well. Well, you've also taken out preempt_dynamic for some obscure reason :/ > Please review. > Ankur Arora (86): > Revert "riscv: support PREEMPT_DYNAMIC with static keys" > Revert "sched/core: Make sched_dynamic_mutex static" > Revert "ftrace: Use preemption model accessors for trace header > printout" > Revert "preempt/dynamic: Introduce preemption model accessors" > Revert "kcsan: Use preemption model accessors" > Revert "entry: Fix compile error in > dynamic_irqentry_exit_cond_resched()" > Revert "livepatch,sched: Add livepatch task switching to > cond_resched()" > Revert "arm64: Support PREEMPT_DYNAMIC" > Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" > Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from > GENERIC_ENTRY" > Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers" > Revert "sched/preempt: Refactor sched_dynamic_update()" > Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later" > Revert "preempt/dynamic: Fix setup_preempt_mode() return value" > Revert "preempt: Restore preemption model selection configs" > Revert "sched: Provide Kconfig support for default dynamic preempt > mode" > sched/preempt: remove PREEMPT_DYNAMIC from the build version > Revert "preempt/dynamic: Fix typo in macro conditional statement" > Revert "sched,preempt: Move preempt_dynamic to debug.c" > Revert "static_call: Relax static_call_update() function argument > type" > Revert "sched/core: Use -EINVAL in sched_dynamic_mode()" > Revert "sched/core: Stop using magic values in sched_dynamic_mode()" > Revert "sched,x86: Allow !PREEMPT_DYNAMIC" > Revert "sched: Harden PREEMPT_DYNAMIC" > Revert "sched: Add /debug/sched_preempt" > Revert "preempt/dynamic: Support dynamic preempt with preempt= boot > option" > Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static > call" > Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static > calls" > Revert "preempt/dynamic: Provide cond_resched() and might_resched() > static calls" > Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC" NAK Even if you were to remove PREEMPT_NONE, which should be a separate series, but that isn't on the table at all afaict, removing preempt_dynamic doesn't make sense. You still want the preempt= boot time argument and the /debug/sched/preempt things to dynamically switch between the models. Please, focus on the voluntary thing, gut that and then replace it with the lazy thing, but leave everything else in place. Re dynamic preempt, gutting the current voluntary preemption model means getting rid of the cond_resched and might_resched toggles but you'll gain a switch to kill the lazy stuff.
From: Ankur Arora > Sent: 07 November 2023 21:57 ... > There are four main sets of preemption points in the kernel: > > 1. return to user > 2. explicit preemption points (cond_resched() and its ilk) > 3. return to kernel (tick/IPI/irq at irqexit) > 4. end of non-preemptible sections at (preempt_count() == preempt_offset) > ... > Policies: > > A - preemption=none: run to completion > B - preemption=voluntary: run to completion, unless a task of higher > sched-class awaits > C - preemption=full: optimized for low-latency. Preempt whenever a higher > priority task awaits. If you remove cond_resched() then won't both B and C require an extra IPI. That is probably OK for RT tasks but could get expensive for normal tasks that aren't bound to a specific cpu. I suspect C could also lead to tasks being pre-empted just before they sleep (eg after waking another task). There might already be mitigation for that, I'm not sure if a voluntary sleep can be done in a non-pre-emptible section. Certainly it should all help the scheduling of RT tasks - which can currently get delayed by a non-RT task in a slow kernel path. Although the worst one is the softint code... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 11/8/23 09:51, Peter Zijlstra wrote: > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: >> Hi, >> >> We have two models of preemption: voluntary and full > 3, also none (RT is not actually a preemption model). > >> (and RT which is >> a fuller form of full preemption.) > It is not in fact a different preemption model, it is the same full > preemption, the difference with RT is that it makes a lot more stuff > preemptible, but the fundamental preemption model is the same -- full. +1 -- Daniel
Peter Zijlstra <peterz@infradead.org> writes: > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: >> Hi, >> >> We have two models of preemption: voluntary and full > > 3, also none (RT is not actually a preemption model). I think we are just using preemption models differently. I was trying to distinguish between how preemption happens: thus voluntary (based on voluntary preemption points), and full (based on preemption state.) >> (and RT which is >> a fuller form of full preemption.) > > It is not in fact a different preemption model, it is the same full > preemption, the difference with RT is that it makes a lot more stuff > preemptible, but the fundamental preemption model is the same -- full. > >> In this series -- which is based >> on Thomas' PoC (see [1]), we try to unify the two by letting the >> scheduler enforce policy for the voluntary preemption models as well. > > Well, you've also taken out preempt_dynamic for some obscure reason :/ Sorry about that :). I didn't mention it because I was using preemption model in the sense I describe above. And to my mind preempt_dynamic is a clever mechanism that switches between other preemption models. >> Please review. > >> Ankur Arora (86): >> Revert "riscv: support PREEMPT_DYNAMIC with static keys" >> Revert "sched/core: Make sched_dynamic_mutex static" >> Revert "ftrace: Use preemption model accessors for trace header >> printout" >> Revert "preempt/dynamic: Introduce preemption model accessors" >> Revert "kcsan: Use preemption model accessors" >> Revert "entry: Fix compile error in >> dynamic_irqentry_exit_cond_resched()" >> Revert "livepatch,sched: Add livepatch task switching to >> cond_resched()" >> Revert "arm64: Support PREEMPT_DYNAMIC" >> Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" >> Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from >> GENERIC_ENTRY" >> Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers" >> Revert "sched/preempt: Refactor sched_dynamic_update()" >> Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later" >> Revert "preempt/dynamic: Fix setup_preempt_mode() return value" >> Revert "preempt: Restore preemption model selection configs" >> Revert "sched: Provide Kconfig support for default dynamic preempt >> mode" >> sched/preempt: remove PREEMPT_DYNAMIC from the build version >> Revert "preempt/dynamic: Fix typo in macro conditional statement" >> Revert "sched,preempt: Move preempt_dynamic to debug.c" >> Revert "static_call: Relax static_call_update() function argument >> type" >> Revert "sched/core: Use -EINVAL in sched_dynamic_mode()" >> Revert "sched/core: Stop using magic values in sched_dynamic_mode()" >> Revert "sched,x86: Allow !PREEMPT_DYNAMIC" >> Revert "sched: Harden PREEMPT_DYNAMIC" >> Revert "sched: Add /debug/sched_preempt" >> Revert "preempt/dynamic: Support dynamic preempt with preempt= boot >> option" >> Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static >> call" >> Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static >> calls" >> Revert "preempt/dynamic: Provide cond_resched() and might_resched() >> static calls" >> Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC" > > NAK > > Even if you were to remove PREEMPT_NONE, which should be a separate > series, but that isn't on the table at all afaict, removing > preempt_dynamic doesn't make sense. Agreed. I don't intend to remove PREEMPT_NONE. And, obviously you do want preempt_dynamic like toggling abilities. > You still want the preempt= boot time argument and the > /debug/sched/preempt things to dynamically switch between the models. Also, yes. > Please, focus on the voluntary thing, gut that and then replace it with > the lazy thing, but leave everything else in place. > > Re dynamic preempt, gutting the current voluntary preemption model means > getting rid of the cond_resched and might_resched toggles but you'll > gain a switch to kill the lazy stuff. Yes. I think I mostly agree with you. And, I should have thought this whole revert thing through. Reverting it wasn't really the plan. The plan was to revert these patches temporarily, put in the changes you see in this series, and then pull in the relevant bits of the preempt_dynamic. Only I decided to push that to later. Sigh. On your NAK, what about these patches for instance: >> Revert "riscv: support PREEMPT_DYNAMIC with static keys" >> Revert "livepatch,sched: Add livepatch task switching to >> cond_resched()" >> Revert "arm64: Support PREEMPT_DYNAMIC" >> Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" What's the best way to handle these? With the lazy bit, cond_resched() and might_resched() are gone. So we don't need all of the static key inftrastructure for toggling etc. The part of preempt_dynamic that makes sense to me is the one that switches dynamically between none/voluntary/full. Here it would need to be wired onto controls of the lazy bit. (Right now the preemption policy is controlled by sched_feat in patches 43, and 44 but sched/preempt is a much better interface.) -- ankur
On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote: > >> Revert "riscv: support PREEMPT_DYNAMIC with static keys" > >> Revert "livepatch,sched: Add livepatch task switching to > >> cond_resched()" > >> Revert "arm64: Support PREEMPT_DYNAMIC" > >> Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" > > What's the best way to handle these? With the lazy bit, cond_resched() > and might_resched() are gone. So we don't need all of the static > key inftrastructure for toggling etc. > > The part of preempt_dynamic that makes sense to me is the one that > switches dynamically between none/voluntary/full. Here it would need > to be wired onto controls of the lazy bit. > (Right now the preemption policy is controlled by sched_feat in > patches 43, and 44 but sched/preempt is a much better interface.) I'm not understanding, those should stay obviously. The current preempt_dynamic stuff has 5 toggles: /* * SC:cond_resched * SC:might_resched * SC:preempt_schedule * SC:preempt_schedule_notrace * SC:irqentry_exit_cond_resched * * * NONE: * cond_resched <- __cond_resched * might_resched <- RET0 * preempt_schedule <- NOP * preempt_schedule_notrace <- NOP * irqentry_exit_cond_resched <- NOP * * VOLUNTARY: * cond_resched <- __cond_resched * might_resched <- __cond_resched * preempt_schedule <- NOP * preempt_schedule_notrace <- NOP * irqentry_exit_cond_resched <- NOP * * FULL: * cond_resched <- RET0 * might_resched <- RET0 * preempt_schedule <- preempt_schedule * preempt_schedule_notrace <- preempt_schedule_notrace * irqentry_exit_cond_resched <- irqentry_exit_cond_resched */ If you kill voluntary as we know it today, you can remove cond_resched and might_resched, but the remaining 3 are still needed to switch between NONE and FULL. Additionally, you'll get one new state to enable/disable the LAZY stuff. Neither NONE nor FULL want the LAZY thing on. You'll then end up with something like: /* * SK:preempt_lazy * SC:preempt_schedule * SC:preempt_schedule_notrace * SC:irqentry_exit_cond_resched * * * NONE: * preempt_lazy <- OFF * preempt_schedule <- NOP * preempt_schedule_notrace <- NOP * irqentry_exit_cond_resched <- NOP * * VOLUNTARY: * preempt_lazy <- ON * preempt_schedule <- preempt_schedule * preempt_schedule_notrace <- preempt_schedule_notrace * irqentry_exit_cond_resched <- irqentry_exit_cond_resched * * FULL: * preempt_lazy <- OFF * preempt_schedule <- preempt_schedule * preempt_schedule_notrace <- preempt_schedule_notrace * irqentry_exit_cond_resched <- irqentry_exit_cond_resched */ For the architectures that do not have static_call but instead use static_key for everything, the SC's are obviously static_key based wrappers around the function calls -- like now.
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote: > >> >> Revert "riscv: support PREEMPT_DYNAMIC with static keys" >> >> Revert "livepatch,sched: Add livepatch task switching to >> >> cond_resched()" >> >> Revert "arm64: Support PREEMPT_DYNAMIC" >> >> Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys" >> >> What's the best way to handle these? With the lazy bit, cond_resched() >> and might_resched() are gone. So we don't need all of the static >> key inftrastructure for toggling etc. >> >> The part of preempt_dynamic that makes sense to me is the one that >> switches dynamically between none/voluntary/full. Here it would need >> to be wired onto controls of the lazy bit. >> (Right now the preemption policy is controlled by sched_feat in >> patches 43, and 44 but sched/preempt is a much better interface.) > > I'm not understanding, those should stay obviously. > > The current preempt_dynamic stuff has 5 toggles: > > /* > * SC:cond_resched > * SC:might_resched > * SC:preempt_schedule > * SC:preempt_schedule_notrace > * SC:irqentry_exit_cond_resched > * > * > * NONE: > * cond_resched <- __cond_resched > * might_resched <- RET0 > * preempt_schedule <- NOP > * preempt_schedule_notrace <- NOP > * irqentry_exit_cond_resched <- NOP > * > * VOLUNTARY: > * cond_resched <- __cond_resched > * might_resched <- __cond_resched > * preempt_schedule <- NOP > * preempt_schedule_notrace <- NOP > * irqentry_exit_cond_resched <- NOP > * > * FULL: > * cond_resched <- RET0 > * might_resched <- RET0 > * preempt_schedule <- preempt_schedule > * preempt_schedule_notrace <- preempt_schedule_notrace > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched > */ > > If you kill voluntary as we know it today, you can remove cond_resched > and might_resched, but the remaining 3 are still needed to switch > between NONE and FULL. Ah now I see what you are saying. Quick thought: even if we were running under NONE, eventually you'll want to forcibly preempt out a CPU hog. So we will need to have at least this one enabled always: > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched These two, it might make sense to toggle them based on model. > * preempt_schedule <- preempt_schedule > * preempt_schedule_notrace <- preempt_schedule_notrace Anyway let me think about this more and respond tomorrow. For now, time for bed. Thanks for clarifying btw. Ankur > Additionally, you'll get one new state to enable/disable the LAZY stuff. > Neither NONE nor FULL want the LAZY thing on. > > You'll then end up with something like: > > /* > * SK:preempt_lazy > * SC:preempt_schedule > * SC:preempt_schedule_notrace > * SC:irqentry_exit_cond_resched > * > * > * NONE: > * preempt_lazy <- OFF > * preempt_schedule <- NOP > * preempt_schedule_notrace <- NOP > * irqentry_exit_cond_resched <- NOP > * > * VOLUNTARY: > * preempt_lazy <- ON > * preempt_schedule <- preempt_schedule > * preempt_schedule_notrace <- preempt_schedule_notrace > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched > * > * FULL: > * preempt_lazy <- OFF > * preempt_schedule <- preempt_schedule > * preempt_schedule_notrace <- preempt_schedule_notrace > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched > */ > > For the architectures that do not have static_call but instead use > static_key for everything, the SC's are obviously static_key based > wrappers around the function calls -- like now.
On Wed, Nov 08, 2023 at 03:00:44AM -0800, Ankur Arora wrote: > Ah now I see what you are saying. > > Quick thought: even if we were running under NONE, eventually you'll Well, NONE, you get what you pay for etc.. > want to forcibly preempt out a CPU hog. So we will need to have > at least this one enabled always: > > > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched > > These two, it might make sense to toggle them based on model. > > > * preempt_schedule <- preempt_schedule > > * preempt_schedule_notrace <- preempt_schedule_notrace > > Anyway let me think about this more and respond tomorrow. That's more or less killing NONE. At that point there's really no point in having it. I would really suggest you start with transforming VOLUNTARY into the LAZY thing, keep it as simple and narrow as possible. Once you've got that done, then you can try and argue that NONE makes no sense and try and take it out. Smaller step etc..
On Wed, Nov 08, 2023 at 12:14:15PM +0100, Peter Zijlstra wrote: > I would really suggest you start with transforming VOLUNTARY into the > LAZY thing, keep it as simple and narrow as possible. Possibly make it worse first, add LAZY as a fourth option, then show it makes VOLUNTARY redundant, kill that. > Once you've got that done, then you can try and argue that NONE makes no > sense and try and take it out. This also pushes out having to deal with the !PREEMPT archs until the very last moment. And once you're here, cond_resched() should be an obvious no-op function and you can go delete them. Anyway, as said, smaller steps more better. Nobody likes 86 patches in their inbox in the morning (or at any time really).
On Wed, 8 Nov 2023 09:43:10 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > > Policies: > > > > A - preemption=none: run to completion > > B - preemption=voluntary: run to completion, unless a task of higher > > sched-class awaits > > C - preemption=full: optimized for low-latency. Preempt whenever a higher > > priority task awaits. > > If you remove cond_resched() then won't both B and C require an extra IPI. > That is probably OK for RT tasks but could get expensive for > normal tasks that aren't bound to a specific cpu. What IPI is extra? > > I suspect C could also lead to tasks being pre-empted just before > they sleep (eg after waking another task). > There might already be mitigation for that, I'm not sure if > a voluntary sleep can be done in a non-pre-emptible section. No, voluntary sleep can not be done in a preemptible section. -- Steve
On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote: > On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote: > I'm not understanding, those should stay obviously. > > The current preempt_dynamic stuff has 5 toggles: > > /* > * SC:cond_resched > * SC:might_resched > * SC:preempt_schedule > * SC:preempt_schedule_notrace > * SC:irqentry_exit_cond_resched > * > * > * NONE: > * cond_resched <- __cond_resched > * might_resched <- RET0 > * preempt_schedule <- NOP > * preempt_schedule_notrace <- NOP > * irqentry_exit_cond_resched <- NOP > * > * VOLUNTARY: > * cond_resched <- __cond_resched > * might_resched <- __cond_resched > * preempt_schedule <- NOP > * preempt_schedule_notrace <- NOP > * irqentry_exit_cond_resched <- NOP > * > * FULL: > * cond_resched <- RET0 > * might_resched <- RET0 > * preempt_schedule <- preempt_schedule > * preempt_schedule_notrace <- preempt_schedule_notrace > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched > */ > > If you kill voluntary as we know it today, you can remove cond_resched > and might_resched, but the remaining 3 are still needed to switch > between NONE and FULL. No. The whole point of LAZY is to keep preempt_schedule(), preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled. Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ The idea is to always enable preempt count and keep _all_ preemption points enabled. For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED. Here is where the regular and the lazy flags are evaluated: Ret2user Ret2kernel PreemptCnt=0 need_resched() NEED_RESCHED Y Y Y Y LAZY_RESCHED Y N N Y The trick is that LAZY is not folded into preempt_count so a 1->0 counter transition won't cause preempt_schedule() to be invoked because the topmost bit (NEED_RESCHED) is set. The scheduler can still decide to set TIF_NEED_RESCHED which will cause an immediate preemption at the next preemption point. This allows to force out a task which loops, e.g. in a massive copy or clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY is evaluated after a time which is defined by the scheduler itself. For my PoC I did: 1) Set TIF_NEED_RESCHED_LAZY 2) Set TIF_NEED_RESCHED when the task did not react on TIF_NEED_RESCHED_LAZY within a tick I know that's crude but it just works and obviously requires quite some refinement. So the way how you switch between preemption modes is to select when the scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call switching at all. In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it uses the LAZY bit first, grants some time and then gets out the hammer and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption point. Which means once the whole thing is in place then the whole PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with the cond_resched() hackery. So I think this series is backwards. It should add the LAZY muck with a Kconfig switch like I did in my PoC _first_. Once that is working and agreed on, the existing muck can be removed. Thanks, tglx
On Wed, Nov 08, 2023 at 04:38:11PM +0100, Thomas Gleixner wrote: > No. The whole point of LAZY is to keep preempt_schedule(), > preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled. Yeah, I got that. What wasn't at all clear is that it also wanted to replace NONE. 0/n didn't even mention none.
On Wed, 08 Nov 2023 16:38:11 +0100 Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote: > > On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote: > > I'm not understanding, those should stay obviously. > > > > The current preempt_dynamic stuff has 5 toggles: > > > > /* > > * SC:cond_resched > > * SC:might_resched > > * SC:preempt_schedule > > * SC:preempt_schedule_notrace > > * SC:irqentry_exit_cond_resched > > * > > * > > * NONE: > > * cond_resched <- __cond_resched > > * might_resched <- RET0 > > * preempt_schedule <- NOP > > * preempt_schedule_notrace <- NOP > > * irqentry_exit_cond_resched <- NOP > > * > > * VOLUNTARY: > > * cond_resched <- __cond_resched > > * might_resched <- __cond_resched > > * preempt_schedule <- NOP > > * preempt_schedule_notrace <- NOP > > * irqentry_exit_cond_resched <- NOP > > * > > * FULL: > > * cond_resched <- RET0 > > * might_resched <- RET0 > > * preempt_schedule <- preempt_schedule > > * preempt_schedule_notrace <- preempt_schedule_notrace > > * irqentry_exit_cond_resched <- irqentry_exit_cond_resched > > */ > > > > If you kill voluntary as we know it today, you can remove cond_resched > > and might_resched, but the remaining 3 are still needed to switch > > between NONE and FULL. > > No. The whole point of LAZY is to keep preempt_schedule(), > preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled. Right. * NONE: * cond_resched <- __cond_resched * might_resched <- RET0 * preempt_schedule <- NOP * preempt_schedule_notrace <- NOP * irqentry_exit_cond_resched <- NOP Peter, how can you say we can get rid of cond_resched() in NONE when you show that NONE still uses it? I thought the entire point of this was to get rid of all the cond_resched() and they are there for PREEMPT_NONE as well as VOLUNTARY. As you showed above, the only difference between NONE and VOLUNTARY was the might_sleep. > > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ And I've been saying that many times already ;-) Thanks Thomas for reiterating it. -- Steve
From: Steven Rostedt > Sent: 08 November 2023 15:16 > > On Wed, 8 Nov 2023 09:43:10 +0000 > David Laight <David.Laight@ACULAB.COM> wrote: > > > > Policies: > > > > > > A - preemption=none: run to completion > > > B - preemption=voluntary: run to completion, unless a task of higher > > > sched-class awaits > > > C - preemption=full: optimized for low-latency. Preempt whenever a higher > > > priority task awaits. > > > > If you remove cond_resched() then won't both B and C require an extra IPI. > > That is probably OK for RT tasks but could get expensive for > > normal tasks that aren't bound to a specific cpu. > > What IPI is extra? I was thinking that you wouldn't currently need an IPI if the target cpu was running in-kernel because nothing would happen until cond_resched() was called. > > I suspect C could also lead to tasks being pre-empted just before > > they sleep (eg after waking another task). > > There might already be mitigation for that, I'm not sure if > > a voluntary sleep can be done in a non-pre-emptible section. > > No, voluntary sleep can not be done in a preemptible section. I'm guessing you missed out a negation in that (or s/not/only/). I was thinking about sequences like: wake_up(); ... set_current_state(TASK_UNINTERRUPTIBLE) add_wait_queue(); spin_unlock(); schedule(); Where you really don't want to be pre-empted by the woken up task. For non CONFIG_RT the lock might do it - if held long enough. Otherwise you'll need to have pre-emption disabled and enable it just after the set_current_state(). And then quite likely disable again after the schedule() to balance things out. So having the scheduler save the pre-empt disable count might be useful. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: > What's broken: > - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts > lazily") > - Non-x86 architectures. It's trivial to support other archs (only need > to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some > comments on the series. > (From some testing on arm64, didn't find any surprises.) When you say "testing on arm64, didn't find any surprises", I assume you mean with an additional patch adding TIF_NEED_RESCHED_LAZY? Applying this series as-is atop v6.6-rc7 and building defconfig (with GCC 13.2.0) blows up with: | In file included from ./arch/arm64/include/asm/preempt.h:5, | from ./include/linux/preempt.h:79, | from ./include/linux/spinlock.h:56, | from ./include/linux/mmzone.h:8, | from ./include/linux/gfp.h:7, | from ./include/linux/slab.h:16, | from ./include/linux/resource_ext.h:11, | from ./include/linux/acpi.h:13, | from ./include/acpi/apei.h:9, | from ./include/acpi/ghes.h:5, | from ./include/linux/arm_sdei.h:8, | from arch/arm64/kernel/asm-offsets.c:10: | ./include/linux/thread_info.h:63:2: error: #error "Arch needs to define TIF_NEED_RESCHED_LAZY" | 63 | #error "Arch needs to define TIF_NEED_RESCHED_LAZY" | | ^~~~~ | ./include/linux/thread_info.h:66:42: error: 'TIF_NEED_RESCHED_LAZY' undeclared here (not in a function); did you mean 'TIF_NEED_RESCHED'? | 66 | #define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED) | | ^~~~~~~~~~~~~~~~~~~~~ | ./include/linux/thread_info.h:70:24: note: in expansion of macro 'TIF_NEED_RESCHED_LAZY_OFFSET' | 70 | RESCHED_lazy = TIF_NEED_RESCHED_LAZY_OFFSET, | | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ | make[2]: *** [scripts/Makefile.build:116: arch/arm64/kernel/asm-offsets.s] Error 1 | make[1]: *** [/home/mark/src/linux/Makefile:1202: prepare0] Error 2 | make: *** [Makefile:234: __sub-make] Error 2 Note that since arm64 doesn't use the generic entry code, that also requires changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to handle TIF_NEED_RESCHED_LAZY. > - ftrace support for need-resched-lazy is incomplete What exactly do we need for ftrace here? Thanks, Mark.
On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote: > Peter, how can you say we can get rid of cond_resched() in NONE when you Because that would fix none to actually be none. Who cares. > > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ > > And I've been saying that many times already ;-) Why should I look at random patches on the interweb to make sense of these patches here? That just underlines these here patches are not making sense.
On Wed, 8 Nov 2023 17:49:16 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote: > > > Peter, how can you say we can get rid of cond_resched() in NONE when you > > Because that would fix none to actually be none. Who cares. Well, that would lead to regressions with PREEMPT_NONE and the watchdog timer. > > > > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ > > > > And I've been saying that many times already ;-) > > Why should I look at random patches on the interweb to make sense of > these patches here? I actually said it to others, this wasn't really supposed to be addressed to you. > > That just underlines these here patches are not making sense. It's a complex series, and there's a lot of room for improvement. I'm happy to help out here. -- Steve
Thomas Gleixner <tglx@linutronix.de> writes: > On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote: >> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote: >> I'm not understanding, those should stay obviously. >> >> The current preempt_dynamic stuff has 5 toggles: >> >> /* >> * SC:cond_resched >> * SC:might_resched >> * SC:preempt_schedule >> * SC:preempt_schedule_notrace >> * SC:irqentry_exit_cond_resched >> * >> * >> * NONE: >> * cond_resched <- __cond_resched >> * might_resched <- RET0 >> * preempt_schedule <- NOP >> * preempt_schedule_notrace <- NOP >> * irqentry_exit_cond_resched <- NOP >> * >> * VOLUNTARY: >> * cond_resched <- __cond_resched >> * might_resched <- __cond_resched >> * preempt_schedule <- NOP >> * preempt_schedule_notrace <- NOP >> * irqentry_exit_cond_resched <- NOP >> * >> * FULL: >> * cond_resched <- RET0 >> * might_resched <- RET0 >> * preempt_schedule <- preempt_schedule >> * preempt_schedule_notrace <- preempt_schedule_notrace >> * irqentry_exit_cond_resched <- irqentry_exit_cond_resched >> */ >> >> If you kill voluntary as we know it today, you can remove cond_resched >> and might_resched, but the remaining 3 are still needed to switch >> between NONE and FULL. > > No. The whole point of LAZY is to keep preempt_schedule(), > preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled. > > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ > > The idea is to always enable preempt count and keep _all_ preemption > points enabled. > > For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY > instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED. > > Here is where the regular and the lazy flags are evaluated: > > Ret2user Ret2kernel PreemptCnt=0 need_resched() > > NEED_RESCHED Y Y Y Y > LAZY_RESCHED Y N N Y > > The trick is that LAZY is not folded into preempt_count so a 1->0 > counter transition won't cause preempt_schedule() to be invoked because > the topmost bit (NEED_RESCHED) is set. > > The scheduler can still decide to set TIF_NEED_RESCHED which will cause > an immediate preemption at the next preemption point. > > This allows to force out a task which loops, e.g. in a massive copy or > clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY > is evaluated after a time which is defined by the scheduler itself. > > For my PoC I did: > > 1) Set TIF_NEED_RESCHED_LAZY > > 2) Set TIF_NEED_RESCHED when the task did not react on > TIF_NEED_RESCHED_LAZY within a tick > > I know that's crude but it just works and obviously requires quite some > refinement. > > So the way how you switch between preemption modes is to select when the > scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call > switching at all. > > In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it > uses the LAZY bit first, grants some time and then gets out the hammer > and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption > point. > > Which means once the whole thing is in place then the whole > PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with > the cond_resched() hackery. > > So I think this series is backwards. > > It should add the LAZY muck with a Kconfig switch like I did in my PoC > _first_. Once that is working and agreed on, the existing muck can be > removed. Yeah. I should have done it in the order in your PoC. Right now I'm doing all of the stuff you describe above, but because there are far too many structural changes, it's not clear to anybody what the code is doing. Okay, so for the next version let me limit the series to just the scheduler changes which can be orthogonal to the old models (basically a new scheduler model PREEMPT_AUTO). Once that is agreed on, the other models can be removed (or expressed in terms of PREEMPT_AUTO.) -- ankur
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote: > >> Peter, how can you say we can get rid of cond_resched() in NONE when you > > Because that would fix none to actually be none. Who cares. > >> > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/ >> >> And I've been saying that many times already ;-) > > Why should I look at random patches on the interweb to make sense of > these patches here? > > That just underlines these here patches are not making sense. Yeah, I'm changing too many things structural things all together. Let me redo this. This time limiting the changes to the scheduler adding a preemption model which adds the lazy-bit and which can behave in ways that are similar to preempt=none/full differently by toggling the treatment of lazy-bit. This keeps the current models as it is. And, once that makes sense to people, then we can decide how best to remove cond_resched() etc. -- ankur
Mark Rutland <mark.rutland@arm.com> writes: > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: >> What's broken: >> - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts >> lazily") >> - Non-x86 architectures. It's trivial to support other archs (only need >> to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some >> comments on the series. >> (From some testing on arm64, didn't find any surprises.) > > When you say "testing on arm64, didn't find any surprises", I assume you mean > with an additional patch adding TIF_NEED_RESCHED_LAZY? Yeah. And, handling that in the user exit path. > Note that since arm64 doesn't use the generic entry code, that also requires > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to > handle TIF_NEED_RESCHED_LAZY. So, the intent (which got muddied due to this overly large series) was to delay handling TIF_NEED_RESCHED_LAZY until we are about to return to user. I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK and do_notify_resume(). (The design is much clearer in Thomas' PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/) >> - ftrace support for need-resched-lazy is incomplete > > What exactly do we need for ftrace here? Only support for TIF_NEED_RESCHED_LAZY which should be complete. That comment was based on a misreading of the code. Thanks -- ankur
On Wed, Nov 08, 2023 at 04:34:41PM -0800, Ankur Arora wrote: > Mark Rutland <mark.rutland@arm.com> writes: > > > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: > >> What's broken: > >> - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts > >> lazily") > >> - Non-x86 architectures. It's trivial to support other archs (only need > >> to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some > >> comments on the series. > >> (From some testing on arm64, didn't find any surprises.) > > > > When you say "testing on arm64, didn't find any surprises", I assume you mean > > with an additional patch adding TIF_NEED_RESCHED_LAZY? > > Yeah. And, handling that in the user exit path. > > > Note that since arm64 doesn't use the generic entry code, that also requires > > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to > > handle TIF_NEED_RESCHED_LAZY. > > So, the intent (which got muddied due to this overly large series) > was to delay handling TIF_NEED_RESCHED_LAZY until we are about to > return to user. Ah, I missed that detail -- thanks for clarifying! > I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED > and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK > and do_notify_resume(). Digging a bit more, I think that should still work. One slight clarification: arm64_preempt_schedule_irq() doesn't look at TIF_NEED_RESCHED today, as it relies on the scheduler IPI calling preempt_fold_need_resched() to propogate TIF_NEED_RESCHED into PREEMPT_NEED_RESCHED. That should still work since this series makes preempt_fold_need_resched() check tif_need_resched(RESCHED_eager). I was a bit cnofused because in the generic entry code, irqentry_exit_cond_resched() explicitly checks for TIF_NEED_RESCHED, and I'm not sure why it does that rather than relying on the scheduler IPI as above. > (The design is much clearer in Thomas' PoC: > https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/) > > >> - ftrace support for need-resched-lazy is incomplete > > > > What exactly do we need for ftrace here? > > Only support for TIF_NEED_RESCHED_LAZY which should be complete. > That comment was based on a misreading of the code. Cool; thanks! Mark.
Mark Rutland <mark.rutland@arm.com> writes: > On Wed, Nov 08, 2023 at 04:34:41PM -0800, Ankur Arora wrote: >> Mark Rutland <mark.rutland@arm.com> writes: >> >> > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote: >> >> What's broken: >> >> - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts >> >> lazily") >> >> - Non-x86 architectures. It's trivial to support other archs (only need >> >> to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some >> >> comments on the series. >> >> (From some testing on arm64, didn't find any surprises.) >> > >> > When you say "testing on arm64, didn't find any surprises", I assume you mean >> > with an additional patch adding TIF_NEED_RESCHED_LAZY? >> >> Yeah. And, handling that in the user exit path. >> >> > Note that since arm64 doesn't use the generic entry code, that also requires >> > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to >> > handle TIF_NEED_RESCHED_LAZY. >> >> So, the intent (which got muddied due to this overly large series) >> was to delay handling TIF_NEED_RESCHED_LAZY until we are about to >> return to user. > > Ah, I missed that detail -- thanks for clarifying! > >> I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED >> and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK >> and do_notify_resume(). > > Digging a bit more, I think that should still work. > > One slight clarification: arm64_preempt_schedule_irq() doesn't look at > TIF_NEED_RESCHED today, as it relies on the scheduler IPI calling > preempt_fold_need_resched() to propogate TIF_NEED_RESCHED into > PREEMPT_NEED_RESCHED. That should still work since this series makes > preempt_fold_need_resched() check tif_need_resched(RESCHED_eager). > > I was a bit cnofused because in the generic entry code, > irqentry_exit_cond_resched() explicitly checks for TIF_NEED_RESCHED, and I'm > not sure why it does that rather than relying on the scheduler IPI as above. Yeah I found that confusing as well. I suspect the reason is that not all archs do the folding and we need the explicit check for those that don't. -- ankur