mbox series

[RFC,00/86] Make the kernel preemptible

Message ID 20231107215742.363031-1-ankur.a.arora@oracle.com (mailing list archive)
Headers show
Series Make the kernel preemptible | expand

Message

Ankur Arora Nov. 7, 2023, 9:56 p.m. UTC
Hi,

We have two models of preemption: voluntary and full (and RT which is
a fuller form of full preemption.) In this series -- which is based
on Thomas' PoC (see [1]), we try to unify the two by letting the
scheduler enforce policy for the voluntary preemption models as well.

(Note that this is about preemption when executing in the kernel.
Userspace is always preemptible.)

Background
==

Why?: both of these preemption mechanisms are almost entirely disjoint.
There are four main sets of preemption points in the kernel:

 1. return to user
 2. explicit preemption points (cond_resched() and its ilk)
 3. return to kernel (tick/IPI/irq at irqexit)
 4. end of non-preemptible sections at (preempt_count() == preempt_offset)

Voluntary preemption uses mechanisms 1 and 2. Full preemption
uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
which can be all things to all people because they internally
contain 2, and 4.

Now since there's no ideal placement of explicit preemption points,
they tend to be randomly spread over code and accumulate over time,
as they are are added when latency problems are seen. Plus fear of
regressions makes them difficult to remove.
(Presumably, asymptotically they would spead out evenly across the
instruction stream!)

In voluntary models, the scheduler's job is to match the demand
side of preemption points (a task that needs to be scheduled) with
the supply side (a task which calls cond_resched().)

Full preemption models track preemption count so the scheduler can
always knows if it is safe to preempt and can drive preemption
itself (ex. via dynamic preemption points in 3.)

Design
==

As Thomas outlines in [1], to unify the preemption models we
want to: always have the preempt_count enabled and allow the scheduler
to drive preemption policy based on the model in effect.

Policies:

- preemption=none: run to completion
- preemption=voluntary: run to completion, unless a task of higher
  sched-class awaits
- preemption=full: optimized for low-latency. Preempt whenever a higher
  priority task awaits.

To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
scheduler to mark that a reschedule is needed, but is deferred until
the task finishes executing in the kernel -- voluntary preemption
as it were.

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

         ret-to-user    ret-to-kernel    preempt_count()
none           Y              N                N
voluntary      Y              Y                Y
full           Y              Y                Y


There's just one remaining issue: now that explicit preemption points are
gone, processes that spread a long time in the kernel have no way to give
up the CPU.

For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.

For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
(This would cause preemption either at ret-to-kernel, or if the task is in
a non-preemptible section, when it exits that section.)

Arguably this provides for much more consistent maximum latency (~2 tick
lengths + length of non-preemptible section) as compared to the old model
where the maximum latency depended on the dynamic distribution of
cond_resched() points.

(As a bonus it handles code that is preemptible but cannot call cond_resched()
 completely trivially: ex. long running Xen hypercalls, or this series
 which started this discussion:
 https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)


Status
==

What works:
 - The system seems to keep ticking over with the normal scheduling policies
   (SCHED_OTHER). The support for the realtime policies is somewhat more
   half baked.)
 - The basic performance numbers seem pretty close to 6.6-rc7 baseline

What's broken:
 - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
   lazily")
 - Non-x86 architectures. It's trivial to support other archs (only need
   to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
   comments on the series.
   (From some testing on arm64, didn't find any surprises.)
 - livepatch: livepatch depends on using _cond_resched() to provide
   low-latency patching. That is obviously difficult with cond_resched()
   gone. We could get a similar effect by using a static_key in
   preempt_enable() but at least with inline locks, that might be end
   up bloating the kernel quite a bit.
 - Documentation/ and comments mention cond_resched()
 - ftrace support for need-resched-lazy is incomplete

What needs more discussion:
 - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
   only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
   change to tif_need_resched(resched_t)")
 - Tracking whether a task in userspace or in the kernel (See patch-40
   "context_tracking: add ct_state_cpu()")
 - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
   preemption")


Performance
==

Expectation:

* perf sched bench pipe

preemption               full           none

6.6-rc7              6.68 +- 0.10   6.69 +- 0.07
+series              6.69 +- 0.12   6.67 +- 0.10

This is rescheduling out of idle which should and does perform identically.

* schbench, preempt=none

  * 1 group, 16 threads each

                 6.6-rc7      +series  
                 (usecs)      (usecs)
     50.0th:         6            6  
     90.0th:         8            7  
     99.0th:        11           11  
     99.9th:        15           14  
  
  * 8 groups, 16 threads each

                6.6-rc7       +series  
                 (usecs)      (usecs)
     50.0th:         6            6  
     90.0th:         8            8  
     99.0th:        12           11  
     99.9th:        20           21  


* schbench, preempt=full

  * 1 group, 16 threads each

                6.6-rc7       +series  
                (usecs)       (usecs)
     50.0th:         6            6  
     90.0th:         8            7  
     99.0th:        11           11  
     99.9th:        14           14  


  * 8 groups, 16 threads each

                6.6-rc7       +series  
                (usecs)       (usecs)
     50.0th:         7            7  
     90.0th:         9            9  
     99.0th:        12           12  
     99.9th:        21           22  


  Not much in it either way.

* kernbench, preempt=full

  * half-load (-j 128)

           6.6-rc7                                    +series                     

  wall        149.2  +-     27.2             wall        132.8  +-     0.4
  utime      8097.1  +-     57.4             utime      8088.5  +-    14.1
  stime      1165.5  +-      9.4             stime      1159.2  +-     1.9
  %cpu       6337.6  +-   1072.8             %cpu       6959.6  +-    22.8
  csw      237618    +-   2190.6             %csw     240343    +-  1386.8


  * optimal-load (-j 1024)

           6.6-rc7                                    +series                     

  wall        137.8 +-       0.0             wall       137.7  +-       0.8
  utime     11115.0 +-    3306.1             utime    11041.7  +-    3235.0
  stime      1340.0 +-     191.3             stime     1323.1  +-     179.5
  %cpu       8846.3 +-    2830.6             %cpu      9101.3  +-    2346.7
  csw     2099910   +- 2040080.0             csw    2068210    +- 2002450.0


  The preempt=full path should effectively not see any change in
  behaviour. The optimal-loads are pretty much identical.
  For the half-load, however, the +series version does much better but that
  seems to be because of much higher run to run variability in the 6.6-rc7 load.

* kernbench, preempt=none

  * half-load (-j 128)

           6.6-rc7                                    +series                     

  wall        134.5  +-      4.2             wall        133.6  +-     2.7
  utime      8093.3  +-     39.3             utime      8099.0  +-    38.9
  stime      1175.7  +-     10.6             stime      1169.1  +-     8.4
  %cpu       6893.3  +-    233.2             %cpu       6936.3  +-   142.8
  csw      240723    +-    423.0             %csw     173152    +-  1126.8
                                             

  * optimal-load (-j 1024)

           6.6-rc7                                    +series                     

  wall        139.2 +-       0.3             wall       138.8  +-       0.2
  utime     11161.0 +-    3360.4             utime    11061.2  +-    3244.9
  stime      1357.6 +-     199.3             stime     1366.6  +-     216.3
  %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-    2351.1
  csw     2078599   +- 2013320.0             csw    1970610    +- 1969030.0


  For both of these the wallclock, utime, stime etc are pretty much
  identical. The one interesting difference is that the number of
  context switches are fewer. This intuitively makes sense given that
  we reschedule threads lazily rather than rescheduling if we encounter
  a cond_resched() and there's a thread wanting to be scheduled.

  The max-load numbers (not posted here) also behave similarly.


Series
==

With that, this is how he series is laid out:

 - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
   used by that is via static_calls() and this is a simpler approach which
   doesn't need any of that (and does away with cond_resched().)

   Some of the commits will be resurrected.
       089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
       cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
       5693fa74f98a ("kcsan: Use preemption model accessors")

 - Patches 31-45: contain the scheduler changes to do this. Of these
   the critical ones are:
     patch 35 "thread_info: change to tif_need_resched(resched_t)"
     patch 41 "sched: handle resched policy in resched_curr()"
     patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
     patch 44 "sched: voluntary preemption"
      (this needs more work to decide when a higher sched-policy task
       should preempt a lower sched-policy task)
     patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"

 - Patches 47-50: contain RCU related changes. RCU now works in both
   PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
   (Until now PREEMPTION=y => PREEMPT_RCU)

 - Patches 51-56,86: contain cond_resched() related cleanups.
     patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
     interface. Pitchforks?

 - Patches 57-86: remove cond_resched() from the tree.


Also at: github.com/terminus/linux preemption-rfc


Please review.

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/


Ankur Arora (86):
  Revert "riscv: support PREEMPT_DYNAMIC with static keys"
  Revert "sched/core: Make sched_dynamic_mutex static"
  Revert "ftrace: Use preemption model accessors for trace header
    printout"
  Revert "preempt/dynamic: Introduce preemption model accessors"
  Revert "kcsan: Use preemption model accessors"
  Revert "entry: Fix compile error in
    dynamic_irqentry_exit_cond_resched()"
  Revert "livepatch,sched: Add livepatch task switching to
    cond_resched()"
  Revert "arm64: Support PREEMPT_DYNAMIC"
  Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
  Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
    GENERIC_ENTRY"
  Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
  Revert "sched/preempt: Refactor sched_dynamic_update()"
  Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
  Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
  Revert "preempt: Restore preemption model selection configs"
  Revert "sched: Provide Kconfig support for default dynamic preempt
    mode"
  sched/preempt: remove PREEMPT_DYNAMIC from the build version
  Revert "preempt/dynamic: Fix typo in macro conditional statement"
  Revert "sched,preempt: Move preempt_dynamic to debug.c"
  Revert "static_call: Relax static_call_update() function argument
    type"
  Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
  Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
  Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
  Revert "sched: Harden PREEMPT_DYNAMIC"
  Revert "sched: Add /debug/sched_preempt"
  Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
    option"
  Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
    call"
  Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
    calls"
  Revert "preempt/dynamic: Provide cond_resched() and might_resched()
    static calls"
  Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
  x86/thread_info: add TIF_NEED_RESCHED_LAZY
  entry: handle TIF_NEED_RESCHED_LAZY
  entry/kvm: handle TIF_NEED_RESCHED_LAZY
  thread_info: accessors for TIF_NEED_RESCHED*
  thread_info: change to tif_need_resched(resched_t)
  entry: irqentry_exit only preempts TIF_NEED_RESCHED
  sched: make test_*_tsk_thread_flag() return bool
  sched: *_tsk_need_resched() now takes resched_t
  sched: handle lazy resched in set_nr_*_polling()
  context_tracking: add ct_state_cpu()
  sched: handle resched policy in resched_curr()
  sched: force preemption on tick expiration
  sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
  sched: voluntary preemption
  preempt: ARCH_NO_PREEMPT only preempts lazily
  tracing: handle lazy resched
  rcu: select PREEMPT_RCU if PREEMPT
  rcu: handle quiescent states for PREEMPT_RCU=n
  osnoise: handle quiescent states directly
  rcu: TASKS_RCU does not need to depend on PREEMPTION
  preempt: disallow !PREEMPT_COUNT or !PREEMPTION
  sched: remove CONFIG_PREEMPTION from *_needbreak()
  sched: fixup __cond_resched_*()
  sched: add cond_resched_stall()
  xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
  xarray: use cond_resched_xas*()
  coccinelle: script to remove cond_resched()
  treewide: x86: remove cond_resched()
  treewide: rcu: remove cond_resched()
  treewide: torture: remove cond_resched()
  treewide: bpf: remove cond_resched()
  treewide: trace: remove cond_resched()
  treewide: futex: remove cond_resched()
  treewide: printk: remove cond_resched()
  treewide: task_work: remove cond_resched()
  treewide: kernel: remove cond_resched()
  treewide: kernel: remove cond_reshed()
  treewide: mm: remove cond_resched()
  treewide: io_uring: remove cond_resched()
  treewide: ipc: remove cond_resched()
  treewide: lib: remove cond_resched()
  treewide: crypto: remove cond_resched()
  treewide: security: remove cond_resched()
  treewide: fs: remove cond_resched()
  treewide: virt: remove cond_resched()
  treewide: block: remove cond_resched()
  treewide: netfilter: remove cond_resched()
  treewide: net: remove cond_resched()
  treewide: net: remove cond_resched()
  treewide: sound: remove cond_resched()
  treewide: md: remove cond_resched()
  treewide: mtd: remove cond_resched()
  treewide: drm: remove cond_resched()
  treewide: net: remove cond_resched()
  treewide: drivers: remove cond_resched()
  sched: remove cond_resched()

 .../admin-guide/kernel-parameters.txt         |   7 -
 arch/Kconfig                                  |  42 +-
 arch/arm64/Kconfig                            |   1 -
 arch/arm64/include/asm/preempt.h              |  19 +-
 arch/arm64/kernel/entry-common.c              |  10 +-
 arch/riscv/Kconfig                            |   1 -
 arch/s390/include/asm/preempt.h               |   4 +-
 arch/x86/Kconfig                              |   1 -
 arch/x86/include/asm/preempt.h                |  50 +-
 arch/x86/include/asm/thread_info.h            |   6 +-
 arch/x86/kernel/alternative.c                 |  10 -
 arch/x86/kernel/cpu/sgx/encl.c                |  14 +-
 arch/x86/kernel/cpu/sgx/ioctl.c               |   3 -
 arch/x86/kernel/cpu/sgx/main.c                |   5 -
 arch/x86/kernel/cpu/sgx/virt.c                |   4 -
 arch/x86/kvm/lapic.c                          |   6 +-
 arch/x86/kvm/mmu/mmu.c                        |   2 +-
 arch/x86/kvm/svm/sev.c                        |   5 +-
 arch/x86/net/bpf_jit_comp.c                   |   1 -
 arch/x86/net/bpf_jit_comp32.c                 |   1 -
 arch/x86/xen/mmu_pv.c                         |   1 -
 block/blk-cgroup.c                            |   2 -
 block/blk-lib.c                               |  11 -
 block/blk-mq.c                                |   3 -
 block/blk-zoned.c                             |   6 -
 crypto/internal.h                             |   2 +-
 crypto/tcrypt.c                               |   5 -
 crypto/testmgr.c                              |  10 -
 drivers/accel/ivpu/ivpu_drv.c                 |   2 -
 drivers/accel/ivpu/ivpu_gem.c                 |   1 -
 drivers/accel/ivpu/ivpu_pm.c                  |   8 +-
 drivers/accel/qaic/qaic_data.c                |   2 -
 drivers/acpi/processor_idle.c                 |   2 +-
 drivers/auxdisplay/charlcd.c                  |  11 -
 drivers/base/power/domain.c                   |   1 -
 drivers/block/aoe/aoecmd.c                    |   3 +-
 drivers/block/brd.c                           |   1 -
 drivers/block/drbd/drbd_bitmap.c              |   4 -
 drivers/block/drbd/drbd_debugfs.c             |   1 -
 drivers/block/loop.c                          |   3 -
 drivers/block/xen-blkback/blkback.c           |   3 -
 drivers/block/zram/zram_drv.c                 |   2 -
 drivers/bluetooth/virtio_bt.c                 |   1 -
 drivers/char/hw_random/arm_smccc_trng.c       |   1 -
 drivers/char/lp.c                             |   2 -
 drivers/char/mem.c                            |   4 -
 drivers/char/mwave/3780i.c                    |   4 +-
 drivers/char/ppdev.c                          |   4 -
 drivers/char/random.c                         |   2 -
 drivers/char/virtio_console.c                 |   1 -
 drivers/crypto/virtio/virtio_crypto_core.c    |   1 -
 drivers/cxl/pci.c                             |   1 -
 drivers/dma-buf/selftest.c                    |   1 -
 drivers/dma-buf/st-dma-fence-chain.c          |   1 -
 drivers/fsi/fsi-sbefifo.c                     |  14 +-
 drivers/gpu/drm/bridge/samsung-dsim.c         |   2 +-
 drivers/gpu/drm/drm_buddy.c                   |   1 -
 drivers/gpu/drm/drm_gem.c                     |   1 -
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |   1 -
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c     |   2 -
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |   6 -
 .../drm/i915/gem/selftests/i915_gem_mman.c    |   5 -
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c   |   2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c            |   2 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c       |   4 -
 drivers/gpu/drm/i915/gt/selftest_execlists.c  |   4 -
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   2 -
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |   2 -
 drivers/gpu/drm/i915/gt/selftest_migrate.c    |   2 -
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |   4 -
 drivers/gpu/drm/i915/i915_active.c            |   2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c         |   2 -
 drivers/gpu/drm/i915/i915_gpu_error.c         |  18 +-
 drivers/gpu/drm/i915/intel_uncore.c           |   1 -
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   2 -
 drivers/gpu/drm/i915/selftests/i915_request.c |   2 -
 .../gpu/drm/i915/selftests/i915_selftest.c    |   3 -
 drivers/gpu/drm/i915/selftests/i915_vma.c     |   9 -
 .../gpu/drm/i915/selftests/igt_flush_test.c   |   2 -
 .../drm/i915/selftests/intel_memory_region.c  |   4 -
 drivers/gpu/drm/tests/drm_buddy_test.c        |   5 -
 drivers/gpu/drm/tests/drm_mm_test.c           |  29 -
 drivers/i2c/busses/i2c-bcm-iproc.c            |   9 +-
 drivers/i2c/busses/i2c-highlander.c           |   9 +-
 drivers/i2c/busses/i2c-ibm_iic.c              |  11 +-
 drivers/i2c/busses/i2c-mpc.c                  |   2 +-
 drivers/i2c/busses/i2c-mxs.c                  |   9 +-
 drivers/i2c/busses/scx200_acb.c               |   9 +-
 drivers/infiniband/core/umem.c                |   1 -
 drivers/infiniband/hw/hfi1/driver.c           |   1 -
 drivers/infiniband/hw/hfi1/firmware.c         |   2 +-
 drivers/infiniband/hw/hfi1/init.c             |   1 -
 drivers/infiniband/hw/hfi1/ruc.c              |   1 -
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c    |   5 +-
 drivers/infiniband/hw/qib/qib_init.c          |   1 -
 drivers/infiniband/sw/rxe/rxe_qp.c            |   3 +-
 drivers/infiniband/sw/rxe/rxe_task.c          |   4 +-
 drivers/input/evdev.c                         |   1 -
 drivers/input/keyboard/clps711x-keypad.c      |   2 +-
 drivers/input/misc/uinput.c                   |   1 -
 drivers/input/mousedev.c                      |   1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   2 -
 drivers/md/bcache/btree.c                     |   5 -
 drivers/md/bcache/journal.c                   |   2 -
 drivers/md/bcache/sysfs.c                     |   1 -
 drivers/md/bcache/writeback.c                 |   2 -
 drivers/md/dm-bufio.c                         |  14 -
 drivers/md/dm-cache-target.c                  |   4 -
 drivers/md/dm-crypt.c                         |   3 -
 drivers/md/dm-integrity.c                     |   3 -
 drivers/md/dm-kcopyd.c                        |   2 -
 drivers/md/dm-snap.c                          |   1 -
 drivers/md/dm-stats.c                         |   8 -
 drivers/md/dm-thin.c                          |   2 -
 drivers/md/dm-writecache.c                    |  11 -
 drivers/md/dm.c                               |   4 -
 drivers/md/md.c                               |   1 -
 drivers/md/raid1.c                            |   2 -
 drivers/md/raid10.c                           |   3 -
 drivers/md/raid5.c                            |   2 -
 drivers/media/i2c/vpx3220.c                   |   3 -
 drivers/media/pci/cobalt/cobalt-i2c.c         |   4 +-
 drivers/misc/bcm-vk/bcm_vk_dev.c              |   3 +-
 drivers/misc/bcm-vk/bcm_vk_msg.c              |   3 +-
 drivers/misc/genwqe/card_base.c               |   3 +-
 drivers/misc/genwqe/card_ddcb.c               |   6 -
 drivers/misc/genwqe/card_dev.c                |   2 -
 drivers/misc/vmw_balloon.c                    |   4 -
 drivers/mmc/host/mmc_spi.c                    |   3 -
 drivers/mtd/chips/cfi_cmdset_0001.c           |   6 -
 drivers/mtd/chips/cfi_cmdset_0002.c           |   1 -
 drivers/mtd/chips/cfi_util.c                  |   2 +-
 drivers/mtd/devices/spear_smi.c               |   2 +-
 drivers/mtd/devices/sst25l.c                  |   3 +-
 drivers/mtd/devices/st_spi_fsm.c              |   4 -
 drivers/mtd/inftlcore.c                       |   5 -
 drivers/mtd/lpddr/lpddr_cmds.c                |   6 +-
 drivers/mtd/mtd_blkdevs.c                     |   1 -
 drivers/mtd/nand/onenand/onenand_base.c       |  18 +-
 drivers/mtd/nand/onenand/onenand_samsung.c    |   8 +-
 drivers/mtd/nand/raw/diskonchip.c             |   4 +-
 drivers/mtd/nand/raw/fsmc_nand.c              |   3 +-
 drivers/mtd/nand/raw/hisi504_nand.c           |   2 +-
 drivers/mtd/nand/raw/nand_base.c              |   3 +-
 drivers/mtd/nand/raw/nand_legacy.c            |  17 +-
 drivers/mtd/spi-nor/core.c                    |   8 +-
 drivers/mtd/tests/mtd_test.c                  |   2 -
 drivers/mtd/tests/mtd_test.h                  |   2 +-
 drivers/mtd/tests/pagetest.c                  |   1 -
 drivers/mtd/tests/readtest.c                  |   2 -
 drivers/mtd/tests/torturetest.c               |   1 -
 drivers/mtd/ubi/attach.c                      |  10 -
 drivers/mtd/ubi/build.c                       |   2 -
 drivers/mtd/ubi/cdev.c                        |   4 -
 drivers/mtd/ubi/eba.c                         |   8 -
 drivers/mtd/ubi/misc.c                        |   2 -
 drivers/mtd/ubi/vtbl.c                        |   6 -
 drivers/mtd/ubi/wl.c                          |  13 -
 drivers/net/dummy.c                           |   1 -
 drivers/net/ethernet/broadcom/tg3.c           |   2 +-
 drivers/net/ethernet/intel/e1000/e1000_hw.c   |   3 -
 drivers/net/ethernet/mediatek/mtk_eth_soc.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx4/catas.c    |   2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c      |  13 +-
 .../ethernet/mellanox/mlx4/resource_tracker.c |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   3 +-
 drivers/net/ethernet/mellanox/mlxsw/i2c.c     |   5 -
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 -
 drivers/net/ethernet/pasemi/pasemi_mac.c      |   3 -
 .../ethernet/qlogic/netxen/netxen_nic_init.c  |   2 -
 .../ethernet/qlogic/qlcnic/qlcnic_83xx_init.c |   1 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_init.c  |   1 -
 .../ethernet/qlogic/qlcnic/qlcnic_minidump.c  |   2 -
 drivers/net/ethernet/sfc/falcon/falcon.c      |   6 -
 drivers/net/ifb.c                             |   1 -
 drivers/net/ipvlan/ipvlan_core.c              |   1 -
 drivers/net/macvlan.c                         |   2 -
 drivers/net/mhi_net.c                         |   4 +-
 drivers/net/netdevsim/fib.c                   |   1 -
 drivers/net/virtio_net.c                      |   2 -
 drivers/net/wireguard/ratelimiter.c           |   2 -
 drivers/net/wireguard/receive.c               |   3 -
 drivers/net/wireguard/send.c                  |   4 -
 drivers/net/wireless/broadcom/b43/lo.c        |   6 +-
 drivers/net/wireless/broadcom/b43/pio.c       |   1 -
 drivers/net/wireless/broadcom/b43legacy/phy.c |   5 -
 .../broadcom/brcm80211/brcmfmac/cfg80211.c    |   1 -
 drivers/net/wireless/cisco/airo.c             |   2 -
 .../net/wireless/intel/iwlwifi/pcie/trans.c   |   2 -
 drivers/net/wireless/marvell/mwl8k.c          |   2 -
 drivers/net/wireless/mediatek/mt76/util.c     |   1 -
 drivers/net/wwan/mhi_wwan_mbim.c              |   2 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c    |   3 -
 drivers/net/xen-netback/netback.c             |   1 -
 drivers/net/xen-netback/rx.c                  |   2 -
 drivers/nvdimm/btt.c                          |   2 -
 drivers/nvme/target/zns.c                     |   2 -
 drivers/parport/parport_ip32.c                |   1 -
 drivers/parport/parport_pc.c                  |   4 -
 drivers/pci/pci-sysfs.c                       |   1 -
 drivers/pci/proc.c                            |   1 -
 .../intel/speed_select_if/isst_if_mbox_pci.c  |   4 +-
 drivers/s390/cio/css.c                        |   8 -
 drivers/scsi/NCR5380.c                        |   2 -
 drivers/scsi/megaraid.c                       |   1 -
 drivers/scsi/qedi/qedi_main.c                 |   1 -
 drivers/scsi/qla2xxx/qla_nx.c                 |   2 -
 drivers/scsi/qla2xxx/qla_sup.c                |   5 -
 drivers/scsi/qla4xxx/ql4_nx.c                 |   1 -
 drivers/scsi/xen-scsifront.c                  |   2 +-
 drivers/spi/spi-lantiq-ssc.c                  |   3 +-
 drivers/spi/spi-meson-spifc.c                 |   2 +-
 drivers/spi/spi.c                             |   2 +-
 drivers/staging/rtl8723bs/core/rtw_mlme_ext.c |   2 +-
 drivers/staging/rtl8723bs/core/rtw_pwrctrl.c  |   2 -
 drivers/tee/optee/ffa_abi.c                   |   1 -
 drivers/tee/optee/smc_abi.c                   |   1 -
 drivers/tty/hvc/hvc_console.c                 |   6 +-
 drivers/tty/tty_buffer.c                      |   3 -
 drivers/tty/tty_io.c                          |   1 -
 drivers/usb/gadget/udc/max3420_udc.c          |   1 -
 drivers/usb/host/max3421-hcd.c                |   2 +-
 drivers/usb/host/xen-hcd.c                    |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c           |   2 -
 drivers/vfio/vfio_iommu_type1.c               |   7 -
 drivers/vhost/vhost.c                         |   1 -
 drivers/video/console/vgacon.c                |   4 -
 drivers/virtio/virtio_mem.c                   |   8 -
 drivers/xen/balloon.c                         |   2 -
 drivers/xen/gntdev.c                          |   2 -
 drivers/xen/xen-scsiback.c                    |   9 +-
 fs/afs/write.c                                |   2 -
 fs/btrfs/backref.c                            |   6 -
 fs/btrfs/block-group.c                        |   3 -
 fs/btrfs/ctree.c                              |   1 -
 fs/btrfs/defrag.c                             |   1 -
 fs/btrfs/disk-io.c                            |   3 -
 fs/btrfs/extent-io-tree.c                     |   5 -
 fs/btrfs/extent-tree.c                        |   8 -
 fs/btrfs/extent_io.c                          |   9 -
 fs/btrfs/file-item.c                          |   1 -
 fs/btrfs/file.c                               |   4 -
 fs/btrfs/free-space-cache.c                   |   4 -
 fs/btrfs/inode.c                              |   9 -
 fs/btrfs/ordered-data.c                       |   2 -
 fs/btrfs/qgroup.c                             |   1 -
 fs/btrfs/reflink.c                            |   2 -
 fs/btrfs/relocation.c                         |   9 -
 fs/btrfs/scrub.c                              |   3 -
 fs/btrfs/send.c                               |   1 -
 fs/btrfs/space-info.c                         |   1 -
 fs/btrfs/tests/extent-io-tests.c              |   1 -
 fs/btrfs/transaction.c                        |   3 -
 fs/btrfs/tree-log.c                           |  12 -
 fs/btrfs/uuid-tree.c                          |   1 -
 fs/btrfs/volumes.c                            |   2 -
 fs/buffer.c                                   |   1 -
 fs/cachefiles/cache.c                         |   4 +-
 fs/cachefiles/namei.c                         |   1 -
 fs/cachefiles/volume.c                        |   1 -
 fs/ceph/addr.c                                |   1 -
 fs/dax.c                                      |  16 +-
 fs/dcache.c                                   |   2 -
 fs/dlm/ast.c                                  |   1 -
 fs/dlm/dir.c                                  |   2 -
 fs/dlm/lock.c                                 |   3 -
 fs/dlm/lowcomms.c                             |   3 -
 fs/dlm/recover.c                              |   1 -
 fs/drop_caches.c                              |   1 -
 fs/erofs/utils.c                              |   1 -
 fs/erofs/zdata.c                              |   8 +-
 fs/eventpoll.c                                |   3 -
 fs/exec.c                                     |   4 -
 fs/ext4/block_validity.c                      |   2 -
 fs/ext4/dir.c                                 |   1 -
 fs/ext4/extents.c                             |   1 -
 fs/ext4/ialloc.c                              |   1 -
 fs/ext4/inode.c                               |   1 -
 fs/ext4/mballoc.c                             |  12 +-
 fs/ext4/namei.c                               |   3 -
 fs/ext4/orphan.c                              |   1 -
 fs/ext4/super.c                               |   2 -
 fs/f2fs/checkpoint.c                          |  16 +-
 fs/f2fs/compress.c                            |   1 -
 fs/f2fs/data.c                                |   3 -
 fs/f2fs/dir.c                                 |   1 -
 fs/f2fs/extent_cache.c                        |   1 -
 fs/f2fs/f2fs.h                                |   6 +-
 fs/f2fs/file.c                                |   3 -
 fs/f2fs/node.c                                |   4 -
 fs/f2fs/super.c                               |   1 -
 fs/fat/fatent.c                               |   2 -
 fs/file.c                                     |   7 +-
 fs/fs-writeback.c                             |   3 -
 fs/gfs2/aops.c                                |   1 -
 fs/gfs2/bmap.c                                |   2 -
 fs/gfs2/glock.c                               |   2 +-
 fs/gfs2/log.c                                 |   1 -
 fs/gfs2/ops_fstype.c                          |   1 -
 fs/hpfs/buffer.c                              |   8 -
 fs/hugetlbfs/inode.c                          |   3 -
 fs/inode.c                                    |   3 -
 fs/iomap/buffered-io.c                        |   7 +-
 fs/jbd2/checkpoint.c                          |   2 -
 fs/jbd2/commit.c                              |   3 -
 fs/jbd2/recovery.c                            |   2 -
 fs/jffs2/build.c                              |   6 +-
 fs/jffs2/erase.c                              |   3 -
 fs/jffs2/gc.c                                 |   2 -
 fs/jffs2/nodelist.c                           |   1 -
 fs/jffs2/nodemgmt.c                           |  11 +-
 fs/jffs2/readinode.c                          |   2 -
 fs/jffs2/scan.c                               |   4 -
 fs/jffs2/summary.c                            |   2 -
 fs/jfs/jfs_txnmgr.c                           |  14 +-
 fs/libfs.c                                    |   5 +-
 fs/mbcache.c                                  |   1 -
 fs/namei.c                                    |   1 -
 fs/netfs/io.c                                 |   1 -
 fs/nfs/delegation.c                           |   3 -
 fs/nfs/pnfs.c                                 |   2 -
 fs/nfs/write.c                                |   4 -
 fs/nilfs2/btree.c                             |   1 -
 fs/nilfs2/inode.c                             |   1 -
 fs/nilfs2/page.c                              |   4 -
 fs/nilfs2/segment.c                           |   4 -
 fs/notify/fanotify/fanotify_user.c            |   1 -
 fs/notify/fsnotify.c                          |   1 -
 fs/ntfs/attrib.c                              |   3 -
 fs/ntfs/file.c                                |   2 -
 fs/ntfs3/file.c                               |   9 -
 fs/ntfs3/frecord.c                            |   2 -
 fs/ocfs2/alloc.c                              |   4 +-
 fs/ocfs2/cluster/tcp.c                        |   8 +-
 fs/ocfs2/dlm/dlmthread.c                      |   7 +-
 fs/ocfs2/file.c                               |  10 +-
 fs/proc/base.c                                |   1 -
 fs/proc/fd.c                                  |   1 -
 fs/proc/kcore.c                               |   1 -
 fs/proc/page.c                                |   6 -
 fs/proc/task_mmu.c                            |   7 -
 fs/quota/dquot.c                              |   1 -
 fs/reiserfs/journal.c                         |   2 -
 fs/select.c                                   |   1 -
 fs/smb/client/file.c                          |   2 -
 fs/splice.c                                   |   1 -
 fs/ubifs/budget.c                             |   1 -
 fs/ubifs/commit.c                             |   1 -
 fs/ubifs/debug.c                              |   5 -
 fs/ubifs/dir.c                                |   1 -
 fs/ubifs/gc.c                                 |   5 -
 fs/ubifs/io.c                                 |   2 -
 fs/ubifs/lprops.c                             |   2 -
 fs/ubifs/lpt_commit.c                         |   3 -
 fs/ubifs/orphan.c                             |   1 -
 fs/ubifs/recovery.c                           |   4 -
 fs/ubifs/replay.c                             |   7 -
 fs/ubifs/scan.c                               |   2 -
 fs/ubifs/shrinker.c                           |   1 -
 fs/ubifs/super.c                              |   2 -
 fs/ubifs/tnc_commit.c                         |   2 -
 fs/ubifs/tnc_misc.c                           |   1 -
 fs/userfaultfd.c                              |   9 -
 fs/verity/enable.c                            |   1 -
 fs/verity/read_metadata.c                     |   1 -
 fs/xfs/scrub/common.h                         |   7 -
 fs/xfs/scrub/xfarray.c                        |   7 -
 fs/xfs/xfs_aops.c                             |   1 -
 fs/xfs/xfs_icache.c                           |   2 -
 fs/xfs/xfs_iwalk.c                            |   1 -
 include/asm-generic/preempt.h                 |  18 +-
 include/linux/console.h                       |   2 +-
 include/linux/context_tracking_state.h        |  21 +
 include/linux/entry-common.h                  |  19 +-
 include/linux/entry-kvm.h                     |   2 +-
 include/linux/kernel.h                        |  32 +-
 include/linux/livepatch.h                     |   1 -
 include/linux/livepatch_sched.h               |  29 -
 include/linux/preempt.h                       |  44 +-
 include/linux/rcupdate.h                      |  10 +-
 include/linux/rcutree.h                       |   2 +-
 include/linux/sched.h                         | 153 ++----
 include/linux/sched/cond_resched.h            |   1 -
 include/linux/sched/idle.h                    |   8 +-
 include/linux/thread_info.h                   |  29 +-
 include/linux/trace_events.h                  |   6 +-
 include/linux/vermagic.h                      |   2 +-
 include/linux/xarray.h                        |  14 +
 init/Makefile                                 |   3 +-
 io_uring/io-wq.c                              |   4 +-
 io_uring/io_uring.c                           |  21 +-
 io_uring/kbuf.c                               |   2 -
 io_uring/sqpoll.c                             |   6 +-
 io_uring/tctx.c                               |   4 +-
 ipc/msgutil.c                                 |   3 -
 ipc/sem.c                                     |   2 -
 kernel/Kconfig.preempt                        |  70 +--
 kernel/auditsc.c                              |   2 -
 kernel/bpf/Kconfig                            |   2 +-
 kernel/bpf/arraymap.c                         |   3 -
 kernel/bpf/bpf_iter.c                         |   7 +-
 kernel/bpf/btf.c                              |   9 -
 kernel/bpf/cpumap.c                           |   2 -
 kernel/bpf/hashtab.c                          |   7 -
 kernel/bpf/syscall.c                          |   3 -
 kernel/bpf/verifier.c                         |   5 -
 kernel/cgroup/rstat.c                         |   3 +-
 kernel/dma/debug.c                            |   2 -
 kernel/entry/common.c                         |  32 +-
 kernel/entry/kvm.c                            |   4 +-
 kernel/events/core.c                          |   2 +-
 kernel/futex/core.c                           |   6 +-
 kernel/futex/pi.c                             |   6 +-
 kernel/futex/requeue.c                        |   1 -
 kernel/futex/waitwake.c                       |   2 +-
 kernel/gcov/base.c                            |   1 -
 kernel/hung_task.c                            |   6 +-
 kernel/kallsyms.c                             |   4 +-
 kernel/kcsan/kcsan_test.c                     |   5 +-
 kernel/kexec_core.c                           |   6 -
 kernel/kthread.c                              |   1 -
 kernel/livepatch/core.c                       |   1 -
 kernel/livepatch/transition.c                 | 107 +---
 kernel/locking/test-ww_mutex.c                |   4 +-
 kernel/module/main.c                          |   1 -
 kernel/printk/printk.c                        |  65 +--
 kernel/ptrace.c                               |   2 -
 kernel/rcu/Kconfig                            |   4 +-
 kernel/rcu/rcuscale.c                         |   2 -
 kernel/rcu/rcutorture.c                       |   8 +-
 kernel/rcu/tasks.h                            |   5 +-
 kernel/rcu/tree.c                             |   4 +-
 kernel/rcu/tree_exp.h                         |   4 +-
 kernel/rcu/tree_plugin.h                      |   7 +-
 kernel/rcu/tree_stall.h                       |   2 +-
 kernel/scftorture.c                           |   1 -
 kernel/sched/core.c                           | 497 +++++-------------
 kernel/sched/core_sched.c                     |   2 +-
 kernel/sched/deadline.c                       |  26 +-
 kernel/sched/debug.c                          |  67 +--
 kernel/sched/fair.c                           |  54 +-
 kernel/sched/features.h                       |  18 +
 kernel/sched/idle.c                           |   6 +-
 kernel/sched/rt.c                             |  35 +-
 kernel/sched/sched.h                          |   9 +-
 kernel/softirq.c                              |   1 -
 kernel/stop_machine.c                         |   2 +-
 kernel/task_work.c                            |   1 -
 kernel/torture.c                              |   1 -
 kernel/trace/Kconfig                          |   4 +-
 kernel/trace/ftrace.c                         |   4 -
 kernel/trace/ring_buffer.c                    |   4 -
 kernel/trace/ring_buffer_benchmark.c          |  13 -
 kernel/trace/trace.c                          |  29 +-
 kernel/trace/trace_events.c                   |   1 -
 kernel/trace/trace_osnoise.c                  |  37 +-
 kernel/trace/trace_output.c                   |  16 +-
 kernel/trace/trace_selftest.c                 |   9 -
 kernel/workqueue.c                            |  10 -
 lib/crc32test.c                               |   2 -
 lib/crypto/mpi/mpi-pow.c                      |   1 -
 lib/memcpy_kunit.c                            |   5 -
 lib/random32.c                                |   1 -
 lib/rhashtable.c                              |   2 -
 lib/test_bpf.c                                |   3 -
 lib/test_lockup.c                             |   2 +-
 lib/test_maple_tree.c                         |   8 -
 lib/test_rhashtable.c                         |  10 -
 mm/backing-dev.c                              |   8 +-
 mm/compaction.c                               |  23 +-
 mm/damon/paddr.c                              |   1 -
 mm/dmapool_test.c                             |   2 -
 mm/filemap.c                                  |  11 +-
 mm/gup.c                                      |   1 -
 mm/huge_memory.c                              |   3 -
 mm/hugetlb.c                                  |  12 -
 mm/hugetlb_cgroup.c                           |   1 -
 mm/kasan/quarantine.c                         |   6 +-
 mm/kfence/kfence_test.c                       |  22 +-
 mm/khugepaged.c                               |  10 +-
 mm/kmemleak.c                                 |   8 -
 mm/ksm.c                                      |  21 +-
 mm/madvise.c                                  |   3 -
 mm/memcontrol.c                               |   4 -
 mm/memfd.c                                    |  10 +-
 mm/memory-failure.c                           |   1 -
 mm/memory.c                                   |  12 +-
 mm/memory_hotplug.c                           |   6 -
 mm/mempolicy.c                                |   1 -
 mm/migrate.c                                  |   6 -
 mm/mincore.c                                  |   1 -
 mm/mlock.c                                    |   2 -
 mm/mm_init.c                                  |  13 +-
 mm/mmap.c                                     |   1 -
 mm/mmu_gather.c                               |   2 -
 mm/mprotect.c                                 |   1 -
 mm/mremap.c                                   |   1 -
 mm/nommu.c                                    |   1 -
 mm/page-writeback.c                           |   6 +-
 mm/page_alloc.c                               |  13 +-
 mm/page_counter.c                             |   1 -
 mm/page_ext.c                                 |   1 -
 mm/page_idle.c                                |   2 -
 mm/page_io.c                                  |   2 -
 mm/page_owner.c                               |   1 -
 mm/percpu.c                                   |   5 -
 mm/rmap.c                                     |   2 -
 mm/shmem.c                                    |  19 +-
 mm/shuffle.c                                  |   6 +-
 mm/slab.c                                     |   3 -
 mm/swap_cgroup.c                              |   4 -
 mm/swapfile.c                                 |  14 -
 mm/truncate.c                                 |   4 -
 mm/userfaultfd.c                              |   3 -
 mm/util.c                                     |   1 -
 mm/vmalloc.c                                  |   5 -
 mm/vmscan.c                                   |  29 +-
 mm/vmstat.c                                   |   4 -
 mm/workingset.c                               |   1 -
 mm/z3fold.c                                   |  15 +-
 mm/zsmalloc.c                                 |   1 -
 mm/zswap.c                                    |   1 -
 net/batman-adv/tp_meter.c                     |   2 -
 net/bpf/test_run.c                            |   1 -
 net/bridge/br_netlink.c                       |   1 -
 net/core/dev.c                                |   4 -
 net/core/neighbour.c                          |   1 -
 net/core/net_namespace.c                      |   1 -
 net/core/netclassid_cgroup.c                  |   1 -
 net/core/rtnetlink.c                          |   1 -
 net/core/sock.c                               |   2 -
 net/ipv4/inet_connection_sock.c               |   3 -
 net/ipv4/inet_diag.c                          |   1 -
 net/ipv4/inet_hashtables.c                    |   1 -
 net/ipv4/inet_timewait_sock.c                 |   1 -
 net/ipv4/inetpeer.c                           |   1 -
 net/ipv4/netfilter/arp_tables.c               |   2 -
 net/ipv4/netfilter/ip_tables.c                |   3 -
 net/ipv4/nexthop.c                            |   1 -
 net/ipv4/tcp_ipv4.c                           |   2 -
 net/ipv4/udp.c                                |   2 -
 net/ipv6/fib6_rules.c                         |   1 -
 net/ipv6/netfilter/ip6_tables.c               |   2 -
 net/ipv6/udp.c                                |   2 -
 net/mptcp/mptcp_diag.c                        |   2 -
 net/mptcp/pm_netlink.c                        |   5 -
 net/mptcp/protocol.c                          |   1 -
 net/netfilter/ipset/ip_set_core.c             |   1 -
 net/netfilter/ipvs/ip_vs_est.c                |   3 -
 net/netfilter/nf_conncount.c                  |   2 -
 net/netfilter/nf_conntrack_core.c             |   3 -
 net/netfilter/nf_conntrack_ecache.c           |   3 -
 net/netfilter/nf_tables_api.c                 |   2 -
 net/netfilter/nft_set_rbtree.c                |   2 -
 net/netfilter/x_tables.c                      |   3 +-
 net/netfilter/xt_hashlimit.c                  |   1 -
 net/netlink/af_netlink.c                      |   1 -
 net/rds/ib_recv.c                             |   2 -
 net/rds/tcp.c                                 |   2 +-
 net/rds/threads.c                             |   1 -
 net/rxrpc/call_object.c                       |   2 +-
 net/sched/sch_api.c                           |   3 -
 net/sctp/socket.c                             |   1 -
 net/socket.c                                  |   2 -
 net/sunrpc/cache.c                            |  11 +-
 net/sunrpc/sched.c                            |   2 +-
 net/sunrpc/svc_xprt.c                         |   1 -
 net/sunrpc/xprtsock.c                         |   2 -
 net/tipc/core.c                               |   2 +-
 net/tipc/topsrv.c                             |   3 -
 net/unix/af_unix.c                            |   5 +-
 net/x25/af_x25.c                              |   1 -
 scripts/coccinelle/api/cond_resched.cocci     |  53 ++
 security/keys/gc.c                            |   1 -
 security/landlock/fs.c                        |   1 -
 security/selinux/ss/hashtab.h                 |   2 -
 security/selinux/ss/policydb.c                |   6 -
 security/selinux/ss/services.c                |   1 -
 security/selinux/ss/sidtab.c                  |   1 -
 sound/arm/aaci.c                              |   2 +-
 sound/core/seq/seq_virmidi.c                  |   2 -
 sound/hda/hdac_controller.c                   |   1 -
 sound/isa/sb/emu8000_patch.c                  |   5 -
 sound/isa/sb/emu8000_pcm.c                    |   2 +-
 sound/isa/wss/wss_lib.c                       |   1 -
 sound/pci/echoaudio/echoaudio_dsp.c           |   2 -
 sound/pci/ens1370.c                           |   1 -
 sound/pci/es1968.c                            |   2 +-
 sound/pci/lola/lola.c                         |   1 -
 sound/pci/mixart/mixart_hwdep.c               |   2 +-
 sound/pci/pcxhr/pcxhr_core.c                  |   5 -
 sound/pci/vx222/vx222_ops.c                   |   2 -
 sound/x86/intel_hdmi_audio.c                  |   1 -
 virt/kvm/pfncache.c                           |   2 -
 596 files changed, 881 insertions(+), 2813 deletions(-)
 delete mode 100644 include/linux/livepatch_sched.h
 delete mode 100644 include/linux/sched/cond_resched.h
 create mode 100644 scripts/coccinelle/api/cond_resched.cocci

Comments

Steven Rostedt Nov. 7, 2023, 11:01 p.m. UTC | #1
On Tue,  7 Nov 2023 13:56:46 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Hi,

Hi Ankur,

Thanks for doing this!

> 
> We have two models of preemption: voluntary and full (and RT which is
> a fuller form of full preemption.) In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.

I would say there's "NONE" which is really just a "voluntary" but with
fewer preemption points ;-) But still should be mentioned, otherwise people
may get confused.

> 
> (Note that this is about preemption when executing in the kernel.
> Userspace is always preemptible.)
> 


> Design
> ==
> 
> As Thomas outlines in [1], to unify the preemption models we
> want to: always have the preempt_count enabled and allow the scheduler
> to drive preemption policy based on the model in effect.
> 
> Policies:
> 
> - preemption=none: run to completion
> - preemption=voluntary: run to completion, unless a task of higher
>   sched-class awaits
> - preemption=full: optimized for low-latency. Preempt whenever a higher
>   priority task awaits.
> 
> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.
> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>          ret-to-user    ret-to-kernel    preempt_count()
> none           Y              N                N
> voluntary      Y              Y                Y
> full           Y              Y                Y

Wait. The above is for when RESCHED_LAZY is to preempt, right?

Then, shouldn't voluntary be:

 voluntary      Y              N                N

For LAZY, but 

 voluntary      Y              Y                Y

For NEED_RESCHED (without lazy)

That is, the only difference between voluntary and none (as you describe
above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
but on none, it still sets NEED_RESCHED_LAZY?

> 
> 
> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.

I wonder if this needs to be solved by with a user space knob, to trigger
the time that "NEED_RESCHED" will force a schedule?

> 
> For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.
> 
> For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
> if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
> (This would cause preemption either at ret-to-kernel, or if the task is in
> a non-preemptible section, when it exits that section.)
> 
> Arguably this provides for much more consistent maximum latency (~2 tick
> lengths + length of non-preemptible section) as compared to the old model
> where the maximum latency depended on the dynamic distribution of
> cond_resched() points.

Again, why I think we probably want to set a knob  for users to adjust
this. Default, it will be set to "tick" but if not, then we need to add
another timer to trigger before then. And this would only be available with
HRTIMERS of course ;-)

> 
> (As a bonus it handles code that is preemptible but cannot call
> cond_resched() completely trivially: ex. long running Xen hypercalls, or
> this series which started this discussion:
>  https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)
> 
> 
> Status
> ==
> 
> What works:
>  - The system seems to keep ticking over with the normal scheduling
> policies (SCHED_OTHER). The support for the realtime policies is somewhat
> more half baked.)
>  - The basic performance numbers seem pretty close to 6.6-rc7 baseline
> 
> What's broken:
>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>    lazily")
>  - Non-x86 architectures. It's trivial to support other archs (only need
>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>    comments on the series.
>    (From some testing on arm64, didn't find any surprises.)


>  - livepatch: livepatch depends on using _cond_resched() to provide
>    low-latency patching. That is obviously difficult with cond_resched()
>    gone. We could get a similar effect by using a static_key in
>    preempt_enable() but at least with inline locks, that might be end
>    up bloating the kernel quite a bit.

Maybe if we have that timer, livepatch could set it to be temporarily
shorter?

>  - Documentation/ and comments mention cond_resched()

>  - ftrace support for need-resched-lazy is incomplete

Shouldn't be a problem.

> 
> What needs more discussion:
>  - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
>    only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
>    change to tif_need_resched(resched_t)")

I would say NEED_RESCHED only, then it would match the description of the
different models above.

>  - Tracking whether a task in userspace or in the kernel (See patch-40
>    "context_tracking: add ct_state_cpu()")
>  - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
>    preemption")
> 
> 
> Performance
> ==
> 


>   * optimal-load (-j 1024)
> 
>            6.6-rc7                                    +series
>         
> 
>   wall        139.2 +-       0.3             wall       138.8  +-
> 0.2 utime     11161.0 +-    3360.4             utime    11061.2  +-
> 3244.9 stime      1357.6 +-     199.3             stime     1366.6  +-
>  216.3 %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-
> 2351.1 csw     2078599   +- 2013320.0             csw    1970610    +-
> 1969030.0
> 
> 
>   For both of these the wallclock, utime, stime etc are pretty much
>   identical. The one interesting difference is that the number of
>   context switches are fewer. This intuitively makes sense given that
>   we reschedule threads lazily rather than rescheduling if we encounter
>   a cond_resched() and there's a thread wanting to be scheduled.
> 
>   The max-load numbers (not posted here) also behave similarly.

It would be interesting to run any "latency sensitive" benchmarks.

I wounder how cyclictest would work under each model with and without this
patch?

-- Steve
Ankur Arora Nov. 7, 2023, 11:43 p.m. UTC | #2
Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue,  7 Nov 2023 13:56:46 -0800
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Hi,
>
> Hi Ankur,
>
> Thanks for doing this!
>
>>
>> We have two models of preemption: voluntary and full (and RT which is
>> a fuller form of full preemption.) In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> I would say there's "NONE" which is really just a "voluntary" but with
> fewer preemption points ;-) But still should be mentioned, otherwise people
> may get confused.
>
>>
>> (Note that this is about preemption when executing in the kernel.
>> Userspace is always preemptible.)
>>
>
>
>> Design
>> ==
>>
>> As Thomas outlines in [1], to unify the preemption models we
>> want to: always have the preempt_count enabled and allow the scheduler
>> to drive preemption policy based on the model in effect.
>>
>> Policies:
>>
>> - preemption=none: run to completion
>> - preemption=voluntary: run to completion, unless a task of higher
>>   sched-class awaits
>> - preemption=full: optimized for low-latency. Preempt whenever a higher
>>   priority task awaits.
>>
>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>>
>> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
>> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
>>
>>          ret-to-user    ret-to-kernel    preempt_count()
>> none           Y              N                N
>> voluntary      Y              Y                Y
>> full           Y              Y                Y
>
> Wait. The above is for when RESCHED_LAZY is to preempt, right?
>
> Then, shouldn't voluntary be:
>
>  voluntary      Y              N                N
>
> For LAZY, but
>
>  voluntary      Y              Y                Y
>
> For NEED_RESCHED (without lazy)

Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags
and in the middle switched to talking about how the voluntary model will
get to what it wants.

> That is, the only difference between voluntary and none (as you describe
> above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
> but on none, it still sets NEED_RESCHED_LAZY?

Yeah exactly. Just to restate without mucking it up:

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

                  ret-to-user    ret-to-kernel    preempt_count()
NEED_RESCHED_LAZY    Y              N                N
NEED_RESCHED         Y              Y                Y

Based on how various preemption models set the flag they would cause
preemption at:

                  ret-to-user    ret-to-kernel    preempt_count()
none                 Y              N                N
voluntary            Y              Y                Y
full                 Y              Y                Y

>>   The max-load numbers (not posted here) also behave similarly.
>
> It would be interesting to run any "latency sensitive" benchmarks.
>
> I wounder how cyclictest would work under each model with and without this
> patch?

Didn't post these numbers because I suspect that code isn't quite right,
but voluntary preemption for instance does what it promises:

# echo NO_FORCE_PREEMPT  > sched/features
# echo NO_PREEMPT_PRIORITY > sched/features    # preempt=none
# stress-ng --cyclic 1  --timeout 10
stress-ng: info:  [1214172] setting to a 10 second run per stressor
stress-ng: info:  [1214172] dispatching hogs: 1 cyclic
stress-ng: info:  [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info:  [1214174] cyclic:   mean: 9834.56 ns, mode: 3495 ns
stress-ng: info:  [1214174] cyclic:   min: 2413 ns, max: 3145065 ns, std.dev. 77096.98
stress-ng: info:  [1214174] cyclic: latency percentiles:
stress-ng: info:  [1214174] cyclic:   25.00%:       3366 ns
stress-ng: info:  [1214174] cyclic:   50.00%:       3505 ns
stress-ng: info:  [1214174] cyclic:   75.00%:       3776 ns
stress-ng: info:  [1214174] cyclic:   90.00%:       4316 ns
stress-ng: info:  [1214174] cyclic:   95.40%:      10989 ns
stress-ng: info:  [1214174] cyclic:   99.00%:      91181 ns
stress-ng: info:  [1214174] cyclic:   99.50%:     290477 ns
stress-ng: info:  [1214174] cyclic:   99.90%:    1360837 ns
stress-ng: info:  [1214174] cyclic:   99.99%:    3145065 ns
stress-ng: info:  [1214172] successful run completed in 10.00s

# echo PREEMPT_PRIORITY > features    # preempt=voluntary
# stress-ng --cyclic 1  --timeout 10
stress-ng: info:  [916483] setting to a 10 second run per stressor
stress-ng: info:  [916483] dispatching hogs: 1 cyclic
stress-ng: info:  [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info:  [916484] cyclic:   mean: 3682.77 ns, mode: 3185 ns
stress-ng: info:  [916484] cyclic:   min: 2523 ns, max: 150082 ns, std.dev. 2198.07
stress-ng: info:  [916484] cyclic: latency percentiles:
stress-ng: info:  [916484] cyclic:   25.00%:       3185 ns
stress-ng: info:  [916484] cyclic:   50.00%:       3306 ns
stress-ng: info:  [916484] cyclic:   75.00%:       3666 ns
stress-ng: info:  [916484] cyclic:   90.00%:       4778 ns
stress-ng: info:  [916484] cyclic:   95.40%:       5359 ns
stress-ng: info:  [916484] cyclic:   99.00%:       6141 ns
stress-ng: info:  [916484] cyclic:   99.50%:       7824 ns
stress-ng: info:  [916484] cyclic:   99.90%:      29825 ns
stress-ng: info:  [916484] cyclic:   99.99%:     150082 ns
stress-ng: info:  [916483] successful run completed in 10.01s

This is with a background kernbench half-load.

Let me see if I can dig out the numbers without this series.

--
ankur
Steven Rostedt Nov. 8, 2023, midnight UTC | #3
On Tue, 07 Nov 2023 15:43:40 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>                   ret-to-user    ret-to-kernel    preempt_count()
> NEED_RESCHED_LAZY    Y              N                N
> NEED_RESCHED         Y              Y                Y
> 
> Based on how various preemption models set the flag they would cause
> preemption at:

I would change the above to say "set the NEED_SCHED flag", as "set the
flag" is still ambiguous. Or am I still misunderstanding the below table?

> 
>                   ret-to-user    ret-to-kernel    preempt_count()
> none                 Y              N                N
> voluntary            Y              Y                Y
> full                 Y              Y                Y
> 

-- Steve
Christoph Lameter (Ampere) Nov. 8, 2023, 4:08 a.m. UTC | #4
The kernel is not preemptible???? What are you smoking?

On Tue, 7 Nov 2023, Ankur Arora wrote:

> In voluntary models, the scheduler's job is to match the demand
> side of preemption points (a task that needs to be scheduled) with
> the supply side (a task which calls cond_resched().)

Voluntary preemption models are important for code optimization because 
the code can rely on the scheduler not changing the cpu we are running on. 
This allows removing code for preempt_enable/disable to be removed from 
the code and allows better code generation. The best performing code is 
generated with defined preemption points when we have a guarantee that the 
code is not being rescheduled on a different processor. This is f.e. 
important for consistent access to PER CPU areas.

> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.

That is different from the current no preemption model? Seems to be 
the same.

> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.

These are needed to avoid adding preempt_enable/disable to a lot of 
primitives that are used for synchronization. You cannot remove those 
without changing a lot of synchronization primitives to always have to 
consider being preempted while operating.
Ankur Arora Nov. 8, 2023, 4:33 a.m. UTC | #5
Christoph Lameter <cl@linux.com> writes:

> The kernel is not preemptible???? What are you smoking?

The title admittedly is a little tongue in check but the point was that
a kernel under a voluntary preemption model isn't preemptible. That's
what this series attempts to do. Essentially enable PREEMPT_COUNT and
PREEMPTION for all preemption models.

PREEMPT_COUNT is always enabled with PREEMPT_DYNAMIC as well. There the
approach is to toggle which preemption points are used dynamically.
Here the idea is to not have statically placed preemption points and let
the scheduler decide when preemption is warranted.
And the only way to safely do that is by having PREEMPT_COUNT=y.

>> In voluntary models, the scheduler's job is to match the demand
>> side of preemption points (a task that needs to be scheduled) with
>> the supply side (a task which calls cond_resched().)
>
> Voluntary preemption models are important for code optimization because the code
> can rely on the scheduler not changing the cpu we are running on. This allows
> removing code for preempt_enable/disable to be removed from the code and allows
> better code generation. The best performing code is generated with defined
> preemption points when we have a guarantee that the code is not being
> rescheduled on a different processor. This is f.e. important for consistent
> access to PER CPU areas.

Right. This necessitates preempt_enable/preempt_disable() so you get
consistent access to the CPU.

This came up in an earlier discussion (See
https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
that preempt_enable/_disable() overhead was relatively minimal.

Is your point that always-on preempt_count is far too expensive?

>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>
> That is different from the current no preemption model? Seems to be the same.
>> There's just one remaining issue: now that explicit preemption points are
>> gone, processes that spread a long time in the kernel have no way to give
>> up the CPU.
>
> These are needed to avoid adding preempt_enable/disable to a lot of primitives
> that are used for synchronization. You cannot remove those without changing a
> lot of synchronization primitives to always have to consider being preempted
> while operating.

I'm afraid I don't understand why you would need to change any
synchronization primitives. The code that does preempt_enable/_disable()
is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
CONFIG_PREEMPT_COUNT.

The intent here is to always have CONFIG_PREEMPT_COUNT=y.

--
ankur
Christoph Lameter (Ampere) Nov. 8, 2023, 4:52 a.m. UTC | #6
On Tue, 7 Nov 2023, Ankur Arora wrote:

> This came up in an earlier discussion (See
> https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
> that preempt_enable/_disable() overhead was relatively minimal.
>
> Is your point that always-on preempt_count is far too expensive?

Yes over the years distros have traditionally delivered their kernels by 
default without preemption because of these issues. If the overhead has 
been minimized then that may have changed. Even if so there is still a lot 
of code being generated that has questionable benefit and just 
bloats the kernel.

>> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> that are used for synchronization. You cannot remove those without changing a
>> lot of synchronization primitives to always have to consider being preempted
>> while operating.
>
> I'm afraid I don't understand why you would need to change any
> synchronization primitives. The code that does preempt_enable/_disable()
> is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
> CONFIG_PREEMPT_COUNT.

In the trivial cases it is simple like that. But look f.e.
in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a 
overhead added to be able to allow the cpu to change under us. There are 
likely other examples in the source.

And the whole business of local data 
access via per cpu areas suffers if we cannot rely on two accesses in a 
section being able to see consistent values.

> The intent here is to always have CONFIG_PREEMPT_COUNT=y.

Just for fun? Code is most efficient if it does not have to consider too 
many side conditions like suddenly running on a different processor. This 
introduces needless complexity into the code. It would be better to remove 
PREEMPT_COUNT for good to just rely on voluntary preemption. We could 
probably reduce the complexity of the kernel source significantly.

I have never noticed a need to preemption at every instruction in the 
kernel (if that would be possible at all... Locks etc prevent that ideal 
scenario frequently). Preemption like that is more like a pipe dream.

High performance kernel solution usually disable 
overhead like that.
Steven Rostedt Nov. 8, 2023, 5:12 a.m. UTC | #7
On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
Christoph Lameter <cl@linux.com> wrote:

> On Tue, 7 Nov 2023, Ankur Arora wrote:
> 
> > This came up in an earlier discussion (See
> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
> > that preempt_enable/_disable() overhead was relatively minimal.
> >
> > Is your point that always-on preempt_count is far too expensive?  
> 
> Yes over the years distros have traditionally delivered their kernels by 
> default without preemption because of these issues. If the overhead has 
> been minimized then that may have changed. Even if so there is still a lot 
> of code being generated that has questionable benefit and just 
> bloats the kernel.
> 
> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
> >> that are used for synchronization. You cannot remove those without changing a
> >> lot of synchronization primitives to always have to consider being preempted
> >> while operating.  
> >
> > I'm afraid I don't understand why you would need to change any
> > synchronization primitives. The code that does preempt_enable/_disable()
> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
> > CONFIG_PREEMPT_COUNT.  
> 
> In the trivial cases it is simple like that. But look f.e.
> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a 
> overhead added to be able to allow the cpu to change under us. There are 
> likely other examples in the source.
> 

preempt_disable() and preempt_enable() are much lower overhead today than
it use to be.

If you are worried about changing CPUs, there's also migrate_disable() too.

> And the whole business of local data 
> access via per cpu areas suffers if we cannot rely on two accesses in a 
> section being able to see consistent values.
> 
> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.  
> 
> Just for fun? Code is most efficient if it does not have to consider too 
> many side conditions like suddenly running on a different processor. This 
> introduces needless complexity into the code. It would be better to remove 
> PREEMPT_COUNT for good to just rely on voluntary preemption. We could 
> probably reduce the complexity of the kernel source significantly.

That is what caused this thread in the first place. Randomly scattered
"preemption points" does not scale!

And I'm sorry, we have latency sensitive use cases that require full
preemption.

> 
> I have never noticed a need to preemption at every instruction in the 
> kernel (if that would be possible at all... Locks etc prevent that ideal 
> scenario frequently). Preemption like that is more like a pipe dream.
> 
> High performance kernel solution usually disable 
> overhead like that.
> 

Please read the email from Thomas:

   https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/

This is not technically getting rid of PREEMPT_NONE. It is adding a new
NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
entering or in user space. It will behave the same as PREEMPT_NONE, but
without the need for all the cond_resched() scattered randomly throughout
the kernel.

If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
preempt at the next available location (preempt_count == 0).

But yes, all locations that do not explicitly disable preemption, will now
possibly preempt (due to long running kernel threads).

-- Steve
Ankur Arora Nov. 8, 2023, 6:49 a.m. UTC | #8
Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
> Christoph Lameter <cl@linux.com> wrote:
>
>> On Tue, 7 Nov 2023, Ankur Arora wrote:
>>
>> > This came up in an earlier discussion (See
>> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
>> > that preempt_enable/_disable() overhead was relatively minimal.
>> >
>> > Is your point that always-on preempt_count is far too expensive?
>>
>> Yes over the years distros have traditionally delivered their kernels by
>> default without preemption because of these issues. If the overhead has
>> been minimized then that may have changed. Even if so there is still a lot
>> of code being generated that has questionable benefit and just
>> bloats the kernel.
>>
>> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> >> that are used for synchronization. You cannot remove those without changing a
>> >> lot of synchronization primitives to always have to consider being preempted
>> >> while operating.
>> >
>> > I'm afraid I don't understand why you would need to change any
>> > synchronization primitives. The code that does preempt_enable/_disable()
>> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
>> > CONFIG_PREEMPT_COUNT.
>>
>> In the trivial cases it is simple like that. But look f.e.
>> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a
>> overhead added to be able to allow the cpu to change under us. There are
>> likely other examples in the source.
>>
>
> preempt_disable() and preempt_enable() are much lower overhead today than
> it use to be.
>
> If you are worried about changing CPUs, there's also migrate_disable() too.
>
>> And the whole business of local data
>> access via per cpu areas suffers if we cannot rely on two accesses in a
>> section being able to see consistent values.
>>
>> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.
>>
>> Just for fun? Code is most efficient if it does not have to consider too
>> many side conditions like suddenly running on a different processor. This
>> introduces needless complexity into the code. It would be better to remove
>> PREEMPT_COUNT for good to just rely on voluntary preemption. We could
>> probably reduce the complexity of the kernel source significantly.
>
> That is what caused this thread in the first place. Randomly scattered
> "preemption points" does not scale!
>
> And I'm sorry, we have latency sensitive use cases that require full
> preemption.
>
>>
>> I have never noticed a need to preemption at every instruction in the
>> kernel (if that would be possible at all... Locks etc prevent that ideal
>> scenario frequently). Preemption like that is more like a pipe dream.

The intent isn't to preempt at every other instruction in the kernel.

As Thomas describes, the idea is that for voluntary preemption kernels
resched happens at cond_resched() points which have been distributed
heuristically. As a consequence you might get both too little preemption
and too much preemption.

The intent is to bring preemption in control of the scheduler which
can do a better job than randomly placed cond_resched() points.

>> High performance kernel solution usually disable
>> overhead like that.

You are also missing all the ways in which voluntary preemption
points are responsible for poor performance. For instance, if you look
atq clear_huge_page() it does page by page copy with a cond_resched()
call after clearing each page.

But if you can expose the full extent to the CPU, it can optimize
differently (for the 1GB page it can now elide cacheline allocation):

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

(See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

> Please read the email from Thomas:
>
>    https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
>
> This is not technically getting rid of PREEMPT_NONE. It is adding a new
> NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
> entering or in user space. It will behave the same as PREEMPT_NONE, but
> without the need for all the cond_resched() scattered randomly throughout
> the kernel.

And a corollary of that is that with a scheduler controlled PREEMPT_NONE
a task might end up running to completion where earlier it could have been
preempted early because it crossed a cond_resched().

> If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
> 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
> preempt at the next available location (preempt_count == 0).
>
> But yes, all locations that do not explicitly disable preemption, will now
> possibly preempt (due to long running kernel threads).

--
ankur
Jürgen Groß Nov. 8, 2023, 7:31 a.m. UTC | #9
On 07.11.23 22:56, Ankur Arora wrote:
> Hi,
> 
> We have two models of preemption: voluntary and full (and RT which is
> a fuller form of full preemption.) In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.
> 
> (Note that this is about preemption when executing in the kernel.
> Userspace is always preemptible.)
> 
> Background
> ==
> 
> Why?: both of these preemption mechanisms are almost entirely disjoint.
> There are four main sets of preemption points in the kernel:
> 
>   1. return to user
>   2. explicit preemption points (cond_resched() and its ilk)
>   3. return to kernel (tick/IPI/irq at irqexit)
>   4. end of non-preemptible sections at (preempt_count() == preempt_offset)
> 
> Voluntary preemption uses mechanisms 1 and 2. Full preemption
> uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
> which can be all things to all people because they internally
> contain 2, and 4.
> 
> Now since there's no ideal placement of explicit preemption points,
> they tend to be randomly spread over code and accumulate over time,
> as they are are added when latency problems are seen. Plus fear of
> regressions makes them difficult to remove.
> (Presumably, asymptotically they would spead out evenly across the
> instruction stream!)
> 
> In voluntary models, the scheduler's job is to match the demand
> side of preemption points (a task that needs to be scheduled) with
> the supply side (a task which calls cond_resched().)
> 
> Full preemption models track preemption count so the scheduler can
> always knows if it is safe to preempt and can drive preemption
> itself (ex. via dynamic preemption points in 3.)
> 
> Design
> ==
> 
> As Thomas outlines in [1], to unify the preemption models we
> want to: always have the preempt_count enabled and allow the scheduler
> to drive preemption policy based on the model in effect.
> 
> Policies:
> 
> - preemption=none: run to completion
> - preemption=voluntary: run to completion, unless a task of higher
>    sched-class awaits
> - preemption=full: optimized for low-latency. Preempt whenever a higher
>    priority task awaits.
> 
> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.
> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>           ret-to-user    ret-to-kernel    preempt_count()
> none           Y              N                N
> voluntary      Y              Y                Y
> full           Y              Y                Y
> 
> 
> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.
> 
> For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.
> 
> For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
> if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
> (This would cause preemption either at ret-to-kernel, or if the task is in
> a non-preemptible section, when it exits that section.)
> 
> Arguably this provides for much more consistent maximum latency (~2 tick
> lengths + length of non-preemptible section) as compared to the old model
> where the maximum latency depended on the dynamic distribution of
> cond_resched() points.
> 
> (As a bonus it handles code that is preemptible but cannot call cond_resched()
>   completely trivially: ex. long running Xen hypercalls, or this series
>   which started this discussion:
>   https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)
> 
> 
> Status
> ==
> 
> What works:
>   - The system seems to keep ticking over with the normal scheduling policies
>     (SCHED_OTHER). The support for the realtime policies is somewhat more
>     half baked.)
>   - The basic performance numbers seem pretty close to 6.6-rc7 baseline
> 
> What's broken:
>   - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>     lazily")
>   - Non-x86 architectures. It's trivial to support other archs (only need
>     to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>     comments on the series.
>     (From some testing on arm64, didn't find any surprises.)
>   - livepatch: livepatch depends on using _cond_resched() to provide
>     low-latency patching. That is obviously difficult with cond_resched()
>     gone. We could get a similar effect by using a static_key in
>     preempt_enable() but at least with inline locks, that might be end
>     up bloating the kernel quite a bit.
>   - Documentation/ and comments mention cond_resched()
>   - ftrace support for need-resched-lazy is incomplete
> 
> What needs more discussion:
>   - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
>     only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
>     change to tif_need_resched(resched_t)")
>   - Tracking whether a task in userspace or in the kernel (See patch-40
>     "context_tracking: add ct_state_cpu()")
>   - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
>     preemption")
> 
> 
> Performance
> ==
> 
> Expectation:
> 
> * perf sched bench pipe
> 
> preemption               full           none
> 
> 6.6-rc7              6.68 +- 0.10   6.69 +- 0.07
> +series              6.69 +- 0.12   6.67 +- 0.10
> 
> This is rescheduling out of idle which should and does perform identically.
> 
> * schbench, preempt=none
> 
>    * 1 group, 16 threads each
> 
>                   6.6-rc7      +series
>                   (usecs)      (usecs)
>       50.0th:         6            6
>       90.0th:         8            7
>       99.0th:        11           11
>       99.9th:        15           14
>    
>    * 8 groups, 16 threads each
> 
>                  6.6-rc7       +series
>                   (usecs)      (usecs)
>       50.0th:         6            6
>       90.0th:         8            8
>       99.0th:        12           11
>       99.9th:        20           21
> 
> 
> * schbench, preempt=full
> 
>    * 1 group, 16 threads each
> 
>                  6.6-rc7       +series
>                  (usecs)       (usecs)
>       50.0th:         6            6
>       90.0th:         8            7
>       99.0th:        11           11
>       99.9th:        14           14
> 
> 
>    * 8 groups, 16 threads each
> 
>                  6.6-rc7       +series
>                  (usecs)       (usecs)
>       50.0th:         7            7
>       90.0th:         9            9
>       99.0th:        12           12
>       99.9th:        21           22
> 
> 
>    Not much in it either way.
> 
> * kernbench, preempt=full
> 
>    * half-load (-j 128)
> 
>             6.6-rc7                                    +series
> 
>    wall        149.2  +-     27.2             wall        132.8  +-     0.4
>    utime      8097.1  +-     57.4             utime      8088.5  +-    14.1
>    stime      1165.5  +-      9.4             stime      1159.2  +-     1.9
>    %cpu       6337.6  +-   1072.8             %cpu       6959.6  +-    22.8
>    csw      237618    +-   2190.6             %csw     240343    +-  1386.8
> 
> 
>    * optimal-load (-j 1024)
> 
>             6.6-rc7                                    +series
> 
>    wall        137.8 +-       0.0             wall       137.7  +-       0.8
>    utime     11115.0 +-    3306.1             utime    11041.7  +-    3235.0
>    stime      1340.0 +-     191.3             stime     1323.1  +-     179.5
>    %cpu       8846.3 +-    2830.6             %cpu      9101.3  +-    2346.7
>    csw     2099910   +- 2040080.0             csw    2068210    +- 2002450.0
> 
> 
>    The preempt=full path should effectively not see any change in
>    behaviour. The optimal-loads are pretty much identical.
>    For the half-load, however, the +series version does much better but that
>    seems to be because of much higher run to run variability in the 6.6-rc7 load.
> 
> * kernbench, preempt=none
> 
>    * half-load (-j 128)
> 
>             6.6-rc7                                    +series
> 
>    wall        134.5  +-      4.2             wall        133.6  +-     2.7
>    utime      8093.3  +-     39.3             utime      8099.0  +-    38.9
>    stime      1175.7  +-     10.6             stime      1169.1  +-     8.4
>    %cpu       6893.3  +-    233.2             %cpu       6936.3  +-   142.8
>    csw      240723    +-    423.0             %csw     173152    +-  1126.8
>                                               
> 
>    * optimal-load (-j 1024)
> 
>             6.6-rc7                                    +series
> 
>    wall        139.2 +-       0.3             wall       138.8  +-       0.2
>    utime     11161.0 +-    3360.4             utime    11061.2  +-    3244.9
>    stime      1357.6 +-     199.3             stime     1366.6  +-     216.3
>    %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-    2351.1
>    csw     2078599   +- 2013320.0             csw    1970610    +- 1969030.0
> 
> 
>    For both of these the wallclock, utime, stime etc are pretty much
>    identical. The one interesting difference is that the number of
>    context switches are fewer. This intuitively makes sense given that
>    we reschedule threads lazily rather than rescheduling if we encounter
>    a cond_resched() and there's a thread wanting to be scheduled.
> 
>    The max-load numbers (not posted here) also behave similarly.
> 
> 
> Series
> ==
> 
> With that, this is how he series is laid out:
> 
>   - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
>     used by that is via static_calls() and this is a simpler approach which
>     doesn't need any of that (and does away with cond_resched().)
> 
>     Some of the commits will be resurrected.
>         089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
>         cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
>         5693fa74f98a ("kcsan: Use preemption model accessors")
> 
>   - Patches 31-45: contain the scheduler changes to do this. Of these
>     the critical ones are:
>       patch 35 "thread_info: change to tif_need_resched(resched_t)"
>       patch 41 "sched: handle resched policy in resched_curr()"
>       patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
>       patch 44 "sched: voluntary preemption"
>        (this needs more work to decide when a higher sched-policy task
>         should preempt a lower sched-policy task)
>       patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"
> 
>   - Patches 47-50: contain RCU related changes. RCU now works in both
>     PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
>     (Until now PREEMPTION=y => PREEMPT_RCU)
> 
>   - Patches 51-56,86: contain cond_resched() related cleanups.
>       patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
>       interface. Pitchforks?
> 
>   - Patches 57-86: remove cond_resched() from the tree.
> 
> 
> Also at: github.com/terminus/linux preemption-rfc
> 
> 
> Please review.
> 
> Thanks
> Ankur
> 
> [1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> 
> 
> Ankur Arora (86):
>    Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>    Revert "sched/core: Make sched_dynamic_mutex static"
>    Revert "ftrace: Use preemption model accessors for trace header
>      printout"
>    Revert "preempt/dynamic: Introduce preemption model accessors"
>    Revert "kcsan: Use preemption model accessors"
>    Revert "entry: Fix compile error in
>      dynamic_irqentry_exit_cond_resched()"
>    Revert "livepatch,sched: Add livepatch task switching to
>      cond_resched()"
>    Revert "arm64: Support PREEMPT_DYNAMIC"
>    Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>    Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>      GENERIC_ENTRY"
>    Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>    Revert "sched/preempt: Refactor sched_dynamic_update()"
>    Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>    Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>    Revert "preempt: Restore preemption model selection configs"
>    Revert "sched: Provide Kconfig support for default dynamic preempt
>      mode"
>    sched/preempt: remove PREEMPT_DYNAMIC from the build version
>    Revert "preempt/dynamic: Fix typo in macro conditional statement"
>    Revert "sched,preempt: Move preempt_dynamic to debug.c"
>    Revert "static_call: Relax static_call_update() function argument
>      type"
>    Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>    Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>    Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>    Revert "sched: Harden PREEMPT_DYNAMIC"
>    Revert "sched: Add /debug/sched_preempt"
>    Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>      option"
>    Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>      call"
>    Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>      calls"
>    Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>      static calls"
>    Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
>    x86/thread_info: add TIF_NEED_RESCHED_LAZY
>    entry: handle TIF_NEED_RESCHED_LAZY
>    entry/kvm: handle TIF_NEED_RESCHED_LAZY
>    thread_info: accessors for TIF_NEED_RESCHED*
>    thread_info: change to tif_need_resched(resched_t)
>    entry: irqentry_exit only preempts TIF_NEED_RESCHED
>    sched: make test_*_tsk_thread_flag() return bool
>    sched: *_tsk_need_resched() now takes resched_t
>    sched: handle lazy resched in set_nr_*_polling()
>    context_tracking: add ct_state_cpu()
>    sched: handle resched policy in resched_curr()
>    sched: force preemption on tick expiration
>    sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
>    sched: voluntary preemption
>    preempt: ARCH_NO_PREEMPT only preempts lazily
>    tracing: handle lazy resched
>    rcu: select PREEMPT_RCU if PREEMPT
>    rcu: handle quiescent states for PREEMPT_RCU=n
>    osnoise: handle quiescent states directly
>    rcu: TASKS_RCU does not need to depend on PREEMPTION
>    preempt: disallow !PREEMPT_COUNT or !PREEMPTION
>    sched: remove CONFIG_PREEMPTION from *_needbreak()
>    sched: fixup __cond_resched_*()
>    sched: add cond_resched_stall()
>    xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
>    xarray: use cond_resched_xas*()
>    coccinelle: script to remove cond_resched()
>    treewide: x86: remove cond_resched()
>    treewide: rcu: remove cond_resched()
>    treewide: torture: remove cond_resched()
>    treewide: bpf: remove cond_resched()
>    treewide: trace: remove cond_resched()
>    treewide: futex: remove cond_resched()
>    treewide: printk: remove cond_resched()
>    treewide: task_work: remove cond_resched()
>    treewide: kernel: remove cond_resched()
>    treewide: kernel: remove cond_reshed()
>    treewide: mm: remove cond_resched()
>    treewide: io_uring: remove cond_resched()
>    treewide: ipc: remove cond_resched()
>    treewide: lib: remove cond_resched()
>    treewide: crypto: remove cond_resched()
>    treewide: security: remove cond_resched()
>    treewide: fs: remove cond_resched()
>    treewide: virt: remove cond_resched()
>    treewide: block: remove cond_resched()
>    treewide: netfilter: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: sound: remove cond_resched()
>    treewide: md: remove cond_resched()
>    treewide: mtd: remove cond_resched()
>    treewide: drm: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: drivers: remove cond_resched()
>    sched: remove cond_resched()

I'm missing the removal of the Xen parts, which were one of the reasons to start
this whole work (xen_in_preemptible_hcall etc.).


Juergen
Vlastimil Babka Nov. 8, 2023, 7:54 a.m. UTC | #10
On 11/8/23 06:12, Steven Rostedt wrote:
> On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
> Christoph Lameter <cl@linux.com> wrote:
> 
>> On Tue, 7 Nov 2023, Ankur Arora wrote:
>> 
>> > This came up in an earlier discussion (See
>> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
>> > that preempt_enable/_disable() overhead was relatively minimal.
>> >
>> > Is your point that always-on preempt_count is far too expensive?  
>> 
>> Yes over the years distros have traditionally delivered their kernels by 
>> default without preemption because of these issues. If the overhead has 
>> been minimized then that may have changed. Even if so there is still a lot 
>> of code being generated that has questionable benefit and just 
>> bloats the kernel.
>> 
>> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> >> that are used for synchronization. You cannot remove those without changing a
>> >> lot of synchronization primitives to always have to consider being preempted
>> >> while operating.  
>> >
>> > I'm afraid I don't understand why you would need to change any
>> > synchronization primitives. The code that does preempt_enable/_disable()
>> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
>> > CONFIG_PREEMPT_COUNT.  
>> 
>> In the trivial cases it is simple like that. But look f.e.
>> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a 
>> overhead added to be able to allow the cpu to change under us. There are 
>> likely other examples in the source.
>> 
> 
> preempt_disable() and preempt_enable() are much lower overhead today than
> it use to be.
> 
> If you are worried about changing CPUs, there's also migrate_disable() too.

Note that while migrate_disable() would be often sufficient, the
implementation of it has actually more overhead (function call, does
preempt_disable()/enable() as part of it) than just preempt_disable(). See
for example the pcpu_task_pin() definition in mm/page_alloc.c
Peter Zijlstra Nov. 8, 2023, 8:51 a.m. UTC | #11
On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
> Hi,
> 
> We have two models of preemption: voluntary and full 

3, also none (RT is not actually a preemption model).

> (and RT which is
> a fuller form of full preemption.)

It is not in fact a different preemption model, it is the same full
preemption, the difference with RT is that it makes a lot more stuff
preemptible, but the fundamental preemption model is the same -- full.

> In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.

Well, you've also taken out preempt_dynamic for some obscure reason :/


> Please review.

> Ankur Arora (86):
>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>   Revert "sched/core: Make sched_dynamic_mutex static"
>   Revert "ftrace: Use preemption model accessors for trace header
>     printout"
>   Revert "preempt/dynamic: Introduce preemption model accessors"
>   Revert "kcsan: Use preemption model accessors"
>   Revert "entry: Fix compile error in
>     dynamic_irqentry_exit_cond_resched()"
>   Revert "livepatch,sched: Add livepatch task switching to
>     cond_resched()"
>   Revert "arm64: Support PREEMPT_DYNAMIC"
>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>   Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>     GENERIC_ENTRY"
>   Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>   Revert "sched/preempt: Refactor sched_dynamic_update()"
>   Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>   Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>   Revert "preempt: Restore preemption model selection configs"
>   Revert "sched: Provide Kconfig support for default dynamic preempt
>     mode"
>   sched/preempt: remove PREEMPT_DYNAMIC from the build version
>   Revert "preempt/dynamic: Fix typo in macro conditional statement"
>   Revert "sched,preempt: Move preempt_dynamic to debug.c"
>   Revert "static_call: Relax static_call_update() function argument
>     type"
>   Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>   Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>   Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>   Revert "sched: Harden PREEMPT_DYNAMIC"
>   Revert "sched: Add /debug/sched_preempt"
>   Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>     option"
>   Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>     call"
>   Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>     calls"
>   Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>     static calls"
>   Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"

NAK

Even if you were to remove PREEMPT_NONE, which should be a separate
series, but that isn't on the table at all afaict, removing
preempt_dynamic doesn't make sense.

You still want the preempt= boot time argument and the
/debug/sched/preempt things to dynamically switch between the models.

Please, focus on the voluntary thing, gut that and then replace it with
the lazy thing, but leave everything else in place.

Re dynamic preempt, gutting the current voluntary preemption model means
getting rid of the cond_resched and might_resched toggles but you'll
gain a switch to kill the lazy stuff.
David Laight Nov. 8, 2023, 9:43 a.m. UTC | #12
From: Ankur Arora
> Sent: 07 November 2023 21:57
...
> There are four main sets of preemption points in the kernel:
> 
>  1. return to user
>  2. explicit preemption points (cond_resched() and its ilk)
>  3. return to kernel (tick/IPI/irq at irqexit)
>  4. end of non-preemptible sections at (preempt_count() == preempt_offset)
> 
...
> Policies:
> 
> A - preemption=none: run to completion
> B - preemption=voluntary: run to completion, unless a task of higher
>     sched-class awaits
> C - preemption=full: optimized for low-latency. Preempt whenever a higher
>     priority task awaits.

If you remove cond_resched() then won't both B and C require an extra IPI.
That is probably OK for RT tasks but could get expensive for
normal tasks that aren't bound to a specific cpu.

I suspect C could also lead to tasks being pre-empted just before
they sleep (eg after waking another task).
There might already be mitigation for that, I'm not sure if
a voluntary sleep can be done in a non-pre-emptible section.

Certainly it should all help the scheduling of RT tasks - which
can currently get delayed by a non-RT task in a slow kernel path.
Although the worst one is the softint code...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Daniel Bristot de Oliveira Nov. 8, 2023, 9:53 a.m. UTC | #13
On 11/8/23 09:51, Peter Zijlstra wrote:
> On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> Hi,
>>
>> We have two models of preemption: voluntary and full 
> 3, also none (RT is not actually a preemption model).
> 
>> (and RT which is
>> a fuller form of full preemption.)
> It is not in fact a different preemption model, it is the same full
> preemption, the difference with RT is that it makes a lot more stuff
> preemptible, but the fundamental preemption model is the same -- full.

+1

-- Daniel
Ankur Arora Nov. 8, 2023, 10:04 a.m. UTC | #14
Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> Hi,
>>
>> We have two models of preemption: voluntary and full
>
> 3, also none (RT is not actually a preemption model).

I think we are just using preemption models differently.

I was trying to distinguish between how preemption happens: thus
voluntary (based on voluntary preemption points), and full
(based on preemption state.)

>> (and RT which is
>> a fuller form of full preemption.)
>
> It is not in fact a different preemption model, it is the same full
> preemption, the difference with RT is that it makes a lot more stuff
> preemptible, but the fundamental preemption model is the same -- full.
>
>> In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> Well, you've also taken out preempt_dynamic for some obscure reason :/

Sorry about that :). I didn't mention it because I was using preemption
model in the sense I describe above. And to my mind preempt_dynamic is
a clever mechanism that switches between other preemption models.

>> Please review.
>
>> Ankur Arora (86):
>>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>>   Revert "sched/core: Make sched_dynamic_mutex static"
>>   Revert "ftrace: Use preemption model accessors for trace header
>>     printout"
>>   Revert "preempt/dynamic: Introduce preemption model accessors"
>>   Revert "kcsan: Use preemption model accessors"
>>   Revert "entry: Fix compile error in
>>     dynamic_irqentry_exit_cond_resched()"
>>   Revert "livepatch,sched: Add livepatch task switching to
>>     cond_resched()"
>>   Revert "arm64: Support PREEMPT_DYNAMIC"
>>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>>   Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>>     GENERIC_ENTRY"
>>   Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>>   Revert "sched/preempt: Refactor sched_dynamic_update()"
>>   Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>>   Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>>   Revert "preempt: Restore preemption model selection configs"
>>   Revert "sched: Provide Kconfig support for default dynamic preempt
>>     mode"
>>   sched/preempt: remove PREEMPT_DYNAMIC from the build version
>>   Revert "preempt/dynamic: Fix typo in macro conditional statement"
>>   Revert "sched,preempt: Move preempt_dynamic to debug.c"
>>   Revert "static_call: Relax static_call_update() function argument
>>     type"
>>   Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>>   Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>>   Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>>   Revert "sched: Harden PREEMPT_DYNAMIC"
>>   Revert "sched: Add /debug/sched_preempt"
>>   Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>>     option"
>>   Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>>     call"
>>   Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>>     calls"
>>   Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>>     static calls"
>>   Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
>
> NAK
>
> Even if you were to remove PREEMPT_NONE, which should be a separate
> series, but that isn't on the table at all afaict, removing
> preempt_dynamic doesn't make sense.

Agreed. I don't intend to remove PREEMPT_NONE. And, obviously you
do want preempt_dynamic like toggling abilities.

> You still want the preempt= boot time argument and the
> /debug/sched/preempt things to dynamically switch between the models.

Also, yes.

> Please, focus on the voluntary thing, gut that and then replace it with
> the lazy thing, but leave everything else in place.
>
> Re dynamic preempt, gutting the current voluntary preemption model means
> getting rid of the cond_resched and might_resched toggles but you'll
> gain a switch to kill the lazy stuff.

Yes. I think I mostly agree with you.

And, I should have thought this whole revert thing through.
Reverting it wasn't really the plan. The plan was to revert these
patches temporarily, put in the changes you see in this series, and
then pull in the relevant bits of the preempt_dynamic.

Only I decided to push that to later. Sigh.

On your NAK, what about these patches for instance:

>>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>>   Revert "livepatch,sched: Add livepatch task switching to
>>     cond_resched()"
>>   Revert "arm64: Support PREEMPT_DYNAMIC"
>>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"

What's the best way to handle these? With the lazy bit, cond_resched()
and might_resched() are gone. So we don't need all of the static
key inftrastructure for toggling etc.

The part of preempt_dynamic that makes sense to me is the one that
switches dynamically between none/voluntary/full. Here it would need
to be wired onto controls of the lazy bit.
(Right now the preemption policy is controlled by sched_feat in
patches 43, and 44 but sched/preempt is a much better interface.)

--
ankur
Peter Zijlstra Nov. 8, 2023, 10:13 a.m. UTC | #15
On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:

> >>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
> >>   Revert "livepatch,sched: Add livepatch task switching to
> >>     cond_resched()"
> >>   Revert "arm64: Support PREEMPT_DYNAMIC"
> >>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
> 
> What's the best way to handle these? With the lazy bit, cond_resched()
> and might_resched() are gone. So we don't need all of the static
> key inftrastructure for toggling etc.
> 
> The part of preempt_dynamic that makes sense to me is the one that
> switches dynamically between none/voluntary/full. Here it would need
> to be wired onto controls of the lazy bit.
> (Right now the preemption policy is controlled by sched_feat in
> patches 43, and 44 but sched/preempt is a much better interface.)

I'm not understanding, those should stay obviously.

The current preempt_dynamic stuff has 5 toggles:

/*
 * SC:cond_resched
 * SC:might_resched
 * SC:preempt_schedule
 * SC:preempt_schedule_notrace
 * SC:irqentry_exit_cond_resched
 *
 *
 * NONE:
 *   cond_resched               <- __cond_resched
 *   might_resched              <- RET0
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP
 *
 * VOLUNTARY:
 *   cond_resched               <- __cond_resched
 *   might_resched              <- __cond_resched
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP
 *
 * FULL:
 *   cond_resched               <- RET0
 *   might_resched              <- RET0
 *   preempt_schedule           <- preempt_schedule
 *   preempt_schedule_notrace   <- preempt_schedule_notrace
 *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
 */

If you kill voluntary as we know it today, you can remove cond_resched
and might_resched, but the remaining 3 are still needed to switch
between NONE and FULL.

Additionally, you'll get one new state to enable/disable the LAZY stuff.
Neither NONE nor FULL want the LAZY thing on.

You'll then end up with something like:

/*
 * SK:preempt_lazy
 * SC:preempt_schedule
 * SC:preempt_schedule_notrace
 * SC:irqentry_exit_cond_resched
 *
 *
 * NONE:
 *   preempt_lazy		<- OFF
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP
 *
 * VOLUNTARY:
 *   preempt_lazy		<- ON
 *   preempt_schedule           <- preempt_schedule
 *   preempt_schedule_notrace   <- preempt_schedule_notrace
 *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
 *
 * FULL:
 *   preempt_lazy		<- OFF
 *   preempt_schedule           <- preempt_schedule
 *   preempt_schedule_notrace   <- preempt_schedule_notrace
 *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
 */

For the architectures that do not have static_call but instead use
static_key for everything, the SC's are obviously static_key based
wrappers around the function calls -- like now.
Ankur Arora Nov. 8, 2023, 11 a.m. UTC | #16
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
>
>> >>   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>> >>   Revert "livepatch,sched: Add livepatch task switching to
>> >>     cond_resched()"
>> >>   Revert "arm64: Support PREEMPT_DYNAMIC"
>> >>   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>>
>> What's the best way to handle these? With the lazy bit, cond_resched()
>> and might_resched() are gone. So we don't need all of the static
>> key inftrastructure for toggling etc.
>>
>> The part of preempt_dynamic that makes sense to me is the one that
>> switches dynamically between none/voluntary/full. Here it would need
>> to be wired onto controls of the lazy bit.
>> (Right now the preemption policy is controlled by sched_feat in
>> patches 43, and 44 but sched/preempt is a much better interface.)
>
> I'm not understanding, those should stay obviously.
>
> The current preempt_dynamic stuff has 5 toggles:
>
> /*
>  * SC:cond_resched
>  * SC:might_resched
>  * SC:preempt_schedule
>  * SC:preempt_schedule_notrace
>  * SC:irqentry_exit_cond_resched
>  *
>  *
>  * NONE:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * VOLUNTARY:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- __cond_resched
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * FULL:
>  *   cond_resched               <- RET0
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  */
>
> If you kill voluntary as we know it today, you can remove cond_resched
> and might_resched, but the remaining 3 are still needed to switch
> between NONE and FULL.

Ah now I see what you are saying.

Quick thought: even if we were running under NONE, eventually you'll
want to forcibly preempt out a CPU hog. So we will need to have
at least this one enabled always:

>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched

These two, it might make sense to toggle them based on model.

>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace

Anyway let me think about this more and respond tomorrow.

For now, time for bed.

Thanks for clarifying btw.

Ankur

> Additionally, you'll get one new state to enable/disable the LAZY stuff.
> Neither NONE nor FULL want the LAZY thing on.
>
> You'll then end up with something like:
>
> /*
>  * SK:preempt_lazy
>  * SC:preempt_schedule
>  * SC:preempt_schedule_notrace
>  * SC:irqentry_exit_cond_resched
>  *
>  *
>  * NONE:
>  *   preempt_lazy		<- OFF
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * VOLUNTARY:
>  *   preempt_lazy		<- ON
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  *
>  * FULL:
>  *   preempt_lazy		<- OFF
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  */
>
> For the architectures that do not have static_call but instead use
> static_key for everything, the SC's are obviously static_key based
> wrappers around the function calls -- like now.
Peter Zijlstra Nov. 8, 2023, 11:14 a.m. UTC | #17
On Wed, Nov 08, 2023 at 03:00:44AM -0800, Ankur Arora wrote:

> Ah now I see what you are saying.
> 
> Quick thought: even if we were running under NONE, eventually you'll

Well, NONE, you get what you pay for etc..

> want to forcibly preempt out a CPU hog. So we will need to have
> at least this one enabled always:
> 
> >  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> 
> These two, it might make sense to toggle them based on model.
> 
> >  *   preempt_schedule           <- preempt_schedule
> >  *   preempt_schedule_notrace   <- preempt_schedule_notrace
> 
> Anyway let me think about this more and respond tomorrow.

That's more or less killing NONE. At that point there's really no point
in having it.

I would really suggest you start with transforming VOLUNTARY into the
LAZY thing, keep it as simple and narrow as possible.

Once you've got that done, then you can try and argue that NONE makes no
sense and try and take it out.

Smaller step etc..
Peter Zijlstra Nov. 8, 2023, 12:16 p.m. UTC | #18
On Wed, Nov 08, 2023 at 12:14:15PM +0100, Peter Zijlstra wrote:

> I would really suggest you start with transforming VOLUNTARY into the
> LAZY thing, keep it as simple and narrow as possible.

Possibly make it worse first, add LAZY as a fourth option, then show it
makes VOLUNTARY redundant, kill that.

> Once you've got that done, then you can try and argue that NONE makes no
> sense and try and take it out.

This also pushes out having to deal with the !PREEMPT archs until the
very last moment.

And once you're here, cond_resched() should be an obvious no-op function
and you can go delete them.

Anyway, as said, smaller steps more better. Nobody likes 86 patches in
their inbox in the morning (or at any time really).
Steven Rostedt Nov. 8, 2023, 3:15 p.m. UTC | #19
On Wed, 8 Nov 2023 09:43:10 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> > Policies:
> > 
> > A - preemption=none: run to completion
> > B - preemption=voluntary: run to completion, unless a task of higher
> >     sched-class awaits
> > C - preemption=full: optimized for low-latency. Preempt whenever a higher
> >     priority task awaits.  
> 
> If you remove cond_resched() then won't both B and C require an extra IPI.
> That is probably OK for RT tasks but could get expensive for
> normal tasks that aren't bound to a specific cpu.

What IPI is extra?

> 
> I suspect C could also lead to tasks being pre-empted just before
> they sleep (eg after waking another task).
> There might already be mitigation for that, I'm not sure if
> a voluntary sleep can be done in a non-pre-emptible section.

No, voluntary sleep can not be done in a preemptible section.

-- Steve
Thomas Gleixner Nov. 8, 2023, 3:38 p.m. UTC | #20
On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
> I'm not understanding, those should stay obviously.
>
> The current preempt_dynamic stuff has 5 toggles:
>
> /*
>  * SC:cond_resched
>  * SC:might_resched
>  * SC:preempt_schedule
>  * SC:preempt_schedule_notrace
>  * SC:irqentry_exit_cond_resched
>  *
>  *
>  * NONE:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * VOLUNTARY:
>  *   cond_resched               <- __cond_resched
>  *   might_resched              <- __cond_resched
>  *   preempt_schedule           <- NOP
>  *   preempt_schedule_notrace   <- NOP
>  *   irqentry_exit_cond_resched <- NOP
>  *
>  * FULL:
>  *   cond_resched               <- RET0
>  *   might_resched              <- RET0
>  *   preempt_schedule           <- preempt_schedule
>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>  */
>
> If you kill voluntary as we know it today, you can remove cond_resched
> and might_resched, but the remaining 3 are still needed to switch
> between NONE and FULL.

No. The whole point of LAZY is to keep preempt_schedule(),
preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

The idea is to always enable preempt count and keep _all_ preemption
points enabled.

For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY
instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED.

Here is where the regular and the lazy flags are evaluated:

                Ret2user        Ret2kernel      PreemptCnt=0  need_resched()

NEED_RESCHED       Y                Y               Y         Y
LAZY_RESCHED       Y                N               N         Y

The trick is that LAZY is not folded into preempt_count so a 1->0
counter transition won't cause preempt_schedule() to be invoked because
the topmost bit (NEED_RESCHED) is set.

The scheduler can still decide to set TIF_NEED_RESCHED which will cause
an immediate preemption at the next preemption point.

This allows to force out a task which loops, e.g. in a massive copy or
clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY
is evaluated after a time which is defined by the scheduler itself.

For my PoC I did:

    1) Set TIF_NEED_RESCHED_LAZY

    2) Set TIF_NEED_RESCHED when the task did not react on
       TIF_NEED_RESCHED_LAZY within a tick

I know that's crude but it just works and obviously requires quite some
refinement.

So the way how you switch between preemption modes is to select when the
scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call
switching at all.

In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it
uses the LAZY bit first, grants some time and then gets out the hammer
and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption
point.

Which means once the whole thing is in place then the whole
PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with
the cond_resched() hackery.

So I think this series is backwards.

It should add the LAZY muck with a Kconfig switch like I did in my PoC
_first_. Once that is working and agreed on, the existing muck can be
removed.

Thanks,

        tglx
Peter Zijlstra Nov. 8, 2023, 4:15 p.m. UTC | #21
On Wed, Nov 08, 2023 at 04:38:11PM +0100, Thomas Gleixner wrote:

> No. The whole point of LAZY is to keep preempt_schedule(),
> preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Yeah, I got that. What wasn't at all clear is that it also wanted to
replace NONE.

0/n didn't even mention none.
Steven Rostedt Nov. 8, 2023, 4:22 p.m. UTC | #22
On Wed, 08 Nov 2023 16:38:11 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
> > On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
> > I'm not understanding, those should stay obviously.
> >
> > The current preempt_dynamic stuff has 5 toggles:
> >
> > /*
> >  * SC:cond_resched
> >  * SC:might_resched
> >  * SC:preempt_schedule
> >  * SC:preempt_schedule_notrace
> >  * SC:irqentry_exit_cond_resched
> >  *
> >  *
> >  * NONE:
> >  *   cond_resched               <- __cond_resched
> >  *   might_resched              <- RET0
> >  *   preempt_schedule           <- NOP
> >  *   preempt_schedule_notrace   <- NOP
> >  *   irqentry_exit_cond_resched <- NOP
> >  *
> >  * VOLUNTARY:
> >  *   cond_resched               <- __cond_resched
> >  *   might_resched              <- __cond_resched
> >  *   preempt_schedule           <- NOP
> >  *   preempt_schedule_notrace   <- NOP
> >  *   irqentry_exit_cond_resched <- NOP
> >  *
> >  * FULL:
> >  *   cond_resched               <- RET0
> >  *   might_resched              <- RET0
> >  *   preempt_schedule           <- preempt_schedule
> >  *   preempt_schedule_notrace   <- preempt_schedule_notrace
> >  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> >  */
> >
> > If you kill voluntary as we know it today, you can remove cond_resched
> > and might_resched, but the remaining 3 are still needed to switch
> > between NONE and FULL.  
> 
> No. The whole point of LAZY is to keep preempt_schedule(),
> preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Right.

 * NONE:
 *   cond_resched               <- __cond_resched
 *   might_resched              <- RET0
 *   preempt_schedule           <- NOP
 *   preempt_schedule_notrace   <- NOP
 *   irqentry_exit_cond_resched <- NOP

Peter, how can you say we can get rid of cond_resched() in NONE when you
show that NONE still uses it? I thought the entire point of this was to get
rid of all the cond_resched() and they are there for PREEMPT_NONE as well as
VOLUNTARY. As you showed above, the only difference between NONE and
VOLUNTARY was the might_sleep.

> 
> Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

And I've been saying that many times already ;-)

Thanks Thomas for reiterating it.

-- Steve
David Laight Nov. 8, 2023, 4:29 p.m. UTC | #23
From: Steven Rostedt
> Sent: 08 November 2023 15:16
> 
> On Wed, 8 Nov 2023 09:43:10 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > > Policies:
> > >
> > > A - preemption=none: run to completion
> > > B - preemption=voluntary: run to completion, unless a task of higher
> > >     sched-class awaits
> > > C - preemption=full: optimized for low-latency. Preempt whenever a higher
> > >     priority task awaits.
> >
> > If you remove cond_resched() then won't both B and C require an extra IPI.
> > That is probably OK for RT tasks but could get expensive for
> > normal tasks that aren't bound to a specific cpu.
> 
> What IPI is extra?

I was thinking that you wouldn't currently need an IPI if the target cpu
was running in-kernel because nothing would happen until cond_resched()
was called.

> > I suspect C could also lead to tasks being pre-empted just before
> > they sleep (eg after waking another task).
> > There might already be mitigation for that, I'm not sure if
> > a voluntary sleep can be done in a non-pre-emptible section.
> 
> No, voluntary sleep can not be done in a preemptible section.

I'm guessing you missed out a negation in that (or s/not/only/).

I was thinking about sequences like:
	wake_up();
	...
	set_current_state(TASK_UNINTERRUPTIBLE)
	add_wait_queue();
	spin_unlock();
	schedule();

Where you really don't want to be pre-empted by the woken up task.
For non CONFIG_RT the lock might do it - if held long enough.
Otherwise you'll need to have pre-emption disabled and enable
it just after the set_current_state().
And then quite likely disable again after the schedule()
to balance things out.

So having the scheduler save the pre-empt disable count might
be useful.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Mark Rutland Nov. 8, 2023, 4:33 p.m. UTC | #24
On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
> What's broken:
>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>    lazily")
>  - Non-x86 architectures. It's trivial to support other archs (only need
>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>    comments on the series.
>    (From some testing on arm64, didn't find any surprises.)

When you say "testing on arm64, didn't find any surprises", I assume you mean
with an additional patch adding TIF_NEED_RESCHED_LAZY?

Applying this series as-is atop v6.6-rc7 and building defconfig (with GCC
13.2.0) blows up with:

| In file included from ./arch/arm64/include/asm/preempt.h:5,
|                  from ./include/linux/preempt.h:79,
|                  from ./include/linux/spinlock.h:56,
|                  from ./include/linux/mmzone.h:8,
|                  from ./include/linux/gfp.h:7,
|                  from ./include/linux/slab.h:16,
|                  from ./include/linux/resource_ext.h:11,
|                  from ./include/linux/acpi.h:13,
|                  from ./include/acpi/apei.h:9,
|                  from ./include/acpi/ghes.h:5,
|                  from ./include/linux/arm_sdei.h:8,
|                  from arch/arm64/kernel/asm-offsets.c:10:
| ./include/linux/thread_info.h:63:2: error: #error "Arch needs to define TIF_NEED_RESCHED_LAZY"
|    63 | #error "Arch needs to define TIF_NEED_RESCHED_LAZY"
|       |  ^~~~~
| ./include/linux/thread_info.h:66:42: error: 'TIF_NEED_RESCHED_LAZY' undeclared here (not in a function); did you mean 'TIF_NEED_RESCHED'?
|    66 | #define TIF_NEED_RESCHED_LAZY_OFFSET    (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED)
|       |                                          ^~~~~~~~~~~~~~~~~~~~~
| ./include/linux/thread_info.h:70:24: note: in expansion of macro 'TIF_NEED_RESCHED_LAZY_OFFSET'
|    70 |         RESCHED_lazy = TIF_NEED_RESCHED_LAZY_OFFSET,
|       |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
| make[2]: *** [scripts/Makefile.build:116: arch/arm64/kernel/asm-offsets.s] Error 1
| make[1]: *** [/home/mark/src/linux/Makefile:1202: prepare0] Error 2
| make: *** [Makefile:234: __sub-make] Error 2

Note that since arm64 doesn't use the generic entry code, that also requires
changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
handle TIF_NEED_RESCHED_LAZY.

>  - ftrace support for need-resched-lazy is incomplete

What exactly do we need for ftrace here?

Thanks,
Mark.
Peter Zijlstra Nov. 8, 2023, 4:49 p.m. UTC | #25
On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote:

> Peter, how can you say we can get rid of cond_resched() in NONE when you

Because that would fix none to actually be none. Who cares.

> > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> 
> And I've been saying that many times already ;-)

Why should I look at random patches on the interweb to make sense of
these patches here?

That just underlines these here patches are not making sense.
Steven Rostedt Nov. 8, 2023, 5:18 p.m. UTC | #26
On Wed, 8 Nov 2023 17:49:16 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote:
> 
> > Peter, how can you say we can get rid of cond_resched() in NONE when you  
> 
> Because that would fix none to actually be none. Who cares.

Well, that would lead to regressions with PREEMPT_NONE and the watchdog
timer.

> 
> > > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/  
> > 
> > And I've been saying that many times already ;-)  
> 
> Why should I look at random patches on the interweb to make sense of
> these patches here?

I actually said it to others, this wasn't really supposed to be addressed
to you.

> 
> That just underlines these here patches are not making sense.

It's a complex series, and there's a lot of room for improvement. I'm happy
to help out here.

-- Steve
Ankur Arora Nov. 8, 2023, 8:26 p.m. UTC | #27
Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
>> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
>> I'm not understanding, those should stay obviously.
>>
>> The current preempt_dynamic stuff has 5 toggles:
>>
>> /*
>>  * SC:cond_resched
>>  * SC:might_resched
>>  * SC:preempt_schedule
>>  * SC:preempt_schedule_notrace
>>  * SC:irqentry_exit_cond_resched
>>  *
>>  *
>>  * NONE:
>>  *   cond_resched               <- __cond_resched
>>  *   might_resched              <- RET0
>>  *   preempt_schedule           <- NOP
>>  *   preempt_schedule_notrace   <- NOP
>>  *   irqentry_exit_cond_resched <- NOP
>>  *
>>  * VOLUNTARY:
>>  *   cond_resched               <- __cond_resched
>>  *   might_resched              <- __cond_resched
>>  *   preempt_schedule           <- NOP
>>  *   preempt_schedule_notrace   <- NOP
>>  *   irqentry_exit_cond_resched <- NOP
>>  *
>>  * FULL:
>>  *   cond_resched               <- RET0
>>  *   might_resched              <- RET0
>>  *   preempt_schedule           <- preempt_schedule
>>  *   preempt_schedule_notrace   <- preempt_schedule_notrace
>>  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>>  */
>>
>> If you kill voluntary as we know it today, you can remove cond_resched
>> and might_resched, but the remaining 3 are still needed to switch
>> between NONE and FULL.
>
> No. The whole point of LAZY is to keep preempt_schedule(),
> preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.
>
> Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>
> The idea is to always enable preempt count and keep _all_ preemption
> points enabled.
>
> For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY
> instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED.
>
> Here is where the regular and the lazy flags are evaluated:
>
>                 Ret2user        Ret2kernel      PreemptCnt=0  need_resched()
>
> NEED_RESCHED       Y                Y               Y         Y
> LAZY_RESCHED       Y                N               N         Y
>
> The trick is that LAZY is not folded into preempt_count so a 1->0
> counter transition won't cause preempt_schedule() to be invoked because
> the topmost bit (NEED_RESCHED) is set.
>
> The scheduler can still decide to set TIF_NEED_RESCHED which will cause
> an immediate preemption at the next preemption point.
>
> This allows to force out a task which loops, e.g. in a massive copy or
> clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY
> is evaluated after a time which is defined by the scheduler itself.
>
> For my PoC I did:
>
>     1) Set TIF_NEED_RESCHED_LAZY
>
>     2) Set TIF_NEED_RESCHED when the task did not react on
>        TIF_NEED_RESCHED_LAZY within a tick
>
> I know that's crude but it just works and obviously requires quite some
> refinement.
>
> So the way how you switch between preemption modes is to select when the
> scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call
> switching at all.
>
> In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it
> uses the LAZY bit first, grants some time and then gets out the hammer
> and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption
> point.
>
> Which means once the whole thing is in place then the whole
> PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with
> the cond_resched() hackery.
>
> So I think this series is backwards.
>
> It should add the LAZY muck with a Kconfig switch like I did in my PoC
> _first_. Once that is working and agreed on, the existing muck can be
> removed.

Yeah. I should have done it in the order in your PoC. Right now I'm
doing all of the stuff you describe above, but because there are far
too many structural changes, it's not clear to anybody what the code
is doing.

Okay, so for the next version let me limit the series to just the
scheduler changes which can be orthogonal to the old models (basically
a new scheduler model PREEMPT_AUTO).

Once that is agreed on, the other models can be removed (or expressed
in terms of PREEMPT_AUTO.)

--
ankur
Ankur Arora Nov. 8, 2023, 8:46 p.m. UTC | #28
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Nov 08, 2023 at 11:22:27AM -0500, Steven Rostedt wrote:
>
>> Peter, how can you say we can get rid of cond_resched() in NONE when you
>
> Because that would fix none to actually be none. Who cares.
>
>> > Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>
>> And I've been saying that many times already ;-)
>
> Why should I look at random patches on the interweb to make sense of
> these patches here?
>
> That just underlines these here patches are not making sense.

Yeah, I'm changing too many things structural things all together.

Let me redo this. This time limiting the changes to the scheduler
adding a preemption model which adds the lazy-bit and which can
behave in ways that are similar to preempt=none/full differently
by toggling the treatment of lazy-bit.

This keeps the current models as it is.

And, once that makes sense to people, then we can decide how best to
remove cond_resched() etc.

--
ankur
Ankur Arora Nov. 9, 2023, 12:34 a.m. UTC | #29
Mark Rutland <mark.rutland@arm.com> writes:

> On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> What's broken:
>>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>>    lazily")
>>  - Non-x86 architectures. It's trivial to support other archs (only need
>>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>>    comments on the series.
>>    (From some testing on arm64, didn't find any surprises.)
>
> When you say "testing on arm64, didn't find any surprises", I assume you mean
> with an additional patch adding TIF_NEED_RESCHED_LAZY?

Yeah. And, handling that in the user exit path.

> Note that since arm64 doesn't use the generic entry code, that also requires
> changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
> handle TIF_NEED_RESCHED_LAZY.

So, the intent (which got muddied due to this overly large series)
was to delay handling TIF_NEED_RESCHED_LAZY until we are about to
return to user.

I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED
and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK
and do_notify_resume().

(The design is much clearer in Thomas' PoC:
https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/)

>>  - ftrace support for need-resched-lazy is incomplete
>
> What exactly do we need for ftrace here?

Only support for TIF_NEED_RESCHED_LAZY which should be complete.
That comment was based on a misreading of the code.


Thanks

--
ankur
Mark Rutland Nov. 9, 2023, 11 a.m. UTC | #30
On Wed, Nov 08, 2023 at 04:34:41PM -0800, Ankur Arora wrote:
> Mark Rutland <mark.rutland@arm.com> writes:
> 
> > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
> >> What's broken:
> >>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
> >>    lazily")
> >>  - Non-x86 architectures. It's trivial to support other archs (only need
> >>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
> >>    comments on the series.
> >>    (From some testing on arm64, didn't find any surprises.)
> >
> > When you say "testing on arm64, didn't find any surprises", I assume you mean
> > with an additional patch adding TIF_NEED_RESCHED_LAZY?
> 
> Yeah. And, handling that in the user exit path.
> 
> > Note that since arm64 doesn't use the generic entry code, that also requires
> > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
> > handle TIF_NEED_RESCHED_LAZY.
> 
> So, the intent (which got muddied due to this overly large series)
> was to delay handling TIF_NEED_RESCHED_LAZY until we are about to
> return to user.

Ah, I missed that detail -- thanks for clarifying!

> I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED
> and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK
> and do_notify_resume().

Digging a bit more, I think that should still work.

One slight clarification: arm64_preempt_schedule_irq() doesn't look at
TIF_NEED_RESCHED today, as it relies on the scheduler IPI calling
preempt_fold_need_resched() to propogate TIF_NEED_RESCHED into
PREEMPT_NEED_RESCHED. That should still work since this series makes
preempt_fold_need_resched() check tif_need_resched(RESCHED_eager).

I was a bit cnofused because in the generic entry code,
irqentry_exit_cond_resched() explicitly checks for TIF_NEED_RESCHED, and I'm
not sure why it does that rather than relying on the scheduler IPI as above.

> (The design is much clearer in Thomas' PoC:
> https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/)
> 
> >>  - ftrace support for need-resched-lazy is incomplete
> >
> > What exactly do we need for ftrace here?
> 
> Only support for TIF_NEED_RESCHED_LAZY which should be complete.
> That comment was based on a misreading of the code.

Cool; thanks!

Mark.
Ankur Arora Nov. 9, 2023, 10:36 p.m. UTC | #31
Mark Rutland <mark.rutland@arm.com> writes:

> On Wed, Nov 08, 2023 at 04:34:41PM -0800, Ankur Arora wrote:
>> Mark Rutland <mark.rutland@arm.com> writes:
>>
>> > On Tue, Nov 07, 2023 at 01:56:46PM -0800, Ankur Arora wrote:
>> >> What's broken:
>> >>  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>> >>    lazily")
>> >>  - Non-x86 architectures. It's trivial to support other archs (only need
>> >>    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>> >>    comments on the series.
>> >>    (From some testing on arm64, didn't find any surprises.)
>> >
>> > When you say "testing on arm64, didn't find any surprises", I assume you mean
>> > with an additional patch adding TIF_NEED_RESCHED_LAZY?
>>
>> Yeah. And, handling that in the user exit path.
>>
>> > Note that since arm64 doesn't use the generic entry code, that also requires
>> > changes to arm64_preempt_schedule_irq() in arch/arm64/kernel/entry-common.c, to
>> > handle TIF_NEED_RESCHED_LAZY.
>>
>> So, the intent (which got muddied due to this overly large series)
>> was to delay handling TIF_NEED_RESCHED_LAZY until we are about to
>> return to user.
>
> Ah, I missed that detail -- thanks for clarifying!
>
>> I think arm64_preempt_schedule_irq() should only handle TIF_NEED_RESCHED
>> and the _TIF_NEED_RESCHED_LAZY should be handled via _TIF_WORK_MASK
>> and do_notify_resume().
>
> Digging a bit more, I think that should still work.
>
> One slight clarification: arm64_preempt_schedule_irq() doesn't look at
> TIF_NEED_RESCHED today, as it relies on the scheduler IPI calling
> preempt_fold_need_resched() to propogate TIF_NEED_RESCHED into
> PREEMPT_NEED_RESCHED. That should still work since this series makes
> preempt_fold_need_resched() check tif_need_resched(RESCHED_eager).
>
> I was a bit cnofused because in the generic entry code,
> irqentry_exit_cond_resched() explicitly checks for TIF_NEED_RESCHED, and I'm
> not sure why it does that rather than relying on the scheduler IPI as above.

Yeah I found that confusing as well. I suspect the reason is that not
all archs do the folding and we need the explicit check for those that
don't.


--
ankur