mbox series

[0/3] drm/nouveau: Fix & improve nouveau_fence_done()

Message ID 20250410092418.135258-2-phasta@kernel.org (mailing list archive)
Headers show
Series drm/nouveau: Fix & improve nouveau_fence_done() | expand

Message

Philipp Stanner April 10, 2025, 9:24 a.m. UTC
Contains two patches improving nouveau_fence_done(), and one addressing
an actual bug (race):

[   39.848463] WARNING: CPU: 21 PID: 1734 at drivers/gpu/drm/nouveau/nouveau_fence.c:509 nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[   39.848551] Modules linked in: snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc snd_sof_pci_intel_
tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda snd_sof snd_sof_utils snd
_soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led snd_soc_hda_codec intel_rapl_msr snd_hda_
codec_realtek snd_hda_ext_core intel_rapl_common snd_hda_codec_generic snd_soc_core snd_hda_scodec_component intel_uncore_frequency intel_uncore_frequency_common snd_hd
a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc snd_hwdep snd_hda_core snd_seq sn
d_seq_device dell_wmi
[   39.848575]  dell_pc x86_pkg_temp_thermal spi_nor platform_profile sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp cxl_port iTCO_wdt mtd rapl intel
_pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec e1000e macsec mei_me i2c_i801 
spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched polyval_generic rtsx_pci_sdmm
c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3 nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec nvme_core idxd_bus rtsx_pci nvme_au
th pinctrl_alderlake ip6_tables ip_tables fuse
[   39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted: G        W          6.14.0-rc4+ #11
[   39.848605] Tainted: [W]=WARN
[   39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6, BIOS 2.7.0 12/17/2024
[   39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[   39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5 f0 eb 96 <0f> 0b e9 67 ff ff f
f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
[   39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
[   39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX: ff175a3b4801e008
[   39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI: ff175a3b504da980
[   39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09: 0000000000000001
[   39.848694] R10: 0000000000000001 R11: 0000000000000000 R12: ff175a3b6d97de00
[   39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15: 0000000000000001
[   39.848696] FS:  00007fc5477846c0(0000) GS:ff175a5a50280000(0000) knlGS:0000000000000000
[   39.848698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4: 0000000000f71ef0
[   39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   39.848702] PKRU: 55555554
[   39.848703] Call Trace:
[   39.848704]  <TASK>
[   39.848705]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[   39.848782]  ? __warn.cold+0x93/0xfa
[   39.848785]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[   39.848861]  ? report_bug+0xff/0x140
[   39.848863]  ? handle_bug+0x58/0x90
[   39.848865]  ? exc_invalid_op+0x17/0x70
[   39.848866]  ? asm_exc_invalid_op+0x1a/0x20
[   39.848870]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
[   39.848943]  nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
[   39.849016]  ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
[   39.849088]  __dma_fence_enable_signaling+0x33/0xc0
[   39.849090]  dma_fence_add_callback+0x4b/0xd0
[   39.849093]  nouveau_fence_emit+0xa3/0x260 [nouveau]
[   39.849166]  nouveau_fence_new+0x7d/0xf0 [nouveau]
[   39.849242]  nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
[   39.849338]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
[   39.849431]  drm_ioctl_kernel+0xad/0x100
[   39.849433]  drm_ioctl+0x288/0x550
[   39.849435]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
[   39.849526]  nouveau_drm_ioctl+0x57/0xb0 [nouveau]
[   39.849620]  __x64_sys_ioctl+0x94/0xc0
[   39.849621]  do_syscall_64+0x82/0x160
[   39.849623]  ? drm_ioctl+0x2b7/0x550
[   39.849625]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
[   39.849719]  ? ktime_get_mono_fast_ns+0x38/0xd0
[   39.849721]  ? __pm_runtime_suspend+0x69/0xc0
[   39.849724]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[   39.849726]  ? syscall_exit_to_user_mode+0x10/0x200
[   39.849729]  ? do_syscall_64+0x8e/0x160
[   39.849730]  ? exc_page_fault+0x7e/0x1a0
[   39.849733]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   39.849735] RIP: 0033:0x7fc5576fe0ad
[   39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[   39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX: 00007fc5576fe0ad
[   39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI: 000000000000000e
[   39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09: 000055cb74e35560
[   39.849742] R10: 0000000000000014 R11: 0000000000000246 R12: 00007ffc00268960
[   39.849744] R13: 00000000c0406481 R14: 000000000000000e R15: 000055cb74e3cd10
[   39.849746]  </TASK>
[   39.849746] ---[ end trace 0000000000000000 ]---
[   39.849776] ------------[ cut here ]------------


This is the first WARN_ON() in dma_fence_set_error(), called by
nouveau_fence_context_kill().

It's rare, but it is a bug, or rather: the archetype of a race, since
(as Christian pointed out) nouveau_fence_update() later at some point
will remove the signaled fence (by signaling it again).


P.


Philipp Stanner (3):
  drm/nouveau: Prevent signaled fences in pending list
  drm/nouveau: Remove surplus if-branch
  drm/nouveau: Add helper to check base fence

 drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++-----------
 1 file changed, 18 insertions(+), 14 deletions(-)

Comments

Philipp Stanner April 10, 2025, 9:51 a.m. UTC | #1
On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
> Contains two patches improving nouveau_fence_done(), and one
> addressing
> an actual bug (race):

Oops, that's the wrong calltrace. Here we go:

[ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]---

By the way, for reference:
I did try whether it could be done to have nouveau_fence_signal()
incorporated into nouveau_fence_update() and nouveau_fence_done().
This, however, would then cause a race with the list_del() in
nouveau_fence_no_signaling(), WARNing because of the list poison.

So the "solution" space is:
 * A cleanup callback on the dma_fence.
 * Keeping the current race or
 * replacing it with another race with another function.
 * Just preventing nouveau_fence_done() from signaling fences other
   than through nouveau_fence_update/signal

The later seems clearly like the cleanest solution to me. Alternative
would be a work-intensive rework of all the misdesigns broken in
nouveau_fence.c


P.

> 
> [   39.848463] WARNING: CPU: 21 PID: 1734 at
> drivers/gpu/drm/nouveau/nouveau_fence.c:509
> nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [   39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
> t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
> nf_tables qrtr sunrpc snd_sof_pci_intel_
> tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
> snd_sof_intel_hda snd_sof snd_sof_utils snd
> _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
> snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
> snd_soc_hda_codec intel_rapl_msr snd_hda_
> codec_realtek snd_hda_ext_core intel_rapl_common
> snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
> intel_uncore_frequency intel_uncore_frequency_common snd_hd
> a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit
> snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc
> snd_hwdep snd_hda_core snd_seq sn
> d_seq_device dell_wmi
> [   39.848575]  dell_pc x86_pkg_temp_thermal spi_nor platform_profile
> sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
> cxl_port iTCO_wdt mtd rapl intel
> _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
> iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
> dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
> n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore
> einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec
> e1000e macsec mei_me i2c_i801 
> spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink
> zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched
> polyval_generic rtsx_pci_sdmm
> c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3
> nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
> nvme_core idxd_bus rtsx_pci nvme_au
> th pinctrl_alderlake ip6_tables ip_tables fuse
> [   39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted:
> G        W          6.14.0-rc4+ #11
> [   39.848605] Tainted: [W]=WARN
> [   39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
> BIOS 2.7.0 12/17/2024
> [   39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
> [nouveau]
> [   39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43
> 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5
> f0 eb 96 <0f> 0b e9 67 ff ff f
> f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
> [   39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
> [   39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
> ff175a3b4801e008
> [   39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
> ff175a3b504da980
> [   39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
> 0000000000000001
> [   39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
> ff175a3b6d97de00
> [   39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
> 0000000000000001
> [   39.848696] FS:  00007fc5477846c0(0000) GS:ff175a5a50280000(0000)
> knlGS:0000000000000000
> [   39.848698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
> 0000000000f71ef0
> [   39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> 0000000000000400
> [   39.848702] PKRU: 55555554
> [   39.848703] Call Trace:
> [   39.848704]  <TASK>
> [   39.848705]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [   39.848782]  ? __warn.cold+0x93/0xfa
> [   39.848785]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [   39.848861]  ? report_bug+0xff/0x140
> [   39.848863]  ? handle_bug+0x58/0x90
> [   39.848865]  ? exc_invalid_op+0x17/0x70
> [   39.848866]  ? asm_exc_invalid_op+0x1a/0x20
> [   39.848870]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> [   39.848943]  nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
> [   39.849016]  ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
> [   39.849088]  __dma_fence_enable_signaling+0x33/0xc0
> [   39.849090]  dma_fence_add_callback+0x4b/0xd0
> [   39.849093]  nouveau_fence_emit+0xa3/0x260 [nouveau]
> [   39.849166]  nouveau_fence_new+0x7d/0xf0 [nouveau]
> [   39.849242]  nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
> [   39.849338]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
> [   39.849431]  drm_ioctl_kernel+0xad/0x100
> [   39.849433]  drm_ioctl+0x288/0x550
> [   39.849435]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
> [   39.849526]  nouveau_drm_ioctl+0x57/0xb0 [nouveau]
> [   39.849620]  __x64_sys_ioctl+0x94/0xc0
> [   39.849621]  do_syscall_64+0x82/0x160
> [   39.849623]  ? drm_ioctl+0x2b7/0x550
> [   39.849625]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
> [   39.849719]  ? ktime_get_mono_fast_ns+0x38/0xd0
> [   39.849721]  ? __pm_runtime_suspend+0x69/0xc0
> [   39.849724]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> [   39.849726]  ? syscall_exit_to_user_mode+0x10/0x200
> [   39.849729]  ? do_syscall_64+0x8e/0x160
> [   39.849730]  ? exc_page_fault+0x7e/0x1a0
> [   39.849733]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   39.849735] RIP: 0033:0x7fc5576fe0ad
> [   39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10
> c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00
> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28
> 00 00 00
> [   39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [   39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
> 00007fc5576fe0ad
> [   39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
> 000000000000000e
> [   39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
> 000055cb74e35560
> [   39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
> 00007ffc00268960
> [   39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
> 000055cb74e3cd10
> [   39.849746]  </TASK>
> [   39.849746] ---[ end trace 0000000000000000 ]---
> [   39.849776] ------------[ cut here ]------------
> 
> 
> This is the first WARN_ON() in dma_fence_set_error(), called by
> nouveau_fence_context_kill().
> 
> It's rare, but it is a bug, or rather: the archetype of a race, since
> (as Christian pointed out) nouveau_fence_update() later at some point
> will remove the signaled fence (by signaling it again).
> 
> 
> P.
> 
> 
> Philipp Stanner (3):
>   drm/nouveau: Prevent signaled fences in pending list
>   drm/nouveau: Remove surplus if-branch
>   drm/nouveau: Add helper to check base fence
> 
>  drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++---------
> --
>  1 file changed, 18 insertions(+), 14 deletions(-)
>
Christian König April 10, 2025, 12:18 p.m. UTC | #2
Am 10.04.25 um 11:51 schrieb Philipp Stanner:
> On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
>> Contains two patches improving nouveau_fence_done(), and one
>> addressing
>> an actual bug (race):
> Oops, that's the wrong calltrace. Here we go:
>
> [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]---

I think I understand the problem now as well, but that backtrace is completely mangled in the mail.

It would be nice if you could send that out again.

Thanks,
Christian.

>
> By the way, for reference:
> I did try whether it could be done to have nouveau_fence_signal()
> incorporated into nouveau_fence_update() and nouveau_fence_done().
> This, however, would then cause a race with the list_del() in
> nouveau_fence_no_signaling(), WARNing because of the list poison.
>
> So the "solution" space is:
>  * A cleanup callback on the dma_fence.
>  * Keeping the current race or
>  * replacing it with another race with another function.
>  * Just preventing nouveau_fence_done() from signaling fences other
>    than through nouveau_fence_update/signal
>
> The later seems clearly like the cleanest solution to me. Alternative
> would be a work-intensive rework of all the misdesigns broken in
> nouveau_fence.c
>
>
> P.
>
>> [   39.848463] WARNING: CPU: 21 PID: 1734 at
>> drivers/gpu/drm/nouveau/nouveau_fence.c:509
>> nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
>> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
>> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
>> t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
>> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
>> nf_tables qrtr sunrpc snd_sof_pci_intel_
>> tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
>> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
>> snd_sof_intel_hda snd_sof snd_sof_utils snd
>> _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
>> snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
>> snd_soc_hda_codec intel_rapl_msr snd_hda_
>> codec_realtek snd_hda_ext_core intel_rapl_common
>> snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
>> intel_uncore_frequency intel_uncore_frequency_common snd_hd
>> a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit
>> snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc
>> snd_hwdep snd_hda_core snd_seq sn
>> d_seq_device dell_wmi
>> [   39.848575]  dell_pc x86_pkg_temp_thermal spi_nor platform_profile
>> sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
>> cxl_port iTCO_wdt mtd rapl intel
>> _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
>> iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
>> dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
>> n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore
>> einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec
>> e1000e macsec mei_me i2c_i801 
>> spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink
>> zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched
>> polyval_generic rtsx_pci_sdmm
>> c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3
>> nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
>> nvme_core idxd_bus rtsx_pci nvme_au
>> th pinctrl_alderlake ip6_tables ip_tables fuse
>> [   39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted:
>> G        W          6.14.0-rc4+ #11
>> [   39.848605] Tainted: [W]=WARN
>> [   39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6,
>> BIOS 2.7.0 12/17/2024
>> [   39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
>> [nouveau]
>> [   39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43
>> 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5
>> f0 eb 96 <0f> 0b e9 67 ff ff f
>> f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
>> [   39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
>> [   39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
>> ff175a3b4801e008
>> [   39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
>> ff175a3b504da980
>> [   39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
>> 0000000000000001
>> [   39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
>> ff175a3b6d97de00
>> [   39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
>> 0000000000000001
>> [   39.848696] FS:  00007fc5477846c0(0000) GS:ff175a5a50280000(0000)
>> knlGS:0000000000000000
>> [   39.848698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
>> 0000000000f71ef0
>> [   39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [   39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>> 0000000000000400
>> [   39.848702] PKRU: 55555554
>> [   39.848703] Call Trace:
>> [   39.848704]  <TASK>
>> [   39.848705]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848782]  ? __warn.cold+0x93/0xfa
>> [   39.848785]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848861]  ? report_bug+0xff/0x140
>> [   39.848863]  ? handle_bug+0x58/0x90
>> [   39.848865]  ? exc_invalid_op+0x17/0x70
>> [   39.848866]  ? asm_exc_invalid_op+0x1a/0x20
>> [   39.848870]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
>> [   39.848943]  nouveau_fence_enable_signaling+0x32/0x80 [nouveau]
>> [   39.849016]  ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau]
>> [   39.849088]  __dma_fence_enable_signaling+0x33/0xc0
>> [   39.849090]  dma_fence_add_callback+0x4b/0xd0
>> [   39.849093]  nouveau_fence_emit+0xa3/0x260 [nouveau]
>> [   39.849166]  nouveau_fence_new+0x7d/0xf0 [nouveau]
>> [   39.849242]  nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
>> [   39.849338]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [   39.849431]  drm_ioctl_kernel+0xad/0x100
>> [   39.849433]  drm_ioctl+0x288/0x550
>> [   39.849435]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [   39.849526]  nouveau_drm_ioctl+0x57/0xb0 [nouveau]
>> [   39.849620]  __x64_sys_ioctl+0x94/0xc0
>> [   39.849621]  do_syscall_64+0x82/0x160
>> [   39.849623]  ? drm_ioctl+0x2b7/0x550
>> [   39.849625]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau]
>> [   39.849719]  ? ktime_get_mono_fast_ns+0x38/0xd0
>> [   39.849721]  ? __pm_runtime_suspend+0x69/0xc0
>> [   39.849724]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
>> [   39.849726]  ? syscall_exit_to_user_mode+0x10/0x200
>> [   39.849729]  ? do_syscall_64+0x8e/0x160
>> [   39.849730]  ? exc_page_fault+0x7e/0x1a0
>> [   39.849733]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [   39.849735] RIP: 0033:0x7fc5576fe0ad
>> [   39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10
>> c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00
>> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28
>> 00 00 00
>> [   39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [   39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
>> 00007fc5576fe0ad
>> [   39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
>> 000000000000000e
>> [   39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
>> 000055cb74e35560
>> [   39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
>> 00007ffc00268960
>> [   39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
>> 000055cb74e3cd10
>> [   39.849746]  </TASK>
>> [   39.849746] ---[ end trace 0000000000000000 ]---
>> [   39.849776] ------------[ cut here ]------------
>>
>>
>> This is the first WARN_ON() in dma_fence_set_error(), called by
>> nouveau_fence_context_kill().
>>
>> It's rare, but it is a bug, or rather: the archetype of a race, since
>> (as Christian pointed out) nouveau_fence_update() later at some point
>> will remove the signaled fence (by signaling it again).
>>
>>
>> P.
>>
>>
>> Philipp Stanner (3):
>>   drm/nouveau: Prevent signaled fences in pending list
>>   drm/nouveau: Remove surplus if-branch
>>   drm/nouveau: Add helper to check base fence
>>
>>  drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++---------
>> --
>>  1 file changed, 18 insertions(+), 14 deletions(-)
>>
Philipp Stanner April 10, 2025, 1:18 p.m. UTC | #3
On Thu, 2025-04-10 at 14:18 +0200, Christian König wrote:
> Am 10.04.25 um 11:51 schrieb Philipp Stanner:
> > On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote:
> > > Contains two patches improving nouveau_fence_done(), and one
> > > addressing
> > > an actual bug (race):
> > Oops, that's the wrong calltrace. Here we go:
> > 
> > [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ?
> > nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.791874] ? __warn.cold
> > (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ?
> > nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.791950] ? report_bug
> > (/home/imperator/linux/lib/bug.c:180
> > /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug
> > (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ?
> > exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309
> > (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op
> > (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [
> > 85.791960] ? nouveau_fence_context_kill
> > (/home/imperator/linux/./include/linux/dma-fence.h:587
> > (discriminator 9)
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94
> > (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold
> > (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/schedu
> > ler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [
> > 85.792033] ? drm_sched_entity_kill.part.0
> > (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243
> > (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509
> > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518)
> > nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188)
> > nouveau [ 85.792191] nouveau_abi16_fini
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224
> > (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose
> > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240)
> > nouveau [ 85.792349] drm_file_free
> > (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353]
> > drm_release
> > (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67
> > (discriminator 1)
> > /home/imperator/linux/./include/linux/atomic/atomic-arch-
> > fallback.h:2278 (discriminator 1)
> > /home/imperator/linux/./include/linux/atomic/atomic-
> > instrumented.h:1384 (discriminator 1)
> > /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator
> > 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464)
> > [ 85.792357] task_work_run
> > (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit
> > (/home/imperator/linux/kernel/exit.c:939) [ 85.792362]
> > do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [
> > 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036)
> > [ 85.792366] arch_do_signal_or_restart
> > (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38
> > /home/imperator/linux/arch/x86/kernel/signal.c:264
> > /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369]
> > syscall_exit_to_user_mode
> > (/home/imperator/linux/kernel/entry/common.c:113
> > /home/imperator/linux/./include/linux/entry-common.h:329
> > /home/imperator/linux/kernel/entry/common.c:207
> > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372]
> > do_syscall_64
> > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172
> > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ?
> > syscall_exit_to_user_mode_prepare
> > (/home/imperator/linux/./include/linux/audit.h:357
> > /home/imperator/linux/kernel/entry/common.c:166
> > /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ?
> > syscall_exit_to_user_mode
> > (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686
> > /home/imperator/linux/./include/linux/entry-common.h:232
> > /home/imperator/linux/kernel/entry/common.c:206
> > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ?
> > do_syscall_64
> > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172
> > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378]
> > entry_SYSCALL_64_after_hwframe
> > (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381]
> > RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode
> > bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such
> > file Code starting with the faulting instruction
> > =========================================== [ 85.792383] RSP:
> > 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [
> > 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX:
> > 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI:
> > 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP:
> > 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [
> > 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12:
> > 0000000000000001 [ 85.792388] R13: 0000000000000000 R14:
> > 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [
> > 85.792391] ---[ end trace 0000000000000000 ]---
> 
> I think I understand the problem now as well, but that backtrace is
> completely mangled in the mail.
> 
> It would be nice if you could send that out again.


I really need to install Mutt soon..

Let's try it this way:
https://paste.debian.net/1368679/

P.

> 
> Thanks,
> Christian.
> 
> > 
> > By the way, for reference:
> > I did try whether it could be done to have nouveau_fence_signal()
> > incorporated into nouveau_fence_update() and nouveau_fence_done().
> > This, however, would then cause a race with the list_del() in
> > nouveau_fence_no_signaling(), WARNing because of the list poison.
> > 
> > So the "solution" space is:
> >  * A cleanup callback on the dma_fence.
> >  * Keeping the current race or
> >  * replacing it with another race with another function.
> >  * Just preventing nouveau_fence_done() from signaling fences other
> >    than through nouveau_fence_update/signal
> > 
> > The later seems clearly like the cleanest solution to me.
> > Alternative
> > would be a work-intensive rework of all the misdesigns broken in
> > nouveau_fence.c
> > 
> > 
> > P.
> > 
> > > [   39.848463] WARNING: CPU: 21 PID: 1734 at
> > > drivers/gpu/drm/nouveau/nouveau_fence.c:509
> > > nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [   39.848551] Modules linked in: snd_seq_dummy snd_hrtimer
> > > nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine
> > > t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
> > > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
> > > nf_tables qrtr sunrpc snd_sof_pci_intel_
> > > tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci
> > > snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda
> > > snd_sof_intel_hda snd_sof snd_sof_utils snd
> > > _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks
> > > snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led
> > > snd_soc_hda_codec intel_rapl_msr snd_hda_
> > > codec_realtek snd_hda_ext_core intel_rapl_common
> > > snd_hda_codec_generic snd_soc_core snd_hda_scodec_component
> > > intel_uncore_frequency intel_uncore_frequency_common snd_hd
> > > a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common
> > > nfit
> > > snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec
> > > binfmt_misc
> > > snd_hwdep snd_hda_core snd_seq sn
> > > d_seq_device dell_wmi
> > > [   39.848575]  dell_pc x86_pkg_temp_thermal spi_nor
> > > platform_profile
> > > sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp
> > > cxl_port iTCO_wdt mtd rapl intel
> > > _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class
> > > iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate
> > > dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo
> > > n dell_wmi_descriptor firmware_attributes_class wmi_bmof
> > > intel_uncore
> > > einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common
> > > intel_vsec
> > > e1000e macsec mei_me i2c_i801 
> > > spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop
> > > nfnetlink
> > > zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto
> > > gpu_sched
> > > polyval_generic rtsx_pci_sdmm
> > > c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm
> > > sha512_ssse3
> > > nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec
> > > nvme_core idxd_bus rtsx_pci nvme_au
> > > th pinctrl_alderlake ip6_tables ip_tables fuse
> > > [   39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell
> > > Tainted:
> > > G        W          6.14.0-rc4+ #11
> > > [   39.848605] Tainted: [W]=WARN
> > > [   39.848606] Hardware name: Dell Inc. Precision 7960
> > > Tower/01G0M6,
> > > BIOS 2.7.0 12/17/2024
> > > [   39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0
> > > [nouveau]
> > > [   39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1
> > > 43
> > > 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2
> > > c5
> > > f0 eb 96 <0f> 0b e9 67 ff ff f
> > > f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8
> > > [   39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046
> > > [   39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX:
> > > ff175a3b4801e008
> > > [   39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI:
> > > ff175a3b504da980
> > > [   39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09:
> > > 0000000000000001
> > > [   39.848694] R10: 0000000000000001 R11: 0000000000000000 R12:
> > > ff175a3b6d97de00
> > > [   39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15:
> > > 0000000000000001
> > > [   39.848696] FS:  00007fc5477846c0(0000)
> > > GS:ff175a5a50280000(0000)
> > > knlGS:0000000000000000
> > > [   39.848698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4:
> > > 0000000000f71ef0
> > > [   39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [   39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> > > 0000000000000400
> > > [   39.848702] PKRU: 55555554
> > > [   39.848703] Call Trace:
> > > [   39.848704]  <TASK>
> > > [   39.848705]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [   39.848782]  ? __warn.cold+0x93/0xfa
> > > [   39.848785]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [   39.848861]  ? report_bug+0xff/0x140
> > > [   39.848863]  ? handle_bug+0x58/0x90
> > > [   39.848865]  ? exc_invalid_op+0x17/0x70
> > > [   39.848866]  ? asm_exc_invalid_op+0x1a/0x20
> > > [   39.848870]  ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau]
> > > [   39.848943]  nouveau_fence_enable_signaling+0x32/0x80
> > > [nouveau]
> > > [   39.849016]  ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10
> > > [nouveau]
> > > [   39.849088]  __dma_fence_enable_signaling+0x33/0xc0
> > > [   39.849090]  dma_fence_add_callback+0x4b/0xd0
> > > [   39.849093]  nouveau_fence_emit+0xa3/0x260 [nouveau]
> > > [   39.849166]  nouveau_fence_new+0x7d/0xf0 [nouveau]
> > > [   39.849242]  nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau]
> > > [   39.849338]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [   39.849431]  drm_ioctl_kernel+0xad/0x100
> > > [   39.849433]  drm_ioctl+0x288/0x550
> > > [   39.849435]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [   39.849526]  nouveau_drm_ioctl+0x57/0xb0 [nouveau]
> > > [   39.849620]  __x64_sys_ioctl+0x94/0xc0
> > > [   39.849621]  do_syscall_64+0x82/0x160
> > > [   39.849623]  ? drm_ioctl+0x2b7/0x550
> > > [   39.849625]  ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10
> > > [nouveau]
> > > [   39.849719]  ? ktime_get_mono_fast_ns+0x38/0xd0
> > > [   39.849721]  ? __pm_runtime_suspend+0x69/0xc0
> > > [   39.849724]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
> > > [   39.849726]  ? syscall_exit_to_user_mode+0x10/0x200
> > > [   39.849729]  ? do_syscall_64+0x8e/0x160
> > > [   39.849730]  ? exc_page_fault+0x7e/0x1a0
> > > [   39.849733]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [   39.849735] RIP: 0033:0x7fc5576fe0ad
> > > [   39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45
> > > 10
> > > c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00
> > > 00
> > > 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25
> > > 28
> > > 00 00 00
> > > [   39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246
> > > ORIG_RAX:
> > > 0000000000000010
> > > [   39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX:
> > > 00007fc5576fe0ad
> > > [   39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI:
> > > 000000000000000e
> > > [   39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09:
> > > 000055cb74e35560
> > > [   39.849742] R10: 0000000000000014 R11: 0000000000000246 R12:
> > > 00007ffc00268960
> > > [   39.849744] R13: 00000000c0406481 R14: 000000000000000e R15:
> > > 000055cb74e3cd10
> > > [   39.849746]  </TASK>
> > > [   39.849746] ---[ end trace 0000000000000000 ]---
> > > [   39.849776] ------------[ cut here ]------------
> > > 
> > > 
> > > This is the first WARN_ON() in dma_fence_set_error(), called by
> > > nouveau_fence_context_kill().
> > > 
> > > It's rare, but it is a bug, or rather: the archetype of a race,
> > > since
> > > (as Christian pointed out) nouveau_fence_update() later at some
> > > point
> > > will remove the signaled fence (by signaling it again).
> > > 
> > > 
> > > P.
> > > 
> > > 
> > > Philipp Stanner (3):
> > >   drm/nouveau: Prevent signaled fences in pending list
> > >   drm/nouveau: Remove surplus if-branch
> > >   drm/nouveau: Add helper to check base fence
> > > 
> > >  drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++-----
> > > ----
> > > --
> > >  1 file changed, 18 insertions(+), 14 deletions(-)
> > > 
>