Message ID | 20250410092418.135258-2-phasta@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | drm/nouveau: Fix & improve nouveau_fence_done() | expand |
On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote: > Contains two patches improving nouveau_fence_done(), and one > addressing > an actual bug (race): Oops, that's the wrong calltrace. Here we go: [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]--- By the way, for reference: I did try whether it could be done to have nouveau_fence_signal() incorporated into nouveau_fence_update() and nouveau_fence_done(). This, however, would then cause a race with the list_del() in nouveau_fence_no_signaling(), WARNing because of the list poison. So the "solution" space is: * A cleanup callback on the dma_fence. * Keeping the current race or * replacing it with another race with another function. * Just preventing nouveau_fence_done() from signaling fences other than through nouveau_fence_update/signal The later seems clearly like the cleanest solution to me. Alternative would be a work-intensive rework of all the misdesigns broken in nouveau_fence.c P. > > [ 39.848463] WARNING: CPU: 21 PID: 1734 at > drivers/gpu/drm/nouveau/nouveau_fence.c:509 > nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer > nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine > t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set > nf_tables qrtr sunrpc snd_sof_pci_intel_ > tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci > snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda > snd_sof_intel_hda snd_sof snd_sof_utils snd > _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks > snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led > snd_soc_hda_codec intel_rapl_msr snd_hda_ > codec_realtek snd_hda_ext_core intel_rapl_common > snd_hda_codec_generic snd_soc_core snd_hda_scodec_component > intel_uncore_frequency intel_uncore_frequency_common snd_hd > a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit > snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc > snd_hwdep snd_hda_core snd_seq sn > d_seq_device dell_wmi > [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor platform_profile > sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp > cxl_port iTCO_wdt mtd rapl intel > _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class > iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate > dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo > n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore > einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec > e1000e macsec mei_me i2c_i801 > spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink > zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched > polyval_generic rtsx_pci_sdmm > c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3 > nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec > nvme_core idxd_bus rtsx_pci nvme_au > th pinctrl_alderlake ip6_tables ip_tables fuse > [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted: > G W 6.14.0-rc4+ #11 > [ 39.848605] Tainted: [W]=WARN > [ 39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6, > BIOS 2.7.0 12/17/2024 > [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0 > [nouveau] > [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43 > 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5 > f0 eb 96 <0f> 0b e9 67 ff ff f > f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8 > [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046 > [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX: > ff175a3b4801e008 > [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI: > ff175a3b504da980 > [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09: > 0000000000000001 > [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12: > ff175a3b6d97de00 > [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15: > 0000000000000001 > [ 39.848696] FS: 00007fc5477846c0(0000) GS:ff175a5a50280000(0000) > knlGS:0000000000000000 > [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4: > 0000000000f71ef0 > [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: > 0000000000000400 > [ 39.848702] PKRU: 55555554 > [ 39.848703] Call Trace: > [ 39.848704] <TASK> > [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > [ 39.848782] ? __warn.cold+0x93/0xfa > [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > [ 39.848861] ? report_bug+0xff/0x140 > [ 39.848863] ? handle_bug+0x58/0x90 > [ 39.848865] ? exc_invalid_op+0x17/0x70 > [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20 > [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80 [nouveau] > [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau] > [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0 > [ 39.849090] dma_fence_add_callback+0x4b/0xd0 > [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau] > [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau] > [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau] > [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau] > [ 39.849431] drm_ioctl_kernel+0xad/0x100 > [ 39.849433] drm_ioctl+0x288/0x550 > [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau] > [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau] > [ 39.849620] __x64_sys_ioctl+0x94/0xc0 > [ 39.849621] do_syscall_64+0x82/0x160 > [ 39.849623] ? drm_ioctl+0x2b7/0x550 > [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau] > [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0 > [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0 > [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0 > [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200 > [ 39.849729] ? do_syscall_64+0x8e/0x160 > [ 39.849730] ? exc_page_fault+0x7e/0x1a0 > [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 39.849735] RIP: 0033:0x7fc5576fe0ad > [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 > c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 > 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 > 00 00 00 > [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX: > 00007fc5576fe0ad > [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI: > 000000000000000e > [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09: > 000055cb74e35560 > [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12: > 00007ffc00268960 > [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15: > 000055cb74e3cd10 > [ 39.849746] </TASK> > [ 39.849746] ---[ end trace 0000000000000000 ]--- > [ 39.849776] ------------[ cut here ]------------ > > > This is the first WARN_ON() in dma_fence_set_error(), called by > nouveau_fence_context_kill(). > > It's rare, but it is a bug, or rather: the archetype of a race, since > (as Christian pointed out) nouveau_fence_update() later at some point > will remove the signaled fence (by signaling it again). > > > P. > > > Philipp Stanner (3): > drm/nouveau: Prevent signaled fences in pending list > drm/nouveau: Remove surplus if-branch > drm/nouveau: Add helper to check base fence > > drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++--------- > -- > 1 file changed, 18 insertions(+), 14 deletions(-) >
Am 10.04.25 um 11:51 schrieb Philipp Stanner: > On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote: >> Contains two patches improving nouveau_fence_done(), and one >> addressing >> an actual bug (race): > Oops, that's the wrong calltrace. Here we go: > > [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791874] ? __warn.cold (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.791950] ? report_bug (/home/imperator/linux/lib/bug.c:180 /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ 85.791960] ? nouveau_fence_context_kill (/home/imperator/linux/./include/linux/dma-fence.h:587 (discriminator 9) /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/scheduler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ 85.792033] ? drm_sched_entity_kill.part.0 (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) nouveau [ 85.792191] nouveau_abi16_fini (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) nouveau [ 85.792349] drm_file_free (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] drm_release (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-arch-fallback.h:2278 (discriminator 1) /home/imperator/linux/./include/linux/atomic/atomic-instrumented.h:1384 (discriminator 1) /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) [ 85.792357] task_work_run (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) [ 85.792366] arch_do_signal_or_restart (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 /home/imperator/linux/arch/x86/kernel/signal.c:264 /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] syscall_exit_to_user_mode (/home/imperator/linux/kernel/entry/common.c:113 /home/imperator/linux/./include/linux/entry-common.h:329 /home/imperator/linux/kernel/entry/common.c:207 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? syscall_exit_to_user_mode_prepare (/home/imperator/linux/./include/linux/audit.h:357 /home/imperator/linux/kernel/entry/common.c:166 /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? syscall_exit_to_user_mode (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 /home/imperator/linux/./include/linux/entry-common.h:232 /home/imperator/linux/kernel/entry/common.c:206 /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? do_syscall_64 (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] entry_SYSCALL_64_after_hwframe (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such file Code starting with the faulting instruction =========================================== [ 85.792383] RSP: 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ 85.792391] ---[ end trace 0000000000000000 ]--- I think I understand the problem now as well, but that backtrace is completely mangled in the mail. It would be nice if you could send that out again. Thanks, Christian. > > By the way, for reference: > I did try whether it could be done to have nouveau_fence_signal() > incorporated into nouveau_fence_update() and nouveau_fence_done(). > This, however, would then cause a race with the list_del() in > nouveau_fence_no_signaling(), WARNing because of the list poison. > > So the "solution" space is: > * A cleanup callback on the dma_fence. > * Keeping the current race or > * replacing it with another race with another function. > * Just preventing nouveau_fence_done() from signaling fences other > than through nouveau_fence_update/signal > > The later seems clearly like the cleanest solution to me. Alternative > would be a work-intensive rework of all the misdesigns broken in > nouveau_fence.c > > > P. > >> [ 39.848463] WARNING: CPU: 21 PID: 1734 at >> drivers/gpu/drm/nouveau/nouveau_fence.c:509 >> nouveau_fence_no_signaling+0xac/0xd0 [nouveau] >> [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer >> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet >> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine >> t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat >> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set >> nf_tables qrtr sunrpc snd_sof_pci_intel_ >> tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci >> snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda >> snd_sof_intel_hda snd_sof snd_sof_utils snd >> _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks >> snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led >> snd_soc_hda_codec intel_rapl_msr snd_hda_ >> codec_realtek snd_hda_ext_core intel_rapl_common >> snd_hda_codec_generic snd_soc_core snd_hda_scodec_component >> intel_uncore_frequency intel_uncore_frequency_common snd_hd >> a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common nfit >> snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec binfmt_misc >> snd_hwdep snd_hda_core snd_seq sn >> d_seq_device dell_wmi >> [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor platform_profile >> sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp >> cxl_port iTCO_wdt mtd rapl intel >> _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class >> iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate >> dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo >> n dell_wmi_descriptor firmware_attributes_class wmi_bmof intel_uncore >> einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common intel_vsec >> e1000e macsec mei_me i2c_i801 >> spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop nfnetlink >> zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto gpu_sched >> polyval_generic rtsx_pci_sdmm >> c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm sha512_ssse3 >> nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec >> nvme_core idxd_bus rtsx_pci nvme_au >> th pinctrl_alderlake ip6_tables ip_tables fuse >> [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell Tainted: >> G W 6.14.0-rc4+ #11 >> [ 39.848605] Tainted: [W]=WARN >> [ 39.848606] Hardware name: Dell Inc. Precision 7960 Tower/01G0M6, >> BIOS 2.7.0 12/17/2024 >> [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0 >> [nouveau] >> [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 43 >> 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 c5 >> f0 eb 96 <0f> 0b e9 67 ff ff f >> f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8 >> [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046 >> [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX: >> ff175a3b4801e008 >> [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI: >> ff175a3b504da980 >> [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09: >> 0000000000000001 >> [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12: >> ff175a3b6d97de00 >> [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15: >> 0000000000000001 >> [ 39.848696] FS: 00007fc5477846c0(0000) GS:ff175a5a50280000(0000) >> knlGS:0000000000000000 >> [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4: >> 0000000000f71ef0 >> [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: >> 0000000000000400 >> [ 39.848702] PKRU: 55555554 >> [ 39.848703] Call Trace: >> [ 39.848704] <TASK> >> [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] >> [ 39.848782] ? __warn.cold+0x93/0xfa >> [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] >> [ 39.848861] ? report_bug+0xff/0x140 >> [ 39.848863] ? handle_bug+0x58/0x90 >> [ 39.848865] ? exc_invalid_op+0x17/0x70 >> [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20 >> [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] >> [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80 [nouveau] >> [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 [nouveau] >> [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0 >> [ 39.849090] dma_fence_add_callback+0x4b/0xd0 >> [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau] >> [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau] >> [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau] >> [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau] >> [ 39.849431] drm_ioctl_kernel+0xad/0x100 >> [ 39.849433] drm_ioctl+0x288/0x550 >> [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau] >> [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau] >> [ 39.849620] __x64_sys_ioctl+0x94/0xc0 >> [ 39.849621] do_syscall_64+0x82/0x160 >> [ 39.849623] ? drm_ioctl+0x2b7/0x550 >> [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 [nouveau] >> [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0 >> [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0 >> [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0 >> [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200 >> [ 39.849729] ? do_syscall_64+0x8e/0x160 >> [ 39.849730] ? exc_page_fault+0x7e/0x1a0 >> [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e >> [ 39.849735] RIP: 0033:0x7fc5576fe0ad >> [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 >> c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 >> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 >> 00 00 00 >> [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000010 >> [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX: >> 00007fc5576fe0ad >> [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI: >> 000000000000000e >> [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09: >> 000055cb74e35560 >> [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12: >> 00007ffc00268960 >> [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15: >> 000055cb74e3cd10 >> [ 39.849746] </TASK> >> [ 39.849746] ---[ end trace 0000000000000000 ]--- >> [ 39.849776] ------------[ cut here ]------------ >> >> >> This is the first WARN_ON() in dma_fence_set_error(), called by >> nouveau_fence_context_kill(). >> >> It's rare, but it is a bug, or rather: the archetype of a race, since >> (as Christian pointed out) nouveau_fence_update() later at some point >> will remove the signaled fence (by signaling it again). >> >> >> P. >> >> >> Philipp Stanner (3): >> drm/nouveau: Prevent signaled fences in pending list >> drm/nouveau: Remove surplus if-branch >> drm/nouveau: Add helper to check base fence >> >> drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++--------- >> -- >> 1 file changed, 18 insertions(+), 14 deletions(-) >>
On Thu, 2025-04-10 at 14:18 +0200, Christian König wrote: > Am 10.04.25 um 11:51 schrieb Philipp Stanner: > > On Thu, 2025-04-10 at 11:24 +0200, Philipp Stanner wrote: > > > Contains two patches improving nouveau_fence_done(), and one > > > addressing > > > an actual bug (race): > > Oops, that's the wrong calltrace. Here we go: > > > > [ 85.791794] Call Trace: [ 85.791796] <TASK> [ 85.791797] ? > > nouveau_fence_context_kill > > (/home/imperator/linux/./include/linux/dma-fence.h:587 > > (discriminator 9) > > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 > > (discriminator 9)) nouveau [ 85.791874] ? __warn.cold > > (/home/imperator/linux/kernel/panic.c:748) [ 85.791878] ? > > nouveau_fence_context_kill > > (/home/imperator/linux/./include/linux/dma-fence.h:587 > > (discriminator 9) > > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 > > (discriminator 9)) nouveau [ 85.791950] ? report_bug > > (/home/imperator/linux/lib/bug.c:180 > > /home/imperator/linux/lib/bug.c:219) [ 85.791953] ? handle_bug > > (/home/imperator/linux/arch/x86/kernel/traps.c:260) [ 85.791956] ? > > exc_invalid_op (/home/imperator/linux/arch/x86/kernel/traps.c:309 > > (discriminator 1)) [ 85.791957] ? asm_exc_invalid_op > > (/home/imperator/linux/./arch/x86/include/asm/idtentry.h:621) [ > > 85.791960] ? nouveau_fence_context_kill > > (/home/imperator/linux/./include/linux/dma-fence.h:587 > > (discriminator 9) > > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_fence.c:94 > > (discriminator 9)) nouveau [ 85.792028] drm_sched_fini.cold > > (/home/imperator/linux/./include/trace/../../drivers/gpu/drm/schedu > > ler/gpu_scheduler_trace.h:72 (discriminator 1)) gpu_sched [ > > 85.792033] ? drm_sched_entity_kill.part.0 > > (/home/imperator/linux/drivers/gpu/drm/scheduler/sched_entity.c:243 > > (discriminator 2)) gpu_sched [ 85.792037] nouveau_sched_destroy > > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:509 > > /home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_sched.c:518) > > nouveau [ 85.792122] nouveau_abi16_chan_fini.isra.0 > > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:188) > > nouveau [ 85.792191] nouveau_abi16_fini > > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_abi16.c:224 > > (discriminator 3)) nouveau [ 85.792263] nouveau_drm_postclose > > (/home/imperator/linux/drivers/gpu/drm/nouveau/nouveau_drm.c:1240) > > nouveau [ 85.792349] drm_file_free > > (/home/imperator/linux/drivers/gpu/drm/drm_file.c:255) [ 85.792353] > > drm_release > > (/home/imperator/linux/./arch/x86/include/asm/atomic.h:67 > > (discriminator 1) > > /home/imperator/linux/./include/linux/atomic/atomic-arch- > > fallback.h:2278 (discriminator 1) > > /home/imperator/linux/./include/linux/atomic/atomic- > > instrumented.h:1384 (discriminator 1) > > /home/imperator/linux/drivers/gpu/drm/drm_file.c:428 (discriminator > > 1)) [ 85.792355] __fput (/home/imperator/linux/fs/file_table.c:464) > > [ 85.792357] task_work_run > > (/home/imperator/linux/kernel/task_work.c:227) [ 85.792360] do_exit > > (/home/imperator/linux/kernel/exit.c:939) [ 85.792362] > > do_group_exit (/home/imperator/linux/kernel/exit.c:1069) [ > > 85.792364] get_signal (/home/imperator/linux/kernel/signal.c:3036) > > [ 85.792366] arch_do_signal_or_restart > > (/home/imperator/linux/./arch/x86/include/asm/syscall.h:38 > > /home/imperator/linux/arch/x86/kernel/signal.c:264 > > /home/imperator/linux/arch/x86/kernel/signal.c:339) [ 85.792369] > > syscall_exit_to_user_mode > > (/home/imperator/linux/kernel/entry/common.c:113 > > /home/imperator/linux/./include/linux/entry-common.h:329 > > /home/imperator/linux/kernel/entry/common.c:207 > > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792372] > > do_syscall_64 > > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 > > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792373] ? > > syscall_exit_to_user_mode_prepare > > (/home/imperator/linux/./include/linux/audit.h:357 > > /home/imperator/linux/kernel/entry/common.c:166 > > /home/imperator/linux/kernel/entry/common.c:200) [ 85.792376] ? > > syscall_exit_to_user_mode > > (/home/imperator/linux/./arch/x86/include/asm/paravirt.h:686 > > /home/imperator/linux/./include/linux/entry-common.h:232 > > /home/imperator/linux/kernel/entry/common.c:206 > > /home/imperator/linux/kernel/entry/common.c:218) [ 85.792377] ? > > do_syscall_64 > > (/home/imperator/linux/./arch/x86/include/asm/cpufeature.h:172 > > /home/imperator/linux/arch/x86/entry/common.c:98) [ 85.792378] > > entry_SYSCALL_64_after_hwframe > > (/home/imperator/linux/arch/x86/entry/entry_64.S:130) [ 85.792381] > > RIP: 0033:0x7ff950b6af70 [ 85.792383] Code: Unable to access opcode > > bytes at 0x7ff950b6af46. objdump: '/tmp/tmp.sfPRl5k2te.o': No such > > file Code starting with the faulting instruction > > =========================================== [ 85.792383] RSP: > > 002b:00007ff93cdfb6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f [ > > 85.792385] RAX: fffffffffffffdfe RBX: 000055d386d61870 RCX: > > 00007ff950b6af70 [ 85.792386] RDX: 0000000000000000 RSI: > > 0000000000000001 RDI: 00007ff928000b90 [ 85.792387] RBP: > > 00007ff93cdfb740 R08: 0000000000000008 R09: 0000000000000000 [ > > 85.792388] R10: 0000000000000000 R11: 0000000000000293 R12: > > 0000000000000001 [ 85.792388] R13: 0000000000000000 R14: > > 0000000000000000 R15: 00007ff951b10b40 [ 85.792390] </TASK> [ > > 85.792391] ---[ end trace 0000000000000000 ]--- > > I think I understand the problem now as well, but that backtrace is > completely mangled in the mail. > > It would be nice if you could send that out again. I really need to install Mutt soon.. Let's try it this way: https://paste.debian.net/1368679/ P. > > Thanks, > Christian. > > > > > By the way, for reference: > > I did try whether it could be done to have nouveau_fence_signal() > > incorporated into nouveau_fence_update() and nouveau_fence_done(). > > This, however, would then cause a race with the list_del() in > > nouveau_fence_no_signaling(), WARNing because of the list poison. > > > > So the "solution" space is: > > * A cleanup callback on the dma_fence. > > * Keeping the current race or > > * replacing it with another race with another function. > > * Just preventing nouveau_fence_done() from signaling fences other > > than through nouveau_fence_update/signal > > > > The later seems clearly like the cleanest solution to me. > > Alternative > > would be a work-intensive rework of all the misdesigns broken in > > nouveau_fence.c > > > > > > P. > > > > > [ 39.848463] WARNING: CPU: 21 PID: 1734 at > > > drivers/gpu/drm/nouveau/nouveau_fence.c:509 > > > nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > > > [ 39.848551] Modules linked in: snd_seq_dummy snd_hrtimer > > > nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet > > > nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_ine > > > t nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat > > > nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set > > > nf_tables qrtr sunrpc snd_sof_pci_intel_ > > > tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_sof_pci > > > snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda > > > snd_sof_intel_hda snd_sof snd_sof_utils snd > > > _soc_acpi_intel_match snd_soc_acpi snd_soc_acpi_intel_sdca_quirks > > > snd_sof_intel_hda_mlink snd_soc_sdca snd_soc_avs snd_ctl_led > > > snd_soc_hda_codec intel_rapl_msr snd_hda_ > > > codec_realtek snd_hda_ext_core intel_rapl_common > > > snd_hda_codec_generic snd_soc_core snd_hda_scodec_component > > > intel_uncore_frequency intel_uncore_frequency_common snd_hd > > > a_codec_hdmi intel_ifs snd_compress i10nm_edac skx_edac_common > > > nfit > > > snd_hda_intel snd_intel_dspcfg libnvdimm snd_hda_codec > > > binfmt_misc > > > snd_hwdep snd_hda_core snd_seq sn > > > d_seq_device dell_wmi > > > [ 39.848575] dell_pc x86_pkg_temp_thermal spi_nor > > > platform_profile > > > sparse_keymap intel_powerclamp dax_hmem snd_pcm cxl_acpi coretemp > > > cxl_port iTCO_wdt mtd rapl intel > > > _pmc_bxt pmt_telemetry cxl_core dell_wmi_sysman pmt_class > > > iTCO_vendor_support snd_timer isst_if_mmio vfat intel_cstate > > > dell_smbios dcdbas fat dell_wmi_ddv dell_smm_hwmo > > > n dell_wmi_descriptor firmware_attributes_class wmi_bmof > > > intel_uncore > > > einj pcspkr isst_if_mbox_pci atlantic snd isst_if_common > > > intel_vsec > > > e1000e macsec mei_me i2c_i801 > > > spi_intel_pci soundcore i2c_smbus spi_intel mei joydev loop > > > nfnetlink > > > zram nouveau drm_ttm_helper ttm polyval_clmulni iaa_crypto > > > gpu_sched > > > polyval_generic rtsx_pci_sdmm > > > c ghash_clmulni_intel i2c_algo_bit mmc_core drm_gpuvm > > > sha512_ssse3 > > > nvme drm_exec drm_display_helper sha256_ssse3 idxd sha1_ssse3 cec > > > nvme_core idxd_bus rtsx_pci nvme_au > > > th pinctrl_alderlake ip6_tables ip_tables fuse > > > [ 39.848603] CPU: 21 UID: 42 PID: 1734 Comm: gnome-shell > > > Tainted: > > > G W 6.14.0-rc4+ #11 > > > [ 39.848605] Tainted: [W]=WARN > > > [ 39.848606] Hardware name: Dell Inc. Precision 7960 > > > Tower/01G0M6, > > > BIOS 2.7.0 12/17/2024 > > > [ 39.848607] RIP: 0010:nouveau_fence_no_signaling+0xac/0xd0 > > > [nouveau] > > > [ 39.848688] Code: db 74 17 48 8d 7b 38 b8 ff ff ff ff f0 0f c1 > > > 43 > > > 38 83 f8 01 74 29 85 c0 7e 17 31 c0 5b 5d c3 cc cc cc cc e8 76 b2 > > > c5 > > > f0 eb 96 <0f> 0b e9 67 ff ff f > > > f be 03 00 00 00 e8 83 76 33 f1 31 c0 eb dd e8 > > > [ 39.848690] RSP: 0018:ff1cc1ffc5c039f0 EFLAGS: 00010046 > > > [ 39.848691] RAX: 0000000000000001 RBX: ff175a3b504da980 RCX: > > > ff175a3b4801e008 > > > [ 39.848692] RDX: ff175a3b43e7bad0 RSI: ffffffffc09d3fda RDI: > > > ff175a3b504da980 > > > [ 39.848693] RBP: ff175a3b504da9c0 R08: ffffffffc09e39df R09: > > > 0000000000000001 > > > [ 39.848694] R10: 0000000000000001 R11: 0000000000000000 R12: > > > ff175a3b6d97de00 > > > [ 39.848695] R13: 0000000000000246 R14: ff1cc1ffc5c03c60 R15: > > > 0000000000000001 > > > [ 39.848696] FS: 00007fc5477846c0(0000) > > > GS:ff175a5a50280000(0000) > > > knlGS:0000000000000000 > > > [ 39.848698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 39.848699] CR2: 000055cb7613d1a8 CR3: 000000012e5ce004 CR4: > > > 0000000000f71ef0 > > > [ 39.848700] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > 0000000000000000 > > > [ 39.848701] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: > > > 0000000000000400 > > > [ 39.848702] PKRU: 55555554 > > > [ 39.848703] Call Trace: > > > [ 39.848704] <TASK> > > > [ 39.848705] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > > > [ 39.848782] ? __warn.cold+0x93/0xfa > > > [ 39.848785] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > > > [ 39.848861] ? report_bug+0xff/0x140 > > > [ 39.848863] ? handle_bug+0x58/0x90 > > > [ 39.848865] ? exc_invalid_op+0x17/0x70 > > > [ 39.848866] ? asm_exc_invalid_op+0x1a/0x20 > > > [ 39.848870] ? nouveau_fence_no_signaling+0xac/0xd0 [nouveau] > > > [ 39.848943] nouveau_fence_enable_signaling+0x32/0x80 > > > [nouveau] > > > [ 39.849016] ? __pfx_nouveau_fence_cleanup_cb+0x10/0x10 > > > [nouveau] > > > [ 39.849088] __dma_fence_enable_signaling+0x33/0xc0 > > > [ 39.849090] dma_fence_add_callback+0x4b/0xd0 > > > [ 39.849093] nouveau_fence_emit+0xa3/0x260 [nouveau] > > > [ 39.849166] nouveau_fence_new+0x7d/0xf0 [nouveau] > > > [ 39.849242] nouveau_gem_ioctl_pushbuf+0xe8f/0x1300 [nouveau] > > > [ 39.849338] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 > > > [nouveau] > > > [ 39.849431] drm_ioctl_kernel+0xad/0x100 > > > [ 39.849433] drm_ioctl+0x288/0x550 > > > [ 39.849435] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 > > > [nouveau] > > > [ 39.849526] nouveau_drm_ioctl+0x57/0xb0 [nouveau] > > > [ 39.849620] __x64_sys_ioctl+0x94/0xc0 > > > [ 39.849621] do_syscall_64+0x82/0x160 > > > [ 39.849623] ? drm_ioctl+0x2b7/0x550 > > > [ 39.849625] ? __pfx_nouveau_gem_ioctl_pushbuf+0x10/0x10 > > > [nouveau] > > > [ 39.849719] ? ktime_get_mono_fast_ns+0x38/0xd0 > > > [ 39.849721] ? __pm_runtime_suspend+0x69/0xc0 > > > [ 39.849724] ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0 > > > [ 39.849726] ? syscall_exit_to_user_mode+0x10/0x200 > > > [ 39.849729] ? do_syscall_64+0x8e/0x160 > > > [ 39.849730] ? exc_page_fault+0x7e/0x1a0 > > > [ 39.849733] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > [ 39.849735] RIP: 0033:0x7fc5576fe0ad > > > [ 39.849736] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 > > > 10 > > > c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 > > > 00 > > > 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 > > > 28 > > > 00 00 00 > > > [ 39.849737] RSP: 002b:00007ffc002688a0 EFLAGS: 00000246 > > > ORIG_RAX: > > > 0000000000000010 > > > [ 39.849739] RAX: ffffffffffffffda RBX: 000055cb74e316c0 RCX: > > > 00007fc5576fe0ad > > > [ 39.849740] RDX: 00007ffc00268960 RSI: 00000000c0406481 RDI: > > > 000000000000000e > > > [ 39.849741] RBP: 00007ffc002688f0 R08: 0000000000000000 R09: > > > 000055cb74e35560 > > > [ 39.849742] R10: 0000000000000014 R11: 0000000000000246 R12: > > > 00007ffc00268960 > > > [ 39.849744] R13: 00000000c0406481 R14: 000000000000000e R15: > > > 000055cb74e3cd10 > > > [ 39.849746] </TASK> > > > [ 39.849746] ---[ end trace 0000000000000000 ]--- > > > [ 39.849776] ------------[ cut here ]------------ > > > > > > > > > This is the first WARN_ON() in dma_fence_set_error(), called by > > > nouveau_fence_context_kill(). > > > > > > It's rare, but it is a bug, or rather: the archetype of a race, > > > since > > > (as Christian pointed out) nouveau_fence_update() later at some > > > point > > > will remove the signaled fence (by signaling it again). > > > > > > > > > P. > > > > > > > > > Philipp Stanner (3): > > > drm/nouveau: Prevent signaled fences in pending list > > > drm/nouveau: Remove surplus if-branch > > > drm/nouveau: Add helper to check base fence > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 32 ++++++++++++++----- > > > ---- > > > -- > > > 1 file changed, 18 insertions(+), 14 deletions(-) > > > >