Message ID | CAPM=9tx_KS1qc8E1kUB5PPBvO9EKHNkk7hYWu-WwWJ6os=otJA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [git,pull] drm urgent for 6.10-rc1 | expand |
The pull request you sent on Thu, 16 May 2024 12:53:52 +1000:
> https://gitlab.freedesktop.org/drm/kernel.git tags/drm-next-2024-05-16
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/972a2543e3dd87f7310d65944b857631b4290e12
Thank you!
On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote: > > Here is the buddy allocator fix I picked up from the list, please apply. So I removed my reverts, and am running a kernel that includes the merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of warnings as per below. I was going to say that the difference is that now they trigger through the page fault path (amdgpu_gem_fault) while previously they triggered through the system call path and amdgpu_drm_ioctl. But it turns out it's both in both cases, and it just happened to be one or the other in the particular warnings that I cut-and-pasted. As before, there are tens of thousands of them after being up for less than an hour, so this is not some kind of rare thing. The machine hasn't _crashed_ yet, though. But I'm going to be out and about and working on my laptop the rest of the day, so I won't be able to test. (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted in the WARN isn't some official kernel, I have about ten private patches that I keep testing in my tree, so if you wondered what the heck that git version is, it's not going to match anything you see, but the ~ten patches also aren't relevant to this). Nothing unusual in the config, although this is clang-built. Shouldn't matter, never has before. Linus --- CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G W 6.9.0-08295-gfd39ab3b5289 #64 Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 AORUS MASTER, BIOS F7 09/07/2022 RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy] Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02 RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246 RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800 RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0 RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000 R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18 R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000 FS: 00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0 Call Trace: <TASK> ? __warn+0xc1/0x190 ? __force_merge+0x14f/0x180 [drm_buddy] ? report_bug+0x129/0x1a0 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? __force_merge+0x14f/0x180 [drm_buddy] drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy] ? __cond_resched+0x16/0x40 amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu] ttm_resource_alloc+0x31/0x120 [ttm] ttm_bo_alloc_resource+0xbc/0x260 [ttm] ? memcg_account_kmem+0x4a/0xe0 ? ttm_resource_compatible+0xbb/0xe0 [ttm] ttm_bo_validate+0x9f/0x210 [ttm] ? __alloc_pages+0x129/0x210 amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu] amdgpu_gem_fault+0x53/0xd0 [amdgpu] __do_fault+0x41/0x140 do_pte_missing+0x453/0xfd0 handle_mm_fault+0x73c/0x1090 do_user_addr_fault+0x2e2/0x6f0 exc_page_fault+0x56/0x110 asm_exc_page_fault+0x22/0x30
On Thu, May 16, 2024 at 2:02 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote: > > > > Here is the buddy allocator fix I picked up from the list, please apply. > > So I removed my reverts, and am running a kernel that includes the > merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of > https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of > warnings as per below. > > I was going to say that the difference is that now they trigger > through the page fault path (amdgpu_gem_fault) while previously they > triggered through the system call path and amdgpu_drm_ioctl. But it > turns out it's both in both cases, and it just happened to be one or > the other in the particular warnings that I cut-and-pasted. > > As before, there are tens of thousands of them after being up for less > than an hour, so this is not some kind of rare thing. > > The machine hasn't _crashed_ yet, though. But I'm going to be out and > about and working on my laptop the rest of the day, so I won't be able > to test. > > (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted > in the WARN isn't some official kernel, I have about ten private > patches that I keep testing in my tree, so if you wondered what the > heck that git version is, it's not going to match anything you see, > but the ~ten patches also aren't relevant to this). > > Nothing unusual in the config, although this is clang-built. Shouldn't > matter, never has before. Arun is investigating and trying to repro it. You still have a polaris based GPU right? Thanks, Alex > > Linus > > --- > CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G W > 6.9.0-08295-gfd39ab3b5289 #64 > Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 > AORUS MASTER, BIOS F7 09/07/2022 > RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy] > Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00 > 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b > 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02 > RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246 > RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800 > RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0 > RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000 > R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18 > R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000 > FS: 00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0 > Call Trace: > <TASK> > ? __warn+0xc1/0x190 > ? __force_merge+0x14f/0x180 [drm_buddy] > ? report_bug+0x129/0x1a0 > ? handle_bug+0x3d/0x70 > ? exc_invalid_op+0x16/0x40 > ? asm_exc_invalid_op+0x16/0x20 > ? __force_merge+0x14f/0x180 [drm_buddy] > drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy] > ? __cond_resched+0x16/0x40 > amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu] > ttm_resource_alloc+0x31/0x120 [ttm] > ttm_bo_alloc_resource+0xbc/0x260 [ttm] > ? memcg_account_kmem+0x4a/0xe0 > ? ttm_resource_compatible+0xbb/0xe0 [ttm] > ttm_bo_validate+0x9f/0x210 [ttm] > ? __alloc_pages+0x129/0x210 > amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu] > amdgpu_gem_fault+0x53/0xd0 [amdgpu] > __do_fault+0x41/0x140 > do_pte_missing+0x453/0xfd0 > handle_mm_fault+0x73c/0x1090 > do_user_addr_fault+0x2e2/0x6f0 > exc_page_fault+0x56/0x110 > asm_exc_page_fault+0x22/0x30
On 5/17/2024 12:01 AM, Alex Deucher wrote: > On Thu, May 16, 2024 at 2:02 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote: >>> Here is the buddy allocator fix I picked up from the list, please apply. >> So I removed my reverts, and am running a kernel that includes the >> merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of >> https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of >> warnings as per below. >> >> I was going to say that the difference is that now they trigger >> through the page fault path (amdgpu_gem_fault) while previously they >> triggered through the system call path and amdgpu_drm_ioctl. But it >> turns out it's both in both cases, and it just happened to be one or >> the other in the particular warnings that I cut-and-pasted. >> >> As before, there are tens of thousands of them after being up for less >> than an hour, so this is not some kind of rare thing. >> >> The machine hasn't _crashed_ yet, though. But I'm going to be out and >> about and working on my laptop the rest of the day, so I won't be able >> to test. >> >> (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted >> in the WARN isn't some official kernel, I have about ten private >> patches that I keep testing in my tree, so if you wondered what the >> heck that git version is, it's not going to match anything you see, >> but the ~ten patches also aren't relevant to this). >> >> Nothing unusual in the config, although this is clang-built. Shouldn't >> matter, never has before. > Arun is investigating and trying to repro it. You still have a > polaris based GPU right? We haven't been able to reproduce it across variety of GPU's. Would it please be possible to send your dmesg logs and kernel config, I will check this on the same GPU you are using. Thanks, Arun. > > Thanks, > > Alex > >> Linus >> >> --- >> CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G W >> 6.9.0-08295-gfd39ab3b5289 #64 >> Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 >> AORUS MASTER, BIOS F7 09/07/2022 >> RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy] >> Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00 >> 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b >> 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02 >> RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246 >> RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800 >> RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0 >> RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000 >> R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18 >> R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000 >> FS: 00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0 >> Call Trace: >> <TASK> >> ? __warn+0xc1/0x190 >> ? __force_merge+0x14f/0x180 [drm_buddy] >> ? report_bug+0x129/0x1a0 >> ? handle_bug+0x3d/0x70 >> ? exc_invalid_op+0x16/0x40 >> ? asm_exc_invalid_op+0x16/0x20 >> ? __force_merge+0x14f/0x180 [drm_buddy] >> drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy] >> ? __cond_resched+0x16/0x40 >> amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu] >> ttm_resource_alloc+0x31/0x120 [ttm] >> ttm_bo_alloc_resource+0xbc/0x260 [ttm] >> ? memcg_account_kmem+0x4a/0xe0 >> ? ttm_resource_compatible+0xbb/0xe0 [ttm] >> ttm_bo_validate+0x9f/0x210 [ttm] >> ? __alloc_pages+0x129/0x210 >> amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu] >> amdgpu_gem_fault+0x53/0xd0 [amdgpu] >> __do_fault+0x41/0x140 >> do_pte_missing+0x453/0xfd0 >> handle_mm_fault+0x73c/0x1090 >> do_user_addr_fault+0x2e2/0x6f0 >> exc_page_fault+0x56/0x110 >> asm_exc_page_fault+0x22/0x30
> >> > >> (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted > >> in the WARN isn't some official kernel, I have about ten private > >> patches that I keep testing in my tree, so if you wondered what the > >> heck that git version is, it's not going to match anything you see, > >> but the ~ten patches also aren't relevant to this). > >> > >> Nothing unusual in the config, although this is clang-built. Shouldn't > >> matter, never has before. > > Arun is investigating and trying to repro it. You still have a > > polaris based GPU right? > We haven't been able to reproduce it across variety of GPU's. Would it > please be possible > to send your dmesg logs and kernel config, I will check this on the same > GPU you are using. I just installed my RX480 polaris card in my AMD test machine, and with current origin/master I'm not seeing this at all. Running an F40 GNOME desktop, doing firefox etc. Linus, do you see it a boot straight away? Dave.
On Thu, May 16, 2024 at 2:02 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 15 May 2024 at 19:54, Dave Airlie <airlied@gmail.com> wrote: > > > > Here is the buddy allocator fix I picked up from the list, please apply. > > So I removed my reverts, and am running a kernel that includes the > merge 972a2543e3dd ("Merge tag 'drm-next-2024-05-16' of > https://gitlab.freedesktop.org/drm/kernel") but I still see a lot of > warnings as per below. > > I was going to say that the difference is that now they trigger > through the page fault path (amdgpu_gem_fault) while previously they > triggered through the system call path and amdgpu_drm_ioctl. But it > turns out it's both in both cases, and it just happened to be one or > the other in the particular warnings that I cut-and-pasted. > > As before, there are tens of thousands of them after being up for less > than an hour, so this is not some kind of rare thing. > > The machine hasn't _crashed_ yet, though. But I'm going to be out and > about and working on my laptop the rest of the day, so I won't be able > to test. > > (And that kernel version of "6.9.0-08295-gfd39ab3b5289" that is quoted > in the WARN isn't some official kernel, I have about ten private > patches that I keep testing in my tree, so if you wondered what the > heck that git version is, it's not going to match anything you see, > but the ~ten patches also aren't relevant to this). > > Nothing unusual in the config, although this is clang-built. Shouldn't > matter, never has before. Can you try this patch? https://patchwork.freedesktop.org/patch/594539/ Alex > > Linus > > --- > CPU: 28 PID: 3326 Comm: mutter-x11-fram Tainted: G W > 6.9.0-08295-gfd39ab3b5289 #64 > Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 > AORUS MASTER, BIOS F7 09/07/2022 > RIP: 0010:__force_merge+0x14f/0x180 [drm_buddy] > Code: 74 0d 49 8b 44 24 18 48 d3 e0 49 29 44 24 30 4c 89 e7 ba 01 00 > 00 00 e8 9f 00 00 00 44 39 e8 73 1f 49 8b 04 24 e9 25 ff ff ff <0f> 0b > 4c 39 c3 75 a3 eb 99 b8 f4 ff ff ff c3 b8 f4 ff ff ff eb 02 > RSP: 0000:ffff9e350314baa0 EFLAGS: 00010246 > RAX: ffff974a227a4a00 RBX: ffff974a2d024b88 RCX: 000000000b8eb800 > RDX: ffff974a2d024bf8 RSI: ffff974a2d024bd0 RDI: ffff974a2d024bb0 > RBP: 0000000000000000 R08: ffff974a2d024b88 R09: 0000000000001000 > R10: 0000000000000800 R11: 0000000000000000 R12: ffff974a2198fa18 > R13: 0000000000000009 R14: 0000000010000000 R15: 0000000000000000 > FS: 00007f56a78b6540(0000) GS:ffff97591e700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f5688040000 CR3: 0000000198cc9000 CR4: 0000000000350ef0 > Call Trace: > <TASK> > ? __warn+0xc1/0x190 > ? __force_merge+0x14f/0x180 [drm_buddy] > ? report_bug+0x129/0x1a0 > ? handle_bug+0x3d/0x70 > ? exc_invalid_op+0x16/0x40 > ? asm_exc_invalid_op+0x16/0x20 > ? __force_merge+0x14f/0x180 [drm_buddy] > drm_buddy_alloc_blocks+0x249/0x400 [drm_buddy] > ? __cond_resched+0x16/0x40 > amdgpu_vram_mgr_new+0x204/0x3f0 [amdgpu] > ttm_resource_alloc+0x31/0x120 [ttm] > ttm_bo_alloc_resource+0xbc/0x260 [ttm] > ? memcg_account_kmem+0x4a/0xe0 > ? ttm_resource_compatible+0xbb/0xe0 [ttm] > ttm_bo_validate+0x9f/0x210 [ttm] > ? __alloc_pages+0x129/0x210 > amdgpu_bo_fault_reserve_notify+0x98/0x110 [amdgpu] > amdgpu_gem_fault+0x53/0xd0 [amdgpu] > __do_fault+0x41/0x140 > do_pte_missing+0x453/0xfd0 > handle_mm_fault+0x73c/0x1090 > do_user_addr_fault+0x2e2/0x6f0 > exc_page_fault+0x56/0x110 > asm_exc_page_fault+0x22/0x30
On Thu, 16 May 2024 at 18:08, Dave Airlie <airlied@gmail.com> wrote: > > Linus, do you see it a boot straight away? Ok, back at that computer now, and yes, I see those messages right away. In fact, they seem to happen before gnome even starts up, ie I see those messages long before the first messages from gnome-session: May 17 12:07:17 tr3970x kernel: WARNING: CPU: 4 PID: 1067 at drivers/gpu/drm/drm_buddy.c:198 __force_merge+0x184/0x1b0 [drm_buddy] .. lots and lots and lots of them .. ... May 17 12:07:23 tr3970x systemd-cryptsetup[982]: ... ... May 17 12:07:25 tr3970x systemd[1]: Reached target basic.target ... May 17 12:07:25 tr3970x systemd[1]: Mounted sysroot.mount - /sysroot. ... May 17 12:07:25 tr3970x systemd[1]: Switching root. ... May 17 12:07:36 tr3970x gnome-session[2824]: .. ... May 17 12:07:36 tr3970x gnome-shell[2836]: Obtained a high priority EGL context May 17 12:07:36 tr3970x kernel: WARNING: CPU: 31 PID: 2836 at drivers/gpu/drm/drm_buddy.c:198 __force_merge+0x184/0x1b0 [drm_buddy] .. lots of warnings resume ... IOW, it happens already during the graphical boot before I have even typed in my disk encryption password. Then it starts again when gnome starts. I just checked: I have exactly 8192 warnings from the early boot before the first gnome warning. Which sounds like too round a number to be an accident. I will try the patch Alex pointed at next: https://patchwork.freedesktop.org/patch/594539/ and see if that fixes it for me. Linus
On Fri, 17 May 2024 at 06:55, Alex Deucher <alexdeucher@gmail.com> wrote: > > Can you try this patch? > https://patchwork.freedesktop.org/patch/594539/ Ack. This seems to fix it for me - unless the problem is random and only happens sometimes, and I've just been *very* unlucky until now. Linus