diff mbox series

drm/atmel-hlcdc: Release CRTC commit when destroying plane state

Message ID 20240306194935.11871-1-pl.dourneau@klervi.com (mailing list archive)
State New, archived
Headers show
Series drm/atmel-hlcdc: Release CRTC commit when destroying plane state | expand

Commit Message

Pierre-Louis Dourneau March 6, 2024, 7:49 p.m. UTC
From: Arnaud Lahache <a.lahache@klervi.com>

Fixes a memory leak occurring on each modeset update.

Running a program such as libdrm's modetest[0] with this driver exhausts
all available memory after a few minutes. Enabling kmemleak yields a series
of such leak reports:

unreferenced object 0xc21acf40 (size 64):
  comm "modetest", pid 210, jiffies 4294942460 (age 331.240s)
  hex dump (first 32 bytes):
    00 a0 a1 c1 01 00 00 00 ff ff ff ff 4c cf 1a c2  ............L...
    4c cf 1a c2 ff ff ff ff 58 cf 1a c2 58 cf 1a c2  L.......X...X...
  backtrace:
    [<d68b3e09>] kmalloc_trace+0x18/0x24
    [<f858a020>] drm_atomic_helper_setup_commit+0x1e0/0x7e0
    [<26e8ab04>] drm_atomic_helper_commit+0x40/0x160
    [<49708b0c>] drm_atomic_commit+0xa8/0xf0
    [<e58c2942>] drm_mode_obj_set_property_ioctl+0x154/0x3d8
    [<5e97e57d>] drm_ioctl+0x200/0x3c4
    [<ed514ba1>] sys_ioctl+0x240/0xb48
    [<26aab344>] ret_fast_syscall+0x0/0x44

drm_atomic_helper_setup_commit() acquires a reference to a drm_crtc_commit
for each CRTC and associated connectors and planes involved in a modeset.
64-byte leaks map well to the size of a drm_crtc_commit on 32-bit
architectures, and the second 4-byte chunk in the hex dump above awfully
looks like a reference count.

We tracked this missing reference decrement down to the driver's
atmel_hlcdc_plane_atomic_destroy_state() callback. Its CRTC counterpart,
atmel_hlcdc_crtc_destroy_state(), calls into the drm_atomic helpers and
properly releases its references to the commit. Planes didn't. Using the
default helper for that purpose, __drm_atomic_helper_plane_destroy_state(),
fixes the leak and avoids reimplementing the same logic in the driver.

[0]: https://gitlab.freedesktop.org/mesa/drm/-/tree/main/tests/modetest
     Invoke with `modetest -M atmel-hlcdc -s 32:#0 -v`, assuming 32 is the
     ID of a connector.

Signed-off-by: Arnaud Lahache <a.lahache@klervi.com>
Co-developed-by: Pierre-Louis Dourneau <pl.dourneau@klervi.com>
Signed-off-by: Pierre-Louis Dourneau <pl.dourneau@klervi.com>
Co-developed-by: Benoît Alcaina <b.alcaina@klervi.com>
Signed-off-by: Benoît Alcaina <b.alcaina@klervi.com>
---
As far as our testing goes, we've been running 6 of our production units
with this patch for more than 2 weeks as per the date this patch is sent
out. We can report stable memory usage.

Admittedly, our usage of the DRM uAPI is rather simple: create 2 dumb
buffers, do an initial MODE_SETCRTC, and then MODE_PAGE_FLIP between the
two dumb buffers at the rate of once per second on average. We haven't
evaluated more complex workloads.

 drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


base-commit: 9dfc46c87cdc8f5a42a71de247a744a6b8188980

Comments

Ludovic Desroches March 8, 2024, 8 a.m. UTC | #1
On 3/6/24 20:49, Pierre-Louis Dourneau wrote:
> 
> From: Arnaud Lahache <a.lahache@klervi.com>
> 
> Fixes a memory leak occurring on each modeset update.
> 
> Running a program such as libdrm's modetest[0] with this driver exhausts
> all available memory after a few minutes. Enabling kmemleak yields a series
> of such leak reports:
> 
> unreferenced object 0xc21acf40 (size 64):
>    comm "modetest", pid 210, jiffies 4294942460 (age 331.240s)
>    hex dump (first 32 bytes):
>      00 a0 a1 c1 01 00 00 00 ff ff ff ff 4c cf 1a c2  ............L...
>      4c cf 1a c2 ff ff ff ff 58 cf 1a c2 58 cf 1a c2  L.......X...X...
>    backtrace:
>      [<d68b3e09>] kmalloc_trace+0x18/0x24
>      [<f858a020>] drm_atomic_helper_setup_commit+0x1e0/0x7e0
>      [<26e8ab04>] drm_atomic_helper_commit+0x40/0x160
>      [<49708b0c>] drm_atomic_commit+0xa8/0xf0
>      [<e58c2942>] drm_mode_obj_set_property_ioctl+0x154/0x3d8
>      [<5e97e57d>] drm_ioctl+0x200/0x3c4
>      [<ed514ba1>] sys_ioctl+0x240/0xb48
>      [<26aab344>] ret_fast_syscall+0x0/0x44
> 
> drm_atomic_helper_setup_commit() acquires a reference to a drm_crtc_commit
> for each CRTC and associated connectors and planes involved in a modeset.
> 64-byte leaks map well to the size of a drm_crtc_commit on 32-bit
> architectures, and the second 4-byte chunk in the hex dump above awfully
> looks like a reference count.
> 
> We tracked this missing reference decrement down to the driver's
> atmel_hlcdc_plane_atomic_destroy_state() callback. Its CRTC counterpart,
> atmel_hlcdc_crtc_destroy_state(), calls into the drm_atomic helpers and
> properly releases its references to the commit. Planes didn't. Using the
> default helper for that purpose, __drm_atomic_helper_plane_destroy_state(),
> fixes the leak and avoids reimplementing the same logic in the driver.
> 
> [0]: https://gitlab.freedesktop.org/mesa/drm/-/tree/main/tests/modetest
>       Invoke with `modetest -M atmel-hlcdc -s 32:#0 -v`, assuming 32 is the
>       ID of a connector.
> 
> Signed-off-by: Arnaud Lahache <a.lahache@klervi.com>
> Co-developed-by: Pierre-Louis Dourneau <pl.dourneau@klervi.com>
> Signed-off-by: Pierre-Louis Dourneau <pl.dourneau@klervi.com>
> Co-developed-by: Benoît Alcaina <b.alcaina@klervi.com>
> Signed-off-by: Benoît Alcaina <b.alcaina@klervi.com>
> ---
> As far as our testing goes, we've been running 6 of our production units
> with this patch for more than 2 weeks as per the date this patch is sent
> out. We can report stable memory usage.

Hello Arnaud,

This patch fixes the memory leak but introduces a crash on my side when 
exiting a graphics app using the Microchip graphics library.


------------[ cut here ]------------ 
 
 

WARNING: CPU: 0 PID: 201 at lib/refcount.c:28 
refcount_warn_saturate+0x58/0x130 
 
 

refcount_t: underflow; use-after-free. 
 
 

Modules linked in: 
 
 

CPU: 0 PID: 201 Comm: basic Not tainted 6.1.55-linux4microchip-2023.10+ 
#65 
 

Hardware name: Microchip SAM9X60 
 
 

  unwind_backtrace from show_stack+0x10/0x18 
 
 

  show_stack from dump_stack_lvl+0x28/0x34 
 
 

  dump_stack_lvl from __warn+0x70/0xb8 
 
 

  __warn from warn_slowpath_fmt+0x78/0xac 
 
 

  warn_slowpath_fmt from refcount_warn_saturate+0x58/0x130 
 
 

  refcount_warn_saturate from kref_put+0x54/0x5c 
 
 

  kref_put from drm_fb_release+0x100/0x11c 
 
 

  drm_fb_release from drm_file_free+0xcc/0x1bc 
 
 

  drm_file_free from drm_release+0x44/0x94 
 
 

  drm_release from __fput+0xf0/0x20c 
 
 

  __fput from task_work_run+0x8c/0xb0 
 
 

  task_work_run from do_exit+0x310/0x760 
 
 

  do_exit from sys_exit_group+0x0/0x14 
 
 

  sys_exit_group from get_signal+0x524/0x64c 
 
 

  get_signal from do_work_pending+0xf4/0x398 
 
 

  do_work_pending from slow_work_pending+0xc/0x24 
 
 

Exception stack(0xc9991fb0 to 0xc9991ff8) 
 
 

1fa0:                                     0054dd48 00000000 005331f8 
00000001 
 

1fc0: 00533270 00000002 00000000 404c8000 beb54170 0051b448 43ab8000 
0051b424 
 

1fe0: b6cc1d98 beb54040 b6c224fc b68357dc 20000010 ffffffff 
 
 

---[ end trace 0000000000000000 ]--- 
 
 

8<--- cut here --- 
 
 

Unable to handle kernel NULL pointer dereference at virtual address 
00000004 
 

[00000004] *pgd=00000000 
 
 

Internal error: Oops: 805 [#1] ARM 
 
 

Modules linked in: 
 
 

CPU: 0 PID: 201 Comm: basic Tainted: G        W 
6.1.55-linux4microchip-2023.10+ #65 
 

Hardware name: Microchip SAM9X60 
 
 

PC is at drm_fb_release+0xc0/0x11c 
 
 

LR is at drm_fb_release+0x100/0x11c 
 
 

pc : [<c0329b10>]    lr : [<c0329b50>]    psr: 80000013 
 
 

sp : c9991e48  ip : 00000000  fp : 0051b424 
 
 

r10: c174a5f0  r9 : 00000000  r8 : c2188074 
 
 

r7 : 60000013  r6 : c9991e5c  r5 : ffffff8c  r4 : c2188048 
 
 

r3 : c1590994  r2 : c2188048  r1 : 00000000  r0 : c1590920 
 
 

Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none 
 
 

Control: 0005317f  Table: 2384c000  DAC: 00000051 
 
 

Register r0 information: slab kmalloc-192 start c1590920 pointer offset 
0 size 192 
 

Register r1 information: NULL pointer 
 
 

Register r2 information: slab kmalloc-192 start c2188000 pointer offset 
72 size 192 
 

Register r3 information: slab kmalloc-192 start c1590920 pointer offset 
116 size 192 
 

Register r4 information: slab kmalloc-192 start c2188000 pointer offset 
72 size 192 
 

Register r5 information: non-paged memory 
 
 

Register r6 information: 2-page vmalloc region starting at 0xc9990000 
allocated at kernel_clone+0xb4/0x204 
 

Register r7 information: non-paged memory 
 
 

Register r8 information: slab kmalloc-192 start c2188000 pointer offset 
116 size 192 
 

Register r9 information: NULL pointer 
 
 

Register r10 information: slab task_struct start c174a140 pointer offset 
1200 
 

Register r11 information: non-paged memory 
 
 

Register r12 information: NULL pointer 
 
 

Process basic (pid: 201, stack limit = 0x41541c7b) 
 
 

Stack: (0xc9991e48 to 0xc9992000)
1e40:                   00000001 00000000 00000000 00000000 00000000 
c9991e5c 
 

1e60: c9991e5c 5c44a9dd c2188000 c1621800 c2188060 c03146e4 000000c9 
0000e200 
 

1e80: 00000001 00000000 00000000 c1621800 c38f2cc0 c0fb63e0 c0c1ce70 
c0f99560 
 

1ea0: 00000000 c0314a5c c38f2cc0 c10ee6e8 000a201f c00c66a0 c38f2cc0 
00000001 
 

1ec0: c09f1a4c c00c67fc c38f2e00 c174a140 c0a20b1c c2232a50 c9991ef4 
c002f000 
 

1ee0: c174a140 c2232a20 c174a140 c001a094 00000002 00000008 c383c880 
5c44a9dd 
 

1f00: 00000002 c383c880 0051b424 c001a6a8 00000009 c0024af0 beb53c50 
00000000 
 

1f20: 00000000 5c44a9dd 00000000 00000000 c9991fb0 c174a140 00000000 
00000000 
 

1f40: 00000000 00000000 0051b424 c000c244 00000000 00000000 00000000 
00000000 
 

1f60: 00000000 00000000 00000009 00000000 00000000 00000000 00000000 
00000000 
 

1f80: 00000000 00000000 00000000 5c44a9dd b68357dc 20000010 ffffffff 
0005317f 
 

1fa0: 00000000 c174a140 43ab8000 c00084dc 0054dd48 00000000 005331f8 
00000001 
 

1fc0: 00533270 00000002 00000000 404c8000 beb54170 0051b448 43ab8000 
0051b424 
 

1fe0: b6cc1d98 beb54040 b6c224fc b68357dc 20000010 ffffffff 00000000 
00000000 
 

  drm_fb_release from drm_file_free+0xcc/0x1bc 
 
 

  drm_file_free from drm_release+0x44/0x94 
 
 

  drm_release from __fput+0xf0/0x20c 
 
 

  __fput from task_work_run+0x8c/0xb0 
 
 

  task_work_run from do_exit+0x310/0x760 
 
 

  do_exit from sys_exit_group+0x0/0x14 
 
 

  sys_exit_group from get_signal+0x524/0x64c 
 
 

  get_signal from do_work_pending+0xf4/0x398 
 
 

  do_work_pending from slow_work_pending+0xc/0x24 
 
 

Exception stack(0xc9991fb0 to 0xc9991ff8) 
 
 

1fa0:                                     0054dd48 00000000 005331f8 
00000001 
 

1fc0: 00533270 00000002 00000000 404c8000 beb54170 0051b448 43ab8000 
0051b424 
 

1fe0: b6cc1d98 beb54040 b6c224fc b68357dc 20000010 ffffffff 
 
 

Code: e590c018 e5902078 e5901074 e35c0001 (e5812004) 
 
 

---[ end trace 00000000c0a20288 ]--- 
 
 

Fixing recursive fault but reboot is needed!


The memory leak had been introduced with this commit:


commit eec44d44a3d2d00c8f663f13555d3dd280b1ea3f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 21 16:29:54 2021 +0100



     drm/atmel: Use drm_atomic_helper_commit

     One of these drivers that predates the nonblocking support in helpers,
     and hand-rolled its own thing. Entirely not anything specific here, we
     can just delete it all and replace it with the helper version.

     Could also perhaps use the drm_mode_config_helper_suspend/resume
     stuff, for another few lines deleted. But I'm not looking at that
     stuff, I'm just going through all the atomic commit functions and make
     sure they have properly annotated dma-fence critical sections
     everywhere.


I think this move to the drm atomic helper should have gone with an 
update on the release side as well. There is probably something wrong 
with the atomic_destroy_state callbacks provided by the atmel-hlcdc driver.

Regards,
Ludovic

> 
> Admittedly, our usage of the DRM uAPI is rather simple: create 2 dumb
> buffers, do an initial MODE_SETCRTC, and then MODE_PAGE_FLIP between the
> two dumb buffers at the rate of once per second on average. We haven't
> evaluated more complex workloads.
> 
>   drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c b/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
> index daa508504f47..390c4fc62af7 100644
> --- a/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
> +++ b/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
> @@ -934,8 +934,7 @@ static void atmel_hlcdc_plane_atomic_destroy_state(struct drm_plane *p,
>                                state->dscrs[i]->self);
>          }
> 
> -       if (s->fb)
> -               drm_framebuffer_put(s->fb);
> +       __drm_atomic_helper_plane_destroy_state(s);
> 
>          kfree(state);
>   }
> 
> base-commit: 9dfc46c87cdc8f5a42a71de247a744a6b8188980
> --
> 2.34.1
>
Pierre-Louis Dourneau March 12, 2024, 8:40 a.m. UTC | #2
Hey Ludovic,

On 3/8/24, Ludovic.Desroches@microchip.com <Ludovic.Desroches@microchip.com> wrote:
> On 3/6/24 20:49, Pierre-Louis Dourneau wrote:
> >
> > From: Arnaud Lahache <a.lahache@klervi.com>
> >
> > Fixes a memory leak occurring on each modeset update.
> >
> > Running a program such as libdrm's modetest[0] with this driver exhausts
> > all available memory after a few minutes. Enabling kmemleak yields a series
> > of such leak reports:
> >
> > unreferenced object 0xc21acf40 (size 64):
> >    comm "modetest", pid 210, jiffies 4294942460 (age 331.240s)
> >    hex dump (first 32 bytes):
> >      00 a0 a1 c1 01 00 00 00 ff ff ff ff 4c cf 1a c2  ............L...
> >      4c cf 1a c2 ff ff ff ff 58 cf 1a c2 58 cf 1a c2  L.......X...X...
> >    backtrace:
> >      [<d68b3e09>] kmalloc_trace+0x18/0x24
> >      [<f858a020>] drm_atomic_helper_setup_commit+0x1e0/0x7e0
> >      [<26e8ab04>] drm_atomic_helper_commit+0x40/0x160
> >      [<49708b0c>] drm_atomic_commit+0xa8/0xf0
> >      [<e58c2942>] drm_mode_obj_set_property_ioctl+0x154/0x3d8
> >      [<5e97e57d>] drm_ioctl+0x200/0x3c4
> >      [<ed514ba1>] sys_ioctl+0x240/0xb48
> >      [<26aab344>] ret_fast_syscall+0x0/0x44
> >
> > drm_atomic_helper_setup_commit() acquires a reference to a drm_crtc_commit
> > for each CRTC and associated connectors and planes involved in a modeset.
> > 64-byte leaks map well to the size of a drm_crtc_commit on 32-bit
> > architectures, and the second 4-byte chunk in the hex dump above awfully
> > looks like a reference count.
> >
> > We tracked this missing reference decrement down to the driver's
> > atmel_hlcdc_plane_atomic_destroy_state() callback. Its CRTC counterpart,
> > atmel_hlcdc_crtc_destroy_state(), calls into the drm_atomic helpers and
> > properly releases its references to the commit. Planes didn't. Using the
> > default helper for that purpose, __drm_atomic_helper_plane_destroy_state(),
> > fixes the leak and avoids reimplementing the same logic in the driver.
> >
> > [0]: https://gitlab.freedesktop.org/mesa/drm/-/tree/main/tests/modetest
> >       Invoke with `modetest -M atmel-hlcdc -s 32:#0 -v`, assuming 32 is the
> >       ID of a connector.
> >
> > Signed-off-by: Arnaud Lahache <a.lahache@klervi.com>
> > Co-developed-by: Pierre-Louis Dourneau <pl.dourneau@klervi.com>
> > Signed-off-by: Pierre-Louis Dourneau <pl.dourneau@klervi.com>
> > Co-developed-by: Benoît Alcaina <b.alcaina@klervi.com>
> > Signed-off-by: Benoît Alcaina <b.alcaina@klervi.com>
> > ---
> > As far as our testing goes, we've been running 6 of our production units
> > with this patch for more than 2 weeks as per the date this patch is sent
> > out. We can report stable memory usage.
> 
> Hello Arnaud,
> 
> This patch fixes the memory leak but introduces a crash on my side when
> exiting a graphics app using the Microchip graphics library.
> 
> 
> ------------[ cut here ]------------
> 
> 
> 
> WARNING: CPU: 0 PID: 201 at lib/refcount.c:28
> refcount_warn_saturate+0x58/0x130
> 
> 
> 
> refcount_t: underflow; use-after-free.
> 
> 
> 
> Modules linked in:
> 
> 
> 
> CPU: 0 PID: 201 Comm: basic Not tainted 6.1.55-linux4microchip-2023.10+
> #65
> 
> 
> Hardware name: Microchip SAM9X60
> 
> 
> 
>   unwind_backtrace from show_stack+0x10/0x18
> 
> 
> 
>   show_stack from dump_stack_lvl+0x28/0x34
> 
> 
> 
>   dump_stack_lvl from __warn+0x70/0xb8
> 
> 
> 
>   __warn from warn_slowpath_fmt+0x78/0xac
> 
> 
> 
>   warn_slowpath_fmt from refcount_warn_saturate+0x58/0x130
> 
> 
> 
>   refcount_warn_saturate from kref_put+0x54/0x5c
> 
> 
> 
>   kref_put from drm_fb_release+0x100/0x11c
> 
> 
> 
>   drm_fb_release from drm_file_free+0xcc/0x1bc
> 
> 
> 
>   drm_file_free from drm_release+0x44/0x94
> 
> 
> 
>   drm_release from __fput+0xf0/0x20c
> 
> 
> 
>   __fput from task_work_run+0x8c/0xb0
> 
> 
> 
>   task_work_run from do_exit+0x310/0x760
> 
> 
> 
>   do_exit from sys_exit_group+0x0/0x14
> 
> 
> 
>   sys_exit_group from get_signal+0x524/0x64c
> 
> 
> 
>   get_signal from do_work_pending+0xf4/0x398
> 
> 
> 
>   do_work_pending from slow_work_pending+0xc/0x24
> 
> 
> 
> Exception stack(0xc9991fb0 to 0xc9991ff8)
> 
> 
> 
> 1fa0:                                     0054dd48 00000000 005331f8
> 00000001
> 
> 
> 1fc0: 00533270 00000002 00000000 404c8000 beb54170 0051b448 43ab8000
> 0051b424
> 
> 
> 1fe0: b6cc1d98 beb54040 b6c224fc b68357dc 20000010 ffffffff
> 
> 
> 
> ---[ end trace 0000000000000000 ]---
> 
> 
> 
> 8<--- cut here ---
> 
> 
> 
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000004
> 
> 
> [00000004] *pgd=00000000
> 
> 
> 
> Internal error: Oops: 805 [#1] ARM
> 
> 
> 
> Modules linked in:
> 
> 
> 
> CPU: 0 PID: 201 Comm: basic Tainted: G        W
> 6.1.55-linux4microchip-2023.10+ #65
> 
> 
> Hardware name: Microchip SAM9X60
> 
> 
> 
> PC is at drm_fb_release+0xc0/0x11c
> 
> 
> 
> LR is at drm_fb_release+0x100/0x11c
> 
> 
> 
> pc : [<c0329b10>]    lr : [<c0329b50>]    psr: 80000013
> 
> 
> 
> sp : c9991e48  ip : 00000000  fp : 0051b424
> 
> 
> 
> r10: c174a5f0  r9 : 00000000  r8 : c2188074
> 
> 
> 
> r7 : 60000013  r6 : c9991e5c  r5 : ffffff8c  r4 : c2188048
> 
> 
> 
> r3 : c1590994  r2 : c2188048  r1 : 00000000  r0 : c1590920
> 
> 
> 
> Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> 
> 
> 
> Control: 0005317f  Table: 2384c000  DAC: 00000051
> 
> 
> 
> Register r0 information: slab kmalloc-192 start c1590920 pointer offset
> 0 size 192
> 
> 
> Register r1 information: NULL pointer
> 
> 
> 
> Register r2 information: slab kmalloc-192 start c2188000 pointer offset
> 72 size 192
> 
> 
> Register r3 information: slab kmalloc-192 start c1590920 pointer offset
> 116 size 192
> 
> 
> Register r4 information: slab kmalloc-192 start c2188000 pointer offset
> 72 size 192
> 
> 
> Register r5 information: non-paged memory
> 
> 
> 
> Register r6 information: 2-page vmalloc region starting at 0xc9990000
> allocated at kernel_clone+0xb4/0x204
> 
> 
> Register r7 information: non-paged memory
> 
> 
> 
> Register r8 information: slab kmalloc-192 start c2188000 pointer offset
> 116 size 192
> 
> 
> Register r9 information: NULL pointer
> 
> 
> 
> Register r10 information: slab task_struct start c174a140 pointer offset
> 1200
> 
> 
> Register r11 information: non-paged memory
> 
> 
> 
> Register r12 information: NULL pointer
> 
> 
> 
> Process basic (pid: 201, stack limit = 0x41541c7b)
> 
> 
> 
> Stack: (0xc9991e48 to 0xc9992000)
> 1e40:                   00000001 00000000 00000000 00000000 00000000
> c9991e5c
> 
> 
> 1e60: c9991e5c 5c44a9dd c2188000 c1621800 c2188060 c03146e4 000000c9
> 0000e200
> 
> 
> 1e80: 00000001 00000000 00000000 c1621800 c38f2cc0 c0fb63e0 c0c1ce70
> c0f99560
> 
> 
> 1ea0: 00000000 c0314a5c c38f2cc0 c10ee6e8 000a201f c00c66a0 c38f2cc0
> 00000001
> 
> 
> 1ec0: c09f1a4c c00c67fc c38f2e00 c174a140 c0a20b1c c2232a50 c9991ef4
> c002f000
> 
> 
> 1ee0: c174a140 c2232a20 c174a140 c001a094 00000002 00000008 c383c880
> 5c44a9dd
> 
> 
> 1f00: 00000002 c383c880 0051b424 c001a6a8 00000009 c0024af0 beb53c50
> 00000000
> 
> 
> 1f20: 00000000 5c44a9dd 00000000 00000000 c9991fb0 c174a140 00000000
> 00000000
> 
> 
> 1f40: 00000000 00000000 0051b424 c000c244 00000000 00000000 00000000
> 00000000
> 
> 
> 1f60: 00000000 00000000 00000009 00000000 00000000 00000000 00000000
> 00000000
> 
> 
> 1f80: 00000000 00000000 00000000 5c44a9dd b68357dc 20000010 ffffffff
> 0005317f
> 
> 
> 1fa0: 00000000 c174a140 43ab8000 c00084dc 0054dd48 00000000 005331f8
> 00000001
> 
> 
> 1fc0: 00533270 00000002 00000000 404c8000 beb54170 0051b448 43ab8000
> 0051b424
> 
> 
> 1fe0: b6cc1d98 beb54040 b6c224fc b68357dc 20000010 ffffffff 00000000
> 00000000
> 
> 
>   drm_fb_release from drm_file_free+0xcc/0x1bc
> 
> 
> 
>   drm_file_free from drm_release+0x44/0x94
> 
> 
> 
>   drm_release from __fput+0xf0/0x20c
> 
> 
> 
>   __fput from task_work_run+0x8c/0xb0
> 
> 
> 
>   task_work_run from do_exit+0x310/0x760
> 
> 
> 
>   do_exit from sys_exit_group+0x0/0x14
> 
> 
> 
>   sys_exit_group from get_signal+0x524/0x64c
> 
> 
> 
>   get_signal from do_work_pending+0xf4/0x398
> 
> 
> 
>   do_work_pending from slow_work_pending+0xc/0x24
> 
> 
> 
> Exception stack(0xc9991fb0 to 0xc9991ff8)
> 
> 
> 
> 1fa0:                                     0054dd48 00000000 005331f8
> 00000001
> 
> 
> 1fc0: 00533270 00000002 00000000 404c8000 beb54170 0051b448 43ab8000
> 0051b424
> 
> 
> 1fe0: b6cc1d98 beb54040 b6c224fc b68357dc 20000010 ffffffff
> 
> 
> 
> Code: e590c018 e5902078 e5901074 e35c0001 (e5812004)
> 
> 
> 
> ---[ end trace 00000000c0a20288 ]---
> 
> 
> 
> Fixing recursive fault but reboot is needed!

We've tried to reproduce your crash with 6.1.22-linux4microchip-2023.04,
to no avail. We'll try to upgrade to 6.1.55-linux4microchip-2023.10 (your
version) and test again.

Is there a particular EGT program you recommend to trigger the crash? Or
any should do? We've limited ourselves to a few of the samples provided
with the library, namely egt_basic, egt_water, egt_charts, and egt_i18n.

> The memory leak had been introduced with this commit:
> 
> 
> commit eec44d44a3d2d00c8f663f13555d3dd280b1ea3f
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Thu Jan 21 16:29:54 2021 +0100
> 
> 
> 
>      drm/atmel: Use drm_atomic_helper_commit
> 
>      One of these drivers that predates the nonblocking support in helpers,
>      and hand-rolled its own thing. Entirely not anything specific here, we
>      can just delete it all and replace it with the helper version.
> 
>      Could also perhaps use the drm_mode_config_helper_suspend/resume
>      stuff, for another few lines deleted. But I'm not looking at that
>      stuff, I'm just going through all the atomic commit functions and make
>      sure they have properly annotated dma-fence critical sections
>      everywhere.
> 
> 
> I think this move to the drm atomic helper should have gone with an
> update on the release side as well. There is probably something wrong
> with the atomic_destroy_state callbacks provided by the atmel-hlcdc driver.
> 
> Regards,
> Ludovic
> 
> >
> > Admittedly, our usage of the DRM uAPI is rather simple: create 2 dumb
> > buffers, do an initial MODE_SETCRTC, and then MODE_PAGE_FLIP between the
> > two dumb buffers at the rate of once per second on average. We haven't
> > evaluated more complex workloads.
> >
> >   drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c b/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
> > index daa508504f47..390c4fc62af7 100644
> > --- a/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
> > +++ b/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
> > @@ -934,8 +934,7 @@ static void atmel_hlcdc_plane_atomic_destroy_state(struct drm_plane *p,
> >                                state->dscrs[i]->self);
> >          }
> >
> > -       if (s->fb)
> > -               drm_framebuffer_put(s->fb);
> > +       __drm_atomic_helper_plane_destroy_state(s);
> >
> >          kfree(state);
> >   }
> >
> > base-commit: 9dfc46c87cdc8f5a42a71de247a744a6b8188980
> > --
> > 2.34.1
> >

Pierre-Louis
Pierre-Louis Dourneau March 15, 2024, 12:25 p.m. UTC | #3
On 3/12/24, Pierre-Louis Dourneau <pl.dourneau@klervi.com> wrote:
> On 3/8/24, Ludovic.Desroches@microchip.com <Ludovic.Desroches@microchip.com> wrote:
> > This patch fixes the memory leak but introduces a crash on my side when
> > exiting a graphics app using the Microchip graphics library.
> 
> We've tried to reproduce your crash with 6.1.22-linux4microchip-2023.04,
> to no avail. We'll try to upgrade to 6.1.55-linux4microchip-2023.10 (your
> version) and test again.

I was able to test a few more recent kernel versions[0] with the patch
applied. None yielded any crash, be it running Microchip's EGT samples[1]
or libdrm's modetest. Although, what I did manage to reproduce was a
refcount underflow similar to the one you had:

  # modetest -M atmel-hlcdc -s 32:#0 -P 33@47:800x400@XR24 -a
  setting mode 1024x600-65.48Hz on connectors 32, crtc 47
  testing 800x400@XR24 on plane 33, crtc 47
  [   75.736699] ------------[ cut here ]------------
  [   75.741353] WARNING: CPU: 0 PID: 200 at lib/refcount.c:28 refcount_warn_saturate+0xf0/0x15c
  [   75.749731] refcount_t: underflow; use-after-free.
  [   75.754574] CPU: 0 PID: 200 Comm: modetest Not tainted 6.1.55-linux4microchip-2023.10 #4
  [   75.762915] Hardware name: Microchip SAM9X60
  [   75.767198]  unwind_backtrace from show_stack+0x10/0x18
  [   75.772423]  show_stack from dump_stack_lvl+0x28/0x34
  [   75.777479]  dump_stack_lvl from __warn+0x8c/0xc0
  [   75.782187]  __warn from warn_slowpath_fmt+0x74/0xa8
  [   75.787158]  warn_slowpath_fmt from refcount_warn_saturate+0xf0/0x15c
  [   75.793611]  refcount_warn_saturate from __drm_atomic_helper_plane_destroy_state+0xd0/0xd4
  [   75.801894]  __drm_atomic_helper_plane_destroy_state from atmel_hlcdc_plane_atomic_destroy_state+0x38/0x48
  [   75.811573]  atmel_hlcdc_plane_atomic_destroy_state from drm_atomic_state_default_clear+0x1c4/0x2fc
  [   75.820642]  drm_atomic_state_default_clear from __drm_atomic_state_free+0x7c/0xb0
  [   75.828228]  __drm_atomic_state_free from drm_mode_atomic_ioctl+0x868/0xb88
  [   75.835204]  drm_mode_atomic_ioctl from drm_ioctl+0x200/0x3c4
  [   75.840960]  drm_ioctl from sys_ioctl+0x240/0xb48
  [   75.845669]  sys_ioctl from ret_fast_syscall+0x0/0x44
  [   75.850725] Exception stack(0xc8c91fa8 to 0xc8c91ff0)
  [   75.855794] 1fa0:                   004b0ec0 00000003 00000003 c03864bc becd7bc8 becd7b98
  [   75.863992] 1fc0: 004b0ec0 00000003 becd7bc8 00000036 00000003 00000003 b6f22f80 00000400
  [   75.872183] 1fe0: b6f21e74 becd7a68 b6f07494 b6f61cc8
  [   75.877289] ---[ end trace 0000000000000000 ]---
  Atomic Commit failed [1]

Same but without using the atomic uAPI (`-a` option removed):

  # modetest -M atmel-hlcdc -s 32:#0 -P 33@47:800x400@XR24
  setting mode 1024x600-65.48Hz on connectors 32, crtc 47
  testing 800x400@XR24 overlay plane 33
  failed to enable plane: Invalid argument

  [   98.542958] ------------[ cut here ]------------
  [   98.547547] WARNING: CPU: 0 PID: 28 at lib/refcount.c:28 refcount_warn_saturate+0xf0/0x15c
  [   98.555902] refcount_t: underflow; use-after-free.
  [   98.560698] CPU: 0 PID: 28 Comm: kworker/0:7 Tainted: G        W          6.1.55-linux4microchip-2023.10 #6
  [   98.570695] Hardware name: Microchip SAM9X60
  [   98.574972] Workqueue: events drm_mode_rmfb_work_fn
  [   98.579859] unwind_backtrace from show_stack+0x10/0x18
  [   98.587615] show_stack from dump_stack_lvl+0x28/0x34
  [   98.595201] dump_stack_lvl from __warn+0x8c/0xc0
  [   98.602438] __warn from warn_slowpath_fmt+0x74/0xa8
  [   98.609937] warn_slowpath_fmt from refcount_warn_saturate+0xf0/0x15c
  [   98.618919] refcount_warn_saturate from __drm_atomic_helper_plane_destroy_state+0xd0/0xd4
  [   98.629740] __drm_atomic_helper_plane_destroy_state from atmel_hlcdc_plane_atomic_destroy_state+0x38/0x48
  [   98.641947] atmel_hlcdc_plane_atomic_destroy_state from drm_atomic_state_default_clear+0x1c4/0x2fc
  [   98.653545] drm_atomic_state_default_clear from __drm_atomic_state_free+0x7c/0xb0
  [   98.663660] __drm_atomic_state_free from drm_framebuffer_remove+0x48c/0x540
  [   98.673252] drm_framebuffer_remove from drm_mode_rmfb_work_fn+0x68/0x84
  [   98.682495] drm_mode_rmfb_work_fn from process_one_work+0x1b4/0x3f4
  [   98.691390] process_one_work from worker_thread+0x214/0x4e8
  [   98.699587] worker_thread from kthread+0xb4/0xd8
  [   98.706824] kthread from ret_from_fork+0x14/0x28
  [   98.714060] Exception stack(0xc88adfb0 to 0xc88adff8)
  [   98.719125] dfa0:                                     00000000 00000000 00000000 00000000
  [   98.727327] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  [   98.735520] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
  [   98.742219] ---[ end trace 0000000000000000 ]---

The first one is not deterministic, you have to try a few times to trigger
it. The second one is a hit every time.

Same commands on a kernel without the patch don't report any underflow.
Note the commit in the first command also fails on a kernel without the
patch, which I guess is expected as plane 33 is the primary plane and I'm
trying set dimensions that do not match the size of the display. The commit
succeeds when invoking with the correct dimensions, but then I can't make
it produce an underflow. Same with the second command.

It seems to only trigger once per boot. Running the commands again does not
yield another underflow.

Looking at the disassembly, this is an underflow of the drm_crtc_commit's
refcount this time. In the warning you had, it was on a framebuffer object.

Anyway, I'll go back to the drawing board, study more closely the resource
release part of the driver. Thanks for having brought up the issues with
the patch.

[0]: Namely, linux4microchip-2023.10 (6.1.55), v6.8, and drm-next-2023-03-13
[1]: https://github.com/linux4sam/egt/tree/master/examples

Pierre-Louis
Ludovic Desroches March 21, 2024, 6:45 a.m. UTC | #4
On 3/15/24 13:25, Pierre-Louis Dourneau wrote:
> [You don't often get email from pl.dourneau@klervi.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 3/12/24, Pierre-Louis Dourneau <pl.dourneau@klervi.com> wrote:
>> On 3/8/24, Ludovic.Desroches@microchip.com <Ludovic.Desroches@microchip.com> wrote:
>>> This patch fixes the memory leak but introduces a crash on my side when
>>> exiting a graphics app using the Microchip graphics library.
>>
>> We've tried to reproduce your crash with 6.1.22-linux4microchip-2023.04,
>> to no avail. We'll try to upgrade to 6.1.55-linux4microchip-2023.10 (your
>> version) and test again.
> 
> I was able to test a few more recent kernel versions[0] with the patch
> applied. None yielded any crash, be it running Microchip's EGT samples[1]
> or libdrm's modetest. Although, what I did manage to reproduce was a
> refcount underflow similar to the one you had:
> 
>    # modetest -M atmel-hlcdc -s 32:#0 -P 33@47:800x400@XR24 -a
>    setting mode 1024x600-65.48Hz on connectors 32, crtc 47
>    testing 800x400@XR24 on plane 33, crtc 47
>    [   75.736699] ------------[ cut here ]------------
>    [   75.741353] WARNING: CPU: 0 PID: 200 at lib/refcount.c:28 refcount_warn_saturate+0xf0/0x15c
>    [   75.749731] refcount_t: underflow; use-after-free.
>    [   75.754574] CPU: 0 PID: 200 Comm: modetest Not tainted 6.1.55-linux4microchip-2023.10 #4
>    [   75.762915] Hardware name: Microchip SAM9X60
>    [   75.767198]  unwind_backtrace from show_stack+0x10/0x18
>    [   75.772423]  show_stack from dump_stack_lvl+0x28/0x34
>    [   75.777479]  dump_stack_lvl from __warn+0x8c/0xc0
>    [   75.782187]  __warn from warn_slowpath_fmt+0x74/0xa8
>    [   75.787158]  warn_slowpath_fmt from refcount_warn_saturate+0xf0/0x15c
>    [   75.793611]  refcount_warn_saturate from __drm_atomic_helper_plane_destroy_state+0xd0/0xd4
>    [   75.801894]  __drm_atomic_helper_plane_destroy_state from atmel_hlcdc_plane_atomic_destroy_state+0x38/0x48
>    [   75.811573]  atmel_hlcdc_plane_atomic_destroy_state from drm_atomic_state_default_clear+0x1c4/0x2fc
>    [   75.820642]  drm_atomic_state_default_clear from __drm_atomic_state_free+0x7c/0xb0
>    [   75.828228]  __drm_atomic_state_free from drm_mode_atomic_ioctl+0x868/0xb88
>    [   75.835204]  drm_mode_atomic_ioctl from drm_ioctl+0x200/0x3c4
>    [   75.840960]  drm_ioctl from sys_ioctl+0x240/0xb48
>    [   75.845669]  sys_ioctl from ret_fast_syscall+0x0/0x44
>    [   75.850725] Exception stack(0xc8c91fa8 to 0xc8c91ff0)
>    [   75.855794] 1fa0:                   004b0ec0 00000003 00000003 c03864bc becd7bc8 becd7b98
>    [   75.863992] 1fc0: 004b0ec0 00000003 becd7bc8 00000036 00000003 00000003 b6f22f80 00000400
>    [   75.872183] 1fe0: b6f21e74 becd7a68 b6f07494 b6f61cc8
>    [   75.877289] ---[ end trace 0000000000000000 ]---
>    Atomic Commit failed [1]
> 
> Same but without using the atomic uAPI (`-a` option removed):
> 
>    # modetest -M atmel-hlcdc -s 32:#0 -P 33@47:800x400@XR24
>    setting mode 1024x600-65.48Hz on connectors 32, crtc 47
>    testing 800x400@XR24 overlay plane 33
>    failed to enable plane: Invalid argument
> 
>    [   98.542958] ------------[ cut here ]------------
>    [   98.547547] WARNING: CPU: 0 PID: 28 at lib/refcount.c:28 refcount_warn_saturate+0xf0/0x15c
>    [   98.555902] refcount_t: underflow; use-after-free.
>    [   98.560698] CPU: 0 PID: 28 Comm: kworker/0:7 Tainted: G        W          6.1.55-linux4microchip-2023.10 #6
>    [   98.570695] Hardware name: Microchip SAM9X60
>    [   98.574972] Workqueue: events drm_mode_rmfb_work_fn
>    [   98.579859] unwind_backtrace from show_stack+0x10/0x18
>    [   98.587615] show_stack from dump_stack_lvl+0x28/0x34
>    [   98.595201] dump_stack_lvl from __warn+0x8c/0xc0
>    [   98.602438] __warn from warn_slowpath_fmt+0x74/0xa8
>    [   98.609937] warn_slowpath_fmt from refcount_warn_saturate+0xf0/0x15c
>    [   98.618919] refcount_warn_saturate from __drm_atomic_helper_plane_destroy_state+0xd0/0xd4
>    [   98.629740] __drm_atomic_helper_plane_destroy_state from atmel_hlcdc_plane_atomic_destroy_state+0x38/0x48
>    [   98.641947] atmel_hlcdc_plane_atomic_destroy_state from drm_atomic_state_default_clear+0x1c4/0x2fc
>    [   98.653545] drm_atomic_state_default_clear from __drm_atomic_state_free+0x7c/0xb0
>    [   98.663660] __drm_atomic_state_free from drm_framebuffer_remove+0x48c/0x540
>    [   98.673252] drm_framebuffer_remove from drm_mode_rmfb_work_fn+0x68/0x84
>    [   98.682495] drm_mode_rmfb_work_fn from process_one_work+0x1b4/0x3f4
>    [   98.691390] process_one_work from worker_thread+0x214/0x4e8
>    [   98.699587] worker_thread from kthread+0xb4/0xd8
>    [   98.706824] kthread from ret_from_fork+0x14/0x28
>    [   98.714060] Exception stack(0xc88adfb0 to 0xc88adff8)
>    [   98.719125] dfa0:                                     00000000 00000000 00000000 00000000
>    [   98.727327] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>    [   98.735520] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
>    [   98.742219] ---[ end trace 0000000000000000 ]---
> 
> The first one is not deterministic, you have to try a few times to trigger
> it. The second one is a hit every time.
> 
> Same commands on a kernel without the patch don't report any underflow.
> Note the commit in the first command also fails on a kernel without the
> patch, which I guess is expected as plane 33 is the primary plane and I'm
> trying set dimensions that do not match the size of the display. The commit
> succeeds when invoking with the correct dimensions, but then I can't make
> it produce an underflow. Same with the second command.
> 
> It seems to only trigger once per boot. Running the commands again does not
> yield another underflow.
> 
> Looking at the disassembly, this is an underflow of the drm_crtc_commit's
> refcount this time. In the warning you had, it was on a framebuffer object.
> 
> Anyway, I'll go back to the drawing board, study more closely the resource
> release part of the driver. Thanks for having brought up the issues with
> the patch.

Hello,

Sorry for the late reply, I was off previous weeks. Late me know if you 
need more information or testing on my side.

 From the top of my head, the crash was systematic on my side and I 
think I get it with any EGT application.

I use an EGT app to check if the memory leak is still here. It 
continuously change the text of a button causing many calls of 
drm_atomic_helper_setup_commit. After a while, we can see that the 
virtual memory is constant while the system memory decreases. At least, 
your patch sounds to fix this memory leak.

Regards,
Ludovic

> 
> [0]: Namely, linux4microchip-2023.10 (6.1.55), v6.8, and drm-next-2023-03-13
> [1]: https://github.com/linux4sam/egt/tree/master/examples
> 
> Pierre-Louis
diff mbox series

Patch

diff --git a/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c b/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
index daa508504f47..390c4fc62af7 100644
--- a/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
+++ b/drivers/gpu/drm/atmel-hlcdc/atmel_hlcdc_plane.c
@@ -934,8 +934,7 @@  static void atmel_hlcdc_plane_atomic_destroy_state(struct drm_plane *p,
 			      state->dscrs[i]->self);
 	}
 
-	if (s->fb)
-		drm_framebuffer_put(s->fb);
+	__drm_atomic_helper_plane_destroy_state(s);
 
 	kfree(state);
 }