diff mbox series

[RFC] drm/bridge: panel: Use devm_drm_bridge_add()

Message ID 20241009052402.411978-1-fshao@chromium.org (mailing list archive)
State New, archived
Headers show
Series [RFC] drm/bridge: panel: Use devm_drm_bridge_add() | expand

Commit Message

Fei Shao Oct. 9, 2024, 5:23 a.m. UTC
In the mtk_dsi driver, its DSI host attach callback calls
devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
a panel bridge, a panel_bridge object is allocated and managed by the
panel device.

Later, if the attach callback fails with -EPROBE_DEFER from subsequent
component_add(), the panel device invoking the callback at probe time
also fails, and all device-managed resources are freed accordingly.

This exposes a drm_bridge bridge_list corruption due to the unbalanced
lifecycle between the DSI host and the panel devices: the panel_bridge
object managed by panel device is freed, while drm_bridge_remove() is
bound to DSI host device and never gets called.
The next drm_bridge_add() will trigger UAF against the freed bridge list
object and result in kernel panic.

This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
outputting to a DSI panel (DT is WIP for upstream).

As a fix, using devm_drm_bridge_add() with the panel device in the panel
path seems reasonable. This also implies a chain of potential cleanup
actions:

1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
   becomes hollow and can be removed.

2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
   `bridge->pre_enable_prev_first` line. Itself can be also removed if
   we move the line into drm_panel_bridge_add_typed(). (maybe?)

3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
   so it's essentially the new devm_drm_panel_bridge_add_typed().

4. drmm_panel_bridge_add() needs to be updated accordingly since it
   calls drm_panel_bridge_add_typed(). But now there's only one bridge
   object to be freed, and it's already being managed by panel device.
   I wonder if we still need both drmm_ and devm_ version in this case.
   (maybe yes from DRM PoV, I don't know much about the context)

This is a RFC patch since I'm not sure if my understanding is correct
(for both the fix and the cleanup). It fixes the issue I encountered,
but I don't expect it to be picked up directly due to the redundant
commit message and the dangling devm_drm_panel_bridge_release().
I plan to resend the official patch(es) once I know what I supposed to
do next.

For reference, here's the KASAN report from the device:
==================================================================
 BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
 Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69

 CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
 Hardware name: Google Ciri sku0/unprovisioned board (DT)
 Workqueue: events_unbound deferred_probe_work_func
 Call trace:
  dump_backtrace+0xfc/0x140
  show_stack+0x24/0x38
  dump_stack_lvl+0x40/0xc8
  print_report+0x140/0x700
  kasan_report+0xcc/0x130
  __asan_report_load8_noabort+0x20/0x30
  drm_bridge_add+0x98/0x230
  devm_drm_panel_bridge_add_typed+0x174/0x298
  devm_drm_of_get_bridge+0xe8/0x190
  mtk_dsi_host_attach+0x130/0x2b0
  mipi_dsi_attach+0x8c/0xe8
  hx83102_probe+0x1a8/0x368
  mipi_dsi_drv_probe+0x6c/0x88
  really_probe+0x1c4/0x698
  __driver_probe_device+0x160/0x298
  driver_probe_device+0x7c/0x2a8
  __device_attach_driver+0x2a0/0x398
  bus_for_each_drv+0x198/0x200
  __device_attach+0x1c0/0x308
  device_initial_probe+0x20/0x38
  bus_probe_device+0x11c/0x1f8
  deferred_probe_work_func+0x80/0x250
  worker_thread+0x9b4/0x2780
  kthread+0x274/0x350
  ret_from_fork+0x10/0x20

 Allocated by task 69:
  kasan_save_track+0x40/0x78
  kasan_save_alloc_info+0x44/0x58
  __kasan_kmalloc+0x84/0xa0
  __kmalloc_node_track_caller_noprof+0x228/0x450
  devm_kmalloc+0x6c/0x288
  devm_drm_panel_bridge_add_typed+0xa0/0x298
  devm_drm_of_get_bridge+0xe8/0x190
  mtk_dsi_host_attach+0x130/0x2b0
  mipi_dsi_attach+0x8c/0xe8
  hx83102_probe+0x1a8/0x368
  mipi_dsi_drv_probe+0x6c/0x88
  really_probe+0x1c4/0x698
  __driver_probe_device+0x160/0x298
  driver_probe_device+0x7c/0x2a8
  __device_attach_driver+0x2a0/0x398
  bus_for_each_drv+0x198/0x200
  __device_attach+0x1c0/0x308
  device_initial_probe+0x20/0x38
  bus_probe_device+0x11c/0x1f8
  deferred_probe_work_func+0x80/0x250
  worker_thread+0x9b4/0x2780
  kthread+0x274/0x350
  ret_from_fork+0x10/0x20

 Freed by task 69:
  kasan_save_track+0x40/0x78
  kasan_save_free_info+0x58/0x78
  __kasan_slab_free+0x48/0x68
  kfree+0xd4/0x750
  devres_release_all+0x144/0x1e8
  really_probe+0x48c/0x698
  __driver_probe_device+0x160/0x298
  driver_probe_device+0x7c/0x2a8
  __device_attach_driver+0x2a0/0x398
  bus_for_each_drv+0x198/0x200
  __device_attach+0x1c0/0x308
  device_initial_probe+0x20/0x38
  bus_probe_device+0x11c/0x1f8
  deferred_probe_work_func+0x80/0x250
  worker_thread+0x9b4/0x2780
  kthread+0x274/0x350
  ret_from_fork+0x10/0x20

 The buggy address belongs to the object at ffffff80c4e9e000
  which belongs to the cache kmalloc-4k of size 4096
 The buggy address is located 256 bytes inside of
  freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)

 The buggy address belongs to the physical page:
 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 flags: 0x8000000000000040(head|zone=2)
 page_type: f5(slab)
 page: refcount:1 mapcount:0 mapping:0000000000000000
 index:0x0 pfn:0x104e98
 raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
 raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
 head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
 head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
 head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
===================================================================

Signed-off-by: Fei Shao <fshao@chromium.org>
---

 drivers/gpu/drm/bridge/panel.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Comments

Chen-Yu Tsai Oct. 15, 2024, 9:23 a.m. UTC | #1
On Wed, Oct 9, 2024 at 1:24 PM Fei Shao <fshao@chromium.org> wrote:
>
> In the mtk_dsi driver, its DSI host attach callback calls
> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> a panel bridge, a panel_bridge object is allocated and managed by the
> panel device.
>
> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> component_add(), the panel device invoking the callback at probe time
> also fails, and all device-managed resources are freed accordingly.
>
> This exposes a drm_bridge bridge_list corruption due to the unbalanced
> lifecycle between the DSI host and the panel devices: the panel_bridge
> object managed by panel device is freed, while drm_bridge_remove() is
> bound to DSI host device and never gets called.
> The next drm_bridge_add() will trigger UAF against the freed bridge list
> object and result in kernel panic.

I think this comes back to Laurent's comment [1] when devm_drm_of_get_bridge()
was first introduced. Maybe it's best to let the panel core always register
a bridge.

https://lore.kernel.org/dri-devel/YUvKcTv2hSrUqIvF@pendragon.ideasonboard.com/

> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> outputting to a DSI panel (DT is WIP for upstream).
>
> As a fix, using devm_drm_bridge_add() with the panel device in the panel
> path seems reasonable. This also implies a chain of potential cleanup
> actions:
>
> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>    becomes hollow and can be removed.
>
> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>    `bridge->pre_enable_prev_first` line. Itself can be also removed if
>    we move the line into drm_panel_bridge_add_typed(). (maybe?)
>
> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>    so it's essentially the new devm_drm_panel_bridge_add_typed().
>
> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>    calls drm_panel_bridge_add_typed(). But now there's only one bridge
>    object to be freed, and it's already being managed by panel device.
>    I wonder if we still need both drmm_ and devm_ version in this case.
>    (maybe yes from DRM PoV, I don't know much about the context)
>
> This is a RFC patch since I'm not sure if my understanding is correct
> (for both the fix and the cleanup). It fixes the issue I encountered,
> but I don't expect it to be picked up directly due to the redundant
> commit message and the dangling devm_drm_panel_bridge_release().
> I plan to resend the official patch(es) once I know what I supposed to
> do next.
>
> For reference, here's the KASAN report from the device:
> ==================================================================
>  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>
>  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>  Hardware name: Google Ciri sku0/unprovisioned board (DT)
>  Workqueue: events_unbound deferred_probe_work_func
>  Call trace:
>   dump_backtrace+0xfc/0x140
>   show_stack+0x24/0x38
>   dump_stack_lvl+0x40/0xc8
>   print_report+0x140/0x700
>   kasan_report+0xcc/0x130
>   __asan_report_load8_noabort+0x20/0x30
>   drm_bridge_add+0x98/0x230
>   devm_drm_panel_bridge_add_typed+0x174/0x298
>   devm_drm_of_get_bridge+0xe8/0x190
>   mtk_dsi_host_attach+0x130/0x2b0
>   mipi_dsi_attach+0x8c/0xe8
>   hx83102_probe+0x1a8/0x368
>   mipi_dsi_drv_probe+0x6c/0x88
>   really_probe+0x1c4/0x698
>   __driver_probe_device+0x160/0x298
>   driver_probe_device+0x7c/0x2a8
>   __device_attach_driver+0x2a0/0x398
>   bus_for_each_drv+0x198/0x200
>   __device_attach+0x1c0/0x308
>   device_initial_probe+0x20/0x38
>   bus_probe_device+0x11c/0x1f8
>   deferred_probe_work_func+0x80/0x250
>   worker_thread+0x9b4/0x2780
>   kthread+0x274/0x350
>   ret_from_fork+0x10/0x20
>
>  Allocated by task 69:
>   kasan_save_track+0x40/0x78
>   kasan_save_alloc_info+0x44/0x58
>   __kasan_kmalloc+0x84/0xa0
>   __kmalloc_node_track_caller_noprof+0x228/0x450
>   devm_kmalloc+0x6c/0x288
>   devm_drm_panel_bridge_add_typed+0xa0/0x298
>   devm_drm_of_get_bridge+0xe8/0x190
>   mtk_dsi_host_attach+0x130/0x2b0
>   mipi_dsi_attach+0x8c/0xe8
>   hx83102_probe+0x1a8/0x368
>   mipi_dsi_drv_probe+0x6c/0x88
>   really_probe+0x1c4/0x698
>   __driver_probe_device+0x160/0x298
>   driver_probe_device+0x7c/0x2a8
>   __device_attach_driver+0x2a0/0x398
>   bus_for_each_drv+0x198/0x200
>   __device_attach+0x1c0/0x308
>   device_initial_probe+0x20/0x38
>   bus_probe_device+0x11c/0x1f8
>   deferred_probe_work_func+0x80/0x250
>   worker_thread+0x9b4/0x2780
>   kthread+0x274/0x350
>   ret_from_fork+0x10/0x20
>
>  Freed by task 69:
>   kasan_save_track+0x40/0x78
>   kasan_save_free_info+0x58/0x78
>   __kasan_slab_free+0x48/0x68
>   kfree+0xd4/0x750
>   devres_release_all+0x144/0x1e8
>   really_probe+0x48c/0x698
>   __driver_probe_device+0x160/0x298
>   driver_probe_device+0x7c/0x2a8
>   __device_attach_driver+0x2a0/0x398
>   bus_for_each_drv+0x198/0x200
>   __device_attach+0x1c0/0x308
>   device_initial_probe+0x20/0x38
>   bus_probe_device+0x11c/0x1f8
>   deferred_probe_work_func+0x80/0x250
>   worker_thread+0x9b4/0x2780
>   kthread+0x274/0x350
>   ret_from_fork+0x10/0x20
>
>  The buggy address belongs to the object at ffffff80c4e9e000
>   which belongs to the cache kmalloc-4k of size 4096
>  The buggy address is located 256 bytes inside of
>   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>
>  The buggy address belongs to the physical page:
>  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>  flags: 0x8000000000000040(head|zone=2)
>  page_type: f5(slab)
>  page: refcount:1 mapcount:0 mapping:0000000000000000
>  index:0x0 pfn:0x104e98
>  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: kasan: bad access detected
>
>  Memory state around the buggy address:
>   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                     ^
>   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ===================================================================
>
> Signed-off-by: Fei Shao <fshao@chromium.org>
> ---
>
>  drivers/gpu/drm/bridge/panel.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/bridge/panel.c b/drivers/gpu/drm/bridge/panel.c
> index 6e88339dec0f..352723c59c70 100644
> --- a/drivers/gpu/drm/bridge/panel.c
> +++ b/drivers/gpu/drm/bridge/panel.c
> @@ -303,7 +303,7 @@ struct drm_bridge *drm_panel_bridge_add_typed(struct drm_panel *panel,
>         panel_bridge->bridge.ops = DRM_BRIDGE_OP_MODES;
>         panel_bridge->bridge.type = connector_type;
>
> -       drm_bridge_add(&panel_bridge->bridge);
> +       devm_drm_bridge_add(panel->dev, &panel_bridge->bridge);
>
>         return &panel_bridge->bridge;
>  }
> @@ -327,7 +327,6 @@ void drm_panel_bridge_remove(struct drm_bridge *bridge)
>
>         panel_bridge = drm_bridge_to_panel_bridge(bridge);
>
> -       drm_bridge_remove(bridge);

I believe this is incorrect, because the bridge object is still freed
on the next line. If one calls drm_panel_bridge_remove() explicitly,
or somehow the interface device is unbound before the panel device
(which can happen for RGB or LVDS panels where the panel is not a
sub-device of the interface), it will still blow up. Also, some
drivers choose to manage the lifetime themselves with
drm_panel_bridge_add_typed() and drm_panel_bridge_remove(). A new
devm_drm_bridge_remove() function needs to be added and used here.


ChenYu


>         devm_kfree(panel_bridge->panel->dev, bridge);
>  }
>  EXPORT_SYMBOL(drm_panel_bridge_remove);
> @@ -359,8 +358,6 @@ static void devm_drm_panel_bridge_release(struct device *dev, void *res)
>
>         if (!bridge)
>                 return;
> -
> -       drm_bridge_remove(bridge);
>  }
>
>  /**
> --
> 2.47.0.rc1.288.g06298d1525-goog
>
Maxime Ripard Oct. 24, 2024, 12:35 p.m. UTC | #2
On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> In the mtk_dsi driver, its DSI host attach callback calls
> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> a panel bridge, a panel_bridge object is allocated and managed by the
> panel device.
> 
> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> component_add(), the panel device invoking the callback at probe time
> also fails, and all device-managed resources are freed accordingly.
> 
> This exposes a drm_bridge bridge_list corruption due to the unbalanced
> lifecycle between the DSI host and the panel devices: the panel_bridge
> object managed by panel device is freed, while drm_bridge_remove() is
> bound to DSI host device and never gets called.
> The next drm_bridge_add() will trigger UAF against the freed bridge list
> object and result in kernel panic.
> 
> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> outputting to a DSI panel (DT is WIP for upstream).
> 
> As a fix, using devm_drm_bridge_add() with the panel device in the panel
> path seems reasonable. This also implies a chain of potential cleanup
> actions:
> 
> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>    becomes hollow and can be removed.
> 
> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>    `bridge->pre_enable_prev_first` line. Itself can be also removed if
>    we move the line into drm_panel_bridge_add_typed(). (maybe?)
> 
> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>    so it's essentially the new devm_drm_panel_bridge_add_typed().
> 
> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>    calls drm_panel_bridge_add_typed(). But now there's only one bridge
>    object to be freed, and it's already being managed by panel device.
>    I wonder if we still need both drmm_ and devm_ version in this case.
>    (maybe yes from DRM PoV, I don't know much about the context)
> 
> This is a RFC patch since I'm not sure if my understanding is correct
> (for both the fix and the cleanup). It fixes the issue I encountered,
> but I don't expect it to be picked up directly due to the redundant
> commit message and the dangling devm_drm_panel_bridge_release().
> I plan to resend the official patch(es) once I know what I supposed to
> do next.
> 
> For reference, here's the KASAN report from the device:
> ==================================================================
>  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> 
>  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>  Hardware name: Google Ciri sku0/unprovisioned board (DT)
>  Workqueue: events_unbound deferred_probe_work_func
>  Call trace:
>   dump_backtrace+0xfc/0x140
>   show_stack+0x24/0x38
>   dump_stack_lvl+0x40/0xc8
>   print_report+0x140/0x700
>   kasan_report+0xcc/0x130
>   __asan_report_load8_noabort+0x20/0x30
>   drm_bridge_add+0x98/0x230
>   devm_drm_panel_bridge_add_typed+0x174/0x298
>   devm_drm_of_get_bridge+0xe8/0x190
>   mtk_dsi_host_attach+0x130/0x2b0
>   mipi_dsi_attach+0x8c/0xe8
>   hx83102_probe+0x1a8/0x368
>   mipi_dsi_drv_probe+0x6c/0x88
>   really_probe+0x1c4/0x698
>   __driver_probe_device+0x160/0x298
>   driver_probe_device+0x7c/0x2a8
>   __device_attach_driver+0x2a0/0x398
>   bus_for_each_drv+0x198/0x200
>   __device_attach+0x1c0/0x308
>   device_initial_probe+0x20/0x38
>   bus_probe_device+0x11c/0x1f8
>   deferred_probe_work_func+0x80/0x250
>   worker_thread+0x9b4/0x2780
>   kthread+0x274/0x350
>   ret_from_fork+0x10/0x20
> 
>  Allocated by task 69:
>   kasan_save_track+0x40/0x78
>   kasan_save_alloc_info+0x44/0x58
>   __kasan_kmalloc+0x84/0xa0
>   __kmalloc_node_track_caller_noprof+0x228/0x450
>   devm_kmalloc+0x6c/0x288
>   devm_drm_panel_bridge_add_typed+0xa0/0x298
>   devm_drm_of_get_bridge+0xe8/0x190
>   mtk_dsi_host_attach+0x130/0x2b0
>   mipi_dsi_attach+0x8c/0xe8
>   hx83102_probe+0x1a8/0x368
>   mipi_dsi_drv_probe+0x6c/0x88
>   really_probe+0x1c4/0x698
>   __driver_probe_device+0x160/0x298
>   driver_probe_device+0x7c/0x2a8
>   __device_attach_driver+0x2a0/0x398
>   bus_for_each_drv+0x198/0x200
>   __device_attach+0x1c0/0x308
>   device_initial_probe+0x20/0x38
>   bus_probe_device+0x11c/0x1f8
>   deferred_probe_work_func+0x80/0x250
>   worker_thread+0x9b4/0x2780
>   kthread+0x274/0x350
>   ret_from_fork+0x10/0x20
> 
>  Freed by task 69:
>   kasan_save_track+0x40/0x78
>   kasan_save_free_info+0x58/0x78
>   __kasan_slab_free+0x48/0x68
>   kfree+0xd4/0x750
>   devres_release_all+0x144/0x1e8
>   really_probe+0x48c/0x698
>   __driver_probe_device+0x160/0x298
>   driver_probe_device+0x7c/0x2a8
>   __device_attach_driver+0x2a0/0x398
>   bus_for_each_drv+0x198/0x200
>   __device_attach+0x1c0/0x308
>   device_initial_probe+0x20/0x38
>   bus_probe_device+0x11c/0x1f8
>   deferred_probe_work_func+0x80/0x250
>   worker_thread+0x9b4/0x2780
>   kthread+0x274/0x350
>   ret_from_fork+0x10/0x20
> 
>  The buggy address belongs to the object at ffffff80c4e9e000
>   which belongs to the cache kmalloc-4k of size 4096
>  The buggy address is located 256 bytes inside of
>   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> 
>  The buggy address belongs to the physical page:
>  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>  flags: 0x8000000000000040(head|zone=2)
>  page_type: f5(slab)
>  page: refcount:1 mapcount:0 mapping:0000000000000000
>  index:0x0 pfn:0x104e98
>  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: kasan: bad access detected
> 
>  Memory state around the buggy address:
>   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                     ^
>   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ===================================================================
> 
> Signed-off-by: Fei Shao <fshao@chromium.org>

I was looking at the driver to try to follow your (awesome btw, thanks)
commit log, and it does have a quite different structure compared to
what we recommend.

Would following
https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
help?

Maxime
Fei Shao Oct. 29, 2024, 2:53 p.m. UTC | #3
On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>
> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > In the mtk_dsi driver, its DSI host attach callback calls
> > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > a panel bridge, a panel_bridge object is allocated and managed by the
> > panel device.
> >
> > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > component_add(), the panel device invoking the callback at probe time
> > also fails, and all device-managed resources are freed accordingly.
> >
> > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > lifecycle between the DSI host and the panel devices: the panel_bridge
> > object managed by panel device is freed, while drm_bridge_remove() is
> > bound to DSI host device and never gets called.
> > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > object and result in kernel panic.
> >
> > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > outputting to a DSI panel (DT is WIP for upstream).
> >
> > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > path seems reasonable. This also implies a chain of potential cleanup
> > actions:
> >
> > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> >    becomes hollow and can be removed.
> >
> > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> >    `bridge->pre_enable_prev_first` line. Itself can be also removed if
> >    we move the line into drm_panel_bridge_add_typed(). (maybe?)
> >
> > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> >    so it's essentially the new devm_drm_panel_bridge_add_typed().
> >
> > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> >    calls drm_panel_bridge_add_typed(). But now there's only one bridge
> >    object to be freed, and it's already being managed by panel device.
> >    I wonder if we still need both drmm_ and devm_ version in this case.
> >    (maybe yes from DRM PoV, I don't know much about the context)
> >
> > This is a RFC patch since I'm not sure if my understanding is correct
> > (for both the fix and the cleanup). It fixes the issue I encountered,
> > but I don't expect it to be picked up directly due to the redundant
> > commit message and the dangling devm_drm_panel_bridge_release().
> > I plan to resend the official patch(es) once I know what I supposed to
> > do next.
> >
> > For reference, here's the KASAN report from the device:
> > ==================================================================
> >  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> >  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> >
> >  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> >  Hardware name: Google Ciri sku0/unprovisioned board (DT)
> >  Workqueue: events_unbound deferred_probe_work_func
> >  Call trace:
> >   dump_backtrace+0xfc/0x140
> >   show_stack+0x24/0x38
> >   dump_stack_lvl+0x40/0xc8
> >   print_report+0x140/0x700
> >   kasan_report+0xcc/0x130
> >   __asan_report_load8_noabort+0x20/0x30
> >   drm_bridge_add+0x98/0x230
> >   devm_drm_panel_bridge_add_typed+0x174/0x298
> >   devm_drm_of_get_bridge+0xe8/0x190
> >   mtk_dsi_host_attach+0x130/0x2b0
> >   mipi_dsi_attach+0x8c/0xe8
> >   hx83102_probe+0x1a8/0x368
> >   mipi_dsi_drv_probe+0x6c/0x88
> >   really_probe+0x1c4/0x698
> >   __driver_probe_device+0x160/0x298
> >   driver_probe_device+0x7c/0x2a8
> >   __device_attach_driver+0x2a0/0x398
> >   bus_for_each_drv+0x198/0x200
> >   __device_attach+0x1c0/0x308
> >   device_initial_probe+0x20/0x38
> >   bus_probe_device+0x11c/0x1f8
> >   deferred_probe_work_func+0x80/0x250
> >   worker_thread+0x9b4/0x2780
> >   kthread+0x274/0x350
> >   ret_from_fork+0x10/0x20
> >
> >  Allocated by task 69:
> >   kasan_save_track+0x40/0x78
> >   kasan_save_alloc_info+0x44/0x58
> >   __kasan_kmalloc+0x84/0xa0
> >   __kmalloc_node_track_caller_noprof+0x228/0x450
> >   devm_kmalloc+0x6c/0x288
> >   devm_drm_panel_bridge_add_typed+0xa0/0x298
> >   devm_drm_of_get_bridge+0xe8/0x190
> >   mtk_dsi_host_attach+0x130/0x2b0
> >   mipi_dsi_attach+0x8c/0xe8
> >   hx83102_probe+0x1a8/0x368
> >   mipi_dsi_drv_probe+0x6c/0x88
> >   really_probe+0x1c4/0x698
> >   __driver_probe_device+0x160/0x298
> >   driver_probe_device+0x7c/0x2a8
> >   __device_attach_driver+0x2a0/0x398
> >   bus_for_each_drv+0x198/0x200
> >   __device_attach+0x1c0/0x308
> >   device_initial_probe+0x20/0x38
> >   bus_probe_device+0x11c/0x1f8
> >   deferred_probe_work_func+0x80/0x250
> >   worker_thread+0x9b4/0x2780
> >   kthread+0x274/0x350
> >   ret_from_fork+0x10/0x20
> >
> >  Freed by task 69:
> >   kasan_save_track+0x40/0x78
> >   kasan_save_free_info+0x58/0x78
> >   __kasan_slab_free+0x48/0x68
> >   kfree+0xd4/0x750
> >   devres_release_all+0x144/0x1e8
> >   really_probe+0x48c/0x698
> >   __driver_probe_device+0x160/0x298
> >   driver_probe_device+0x7c/0x2a8
> >   __device_attach_driver+0x2a0/0x398
> >   bus_for_each_drv+0x198/0x200
> >   __device_attach+0x1c0/0x308
> >   device_initial_probe+0x20/0x38
> >   bus_probe_device+0x11c/0x1f8
> >   deferred_probe_work_func+0x80/0x250
> >   worker_thread+0x9b4/0x2780
> >   kthread+0x274/0x350
> >   ret_from_fork+0x10/0x20
> >
> >  The buggy address belongs to the object at ffffff80c4e9e000
> >   which belongs to the cache kmalloc-4k of size 4096
> >  The buggy address is located 256 bytes inside of
> >   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> >
> >  The buggy address belongs to the physical page:
> >  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> >  flags: 0x8000000000000040(head|zone=2)
> >  page_type: f5(slab)
> >  page: refcount:1 mapcount:0 mapping:0000000000000000
> >  index:0x0 pfn:0x104e98
> >  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> >  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> >  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> >  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> >  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> >  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> >  page dumped because: kasan: bad access detected
> >
> >  Memory state around the buggy address:
> >   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >                     ^
> >   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ===================================================================
> >
> > Signed-off-by: Fei Shao <fshao@chromium.org>
>
> I was looking at the driver to try to follow your (awesome btw, thanks)
> commit log, and it does have a quite different structure compared to
> what we recommend.
>
> Would following
> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> help?

Hi Maxime,

Thank you for the pointer.
I read the suggested pattern in the doc and compared it with the
drivers. If I understand correctly, both the MIPI-DSI host and panel
drivers follow the instructions:

1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
   >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
2. In its probe hook, the bridge driver must try to find its MIPI-DSI
host, register as a MIPI-DSI device and attach the MIPI-DSI device to
its host.
   >> drm/panel/panel-himax-hx83102.c follows and runs
mipi_dsi_attach() at the end of probe hook.
3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
now add its component.
   >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.

Could you elaborate on the "different structures" you mentioned?

To clarify my point: the issue is that component_add() may return
-EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
causing the panel bridge to be removed. However, drm_bridge_remove()
is bound to MIPI-DSI host instead of panel bridge, which owns the
actual list_head object.

This might be reproducible with other MIPI-DSI host + panel
combinations by forcibly returning -EPROBE_DEFER in the host attach
hook (verification with another device is needed), so the fix may be
required in drm/bridge/panel.c.

And to Chen-Yu: Thanks for the suggestion. I'll incorporate that into
v2 pending confirmation that it is the correct fix.

Regards,
Fei

>
> Maxime
Maxime Ripard Nov. 14, 2024, 1:12 p.m. UTC | #4
On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> >
> > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > In the mtk_dsi driver, its DSI host attach callback calls
> > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > panel device.
> > >
> > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > component_add(), the panel device invoking the callback at probe time
> > > also fails, and all device-managed resources are freed accordingly.
> > >
> > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > object managed by panel device is freed, while drm_bridge_remove() is
> > > bound to DSI host device and never gets called.
> > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > object and result in kernel panic.
> > >
> > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > outputting to a DSI panel (DT is WIP for upstream).
> > >
> > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > path seems reasonable. This also implies a chain of potential cleanup
> > > actions:
> > >
> > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > >    becomes hollow and can be removed.
> > >
> > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > >    `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > >    we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > >
> > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > >    so it's essentially the new devm_drm_panel_bridge_add_typed().
> > >
> > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > >    calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > >    object to be freed, and it's already being managed by panel device.
> > >    I wonder if we still need both drmm_ and devm_ version in this case.
> > >    (maybe yes from DRM PoV, I don't know much about the context)
> > >
> > > This is a RFC patch since I'm not sure if my understanding is correct
> > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > but I don't expect it to be picked up directly due to the redundant
> > > commit message and the dangling devm_drm_panel_bridge_release().
> > > I plan to resend the official patch(es) once I know what I supposed to
> > > do next.
> > >
> > > For reference, here's the KASAN report from the device:
> > > ==================================================================
> > >  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > >  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > >
> > >  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > >  Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > >  Workqueue: events_unbound deferred_probe_work_func
> > >  Call trace:
> > >   dump_backtrace+0xfc/0x140
> > >   show_stack+0x24/0x38
> > >   dump_stack_lvl+0x40/0xc8
> > >   print_report+0x140/0x700
> > >   kasan_report+0xcc/0x130
> > >   __asan_report_load8_noabort+0x20/0x30
> > >   drm_bridge_add+0x98/0x230
> > >   devm_drm_panel_bridge_add_typed+0x174/0x298
> > >   devm_drm_of_get_bridge+0xe8/0x190
> > >   mtk_dsi_host_attach+0x130/0x2b0
> > >   mipi_dsi_attach+0x8c/0xe8
> > >   hx83102_probe+0x1a8/0x368
> > >   mipi_dsi_drv_probe+0x6c/0x88
> > >   really_probe+0x1c4/0x698
> > >   __driver_probe_device+0x160/0x298
> > >   driver_probe_device+0x7c/0x2a8
> > >   __device_attach_driver+0x2a0/0x398
> > >   bus_for_each_drv+0x198/0x200
> > >   __device_attach+0x1c0/0x308
> > >   device_initial_probe+0x20/0x38
> > >   bus_probe_device+0x11c/0x1f8
> > >   deferred_probe_work_func+0x80/0x250
> > >   worker_thread+0x9b4/0x2780
> > >   kthread+0x274/0x350
> > >   ret_from_fork+0x10/0x20
> > >
> > >  Allocated by task 69:
> > >   kasan_save_track+0x40/0x78
> > >   kasan_save_alloc_info+0x44/0x58
> > >   __kasan_kmalloc+0x84/0xa0
> > >   __kmalloc_node_track_caller_noprof+0x228/0x450
> > >   devm_kmalloc+0x6c/0x288
> > >   devm_drm_panel_bridge_add_typed+0xa0/0x298
> > >   devm_drm_of_get_bridge+0xe8/0x190
> > >   mtk_dsi_host_attach+0x130/0x2b0
> > >   mipi_dsi_attach+0x8c/0xe8
> > >   hx83102_probe+0x1a8/0x368
> > >   mipi_dsi_drv_probe+0x6c/0x88
> > >   really_probe+0x1c4/0x698
> > >   __driver_probe_device+0x160/0x298
> > >   driver_probe_device+0x7c/0x2a8
> > >   __device_attach_driver+0x2a0/0x398
> > >   bus_for_each_drv+0x198/0x200
> > >   __device_attach+0x1c0/0x308
> > >   device_initial_probe+0x20/0x38
> > >   bus_probe_device+0x11c/0x1f8
> > >   deferred_probe_work_func+0x80/0x250
> > >   worker_thread+0x9b4/0x2780
> > >   kthread+0x274/0x350
> > >   ret_from_fork+0x10/0x20
> > >
> > >  Freed by task 69:
> > >   kasan_save_track+0x40/0x78
> > >   kasan_save_free_info+0x58/0x78
> > >   __kasan_slab_free+0x48/0x68
> > >   kfree+0xd4/0x750
> > >   devres_release_all+0x144/0x1e8
> > >   really_probe+0x48c/0x698
> > >   __driver_probe_device+0x160/0x298
> > >   driver_probe_device+0x7c/0x2a8
> > >   __device_attach_driver+0x2a0/0x398
> > >   bus_for_each_drv+0x198/0x200
> > >   __device_attach+0x1c0/0x308
> > >   device_initial_probe+0x20/0x38
> > >   bus_probe_device+0x11c/0x1f8
> > >   deferred_probe_work_func+0x80/0x250
> > >   worker_thread+0x9b4/0x2780
> > >   kthread+0x274/0x350
> > >   ret_from_fork+0x10/0x20
> > >
> > >  The buggy address belongs to the object at ffffff80c4e9e000
> > >   which belongs to the cache kmalloc-4k of size 4096
> > >  The buggy address is located 256 bytes inside of
> > >   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > >
> > >  The buggy address belongs to the physical page:
> > >  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > >  flags: 0x8000000000000040(head|zone=2)
> > >  page_type: f5(slab)
> > >  page: refcount:1 mapcount:0 mapping:0000000000000000
> > >  index:0x0 pfn:0x104e98
> > >  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > >  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > >  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > >  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > >  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > >  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > >  page dumped because: kasan: bad access detected
> > >
> > >  Memory state around the buggy address:
> > >   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > >   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > >  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > >                     ^
> > >   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > >   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > ===================================================================
> > >
> > > Signed-off-by: Fei Shao <fshao@chromium.org>
> >
> > I was looking at the driver to try to follow your (awesome btw, thanks)
> > commit log, and it does have a quite different structure compared to
> > what we recommend.
> >
> > Would following
> > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > help?
> 
> Hi Maxime,
> 
> Thank you for the pointer.
> I read the suggested pattern in the doc and compared it with the
> drivers. If I understand correctly, both the MIPI-DSI host and panel
> drivers follow the instructions:
> 
> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>    >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> its host.
>    >> drm/panel/panel-himax-hx83102.c follows and runs
> mipi_dsi_attach() at the end of probe hook.
> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> now add its component.
>    >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> 
> Could you elaborate on the "different structures" you mentioned?

Yeah, you're right, sorry.

> To clarify my point: the issue is that component_add() may return
> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> causing the panel bridge to be removed. However, drm_bridge_remove()
> is bound to MIPI-DSI host instead of panel bridge, which owns the
> actual list_head object.
> 
> This might be reproducible with other MIPI-DSI host + panel
> combinations by forcibly returning -EPROBE_DEFER in the host attach
> hook (verification with another device is needed), so the fix may be
> required in drm/bridge/panel.c.

Yeah, I think you're just hitting another bridge lifetime issue, and
it's not the only one unfortunately. Tying the bridge structure lifetime
itself to the device is wrong, it should be tied to the DRM device
lifetime instead.

But then, the discussion becomes that bridges typically probe outside of
the "main" DRM device probe path, so you don't have access to the DRM
device structure until attach at best.

That's why I'm a bit skeptical about your patch. It might workaround
your issue, but it doesn't actually solve the problem. I guess the best
way about it would be to convert bridges to reference counting, with the
device taking a reference at probe time when it allocates the structure
(and giving it back at remove time), and the DRM device taking one when
it's attached and one when it's detached.

It's much more involved than just another helper though :/

Maxime
Laurent Pinchart Nov. 14, 2024, 3:21 p.m. UTC | #5
On Thu, Nov 14, 2024 at 02:12:01PM +0100, Maxime Ripard wrote:
> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > >
> > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > panel device.
> > > >
> > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > component_add(), the panel device invoking the callback at probe time
> > > > also fails, and all device-managed resources are freed accordingly.
> > > >
> > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > bound to DSI host device and never gets called.
> > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > object and result in kernel panic.
> > > >
> > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > outputting to a DSI panel (DT is WIP for upstream).
> > > >
> > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > actions:
> > > >
> > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > >    becomes hollow and can be removed.
> > > >
> > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > >    `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > >    we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > >
> > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > >    so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > >
> > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > >    calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > >    object to be freed, and it's already being managed by panel device.
> > > >    I wonder if we still need both drmm_ and devm_ version in this case.
> > > >    (maybe yes from DRM PoV, I don't know much about the context)
> > > >
> > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > but I don't expect it to be picked up directly due to the redundant
> > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > do next.
> > > >
> > > > For reference, here's the KASAN report from the device:
> > > > ==================================================================
> > > >  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > >  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > >
> > > >  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > >  Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > >  Workqueue: events_unbound deferred_probe_work_func
> > > >  Call trace:
> > > >   dump_backtrace+0xfc/0x140
> > > >   show_stack+0x24/0x38
> > > >   dump_stack_lvl+0x40/0xc8
> > > >   print_report+0x140/0x700
> > > >   kasan_report+0xcc/0x130
> > > >   __asan_report_load8_noabort+0x20/0x30
> > > >   drm_bridge_add+0x98/0x230
> > > >   devm_drm_panel_bridge_add_typed+0x174/0x298
> > > >   devm_drm_of_get_bridge+0xe8/0x190
> > > >   mtk_dsi_host_attach+0x130/0x2b0
> > > >   mipi_dsi_attach+0x8c/0xe8
> > > >   hx83102_probe+0x1a8/0x368
> > > >   mipi_dsi_drv_probe+0x6c/0x88
> > > >   really_probe+0x1c4/0x698
> > > >   __driver_probe_device+0x160/0x298
> > > >   driver_probe_device+0x7c/0x2a8
> > > >   __device_attach_driver+0x2a0/0x398
> > > >   bus_for_each_drv+0x198/0x200
> > > >   __device_attach+0x1c0/0x308
> > > >   device_initial_probe+0x20/0x38
> > > >   bus_probe_device+0x11c/0x1f8
> > > >   deferred_probe_work_func+0x80/0x250
> > > >   worker_thread+0x9b4/0x2780
> > > >   kthread+0x274/0x350
> > > >   ret_from_fork+0x10/0x20
> > > >
> > > >  Allocated by task 69:
> > > >   kasan_save_track+0x40/0x78
> > > >   kasan_save_alloc_info+0x44/0x58
> > > >   __kasan_kmalloc+0x84/0xa0
> > > >   __kmalloc_node_track_caller_noprof+0x228/0x450
> > > >   devm_kmalloc+0x6c/0x288
> > > >   devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > >   devm_drm_of_get_bridge+0xe8/0x190
> > > >   mtk_dsi_host_attach+0x130/0x2b0
> > > >   mipi_dsi_attach+0x8c/0xe8
> > > >   hx83102_probe+0x1a8/0x368
> > > >   mipi_dsi_drv_probe+0x6c/0x88
> > > >   really_probe+0x1c4/0x698
> > > >   __driver_probe_device+0x160/0x298
> > > >   driver_probe_device+0x7c/0x2a8
> > > >   __device_attach_driver+0x2a0/0x398
> > > >   bus_for_each_drv+0x198/0x200
> > > >   __device_attach+0x1c0/0x308
> > > >   device_initial_probe+0x20/0x38
> > > >   bus_probe_device+0x11c/0x1f8
> > > >   deferred_probe_work_func+0x80/0x250
> > > >   worker_thread+0x9b4/0x2780
> > > >   kthread+0x274/0x350
> > > >   ret_from_fork+0x10/0x20
> > > >
> > > >  Freed by task 69:
> > > >   kasan_save_track+0x40/0x78
> > > >   kasan_save_free_info+0x58/0x78
> > > >   __kasan_slab_free+0x48/0x68
> > > >   kfree+0xd4/0x750
> > > >   devres_release_all+0x144/0x1e8
> > > >   really_probe+0x48c/0x698
> > > >   __driver_probe_device+0x160/0x298
> > > >   driver_probe_device+0x7c/0x2a8
> > > >   __device_attach_driver+0x2a0/0x398
> > > >   bus_for_each_drv+0x198/0x200
> > > >   __device_attach+0x1c0/0x308
> > > >   device_initial_probe+0x20/0x38
> > > >   bus_probe_device+0x11c/0x1f8
> > > >   deferred_probe_work_func+0x80/0x250
> > > >   worker_thread+0x9b4/0x2780
> > > >   kthread+0x274/0x350
> > > >   ret_from_fork+0x10/0x20
> > > >
> > > >  The buggy address belongs to the object at ffffff80c4e9e000
> > > >   which belongs to the cache kmalloc-4k of size 4096
> > > >  The buggy address is located 256 bytes inside of
> > > >   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > >
> > > >  The buggy address belongs to the physical page:
> > > >  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > >  flags: 0x8000000000000040(head|zone=2)
> > > >  page_type: f5(slab)
> > > >  page: refcount:1 mapcount:0 mapping:0000000000000000
> > > >  index:0x0 pfn:0x104e98
> > > >  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > >  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > >  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > >  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > >  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > >  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > >  page dumped because: kasan: bad access detected
> > > >
> > > >  Memory state around the buggy address:
> > > >   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >                     ^
> > > >   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > ===================================================================
> > > >
> > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > >
> > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > commit log, and it does have a quite different structure compared to
> > > what we recommend.
> > >
> > > Would following
> > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > help?
> > 
> > Hi Maxime,
> > 
> > Thank you for the pointer.
> > I read the suggested pattern in the doc and compared it with the
> > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > drivers follow the instructions:
> > 
> > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> >    >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > its host.
> >    >> drm/panel/panel-himax-hx83102.c follows and runs
> > mipi_dsi_attach() at the end of probe hook.
> > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > now add its component.
> >    >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> > 
> > Could you elaborate on the "different structures" you mentioned?
> 
> Yeah, you're right, sorry.
> 
> > To clarify my point: the issue is that component_add() may return
> > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > causing the panel bridge to be removed. However, drm_bridge_remove()
> > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > actual list_head object.
> > 
> > This might be reproducible with other MIPI-DSI host + panel
> > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > hook (verification with another device is needed), so the fix may be
> > required in drm/bridge/panel.c.
> 
> Yeah, I think you're just hitting another bridge lifetime issue, and
> it's not the only one unfortunately. Tying the bridge structure lifetime
> itself to the device is wrong, it should be tied to the DRM device
> lifetime instead.
> 
> But then, the discussion becomes that bridges typically probe outside of
> the "main" DRM device probe path, so you don't have access to the DRM
> device structure until attach at best.
> 
> That's why I'm a bit skeptical about your patch. It might workaround
> your issue, but it doesn't actually solve the problem. I guess the best
> way about it would be to convert bridges to reference counting, with the
> device taking a reference at probe time when it allocates the structure
> (and giving it back at remove time), and the DRM device taking one when
> it's attached and one when it's detached.

+1, I was considering writing exactly the same while reading your
review until I reached this paragraph. devm_* is a nice dream, and maybe
APIs that simplify cleanup in a similar way can be implemented (possibly
based on cleanup.h), but behind the scene they will need to rely on a
sound reference-counting base.

> It's much more involved than just another helper though :/
Chen-Yu Tsai Nov. 27, 2024, 9:58 a.m. UTC | #6
Revisiting this thread since I just stepped on the same problem on a
different device.

On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>
> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > >
> > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > panel device.
> > > >
> > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > component_add(), the panel device invoking the callback at probe time
> > > > also fails, and all device-managed resources are freed accordingly.
> > > >
> > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > bound to DSI host device and never gets called.
> > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > object and result in kernel panic.
> > > >
> > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > outputting to a DSI panel (DT is WIP for upstream).
> > > >
> > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > actions:
> > > >
> > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > >    becomes hollow and can be removed.
> > > >
> > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > >    `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > >    we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > >
> > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > >    so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > >
> > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > >    calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > >    object to be freed, and it's already being managed by panel device.
> > > >    I wonder if we still need both drmm_ and devm_ version in this case.
> > > >    (maybe yes from DRM PoV, I don't know much about the context)
> > > >
> > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > but I don't expect it to be picked up directly due to the redundant
> > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > do next.
> > > >
> > > > For reference, here's the KASAN report from the device:
> > > > ==================================================================
> > > >  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > >  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > >
> > > >  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > >  Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > >  Workqueue: events_unbound deferred_probe_work_func
> > > >  Call trace:
> > > >   dump_backtrace+0xfc/0x140
> > > >   show_stack+0x24/0x38
> > > >   dump_stack_lvl+0x40/0xc8
> > > >   print_report+0x140/0x700
> > > >   kasan_report+0xcc/0x130
> > > >   __asan_report_load8_noabort+0x20/0x30
> > > >   drm_bridge_add+0x98/0x230
> > > >   devm_drm_panel_bridge_add_typed+0x174/0x298
> > > >   devm_drm_of_get_bridge+0xe8/0x190
> > > >   mtk_dsi_host_attach+0x130/0x2b0
> > > >   mipi_dsi_attach+0x8c/0xe8
> > > >   hx83102_probe+0x1a8/0x368
> > > >   mipi_dsi_drv_probe+0x6c/0x88
> > > >   really_probe+0x1c4/0x698
> > > >   __driver_probe_device+0x160/0x298
> > > >   driver_probe_device+0x7c/0x2a8
> > > >   __device_attach_driver+0x2a0/0x398
> > > >   bus_for_each_drv+0x198/0x200
> > > >   __device_attach+0x1c0/0x308
> > > >   device_initial_probe+0x20/0x38
> > > >   bus_probe_device+0x11c/0x1f8
> > > >   deferred_probe_work_func+0x80/0x250
> > > >   worker_thread+0x9b4/0x2780
> > > >   kthread+0x274/0x350
> > > >   ret_from_fork+0x10/0x20
> > > >
> > > >  Allocated by task 69:
> > > >   kasan_save_track+0x40/0x78
> > > >   kasan_save_alloc_info+0x44/0x58
> > > >   __kasan_kmalloc+0x84/0xa0
> > > >   __kmalloc_node_track_caller_noprof+0x228/0x450
> > > >   devm_kmalloc+0x6c/0x288
> > > >   devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > >   devm_drm_of_get_bridge+0xe8/0x190
> > > >   mtk_dsi_host_attach+0x130/0x2b0
> > > >   mipi_dsi_attach+0x8c/0xe8
> > > >   hx83102_probe+0x1a8/0x368
> > > >   mipi_dsi_drv_probe+0x6c/0x88
> > > >   really_probe+0x1c4/0x698
> > > >   __driver_probe_device+0x160/0x298
> > > >   driver_probe_device+0x7c/0x2a8
> > > >   __device_attach_driver+0x2a0/0x398
> > > >   bus_for_each_drv+0x198/0x200
> > > >   __device_attach+0x1c0/0x308
> > > >   device_initial_probe+0x20/0x38
> > > >   bus_probe_device+0x11c/0x1f8
> > > >   deferred_probe_work_func+0x80/0x250
> > > >   worker_thread+0x9b4/0x2780
> > > >   kthread+0x274/0x350
> > > >   ret_from_fork+0x10/0x20
> > > >
> > > >  Freed by task 69:
> > > >   kasan_save_track+0x40/0x78
> > > >   kasan_save_free_info+0x58/0x78
> > > >   __kasan_slab_free+0x48/0x68
> > > >   kfree+0xd4/0x750
> > > >   devres_release_all+0x144/0x1e8
> > > >   really_probe+0x48c/0x698
> > > >   __driver_probe_device+0x160/0x298
> > > >   driver_probe_device+0x7c/0x2a8
> > > >   __device_attach_driver+0x2a0/0x398
> > > >   bus_for_each_drv+0x198/0x200
> > > >   __device_attach+0x1c0/0x308
> > > >   device_initial_probe+0x20/0x38
> > > >   bus_probe_device+0x11c/0x1f8
> > > >   deferred_probe_work_func+0x80/0x250
> > > >   worker_thread+0x9b4/0x2780
> > > >   kthread+0x274/0x350
> > > >   ret_from_fork+0x10/0x20
> > > >
> > > >  The buggy address belongs to the object at ffffff80c4e9e000
> > > >   which belongs to the cache kmalloc-4k of size 4096
> > > >  The buggy address is located 256 bytes inside of
> > > >   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > >
> > > >  The buggy address belongs to the physical page:
> > > >  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > >  flags: 0x8000000000000040(head|zone=2)
> > > >  page_type: f5(slab)
> > > >  page: refcount:1 mapcount:0 mapping:0000000000000000
> > > >  index:0x0 pfn:0x104e98
> > > >  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > >  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > >  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > >  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > >  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > >  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > >  page dumped because: kasan: bad access detected
> > > >
> > > >  Memory state around the buggy address:
> > > >   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >                     ^
> > > >   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > >   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > ===================================================================
> > > >
> > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > >
> > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > commit log, and it does have a quite different structure compared to
> > > what we recommend.
> > >
> > > Would following
> > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > help?
> >
> > Hi Maxime,
> >
> > Thank you for the pointer.
> > I read the suggested pattern in the doc and compared it with the
> > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > drivers follow the instructions:
> >
> > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> >    >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > its host.
> >    >> drm/panel/panel-himax-hx83102.c follows and runs
> > mipi_dsi_attach() at the end of probe hook.
> > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > now add its component.
> >    >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> >
> > Could you elaborate on the "different structures" you mentioned?
>
> Yeah, you're right, sorry.
>
> > To clarify my point: the issue is that component_add() may return
> > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > causing the panel bridge to be removed. However, drm_bridge_remove()
> > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > actual list_head object.
> >
> > This might be reproducible with other MIPI-DSI host + panel
> > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > hook (verification with another device is needed), so the fix may be
> > required in drm/bridge/panel.c.

> Yeah, I think you're just hitting another bridge lifetime issue, and
> it's not the only one unfortunately. Tying the bridge structure lifetime
> itself to the device is wrong, it should be tied to the DRM device
> lifetime instead.

I think the more immediate issue is that the bridge object's lifetime
and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
or drmm_of_get_bridge() are used.

These helpers tie the bridge add/removal to the device or drm_device
passed in, but internally they call down to drm_panel_bridge_add_typed()
which allocates the bridge object tied to the panel device.
> But then, the discussion becomes that bridges typically probe outside of
> the "main" DRM device probe path, so you don't have access to the DRM
> device structure until attach at best.
>
> That's why I'm a bit skeptical about your patch. It might workaround
> your issue, but it doesn't actually solve the problem. I guess the best
> way about it would be to convert bridges to reference counting, with the
> device taking a reference at probe time when it allocates the structure
> (and giving it back at remove time), and the DRM device taking one when
> it's attached and one when it's detached.

Without going as far, it's probably better to align the lifecycle of
the two parts. Most other bridge drivers in the kernel have |drm_bridge|
lifecycle tied to their underlying |device|, either with explicit
drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
callbacks respectively, or with devm_drm_bridge_add in the probe/bind
path. The only ones with a narrower lifecycle are the DSI hosts, which
add the bridge in during host attach and remove it during host detach.

I'm thinking about fixing the panel_bridge lifecycle such that it is
tied to the panel itself. Maybe that would involve making
devm_drm_of_get_bridge() correctly return bridges even if a panel was
found, and then making the panels create and add panel bridges directly,
possibly within drm_panel_add(). Would that make sense?


Thanks
ChenYu

> It's much more involved than just another helper though :/
>
> Maxime
Sui Jingfeng Nov. 27, 2024, 6:46 p.m. UTC | #7
Hi,

On 2024/11/27 17:58, Chen-Yu Tsai wrote:
> Revisiting this thread since I just stepped on the same problem on a
> different device.
>
> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>> panel device.
>>>>>
>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>> component_add(), the panel device invoking the callback at probe time
>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>
>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>> bound to DSI host device and never gets called.
>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>> object and result in kernel panic.
>>>>>
>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>
>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>> actions:
>>>>>
>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>     becomes hollow and can be removed.
>>>>>
>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>     `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>     we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>
>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>     so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>
>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>     calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>     object to be freed, and it's already being managed by panel device.
>>>>>     I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>     (maybe yes from DRM PoV, I don't know much about the context)
>>>>>
>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>> do next.
>>>>>
>>>>> For reference, here's the KASAN report from the device:
>>>>> ==================================================================
>>>>>   BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>   Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>
>>>>>   CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>   Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>   Workqueue: events_unbound deferred_probe_work_func
>>>>>   Call trace:
>>>>>    dump_backtrace+0xfc/0x140
>>>>>    show_stack+0x24/0x38
>>>>>    dump_stack_lvl+0x40/0xc8
>>>>>    print_report+0x140/0x700
>>>>>    kasan_report+0xcc/0x130
>>>>>    __asan_report_load8_noabort+0x20/0x30
>>>>>    drm_bridge_add+0x98/0x230
>>>>>    devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>    devm_drm_of_get_bridge+0xe8/0x190
>>>>>    mtk_dsi_host_attach+0x130/0x2b0
>>>>>    mipi_dsi_attach+0x8c/0xe8
>>>>>    hx83102_probe+0x1a8/0x368
>>>>>    mipi_dsi_drv_probe+0x6c/0x88
>>>>>    really_probe+0x1c4/0x698
>>>>>    __driver_probe_device+0x160/0x298
>>>>>    driver_probe_device+0x7c/0x2a8
>>>>>    __device_attach_driver+0x2a0/0x398
>>>>>    bus_for_each_drv+0x198/0x200
>>>>>    __device_attach+0x1c0/0x308
>>>>>    device_initial_probe+0x20/0x38
>>>>>    bus_probe_device+0x11c/0x1f8
>>>>>    deferred_probe_work_func+0x80/0x250
>>>>>    worker_thread+0x9b4/0x2780
>>>>>    kthread+0x274/0x350
>>>>>    ret_from_fork+0x10/0x20
>>>>>
>>>>>   Allocated by task 69:
>>>>>    kasan_save_track+0x40/0x78
>>>>>    kasan_save_alloc_info+0x44/0x58
>>>>>    __kasan_kmalloc+0x84/0xa0
>>>>>    __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>    devm_kmalloc+0x6c/0x288
>>>>>    devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>    devm_drm_of_get_bridge+0xe8/0x190
>>>>>    mtk_dsi_host_attach+0x130/0x2b0
>>>>>    mipi_dsi_attach+0x8c/0xe8
>>>>>    hx83102_probe+0x1a8/0x368
>>>>>    mipi_dsi_drv_probe+0x6c/0x88
>>>>>    really_probe+0x1c4/0x698
>>>>>    __driver_probe_device+0x160/0x298
>>>>>    driver_probe_device+0x7c/0x2a8
>>>>>    __device_attach_driver+0x2a0/0x398
>>>>>    bus_for_each_drv+0x198/0x200
>>>>>    __device_attach+0x1c0/0x308
>>>>>    device_initial_probe+0x20/0x38
>>>>>    bus_probe_device+0x11c/0x1f8
>>>>>    deferred_probe_work_func+0x80/0x250
>>>>>    worker_thread+0x9b4/0x2780
>>>>>    kthread+0x274/0x350
>>>>>    ret_from_fork+0x10/0x20
>>>>>
>>>>>   Freed by task 69:
>>>>>    kasan_save_track+0x40/0x78
>>>>>    kasan_save_free_info+0x58/0x78
>>>>>    __kasan_slab_free+0x48/0x68
>>>>>    kfree+0xd4/0x750
>>>>>    devres_release_all+0x144/0x1e8
>>>>>    really_probe+0x48c/0x698
>>>>>    __driver_probe_device+0x160/0x298
>>>>>    driver_probe_device+0x7c/0x2a8
>>>>>    __device_attach_driver+0x2a0/0x398
>>>>>    bus_for_each_drv+0x198/0x200
>>>>>    __device_attach+0x1c0/0x308
>>>>>    device_initial_probe+0x20/0x38
>>>>>    bus_probe_device+0x11c/0x1f8
>>>>>    deferred_probe_work_func+0x80/0x250
>>>>>    worker_thread+0x9b4/0x2780
>>>>>    kthread+0x274/0x350
>>>>>    ret_from_fork+0x10/0x20
>>>>>
>>>>>   The buggy address belongs to the object at ffffff80c4e9e000
>>>>>    which belongs to the cache kmalloc-4k of size 4096
>>>>>   The buggy address is located 256 bytes inside of
>>>>>    freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>
>>>>>   The buggy address belongs to the physical page:
>>>>>   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>   flags: 0x8000000000000040(head|zone=2)
>>>>>   page_type: f5(slab)
>>>>>   page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>   index:0x0 pfn:0x104e98
>>>>>   raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>   raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>   head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>   head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>   head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>   page dumped because: kasan: bad access detected
>>>>>
>>>>>   Memory state around the buggy address:
>>>>>    ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>    ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>   >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>                      ^
>>>>>    ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>    ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>> ===================================================================
>>>>>
>>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>> commit log, and it does have a quite different structure compared to
>>>> what we recommend.
>>>>
>>>> Would following
>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>> help?
>>> Hi Maxime,
>>>
>>> Thank you for the pointer.
>>> I read the suggested pattern in the doc and compared it with the
>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>> drivers follow the instructions:
>>>
>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>     >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>> its host.
>>>     >> drm/panel/panel-himax-hx83102.c follows and runs
>>> mipi_dsi_attach() at the end of probe hook.
>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>> now add its component.
>>>     >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>
>>> Could you elaborate on the "different structures" you mentioned?
>> Yeah, you're right, sorry.
>>
>>> To clarify my point: the issue is that component_add() may return
>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>> actual list_head object.
>>>
>>> This might be reproducible with other MIPI-DSI host + panel
>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>> hook (verification with another device is needed), so the fix may be
>>> required in drm/bridge/panel.c.
>> Yeah, I think you're just hitting another bridge lifetime issue, and
>> it's not the only one unfortunately. Tying the bridge structure lifetime
>> itself to the device is wrong, it should be tied to the DRM device
>> lifetime instead.
> I think the more immediate issue is that the bridge object's lifetime
> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> or drmm_of_get_bridge() are used.

Well, I think this is more of probe issue of multiple kernel modules.

The root issue is that the global bridge list still stores the
pointer to *old* the bridge instance which has been freed after
the first '-EPROBE_DEFER' happened. The next time the
'drm_bridge_add(&panel_bridge->bridge);' is called, we will deference
the *old* NULL pointer. Because it will touch the 'struct drm_bridge::list'
field, which's backing memory has been freed.


> These helpers tie the bridge add/removal to the device or drm_device
> passed in, but internally they call down to drm_panel_bridge_add_typed()
> which allocates the bridge object tied to the panel device.

When the devm_drm_panel_bridge_add_typed() is passed a pointer of
DSI host device, we essentially tie the lifetime of the freshly
created drm bridge instance to the DSI host device. But, the
'struct panel_bridge' clearly hint that the bridge instance has
the same lifetime with the backing panel, after all, it's the
underlying panel baking the bridge.

>> But then, the discussion becomes that bridges typically probe outside of
>> the "main" DRM device probe path, so you don't have access to the DRM
>> device structure until attach at best.
>>
>> That's why I'm a bit skeptical about your patch. It might workaround
>> your issue, but it doesn't actually solve the problem. I guess the best
>> way about it would be to convert bridges to reference counting, with the
>> device taking a reference at probe time when it allocates the structure
>> (and giving it back at remove time), and the DRM device taking one when
>> it's attached and one when it's detached.
> Without going as far, it's probably better to align the lifecycle of
> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> lifecycle tied to their underlying |device|, either with explicit
> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> path. The only ones with a narrower lifecycle are the DSI hosts, which
> add the bridge in during host attach and remove it during host detach.
>
> I'm thinking about fixing the panel_bridge lifecycle such that it is
> tied to the panel itself. Maybe that would involve making
> devm_drm_of_get_bridge() correctly return bridges even if a panel was
> found, and then making the panels create and add panel bridges directly,
> possibly within drm_panel_add(). Would that make sense?

I think, align the lifetime of the bridge with 'panel->dev' probably helps.
Modifying the devm_drm_of_get_bridge() function like the following pattern:


```

struct drm_bridge *devm_drm_of_get_bridge(struct device_node *np, u32 
port, u32 endpoint)
{
     struct drm_bridge *bridge;
     struct drm_panel *panel;
     int ret;

     ret = drm_of_find_panel_or_bridge(np, port, endpoint, &panel, &bridge);
     if (ret)
         return ERR_PTR(ret);

     if (panel)
         bridge = devm_drm_panel_bridge_add(panel->dev, panel);

     return bridge;
}

```


Or alternatively, inline this to drm/mediatek,
rename it as mtk_drm_of_get_bridge().

Or alternatively, manage the bridge's lifetime manually.
Remove it from the global bridge list if errors happen.


>
> Thanks
> ChenYu
>
>> It's much more involved than just another helper though :/
>>
>> Maxime
Chen-Yu Tsai Nov. 29, 2024, 4:52 a.m. UTC | #8
On Thu, Nov 28, 2024 at 2:46 AM Sui Jingfeng <sui.jingfeng@linux.dev> wrote:
>
> Hi,
>
> On 2024/11/27 17:58, Chen-Yu Tsai wrote:
> > Revisiting this thread since I just stepped on the same problem on a
> > different device.
> >
> > On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
> >> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> >>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> >>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> >>>>> In the mtk_dsi driver, its DSI host attach callback calls
> >>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> >>>>> a panel bridge, a panel_bridge object is allocated and managed by the
> >>>>> panel device.
> >>>>>
> >>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> >>>>> component_add(), the panel device invoking the callback at probe time
> >>>>> also fails, and all device-managed resources are freed accordingly.
> >>>>>
> >>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
> >>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
> >>>>> object managed by panel device is freed, while drm_bridge_remove() is
> >>>>> bound to DSI host device and never gets called.
> >>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
> >>>>> object and result in kernel panic.
> >>>>>
> >>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> >>>>> outputting to a DSI panel (DT is WIP for upstream).
> >>>>>
> >>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
> >>>>> path seems reasonable. This also implies a chain of potential cleanup
> >>>>> actions:
> >>>>>
> >>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> >>>>>     becomes hollow and can be removed.
> >>>>>
> >>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> >>>>>     `bridge->pre_enable_prev_first` line. Itself can be also removed if
> >>>>>     we move the line into drm_panel_bridge_add_typed(). (maybe?)
> >>>>>
> >>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> >>>>>     so it's essentially the new devm_drm_panel_bridge_add_typed().
> >>>>>
> >>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> >>>>>     calls drm_panel_bridge_add_typed(). But now there's only one bridge
> >>>>>     object to be freed, and it's already being managed by panel device.
> >>>>>     I wonder if we still need both drmm_ and devm_ version in this case.
> >>>>>     (maybe yes from DRM PoV, I don't know much about the context)
> >>>>>
> >>>>> This is a RFC patch since I'm not sure if my understanding is correct
> >>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
> >>>>> but I don't expect it to be picked up directly due to the redundant
> >>>>> commit message and the dangling devm_drm_panel_bridge_release().
> >>>>> I plan to resend the official patch(es) once I know what I supposed to
> >>>>> do next.
> >>>>>
> >>>>> For reference, here's the KASAN report from the device:
> >>>>> ==================================================================
> >>>>>   BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> >>>>>   Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> >>>>>
> >>>>>   CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> >>>>>   Hardware name: Google Ciri sku0/unprovisioned board (DT)
> >>>>>   Workqueue: events_unbound deferred_probe_work_func
> >>>>>   Call trace:
> >>>>>    dump_backtrace+0xfc/0x140
> >>>>>    show_stack+0x24/0x38
> >>>>>    dump_stack_lvl+0x40/0xc8
> >>>>>    print_report+0x140/0x700
> >>>>>    kasan_report+0xcc/0x130
> >>>>>    __asan_report_load8_noabort+0x20/0x30
> >>>>>    drm_bridge_add+0x98/0x230
> >>>>>    devm_drm_panel_bridge_add_typed+0x174/0x298
> >>>>>    devm_drm_of_get_bridge+0xe8/0x190
> >>>>>    mtk_dsi_host_attach+0x130/0x2b0
> >>>>>    mipi_dsi_attach+0x8c/0xe8
> >>>>>    hx83102_probe+0x1a8/0x368
> >>>>>    mipi_dsi_drv_probe+0x6c/0x88
> >>>>>    really_probe+0x1c4/0x698
> >>>>>    __driver_probe_device+0x160/0x298
> >>>>>    driver_probe_device+0x7c/0x2a8
> >>>>>    __device_attach_driver+0x2a0/0x398
> >>>>>    bus_for_each_drv+0x198/0x200
> >>>>>    __device_attach+0x1c0/0x308
> >>>>>    device_initial_probe+0x20/0x38
> >>>>>    bus_probe_device+0x11c/0x1f8
> >>>>>    deferred_probe_work_func+0x80/0x250
> >>>>>    worker_thread+0x9b4/0x2780
> >>>>>    kthread+0x274/0x350
> >>>>>    ret_from_fork+0x10/0x20
> >>>>>
> >>>>>   Allocated by task 69:
> >>>>>    kasan_save_track+0x40/0x78
> >>>>>    kasan_save_alloc_info+0x44/0x58
> >>>>>    __kasan_kmalloc+0x84/0xa0
> >>>>>    __kmalloc_node_track_caller_noprof+0x228/0x450
> >>>>>    devm_kmalloc+0x6c/0x288
> >>>>>    devm_drm_panel_bridge_add_typed+0xa0/0x298
> >>>>>    devm_drm_of_get_bridge+0xe8/0x190
> >>>>>    mtk_dsi_host_attach+0x130/0x2b0
> >>>>>    mipi_dsi_attach+0x8c/0xe8
> >>>>>    hx83102_probe+0x1a8/0x368
> >>>>>    mipi_dsi_drv_probe+0x6c/0x88
> >>>>>    really_probe+0x1c4/0x698
> >>>>>    __driver_probe_device+0x160/0x298
> >>>>>    driver_probe_device+0x7c/0x2a8
> >>>>>    __device_attach_driver+0x2a0/0x398
> >>>>>    bus_for_each_drv+0x198/0x200
> >>>>>    __device_attach+0x1c0/0x308
> >>>>>    device_initial_probe+0x20/0x38
> >>>>>    bus_probe_device+0x11c/0x1f8
> >>>>>    deferred_probe_work_func+0x80/0x250
> >>>>>    worker_thread+0x9b4/0x2780
> >>>>>    kthread+0x274/0x350
> >>>>>    ret_from_fork+0x10/0x20
> >>>>>
> >>>>>   Freed by task 69:
> >>>>>    kasan_save_track+0x40/0x78
> >>>>>    kasan_save_free_info+0x58/0x78
> >>>>>    __kasan_slab_free+0x48/0x68
> >>>>>    kfree+0xd4/0x750
> >>>>>    devres_release_all+0x144/0x1e8
> >>>>>    really_probe+0x48c/0x698
> >>>>>    __driver_probe_device+0x160/0x298
> >>>>>    driver_probe_device+0x7c/0x2a8
> >>>>>    __device_attach_driver+0x2a0/0x398
> >>>>>    bus_for_each_drv+0x198/0x200
> >>>>>    __device_attach+0x1c0/0x308
> >>>>>    device_initial_probe+0x20/0x38
> >>>>>    bus_probe_device+0x11c/0x1f8
> >>>>>    deferred_probe_work_func+0x80/0x250
> >>>>>    worker_thread+0x9b4/0x2780
> >>>>>    kthread+0x274/0x350
> >>>>>    ret_from_fork+0x10/0x20
> >>>>>
> >>>>>   The buggy address belongs to the object at ffffff80c4e9e000
> >>>>>    which belongs to the cache kmalloc-4k of size 4096
> >>>>>   The buggy address is located 256 bytes inside of
> >>>>>    freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> >>>>>
> >>>>>   The buggy address belongs to the physical page:
> >>>>>   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> >>>>>   flags: 0x8000000000000040(head|zone=2)
> >>>>>   page_type: f5(slab)
> >>>>>   page: refcount:1 mapcount:0 mapping:0000000000000000
> >>>>>   index:0x0 pfn:0x104e98
> >>>>>   raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> >>>>>   raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> >>>>>   head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> >>>>>   head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> >>>>>   head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> >>>>>   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> >>>>>   page dumped because: kasan: bad access detected
> >>>>>
> >>>>>   Memory state around the buggy address:
> >>>>>    ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>>>>    ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>>>>   >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>>>>                      ^
> >>>>>    ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>>>>    ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>>>> ===================================================================
> >>>>>
> >>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
> >>>> I was looking at the driver to try to follow your (awesome btw, thanks)
> >>>> commit log, and it does have a quite different structure compared to
> >>>> what we recommend.
> >>>>
> >>>> Would following
> >>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> >>>> help?
> >>> Hi Maxime,
> >>>
> >>> Thank you for the pointer.
> >>> I read the suggested pattern in the doc and compared it with the
> >>> drivers. If I understand correctly, both the MIPI-DSI host and panel
> >>> drivers follow the instructions:
> >>>
> >>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> >>>     >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> >>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> >>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> >>> its host.
> >>>     >> drm/panel/panel-himax-hx83102.c follows and runs
> >>> mipi_dsi_attach() at the end of probe hook.
> >>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> >>> now add its component.
> >>>     >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> >>>
> >>> Could you elaborate on the "different structures" you mentioned?
> >> Yeah, you're right, sorry.
> >>
> >>> To clarify my point: the issue is that component_add() may return
> >>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> >>> causing the panel bridge to be removed. However, drm_bridge_remove()
> >>> is bound to MIPI-DSI host instead of panel bridge, which owns the
> >>> actual list_head object.
> >>>
> >>> This might be reproducible with other MIPI-DSI host + panel
> >>> combinations by forcibly returning -EPROBE_DEFER in the host attach
> >>> hook (verification with another device is needed), so the fix may be
> >>> required in drm/bridge/panel.c.
> >> Yeah, I think you're just hitting another bridge lifetime issue, and
> >> it's not the only one unfortunately. Tying the bridge structure lifetime
> >> itself to the device is wrong, it should be tied to the DRM device
> >> lifetime instead.
> > I think the more immediate issue is that the bridge object's lifetime
> > and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> > or drmm_of_get_bridge() are used.
>
> Well, I think this is more of probe issue of multiple kernel modules.
>
> The root issue is that the global bridge list still stores the
> pointer to *old* the bridge instance which has been freed after
> the first '-EPROBE_DEFER' happened. The next time the
> 'drm_bridge_add(&panel_bridge->bridge);' is called, we will deference
> the *old* NULL pointer. Because it will touch the 'struct drm_bridge::list'
> field, which's backing memory has been freed.

Yes. That is what is causing the crash.

> > These helpers tie the bridge add/removal to the device or drm_device
> > passed in, but internally they call down to drm_panel_bridge_add_typed()
> > which allocates the bridge object tied to the panel device.
>
> When the devm_drm_panel_bridge_add_typed() is passed a pointer of
> DSI host device, we essentially tie the lifetime of the freshly
> created drm bridge instance to the DSI host device. But, the
> 'struct panel_bridge' clearly hint that the bridge instance has
> the same lifetime with the backing panel, after all, it's the
> underlying panel baking the bridge.

Exactly.

> >> But then, the discussion becomes that bridges typically probe outside of
> >> the "main" DRM device probe path, so you don't have access to the DRM
> >> device structure until attach at best.
> >>
> >> That's why I'm a bit skeptical about your patch. It might workaround
> >> your issue, but it doesn't actually solve the problem. I guess the best
> >> way about it would be to convert bridges to reference counting, with the
> >> device taking a reference at probe time when it allocates the structure
> >> (and giving it back at remove time), and the DRM device taking one when
> >> it's attached and one when it's detached.
> > Without going as far, it's probably better to align the lifecycle of
> > the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> > lifecycle tied to their underlying |device|, either with explicit
> > drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> > callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> > path. The only ones with a narrower lifecycle are the DSI hosts, which
> > add the bridge in during host attach and remove it during host detach.
> >
> > I'm thinking about fixing the panel_bridge lifecycle such that it is
> > tied to the panel itself. Maybe that would involve making
> > devm_drm_of_get_bridge() correctly return bridges even if a panel was
> > found, and then making the panels create and add panel bridges directly,
> > possibly within drm_panel_add(). Would that make sense?
>
> I think, align the lifetime of the bridge with 'panel->dev' probably helps.
> Modifying the devm_drm_of_get_bridge() function like the following pattern:
>
>
> ```
>
> struct drm_bridge *devm_drm_of_get_bridge(struct device_node *np, u32
> port, u32 endpoint)
> {
>      struct drm_bridge *bridge;
>      struct drm_panel *panel;
>      int ret;
>
>      ret = drm_of_find_panel_or_bridge(np, port, endpoint, &panel, &bridge);
>      if (ret)
>          return ERR_PTR(ret);
>
>      if (panel)
>          bridge = devm_drm_panel_bridge_add(panel->dev, panel);
>
>      return bridge;
> }

That's one possible solution I thought of, but then the devm_ prefix
no longer makes sense. Also the panel_bridge is still implicitly created,
and we might as well move that to the panel side.

> ```
>
>
> Or alternatively, inline this to drm/mediatek,
> rename it as mtk_drm_of_get_bridge().

I would prefer to not do that, since that only fixes the issue for
MediaTek, while we have some 30 odd users of devm_drm_of_get_bridge().

> Or alternatively, manage the bridge's lifetime manually.
> Remove it from the global bridge list if errors happen.

That's also one way; it's just messy.


Thanks
ChenYu

> >
> > Thanks
> > ChenYu
> >
> >> It's much more involved than just another helper though :/
> >>
> >> Maxime
>
> --
> Best regards,
> Sui
>
Sui Jingfeng Nov. 29, 2024, 10:33 a.m. UTC | #9
Hi,

On 2024/11/29 12:52, Chen-Yu Tsai wrote:
> On Thu, Nov 28, 2024 at 2:46 AM Sui Jingfeng <sui.jingfeng@linux.dev> wrote:
>> Hi,
>>
>> On 2024/11/27 17:58, Chen-Yu Tsai wrote:
>>> Revisiting this thread since I just stepped on the same problem on a
>>> different device.
>>>
>>> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>>>> panel device.
>>>>>>>
>>>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>>>> component_add(), the panel device invoking the callback at probe time
>>>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>>>
>>>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>>>> bound to DSI host device and never gets called.
>>>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>>>> object and result in kernel panic.
>>>>>>>
>>>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>>>
>>>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>>>> actions:
>>>>>>>
>>>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>>>      becomes hollow and can be removed.
>>>>>>>
>>>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>>>      `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>>>      we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>>>
>>>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>>>      so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>>>
>>>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>>>      calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>>>      object to be freed, and it's already being managed by panel device.
>>>>>>>      I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>>>      (maybe yes from DRM PoV, I don't know much about the context)
>>>>>>>
>>>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>>>> do next.
>>>>>>>
>>>>>>> For reference, here's the KASAN report from the device:
>>>>>>> ==================================================================
>>>>>>>    BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>>>    Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>>>
>>>>>>>    CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>>>    Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>>>    Workqueue: events_unbound deferred_probe_work_func
>>>>>>>    Call trace:
>>>>>>>     dump_backtrace+0xfc/0x140
>>>>>>>     show_stack+0x24/0x38
>>>>>>>     dump_stack_lvl+0x40/0xc8
>>>>>>>     print_report+0x140/0x700
>>>>>>>     kasan_report+0xcc/0x130
>>>>>>>     __asan_report_load8_noabort+0x20/0x30
>>>>>>>     drm_bridge_add+0x98/0x230
>>>>>>>     devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>>>     devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>     mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>     mipi_dsi_attach+0x8c/0xe8
>>>>>>>     hx83102_probe+0x1a8/0x368
>>>>>>>     mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>     really_probe+0x1c4/0x698
>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>     kthread+0x274/0x350
>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>
>>>>>>>    Allocated by task 69:
>>>>>>>     kasan_save_track+0x40/0x78
>>>>>>>     kasan_save_alloc_info+0x44/0x58
>>>>>>>     __kasan_kmalloc+0x84/0xa0
>>>>>>>     __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>>>     devm_kmalloc+0x6c/0x288
>>>>>>>     devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>>>     devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>     mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>     mipi_dsi_attach+0x8c/0xe8
>>>>>>>     hx83102_probe+0x1a8/0x368
>>>>>>>     mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>     really_probe+0x1c4/0x698
>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>     kthread+0x274/0x350
>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>
>>>>>>>    Freed by task 69:
>>>>>>>     kasan_save_track+0x40/0x78
>>>>>>>     kasan_save_free_info+0x58/0x78
>>>>>>>     __kasan_slab_free+0x48/0x68
>>>>>>>     kfree+0xd4/0x750
>>>>>>>     devres_release_all+0x144/0x1e8
>>>>>>>     really_probe+0x48c/0x698
>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>     kthread+0x274/0x350
>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>
>>>>>>>    The buggy address belongs to the object at ffffff80c4e9e000
>>>>>>>     which belongs to the cache kmalloc-4k of size 4096
>>>>>>>    The buggy address is located 256 bytes inside of
>>>>>>>     freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>>>
>>>>>>>    The buggy address belongs to the physical page:
>>>>>>>    head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>>>    flags: 0x8000000000000040(head|zone=2)
>>>>>>>    page_type: f5(slab)
>>>>>>>    page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>>>    index:0x0 pfn:0x104e98
>>>>>>>    raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>    raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>    head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>    head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>    head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>>>    head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>>    page dumped because: kasan: bad access detected
>>>>>>>
>>>>>>>    Memory state around the buggy address:
>>>>>>>     ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>     ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>    >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>                       ^
>>>>>>>     ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>     ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>> ===================================================================
>>>>>>>
>>>>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
>>>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>>>> commit log, and it does have a quite different structure compared to
>>>>>> what we recommend.
>>>>>>
>>>>>> Would following
>>>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>>>> help?
>>>>> Hi Maxime,
>>>>>
>>>>> Thank you for the pointer.
>>>>> I read the suggested pattern in the doc and compared it with the
>>>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>>>> drivers follow the instructions:
>>>>>
>>>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>>>      >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>>>> its host.
>>>>>      >> drm/panel/panel-himax-hx83102.c follows and runs
>>>>> mipi_dsi_attach() at the end of probe hook.
>>>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>>>> now add its component.
>>>>>      >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>>>
>>>>> Could you elaborate on the "different structures" you mentioned?
>>>> Yeah, you're right, sorry.
>>>>
>>>>> To clarify my point: the issue is that component_add() may return
>>>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>>>> actual list_head object.
>>>>>
>>>>> This might be reproducible with other MIPI-DSI host + panel
>>>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>>>> hook (verification with another device is needed), so the fix may be
>>>>> required in drm/bridge/panel.c.
>>>> Yeah, I think you're just hitting another bridge lifetime issue, and
>>>> it's not the only one unfortunately. Tying the bridge structure lifetime
>>>> itself to the device is wrong, it should be tied to the DRM device
>>>> lifetime instead.
>>> I think the more immediate issue is that the bridge object's lifetime
>>> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
>>> or drmm_of_get_bridge() are used.
>> Well, I think this is more of probe issue of multiple kernel modules.
>>
>> The root issue is that the global bridge list still stores the
>> pointer to *old* the bridge instance which has been freed after
>> the first '-EPROBE_DEFER' happened. The next time the
>> 'drm_bridge_add(&panel_bridge->bridge);' is called, we will deference
>> the *old* NULL pointer. Because it will touch the 'struct drm_bridge::list'
>> field, which's backing memory has been freed.
> Yes. That is what is causing the crash.
>
>>> These helpers tie the bridge add/removal to the device or drm_device
>>> passed in, but internally they call down to drm_panel_bridge_add_typed()
>>> which allocates the bridge object tied to the panel device.
>> When the devm_drm_panel_bridge_add_typed() is passed a pointer of
>> DSI host device, we essentially tie the lifetime of the freshly
>> created drm bridge instance to the DSI host device. But, the
>> 'struct panel_bridge' clearly hint that the bridge instance has
>> the same lifetime with the backing panel, after all, it's the
>> underlying panel baking the bridge.
> Exactly.
>
>>>> But then, the discussion becomes that bridges typically probe outside of
>>>> the "main" DRM device probe path, so you don't have access to the DRM
>>>> device structure until attach at best.
>>>>
>>>> That's why I'm a bit skeptical about your patch. It might workaround
>>>> your issue, but it doesn't actually solve the problem. I guess the best
>>>> way about it would be to convert bridges to reference counting, with the
>>>> device taking a reference at probe time when it allocates the structure
>>>> (and giving it back at remove time), and the DRM device taking one when
>>>> it's attached and one when it's detached.
>>> Without going as far, it's probably better to align the lifecycle of
>>> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
>>> lifecycle tied to their underlying |device|, either with explicit
>>> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
>>> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
>>> path. The only ones with a narrower lifecycle are the DSI hosts, which
>>> add the bridge in during host attach and remove it during host detach.
>>>
>>> I'm thinking about fixing the panel_bridge lifecycle such that it is
>>> tied to the panel itself. Maybe that would involve making
>>> devm_drm_of_get_bridge() correctly return bridges even if a panel was
>>> found, and then making the panels create and add panel bridges directly,
>>> possibly within drm_panel_add(). Would that make sense?
>> I think, align the lifetime of the bridge with 'panel->dev' probably helps.
>> Modifying the devm_drm_of_get_bridge() function like the following pattern:
>>
>>
>> ```
>>
>> struct drm_bridge *devm_drm_of_get_bridge(struct device_node *np, u32
>> port, u32 endpoint)
>> {
>>       struct drm_bridge *bridge;
>>       struct drm_panel *panel;
>>       int ret;
>>
>>       ret = drm_of_find_panel_or_bridge(np, port, endpoint, &panel, &bridge);
>>       if (ret)
>>           return ERR_PTR(ret);
>>
>>       if (panel)
>>           bridge = devm_drm_panel_bridge_add(panel->dev, panel);
>>
>>       return bridge;
>> }
> That's one possible solution I thought of, but then the devm_ prefix
> no longer makes sense.

It still is dev managed...

Despite of_get_something() typically increase the reference count of
something. of_find_something() will return a pointer to the underlyinginstance, but will not touch the reference counter.

> Also the panel_bridge is still implicitly created,

Yeah.

And it will always create a new bridge instance when you call the function.
(if the backing panel is already populated).Similar with Dmitry's AUX/HPD bridge.

> and we might as well move that to the panel side.

Your idea is possible and may bring back better lifetime control,
I don't know which method is the best, gonna to run away.


>> ```
>>
>>
>> Or alternatively, inline this to drm/mediatek,
>> rename it as mtk_drm_of_get_bridge().
> I would prefer to not do that, since that only fixes the issue for
> MediaTek, while we have some 30 odd users of devm_drm_of_get_bridge().
>
>> Or alternatively, manage the bridge's lifetime manually.
>> Remove it from the global bridge list if errors happen.
> That's also one way; it's just messy.

Right.

>
> Thanks
> ChenYu
>
>>> Thanks
>>> ChenYu
>>>
>>>> It's much more involved than just another helper though :/
>>>>
>>>> Maxime
>> --
>> Best regards,
>> Sui
>>
Maxime Ripard Nov. 29, 2024, 10:51 a.m. UTC | #10
On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
> Revisiting this thread since I just stepped on the same problem on a
> different device.
> 
> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
> >
> > On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > >
> > > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > > panel device.
> > > > >
> > > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > > component_add(), the panel device invoking the callback at probe time
> > > > > also fails, and all device-managed resources are freed accordingly.
> > > > >
> > > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > > bound to DSI host device and never gets called.
> > > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > > object and result in kernel panic.
> > > > >
> > > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > > outputting to a DSI panel (DT is WIP for upstream).
> > > > >
> > > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > > actions:
> > > > >
> > > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > > >    becomes hollow and can be removed.
> > > > >
> > > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > > >    `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > > >    we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > > >
> > > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > > >    so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > > >
> > > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > > >    calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > > >    object to be freed, and it's already being managed by panel device.
> > > > >    I wonder if we still need both drmm_ and devm_ version in this case.
> > > > >    (maybe yes from DRM PoV, I don't know much about the context)
> > > > >
> > > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > > but I don't expect it to be picked up directly due to the redundant
> > > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > > do next.
> > > > >
> > > > > For reference, here's the KASAN report from the device:
> > > > > ==================================================================
> > > > >  BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > > >  Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > > >
> > > > >  CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > > >  Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > > >  Workqueue: events_unbound deferred_probe_work_func
> > > > >  Call trace:
> > > > >   dump_backtrace+0xfc/0x140
> > > > >   show_stack+0x24/0x38
> > > > >   dump_stack_lvl+0x40/0xc8
> > > > >   print_report+0x140/0x700
> > > > >   kasan_report+0xcc/0x130
> > > > >   __asan_report_load8_noabort+0x20/0x30
> > > > >   drm_bridge_add+0x98/0x230
> > > > >   devm_drm_panel_bridge_add_typed+0x174/0x298
> > > > >   devm_drm_of_get_bridge+0xe8/0x190
> > > > >   mtk_dsi_host_attach+0x130/0x2b0
> > > > >   mipi_dsi_attach+0x8c/0xe8
> > > > >   hx83102_probe+0x1a8/0x368
> > > > >   mipi_dsi_drv_probe+0x6c/0x88
> > > > >   really_probe+0x1c4/0x698
> > > > >   __driver_probe_device+0x160/0x298
> > > > >   driver_probe_device+0x7c/0x2a8
> > > > >   __device_attach_driver+0x2a0/0x398
> > > > >   bus_for_each_drv+0x198/0x200
> > > > >   __device_attach+0x1c0/0x308
> > > > >   device_initial_probe+0x20/0x38
> > > > >   bus_probe_device+0x11c/0x1f8
> > > > >   deferred_probe_work_func+0x80/0x250
> > > > >   worker_thread+0x9b4/0x2780
> > > > >   kthread+0x274/0x350
> > > > >   ret_from_fork+0x10/0x20
> > > > >
> > > > >  Allocated by task 69:
> > > > >   kasan_save_track+0x40/0x78
> > > > >   kasan_save_alloc_info+0x44/0x58
> > > > >   __kasan_kmalloc+0x84/0xa0
> > > > >   __kmalloc_node_track_caller_noprof+0x228/0x450
> > > > >   devm_kmalloc+0x6c/0x288
> > > > >   devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > > >   devm_drm_of_get_bridge+0xe8/0x190
> > > > >   mtk_dsi_host_attach+0x130/0x2b0
> > > > >   mipi_dsi_attach+0x8c/0xe8
> > > > >   hx83102_probe+0x1a8/0x368
> > > > >   mipi_dsi_drv_probe+0x6c/0x88
> > > > >   really_probe+0x1c4/0x698
> > > > >   __driver_probe_device+0x160/0x298
> > > > >   driver_probe_device+0x7c/0x2a8
> > > > >   __device_attach_driver+0x2a0/0x398
> > > > >   bus_for_each_drv+0x198/0x200
> > > > >   __device_attach+0x1c0/0x308
> > > > >   device_initial_probe+0x20/0x38
> > > > >   bus_probe_device+0x11c/0x1f8
> > > > >   deferred_probe_work_func+0x80/0x250
> > > > >   worker_thread+0x9b4/0x2780
> > > > >   kthread+0x274/0x350
> > > > >   ret_from_fork+0x10/0x20
> > > > >
> > > > >  Freed by task 69:
> > > > >   kasan_save_track+0x40/0x78
> > > > >   kasan_save_free_info+0x58/0x78
> > > > >   __kasan_slab_free+0x48/0x68
> > > > >   kfree+0xd4/0x750
> > > > >   devres_release_all+0x144/0x1e8
> > > > >   really_probe+0x48c/0x698
> > > > >   __driver_probe_device+0x160/0x298
> > > > >   driver_probe_device+0x7c/0x2a8
> > > > >   __device_attach_driver+0x2a0/0x398
> > > > >   bus_for_each_drv+0x198/0x200
> > > > >   __device_attach+0x1c0/0x308
> > > > >   device_initial_probe+0x20/0x38
> > > > >   bus_probe_device+0x11c/0x1f8
> > > > >   deferred_probe_work_func+0x80/0x250
> > > > >   worker_thread+0x9b4/0x2780
> > > > >   kthread+0x274/0x350
> > > > >   ret_from_fork+0x10/0x20
> > > > >
> > > > >  The buggy address belongs to the object at ffffff80c4e9e000
> > > > >   which belongs to the cache kmalloc-4k of size 4096
> > > > >  The buggy address is located 256 bytes inside of
> > > > >   freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > > >
> > > > >  The buggy address belongs to the physical page:
> > > > >  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > > >  flags: 0x8000000000000040(head|zone=2)
> > > > >  page_type: f5(slab)
> > > > >  page: refcount:1 mapcount:0 mapping:0000000000000000
> > > > >  index:0x0 pfn:0x104e98
> > > > >  raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > >  raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > >  head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > >  head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > >  head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > > >  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > > >  page dumped because: kasan: bad access detected
> > > > >
> > > > >  Memory state around the buggy address:
> > > > >   ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > >   ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > >  >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > >                     ^
> > > > >   ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > >   ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > ===================================================================
> > > > >
> > > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > > >
> > > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > > commit log, and it does have a quite different structure compared to
> > > > what we recommend.
> > > >
> > > > Would following
> > > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > > help?
> > >
> > > Hi Maxime,
> > >
> > > Thank you for the pointer.
> > > I read the suggested pattern in the doc and compared it with the
> > > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > > drivers follow the instructions:
> > >
> > > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> > >    >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > > its host.
> > >    >> drm/panel/panel-himax-hx83102.c follows and runs
> > > mipi_dsi_attach() at the end of probe hook.
> > > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > > now add its component.
> > >    >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> > >
> > > Could you elaborate on the "different structures" you mentioned?
> >
> > Yeah, you're right, sorry.
> >
> > > To clarify my point: the issue is that component_add() may return
> > > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > > causing the panel bridge to be removed. However, drm_bridge_remove()
> > > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > > actual list_head object.
> > >
> > > This might be reproducible with other MIPI-DSI host + panel
> > > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > > hook (verification with another device is needed), so the fix may be
> > > required in drm/bridge/panel.c.
> 
> > Yeah, I think you're just hitting another bridge lifetime issue, and
> > it's not the only one unfortunately. Tying the bridge structure lifetime
> > itself to the device is wrong, it should be tied to the DRM device
> > lifetime instead.
> 
> I think the more immediate issue is that the bridge object's lifetime
> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> or drmm_of_get_bridge() are used.
> 
> These helpers tie the bridge add/removal to the device or drm_device
> passed in, but internally they call down to drm_panel_bridge_add_typed()
> which allocates the bridge object tied to the panel device.
> > But then, the discussion becomes that bridges typically probe outside of
> > the "main" DRM device probe path, so you don't have access to the DRM
> > device structure until attach at best.
> >
> > That's why I'm a bit skeptical about your patch. It might workaround
> > your issue, but it doesn't actually solve the problem. I guess the best
> > way about it would be to convert bridges to reference counting, with the
> > device taking a reference at probe time when it allocates the structure
> > (and giving it back at remove time), and the DRM device taking one when
> > it's attached and one when it's detached.
> 
> Without going as far, it's probably better to align the lifecycle of
> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> lifecycle tied to their underlying |device|, either with explicit
> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> path. The only ones with a narrower lifecycle are the DSI hosts, which
> add the bridge in during host attach and remove it during host detach.
> 
> I'm thinking about fixing the panel_bridge lifecycle such that it is
> tied to the panel itself. Maybe that would involve making
> devm_drm_of_get_bridge() correctly return bridges even if a panel was
> found, and then making the panels create and add panel bridges directly,
> possibly within drm_panel_add(). Would that make sense?

Not really. Or rather, it doesn't fix the root cause that is that tieing
the bridge lifetime to the device is wrong. It needs to be tied to the
DRM device somehow. Your suggestion might indeed work around your issue,
but it doesn't fix the actual problem.

Maxime
Sui Jingfeng Nov. 29, 2024, 2:12 p.m. UTC | #11
Hi,

On 2024/11/29 18:51, Maxime Ripard wrote:
> On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
>> Revisiting this thread since I just stepped on the same problem on a
>> different device.
>>
>> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>>> panel device.
>>>>>>
>>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>>> component_add(), the panel device invoking the callback at probe time
>>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>>
>>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>>> bound to DSI host device and never gets called.
>>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>>> object and result in kernel panic.
>>>>>>
>>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>>
>>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>>> actions:
>>>>>>
>>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>>     becomes hollow and can be removed.
>>>>>>
>>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>>     `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>>     we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>>
>>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>>     so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>>
>>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>>     calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>>     object to be freed, and it's already being managed by panel device.
>>>>>>     I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>>     (maybe yes from DRM PoV, I don't know much about the context)
>>>>>>
>>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>>> do next.
>>>>>>
>>>>>> For reference, here's the KASAN report from the device:
>>>>>> ==================================================================
>>>>>>   BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>>   Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>>
>>>>>>   CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>>   Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>>   Workqueue: events_unbound deferred_probe_work_func
>>>>>>   Call trace:
>>>>>>    dump_backtrace+0xfc/0x140
>>>>>>    show_stack+0x24/0x38
>>>>>>    dump_stack_lvl+0x40/0xc8
>>>>>>    print_report+0x140/0x700
>>>>>>    kasan_report+0xcc/0x130
>>>>>>    __asan_report_load8_noabort+0x20/0x30
>>>>>>    drm_bridge_add+0x98/0x230
>>>>>>    devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>>    devm_drm_of_get_bridge+0xe8/0x190
>>>>>>    mtk_dsi_host_attach+0x130/0x2b0
>>>>>>    mipi_dsi_attach+0x8c/0xe8
>>>>>>    hx83102_probe+0x1a8/0x368
>>>>>>    mipi_dsi_drv_probe+0x6c/0x88
>>>>>>    really_probe+0x1c4/0x698
>>>>>>    __driver_probe_device+0x160/0x298
>>>>>>    driver_probe_device+0x7c/0x2a8
>>>>>>    __device_attach_driver+0x2a0/0x398
>>>>>>    bus_for_each_drv+0x198/0x200
>>>>>>    __device_attach+0x1c0/0x308
>>>>>>    device_initial_probe+0x20/0x38
>>>>>>    bus_probe_device+0x11c/0x1f8
>>>>>>    deferred_probe_work_func+0x80/0x250
>>>>>>    worker_thread+0x9b4/0x2780
>>>>>>    kthread+0x274/0x350
>>>>>>    ret_from_fork+0x10/0x20
>>>>>>
>>>>>>   Allocated by task 69:
>>>>>>    kasan_save_track+0x40/0x78
>>>>>>    kasan_save_alloc_info+0x44/0x58
>>>>>>    __kasan_kmalloc+0x84/0xa0
>>>>>>    __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>>    devm_kmalloc+0x6c/0x288
>>>>>>    devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>>    devm_drm_of_get_bridge+0xe8/0x190
>>>>>>    mtk_dsi_host_attach+0x130/0x2b0
>>>>>>    mipi_dsi_attach+0x8c/0xe8
>>>>>>    hx83102_probe+0x1a8/0x368
>>>>>>    mipi_dsi_drv_probe+0x6c/0x88
>>>>>>    really_probe+0x1c4/0x698
>>>>>>    __driver_probe_device+0x160/0x298
>>>>>>    driver_probe_device+0x7c/0x2a8
>>>>>>    __device_attach_driver+0x2a0/0x398
>>>>>>    bus_for_each_drv+0x198/0x200
>>>>>>    __device_attach+0x1c0/0x308
>>>>>>    device_initial_probe+0x20/0x38
>>>>>>    bus_probe_device+0x11c/0x1f8
>>>>>>    deferred_probe_work_func+0x80/0x250
>>>>>>    worker_thread+0x9b4/0x2780
>>>>>>    kthread+0x274/0x350
>>>>>>    ret_from_fork+0x10/0x20
>>>>>>
>>>>>>   Freed by task 69:
>>>>>>    kasan_save_track+0x40/0x78
>>>>>>    kasan_save_free_info+0x58/0x78
>>>>>>    __kasan_slab_free+0x48/0x68
>>>>>>    kfree+0xd4/0x750
>>>>>>    devres_release_all+0x144/0x1e8
>>>>>>    really_probe+0x48c/0x698
>>>>>>    __driver_probe_device+0x160/0x298
>>>>>>    driver_probe_device+0x7c/0x2a8
>>>>>>    __device_attach_driver+0x2a0/0x398
>>>>>>    bus_for_each_drv+0x198/0x200
>>>>>>    __device_attach+0x1c0/0x308
>>>>>>    device_initial_probe+0x20/0x38
>>>>>>    bus_probe_device+0x11c/0x1f8
>>>>>>    deferred_probe_work_func+0x80/0x250
>>>>>>    worker_thread+0x9b4/0x2780
>>>>>>    kthread+0x274/0x350
>>>>>>    ret_from_fork+0x10/0x20
>>>>>>
>>>>>>   The buggy address belongs to the object at ffffff80c4e9e000
>>>>>>    which belongs to the cache kmalloc-4k of size 4096
>>>>>>   The buggy address is located 256 bytes inside of
>>>>>>    freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>>
>>>>>>   The buggy address belongs to the physical page:
>>>>>>   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>>   flags: 0x8000000000000040(head|zone=2)
>>>>>>   page_type: f5(slab)
>>>>>>   page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>>   index:0x0 pfn:0x104e98
>>>>>>   raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>   raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>   head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>   head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>   head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>>   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>   page dumped because: kasan: bad access detected
>>>>>>
>>>>>>   Memory state around the buggy address:
>>>>>>    ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>    ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>   >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>                      ^
>>>>>>    ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>    ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>> ===================================================================
>>>>>>
>>>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
>>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>>> commit log, and it does have a quite different structure compared to
>>>>> what we recommend.
>>>>>
>>>>> Would following
>>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>>> help?
>>>> Hi Maxime,
>>>>
>>>> Thank you for the pointer.
>>>> I read the suggested pattern in the doc and compared it with the
>>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>>> drivers follow the instructions:
>>>>
>>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>>     >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>>> its host.
>>>>     >> drm/panel/panel-himax-hx83102.c follows and runs
>>>> mipi_dsi_attach() at the end of probe hook.
>>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>>> now add its component.
>>>>     >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>>
>>>> Could you elaborate on the "different structures" you mentioned?
>>> Yeah, you're right, sorry.
>>>
>>>> To clarify my point: the issue is that component_add() may return
>>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>>> actual list_head object.
>>>>
>>>> This might be reproducible with other MIPI-DSI host + panel
>>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>>> hook (verification with another device is needed), so the fix may be
>>>> required in drm/bridge/panel.c.
>>> Yeah, I think you're just hitting another bridge lifetime issue, and
>>> it's not the only one unfortunately. Tying the bridge structure lifetime
>>> itself to the device is wrong, it should be tied to the DRM device
>>> lifetime instead.
>> I think the more immediate issue is that the bridge object's lifetime
>> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
>> or drmm_of_get_bridge() are used.
>>
>> These helpers tie the bridge add/removal to the device or drm_device
>> passed in, but internally they call down to drm_panel_bridge_add_typed()
>> which allocates the bridge object tied to the panel device.
>>> But then, the discussion becomes that bridges typically probe outside of
>>> the "main" DRM device probe path, so you don't have access to the DRM
>>> device structure until attach at best.
>>>
>>> That's why I'm a bit skeptical about your patch. It might workaround
>>> your issue, but it doesn't actually solve the problem. I guess the best
>>> way about it would be to convert bridges to reference counting, with the
>>> device taking a reference at probe time when it allocates the structure
>>> (and giving it back at remove time), and the DRM device taking one when
>>> it's attached and one when it's detached.
>> Without going as far, it's probably better to align the lifecycle of
>> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
>> lifecycle tied to their underlying |device|, either with explicit
>> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
>> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
>> path. The only ones with a narrower lifecycle are the DSI hosts, which
>> add the bridge in during host attach and remove it during host detach.
>>
>> I'm thinking about fixing the panel_bridge lifecycle such that it is
>> tied to the panel itself. Maybe that would involve making
>> devm_drm_of_get_bridge() correctly return bridges even if a panel was
>> found, and then making the panels create and add panel bridges directly,
>> possibly within drm_panel_add(). Would that make sense?
> Not really.


[...]


> Or rather, it doesn't fix the root cause that is that tieing
> the bridge lifetime to the device is wrong.


This is multiple kernel driver module probe issue, not an issue
about bridge's lifetime.


The life time of the bridge of an 'struct panel_bridge' has
been tied to the 'panel->dev' since 2017 [1].

See commit 13dfc0540a575b47b2d640b093ac16e9e09474f6
("drm/bridge: Refactor out the panel wrapper from the lvds-encoder bridge.")


[1] https://patchwork.freedesktop.org/patch/159673/


>   It needs to be tied to the DRM device somehow.

Why?

It's the underlying hardware device backing the bridge, if the
backing hardware device has been freed, How does the bound drm
bridge driver could continue to work?

How could the dangling pointer stored in the bridge_list still
will make sense?

The imx-lcdif could instantiate three DRM driver, which one
should be selected as the "main" DRM device to attach?


No, It is messy since day 0. And has been made worse since 2017,
from then, thedevm_drm_panel_bridge_add() [2] was initially introduced. Which allow us 
to abuse the lifetime of bridge to a different device or (any device).

Maxime's patch just follow this way, but if the caller side
wise enough to refuse to use those helper, we should be still
safe. That why I suggest ChenYu to inline and with a little bit
revise.

[2] https://patchwork.freedesktop.org/patch/167666/

[3] 
https://lore.kernel.org/dri-devel/20210910130941.1740182-2-maxime@cerno.tech/

> Your suggestion might indeed work around your issue,


To be clear, the mentioned problem in this thread is caused
by deferral probe. We should remove the dangling pointer
stored in the bridge_list, This is just something similar to
the fault cleanup or error handling, Right?

But the fundamental thing is that the issue is happened in
the deferral probe context.


> but it doesn't fix the actual problem.

If drm bridge still have lifetime related issue, then it is
yet an another problem. Which should be then orthogonal to
this one. Then, it should deserve an another fix.

Best regards,
Sui

> Maxime
Maxime Ripard Nov. 29, 2024, 2:54 p.m. UTC | #12
On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
> Hi,
> 
> On 2024/11/29 18:51, Maxime Ripard wrote:
> > On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
> > > Revisiting this thread since I just stepped on the same problem on a
> > > different device.
> > > 
> > > On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > > > > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > > > > panel device.
> > > > > > > 
> > > > > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > > > > component_add(), the panel device invoking the callback at probe time
> > > > > > > also fails, and all device-managed resources are freed accordingly.
> > > > > > > 
> > > > > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > > > > bound to DSI host device and never gets called.
> > > > > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > > > > object and result in kernel panic.
> > > > > > > 
> > > > > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > > > > outputting to a DSI panel (DT is WIP for upstream).
> > > > > > > 
> > > > > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > > > > actions:
> > > > > > > 
> > > > > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > > > > >     becomes hollow and can be removed.
> > > > > > > 
> > > > > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > > > > >     `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > > > > >     we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > > > > > 
> > > > > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > > > > >     so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > > > > > 
> > > > > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > > > > >     calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > > > > >     object to be freed, and it's already being managed by panel device.
> > > > > > >     I wonder if we still need both drmm_ and devm_ version in this case.
> > > > > > >     (maybe yes from DRM PoV, I don't know much about the context)
> > > > > > > 
> > > > > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > > > > but I don't expect it to be picked up directly due to the redundant
> > > > > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > > > > do next.
> > > > > > > 
> > > > > > > For reference, here's the KASAN report from the device:
> > > > > > > ==================================================================
> > > > > > >   BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > > > > >   Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > > > > > 
> > > > > > >   CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > > > > >   Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > > > > >   Workqueue: events_unbound deferred_probe_work_func
> > > > > > >   Call trace:
> > > > > > >    dump_backtrace+0xfc/0x140
> > > > > > >    show_stack+0x24/0x38
> > > > > > >    dump_stack_lvl+0x40/0xc8
> > > > > > >    print_report+0x140/0x700
> > > > > > >    kasan_report+0xcc/0x130
> > > > > > >    __asan_report_load8_noabort+0x20/0x30
> > > > > > >    drm_bridge_add+0x98/0x230
> > > > > > >    devm_drm_panel_bridge_add_typed+0x174/0x298
> > > > > > >    devm_drm_of_get_bridge+0xe8/0x190
> > > > > > >    mtk_dsi_host_attach+0x130/0x2b0
> > > > > > >    mipi_dsi_attach+0x8c/0xe8
> > > > > > >    hx83102_probe+0x1a8/0x368
> > > > > > >    mipi_dsi_drv_probe+0x6c/0x88
> > > > > > >    really_probe+0x1c4/0x698
> > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > >    __device_attach+0x1c0/0x308
> > > > > > >    device_initial_probe+0x20/0x38
> > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > >    kthread+0x274/0x350
> > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > 
> > > > > > >   Allocated by task 69:
> > > > > > >    kasan_save_track+0x40/0x78
> > > > > > >    kasan_save_alloc_info+0x44/0x58
> > > > > > >    __kasan_kmalloc+0x84/0xa0
> > > > > > >    __kmalloc_node_track_caller_noprof+0x228/0x450
> > > > > > >    devm_kmalloc+0x6c/0x288
> > > > > > >    devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > > > > >    devm_drm_of_get_bridge+0xe8/0x190
> > > > > > >    mtk_dsi_host_attach+0x130/0x2b0
> > > > > > >    mipi_dsi_attach+0x8c/0xe8
> > > > > > >    hx83102_probe+0x1a8/0x368
> > > > > > >    mipi_dsi_drv_probe+0x6c/0x88
> > > > > > >    really_probe+0x1c4/0x698
> > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > >    __device_attach+0x1c0/0x308
> > > > > > >    device_initial_probe+0x20/0x38
> > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > >    kthread+0x274/0x350
> > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > 
> > > > > > >   Freed by task 69:
> > > > > > >    kasan_save_track+0x40/0x78
> > > > > > >    kasan_save_free_info+0x58/0x78
> > > > > > >    __kasan_slab_free+0x48/0x68
> > > > > > >    kfree+0xd4/0x750
> > > > > > >    devres_release_all+0x144/0x1e8
> > > > > > >    really_probe+0x48c/0x698
> > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > >    __device_attach+0x1c0/0x308
> > > > > > >    device_initial_probe+0x20/0x38
> > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > >    kthread+0x274/0x350
> > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > 
> > > > > > >   The buggy address belongs to the object at ffffff80c4e9e000
> > > > > > >    which belongs to the cache kmalloc-4k of size 4096
> > > > > > >   The buggy address is located 256 bytes inside of
> > > > > > >    freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > > > > > 
> > > > > > >   The buggy address belongs to the physical page:
> > > > > > >   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > > > > >   flags: 0x8000000000000040(head|zone=2)
> > > > > > >   page_type: f5(slab)
> > > > > > >   page: refcount:1 mapcount:0 mapping:0000000000000000
> > > > > > >   index:0x0 pfn:0x104e98
> > > > > > >   raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > >   raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > >   head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > >   head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > >   head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > > > > >   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > >   page dumped because: kasan: bad access detected
> > > > > > > 
> > > > > > >   Memory state around the buggy address:
> > > > > > >    ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > >    ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > >   >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > >                      ^
> > > > > > >    ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > >    ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > ===================================================================
> > > > > > > 
> > > > > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > > > > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > > > > commit log, and it does have a quite different structure compared to
> > > > > > what we recommend.
> > > > > > 
> > > > > > Would following
> > > > > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > > > > help?
> > > > > Hi Maxime,
> > > > > 
> > > > > Thank you for the pointer.
> > > > > I read the suggested pattern in the doc and compared it with the
> > > > > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > > > > drivers follow the instructions:
> > > > > 
> > > > > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> > > > >     >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > > > > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > > > > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > > > > its host.
> > > > >     >> drm/panel/panel-himax-hx83102.c follows and runs
> > > > > mipi_dsi_attach() at the end of probe hook.
> > > > > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > > > > now add its component.
> > > > >     >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> > > > > 
> > > > > Could you elaborate on the "different structures" you mentioned?
> > > > Yeah, you're right, sorry.
> > > > 
> > > > > To clarify my point: the issue is that component_add() may return
> > > > > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > > > > causing the panel bridge to be removed. However, drm_bridge_remove()
> > > > > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > > > > actual list_head object.
> > > > > 
> > > > > This might be reproducible with other MIPI-DSI host + panel
> > > > > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > > > > hook (verification with another device is needed), so the fix may be
> > > > > required in drm/bridge/panel.c.
> > > > Yeah, I think you're just hitting another bridge lifetime issue, and
> > > > it's not the only one unfortunately. Tying the bridge structure lifetime
> > > > itself to the device is wrong, it should be tied to the DRM device
> > > > lifetime instead.
> > > I think the more immediate issue is that the bridge object's lifetime
> > > and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> > > or drmm_of_get_bridge() are used.
> > > 
> > > These helpers tie the bridge add/removal to the device or drm_device
> > > passed in, but internally they call down to drm_panel_bridge_add_typed()
> > > which allocates the bridge object tied to the panel device.
> > > > But then, the discussion becomes that bridges typically probe outside of
> > > > the "main" DRM device probe path, so you don't have access to the DRM
> > > > device structure until attach at best.
> > > > 
> > > > That's why I'm a bit skeptical about your patch. It might workaround
> > > > your issue, but it doesn't actually solve the problem. I guess the best
> > > > way about it would be to convert bridges to reference counting, with the
> > > > device taking a reference at probe time when it allocates the structure
> > > > (and giving it back at remove time), and the DRM device taking one when
> > > > it's attached and one when it's detached.
> > > Without going as far, it's probably better to align the lifecycle of
> > > the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> > > lifecycle tied to their underlying |device|, either with explicit
> > > drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> > > callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> > > path. The only ones with a narrower lifecycle are the DSI hosts, which
> > > add the bridge in during host attach and remove it during host detach.
> > > 
> > > I'm thinking about fixing the panel_bridge lifecycle such that it is
> > > tied to the panel itself. Maybe that would involve making
> > > devm_drm_of_get_bridge() correctly return bridges even if a panel was
> > > found, and then making the panels create and add panel bridges directly,
> > > possibly within drm_panel_add(). Would that make sense?
> > Not really.
> 
> 
> [...]
> 
> 
> > Or rather, it doesn't fix the root cause that is that tieing
> > the bridge lifetime to the device is wrong.
> 
> 
> This is multiple kernel driver module probe issue, not an issue
> about bridge's lifetime.
> 
> 
> The life time of the bridge of an 'struct panel_bridge' has
> been tied to the 'panel->dev' since 2017 [1].
> 
> See commit 13dfc0540a575b47b2d640b093ac16e9e09474f6
> ("drm/bridge: Refactor out the panel wrapper from the lvds-encoder bridge.")
>
> [1] https://patchwork.freedesktop.org/patch/159673/

Yeah, and it's been wrong since 2017.

> >   It needs to be tied to the DRM device somehow.
> 
> Why?

Because the DRM device is only free'd when the last userspace
application has closed it's FD to it, which might much later than the
device being removed. So if we tie that to the device lifetime, and the
device goes away, we have a dangling pointer and potential
use-after-free issue if the application continues to access its fd.

> It's the underlying hardware device backing the bridge, if the
> backing hardware device has been freed, How does the bound drm
> bridge driver could continue to work?

Using drm_dev_enter/drm_dev_exit.

> How could the dangling pointer stored in the bridge_list still
> will make sense?

It's dangling only if the bridge has been free'd while still having a
pointer to it. If you have a reference counted allocation, it's not
dangling anymore.

> The imx-lcdif could instantiate three DRM driver, which one
> should be selected as the "main" DRM device to attach?

The one the bridge attaches to?

> No, It is messy since day 0. And has been made worse since 2017,
> from then, thedevm_drm_panel_bridge_add() [2] was initially introduced.
> Which allow us to abuse the lifetime of bridge to a different device or (any
> device).

I agree it's messy. I'm sure you'd agree that we do not want to make the
situation any messier.

> Maxime's patch just follow this way, but if the caller side
> wise enough to refuse to use those helper, we should be still
> safe. That why I suggest ChenYu to inline and with a little bit
> revise.

Hi! I'm that Maxime. And it was indeed a mistake in hindsight.

Maxime

> [2] https://patchwork.freedesktop.org/patch/167666/
> 
> [3]
> https://lore.kernel.org/dri-devel/20210910130941.1740182-2-maxime@cerno.tech/
> 
> > Your suggestion might indeed work around your issue,
>
> To be clear, the mentioned problem in this thread is caused
> by deferral probe. We should remove the dangling pointer
> stored in the bridge_list, This is just something similar to
> the fault cleanup or error handling, Right?
> 
> But the fundamental thing is that the issue is happened in
> the deferral probe context.

The context doesn't matter here.

> > but it doesn't fix the actual problem.
> 
> If drm bridge still have lifetime related issue, then it is
> yet an another problem. Which should be then orthogonal to
> this one. Then, it should deserve an another fix.

No, it's absolutely the same one: bridges should be refcounted, they
aren't, it's a mess.

Maxime
Chen-Yu Tsai Nov. 29, 2024, 3:16 p.m. UTC | #13
On Fri, Nov 29, 2024 at 10:55 PM Maxime Ripard <mripard@kernel.org> wrote:
>
> On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
> > Hi,
> >
> > On 2024/11/29 18:51, Maxime Ripard wrote:
> > > On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
> > > > Revisiting this thread since I just stepped on the same problem on a
> > > > different device.
> > > >
> > > > On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > > > > > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > > > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > > > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > > > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > > > > > panel device.
> > > > > > > >
> > > > > > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > > > > > component_add(), the panel device invoking the callback at probe time
> > > > > > > > also fails, and all device-managed resources are freed accordingly.
> > > > > > > >
> > > > > > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > > > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > > > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > > > > > bound to DSI host device and never gets called.
> > > > > > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > > > > > object and result in kernel panic.
> > > > > > > >
> > > > > > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > > > > > outputting to a DSI panel (DT is WIP for upstream).
> > > > > > > >
> > > > > > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > > > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > > > > > actions:
> > > > > > > >
> > > > > > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > > > > > >     becomes hollow and can be removed.
> > > > > > > >
> > > > > > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > > > > > >     `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > > > > > >     we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > > > > > >
> > > > > > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > > > > > >     so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > > > > > >
> > > > > > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > > > > > >     calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > > > > > >     object to be freed, and it's already being managed by panel device.
> > > > > > > >     I wonder if we still need both drmm_ and devm_ version in this case.
> > > > > > > >     (maybe yes from DRM PoV, I don't know much about the context)
> > > > > > > >
> > > > > > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > > > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > > > > > but I don't expect it to be picked up directly due to the redundant
> > > > > > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > > > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > > > > > do next.
> > > > > > > >
> > > > > > > > For reference, here's the KASAN report from the device:
> > > > > > > > ==================================================================
> > > > > > > >   BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > > > > > >   Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > > > > > >
> > > > > > > >   CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > > > > > >   Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > > > > > >   Workqueue: events_unbound deferred_probe_work_func
> > > > > > > >   Call trace:
> > > > > > > >    dump_backtrace+0xfc/0x140
> > > > > > > >    show_stack+0x24/0x38
> > > > > > > >    dump_stack_lvl+0x40/0xc8
> > > > > > > >    print_report+0x140/0x700
> > > > > > > >    kasan_report+0xcc/0x130
> > > > > > > >    __asan_report_load8_noabort+0x20/0x30
> > > > > > > >    drm_bridge_add+0x98/0x230
> > > > > > > >    devm_drm_panel_bridge_add_typed+0x174/0x298
> > > > > > > >    devm_drm_of_get_bridge+0xe8/0x190
> > > > > > > >    mtk_dsi_host_attach+0x130/0x2b0
> > > > > > > >    mipi_dsi_attach+0x8c/0xe8
> > > > > > > >    hx83102_probe+0x1a8/0x368
> > > > > > > >    mipi_dsi_drv_probe+0x6c/0x88
> > > > > > > >    really_probe+0x1c4/0x698
> > > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > > >    __device_attach+0x1c0/0x308
> > > > > > > >    device_initial_probe+0x20/0x38
> > > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > > >    kthread+0x274/0x350
> > > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > >
> > > > > > > >   Allocated by task 69:
> > > > > > > >    kasan_save_track+0x40/0x78
> > > > > > > >    kasan_save_alloc_info+0x44/0x58
> > > > > > > >    __kasan_kmalloc+0x84/0xa0
> > > > > > > >    __kmalloc_node_track_caller_noprof+0x228/0x450
> > > > > > > >    devm_kmalloc+0x6c/0x288
> > > > > > > >    devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > > > > > >    devm_drm_of_get_bridge+0xe8/0x190
> > > > > > > >    mtk_dsi_host_attach+0x130/0x2b0
> > > > > > > >    mipi_dsi_attach+0x8c/0xe8
> > > > > > > >    hx83102_probe+0x1a8/0x368
> > > > > > > >    mipi_dsi_drv_probe+0x6c/0x88
> > > > > > > >    really_probe+0x1c4/0x698
> > > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > > >    __device_attach+0x1c0/0x308
> > > > > > > >    device_initial_probe+0x20/0x38
> > > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > > >    kthread+0x274/0x350
> > > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > >
> > > > > > > >   Freed by task 69:
> > > > > > > >    kasan_save_track+0x40/0x78
> > > > > > > >    kasan_save_free_info+0x58/0x78
> > > > > > > >    __kasan_slab_free+0x48/0x68
> > > > > > > >    kfree+0xd4/0x750
> > > > > > > >    devres_release_all+0x144/0x1e8
> > > > > > > >    really_probe+0x48c/0x698
> > > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > > >    __device_attach+0x1c0/0x308
> > > > > > > >    device_initial_probe+0x20/0x38
> > > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > > >    kthread+0x274/0x350
> > > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > >
> > > > > > > >   The buggy address belongs to the object at ffffff80c4e9e000
> > > > > > > >    which belongs to the cache kmalloc-4k of size 4096
> > > > > > > >   The buggy address is located 256 bytes inside of
> > > > > > > >    freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > > > > > >
> > > > > > > >   The buggy address belongs to the physical page:
> > > > > > > >   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > > > > > >   flags: 0x8000000000000040(head|zone=2)
> > > > > > > >   page_type: f5(slab)
> > > > > > > >   page: refcount:1 mapcount:0 mapping:0000000000000000
> > > > > > > >   index:0x0 pfn:0x104e98
> > > > > > > >   raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > > >   raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > > >   head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > > >   head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > > >   head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > > > > > >   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > >   page dumped because: kasan: bad access detected
> > > > > > > >
> > > > > > > >   Memory state around the buggy address:
> > > > > > > >    ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > >    ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > >   >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > >                      ^
> > > > > > > >    ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > >    ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > ===================================================================
> > > > > > > >
> > > > > > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > > > > > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > > > > > commit log, and it does have a quite different structure compared to
> > > > > > > what we recommend.
> > > > > > >
> > > > > > > Would following
> > > > > > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > > > > > help?
> > > > > > Hi Maxime,
> > > > > >
> > > > > > Thank you for the pointer.
> > > > > > I read the suggested pattern in the doc and compared it with the
> > > > > > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > > > > > drivers follow the instructions:
> > > > > >
> > > > > > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> > > > > >     >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > > > > > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > > > > > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > > > > > its host.
> > > > > >     >> drm/panel/panel-himax-hx83102.c follows and runs
> > > > > > mipi_dsi_attach() at the end of probe hook.
> > > > > > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > > > > > now add its component.
> > > > > >     >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> > > > > >
> > > > > > Could you elaborate on the "different structures" you mentioned?
> > > > > Yeah, you're right, sorry.
> > > > >
> > > > > > To clarify my point: the issue is that component_add() may return
> > > > > > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > > > > > causing the panel bridge to be removed. However, drm_bridge_remove()
> > > > > > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > > > > > actual list_head object.
> > > > > >
> > > > > > This might be reproducible with other MIPI-DSI host + panel
> > > > > > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > > > > > hook (verification with another device is needed), so the fix may be
> > > > > > required in drm/bridge/panel.c.
> > > > > Yeah, I think you're just hitting another bridge lifetime issue, and
> > > > > it's not the only one unfortunately. Tying the bridge structure lifetime
> > > > > itself to the device is wrong, it should be tied to the DRM device
> > > > > lifetime instead.
> > > > I think the more immediate issue is that the bridge object's lifetime
> > > > and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> > > > or drmm_of_get_bridge() are used.
> > > >
> > > > These helpers tie the bridge add/removal to the device or drm_device
> > > > passed in, but internally they call down to drm_panel_bridge_add_typed()
> > > > which allocates the bridge object tied to the panel device.
> > > > > But then, the discussion becomes that bridges typically probe outside of
> > > > > the "main" DRM device probe path, so you don't have access to the DRM
> > > > > device structure until attach at best.
> > > > >
> > > > > That's why I'm a bit skeptical about your patch. It might workaround
> > > > > your issue, but it doesn't actually solve the problem. I guess the best
> > > > > way about it would be to convert bridges to reference counting, with the
> > > > > device taking a reference at probe time when it allocates the structure
> > > > > (and giving it back at remove time), and the DRM device taking one when
> > > > > it's attached and one when it's detached.
> > > > Without going as far, it's probably better to align the lifecycle of
> > > > the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> > > > lifecycle tied to their underlying |device|, either with explicit
> > > > drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> > > > callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> > > > path. The only ones with a narrower lifecycle are the DSI hosts, which
> > > > add the bridge in during host attach and remove it during host detach.
> > > >
> > > > I'm thinking about fixing the panel_bridge lifecycle such that it is
> > > > tied to the panel itself. Maybe that would involve making
> > > > devm_drm_of_get_bridge() correctly return bridges even if a panel was
> > > > found, and then making the panels create and add panel bridges directly,
> > > > possibly within drm_panel_add(). Would that make sense?
> > > Not really.
> >
> >
> > [...]
> >
> >
> > > Or rather, it doesn't fix the root cause that is that tieing
> > > the bridge lifetime to the device is wrong.

Well, yeah, but at least it aligns the panel_bridge case with every other
bridge driver: the bridge object and its duration existing in the list
of bridges are the same in every other bridge driver. The bridge driver
allocates the bridge object in its probe function (with devm or not),
and adds the bridge to the global list as probably the last step of
the probe function. In the remove function the reverse is done.

Right now for the panel bridge, the bridge object is allocated with its
lifetime tied to the panel, but the adding and removing of the bridge
to/from the global list is tied to the caller of the panel_bridge API,
likely the previous bridge or encoder in the chain.

This is what is blowing up for us, right now, with the Ciri Chromebooks.
If we fix this bit, then at least it reduces the problem to be only as
worse as the other bridge drivers.

I think MediaTek is currently the only platform that hits this because
it probably is the only platform that has dual display pipelines with
bridges and panel bridges, with one of the pipelines being MIPI DSI.
That's not to say another platform, such as Rockchip, wouldn't hit it
if similar devices were produced.

> > This is multiple kernel driver module probe issue, not an issue
> > about bridge's lifetime.
> >
> >
> > The life time of the bridge of an 'struct panel_bridge' has
> > been tied to the 'panel->dev' since 2017 [1].
> >
> > See commit 13dfc0540a575b47b2d640b093ac16e9e09474f6
> > ("drm/bridge: Refactor out the panel wrapper from the lvds-encoder bridge.")
> >
> > [1] https://patchwork.freedesktop.org/patch/159673/
>
> Yeah, and it's been wrong since 2017.
>
> > >   It needs to be tied to the DRM device somehow.
> >
> > Why?
>
> Because the DRM device is only free'd when the last userspace
> application has closed it's FD to it, which might much later than the
> device being removed. So if we tie that to the device lifetime, and the
> device goes away, we have a dangling pointer and potential
> use-after-free issue if the application continues to access its fd.
>
> > It's the underlying hardware device backing the bridge, if the
> > backing hardware device has been freed, How does the bound drm
> > bridge driver could continue to work?
>
> Using drm_dev_enter/drm_dev_exit.
>
> > How could the dangling pointer stored in the bridge_list still
> > will make sense?
>
> It's dangling only if the bridge has been free'd while still having a
> pointer to it. If you have a reference counted allocation, it's not
> dangling anymore.
>
> > The imx-lcdif could instantiate three DRM driver, which one
> > should be selected as the "main" DRM device to attach?
>
> The one the bridge attaches to?
>
> > No, It is messy since day 0. And has been made worse since 2017,
> > from then, thedevm_drm_panel_bridge_add() [2] was initially introduced.
> > Which allow us to abuse the lifetime of bridge to a different device or (any
> > device).
>
> I agree it's messy. I'm sure you'd agree that we do not want to make the
> situation any messier.
>
> > Maxime's patch just follow this way, but if the caller side
> > wise enough to refuse to use those helper, we should be still
> > safe. That why I suggest ChenYu to inline and with a little bit
> > revise.
>
> Hi! I'm that Maxime. And it was indeed a mistake in hindsight.
>
> Maxime
>
> > [2] https://patchwork.freedesktop.org/patch/167666/
> >
> > [3]
> > https://lore.kernel.org/dri-devel/20210910130941.1740182-2-maxime@cerno.tech/
> >
> > > Your suggestion might indeed work around your issue,
> >
> > To be clear, the mentioned problem in this thread is caused
> > by deferral probe. We should remove the dangling pointer
> > stored in the bridge_list, This is just something similar to
> > the fault cleanup or error handling, Right?
> >
> > But the fundamental thing is that the issue is happened in
> > the deferral probe context.
>
> The context doesn't matter here.
>
> > > but it doesn't fix the actual problem.
> >
> > If drm bridge still have lifetime related issue, then it is
> > yet an another problem. Which should be then orthogonal to
> > this one. Then, it should deserve an another fix.
>
> No, it's absolutely the same one: bridges should be refcounted, they
> aren't, it's a mess.

I agree that the bridge lifecycle is a mess (and panels are probably the
same). What we're trying to fix here is related, but not exactly the same,
as explained above.

Can we at least fix this bit?


ChenYu
Sui Jingfeng Nov. 29, 2024, 3:24 p.m. UTC | #14
Hi,

On 2024/11/29 22:54, Maxime Ripard wrote:
> On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
>> Hi,
>>
>> On 2024/11/29 18:51, Maxime Ripard wrote:
>>> On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
>>>> Revisiting this thread since I just stepped on the same problem on a
>>>> different device.
>>>>
>>>> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>>>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>>>>> panel device.
>>>>>>>>
>>>>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>>>>> component_add(), the panel device invoking the callback at probe time
>>>>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>>>>
>>>>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>>>>> bound to DSI host device and never gets called.
>>>>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>>>>> object and result in kernel panic.
>>>>>>>>
>>>>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>>>>
>>>>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>>>>> actions:
>>>>>>>>
>>>>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>>>>      becomes hollow and can be removed.
>>>>>>>>
>>>>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>>>>      `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>>>>      we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>>>>
>>>>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>>>>      so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>>>>
>>>>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>>>>      calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>>>>      object to be freed, and it's already being managed by panel device.
>>>>>>>>      I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>>>>      (maybe yes from DRM PoV, I don't know much about the context)
>>>>>>>>
>>>>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>>>>> do next.
>>>>>>>>
>>>>>>>> For reference, here's the KASAN report from the device:
>>>>>>>> ==================================================================
>>>>>>>>    BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>>>>    Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>>>>
>>>>>>>>    CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>>>>    Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>>>>    Workqueue: events_unbound deferred_probe_work_func
>>>>>>>>    Call trace:
>>>>>>>>     dump_backtrace+0xfc/0x140
>>>>>>>>     show_stack+0x24/0x38
>>>>>>>>     dump_stack_lvl+0x40/0xc8
>>>>>>>>     print_report+0x140/0x700
>>>>>>>>     kasan_report+0xcc/0x130
>>>>>>>>     __asan_report_load8_noabort+0x20/0x30
>>>>>>>>     drm_bridge_add+0x98/0x230
>>>>>>>>     devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>>>>     devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>>     mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>>     mipi_dsi_attach+0x8c/0xe8
>>>>>>>>     hx83102_probe+0x1a8/0x368
>>>>>>>>     mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>>     really_probe+0x1c4/0x698
>>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>>     kthread+0x274/0x350
>>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>>
>>>>>>>>    Allocated by task 69:
>>>>>>>>     kasan_save_track+0x40/0x78
>>>>>>>>     kasan_save_alloc_info+0x44/0x58
>>>>>>>>     __kasan_kmalloc+0x84/0xa0
>>>>>>>>     __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>>>>     devm_kmalloc+0x6c/0x288
>>>>>>>>     devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>>>>     devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>>     mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>>     mipi_dsi_attach+0x8c/0xe8
>>>>>>>>     hx83102_probe+0x1a8/0x368
>>>>>>>>     mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>>     really_probe+0x1c4/0x698
>>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>>     kthread+0x274/0x350
>>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>>
>>>>>>>>    Freed by task 69:
>>>>>>>>     kasan_save_track+0x40/0x78
>>>>>>>>     kasan_save_free_info+0x58/0x78
>>>>>>>>     __kasan_slab_free+0x48/0x68
>>>>>>>>     kfree+0xd4/0x750
>>>>>>>>     devres_release_all+0x144/0x1e8
>>>>>>>>     really_probe+0x48c/0x698
>>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>>     kthread+0x274/0x350
>>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>>
>>>>>>>>    The buggy address belongs to the object at ffffff80c4e9e000
>>>>>>>>     which belongs to the cache kmalloc-4k of size 4096
>>>>>>>>    The buggy address is located 256 bytes inside of
>>>>>>>>     freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>>>>
>>>>>>>>    The buggy address belongs to the physical page:
>>>>>>>>    head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>>>>    flags: 0x8000000000000040(head|zone=2)
>>>>>>>>    page_type: f5(slab)
>>>>>>>>    page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>>>>    index:0x0 pfn:0x104e98
>>>>>>>>    raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>>    raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>>    head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>>    head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>>    head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>>>>    head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>>>    page dumped because: kasan: bad access detected
>>>>>>>>
>>>>>>>>    Memory state around the buggy address:
>>>>>>>>     ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>     ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>    >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>                       ^
>>>>>>>>     ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>     ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>> ===================================================================
>>>>>>>>
>>>>>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
>>>>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>>>>> commit log, and it does have a quite different structure compared to
>>>>>>> what we recommend.
>>>>>>>
>>>>>>> Would following
>>>>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>>>>> help?
>>>>>> Hi Maxime,
>>>>>>
>>>>>> Thank you for the pointer.
>>>>>> I read the suggested pattern in the doc and compared it with the
>>>>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>>>>> drivers follow the instructions:
>>>>>>
>>>>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>>>>      >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>>>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>>>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>>>>> its host.
>>>>>>      >> drm/panel/panel-himax-hx83102.c follows and runs
>>>>>> mipi_dsi_attach() at the end of probe hook.
>>>>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>>>>> now add its component.
>>>>>>      >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>>>>
>>>>>> Could you elaborate on the "different structures" you mentioned?
>>>>> Yeah, you're right, sorry.
>>>>>
>>>>>> To clarify my point: the issue is that component_add() may return
>>>>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>>>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>>>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>>>>> actual list_head object.
>>>>>>
>>>>>> This might be reproducible with other MIPI-DSI host + panel
>>>>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>>>>> hook (verification with another device is needed), so the fix may be
>>>>>> required in drm/bridge/panel.c.
>>>>> Yeah, I think you're just hitting another bridge lifetime issue, and
>>>>> it's not the only one unfortunately. Tying the bridge structure lifetime
>>>>> itself to the device is wrong, it should be tied to the DRM device
>>>>> lifetime instead.
>>>> I think the more immediate issue is that the bridge object's lifetime
>>>> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
>>>> or drmm_of_get_bridge() are used.
>>>>
>>>> These helpers tie the bridge add/removal to the device or drm_device
>>>> passed in, but internally they call down to drm_panel_bridge_add_typed()
>>>> which allocates the bridge object tied to the panel device.
>>>>> But then, the discussion becomes that bridges typically probe outside of
>>>>> the "main" DRM device probe path, so you don't have access to the DRM
>>>>> device structure until attach at best.
>>>>>
>>>>> That's why I'm a bit skeptical about your patch. It might workaround
>>>>> your issue, but it doesn't actually solve the problem. I guess the best
>>>>> way about it would be to convert bridges to reference counting, with the
>>>>> device taking a reference at probe time when it allocates the structure
>>>>> (and giving it back at remove time), and the DRM device taking one when
>>>>> it's attached and one when it's detached.
>>>> Without going as far, it's probably better to align the lifecycle of
>>>> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
>>>> lifecycle tied to their underlying |device|, either with explicit
>>>> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
>>>> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
>>>> path. The only ones with a narrower lifecycle are the DSI hosts, which
>>>> add the bridge in during host attach and remove it during host detach.
>>>>
>>>> I'm thinking about fixing the panel_bridge lifecycle such that it is
>>>> tied to the panel itself. Maybe that would involve making
>>>> devm_drm_of_get_bridge() correctly return bridges even if a panel was
>>>> found, and then making the panels create and add panel bridges directly,
>>>> possibly within drm_panel_add(). Would that make sense?
>>> Not really.
>>
>> [...]
>>
>>
>>> Or rather, it doesn't fix the root cause that is that tieing
>>> the bridge lifetime to the device is wrong.
>>
>> This is multiple kernel driver module probe issue, not an issue
>> about bridge's lifetime.
>>
>>
>> The life time of the bridge of an 'struct panel_bridge' has
>> been tied to the 'panel->dev' since 2017 [1].
>>
>> See commit 13dfc0540a575b47b2d640b093ac16e9e09474f6
>> ("drm/bridge: Refactor out the panel wrapper from the lvds-encoder bridge.")
>>
>> [1] https://patchwork.freedesktop.org/patch/159673/
> Yeah, and it's been wrong since 2017.
>
>>>    It needs to be tied to the DRM device somehow.
>> Why?
> Because the DRM device is only free'd when the last userspace
> application has closed it's FD to it, which might much later than the
> device being removed. So if we tie that to the device lifetime, and the
> device goes away, we have a dangling pointer and potential
> use-after-free issue if the application continues to access its fd.
>
>> It's the underlying hardware device backing the bridge, if the
>> backing hardware device has been freed, How does the bound drm
>> bridge driver could continue to work?
> Using drm_dev_enter/drm_dev_exit.
>
>> How could the dangling pointer stored in the bridge_list still
>> will make sense?
> It's dangling only if the bridge has been free'd while still having a
> pointer to it. If you have a reference counted allocation, it's not
> dangling anymore.

I meant that in the deferral context, the underlying panel device has
been freed. You could keep the allocated storage in memory, but this
is in vain. The real hardware has gone, the reference counted allocation
could only stand for the panel bridge itself, without the real hardware
backing there. It can not fully functional.

As far as I could understand, in the deferral context, tears down
everything is standard behavior. This is not very related to the
lifetime.

>> The imx-lcdif could instantiate three DRM driver, which one
>> should be selected as the "main" DRM device to attach?
> The one the bridge attaches to?


The point is how can we select one from it.


>> No, It is messy since day 0. And has been made worse since 2017,
>> from then, thedevm_drm_panel_bridge_add() [2] was initially introduced.
>> Which allow us to abuse the lifetime of bridge to a different device or (any
>> device).
> I agree it's messy. I'm sure you'd agree that we do not want to make the
> situation any messier.
>
>> Maxime's patch just follow this way, but if the caller side
>> wise enough to refuse to use those helper, we should be still
>> safe. That why I suggest ChenYu to inline and with a little bit
>> revise.
> Hi! I'm that Maxime. And it was indeed a mistake in hindsight.
>
> Maxime
>
>> [2] https://patchwork.freedesktop.org/patch/167666/
>>
>> [3]
>> https://lore.kernel.org/dri-devel/20210910130941.1740182-2-maxime@cerno.tech/
>>
>>> Your suggestion might indeed work around your issue,
>> To be clear, the mentioned problem in this thread is caused
>> by deferral probe. We should remove the dangling pointer
>> stored in the bridge_list, This is just something similar to
>> the fault cleanup or error handling, Right?
>>
>> But the fundamental thing is that the issue is happened in
>> the deferral probe context.
> The context doesn't matter here.


Its an important factor, it really matters.

One fundamental criteria, I think, is that *if* other
bridge + KMS driver combinations suffer from the same problem.

Apparently, other drm bridge users didn't report similar problem.
This means that non devm_drm_of_get_bridge() callers are different
with those devm_drm_of_get_bridge() callers.

>>> but it doesn't fix the actual problem.
>> If drm bridge still have lifetime related issue, then it is
>> yet an another problem. Which should be then orthogonal to
>> this one. Then, it should deserve an another fix.
> No, it's absolutely the same one: bridges should be refcounted, they
> aren't, it's a mess.


I think, bridges should be ref-counted or not is another story.


> Maxime
Maxime Ripard Dec. 2, 2024, 10:12 a.m. UTC | #15
On Fri, Nov 29, 2024 at 11:24:18PM +0800, Sui Jingfeng wrote:
> Hi,
> 
> On 2024/11/29 22:54, Maxime Ripard wrote:
> > On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
> > > Hi,
> > > 
> > > On 2024/11/29 18:51, Maxime Ripard wrote:
> > > > On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
> > > > > Revisiting this thread since I just stepped on the same problem on a
> > > > > different device.
> > > > > 
> > > > > On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > > On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > > > > > > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > > > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > > > > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > > > > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > > > > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > > > > > > panel device.
> > > > > > > > > 
> > > > > > > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > > > > > > component_add(), the panel device invoking the callback at probe time
> > > > > > > > > also fails, and all device-managed resources are freed accordingly.
> > > > > > > > > 
> > > > > > > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > > > > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > > > > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > > > > > > bound to DSI host device and never gets called.
> > > > > > > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > > > > > > object and result in kernel panic.
> > > > > > > > > 
> > > > > > > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > > > > > > outputting to a DSI panel (DT is WIP for upstream).
> > > > > > > > > 
> > > > > > > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > > > > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > > > > > > actions:
> > > > > > > > > 
> > > > > > > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > > > > > > >      becomes hollow and can be removed.
> > > > > > > > > 
> > > > > > > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > > > > > > >      `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > > > > > > >      we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > > > > > > > 
> > > > > > > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > > > > > > >      so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > > > > > > > 
> > > > > > > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > > > > > > >      calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > > > > > > >      object to be freed, and it's already being managed by panel device.
> > > > > > > > >      I wonder if we still need both drmm_ and devm_ version in this case.
> > > > > > > > >      (maybe yes from DRM PoV, I don't know much about the context)
> > > > > > > > > 
> > > > > > > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > > > > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > > > > > > but I don't expect it to be picked up directly due to the redundant
> > > > > > > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > > > > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > > > > > > do next.
> > > > > > > > > 
> > > > > > > > > For reference, here's the KASAN report from the device:
> > > > > > > > > ==================================================================
> > > > > > > > >    BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > > > > > > >    Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > > > > > > > 
> > > > > > > > >    CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > > > > > > >    Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > > > > > > >    Workqueue: events_unbound deferred_probe_work_func
> > > > > > > > >    Call trace:
> > > > > > > > >     dump_backtrace+0xfc/0x140
> > > > > > > > >     show_stack+0x24/0x38
> > > > > > > > >     dump_stack_lvl+0x40/0xc8
> > > > > > > > >     print_report+0x140/0x700
> > > > > > > > >     kasan_report+0xcc/0x130
> > > > > > > > >     __asan_report_load8_noabort+0x20/0x30
> > > > > > > > >     drm_bridge_add+0x98/0x230
> > > > > > > > >     devm_drm_panel_bridge_add_typed+0x174/0x298
> > > > > > > > >     devm_drm_of_get_bridge+0xe8/0x190
> > > > > > > > >     mtk_dsi_host_attach+0x130/0x2b0
> > > > > > > > >     mipi_dsi_attach+0x8c/0xe8
> > > > > > > > >     hx83102_probe+0x1a8/0x368
> > > > > > > > >     mipi_dsi_drv_probe+0x6c/0x88
> > > > > > > > >     really_probe+0x1c4/0x698
> > > > > > > > >     __driver_probe_device+0x160/0x298
> > > > > > > > >     driver_probe_device+0x7c/0x2a8
> > > > > > > > >     __device_attach_driver+0x2a0/0x398
> > > > > > > > >     bus_for_each_drv+0x198/0x200
> > > > > > > > >     __device_attach+0x1c0/0x308
> > > > > > > > >     device_initial_probe+0x20/0x38
> > > > > > > > >     bus_probe_device+0x11c/0x1f8
> > > > > > > > >     deferred_probe_work_func+0x80/0x250
> > > > > > > > >     worker_thread+0x9b4/0x2780
> > > > > > > > >     kthread+0x274/0x350
> > > > > > > > >     ret_from_fork+0x10/0x20
> > > > > > > > > 
> > > > > > > > >    Allocated by task 69:
> > > > > > > > >     kasan_save_track+0x40/0x78
> > > > > > > > >     kasan_save_alloc_info+0x44/0x58
> > > > > > > > >     __kasan_kmalloc+0x84/0xa0
> > > > > > > > >     __kmalloc_node_track_caller_noprof+0x228/0x450
> > > > > > > > >     devm_kmalloc+0x6c/0x288
> > > > > > > > >     devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > > > > > > >     devm_drm_of_get_bridge+0xe8/0x190
> > > > > > > > >     mtk_dsi_host_attach+0x130/0x2b0
> > > > > > > > >     mipi_dsi_attach+0x8c/0xe8
> > > > > > > > >     hx83102_probe+0x1a8/0x368
> > > > > > > > >     mipi_dsi_drv_probe+0x6c/0x88
> > > > > > > > >     really_probe+0x1c4/0x698
> > > > > > > > >     __driver_probe_device+0x160/0x298
> > > > > > > > >     driver_probe_device+0x7c/0x2a8
> > > > > > > > >     __device_attach_driver+0x2a0/0x398
> > > > > > > > >     bus_for_each_drv+0x198/0x200
> > > > > > > > >     __device_attach+0x1c0/0x308
> > > > > > > > >     device_initial_probe+0x20/0x38
> > > > > > > > >     bus_probe_device+0x11c/0x1f8
> > > > > > > > >     deferred_probe_work_func+0x80/0x250
> > > > > > > > >     worker_thread+0x9b4/0x2780
> > > > > > > > >     kthread+0x274/0x350
> > > > > > > > >     ret_from_fork+0x10/0x20
> > > > > > > > > 
> > > > > > > > >    Freed by task 69:
> > > > > > > > >     kasan_save_track+0x40/0x78
> > > > > > > > >     kasan_save_free_info+0x58/0x78
> > > > > > > > >     __kasan_slab_free+0x48/0x68
> > > > > > > > >     kfree+0xd4/0x750
> > > > > > > > >     devres_release_all+0x144/0x1e8
> > > > > > > > >     really_probe+0x48c/0x698
> > > > > > > > >     __driver_probe_device+0x160/0x298
> > > > > > > > >     driver_probe_device+0x7c/0x2a8
> > > > > > > > >     __device_attach_driver+0x2a0/0x398
> > > > > > > > >     bus_for_each_drv+0x198/0x200
> > > > > > > > >     __device_attach+0x1c0/0x308
> > > > > > > > >     device_initial_probe+0x20/0x38
> > > > > > > > >     bus_probe_device+0x11c/0x1f8
> > > > > > > > >     deferred_probe_work_func+0x80/0x250
> > > > > > > > >     worker_thread+0x9b4/0x2780
> > > > > > > > >     kthread+0x274/0x350
> > > > > > > > >     ret_from_fork+0x10/0x20
> > > > > > > > > 
> > > > > > > > >    The buggy address belongs to the object at ffffff80c4e9e000
> > > > > > > > >     which belongs to the cache kmalloc-4k of size 4096
> > > > > > > > >    The buggy address is located 256 bytes inside of
> > > > > > > > >     freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > > > > > > > 
> > > > > > > > >    The buggy address belongs to the physical page:
> > > > > > > > >    head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > > > > > > >    flags: 0x8000000000000040(head|zone=2)
> > > > > > > > >    page_type: f5(slab)
> > > > > > > > >    page: refcount:1 mapcount:0 mapping:0000000000000000
> > > > > > > > >    index:0x0 pfn:0x104e98
> > > > > > > > >    raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > > > >    raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > > > >    head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > > > >    head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > > > >    head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > > > > > > >    head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > > >    page dumped because: kasan: bad access detected
> > > > > > > > > 
> > > > > > > > >    Memory state around the buggy address:
> > > > > > > > >     ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >     ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >    >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >                       ^
> > > > > > > > >     ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >     ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > > ===================================================================
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > > > > > > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > > > > > > commit log, and it does have a quite different structure compared to
> > > > > > > > what we recommend.
> > > > > > > > 
> > > > > > > > Would following
> > > > > > > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > > > > > > help?
> > > > > > > Hi Maxime,
> > > > > > > 
> > > > > > > Thank you for the pointer.
> > > > > > > I read the suggested pattern in the doc and compared it with the
> > > > > > > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > > > > > > drivers follow the instructions:
> > > > > > > 
> > > > > > > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> > > > > > >      >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > > > > > > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > > > > > > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > > > > > > its host.
> > > > > > >      >> drm/panel/panel-himax-hx83102.c follows and runs
> > > > > > > mipi_dsi_attach() at the end of probe hook.
> > > > > > > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > > > > > > now add its component.
> > > > > > >      >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> > > > > > > 
> > > > > > > Could you elaborate on the "different structures" you mentioned?
> > > > > > Yeah, you're right, sorry.
> > > > > > 
> > > > > > > To clarify my point: the issue is that component_add() may return
> > > > > > > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > > > > > > causing the panel bridge to be removed. However, drm_bridge_remove()
> > > > > > > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > > > > > > actual list_head object.
> > > > > > > 
> > > > > > > This might be reproducible with other MIPI-DSI host + panel
> > > > > > > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > > > > > > hook (verification with another device is needed), so the fix may be
> > > > > > > required in drm/bridge/panel.c.
> > > > > > Yeah, I think you're just hitting another bridge lifetime issue, and
> > > > > > it's not the only one unfortunately. Tying the bridge structure lifetime
> > > > > > itself to the device is wrong, it should be tied to the DRM device
> > > > > > lifetime instead.
> > > > > I think the more immediate issue is that the bridge object's lifetime
> > > > > and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> > > > > or drmm_of_get_bridge() are used.
> > > > > 
> > > > > These helpers tie the bridge add/removal to the device or drm_device
> > > > > passed in, but internally they call down to drm_panel_bridge_add_typed()
> > > > > which allocates the bridge object tied to the panel device.
> > > > > > But then, the discussion becomes that bridges typically probe outside of
> > > > > > the "main" DRM device probe path, so you don't have access to the DRM
> > > > > > device structure until attach at best.
> > > > > > 
> > > > > > That's why I'm a bit skeptical about your patch. It might workaround
> > > > > > your issue, but it doesn't actually solve the problem. I guess the best
> > > > > > way about it would be to convert bridges to reference counting, with the
> > > > > > device taking a reference at probe time when it allocates the structure
> > > > > > (and giving it back at remove time), and the DRM device taking one when
> > > > > > it's attached and one when it's detached.
> > > > > Without going as far, it's probably better to align the lifecycle of
> > > > > the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> > > > > lifecycle tied to their underlying |device|, either with explicit
> > > > > drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> > > > > callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> > > > > path. The only ones with a narrower lifecycle are the DSI hosts, which
> > > > > add the bridge in during host attach and remove it during host detach.
> > > > > 
> > > > > I'm thinking about fixing the panel_bridge lifecycle such that it is
> > > > > tied to the panel itself. Maybe that would involve making
> > > > > devm_drm_of_get_bridge() correctly return bridges even if a panel was
> > > > > found, and then making the panels create and add panel bridges directly,
> > > > > possibly within drm_panel_add(). Would that make sense?
> > > > Not really.
> > > 
> > > [...]
> > > 
> > > 
> > > > Or rather, it doesn't fix the root cause that is that tieing
> > > > the bridge lifetime to the device is wrong.
> > > 
> > > This is multiple kernel driver module probe issue, not an issue
> > > about bridge's lifetime.
> > > 
> > > 
> > > The life time of the bridge of an 'struct panel_bridge' has
> > > been tied to the 'panel->dev' since 2017 [1].
> > > 
> > > See commit 13dfc0540a575b47b2d640b093ac16e9e09474f6
> > > ("drm/bridge: Refactor out the panel wrapper from the lvds-encoder bridge.")
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/159673/
> > Yeah, and it's been wrong since 2017.
> > 
> > > >    It needs to be tied to the DRM device somehow.
> > > Why?
> > Because the DRM device is only free'd when the last userspace
> > application has closed it's FD to it, which might much later than the
> > device being removed. So if we tie that to the device lifetime, and the
> > device goes away, we have a dangling pointer and potential
> > use-after-free issue if the application continues to access its fd.
> > 
> > > It's the underlying hardware device backing the bridge, if the
> > > backing hardware device has been freed, How does the bound drm
> > > bridge driver could continue to work?
> > Using drm_dev_enter/drm_dev_exit.
> > 
> > > How could the dangling pointer stored in the bridge_list still
> > > will make sense?
> > It's dangling only if the bridge has been free'd while still having a
> > pointer to it. If you have a reference counted allocation, it's not
> > dangling anymore.
> 
> I meant that in the deferral context, the underlying panel device has
> been freed. You could keep the allocated storage in memory, but this
> is in vain.

It prevents memory safety issues.

> The real hardware has gone, the reference counted allocation could
> only stand for the panel bridge itself, without the real hardware
> backing there. It can not fully functional.

Yes, but that's not the point?

> As far as I could understand, in the deferral context, tears down
> everything is standard behavior. This is not very related to the
> lifetime.
> 
> > > The imx-lcdif could instantiate three DRM driver, which one
> > > should be selected as the "main" DRM device to attach?
> > The one the bridge attaches to?
>
> The point is how can we select one from it.

bridge->dev ?

> > > No, It is messy since day 0. And has been made worse since 2017,
> > > from then, thedevm_drm_panel_bridge_add() [2] was initially introduced.
> > > Which allow us to abuse the lifetime of bridge to a different device or (any
> > > device).
> > I agree it's messy. I'm sure you'd agree that we do not want to make the
> > situation any messier.
> > 
> > > Maxime's patch just follow this way, but if the caller side
> > > wise enough to refuse to use those helper, we should be still
> > > safe. That why I suggest ChenYu to inline and with a little bit
> > > revise.
> > Hi! I'm that Maxime. And it was indeed a mistake in hindsight.
> > 
> > Maxime
> > 
> > > [2] https://patchwork.freedesktop.org/patch/167666/
> > > 
> > > [3]
> > > https://lore.kernel.org/dri-devel/20210910130941.1740182-2-maxime@cerno.tech/
> > > 
> > > > Your suggestion might indeed work around your issue,
> > > To be clear, the mentioned problem in this thread is caused
> > > by deferral probe. We should remove the dangling pointer
> > > stored in the bridge_list, This is just something similar to
> > > the fault cleanup or error handling, Right?
> > > 
> > > But the fundamental thing is that the issue is happened in
> > > the deferral probe context.
> > The context doesn't matter here.
> 
> 
> Its an important factor, it really matters.
> 
> One fundamental criteria, I think, is that *if* other
> bridge + KMS driver combinations suffer from the same problem.

All of them do. We just collectively stick our head in the sand.

> Apparently, other drm bridge users didn't report similar problem.
> This means that non devm_drm_of_get_bridge() callers are different
> with those devm_drm_of_get_bridge() callers.

Nope. They are strictly equivalent.

Maxime
Sui Jingfeng Dec. 2, 2024, 2:31 p.m. UTC | #16
Hi,

On 2024/12/2 18:12, Maxime Ripard wrote:
> On Fri, Nov 29, 2024 at 11:24:18PM +0800, Sui Jingfeng wrote:
>> Hi,
>>
>> On 2024/11/29 22:54, Maxime Ripard wrote:
>>> On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
>>>> Hi,
>>>>
>>>> On 2024/11/29 18:51, Maxime Ripard wrote:
>>>>> On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
>>>>>> Revisiting this thread since I just stepped on the same problem on a
>>>>>> different device.
>>>>>>
>>>>>> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>>>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>>>>>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>>>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>>>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>>>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>>>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>>>>>>> panel device.
>>>>>>>>>>
>>>>>>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>>>>>>> component_add(), the panel device invoking the callback at probe time
>>>>>>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>>>>>>
>>>>>>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>>>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>>>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>>>>>>> bound to DSI host device and never gets called.
>>>>>>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>>>>>>> object and result in kernel panic.
>>>>>>>>>>
>>>>>>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>>>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>>>>>>
>>>>>>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>>>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>>>>>>> actions:
>>>>>>>>>>
>>>>>>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>>>>>>       becomes hollow and can be removed.
>>>>>>>>>>
>>>>>>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>>>>>>       `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>>>>>>       we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>>>>>>
>>>>>>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>>>>>>       so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>>>>>>
>>>>>>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>>>>>>       calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>>>>>>       object to be freed, and it's already being managed by panel device.
>>>>>>>>>>       I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>>>>>>       (maybe yes from DRM PoV, I don't know much about the context)
>>>>>>>>>>
>>>>>>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>>>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>>>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>>>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>>>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>>>>>>> do next.
>>>>>>>>>>
>>>>>>>>>> For reference, here's the KASAN report from the device:
>>>>>>>>>> ==================================================================
>>>>>>>>>>     BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>>>>>>     Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>>>>>>
>>>>>>>>>>     CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>>>>>>     Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>>>>>>     Workqueue: events_unbound deferred_probe_work_func
>>>>>>>>>>     Call trace:
>>>>>>>>>>      dump_backtrace+0xfc/0x140
>>>>>>>>>>      show_stack+0x24/0x38
>>>>>>>>>>      dump_stack_lvl+0x40/0xc8
>>>>>>>>>>      print_report+0x140/0x700
>>>>>>>>>>      kasan_report+0xcc/0x130
>>>>>>>>>>      __asan_report_load8_noabort+0x20/0x30
>>>>>>>>>>      drm_bridge_add+0x98/0x230
>>>>>>>>>>      devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>>>>>>      devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>>>>      mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>>>>      mipi_dsi_attach+0x8c/0xe8
>>>>>>>>>>      hx83102_probe+0x1a8/0x368
>>>>>>>>>>      mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>>>>      really_probe+0x1c4/0x698
>>>>>>>>>>      __driver_probe_device+0x160/0x298
>>>>>>>>>>      driver_probe_device+0x7c/0x2a8
>>>>>>>>>>      __device_attach_driver+0x2a0/0x398
>>>>>>>>>>      bus_for_each_drv+0x198/0x200
>>>>>>>>>>      __device_attach+0x1c0/0x308
>>>>>>>>>>      device_initial_probe+0x20/0x38
>>>>>>>>>>      bus_probe_device+0x11c/0x1f8
>>>>>>>>>>      deferred_probe_work_func+0x80/0x250
>>>>>>>>>>      worker_thread+0x9b4/0x2780
>>>>>>>>>>      kthread+0x274/0x350
>>>>>>>>>>      ret_from_fork+0x10/0x20
>>>>>>>>>>
>>>>>>>>>>     Allocated by task 69:
>>>>>>>>>>      kasan_save_track+0x40/0x78
>>>>>>>>>>      kasan_save_alloc_info+0x44/0x58
>>>>>>>>>>      __kasan_kmalloc+0x84/0xa0
>>>>>>>>>>      __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>>>>>>      devm_kmalloc+0x6c/0x288
>>>>>>>>>>      devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>>>>>>      devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>>>>      mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>>>>      mipi_dsi_attach+0x8c/0xe8
>>>>>>>>>>      hx83102_probe+0x1a8/0x368
>>>>>>>>>>      mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>>>>      really_probe+0x1c4/0x698
>>>>>>>>>>      __driver_probe_device+0x160/0x298
>>>>>>>>>>      driver_probe_device+0x7c/0x2a8
>>>>>>>>>>      __device_attach_driver+0x2a0/0x398
>>>>>>>>>>      bus_for_each_drv+0x198/0x200
>>>>>>>>>>      __device_attach+0x1c0/0x308
>>>>>>>>>>      device_initial_probe+0x20/0x38
>>>>>>>>>>      bus_probe_device+0x11c/0x1f8
>>>>>>>>>>      deferred_probe_work_func+0x80/0x250
>>>>>>>>>>      worker_thread+0x9b4/0x2780
>>>>>>>>>>      kthread+0x274/0x350
>>>>>>>>>>      ret_from_fork+0x10/0x20
>>>>>>>>>>
>>>>>>>>>>     Freed by task 69:
>>>>>>>>>>      kasan_save_track+0x40/0x78
>>>>>>>>>>      kasan_save_free_info+0x58/0x78
>>>>>>>>>>      __kasan_slab_free+0x48/0x68
>>>>>>>>>>      kfree+0xd4/0x750
>>>>>>>>>>      devres_release_all+0x144/0x1e8
>>>>>>>>>>      really_probe+0x48c/0x698
>>>>>>>>>>      __driver_probe_device+0x160/0x298
>>>>>>>>>>      driver_probe_device+0x7c/0x2a8
>>>>>>>>>>      __device_attach_driver+0x2a0/0x398
>>>>>>>>>>      bus_for_each_drv+0x198/0x200
>>>>>>>>>>      __device_attach+0x1c0/0x308
>>>>>>>>>>      device_initial_probe+0x20/0x38
>>>>>>>>>>      bus_probe_device+0x11c/0x1f8
>>>>>>>>>>      deferred_probe_work_func+0x80/0x250
>>>>>>>>>>      worker_thread+0x9b4/0x2780
>>>>>>>>>>      kthread+0x274/0x350
>>>>>>>>>>      ret_from_fork+0x10/0x20
>>>>>>>>>>
>>>>>>>>>>     The buggy address belongs to the object at ffffff80c4e9e000
>>>>>>>>>>      which belongs to the cache kmalloc-4k of size 4096
>>>>>>>>>>     The buggy address is located 256 bytes inside of
>>>>>>>>>>      freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>>>>>>
>>>>>>>>>>     The buggy address belongs to the physical page:
>>>>>>>>>>     head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>>>>>>     flags: 0x8000000000000040(head|zone=2)
>>>>>>>>>>     page_type: f5(slab)
>>>>>>>>>>     page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>>>>>>     index:0x0 pfn:0x104e98
>>>>>>>>>>     raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>>>>     raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>>>>     head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>>>>     head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>>>>     head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>>>>>>     head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>>>>>     page dumped because: kasan: bad access detected
>>>>>>>>>>
>>>>>>>>>>     Memory state around the buggy address:
>>>>>>>>>>      ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>      ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>     >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>                        ^
>>>>>>>>>>      ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>      ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>> ===================================================================
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
>>>>>>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>>>>>>> commit log, and it does have a quite different structure compared to
>>>>>>>>> what we recommend.
>>>>>>>>>
>>>>>>>>> Would following
>>>>>>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>>>>>>> help?
>>>>>>>> Hi Maxime,
>>>>>>>>
>>>>>>>> Thank you for the pointer.
>>>>>>>> I read the suggested pattern in the doc and compared it with the
>>>>>>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>>>>>>> drivers follow the instructions:
>>>>>>>>
>>>>>>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>>>>>>       >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>>>>>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>>>>>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>>>>>>> its host.
>>>>>>>>       >> drm/panel/panel-himax-hx83102.c follows and runs
>>>>>>>> mipi_dsi_attach() at the end of probe hook.
>>>>>>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>>>>>>> now add its component.
>>>>>>>>       >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>>>>>>
>>>>>>>> Could you elaborate on the "different structures" you mentioned?
>>>>>>> Yeah, you're right, sorry.
>>>>>>>
>>>>>>>> To clarify my point: the issue is that component_add() may return
>>>>>>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>>>>>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>>>>>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>>>>>>> actual list_head object.
>>>>>>>>
>>>>>>>> This might be reproducible with other MIPI-DSI host + panel
>>>>>>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>>>>>>> hook (verification with another device is needed), so the fix may be
>>>>>>>> required in drm/bridge/panel.c.
>>>>>>> Yeah, I think you're just hitting another bridge lifetime issue, and
>>>>>>> it's not the only one unfortunately. Tying the bridge structure lifetime
>>>>>>> itself to the device is wrong, it should be tied to the DRM device
>>>>>>> lifetime instead.
>>>>>> I think the more immediate issue is that the bridge object's lifetime
>>>>>> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
>>>>>> or drmm_of_get_bridge() are used.
>>>>>>
>>>>>> These helpers tie the bridge add/removal to the device or drm_device
>>>>>> passed in, but internally they call down to drm_panel_bridge_add_typed()
>>>>>> which allocates the bridge object tied to the panel device.
>>>>>>> But then, the discussion becomes that bridges typically probe outside of
>>>>>>> the "main" DRM device probe path, so you don't have access to the DRM
>>>>>>> device structure until attach at best.
>>>>>>>
>>>>>>> That's why I'm a bit skeptical about your patch. It might workaround
>>>>>>> your issue, but it doesn't actually solve the problem. I guess the best
>>>>>>> way about it would be to convert bridges to reference counting, with the
>>>>>>> device taking a reference at probe time when it allocates the structure
>>>>>>> (and giving it back at remove time), and the DRM device taking one when
>>>>>>> it's attached and one when it's detached.
>>>>>> Without going as far, it's probably better to align the lifecycle of
>>>>>> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
>>>>>> lifecycle tied to their underlying |device|, either with explicit
>>>>>> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
>>>>>> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
>>>>>> path. The only ones with a narrower lifecycle are the DSI hosts, which
>>>>>> add the bridge in during host attach and remove it during host detach.
>>>>>>
>>>>>> I'm thinking about fixing the panel_bridge lifecycle such that it is
>>>>>> tied to the panel itself. Maybe that would involve making
>>>>>> devm_drm_of_get_bridge() correctly return bridges even if a panel was
>>>>>> found, and then making the panels create and add panel bridges directly,
>>>>>> possibly within drm_panel_add(). Would that make sense?
>>>>> Not really.
>>>> [...]
>>>>
>>>>
>>>>> Or rather, it doesn't fix the root cause that is that tieing
>>>>> the bridge lifetime to the device is wrong.
>>>> This is multiple kernel driver module probe issue, not an issue
>>>> about bridge's lifetime.
>>>>
>>>>
>>>> The life time of the bridge of an 'struct panel_bridge' has
>>>> been tied to the 'panel->dev' since 2017 [1].
>>>>
>>>> See commit 13dfc0540a575b47b2d640b093ac16e9e09474f6
>>>> ("drm/bridge: Refactor out the panel wrapper from the lvds-encoder bridge.")
>>>>
>>>> [1] https://patchwork.freedesktop.org/patch/159673/
>>> Yeah, and it's been wrong since 2017.
>>>
>>>>>     It needs to be tied to the DRM device somehow.
>>>> Why?
>>> Because the DRM device is only free'd when the last userspace
>>> application has closed it's FD to it, which might much later than the
>>> device being removed. So if we tie that to the device lifetime, and the
>>> device goes away, we have a dangling pointer and potential
>>> use-after-free issue if the application continues to access its fd.
>>>
>>>> It's the underlying hardware device backing the bridge, if the
>>>> backing hardware device has been freed, How does the bound drm
>>>> bridge driver could continue to work?
>>> Using drm_dev_enter/drm_dev_exit.
>>>
>>>> How could the dangling pointer stored in the bridge_list still
>>>> will make sense?
>>> It's dangling only if the bridge has been free'd while still having a
>>> pointer to it. If you have a reference counted allocation, it's not
>>> dangling anymore.
>> I meant that in the deferral context, the underlying panel device has
>> been freed. You could keep the allocated storage in memory, but this
>> is in vain.
> It prevents memory safety issues.

The mentioned problem is NOT a memory safety issue, it's just that the
devm_drm_of_get_bridge() fails to do resource destruction on driver detach.

When mipi_dsi_attach(dsi) returns -EPROBE_DEFFER, the panel driver detached.
The underlying *panel* has been removed from the 'panel_list'. The drm_panel
instance that the panel_bridge structure references has been freed.

In this circumstance,have a reference counted allocation for the bridge itself
means nothing at all. Because the 'struct panel_bridge::panel' pointer is
dangling after the panel driver detached.

We should not access the panel_bridge/the panel/the bridge instance via the
dangling pointer anymore.

>> The real hardware has gone, the reference counted allocation could
>> only stand for the panel bridge itself, without the real hardware
>> backing there. It can not fully functional.
> Yes, but that's not the point?
>

The point is to remove the dangling pointer in the bridge_list, make sure
that it won't be de-referenced anymore.

As the memory it references has been free with regard to the '-EPROBE_DEFFER'
On the next time round, we will have *new* pointer to *new* memory location.
The new instance and pointer to that is expected to be used.

>> As far as I could understand, in the deferral context, tears down
>> everything is standard behavior. This is not very related to the
>> lifetime.
>>
>>>> The imx-lcdif could instantiate three DRM driver, which one
>>>> should be selected as the "main" DRM device to attach?
>>> The one the bridge attaches to?
>> The point is how can we select one from it.
> bridge->dev ?


This still will lead to release time misaligned, as KMS driver/device could
managed to not leave once probed. Such drivers could create a sub-device
to bind a sub-driver, offload attach() actions to the sub-device to break
thecyclic dependency  issue.


>>>> No, It is messy since day 0. And has been made worse since 2017,
>>>> from then, thedevm_drm_panel_bridge_add() [2] was initially introduced.
>>>> Which allow us to abuse the lifetime of bridge to a different device or (any
>>>> device).
>>> I agree it's messy. I'm sure you'd agree that we do not want to make the
>>> situation any messier.
>>>
>>>> Maxime's patch just follow this way, but if the caller side
>>>> wise enough to refuse to use those helper, we should be still
>>>> safe. That why I suggest ChenYu to inline and with a little bit
>>>> revise.
>>> Hi! I'm that Maxime. And it was indeed a mistake in hindsight.
>>>
>>> Maxime
>>>
>>>> [2] https://patchwork.freedesktop.org/patch/167666/
>>>>
>>>> [3]
>>>> https://lore.kernel.org/dri-devel/20210910130941.1740182-2-maxime@cerno.tech/
>>>>
>>>>> Your suggestion might indeed work around your issue,
>>>> To be clear, the mentioned problem in this thread is caused
>>>> by deferral probe. We should remove the dangling pointer
>>>> stored in the bridge_list, This is just something similar to
>>>> the fault cleanup or error handling, Right?
>>>>
>>>> But the fundamental thing is that the issue is happened in
>>>> the deferral probe context.
>>> The context doesn't matter here.
>>
>> Its an important factor, it really matters.
>>
>> One fundamental criteria, I think, is that *if* other
>> bridge + KMS driver combinations suffer from the same problem.
> All of them do. We just collectively stick our head in the sand.

Non devm_drm_of_get_bridge() callers won't create an bridge on the
*attach* time.

As well as those of_drm_find_bridge() users, which didn't manage
the release time explicitly. Or those users that tie the release
time to the underlying backing platform device, are just fine.


>> Apparently, other drm bridge users didn't report similar problem.
>> This means that non devm_drm_of_get_bridge() callers are different
>> with those devm_drm_of_get_bridge() callers.
> Nope. They are strictly equivalent.

No.

One is put emphasis on controlling when is ready to use,
the other one is focus on addressing when is safety to leave.
Not exactly same in details.

> Maxime
Maxime Ripard Dec. 2, 2024, 2:57 p.m. UTC | #17
On Fri, Nov 29, 2024 at 11:16:50PM +0800, Chen-Yu Tsai wrote:
> On Fri, Nov 29, 2024 at 10:55 PM Maxime Ripard <mripard@kernel.org> wrote:
> >
> > On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
> > > Hi,
> > >
> > > On 2024/11/29 18:51, Maxime Ripard wrote:
> > > > On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
> > > > > Revisiting this thread since I just stepped on the same problem on a
> > > > > different device.
> > > > >
> > > > > On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > > On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
> > > > > > > On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
> > > > > > > > On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
> > > > > > > > > In the mtk_dsi driver, its DSI host attach callback calls
> > > > > > > > > devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
> > > > > > > > > a panel bridge, a panel_bridge object is allocated and managed by the
> > > > > > > > > panel device.
> > > > > > > > >
> > > > > > > > > Later, if the attach callback fails with -EPROBE_DEFER from subsequent
> > > > > > > > > component_add(), the panel device invoking the callback at probe time
> > > > > > > > > also fails, and all device-managed resources are freed accordingly.
> > > > > > > > >
> > > > > > > > > This exposes a drm_bridge bridge_list corruption due to the unbalanced
> > > > > > > > > lifecycle between the DSI host and the panel devices: the panel_bridge
> > > > > > > > > object managed by panel device is freed, while drm_bridge_remove() is
> > > > > > > > > bound to DSI host device and never gets called.
> > > > > > > > > The next drm_bridge_add() will trigger UAF against the freed bridge list
> > > > > > > > > object and result in kernel panic.
> > > > > > > > >
> > > > > > > > > This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
> > > > > > > > > outputting to a DSI panel (DT is WIP for upstream).
> > > > > > > > >
> > > > > > > > > As a fix, using devm_drm_bridge_add() with the panel device in the panel
> > > > > > > > > path seems reasonable. This also implies a chain of potential cleanup
> > > > > > > > > actions:
> > > > > > > > >
> > > > > > > > > 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
> > > > > > > > >     becomes hollow and can be removed.
> > > > > > > > >
> > > > > > > > > 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
> > > > > > > > >     `bridge->pre_enable_prev_first` line. Itself can be also removed if
> > > > > > > > >     we move the line into drm_panel_bridge_add_typed(). (maybe?)
> > > > > > > > >
> > > > > > > > > 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
> > > > > > > > >     so it's essentially the new devm_drm_panel_bridge_add_typed().
> > > > > > > > >
> > > > > > > > > 4. drmm_panel_bridge_add() needs to be updated accordingly since it
> > > > > > > > >     calls drm_panel_bridge_add_typed(). But now there's only one bridge
> > > > > > > > >     object to be freed, and it's already being managed by panel device.
> > > > > > > > >     I wonder if we still need both drmm_ and devm_ version in this case.
> > > > > > > > >     (maybe yes from DRM PoV, I don't know much about the context)
> > > > > > > > >
> > > > > > > > > This is a RFC patch since I'm not sure if my understanding is correct
> > > > > > > > > (for both the fix and the cleanup). It fixes the issue I encountered,
> > > > > > > > > but I don't expect it to be picked up directly due to the redundant
> > > > > > > > > commit message and the dangling devm_drm_panel_bridge_release().
> > > > > > > > > I plan to resend the official patch(es) once I know what I supposed to
> > > > > > > > > do next.
> > > > > > > > >
> > > > > > > > > For reference, here's the KASAN report from the device:
> > > > > > > > > ==================================================================
> > > > > > > > >   BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
> > > > > > > > >   Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
> > > > > > > > >
> > > > > > > > >   CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
> > > > > > > > >   Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > > > > > > > >   Workqueue: events_unbound deferred_probe_work_func
> > > > > > > > >   Call trace:
> > > > > > > > >    dump_backtrace+0xfc/0x140
> > > > > > > > >    show_stack+0x24/0x38
> > > > > > > > >    dump_stack_lvl+0x40/0xc8
> > > > > > > > >    print_report+0x140/0x700
> > > > > > > > >    kasan_report+0xcc/0x130
> > > > > > > > >    __asan_report_load8_noabort+0x20/0x30
> > > > > > > > >    drm_bridge_add+0x98/0x230
> > > > > > > > >    devm_drm_panel_bridge_add_typed+0x174/0x298
> > > > > > > > >    devm_drm_of_get_bridge+0xe8/0x190
> > > > > > > > >    mtk_dsi_host_attach+0x130/0x2b0
> > > > > > > > >    mipi_dsi_attach+0x8c/0xe8
> > > > > > > > >    hx83102_probe+0x1a8/0x368
> > > > > > > > >    mipi_dsi_drv_probe+0x6c/0x88
> > > > > > > > >    really_probe+0x1c4/0x698
> > > > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > > > >    __device_attach+0x1c0/0x308
> > > > > > > > >    device_initial_probe+0x20/0x38
> > > > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > > > >    kthread+0x274/0x350
> > > > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > > >
> > > > > > > > >   Allocated by task 69:
> > > > > > > > >    kasan_save_track+0x40/0x78
> > > > > > > > >    kasan_save_alloc_info+0x44/0x58
> > > > > > > > >    __kasan_kmalloc+0x84/0xa0
> > > > > > > > >    __kmalloc_node_track_caller_noprof+0x228/0x450
> > > > > > > > >    devm_kmalloc+0x6c/0x288
> > > > > > > > >    devm_drm_panel_bridge_add_typed+0xa0/0x298
> > > > > > > > >    devm_drm_of_get_bridge+0xe8/0x190
> > > > > > > > >    mtk_dsi_host_attach+0x130/0x2b0
> > > > > > > > >    mipi_dsi_attach+0x8c/0xe8
> > > > > > > > >    hx83102_probe+0x1a8/0x368
> > > > > > > > >    mipi_dsi_drv_probe+0x6c/0x88
> > > > > > > > >    really_probe+0x1c4/0x698
> > > > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > > > >    __device_attach+0x1c0/0x308
> > > > > > > > >    device_initial_probe+0x20/0x38
> > > > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > > > >    kthread+0x274/0x350
> > > > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > > >
> > > > > > > > >   Freed by task 69:
> > > > > > > > >    kasan_save_track+0x40/0x78
> > > > > > > > >    kasan_save_free_info+0x58/0x78
> > > > > > > > >    __kasan_slab_free+0x48/0x68
> > > > > > > > >    kfree+0xd4/0x750
> > > > > > > > >    devres_release_all+0x144/0x1e8
> > > > > > > > >    really_probe+0x48c/0x698
> > > > > > > > >    __driver_probe_device+0x160/0x298
> > > > > > > > >    driver_probe_device+0x7c/0x2a8
> > > > > > > > >    __device_attach_driver+0x2a0/0x398
> > > > > > > > >    bus_for_each_drv+0x198/0x200
> > > > > > > > >    __device_attach+0x1c0/0x308
> > > > > > > > >    device_initial_probe+0x20/0x38
> > > > > > > > >    bus_probe_device+0x11c/0x1f8
> > > > > > > > >    deferred_probe_work_func+0x80/0x250
> > > > > > > > >    worker_thread+0x9b4/0x2780
> > > > > > > > >    kthread+0x274/0x350
> > > > > > > > >    ret_from_fork+0x10/0x20
> > > > > > > > >
> > > > > > > > >   The buggy address belongs to the object at ffffff80c4e9e000
> > > > > > > > >    which belongs to the cache kmalloc-4k of size 4096
> > > > > > > > >   The buggy address is located 256 bytes inside of
> > > > > > > > >    freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
> > > > > > > > >
> > > > > > > > >   The buggy address belongs to the physical page:
> > > > > > > > >   head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> > > > > > > > >   flags: 0x8000000000000040(head|zone=2)
> > > > > > > > >   page_type: f5(slab)
> > > > > > > > >   page: refcount:1 mapcount:0 mapping:0000000000000000
> > > > > > > > >   index:0x0 pfn:0x104e98
> > > > > > > > >   raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > > > >   raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > > > >   head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
> > > > > > > > >   head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
> > > > > > > > >   head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
> > > > > > > > >   head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > > >   page dumped because: kasan: bad access detected
> > > > > > > > >
> > > > > > > > >   Memory state around the buggy address:
> > > > > > > > >    ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >    ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >   >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >                      ^
> > > > > > > > >    ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > >    ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > > > > > > > ===================================================================
> > > > > > > > >
> > > > > > > > > Signed-off-by: Fei Shao <fshao@chromium.org>
> > > > > > > > I was looking at the driver to try to follow your (awesome btw, thanks)
> > > > > > > > commit log, and it does have a quite different structure compared to
> > > > > > > > what we recommend.
> > > > > > > >
> > > > > > > > Would following
> > > > > > > > https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
> > > > > > > > help?
> > > > > > > Hi Maxime,
> > > > > > >
> > > > > > > Thank you for the pointer.
> > > > > > > I read the suggested pattern in the doc and compared it with the
> > > > > > > drivers. If I understand correctly, both the MIPI-DSI host and panel
> > > > > > > drivers follow the instructions:
> > > > > > >
> > > > > > > 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
> > > > > > >     >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
> > > > > > > 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
> > > > > > > host, register as a MIPI-DSI device and attach the MIPI-DSI device to
> > > > > > > its host.
> > > > > > >     >> drm/panel/panel-himax-hx83102.c follows and runs
> > > > > > > mipi_dsi_attach() at the end of probe hook.
> > > > > > > 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
> > > > > > > now add its component.
> > > > > > >     >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
> > > > > > >
> > > > > > > Could you elaborate on the "different structures" you mentioned?
> > > > > > Yeah, you're right, sorry.
> > > > > >
> > > > > > > To clarify my point: the issue is that component_add() may return
> > > > > > > -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
> > > > > > > causing the panel bridge to be removed. However, drm_bridge_remove()
> > > > > > > is bound to MIPI-DSI host instead of panel bridge, which owns the
> > > > > > > actual list_head object.
> > > > > > >
> > > > > > > This might be reproducible with other MIPI-DSI host + panel
> > > > > > > combinations by forcibly returning -EPROBE_DEFER in the host attach
> > > > > > > hook (verification with another device is needed), so the fix may be
> > > > > > > required in drm/bridge/panel.c.
> > > > > > Yeah, I think you're just hitting another bridge lifetime issue, and
> > > > > > it's not the only one unfortunately. Tying the bridge structure lifetime
> > > > > > itself to the device is wrong, it should be tied to the DRM device
> > > > > > lifetime instead.
> > > > > I think the more immediate issue is that the bridge object's lifetime
> > > > > and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
> > > > > or drmm_of_get_bridge() are used.
> > > > >
> > > > > These helpers tie the bridge add/removal to the device or drm_device
> > > > > passed in, but internally they call down to drm_panel_bridge_add_typed()
> > > > > which allocates the bridge object tied to the panel device.
> > > > > > But then, the discussion becomes that bridges typically probe outside of
> > > > > > the "main" DRM device probe path, so you don't have access to the DRM
> > > > > > device structure until attach at best.
> > > > > >
> > > > > > That's why I'm a bit skeptical about your patch. It might workaround
> > > > > > your issue, but it doesn't actually solve the problem. I guess the best
> > > > > > way about it would be to convert bridges to reference counting, with the
> > > > > > device taking a reference at probe time when it allocates the structure
> > > > > > (and giving it back at remove time), and the DRM device taking one when
> > > > > > it's attached and one when it's detached.
> > > > > Without going as far, it's probably better to align the lifecycle of
> > > > > the two parts. Most other bridge drivers in the kernel have |drm_bridge|
> > > > > lifecycle tied to their underlying |device|, either with explicit
> > > > > drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
> > > > > callbacks respectively, or with devm_drm_bridge_add in the probe/bind
> > > > > path. The only ones with a narrower lifecycle are the DSI hosts, which
> > > > > add the bridge in during host attach and remove it during host detach.
> > > > >
> > > > > I'm thinking about fixing the panel_bridge lifecycle such that it is
> > > > > tied to the panel itself. Maybe that would involve making
> > > > > devm_drm_of_get_bridge() correctly return bridges even if a panel was
> > > > > found, and then making the panels create and add panel bridges directly,
> > > > > possibly within drm_panel_add(). Would that make sense?
> > > > Not really.
> > >
> > >
> > > [...]
> > >
> > >
> > > > Or rather, it doesn't fix the root cause that is that tieing
> > > > the bridge lifetime to the device is wrong.
> 
> Well, yeah, but at least it aligns the panel_bridge case with every other
> bridge driver: the bridge object and its duration existing in the list
> of bridges are the same in every other bridge driver. The bridge driver
> allocates the bridge object in its probe function (with devm or not),
> and adds the bridge to the global list as probably the last step of
> the probe function. In the remove function the reverse is done.
> 
> Right now for the panel bridge, the bridge object is allocated with its
> lifetime tied to the panel, but the adding and removing of the bridge
> to/from the global list is tied to the caller of the panel_bridge API,
> likely the previous bridge or encoder in the chain.
> 
> This is what is blowing up for us, right now, with the Ciri Chromebooks.
> If we fix this bit, then at least it reduces the problem to be only as
> worse as the other bridge drivers.

It also creates a harmful function in the framework, and thus technical
debt that we will have to deal with eventually. It's pretty clear that
you don't want to fix it properly, but at the very least we shouldn't
pile on more technical debt in the framework.

Anyway, you both seem to have decided I'm wrong, so go ahead anyway you
see fit.

Maxime
Sui Jingfeng Dec. 2, 2024, 3:49 p.m. UTC | #18
Hi,

On 2024/12/2 22:57, Maxime Ripard wrote:
> On Fri, Nov 29, 2024 at 11:16:50PM +0800, Chen-Yu Tsai wrote:
>> On Fri, Nov 29, 2024 at 10:55 PM Maxime Ripard <mripard@kernel.org> wrote:
>>> On Fri, Nov 29, 2024 at 10:12:02PM +0800, Sui Jingfeng wrote:
>>>> Hi,
>>>>
>>>> On 2024/11/29 18:51, Maxime Ripard wrote:
>>>>> On Wed, Nov 27, 2024 at 05:58:31PM +0800, Chen-Yu Tsai wrote:
>>>>>> Revisiting this thread since I just stepped on the same problem on a
>>>>>> different device.
>>>>>>
>>>>>> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>>>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>>>>>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@kernel.org> wrote:
>>>>>>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>>>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>>>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>>>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>>>>>>> panel device.
>>>>>>>>>>
>>>>>>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>>>>>>> component_add(), the panel device invoking the callback at probe time
>>>>>>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>>>>>>
>>>>>>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>>>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>>>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>>>>>>> bound to DSI host device and never gets called.
>>>>>>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>>>>>>> object and result in kernel panic.
>>>>>>>>>>
>>>>>>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>>>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>>>>>>
>>>>>>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>>>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>>>>>>> actions:
>>>>>>>>>>
>>>>>>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>>>>>>      becomes hollow and can be removed.
>>>>>>>>>>
>>>>>>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>>>>>>      `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>>>>>>      we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>>>>>>
>>>>>>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>>>>>>      so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>>>>>>
>>>>>>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>>>>>>      calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>>>>>>      object to be freed, and it's already being managed by panel device.
>>>>>>>>>>      I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>>>>>>      (maybe yes from DRM PoV, I don't know much about the context)
>>>>>>>>>>
>>>>>>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>>>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>>>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>>>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>>>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>>>>>>> do next.
>>>>>>>>>>
>>>>>>>>>> For reference, here's the KASAN report from the device:
>>>>>>>>>> ==================================================================
>>>>>>>>>>    BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>>>>>>    Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>>>>>>
>>>>>>>>>>    CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>>>>>>    Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>>>>>>    Workqueue: events_unbound deferred_probe_work_func
>>>>>>>>>>    Call trace:
>>>>>>>>>>     dump_backtrace+0xfc/0x140
>>>>>>>>>>     show_stack+0x24/0x38
>>>>>>>>>>     dump_stack_lvl+0x40/0xc8
>>>>>>>>>>     print_report+0x140/0x700
>>>>>>>>>>     kasan_report+0xcc/0x130
>>>>>>>>>>     __asan_report_load8_noabort+0x20/0x30
>>>>>>>>>>     drm_bridge_add+0x98/0x230
>>>>>>>>>>     devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>>>>>>     devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>>>>     mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>>>>     mipi_dsi_attach+0x8c/0xe8
>>>>>>>>>>     hx83102_probe+0x1a8/0x368
>>>>>>>>>>     mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>>>>     really_probe+0x1c4/0x698
>>>>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>>>>     kthread+0x274/0x350
>>>>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>>>>
>>>>>>>>>>    Allocated by task 69:
>>>>>>>>>>     kasan_save_track+0x40/0x78
>>>>>>>>>>     kasan_save_alloc_info+0x44/0x58
>>>>>>>>>>     __kasan_kmalloc+0x84/0xa0
>>>>>>>>>>     __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>>>>>>     devm_kmalloc+0x6c/0x288
>>>>>>>>>>     devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>>>>>>     devm_drm_of_get_bridge+0xe8/0x190
>>>>>>>>>>     mtk_dsi_host_attach+0x130/0x2b0
>>>>>>>>>>     mipi_dsi_attach+0x8c/0xe8
>>>>>>>>>>     hx83102_probe+0x1a8/0x368
>>>>>>>>>>     mipi_dsi_drv_probe+0x6c/0x88
>>>>>>>>>>     really_probe+0x1c4/0x698
>>>>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>>>>     kthread+0x274/0x350
>>>>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>>>>
>>>>>>>>>>    Freed by task 69:
>>>>>>>>>>     kasan_save_track+0x40/0x78
>>>>>>>>>>     kasan_save_free_info+0x58/0x78
>>>>>>>>>>     __kasan_slab_free+0x48/0x68
>>>>>>>>>>     kfree+0xd4/0x750
>>>>>>>>>>     devres_release_all+0x144/0x1e8
>>>>>>>>>>     really_probe+0x48c/0x698
>>>>>>>>>>     __driver_probe_device+0x160/0x298
>>>>>>>>>>     driver_probe_device+0x7c/0x2a8
>>>>>>>>>>     __device_attach_driver+0x2a0/0x398
>>>>>>>>>>     bus_for_each_drv+0x198/0x200
>>>>>>>>>>     __device_attach+0x1c0/0x308
>>>>>>>>>>     device_initial_probe+0x20/0x38
>>>>>>>>>>     bus_probe_device+0x11c/0x1f8
>>>>>>>>>>     deferred_probe_work_func+0x80/0x250
>>>>>>>>>>     worker_thread+0x9b4/0x2780
>>>>>>>>>>     kthread+0x274/0x350
>>>>>>>>>>     ret_from_fork+0x10/0x20
>>>>>>>>>>
>>>>>>>>>>    The buggy address belongs to the object at ffffff80c4e9e000
>>>>>>>>>>     which belongs to the cache kmalloc-4k of size 4096
>>>>>>>>>>    The buggy address is located 256 bytes inside of
>>>>>>>>>>     freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>>>>>>
>>>>>>>>>>    The buggy address belongs to the physical page:
>>>>>>>>>>    head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>>>>>>    flags: 0x8000000000000040(head|zone=2)
>>>>>>>>>>    page_type: f5(slab)
>>>>>>>>>>    page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>>>>>>    index:0x0 pfn:0x104e98
>>>>>>>>>>    raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>>>>    raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>>>>    head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>>>>>    head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>>>>>    head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>>>>>>    head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>>>>>    page dumped because: kasan: bad access detected
>>>>>>>>>>
>>>>>>>>>>    Memory state around the buggy address:
>>>>>>>>>>     ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>     ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>    >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>                       ^
>>>>>>>>>>     ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>>     ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>>>>> ===================================================================
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Fei Shao <fshao@chromium.org>
>>>>>>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>>>>>>> commit log, and it does have a quite different structure compared to
>>>>>>>>> what we recommend.
>>>>>>>>>
>>>>>>>>> Would following
>>>>>>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>>>>>>> help?
>>>>>>>> Hi Maxime,
>>>>>>>>
>>>>>>>> Thank you for the pointer.
>>>>>>>> I read the suggested pattern in the doc and compared it with the
>>>>>>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>>>>>>> drivers follow the instructions:
>>>>>>>>
>>>>>>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>>>>>>      >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>>>>>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>>>>>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>>>>>>> its host.
>>>>>>>>      >> drm/panel/panel-himax-hx83102.c follows and runs
>>>>>>>> mipi_dsi_attach() at the end of probe hook.
>>>>>>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>>>>>>> now add its component.
>>>>>>>>      >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>>>>>>
>>>>>>>> Could you elaborate on the "different structures" you mentioned?
>>>>>>> Yeah, you're right, sorry.
>>>>>>>
>>>>>>>> To clarify my point: the issue is that component_add() may return
>>>>>>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>>>>>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>>>>>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>>>>>>> actual list_head object.
>>>>>>>>
>>>>>>>> This might be reproducible with other MIPI-DSI host + panel
>>>>>>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>>>>>>> hook (verification with another device is needed), so the fix may be
>>>>>>>> required in drm/bridge/panel.c.
>>>>>>> Yeah, I think you're just hitting another bridge lifetime issue, and
>>>>>>> it's not the only one unfortunately. Tying the bridge structure lifetime
>>>>>>> itself to the device is wrong, it should be tied to the DRM device
>>>>>>> lifetime instead.
>>>>>> I think the more immediate issue is that the bridge object's lifetime
>>>>>> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
>>>>>> or drmm_of_get_bridge() are used.
>>>>>>
>>>>>> These helpers tie the bridge add/removal to the device or drm_device
>>>>>> passed in, but internally they call down to drm_panel_bridge_add_typed()
>>>>>> which allocates the bridge object tied to the panel device.
>>>>>>> But then, the discussion becomes that bridges typically probe outside of
>>>>>>> the "main" DRM device probe path, so you don't have access to the DRM
>>>>>>> device structure until attach at best.
>>>>>>>
>>>>>>> That's why I'm a bit skeptical about your patch. It might workaround
>>>>>>> your issue, but it doesn't actually solve the problem. I guess the best
>>>>>>> way about it would be to convert bridges to reference counting, with the
>>>>>>> device taking a reference at probe time when it allocates the structure
>>>>>>> (and giving it back at remove time), and the DRM device taking one when
>>>>>>> it's attached and one when it's detached.
>>>>>> Without going as far, it's probably better to align the lifecycle of
>>>>>> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
>>>>>> lifecycle tied to their underlying |device|, either with explicit
>>>>>> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
>>>>>> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
>>>>>> path. The only ones with a narrower lifecycle are the DSI hosts, which
>>>>>> add the bridge in during host attach and remove it during host detach.
>>>>>>
>>>>>> I'm thinking about fixing the panel_bridge lifecycle such that it is
>>>>>> tied to the panel itself. Maybe that would involve making
>>>>>> devm_drm_of_get_bridge() correctly return bridges even if a panel was
>>>>>> found, and then making the panels create and add panel bridges directly,
>>>>>> possibly within drm_panel_add(). Would that make sense?
>>>>> Not really.
>>>>
>>>> [...]
>>>>
>>>>
>>>>> Or rather, it doesn't fix the root cause that is that tieing
>>>>> the bridge lifetime to the device is wrong.
>> Well, yeah, but at least it aligns the panel_bridge case with every other
>> bridge driver: the bridge object and its duration existing in the list
>> of bridges are the same in every other bridge driver. The bridge driver
>> allocates the bridge object in its probe function (with devm or not),
>> and adds the bridge to the global list as probably the last step of
>> the probe function. In the remove function the reverse is done.
>>
>> Right now for the panel bridge, the bridge object is allocated with its
>> lifetime tied to the panel, but the adding and removing of the bridge
>> to/from the global list is tied to the caller of the panel_bridge API,
>> likely the previous bridge or encoder in the chain.
>>
>> This is what is blowing up for us, right now, with the Ciri Chromebooks.
>> If we fix this bit, then at least it reduces the problem to be only as
>> worse as the other bridge drivers.
> It also creates a harmful function in the framework, and thus technical
> debt that we will have to deal with eventually. It's pretty clear that
> you don't want to fix it properly, but at the very least we shouldn't
> pile on more technical debt in the framework.
>
> Anyway, you both seem to have decided I'm wrong,


Hmm, I merely meant that your instruction is an another approach, may need
a large commits to all drm bridges and drm panels though.


> so go ahead anyway you
> see fit.
>
> Maxime
diff mbox series

Patch

diff --git a/drivers/gpu/drm/bridge/panel.c b/drivers/gpu/drm/bridge/panel.c
index 6e88339dec0f..352723c59c70 100644
--- a/drivers/gpu/drm/bridge/panel.c
+++ b/drivers/gpu/drm/bridge/panel.c
@@ -303,7 +303,7 @@  struct drm_bridge *drm_panel_bridge_add_typed(struct drm_panel *panel,
 	panel_bridge->bridge.ops = DRM_BRIDGE_OP_MODES;
 	panel_bridge->bridge.type = connector_type;
 
-	drm_bridge_add(&panel_bridge->bridge);
+	devm_drm_bridge_add(panel->dev, &panel_bridge->bridge);
 
 	return &panel_bridge->bridge;
 }
@@ -327,7 +327,6 @@  void drm_panel_bridge_remove(struct drm_bridge *bridge)
 
 	panel_bridge = drm_bridge_to_panel_bridge(bridge);
 
-	drm_bridge_remove(bridge);
 	devm_kfree(panel_bridge->panel->dev, bridge);
 }
 EXPORT_SYMBOL(drm_panel_bridge_remove);
@@ -359,8 +358,6 @@  static void devm_drm_panel_bridge_release(struct device *dev, void *res)
 
 	if (!bridge)
 		return;
-
-	drm_bridge_remove(bridge);
 }
 
 /**