diff mbox series

[BUG] drm_connector reference counting and USB-C docks

Message ID Z-Cw0x7wa5w7tliO@debian.local (mailing list archive)
State New
Headers show
Series [BUG] drm_connector reference counting and USB-C docks | expand

Commit Message

Chris Bainbridge March 24, 2025, 1:09 a.m. UTC
There is a reference couting / lifecycle issue with drm_connector when used
with a USB-C dock. The problem has been previously reproduced on both Intel and
AMD GPUs.

On both Intel and AMD, the symptoms are:

  - multiple connectors being listed in sysfs `sys/class/drm/cardX/` (because
    the old connectors are not removed when the dock is unplugged)
  - no display on the external monitors.
  - "Payload for VCPI 1 not in topology, not sending remove" error if drm.debug
    is enabled

On AMD, this issue is the root cause of a number of errors when re-plugging in
a dock:

  - *ERROR* Failed to get ACT after 3000ms
  - kernel NULL pointer dereference calling setcrtc
  - UBSAN: shift-out-of-bounds in drivers/gpu/drm/display/drm_dp_mst_topology.c
  - use-after-free in dc_stream_release
  - refcount_t: underflow; use-after-free.
  - slab-use-after-free in event_property_validate
  - WARNING display/dc/dcn21/dcn21_link_encoder.c:215 dcn21_link_encoder_acquire_phy
  - Part 1 of payload creation for DP-2 failed, skipping part 2
  - probably most bug reports relating to suspend/resume and a dock

This bug has been reproduced on both Ubuntu/Gnome and Debian/XFCE. The symptoms
are intermittent and vary (as above), but the consistent initial symptom is
multiple connectors being listed in sysfs.

To reproduce, annotate drm_dp_delayed_destroy_port with something like:

Boot laptop with dock connected, activate external monitors, suspend, unplug
the dock, and resume. This problem is intermittent, so these steps may need to
be repeated. But when the problem is hit, the drm_dp_mst_port will be
destroyed, but the drm_connector will still be alive. (This can also be
reproduced with just plugging and unplugging without suspend/resume, but, on my
laptop, it happens almost every time with suspend/resume).

The cause of this problem appears to be:

  - calling setcrtc to enable a CRTC results in the drm_connector refcount
    being incremented:
  - drm_atomic_get_connector_state appears to add connectors into
    drm_atomic_state->connectors, and increments the refcount

  - on disabling the external monitors, a call to drm_mode_setcrtc results in
    the drm_connector being destroyed via call chain:

    amdgpu_drm_ioctl
      drm_ioctl
        drm_ioctl_kernel
          drm_mode_setcrtc (via func)
            drm_atomic_helper_set_config (via crtc->funcs->set_config)
              drm_atomic_state_put
                __drm_atomic_state_free (via kref_put)
                  drm_atomic_state_clear
                    drm_atomic_state_default_clear
                      drm_connector_put
                        drm_mode_object_put
                          drm_connector_free (via ->free_cb put destroyer)
                            dm_dp_mst_connector_destroy

  - so the drm_connector is not destroyed until/if userspace calls setcrtc to
    clear the CRTC (set.num_connectors=0). If this does not happen for whatever
    reason (userspace process is terminated, frozen due to suspend, etc.) then
    the drm_connector object will still be alive even though the corresponding
    drm_dp_mst_port is dead.

  - in normal usage, drm_connector_cleanup releases the connector ID:

    ida_free(&dev->mode_config.connector_ida, connector->index);

  - when dock is replugged, a connector ID is allocated:

    connector->connector_type_id = ida_alloc_min(connector_ida, 1, GFP_KERNEL);

  - if setcrtc has not been called to free the old ID, then ida_alloc_min
    allocates a new connector ID instead of reusing the old one. This explains
    the "multiple connectors being listed in sysfs" problem.

  - the other problems occur after this, due to the multiple half-dead
    connector objects.

  - UBSAN: shift-out-of-bounds in drivers/gpu/drm/display/drm_dp_mst_topology.c:4568
    occurs because vcpi==0 in this payload, so BIT op is a left-shift by -1.

  - slab-use-after-free in event_property_validate: looks like it happens
    because hdcp_update_display, hdcp_remove_display copy references to
    amdgpu_dm_connector (which contains a nested drm_connector) in to the
    delayed_work struct hdcp_workqueue without incrementing the reference count
    (see pair of lines "hdcp_w->aconnector[conn_index] = aconnector;").
    If the connector is freed, &aconnector[conn_index] will become a dangling
    pointer. Actually, I can reproduce this easily by just booting to
    gdm then plugging and unplugging the dock a few times, so it's
    possible this is an independent issue that also needs fixing.

  - use-after-free in dc_stream_release - there appears to be a few
    points where a dc_stream_state pointer is copied without refcounting
    ("pipe_ctx->stream = stream;") but I don't know if this is the problem. It
    could also just be that earlier failures have left something in a bad state.

I'm unsure of the best approach to fix the root cause. One way is to try
and release the references by disabling the CRTC. I tried calling
drm_mode_crtc from drm_dp_delayed_destroy_port. This was a bit hacky,
but did seem to work, the reference count got reduced to 0, and the
drm_connector was destroyed. Another option would be to call the
drm_connector destructor from drm_dp_delayed_destroy_port (protected by
some mutex so that it doesn't get called twice when the actual refcount
goes to 0) - that might work to free up the connector ID, but I suspect
there could be other issues with having the drm_connector object still
alive and potentially holding references to other objects, even though
the dock has been physically disconnected.
diff mbox series

Patch

--- a/drivers/gpu/drm/display/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/display/drm_dp_mst_topology.c
@@ -5014,6 +5014,9 @@  drm_dp_delayed_destroy_port(struct drm_dp_mst_port *port)
 
        if (port->connector) {
                drm_connector_unregister(port->connector);
+               printk("drm_dp_delayed_destroy_port %s refcount=%d\n",
+                               port->connector->name,
+                               kref_read(&port->connector->base.refcount));
                drm_connector_put(port->connector);
        }