snd_hda_intel/snd_hda_codec_hdmi module load/unload race
diff mbox

Message ID s5hd1g5vi91.wl-tiwai@suse.de
State New
Headers show

Commit Message

Takashi Iwai Jan. 2, 2017, 11 a.m. UTC
[Whipping the old thread again, as I'm finally catching up the backlog
 after vacation]

On Fri, 16 Dec 2016 17:40:54 +0100,
Greg Kroah-Hartman wrote:
> 
> On Thu, Dec 15, 2016 at 12:32:08PM +0100, Takashi Iwai wrote:
> > On Wed, 14 Dec 2016 22:00:50 +0100,
> > Imre Deak wrote:
> > > 
> > > Hi,
> > > 
> > > I got the trace below while trying to unload (unbind) snd_hda_intel, while
> > > its still loading the HDMI codec driver. IIUC what happens is:
> > > 
> > > Task1                                    Task2                                    Task3
> > > modprobe snd_hda_intel
> > >   schedule(azx_probe_work)
> > > unbind snd_hda_intel via sysfs
> > >   device_release_driver()
> > >     device_lock(snd_hda_intel)
> > >     azx_remove()
> > >       cancel_work_sync(azx_probe_work)
> > >                                          azx_probe_work()
> > >                                            request_module(snd-hda-codec-hdmi)
> > >                                                                                   hdmi_driver_init()
> > >                                                                                     __driver_attach()
> > >                                                                                       device_lock(snd_hda_intel)
> > > 
> > > Deadlock, since azx_probe_work() will never finish and the snd_hda_intel device
> > > lock will never get released.
> > 
> > This is indeed nasty.  The deadlock happens when the driver core takes
> > the parent's device lock.
> > 
> > static int __driver_attach(struct device *dev, void *data)
> > {
> > ....
> > 	if (dev->parent)	/* Needed for USB */
> > 		device_lock(dev->parent);
> > 	device_lock(dev);
> > 	if (!dev->driver)
> > 		driver_probe_device(drv, dev);
> > 
> > I vaguely remember of some other issue due to the device_lock of the
> > parent device.  And, I guess a similar deadlock may happen not only
> > with HD-audio driver but also in general with every driver using async
> > probe.
> > 
> > Greg, any good way to avoid such a deadlock?  Can we make the parent
> > device lock conditional somehow?
> 
> Ick, messy.  I don't want to make the parent lock conditional, as it's
> needed.  Shouldn't the cancel_work_sync() prevent the request_module()
> from running?  Seems like you need to serialize your probe_work
> somehow...

The situation is a bit complex.  The work itself was kicked off by the
controller driver's probe(), in order to make the codec binding
asynchronous.  And we can't serialize inside the remove() because it
is already in the lock.

I guess a workaround for the time being would be just to unlock the
device temporarily during this cancel_work_sync().  Since it's in
remove() and the device parent's lock is always taken, the race
against another binding should be suppressed even if we temporarily
unlock the device lock there. 

Below is the untested patch.  It's a pity that the first patch I wrote
in this year is something like this... ;)


thanks,

Takashi

-- 8< --
From: Takashi Iwai <tiwai@suse.de>
Subject: [PATCH] ALSA: hda - Fix deadlock of controller device lock at
 unbinding

Imre Deak reported a deadlock of HD-audio driver at unbinding while
it's still in probing.  Since we probe the codecs asynchronously in a
work, the codec driver probe may still be kicked off while the
controller itself is being unbound.  And, azx_remove() tries to
process all pending tasks via cancel_work_sync() for fixing the other
races (see commit [0b8c82190c12: ALSA: hda - Cancel probe work instead
of flush at remove]), now we may meet a bizarre deadlock:

Unbind snd_hda_intel via sysfs:
  device_release_driver() ->
    device_lock(snd_hda_intel) ->
      azx_remove() ->
        cancel_work_sync(azx_probe_work)

azx_probe_work():
  codec driver probe() ->
     __driver_attach() ->
       device_lock(snd_hda_intel)

This deadlock is caused by the fact that both device_release_driver()
and driver_probe_device() take both the device and its parent locks at
the same time.  The codec device sets the controller device as its
parent, and this lock is taken before the probe() callback is called,
while the controller remove() callback gets called also with the same
lock.

In this patch, as an ugly workaround, we unlock the controller device
temporarily during cancel_work_sync() call.  The race against another
bind call should be still suppressed by the parent's device lock.

Reported-by: Imre Deak <imre.deak@intel.com>
Fixes: 0b8c82190c12 ("ALSA: hda - Cancel probe work instead of flush at remove")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 sound/pci/hda/hda_intel.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Patch
diff mbox

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index c64d986009a9..2587c197e353 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -2155,7 +2155,20 @@  static void azx_remove(struct pci_dev *pci)
 		/* cancel the pending probing work */
 		chip = card->private_data;
 		hda = container_of(chip, struct hda_intel, chip);
+		/* FIXME: below is an ugly workaround.
+		 * Both device_release_driver() and driver_probe_device()
+		 * take *both* the device's and its parent's lock before
+		 * calling the remove() and probe() callbacks.  The codec
+		 * probe takes the locks of both the codec itself and its
+		 * parent, i.e. the PCI controller dev.  Meanwhile, when
+		 * the PCI controller is unbound, it takes its lock, too
+		 * ==> ouch, a deadlock!
+		 * As a workaround, we unlock temporarily here the controller
+		 * device during cancel_work_sync() call.
+		 */
+		device_unlock(&pci->dev);
 		cancel_work_sync(&hda->probe_work);
+		device_lock(&pci->dev);
 
 		snd_card_free(card);
 	}