diff mbox series

ALSA: hda: fix general protection fault in azx_runtime_idle

Message ID 20211110210307.1172004-1-kai.vehmanen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series ALSA: hda: fix general protection fault in azx_runtime_idle | expand

Commit Message

Kai Vehmanen Nov. 10, 2021, 9:03 p.m. UTC
Fix a corner case between PCI device driver remove callback and
runtime PM idle callback.

Following sequence of events can happen:
  - at azx_create, context is allocated with devm_kzalloc() and
    stored as pci_set_drvdata()
  - user-space requests to unbind audio driver
  - dd.c:__device_release_driver() calls PCI remove
  - pci-driver.c:pci_device_remove() calls the audio
    driver azx_remove() callback and this is completed
  - pci-driver.c:pm_runtime_put_sync() leads to a call
    to rpm_idle() which again calls azx_runtime_idle()
  - the azx context object, as returned by dev_get_drvdata(),
    is no longer valid
  -> access fault in azx_runtime_idle when executing
	struct snd_card *card = dev_get_drvdata(dev);
	chip = card->private_data;
	if (chip->disabled || hda->init_failed)

This was discovered by i915_module_load test with 5.15.0 based
linux-next tree.

Example log caught by i915_module_load test with linux-next
https://intel-gfx-ci.01.org/tree/linux-next/

<4> [264.038232] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b73f0: 0000 [#1] PREEMPT SMP NOPTI
<4> [264.038248] CPU: 0 PID: 5374 Comm: i915_module_loa Not tainted 5.15.0-next-20211109-gc8109c2ba35e-next-20211109 #1
[...]
<4> [264.038267] RIP: 0010:azx_runtime_idle+0x12/0x60 [snd_hda_intel]
[...]
<4> [264.038355] Call Trace:
<4> [264.038359]  <TASK>
<4> [264.038362]  __rpm_callback+0x3d/0x110
<4> [264.038371]  rpm_idle+0x27f/0x380
<4> [264.038376]  __pm_runtime_idle+0x3b/0x100
<4> [264.038382]  pci_device_remove+0x6d/0xa0
<4> [264.038388]  device_release_driver_internal+0xef/0x1e0
<4> [264.038395]  unbind_store+0xeb/0x120
<4> [264.038400]  kernfs_fop_write_iter+0x11a/0x1c0

Fix the issue by setting drvdata to NULL at end of azx_remove().

Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
---
 sound/pci/hda/hda_intel.c | 1 +
 1 file changed, 1 insertion(+)

Some non-persistent direct links showing the bug trigger on
different platforms with linux-next 20211109:
 - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html
 - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html

Notably with 20211110 linux-next, the bug does not trigger:
 - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html


base-commit: 6322ec8d0de924cf9672b23c1b5052afafc2f03b

Comments

Takashi Iwai Nov. 10, 2021, 9:55 p.m. UTC | #1
On Wed, 10 Nov 2021 22:03:07 +0100,
Kai Vehmanen wrote:
> 
> Fix a corner case between PCI device driver remove callback and
> runtime PM idle callback.
> 
> Following sequence of events can happen:
>   - at azx_create, context is allocated with devm_kzalloc() and
>     stored as pci_set_drvdata()
>   - user-space requests to unbind audio driver
>   - dd.c:__device_release_driver() calls PCI remove
>   - pci-driver.c:pci_device_remove() calls the audio
>     driver azx_remove() callback and this is completed
>   - pci-driver.c:pm_runtime_put_sync() leads to a call
>     to rpm_idle() which again calls azx_runtime_idle()
>   - the azx context object, as returned by dev_get_drvdata(),
>     is no longer valid
>   -> access fault in azx_runtime_idle when executing
> 	struct snd_card *card = dev_get_drvdata(dev);
> 	chip = card->private_data;
> 	if (chip->disabled || hda->init_failed)
> 
> This was discovered by i915_module_load test with 5.15.0 based
> linux-next tree.
> 
> Example log caught by i915_module_load test with linux-next
> https://intel-gfx-ci.01.org/tree/linux-next/
> 
> <4> [264.038232] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b73f0: 0000 [#1] PREEMPT SMP NOPTI
> <4> [264.038248] CPU: 0 PID: 5374 Comm: i915_module_loa Not tainted 5.15.0-next-20211109-gc8109c2ba35e-next-20211109 #1
> [...]
> <4> [264.038267] RIP: 0010:azx_runtime_idle+0x12/0x60 [snd_hda_intel]
> [...]
> <4> [264.038355] Call Trace:
> <4> [264.038359]  <TASK>
> <4> [264.038362]  __rpm_callback+0x3d/0x110
> <4> [264.038371]  rpm_idle+0x27f/0x380
> <4> [264.038376]  __pm_runtime_idle+0x3b/0x100
> <4> [264.038382]  pci_device_remove+0x6d/0xa0
> <4> [264.038388]  device_release_driver_internal+0xef/0x1e0
> <4> [264.038395]  unbind_store+0xeb/0x120
> <4> [264.038400]  kernfs_fop_write_iter+0x11a/0x1c0
> 
> Fix the issue by setting drvdata to NULL at end of azx_remove().
> 
> Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
> ---
>  sound/pci/hda/hda_intel.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> Some non-persistent direct links showing the bug trigger on
> different platforms with linux-next 20211109:
>  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html
>  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html
> 
> Notably with 20211110 linux-next, the bug does not trigger:
>  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html

Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE?
This would be the only logical explanation I can think of for now.

In anyway, the code change itself looks good, so I took the fix now.


thanks,

Takashi
Kai Vehmanen Nov. 10, 2021, 10:15 p.m. UTC | #2
Hey,

On Wed, 10 Nov 2021, Takashi Iwai wrote:

> On Wed, 10 Nov 2021 22:03:07 +0100, Kai Vehmanen wrote:
> > Fix a corner case between PCI device driver remove callback and
> > runtime PM idle callback.
[...]
> > Some non-persistent direct links showing the bug trigger on
> > different platforms with linux-next 20211109:
> >  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html
> >  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html
> > 
> > Notably with 20211110 linux-next, the bug does not trigger:
> >  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html
> 
> Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE?
> This would be the only logical explanation I can think of for now.

hmm, that doesn't seem to be used. Here's a link to kconfig used in the 
failing CI run:
https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/kconfig.txt

It's still a bit odd, especially given Scott just reported the other HDA 
related regression in 5.15 today. The two issues don't seem to be related 
though, although both are fixed by clearing drvdata (but in different 
places of hda_intel.c).

I'll try to run some more tests tomorrow. The fix should be good in any 
case, but it would be interesting to understand better what change made 
this more (?) likely to hit than before. This is not a new test and the 
problem happens on fairly old platforms, so something has changed.

Br, Kai
Takashi Iwai Nov. 11, 2021, 1:29 p.m. UTC | #3
On Wed, 10 Nov 2021 23:15:40 +0100,
Kai Vehmanen wrote:
> 
> Hey,
> 
> On Wed, 10 Nov 2021, Takashi Iwai wrote:
> 
> > On Wed, 10 Nov 2021 22:03:07 +0100, Kai Vehmanen wrote:
> > > Fix a corner case between PCI device driver remove callback and
> > > runtime PM idle callback.
> [...]
> > > Some non-persistent direct links showing the bug trigger on
> > > different platforms with linux-next 20211109:
> > >  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html
> > >  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html
> > > 
> > > Notably with 20211110 linux-next, the bug does not trigger:
> > >  - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html
> > 
> > Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE?
> > This would be the only logical explanation I can think of for now.
> 
> hmm, that doesn't seem to be used. Here's a link to kconfig used in the 
> failing CI run:
> https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/kconfig.txt

OK, then it's not due to the delayed release, but the cause should be
the same, I suppose.

> It's still a bit odd, especially given Scott just reported the other HDA 
> related regression in 5.15 today. The two issues don't seem to be related 
> though, although both are fixed by clearing drvdata (but in different 
> places of hda_intel.c).

I don't think it's the same issue, rather a coincidence of the
timing.  There have been many changes in 5.15, after all :)

> I'll try to run some more tests tomorrow. The fix should be good in any 
> case, but it would be interesting to understand better what change made 
> this more (?) likely to hit than before. This is not a new test and the 
> problem happens on fairly old platforms, so something has changed.

A potential problem with the current code is that it doesn't disable
the runtime PM at the release procedure.  Could you try the patch
below?  You can put WARN_ON(!chip) at azx_runtime_idle(), too, for
catching the invalid runtime call.


thanks,

Takashi

--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -1347,8 +1347,13 @@ static void azx_free(struct azx *chip)
 	if (hda->freed)
 		return;
 
-	if (azx_has_pm_runtime(chip) && chip->running)
+	if (azx_has_pm_runtime(chip) && chip->running) {
 		pm_runtime_get_noresume(&pci->dev);
+		pm_runtime_forbid(&pci->dev);
+		pm_runtime_dont_use_autosuspend(&pci->dev);
+		pm_runtime_disable(&pci->dev);
+	}
+
 	chip->running = 0;
 
 	azx_del_card_list(chip);
@@ -2320,6 +2325,7 @@ static int azx_probe_continue(struct azx *chip)
 	set_default_power_save(chip);
 
 	if (azx_has_pm_runtime(chip)) {
+		pm_runtime_enable(&pci->dev);
 		pm_runtime_use_autosuspend(&pci->dev);
 		pm_runtime_allow(&pci->dev);
 		pm_runtime_put_autosuspend(&pci->dev);
Kai Vehmanen Nov. 11, 2021, 5:39 p.m. UTC | #4
Hi,

On Thu, 11 Nov 2021, Takashi Iwai wrote:

> A potential problem with the current code is that it doesn't disable
> the runtime PM at the release procedure.  Could you try the patch
> below?  You can put WARN_ON(!chip) at azx_runtime_idle(), too, for
> catching the invalid runtime call.
[...]
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -1347,8 +1347,13 @@ static void azx_free(struct azx *chip)
>  	if (hda->freed)
>  		return;
>  
> -	if (azx_has_pm_runtime(chip) && chip->running)
> +	if (azx_has_pm_runtime(chip) && chip->running) {
>  		pm_runtime_get_noresume(&pci->dev);
> +		pm_runtime_forbid(&pci->dev);
> +		pm_runtime_dont_use_autosuspend(&pci->dev);
> +		pm_runtime_disable(&pci->dev);
> +	}
> +
>  	chip->running = 0;

Tested with next-20211019 (first next tag where I've seen test failures) 
and your patch, and this seems to do the trick. I didn't have my drvdata 
patch included when I ran the test. No rpm_idle() calls 
anymore after azx_remove(), so the bug is not hit.

>  	azx_del_card_list(chip);
> @@ -2320,6 +2325,7 @@ static int azx_probe_continue(struct azx *chip)
>  	set_default_power_save(chip);
>  
>  	if (azx_has_pm_runtime(chip)) {
> +		pm_runtime_enable(&pci->dev);
>  		pm_runtime_use_autosuspend(&pci->dev);

This does generate warnings
[   13.495059] snd_hda_intel 0000:00:1f.3: Unbalanced pm_runtime_enable!

And later
[   54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children
[   54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0

Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and 
azx_probe_continue() seems to help and fix still works.

Br, Kai
Takashi Iwai Nov. 12, 2021, 10:02 a.m. UTC | #5
On Thu, 11 Nov 2021 18:39:36 +0100,
Kai Vehmanen wrote:
> 
> Hi,
> 
> On Thu, 11 Nov 2021, Takashi Iwai wrote:
> 
> > A potential problem with the current code is that it doesn't disable
> > the runtime PM at the release procedure.  Could you try the patch
> > below?  You can put WARN_ON(!chip) at azx_runtime_idle(), too, for
> > catching the invalid runtime call.
> [...]
> > --- a/sound/pci/hda/hda_intel.c
> > +++ b/sound/pci/hda/hda_intel.c
> > @@ -1347,8 +1347,13 @@ static void azx_free(struct azx *chip)
> >  	if (hda->freed)
> >  		return;
> >  
> > -	if (azx_has_pm_runtime(chip) && chip->running)
> > +	if (azx_has_pm_runtime(chip) && chip->running) {
> >  		pm_runtime_get_noresume(&pci->dev);
> > +		pm_runtime_forbid(&pci->dev);
> > +		pm_runtime_dont_use_autosuspend(&pci->dev);
> > +		pm_runtime_disable(&pci->dev);
> > +	}
> > +
> >  	chip->running = 0;
> 
> Tested with next-20211019 (first next tag where I've seen test failures) 
> and your patch, and this seems to do the trick. I didn't have my drvdata 
> patch included when I ran the test. No rpm_idle() calls 
> anymore after azx_remove(), so the bug is not hit.

So far, so good...

> >  	azx_del_card_list(chip);
> > @@ -2320,6 +2325,7 @@ static int azx_probe_continue(struct azx *chip)
> >  	set_default_power_save(chip);
> >  
> >  	if (azx_has_pm_runtime(chip)) {
> > +		pm_runtime_enable(&pci->dev);
> >  		pm_runtime_use_autosuspend(&pci->dev);
> 
> This does generate warnings
> [   13.495059] snd_hda_intel 0000:00:1f.3: Unbalanced pm_runtime_enable!
> 
> And later
> [   54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children
> [   54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0
> 
> Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and 
> azx_probe_continue() seems to help and fix still works.

Ah yes, I was confused as if it were already called in hdac_device.c,
but this was about the HD-audio bus controller, not the codec.

Below is the revised one.


Takashi

-- 8< --
From: Takashi Iwai <tiwai@suse.de>
Subject: [PATCH] ALSA: hda: intel: More comprehensive PM runtime setup for
 controller driver

Currently we haven't explicitly enable and allow/forbid the runtime PM
at the probe and the remove phases of HD-audio controller driver, and
this was the reason of a GPF mentioned in the commit e81478bbe7a1
("ALSA: hda: fix general protection fault in azx_runtime_idle");
namely, even after the resources are released, the runtime PM might be
still invoked by the bound graphics driver during the remove of the
controller driver.  Although we've fixed it by clearing the drvdata
reference, it'd be also better to cover the runtime PM issue more
properly.

This patch adds a few more pm_runtime_*() calls at the probe and the
remove time for setting and cleaning up the runtime PM.  Particularly,
now more explicitly pm_runtime_enable() and _disable() get called as
well as pm_runtime_forbid() call at the remove callback, so that a
use-after-free should be avoided.

Reported-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 sound/pci/hda/hda_intel.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index fe51163f2d82..45e85180048c 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -1347,8 +1347,14 @@ static void azx_free(struct azx *chip)
 	if (hda->freed)
 		return;
 
-	if (azx_has_pm_runtime(chip) && chip->running)
+	if (azx_has_pm_runtime(chip) && chip->running) {
 		pm_runtime_get_noresume(&pci->dev);
+		pm_runtime_disable(&pci->dev);
+		pm_runtime_set_suspended(&pci->dev);
+		pm_runtime_forbid(&pci->dev);
+		pm_runtime_dont_use_autosuspend(&pci->dev);
+	}
+
 	chip->running = 0;
 
 	azx_del_card_list(chip);
@@ -2322,6 +2328,8 @@ static int azx_probe_continue(struct azx *chip)
 	if (azx_has_pm_runtime(chip)) {
 		pm_runtime_use_autosuspend(&pci->dev);
 		pm_runtime_allow(&pci->dev);
+		pm_runtime_set_active(&pci->dev);
+		pm_runtime_enable(&pci->dev);
 		pm_runtime_put_autosuspend(&pci->dev);
 	}
Kai Vehmanen Nov. 12, 2021, 12:27 p.m. UTC | #6
Hi,

On Fri, 12 Nov 2021, Takashi Iwai wrote:

> On Thu, 11 Nov 2021 18:39:36 +0100, Kai Vehmanen wrote:
> > And later
> > [   54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children
> > [   54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0
> > 
> > Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and 
> > azx_probe_continue() seems to help and fix still works.
> 
> Ah yes, I was confused as if it were already called in hdac_device.c,
> but this was about the HD-audio bus controller, not the codec.
> 
> Below is the revised one.
[...]
> Currently we haven't explicitly enable and allow/forbid the runtime PM
> at the probe and the remove phases of HD-audio controller driver, and
> this was the reason of a GPF mentioned in the commit e81478bbe7a1
> ("ALSA: hda: fix general protection fault in azx_runtime_idle");
> namely, even after the resources are released, the runtime PM might be
> still invoked by the bound graphics driver during the remove of the
> controller driver.  Although we've fixed it by clearing the drvdata
> reference, it'd be also better to cover the runtime PM issue more
> properly.
> 
> This patch adds a few more pm_runtime_*() calls at the probe and the
> remove time for setting and cleaning up the runtime PM.  Particularly,
> now more explicitly pm_runtime_enable() and _disable() get called as
> well as pm_runtime_forbid() call at the remove callback, so that a
> use-after-free should be avoided.
> 
> Reported-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
> Signed-off-by: Takashi Iwai <tiwai@suse.de>

ack, tested this and no warnings anymore.

Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Tested-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

Br, Kai
Takashi Iwai Nov. 15, 2021, 7:57 a.m. UTC | #7
On Fri, 12 Nov 2021 13:27:34 +0100,
Kai Vehmanen wrote:
> 
> Hi,
> 
> On Fri, 12 Nov 2021, Takashi Iwai wrote:
> 
> > On Thu, 11 Nov 2021 18:39:36 +0100, Kai Vehmanen wrote:
> > > And later
> > > [   54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children
> > > [   54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0
> > > 
> > > Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and 
> > > azx_probe_continue() seems to help and fix still works.
> > 
> > Ah yes, I was confused as if it were already called in hdac_device.c,
> > but this was about the HD-audio bus controller, not the codec.
> > 
> > Below is the revised one.
> [...]
> > Currently we haven't explicitly enable and allow/forbid the runtime PM
> > at the probe and the remove phases of HD-audio controller driver, and
> > this was the reason of a GPF mentioned in the commit e81478bbe7a1
> > ("ALSA: hda: fix general protection fault in azx_runtime_idle");
> > namely, even after the resources are released, the runtime PM might be
> > still invoked by the bound graphics driver during the remove of the
> > controller driver.  Although we've fixed it by clearing the drvdata
> > reference, it'd be also better to cover the runtime PM issue more
> > properly.
> > 
> > This patch adds a few more pm_runtime_*() calls at the probe and the
> > remove time for setting and cleaning up the runtime PM.  Particularly,
> > now more explicitly pm_runtime_enable() and _disable() get called as
> > well as pm_runtime_forbid() call at the remove callback, so that a
> > use-after-free should be avoided.
> > 
> > Reported-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
> > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> 
> ack, tested this and no warnings anymore.
> 
> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
> Tested-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

OK, I'll submit and merge the patch.

As the original problem was fixed by NULL checks, I'd push this for
5.16, though.


thanks,

Takashi
diff mbox series

Patch

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 7762718cf429..b90c817e3f6f 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -2364,6 +2364,7 @@  static void azx_remove(struct pci_dev *pci)
 		cancel_delayed_work_sync(&hda->probe_work);
 		device_lock(&pci->dev);
 
+		pci_set_drvdata(pci, NULL);
 		snd_card_free(card);
 	}
 }