diff mbox series

[v8,3/3] ASoC: SOF: Fix deadlock when shutdown a frozen userspace

Message ID 20221127-snd-freeze-v8-3-3bc02d09f2ce@chromium.org (mailing list archive)
State New, archived
Headers show
Series ASoC: SOF: Fix deadlock when shutdown a frozen userspace | expand

Commit Message

Ricardo Ribalda Dec. 1, 2022, 11:08 a.m. UTC
If we are shutting down due to kexec and the userspace is frozen, the
system will stall forever waiting for userspace to complete.

Do not wait for the clients to complete in that case.

This fixes:

[   84.943749] Freezing user space processes ... (elapsed 0.111 seconds) done.
[  246.784446] INFO: task kexec-lite:5123 blocked for more than 122 seconds.
[  246.819035] Call Trace:
[  246.821782]  <TASK>
[  246.824186]  __schedule+0x5f9/0x1263
[  246.828231]  schedule+0x87/0xc5
[  246.831779]  snd_card_disconnect_sync+0xb5/0x127
...
[  246.889249]  snd_sof_device_shutdown+0xb4/0x150
[  246.899317]  pci_device_shutdown+0x37/0x61
[  246.903990]  device_shutdown+0x14c/0x1d6
[  246.908391]  kernel_kexec+0x45/0xb9

And:

[  246.893222] INFO: task kexec-lite:4891 blocked for more than 122 seconds.
[  246.927709] Call Trace:
[  246.930461]  <TASK>
[  246.932819]  __schedule+0x5f9/0x1263
[  246.936855]  ? fsnotify_grab_connector+0x5c/0x70
[  246.942045]  schedule+0x87/0xc5
[  246.945567]  schedule_timeout+0x49/0xf3
[  246.949877]  wait_for_completion+0x86/0xe8
[  246.954463]  snd_card_free+0x68/0x89
...
[  247.001080]  platform_device_unregister+0x12/0x35

Cc: stable@vger.kernel.org
Fixes: 83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers in .shutdown")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
---
 sound/soc/sof/core.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Oliver Neukum Dec. 1, 2022, 12:28 p.m. UTC | #1
On 01.12.22 12:08, Ricardo Ribalda wrote:
> If we are shutting down due to kexec and the userspace is frozen, the
> system will stall forever waiting for userspace to complete.
> 
> Do not wait for the clients to complete in that case.

Hi,

I am afraid I have to state that this approach is bad in every case,
not just this corner case. It basically means that user space can stall
the kernel for an arbitrary amount of time. And we cannot have that.

	Regards
		Oliver
Ricardo Ribalda Dec. 1, 2022, 1:03 p.m. UTC | #2
Hi Oliver

Thanks for your review

On Thu, 1 Dec 2022 at 13:29, Oliver Neukum <oneukum@suse.com> wrote:
>
> On 01.12.22 12:08, Ricardo Ribalda wrote:
> > If we are shutting down due to kexec and the userspace is frozen, the
> > system will stall forever waiting for userspace to complete.
> >
> > Do not wait for the clients to complete in that case.
>
> Hi,
>
> I am afraid I have to state that this approach is bad in every case,
> not just this corner case. It basically means that user space can stall
> the kernel for an arbitrary amount of time. And we cannot have that.
>
>         Regards
>                 Oliver

This patchset does not modify this behaviour. It simply fixes the
stall for kexec().

The  patch that introduced the stall:
83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers
in .shutdown")

was sent as a generalised version of:
https://github.com/thesofproject/linux/pull/3388

AFAIK, we would need a similar patch for every single board.... which
I am not sure it is doable in a reasonable timeframe.

On the meantime this seems like a decent compromises. Yes, a
miss-behaving userspace can still stall during suspend, but that was
not introduced in this patch.

Regards!
>
Oliver Neukum Dec. 1, 2022, 1:22 p.m. UTC | #3
On 01.12.22 14:03, Ricardo Ribalda wrote:

Hi,
  
> This patchset does not modify this behaviour. It simply fixes the
> stall for kexec().
> 
> The  patch that introduced the stall:
> 83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers
> in .shutdown")

That patch is problematic. I would go as far as saying that
it needs to be reverted.

> was sent as a generalised version of:
> https://github.com/thesofproject/linux/pull/3388
> 
> AFAIK, we would need a similar patch for every single board.... which
> I am not sure it is doable in a reasonable timeframe.
> 
> On the meantime this seems like a decent compromises. Yes, a
> miss-behaving userspace can still stall during suspend, but that was
> not introduced in this patch.

Well, I mean if you know what wrong then I'd say at least return to
a sanely broken state.

The whole approach is wrong. You need to be able to deal with user
space talking to removed devices by returning an error and keeping
the resources association with the open file allocated until
user space calls close()

	Regards
		Oliver
Ricardo Ribalda Dec. 1, 2022, 1:34 p.m. UTC | #4
Hi Oliver

On Thu, 1 Dec 2022 at 14:22, 'Oliver Neukum' via Chromeos Kdump
<chromeos-kdump@google.com> wrote:
>
> On 01.12.22 14:03, Ricardo Ribalda wrote:
>
> Hi,
>
> > This patchset does not modify this behaviour. It simply fixes the
> > stall for kexec().
> >
> > The  patch that introduced the stall:
> > 83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers
> > in .shutdown")
>
> That patch is problematic. I would go as far as saying that
> it needs to be reverted.
>

It fixes a real issue. We have not had any complaints until we tried
to kexec in the platform.
I wont recommend reverting it until we have an alternative implementation.

kexec is far less common than suspend/reboot.

> > was sent as a generalised version of:
> > https://github.com/thesofproject/linux/pull/3388
> >
> > AFAIK, we would need a similar patch for every single board.... which
> > I am not sure it is doable in a reasonable timeframe.
> >
> > On the meantime this seems like a decent compromises. Yes, a
> > miss-behaving userspace can still stall during suspend, but that was
> > not introduced in this patch.
>
> Well, I mean if you know what wrong then I'd say at least return to
> a sanely broken state.
>
> The whole approach is wrong. You need to be able to deal with user
> space talking to removed devices by returning an error and keeping
> the resources association with the open file allocated until
> user space calls close()

In general, the whole shutdown is broken for all the subsystems ;).
It is a complicated issue. Users handling fds, devices with DMAs in
the middle of an operation, dma fences....

Unfortunately I am not that familiar with the sound subsystem to make
a proper patch for this.

>
>         Regards
>                 Oliver
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Chromeos Kdump" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to chromeos-kdump+unsubscribe@google.com.
> To view this discussion on the web, visit https://groups.google.com/a/google.com/d/msgid/chromeos-kdump/d3730d1d-6f92-700a-06c4-0e0a35e270b0%40suse.com.
Takashi Iwai Dec. 1, 2022, 1:38 p.m. UTC | #5
On Thu, 01 Dec 2022 14:22:12 +0100,
Oliver Neukum wrote:
> 
> On 01.12.22 14:03, Ricardo Ribalda wrote:
> 
> Hi,
>  
> > This patchset does not modify this behaviour. It simply fixes the
> > stall for kexec().
> > 
> > The  patch that introduced the stall:
> > 83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers
> > in .shutdown")
> 
> That patch is problematic. I would go as far as saying that
> it needs to be reverted.

... or fixed.

> > was sent as a generalised version of:
> > https://github.com/thesofproject/linux/pull/3388
> > 
> > AFAIK, we would need a similar patch for every single board.... which
> > I am not sure it is doable in a reasonable timeframe.
> > 
> > On the meantime this seems like a decent compromises. Yes, a
> > miss-behaving userspace can still stall during suspend, but that was
> > not introduced in this patch.
> 
> Well, I mean if you know what wrong then I'd say at least return to
> a sanely broken state.
> 
> The whole approach is wrong. You need to be able to deal with user
> space talking to removed devices by returning an error and keeping
> the resources association with the open file allocated until
> user space calls close()

As I already mentioned in another thread, if the user-space action has
to be cut off, we just need to call snd_card_disconnect() instead
without sync.  A quick hack would be like below (totally untested and
might be wrong, though).

In anyway, Ricardo, please stop spinning too frequently; v8 in a few 
days is way too much, and now the recipient list became unmanageable.
Let's give people some time to review and consider a better solution
at first.


thanks,

Takashi

-- 8< --
--- a/sound/soc/sof/core.c
+++ b/sound/soc/sof/core.c
@@ -475,7 +475,7 @@ EXPORT_SYMBOL(snd_sof_device_remove);
 int snd_sof_device_shutdown(struct device *dev)
 {
 	struct snd_sof_dev *sdev = dev_get_drvdata(dev);
-	struct snd_sof_pdata *pdata = sdev->pdata;
+	struct snd_soc_component *component;
 
 	if (IS_ENABLED(CONFIG_SND_SOC_SOF_PROBE_WORK_QUEUE))
 		cancel_work_sync(&sdev->probe_work);
@@ -484,9 +484,9 @@ int snd_sof_device_shutdown(struct device *dev)
 	 * make sure clients and machine driver(s) are unregistered to force
 	 * all userspace devices to be closed prior to the DSP shutdown sequence
 	 */
-	sof_unregister_clients(sdev);
-
-	snd_sof_machine_unregister(sdev, pdata);
+	component = snd_soc_lookup_component(sdev->dev, NULL);
+	if (component && component->card && component->card->snd_card)
+		snd_card_disconnect(component->card->snd_card);
 
 	if (sdev->fw_state == SOF_FW_BOOT_COMPLETE)
 		return snd_sof_shutdown(sdev);
Kai Vehmanen Dec. 9, 2022, 11:53 a.m. UTC | #6
Hi,

On Thu, 1 Dec 2022, Ricardo Ribalda wrote:

> On Thu, 1 Dec 2022 at 14:22, 'Oliver Neukum' via Chromeos Kdump <chromeos-kdump@google.com> wrote:
> >
> > On 01.12.22 14:03, Ricardo Ribalda wrote:
> > > This patchset does not modify this behaviour. It simply fixes the
> > > stall for kexec().
> > >
> > > The  patch that introduced the stall:
> > > 83bfc7e793b5 ("ASoC: SOF: core: unregister clients and machine drivers
> > > in .shutdown")
> >
> > That patch is problematic. I would go as far as saying that
> > it needs to be reverted.
> 
> It fixes a real issue. We have not had any complaints until we tried
> to kexec in the platform.
> I wont recommend reverting it until we have an alternative implementation.
> 
> kexec is far less common than suspend/reboot.

I've posted an alternative to ALSA list that reverts the problematic
patch and fixes the problem (the patch was originally addressing)
in a different way:

https://mailman.alsa-project.org/pipermail/alsa-devel/2022-December/209776.html

No changes outside sound/soc/ are needed with this approach.

Br, Kai
diff mbox series

Patch

diff --git a/sound/soc/sof/core.c b/sound/soc/sof/core.c
index 3e6141d03770..9587b6a85103 100644
--- a/sound/soc/sof/core.c
+++ b/sound/soc/sof/core.c
@@ -9,6 +9,8 @@ 
 //
 
 #include <linux/firmware.h>
+#include <linux/kexec.h>
+#include <linux/freezer.h>
 #include <linux/module.h>
 #include <sound/soc.h>
 #include <sound/sof.h>
@@ -484,9 +486,10 @@  int snd_sof_device_shutdown(struct device *dev)
 	 * make sure clients and machine driver(s) are unregistered to force
 	 * all userspace devices to be closed prior to the DSP shutdown sequence
 	 */
-	sof_unregister_clients(sdev);
-
-	snd_sof_machine_unregister(sdev, pdata);
+	if (!(kexec_in_progress() && pm_freezing())) {
+		sof_unregister_clients(sdev);
+		snd_sof_machine_unregister(sdev, pdata);
+	}
 
 	if (sdev->fw_state == SOF_FW_BOOT_COMPLETE)
 		return snd_sof_shutdown(sdev);