diff mbox series

ALSA: firewire-lib: restore process context workqueue to prevent deadlock

Message ID 20240718115637.12816-1-edmund.raile@proton.me (mailing list archive)
State New, archived
Headers show
Series ALSA: firewire-lib: restore process context workqueue to prevent deadlock | expand

Commit Message

Edmund Raile July 18, 2024, 11:56 a.m. UTC
Commit b5b519965c4c ("ALSA: firewire-lib: operate for period elapse event
in process context") removed the process context workqueue from
amdtp_domain_stream_pcm_pointer() and update_pcm_pointers() to remove
its overhead.

With RME Fireface 800, this lead to a regression since Kernels 5.14.0,
causing a deadlock with eventual system freeze under ALSA operation:

? tasklet_unlock_spin_wait
 </NMI>
 <TASK>
ohci_flush_iso_completions firewire_ohci
amdtp_domain_stream_pcm_pointer snd_firewire_lib
snd_pcm_update_hw_ptr0 snd_pcm
snd_pcm_status64 snd_pcm

? native_queued_spin_lock_slowpath
 </NMI>
 <IRQ>
_raw_spin_lock_irqsave
snd_pcm_period_elapsed snd_pcm
process_rx_packets snd_firewire_lib
irq_target_callback snd_firewire_lib
handle_it_packet firewire_ohci
context_tasklet firewire_ohci

Restore the work queue to prevent deadlock between ALSA substream
process context spin_lock of snd_pcm_stream_lock_irq() in snd_pcm_status64()
and OHCI 1394 IT softIRQ context spin_lock of snd_pcm_stream_lock_irqsave()
in snd_pcm_period_elapsed().

to reproduce the issue:
direct ALSA playback to the device:
  mpv --audio-device=alsa/sysdefault:CARD=Fireface800 Spor-Ignition.flac
Time to occurrence: 2s to 30m
Likelihood increased by:
  - high CPU load
    stress --cpu $(nproc)
  - switching between applications via workspaces
    tested with i915 in Xfce
PulsaAudio / PipeWire conceal the issue as they run PCM substream
without period wakeup mode, issuing less hardIRQs.

Closes: https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76hzeo4simnl@jn3eo7pe642q/T/#u
Fixes: 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context")
Signed-off-by: Edmund Raile <edmund.raile@proton.me>
---
This is the follow-up patch to the 5.14.0 regression I reported:
https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76hzeo4simnl@jn3eo7pe642q/T/#u
("[REGRESSION] ALSA: firewire-lib: snd_pcm_period_elapsed deadlock with Fireface 800")

Takashi Sakamoto explained the issue in his response to the regression:
A. In the process context
    * (lock A) Acquiring spin_lock by snd_pcm_stream_lock_irq() in
               snd_pcm_status64()
    * (lock B) Then attempt to enter tasklet

B. In the softIRQ context
    * (lock B) Enter tasklet
    * (lock A) Attempt to acquire spin_lock by snd_pcm_stream_lock_irqsave() in
               snd_pcm_period_elapsed()

This leads me to believe this isn't just an issue limited to the RME Fireface

driver (snd_fireface), though I can not test the other devices.
 sound/firewire/amdtp-stream.c | 32 +++++++++++++++++++++-----------
 sound/firewire/amdtp-stream.h |  1 +
 2 files changed, 22 insertions(+), 11 deletions(-)

Comments

Takashi Sakamoto July 24, 2024, 12:44 a.m. UTC | #1
Hi,

Thank you for your posting the patch, however I have some nitpickings
(and some requests to you).

On Thu, Jul 18, 2024 at 11:56:54AM +0000, Edmund Raile wrote:
> Commit b5b519965c4c ("ALSA: firewire-lib: operate for period elapse event
> in process context") removed the process context workqueue from
> amdtp_domain_stream_pcm_pointer() and update_pcm_pointers() to remove
> its overhead.
> 
> With RME Fireface 800, this lead to a regression since Kernels 5.14.0,
> causing a deadlock with eventual system freeze under ALSA operation:
> 
> ? tasklet_unlock_spin_wait
>  </NMI>
>  <TASK>
> ohci_flush_iso_completions firewire_ohci
> amdtp_domain_stream_pcm_pointer snd_firewire_lib
> snd_pcm_update_hw_ptr0 snd_pcm
> snd_pcm_status64 snd_pcm
> 
> ? native_queued_spin_lock_slowpath
>  </NMI>
>  <IRQ>
> _raw_spin_lock_irqsave
> snd_pcm_period_elapsed snd_pcm
> process_rx_packets snd_firewire_lib
> irq_target_callback snd_firewire_lib
> handle_it_packet firewire_ohci
> context_tasklet firewire_ohci
> 
> Restore the work queue to prevent deadlock between ALSA substream
> process context spin_lock of snd_pcm_stream_lock_irq() in snd_pcm_status64()
> and OHCI 1394 IT softIRQ context spin_lock of snd_pcm_stream_lock_irqsave()
> in snd_pcm_period_elapsed().
> 
> to reproduce the issue:
> direct ALSA playback to the device:
>   mpv --audio-device=alsa/sysdefault:CARD=Fireface800 Spor-Ignition.flac
> Time to occurrence: 2s to 30m
> Likelihood increased by:
>   - high CPU load
>     stress --cpu $(nproc)
>   - switching between applications via workspaces
>     tested with i915 in Xfce
> PulsaAudio / PipeWire conceal the issue as they run PCM substream
> without period wakeup mode, issuing less hardIRQs.
> 
> Closes: https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76hzeo4simnl@jn3eo7pe642q/T/#u
> Fixes: 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context")
> Signed-off-by: Edmund Raile <edmund.raile@proton.me>
> ---
> This is the follow-up patch to the 5.14.0 regression I reported:
> https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76hzeo4simnl@jn3eo7pe642q/T/#u
> ("[REGRESSION] ALSA: firewire-lib: snd_pcm_period_elapsed deadlock with Fireface 800")
> 
> Takashi Sakamoto explained the issue in his response to the regression:
> A. In the process context
>     * (lock A) Acquiring spin_lock by snd_pcm_stream_lock_irq() in
>                snd_pcm_status64()
>     * (lock B) Then attempt to enter tasklet
> 
> B. In the softIRQ context
>     * (lock B) Enter tasklet
>     * (lock A) Attempt to acquire spin_lock by snd_pcm_stream_lock_irqsave() in
>                snd_pcm_period_elapsed()
> 
> This leads me to believe this isn't just an issue limited to the RME Fireface

We are in the merge window to Linux kernel 6.11[1]. I prefer reviewing a
small and trivial patches in the weeks, thus I would like to postpone
applying this kind of patches after releasing -rc1 even if they look
good.

In Linux kernel development, we have the process to apply patches into
released kernels to fix bugs and regressions. As of today, the developers
pay some of their efforts to maintain the following kernels[2]:

* 6.10
* 6.9.10 
* 6.6.41
* 6.1.100
* 5.15.163
* 5.10.222
* 5.4.280
* 4.19.318

It is cleared that the issued changes were applied to v5.14 kernel, thus we
should lead the developers to find the posted patches and apply them to
future releases of each version. The process[3] have come into existence
enough before the regression report procedure[4]. I would like you to read
some documentation about the process and add more care for stable kernel
maintainers.

Well, the issued commits are (older at first):

* 7ba5ca32fe6
* b5b519965c4

As long as I can see, these commits can be reverted per each, with a
slight handy-changes. In the case, it is preferable to make a patchset
including these two revert patches. I would like to request it to you,
instead of the all-in-one patch, so that developers easily find the issued
commits (and work to apply these patches into kernels maintained
publicly/locally).

At last, I prefer that the whole patch comment is written by the
posters, instead of referring to comments by the others. I know that the
description about AB/BA deadlock is a bit hard to write, but it is enough
and satisfied for you to write what you understand about the issue. I'd
accept it.

[1] https://lore.kernel.org/lkml/CAHk-=wjV_O2g_K19McjGKrxFxMFDqex+fyGcKc3uac1ft_O2gg@mail.gmail.com/
[2] https://kernel.org/
[3] https://docs.kernel.org/process/stable-kernel-rules.html
[4] https://docs.kernel.org/process/handling-regressions.html


Thanks

Takashi Sakamoto
diff mbox series

Patch

diff --git a/sound/firewire/amdtp-stream.c b/sound/firewire/amdtp-stream.c
index d35d0a420ee0..77b99a2117f4 100644
--- a/sound/firewire/amdtp-stream.c
+++ b/sound/firewire/amdtp-stream.c
@@ -77,6 +77,8 @@ 
 // overrun. Actual device can skip more, then this module stops the packet streaming.
 #define IR_JUMBO_PAYLOAD_MAX_SKIP_CYCLES	5
 
+static void pcm_period_work(struct work_struct *work);
+
 /**
  * amdtp_stream_init - initialize an AMDTP stream structure
  * @s: the AMDTP stream to initialize
@@ -105,6 +107,7 @@  int amdtp_stream_init(struct amdtp_stream *s, struct fw_unit *unit,
 	s->flags = flags;
 	s->context = ERR_PTR(-1);
 	mutex_init(&s->mutex);
+	INIT_WORK(&s->period_work, pcm_period_work);
 	s->packet_index = 0;
 
 	init_waitqueue_head(&s->ready_wait);
@@ -347,6 +350,7 @@  EXPORT_SYMBOL(amdtp_stream_get_max_payload);
  */
 void amdtp_stream_pcm_prepare(struct amdtp_stream *s)
 {
+	cancel_work_sync(&s->period_work);
 	s->pcm_buffer_pointer = 0;
 	s->pcm_period_pointer = 0;
 }
@@ -611,19 +615,21 @@  static void update_pcm_pointers(struct amdtp_stream *s,
 		// The program in user process should periodically check the status of intermediate
 		// buffer associated to PCM substream to process PCM frames in the buffer, instead
 		// of receiving notification of period elapsed by poll wait.
-		if (!pcm->runtime->no_period_wakeup) {
-			if (in_softirq()) {
-				// In software IRQ context for 1394 OHCI.
-				snd_pcm_period_elapsed(pcm);
-			} else {
-				// In process context of ALSA PCM application under acquired lock of
-				// PCM substream.
-				snd_pcm_period_elapsed_under_stream_lock(pcm);
-			}
-		}
+		if (!pcm->runtime->no_period_wakeup)
+			queue_work(system_highpri_wq, &s->period_work);
 	}
 }
 
+static void pcm_period_work(struct work_struct *work)
+{
+	struct amdtp_stream *s = container_of(work, struct amdtp_stream,
+					      period_work);
+	struct snd_pcm_substream *pcm = READ_ONCE(s->pcm);
+	
+	if (pcm)
+		snd_pcm_period_elapsed(pcm);
+}
+
 static int queue_packet(struct amdtp_stream *s, struct fw_iso_packet *params,
 			bool sched_irq)
 {
@@ -1854,7 +1860,10 @@  unsigned long amdtp_domain_stream_pcm_pointer(struct amdtp_domain *d,
 	if (irq_target && amdtp_stream_running(irq_target)) {
 		// In software IRQ context, the call causes dead-lock to disable the tasklet
 		// synchronously.
-		if (!in_softirq())
+		// workqueue serves to prevent deadlock between process context spinlock
+		// of snd_pcm_stream_lock_irq() in snd_pcm_status64() and softIRQ spinlock
+		// of snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed()
+		if ((!in_softirq()) && (current_work() != &s->period_work))
 			fw_iso_context_flush_completions(irq_target->context);
 	}
 
@@ -1910,6 +1919,7 @@  static void amdtp_stream_stop(struct amdtp_stream *s)
 		return;
 	}
 
+	cancel_work_sync(&s->period_work);
 	fw_iso_context_stop(s->context);
 	fw_iso_context_destroy(s->context);
 	s->context = ERR_PTR(-1);
diff --git a/sound/firewire/amdtp-stream.h b/sound/firewire/amdtp-stream.h
index a1ed2e80f91a..775db3fc4959 100644
--- a/sound/firewire/amdtp-stream.h
+++ b/sound/firewire/amdtp-stream.h
@@ -191,6 +191,7 @@  struct amdtp_stream {
 
 	/* For a PCM substream processing. */
 	struct snd_pcm_substream *pcm;
+	struct work_struct period_work;
 	snd_pcm_uframes_t pcm_buffer_pointer;
 	unsigned int pcm_period_pointer;
 	unsigned int pcm_frame_multiplier;