diff mbox series

[PROBLEM] ALSA: hda: PCM streams are suspended while opening

Message ID 1553097854-10287-1-git-send-email-jonathanh@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [PROBLEM] ALSA: hda: PCM streams are suspended while opening | expand

Commit Message

Jon Hunter March 20, 2019, 4:04 p.m. UTC
From: Jonathan Hunter <jonathanh@nvidia.com>

This issue is not observed on the latest mainline kernel since the
'ALSA: PCM suspend cleanup' series has been applied and the
snd_pcm_suspend_all() function call in the HDA codec driver has been
removed. However, this issue impacts stable kernel branches and so I
am trying to find a solution for these branches.

When stress testing audio playback across multiple HDA streams
simultaneously, the following failure is sometimes observed when
starting playback ...

 Unable to set hw params for playback: File descriptor in bad state

The problem is caused because ...
1. Playback is starting for one HDA codec and so the chip->open_mutex
   in azx_pcm_open() has been acquired.
2. Playback for another HDA codec is starting but is blocked waiting to
   acquire the chip->open_mutex in azx_pcm_open().
3. For the HDA codec that is blocked, its runtime-pm status is still
   enabled from a previous playback session that has since completed and
   been closed, however its autosuspend timeout has not expired yet.
4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
   now occurs calling the runtime-pm suspend callback and this calls
   snd_pcm_suspend_all() placing all PCM streams in the suspended state.
5. The block HDA codec is now unblocked and fails to set the HW params
   because the PCM stream is in the suspended state.

The above has been observed on Tegra platforms where the autosuspend
delay is set to 1 second and there is a delay to an AZX command. This
highlights that there is a window of time where autosuspend can place
the HDA PCM stream in the suspended state while opening the stream
and cause playback to fail.

Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
device type PM ops') it appears that the call to snd_pcm_suspend_all()
has now been moved to the to the suspend handler for PCM streams and so
I am wondering if it would be equivalent to make the following change to
the HDA codec driver to achieve the same behaviour but only for the HDA
driver. So far the issue has not been observed with this change.

Please note that I am not sending this as a formal patch as I wanted to
get some feedback/comments on the above.
---
 sound/pci/hda/hda_codec.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Takashi Iwai March 20, 2019, 5:19 p.m. UTC | #1
On Wed, 20 Mar 2019 17:04:14 +0100,
Jon Hunter wrote:
> 
> From: Jonathan Hunter <jonathanh@nvidia.com>
> 
> This issue is not observed on the latest mainline kernel since the
> 'ALSA: PCM suspend cleanup' series has been applied and the
> snd_pcm_suspend_all() function call in the HDA codec driver has been
> removed. However, this issue impacts stable kernel branches and so I
> am trying to find a solution for these branches.
> 
> When stress testing audio playback across multiple HDA streams
> simultaneously, the following failure is sometimes observed when
> starting playback ...
> 
>  Unable to set hw params for playback: File descriptor in bad state
> 
> The problem is caused because ...
> 1. Playback is starting for one HDA codec and so the chip->open_mutex
>    in azx_pcm_open() has been acquired.
> 2. Playback for another HDA codec is starting but is blocked waiting to
>    acquire the chip->open_mutex in azx_pcm_open().
> 3. For the HDA codec that is blocked, its runtime-pm status is still
>    enabled from a previous playback session that has since completed and
>    been closed, however its autosuspend timeout has not expired yet.
> 4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
>    now occurs calling the runtime-pm suspend callback and this calls
>    snd_pcm_suspend_all() placing all PCM streams in the suspended state.
> 5. The block HDA codec is now unblocked and fails to set the HW params
>    because the PCM stream is in the suspended state.
> 
> The above has been observed on Tegra platforms where the autosuspend
> delay is set to 1 second and there is a delay to an AZX command. This
> highlights that there is a window of time where autosuspend can place
> the HDA PCM stream in the suspended state while opening the stream
> and cause playback to fail.
> 
> Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
> device type PM ops') it appears that the call to snd_pcm_suspend_all()
> has now been moved to the to the suspend handler for PCM streams and so
> I am wondering if it would be equivalent to make the following change to
> the HDA codec driver to achieve the same behaviour but only for the HDA
> driver. So far the issue has not been observed with this change.
> 
> Please note that I am not sending this as a formal patch as I wanted to
> get some feedback/comments on the above.

I guess just backporting 3d21ef0b49f8 should be OK even for older
kernels.  This assures the PCM stream gets suspended before entering
any parent device suspend call.

The remaining changes (except for the one for atiixp) are merely
cleanup of superfluous calls, and keeping them are harmless.

Could you check whether my theory above correct?


thanks,

Takashi
Jon Hunter March 20, 2019, 5:59 p.m. UTC | #2
On 20/03/2019 17:19, Takashi Iwai wrote:
> On Wed, 20 Mar 2019 17:04:14 +0100,
> Jon Hunter wrote:
>>
>> From: Jonathan Hunter <jonathanh@nvidia.com>
>>
>> This issue is not observed on the latest mainline kernel since the
>> 'ALSA: PCM suspend cleanup' series has been applied and the
>> snd_pcm_suspend_all() function call in the HDA codec driver has been
>> removed. However, this issue impacts stable kernel branches and so I
>> am trying to find a solution for these branches.
>>
>> When stress testing audio playback across multiple HDA streams
>> simultaneously, the following failure is sometimes observed when
>> starting playback ...
>>
>>  Unable to set hw params for playback: File descriptor in bad state
>>
>> The problem is caused because ...
>> 1. Playback is starting for one HDA codec and so the chip->open_mutex
>>    in azx_pcm_open() has been acquired.
>> 2. Playback for another HDA codec is starting but is blocked waiting to
>>    acquire the chip->open_mutex in azx_pcm_open().
>> 3. For the HDA codec that is blocked, its runtime-pm status is still
>>    enabled from a previous playback session that has since completed and
>>    been closed, however its autosuspend timeout has not expired yet.
>> 4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
>>    now occurs calling the runtime-pm suspend callback and this calls
>>    snd_pcm_suspend_all() placing all PCM streams in the suspended state.
>> 5. The block HDA codec is now unblocked and fails to set the HW params
>>    because the PCM stream is in the suspended state.
>>
>> The above has been observed on Tegra platforms where the autosuspend
>> delay is set to 1 second and there is a delay to an AZX command. This
>> highlights that there is a window of time where autosuspend can place
>> the HDA PCM stream in the suspended state while opening the stream
>> and cause playback to fail.
>>
>> Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
>> device type PM ops') it appears that the call to snd_pcm_suspend_all()
>> has now been moved to the to the suspend handler for PCM streams and so
>> I am wondering if it would be equivalent to make the following change to
>> the HDA codec driver to achieve the same behaviour but only for the HDA
>> driver. So far the issue has not been observed with this change.
>>
>> Please note that I am not sending this as a formal patch as I wanted to
>> get some feedback/comments on the above.
> 
> I guess just backporting 3d21ef0b49f8 should be OK even for older
> kernels.  This assures the PCM stream gets suspended before entering
> any parent device suspend call.
> 
> The remaining changes (except for the one for atiixp) are merely
> cleanup of superfluous calls, and keeping them are harmless.
> 
> Could you check whether my theory above correct?

I believe that you will need that change and 17bc4815de58 because we
need to prevent snd_pcm_suspend_all() being called in the pm-runtime
suspend callback. Applying only 3d21ef0b49f8 will means that at runtime
snd_pcm_suspend_all() can still be called from within the HDA codec
driver when the autosuspend timeout occurs. I will test this and
confirm. Maybe both could be backported?

Cheers
Jon
Jon Hunter March 21, 2019, 2:28 p.m. UTC | #3
On 20/03/2019 17:59, Jon Hunter wrote:
> 
> On 20/03/2019 17:19, Takashi Iwai wrote:
>> On Wed, 20 Mar 2019 17:04:14 +0100,
>> Jon Hunter wrote:
>>>
>>> From: Jonathan Hunter <jonathanh@nvidia.com>
>>>
>>> This issue is not observed on the latest mainline kernel since the
>>> 'ALSA: PCM suspend cleanup' series has been applied and the
>>> snd_pcm_suspend_all() function call in the HDA codec driver has been
>>> removed. However, this issue impacts stable kernel branches and so I
>>> am trying to find a solution for these branches.
>>>
>>> When stress testing audio playback across multiple HDA streams
>>> simultaneously, the following failure is sometimes observed when
>>> starting playback ...
>>>
>>>  Unable to set hw params for playback: File descriptor in bad state
>>>
>>> The problem is caused because ...
>>> 1. Playback is starting for one HDA codec and so the chip->open_mutex
>>>    in azx_pcm_open() has been acquired.
>>> 2. Playback for another HDA codec is starting but is blocked waiting to
>>>    acquire the chip->open_mutex in azx_pcm_open().
>>> 3. For the HDA codec that is blocked, its runtime-pm status is still
>>>    enabled from a previous playback session that has since completed and
>>>    been closed, however its autosuspend timeout has not expired yet.
>>> 4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
>>>    now occurs calling the runtime-pm suspend callback and this calls
>>>    snd_pcm_suspend_all() placing all PCM streams in the suspended state.
>>> 5. The block HDA codec is now unblocked and fails to set the HW params
>>>    because the PCM stream is in the suspended state.
>>>
>>> The above has been observed on Tegra platforms where the autosuspend
>>> delay is set to 1 second and there is a delay to an AZX command. This
>>> highlights that there is a window of time where autosuspend can place
>>> the HDA PCM stream in the suspended state while opening the stream
>>> and cause playback to fail.
>>>
>>> Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
>>> device type PM ops') it appears that the call to snd_pcm_suspend_all()
>>> has now been moved to the to the suspend handler for PCM streams and so
>>> I am wondering if it would be equivalent to make the following change to
>>> the HDA codec driver to achieve the same behaviour but only for the HDA
>>> driver. So far the issue has not been observed with this change.
>>>
>>> Please note that I am not sending this as a formal patch as I wanted to
>>> get some feedback/comments on the above.
>>
>> I guess just backporting 3d21ef0b49f8 should be OK even for older
>> kernels.  This assures the PCM stream gets suspended before entering
>> any parent device suspend call.
>>
>> The remaining changes (except for the one for atiixp) are merely
>> cleanup of superfluous calls, and keeping them are harmless.
>>
>> Could you check whether my theory above correct?
> 
> I believe that you will need that change and 17bc4815de58 because we
> need to prevent snd_pcm_suspend_all() being called in the pm-runtime
> suspend callback. Applying only 3d21ef0b49f8 will means that at runtime
> snd_pcm_suspend_all() can still be called from within the HDA codec
> driver when the autosuspend timeout occurs. I will test this and
> confirm. Maybe both could be backported?

I confirmed today that with just 3d21ef0b49f8 the issue can occur, but
with both 3d21ef0b49f8 and 17bc4815de58 the issue is not seen.

Do you think that it would be appropriate to include these in the stable
branches? They apply cleanly to v5.0, but we would probably need to
back-port for early kernels such as v4.19, v4.9, etc.

Cheers
Jon
Takashi Iwai March 25, 2019, 9:59 a.m. UTC | #4
On Thu, 21 Mar 2019 15:28:23 +0100,
Jon Hunter wrote:
> 
> 
> On 20/03/2019 17:59, Jon Hunter wrote:
> > 
> > On 20/03/2019 17:19, Takashi Iwai wrote:
> >> On Wed, 20 Mar 2019 17:04:14 +0100,
> >> Jon Hunter wrote:
> >>>
> >>> From: Jonathan Hunter <jonathanh@nvidia.com>
> >>>
> >>> This issue is not observed on the latest mainline kernel since the
> >>> 'ALSA: PCM suspend cleanup' series has been applied and the
> >>> snd_pcm_suspend_all() function call in the HDA codec driver has been
> >>> removed. However, this issue impacts stable kernel branches and so I
> >>> am trying to find a solution for these branches.
> >>>
> >>> When stress testing audio playback across multiple HDA streams
> >>> simultaneously, the following failure is sometimes observed when
> >>> starting playback ...
> >>>
> >>>  Unable to set hw params for playback: File descriptor in bad state
> >>>
> >>> The problem is caused because ...
> >>> 1. Playback is starting for one HDA codec and so the chip->open_mutex
> >>>    in azx_pcm_open() has been acquired.
> >>> 2. Playback for another HDA codec is starting but is blocked waiting to
> >>>    acquire the chip->open_mutex in azx_pcm_open().
> >>> 3. For the HDA codec that is blocked, its runtime-pm status is still
> >>>    enabled from a previous playback session that has since completed and
> >>>    been closed, however its autosuspend timeout has not expired yet.
> >>> 4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
> >>>    now occurs calling the runtime-pm suspend callback and this calls
> >>>    snd_pcm_suspend_all() placing all PCM streams in the suspended state.
> >>> 5. The block HDA codec is now unblocked and fails to set the HW params
> >>>    because the PCM stream is in the suspended state.
> >>>
> >>> The above has been observed on Tegra platforms where the autosuspend
> >>> delay is set to 1 second and there is a delay to an AZX command. This
> >>> highlights that there is a window of time where autosuspend can place
> >>> the HDA PCM stream in the suspended state while opening the stream
> >>> and cause playback to fail.
> >>>
> >>> Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
> >>> device type PM ops') it appears that the call to snd_pcm_suspend_all()
> >>> has now been moved to the to the suspend handler for PCM streams and so
> >>> I am wondering if it would be equivalent to make the following change to
> >>> the HDA codec driver to achieve the same behaviour but only for the HDA
> >>> driver. So far the issue has not been observed with this change.
> >>>
> >>> Please note that I am not sending this as a formal patch as I wanted to
> >>> get some feedback/comments on the above.
> >>
> >> I guess just backporting 3d21ef0b49f8 should be OK even for older
> >> kernels.  This assures the PCM stream gets suspended before entering
> >> any parent device suspend call.
> >>
> >> The remaining changes (except for the one for atiixp) are merely
> >> cleanup of superfluous calls, and keeping them are harmless.
> >>
> >> Could you check whether my theory above correct?
> > 
> > I believe that you will need that change and 17bc4815de58 because we
> > need to prevent snd_pcm_suspend_all() being called in the pm-runtime
> > suspend callback. Applying only 3d21ef0b49f8 will means that at runtime
> > snd_pcm_suspend_all() can still be called from within the HDA codec
> > driver when the autosuspend timeout occurs. I will test this and
> > confirm. Maybe both could be backported?
> 
> I confirmed today that with just 3d21ef0b49f8 the issue can occur, but
> with both 3d21ef0b49f8 and 17bc4815de58 the issue is not seen.
> 
> Do you think that it would be appropriate to include these in the stable
> branches? They apply cleanly to v5.0, but we would probably need to
> back-port for early kernels such as v4.19, v4.9, etc.

I reconsidered the problem again, and noticed that the very same
problem may appear with the system PM, not only with runtime PM.
The suspend can happen at any time, so even a stream in OPEN state may
go to suspend, and you'll hit the same problem.  It's just a corner
case so no one really cared much.

So, I think the patch like below should fix the problem.
This can be easily backported to all stable trees, and it alone should
work without backporting the else intrusive changes.

Could you check whether my theory is correct?


thanks,

Takashi

---
From: Takashi Iwai <tiwai@suse.de>
Subject: [PATCH] ALSA: pcm: Don't suspend stream in unrecoverable PCM state

Currently PCM core sets each opened stream forcibly to SUSPENDED state
via snd_pcm_suspend_all() call, and the user-space is responsible for
re-triggering the resume manually either via snd_pcm_resume() or
prepare call.  The scheme works fine usually, but there are corner
cases where the stream can't be resumed by that call: the streams
still in OPEN state before finishing hw_params.  When they are
suspended, user-space cannot perform resume or prepare because they
haven't been set up yet.  The only possible recovery is to re-open the
device, which isn't nice at all.  Similarly, when a stream is in
DISCONNECTED state, it makes no sense to change it to SUSPENDED
state.  Ditto for in SETUP state; which you can re-prepare directly.

So, this patch addresses these issues by filtering the PCM streams to
be suspended by checking the PCM state.  When a stream is in either
OPEN, SETUP or DISCONNECTED as well as already SUSPENDED, the suspend
action is skipped.

To be noted, this problem was originally reported for the PCM runtime
PM on HD-audio.  And, the runtime PM problem itself was already
addressed (although not intended) by the code refactoring commits
3d21ef0b49f8 ("ALSA: pcm: Suspend streams globally via device type PM
ops") and 17bc4815de58 ("ALSA: pci: Remove superfluous
snd_pcm_suspend*() calls").  These commits eliminated the
snd_pcm_suspend*() calls from the runtime PM suspend callback code
path, hence the racy OPEN state won't appear while runtime PM.
(FWIW, the race window is between snd_pcm_open_substream() and the
first power up in azx_pcm_open().)

Although the runtime PM issue was already "fixed", the same problem is
still present for the system PM, hence this patch is still needed.
And for stable trees, this patch alone should suffice for fixing the
runtime PM problem, too.

Reported-by: Jonathan Hunter <jonathanh@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 sound/core/pcm_native.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index f731f904e8cc..1d8452912b14 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -1445,8 +1445,15 @@ static int snd_pcm_pause(struct snd_pcm_substream *substream, int push)
 static int snd_pcm_pre_suspend(struct snd_pcm_substream *substream, int state)
 {
 	struct snd_pcm_runtime *runtime = substream->runtime;
-	if (runtime->status->state == SNDRV_PCM_STATE_SUSPENDED)
+	switch (runtime->status->state) {
+	case SNDRV_PCM_STATE_SUSPENDED:
 		return -EBUSY;
+	/* unresumable PCM state; return -EBUSY for skipping suspend */
+	case SNDRV_PCM_STATE_OPEN:
+	case SNDRV_PCM_STATE_SETUP:
+	case SNDRV_PCM_STATE_DISCONNECTED:
+		return -EBUSY;
+	}
 	runtime->trigger_master = substream;
 	return 0;
 }
Jon Hunter March 25, 2019, 10:19 a.m. UTC | #5
On 25/03/2019 09:59, Takashi Iwai wrote:
> On Thu, 21 Mar 2019 15:28:23 +0100,
> Jon Hunter wrote:
>>
>>
>> On 20/03/2019 17:59, Jon Hunter wrote:
>>>
>>> On 20/03/2019 17:19, Takashi Iwai wrote:
>>>> On Wed, 20 Mar 2019 17:04:14 +0100,
>>>> Jon Hunter wrote:
>>>>>
>>>>> From: Jonathan Hunter <jonathanh@nvidia.com>
>>>>>
>>>>> This issue is not observed on the latest mainline kernel since the
>>>>> 'ALSA: PCM suspend cleanup' series has been applied and the
>>>>> snd_pcm_suspend_all() function call in the HDA codec driver has been
>>>>> removed. However, this issue impacts stable kernel branches and so I
>>>>> am trying to find a solution for these branches.
>>>>>
>>>>> When stress testing audio playback across multiple HDA streams
>>>>> simultaneously, the following failure is sometimes observed when
>>>>> starting playback ...
>>>>>
>>>>>  Unable to set hw params for playback: File descriptor in bad state
>>>>>
>>>>> The problem is caused because ...
>>>>> 1. Playback is starting for one HDA codec and so the chip->open_mutex
>>>>>    in azx_pcm_open() has been acquired.
>>>>> 2. Playback for another HDA codec is starting but is blocked waiting to
>>>>>    acquire the chip->open_mutex in azx_pcm_open().
>>>>> 3. For the HDA codec that is blocked, its runtime-pm status is still
>>>>>    enabled from a previous playback session that has since completed and
>>>>>    been closed, however its autosuspend timeout has not expired yet.
>>>>> 4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
>>>>>    now occurs calling the runtime-pm suspend callback and this calls
>>>>>    snd_pcm_suspend_all() placing all PCM streams in the suspended state.
>>>>> 5. The block HDA codec is now unblocked and fails to set the HW params
>>>>>    because the PCM stream is in the suspended state.
>>>>>
>>>>> The above has been observed on Tegra platforms where the autosuspend
>>>>> delay is set to 1 second and there is a delay to an AZX command. This
>>>>> highlights that there is a window of time where autosuspend can place
>>>>> the HDA PCM stream in the suspended state while opening the stream
>>>>> and cause playback to fail.
>>>>>
>>>>> Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
>>>>> device type PM ops') it appears that the call to snd_pcm_suspend_all()
>>>>> has now been moved to the to the suspend handler for PCM streams and so
>>>>> I am wondering if it would be equivalent to make the following change to
>>>>> the HDA codec driver to achieve the same behaviour but only for the HDA
>>>>> driver. So far the issue has not been observed with this change.
>>>>>
>>>>> Please note that I am not sending this as a formal patch as I wanted to
>>>>> get some feedback/comments on the above.
>>>>
>>>> I guess just backporting 3d21ef0b49f8 should be OK even for older
>>>> kernels.  This assures the PCM stream gets suspended before entering
>>>> any parent device suspend call.
>>>>
>>>> The remaining changes (except for the one for atiixp) are merely
>>>> cleanup of superfluous calls, and keeping them are harmless.
>>>>
>>>> Could you check whether my theory above correct?
>>>
>>> I believe that you will need that change and 17bc4815de58 because we
>>> need to prevent snd_pcm_suspend_all() being called in the pm-runtime
>>> suspend callback. Applying only 3d21ef0b49f8 will means that at runtime
>>> snd_pcm_suspend_all() can still be called from within the HDA codec
>>> driver when the autosuspend timeout occurs. I will test this and
>>> confirm. Maybe both could be backported?
>>
>> I confirmed today that with just 3d21ef0b49f8 the issue can occur, but
>> with both 3d21ef0b49f8 and 17bc4815de58 the issue is not seen.
>>
>> Do you think that it would be appropriate to include these in the stable
>> branches? They apply cleanly to v5.0, but we would probably need to
>> back-port for early kernels such as v4.19, v4.9, etc.
> 
> I reconsidered the problem again, and noticed that the very same
> problem may appear with the system PM, not only with runtime PM.
> The suspend can happen at any time, so even a stream in OPEN state may
> go to suspend, and you'll hit the same problem.  It's just a corner
> case so no one really cared much.

Yes I was not sure if it could also happen in the system suspend case or
not.

> So, I think the patch like below should fix the problem.
> This can be easily backported to all stable trees, and it alone should
> work without backporting the else intrusive changes.
> 
> Could you check whether my theory is correct?

Do you want me to test this on the same stable branch I have been
testing with where commits 3d21ef0b49f8 and 17bc4815de58 are not present?

Cheers
Jon
 --
nvpublic
Takashi Iwai March 25, 2019, 10:29 a.m. UTC | #6
On Mon, 25 Mar 2019 11:19:31 +0100,
Jon Hunter wrote:
> 
> 
> On 25/03/2019 09:59, Takashi Iwai wrote:
> > On Thu, 21 Mar 2019 15:28:23 +0100,
> > Jon Hunter wrote:
> >>
> >>
> >> On 20/03/2019 17:59, Jon Hunter wrote:
> >>>
> >>> On 20/03/2019 17:19, Takashi Iwai wrote:
> >>>> On Wed, 20 Mar 2019 17:04:14 +0100,
> >>>> Jon Hunter wrote:
> >>>>>
> >>>>> From: Jonathan Hunter <jonathanh@nvidia.com>
> >>>>>
> >>>>> This issue is not observed on the latest mainline kernel since the
> >>>>> 'ALSA: PCM suspend cleanup' series has been applied and the
> >>>>> snd_pcm_suspend_all() function call in the HDA codec driver has been
> >>>>> removed. However, this issue impacts stable kernel branches and so I
> >>>>> am trying to find a solution for these branches.
> >>>>>
> >>>>> When stress testing audio playback across multiple HDA streams
> >>>>> simultaneously, the following failure is sometimes observed when
> >>>>> starting playback ...
> >>>>>
> >>>>>  Unable to set hw params for playback: File descriptor in bad state
> >>>>>
> >>>>> The problem is caused because ...
> >>>>> 1. Playback is starting for one HDA codec and so the chip->open_mutex
> >>>>>    in azx_pcm_open() has been acquired.
> >>>>> 2. Playback for another HDA codec is starting but is blocked waiting to
> >>>>>    acquire the chip->open_mutex in azx_pcm_open().
> >>>>> 3. For the HDA codec that is blocked, its runtime-pm status is still
> >>>>>    enabled from a previous playback session that has since completed and
> >>>>>    been closed, however its autosuspend timeout has not expired yet.
> >>>>> 4. For the HDA codec that is blocked, the runtime-pm autosuspend timeout
> >>>>>    now occurs calling the runtime-pm suspend callback and this calls
> >>>>>    snd_pcm_suspend_all() placing all PCM streams in the suspended state.
> >>>>> 5. The block HDA codec is now unblocked and fails to set the HW params
> >>>>>    because the PCM stream is in the suspended state.
> >>>>>
> >>>>> The above has been observed on Tegra platforms where the autosuspend
> >>>>> delay is set to 1 second and there is a delay to an AZX command. This
> >>>>> highlights that there is a window of time where autosuspend can place
> >>>>> the HDA PCM stream in the suspended state while opening the stream
> >>>>> and cause playback to fail.
> >>>>>
> >>>>> Looking at commit 3d21ef0b49f8 ('ALSA: pcm: Suspend streams globally via
> >>>>> device type PM ops') it appears that the call to snd_pcm_suspend_all()
> >>>>> has now been moved to the to the suspend handler for PCM streams and so
> >>>>> I am wondering if it would be equivalent to make the following change to
> >>>>> the HDA codec driver to achieve the same behaviour but only for the HDA
> >>>>> driver. So far the issue has not been observed with this change.
> >>>>>
> >>>>> Please note that I am not sending this as a formal patch as I wanted to
> >>>>> get some feedback/comments on the above.
> >>>>
> >>>> I guess just backporting 3d21ef0b49f8 should be OK even for older
> >>>> kernels.  This assures the PCM stream gets suspended before entering
> >>>> any parent device suspend call.
> >>>>
> >>>> The remaining changes (except for the one for atiixp) are merely
> >>>> cleanup of superfluous calls, and keeping them are harmless.
> >>>>
> >>>> Could you check whether my theory above correct?
> >>>
> >>> I believe that you will need that change and 17bc4815de58 because we
> >>> need to prevent snd_pcm_suspend_all() being called in the pm-runtime
> >>> suspend callback. Applying only 3d21ef0b49f8 will means that at runtime
> >>> snd_pcm_suspend_all() can still be called from within the HDA codec
> >>> driver when the autosuspend timeout occurs. I will test this and
> >>> confirm. Maybe both could be backported?
> >>
> >> I confirmed today that with just 3d21ef0b49f8 the issue can occur, but
> >> with both 3d21ef0b49f8 and 17bc4815de58 the issue is not seen.
> >>
> >> Do you think that it would be appropriate to include these in the stable
> >> branches? They apply cleanly to v5.0, but we would probably need to
> >> back-port for early kernels such as v4.19, v4.9, etc.
> > 
> > I reconsidered the problem again, and noticed that the very same
> > problem may appear with the system PM, not only with runtime PM.
> > The suspend can happen at any time, so even a stream in OPEN state may
> > go to suspend, and you'll hit the same problem.  It's just a corner
> > case so no one really cared much.
> 
> Yes I was not sure if it could also happen in the system suspend case or
> not.
> 
> > So, I think the patch like below should fix the problem.
> > This can be easily backported to all stable trees, and it alone should
> > work without backporting the else intrusive changes.
> > 
> > Could you check whether my theory is correct?
> 
> Do you want me to test this on the same stable branch I have been
> testing with where commits 3d21ef0b49f8 and 17bc4815de58 are not present?

Yes, that'd be appreciated.  Once after confirming it actually helps
for the runtime PM, I'll queue to my tree and merge to upstream /
stable.


thanks,

Takashi
Jon Hunter March 25, 2019, 3:31 p.m. UTC | #7
On 25/03/2019 09:59, Takashi Iwai wrote:

...

> I reconsidered the problem again, and noticed that the very same
> problem may appear with the system PM, not only with runtime PM.
> The suspend can happen at any time, so even a stream in OPEN state may
> go to suspend, and you'll hit the same problem.  It's just a corner
> case so no one really cared much.
> 
> So, I think the patch like below should fix the problem.
> This can be easily backported to all stable trees, and it alone should
> work without backporting the else intrusive changes.
> 
> Could you check whether my theory is correct?
> 
> 
> thanks,
> 
> Takashi
> 
> ---
> From: Takashi Iwai <tiwai@suse.de>
> Subject: [PATCH] ALSA: pcm: Don't suspend stream in unrecoverable PCM state
> 
> Currently PCM core sets each opened stream forcibly to SUSPENDED state
> via snd_pcm_suspend_all() call, and the user-space is responsible for
> re-triggering the resume manually either via snd_pcm_resume() or
> prepare call.  The scheme works fine usually, but there are corner
> cases where the stream can't be resumed by that call: the streams
> still in OPEN state before finishing hw_params.  When they are
> suspended, user-space cannot perform resume or prepare because they
> haven't been set up yet.  The only possible recovery is to re-open the
> device, which isn't nice at all.  Similarly, when a stream is in
> DISCONNECTED state, it makes no sense to change it to SUSPENDED
> state.  Ditto for in SETUP state; which you can re-prepare directly.
> 
> So, this patch addresses these issues by filtering the PCM streams to
> be suspended by checking the PCM state.  When a stream is in either
> OPEN, SETUP or DISCONNECTED as well as already SUSPENDED, the suspend
> action is skipped.
> 
> To be noted, this problem was originally reported for the PCM runtime
> PM on HD-audio.  And, the runtime PM problem itself was already
> addressed (although not intended) by the code refactoring commits
> 3d21ef0b49f8 ("ALSA: pcm: Suspend streams globally via device type PM
> ops") and 17bc4815de58 ("ALSA: pci: Remove superfluous
> snd_pcm_suspend*() calls").  These commits eliminated the
> snd_pcm_suspend*() calls from the runtime PM suspend callback code
> path, hence the racy OPEN state won't appear while runtime PM.
> (FWIW, the race window is between snd_pcm_open_substream() and the
> first power up in azx_pcm_open().)
> 
> Although the runtime PM issue was already "fixed", the same problem is
> still present for the system PM, hence this patch is still needed.
> And for stable trees, this patch alone should suffice for fixing the
> runtime PM problem, too.
> 
> Reported-by: Jonathan Hunter <jonathanh@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Takashi Iwai <tiwai@suse.de>
> ---
>  sound/core/pcm_native.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
> index f731f904e8cc..1d8452912b14 100644
> --- a/sound/core/pcm_native.c
> +++ b/sound/core/pcm_native.c
> @@ -1445,8 +1445,15 @@ static int snd_pcm_pause(struct snd_pcm_substream *substream, int push)
>  static int snd_pcm_pre_suspend(struct snd_pcm_substream *substream, int state)
>  {
>  	struct snd_pcm_runtime *runtime = substream->runtime;
> -	if (runtime->status->state == SNDRV_PCM_STATE_SUSPENDED)
> +	switch (runtime->status->state) {
> +	case SNDRV_PCM_STATE_SUSPENDED:
>  		return -EBUSY;
> +	/* unresumable PCM state; return -EBUSY for skipping suspend */
> +	case SNDRV_PCM_STATE_OPEN:
> +	case SNDRV_PCM_STATE_SETUP:
> +	case SNDRV_PCM_STATE_DISCONNECTED:
> +		return -EBUSY;
> +	}
>  	runtime->trigger_master = substream;
>  	return 0;
>  }

Thanks, this works for me! I have tested this on stable branch
linux-5.0.y with Tegra194. So feel free to add my ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Cheers
Jon
Takashi Iwai March 25, 2019, 3:36 p.m. UTC | #8
On Mon, 25 Mar 2019 16:31:26 +0100,
Jon Hunter wrote:
> 
> 
> On 25/03/2019 09:59, Takashi Iwai wrote:
> 
> ...
> 
> > I reconsidered the problem again, and noticed that the very same
> > problem may appear with the system PM, not only with runtime PM.
> > The suspend can happen at any time, so even a stream in OPEN state may
> > go to suspend, and you'll hit the same problem.  It's just a corner
> > case so no one really cared much.
> > 
> > So, I think the patch like below should fix the problem.
> > This can be easily backported to all stable trees, and it alone should
> > work without backporting the else intrusive changes.
> > 
> > Could you check whether my theory is correct?
> > 
> > 
> > thanks,
> > 
> > Takashi
> > 
> > ---
> > From: Takashi Iwai <tiwai@suse.de>
> > Subject: [PATCH] ALSA: pcm: Don't suspend stream in unrecoverable PCM state
> > 
> > Currently PCM core sets each opened stream forcibly to SUSPENDED state
> > via snd_pcm_suspend_all() call, and the user-space is responsible for
> > re-triggering the resume manually either via snd_pcm_resume() or
> > prepare call.  The scheme works fine usually, but there are corner
> > cases where the stream can't be resumed by that call: the streams
> > still in OPEN state before finishing hw_params.  When they are
> > suspended, user-space cannot perform resume or prepare because they
> > haven't been set up yet.  The only possible recovery is to re-open the
> > device, which isn't nice at all.  Similarly, when a stream is in
> > DISCONNECTED state, it makes no sense to change it to SUSPENDED
> > state.  Ditto for in SETUP state; which you can re-prepare directly.
> > 
> > So, this patch addresses these issues by filtering the PCM streams to
> > be suspended by checking the PCM state.  When a stream is in either
> > OPEN, SETUP or DISCONNECTED as well as already SUSPENDED, the suspend
> > action is skipped.
> > 
> > To be noted, this problem was originally reported for the PCM runtime
> > PM on HD-audio.  And, the runtime PM problem itself was already
> > addressed (although not intended) by the code refactoring commits
> > 3d21ef0b49f8 ("ALSA: pcm: Suspend streams globally via device type PM
> > ops") and 17bc4815de58 ("ALSA: pci: Remove superfluous
> > snd_pcm_suspend*() calls").  These commits eliminated the
> > snd_pcm_suspend*() calls from the runtime PM suspend callback code
> > path, hence the racy OPEN state won't appear while runtime PM.
> > (FWIW, the race window is between snd_pcm_open_substream() and the
> > first power up in azx_pcm_open().)
> > 
> > Although the runtime PM issue was already "fixed", the same problem is
> > still present for the system PM, hence this patch is still needed.
> > And for stable trees, this patch alone should suffice for fixing the
> > runtime PM problem, too.
> > 
> > Reported-by: Jonathan Hunter <jonathanh@nvidia.com>
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Takashi Iwai <tiwai@suse.de>
> > ---
> >  sound/core/pcm_native.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
> > index f731f904e8cc..1d8452912b14 100644
> > --- a/sound/core/pcm_native.c
> > +++ b/sound/core/pcm_native.c
> > @@ -1445,8 +1445,15 @@ static int snd_pcm_pause(struct snd_pcm_substream *substream, int push)
> >  static int snd_pcm_pre_suspend(struct snd_pcm_substream *substream, int state)
> >  {
> >  	struct snd_pcm_runtime *runtime = substream->runtime;
> > -	if (runtime->status->state == SNDRV_PCM_STATE_SUSPENDED)
> > +	switch (runtime->status->state) {
> > +	case SNDRV_PCM_STATE_SUSPENDED:
> >  		return -EBUSY;
> > +	/* unresumable PCM state; return -EBUSY for skipping suspend */
> > +	case SNDRV_PCM_STATE_OPEN:
> > +	case SNDRV_PCM_STATE_SETUP:
> > +	case SNDRV_PCM_STATE_DISCONNECTED:
> > +		return -EBUSY;
> > +	}
> >  	runtime->trigger_master = substream;
> >  	return 0;
> >  }
> 
> Thanks, this works for me! I have tested this on stable branch
> linux-5.0.y with Tegra194. So feel free to add my ...
> 
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

OK, I'm going to queue the fix now.  Thanks for quick testing!


Takashi
diff mbox series

Patch

diff --git a/sound/pci/hda/hda_codec.c b/sound/pci/hda/hda_codec.c
index 9f8d59e7e89f..26ecc6e1f524 100644
--- a/sound/pci/hda/hda_codec.c
+++ b/sound/pci/hda/hda_codec.c
@@ -2923,12 +2923,9 @@  static void hda_call_codec_resume(struct hda_codec *codec)
 static int hda_codec_runtime_suspend(struct device *dev)
 {
 	struct hda_codec *codec = dev_to_hda_codec(dev);
-	struct hda_pcm *pcm;
 	unsigned int state;
 
 	cancel_delayed_work_sync(&codec->jackpoll_work);
-	list_for_each_entry(pcm, &codec->pcm_list_head, list)
-		snd_pcm_suspend_all(pcm->pcm);
 	state = hda_call_codec_suspend(codec);
 	if (codec->link_down_at_suspend ||
 	    (codec_has_clkstop(codec) && codec_has_epss(codec) &&
@@ -2948,11 +2945,22 @@  static int hda_codec_runtime_resume(struct device *dev)
 	pm_runtime_mark_last_busy(dev);
 	return 0;
 }
+
+static int hda_codec_suspend(struct device *dev)
+{
+	struct hda_codec *codec = dev_to_hda_codec(dev);
+	struct hda_pcm *pcm;
+
+	list_for_each_entry(pcm, &codec->pcm_list_head, list)
+		snd_pcm_suspend_all(pcm->pcm);
+
+	return pm_runtime_force_suspend(dev);
+}
 #endif /* CONFIG_PM */
 
 /* referred in hda_bind.c */
 const struct dev_pm_ops hda_codec_driver_pm = {
-	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
+	SET_SYSTEM_SLEEP_PM_OPS(hda_codec_suspend,
 				pm_runtime_force_resume)
 	SET_RUNTIME_PM_OPS(hda_codec_runtime_suspend, hda_codec_runtime_resume,
 			   NULL)