diff mbox series

ALSA: pcm: fix incorrect hw_base increase

Message ID 1589515779-20987-1-git-send-email-brent.lu@intel.com (mailing list archive)
State New, archived
Headers show
Series ALSA: pcm: fix incorrect hw_base increase | expand

Commit Message

Brent Lu May 15, 2020, 4:09 a.m. UTC
The hw_base will be increased by runtime->buffer_size frames
unconditionally if the runtime->status->hw_ptr is not updated for over
half of buffer time. As the hw_base increases, so does the
runtime->status->hw_ptr which could lead to invalid return value when
user space program calls snd_pcm_avail() function.

By updating runtime->hw_ptr_jiffies each time the HWSYNC is called,
the hw_base will keep the same when buffer stall happens, so does the
hw_ptr.

Signed-off-by: Brent Lu <brent.lu@intel.com>
---
 sound/core/pcm_lib.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Greg Kroah-Hartman May 15, 2020, 7:04 a.m. UTC | #1
On Fri, May 15, 2020 at 12:09:39PM +0800, Brent Lu wrote:
> The hw_base will be increased by runtime->buffer_size frames
> unconditionally if the runtime->status->hw_ptr is not updated for over
> half of buffer time. As the hw_base increases, so does the
> runtime->status->hw_ptr which could lead to invalid return value when
> user space program calls snd_pcm_avail() function.
> 
> By updating runtime->hw_ptr_jiffies each time the HWSYNC is called,
> the hw_base will keep the same when buffer stall happens, so does the
> hw_ptr.
> 
> Signed-off-by: Brent Lu <brent.lu@intel.com>
> ---
>  sound/core/pcm_lib.c | 1 +
>  1 file changed, 1 insertion(+)

Is this a bugfix needed for older kernels as well?  When did this issue
show up?

thanks,

greg k-h
Takashi Iwai May 15, 2020, 7:21 a.m. UTC | #2
On Fri, 15 May 2020 06:09:39 +0200,
Brent Lu wrote:
> 
> The hw_base will be increased by runtime->buffer_size frames
> unconditionally if the runtime->status->hw_ptr is not updated for over
> half of buffer time. As the hw_base increases, so does the
> runtime->status->hw_ptr which could lead to invalid return value when
> user space program calls snd_pcm_avail() function.
> 
> By updating runtime->hw_ptr_jiffies each time the HWSYNC is called,
> the hw_base will keep the same when buffer stall happens, so does the
> hw_ptr.

Updating hw_ptr_jiffies at that code path looks correct, but it still
leaves the question why this condition happens.  It means that the
actual hwptr isn't changed and yet only jiffies increase
significantly; it means that the hardware can't report proper pointer,
and it should have set SNDRV_PCM_INFO_BATCH flag, then the jiffies
check is skipped.

With which hardware and under which situation did it happen (and the
patch fixed)?


thanks,

Takashi

> 
> Signed-off-by: Brent Lu <brent.lu@intel.com>
> ---
>  sound/core/pcm_lib.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c
> index 872a852..d531e1b 100644
> --- a/sound/core/pcm_lib.c
> +++ b/sound/core/pcm_lib.c
> @@ -433,6 +433,7 @@ static int snd_pcm_update_hw_ptr0(struct snd_pcm_substream *substream,
>  
>   no_delta_check:
>  	if (runtime->status->hw_ptr == new_hw_ptr) {
> +		runtime->hw_ptr_jiffies = curr_jiffies;
>  		update_audio_tstamp(substream, &curr_tstamp, &audio_tstamp);
>  		return 0;
>  	}
> -- 
> 2.7.4
>
Brent Lu May 15, 2020, 9:04 a.m. UTC | #3
> 
> Is this a bugfix needed for older kernels as well?  When did this issue show
> up?
> 
> thanks,
> 
> greg k-h

It happens when DMA stop moving data from host to DSP/DAI for a long time
(> half of buffer time). I know host driver should do something about it. But if
not, the HWSYNC will keep increasing the hw_base and hw_ptr and confuses
user space program.

Regards,
Brent
Jaroslav Kysela May 15, 2020, 9:30 a.m. UTC | #4
Dne 15. 05. 20 v 11:04 Lu, Brent napsal(a):
>>
>> Is this a bugfix needed for older kernels as well?  When did this issue show
>> up?
>>
>> thanks,
>>
>> greg k-h
> 
> It happens when DMA stop moving data from host to DSP/DAI for a long time
> (> half of buffer time). I know host driver should do something about it. But if
> not, the HWSYNC will keep increasing the hw_base and hw_ptr and confuses
> user space program.

I'm afraid, but with this code, you turn off the hw_ptr jiffies code. It would 
be better to fix the driver in this case (return the updated / estimated DMA 
pointer, increase DMA buffer size etc.). This "lag" is unacceptable.

						Jaroslav

> 
> Regards,
> Brent
>
Brent Lu May 15, 2020, 9:36 a.m. UTC | #5
> 
> Updating hw_ptr_jiffies at that code path looks correct, but it still leaves the
> question why this condition happens.  It means that the actual hwptr isn't
> changed and yet only jiffies increase significantly; it means that the hardware
> can't report proper pointer, and it should have set
> SNDRV_PCM_INFO_BATCH flag, then the jiffies check is skipped.
> 
> With which hardware and under which situation did it happen (and the patch
> fixed)?
> 
> 
> thanks,
> 
> Takashi
> 

From time to time we got questions from google about why sometimes the
snd_pcm_avail() returns a value larger than buffer size. Recently we finally
found reliable reproduce steps: it's on Intel GLK Chromebook Fleex with
SOF firmware. There is a 1/20 chance the audio playback to HDMI fails.

From the FW side we observe a buffer runderrun, the FW is not able to
recover it for some reason and stops the pipe.

From the Linux side we see the pos does not increase because the FW stops
receiving data but suddenly the hw_prt is increased by buffer_size (16368).
It could make snd_pcm_avail() reporting a false underrun to the caller like
following log:

2020-05-09T16:08:32.712042+08:00 DEBUG kernel: [  418.510086] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 4096 avail 12368
2020-05-09T16:08:32.712043+08:00 DEBUG kernel: [  418.510149] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 6910 avail 9554
...
2020-05-09T16:08:32.883095+08:00 DEBUG kernel: [  418.680868] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 15102 avail 1362
2020-05-09T16:08:32.883104+08:00 DEBUG kernel: [  418.681052] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 15102 avail 1362
2020-05-09T16:08:32.883109+08:00 DEBUG kernel: [  418.681130] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 16464 avail 0
2020-05-09T16:08:32.929330+08:00 DEBUG kernel: [  418.726515] sound pcmC0D5p: pos 96 hw_ptr 16464 appl_ptr 16464 avail 16368
2020-05-09T16:08:32.929512+08:00 DEBUG kernel: [  418.727041] sound pcmC0D5p: pos 96 hw_ptr 16464 appl_ptr 16464 avail 16368

Or it could make snd_pcm_avail() returns an invalid number and confuses the
Caller like following log:

2020-05-09T16:08:33.054039+08:00 DEBUG kernel: [  418.851755] sound pcmC0D5p: pos 96 hw_ptr 16464 appl_ptr 27390 avail 5442
2020-05-09T16:08:33.129552+08:00 DEBUG kernel: [  418.926491] sound pcmC0D5p: pos 96 hw_ptr 32832 appl_ptr 27390 avail 21810
2020-05-09T16:08:33.131746+08:00 ERR cras_server[1907]: pcm_avail returned frames larger than buf_size: sof-glkda7219max: :0,5: 21810 > 16368

I've submitted a patch to fix the issue in SOF side. However, I think it's also good
to fix the incorrect hw_base increasement in Linux side.

https://github.com/thesofproject/sof/pull/2926


Regards,
Brent
Takashi Iwai May 15, 2020, 10:39 a.m. UTC | #6
On Fri, 15 May 2020 11:30:54 +0200,
Jaroslav Kysela wrote:
> 
> Dne 15. 05. 20 v 11:04 Lu, Brent napsal(a):
> >>
> >> Is this a bugfix needed for older kernels as well?  When did this issue show
> >> up?
> >>
> >> thanks,
> >>
> >> greg k-h
> >
> > It happens when DMA stop moving data from host to DSP/DAI for a long time
> > (> half of buffer time). I know host driver should do something about it. But if
> > not, the HWSYNC will keep increasing the hw_base and hw_ptr and confuses
> > user space program.
> 
> I'm afraid, but with this code, you turn off the hw_ptr jiffies
> code. It would be better to fix the driver in this case (return the
> updated / estimated DMA pointer, increase DMA buffer size etc.). This
> "lag" is unacceptable.

The problem is obviously in the driver's side and it's best to be
addressed there.  But, I think it's still worth to apply this change.

The hw_ptr jiffies check is performed basically in two places: one is
snd_pcm_period_elapsed() call from ISR, and another is with the
no_period_wakeup flag.  In both cases, it calculates the diff of
jiffies from the previous update, and corrects the hw_ptr_base if that
exceeds the threshold.

And the bug here is that the "previous" jiffies is kept as long as the
hwptr itself is updated.  What we need is the correction of the base
when it really has processed the period size; i.e. hwptr got the same
value (with no_period_wakeup) and yet the jiffies diff is big.  For
this check, it's correct to update hw_ptr_jiffies at each call no
matter whether hwptr moved or not; we need to evaluate from the
previous update, after all.

But I might overlook something.  Jaroslav, could you check it again?
The jiffies check code is your black magic :)


thanks,

Takashi
Takashi Iwai May 15, 2020, 10:40 a.m. UTC | #7
On Fri, 15 May 2020 11:36:19 +0200,
Lu, Brent wrote:
> 
> > 
> > Updating hw_ptr_jiffies at that code path looks correct, but it still leaves the
> > question why this condition happens.  It means that the actual hwptr isn't
> > changed and yet only jiffies increase significantly; it means that the hardware
> > can't report proper pointer, and it should have set
> > SNDRV_PCM_INFO_BATCH flag, then the jiffies check is skipped.
> > 
> > With which hardware and under which situation did it happen (and the patch
> > fixed)?
> > 
> > 
> > thanks,
> > 
> > Takashi
> > 
> 
> >From time to time we got questions from google about why sometimes the
> snd_pcm_avail() returns a value larger than buffer size. Recently we finally
> found reliable reproduce steps: it's on Intel GLK Chromebook Fleex with
> SOF firmware. There is a 1/20 chance the audio playback to HDMI fails.
> 
> >From the FW side we observe a buffer runderrun, the FW is not able to
> recover it for some reason and stops the pipe.
> 
> >From the Linux side we see the pos does not increase because the FW stops
> receiving data but suddenly the hw_prt is increased by buffer_size (16368).
> It could make snd_pcm_avail() reporting a false underrun to the caller like
> following log:
> 
> 2020-05-09T16:08:32.712042+08:00 DEBUG kernel: [  418.510086] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 4096 avail 12368
> 2020-05-09T16:08:32.712043+08:00 DEBUG kernel: [  418.510149] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 6910 avail 9554
> ...
> 2020-05-09T16:08:32.883095+08:00 DEBUG kernel: [  418.680868] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 15102 avail 1362
> 2020-05-09T16:08:32.883104+08:00 DEBUG kernel: [  418.681052] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 15102 avail 1362
> 2020-05-09T16:08:32.883109+08:00 DEBUG kernel: [  418.681130] sound pcmC0D5p: pos 96 hw_ptr 96 appl_ptr 16464 avail 0
> 2020-05-09T16:08:32.929330+08:00 DEBUG kernel: [  418.726515] sound pcmC0D5p: pos 96 hw_ptr 16464 appl_ptr 16464 avail 16368
> 2020-05-09T16:08:32.929512+08:00 DEBUG kernel: [  418.727041] sound pcmC0D5p: pos 96 hw_ptr 16464 appl_ptr 16464 avail 16368
> 
> Or it could make snd_pcm_avail() returns an invalid number and confuses the
> Caller like following log:
> 
> 2020-05-09T16:08:33.054039+08:00 DEBUG kernel: [  418.851755] sound pcmC0D5p: pos 96 hw_ptr 16464 appl_ptr 27390 avail 5442
> 2020-05-09T16:08:33.129552+08:00 DEBUG kernel: [  418.926491] sound pcmC0D5p: pos 96 hw_ptr 32832 appl_ptr 27390 avail 21810
> 2020-05-09T16:08:33.131746+08:00 ERR cras_server[1907]: pcm_avail returned frames larger than buf_size: sof-glkda7219max: :0,5: 21810 > 16368
> 
> I've submitted a patch to fix the issue in SOF side. However, I think it's also good
> to fix the incorrect hw_base increasement in Linux side.
> 
> https://github.com/thesofproject/sof/pull/2926

Oh this whole information (at least some digested version) should have
been included in the patch description.  Otherwise we have no idea why
it's needed and what actually means.

Let's wait for Jaroslav's review, and if it's OK, could you resubmit
with more description?


thanks,

Takashi
Jaroslav Kysela May 15, 2020, 12:01 p.m. UTC | #8
Dne 15. 05. 20 v 12:39 Takashi Iwai napsal(a):
> On Fri, 15 May 2020 11:30:54 +0200,
> Jaroslav Kysela wrote:
>>
>> Dne 15. 05. 20 v 11:04 Lu, Brent napsal(a):
>>>>
>>>> Is this a bugfix needed for older kernels as well?  When did this issue show
>>>> up?
>>>>
>>>> thanks,
>>>>
>>>> greg k-h
>>>
>>> It happens when DMA stop moving data from host to DSP/DAI for a long time
>>> (> half of buffer time). I know host driver should do something about it. But if
>>> not, the HWSYNC will keep increasing the hw_base and hw_ptr and confuses
>>> user space program.
>>
>> I'm afraid, but with this code, you turn off the hw_ptr jiffies
>> code. It would be better to fix the driver in this case (return the
>> updated / estimated DMA pointer, increase DMA buffer size etc.). This
>> "lag" is unacceptable.
> 
> The problem is obviously in the driver's side and it's best to be
> addressed there.  But, I think it's still worth to apply this change.
> 
> The hw_ptr jiffies check is performed basically in two places: one is
> snd_pcm_period_elapsed() call from ISR, and another is with the
> no_period_wakeup flag.  In both cases, it calculates the diff of
> jiffies from the previous update, and corrects the hw_ptr_base if that
> exceeds the threshold.
> 
> And the bug here is that the "previous" jiffies is kept as long as the
> hwptr itself is updated.  What we need is the correction of the base
> when it really has processed the period size; i.e. hwptr got the same
> value (with no_period_wakeup) and yet the jiffies diff is big.  For
> this check, it's correct to update hw_ptr_jiffies at each call no
> matter whether hwptr moved or not; we need to evaluate from the
> previous update, after all.
> 
> But I might overlook something.  Jaroslav, could you check it again?
> The jiffies check code is your black magic :)

I tried to imagine a negative impact for this hw_ptr_jiffies update when the 
DMA position is not updated from the driver and I haven't found any so far.

Let's apply this and we'll see in future :-)

And yes, the patch description should be improved (DMA ptr is not updated / 
streaming is inactive).

Reviewed-by: Jaroslav Kysela <perex@perex.cz>

> 
> 
> thanks,
> 
> Takashi
>
Brent Lu May 18, 2020, 4:38 a.m. UTC | #9
> 
> I tried to imagine a negative impact for this hw_ptr_jiffies update when the
> DMA position is not updated from the driver and I haven't found any so far.
> 
> Let's apply this and we'll see in future :-)
> 
> And yes, the patch description should be improved (DMA ptr is not updated /
> streaming is inactive).
> 
> Reviewed-by: Jaroslav Kysela <perex@perex.cz>

Hi Jaroslav and Takashi,

Thank you for the review. Patch V2 has been uploaded.


Regards,
Brent
> 
> >
> >
> > thanks,
> >
> > Takashi
> >
> 
> 
> --
> Jaroslav Kysela <perex@perex.cz>
> Linux Sound Maintainer; ALSA Project; Red Hat, Inc.
diff mbox series

Patch

diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c
index 872a852..d531e1b 100644
--- a/sound/core/pcm_lib.c
+++ b/sound/core/pcm_lib.c
@@ -433,6 +433,7 @@  static int snd_pcm_update_hw_ptr0(struct snd_pcm_substream *substream,
 
  no_delta_check:
 	if (runtime->status->hw_ptr == new_hw_ptr) {
+		runtime->hw_ptr_jiffies = curr_jiffies;
 		update_audio_tstamp(substream, &curr_tstamp, &audio_tstamp);
 		return 0;
 	}