diff mbox series

ASoC: Intel: Atom: use hardware counter to update hw_ptr

Message ID 1595779727-31404-1-git-send-email-brent.lu@intel.com (mailing list archive)
State New, archived
Headers show
Series ASoC: Intel: Atom: use hardware counter to update hw_ptr | expand

Commit Message

Brent Lu July 26, 2020, 4:08 p.m. UTC
The ring buffer counter runs faster than hardware counter if the
period size in hw_param is larger than 240. Although the differce is
not much (around 2k frames), it causes false underrun in CRAS
sometimes because it's using 256 frames as period size in hw_param.
Using the hardware counter could provide precise hw_ptr to user space
and avoid the false underrun in CRAS.

Signed-off-by: Brent Lu <brent.lu@intel.com>
---
 sound/soc/intel/atom/sst/sst_drv_interface.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

Comments

Pierre-Louis Bossart July 27, 2020, 2:09 p.m. UTC | #1
On 7/26/20 11:08 AM, Brent Lu wrote:
> The ring buffer counter runs faster than hardware counter if the
> period size in hw_param is larger than 240. Although the differce is
> not much (around 2k frames), it causes false underrun in CRAS
> sometimes because it's using 256 frames as period size in hw_param.

All the Atom firmware assumes data chunks in multiples of 1ms (typically 
5, 10 or 20ms). I have never seen anyone use 256 frames, that's asking 
for trouble really.

it's actually the same with Skylake and SOF in most cases.

Is this a 'real' problem or a problem detected by the Chrome ALSA 
compliance tests, in the latter case that would hint at a too generic 
value of min_period.

> Using the hardware counter could provide precise hw_ptr to user space
> and avoid the false underrun in CRAS.
> 
> Signed-off-by: Brent Lu <brent.lu@intel.com>
> ---
>   sound/soc/intel/atom/sst/sst_drv_interface.c | 15 +++------------
>   1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/sound/soc/intel/atom/sst/sst_drv_interface.c b/sound/soc/intel/atom/sst/sst_drv_interface.c
> index 7624953..1949ad9 100644
> --- a/sound/soc/intel/atom/sst/sst_drv_interface.c
> +++ b/sound/soc/intel/atom/sst/sst_drv_interface.c
> @@ -485,7 +485,6 @@ static inline int sst_calc_tstamp(struct intel_sst_drv *ctx,
>   		struct snd_pcm_substream *substream,
>   		struct snd_sst_tstamp *fw_tstamp)
>   {
> -	size_t delay_bytes, delay_frames;
>   	size_t buffer_sz;
>   	u32 pointer_bytes, pointer_samples;
>   
> @@ -493,22 +492,14 @@ static inline int sst_calc_tstamp(struct intel_sst_drv *ctx,
>   			fw_tstamp->ring_buffer_counter);
>   	dev_dbg(ctx->dev, "mrfld hardware_counter %llu in bytes\n",
>   			 fw_tstamp->hardware_counter);
> -	if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
> -		delay_bytes = (size_t) (fw_tstamp->ring_buffer_counter -
> -					fw_tstamp->hardware_counter);
> -	else
> -		delay_bytes = (size_t) (fw_tstamp->hardware_counter -
> -					fw_tstamp->ring_buffer_counter);
> -	delay_frames = bytes_to_frames(substream->runtime, delay_bytes);
> +
>   	buffer_sz = snd_pcm_lib_buffer_bytes(substream);
> -	div_u64_rem(fw_tstamp->ring_buffer_counter, buffer_sz, &pointer_bytes);
> +	div_u64_rem(fw_tstamp->hardware_counter, buffer_sz, &pointer_bytes);
>   	pointer_samples = bytes_to_samples(substream->runtime, pointer_bytes);
>   
> -	dev_dbg(ctx->dev, "pcm delay %zu in bytes\n", delay_bytes);
> -
>   	info->buffer_ptr = pointer_samples / substream->runtime->channels;
> +	info->pcm_delay = 0;

and that seems also wrong? Why would the delay be zero?

> -	info->pcm_delay = delay_frames;
>   	dev_dbg(ctx->dev, "buffer ptr %llu pcm_delay rep: %llu\n",
>   			info->buffer_ptr, info->pcm_delay);
>   	return 0;
>
Brent Lu July 28, 2020, 2:28 a.m. UTC | #2
> 
> All the Atom firmware assumes data chunks in multiples of 1ms (typically 5,
> 10 or 20ms). I have never seen anyone use 256 frames, that's asking for
> trouble really.
> 
> it's actually the same with Skylake and SOF in most cases.
> 
> Is this a 'real' problem or a problem detected by the Chrome ALSA
> compliance tests, in the latter case that would hint at a too generic value of
> min_period.
> 

I've told them 240 is more reasonable since the sample rate is 48000 and our
android bsp also uses 240 for multimedia use case for many years but they don't
want to change the CRAS setting for some reason.

Google says it's a real issue for them: "The driver consumes frames quickly at the
beginning will make CRAS underrun because CRAS fills samples in the fixed rate."

Currently they implement constraint in machine driver of atom machines to force
240 period size so CRAS is using 240 for atom platforms and 256 for other big cores.

I'm curious why not just using hardware counter to update hw_ptr and get rid of
the period setting in hw_param? It seems to me the ring buffer counter does not
reflect the real status.


Regards,
Brent

> 
> and that seems also wrong? Why would the delay be zero?
> 

info->pcm_delay is the difference between ring buffer counter and hardware
counter. Because the ring buffer counter (hw_ptr) is running faster then it should,
so we add the info->pcm_delay to substream->runtime->delay as compensation.

Therefore, application could use snd_pcm_delay() to get the actual frame number
which are still in buffer.

snd_pcm_delay() = buffer_size - snd_pcm_avail() + runtime->delay

We don't need pcm_delay to compensate anything if using hardware counter.


> > -	info->pcm_delay = delay_frames;
> >   	dev_dbg(ctx->dev, "buffer ptr %llu pcm_delay rep: %llu\n",
> >   			info->buffer_ptr, info->pcm_delay);
> >   	return 0;
> >
Pierre-Louis Bossart July 28, 2020, 12:06 p.m. UTC | #3
On 7/27/20 9:28 PM, Lu, Brent wrote:
>>
>> All the Atom firmware assumes data chunks in multiples of 1ms (typically 5,
>> 10 or 20ms). I have never seen anyone use 256 frames, that's asking for
>> trouble really.
>>
>> it's actually the same with Skylake and SOF in most cases.
>>
>> Is this a 'real' problem or a problem detected by the Chrome ALSA
>> compliance tests, in the latter case that would hint at a too generic value of
>> min_period.
>>
> 
> I've told them 240 is more reasonable since the sample rate is 48000 and our
> android bsp also uses 240 for multimedia use case for many years but they don't
> want to change the CRAS setting for some reason.
> 
> Google says it's a real issue for them: "The driver consumes frames quickly at the
> beginning will make CRAS underrun because CRAS fills samples in the fixed rate."
> 
> Currently they implement constraint in machine driver of atom machines to force
> 240 period size so CRAS is using 240 for atom platforms and 256 for other big cores.

So if there are already quirks in atom machine drivers to change the 
period size, why is this patch necessary?

> I'm curious why not just using hardware counter to update hw_ptr and get rid of
> the period setting in hw_param? It seems to me the ring buffer counter does not
> reflect the real status.

I don't recall precisely what this hardware counter does. I vaguely 
recall it's tied to the 19.2MHz external timer which is also used to 
schedule the 1ms SBA mixer and the SSP IOs. And by comparing with the 
ring buffer pointer you can infer the delay inside the DSP. I think you 
are also making an assumption that all streams are tied to the output 
rate, but that's most likely a bad assumption. The hard-coded topology 
supported media, speech and compressed data and the consumption rate on 
the DMA side could be faster with some buffering happening in the DSP. 
It's not a passthrough DMA in all cases.

This is really legacy code that no one really fully understands nor 
plans on improving, it'd be a bad idea to change the pcm pointer reports 
now, 6 years after the initial code release and after all initial 
contributors moved on. It's what it is.

>> and that seems also wrong? Why would the delay be zero?
>>
> 
> info->pcm_delay is the difference between ring buffer counter and hardware
> counter. Because the ring buffer counter (hw_ptr) is running faster then it should,
> so we add the info->pcm_delay to substream->runtime->delay as compensation.
> 
> Therefore, application could use snd_pcm_delay() to get the actual frame number
> which are still in buffer.
> 
> snd_pcm_delay() = buffer_size - snd_pcm_avail() + runtime->delay
> 
> We don't need pcm_delay to compensate anything if using hardware counter.

If you force info->pcm_delay to be zero, then runtime->delay is also zero:

see sst_soc_pointer():
	substream->runtime->delay = str_info->pcm_delay;

>>> -	info->pcm_delay = delay_frames;
Brent Lu July 28, 2020, 5:02 p.m. UTC | #4
> 
> So if there are already quirks in atom machine drivers to change the period
> size, why is this patch necessary?
> 

The story is: google implemented the constraint but doesn't know why it works
so asked us to explain. After checking the two counters I realized the increase of
ring buffer pointer follows the period size setting in hw_param (256) but the
period of interrupt is always 5ms instead of 5.33 so it's running little bit too fast.
It seems the LPE keeps tracking the difference of two counters. When the
difference exceeds 2160 samples, the next interrupt will be canceled so the
hardware counter could catch up a little.

[   43.208299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 107520 hardware_counter 98880 pcm delay 8640 (in bytes)
[   43.208306] intel_sst_acpi 808622A8:00: buffer ptr 26880 pcm_delay rep: 2160
[   43.208321] sound pcmC1D0p: [Q] pos 26880 hw_ptr 26880 appl_ptr 40000 avail 191680
=> one interrupt is skipped.
[   43.218299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 108544 hardware_counter 100800 pcm delay 7744 (in bytes)
[   43.218307] intel_sst_acpi 808622A8:00: buffer ptr 27136 pcm_delay rep: 1936
[   43.218336] sound pcmC1D0p: [Q] pos 27136 hw_ptr 27136 appl_ptr 40000 avail 191936

So I think why not using the hardware counter? It increases 240 samples every 5ms
perfectly match the 48000 sample rate. The test result is good but I know there must
be a reason for the original designer to use ring buffer counter instead of hardware
counter. I uploaded this patch to see if anyone still remember the reason and share
some insight with me.

I totally agree that we shouldn't touch this part of design. Do you think it make sense
to add a constraint to enforce the period size in machine driver? If yes then I would
upload patches for Chrome atom machines for google.


Regards,
Brent

> > I'm curious why not just using hardware counter to update hw_ptr and
> > get rid of the period setting in hw_param? It seems to me the ring
> > buffer counter does not reflect the real status.
> 
> I don't recall precisely what this hardware counter does. I vaguely recall it's
> tied to the 19.2MHz external timer which is also used to schedule the 1ms
> SBA mixer and the SSP IOs. And by comparing with the ring buffer pointer
> you can infer the delay inside the DSP. I think you are also making an
> assumption that all streams are tied to the output rate, but that's most likely
> a bad assumption. The hard-coded topology supported media, speech and
> compressed data and the consumption rate on the DMA side could be faster
> with some buffering happening in the DSP.
> It's not a passthrough DMA in all cases.
> 
> This is really legacy code that no one really fully understands nor plans on
> improving, it'd be a bad idea to change the pcm pointer reports now, 6 years
> after the initial code release and after all initial contributors moved on. It's
> what it is.
>
Pierre-Louis Bossart July 28, 2020, 5:30 p.m. UTC | #5
On 7/28/20 12:02 PM, Lu, Brent wrote:
>>
>> So if there are already quirks in atom machine drivers to change the period
>> size, why is this patch necessary?
>>
> 
> The story is: google implemented the constraint but doesn't know why it works
> so asked us to explain. After checking the two counters I realized the increase of
> ring buffer pointer follows the period size setting in hw_param (256) but the
> period of interrupt is always 5ms instead of 5.33 so it's running little bit too fast.
> It seems the LPE keeps tracking the difference of two counters. When the
> difference exceeds 2160 samples, the next interrupt will be canceled so the
> hardware counter could catch up a little.
> 
> [   43.208299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 107520 hardware_counter 98880 pcm delay 8640 (in bytes)
> [   43.208306] intel_sst_acpi 808622A8:00: buffer ptr 26880 pcm_delay rep: 2160
> [   43.208321] sound pcmC1D0p: [Q] pos 26880 hw_ptr 26880 appl_ptr 40000 avail 191680
> => one interrupt is skipped.
> [   43.218299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 108544 hardware_counter 100800 pcm delay 7744 (in bytes)
> [   43.218307] intel_sst_acpi 808622A8:00: buffer ptr 27136 pcm_delay rep: 1936
> [   43.218336] sound pcmC1D0p: [Q] pos 27136 hw_ptr 27136 appl_ptr 40000 avail 191936
> 
> So I think why not using the hardware counter? It increases 240 samples every 5ms
> perfectly match the 48000 sample rate. The test result is good but I know there must
> be a reason for the original designer to use ring buffer counter instead of hardware
> counter. I uploaded this patch to see if anyone still remember the reason and share
> some insight with me.
> 
> I totally agree that we shouldn't touch this part of design. Do you think it make sense
> to add a constraint to enforce the period size in machine driver? If yes then I would
> upload patches for Chrome atom machines for google.

I think it'd make sense to add this constraint, either in the machine 
driver or in the platform driver, so that we don't change the position 
updates and introduce more issues by accident by doing so. As you 
rightly said, I don't think anyone tested periods multiple of 256 
samples so it's not a regression, more aligning with the internal design.
Cheng-yi Chiang July 29, 2020, 5:06 a.m. UTC | #6
On Wed, Jul 29, 2020 at 1:31 AM Pierre-Louis Bossart
<pierre-louis.bossart@linux.intel.com> wrote:
>
>
>
> On 7/28/20 12:02 PM, Lu, Brent wrote:
> >>
> >> So if there are already quirks in atom machine drivers to change the period
> >> size, why is this patch necessary?
> >>
> >
> > The story is: google implemented the constraint but doesn't know why it works
> > so asked us to explain. After checking the two counters I realized the increase of
> > ring buffer pointer follows the period size setting in hw_param (256) but the
> > period of interrupt is always 5ms instead of 5.33 so it's running little bit too fast.
> > It seems the LPE keeps tracking the difference of two counters. When the
> > difference exceeds 2160 samples, the next interrupt will be canceled so the
> > hardware counter could catch up a little.
> >
> > [   43.208299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 107520 hardware_counter 98880 pcm delay 8640 (in bytes)
> > [   43.208306] intel_sst_acpi 808622A8:00: buffer ptr 26880 pcm_delay rep: 2160
> > [   43.208321] sound pcmC1D0p: [Q] pos 26880 hw_ptr 26880 appl_ptr 40000 avail 191680
> > => one interrupt is skipped.
> > [   43.218299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 108544 hardware_counter 100800 pcm delay 7744 (in bytes)
> > [   43.218307] intel_sst_acpi 808622A8:00: buffer ptr 27136 pcm_delay rep: 1936
> > [   43.218336] sound pcmC1D0p: [Q] pos 27136 hw_ptr 27136 appl_ptr 40000 avail 191936
> >
> > So I think why not using the hardware counter? It increases 240 samples every 5ms
> > perfectly match the 48000 sample rate. The test result is good but I know there must
> > be a reason for the original designer to use ring buffer counter instead of hardware
> > counter. I uploaded this patch to see if anyone still remember the reason and share
> > some insight with me.
> >
> > I totally agree that we shouldn't touch this part of design. Do you think it make sense
> > to add a constraint to enforce the period size in machine driver? If yes then I would
> > upload patches for Chrome atom machines for google.
>
> I think it'd make sense to add this constraint, either in the machine
> driver or in the platform driver, so that we don't change the position
> updates and introduce more issues by accident by doing so. As you
> rightly said, I don't think anyone tested periods multiple of 256
> samples so it's not a regression, more aligning with the internal design.

Thanks for the explanation and discussion.
A slight correction in this thread is that CRAS does not set period size to 256.
It just uses whatever value that driver wants to use, which happens to be 256.
If the driver can limit the selection to only correct values that
would be the best.
Adding constraints in the machine driver or platform driver looks
good. Prefer to adding in platform driver so we don't need to add for
each machine driver.
Thanks!
diff mbox series

Patch

diff --git a/sound/soc/intel/atom/sst/sst_drv_interface.c b/sound/soc/intel/atom/sst/sst_drv_interface.c
index 7624953..1949ad9 100644
--- a/sound/soc/intel/atom/sst/sst_drv_interface.c
+++ b/sound/soc/intel/atom/sst/sst_drv_interface.c
@@ -485,7 +485,6 @@  static inline int sst_calc_tstamp(struct intel_sst_drv *ctx,
 		struct snd_pcm_substream *substream,
 		struct snd_sst_tstamp *fw_tstamp)
 {
-	size_t delay_bytes, delay_frames;
 	size_t buffer_sz;
 	u32 pointer_bytes, pointer_samples;
 
@@ -493,22 +492,14 @@  static inline int sst_calc_tstamp(struct intel_sst_drv *ctx,
 			fw_tstamp->ring_buffer_counter);
 	dev_dbg(ctx->dev, "mrfld hardware_counter %llu in bytes\n",
 			 fw_tstamp->hardware_counter);
-	if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
-		delay_bytes = (size_t) (fw_tstamp->ring_buffer_counter -
-					fw_tstamp->hardware_counter);
-	else
-		delay_bytes = (size_t) (fw_tstamp->hardware_counter -
-					fw_tstamp->ring_buffer_counter);
-	delay_frames = bytes_to_frames(substream->runtime, delay_bytes);
+
 	buffer_sz = snd_pcm_lib_buffer_bytes(substream);
-	div_u64_rem(fw_tstamp->ring_buffer_counter, buffer_sz, &pointer_bytes);
+	div_u64_rem(fw_tstamp->hardware_counter, buffer_sz, &pointer_bytes);
 	pointer_samples = bytes_to_samples(substream->runtime, pointer_bytes);
 
-	dev_dbg(ctx->dev, "pcm delay %zu in bytes\n", delay_bytes);
-
 	info->buffer_ptr = pointer_samples / substream->runtime->channels;
+	info->pcm_delay = 0;
 
-	info->pcm_delay = delay_frames;
 	dev_dbg(ctx->dev, "buffer ptr %llu pcm_delay rep: %llu\n",
 			info->buffer_ptr, info->pcm_delay);
 	return 0;