[3/3] ASoC: max98090: fix possible race conditions

Message ID	20191120060256.212818-4-tzungbi@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=+StP=ZM=alsa-project.org=alsa-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CC51322461 Date: Wed, 20 Nov 2019 14:02:56 +0800 In-Reply-To: <20191120060256.212818-1-tzungbi@google.com> Message-Id: <20191120060256.212818-4-tzungbi@google.com> Mime-Version: 1.0 References: <20191120060256.212818-1-tzungbi@google.com> From: Tzung-Bi Shih <tzungbi@google.com> To: broonie@kernel.org Cc: tzungbi@google.com, alsa-devel@alsa-project.org, dgreid@google.com, cychiang@google.com Subject: [alsa-devel] [PATCH 3/3] ASoC: max98090: fix possible race conditions Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: alsa-devel-bounces@alsa-project.org Sender: "Alsa-devel" <alsa-devel-bounces@alsa-project.org>
Series	ASoC: max98090: fix PLL unlocked workaround-related \| expand [0/3] ASoC: max98090: fix PLL unlocked workaround-related [1/3] ASoC: max98090: remove msleep in PLL unlocked workaround [2/3] ASoC: max98090: exit workaround earlier if PLL is locked [3/3] ASoC: max98090: fix possible race conditions

Tzung-Bi Shih Nov. 20, 2019, 6:02 a.m. UTC

max98090_interrupt() and max98090_pll_work() run in 2 different threads.
There are 2 possible races:

Note: M98090_REG_DEVICE_STATUS = 0x01.
Note: ULK == 0, PLL is locked; ULK == 1, PLL is unlocked.

max98090_interrupt      max98090_pll_work
----------------------------------------------
schedule max98090_pll_work
                        restart max98090 codec
receive ULK INT
                        assert ULK == 0
schedule max98090_pll_work (1).

In the case (1), the PLL is locked but max98090_interrupt unnecessarily
schedules another max98090_pll_work.

max98090_interrupt      max98090_pll_work      max98090 codec
----------------------------------------------------------------------
                                               ULK = 1
receive ULK INT
read 0x01
                                               ULK = 0 (clear on read)
schedule max98090_pll_work
                        restart max98090 codec
                                               ULK = 1
receive ULK INT
read 0x01
                                               ULK = 0 (clear on read)
                        read 0x1
                        assert ULK == 0 (2).

In the case (2), both max98090_interrupt and max98090_pll_work read
the same clear-on-read register.  max98090_pll_work would falsely
thought PLL is locked.

There are 2 possible options:
A. turn off ULK interrupt before scheduling max98090_pll_work; and turn
on again before exiting max98090_pll_work.
B. remove the second thread of execution.

Adopts option B which is more straightforward.

Signed-off-by: Tzung-Bi Shih <tzungbi@google.com>
---
 sound/soc/codecs/max98090.c | 8 ++------
 sound/soc/codecs/max98090.h | 1 -
 2 files changed, 2 insertions(+), 7 deletions(-)

Pierre-Louis Bossart Nov. 20, 2019, 2:32 p.m. UTC | #1

On 11/20/19 12:02 AM, Tzung-Bi Shih wrote:
> max98090_interrupt() and max98090_pll_work() run in 2 different threads.
> There are 2 possible races:
> 
> Note: M98090_REG_DEVICE_STATUS = 0x01.
> Note: ULK == 0, PLL is locked; ULK == 1, PLL is unlocked.
> 
> max98090_interrupt      max98090_pll_work
> ----------------------------------------------
> schedule max98090_pll_work
>                          restart max98090 codec
> receive ULK INT
>                          assert ULK == 0
> schedule max98090_pll_work (1).
> 
> In the case (1), the PLL is locked but max98090_interrupt unnecessarily
> schedules another max98090_pll_work.

if you re-test that the PLL is already running, then you can exit the 
work function immediately without redoing the sequence?

maybe also play with the masks so that the PLL unlock is masked in the 
interrupt and unmasked after the PLL locks?

> 
> max98090_interrupt      max98090_pll_work      max98090 codec
> ----------------------------------------------------------------------
>                                                 ULK = 1
> receive ULK INT
> read 0x01
>                                                 ULK = 0 (clear on read)
> schedule max98090_pll_work
>                          restart max98090 codec
>                                                 ULK = 1
> receive ULK INT
> read 0x01
>                                                 ULK = 0 (clear on read)
>                          read 0x1
>                          assert ULK == 0 (2).

what are those 0x01 and 0x1? is the second a typo possibly?

> 
> In the case (2), both max98090_interrupt and max98090_pll_work read
> the same clear-on-read register.  max98090_pll_work would falsely
> thought PLL is locked.
> 
> There are 2 possible options:
> A. turn off ULK interrupt before scheduling max98090_pll_work; and turn
> on again before exiting max98090_pll_work.
> B. remove the second thread of execution.
> 
> Adopts option B which is more straightforward.

but has the side effect of possibly adding a 10ms delay in the interrupt 
thread?

Tzung-Bi Shih Nov. 21, 2019, 2:19 a.m. UTC | #2

On Thu, Nov 21, 2019 at 12:10 AM Pierre-Louis Bossart
<pierre-louis.bossart@linux.intel.com> wrote:
> On 11/20/19 12:02 AM, Tzung-Bi Shih wrote:
> > max98090_interrupt() and max98090_pll_work() run in 2 different threads.
> > There are 2 possible races:
> >
> > Note: M98090_REG_DEVICE_STATUS = 0x01.
> > Note: ULK == 0, PLL is locked; ULK == 1, PLL is unlocked.
> >
> > max98090_interrupt      max98090_pll_work
> > ----------------------------------------------
> > schedule max98090_pll_work
> >                          restart max98090 codec
> > receive ULK INT
> >                          assert ULK == 0
> > schedule max98090_pll_work (1).
> >
> > In the case (1), the PLL is locked but max98090_interrupt unnecessarily
> > schedules another max98090_pll_work.
>
> if you re-test that the PLL is already running, then you can exit the
> work function immediately without redoing the sequence?

ACK.

> maybe also play with the masks so that the PLL unlock is masked in the
> interrupt and unmasked after the PLL locks?

ACK, this is our option A mentioned below.

> > max98090_interrupt      max98090_pll_work      max98090 codec
> > ----------------------------------------------------------------------
> >                                                 ULK = 1
> > receive ULK INT
> > read 0x01
> >                                                 ULK = 0 (clear on read)
> > schedule max98090_pll_work
> >                          restart max98090 codec
> >                                                 ULK = 1
> > receive ULK INT
> > read 0x01
> >                                                 ULK = 0 (clear on read)
> >                          read 0x1
> >                          assert ULK == 0 (2).
>
> what are those 0x01 and 0x1? is the second a typo possibly?

ACK, a typo.

> > In the case (2), both max98090_interrupt and max98090_pll_work read
> > the same clear-on-read register.  max98090_pll_work would falsely
> > thought PLL is locked.
> >
> > There are 2 possible options:
> > A. turn off ULK interrupt before scheduling max98090_pll_work; and turn
> > on again before exiting max98090_pll_work.
> > B. remove the second thread of execution.
> >
> > Adopts option B which is more straightforward.
>
> but has the side effect of possibly adding a 10ms delay in the interrupt
> thread?

(forgot to mention) Option A cannot fix the case (2) race condition:
there would be 2 threads read the same clear-on-read register.  In
theory, the hardware should faster than CPUs' accesses via I2C.
max98090 should returns ULK=1 any time if PLL is unlocked.  Shall we
ignore the case (2) and adopt option A?

Pierre-Louis Bossart Nov. 21, 2019, 3:20 a.m. UTC | #3

> 
>>> max98090_interrupt      max98090_pll_work      max98090 codec
>>> ----------------------------------------------------------------------
>>>                                                  ULK = 1
>>> receive ULK INT
>>> read 0x01
>>>                                                  ULK = 0 (clear on read)
>>> schedule max98090_pll_work
>>>                           restart max98090 codec
>>>                                                  ULK = 1
>>> receive ULK INT
>>> read 0x01
>>>                                                  ULK = 0 (clear on read)
>>>                           read 0x1
>>>                           assert ULK == 0 (2).
>>
>> what are those 0x01 and 0x1? is the second a typo possibly?
> 
> ACK, a typo.
> 
>>> In the case (2), both max98090_interrupt and max98090_pll_work read
>>> the same clear-on-read register.  max98090_pll_work would falsely
>>> thought PLL is locked.
>>>
>>> There are 2 possible options:
>>> A. turn off ULK interrupt before scheduling max98090_pll_work; and turn
>>> on again before exiting max98090_pll_work.
>>> B. remove the second thread of execution.
>>>
>>> Adopts option B which is more straightforward.
>>
>> but has the side effect of possibly adding a 10ms delay in the interrupt
>> thread?
> 
> (forgot to mention) Option A cannot fix the case (2) race condition:
> there would be 2 threads read the same clear-on-read register.  In
> theory, the hardware should faster than CPUs' accesses via I2C.
> max98090 should returns ULK=1 any time if PLL is unlocked.  Shall we
> ignore the case (2) and adopt option A?

I don't have any specific recommendation, just trying to follow your 
line of thought on a problem that bugged be in the past. If your tests 
shows that the extra delay is fine, then it's good progress

Still you may want to clarify your second point and the commit message. 
The first race condition was clear, the second one not so much.

what did you mean with 'assert ULK == 0' and what does this map to in 
pll_work

the code looks as this:

static void max98090_pll_work(struct work_struct *work)
{
	if (!snd_soc_component_is_active(component))
		return;

	/* Toggle shutdown OFF then ON */
	snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
			    M98090_SHDNN_MASK, 0);

	snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
			    M98090_SHDNN_MASK, M98090_SHDNN_MASK);

	/* Give PLL time to lock */
	msleep(10);
}

so in what does the race occur?

or was it race with the other work queue, which starts like this:

static void max98090_pll_det_enable_work(struct work_struct *work)
{
	struct max98090_priv *max98090 =
		container_of(work, struct max98090_priv,
			     pll_det_enable_work.work);
	struct snd_soc_component *component = max98090->component;
	unsigned int status, mask;

	/*
	 * Clear status register in order to clear possibly already occurred
	 * PLL unlock. If PLL hasn't still locked, the status will be set
	 * again and PLL unlock interrupt will occur.
	 * Note this will clear all status bits
	 */
	regmap_read(max98090->regmap, M98090_REG_DEVICE_STATUS, &status);

So here indeed there is the same read, but what is the consequence of 
the read?
	
did you mean that in the pll_det_enable_work the value of 'status' may 
be incorrect as it was cleared already?

But the interrupt does not schedule the pll_det_enable_work(), it 
happens on a trigger, so how would the race happen?

it'd be good if you provide more details on the flow so that all 
reviewers get the ideas, it's a good opportunity to fix this for good 
and move on.

Tzung-Bi Shih Nov. 21, 2019, 6:17 a.m. UTC | #4

On Thu, Nov 21, 2019 at 11:20 AM Pierre-Louis Bossart
<pierre-louis.bossart@linux.intel.com> wrote:
> Still you may want to clarify your second point and the commit message.
> The first race condition was clear, the second one not so much.
>
> what did you mean with 'assert ULK == 0' and what does this map to in
> pll_work

The case (2) race condition is based on the previous patch:
https://mailman.alsa-project.org/pipermail/alsa-devel/2019-November/158852.html

After applying the patch, the code would be something like: (in the
case, two threads read 0x01.)
static void max98090_pll_work(struct work_struct *work)
{
[snip]
        snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
                            M98090_SHDNN_MASK, 0);
        snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
                            M98090_SHDNN_MASK, M98090_SHDNN_MASK);

        for (i = 0; i < 10; ++i) {
                /* Check lock status */
                pll = snd_soc_component_read32(
                                component, M98090_REG_DEVICE_STATUS);
                if (!(pll & M98090_ULK_MASK))
                        break;

                /* Give PLL time to lock */
                usleep_range(1000, 1200);
        }
}

I guess we have 2 choices:
i. stick to the original plan: option B, remove the second thread of execution.
In the case, we would be punished by at most 10~12ms delay in the
interrupt handler thread.  I browsed the max98090_interrupt() and tend
to think it should be fine if we delay jack detection for extra 10ms.

ii. adopt option A, play with the masks; and drop the previous patch
to avoid multiple threads access 0x01
We would be forced to wait >10ms before re-enabling ULK interrupt.
Notably, Documentation/timers/timers-howto.txt states "msleep(1~20)
may not do what the caller intends, and will often sleep
longer".Documentation/timers/timers-howto.txt".

Either way should be fine.  Which one would you suggest?

Pierre-Louis Bossart Nov. 21, 2019, 3:08 p.m. UTC | #5

On 11/21/19 12:17 AM, Tzung-Bi Shih wrote:
> On Thu, Nov 21, 2019 at 11:20 AM Pierre-Louis Bossart
> <pierre-louis.bossart@linux.intel.com> wrote:
>> Still you may want to clarify your second point and the commit message.
>> The first race condition was clear, the second one not so much.
>>
>> what did you mean with 'assert ULK == 0' and what does this map to in
>> pll_work
> 
> The case (2) race condition is based on the previous patch:
> https://mailman.alsa-project.org/pipermail/alsa-devel/2019-November/158852.html

ah, ok, so that's a race you introduced :-)

> 
> After applying the patch, the code would be something like: (in the
> case, two threads read 0x01.)
> static void max98090_pll_work(struct work_struct *work)
> {
> [snip]
>          snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
>                              M98090_SHDNN_MASK, 0);
>          snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
>                              M98090_SHDNN_MASK, M98090_SHDNN_MASK);
> 
>          for (i = 0; i < 10; ++i) {
>                  /* Check lock status */
>                  pll = snd_soc_component_read32(
>                                  component, M98090_REG_DEVICE_STATUS);
>                  if (!(pll & M98090_ULK_MASK))
>                          break;
> 
>                  /* Give PLL time to lock */
>                  usleep_range(1000, 1200);
>          }
> }

Sorry, I don't see at what point the race happens.

if your interrupt sees an ULK status, it will schedule the pll_work.
You only test the status again *after* switching the device on/off, so 
what is the side effect?

I am also wondering if you shouldn't sleep first in the loop, then check 
the status?

And in case you do have a side effect due to the clear on read, maybe 
you could save the status in a context and retest in the workqueue, that 
way you'd have a trace of the ULK event even if the register was cleared.

Tzung-Bi Shih Nov. 21, 2019, 4:49 p.m. UTC | #6

On Thu, Nov 21, 2019 at 11:41 PM Pierre-Louis Bossart
<pierre-louis.bossart@linux.intel.com> wrote:
> On 11/21/19 12:17 AM, Tzung-Bi Shih wrote:
> > On Thu, Nov 21, 2019 at 11:20 AM Pierre-Louis Bossart
> > <pierre-louis.bossart@linux.intel.com> wrote:
> >> Still you may want to clarify your second point and the commit message.
> >> The first race condition was clear, the second one not so much.
> >>
> >> what did you mean with 'assert ULK == 0' and what does this map to in
> >> pll_work
> >
> > The case (2) race condition is based on the previous patch:
> > https://mailman.alsa-project.org/pipermail/alsa-devel/2019-November/158852.html
>
> ah, ok, so that's a race you introduced :-)
>
> >
> > After applying the patch, the code would be something like: (in the
> > case, two threads read 0x01.)
> > static void max98090_pll_work(struct work_struct *work)
> > {
> > [snip]
> >          snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
> >                              M98090_SHDNN_MASK, 0);
> >          snd_soc_component_update_bits(component, M98090_REG_DEVICE_SHUTDOWN,
> >                              M98090_SHDNN_MASK, M98090_SHDNN_MASK);
> >
> >          for (i = 0; i < 10; ++i) {
> >                  /* Check lock status */
> >                  pll = snd_soc_component_read32(
> >                                  component, M98090_REG_DEVICE_STATUS);
> >                  if (!(pll & M98090_ULK_MASK))
> >                          break;
> >
> >                  /* Give PLL time to lock */
> >                  usleep_range(1000, 1200);
> >          }
> > }
>
> Sorry, I don't see at what point the race happens.
>
> if your interrupt sees an ULK status, it will schedule the pll_work.
> You only test the status again *after* switching the device on/off, so
> what is the side effect?

Both interrupt thread and pll_work thread access the status register.
The side effect is: interrupt thread could read the status register
right before pll_work thread's read.

> I am also wondering if you shouldn't sleep first in the loop, then check
> the status?

ACK, "sleep first and check the status" makes more sense.

> And in case you do have a side effect due to the clear on read, maybe
> you could save the status in a context and retest in the workqueue, that
> way you'd have a trace of the ULK event even if the register was cleared.

I think cache the status couldn't be a good idea.  We couldn't make
sure the freshness of the cache.


Can we suppose CPUs always get ULK=1 if PLL is unlocked so that we can
simply ignore the case (2)?  Since we know the register is volatile
and read the register via I2C should be much slower than hardware
raise the bit.

Pierre-Louis Bossart Nov. 21, 2019, 5:20 p.m. UTC | #7

>> if your interrupt sees an ULK status, it will schedule the pll_work.
>> You only test the status again *after* switching the device on/off, so
>> what is the side effect?
> 
> Both interrupt thread and pll_work thread access the status register.
> The side effect is: interrupt thread could read the status register
> right before pll_work thread's read.
> 
>> I am also wondering if you shouldn't sleep first in the loop, then check
>> the status?
> 
> ACK, "sleep first and check the status" makes more sense.
> 
>> And in case you do have a side effect due to the clear on read, maybe
>> you could save the status in a context and retest in the workqueue, that
>> way you'd have a trace of the ULK event even if the register was cleared.
> 
> I think cache the status couldn't be a good idea.  We couldn't make
> sure the freshness of the cache.

I was thinking more of treating it as a volatile, but it's ugly indeed.

> Can we suppose CPUs always get ULK=1 if PLL is unlocked so that we can
> simply ignore the case (2)?  Since we know the register is volatile
> and read the register via I2C should be much slower than hardware
> raise the bit.

Your option B with the small change to sleep then retest is probably the 
better solution overall.

BTW did you test this on both Baytrail and BSW chromebooks? In the past 
I saw baytrail devices exposed this error a lot more than BSW.

Tzung-Bi Shih Nov. 22, 2019, 7:52 a.m. UTC | #8

On Fri, Nov 22, 2019 at 3:20 AM Pierre-Louis Bossart
<pierre-louis.bossart@linux.intel.com> wrote:
> > Can we suppose CPUs always get ULK=1 if PLL is unlocked so that we can
> > simply ignore the case (2)?  Since we know the register is volatile
> > and read the register via I2C should be much slower than hardware
> > raise the bit.
>
> Your option B with the small change to sleep then retest is probably the
> better solution overall.

Have sent v2, https://mailman.alsa-project.org/pipermail/alsa-devel/2019-November/159050.html

> BTW did you test this on both Baytrail and BSW chromebooks? In the past
> I saw baytrail devices exposed this error a lot more than BSW.

I tested on a Baytrail-based chromebook which is easy to generate lots
of "PLL unlocked" on the console.  After applying this series and the
next series (https://mailman.alsa-project.org/pipermail/alsa-devel/2019-November/158853.html),
I seldom see "PLL unlocked" on the console.  It still happens a few
times, but with very very low probability.

[3/3] ASoC: max98090: fix possible race conditions

Commit Message

Comments

Patch