diff mbox series

ad_sigma_delta: fix race between IRQ and completion

Message ID 63a01acb.a70a0220.9a08f.987d@mx.google.com (mailing list archive)
State Changes Requested
Headers show
Series ad_sigma_delta: fix race between IRQ and completion | expand

Commit Message

Daniel Beer Dec. 19, 2022, 7:48 a.m. UTC
ad_sigma_delta waits for a conversion which terminates with the firing
of a one-shot IRQ handler. In this handler, the interrupt is disabled
and a completion is set.

Meanwhile, the thread that initiated the conversion is waiting on the
completion to know when the conversion happened. If this wait times out,
the conversion is aborted and IRQs are disabled. But the IRQ may fire
anyway between the time the completion wait times out and the disabling
of interrupts. If this occurs, we get a double-disabled interrupt.

This patch fixes that by wrapping the completion wait in a function that
handles timeouts correctly by synchronously disabling the interrupt and
then undoing the damage if it got disabled twice.

Fixes: af3008485ea0 ("iio:adc: Add common code for ADI Sigma Delta devices")
Cc: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Daniel Beer <dlbeer@gmail.com>
---
 drivers/iio/adc/ad_sigma_delta.c | 49 +++++++++++++++++++-------------
 1 file changed, 30 insertions(+), 19 deletions(-)

Comments

Jonathan Cameron Dec. 23, 2022, 4:16 p.m. UTC | #1
On Mon, 19 Dec 2022 20:48:46 +1300
Daniel Beer <dlbeer@gmail.com> wrote:

> ad_sigma_delta waits for a conversion which terminates with the firing
> of a one-shot IRQ handler. In this handler, the interrupt is disabled
> and a completion is set.
> 
> Meanwhile, the thread that initiated the conversion is waiting on the
> completion to know when the conversion happened. If this wait times out,
> the conversion is aborted and IRQs are disabled. But the IRQ may fire
> anyway between the time the completion wait times out and the disabling
> of interrupts. If this occurs, we get a double-disabled interrupt.

Ouch and good work tracking it down.  just to check, did you see this
bug happen in the wild or spotted by code inspection?

Given that timeout generally indicates hardware failure, I'm not sure
how critical this is to fix.

I don't think this fix fully closes the race - see inline.

Jonathan

> 
> This patch fixes that by wrapping the completion wait in a function that
> handles timeouts correctly by synchronously disabling the interrupt and
> then undoing the damage if it got disabled twice.
> 
> Fixes: af3008485ea0 ("iio:adc: Add common code for ADI Sigma Delta devices")
> Cc: Lars-Peter Clausen <lars@metafoo.de>
> Signed-off-by: Daniel Beer <dlbeer@gmail.com>
> ---
>  drivers/iio/adc/ad_sigma_delta.c | 49 +++++++++++++++++++-------------
>  1 file changed, 30 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/iio/adc/ad_sigma_delta.c b/drivers/iio/adc/ad_sigma_delta.c
> index d8570f620785..2f1702eeed56 100644
> --- a/drivers/iio/adc/ad_sigma_delta.c
> +++ b/drivers/iio/adc/ad_sigma_delta.c
> @@ -202,6 +202,31 @@ int ad_sd_reset(struct ad_sigma_delta *sigma_delta,
>  }
>  EXPORT_SYMBOL_NS_GPL(ad_sd_reset, IIO_AD_SIGMA_DELTA);
>  
> +static int ad_sd_wait_and_disable(struct ad_sigma_delta *sigma_delta,
> +				  unsigned long timeout)
> +{
> +	const int ret = wait_for_completion_interruptible_timeout(
> +			&sigma_delta->completion, timeout);
> +
> +	if (!ret) {
> +		/* Just because the completion timed out, doesn't mean that the
Multiline comment syntax in IIO is the
/*
 * Just...

form.

> +		 * IRQ didn't fire. It might be in progress right now.
> +		 */
> +		disable_irq(sigma_delta->spi->irq);
> +
> +		/* The IRQ handler may have run after all. If that happened,

Same for this comment.

> +		 * then we will now have double-disabled the IRQ, and irq_dis
> +		 * will be true (having been set in the handler).
> +		 */
> +		if (sigma_delta->irq_dis)
> +			enable_irq(sigma_delta->spi->irq);
> +		else
> +			sigma_delta->irq_dis = true;

I'd set this unconditionally.  It might already be set, but that shouldn't
be a problem.

Is this fix sufficient?  If the interrupt is being handled on a different
CPU to the caller of this function, I think we can still race enough that
this fails to fix it up.  Might need a spinlock to prevent that.

  CPU 0                                        CPU 1
ad_sd_data_rdy_trig_poll()               ad_sd_wait_and_disable()
                                       
                                         //wait_for_completion ends
					
Interrupt
                                          disable_irq()
					  if (sigma-delta->irq_dis) !true	
					  else
						sigma_delta->irq_dis = true

disable_irq_nosync(irq)
sigma_delta->irq_dis = true;

So we still end up with a doubly disabled irq.  Add a spinlock to make the
disable and the setting of sigma_delta->irq_dis atomic then it should all be fine.                



> +	}
> +
> +	return ret;
> +}
> +
>  int ad_sd_calibrate(struct ad_sigma_delta *sigma_delta,
>  	unsigned int mode, unsigned int channel)
>  {
> @@ -223,14 +248,11 @@ int ad_sd_calibrate(struct ad_sigma_delta *sigma_delta,
>  
>  	sigma_delta->irq_dis = false;
>  	enable_irq(sigma_delta->spi->irq);
> -	timeout = wait_for_completion_timeout(&sigma_delta->completion, 2 * HZ);
> -	if (timeout == 0) {
> -		sigma_delta->irq_dis = true;
> -		disable_irq_nosync(sigma_delta->spi->irq);
> +	timeout = ad_sd_wait_and_disable(sigma_delta, 2 * HZ);
> +	if (timeout == 0)
>  		ret = -EIO;
> -	} else {
> +	else
>  		ret = 0;
> -	}
>  out:
>  	sigma_delta->keep_cs_asserted = false;
>  	ad_sigma_delta_set_mode(sigma_delta, AD_SD_MODE_IDLE);
> @@ -296,8 +318,7 @@ int ad_sigma_delta_single_conversion(struct iio_dev *indio_dev,
>  
>  	sigma_delta->irq_dis = false;
>  	enable_irq(sigma_delta->spi->irq);
> -	ret = wait_for_completion_interruptible_timeout(
> -			&sigma_delta->completion, HZ);
> +	ret = ad_sd_wait_and_disable(sigma_delta, HZ);
>  
>  	if (ret == 0)
>  		ret = -EIO;
> @@ -314,11 +335,6 @@ int ad_sigma_delta_single_conversion(struct iio_dev *indio_dev,
>  		&raw_sample);
>  
>  out:
> -	if (!sigma_delta->irq_dis) {
> -		disable_irq_nosync(sigma_delta->spi->irq);
> -		sigma_delta->irq_dis = true;
> -	}
> -
>  	sigma_delta->keep_cs_asserted = false;
>  	ad_sigma_delta_set_mode(sigma_delta, AD_SD_MODE_IDLE);
>  	sigma_delta->bus_locked = false;
> @@ -411,12 +427,7 @@ static int ad_sd_buffer_postdisable(struct iio_dev *indio_dev)
>  	struct ad_sigma_delta *sigma_delta = iio_device_get_drvdata(indio_dev);
>  
>  	reinit_completion(&sigma_delta->completion);
> -	wait_for_completion_timeout(&sigma_delta->completion, HZ);
> -
> -	if (!sigma_delta->irq_dis) {
> -		disable_irq_nosync(sigma_delta->spi->irq);
> -		sigma_delta->irq_dis = true;
> -	}
> +	ad_sd_wait_and_disable(sigma_delta, HZ);
>  
>  	sigma_delta->keep_cs_asserted = false;
>  	ad_sigma_delta_set_mode(sigma_delta, AD_SD_MODE_IDLE);
Daniel Beer Dec. 24, 2022, 2:31 a.m. UTC | #2
On Fri, Dec 23, 2022 at 04:16:59PM +0000, Jonathan Cameron wrote:
> > ad_sigma_delta waits for a conversion which terminates with the firing
> > of a one-shot IRQ handler. In this handler, the interrupt is disabled
> > and a completion is set.
> > 
> > Meanwhile, the thread that initiated the conversion is waiting on the
> > completion to know when the conversion happened. If this wait times out,
> > the conversion is aborted and IRQs are disabled. But the IRQ may fire
> > anyway between the time the completion wait times out and the disabling
> > of interrupts. If this occurs, we get a double-disabled interrupt.
> 
> Ouch and good work tracking it down.  just to check, did you see this
> bug happen in the wild or spotted by code inspection?

Hi Jonathan,

Thanks for reviewing. It was by inspection -- I'd originally thought
about it and fixed in in a similar way in this patch:

    https://lore.kernel.org/all/61dd3e0c.1c69fb81.cea15.8d98@mx.google.com/

But since that's not applied, I thought I'd better put together a
separate fix for the time being.

> Given that timeout generally indicates hardware failure, I'm not sure
> how critical this is to fix.

Probably not very critical. I think you'd have to be pretty unlucky to
encounter it.

> Is this fix sufficient?  If the interrupt is being handled on a different
> CPU to the caller of this function, I think we can still race enough that
> this fails to fix it up.  Might need a spinlock to prevent that.
> 
>   CPU 0                                        CPU 1
> ad_sd_data_rdy_trig_poll()               ad_sd_wait_and_disable()
>                                        
>                                          //wait_for_completion ends
> 					
> Interrupt
>                                           disable_irq()
> 					  if (sigma-delta->irq_dis) !true	
> 					  else
> 						sigma_delta->irq_dis = true
> 
> disable_irq_nosync(irq)
> sigma_delta->irq_dis = true;
> 
> So we still end up with a doubly disabled irq.  Add a spinlock to make the
> disable and the setting of sigma_delta->irq_dis atomic then it should all be fine.                

My understanding is that the suffix-less version of disable_irq would
wait for all running handlers on other CPUs (i.e.
ad_sd_data_rdy_trig_poll) to finish before proceeding, which would
prevent this from happening. Is that not the case?

But now that you mention it, there is another small problem: in the case
where the conversion doesn't time out, the interrupt handler will call
complete() and then perform some operations on the struct
ad_sigma_delta.

This is always ok on a single processor, but if there are multiple CPUs
there is possibly a brief period where both the interrupt handler and
the waiting thread are accessing the ad_sigma_delta struct without
synchronization between them.

Not sure if that's really a problem in practice, but I think an easy way
to rule it out would just be to move the complete() call to the bottom
of the handler and make sure it doesn't touch the structure again after
that.

Cheers,
Daniel
Jonathan Cameron Dec. 31, 2022, 2:28 p.m. UTC | #3
On Sat, 24 Dec 2022 15:31:58 +1300
Daniel Beer <dlbeer@gmail.com> wrote:

> On Fri, Dec 23, 2022 at 04:16:59PM +0000, Jonathan Cameron wrote:
> > > ad_sigma_delta waits for a conversion which terminates with the firing
> > > of a one-shot IRQ handler. In this handler, the interrupt is disabled
> > > and a completion is set.
> > > 
> > > Meanwhile, the thread that initiated the conversion is waiting on the
> > > completion to know when the conversion happened. If this wait times out,
> > > the conversion is aborted and IRQs are disabled. But the IRQ may fire
> > > anyway between the time the completion wait times out and the disabling
> > > of interrupts. If this occurs, we get a double-disabled interrupt.  
> > 
> > Ouch and good work tracking it down.  just to check, did you see this
> > bug happen in the wild or spotted by code inspection?  
> 
> Hi Jonathan,
> 
> Thanks for reviewing. It was by inspection -- I'd originally thought
> about it and fixed in in a similar way in this patch:
> 
>     https://lore.kernel.org/all/61dd3e0c.1c69fb81.cea15.8d98@mx.google.com/
> 
> But since that's not applied, I thought I'd better put together a
> separate fix for the time being.
> 
> > Given that timeout generally indicates hardware failure, I'm not sure
> > how critical this is to fix.  
> 
> Probably not very critical. I think you'd have to be pretty unlucky to
> encounter it.
> 
> > Is this fix sufficient?  If the interrupt is being handled on a different
> > CPU to the caller of this function, I think we can still race enough that
> > this fails to fix it up.  Might need a spinlock to prevent that.
> > 
> >   CPU 0                                        CPU 1
> > ad_sd_data_rdy_trig_poll()               ad_sd_wait_and_disable()
> >                                        
> >                                          //wait_for_completion ends
> > 					
> > Interrupt
> >                                           disable_irq()
> > 					  if (sigma-delta->irq_dis) !true	
> > 					  else
> > 						sigma_delta->irq_dis = true
> > 
> > disable_irq_nosync(irq)
> > sigma_delta->irq_dis = true;
> > 
> > So we still end up with a doubly disabled irq.  Add a spinlock to make the
> > disable and the setting of sigma_delta->irq_dis atomic then it should all be fine.                  
> 
> My understanding is that the suffix-less version of disable_irq would
> wait for all running handlers on other CPUs (i.e.
> ad_sd_data_rdy_trig_poll) to finish before proceeding, which would
> prevent this from happening. Is that not the case?

Ah. I'd missed that - it is indeed documented as waiting for
pending irq handlers so that race doesn't exist.


> 
> But now that you mention it, there is another small problem: in the case
> where the conversion doesn't time out, the interrupt handler will call
> complete() and then perform some operations on the struct
> ad_sigma_delta.
> 
> This is always ok on a single processor, but if there are multiple CPUs
> there is possibly a brief period where both the interrupt handler and
> the waiting thread are accessing the ad_sigma_delta struct without
> synchronization between them.
> 
> Not sure if that's really a problem in practice, but I think an easy way
> to rule it out would just be to move the complete() call to the bottom
> of the handler and make sure it doesn't touch the structure again after
> that.

Hmm. At first glance, you are correct that moving completion later probably
makes sense. There may be some subtleties in that ordering though so I definitely want
some feedback from those more familiar with this driver than I am before
taking this patch or that further change.

Jonathan


> 
> Cheers,
> Daniel
>
diff mbox series

Patch

diff --git a/drivers/iio/adc/ad_sigma_delta.c b/drivers/iio/adc/ad_sigma_delta.c
index d8570f620785..2f1702eeed56 100644
--- a/drivers/iio/adc/ad_sigma_delta.c
+++ b/drivers/iio/adc/ad_sigma_delta.c
@@ -202,6 +202,31 @@  int ad_sd_reset(struct ad_sigma_delta *sigma_delta,
 }
 EXPORT_SYMBOL_NS_GPL(ad_sd_reset, IIO_AD_SIGMA_DELTA);
 
+static int ad_sd_wait_and_disable(struct ad_sigma_delta *sigma_delta,
+				  unsigned long timeout)
+{
+	const int ret = wait_for_completion_interruptible_timeout(
+			&sigma_delta->completion, timeout);
+
+	if (!ret) {
+		/* Just because the completion timed out, doesn't mean that the
+		 * IRQ didn't fire. It might be in progress right now.
+		 */
+		disable_irq(sigma_delta->spi->irq);
+
+		/* The IRQ handler may have run after all. If that happened,
+		 * then we will now have double-disabled the IRQ, and irq_dis
+		 * will be true (having been set in the handler).
+		 */
+		if (sigma_delta->irq_dis)
+			enable_irq(sigma_delta->spi->irq);
+		else
+			sigma_delta->irq_dis = true;
+	}
+
+	return ret;
+}
+
 int ad_sd_calibrate(struct ad_sigma_delta *sigma_delta,
 	unsigned int mode, unsigned int channel)
 {
@@ -223,14 +248,11 @@  int ad_sd_calibrate(struct ad_sigma_delta *sigma_delta,
 
 	sigma_delta->irq_dis = false;
 	enable_irq(sigma_delta->spi->irq);
-	timeout = wait_for_completion_timeout(&sigma_delta->completion, 2 * HZ);
-	if (timeout == 0) {
-		sigma_delta->irq_dis = true;
-		disable_irq_nosync(sigma_delta->spi->irq);
+	timeout = ad_sd_wait_and_disable(sigma_delta, 2 * HZ);
+	if (timeout == 0)
 		ret = -EIO;
-	} else {
+	else
 		ret = 0;
-	}
 out:
 	sigma_delta->keep_cs_asserted = false;
 	ad_sigma_delta_set_mode(sigma_delta, AD_SD_MODE_IDLE);
@@ -296,8 +318,7 @@  int ad_sigma_delta_single_conversion(struct iio_dev *indio_dev,
 
 	sigma_delta->irq_dis = false;
 	enable_irq(sigma_delta->spi->irq);
-	ret = wait_for_completion_interruptible_timeout(
-			&sigma_delta->completion, HZ);
+	ret = ad_sd_wait_and_disable(sigma_delta, HZ);
 
 	if (ret == 0)
 		ret = -EIO;
@@ -314,11 +335,6 @@  int ad_sigma_delta_single_conversion(struct iio_dev *indio_dev,
 		&raw_sample);
 
 out:
-	if (!sigma_delta->irq_dis) {
-		disable_irq_nosync(sigma_delta->spi->irq);
-		sigma_delta->irq_dis = true;
-	}
-
 	sigma_delta->keep_cs_asserted = false;
 	ad_sigma_delta_set_mode(sigma_delta, AD_SD_MODE_IDLE);
 	sigma_delta->bus_locked = false;
@@ -411,12 +427,7 @@  static int ad_sd_buffer_postdisable(struct iio_dev *indio_dev)
 	struct ad_sigma_delta *sigma_delta = iio_device_get_drvdata(indio_dev);
 
 	reinit_completion(&sigma_delta->completion);
-	wait_for_completion_timeout(&sigma_delta->completion, HZ);
-
-	if (!sigma_delta->irq_dis) {
-		disable_irq_nosync(sigma_delta->spi->irq);
-		sigma_delta->irq_dis = true;
-	}
+	ad_sd_wait_and_disable(sigma_delta, HZ);
 
 	sigma_delta->keep_cs_asserted = false;
 	ad_sigma_delta_set_mode(sigma_delta, AD_SD_MODE_IDLE);