diff mbox series

irq: Resolve that mask_irq/unmask_irq may not be called in pairs

Message ID 20231207014003.12919-1-xiongxin@kylinos.cn (mailing list archive)
State New
Headers show
Series irq: Resolve that mask_irq/unmask_irq may not be called in pairs | expand

Commit Message

xiongxin Dec. 7, 2023, 1:40 a.m. UTC
When an interrupt controller uses a function such as handle_level_irq()
as an interrupt handler and the controller implements the irq_disable()
callback, the following scenario will appear in the i2c-hid driver in
the sleep scenario:

in the sleep flow, while the user is still triggering the i2c-hid
interrupt, we get the following function call:

  handle_level_irq()
    -> mask_ack_irq()
      -> mask_irq()
				i2c_hid_core_suspend()
				  -> disable_irq()
				    -> __irq_disable()
				      -> irq_state_set_disabled()
				      -> irq_state_set_masked()

  irq_thread_fn()
    -> irq_finalize_oneshot()
      -> if (!desc->threads_oneshot && !irqd_irq_disabled() &&
	     irqd_irq_masked())
      	 	unmask_threaded_irq()
		  -> unmask_irq()

That is, when __irq_disable() is called between mask_irq() and
irq_finalize_oneshot(), the code in irq_finalize_oneshot() will cause
the !irqd_irq_disabled() fails to enter the unmask_irq() branch, which
causes mask_irq/unmask_irq to be called unpaired and the i2c-hid
interrupt to be masked.

Since mask_irq/unmask_irq and irq_disabled() belong to two different
hardware registers or policies, the !irqd_irq_disabled() assertion may
not be used to determine whether unmask_irq() needs to be called.

Cc: stable@vger.kernel.org
Signed-off-by: xiongxin <xiongxin@kylinos.cn>
Signed-off-by: Riwen Lu <luriwen@kylinos.cn>

Comments

Thomas Gleixner Dec. 8, 2023, 1:52 p.m. UTC | #1
On Thu, Dec 07 2023 at 09:40, xiongxin@kylinos.cn wrote:
> When an interrupt controller uses a function such as handle_level_irq()
> as an interrupt handler and the controller implements the irq_disable()
> callback, the following scenario will appear in the i2c-hid driver in
> the sleep scenario:
>
> in the sleep flow, while the user is still triggering the i2c-hid
> interrupt, we get the following function call:
>
>   handle_level_irq()
>     -> mask_ack_irq()
>       -> mask_irq()
> 				i2c_hid_core_suspend()
> 				  -> disable_irq()
> 				    -> __irq_disable()
> 				      -> irq_state_set_disabled()
> 				      -> irq_state_set_masked()
>
>   irq_thread_fn()
>     -> irq_finalize_oneshot()
>       -> if (!desc->threads_oneshot && !irqd_irq_disabled() &&
> 	     irqd_irq_masked())
>       	 	unmask_threaded_irq()
> 		  -> unmask_irq()
>
> That is, when __irq_disable() is called between mask_irq() and
> irq_finalize_oneshot(), the code in irq_finalize_oneshot() will cause
> the !irqd_irq_disabled() fails to enter the unmask_irq() branch, which
> causes mask_irq/unmask_irq to be called unpaired and the i2c-hid
> interrupt to be masked.
>
> Since mask_irq/unmask_irq and irq_disabled() belong to two different
> hardware registers or policies, the !irqd_irq_disabled() assertion may
> not be used to determine whether unmask_irq() needs to be called.

No. That's fundamentally wrong.

Disabled interrupts are disabled and can only be reenabled by the
corresponding enable call. The existing code is entirely correct.

What you are trying to do is unmasking a disabled interrupt, which
results in inconsistent state.

Which interrupt chip is involved here?

Thanks,

        tglx
xiongxin Dec. 11, 2023, 3:10 a.m. UTC | #2
在 2023/12/8 21:52, Thomas Gleixner 写道:
> On Thu, Dec 07 2023 at 09:40, xiongxin@kylinos.cn wrote:
>> When an interrupt controller uses a function such as handle_level_irq()
>> as an interrupt handler and the controller implements the irq_disable()
>> callback, the following scenario will appear in the i2c-hid driver in
>> the sleep scenario:
>>
>> in the sleep flow, while the user is still triggering the i2c-hid
>> interrupt, we get the following function call:
>>
>>    handle_level_irq()
>>      -> mask_ack_irq()
>>        -> mask_irq()
>> 				i2c_hid_core_suspend()
>> 				  -> disable_irq()
>> 				    -> __irq_disable()
>> 				      -> irq_state_set_disabled()
>> 				      -> irq_state_set_masked()
>>
>>    irq_thread_fn()
>>      -> irq_finalize_oneshot()
>>        -> if (!desc->threads_oneshot && !irqd_irq_disabled() &&
>> 	     irqd_irq_masked())
>>        	 	unmask_threaded_irq()
>> 		  -> unmask_irq()
>>
>> That is, when __irq_disable() is called between mask_irq() and
>> irq_finalize_oneshot(), the code in irq_finalize_oneshot() will cause
>> the !irqd_irq_disabled() fails to enter the unmask_irq() branch, which
>> causes mask_irq/unmask_irq to be called unpaired and the i2c-hid
>> interrupt to be masked.
>>
>> Since mask_irq/unmask_irq and irq_disabled() belong to two different
>> hardware registers or policies, the !irqd_irq_disabled() assertion may
>> not be used to determine whether unmask_irq() needs to be called.
> 
> No. That's fundamentally wrong.
> 
> Disabled interrupts are disabled and can only be reenabled by the
> corresponding enable call. The existing code is entirely correct.
> 
> What you are trying to do is unmasking a disabled interrupt, which
> results in inconsistent state.
> 
> Which interrupt chip is involved here?

i2c hid driver use gpio interrupt controller like 
drivers/gpio/gpio-dwapb.c, The gpio interrupt controller code implements 
handle_level_irq() and irq_disabled().
> 
> Thanks,
> 
>          tglx
> 

According to my code tracking and analysis:

Normally, when using the i2c hid device, the gpio interrupt controller's 
mask_irq() and unmask_irq() are called in pairs.For example, the process 
is as follows:

mask_irq();

if (!irqd_irq_disabled() && irqd_irq_masked())
	unmask_irq();

irq_state_set_disabled();
irq_state_set_masked();

In this process, unmask_irq() will be called, and the following 
mask_irq()/unmask_irq() will return directly.


But when doing a sleep process, such as suspend to RAM, 
i2c_hid_core_suspend() of the i2c hid driver is called, which implements 
the disable_irq() function, which finally calls __irq_disable(). Because 
the desc parameter is set to the __irq_disabled() function without a 
lock (desk->lock), the __irq_disabled() function can be called during 
the execution of the handle_level_irq() function, which causes the 
following:

mask_irq();

		irq_state_set_disabled();
		irq_state_set_masked();

if (!irqd_irq_disabled() && irqd_irq_masked())
	unmask_irq();

In this scenario, unmask_irq() will not be called, and then gpio 
corresponding interrupt pin will be masked. Finally, in the suspend() 
process driven by gpio interrupt controller, the interrupt mask register 
will be saved, and then masked will continue to be read when resuming () 
process. After the kernel resumed, the i2c hid gpio interrupt was masked 
and the i2c hid device was unavailable.
Thomas Gleixner Dec. 12, 2023, 3:17 p.m. UTC | #3
On Mon, Dec 11 2023 at 11:10, xiongxin@kylinos.cn wrote:
> 在 2023/12/8 21:52, Thomas Gleixner 写道:
>> On Thu, Dec 07 2023 at 09:40, xiongxin@kylinos.cn wrote:
>> Disabled interrupts are disabled and can only be reenabled by the
>> corresponding enable call. The existing code is entirely correct.
>> 
>> What you are trying to do is unmasking a disabled interrupt, which
>> results in inconsistent state.
>> 
>> Which interrupt chip is involved here?
>
> i2c hid driver use gpio interrupt controller like 
> drivers/gpio/gpio-dwapb.c, The gpio interrupt controller code implements 
> handle_level_irq() and irq_disabled().

No it does not. handle_level_irq() is implemented in the interrupt core
code and irq_disabled() is not a function at all.

Please describe things precisely and not by fairy tales.

> Normally, when using the i2c hid device, the gpio interrupt controller's 
> mask_irq() and unmask_irq() are called in pairs.

Sure. That's how the core code works.

> But when doing a sleep process, such as suspend to RAM, 
> i2c_hid_core_suspend() of the i2c hid driver is called, which implements 
> the disable_irq() function,

IOW, i2c_hid_core_suspend() disables the interrupt of the client device.

> which finally calls __irq_disable(). Because 
> the desc parameter is set to the __irq_disabled() function without a 
> lock (desk->lock), the __irq_disabled() function can be called during

That's nonsense.

disable_irq(irq)
  if (!__disable_irq_nosync(irq)
     desc = irq_get_desc_buslock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);

            ^^^^^^^^^^^^^^^^^^^^ This locks the interrupt descriptor

And yes disable_irq() can be invoked when the interrupt is handled
concurrently. That's legitimate and absolutely correct, but that has
absolutely nothing to do with the locking.

The point is that after disable_irq() returns the interrupt handler is
guaranteed not to be running and not to be invoked anymore until
something invokes enable_irq().

The fact that disable_irq() marks the interrupt disabled prevents the
hard interrupt handler and the threaded handler to unmask the interrupt.
That's correct and fundamental to ensure that the interrupt is and stays
truly disabled.

> if (!irqd_irq_disabled() && irqd_irq_masked())
> 	unmask_irq();

> In this scenario, unmask_irq() will not be called, and then gpio 
> corresponding interrupt pin will be masked.

It _cannot_ be called because the interrupt is _disabled_, which means
the interrupt stays masked. Correctly so.

> Finally, in the suspend() process driven by gpio interrupt controller,
> the interrupt mask register will be saved, and then masked will
> continue to be read when resuming () process. After the kernel
> resumed, the i2c hid gpio interrupt was masked and the i2c hid device
> was unavailable.

That's just wrong again.

Suspend:

       i2c_hid_core_suspend()
          disable_irq();       <- Marks it disabled and eventually
                                  masks it.

       gpio_irq_suspend()
          save_registers();    <- Saves masked interrupt

Resume:

       gpio_irq_resume()
          restore_registers(); <- Restores masked interrupt

       i2c_hid_core_resume()
          enable_irq();        <- Unmasks interrupt and removes the
                                  disabled marker

As I explained you before, disable_irq() can only be undone by
enable_irq() and not by ignoring the disabled state somewhere
else. Disabled state is well defined.

So if the drivers behave correctly in terms of suspend/resume ordering
as shown above, then this all should just work.

If it does not then please figure out what's the actual underlying
problem instead of violating well defined constraints in the core code
and telling me fairy tales about the code.

Thanks,

        tglx
Jiri Kosina Dec. 12, 2023, 4:57 p.m. UTC | #4
On Mon, 11 Dec 2023, xiongxin wrote:

> In this scenario, unmask_irq() will not be called, and then gpio corresponding
> interrupt pin will be masked. Finally, in the suspend() process driven by gpio
> interrupt controller, the interrupt mask register will be saved, and then
> masked will continue to be read when resuming () process. After the kernel
> resumed, the i2c hid gpio interrupt was masked and the i2c hid device was
> unavailable.

In addition to what Thomas already wrote -- what exactly is the problem 
you are trying to solve here?

Is it that your device drive by i2c-hid driver is no longer sending any 
data reports after a suspend/resume cycle? What makes you think that it's 
because of its IRQ being disabled?

Don't you just perhaps need I2C_HID_QUIRK_RESET_ON_RESUME quirk for that 
device?
xiongxin Dec. 13, 2023, 2:29 a.m. UTC | #5
在 2023/12/12 23:17, Thomas Gleixner 写道:
> On Mon, Dec 11 2023 at 11:10, xiongxin@kylinos.cn wrote:
>> 在 2023/12/8 21:52, Thomas Gleixner 写道:
>>> On Thu, Dec 07 2023 at 09:40, xiongxin@kylinos.cn wrote:
>>> Disabled interrupts are disabled and can only be reenabled by the
>>> corresponding enable call. The existing code is entirely correct.
>>>
>>> What you are trying to do is unmasking a disabled interrupt, which
>>> results in inconsistent state.
>>>
>>> Which interrupt chip is involved here?
>>
>> i2c hid driver use gpio interrupt controller like
>> drivers/gpio/gpio-dwapb.c, The gpio interrupt controller code implements
>> handle_level_irq() and irq_disabled().
> 
> No it does not. handle_level_irq() is implemented in the interrupt core
> code and irq_disabled() is not a function at all.
> 
> Please describe things precisely and not by fairy tales.
> 
>> Normally, when using the i2c hid device, the gpio interrupt controller's
>> mask_irq() and unmask_irq() are called in pairs.
> 
> Sure. That's how the core code works.
> 
>> But when doing a sleep process, such as suspend to RAM,
>> i2c_hid_core_suspend() of the i2c hid driver is called, which implements
>> the disable_irq() function,
> 
> IOW, i2c_hid_core_suspend() disables the interrupt of the client device.
> 
>> which finally calls __irq_disable(). Because
>> the desc parameter is set to the __irq_disabled() function without a
>> lock (desk->lock), the __irq_disabled() function can be called during
> 
> That's nonsense.
> 
> disable_irq(irq)
>    if (!__disable_irq_nosync(irq)
>       desc = irq_get_desc_buslock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);
> 
>              ^^^^^^^^^^^^^^^^^^^^ This locks the interrupt descriptor
> 
> And yes disable_irq() can be invoked when the interrupt is handled
> concurrently. That's legitimate and absolutely correct, but that has
> absolutely nothing to do with the locking.
> 
> The point is that after disable_irq() returns the interrupt handler is
> guaranteed not to be running and not to be invoked anymore until
> something invokes enable_irq().
> 
> The fact that disable_irq() marks the interrupt disabled prevents the
> hard interrupt handler and the threaded handler to unmask the interrupt.
> That's correct and fundamental to ensure that the interrupt is and stays
> truly disabled.
> 
>> if (!irqd_irq_disabled() && irqd_irq_masked())
>> 	unmask_irq();
> 
>> In this scenario, unmask_irq() will not be called, and then gpio
>> corresponding interrupt pin will be masked.
> 
> It _cannot_ be called because the interrupt is _disabled_, which means
> the interrupt stays masked. Correctly so.
> 
>> Finally, in the suspend() process driven by gpio interrupt controller,
>> the interrupt mask register will be saved, and then masked will
>> continue to be read when resuming () process. After the kernel
>> resumed, the i2c hid gpio interrupt was masked and the i2c hid device
>> was unavailable.
> 
> That's just wrong again.
> 
> Suspend:
> 
>         i2c_hid_core_suspend()
>            disable_irq();       <- Marks it disabled and eventually
>                                    masks it.
> 
>         gpio_irq_suspend()
>            save_registers();    <- Saves masked interrupt
> 
> Resume:
> 
>         gpio_irq_resume()
>            restore_registers(); <- Restores masked interrupt
> 
>         i2c_hid_core_resume()
>            enable_irq();        <- Unmasks interrupt and removes the
>                                    disabled marker
> 
> As I explained you before, disable_irq() can only be undone by
> enable_irq() and not by ignoring the disabled state somewhere
> else. Disabled state is well defined.
> 
> So if the drivers behave correctly in terms of suspend/resume ordering
> as shown above, then this all should just work.
> 
> If it does not then please figure out what's the actual underlying
> problem instead of violating well defined constraints in the core code
> and telling me fairy tales about the code.
> 
> Thanks,
> 
>          tglx
> 
> 
> 
> 

Sorry, the previous reply may not have clarified the BUG process. I 
re-debugged and confirmed it yesterday. The current BUG execution 
sequence is described as follows:

1: call in interrupt context

handle_level_irq(struct irq_desc *desc)
     raw_spin_lock(&desc->lock);

     mask_ack_irq(desc);
         mask_irq(desc);
	    desc->irq_data.chip->irq_mask(&desc->irq_data);
	                         <--- gpio irq_chip irq_mask call func.
	    irq_state_set_masked(desc);
     ...
     handle_irq_event(desc); <--- wake interrupt handler thread

     cond_unmask_irq(desc);
     raw_spin_unlock(&desc->lock);

2: call in suspend process

i2c_hid_core_suspend()
     disable_irq(client->irq);
	__disable_irq_nosync(irq)
	    desc = irq_get_desc_buslock(...);

	    __disable_irq(desc);
		irq_disable(desc);
		    __irq_disable(...);
			irq_state_set_disabled(...); <-set disabled flag
			irq_state_set_masked(desc); <-set masked flag

	    irq_put_desc_busunlock(desc, flags);


3:  Interrupt handler thread call

irq_thread_fn()
     irq_finalize_oneshot(desc, action);
	raw_spin_lock_irq(&desc->lock);

	if (!desc->threads_oneshot &&
		!irqd_irq_disabled(&desc->irq_data) && <-
		irqd_irq_masked(&desc->irq_data))
	    unmask_threaded_irq(desc);
		unmask_irq(desc);
		    desc->irq_data.chip->irq_unmask(&desc->irq_data);
			        <--- gpio irq_chip irq_unmask call func.

	raw_spin_unlock_irq(&desc->lock);

That is, there is a time between the 1:handle_level_irq() and 
3:irq_thread_fn() calls for the 2:disable_irq() call to acquire the lock 
and then implement the irq_state_set_disabled() operation. When finally 
call irq_thread_fn()->irq_finalize_oneshot(), it cannot enter the 
unmask_thread_irq() process.

In this case, the gpio irq_chip irq_mask()/irq_unmask() callback pairs 
are not called in pairs, so I think this is a BUG, but not necessarily 
fixed from the irq core code layer.

Next, when the gpio controller driver calls the suspend/resume process, 
it is as follows:

suspend process:
dwapb_gpio_suspend()
     ctx->int_mask   = dwapb_read(gpio, GPIO_INTMASK);

resume process:
dwapb_gpio_resume()
     dwapb_write(gpio, GPIO_INTMASK, ctx->int_mask);

In this case, the masked interrupt bit of GPIO interrupt corresponding 
to i2c hid is saved, so that when gpio resume() process writes from the 
register, the gpio interrupt bit corresponding to i2c hid is masked and 
the i2c hid device cannot be used.

My first solution is to remove the !irqd_irq_disabled(&desc->irq_data) 
condition and the BUG disappears. I can't think of a better solution 
right now.
xiongxin Dec. 13, 2023, 2:35 a.m. UTC | #6
在 2023/12/13 00:57, Jiri Kosina 写道:
> On Mon, 11 Dec 2023, xiongxin wrote:
> 
>> In this scenario, unmask_irq() will not be called, and then gpio corresponding
>> interrupt pin will be masked. Finally, in the suspend() process driven by gpio
>> interrupt controller, the interrupt mask register will be saved, and then
>> masked will continue to be read when resuming () process. After the kernel
>> resumed, the i2c hid gpio interrupt was masked and the i2c hid device was
>> unavailable.
> 
> In addition to what Thomas already wrote -- what exactly is the problem
> you are trying to solve here?
> 
> Is it that your device drive by i2c-hid driver is no longer sending any
> data reports after a suspend/resume cycle? What makes you think that it's
> because of its IRQ being disabled?
> 
> Don't you just perhaps need I2C_HID_QUIRK_RESET_ON_RESUME quirk for that
> device?
> 

I have confirmed I2C_HID_QUIRK_RESET_ON_RESUME quirk, the current BUG is 
related to GPIO interrupt masking, and has little to do with hid code.

I explained the detailed process of the BUG in another email.
Thomas Gleixner Dec. 13, 2023, 2:59 p.m. UTC | #7
On Wed, Dec 13 2023 at 10:29, xiongxin wrote:
> 在 2023/12/12 23:17, Thomas Gleixner 写道:
> Sorry, the previous reply may not have clarified the BUG process. I 
> re-debugged and confirmed it yesterday. The current BUG execution 
> sequence is described as follows:

It's the sequence how this works and it works correctly.

Just because it does not work on your machine it does not mean that this
is incorrect and a BUG.

You are trying to fix a symptom and thereby violating guarantees of the
core code.

> That is, there is a time between the 1:handle_level_irq() and 
> 3:irq_thread_fn() calls for the 2:disable_irq() call to acquire the lock 
> and then implement the irq_state_set_disabled() operation. When finally 
> call irq_thread_fn()->irq_finalize_oneshot(), it cannot enter the 
> unmask_thread_irq() process.

Correct, because the interrupt has been DISABLED in the mean time.

> In this case, the gpio irq_chip irq_mask()/irq_unmask() callback pairs 
> are not called in pairs, so I think this is a BUG, but not necessarily 
> fixed from the irq core code layer.

No. It is _NOT_ a BUG. unmask() is not allowed to be invoked when the
interrupt is DISABLED. That's the last time I'm going to tell you that.
Only enable_irq() can undo the effect of disable_irq(), period.

> Next, when the gpio controller driver calls the suspend/resume process, 
> it is as follows:
>
> suspend process:
> dwapb_gpio_suspend()
>      ctx->int_mask   = dwapb_read(gpio, GPIO_INTMASK);
>
> resume process:
> dwapb_gpio_resume()
>      dwapb_write(gpio, GPIO_INTMASK, ctx->int_mask);

Did you actually look at the sequence I gave you?

   Suspend:

	  i2c_hid_core_suspend()
	     disable_irq();       <- Marks it disabled and eventually
				     masks it.

	  gpio_irq_suspend()
	     save_registers();    <- Saves masked interrupt

   Resume:

	  gpio_irq_resume()
	     restore_registers(); <- Restores masked interrupt

	  i2c_hid_core_resume()
	     enable_irq();        <- Unmasks interrupt and removes the
				     disabled marker


Have you verified that this order of invocations is what happens on
your machine?

Thanks,

        tglx
xiongxin Dec. 14, 2023, 1:54 a.m. UTC | #8
在 2023/12/13 22:59, Thomas Gleixner 写道:
> On Wed, Dec 13 2023 at 10:29, xiongxin wrote:
>> 在 2023/12/12 23:17, Thomas Gleixner 写道:
>> Sorry, the previous reply may not have clarified the BUG process. I
>> re-debugged and confirmed it yesterday. The current BUG execution
>> sequence is described as follows:
> 
> It's the sequence how this works and it works correctly.
> 
> Just because it does not work on your machine it does not mean that this
> is incorrect and a BUG.
> 
> You are trying to fix a symptom and thereby violating guarantees of the
> core code.
> 
>> That is, there is a time between the 1:handle_level_irq() and
>> 3:irq_thread_fn() calls for the 2:disable_irq() call to acquire the lock
>> and then implement the irq_state_set_disabled() operation. When finally
>> call irq_thread_fn()->irq_finalize_oneshot(), it cannot enter the
>> unmask_thread_irq() process.
> 
> Correct, because the interrupt has been DISABLED in the mean time.
> 
>> In this case, the gpio irq_chip irq_mask()/irq_unmask() callback pairs
>> are not called in pairs, so I think this is a BUG, but not necessarily
>> fixed from the irq core code layer.
> 
> No. It is _NOT_ a BUG. unmask() is not allowed to be invoked when the
> interrupt is DISABLED. That's the last time I'm going to tell you that.
> Only enable_irq() can undo the effect of disable_irq(), period.
> 
>> Next, when the gpio controller driver calls the suspend/resume process,
>> it is as follows:
>>
>> suspend process:
>> dwapb_gpio_suspend()
>>       ctx->int_mask   = dwapb_read(gpio, GPIO_INTMASK);
>>
>> resume process:
>> dwapb_gpio_resume()
>>       dwapb_write(gpio, GPIO_INTMASK, ctx->int_mask);
> 
> Did you actually look at the sequence I gave you?
> 
>     Suspend:
> 
> 	  i2c_hid_core_suspend()
> 	     disable_irq();       <- Marks it disabled and eventually
> 				     masks it.
> 
> 	  gpio_irq_suspend()
> 	     save_registers();    <- Saves masked interrupt
> 
>     Resume:
> 
> 	  gpio_irq_resume()
> 	     restore_registers(); <- Restores masked interrupt
> 
> 	  i2c_hid_core_resume()
> 	     enable_irq();        <- Unmasks interrupt and removes the
> 				     disabled marker
> 
> 
> Have you verified that this order of invocations is what happens on
> your machine?
> 
> Thanks,
> 
>          tglx

As described earlier, in the current situation, the irq_mask() callback 
of gpio irq_chip is called in mask_irq(), followed by the disable_irq() 
in i2c_hid_core_suspend(), unmask_irq() will not be executed.

Then call enable_irq() in i2c_hid_core_resume(). Since gpio irq_chip 
does not implement the irq_startup() callback, it ends up calling 
irq_enable().

The irq_enable() function is then implemented as follows:

irq_state_clr_disabled(desc);
if (desc->irq_data.chip->irq_enable) {
	desc->irq_data.chip->irq_enable(&desc->irq_data);
	irq_state_clr_masked(desc);
} else {
	unmask_irq(desc);
}

Because gpio irq_chip implements irq_enable(), unmask_irq() is not 
executed, and gpio irq_chip's irq_unmask() callback is not called. 
Instead, irq_state_clr_masked() was called to clear the masked flag.

The irq masked behavior is actually controlled by the 
irq_mask()/irq_unmask() callback function pairs in gpio irq_chip. When 
the whole situation occurs, there is one more irq_mask() operation, or 
one less irq_unmask() operation. This ends the i2c hid resume and the 
gpio corresponding i2c hid interrupt is also masked.

Please help confirm whether the current situation is a BUG, or suggest 
other solutions to fix it.
Serge Semin Dec. 14, 2023, 10:06 a.m. UTC | #9
Hi Thomas, Xiong

On Thu, Dec 14, 2023 at 09:54:06AM +0800, xiongxin wrote:
> 在 2023/12/13 22:59, Thomas Gleixner 写道:
> > On Wed, Dec 13 2023 at 10:29, xiongxin wrote:
> > > 在 2023/12/12 23:17, Thomas Gleixner 写道:
> > > Sorry, the previous reply may not have clarified the BUG process. I
> > > re-debugged and confirmed it yesterday. The current BUG execution
> > > sequence is described as follows:
> > 
> > It's the sequence how this works and it works correctly.
> > 
> > Just because it does not work on your machine it does not mean that this
> > is incorrect and a BUG.
> > 
> > You are trying to fix a symptom and thereby violating guarantees of the
> > core code.
> > 
> > > That is, there is a time between the 1:handle_level_irq() and
> > > 3:irq_thread_fn() calls for the 2:disable_irq() call to acquire the lock
> > > and then implement the irq_state_set_disabled() operation. When finally
> > > call irq_thread_fn()->irq_finalize_oneshot(), it cannot enter the
> > > unmask_thread_irq() process.
> > 
> > Correct, because the interrupt has been DISABLED in the mean time.
> > 
> > > In this case, the gpio irq_chip irq_mask()/irq_unmask() callback pairs
> > > are not called in pairs, so I think this is a BUG, but not necessarily
> > > fixed from the irq core code layer.
> > 
> > No. It is _NOT_ a BUG. unmask() is not allowed to be invoked when the
> > interrupt is DISABLED. That's the last time I'm going to tell you that.
> > Only enable_irq() can undo the effect of disable_irq(), period.
> > 
> > > Next, when the gpio controller driver calls the suspend/resume process,
> > > it is as follows:
> > > 
> > > suspend process:
> > > dwapb_gpio_suspend()
> > >       ctx->int_mask   = dwapb_read(gpio, GPIO_INTMASK);
> > > 
> > > resume process:
> > > dwapb_gpio_resume()
> > >       dwapb_write(gpio, GPIO_INTMASK, ctx->int_mask);
> > 
> > Did you actually look at the sequence I gave you?
> > 
> >     Suspend:
> > 
> > 	  i2c_hid_core_suspend()
> > 	     disable_irq();       <- Marks it disabled and eventually
> > 				     masks it.
> > 
> > 	  gpio_irq_suspend()
> > 	     save_registers();    <- Saves masked interrupt
> > 
> >     Resume:
> > 
> > 	  gpio_irq_resume()
> > 	     restore_registers(); <- Restores masked interrupt
> > 
> > 	  i2c_hid_core_resume()
> > 	     enable_irq();        <- Unmasks interrupt and removes the
> > 				     disabled marker
> > 
> > 
> > Have you verified that this order of invocations is what happens on
> > your machine?
> > 
> > Thanks,
> > 
> >          tglx
> 
> As described earlier, in the current situation, the irq_mask() callback of
> gpio irq_chip is called in mask_irq(), followed by the disable_irq() in
> i2c_hid_core_suspend(), unmask_irq() will not be executed.
> 
> Then call enable_irq() in i2c_hid_core_resume(). Since gpio irq_chip does
> not implement the irq_startup() callback, it ends up calling irq_enable().
> 
> The irq_enable() function is then implemented as follows:
> 
> irq_state_clr_disabled(desc);
> if (desc->irq_data.chip->irq_enable) {
> 	desc->irq_data.chip->irq_enable(&desc->irq_data);
> 	irq_state_clr_masked(desc);
> } else {
> 	unmask_irq(desc);
> }
> 
> Because gpio irq_chip implements irq_enable(), unmask_irq() is not executed,
> and gpio irq_chip's irq_unmask() callback is not called. Instead,
> irq_state_clr_masked() was called to clear the masked flag.
> 
> The irq masked behavior is actually controlled by the
> irq_mask()/irq_unmask() callback function pairs in gpio irq_chip. When the
> whole situation occurs, there is one more irq_mask() operation, or one less
> irq_unmask() operation. This ends the i2c hid resume and the gpio
> corresponding i2c hid interrupt is also masked.
> 
> Please help confirm whether the current situation is a BUG, or suggest other
> solutions to fix it.
> 

I has just been Cc'ed to this thread from the very start of the thread
so not sure whether I fully understand the problem. But a while ago an
IRQ disable/undisable-mask/unmask-unpairing problem was reported for
DW APB GPIO driver implementation, but in a another context though:
https://lore.kernel.org/linux-gpio/1606728979-44259-1-git-send-email-luojiaxing@huawei.com/
We didn't come to a final conclusion back then, so the patch got lost
in the emails archive. Xiong, is it related to the problem in your
case?

-Serge(y)
Thomas Gleixner Dec. 14, 2023, 10:13 a.m. UTC | #10
On Thu, Dec 14 2023 at 09:54, xiongxin wrote:
> 在 2023/12/13 22:59, Thomas Gleixner 写道:
>> Did you actually look at the sequence I gave you?
>> 
>>     Suspend:
>> 
>> 	  i2c_hid_core_suspend()
>> 	     disable_irq();       <- Marks it disabled and eventually
>> 				     masks it.
>> 
>> 	  gpio_irq_suspend()
>> 	     save_registers();    <- Saves masked interrupt
>> 
>>     Resume:
>> 
>> 	  gpio_irq_resume()
>> 	     restore_registers(); <- Restores masked interrupt
>> 
>> 	  i2c_hid_core_resume()
>> 	     enable_irq();        <- Unmasks interrupt and removes the
>> 				     disabled marker
>> 
>> 
>> Have you verified that this order of invocations is what happens on
>> your machine?
>
> As described earlier, in the current situation, the irq_mask() callback 
> of gpio irq_chip is called in mask_irq(), followed by the disable_irq() 
> in i2c_hid_core_suspend(), unmask_irq() will not be executed.

Which is correct.

> Then call enable_irq() in i2c_hid_core_resume(). Since gpio irq_chip 
> does not implement the irq_startup() callback, it ends up calling 
> irq_enable().
>
> The irq_enable() function is then implemented as follows:
>
> irq_state_clr_disabled(desc);
> if (desc->irq_data.chip->irq_enable) {
> 	desc->irq_data.chip->irq_enable(&desc->irq_data);
> 	irq_state_clr_masked(desc);
> } else {
> 	unmask_irq(desc);
> }
>
> Because gpio irq_chip implements irq_enable(), unmask_irq() is not 
> executed, and gpio irq_chip's irq_unmask() callback is not called. 
> Instead, irq_state_clr_masked() was called to clear the masked flag.
>
> The irq masked behavior is actually controlled by the 
> irq_mask()/irq_unmask() callback function pairs in gpio irq_chip. When 
> the whole situation occurs, there is one more irq_mask() operation, or 
> one less irq_unmask() operation. This ends the i2c hid resume and the 
> gpio corresponding i2c hid interrupt is also masked.
>
> Please help confirm whether the current situation is a BUG, or suggest 
> other solutions to fix it.

Again, I already explained to you in great detail why the core code is
correct and does not have a bug.

But as you insist that the bug is in the core code you obviously failed
to validate what I asked you to validate:

>> 	  i2c_hid_core_resume()
>> 	     enable_irq();        <- Unmasks interrupt and removes the
>> 				     disabled marker

The keyword to validate here is 'Unmasks'.

As gpio_dwapb implements the irq_enable() callback enable_irq() is not
going to end up invoking the irq_unmask() callback. But the irq_enable()
callback is required to be a superset of irq_unmask(). I.e. the core
code expects it to do:

    1) Some preparatory work to enable the interrupt line

    2) Unmask the interrupt, which is why the masked state is cleared
       by the core after invoking the irq_enable() callback.

which is pretty obvious because if an interrupt chip does not implement
the irq_enable() callback the core defaults to irq_unmask()

Correspondingly the core expects from the irq_disable() callback:

    1) To mask the interrupt

    2) To do some extra work to disable the interrupt line

which is obvious again because the core defaults to irq_mask() if the
irq_disable() callback is not implemented by the interrupt chip.

I'm pretty sure that with the previous provided information and the
extra information above you can figure out yourself that:

  1) the core code is correct as is

  2) where exactly the problem is located and how to fix it

No?

Thanks,

        tglx
Andy Shevchenko Dec. 14, 2023, 4:11 p.m. UTC | #11
On Thu, Dec 14, 2023 at 01:06:23PM +0300, Serge Semin wrote:
> On Thu, Dec 14, 2023 at 09:54:06AM +0800, xiongxin wrote:
> > 在 2023/12/13 22:59, Thomas Gleixner 写道:
> > > On Wed, Dec 13 2023 at 10:29, xiongxin wrote:
> > > > 在 2023/12/12 23:17, Thomas Gleixner 写道:
> > > > Sorry, the previous reply may not have clarified the BUG process. I
> > > > re-debugged and confirmed it yesterday. The current BUG execution
> > > > sequence is described as follows:
> > > 
> > > It's the sequence how this works and it works correctly.
> > > 
> > > Just because it does not work on your machine it does not mean that this
> > > is incorrect and a BUG.
> > > 
> > > You are trying to fix a symptom and thereby violating guarantees of the
> > > core code.
> > > 
> > > > That is, there is a time between the 1:handle_level_irq() and
> > > > 3:irq_thread_fn() calls for the 2:disable_irq() call to acquire the lock
> > > > and then implement the irq_state_set_disabled() operation. When finally
> > > > call irq_thread_fn()->irq_finalize_oneshot(), it cannot enter the
> > > > unmask_thread_irq() process.
> > > 
> > > Correct, because the interrupt has been DISABLED in the mean time.
> > > 
> > > > In this case, the gpio irq_chip irq_mask()/irq_unmask() callback pairs
> > > > are not called in pairs, so I think this is a BUG, but not necessarily
> > > > fixed from the irq core code layer.
> > > 
> > > No. It is _NOT_ a BUG. unmask() is not allowed to be invoked when the
> > > interrupt is DISABLED. That's the last time I'm going to tell you that.
> > > Only enable_irq() can undo the effect of disable_irq(), period.
> > > 
> > > > Next, when the gpio controller driver calls the suspend/resume process,
> > > > it is as follows:
> > > > 
> > > > suspend process:
> > > > dwapb_gpio_suspend()
> > > >       ctx->int_mask   = dwapb_read(gpio, GPIO_INTMASK);
> > > > 
> > > > resume process:
> > > > dwapb_gpio_resume()
> > > >       dwapb_write(gpio, GPIO_INTMASK, ctx->int_mask);
> > > 
> > > Did you actually look at the sequence I gave you?
> > > 
> > >     Suspend:
> > > 
> > > 	  i2c_hid_core_suspend()
> > > 	     disable_irq();       <- Marks it disabled and eventually
> > > 				     masks it.
> > > 
> > > 	  gpio_irq_suspend()
> > > 	     save_registers();    <- Saves masked interrupt
> > > 
> > >     Resume:
> > > 
> > > 	  gpio_irq_resume()
> > > 	     restore_registers(); <- Restores masked interrupt
> > > 
> > > 	  i2c_hid_core_resume()
> > > 	     enable_irq();        <- Unmasks interrupt and removes the
> > > 				     disabled marker
> > > 
> > > 
> > > Have you verified that this order of invocations is what happens on
> > > your machine?
> > > 
> > > Thanks,
> > > 
> > >          tglx
> > 
> > As described earlier, in the current situation, the irq_mask() callback of
> > gpio irq_chip is called in mask_irq(), followed by the disable_irq() in
> > i2c_hid_core_suspend(), unmask_irq() will not be executed.
> > 
> > Then call enable_irq() in i2c_hid_core_resume(). Since gpio irq_chip does
> > not implement the irq_startup() callback, it ends up calling irq_enable().
> > 
> > The irq_enable() function is then implemented as follows:
> > 
> > irq_state_clr_disabled(desc);
> > if (desc->irq_data.chip->irq_enable) {
> > 	desc->irq_data.chip->irq_enable(&desc->irq_data);
> > 	irq_state_clr_masked(desc);
> > } else {
> > 	unmask_irq(desc);
> > }
> > 
> > Because gpio irq_chip implements irq_enable(), unmask_irq() is not executed,
> > and gpio irq_chip's irq_unmask() callback is not called. Instead,
> > irq_state_clr_masked() was called to clear the masked flag.
> > 
> > The irq masked behavior is actually controlled by the
> > irq_mask()/irq_unmask() callback function pairs in gpio irq_chip. When the
> > whole situation occurs, there is one more irq_mask() operation, or one less
> > irq_unmask() operation. This ends the i2c hid resume and the gpio
> > corresponding i2c hid interrupt is also masked.
> > 
> > Please help confirm whether the current situation is a BUG, or suggest other
> > solutions to fix it.
> 
> I has just been Cc'ed to this thread from the very start of the thread
> so not sure whether I fully understand the problem. But a while ago an
> IRQ disable/undisable-mask/unmask-unpairing problem was reported for
> DW APB GPIO driver implementation, but in a another context though:
> https://lore.kernel.org/linux-gpio/1606728979-44259-1-git-send-email-luojiaxing@huawei.com/
> We didn't come to a final conclusion back then, so the patch got lost
> in the emails archive. Xiong, is it related to the problem in your
> case?

From what explained it sounds to me that GPIO driver has missing part and
this seems the missing patch (in one or another form). Perhaps we can get
a second round of review,
xiongxin Dec. 15, 2023, 2:18 a.m. UTC | #12
在 2023/12/15 00:11, Andy Shevchenko 写道:
> On Thu, Dec 14, 2023 at 01:06:23PM +0300, Serge Semin wrote:
>> On Thu, Dec 14, 2023 at 09:54:06AM +0800, xiongxin wrote:
>>> 在 2023/12/13 22:59, Thomas Gleixner 写道:
>>>> On Wed, Dec 13 2023 at 10:29, xiongxin wrote:
>>>>> 在 2023/12/12 23:17, Thomas Gleixner 写道:
>>>>> Sorry, the previous reply may not have clarified the BUG process. I
>>>>> re-debugged and confirmed it yesterday. The current BUG execution
>>>>> sequence is described as follows:
>>>>
>>>> It's the sequence how this works and it works correctly.
>>>>
>>>> Just because it does not work on your machine it does not mean that this
>>>> is incorrect and a BUG.
>>>>
>>>> You are trying to fix a symptom and thereby violating guarantees of the
>>>> core code.
>>>>
>>>>> That is, there is a time between the 1:handle_level_irq() and
>>>>> 3:irq_thread_fn() calls for the 2:disable_irq() call to acquire the lock
>>>>> and then implement the irq_state_set_disabled() operation. When finally
>>>>> call irq_thread_fn()->irq_finalize_oneshot(), it cannot enter the
>>>>> unmask_thread_irq() process.
>>>>
>>>> Correct, because the interrupt has been DISABLED in the mean time.
>>>>
>>>>> In this case, the gpio irq_chip irq_mask()/irq_unmask() callback pairs
>>>>> are not called in pairs, so I think this is a BUG, but not necessarily
>>>>> fixed from the irq core code layer.
>>>>
>>>> No. It is _NOT_ a BUG. unmask() is not allowed to be invoked when the
>>>> interrupt is DISABLED. That's the last time I'm going to tell you that.
>>>> Only enable_irq() can undo the effect of disable_irq(), period.
>>>>
>>>>> Next, when the gpio controller driver calls the suspend/resume process,
>>>>> it is as follows:
>>>>>
>>>>> suspend process:
>>>>> dwapb_gpio_suspend()
>>>>>        ctx->int_mask   = dwapb_read(gpio, GPIO_INTMASK);
>>>>>
>>>>> resume process:
>>>>> dwapb_gpio_resume()
>>>>>        dwapb_write(gpio, GPIO_INTMASK, ctx->int_mask);
>>>>
>>>> Did you actually look at the sequence I gave you?
>>>>
>>>>      Suspend:
>>>>
>>>> 	  i2c_hid_core_suspend()
>>>> 	     disable_irq();       <- Marks it disabled and eventually
>>>> 				     masks it.
>>>>
>>>> 	  gpio_irq_suspend()
>>>> 	     save_registers();    <- Saves masked interrupt
>>>>
>>>>      Resume:
>>>>
>>>> 	  gpio_irq_resume()
>>>> 	     restore_registers(); <- Restores masked interrupt
>>>>
>>>> 	  i2c_hid_core_resume()
>>>> 	     enable_irq();        <- Unmasks interrupt and removes the
>>>> 				     disabled marker
>>>>
>>>>
>>>> Have you verified that this order of invocations is what happens on
>>>> your machine?
>>>>
>>>> Thanks,
>>>>
>>>>           tglx
>>>
>>> As described earlier, in the current situation, the irq_mask() callback of
>>> gpio irq_chip is called in mask_irq(), followed by the disable_irq() in
>>> i2c_hid_core_suspend(), unmask_irq() will not be executed.
>>>
>>> Then call enable_irq() in i2c_hid_core_resume(). Since gpio irq_chip does
>>> not implement the irq_startup() callback, it ends up calling irq_enable().
>>>
>>> The irq_enable() function is then implemented as follows:
>>>
>>> irq_state_clr_disabled(desc);
>>> if (desc->irq_data.chip->irq_enable) {
>>> 	desc->irq_data.chip->irq_enable(&desc->irq_data);
>>> 	irq_state_clr_masked(desc);
>>> } else {
>>> 	unmask_irq(desc);
>>> }
>>>
>>> Because gpio irq_chip implements irq_enable(), unmask_irq() is not executed,
>>> and gpio irq_chip's irq_unmask() callback is not called. Instead,
>>> irq_state_clr_masked() was called to clear the masked flag.
>>>
>>> The irq masked behavior is actually controlled by the
>>> irq_mask()/irq_unmask() callback function pairs in gpio irq_chip. When the
>>> whole situation occurs, there is one more irq_mask() operation, or one less
>>> irq_unmask() operation. This ends the i2c hid resume and the gpio
>>> corresponding i2c hid interrupt is also masked.
>>>
>>> Please help confirm whether the current situation is a BUG, or suggest other
>>> solutions to fix it.
>>
>> I has just been Cc'ed to this thread from the very start of the thread
>> so not sure whether I fully understand the problem. But a while ago an
>> IRQ disable/undisable-mask/unmask-unpairing problem was reported for
>> DW APB GPIO driver implementation, but in a another context though:
>> https://lore.kernel.org/linux-gpio/1606728979-44259-1-git-send-email-luojiaxing@huawei.com/
>> We didn't come to a final conclusion back then, so the patch got lost
>> in the emails archive. Xiong, is it related to the problem in your
>> case?
> 
>  From what explained it sounds to me that GPIO driver has missing part and
> this seems the missing patch (in one or another form). Perhaps we can get
> a second round of review,
> 

Yes, the current case is exactly the situation described in the link, 
and the specific implementation process is as described in my previous 
email. After adding the patch for retest, the exception can be 
effectively solved. And at present, can the patch be incorporated?

Thank you Thomas for your kind advice!

My previous focus has been locked until mask_irq()/unmask_irq() is not 
called in pairs, in fact, it can turn on the corresponding irq masked in 
enable_irq. Looking at the irq_enable() callback implementation of other 
GPIO interrupt controllers, there are indeed operations on masked registers.

Thank you all!
diff mbox series

Patch

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1782f90cd8c6..9160fc9170b3 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1120,8 +1120,7 @@  static void irq_finalize_oneshot(struct irq_desc *desc,
 
 	desc->threads_oneshot &= ~action->thread_mask;
 
-	if (!desc->threads_oneshot && !irqd_irq_disabled(&desc->irq_data) &&
-	    irqd_irq_masked(&desc->irq_data))
+	if (!desc->threads_oneshot && irqd_irq_masked(&desc->irq_data))
 		unmask_threaded_irq(desc);
 
 out_unlock: