diff mbox series

[v2,2/3] drm/panthor: Fix ordering in _irq_suspend()

Message ID 20240325135705.3717293-2-boris.brezillon@collabora.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/3] drm/panthor: Fix IO-page mmap() for 32-bit userspace on 64-bit kernel | expand

Commit Message

Boris Brezillon March 25, 2024, 1:57 p.m. UTC
Make sure we set suspended=true last to avoid generating an irq storm
in the unlikely case where an IRQ happens between the suspended=true
assignment and the _INT_MASK update.

v2:
- New patch

Reported-by: Steven Price <steven.price@arm.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Steven Price March 25, 2024, 3:23 p.m. UTC | #1
On 25/03/2024 13:57, Boris Brezillon wrote:
> Make sure we set suspended=true last to avoid generating an irq storm
> in the unlikely case where an IRQ happens between the suspended=true
> assignment and the _INT_MASK update.
> 
> v2:
> - New patch
> 
> Reported-by: Steven Price <steven.price@arm.com>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

Thanks!

Steve

> ---
>  drivers/gpu/drm/panthor/panthor_device.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 7ee4987a3796..3a930a368ae1 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
>  {												\
>  	int cookie;										\
>  												\
> -	atomic_set(&pirq->suspended, true);							\
> +	pirq->mask = 0;										\
>  												\
>  	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
>  		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
> @@ -333,7 +333,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
>  		drm_dev_exit(cookie);								\
>  	}											\
>  												\
> -	pirq->mask = 0;										\
> +	atomic_set(&pirq->suspended, true);							\
>  }												\
>  												\
>  static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
Liviu Dudau March 25, 2024, 5:16 p.m. UTC | #2
On Mon, Mar 25, 2024 at 02:57:04PM +0100, Boris Brezillon wrote:
> Make sure we set suspended=true last to avoid generating an irq storm
> in the unlikely case where an IRQ happens between the suspended=true
> assignment and the _INT_MASK update.
> 
> v2:
> - New patch
> 
> Reported-by: Steven Price <steven.price@arm.com>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 7ee4987a3796..3a930a368ae1 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
>  {												\
>  	int cookie;										\
>  												\
> -	atomic_set(&pirq->suspended, true);							\
> +	pirq->mask = 0;										\

I think you might still have a race between _irq_suspend() and _irq_threaded_handler() where the
status will be zero due to pirq->mask being zero, so no interrupt will be cleared but they will
be masked (kind of the opposite problem to patch 3/3).

I'm starting to think that pirq->mask should be local to _irq_threaded_handler() and not be messed
with in the other functions.

>  												\
>  	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
>  		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\

If you move the line above before the if condition, do you still need patch 3/3?

Best regards,
Liviu

> @@ -333,7 +333,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
>  		drm_dev_exit(cookie);								\
>  	}											\
>  												\
> -	pirq->mask = 0;										\
> +	atomic_set(&pirq->suspended, true);							\
>  }												\
>  												\
>  static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
> -- 
> 2.44.0
>
Boris Brezillon March 25, 2024, 6:02 p.m. UTC | #3
On Mon, 25 Mar 2024 17:16:16 +0000
Liviu Dudau <liviu.dudau@arm.com> wrote:

> On Mon, Mar 25, 2024 at 02:57:04PM +0100, Boris Brezillon wrote:
> > Make sure we set suspended=true last to avoid generating an irq storm
> > in the unlikely case where an IRQ happens between the suspended=true
> > assignment and the _INT_MASK update.
> > 
> > v2:
> > - New patch
> > 
> > Reported-by: Steven Price <steven.price@arm.com>
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index 7ee4987a3796..3a930a368ae1 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
> >  {												\
> >  	int cookie;										\
> >  												\
> > -	atomic_set(&pirq->suspended, true);							\
> > +	pirq->mask = 0;										\  
> 
> I think you might still have a race between _irq_suspend() and _irq_threaded_handler() where the
> status will be zero due to pirq->mask being zero, so no interrupt will be cleared but they will
> be masked (kind of the opposite problem to patch 3/3).

Right, but I'm trying to find a case where this is an issue. Yes, we
might lose events, but at the same time, when _irq_suspend() is called,
we are supposed to be idle, so all this mask=0 assignment does is
speed-up the synchronization with the irq-thread. If there's anything
we need to be done before suspending the IRQ, this should really use
its own synchronization model.

> 
> I'm starting to think that pirq->mask should be local to _irq_threaded_handler() and not be messed
> with in the other functions.

It kinda is, as we don't modify panthor_irq::mask outside the
suspend/resume (and now unplug) path, and each of these accesses has a
reason to exist:

- in the resume path, we know all IRQs are masked, and we reset the
  SW-side mask to the interrupts we want to accept before updating
  _INT_MASK. No risk of race in that one
- in the unplug path, I don't think we care about unhandled interrupts,
  because the device will become unusable after that point, so updating
  the panthor_irq::mask early and losing events should be okay.
- the suspend case has been described above. As explained, I don't think
  it matters if we lose events there, because really, if there's any
  synchronization needed, it should have happened explicitly before
  _irq_suspend() is called. The synchronize_irq() we have is just here
  to make sure there's nothing accessing registers when we turn the
  device clk/power-domain off.

> 
> >  												\
> >  	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> >  		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\  
> 
> If you move the line above before the if condition, do you still need patch 3/3?

The whole point of the drm_dev_enter/exit() section was to prevent
access to registers after the device has been unplugged, so, if I move
the gpu_write() outside of this block, I'd rather drop the entire
drm_dev_enter/exit() section (both here and in _irq_resume()). That
should be safe actually, as I don't expect the PM hooks or the reset
handler to be called after the device and its resource have been
removed, and those are the two only paths where _irq_suspend/resume()
can be called.
Liviu Dudau March 26, 2024, 9:58 a.m. UTC | #4
On Mon, Mar 25, 2024 at 07:02:13PM +0100, Boris Brezillon wrote:
> On Mon, 25 Mar 2024 17:16:16 +0000
> Liviu Dudau <liviu.dudau@arm.com> wrote:
> 
> > On Mon, Mar 25, 2024 at 02:57:04PM +0100, Boris Brezillon wrote:
> > > Make sure we set suspended=true last to avoid generating an irq storm
> > > in the unlikely case where an IRQ happens between the suspended=true
> > > assignment and the _INT_MASK update.
> > > 
> > > v2:
> > > - New patch
> > > 
> > > Reported-by: Steven Price <steven.price@arm.com>
> > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > ---
> > >  drivers/gpu/drm/panthor/panthor_device.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > > index 7ee4987a3796..3a930a368ae1 100644
> > > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > > @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
> > >  {												\
> > >  	int cookie;										\
> > >  												\
> > > -	atomic_set(&pirq->suspended, true);							\
> > > +	pirq->mask = 0;										\  
> > 
> > I think you might still have a race between _irq_suspend() and _irq_threaded_handler() where the
> > status will be zero due to pirq->mask being zero, so no interrupt will be cleared but they will
> > be masked (kind of the opposite problem to patch 3/3).
> 
> Right, but I'm trying to find a case where this is an issue. Yes, we
> might lose events, but at the same time, when _irq_suspend() is called,
> we are supposed to be idle, so all this mask=0 assignment does is
> speed-up the synchronization with the irq-thread. If there's anything
> we need to be done before suspending the IRQ, this should really use
> its own synchronization model.

Perf counter collection before going idle ;)

I know that's not yet on your radar, but we need to keep it in mind to
test that scenario when we add support for keeping runtime PM alive
during perf counters dumping.

> 
> > 
> > I'm starting to think that pirq->mask should be local to _irq_threaded_handler() and not be messed
> > with in the other functions.
> 
> It kinda is, as we don't modify panthor_irq::mask outside the
> suspend/resume (and now unplug) path, and each of these accesses has a
> reason to exist:
> 
> - in the resume path, we know all IRQs are masked, and we reset the
>   SW-side mask to the interrupts we want to accept before updating
>   _INT_MASK. No risk of race in that one
> - in the unplug path, I don't think we care about unhandled interrupts,
>   because the device will become unusable after that point, so updating
>   the panthor_irq::mask early and losing events should be okay.
> - the suspend case has been described above. As explained, I don't think
>   it matters if we lose events there, because really, if there's any
>   synchronization needed, it should have happened explicitly before
>   _irq_suspend() is called. The synchronize_irq() we have is just here
>   to make sure there's nothing accessing registers when we turn the
>   device clk/power-domain off.
> 
> > 
> > >  												\
> > >  	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> > >  		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\  
> > 
> > If you move the line above before the if condition, do you still need patch 3/3?
> 
> The whole point of the drm_dev_enter/exit() section was to prevent
> access to registers after the device has been unplugged, so, if I move
> the gpu_write() outside of this block, I'd rather drop the entire
> drm_dev_enter/exit() section (both here and in _irq_resume()). That
> should be safe actually, as I don't expect the PM hooks or the reset
> handler to be called after the device and its resource have been
> removed, and those are the two only paths where _irq_suspend/resume()
> can be called.

Agree. It feels strange to care a lot about preventing access to registers
(as we should) and then going and adding the unplug function which replaces
the suspend in 3 out of 7 cases. New contributors might get confused about
which one to use in the future.

What we should do is rename suspend to unplug and move the drm_dev_enter/exit()
condition at the calling point for the cases where it makes sense.

Best regards,
Liviu
Boris Brezillon March 26, 2024, 10:25 a.m. UTC | #5
On Tue, 26 Mar 2024 09:58:17 +0000
Liviu Dudau <liviu.dudau@arm.com> wrote:

> On Mon, Mar 25, 2024 at 07:02:13PM +0100, Boris Brezillon wrote:
> > On Mon, 25 Mar 2024 17:16:16 +0000
> > Liviu Dudau <liviu.dudau@arm.com> wrote:
> >   
> > > On Mon, Mar 25, 2024 at 02:57:04PM +0100, Boris Brezillon wrote:  
> > > > Make sure we set suspended=true last to avoid generating an irq storm
> > > > in the unlikely case where an IRQ happens between the suspended=true
> > > > assignment and the _INT_MASK update.
> > > > 
> > > > v2:
> > > > - New patch
> > > > 
> > > > Reported-by: Steven Price <steven.price@arm.com>
> > > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > > ---
> > > >  drivers/gpu/drm/panthor/panthor_device.h | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > > > index 7ee4987a3796..3a930a368ae1 100644
> > > > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > > > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > > > @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
> > > >  {												\
> > > >  	int cookie;										\
> > > >  												\
> > > > -	atomic_set(&pirq->suspended, true);							\
> > > > +	pirq->mask = 0;										\    
> > > 
> > > I think you might still have a race between _irq_suspend() and _irq_threaded_handler() where the
> > > status will be zero due to pirq->mask being zero, so no interrupt will be cleared but they will
> > > be masked (kind of the opposite problem to patch 3/3).  
> > 
> > Right, but I'm trying to find a case where this is an issue. Yes, we
> > might lose events, but at the same time, when _irq_suspend() is called,
> > we are supposed to be idle, so all this mask=0 assignment does is
> > speed-up the synchronization with the irq-thread. If there's anything
> > we need to be done before suspending the IRQ, this should really use
> > its own synchronization model.  
> 
> Perf counter collection before going idle ;)

That's typically a case of synchronization that should be done
explicitly rather than relying on the synchronize_irq() we have in
_irq_suspend().

> 
> I know that's not yet on your radar, but we need to keep it in mind to
> test that scenario when we add support for keeping runtime PM alive
> during perf counters dumping.

It's definitely on our radar, but I don't think _irq_suspend() is the
right way to handle perfcnt dump synchronization. IMHO, it deserves its
own _perfcnt_suspend() function that's called before _fw_suspend() to
make sure any in-flight perfcnt dumps get flushed before the
JOB/FW irq suspend happens.

> 
> >   
> > > 
> > > I'm starting to think that pirq->mask should be local to _irq_threaded_handler() and not be messed
> > > with in the other functions.  
> > 
> > It kinda is, as we don't modify panthor_irq::mask outside the
> > suspend/resume (and now unplug) path, and each of these accesses has a
> > reason to exist:
> > 
> > - in the resume path, we know all IRQs are masked, and we reset the
> >   SW-side mask to the interrupts we want to accept before updating
> >   _INT_MASK. No risk of race in that one
> > - in the unplug path, I don't think we care about unhandled interrupts,
> >   because the device will become unusable after that point, so updating
> >   the panthor_irq::mask early and losing events should be okay.
> > - the suspend case has been described above. As explained, I don't think
> >   it matters if we lose events there, because really, if there's any
> >   synchronization needed, it should have happened explicitly before
> >   _irq_suspend() is called. The synchronize_irq() we have is just here
> >   to make sure there's nothing accessing registers when we turn the
> >   device clk/power-domain off.
> >   
> > >   
> > > >  												\
> > > >  	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> > > >  		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\    
> > > 
> > > If you move the line above before the if condition, do you still need patch 3/3?  
> > 
> > The whole point of the drm_dev_enter/exit() section was to prevent
> > access to registers after the device has been unplugged, so, if I move
> > the gpu_write() outside of this block, I'd rather drop the entire
> > drm_dev_enter/exit() section (both here and in _irq_resume()). That
> > should be safe actually, as I don't expect the PM hooks or the reset
> > handler to be called after the device and its resource have been
> > removed, and those are the two only paths where _irq_suspend/resume()
> > can be called.  
> 
> Agree. It feels strange to care a lot about preventing access to registers
> (as we should) and then going and adding the unplug function which replaces
> the suspend in 3 out of 7 cases. New contributors might get confused about
> which one to use in the future.

Actually, irq_unplug() should be used in the _unplug() path, as the
name implies, I don't think that's confusing :P. My point was more on
the fact we probably don't need the dev_enter/exit() sections in the
first place, so why introduce a new helper if we can make the
_irq_suspend() helper simpler and re-usable for the _unplug() case?

> 
> What we should do is rename suspend to unplug and move the drm_dev_enter/exit()
> condition at the calling point for the cases where it makes sense.

That'd be even more confusing IMO, why would you call an helper named
_unplug() if all you want to do is temporarily suspend the IRQ?
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 7ee4987a3796..3a930a368ae1 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -325,7 +325,7 @@  static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
 {												\
 	int cookie;										\
 												\
-	atomic_set(&pirq->suspended, true);							\
+	pirq->mask = 0;										\
 												\
 	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
 		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
@@ -333,7 +333,7 @@  static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)
 		drm_dev_exit(cookie);								\
 	}											\
 												\
-	pirq->mask = 0;										\
+	atomic_set(&pirq->suspended, true);							\
 }												\
 												\
 static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\