mbox series

[0/5] dma-fence, i915: Stop allowing SLAB_TYPESAFE_BY_RCU for dma_fence

Message ID 20210609212959.471209-1-jason@jlekstrand.net (mailing list archive)
Headers show
Series dma-fence, i915: Stop allowing SLAB_TYPESAFE_BY_RCU for dma_fence | expand

Message

Jason Ekstrand June 9, 2021, 9:29 p.m. UTC
Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
some serious disclaimers.  In particular, objects can get recycled while
RCU readers are still in-flight.  This can be ok if everyone who touches
these objects knows about the disclaimers and is careful.  However,
because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
because i915_request contains a dma_fence, we've leaked
SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
in the kernel which may consume a dma_fence.

We've tried to keep it somewhat contained by doing most of the hard work
to prevent access of recycled objects via dma_fence_get_rcu_safe().
However, a quick grep of kernel sources says that, of the 30 instances
of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
It's likely there bear traps in DRM and related subsystems just waiting
for someone to accidentally step in them.

This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
and, instead, does an RCU-safe slab free via rcu_call().  This should
let us keep most of the perf benefits of slab allocation while avoiding
the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes support
for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.

Note: The last patch is labled DONOTMERGE.  This was at Daniel Vetter's
request as we may want to let this bake for a couple releases before we
rip out dma_fence_get_rcu_safe entirely.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Jason Ekstrand (5):
  drm/i915: Move intel_engine_free_request_pool to i915_request.c
  drm/i915: Use a simpler scheme for caching i915_request
  drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request
  dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests
  DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe

 drivers/dma-buf/dma-fence-chain.c         |   8 +-
 drivers/dma-buf/dma-resv.c                |   4 +-
 drivers/dma-buf/st-dma-fence-chain.c      |  24 +---
 drivers/dma-buf/st-dma-fence.c            |  27 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |   4 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |   8 --
 drivers/gpu/drm/i915/i915_active.h        |   4 +-
 drivers/gpu/drm/i915/i915_request.c       | 147 ++++++++++++----------
 drivers/gpu/drm/i915/i915_request.h       |   2 -
 drivers/gpu/drm/i915/i915_vma.c           |   4 +-
 include/drm/drm_syncobj.h                 |   4 +-
 include/linux/dma-fence.h                 |  50 --------
 include/linux/dma-resv.h                  |   4 +-
 13 files changed, 110 insertions(+), 180 deletions(-)

Comments

Tvrtko Ursulin June 10, 2021, 9:29 a.m. UTC | #1
On 09/06/2021 22:29, Jason Ekstrand wrote:
> Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
> tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
> was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
> i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
> some serious disclaimers.  In particular, objects can get recycled while
> RCU readers are still in-flight.  This can be ok if everyone who touches
> these objects knows about the disclaimers and is careful.  However,
> because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
> because i915_request contains a dma_fence, we've leaked
> SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
> in the kernel which may consume a dma_fence.

I don't think the part about leaking is true...

> We've tried to keep it somewhat contained by doing most of the hard work
> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> However, a quick grep of kernel sources says that, of the 30 instances
> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> It's likely there bear traps in DRM and related subsystems just waiting
> for someone to accidentally step in them.

...because dma_fence_get_rcu_safe apears to be about whether the 
*pointer* to the fence itself is rcu protected, not about the fence 
object itself.

If one has a stable pointer to a fence dma_fence_get_rcu is I think 
enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma 
fence is a base object there). Unless you found a bug in rq field 
recycling. But access to the dma fence is all tightly controlled so I 
don't get what leaks.

> This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
> and, instead, does an RCU-safe slab free via rcu_call().  This should
> let us keep most of the perf benefits of slab allocation while avoiding
> the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes support
> for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.

According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU 
freeing can be a lot more costly so I think we need a clear 
justification on why this change is being considered.

Regards,

Tvrtko

> 
> Note: The last patch is labled DONOTMERGE.  This was at Daniel Vetter's
> request as we may want to let this bake for a couple releases before we
> rip out dma_fence_get_rcu_safe entirely.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> 
> Jason Ekstrand (5):
>    drm/i915: Move intel_engine_free_request_pool to i915_request.c
>    drm/i915: Use a simpler scheme for caching i915_request
>    drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request
>    dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests
>    DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe
> 
>   drivers/dma-buf/dma-fence-chain.c         |   8 +-
>   drivers/dma-buf/dma-resv.c                |   4 +-
>   drivers/dma-buf/st-dma-fence-chain.c      |  24 +---
>   drivers/dma-buf/st-dma-fence.c            |  27 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |   4 +-
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c |   8 --
>   drivers/gpu/drm/i915/i915_active.h        |   4 +-
>   drivers/gpu/drm/i915/i915_request.c       | 147 ++++++++++++----------
>   drivers/gpu/drm/i915/i915_request.h       |   2 -
>   drivers/gpu/drm/i915/i915_vma.c           |   4 +-
>   include/drm/drm_syncobj.h                 |   4 +-
>   include/linux/dma-fence.h                 |  50 --------
>   include/linux/dma-resv.h                  |   4 +-
>   13 files changed, 110 insertions(+), 180 deletions(-)
>
Christian König June 10, 2021, 9:39 a.m. UTC | #2
Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
>
> On 09/06/2021 22:29, Jason Ekstrand wrote:
>> Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
>> tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
>> was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
>> i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
>> some serious disclaimers.  In particular, objects can get recycled while
>> RCU readers are still in-flight.  This can be ok if everyone who touches
>> these objects knows about the disclaimers and is careful. However,
>> because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
>> because i915_request contains a dma_fence, we've leaked
>> SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
>> in the kernel which may consume a dma_fence.
>
> I don't think the part about leaking is true...
>
>> We've tried to keep it somewhat contained by doing most of the hard work
>> to prevent access of recycled objects via dma_fence_get_rcu_safe().
>> However, a quick grep of kernel sources says that, of the 30 instances
>> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
>> It's likely there bear traps in DRM and related subsystems just waiting
>> for someone to accidentally step in them.
>
> ...because dma_fence_get_rcu_safe apears to be about whether the 
> *pointer* to the fence itself is rcu protected, not about the fence 
> object itself.

Yes, exactly that.

>
> If one has a stable pointer to a fence dma_fence_get_rcu is I think 
> enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma 
> fence is a base object there). Unless you found a bug in rq field 
> recycling. But access to the dma fence is all tightly controlled so I 
> don't get what leaks.
>
>> This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
>> and, instead, does an RCU-safe slab free via rcu_call().  This should
>> let us keep most of the perf benefits of slab allocation while avoiding
>> the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes 
>> support
>> for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.
>
> According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU 
> freeing can be a lot more costly so I think we need a clear 
> justification on why this change is being considered.

The problem is that SLAB_TYPESAFE_BY_RCU requires that we use a sequence 
counter to make sure that we don't grab the reference to a reallocated 
dma_fence.

Updating the sequence counter every time we add a fence now means two 
additions writes and one additional barrier for an extremely hot path. 
The extra overhead of RCU freeing is completely negligible compared to that.

The good news is that I think if we are just a bit more clever about our 
handle we can both avoid the sequence counter and keep 
SLAB_TYPESAFE_BY_RCU around.

But this needs more code cleanup and abstracting the sequence counter 
usage in a macro.

Regards,
Christian.


>
> Regards,
>
> Tvrtko
>
>>
>> Note: The last patch is labled DONOTMERGE.  This was at Daniel Vetter's
>> request as we may want to let this bake for a couple releases before we
>> rip out dma_fence_get_rcu_safe entirely.
>>
>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Dave Airlie <airlied@redhat.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>
>> Jason Ekstrand (5):
>>    drm/i915: Move intel_engine_free_request_pool to i915_request.c
>>    drm/i915: Use a simpler scheme for caching i915_request
>>    drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request
>>    dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests
>>    DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe
>>
>>   drivers/dma-buf/dma-fence-chain.c         |   8 +-
>>   drivers/dma-buf/dma-resv.c                |   4 +-
>>   drivers/dma-buf/st-dma-fence-chain.c      |  24 +---
>>   drivers/dma-buf/st-dma-fence.c            |  27 +---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |   4 +-
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c |   8 --
>>   drivers/gpu/drm/i915/i915_active.h        |   4 +-
>>   drivers/gpu/drm/i915/i915_request.c       | 147 ++++++++++++----------
>>   drivers/gpu/drm/i915/i915_request.h       |   2 -
>>   drivers/gpu/drm/i915/i915_vma.c           |   4 +-
>>   include/drm/drm_syncobj.h                 |   4 +-
>>   include/linux/dma-fence.h                 |  50 --------
>>   include/linux/dma-resv.h                  |   4 +-
>>   13 files changed, 110 insertions(+), 180 deletions(-)
>>
Daniel Vetter June 10, 2021, 11:29 a.m. UTC | #3
On Thu, Jun 10, 2021 at 11:39 AM Christian König
<christian.koenig@amd.com> wrote:
> Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > On 09/06/2021 22:29, Jason Ekstrand wrote:
> >> Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
> >> tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
> >> was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
> >> i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
> >> some serious disclaimers.  In particular, objects can get recycled while
> >> RCU readers are still in-flight.  This can be ok if everyone who touches
> >> these objects knows about the disclaimers and is careful. However,
> >> because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
> >> because i915_request contains a dma_fence, we've leaked
> >> SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
> >> in the kernel which may consume a dma_fence.
> >
> > I don't think the part about leaking is true...
> >
> >> We've tried to keep it somewhat contained by doing most of the hard work
> >> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> >> However, a quick grep of kernel sources says that, of the 30 instances
> >> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> >> It's likely there bear traps in DRM and related subsystems just waiting
> >> for someone to accidentally step in them.
> >
> > ...because dma_fence_get_rcu_safe apears to be about whether the
> > *pointer* to the fence itself is rcu protected, not about the fence
> > object itself.
>
> Yes, exactly that.

We do leak, and badly. Any __rcu protected fence pointer where a
shared fence could show up is affected. And the point of dma_fence is
that they're shareable, and we're inventing ever more ways to do so
(sync_file, drm_syncobj, implicit fencing maybe soon with
import/export ioctl on top, in/out fences in CS ioctl, atomic ioctl,
...).

So without a full audit anything that uses the following pattern is
probably busted:

rcu_read_lock();
fence = rcu_dereference();
fence = dma_fence_get_rcu();
rcu_read_lock();

/* use the fence now that we acquired a full reference */

And I don't mean "you might wait a bit too much" busted, but "this can
lead to loops in the dma_fence dependency chain, resulting in
deadlocks" kind of busted. What's worse, the standard rcu lockless
access pattern is also busted completely:

rcu_read_lock();
fence = rcu_derefence();
/* locklessly check the state of fence */
rcu_read_unlock();

because once you have TYPESAFE_BY_RCU rcu_read_lock doesn't prevent a
use-after-free anymore. The only thing it guarantees is that your
fence pointer keeps pointing at either freed memory, or a fence, but
nothing else. You have to wrap your rcu_derefence and code into a
seqlock of some kind, either a real one like dma_resv, or an
open-coded one like dma_fence_get_rcu_safe uses. And yes the latter is
a specialized seqlock, except it fails to properly document in
comments where all the required barriers are.

tldr; all the code using dma_fence_get_rcu needs to be assumed to be broken.

Heck this is fragile and tricky enough that i915 shot its own leg off
routinely (there's a bugfix floating around just now), so not even
internally we're very good at getting this right.

> > If one has a stable pointer to a fence dma_fence_get_rcu is I think
> > enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma
> > fence is a base object there). Unless you found a bug in rq field
> > recycling. But access to the dma fence is all tightly controlled so I
> > don't get what leaks.
> >
> >> This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
> >> and, instead, does an RCU-safe slab free via rcu_call().  This should
> >> let us keep most of the perf benefits of slab allocation while avoiding
> >> the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes
> >> support
> >> for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.
> >
> > According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU
> > freeing can be a lot more costly so I think we need a clear
> > justification on why this change is being considered.
>
> The problem is that SLAB_TYPESAFE_BY_RCU requires that we use a sequence
> counter to make sure that we don't grab the reference to a reallocated
> dma_fence.
>
> Updating the sequence counter every time we add a fence now means two
> additions writes and one additional barrier for an extremely hot path.
> The extra overhead of RCU freeing is completely negligible compared to that.
>
> The good news is that I think if we are just a bit more clever about our
> handle we can both avoid the sequence counter and keep
> SLAB_TYPESAFE_BY_RCU around.

You still need a seqlock, or something else that's serving as your
seqlock. dma_fence_list behind a single __rcu protected pointer, with
all subsequent fence pointers _not_ being rcu protected (i.e. full
reference, on every change we allocate might work. Which is a very
funny way of implementing something like a seqlock.

And that only covers dma_resv, you _have_ to do this _everywhere_ in
every driver. Except if you can proof that your __rcu fence pointer
only ever points at your own driver's fences.

So unless you're volunteering to audit all the drivers, and constantly
re-audit them (because rcu only guaranteeing type-safety but not
actually preventing use-after-free is very unusual in the kernel) just
fixing dma_resv doesn't solve the problem here at all.

> But this needs more code cleanup and abstracting the sequence counter
> usage in a macro.

The other thing is that this doesn't even make sense for i915 anymore.
The solution to the "userspace wants to submit bazillion requests"
problem is direct userspace submit. Current hw doesn't have userspace
ringbuffer, but we have a pretty clever trick in the works to make
this possible with current hw, essentially by submitting a CS that
loops on itself, and then inserting batches into this "ring" by
latching a conditional branch in this CS. It's not pretty, but it gets
the job done and outright removes the need for plaid mode throughput
of i915_request dma fences.
-Daniel

>
> Regards,
> Christian.
>
>
> >
> > Regards,
> >
> > Tvrtko
> >
> >>
> >> Note: The last patch is labled DONOTMERGE.  This was at Daniel Vetter's
> >> request as we may want to let this bake for a couple releases before we
> >> rip out dma_fence_get_rcu_safe entirely.
> >>
> >> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> >> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >> Cc: Christian König <christian.koenig@amd.com>
> >> Cc: Dave Airlie <airlied@redhat.com>
> >> Cc: Matthew Auld <matthew.auld@intel.com>
> >> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>
> >> Jason Ekstrand (5):
> >>    drm/i915: Move intel_engine_free_request_pool to i915_request.c
> >>    drm/i915: Use a simpler scheme for caching i915_request
> >>    drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request
> >>    dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests
> >>    DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe
> >>
> >>   drivers/dma-buf/dma-fence-chain.c         |   8 +-
> >>   drivers/dma-buf/dma-resv.c                |   4 +-
> >>   drivers/dma-buf/st-dma-fence-chain.c      |  24 +---
> >>   drivers/dma-buf/st-dma-fence.c            |  27 +---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |   4 +-
> >>   drivers/gpu/drm/i915/gt/intel_engine_cs.c |   8 --
> >>   drivers/gpu/drm/i915/i915_active.h        |   4 +-
> >>   drivers/gpu/drm/i915/i915_request.c       | 147 ++++++++++++----------
> >>   drivers/gpu/drm/i915/i915_request.h       |   2 -
> >>   drivers/gpu/drm/i915/i915_vma.c           |   4 +-
> >>   include/drm/drm_syncobj.h                 |   4 +-
> >>   include/linux/dma-fence.h                 |  50 --------
> >>   include/linux/dma-resv.h                  |   4 +-
> >>   13 files changed, 110 insertions(+), 180 deletions(-)
> >>
>
Daniel Vetter June 10, 2021, 11:53 a.m. UTC | #4
On Thu, Jun 10, 2021 at 1:29 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Thu, Jun 10, 2021 at 11:39 AM Christian König
> <christian.koenig@amd.com> wrote:
> > Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > > On 09/06/2021 22:29, Jason Ekstrand wrote:
> > >> Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
> > >> tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
> > >> was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
> > >> i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
> > >> some serious disclaimers.  In particular, objects can get recycled while
> > >> RCU readers are still in-flight.  This can be ok if everyone who touches
> > >> these objects knows about the disclaimers and is careful. However,
> > >> because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
> > >> because i915_request contains a dma_fence, we've leaked
> > >> SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
> > >> in the kernel which may consume a dma_fence.
> > >
> > > I don't think the part about leaking is true...
> > >
> > >> We've tried to keep it somewhat contained by doing most of the hard work
> > >> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> > >> However, a quick grep of kernel sources says that, of the 30 instances
> > >> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> > >> It's likely there bear traps in DRM and related subsystems just waiting
> > >> for someone to accidentally step in them.
> > >
> > > ...because dma_fence_get_rcu_safe apears to be about whether the
> > > *pointer* to the fence itself is rcu protected, not about the fence
> > > object itself.
> >
> > Yes, exactly that.
>
> We do leak, and badly. Any __rcu protected fence pointer where a
> shared fence could show up is affected. And the point of dma_fence is
> that they're shareable, and we're inventing ever more ways to do so
> (sync_file, drm_syncobj, implicit fencing maybe soon with
> import/export ioctl on top, in/out fences in CS ioctl, atomic ioctl,
> ...).
>
> So without a full audit anything that uses the following pattern is
> probably busted:
>
> rcu_read_lock();
> fence = rcu_dereference();
> fence = dma_fence_get_rcu();
> rcu_read_lock();
>
> /* use the fence now that we acquired a full reference */
>
> And I don't mean "you might wait a bit too much" busted, but "this can
> lead to loops in the dma_fence dependency chain, resulting in
> deadlocks" kind of busted. What's worse, the standard rcu lockless
> access pattern is also busted completely:
>
> rcu_read_lock();
> fence = rcu_derefence();
> /* locklessly check the state of fence */
> rcu_read_unlock();
>
> because once you have TYPESAFE_BY_RCU rcu_read_lock doesn't prevent a
> use-after-free anymore. The only thing it guarantees is that your
> fence pointer keeps pointing at either freed memory, or a fence, but
> nothing else. You have to wrap your rcu_derefence and code into a
> seqlock of some kind, either a real one like dma_resv, or an
> open-coded one like dma_fence_get_rcu_safe uses. And yes the latter is
> a specialized seqlock, except it fails to properly document in
> comments where all the required barriers are.
>
> tldr; all the code using dma_fence_get_rcu needs to be assumed to be broken.
>
> Heck this is fragile and tricky enough that i915 shot its own leg off
> routinely (there's a bugfix floating around just now), so not even
> internally we're very good at getting this right.
>
> > > If one has a stable pointer to a fence dma_fence_get_rcu is I think
> > > enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma
> > > fence is a base object there). Unless you found a bug in rq field
> > > recycling. But access to the dma fence is all tightly controlled so I
> > > don't get what leaks.
> > >
> > >> This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
> > >> and, instead, does an RCU-safe slab free via rcu_call().  This should
> > >> let us keep most of the perf benefits of slab allocation while avoiding
> > >> the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes
> > >> support
> > >> for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.
> > >
> > > According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU
> > > freeing can be a lot more costly so I think we need a clear
> > > justification on why this change is being considered.
> >
> > The problem is that SLAB_TYPESAFE_BY_RCU requires that we use a sequence
> > counter to make sure that we don't grab the reference to a reallocated
> > dma_fence.
> >
> > Updating the sequence counter every time we add a fence now means two
> > additions writes and one additional barrier for an extremely hot path.
> > The extra overhead of RCU freeing is completely negligible compared to that.
> >
> > The good news is that I think if we are just a bit more clever about our
> > handle we can both avoid the sequence counter and keep
> > SLAB_TYPESAFE_BY_RCU around.
>
> You still need a seqlock, or something else that's serving as your
> seqlock. dma_fence_list behind a single __rcu protected pointer, with
> all subsequent fence pointers _not_ being rcu protected (i.e. full
> reference, on every change we allocate might work. Which is a very
> funny way of implementing something like a seqlock.
>
> And that only covers dma_resv, you _have_ to do this _everywhere_ in
> every driver. Except if you can proof that your __rcu fence pointer
> only ever points at your own driver's fences.
>
> So unless you're volunteering to audit all the drivers, and constantly
> re-audit them (because rcu only guaranteeing type-safety but not
> actually preventing use-after-free is very unusual in the kernel) just
> fixing dma_resv doesn't solve the problem here at all.
>
> > But this needs more code cleanup and abstracting the sequence counter
> > usage in a macro.
>
> The other thing is that this doesn't even make sense for i915 anymore.
> The solution to the "userspace wants to submit bazillion requests"
> problem is direct userspace submit. Current hw doesn't have userspace
> ringbuffer, but we have a pretty clever trick in the works to make
> this possible with current hw, essentially by submitting a CS that
> loops on itself, and then inserting batches into this "ring" by
> latching a conditional branch in this CS. It's not pretty, but it gets
> the job done and outright removes the need for plaid mode throughput
> of i915_request dma fences.

To put it another way: I'm the guy who reviewed the patch which
started this entire TYPESAFE_BY_RCU mess we got ourselves into:

commit 0eafec6d3244802d469712682b0f513963c23eff
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 4 16:32:41 2016 +0100

   drm/i915: Enable lockless lookup of request tracking via RCU

...

   Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
   Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
   Cc: "Goel, Akash" <akash.goel@intel.com>
   Cc: Josh Triplett <josh@joshtriplett.org>
   Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
   Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
   Link: http://patchwork.freedesktop.org/patch/msgid/1470324762-2545-25-git-send-email-chris@chris-wilson.co.uk

Looking back this was a mistake. The innocently labelled
DESTROY_BY_RCU tricked me real bad, and we never had any real-world
use-case to justify all the danger this brought not just to i915, but
to any driver using __rcu protected dma_fence access. It's not worth
it.
-Daniel
Tvrtko Ursulin June 10, 2021, 1:07 p.m. UTC | #5
On 10/06/2021 12:29, Daniel Vetter wrote:
> On Thu, Jun 10, 2021 at 11:39 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
>>> On 09/06/2021 22:29, Jason Ekstrand wrote:
>>>> Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
>>>> tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
>>>> was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
>>>> i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
>>>> some serious disclaimers.  In particular, objects can get recycled while
>>>> RCU readers are still in-flight.  This can be ok if everyone who touches
>>>> these objects knows about the disclaimers and is careful. However,
>>>> because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
>>>> because i915_request contains a dma_fence, we've leaked
>>>> SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
>>>> in the kernel which may consume a dma_fence.
>>>
>>> I don't think the part about leaking is true...
>>>
>>>> We've tried to keep it somewhat contained by doing most of the hard work
>>>> to prevent access of recycled objects via dma_fence_get_rcu_safe().
>>>> However, a quick grep of kernel sources says that, of the 30 instances
>>>> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
>>>> It's likely there bear traps in DRM and related subsystems just waiting
>>>> for someone to accidentally step in them.
>>>
>>> ...because dma_fence_get_rcu_safe apears to be about whether the
>>> *pointer* to the fence itself is rcu protected, not about the fence
>>> object itself.
>>
>> Yes, exactly that.
> 
> We do leak, and badly. Any __rcu protected fence pointer where a
> shared fence could show up is affected. And the point of dma_fence is
> that they're shareable, and we're inventing ever more ways to do so
> (sync_file, drm_syncobj, implicit fencing maybe soon with
> import/export ioctl on top, in/out fences in CS ioctl, atomic ioctl,
> ...).
> 
> So without a full audit anything that uses the following pattern is
> probably busted:
> 
> rcu_read_lock();
> fence = rcu_dereference();
> fence = dma_fence_get_rcu();
> rcu_read_lock();

What do you mean by _probably_ busted? This should either fail in 
kref_get_unless_zero for freed fences or it grabs the wrong fence which 
dma_fence_get_rcu_safe() is supposed to detect.

> /* use the fence now that we acquired a full reference */
> 
> And I don't mean "you might wait a bit too much" busted, but "this can
> lead to loops in the dma_fence dependency chain, resulting in
> deadlocks" kind of busted. What's worse, the standard rcu lockless

I don't have the story on dma_fence dependency chain deadlocks. Maybe 
put a few more words about that in the cover letter since it would be 
good to understand the real motivation behind the change.

Are there even bugs about deadlocks which should be mentioned?

Or why is not the act of fence being signaled removing the fence from 
the rcu protected containers, preventing the stale pointer problem?

> access pattern is also busted completely:
> 
> rcu_read_lock();
> fence = rcu_derefence();
> /* locklessly check the state of fence */
> rcu_read_unlock();

My understanding is that lockless, eg. no reference taken, access can 
access a different fence or a freed fence, but won't cause use after 
free when under the rcu lock.

As long as dma fence users are going through the API entry points 
individual drivers should be able to handle things correctly.

> because once you have TYPESAFE_BY_RCU rcu_read_lock doesn't prevent a
> use-after-free anymore. The only thing it guarantees is that your
> fence pointer keeps pointing at either freed memory, or a fence, but

Again, I think it can't be freed memory inside the rcu lock section. It 
can only be re-allocated or unused object.

Overall it looks like there is some complication involving the 
interaction between rcu protected pointers and SLAB_TYPESAFE_BY_RCU, 
rather than a simple statement i915 leaked/broke something for other 
drivers. And challenge of auditing drivers to make all use 
dma_fence_get_rcu_safe() when dealing with such storage.

To be clear I don't mind simplifications in principle as long as the 
problem statement is accurate.

And some benchmarks definitely need to be ran here. At least that was 
the usual thing in the past when such large changes were being proposed.

Regards,

Tvrtko

> nothing else. You have to wrap your rcu_derefence and code into a
> seqlock of some kind, either a real one like dma_resv, or an
> open-coded one like dma_fence_get_rcu_safe uses. And yes the latter is
> a specialized seqlock, except it fails to properly document in
> comments where all the required barriers are.
> 
> tldr; all the code using dma_fence_get_rcu needs to be assumed to be broken.
> 
> Heck this is fragile and tricky enough that i915 shot its own leg off
> routinely (there's a bugfix floating around just now), so not even
> internally we're very good at getting this right.
> 
>>> If one has a stable pointer to a fence dma_fence_get_rcu is I think
>>> enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma
>>> fence is a base object there). Unless you found a bug in rq field
>>> recycling. But access to the dma fence is all tightly controlled so I
>>> don't get what leaks.
>>>
>>>> This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
>>>> and, instead, does an RCU-safe slab free via rcu_call().  This should
>>>> let us keep most of the perf benefits of slab allocation while avoiding
>>>> the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes
>>>> support
>>>> for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.
>>>
>>> According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU
>>> freeing can be a lot more costly so I think we need a clear
>>> justification on why this change is being considered.
>>
>> The problem is that SLAB_TYPESAFE_BY_RCU requires that we use a sequence
>> counter to make sure that we don't grab the reference to a reallocated
>> dma_fence.
>>
>> Updating the sequence counter every time we add a fence now means two
>> additions writes and one additional barrier for an extremely hot path.
>> The extra overhead of RCU freeing is completely negligible compared to that.
>>
>> The good news is that I think if we are just a bit more clever about our
>> handle we can both avoid the sequence counter and keep
>> SLAB_TYPESAFE_BY_RCU around.
> 
> You still need a seqlock, or something else that's serving as your
> seqlock. dma_fence_list behind a single __rcu protected pointer, with
> all subsequent fence pointers _not_ being rcu protected (i.e. full
> reference, on every change we allocate might work. Which is a very
> funny way of implementing something like a seqlock.
> 
> And that only covers dma_resv, you _have_ to do this _everywhere_ in
> every driver. Except if you can proof that your __rcu fence pointer
> only ever points at your own driver's fences.
> 
> So unless you're volunteering to audit all the drivers, and constantly
> re-audit them (because rcu only guaranteeing type-safety but not
> actually preventing use-after-free is very unusual in the kernel) just
> fixing dma_resv doesn't solve the problem here at all.
> 
>> But this needs more code cleanup and abstracting the sequence counter
>> usage in a macro.
> 
> The other thing is that this doesn't even make sense for i915 anymore.
> The solution to the "userspace wants to submit bazillion requests"
> problem is direct userspace submit. Current hw doesn't have userspace
> ringbuffer, but we have a pretty clever trick in the works to make
> this possible with current hw, essentially by submitting a CS that
> loops on itself, and then inserting batches into this "ring" by
> latching a conditional branch in this CS. It's not pretty, but it gets
> the job done and outright removes the need for plaid mode throughput
> of i915_request dma fences.
> -Daniel
> 
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> Note: The last patch is labled DONOTMERGE.  This was at Daniel Vetter's
>>>> request as we may want to let this bake for a couple releases before we
>>>> rip out dma_fence_get_rcu_safe entirely.
>>>>
>>>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Dave Airlie <airlied@redhat.com>
>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>
>>>> Jason Ekstrand (5):
>>>>     drm/i915: Move intel_engine_free_request_pool to i915_request.c
>>>>     drm/i915: Use a simpler scheme for caching i915_request
>>>>     drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request
>>>>     dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests
>>>>     DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe
>>>>
>>>>    drivers/dma-buf/dma-fence-chain.c         |   8 +-
>>>>    drivers/dma-buf/dma-resv.c                |   4 +-
>>>>    drivers/dma-buf/st-dma-fence-chain.c      |  24 +---
>>>>    drivers/dma-buf/st-dma-fence.c            |  27 +---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |   4 +-
>>>>    drivers/gpu/drm/i915/gt/intel_engine_cs.c |   8 --
>>>>    drivers/gpu/drm/i915/i915_active.h        |   4 +-
>>>>    drivers/gpu/drm/i915/i915_request.c       | 147 ++++++++++++----------
>>>>    drivers/gpu/drm/i915/i915_request.h       |   2 -
>>>>    drivers/gpu/drm/i915/i915_vma.c           |   4 +-
>>>>    include/drm/drm_syncobj.h                 |   4 +-
>>>>    include/linux/dma-fence.h                 |  50 --------
>>>>    include/linux/dma-resv.h                  |   4 +-
>>>>    13 files changed, 110 insertions(+), 180 deletions(-)
>>>>
>>
> 
>
Jason Ekstrand June 10, 2021, 1:35 p.m. UTC | #6
On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Thu, Jun 10, 2021 at 11:39 AM Christian König
> <christian.koenig@amd.com> wrote:
> > Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > > On 09/06/2021 22:29, Jason Ekstrand wrote:
> > >> Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request
> > >> tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it
> > >> was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on
> > >> i915_request.  As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with
> > >> some serious disclaimers.  In particular, objects can get recycled while
> > >> RCU readers are still in-flight.  This can be ok if everyone who touches
> > >> these objects knows about the disclaimers and is careful. However,
> > >> because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and
> > >> because i915_request contains a dma_fence, we've leaked
> > >> SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver
> > >> in the kernel which may consume a dma_fence.
> > >
> > > I don't think the part about leaking is true...
> > >
> > >> We've tried to keep it somewhat contained by doing most of the hard work
> > >> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> > >> However, a quick grep of kernel sources says that, of the 30 instances
> > >> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> > >> It's likely there bear traps in DRM and related subsystems just waiting
> > >> for someone to accidentally step in them.
> > >
> > > ...because dma_fence_get_rcu_safe apears to be about whether the
> > > *pointer* to the fence itself is rcu protected, not about the fence
> > > object itself.
> >
> > Yes, exactly that.

The fact that both of you think this either means that I've completely
missed what's going on with RCUs here (possible but, in this case, I
think unlikely) or RCUs on dma fences should scare us all.  Yes, it
protects against races on the dma_fence pointer itself.  However,
whether or not that dma_fence pointer lives in RCU-protected memory is
immaterial AFAICT.  It also does magic to deal with
SLAB_TYPESAFE_BY_RCU.  Let's walk through it.  Please tell me if/where
I go off the rails.

First, let's set the scenario:  The race this is protecting us against
(I think) is where someone else comes along and swaps out the pointer
we're trying to fetch for NULL or a different one and then drops the
last reference.

First, before we get to dma_fence_get_rcu_safe(), the caller has taken
an RCU read lock.  Then we get into the function

    fence = rcu_dereference(*fencep);
    if (!fence)
        return NULL;

First, we dereference fencep and grab the pointer.  There's an
rcu_dereference() here which does the usual RCU magic (which I don't
fully understand yet) to turn an __rcu pointer into a "real" pointer.
It's possible that the pointer is NULL, if so we bail.  We may have
lost the race or it could be the the pointer was NULL the whole time.
Doesn't matter.

    if (!dma_fence_get_rcu(fence))
        continue;

This attempts to get a reference and, if it fails continues.  More on
the continue later.  For now, let's dive into dma_fence_get()

    if (kref_get_unless_zero(&fence->refcount))
        return fence;
    else
        return NULL;

So we try to get a reference unless it's zero.  This is a pretty
standard pattern and, if the dma_fence was freed with kfree_rcu(),
would be all we need.  If the reference count on the dma_fence drops
to 0 and then the dma_fence is freed with kfree_rcu, we're guaranteed
that there is an RCU grace period between when the reference count
hits 0 and the memory is reclaimed.  Since all this happens inside the
RCU read lock, if we raced with someone attempting to swap out the
pointer and drop the reference count to zero, we have one of two
cases:

 1. We get the old pointer but successfully take a reference.  In this
case, it's the same as if we were called a few cycles earlier and
straight-up won the race.  We get the old pointer and, because we now
have a reference, the object is never freed.

 2. We get the old pointer but refcount is already zero by the time we
get here.  In this case, kref_get_unless_zero() returns false and
dma_fence_get_rcu() returns NULL.

If these were the only two cases we cared about, all of
dma_fence_get_rcu_safe() could be implemented as follows:

static inline struct dma_fence *
dma_fence_get_rcu_safe(struct dma_fence **fencep)
{
    struct dma_fence *fence;

    fence = rcu_dereference(*fencep);
    if (fence)
        fence = dma_fence_get_rcu(fence);

    return fence;
}

and we we'd be done.  The case the above code doesn't handle is if the
thing we're racing with swaps it to a non-NULL pointer.  To handle
that case, we throw a loop around the whole thing as follows:

static inline struct dma_fence *
dma_fence_get_rcu_safe(struct dma_fence **fencep)
{
    struct dma_fence *fence;

    do {
        fence = rcu_dereference(*fencep);
        if (!fence)
            return NULL;

        fence = dma_fence_get_rcu(fence);
    } while (!fence);

    return fence;
}

Ok, great, we've got an implementation, right?  Unfortunately, this is
where SLAB_TYPESAFE_BY_RCU crashes the party.  The giant disclaimer
about SLAB_TYPESAFE_BY_RCU is that memory gets recycled immediately
and doesn't wait for an RCU grace period.  You're guaranteed that
memory exists at that pointer so you won't get a nasty SEGFAULT and
you're guaranteed that the memory is still a dma_fence, but you're not
guaranteed anything else.  In particular, there's a 3rd case:

 3. We get an old pointer but it's been recycled and points to a
totally different dma_fence whose reference count is non-zero.  In
this case, rcu_dereference returns non-null and kref_get_unless_zero()
succeeds but we still managed to end up with the wrong fence.

To deal with 3, we do this:

    /* The atomic_inc_not_zero() inside dma_fence_get_rcu()
     * provides a full memory barrier upon success (such as now).
     * This is paired with the write barrier from assigning
     * to the __rcu protected fence pointer so that if that
     * pointer still matches the current fence, we know we
     * have successfully acquire a reference to it. If it no
     * longer matches, we are holding a reference to some other
     * reallocated pointer. This is possible if the allocator
     * is using a freelist like SLAB_TYPESAFE_BY_RCU where the
     * fence remains valid for the RCU grace period, but it
     * may be reallocated. When using such allocators, we are
     * responsible for ensuring the reference we get is to
     * the right fence, as below.
     */
    if (fence == rcu_access_pointer(*fencep))
        return rcu_pointer_handoff(fence);

    dma_fence_put(fence);

We dereference fencep one more time and check to ensure that the
pointer we fetched at the start still matches.  There are some serious
memory barrier tricks going on here.  In particular, we're depending
on the fact that kref_get_unless_zero() does an atomic which means a
memory barrier between when the other thread we're racing with swapped
out the pointer and when the atomic happened.  Assuming that the other
thread swapped out the pointer BEFORE dropping the reference, we can
detect the recycle race with this pointer check.  If this last check
succeeds, we return the fence.  If it fails, then we ended up with the
wrong dma_fence and we drop the reference we acquired above and try
again.

Again, the important issue here that causes problems is that there's
no RCU grace period between the kref hitting zero and the dma_fence
being recycled.  If a dma_fence is freed with kfree_rcu(), we have
such a grace period and it's fine.  If we recycling, we can end up in
all sorts of weird corners if we're not careful to ensure that the
fence we got is the fence we think we got.

Before I move on, there's one more important point:  This can happen
without SLAB_TYPESAFE_BY_RCU.  Really, any dma_fence recycling scheme
which doesn't ensure an RCU grace period between keref->zero and
recycle will run afoul of this.  SLAB_TYPESAFE_BY_RCU just happens to
be the way i915 gets into this mess.

> We do leak, and badly. Any __rcu protected fence pointer where a
> shared fence could show up is affected. And the point of dma_fence is
> that they're shareable, and we're inventing ever more ways to do so
> (sync_file, drm_syncobj, implicit fencing maybe soon with
> import/export ioctl on top, in/out fences in CS ioctl, atomic ioctl,
> ...).
>
> So without a full audit anything that uses the following pattern is
> probably busted:
>
> rcu_read_lock();
> fence = rcu_dereference();
> fence = dma_fence_get_rcu();
> rcu_read_lock();
>
> /* use the fence now that we acquired a full reference */
>
> And I don't mean "you might wait a bit too much" busted, but "this can
> lead to loops in the dma_fence dependency chain, resulting in
> deadlocks" kind of busted.

Yup.

> What's worse, the standard rcu lockless
> access pattern is also busted completely:
>
> rcu_read_lock();
> fence = rcu_derefence();
> /* locklessly check the state of fence */
> rcu_read_unlock();

Yeah, this one's broken too.  It depends on what you're doing with
that state just how busted and what that breakage costs you but it's
definitely busted.

> because once you have TYPESAFE_BY_RCU rcu_read_lock doesn't prevent a
> use-after-free anymore. The only thing it guarantees is that your
> fence pointer keeps pointing at either freed memory, or a fence, but
> nothing else. You have to wrap your rcu_derefence and code into a
> seqlock of some kind, either a real one like dma_resv, or an
> open-coded one like dma_fence_get_rcu_safe uses. And yes the latter is
> a specialized seqlock, except it fails to properly document in
> comments where all the required barriers are.
>
> tldr; all the code using dma_fence_get_rcu needs to be assumed to be broken.
>
> Heck this is fragile and tricky enough that i915 shot its own leg off
> routinely (there's a bugfix floating around just now), so not even
> internally we're very good at getting this right.
>
> > > If one has a stable pointer to a fence dma_fence_get_rcu is I think
> > > enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma
> > > fence is a base object there). Unless you found a bug in rq field
> > > recycling. But access to the dma fence is all tightly controlled so I
> > > don't get what leaks.
> > >
> > >> This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request
> > >> and, instead, does an RCU-safe slab free via rcu_call().  This should
> > >> let us keep most of the perf benefits of slab allocation while avoiding
> > >> the bear traps inherent in SLAB_TYPESAFE_BY_RCU.  It then removes
> > >> support
> > >> for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.
> > >
> > > According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU
> > > freeing can be a lot more costly so I think we need a clear
> > > justification on why this change is being considered.
> >
> > The problem is that SLAB_TYPESAFE_BY_RCU requires that we use a sequence
> > counter to make sure that we don't grab the reference to a reallocated
> > dma_fence.
> >
> > Updating the sequence counter every time we add a fence now means two
> > additions writes and one additional barrier for an extremely hot path.
> > The extra overhead of RCU freeing is completely negligible compared to that.
> >
> > The good news is that I think if we are just a bit more clever about our
> > handle we can both avoid the sequence counter and keep
> > SLAB_TYPESAFE_BY_RCU around.

We're already trying to do handle cleverness as described above.  But,
as Daniel said and I put in some commit message, we're probably only
doing it in about 1/3 of the places we need to be.

> You still need a seqlock, or something else that's serving as your
> seqlock. dma_fence_list behind a single __rcu protected pointer, with
> all subsequent fence pointers _not_ being rcu protected (i.e. full
> reference, on every change we allocate might work. Which is a very
> funny way of implementing something like a seqlock.
>
> And that only covers dma_resv, you _have_ to do this _everywhere_ in
> every driver. Except if you can proof that your __rcu fence pointer
> only ever points at your own driver's fences.
>
> So unless you're volunteering to audit all the drivers, and constantly
> re-audit them (because rcu only guaranteeing type-safety but not
> actually preventing use-after-free is very unusual in the kernel) just
> fixing dma_resv doesn't solve the problem here at all.
>
> > But this needs more code cleanup and abstracting the sequence counter
> > usage in a macro.
>
> The other thing is that this doesn't even make sense for i915 anymore.

I'm not sure I'd go that far.  Yes, we've got the ULLS hack but
i915_request is going to stay around for a while.  What's really
overblown here is the bazillions of requests.  GL drivers submit tens
or maybe 100ish batches per frame.  Media has to ping-pong a bit more
but it should still be < 1000/second.  If we're really
dma_fence_release-bound, we're in a microbenchmark.

--Jason

> The solution to the "userspace wants to submit bazillion requests"
> problem is direct userspace submit. Current hw doesn't have userspace
> ringbuffer, but we have a pretty clever trick in the works to make
> this possible with current hw, essentially by submitting a CS that
> loops on itself, and then inserting batches into this "ring" by
> latching a conditional branch in this CS. It's not pretty, but it gets
> the job done and outright removes the need for plaid mode throughput
> of i915_request dma fences.
> -Daniel
>
> >
> > Regards,
> > Christian.
> >
> >
> > >
> > > Regards,
> > >
> > > Tvrtko
> > >
> > >>
> > >> Note: The last patch is labled DONOTMERGE.  This was at Daniel Vetter's
> > >> request as we may want to let this bake for a couple releases before we
> > >> rip out dma_fence_get_rcu_safe entirely.
> > >>
> > >> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > >> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> > >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > >> Cc: Christian König <christian.koenig@amd.com>
> > >> Cc: Dave Airlie <airlied@redhat.com>
> > >> Cc: Matthew Auld <matthew.auld@intel.com>
> > >> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > >>
> > >> Jason Ekstrand (5):
> > >>    drm/i915: Move intel_engine_free_request_pool to i915_request.c
> > >>    drm/i915: Use a simpler scheme for caching i915_request
> > >>    drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request
> > >>    dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests
> > >>    DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe
> > >>
> > >>   drivers/dma-buf/dma-fence-chain.c         |   8 +-
> > >>   drivers/dma-buf/dma-resv.c                |   4 +-
> > >>   drivers/dma-buf/st-dma-fence-chain.c      |  24 +---
> > >>   drivers/dma-buf/st-dma-fence.c            |  27 +---
> > >>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |   4 +-
> > >>   drivers/gpu/drm/i915/gt/intel_engine_cs.c |   8 --
> > >>   drivers/gpu/drm/i915/i915_active.h        |   4 +-
> > >>   drivers/gpu/drm/i915/i915_request.c       | 147 ++++++++++++----------
> > >>   drivers/gpu/drm/i915/i915_request.h       |   2 -
> > >>   drivers/gpu/drm/i915/i915_vma.c           |   4 +-
> > >>   include/drm/drm_syncobj.h                 |   4 +-
> > >>   include/linux/dma-fence.h                 |  50 --------
> > >>   include/linux/dma-resv.h                  |   4 +-
> > >>   13 files changed, 110 insertions(+), 180 deletions(-)
> > >>
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Jason Ekstrand June 10, 2021, 8:09 p.m. UTC | #7
On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
> On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > On Thu, Jun 10, 2021 at 11:39 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > > > On 09/06/2021 22:29, Jason Ekstrand wrote:
> > > >>
> > > >> We've tried to keep it somewhat contained by doing most of the hard work
> > > >> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> > > >> However, a quick grep of kernel sources says that, of the 30 instances
> > > >> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> > > >> It's likely there bear traps in DRM and related subsystems just waiting
> > > >> for someone to accidentally step in them.
> > > >
> > > > ...because dma_fence_get_rcu_safe apears to be about whether the
> > > > *pointer* to the fence itself is rcu protected, not about the fence
> > > > object itself.
> > >
> > > Yes, exactly that.
>
> The fact that both of you think this either means that I've completely
> missed what's going on with RCUs here (possible but, in this case, I
> think unlikely) or RCUs on dma fences should scare us all.

Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
such,  I'd like to ask a slightly different question:  What are the
rules about what is allowed to be done under the RCU read lock and
what guarantees does a driver need to provide?

I think so far that we've all agreed on the following:

 1. Freeing an unsignaled fence is ok as long as it doesn't have any
pending callbacks.  (Callbacks should hold a reference anyway).

 2. The pointer race solved by dma_fence_get_rcu_safe is real and
requires the loop to sort out.

But let's say I have a dma_fence pointer that I got from, say, calling
dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
with it under the RCU lock?  What assumptions can I make?  Is this
code, for instance, ok?

rcu_read_lock();
fence = dma_resv_excl_fence(obj);
idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
rcu_read_unlock();

This code very much looks correct under the following assumptions:

 1. A valid fence pointer stays alive under the RCU read lock
 2. SIGNALED_BIT is set-once (it's never unset after being set).

However, if it were, we wouldn't have dma_resv_test_singnaled(), now
would we? :-)

The moment you introduce ANY dma_fence recycling that recycles a
dma_fence within a single RCU grace period, all your assumptions break
down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
also have a little i915_request recycler to try and help with memory
pressure scenarios in certain critical sections that also doesn't
respect RCU grace periods.  And, as mentioned multiple times, our
recycling leaks into every other driver because, thanks to i915's
choice, the above 4-line code snippet isn't valid ANYWHERE in the
kernel.

So the question I'm raising isn't so much about the rules today.
Today, we live in the wild wild west where everything is YOLO.  But
where do we want to go?  Do we like this wild west world?  So we want
more consistency under the RCU read lock?  If so, what do we want the
rules to be?

One option would be to accept the wild-west world we live in and say
"The RCU read lock gains you nothing.  If you want to touch the guts
of a dma_fence, take a reference".  But, at that point, we're eating
two atomics for every time someone wants to look at a dma_fence.  Do
we want that?

Alternatively, and this what I think Daniel and I were trying to
propose here, is that we place some constraints on dma_fence
recycling.  Specifically that, under the RCU read lock, the fence
doesn't suddenly become a new fence.  All of the immutability and
once-mutability guarantees of various bits of dma_fence hold as long
as you have the RCU read lock.

--Jason
Daniel Vetter June 10, 2021, 8:42 p.m. UTC | #8
On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > On Thu, Jun 10, 2021 at 11:39 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > > > Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > > > > On 09/06/2021 22:29, Jason Ekstrand wrote:
> > > > >>
> > > > >> We've tried to keep it somewhat contained by doing most of the hard work
> > > > >> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> > > > >> However, a quick grep of kernel sources says that, of the 30 instances
> > > > >> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> > > > >> It's likely there bear traps in DRM and related subsystems just waiting
> > > > >> for someone to accidentally step in them.
> > > > >
> > > > > ...because dma_fence_get_rcu_safe apears to be about whether the
> > > > > *pointer* to the fence itself is rcu protected, not about the fence
> > > > > object itself.
> > > >
> > > > Yes, exactly that.
> >
> > The fact that both of you think this either means that I've completely
> > missed what's going on with RCUs here (possible but, in this case, I
> > think unlikely) or RCUs on dma fences should scare us all.
>
> Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
> such,  I'd like to ask a slightly different question:  What are the
> rules about what is allowed to be done under the RCU read lock and
> what guarantees does a driver need to provide?
>
> I think so far that we've all agreed on the following:
>
>  1. Freeing an unsignaled fence is ok as long as it doesn't have any
> pending callbacks.  (Callbacks should hold a reference anyway).
>
>  2. The pointer race solved by dma_fence_get_rcu_safe is real and
> requires the loop to sort out.
>
> But let's say I have a dma_fence pointer that I got from, say, calling
> dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
> with it under the RCU lock?  What assumptions can I make?  Is this
> code, for instance, ok?
>
> rcu_read_lock();
> fence = dma_resv_excl_fence(obj);
> idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
> rcu_read_unlock();
>
> This code very much looks correct under the following assumptions:
>
>  1. A valid fence pointer stays alive under the RCU read lock
>  2. SIGNALED_BIT is set-once (it's never unset after being set).
>
> However, if it were, we wouldn't have dma_resv_test_singnaled(), now
> would we? :-)
>
> The moment you introduce ANY dma_fence recycling that recycles a
> dma_fence within a single RCU grace period, all your assumptions break
> down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
> also have a little i915_request recycler to try and help with memory
> pressure scenarios in certain critical sections that also doesn't
> respect RCU grace periods.  And, as mentioned multiple times, our
> recycling leaks into every other driver because, thanks to i915's
> choice, the above 4-line code snippet isn't valid ANYWHERE in the
> kernel.
>
> So the question I'm raising isn't so much about the rules today.
> Today, we live in the wild wild west where everything is YOLO.  But
> where do we want to go?  Do we like this wild west world?  So we want
> more consistency under the RCU read lock?  If so, what do we want the
> rules to be?
>
> One option would be to accept the wild-west world we live in and say
> "The RCU read lock gains you nothing.  If you want to touch the guts
> of a dma_fence, take a reference".  But, at that point, we're eating
> two atomics for every time someone wants to look at a dma_fence.  Do
> we want that?
>
> Alternatively, and this what I think Daniel and I were trying to
> propose here, is that we place some constraints on dma_fence
> recycling.  Specifically that, under the RCU read lock, the fence
> doesn't suddenly become a new fence.  All of the immutability and
> once-mutability guarantees of various bits of dma_fence hold as long
> as you have the RCU read lock.

Yeah this is suboptimal. Too many potential bugs, not enough benefits.

This entire __rcu business started so that there would be a lockless
way to get at fences, or at least the exclusive one. That did not
really pan out. I think we have a few options:

- drop the idea of rcu/lockless dma-fence access outright. A quick
sequence of grabbing the lock, acquiring the dma_fence and then
dropping your lock again is probably plenty good. There's a lot of
call_rcu and other stuff we could probably delete. I have no idea what
the perf impact across all the drivers would be.

- try to make all drivers follow some stricter rules. The trouble is
that at least with radeon dma_fence callbacks aren't even very
reliable (that's why it has its own dma_fence_wait implementation), so
things are wobbly anyway.

- live with the current situation, but radically delete all unsafe
interfaces. I.e. nothing is allowed to directly deref an rcu fence
pointer, everything goes through dma_fence_get_rcu_safe. The
kref_get_unless_zero would become an internal implementation detail.
Our "fast" and "lockless" dma_resv fence access stays a pile of
seqlock, retry loop and an a conditional atomic inc + atomic dec. The
only thing that's slightly faster would be dma_resv_test_signaled()

- I guess minimally we should rename dma_fence_get_rcu to
dma_fence_tryget. It has nothing to do with rcu really, and the use is
very, very limited.

Not sure what's a good idea here tbh.
-Daniel
Christian König June 11, 2021, 6:55 a.m. UTC | #9
Am 10.06.21 um 22:42 schrieb Daniel Vetter:
> On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>> On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
>>> On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>>> On Thu, Jun 10, 2021 at 11:39 AM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>> Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
>>>>>> On 09/06/2021 22:29, Jason Ekstrand wrote:
>>>>>>> We've tried to keep it somewhat contained by doing most of the hard work
>>>>>>> to prevent access of recycled objects via dma_fence_get_rcu_safe().
>>>>>>> However, a quick grep of kernel sources says that, of the 30 instances
>>>>>>> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
>>>>>>> It's likely there bear traps in DRM and related subsystems just waiting
>>>>>>> for someone to accidentally step in them.
>>>>>> ...because dma_fence_get_rcu_safe apears to be about whether the
>>>>>> *pointer* to the fence itself is rcu protected, not about the fence
>>>>>> object itself.
>>>>> Yes, exactly that.
>>> The fact that both of you think this either means that I've completely
>>> missed what's going on with RCUs here (possible but, in this case, I
>>> think unlikely) or RCUs on dma fences should scare us all.
>> Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
>> such,  I'd like to ask a slightly different question:  What are the
>> rules about what is allowed to be done under the RCU read lock and
>> what guarantees does a driver need to provide?
>>
>> I think so far that we've all agreed on the following:
>>
>>   1. Freeing an unsignaled fence is ok as long as it doesn't have any
>> pending callbacks.  (Callbacks should hold a reference anyway).
>>
>>   2. The pointer race solved by dma_fence_get_rcu_safe is real and
>> requires the loop to sort out.
>>
>> But let's say I have a dma_fence pointer that I got from, say, calling
>> dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
>> with it under the RCU lock?  What assumptions can I make?  Is this
>> code, for instance, ok?
>>
>> rcu_read_lock();
>> fence = dma_resv_excl_fence(obj);
>> idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
>> rcu_read_unlock();
>>
>> This code very much looks correct under the following assumptions:
>>
>>   1. A valid fence pointer stays alive under the RCU read lock
>>   2. SIGNALED_BIT is set-once (it's never unset after being set).
>>
>> However, if it were, we wouldn't have dma_resv_test_singnaled(), now
>> would we? :-)
>>
>> The moment you introduce ANY dma_fence recycling that recycles a
>> dma_fence within a single RCU grace period, all your assumptions break
>> down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
>> also have a little i915_request recycler to try and help with memory
>> pressure scenarios in certain critical sections that also doesn't
>> respect RCU grace periods.  And, as mentioned multiple times, our
>> recycling leaks into every other driver because, thanks to i915's
>> choice, the above 4-line code snippet isn't valid ANYWHERE in the
>> kernel.
>>
>> So the question I'm raising isn't so much about the rules today.
>> Today, we live in the wild wild west where everything is YOLO.  But
>> where do we want to go?  Do we like this wild west world?  So we want
>> more consistency under the RCU read lock?  If so, what do we want the
>> rules to be?
>>
>> One option would be to accept the wild-west world we live in and say
>> "The RCU read lock gains you nothing.  If you want to touch the guts
>> of a dma_fence, take a reference".  But, at that point, we're eating
>> two atomics for every time someone wants to look at a dma_fence.  Do
>> we want that?
>>
>> Alternatively, and this what I think Daniel and I were trying to
>> propose here, is that we place some constraints on dma_fence
>> recycling.  Specifically that, under the RCU read lock, the fence
>> doesn't suddenly become a new fence.  All of the immutability and
>> once-mutability guarantees of various bits of dma_fence hold as long
>> as you have the RCU read lock.
> Yeah this is suboptimal. Too many potential bugs, not enough benefits.
>
> This entire __rcu business started so that there would be a lockless
> way to get at fences, or at least the exclusive one. That did not
> really pan out. I think we have a few options:
>
> - drop the idea of rcu/lockless dma-fence access outright. A quick
> sequence of grabbing the lock, acquiring the dma_fence and then
> dropping your lock again is probably plenty good. There's a lot of
> call_rcu and other stuff we could probably delete. I have no idea what
> the perf impact across all the drivers would be.

The question is maybe not the perf impact, but rather if that is 
possible over all.

IIRC we now have some cases in TTM where RCU is mandatory and we simply 
don't have any other choice than using it.

> - try to make all drivers follow some stricter rules. The trouble is
> that at least with radeon dma_fence callbacks aren't even very
> reliable (that's why it has its own dma_fence_wait implementation), so
> things are wobbly anyway.
>
> - live with the current situation, but radically delete all unsafe
> interfaces. I.e. nothing is allowed to directly deref an rcu fence
> pointer, everything goes through dma_fence_get_rcu_safe. The
> kref_get_unless_zero would become an internal implementation detail.
> Our "fast" and "lockless" dma_resv fence access stays a pile of
> seqlock, retry loop and an a conditional atomic inc + atomic dec. The
> only thing that's slightly faster would be dma_resv_test_signaled()
>
> - I guess minimally we should rename dma_fence_get_rcu to
> dma_fence_tryget. It has nothing to do with rcu really, and the use is
> very, very limited.

I think what we should do is to use RCU internally in the dma_resv 
object but disallow drivers/frameworks to mess with that directly.

In other words drivers should use one of the following:
1. dma_resv_wait_timeout()
2. dma_resv_test_signaled()
3. dma_resv_copy_fences()
4. dma_resv_get_fences()
5. dma_resv_for_each_fence() <- to be implemented
6. dma_resv_for_each_fence_unlocked() <- to be implemented

Inside those functions we then make sure that we only save ways of 
accessing the RCU protected data structures.

This way we only need to make sure that those accessor functions are 
sane and don't need to audit every driver individually.

I can tackle implementing for the dma_res_for_each_fence()/_unlocked(). 
Already got a large bunch of that coded out anyway.

Regards,
Christian.

>
> Not sure what's a good idea here tbh.
> -Daniel
Daniel Vetter June 11, 2021, 7:20 a.m. UTC | #10
On Fri, Jun 11, 2021 at 8:55 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 10.06.21 um 22:42 schrieb Daniel Vetter:
> > On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >> On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>> On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >>>> On Thu, Jun 10, 2021 at 11:39 AM Christian König
> >>>> <christian.koenig@amd.com> wrote:
> >>>>> Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> >>>>>> On 09/06/2021 22:29, Jason Ekstrand wrote:
> >>>>>>> We've tried to keep it somewhat contained by doing most of the hard work
> >>>>>>> to prevent access of recycled objects via dma_fence_get_rcu_safe().
> >>>>>>> However, a quick grep of kernel sources says that, of the 30 instances
> >>>>>>> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> >>>>>>> It's likely there bear traps in DRM and related subsystems just waiting
> >>>>>>> for someone to accidentally step in them.
> >>>>>> ...because dma_fence_get_rcu_safe apears to be about whether the
> >>>>>> *pointer* to the fence itself is rcu protected, not about the fence
> >>>>>> object itself.
> >>>>> Yes, exactly that.
> >>> The fact that both of you think this either means that I've completely
> >>> missed what's going on with RCUs here (possible but, in this case, I
> >>> think unlikely) or RCUs on dma fences should scare us all.
> >> Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
> >> such,  I'd like to ask a slightly different question:  What are the
> >> rules about what is allowed to be done under the RCU read lock and
> >> what guarantees does a driver need to provide?
> >>
> >> I think so far that we've all agreed on the following:
> >>
> >>   1. Freeing an unsignaled fence is ok as long as it doesn't have any
> >> pending callbacks.  (Callbacks should hold a reference anyway).
> >>
> >>   2. The pointer race solved by dma_fence_get_rcu_safe is real and
> >> requires the loop to sort out.
> >>
> >> But let's say I have a dma_fence pointer that I got from, say, calling
> >> dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
> >> with it under the RCU lock?  What assumptions can I make?  Is this
> >> code, for instance, ok?
> >>
> >> rcu_read_lock();
> >> fence = dma_resv_excl_fence(obj);
> >> idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
> >> rcu_read_unlock();
> >>
> >> This code very much looks correct under the following assumptions:
> >>
> >>   1. A valid fence pointer stays alive under the RCU read lock
> >>   2. SIGNALED_BIT is set-once (it's never unset after being set).
> >>
> >> However, if it were, we wouldn't have dma_resv_test_singnaled(), now
> >> would we? :-)
> >>
> >> The moment you introduce ANY dma_fence recycling that recycles a
> >> dma_fence within a single RCU grace period, all your assumptions break
> >> down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
> >> also have a little i915_request recycler to try and help with memory
> >> pressure scenarios in certain critical sections that also doesn't
> >> respect RCU grace periods.  And, as mentioned multiple times, our
> >> recycling leaks into every other driver because, thanks to i915's
> >> choice, the above 4-line code snippet isn't valid ANYWHERE in the
> >> kernel.
> >>
> >> So the question I'm raising isn't so much about the rules today.
> >> Today, we live in the wild wild west where everything is YOLO.  But
> >> where do we want to go?  Do we like this wild west world?  So we want
> >> more consistency under the RCU read lock?  If so, what do we want the
> >> rules to be?
> >>
> >> One option would be to accept the wild-west world we live in and say
> >> "The RCU read lock gains you nothing.  If you want to touch the guts
> >> of a dma_fence, take a reference".  But, at that point, we're eating
> >> two atomics for every time someone wants to look at a dma_fence.  Do
> >> we want that?
> >>
> >> Alternatively, and this what I think Daniel and I were trying to
> >> propose here, is that we place some constraints on dma_fence
> >> recycling.  Specifically that, under the RCU read lock, the fence
> >> doesn't suddenly become a new fence.  All of the immutability and
> >> once-mutability guarantees of various bits of dma_fence hold as long
> >> as you have the RCU read lock.
> > Yeah this is suboptimal. Too many potential bugs, not enough benefits.
> >
> > This entire __rcu business started so that there would be a lockless
> > way to get at fences, or at least the exclusive one. That did not
> > really pan out. I think we have a few options:
> >
> > - drop the idea of rcu/lockless dma-fence access outright. A quick
> > sequence of grabbing the lock, acquiring the dma_fence and then
> > dropping your lock again is probably plenty good. There's a lot of
> > call_rcu and other stuff we could probably delete. I have no idea what
> > the perf impact across all the drivers would be.
>
> The question is maybe not the perf impact, but rather if that is
> possible over all.
>
> IIRC we now have some cases in TTM where RCU is mandatory and we simply
> don't have any other choice than using it.

Adding Thomas Hellstrom.

Where is that stuff? If we end up with all the dma_resv locking
complexity just for an oddball, then I think that would be rather big
bummer.

> > - try to make all drivers follow some stricter rules. The trouble is
> > that at least with radeon dma_fence callbacks aren't even very
> > reliable (that's why it has its own dma_fence_wait implementation), so
> > things are wobbly anyway.
> >
> > - live with the current situation, but radically delete all unsafe
> > interfaces. I.e. nothing is allowed to directly deref an rcu fence
> > pointer, everything goes through dma_fence_get_rcu_safe. The
> > kref_get_unless_zero would become an internal implementation detail.
> > Our "fast" and "lockless" dma_resv fence access stays a pile of
> > seqlock, retry loop and an a conditional atomic inc + atomic dec. The
> > only thing that's slightly faster would be dma_resv_test_signaled()
> >
> > - I guess minimally we should rename dma_fence_get_rcu to
> > dma_fence_tryget. It has nothing to do with rcu really, and the use is
> > very, very limited.
>
> I think what we should do is to use RCU internally in the dma_resv
> object but disallow drivers/frameworks to mess with that directly.
>
> In other words drivers should use one of the following:
> 1. dma_resv_wait_timeout()
> 2. dma_resv_test_signaled()
> 3. dma_resv_copy_fences()
> 4. dma_resv_get_fences()
> 5. dma_resv_for_each_fence() <- to be implemented
> 6. dma_resv_for_each_fence_unlocked() <- to be implemented
>
> Inside those functions we then make sure that we only save ways of
> accessing the RCU protected data structures.
>
> This way we only need to make sure that those accessor functions are
> sane and don't need to audit every driver individually.

Yeah better encapsulation for dma_resv sounds like a good thing, least
for all the other issues we've been discussing recently. I guess your
list is also missing the various "add/replace some more fences"
functions, but we have them already.

> I can tackle implementing for the dma_res_for_each_fence()/_unlocked().
> Already got a large bunch of that coded out anyway.

When/where do we need ot iterate over fences unlocked? Given how much
pain it is to get a consistent snapshot of the fences or fence state
(I've read  the dma-buf poll implementation, and it looks a bit buggy
in that regard, but not sure, just as an example) and unlocked
iterator sounds very dangerous to me.
-Daniel
Christian König June 11, 2021, 7:42 a.m. UTC | #11
Am 11.06.21 um 09:20 schrieb Daniel Vetter:
> On Fri, Jun 11, 2021 at 8:55 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 10.06.21 um 22:42 schrieb Daniel Vetter:
>>> On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>> On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>> On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>>>>> On Thu, Jun 10, 2021 at 11:39 AM Christian König
>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>> Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
>>>>>>>> On 09/06/2021 22:29, Jason Ekstrand wrote:
>>>>>>>>> We've tried to keep it somewhat contained by doing most of the hard work
>>>>>>>>> to prevent access of recycled objects via dma_fence_get_rcu_safe().
>>>>>>>>> However, a quick grep of kernel sources says that, of the 30 instances
>>>>>>>>> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
>>>>>>>>> It's likely there bear traps in DRM and related subsystems just waiting
>>>>>>>>> for someone to accidentally step in them.
>>>>>>>> ...because dma_fence_get_rcu_safe apears to be about whether the
>>>>>>>> *pointer* to the fence itself is rcu protected, not about the fence
>>>>>>>> object itself.
>>>>>>> Yes, exactly that.
>>>>> The fact that both of you think this either means that I've completely
>>>>> missed what's going on with RCUs here (possible but, in this case, I
>>>>> think unlikely) or RCUs on dma fences should scare us all.
>>>> Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
>>>> such,  I'd like to ask a slightly different question:  What are the
>>>> rules about what is allowed to be done under the RCU read lock and
>>>> what guarantees does a driver need to provide?
>>>>
>>>> I think so far that we've all agreed on the following:
>>>>
>>>>    1. Freeing an unsignaled fence is ok as long as it doesn't have any
>>>> pending callbacks.  (Callbacks should hold a reference anyway).
>>>>
>>>>    2. The pointer race solved by dma_fence_get_rcu_safe is real and
>>>> requires the loop to sort out.
>>>>
>>>> But let's say I have a dma_fence pointer that I got from, say, calling
>>>> dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
>>>> with it under the RCU lock?  What assumptions can I make?  Is this
>>>> code, for instance, ok?
>>>>
>>>> rcu_read_lock();
>>>> fence = dma_resv_excl_fence(obj);
>>>> idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
>>>> rcu_read_unlock();
>>>>
>>>> This code very much looks correct under the following assumptions:
>>>>
>>>>    1. A valid fence pointer stays alive under the RCU read lock
>>>>    2. SIGNALED_BIT is set-once (it's never unset after being set).
>>>>
>>>> However, if it were, we wouldn't have dma_resv_test_singnaled(), now
>>>> would we? :-)
>>>>
>>>> The moment you introduce ANY dma_fence recycling that recycles a
>>>> dma_fence within a single RCU grace period, all your assumptions break
>>>> down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
>>>> also have a little i915_request recycler to try and help with memory
>>>> pressure scenarios in certain critical sections that also doesn't
>>>> respect RCU grace periods.  And, as mentioned multiple times, our
>>>> recycling leaks into every other driver because, thanks to i915's
>>>> choice, the above 4-line code snippet isn't valid ANYWHERE in the
>>>> kernel.
>>>>
>>>> So the question I'm raising isn't so much about the rules today.
>>>> Today, we live in the wild wild west where everything is YOLO.  But
>>>> where do we want to go?  Do we like this wild west world?  So we want
>>>> more consistency under the RCU read lock?  If so, what do we want the
>>>> rules to be?
>>>>
>>>> One option would be to accept the wild-west world we live in and say
>>>> "The RCU read lock gains you nothing.  If you want to touch the guts
>>>> of a dma_fence, take a reference".  But, at that point, we're eating
>>>> two atomics for every time someone wants to look at a dma_fence.  Do
>>>> we want that?
>>>>
>>>> Alternatively, and this what I think Daniel and I were trying to
>>>> propose here, is that we place some constraints on dma_fence
>>>> recycling.  Specifically that, under the RCU read lock, the fence
>>>> doesn't suddenly become a new fence.  All of the immutability and
>>>> once-mutability guarantees of various bits of dma_fence hold as long
>>>> as you have the RCU read lock.
>>> Yeah this is suboptimal. Too many potential bugs, not enough benefits.
>>>
>>> This entire __rcu business started so that there would be a lockless
>>> way to get at fences, or at least the exclusive one. That did not
>>> really pan out. I think we have a few options:
>>>
>>> - drop the idea of rcu/lockless dma-fence access outright. A quick
>>> sequence of grabbing the lock, acquiring the dma_fence and then
>>> dropping your lock again is probably plenty good. There's a lot of
>>> call_rcu and other stuff we could probably delete. I have no idea what
>>> the perf impact across all the drivers would be.
>> The question is maybe not the perf impact, but rather if that is
>> possible over all.
>>
>> IIRC we now have some cases in TTM where RCU is mandatory and we simply
>> don't have any other choice than using it.
> Adding Thomas Hellstrom.
>
> Where is that stuff? If we end up with all the dma_resv locking
> complexity just for an oddball, then I think that would be rather big
> bummer.

This is during buffer destruction. See the call to dma_resv_copy_fences().

But that is basically just using a dma_resv function which accesses the 
object without taking a lock.

>>> - try to make all drivers follow some stricter rules. The trouble is
>>> that at least with radeon dma_fence callbacks aren't even very
>>> reliable (that's why it has its own dma_fence_wait implementation), so
>>> things are wobbly anyway.
>>>
>>> - live with the current situation, but radically delete all unsafe
>>> interfaces. I.e. nothing is allowed to directly deref an rcu fence
>>> pointer, everything goes through dma_fence_get_rcu_safe. The
>>> kref_get_unless_zero would become an internal implementation detail.
>>> Our "fast" and "lockless" dma_resv fence access stays a pile of
>>> seqlock, retry loop and an a conditional atomic inc + atomic dec. The
>>> only thing that's slightly faster would be dma_resv_test_signaled()
>>>
>>> - I guess minimally we should rename dma_fence_get_rcu to
>>> dma_fence_tryget. It has nothing to do with rcu really, and the use is
>>> very, very limited.
>> I think what we should do is to use RCU internally in the dma_resv
>> object but disallow drivers/frameworks to mess with that directly.
>>
>> In other words drivers should use one of the following:
>> 1. dma_resv_wait_timeout()
>> 2. dma_resv_test_signaled()
>> 3. dma_resv_copy_fences()
>> 4. dma_resv_get_fences()
>> 5. dma_resv_for_each_fence() <- to be implemented
>> 6. dma_resv_for_each_fence_unlocked() <- to be implemented
>>
>> Inside those functions we then make sure that we only save ways of
>> accessing the RCU protected data structures.
>>
>> This way we only need to make sure that those accessor functions are
>> sane and don't need to audit every driver individually.
> Yeah better encapsulation for dma_resv sounds like a good thing, least
> for all the other issues we've been discussing recently. I guess your
> list is also missing the various "add/replace some more fences"
> functions, but we have them already.
>
>> I can tackle implementing for the dma_res_for_each_fence()/_unlocked().
>> Already got a large bunch of that coded out anyway.
> When/where do we need ot iterate over fences unlocked? Given how much
> pain it is to get a consistent snapshot of the fences or fence state
> (I've read  the dma-buf poll implementation, and it looks a bit buggy
> in that regard, but not sure, just as an example) and unlocked
> iterator sounds very dangerous to me.

This is to make implementation of the other functions easier. Currently 
they basically each roll their own loop implementation which at least 
for dma_resv_test_signaled() looks a bit questionable to me.

Additionally to those we we have one more case in i915 and the unlocked 
polling implementation which I agree is a bit questionable as well.

My idea is to have the problematic logic in the iterator and only give 
back fence which have a reference and are 100% sure the right one.

Probably best if I show some code around to explain what I mean.

Regards,
Christian.

> -Daniel
Daniel Vetter June 11, 2021, 9:33 a.m. UTC | #12
On Fri, Jun 11, 2021 at 09:42:07AM +0200, Christian König wrote:
> Am 11.06.21 um 09:20 schrieb Daniel Vetter:
> > On Fri, Jun 11, 2021 at 8:55 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Am 10.06.21 um 22:42 schrieb Daniel Vetter:
> > > > On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > > On Thu, Jun 10, 2021 at 11:39 AM Christian König
> > > > > > > <christian.koenig@amd.com> wrote:
> > > > > > > > Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > > > > > > > > On 09/06/2021 22:29, Jason Ekstrand wrote:
> > > > > > > > > > We've tried to keep it somewhat contained by doing most of the hard work
> > > > > > > > > > to prevent access of recycled objects via dma_fence_get_rcu_safe().
> > > > > > > > > > However, a quick grep of kernel sources says that, of the 30 instances
> > > > > > > > > > of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> > > > > > > > > > It's likely there bear traps in DRM and related subsystems just waiting
> > > > > > > > > > for someone to accidentally step in them.
> > > > > > > > > ...because dma_fence_get_rcu_safe apears to be about whether the
> > > > > > > > > *pointer* to the fence itself is rcu protected, not about the fence
> > > > > > > > > object itself.
> > > > > > > > Yes, exactly that.
> > > > > > The fact that both of you think this either means that I've completely
> > > > > > missed what's going on with RCUs here (possible but, in this case, I
> > > > > > think unlikely) or RCUs on dma fences should scare us all.
> > > > > Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
> > > > > such,  I'd like to ask a slightly different question:  What are the
> > > > > rules about what is allowed to be done under the RCU read lock and
> > > > > what guarantees does a driver need to provide?
> > > > > 
> > > > > I think so far that we've all agreed on the following:
> > > > > 
> > > > >    1. Freeing an unsignaled fence is ok as long as it doesn't have any
> > > > > pending callbacks.  (Callbacks should hold a reference anyway).
> > > > > 
> > > > >    2. The pointer race solved by dma_fence_get_rcu_safe is real and
> > > > > requires the loop to sort out.
> > > > > 
> > > > > But let's say I have a dma_fence pointer that I got from, say, calling
> > > > > dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
> > > > > with it under the RCU lock?  What assumptions can I make?  Is this
> > > > > code, for instance, ok?
> > > > > 
> > > > > rcu_read_lock();
> > > > > fence = dma_resv_excl_fence(obj);
> > > > > idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
> > > > > rcu_read_unlock();
> > > > > 
> > > > > This code very much looks correct under the following assumptions:
> > > > > 
> > > > >    1. A valid fence pointer stays alive under the RCU read lock
> > > > >    2. SIGNALED_BIT is set-once (it's never unset after being set).
> > > > > 
> > > > > However, if it were, we wouldn't have dma_resv_test_singnaled(), now
> > > > > would we? :-)
> > > > > 
> > > > > The moment you introduce ANY dma_fence recycling that recycles a
> > > > > dma_fence within a single RCU grace period, all your assumptions break
> > > > > down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
> > > > > also have a little i915_request recycler to try and help with memory
> > > > > pressure scenarios in certain critical sections that also doesn't
> > > > > respect RCU grace periods.  And, as mentioned multiple times, our
> > > > > recycling leaks into every other driver because, thanks to i915's
> > > > > choice, the above 4-line code snippet isn't valid ANYWHERE in the
> > > > > kernel.
> > > > > 
> > > > > So the question I'm raising isn't so much about the rules today.
> > > > > Today, we live in the wild wild west where everything is YOLO.  But
> > > > > where do we want to go?  Do we like this wild west world?  So we want
> > > > > more consistency under the RCU read lock?  If so, what do we want the
> > > > > rules to be?
> > > > > 
> > > > > One option would be to accept the wild-west world we live in and say
> > > > > "The RCU read lock gains you nothing.  If you want to touch the guts
> > > > > of a dma_fence, take a reference".  But, at that point, we're eating
> > > > > two atomics for every time someone wants to look at a dma_fence.  Do
> > > > > we want that?
> > > > > 
> > > > > Alternatively, and this what I think Daniel and I were trying to
> > > > > propose here, is that we place some constraints on dma_fence
> > > > > recycling.  Specifically that, under the RCU read lock, the fence
> > > > > doesn't suddenly become a new fence.  All of the immutability and
> > > > > once-mutability guarantees of various bits of dma_fence hold as long
> > > > > as you have the RCU read lock.
> > > > Yeah this is suboptimal. Too many potential bugs, not enough benefits.
> > > > 
> > > > This entire __rcu business started so that there would be a lockless
> > > > way to get at fences, or at least the exclusive one. That did not
> > > > really pan out. I think we have a few options:
> > > > 
> > > > - drop the idea of rcu/lockless dma-fence access outright. A quick
> > > > sequence of grabbing the lock, acquiring the dma_fence and then
> > > > dropping your lock again is probably plenty good. There's a lot of
> > > > call_rcu and other stuff we could probably delete. I have no idea what
> > > > the perf impact across all the drivers would be.
> > > The question is maybe not the perf impact, but rather if that is
> > > possible over all.
> > > 
> > > IIRC we now have some cases in TTM where RCU is mandatory and we simply
> > > don't have any other choice than using it.
> > Adding Thomas Hellstrom.
> > 
> > Where is that stuff? If we end up with all the dma_resv locking
> > complexity just for an oddball, then I think that would be rather big
> > bummer.
> 
> This is during buffer destruction. See the call to dma_resv_copy_fences().

Ok yeah that's tricky.

The way solved this in i915 is with a trylock and punting to a worker
queue if the trylock fails. And the worker queue would also be flushed
from the shrinker (once we get there at least).

So this looks fixable.

> But that is basically just using a dma_resv function which accesses the
> object without taking a lock.

The other one I've found is the ghost object, but that one is locked
fully.

> > > > - try to make all drivers follow some stricter rules. The trouble is
> > > > that at least with radeon dma_fence callbacks aren't even very
> > > > reliable (that's why it has its own dma_fence_wait implementation), so
> > > > things are wobbly anyway.
> > > > 
> > > > - live with the current situation, but radically delete all unsafe
> > > > interfaces. I.e. nothing is allowed to directly deref an rcu fence
> > > > pointer, everything goes through dma_fence_get_rcu_safe. The
> > > > kref_get_unless_zero would become an internal implementation detail.
> > > > Our "fast" and "lockless" dma_resv fence access stays a pile of
> > > > seqlock, retry loop and an a conditional atomic inc + atomic dec. The
> > > > only thing that's slightly faster would be dma_resv_test_signaled()
> > > > 
> > > > - I guess minimally we should rename dma_fence_get_rcu to
> > > > dma_fence_tryget. It has nothing to do with rcu really, and the use is
> > > > very, very limited.
> > > I think what we should do is to use RCU internally in the dma_resv
> > > object but disallow drivers/frameworks to mess with that directly.
> > > 
> > > In other words drivers should use one of the following:
> > > 1. dma_resv_wait_timeout()
> > > 2. dma_resv_test_signaled()
> > > 3. dma_resv_copy_fences()
> > > 4. dma_resv_get_fences()
> > > 5. dma_resv_for_each_fence() <- to be implemented
> > > 6. dma_resv_for_each_fence_unlocked() <- to be implemented
> > > 
> > > Inside those functions we then make sure that we only save ways of
> > > accessing the RCU protected data structures.
> > > 
> > > This way we only need to make sure that those accessor functions are
> > > sane and don't need to audit every driver individually.
> > Yeah better encapsulation for dma_resv sounds like a good thing, least
> > for all the other issues we've been discussing recently. I guess your
> > list is also missing the various "add/replace some more fences"
> > functions, but we have them already.
> > 
> > > I can tackle implementing for the dma_res_for_each_fence()/_unlocked().
> > > Already got a large bunch of that coded out anyway.
> > When/where do we need ot iterate over fences unlocked? Given how much
> > pain it is to get a consistent snapshot of the fences or fence state
> > (I've read  the dma-buf poll implementation, and it looks a bit buggy
> > in that regard, but not sure, just as an example) and unlocked
> > iterator sounds very dangerous to me.
> 
> This is to make implementation of the other functions easier. Currently they
> basically each roll their own loop implementation which at least for
> dma_resv_test_signaled() looks a bit questionable to me.
> 
> Additionally to those we we have one more case in i915 and the unlocked
> polling implementation which I agree is a bit questionable as well.

Yeah, the more I look at any of these lockless loop things the more I'm
worried. 90% sure the one in dma_buf_poll is broken too.

> My idea is to have the problematic logic in the iterator and only give back
> fence which have a reference and are 100% sure the right one.
> 
> Probably best if I show some code around to explain what I mean.

My gut feeling is that we should just try and convert them all over to
taking the dma_resv_lock. And if there is really a contention issue with
that, then either try to shrink it, or make it an rwlock or similar. But
just the more I read a lot of the implementations the more I see bugs and
have questions.

Maybe at the end a few will be left over, and then we can look at these
individually in detail. Like the ttm_bo_individualize_resv situation.
Christian König June 11, 2021, 10:03 a.m. UTC | #13
Am 11.06.21 um 11:33 schrieb Daniel Vetter:
> On Fri, Jun 11, 2021 at 09:42:07AM +0200, Christian König wrote:
>> Am 11.06.21 um 09:20 schrieb Daniel Vetter:
>>> On Fri, Jun 11, 2021 at 8:55 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 10.06.21 um 22:42 schrieb Daniel Vetter:
>>>>> On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>>> On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>>>> On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>>>>>>> On Thu, Jun 10, 2021 at 11:39 AM Christian König
>>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>>> Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
>>>>>>>>>> On 09/06/2021 22:29, Jason Ekstrand wrote:
>>>>>>>>>>> We've tried to keep it somewhat contained by doing most of the hard work
>>>>>>>>>>> to prevent access of recycled objects via dma_fence_get_rcu_safe().
>>>>>>>>>>> However, a quick grep of kernel sources says that, of the 30 instances
>>>>>>>>>>> of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
>>>>>>>>>>> It's likely there bear traps in DRM and related subsystems just waiting
>>>>>>>>>>> for someone to accidentally step in them.
>>>>>>>>>> ...because dma_fence_get_rcu_safe apears to be about whether the
>>>>>>>>>> *pointer* to the fence itself is rcu protected, not about the fence
>>>>>>>>>> object itself.
>>>>>>>>> Yes, exactly that.
>>>>>>> The fact that both of you think this either means that I've completely
>>>>>>> missed what's going on with RCUs here (possible but, in this case, I
>>>>>>> think unlikely) or RCUs on dma fences should scare us all.
>>>>>> Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
>>>>>> such,  I'd like to ask a slightly different question:  What are the
>>>>>> rules about what is allowed to be done under the RCU read lock and
>>>>>> what guarantees does a driver need to provide?
>>>>>>
>>>>>> I think so far that we've all agreed on the following:
>>>>>>
>>>>>>     1. Freeing an unsignaled fence is ok as long as it doesn't have any
>>>>>> pending callbacks.  (Callbacks should hold a reference anyway).
>>>>>>
>>>>>>     2. The pointer race solved by dma_fence_get_rcu_safe is real and
>>>>>> requires the loop to sort out.
>>>>>>
>>>>>> But let's say I have a dma_fence pointer that I got from, say, calling
>>>>>> dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
>>>>>> with it under the RCU lock?  What assumptions can I make?  Is this
>>>>>> code, for instance, ok?
>>>>>>
>>>>>> rcu_read_lock();
>>>>>> fence = dma_resv_excl_fence(obj);
>>>>>> idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
>>>>>> rcu_read_unlock();
>>>>>>
>>>>>> This code very much looks correct under the following assumptions:
>>>>>>
>>>>>>     1. A valid fence pointer stays alive under the RCU read lock
>>>>>>     2. SIGNALED_BIT is set-once (it's never unset after being set).
>>>>>>
>>>>>> However, if it were, we wouldn't have dma_resv_test_singnaled(), now
>>>>>> would we? :-)
>>>>>>
>>>>>> The moment you introduce ANY dma_fence recycling that recycles a
>>>>>> dma_fence within a single RCU grace period, all your assumptions break
>>>>>> down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
>>>>>> also have a little i915_request recycler to try and help with memory
>>>>>> pressure scenarios in certain critical sections that also doesn't
>>>>>> respect RCU grace periods.  And, as mentioned multiple times, our
>>>>>> recycling leaks into every other driver because, thanks to i915's
>>>>>> choice, the above 4-line code snippet isn't valid ANYWHERE in the
>>>>>> kernel.
>>>>>>
>>>>>> So the question I'm raising isn't so much about the rules today.
>>>>>> Today, we live in the wild wild west where everything is YOLO.  But
>>>>>> where do we want to go?  Do we like this wild west world?  So we want
>>>>>> more consistency under the RCU read lock?  If so, what do we want the
>>>>>> rules to be?
>>>>>>
>>>>>> One option would be to accept the wild-west world we live in and say
>>>>>> "The RCU read lock gains you nothing.  If you want to touch the guts
>>>>>> of a dma_fence, take a reference".  But, at that point, we're eating
>>>>>> two atomics for every time someone wants to look at a dma_fence.  Do
>>>>>> we want that?
>>>>>>
>>>>>> Alternatively, and this what I think Daniel and I were trying to
>>>>>> propose here, is that we place some constraints on dma_fence
>>>>>> recycling.  Specifically that, under the RCU read lock, the fence
>>>>>> doesn't suddenly become a new fence.  All of the immutability and
>>>>>> once-mutability guarantees of various bits of dma_fence hold as long
>>>>>> as you have the RCU read lock.
>>>>> Yeah this is suboptimal. Too many potential bugs, not enough benefits.
>>>>>
>>>>> This entire __rcu business started so that there would be a lockless
>>>>> way to get at fences, or at least the exclusive one. That did not
>>>>> really pan out. I think we have a few options:
>>>>>
>>>>> - drop the idea of rcu/lockless dma-fence access outright. A quick
>>>>> sequence of grabbing the lock, acquiring the dma_fence and then
>>>>> dropping your lock again is probably plenty good. There's a lot of
>>>>> call_rcu and other stuff we could probably delete. I have no idea what
>>>>> the perf impact across all the drivers would be.
>>>> The question is maybe not the perf impact, but rather if that is
>>>> possible over all.
>>>>
>>>> IIRC we now have some cases in TTM where RCU is mandatory and we simply
>>>> don't have any other choice than using it.
>>> Adding Thomas Hellstrom.
>>>
>>> Where is that stuff? If we end up with all the dma_resv locking
>>> complexity just for an oddball, then I think that would be rather big
>>> bummer.
>> This is during buffer destruction. See the call to dma_resv_copy_fences().
> Ok yeah that's tricky.
>
> The way solved this in i915 is with a trylock and punting to a worker
> queue if the trylock fails. And the worker queue would also be flushed
> from the shrinker (once we get there at least).

That's what we already had done here as well, but the worker is exactly 
what we wanted to avoid by this.

> So this looks fixable.

I'm not sure of that. We had really good reasons to remove the worker.

>
>> But that is basically just using a dma_resv function which accesses the
>> object without taking a lock.
> The other one I've found is the ghost object, but that one is locked
> fully.
>
>>>>> - try to make all drivers follow some stricter rules. The trouble is
>>>>> that at least with radeon dma_fence callbacks aren't even very
>>>>> reliable (that's why it has its own dma_fence_wait implementation), so
>>>>> things are wobbly anyway.
>>>>>
>>>>> - live with the current situation, but radically delete all unsafe
>>>>> interfaces. I.e. nothing is allowed to directly deref an rcu fence
>>>>> pointer, everything goes through dma_fence_get_rcu_safe. The
>>>>> kref_get_unless_zero would become an internal implementation detail.
>>>>> Our "fast" and "lockless" dma_resv fence access stays a pile of
>>>>> seqlock, retry loop and an a conditional atomic inc + atomic dec. The
>>>>> only thing that's slightly faster would be dma_resv_test_signaled()
>>>>>
>>>>> - I guess minimally we should rename dma_fence_get_rcu to
>>>>> dma_fence_tryget. It has nothing to do with rcu really, and the use is
>>>>> very, very limited.
>>>> I think what we should do is to use RCU internally in the dma_resv
>>>> object but disallow drivers/frameworks to mess with that directly.
>>>>
>>>> In other words drivers should use one of the following:
>>>> 1. dma_resv_wait_timeout()
>>>> 2. dma_resv_test_signaled()
>>>> 3. dma_resv_copy_fences()
>>>> 4. dma_resv_get_fences()
>>>> 5. dma_resv_for_each_fence() <- to be implemented
>>>> 6. dma_resv_for_each_fence_unlocked() <- to be implemented
>>>>
>>>> Inside those functions we then make sure that we only save ways of
>>>> accessing the RCU protected data structures.
>>>>
>>>> This way we only need to make sure that those accessor functions are
>>>> sane and don't need to audit every driver individually.
>>> Yeah better encapsulation for dma_resv sounds like a good thing, least
>>> for all the other issues we've been discussing recently. I guess your
>>> list is also missing the various "add/replace some more fences"
>>> functions, but we have them already.
>>>
>>>> I can tackle implementing for the dma_res_for_each_fence()/_unlocked().
>>>> Already got a large bunch of that coded out anyway.
>>> When/where do we need ot iterate over fences unlocked? Given how much
>>> pain it is to get a consistent snapshot of the fences or fence state
>>> (I've read  the dma-buf poll implementation, and it looks a bit buggy
>>> in that regard, but not sure, just as an example) and unlocked
>>> iterator sounds very dangerous to me.
>> This is to make implementation of the other functions easier. Currently they
>> basically each roll their own loop implementation which at least for
>> dma_resv_test_signaled() looks a bit questionable to me.
>>
>> Additionally to those we we have one more case in i915 and the unlocked
>> polling implementation which I agree is a bit questionable as well.
> Yeah, the more I look at any of these lockless loop things the more I'm
> worried. 90% sure the one in dma_buf_poll is broken too.
>
>> My idea is to have the problematic logic in the iterator and only give back
>> fence which have a reference and are 100% sure the right one.
>>
>> Probably best if I show some code around to explain what I mean.
> My gut feeling is that we should just try and convert them all over to
> taking the dma_resv_lock. And if there is really a contention issue with
> that, then either try to shrink it, or make it an rwlock or similar. But
> just the more I read a lot of the implementations the more I see bugs and
> have questions.

How about we abstract all that funny rcu dance inside the iterator instead?

I mean when we just have one walker function which is well documented 
and understood then the rest becomes relatively easy.

Christian.

> Maybe at the end a few will be left over, and then we can look at these
> individually in detail. Like the ttm_bo_individualize_resv situation.
Daniel Vetter June 11, 2021, 3:08 p.m. UTC | #14
On Fri, Jun 11, 2021 at 12:03:31PM +0200, Christian König wrote:
> Am 11.06.21 um 11:33 schrieb Daniel Vetter:
> > On Fri, Jun 11, 2021 at 09:42:07AM +0200, Christian König wrote:
> > > Am 11.06.21 um 09:20 schrieb Daniel Vetter:
> > > > On Fri, Jun 11, 2021 at 8:55 AM Christian König
> > > > <christian.koenig@amd.com> wrote:
> > > > > Am 10.06.21 um 22:42 schrieb Daniel Vetter:
> > > > > > On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > > > > On Thu, Jun 10, 2021 at 11:39 AM Christian König
> > > > > > > > > <christian.koenig@amd.com> wrote:
> > > > > > > > > > Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
> > > > > > > > > > > On 09/06/2021 22:29, Jason Ekstrand wrote:
> > > > > > > > > > > > We've tried to keep it somewhat contained by doing most of the hard work
> > > > > > > > > > > > to prevent access of recycled objects via dma_fence_get_rcu_safe().
> > > > > > > > > > > > However, a quick grep of kernel sources says that, of the 30 instances
> > > > > > > > > > > > of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
> > > > > > > > > > > > It's likely there bear traps in DRM and related subsystems just waiting
> > > > > > > > > > > > for someone to accidentally step in them.
> > > > > > > > > > > ...because dma_fence_get_rcu_safe apears to be about whether the
> > > > > > > > > > > *pointer* to the fence itself is rcu protected, not about the fence
> > > > > > > > > > > object itself.
> > > > > > > > > > Yes, exactly that.
> > > > > > > > The fact that both of you think this either means that I've completely
> > > > > > > > missed what's going on with RCUs here (possible but, in this case, I
> > > > > > > > think unlikely) or RCUs on dma fences should scare us all.
> > > > > > > Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
> > > > > > > such,  I'd like to ask a slightly different question:  What are the
> > > > > > > rules about what is allowed to be done under the RCU read lock and
> > > > > > > what guarantees does a driver need to provide?
> > > > > > > 
> > > > > > > I think so far that we've all agreed on the following:
> > > > > > > 
> > > > > > >     1. Freeing an unsignaled fence is ok as long as it doesn't have any
> > > > > > > pending callbacks.  (Callbacks should hold a reference anyway).
> > > > > > > 
> > > > > > >     2. The pointer race solved by dma_fence_get_rcu_safe is real and
> > > > > > > requires the loop to sort out.
> > > > > > > 
> > > > > > > But let's say I have a dma_fence pointer that I got from, say, calling
> > > > > > > dma_resv_excl_fence() under rcu_read_lock().  What am I allowed to do
> > > > > > > with it under the RCU lock?  What assumptions can I make?  Is this
> > > > > > > code, for instance, ok?
> > > > > > > 
> > > > > > > rcu_read_lock();
> > > > > > > fence = dma_resv_excl_fence(obj);
> > > > > > > idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
> > > > > > > rcu_read_unlock();
> > > > > > > 
> > > > > > > This code very much looks correct under the following assumptions:
> > > > > > > 
> > > > > > >     1. A valid fence pointer stays alive under the RCU read lock
> > > > > > >     2. SIGNALED_BIT is set-once (it's never unset after being set).
> > > > > > > 
> > > > > > > However, if it were, we wouldn't have dma_resv_test_singnaled(), now
> > > > > > > would we? :-)
> > > > > > > 
> > > > > > > The moment you introduce ANY dma_fence recycling that recycles a
> > > > > > > dma_fence within a single RCU grace period, all your assumptions break
> > > > > > > down.  SLAB_TYPESAFE_BY_RCU is just one way that i915 does this.  We
> > > > > > > also have a little i915_request recycler to try and help with memory
> > > > > > > pressure scenarios in certain critical sections that also doesn't
> > > > > > > respect RCU grace periods.  And, as mentioned multiple times, our
> > > > > > > recycling leaks into every other driver because, thanks to i915's
> > > > > > > choice, the above 4-line code snippet isn't valid ANYWHERE in the
> > > > > > > kernel.
> > > > > > > 
> > > > > > > So the question I'm raising isn't so much about the rules today.
> > > > > > > Today, we live in the wild wild west where everything is YOLO.  But
> > > > > > > where do we want to go?  Do we like this wild west world?  So we want
> > > > > > > more consistency under the RCU read lock?  If so, what do we want the
> > > > > > > rules to be?
> > > > > > > 
> > > > > > > One option would be to accept the wild-west world we live in and say
> > > > > > > "The RCU read lock gains you nothing.  If you want to touch the guts
> > > > > > > of a dma_fence, take a reference".  But, at that point, we're eating
> > > > > > > two atomics for every time someone wants to look at a dma_fence.  Do
> > > > > > > we want that?
> > > > > > > 
> > > > > > > Alternatively, and this what I think Daniel and I were trying to
> > > > > > > propose here, is that we place some constraints on dma_fence
> > > > > > > recycling.  Specifically that, under the RCU read lock, the fence
> > > > > > > doesn't suddenly become a new fence.  All of the immutability and
> > > > > > > once-mutability guarantees of various bits of dma_fence hold as long
> > > > > > > as you have the RCU read lock.
> > > > > > Yeah this is suboptimal. Too many potential bugs, not enough benefits.
> > > > > > 
> > > > > > This entire __rcu business started so that there would be a lockless
> > > > > > way to get at fences, or at least the exclusive one. That did not
> > > > > > really pan out. I think we have a few options:
> > > > > > 
> > > > > > - drop the idea of rcu/lockless dma-fence access outright. A quick
> > > > > > sequence of grabbing the lock, acquiring the dma_fence and then
> > > > > > dropping your lock again is probably plenty good. There's a lot of
> > > > > > call_rcu and other stuff we could probably delete. I have no idea what
> > > > > > the perf impact across all the drivers would be.
> > > > > The question is maybe not the perf impact, but rather if that is
> > > > > possible over all.
> > > > > 
> > > > > IIRC we now have some cases in TTM where RCU is mandatory and we simply
> > > > > don't have any other choice than using it.
> > > > Adding Thomas Hellstrom.
> > > > 
> > > > Where is that stuff? If we end up with all the dma_resv locking
> > > > complexity just for an oddball, then I think that would be rather big
> > > > bummer.
> > > This is during buffer destruction. See the call to dma_resv_copy_fences().
> > Ok yeah that's tricky.
> > 
> > The way solved this in i915 is with a trylock and punting to a worker
> > queue if the trylock fails. And the worker queue would also be flushed
> > from the shrinker (once we get there at least).
> 
> That's what we already had done here as well, but the worker is exactly what
> we wanted to avoid by this.
> 
> > So this looks fixable.
> 
> I'm not sure of that. We had really good reasons to remove the worker.

I've looked around, and I didn't see any huge changes around the
delayed_delete work. There's lots of changes on how the lru is handled to
optimize that.

And even today we still have the delayed deletion thing.

So essentially what I had in mind is instead of just ttm_bo_cleanup_refs
you first check whether the resv is individualized already, and if not do
that first.

This means there's a slight delay when a bo is deleted between when the
refcount drops, and when we actually individualize the fences.

What was the commit that removed another worker here?
-Daniel


> 
> > 
> > > But that is basically just using a dma_resv function which accesses the
> > > object without taking a lock.
> > The other one I've found is the ghost object, but that one is locked
> > fully.
> > 
> > > > > > - try to make all drivers follow some stricter rules. The trouble is
> > > > > > that at least with radeon dma_fence callbacks aren't even very
> > > > > > reliable (that's why it has its own dma_fence_wait implementation), so
> > > > > > things are wobbly anyway.
> > > > > > 
> > > > > > - live with the current situation, but radically delete all unsafe
> > > > > > interfaces. I.e. nothing is allowed to directly deref an rcu fence
> > > > > > pointer, everything goes through dma_fence_get_rcu_safe. The
> > > > > > kref_get_unless_zero would become an internal implementation detail.
> > > > > > Our "fast" and "lockless" dma_resv fence access stays a pile of
> > > > > > seqlock, retry loop and an a conditional atomic inc + atomic dec. The
> > > > > > only thing that's slightly faster would be dma_resv_test_signaled()
> > > > > > 
> > > > > > - I guess minimally we should rename dma_fence_get_rcu to
> > > > > > dma_fence_tryget. It has nothing to do with rcu really, and the use is
> > > > > > very, very limited.
> > > > > I think what we should do is to use RCU internally in the dma_resv
> > > > > object but disallow drivers/frameworks to mess with that directly.
> > > > > 
> > > > > In other words drivers should use one of the following:
> > > > > 1. dma_resv_wait_timeout()
> > > > > 2. dma_resv_test_signaled()
> > > > > 3. dma_resv_copy_fences()
> > > > > 4. dma_resv_get_fences()
> > > > > 5. dma_resv_for_each_fence() <- to be implemented
> > > > > 6. dma_resv_for_each_fence_unlocked() <- to be implemented
> > > > > 
> > > > > Inside those functions we then make sure that we only save ways of
> > > > > accessing the RCU protected data structures.
> > > > > 
> > > > > This way we only need to make sure that those accessor functions are
> > > > > sane and don't need to audit every driver individually.
> > > > Yeah better encapsulation for dma_resv sounds like a good thing, least
> > > > for all the other issues we've been discussing recently. I guess your
> > > > list is also missing the various "add/replace some more fences"
> > > > functions, but we have them already.
> > > > 
> > > > > I can tackle implementing for the dma_res_for_each_fence()/_unlocked().
> > > > > Already got a large bunch of that coded out anyway.
> > > > When/where do we need ot iterate over fences unlocked? Given how much
> > > > pain it is to get a consistent snapshot of the fences or fence state
> > > > (I've read  the dma-buf poll implementation, and it looks a bit buggy
> > > > in that regard, but not sure, just as an example) and unlocked
> > > > iterator sounds very dangerous to me.
> > > This is to make implementation of the other functions easier. Currently they
> > > basically each roll their own loop implementation which at least for
> > > dma_resv_test_signaled() looks a bit questionable to me.
> > > 
> > > Additionally to those we we have one more case in i915 and the unlocked
> > > polling implementation which I agree is a bit questionable as well.
> > Yeah, the more I look at any of these lockless loop things the more I'm
> > worried. 90% sure the one in dma_buf_poll is broken too.
> > 
> > > My idea is to have the problematic logic in the iterator and only give back
> > > fence which have a reference and are 100% sure the right one.
> > > 
> > > Probably best if I show some code around to explain what I mean.
> > My gut feeling is that we should just try and convert them all over to
> > taking the dma_resv_lock. And if there is really a contention issue with
> > that, then either try to shrink it, or make it an rwlock or similar. But
> > just the more I read a lot of the implementations the more I see bugs and
> > have questions.
> 
> How about we abstract all that funny rcu dance inside the iterator instead?
> 
> I mean when we just have one walker function which is well documented and
> understood then the rest becomes relatively easy.
> 
> Christian.
> 
> > Maybe at the end a few will be left over, and then we can look at these
> > individually in detail. Like the ttm_bo_individualize_resv situation.
>