diff mbox series

[v2,2/4] cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations

Message ID 161707246948.2072157.2116502455691653472.stgit@dwillia2-desk3.amr.corp.intel.com
State Superseded
Headers show
Series cxl/mem: Fix memdev device setup | expand

Commit Message

Dan Williams March 30, 2021, 2:47 a.m. UTC
The percpu_ref to gate whether cxl_memdev_ioctl() is free to use the
driver context (@cxlm) to issue I/O is overkill, implemented incorrectly
(missing a device reference before accessing the percpu_ref), and the
complexities of shutting down a percpu_ref contributed to a bug in the
error unwind in cxl_mem_add_memdev() (missing put_device() to be fixed
separately).

Use srcu + device_is_registered() to achieve the same effect and add the
missing reference counting for cxlmd in cxl_memdev_open() and
cxl_memdev_release_file().

Fixes: b39cb1052a5c ("cxl/mem: Register CXL memX devices")
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/mem.c |   83 +++++++++++++++++++++++++++--------------------------
 1 file changed, 43 insertions(+), 40 deletions(-)

Comments

Jason Gunthorpe March 30, 2021, 11:16 a.m. UTC | #1
On Mon, Mar 29, 2021 at 07:47:49PM -0700, Dan Williams wrote:

> @@ -1155,21 +1175,12 @@ static void cxlmdev_unregister(void *_cxlmd)
>  	struct cxl_memdev *cxlmd = _cxlmd;
>  	struct device *dev = &cxlmd->dev;
>  
> -	percpu_ref_kill(&cxlmd->ops_active);
>  	cdev_device_del(&cxlmd->cdev, dev);
> -	wait_for_completion(&cxlmd->ops_dead);
> +	synchronize_srcu(&cxl_memdev_srcu);

This needs some kind of rcu protected pointer for SRCU to to
work.. The write side has to null the pointer and the read side has to
copy the pointer to the stack and check for NULL. 

Otherwise the read side can't detect when the write side is shutting
down.

Basically you must use rcu_derference(), rcu_assign_pointer(), etc
when working with RCU.

Something like 'ops' is usually a reasonable choice

This really can't just be a simple rwsem?

Jason
Dan Williams March 30, 2021, 3:37 p.m. UTC | #2
On Tue, Mar 30, 2021 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Mar 29, 2021 at 07:47:49PM -0700, Dan Williams wrote:
>
> > @@ -1155,21 +1175,12 @@ static void cxlmdev_unregister(void *_cxlmd)
> >       struct cxl_memdev *cxlmd = _cxlmd;
> >       struct device *dev = &cxlmd->dev;
> >
> > -     percpu_ref_kill(&cxlmd->ops_active);
> >       cdev_device_del(&cxlmd->cdev, dev);
> > -     wait_for_completion(&cxlmd->ops_dead);
> > +     synchronize_srcu(&cxl_memdev_srcu);
>
> This needs some kind of rcu protected pointer for SRCU to to
> work.. The write side has to null the pointer and the read side has to
> copy the pointer to the stack and check for NULL.
>
> Otherwise the read side can't detect when the write side is shutting
> down.
>
> Basically you must use rcu_derference(), rcu_assign_pointer(), etc
> when working with RCU.

If the shutdown path was not using synchronize_rcu() then I would
agree with you. This usage of srcu is also used to protect dax device
shutdown after talking through rwsem vs srcu in this thread with Jan
and Paul [1]. The syncrhonize_rcu() guarantees that all read-side
critical sections have had at least one chance to quiesce.

So this could either use rcu pointer accessors and call_srcu to free
the object in a quiescent state, or it can use synchronize_srcu()
relative to a condition that aborts usage of the pointer.

[1]: https://lore.kernel.org/lkml/20180408031113.GO3948@linux.vnet.ibm.com/
Jason Gunthorpe March 30, 2021, 3:47 p.m. UTC | #3
On Tue, Mar 30, 2021 at 08:37:19AM -0700, Dan Williams wrote:
> On Tue, Mar 30, 2021 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Mar 29, 2021 at 07:47:49PM -0700, Dan Williams wrote:
> >
> > > @@ -1155,21 +1175,12 @@ static void cxlmdev_unregister(void *_cxlmd)
> > >       struct cxl_memdev *cxlmd = _cxlmd;
> > >       struct device *dev = &cxlmd->dev;
> > >
> > > -     percpu_ref_kill(&cxlmd->ops_active);
> > >       cdev_device_del(&cxlmd->cdev, dev);
> > > -     wait_for_completion(&cxlmd->ops_dead);
> > > +     synchronize_srcu(&cxl_memdev_srcu);
> >
> > This needs some kind of rcu protected pointer for SRCU to to
> > work.. The write side has to null the pointer and the read side has to
> > copy the pointer to the stack and check for NULL.
> >
> > Otherwise the read side can't detect when the write side is shutting
> > down.
> >
> > Basically you must use rcu_derference(), rcu_assign_pointer(), etc
> > when working with RCU.
> 
> If the shutdown path was not using synchronize_rcu() then I would
> agree with you. This usage of srcu is also used to protect dax device
> shutdown after talking through rwsem vs srcu in this thread with Jan
> and Paul [1]. The syncrhonize_rcu() guarantees that all read-side
> critical sections have had at least one chance to quiesce.
> 
> So this could either use rcu pointer accessors and call_srcu to free
> the object in a quiescent state, or it can use synchronize_srcu()
> relative to a condition that aborts usage of the pointer.

synchronize_rcu doesn't stop the read side from running it. It only
guarentees that all running or future read sides will see the *write*
performed prior to synchronize_rcu.

If you can't clearly point to the *data* under RCU protection it is
being used wrong.

Same as if you can't point to the *data* being protected by a rwsem it
is probably being used wrong.

We are not locking code.

Jason
Dan Williams March 30, 2021, 4:05 p.m. UTC | #4
On Tue, Mar 30, 2021 at 8:47 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Mar 30, 2021 at 08:37:19AM -0700, Dan Williams wrote:
> > On Tue, Mar 30, 2021 at 4:16 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Mon, Mar 29, 2021 at 07:47:49PM -0700, Dan Williams wrote:
> > >
> > > > @@ -1155,21 +1175,12 @@ static void cxlmdev_unregister(void *_cxlmd)
> > > >       struct cxl_memdev *cxlmd = _cxlmd;
> > > >       struct device *dev = &cxlmd->dev;
> > > >
> > > > -     percpu_ref_kill(&cxlmd->ops_active);
> > > >       cdev_device_del(&cxlmd->cdev, dev);
> > > > -     wait_for_completion(&cxlmd->ops_dead);
> > > > +     synchronize_srcu(&cxl_memdev_srcu);
> > >
> > > This needs some kind of rcu protected pointer for SRCU to to
> > > work.. The write side has to null the pointer and the read side has to
> > > copy the pointer to the stack and check for NULL.
> > >
> > > Otherwise the read side can't detect when the write side is shutting
> > > down.
> > >
> > > Basically you must use rcu_derference(), rcu_assign_pointer(), etc
> > > when working with RCU.
> >
> > If the shutdown path was not using synchronize_rcu() then I would
> > agree with you. This usage of srcu is also used to protect dax device
> > shutdown after talking through rwsem vs srcu in this thread with Jan
> > and Paul [1]. The syncrhonize_rcu() guarantees that all read-side
> > critical sections have had at least one chance to quiesce.
> >
> > So this could either use rcu pointer accessors and call_srcu to free
> > the object in a quiescent state, or it can use synchronize_srcu()
> > relative to a condition that aborts usage of the pointer.
>
> synchronize_rcu doesn't stop the read side from running it. It only
> guarentees that all running or future read sides will see the *write*
> performed prior to synchronize_rcu.
>
> If you can't clearly point to the *data* under RCU protection it is
> being used wrong.

Agree.

The data being protected is the value of dev->kobj.state_in_sysfs. The
read-side is allowed to keep running, and the syncrhonize_rcu()
guarantees that any read-side that saw state_in_sysfs == 1 has had a
chance to complete. Future reads terminate the ioctl at the
device_is_registered() check.

> Same as if you can't point to the *data* being protected by a rwsem it
> is probably being used wrong.
>
> We are not locking code.

Agree.
Jason Gunthorpe March 30, 2021, 5:02 p.m. UTC | #5
On Tue, Mar 30, 2021 at 09:05:29AM -0700, Dan Williams wrote:

> > If you can't clearly point to the *data* under RCU protection it is
> > being used wrong.
> 
> Agree.
> 
> The data being protected is the value of
> dev->kobj.state_in_sysfs. The

So where is that read under:

+	idx = srcu_read_lock(&cxl_memdev_srcu);
+	rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
+	srcu_read_unlock(&cxl_memdev_srcu, idx);

?

It can't read the RCU protected data outside the RCU critical region,
and it can't read/write RCU protected data without using the helper
macros which insert the required barriers.

IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.

Jason
Dan Williams March 30, 2021, 5:31 p.m. UTC | #6
On Tue, Mar 30, 2021 at 10:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Mar 30, 2021 at 09:05:29AM -0700, Dan Williams wrote:
>
> > > If you can't clearly point to the *data* under RCU protection it is
> > > being used wrong.
> >
> > Agree.
> >
> > The data being protected is the value of
> > dev->kobj.state_in_sysfs. The
>
> So where is that read under:
>
> +       idx = srcu_read_lock(&cxl_memdev_srcu);
> +       rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
> +       srcu_read_unlock(&cxl_memdev_srcu, idx);
>
> ?

device_is_registered() inside __cxl_memdev_ioctl().

> It can't read the RCU protected data outside the RCU critical region,
> and it can't read/write RCU protected data without using the helper
> macros which insert the required barriers.

The required barriers are there. srcu_read_lock() +
device_is_registered() is paired with cdev_device_del() +
synchronize_rcu().

> IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.

This usage of srcu is functionally equivalent to replacing
srcu_read_lock() with down_read() and the shutdown path with:

cdev_device_del(...);
down_write(...):
up_write(...);

...to flush readers that read ->state_in_sysfs == 1.
Jason Gunthorpe March 30, 2021, 5:54 p.m. UTC | #7
On Tue, Mar 30, 2021 at 10:31:15AM -0700, Dan Williams wrote:
> On Tue, Mar 30, 2021 at 10:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 09:05:29AM -0700, Dan Williams wrote:
> >
> > > > If you can't clearly point to the *data* under RCU protection it is
> > > > being used wrong.
> > >
> > > Agree.
> > >
> > > The data being protected is the value of
> > > dev->kobj.state_in_sysfs. The
> >
> > So where is that read under:
> >
> > +       idx = srcu_read_lock(&cxl_memdev_srcu);
> > +       rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
> > +       srcu_read_unlock(&cxl_memdev_srcu, idx);
> >
> > ?
> 
> device_is_registered() inside __cxl_memdev_ioctl().

Oh, I see, I missed that

> > It can't read the RCU protected data outside the RCU critical region,
> > and it can't read/write RCU protected data without using the helper
> > macros which insert the required barriers.
> 
> The required barriers are there. srcu_read_lock() +
> device_is_registered() is paired with cdev_device_del() +
> synchronize_rcu().

RCU needs barriers on the actual load/store just a naked
device_is_registered() alone is not strong enough.

> > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> 
> This usage of srcu is functionally equivalent to replacing
> srcu_read_lock() with down_read() and the shutdown path with:

Sort of, but the rules for load/store under RCU are different than for
load/store under a normal barriered lock. All the data is unstable for
instance and minimially needs READ_ONCE.

> cdev_device_del(...);
> down_write(...):
> up_write(...);

The lock would have to enclose the store to state_in_sysfs, otherwise
as written it has the same data race problems.

Jason
Dan Williams March 30, 2021, 7 p.m. UTC | #8
On Tue, Mar 30, 2021 at 10:54 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Mar 30, 2021 at 10:31:15AM -0700, Dan Williams wrote:
> > On Tue, Mar 30, 2021 at 10:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 09:05:29AM -0700, Dan Williams wrote:
> > >
> > > > > If you can't clearly point to the *data* under RCU protection it is
> > > > > being used wrong.
> > > >
> > > > Agree.
> > > >
> > > > The data being protected is the value of
> > > > dev->kobj.state_in_sysfs. The
> > >
> > > So where is that read under:
> > >
> > > +       idx = srcu_read_lock(&cxl_memdev_srcu);
> > > +       rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
> > > +       srcu_read_unlock(&cxl_memdev_srcu, idx);
> > >
> > > ?
> >
> > device_is_registered() inside __cxl_memdev_ioctl().
>
> Oh, I see, I missed that
>
> > > It can't read the RCU protected data outside the RCU critical region,
> > > and it can't read/write RCU protected data without using the helper
> > > macros which insert the required barriers.
> >
> > The required barriers are there. srcu_read_lock() +
> > device_is_registered() is paired with cdev_device_del() +
> > synchronize_rcu().
>
> RCU needs barriers on the actual load/store just a naked
> device_is_registered() alone is not strong enough.
>
> > > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> >
> > This usage of srcu is functionally equivalent to replacing
> > srcu_read_lock() with down_read() and the shutdown path with:
>
> Sort of, but the rules for load/store under RCU are different than for
> load/store under a normal barriered lock. All the data is unstable for
> instance and minimially needs READ_ONCE.

The data is unstable under the srcu_read_lock until the end of the
next rcu grace period, synchronize_rcu() ensures all active
srcu_read_lock() sections have completed. Unless Paul and I
misunderstood each other, this scheme of synchronizing object state is
also used in kill_dax(), and I put that comment there the last time
this question was raised. If srcu was being used to synchronize the
liveness of an rcu object like @cxlm or a new ops object then I would
expect rcu_dereference + rcu_assign_pointer around usages of that
object. The liveness of the object in this case is handled by kobject
reference, or inode reference in the case of kill_dax() outside of
srcu.

>
> > cdev_device_del(...);
> > down_write(...):
> > up_write(...);
>
> The lock would have to enclose the store to state_in_sysfs, otherwise
> as written it has the same data race problems.

There's no race above. The rule is that any possible observation of
->state_in_sysfs == 1, or rcu_dereference() != NULL, must be flushed.
After that value transitions to zero, or the rcu object is marked for
deletion, an rcu grace period is needed before that memory can be
freed. If an rwsem is used the only requirement is that any read-side
sections that might have observed ->state_in_sysfs == 1 have ended
which is why the down_write() / up_write() does not need to surround
the cdev_device_del(). It's sufficient to flush the read side after
the state is known to have changed. There are several examples of
rwsem being used as a barrier like this:

drivers/mtd/ubi/wl.c:1432:      down_write(&ubi->work_sem);
drivers/mtd/ubi/wl.c-1433-      up_write(&ubi->work_sem);

drivers/scsi/cxlflash/main.c:2229:      down_write(&cfg->ioctl_rwsem);
drivers/scsi/cxlflash/main.c-2230-      up_write(&cfg->ioctl_rwsem);

fs/btrfs/block-group.c:355:     down_write(&space_info->groups_sem);
fs/btrfs/block-group.c-356-     up_write(&space_info->groups_sem);

fs/btrfs/disk-io.c:4189:        down_write(&fs_info->cleanup_work_sem);
fs/btrfs/disk-io.c-4190-        up_write(&fs_info->cleanup_work_sem);

net/core/net_namespace.c:629:   down_write(&pernet_ops_rwsem);
net/core/net_namespace.c-630-   up_write(&pernet_ops_rwsem);
Jason Gunthorpe March 30, 2021, 7:26 p.m. UTC | #9
On Tue, Mar 30, 2021 at 12:00:23PM -0700, Dan Williams wrote:

> > > > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> > >
> > > This usage of srcu is functionally equivalent to replacing
> > > srcu_read_lock() with down_read() and the shutdown path with:
> >
> > Sort of, but the rules for load/store under RCU are different than for
> > load/store under a normal barriered lock. All the data is unstable for
> > instance and minimially needs READ_ONCE.
> 
> The data is unstable under the srcu_read_lock until the end of the
> next rcu grace period, synchronize_rcu() ensures all active
> srcu_read_lock() sections have completed.

No, that isn't how I would phrase it. *any* write side data under RCU
is always unstable by definition in the read side because the write
side can always race with any reader. Thus one should use the RCU
accessors to deal with that data race, and get some acquire/release
semantics when pointer chasing (though this doesn't matter here)

> Unless Paul and I misunderstood each other, this scheme of
> synchronizing object state is also used in kill_dax(), and I put
> that comment there the last time this question was raised. If srcu
> was being used to synchronize the liveness of an rcu object like
> @cxlm or a new ops object then I would expect rcu_dereference +
> rcu_assign_pointer around usages of that object. The liveness of the
> object in this case is handled by kobject reference, or inode
> reference in the case of kill_dax() outside of srcu.

It was probably a mis-understanding as I don't think Paul would say
you should read data in thread A and write to it in B without using
READ_ONCE/WRITE_ONCE or a stronger atomic to manage the data race.

The LWN articles on the "big bad compiler" are informative here. You
don't want the compiler to do a transformation where it loads
state_in_sysfs multiple times and gets different answers. This is what
READ_ONCE is aiming to prevent.

Here is it just a boolean flag, and the flag is only cleared, so risks
are low, but it still isn't a technically correct way to use RCU.

(and yes the kernel is full of examples of not following the memory
model strictly)

> > > cdev_device_del(...);
> > > down_write(...):
> > > up_write(...);
> >
> > The lock would have to enclose the store to state_in_sysfs, otherwise
> > as written it has the same data race problems.
> 
> There's no race above. The rule is that any possible observation of
> ->state_in_sysfs == 1, or rcu_dereference() != NULL, must be
> flushed.

It is not about the flushing.

Jason
Dan Williams March 30, 2021, 7:43 p.m. UTC | #10
On Tue, Mar 30, 2021 at 12:26 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Mar 30, 2021 at 12:00:23PM -0700, Dan Williams wrote:
>
> > > > > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> > > >
> > > > This usage of srcu is functionally equivalent to replacing
> > > > srcu_read_lock() with down_read() and the shutdown path with:
> > >
> > > Sort of, but the rules for load/store under RCU are different than for
> > > load/store under a normal barriered lock. All the data is unstable for
> > > instance and minimially needs READ_ONCE.
> >
> > The data is unstable under the srcu_read_lock until the end of the
> > next rcu grace period, synchronize_rcu() ensures all active
> > srcu_read_lock() sections have completed.
>
> No, that isn't how I would phrase it. *any* write side data under RCU
> is always unstable by definition in the read side because the write
> side can always race with any reader. Thus one should use the RCU
> accessors to deal with that data race, and get some acquire/release
> semantics when pointer chasing (though this doesn't matter here)
>
> > Unless Paul and I misunderstood each other, this scheme of
> > synchronizing object state is also used in kill_dax(), and I put
> > that comment there the last time this question was raised. If srcu
> > was being used to synchronize the liveness of an rcu object like
> > @cxlm or a new ops object then I would expect rcu_dereference +
> > rcu_assign_pointer around usages of that object. The liveness of the
> > object in this case is handled by kobject reference, or inode
> > reference in the case of kill_dax() outside of srcu.
>
> It was probably a mis-understanding as I don't think Paul would say
> you should read data in thread A and write to it in B without using
> READ_ONCE/WRITE_ONCE or a stronger atomic to manage the data race.
>
> The LWN articles on the "big bad compiler" are informative here. You
> don't want the compiler to do a transformation where it loads
> state_in_sysfs multiple times and gets different answers. This is what
> READ_ONCE is aiming to prevent.
>
> Here is it just a boolean flag, and the flag is only cleared, so risks
> are low, but it still isn't a technically correct way to use RCU.
>
> (and yes the kernel is full of examples of not following the memory
> model strictly)

Ok, so this is the disagreement. Strict adherence to the model where
it does not matter in practice. In that sense this use is not
idiomatic, and the fact that we've spilled this many bytes in emails
back and forth is good enough reason to change it.

>
> > > > cdev_device_del(...);
> > > > down_write(...):
> > > > up_write(...);
> > >
> > > The lock would have to enclose the store to state_in_sysfs, otherwise
> > > as written it has the same data race problems.
> >
> > There's no race above. The rule is that any possible observation of
> > ->state_in_sysfs == 1, or rcu_dereference() != NULL, must be
> > flushed.
>
> It is not about the flushing.

Ok, it's not about the flushing it's about whether the store to
state_in_sysfs can leak past up_write(). If that store can leak then
neither arrangement of:

cdev_device_del(...);
down_write(...):
up_write(...);

...or:

down_write(...):
cdev_device_del(...);
up_write(...);

...is sufficient.
Jason Gunthorpe March 30, 2021, 7:51 p.m. UTC | #11
On Tue, Mar 30, 2021 at 12:43:15PM -0700, Dan Williams wrote:

> Ok, so this is the disagreement. Strict adherence to the model where
> it does not matter in practice.

The basic problem is that it is hard to know if it does not matter in
practice because you never know what the compiler might decide to do
...

It is probably fine, but then again I've seen a surprising case in the
mm where the compiler did generate double loads and it wasn't fine.

Now with the new data race detector it feels like marking all
concurrent data access with READ_ONCE / WRITE_ONCE (or a stronger
atomic) is the correct thing to do.

> > > There's no race above. The rule is that any possible observation of
> > > ->state_in_sysfs == 1, or rcu_dereference() != NULL, must be
> > > flushed.
> >
> > It is not about the flushing.
> 
> Ok, it's not about the flushing it's about whether the store to
> state_in_sysfs can leak past up_write(). If that store can leak then
> neither arrangement of:

up_write() and synchronize_srcu() are both store barriers, so the
store must be ordered.

It is the reader that is the problem. This ordering:

> down_write(...):
> cdev_device_del(...);
> up_write(...);

Prevents state_in_sysfs from being unstable during read as the write
lock prevents it from being written while a reader is active. No
READ_ONCE needed.

Jason
Dan Williams March 30, 2021, 9 p.m. UTC | #12
On Tue, Mar 30, 2021 at 12:52 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Mar 30, 2021 at 12:43:15PM -0700, Dan Williams wrote:
>
> > Ok, so this is the disagreement. Strict adherence to the model where
> > it does not matter in practice.
>
> The basic problem is that it is hard to know if it does not matter in
> practice because you never know what the compiler might decide to do
> ...
>
> It is probably fine, but then again I've seen a surprising case in the
> mm where the compiler did generate double loads and it wasn't fine.
>
> Now with the new data race detector it feels like marking all
> concurrent data access with READ_ONCE / WRITE_ONCE (or a stronger
> atomic) is the correct thing to do.
>
> > > > There's no race above. The rule is that any possible observation of
> > > > ->state_in_sysfs == 1, or rcu_dereference() != NULL, must be
> > > > flushed.
> > >
> > > It is not about the flushing.
> >
> > Ok, it's not about the flushing it's about whether the store to
> > state_in_sysfs can leak past up_write(). If that store can leak then
> > neither arrangement of:
>
> up_write() and synchronize_srcu() are both store barriers, so the
> store must be ordered.
>
> It is the reader that is the problem. This ordering:
>
> > down_write(...):
> > cdev_device_del(...);
> > up_write(...);
>
> Prevents state_in_sysfs from being unstable during read as the write
> lock prevents it from being written while a reader is active. No
> READ_ONCE needed.
>

Ok, so another case where I agree the instability exists but does not
matter in practice in this case because the cxl_memdev_ioctl() read
side is prepared for the state change to leak into the down_read()
section. There's no meaningful publish/unpublish race that READ_ONCE()
resolves here. With the read-side prepared for instability it obviates
needing to hold a lock over cdev_device_del() which would entangle
this locking dependency unnecessarily with all the other operations
that cdev_device_del() performs.

Again though I appear to be spilling bytes explaining specific
mitigating circumstances. I anticipate your response to the assertion
that holding a lock over cdev_device_del() is too broad would be to
move the synchronization away from device object state and towards the
specific dependency that needs to remain stable for the execution of
the ioctl. So registration path does:

cxlmd->cxlm = cxlm;
down_write(&cxl_memdev_rwsem);
up_write(&cxl_memdev_rwsem);
cdev_device_add(...);

...unregistration path does:

down_write(&cxl_memdev_rwsem);
cxlmd->cxlm = cxlm;
up_write(&cxl_memdev_rwsem);
cdev_device_del(...);

...and then no unstable state of the ->cxlm reference leaks into the
read-side that does:

down_read(&cxl_memdev_rwsem);
if (!cxlmd->cxlm) {
    up_read(&cxl_memdev_rwsem)
    return -ENXIO;
}
...
up_read(&cxl_memdev_rwsem);

...and no need to debate locks held over cdev_device_del() to resolve
instability.
Jason Gunthorpe March 30, 2021, 10:09 p.m. UTC | #13
On Tue, Mar 30, 2021 at 02:00:43PM -0700, Dan Williams wrote:

> Ok, so another case where I agree the instability exists but does not
> matter in practice in this case because the cxl_memdev_ioctl() read
> side is prepared for the state change to leak into the down_read()
> section. There's no meaningful publish/unpublish race that READ_ONCE()
> resolves here. 

Generally READ_ONCE is about protecting the control flow after the
*compiler*

Say, I have 

    if (a) {
      do_x();
      do_y();
    }

The compiler *might* transform it so I get:

    if (a)
       do_x()
    if (a)
       do_y()

Now if 'a' is an unstable data race I have a crazy problem: the basic
invarient that x and y are always done together is broken and I can
have x done without y. (or y done without x if the CPU runs it out of
order!)

But I can't see this at all from the code, it just runs wrong under
certain races with certain compilers.

READ_ONCE prevents this kinds of compiler transform, this is what it
is for. It is why in modern times it has become expected to always
mark unlocked unstable data access with these helpers, or the RCU
specific variants.

It is not reasoning about happens before/happens after kind of ideas,
it is about "the compiler assumes the memory doesn't change and if we
break that assumption the compiled result might not work sanely"

> down_write(&cxl_memdev_rwsem);
> cxlmd->cxlm = cxlm;
> up_write(&cxl_memdev_rwsem);
> cdev_device_del(...);
 
> ...and then no unstable state of the ->cxlm reference leaks into the
> read-side that does:
> 
> down_read(&cxl_memdev_rwsem);
> if (!cxlmd->cxlm) {
>     up_read(&cxl_memdev_rwsem)
>     return -ENXIO;
> }
> ...
> up_read(&cxl_memdev_rwsem);

Sure this construction is easy to understand and validate, if a rwsem
is OK performance wise. We can do it after the cdev_device_del. Here
the cxlm is the protected data and it is always accessed under
lock. 

The rwsem can be transformed to SRCU by marking cxlm with the __rcu
type annotation and using srcu_dereference (as a READ_ONCE wrapper)
and rcu_assign_pointer (as a WRITE_ONCE wrapper) to manipulte the cxlm

Jason
diff mbox series

Patch

diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 30bf4f0f3c17..548d696f1f54 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -96,21 +96,18 @@  struct mbox_cmd {
  * @dev: driver core device object
  * @cdev: char dev core object for ioctl operations
  * @cxlm: pointer to the parent device driver data
- * @ops_active: active user of @cxlm in ops handlers
- * @ops_dead: completion when all @cxlm ops users have exited
  * @id: id number of this memdev instance.
  */
 struct cxl_memdev {
 	struct device dev;
 	struct cdev cdev;
 	struct cxl_mem *cxlm;
-	struct percpu_ref ops_active;
-	struct completion ops_dead;
 	int id;
 };
 
 static int cxl_mem_major;
 static DEFINE_IDA(cxl_memdev_ida);
+DEFINE_STATIC_SRCU(cxl_memdev_srcu);
 static struct dentry *cxl_debugfs;
 static bool cxl_raw_allow_all;
 
@@ -763,6 +760,14 @@  static int cxl_send_cmd(struct cxl_memdev *cxlmd,
 static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
 			       unsigned long arg)
 {
+	/*
+	 * Stop servicing ioctls once the device has been unregistered
+	 * which indicates it has been disconnected from its cxlm driver
+	 * context
+	 */
+	if (!device_is_registered(&cxlmd->dev))
+		return -ENXIO;
+
 	switch (cmd) {
 	case CXL_MEM_QUERY_COMMANDS:
 		return cxl_query_cmd(cxlmd, (void __user *)arg);
@@ -776,26 +781,42 @@  static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
 static long cxl_memdev_ioctl(struct file *file, unsigned int cmd,
 			     unsigned long arg)
 {
-	struct cxl_memdev *cxlmd;
-	struct inode *inode;
-	int rc = -ENOTTY;
+	struct cxl_memdev *cxlmd = file->private_data;
+	int rc, idx;
 
-	inode = file_inode(file);
-	cxlmd = container_of(inode->i_cdev, typeof(*cxlmd), cdev);
+	idx = srcu_read_lock(&cxl_memdev_srcu);
+	rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
+	srcu_read_unlock(&cxl_memdev_srcu, idx);
 
-	if (!percpu_ref_tryget_live(&cxlmd->ops_active))
-		return -ENXIO;
+	return rc;
+}
 
-	rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
+static int cxl_memdev_open(struct inode *inode, struct file *file)
+{
+	struct cxl_memdev *cxlmd =
+		container_of(inode->i_cdev, typeof(*cxlmd), cdev);
 
-	percpu_ref_put(&cxlmd->ops_active);
+	get_device(&cxlmd->dev);
+	file->private_data = cxlmd;
 
-	return rc;
+	return 0;
+}
+
+static int cxl_memdev_release_file(struct inode *inode, struct file *file)
+{
+	struct cxl_memdev *cxlmd =
+		container_of(inode->i_cdev, typeof(*cxlmd), cdev);
+
+	put_device(&cxlmd->dev);
+
+	return 0;
 }
 
 static const struct file_operations cxl_memdev_fops = {
 	.owner = THIS_MODULE,
 	.unlocked_ioctl = cxl_memdev_ioctl,
+	.open = cxl_memdev_open,
+	.release = cxl_memdev_release_file,
 	.compat_ioctl = compat_ptr_ioctl,
 	.llseek = noop_llseek,
 };
@@ -1049,7 +1070,6 @@  static void cxl_memdev_release(struct device *dev)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 
-	percpu_ref_exit(&cxlmd->ops_active);
 	ida_free(&cxl_memdev_ida, cxlmd->id);
 	kfree(cxlmd);
 }
@@ -1155,21 +1175,12 @@  static void cxlmdev_unregister(void *_cxlmd)
 	struct cxl_memdev *cxlmd = _cxlmd;
 	struct device *dev = &cxlmd->dev;
 
-	percpu_ref_kill(&cxlmd->ops_active);
 	cdev_device_del(&cxlmd->cdev, dev);
-	wait_for_completion(&cxlmd->ops_dead);
+	synchronize_srcu(&cxl_memdev_srcu);
 	cxlmd->cxlm = NULL;
 	put_device(dev);
 }
 
-static void cxlmdev_ops_active_release(struct percpu_ref *ref)
-{
-	struct cxl_memdev *cxlmd =
-		container_of(ref, typeof(*cxlmd), ops_active);
-
-	complete(&cxlmd->ops_dead);
-}
-
 static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
 {
 	struct pci_dev *pdev = cxlm->pdev;
@@ -1181,17 +1192,12 @@  static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
 	cxlmd = kzalloc(sizeof(*cxlmd), GFP_KERNEL);
 	if (!cxlmd)
 		return -ENOMEM;
-	init_completion(&cxlmd->ops_dead);
 
 	/*
-	 * @cxlm is deallocated when the driver unbinds so operations
-	 * that are using it need to hold a live reference.
+	 * @cxlm is released when the driver unbinds so srcu and
+	 * device_is_registered() protect usage of this through @cxlmd.
 	 */
 	cxlmd->cxlm = cxlm;
-	rc = percpu_ref_init(&cxlmd->ops_active, cxlmdev_ops_active_release, 0,
-			     GFP_KERNEL);
-	if (rc)
-		goto err_ref;
 
 	rc = ida_alloc_range(&cxl_memdev_ida, 0, CXL_MEM_MAX_DEVS, GFP_KERNEL);
 	if (rc < 0)
@@ -1216,16 +1222,13 @@  static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
 	return devm_add_action_or_reset(dev->parent, cxlmdev_unregister, cxlmd);
 
 err_add:
-	ida_free(&cxl_memdev_ida, cxlmd->id);
-err_id:
 	/*
-	 * Theoretically userspace could have already entered the fops,
-	 * so flush ops_active.
+	 * The cdev was briefly live, flush any ioctl operations that
+	 * saw that state.
 	 */
-	percpu_ref_kill(&cxlmd->ops_active);
-	wait_for_completion(&cxlmd->ops_dead);
-	percpu_ref_exit(&cxlmd->ops_active);
-err_ref:
+	synchronize_srcu(&cxl_memdev_srcu);
+	ida_free(&cxl_memdev_ida, cxlmd->id);
+err_id:
 	kfree(cxlmd);
 
 	return rc;