mbox series

[for-rc,0/2] Fix poor reference counting in advice_mr work

Message ID 20190112023124.21647-1-jgg@ziepe.ca (mailing list archive)
Headers show
Series Fix poor reference counting in advice_mr work | expand

Message

Jason Gunthorpe Jan. 12, 2019, 2:31 a.m. UTC
From: Jason Gunthorpe <jgg@mellanox.com>

Work needs to hold a reference on the pointers that it plans to use otherwise
they can become free'd or unregistered. Expose the existing registration lock
that netlink uses for this purpose.

Jason Gunthorpe (2):
  RDMA/device: Expose ib_device_try_get(()
  IB/mlx5: Fix how advise_mr() launches async work

 drivers/infiniband/core/core_priv.h |  1 -
 drivers/infiniband/core/device.c    |  6 +++---
 drivers/infiniband/hw/mlx5/odp.c    | 10 +++++-----
 include/rdma/ib_verbs.h             | 24 ++++++++++++++++++++++--
 4 files changed, 30 insertions(+), 11 deletions(-)

Comments

Jason Gunthorpe Jan. 21, 2019, 9:43 p.m. UTC | #1
On Fri, Jan 11, 2019 at 07:31:22PM -0700, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Work needs to hold a reference on the pointers that it plans to use otherwise
> they can become free'd or unregistered. Expose the existing registration lock
> that netlink uses for this purpose.
> 
> Jason Gunthorpe (2):
>   RDMA/device: Expose ib_device_try_get(()
>   IB/mlx5: Fix how advise_mr() launches async work

I've applied both of these to for-rc, with the change in comment that
Parav gave.

We can revisit if get_device/put_device is the best way to handle this
when the work queue issue is resolved, most probably we can drop the
get/put in the driver once the work queue is flushed prior to
deallocing the device.

Jason
Leon Romanovsky Jan. 22, 2019, 5:52 a.m. UTC | #2
On Mon, Jan 21, 2019 at 02:43:41PM -0700, Jason Gunthorpe wrote:
> On Fri, Jan 11, 2019 at 07:31:22PM -0700, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> >
> > Work needs to hold a reference on the pointers that it plans to use otherwise
> > they can become free'd or unregistered. Expose the existing registration lock
> > that netlink uses for this purpose.
> >
> > Jason Gunthorpe (2):
> >   RDMA/device: Expose ib_device_try_get(()
> >   IB/mlx5: Fix how advise_mr() launches async work
>
> I've applied both of these to for-rc, with the change in comment that
> Parav gave.
>
> We can revisit if get_device/put_device is the best way to handle this
> when the work queue issue is resolved, most probably we can drop the
> get/put in the driver once the work queue is flushed prior to
> deallocing the device.

Didn't we agree that the right fix is to delete advise_mr and sync
system queue on exit?

It will be much more clean to do it right from the beginning instead of
adding new get_device/put_device and removing them immediately in
following patch.

Thanks

>
> Jason
Jason Gunthorpe Jan. 22, 2019, 4:04 p.m. UTC | #3
On Tue, Jan 22, 2019 at 07:52:20AM +0200, Leon Romanovsky wrote:
> On Mon, Jan 21, 2019 at 02:43:41PM -0700, Jason Gunthorpe wrote:
> > On Fri, Jan 11, 2019 at 07:31:22PM -0700, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > >
> > > Work needs to hold a reference on the pointers that it plans to use otherwise
> > > they can become free'd or unregistered. Expose the existing registration lock
> > > that netlink uses for this purpose.
> > >
> > > Jason Gunthorpe (2):
> > >   RDMA/device: Expose ib_device_try_get(()
> > >   IB/mlx5: Fix how advise_mr() launches async work
> >
> > I've applied both of these to for-rc, with the change in comment that
> > Parav gave.
> >
> > We can revisit if get_device/put_device is the best way to handle this
> > when the work queue issue is resolved, most probably we can drop the
> > get/put in the driver once the work queue is flushed prior to
> > deallocing the device.
> 
> Didn't we agree that the right fix is to delete advise_mr and sync
> system queue on exit?

I need something for -rc, and that fix was way too big.

> It will be much more clean to do it right from the beginning instead of
> adding new get_device/put_device and removing them immediately in
> following patch.

The WQ cleanup patch will be in -next..

Jason