Message ID | 20190112023124.21647-3-jgg@ziepe.ca (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | Fix poor reference counting in advice_mr work | expand |
On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@mellanox.com> > > Work must hold a kref on the ib_device otherwise the dev pointer can > become free before the work runs. How can it occur exactly? "dev" will be released after mlx5_ib will exist, mlx5_ib exist will drain advise_mr queue before it. Such drain will ensure that no works are running. > > Remove the bogus use of 'reg_state': > - While in uverbs the reg_state is guaranteed to always be > REGISTERED > - Testing reg_state with no locking is bogus. Use ib_device_try_get() > to get back into a region that prevents unregistration. > > Fixes: 813e90b1aeaa ("IB/mlx5: Add advise_mr() support") > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > --- > drivers/infiniband/hw/mlx5/odp.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c > index 8d46b1dc56580f..a241a65834d4ca 100644 > --- a/drivers/infiniband/hw/mlx5/odp.c > +++ b/drivers/infiniband/hw/mlx5/odp.c > @@ -1595,10 +1595,12 @@ static void mlx5_ib_prefetch_mr_work(struct work_struct *work) > struct prefetch_mr_work *w = > container_of(work, struct prefetch_mr_work, work); > > - if (w->dev->ib_dev.reg_state == IB_DEV_REGISTERED) > + if (ib_device_try_get(&w->dev->ib_dev)) { > mlx5_ib_prefetch_sg_list(w->dev, w->pf_flags, w->sg_list, > w->num_sge); > - > + ib_device_put(&w->dev->ib_dev); > + } > + put_device(&w->dev->ib_dev.dev); > kfree(w); > } > > @@ -1617,15 +1619,13 @@ int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd, > return mlx5_ib_prefetch_sg_list(dev, pf_flags, sg_list, > num_sge); > > - if (dev->ib_dev.reg_state != IB_DEV_REGISTERED) > - return -ENODEV; > - > work = kvzalloc(struct_size(work, sg_list, num_sge), GFP_KERNEL); > if (!work) > return -ENOMEM; > > memcpy(work->sg_list, sg_list, num_sge * sizeof(struct ib_sge)); > > + get_device(&dev->ib_dev.dev); It looks extremely wrong to sprinkle driver code with general get_device/put_device. > work->dev = dev; > work->pf_flags = pf_flags; > work->num_sge = num_sge; > -- > 2.20.1 >
On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > become free before the work runs. > > How can it occur exactly? > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > drain advise_mr queue before it. Such drain will ensure that no works > are running. Okay, that is fine for the get_device, it can be removed. Are we OK with the reg_state hunk? Jason
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > Sent: Saturday, January 12, 2019 11:06 AM > To: Leon Romanovsky <leon@kernel.org> > Cc: linux-rdma@vger.kernel.org; Moni Shoua <monis@mellanox.com> > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > work > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > become free before the work runs. > > > > How can it occur exactly? > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > drain advise_mr queue before it. Such drain will ensure that no works > > are running. > > Okay, that is fine for the get_device, it can be removed. > > Are we OK with the reg_state hunk? > > Jason Unrelated to your question, I noticed that advise_mr does uverbs_attr_ptr_get_array_size() for the SGEs. I think this is incorrect. It should have been mr handles whose ownership can be check against the PID or ucontext that registered those MRs. With ib_sge, now one user can start page faulting MRs of another process of another user? How is this protected? Uverbs shouldn't be deriving uverbs handle from mr key of ib_sge. Who enqueues work to dev->advise_mr_wq? schedule_work() in mlx5_ib_advise_mr_prefetch() does to global workqueue. It should be done to advise_mr_wq(). This will give chance to flush the wq when IB device is unregistered by the core.
On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > Sent: Saturday, January 12, 2019 11:06 AM > > To: Leon Romanovsky <leon@kernel.org> > > Cc: linux-rdma@vger.kernel.org; Moni Shoua <monis@mellanox.com> > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > work > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > > become free before the work runs. > > > > > > How can it occur exactly? > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > > drain advise_mr queue before it. Such drain will ensure that no works > > > are running. > > > > Okay, that is fine for the get_device, it can be removed. > > > > Are we OK with the reg_state hunk? > > > > Jason > > Unrelated to your question, I noticed that advise_mr does > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > incorrect. It should have been mr handles whose ownership can be > check against the PID or ucontext that registered those MRs. The SGE's have lkeys in them, and the lkeys are local to the PD passed in as argument. So the caller cannot touch resources outside the PD, hopefully the driver implementation checks this.. > Who enqueues work to dev->advise_mr_wq? schedule_work() in > mlx5_ib_advise_mr_prefetch() does to global workqueue. Ah! I *thought* I checked this, yes, using the system work queue is why I added the put_device :) > It should be done to advise_mr_wq(). This will give chance to flush > the wq when IB device is unregistered by the core. Good question - Moni?? Jason
> -----Original Message----- > From: Jason Gunthorpe > Sent: Saturday, January 12, 2019 11:39 AM > To: Parav Pandit <parav@mellanox.com> > Cc: Leon Romanovsky <leon@kernel.org>; linux-rdma@vger.kernel.org; Moni > Shoua <monis@mellanox.com> > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > work > > On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > > Sent: Saturday, January 12, 2019 11:06 AM > > > To: Leon Romanovsky <leon@kernel.org> > > > Cc: linux-rdma@vger.kernel.org; Moni Shoua <monis@mellanox.com> > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > launches async work > > > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer > > > > > can become free before the work runs. > > > > > > > > How can it occur exactly? > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist > > > > will drain advise_mr queue before it. Such drain will ensure that > > > > no works are running. > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > Are we OK with the reg_state hunk? > > > > > > Jason > > > > Unrelated to your question, I noticed that advise_mr does > > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > > incorrect. It should have been mr handles whose ownership can be > > check against the PID or ucontext that registered those MRs. > > The SGE's have lkeys in them, and the lkeys are local to the PD passed in as > argument. > > So the caller cannot touch resources outside the PD, hopefully the driver > implementation checks this.. > pagefault_single_data_segment () invokes __mlx5_mr_lookup() to lookup a MR per device and not per PD. mlx5_ib_advise_mr_prefetch() doesn't touch pd. work item doesn't have PD in it. Lets say PD check is added and it passes in pagefault_single_data_segment() under device overloaded srcu lock in work item context. After that MR got reallocated to other process before work item can do the work and now it fault's some other process. To avoid this, instead of overloading dev->mr_srcu, this work item should be enqueued to advise_mr_wq and that should be flushed during destroy_mr() which ensures that nothing dangling for this MR. Such code will create right synchronization without depending on randomness of key, number of MR etc parameters. > > Who enqueues work to dev->advise_mr_wq? schedule_work() in > > mlx5_ib_advise_mr_prefetch() does to global workqueue. > > Ah! I *thought* I checked this, yes, using the system work queue is why I > added the put_device :) > > > It should be done to advise_mr_wq(). This will give chance to flush > > the wq when IB device is unregistered by the core. > > Good question - Moni?? > > Jason
On Sat, Jan 12, 2019 at 11:06:51AM -0700, Parav Pandit wrote: > > > > From: Jason Gunthorpe > > Sent: Saturday, January 12, 2019 11:39 AM > > To: Parav Pandit <parav@mellanox.com> > > Cc: Leon Romanovsky <leon@kernel.org>; linux-rdma@vger.kernel.org; Moni > > Shoua <monis@mellanox.com> > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > work > > > > On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > > > > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > > > Sent: Saturday, January 12, 2019 11:06 AM > > > > To: Leon Romanovsky <leon@kernel.org> > > > > Cc: linux-rdma@vger.kernel.org; Moni Shoua <monis@mellanox.com> > > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > > launches async work > > > > > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer > > > > > > can become free before the work runs. > > > > > > > > > > How can it occur exactly? > > > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist > > > > > will drain advise_mr queue before it. Such drain will ensure that > > > > > no works are running. > > > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > > > Are we OK with the reg_state hunk? > > > > > > > > Jason > > > > > > Unrelated to your question, I noticed that advise_mr does > > > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > > > incorrect. It should have been mr handles whose ownership can be > > > check against the PID or ucontext that registered those MRs. > > > > The SGE's have lkeys in them, and the lkeys are local to the PD passed in as > > argument. > > > > So the caller cannot touch resources outside the PD, hopefully the driver > > implementation checks this.. > > > pagefault_single_data_segment () invokes __mlx5_mr_lookup() to > lookup a MR per device and not per PD. mlx5_ib_advise_mr_prefetch() > doesn't touch pd. work item doesn't have PD in it. Ugh, Moni? I don't think we should alow processes to page fault outside their PD, that sounds like an avenue for a side-channel security problem. > After that MR got reallocated to other process before work item can > do the work and now it fault's some other process. The locking around mr_srcu and MR lifetime is all broken, so this is not too surprising.. > To avoid this, instead of overloading dev->mr_srcu, this work item > should be enqueued to advise_mr_wq and that should be flushed during > destroy_mr() which ensures that nothing dangling for this MR. Such > code will create right synchronization without depending on > randomness of key, number of MR etc parameters. Moni? Jason
On Sat, Jan 12, 2019 at 05:05:49PM +0000, Jason Gunthorpe wrote: > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > become free before the work runs. > > > > How can it occur exactly? > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > drain advise_mr queue before it. Such drain will ensure that no works > > are running. > > Okay, that is fine for the get_device, it can be removed. > > Are we OK with the reg_state hunk? I'm not sure about that Maybe current code is not nice as you suggested but IB_DEV_REGISTERED check also valid approach. More general, I don't like how IB/core handles states and IMHO we should push our stack to be similar to block layer elevator. Block layer implements clear stages and calls to callbacks provided by drivers. It removes the need to manage and check any state in those drivers. If we follow this approach, we will fix natively all those unregister/reset bogus flows and eliminate the need to do kref on everything. Thanks > > Jason
On Sat, Jan 12, 2019 at 05:38:52PM +0000, Jason Gunthorpe wrote: > On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > > Sent: Saturday, January 12, 2019 11:06 AM > > > To: Leon Romanovsky <leon@kernel.org> > > > Cc: linux-rdma@vger.kernel.org; Moni Shoua <monis@mellanox.com> > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > > work > > > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > > > become free before the work runs. > > > > > > > > How can it occur exactly? > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > > > drain advise_mr queue before it. Such drain will ensure that no works > > > > are running. > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > Are we OK with the reg_state hunk? > > > > > > Jason > > > > Unrelated to your question, I noticed that advise_mr does > > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > > incorrect. It should have been mr handles whose ownership can be > > check against the PID or ucontext that registered those MRs. > > The SGE's have lkeys in them, and the lkeys are local to the PD passed > in as argument. > > So the caller cannot touch resources outside the PD, hopefully the > driver implementation checks this.. > > > Who enqueues work to dev->advise_mr_wq? schedule_work() in > > mlx5_ib_advise_mr_prefetch() does to global workqueue. > > Ah! I *thought* I checked this, yes, using the system work queue is > why I added the put_device :) > > > It should be done to advise_mr_wq(). This will give chance to flush > > the wq when IB device is unregistered by the core. > > Good question - Moni?? It is definitely missed in my review, but if the IB/core doesn't flush general system queue on driver removal, we are better to fix this. First addition of new works should be stopped and second system queue should be flushed. My general guideline is to avoid creating extra workqueues and I saw number of drivers are using system queues. Thanks > > Jason
> Ah! I *thought* I checked this, yes, using the system work queue is > why I added the put_device :) > > > It should be done to advise_mr_wq(). This will give chance to flush > > the wq when IB device is unregistered by the core. > > Good question - Moni?? > > Jason Thanks. Using schedule_work() is obviously a bug. The intention was to use the advise_mr_wq. I'll send a fix for that. However, After this will be fixed there is no need to get_device() since it is assured that no work items will be processed after mlx5_ib_stage_init_cleanup() finishes.
On Sat, Jan 12, 2019 at 8:42 PM Jason Gunthorpe <jgg@mellanox.com> wrote: > > On Sat, Jan 12, 2019 at 11:06:51AM -0700, Parav Pandit wrote: > > > > > > > From: Jason Gunthorpe > > > Sent: Saturday, January 12, 2019 11:39 AM > > > To: Parav Pandit <parav@mellanox.com> > > > Cc: Leon Romanovsky <leon@kernel.org>; linux-rdma@vger.kernel.org; Moni > > > Shoua <monis@mellanox.com> > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > > work > > > > > > On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > > > > > > > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > > > > Sent: Saturday, January 12, 2019 11:06 AM > > > > > To: Leon Romanovsky <leon@kernel.org> > > > > > Cc: linux-rdma@vger.kernel.org; Moni Shoua <monis@mellanox.com> > > > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > > > launches async work > > > > > > > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer > > > > > > > can become free before the work runs. > > > > > > > > > > > > How can it occur exactly? > > > > > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist > > > > > > will drain advise_mr queue before it. Such drain will ensure that > > > > > > no works are running. > > > > > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > > > > > Are we OK with the reg_state hunk? > > > > > > > > > > Jason > > > > > > > > Unrelated to your question, I noticed that advise_mr does > > > > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > > > > incorrect. It should have been mr handles whose ownership can be > > > > check against the PID or ucontext that registered those MRs. > > > > > > The SGE's have lkeys in them, and the lkeys are local to the PD passed in as > > > argument. > > > > > > So the caller cannot touch resources outside the PD, hopefully the driver > > > implementation checks this.. > > > > > pagefault_single_data_segment () invokes __mlx5_mr_lookup() to > > lookup a MR per device and not per PD. mlx5_ib_advise_mr_prefetch() > > doesn't touch pd. work item doesn't have PD in it. > > Ugh, Moni? > > I don't think we should alow processes to page fault outside their PD, > that sounds like an avenue for a side-channel security problem. > I can pass pd from prefetch_mr down to pagefault_single_data_segment() and force that that lookup will be also for pd. However there is another flow that goes through pagefault_single_data_segment() that starts from hardware page-fault. What will be the pd in this case? Perhaps the pd of the source/destination QP of the packet that generated the pagefault? But I can do this only if the type of page-fault is WQE. In this case the hardware provides the QPN. For page-fault of type RDMA the hardware doesn't provide the QPN, only the lkey, address and size (dev is the one that fired the event). So if there is a potential security risk in the advise_mr flow doesn't it exist in the page-fault flow? A malicious packet can do the same thing a malicious process does. What do you think? > > After that MR got reallocated to other process before work item can > > do the work and now it fault's some other process. > > The locking around mr_srcu and MR lifetime is all broken, so this is > not too surprising.. > > > To avoid this, instead of overloading dev->mr_srcu, this work item > > should be enqueued to advise_mr_wq and that should be flushed during > > destroy_mr() which ensures that nothing dangling for this MR. Such > > code will create right synchronization without depending on > > randomness of key, number of MR etc parameters. > > Moni? > > Jason
> -----Original Message----- > From: Moni Shoua <monis@mellanox.com> > Sent: Sunday, January 13, 2019 11:19 AM > To: Jason Gunthorpe <jgg@mellanox.com> > Cc: Parav Pandit <parav@mellanox.com>; Leon Romanovsky > <leon@kernel.org>; linux-rdma@vger.kernel.org > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > work > > On Sat, Jan 12, 2019 at 8:42 PM Jason Gunthorpe <jgg@mellanox.com> > wrote: > > > > On Sat, Jan 12, 2019 at 11:06:51AM -0700, Parav Pandit wrote: > > > > > > > > > > From: Jason Gunthorpe > > > > Sent: Saturday, January 12, 2019 11:39 AM > > > > To: Parav Pandit <parav@mellanox.com> > > > > Cc: Leon Romanovsky <leon@kernel.org>; linux-rdma@vger.kernel.org; > > > > Moni Shoua <monis@mellanox.com> > > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > > launches async work > > > > > > > > On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > > > > > > > > > > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > > > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > > > > > Sent: Saturday, January 12, 2019 11:06 AM > > > > > > To: Leon Romanovsky <leon@kernel.org> > > > > > > Cc: linux-rdma@vger.kernel.org; Moni Shoua > > > > > > <monis@mellanox.com> > > > > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > > > > launches async work > > > > > > > > > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev > > > > > > > > pointer can become free before the work runs. > > > > > > > > > > > > > > How can it occur exactly? > > > > > > > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib > > > > > > > exist will drain advise_mr queue before it. Such drain will > > > > > > > ensure that no works are running. > > > > > > > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > > > > > > > Are we OK with the reg_state hunk? > > > > > > > > > > > > Jason > > > > > > > > > > Unrelated to your question, I noticed that advise_mr does > > > > > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > > > > > incorrect. It should have been mr handles whose ownership can > > > > > be check against the PID or ucontext that registered those MRs. > > > > > > > > The SGE's have lkeys in them, and the lkeys are local to the PD > > > > passed in as argument. > > > > > > > > So the caller cannot touch resources outside the PD, hopefully the > > > > driver implementation checks this.. > > > > > > > pagefault_single_data_segment () invokes __mlx5_mr_lookup() to > > > lookup a MR per device and not per PD. mlx5_ib_advise_mr_prefetch() > > > doesn't touch pd. work item doesn't have PD in it. > > > > Ugh, Moni? > > > > I don't think we should alow processes to page fault outside their PD, > > that sounds like an avenue for a side-channel security problem. > > > > I can pass pd from prefetch_mr down to pagefault_single_data_segment() > and force that that lookup will be also for pd. > However there is another flow that goes through > pagefault_single_data_segment() that starts from hardware page-fault. > What will be the pd in this case? Perhaps the pd of the source/destination > QP of the packet that generated the pagefault? But I can do this only if the > type of page-fault is WQE. In this case the hardware provides the QPN. For > page-fault of type RDMA the hardware doesn't provide the QPN, only the > lkey, address and size (dev is the one that fired the event). > So if there is a potential security risk in the advise_mr flow doesn't it exist in > the page-fault flow? A malicious packet can do the same thing a malicious > process does. > What do you think? When hw triggered the page fault it has passed the checks of PD for a QP and MR from the QP and MR context including honoring read/write/atomic access rights. Once those checks pass, it should ask for page fault. If MR gets destroyed in this process before page fault, host driver will be terminating the MR and all pending transactions.
> -----Original Message----- > From: Moni Shoua <monis@mellanox.com> > Sent: Sunday, January 13, 2019 10:57 AM > To: Jason Gunthorpe <jgg@mellanox.com> > Cc: Parav Pandit <parav@mellanox.com>; Leon Romanovsky > <leon@kernel.org>; linux-rdma@vger.kernel.org > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > work > > > Ah! I *thought* I checked this, yes, using the system work queue is > > why I added the put_device :) > > > > > It should be done to advise_mr_wq(). This will give chance to flush > > > the wq when IB device is unregistered by the core. > > > > Good question - Moni?? > > > > Jason > Thanks. > Using schedule_work() is obviously a bug. The intention was to use the > advise_mr_wq. I'll send a fix for that. > However, After this will be fixed there is no need to get_device() since it is > assured that no work items will be processed after > mlx5_ib_stage_init_cleanup() finishes. Also you either need to flush the advise_mr_wq or cancel the pending work item when MR is destroyed to make sure that whatever page fault user triggered are canceled. Otherwise a MR can get recycled (dealloc, alloc) (looked up by MR key) for same PD which is page faulting now, which user didn't ask to.
On Sun, Jan 13, 2019 at 8:18 PM Parav Pandit <parav@mellanox.com> wrote: > > > > > -----Original Message----- > > From: Moni Shoua <monis@mellanox.com> > > Sent: Sunday, January 13, 2019 10:57 AM > > To: Jason Gunthorpe <jgg@mellanox.com> > > Cc: Parav Pandit <parav@mellanox.com>; Leon Romanovsky > > <leon@kernel.org>; linux-rdma@vger.kernel.org > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > work > > > > > Ah! I *thought* I checked this, yes, using the system work queue is > > > why I added the put_device :) > > > > > > > It should be done to advise_mr_wq(). This will give chance to flush > > > > the wq when IB device is unregistered by the core. > > > > > > Good question - Moni?? > > > > > > Jason > > Thanks. > > Using schedule_work() is obviously a bug. The intention was to use the > > advise_mr_wq. I'll send a fix for that. > > However, After this will be fixed there is no need to get_device() since it is > > assured that no work items will be processed after > > mlx5_ib_stage_init_cleanup() finishes. > > Also you either need to flush the advise_mr_wq or cancel the pending work item when MR is destroyed to make sure that whatever page fault user triggered are canceled. Otherwise a MR can get recycled (dealloc, alloc) (looked up by MR key) for same PD which is page faulting now, which user didn't ask to.
On Mon, Jan 14, 2019 at 3:09 PM Moni Shoua <monis@mellanox.com> wrote: > > On Sun, Jan 13, 2019 at 8:18 PM Parav Pandit <parav@mellanox.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Moni Shoua <monis@mellanox.com> > > > Sent: Sunday, January 13, 2019 10:57 AM > > > To: Jason Gunthorpe <jgg@mellanox.com> > > > Cc: Parav Pandit <parav@mellanox.com>; Leon Romanovsky > > > <leon@kernel.org>; linux-rdma@vger.kernel.org > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > > work > > > > > > > Ah! I *thought* I checked this, yes, using the system work queue is > > > > why I added the put_device :) > > > > > > > > > It should be done to advise_mr_wq(). This will give chance to flush > > > > > the wq when IB device is unregistered by the core. > > > > > > > > Good question - Moni?? > > > > > > > > Jason > > > Thanks. > > > Using schedule_work() is obviously a bug. The intention was to use the > > > advise_mr_wq. I'll send a fix for that. > > > However, After this will be fixed there is no need to get_device() since it is > > > assured that no work items will be processed after > > > mlx5_ib_stage_init_cleanup() finishes. > > > > Also you either need to flush the advise_mr_wq or cancel the pending work item when MR is destroyed to make sure that whatever page fault user triggered are canceled. Otherwise a MR can get recycled (dealloc, alloc) (looked up by MR key) for same PD which is page faulting now, which user didn't ask to. There is also a check for valid address. So, if the the new MR has different address range then the operation will not take and if the address is in range of the new MR so there is no difference.
On Sun, Jan 13, 2019 at 10:26:03AM -0700, Parav Pandit wrote: > > > > From: Moni Shoua <monis@mellanox.com> > > Sent: Sunday, January 13, 2019 11:19 AM > > To: Jason Gunthorpe <jgg@mellanox.com> > > Cc: Parav Pandit <parav@mellanox.com>; Leon Romanovsky > > <leon@kernel.org>; linux-rdma@vger.kernel.org > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() launches async > > work > > > > On Sat, Jan 12, 2019 at 8:42 PM Jason Gunthorpe <jgg@mellanox.com> > > wrote: > > > > > > On Sat, Jan 12, 2019 at 11:06:51AM -0700, Parav Pandit wrote: > > > > > > > > > > > > > From: Jason Gunthorpe > > > > > Sent: Saturday, January 12, 2019 11:39 AM > > > > > To: Parav Pandit <parav@mellanox.com> > > > > > Cc: Leon Romanovsky <leon@kernel.org>; linux-rdma@vger.kernel.org; > > > > > Moni Shoua <monis@mellanox.com> > > > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > > > launches async work > > > > > > > > > > On Sat, Jan 12, 2019 at 10:32:54AM -0700, Parav Pandit wrote: > > > > > > > > > > > > > > > > > > > From: linux-rdma-owner@vger.kernel.org <linux-rdma- > > > > > > > owner@vger.kernel.org> On Behalf Of Jason Gunthorpe > > > > > > > Sent: Saturday, January 12, 2019 11:06 AM > > > > > > > To: Leon Romanovsky <leon@kernel.org> > > > > > > > Cc: linux-rdma@vger.kernel.org; Moni Shoua > > > > > > > <monis@mellanox.com> > > > > > > > Subject: Re: [PATCH for-rc 2/2] IB/mlx5: Fix how advise_mr() > > > > > > > launches async work > > > > > > > > > > > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev > > > > > > > > > pointer can become free before the work runs. > > > > > > > > > > > > > > > > How can it occur exactly? > > > > > > > > > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib > > > > > > > > exist will drain advise_mr queue before it. Such drain will > > > > > > > > ensure that no works are running. > > > > > > > > > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > > > > > > > > > Are we OK with the reg_state hunk? > > > > > > > > > > > > > > Jason > > > > > > > > > > > > Unrelated to your question, I noticed that advise_mr does > > > > > > uverbs_attr_ptr_get_array_size() for the SGEs. I think this is > > > > > > incorrect. It should have been mr handles whose ownership can > > > > > > be check against the PID or ucontext that registered those MRs. > > > > > > > > > > The SGE's have lkeys in them, and the lkeys are local to the PD > > > > > passed in as argument. > > > > > > > > > > So the caller cannot touch resources outside the PD, hopefully the > > > > > driver implementation checks this.. > > > > > > > > > pagefault_single_data_segment () invokes __mlx5_mr_lookup() to > > > > lookup a MR per device and not per PD. mlx5_ib_advise_mr_prefetch() > > > > doesn't touch pd. work item doesn't have PD in it. > > > > > > Ugh, Moni? > > > > > > I don't think we should alow processes to page fault outside their PD, > > > that sounds like an avenue for a side-channel security problem. > > > > > > > I can pass pd from prefetch_mr down to pagefault_single_data_segment() > > and force that that lookup will be also for pd. > > However there is another flow that goes through > > pagefault_single_data_segment() that starts from hardware page-fault. > > What will be the pd in this case? Perhaps the pd of the source/destination > > QP of the packet that generated the pagefault? But I can do this only if the > > type of page-fault is WQE. In this case the hardware provides the QPN. For > > page-fault of type RDMA the hardware doesn't provide the QPN, only the > > lkey, address and size (dev is the one that fired the event). > > So if there is a potential security risk in the advise_mr flow doesn't it exist in > > the page-fault flow? A malicious packet can do the same thing a malicious > > process does. > > What do you think? > > When hw triggered the page fault it has passed the checks of PD for > a QP and MR from the QP and MR context including honoring > read/write/atomic access rights. Once those checks pass, it should > ask for page fault. If MR gets destroyed in this process before > page fault, host driver will be terminating the MR and all pending > transactions. I agree, the HW needs to validate all on-the-wire rkeys against the Rx'ing QP's PD before passing them into the driver. So if the HW is right then the HW driver flow should ignore the PD as it is already checked. Jason
On Sun, Jan 13, 2019 at 10:46:08AM +0200, Leon Romanovsky wrote: > > > Who enqueues work to dev->advise_mr_wq? schedule_work() in > > > mlx5_ib_advise_mr_prefetch() does to global workqueue. > > > > Ah! I *thought* I checked this, yes, using the system work queue is > > why I added the put_device :) > > > > > It should be done to advise_mr_wq(). This will give chance to flush > > > the wq when IB device is unregistered by the core. > > > > Good question - Moni?? > > It is definitely missed in my review, but if the IB/core doesn't flush > general system queue on driver removal, we are better to fix this. > > First addition of new works should be stopped and second system > queue should be flushed. My general guideline is to avoid creating > extra workqueues and I saw number of drivers are using system > queues. It is up to each driver to flush all the work queues it uses after ib_unregister_driver completes and before it allowed module unload to proceed. Jason
On Sun, Jan 13, 2019 at 10:26:27AM +0200, Leon Romanovsky wrote: > On Sat, Jan 12, 2019 at 05:05:49PM +0000, Jason Gunthorpe wrote: > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > > become free before the work runs. > > > > > > How can it occur exactly? > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > > drain advise_mr queue before it. Such drain will ensure that no works > > > are running. > > > > Okay, that is fine for the get_device, it can be removed. > > > > Are we OK with the reg_state hunk? > > I'm not sure about that > Maybe current code is not nice as you suggested but IB_DEV_REGISTERED > check also valid approach. It is not. Checking the state without holding the appropriate locks does nothing useful. We need to delete the test or properly hold the registration lock. > More general, I don't like how IB/core handles states and IMHO we should > push our stack to be similar to block layer elevator. Block layer implements > clear stages and calls to callbacks provided by drivers. It removes the need > to manage and check any state in those drivers. I'm not super familiar with the block layer, but I don't think more callbacks will be in anyway more clear here. We have a very complicated lifetime model already.. Jason
On Mon, Jan 14, 2019 at 05:36:52PM +0000, Jason Gunthorpe wrote: > On Sun, Jan 13, 2019 at 10:26:27AM +0200, Leon Romanovsky wrote: > > On Sat, Jan 12, 2019 at 05:05:49PM +0000, Jason Gunthorpe wrote: > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > > > become free before the work runs. > > > > > > > > How can it occur exactly? > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > > > drain advise_mr queue before it. Such drain will ensure that no works > > > > are running. > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > Are we OK with the reg_state hunk? > > > > I'm not sure about that > > Maybe current code is not nice as you suggested but IB_DEV_REGISTERED > > check also valid approach. > > It is not. Checking the state without holding the appropriate locks > does nothing useful. > > We need to delete the test or properly hold the registration lock. > > > More general, I don't like how IB/core handles states and IMHO we should > > push our stack to be similar to block layer elevator. Block layer implements > > clear stages and calls to callbacks provided by drivers. It removes the need > > to manage and check any state in those drivers. > > I'm not super familiar with the block layer, but I don't think more > callbacks will be in anyway more clear here. We have a very > complicated lifetime model already.. The idea is to provide very clear device state machine which will ensure that once driver was called, IB/core won't change state without need to annotate driver code with various get/put. Thanks > > Jason
On Mon, Jan 14, 2019 at 08:26:20PM +0200, Leon Romanovsky wrote: > On Mon, Jan 14, 2019 at 05:36:52PM +0000, Jason Gunthorpe wrote: > > On Sun, Jan 13, 2019 at 10:26:27AM +0200, Leon Romanovsky wrote: > > > On Sat, Jan 12, 2019 at 05:05:49PM +0000, Jason Gunthorpe wrote: > > > > On Sat, Jan 12, 2019 at 05:06:02PM +0200, Leon Romanovsky wrote: > > > > > On Fri, Jan 11, 2019 at 07:31:24PM -0700, Jason Gunthorpe wrote: > > > > > > From: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > > > > > Work must hold a kref on the ib_device otherwise the dev pointer can > > > > > > become free before the work runs. > > > > > > > > > > How can it occur exactly? > > > > > > > > > > "dev" will be released after mlx5_ib will exist, mlx5_ib exist will > > > > > drain advise_mr queue before it. Such drain will ensure that no works > > > > > are running. > > > > > > > > Okay, that is fine for the get_device, it can be removed. > > > > > > > > Are we OK with the reg_state hunk? > > > > > > I'm not sure about that > > > Maybe current code is not nice as you suggested but IB_DEV_REGISTERED > > > check also valid approach. > > > > It is not. Checking the state without holding the appropriate locks > > does nothing useful. > > > > We need to delete the test or properly hold the registration lock. > > > > > More general, I don't like how IB/core handles states and IMHO we should > > > push our stack to be similar to block layer elevator. Block layer implements > > > clear stages and calls to callbacks provided by drivers. It removes the need > > > to manage and check any state in those drivers. > > > > I'm not super familiar with the block layer, but I don't think more > > callbacks will be in anyway more clear here. We have a very > > complicated lifetime model already.. > > The idea is to provide very clear device state machine which will ensure > that once driver was called, IB/core won't change state without need to > annotate driver code with various get/put. We do that. Today the driver is in control of the lifetime of both the memory and the 'registration', as it is the one that calls ib_device_unregister and the badly named ib_dealloc The registration lock is a tool the driver can use to prevent progress through unregistration in cases where it cares about that. This is much better than having every driver try and build their own lock, as I saw in bxnt and rxe for instance. Jason
On Sat, Jan 12, 2019 at 4:32 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > From: Jason Gunthorpe <jgg@mellanox.com> > > Work must hold a kref on the ib_device otherwise the dev pointer can > become free before the work runs. > > Remove the bogus use of 'reg_state': > - While in uverbs the reg_state is guaranteed to always be > REGISTERED > - Testing reg_state with no locking is bogus. Use ib_device_try_get() > to get back into a region that prevents unregistration. > > Fixes: 813e90b1aeaa ("IB/mlx5: Add advise_mr() support") > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > --- > drivers/infiniband/hw/mlx5/odp.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c > index 8d46b1dc56580f..a241a65834d4ca 100644 > --- a/drivers/infiniband/hw/mlx5/odp.c > +++ b/drivers/infiniband/hw/mlx5/odp.c > @@ -1595,10 +1595,12 @@ static void mlx5_ib_prefetch_mr_work(struct work_struct *work) > struct prefetch_mr_work *w = > container_of(work, struct prefetch_mr_work, work); > > - if (w->dev->ib_dev.reg_state == IB_DEV_REGISTERED) > + if (ib_device_try_get(&w->dev->ib_dev)) { > mlx5_ib_prefetch_sg_list(w->dev, w->pf_flags, w->sg_list, > w->num_sge); > - > + ib_device_put(&w->dev->ib_dev); > + } > + put_device(&w->dev->ib_dev.dev); > kfree(w); > } > > @@ -1617,15 +1619,13 @@ int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd, > return mlx5_ib_prefetch_sg_list(dev, pf_flags, sg_list, > num_sge); > > - if (dev->ib_dev.reg_state != IB_DEV_REGISTERED) > - return -ENODEV; > - > work = kvzalloc(struct_size(work, sg_list, num_sge), GFP_KERNEL); > if (!work) > return -ENOMEM; > > memcpy(work->sg_list, sg_list, num_sge * sizeof(struct ib_sge)); > > + get_device(&dev->ib_dev.dev); > work->dev = dev; > work->pf_flags = pf_flags; > work->num_sge = num_sge; > -- > 2.20.1 > Looks good Reviewed-by: Moni Shoua <monis@mellanox.com>
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index 8d46b1dc56580f..a241a65834d4ca 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -1595,10 +1595,12 @@ static void mlx5_ib_prefetch_mr_work(struct work_struct *work) struct prefetch_mr_work *w = container_of(work, struct prefetch_mr_work, work); - if (w->dev->ib_dev.reg_state == IB_DEV_REGISTERED) + if (ib_device_try_get(&w->dev->ib_dev)) { mlx5_ib_prefetch_sg_list(w->dev, w->pf_flags, w->sg_list, w->num_sge); - + ib_device_put(&w->dev->ib_dev); + } + put_device(&w->dev->ib_dev.dev); kfree(w); } @@ -1617,15 +1619,13 @@ int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd, return mlx5_ib_prefetch_sg_list(dev, pf_flags, sg_list, num_sge); - if (dev->ib_dev.reg_state != IB_DEV_REGISTERED) - return -ENODEV; - work = kvzalloc(struct_size(work, sg_list, num_sge), GFP_KERNEL); if (!work) return -ENOMEM; memcpy(work->sg_list, sg_list, num_sge * sizeof(struct ib_sge)); + get_device(&dev->ib_dev.dev); work->dev = dev; work->pf_flags = pf_flags; work->num_sge = num_sge;