[hmm,00/15] Consolidate the mmu notifier interval_tree and locking

Message ID	20191015181242.8343-1-jgg@ziepe.ca (mailing list archive)
Headers	show Return-Path: <SRS0=iMHv=YI=vger.kernel.org=linux-rdma-owner@kernel.org> From: Jason Gunthorpe <jgg@ziepe.ca> To: Jerome Glisse <jglisse@redhat.com>, Ralph Campbell <rcampbell@nvidia.com>, John Hubbard <jhubbard@nvidia.com>, Felix.Kuehling@amd.com Cc: linux-rdma@vger.kernel.org, linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>, dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, Ben Skeggs <bskeggs@redhat.com>, Jason Gunthorpe <jgg@mellanox.com> Subject: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking Date: Tue, 15 Oct 2019 15:12:27 -0300 Message-Id: <20191015181242.8343-1-jgg@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk
Series	Consolidate the mmu notifier interval_tree and locking \| expand [hmm,00/15] Consolidate the mmu notifier interval_tree and locking [hmm,01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled [hmm,02/15] mm/mmu_notifier: add an interval tree notifier [hmm,03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror [hmm,04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled [hmm,05/15] RDMA/odp: Use mmu_range_notifier_insert() [hmm,06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv [hmm,07/15] drm/radeon: use mmu_range_notifier_insert [hmm,08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER [hmm,09/15] xen/gntdev: use mmu_range_notifier_insert [hmm,10/15] nouveau: use mmu_notifier directly for invalidate_range_start [hmm,11/15] nouveau: use mmu_range_notifier instead of hmm_mirror [hmm,12/15] drm/amdgpu: Call find_vma under mmap_sem [hmm,13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror [hmm,14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror [hmm,15/15] mm/hmm: remove hmm_mirror and related

Jason Gunthorpe Oct. 15, 2019, 6:12 p.m. UTC

From: Jason Gunthorpe <jgg@mellanox.com>

8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
they only use invalidate_range_start/end and immediately check the
invalidating range against some driver data structure to tell if the
driver is interested. Half of them use an interval_tree, the others are
simple linear search lists.

Of the ones I checked they largely seem to have various kinds of races,
bugs and poor implementation. This is a result of the complexity in how
the notifier interacts with get_user_pages(). It is extremely difficult to
use it correctly.

Consolidate all of this code together into the core mmu_notifier and
provide a locking scheme similar to hmm_mirror that allows the user to
safely use get_user_pages() and reliably know if the page list still
matches the mm.

This new arrangment plays nicely with the !blockable mode for
OOM. Scanning the interval tree is done such that the intersection test
will always succeed, and since there is no invalidate_range_end exposed to
drivers the scheme safely allows multiple drivers to be subscribed.

Four places are converted as an example of how the new API is used.
Four are left for future patches:
 - i915_gem has complex locking around destruction of a registration,
   needs more study
 - hfi1 (2nd user) needs access to the rbtree
 - scif_dma has a complicated logic flow
 - vhost's mmu notifiers are already being rewritten

This is still being tested, but I figured to send it to start getting help
from the xen, amd and hfi drivers which I cannot test here.

It would be intended for the hmm tree.

Jason Gunthorpe (15):
  mm/mmu_notifier: define the header pre-processor parts even if
    disabled
  mm/mmu_notifier: add an interval tree notifier
  mm/hmm: allow hmm_range to be used with a mmu_range_notifier or
    hmm_mirror
  mm/hmm: define the pre-processor related parts of hmm.h even if
    disabled
  RDMA/odp: Use mmu_range_notifier_insert()
  RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
  drm/radeon: use mmu_range_notifier_insert
  xen/gntdev: Use select for DMA_SHARED_BUFFER
  xen/gntdev: use mmu_range_notifier_insert
  nouveau: use mmu_notifier directly for invalidate_range_start
  nouveau: use mmu_range_notifier instead of hmm_mirror
  drm/amdgpu: Call find_vma under mmap_sem
  drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  mm/hmm: remove hmm_mirror and related

 Documentation/vm/hmm.rst                      | 105 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 445 ++------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  53 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 111 ++--
 drivers/gpu/drm/nouveau/nouveau_svm.c         | 229 +++++---
 drivers/gpu/drm/radeon/radeon.h               |   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c            | 218 ++-----
 drivers/infiniband/core/device.c              |   1 -
 drivers/infiniband/core/umem_odp.c            | 288 +---------
 drivers/infiniband/hw/hfi1/file_ops.c         |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h              |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c     | 144 ++---
 drivers/infiniband/hw/hfi1/user_exp_rcv.h     |   3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |   7 +-
 drivers/infiniband/hw/mlx5/mr.c               |   3 +-
 drivers/infiniband/hw/mlx5/odp.c              |  48 +-
 drivers/xen/Kconfig                           |   3 +-
 drivers/xen/gntdev-common.h                   |   8 +-
 drivers/xen/gntdev.c                          | 179 ++----
 include/linux/hmm.h                           | 195 +------
 include/linux/mmu_notifier.h                  | 124 +++-
 include/rdma/ib_umem_odp.h                    |  65 +--
 include/rdma/ib_verbs.h                       |   2 -
 kernel/fork.c                                 |   1 -
 mm/Kconfig                                    |   2 +-
 mm/hmm.c                                      | 275 +--------
 mm/mmu_notifier.c                             | 542 +++++++++++++++++-
 32 files changed, 1180 insertions(+), 1923 deletions(-)

Christian König Oct. 16, 2019, 8:58 a.m. UTC | #1

Am 15.10.19 um 20:12 schrieb Jason Gunthorpe:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.
>
> Of the ones I checked they largely seem to have various kinds of races,
> bugs and poor implementation. This is a result of the complexity in how
> the notifier interacts with get_user_pages(). It is extremely difficult to
> use it correctly.
>
> Consolidate all of this code together into the core mmu_notifier and
> provide a locking scheme similar to hmm_mirror that allows the user to
> safely use get_user_pages() and reliably know if the page list still
> matches the mm.

That sounds really good, but could you outline for a moment how that is 
archived?

Please keep in mind that the page reference get_user_pages() grabs is 
*NOT* sufficient to guarantee coherency here.

Regards,
Christian.

>
> This new arrangment plays nicely with the !blockable mode for
> OOM. Scanning the interval tree is done such that the intersection test
> will always succeed, and since there is no invalidate_range_end exposed to
> drivers the scheme safely allows multiple drivers to be subscribed.
>
> Four places are converted as an example of how the new API is used.
> Four are left for future patches:
>   - i915_gem has complex locking around destruction of a registration,
>     needs more study
>   - hfi1 (2nd user) needs access to the rbtree
>   - scif_dma has a complicated logic flow
>   - vhost's mmu notifiers are already being rewritten
>
> This is still being tested, but I figured to send it to start getting help
> from the xen, amd and hfi drivers which I cannot test here.
>
> It would be intended for the hmm tree.
>
> Jason Gunthorpe (15):
>    mm/mmu_notifier: define the header pre-processor parts even if
>      disabled
>    mm/mmu_notifier: add an interval tree notifier
>    mm/hmm: allow hmm_range to be used with a mmu_range_notifier or
>      hmm_mirror
>    mm/hmm: define the pre-processor related parts of hmm.h even if
>      disabled
>    RDMA/odp: Use mmu_range_notifier_insert()
>    RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
>    drm/radeon: use mmu_range_notifier_insert
>    xen/gntdev: Use select for DMA_SHARED_BUFFER
>    xen/gntdev: use mmu_range_notifier_insert
>    nouveau: use mmu_notifier directly for invalidate_range_start
>    nouveau: use mmu_range_notifier instead of hmm_mirror
>    drm/amdgpu: Call find_vma under mmap_sem
>    drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
>    drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
>    mm/hmm: remove hmm_mirror and related
>
>   Documentation/vm/hmm.rst                      | 105 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   9 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 445 ++------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  53 --
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 111 ++--
>   drivers/gpu/drm/nouveau/nouveau_svm.c         | 229 +++++---
>   drivers/gpu/drm/radeon/radeon.h               |   9 +-
>   drivers/gpu/drm/radeon/radeon_mn.c            | 218 ++-----
>   drivers/infiniband/core/device.c              |   1 -
>   drivers/infiniband/core/umem_odp.c            | 288 +---------
>   drivers/infiniband/hw/hfi1/file_ops.c         |   2 +-
>   drivers/infiniband/hw/hfi1/hfi.h              |   2 +-
>   drivers/infiniband/hw/hfi1/user_exp_rcv.c     | 144 ++---
>   drivers/infiniband/hw/hfi1/user_exp_rcv.h     |   3 +-
>   drivers/infiniband/hw/mlx5/mlx5_ib.h          |   7 +-
>   drivers/infiniband/hw/mlx5/mr.c               |   3 +-
>   drivers/infiniband/hw/mlx5/odp.c              |  48 +-
>   drivers/xen/Kconfig                           |   3 +-
>   drivers/xen/gntdev-common.h                   |   8 +-
>   drivers/xen/gntdev.c                          | 179 ++----
>   include/linux/hmm.h                           | 195 +------
>   include/linux/mmu_notifier.h                  | 124 +++-
>   include/rdma/ib_umem_odp.h                    |  65 +--
>   include/rdma/ib_verbs.h                       |   2 -
>   kernel/fork.c                                 |   1 -
>   mm/Kconfig                                    |   2 +-
>   mm/hmm.c                                      | 275 +--------
>   mm/mmu_notifier.c                             | 542 +++++++++++++++++-
>   32 files changed, 1180 insertions(+), 1923 deletions(-)
>

Jason Gunthorpe Oct. 16, 2019, 4:04 p.m. UTC | #2

On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote:
> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> > they only use invalidate_range_start/end and immediately check the
> > invalidating range against some driver data structure to tell if the
> > driver is interested. Half of them use an interval_tree, the others are
> > simple linear search lists.
> > 
> > Of the ones I checked they largely seem to have various kinds of races,
> > bugs and poor implementation. This is a result of the complexity in how
> > the notifier interacts with get_user_pages(). It is extremely difficult to
> > use it correctly.
> > 
> > Consolidate all of this code together into the core mmu_notifier and
> > provide a locking scheme similar to hmm_mirror that allows the user to
> > safely use get_user_pages() and reliably know if the page list still
> > matches the mm.
> 
> That sounds really good, but could you outline for a moment how that is
> archived?

It uses the same basic scheme as hmm and rdma odp, outlined in the
revisions to hmm.rst later on.

Basically, 

 seq = mmu_range_read_begin(&mrn);

 // This is a speculative region
 .. get_user_pages()/hmm_range_fault() ..
 // Result cannot be derferenced

 take_lock(driver->update);
 if (mmu_range_read_retry(&mrn, range.notifier_seq) {
    // collision! The results are not correct
    goto again
 }

 // no collision, and now under lock. Now we can de-reference the pages/etc
 // program HW
 // Now the invalidate callback is responsible to synchronize against changes
 unlock(driver->update) 

Basically, anything that was using hmm_mirror correctly transisions
over fairly trivially, just with the modification to store a sequence
number to close that race described in the hmm commit.

For something like AMD gpu I expect it to transition to use dma_fence
from the notifier for coherency right before it unlocks driver->update.

Jason

Christian König Oct. 17, 2019, 8:54 a.m. UTC | #3

Am 16.10.19 um 18:04 schrieb Jason Gunthorpe:
> On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote:
>> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
>>> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
>>> they only use invalidate_range_start/end and immediately check the
>>> invalidating range against some driver data structure to tell if the
>>> driver is interested. Half of them use an interval_tree, the others are
>>> simple linear search lists.
>>>
>>> Of the ones I checked they largely seem to have various kinds of races,
>>> bugs and poor implementation. This is a result of the complexity in how
>>> the notifier interacts with get_user_pages(). It is extremely difficult to
>>> use it correctly.
>>>
>>> Consolidate all of this code together into the core mmu_notifier and
>>> provide a locking scheme similar to hmm_mirror that allows the user to
>>> safely use get_user_pages() and reliably know if the page list still
>>> matches the mm.
>> That sounds really good, but could you outline for a moment how that is
>> archived?
> It uses the same basic scheme as hmm and rdma odp, outlined in the
> revisions to hmm.rst later on.
>
> Basically,
>
>   seq = mmu_range_read_begin(&mrn);
>
>   // This is a speculative region
>   .. get_user_pages()/hmm_range_fault() ..

How do we enforce that this get_user_pages()/hmm_range_fault() doesn't 
see outdated page table information?

In other words how the the following race prevented:

CPU A CPU B
invalidate_range_start()
       mmu_range_read_begin()
       get_user_pages()/hmm_range_fault()
Updating the ptes
invalidate_range_end()


I mean get_user_pages() tries to circumvent this issue by grabbing a 
reference to the pages in question, but that isn't sufficient for the 
SVM use case.

That's the reason why we had this horrible solution with a r/w lock and 
a linked list of BOs in an interval tree.

Regards,
Christian.

>   // Result cannot be derferenced
>
>   take_lock(driver->update);
>   if (mmu_range_read_retry(&mrn, range.notifier_seq) {
>      // collision! The results are not correct
>      goto again
>   }
>
>   // no collision, and now under lock. Now we can de-reference the pages/etc
>   // program HW
>   // Now the invalidate callback is responsible to synchronize against changes
>   unlock(driver->update)
>
> Basically, anything that was using hmm_mirror correctly transisions
> over fairly trivially, just with the modification to store a sequence
> number to close that race described in the hmm commit.
>
> For something like AMD gpu I expect it to transition to use dma_fence
> from the notifier for coherency right before it unlocks driver->update.
>
> Jason
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Philip Yang Oct. 17, 2019, 4:26 p.m. UTC | #4

On 2019-10-17 4:54 a.m., Christian König wrote:
> Am 16.10.19 um 18:04 schrieb Jason Gunthorpe:
>> On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote:
>>> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe:
>>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>>
>>>> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, 
>>>> hfi1,
>>>> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
>>>> they only use invalidate_range_start/end and immediately check the
>>>> invalidating range against some driver data structure to tell if the
>>>> driver is interested. Half of them use an interval_tree, the others are
>>>> simple linear search lists.
>>>>
>>>> Of the ones I checked they largely seem to have various kinds of races,
>>>> bugs and poor implementation. This is a result of the complexity in how
>>>> the notifier interacts with get_user_pages(). It is extremely 
>>>> difficult to
>>>> use it correctly.
>>>>
>>>> Consolidate all of this code together into the core mmu_notifier and
>>>> provide a locking scheme similar to hmm_mirror that allows the user to
>>>> safely use get_user_pages() and reliably know if the page list still
>>>> matches the mm.
>>> That sounds really good, but could you outline for a moment how that is
>>> archived?
>> It uses the same basic scheme as hmm and rdma odp, outlined in the
>> revisions to hmm.rst later on.
>>
>> Basically,
>>
>>   seq = mmu_range_read_begin(&mrn);
>>
>>   // This is a speculative region
>>   .. get_user_pages()/hmm_range_fault() ..
> 
> How do we enforce that this get_user_pages()/hmm_range_fault() doesn't 
> see outdated page table information?
> 
> In other words how the the following race prevented:
> 
> CPU A CPU B
> invalidate_range_start()
>        mmu_range_read_begin()
>        get_user_pages()/hmm_range_fault()
> Updating the ptes
> invalidate_range_end()
> 
> 
> I mean get_user_pages() tries to circumvent this issue by grabbing a 
> reference to the pages in question, but that isn't sufficient for the 
> SVM use case.
> 
> That's the reason why we had this horrible solution with a r/w lock and 
> a linked list of BOs in an interval tree.
> 
> Regards,
> Christian.
get_user_pages/hmm_range_fault() and invalidate_range_start() both are 
called while holding mm->map_sem, so they are always serialized.

Philip
> 
>>   // Result cannot be derferenced
>>
>>   take_lock(driver->update);
>>   if (mmu_range_read_retry(&mrn, range.notifier_seq) {
>>      // collision! The results are not correct
>>      goto again
>>   }
>>
>>   // no collision, and now under lock. Now we can de-reference the 
>> pages/etc
>>   // program HW
>>   // Now the invalidate callback is responsible to synchronize against 
>> changes
>>   unlock(driver->update)
>>
>> Basically, anything that was using hmm_mirror correctly transisions
>> over fairly trivially, just with the modification to store a sequence
>> number to close that race described in the hmm commit.
>>
>> For something like AMD gpu I expect it to transition to use dma_fence
>> from the notifier for coherency right before it unlocks driver->update.
>>
>> Jason
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Christian König Oct. 17, 2019, 4:47 p.m. UTC | #5

Sending once more as text.

Am 17.10.19 um 18:26 schrieb Yang, Philip:
> On 2019-10-17 4:54 a.m., Christian König wrote:
>> Am 16.10.19 um 18:04 schrieb Jason Gunthorpe:
>>> On Wed, Oct 16, 2019 at 10:58:02AM +0200, Christian König wrote:
>>>> Am 15.10.19 um 20:12 schrieb Jason Gunthorpe:
>>>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>>>
>>>>> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp,
>>>>> hfi1,
>>>>> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
>>>>> they only use invalidate_range_start/end and immediately check the
>>>>> invalidating range against some driver data structure to tell if the
>>>>> driver is interested. Half of them use an interval_tree, the others are
>>>>> simple linear search lists.
>>>>>
>>>>> Of the ones I checked they largely seem to have various kinds of races,
>>>>> bugs and poor implementation. This is a result of the complexity in how
>>>>> the notifier interacts with get_user_pages(). It is extremely
>>>>> difficult to
>>>>> use it correctly.
>>>>>
>>>>> Consolidate all of this code together into the core mmu_notifier and
>>>>> provide a locking scheme similar to hmm_mirror that allows the user to
>>>>> safely use get_user_pages() and reliably know if the page list still
>>>>> matches the mm.
>>>> That sounds really good, but could you outline for a moment how that is
>>>> archived?
>>> It uses the same basic scheme as hmm and rdma odp, outlined in the
>>> revisions to hmm.rst later on.
>>>
>>> Basically,
>>>
>>>    seq = mmu_range_read_begin(&mrn);
>>>
>>>    // This is a speculative region
>>>    .. get_user_pages()/hmm_range_fault() ..
>> How do we enforce that this get_user_pages()/hmm_range_fault() doesn't
>> see outdated page table information?
>>
>> In other words how the the following race prevented:
>>
>> CPU A CPU B
>> invalidate_range_start()
>>         mmu_range_read_begin()
>>         get_user_pages()/hmm_range_fault()
>> Updating the ptes
>> invalidate_range_end()
>>
>>
>> I mean get_user_pages() tries to circumvent this issue by grabbing a
>> reference to the pages in question, but that isn't sufficient for the
>> SVM use case.
>>
>> That's the reason why we had this horrible solution with a r/w lock and
>> a linked list of BOs in an interval tree.
>>
>> Regards,
>> Christian.
> get_user_pages/hmm_range_fault() and invalidate_range_start() both are
> called while holding mm->map_sem, so they are always serialized.

Not even remotely.

For calling get_user_pages()/hmm_range_fault() you only need to hold the 
mmap_sem in read mode.

And IIRC invalidate_range_start() is sometimes called without holding 
the mmap_sem at all.

So again how are they serialized?

Regards,
Christian.

>
> Philip
>>>    // Result cannot be derferenced
>>>
>>>    take_lock(driver->update);
>>>    if (mmu_range_read_retry(&mrn, range.notifier_seq) {
>>>       // collision! The results are not correct
>>>       goto again
>>>    }
>>>
>>>    // no collision, and now under lock. Now we can de-reference the
>>> pages/etc
>>>    // program HW
>>>    // Now the invalidate callback is responsible to synchronize against
>>> changes
>>>    unlock(driver->update)
>>>
>>> Basically, anything that was using hmm_mirror correctly transisions
>>> over fairly trivially, just with the modification to store a sequence
>>> number to close that race described in the hmm commit.
>>>
>>> For something like AMD gpu I expect it to transition to use dma_fence
>>> from the notifier for coherency right before it unlocks driver->update.
>>>
>>> Jason
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Jason Gunthorpe Oct. 18, 2019, 8:36 p.m. UTC | #6

On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:

> > get_user_pages/hmm_range_fault() and invalidate_range_start() both are
> > called while holding mm->map_sem, so they are always serialized.
> 
> Not even remotely.
> 
> For calling get_user_pages()/hmm_range_fault() you only need to hold the 
> mmap_sem in read mode.

Right

> And IIRC invalidate_range_start() is sometimes called without holding 
> the mmap_sem at all.

Yep

> So again how are they serialized?

The 'driver lock' thing does it, read the hmm documentation, the hmm
approach is basically the only approach that was correct of all the
drivers..

So long as the 'driver lock' is held the range cannot become
invalidated as the 'driver lock' prevents progress of invalidation.

Holding the driver lock and using the seq based mmu_range_read_retry()
tells if the previous unlocked get_user_pages() is still valid or
needs to be discard.

So it doesn't matter if get_user_pages() races or not, the result is not
to be used until the driver lock is held and mmu_range_read_retry()
called, which provides the coherence.

It is the usual seqlock pattern.

Jason

Christian König Oct. 20, 2019, 2:21 p.m. UTC | #7

Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:
>
>>> get_user_pages/hmm_range_fault() and invalidate_range_start() both are
>>> called while holding mm->map_sem, so they are always serialized.
>> Not even remotely.
>>
>> For calling get_user_pages()/hmm_range_fault() you only need to hold the
>> mmap_sem in read mode.
> Right
>   
>> And IIRC invalidate_range_start() is sometimes called without holding
>> the mmap_sem at all.
> Yep
>   
>> So again how are they serialized?
> The 'driver lock' thing does it, read the hmm documentation, the hmm
> approach is basically the only approach that was correct of all the
> drivers..

Well that's what I've did, but what HMM does still doesn't looks correct 
to me.

> So long as the 'driver lock' is held the range cannot become
> invalidated as the 'driver lock' prevents progress of invalidation.

Correct, but the problem is it doesn't wait for ongoing operations to 
complete.

See I'm talking about the following case:

Thread A    Thread B
invalidate_range_start()
                     mmu_range_read_begin()
                     get_user_pages()/hmm_range_fault()
                     grab_driver_lock()
Updating the ptes
invalidate_range_end()

As far as I can see in invalidate_range_start() the driver lock is taken 
to make sure that we can't start any invalidation while the driver is 
using the pages for a command submission.

But the pages we got from get_user_pages()/hmm_range_fault() might not 
be up to date because the driver lock is also dropped again in 
invalidate_range_start() and not in invalidate_range_end().

> Holding the driver lock and using the seq based mmu_range_read_retry()
> tells if the previous unlocked get_user_pages() is still valid or
> needs to be discard.
>
> So it doesn't matter if get_user_pages() races or not, the result is not
> to be used until the driver lock is held and mmu_range_read_retry()
> called, which provides the coherence.
>
> It is the usual seqlock pattern.

Well we don't update the seqlock after the update to the protected data 
structure (the page table) happened, but rather before that.

That doesn't looks like the normal patter for a seqlock to me and as far 
as I can see that is quite a bug in the HMM design/logic.

Regards,
Christian.

>
> Jason

Jason Gunthorpe Oct. 21, 2019, 1:57 p.m. UTC | #8

On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote:
> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> > On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:
> >
> >>> get_user_pages/hmm_range_fault() and invalidate_range_start() both are
> >>> called while holding mm->map_sem, so they are always serialized.
> >> Not even remotely.
> >>
> >> For calling get_user_pages()/hmm_range_fault() you only need to hold the
> >> mmap_sem in read mode.
> > Right
> >   
> >> And IIRC invalidate_range_start() is sometimes called without holding
> >> the mmap_sem at all.
> > Yep
> >   
> >> So again how are they serialized?
> > The 'driver lock' thing does it, read the hmm documentation, the hmm
> > approach is basically the only approach that was correct of all the
> > drivers..
> 
> Well that's what I've did, but what HMM does still doesn't looks correct 
> to me.

It has a bug, but the basic flow seems to work.

https://patchwork.kernel.org/patch/11191

> > So long as the 'driver lock' is held the range cannot become
> > invalidated as the 'driver lock' prevents progress of invalidation.
> 
> Correct, but the problem is it doesn't wait for ongoing operations to 
> complete.
>
> See I'm talking about the following case:
> 
> Thread A    Thread B
> invalidate_range_start()
>                      mmu_range_read_begin()
>                      get_user_pages()/hmm_range_fault()
>                      grab_driver_lock()
> Updating the ptes
> invalidate_range_end()
> 
> As far as I can see in invalidate_range_start() the driver lock is taken 
> to make sure that we can't start any invalidation while the driver is 
> using the pages for a command submission.

Again, this uses the seqlock like scheme *and* the driver lock.

In this case after grab_driver_lock() mmu_range_read_retry() will
return false if Thread A has progressed to 'updating the ptes.

For instance here is how the concurrency resolves for retry:

       CPU1                                CPU2
                                  seq = mmu_range_read_begin()
invalidate_range_start()
  invalidate_seq++ 
                                  get_user_pages()/hmm_range_fault()
  grab_driver_lock()
  ungrab_driver_lock()
                                  grab_driver_lock()
                                  seq != invalidate_seq, so retry

  Updating the ptes

invalidate_range_end()
  invalidate_seq++


And here is how it resolves for blocking:

       CPU1                                CPU2
                                  seq = mmu_range_read_begin()
invalidate_range_start()
                                  get_user_pages()/hmm_range_fault()

                                  grab_driver_lock()
                                  seq == invalidate_seq, so conitnue
  invalidate_seq++ 
                                  ungrab_driver_lock()
  grab_driver_lock()
  // Cannot progress to 'Updating the ptes' while the drive_lock is held
  ungrab_driver_lock()
  Updating the ptes

invalidate_range_end()
  invalidate_seq++

For the above I've simplified the mechanics of the invalidate_seq, you
need to look through the patch to see how it actually works.

> Well we don't update the seqlock after the update to the protected data 
> structure (the page table) happened, but rather before that.

??? This is what mn_itree_inv_end() does, it is called by
invalidate_range_end

> That doesn't looks like the normal patter for a seqlock to me and as far 
> as I can see that is quite a bug in the HMM design/logic.

Well, hmm has a bug because it doesn't use a seqlock pattern, see the
above URL.

One of the motivations for this work is to squash that bug by adding a
seqlock like pattern. But the basic hmm flow and collision-retry
approach seems sound.

Do you see a problem with this patch?

Jason

Christian König Oct. 21, 2019, 2:28 p.m. UTC | #9

Am 21.10.19 um 15:57 schrieb Jason Gunthorpe:
> On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote:
>> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
>>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:
>>> [SNIP]
>>>    
>>>> So again how are they serialized?
>>> The 'driver lock' thing does it, read the hmm documentation, the hmm
>>> approach is basically the only approach that was correct of all the
>>> drivers..
>> Well that's what I've did, but what HMM does still doesn't looks correct
>> to me.
> It has a bug, but the basic flow seems to work.
>
> https://patchwork.kernel.org/patch/11191

Maybe wrong link? That link looks like an unrelated discussion on kernel 
image relocation.

>>> So long as the 'driver lock' is held the range cannot become
>>> invalidated as the 'driver lock' prevents progress of invalidation.
>> Correct, but the problem is it doesn't wait for ongoing operations to
>> complete.
>>
>> See I'm talking about the following case:
>>
>> Thread A    Thread B
>> invalidate_range_start()
>>                       mmu_range_read_begin()
>>                       get_user_pages()/hmm_range_fault()
>>                       grab_driver_lock()
>> Updating the ptes
>> invalidate_range_end()
>>
>> As far as I can see in invalidate_range_start() the driver lock is taken
>> to make sure that we can't start any invalidation while the driver is
>> using the pages for a command submission.
> Again, this uses the seqlock like scheme *and* the driver lock.
>
> In this case after grab_driver_lock() mmu_range_read_retry() will
> return false if Thread A has progressed to 'updating the ptes.
>
> For instance here is how the concurrency resolves for retry:
>
>         CPU1                                CPU2
>                                    seq = mmu_range_read_begin()
> invalidate_range_start()
>    invalidate_seq++

How that was order was what confusing me. But I've read up on the code 
in mmu_range_read_begin() and found the lines I was looking for:

+    if (is_invalidating)
+        wait_event(mmn_mm->wq,
+               READ_ONCE(mmn_mm->invalidate_seq) != seq);

[SNIP]

> For the above I've simplified the mechanics of the invalidate_seq, you
> need to look through the patch to see how it actually works.

Yea, that you also allow multiple write sides is pretty neat.

>> Well we don't update the seqlock after the update to the protected data
>> structure (the page table) happened, but rather before that.
> ??? This is what mn_itree_inv_end() does, it is called by
> invalidate_range_end
>
>> That doesn't looks like the normal patter for a seqlock to me and as far
>> as I can see that is quite a bug in the HMM design/logic.
> Well, hmm has a bug because it doesn't use a seqlock pattern, see the
> above URL.
>
> One of the motivations for this work is to squash that bug by adding a
> seqlock like pattern. But the basic hmm flow and collision-retry
> approach seems sound.
>
> Do you see a problem with this patch?

No, not any more.

Essentially you are doing the same thing I've tried to do with the 
original amdgpu implementation. The difference is that you don't try to 
use a per range sequence (which is a good idea, we never got that fully 
working) and you allow multiple writers at the same time.

Feel free to stitch an Acked-by: Christian König 
<christian.koenig@amd.com> on patch #2, but you still doing a bunch of 
things in there which are way beyond my understanding (e.g. where are 
all the SMP barriers?).

Cheers,
Christian.

>
> Jason

Jason Gunthorpe Oct. 21, 2019, 3:12 p.m. UTC | #10

On Mon, Oct 21, 2019 at 02:28:46PM +0000, Koenig, Christian wrote:
> Am 21.10.19 um 15:57 schrieb Jason Gunthorpe:
> > On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote:
> >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> >>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:
> >>> [SNIP]
> >>>    
> >>>> So again how are they serialized?
> >>> The 'driver lock' thing does it, read the hmm documentation, the hmm
> >>> approach is basically the only approach that was correct of all the
> >>> drivers..
> >> Well that's what I've did, but what HMM does still doesn't looks correct
> >> to me.
> > It has a bug, but the basic flow seems to work.
> >
> > https://patchwork.kernel.org/patch/11191
> 
> Maybe wrong link? That link looks like an unrelated discussion on kernel 
> image relocation.

Sorry, it got corrupted:

https://patchwork.kernel.org/patch/11191397/

> >>> So long as the 'driver lock' is held the range cannot become
> >>> invalidated as the 'driver lock' prevents progress of invalidation.
> >> Correct, but the problem is it doesn't wait for ongoing operations to
> >> complete.
> >>
> >> See I'm talking about the following case:
> >>
> >> Thread A    Thread B
> >> invalidate_range_start()
> >>                       mmu_range_read_begin()
> >>                       get_user_pages()/hmm_range_fault()
> >>                       grab_driver_lock()
> >> Updating the ptes
> >> invalidate_range_end()
> >>
> >> As far as I can see in invalidate_range_start() the driver lock is taken
> >> to make sure that we can't start any invalidation while the driver is
> >> using the pages for a command submission.
> > Again, this uses the seqlock like scheme *and* the driver lock.
> >
> > In this case after grab_driver_lock() mmu_range_read_retry() will
> > return false if Thread A has progressed to 'updating the ptes.
> >
> > For instance here is how the concurrency resolves for retry:
> >
> >         CPU1                                CPU2
> >                                    seq = mmu_range_read_begin()
> > invalidate_range_start()
> >    invalidate_seq++
> 
> How that was order was what confusing me. But I've read up on the code 
> in mmu_range_read_begin() and found the lines I was looking for:
> 
> +    if (is_invalidating)
> +        wait_event(mmn_mm->wq,
> +               READ_ONCE(mmn_mm->invalidate_seq) != seq);
> 
> [SNIP]

Right, the basic design is that the 'seq' returned by
mmu_range_read_begin() is never currently under invalidation.

Thus if the starting seq is not invalidating, then if it doesn't
change we are guaranteed the ptes haven't changed either.

> > For the above I've simplified the mechanics of the invalidate_seq, you
> > need to look through the patch to see how it actually works.
> 
> Yea, that you also allow multiple write sides is pretty neat.

Complicated, but necessary to make the non-blocking OOM stuff able to
read the interval tree under all conditions :\

> > One of the motivations for this work is to squash that bug by adding a
> > seqlock like pattern. But the basic hmm flow and collision-retry
> > approach seems sound.
> >
> > Do you see a problem with this patch?
> 
> No, not any more.

Okay, great, thanks

> Essentially you are doing the same thing I've tried to do with the 
> original amdgpu implementation. The difference is that you don't try to 
> use a per range sequence (which is a good idea, we never got that fully 
> working) and you allow multiple writers at the same time.

Yes, ODP had the per-range sequence and it never worked right
either. Keeping track of things during the invalidate_end was too complex

> Feel free to stitch an Acked-by: Christian König 
> <christian.koenig@amd.com> on patch #2,

Thanks! Can you also take some review and test for the AMD related
patches? These were quite hard to make, it is very likely I've made an
error..

> but you still doing a bunch of things in there which are way beyond
> my understanding (e.g. where are all the SMP barriers?).

The only non-locked data is 'struct mmu_range_notifier->invalidate_seq'

On the write side it uses a WRITE_ONCE. The 'seq' it writes is
generated under the mmn_mm->lock spinlock (held before and after the
WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same
value. 

Essentially the spinlock is providing the barrier to order write:

invalidate_range_start():
 spin_lock(&mmn_mm->lock);
 mmn_mm->active_invalidate_ranges++;
 mmn_mm->invalidate_seq |= 1;
 *seq = mmn_mm->invalidate_seq;
 spin_unlock(&mmn_mm->lock);

 WRITE_ONCE(mrn->invalidate_seq, cur_seq);

invalidate_range_end()
 spin_lock(&mmn_mm->lock);
 if (--mmn_mm->active_invalidate_ranges)
    mmn_mm->invalidate_seq++
 spin_unlock(&mmn_mm->lock);

ie when we do invalidate_seq++, due to the active_invalidate_ranges
counter and the spinlock, we know all other WRITE_ONCE's have passed
their spin_unlock and no concurrent ones are starting. The spinlock is
providing the barrier here.

On the read side.. It is a similar argument, except with the
driver_lock:

mmu_range_read_begin()
  seq = READ_ONCE(mrn->invalidate_seq);

Here 'seq' may be the current value, or it may be an older
value. Ordering is eventually provided by the driver_lock:

mn_tree_invalidate_start()
 [..]
 WRITE_ONCE(mrn->invalidate_seq, cur_seq);
 driver_lock()
 driver_unlock()

vs
 driver_lock()
   mmu_range_read_begin()
     return seq != READ_ONCE(mrn->invalidate_seq);
 driver_unlock()

Here if mn_tree_invalidate_start() has passed driver_unlock() then
because we do driver_lock() before mmu_range_read_begin() we must
observe the WRITE_ONCE. ie the driver_unlock()/driver_lock() provide
the pair'd barrier.

If mn_tree_invalidate_start() has not passed driver_lock() then the
mmu_range_read_begin() side prevents it from passing driver_lock()
while it holds the lock. In this case it is OK if we don't observe the
WRITE_ONCE() that was done just before as the invalidate_start()
thread can't proceed to invalidation.

It is very unusual locking, it would be great if others could help
look at it!

The unusual bit in all of this is using a lock's critical region to
'protect' data for read, but updating that same data before the lock's
critical secion. ie relying on the unlock barrier to 'release' program
ordered stores done before the lock's own critical region, and the
lock side barrier to 'acquire' those stores.

This approach is borrowed from the hmm mirror implementation..

If for some reason the scheme doesn't work, then the fallback is to
expand the mmn_mm->lock spinlock to protect the mrn->invalidate_seq at
some cost in performance.

Jason

Dennis Dalessandro Oct. 21, 2019, 3:55 p.m. UTC | #11

On 10/15/2019 2:12 PM, Jason Gunthorpe wrote:
> This is still being tested, but I figured to send it to start getting help
> from the xen, amd and hfi drivers which I cannot test here.

Sorry for the delay, I never seen this. Was not on Cc list and didn't 
register to me it impacted hfi. I'll take a look and run it through some 
hfi1 tests.

-Denny

Jason Gunthorpe Oct. 21, 2019, 4:58 p.m. UTC | #12

On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote:
> On 10/15/2019 2:12 PM, Jason Gunthorpe wrote:
> > This is still being tested, but I figured to send it to start getting help
> > from the xen, amd and hfi drivers which I cannot test here.
> 
> Sorry for the delay, I never seen this. Was not on Cc list and didn't
> register to me it impacted hfi. I'll take a look and run it through some
> hfi1 tests.

Hm, you were cc'd on the hfi1 patch of the series:

https://patchwork.kernel.org/patch/11191395/

So you saw that, right?

But it seems that git send-email didn't pull all the cc's together?

Jason

Jerome Glisse Oct. 21, 2019, 6:40 p.m. UTC | #13

On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.
> 
> Of the ones I checked they largely seem to have various kinds of races,
> bugs and poor implementation. This is a result of the complexity in how
> the notifier interacts with get_user_pages(). It is extremely difficult to
> use it correctly.
> 
> Consolidate all of this code together into the core mmu_notifier and
> provide a locking scheme similar to hmm_mirror that allows the user to
> safely use get_user_pages() and reliably know if the page list still
> matches the mm.
> 
> This new arrangment plays nicely with the !blockable mode for
> OOM. Scanning the interval tree is done such that the intersection test
> will always succeed, and since there is no invalidate_range_end exposed to
> drivers the scheme safely allows multiple drivers to be subscribed.
> 
> Four places are converted as an example of how the new API is used.
> Four are left for future patches:
>  - i915_gem has complex locking around destruction of a registration,
>    needs more study
>  - hfi1 (2nd user) needs access to the rbtree
>  - scif_dma has a complicated logic flow
>  - vhost's mmu notifiers are already being rewritten
> 
> This is still being tested, but I figured to send it to start getting help
> from the xen, amd and hfi drivers which I cannot test here.

It might be a good oportunity to also switch those users to
hmm_range_fault() instead of GUP as GUP is pointless for those
users. In fact the GUP is an impediment to normal mm operations.

I will test on nouveau.

Cheers,
Jérôme

Jason Gunthorpe Oct. 21, 2019, 7:06 p.m. UTC | #14

On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote:
> On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> > they only use invalidate_range_start/end and immediately check the
> > invalidating range against some driver data structure to tell if the
> > driver is interested. Half of them use an interval_tree, the others are
> > simple linear search lists.
> > 
> > Of the ones I checked they largely seem to have various kinds of races,
> > bugs and poor implementation. This is a result of the complexity in how
> > the notifier interacts with get_user_pages(). It is extremely difficult to
> > use it correctly.
> > 
> > Consolidate all of this code together into the core mmu_notifier and
> > provide a locking scheme similar to hmm_mirror that allows the user to
> > safely use get_user_pages() and reliably know if the page list still
> > matches the mm.
> > 
> > This new arrangment plays nicely with the !blockable mode for
> > OOM. Scanning the interval tree is done such that the intersection test
> > will always succeed, and since there is no invalidate_range_end exposed to
> > drivers the scheme safely allows multiple drivers to be subscribed.
> > 
> > Four places are converted as an example of how the new API is used.
> > Four are left for future patches:
> >  - i915_gem has complex locking around destruction of a registration,
> >    needs more study
> >  - hfi1 (2nd user) needs access to the rbtree
> >  - scif_dma has a complicated logic flow
> >  - vhost's mmu notifiers are already being rewritten
> > 
> > This is still being tested, but I figured to send it to start getting help
> > from the xen, amd and hfi drivers which I cannot test here.
> 
> It might be a good oportunity to also switch those users to
> hmm_range_fault() instead of GUP as GUP is pointless for those
> users. In fact the GUP is an impediment to normal mm operations.

I think vhost can use hmm_range_fault

hfi1 does actually need to have the page pin, it doesn't fence DMA
during invalidate.

i915_gem feels alot like amdgpu, so probably it would benefit

No idea about scif_dma

> I will test on nouveau.

Thanks, hopefully it still works, I think Ralph was able to do some
basic checks. But it is a pretty complicated series, I probably made
some mistakes.

FWIW, I know that nouveau gets a lockdep splat now from Daniel
Vetter's recent changes, it tries to do GFP_KERENEL allocations under
a lock also held by the invalidate_range_start path.

Thanks for looking at it!

Regards,
Jason

Daniel Vetter Oct. 22, 2019, 7:57 a.m. UTC | #15

On Mon, Oct 21, 2019 at 03:12:26PM +0000, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:28:46PM +0000, Koenig, Christian wrote:
> > Am 21.10.19 um 15:57 schrieb Jason Gunthorpe:
> > > On Sun, Oct 20, 2019 at 02:21:42PM +0000, Koenig, Christian wrote:
> > >> Am 18.10.19 um 22:36 schrieb Jason Gunthorpe:
> > >>> On Thu, Oct 17, 2019 at 04:47:20PM +0000, Koenig, Christian wrote:
> > >>> [SNIP]
> > >>>    
> > >>>> So again how are they serialized?
> > >>> The 'driver lock' thing does it, read the hmm documentation, the hmm
> > >>> approach is basically the only approach that was correct of all the
> > >>> drivers..
> > >> Well that's what I've did, but what HMM does still doesn't looks correct
> > >> to me.
> > > It has a bug, but the basic flow seems to work.
> > >
> > > https://patchwork.kernel.org/patch/11191
> > 
> > Maybe wrong link? That link looks like an unrelated discussion on kernel 
> > image relocation.
> 
> Sorry, it got corrupted:
> 
> https://patchwork.kernel.org/patch/11191397/
> 
> > >>> So long as the 'driver lock' is held the range cannot become
> > >>> invalidated as the 'driver lock' prevents progress of invalidation.
> > >> Correct, but the problem is it doesn't wait for ongoing operations to
> > >> complete.
> > >>
> > >> See I'm talking about the following case:
> > >>
> > >> Thread A    Thread B
> > >> invalidate_range_start()
> > >>                       mmu_range_read_begin()
> > >>                       get_user_pages()/hmm_range_fault()
> > >>                       grab_driver_lock()
> > >> Updating the ptes
> > >> invalidate_range_end()
> > >>
> > >> As far as I can see in invalidate_range_start() the driver lock is taken
> > >> to make sure that we can't start any invalidation while the driver is
> > >> using the pages for a command submission.
> > > Again, this uses the seqlock like scheme *and* the driver lock.
> > >
> > > In this case after grab_driver_lock() mmu_range_read_retry() will
> > > return false if Thread A has progressed to 'updating the ptes.
> > >
> > > For instance here is how the concurrency resolves for retry:
> > >
> > >         CPU1                                CPU2
> > >                                    seq = mmu_range_read_begin()
> > > invalidate_range_start()
> > >    invalidate_seq++
> > 
> > How that was order was what confusing me. But I've read up on the code 
> > in mmu_range_read_begin() and found the lines I was looking for:
> > 
> > +    if (is_invalidating)
> > +        wait_event(mmn_mm->wq,
> > +               READ_ONCE(mmn_mm->invalidate_seq) != seq);
> > 
> > [SNIP]
> 
> Right, the basic design is that the 'seq' returned by
> mmu_range_read_begin() is never currently under invalidation.
> 
> Thus if the starting seq is not invalidating, then if it doesn't
> change we are guaranteed the ptes haven't changed either.
> 
> > > For the above I've simplified the mechanics of the invalidate_seq, you
> > > need to look through the patch to see how it actually works.
> > 
> > Yea, that you also allow multiple write sides is pretty neat.
> 
> Complicated, but necessary to make the non-blocking OOM stuff able to
> read the interval tree under all conditions :\
> 
> > > One of the motivations for this work is to squash that bug by adding a
> > > seqlock like pattern. But the basic hmm flow and collision-retry
> > > approach seems sound.
> > >
> > > Do you see a problem with this patch?
> > 
> > No, not any more.
> 
> Okay, great, thanks
>  
> > Essentially you are doing the same thing I've tried to do with the 
> > original amdgpu implementation. The difference is that you don't try to 
> > use a per range sequence (which is a good idea, we never got that fully 
> > working) and you allow multiple writers at the same time.
> 
> Yes, ODP had the per-range sequence and it never worked right
> either. Keeping track of things during the invalidate_end was too complex
>  
> > Feel free to stitch an Acked-by: Christian König 
> > <christian.koenig@amd.com> on patch #2,
> 
> Thanks! Can you also take some review and test for the AMD related
> patches? These were quite hard to make, it is very likely I've made an
> error..
> 
> > but you still doing a bunch of things in there which are way beyond
> > my understanding (e.g. where are all the SMP barriers?).
> 
> The only non-locked data is 'struct mmu_range_notifier->invalidate_seq'
> 
> On the write side it uses a WRITE_ONCE. The 'seq' it writes is
> generated under the mmn_mm->lock spinlock (held before and after the
> WRITE_ONCE) such that all concurrent WRITE_ONCE's are storing the same
> value. 
> 
> Essentially the spinlock is providing the barrier to order write:
> 
> invalidate_range_start():
>  spin_lock(&mmn_mm->lock);
>  mmn_mm->active_invalidate_ranges++;
>  mmn_mm->invalidate_seq |= 1;
>  *seq = mmn_mm->invalidate_seq;
>  spin_unlock(&mmn_mm->lock);
> 
>  WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> 
> invalidate_range_end()
>  spin_lock(&mmn_mm->lock);
>  if (--mmn_mm->active_invalidate_ranges)
>     mmn_mm->invalidate_seq++
>  spin_unlock(&mmn_mm->lock);
> 
> ie when we do invalidate_seq++, due to the active_invalidate_ranges
> counter and the spinlock, we know all other WRITE_ONCE's have passed
> their spin_unlock and no concurrent ones are starting. The spinlock is
> providing the barrier here.
> 
> On the read side.. It is a similar argument, except with the
> driver_lock:
> 
> mmu_range_read_begin()
>   seq = READ_ONCE(mrn->invalidate_seq);
> 
> Here 'seq' may be the current value, or it may be an older
> value. Ordering is eventually provided by the driver_lock:
> 
> mn_tree_invalidate_start()
>  [..]
>  WRITE_ONCE(mrn->invalidate_seq, cur_seq);
>  driver_lock()
>  driver_unlock()
> 
> vs
>  driver_lock()
>    mmu_range_read_begin()
>      return seq != READ_ONCE(mrn->invalidate_seq);
>  driver_unlock()
> 
> Here if mn_tree_invalidate_start() has passed driver_unlock() then
> because we do driver_lock() before mmu_range_read_begin() we must
> observe the WRITE_ONCE. ie the driver_unlock()/driver_lock() provide
> the pair'd barrier.
> 
> If mn_tree_invalidate_start() has not passed driver_lock() then the
> mmu_range_read_begin() side prevents it from passing driver_lock()
> while it holds the lock. In this case it is OK if we don't observe the
> WRITE_ONCE() that was done just before as the invalidate_start()
> thread can't proceed to invalidation.
> 
> It is very unusual locking, it would be great if others could help
> look at it!
> 
> The unusual bit in all of this is using a lock's critical region to
> 'protect' data for read, but updating that same data before the lock's
> critical secion. ie relying on the unlock barrier to 'release' program
> ordered stores done before the lock's own critical region, and the
> lock side barrier to 'acquire' those stores.

I think this unusual use of locks as barriers for other unlocked accesses
deserves comments even more than just normal barriers. Can you pls add
them? I think the design seeems sound ...

Also the comment on the driver's lock hopefully prevents driver
maintainers from moving the driver_lock around in a way that would very
subtle break the scheme, so I think having the acquire barrier commented
in each place would be really good.

Cheers, Daniel

> 
> This approach is borrowed from the hmm mirror implementation..
> 
> If for some reason the scheme doesn't work, then the fallback is to
> expand the mmn_mm->lock spinlock to protect the mrn->invalidate_seq at
> some cost in performance.
> 
> Jason
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Dennis Dalessandro Oct. 22, 2019, 11:56 a.m. UTC | #16

On 10/21/2019 12:58 PM, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote:
>> On 10/15/2019 2:12 PM, Jason Gunthorpe wrote:
>>> This is still being tested, but I figured to send it to start getting help
>>> from the xen, amd and hfi drivers which I cannot test here.
>>
>> Sorry for the delay, I never seen this. Was not on Cc list and didn't
>> register to me it impacted hfi. I'll take a look and run it through some
>> hfi1 tests.
> 
> Hm, you were cc'd on the hfi1 patch of the series:
> 
> https://patchwork.kernel.org/patch/11191395/
> 
> So you saw that, right?

I do now.

> But it seems that git send-email didn't pull all the cc's together?

I don't know. I thought it did, at one time I recall trying to get it 
*not* to do that, when preparing some internal reviews. Haven't used it 
for a long time though, I've been using stgit.

At any rate can you give me a SHA or branch this applies on top of? I 
have pulled rdma/hmm, rdma/wip/jgg, linus/master but seem to have conflicts.

-Denny

Jason Gunthorpe Oct. 22, 2019, 2:37 p.m. UTC | #17

On Tue, Oct 22, 2019 at 07:56:12AM -0400, Dennis Dalessandro wrote:
> On 10/21/2019 12:58 PM, Jason Gunthorpe wrote:
> > On Mon, Oct 21, 2019 at 11:55:51AM -0400, Dennis Dalessandro wrote:
> > > On 10/15/2019 2:12 PM, Jason Gunthorpe wrote:
> > > > This is still being tested, but I figured to send it to start getting help
> > > > from the xen, amd and hfi drivers which I cannot test here.
> > > 
> > > Sorry for the delay, I never seen this. Was not on Cc list and didn't
> > > register to me it impacted hfi. I'll take a look and run it through some
> > > hfi1 tests.
> > 
> > Hm, you were cc'd on the hfi1 patch of the series:
> > 
> > https://patchwork.kernel.org/patch/11191395/
> > 
> > So you saw that, right?
> 
> I do now.
> 
> > But it seems that git send-email didn't pull all the cc's together?
> 
> I don't know. I thought it did, at one time I recall trying to get it *not*
> to do that, when preparing some internal reviews. Haven't used it for a long
> time though, I've been using stgit.
> 
> At any rate can you give me a SHA or branch this applies on top of? I have
> pulled rdma/hmm, rdma/wip/jgg, linus/master but seem to have conflicts.

Here is a full tree:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

It needs the ODP series on the mailing list to apply

Thanks,
Jason

Jason Gunthorpe Oct. 22, 2019, 3:01 p.m. UTC | #18

On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote:

> > The unusual bit in all of this is using a lock's critical region to
> > 'protect' data for read, but updating that same data before the lock's
> > critical secion. ie relying on the unlock barrier to 'release' program
> > ordered stores done before the lock's own critical region, and the
> > lock side barrier to 'acquire' those stores.
> 
> I think this unusual use of locks as barriers for other unlocked accesses
> deserves comments even more than just normal barriers. Can you pls add
> them? I think the design seeems sound ...
> 
> Also the comment on the driver's lock hopefully prevents driver
> maintainers from moving the driver_lock around in a way that would very
> subtle break the scheme, so I think having the acquire barrier commented
> in each place would be really good.

There is already a lot of documentation, I think it would be helpful
if you could suggest some specific places where you think an addition
would help? I think the perspective of someone less familiar with this
design would really improve the documentation

I've been tempted to force the driver to store the seq number directly
under the driver lock - this makes the scheme much clearer, ie
something like this:

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 712c99918551bc..738fa670dcfb19 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -488,7 +488,8 @@ struct svm_notifier {
 };
 
 static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
-                                        const struct mmu_notifier_range *range)
+                                        const struct mmu_notifier_range *range,
+                                        unsigned long seq)
 {
        struct svm_notifier *sn =
                container_of(mrn, struct svm_notifier, notifier);
@@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
                mutex_lock(&sn->svmm->mutex);
        else if (!mutex_trylock(&sn->svmm->mutex))
                return false;
+       mmu_range_notifier_update_seq(mrn, seq);
        mutex_unlock(&sn->svmm->mutex);
        return true;
 }


At the cost of making the driver a bit more complex, what do you
think?

Jason

Daniel Vetter Oct. 23, 2019, 9:08 a.m. UTC | #19

On Tue, Oct 22, 2019 at 03:01:13PM +0000, Jason Gunthorpe wrote:
> On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote:
> 
> > > The unusual bit in all of this is using a lock's critical region to
> > > 'protect' data for read, but updating that same data before the lock's
> > > critical secion. ie relying on the unlock barrier to 'release' program
> > > ordered stores done before the lock's own critical region, and the
> > > lock side barrier to 'acquire' those stores.
> > 
> > I think this unusual use of locks as barriers for other unlocked accesses
> > deserves comments even more than just normal barriers. Can you pls add
> > them? I think the design seeems sound ...
> > 
> > Also the comment on the driver's lock hopefully prevents driver
> > maintainers from moving the driver_lock around in a way that would very
> > subtle break the scheme, so I think having the acquire barrier commented
> > in each place would be really good.
> 
> There is already a lot of documentation, I think it would be helpful
> if you could suggest some specific places where you think an addition
> would help? I think the perspective of someone less familiar with this
> design would really improve the documentation

Hm I just meant the usual recommendation that "barriers must have comments
explaining what they order, and where the other side of the barrier is".
Using unlock/lock as a barrier imo just makes that an even better idea.
Usually what I do is something like "we need to order $this against $that
below, and the other side of this barrier is in function()." With maybe a
bit more if it's not obvious how things go wrong if the orderin is broken.

Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its
barriers :-/

> I've been tempted to force the driver to store the seq number directly
> under the driver lock - this makes the scheme much clearer, ie
> something like this:
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 712c99918551bc..738fa670dcfb19 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -488,7 +488,8 @@ struct svm_notifier {
>  };
>  
>  static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
> -                                        const struct mmu_notifier_range *range)
> +                                        const struct mmu_notifier_range *range,
> +                                        unsigned long seq)
>  {
>         struct svm_notifier *sn =
>                 container_of(mrn, struct svm_notifier, notifier);
> @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
>                 mutex_lock(&sn->svmm->mutex);
>         else if (!mutex_trylock(&sn->svmm->mutex))
>                 return false;
> +       mmu_range_notifier_update_seq(mrn, seq);
>         mutex_unlock(&sn->svmm->mutex);
>         return true;
>  }
> 
> 
> At the cost of making the driver a bit more complex, what do you
> think?

Hm, spinning this further ... could we initialize the mmu range notifier
with a pointer to the driver lock, so that we could put a
lockdep_assert_held into mmu_range_notifier_update_seq? I think that would
make this scheme substantially more driver-hacker proof :-)

Cheers, Daniel

Christian König Oct. 23, 2019, 9:32 a.m. UTC | #20

Am 23.10.19 um 11:08 schrieb Daniel Vetter:
> On Tue, Oct 22, 2019 at 03:01:13PM +0000, Jason Gunthorpe wrote:
>> On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote:
>>
>>>> The unusual bit in all of this is using a lock's critical region to
>>>> 'protect' data for read, but updating that same data before the lock's
>>>> critical secion. ie relying on the unlock barrier to 'release' program
>>>> ordered stores done before the lock's own critical region, and the
>>>> lock side barrier to 'acquire' those stores.
>>> I think this unusual use of locks as barriers for other unlocked accesses
>>> deserves comments even more than just normal barriers. Can you pls add
>>> them? I think the design seeems sound ...
>>>
>>> Also the comment on the driver's lock hopefully prevents driver
>>> maintainers from moving the driver_lock around in a way that would very
>>> subtle break the scheme, so I think having the acquire barrier commented
>>> in each place would be really good.
>> There is already a lot of documentation, I think it would be helpful
>> if you could suggest some specific places where you think an addition
>> would help? I think the perspective of someone less familiar with this
>> design would really improve the documentation
> Hm I just meant the usual recommendation that "barriers must have comments
> explaining what they order, and where the other side of the barrier is".
> Using unlock/lock as a barrier imo just makes that an even better idea.
> Usually what I do is something like "we need to order $this against $that
> below, and the other side of this barrier is in function()." With maybe a
> bit more if it's not obvious how things go wrong if the orderin is broken.
>
> Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its
> barriers :-/
>
>> I've been tempted to force the driver to store the seq number directly
>> under the driver lock - this makes the scheme much clearer, ie
>> something like this:
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
>> index 712c99918551bc..738fa670dcfb19 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
>> @@ -488,7 +488,8 @@ struct svm_notifier {
>>   };
>>   
>>   static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
>> -                                        const struct mmu_notifier_range *range)
>> +                                        const struct mmu_notifier_range *range,
>> +                                        unsigned long seq)
>>   {
>>          struct svm_notifier *sn =
>>                  container_of(mrn, struct svm_notifier, notifier);
>> @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
>>                  mutex_lock(&sn->svmm->mutex);
>>          else if (!mutex_trylock(&sn->svmm->mutex))
>>                  return false;
>> +       mmu_range_notifier_update_seq(mrn, seq);
>>          mutex_unlock(&sn->svmm->mutex);
>>          return true;
>>   }
>>
>>
>> At the cost of making the driver a bit more complex, what do you
>> think?
> Hm, spinning this further ... could we initialize the mmu range notifier
> with a pointer to the driver lock, so that we could put a
> lockdep_assert_held into mmu_range_notifier_update_seq? I think that would
> make this scheme substantially more driver-hacker proof :-)

Going another step further.... what hinders us to put the lock into the 
mmu range notifier itself and have _lock()/_unlock() helpers?

I mean having the lock in the driver only makes sense when the driver 
would be using the same lock for multiple things, e.g. multiple MMU 
range notifiers under the same lock. But I really don't see that use 
case here.

Regards,
Christian.

>
> Cheers, Daniel

Jerome Glisse Oct. 23, 2019, 4:52 p.m. UTC | #21

On Wed, Oct 23, 2019 at 11:32:16AM +0200, Christian König wrote:
> Am 23.10.19 um 11:08 schrieb Daniel Vetter:
> > On Tue, Oct 22, 2019 at 03:01:13PM +0000, Jason Gunthorpe wrote:
> > > On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote:
> > > 
> > > > > The unusual bit in all of this is using a lock's critical region to
> > > > > 'protect' data for read, but updating that same data before the lock's
> > > > > critical secion. ie relying on the unlock barrier to 'release' program
> > > > > ordered stores done before the lock's own critical region, and the
> > > > > lock side barrier to 'acquire' those stores.
> > > > I think this unusual use of locks as barriers for other unlocked accesses
> > > > deserves comments even more than just normal barriers. Can you pls add
> > > > them? I think the design seeems sound ...
> > > > 
> > > > Also the comment on the driver's lock hopefully prevents driver
> > > > maintainers from moving the driver_lock around in a way that would very
> > > > subtle break the scheme, so I think having the acquire barrier commented
> > > > in each place would be really good.
> > > There is already a lot of documentation, I think it would be helpful
> > > if you could suggest some specific places where you think an addition
> > > would help? I think the perspective of someone less familiar with this
> > > design would really improve the documentation
> > Hm I just meant the usual recommendation that "barriers must have comments
> > explaining what they order, and where the other side of the barrier is".
> > Using unlock/lock as a barrier imo just makes that an even better idea.
> > Usually what I do is something like "we need to order $this against $that
> > below, and the other side of this barrier is in function()." With maybe a
> > bit more if it's not obvious how things go wrong if the orderin is broken.
> > 
> > Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its
> > barriers :-/
> > 
> > > I've been tempted to force the driver to store the seq number directly
> > > under the driver lock - this makes the scheme much clearer, ie
> > > something like this:
> > > 
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > index 712c99918551bc..738fa670dcfb19 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > @@ -488,7 +488,8 @@ struct svm_notifier {
> > >   };
> > >   static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
> > > -                                        const struct mmu_notifier_range *range)
> > > +                                        const struct mmu_notifier_range *range,
> > > +                                        unsigned long seq)
> > >   {
> > >          struct svm_notifier *sn =
> > >                  container_of(mrn, struct svm_notifier, notifier);
> > > @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
> > >                  mutex_lock(&sn->svmm->mutex);
> > >          else if (!mutex_trylock(&sn->svmm->mutex))
> > >                  return false;
> > > +       mmu_range_notifier_update_seq(mrn, seq);
> > >          mutex_unlock(&sn->svmm->mutex);
> > >          return true;
> > >   }
> > > 
> > > 
> > > At the cost of making the driver a bit more complex, what do you
> > > think?
> > Hm, spinning this further ... could we initialize the mmu range notifier
> > with a pointer to the driver lock, so that we could put a
> > lockdep_assert_held into mmu_range_notifier_update_seq? I think that would
> > make this scheme substantially more driver-hacker proof :-)
> 
> Going another step further.... what hinders us to put the lock into the mmu
> range notifier itself and have _lock()/_unlock() helpers?
> 
> I mean having the lock in the driver only makes sense when the driver would
> be using the same lock for multiple things, e.g. multiple MMU range
> notifiers under the same lock. But I really don't see that use case here.

I actualy do, nouveau use one lock to protect the page table and that's the
lock that matter. You can have multiple range for a single page table, idea
being only a sub-set of the process address space is ever accessed by the
GPU and those it is better to focus on this sub-set and track invalidation in
a finer grain.

Cheers,
Jérôme

Jason Gunthorpe Oct. 23, 2019, 5:24 p.m. UTC | #22

On Wed, Oct 23, 2019 at 12:52:23PM -0400, Jerome Glisse wrote:
> > Going another step further.... what hinders us to put the lock into the mmu
> > range notifier itself and have _lock()/_unlock() helpers?
> > 
> > I mean having the lock in the driver only makes sense when the driver would
> > be using the same lock for multiple things, e.g. multiple MMU range
> > notifiers under the same lock. But I really don't see that use case here.
> 
> I actualy do, nouveau use one lock to protect the page table and that's the
> lock that matter. You can have multiple range for a single page table, idea
> being only a sub-set of the process address space is ever accessed by the
> GPU and those it is better to focus on this sub-set and track invalidation in
> a finer grain.

mlx5 is similar, but not currently coded quite right, there is one
lock that protects the command queue for submitting invalidations to
the HW and it doesn't make a lot of sense to have additional fine
grained locking beyond that.

So I suppose the intent here that most drivers would have a single
'page table lock' that protects the HW's page table update, and this
lock is the one that should be held while upating and checking the
sequence number.

dma_fence based drivers are possibly a little different, I think they
can just use a spinlock, their pattern should probably be something
like

fault:
 hmm_range_fault()

 spin_lock()
 if (mmu_range_read_retry()))
     goto again
 dma_fence_init(mrn->fence)
 spin_unlock()

invalidate:
 spin_lock()
 is_inited = 'dma fence init has been called'
 spin_unlock()
 if (is_inited)
    dma_fence_wait(fence)

I'm not sure, never used dma_fence before. The key thing is that the
dma_fence_wait() cannot block until after the mmu_range_read_retry() &
unlock completes. Otherwise it can deadlock with hmm_range_fault().

It would be nice to figure this out and add it to the hmm.rst as we do
have two drivers using the dma_fence scheme.

Also, the use of a spinlock here probably says we should keep the lock
external.

But, it sounds like the mmu_range_notifier_update_seq() is a good
idea, so let me add that in v2.

Jason

Jerome Glisse Oct. 23, 2019, 8:26 p.m. UTC | #23

On Mon, Oct 21, 2019 at 07:06:00PM +0000, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote:
> > On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > 
> > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> > > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> > > they only use invalidate_range_start/end and immediately check the
> > > invalidating range against some driver data structure to tell if the
> > > driver is interested. Half of them use an interval_tree, the others are
> > > simple linear search lists.
> > > 
> > > Of the ones I checked they largely seem to have various kinds of races,
> > > bugs and poor implementation. This is a result of the complexity in how
> > > the notifier interacts with get_user_pages(). It is extremely difficult to
> > > use it correctly.
> > > 
> > > Consolidate all of this code together into the core mmu_notifier and
> > > provide a locking scheme similar to hmm_mirror that allows the user to
> > > safely use get_user_pages() and reliably know if the page list still
> > > matches the mm.
> > > 
> > > This new arrangment plays nicely with the !blockable mode for
> > > OOM. Scanning the interval tree is done such that the intersection test
> > > will always succeed, and since there is no invalidate_range_end exposed to
> > > drivers the scheme safely allows multiple drivers to be subscribed.
> > > 
> > > Four places are converted as an example of how the new API is used.
> > > Four are left for future patches:
> > >  - i915_gem has complex locking around destruction of a registration,
> > >    needs more study
> > >  - hfi1 (2nd user) needs access to the rbtree
> > >  - scif_dma has a complicated logic flow
> > >  - vhost's mmu notifiers are already being rewritten
> > > 
> > > This is still being tested, but I figured to send it to start getting help
> > > from the xen, amd and hfi drivers which I cannot test here.
> > 
> > It might be a good oportunity to also switch those users to
> > hmm_range_fault() instead of GUP as GUP is pointless for those
> > users. In fact the GUP is an impediment to normal mm operations.
> 
> I think vhost can use hmm_range_fault
> 
> hfi1 does actually need to have the page pin, it doesn't fence DMA
> during invalidate.
> 
> i915_gem feels alot like amdgpu, so probably it would benefit
> 
> No idea about scif_dma
> 
> > I will test on nouveau.
> 
> Thanks, hopefully it still works, I think Ralph was able to do some
> basic checks. But it is a pretty complicated series, I probably made
> some mistakes.

So it seems to work ok with nouveau, will let tests run in loop thought
there are not very advance test.

> 
> FWIW, I know that nouveau gets a lockdep splat now from Daniel
> Vetter's recent changes, it tries to do GFP_KERENEL allocations under
> a lock also held by the invalidate_range_start path.

I have not seen any splat so far, is it throught some new kernel config ?

Cheers,
Jérôme

Christoph Hellwig Oct. 24, 2019, 2:16 a.m. UTC | #24

On Wed, Oct 23, 2019 at 05:24:45PM +0000, Jason Gunthorpe wrote:
> mlx5 is similar, but not currently coded quite right, there is one
> lock that protects the command queue for submitting invalidations to
> the HW and it doesn't make a lot of sense to have additional fine
> grained locking beyond that.

IFF all drivers could agree on a lock type (rw_semaphore?) for this
protection you could add a pointer to the range which would clear
things up a lot.  I'm just not sure you could get everyone to
agree.

[hmm,00/15] Consolidate the mmu notifier interval_tree and locking

Message

Comments