diff mbox series

[RFC,3/5] mm/vma: add support for peer to peer to device vma

Message ID 20190129174728.6430-4-jglisse@redhat.com (mailing list archive)
State New, archived
Headers show
Series Device peer to peer (p2p) through vma | expand

Commit Message

Jerome Glisse Jan. 29, 2019, 5:47 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

Allow mmap of device file to export device memory to peer to peer
devices. This will allow for instance a network device to access a
GPU memory or to access a storage device queue directly.

The common case will be a vma created by userspace device driver
that is then share to another userspace device driver which call
in its kernel device driver to map that vma.

The vma does not need to have any valid CPU mapping so that only
peer to peer device might access its content. Or it could have
valid CPU mapping too in that case it should point to same memory
for consistency.

Note that peer to peer mapping is highly platform and device
dependent and it might not work in all the cases. However we do
expect supports for this to grow on more hardware platform.

This patch only adds new call backs to vm_operations_struct bulk
of code light within common bus driver (like pci) and device
driver (both the exporting and importing device).

Current design mandate that the importer must obey mmu_notifier
and invalidate any peer to peer mapping anytime a notification
of invalidation happens for a range that have been peer to peer
mapped. This allows exporter device to easily invalidate mapping
for any importer device.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: iommu@lists.linux-foundation.org
---
 include/linux/mm.h | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Comments

Logan Gunthorpe Jan. 29, 2019, 6:36 p.m. UTC | #1
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:

> +	/*
> +	 * Optional for device driver that want to allow peer to peer (p2p)
> +	 * mapping of their vma (which can be back by some device memory) to
> +	 * another device.
> +	 *
> +	 * Note that the exporting device driver might not have map anything
> +	 * inside the vma for the CPU but might still want to allow a peer
> +	 * device to access the range of memory corresponding to a range in
> +	 * that vma.
> +	 *
> +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> +	 * device to map once during setup and report any failure at that time
> +	 * to the userspace. Further mapping of the same range might happen
> +	 * after mmu notifier invalidation over the range. The exporting device
> +	 * can use this to move things around (defrag BAR space for instance)
> +	 * or do other similar task.
> +	 *
> +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> +	 * POINT IN TIME WITH NO LOCK HELD.
> +	 *
> +	 * In below function, the device argument is the importing device,
> +	 * the exporting device is the device to which the vma belongs.
> +	 */
> +	long (*p2p_map)(struct vm_area_struct *vma,
> +			struct device *device,
> +			unsigned long start,
> +			unsigned long end,
> +			dma_addr_t *pa,
> +			bool write);
> +	long (*p2p_unmap)(struct vm_area_struct *vma,
> +			  struct device *device,
> +			  unsigned long start,
> +			  unsigned long end,
> +			  dma_addr_t *pa);

I don't understand why we need new p2p_[un]map function pointers for
this. In subsequent patches, they never appear to be set anywhere and
are only called by the HMM code. I'd have expected it to be called by
some core VMA code and set by HMM as that's what vm_operations_struct is
for.

But the code as all very confusing, hard to follow and seems to be
missing significant chunks. So I'm not really sure what is going on.

Logan
Jerome Glisse Jan. 29, 2019, 7:11 p.m. UTC | #2
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> 
> > +	/*
> > +	 * Optional for device driver that want to allow peer to peer (p2p)
> > +	 * mapping of their vma (which can be back by some device memory) to
> > +	 * another device.
> > +	 *
> > +	 * Note that the exporting device driver might not have map anything
> > +	 * inside the vma for the CPU but might still want to allow a peer
> > +	 * device to access the range of memory corresponding to a range in
> > +	 * that vma.
> > +	 *
> > +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > +	 * device to map once during setup and report any failure at that time
> > +	 * to the userspace. Further mapping of the same range might happen
> > +	 * after mmu notifier invalidation over the range. The exporting device
> > +	 * can use this to move things around (defrag BAR space for instance)
> > +	 * or do other similar task.
> > +	 *
> > +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > +	 * POINT IN TIME WITH NO LOCK HELD.
> > +	 *
> > +	 * In below function, the device argument is the importing device,
> > +	 * the exporting device is the device to which the vma belongs.
> > +	 */
> > +	long (*p2p_map)(struct vm_area_struct *vma,
> > +			struct device *device,
> > +			unsigned long start,
> > +			unsigned long end,
> > +			dma_addr_t *pa,
> > +			bool write);
> > +	long (*p2p_unmap)(struct vm_area_struct *vma,
> > +			  struct device *device,
> > +			  unsigned long start,
> > +			  unsigned long end,
> > +			  dma_addr_t *pa);
> 
> I don't understand why we need new p2p_[un]map function pointers for
> this. In subsequent patches, they never appear to be set anywhere and
> are only called by the HMM code. I'd have expected it to be called by
> some core VMA code and set by HMM as that's what vm_operations_struct is
> for.
> 
> But the code as all very confusing, hard to follow and seems to be
> missing significant chunks. So I'm not really sure what is going on.

It is set by device driver when userspace do mmap(fd) where fd comes
from open("/dev/somedevicefile"). So it is set by device driver. HMM
has nothing to do with this. It must be set by device driver mmap
call back (mmap callback of struct file_operations). For this patch
you can completely ignore all the HMM patches. Maybe posting this as
2 separate patchset would make it clearer.

For instance see [1] for how a non HMM driver can export its memory
by just setting those callback. Note that a proper implementation of
this should also include some kind of driver policy on what to allow
to map and what to not allow ... All this is driver specific in any
way.

Cheers,
Jérôme

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=wip-p2p-showcase&id=964214dcd4df96f296e2214042e8cfce135ae3d4
Logan Gunthorpe Jan. 29, 2019, 7:24 p.m. UTC | #3
On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
>>
>>> +	/*
>>> +	 * Optional for device driver that want to allow peer to peer (p2p)
>>> +	 * mapping of their vma (which can be back by some device memory) to
>>> +	 * another device.
>>> +	 *
>>> +	 * Note that the exporting device driver might not have map anything
>>> +	 * inside the vma for the CPU but might still want to allow a peer
>>> +	 * device to access the range of memory corresponding to a range in
>>> +	 * that vma.
>>> +	 *
>>> +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
>>> +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
>>> +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
>>> +	 * device to map once during setup and report any failure at that time
>>> +	 * to the userspace. Further mapping of the same range might happen
>>> +	 * after mmu notifier invalidation over the range. The exporting device
>>> +	 * can use this to move things around (defrag BAR space for instance)
>>> +	 * or do other similar task.
>>> +	 *
>>> +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
>>> +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
>>> +	 * POINT IN TIME WITH NO LOCK HELD.
>>> +	 *
>>> +	 * In below function, the device argument is the importing device,
>>> +	 * the exporting device is the device to which the vma belongs.
>>> +	 */
>>> +	long (*p2p_map)(struct vm_area_struct *vma,
>>> +			struct device *device,
>>> +			unsigned long start,
>>> +			unsigned long end,
>>> +			dma_addr_t *pa,
>>> +			bool write);
>>> +	long (*p2p_unmap)(struct vm_area_struct *vma,
>>> +			  struct device *device,
>>> +			  unsigned long start,
>>> +			  unsigned long end,
>>> +			  dma_addr_t *pa);
>>
>> I don't understand why we need new p2p_[un]map function pointers for
>> this. In subsequent patches, they never appear to be set anywhere and
>> are only called by the HMM code. I'd have expected it to be called by
>> some core VMA code and set by HMM as that's what vm_operations_struct is
>> for.
>>
>> But the code as all very confusing, hard to follow and seems to be
>> missing significant chunks. So I'm not really sure what is going on.
> 
> It is set by device driver when userspace do mmap(fd) where fd comes
> from open("/dev/somedevicefile"). So it is set by device driver. HMM
> has nothing to do with this. It must be set by device driver mmap
> call back (mmap callback of struct file_operations). For this patch
> you can completely ignore all the HMM patches. Maybe posting this as
> 2 separate patchset would make it clearer.
> 
> For instance see [1] for how a non HMM driver can export its memory
> by just setting those callback. Note that a proper implementation of
> this should also include some kind of driver policy on what to allow
> to map and what to not allow ... All this is driver specific in any
> way.

I'd suggest [1] should be a part of the patchset so we can actually see
a user of the stuff you're adding.

But it still doesn't explain everything as without the HMM code nothing
calls the new vm_ops. And there's still no callers for the p2p_test
functions you added. And I still don't understand why we need the new
vm_ops or who calls them and when. Why can't drivers use the existing
'fault' vm_op and call a new helper function to map p2p when appropriate
or a different helper function to map a large range in its mmap
operation? Just like regular mmap code...

Logan
Jason Gunthorpe Jan. 29, 2019, 7:32 p.m. UTC | #4
On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> > 
> > > +	/*
> > > +	 * Optional for device driver that want to allow peer to peer (p2p)
> > > +	 * mapping of their vma (which can be back by some device memory) to
> > > +	 * another device.
> > > +	 *
> > > +	 * Note that the exporting device driver might not have map anything
> > > +	 * inside the vma for the CPU but might still want to allow a peer
> > > +	 * device to access the range of memory corresponding to a range in
> > > +	 * that vma.
> > > +	 *
> > > +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > > +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > > +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > > +	 * device to map once during setup and report any failure at that time
> > > +	 * to the userspace. Further mapping of the same range might happen
> > > +	 * after mmu notifier invalidation over the range. The exporting device
> > > +	 * can use this to move things around (defrag BAR space for instance)
> > > +	 * or do other similar task.
> > > +	 *
> > > +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > > +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > > +	 * POINT IN TIME WITH NO LOCK HELD.
> > > +	 *
> > > +	 * In below function, the device argument is the importing device,
> > > +	 * the exporting device is the device to which the vma belongs.
> > > +	 */
> > > +	long (*p2p_map)(struct vm_area_struct *vma,
> > > +			struct device *device,
> > > +			unsigned long start,
> > > +			unsigned long end,
> > > +			dma_addr_t *pa,
> > > +			bool write);
> > > +	long (*p2p_unmap)(struct vm_area_struct *vma,
> > > +			  struct device *device,
> > > +			  unsigned long start,
> > > +			  unsigned long end,
> > > +			  dma_addr_t *pa);
> > 
> > I don't understand why we need new p2p_[un]map function pointers for
> > this. In subsequent patches, they never appear to be set anywhere and
> > are only called by the HMM code. I'd have expected it to be called by
> > some core VMA code and set by HMM as that's what vm_operations_struct is
> > for.
> > 
> > But the code as all very confusing, hard to follow and seems to be
> > missing significant chunks. So I'm not really sure what is going on.
> 
> It is set by device driver when userspace do mmap(fd) where fd comes
> from open("/dev/somedevicefile"). So it is set by device driver. HMM
> has nothing to do with this. It must be set by device driver mmap
> call back (mmap callback of struct file_operations). For this patch
> you can completely ignore all the HMM patches. Maybe posting this as
> 2 separate patchset would make it clearer.
> 
> For instance see [1] for how a non HMM driver can export its memory
> by just setting those callback. Note that a proper implementation of
> this should also include some kind of driver policy on what to allow
> to map and what to not allow ... All this is driver specific in any
> way.

I'm imagining that the RDMA drivers would use this interface on their
per-process 'doorbell' BAR pages - we also wish to have P2P DMA to
this memory. Also the entire VFIO PCI BAR mmap would be good to cover
with this too.

Jerome, I think it would be nice to have a helper scheme - I think the
simple case would be simple remapping of PCI BAR memory, so if we
could have, say something like:

static const struct vm_operations_struct my_ops {
  .p2p_map = p2p_ioremap_map_op,
  .p2p_unmap = p2p_ioremap_unmap_op,
}

struct ioremap_data {
  [..]
}

fops_mmap() {
   vma->private_data = &driver_priv->ioremap_data;
   return p2p_ioremap_device_memory(vma, exporting_device, [..]);
}

Which closely matches at least what the RDMA drivers do. Where
p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers
with sensible functions, etc.

It looks like vfio would be able to use this as well (though I am
unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for
BAR memory..)

Do any drivers need more control than this?

Jason
Jerome Glisse Jan. 29, 2019, 7:44 p.m. UTC | #5
On Tue, Jan 29, 2019 at 12:24:04PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> >>
> >>> +	/*
> >>> +	 * Optional for device driver that want to allow peer to peer (p2p)
> >>> +	 * mapping of their vma (which can be back by some device memory) to
> >>> +	 * another device.
> >>> +	 *
> >>> +	 * Note that the exporting device driver might not have map anything
> >>> +	 * inside the vma for the CPU but might still want to allow a peer
> >>> +	 * device to access the range of memory corresponding to a range in
> >>> +	 * that vma.
> >>> +	 *
> >>> +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> >>> +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> >>> +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> >>> +	 * device to map once during setup and report any failure at that time
> >>> +	 * to the userspace. Further mapping of the same range might happen
> >>> +	 * after mmu notifier invalidation over the range. The exporting device
> >>> +	 * can use this to move things around (defrag BAR space for instance)
> >>> +	 * or do other similar task.
> >>> +	 *
> >>> +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> >>> +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> >>> +	 * POINT IN TIME WITH NO LOCK HELD.
> >>> +	 *
> >>> +	 * In below function, the device argument is the importing device,
> >>> +	 * the exporting device is the device to which the vma belongs.
> >>> +	 */
> >>> +	long (*p2p_map)(struct vm_area_struct *vma,
> >>> +			struct device *device,
> >>> +			unsigned long start,
> >>> +			unsigned long end,
> >>> +			dma_addr_t *pa,
> >>> +			bool write);
> >>> +	long (*p2p_unmap)(struct vm_area_struct *vma,
> >>> +			  struct device *device,
> >>> +			  unsigned long start,
> >>> +			  unsigned long end,
> >>> +			  dma_addr_t *pa);
> >>
> >> I don't understand why we need new p2p_[un]map function pointers for
> >> this. In subsequent patches, they never appear to be set anywhere and
> >> are only called by the HMM code. I'd have expected it to be called by
> >> some core VMA code and set by HMM as that's what vm_operations_struct is
> >> for.
> >>
> >> But the code as all very confusing, hard to follow and seems to be
> >> missing significant chunks. So I'm not really sure what is going on.
> > 
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> > 
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
> 
> I'd suggest [1] should be a part of the patchset so we can actually see
> a user of the stuff you're adding.

I did not wanted to clutter patchset with device driver specific usage
of this. As the API can be reason about in abstract way.

> 
> But it still doesn't explain everything as without the HMM code nothing
> calls the new vm_ops. And there's still no callers for the p2p_test
> functions you added. And I still don't understand why we need the new
> vm_ops or who calls them and when. Why can't drivers use the existing
> 'fault' vm_op and call a new helper function to map p2p when appropriate
> or a different helper function to map a large range in its mmap
> operation? Just like regular mmap code...

HMM code is just one user, if you have a driver that use HMM mirror
then your driver get support for this for free. If you do not want to
use HMM then you can directly call this in your driver.

The flow is, device driver want to setup some mapping for a range of
virtual address [va_start, va_end]:
    1 - Lookup vma for the range
    2 - If vma is regular vma (not an mmap of a file) then use GUP
        If vma is a mmap of a file and they are p2p_map call back
        then call p2p_map()
    3 - Use either the result of GUP or p2p_map() to program the
        device

The p2p test function is use by device driver implementing the call
back for instance see:

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p&id=401a567696eafb1d4faf7054ab0d7c3a16a5ef06

The vm_fault callback is not suited because here we are mapping to
another device so this will need special handling, someone must have
both device struct pointer in end and someone must be allow to make
decission on what to allow and what not to allow.

Moreover exporting driver like GPU driver might have complex policy
in place for which they will only allow export of some memory to a
peer device but not other.

In the end it means that it easier and simpler to add new call back
specificaly for that, so the intention is clear to both the caller
and the callee. The exporting device can then do the proper check
using the core helper (ie checking that the device can actually peer
to each other) and if that works it can then decide wether or not
it wants to allow this other device to access its memory or if it
prefer to use main memory for this.

To add this to the fault callback we would need to define a bunch
of new flags, setup fake page table so that we can populate pte and
then have the importing device re-interpret everything specialy
because it comes from another device. It would look ugly and it
would need to modify bunch of core mm code.

Note that this callback solution also allow an exporting device to
allow peer access while CPU can not access the memory ie the pte
entry for the range are pte_none.

Cheers,
Jérôme
Jerome Glisse Jan. 29, 2019, 7:50 p.m. UTC | #6
On Tue, Jan 29, 2019 at 07:32:57PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> > > 
> > > 
> > > On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
> > > 
> > > > +	/*
> > > > +	 * Optional for device driver that want to allow peer to peer (p2p)
> > > > +	 * mapping of their vma (which can be back by some device memory) to
> > > > +	 * another device.
> > > > +	 *
> > > > +	 * Note that the exporting device driver might not have map anything
> > > > +	 * inside the vma for the CPU but might still want to allow a peer
> > > > +	 * device to access the range of memory corresponding to a range in
> > > > +	 * that vma.
> > > > +	 *
> > > > +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > > > +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > > > +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > > > +	 * device to map once during setup and report any failure at that time
> > > > +	 * to the userspace. Further mapping of the same range might happen
> > > > +	 * after mmu notifier invalidation over the range. The exporting device
> > > > +	 * can use this to move things around (defrag BAR space for instance)
> > > > +	 * or do other similar task.
> > > > +	 *
> > > > +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > > > +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > > > +	 * POINT IN TIME WITH NO LOCK HELD.
> > > > +	 *
> > > > +	 * In below function, the device argument is the importing device,
> > > > +	 * the exporting device is the device to which the vma belongs.
> > > > +	 */
> > > > +	long (*p2p_map)(struct vm_area_struct *vma,
> > > > +			struct device *device,
> > > > +			unsigned long start,
> > > > +			unsigned long end,
> > > > +			dma_addr_t *pa,
> > > > +			bool write);
> > > > +	long (*p2p_unmap)(struct vm_area_struct *vma,
> > > > +			  struct device *device,
> > > > +			  unsigned long start,
> > > > +			  unsigned long end,
> > > > +			  dma_addr_t *pa);
> > > 
> > > I don't understand why we need new p2p_[un]map function pointers for
> > > this. In subsequent patches, they never appear to be set anywhere and
> > > are only called by the HMM code. I'd have expected it to be called by
> > > some core VMA code and set by HMM as that's what vm_operations_struct is
> > > for.
> > > 
> > > But the code as all very confusing, hard to follow and seems to be
> > > missing significant chunks. So I'm not really sure what is going on.
> > 
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> > 
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
> 
> I'm imagining that the RDMA drivers would use this interface on their
> per-process 'doorbell' BAR pages - we also wish to have P2P DMA to
> this memory. Also the entire VFIO PCI BAR mmap would be good to cover
> with this too.

Correct, you would set those callback on the mmap of your doorbell.

> 
> Jerome, I think it would be nice to have a helper scheme - I think the
> simple case would be simple remapping of PCI BAR memory, so if we
> could have, say something like:
> 
> static const struct vm_operations_struct my_ops {
>   .p2p_map = p2p_ioremap_map_op,
>   .p2p_unmap = p2p_ioremap_unmap_op,
> }
> 
> struct ioremap_data {
>   [..]
> }
> 
> fops_mmap() {
>    vma->private_data = &driver_priv->ioremap_data;
>    return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> }
> 
> Which closely matches at least what the RDMA drivers do. Where
> p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers
> with sensible functions, etc.
> 
> It looks like vfio would be able to use this as well (though I am
> unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for
> BAR memory..)

Yes simple helper that implement a sane default implementation is
definitly a good idea. As i was working with GPU it was not something
that immediatly poped to mind (see below). But i can certainly do
a sane set of default helper that simple device driver can use right
away without too much thinking on there part. I will add this for
next posting.

> Do any drivers need more control than this?

GPU driver do want more control :) GPU driver are moving things around
all the time and they have more memory than bar space (on newer platform
AMD GPU do resize the bar but it is not the rule for all GPUs). So
GPU driver do actualy manage their BAR address space and they map and
unmap thing there. They can not allow someone to just pin stuff there
randomly or this would disrupt their regular work flow. Hence they need
control and they might implement threshold for instance if they have
more than N pages of bar space map for peer to peer then they can decide
to fall back to main memory for any new peer mapping.

Cheers,
Jérôme
Jason Gunthorpe Jan. 29, 2019, 8:24 p.m. UTC | #7
On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:

> GPU driver do want more control :) GPU driver are moving things around
> all the time and they have more memory than bar space (on newer platform
> AMD GPU do resize the bar but it is not the rule for all GPUs). So
> GPU driver do actualy manage their BAR address space and they map and
> unmap thing there. They can not allow someone to just pin stuff there
> randomly or this would disrupt their regular work flow. Hence they need
> control and they might implement threshold for instance if they have
> more than N pages of bar space map for peer to peer then they can decide
> to fall back to main memory for any new peer mapping.

But this API doesn't seem to offer any control - I thought that
control was all coming from the mm/hmm notifiers triggering p2p_unmaps?

I would think that the importing driver can assume the BAR page is
kept alive until it calls unmap (presumably triggered by notifiers)?

ie the exporting driver sees the BAR page as pinned until unmap.

Jason
Logan Gunthorpe Jan. 29, 2019, 8:39 p.m. UTC | #8
On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
> Jerome, I think it would be nice to have a helper scheme - I think the
> simple case would be simple remapping of PCI BAR memory, so if we
> could have, say something like:
> 
> static const struct vm_operations_struct my_ops {
>   .p2p_map = p2p_ioremap_map_op,
>   .p2p_unmap = p2p_ioremap_unmap_op,
> }
> 
> struct ioremap_data {
>   [..]
> }
> 
> fops_mmap() {
>    vma->private_data = &driver_priv->ioremap_data;
>    return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> }

This is roughly what I was expecting, except I don't see exactly what
the p2p_map and p2p_unmap callbacks are for. The importing driver should
see p2pdma/hmm struct pages and use the appropriate function to map
them. It shouldn't be the responsibility of the exporting driver to
implement the mapping. And I don't think we should have 'special' vma's
for this (though we may need something to ensure we don't get mapping
requests mixed with different types of pages...).

I also figured there'd be a fault version of p2p_ioremap_device_memory()
for when you are mapping P2P memory and you want to assign the pages
lazily. Though, this can come later when someone wants to implement that.

Logan
Logan Gunthorpe Jan. 29, 2019, 8:43 p.m. UTC | #9
On 2019-01-29 12:44 p.m., Jerome Glisse wrote:
>> I'd suggest [1] should be a part of the patchset so we can actually see
>> a user of the stuff you're adding.
> 
> I did not wanted to clutter patchset with device driver specific usage
> of this. As the API can be reason about in abstract way.

It's hard to reason about an interface when you can't see what all the
layers want to do with it. Most maintainers (I'd hope) would certainly
never merge code that has no callers, and for much the same reason, I'd
rather not review patches that don't have real use case examples.

Logan
Jerome Glisse Jan. 29, 2019, 8:44 p.m. UTC | #10
On Tue, Jan 29, 2019 at 08:24:36PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:
> 
> > GPU driver do want more control :) GPU driver are moving things around
> > all the time and they have more memory than bar space (on newer platform
> > AMD GPU do resize the bar but it is not the rule for all GPUs). So
> > GPU driver do actualy manage their BAR address space and they map and
> > unmap thing there. They can not allow someone to just pin stuff there
> > randomly or this would disrupt their regular work flow. Hence they need
> > control and they might implement threshold for instance if they have
> > more than N pages of bar space map for peer to peer then they can decide
> > to fall back to main memory for any new peer mapping.
> 
> But this API doesn't seem to offer any control - I thought that
> control was all coming from the mm/hmm notifiers triggering p2p_unmaps?

The control is within the driver implementation of those callbacks. So
driver implementation can refuse to map by returning an error on p2p_map
or it can decide to use main memory by migrating its object to main memory
and populating the dma address array with dma_page_map() of the main memory
pages. Driver like GPU can have policy on top of that for instance they
will only allow p2p map to succeed for objects that have been tagged by the
userspace in some way ie the userspace application is in control of what
can be map to peer device. This is needed for GPU driver as we do want
userspace involvement on what object are allowed to have p2p access and
also so that we can report to userspace when we are running out of BAR
addresses for this to work as intended (ie not falling back to main memory)
so that application can take appropriate actions (like decide what to
prioritize).

For moving things around after a successful p2p_map yes the exporting
device have to call for instance zap_vma_ptes() or something similar.
This will trigger notifier call and the importing device will invalidate
its mapping. Once it is invalidated then the exporting device can
point new call of p2p_map (for the same range) to new memory (obviously
the exporting device have to synchronize any concurrent call to p2p_map
with the invalidation).

> 
> I would think that the importing driver can assume the BAR page is
> kept alive until it calls unmap (presumably triggered by notifiers)?
> 
> ie the exporting driver sees the BAR page as pinned until unmap.

The intention with this patchset is that it is not pin ie the importer
device _must_ abide by all mmu notifier invalidations and they can
happen at anytime. The importing device can however re-p2p_map the
same range after an invalidation.

I would like to restrict this to importer that can invalidate for
now because i believe all the first device to use can support the
invalidation.

Also when using HMM private device memory we _can not_ pin virtual
address to device memory as otherwise CPU access would have to SIGBUS
or SEGFAULT and we do not want that. So this was also a motivation to
keep thing consistent for the importer for both cases.

Cheers,
Jérôme
Jerome Glisse Jan. 29, 2019, 8:57 p.m. UTC | #11
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
> > Jerome, I think it would be nice to have a helper scheme - I think the
> > simple case would be simple remapping of PCI BAR memory, so if we
> > could have, say something like:
> > 
> > static const struct vm_operations_struct my_ops {
> >   .p2p_map = p2p_ioremap_map_op,
> >   .p2p_unmap = p2p_ioremap_unmap_op,
> > }
> > 
> > struct ioremap_data {
> >   [..]
> > }
> > 
> > fops_mmap() {
> >    vma->private_data = &driver_priv->ioremap_data;
> >    return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> > }
> 
> This is roughly what I was expecting, except I don't see exactly what
> the p2p_map and p2p_unmap callbacks are for. The importing driver should
> see p2pdma/hmm struct pages and use the appropriate function to map
> them. It shouldn't be the responsibility of the exporting driver to
> implement the mapping. And I don't think we should have 'special' vma's
> for this (though we may need something to ensure we don't get mapping
> requests mixed with different types of pages...).

GPU driver must be in control and must be call to. Here there is 2 cases
in this patchset and i should have instead posted 2 separate patchset as
it seems that it is confusing things.

For the HMM page, the physical address of the page ie the pfn does not
correspond to anything ie there is nothing behind it. So the importing
device has no idea how to get a valid physical address from an HMM page
only the device driver exporting its memory with HMM device memory knows
that.


For the special vma ie mmap of a device file. GPU driver do manage their
BAR ie the GPU have a page table that map BAR page to GPU memory and the
driver _constantly_ update this page table, it is reflected by invalidating
the CPU mapping. In fact most of the time the CPU mapping of GPU object are
invalid they are valid only a small fraction of their lifetime. So you
_must_ have some call to inform the exporting device driver that another
device would like to map one of its vma. The exporting device can then
try to avoid as much churn as possible for the importing device. But this
has consequence and the exporting device driver must be allow to apply
policy and make decission on wether or not it authorize the other device
to peer map its memory. For GPU the userspace application have to call
specific API that translate into specific ioctl which themself set flags
on object (in the kernel struct tracking the user space object). The only
way to allow program predictability is if the application can ask and know
if it can peer export an object (ie is there enough BAR space left).

Moreover i would like to be able to use this API between GPUs that are
inter-connected between each other and for those the CPU page table are
just invalid and the physical address to use are only meaning full to the
exporting and importing device. So again here core kernel has no idea of
what the physical address would be.


So in both cases, at very least for GPU, we do want total control to be
given to the exporter.

> 
> I also figured there'd be a fault version of p2p_ioremap_device_memory()
> for when you are mapping P2P memory and you want to assign the pages
> lazily. Though, this can come later when someone wants to implement that.

For GPU the BAR address space is manage page by page and thus you do not
want to map a range of BAR addresses but you want to allow mapping of
multiple page of BAR address that are not adjacent to each other nor
ordered in anyway. But providing helper for simpler device does make sense.

Cheers,
Jérôme
Jason Gunthorpe Jan. 29, 2019, 8:58 p.m. UTC | #12
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:

> implement the mapping. And I don't think we should have 'special' vma's
> for this (though we may need something to ensure we don't get mapping
> requests mixed with different types of pages...).

I think Jerome explained the point here is to have a 'special vma'
rather than a 'special struct page' as, really, we don't need a
struct page at all to make this work.

If I recall your earlier attempts at adding struct page for BAR
memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.

This does seem to avoid that pitfall entirely as we can never
accidently get into the SGL system with this kind of memory or VMA?

Jason
Logan Gunthorpe Jan. 29, 2019, 9:30 p.m. UTC | #13
On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
> GPU driver must be in control and must be call to. Here there is 2 cases
> in this patchset and i should have instead posted 2 separate patchset as
> it seems that it is confusing things.
> 
> For the HMM page, the physical address of the page ie the pfn does not
> correspond to anything ie there is nothing behind it. So the importing
> device has no idea how to get a valid physical address from an HMM page
> only the device driver exporting its memory with HMM device memory knows
> that.
> 
> 
> For the special vma ie mmap of a device file. GPU driver do manage their
> BAR ie the GPU have a page table that map BAR page to GPU memory and the
> driver _constantly_ update this page table, it is reflected by invalidating
> the CPU mapping. In fact most of the time the CPU mapping of GPU object are
> invalid they are valid only a small fraction of their lifetime. So you
> _must_ have some call to inform the exporting device driver that another
> device would like to map one of its vma. The exporting device can then
> try to avoid as much churn as possible for the importing device. But this
> has consequence and the exporting device driver must be allow to apply
> policy and make decission on wether or not it authorize the other device
> to peer map its memory. For GPU the userspace application have to call
> specific API that translate into specific ioctl which themself set flags
> on object (in the kernel struct tracking the user space object). The only
> way to allow program predictability is if the application can ask and know
> if it can peer export an object (ie is there enough BAR space left).

This all seems like it's an HMM problem and not related to mapping
BARs/"potential BARs" to userspace. If some code wants to DMA map HMM
pages, it calls an HMM function to map them. If HMM needs to consult
with the driver on aspects of how that's mapped, then that's between HMM
and the driver and not something I really care about. But making the
entire mapping stuff tied to userspace VMAs does not make sense to me.
What if somebody wants to map some HMM pages in the same way but from
kernel space and they therefore don't have a VMA?


>> I also figured there'd be a fault version of p2p_ioremap_device_memory()
>> for when you are mapping P2P memory and you want to assign the pages
>> lazily. Though, this can come later when someone wants to implement that.
> 
> For GPU the BAR address space is manage page by page and thus you do not
> want to map a range of BAR addresses but you want to allow mapping of
> multiple page of BAR address that are not adjacent to each other nor
> ordered in anyway. But providing helper for simpler device does make sense.

Well, this has little do with the backing device but how the memory is
mapped into userspace. With p2p_ioremap_device_memory() the entire range
is mapped into the userspace VMA immediately during the call to mmap().
With p2p_fault_device_memory(), mmap() would not actually map anything
and a page in the VMA would be mapped only when userspace accesses it
(using fault()). It seems to me like GPUs would prefer the latter but if
HMM takes care of the mapping from userspace potential pages to actual
GPU pages through the BAR then that may not be true.

Logan
Jerome Glisse Jan. 29, 2019, 9:50 p.m. UTC | #14
On Tue, Jan 29, 2019 at 02:30:49PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
> > GPU driver must be in control and must be call to. Here there is 2 cases
> > in this patchset and i should have instead posted 2 separate patchset as
> > it seems that it is confusing things.
> > 
> > For the HMM page, the physical address of the page ie the pfn does not
> > correspond to anything ie there is nothing behind it. So the importing
> > device has no idea how to get a valid physical address from an HMM page
> > only the device driver exporting its memory with HMM device memory knows
> > that.
> > 
> > 
> > For the special vma ie mmap of a device file. GPU driver do manage their
> > BAR ie the GPU have a page table that map BAR page to GPU memory and the
> > driver _constantly_ update this page table, it is reflected by invalidating
> > the CPU mapping. In fact most of the time the CPU mapping of GPU object are
> > invalid they are valid only a small fraction of their lifetime. So you
> > _must_ have some call to inform the exporting device driver that another
> > device would like to map one of its vma. The exporting device can then
> > try to avoid as much churn as possible for the importing device. But this
> > has consequence and the exporting device driver must be allow to apply
> > policy and make decission on wether or not it authorize the other device
> > to peer map its memory. For GPU the userspace application have to call
> > specific API that translate into specific ioctl which themself set flags
> > on object (in the kernel struct tracking the user space object). The only
> > way to allow program predictability is if the application can ask and know
> > if it can peer export an object (ie is there enough BAR space left).
> 
> This all seems like it's an HMM problem and not related to mapping
> BARs/"potential BARs" to userspace. If some code wants to DMA map HMM
> pages, it calls an HMM function to map them. If HMM needs to consult
> with the driver on aspects of how that's mapped, then that's between HMM
> and the driver and not something I really care about. But making the
> entire mapping stuff tied to userspace VMAs does not make sense to me.
> What if somebody wants to map some HMM pages in the same way but from
> kernel space and they therefore don't have a VMA?

No this is the non HMM case i am talking about here. Fully ignore HMM
in this frame. A GPU driver that do not support or use HMM in anyway
has all the properties and requirement i do list above. So all the points
i was making are without HMM in the picture whatsoever. I should have
posted this a separate patches to avoid this confusion.

Regarding your HMM question. You can not map HMM pages, all code path
that would try that would trigger a migration back to regular memory
and will use the regular memory for CPU access.


> >> I also figured there'd be a fault version of p2p_ioremap_device_memory()
> >> for when you are mapping P2P memory and you want to assign the pages
> >> lazily. Though, this can come later when someone wants to implement that.
> > 
> > For GPU the BAR address space is manage page by page and thus you do not
> > want to map a range of BAR addresses but you want to allow mapping of
> > multiple page of BAR address that are not adjacent to each other nor
> > ordered in anyway. But providing helper for simpler device does make sense.
> 
> Well, this has little do with the backing device but how the memory is
> mapped into userspace. With p2p_ioremap_device_memory() the entire range
> is mapped into the userspace VMA immediately during the call to mmap().
> With p2p_fault_device_memory(), mmap() would not actually map anything
> and a page in the VMA would be mapped only when userspace accesses it
> (using fault()). It seems to me like GPUs would prefer the latter but if
> HMM takes care of the mapping from userspace potential pages to actual
> GPU pages through the BAR then that may not be true.

Again HMM has nothing to do here, ignore HMM it does not play any role
and it is not involve in anyway here. GPU want to control what object
they allow other device to access and object they do not allow. GPU driver
_constantly_ invalidate the CPU page table and in fact the CPU page table
do not have any valid pte for a vma that is an mmap of GPU device file
for most of the vma lifetime. Changing that would highly disrupt and
break GPU drivers. They need to control that, they need to control what
to do if another device tries to peer map some of their memory. Hence
why they need to implement the callback and decide on wether or not they
allow the peer mapping or use device memory for it (they can decide to
fallback to main memory).

If the exporter can not control than this is useless to GPU driver. I
would rather not exclude GPU driver from this.

Cheers,
Jérôme
Logan Gunthorpe Jan. 29, 2019, 10:58 p.m. UTC | #15
On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> No this is the non HMM case i am talking about here. Fully ignore HMM
> in this frame. A GPU driver that do not support or use HMM in anyway
> has all the properties and requirement i do list above. So all the points
> i was making are without HMM in the picture whatsoever. I should have
> posted this a separate patches to avoid this confusion.
> 
> Regarding your HMM question. You can not map HMM pages, all code path
> that would try that would trigger a migration back to regular memory
> and will use the regular memory for CPU access.
> 

I thought this was the whole point of HMM... And eventually it would
support being able to map the pages through the BAR in cooperation with
the driver. If not, what's that whole layer for? Why not just have HMM
handle this situation?

And what struct pages are actually going to be backing these VMAs if
it's not using HMM?


> Again HMM has nothing to do here, ignore HMM it does not play any role
> and it is not involve in anyway here. GPU want to control what object
> they allow other device to access and object they do not allow. GPU driver
> _constantly_ invalidate the CPU page table and in fact the CPU page table
> do not have any valid pte for a vma that is an mmap of GPU device file
> for most of the vma lifetime. Changing that would highly disrupt and
> break GPU drivers. They need to control that, they need to control what
> to do if another device tries to peer map some of their memory. Hence
> why they need to implement the callback and decide on wether or not they
> allow the peer mapping or use device memory for it (they can decide to
> fallback to main memory).

But mapping is an operation of the memory/struct pages behind the VMA;
not of the VMA itself and I think that's evident by the code in that the
only way the VMA layer is involved is the fact that you're abusing
vm_ops by adding new ops there and calling it by other layers.

Logan
Jason Gunthorpe Jan. 29, 2019, 11:02 p.m. UTC | #16
On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:

> > But this API doesn't seem to offer any control - I thought that
> > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> 
> The control is within the driver implementation of those callbacks. 

Seems like what you mean by control is 'the exporter gets to choose
the physical address at the instant of map' - which seems reasonable
for GPU.


> will only allow p2p map to succeed for objects that have been tagged by the
> userspace in some way ie the userspace application is in control of what
> can be map to peer device.

I would have thought this means the VMA for the object is created
without the map/unmap ops? Or are GPU objects and VMAs unrelated?

> For moving things around after a successful p2p_map yes the exporting
> device have to call for instance zap_vma_ptes() or something
> similar.

Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when
unplugging the PCI device and we can delay the PCI unplug completion
until all the p2p_unmaps are called...

But in this case a future p2p_map will have to fail as the BAR no
longer exists. How to handle this?

> > I would think that the importing driver can assume the BAR page is
> > kept alive until it calls unmap (presumably triggered by notifiers)?
> > 
> > ie the exporting driver sees the BAR page as pinned until unmap.
> 
> The intention with this patchset is that it is not pin ie the importer
> device _must_ abide by all mmu notifier invalidations and they can
> happen at anytime. The importing device can however re-p2p_map the
> same range after an invalidation.
>
> I would like to restrict this to importer that can invalidate for
> now because i believe all the first device to use can support the
> invalidation.

This seems reasonable (and sort of says importers not getting this
from HMM need careful checking), was this in the comment above the
ops?

Jason
Jerome Glisse Jan. 29, 2019, 11:47 p.m. UTC | #17
On Tue, Jan 29, 2019 at 03:58:45PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> > No this is the non HMM case i am talking about here. Fully ignore HMM
> > in this frame. A GPU driver that do not support or use HMM in anyway
> > has all the properties and requirement i do list above. So all the points
> > i was making are without HMM in the picture whatsoever. I should have
> > posted this a separate patches to avoid this confusion.
> > 
> > Regarding your HMM question. You can not map HMM pages, all code path
> > that would try that would trigger a migration back to regular memory
> > and will use the regular memory for CPU access.
> > 
> 
> I thought this was the whole point of HMM... And eventually it would
> support being able to map the pages through the BAR in cooperation with
> the driver. If not, what's that whole layer for? Why not just have HMM
> handle this situation?

The whole point is to allow to use device memory for range of virtual
address of a process when it does make sense to use device memory for
that range. So they are multiple cases where it does make sense:
[1] - Only the device is accessing the range and they are no CPU access
      For instance the program is executing/running a big function on
      the GPU and they are not concurrent CPU access, this is very
      common in all the existing GPGPU code. In fact AFAICT It is the
      most common pattern. So here you can use HMM private or public
      memory.
[2] - Both device and CPU access a common range of virtul address
      concurrently. In that case if you are on a platform with cache
      coherent inter-connect like OpenCAPI or CCIX then you can use
      HMM public device memory and have both access the same memory.
      You can not use HMM private memory.

So far on x86 we only have PCIE and thus so far on x86 we only have
private HMM device memory that is not accessible by the CPU in any
way.

It does not make that memory useless, far from it. Having only the
device work on the dataset while CPU is either waiting or accessing
something else is very common.


Then HMM is a toolbox, so here are some of the tools:
    HMM mirror - helper to mirror process address on to a device
    ie this is SVM(Share Virtual Memory)/SVA(Share Virtual Address)
    in software

    HMM private memory - allow to register device memory with the linux
    kernel. The memory is not CPU accessible. The memory is fully manage
    by the device driver. What and when to migrate is under the control
    of the device driver.

    HMM public memory - allow to register device memory with the linux
    kernel. The memory must be CPU accessible and cache coherent and
    abide by the platform memory model. The memory is fully manage by
    the device driver because otherwise it would disrupt the device
    driver operation (for instance GPU can also be use for graphics).

    Migration - helper to perform migration to and from device memory.
    It does not make any decission on itself it just perform all the
    steps in the right order and call back into the driver to get the
    migration going.

It is up to device driver to implement heuristic and provide userspace
API to control memory migration to and from device memory. For device
private memory on CPU page fault the kernel will force a migration back
to system memory so that the CPU can access the memory. What matter here
is that the memory model of the platform is intact and thus you can
safely use CPU atomic operation or rely on your platform memory model
for your program. Note that long term i would like to define common API
to expose to userspace to manage memory binding to specific device
memory so that we can mix and match multiple device memory into a single
process and define policy too.

Also CPU atomic instruction to PCIE BAR gives _undefined_ results and in
fact on some AMD/Intel platform it leads to weirdness/crash/freeze. So
obviously we can not map PCIE BAR to CPU without breaking the memory
model. More over on PCIE you might not be able to resize the BAR to
expose all the device memory. GPU can have several giga bytes of memory
and not all of them support PCIE bar resize, and sometimes PCIE bar
resize does not work either because of bios/firmware issue or simply
because you are running out of IO space.

So on x86 we are stuck with HMM private memory, i am hopping that some
day in the future we will have CCIX or something similar. But for now
we have to work with what we have.

> And what struct pages are actually going to be backing these VMAs if
> it's not using HMM?

When you have some range of virtual address migrated to HMM private
memory then the CPU pte are special swap entry and they behave just
as if the memory was swapped to disk. So CPU access to those will
fault and trigger a migration back to main memory.

We still want to allow peer to peer to exist when using HMM memory
for a range of virtual address (of a vma that is not an mmap of a
device file) because the peer device do not rely on atomic or on the
platform memory model. In those cases we assume that the importer is
aware of the limitation and is asking access in good faith and thus
we want to allow the exporting device to either allow the peer mapping
(because it has enough BAR address to map) or fall back to main memory.


> > Again HMM has nothing to do here, ignore HMM it does not play any role
> > and it is not involve in anyway here. GPU want to control what object
> > they allow other device to access and object they do not allow. GPU driver
> > _constantly_ invalidate the CPU page table and in fact the CPU page table
> > do not have any valid pte for a vma that is an mmap of GPU device file
> > for most of the vma lifetime. Changing that would highly disrupt and
> > break GPU drivers. They need to control that, they need to control what
> > to do if another device tries to peer map some of their memory. Hence
> > why they need to implement the callback and decide on wether or not they
> > allow the peer mapping or use device memory for it (they can decide to
> > fallback to main memory).
> 
> But mapping is an operation of the memory/struct pages behind the VMA;
> not of the VMA itself and I think that's evident by the code in that the
> only way the VMA layer is involved is the fact that you're abusing
> vm_ops by adding new ops there and calling it by other layers.

For GPU driver the vma pte are populated on CPU page fault and they get
clear quickly after. A very usual pattern is:
    - CPU write something to the object through the object mapping ie
      through a vma. This trigger page fault which call the fault()
      callback from vm_operations struct. This populate the page table
      for the vma.
    - Userspace launch commands on the GPU, first thing kernel do is
      clear all CPU page table entry for objects listed in the commands
      ie we do not except any further CPU access nor do we want it.

GPU driver have always been geared toward minimizing CPU access to GPU
memory. For object that need to be access by both concurrently we use the
main memory and not the device memory.

So in fact you will almost never have valid pte for an mmap of a GPU
object (done throught the GPU device file). However it does not mean that
we want to block peer to peer to happen. Today the use cases we know for
peer to peer are with GPUDirect (NVidia) or ROCmDMA (AMD) roughly the
same thing. Most common use cases i am aware are:
    - RDMA is streaming in input directly into GPU memory avoiding the
      need to have a bounce buffer into memory (this save both main
      memory and PCIE bandwidth by avoiding RDMA->main then main->GPU).
    - RDMA is streaming out result (same idea as streaming in but in
      the other direction :))
    - RDMA is use to monitor computation progress on the GPU and it
      tries to do so with minimal disruption to the GPU. So RDMA would
      like to be able to peek into GPU memory to fetch some values
      and transmit them over the network.

I believe people would like to have more complex use case, like for
instance having the GPU be able to directly control some RDMA queue
to request data to some other host on the networ, or control some
block device queue to read data from block device directly. I believe
those can be implemented with the API set forward in those patches.

So for those above use cases it is fine to not have valid CPU pte and
only have peer to peer mapping. The CPU is not expected to be involve
and we should not make it a requirement. Hence we should not expect
to have valid pte.


Also another common use case is that GPU driver might leave pte that
points to main memory while the GPU is using device memory for the
object corresponding to the vma those pte are in. Expectation is that
the CPU access are synchronized with the device access through the
API use by the application. Note here we are talking non HMM, non SVM
case ie special object that are allocated through API specific functions
that result in driver ioctl and mmap of device file.


Hopes this helps understand the big picture from GPU driver point of
view :)

Cheers,
Jérôme
Jerome Glisse Jan. 30, 2019, 12:08 a.m. UTC | #18
On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> 
> > > But this API doesn't seem to offer any control - I thought that
> > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> > 
> > The control is within the driver implementation of those callbacks. 
> 
> Seems like what you mean by control is 'the exporter gets to choose
> the physical address at the instant of map' - which seems reasonable
> for GPU.
> 
> 
> > will only allow p2p map to succeed for objects that have been tagged by the
> > userspace in some way ie the userspace application is in control of what
> > can be map to peer device.
> 
> I would have thought this means the VMA for the object is created
> without the map/unmap ops? Or are GPU objects and VMAs unrelated?

GPU object and VMA are unrelated in all open source GPU driver i am
somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
object and never map it (and thus never have it associated with a
vma) and in fact this is very common. For graphic you usualy only
have hand full of the hundreds of GPU object your application have
mapped.

The control for peer to peer can also be a mutable properties of the
object ie userspace do ioctl on the GPU driver which create an object;
Some times after the object is created the userspace do others ioctl
to allow to export the object to another specific device again this
result in ioctl to the device driver, those ioctl set flags and
update GPU object kernel structure with all the info.

In the meantime you have no control on when other driver might call
the vma p2p call backs. So you must have register the vma with
vm_operations that include the p2p_map and p2p_unmap. Those driver
function will check the object kernel structure each time they get
call and act accordingly.



> > For moving things around after a successful p2p_map yes the exporting
> > device have to call for instance zap_vma_ptes() or something
> > similar.
> 
> Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when
> unplugging the PCI device and we can delay the PCI unplug completion
> until all the p2p_unmaps are called...
> 
> But in this case a future p2p_map will have to fail as the BAR no
> longer exists. How to handle this?

So the comment above the callback (i should write more thorough guideline
and documentation) state that export should/(must?) be predictable ie
if an importer device calls p2p_map() once on a vma and it does succeed
then if the same device calls again p2p_map() on the same vma and if the
vma is still valid (ie no unmap or does not correspond to a different
object ...) then the p2p_map() should/(must?) succeed.

The idea is that the importer would do a first call to p2p_map() when it
setup its own object, report failure to userspace if that fails. If it
does succeed then we should never have an issue next time we call p2p_map()
(after mapping being invalidated by mmu notifier for instance). So it will
succeed just like the first call (again assuming the vma is still valid).

Idea is that we can only ask exporter to be predictable and still allow
them to fail if things are really going bad.


> > > I would think that the importing driver can assume the BAR page is
> > > kept alive until it calls unmap (presumably triggered by notifiers)?
> > > 
> > > ie the exporting driver sees the BAR page as pinned until unmap.
> > 
> > The intention with this patchset is that it is not pin ie the importer
> > device _must_ abide by all mmu notifier invalidations and they can
> > happen at anytime. The importing device can however re-p2p_map the
> > same range after an invalidation.
> >
> > I would like to restrict this to importer that can invalidate for
> > now because i believe all the first device to use can support the
> > invalidation.
> 
> This seems reasonable (and sort of says importers not getting this
> from HMM need careful checking), was this in the comment above the
> ops?

I think i put it in the comment above the ops but in any cases i should
write something in documentation with example and thorough guideline.
Note that there won't be any mmu notifier to mmap of a device file
unless the device driver calls for it or there is a syscall like munmap
or mremap or mprotect well any syscall that work on vma.

So assuming the application is no doing something stupid, nor the
driver. Then the result of p2p_map can stay valid until the importer
is done and call p2p_unmap of its own free will. This is what i do
expect for this. But for GPU i would still like to allow GPU driver
to evict (and thus invalidate importer mapping) to main memory or
defragment their BAR address space if the GPU driver feels a pressing
need to do so.

If we ever want to support full pin then we might have to add a
flag so that GPU driver can refuse an importer that wants things
pin forever.

Cheers,
Jérôme
Logan Gunthorpe Jan. 30, 2019, 1:17 a.m. UTC | #19
On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
> The whole point is to allow to use device memory for range of virtual
> address of a process when it does make sense to use device memory for
> that range. So they are multiple cases where it does make sense:
> [1] - Only the device is accessing the range and they are no CPU access
>       For instance the program is executing/running a big function on
>       the GPU and they are not concurrent CPU access, this is very
>       common in all the existing GPGPU code. In fact AFAICT It is the
>       most common pattern. So here you can use HMM private or public
>       memory.
> [2] - Both device and CPU access a common range of virtul address
>       concurrently. In that case if you are on a platform with cache
>       coherent inter-connect like OpenCAPI or CCIX then you can use
>       HMM public device memory and have both access the same memory.
>       You can not use HMM private memory.
> 
> So far on x86 we only have PCIE and thus so far on x86 we only have
> private HMM device memory that is not accessible by the CPU in any
> way.

I feel like you're just moving the rug out from under us... Before you
said ignore HMM and I was asking about the use case that wasn't using
HMM and how it works without HMM. In response, you just give me *way*
too much information describing HMM. And still, as best as I can see,
managing DMA mappings (which is different from the userspace mappings)
for GPU P2P should be handled by HMM and the userspace mappings should
*just* link VMAs to HMM pages using the standard infrastructure we
already have.

>> And what struct pages are actually going to be backing these VMAs if
>> it's not using HMM?
> 
> When you have some range of virtual address migrated to HMM private
> memory then the CPU pte are special swap entry and they behave just
> as if the memory was swapped to disk. So CPU access to those will
> fault and trigger a migration back to main memory.

This isn't answering my question at all... I specifically asked what is
backing the VMA when we are *not* using HMM.

Logan
Jerome Glisse Jan. 30, 2019, 2:48 a.m. UTC | #20
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
> > The whole point is to allow to use device memory for range of virtual
> > address of a process when it does make sense to use device memory for
> > that range. So they are multiple cases where it does make sense:
> > [1] - Only the device is accessing the range and they are no CPU access
> >       For instance the program is executing/running a big function on
> >       the GPU and they are not concurrent CPU access, this is very
> >       common in all the existing GPGPU code. In fact AFAICT It is the
> >       most common pattern. So here you can use HMM private or public
> >       memory.
> > [2] - Both device and CPU access a common range of virtul address
> >       concurrently. In that case if you are on a platform with cache
> >       coherent inter-connect like OpenCAPI or CCIX then you can use
> >       HMM public device memory and have both access the same memory.
> >       You can not use HMM private memory.
> > 
> > So far on x86 we only have PCIE and thus so far on x86 we only have
> > private HMM device memory that is not accessible by the CPU in any
> > way.
> 
> I feel like you're just moving the rug out from under us... Before you
> said ignore HMM and I was asking about the use case that wasn't using
> HMM and how it works without HMM. In response, you just give me *way*
> too much information describing HMM. And still, as best as I can see,
> managing DMA mappings (which is different from the userspace mappings)
> for GPU P2P should be handled by HMM and the userspace mappings should
> *just* link VMAs to HMM pages using the standard infrastructure we
> already have.

For HMM P2P mapping we need to call into the driver to know if driver
wants to fallback to main memory (running out of BAR addresses) or if
it can allow a peer device to directly access its memory. We also need
the call to exporting device driver as only the exporting device driver
can map the HMM page pfn to some physical BAR address (which would be
allocated by driver for GPU).

I wanted to make sure the HMM case was understood too, sorry if it
caused confusion with the non HMM case which i describe below.


> >> And what struct pages are actually going to be backing these VMAs if
> >> it's not using HMM?
> > 
> > When you have some range of virtual address migrated to HMM private
> > memory then the CPU pte are special swap entry and they behave just
> > as if the memory was swapped to disk. So CPU access to those will
> > fault and trigger a migration back to main memory.
> 
> This isn't answering my question at all... I specifically asked what is
> backing the VMA when we are *not* using HMM.

So when you are not using HMM ie existing GPU object without HMM then
like i said you do not have any valid pte most of the time inside the
CPU page table ie the GPU driver only populate the pte with valid
entry when they are CPU page fault and it clear those as soon as the
corresponding object is use by the GPU. In fact some driver also unmap
it agressively from the BAR making the memory totaly un-accessible to
anything but the GPU.

GPU driver do not like CPU mapping, they are quite aggressive about
clearing them. Then everything i said about having userspace deciding
which object can be share, and, with who, do apply here. So for GPU you
do want to give control to GPU driver and you do not want to require valid
CPU pte for the vma so that the exporting driver can return valid
address to the importing peer device only.

Also exporting device driver might decide to fallback to main memory
(running out of BAR addresses for instance). So again here we want to
go through the exporting device driver so that it can take the right
action.

So the expected pattern (for GPU driver) is:
    - no valid pte for the special vma (mmap of device file)
    - importing device call p2p_map() for the vma if it succeed the
      first time then we expect it will succeed for the same vma and
      range next time we call it.
    - exporting driver can either return physical address to page
      into its BAR space that point to the correct device memory or
      fallback to main memory

Then at any point in time:
    - if GPU driver want to move the object around (for whatever
      reasons) it calls zap_vma_ptes() the fact that there is no
      valid CPU pte does not matter it will call mmu notifier and thus
      any importing device driver will invalidate its mapping
    - importing device driver that lost the mapping due to mmu
      notification can re-map by re-calling p2p_map() (it should
      check that the vma is still valid ...) and guideline is for
      the exporting device driver to succeed and return valid
      address to the new memory use for the object

This allow device driver like GPU to keep control. The expected
pattern is still the p2p mapping to stay undisrupted for their
whole lifetime. Invalidation should only be triggered if GPU driver
do need to move things around.

All the above is for the no HMM case ie mmap of a device file so
for any existing open source GPU device driver that do not support
HMM.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 4:18 a.m. UTC | #21
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:

> This isn't answering my question at all... I specifically asked what is
> backing the VMA when we are *not* using HMM.

At least for RDMA what backs the VMA today is non-struct-page BAR
memory filled in with io_remap_pfn.

And we want to expose this for P2P DMA. None of the HMM stuff applies
here and the p2p_map/unmap are a nice simple approach that covers all
the needs RDMA has, at least.

Every attempt to give BAR memory to struct page has run into major
trouble, IMHO, so I like that this approach avoids that.

And if you don't have struct page then the only kernel object left to
hang meta data off is the VMA itself.

It seems very similar to the existing P2P work between in-kernel
consumers, just that VMA is now mediating a general user space driven
discovery process instead of being hard wired into a driver.

Jason
Jason Gunthorpe Jan. 30, 2019, 4:30 a.m. UTC | #22
On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
> On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
> > On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> > 
> > > > But this API doesn't seem to offer any control - I thought that
> > > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> > > 
> > > The control is within the driver implementation of those callbacks. 
> > 
> > Seems like what you mean by control is 'the exporter gets to choose
> > the physical address at the instant of map' - which seems reasonable
> > for GPU.
> > 
> > 
> > > will only allow p2p map to succeed for objects that have been tagged by the
> > > userspace in some way ie the userspace application is in control of what
> > > can be map to peer device.
> > 
> > I would have thought this means the VMA for the object is created
> > without the map/unmap ops? Or are GPU objects and VMAs unrelated?
> 
> GPU object and VMA are unrelated in all open source GPU driver i am
> somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
> object and never map it (and thus never have it associated with a
> vma) and in fact this is very common. For graphic you usualy only
> have hand full of the hundreds of GPU object your application have
> mapped.

I mean the other way does every VMA with a p2p_map/unmap point to
exactly one GPU object?

ie I'm surprised you say that p2p_map needs to have policy, I would
have though the policy is applied when the VMA is created (ie objects
that are not for p2p do not have p2p_map set), and even for GPU
p2p_map should really only have to do with window allocation and pure
'can I even do p2p' type functionality.

> Idea is that we can only ask exporter to be predictable and still allow
> them to fail if things are really going bad.

I think hot unplug / PCI error recovery is one of the 'really going
bad' cases..

> I think i put it in the comment above the ops but in any cases i should
> write something in documentation with example and thorough guideline.
> Note that there won't be any mmu notifier to mmap of a device file
> unless the device driver calls for it or there is a syscall like munmap
> or mremap or mprotect well any syscall that work on vma.

This is something we might need to explore, does calling
zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM
mirror consumer will release any p2p maps on that VMA?

> If we ever want to support full pin then we might have to add a
> flag so that GPU driver can refuse an importer that wants things
> pin forever.

This would become interesting for VFIO and RDMA at least - I don't
think VFIO has anything like SVA so it would want to import a p2p_map
and indicate that it will not respond to MMU notifiers.

GPU can refuse, but maybe RDMA would allow it...

Jason
Christoph Hellwig Jan. 30, 2019, 7:52 a.m. UTC | #23
On Tue, Jan 29, 2019 at 01:43:02PM -0700, Logan Gunthorpe wrote:
> It's hard to reason about an interface when you can't see what all the
> layers want to do with it. Most maintainers (I'd hope) would certainly
> never merge code that has no callers, and for much the same reason, I'd
> rather not review patches that don't have real use case examples.

Yes, we should never review, nevermind merge code without users.
We had one example recently where this was not followed, which was HMM
and that turned out to be a desaster.
Christoph Hellwig Jan. 30, 2019, 8 a.m. UTC | #24
On Wed, Jan 30, 2019 at 04:18:48AM +0000, Jason Gunthorpe wrote:
> Every attempt to give BAR memory to struct page has run into major
> trouble, IMHO, so I like that this approach avoids that.

Way less problems than not having struct page for doing anything
non-trivial.  If you map the BAR to userspace with remap_pfn_range
and friends the mapping is indeed very simple.  But any operation
that expects a page structure, which is at least everything using
get_user_pages won't work.

So you can't do direct I/O to your remapped BAR, you can't create MRs
on it, etc, etc.
Christoph Hellwig Jan. 30, 2019, 8:02 a.m. UTC | #25
On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> 
> > implement the mapping. And I don't think we should have 'special' vma's
> > for this (though we may need something to ensure we don't get mapping
> > requests mixed with different types of pages...).
> 
> I think Jerome explained the point here is to have a 'special vma'
> rather than a 'special struct page' as, really, we don't need a
> struct page at all to make this work.
> 
> If I recall your earlier attempts at adding struct page for BAR
> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.

Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
it work.  Without struct page none of the above can work at all.  That
is why we use struct page for backing BARs in the existing P2P code.
Not that I'm a particular fan of creating struct page for this device
memory, but without major invasive surgery to large parts of the kernel
it is the only way to make it work.
Christian König Jan. 30, 2019, 10:33 a.m. UTC | #26
Am 30.01.19 um 09:02 schrieb Christoph Hellwig:
> On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
>> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
>>
>>> implement the mapping. And I don't think we should have 'special' vma's
>>> for this (though we may need something to ensure we don't get mapping
>>> requests mixed with different types of pages...).
>> I think Jerome explained the point here is to have a 'special vma'
>> rather than a 'special struct page' as, really, we don't need a
>> struct page at all to make this work.
>>
>> If I recall your earlier attempts at adding struct page for BAR
>> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> it work.  Without struct page none of the above can work at all.  That
> is why we use struct page for backing BARs in the existing P2P code.
> Not that I'm a particular fan of creating struct page for this device
> memory, but without major invasive surgery to large parts of the kernel
> it is the only way to make it work.

The problem seems to be that struct page does two things:

1. Memory management for system memory.
2. The object to work with in the I/O layer.

This was done because a good part of that stuff overlaps, like reference 
counting how often a page is used.  The problem now is that this doesn't 
work very well for device memory in some cases.

For example on GPUs you usually have a large amount of memory which is 
not even accessible by the CPU. In other words you can't easily create a 
struct page for it because you can't reference it with a physical CPU 
address.

Maybe struct page should be split up into smaller structures? I mean 
it's really overloaded with data.

Christian.
Jerome Glisse Jan. 30, 2019, 3:43 p.m. UTC | #27
On Wed, Jan 30, 2019 at 04:30:27AM +0000, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
> > > On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> > > 
> > > > > But this API doesn't seem to offer any control - I thought that
> > > > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> > > > 
> > > > The control is within the driver implementation of those callbacks. 
> > > 
> > > Seems like what you mean by control is 'the exporter gets to choose
> > > the physical address at the instant of map' - which seems reasonable
> > > for GPU.
> > > 
> > > 
> > > > will only allow p2p map to succeed for objects that have been tagged by the
> > > > userspace in some way ie the userspace application is in control of what
> > > > can be map to peer device.
> > > 
> > > I would have thought this means the VMA for the object is created
> > > without the map/unmap ops? Or are GPU objects and VMAs unrelated?
> > 
> > GPU object and VMA are unrelated in all open source GPU driver i am
> > somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
> > object and never map it (and thus never have it associated with a
> > vma) and in fact this is very common. For graphic you usualy only
> > have hand full of the hundreds of GPU object your application have
> > mapped.
> 
> I mean the other way does every VMA with a p2p_map/unmap point to
> exactly one GPU object?
> 
> ie I'm surprised you say that p2p_map needs to have policy, I would
> have though the policy is applied when the VMA is created (ie objects
> that are not for p2p do not have p2p_map set), and even for GPU
> p2p_map should really only have to do with window allocation and pure
> 'can I even do p2p' type functionality.

All userspace API to enable p2p happens after object creation and in
some case they are mutable ie you can decide to no longer share the
object (userspace application decision). The BAR address space is a
resource from GPU driver point of view and thus from userspace point
of view. As such decissions that affect how it is use an what object
can use it, can change over application lifetime.

This is why i would like to allow kernel driver to apply any such
access policy, decided by the application on its object (on top of
which the kernel GPU driver can apply its own policy for GPU resource
sharing by forcing some object to main memory).


> 
> > Idea is that we can only ask exporter to be predictable and still allow
> > them to fail if things are really going bad.
> 
> I think hot unplug / PCI error recovery is one of the 'really going
> bad' cases..

GPU can hang and all data becomes _undefined_, it can also be suspended
to save power (think laptop with discret GPU for instance). GPU threads
can be kill ... So they are few cases i can think of where either you
want to kill the p2p mapping and make sure the importer is aware and
might have a change to report back through its own userspace API, or
at very least fallback to dummy pages. In some of the above cases, for
instance suspend, you just want to move thing around to allow to shut
down device memory.


> > I think i put it in the comment above the ops but in any cases i should
> > write something in documentation with example and thorough guideline.
> > Note that there won't be any mmu notifier to mmap of a device file
> > unless the device driver calls for it or there is a syscall like munmap
> > or mremap or mprotect well any syscall that work on vma.
> 
> This is something we might need to explore, does calling
> zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM
> mirror consumer will release any p2p maps on that VMA?

Yes it does.

> 
> > If we ever want to support full pin then we might have to add a
> > flag so that GPU driver can refuse an importer that wants things
> > pin forever.
> 
> This would become interesting for VFIO and RDMA at least - I don't
> think VFIO has anything like SVA so it would want to import a p2p_map
> and indicate that it will not respond to MMU notifiers.
> 
> GPU can refuse, but maybe RDMA would allow it...

Ok i will add a flag field in next post. GPU could allow pin but they
would most likely use main memory for any such object, hence it is no
longer really p2p but at least both device look at the same data.

Cheers,
Jérôme
Jerome Glisse Jan. 30, 2019, 3:49 p.m. UTC | #28
On Wed, Jan 30, 2019 at 09:00:06AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 04:18:48AM +0000, Jason Gunthorpe wrote:
> > Every attempt to give BAR memory to struct page has run into major
> > trouble, IMHO, so I like that this approach avoids that.
> 
> Way less problems than not having struct page for doing anything
> non-trivial.  If you map the BAR to userspace with remap_pfn_range
> and friends the mapping is indeed very simple.  But any operation
> that expects a page structure, which is at least everything using
> get_user_pages won't work.
> 
> So you can't do direct I/O to your remapped BAR, you can't create MRs
> on it, etc, etc.

We do not want direct I/O, in fact at least for GPU we want to seldomly
allow access to object vma, so the less thing can access it the happier
we are :) All the GPU userspace driver API (OpenGL, OpenCL, Vulkan, ...)
that expose any such mapping with the application are very clear on the
limitation which is often worded: the only valid thing is direct CPU
access (no syscall can be use with those pointers).

So application developer already have low expectation on what is valid
and allowed to do.

Cheers,
Jérôme
Jerome Glisse Jan. 30, 2019, 3:55 p.m. UTC | #29
On Wed, Jan 30, 2019 at 10:33:39AM +0000, Koenig, Christian wrote:
> Am 30.01.19 um 09:02 schrieb Christoph Hellwig:
> > On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
> >> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> >>
> >>> implement the mapping. And I don't think we should have 'special' vma's
> >>> for this (though we may need something to ensure we don't get mapping
> >>> requests mixed with different types of pages...).
> >> I think Jerome explained the point here is to have a 'special vma'
> >> rather than a 'special struct page' as, really, we don't need a
> >> struct page at all to make this work.
> >>
> >> If I recall your earlier attempts at adding struct page for BAR
> >> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> > Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> > it work.  Without struct page none of the above can work at all.  That
> > is why we use struct page for backing BARs in the existing P2P code.
> > Not that I'm a particular fan of creating struct page for this device
> > memory, but without major invasive surgery to large parts of the kernel
> > it is the only way to make it work.
> 
> The problem seems to be that struct page does two things:
> 
> 1. Memory management for system memory.
> 2. The object to work with in the I/O layer.
> 
> This was done because a good part of that stuff overlaps, like reference 
> counting how often a page is used.  The problem now is that this doesn't 
> work very well for device memory in some cases.
> 
> For example on GPUs you usually have a large amount of memory which is 
> not even accessible by the CPU. In other words you can't easily create a 
> struct page for it because you can't reference it with a physical CPU 
> address.
> 
> Maybe struct page should be split up into smaller structures? I mean 
> it's really overloaded with data.

I think the simpler answer is that we do not want to allow GUP or any-
thing similar to pin BAR or device memory. Doing so can only hurt us
long term by fragmenting the GPU memory and forbidding us to move thing
around. For transparent use of device memory within a process this is
definitly forbidden to pin.

I do not see any good reasons we would like to pin device memory for
the existing GPU GEM objects. Userspace always had a very low expectation
on what it can do with mmap of those object and i believe it is better
to keep expectation low here and says nothing will work with those
pointer. I just do not see a valid and compelling use case to change
that :)

Even outside GPU driver, device driver like RDMA just want to share their
doorbell to other device and they do not want to see those doorbell page
use in direct I/O or anything similar AFAICT.

Cheers,
Jérôme
Logan Gunthorpe Jan. 30, 2019, 5:17 p.m. UTC | #30
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> Every attempt to give BAR memory to struct page has run into major
> trouble, IMHO, so I like that this approach avoids that.
> 
> And if you don't have struct page then the only kernel object left to
> hang meta data off is the VMA itself.
> 
> It seems very similar to the existing P2P work between in-kernel
> consumers, just that VMA is now mediating a general user space driven
> discovery process instead of being hard wired into a driver.

But the kernel now has P2P bars backed by struct pages and it works
well. And that's what we are doing in-kernel. We even have a hacky
out-of-tree module which exposes these pages and it also works (but
would need Jerome's solution for denying those pages in GUP, etc). So
why do something completely different in userspace so they can't share
any of the DMA map infrastructure?

Logan
Christoph Hellwig Jan. 30, 2019, 5:26 p.m. UTC | #31
On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> Even outside GPU driver, device driver like RDMA just want to share their
> doorbell to other device and they do not want to see those doorbell page
> use in direct I/O or anything similar AFAICT.

At least Mellanox HCA support and inline data feature where you
can copy data directly into the BAR.  For something like a usrspace
NVMe target it might be very useful to do direct I/O straight into
the BAR for that.
Logan Gunthorpe Jan. 30, 2019, 5:32 p.m. UTC | #32
On 2019-01-30 10:26 a.m., Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
>> Even outside GPU driver, device driver like RDMA just want to share their
>> doorbell to other device and they do not want to see those doorbell page
>> use in direct I/O or anything similar AFAICT.
> 
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR.  For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.

Yup, these are things we definitely want to be able to do, and have done
with hacky garbage code: Direct I/O from NVMe to P2P BAR, then we could
Direct I/O to another drive or map it as an MR and send it over an RNIC.

We'd definitely like to move in that direction. And a world where such
userspace mappings are gimpped by the fact that they are only some
special feature of userspace VMAs that can only be used in specialized
userspace interfaces is not useful to us.

Logan
Jason Gunthorpe Jan. 30, 2019, 5:39 p.m. UTC | #33
On Wed, Jan 30, 2019 at 06:26:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> > Even outside GPU driver, device driver like RDMA just want to share their
> > doorbell to other device and they do not want to see those doorbell page
> > use in direct I/O or anything similar AFAICT.
> 
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR.  For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.

It doesn't really work like that. 

The PCI-E TLP sequence to trigger this feature is very precise, and
the data requires the right headers/etc. Mixing that with O_DIRECT
seems very unlikely.

Jason
Jason Gunthorpe Jan. 30, 2019, 5:44 p.m. UTC | #34
On Wed, Jan 30, 2019 at 09:02:08AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 29, 2019 at 08:58:35PM +0000, Jason Gunthorpe wrote:
> > On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> > 
> > > implement the mapping. And I don't think we should have 'special' vma's
> > > for this (though we may need something to ensure we don't get mapping
> > > requests mixed with different types of pages...).
> > 
> > I think Jerome explained the point here is to have a 'special vma'
> > rather than a 'special struct page' as, really, we don't need a
> > struct page at all to make this work.
> > 
> > If I recall your earlier attempts at adding struct page for BAR
> > memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> 
> Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> it work.  Without struct page none of the above can work at all.  That
> is why we use struct page for backing BARs in the existing P2P code.
> Not that I'm a particular fan of creating struct page for this device
> memory, but without major invasive surgery to large parts of the kernel
> it is the only way to make it work.

I don't think anyone is interested in O_DIRECT/etc for RDMA doorbell
pages.

.. and again, I recall Logan already attempted to mix non-CPU memory
into sgls and it was a disaster. You pointed out that one cannot just
put iomem addressed into a SGL without auditing basically the entire
block stack to prove that nothing uses iomem without an iomem
accessor.

I recall that proposal veered into a direction where the block layer
would just fail very early if there was iomem in the sgl, so generally
no O_DIRECT support anyhow.

We already accepted the P2P stuff from Logan as essentially a giant
special case - it only works with RDMA and only because RDMA MR was
hacked up with a special p2p callback.

I don't see why a special case with a VMA is really that different.

If someone figures out the struct page path down the road it can
obviously be harmonized with this VMA approach pretty easily.

Jason
Jerome Glisse Jan. 30, 2019, 6:05 p.m. UTC | #35
On Wed, Jan 30, 2019 at 06:26:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> > Even outside GPU driver, device driver like RDMA just want to share their
> > doorbell to other device and they do not want to see those doorbell page
> > use in direct I/O or anything similar AFAICT.
> 
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR.  For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.

And what i am proposing is not exclusive of that. If exporting device
wants to have struct page for its BAR than it can do so. What i do not
want is imposing that burden on everyone as many devices do not want
or do not care for that. Moreover having struct page and allowing that
struct page to trickle know in obscure corner of the kernel means that
exporter that want that will also have the burden to check that what
they are doing does not end up in something terribly bad.

While i would like a one API fits all i do not think that we can sanely
do that for P2P. They are too much differences between how different
devices expose and manage their BAR to make any such attempt reasonably
sane.

Maybe thing will evolve oragnicaly, but for now i do not see a way out
side the API i am proposing (again this is not exclusive of the struct
page API that is upstream both can co-exist and a device can use both
or just one).

Cheers,
Jérôme
Logan Gunthorpe Jan. 30, 2019, 6:13 p.m. UTC | #36
On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> I don't see why a special case with a VMA is really that different.

Well one *really* big difference is the VMA changes necessarily expose
specialized new functionality to userspace which has to be supported
forever and may be difficult to change. The p2pdma code is largely
in-kernel and we can rework and change the interfaces all we want as we
improve our struct page infrastructure.

I'd also argue that p2pdma isn't nearly as specialized as this VMA thing
and can be used pretty generically to do other things. Though, the other
ideas we've talked about doing are pretty far off and may have other
challenges.

Logan
Jerome Glisse Jan. 30, 2019, 6:50 p.m. UTC | #37
On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> > I don't see why a special case with a VMA is really that different.
> 
> Well one *really* big difference is the VMA changes necessarily expose
> specialized new functionality to userspace which has to be supported
> forever and may be difficult to change. The p2pdma code is largely
> in-kernel and we can rework and change the interfaces all we want as we
> improve our struct page infrastructure.

I do not see how VMA changes are any different than using struct page
in respect to userspace exposure. Those vma callback do not need to be
set by everyone, in fact expectation is that only handful of driver
will set those.

How can we do p2p between RDMA and GPU for instance, without exposure
to userspace ? At some point you need to tell userspace hey this kernel
does allow you to do that :)

RDMA works on vma, and GPU driver can easily setup vma for an object
hence why vma sounds like a logical place. In fact vma (mmap of a device
file) is very common device driver pattern.

In the model i am proposing the exporting device is in control of
policy ie wether to allow or not the peer to peer mapping. So each
device driver can define proper device specific API to enable and
expose that feature to userspace.

If they do, the only thing we have to preserve is the end result for
the user. The userspace does not care one bit if we achieve this in
the kernel with a set of new callback within the vm_operations struct
or in some other way. Only the end result matter.

So question is do we want to allow RDMA to access GPU driver object ?
I believe we do, they are people using non upstream solution with open
source driver to do just that, so it is a testimony that they are
users for this. More use case have been propose too.

> 
> I'd also argue that p2pdma isn't nearly as specialized as this VMA thing
> and can be used pretty generically to do other things. Though, the other
> ideas we've talked about doing are pretty far off and may have other
> challenges.

I believe p2p is highly specialize on non cache-coherent inter-connect
platform like x86 with PCIE. So i do not think that using struct page
for this is a good idea, it is not warranted/needed, and it can only be
problematic if some random kernel code get holds of those struct page
without understanding it is not regular memory.

I believe the vma callback are the simplest solution with the minimum
burden for the device driver and for the kernel. If they are any better
solution that emerge there is nothing that would block us to remove
this to replace it with the other solution.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 6:56 p.m. UTC | #38
On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> > Every attempt to give BAR memory to struct page has run into major
> > trouble, IMHO, so I like that this approach avoids that.
> > 
> > And if you don't have struct page then the only kernel object left to
> > hang meta data off is the VMA itself.
> > 
> > It seems very similar to the existing P2P work between in-kernel
> > consumers, just that VMA is now mediating a general user space driven
> > discovery process instead of being hard wired into a driver.
> 
> But the kernel now has P2P bars backed by struct pages and it works
> well. 

I don't think it works that well..

We ended up with a 'sgl' that is not really a sgl, and doesn't work
with many of the common SGL patterns. sg_copy_buffer doesn't work,
dma_map, doesn't work, sg_page doesn't work quite right, etc.

Only nvme and rdma got the special hacks to make them understand these
p2p-sgls, and I'm still not convinced some of the RDMA drivers that
want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
break in this scenario.

Since the SGLs become broken, it pretty much means there is no path to
make GUP work generically, we have to go through and make everything
safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
impossible with all the competing objections.

But GPU seems to have a problem unrelated to this - what Jerome wants
is to have two faulting domains for VMA's - visible-to-cpu and
visible-to-dma. The new op is essentially faulting the pages into the
visible-to-dma category and leaving them invisible-to-cpu.

So that duality would still have to exists, and I think p2p_map/unmap
is a much simpler implementation than trying to create some kind of
special PTE in the VMA..

At least for RDMA, struct page or not doesn't really matter. 

We can make struct pages for the BAR the same way NVMe does.  GPU is
probably the same, just with more mememory at stake?  

And maybe this should be the first implementation. The p2p_map VMA
operation should return a SGL and the caller should do the existing
pci_p2pdma_map_sg() flow.. 

Worry about optimizing away the struct page overhead later?

Jason
Jason Gunthorpe Jan. 30, 2019, 7:06 p.m. UTC | #39
On Wed, Jan 30, 2019 at 09:00:06AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 04:18:48AM +0000, Jason Gunthorpe wrote:
> > Every attempt to give BAR memory to struct page has run into major
> > trouble, IMHO, so I like that this approach avoids that.
> 
> Way less problems than not having struct page for doing anything
> non-trivial.  If you map the BAR to userspace with remap_pfn_range
> and friends the mapping is indeed very simple.  But any operation
> that expects a page structure, which is at least everything using
> get_user_pages won't work.

GUP doesn't work anyhow today, and won't work with BAR struct pages in
the forseeable future (Logan has sent attempts on this before).

So nothing seems lost..

> So you can't do direct I/O to your remapped BAR, you can't create MRs
> on it, etc, etc.

Jerome made the HMM mirror API use this flow, so afer his patch to
switch the ODP MR to use HMM, and to switch GPU drivers, it will work
for those cases. Which is more than the zero cases than we have today
:)

Jason
Jason Gunthorpe Jan. 30, 2019, 7:19 p.m. UTC | #40
On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> > I don't see why a special case with a VMA is really that different.
> 
> Well one *really* big difference is the VMA changes necessarily expose
> specialized new functionality to userspace which has to be supported
> forever and may be difficult to change. 

The only user change here is that more things will succeed when
creating RDMA MRs (and vice versa to GPU). I don't think this
restricts the kernel implementation at all, unless we intend to
remove P2P entirely..

Jason
Jerome Glisse Jan. 30, 2019, 7:22 p.m. UTC | #41
On Wed, Jan 30, 2019 at 06:56:59PM +0000, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> > > Every attempt to give BAR memory to struct page has run into major
> > > trouble, IMHO, so I like that this approach avoids that.
> > > 
> > > And if you don't have struct page then the only kernel object left to
> > > hang meta data off is the VMA itself.
> > > 
> > > It seems very similar to the existing P2P work between in-kernel
> > > consumers, just that VMA is now mediating a general user space driven
> > > discovery process instead of being hard wired into a driver.
> > 
> > But the kernel now has P2P bars backed by struct pages and it works
> > well. 
> 
> I don't think it works that well..
> 
> We ended up with a 'sgl' that is not really a sgl, and doesn't work
> with many of the common SGL patterns. sg_copy_buffer doesn't work,
> dma_map, doesn't work, sg_page doesn't work quite right, etc.
> 
> Only nvme and rdma got the special hacks to make them understand these
> p2p-sgls, and I'm still not convinced some of the RDMA drivers that
> want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
> break in this scenario.
> 
> Since the SGLs become broken, it pretty much means there is no path to
> make GUP work generically, we have to go through and make everything
> safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
> impossible with all the competing objections.
> 
> But GPU seems to have a problem unrelated to this - what Jerome wants
> is to have two faulting domains for VMA's - visible-to-cpu and
> visible-to-dma. The new op is essentially faulting the pages into the
> visible-to-dma category and leaving them invisible-to-cpu.
> 
> So that duality would still have to exists, and I think p2p_map/unmap
> is a much simpler implementation than trying to create some kind of
> special PTE in the VMA..
> 
> At least for RDMA, struct page or not doesn't really matter. 
> 
> We can make struct pages for the BAR the same way NVMe does.  GPU is
> probably the same, just with more mememory at stake?  
> 
> And maybe this should be the first implementation. The p2p_map VMA
> operation should return a SGL and the caller should do the existing
> pci_p2pdma_map_sg() flow.. 

For GPU it would not work, GPU might want to use main memory (because
it is running out of BAR space) it is a lot easier if the p2p_map
callback calls the right dma map function (for page or io) rather than
having to define some format that would pass down the information.

> 
> Worry about optimizing away the struct page overhead later?

Struct page do not fit well for GPU as the BAR address can be reprogram
to point to any page inside the device memory (think 256M BAR versus
16GB device memory). Forcing struct page on GPU driver would require
major surgery to the GPU driver inner working and there is no benefit
to have from the struct page. So it is hard to justify this.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 7:38 p.m. UTC | #42
On Wed, Jan 30, 2019 at 02:22:34PM -0500, Jerome Glisse wrote:

> For GPU it would not work, GPU might want to use main memory (because
> it is running out of BAR space) it is a lot easier if the p2p_map
> callback calls the right dma map function (for page or io) rather than
> having to define some format that would pass down the information.

This is already sort of built into the sgl, you are supposed to use
is_pci_p2pdma_page() and pci_p2pdma_map_sg() and somehow it is supposed
to work out - but I think this is also fairly incomplete.

ie the current APIs seem to assume the SGL is homogeneous :(

> > Worry about optimizing away the struct page overhead later?
> 
> Struct page do not fit well for GPU as the BAR address can be reprogram
> to point to any page inside the device memory (think 256M BAR versus
> 16GB device memory).

The struct page only points to the BAR - it is not related to the
actual GPU memory in any way. The struct page is just an alternative
way to specify the physical address of the BAR page.

I think this boils down to one call to setup the entire BAR, like nvme
does, and then using the struct page in the p2p_map SGL??

Jason
Logan Gunthorpe Jan. 30, 2019, 7:45 p.m. UTC | #43
On 2019-01-30 12:06 p.m., Jason Gunthorpe wrote:
>> Way less problems than not having struct page for doing anything
>> non-trivial.  If you map the BAR to userspace with remap_pfn_range
>> and friends the mapping is indeed very simple.  But any operation
>> that expects a page structure, which is at least everything using
>> get_user_pages won't work.
> 
> GUP doesn't work anyhow today, and won't work with BAR struct pages in
> the forseeable future (Logan has sent attempts on this before).

I don't recall ever attempting that... But patching GUP for special
pages or VMAS; or working around by not calling it in some cases seems
like the thing that's going to need to be done one way or another.

> Jerome made the HMM mirror API use this flow, so afer his patch to
> switch the ODP MR to use HMM, and to switch GPU drivers, it will work
> for those cases. Which is more than the zero cases than we have today
> :)

But we're getting the same bait and switch here... If you are using HMM
you are using struct pages, but we're told we need this special VMA hack
for cases that don't use HMM and thus don't have struct pages...

Logan
Logan Gunthorpe Jan. 30, 2019, 7:48 p.m. UTC | #44
On 2019-01-30 12:19 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
>>> I don't see why a special case with a VMA is really that different.
>>
>> Well one *really* big difference is the VMA changes necessarily expose
>> specialized new functionality to userspace which has to be supported
>> forever and may be difficult to change. 
> 
> The only user change here is that more things will succeed when
> creating RDMA MRs (and vice versa to GPU). I don't think this
> restricts the kernel implementation at all, unless we intend to
> remove P2P entirely..

Well for MRs I'd expect you are using struct pages to track the memory
some how.... VMAs that aren't backed by pages and use this special
interface must therefore be creating new special interfaces that can
call p2p_[un]map...

I'd much rather see special cases around struct page so we can find ways
to generalize it in the future than create special cases tied to random
userspace interfaces.

Logan
Logan Gunthorpe Jan. 30, 2019, 7:52 p.m. UTC | #45
On 2019-01-30 12:22 p.m., Jerome Glisse wrote:
> On Wed, Jan 30, 2019 at 06:56:59PM +0000, Jason Gunthorpe wrote:
>> On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
>>>
>>>
>>> On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
>>>> Every attempt to give BAR memory to struct page has run into major
>>>> trouble, IMHO, so I like that this approach avoids that.
>>>>
>>>> And if you don't have struct page then the only kernel object left to
>>>> hang meta data off is the VMA itself.
>>>>
>>>> It seems very similar to the existing P2P work between in-kernel
>>>> consumers, just that VMA is now mediating a general user space driven
>>>> discovery process instead of being hard wired into a driver.
>>>
>>> But the kernel now has P2P bars backed by struct pages and it works
>>> well. 
>>
>> I don't think it works that well..
>>
>> We ended up with a 'sgl' that is not really a sgl, and doesn't work
>> with many of the common SGL patterns. sg_copy_buffer doesn't work,
>> dma_map, doesn't work, sg_page doesn't work quite right, etc.
>>
>> Only nvme and rdma got the special hacks to make them understand these
>> p2p-sgls, and I'm still not convinced some of the RDMA drivers that
>> want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
>> break in this scenario.
>>
>> Since the SGLs become broken, it pretty much means there is no path to
>> make GUP work generically, we have to go through and make everything
>> safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
>> impossible with all the competing objections.
>>
>> But GPU seems to have a problem unrelated to this - what Jerome wants
>> is to have two faulting domains for VMA's - visible-to-cpu and
>> visible-to-dma. The new op is essentially faulting the pages into the
>> visible-to-dma category and leaving them invisible-to-cpu.
>>
>> So that duality would still have to exists, and I think p2p_map/unmap
>> is a much simpler implementation than trying to create some kind of
>> special PTE in the VMA..
>>
>> At least for RDMA, struct page or not doesn't really matter. 
>>
>> We can make struct pages for the BAR the same way NVMe does.  GPU is
>> probably the same, just with more mememory at stake?  
>>
>> And maybe this should be the first implementation. The p2p_map VMA
>> operation should return a SGL and the caller should do the existing
>> pci_p2pdma_map_sg() flow.. 
> 
> For GPU it would not work, GPU might want to use main memory (because
> it is running out of BAR space) it is a lot easier if the p2p_map
> callback calls the right dma map function (for page or io) rather than
> having to define some format that would pass down the information.

>>
>> Worry about optimizing away the struct page overhead later?
> 
> Struct page do not fit well for GPU as the BAR address can be reprogram
> to point to any page inside the device memory (think 256M BAR versus
> 16GB device memory). Forcing struct page on GPU driver would require
> major surgery to the GPU driver inner working and there is no benefit
> to have from the struct page. So it is hard to justify this.

I think we have to consider the struct pages to track the address space,
not what backs it (essentially what HMM is doing). If we need to add
operations for the driver to map the address space/struct pages back to
physical memory then do that. Creating a whole new idea that's tied to
userspace VMAs still seems wrong to me.

Logan
Jason Gunthorpe Jan. 30, 2019, 7:59 p.m. UTC | #46
On Wed, Jan 30, 2019 at 12:45:46PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 12:06 p.m., Jason Gunthorpe wrote:
> >> Way less problems than not having struct page for doing anything
> >> non-trivial.  If you map the BAR to userspace with remap_pfn_range
> >> and friends the mapping is indeed very simple.  But any operation
> >> that expects a page structure, which is at least everything using
> >> get_user_pages won't work.
> > 
> > GUP doesn't work anyhow today, and won't work with BAR struct pages in
> > the forseeable future (Logan has sent attempts on this before).
> 
> I don't recall ever attempting that... But patching GUP for special
> pages or VMAS; or working around by not calling it in some cases seems
> like the thing that's going to need to be done one way or another.

Remember, the long discussion we had about how to get the IOMEM
annotation into SGL? That is a necessary pre-condition to doing
anything with GUP in DMA using drivers as GUP -> SGL -> DMA map is
pretty much the standard flow.
 
> > Jerome made the HMM mirror API use this flow, so afer his patch to
> > switch the ODP MR to use HMM, and to switch GPU drivers, it will work
> > for those cases. Which is more than the zero cases than we have today
> > :)
> 
> But we're getting the same bait and switch here... If you are using HMM
> you are using struct pages, but we're told we need this special VMA hack
> for cases that don't use HMM and thus don't have struct pages...

Well, I don't know much about HMM, but the HMM mirror API looks like a
version of MMU notifiers that offloads a bunch of dreck to the HMM
code core instead of drivers. The RDMA code got hundreds of lines
shorter by using it.

Some of that dreck is obtaining a DMA address for the user VMAs,
including using multiple paths to get them. A driver using HMM mirror
doesn't seem to call GUP at all, HMM mirror handles that, along with
various special cases, including calling out to these new VMA ops.

I don't really know how mirror relates to other parts of HMM, like the
bits that use struct pages. Maybe it also knows about more special
cases created by other parts of HMM?

So, I see Jerome solving the GUP problem by replacing GUP entirely
using an API that is more suited to what these sorts of drivers
actually need.

Jason
Logan Gunthorpe Jan. 30, 2019, 8 p.m. UTC | #47
On 2019-01-30 12:38 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 02:22:34PM -0500, Jerome Glisse wrote:
> 
>> For GPU it would not work, GPU might want to use main memory (because
>> it is running out of BAR space) it is a lot easier if the p2p_map
>> callback calls the right dma map function (for page or io) rather than
>> having to define some format that would pass down the information.
> 
> This is already sort of built into the sgl, you are supposed to use
> is_pci_p2pdma_page() and pci_p2pdma_map_sg() and somehow it is supposed
> to work out - but I think this is also fairly incomplete.


> ie the current APIs seem to assume the SGL is homogeneous :(

We never changed SGLs. We still use them to pass p2pdma pages, only we
need to be a bit careful where we send the entire SGL. I see no reason
why we can't continue to be careful once their in userspace if there's
something in GUP to deny them.

It would be nice to have heterogeneous SGLs and it is something we
should work toward but in practice they aren't really necessary at the
moment.

>>> Worry about optimizing away the struct page overhead later?
>>
>> Struct page do not fit well for GPU as the BAR address can be reprogram
>> to point to any page inside the device memory (think 256M BAR versus
>> 16GB device memory).
> 
> The struct page only points to the BAR - it is not related to the
> actual GPU memory in any way. The struct page is just an alternative
> way to specify the physical address of the BAR page.

That doesn't even necessarily need to be the case. For HMM, I
understand, struct pages may not point to any accessible memory and the
memory that backs it (or not) may change over the life time of it. So
they don't have to be strictly tied to BARs addresses. p2pdma pages are
strictly tied to BAR addresses though.

Logan
Jason Gunthorpe Jan. 30, 2019, 8:11 p.m. UTC | #48
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:

> We never changed SGLs. We still use them to pass p2pdma pages, only we
> need to be a bit careful where we send the entire SGL. I see no reason
> why we can't continue to be careful once their in userspace if there's
> something in GUP to deny them.
> 
> It would be nice to have heterogeneous SGLs and it is something we
> should work toward but in practice they aren't really necessary at the
> moment.

RDMA generally cannot cope well with an API that requires homogeneous
SGLs.. User space can construct complex MRs (particularly with the
proposed SGL MR flow) and we must marshal that into a single SGL or
the drivers fall apart.

Jerome explained that GPU is worse, a single VMA may have a random mix
of CPU or device pages..

This is a pretty big blocker that would have to somehow be fixed.

> That doesn't even necessarily need to be the case. For HMM, I
> understand, struct pages may not point to any accessible memory and the
> memory that backs it (or not) may change over the life time of it. So
> they don't have to be strictly tied to BARs addresses. p2pdma pages are
> strictly tied to BAR addresses though.

No idea, but at least for this case I don't think we need magic HMM
pages to make simple VMA ops p2p_map/umap work..

Jason
Jerome Glisse Jan. 30, 2019, 8:35 p.m. UTC | #49
On Wed, Jan 30, 2019 at 12:52:44PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 12:22 p.m., Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 06:56:59PM +0000, Jason Gunthorpe wrote:
> >> On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
> >>>
> >>>
> >>> On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> >>>> Every attempt to give BAR memory to struct page has run into major
> >>>> trouble, IMHO, so I like that this approach avoids that.
> >>>>
> >>>> And if you don't have struct page then the only kernel object left to
> >>>> hang meta data off is the VMA itself.
> >>>>
> >>>> It seems very similar to the existing P2P work between in-kernel
> >>>> consumers, just that VMA is now mediating a general user space driven
> >>>> discovery process instead of being hard wired into a driver.
> >>>
> >>> But the kernel now has P2P bars backed by struct pages and it works
> >>> well. 
> >>
> >> I don't think it works that well..
> >>
> >> We ended up with a 'sgl' that is not really a sgl, and doesn't work
> >> with many of the common SGL patterns. sg_copy_buffer doesn't work,
> >> dma_map, doesn't work, sg_page doesn't work quite right, etc.
> >>
> >> Only nvme and rdma got the special hacks to make them understand these
> >> p2p-sgls, and I'm still not convinced some of the RDMA drivers that
> >> want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
> >> break in this scenario.
> >>
> >> Since the SGLs become broken, it pretty much means there is no path to
> >> make GUP work generically, we have to go through and make everything
> >> safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
> >> impossible with all the competing objections.
> >>
> >> But GPU seems to have a problem unrelated to this - what Jerome wants
> >> is to have two faulting domains for VMA's - visible-to-cpu and
> >> visible-to-dma. The new op is essentially faulting the pages into the
> >> visible-to-dma category and leaving them invisible-to-cpu.
> >>
> >> So that duality would still have to exists, and I think p2p_map/unmap
> >> is a much simpler implementation than trying to create some kind of
> >> special PTE in the VMA..
> >>
> >> At least for RDMA, struct page or not doesn't really matter. 
> >>
> >> We can make struct pages for the BAR the same way NVMe does.  GPU is
> >> probably the same, just with more mememory at stake?  
> >>
> >> And maybe this should be the first implementation. The p2p_map VMA
> >> operation should return a SGL and the caller should do the existing
> >> pci_p2pdma_map_sg() flow.. 
> > 
> > For GPU it would not work, GPU might want to use main memory (because
> > it is running out of BAR space) it is a lot easier if the p2p_map
> > callback calls the right dma map function (for page or io) rather than
> > having to define some format that would pass down the information.
> 
> >>
> >> Worry about optimizing away the struct page overhead later?
> > 
> > Struct page do not fit well for GPU as the BAR address can be reprogram
> > to point to any page inside the device memory (think 256M BAR versus
> > 16GB device memory). Forcing struct page on GPU driver would require
> > major surgery to the GPU driver inner working and there is no benefit
> > to have from the struct page. So it is hard to justify this.
> 
> I think we have to consider the struct pages to track the address space,
> not what backs it (essentially what HMM is doing). If we need to add
> operations for the driver to map the address space/struct pages back to
> physical memory then do that. Creating a whole new idea that's tied to
> userspace VMAs still seems wrong to me.

VMA is the object RDMA works on, GPU driver have been working with
VMA too, where VMA is tie to only one specific GPU object. So the most
disrupting approach here is using struct page. It was never use and
will not be use in many driver. Updating those to struct page is too
risky and too much changes. The vma call back is something you can
remove at any time if you have something better that do not need major
surgery to GPU driver.

Cheers,
Jérôme
Jerome Glisse Jan. 30, 2019, 8:43 p.m. UTC | #50
On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> 
> > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > need to be a bit careful where we send the entire SGL. I see no reason
> > why we can't continue to be careful once their in userspace if there's
> > something in GUP to deny them.
> > 
> > It would be nice to have heterogeneous SGLs and it is something we
> > should work toward but in practice they aren't really necessary at the
> > moment.
> 
> RDMA generally cannot cope well with an API that requires homogeneous
> SGLs.. User space can construct complex MRs (particularly with the
> proposed SGL MR flow) and we must marshal that into a single SGL or
> the drivers fall apart.
> 
> Jerome explained that GPU is worse, a single VMA may have a random mix
> of CPU or device pages..
> 
> This is a pretty big blocker that would have to somehow be fixed.

Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
so what you get for an ODP umem is just a list of dma address you
can program your device to. The aim is to avoid the driver to care
about that. The access policy when the UMEM object is created by
userspace through verbs API should however ascertain that for mmap
of device file it is only creating a UMEM that is fully covered by
one and only one vma. GPU device driver will have one vma per logical
GPU object. I expect other kind of device do that same so that they
can match a vma to a unique object in their driver.

> 
> > That doesn't even necessarily need to be the case. For HMM, I
> > understand, struct pages may not point to any accessible memory and the
> > memory that backs it (or not) may change over the life time of it. So
> > they don't have to be strictly tied to BARs addresses. p2pdma pages are
> > strictly tied to BAR addresses though.
> 
> No idea, but at least for this case I don't think we need magic HMM
> pages to make simple VMA ops p2p_map/umap work..

Yes, you do not need page for simple driver, if we start creating struct
page for all PCIE BAR we are gonna waste a lot of memory and resources
for no good reason. I doubt all of the PCIE BAR of a device enabling p2p
will ever be map as p2p. So simple driver do not need struct page, GPU
driver that do not use HMM (all GPU that are more than 2 years old) do
not need struct page. Struct page is a burden here more than anything
else. I have not seen one good thing the struct page gives you.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 8:44 p.m. UTC | #51
On Wed, Jan 30, 2019 at 12:48:33PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 12:19 p.m., Jason Gunthorpe wrote:
> > On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> >>> I don't see why a special case with a VMA is really that different.
> >>
> >> Well one *really* big difference is the VMA changes necessarily expose
> >> specialized new functionality to userspace which has to be supported
> >> forever and may be difficult to change. 
> > 
> > The only user change here is that more things will succeed when
> > creating RDMA MRs (and vice versa to GPU). I don't think this
> > restricts the kernel implementation at all, unless we intend to
> > remove P2P entirely..
> 
> Well for MRs I'd expect you are using struct pages to track the memory
> some how.... 

Not really, for MRs most drivers care about DMA addresses only. The
only reason struct page ever gets involved is because it is part of
the GUP, SGL and dma_map family of APIs.

> VMAs that aren't backed by pages and use this special interface must
> therefore be creating new special interfaces that can call
> p2p_[un]map...

Well, those are kernel internal interfaces, so they can be changed

No matter what we do, code that wants to DMA to user BAR pages must
take *some kind* of special care - either it needs to use a special
GUP and SGL flow, or a mixed GUP, SGL and p2p_map flow. 

I don't really see why one is better than the other at this point, or
why doing one means we can't do the other some day later. They are
fairly similar.

O_DIRECT seems to be the justification for struct page, but nobody is
signing up to make O_DIRECT have the required special GUP/SGL/P2P flow
that would be needed to *actually* make that work - so it really isn't
a justification today.

Jason
Jason Gunthorpe Jan. 30, 2019, 8:50 p.m. UTC | #52
On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
> On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
> > On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> > 
> > > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > > need to be a bit careful where we send the entire SGL. I see no reason
> > > why we can't continue to be careful once their in userspace if there's
> > > something in GUP to deny them.
> > > 
> > > It would be nice to have heterogeneous SGLs and it is something we
> > > should work toward but in practice they aren't really necessary at the
> > > moment.
> > 
> > RDMA generally cannot cope well with an API that requires homogeneous
> > SGLs.. User space can construct complex MRs (particularly with the
> > proposed SGL MR flow) and we must marshal that into a single SGL or
> > the drivers fall apart.
> > 
> > Jerome explained that GPU is worse, a single VMA may have a random mix
> > of CPU or device pages..
> > 
> > This is a pretty big blocker that would have to somehow be fixed.
> 
> Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
> so what you get for an ODP umem is just a list of dma address you
> can program your device to. The aim is to avoid the driver to care
> about that. The access policy when the UMEM object is created by
> userspace through verbs API should however ascertain that for mmap
> of device file it is only creating a UMEM that is fully covered by
> one and only one vma. GPU device driver will have one vma per logical
> GPU object. I expect other kind of device do that same so that they
> can match a vma to a unique object in their driver.

A one VMA rule is not really workable.

With ODP VMA boundaries can move around across the lifetime of the MR
and we have no obvious way to fail anything if userpace puts a VMA
boundary in the middle of an existing ODP MR address range.

I think the HMM mirror API really needs to deal with this for the
driver somehow.

Jason
Logan Gunthorpe Jan. 30, 2019, 9:01 p.m. UTC | #53
On 2019-01-30 12:59 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 12:45:46PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2019-01-30 12:06 p.m., Jason Gunthorpe wrote:
>>>> Way less problems than not having struct page for doing anything
>>>> non-trivial.  If you map the BAR to userspace with remap_pfn_range
>>>> and friends the mapping is indeed very simple.  But any operation
>>>> that expects a page structure, which is at least everything using
>>>> get_user_pages won't work.
>>>
>>> GUP doesn't work anyhow today, and won't work with BAR struct pages in
>>> the forseeable future (Logan has sent attempts on this before).
>>
>> I don't recall ever attempting that... But patching GUP for special
>> pages or VMAS; or working around by not calling it in some cases seems
>> like the thing that's going to need to be done one way or another.
> 
> Remember, the long discussion we had about how to get the IOMEM
> annotation into SGL? That is a necessary pre-condition to doing
> anything with GUP in DMA using drivers as GUP -> SGL -> DMA map is
> pretty much the standard flow.

Yes, but that was unrelated to GUP even if that might have been the
eventual direction.

And I feel the GUP->SGL->DMA flow should still be what we are aiming
for. Even if we need a special GUP for special pages, and a special DMA
map; and the SGL still has to be homogenous....

> So, I see Jerome solving the GUP problem by replacing GUP entirely
> using an API that is more suited to what these sorts of drivers
> actually need.

Yes, this is what I'm expecting and what I want. Not bypassing the whole
thing by doing special things with VMAs.

Logan
Jerome Glisse Jan. 30, 2019, 9:45 p.m. UTC | #54
On Wed, Jan 30, 2019 at 08:50:00PM +0000, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
> > > On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> > > 
> > > > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > > > need to be a bit careful where we send the entire SGL. I see no reason
> > > > why we can't continue to be careful once their in userspace if there's
> > > > something in GUP to deny them.
> > > > 
> > > > It would be nice to have heterogeneous SGLs and it is something we
> > > > should work toward but in practice they aren't really necessary at the
> > > > moment.
> > > 
> > > RDMA generally cannot cope well with an API that requires homogeneous
> > > SGLs.. User space can construct complex MRs (particularly with the
> > > proposed SGL MR flow) and we must marshal that into a single SGL or
> > > the drivers fall apart.
> > > 
> > > Jerome explained that GPU is worse, a single VMA may have a random mix
> > > of CPU or device pages..
> > > 
> > > This is a pretty big blocker that would have to somehow be fixed.
> > 
> > Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
> > so what you get for an ODP umem is just a list of dma address you
> > can program your device to. The aim is to avoid the driver to care
> > about that. The access policy when the UMEM object is created by
> > userspace through verbs API should however ascertain that for mmap
> > of device file it is only creating a UMEM that is fully covered by
> > one and only one vma. GPU device driver will have one vma per logical
> > GPU object. I expect other kind of device do that same so that they
> > can match a vma to a unique object in their driver.
> 
> A one VMA rule is not really workable.
> 
> With ODP VMA boundaries can move around across the lifetime of the MR
> and we have no obvious way to fail anything if userpace puts a VMA
> boundary in the middle of an existing ODP MR address range.

This is true only for vma that are not mmap of a device file. This is
what i was trying to get accross. An mmap of a file is never merge
so it can only get split/butcher by munmap/mremap but when that happen
you also need to reflect the virtual address space change to the
device ie any access to a now invalid range must trigger error.

> 
> I think the HMM mirror API really needs to deal with this for the
> driver somehow.

Yes the HMM does deal with this for you, you do not have to worry about
it. Sorry if that was not clear. I just wanted to stress that vma that
are mmap of a file do not behave like other vma hence when you create
the UMEM you can check for those if you feel the need.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 9:50 p.m. UTC | #55
On Wed, Jan 30, 2019 at 02:01:35PM -0700, Logan Gunthorpe wrote:

> And I feel the GUP->SGL->DMA flow should still be what we are aiming
> for. Even if we need a special GUP for special pages, and a special DMA
> map; and the SGL still has to be homogenous....

*shrug* so what if the special GUP called a VMA op instead of
traversing the VMA PTEs today? Why does it really matter? It could
easily change to a struct page flow tomorrow..

> > So, I see Jerome solving the GUP problem by replacing GUP entirely
> > using an API that is more suited to what these sorts of drivers
> > actually need.
> 
> Yes, this is what I'm expecting and what I want. Not bypassing the whole
> thing by doing special things with VMAs.

IMHO struct page is a big pain for this application, and if we can
build flows that don't actually need it then we shouldn't require it
just because the old flows needed it.

HMM mirror is a new flow that doesn't need struct page.

Would you feel better if this also came along with a:

  struct dma_sg_table *sgl_dma_map_user(struct device *dma_device, 
             void __user *prt, size_t len)

flow which returns a *DMA MAPPED* sgl that does not have struct page
pointers as another interface?

We can certainly call an API like this from RDMA for non-ODP MRs.

Eliminating the page pointers also eliminates the __iomem
problem. However this sgl object is not copyable or accessible from
the CPU, so the caller must be sure it doesn't need CPU access when
using this API. 

For RDMA I'd include some flag in the struct ib_device if the driver
requires CPU accessible SGLs and call the right API. Maybe the block
layer could do the same trick for O_DIRECT?

This would also directly solve the P2P problem with hfi1/qib/rxe, as
I'd likely also say that pci_p2pdma_map_sg() returns the same DMA only
sgl thing.

Jason
Jason Gunthorpe Jan. 30, 2019, 9:56 p.m. UTC | #56
On Wed, Jan 30, 2019 at 04:45:25PM -0500, Jerome Glisse wrote:
> On Wed, Jan 30, 2019 at 08:50:00PM +0000, Jason Gunthorpe wrote:
> > On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
> > > On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
> > > > On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> > > > 
> > > > > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > > > > need to be a bit careful where we send the entire SGL. I see no reason
> > > > > why we can't continue to be careful once their in userspace if there's
> > > > > something in GUP to deny them.
> > > > > 
> > > > > It would be nice to have heterogeneous SGLs and it is something we
> > > > > should work toward but in practice they aren't really necessary at the
> > > > > moment.
> > > > 
> > > > RDMA generally cannot cope well with an API that requires homogeneous
> > > > SGLs.. User space can construct complex MRs (particularly with the
> > > > proposed SGL MR flow) and we must marshal that into a single SGL or
> > > > the drivers fall apart.
> > > > 
> > > > Jerome explained that GPU is worse, a single VMA may have a random mix
> > > > of CPU or device pages..
> > > > 
> > > > This is a pretty big blocker that would have to somehow be fixed.
> > > 
> > > Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
> > > so what you get for an ODP umem is just a list of dma address you
> > > can program your device to. The aim is to avoid the driver to care
> > > about that. The access policy when the UMEM object is created by
> > > userspace through verbs API should however ascertain that for mmap
> > > of device file it is only creating a UMEM that is fully covered by
> > > one and only one vma. GPU device driver will have one vma per logical
> > > GPU object. I expect other kind of device do that same so that they
> > > can match a vma to a unique object in their driver.
> > 
> > A one VMA rule is not really workable.
> > 
> > With ODP VMA boundaries can move around across the lifetime of the MR
> > and we have no obvious way to fail anything if userpace puts a VMA
> > boundary in the middle of an existing ODP MR address range.
> 
> This is true only for vma that are not mmap of a device file. This is
> what i was trying to get accross. An mmap of a file is never merge
> so it can only get split/butcher by munmap/mremap but when that happen
> you also need to reflect the virtual address space change to the
> device ie any access to a now invalid range must trigger error.

Why is it invalid? The address range still has valid process memory?

What is the problem in the HMM mirror that it needs this restriction?

There is also the situation where we create an ODP MR that spans 0 ->
U64_MAX in the process address space. In this case there are lots of
different VMAs it covers and we expect it to fully track all changes
to all VMAs.

So we have to spin up dedicated umem_odps that carefully span single
VMAs, and somehow track changes to VMA ?

mlx5 odp does some of this already.. But yikes, this needs some pretty
careful testing in all these situations.

> > I think the HMM mirror API really needs to deal with this for the
> > driver somehow.
> 
> Yes the HMM does deal with this for you, you do not have to worry about
> it. Sorry if that was not clear. I just wanted to stress that vma that
> are mmap of a file do not behave like other vma hence when you create
> the UMEM you can check for those if you feel the need.

What properties do we get from HMM mirror? Will it tell us when to
create more umems to cover VMA seams or will it just cause undesired
no-mapped failures in some cases?

Jason
Jerome Glisse Jan. 30, 2019, 10:30 p.m. UTC | #57
On Wed, Jan 30, 2019 at 09:56:07PM +0000, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 04:45:25PM -0500, Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 08:50:00PM +0000, Jason Gunthorpe wrote:
> > > On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
> > > > On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
> > > > > On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> > > > > 
> > > > > > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > > > > > need to be a bit careful where we send the entire SGL. I see no reason
> > > > > > why we can't continue to be careful once their in userspace if there's
> > > > > > something in GUP to deny them.
> > > > > > 
> > > > > > It would be nice to have heterogeneous SGLs and it is something we
> > > > > > should work toward but in practice they aren't really necessary at the
> > > > > > moment.
> > > > > 
> > > > > RDMA generally cannot cope well with an API that requires homogeneous
> > > > > SGLs.. User space can construct complex MRs (particularly with the
> > > > > proposed SGL MR flow) and we must marshal that into a single SGL or
> > > > > the drivers fall apart.
> > > > > 
> > > > > Jerome explained that GPU is worse, a single VMA may have a random mix
> > > > > of CPU or device pages..
> > > > > 
> > > > > This is a pretty big blocker that would have to somehow be fixed.
> > > > 
> > > > Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
> > > > so what you get for an ODP umem is just a list of dma address you
> > > > can program your device to. The aim is to avoid the driver to care
> > > > about that. The access policy when the UMEM object is created by
> > > > userspace through verbs API should however ascertain that for mmap
> > > > of device file it is only creating a UMEM that is fully covered by
> > > > one and only one vma. GPU device driver will have one vma per logical
> > > > GPU object. I expect other kind of device do that same so that they
> > > > can match a vma to a unique object in their driver.
> > > 
> > > A one VMA rule is not really workable.
> > > 
> > > With ODP VMA boundaries can move around across the lifetime of the MR
> > > and we have no obvious way to fail anything if userpace puts a VMA
> > > boundary in the middle of an existing ODP MR address range.
> > 
> > This is true only for vma that are not mmap of a device file. This is
> > what i was trying to get accross. An mmap of a file is never merge
> > so it can only get split/butcher by munmap/mremap but when that happen
> > you also need to reflect the virtual address space change to the
> > device ie any access to a now invalid range must trigger error.
> 
> Why is it invalid? The address range still has valid process memory?

If you do munmap(A, size) then all address in the range [A, A+size]
are invalid. This is what i am refering too here. Same for mremap.

> 
> What is the problem in the HMM mirror that it needs this restriction?

No restriction at all here. I think i just wasn't understood.

> There is also the situation where we create an ODP MR that spans 0 ->
> U64_MAX in the process address space. In this case there are lots of
> different VMAs it covers and we expect it to fully track all changes
> to all VMAs.

Yes and that works however any memory access above TASK_SIZE will
return -EFAULT as this is kernel address space so you can only access
anything that is a valid process virtual address.

> 
> So we have to spin up dedicated umem_odps that carefully span single
> VMAs, and somehow track changes to VMA ?

No you do not.

> 
> mlx5 odp does some of this already.. But yikes, this needs some pretty
> careful testing in all these situations.

Sorry if i confused you even more than the first time. Everything works
you have nothing to worry about :)

> 
> > > I think the HMM mirror API really needs to deal with this for the
> > > driver somehow.
> > 
> > Yes the HMM does deal with this for you, you do not have to worry about
> > it. Sorry if that was not clear. I just wanted to stress that vma that
> > are mmap of a file do not behave like other vma hence when you create
> > the UMEM you can check for those if you feel the need.
> 
> What properties do we get from HMM mirror? Will it tell us when to
> create more umems to cover VMA seams or will it just cause undesired
> no-mapped failures in some cases?

You do not get anything from HMM mirror, i might add a flag so that
HMM can report this special condition if driver wants to know. If
you want to know you have to look at the vma by yourself. GPU driver
will definitly want to know whem importing so i might add a flag so
that they do not have to lookup the vma themself to know.

Again if you do not care then just ignore everything here, it is
handled by HMM and you do not have to worry one bit. If it worked
with GUP it will work with HMM and with those p2p patches if it
will even works against vma that are mmap of a file and that set
the p2p_map function.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 10:33 p.m. UTC | #58
On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:

> > What is the problem in the HMM mirror that it needs this restriction?
> 
> No restriction at all here. I think i just wasn't understood.

Are you are talking about from the exporting side - where the thing
creating the VMA can really only put one distinct object into it?

Jason
Jerome Glisse Jan. 30, 2019, 10:47 p.m. UTC | #59
On Wed, Jan 30, 2019 at 10:33:04PM +0000, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
> 
> > > What is the problem in the HMM mirror that it needs this restriction?
> > 
> > No restriction at all here. I think i just wasn't understood.
> 
> Are you are talking about from the exporting side - where the thing
> creating the VMA can really only put one distinct object into it?

The message i was trying to get accross is that HMM mirror will
always succeed for everything* except for special vma ie mmap of
device file. For those it can only succeed if a p2p_map() call
succeed.

So any user of HMM mirror might to know why the mirroring fail ie
was it because something exceptional is happening ? Or is it because
i was trying to map a special vma which can be forbiden.

Hence why i assume that you might want to know about such p2p_map
failure at the time you create the umem odp object as it might be
some failure you might want to report differently and handle
differently. If you do not care about differentiating OOM or
exceptional failure from p2p_map failure than you have nothing to
worry about you will get the same error from HMM for both.

Cheers,
Jérôme

* Everything except when they are exceptional condition like OOM or
  poisonous memory.
Jason Gunthorpe Jan. 30, 2019, 10:51 p.m. UTC | #60
On Wed, Jan 30, 2019 at 05:47:05PM -0500, Jerome Glisse wrote:
> On Wed, Jan 30, 2019 at 10:33:04PM +0000, Jason Gunthorpe wrote:
> > On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
> > 
> > > > What is the problem in the HMM mirror that it needs this restriction?
> > > 
> > > No restriction at all here. I think i just wasn't understood.
> > 
> > Are you are talking about from the exporting side - where the thing
> > creating the VMA can really only put one distinct object into it?
> 
> The message i was trying to get accross is that HMM mirror will
> always succeed for everything* except for special vma ie mmap of
> device file. For those it can only succeed if a p2p_map() call
> succeed.
> 
> So any user of HMM mirror might to know why the mirroring fail ie
> was it because something exceptional is happening ? Or is it because
> i was trying to map a special vma which can be forbiden.
> 
> Hence why i assume that you might want to know about such p2p_map
> failure at the time you create the umem odp object as it might be
> some failure you might want to report differently and handle
> differently. If you do not care about differentiating OOM or
> exceptional failure from p2p_map failure than you have nothing to
> worry about you will get the same error from HMM for both.

I think my hope here was that we could have some kind of 'trial'
interface where very eary users can call
'hmm_mirror_is_maybe_supported(dev, user_ptr, len)' and get a failure
indication.

We probably wouldn't call this on the full address space though

Beyond that it is just inevitable there can be problems faulting if
the memory map is messed with after MR is created.

And here again, I don't want to worry about any particular VMA
boundaries..

Jason
Logan Gunthorpe Jan. 30, 2019, 10:52 p.m. UTC | #61
On 2019-01-30 2:50 p.m., Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 02:01:35PM -0700, Logan Gunthorpe wrote:
> 
>> And I feel the GUP->SGL->DMA flow should still be what we are aiming
>> for. Even if we need a special GUP for special pages, and a special DMA
>> map; and the SGL still has to be homogenous....
> 
> *shrug* so what if the special GUP called a VMA op instead of
> traversing the VMA PTEs today? Why does it really matter? It could
> easily change to a struct page flow tomorrow..

Well it's so that it's composable. We want the SGL->DMA side to work for
APIs from kernel space and not have to run a completely different flow
for kernel drivers than from userspace memory.

For GUP to do a special VMA traversal it would now need to return
something besides struct pages which means no SGL and it means a
completely different DMA mapping call.
> Would you feel better if this also came along with a:
> 
>   struct dma_sg_table *sgl_dma_map_user(struct device *dma_device, 
>              void __user *prt, size_t len)

That seems like a nice API. But certainly the implementation would need
to use existing dma_map or pci_p2pdma_map calls, or whatever as part of
it...

,
> flow which returns a *DMA MAPPED* sgl that does not have struct page
> pointers as another interface?
> 
> We can certainly call an API like this from RDMA for non-ODP MRs.
> 
> Eliminating the page pointers also eliminates the __iomem
> problem. However this sgl object is not copyable or accessible from
> the CPU, so the caller must be sure it doesn't need CPU access when
> using this API. 

We actually stopped caring about the __iomem problem. We are working
under the assumption that pages returned by devm_memremap_pages() can be
accessed as normal RAM and does not need the __iomem designation. The
main problem now is that code paths need to know to use pci_p2pdma_map
or not. And in theory this could be pushed into regular dma_map
implementations but we'd have to get it into all of them which is a pain.

Logan
Jerome Glisse Jan. 30, 2019, 10:58 p.m. UTC | #62
On Wed, Jan 30, 2019 at 10:51:55PM +0000, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 05:47:05PM -0500, Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 10:33:04PM +0000, Jason Gunthorpe wrote:
> > > On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
> > > 
> > > > > What is the problem in the HMM mirror that it needs this restriction?
> > > > 
> > > > No restriction at all here. I think i just wasn't understood.
> > > 
> > > Are you are talking about from the exporting side - where the thing
> > > creating the VMA can really only put one distinct object into it?
> > 
> > The message i was trying to get accross is that HMM mirror will
> > always succeed for everything* except for special vma ie mmap of
> > device file. For those it can only succeed if a p2p_map() call
> > succeed.
> > 
> > So any user of HMM mirror might to know why the mirroring fail ie
> > was it because something exceptional is happening ? Or is it because
> > i was trying to map a special vma which can be forbiden.
> > 
> > Hence why i assume that you might want to know about such p2p_map
> > failure at the time you create the umem odp object as it might be
> > some failure you might want to report differently and handle
> > differently. If you do not care about differentiating OOM or
> > exceptional failure from p2p_map failure than you have nothing to
> > worry about you will get the same error from HMM for both.
> 
> I think my hope here was that we could have some kind of 'trial'
> interface where very eary users can call
> 'hmm_mirror_is_maybe_supported(dev, user_ptr, len)' and get a failure
> indication.
> 
> We probably wouldn't call this on the full address space though

Yes we can do special wrapper around the general case that allow
caller to differentiate failure. So at creation you call the
special flavor and get proper distinction between error. Afterward
during normal operation any failure is just treated in a same way
no matter what is the reasons (munmap, mremap, mprotect, ...).


> Beyond that it is just inevitable there can be problems faulting if
> the memory map is messed with after MR is created.
> 
> And here again, I don't want to worry about any particular VMA
> boundaries..

You do not have to worry about boundaries HMM will return -EFAULT
if there is no valid vma behind the address you are trying to map
(or if the vma prot does not allow you to access it). So then you
can handle that failure just like you do now and as my ODP HMM
patch preserve.

Cheers,
Jérôme
Jason Gunthorpe Jan. 30, 2019, 11:30 p.m. UTC | #63
On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 2:50 p.m., Jason Gunthorpe wrote:
> > On Wed, Jan 30, 2019 at 02:01:35PM -0700, Logan Gunthorpe wrote:
> > 
> >> And I feel the GUP->SGL->DMA flow should still be what we are aiming
> >> for. Even if we need a special GUP for special pages, and a special DMA
> >> map; and the SGL still has to be homogenous....
> > 
> > *shrug* so what if the special GUP called a VMA op instead of
> > traversing the VMA PTEs today? Why does it really matter? It could
> > easily change to a struct page flow tomorrow..
> 
> Well it's so that it's composable. We want the SGL->DMA side to work for
> APIs from kernel space and not have to run a completely different flow
> for kernel drivers than from userspace memory.

If we want to have these DMA-only SGLs anyhow, then the kernel flow
can use them too.

In the kernel it easier because the 'exporter' already knows it is
working with BAR memory, so it can just do something like this:

struct dma_sg_table *sgl_dma_map_pci_bar(struct pci_device *from,
                                         struct device *to,
                                         unsigned long bar_ptr,
                                         size_t length)

And then it falls down the same DMA-SGL-only kind of flow that would
exist to support the user side. ie it is the kernel version of the API
I described below.

> For GUP to do a special VMA traversal it would now need to return
> something besides struct pages which means no SGL and it means a
> completely different DMA mapping call.

GUP cannot support BAR memory because it must only return CPU memory -
I think too many callers make this assumption for it to be possible to
change it.. (see below)

A new-GUP can return DMA addresses - so it doesn't have this problem.

> > Would you feel better if this also came along with a:
> > 
> >   struct dma_sg_table *sgl_dma_map_user(struct device *dma_device, 
> >              void __user *prt, size_t len)
> 
> That seems like a nice API. But certainly the implementation would need
> to use existing dma_map or pci_p2pdma_map calls, or whatever as part of
> it...

I wonder how Jerome worked the translation, I haven't looked yet..

> We actually stopped caring about the __iomem problem. We are working
> under the assumption that pages returned by devm_memremap_pages() can be
> accessed as normal RAM and does not need the __iomem designation.

As far as CPU mapped uncached BAR memory goes, this is broadly not
true.

s390x for instance requires dedicated CPU instructions to access BAR
memory.

x86 will fault if you attempt to use a SSE algorithm on uncached BAR
memory. The kernel has many SSE accelerated algorithsm so you can
never pass these special p2p SGL's through to them either. (think
parity or encryption transformations through the block stack)

Many platforms have limitations on alignment and access size for BAR
memory - you can't just blindly call memcpy and expect it to work.
(TPM is actually struggling with this now, confusingly different
versions of memcpy are giving different results on some x86 io memory)

Other platforms might fault or corrupt if an unaligned access is
attempted to BAR memory.

In short - I don't see a way to avoid knowing about __iomem in the
sgl. There are too many real use cases that require this knowledge,
and too many places that touch the SGL pages with the CPU.

I think we must have 'DMA only' SGLs and code paths that are known
DMA-only clean to make it work properly with BAR memory.

Jason
Christoph Hellwig Jan. 31, 2019, 8:02 a.m. UTC | #64
On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> I do not see how VMA changes are any different than using struct page
> in respect to userspace exposure. Those vma callback do not need to be
> set by everyone, in fact expectation is that only handful of driver
> will set those.
> 
> How can we do p2p between RDMA and GPU for instance, without exposure
> to userspace ? At some point you need to tell userspace hey this kernel
> does allow you to do that :)

To do RDMA on a memory region you need struct page backіng to start
with..
Christoph Hellwig Jan. 31, 2019, 8:05 a.m. UTC | #65
On Wed, Jan 30, 2019 at 08:44:20PM +0000, Jason Gunthorpe wrote:
> Not really, for MRs most drivers care about DMA addresses only. The
> only reason struct page ever gets involved is because it is part of
> the GUP, SGL and dma_map family of APIs.

And the only way you get the DMA address is through the dma mapping
APIs.  Which except for the little oddball dma_map_resource expect
a struct page in some form.  And dma_map_resource isn't really up
to speed for full blown P2P.

Now we could and maybe eventually should change all this.  But that
is a pre-requisitive for doing anything more fancy, and not something
to be hacked around.

> O_DIRECT seems to be the justification for struct page, but nobody is
> signing up to make O_DIRECT have the required special GUP/SGL/P2P flow
> that would be needed to *actually* make that work - so it really isn't
> a justification today.

O_DIRECT is just the messenger.  Anything using GUP will need a struct
page, which is all our interfaces that do I/O directly to user pages.
Christoph Hellwig Jan. 31, 2019, 8:13 a.m. UTC | #66
On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > *shrug* so what if the special GUP called a VMA op instead of
> > traversing the VMA PTEs today? Why does it really matter? It could
> > easily change to a struct page flow tomorrow..
> 
> Well it's so that it's composable. We want the SGL->DMA side to work for
> APIs from kernel space and not have to run a completely different flow
> for kernel drivers than from userspace memory.

Yes, I think that is the important point.

All the other struct page discussion is not about anyone of us wanting
struct page - heck it is a pain to deal with, but then again it is
there for a reason.

In the typical GUP flows we have three uses of a struct page:

 (1) to carry a physical address.  This is mostly through
     struct scatterlist and struct bio_vec.  We could just store
     a magic PFN-like value that encodes the physical address
     and allow looking up a page if it exists, and we had at least
     two attempts at it.  In some way I think that would actually
     make the interfaces cleaner, but Linus has NACKed it in the
     past, so we'll have to convince him first that this is the
     way forward
 (2) to keep a reference to the memory so that it doesn't go away
     under us due to swapping, process exit, unmapping, etc.
     No idea how we want to solve this, but I guess you have
     some smart ideas?
 (3) to make the PTEs dirty after writing to them.  Again no sure
     what our preferred interface here would be

If we solve all of the above problems I'd be more than happy to
go with a non-struct page based interface for BAR P2P.  But we'll
have to solve these issues in a generic way first.
Jerome Glisse Jan. 31, 2019, 3:03 p.m. UTC | #67
On Thu, Jan 31, 2019 at 09:02:03AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> > I do not see how VMA changes are any different than using struct page
> > in respect to userspace exposure. Those vma callback do not need to be
> > set by everyone, in fact expectation is that only handful of driver
> > will set those.
> > 
> > How can we do p2p between RDMA and GPU for instance, without exposure
> > to userspace ? At some point you need to tell userspace hey this kernel
> > does allow you to do that :)
> 
> To do RDMA on a memory region you need struct page backіng to start
> with..

No you do not with this patchset and there is no reason to tie RDMA to
struct page it does not provide a single feature we would need. So as
it can be done without and they are not benefit of using one i do not
see why we should use one.

Cheers,
Jérôme
Jerome Glisse Jan. 31, 2019, 3:11 p.m. UTC | #68
On Thu, Jan 31, 2019 at 09:05:01AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 08:44:20PM +0000, Jason Gunthorpe wrote:
> > Not really, for MRs most drivers care about DMA addresses only. The
> > only reason struct page ever gets involved is because it is part of
> > the GUP, SGL and dma_map family of APIs.
> 
> And the only way you get the DMA address is through the dma mapping
> APIs.  Which except for the little oddball dma_map_resource expect
> a struct page in some form.  And dma_map_resource isn't really up
> to speed for full blown P2P.
> 
> Now we could and maybe eventually should change all this.  But that
> is a pre-requisitive for doing anything more fancy, and not something
> to be hacked around.
> 
> > O_DIRECT seems to be the justification for struct page, but nobody is
> > signing up to make O_DIRECT have the required special GUP/SGL/P2P flow
> > that would be needed to *actually* make that work - so it really isn't
> > a justification today.
> 
> O_DIRECT is just the messenger.  Anything using GUP will need a struct
> page, which is all our interfaces that do I/O directly to user pages.

I do not want to allow GUP to pin I/O space this would open a pandora
box that we do not want to open at all. Many driver manage their IO
space and if they get random pinning because some other kernel bits
they never heard of starts to do GUP on their stuff it is gonna cause
havoc.

So far mmap of device file have always been special and it has been
reflected to userspace in all the instance i know of (media and GPU).
Pretending we can handle them like any other vma is a lie because
they were never designed that way in the first place and it would be
disruptive to all those driver.

Minimum disruption with minimun changes is what we should aim for and
is what i am trying to do with this patchset. Using struct page and
allowing GUP would mean rewritting huge chunk of GPU drivers (pretty
much rewritting their whole memory management) with no benefit at the
end.

When something is special it is better to leave it that way.

Cheers,
Jérôme
Jerome Glisse Jan. 31, 2019, 3:37 p.m. UTC | #69
On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > *shrug* so what if the special GUP called a VMA op instead of
> > > traversing the VMA PTEs today? Why does it really matter? It could
> > > easily change to a struct page flow tomorrow..
> > 
> > Well it's so that it's composable. We want the SGL->DMA side to work for
> > APIs from kernel space and not have to run a completely different flow
> > for kernel drivers than from userspace memory.
> 
> Yes, I think that is the important point.
> 
> All the other struct page discussion is not about anyone of us wanting
> struct page - heck it is a pain to deal with, but then again it is
> there for a reason.
> 
> In the typical GUP flows we have three uses of a struct page:

We do not want GUP. Yes some RDMA driver and other use GUP but they
should only use GUP on regular vma not on special vma (ie mmap of a
device file). Allowing GUP on those is insane. It is better to special
case the peer to peer mapping because _it is_ special, nothing inside
those are manage by core mm and driver can deal with them in weird
way (GPU certainly do and for very good reasons without which they
would perform badly).

> 
>  (1) to carry a physical address.  This is mostly through
>      struct scatterlist and struct bio_vec.  We could just store
>      a magic PFN-like value that encodes the physical address
>      and allow looking up a page if it exists, and we had at least
>      two attempts at it.  In some way I think that would actually
>      make the interfaces cleaner, but Linus has NACKed it in the
>      past, so we'll have to convince him first that this is the
>      way forward

Wasting 64bytes just to carry address is a waste for everyone.

>  (2) to keep a reference to the memory so that it doesn't go away
>      under us due to swapping, process exit, unmapping, etc.
>      No idea how we want to solve this, but I guess you have
>      some smart ideas?

The DMA API has _never_ dealt with page refcount and it have always
been up to the user of the DMA API to ascertain that it is safe for
them to map/unmap page/resource they are providing to the DMA API.

The lifetime management of page or resource provided to the DMA API
should remain the problem of the caller and not be something the DMA
API cares one bit about.

>  (3) to make the PTEs dirty after writing to them.  Again no sure
>      what our preferred interface here would be

Again the DMA API has never dealt with that nor should he. What does
dirty pte means for a special mapping (mmap of device file) ? There is
no single common definition for that, most driver do not care about it
and it get fully ignore.

> 
> If we solve all of the above problems I'd be more than happy to
> go with a non-struct page based interface for BAR P2P.  But we'll
> have to solve these issues in a generic way first.

None of the above are problems the DMA API need to solve. The DMA API
is about mapping some memory resource to a device. For regular main
memory it is easy on most architecture (anything with a sane IOMMU).
For IO resources it is not as straight forward as it was often left
undefined in the architecture platform documentation or the inter-
connect standard. AFAIK mapping BAR from one PCIE device to another
through IOMMU works well on recent Intel and AMD platform. We will
probably need to use some whitelist at i am not sure this is something
Intel or AMD guarantee, i believe they want to start guaranteeing it.

So having one DMA API for regular memory and one for IO memory aka
resource (dma_map_resource()) sounds like the only sane approach here.
It is fundamentally different memory and we should not try to muddle
the water by having it go through a single common API. There is no
benefit to that beside saving couple hundred of lines of code to some
driver and this couple hundred lines of code can be move to a common
helpers.

So to me it is lot sane to provide an helper that would deal with
the different vma type on behalf of device than forcing down struct
page. Something like:

vma_dma_map_range(vma, device, start, end, flags, pa[])
vma_dma_unmap_range(vma, device, start, end, flags, pa[])

VMA_DMA_MAP_FLAG_WRITE
VMA_DMA_MAP_FLAG_PIN

Which would use GUP or special vma handling on behalf of the calling
device or use a special p2p code path for special vma. Device that
need pinning set the flag and it is up to the exporting device to
accept or not. Pinning when using GUP is obvious.

When the vma goes away the importing device must update its device
page table to some dummy page or do something sane, because keeping
things map after that point does not make sense anymore. Device is
no longer operating on a range of virtual address that make sense.

So instead of pushing p2p handling within GUP to not disrupt existing
driver workflow. It is better to provide an helper that handle all
the gory details for the device driver. It does not change things for
the driver and allows proper special casing.

Cheers,
Jérôme
Jason Gunthorpe Jan. 31, 2019, 7:02 p.m. UTC | #70
On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > *shrug* so what if the special GUP called a VMA op instead of
> > > traversing the VMA PTEs today? Why does it really matter? It could
> > > easily change to a struct page flow tomorrow..
> > 
> > Well it's so that it's composable. We want the SGL->DMA side to work for
> > APIs from kernel space and not have to run a completely different flow
> > for kernel drivers than from userspace memory.
> 
> Yes, I think that is the important point.
> 
> All the other struct page discussion is not about anyone of us wanting
> struct page - heck it is a pain to deal with, but then again it is
> there for a reason.
> 
> In the typical GUP flows we have three uses of a struct page:
> 
>  (1) to carry a physical address.  This is mostly through
>      struct scatterlist and struct bio_vec.  We could just store
>      a magic PFN-like value that encodes the physical address
>      and allow looking up a page if it exists, and we had at least
>      two attempts at it.  In some way I think that would actually
>      make the interfaces cleaner, but Linus has NACKed it in the
>      past, so we'll have to convince him first that this is the
>      way forward

Something like this (and more) has always been the roadblock with
trying to mix BAR memory into SGL. I think it is such a big problem as
to be unsolvable in one step.. 

Struct page doesn't even really help anything beyond dma_map as we
still can't pretend that __iomem is normal memory for general SGL
users.

>  (2) to keep a reference to the memory so that it doesn't go away
>      under us due to swapping, process exit, unmapping, etc.
>      No idea how we want to solve this, but I guess you have
>      some smart ideas?

Jerome, how does this work anyhow? Did you do something to make the
VMA lifetime match the p2p_map/unmap? Or can we get into a situation
were the VMA is destroyed and the importing driver can't call the
unmap anymore?

I know in the case of notifiers the VMA liftime should be strictly
longer than the map/unmap - but does this mean we can never support
non-notifier users via this scheme?

>  (3) to make the PTEs dirty after writing to them.  Again no sure
>      what our preferred interface here would be

This need doesn't really apply to BAR memory..

> If we solve all of the above problems I'd be more than happy to
> go with a non-struct page based interface for BAR P2P.  But we'll
> have to solve these issues in a generic way first.

I still think the right direction is to build on what Logan has done -
realize that he created a DMA-only SGL - make that a formal type of
the kernel and provide the right set of APIs to work with this type,
without being forced to expose struct page.

Basically invert the API flow - the DMA map would be done close to
GUP, not buried in the driver. This absolutely doesn't work for every
flow we have, but it does enable the ones that people seem to care
about when talking about P2P.

To get to where we are today we'd need a few new IB APIs, and some
nvme change to work with DMA-only SGL's and so forth, but that doesn't
seem so bad. The API also seems much more safe and understandable than
todays version that is trying to hope that the SGL is never touched by
the CPU.

It also does present a path to solve some cases of the O_DIRECT
problems if the block stack can develop some way to know if an IO will
go down a DMA-only IO path or not... This seems less challenging that
auditing every SGL user for iomem safety??

Yes we end up with a duality, but we already basically have that with
the p2p flow today..

Jason
Logan Gunthorpe Jan. 31, 2019, 7:19 p.m. UTC | #71
On 2019-01-31 12:02 p.m., Jason Gunthorpe wrote:
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.

> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.
> It also does present a path to solve some cases of the O_DIRECT
> problems if the block stack can develop some way to know if an IO will
> go down a DMA-only IO path or not... This seems less challenging that
> auditing every SGL user for iomem safety??


The DMA-only SGL will work for some use cases, but I think it's going to
be a challenge for others. We care most about NVMe and, therefore, the
block layer.

Given my understanding of the block layer, and it's queuing
infrastructure, I don't think having a DMA-only IO path makes sense. I
think it has to be the same path, but with a special DMA-only bio; and
endpoints would have to indicate support for that bio. I can't say I
have a deep enough understanding of the block layer to know how possible
that would be.

Logan
Jerome Glisse Jan. 31, 2019, 7:35 p.m. UTC | #72
On Thu, Jan 31, 2019 at 07:02:15PM +0000, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > > *shrug* so what if the special GUP called a VMA op instead of
> > > > traversing the VMA PTEs today? Why does it really matter? It could
> > > > easily change to a struct page flow tomorrow..
> > > 
> > > Well it's so that it's composable. We want the SGL->DMA side to work for
> > > APIs from kernel space and not have to run a completely different flow
> > > for kernel drivers than from userspace memory.
> > 
> > Yes, I think that is the important point.
> > 
> > All the other struct page discussion is not about anyone of us wanting
> > struct page - heck it is a pain to deal with, but then again it is
> > there for a reason.
> > 
> > In the typical GUP flows we have three uses of a struct page:
> > 
> >  (1) to carry a physical address.  This is mostly through
> >      struct scatterlist and struct bio_vec.  We could just store
> >      a magic PFN-like value that encodes the physical address
> >      and allow looking up a page if it exists, and we had at least
> >      two attempts at it.  In some way I think that would actually
> >      make the interfaces cleaner, but Linus has NACKed it in the
> >      past, so we'll have to convince him first that this is the
> >      way forward
> 
> Something like this (and more) has always been the roadblock with
> trying to mix BAR memory into SGL. I think it is such a big problem as
> to be unsolvable in one step.. 
> 
> Struct page doesn't even really help anything beyond dma_map as we
> still can't pretend that __iomem is normal memory for general SGL
> users.
> 
> >  (2) to keep a reference to the memory so that it doesn't go away
> >      under us due to swapping, process exit, unmapping, etc.
> >      No idea how we want to solve this, but I guess you have
> >      some smart ideas?
> 
> Jerome, how does this work anyhow? Did you do something to make the
> VMA lifetime match the p2p_map/unmap? Or can we get into a situation
> were the VMA is destroyed and the importing driver can't call the
> unmap anymore?
> 
> I know in the case of notifiers the VMA liftime should be strictly
> longer than the map/unmap - but does this mean we can never support
> non-notifier users via this scheme?

So in this version the requirement is that the importer also have a mmu
notifier registered and that's what all GPU driver do already. Any
driver that map some range of vma to a device should register itself as
a mmu notifier listener to do something when vma goes away. I posted a
patchset a while ago to allow listener to differentiate when the vma is
going away from other type of invalidation [1]

With that in place you can easily handle the pin case. Driver really
need to do something when the vma goes away with GUP or not. As the
device is then writing/reading to/from something that does not match
anything in the process address space.

So user that want pin would register notifier, call p2p_map with pin
flag and ignore all notifier callback except the unmap one when the
unmap one happens they have the vma and they should call p2p_unmap
from their invalidate callback and update their device to either some
dummy memory or program it in a way that the userspace application
will notice.

This can all be handled by some helper so that driver do not have to
write more than 5 lines of code and function to update their device
mapping to something of their choosing.


> 
> >  (3) to make the PTEs dirty after writing to them.  Again no sure
> >      what our preferred interface here would be
> 
> This need doesn't really apply to BAR memory..
> 
> > If we solve all of the above problems I'd be more than happy to
> > go with a non-struct page based interface for BAR P2P.  But we'll
> > have to solve these issues in a generic way first.
> 
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.
> 
> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.

This does not work for GPU really i do not want to have to rewrite GPU
driver for this. Struct page is a burden and it does not bring anything
to the table. I rather provide an all in one stop for driver to use
this without having to worry between regular vma and special vma.

Note that in this patchset i reuse chunk of Logan works and intention is
to also allow PCI struct page to work too. But they should not be the
only mechanisms.

> 
> To get to where we are today we'd need a few new IB APIs, and some
> nvme change to work with DMA-only SGL's and so forth, but that doesn't
> seem so bad. The API also seems much more safe and understandable than
> todays version that is trying to hope that the SGL is never touched by
> the CPU.
> 
> It also does present a path to solve some cases of the O_DIRECT
> problems if the block stack can develop some way to know if an IO will
> go down a DMA-only IO path or not... This seems less challenging that
> auditing every SGL user for iomem safety??

So what is this O_DIRECT thing that keep coming again and again here :)
What is the use case ? Note that bio will always have valid struct page
of regular memory as using PCIE BAR for filesystem is crazy (you do not
have atomic or cache coherence and many CPU instruction have _undefined_
effect so what ever the userspace would do might do nothing.

Now if you want to use BAR address as destination or source of directIO
then let just update the directIO code to handle this. There is no need
to go hack every single place in the kernel that might deal with struct
page or sgl. Just update the place that need to understand this. We can
even update directIO to work on weird platform. The change to directIO
will be small, couple hundred line of code at best.

Cheers,
Jérôme

[1] https://lore.kernel.org/linux-fsdevel/20190123222315.1122-1-jglisse@redhat.com/T/#m69e8f589240e18acbf196a1c8aa1d6fc97bd3565
Logan Gunthorpe Jan. 31, 2019, 7:44 p.m. UTC | #73
On 2019-01-31 12:35 p.m., Jerome Glisse wrote:
> So what is this O_DIRECT thing that keep coming again and again here :)
> What is the use case ? Note that bio will always have valid struct page
> of regular memory as using PCIE BAR for filesystem is crazy (you do not
> have atomic or cache coherence and many CPU instruction have _undefined_
> effect so what ever the userspace would do might do nothing.

The point is to be able to use a BAR as the source of data to write/read
from a file system. So as a simple example, if an NVMe drive had a CMB,
and you could map that CMB to userspace, you could do an O_DIRECT read
to the BAR on one drive and an O_DIRECT write from the BAR on another
drive. Thus you could bypass the upstream port of a switch (and
therefore all CPU resources) altogether.

For the most part nobody would want to put a filesystem on a BAR.
(Though there have been some crazy ideas to put persistent memory behind
a CMB...)

> Now if you want to use BAR address as destination or source of directIO
> then let just update the directIO code to handle this. There is no need
> to go hack every single place in the kernel that might deal with struct
> page or sgl. Just update the place that need to understand this. We can
> even update directIO to work on weird platform. The change to directIO
> will be small, couple hundred line of code at best.

Well if you want to figure out how to remove struct page from the entire
block layer that would help everybody. But until then, it's pretty much
impossible to use the block layer (and therefore O_DIRECT) without
struct page.

Logan
Jason Gunthorpe Jan. 31, 2019, 7:54 p.m. UTC | #74
On Thu, Jan 31, 2019 at 12:19:31PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-31 12:02 p.m., Jason Gunthorpe wrote:
> > I still think the right direction is to build on what Logan has done -
> > realize that he created a DMA-only SGL - make that a formal type of
> > the kernel and provide the right set of APIs to work with this type,
> > without being forced to expose struct page.
> 
> > Basically invert the API flow - the DMA map would be done close to
> > GUP, not buried in the driver. This absolutely doesn't work for every
> > flow we have, but it does enable the ones that people seem to care
> > about when talking about P2P.
> > It also does present a path to solve some cases of the O_DIRECT
> > problems if the block stack can develop some way to know if an IO will
> > go down a DMA-only IO path or not... This seems less challenging that
> > auditing every SGL user for iomem safety??
> 
> 
> The DMA-only SGL will work for some use cases, but I think it's going to
> be a challenge for others. We care most about NVMe and, therefore, the
> block layer.

The exercise here is not to enable O_DIRECT for P2P, it is to allow
certain much simpler users to use P2P. We should not be saying that
someone has to solve these complicated problems in the entire block
stack just to make RDMA work. :(

If the block stack can use a 'dma sgl' or not, I don't know.

However, it does look like it fits these RDMA, GPU and VFIO cases
fairly well, and looks better than the hacky
sgl-but-really-special-p2p hack we have in RDMA today.

Jason
Jason Gunthorpe Jan. 31, 2019, 7:58 p.m. UTC | #75
On Thu, Jan 31, 2019 at 02:35:14PM -0500, Jerome Glisse wrote:

> > Basically invert the API flow - the DMA map would be done close to
> > GUP, not buried in the driver. This absolutely doesn't work for every
> > flow we have, but it does enable the ones that people seem to care
> > about when talking about P2P.
> 
> This does not work for GPU really i do not want to have to rewrite GPU
> driver for this. Struct page is a burden and it does not bring anything
> to the table. I rather provide an all in one stop for driver to use
> this without having to worry between regular vma and special vma.

I'm talking about almost exactly what you've done in here - make a
'sgl' that is dma addresses only. 

In these VMA patches you used a simple array of physical addreses -
I'm only talking about moving that array into a 'dma sgl'.

The flow is still basically the same - the driver directly gets DMA
physical addresses with no possibility to get a struct page or CPU
memory.

And then we can build more stuff around the 'dma sgl', including
the in-kernel users Logan is worrying about.

Jason
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..1bd60a90e575 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -429,6 +429,44 @@  struct vm_operations_struct {
 			pgoff_t start_pgoff, pgoff_t end_pgoff);
 	unsigned long (*pagesize)(struct vm_area_struct * area);
 
+	/*
+	 * Optional for device driver that want to allow peer to peer (p2p)
+	 * mapping of their vma (which can be back by some device memory) to
+	 * another device.
+	 *
+	 * Note that the exporting device driver might not have map anything
+	 * inside the vma for the CPU but might still want to allow a peer
+	 * device to access the range of memory corresponding to a range in
+	 * that vma.
+	 *
+	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
+	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
+	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
+	 * device to map once during setup and report any failure at that time
+	 * to the userspace. Further mapping of the same range might happen
+	 * after mmu notifier invalidation over the range. The exporting device
+	 * can use this to move things around (defrag BAR space for instance)
+	 * or do other similar task.
+	 *
+	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
+	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
+	 * POINT IN TIME WITH NO LOCK HELD.
+	 *
+	 * In below function, the device argument is the importing device,
+	 * the exporting device is the device to which the vma belongs.
+	 */
+	long (*p2p_map)(struct vm_area_struct *vma,
+			struct device *device,
+			unsigned long start,
+			unsigned long end,
+			dma_addr_t *pa,
+			bool write);
+	long (*p2p_unmap)(struct vm_area_struct *vma,
+			  struct device *device,
+			  unsigned long start,
+			  unsigned long end,
+			  dma_addr_t *pa);
+
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);