Message ID | 20220306053211.135762-1-jarkko@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | MAP_POPULATE for device memory | expand |
From: Jarkko Sakkinen > Sent: 06 March 2022 05:32 > > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > to use that for initializing the device memory by providing a new callback > f_ops->populate() for the purpose. > > SGX patches are provided to show the callback in context. > > An obvious alternative is a ioctl but it is less elegant and requires > two syscalls (mmap + ioctl) per memory range, instead of just one > (mmap). Is this all about trying to stop the vm_operations_struct.fault() function being called? It is pretty easy to ensure the mappings are setup in the driver's mmap() function. Then the fault() function can just return -VM_FAULT_SIGBUS; If it is actually device memory you just need to call vm_iomap_memory() That quite nicely mmap()s PCIe memory space into a user process. Mapping driver memory is slightly more difficult. For buffers allocated with dma_alloc_coherent() you can probably use dma_mmap_coherent(). But I have a loop calling remap_pfn_range() because the buffer area is made of multiple 16kB kernel buffers that need to be mapped to contiguous user pages. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote: > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > to use that for initializing the device memory by providing a new callback > f_ops->populate() for the purpose. As I said, NAK.
On Sun, Mar 06, 2022 at 08:30:14AM +0000, David Laight wrote: > From: Jarkko Sakkinen > > Sent: 06 March 2022 05:32 > > > > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > > to use that for initializing the device memory by providing a new callback > > f_ops->populate() for the purpose. > > > > SGX patches are provided to show the callback in context. > > > > An obvious alternative is a ioctl but it is less elegant and requires > > two syscalls (mmap + ioctl) per memory range, instead of just one > > (mmap). > > Is this all about trying to stop the vm_operations_struct.fault() > function being called? In SGX protected memory is actually encrypted normal memory and CPU access control semantics (marked as reserved, e.g. struct page's). In SGX you need call ENCLS[EAUG] outside the protected memory to add new pages to the protected memory. Then when CPU is executing inside this protected memory, also known as enclaves, it commits the memory as part of the enclave either with ENCLU[EACCEPT] or ENCLU[EACCEPTCOPY]. So the point is not prevent page faults but to prepare the memory for pending state so that the enclave can then accept them without round-trips, and in some cases thus improve performance (in our case in enarx.dev platform that we are developing). In fact, #PF handler in SGX driver in the current SGX2 patch set also does EAUG on-demand. Optimal is to have both routes available. And said, this can be of course also implemented as ioctl. BR, Jarkko
On Sun, Mar 06, 2022 at 11:33:02AM +0000, Matthew Wilcox wrote: > On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote: > > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > > to use that for initializing the device memory by providing a new callback > > f_ops->populate() for the purpose. > > As I said, NAK. Agreed. This is an amazingly bad interface.
On 06.03.22 06:32, Jarkko Sakkinen wrote: > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > to use that for initializing the device memory by providing a new callback > f_ops->populate() for the purpose. > > SGX patches are provided to show the callback in context. > > An obvious alternative is a ioctl but it is less elegant and requires > two syscalls (mmap + ioctl) per memory range, instead of just one > (mmap). What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support VM_IO | VM_PFNMAP (as well?) ?
On Sun, Mar 06, 2022 at 11:48:26PM -0800, Christoph Hellwig wrote: > On Sun, Mar 06, 2022 at 11:33:02AM +0000, Matthew Wilcox wrote: > > On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote: > > > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > > > to use that for initializing the device memory by providing a new callback > > > f_ops->populate() for the purpose. > > > > As I said, NAK. > > Agreed. This is an amazingly bad interface. So what would you suggest to sort out the issue? I'm happy to go with ioctl if nothing else is acceptable. BR, Jarkko
On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote: > On 06.03.22 06:32, Jarkko Sakkinen wrote: > > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > > to use that for initializing the device memory by providing a new callback > > f_ops->populate() for the purpose. > > > > SGX patches are provided to show the callback in context. > > > > An obvious alternative is a ioctl but it is less elegant and requires > > two syscalls (mmap + ioctl) per memory range, instead of just one > > (mmap). > > What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support > VM_IO | VM_PFNMAP (as well?) ? What would be a proper point to bind that behaviour? For mmap/mprotect it'd be probably populate_vma_page_range() because that would span both mmap() and mprotect() (Dave's suggestion in this thread). For MAP_POPULATE I did not have hard proof to show that it would be used by other drivers but for madvice() you can find at least a few ioctl based implementations: $ git grep -e madv --and \( -e ioc \) drivers/ drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/i915/i915_driver.c: DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c: DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE, msm_ioctl_gem_madvise, DRM_RENDER_ALLOW), drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_drv.c: DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I agree that to be consistent implementation, both madvice() and MAP_POPULATE should work. > -- > Thanks, > > David / dhildenb BR, Jarkko
On 07.03.22 15:22, Jarkko Sakkinen wrote: > On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote: >> On 06.03.22 06:32, Jarkko Sakkinen wrote: >>> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow >>> to use that for initializing the device memory by providing a new callback >>> f_ops->populate() for the purpose. >>> >>> SGX patches are provided to show the callback in context. >>> >>> An obvious alternative is a ioctl but it is less elegant and requires >>> two syscalls (mmap + ioctl) per memory range, instead of just one >>> (mmap). >> >> What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support >> VM_IO | VM_PFNMAP (as well?) ? > > What would be a proper point to bind that behaviour? For mmap/mprotect it'd > be probably populate_vma_page_range() because that would span both mmap() > and mprotect() (Dave's suggestion in this thread). MADV_POPULATE_* ends up in faultin_vma_page_range(), right next to populate_vma_page_range(). So it might require a similar way to hook into the driver I guess. > > For MAP_POPULATE I did not have hard proof to show that it would be used > by other drivers but for madvice() you can find at least a few ioctl > based implementations: > > $ git grep -e madv --and \( -e ioc \) drivers/ > drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data, > drivers/gpu/drm/i915/i915_driver.c: DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW), > drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data, > drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data, > drivers/gpu/drm/msm/msm_drv.c: DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE, msm_ioctl_gem_madvise, DRM_RENDER_ALLOW), > drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, > drivers/gpu/drm/vc4/vc4_drv.c: DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW), > drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, > drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, > > IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I > agree that to be consistent implementation, both madvice() and MAP_POPULATE > should work. MADV_POPULATE_WRITE + MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE is one way to dynamically manage memory consumption inside a sparse memory mapping (preallocate/populate via MADV_POPULATE_WRITE, discard via MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE). Extending that whole mechanism to deal with VM_IO | VM_PFNMAP mappings as well could be interesting. At least I herd about some ideas where we might want to dynamically expose memory to a VM (via virtio-mem) inside a sparse memory mapping, and the memory in that sparse memory mapping is provided from a dedicated memory pool managed by a device driver -- not just using ordinary anonymous/file/hugetlb memory as we do right now. Now, this is certainly stuff for the future, I just wanted to mention it.
On Mon, Mar 07, 2022 at 03:33:52PM +0100, David Hildenbrand wrote: > On 07.03.22 15:22, Jarkko Sakkinen wrote: > > On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote: > >> On 06.03.22 06:32, Jarkko Sakkinen wrote: > >>> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow > >>> to use that for initializing the device memory by providing a new callback > >>> f_ops->populate() for the purpose. > >>> > >>> SGX patches are provided to show the callback in context. > >>> > >>> An obvious alternative is a ioctl but it is less elegant and requires > >>> two syscalls (mmap + ioctl) per memory range, instead of just one > >>> (mmap). > >> > >> What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support > >> VM_IO | VM_PFNMAP (as well?) ? > > > > What would be a proper point to bind that behaviour? For mmap/mprotect it'd > > be probably populate_vma_page_range() because that would span both mmap() > > and mprotect() (Dave's suggestion in this thread). > > MADV_POPULATE_* ends up in faultin_vma_page_range(), right next to > populate_vma_page_range(). So it might require a similar way to hook > into the driver I guess. > > > > > For MAP_POPULATE I did not have hard proof to show that it would be used > > by other drivers but for madvice() you can find at least a few ioctl > > based implementations: > > > > $ git grep -e madv --and \( -e ioc \) drivers/ > > drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data, > > drivers/gpu/drm/i915/i915_driver.c: DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW), > > drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data, > > drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data, > > drivers/gpu/drm/msm/msm_drv.c: DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE, msm_ioctl_gem_madvise, DRM_RENDER_ALLOW), > > drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, > > drivers/gpu/drm/vc4/vc4_drv.c: DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW), > > drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, > > drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, > > > > IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I > > agree that to be consistent implementation, both madvice() and MAP_POPULATE > > should work. > > MADV_POPULATE_WRITE + MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE is one way to > dynamically manage memory consumption inside a sparse memory mapping > (preallocate/populate via MADV_POPULATE_WRITE, discard via > MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE). Extending that whole mechanism to > deal with VM_IO | VM_PFNMAP mappings as well could be interesting. > > At least I herd about some ideas where we might want to dynamically > expose memory to a VM (via virtio-mem) inside a sparse memory mapping, > and the memory in that sparse memory mapping is provided from a > dedicated memory pool managed by a device driver -- not just using > ordinary anonymous/file/hugetlb memory as we do right now. > > Now, this is certainly stuff for the future, I just wanted to mention it. For SGX purposes I'm now studying the possibly to use ra_state to get idea where do "prefetching" (EAUG's) in batches, as it is something that would not require any intrusive changes to mm but thank you for sharing this. Looking into implementing this properly is the 2nd option, if that does not work out. > -- > Thanks, > > David / dhildenb BR, Jarkko
On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote: > So what would you suggest to sort out the issue? I'm happy to go with > ioctl if nothing else is acceptable. PLenty of drivers treat all mmaps as if MAP_POPULATE was specified, typically by using (io_)remap_pfn_range. If there any reason to only optionally have the pre-fault semantics for sgx? If not this should be really simple. And if we have a real need for it to be optional we'll just need to find a sane way to pass that information to ->mmap.
On Mon, Mar 07, 2022 at 07:56:53AM -0800, Christoph Hellwig wrote: > On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote: > > So what would you suggest to sort out the issue? I'm happy to go with > > ioctl if nothing else is acceptable. > > PLenty of drivers treat all mmaps as if MAP_POPULATE was specified, > typically by using (io_)remap_pfn_range. If there any reason to only > optionally have the pre-fault semantics for sgx? If not this should > be really simple. And if we have a real need for it to be optional > we'll just need to find a sane way to pass that information to ->mmap. Dave, what if mmap() would just unconditionally EAUG after initialization? It's an option, yes. BR, Jarkko
From: Christoph Hellwig > Sent: 07 March 2022 15:57 > > On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote: > > So what would you suggest to sort out the issue? I'm happy to go with > > ioctl if nothing else is acceptable. > > PLenty of drivers treat all mmaps as if MAP_POPULATE was specified, > typically by using (io_)remap_pfn_range. If there any reason to only > optionally have the pre-fault semantics for sgx? If not this should > be really simple. And if we have a real need for it to be optional > we'll just need to find a sane way to pass that information to ->mmap. Is there any space in vma->vm_flags ? That would be better than an extra argument or function. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Mar 07, 2022 at 10:11:19PM +0000, David Laight wrote: > From: Christoph Hellwig > > Sent: 07 March 2022 15:57 > > > > On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote: > > > So what would you suggest to sort out the issue? I'm happy to go with > > > ioctl if nothing else is acceptable. > > > > PLenty of drivers treat all mmaps as if MAP_POPULATE was specified, > > typically by using (io_)remap_pfn_range. If there any reason to only > > optionally have the pre-fault semantics for sgx? If not this should > > be really simple. And if we have a real need for it to be optional > > we'll just need to find a sane way to pass that information to ->mmap. > > Is there any space in vma->vm_flags ? > > That would be better than an extra argument or function. It's very dense but I'll give a shot for callback route based on Dave's comments in this thread. I.e. use it as filter inside __mm_populate() and populate_vma_page_range(). For Enarx, which we are implementing being able to use MAP_POPULATE and get the full range EAUG'd would be best way to optimize the performance of wasm JIT (Enarx is a wasm run-time capable of running inside an SGX enclave, AMD SEV-SNP VM etc.). More so than any predictor (ra_state, madvice etc.) inside #PF handler, which have been suggested in this thread. After some research on how we implement user space, I'd rather keep the #PF handler working on a single page (EAUG a single page) and have either ioctl or MAP_POPULATE to do the batch fill. We can still "not trust the user space" i.e. the populate does not have to guarantee to do the full length since the #PF handler will then fill the holes. This was one concern in this thread but it is not hard to address. BR, Jarkko