Message ID | cover.1709635535.git.leon@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Split IOMMU DMA mapping operation to two steps | expand |
On 2024-03-05 11:18 am, Leon Romanovsky wrote: > This is complimentary part to the proposed LSF/MM topic. > https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 > > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and > DMA patches are ready for review and inclusion, the NVMe patches are still in > progress as they require agreement on API first. > > Thanks > > ------------------------------------------------------------------------------- > The DMA mapping operation performs two steps at one same time: allocates > IOVA space and actually maps DMA pages to that space. This one shot > operation works perfectly for non-complex scenarios, where callers use > that DMA API in control path when they setup hardware. > > However in more complex scenarios, when DMA mapping is needed in data > path and especially when some sort of specific datatype is involved, > such one shot approach has its drawbacks. > > That approach pushes developers to introduce new DMA APIs for specific > datatype. For example existing scatter-gather mapping functions, or > latest Chuck's RFC series to add biovec related DMA mapping [1] and > probably struct folio will need it too. > > These advanced DMA mapping APIs are needed to calculate IOVA size to > allocate it as one chunk and some sort of offset calculations to know > which part of IOVA to map. I don't follow this part at all - at *some* point, something must know a range of memory addresses involved in a DMA transfer, so that's where it should map that range for DMA. Even in a badly-designed system where the point it's most practical to make the mapping is further out and only knows that DMA will touch some subset of a buffer, but doesn't know exactly what subset yet, you'd usually just map the whole buffer. I don't see why the DMA API would ever need to know about anything other than pages/PFNs and dma_addr_ts (yes, it does also accept them being wrapped together in scatterlists; yes, scatterlists are awful and it would be nice to replace them with a better general DMA descriptor; that is a whole other subject of its own). > Instead of teaching DMA to know these specific datatypes, let's separate > existing DMA mapping routine to two steps and give an option to advanced > callers (subsystems) perform all calculations internally in advance and > map pages later when it is needed. From a brief look, this is clearly an awkward reinvention of the IOMMU API. If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address spaces then they can and should use the IOMMU API. Perhaps there's room for some quality-of-life additions to the IOMMU API to help with common usage patterns, but the generic DMA mapping API is absolutely not the place for it. Thanks, Robin. > In this series, three users are converted and each of such conversion > presents different positive gain: > 1. RDMA simplifies and speeds up its pagefault handling for > on-demand-paging (ODP) mode. > 2. VFIO PCI live migration code saves huge chunk of memory. > 3. NVMe PCI avoids intermediate SG table manipulation and operates > directly on BIOs. > > Thanks > > [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net > > Chaitanya Kulkarni (2): > block: add dma_link_range() based API > nvme-pci: use blk_rq_dma_map() for NVMe SGL > > Leon Romanovsky (14): > mm/hmm: let users to tag specific PFNs > dma-mapping: provide an interface to allocate IOVA > dma-mapping: provide callbacks to link/unlink pages to specific IOVA > iommu/dma: Provide an interface to allow preallocate IOVA > iommu/dma: Prepare map/unmap page functions to receive IOVA > iommu/dma: Implement link/unlink page callbacks > RDMA/umem: Preallocate and cache IOVA for UMEM ODP > RDMA/umem: Store ODP access mask information in PFN > RDMA/core: Separate DMA mapping to caching IOVA and page linkage > RDMA/umem: Prevent UMEM ODP creation with SWIOTLB > vfio/mlx5: Explicitly use number of pages instead of allocated length > vfio/mlx5: Rewrite create mkey flow to allow better code reuse > vfio/mlx5: Explicitly store page list > vfio/mlx5: Convert vfio to use DMA link API > > Documentation/core-api/dma-attributes.rst | 7 + > block/blk-merge.c | 156 ++++++++++++++ > drivers/infiniband/core/umem_odp.c | 219 +++++++------------ > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + > drivers/infiniband/hw/mlx5/odp.c | 59 +++-- > drivers/iommu/dma-iommu.c | 129 ++++++++--- > drivers/nvme/host/pci.c | 220 +++++-------------- > drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- > drivers/vfio/pci/mlx5/cmd.h | 22 +- > drivers/vfio/pci/mlx5/main.c | 136 +++++------- > include/linux/blk-mq.h | 9 + > include/linux/dma-map-ops.h | 13 ++ > include/linux/dma-mapping.h | 39 ++++ > include/linux/hmm.h | 3 + > include/rdma/ib_umem_odp.h | 22 +- > include/rdma/ib_verbs.h | 54 +++++ > kernel/dma/debug.h | 2 + > kernel/dma/direct.h | 7 +- > kernel/dma/mapping.c | 91 ++++++++ > mm/hmm.c | 34 +-- > 20 files changed, 870 insertions(+), 605 deletions(-) >
On Tue, Mar 05, 2024 at 12:05:23PM +0000, Robin Murphy wrote: > On 2024-03-05 11:18 am, Leon Romanovsky wrote: > > This is complimentary part to the proposed LSF/MM topic. > > https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 > > > > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and > > DMA patches are ready for review and inclusion, the NVMe patches are still in > > progress as they require agreement on API first. > > > > Thanks > > > > ------------------------------------------------------------------------------- > > The DMA mapping operation performs two steps at one same time: allocates > > IOVA space and actually maps DMA pages to that space. This one shot > > operation works perfectly for non-complex scenarios, where callers use > > that DMA API in control path when they setup hardware. > > > > However in more complex scenarios, when DMA mapping is needed in data > > path and especially when some sort of specific datatype is involved, > > such one shot approach has its drawbacks. > > > > That approach pushes developers to introduce new DMA APIs for specific > > datatype. For example existing scatter-gather mapping functions, or > > latest Chuck's RFC series to add biovec related DMA mapping [1] and > > probably struct folio will need it too. > > > > These advanced DMA mapping APIs are needed to calculate IOVA size to > > allocate it as one chunk and some sort of offset calculations to know > > which part of IOVA to map. > > I don't follow this part at all - at *some* point, something must know a > range of memory addresses involved in a DMA transfer, so that's where it > should map that range for DMA. In all presented cases in this series, the overall DMA size is known in advance. In RDMA case, it is known when user registers the memory, in VFIO, when live migration is happening and in NVMe, when BIO is created. So once we allocated IOVA, we will need to link ranges, which si the same as map but without IOVA allocation. > Even in a badly-designed system where the > point it's most practical to make the mapping is further out and only knows > that DMA will touch some subset of a buffer, but doesn't know exactly what > subset yet, you'd usually just map the whole buffer. I don't see why the DMA > API would ever need to know about anything other than pages/PFNs and > dma_addr_ts (yes, it does also accept them being wrapped together in > scatterlists; yes, scatterlists are awful and it would be nice to replace > them with a better general DMA descriptor; that is a whole other subject of > its own). This is exactly what was done here, we got rid of scatterlists. > > > Instead of teaching DMA to know these specific datatypes, let's separate > > existing DMA mapping routine to two steps and give an option to advanced > > callers (subsystems) perform all calculations internally in advance and > > map pages later when it is needed. > > From a brief look, this is clearly an awkward reinvention of the IOMMU API. > If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address > spaces then they can and should use the IOMMU API. Perhaps there's room for > some quality-of-life additions to the IOMMU API to help with common usage > patterns, but the generic DMA mapping API is absolutely not the place for > it. DMA mapping gives nice abstraction from IOMMU, and allows us to have same flow for IOMMU and non-IOMMU flows without duplicating code, while you suggest to teach almost every part in the kernel to know about IOMMU. In this series, we changed RDMA, VFIO and NVMe, and in all cases we removed more code than added. From what I saw, VDPA and virito-blk will benefit from proposed API too. Even in this RFC, where Chaitanya did partial job and didn't convert whole driver, the gain is pretty obvious: https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u drivers/nvme/host/pci.c | 220 ++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 file changed, 49 insertions(+), 171 deletions(-) Thanks > > Thanks, > Robin. > > > In this series, three users are converted and each of such conversion > > presents different positive gain: > > 1. RDMA simplifies and speeds up its pagefault handling for > > on-demand-paging (ODP) mode. > > 2. VFIO PCI live migration code saves huge chunk of memory. > > 3. NVMe PCI avoids intermediate SG table manipulation and operates > > directly on BIOs. > > > > Thanks > > > > [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net > > > > Chaitanya Kulkarni (2): > > block: add dma_link_range() based API > > nvme-pci: use blk_rq_dma_map() for NVMe SGL > > > > Leon Romanovsky (14): > > mm/hmm: let users to tag specific PFNs > > dma-mapping: provide an interface to allocate IOVA > > dma-mapping: provide callbacks to link/unlink pages to specific IOVA > > iommu/dma: Provide an interface to allow preallocate IOVA > > iommu/dma: Prepare map/unmap page functions to receive IOVA > > iommu/dma: Implement link/unlink page callbacks > > RDMA/umem: Preallocate and cache IOVA for UMEM ODP > > RDMA/umem: Store ODP access mask information in PFN > > RDMA/core: Separate DMA mapping to caching IOVA and page linkage > > RDMA/umem: Prevent UMEM ODP creation with SWIOTLB > > vfio/mlx5: Explicitly use number of pages instead of allocated length > > vfio/mlx5: Rewrite create mkey flow to allow better code reuse > > vfio/mlx5: Explicitly store page list > > vfio/mlx5: Convert vfio to use DMA link API > > > > Documentation/core-api/dma-attributes.rst | 7 + > > block/blk-merge.c | 156 ++++++++++++++ > > drivers/infiniband/core/umem_odp.c | 219 +++++++------------ > > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + > > drivers/infiniband/hw/mlx5/odp.c | 59 +++-- > > drivers/iommu/dma-iommu.c | 129 ++++++++--- > > drivers/nvme/host/pci.c | 220 +++++-------------- > > drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- > > drivers/vfio/pci/mlx5/cmd.h | 22 +- > > drivers/vfio/pci/mlx5/main.c | 136 +++++------- > > include/linux/blk-mq.h | 9 + > > include/linux/dma-map-ops.h | 13 ++ > > include/linux/dma-mapping.h | 39 ++++ > > include/linux/hmm.h | 3 + > > include/rdma/ib_umem_odp.h | 22 +- > > include/rdma/ib_verbs.h | 54 +++++ > > kernel/dma/debug.h | 2 + > > kernel/dma/direct.h | 7 +- > > kernel/dma/mapping.c | 91 ++++++++ > > mm/hmm.c | 34 +-- > > 20 files changed, 870 insertions(+), 605 deletions(-) > >
On Tue, Mar 05, 2024 at 02:29:35PM +0200, Leon Romanovsky wrote: > > > These advanced DMA mapping APIs are needed to calculate IOVA size to > > > allocate it as one chunk and some sort of offset calculations to know > > > which part of IOVA to map. > > > > I don't follow this part at all - at *some* point, something must know a > > range of memory addresses involved in a DMA transfer, so that's where it > > should map that range for DMA. > > In all presented cases in this series, the overall DMA size is known in > advance. In RDMA case, it is known when user registers the memory, in > VFIO, when live migration is happening and in NVMe, when BIO is created. > > So once we allocated IOVA, we will need to link ranges, which si the > same as map but without IOVA allocation. But above you say: "These advanced DMA mapping APIs are needed to calculate IOVA size to allocate it as one chunk and some sort of offset calculations to know which part of IOVA to map." this suggests you need helpers to calculate the len and offset. I can't see where that would ever make sense. The total transfer size should just be passed in by the callers and be known, and there should be no offset. > > > Instead of teaching DMA to know these specific datatypes, let's separate > > > existing DMA mapping routine to two steps and give an option to advanced > > > callers (subsystems) perform all calculations internally in advance and > > > map pages later when it is needed. > > > > From a brief look, this is clearly an awkward reinvention of the IOMMU API. > > If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address > > spaces then they can and should use the IOMMU API. Perhaps there's room for > > some quality-of-life additions to the IOMMU API to help with common usage > > patterns, but the generic DMA mapping API is absolutely not the place for > > it. > > DMA mapping gives nice abstraction from IOMMU, and allows us to have > same flow for IOMMU and non-IOMMU flows without duplicating code, while > you suggest to teach almost every part in the kernel to know about IOMMU. Except that the flows are fundamentally different for the "can coalesce" vs "can't coalesce" case. In the former we have one dma_addr_t range, and in the latter as many as there are input vectors (this is ignoring the weird iommu merging case where we we coalesce some but not all segments, but I'd rather not have that in a new API). So if we want to efficiently be able to handle these cases we need two APIs in the driver and a good framework to switch between them. Robins makes a point here that the iommu API handles the can coalesce case and he has a point as that's exactly how the IOMMU API works. I'd still prefer to wrap it with dma callers to handle things like swiotlb and maybe Xen grant tables and to avoid the type confusion between dma_addr_t and then untyped iova in the iommu layer, but having this layer or not is probably worth a discussion. > > In this series, we changed RDMA, VFIO and NVMe, and in all cases we > removed more code than added. From what I saw, VDPA and virito-blk will > benefit from proposed API too. > > Even in this RFC, where Chaitanya did partial job and didn't convert > whole driver, the gain is pretty obvious: > https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u > I have no idea how that nvme patch is even supposed to work. It removes the PRP path in nvme-pci, which not only is the most common I/O path but actually required for the admin queue as NVMe doesn't support SGLs for the admin queue.
On Wed, Mar 06, 2024 at 03:44:16PM +0100, Christoph Hellwig wrote: > Except that the flows are fundamentally different for the "can coalesce" > vs "can't coalesce" case. In the former we have one dma_addr_t range, > and in the latter as many as there are input vectors (this is ignoring > the weird iommu merging case where we we coalesce some but not all > segments, but I'd rather not have that in a new API). I don't think they are so fundamentally different, at least in our past conversations I never came out with the idea we should burden the driver with two different flows based on what kind of alignment the transfer happens to have. Certainly if we split the API to focus one API on doing only page-aligned transfers the aligned part does become a little. At least the RDMA drivers could productively use just a page aligned interface. But I didn't think this would make BIO users happy so never even thought about it.. > The total transfer size should just be passed in by the callers and > be known, and there should be no offset. The API needs the caller to figure out the total number of IOVA pages it needs, rounding up the CPU ranges to full aligned pages. That becomes the IOVA allocation. offset is something that arises to support non-aligned transfers. > So if we want to efficiently be able to handle these cases we need > two APIs in the driver and a good framework to switch between them. But, what does the non-page-aligned version look like? Doesn't it still look basically like this? And what is the actual difference if the input is aligned? The caller can assume it doesn't need to provide a per-range dma_addr_t during unmap. It still can't assume the HW programming will be linear due to the P2P !ACS support. And it still has to call an API per-cpu range to actually program the IOMMU. So are they really so different to want different APIs? That strikes me as a big driver cost. > I'd still prefer to wrap it with dma callers to handle things like > swiotlb and maybe Xen grant tables and to avoid the type confusion > between dma_addr_t and then untyped iova in the iommu layer, but > having this layer or not is probably worth a discussion. I'm surprised by the idea of random drivers reaching past dma-iommu.c and into the iommu layer to setup DMA directly on the DMA API's iommu_domain?? That seems like completely giving up on the DMA API abstraction to me. :( IMHO, it needs to be wrapped, the wrapper needs to do all the special P2P stuff, at a minimum. The wrapper should multiplex to all the non-iommu cases for the driver too. We still need to achieve some kind of abstraction here that doesn't bruden every driver with different code paths for each DMA back end! Don't we?? Jason
On Wed, Mar 06, 2024 at 11:43:28AM -0400, Jason Gunthorpe wrote: > I don't think they are so fundamentally different, at least in our > past conversations I never came out with the idea we should burden the > driver with two different flows based on what kind of alignment the > transfer happens to have. Then we talked past each other.. > At least the RDMA drivers could productively use just a page aligned > interface. But I didn't think this would make BIO users happy so never > even thought about it.. page aligned is generally the right thing for the block layer. NVMe for example already requires that anyway due to PRPs. > > The total transfer size should just be passed in by the callers and > > be known, and there should be no offset. > > The API needs the caller to figure out the total number of IOVA pages > it needs, rounding up the CPU ranges to full aligned pages. That > becomes the IOVA allocation. Yes, it's a basic align up to the granularity asuming we don't bother with non-aligned transfers. > > > So if we want to efficiently be able to handle these cases we need > > two APIs in the driver and a good framework to switch between them. > > But, what does the non-page-aligned version look like? Doesn't it > still look basically like this? I'd just rather have the non-aligned case for those who really need it be the loop over map single region that is needed for the direct mapping anyway. > > And what is the actual difference if the input is aligned? The caller > can assume it doesn't need to provide a per-range dma_addr_t during > unmap. A per-range dma_addr_t doesn't really make sense for the aligned and coalesced case. > It still can't assume the HW programming will be linear due to the P2P > !ACS support. > > And it still has to call an API per-cpu range to actually program the > IOMMU. > > So are they really so different to want different APIs? That strikes > me as a big driver cost. To not have to store a dma_address range per CPU range that doesn't actually get used at all.
On Wed, Mar 06, 2024 at 05:20:22PM +0100, Christoph Hellwig wrote: > On Wed, Mar 06, 2024 at 11:43:28AM -0400, Jason Gunthorpe wrote: > > I don't think they are so fundamentally different, at least in our > > past conversations I never came out with the idea we should burden the > > driver with two different flows based on what kind of alignment the > > transfer happens to have. > > Then we talked past each other.. Well, we never talked to such detail > > > So if we want to efficiently be able to handle these cases we need > > > two APIs in the driver and a good framework to switch between them. > > > > But, what does the non-page-aligned version look like? Doesn't it > > still look basically like this? > > I'd just rather have the non-aligned case for those who really need > it be the loop over map single region that is needed for the direct > mapping anyway. There is a list of interesting cases this has to cover: 1. Direct map. No dma_addr_t at unmap, multiple HW SGLs 2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs 3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs 4. swiotlb single range. Only IOVA range at unmap, single HW SGL 5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs. 6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs I think we agree that 1 and 2 should be optimized highly as they are the common case. That mainly means no dma_addr_t storage in either 5 is the slowest and has the most overhead. 4 is basically the same as 2 from the driver's viewpoint 3 is quite similar to 1, but it has the IOVA range at unmap. 6 doesn't have to be optimal, from the driver perspective it can be like 5 That is three basic driver flows 1/3, 2/4 and 5/6 So are you thinking something more like a driver flow of: .. extent IO and get # aligned pages and know if there is P2P .. dma_init_io(state, num_pages, p2p_flag) if (dma_io_single_range(state)) { // #2, #4 for each io() dma_link_aligned_pages(state, io range) hw_sgl = (state->iova, state->len) } else { // #1, #3, #5, #6 hw_sgls = alloc_hw_sgls(num_ios) if (dma_io_needs_dma_addr_unmap(state)) dma_addr_storage = alloc_num_ios(); // #5 only for each io() hw_sgl[i] = dma_map_single(state, io range) if (dma_addr_storage) dma_addr_storage[i] = hw_sgl[i]; // #5 only } ? This is not quite what you said, we split the driver flow based on needing 1 HW SGL vs need many HW SGL. > > So are they really so different to want different APIs? That strikes > > me as a big driver cost. > > To not have to store a dma_address range per CPU range that doesn't > actually get used at all. Right, that is a nice optimization we should reach for. Jason
On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote: > There is a list of interesting cases this has to cover: > > 1. Direct map. No dma_addr_t at unmap, multiple HW SGLs > 2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs > 3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs > 4. swiotlb single range. Only IOVA range at unmap, single HW SGL > 5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs. > 6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs > > I think we agree that 1 and 2 should be optimized highly as they are > the common case. That mainly means no dma_addr_t storage in either I don't think you can do without dma_addr_t storage. In most cases your can just store the dma_addr_t in the LE/BE encoded hardware SGL, so no extra storage should be needed though. > 3 is quite similar to 1, but it has the IOVA range at unmap. Can you explain what P2P case you mean? The switch one with the bus address is indeed basically the same, just with potentioally a different offset, while the through host bridge case is the same as a normal iommu map. > > 4 is basically the same as 2 from the driver's viewpoint I'd actually treat it the same as one. > 5 is the slowest and has the most overhead. and 5 could be broken into multiple 4s at least for now. Or do you have a different dfinition of range here? > So are you thinking something more like a driver flow of: > > .. extent IO and get # aligned pages and know if there is P2P .. > dma_init_io(state, num_pages, p2p_flag) > if (dma_io_single_range(state)) { > // #2, #4 > for each io() > dma_link_aligned_pages(state, io range) > hw_sgl = (state->iova, state->len) > } else { I think what you have a dma_io_single_range should become before the dma_init_io. If we know we can't coalesce it really just is a dma_map_{single,page,bvec} loop, no need for any extra state. And we're back to roughly the proposal I sent out years ago. > This is not quite what you said, we split the driver flow based on > needing 1 HW SGL vs need many HW SGL. That's at least what I intended to say, and I'm a little curious as what it came across.
On Wed, Mar 06, 2024 at 11:14:00PM +0100, Christoph Hellwig wrote: > On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote: > > There is a list of interesting cases this has to cover: > > > > 1. Direct map. No dma_addr_t at unmap, multiple HW SGLs > > 2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs > > 3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs > > 4. swiotlb single range. Only IOVA range at unmap, single HW SGL > > 5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs. > > 6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs > > > > I think we agree that 1 and 2 should be optimized highly as they are > > the common case. That mainly means no dma_addr_t storage in either > > I don't think you can do without dma_addr_t storage. In most cases > your can just store the dma_addr_t in the LE/BE encoded hardware > SGL, so no extra storage should be needed though. RDMA (and often DRM too) generally doesn't work like that, the driver copies the page table into the device and then the only reason to have a dma_addr_t storage is to pass that to the dma unmap API. Optionally eliminating long term dma_addr_t storage would be a worthwhile memory savings for large long lived user space memory registrations. > > 3 is quite similar to 1, but it has the IOVA range at unmap. > > Can you explain what P2P case you mean? The switch one with the > bus address is indeed basically the same, just with potentioally a > different offset, while the through host bridge case is the same > as a normal iommu map. Yes, the bus address case. The IOMMU is turned on, ACS on a local switch is off. All pages go through the IOMMU in the normal way except P2P pages between devices on the same switch. (ie the dma_addr_t is CPU physical of the P2P plus an offset). RDMA must support a mixture of IOVA and P2P addresses in the same IO operation. I suppose it would make more sense to say it is similar to 6. > > 5 is the slowest and has the most overhead. > > and 5 could be broken into multiple 4s at least for now. Or do you > have a different dfinition of range here? I wrote the list as from a single IO operation perspective, so all but 5 need to store a single IOVA range that could be stored in some simple non-dynamic memory along with whatever HW SGLs/etc are needed. The point of 5 being different is because the driver has to provide a dynamically sized list of dma_addr_t's as storage until unmap. 5 is the only case that requires that full list. So yes, 5 could be broken up into multiple IOs, but then the specialness of 5 is the driver must keep track of multiple IOs.. > > So are you thinking something more like a driver flow of: > > > > .. extent IO and get # aligned pages and know if there is P2P .. > > dma_init_io(state, num_pages, p2p_flag) > > if (dma_io_single_range(state)) { > > // #2, #4 > > for each io() > > dma_link_aligned_pages(state, io range) > > hw_sgl = (state->iova, state->len) > > } else { > > I think what you have a dma_io_single_range should become before > the dma_init_io. If we know we can't coalesce it really just is a > dma_map_{single,page,bvec} loop, no need for any extra state. I imagine dma_io_single_range() to just check a flag in state. I still want to call dma_init_io() for the non-coalescing cases because all the flows, regardless of composition, should be about as fast as dma_map_sg is today. That means we need to always pre-allocate the IOVA in any case where the IOMMU might be active - even on a non-coalescing flow. IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to be used and we can't just call today's dma_map_page() in a loop on the non-coalescing side and pay the overhead of Nx IOVA allocations. In large part this is for RDMA, were a single P2P page in a large multi-gigabyte user memory registration shouldn't drastically harm the registration performance by falling down to doing dma_map_page, and an IOVA allocation, on a 4k page by page basis. The other thing that got hand waved here is how does dma_init_io() know which of the 6 states we are looking at? I imagine we probably want to do something like: struct dma_io_summarize summary = {}; for each io() dma_io_summarize_range(&summary, io range) dma_init_io(dev, &state, &summary); if (state->single_range) { } else { } dma_io_done_mapping(&state); <-- flush IOTLB once At least this way the DMA API still has some decent opportunity for abstraction and future growth using state to pass bits of information between the API family. There is some swiotlb complexity that needs something like this, a system with iommu can still fail to coalesce if the pages are encrypted and the device doesn't support DMA from encrypted pages. We need to check for P2P pages, encrypted memory pages, and who knows what else. > And we're back to roughly the proposal I sent out years ago. Well, all of this is roughly your original proposal, just with different optimization choices and some enhancement to also cover hmm_range_fault() users. Enhancing the single sgl case is not a big change, I think. It does seem simplifying for the driver to not have to coalesce SGLs to detect the single-SGL fast-path. > > This is not quite what you said, we split the driver flow based on > > needing 1 HW SGL vs need many HW SGL. > > That's at least what I intended to say, and I'm a little curious as what > it came across. Ok, I was reading the discussion more about as alignment than single HW SGL, I think you ment alignment as implying coalescing behavior implying single HW SGL.. Jason
在 2024/3/5 12:18, Leon Romanovsky 写道: > This is complimentary part to the proposed LSF/MM topic. > https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 I am interested in this topic. Hope I can join the meeting to discuss this topic. Zhu Yanjun > > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and > DMA patches are ready for review and inclusion, the NVMe patches are still in > progress as they require agreement on API first. > > Thanks > > ------------------------------------------------------------------------------- > The DMA mapping operation performs two steps at one same time: allocates > IOVA space and actually maps DMA pages to that space. This one shot > operation works perfectly for non-complex scenarios, where callers use > that DMA API in control path when they setup hardware. > > However in more complex scenarios, when DMA mapping is needed in data > path and especially when some sort of specific datatype is involved, > such one shot approach has its drawbacks. > > That approach pushes developers to introduce new DMA APIs for specific > datatype. For example existing scatter-gather mapping functions, or > latest Chuck's RFC series to add biovec related DMA mapping [1] and > probably struct folio will need it too. > > These advanced DMA mapping APIs are needed to calculate IOVA size to > allocate it as one chunk and some sort of offset calculations to know > which part of IOVA to map. > > Instead of teaching DMA to know these specific datatypes, let's separate > existing DMA mapping routine to two steps and give an option to advanced > callers (subsystems) perform all calculations internally in advance and > map pages later when it is needed. > > In this series, three users are converted and each of such conversion > presents different positive gain: > 1. RDMA simplifies and speeds up its pagefault handling for > on-demand-paging (ODP) mode. > 2. VFIO PCI live migration code saves huge chunk of memory. > 3. NVMe PCI avoids intermediate SG table manipulation and operates > directly on BIOs. > > Thanks > > [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net > > Chaitanya Kulkarni (2): > block: add dma_link_range() based API > nvme-pci: use blk_rq_dma_map() for NVMe SGL > > Leon Romanovsky (14): > mm/hmm: let users to tag specific PFNs > dma-mapping: provide an interface to allocate IOVA > dma-mapping: provide callbacks to link/unlink pages to specific IOVA > iommu/dma: Provide an interface to allow preallocate IOVA > iommu/dma: Prepare map/unmap page functions to receive IOVA > iommu/dma: Implement link/unlink page callbacks > RDMA/umem: Preallocate and cache IOVA for UMEM ODP > RDMA/umem: Store ODP access mask information in PFN > RDMA/core: Separate DMA mapping to caching IOVA and page linkage > RDMA/umem: Prevent UMEM ODP creation with SWIOTLB > vfio/mlx5: Explicitly use number of pages instead of allocated length > vfio/mlx5: Rewrite create mkey flow to allow better code reuse > vfio/mlx5: Explicitly store page list > vfio/mlx5: Convert vfio to use DMA link API > > Documentation/core-api/dma-attributes.rst | 7 + > block/blk-merge.c | 156 ++++++++++++++ > drivers/infiniband/core/umem_odp.c | 219 +++++++------------ > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + > drivers/infiniband/hw/mlx5/odp.c | 59 +++-- > drivers/iommu/dma-iommu.c | 129 ++++++++--- > drivers/nvme/host/pci.c | 220 +++++-------------- > drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- > drivers/vfio/pci/mlx5/cmd.h | 22 +- > drivers/vfio/pci/mlx5/main.c | 136 +++++------- > include/linux/blk-mq.h | 9 + > include/linux/dma-map-ops.h | 13 ++ > include/linux/dma-mapping.h | 39 ++++ > include/linux/hmm.h | 3 + > include/rdma/ib_umem_odp.h | 22 +- > include/rdma/ib_verbs.h | 54 +++++ > kernel/dma/debug.h | 2 + > kernel/dma/direct.h | 7 +- > kernel/dma/mapping.c | 91 ++++++++ > mm/hmm.c | 34 +-- > 20 files changed, 870 insertions(+), 605 deletions(-) >
On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote: > > > > I don't think you can do without dma_addr_t storage. In most cases > > your can just store the dma_addr_t in the LE/BE encoded hardware > > SGL, so no extra storage should be needed though. > > RDMA (and often DRM too) generally doesn't work like that, the driver > copies the page table into the device and then the only reason to have > a dma_addr_t storage is to pass that to the dma unmap API. Optionally > eliminating long term dma_addr_t storage would be a worthwhile memory > savings for large long lived user space memory registrations. It's just kinda hard to do. For aligned IOMMU mapping you'd only have one dma_addr_t mappings (or maybe a few if P2P regions are involved), so this probably doesn't matter. For direct mappings you'd have a few, but maybe the better answer is to use THP more aggressively and reduce the number of segments. > I wrote the list as from a single IO operation perspective, so all but > 5 need to store a single IOVA range that could be stored in some > simple non-dynamic memory along with whatever HW SGLs/etc are needed. > > The point of 5 being different is because the driver has to provide a > dynamically sized list of dma_addr_t's as storage until unmap. 5 is > the only case that requires that full list. No, all cases need to store one or more ranges. > > > So are you thinking something more like a driver flow of: > > > > > > .. extent IO and get # aligned pages and know if there is P2P .. > > > dma_init_io(state, num_pages, p2p_flag) > > > if (dma_io_single_range(state)) { > > > // #2, #4 > > > for each io() > > > dma_link_aligned_pages(state, io range) > > > hw_sgl = (state->iova, state->len) > > > } else { > > > > I think what you have a dma_io_single_range should become before > > the dma_init_io. If we know we can't coalesce it really just is a > > dma_map_{single,page,bvec} loop, no need for any extra state. > > I imagine dma_io_single_range() to just check a flag in state. > > I still want to call dma_init_io() for the non-coalescing cases > because all the flows, regardless of composition, should be about as > fast as dma_map_sg is today. If all flows includes multiple non-coalesced regions that just makes things very complicated, and that's exactly what I'd want to avoid. > That means we need to always pre-allocate the IOVA in any case where > the IOMMU might be active - even on a non-coalescing flow. > > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to > be used and we can't just call today's dma_map_page() in a loop on the > non-coalescing side and pay the overhead of Nx IOVA allocations. > > In large part this is for RDMA, were a single P2P page in a large > multi-gigabyte user memory registration shouldn't drastically harm the > registration performance by falling down to doing dma_map_page, and an > IOVA allocation, on a 4k page by page basis. But that P2P page needs to be handled very differently, as with it we can't actually use a single iova range. So I'm not sure how that is even supposed to work. If you have +-------+-----+-------+ | local | P2P | local | +-------+-----+-------+ you need at least 3 hw SGL entries, as the IOVA won't be contigous. > The other thing that got hand waved here is how does dma_init_io() > know which of the 6 states we are looking at? I imagine we probably > want to do something like: > > struct dma_io_summarize summary = {}; > for each io() > dma_io_summarize_range(&summary, io range) > dma_init_io(dev, &state, &summary); > if (state->single_range) { > } else { > } > dma_io_done_mapping(&state); <-- flush IOTLB once That's why I really just want 2 cases. If the caller guarantees the range is coalescable and there is an IOMMU use the iommu-API like API, else just iter over map_single/page. > Enhancing the single sgl case is not a big change, I think. It does > seem simplifying for the driver to not have to coalesce SGLs to detect > the single-SGL fast-path. > > > > This is not quite what you said, we split the driver flow based on > > > needing 1 HW SGL vs need many HW SGL. > > > > That's at least what I intended to say, and I'm a little curious as what > > it came across. > > Ok, I was reading the discussion more about as alignment than single > HW SGL, I think you ment alignment as implying coalescing behavior > implying single HW SGL.. Yes.
On Thu, Mar 07, 2024 at 04:05:05PM +0100, Christoph Hellwig wrote: > On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote: > > > > > > I don't think you can do without dma_addr_t storage. In most cases > > > your can just store the dma_addr_t in the LE/BE encoded hardware > > > SGL, so no extra storage should be needed though. > > > > RDMA (and often DRM too) generally doesn't work like that, the driver > > copies the page table into the device and then the only reason to have > > a dma_addr_t storage is to pass that to the dma unmap API. Optionally > > eliminating long term dma_addr_t storage would be a worthwhile memory > > savings for large long lived user space memory registrations. > > It's just kinda hard to do. For aligned IOMMU mapping you'd only > have one dma_addr_t mappings (or maybe a few if P2P regions are > involved), so this probably doesn't matter. For direct mappings > you'd have a few, but maybe the better answer is to use THP > more aggressively and reduce the number of segments. Right, those things have all been done. 100GB of huge pages is still using a fair amount of memory for storing dma_addr_t's. It is hard to do perfectly, but I think it is not so bad if we focus on the direct only case and simple systems that can exclude swiotlb early on. > > > > So are you thinking something more like a driver flow of: > > > > > > > > .. extent IO and get # aligned pages and know if there is P2P .. > > > > dma_init_io(state, num_pages, p2p_flag) > > > > if (dma_io_single_range(state)) { > > > > // #2, #4 > > > > for each io() > > > > dma_link_aligned_pages(state, io range) > > > > hw_sgl = (state->iova, state->len) > > > > } else { > > > > > > I think what you have a dma_io_single_range should become before > > > the dma_init_io. If we know we can't coalesce it really just is a > > > dma_map_{single,page,bvec} loop, no need for any extra state. > > > > I imagine dma_io_single_range() to just check a flag in state. > > > > I still want to call dma_init_io() for the non-coalescing cases > > because all the flows, regardless of composition, should be about as > > fast as dma_map_sg is today. > > If all flows includes multiple non-coalesced regions that just makes > things very complicated, and that's exactly what I'd want to avoid. I don't see how to avoid it unless we say RDMA shouldn't use this API, which is kind of the whole point from my perspective.. I want an API that can handle all the same complexity as dma_map_sg() without forcing the use of scatterlist. Instead "bring your own datastructure". This is the essence of what we discussed. An API that is inferior to dma_map_sg() is really problematic to use with RDMA. > > That means we need to always pre-allocate the IOVA in any case where > > the IOMMU might be active - even on a non-coalescing flow. > > > > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to > > be used and we can't just call today's dma_map_page() in a loop on the > > non-coalescing side and pay the overhead of Nx IOVA allocations. > > > > In large part this is for RDMA, were a single P2P page in a large > > multi-gigabyte user memory registration shouldn't drastically harm the > > registration performance by falling down to doing dma_map_page, and an > > IOVA allocation, on a 4k page by page basis. > > But that P2P page needs to be handled very differently, as with it > we can't actually use a single iova range. So I'm not sure how that > is even supposed to work. If you have > > +-------+-----+-------+ > | local | P2P | local | > +-------+-----+-------+ > > you need at least 3 hw SGL entries, as the IOVA won't be contigous. Sure, 3 SGL entries is fine, that isn't what I'm pointing at I'm saying that today if you give such a scatterlist to dma_map_sg() it scans it and computes the IOVA space need, allocates one IOVA space, then subdivides that single space up into the 3 HW SGLs you show. If you don't preserve that then we are calling, 4k at a time, a dma_map_page() which is not anywhere close to the same outcome as what dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs, and we call into the IOVA allocator a huge number of times. It needs to work following the same basic structure of dma_map_sg, unfolding that logic into helpers so that the driver can provide the data structure: - Scan the io ranges and figure out how much IOVA needed (dma_io_summarize_range) - Allocate the IOVA (dma_init_io) - Scan the io ranges again generate the final HW SGL (dma_io_link_page) - Finish the iommu batch (dma_io_done_mapping) And you can make that pattern work for all the other cases too. So I don't see this as particularly worse, calling some other API instead of dma_map_page is not really a complexity on the driver. Calling dma_init_io every time is also not a complexity. The DMA API side is a bit more, but not substantively different logic from what dma_map_sg already does. Otherwise what is the alternative? How do I keep these complex things working in RDMA and remove scatterlist? > > The other thing that got hand waved here is how does dma_init_io() > > know which of the 6 states we are looking at? I imagine we probably > > want to do something like: > > > > struct dma_io_summarize summary = {}; > > for each io() > > dma_io_summarize_range(&summary, io range) > > dma_init_io(dev, &state, &summary); > > if (state->single_range) { > > } else { > > } > > dma_io_done_mapping(&state); <-- flush IOTLB once > > That's why I really just want 2 cases. If the caller guarantees the > range is coalescable and there is an IOMMU use the iommu-API like > API, else just iter over map_single/page. But how does the caller even know if it is coalescable? Other than the trivial case of a single CPU range, that is a complicated detail based on what pages are inside the range combined with the capability of the device doing DMA. I don't see a simple way for the caller to figure this out. You need to sweep every page and collect some information on it. The above is to abstract that detail. It was simpler before the confidential compute stuff :( Jason
On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote: > > > > It's just kinda hard to do. For aligned IOMMU mapping you'd only > > have one dma_addr_t mappings (or maybe a few if P2P regions are > > involved), so this probably doesn't matter. For direct mappings > > you'd have a few, but maybe the better answer is to use THP > > more aggressively and reduce the number of segments. > > Right, those things have all been done. 100GB of huge pages is still > using a fair amount of memory for storing dma_addr_t's. > > It is hard to do perfectly, but I think it is not so bad if we focus > on the direct only case and simple systems that can exclude swiotlb > early on. Even with direct mappings only we still need to take care of cache synchronization. > > If all flows includes multiple non-coalesced regions that just makes > > things very complicated, and that's exactly what I'd want to avoid. > > I don't see how to avoid it unless we say RDMA shouldn't use this API, > which is kind of the whole point from my perspective.. The DMA API callers really need to know what is P2P or not for various reasons. And they should generally have that information available, either from pin_user_pages that needs to special case it or from the in-kernel I/O submitter that build it from P2P and normal memory. > Sure, 3 SGL entries is fine, that isn't what I'm pointing at > > I'm saying that today if you give such a scatterlist to dma_map_sg() > it scans it and computes the IOVA space need, allocates one IOVA > space, then subdivides that single space up into the 3 HW SGLs you > show. > > If you don't preserve that then we are calling, 4k at a time, a > dma_map_page() which is not anywhere close to the same outcome as what > dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs, > and we call into the IOVA allocator a huge number of times. Again, your callers must know what is a P2P region and what is not. I don't think it is a hard burdern to do mappings at that granularity, and we can encapsulate this in nice helpes for say the block layer and pin_user_pages callers to start. > > It needs to work following the same basic structure of dma_map_sg, > unfolding that logic into helpers so that the driver can provide > the data structure: > > - Scan the io ranges and figure out how much IOVA needed > (dma_io_summarize_range) That is in general a function of the upper layer and not the DMA code. > - Allocate the IOVA (dma_init_io) And this step is only needed for the iommu case. > > That's why I really just want 2 cases. If the caller guarantees the > > range is coalescable and there is an IOMMU use the iommu-API like > > API, else just iter over map_single/page. > > But how does the caller even know if it is coalescable? Other than the > trivial case of a single CPU range, that is a complicated detail based > on what pages are inside the range combined with the capability of the > device doing DMA. I don't see a simple way for the caller to figure > this out. You need to sweep every page and collect some information on > it. The above is to abstract that detail. dma_get_merge_boundary already provides this information in terms of the device capabilities. And given that the callers knows what is P2P and what is not we have all the information that is needed.
On Fri, Mar 08, 2024 at 05:49:20PM +0100, Christoph Hellwig wrote: > On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote: > > > > > > It's just kinda hard to do. For aligned IOMMU mapping you'd only > > > have one dma_addr_t mappings (or maybe a few if P2P regions are > > > involved), so this probably doesn't matter. For direct mappings > > > you'd have a few, but maybe the better answer is to use THP > > > more aggressively and reduce the number of segments. > > > > Right, those things have all been done. 100GB of huge pages is still > > using a fair amount of memory for storing dma_addr_t's. > > > > It is hard to do perfectly, but I think it is not so bad if we focus > > on the direct only case and simple systems that can exclude swiotlb > > early on. > > Even with direct mappings only we still need to take care of > cache synchronization. Yes, we still have to unmap, but the unmap for cache synchronization doesn't need the dma_addr_t to flush the CPU cache. > > > If all flows includes multiple non-coalesced regions that just makes > > > things very complicated, and that's exactly what I'd want to avoid. > > > > I don't see how to avoid it unless we say RDMA shouldn't use this API, > > which is kind of the whole point from my perspective.. > > The DMA API callers really need to know what is P2P or not for > various reasons. And they should generally have that information > available, either from pin_user_pages that needs to special case > it or from the in-kernel I/O submitter that build it from P2P and > normal memory. I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and shoves the resulting page list into in a scattertable. It never checks if any returned page is P2P - it has no reason to care. dma_map_sg() does all the work. That is the kind of abstraction I am coming to this problem with. You are looking at BIO where you already needed to split things up for other reasons, but I think that is a uniquely block thing that will not be shared in other subsystems. > > If you don't preserve that then we are calling, 4k at a time, a > > dma_map_page() which is not anywhere close to the same outcome as what > > dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs, > > and we call into the IOVA allocator a huge number of times. > > Again, your callers must know what is a P2P region and what is not. I don't see this at all. We don't do this today in RDMA. There is no "P2P region". > > > That's why I really just want 2 cases. If the caller guarantees the > > > range is coalescable and there is an IOMMU use the iommu-API like > > > API, else just iter over map_single/page. > > > > But how does the caller even know if it is coalescable? Other than the > > trivial case of a single CPU range, that is a complicated detail based > > on what pages are inside the range combined with the capability of the > > device doing DMA. I don't see a simple way for the caller to figure > > this out. You need to sweep every page and collect some information on > > it. The above is to abstract that detail. > > dma_get_merge_boundary already provides this information in terms > of the device capabilities. And given that the callers knows what > is P2P and what is not we have all the information that is needed. Encrypted memory too. RDMA also doesn't call dma_get_merge_boundary(). It doesn't keep track of P2P regions. It doesn't break out encrypted memory. It has no purpose to do any of those things. You fundamentally cannot subdivide a memory registration. So we could artificially introduce the concept of limited coalescing into RDMA, dmabuf and others just to drive this new API - but really that feels much much worse than just making the DMA API still able to do IOMMU coalescing in more cases. Even if we did that, it will still be less efficient than today where we just call dma_map_sg() on the jumble of pages. Jason
On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote: > > The DMA API callers really need to know what is P2P or not for > > various reasons. And they should generally have that information > > available, either from pin_user_pages that needs to special case > > it or from the in-kernel I/O submitter that build it from P2P and > > normal memory. > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and > shoves the resulting page list into in a scattertable. It never checks > if any returned page is P2P - it has no reason to care. dma_map_sg() > does all the work. Right now it does, but that's not really a good interface. If we have a pin_user_pages variant that only pins until the next relevant P2P boundary and tells you about we can significantly simplify the overall interface.
On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote: > On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote: > > > The DMA API callers really need to know what is P2P or not for > > > various reasons. And they should generally have that information > > > available, either from pin_user_pages that needs to special case > > > it or from the in-kernel I/O submitter that build it from P2P and > > > normal memory. > > > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and > > shoves the resulting page list into in a scattertable. It never checks > > if any returned page is P2P - it has no reason to care. dma_map_sg() > > does all the work. > > Right now it does, but that's not really a good interface. If we have > a pin_user_pages variant that only pins until the next relevant P2P > boundary and tells you about we can significantly simplify the overall > interface. And you will need to have a way to instruct that pin_user_pages() variant to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that force, you will have !FOLL_PCI_P2PDMA behaviour. When you say "simplify the overall interface", which interface do you mean? Thanks
On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote: > And you will need to have a way to instruct that pin_user_pages() variant > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that > force, you will have !FOLL_PCI_P2PDMA behaviour. I don't understand what you mean. > When you say "simplify the overall interface", which interface do you mean? Primarily the dma mapping interface. Secondarily also everything around it.
On Tue, Mar 12, 2024 at 10:28:44PM +0100, Christoph Hellwig wrote: > On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote: > > And you will need to have a way to instruct that pin_user_pages() variant > > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that > > force, you will have !FOLL_PCI_P2PDMA behaviour. > > I don't understand what you mean. Jason talked about the need to call to pin_user_pages(..., gup_flags | FOLL_PCI_P2PDMA, ...), but in your proposal this call won't be possible anymore. > > > When you say "simplify the overall interface", which interface do you mean? > > Primarily the dma mapping interface. Secondarily also everything around > it. OK, thanks
On Wed, Mar 13, 2024 at 09:46:36AM +0200, Leon Romanovsky wrote: > On Tue, Mar 12, 2024 at 10:28:44PM +0100, Christoph Hellwig wrote: > > On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote: > > > And you will need to have a way to instruct that pin_user_pages() variant > > > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that > > > force, you will have !FOLL_PCI_P2PDMA behaviour. > > > > I don't understand what you mean. > > Jason talked about the need to call to pin_user_pages(..., gup_flags | FOLL_PCI_P2PDMA, ...), > but in your proposal this call won't be possible anymore. Why?
On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote: > On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote: > > > The DMA API callers really need to know what is P2P or not for > > > various reasons. And they should generally have that information > > > available, either from pin_user_pages that needs to special case > > > it or from the in-kernel I/O submitter that build it from P2P and > > > normal memory. > > > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and > > shoves the resulting page list into in a scattertable. It never checks > > if any returned page is P2P - it has no reason to care. dma_map_sg() > > does all the work. > > Right now it does, but that's not really a good interface. If we have > a pin_user_pages variant that only pins until the next relevant P2P > boundary and tells you about we can significantly simplify the overall > interface. Sorry for the delay, I was away.. I kind of understand your thinking on the DMA side, but I don't see how this is good for users of the API beyond BIO. How will this make RDMA better? We have one MR, the MR has pages, the HW doesn't care about the SW distinction of p2p, swiotlb, direct, encrypted, iommu, etc. It needs to create one HW page list for whatever user VA range was given. Or worse, whatever thing is inside a DMABUF from a DRM driver. DMABUF's can have a (dynamic!) mixture of P2P and regular AFAIK based on the GPU's migration behavior. Or triple worse, ODP can dynamically change on a page by page basis the type depending on what hmm_range_fault() returns. So I take it as a requirement that RDMA MUST make single MR's out of a hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are not a functional replacement for a single MR. Go back to the start of what are we trying to do here: 1) Make a DMA API that can support hmm_range_fault() users in a sensible and performant way 2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and user VA's without restriction 3) Allow to remove scatterlist from BIO paths 4) Provide a DMABUF API that is not scatterlist that can feed into the new DMA API - again supporting DMABUF's hodgepodge of types. I'd like to do all of these things. I know 3 is your highest priority, but it is my lowest :) So, if the new API can only do uniformity I immediately loose #1 - hmm_range_fault() can't guarentee anything, so it looses the IOVA optimization that Leon's patches illustrate. For uniformity #2 probably needs multiple DMA API "transactions". This is doable, but it is less performant than one "transaction". #3 is perfectly happy because BIO already creates uniformity #4 is like #2, there is not guarenteed uniformity inside DMABUF so every DMABUF importer needs to take some complexity to deal with it. There are many DMABUF importers so this feels like a poor API abstraction if we force everyone there to take on complexity. So I'm just not seeing why this would be better. I think Leon's series shows the cost of non-uniformity support is actually pretty small. Still, we could do better, if the caller can optionally indicate it knows it has uniformity then that can be optimized futher. I'd like to find something that works well for all of the above, and I think abstracting non-uniformity at the API level is important for the above reasons. Can we tweak what Leon has done to keep the hmm_range_fault support and non-uniformity for RDMA but add a uniformity optimized flow for BIO? Jason
On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote: > On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote: > > On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote: > > > > The DMA API callers really need to know what is P2P or not for > > > > various reasons. And they should generally have that information > > > > available, either from pin_user_pages that needs to special case > > > > it or from the in-kernel I/O submitter that build it from P2P and > > > > normal memory. > > > > > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and > > > shoves the resulting page list into in a scattertable. It never checks > > > if any returned page is P2P - it has no reason to care. dma_map_sg() > > > does all the work. > > > > Right now it does, but that's not really a good interface. If we have > > a pin_user_pages variant that only pins until the next relevant P2P > > boundary and tells you about we can significantly simplify the overall > > interface. > > Sorry for the delay, I was away.. <...> > Can we tweak what Leon has done to keep the hmm_range_fault support > and non-uniformity for RDMA but add a uniformity optimized flow for > BIO? Something like this will do the trick. From 45e739e7073fb04bc168624f77320130bb3f9267 Mon Sep 17 00:00:00 2001 Message-ID: <45e739e7073fb04bc168624f77320130bb3f9267.1710924764.git.leonro@nvidia.com> From: Leon Romanovsky <leonro@nvidia.com> Date: Mon, 18 Mar 2024 11:16:41 +0200 Subject: [PATCH] mm/gup: add strict interface to pin user pages according to FOLL flag All pin_user_pages*() and get_user_pages*() callbacks allocate user pages by partially taking into account their p2p vs. non-p2p properties. In case, user sets FOLL_PCI_P2PDMA flag, the allocated pages will include both p2p and "regular" pages, while if FOLL_PCI_P2PDMA flag is not provided, only regular pages are returned. In order to make sure that with FOLL_PCI_P2PDMA flag, only p2p pages are returned, let's introduce new internal FOLL_STRICT flag and provide special pin_user_pages_fast_strict() API call. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> --- include/linux/mm.h | 3 +++ mm/gup.c | 36 +++++++++++++++++++++++++++++++++++- mm/internal.h | 4 +++- 3 files changed, 41 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f5a97dec5169..910b65dde24a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2491,6 +2491,9 @@ int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); void folio_add_pin(struct folio *folio); +int pin_user_pages_fast_strict(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages); + int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim); diff --git a/mm/gup.c b/mm/gup.c index df83182ec72d..11b5c626a4ab 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -133,6 +133,10 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) return NULL; + if (flags & FOLL_STRICT) + if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page)) + return NULL; + if (flags & FOLL_GET) return try_get_folio(page, refs); @@ -232,6 +236,10 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) return -EREMOTEIO; + if (flags & FOLL_STRICT) + if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page)) + return -EREMOTEIO; + if (flags & FOLL_GET) folio_ref_inc(folio); else if (flags & FOLL_PIN) { @@ -2243,6 +2251,8 @@ static bool is_valid_gup_args(struct page **pages, int *locked, * - FOLL_TOUCH/FOLL_PIN/FOLL_TRIED/FOLL_FAST_ONLY are internal only * - FOLL_REMOTE is internal only and used on follow_page() * - FOLL_UNLOCKABLE is internal only and used if locked is !NULL + * - FOLL_STRICT is internal only and used to distinguish between p2p + * and "regular" pages. */ if (WARN_ON_ONCE(gup_flags & INTERNAL_GUP_FLAGS)) return false; @@ -3187,7 +3197,8 @@ static int internal_get_user_pages_fast(unsigned long start, if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_FORCE | FOLL_PIN | FOLL_GET | FOLL_FAST_ONLY | FOLL_NOFAULT | - FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT))) + FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT | + FOLL_STRICT))) return -EINVAL; if (gup_flags & FOLL_PIN) @@ -3322,6 +3333,29 @@ int pin_user_pages_fast(unsigned long start, int nr_pages, } EXPORT_SYMBOL_GPL(pin_user_pages_fast); +/** + * pin_user_pages_fast_strict() - this is pin_user_pages_fast() variant, which + * makes sure that only pages with same properties are pinned. + * + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @gup_flags: flags modifying pin behaviour + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. + * + * Nearly the same as pin_user_pages_fastt(), except that FOLL_STRICT is set. + * + * FOLL_STRICT means that the pages are allocated with specific FOLL_* properties. + */ +int pin_user_pages_fast_strict(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages) +{ + if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN | FOLL_STRICT)) + return -EINVAL; + return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages); +} +EXPORT_SYMBOL_GPL(pin_user_pages_fast_strict); + /** * pin_user_pages_remote() - pin pages of a remote process * diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..7578837a0444 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1031,10 +1031,12 @@ enum { FOLL_FAST_ONLY = 1 << 20, /* allow unlocking the mmap lock */ FOLL_UNLOCKABLE = 1 << 21, + /* don't mix pages with different properties, e.g. p2p with "regular" ones */ + FOLL_STRICT = 1 << 22, }; #define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \ - FOLL_FAST_ONLY | FOLL_UNLOCKABLE) + FOLL_FAST_ONLY | FOLL_UNLOCKABLE | FOLL_STRICT) /* * Indicates for which pages that are write-protected in the page table,
On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote: > I kind of understand your thinking on the DMA side, but I don't see > how this is good for users of the API beyond BIO. > > How will this make RDMA better? We have one MR, the MR has pages, the > HW doesn't care about the SW distinction of p2p, swiotlb, direct, > encrypted, iommu, etc. It needs to create one HW page list for > whatever user VA range was given. Well, the hardware (as in the PCIe card) never cares. But the setup path for the IOMMU does, and something in the OS needs to know about it. So unless we want to stash away a 'is this P2P' flag in every page / SG entry / bvec, or a do a lookup to find that out for each of them we need to manage chunks at these boundaries. And that's what I'm proposing. > Or worse, whatever thing is inside a DMABUF from a DRM > driver. DMABUF's can have a (dynamic!) mixture of P2P and regular > AFAIK based on the GPU's migration behavior. And that's fine. We just need to track it efficiently. > > Or triple worse, ODP can dynamically change on a page by page basis > the type depending on what hmm_range_fault() returns. Same. If this changes all the time you need to track it. And we should find a way to shared the code if we have multiple users for it. But most DMA API consumers will never see P2P, and when they see it it will be static. So don't build the DMA API to automically do the (not exactly super cheap) checks and add complexity for it. > So I take it as a requirement that RDMA MUST make single MR's out of a > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are > not a functional replacement for a single MR. But MRs consolidate multiple dma addresses anyway. > Go back to the start of what are we trying to do here: > 1) Make a DMA API that can support hmm_range_fault() users in a > sensible and performant way > 2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and > user VA's without restriction > 3) Allow to remove scatterlist from BIO paths > 4) Provide a DMABUF API that is not scatterlist that can feed into > the new DMA API - again supporting DMABUF's hodgepodge of types. > > I'd like to do all of these things. I know 3 is your highest priority, > but it is my lowest :) Well, 3 an 4. And 3 is not just limited to bio, but all the other pointless scatterlist uses.
On Wed, Mar 20, 2024 at 10:55:36AM +0200, Leon Romanovsky wrote:
> Something like this will do the trick.
As far as I can tell it totally misses the point. Which is not to never
return non-P2P if the flag is set, but to return either all P2P or non-P2
P and not create a boundary in the single call.
On Thu, Mar 21, 2024 at 11:40:13PM +0100, Christoph Hellwig wrote: > On Wed, Mar 20, 2024 at 10:55:36AM +0200, Leon Romanovsky wrote: > > Something like this will do the trick. > > As far as I can tell it totally misses the point. Which is not to never > return non-P2P if the flag is set, but to return either all P2P or non-P2 > P and not create a boundary in the single call. You are treating FOLL_PCI_P2PDMA as a hint, but in iov_iter_extract_user_pages() you set it only for p2p queues. I was under impression that you want only p2p pages in these queues. Anyway, I can prepare other patch that will return or p2p or non-p2p pages in one shot. Thanks
On Thu, Mar 21, 2024 at 11:39:10PM +0100, Christoph Hellwig wrote: > On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote: > > I kind of understand your thinking on the DMA side, but I don't see > > how this is good for users of the API beyond BIO. > > > > How will this make RDMA better? We have one MR, the MR has pages, the > > HW doesn't care about the SW distinction of p2p, swiotlb, direct, > > encrypted, iommu, etc. It needs to create one HW page list for > > whatever user VA range was given. > > Well, the hardware (as in the PCIe card) never cares. But the setup > path for the IOMMU does, and something in the OS needs to know about > it. So unless we want to stash away a 'is this P2P' flag in every > page / SG entry / bvec, or a do a lookup to find that out for each > of them we need to manage chunks at these boundaries. And that's > what I'm proposing. Okay, if we look at the struct-page-less world (which we want for DMABUF) then we need to keep track for sure. What I had drafted was to keep track in the new "per-SG entry" because that seemed easiest to migrate existing code into. Though the datastructure could also be written to be a list of uniform memory types and then a list of SG entries. (more like how bio is organized) No idea right now which is better, and I'm happy make it go either way. But Leon's series is not quite getting to this, it it still struct page based and struct page itself has all the metadata - though as you say it is a bit expensive to access. > > Or worse, whatever thing is inside a DMABUF from a DRM > > driver. DMABUF's can have a (dynamic!) mixture of P2P and regular > > AFAIK based on the GPU's migration behavior. > > And that's fine. We just need to track it efficiently. Right, DMABUF/etc will return a something that has a list of physical addresses and some meta-data to indicate the "p2p memory provider" for the P2P part. Perhaps it could be as simple as 1 bit in the physical address/length and a global "P2P memory provider" pointer for the entire DMA BUF. Unclear to me right now, but sure. > > Or triple worse, ODP can dynamically change on a page by page basis > > the type depending on what hmm_range_fault() returns. > > Same. If this changes all the time you need to track it. And we > should find a way to shared the code if we have multiple users for it. ODP (for at least the forseeable furture) is simpler because it is always struct page based so we don't need more metadata if we pay the cost to reach into the struct page. I suspect that is the right trade off for hmm_range_fault users. > But most DMA API consumers will never see P2P, and when they see it > it will be static. So don't build the DMA API to automically do > the (not exactly super cheap) checks and add complexity for it. Okay, I think I get what you'd like to see. If we are going to make caller provided uniformity a requirement, lets imagine a formal memory type idea to help keep this a little abstracted? DMA_MEMORY_TYPE_NORMAL DMA_MEMORY_TYPE_P2P_NOT_ACS DMA_MEMORY_TYPE_ENCRYPTED DMA_MEMORY_TYPE_BOUNCE_BUFFER // ?? Then maybe the driver flow looks like: if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) { struct dma_api_iommu_state state; dma_api_iommu_start(&state, transaction.num_pages); for_each_range(transaction, range) dma_api_iommu_map_range(&state, range.start_page, range.length); num_hwsgls = 1; hwsgl.addr = state.iova; hwsgl.length = transaction.length dma_api_iommu_batch_done(&state); } else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) { num_hwsgls = transcation.num_sgls; for_each_range(transaction, range) { hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider); hwsgl[i].len = range.size; } } else { /* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */ num_hwsgls = transcation.num_sgls; for_each_range(transaction, range) { hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length); hwsgl[i].len = range.size; } } And the hmm_range_fault case is sort of like: struct dma_api_iommu_state state; dma_api_iommu_start(&state, mr.num_pages); [..] hmm_range_fault(...) if (present) dma_link_page(&state, faulting_address_offset, page); else dma_unlink_page(&state, faulting_address_offset, page); Is this looking closer? > > So I take it as a requirement that RDMA MUST make single MR's out of a > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are > > not a functional replacement for a single MR. > > But MRs consolidate multiple dma addresses anyway. I'm not sure I understand this? > > Go back to the start of what are we trying to do here: > > 1) Make a DMA API that can support hmm_range_fault() users in a > > sensible and performant way > > 2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and > > user VA's without restriction > > 3) Allow to remove scatterlist from BIO paths > > 4) Provide a DMABUF API that is not scatterlist that can feed into > > the new DMA API - again supporting DMABUF's hodgepodge of types. > > > > I'd like to do all of these things. I know 3 is your highest priority, > > but it is my lowest :) > > Well, 3 an 4. And 3 is not just limited to bio, but all the other > pointless scatterlist uses. Well, I didn't write a '5) remove all the other pointless scatterlist case' :) Anyhow, I think we all agree on the high level objective, we just need to get to an API that fuses all of these goals together. To go back to my main thesis - I would like a high performance low level DMA API that is capable enough that it could implement scatterlist dma_map_sg() and thus also implement any future scatterlist_v2, bio, hmm_range_fault or any other thing we come up with on top of it. This is broadly what I thought we agreed to at LSF last year. Jason
On Fri, Mar 22, 2024 at 07:46:17PM +0200, Leon Romanovsky wrote: > > As far as I can tell it totally misses the point. Which is not to never > > return non-P2P if the flag is set, but to return either all P2P or non-P2 > > P and not create a boundary in the single call. > > You are treating FOLL_PCI_P2PDMA as a hint, but in iov_iter_extract_user_pages() > you set it only for p2p queues. I was under impression that you want only p2p pages > in these queues. FOLL_PCI_P2PDMA is an indicator that the caller can cope with P2P pages. Most callers simply can't.
On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote: > If we are going to make caller provided uniformity a requirement, lets > imagine a formal memory type idea to help keep this a little > abstracted? > > DMA_MEMORY_TYPE_NORMAL > DMA_MEMORY_TYPE_P2P_NOT_ACS > DMA_MEMORY_TYPE_ENCRYPTED > DMA_MEMORY_TYPE_BOUNCE_BUFFER // ?? > > Then maybe the driver flow looks like: > > if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) { Add a nice helper to make this somewhat readable, but yes. > } else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) { > num_hwsgls = transcation.num_sgls; > for_each_range(transaction, range) { > hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider); > hwsgl[i].len = range.size; > } > } else { > /* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */ > num_hwsgls = transcation.num_sgls; > for_each_range(transaction, range) { > hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length); > hwsgl[i].len = range.size; > } > And these two are really the same except that we call a different map helper underneath. So I think as far as the driver is concerned they should be the same, the DMA API just needs to key off the memory tap. > And the hmm_range_fault case is sort of like: > > struct dma_api_iommu_state state; > dma_api_iommu_start(&state, mr.num_pages); > > [..] > hmm_range_fault(...) > if (present) > dma_link_page(&state, faulting_address_offset, page); > else > dma_unlink_page(&state, faulting_address_offset, page); > > Is this looking closer? Yes. > > > So I take it as a requirement that RDMA MUST make single MR's out of a > > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are > > > not a functional replacement for a single MR. > > > > But MRs consolidate multiple dma addresses anyway. > > I'm not sure I understand this? The RDMA MRs take a a list of PFNish address, (or SGLs with the enhanced MRs from Mellanox) and give you back a single rkey/lkey. > To go back to my main thesis - I would like a high performance low > level DMA API that is capable enough that it could implement > scatterlist dma_map_sg() and thus also implement any future > scatterlist_v2, bio, hmm_range_fault or any other thing we come up > with on top of it. This is broadly what I thought we agreed to at LSF > last year. I think the biggest underlying problem of the scatterlist based DMA implementation for IOMMUs is that it's trying to handle to much, that is magic coalescing even if the segments boundaries don't align with the IOMMU page size. If we can get rid of that misfeature I think we'd greatly simply the API and implementation.
On Mon, Mar 25, 2024 at 12:22:15AM +0100, Christoph Hellwig wrote: > On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote: > > If we are going to make caller provided uniformity a requirement, lets > > imagine a formal memory type idea to help keep this a little > > abstracted? > > > > DMA_MEMORY_TYPE_NORMAL > > DMA_MEMORY_TYPE_P2P_NOT_ACS > > DMA_MEMORY_TYPE_ENCRYPTED > > DMA_MEMORY_TYPE_BOUNCE_BUFFER // ?? > > > > Then maybe the driver flow looks like: > > > > if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) { > > Add a nice helper to make this somewhat readable, but yes. > > > } else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) { > > num_hwsgls = transcation.num_sgls; > > for_each_range(transaction, range) { > > hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider); > > hwsgl[i].len = range.size; > > } > > } else { > > /* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */ > > num_hwsgls = transcation.num_sgls; > > for_each_range(transaction, range) { > > hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length); > > hwsgl[i].len = range.size; > > } > > > > And these two are really the same except that we call a different map > helper underneath. So I think as far as the driver is concerned > they should be the same, the DMA API just needs to key off the > memory tap. Yeah.. If the caller is going to have compute the memory type of the range then lets pass it to the helper dma_api_map_memory_type(transaction.memory_type, range.start_page, range.length); Then we can just hide all the differences under the API without doing duplicated work. Function names need some work ... > > > > So I take it as a requirement that RDMA MUST make single MR's out of a > > > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are > > > > not a functional replacement for a single MR. > > > > > > But MRs consolidate multiple dma addresses anyway. > > > > I'm not sure I understand this? > > The RDMA MRs take a a list of PFNish address, (or SGLs with the > enhanced MRs from Mellanox) and give you back a single rkey/lkey. Yes, that is the desire. > > To go back to my main thesis - I would like a high performance low > > level DMA API that is capable enough that it could implement > > scatterlist dma_map_sg() and thus also implement any future > > scatterlist_v2, bio, hmm_range_fault or any other thing we come up > > with on top of it. This is broadly what I thought we agreed to at LSF > > last year. > > I think the biggest underlying problem of the scatterlist based > DMA implementation for IOMMUs is that it's trying to handle to much, > that is magic coalescing even if the segments boundaries don't align > with the IOMMU page size. If we can get rid of that misfeature I > think we'd greatly simply the API and implementation. Yeah, that stuff is not easy at all and takes extra computation to figure out. I always assumed it was there for block... Leon & Chaitanya will make a RFC v2 along these lines, lets see how it goes. Thanks, Jason
在 2024/3/7 7:01, Zhu Yanjun 写道: > 在 2024/3/5 12:18, Leon Romanovsky 写道: >> This is complimentary part to the proposed LSF/MM topic. >> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 > > I am interested in this topic. Hope I can join the meeting to discuss > this topic. > With the same idea, in the IDPF driver, the function dma_alloc_coherent which is called in the IDPF driver can be devided into the following 2 functions: iommu_dma_alloc_pages and iommu_dma_map_page So the function iommu_dma_alloc_pages allocates pages, iommu_dma_map_page makes mapping between pages and IOVA. Now the above idea is implemented in the NIC driver. Currently it can work well. Next the above idea will be implemented in the block device. Hope this can increase the performance of the block device. Best Regards, Zhu Yanjun > Zhu Yanjun > >> >> This is posted as RFC to get a feedback on proposed split, but RDMA, >> VFIO and >> DMA patches are ready for review and inclusion, the NVMe patches are >> still in >> progress as they require agreement on API first. >> >> Thanks >> >> ------------------------------------------------------------------------------- >> The DMA mapping operation performs two steps at one same time: allocates >> IOVA space and actually maps DMA pages to that space. This one shot >> operation works perfectly for non-complex scenarios, where callers use >> that DMA API in control path when they setup hardware. >> >> However in more complex scenarios, when DMA mapping is needed in data >> path and especially when some sort of specific datatype is involved, >> such one shot approach has its drawbacks. >> >> That approach pushes developers to introduce new DMA APIs for specific >> datatype. For example existing scatter-gather mapping functions, or >> latest Chuck's RFC series to add biovec related DMA mapping [1] and >> probably struct folio will need it too. >> >> These advanced DMA mapping APIs are needed to calculate IOVA size to >> allocate it as one chunk and some sort of offset calculations to know >> which part of IOVA to map. >> >> Instead of teaching DMA to know these specific datatypes, let's separate >> existing DMA mapping routine to two steps and give an option to advanced >> callers (subsystems) perform all calculations internally in advance and >> map pages later when it is needed. >> >> In this series, three users are converted and each of such conversion >> presents different positive gain: >> 1. RDMA simplifies and speeds up its pagefault handling for >> on-demand-paging (ODP) mode. >> 2. VFIO PCI live migration code saves huge chunk of memory. >> 3. NVMe PCI avoids intermediate SG table manipulation and operates >> directly on BIOs. >> >> Thanks >> >> [1] >> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net >> >> Chaitanya Kulkarni (2): >> block: add dma_link_range() based API >> nvme-pci: use blk_rq_dma_map() for NVMe SGL >> >> Leon Romanovsky (14): >> mm/hmm: let users to tag specific PFNs >> dma-mapping: provide an interface to allocate IOVA >> dma-mapping: provide callbacks to link/unlink pages to specific IOVA >> iommu/dma: Provide an interface to allow preallocate IOVA >> iommu/dma: Prepare map/unmap page functions to receive IOVA >> iommu/dma: Implement link/unlink page callbacks >> RDMA/umem: Preallocate and cache IOVA for UMEM ODP >> RDMA/umem: Store ODP access mask information in PFN >> RDMA/core: Separate DMA mapping to caching IOVA and page linkage >> RDMA/umem: Prevent UMEM ODP creation with SWIOTLB >> vfio/mlx5: Explicitly use number of pages instead of allocated length >> vfio/mlx5: Rewrite create mkey flow to allow better code reuse >> vfio/mlx5: Explicitly store page list >> vfio/mlx5: Convert vfio to use DMA link API >> >> Documentation/core-api/dma-attributes.rst | 7 + >> block/blk-merge.c | 156 ++++++++++++++ >> drivers/infiniband/core/umem_odp.c | 219 +++++++------------ >> drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + >> drivers/infiniband/hw/mlx5/odp.c | 59 +++-- >> drivers/iommu/dma-iommu.c | 129 ++++++++--- >> drivers/nvme/host/pci.c | 220 +++++-------------- >> drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- >> drivers/vfio/pci/mlx5/cmd.h | 22 +- >> drivers/vfio/pci/mlx5/main.c | 136 +++++------- >> include/linux/blk-mq.h | 9 + >> include/linux/dma-map-ops.h | 13 ++ >> include/linux/dma-mapping.h | 39 ++++ >> include/linux/hmm.h | 3 + >> include/rdma/ib_umem_odp.h | 22 +- >> include/rdma/ib_verbs.h | 54 +++++ >> kernel/dma/debug.h | 2 + >> kernel/dma/direct.h | 7 +- >> kernel/dma/mapping.c | 91 ++++++++ >> mm/hmm.c | 34 +-- >> 20 files changed, 870 insertions(+), 605 deletions(-) >> >
Hi Leon, Jason > -----Original Message----- > From: Leon Romanovsky <leon@kernel.org> > Sent: Tuesday, March 5, 2024 6:19 AM > To: Christoph Hellwig <hch@lst.de>; Robin Murphy > <robin.murphy@arm.com>; Marek Szyprowski > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > Deacon <will@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>; Chaitanya > Kulkarni <chaitanyak@nvidia.com> > Cc: Jonathan Corbet <corbet@lwn.net>; Jens Axboe <axboe@kernel.dk>; > Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>; > Yishai Hadas <yishaih@nvidia.com>; Shameer Kolothum > <shameerali.kolothum.thodi@huawei.com>; Kevin Tian > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > <bvanassche@acm.org>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > <martin.petersen@oracle.com>; daniel@iogearbox.net; Dan Williams > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> > Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two > steps > > This is complimentary part to the proposed LSF/MM topic. > https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b- > b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 > > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO > and > DMA patches are ready for review and inclusion, the NVMe patches are still > in > progress as they require agreement on API first. > > Thanks > > ------------------------------------------------------------------------------- > The DMA mapping operation performs two steps at one same time: allocates > IOVA space and actually maps DMA pages to that space. This one shot > operation works perfectly for non-complex scenarios, where callers use > that DMA API in control path when they setup hardware. > > However in more complex scenarios, when DMA mapping is needed in data > path and especially when some sort of specific datatype is involved, > such one shot approach has its drawbacks. > > That approach pushes developers to introduce new DMA APIs for specific > datatype. For example existing scatter-gather mapping functions, or > latest Chuck's RFC series to add biovec related DMA mapping [1] and > probably struct folio will need it too. > > These advanced DMA mapping APIs are needed to calculate IOVA size to > allocate it as one chunk and some sort of offset calculations to know > which part of IOVA to map. > > Instead of teaching DMA to know these specific datatypes, let's separate > existing DMA mapping routine to two steps and give an option to advanced > callers (subsystems) perform all calculations internally in advance and > map pages later when it is needed. I looked into how this scheme can be applied to DRM subsystem and GPU drivers. I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB In GPU driver, we have a few use cases where we need dma-mapping. Just name two: 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case. We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions. 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator. For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size? But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way. Oak > > In this series, three users are converted and each of such conversion > presents different positive gain: > 1. RDMA simplifies and speeds up its pagefault handling for > on-demand-paging (ODP) mode. > 2. VFIO PCI live migration code saves huge chunk of memory. > 3. NVMe PCI avoids intermediate SG table manipulation and operates > directly on BIOs. > > Thanks > > [1] > https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@ > klimt.1015granger.net > > Chaitanya Kulkarni (2): > block: add dma_link_range() based API > nvme-pci: use blk_rq_dma_map() for NVMe SGL > > Leon Romanovsky (14): > mm/hmm: let users to tag specific PFNs > dma-mapping: provide an interface to allocate IOVA > dma-mapping: provide callbacks to link/unlink pages to specific IOVA > iommu/dma: Provide an interface to allow preallocate IOVA > iommu/dma: Prepare map/unmap page functions to receive IOVA > iommu/dma: Implement link/unlink page callbacks > RDMA/umem: Preallocate and cache IOVA for UMEM ODP > RDMA/umem: Store ODP access mask information in PFN > RDMA/core: Separate DMA mapping to caching IOVA and page linkage > RDMA/umem: Prevent UMEM ODP creation with SWIOTLB > vfio/mlx5: Explicitly use number of pages instead of allocated length > vfio/mlx5: Rewrite create mkey flow to allow better code reuse > vfio/mlx5: Explicitly store page list > vfio/mlx5: Convert vfio to use DMA link API > > Documentation/core-api/dma-attributes.rst | 7 + > block/blk-merge.c | 156 ++++++++++++++ > drivers/infiniband/core/umem_odp.c | 219 +++++++------------ > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + > drivers/infiniband/hw/mlx5/odp.c | 59 +++-- > drivers/iommu/dma-iommu.c | 129 ++++++++--- > drivers/nvme/host/pci.c | 220 +++++-------------- > drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- > drivers/vfio/pci/mlx5/cmd.h | 22 +- > drivers/vfio/pci/mlx5/main.c | 136 +++++------- > include/linux/blk-mq.h | 9 + > include/linux/dma-map-ops.h | 13 ++ > include/linux/dma-mapping.h | 39 ++++ > include/linux/hmm.h | 3 + > include/rdma/ib_umem_odp.h | 22 +- > include/rdma/ib_verbs.h | 54 +++++ > kernel/dma/debug.h | 2 + > kernel/dma/direct.h | 7 +- > kernel/dma/mapping.c | 91 ++++++++ > mm/hmm.c | 34 +-- > 20 files changed, 870 insertions(+), 605 deletions(-) > > -- > 2.44.0
On 03.05.24 01:32, Zeng, Oak wrote: > Hi Leon, Jason > >> -----Original Message----- >> From: Leon Romanovsky <leon@kernel.org> >> Sent: Tuesday, March 5, 2024 6:19 AM >> To: Christoph Hellwig <hch@lst.de>; Robin Murphy >> <robin.murphy@arm.com>; Marek Szyprowski >> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will >> Deacon <will@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>; Chaitanya >> Kulkarni <chaitanyak@nvidia.com> >> Cc: Jonathan Corbet <corbet@lwn.net>; Jens Axboe <axboe@kernel.dk>; >> Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>; >> Yishai Hadas <yishaih@nvidia.com>; Shameer Kolothum >> <shameerali.kolothum.thodi@huawei.com>; Kevin Tian >> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; >> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- >> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; >> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; >> iommu@lists.linux.dev; linux-nvme@lists.infradead.org; >> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche >> <bvanassche@acm.org>; Damien Le Moal >> <damien.lemoal@opensource.wdc.com>; Amir Goldstein >> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen >> <martin.petersen@oracle.com>; daniel@iogearbox.net; Dan Williams >> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky >> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> >> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two >> steps >> >> This is complimentary part to the proposed LSF/MM topic. >> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b- >> b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 >> >> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO >> and >> DMA patches are ready for review and inclusion, the NVMe patches are still >> in >> progress as they require agreement on API first. >> >> Thanks >> >> ------------------------------------------------------------------------------- >> The DMA mapping operation performs two steps at one same time: allocates >> IOVA space and actually maps DMA pages to that space. This one shot >> operation works perfectly for non-complex scenarios, where callers use >> that DMA API in control path when they setup hardware. >> >> However in more complex scenarios, when DMA mapping is needed in data >> path and especially when some sort of specific datatype is involved, >> such one shot approach has its drawbacks. >> >> That approach pushes developers to introduce new DMA APIs for specific >> datatype. For example existing scatter-gather mapping functions, or >> latest Chuck's RFC series to add biovec related DMA mapping [1] and >> probably struct folio will need it too. >> >> These advanced DMA mapping APIs are needed to calculate IOVA size to >> allocate it as one chunk and some sort of offset calculations to know >> which part of IOVA to map. >> >> Instead of teaching DMA to know these specific datatypes, let's separate >> existing DMA mapping routine to two steps and give an option to advanced >> callers (subsystems) perform all calculations internally in advance and >> map pages later when it is needed. > I looked into how this scheme can be applied to DRM subsystem and GPU drivers. > > I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB > > In GPU driver, we have a few use cases where we need dma-mapping. Just name two: > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case. > > We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions. > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator. > > For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size? > > But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way. Not sure if GPU peer to peer dma mapping GPU memory for use can use this scheme or not. If I remember it correctly, Intel Gaudi GPU supports peer 2 peer dma mapping in GPU Direct RDMA. Not sure if this scheme can be applied in that place or not. Just my 2 cent suggestions. Zhu Yanjun > > > Oak > > >> In this series, three users are converted and each of such conversion >> presents different positive gain: >> 1. RDMA simplifies and speeds up its pagefault handling for >> on-demand-paging (ODP) mode. >> 2. VFIO PCI live migration code saves huge chunk of memory. >> 3. NVMe PCI avoids intermediate SG table manipulation and operates >> directly on BIOs. >> >> Thanks >> >> [1] >> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@ >> klimt.1015granger.net >> >> Chaitanya Kulkarni (2): >> block: add dma_link_range() based API >> nvme-pci: use blk_rq_dma_map() for NVMe SGL >> >> Leon Romanovsky (14): >> mm/hmm: let users to tag specific PFNs >> dma-mapping: provide an interface to allocate IOVA >> dma-mapping: provide callbacks to link/unlink pages to specific IOVA >> iommu/dma: Provide an interface to allow preallocate IOVA >> iommu/dma: Prepare map/unmap page functions to receive IOVA >> iommu/dma: Implement link/unlink page callbacks >> RDMA/umem: Preallocate and cache IOVA for UMEM ODP >> RDMA/umem: Store ODP access mask information in PFN >> RDMA/core: Separate DMA mapping to caching IOVA and page linkage >> RDMA/umem: Prevent UMEM ODP creation with SWIOTLB >> vfio/mlx5: Explicitly use number of pages instead of allocated length >> vfio/mlx5: Rewrite create mkey flow to allow better code reuse >> vfio/mlx5: Explicitly store page list >> vfio/mlx5: Convert vfio to use DMA link API >> >> Documentation/core-api/dma-attributes.rst | 7 + >> block/blk-merge.c | 156 ++++++++++++++ >> drivers/infiniband/core/umem_odp.c | 219 +++++++------------ >> drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + >> drivers/infiniband/hw/mlx5/odp.c | 59 +++-- >> drivers/iommu/dma-iommu.c | 129 ++++++++--- >> drivers/nvme/host/pci.c | 220 +++++-------------- >> drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- >> drivers/vfio/pci/mlx5/cmd.h | 22 +- >> drivers/vfio/pci/mlx5/main.c | 136 +++++------- >> include/linux/blk-mq.h | 9 + >> include/linux/dma-map-ops.h | 13 ++ >> include/linux/dma-mapping.h | 39 ++++ >> include/linux/hmm.h | 3 + >> include/rdma/ib_umem_odp.h | 22 +- >> include/rdma/ib_verbs.h | 54 +++++ >> kernel/dma/debug.h | 2 + >> kernel/dma/direct.h | 7 +- >> kernel/dma/mapping.c | 91 ++++++++ >> mm/hmm.c | 34 +-- >> 20 files changed, 870 insertions(+), 605 deletions(-) >> >> -- >> 2.44.0
On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > Instead of teaching DMA to know these specific datatypes, let's separate > > existing DMA mapping routine to two steps and give an option to advanced > > callers (subsystems) perform all calculations internally in advance and > > map pages later when it is needed. > > I looked into how this scheme can be applied to DRM subsystem and GPU drivers. > > I figured RDMA can apply this scheme because RDMA can calculate the > iova size. Per my limited knowledge of rdma, user can register a > memory region (the reg_user_mr vfunc) and memory region's sized is > used to pre-allocate iova space. And in the RDMA use case, it seems > the user registered region can be very big, e.g., 512MiB or even GiB In RDMA the iova would be linked to the SVA granual we discussed previously. > In GPU driver, we have a few use cases where we need dma-mapping. Just name two: > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > (in Intel's driver it is through a vm_bind api, similar to mmap). A > userptr can be of any random size, depending on user malloc > size. Today we use dma-map-sg for this use case. The down side of > our approach is, during userptr invalidation, even if user only > munmap partially of an userptr, we invalidate the whole userptr from > gpu page table, because there is no way for us to partially > dma-unmap the whole sg list. I think we can try your new API in this > case. The main benefit of the new approach is the partial munmap > case. Yes, this is one of the main things it will improve. > We will have to pre-allocate iova for each userptr, and we have many > userptrs of random size... So we might be not as efficient as RDMA > case where I assume user register a few big memory regions. You are already doing this. dma_map_sg() does exactly the same IOVA allocation under the covers. > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > program directly, without any other extra driver API call. We call > this use case system allocator. > For system allocator, driver have no knowledge of which virtual > address range is valid in advance. So when GPU access a > malloc'ed/mmap'ed address, we have a page fault. We then look up a > CPU vma which contains the fault address. I guess we can use the CPU > vma size to allocate the iova space of the same size? No. You'd follow what we discussed in the other thread. If you do a full SVA then you'd split your MM space into granuals and when a fault hits a granual you'd allocate the IOVA for the whole granual. RDMA ODP is using a 512M granual currently. If you are doing sub ranges then you'd probably allocate the IOVA for the well defined sub range (assuming the typical use case isn't huge) > But there will be a true difficulty to apply your scheme to this use > case. It is related to the STICKY flag. As I understand it, the > sticky flag is designed for driver to mark "this page/pfn has been > populated, no need to re-populate again", roughly...Unlike userptr > and RDMA use cases where the backing store of a buffer is always in > system memory, in the system allocator use case, the backing store > can be changing b/t system memory and GPU's device private > memory. Even worse, we have to assume the data migration b/t system > and GPU is dynamic. When data is migrated to GPU, we don't need > dma-map. And when migration happens to a pfn with STICKY flag, we > still need to repopulate this pfn. So you can see, it is not easy to > apply this scheme to this use case. At least I can't see an obvious > way. You are already doing this today, you are keeping the sg list around until you unmap it. Instead of keeping the sg list you'd keep a much smaller datastructure per-granual. The sticky bit is simply a convient way for ODP to manage the smaller data structure, you don't have to use it. But you do need to keep track of what pages in the granual have been DMA mapped - sg list was doing this before. This could be a simple bitmap array matching the granual size. Looking (far) forward we may be able to have a "replace" API that allows installing a new page unconditionally regardless of what is already there. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Friday, May 3, 2024 12:43 PM > To: Zeng, Oak <oak.zeng@intel.com> > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy > <robin.murphy@arm.com>; Marek Szyprowski > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > <bvanassche@acm.org>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > two steps > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > > > Instead of teaching DMA to know these specific datatypes, let's separate > > > existing DMA mapping routine to two steps and give an option to > advanced > > > callers (subsystems) perform all calculations internally in advance and > > > map pages later when it is needed. > > > > I looked into how this scheme can be applied to DRM subsystem and GPU > drivers. > > > > I figured RDMA can apply this scheme because RDMA can calculate the > > iova size. Per my limited knowledge of rdma, user can register a > > memory region (the reg_user_mr vfunc) and memory region's sized is > > used to pre-allocate iova space. And in the RDMA use case, it seems > > the user registered region can be very big, e.g., 512MiB or even GiB > > In RDMA the iova would be linked to the SVA granual we discussed > previously. I need to learn more of this scheme. Let's say 512MiB granual... In a 57-bit virtual address machine, the use space can address space can be up to 56 bit (e.g., half-half split b/t kernel and user) So you would end up with 134,217, 728 sub-regions (2 to the power of 27), which is huge... Is that RDMA use a much smaller virtual address space? With 512MiB granual, do you fault-in or map 512MiB virtual address range to RDMA page table? E.g., when page fault happens at address A, do you fault-in the whole 512MiB region to RDMA page table? How do you make sure all addresses in this 512MiB region are valid virtual address? > > > In GPU driver, we have a few use cases where we need dma-mapping. Just > name two: > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > > (in Intel's driver it is through a vm_bind api, similar to mmap). A > > userptr can be of any random size, depending on user malloc > > size. Today we use dma-map-sg for this use case. The down side of > > our approach is, during userptr invalidation, even if user only > > munmap partially of an userptr, we invalidate the whole userptr from > > gpu page table, because there is no way for us to partially > > dma-unmap the whole sg list. I think we can try your new API in this > > case. The main benefit of the new approach is the partial munmap > > case. > > Yes, this is one of the main things it will improve. > > > We will have to pre-allocate iova for each userptr, and we have many > > userptrs of random size... So we might be not as efficient as RDMA > > case where I assume user register a few big memory regions. > > You are already doing this. dma_map_sg() does exactly the same IOVA > allocation under the covers. Sure. Then we can replace our _sg with your new DMA Api once it is merged. We will gain a benefit with a little more codes > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > > program directly, without any other extra driver API call. We call > > this use case system allocator. > > > For system allocator, driver have no knowledge of which virtual > > address range is valid in advance. So when GPU access a > > malloc'ed/mmap'ed address, we have a page fault. We then look up a > > CPU vma which contains the fault address. I guess we can use the CPU > > vma size to allocate the iova space of the same size? > > No. You'd follow what we discussed in the other thread. > > If you do a full SVA then you'd split your MM space into granuals and > when a fault hits a granual you'd allocate the IOVA for the whole > granual. RDMA ODP is using a 512M granual currently. Per system allocator requirement, we have to do full SVA (which means ANY valid CPU virtual address is a valid GPU virtual address). Per my above calculation, with 512M granual, we will end up a huge number of sub-regions.... > > If you are doing sub ranges then you'd probably allocate the IOVA for > the well defined sub range (assuming the typical use case isn't huge) Can you explain what is sub ranges? Is that device only mirror partially of the CPU virtual address space? How do we decide which part to mirror? > > > But there will be a true difficulty to apply your scheme to this use > > case. It is related to the STICKY flag. As I understand it, the > > sticky flag is designed for driver to mark "this page/pfn has been > > populated, no need to re-populate again", roughly...Unlike userptr > > and RDMA use cases where the backing store of a buffer is always in > > system memory, in the system allocator use case, the backing store > > can be changing b/t system memory and GPU's device private > > memory. Even worse, we have to assume the data migration b/t system > > and GPU is dynamic. When data is migrated to GPU, we don't need > > dma-map. And when migration happens to a pfn with STICKY flag, we > > still need to repopulate this pfn. So you can see, it is not easy to > > apply this scheme to this use case. At least I can't see an obvious > > way. > > You are already doing this today, you are keeping the sg list around > until you unmap it. > > Instead of keeping the sg list you'd keep a much smaller datastructure > per-granual. The sticky bit is simply a convient way for ODP to manage > the smaller data structure, you don't have to use it. > > But you do need to keep track of what pages in the granual have been > DMA mapped - sg list was doing this before. This could be a simple > bitmap array matching the granual size. Make sense. We can try once your API is ready. I still don't figure out the granular scheme. Please help with above questions. Thanks, Oak > > Looking (far) forward we may be able to have a "replace" API that > allows installing a new page unconditionally regardless of what is > already there. > > Jason
Hi Jason, Leon, I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API. Did you test the new API in RDMA subsystem? Or this RFC series was just some untested codes sending out to get people's design feedback? Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time. Cc Himal/Krishna who are also working/testing the new API. Thanks, Oak > -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Friday, May 3, 2024 12:43 PM > To: Zeng, Oak <oak.zeng@intel.com> > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy > <robin.murphy@arm.com>; Marek Szyprowski > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > <bvanassche@acm.org>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > two steps > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > > > Instead of teaching DMA to know these specific datatypes, let's separate > > > existing DMA mapping routine to two steps and give an option to > advanced > > > callers (subsystems) perform all calculations internally in advance and > > > map pages later when it is needed. > > > > I looked into how this scheme can be applied to DRM subsystem and GPU > drivers. > > > > I figured RDMA can apply this scheme because RDMA can calculate the > > iova size. Per my limited knowledge of rdma, user can register a > > memory region (the reg_user_mr vfunc) and memory region's sized is > > used to pre-allocate iova space. And in the RDMA use case, it seems > > the user registered region can be very big, e.g., 512MiB or even GiB > > In RDMA the iova would be linked to the SVA granual we discussed > previously. > > > In GPU driver, we have a few use cases where we need dma-mapping. Just > name two: > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > > (in Intel's driver it is through a vm_bind api, similar to mmap). A > > userptr can be of any random size, depending on user malloc > > size. Today we use dma-map-sg for this use case. The down side of > > our approach is, during userptr invalidation, even if user only > > munmap partially of an userptr, we invalidate the whole userptr from > > gpu page table, because there is no way for us to partially > > dma-unmap the whole sg list. I think we can try your new API in this > > case. The main benefit of the new approach is the partial munmap > > case. > > Yes, this is one of the main things it will improve. > > > We will have to pre-allocate iova for each userptr, and we have many > > userptrs of random size... So we might be not as efficient as RDMA > > case where I assume user register a few big memory regions. > > You are already doing this. dma_map_sg() does exactly the same IOVA > allocation under the covers. > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > > program directly, without any other extra driver API call. We call > > this use case system allocator. > > > For system allocator, driver have no knowledge of which virtual > > address range is valid in advance. So when GPU access a > > malloc'ed/mmap'ed address, we have a page fault. We then look up a > > CPU vma which contains the fault address. I guess we can use the CPU > > vma size to allocate the iova space of the same size? > > No. You'd follow what we discussed in the other thread. > > If you do a full SVA then you'd split your MM space into granuals and > when a fault hits a granual you'd allocate the IOVA for the whole > granual. RDMA ODP is using a 512M granual currently. > > If you are doing sub ranges then you'd probably allocate the IOVA for > the well defined sub range (assuming the typical use case isn't huge) > > > But there will be a true difficulty to apply your scheme to this use > > case. It is related to the STICKY flag. As I understand it, the > > sticky flag is designed for driver to mark "this page/pfn has been > > populated, no need to re-populate again", roughly...Unlike userptr > > and RDMA use cases where the backing store of a buffer is always in > > system memory, in the system allocator use case, the backing store > > can be changing b/t system memory and GPU's device private > > memory. Even worse, we have to assume the data migration b/t system > > and GPU is dynamic. When data is migrated to GPU, we don't need > > dma-map. And when migration happens to a pfn with STICKY flag, we > > still need to repopulate this pfn. So you can see, it is not easy to > > apply this scheme to this use case. At least I can't see an obvious > > way. > > You are already doing this today, you are keeping the sg list around > until you unmap it. > > Instead of keeping the sg list you'd keep a much smaller datastructure > per-granual. The sticky bit is simply a convient way for ODP to manage > the smaller data structure, you don't have to use it. > > But you do need to keep track of what pages in the granual have been > DMA mapped - sg list was doing this before. This could be a simple > bitmap array matching the granual size. > > Looking (far) forward we may be able to have a "replace" API that > allows installing a new page unconditionally regardless of what is > already there. > > Jason
On 10.06.24 17:12, Zeng, Oak wrote: > Hi Jason, Leon, > > I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API. > > Did you test the new API in RDMA subsystem? Or this RFC series was just some untested codes sending out to get people's design feedback? Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time. Hi, Zeng I have tested this patch series. And a patch about NVMe will cause some call trace. But if you revert this patch about NVMe, the whole patches can work well. You can develop your patches based on this patch series. It seems that "some agreements can not be reached" about NVMe. So NVMe patch can not work well. I do not delve into this NVMe patch. Zhu Yanjun > > Cc Himal/Krishna who are also working/testing the new API. > > Thanks, > Oak > >> -----Original Message----- >> From: Jason Gunthorpe <jgg@ziepe.ca> >> Sent: Friday, May 3, 2024 12:43 PM >> To: Zeng, Oak <oak.zeng@intel.com> >> Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy >> <robin.murphy@arm.com>; Marek Szyprowski >> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will >> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; >> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas >> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens >> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi >> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; >> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin >> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; >> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- >> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; >> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; >> iommu@lists.linux.dev; linux-nvme@lists.infradead.org; >> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche >> <bvanassche@acm.org>; Damien Le Moal >> <damien.lemoal@opensource.wdc.com>; Amir Goldstein >> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen >> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J >> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky >> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> >> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to >> two steps >> >> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: >> >>>> Instead of teaching DMA to know these specific datatypes, let's separate >>>> existing DMA mapping routine to two steps and give an option to >> advanced >>>> callers (subsystems) perform all calculations internally in advance and >>>> map pages later when it is needed. >>> I looked into how this scheme can be applied to DRM subsystem and GPU >> drivers. >>> I figured RDMA can apply this scheme because RDMA can calculate the >>> iova size. Per my limited knowledge of rdma, user can register a >>> memory region (the reg_user_mr vfunc) and memory region's sized is >>> used to pre-allocate iova space. And in the RDMA use case, it seems >>> the user registered region can be very big, e.g., 512MiB or even GiB >> In RDMA the iova would be linked to the SVA granual we discussed >> previously. >> >>> In GPU driver, we have a few use cases where we need dma-mapping. Just >> name two: >>> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu >>> (in Intel's driver it is through a vm_bind api, similar to mmap). A >>> userptr can be of any random size, depending on user malloc >>> size. Today we use dma-map-sg for this use case. The down side of >>> our approach is, during userptr invalidation, even if user only >>> munmap partially of an userptr, we invalidate the whole userptr from >>> gpu page table, because there is no way for us to partially >>> dma-unmap the whole sg list. I think we can try your new API in this >>> case. The main benefit of the new approach is the partial munmap >>> case. >> Yes, this is one of the main things it will improve. >> >>> We will have to pre-allocate iova for each userptr, and we have many >>> userptrs of random size... So we might be not as efficient as RDMA >>> case where I assume user register a few big memory regions. >> You are already doing this. dma_map_sg() does exactly the same IOVA >> allocation under the covers. >> >>> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU >>> program directly, without any other extra driver API call. We call >>> this use case system allocator. >>> For system allocator, driver have no knowledge of which virtual >>> address range is valid in advance. So when GPU access a >>> malloc'ed/mmap'ed address, we have a page fault. We then look up a >>> CPU vma which contains the fault address. I guess we can use the CPU >>> vma size to allocate the iova space of the same size? >> No. You'd follow what we discussed in the other thread. >> >> If you do a full SVA then you'd split your MM space into granuals and >> when a fault hits a granual you'd allocate the IOVA for the whole >> granual. RDMA ODP is using a 512M granual currently. >> >> If you are doing sub ranges then you'd probably allocate the IOVA for >> the well defined sub range (assuming the typical use case isn't huge) >> >>> But there will be a true difficulty to apply your scheme to this use >>> case. It is related to the STICKY flag. As I understand it, the >>> sticky flag is designed for driver to mark "this page/pfn has been >>> populated, no need to re-populate again", roughly...Unlike userptr >>> and RDMA use cases where the backing store of a buffer is always in >>> system memory, in the system allocator use case, the backing store >>> can be changing b/t system memory and GPU's device private >>> memory. Even worse, we have to assume the data migration b/t system >>> and GPU is dynamic. When data is migrated to GPU, we don't need >>> dma-map. And when migration happens to a pfn with STICKY flag, we >>> still need to repopulate this pfn. So you can see, it is not easy to >>> apply this scheme to this use case. At least I can't see an obvious >>> way. >> You are already doing this today, you are keeping the sg list around >> until you unmap it. >> >> Instead of keeping the sg list you'd keep a much smaller datastructure >> per-granual. The sticky bit is simply a convient way for ODP to manage >> the smaller data structure, you don't have to use it. >> >> But you do need to keep track of what pages in the granual have been >> DMA mapped - sg list was doing this before. This could be a simple >> bitmap array matching the granual size. >> >> Looking (far) forward we may be able to have a "replace" API that >> allows installing a new page unconditionally regardless of what is >> already there. >> >> Jason
On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote: > Hi Jason, Leon, > > I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API. Sounds great, thanks for the feedback. > > Did you test the new API in RDMA subsystem? This version was tested in our regression tests, but there is a chance that you are hitting flows that were not relevant for RDMA case. > Or this RFC series was just some untested codes sending out to get people's design feedback? RFC was fully tested in VFIO and RDMA paths, but not NVMe patch. > Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time. Yes, as an outcome of the feedback in this thread, I implemented a new version. Unfortunately, there are some personal matters that are preventing from me to send it right away. https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dma-split-v1 There are some differences in the API, but the main idea is the same. This version is not fully tested yet. Thanks > > Cc Himal/Krishna who are also working/testing the new API. > > Thanks, > Oak > > > -----Original Message----- > > From: Jason Gunthorpe <jgg@ziepe.ca> > > Sent: Friday, May 3, 2024 12:43 PM > > To: Zeng, Oak <oak.zeng@intel.com> > > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy > > <robin.murphy@arm.com>; Marek Szyprowski > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > <bvanassche@acm.org>; Damien Le Moal > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky > > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > two steps > > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > > > > > Instead of teaching DMA to know these specific datatypes, let's separate > > > > existing DMA mapping routine to two steps and give an option to > > advanced > > > > callers (subsystems) perform all calculations internally in advance and > > > > map pages later when it is needed. > > > > > > I looked into how this scheme can be applied to DRM subsystem and GPU > > drivers. > > > > > > I figured RDMA can apply this scheme because RDMA can calculate the > > > iova size. Per my limited knowledge of rdma, user can register a > > > memory region (the reg_user_mr vfunc) and memory region's sized is > > > used to pre-allocate iova space. And in the RDMA use case, it seems > > > the user registered region can be very big, e.g., 512MiB or even GiB > > > > In RDMA the iova would be linked to the SVA granual we discussed > > previously. > > > > > In GPU driver, we have a few use cases where we need dma-mapping. Just > > name two: > > > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A > > > userptr can be of any random size, depending on user malloc > > > size. Today we use dma-map-sg for this use case. The down side of > > > our approach is, during userptr invalidation, even if user only > > > munmap partially of an userptr, we invalidate the whole userptr from > > > gpu page table, because there is no way for us to partially > > > dma-unmap the whole sg list. I think we can try your new API in this > > > case. The main benefit of the new approach is the partial munmap > > > case. > > > > Yes, this is one of the main things it will improve. > > > > > We will have to pre-allocate iova for each userptr, and we have many > > > userptrs of random size... So we might be not as efficient as RDMA > > > case where I assume user register a few big memory regions. > > > > You are already doing this. dma_map_sg() does exactly the same IOVA > > allocation under the covers. > > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > > > program directly, without any other extra driver API call. We call > > > this use case system allocator. > > > > > For system allocator, driver have no knowledge of which virtual > > > address range is valid in advance. So when GPU access a > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a > > > CPU vma which contains the fault address. I guess we can use the CPU > > > vma size to allocate the iova space of the same size? > > > > No. You'd follow what we discussed in the other thread. > > > > If you do a full SVA then you'd split your MM space into granuals and > > when a fault hits a granual you'd allocate the IOVA for the whole > > granual. RDMA ODP is using a 512M granual currently. > > > > If you are doing sub ranges then you'd probably allocate the IOVA for > > the well defined sub range (assuming the typical use case isn't huge) > > > > > But there will be a true difficulty to apply your scheme to this use > > > case. It is related to the STICKY flag. As I understand it, the > > > sticky flag is designed for driver to mark "this page/pfn has been > > > populated, no need to re-populate again", roughly...Unlike userptr > > > and RDMA use cases where the backing store of a buffer is always in > > > system memory, in the system allocator use case, the backing store > > > can be changing b/t system memory and GPU's device private > > > memory. Even worse, we have to assume the data migration b/t system > > > and GPU is dynamic. When data is migrated to GPU, we don't need > > > dma-map. And when migration happens to a pfn with STICKY flag, we > > > still need to repopulate this pfn. So you can see, it is not easy to > > > apply this scheme to this use case. At least I can't see an obvious > > > way. > > > > You are already doing this today, you are keeping the sg list around > > until you unmap it. > > > > Instead of keeping the sg list you'd keep a much smaller datastructure > > per-granual. The sticky bit is simply a convient way for ODP to manage > > the smaller data structure, you don't have to use it. > > > > But you do need to keep track of what pages in the granual have been > > DMA mapped - sg list was doing this before. This could be a simple > > bitmap array matching the granual size. > > > > Looking (far) forward we may be able to have a "replace" API that > > allows installing a new page unconditionally regardless of what is > > already there. > > > > Jason
Thanks Leon and Yanjun for the reply! Based on the reply, we will continue use the current version for test (as it is tested for vfio and rdma). We will switch to v1 once it is fully tested/reviewed. Thanks, Oak > -----Original Message----- > From: Leon Romanovsky <leon@kernel.org> > Sent: Monday, June 10, 2024 12:18 PM > To: Zeng, Oak <oak.zeng@intel.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin > Murphy <robin.murphy@arm.com>; Marek Szyprowski > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > <bvanassche@acm.org>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > <himal.prasad.ghimiray@intel.com> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > two steps > > On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote: > > Hi Jason, Leon, > > > > I come back to this thread to ask a question. Per the discussion in another > thread, I have integrated the new dma-mapping API (the first 6 patches of > this series) to DRM subsystem. The new API seems fit pretty good to our > purpose, better than scatter-gather dma-mapping. So we want to continue > work with you to adopt this new API. > > Sounds great, thanks for the feedback. > > > > > Did you test the new API in RDMA subsystem? > > This version was tested in our regression tests, but there is a chance > that you are hitting flows that were not relevant for RDMA case. > > > Or this RFC series was just some untested codes sending out to get > people's design feedback? > > RFC was fully tested in VFIO and RDMA paths, but not NVMe patch. > > > Do you have refined version for us to try? I ask because we are seeing > some issues but not sure whether it is caused by the new API. We are > debugging but it would be good to also ask at the same time. > > Yes, as an outcome of the feedback in this thread, I implemented a new > version. Unfortunately, there are some personal matters that are preventing > from me to send it right away. > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux- > rdma.git/log/?h=dma-split-v1 > > There are some differences in the API, but the main idea is the same. > This version is not fully tested yet. > > Thanks > > > > > Cc Himal/Krishna who are also working/testing the new API. > > > > Thanks, > > Oak > > > > > -----Original Message----- > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > > Sent: Friday, May 3, 2024 12:43 PM > > > To: Zeng, Oak <oak.zeng@intel.com> > > > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy > > > <robin.murphy@arm.com>; Marek Szyprowski > > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; > Jens > > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, > Kevin > > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > > foundation.org>; linux-doc@vger.kernel.org; linux- > kernel@vger.kernel.org; > > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > > <bvanassche@acm.org>; Damien Le Moal > > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky > > > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > > two steps > > > > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > > > > > > > Instead of teaching DMA to know these specific datatypes, let's > separate > > > > > existing DMA mapping routine to two steps and give an option to > > > advanced > > > > > callers (subsystems) perform all calculations internally in advance and > > > > > map pages later when it is needed. > > > > > > > > I looked into how this scheme can be applied to DRM subsystem and > GPU > > > drivers. > > > > > > > > I figured RDMA can apply this scheme because RDMA can calculate the > > > > iova size. Per my limited knowledge of rdma, user can register a > > > > memory region (the reg_user_mr vfunc) and memory region's sized is > > > > used to pre-allocate iova space. And in the RDMA use case, it seems > > > > the user registered region can be very big, e.g., 512MiB or even GiB > > > > > > In RDMA the iova would be linked to the SVA granual we discussed > > > previously. > > > > > > > In GPU driver, we have a few use cases where we need dma-mapping. > Just > > > name two: > > > > > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A > > > > userptr can be of any random size, depending on user malloc > > > > size. Today we use dma-map-sg for this use case. The down side of > > > > our approach is, during userptr invalidation, even if user only > > > > munmap partially of an userptr, we invalidate the whole userptr from > > > > gpu page table, because there is no way for us to partially > > > > dma-unmap the whole sg list. I think we can try your new API in this > > > > case. The main benefit of the new approach is the partial munmap > > > > case. > > > > > > Yes, this is one of the main things it will improve. > > > > > > > We will have to pre-allocate iova for each userptr, and we have many > > > > userptrs of random size... So we might be not as efficient as RDMA > > > > case where I assume user register a few big memory regions. > > > > > > You are already doing this. dma_map_sg() does exactly the same IOVA > > > allocation under the covers. > > > > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > > > > program directly, without any other extra driver API call. We call > > > > this use case system allocator. > > > > > > > For system allocator, driver have no knowledge of which virtual > > > > address range is valid in advance. So when GPU access a > > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a > > > > CPU vma which contains the fault address. I guess we can use the CPU > > > > vma size to allocate the iova space of the same size? > > > > > > No. You'd follow what we discussed in the other thread. > > > > > > If you do a full SVA then you'd split your MM space into granuals and > > > when a fault hits a granual you'd allocate the IOVA for the whole > > > granual. RDMA ODP is using a 512M granual currently. > > > > > > If you are doing sub ranges then you'd probably allocate the IOVA for > > > the well defined sub range (assuming the typical use case isn't huge) > > > > > > > But there will be a true difficulty to apply your scheme to this use > > > > case. It is related to the STICKY flag. As I understand it, the > > > > sticky flag is designed for driver to mark "this page/pfn has been > > > > populated, no need to re-populate again", roughly...Unlike userptr > > > > and RDMA use cases where the backing store of a buffer is always in > > > > system memory, in the system allocator use case, the backing store > > > > can be changing b/t system memory and GPU's device private > > > > memory. Even worse, we have to assume the data migration b/t > system > > > > and GPU is dynamic. When data is migrated to GPU, we don't need > > > > dma-map. And when migration happens to a pfn with STICKY flag, we > > > > still need to repopulate this pfn. So you can see, it is not easy to > > > > apply this scheme to this use case. At least I can't see an obvious > > > > way. > > > > > > You are already doing this today, you are keeping the sg list around > > > until you unmap it. > > > > > > Instead of keeping the sg list you'd keep a much smaller datastructure > > > per-granual. The sticky bit is simply a convient way for ODP to manage > > > the smaller data structure, you don't have to use it. > > > > > > But you do need to keep track of what pages in the granual have been > > > DMA mapped - sg list was doing this before. This could be a simple > > > bitmap array matching the granual size. > > > > > > Looking (far) forward we may be able to have a "replace" API that > > > allows installing a new page unconditionally regardless of what is > > > already there. > > > > > > Jason
On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: > Thanks Leon and Yanjun for the reply! > > Based on the reply, we will continue use the current version for > test (as it is tested for vfio and rdma). We will switch to v1 once > it is fully tested/reviewed. I'm glad you are finding it useful, one of my interests with this work is to improve all the HMM users. Jason
Hi Jason, Leon, I was able to fix the issue from my side. Things work fine now. I got two questions though: 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return. I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment? [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000 [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000 [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000 [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page". Is this description correct? I don't understand the part "+ DMA address returned for the previous page ". In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address. Maybe I need more test. But any comment? Thanks, Oak > -----Original Message----- > From: Jason Gunthorpe <jgg@ziepe.ca> > Sent: Monday, June 10, 2024 1:25 PM > To: Zeng, Oak <oak.zeng@intel.com> > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig <hch@lst.de>; > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > <bvanassche@acm.org>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > <himal.prasad.ghimiray@intel.com> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > two steps > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: > > Thanks Leon and Yanjun for the reply! > > > > Based on the reply, we will continue use the current version for > > test (as it is tested for vfio and rdma). We will switch to v1 once > > it is fully tested/reviewed. > > I'm glad you are finding it useful, one of my interests with this work > is to improve all the HMM users. > > Jason
On 10.06.24 23:28, Zeng, Oak wrote: > Hi Jason, Leon, > > I was able to fix the issue from my side. Things work fine now. Can you enlarge the dma list, then make tests with fio? Not sure if the performance is better or not. Thanks, Zhu Yanjun > I got two questions though: > > 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return. > I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment? > > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000 > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000 > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000 > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000 > > 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page". > Is this description correct? I don't understand the part "+ DMA address returned for the previous page ". > In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address. > Maybe I need more test. But any comment? > > Thanks, > Oak > >> -----Original Message----- >> From: Jason Gunthorpe <jgg@ziepe.ca> >> Sent: Monday, June 10, 2024 1:25 PM >> To: Zeng, Oak <oak.zeng@intel.com> >> Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig <hch@lst.de>; >> Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski >> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will >> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; >> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas >> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens >> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi >> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; >> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin >> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; >> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- >> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; >> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; >> iommu@lists.linux.dev; linux-nvme@lists.infradead.org; >> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche >> <bvanassche@acm.org>; Damien Le Moal >> <damien.lemoal@opensource.wdc.com>; Amir Goldstein >> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen >> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J >> <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun >> <zyjzyj2000@gmail.com>; Bommu, Krishnaiah >> <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad >> <himal.prasad.ghimiray@intel.com> >> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to >> two steps >> >> On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: >>> Thanks Leon and Yanjun for the reply! >>> >>> Based on the reply, we will continue use the current version for >>> test (as it is tested for vfio and rdma). We will switch to v1 once >>> it is fully tested/reviewed. >> I'm glad you are finding it useful, one of my interests with this work >> is to improve all the HMM users. >> >> Jason
On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: > Thanks Leon and Yanjun for the reply! > > Based on the reply, we will continue use the current version for test (as it is tested for vfio and rdma). We will switch to v1 once it is fully tested/reviewed. Sounds good, if v0 fits your need, the v1 will fit it too. From HMM perspective, the change is minimal between them. In v0, I called to dma_link_page() here and now it is called dma_hmm_link_page(). https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/diff/drivers/infiniband/hw/mlx5/odp.c?h=dma-split-v1&id=a0d719a406133cdc3ef2328dda3ef082a034c45e > > Thanks, > Oak > > > -----Original Message----- > > From: Leon Romanovsky <leon@kernel.org> > > Sent: Monday, June 10, 2024 12:18 PM > > To: Zeng, Oak <oak.zeng@intel.com> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin > > Murphy <robin.murphy@arm.com>; Marek Szyprowski > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > <bvanassche@acm.org>; Damien Le Moal > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > > <himal.prasad.ghimiray@intel.com> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > two steps > > > > On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote: > > > Hi Jason, Leon, > > > > > > I come back to this thread to ask a question. Per the discussion in another > > thread, I have integrated the new dma-mapping API (the first 6 patches of > > this series) to DRM subsystem. The new API seems fit pretty good to our > > purpose, better than scatter-gather dma-mapping. So we want to continue > > work with you to adopt this new API. > > > > Sounds great, thanks for the feedback. > > > > > > > > Did you test the new API in RDMA subsystem? > > > > This version was tested in our regression tests, but there is a chance > > that you are hitting flows that were not relevant for RDMA case. > > > > > Or this RFC series was just some untested codes sending out to get > > people's design feedback? > > > > RFC was fully tested in VFIO and RDMA paths, but not NVMe patch. > > > > > Do you have refined version for us to try? I ask because we are seeing > > some issues but not sure whether it is caused by the new API. We are > > debugging but it would be good to also ask at the same time. > > > > Yes, as an outcome of the feedback in this thread, I implemented a new > > version. Unfortunately, there are some personal matters that are preventing > > from me to send it right away. > > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux- > > rdma.git/log/?h=dma-split-v1 > > > > There are some differences in the API, but the main idea is the same. > > This version is not fully tested yet. > > > > Thanks > > > > > > > > Cc Himal/Krishna who are also working/testing the new API. > > > > > > Thanks, > > > Oak > > > > > > > -----Original Message----- > > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > > > Sent: Friday, May 3, 2024 12:43 PM > > > > To: Zeng, Oak <oak.zeng@intel.com> > > > > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy > > > > <robin.murphy@arm.com>; Marek Szyprowski > > > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; > > Jens > > > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, > > Kevin > > > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > > > foundation.org>; linux-doc@vger.kernel.org; linux- > > kernel@vger.kernel.org; > > > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > > > <bvanassche@acm.org>; Damien Le Moal > > > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > > > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky > > > > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com> > > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > > > two steps > > > > > > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote: > > > > > > > > > > Instead of teaching DMA to know these specific datatypes, let's > > separate > > > > > > existing DMA mapping routine to two steps and give an option to > > > > advanced > > > > > > callers (subsystems) perform all calculations internally in advance and > > > > > > map pages later when it is needed. > > > > > > > > > > I looked into how this scheme can be applied to DRM subsystem and > > GPU > > > > drivers. > > > > > > > > > > I figured RDMA can apply this scheme because RDMA can calculate the > > > > > iova size. Per my limited knowledge of rdma, user can register a > > > > > memory region (the reg_user_mr vfunc) and memory region's sized is > > > > > used to pre-allocate iova space. And in the RDMA use case, it seems > > > > > the user registered region can be very big, e.g., 512MiB or even GiB > > > > > > > > In RDMA the iova would be linked to the SVA granual we discussed > > > > previously. > > > > > > > > > In GPU driver, we have a few use cases where we need dma-mapping. > > Just > > > > name two: > > > > > > > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu > > > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A > > > > > userptr can be of any random size, depending on user malloc > > > > > size. Today we use dma-map-sg for this use case. The down side of > > > > > our approach is, during userptr invalidation, even if user only > > > > > munmap partially of an userptr, we invalidate the whole userptr from > > > > > gpu page table, because there is no way for us to partially > > > > > dma-unmap the whole sg list. I think we can try your new API in this > > > > > case. The main benefit of the new approach is the partial munmap > > > > > case. > > > > > > > > Yes, this is one of the main things it will improve. > > > > > > > > > We will have to pre-allocate iova for each userptr, and we have many > > > > > userptrs of random size... So we might be not as efficient as RDMA > > > > > case where I assume user register a few big memory regions. > > > > > > > > You are already doing this. dma_map_sg() does exactly the same IOVA > > > > allocation under the covers. > > > > > > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU > > > > > program directly, without any other extra driver API call. We call > > > > > this use case system allocator. > > > > > > > > > For system allocator, driver have no knowledge of which virtual > > > > > address range is valid in advance. So when GPU access a > > > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a > > > > > CPU vma which contains the fault address. I guess we can use the CPU > > > > > vma size to allocate the iova space of the same size? > > > > > > > > No. You'd follow what we discussed in the other thread. > > > > > > > > If you do a full SVA then you'd split your MM space into granuals and > > > > when a fault hits a granual you'd allocate the IOVA for the whole > > > > granual. RDMA ODP is using a 512M granual currently. > > > > > > > > If you are doing sub ranges then you'd probably allocate the IOVA for > > > > the well defined sub range (assuming the typical use case isn't huge) > > > > > > > > > But there will be a true difficulty to apply your scheme to this use > > > > > case. It is related to the STICKY flag. As I understand it, the > > > > > sticky flag is designed for driver to mark "this page/pfn has been > > > > > populated, no need to re-populate again", roughly...Unlike userptr > > > > > and RDMA use cases where the backing store of a buffer is always in > > > > > system memory, in the system allocator use case, the backing store > > > > > can be changing b/t system memory and GPU's device private > > > > > memory. Even worse, we have to assume the data migration b/t > > system > > > > > and GPU is dynamic. When data is migrated to GPU, we don't need > > > > > dma-map. And when migration happens to a pfn with STICKY flag, we > > > > > still need to repopulate this pfn. So you can see, it is not easy to > > > > > apply this scheme to this use case. At least I can't see an obvious > > > > > way. > > > > > > > > You are already doing this today, you are keeping the sg list around > > > > until you unmap it. > > > > > > > > Instead of keeping the sg list you'd keep a much smaller datastructure > > > > per-granual. The sticky bit is simply a convient way for ODP to manage > > > > the smaller data structure, you don't have to use it. > > > > > > > > But you do need to keep track of what pages in the granual have been > > > > DMA mapped - sg list was doing this before. This could be a simple > > > > bitmap array matching the granual size. > > > > > > > > Looking (far) forward we may be able to have a "replace" API that > > > > allows installing a new page unconditionally regardless of what is > > > > already there. > > > > > > > > Jason
On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote: > Hi Jason, Leon, > > I was able to fix the issue from my side. Things work fine now. I got two questions though: > > 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return. > I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment? My API gives you the flexibility to use any page size you want. You can use 2M pages instead of 4K pages. The API doesn't enforce any page size. > > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000 > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000 > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000 > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000 > > 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page". > Is this description correct? I don't understand the part "+ DMA address returned for the previous page ". > In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address. > Maybe I need more test. But any comment? You did it perfectly right. This is the correct way to advance dma_offset. Thanks > > Thanks, > Oak > > > -----Original Message----- > > From: Jason Gunthorpe <jgg@ziepe.ca> > > Sent: Monday, June 10, 2024 1:25 PM > > To: Zeng, Oak <oak.zeng@intel.com> > > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig <hch@lst.de>; > > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > <bvanassche@acm.org>; Damien Le Moal > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > > <himal.prasad.ghimiray@intel.com> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > two steps > > > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: > > > Thanks Leon and Yanjun for the reply! > > > > > > Based on the reply, we will continue use the current version for > > > test (as it is tested for vfio and rdma). We will switch to v1 once > > > it is fully tested/reviewed. > > > > I'm glad you are finding it useful, one of my interests with this work > > is to improve all the HMM users. > > > > Jason
Thank you Leon. That is helpful. I also have another very naïve question. I don't understand what is the iova address. I previously thought the iova address space is the same as the dma_address space when iommu is involved. I thought the dma_alloc_iova would allocate some contiguous iova address range and later dma_link_range function would link a physical page to the iova address and return the iova address. In other words, I thought the dma_address is iova address, and the iommu page table translate a dma_address or iova address to the physical address. But from my print below, my above understanding is obviously wrong: the iova.dma_addr is 0 and the dma_address returned from dma_link_range is none zero... Can you help me what is iova address? Is iova address iommu related? Since dma_link_range returns a non-iova address, does this function allocate the dma-address itself? Is dma-address correlated with iova address? Oak > -----Original Message----- > From: Leon Romanovsky <leon@kernel.org> > Sent: Tuesday, June 11, 2024 11:45 AM > To: Zeng, Oak <oak.zeng@intel.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin > Murphy <robin.murphy@arm.com>; Marek Szyprowski > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > <bvanassche@acm.org>; Damien Le Moal > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > <himal.prasad.ghimiray@intel.com> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > two steps > > On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote: > > Hi Jason, Leon, > > > > I was able to fix the issue from my side. Things work fine now. I got two > questions though: > > > > 1) The value returned from dma_link_range function is not contiguous, see > below print. The "linked pa" is the function return. > > I think dma_map_sgtable API would return some contiguous dma address. > Is the dma-map_sgtable api is more efficient regarding the iommu page table? > i.e., try to use bigger page size, such as use 2M page size when it is possible. > With your new API, does it also have such consideration? I vaguely > remembered Jason mentioned such thing, but my print below doesn't look > like so. Maybe I need to test bigger range (only 16 pages range in the test of > below printing). Comment? > > My API gives you the flexibility to use any page size you want. You can > use 2M pages instead of 4K pages. The API doesn't enforce any page size. > > > > > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > linked pa = 18ef3f000 > > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > linked pa = 190d00000 > > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > linked pa = 190024000 > > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > linked pa = 178e89000 > > > > 2) in the comment of dma_link_range function, it is said: " @dma_offset > needs to be advanced by the caller with the size of previous page that was > linked + DMA address returned for the previous page". > > Is this description correct? I don't understand the part "+ DMA address > returned for the previous page ". > > In my codes, let's say I call this function to link 10 pages, the first > dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the > previously returned dma address. > > Maybe I need more test. But any comment? > > You did it perfectly right. This is the correct way to advance dma_offset. > > Thanks > > > > > Thanks, > > Oak > > > > > -----Original Message----- > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > > Sent: Monday, June 10, 2024 1:25 PM > > > To: Zeng, Oak <oak.zeng@intel.com> > > > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig > <hch@lst.de>; > > > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski > > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; > Jens > > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, > Kevin > > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > > foundation.org>; linux-doc@vger.kernel.org; linux- > kernel@vger.kernel.org; > > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > > <bvanassche@acm.org>; Damien Le Moal > > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > > > <himal.prasad.ghimiray@intel.com> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > > two steps > > > > > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: > > > > Thanks Leon and Yanjun for the reply! > > > > > > > > Based on the reply, we will continue use the current version for > > > > test (as it is tested for vfio and rdma). We will switch to v1 once > > > > it is fully tested/reviewed. > > > > > > I'm glad you are finding it useful, one of my interests with this work > > > is to improve all the HMM users. > > > > > > Jason
On Tue, Jun 11, 2024 at 06:26:23PM +0000, Zeng, Oak wrote: > Thank you Leon. That is helpful. > > I also have another very naïve question. I don't understand what is the iova address. I previously thought the iova address space is the same as the dma_address space when iommu is involved. I thought the dma_alloc_iova would allocate some contiguous iova address range and later dma_link_range function would link a physical page to the iova address and return the iova address. In other words, I thought the dma_address is iova address, and the iommu page table translate a dma_address or iova address to the physical address. This is right understanding. > > But from my print below, my above understanding is obviously wrong: the iova.dma_addr is 0 and the dma_address returned from dma_link_range is none zero... Can you help me what is iova address? Is iova address iommu related? Since dma_link_range returns a non-iova address, does this function allocate the dma-address itself? Is dma-address correlated with iova address? This is a combination of two things: 1. Need to support HMM specific logic 2. Outcome of v0 version where I implemented dma_link_range() to perform fallback to DMA direct mode. See patch 2 and 3. https://lore.kernel.org/all/54a3554639bfb963c9919c5d7c1f449021bebdb3.1709635535.git.leon@kernel.org/ https://lore.kernel.org/all/f1049f0fc280288ae2f0c1e02388cde91b0f7876.1709635535.git.leon@kernel.org/ So dma-iova == 0 means that you are working in direct mode and not with IOMMU, e.g. you can translate from physical address to DMA address by simple call to phys_to_dma(). One of the comments was that it is not desired behaviour and I need to create separate functions that will be in use only when IOMMU is used. See the difference between v0 and v1 for dma_link_range() function. v0: https://lore.kernel.org/all/f1049f0fc280288ae2f0c1e02388cde91b0f7876.1709635535.git.leon@kernel.org/ v1: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=dma-split-v1&id=5aa29f2620ef86ac58c17a0297929a0b9e8d7790 And HMM variant of dma_link_range() function, which saves from you the need to copy/paste same HMM logic from RDMA to DRM. https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=dma-split-v1&id=4d8d8d4fbe7891b1412f03ecaff88bc492e2e4eb Thanks > > Oak > > > -----Original Message----- > > From: Leon Romanovsky <leon@kernel.org> > > Sent: Tuesday, June 11, 2024 11:45 AM > > To: Zeng, Oak <oak.zeng@intel.com> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin > > Murphy <robin.murphy@arm.com>; Marek Szyprowski > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > <bvanassche@acm.org>; Damien Le Moal > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > > <himal.prasad.ghimiray@intel.com> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > two steps > > > > On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote: > > > Hi Jason, Leon, > > > > > > I was able to fix the issue from my side. Things work fine now. I got two > > questions though: > > > > > > 1) The value returned from dma_link_range function is not contiguous, see > > below print. The "linked pa" is the function return. > > > I think dma_map_sgtable API would return some contiguous dma address. > > Is the dma-map_sgtable api is more efficient regarding the iommu page table? > > i.e., try to use bigger page size, such as use 2M page size when it is possible. > > With your new API, does it also have such consideration? I vaguely > > remembered Jason mentioned such thing, but my print below doesn't look > > like so. Maybe I need to test bigger range (only 16 pages range in the test of > > below printing). Comment? > > > > My API gives you the flexibility to use any page size you want. You can > > use 2M pages instead of 4K pages. The API doesn't enforce any page size. > > > > > > > > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > > linked pa = 18ef3f000 > > > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > > linked pa = 190d00000 > > > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > > linked pa = 190024000 > > > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, > > linked pa = 178e89000 > > > > > > 2) in the comment of dma_link_range function, it is said: " @dma_offset > > needs to be advanced by the caller with the size of previous page that was > > linked + DMA address returned for the previous page". > > > Is this description correct? I don't understand the part "+ DMA address > > returned for the previous page ". > > > In my codes, let's say I call this function to link 10 pages, the first > > dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the > > previously returned dma address. > > > Maybe I need more test. But any comment? > > > > You did it perfectly right. This is the correct way to advance dma_offset. > > > > Thanks > > > > > > > > Thanks, > > > Oak > > > > > > > -----Original Message----- > > > > From: Jason Gunthorpe <jgg@ziepe.ca> > > > > Sent: Monday, June 10, 2024 1:25 PM > > > > To: Zeng, Oak <oak.zeng@intel.com> > > > > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig > > <hch@lst.de>; > > > > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski > > > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will > > > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>; > > > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas > > > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; > > Jens > > > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi > > > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>; > > > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, > > Kevin > > > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>; > > > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux- > > > > foundation.org>; linux-doc@vger.kernel.org; linux- > > kernel@vger.kernel.org; > > > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; > > > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org; > > > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche > > > > <bvanassche@acm.org>; Damien Le Moal > > > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein > > > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen > > > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J > > > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun > > > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah > > > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad > > > > <himal.prasad.ghimiray@intel.com> > > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to > > > > two steps > > > > > > > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote: > > > > > Thanks Leon and Yanjun for the reply! > > > > > > > > > > Based on the reply, we will continue use the current version for > > > > > test (as it is tested for vfio and rdma). We will switch to v1 once > > > > > it is fully tested/reviewed. > > > > > > > > I'm glad you are finding it useful, one of my interests with this work > > > > is to improve all the HMM users. > > > > > > > > Jason >