Message ID | 20230612130256.4572-5-linyunsheng@huawei.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [net-next,v4,1/5] page_pool: frag API support for 32-bit arch with 64-bit DMA | expand |
On Mon, 12 Jun 2023 21:02:55 +0800 Yunsheng Lin wrote: > struct page_pool_params pp_params = { > - .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG | > - PP_FLAG_DMA_SYNC_DEV, > + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, > .order = hns3_page_order(ring), Does hns3_page_order() set a good example for the users? static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) { #if (PAGE_SIZE < 8192) if (ring->buf_size > (PAGE_SIZE / 2)) return 1; #endif return 0; } Why allocate order 1 pages for buffers which would fit in a single page? I feel like this soft of heuristic should be built into the API itself.
On 2023/6/15 1:19, Jakub Kicinski wrote: > On Mon, 12 Jun 2023 21:02:55 +0800 Yunsheng Lin wrote: >> struct page_pool_params pp_params = { >> - .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG | >> - PP_FLAG_DMA_SYNC_DEV, >> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, >> .order = hns3_page_order(ring), > > Does hns3_page_order() set a good example for the users? > > static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) > { > #if (PAGE_SIZE < 8192) > if (ring->buf_size > (PAGE_SIZE / 2)) > return 1; > #endif > return 0; > } > > Why allocate order 1 pages for buffers which would fit in a single page? > I feel like this soft of heuristic should be built into the API itself. hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size with 4096 bytes and system page size with 4K, as hns3 driver still support the per-desc ping-pong way of page splitting when page_pool_enabled is false. With page pool enabled, you are right that order 0 pages is enough, and I am not sure about the exact reason we use the some order as the ping-pong way of page splitting now. As 2048 bytes buf size seems to be the default one, and I has not heard any one changing it. Also, it caculates the pool_size using something as below, so the memory usage is almost the same for order 0 and order 1: .pool_size = ring->desc_num * hns3_buf_size(ring) / (PAGE_SIZE << hns3_page_order(ring)), I am not sure it worth changing it, maybe just change it to set good example for the users:) anyway I need to discuss this with other colleague internally and do some testing before doing the change. > . >
From: Jakub Kicinski <kuba@kernel.org> Date: Wed, 14 Jun 2023 10:19:54 -0700 > On Mon, 12 Jun 2023 21:02:55 +0800 Yunsheng Lin wrote: >> struct page_pool_params pp_params = { >> - .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG | >> - PP_FLAG_DMA_SYNC_DEV, >> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, >> .order = hns3_page_order(ring), > > Does hns3_page_order() set a good example for the users? > > static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) > { > #if (PAGE_SIZE < 8192) > if (ring->buf_size > (PAGE_SIZE / 2)) > return 1; > #endif > return 0; > } Oh lol, just what Intel drivers do. They don't have a pool to keep some bunch of pages (they can recycle a page only within its buffer), so in order to still recycle them, they allocate order-1 pages to be able to flip the halves >_< > > Why allocate order 1 pages for buffers which would fit in a single page? > I feel like this soft of heuristic should be built into the API itself. Offtop: I tested this series with IAVF: very little perf regression* (almost stddev) comparing to just 1-page-per-frame Page Pool series, but 21 Mb less RAM taken comparing to both "old" PP series and baseline, nice :D (+Cc David Christensen, he'll be glad to hear we're stopping eating 64Kb pages) * this might be caused by that in the previous version I was hardcoding truesize, but now it depends on what page_pool_alloc() returns. Same for Rx offset: it was always 0 previously, as every frame was placed at the start of page, now depends on how PP places** it. With MTU of 1500 and no XDP, two frames fit into one 4k page. With XDP on (increased headroom) or increased MTU, PP starts effectively do 1-frame-per-page with literally no changes in performance (increased RAM usage obviously -- I mean, it gets restored to the baseline numbers). ** BTW, instead of 2048 + 2048, I'm getting 1920 + 2176. Maybe the stack would be happier to see more consistent truesize for cache purposes. I'll try to play with it. Thanks, Olek
On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote: > > Does hns3_page_order() set a good example for the users? > > > > static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) > > { > > #if (PAGE_SIZE < 8192) > > if (ring->buf_size > (PAGE_SIZE / 2)) > > return 1; > > #endif > > return 0; > > } > > > > Why allocate order 1 pages for buffers which would fit in a single page? > > I feel like this soft of heuristic should be built into the API itself. > > hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes > 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size > with 4096 bytes and system page size with 4K, as hns3 driver still support the > per-desc ping-pong way of page splitting when page_pool_enabled is false. > > With page pool enabled, you are right that order 0 pages is enough, and I am not > sure about the exact reason we use the some order as the ping-pong way of page > splitting now. > As 2048 bytes buf size seems to be the default one, and I has not heard any one > changing it. Also, it caculates the pool_size using something as below, so the > memory usage is almost the same for order 0 and order 1: > > .pool_size = ring->desc_num * hns3_buf_size(ring) / > (PAGE_SIZE << hns3_page_order(ring)), > > I am not sure it worth changing it, maybe just change it to set good example for > the users:) anyway I need to discuss this with other colleague internally and do > some testing before doing the change. Right, I think this may be a leftover from the page flipping mode of operation. But AFAIU we should leave the recycling fully to the page pool now. If we make any improvements try to make them at the page pool level. I like your patches as they isolate the drivers from having to make the fragmentation decisions based on the system page size (4k vs 64k but we're hearing more and more about ARM w/ 16k pages). For that use case this is great. What we don't want is drivers to start requesting larger page sizes because it looks good in iperf on a freshly booted, idle system :(
On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote: > > > Does hns3_page_order() set a good example for the users? > > > > > > static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) > > > { > > > #if (PAGE_SIZE < 8192) > > > if (ring->buf_size > (PAGE_SIZE / 2)) > > > return 1; > > > #endif > > > return 0; > > > } > > > > > > Why allocate order 1 pages for buffers which would fit in a single page? > > > I feel like this soft of heuristic should be built into the API itself. > > > > hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes > > 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size > > with 4096 bytes and system page size with 4K, as hns3 driver still support the > > per-desc ping-pong way of page splitting when page_pool_enabled is false. > > > > With page pool enabled, you are right that order 0 pages is enough, and I am not > > sure about the exact reason we use the some order as the ping-pong way of page > > splitting now. > > As 2048 bytes buf size seems to be the default one, and I has not heard any one > > changing it. Also, it caculates the pool_size using something as below, so the > > memory usage is almost the same for order 0 and order 1: > > > > .pool_size = ring->desc_num * hns3_buf_size(ring) / > > (PAGE_SIZE << hns3_page_order(ring)), > > > > I am not sure it worth changing it, maybe just change it to set good example for > > the users:) anyway I need to discuss this with other colleague internally and do > > some testing before doing the change. > > Right, I think this may be a leftover from the page flipping mode of > operation. But AFAIU we should leave the recycling fully to the page > pool now. If we make any improvements try to make them at the page pool > level. > > I like your patches as they isolate the drivers from having to make the > fragmentation decisions based on the system page size (4k vs 64k but > we're hearing more and more about ARM w/ 16k pages). For that use case > this is great. > > What we don't want is drivers to start requesting larger page sizes > because it looks good in iperf on a freshly booted, idle system :( Actually that would be a really good direction for this patch set to look at going into. Rather than having us always allocate a "page" it would make sense for most drivers to allocate a 4K fragment or the like in the case that the base page size is larger than 4K. That might be a good use case to justify doing away with the standard page pool page and look at making them all fragmented. In the case of the standard page size being 4K a standard page would just have to take on the CPU overhead of the atomic_set and atomic_read for pp_ref_count (new name) which should be minimal as on most sane systems those just end up being a memory write and read.
On 2023/6/16 2:26, Alexander Duyck wrote: > On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote: >>>> Does hns3_page_order() set a good example for the users? >>>> >>>> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) >>>> { >>>> #if (PAGE_SIZE < 8192) >>>> if (ring->buf_size > (PAGE_SIZE / 2)) >>>> return 1; >>>> #endif >>>> return 0; >>>> } >>>> >>>> Why allocate order 1 pages for buffers which would fit in a single page? >>>> I feel like this soft of heuristic should be built into the API itself. >>> >>> hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes >>> 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size >>> with 4096 bytes and system page size with 4K, as hns3 driver still support the >>> per-desc ping-pong way of page splitting when page_pool_enabled is false. >>> >>> With page pool enabled, you are right that order 0 pages is enough, and I am not >>> sure about the exact reason we use the some order as the ping-pong way of page >>> splitting now. >>> As 2048 bytes buf size seems to be the default one, and I has not heard any one >>> changing it. Also, it caculates the pool_size using something as below, so the >>> memory usage is almost the same for order 0 and order 1: >>> >>> .pool_size = ring->desc_num * hns3_buf_size(ring) / >>> (PAGE_SIZE << hns3_page_order(ring)), >>> >>> I am not sure it worth changing it, maybe just change it to set good example for >>> the users:) anyway I need to discuss this with other colleague internally and do >>> some testing before doing the change. >> >> Right, I think this may be a leftover from the page flipping mode of >> operation. But AFAIU we should leave the recycling fully to the page >> pool now. If we make any improvements try to make them at the page pool >> level. I checked, the per-desc buf with 4096 bytes for hnse does not seem to be used mainly because of the larger memory usage you mentioned below. >> >> I like your patches as they isolate the drivers from having to make the >> fragmentation decisions based on the system page size (4k vs 64k but >> we're hearing more and more about ARM w/ 16k pages). For that use case >> this is great. Yes, That is my point. For hw case, the page splitting in page pool is mainly to enble multi-descs to use the same page as my understanding. >> >> What we don't want is drivers to start requesting larger page sizes >> because it looks good in iperf on a freshly booted, idle system :( > > Actually that would be a really good direction for this patch set to > look at going into. Rather than having us always allocate a "page" it > would make sense for most drivers to allocate a 4K fragment or the > like in the case that the base page size is larger than 4K. That might > be a good use case to justify doing away with the standard page pool > page and look at making them all fragmented. I am not sure if I understand the above, isn't the frag API able to support allocating a 4K fragment when base page size is larger than 4K before or after this patch? what more do we need to do? > > In the case of the standard page size being 4K a standard page would > just have to take on the CPU overhead of the atomic_set and > atomic_read for pp_ref_count (new name) which should be minimal as on > most sane systems those just end up being a memory write and read. If I understand you correctly, I think what you are trying to do may break some of Jesper' benchmarking:) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > . >
On Fri, Jun 16, 2023 at 5:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/6/16 2:26, Alexander Duyck wrote: > > On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote: > >> > >> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote: > >>>> Does hns3_page_order() set a good example for the users? > >>>> > >>>> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring) > >>>> { > >>>> #if (PAGE_SIZE < 8192) > >>>> if (ring->buf_size > (PAGE_SIZE / 2)) > >>>> return 1; > >>>> #endif > >>>> return 0; > >>>> } > >>>> > >>>> Why allocate order 1 pages for buffers which would fit in a single page? > >>>> I feel like this soft of heuristic should be built into the API itself. > >>> > >>> hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes > >>> 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size > >>> with 4096 bytes and system page size with 4K, as hns3 driver still support the > >>> per-desc ping-pong way of page splitting when page_pool_enabled is false. > >>> > >>> With page pool enabled, you are right that order 0 pages is enough, and I am not > >>> sure about the exact reason we use the some order as the ping-pong way of page > >>> splitting now. > >>> As 2048 bytes buf size seems to be the default one, and I has not heard any one > >>> changing it. Also, it caculates the pool_size using something as below, so the > >>> memory usage is almost the same for order 0 and order 1: > >>> > >>> .pool_size = ring->desc_num * hns3_buf_size(ring) / > >>> (PAGE_SIZE << hns3_page_order(ring)), > >>> > >>> I am not sure it worth changing it, maybe just change it to set good example for > >>> the users:) anyway I need to discuss this with other colleague internally and do > >>> some testing before doing the change. > >> > >> Right, I think this may be a leftover from the page flipping mode of > >> operation. But AFAIU we should leave the recycling fully to the page > >> pool now. If we make any improvements try to make them at the page pool > >> level. > > I checked, the per-desc buf with 4096 bytes for hnse does not seem to > be used mainly because of the larger memory usage you mentioned below. > > >> > >> I like your patches as they isolate the drivers from having to make the > >> fragmentation decisions based on the system page size (4k vs 64k but > >> we're hearing more and more about ARM w/ 16k pages). For that use case > >> this is great. > > Yes, That is my point. For hw case, the page splitting in page pool is > mainly to enble multi-descs to use the same page as my understanding. > > >> > >> What we don't want is drivers to start requesting larger page sizes > >> because it looks good in iperf on a freshly booted, idle system :( > > > > Actually that would be a really good direction for this patch set to > > look at going into. Rather than having us always allocate a "page" it > > would make sense for most drivers to allocate a 4K fragment or the > > like in the case that the base page size is larger than 4K. That might > > be a good use case to justify doing away with the standard page pool > > page and look at making them all fragmented. > > I am not sure if I understand the above, isn't the frag API able to > support allocating a 4K fragment when base page size is larger than > 4K before or after this patch? what more do we need to do? I'm not talking about the frag API. I am talking about the non-fragmented case. Right now standard page_pool will allocate an order 0 page. So if a driver is using just pages expecting 4K pages that isn't true on these ARM or PowerPC systems where the page size is larger than 4K. For a bit of historical reference on igb/ixgbe they had a known issue where they would potentially run a system out of memory when page size was larger than 4K. I had originally implemented things with just the refcounting hack and at the time it worked great on systems with 4K pages. However on a PowerPC it would trigger OOM errors because they could run with 64K pages. To fix that I started adding all the PAGE_SIZE checks in the driver and moved over to a striping model for those that would free the page when it reached the end in order to force it to free the page and make better use of the available memory. > > > > In the case of the standard page size being 4K a standard page would > > just have to take on the CPU overhead of the atomic_set and > > atomic_read for pp_ref_count (new name) which should be minimal as on > > most sane systems those just end up being a memory write and read. > > If I understand you correctly, I think what you are trying to do > may break some of Jesper' benchmarking:) > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c So? If it breaks an out-of-tree benchmark the benchmark can always be fixed. The point is enabling a use case that can add value across the board instead of trying to force the community to support a niche use case. Ideally we should get away from using the pages directly for most cases in page pool. In my mind the page pool should start operating more like __get_free_pages where what you get is a virtual address instead of the actual page. That way we could start abstracting it away and eventually get to something more like a true page_pool api instead of what feels like a set of add-ons for the page allocator. Although at the end of the day this still feels more like we are just reimplementing slab so it is hard for me to say this is necessarily the best solution either.
On 16/06/2023 17.01, Alexander Duyck wrote: > On Fri, Jun 16, 2023 at 5:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> On 2023/6/16 2:26, Alexander Duyck wrote: >>> On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote: >>>> >>>> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote: [...] >>>> >>>> I like your patches as they isolate the drivers from having to make the >>>> fragmentation decisions based on the system page size (4k vs 64k but >>>> we're hearing more and more about ARM w/ 16k pages). For that use case >>>> this is great. +1 [...] >>> >>> In the case of the standard page size being 4K a standard page would >>> just have to take on the CPU overhead of the atomic_set and >>> atomic_read for pp_ref_count (new name) which should be minimal as on >>> most sane systems those just end up being a memory write and read. >> >> If I understand you correctly, I think what you are trying to do >> may break some of Jesper' benchmarking:) >> >> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > > So? If it breaks an out-of-tree benchmark the benchmark can always be > fixed. It doesn't matter if this is out-of-tree (I should have upstreamed it when AKPM asked me to.) Point is don't break my page_pool fast-path!!! :-P > The point is enabling a use case that can add value across the > board instead of trying to force the community to support a niche use > case. I'm all for creating a new API, lets call it netmem, that takes care of this use-case. I'm *not* okay with this new API slowing down the page_pool fast-path. Why not multiplex on a MEM_TYPE, like XDP_MEM_TYPE is prepared for?!? Meaning the caller can choose which is the correct API call. (thus, we can stay away from adding code to fast-path case) See below, copy-paste of code that shows what I mean by multiplex on a MEM_TYPE. > > Ideally we should get away from using the pages directly for most > cases in page pool. In my mind the page pool should start operating > more like __get_free_pages where what you get is a virtual address > instead of the actual page. That way we could start abstracting it > away and eventually get to something more like a true page_pool api > instead of what feels like a set of add-ons for the page allocator. Yes, I agree with Alex Duyck here. Like when I looked at veth proposed changes, it also felt like a virtual address would be better than a page. addr = netmem_alloc(rq->page_pool, &truesize); > Although at the end of the day this still feels more like we are just > reimplementing slab so it is hard for me to say this is necessarily > the best solution either. Yes, we have to be careful not to re-implement the MM layer in network land ;-) (below code copy-paste broke whitespaces) $ git show commit fe38c642d629f8361f76b25aa8732e5e331d0925 (HEAD -> pp_rm_workqueue04) Author: Jesper Dangaard Brouer <brouer@redhat.com> Date: Fri Jun 16 20:54:08 2023 +0200 page_pool: code examplifying multiplexing on mem_type Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> diff --git a/include/net/xdp.h b/include/net/xdp.h index d1c5381fc95f..c02ac82a1d79 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -42,6 +42,7 @@ enum xdp_mem_type { MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ MEM_TYPE_PAGE_POOL, MEM_TYPE_XSK_BUFF_POOL, + MEM_TYPE_PP_NETMEM, MEM_TYPE_MAX, }; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d03448a4c411..68be76efef00 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -353,7 +353,7 @@ static void page_pool_set_pp_info(struct page_pool *pool, struct page *page) { page->pp = pool; - page->pp_magic |= PP_SIGNATURE; + page->pp_magic |= PP_SIGNATURE | (MEM_TYPE_PAGE_POOL << 8); if (pool->p.init_callback) pool->p.init_callback(page, pool->p.init_arg); } @@ -981,6 +981,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) struct napi_struct *napi; struct page_pool *pp; bool allow_direct; + int mem_type; page = compound_head(page); @@ -991,9 +992,10 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) * and page_is_pfmemalloc() is checked in __page_pool_put_page() * to avoid recycling the pfmemalloc page. */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + if (unlikely((page->pp_magic & ~0xF03UL) != PP_SIGNATURE)) return false; + mem_type = (page->pp_magic & 0xF00) >> 8; pp = page->pp; /* Allow direct recycle if we have reasons to believe that we are @@ -1009,7 +1011,10 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) * The page will be returned to the pool here regardless of the * 'flipped' fragment being in use or not. */ - page_pool_put_full_page(pp, page, allow_direct); + if (mem_type == MEM_TYPE_PP_NETMEM) + pp_netmem_put_page(pp, page, allow_direct); + else + page_pool_put_full_page(pp, page, allow_direct); return true; } diff --git a/net/core/xdp.c b/net/core/xdp.c index 41e5ca8643ec..dc4bfbe8f002 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -380,6 +380,11 @@ void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct, struct page *page; switch (mem->type) { + case MEM_TYPE_PP_NETMEM: + if (napi_direct && xdp_return_frame_no_direct()) + napi_direct = false; + pp_netmem_put(page->pp, data, napi_direct); + break; case MEM_TYPE_PAGE_POOL: page = virt_to_head_page(data); if (napi_direct && xdp_return_frame_no_direct())
On Fri, 16 Jun 2023 20:59:12 +0200 Jesper Dangaard Brouer wrote: > + if (mem_type == MEM_TYPE_PP_NETMEM) > + pp_netmem_put_page(pp, page, allow_direct); > + else > + page_pool_put_full_page(pp, page, allow_direct); Interesting, what is the netmem type? I was thinking about extending page pool for other mem providers and what came to mind was either optionally replacing the free / alloc with a function pointer: https://github.com/torvalds/linux/commit/578ebda5607781c0abb26c1feae7ec8b83840768 or wrapping the PP calls with static inlines which can direct to a different implementation completely (like zctap / io_uring zc). Former is better for huge pages, latter is better for IO mem (peer-to-peer DMA). I wonder if you have different use case which requires a different model :(
On 16/06/2023 21.21, Jakub Kicinski wrote: > On Fri, 16 Jun 2023 20:59:12 +0200 Jesper Dangaard Brouer wrote: >> + if (mem_type == MEM_TYPE_PP_NETMEM) >> + pp_netmem_put_page(pp, page, allow_direct); >> + else >> + page_pool_put_full_page(pp, page, allow_direct); > > Interesting, what is the netmem type? I was thinking about extending > page pool for other mem providers and what came to mind was either > optionally replacing the free / alloc with a function pointer: > > https://github.com/torvalds/linux/commit/578ebda5607781c0abb26c1feae7ec8b83840768 > > or wrapping the PP calls with static inlines which can direct to > a different implementation completely (like zctap / io_uring zc). > I *LOVE* this idea!!! It have been my master plan since day-1 to have other mem providers. Notice how ZC xsk/AF_XDP have it's own memory allocator implementation. The page_pool was never meant to be the final and best solution, I want to see other, better and faster solutions competing with page_pool and maybe some day replacing page_pool (I even see it as a success if PP get depreciated and remove from the kernel due to a better solution). See[1] how net/core/xdp.c simply have a switch statement (is fast, because ASM wise it becomes a jump table): [1] https://github.com/torvalds/linux/blob/v6.4-rc6/net/core/xdp.c#L382-L402 > Former is better for huge pages, latter is better for IO mem > (peer-to-peer DMA). I wonder if you have different use case which > requires a different model :( > I want for the network stack SKBs (and XDP) to support different memory types for the "head" frame and "data-frags". Eric have described this idea before, that hardware will do header-split, and we/he can get TCP data part is another page/frag, making it faster for TCP-streams, but this can be used for much more. My proposed use-cases involves more that TCP. We can easily imagine NVMe protocol header-split, and the data-frag could be a mem_type that actually belongs to the harddisk (maybe CPU cannot even read this). The same scenario goes for GPU memory, which is for the AI use-case. IIRC then Jonathan have previously send patches for the GPU use-case. I really hope we can work in this direction together, --Jesper
On 2023/6/16 23:01, Alexander Duyck wrote: ... >>> Actually that would be a really good direction for this patch set to >>> look at going into. Rather than having us always allocate a "page" it >>> would make sense for most drivers to allocate a 4K fragment or the >>> like in the case that the base page size is larger than 4K. That might >>> be a good use case to justify doing away with the standard page pool >>> page and look at making them all fragmented. >> >> I am not sure if I understand the above, isn't the frag API able to >> support allocating a 4K fragment when base page size is larger than >> 4K before or after this patch? what more do we need to do? > > I'm not talking about the frag API. I am talking about the > non-fragmented case. Right now standard page_pool will allocate an > order 0 page. So if a driver is using just pages expecting 4K pages > that isn't true on these ARM or PowerPC systems where the page size is > larger than 4K. > > For a bit of historical reference on igb/ixgbe they had a known issue > where they would potentially run a system out of memory when page size > was larger than 4K. I had originally implemented things with just the > refcounting hack and at the time it worked great on systems with 4K > pages. However on a PowerPC it would trigger OOM errors because they > could run with 64K pages. To fix that I started adding all the > PAGE_SIZE checks in the driver and moved over to a striping model for > those that would free the page when it reached the end in order to > force it to free the page and make better use of the available memory. Isn't the page_pool_alloc() or page_pool_alloc_frag() API also solve the above problem? I think what you really want is another layer of subdividing support in the driver on top of the above, right?
On Fri, 16 Jun 2023 22:42:35 +0200 Jesper Dangaard Brouer wrote: > > Former is better for huge pages, latter is better for IO mem > > (peer-to-peer DMA). I wonder if you have different use case which > > requires a different model :( > > I want for the network stack SKBs (and XDP) to support different memory > types for the "head" frame and "data-frags". Eric have described this > idea before, that hardware will do header-split, and we/he can get TCP > data part is another page/frag, making it faster for TCP-streams, but > this can be used for much more. > > My proposed use-cases involves more that TCP. We can easily imagine > NVMe protocol header-split, and the data-frag could be a mem_type that > actually belongs to the harddisk (maybe CPU cannot even read this). The > same scenario goes for GPU memory, which is for the AI use-case. IIRC > then Jonathan have previously send patches for the GPU use-case. > > I really hope we can work in this direction together, Perfect, that's also the use case I had in mind. The huge page thing was just a quick thing to implement as a PoC (although useful in its own right, one day I'll find the time to finish it, sigh). That said I couldn't convince myself that for a peer-to-peer setup we have enough space in struct page to store all the information we need. Or that we'd get a struct page at all, and not just a region of memory with no struct page * allocated :S That'd require serious surgery on the page pool's fast paths to work around. I haven't dug into the details, tho. If you think we can use page pool as a frontend for iouring and/or p2p memory that'd be awesome! The workaround solution I had in mind would be to create a narrower API for just data pages. Since we'd need to sprinkle ifs anyway, pull them up close to the call site. Allowing to switch page pool for a completely different implementation, like the one Jonathan coded up for iouring. Basically $name_alloc_page(queue) { if (queue->pp) return page_pool_dev_alloc_pages(queue->pp); else if (queue->iouring..) ... }
On 19/06/2023 20.07, Jakub Kicinski wrote: > On Fri, 16 Jun 2023 22:42:35 +0200 Jesper Dangaard Brouer wrote: >>> Former is better for huge pages, latter is better for IO mem >>> (peer-to-peer DMA). I wonder if you have different use case which >>> requires a different model :( >> >> I want for the network stack SKBs (and XDP) to support different memory >> types for the "head" frame and "data-frags". Eric have described this >> idea before, that hardware will do header-split, and we/he can get TCP >> data part is another page/frag, making it faster for TCP-streams, but >> this can be used for much more. >> >> My proposed use-cases involves more that TCP. We can easily imagine >> NVMe protocol header-split, and the data-frag could be a mem_type that >> actually belongs to the harddisk (maybe CPU cannot even read this). The >> same scenario goes for GPU memory, which is for the AI use-case. IIRC >> then Jonathan have previously send patches for the GPU use-case. >> >> I really hope we can work in this direction together, > > Perfect, that's also the use case I had in mind. The huge page thing > was just a quick thing to implement as a PoC (although useful in its > own right, one day I'll find the time to finish it, sigh). > > That said I couldn't convince myself that for a peer-to-peer setup we > have enough space in struct page to store all the information we need. > Or that we'd get a struct page at all, and not just a region of memory > with no struct page * allocated :S Good with big ideas, but I think we should start smaller and evolve. > > That'd require serious surgery on the page pool's fast paths to work > around. > > I haven't dug into the details, tho. If you think we can use page pool > as a frontend for iouring and/or p2p memory that'd be awesome! > Hmm... I don't like the sound of this. My point is that we should create a more plug-able memory system for netstack. And NOT try to extend page_pool to cover all use-cases. > The workaround solution I had in mind would be to create a narrower API > for just data pages. Since we'd need to sprinkle ifs anyway, pull them > up close to the call site. Allowing to switch page pool for a > completely different implementation, like the one Jonathan coded up for > iouring. Basically > > $name_alloc_page(queue) > { > if (queue->pp) > return page_pool_dev_alloc_pages(queue->pp); > else if (queue->iouring..) > ... > } Yes, this is more the direction I'm thinking. In many cases, you don't need this if-statement helper in the driver, as driver RX side code will know the API used upfront. The TX completion side will need this kind of multiplexing return helper, to return the pages to the correct memory allocator type (e.g. page_pool being one). See concept in [1] __xdp_return(). Performance wise, function pointers are slow due to RETPOLINE, but switch-case statements (below certain size) becomes a jump table, which is fast. See[1]. [1] https://elixir.bootlin.com/linux/v6.4-rc7/source/net/core/xdp.c#L377 Regarding room in "struct page", notice that page->pp_magic will have plenty room for e.g. storing xdp_mem_type or even xdp_mem_info (which also contains an ID). --Jesper
On Tue, 20 Jun 2023 17:12:41 +0200 Jesper Dangaard Brouer wrote: > > The workaround solution I had in mind would be to create a narrower API > > for just data pages. Since we'd need to sprinkle ifs anyway, pull them > > up close to the call site. Allowing to switch page pool for a > > completely different implementation, like the one Jonathan coded up for > > iouring. Basically > > > > $name_alloc_page(queue) > > { > > if (queue->pp) > > return page_pool_dev_alloc_pages(queue->pp); > > else if (queue->iouring..) > > ... > > } > > Yes, this is more the direction I'm thinking. > In many cases, you don't need this if-statement helper in the driver, as > driver RX side code will know the API used upfront. Meaning that the driver "knows" if it's in the XDP, AF_XDP, iouring or "normal" Rx path? I hope we can avoid extra code in the driver completely, for data pages. > The TX completion side will need this kind of multiplexing return > helper, to return the pages to the correct memory allocator type (e.g. > page_pool being one). See concept in [1] __xdp_return(). > > Performance wise, function pointers are slow due to RETPOLINE, but > switch-case statements (below certain size) becomes a jump table, which > is fast. See[1]. > > [1] https://elixir.bootlin.com/linux/v6.4-rc7/source/net/core/xdp.c#L377 SG! > Regarding room in "struct page", notice that page->pp_magic will have > plenty room for e.g. storing xdp_mem_type or even xdp_mem_info (which > also contains an ID). I was worried about fitting the DMA address, if the pages code from user space.
On Mon, Jun 19, 2023 at 11:07 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 16 Jun 2023 22:42:35 +0200 Jesper Dangaard Brouer wrote: > > > Former is better for huge pages, latter is better for IO mem > > > (peer-to-peer DMA). I wonder if you have different use case which > > > requires a different model :( > > > > I want for the network stack SKBs (and XDP) to support different memory > > types for the "head" frame and "data-frags". Eric have described this > > idea before, that hardware will do header-split, and we/he can get TCP > > data part is another page/frag, making it faster for TCP-streams, but > > this can be used for much more. > > > > My proposed use-cases involves more that TCP. We can easily imagine > > NVMe protocol header-split, and the data-frag could be a mem_type that > > actually belongs to the harddisk (maybe CPU cannot even read this). The > > same scenario goes for GPU memory, which is for the AI use-case. IIRC > > then Jonathan have previously send patches for the GPU use-case. > > > > I really hope we can work in this direction together, > > Perfect, that's also the use case I had in mind. The huge page thing > was just a quick thing to implement as a PoC (although useful in its > own right, one day I'll find the time to finish it, sigh). > > That said I couldn't convince myself that for a peer-to-peer setup we > have enough space in struct page to store all the information we need. > Or that we'd get a struct page at all, and not just a region of memory > with no struct page * allocated :S > > That'd require serious surgery on the page pool's fast paths to work > around. > > I haven't dug into the details, tho. If you think we can use page pool > as a frontend for iouring and/or p2p memory that'd be awesome! > Hello Jakub, I'm looking into device memory (peer-to-peer) networking actually, and I plan to pursue using the page pool as a front end. Quick description of what I have so far: current implementation uses device memory with struct pages; I am putting all those pages in a gen_pool, and we have written an allocator that allocates pages from the gen_pool. In the driver, we use this allocator instead of alloc_page() (the driver in question is gve which currently doesn't use the page pool). When the driver is done with the p2p page, it simply decrements the refcount on it and the page is freed back to the gen_pool. Test results are good; our best results we're able to achieve ~96% line rate with incoming packets going straight to device memory and without bouncing the memory to a host buffer (albeit these results are on our slighly older, production LTS, I need to work on getting results from linus/master). I've discussed your page pool frontend idea with our gve owners and the idea is attractive. In particular it would be good not to insert much custom code into the driver to support device memory pages or other page types. I plan on trying to change my approach to match the page pool provider you have in progress here: https://github.com/kuba-moo/linux/tree/pp-providers In particular the main challenge right now seems to be that my device memory pages are ZONE_DEVICE pages, which can't be inserted to the page pool as-is due to the union in struct page between the page pool entries and the ZONE_DEVICE entries. I have some ideas on how to work around that I'm looking into. It sounds like you don't have the time at the moment to work on the page pool provider idea; I plan to try and get my code working with that model and propose it if it's successful. Let me know if you have concerns here. > The workaround solution I had in mind would be to create a narrower API > for just data pages. Since we'd need to sprinkle ifs anyway, pull them > up close to the call site. Allowing to switch page pool for a > completely different implementation, like the one Jonathan coded up for > iouring. Basically > > $name_alloc_page(queue) > { > if (queue->pp) > return page_pool_dev_alloc_pages(queue->pp); > else if (queue->iouring..) > ... > } >
On 6/29/23 8:27 PM, Mina Almasry wrote: > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking > actually, and I plan to pursue using the page pool as a front end. > > Quick description of what I have so far: > current implementation uses device memory with struct pages; I am > putting all those pages in a gen_pool, and we have written an > allocator that allocates pages from the gen_pool. In the driver, we > use this allocator instead of alloc_page() (the driver in question is > gve which currently doesn't use the page pool). When the driver is > done with the p2p page, it simply decrements the refcount on it and > the page is freed back to the gen_pool. I take it these are typical Linux networking applications using standard socket APIs (not dpdk or xdp sockets or such)? If so, what does tcpdump show for those skbs with pages for the device memory?
On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@kernel.org> wrote: > > On 6/29/23 8:27 PM, Mina Almasry wrote: > > > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking > > actually, and I plan to pursue using the page pool as a front end. > > > > Quick description of what I have so far: > > current implementation uses device memory with struct pages; I am > > putting all those pages in a gen_pool, and we have written an > > allocator that allocates pages from the gen_pool. In the driver, we > > use this allocator instead of alloc_page() (the driver in question is > > gve which currently doesn't use the page pool). When the driver is > > done with the p2p page, it simply decrements the refcount on it and > > the page is freed back to the gen_pool. Quick update here, I was able to get my implementation working with the page pool as a front end with the memory provider API Jakub wrote here: https://github.com/kuba-moo/linux/tree/pp-providers The main complication indeed was the fact that my device memory pages are ZONE_DEVICE pages, which are incompatible with the page_pool due to the union in struct page. I thought of a couple of approaches to resolve that. 1. Make my device memory pages non-ZONE_DEVICE pages. The issue with that is that if the page is not ZONE_DEVICE, put_page(page) will attempt to free it to the buddy allocator I think, which is not correct. The only places where the mm stack currently allow custom freeing callback (AFAIK), are for ZONE_DEVICE page where free_zone_device_page() will call the provided callback in page->pgmap->ops->page_free, and compound pages where the compound_dtor is specified. My device memory pages aren't compound pages so only ZONE_DEVICE pages do what I want. 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and vice versa as they're being inserted and removed from the page pool. This, I think, works elegantly without any issue, and is the option I went with. The info from ZONE_DEVICE that I care about for device memory TCP is the page->zone_device_data which holds the dma_addr, and the page->pgmap which holds the page_free op. I'm able to store both in my memory provider so I can swap pages from ZONE_DEVICE and page_pool back and forth. So far I haven't needed to make any modifications to the memory provider implementation Jakub has pretty much, and my functionality tests are passing. If there are no major objections I'll look into cleaning up the interface a bit and propose it for merge. This is a prerequisite of device memory TCP via the page_pool. > > I take it these are typical Linux networking applications using standard > socket APIs (not dpdk or xdp sockets or such)? If so, what does tcpdump > show for those skbs with pages for the device memory? > Yes these are using (mostly) standing socket APIs. We have small extensions to sendmsg() and recvmsg() to pass a reference to the device memory in both these cases, but that's about it. tcpdump is able to access the header of these skbs which is in host memory, but not the payload in device memory. Here is an example session with my netcat-like test for device memory TCP: https://pastebin.com/raw/FRjKf0kv tcpdump seems to work, and the length of the packets above is correct. tcpdump -A however isn't able to print the payload of the packets: https://pastebin.com/raw/2PcNxaZV
On 7/3/23 12:22 AM, Mina Almasry wrote: > tcpdump is able to access the header of these skbs which is in host > memory, but not the payload in device memory. Here is an example > session with my netcat-like test for device memory TCP: > https://pastebin.com/raw/FRjKf0kv > > tcpdump seems to work, and the length of the packets above is correct. > tcpdump -A however isn't able to print the payload of the packets: > https://pastebin.com/raw/2PcNxaZV That is my expectation. The tcpdump is just an easy example of accessing the skb page frags. skb_copy_and_csum_bits used by icmp is another example that can walk frags wanting access to device memory. You did not cause a panic or trip a WARN_ON for example with the tcpdump? Thanks for checking.
On Mon, Jul 3, 2023 at 4:45 PM David Ahern <dsahern@kernel.org> wrote: > > On 7/3/23 12:22 AM, Mina Almasry wrote: > > tcpdump is able to access the header of these skbs which is in host > > memory, but not the payload in device memory. Here is an example > > session with my netcat-like test for device memory TCP: > > https://pastebin.com/raw/FRjKf0kv > > > > tcpdump seems to work, and the length of the packets above is correct. > > tcpdump -A however isn't able to print the payload of the packets: > > https://pastebin.com/raw/2PcNxaZV > > That is my expectation. The tcpdump is just an easy example of accessing > the skb page frags. skb_copy_and_csum_bits used by icmp is another > example that can walk frags wanting access to device memory. You did not > cause a panic or trip a WARN_ON for example with the tcpdump? > Change for af_packet was not too hard :) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a2dbeb264f260e5b8923ece9aac99fe19ddfeb62..aa4133d1b1e0676e408499ea4534b51262394432 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb->devmem ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) @@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb->devmem ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) And a generic change in pskb_may_pull() ( __pskb_pull_tail() more exactly) was enough to cover most other cases.
On Mon, Jul 3, 2023 at 4:45 PM David Ahern <dsahern@kernel.org> wrote: > That is my expectation. The tcpdump is just an easy example of accessing > the skb page frags. skb_copy_and_csum_bits used by icmp is another > example that can walk frags wanting access to device memory. You did not > cause a panic or trip a WARN_ON for example with the tcpdump? > ICMP packets do not land on the queues having devmem buffers, for obvious reasons. Only chosen TCP flows are steered to these queues.
On 7/3/23 11:13 AM, Eric Dumazet wrote: > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index a2dbeb264f260e5b8923ece9aac99fe19ddfeb62..aa4133d1b1e0676e408499ea4534b51262394432 > 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, > struct net_device *dev, > } > } > > - snaplen = skb->len; > + snaplen = skb->devmem ? skb_headlen(skb) : skb->len; > Ok, so you expect a flag on the skb noting the use of 'untouchable' memory. That aligns with my expectations based on POCs. Based on the above: 1) skb->head is expected to be host memory, and 2) the flag is a global for all frags, so no mix and match.
On 7/3/23 11:15 AM, Eric Dumazet wrote: > On Mon, Jul 3, 2023 at 4:45 PM David Ahern <dsahern@kernel.org> wrote: > >> That is my expectation. The tcpdump is just an easy example of accessing >> the skb page frags. skb_copy_and_csum_bits used by icmp is another >> example that can walk frags wanting access to device memory. You did not >> cause a panic or trip a WARN_ON for example with the tcpdump? >> > > ICMP packets do not land on the queues having devmem buffers, for > obvious reasons. I was not referring to ICMP packets coming in to the nic, but rather an icmp getting triggered during stack processing. > > Only chosen TCP flows are steered to these queues. of course.
On Sun, Jul 02, 2023 at 11:22:33PM -0700, Mina Almasry wrote: > On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@kernel.org> wrote: > > > > On 6/29/23 8:27 PM, Mina Almasry wrote: > > > > > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking > > > actually, and I plan to pursue using the page pool as a front end. > > > > > > Quick description of what I have so far: > > > current implementation uses device memory with struct pages; I am > > > putting all those pages in a gen_pool, and we have written an > > > allocator that allocates pages from the gen_pool. In the driver, we > > > use this allocator instead of alloc_page() (the driver in question is > > > gve which currently doesn't use the page pool). When the driver is > > > done with the p2p page, it simply decrements the refcount on it and > > > the page is freed back to the gen_pool. > > Quick update here, I was able to get my implementation working with > the page pool as a front end with the memory provider API Jakub wrote > here: > https://github.com/kuba-moo/linux/tree/pp-providers > > The main complication indeed was the fact that my device memory pages > are ZONE_DEVICE pages, which are incompatible with the page_pool due > to the union in struct page. I thought of a couple of approaches to > resolve that. > > 1. Make my device memory pages non-ZONE_DEVICE pages. Hard no on this from a mm perspective.. We need P2P memory to be properly tagged and have the expected struct pages to be DMA mappable and otherwise, you totally break everything if you try to do this.. > 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and > vice versa as they're being inserted and removed from the page pool. This is kind of scary, it is very, very, fragile to rework the pages like this. Eg what happens when the owning device unplugs and needs to revoke these pages? I think it would likely crash.. I think it also technically breaks the DMA API as we may need to look into the pgmap to do cache ops on some architectures. I suggest you try to work with 8k folios and then the tail page's struct page is empty enough to store the information you need.. Or allocate per page memory and do a memdesc like thing.. Though overall, you won't find devices creating struct pages for their P2P memory today, so I'm not sure what the purpose is. Jonathan already got highly slammed for proposing code to the kernel that was unusable. Please don't repeat that. Other than a special NVMe use case the interface for P2P is DMABUF right now and it is not struct page backed. Even if we did get to struct pages for device memory, it is highly likely cases you are interested in will be using larger than 4k folios, so page pool would need to cope with this nicely as well. Jason
On Mon, Jul 3, 2023 at 2:43 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Sun, Jul 02, 2023 at 11:22:33PM -0700, Mina Almasry wrote: > > On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@kernel.org> wrote: > > > > > > On 6/29/23 8:27 PM, Mina Almasry wrote: > > > > > > > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking > > > > actually, and I plan to pursue using the page pool as a front end. > > > > > > > > Quick description of what I have so far: > > > > current implementation uses device memory with struct pages; I am > > > > putting all those pages in a gen_pool, and we have written an > > > > allocator that allocates pages from the gen_pool. In the driver, we > > > > use this allocator instead of alloc_page() (the driver in question is > > > > gve which currently doesn't use the page pool). When the driver is > > > > done with the p2p page, it simply decrements the refcount on it and > > > > the page is freed back to the gen_pool. > > > > Quick update here, I was able to get my implementation working with > > the page pool as a front end with the memory provider API Jakub wrote > > here: > > https://github.com/kuba-moo/linux/tree/pp-providers > > > > The main complication indeed was the fact that my device memory pages > > are ZONE_DEVICE pages, which are incompatible with the page_pool due > > to the union in struct page. I thought of a couple of approaches to > > resolve that. > > > > 1. Make my device memory pages non-ZONE_DEVICE pages. > > Hard no on this from a mm perspective.. We need P2P memory to be > properly tagged and have the expected struct pages to be DMA mappable > and otherwise, you totally break everything if you try to do this.. > > > 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and > > vice versa as they're being inserted and removed from the page pool. > > This is kind of scary, it is very, very, fragile to rework the pages > like this. Eg what happens when the owning device unplugs and needs to > revoke these pages? I think it would likely crash.. > > I think it also technically breaks the DMA API as we may need to look > into the pgmap to do cache ops on some architectures. > > I suggest you try to work with 8k folios and then the tail page's > struct page is empty enough to store the information you need.. Hi Jason, sorry for the late reply, I think this could work, and the page pool already supports > order 0 allocations. It may end up being a big change to the GVE driver which as I understand currently deals with order 0 allocations exclusively. Another issue is that in networks with low MTU, we could be DMAing 1400/1500 bytes into each allocation, which is problematic if the allocation is 8K+. I would need to investigate a bit to see if/how to solve that, and we may end up having to split the page and again run into the 'not enough room in struct page' problem. > Or allocate per page memory and do a memdesc like thing.. > I need to review memdesc more closely. Do you imagine I add a pointer in struct page that points to the memdesc? Or implement a page to memdesc mapping in the page_pool? Either approach could work. I think the concern would be accessing the memdesc entries may be a cache miss unacceptable in fast paths, but I think I already dereference page->pgmap in a few places and it doesn't seem to be an issue. > Though overall, you won't find devices creating struct pages for their > P2P memory today, so I'm not sure what the purpose is. Jonathan > already got highly slammed for proposing code to the kernel that was > unusable. Please don't repeat that. Other than a special NVMe use case > the interface for P2P is DMABUF right now and it is not struct page > backed. > Our approach is actually to extend DMABUF to provide struct page backed attachment mappings, which as far as I understand sidesteps the issues Jonathan ran into. Our code is fully functional with any device that supports dmabuf and in fact a lot of my tests use udmabuf to minimize the dependencies. The RFC may come with a udmabuf selftest to showcase that any dmabuf, even a mocked one, would be supported. > Even if we did get to struct pages for device memory, it is highly > likely cases you are interested in will be using larger than 4k > folios, so page pool would need to cope with this nicely as well. > -- Thanks, Mina
On Mon, Jul 3, 2023 at 10:23 AM David Ahern <dsahern@kernel.org> wrote: > > On 7/3/23 11:13 AM, Eric Dumazet wrote: > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > > index a2dbeb264f260e5b8923ece9aac99fe19ddfeb62..aa4133d1b1e0676e408499ea4534b51262394432 > > 100644 > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, > > struct net_device *dev, > > } > > } > > > > - snaplen = skb->len; > > + snaplen = skb->devmem ? skb_headlen(skb) : skb->len; > > > > Ok, so you expect a flag on the skb noting the use of 'untouchable' > memory. That aligns with my expectations based on POCs. > > Based on the above: 1) skb->head is expected to be host memory, and 2) > the flag is a global for all frags, so no mix and match. Yes, both are correct (i.e. what I plan to propose).
On Thu, 29 Jun 2023 19:27:46 -0700 Mina Almasry wrote: > I've discussed your page pool frontend idea with our gve owners and > the idea is attractive. In particular it would be good not to insert > much custom code into the driver to support device memory pages or > other page types. I plan on trying to change my approach to match the > page pool provider you have in progress here: > https://github.com/kuba-moo/linux/tree/pp-providers > > In particular the main challenge right now seems to be that my device > memory pages are ZONE_DEVICE pages, which can't be inserted to the > page pool as-is due to the union in struct page between the page pool > entries and the ZONE_DEVICE entries. I have some ideas on how to work > around that I'm looking into. Right, have you talked to Willem? I mentioned to him that I initially pursued the idea of using the page pool as an abstraction but I realized that fitting both the dma addr and the frag reference count into struct page which isn't just a basic page will be a challenge. So the second best thing seems to be to create an API which matches the page pool API, and have it select between a page pool and another provider based on user configuration. As you said the main goal is to be able to feed kernel / user / device memory to the driver without having to modify driver code. I thought this would live "above" the page pool (perhaps that's what you mean by gen_pool) but Jesper brought up integrating it into the (middle of?) page pool and have page pool switch on the pp_magic. No strong preference on which on is better, seems like something that can only be ironed out by modifying a couple of drivers to find out what fits best :( I'm hoping to jump back into the pp_provider work and finish that up over the next few days, to at least post a PoC. Even if it doesn't work for user / device memory - being able to feed the page pool from huge pages is already a big win for the IOTLB.
On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote: > Another issue is that in networks with low MTU, we could be DMAing > 1400/1500 bytes into each allocation, which is problematic if the > allocation is 8K+. I would need to investigate a bit to see if/how to > solve that, and we may end up having to split the page and again run > into the 'not enough room in struct page' problem. You don't have an intree driver to use this with, so who knows, but the out of tree GPU drivers tend to use a 64k memory management page size, and I don't expect you'd make progress with a design where a 64K naturaly sized allocator is producing 4k/8k non-compound pages just for netdev. We are still struggling with pagemap support for variable page size folios, so there is a bunch of technical blockers before drivers could do this. This is why it is so important to come with a complete in-tree solution, as we cannot review this design if your work is done with hacked up out of tree drivers. Fully and properly adding P2P ZONE_DEVICE to a real world driver is a pretty big ask still. > > Or allocate per page memory and do a memdesc like thing.. > > I need to review memdesc more closely. Do you imagine I add a pointer > in struct page that points to the memdesc? Pointer to extra memory from the PFN has been the usual meaning of memdesc, so doing an interm where the pointer is in the struct page is a reasonable starting point. > > Though overall, you won't find devices creating struct pages for their > > P2P memory today, so I'm not sure what the purpose is. Jonathan > > already got highly slammed for proposing code to the kernel that was > > unusable. Please don't repeat that. Other than a special NVMe use case > > the interface for P2P is DMABUF right now and it is not struct page > > backed. > > > > Our approach is actually to extend DMABUF to provide struct page > backed attachment mappings, which as far as I understand sidesteps the > issues Jonathan ran into. No DMABUF exporters do this today, so your patch series is just as incomplete as the prior ones. Please don't post it as non-RFC, unusable code like this must not be merged. > that supports dmabuf and in fact a lot of my tests use udmabuf to > minimize the dependencies. The RFC may come with a udmabuf selftest to > showcase that any dmabuf, even a mocked one, would be supported. That is not good enough to get merged. You need to get agreement and coded merged from actual driver owners of dmabuf exporters that they want to support this direction. As above it has surprising road blocks outside netdev :\ Jason
On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote: > > > Another issue is that in networks with low MTU, we could be DMAing > > 1400/1500 bytes into each allocation, which is problematic if the > > allocation is 8K+. I would need to investigate a bit to see if/how to > > solve that, and we may end up having to split the page and again run > > into the 'not enough room in struct page' problem. > > You don't have an intree driver to use this with, so who knows, but > the out of tree GPU drivers tend to use a 64k memory management page > size, and I don't expect you'd make progress with a design where a 64K > naturaly sized allocator is producing 4k/8k non-compound pages just > for netdev. We are still struggling with pagemap support for variable > page size folios, so there is a bunch of technical blockers before > drivers could do this. > > This is why it is so important to come with a complete in-tree > solution, as we cannot review this design if your work is done with > hacked up out of tree drivers. > I think you're assuming the proposal requires dma-buf exporter driver changes, and I have a 'hacked up out of tree driver' not visible to you. Both are not quite right. The proposal requires no changes to the dma-buf exporter, and works with udmabuf _as is_, proving that. Please do review the proposal: https://lore.kernel.org/netdev/20230710223304.1174642-1-almasrymina@google.com/ If you still don't like the approach, we can try something else. > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a > pretty big ask still. > There is no such ask. > > > Or allocate per page memory and do a memdesc like thing.. > > > > I need to review memdesc more closely. Do you imagine I add a pointer > > in struct page that points to the memdesc? > > Pointer to extra memory from the PFN has been the usual meaning of > memdesc, so doing an interm where the pointer is in the struct page is > a reasonable starting point. > > > > Though overall, you won't find devices creating struct pages for their > > > P2P memory today, so I'm not sure what the purpose is. Jonathan > > > already got highly slammed for proposing code to the kernel that was > > > unusable. Please don't repeat that. Other than a special NVMe use case > > > the interface for P2P is DMABUF right now and it is not struct page > > > backed. > > > > > > > Our approach is actually to extend DMABUF to provide struct page > > backed attachment mappings, which as far as I understand sidesteps the > > issues Jonathan ran into. > > No DMABUF exporters do this today, so your patch series is just as > incomplete as the prior ones. Please don't post it as non-RFC, > unusable code like this must not be merged. > > > that supports dmabuf and in fact a lot of my tests use udmabuf to > > minimize the dependencies. The RFC may come with a udmabuf selftest to > > showcase that any dmabuf, even a mocked one, would be supported. > > That is not good enough to get merged. You need to get agreement and > coded merged from actual driver owners of dmabuf exporters that they > want to support this direction. As above it has surprising road > blocks outside netdev :\ > The current proposal requires no changes to the dma-buf exporters: https://lore.kernel.org/netdev/20230710223304.1174642-1-almasrymina@google.com/ On dma-buf changes required. I do need approval from the dma-buf maintainers, but AFAICT, no approval from the dma-buf exporters (all I need is already supported). If we need to change direction to a proposal that needs additional support from the driver owners, yes, we'd need their approval, but this is not the case at the moment.
On Mon, Jul 10, 2023 at 04:02:59PM -0700, Mina Almasry wrote: > On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote: > > > > > Another issue is that in networks with low MTU, we could be DMAing > > > 1400/1500 bytes into each allocation, which is problematic if the > > > allocation is 8K+. I would need to investigate a bit to see if/how to > > > solve that, and we may end up having to split the page and again run > > > into the 'not enough room in struct page' problem. > > > > You don't have an intree driver to use this with, so who knows, but > > the out of tree GPU drivers tend to use a 64k memory management page > > size, and I don't expect you'd make progress with a design where a 64K > > naturaly sized allocator is producing 4k/8k non-compound pages just > > for netdev. We are still struggling with pagemap support for variable > > page size folios, so there is a bunch of technical blockers before > > drivers could do this. > > > > This is why it is so important to come with a complete in-tree > > solution, as we cannot review this design if your work is done with > > hacked up out of tree drivers. > > > > I think you're assuming the proposal requires dma-buf exporter driver > changes, and I have a 'hacked up out of tree driver' not visible to > you. Oh, I thought it was obvious what you did in patch 1 was a total non-starter when I said you can't abuse the ZONE_DEVICE pages like this. You must create ZONE_DEVICE P2P pages, not MEMORY_DEVICE_PRIVATE to represent P2P memory, and you can't do that automatically from the dmabuf core code. Without doing this the DMA API doesn't actually work properly because it doesn't have enough information to know about what the underlying exporter is. The entire point of DEVICE_PRIVATE is that the page content, and access to the page's physical location, is *explicitly* unavailable to anyone but the pgmap owner. > > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a > > pretty big ask still. > > There is no such ask. Well, there is from me if you want to use stuct pages as handles for P2P memory. That is what we have defined in the pgmap area. Also I should warn you that your 'option 2' is being NAK'd by Christoph for now, we are not adding any new code around DMABUF's hacky use of NULL sg_page scatterlist for P2P memory either. I've been working on solutions here but it is slow going. > On dma-buf changes required. I do need approval from the dma-buf > maintainers, It is a neat hack, of sorts, to abuse DEVICE_PRIVATE to create struct pages for the exclusive use of pagepool - but you need more approval than just dmabuf maintainers to abuse the pgmap framework like this. At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used to represent P2P memory. You haven't CC'd anyone from the mm community so I've added some people here to see if there are other opinions. To be clear, you need an explicit ack from mm people on the abusive use of pgmap in patch 1. I know it is not what you want to hear, but there are actual reasons why the P2P DMA problem has been festering for so long, and hacky quick fixes like this are not going to be enough.. Jason
On Mon, Jul 10, 2023 at 4:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Jul 10, 2023 at 04:02:59PM -0700, Mina Almasry wrote: > > On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote: > > > > > > > Another issue is that in networks with low MTU, we could be DMAing > > > > 1400/1500 bytes into each allocation, which is problematic if the > > > > allocation is 8K+. I would need to investigate a bit to see if/how to > > > > solve that, and we may end up having to split the page and again run > > > > into the 'not enough room in struct page' problem. > > > > > > You don't have an intree driver to use this with, so who knows, but > > > the out of tree GPU drivers tend to use a 64k memory management page > > > size, and I don't expect you'd make progress with a design where a 64K > > > naturaly sized allocator is producing 4k/8k non-compound pages just > > > for netdev. We are still struggling with pagemap support for variable > > > page size folios, so there is a bunch of technical blockers before > > > drivers could do this. > > > > > > This is why it is so important to come with a complete in-tree > > > solution, as we cannot review this design if your work is done with > > > hacked up out of tree drivers. > > > > > > > I think you're assuming the proposal requires dma-buf exporter driver > > changes, and I have a 'hacked up out of tree driver' not visible to > > you. > > Oh, I thought it was obvious what you did in patch 1 was a total > non-starter when I said you can't abuse the ZONE_DEVICE pages like > this. > > You must create ZONE_DEVICE P2P pages, not MEMORY_DEVICE_PRIVATE to > represent P2P memory, We can extend the memory_type enum, with something like MEMORY_DEVICE_DMABUF? > and you can't do that automatically from the > dmabuf core code. > Sorry, why not? > Without doing this the DMA API doesn't actually work properly because > it doesn't have enough information to know about what the underlying > exporter is. > I didn't think it would matter that the DMA API doesn't work with these pages, because the mapping already exists courtesy of dma_buf_map_attachment(), and these dma-buf pages don't need to be mapped (the dma_addr is already available). I'm providing dma_buf_map_sg() in that patch to use instead of the DMA API. Would be nice to help me understand why we would care why the DMA API doesn't work with these pages when we're asking code using these pages to use the provided API instead? > The entire point of DEVICE_PRIVATE is that the page content, and > access to the page's physical location, is *explicitly* unavailable to > anyone but the pgmap owner. > This seems to be a semantics issue. Would setting the pages as MEMORY_DEVICE_DMABUF address this? > > > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a > > > pretty big ask still. > > > > There is no such ask. > > Well, there is from me if you want to use stuct pages as handles for > P2P memory. That is what we have defined in the pgmap area. > > Also I should warn you that your 'option 2' is being NAK'd by > Christoph for now, we are not adding any new code around DMABUF's > hacky use of NULL sg_page scatterlist for P2P memory either. I've been > working on solutions here but it is slow going. > Option 2, I think, doesn't care about the sg_page NULL. But it may be pointless to discuss, I don't see how we could modify the page pool, networking stack, and all the networking drivers to do anything other than struct page (someone feel free to correct). > > On dma-buf changes required. I do need approval from the dma-buf > > maintainers, > > It is a neat hack, of sorts, to abuse DEVICE_PRIVATE to create struct > pages for the exclusive use of pagepool - but you need more approval > than just dmabuf maintainers to abuse the pgmap framework like > this. > Who? I'd love to add them on subsequent proposals. > At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used > to represent P2P memory. Would using p2pdma API instead of dmabuf be an acceptable direction? The nice thing about what I'm doing is that any existing dmabuf exported would be supported, AFAICT. To use p2pdma I think I would need to modify the drivers providing the device memory to do a pci_p2pdma_add_resource(), and use the pci_p2pdma_map_sg() on the importer driver (NIC), but that may work, with driver changes. Would that be an acceptable direction to you? Or are you against the approach in the RFC and don't have any alternative path forward for this? Is this not a use case we want to support in some way in the kernel? > You haven't CC'd anyone from the mm community > so I've added some people here to see if there are other opinions. > I CC'd get_maintainers.pl output and you because you showed interest and get_maintainers did not add you automatically. > To be clear, you need an explicit ack from mm people on the abusive > use of pgmap in patch 1. > Who? I would love to add them on subsequent proposals. > I know it is not what you want to hear, but there are actual reasons > why the P2P DMA problem has been festering for so long, and hacky > quick fixes like this are not going to be enough.. > -- Thanks, Mina
On Mon, Jul 10, 2023 at 08:49:25PM -0300, Jason Gunthorpe wrote: > The entire point of DEVICE_PRIVATE is that the page content, and > access to the page's physical location, is *explicitly* unavailable to > anyone but the pgmap owner. DEVICE_PRIVATE is in fact availably to no one at all. It is a place holder for memory not addressable at all. Not going to comment on the rest of this as it seems bat shit crazy hacks for out of tree junk. Why is anyone even wasting time on this?
On Tue, 11 Jul 2023 06:27:08 +0200 Christoph Hellwig wrote: > Not going to comment on the rest of this as it seems bat shit crazy > hacks for out of tree junk. Why is anyone even wasting time on this? Noob question - how does RDMA integrate with the out of tree junk? AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU direct". Both Jonathan in the past (Meta) and now Mina (Google) are trying to move the needle and at least feed the GPUs over TCP, instead of patented, proprietary and closed RDMA transports.
On Mon, Jul 10, 2023 at 09:59:06PM -0700, Jakub Kicinski wrote: > On Tue, 11 Jul 2023 06:27:08 +0200 Christoph Hellwig wrote: > > Not going to comment on the rest of this as it seems bat shit crazy > > hacks for out of tree junk. Why is anyone even wasting time on this? > > Noob question - how does RDMA integrate with the out of tree junk? > AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU > direct". I don't care and it has absolutel no business being discussed here. FYI at leat iWarp is a totally open standard.
Jason Gunthorpe wrote: > On Mon, Jul 10, 2023 at 04:02:59PM -0700, Mina Almasry wrote: > > On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote: > > > > > > > Another issue is that in networks with low MTU, we could be DMAing > > > > 1400/1500 bytes into each allocation, which is problematic if the > > > > allocation is 8K+. I would need to investigate a bit to see if/how to > > > > solve that, and we may end up having to split the page and again run > > > > into the 'not enough room in struct page' problem. > > > > > > You don't have an intree driver to use this with, so who knows, but > > > the out of tree GPU drivers tend to use a 64k memory management page > > > size, and I don't expect you'd make progress with a design where a 64K > > > naturaly sized allocator is producing 4k/8k non-compound pages just > > > for netdev. We are still struggling with pagemap support for variable > > > page size folios, so there is a bunch of technical blockers before > > > drivers could do this. > > > > > > This is why it is so important to come with a complete in-tree > > > solution, as we cannot review this design if your work is done with > > > hacked up out of tree drivers. > > > > > > > I think you're assuming the proposal requires dma-buf exporter driver > > changes, and I have a 'hacked up out of tree driver' not visible to > > you. > > Oh, I thought it was obvious what you did in patch 1 was a total > non-starter when I said you can't abuse the ZONE_DEVICE pages like > this. > > You must create ZONE_DEVICE P2P pages, not MEMORY_DEVICE_PRIVATE to > represent P2P memory, and you can't do that automatically from the > dmabuf core code. > > Without doing this the DMA API doesn't actually work properly because > it doesn't have enough information to know about what the underlying > exporter is. > > The entire point of DEVICE_PRIVATE is that the page content, and > access to the page's physical location, is *explicitly* unavailable to > anyone but the pgmap owner. > > > > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a > > > pretty big ask still. > > > > There is no such ask. > > Well, there is from me if you want to use stuct pages as handles for > P2P memory. That is what we have defined in the pgmap area. > > Also I should warn you that your 'option 2' is being NAK'd by > Christoph for now, we are not adding any new code around DMABUF's > hacky use of NULL sg_page scatterlist for P2P memory either. I've been > working on solutions here but it is slow going. > > > On dma-buf changes required. I do need approval from the dma-buf > > maintainers, > > It is a neat hack, of sorts, to abuse DEVICE_PRIVATE to create struct > pages for the exclusive use of pagepool - but you need more approval > than just dmabuf maintainers to abuse the pgmap framework like > this. > > At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used > to represent P2P memory. You haven't CC'd anyone from the mm community > so I've added some people here to see if there are other opinions. > > To be clear, you need an explicit ack from mm people on the abusive > use of pgmap in patch 1. This thread is long so I am only reacting to this first message I am copied on, but yes agree with Jason anything peer-to-peer DMA needs to reuse p2pdma and it definitely needs in-tree producers and consumers for all infrastructure. One piece of technical debt standing in the way of any proposed expansion of pgmap usage is the final resolution of this topic: https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/ I am also generally reticent to entertain taking on new pgmap maintenance. I.e. now that accelerator memory attached over a coherent link is an open hardware standard (CXL) that addresses what pgmap infrastructure enabled in software. > I know it is not what you want to hear, but there are actual reasons > why the P2P DMA problem has been festering for so long, and hacky > quick fixes like this are not going to be enough.. > > Jason
On Tue, Jul 11, 2023 at 07:04:45AM +0200, Christoph Hellwig wrote: > On Mon, Jul 10, 2023 at 09:59:06PM -0700, Jakub Kicinski wrote: > > On Tue, 11 Jul 2023 06:27:08 +0200 Christoph Hellwig wrote: > > > Not going to comment on the rest of this as it seems bat shit crazy > > > hacks for out of tree junk. Why is anyone even wasting time on this? > > > > Noob question - how does RDMA integrate with the out of tree junk? > > AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU > > direct". > > I don't care and it has absolutel no business being discussed here. > > FYI at leat iWarp is a totally open standard. So is Infiniband, Jakub has a unique definition of "proprietary". RDMA works with the AMD and Intel intree drivers using DMABUF without requiring struct pages using the DRM hacky scatterlist approach. Jason
On Mon, Jul 10, 2023 at 05:45:05PM -0700, Mina Almasry wrote: > > At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used > > to represent P2P memory. > > Would using p2pdma API instead of dmabuf be an acceptable direction? "p2pdma API" is really just using MEMORY_DEVICE_PCI_P2PDMA and teaching the pagepool how to work with ZONE_DEVICE pages. I suspect this will clash badly with Matthew's work here: https://lore.kernel.org/all/20230111042214.907030-1-willy@infradead.org/ As from a mm side we haven't ever considered that ZONE_DEVICE and "netmem" can be composed together. The entire point of netmem like stuff is that the allocator hands over the majority of struct page to the allocatee, and ZONE_DEVICE can't work like that. However, assuming that can be solved in some agreeable way then it would be OK to go down this path. But, I feel like this is just overall too hard a direction from the mm perspective. I don't know anything about page pool, but the main sticking point is its reliance on struct page. If it can find another way to locate its meta data (eg an xarray), at least for some cases, it would make things alot easier. Jason
On Tue, 11 Jul 2023 09:05:02 -0300 Jason Gunthorpe wrote: > On Tue, Jul 11, 2023 at 07:04:45AM +0200, Christoph Hellwig wrote: > > On Mon, Jul 10, 2023 at 09:59:06PM -0700, Jakub Kicinski wrote: > > > Noob question - how does RDMA integrate with the out of tree junk? > > > AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU > > > direct". > > > > I don't care and it has absolutel no business being discussed here. My question was genuine. If you think the code is dog shit I will make no argument for merging it. I just get hives from looking at proprietary code so I was hoping someone could explain how the proprietary stacks get it done. > > FYI at leat iWarp is a totally open standard. And I'm sure someone cares about that. I care about open source. I think people in the networking world have a deeper understanding of standardization processes, and their practical implication for open source and hack-ability. If all usable implementations are proprietary the standard could as well not exist. That may be a little hard to understand for folks coming from storage and fabric worlds where the interesting bits of the implementation was pretty much always closed. > So is Infiniband, Jakub has a unique definition of "proprietary". For IB AFAIU there's only one practically usable vendor, such an impressive ecosystem!! > RDMA works with the AMD and Intel intree drivers using DMABUF without > requiring struct pages using the DRM hacky scatterlist approach. I see, thanks. We need pages primarily for refcounting. Avoiding all the infamous problems with memory pins. Oh well.
On 7/11/23 10:00 AM, Jakub Kicinski wrote: >> RDMA works with the AMD and Intel intree drivers using DMABUF without >> requiring struct pages using the DRM hacky scatterlist approach. > I see, thanks. We need pages primarily for refcounting. Avoiding all > the infamous problems with memory pins. Oh well. io_uring for example already manages the page pinning. An skb flag was added for ZC Tx API to avoid refcounting in the core networking layer. Any reason not to allow an alternative representation for skb frags than struct page?
On Tue, 11 Jul 2023 10:20:58 -0600 David Ahern wrote: > On 7/11/23 10:00 AM, Jakub Kicinski wrote: > >> RDMA works with the AMD and Intel intree drivers using DMABUF without > >> requiring struct pages using the DRM hacky scatterlist approach. > > I see, thanks. We need pages primarily for refcounting. Avoiding all > > the infamous problems with memory pins. Oh well. > > io_uring for example already manages the page pinning. An skb flag was > added for ZC Tx API to avoid refcounting in the core networking layer. Right, we can refcount in similar fashion. Still tracking explicitly when buffers are handed over to the NIC. > Any reason not to allow an alternative representation for skb frags than > struct page? I don't think there's a hard technical reason. We can make it work.
On Tue, Jul 11, 2023 at 09:00:47AM -0700, Jakub Kicinski wrote: > > So is Infiniband, Jakub has a unique definition of "proprietary". > > For IB AFAIU there's only one practically usable vendor, such an > impressive ecosystem!! IB has roce (RDMA Over Converged Ethernet, better thought of as IB over Ethernet) under it's standization umbrella and every major commodity NIC vendor (mellanox, broadcom, intel) now implements roce. We also have the roce support in the switch from all major switch vendors. IB as a link layer "failed" with most implementors leaving the ecosystem broadly because Ethernet link layer always wins in networking - this is more of an economic outcome than a standardization outcome in my mind. Due to roce, IB as a transport and software protocol has a solid multi-vendor ecosystem. Jason
On Tue, Jul 11, 2023 at 9:32 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 11 Jul 2023 10:20:58 -0600 David Ahern wrote: > > On 7/11/23 10:00 AM, Jakub Kicinski wrote: > > >> RDMA works with the AMD and Intel intree drivers using DMABUF without > > >> requiring struct pages using the DRM hacky scatterlist approach. > > > I see, thanks. We need pages primarily for refcounting. Avoiding all > > > the infamous problems with memory pins. Oh well. > > > > io_uring for example already manages the page pinning. An skb flag was > > added for ZC Tx API to avoid refcounting in the core networking layer. > > Right, we can refcount in similar fashion. Still tracking explicitly > when buffers are handed over to the NIC. > > > Any reason not to allow an alternative representation for skb frags than > > struct page? > > I don't think there's a hard technical reason. We can make it work. I also think we can switch the representation for skb frags to something else. However - please do correct me if I'm wrong - I don't think that is sufficient for device memory TCP. My understanding is that we also need to modify any NIC drivers that want to use device memory TCP to understand a new memory type, and the page pool as well if that's involved. I think in particular modifying the memory type in all the NIC drivers that want to do device memory TCP is difficult. Do you think this is feasible? -- Thanks, Mina
On Tue, 11 Jul 2023 13:42:47 -0300 Jason Gunthorpe wrote: > On Tue, Jul 11, 2023 at 09:00:47AM -0700, Jakub Kicinski wrote: > > > So is Infiniband, Jakub has a unique definition of "proprietary". > > > > For IB AFAIU there's only one practically usable vendor, such an > > impressive ecosystem!! > > IB has roce (RDMA Over Converged Ethernet, better thought of as IB > over Ethernet) under it's standization umbrella and every major > commodity NIC vendor (mellanox, broadcom, intel) now implements > roce. Now we're getting into our favorite argument and completely sidetracking the conversation, aren't we? :) And as usual our ability to present facts is limited by various NDAs.. > We also have the roce support in the switch from all major > switch vendors. By which you mean all major switch vendors should support basic RoCE requirements. But most vendors will try to put special features into their switches trying to make the full NIC + switch solution as sticky as possible. > IB as a link layer "failed" with most implementors leaving the Interesting. > ecosystem broadly because Ethernet link layer always wins in > networking - this is more of an economic outcome than a > standardization outcome in my mind. > > Due to roce, IB as a transport and software protocol has a solid > multi-vendor ecosystem. Last I checked every generation of HW from even a single vendor came out with a new congestion control algorithm and add-ons. All closed source and usually patented. That is the core of a transport protocol, the intelligence, not the frame format (which may or may not itself deviate thru various encaps and stolen bits).
On Tue, Jul 11, 2023 at 6:11 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Jul 10, 2023 at 05:45:05PM -0700, Mina Almasry wrote: > > > > At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used > > > to represent P2P memory. > > > > Would using p2pdma API instead of dmabuf be an acceptable direction? > > "p2pdma API" is really just using MEMORY_DEVICE_PCI_P2PDMA and > teaching the pagepool how to work with ZONE_DEVICE pages. > Yes, that's what I have in mind. Roughly something along the lines where the device memory provider (GPU or what not) does something like a pci_p2pdma_add_resource(), and the NIC client driver allocates the pages from the resource, and if is_pci_p2pdma_page() use pci_p2pdma_map_sg() instead of dma_map_sg() and whatnot. > I suspect this will clash badly with Matthew's work here: > > https://lore.kernel.org/all/20230111042214.907030-1-willy@infradead.org/ > > As from a mm side we haven't ever considered that ZONE_DEVICE and > "netmem" can be composed together. The entire point of netmem like > stuff is that the allocator hands over the majority of struct page to > the allocatee, and ZONE_DEVICE can't work like that. > > However, assuming that can be solved in some agreeable way then it > would be OK to go down this path. > I think there is potential to solve this in an agreeable way. We may be able to get netmem ZONE_DEVICE pages using the memdesc idea you proposed, or something like the xarray meta data you mention below. If not that, I think Jakub already said that he is considering coming up with a 'page pool like' API that drivers can use, with an implementation that is compatible with ZONE_DEVICE pages. > But, I feel like this is just overall too hard a direction from the mm > perspective. > > I don't know anything about page pool, but the main sticking point is > its reliance on struct page. If it can find another way to locate its > meta data (eg an xarray), at least for some cases, it would make > things alot easier. >
On Tue, Jul 11, 2023 at 10:06:36AM -0700, Jakub Kicinski wrote: > Now we're getting into our favorite argument and completely > sidetracking the conversation, aren't we? :) And as usual > our ability to present facts is limited by various NDAs.. Yes, well, maybe I should stop taking the bait everytime you write "proprietary" :) > > We also have the roce support in the switch from all major > > switch vendors. > > By which you mean all major switch vendors should support basic RoCE > requirements. But most vendors will try to put special features into > their switches trying to make the full NIC + switch solution as sticky > as possible. Yep. At the high end open standards based ethernet has also notably "failed" as well. Every switch vendor now offers their own proprietary ecosystem on a whole bunch of different axis. They all present "ethernet" toward the host but the host often needs to work in a special way to really take full advantage of the proprietary fabric behaviors. > Last I checked every generation of HW from even a single vendor came out > with a new congestion control algorithm and add-ons. Probably, but I don't really view this as an IB or roce issue. Back in the day, there was "data center ethernet" which was a standardization effort to try and tame some of these problems. roce was imagined as an important workload over DCE, but the effort was ethernet focused and generic. Sadly DCE and successor standard based congestion mangement approaches did not work, or were "standardized" in a way that had a big hole that needed to be filled with proprietary algorithms. Eventualy the interest in standardization seems to have waned and several of the big network operators seem to be valuing their unique congestion management as a proprietary element. From a vendor perspective this is has turned into an interop train wreck. Sigh. roce is just highly sensitive to loss - which is managed in ethernet through congestion management. This is why you see roce and congestion management so tightly linked, and perhaps in some deployments becomes the motivating reason to look at congestion management. However, TCP under congestion management is also very interesting and is a motivation to deploy congestion management in its own right in some cases. Jason
On Tue, 11 Jul 2023 15:52:37 -0300 Jason Gunthorpe wrote: > > Now we're getting into our favorite argument and completely > > sidetracking the conversation, aren't we? :) And as usual > > our ability to present facts is limited by various NDAs.. > > Yes, well, maybe I should stop taking the bait everytime you write > "proprietary" :) > > > > We also have the roce support in the switch from all major > > > switch vendors. > > > > By which you mean all major switch vendors should support basic RoCE > > requirements. But most vendors will try to put special features into > > their switches trying to make the full NIC + switch solution as sticky > > as possible. > > Yep. At the high end open standards based ethernet has also notably > "failed" as well. Every switch vendor now offers their own proprietary > ecosystem on a whole bunch of different axis. They all present > "ethernet" toward the host but the host often needs to work in a > special way to really take full advantage of the proprietary fabric > behaviors. I'm not familiar with "high end open standards based on ethernet", would those be some RDMA / storage things? For TCP/IP networks pretty much the only things that matter in a switch are bandwidth, size of buffers, power... Implementation stuff. > > Last I checked every generation of HW from even a single vendor came out > > with a new congestion control algorithm and add-ons. > > Probably, but I don't really view this as an IB or roce issue. > > Back in the day, there was "data center ethernet" which was a > standardization effort to try and tame some of these problems. roce > was imagined as an important workload over DCE, but the effort was > ethernet focused and generic. Sadly DCE and successor standard based > congestion mangement approaches did not work, or were "standardized" > in a way that had a big hole that needed to be filled with proprietary > algorithms. Eventualy the interest in standardization seems to have > waned and several of the big network operators seem to be valuing > their unique congestion management as a proprietary element. From a > vendor perspective this is has turned into an interop train > wreck. Sigh. > > roce is just highly sensitive to loss - which is managed in ethernet > through congestion management. This is why you see roce and congestion > management so tightly linked, and perhaps in some deployments becomes > the motivating reason to look at congestion management. A lot of "standardization" efforts are just attempts to prove to a buyers that an ecosystem exists. Open source the firmware. Let people actually hack on it and when the users bring their own algorithms de facto standardization will happen. Short of that it's all smoke and mirrors.
On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote: > > > Any reason not to allow an alternative representation for skb frags than > > > struct page? > > > > I don't think there's a hard technical reason. We can make it work. > > I also think we can switch the representation for skb frags to > something else. However - please do correct me if I'm wrong - I don't > think that is sufficient for device memory TCP. My understanding is > that we also need to modify any NIC drivers that want to use device > memory TCP to understand a new memory type, and the page pool as well > if that's involved. I think in particular modifying the memory type in > all the NIC drivers that want to do device memory TCP is difficult. Do > you think this is feasible? That's why I was thinking about adding an abstraction between the page pool and the driver. Instead of feeding driver pages a new abstraction could feed the driver just an identifier and a PA. Whether we want to support fragmentation in that model or not would have to be decided. We can take pages from the page pool and feed them to drivers via such an API, but drivers need to stop expecting pages. That's for data buffers only, obviously. We can keep using pages and raw page pool for headers.
On 7/11/23 2:39 PM, Jakub Kicinski wrote: > On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote: >>>> Any reason not to allow an alternative representation for skb frags than >>>> struct page? >>> >>> I don't think there's a hard technical reason. We can make it work. >> >> I also think we can switch the representation for skb frags to >> something else. However - please do correct me if I'm wrong - I don't >> think that is sufficient for device memory TCP. My understanding is >> that we also need to modify any NIC drivers that want to use device >> memory TCP to understand a new memory type, and the page pool as well >> if that's involved. I think in particular modifying the memory type in >> all the NIC drivers that want to do device memory TCP is difficult. Do >> you think this is feasible? > > That's why I was thinking about adding an abstraction between > the page pool and the driver. Instead of feeding driver pages > a new abstraction could feed the driver just an identifier and a PA. skb frag is currently a bio_vec. Overloading the 'struct page' address in that struct with another address is easy to do. Requiring a certain alignment on the address gives you a few low bits to use a flags / magic / etc. Overloading len and offset is not really possible - way too much code is affected (e.g., iov walking and MSS / TSO segmenting). ie., you could overload page address with a pointer to an object in your new abstraction layer and the struct has the other meta data. typedef struct skb_frag { union { struct bio_vec bvec; struct new_abstraction abs; }; } skb_frag_t; where struct new_abstraction { void *addr, unsigned int len; unsigned int offset; }; I have been playing with a similar and it co-exists with the existing code quite well with the constraint on location of len and offset. > > Whether we want to support fragmentation in that model or not would > have to be decided. > > We can take pages from the page pool and feed them to drivers via > such an API, but drivers need to stop expecting pages. yes, drivers would have to be updated to understand the new format. A downside, but again relatively easy to manage. > > That's for data buffers only, obviously. We can keep using pages > and raw page pool for headers. yes.
On Tue, Jul 11, 2023 at 01:34:20PM -0700, Jakub Kicinski wrote: > > Yep. At the high end open standards based ethernet has also notably > > "failed" as well. Every switch vendor now offers their own proprietary > > ecosystem on a whole bunch of different axis. They all present > > "ethernet" toward the host but the host often needs to work in a > > special way to really take full advantage of the proprietary fabric > > behaviors. > > I'm not familiar with "high end open standards based on ethernet", would > those be some RDMA / storage things? For TCP/IP networks pretty much > the only things that matter in a switch are bandwidth, size of buffers, > power... Implementation stuff. I would say when you are getting into ethernet deployments with 25 or 51 Tbps switches directly connected to hosts running at >100G you are getting into the high end side of things. These are very expensive networks. They run complex congestion provoking workloads. They have sophisticated multi-pathing. They often use use a non-blocking topology. Congestion management is important. Making this work with good utilization, and low tail latency is a really hard problem. Many of the solutions come with switch features supporting it. You'd proably say these are not TCP focused networks, even though they are based on ethernet and IP. So I think of them as high end "standards based" ethernet and IP looking networks that have proprietary elements mixed in throughout. Right now there is a bit of a press war between vendors on 'ethernet for AI'. Both Broadcom and NVIDIA are taking techonlogies that were originally built for TCP ethernet networks and remixing/speeding them up to run roce workloads effectively. There is alot more information available now without NDA that shows some detail on this space. AWS's SRD multipathing, Broadcom "AI Ethernet" and NVIDIA's Spectrum-X spring to mind as topical to what these sorts of ethernet networks are. > A lot of "standardization" efforts are just attempts to prove to > a buyers that an ecosystem exists. Heh, that was probably more true years ago. These days it seems like some standardization is also being done so the large hyperscalers can improve their Approved Vendors List. I suppose as long as the result is something we can implement openly in Linux the motivation for standardization is less important. Jason
On Tue, Jul 11, 2023 at 2:39 PM David Ahern <dsahern@kernel.org> wrote: > > On 7/11/23 2:39 PM, Jakub Kicinski wrote: > > On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote: > >>>> Any reason not to allow an alternative representation for skb frags than > >>>> struct page? > >>> > >>> I don't think there's a hard technical reason. We can make it work. > >> > >> I also think we can switch the representation for skb frags to > >> something else. However - please do correct me if I'm wrong - I don't > >> think that is sufficient for device memory TCP. My understanding is > >> that we also need to modify any NIC drivers that want to use device > >> memory TCP to understand a new memory type, and the page pool as well > >> if that's involved. I think in particular modifying the memory type in > >> all the NIC drivers that want to do device memory TCP is difficult. Do > >> you think this is feasible? > > > > That's why I was thinking about adding an abstraction between > > the page pool and the driver. Instead of feeding driver pages > > a new abstraction could feed the driver just an identifier and a PA. > > skb frag is currently a bio_vec. Overloading the 'struct page' address > in that struct with another address is easy to do. Requiring a certain > alignment on the address gives you a few low bits to use a flags / magic > / etc. > > Overloading len and offset is not really possible - way too much code is > affected (e.g., iov walking and MSS / TSO segmenting). > > ie., you could overload page address with a pointer to an object in your > new abstraction layer and the struct has the other meta data. > > typedef struct skb_frag { > union { > struct bio_vec bvec; > struct new_abstraction abs; > }; > } skb_frag_t; > > where > > struct new_abstraction { > void *addr, > unsigned int len; > unsigned int offset; > }; > > I have been playing with a similar and it co-exists with the existing > code quite well with the constraint on location of len and offset. > > > > > Whether we want to support fragmentation in that model or not would > > have to be decided. > > > > We can take pages from the page pool and feed them to drivers via > > such an API, but drivers need to stop expecting pages. > > yes, drivers would have to be updated to understand the new format. A > downside, but again relatively easy to manage. > I'm glad to see that you're open to this approach. As far as I understand, getting device memory in a struct page form would still be preferred, no? And the approach you point to would be a backup plan I presume? Since the good folks on this thread have pointed me to p2pdma to address my use case, I've been doing some homework to see if it can apply. AFACT so far, it applies, and Willem actually had a prototype of it working a while back. The rough approach Willem and I are thinking of would be something like: 1. The device memory driver would be the p2pdma provider. It would expose a user API which allocates a device memory region, calls pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns a reference to it to the userspace. 2. The NIC driver would be the p2pdma client and orchestrator. It would expose a user API which binds an rxq to a pci device. Prior to the bind the user API would check that the pci device has published p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible to the driver (pci_p2pdma_distance() I think), etc. 3. The NIC would allocate pages from the p2pdma provider for incoming packets, and create devmem skbs, and deliver the devmem skbs to the user using the support in my RFC. AFACT all that code need not be changed. AFAICT, all the concerns brought up in this thread are sidestepped by using p2pdma. I need not allocate struct pages in the core dma-buf code anymore (or anywhere), and I need not allocate pgmaps. I would just re-use the p2pdma support. Anyone see any glaring issues with this approach? I plan on trying to implement a PoC and sending an RFC v2. The only pending concern is integration with the page pool, but we already have some ideas on how to solve that. > > > > That's for data buffers only, obviously. We can keep using pages > > and raw page pool for headers. > > yes.
Am 12.07.23 um 05:42 schrieb Mina Almasry: > On Tue, Jul 11, 2023 at 2:39 PM David Ahern <dsahern@kernel.org> wrote: >> On 7/11/23 2:39 PM, Jakub Kicinski wrote: >>> On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote: >>>>>> Any reason not to allow an alternative representation for skb frags than >>>>>> struct page? >>>>> I don't think there's a hard technical reason. We can make it work. >>>> I also think we can switch the representation for skb frags to >>>> something else. However - please do correct me if I'm wrong - I don't >>>> think that is sufficient for device memory TCP. My understanding is >>>> that we also need to modify any NIC drivers that want to use device >>>> memory TCP to understand a new memory type, and the page pool as well >>>> if that's involved. I think in particular modifying the memory type in >>>> all the NIC drivers that want to do device memory TCP is difficult. Do >>>> you think this is feasible? >>> That's why I was thinking about adding an abstraction between >>> the page pool and the driver. Instead of feeding driver pages >>> a new abstraction could feed the driver just an identifier and a PA. >> skb frag is currently a bio_vec. Overloading the 'struct page' address >> in that struct with another address is easy to do. Requiring a certain >> alignment on the address gives you a few low bits to use a flags / magic >> / etc. >> >> Overloading len and offset is not really possible - way too much code is >> affected (e.g., iov walking and MSS / TSO segmenting). >> >> ie., you could overload page address with a pointer to an object in your >> new abstraction layer and the struct has the other meta data. >> >> typedef struct skb_frag { >> union { >> struct bio_vec bvec; >> struct new_abstraction abs; >> }; >> } skb_frag_t; >> >> where >> >> struct new_abstraction { >> void *addr, >> unsigned int len; >> unsigned int offset; >> }; >> >> I have been playing with a similar and it co-exists with the existing >> code quite well with the constraint on location of len and offset. >> >>> Whether we want to support fragmentation in that model or not would >>> have to be decided. >>> >>> We can take pages from the page pool and feed them to drivers via >>> such an API, but drivers need to stop expecting pages. >> yes, drivers would have to be updated to understand the new format. A >> downside, but again relatively easy to manage. >> > I'm glad to see that you're open to this approach. As far as I > understand, getting device memory in a struct page form would still be > preferred, no? And the approach you point to would be a backup plan I > presume? Well yes and no, if you need struct pages depends on what you want to do. struct pages are an approach to manage memory regions, but P2PDMA is essentially about pumping data between devices. If your data is for example organized in files on a filesystem then having struct pages is a must have because you need to be able to manage references, address spaces and so one. If a pages was acquired with alloc_pages() in a driver then you have struct pages, but you can't necessary use them the way you want to use them because the first few dwords have different meaning depending on the use case. And then you have the use case where you have for example micro controllers using P2PDMA to talk with each other. In this case your DMA address might not be memory at all, but rather MMIO. E.g. doorbells on graphics hw is such a case as well as cameras and encoders which communicate directly with each other. > Since the good folks on this thread have pointed me to p2pdma to > address my use case, I've been doing some homework to see if it can > apply. AFACT so far, it applies, and Willem actually had a prototype > of it working a while back. The rough approach Willem and I are > thinking of would be something like: > > 1. The device memory driver would be the p2pdma provider. It would > expose a user API which allocates a device memory region, calls > pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns > a reference to it to the userspace. > > 2. The NIC driver would be the p2pdma client and orchestrator. It > would expose a user API which binds an rxq to a pci device. Prior to > the bind the user API would check that the pci device has published > p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible > to the driver (pci_p2pdma_distance() I think), etc. > > 3. The NIC would allocate pages from the p2pdma provider for incoming > packets, and create devmem skbs, and deliver the devmem skbs to the > user using the support in my RFC. AFACT all that code need not be > changed. > > AFAICT, all the concerns brought up in this thread are sidestepped by > using p2pdma. I need not allocate struct pages in the core dma-buf > code anymore (or anywhere), and I need not allocate pgmaps. I would > just re-use the p2pdma support. > > Anyone see any glaring issues with this approach? I plan on trying to > implement a PoC and sending an RFC v2. Well we already have DMA-buf as user API for this use case, which is perfectly supported by RDMA if I'm not completely mistaken. So what problem do you try to solve here actually? Regards, Christian. > > The only pending concern is integration with the page pool, but we > already have some ideas on how to solve that. > >>> That's for data buffers only, obviously. We can keep using pages >>> and raw page pool for headers. >> yes. > >
On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote: > 1. The device memory driver would be the p2pdma provider. It would > expose a user API which allocates a device memory region, calls > pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns > a reference to it to the userspace. This is not quite right, if you convert any of the GPU drivers to use P2PDMA you are going to need to restructure the p2pmem stuff to seperate the genalloc. The GPU driver must continue to be the owner and allocator of the MMIO memory it already controls, we can't have two allocators working in parallel. The genalloc stuff supports the special NVMe use case, I don't know of anything else that would like to work that way. > 2. The NIC driver would be the p2pdma client and orchestrator. It > would expose a user API which binds an rxq to a pci device. Prior to > the bind the user API would check that the pci device has published > p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible > to the driver (pci_p2pdma_distance() I think), etc. This doesn't fit the programming model for GPUs at all. You don't want to get packets landing in random GPU memory that a kernel side allocator selects, you want packets landing in GPU memory owned by a specific process that owns the TCP connection. This is why DMABUF is used here as it gives a handle to the GPU memory. What you want is to get the P2P pages either directly from the DMABUF or via pin_user_pages() on the DMABUF's mmap. > AFAICT, all the concerns brought up in this thread are sidestepped by > using p2pdma. I need not allocate struct pages in the core dma-buf > code anymore (or anywhere), and I need not allocate pgmaps. I would > just re-use the p2pdma support. Well, as I said it is going to be a big ask to P2P enable any of the DRM drivers. And you still have the netmem vs zone_device struct page conflict to figure out But it is alot closer to reasonable than this RFC. Jason
On Wed, Jul 12, 2023 at 09:55:51AM +0200, Christian König wrote: > > Anyone see any glaring issues with this approach? I plan on trying to > > implement a PoC and sending an RFC v2. > > Well we already have DMA-buf as user API for this use case, which is > perfectly supported by RDMA if I'm not completely mistaken. > > So what problem do you try to solve here actually? In a nutshell, netdev's design currently needs struct pages to do DMA to it's packet buffers. So it cannot consume the scatterlist that dmabuf puts out RDMA doesn't need struct pages at all, so it is fine. If Mina can go down the path of changing netdev to avoid needing struct pages then no changes to DRM side things. Otherwise a P2P struct page and a co-existance with netmem on a ZONE_DEVICE page would be required. :\ Jason
Am 12.07.23 um 15:03 schrieb Jason Gunthorpe: > On Wed, Jul 12, 2023 at 09:55:51AM +0200, Christian König wrote: > >>> Anyone see any glaring issues with this approach? I plan on trying to >>> implement a PoC and sending an RFC v2. >> Well we already have DMA-buf as user API for this use case, which is >> perfectly supported by RDMA if I'm not completely mistaken. >> >> So what problem do you try to solve here actually? > In a nutshell, netdev's design currently needs struct pages to do DMA > to it's packet buffers. > > So it cannot consume the scatterlist that dmabuf puts out > > RDMA doesn't need struct pages at all, so it is fine. > > If Mina can go down the path of changing netdev to avoid needing > struct pages then no changes to DRM side things. > > Otherwise a P2P struct page and a co-existance with netmem on a > ZONE_DEVICE page would be required. :\ Uff, depending on why netdev needs struct page (I think I have a good idea why) this isn't really going to work generically either way. What we maybe able to do is to allow copy_file_range() between DMA-buf file descriptor and a TCP socket. If I'm not completely mistaken that should then end up in DMA-bufs file_operations->copy_file_range callback (maybe with some minor change to allows this). The DMA-buf framework could then forward this to the exporter of the memory which owns the backing memory could then do the necessary steps. Regards, Christian. > > Jason
On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote: > > > 1. The device memory driver would be the p2pdma provider. It would > > expose a user API which allocates a device memory region, calls > > pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns > > a reference to it to the userspace. > > This is not quite right, if you convert any of the GPU drivers to use > P2PDMA you are going to need to restructure the p2pmem stuff to > seperate the genalloc. The GPU driver must continue to be the owner > and allocator of the MMIO memory it already controls, we can't have > two allocators working in parallel. > > The genalloc stuff supports the special NVMe use case, I don't know of > anything else that would like to work that way. > I think maybe you misunderstood the proposal. AFAICT the genalloc stuff works for us, although there are other issues with p2pdma that I need to solve. The proposal was that the uapi in step #1 allocates a region of GPU memory, and sets up a p2pdma provider for this region of memory. From the perspective of the GPU, the memory is allocated, and in use by the user. The p2pdma provider uses genalloc to give out 4K regions with struct pages to in-kernel allocators from this memory region. Why would that not work? Looking at the code, that seems to be how p2pdma works today. The p2pdma provider does p2pdma_add_resource() on a chunk of its memory, and the genalloc allocates memory from that chunk? The actual issues I see with this approach are: 1. There is no way for the p2pdma provider to relinquish back the memory it has provided via pci_p2pdma_add_resource(), in the case that the user crashed or would like to free the GPU buffer. I would need to add a pci_p2pdma_remove_resource(). Would that be acceptable? 2. The p2pdma semantics seem to be global to the pci device. I.e., 1 GPU can export 1 p2pdma resource at a time (the way I'm reading the API). This is not usable for my use case. I would need multiple users to be able to use the uapi in step #1 simultaneously. I would need the same pci device to export different p2pdma resources simultaneously and the p2pdma clients would need to be able to import some of the resources. I would likely need to add an api like this: diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 8318a97c9c61..c9d754713fdc 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -19,6 +19,33 @@ struct scatterlist; #ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); + +/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file + * handle referring to this resource. Multiple such resources can be exported + * by the same pci device. + */ +struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev, + int bar, + size_t size, + u64 offset); + +/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */ +struct file *pci_p2pdma_remove_resource_with_handle( + struct file *p2pdma_resource_file); + +/* Allocates memory from a resource created using + * pci_p2pdma_add_resource_with_handle() + */ +void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file, + size_t size); + +/* Frees p2pmem to a resource created using + * pci_p2pdma_add_resource_with_handle() + */ +void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file, + void *addr, + size_t size); + Looking for feedback from anyone knowledgeable, but the p2pdma maintainers as well if possibl. > > 2. The NIC driver would be the p2pdma client and orchestrator. It > > would expose a user API which binds an rxq to a pci device. Prior to > > the bind the user API would check that the pci device has published > > p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible > > to the driver (pci_p2pdma_distance() I think), etc. > > This doesn't fit the programming model for GPUs at all. You don't want > to get packets landing in random GPU memory that a kernel side > allocator selects, you want packets landing in GPU memory owned by a > specific process that owns the TCP connection. > I think this comment is maybe a side effect of the misunderstanding. In the proposal, the user allocates a GPU buffer using the API in step #1, and then binds the memory to the NIC rx queue using the API specified in step #2. We use flow steering & RSS to steer this user's TCP traffic to the buffer owned by them. > This is why DMABUF is used here as it gives a handle to the GPU > memory. What you want is to get the P2P pages either directly from the > DMABUF or via pin_user_pages() on the DMABUF's mmap. > > > AFAICT, all the concerns brought up in this thread are sidestepped by > > using p2pdma. I need not allocate struct pages in the core dma-buf > > code anymore (or anywhere), and I need not allocate pgmaps. I would > > just re-use the p2pdma support. > > Well, as I said it is going to be a big ask to P2P enable any of the > DRM drivers. > > And you still have the netmem vs zone_device struct page conflict to > figure out > > But it is alot closer to reasonable than this RFC. > > Jason
On Wed, Jul 12, 2023 at 6:35 AM Christian König <christian.koenig@amd.com> wrote: > > Am 12.07.23 um 15:03 schrieb Jason Gunthorpe: > > On Wed, Jul 12, 2023 at 09:55:51AM +0200, Christian König wrote: > > > >>> Anyone see any glaring issues with this approach? I plan on trying to > >>> implement a PoC and sending an RFC v2. > >> Well we already have DMA-buf as user API for this use case, which is > >> perfectly supported by RDMA if I'm not completely mistaken. > >> > >> So what problem do you try to solve here actually? > > In a nutshell, netdev's design currently needs struct pages to do DMA > > to it's packet buffers. > > > > So it cannot consume the scatterlist that dmabuf puts out > > > > RDMA doesn't need struct pages at all, so it is fine. > > > > If Mina can go down the path of changing netdev to avoid needing > > struct pages then no changes to DRM side things. > > > > Otherwise a P2P struct page and a co-existance with netmem on a > > ZONE_DEVICE page would be required. :\ > > Uff, depending on why netdev needs struct page (I think I have a good > idea why) this isn't really going to work generically either way. > > What we maybe able to do is to allow copy_file_range() between DMA-buf > file descriptor and a TCP socket. > > If I'm not completely mistaken that should then end up in DMA-bufs > file_operations->copy_file_range callback (maybe with some minor change > to allows this). > > The DMA-buf framework could then forward this to the exporter of the > memory which owns the backing memory could then do the necessary steps. > I may be missing something, but the way it works on our end for receive is that we give a list of buffers (dma_addr + length + other metadata) to the network card, and the network card writes incoming packets to these dma_addrs and gives us an rx completion pointing to the data it DMA'd. Usually the network card does something like an alloc_page() + dma_map_page() and provides the to the network card. Transmit path works similarly. Not sure that adding copy_file_range() support to dma-buf enables this in some way.
On Wed, Jul 12, 2023 at 01:16:00PM -0700, Mina Almasry wrote: > The proposal was that the uapi in step #1 allocates a region of GPU > memory, and sets up a p2pdma provider for this region of memory. Honestly that feels too hacky, which is why I've said a few times you'd actually need to integrate p2p pages into a DRM driver properly. > 2. The p2pdma semantics seem to be global to the pci device. I.e., 1 > GPU can export 1 p2pdma resource at a time (the way I'm reading the > API). All of this is just a reflection that you are trying to use it in a wrong hacky way. The driver is supposed to create a p2p registration for its entire BAR at startup. Not on demand and not in fragments. Jason
Am 12.07.23 um 22:16 schrieb Mina Almasry: > On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >> On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote: >> >>> 1. The device memory driver would be the p2pdma provider. It would >>> expose a user API which allocates a device memory region, calls >>> pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns >>> a reference to it to the userspace. >> This is not quite right, if you convert any of the GPU drivers to use >> P2PDMA you are going to need to restructure the p2pmem stuff to >> seperate the genalloc. The GPU driver must continue to be the owner >> and allocator of the MMIO memory it already controls, we can't have >> two allocators working in parallel. >> >> The genalloc stuff supports the special NVMe use case, I don't know of >> anything else that would like to work that way. >> > I think maybe you misunderstood the proposal. AFAICT the genalloc > stuff works for us, although there are other issues with p2pdma that I > need to solve. > > The proposal was that the uapi in step #1 allocates a region of GPU > memory, and sets up a p2pdma provider for this region of memory. From > the perspective of the GPU, the memory is allocated, and in use by the > user. The p2pdma provider uses genalloc to give out 4K regions with > struct pages to in-kernel allocators from this memory region. Why > would that not work? Oh well, where should I start. struct page is used in the various I/O subsystems instead of DMA addresses because it allows for a wider range of operations. For example when a page is acquired using get_user_pages() somebody can use the rmap to figure out where a page is mapped and eventually unmap it, map it read only or change the caching attributes etc... Then you have the ability to grab a reference to a page, this for example allows I/O operations to complete and not access freed memory even when the application has already long died. Then a very common use case is that you need to fallback to a CPU copy because the data inside the page isn't aligned or outside the physical reach of a device. The are just numerous issues with what I listed above, for example some of those use cases only work with pagecache pages. Approaching it from the user side, with GPUs there is usually no guarantee that stuff is coherent. E.g. a network card wouldn't automatically see the results of a calculation. Then GPUs usually have tons of memory which isn't CPU accessible or even PCIe bus accessible. So a bounce buffer done with a CPU copy won't work, you need to bounce this with a hw assisted copy. Or you have inter device connections. For example ethernet over HDMI links would be able to access all of the internal GPU resources. Then GPUs often need to shuffle memory around, e.g. similar functionality to ___GFP_MOVABLE. Just with stuff not CPU accessible nor mapped into CPU page tables. ... I mean I can go with this list for quite some time :) > Looking at the code, that seems to be how p2pdma > works today. The p2pdma provider does p2pdma_add_resource() on a chunk > of its memory, and the genalloc allocates memory from that chunk? Well this is how it works for NVMe, that doesn't mean this way works for GPUs or acceleration devices. Regards, Christian. > > The actual issues I see with this approach are: > > 1. There is no way for the p2pdma provider to relinquish back the > memory it has provided via pci_p2pdma_add_resource(), in the case that > the user crashed or would like to free the GPU buffer. I would need to > add a pci_p2pdma_remove_resource(). Would that be acceptable? > > 2. The p2pdma semantics seem to be global to the pci device. I.e., 1 > GPU can export 1 p2pdma resource at a time (the way I'm reading the > API). This is not usable for my use case. I would need multiple users > to be able to use the uapi in step #1 simultaneously. I would need the > same pci device to export different p2pdma resources simultaneously > and the p2pdma clients would need to be able to import some of the > resources. I would likely need to add an api like this: > > diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h > index 8318a97c9c61..c9d754713fdc 100644 > --- a/include/linux/pci-p2pdma.h > +++ b/include/linux/pci-p2pdma.h > @@ -19,6 +19,33 @@ struct scatterlist; > #ifdef CONFIG_PCI_P2PDMA > int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, > u64 offset); > + > +/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file > + * handle referring to this resource. Multiple such resources can be exported > + * by the same pci device. > + */ > +struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev, > + int bar, > + size_t size, > + u64 offset); > + > +/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */ > +struct file *pci_p2pdma_remove_resource_with_handle( > + struct file *p2pdma_resource_file); > + > +/* Allocates memory from a resource created using > + * pci_p2pdma_add_resource_with_handle() > + */ > +void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file, > + size_t size); > + > +/* Frees p2pmem to a resource created using > + * pci_p2pdma_add_resource_with_handle() > + */ > +void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file, > + void *addr, > + size_t size); > + > > Looking for feedback from anyone knowledgeable, but the p2pdma > maintainers as well if possibl. > >>> 2. The NIC driver would be the p2pdma client and orchestrator. It >>> would expose a user API which binds an rxq to a pci device. Prior to >>> the bind the user API would check that the pci device has published >>> p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible >>> to the driver (pci_p2pdma_distance() I think), etc. >> This doesn't fit the programming model for GPUs at all. You don't want >> to get packets landing in random GPU memory that a kernel side >> allocator selects, you want packets landing in GPU memory owned by a >> specific process that owns the TCP connection. >> > I think this comment is maybe a side effect of the misunderstanding. > In the proposal, the user allocates a GPU buffer using the API in step > #1, and then binds the memory to the NIC rx queue using the API > specified in step #2. We use flow steering & RSS to steer this user's > TCP traffic to the buffer owned by them. > >> This is why DMABUF is used here as it gives a handle to the GPU >> memory. What you want is to get the P2P pages either directly from the >> DMABUF or via pin_user_pages() on the DMABUF's mmap. >> >>> AFAICT, all the concerns brought up in this thread are sidestepped by >>> using p2pdma. I need not allocate struct pages in the core dma-buf >>> code anymore (or anywhere), and I need not allocate pgmaps. I would >>> just re-use the p2pdma support. >> Well, as I said it is going to be a big ask to P2P enable any of the >> DRM drivers. >> >> And you still have the netmem vs zone_device struct page conflict to >> figure out >> >> But it is alot closer to reasonable than this RFC. >> >> Jason
On Thu, Jul 13, 2023 at 12:56 AM Christian König <christian.koenig@amd.com> wrote: > > Am 12.07.23 um 22:16 schrieb Mina Almasry: > > On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote: > >> > >>> 1. The device memory driver would be the p2pdma provider. It would > >>> expose a user API which allocates a device memory region, calls > >>> pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns > >>> a reference to it to the userspace. > >> This is not quite right, if you convert any of the GPU drivers to use > >> P2PDMA you are going to need to restructure the p2pmem stuff to > >> seperate the genalloc. The GPU driver must continue to be the owner > >> and allocator of the MMIO memory it already controls, we can't have > >> two allocators working in parallel. > >> > >> The genalloc stuff supports the special NVMe use case, I don't know of > >> anything else that would like to work that way. > >> > > I think maybe you misunderstood the proposal. AFAICT the genalloc > > stuff works for us, although there are other issues with p2pdma that I > > need to solve. > > > > The proposal was that the uapi in step #1 allocates a region of GPU > > memory, and sets up a p2pdma provider for this region of memory. From > > the perspective of the GPU, the memory is allocated, and in use by the > > user. The p2pdma provider uses genalloc to give out 4K regions with > > struct pages to in-kernel allocators from this memory region. Why > > would that not work? > > Oh well, where should I start. > > struct page is used in the various I/O subsystems instead of DMA > addresses because it allows for a wider range of operations. > > For example when a page is acquired using get_user_pages() somebody can > use the rmap to figure out where a page is mapped and eventually unmap > it, map it read only or change the caching attributes etc... > > Then you have the ability to grab a reference to a page, this for > example allows I/O operations to complete and not access freed memory > even when the application has already long died. > > Then a very common use case is that you need to fallback to a CPU copy > because the data inside the page isn't aligned or outside the physical > reach of a device. > > The are just numerous issues with what I listed above, for example some > of those use cases only work with pagecache pages. > > Approaching it from the user side, with GPUs there is usually no > guarantee that stuff is coherent. E.g. a network card wouldn't > automatically see the results of a calculation. > > Then GPUs usually have tons of memory which isn't CPU accessible or even > PCIe bus accessible. So a bounce buffer done with a CPU copy won't work, > you need to bounce this with a hw assisted copy. Or you have inter > device connections. For example ethernet over HDMI links would be able > to access all of the internal GPU resources. > > Then GPUs often need to shuffle memory around, e.g. similar > functionality to ___GFP_MOVABLE. Just with stuff not CPU accessible nor > mapped into CPU page tables. > > ... > > I mean I can go with this list for quite some time :) > > > Looking at the code, that seems to be how p2pdma > > works today. The p2pdma provider does p2pdma_add_resource() on a chunk > > of its memory, and the genalloc allocates memory from that chunk? > > Well this is how it works for NVMe, that doesn't mean this way works for > GPUs or acceleration devices. > Thanks Christian, yes, I misunderstood how this works and I apologize for that. From reading the p2pdma interface it looked to me like it takes a buffer as input and goes ahead and allocates the struct pages for it and exports them as a provider. As you and Jason have repeatedly tried to explain to me this bit is NVMe specific and as Jason puts it it is a "big ask to P2P enable any of the DRM drivers". I am facing some logistical issues as well. My use case requires NIC with special features, and even getting access to the appropriate hardware may be an issue for me. I guess the remaining option not fully explored is the idea of getting the networking stack to consume the scatterlist that dma_buf_map_attachment() provides for the device memory. The very rough approach I have in mind (for the RX path) is: 1. Some uapi that binds a dmabuf to an RX queue. It will do a dma_buf_map_attachment() and get the sg table. 2. We need to feed the scratterlist entries to some allocator that will chunk it up into pieces that can be allocated by the NIC for incoming traffic. I'm thinking genalloc may work for this as-is, but I may need to add one or use something else if I run into some issue. 3. We can implement a memory_provider that allocates these chunks and wraps them in a struct new_abstraction (as David called it) and feeds those into the page pool. 4. The page pool would need to be able to process these struct new_abstraction alongside the struct pages it normally gets from providers. This is maybe the most complicated part, but looking at the page pool code it doesn't seem that big of a hurdle (but I have not tried a POC yet). 5. The drivers (I looked at mlx5) seem to avoid making any mm calls on the struct pages returned by the pool; the pool abstracts everything already. The changes to the drivers may be minimal..? 6. We would need to add a new helper, skb_add_rx_new_abstraction_frag that creates a frag out of new_abstraction rather than a struct page. Once the skb frags with struct new_abstraction are in the TCP stack, they will need some special handling in code accessing the frags. But my RFC already addressed that somewhat because the frags were inaccessible in that case. In this case the frags will be both inaccessible and will not be struct pages at all (things like get_page() will not work), so more special handling will be required, maybe. I imagine the TX path would be considerably less complicated because the allocator and page pool are not involved (I think). Anyone see any glaring issues with this approach? > Regards, > Christian. > > > > > The actual issues I see with this approach are: > > > > 1. There is no way for the p2pdma provider to relinquish back the > > memory it has provided via pci_p2pdma_add_resource(), in the case that > > the user crashed or would like to free the GPU buffer. I would need to > > add a pci_p2pdma_remove_resource(). Would that be acceptable? > > > > 2. The p2pdma semantics seem to be global to the pci device. I.e., 1 > > GPU can export 1 p2pdma resource at a time (the way I'm reading the > > API). This is not usable for my use case. I would need multiple users > > to be able to use the uapi in step #1 simultaneously. I would need the > > same pci device to export different p2pdma resources simultaneously > > and the p2pdma clients would need to be able to import some of the > > resources. I would likely need to add an api like this: > > > > diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h > > index 8318a97c9c61..c9d754713fdc 100644 > > --- a/include/linux/pci-p2pdma.h > > +++ b/include/linux/pci-p2pdma.h > > @@ -19,6 +19,33 @@ struct scatterlist; > > #ifdef CONFIG_PCI_P2PDMA > > int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, > > u64 offset); > > + > > +/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file > > + * handle referring to this resource. Multiple such resources can be exported > > + * by the same pci device. > > + */ > > +struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev, > > + int bar, > > + size_t size, > > + u64 offset); > > + > > +/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */ > > +struct file *pci_p2pdma_remove_resource_with_handle( > > + struct file *p2pdma_resource_file); > > + > > +/* Allocates memory from a resource created using > > + * pci_p2pdma_add_resource_with_handle() > > + */ > > +void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file, > > + size_t size); > > + > > +/* Frees p2pmem to a resource created using > > + * pci_p2pdma_add_resource_with_handle() > > + */ > > +void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file, > > + void *addr, > > + size_t size); > > + > > > > Looking for feedback from anyone knowledgeable, but the p2pdma > > maintainers as well if possibl. > > > >>> 2. The NIC driver would be the p2pdma client and orchestrator. It > >>> would expose a user API which binds an rxq to a pci device. Prior to > >>> the bind the user API would check that the pci device has published > >>> p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible > >>> to the driver (pci_p2pdma_distance() I think), etc. > >> This doesn't fit the programming model for GPUs at all. You don't want > >> to get packets landing in random GPU memory that a kernel side > >> allocator selects, you want packets landing in GPU memory owned by a > >> specific process that owns the TCP connection. > >> > > I think this comment is maybe a side effect of the misunderstanding. > > In the proposal, the user allocates a GPU buffer using the API in step > > #1, and then binds the memory to the NIC rx queue using the API > > specified in step #2. We use flow steering & RSS to steer this user's > > TCP traffic to the buffer owned by them. > > > >> This is why DMABUF is used here as it gives a handle to the GPU > >> memory. What you want is to get the P2P pages either directly from the > >> DMABUF or via pin_user_pages() on the DMABUF's mmap. > >> > >>> AFAICT, all the concerns brought up in this thread are sidestepped by > >>> using p2pdma. I need not allocate struct pages in the core dma-buf > >>> code anymore (or anywhere), and I need not allocate pgmaps. I would > >>> just re-use the p2pdma support. > >> Well, as I said it is going to be a big ask to P2P enable any of the > >> DRM drivers. > >> > >> And you still have the netmem vs zone_device struct page conflict to > >> figure out > >> > >> But it is alot closer to reasonable than this RFC. > >> > >> Jason >
On 7/14/23 8:55 AM, Mina Almasry wrote: > > I guess the remaining option not fully explored is the idea of getting > the networking stack to consume the scatterlist that > dma_buf_map_attachment() provides for the device memory. The very > rough approach I have in mind (for the RX path) is: > > 1. Some uapi that binds a dmabuf to an RX queue. It will do a > dma_buf_map_attachment() and get the sg table. > > 2. We need to feed the scratterlist entries to some allocator that > will chunk it up into pieces that can be allocated by the NIC for > incoming traffic. I'm thinking genalloc may work for this as-is, but I > may need to add one or use something else if I run into some issue. > > 3. We can implement a memory_provider that allocates these chunks and > wraps them in a struct new_abstraction (as David called it) and feeds > those into the page pool. > > 4. The page pool would need to be able to process these struct > new_abstraction alongside the struct pages it normally gets from > providers. This is maybe the most complicated part, but looking at the > page pool code it doesn't seem that big of a hurdle (but I have not > tried a POC yet). > > 5. The drivers (I looked at mlx5) seem to avoid making any mm calls on > the struct pages returned by the pool; the pool abstracts everything > already. The changes to the drivers may be minimal..? > > 6. We would need to add a new helper, skb_add_rx_new_abstraction_frag > that creates a frag out of new_abstraction rather than a struct page. > > Once the skb frags with struct new_abstraction are in the TCP stack, > they will need some special handling in code accessing the frags. But > my RFC already addressed that somewhat because the frags were > inaccessible in that case. In this case the frags will be both > inaccessible and will not be struct pages at all (things like > get_page() will not work), so more special handling will be required, > maybe. > > I imagine the TX path would be considerably less complicated because > the allocator and page pool are not involved (I think). > > Anyone see any glaring issues with this approach? Moving skb_frags to an alternative scheme is essential to make this work. The current page scheme to go from user virtual to pages to physical is not needed for the dmabuf use case. For the driver and hardware queue: don't you need a dedicated queue for the flow(s) in question? If not, how can you properly handle the teardown case (e.g., app crashes and you need to ensure all references to GPU memory are removed from NIC descriptors)? If you agree on this point, then you can require the dedicated queue management in the driver to use and expect only the alternative frag addressing scheme. ie., it knows the address is not struct page (validates by checking skb flag or frag flag or address magic), but a reference to say a page_pool entry (if you are using page_pool for management of the dmabuf slices) which contains the metadata needed for the use case.
On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote: > Once the skb frags with struct new_abstraction are in the TCP stack, > they will need some special handling in code accessing the frags. But > my RFC already addressed that somewhat because the frags were > inaccessible in that case. In this case the frags will be both > inaccessible and will not be struct pages at all (things like > get_page() will not work), so more special handling will be required, > maybe. It seems sort of reasonable, though there will be interesting concerns about coherence and synchronization with generial purpose DMABUFs that will need tackling. Still it is such a lot of churn and weridness in the netdev side, I think you'd do well to present an actual full application as justification. Yes, you showed you can stick unordered TCP data frags into GPU memory sort of quickly, but have you gone further with this to actually show it is useful for a real world GPU centric application? BTW your cover letter said 96% utilization, the usual server configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of TCP BW with this? Jason
On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote: > > > Once the skb frags with struct new_abstraction are in the TCP stack, > > they will need some special handling in code accessing the frags. But > > my RFC already addressed that somewhat because the frags were > > inaccessible in that case. In this case the frags will be both > > inaccessible and will not be struct pages at all (things like > > get_page() will not work), so more special handling will be required, > > maybe. > > It seems sort of reasonable, though there will be interesting concerns > about coherence and synchronization with generial purpose DMABUFs that > will need tackling. > > Still it is such a lot of churn and weridness in the netdev side, I > think you'd do well to present an actual full application as > justification. > > Yes, you showed you can stick unordered TCP data frags into GPU memory > sort of quickly, but have you gone further with this to actually show > it is useful for a real world GPU centric application? > > BTW your cover letter said 96% utilization, the usual server > configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of > TCP BW with this? > I do notice that the number of NICs is missing from our public documentation so far, so I will refrain from specifying how many NICs are on those A3 VMs until the information is public. But I think I can confirm that your general thinking is correct, the perf that we're getting is 96.6% line rate of each GPU/NIC pair, and scales linearly for each NIC/GPU pair we've tested with so far. Line rate of each NIC/GPU pair is 200 Gb/sec. So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec. If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec ... etc.
On Fri, Jul 14, 2023 at 8:19 AM David Ahern <dsahern@kernel.org> wrote: > > On 7/14/23 8:55 AM, Mina Almasry wrote: > > > > I guess the remaining option not fully explored is the idea of getting > > the networking stack to consume the scatterlist that > > dma_buf_map_attachment() provides for the device memory. The very > > rough approach I have in mind (for the RX path) is: > > > > 1. Some uapi that binds a dmabuf to an RX queue. It will do a > > dma_buf_map_attachment() and get the sg table. > > > > 2. We need to feed the scratterlist entries to some allocator that > > will chunk it up into pieces that can be allocated by the NIC for > > incoming traffic. I'm thinking genalloc may work for this as-is, but I > > may need to add one or use something else if I run into some issue. > > > > 3. We can implement a memory_provider that allocates these chunks and > > wraps them in a struct new_abstraction (as David called it) and feeds > > those into the page pool. > > > > 4. The page pool would need to be able to process these struct > > new_abstraction alongside the struct pages it normally gets from > > providers. This is maybe the most complicated part, but looking at the > > page pool code it doesn't seem that big of a hurdle (but I have not > > tried a POC yet). > > > > 5. The drivers (I looked at mlx5) seem to avoid making any mm calls on > > the struct pages returned by the pool; the pool abstracts everything > > already. The changes to the drivers may be minimal..? > > > > 6. We would need to add a new helper, skb_add_rx_new_abstraction_frag > > that creates a frag out of new_abstraction rather than a struct page. > > > > Once the skb frags with struct new_abstraction are in the TCP stack, > > they will need some special handling in code accessing the frags. But > > my RFC already addressed that somewhat because the frags were > > inaccessible in that case. In this case the frags will be both > > inaccessible and will not be struct pages at all (things like > > get_page() will not work), so more special handling will be required, > > maybe. > > > > I imagine the TX path would be considerably less complicated because > > the allocator and page pool are not involved (I think). > > > > Anyone see any glaring issues with this approach? > > Moving skb_frags to an alternative scheme is essential to make this > work. The current page scheme to go from user virtual to pages to > physical is not needed for the dmabuf use case. > > For the driver and hardware queue: don't you need a dedicated queue for > the flow(s) in question? In the RFC and the implementation I'm thinking of, the queue is 'dedicated' in that each queue will be a devmem TCP queue or a regular queue. devmem queues generate devmem skbs and non-devmem queues generate non-devmem skbs. We support switching queues between devmem mode and non-devmem mode via a uapi. > If not, how can you properly handle the > teardown case (e.g., app crashes and you need to ensure all references > to GPU memory are removed from NIC descriptors)? Jason and Christian will correct me if I'm wrong, but AFAICT the dma-buf API requires the dma-buf provider to keep the attachment mapping alive as long as the importer requires it. The dma-buf API gives the importer dma_buf_map_attachment() and dma_buf_unmap_attachment() APIs, but there is no callback for the exporter to inform the importer that it has to take the mapping away. The closest thing I saw was the move_notify() callback, but that is optional. In my mind the way it works is that there will be some uapi that binds a dma-buf to an RX queue, that will create the attachment and the mapping. If the user crashes or closes the dma-buf handle then that will unbind the dma-buf from the RX queue, but the mapping will remain alive (via some refcounting) until all the NIC descriptors are freed and the mapping is not under use anymore. Usually this will happen next driver reset which destroys and recreates rx queues thereby freeing all the NIC descriptors (but could be a new API so that we don't rely on a driver reset). > If you agree on this > point, then you can require the dedicated queue management in the driver > to use and expect only the alternative frag addressing scheme. ie., it > knows the address is not struct page (validates by checking skb flag or > frag flag or address magic), but a reference to say a page_pool entry > (if you are using page_pool for management of the dmabuf slices) which > contains the metadata needed for the use case. Honestly if my understanding above doesn't match what you want, I could implement 'dedicated queues' instead, just let me know what you want at some future iteration. Now, I'm more worried about this memory format issue and I'm working on an RX prototype without struct pages. So far purely technically speaking it seems possible.
On 7/16/23 8:05 PM, Mina Almasry wrote: >> >> For the driver and hardware queue: don't you need a dedicated queue for >> the flow(s) in question? > > In the RFC and the implementation I'm thinking of, the queue is > 'dedicated' in that each queue will be a devmem TCP queue or a regular > queue. devmem queues generate devmem skbs and non-devmem queues > generate non-devmem skbs. We support switching queues between devmem > mode and non-devmem mode via a uapi. ethtool APIs or something else? > >> If not, how can you properly handle the >> teardown case (e.g., app crashes and you need to ensure all references >> to GPU memory are removed from NIC descriptors)? > > Jason and Christian will correct me if I'm wrong, but AFAICT the > dma-buf API requires the dma-buf provider to keep the attachment > mapping alive as long as the importer requires it. The dma-buf API > gives the importer dma_buf_map_attachment() and > dma_buf_unmap_attachment() APIs, but there is no callback for the > exporter to inform the importer that it has to take the mapping away. Isn't the importer that application that terminated (cleanly or other)? That was my thinking but I guess there are other designs that can cross a single application. > The closest thing I saw was the move_notify() callback, but that is > optional. > > In my mind the way it works is that there will be some uapi that binds > a dma-buf to an RX queue, that will create the attachment and the > mapping. If the user crashes or closes the dma-buf handle then that > will unbind the dma-buf from the RX queue, but the mapping will remain > alive (via some refcounting) until all the NIC descriptors are freed > and the mapping is not under use anymore. Usually this will happen > next driver reset which destroys and recreates rx queues thereby > freeing all the NIC descriptors (but could be a new API so that we > don't rely on a driver reset). > >> If you agree on this >> point, then you can require the dedicated queue management in the driver >> to use and expect only the alternative frag addressing scheme. ie., it >> knows the address is not struct page (validates by checking skb flag or >> frag flag or address magic), but a reference to say a page_pool entry >> (if you are using page_pool for management of the dmabuf slices) which >> contains the metadata needed for the use case. > > Honestly if my understanding above doesn't match what you want, I > could implement 'dedicated queues' instead, just let me know what you > want at some future iteration. Now, I'm more worried about this memory > format issue and I'm working on an RX prototype without struct pages. > So far purely technically speaking it seems possible. > > My comment was only a suggestion on how to simplify driver changes. ie., a queue is either pages (based on standard page_pool or alloc_pages) or some "special" page_pool (ie., new abstraction) but not mixed. In that case it knows how to handle the overloaded 'address' in skb_frag in a clean manner.
On 17/07/2023 03.53, Mina Almasry wrote: > On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote: >> >>> Once the skb frags with struct new_abstraction are in the TCP stack, >>> they will need some special handling in code accessing the frags. But >>> my RFC already addressed that somewhat because the frags were >>> inaccessible in that case. In this case the frags will be both >>> inaccessible and will not be struct pages at all (things like >>> get_page() will not work), so more special handling will be required, >>> maybe. >> >> It seems sort of reasonable, though there will be interesting concerns >> about coherence and synchronization with generial purpose DMABUFs that >> will need tackling. >> >> Still it is such a lot of churn and weridness in the netdev side, I >> think you'd do well to present an actual full application as >> justification. >> >> Yes, you showed you can stick unordered TCP data frags into GPU memory >> sort of quickly, but have you gone further with this to actually show >> it is useful for a real world GPU centric application? >> >> BTW your cover letter said 96% utilization, the usual server >> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of >> TCP BW with this? >> > > I do notice that the number of NICs is missing from our public > documentation so far, so I will refrain from specifying how many NICs > are on those A3 VMs until the information is public. But I think I can > confirm that your general thinking is correct, the perf that we're > getting is 96.6% line rate of each GPU/NIC pair, What do you mean by 96.6% "line rate". Is is the Ethernet line-rate? Is the measured throughput the measured TCP data "goodput"? Assuming - MTU 1500 bytes (1514 on wire). - Ethernet header 14 bytes - IP header 20 bytes - TCP header 20 bytes Due to header overhead the goodput will be approx 96.4%. - (1514-(14+20+20))/1514 = 0.9643 - (Not taking Ethernet interframe gap into account). Thus, maybe you have hit Ethernet wire line-rate already? > and scales linearly > for each NIC/GPU pair we've tested with so far. Line rate of each > NIC/GPU pair is 200 Gb/sec. > > So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec. Lets keep our units straight. Here you mean 1545 Gbit/sec, which is 193 GBytes/s > If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec Here you mean 384 Gbit/sec, which is 48 GBytes/sec. > ... > etc. > These massive throughput numbers are important, because they *exceed* the physical host RAM/DIMM memory speeds. This is the *real argument* why software cannot afford to do a single copy of the data from host-RAM into GPU-memory, because the CPU memory throughput to DRAM/DIMM are insufficient. My testlab CPU E5-1650 have 4 DIMM slots DDR4 - Data Width: 64 bits (= 8 bytes) - Configured Memory Speed: 2400 MT/s - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4) Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs). When testing this with lmbench tool bw_mem, the results (below signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as exceeding L3 cache size. In practice it looks like main memory is limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s) Given DDIO can deliver network packets into L3, I also tried to figure out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec (339 Gbit/s), in hopes that it would be enough, but it was not. --Jesper (data below signature) CPU under test: $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2 model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz cache size : 15360 KB Providing some cmdline outputs from lmbench "bw_mem" tool. (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second) $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd 256.00 14924.50 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr 256.00 9895.25 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr 256.00 9737.54 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy 256.00 12462.88 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero 256.00 14869.89 Next output shows reducing size below L3 cache size, which shows an increase in speed, likely the L3 bandwidth. $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd 64.00 14840.58 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd 32.00 14823.97 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd 16.00 24743.86 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd 8.00 40852.26 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd 4.00 42545.65 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd 2.00 42447.82 $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd 1.00 42447.82
On Mon, Jul 24, 2023 at 04:56:27PM +0200, Jesper Dangaard Brouer wrote: > These massive throughput numbers are important, because they *exceed* > the physical host RAM/DIMM memory speeds. That's right, this HW is all designed to use the high memory bandwidth of the parallel GPUs. The CPU must not be involved in the data movement. If you look at the reference block diagrams for a DGX-H100 you can see that the each GPU is directly wired to a 400Gb/s NIC, and there is even software that allows the GPU to directly operate its attached NIC interface. Each of the 8 GPUs in the block diagram has 800 Gb/s full duplex RDMA, and 7200 Gb/sec full duplex on the nvlink interconnect directly connected to it. The GPU is the center of all the interconnect in these systems. Jason
On Mon, Jul 24, 2023 at 7:56 AM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > > On 17/07/2023 03.53, Mina Almasry wrote: > > On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> > >> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote: > >> > >>> Once the skb frags with struct new_abstraction are in the TCP stack, > >>> they will need some special handling in code accessing the frags. But > >>> my RFC already addressed that somewhat because the frags were > >>> inaccessible in that case. In this case the frags will be both > >>> inaccessible and will not be struct pages at all (things like > >>> get_page() will not work), so more special handling will be required, > >>> maybe. > >> > >> It seems sort of reasonable, though there will be interesting concerns > >> about coherence and synchronization with generial purpose DMABUFs that > >> will need tackling. > >> > >> Still it is such a lot of churn and weridness in the netdev side, I > >> think you'd do well to present an actual full application as > >> justification. > >> > >> Yes, you showed you can stick unordered TCP data frags into GPU memory > >> sort of quickly, but have you gone further with this to actually show > >> it is useful for a real world GPU centric application? > >> > >> BTW your cover letter said 96% utilization, the usual server > >> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of > >> TCP BW with this? > >> > > > > I do notice that the number of NICs is missing from our public > > documentation so far, so I will refrain from specifying how many NICs > > are on those A3 VMs until the information is public. But I think I can > > confirm that your general thinking is correct, the perf that we're > > getting is 96.6% line rate of each GPU/NIC pair, > > What do you mean by 96.6% "line rate". > Is is the Ethernet line-rate? > Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec that my NICs run. > Is the measured throughput the measured TCP data "goodput"? Yes, it is goodput. Roughly I believe we add up the return values of recvmsg() and divide that number by time (very roughly, I think). > Assuming > - MTU 1500 bytes (1514 on wire). > - Ethernet header 14 bytes > - IP header 20 bytes > - TCP header 20 bytes > > Due to header overhead the goodput will be approx 96.4%. > - (1514-(14+20+20))/1514 = 0.9643 > - (Not taking Ethernet interframe gap into account). > > Thus, maybe you have hit Ethernet wire line-rate already? My MTU is 8244 actually, which gives me 8192 mss/payload for my connections. By my math the theoretical max would be 1 - 52/8244 = ~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the implementation. > > > and scales linearly > > for each NIC/GPU pair we've tested with so far. Line rate of each > > NIC/GPU pair is 200 Gb/sec. > > > > So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec. > > Lets keep our units straight. > Here you mean 1545 Gbit/sec, which is 193 GBytes/s > Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry! > > If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec > > Here you mean 384 Gbit/sec, which is 48 GBytes/sec. > Correct again! > > ... > > etc. > > > > These massive throughput numbers are important, because they *exceed* > the physical host RAM/DIMM memory speeds. > > This is the *real argument* why software cannot afford to do a single > copy of the data from host-RAM into GPU-memory, because the CPU memory > throughput to DRAM/DIMM are insufficient. > > My testlab CPU E5-1650 have 4 DIMM slots DDR4 > - Data Width: 64 bits (= 8 bytes) > - Configured Memory Speed: 2400 MT/s > - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4) > > Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for > the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs). > > When testing this with lmbench tool bw_mem, the results (below > signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as > exceeding L3 cache size. In practice it looks like main memory is > limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s) > > Given DDIO can deliver network packets into L3, I also tried to figure > out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec > (339 Gbit/s), in hopes that it would be enough, but it was not. > > Yes, avoiding any memory speed bottleneck as you note is important, but the second point mentioned in my cover letter is also impactful: " Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex." Depending on the hardware, this is a bottleneck that we avoid with device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a hierarchy like this: root complex | (uplink) PCIe switch / \ NIC GPU I believe the uplink from the PCIe switch to the root complex is used up 2 times for TX and 2 times for RX if the data needs to go through host memory: RX: NIC -> root complex -> GPU TX: GPU -> root complex -> NIC With device memory TCP, and enabling PCI P2P communication between the devices under the same PCIe switch, the payload flows directly from/to the NIC/GPU through the PCIe switch, and the payload never goes to the root complex, alleviating pressure/bottleneck on that link between the PCIe switch/root complex. I believe this is a core reason we're able to scale throughput linearly with NIC/GPU pairs, because we don't stress share uplink connections and all the payload data transfer happens beneath the PCIe switch. > --Jesper > (data below signature) > > CPU under test: > > $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2 > model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz > cache size : 15360 KB > > > Providing some cmdline outputs from lmbench "bw_mem" tool. > (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second) > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd > 256.00 14924.50 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr > 256.00 9895.25 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr > 256.00 9737.54 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy > 256.00 12462.88 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero > 256.00 14869.89 > > > Next output shows reducing size below L3 cache size, which shows an > increase in speed, likely the L3 bandwidth. > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd > 64.00 14840.58 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd > 32.00 14823.97 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd > 16.00 24743.86 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd > 8.00 40852.26 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd > 4.00 42545.65 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd > 2.00 42447.82 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd > 1.00 42447.82 >
On 25/07/2023 06.04, Mina Almasry wrote: > On Mon, Jul 24, 2023 at 7:56 AM Jesper Dangaard Brouer > <jbrouer@redhat.com> wrote: >> >> >> >> On 17/07/2023 03.53, Mina Almasry wrote: >>> On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >>>> >>>> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote: >>>> >>>>> Once the skb frags with struct new_abstraction are in the TCP stack, >>>>> they will need some special handling in code accessing the frags. But >>>>> my RFC already addressed that somewhat because the frags were >>>>> inaccessible in that case. In this case the frags will be both >>>>> inaccessible and will not be struct pages at all (things like >>>>> get_page() will not work), so more special handling will be required, >>>>> maybe. >>>> >>>> It seems sort of reasonable, though there will be interesting concerns >>>> about coherence and synchronization with generial purpose DMABUFs that >>>> will need tackling. >>>> >>>> Still it is such a lot of churn and weridness in the netdev side, I >>>> think you'd do well to present an actual full application as >>>> justification. >>>> >>>> Yes, you showed you can stick unordered TCP data frags into GPU memory >>>> sort of quickly, but have you gone further with this to actually show >>>> it is useful for a real world GPU centric application? >>>> >>>> BTW your cover letter said 96% utilization, the usual server >>>> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of >>>> TCP BW with this? >>>> >>> >>> I do notice that the number of NICs is missing from our public >>> documentation so far, so I will refrain from specifying how many NICs >>> are on those A3 VMs until the information is public. But I think I can >>> confirm that your general thinking is correct, the perf that we're >>> getting is 96.6% line rate of each GPU/NIC pair, >> >> What do you mean by 96.6% "line rate". >> Is is the Ethernet line-rate? >> > > Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec > that my NICs run. > >> Is the measured throughput the measured TCP data "goodput"? > > Yes, it is goodput. Roughly I believe we add up the return values of > recvmsg() and divide that number by time (very roughly, I think). > >> Assuming >> - MTU 1500 bytes (1514 on wire). >> - Ethernet header 14 bytes >> - IP header 20 bytes >> - TCP header 20 bytes >> >> Due to header overhead the goodput will be approx 96.4%. >> - (1514-(14+20+20))/1514 = 0.9643 >> - (Not taking Ethernet interframe gap into account). >> >> Thus, maybe you have hit Ethernet wire line-rate already? > > My MTU is 8244 actually, which gives me 8192 mss/payload for my > connections. By my math the theoretical max would be 1 - 52/8244 = > ~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the > implementation. > Close enough, my math I would have added the 4 byte FCS checksum (1-56/8244), but it makes no real difference at MTU 8244. >> >>> and scales linearly >>> for each NIC/GPU pair we've tested with so far. Line rate of each >>> NIC/GPU pair is 200 Gb/sec. >>> >>> So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec. >> >> Lets keep our units straight. >> Here you mean 1545 Gbit/sec, which is 193 GBytes/s >> > > Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry! > >>> If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec >> >> Here you mean 384 Gbit/sec, which is 48 GBytes/sec. >> > > Correct again! > >>> ... >>> etc. >>> >> >> These massive throughput numbers are important, because they *exceed* >> the physical host RAM/DIMM memory speeds. >> >> This is the *real argument* why software cannot afford to do a single >> copy of the data from host-RAM into GPU-memory, because the CPU memory >> throughput to DRAM/DIMM are insufficient. >> >> My testlab CPU E5-1650 have 4 DIMM slots DDR4 >> - Data Width: 64 bits (= 8 bytes) >> - Configured Memory Speed: 2400 MT/s >> - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4) >> >> Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for >> the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs). >> >> When testing this with lmbench tool bw_mem, the results (below >> signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as >> exceeding L3 cache size. In practice it looks like main memory is >> limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s) >> Some more insights. I couldn't believe this (single CPU) test was so far from the theoretical max (76.8 vs. 14.8 GBytes/s). This smells like a per CPU core limitation. The lmbench tool bw_mem have an option for parallelism (-P) for testing this. My testlab CPU only have 6 cores (as I have disabled HT). Testing on more CPU cores show an increase in scaling mem bandwidth: Cores 1 = 15.0 GB/s - scale: 1.00 (one core as scale point) Cores 2 = 26.9 GB/s - scale: 1.79 Cores 3 = 36.3 GB/s - scale: 2.42 Cores 4 = 44.0 GB/s - scale: 2.93 Cores 5 = 48.9 GB/s - scale: 3.26 Cores 6 = 49.4 GB/s - scale: 3.29 Thus, the practical test show CPU memory DIMM read bandwidth scales to 49.4 GB/s (395.2 Gbit/s), so there is still hope of 400Gbit/s devices, when utilizing more CPU cores. I don't have a clear explanation why there is a limit per core. The [toplev] tool with (bw_mem -P2) says: Backend_Bound = 90.7% of the time. Backend_Bound.Memory_Bound = 71.8% of these 90.7% Backend_Bound.Core_Bound = remaining 18.9% Backend_Bound.Memory_Bound is split into two main "stalls": Backend_Bound.Memory_Bound.L3_Bound = 7.9% Backend_Bound.Memory_Bound.DRAM_Bound = 58.5% [toplev] https://github.com/andikleen/pmu-tools >> Given DDIO can deliver network packets into L3, I also tried to figure >> out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec >> (339 Gbit/s), in hopes that it would be enough, but it was not. >> The memory bandwidth to L3 cache scales up per CPU core: Cores 1 = 42.35 GB/s = scale: 1.00 (one core as scale point) Cores 2 = 86.38 GB/s = scale: 2.04 Cores 3 = 126.96 GB/s = scale: 3.00 Cores 4 = 168.48 GB/s = scale: 3.98 Cores 5 = 211.77 GB/s = scale: 5.00 Cores 6 = 244.95 GB/s = scale: 5.78 Nice to see how well this scales up per core. Fairly impressive total max L3 bandwidth of 244.95 GB/s (1959.6 Gbit/s). >> > > Yes, avoiding any memory speed bottleneck as you note is important, > but the second point mentioned in my cover letter is also impactful: > > " Alleviate PCIe BW pressure, by limiting data transfer to the lowest level > of the PCIe tree, compared to traditional path which sends data through the > root complex." > This is a good and important point. > Depending on the hardware, this is a bottleneck that we avoid with > device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a > hierarchy like this: > > root complex > | (uplink) > PCIe switch > / \ > NIC GPU > > I believe the uplink from the PCIe switch to the root complex is used > up 2 times for TX and 2 times for RX if the data needs to go through > host memory: > > RX: NIC -> root complex -> GPU > TX: GPU -> root complex -> NIC > > With device memory TCP, and enabling PCI P2P communication between the > devices under the same PCIe switch, the payload flows directly from/to > the NIC/GPU through the PCIe switch, and the payload never goes to the > root complex, alleviating pressure/bottleneck on that link between the > PCIe switch/root complex. I believe this is a core reason we're able > to scale throughput linearly with NIC/GPU pairs, because we don't > stress share uplink connections and all the payload data transfer > happens beneath the PCIe switch. > Good points, and I guess this is what Jason was hinting to. And illustrated in this picture[1] (I googled): [1] https://www.servethehome.com/wp-content/uploads/2022/08/HC34-NVIDIA-DGX-H100-Data-network-configuration.jpg The GPUs "internally" have switched nvlink connections. As Jason said, these nvlinks have an impressive bandwidth[2] of 900GB/s (7200 Gbit/s). [2] https://www.nvidia.com/en-us/data-center/nvlink/ >> --Jesper >> (data below signature) >> Added raw commands and data below. >> CPU under test: >> >> $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2 >> model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz >> cache size : 15360 KB >> >> >> Providing some cmdline outputs from lmbench "bw_mem" tool. >> (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second) >> Running bw_mem with parallelism utilizing more cores: $ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P1 256m rd 268.44 15015.69 $ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P2 256m rd 268.44 26896.42 $ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P3 256m rd 268.44 36347.36 $ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P4 256m rd 268.44 44073.72 $ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P5 256m rd 268.44 48872.02 $ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P6 256m rd 268.44 49426.76 >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd >> 256.00 14924.50 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr >> 256.00 9895.25 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr >> 256.00 9737.54 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy >> 256.00 12462.88 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero >> 256.00 14869.89 >> >> >> Next output shows reducing size below L3 cache size, which shows an >> increase in speed, likely the L3 bandwidth. >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd >> 64.00 14840.58 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd >> 32.00 14823.97 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd >> 16.00 24743.86 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd >> 8.00 40852.26 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd >> 4.00 42545.65 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd >> 2.00 42447.82 >> >> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd >> 1.00 42447.82 >> > Tests for testing L3 per core scaling. $ taskset -c 0 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P1 512K rd 0.512000 42357.43 $ taskset -c 0-1 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P2 512K rd 0.512000 86380.09 $ taskset -c 0-2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P3 512K rd 0.512000 126960.94 $ taskset -c 0-3 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P4 512K rd 0.512000 168485.49 $ taskset -c 0-4 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P5 512K rd 0.512000 211770.67 $ taskset -c 0-5 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P6 512K rd 0.512000 244959.35
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index b676496ec6d7..4e613d5bf1fd 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -4925,8 +4925,7 @@ static void hns3_put_ring_config(struct hns3_nic_priv *priv) static void hns3_alloc_page_pool(struct hns3_enet_ring *ring) { struct page_pool_params pp_params = { - .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG | - PP_FLAG_DMA_SYNC_DEV, + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, .order = hns3_page_order(ring), .pool_size = ring->desc_num * hns3_buf_size(ring) / (PAGE_SIZE << hns3_page_order(ring)), diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c index a79cb680bb23..404caec467af 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c @@ -1426,7 +1426,7 @@ int otx2_pool_init(struct otx2_nic *pfvf, u16 pool_id, return 0; } - pp_params.flags = PP_FLAG_PAGE_FRAG | PP_FLAG_DMA_MAP; + pp_params.flags = PP_FLAG_DMA_MAP; pp_params.pool_size = numptrs; pp_params.nid = NUMA_NO_NODE; pp_params.dev = pfvf->dev; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 593cdfbfc035..79f2f5e51ae0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -843,7 +843,7 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params, } pp_params.order = 0; - pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | PP_FLAG_PAGE_FRAG; + pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; pp_params.pool_size = pool_size; pp_params.nid = node; pp_params.dev = rq->pdev; diff --git a/drivers/net/wireless/mediatek/mt76/mac80211.c b/drivers/net/wireless/mediatek/mt76/mac80211.c index 467afef98ba2..ee72869e5572 100644 --- a/drivers/net/wireless/mediatek/mt76/mac80211.c +++ b/drivers/net/wireless/mediatek/mt76/mac80211.c @@ -566,7 +566,7 @@ int mt76_create_page_pool(struct mt76_dev *dev, struct mt76_queue *q) { struct page_pool_params pp_params = { .order = 0, - .flags = PP_FLAG_PAGE_FRAG, + .flags = 0, .nid = NUMA_NO_NODE, .dev = dev->dma_dev, }; diff --git a/include/net/page_pool.h b/include/net/page_pool.h index c135cd157cea..f4fc339ff020 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -46,10 +46,8 @@ * Please note DMA-sync-for-CPU is still * device driver responsibility */ -#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */ #define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\ - PP_FLAG_DMA_SYNC_DEV |\ - PP_FLAG_PAGE_FRAG) + PP_FLAG_DMA_SYNC_DEV) #define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \ (sizeof(dma_addr_t) > sizeof(unsigned long)) @@ -235,8 +233,7 @@ static inline struct page *page_pool_alloc_frag(struct page_pool *pool, size = ALIGN(size, dma_get_cache_alignment()); - if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) || - size > max_size)) + if (WARN_ON(size > max_size)) return NULL; /* Don't allow page splitting and allocate one big frag diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7c4338221b17..ca2316cc1e7e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5652,7 +5652,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, /* In general, avoid mixing page_pool and non-page_pool allocated * pages within the same SKB. Additionally avoid dealing with clones * with page_pool pages, in case the SKB is using page_pool fragment - * references (PP_FLAG_PAGE_FRAG). Since we only take full page + * references (page_pool_alloc_frag()). Since we only take full page * references for cloned SKBs at the moment that would result in * inconsistent reference counts. * In theory we could take full references if @from is cloned and
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count handling is unified and page_pool_alloc_frag() is supported in 32-bit arch with 64-bit DMA, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 3 +-- drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- drivers/net/wireless/mediatek/mt76/mac80211.c | 2 +- include/net/page_pool.h | 7 ++----- net/core/skbuff.c | 2 +- 6 files changed, 7 insertions(+), 11 deletions(-)