Message ID | 20230810015751.3297321-1-almasrymina@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Device Memory TCP | expand |
Am 10.08.23 um 03:57 schrieb Mina Almasry: > Changes in RFC v2: > ------------------ > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > that attempts to resolve this by implementing scatterlist support in the > networking stack, such that we can import the dma-buf scatterlist > directly. Impressive work, I didn't thought that this would be possible that "easily". Please note that we have considered replacing scatterlists with simple arrays of DMA-addresses in the DMA-buf framework to avoid people trying to access the struct page inside the scatterlist. It might be a good idea to push for that first before this here is finally implemented. GPU drivers already convert the scatterlist used to arrays of DMA-addresses as soon as they get them. This leaves RDMA and V4L as the other two main users which would need to be converted. > This is the approach proposed at a high level here[2]. > > Detailed changes: > 1. Replaced dma-buf pages approach with importing scatterlist into the > page pool. > 2. Replace the dma-buf pages centric API with a netlink API. > 3. Removed the TX path implementation - there is no issue with > implementing the TX path with scatterlist approach, but leaving > out the TX path makes it easier to review. > 4. Functionality is tested with this proposal, but I have not conducted > perf testing yet. I'm not sure there are regressions, but I removed > perf claims from the cover letter until they can be re-confirmed. > 5. Added Signed-off-by: contributors to the implementation. > 6. Fixed some bugs with the RX path since RFC v1. > > Any feedback welcome, but specifically the biggest pending questions > needing feedback IMO are: > > 1. Feedback on the scatterlist-based approach in general. As far as I can see this sounds like the right thing to do in general. Question is rather if we should stick with scatterlist, use array of DMA-addresses or maybe even come up with a completely new structure. > 2. Netlink API (Patch 1 & 2). How does netlink manage the lifetime of objects? > 3. Approach to handle all the drivers that expect to receive pages from > the page pool (Patch 6). Can't say anything about that. I know TCP/IP inside out, but I'm a GPU and not a network driver author. Regards, Christian. > > [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/ > [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/ > > ---------------------- > > * TL;DR: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. > > * Problem: > > A large amount of data transfers have device memory as the source and/or > destination. Accelerators drastically increased the volume of such transfers. > Some examples include: > - ML accelerators transferring large amounts of training data from storage into > GPU/TPU memory. In some cases ML training setup time can be as long as 50% of > TPU compute time, improving data transfer throughput & efficiency can help > improving GPU/TPU utilization. > > - Distributed training, where ML accelerators, such as GPUs on different hosts, > exchange data among them. > > - Distributed raw block storage applications transfer large amounts of data with > remote SSDs, much of this data does not require host processing. > > Today, the majority of the Device-to-Device data transfers the network are > implemented as the following low level operations: Device-to-Host copy, > Host-to-Host network transfer, and Host-to-Device copy. > > The implementation is suboptimal, especially for bulk data transfers, and can > put significant strains on system resources, such as host memory bandwidth, > PCIe bandwidth, etc. One important reason behind the current state is the > kernel’s lack of semantics to express device to network transfers. > > * Proposal: > > In this patch series we attempt to optimize this use case by implementing > socket APIs that enable the user to: > > 1. send device memory across the network directly, and > 2. receive incoming network packets directly into device memory. > > Packet _payloads_ go directly from the NIC to device memory for receive and from > device memory to NIC for transmit. > Packet _headers_ go to/from host memory and are processed by the TCP/IP stack > normally. The NIC _must_ support header split to achieve this. > > Advantages: > > - Alleviate host memory bandwidth pressure, compared to existing > network-transfer + device-copy semantics. > > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level > of the PCIe tree, compared to traditional path which sends data through the > root complex. > > * Patch overview: > > ** Part 1: netlink API > > Gives user ability to bind dma-buf to an RX queue. > > ** Part 2: scatterlist support > > Currently the standard for device memory sharing is DMABUF, which doesn't > generate struct pages. On the other hand, networking stack (skbs, drivers, and > page pool) operate on pages. We have 2 options: > > 1. Generate struct pages for dmabuf device memory, or, > 2. Modify the networking stack to process scatterlist. > > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. > > ** part 3: page pool support > > We piggy back on page pool memory providers proposal: > https://github.com/kuba-moo/linux/tree/pp-providers > > It allows the page pool to define a memory provider that provides the > page allocation and freeing. It helps abstract most of the device memory > TCP changes from the driver. > > ** part 4: support for unreadable skb frags > > Page pool iovs are not accessible by the host; we implement changes > throughput the networking stack to correctly handle skbs with unreadable > frags. > > ** Part 5: recvmsg() APIs > > We define user APIs for the user to send and receive device memory. > > Not included with this RFC is the GVE devmem TCP support, just to > simplify the review. Code available here if desired: > https://github.com/mina/linux/tree/tcpdevmem > > This RFC is built on top of net-next with Jakub's pp-providers changes > cherry-picked. > > * NIC dependencies: > > 1. (strict) Devmem TCP require the NIC to support header split, i.e. the > capability to split incoming packets into a header + payload and to put > each into a separate buffer. Devmem TCP works by using device memory > for the packet payload, and host memory for the packet headers. > > 2. (optional) Devmem TCP works better with flow steering support & RSS support, > i.e. the NIC's ability to steer flows into certain rx queues. This allows the > sysadmin to enable devmem TCP on a subset of the rx queues, and steer > devmem TCP traffic onto these queues and non devmem TCP elsewhere. > > The NIC I have access to with these properties is the GVE with DQO support > running in Google Cloud, but any NIC that supports these features would suffice. > I may be able to help reviewers bring up devmem TCP on their NICs. > > * Testing: > > The series includes a udmabuf kselftest that show a simple use case of > devmem TCP and validates the entire data path end to end without > a dependency on a specific dmabuf provider. > > ** Test Setup > > Kernel: net-next with this RFC and memory provider API cherry-picked > locally. > > Hardware: Google Cloud A3 VMs. > > NIC: GVE with header split & RSS & flow steering support. > > Mina Almasry (11): > net: add netdev netlink api to bind dma-buf to a net device > netdev: implement netlink api to bind dma-buf to netdevice > netdev: implement netdevice devmem allocator > memory-provider: updates to core provider API for devmem TCP > memory-provider: implement dmabuf devmem memory provider > page-pool: add device memory support > net: support non paged skb frags > net: add support for skbs with unreadable frags > tcp: implement recvmsg() RX path for devmem TCP > net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages > selftests: add ncdevmem, netcat for devmem TCP > > Documentation/netlink/specs/netdev.yaml | 27 ++ > include/linux/netdevice.h | 61 +++ > include/linux/skbuff.h | 54 ++- > include/linux/socket.h | 1 + > include/net/page_pool.h | 186 ++++++++- > include/net/sock.h | 2 + > include/net/tcp.h | 5 +- > include/uapi/asm-generic/socket.h | 6 + > include/uapi/linux/netdev.h | 10 + > include/uapi/linux/uio.h | 10 + > net/core/datagram.c | 6 + > net/core/dev.c | 214 ++++++++++ > net/core/gro.c | 2 +- > net/core/netdev-genl-gen.c | 14 + > net/core/netdev-genl-gen.h | 1 + > net/core/netdev-genl.c | 103 +++++ > net/core/page_pool.c | 171 ++++++-- > net/core/skbuff.c | 80 +++- > net/core/sock.c | 36 ++ > net/ipv4/tcp.c | 196 ++++++++- > net/ipv4/tcp_input.c | 13 +- > net/ipv4/tcp_ipv4.c | 7 + > net/ipv4/tcp_output.c | 5 +- > net/packet/af_packet.c | 4 +- > tools/include/uapi/linux/netdev.h | 10 + > tools/net/ynl/generated/netdev-user.c | 41 ++ > tools/net/ynl/generated/netdev-user.h | 46 ++ > tools/testing/selftests/net/.gitignore | 1 + > tools/testing/selftests/net/Makefile | 5 + > tools/testing/selftests/net/ncdevmem.c | 534 ++++++++++++++++++++++++ > 30 files changed, 1787 insertions(+), 64 deletions(-) > create mode 100644 tools/testing/selftests/net/ncdevmem.c >
On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote: > Am 10.08.23 um 03:57 schrieb Mina Almasry: > > Changes in RFC v2: > > ------------------ > > > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > > that attempts to resolve this by implementing scatterlist support in the > > networking stack, such that we can import the dma-buf scatterlist > > directly. > > Impressive work, I didn't thought that this would be possible that "easily". > > Please note that we have considered replacing scatterlists with simple > arrays of DMA-addresses in the DMA-buf framework to avoid people trying to > access the struct page inside the scatterlist. > > It might be a good idea to push for that first before this here is finally > implemented. > > GPU drivers already convert the scatterlist used to arrays of DMA-addresses > as soon as they get them. This leaves RDMA and V4L as the other two main > users which would need to be converted. Oh that would be a nightmare for RDMA. We need a standard based way to have scalable lists of DMA addresses :( > > 2. Netlink API (Patch 1 & 2). > > How does netlink manage the lifetime of objects? And access control.. Jason
On Thu, Aug 10, 2023 at 3:29 AM Christian König <christian.koenig@amd.com> wrote: > > Am 10.08.23 um 03:57 schrieb Mina Almasry: > > Changes in RFC v2: > > ------------------ > > > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > > that attempts to resolve this by implementing scatterlist support in the > > networking stack, such that we can import the dma-buf scatterlist > > directly. > > Impressive work, I didn't thought that this would be possible that "easily". > > Please note that we have considered replacing scatterlists with simple > arrays of DMA-addresses in the DMA-buf framework to avoid people trying > to access the struct page inside the scatterlist. > FWIW, I'm not doing anything with the struct pages inside the scatterlist. All I need from the scatterlist are the sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array you're describing will provide exactly those, but let me know if I misunderstood. > It might be a good idea to push for that first before this here is > finally implemented. > > GPU drivers already convert the scatterlist used to arrays of > DMA-addresses as soon as they get them. This leaves RDMA and V4L as the > other two main users which would need to be converted. > > > This is the approach proposed at a high level here[2]. > > > > Detailed changes: > > 1. Replaced dma-buf pages approach with importing scatterlist into the > > page pool. > > 2. Replace the dma-buf pages centric API with a netlink API. > > 3. Removed the TX path implementation - there is no issue with > > implementing the TX path with scatterlist approach, but leaving > > out the TX path makes it easier to review. > > 4. Functionality is tested with this proposal, but I have not conducted > > perf testing yet. I'm not sure there are regressions, but I removed > > perf claims from the cover letter until they can be re-confirmed. > > 5. Added Signed-off-by: contributors to the implementation. > > 6. Fixed some bugs with the RX path since RFC v1. > > > > Any feedback welcome, but specifically the biggest pending questions > > needing feedback IMO are: > > > > 1. Feedback on the scatterlist-based approach in general. > > As far as I can see this sounds like the right thing to do in general. > > Question is rather if we should stick with scatterlist, use array of > DMA-addresses or maybe even come up with a completely new structure. > As far as I can tell, it should be trivial to switch this device memory TCP implementation to anything that provides: 1. DMA-addresses (sg_dma_address() equivalent) 2. lengths (sg_dma_len() equivalent) if you go that route. Specifically, I think it will be pretty much a localized change to netdev_bind_dmabuf_to_queue() implemented in this patch: https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f > > 2. Netlink API (Patch 1 & 2). > > How does netlink manage the lifetime of objects? > Netlink itself doesn't handle the lifetime of the binding. However, the API I implemented unbinds the dma-buf when the netlink socket is destroyed. I do this so that even if the user process crashes or forgets to unbind, the dma-buf will still be unbound once the netlink socket is closed on the process exit. Details in this patch: https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote: > > Am 10.08.23 um 03:57 schrieb Mina Almasry: > > > Changes in RFC v2: > > > ------------------ > > > > > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > > > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > > > that attempts to resolve this by implementing scatterlist support in the > > > networking stack, such that we can import the dma-buf scatterlist > > > directly. > > > > Impressive work, I didn't thought that this would be possible that "easily". > > > > Please note that we have considered replacing scatterlists with simple > > arrays of DMA-addresses in the DMA-buf framework to avoid people trying to > > access the struct page inside the scatterlist. > > > > It might be a good idea to push for that first before this here is finally > > implemented. > > > > GPU drivers already convert the scatterlist used to arrays of DMA-addresses > > as soon as they get them. This leaves RDMA and V4L as the other two main > > users which would need to be converted. > > Oh that would be a nightmare for RDMA. > > We need a standard based way to have scalable lists of DMA addresses :( > > > > 2. Netlink API (Patch 1 & 2). > > > > How does netlink manage the lifetime of objects? > > And access control.. > Someone will correct me if I'm wrong but I'm not sure netlink itself will do (sufficient) access control. However I meant for the netlink API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN check in netdev_bind_dmabuf_to_queue() in the next iteration.
On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote: > Someone will correct me if I'm wrong but I'm not sure netlink itself > will do (sufficient) access control. However I meant for the netlink > API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add > this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN > check in netdev_bind_dmabuf_to_queue() in the next iteration. Can some other process that does not have the netlink fd manage to recv packets that were stored into the dmabuf? Jason
On Thu, Aug 10, 2023 at 11:58 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 10, 2023 at 11:44:53AM -0700, Mina Almasry wrote: > > > Someone will correct me if I'm wrong but I'm not sure netlink itself > > will do (sufficient) access control. However I meant for the netlink > > API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add > > this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN > > check in netdev_bind_dmabuf_to_queue() in the next iteration. > > Can some other process that does not have the netlink fd manage to > recv packets that were stored into the dmabuf? > The process needs to have the dma-buf fd to receive packets, and not necessarily the netlink fd. It should be possible for: 1. a CAP_NET_ADMIN process to create a dma-buf, bind it using a netlink fd, then share the dma-buf with another normal process that receives packets on it. 2. a normal process to create a dma-buf, share it with a privileged CAP_NET_ADMIN process that creates the binding via the netlink fd, then the owner of the dma-buf can receive data on the dma-buf fd. 3. a CAP_NET_ADMIN creates the dma-buf and creates the binding itself and receives data. We in particular plan to use devmem TCP in the first mode, but this detail is specific to us so I've largely neglected from describing it in the cover letter. If our setup is interesting: the CAP_NET_ADMIN process I describe in #1 is a 'tcpdevmem daemon' which allocates the GPU memory, creates a dma-buf, creates an RX queue binding, and shares the dma-buf with the ML application(s) running on our instance. The ML application receives data on the dma-buf via recvmsg(). The 'tcpdevmem daemon' takes care of the binding but also configures RSS & flow steering. The dma-buf fd sharing happens over a unix domain socket.
Am 10.08.23 um 20:44 schrieb Mina Almasry: > On Thu, Aug 10, 2023 at 3:29 AM Christian König > <christian.koenig@amd.com> wrote: >> Am 10.08.23 um 03:57 schrieb Mina Almasry: >>> Changes in RFC v2: >>> ------------------ >>> >>> The sticking point in RFC v1[1] was the dma-buf pages approach we used to >>> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept >>> that attempts to resolve this by implementing scatterlist support in the >>> networking stack, such that we can import the dma-buf scatterlist >>> directly. >> Impressive work, I didn't thought that this would be possible that "easily". >> >> Please note that we have considered replacing scatterlists with simple >> arrays of DMA-addresses in the DMA-buf framework to avoid people trying >> to access the struct page inside the scatterlist. >> > FWIW, I'm not doing anything with the struct pages inside the > scatterlist. All I need from the scatterlist are the > sg_dma_address(sg) and the sg_dma_len(sg), and I'm guessing the array > you're describing will provide exactly those, but let me know if I > misunderstood. Your understanding is perfectly correct. > >> It might be a good idea to push for that first before this here is >> finally implemented. >> >> GPU drivers already convert the scatterlist used to arrays of >> DMA-addresses as soon as they get them. This leaves RDMA and V4L as the >> other two main users which would need to be converted. >> >>> This is the approach proposed at a high level here[2]. >>> >>> Detailed changes: >>> 1. Replaced dma-buf pages approach with importing scatterlist into the >>> page pool. >>> 2. Replace the dma-buf pages centric API with a netlink API. >>> 3. Removed the TX path implementation - there is no issue with >>> implementing the TX path with scatterlist approach, but leaving >>> out the TX path makes it easier to review. >>> 4. Functionality is tested with this proposal, but I have not conducted >>> perf testing yet. I'm not sure there are regressions, but I removed >>> perf claims from the cover letter until they can be re-confirmed. >>> 5. Added Signed-off-by: contributors to the implementation. >>> 6. Fixed some bugs with the RX path since RFC v1. >>> >>> Any feedback welcome, but specifically the biggest pending questions >>> needing feedback IMO are: >>> >>> 1. Feedback on the scatterlist-based approach in general. >> As far as I can see this sounds like the right thing to do in general. >> >> Question is rather if we should stick with scatterlist, use array of >> DMA-addresses or maybe even come up with a completely new structure. >> > As far as I can tell, it should be trivial to switch this device > memory TCP implementation to anything that provides: > > 1. DMA-addresses (sg_dma_address() equivalent) > 2. lengths (sg_dma_len() equivalent) > > if you go that route. Specifically, I think it will be pretty much a > localized change to netdev_bind_dmabuf_to_queue() implemented in this > patch: > https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f Thanks, that's exactly what I wanted to hear. > >>> 2. Netlink API (Patch 1 & 2). >> How does netlink manage the lifetime of objects? >> > Netlink itself doesn't handle the lifetime of the binding. However, > the API I implemented unbinds the dma-buf when the netlink socket is > destroyed. I do this so that even if the user process crashes or > forgets to unbind, the dma-buf will still be unbound once the netlink > socket is closed on the process exit. Details in this patch: > https://lore.kernel.org/netdev/ZNULIDzuVVyfyMq2@ziepe.ca/T/#m2d344b08f54562cc9155c3f5b018cbfaed96036f I need to double check the details, but at least of hand that sounds sufficient for the lifetime requirements of DMA-buf. Thanks, Christian. > > On Thu, Aug 10, 2023 at 9:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >> On Thu, Aug 10, 2023 at 12:29:08PM +0200, Christian König wrote: >>> Am 10.08.23 um 03:57 schrieb Mina Almasry: >>>> Changes in RFC v2: >>>> ------------------ >>>> >>>> The sticking point in RFC v1[1] was the dma-buf pages approach we used to >>>> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept >>>> that attempts to resolve this by implementing scatterlist support in the >>>> networking stack, such that we can import the dma-buf scatterlist >>>> directly. >>> Impressive work, I didn't thought that this would be possible that "easily". >>> >>> Please note that we have considered replacing scatterlists with simple >>> arrays of DMA-addresses in the DMA-buf framework to avoid people trying to >>> access the struct page inside the scatterlist. >>> >>> It might be a good idea to push for that first before this here is finally >>> implemented. >>> >>> GPU drivers already convert the scatterlist used to arrays of DMA-addresses >>> as soon as they get them. This leaves RDMA and V4L as the other two main >>> users which would need to be converted. >> Oh that would be a nightmare for RDMA. >> >> We need a standard based way to have scalable lists of DMA addresses :( >> >>>> 2. Netlink API (Patch 1 & 2). >>> How does netlink manage the lifetime of objects? >> And access control.. >> > Someone will correct me if I'm wrong but I'm not sure netlink itself > will do (sufficient) access control. However I meant for the netlink > API to bind dma-bufs to be a CAP_NET_ADMIN API, and I forgot to add > this check in this proof-of-concept, sorry. I'll add a CAP_NET_ADMIN > check in netdev_bind_dmabuf_to_queue() in the next iteration.
On 8/9/23 7:57 PM, Mina Almasry wrote: > Changes in RFC v2: > ------------------ > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > that attempts to resolve this by implementing scatterlist support in the > networking stack, such that we can import the dma-buf scatterlist > directly. This is the approach proposed at a high level here[2]. > > Detailed changes: > 1. Replaced dma-buf pages approach with importing scatterlist into the > page pool. > 2. Replace the dma-buf pages centric API with a netlink API. > 3. Removed the TX path implementation - there is no issue with > implementing the TX path with scatterlist approach, but leaving > out the TX path makes it easier to review. > 4. Functionality is tested with this proposal, but I have not conducted > perf testing yet. I'm not sure there are regressions, but I removed > perf claims from the cover letter until they can be re-confirmed. > 5. Added Signed-off-by: contributors to the implementation. > 6. Fixed some bugs with the RX path since RFC v1. > > Any feedback welcome, but specifically the biggest pending questions > needing feedback IMO are: > > 1. Feedback on the scatterlist-based approach in general. > 2. Netlink API (Patch 1 & 2). > 3. Approach to handle all the drivers that expect to receive pages from > the page pool (Patch 6). > > [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/ > [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/ > > ---------------------- > > * TL;DR: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. > > * Problem: > > A large amount of data transfers have device memory as the source and/or > destination. Accelerators drastically increased the volume of such transfers. > Some examples include: > - ML accelerators transferring large amounts of training data from storage into > GPU/TPU memory. In some cases ML training setup time can be as long as 50% of > TPU compute time, improving data transfer throughput & efficiency can help > improving GPU/TPU utilization. > > - Distributed training, where ML accelerators, such as GPUs on different hosts, > exchange data among them. > > - Distributed raw block storage applications transfer large amounts of data with > remote SSDs, much of this data does not require host processing. > > Today, the majority of the Device-to-Device data transfers the network are > implemented as the following low level operations: Device-to-Host copy, > Host-to-Host network transfer, and Host-to-Device copy. > > The implementation is suboptimal, especially for bulk data transfers, and can > put significant strains on system resources, such as host memory bandwidth, > PCIe bandwidth, etc. One important reason behind the current state is the > kernel’s lack of semantics to express device to network transfers. > > * Proposal: > > In this patch series we attempt to optimize this use case by implementing > socket APIs that enable the user to: > > 1. send device memory across the network directly, and > 2. receive incoming network packets directly into device memory. > > Packet _payloads_ go directly from the NIC to device memory for receive and from > device memory to NIC for transmit. > Packet _headers_ go to/from host memory and are processed by the TCP/IP stack > normally. The NIC _must_ support header split to achieve this. > > Advantages: > > - Alleviate host memory bandwidth pressure, compared to existing > network-transfer + device-copy semantics. > > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level > of the PCIe tree, compared to traditional path which sends data through the > root complex. > > * Patch overview: > > ** Part 1: netlink API > > Gives user ability to bind dma-buf to an RX queue. > > ** Part 2: scatterlist support > > Currently the standard for device memory sharing is DMABUF, which doesn't > generate struct pages. On the other hand, networking stack (skbs, drivers, and > page pool) operate on pages. We have 2 options: > > 1. Generate struct pages for dmabuf device memory, or, > 2. Modify the networking stack to process scatterlist. > > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. > > ** part 3: page pool support > > We piggy back on page pool memory providers proposal: > https://github.com/kuba-moo/linux/tree/pp-providers > > It allows the page pool to define a memory provider that provides the > page allocation and freeing. It helps abstract most of the device memory > TCP changes from the driver. > > ** part 4: support for unreadable skb frags > > Page pool iovs are not accessible by the host; we implement changes > throughput the networking stack to correctly handle skbs with unreadable > frags. > > ** Part 5: recvmsg() APIs > > We define user APIs for the user to send and receive device memory. > > Not included with this RFC is the GVE devmem TCP support, just to > simplify the review. Code available here if desired: > https://github.com/mina/linux/tree/tcpdevmem > > This RFC is built on top of net-next with Jakub's pp-providers changes > cherry-picked. > > * NIC dependencies: > > 1. (strict) Devmem TCP require the NIC to support header split, i.e. the > capability to split incoming packets into a header + payload and to put > each into a separate buffer. Devmem TCP works by using device memory > for the packet payload, and host memory for the packet headers. > > 2. (optional) Devmem TCP works better with flow steering support & RSS support, > i.e. the NIC's ability to steer flows into certain rx queues. This allows the > sysadmin to enable devmem TCP on a subset of the rx queues, and steer > devmem TCP traffic onto these queues and non devmem TCP elsewhere. > > The NIC I have access to with these properties is the GVE with DQO support > running in Google Cloud, but any NIC that supports these features would suffice. > I may be able to help reviewers bring up devmem TCP on their NICs. > > * Testing: > > The series includes a udmabuf kselftest that show a simple use case of > devmem TCP and validates the entire data path end to end without > a dependency on a specific dmabuf provider. > > ** Test Setup > > Kernel: net-next with this RFC and memory provider API cherry-picked > locally. > > Hardware: Google Cloud A3 VMs. > > NIC: GVE with header split & RSS & flow steering support. This set seems to depend on Jakub's memory provider patches and a netdev driver change which is not included. For the testing mentioned here, you must have a tree + branch with all of the patches. Is it publicly available? It would be interesting to see how well (easy) this integrates with io_uring. Besides avoiding all of the syscalls for receiving the iov and releasing the buffers back to the pool, io_uring also brings in the ability to seed a page_pool with registered buffers which provides a means to get simpler Rx ZC for host memory. Overall I like the intent and possibilities for extensions, but a lot of details are missing - perhaps some are answered by seeing an end-to-end implementation.
On Sun, Aug 13, 2023 at 6:12 PM David Ahern <dsahern@kernel.org> wrote: > > On 8/9/23 7:57 PM, Mina Almasry wrote: > > Changes in RFC v2: > > ------------------ > > > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > > that attempts to resolve this by implementing scatterlist support in the > > networking stack, such that we can import the dma-buf scatterlist > > directly. This is the approach proposed at a high level here[2]. > > > > Detailed changes: > > 1. Replaced dma-buf pages approach with importing scatterlist into the > > page pool. > > 2. Replace the dma-buf pages centric API with a netlink API. > > 3. Removed the TX path implementation - there is no issue with > > implementing the TX path with scatterlist approach, but leaving > > out the TX path makes it easier to review. > > 4. Functionality is tested with this proposal, but I have not conducted > > perf testing yet. I'm not sure there are regressions, but I removed > > perf claims from the cover letter until they can be re-confirmed. > > 5. Added Signed-off-by: contributors to the implementation. > > 6. Fixed some bugs with the RX path since RFC v1. > > > > Any feedback welcome, but specifically the biggest pending questions > > needing feedback IMO are: > > > > 1. Feedback on the scatterlist-based approach in general. > > 2. Netlink API (Patch 1 & 2). > > 3. Approach to handle all the drivers that expect to receive pages from > > the page pool (Patch 6). > > > > [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/ > > [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/ > > > > ---------------------- > > > > * TL;DR: > > > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > > from device memory efficiently, without bouncing the data to a host memory > > buffer. > > > > * Problem: > > > > A large amount of data transfers have device memory as the source and/or > > destination. Accelerators drastically increased the volume of such transfers. > > Some examples include: > > - ML accelerators transferring large amounts of training data from storage into > > GPU/TPU memory. In some cases ML training setup time can be as long as 50% of > > TPU compute time, improving data transfer throughput & efficiency can help > > improving GPU/TPU utilization. > > > > - Distributed training, where ML accelerators, such as GPUs on different hosts, > > exchange data among them. > > > > - Distributed raw block storage applications transfer large amounts of data with > > remote SSDs, much of this data does not require host processing. > > > > Today, the majority of the Device-to-Device data transfers the network are > > implemented as the following low level operations: Device-to-Host copy, > > Host-to-Host network transfer, and Host-to-Device copy. > > > > The implementation is suboptimal, especially for bulk data transfers, and can > > put significant strains on system resources, such as host memory bandwidth, > > PCIe bandwidth, etc. One important reason behind the current state is the > > kernel’s lack of semantics to express device to network transfers. > > > > * Proposal: > > > > In this patch series we attempt to optimize this use case by implementing > > socket APIs that enable the user to: > > > > 1. send device memory across the network directly, and > > 2. receive incoming network packets directly into device memory. > > > > Packet _payloads_ go directly from the NIC to device memory for receive and from > > device memory to NIC for transmit. > > Packet _headers_ go to/from host memory and are processed by the TCP/IP stack > > normally. The NIC _must_ support header split to achieve this. > > > > Advantages: > > > > - Alleviate host memory bandwidth pressure, compared to existing > > network-transfer + device-copy semantics. > > > > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level > > of the PCIe tree, compared to traditional path which sends data through the > > root complex. > > > > * Patch overview: > > > > ** Part 1: netlink API > > > > Gives user ability to bind dma-buf to an RX queue. > > > > ** Part 2: scatterlist support > > > > Currently the standard for device memory sharing is DMABUF, which doesn't > > generate struct pages. On the other hand, networking stack (skbs, drivers, and > > page pool) operate on pages. We have 2 options: > > > > 1. Generate struct pages for dmabuf device memory, or, > > 2. Modify the networking stack to process scatterlist. > > > > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. > > > > ** part 3: page pool support > > > > We piggy back on page pool memory providers proposal: > > https://github.com/kuba-moo/linux/tree/pp-providers > > > > It allows the page pool to define a memory provider that provides the > > page allocation and freeing. It helps abstract most of the device memory > > TCP changes from the driver. > > > > ** part 4: support for unreadable skb frags > > > > Page pool iovs are not accessible by the host; we implement changes > > throughput the networking stack to correctly handle skbs with unreadable > > frags. > > > > ** Part 5: recvmsg() APIs > > > > We define user APIs for the user to send and receive device memory. > > > > Not included with this RFC is the GVE devmem TCP support, just to > > simplify the review. Code available here if desired: > > https://github.com/mina/linux/tree/tcpdevmem > > > > This RFC is built on top of net-next with Jakub's pp-providers changes > > cherry-picked. > > > > * NIC dependencies: > > > > 1. (strict) Devmem TCP require the NIC to support header split, i.e. the > > capability to split incoming packets into a header + payload and to put > > each into a separate buffer. Devmem TCP works by using device memory > > for the packet payload, and host memory for the packet headers. > > > > 2. (optional) Devmem TCP works better with flow steering support & RSS support, > > i.e. the NIC's ability to steer flows into certain rx queues. This allows the > > sysadmin to enable devmem TCP on a subset of the rx queues, and steer > > devmem TCP traffic onto these queues and non devmem TCP elsewhere. > > > > The NIC I have access to with these properties is the GVE with DQO support > > running in Google Cloud, but any NIC that supports these features would suffice. > > I may be able to help reviewers bring up devmem TCP on their NICs. > > > > * Testing: > > > > The series includes a udmabuf kselftest that show a simple use case of > > devmem TCP and validates the entire data path end to end without > > a dependency on a specific dmabuf provider. > > > > ** Test Setup > > > > Kernel: net-next with this RFC and memory provider API cherry-picked > > locally. > > > > Hardware: Google Cloud A3 VMs. > > > > NIC: GVE with header split & RSS & flow steering support. > > This set seems to depend on Jakub's memory provider patches and a netdev > driver change which is not included. For the testing mentioned here, you > must have a tree + branch with all of the patches. Is it publicly available? > Yes, the net-next based branch is right here: https://github.com/mina/linux/tree/tcpdevmem Here is the git log of that branch: https://github.com/mina/linux/commits/tcpdevmem FWIW, it's already linked from the (long) cover letter, at the end of the '* Patch overview:' section. The branch includes all you mentioned above. The netdev driver I'm using in the GVE. It also includes patches to implement header split & flow steering for GVE (being upstreamed separately), and some debug changes. > It would be interesting to see how well (easy) this integrates with > io_uring. Besides avoiding all of the syscalls for receiving the iov and > releasing the buffers back to the pool, io_uring also brings in the > ability to seed a page_pool with registered buffers which provides a > means to get simpler Rx ZC for host memory. > > Overall I like the intent and possibilities for extensions, but a lot of > details are missing - perhaps some are answered by seeing an end-to-end > implementation.
From: Mina Almasry > Sent: 10 August 2023 02:58 ... > * TL;DR: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. Doesn't that really require peer-to-peer PCIe transfers? IIRC these aren't supported by many root hubs and have fundamental flow control and/or TLP credit problems. I'd guess they are also pretty incompatible with IOMMU? I can see how you might manage to transmit frames from some external memory (eg after encryption) but surely processing receive data that way needs the packets be filtered by both IP addresses and port numbers before being redirected to the (presumably limited) external memory. OTOH isn't the kernel going to need to run code before the packet is actually sent and just after it is received? So all you might gain is a bit of latency? And a bit less utilisation of host memory?? But if your system is really limited by cpu-memory bandwidth you need more cache :-) So how much benefit is there over efficient use of host memory bounce buffers?? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Aug 15, 2023 at 9:38 AM David Laight <David.Laight@aculab.com> wrote: > > From: Mina Almasry > > Sent: 10 August 2023 02:58 > ... > > * TL;DR: > > > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > > from device memory efficiently, without bouncing the data to a host memory > > buffer. > > Doesn't that really require peer-to-peer PCIe transfers? > IIRC these aren't supported by many root hubs and have > fundamental flow control and/or TLP credit problems. > > I'd guess they are also pretty incompatible with IOMMU? Yes, this is a form of PCI_P2PDMA and all the limitations of that apply. > I can see how you might manage to transmit frames from > some external memory (eg after encryption) but surely > processing receive data that way needs the packets > be filtered by both IP addresses and port numbers before > being redirected to the (presumably limited) external > memory. This feature depends on NIC receive header split. The TCP/IP headers are stored to host memory, the payload to device memory. Optionally, on devices that do not support explicit header-split, but do support scatter-gather I/O, if the header size is constant and known, that can be used as a weak substitute. This has additional caveats wrt unexpected traffic for which payload must be host visible (e.g., ICMP). > OTOH isn't the kernel going to need to run code before > the packet is actually sent and just after it is received? > So all you might gain is a bit of latency? > And a bit less utilisation of host memory?? > But if your system is really limited by cpu-memory bandwidth > you need more cache :-) > > > So how much benefit is there over efficient use of host > memory bounce buffers?? Among other things, on a PCIe tree this makes it possible to load up machines with many NICs + GPUs.
On 8/14/23 02:12, David Ahern wrote: > On 8/9/23 7:57 PM, Mina Almasry wrote: >> Changes in RFC v2: >> ------------------ ... >> ** Test Setup >> >> Kernel: net-next with this RFC and memory provider API cherry-picked >> locally. >> >> Hardware: Google Cloud A3 VMs. >> >> NIC: GVE with header split & RSS & flow steering support. > > This set seems to depend on Jakub's memory provider patches and a netdev > driver change which is not included. For the testing mentioned here, you > must have a tree + branch with all of the patches. Is it publicly available? > > It would be interesting to see how well (easy) this integrates with > io_uring. Besides avoiding all of the syscalls for receiving the iov and > releasing the buffers back to the pool, io_uring also brings in the > ability to seed a page_pool with registered buffers which provides a > means to get simpler Rx ZC for host memory. The patchset sounds pretty interesting. I've been working with David Wei (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old similar approaches based on allocating an rx queue. It targets host memory and device memory as an extra feature, uapi is different, lifetimes are managed/bound to io_uring. Completions/buffers are returned to user via a separate queue instead of cmsg, and pushed back granularly to the kernel via another queue. I'll leave it to David to elaborate It sounds like we have space for collaboration here, if not merging then reusing internals as much as we can, but we'd need to look into the details deeper. > Overall I like the intent and possibilities for extensions, but a lot of > details are missing - perhaps some are answered by seeing an end-to-end > implementation.
On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 8/14/23 02:12, David Ahern wrote: > > On 8/9/23 7:57 PM, Mina Almasry wrote: > >> Changes in RFC v2: > >> ------------------ > ... > >> ** Test Setup > >> > >> Kernel: net-next with this RFC and memory provider API cherry-picked > >> locally. > >> > >> Hardware: Google Cloud A3 VMs. > >> > >> NIC: GVE with header split & RSS & flow steering support. > > > > This set seems to depend on Jakub's memory provider patches and a netdev > > driver change which is not included. For the testing mentioned here, you > > must have a tree + branch with all of the patches. Is it publicly available? > > > > It would be interesting to see how well (easy) this integrates with > > io_uring. Besides avoiding all of the syscalls for receiving the iov and > > releasing the buffers back to the pool, io_uring also brings in the > > ability to seed a page_pool with registered buffers which provides a > > means to get simpler Rx ZC for host memory. > > The patchset sounds pretty interesting. I've been working with David Wei > (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old > similar approaches based on allocating an rx queue. It targets host > memory and device memory as an extra feature, uapi is different, lifetimes > are managed/bound to io_uring. Completions/buffers are returned to user via > a separate queue instead of cmsg, and pushed back granularly to the kernel > via another queue. I'll leave it to David to elaborate > > It sounds like we have space for collaboration here, if not merging then > reusing internals as much as we can, but we'd need to look into the > details deeper. > I'm happy to look at your implementation and collaborate on something that works for both use cases. Feel free to share unpolished prototype so I can start having a general idea if possible. > > Overall I like the intent and possibilities for extensions, but a lot of > > details are missing - perhaps some are answered by seeing an end-to-end > > implementation. > > -- > Pavel Begunkov
On 17/08/2023 15:18, Mina Almasry wrote: > On Thu, Aug 17, 2023 at 11:04 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 8/14/23 02:12, David Ahern wrote: >>> On 8/9/23 7:57 PM, Mina Almasry wrote: >>>> Changes in RFC v2: >>>> ------------------ >> ... >>>> ** Test Setup >>>> >>>> Kernel: net-next with this RFC and memory provider API cherry-picked >>>> locally. >>>> >>>> Hardware: Google Cloud A3 VMs. >>>> >>>> NIC: GVE with header split & RSS & flow steering support. >>> >>> This set seems to depend on Jakub's memory provider patches and a netdev >>> driver change which is not included. For the testing mentioned here, you >>> must have a tree + branch with all of the patches. Is it publicly available? >>> >>> It would be interesting to see how well (easy) this integrates with >>> io_uring. Besides avoiding all of the syscalls for receiving the iov and >>> releasing the buffers back to the pool, io_uring also brings in the >>> ability to seed a page_pool with registered buffers which provides a >>> means to get simpler Rx ZC for host memory. >> >> The patchset sounds pretty interesting. I've been working with David Wei >> (CC'ing) on io_uring zc rx (prototype polishing stage) all that is old >> similar approaches based on allocating an rx queue. It targets host >> memory and device memory as an extra feature, uapi is different, lifetimes >> are managed/bound to io_uring. Completions/buffers are returned to user via >> a separate queue instead of cmsg, and pushed back granularly to the kernel >> via another queue. I'll leave it to David to elaborate >> >> It sounds like we have space for collaboration here, if not merging then >> reusing internals as much as we can, but we'd need to look into the >> details deeper. >> > > I'm happy to look at your implementation and collaborate on something > that works for both use cases. Feel free to share unpolished prototype > so I can start having a general idea if possible. Hi I'm David and I am working with Pavel on this. We will have something to share with you on the mailing list before the end of the week. I'm also preparing a submission for NetDev conf. I wonder if you and others at Google plan to present there as well? If so, then we may want to coordinate our submissions and talks (if accepted). Please let me know this week, thanks! > >>> Overall I like the intent and possibilities for extensions, but a lot of >>> details are missing - perhaps some are answered by seeing an end-to-end >>> implementation. >> >> -- >> Pavel Begunkov > > >
On 8/23/23 3:52 PM, David Wei wrote: > I'm also preparing a submission for NetDev conf. I wonder if you and others at > Google plan to present there as well? If so, then we may want to coordinate our > submissions and talks (if accepted). personally, I see them as related but separate topics. Mina's proposal as infra that io_uring builds on. Both are interesting and needed discussions.