[net-next,v3,0/6] Device memory TCP TX

Message ID	20250203223916.1064540-1-almasrymina@google.com (mailing list archive)
Headers	show Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75BAC20FA96 for <kvm@vger.kernel.org>; Mon, 3 Feb 2025 22:39:18 +0000 (UTC) Date: Mon, 3 Feb 2025 22:39:10 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20250203223916.1064540-1-almasrymina@google.com> Subject: [PATCH net-next v3 0/6] Device memory TCP TX From: Mina Almasry <almasrymina@google.com> To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry <almasrymina@google.com>, Donald Hunter <donald.hunter@gmail.com>, Jakub Kicinski <kuba@kernel.org>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org>, Jonathan Corbet <corbet@lwn.net>, Andrew Lunn <andrew+netdev@lunn.ch>, Neal Cardwell <ncardwell@google.com>, David Ahern <dsahern@kernel.org>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, Xuan Zhuo <xuanzhuo@linux.alibaba.com>, " =?utf-8?q?Eugenio_P=C3=A9rez?= " <eperezma@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, Shuah Khan <shuah@kernel.org>, sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim <jhs@mojatatu.com>, Victor Nogueira <victor@mojatatu.com>, Pedro Tammela <pctammela@mojatatu.com>, Samiullah Khawaja <skhawaja@google.com> Content-Type: text/plain; charset="UTF-8"
Series	Device memory TCP TX \| expand [net-next,v3,0/6] Device memory TCP TX [net-next,v3,1/6] net: add devmem TCP TX documentation [net-next,v3,2/6] selftests: ncdevmem: Implement devmem TCP TX [net-next,v3,3/6] net: add get_netmem/put_netmem support [net-next,v3,4/6] net: devmem: TCP tx netlink api [net-next,v3,5/6] net: devmem: Implement TX path [net-next,v3,6/6] net: devmem: make dmabuf unbinding scheduled work

Mina Almasry Feb. 3, 2025, 10:39 p.m. UTC

v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===

Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.

RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======

RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next  reopens, sending it slightly early
to get feedback if any.

Major changes:
--------------

- much improved UAPI as suggested by Stan. We now interpret the iov_base
  of the passed in iov from userspace as the offset into the dmabuf to
  send from. This removes the need to set iov.iov_base = NULL which may
  be confusing to users, and enables us to send multiple iovs in the
  same sendmsg() call. ncdevmem and the docs show a sample use of that.

- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
  this is good improvment as it was confusing to keep track of
  2 iterators for the same sendmsg, and mistracking both iterators
  caused a couple of bugs reported in the last iteration that are now
  resolved with this streamlining.

- Improved test coverage in ncdevmem. Now muliple sendmsg() are tested,
  and sending multiple iovs in the same sendmsg() is tested.

- Fixed issue where dmabuf unmapping was happening in invalid context
  (Stan).

====================================================================

The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].

Sending this series as RFC as the winder closure is immenient. I plan on
reposting as non-RFC once the tree re-opens, addressing any feedback
I receive in the meantime.

Full outline on usage of the TX path is detailed in the documentation
added in the first patch.

Test example is available via the kselftest included in the series as well.

The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.

Patch Overview:
---------------

1. Documentation & tests to give high level overview of the feature
   being added.

2. Add netmem refcounting needed for the TX path.

3. Devmem TX netlink API.

4. Devmem TX net stack implementation.

Testing:
--------

Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.

* Test Setup:

Kernel: net-next with this RFC and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.

Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.

Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.

[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.com/
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#m066dd407fbed108828e2c40ae50e3f4376ef57fd

Cc: sdf@fomichev.me
Cc: asml.silence@gmail.com
Cc: dw@davidwei.uk
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Cc: Pedro Tammela <pctammela@mojatatu.com>
Cc: Samiullah Khawaja <skhawaja@google.com>


Mina Almasry (5):
  net: add devmem TCP TX documentation
  selftests: ncdevmem: Implement devmem TCP TX
  net: add get_netmem/put_netmem support
  net: devmem: Implement TX path
  net: devmem: make dmabuf unbinding scheduled work

Stanislav Fomichev (1):
  net: devmem: TCP tx netlink api

 Documentation/netlink/specs/netdev.yaml       |  12 +
 Documentation/networking/devmem.rst           | 144 ++++++++-
 include/linux/skbuff.h                        |  15 +-
 include/linux/skbuff_ref.h                    |   4 +-
 include/net/netmem.h                          |   3 +
 include/net/sock.h                            |   1 +
 include/uapi/linux/netdev.h                   |   1 +
 include/uapi/linux/uio.h                      |   6 +-
 net/core/datagram.c                           |  41 ++-
 net/core/devmem.c                             | 111 ++++++-
 net/core/devmem.h                             |  70 +++-
 net/core/netdev-genl-gen.c                    |  13 +
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  66 +++-
 net/core/skbuff.c                             |  36 ++-
 net/core/sock.c                               |   8 +
 net/ipv4/tcp.c                                |  36 ++-
 net/vmw_vsock/virtio_transport_common.c       |   3 +-
 tools/include/uapi/linux/netdev.h             |   1 +
 .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
 20 files changed, 819 insertions(+), 53 deletions(-)

Paolo Abeni Feb. 4, 2025, 12:32 p.m. UTC | #1

On 2/3/25 11:39 PM, Mina Almasry wrote:
> The TX path had been dropped from the Device Memory TCP patch series
> post RFCv1 [1], to make that series slightly easier to review. This
> series rebases the implementation of the TX path on top of the
> net_iov/netmem framework agreed upon and merged. The motivation for
> the feature is thoroughly described in the docs & cover letter of the
> original proposal, so I don't repeat the lengthy descriptions here, but
> they are available in [1].
> 
> Sending this series as RFC as the winder closure is immenient. I plan on
> reposting as non-RFC once the tree re-opens, addressing any feedback
> I receive in the meantime.

I guess you should drop this paragraph.

> Full outline on usage of the TX path is detailed in the documentation
> added in the first patch.
> 
> Test example is available via the kselftest included in the series as well.
> 
> The series is relatively small, as the TX path for this feature largely
> piggybacks on the existing MSG_ZEROCOPY implementation.

It looks like no additional device level support is required. That is
IMHO so good up to suspicious level :)

> Patch Overview:
> ---------------
> 
> 1. Documentation & tests to give high level overview of the feature
>    being added.
> 
> 2. Add netmem refcounting needed for the TX path.
> 
> 3. Devmem TX netlink API.
> 
> 4. Devmem TX net stack implementation.

It looks like even the above section needs some update.

/P

Mina Almasry Feb. 4, 2025, 5:27 p.m. UTC | #2

On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/3/25 11:39 PM, Mina Almasry wrote:
> > The TX path had been dropped from the Device Memory TCP patch series
> > post RFCv1 [1], to make that series slightly easier to review. This
> > series rebases the implementation of the TX path on top of the
> > net_iov/netmem framework agreed upon and merged. The motivation for
> > the feature is thoroughly described in the docs & cover letter of the
> > original proposal, so I don't repeat the lengthy descriptions here, but
> > they are available in [1].
> >
> > Sending this series as RFC as the winder closure is immenient. I plan on
> > reposting as non-RFC once the tree re-opens, addressing any feedback
> > I receive in the meantime.
>
> I guess you should drop this paragraph.
>
> > Full outline on usage of the TX path is detailed in the documentation
> > added in the first patch.
> >
> > Test example is available via the kselftest included in the series as well.
> >
> > The series is relatively small, as the TX path for this feature largely
> > piggybacks on the existing MSG_ZEROCOPY implementation.
>
> It looks like no additional device level support is required. That is
> IMHO so good up to suspicious level :)
>

It is correct no additional device level support is required. I don't
have any local changes to my driver to make this work. I think Stan
on-list was able to run the TX path (he commented on fixes to the test
but didn't say it doesn't work :D) and one other person was able to
run it offlist.

> > Patch Overview:
> > ---------------
> >
> > 1. Documentation & tests to give high level overview of the feature
> >    being added.
> >
> > 2. Add netmem refcounting needed for the TX path.
> >
> > 3. Devmem TX netlink API.
> >
> > 4. Devmem TX net stack implementation.
>
> It looks like even the above section needs some update.
>

Ah, I usually keep the original cover letter untouched and put the
updates under the version labels. Looks like you expect the full cover
letter to be updated. Will do. Thanks for looking.

Stanislav Fomichev Feb. 4, 2025, 6:06 p.m. UTC | #3

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > The TX path had been dropped from the Device Memory TCP patch series
> > > post RFCv1 [1], to make that series slightly easier to review. This
> > > series rebases the implementation of the TX path on top of the
> > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > the feature is thoroughly described in the docs & cover letter of the
> > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > they are available in [1].
> > >
> > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > I receive in the meantime.
> >
> > I guess you should drop this paragraph.
> >
> > > Full outline on usage of the TX path is detailed in the documentation
> > > added in the first patch.
> > >
> > > Test example is available via the kselftest included in the series as well.
> > >
> > > The series is relatively small, as the TX path for this feature largely
> > > piggybacks on the existing MSG_ZEROCOPY implementation.
> >
> > It looks like no additional device level support is required. That is
> > IMHO so good up to suspicious level :)
> >
> 
> It is correct no additional device level support is required. I don't
> have any local changes to my driver to make this work. I think Stan
> on-list was able to run the TX path (he commented on fixes to the test
> but didn't say it doesn't work :D) and one other person was able to
> run it offlist.

For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
I have similar internal patch for mlx5 (will share after RX part gets
in). I agree that it seems like gve_unmap_packet needs some work to be more
careful to not unmap NIOVs (if you were testing against gve).

Paolo Abeni Feb. 4, 2025, 6:32 p.m. UTC | #4

On 2/4/25 7:06 PM, Stanislav Fomichev wrote:
> On 02/04, Mina Almasry wrote:
>> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>
>>> On 2/3/25 11:39 PM, Mina Almasry wrote:
>>>> The TX path had been dropped from the Device Memory TCP patch series
>>>> post RFCv1 [1], to make that series slightly easier to review. This
>>>> series rebases the implementation of the TX path on top of the
>>>> net_iov/netmem framework agreed upon and merged. The motivation for
>>>> the feature is thoroughly described in the docs & cover letter of the
>>>> original proposal, so I don't repeat the lengthy descriptions here, but
>>>> they are available in [1].
>>>>
>>>> Sending this series as RFC as the winder closure is immenient. I plan on
>>>> reposting as non-RFC once the tree re-opens, addressing any feedback
>>>> I receive in the meantime.
>>>
>>> I guess you should drop this paragraph.
>>>
>>>> Full outline on usage of the TX path is detailed in the documentation
>>>> added in the first patch.
>>>>
>>>> Test example is available via the kselftest included in the series as well.
>>>>
>>>> The series is relatively small, as the TX path for this feature largely
>>>> piggybacks on the existing MSG_ZEROCOPY implementation.
>>>
>>> It looks like no additional device level support is required. That is
>>> IMHO so good up to suspicious level :)
>>>
>>
>> It is correct no additional device level support is required. I don't
>> have any local changes to my driver to make this work. I think Stan
>> on-list was able to run the TX path (he commented on fixes to the test
>> but didn't say it doesn't work :D) and one other person was able to
>> run it offlist.
> 
> For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> I have similar internal patch for mlx5 (will share after RX part gets
> in). I agree that it seems like gve_unmap_packet needs some work to be more
> careful to not unmap NIOVs (if you were testing against gve).

What happen if an user try to use devmem TX on a device not really
supporting it? Silent data corruption?

Don't we need some way for the device to opt-in (or opt-out) and avoid
such issues?

/P

Mina Almasry Feb. 4, 2025, 6:38 p.m. UTC | #5

On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 02/04, Mina Almasry wrote:
> > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > series rebases the implementation of the TX path on top of the
> > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > the feature is thoroughly described in the docs & cover letter of the
> > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > they are available in [1].
> > > >
> > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > I receive in the meantime.
> > >
> > > I guess you should drop this paragraph.
> > >
> > > > Full outline on usage of the TX path is detailed in the documentation
> > > > added in the first patch.
> > > >
> > > > Test example is available via the kselftest included in the series as well.
> > > >
> > > > The series is relatively small, as the TX path for this feature largely
> > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > >
> > > It looks like no additional device level support is required. That is
> > > IMHO so good up to suspicious level :)
> > >
> >
> > It is correct no additional device level support is required. I don't
> > have any local changes to my driver to make this work. I think Stan
> > on-list was able to run the TX path (he commented on fixes to the test
> > but didn't say it doesn't work :D) and one other person was able to
> > run it offlist.
>
> For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> I have similar internal patch for mlx5 (will share after RX part gets
> in). I agree that it seems like gve_unmap_packet needs some work to be more
> careful to not unmap NIOVs (if you were testing against gve).

Hmm. I think you're right. We ran into a similar issue with the RX
path. The RX path worked 'fine' on initial merge, but it was passing
dmabuf dma-addrs to the dma-mapping API which Jason later called out
to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
boil down into no-ops for a lot of setups I think which is why I'm not
running into any issues in testing, but upon closer look, I think yes,
we need to make sure the driver doesn't end up passing these niov
dma-addrs to functions like dma_unmap_*() and dma_sync_*().

Stan, do you run into issues (crashes/warnings/bugs) in your setup
when the driver tries to unmap niovs? Or did you implement these
changes purely for safety?

Let me take a deeper look here and suggest something for the next
version. I think we may indeed need the driver to declare that it can
handle niovs in the TX path correctly (i.e. not accidentally pass niov
dma-addrs to the dma-mapping API).

--
Thanks,
Mina

Mina Almasry Feb. 4, 2025, 6:47 p.m. UTC | #6

On Tue, Feb 4, 2025 at 10:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/4/25 7:06 PM, Stanislav Fomichev wrote:
> > On 02/04, Mina Almasry wrote:
> >> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>
> >>> On 2/3/25 11:39 PM, Mina Almasry wrote:
> >>>> The TX path had been dropped from the Device Memory TCP patch series
> >>>> post RFCv1 [1], to make that series slightly easier to review. This
> >>>> series rebases the implementation of the TX path on top of the
> >>>> net_iov/netmem framework agreed upon and merged. The motivation for
> >>>> the feature is thoroughly described in the docs & cover letter of the
> >>>> original proposal, so I don't repeat the lengthy descriptions here, but
> >>>> they are available in [1].
> >>>>
> >>>> Sending this series as RFC as the winder closure is immenient. I plan on
> >>>> reposting as non-RFC once the tree re-opens, addressing any feedback
> >>>> I receive in the meantime.
> >>>
> >>> I guess you should drop this paragraph.
> >>>
> >>>> Full outline on usage of the TX path is detailed in the documentation
> >>>> added in the first patch.
> >>>>
> >>>> Test example is available via the kselftest included in the series as well.
> >>>>
> >>>> The series is relatively small, as the TX path for this feature largely
> >>>> piggybacks on the existing MSG_ZEROCOPY implementation.
> >>>
> >>> It looks like no additional device level support is required. That is
> >>> IMHO so good up to suspicious level :)
> >>>
> >>
> >> It is correct no additional device level support is required. I don't
> >> have any local changes to my driver to make this work. I think Stan
> >> on-list was able to run the TX path (he commented on fixes to the test
> >> but didn't say it doesn't work :D) and one other person was able to
> >> run it offlist.
> >
> > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > I have similar internal patch for mlx5 (will share after RX part gets
> > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > careful to not unmap NIOVs (if you were testing against gve).
>
> What happen if an user try to use devmem TX on a device not really
> supporting it? Silent data corruption?
>

So the tx dma-buf binding netlink API will bind the dma-buf to the
netdevice. If that fails, the uapi will return failure and devmem tx
will not be enabled.

If the dma-binding succeeds, then the device can indeed DMA into the
dma-addrs in the device. The TX path will dma from the dma-addrs in
the device just fine and it need not be aware that the dma-addrs are
coming from a device and not from host memory.

The only issue that Stan's patches is pointing to, is that the driver
will likely be passing these dma-buf addresses into dma-mapping APIs
like dma_unmap_*() and dma_sync_*() functions. Those, AFAIU, will be
no-ops with dma-buf addresses in most setups, but it's not 100% safe
to pass those dma-buf addresses to these dma-mapping APIs, so we
should avoid these calls entirely.

> Don't we need some way for the device to opt-in (or opt-out) and avoid
> such issues?
>

Yeah, I think likely the driver needs to declare support (i.e. it's
not using dma-mapping API with dma-buf addresses).

--
Thanks,
Mina

Stanislav Fomichev Feb. 4, 2025, 7:41 p.m. UTC | #7

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 10:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 2/4/25 7:06 PM, Stanislav Fomichev wrote:
> > > On 02/04, Mina Almasry wrote:
> > >> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >>>
> > >>> On 2/3/25 11:39 PM, Mina Almasry wrote:
> > >>>> The TX path had been dropped from the Device Memory TCP patch series
> > >>>> post RFCv1 [1], to make that series slightly easier to review. This
> > >>>> series rebases the implementation of the TX path on top of the
> > >>>> net_iov/netmem framework agreed upon and merged. The motivation for
> > >>>> the feature is thoroughly described in the docs & cover letter of the
> > >>>> original proposal, so I don't repeat the lengthy descriptions here, but
> > >>>> they are available in [1].
> > >>>>
> > >>>> Sending this series as RFC as the winder closure is immenient. I plan on
> > >>>> reposting as non-RFC once the tree re-opens, addressing any feedback
> > >>>> I receive in the meantime.
> > >>>
> > >>> I guess you should drop this paragraph.
> > >>>
> > >>>> Full outline on usage of the TX path is detailed in the documentation
> > >>>> added in the first patch.
> > >>>>
> > >>>> Test example is available via the kselftest included in the series as well.
> > >>>>
> > >>>> The series is relatively small, as the TX path for this feature largely
> > >>>> piggybacks on the existing MSG_ZEROCOPY implementation.
> > >>>
> > >>> It looks like no additional device level support is required. That is
> > >>> IMHO so good up to suspicious level :)
> > >>>
> > >>
> > >> It is correct no additional device level support is required. I don't
> > >> have any local changes to my driver to make this work. I think Stan
> > >> on-list was able to run the TX path (he commented on fixes to the test
> > >> but didn't say it doesn't work :D) and one other person was able to
> > >> run it offlist.
> > >
> > > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > > I have similar internal patch for mlx5 (will share after RX part gets
> > > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > > careful to not unmap NIOVs (if you were testing against gve).
> >
> > What happen if an user try to use devmem TX on a device not really
> > supporting it? Silent data corruption?
> >
> 
> So the tx dma-buf binding netlink API will bind the dma-buf to the
> netdevice. If that fails, the uapi will return failure and devmem tx
> will not be enabled.
> 
> If the dma-binding succeeds, then the device can indeed DMA into the
> dma-addrs in the device. The TX path will dma from the dma-addrs in
> the device just fine and it need not be aware that the dma-addrs are
> coming from a device and not from host memory.
> 
> The only issue that Stan's patches is pointing to, is that the driver
> will likely be passing these dma-buf addresses into dma-mapping APIs
> like dma_unmap_*() and dma_sync_*() functions. Those, AFAIU, will be
> no-ops with dma-buf addresses in most setups, but it's not 100% safe
> to pass those dma-buf addresses to these dma-mapping APIs, so we
> should avoid these calls entirely.
> 
> > Don't we need some way for the device to opt-in (or opt-out) and avoid
> > such issues?
> >
> 
> Yeah, I think likely the driver needs to declare support (i.e. it's
> not using dma-mapping API with dma-buf addresses).

netif_skb_features/ndo_features_check seems like a good fit?

Stanislav Fomichev Feb. 4, 2025, 7:43 p.m. UTC | #8

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 02/04, Mina Almasry wrote:
> > > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > > series rebases the implementation of the TX path on top of the
> > > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > > the feature is thoroughly described in the docs & cover letter of the
> > > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > > they are available in [1].
> > > > >
> > > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > > I receive in the meantime.
> > > >
> > > > I guess you should drop this paragraph.
> > > >
> > > > > Full outline on usage of the TX path is detailed in the documentation
> > > > > added in the first patch.
> > > > >
> > > > > Test example is available via the kselftest included in the series as well.
> > > > >
> > > > > The series is relatively small, as the TX path for this feature largely
> > > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > > >
> > > > It looks like no additional device level support is required. That is
> > > > IMHO so good up to suspicious level :)
> > > >
> > >
> > > It is correct no additional device level support is required. I don't
> > > have any local changes to my driver to make this work. I think Stan
> > > on-list was able to run the TX path (he commented on fixes to the test
> > > but didn't say it doesn't work :D) and one other person was able to
> > > run it offlist.
> >
> > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > I have similar internal patch for mlx5 (will share after RX part gets
> > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > careful to not unmap NIOVs (if you were testing against gve).
> 
> Hmm. I think you're right. We ran into a similar issue with the RX
> path. The RX path worked 'fine' on initial merge, but it was passing
> dmabuf dma-addrs to the dma-mapping API which Jason later called out
> to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
> boil down into no-ops for a lot of setups I think which is why I'm not
> running into any issues in testing, but upon closer look, I think yes,
> we need to make sure the driver doesn't end up passing these niov
> dma-addrs to functions like dma_unmap_*() and dma_sync_*().
> 
> Stan, do you run into issues (crashes/warnings/bugs) in your setup
> when the driver tries to unmap niovs? Or did you implement these
> changes purely for safety?

I don't run into any issues with those unmaps in place, but I'm running x86
with iommu bypass (and as you mention in the other thread, those
calls are no-ops in this case).

Samiullah Khawaja Feb. 5, 2025, 12:47 a.m. UTC | #9

On Tue, Feb 4, 2025 at 11:43 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 02/04, Mina Almasry wrote:
> > On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 02/04, Mina Almasry wrote:
> > > > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > > > series rebases the implementation of the TX path on top of the
> > > > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > > > the feature is thoroughly described in the docs & cover letter of the
> > > > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > > > they are available in [1].
> > > > > >
> > > > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > > > I receive in the meantime.
> > > > >
> > > > > I guess you should drop this paragraph.
> > > > >
> > > > > > Full outline on usage of the TX path is detailed in the documentation
> > > > > > added in the first patch.
> > > > > >
> > > > > > Test example is available via the kselftest included in the series as well.
> > > > > >
> > > > > > The series is relatively small, as the TX path for this feature largely
> > > > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > > > >
> > > > > It looks like no additional device level support is required. That is
> > > > > IMHO so good up to suspicious level :)
> > > > >
> > > >
> > > > It is correct no additional device level support is required. I don't
> > > > have any local changes to my driver to make this work. I think Stan
> > > > on-list was able to run the TX path (he commented on fixes to the test
> > > > but didn't say it doesn't work :D) and one other person was able to
> > > > run it offlist.
> > >
> > > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > > I have similar internal patch for mlx5 (will share after RX part gets
> > > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > > careful to not unmap NIOVs (if you were testing against gve).
> >
> > Hmm. I think you're right. We ran into a similar issue with the RX
> > path. The RX path worked 'fine' on initial merge, but it was passing
> > dmabuf dma-addrs to the dma-mapping API which Jason later called out
> > to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
> > boil down into no-ops for a lot of setups I think which is why I'm not
> > running into any issues in testing, but upon closer look, I think yes,
> > we need to make sure the driver doesn't end up passing these niov
> > dma-addrs to functions like dma_unmap_*() and dma_sync_*().
> >
> > Stan, do you run into issues (crashes/warnings/bugs) in your setup
> > when the driver tries to unmap niovs? Or did you implement these
> > changes purely for safety?
>
> I don't run into any issues with those unmaps in place, but I'm running x86
> with iommu bypass (and as you mention in the other thread, those
> calls are no-ops in this case).
The dma_addr from dma-buf should never enter dma_* APIs. dma-bufs
exporters have their own implementation of these ops and they could be
no-op for identity mappings or when iommu is disabled (in a VM? with
no IOMMU enabled GPA=IOVA). so if we really want to map/unmap/sync
these addresses the dma-buf APIs should be used to do that. Maybe some
glue with a memory provider is required for these net_iovs? I think
the safest option with these is that mappings are never unmapped
manually by driver until the dma_buf_unmap_attachment is called during
unbinding? But maybe that complicates things for io_uring?

Stanislav Fomichev Feb. 5, 2025, 1:05 a.m. UTC | #10

On 02/04, Samiullah Khawaja wrote:
> On Tue, Feb 4, 2025 at 11:43 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 02/04, Mina Almasry wrote:
> > > On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > >
> > > > On 02/04, Mina Almasry wrote:
> > > > > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > > > > series rebases the implementation of the TX path on top of the
> > > > > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > > > > the feature is thoroughly described in the docs & cover letter of the
> > > > > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > > > > they are available in [1].
> > > > > > >
> > > > > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > > > > I receive in the meantime.
> > > > > >
> > > > > > I guess you should drop this paragraph.
> > > > > >
> > > > > > > Full outline on usage of the TX path is detailed in the documentation
> > > > > > > added in the first patch.
> > > > > > >
> > > > > > > Test example is available via the kselftest included in the series as well.
> > > > > > >
> > > > > > > The series is relatively small, as the TX path for this feature largely
> > > > > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > > > > >
> > > > > > It looks like no additional device level support is required. That is
> > > > > > IMHO so good up to suspicious level :)
> > > > > >
> > > > >
> > > > > It is correct no additional device level support is required. I don't
> > > > > have any local changes to my driver to make this work. I think Stan
> > > > > on-list was able to run the TX path (he commented on fixes to the test
> > > > > but didn't say it doesn't work :D) and one other person was able to
> > > > > run it offlist.
> > > >
> > > > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > > > I have similar internal patch for mlx5 (will share after RX part gets
> > > > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > > > careful to not unmap NIOVs (if you were testing against gve).
> > >
> > > Hmm. I think you're right. We ran into a similar issue with the RX
> > > path. The RX path worked 'fine' on initial merge, but it was passing
> > > dmabuf dma-addrs to the dma-mapping API which Jason later called out
> > > to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
> > > boil down into no-ops for a lot of setups I think which is why I'm not
> > > running into any issues in testing, but upon closer look, I think yes,
> > > we need to make sure the driver doesn't end up passing these niov
> > > dma-addrs to functions like dma_unmap_*() and dma_sync_*().
> > >
> > > Stan, do you run into issues (crashes/warnings/bugs) in your setup
> > > when the driver tries to unmap niovs? Or did you implement these
> > > changes purely for safety?
> >
> > I don't run into any issues with those unmaps in place, but I'm running x86
> > with iommu bypass (and as you mention in the other thread, those
> > calls are no-ops in this case).
> The dma_addr from dma-buf should never enter dma_* APIs. dma-bufs
> exporters have their own implementation of these ops and they could be
> no-op for identity mappings or when iommu is disabled (in a VM? with
> no IOMMU enabled GPA=IOVA). so if we really want to map/unmap/sync
> these addresses the dma-buf APIs should be used to do that. Maybe some
> glue with a memory provider is required for these net_iovs? I think
> the safest option with these is that mappings are never unmapped
> manually by driver until the dma_buf_unmap_attachment is called during
> unbinding? But maybe that complicates things for io_uring?

Correct, we don't want to call dma_* APIs on NIOVs, but currently we
do (unmap on tx completion). I mentioned [0] in another thread, we need
something similar for gve (and eventually mlx). skb_frag_dma_map hides
the mapping, but the unmapping unconditionally explicitly calls dma_ APIs
(in most drivers I've looked at).
 
0: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/

Jakub Kicinski Feb. 5, 2025, 2:06 a.m. UTC | #11

On Tue, 4 Feb 2025 11:41:09 -0800 Stanislav Fomichev wrote:
> > > Don't we need some way for the device to opt-in (or opt-out) and avoid
> > > such issues?
> > >  
> > 
> > Yeah, I think likely the driver needs to declare support (i.e. it's
> > not using dma-mapping API with dma-buf addresses).  
> 
> netif_skb_features/ndo_features_check seems like a good fit?

validate_xmit_skb()

Jakub Kicinski Feb. 5, 2025, 2:08 a.m. UTC | #12

On Mon,  3 Feb 2025 22:39:10 +0000 Mina Almasry wrote:
> v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
> ===
> 
> RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*

nit: lore links are better

please stick to RFC until a driver implementation is ready and
included

[net-next,v3,0/6] Device memory TCP TX

Message

Comments