[RFC,-next,00/10] Add ZC notifications to splice and sendfile

Message ID	20250319001521.53249-1-jdamato@fastly.com (mailing list archive)
Headers	show Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCCED8F5B for <linux-kselftest@vger.kernel.org>; Wed, 19 Mar 2025 00:15:36 +0000 (UTC) From: Joe Damato <jdamato@fastly.com> To: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org, asml.silence@gmail.com, linux-fsdevel@vger.kernel.org, edumazet@google.com, pabeni@redhat.com, horms@kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, kuba@kernel.org, shuah@kernel.org, sdf@fomichev.me, mingo@redhat.com, arnd@arndb.de, brauner@kernel.org, akpm@linux-foundation.org, tglx@linutronix.de, jolsa@kernel.org, linux-kselftest@vger.kernel.org, Joe Damato <jdamato@fastly.com> Subject: [RFC -next 00/10] Add ZC notifications to splice and sendfile Date: Wed, 19 Mar 2025 00:15:11 +0000 Message-ID: <20250319001521.53249-1-jdamato@fastly.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add ZC notifications to splice and sendfile \| expand [RFC,-next,00/10] Add ZC notifications to splice and sendfile [RFC,-next,01/10] splice: Add ubuf_info to prepare for ZC [RFC,-next,02/10] splice: Add helper that passes through splice_desc [RFC,-next,03/10] splice: Factor splice_socket into a helper [RFC,-next,04/10] splice: Add SPLICE_F_ZC and attach ubuf [RFC,-next,05/10] fs: Add splice_write_sd to file operations [RFC,-next,06/10] fs: Extend do_sendfile to take a flags argument [RFC,-next,07/10] fs: Add sendfile2 which accepts a flags argument [RFC,-next,08/10] fs: Add sendfile flags for sendfile2 [RFC,-next,09/10] fs: Add sendfile2 syscall [RFC,-next,10/10] selftests: Add sendfile zerocopy notification test

Joe Damato March 19, 2025, 12:15 a.m. UTC

Greetings:

Welcome to the RFC.

Currently, when a user app uses sendfile the user app has no way to know
if the bytes were transmit; sendfile simply returns, but it is possible
that a slow client on the other side may take time to receive and ACK
the bytes. In the meantime, the user app which called sendfile has no
way to know whether it can overwrite the data on disk that it just
sendfile'd.

One way to fix this is to add zerocopy notifications to sendfile similar
to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
extensive work done by Pavel [1].

To support this, two important user ABI changes are proposed:

  - A new splice flag, SPLICE_F_ZC, which allows users to signal that
    splice should generate zerocopy notifications if possible.

  - A new system call, sendfile2, which is similar to sendfile64 except
    that it takes an additional argument, flags, which allows the user
    to specify either a "regular" sendfile or a sendfile with zerocopy
    notifications enabled.

In either case, user apps can read notifications from the error queue
(like they would with MSG_ZEROCOPY) to determine when their call to
sendfile has completed.

I tested this RFC using the selftest modified in the last patch and also
by using the selftest between two different physical hosts:

# server
./msg_zerocopy -4 -i eth0 -t 2 -v -r tcp

# client (does the sendfiling)
dd if=/dev/zero of=sendfile_data bs=1M count=8
./msg_zerocopy -4 -i eth0 -D $SERVER_IP -v -l 1 -t 2 -z -f sendfile_data tcp

I would love to get high level feedback from folks on a few things:

  - Is this functionality, at a high level, something that would be
    desirable / useful? I think so, but I'm of course I am biased ;)

  - Is this approach generally headed in the right direction? Are the
    proposed user ABI changes reasonable?

If the above two points are generally agreed upon then I'd welcome
feedback on the patches themselves :)

This is kind of a net thing, but also kind of a splice thing so hope I
am sending this to right places to get appropriate feedback. I based my
code on the vfs/for-next tree, but am happy to rebase on another tree if
desired. The cc-list got a little out of control, so I manually trimmed
it down quite a bit; sorry if I missed anyone I should have CC'd in the
process.

Thanks,
Joe

[1]: https://lore.kernel.org/netdev/cover.1657643355.git.asml.silence@gmail.com/

Joe Damato (10):
  splice: Add ubuf_info to prepare for ZC
  splice: Add helper that passes through splice_desc
  splice: Factor splice_socket into a helper
  splice: Add SPLICE_F_ZC and attach ubuf
  fs: Add splice_write_sd to file operations
  fs: Extend do_sendfile to take a flags argument
  fs: Add sendfile2 which accepts a flags argument
  fs: Add sendfile flags for sendfile2
  fs: Add sendfile2 syscall
  selftests: Add sendfile zerocopy notification test

 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/tools/syscall_32.tbl             |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/read_write.c                             | 40 +++++++---
 fs/splice.c                                 | 87 +++++++++++++++++----
 include/linux/fs.h                          |  2 +
 include/linux/sendfile.h                    | 10 +++
 include/linux/splice.h                      |  7 +-
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 net/socket.c                                |  1 +
 scripts/syscall.tbl                         |  1 +
 tools/testing/selftests/net/msg_zerocopy.c  | 54 ++++++++++++-
 tools/testing/selftests/net/msg_zerocopy.sh |  5 ++
 27 files changed, 200 insertions(+), 29 deletions(-)
 create mode 100644 include/linux/sendfile.h


base-commit: 2e72b1e0aac24a12f3bf3eec620efaca7ab7d4de

Christoph Hellwig March 19, 2025, 8:04 a.m. UTC | #1

On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> One way to fix this is to add zerocopy notifications to sendfile similar
> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> extensive work done by Pavel [1].

What is a "zerocopy notification" and why aren't you simply plugging
this into io_uring and generate a CQE so that it works like all other
asynchronous operations?

Joe Damato March 19, 2025, 3:32 p.m. UTC | #2

On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> > One way to fix this is to add zerocopy notifications to sendfile similar
> > to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> > extensive work done by Pavel [1].
> 
> What is a "zerocopy notification" 

See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
sendmsg and passes MSG_ZEROCOPY a completion notification is added
to the error queue. The user app can poll for these to find out when
the TX has completed and the buffer it passed to the kernel can be
overwritten.

My series provides the same functionality via splice and sendfile2.

[1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html

> and why aren't you simply plugging this into io_uring and generate
> a CQE so that it works like all other asynchronous operations?

I linked to the iouring work that Pavel did in the cover letter.
Please take a look.

That work refactored the internals of how zerocopy completion
notifications are wired up, allowing other pieces of code to use the
same infrastructure and extend it, if needed.

My series is using the same internals that iouring (and others) use
to generate zerocopy completion notifications. Unlike iouring,
though, I don't need a fully customized implementation with a new
user API for harvesting completion events; I can use the existing
mechanism already in the kernel that user apps already use for
sendmsg (the error queue, as explained above and in the
MSG_ZEROCOPY documentation).

Let me know if that answers your question or if you have other
questions.

Thanks,
Joe

Jens Axboe March 19, 2025, 4:07 p.m. UTC | #3

On 3/19/25 9:32 AM, Joe Damato wrote:
> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
>>> One way to fix this is to add zerocopy notifications to sendfile similar
>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
>>> extensive work done by Pavel [1].
>>
>> What is a "zerocopy notification" 
> 
> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> sendmsg and passes MSG_ZEROCOPY a completion notification is added
> to the error queue. The user app can poll for these to find out when
> the TX has completed and the buffer it passed to the kernel can be
> overwritten.
> 
> My series provides the same functionality via splice and sendfile2.
> 
> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
> 
>> and why aren't you simply plugging this into io_uring and generate
>> a CQE so that it works like all other asynchronous operations?
> 
> I linked to the iouring work that Pavel did in the cover letter.
> Please take a look.
> 
> That work refactored the internals of how zerocopy completion
> notifications are wired up, allowing other pieces of code to use the
> same infrastructure and extend it, if needed.
> 
> My series is using the same internals that iouring (and others) use
> to generate zerocopy completion notifications. Unlike iouring,
> though, I don't need a fully customized implementation with a new
> user API for harvesting completion events; I can use the existing
> mechanism already in the kernel that user apps already use for
> sendmsg (the error queue, as explained above and in the
> MSG_ZEROCOPY documentation).

The error queue is arguably a work-around for _not_ having a delivery
mechanism that works with a sync syscall in the first place. The main
question here imho would be "why add a whole new syscall etc when
there's already an existing way to do accomplish this, with
free-to-reuse notifications". If the answer is "because splice", then it
would seem saner to plumb up those bits only. Would be much simpler
too...

Joe Damato March 19, 2025, 5:04 p.m. UTC | #4

On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
> On 3/19/25 9:32 AM, Joe Damato wrote:
> > On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
> >> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> >>> One way to fix this is to add zerocopy notifications to sendfile similar
> >>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> >>> extensive work done by Pavel [1].
> >>
> >> What is a "zerocopy notification" 
> > 
> > See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> > sendmsg and passes MSG_ZEROCOPY a completion notification is added
> > to the error queue. The user app can poll for these to find out when
> > the TX has completed and the buffer it passed to the kernel can be
> > overwritten.
> > 
> > My series provides the same functionality via splice and sendfile2.
> > 
> > [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
> > 
> >> and why aren't you simply plugging this into io_uring and generate
> >> a CQE so that it works like all other asynchronous operations?
> > 
> > I linked to the iouring work that Pavel did in the cover letter.
> > Please take a look.
> > 
> > That work refactored the internals of how zerocopy completion
> > notifications are wired up, allowing other pieces of code to use the
> > same infrastructure and extend it, if needed.
> > 
> > My series is using the same internals that iouring (and others) use
> > to generate zerocopy completion notifications. Unlike iouring,
> > though, I don't need a fully customized implementation with a new
> > user API for harvesting completion events; I can use the existing
> > mechanism already in the kernel that user apps already use for
> > sendmsg (the error queue, as explained above and in the
> > MSG_ZEROCOPY documentation).
> 
> The error queue is arguably a work-around for _not_ having a delivery
> mechanism that works with a sync syscall in the first place. The main
> question here imho would be "why add a whole new syscall etc when
> there's already an existing way to do accomplish this, with
> free-to-reuse notifications". If the answer is "because splice", then it
> would seem saner to plumb up those bits only. Would be much simpler
> too...

I may be misunderstanding your comment, but my response would be:

  There are existing apps which use sendfile today unsafely and
  it would be very nice to have a safe sendfile equivalent. Converting
  existing apps to using iouring (if I understood your suggestion?)
  would be significantly more work compared to calling sendfile2 and
  adding code to check the error queue.

I would also argue that there are likely user apps out there that
use both sendmsg MSG_ZEROCOPY for certain writes (for data in
memory) and also use sendfile (for data on disk). One example would
be a reverse proxy that might write HTTP headers to clients via
sendmsg but transmit the response body with sendfile.

For those apps, the code to check the error queue already exists for
sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
way to ensure safe sendfile usage.

As far as the bit about plumbing only the splice bits, sorry if I'm
being dense here, do you mean plumbing the error queue through to
splice only and dropping sendfile2?

That is an option. Then the apps currently using sendfile could use
splice instead and get completion notifications on the error queue.
That would probably work and be less work than rewriting to use
iouring, but probably a bit more work than using a new syscall.

Thanks for taking a look and responding.

Jens Axboe March 19, 2025, 5:20 p.m. UTC | #5

On 3/19/25 11:04 AM, Joe Damato wrote:
> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
>> On 3/19/25 9:32 AM, Joe Damato wrote:
>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
>>>>> One way to fix this is to add zerocopy notifications to sendfile similar
>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
>>>>> extensive work done by Pavel [1].
>>>>
>>>> What is a "zerocopy notification" 
>>>
>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added
>>> to the error queue. The user app can poll for these to find out when
>>> the TX has completed and the buffer it passed to the kernel can be
>>> overwritten.
>>>
>>> My series provides the same functionality via splice and sendfile2.
>>>
>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
>>>
>>>> and why aren't you simply plugging this into io_uring and generate
>>>> a CQE so that it works like all other asynchronous operations?
>>>
>>> I linked to the iouring work that Pavel did in the cover letter.
>>> Please take a look.
>>>
>>> That work refactored the internals of how zerocopy completion
>>> notifications are wired up, allowing other pieces of code to use the
>>> same infrastructure and extend it, if needed.
>>>
>>> My series is using the same internals that iouring (and others) use
>>> to generate zerocopy completion notifications. Unlike iouring,
>>> though, I don't need a fully customized implementation with a new
>>> user API for harvesting completion events; I can use the existing
>>> mechanism already in the kernel that user apps already use for
>>> sendmsg (the error queue, as explained above and in the
>>> MSG_ZEROCOPY documentation).
>>
>> The error queue is arguably a work-around for _not_ having a delivery
>> mechanism that works with a sync syscall in the first place. The main
>> question here imho would be "why add a whole new syscall etc when
>> there's already an existing way to do accomplish this, with
>> free-to-reuse notifications". If the answer is "because splice", then it
>> would seem saner to plumb up those bits only. Would be much simpler
>> too...
> 
> I may be misunderstanding your comment, but my response would be:
> 
>   There are existing apps which use sendfile today unsafely and
>   it would be very nice to have a safe sendfile equivalent. Converting
>   existing apps to using iouring (if I understood your suggestion?)
>   would be significantly more work compared to calling sendfile2 and
>   adding code to check the error queue.

It's really not, if you just want to use it as a sync kind of thing. If
you want to have multiple things in flight etc, yeah it could be more
work, you'd also get better performance that way. And you could use
things like registered buffers for either of them, which again would
likely make it more efficient.

If you just use it as a sync thing, it'd be pretty trivial to just wrap
a my_sendfile_foo() in a submit_and_wait operation, which issues and
waits on the completion in a single syscall. And if you want to wait on
the notification too, you could even do that in the same syscall and
wait on 2 CQEs. That'd be a downright trivial way to provide a sync way
of doing the same thing.

> I would also argue that there are likely user apps out there that
> use both sendmsg MSG_ZEROCOPY for certain writes (for data in
> memory) and also use sendfile (for data on disk). One example would
> be a reverse proxy that might write HTTP headers to clients via
> sendmsg but transmit the response body with sendfile.
> 
> For those apps, the code to check the error queue already exists for
> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
> way to ensure safe sendfile usage.

Sure that is certainly possible. I didn't say that wasn't the case,
rather that the error queue approach is a work-around in the first place
for not having some kind of async notification mechanism for when it's
free to reuse.

> As far as the bit about plumbing only the splice bits, sorry if I'm
> being dense here, do you mean plumbing the error queue through to
> splice only and dropping sendfile2?
> 
> That is an option. Then the apps currently using sendfile could use
> splice instead and get completion notifications on the error queue.
> That would probably work and be less work than rewriting to use
> iouring, but probably a bit more work than using a new syscall.

Yep

Joe Damato March 19, 2025, 5:45 p.m. UTC | #6

On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
> On 3/19/25 11:04 AM, Joe Damato wrote:
> > On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
> >> On 3/19/25 9:32 AM, Joe Damato wrote:
> >>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
> >>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> >>>>> One way to fix this is to add zerocopy notifications to sendfile similar
> >>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> >>>>> extensive work done by Pavel [1].
> >>>>
> >>>> What is a "zerocopy notification" 
> >>>
> >>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> >>> sendmsg and passes MSG_ZEROCOPY a completion notification is added
> >>> to the error queue. The user app can poll for these to find out when
> >>> the TX has completed and the buffer it passed to the kernel can be
> >>> overwritten.
> >>>
> >>> My series provides the same functionality via splice and sendfile2.
> >>>
> >>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
> >>>
> >>>> and why aren't you simply plugging this into io_uring and generate
> >>>> a CQE so that it works like all other asynchronous operations?
> >>>
> >>> I linked to the iouring work that Pavel did in the cover letter.
> >>> Please take a look.
> >>>
> >>> That work refactored the internals of how zerocopy completion
> >>> notifications are wired up, allowing other pieces of code to use the
> >>> same infrastructure and extend it, if needed.
> >>>
> >>> My series is using the same internals that iouring (and others) use
> >>> to generate zerocopy completion notifications. Unlike iouring,
> >>> though, I don't need a fully customized implementation with a new
> >>> user API for harvesting completion events; I can use the existing
> >>> mechanism already in the kernel that user apps already use for
> >>> sendmsg (the error queue, as explained above and in the
> >>> MSG_ZEROCOPY documentation).
> >>
> >> The error queue is arguably a work-around for _not_ having a delivery
> >> mechanism that works with a sync syscall in the first place. The main
> >> question here imho would be "why add a whole new syscall etc when
> >> there's already an existing way to do accomplish this, with
> >> free-to-reuse notifications". If the answer is "because splice", then it
> >> would seem saner to plumb up those bits only. Would be much simpler
> >> too...
> > 
> > I may be misunderstanding your comment, but my response would be:
> > 
> >   There are existing apps which use sendfile today unsafely and
> >   it would be very nice to have a safe sendfile equivalent. Converting
> >   existing apps to using iouring (if I understood your suggestion?)
> >   would be significantly more work compared to calling sendfile2 and
> >   adding code to check the error queue.
> 
> It's really not, if you just want to use it as a sync kind of thing. If
> you want to have multiple things in flight etc, yeah it could be more
> work, you'd also get better performance that way. And you could use
> things like registered buffers for either of them, which again would
> likely make it more efficient.

I haven't argued that performance would be better using sendfile2
compared to iouring, just that existing apps which already use
sendfile (but do so unsafely) would probably be more likely to use a
safe alternative with existing examples of how to harvest completion
notifications vs something more complex, like wrapping iouring.

> If you just use it as a sync thing, it'd be pretty trivial to just wrap
> a my_sendfile_foo() in a submit_and_wait operation, which issues and
> waits on the completion in a single syscall. And if you want to wait on
> the notification too, you could even do that in the same syscall and
> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way
> of doing the same thing.

I don't disagree; I just don't know if app developers:
  a.) know that this is possible to do, and
  b.) know how to do it

In general: it does seem a bit odd to me that there isn't a safe
sendfile syscall in Linux that uses existing completion notification
mechanisms.

> > I would also argue that there are likely user apps out there that
> > use both sendmsg MSG_ZEROCOPY for certain writes (for data in
> > memory) and also use sendfile (for data on disk). One example would
> > be a reverse proxy that might write HTTP headers to clients via
> > sendmsg but transmit the response body with sendfile.
> > 
> > For those apps, the code to check the error queue already exists for
> > sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
> > way to ensure safe sendfile usage.
> 
> Sure that is certainly possible. I didn't say that wasn't the case,
> rather that the error queue approach is a work-around in the first place
> for not having some kind of async notification mechanism for when it's
> free to reuse.

Of course, I certainly agree that the error queue is a work around.
But it works, app use it, and its fairly well known. I don't see any
reason, other than historical context, why sendmsg can use this
mechanism, splice can, but sendfile shouldn't?

> > As far as the bit about plumbing only the splice bits, sorry if I'm
> > being dense here, do you mean plumbing the error queue through to
> > splice only and dropping sendfile2?
> > 
> > That is an option. Then the apps currently using sendfile could use
> > splice instead and get completion notifications on the error queue.
> > That would probably work and be less work than rewriting to use
> > iouring, but probably a bit more work than using a new syscall.
> 
> Yep

I'm not opposed to dropping the sendfile2 part of the series for the
official submission. I do think it is a bit odd to add the
functionality to splice only, though, when probably many apps are
using splice via calls to sendfile and there is no way to safely use
sendfile.

If you feel very strongly that this cannot be merged without
dropping sendfile2 and only plumbing this through for splice, then
I'll drop the sendfile2 syscall when I submit officially (probably
next week?).

I do feel pretty strongly that it's more likely apps would use
sendfile2 and we'd have safer apps out in the wild. But, I could be
wrong.

That said: if the new syscsall is the blocker, I'll drop it and
offer a change to the sendfile man page suggesting users swap it
with calls to splice + error queue for safety.

I greatly appreciate you taking a look and your feedback.

Thanks,
Joe

Jens Axboe March 19, 2025, 6:37 p.m. UTC | #7

On 3/19/25 11:45 AM, Joe Damato wrote:
> On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
>> On 3/19/25 11:04 AM, Joe Damato wrote:
>>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
>>>> On 3/19/25 9:32 AM, Joe Damato wrote:
>>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
>>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
>>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar
>>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
>>>>>>> extensive work done by Pavel [1].
>>>>>>
>>>>>> What is a "zerocopy notification" 
>>>>>
>>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
>>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added
>>>>> to the error queue. The user app can poll for these to find out when
>>>>> the TX has completed and the buffer it passed to the kernel can be
>>>>> overwritten.
>>>>>
>>>>> My series provides the same functionality via splice and sendfile2.
>>>>>
>>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
>>>>>
>>>>>> and why aren't you simply plugging this into io_uring and generate
>>>>>> a CQE so that it works like all other asynchronous operations?
>>>>>
>>>>> I linked to the iouring work that Pavel did in the cover letter.
>>>>> Please take a look.
>>>>>
>>>>> That work refactored the internals of how zerocopy completion
>>>>> notifications are wired up, allowing other pieces of code to use the
>>>>> same infrastructure and extend it, if needed.
>>>>>
>>>>> My series is using the same internals that iouring (and others) use
>>>>> to generate zerocopy completion notifications. Unlike iouring,
>>>>> though, I don't need a fully customized implementation with a new
>>>>> user API for harvesting completion events; I can use the existing
>>>>> mechanism already in the kernel that user apps already use for
>>>>> sendmsg (the error queue, as explained above and in the
>>>>> MSG_ZEROCOPY documentation).
>>>>
>>>> The error queue is arguably a work-around for _not_ having a delivery
>>>> mechanism that works with a sync syscall in the first place. The main
>>>> question here imho would be "why add a whole new syscall etc when
>>>> there's already an existing way to do accomplish this, with
>>>> free-to-reuse notifications". If the answer is "because splice", then it
>>>> would seem saner to plumb up those bits only. Would be much simpler
>>>> too...
>>>
>>> I may be misunderstanding your comment, but my response would be:
>>>
>>>   There are existing apps which use sendfile today unsafely and
>>>   it would be very nice to have a safe sendfile equivalent. Converting
>>>   existing apps to using iouring (if I understood your suggestion?)
>>>   would be significantly more work compared to calling sendfile2 and
>>>   adding code to check the error queue.
>>
>> It's really not, if you just want to use it as a sync kind of thing. If
>> you want to have multiple things in flight etc, yeah it could be more
>> work, you'd also get better performance that way. And you could use
>> things like registered buffers for either of them, which again would
>> likely make it more efficient.
> 
> I haven't argued that performance would be better using sendfile2
> compared to iouring, just that existing apps which already use
> sendfile (but do so unsafely) would probably be more likely to use a
> safe alternative with existing examples of how to harvest completion
> notifications vs something more complex, like wrapping iouring.

Sure and I get that, just not sure it'd be worth doing on the kernel
side for such (fairly) weak reasoning. The performance benefit is just a
side note in that if you did do it this way, you'd potentially be able
to run it more efficiently too. And regardless what people do or use
now, they are generally always interested in that aspect.

>> If you just use it as a sync thing, it'd be pretty trivial to just wrap
>> a my_sendfile_foo() in a submit_and_wait operation, which issues and
>> waits on the completion in a single syscall. And if you want to wait on
>> the notification too, you could even do that in the same syscall and
>> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way
>> of doing the same thing.
> 
> I don't disagree; I just don't know if app developers:
>   a.) know that this is possible to do, and
>   b.) know how to do it

Writing that wrapper would be not even a screenful of code. Yes maybe
they don't know how to do it now, but it's _really_ trivial to do. It'd
take me roughly 1 min to do that, would be happy to help out with that
side so it could go into a commit or man page or whatever.

> In general: it does seem a bit odd to me that there isn't a safe
> sendfile syscall in Linux that uses existing completion notification
> mechanisms.

Pretty natural, I think. sendfile(2) predates that by quite a bit, and
the last real change to sendfile was using splice underneath. Which I
did, and that was probably almost 20 years ago at this point...

I do think it makes sense to have a sendfile that's both fast and
efficient, and can be used sanely with buffer reuse without relying on
odd heuristics.

>>> I would also argue that there are likely user apps out there that
>>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in
>>> memory) and also use sendfile (for data on disk). One example would
>>> be a reverse proxy that might write HTTP headers to clients via
>>> sendmsg but transmit the response body with sendfile.
>>>
>>> For those apps, the code to check the error queue already exists for
>>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
>>> way to ensure safe sendfile usage.
>>
>> Sure that is certainly possible. I didn't say that wasn't the case,
>> rather that the error queue approach is a work-around in the first place
>> for not having some kind of async notification mechanism for when it's
>> free to reuse.
> 
> Of course, I certainly agree that the error queue is a work around.
> But it works, app use it, and its fairly well known. I don't see any
> reason, other than historical context, why sendmsg can use this
> mechanism, splice can, but sendfile shouldn't?

My argument would be the same as for other features - if you can do it
simpler this other way, why not consider that? The end result would be
the same, you can do fast sendfile() with sane buffer reuse. But the
kernel side would be simpler, which is always a kernel main goal for
those of us that have to maintain it.

Just adding sendfile2() works in the sense that it's an easier drop in
replacement for an app, though the error queue side does mean it needs
to change anyway - it's not just replacing one syscall with another. And
if we want to be lazy, sure that's fine. I just don't think it's the
best way to do it when we literally have a mechanism that's designed for
this and works with reuse already with normal send zc (and receive side
too, in the next kernel).

Stefan Metzmacher March 19, 2025, 7:15 p.m. UTC | #8

Am 19.03.25 um 19:37 schrieb Jens Axboe:
> On 3/19/25 11:45 AM, Joe Damato wrote:
>> On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
>>> On 3/19/25 11:04 AM, Joe Damato wrote:
>>>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
>>>>> On 3/19/25 9:32 AM, Joe Damato wrote:
>>>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
>>>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
>>>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar
>>>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
>>>>>>>> extensive work done by Pavel [1].
>>>>>>>
>>>>>>> What is a "zerocopy notification"
>>>>>>
>>>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
>>>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added
>>>>>> to the error queue. The user app can poll for these to find out when
>>>>>> the TX has completed and the buffer it passed to the kernel can be
>>>>>> overwritten.
>>>>>>
>>>>>> My series provides the same functionality via splice and sendfile2.
>>>>>>
>>>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
>>>>>>
>>>>>>> and why aren't you simply plugging this into io_uring and generate
>>>>>>> a CQE so that it works like all other asynchronous operations?
>>>>>>
>>>>>> I linked to the iouring work that Pavel did in the cover letter.
>>>>>> Please take a look.
>>>>>>
>>>>>> That work refactored the internals of how zerocopy completion
>>>>>> notifications are wired up, allowing other pieces of code to use the
>>>>>> same infrastructure and extend it, if needed.
>>>>>>
>>>>>> My series is using the same internals that iouring (and others) use
>>>>>> to generate zerocopy completion notifications. Unlike iouring,
>>>>>> though, I don't need a fully customized implementation with a new
>>>>>> user API for harvesting completion events; I can use the existing
>>>>>> mechanism already in the kernel that user apps already use for
>>>>>> sendmsg (the error queue, as explained above and in the
>>>>>> MSG_ZEROCOPY documentation).
>>>>>
>>>>> The error queue is arguably a work-around for _not_ having a delivery
>>>>> mechanism that works with a sync syscall in the first place. The main
>>>>> question here imho would be "why add a whole new syscall etc when
>>>>> there's already an existing way to do accomplish this, with
>>>>> free-to-reuse notifications". If the answer is "because splice", then it
>>>>> would seem saner to plumb up those bits only. Would be much simpler
>>>>> too...
>>>>
>>>> I may be misunderstanding your comment, but my response would be:
>>>>
>>>>    There are existing apps which use sendfile today unsafely and
>>>>    it would be very nice to have a safe sendfile equivalent. Converting
>>>>    existing apps to using iouring (if I understood your suggestion?)
>>>>    would be significantly more work compared to calling sendfile2 and
>>>>    adding code to check the error queue.
>>>
>>> It's really not, if you just want to use it as a sync kind of thing. If
>>> you want to have multiple things in flight etc, yeah it could be more
>>> work, you'd also get better performance that way. And you could use
>>> things like registered buffers for either of them, which again would
>>> likely make it more efficient.
>>
>> I haven't argued that performance would be better using sendfile2
>> compared to iouring, just that existing apps which already use
>> sendfile (but do so unsafely) would probably be more likely to use a
>> safe alternative with existing examples of how to harvest completion
>> notifications vs something more complex, like wrapping iouring.
> 
> Sure and I get that, just not sure it'd be worth doing on the kernel
> side for such (fairly) weak reasoning. The performance benefit is just a
> side note in that if you did do it this way, you'd potentially be able
> to run it more efficiently too. And regardless what people do or use
> now, they are generally always interested in that aspect.
> 
>>> If you just use it as a sync thing, it'd be pretty trivial to just wrap
>>> a my_sendfile_foo() in a submit_and_wait operation, which issues and
>>> waits on the completion in a single syscall. And if you want to wait on
>>> the notification too, you could even do that in the same syscall and
>>> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way
>>> of doing the same thing.
>>
>> I don't disagree; I just don't know if app developers:
>>    a.) know that this is possible to do, and
>>    b.) know how to do it
> 
> Writing that wrapper would be not even a screenful of code. Yes maybe
> they don't know how to do it now, but it's _really_ trivial to do. It'd
> take me roughly 1 min to do that, would be happy to help out with that
> side so it could go into a commit or man page or whatever.
> 
>> In general: it does seem a bit odd to me that there isn't a safe
>> sendfile syscall in Linux that uses existing completion notification
>> mechanisms.
> 
> Pretty natural, I think. sendfile(2) predates that by quite a bit, and
> the last real change to sendfile was using splice underneath. Which I
> did, and that was probably almost 20 years ago at this point...
> 
> I do think it makes sense to have a sendfile that's both fast and
> efficient, and can be used sanely with buffer reuse without relying on
> odd heuristics.
> 
>>>> I would also argue that there are likely user apps out there that
>>>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in
>>>> memory) and also use sendfile (for data on disk). One example would
>>>> be a reverse proxy that might write HTTP headers to clients via
>>>> sendmsg but transmit the response body with sendfile.
>>>>
>>>> For those apps, the code to check the error queue already exists for
>>>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
>>>> way to ensure safe sendfile usage.
>>>
>>> Sure that is certainly possible. I didn't say that wasn't the case,
>>> rather that the error queue approach is a work-around in the first place
>>> for not having some kind of async notification mechanism for when it's
>>> free to reuse.
>>
>> Of course, I certainly agree that the error queue is a work around.
>> But it works, app use it, and its fairly well known. I don't see any
>> reason, other than historical context, why sendmsg can use this
>> mechanism, splice can, but sendfile shouldn't?
> 
> My argument would be the same as for other features - if you can do it
> simpler this other way, why not consider that? The end result would be
> the same, you can do fast sendfile() with sane buffer reuse. But the
> kernel side would be simpler, which is always a kernel main goal for
> those of us that have to maintain it.
> 
> Just adding sendfile2() works in the sense that it's an easier drop in
> replacement for an app, though the error queue side does mean it needs
> to change anyway - it's not just replacing one syscall with another. And
> if we want to be lazy, sure that's fine. I just don't think it's the
> best way to do it when we literally have a mechanism that's designed for
> this and works with reuse already with normal send zc (and receive side
> too, in the next kernel).

A few month (or even years) back, Pavel came up with an idea
to implement some kind of splice into a fixed buffer, if that
would be implemented I guess it would help me in Samba too.
My first usage was on the receive side (from the network).

But the other side might also be possible now we have RWF_DONTCACHE.
Instead of dropping the pages from the page cache, it might
be possible move them to fixed buffer instead.
It would mean the pages would be 'stable' when they are
no longer part of the pagecache.
But maybe my assumption for that is too naive...

Anyway that splice into a fixed buffer would great to have,
as the new IORING_OP_RECV_ZC, requires control over the
hardware queues of the nic and only allows a single process
to provide buffers for that receive queue (at least that's how
I understand it). And that's not possible for multiple process
(maybe not belonging to the same high level application and likely
non-root applications). So it would be great have splice into
fixed buffer as alternative to IORING_OP_SPLICE/IORING_OP_TEE,
as it would be more flexible to use in combination with
IORING_OP_SENDMSG_ZC as well as IORING_OP_WRITE[V]_FIXED with RWF_DONTCACHE.

I guess such a splice into fixed buffer linked to IORING_OP_SENDMSG_ZC
would be the way to simulate the sendfile2() in userspace?

Thanks!
metze

Joe Damato March 19, 2025, 7:16 p.m. UTC | #9

On Wed, Mar 19, 2025 at 12:37:29PM -0600, Jens Axboe wrote:
> On 3/19/25 11:45 AM, Joe Damato wrote:
> > On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
> >> On 3/19/25 11:04 AM, Joe Damato wrote:
> >>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
> >>>> On 3/19/25 9:32 AM, Joe Damato wrote:
> >>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
> >>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> >>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar
> >>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> >>>>>>> extensive work done by Pavel [1].
> >>>>>>
> >>>>>> What is a "zerocopy notification" 
> >>>>>
> >>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> >>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added
> >>>>> to the error queue. The user app can poll for these to find out when
> >>>>> the TX has completed and the buffer it passed to the kernel can be
> >>>>> overwritten.
> >>>>>
> >>>>> My series provides the same functionality via splice and sendfile2.
> >>>>>
> >>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
> >>>>>
> >>>>>> and why aren't you simply plugging this into io_uring and generate
> >>>>>> a CQE so that it works like all other asynchronous operations?
> >>>>>
> >>>>> I linked to the iouring work that Pavel did in the cover letter.
> >>>>> Please take a look.
> >>>>>
> >>>>> That work refactored the internals of how zerocopy completion
> >>>>> notifications are wired up, allowing other pieces of code to use the
> >>>>> same infrastructure and extend it, if needed.
> >>>>>
> >>>>> My series is using the same internals that iouring (and others) use
> >>>>> to generate zerocopy completion notifications. Unlike iouring,
> >>>>> though, I don't need a fully customized implementation with a new
> >>>>> user API for harvesting completion events; I can use the existing
> >>>>> mechanism already in the kernel that user apps already use for
> >>>>> sendmsg (the error queue, as explained above and in the
> >>>>> MSG_ZEROCOPY documentation).
> >>>>
> >>>> The error queue is arguably a work-around for _not_ having a delivery
> >>>> mechanism that works with a sync syscall in the first place. The main
> >>>> question here imho would be "why add a whole new syscall etc when
> >>>> there's already an existing way to do accomplish this, with
> >>>> free-to-reuse notifications". If the answer is "because splice", then it
> >>>> would seem saner to plumb up those bits only. Would be much simpler
> >>>> too...
> >>>
> >>> I may be misunderstanding your comment, but my response would be:
> >>>
> >>>   There are existing apps which use sendfile today unsafely and
> >>>   it would be very nice to have a safe sendfile equivalent. Converting
> >>>   existing apps to using iouring (if I understood your suggestion?)
> >>>   would be significantly more work compared to calling sendfile2 and
> >>>   adding code to check the error queue.
> >>
> >> It's really not, if you just want to use it as a sync kind of thing. If
> >> you want to have multiple things in flight etc, yeah it could be more
> >> work, you'd also get better performance that way. And you could use
> >> things like registered buffers for either of them, which again would
> >> likely make it more efficient.
> > 
> > I haven't argued that performance would be better using sendfile2
> > compared to iouring, just that existing apps which already use
> > sendfile (but do so unsafely) would probably be more likely to use a
> > safe alternative with existing examples of how to harvest completion
> > notifications vs something more complex, like wrapping iouring.
> 
> Sure and I get that, just not sure it'd be worth doing on the kernel
> side for such (fairly) weak reasoning. The performance benefit is just a
> side note in that if you did do it this way, you'd potentially be able
> to run it more efficiently too. And regardless what people do or use
> now, they are generally always interested in that aspect.

Fair enough.

> >> If you just use it as a sync thing, it'd be pretty trivial to just wrap
> >> a my_sendfile_foo() in a submit_and_wait operation, which issues and
> >> waits on the completion in a single syscall. And if you want to wait on
> >> the notification too, you could even do that in the same syscall and
> >> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way
> >> of doing the same thing.
> > 
> > I don't disagree; I just don't know if app developers:
> >   a.) know that this is possible to do, and
> >   b.) know how to do it
> 
> Writing that wrapper would be not even a screenful of code. Yes maybe
> they don't know how to do it now, but it's _really_ trivial to do. It'd
> take me roughly 1 min to do that, would be happy to help out with that
> side so it could go into a commit or man page or whatever.

I'd never be opposed to more documentation ;)

> > In general: it does seem a bit odd to me that there isn't a safe
> > sendfile syscall in Linux that uses existing completion notification
> > mechanisms.
> 
> Pretty natural, I think. sendfile(2) predates that by quite a bit, and
> the last real change to sendfile was using splice underneath. Which I
> did, and that was probably almost 20 years ago at this point...
> 
> I do think it makes sense to have a sendfile that's both fast and
> efficient, and can be used sanely with buffer reuse without relying on
> odd heuristics.

Just trying to tie this together in my head -- are you saying that
you think the kernel internals of sendfile could be changed in a
different way or that this a userland problem (and they should use
the io_uring wrapper you suggested above) ?

> >>> I would also argue that there are likely user apps out there that
> >>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in
> >>> memory) and also use sendfile (for data on disk). One example would
> >>> be a reverse proxy that might write HTTP headers to clients via
> >>> sendmsg but transmit the response body with sendfile.
> >>>
> >>> For those apps, the code to check the error queue already exists for
> >>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
> >>> way to ensure safe sendfile usage.
> >>
> >> Sure that is certainly possible. I didn't say that wasn't the case,
> >> rather that the error queue approach is a work-around in the first place
> >> for not having some kind of async notification mechanism for when it's
> >> free to reuse.
> > 
> > Of course, I certainly agree that the error queue is a work around.
> > But it works, app use it, and its fairly well known. I don't see any
> > reason, other than historical context, why sendmsg can use this
> > mechanism, splice can, but sendfile shouldn't?
> 
> My argument would be the same as for other features - if you can do it
> simpler this other way, why not consider that? The end result would be
> the same, you can do fast sendfile() with sane buffer reuse. But the
> kernel side would be simpler, which is always a kernel main goal for
> those of us that have to maintain it.
>
> Just adding sendfile2() works in the sense that it's an easier drop in
> replacement for an app, though the error queue side does mean it needs
> to change anyway - it's not just replacing one syscall with another. And
> if we want to be lazy, sure that's fine. I just don't think it's the
> best way to do it when we literally have a mechanism that's designed for
> this and works with reuse already with normal send zc (and receive side
> too, in the next kernel).

It seems like you've answered the question I asked above and that
you are suggesting there might be a better and simpler sendfile2
kernel-side implementation that doesn't rely on splice internals at
all.

Am I following you? If so, I'll drop the sendfile2 stuff from this
series and stick with the splice changes only, if you are (at a high
level) OK with the idea of adding a flag for this to splice.

In the meantime, I'll take a few more reads through the iouring code
to see if I can work out how sendfile2 might be built on top of that
instead of splice in the kernel.

Thank you very much for your time, feedback, and attention,
Joe

Joe Damato March 19, 2025, 11:22 p.m. UTC | #10

On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
> On 3/19/25 9:32 AM, Joe Damato wrote:
> > On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
> >> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> >>> One way to fix this is to add zerocopy notifications to sendfile similar
> >>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> >>> extensive work done by Pavel [1].
> >>
> >> What is a "zerocopy notification" 
> > 
> > See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> > sendmsg and passes MSG_ZEROCOPY a completion notification is added
> > to the error queue. The user app can poll for these to find out when
> > the TX has completed and the buffer it passed to the kernel can be
> > overwritten.
> > 
> > My series provides the same functionality via splice and sendfile2.
> > 
> > [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
> > 
> >> and why aren't you simply plugging this into io_uring and generate
> >> a CQE so that it works like all other asynchronous operations?
> > 
> > I linked to the iouring work that Pavel did in the cover letter.
> > Please take a look.
> > 
> > That work refactored the internals of how zerocopy completion
> > notifications are wired up, allowing other pieces of code to use the
> > same infrastructure and extend it, if needed.
> > 
> > My series is using the same internals that iouring (and others) use
> > to generate zerocopy completion notifications. Unlike iouring,
> > though, I don't need a fully customized implementation with a new
> > user API for harvesting completion events; I can use the existing
> > mechanism already in the kernel that user apps already use for
> > sendmsg (the error queue, as explained above and in the
> > MSG_ZEROCOPY documentation).
> 
> The error queue is arguably a work-around for _not_ having a delivery
> mechanism that works with a sync syscall in the first place. The main
> question here imho would be "why add a whole new syscall etc when
> there's already an existing way to do accomplish this, with
> free-to-reuse notifications". If the answer is "because splice", then it
> would seem saner to plumb up those bits only. Would be much simpler
> too...

OK, I reworked the patches to drop all the sendfile2 stuff so no new
system call is added. Only a flag for splice, SPLICE_F_ZC.

It feels weird to add this to the splice path but not the path that
sendfile takes through splice.

I understand and agree with you: if we are adding a new system
call, like sendfile2, it should probably be done as you've
described in your other messages.

What about an alternative?

Would you be open to the idea that sendfile could be extended to
generate error queue completions if the network socket has
SO_ZEROCOPY set?

If so, that would solve the original problem without introducing a
new system call and still leaves the door open for a more efficient
sendfile2 based on iouring internals later.

What do you think?

Christoph Hellwig March 20, 2025, 5:50 a.m. UTC | #11

On Wed, Mar 19, 2025 at 08:32:19AM -0700, Joe Damato wrote:
> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> sendmsg and passes MSG_ZEROCOPY a completion notification is added
> to the error queue. The user app can poll for these to find out when
> the TX has completed and the buffer it passed to the kernel can be
> overwritten.

Yikes.  That's not just an ugly interface, but something entirely
specific to sockets and incompatible with all other asynchronous I/O
interfaces.

> > and why aren't you simply plugging this into io_uring and generate
> > a CQE so that it works like all other asynchronous operations?
> 
> I linked to the iouring work that Pavel did in the cover letter.
> Please take a look.

Please write down what matters in the cover letter, including all the
important tradeoffs.

Christoph Hellwig March 20, 2025, 5:57 a.m. UTC | #12

On Wed, Mar 19, 2025 at 10:45:22AM -0700, Joe Damato wrote:
> I don't disagree; I just don't know if app developers:
>   a.) know that this is possible to do, and
>   b.) know how to do it

So if you don't know that why do you even do the work?

> In general: it does seem a bit odd to me that there isn't a safe
> sendfile syscall in Linux that uses existing completion notification
> mechanisms.

Agreed.  Where the existing notification mechanism is called io_uring.

> Of course, I certainly agree that the error queue is a work around.
> But it works, app use it, and its fairly well known. I don't see any
> reason, other than historical context, why sendmsg can use this
> mechanism, splice can, but sendfile shouldn't?

Because sendmsg should never have done that it certainly should not
spread beyond purely socket specific syscalls.

> If you feel very strongly that this cannot be merged without
> dropping sendfile2 and only plumbing this through for splice, then
> I'll drop the sendfile2 syscall when I submit officially (probably
> next week?).

Splice should also not do "error queue notifications".  Nothing
new and certainly nothing outside of net/ should.

> I do feel pretty strongly that it's more likely apps would use
> sendfile2 and we'd have safer apps out in the wild. But, I could be
> wrong.

A purely synchronous sendfile that is safe is a good thing.  Spreading
non-standard out of band notifications is not.  How to build that
safe sendmsg is a good question, and a sendmsg2 might be a sane
option for that.  The important thing is that the underlying code
should use iocbs and ki_complete to notify I/O completion so that
all the existing infrastucture like io_uring and in-kernel callers
can reuse this.

Note that this also ties into the currently broken memory mamangement
in the networking code that directly messeѕ with page references
rather than notifying the caller about I/O completion.

Pavel Begunkov March 20, 2025, 10:46 a.m. UTC | #13

On 3/19/25 19:15, Stefan Metzmacher wrote:
> Am 19.03.25 um 19:37 schrieb Jens Axboe:
>> On 3/19/25 11:45 AM, Joe Damato wrote:
>>> On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
...
>> My argument would be the same as for other features - if you can do it
>> simpler this other way, why not consider that? The end result would be
>> the same, you can do fast sendfile() with sane buffer reuse. But the
>> kernel side would be simpler, which is always a kernel main goal for
>> those of us that have to maintain it.
>>
>> Just adding sendfile2() works in the sense that it's an easier drop in
>> replacement for an app, though the error queue side does mean it needs
>> to change anyway - it's not just replacing one syscall with another. And
>> if we want to be lazy, sure that's fine. I just don't think it's the
>> best way to do it when we literally have a mechanism that's designed for
>> this and works with reuse already with normal send zc (and receive side
>> too, in the next kernel).
> 
> A few month (or even years) back, Pavel came up with an idea
> to implement some kind of splice into a fixed buffer, if that
> would be implemented I guess it would help me in Samba too.
> My first usage was on the receive side (from the network).

I did it as a testing ground for infra needed for ublk zerocopy,
but if that's of interest I can resurrect the patches and see
where it goes, especially since the aforementioned infra just got
queued.

> But the other side might also be possible now we have RWF_DONTCACHE.
> Instead of dropping the pages from the page cache, it might
> be possible move them to fixed buffer instead.
> It would mean the pages would be 'stable' when they are
> no longer part of the pagecache.
> But maybe my assumption for that is too naive...

That's an interesting idea

> Anyway that splice into a fixed buffer would great to have,
> as the new IORING_OP_RECV_ZC, requires control over the
> hardware queues of the nic and only allows a single process

Right, it basically borrows a hardware rx queue and that
needs CAP_NET_ADMIN, and the user also has to set up steering
rules.

> to provide buffers for that receive queue (at least that's how
> I understand it). And that's not possible for multiple process
> (maybe not belonging to the same high level application and likely

It's up to the user to decide who returns buffers back (and how to
sychronise that) as the api is just a user mapped ring. Regardless,
it's not a finished project, David and I looked at features we want
to add to make life easier for multithreaded apps that can't throw
that many queues. I see your point though.

> non-root applications). So it would be great have splice into
> fixed buffer as alternative to IORING_OP_SPLICE/IORING_OP_TEE,
> as it would be more flexible to use in combination with
> IORING_OP_SENDMSG_ZC as well as IORING_OP_WRITE[V]_FIXED with RWF_DONTCACHE.
> 
> I guess such a splice into fixed buffer linked to IORING_OP_SENDMSG_ZC
> would be the way to simulate the sendfile2() in userspace?

Right, and that approach allows to handle intermediate errors,
which is why it doesn't need to put restrictions on the input
file.

Joe Damato March 20, 2025, 6:05 p.m. UTC | #14

On Wed, Mar 19, 2025 at 10:50:18PM -0700, Christoph Hellwig wrote:
> On Wed, Mar 19, 2025 at 08:32:19AM -0700, Joe Damato wrote:
> > See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> > sendmsg and passes MSG_ZEROCOPY a completion notification is added
> > to the error queue. The user app can poll for these to find out when
> > the TX has completed and the buffer it passed to the kernel can be
> > overwritten.
> 
> Yikes.  That's not just an ugly interface, but something entirely
> specific to sockets and incompatible with all other asynchronous I/O
> interfaces.

I don't really know but I would assume it was introduced, as Jens
said, as a work-around long before other completion mechanisms
existed.

> > > and why aren't you simply plugging this into io_uring and generate
> > > a CQE so that it works like all other asynchronous operations?
> > 
> > I linked to the iouring work that Pavel did in the cover letter.
> > Please take a look.
> 
> Please write down what matters in the cover letter, including all the
> important tradeoffs.

OK, I will enhance the cover letter for the next submission. I had
originally thought I'd submit something officially, but I think I'll
probably submit another RFC with some of the changes I've made based
on the discussion with Jens.

Namely: dropping sendfile2 completely and plumbing the bits through
for splice. I'll wait a bit to hear what Jens thinks about the
SO_ZEROCOPY thing (basically: if a network socket has that option
set, maybe the existing sendfile can generate error queue
completions without needing a separate system call?).

I agree overall that sendfile2 or sendmsg2 or whatever else could
likely be built differently now that better interfaces and
mechanisms exist in the kernel - but I still think there's room to
improve existing system calls so they can be used safely.

Joe Damato March 20, 2025, 6:23 p.m. UTC | #15

On Wed, Mar 19, 2025 at 10:57:29PM -0700, Christoph Hellwig wrote:
> On Wed, Mar 19, 2025 at 10:45:22AM -0700, Joe Damato wrote:
> > I don't disagree; I just don't know if app developers:
> >   a.) know that this is possible to do, and
> >   b.) know how to do it
> 
> So if you don't know that why do you even do the work?

I am doing the work because I use splice and sendfile and it seems
relatively straightforward to make them safer using an existing
mechanism, at least for network sockets.

After dropping the sendfile2 patches completely, it looks like in my
new set all of the code is within CONFIG_NET defines in fs/splice.c.

> > In general: it does seem a bit odd to me that there isn't a safe
> > sendfile syscall in Linux that uses existing completion notification
> > mechanisms.
> 
> Agreed.  Where the existing notification mechanism is called io_uring.

Sure. As I mentioned to Jens: I agree that any new system call
should be built differently.

But does that mean we should leave splice and sendfile as-is when
there is a way to potentially make them safer?

In my other message to Jens I proposed:
  - SPLICE_F_ZC for splice to generate zc completion notifications
    to the error queue
  - Modifying sendfile so that if SO_ZEROCOPY (which already exists)
    is set on a network socket, zc completion notifications are
    generated.

In both cases no new system call is needed and both splice and
sendfile become safer to use. 

At some point in the future a mechanism built on top of iouring
introduced as new system calls (sendmsg2, sendfile2, splice2, etc)
can be built.

> > Of course, I certainly agree that the error queue is a work around.
> > But it works, app use it, and its fairly well known. I don't see any
> > reason, other than historical context, why sendmsg can use this
> > mechanism, splice can, but sendfile shouldn't?
> 
> Because sendmsg should never have done that it certainly should not
> spread beyond purely socket specific syscalls.

I don't know the entire historical context, but I presume sendmsg
did that because there was no other mechanism at the time.

I will explain it more clearly in the next cover letter, but the way
I see the situation is:
  - There are existing system calls which operate on network sockets
    (splice and sendfile) that avoid copies
  - There is a mechanism already in the kernel in the networking
    stack for generating completion notifications
  - Both splice and sendfile could be extended to support this for
    network sockets so they can be used more safely, without
    introducing a new system call

> > If you feel very strongly that this cannot be merged without
> > dropping sendfile2 and only plumbing this through for splice, then
> > I'll drop the sendfile2 syscall when I submit officially (probably
> > next week?).
> 
> Splice should also not do "error queue notifications".  Nothing
> new and certainly nothing outside of net/ should.

It seems like Jens suggested that plumbing this through for splice
was a possibility, but sounds like you disagree.

Not really sure how to proceed here?

If code I am modifying is within CONFIG_NET defines, but lives in
fs/splice.c ... is that within the realm of net or fs ?

I am asking because I genuinely don't know.

As mentioned above and in other messages, it seems like it is
possible to improve the networking parts of splice (and therefore
sendfile) to make them safer to use without introducing a new system
call.

Are you saying that you are against doing that, even if the code is
network specific (but lives in fs/)?

> > I do feel pretty strongly that it's more likely apps would use
> > sendfile2 and we'd have safer apps out in the wild. But, I could be
> > wrong.
> 
> A purely synchronous sendfile that is safe is a good thing.  Spreading
> non-standard out of band notifications is not.  How to build that
> safe sendmsg is a good question, and a sendmsg2 might be a sane
> option for that.  The important thing is that the underlying code
> should use iocbs and ki_complete to notify I/O completion so that
> all the existing infrastucture like io_uring and in-kernel callers
> can reuse this.

I'm not currently planning to build sendmsg2 (and I've already
mentioned to Jens and above I will drop sendfile2), but if I have the time
it sounds like an interesting project.

Christoph Hellwig March 21, 2025, 5:56 a.m. UTC | #16

On Thu, Mar 20, 2025 at 11:23:57AM -0700, Joe Damato wrote:
> In my other message to Jens I proposed:
>   - SPLICE_F_ZC for splice to generate zc completion notifications
>     to the error queue
>   - Modifying sendfile so that if SO_ZEROCOPY (which already exists)
>     is set on a network socket, zc completion notifications are
>     generated.
> 
> In both cases no new system call is needed and both splice and
> sendfile become safer to use. 
> 
> At some point in the future a mechanism built on top of iouring
> introduced as new system calls (sendmsg2, sendfile2, splice2, etc)
> can be built.

I strongly disagree with this.  This is spreading the broken
SO_ZEROCOPY to futher places outside the pure networking realm.  Don't
do that.

It also doesn't help that more than 7 years after adding it,
SO_ZEROCOPY is still completely undocumented.

> > Because sendmsg should never have done that it certainly should not
> > spread beyond purely socket specific syscalls.
> 
> I don't know the entire historical context, but I presume sendmsg
> did that because there was no other mechanism at the time.

At least aio had been around for about 15 years at the point, but
networking folks tend to be pretty insular and reinvent things.

> It seems like Jens suggested that plumbing this through for splice
> was a possibility, but sounds like you disagree.

Yes, very strongly.

> As mentioned above and in other messages, it seems like it is
> possible to improve the networking parts of splice (and therefore
> sendfile) to make them safer to use without introducing a new system
> call.
> 
> Are you saying that you are against doing that, even if the code is
> network specific (but lives in fs/)?

Yes.

Please take the work and integrate it with the kiocb-based system
we use for all other in-kernel I/O that needs completion notifications
and which makes it trivial to integate with io_uring instead of
spreading an imcompatible and inferior event system.

Stefan Metzmacher March 21, 2025, 7:55 a.m. UTC | #17

Am 20.03.25 um 11:46 schrieb Pavel Begunkov:
> On 3/19/25 19:15, Stefan Metzmacher wrote:
>> Am 19.03.25 um 19:37 schrieb Jens Axboe:
>>> On 3/19/25 11:45 AM, Joe Damato wrote:
>>>> On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
> ...
>>> My argument would be the same as for other features - if you can do it
>>> simpler this other way, why not consider that? The end result would be
>>> the same, you can do fast sendfile() with sane buffer reuse. But the
>>> kernel side would be simpler, which is always a kernel main goal for
>>> those of us that have to maintain it.
>>>
>>> Just adding sendfile2() works in the sense that it's an easier drop in
>>> replacement for an app, though the error queue side does mean it needs
>>> to change anyway - it's not just replacing one syscall with another. And
>>> if we want to be lazy, sure that's fine. I just don't think it's the
>>> best way to do it when we literally have a mechanism that's designed for
>>> this and works with reuse already with normal send zc (and receive side
>>> too, in the next kernel).
>>
>> A few month (or even years) back, Pavel came up with an idea
>> to implement some kind of splice into a fixed buffer, if that
>> would be implemented I guess it would help me in Samba too.
>> My first usage was on the receive side (from the network).
> 
> I did it as a testing ground for infra needed for ublk zerocopy,
> but if that's of interest I can resurrect the patches and see
> where it goes, especially since the aforementioned infra just got
> queued.

Would be great!

Have you posted the work in progress somewhere?

Thanks!
metze

Jens Axboe March 21, 2025, 11:11 a.m. UTC | #18

On 3/19/25 1:16 PM, Joe Damato wrote:
>>> In general: it does seem a bit odd to me that there isn't a safe
>>> sendfile syscall in Linux that uses existing completion notification
>>> mechanisms.
>>
>> Pretty natural, I think. sendfile(2) predates that by quite a bit, and
>> the last real change to sendfile was using splice underneath. Which I
>> did, and that was probably almost 20 years ago at this point...
>>
>> I do think it makes sense to have a sendfile that's both fast and
>> efficient, and can be used sanely with buffer reuse without relying on
>> odd heuristics.
> 
> Just trying to tie this together in my head -- are you saying that
> you think the kernel internals of sendfile could be changed in a
> different way or that this a userland problem (and they should use
> the io_uring wrapper you suggested above) ?

I'm saying that it of course makes sense to have a way to do sendfile
where you know when reuse is safe, and that we have an API that provides
that very nicely already without needing to add syscalls. If you used
io_uring for this, then the "tx is done, reuse is fine" notification is
just another notification, not anything special that needs new plumbing.

>>>>> I would also argue that there are likely user apps out there that
>>>>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in
>>>>> memory) and also use sendfile (for data on disk). One example would
>>>>> be a reverse proxy that might write HTTP headers to clients via
>>>>> sendmsg but transmit the response body with sendfile.
>>>>>
>>>>> For those apps, the code to check the error queue already exists for
>>>>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
>>>>> way to ensure safe sendfile usage.
>>>>
>>>> Sure that is certainly possible. I didn't say that wasn't the case,
>>>> rather that the error queue approach is a work-around in the first place
>>>> for not having some kind of async notification mechanism for when it's
>>>> free to reuse.
>>>
>>> Of course, I certainly agree that the error queue is a work around.
>>> But it works, app use it, and its fairly well known. I don't see any
>>> reason, other than historical context, why sendmsg can use this
>>> mechanism, splice can, but sendfile shouldn't?
>>
>> My argument would be the same as for other features - if you can do it
>> simpler this other way, why not consider that? The end result would be
>> the same, you can do fast sendfile() with sane buffer reuse. But the
>> kernel side would be simpler, which is always a kernel main goal for
>> those of us that have to maintain it.
>>
>> Just adding sendfile2() works in the sense that it's an easier drop in
>> replacement for an app, though the error queue side does mean it needs
>> to change anyway - it's not just replacing one syscall with another. And
>> if we want to be lazy, sure that's fine. I just don't think it's the
>> best way to do it when we literally have a mechanism that's designed for
>> this and works with reuse already with normal send zc (and receive side
>> too, in the next kernel).
> 
> It seems like you've answered the question I asked above and that
> you are suggesting there might be a better and simpler sendfile2
> kernel-side implementation that doesn't rely on splice internals at
> all.
> 
> Am I following you? If so, I'll drop the sendfile2 stuff from this
> series and stick with the splice changes only, if you are (at a high
> level) OK with the idea of adding a flag for this to splice.
> 
> In the meantime, I'll take a few more reads through the iouring code
> to see if I can work out how sendfile2 might be built on top of that
> instead of splice in the kernel.

Heh I don't know how you jumped to that conclusion based on my feedback,
and seems like it's solidified through other replies. No I'm not saying
that the approach makes sense for the kernel, it makes some vague amount
of sense only on the premise of "oh but this is easy for applications as
they already know how to use sendfile(2)".

Jens Axboe March 21, 2025, 11:13 a.m. UTC | #19

On 3/19/25 5:22 PM, Joe Damato wrote:
> Would you be open to the idea that sendfile could be extended to
> generate error queue completions if the network socket has
> SO_ZEROCOPY set?

I thought I was quite clear on my view of SO_ZEROCOPY and its error
queue usage, I guess I was not. No I don't think this is a good path at
all, when the whole issue is that pretending to handle two different
types of completions via two different interfaces is pretty dumb and
inefficient to begin with, particularly when we have a method of doing
exactly that where the reuse notifications arrive in the normal
completion stream.

Jens Axboe March 21, 2025, 11:14 a.m. UTC | #20

On 3/20/25 11:56 PM, Christoph Hellwig wrote:
>> I don't know the entire historical context, but I presume sendmsg
>> did that because there was no other mechanism at the time.
> 
> At least aio had been around for about 15 years at the point, but
> networking folks tend to be pretty insular and reinvent things.

Yep...

>> It seems like Jens suggested that plumbing this through for splice
>> was a possibility, but sounds like you disagree.
> 
> Yes, very strongly.

And that is very much not what I suggested, fwiw.

>> As mentioned above and in other messages, it seems like it is
>> possible to improve the networking parts of splice (and therefore
>> sendfile) to make them safer to use without introducing a new system
>> call.
>>
>> Are you saying that you are against doing that, even if the code is
>> network specific (but lives in fs/)?
> 
> Yes.
> 
> Please take the work and integrate it with the kiocb-based system
> we use for all other in-kernel I/O that needs completion notifications
> and which makes it trivial to integate with io_uring instead of
> spreading an imcompatible and inferior event system.

Exactly, this is how we do async IO elsewhere, not sure why networking
needs to be special here, and definitely not special in a good way.

[RFC,-next,00/10] Add ZC notifications to splice and sendfile

Message

Comments