Message ID | 20250319001521.53249-1-jdamato@fastly.com (mailing list archive) |
---|---|
Headers | show |
Series | Add ZC notifications to splice and sendfile | expand |
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > One way to fix this is to add zerocopy notifications to sendfile similar > to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > extensive work done by Pavel [1]. What is a "zerocopy notification" and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > > One way to fix this is to add zerocopy notifications to sendfile similar > > to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > > extensive work done by Pavel [1]. > > What is a "zerocopy notification" See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten. My series provides the same functionality via splice and sendfile2. [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > and why aren't you simply plugging this into io_uring and generate > a CQE so that it works like all other asynchronous operations? I linked to the iouring work that Pavel did in the cover letter. Please take a look. That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed. My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation). Let me know if that answers your question or if you have other questions. Thanks, Joe
On 3/19/25 9:32 AM, Joe Damato wrote: > On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: >> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: >>> One way to fix this is to add zerocopy notifications to sendfile similar >>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the >>> extensive work done by Pavel [1]. >> >> What is a "zerocopy notification" > > See the docs on MSG_ZEROCOPY [1], but in short when a user app calls > sendmsg and passes MSG_ZEROCOPY a completion notification is added > to the error queue. The user app can poll for these to find out when > the TX has completed and the buffer it passed to the kernel can be > overwritten. > > My series provides the same functionality via splice and sendfile2. > > [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > >> and why aren't you simply plugging this into io_uring and generate >> a CQE so that it works like all other asynchronous operations? > > I linked to the iouring work that Pavel did in the cover letter. > Please take a look. > > That work refactored the internals of how zerocopy completion > notifications are wired up, allowing other pieces of code to use the > same infrastructure and extend it, if needed. > > My series is using the same internals that iouring (and others) use > to generate zerocopy completion notifications. Unlike iouring, > though, I don't need a fully customized implementation with a new > user API for harvesting completion events; I can use the existing > mechanism already in the kernel that user apps already use for > sendmsg (the error queue, as explained above and in the > MSG_ZEROCOPY documentation). The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: > On 3/19/25 9:32 AM, Joe Damato wrote: > > On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > >> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > >>> One way to fix this is to add zerocopy notifications to sendfile similar > >>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > >>> extensive work done by Pavel [1]. > >> > >> What is a "zerocopy notification" > > > > See the docs on MSG_ZEROCOPY [1], but in short when a user app calls > > sendmsg and passes MSG_ZEROCOPY a completion notification is added > > to the error queue. The user app can poll for these to find out when > > the TX has completed and the buffer it passed to the kernel can be > > overwritten. > > > > My series provides the same functionality via splice and sendfile2. > > > > [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > > > >> and why aren't you simply plugging this into io_uring and generate > >> a CQE so that it works like all other asynchronous operations? > > > > I linked to the iouring work that Pavel did in the cover letter. > > Please take a look. > > > > That work refactored the internals of how zerocopy completion > > notifications are wired up, allowing other pieces of code to use the > > same infrastructure and extend it, if needed. > > > > My series is using the same internals that iouring (and others) use > > to generate zerocopy completion notifications. Unlike iouring, > > though, I don't need a fully customized implementation with a new > > user API for harvesting completion events; I can use the existing > > mechanism already in the kernel that user apps already use for > > sendmsg (the error queue, as explained above and in the > > MSG_ZEROCOPY documentation). > > The error queue is arguably a work-around for _not_ having a delivery > mechanism that works with a sync syscall in the first place. The main > question here imho would be "why add a whole new syscall etc when > there's already an existing way to do accomplish this, with > free-to-reuse notifications". If the answer is "because splice", then it > would seem saner to plumb up those bits only. Would be much simpler > too... I may be misunderstanding your comment, but my response would be: There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue. I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile. For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage. As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2? That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall. Thanks for taking a look and responding.
On 3/19/25 11:04 AM, Joe Damato wrote: > On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: >> On 3/19/25 9:32 AM, Joe Damato wrote: >>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: >>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: >>>>> One way to fix this is to add zerocopy notifications to sendfile similar >>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the >>>>> extensive work done by Pavel [1]. >>>> >>>> What is a "zerocopy notification" >>> >>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls >>> sendmsg and passes MSG_ZEROCOPY a completion notification is added >>> to the error queue. The user app can poll for these to find out when >>> the TX has completed and the buffer it passed to the kernel can be >>> overwritten. >>> >>> My series provides the same functionality via splice and sendfile2. >>> >>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html >>> >>>> and why aren't you simply plugging this into io_uring and generate >>>> a CQE so that it works like all other asynchronous operations? >>> >>> I linked to the iouring work that Pavel did in the cover letter. >>> Please take a look. >>> >>> That work refactored the internals of how zerocopy completion >>> notifications are wired up, allowing other pieces of code to use the >>> same infrastructure and extend it, if needed. >>> >>> My series is using the same internals that iouring (and others) use >>> to generate zerocopy completion notifications. Unlike iouring, >>> though, I don't need a fully customized implementation with a new >>> user API for harvesting completion events; I can use the existing >>> mechanism already in the kernel that user apps already use for >>> sendmsg (the error queue, as explained above and in the >>> MSG_ZEROCOPY documentation). >> >> The error queue is arguably a work-around for _not_ having a delivery >> mechanism that works with a sync syscall in the first place. The main >> question here imho would be "why add a whole new syscall etc when >> there's already an existing way to do accomplish this, with >> free-to-reuse notifications". If the answer is "because splice", then it >> would seem saner to plumb up those bits only. Would be much simpler >> too... > > I may be misunderstanding your comment, but my response would be: > > There are existing apps which use sendfile today unsafely and > it would be very nice to have a safe sendfile equivalent. Converting > existing apps to using iouring (if I understood your suggestion?) > would be significantly more work compared to calling sendfile2 and > adding code to check the error queue. It's really not, if you just want to use it as a sync kind of thing. If you want to have multiple things in flight etc, yeah it could be more work, you'd also get better performance that way. And you could use things like registered buffers for either of them, which again would likely make it more efficient. If you just use it as a sync thing, it'd be pretty trivial to just wrap a my_sendfile_foo() in a submit_and_wait operation, which issues and waits on the completion in a single syscall. And if you want to wait on the notification too, you could even do that in the same syscall and wait on 2 CQEs. That'd be a downright trivial way to provide a sync way of doing the same thing. > I would also argue that there are likely user apps out there that > use both sendmsg MSG_ZEROCOPY for certain writes (for data in > memory) and also use sendfile (for data on disk). One example would > be a reverse proxy that might write HTTP headers to clients via > sendmsg but transmit the response body with sendfile. > > For those apps, the code to check the error queue already exists for > sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy > way to ensure safe sendfile usage. Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse. > As far as the bit about plumbing only the splice bits, sorry if I'm > being dense here, do you mean plumbing the error queue through to > splice only and dropping sendfile2? > > That is an option. Then the apps currently using sendfile could use > splice instead and get completion notifications on the error queue. > That would probably work and be less work than rewriting to use > iouring, but probably a bit more work than using a new syscall. Yep
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote: > On 3/19/25 11:04 AM, Joe Damato wrote: > > On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: > >> On 3/19/25 9:32 AM, Joe Damato wrote: > >>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > >>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > >>>>> One way to fix this is to add zerocopy notifications to sendfile similar > >>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > >>>>> extensive work done by Pavel [1]. > >>>> > >>>> What is a "zerocopy notification" > >>> > >>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls > >>> sendmsg and passes MSG_ZEROCOPY a completion notification is added > >>> to the error queue. The user app can poll for these to find out when > >>> the TX has completed and the buffer it passed to the kernel can be > >>> overwritten. > >>> > >>> My series provides the same functionality via splice and sendfile2. > >>> > >>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > >>> > >>>> and why aren't you simply plugging this into io_uring and generate > >>>> a CQE so that it works like all other asynchronous operations? > >>> > >>> I linked to the iouring work that Pavel did in the cover letter. > >>> Please take a look. > >>> > >>> That work refactored the internals of how zerocopy completion > >>> notifications are wired up, allowing other pieces of code to use the > >>> same infrastructure and extend it, if needed. > >>> > >>> My series is using the same internals that iouring (and others) use > >>> to generate zerocopy completion notifications. Unlike iouring, > >>> though, I don't need a fully customized implementation with a new > >>> user API for harvesting completion events; I can use the existing > >>> mechanism already in the kernel that user apps already use for > >>> sendmsg (the error queue, as explained above and in the > >>> MSG_ZEROCOPY documentation). > >> > >> The error queue is arguably a work-around for _not_ having a delivery > >> mechanism that works with a sync syscall in the first place. The main > >> question here imho would be "why add a whole new syscall etc when > >> there's already an existing way to do accomplish this, with > >> free-to-reuse notifications". If the answer is "because splice", then it > >> would seem saner to plumb up those bits only. Would be much simpler > >> too... > > > > I may be misunderstanding your comment, but my response would be: > > > > There are existing apps which use sendfile today unsafely and > > it would be very nice to have a safe sendfile equivalent. Converting > > existing apps to using iouring (if I understood your suggestion?) > > would be significantly more work compared to calling sendfile2 and > > adding code to check the error queue. > > It's really not, if you just want to use it as a sync kind of thing. If > you want to have multiple things in flight etc, yeah it could be more > work, you'd also get better performance that way. And you could use > things like registered buffers for either of them, which again would > likely make it more efficient. I haven't argued that performance would be better using sendfile2 compared to iouring, just that existing apps which already use sendfile (but do so unsafely) would probably be more likely to use a safe alternative with existing examples of how to harvest completion notifications vs something more complex, like wrapping iouring. > If you just use it as a sync thing, it'd be pretty trivial to just wrap > a my_sendfile_foo() in a submit_and_wait operation, which issues and > waits on the completion in a single syscall. And if you want to wait on > the notification too, you could even do that in the same syscall and > wait on 2 CQEs. That'd be a downright trivial way to provide a sync way > of doing the same thing. I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms. > > I would also argue that there are likely user apps out there that > > use both sendmsg MSG_ZEROCOPY for certain writes (for data in > > memory) and also use sendfile (for data on disk). One example would > > be a reverse proxy that might write HTTP headers to clients via > > sendmsg but transmit the response body with sendfile. > > > > For those apps, the code to check the error queue already exists for > > sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy > > way to ensure safe sendfile usage. > > Sure that is certainly possible. I didn't say that wasn't the case, > rather that the error queue approach is a work-around in the first place > for not having some kind of async notification mechanism for when it's > free to reuse. Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't? > > As far as the bit about plumbing only the splice bits, sorry if I'm > > being dense here, do you mean plumbing the error queue through to > > splice only and dropping sendfile2? > > > > That is an option. Then the apps currently using sendfile could use > > splice instead and get completion notifications on the error queue. > > That would probably work and be less work than rewriting to use > > iouring, but probably a bit more work than using a new syscall. > > Yep I'm not opposed to dropping the sendfile2 part of the series for the official submission. I do think it is a bit odd to add the functionality to splice only, though, when probably many apps are using splice via calls to sendfile and there is no way to safely use sendfile. If you feel very strongly that this cannot be merged without dropping sendfile2 and only plumbing this through for splice, then I'll drop the sendfile2 syscall when I submit officially (probably next week?). I do feel pretty strongly that it's more likely apps would use sendfile2 and we'd have safer apps out in the wild. But, I could be wrong. That said: if the new syscsall is the blocker, I'll drop it and offer a change to the sendfile man page suggesting users swap it with calls to splice + error queue for safety. I greatly appreciate you taking a look and your feedback. Thanks, Joe
On 3/19/25 11:45 AM, Joe Damato wrote: > On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote: >> On 3/19/25 11:04 AM, Joe Damato wrote: >>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: >>>> On 3/19/25 9:32 AM, Joe Damato wrote: >>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: >>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: >>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar >>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the >>>>>>> extensive work done by Pavel [1]. >>>>>> >>>>>> What is a "zerocopy notification" >>>>> >>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls >>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added >>>>> to the error queue. The user app can poll for these to find out when >>>>> the TX has completed and the buffer it passed to the kernel can be >>>>> overwritten. >>>>> >>>>> My series provides the same functionality via splice and sendfile2. >>>>> >>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html >>>>> >>>>>> and why aren't you simply plugging this into io_uring and generate >>>>>> a CQE so that it works like all other asynchronous operations? >>>>> >>>>> I linked to the iouring work that Pavel did in the cover letter. >>>>> Please take a look. >>>>> >>>>> That work refactored the internals of how zerocopy completion >>>>> notifications are wired up, allowing other pieces of code to use the >>>>> same infrastructure and extend it, if needed. >>>>> >>>>> My series is using the same internals that iouring (and others) use >>>>> to generate zerocopy completion notifications. Unlike iouring, >>>>> though, I don't need a fully customized implementation with a new >>>>> user API for harvesting completion events; I can use the existing >>>>> mechanism already in the kernel that user apps already use for >>>>> sendmsg (the error queue, as explained above and in the >>>>> MSG_ZEROCOPY documentation). >>>> >>>> The error queue is arguably a work-around for _not_ having a delivery >>>> mechanism that works with a sync syscall in the first place. The main >>>> question here imho would be "why add a whole new syscall etc when >>>> there's already an existing way to do accomplish this, with >>>> free-to-reuse notifications". If the answer is "because splice", then it >>>> would seem saner to plumb up those bits only. Would be much simpler >>>> too... >>> >>> I may be misunderstanding your comment, but my response would be: >>> >>> There are existing apps which use sendfile today unsafely and >>> it would be very nice to have a safe sendfile equivalent. Converting >>> existing apps to using iouring (if I understood your suggestion?) >>> would be significantly more work compared to calling sendfile2 and >>> adding code to check the error queue. >> >> It's really not, if you just want to use it as a sync kind of thing. If >> you want to have multiple things in flight etc, yeah it could be more >> work, you'd also get better performance that way. And you could use >> things like registered buffers for either of them, which again would >> likely make it more efficient. > > I haven't argued that performance would be better using sendfile2 > compared to iouring, just that existing apps which already use > sendfile (but do so unsafely) would probably be more likely to use a > safe alternative with existing examples of how to harvest completion > notifications vs something more complex, like wrapping iouring. Sure and I get that, just not sure it'd be worth doing on the kernel side for such (fairly) weak reasoning. The performance benefit is just a side note in that if you did do it this way, you'd potentially be able to run it more efficiently too. And regardless what people do or use now, they are generally always interested in that aspect. >> If you just use it as a sync thing, it'd be pretty trivial to just wrap >> a my_sendfile_foo() in a submit_and_wait operation, which issues and >> waits on the completion in a single syscall. And if you want to wait on >> the notification too, you could even do that in the same syscall and >> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way >> of doing the same thing. > > I don't disagree; I just don't know if app developers: > a.) know that this is possible to do, and > b.) know how to do it Writing that wrapper would be not even a screenful of code. Yes maybe they don't know how to do it now, but it's _really_ trivial to do. It'd take me roughly 1 min to do that, would be happy to help out with that side so it could go into a commit or man page or whatever. > In general: it does seem a bit odd to me that there isn't a safe > sendfile syscall in Linux that uses existing completion notification > mechanisms. Pretty natural, I think. sendfile(2) predates that by quite a bit, and the last real change to sendfile was using splice underneath. Which I did, and that was probably almost 20 years ago at this point... I do think it makes sense to have a sendfile that's both fast and efficient, and can be used sanely with buffer reuse without relying on odd heuristics. >>> I would also argue that there are likely user apps out there that >>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in >>> memory) and also use sendfile (for data on disk). One example would >>> be a reverse proxy that might write HTTP headers to clients via >>> sendmsg but transmit the response body with sendfile. >>> >>> For those apps, the code to check the error queue already exists for >>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy >>> way to ensure safe sendfile usage. >> >> Sure that is certainly possible. I didn't say that wasn't the case, >> rather that the error queue approach is a work-around in the first place >> for not having some kind of async notification mechanism for when it's >> free to reuse. > > Of course, I certainly agree that the error queue is a work around. > But it works, app use it, and its fairly well known. I don't see any > reason, other than historical context, why sendmsg can use this > mechanism, splice can, but sendfile shouldn't? My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it. Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
Am 19.03.25 um 19:37 schrieb Jens Axboe: > On 3/19/25 11:45 AM, Joe Damato wrote: >> On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote: >>> On 3/19/25 11:04 AM, Joe Damato wrote: >>>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: >>>>> On 3/19/25 9:32 AM, Joe Damato wrote: >>>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: >>>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: >>>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar >>>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the >>>>>>>> extensive work done by Pavel [1]. >>>>>>> >>>>>>> What is a "zerocopy notification" >>>>>> >>>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls >>>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added >>>>>> to the error queue. The user app can poll for these to find out when >>>>>> the TX has completed and the buffer it passed to the kernel can be >>>>>> overwritten. >>>>>> >>>>>> My series provides the same functionality via splice and sendfile2. >>>>>> >>>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html >>>>>> >>>>>>> and why aren't you simply plugging this into io_uring and generate >>>>>>> a CQE so that it works like all other asynchronous operations? >>>>>> >>>>>> I linked to the iouring work that Pavel did in the cover letter. >>>>>> Please take a look. >>>>>> >>>>>> That work refactored the internals of how zerocopy completion >>>>>> notifications are wired up, allowing other pieces of code to use the >>>>>> same infrastructure and extend it, if needed. >>>>>> >>>>>> My series is using the same internals that iouring (and others) use >>>>>> to generate zerocopy completion notifications. Unlike iouring, >>>>>> though, I don't need a fully customized implementation with a new >>>>>> user API for harvesting completion events; I can use the existing >>>>>> mechanism already in the kernel that user apps already use for >>>>>> sendmsg (the error queue, as explained above and in the >>>>>> MSG_ZEROCOPY documentation). >>>>> >>>>> The error queue is arguably a work-around for _not_ having a delivery >>>>> mechanism that works with a sync syscall in the first place. The main >>>>> question here imho would be "why add a whole new syscall etc when >>>>> there's already an existing way to do accomplish this, with >>>>> free-to-reuse notifications". If the answer is "because splice", then it >>>>> would seem saner to plumb up those bits only. Would be much simpler >>>>> too... >>>> >>>> I may be misunderstanding your comment, but my response would be: >>>> >>>> There are existing apps which use sendfile today unsafely and >>>> it would be very nice to have a safe sendfile equivalent. Converting >>>> existing apps to using iouring (if I understood your suggestion?) >>>> would be significantly more work compared to calling sendfile2 and >>>> adding code to check the error queue. >>> >>> It's really not, if you just want to use it as a sync kind of thing. If >>> you want to have multiple things in flight etc, yeah it could be more >>> work, you'd also get better performance that way. And you could use >>> things like registered buffers for either of them, which again would >>> likely make it more efficient. >> >> I haven't argued that performance would be better using sendfile2 >> compared to iouring, just that existing apps which already use >> sendfile (but do so unsafely) would probably be more likely to use a >> safe alternative with existing examples of how to harvest completion >> notifications vs something more complex, like wrapping iouring. > > Sure and I get that, just not sure it'd be worth doing on the kernel > side for such (fairly) weak reasoning. The performance benefit is just a > side note in that if you did do it this way, you'd potentially be able > to run it more efficiently too. And regardless what people do or use > now, they are generally always interested in that aspect. > >>> If you just use it as a sync thing, it'd be pretty trivial to just wrap >>> a my_sendfile_foo() in a submit_and_wait operation, which issues and >>> waits on the completion in a single syscall. And if you want to wait on >>> the notification too, you could even do that in the same syscall and >>> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way >>> of doing the same thing. >> >> I don't disagree; I just don't know if app developers: >> a.) know that this is possible to do, and >> b.) know how to do it > > Writing that wrapper would be not even a screenful of code. Yes maybe > they don't know how to do it now, but it's _really_ trivial to do. It'd > take me roughly 1 min to do that, would be happy to help out with that > side so it could go into a commit or man page or whatever. > >> In general: it does seem a bit odd to me that there isn't a safe >> sendfile syscall in Linux that uses existing completion notification >> mechanisms. > > Pretty natural, I think. sendfile(2) predates that by quite a bit, and > the last real change to sendfile was using splice underneath. Which I > did, and that was probably almost 20 years ago at this point... > > I do think it makes sense to have a sendfile that's both fast and > efficient, and can be used sanely with buffer reuse without relying on > odd heuristics. > >>>> I would also argue that there are likely user apps out there that >>>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in >>>> memory) and also use sendfile (for data on disk). One example would >>>> be a reverse proxy that might write HTTP headers to clients via >>>> sendmsg but transmit the response body with sendfile. >>>> >>>> For those apps, the code to check the error queue already exists for >>>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy >>>> way to ensure safe sendfile usage. >>> >>> Sure that is certainly possible. I didn't say that wasn't the case, >>> rather that the error queue approach is a work-around in the first place >>> for not having some kind of async notification mechanism for when it's >>> free to reuse. >> >> Of course, I certainly agree that the error queue is a work around. >> But it works, app use it, and its fairly well known. I don't see any >> reason, other than historical context, why sendmsg can use this >> mechanism, splice can, but sendfile shouldn't? > > My argument would be the same as for other features - if you can do it > simpler this other way, why not consider that? The end result would be > the same, you can do fast sendfile() with sane buffer reuse. But the > kernel side would be simpler, which is always a kernel main goal for > those of us that have to maintain it. > > Just adding sendfile2() works in the sense that it's an easier drop in > replacement for an app, though the error queue side does mean it needs > to change anyway - it's not just replacing one syscall with another. And > if we want to be lazy, sure that's fine. I just don't think it's the > best way to do it when we literally have a mechanism that's designed for > this and works with reuse already with normal send zc (and receive side > too, in the next kernel). A few month (or even years) back, Pavel came up with an idea to implement some kind of splice into a fixed buffer, if that would be implemented I guess it would help me in Samba too. My first usage was on the receive side (from the network). But the other side might also be possible now we have RWF_DONTCACHE. Instead of dropping the pages from the page cache, it might be possible move them to fixed buffer instead. It would mean the pages would be 'stable' when they are no longer part of the pagecache. But maybe my assumption for that is too naive... Anyway that splice into a fixed buffer would great to have, as the new IORING_OP_RECV_ZC, requires control over the hardware queues of the nic and only allows a single process to provide buffers for that receive queue (at least that's how I understand it). And that's not possible for multiple process (maybe not belonging to the same high level application and likely non-root applications). So it would be great have splice into fixed buffer as alternative to IORING_OP_SPLICE/IORING_OP_TEE, as it would be more flexible to use in combination with IORING_OP_SENDMSG_ZC as well as IORING_OP_WRITE[V]_FIXED with RWF_DONTCACHE. I guess such a splice into fixed buffer linked to IORING_OP_SENDMSG_ZC would be the way to simulate the sendfile2() in userspace? Thanks! metze
On Wed, Mar 19, 2025 at 12:37:29PM -0600, Jens Axboe wrote: > On 3/19/25 11:45 AM, Joe Damato wrote: > > On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote: > >> On 3/19/25 11:04 AM, Joe Damato wrote: > >>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: > >>>> On 3/19/25 9:32 AM, Joe Damato wrote: > >>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > >>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > >>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar > >>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > >>>>>>> extensive work done by Pavel [1]. > >>>>>> > >>>>>> What is a "zerocopy notification" > >>>>> > >>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls > >>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added > >>>>> to the error queue. The user app can poll for these to find out when > >>>>> the TX has completed and the buffer it passed to the kernel can be > >>>>> overwritten. > >>>>> > >>>>> My series provides the same functionality via splice and sendfile2. > >>>>> > >>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > >>>>> > >>>>>> and why aren't you simply plugging this into io_uring and generate > >>>>>> a CQE so that it works like all other asynchronous operations? > >>>>> > >>>>> I linked to the iouring work that Pavel did in the cover letter. > >>>>> Please take a look. > >>>>> > >>>>> That work refactored the internals of how zerocopy completion > >>>>> notifications are wired up, allowing other pieces of code to use the > >>>>> same infrastructure and extend it, if needed. > >>>>> > >>>>> My series is using the same internals that iouring (and others) use > >>>>> to generate zerocopy completion notifications. Unlike iouring, > >>>>> though, I don't need a fully customized implementation with a new > >>>>> user API for harvesting completion events; I can use the existing > >>>>> mechanism already in the kernel that user apps already use for > >>>>> sendmsg (the error queue, as explained above and in the > >>>>> MSG_ZEROCOPY documentation). > >>>> > >>>> The error queue is arguably a work-around for _not_ having a delivery > >>>> mechanism that works with a sync syscall in the first place. The main > >>>> question here imho would be "why add a whole new syscall etc when > >>>> there's already an existing way to do accomplish this, with > >>>> free-to-reuse notifications". If the answer is "because splice", then it > >>>> would seem saner to plumb up those bits only. Would be much simpler > >>>> too... > >>> > >>> I may be misunderstanding your comment, but my response would be: > >>> > >>> There are existing apps which use sendfile today unsafely and > >>> it would be very nice to have a safe sendfile equivalent. Converting > >>> existing apps to using iouring (if I understood your suggestion?) > >>> would be significantly more work compared to calling sendfile2 and > >>> adding code to check the error queue. > >> > >> It's really not, if you just want to use it as a sync kind of thing. If > >> you want to have multiple things in flight etc, yeah it could be more > >> work, you'd also get better performance that way. And you could use > >> things like registered buffers for either of them, which again would > >> likely make it more efficient. > > > > I haven't argued that performance would be better using sendfile2 > > compared to iouring, just that existing apps which already use > > sendfile (but do so unsafely) would probably be more likely to use a > > safe alternative with existing examples of how to harvest completion > > notifications vs something more complex, like wrapping iouring. > > Sure and I get that, just not sure it'd be worth doing on the kernel > side for such (fairly) weak reasoning. The performance benefit is just a > side note in that if you did do it this way, you'd potentially be able > to run it more efficiently too. And regardless what people do or use > now, they are generally always interested in that aspect. Fair enough. > >> If you just use it as a sync thing, it'd be pretty trivial to just wrap > >> a my_sendfile_foo() in a submit_and_wait operation, which issues and > >> waits on the completion in a single syscall. And if you want to wait on > >> the notification too, you could even do that in the same syscall and > >> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way > >> of doing the same thing. > > > > I don't disagree; I just don't know if app developers: > > a.) know that this is possible to do, and > > b.) know how to do it > > Writing that wrapper would be not even a screenful of code. Yes maybe > they don't know how to do it now, but it's _really_ trivial to do. It'd > take me roughly 1 min to do that, would be happy to help out with that > side so it could go into a commit or man page or whatever. I'd never be opposed to more documentation ;) > > In general: it does seem a bit odd to me that there isn't a safe > > sendfile syscall in Linux that uses existing completion notification > > mechanisms. > > Pretty natural, I think. sendfile(2) predates that by quite a bit, and > the last real change to sendfile was using splice underneath. Which I > did, and that was probably almost 20 years ago at this point... > > I do think it makes sense to have a sendfile that's both fast and > efficient, and can be used sanely with buffer reuse without relying on > odd heuristics. Just trying to tie this together in my head -- are you saying that you think the kernel internals of sendfile could be changed in a different way or that this a userland problem (and they should use the io_uring wrapper you suggested above) ? > >>> I would also argue that there are likely user apps out there that > >>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in > >>> memory) and also use sendfile (for data on disk). One example would > >>> be a reverse proxy that might write HTTP headers to clients via > >>> sendmsg but transmit the response body with sendfile. > >>> > >>> For those apps, the code to check the error queue already exists for > >>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy > >>> way to ensure safe sendfile usage. > >> > >> Sure that is certainly possible. I didn't say that wasn't the case, > >> rather that the error queue approach is a work-around in the first place > >> for not having some kind of async notification mechanism for when it's > >> free to reuse. > > > > Of course, I certainly agree that the error queue is a work around. > > But it works, app use it, and its fairly well known. I don't see any > > reason, other than historical context, why sendmsg can use this > > mechanism, splice can, but sendfile shouldn't? > > My argument would be the same as for other features - if you can do it > simpler this other way, why not consider that? The end result would be > the same, you can do fast sendfile() with sane buffer reuse. But the > kernel side would be simpler, which is always a kernel main goal for > those of us that have to maintain it. > > Just adding sendfile2() works in the sense that it's an easier drop in > replacement for an app, though the error queue side does mean it needs > to change anyway - it's not just replacing one syscall with another. And > if we want to be lazy, sure that's fine. I just don't think it's the > best way to do it when we literally have a mechanism that's designed for > this and works with reuse already with normal send zc (and receive side > too, in the next kernel). It seems like you've answered the question I asked above and that you are suggesting there might be a better and simpler sendfile2 kernel-side implementation that doesn't rely on splice internals at all. Am I following you? If so, I'll drop the sendfile2 stuff from this series and stick with the splice changes only, if you are (at a high level) OK with the idea of adding a flag for this to splice. In the meantime, I'll take a few more reads through the iouring code to see if I can work out how sendfile2 might be built on top of that instead of splice in the kernel. Thank you very much for your time, feedback, and attention, Joe
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: > On 3/19/25 9:32 AM, Joe Damato wrote: > > On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > >> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > >>> One way to fix this is to add zerocopy notifications to sendfile similar > >>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > >>> extensive work done by Pavel [1]. > >> > >> What is a "zerocopy notification" > > > > See the docs on MSG_ZEROCOPY [1], but in short when a user app calls > > sendmsg and passes MSG_ZEROCOPY a completion notification is added > > to the error queue. The user app can poll for these to find out when > > the TX has completed and the buffer it passed to the kernel can be > > overwritten. > > > > My series provides the same functionality via splice and sendfile2. > > > > [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > > > >> and why aren't you simply plugging this into io_uring and generate > >> a CQE so that it works like all other asynchronous operations? > > > > I linked to the iouring work that Pavel did in the cover letter. > > Please take a look. > > > > That work refactored the internals of how zerocopy completion > > notifications are wired up, allowing other pieces of code to use the > > same infrastructure and extend it, if needed. > > > > My series is using the same internals that iouring (and others) use > > to generate zerocopy completion notifications. Unlike iouring, > > though, I don't need a fully customized implementation with a new > > user API for harvesting completion events; I can use the existing > > mechanism already in the kernel that user apps already use for > > sendmsg (the error queue, as explained above and in the > > MSG_ZEROCOPY documentation). > > The error queue is arguably a work-around for _not_ having a delivery > mechanism that works with a sync syscall in the first place. The main > question here imho would be "why add a whole new syscall etc when > there's already an existing way to do accomplish this, with > free-to-reuse notifications". If the answer is "because splice", then it > would seem saner to plumb up those bits only. Would be much simpler > too... OK, I reworked the patches to drop all the sendfile2 stuff so no new system call is added. Only a flag for splice, SPLICE_F_ZC. It feels weird to add this to the splice path but not the path that sendfile takes through splice. I understand and agree with you: if we are adding a new system call, like sendfile2, it should probably be done as you've described in your other messages. What about an alternative? Would you be open to the idea that sendfile could be extended to generate error queue completions if the network socket has SO_ZEROCOPY set? If so, that would solve the original problem without introducing a new system call and still leaves the door open for a more efficient sendfile2 based on iouring internals later. What do you think?