Message ID | 20250209150749.GA16999@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | pipe: change pipe_write() to never add a zero-sized buffer | expand |
On Sun, 9 Feb 2025 at 07:08, Oleg Nesterov <oleg@redhat.com> wrote: > > This is no longer necessary after c73be61cede5 ("pipe: Add general notification > queue support"), pipe_write() checks pipe_has_watch_queue() and returns -EXDEV > at the start. And can't help in any case, pipe_write() no longer takes this > rd_wait.lock spinlock. Ack. This code all goes back to the two-level locking thing with notifications using just the spinlock side. The locking was removed from this code in commit dfaabf916b1c ("fs/pipe: remove unnecessary spinlock from pipe_write()"), but the "pre-allocate" logic remained. This patch seems to be the right thing to do and removes the vestiges of the old model. But I don't think you need that pipe_buf_assert_len() thing. And if you do, please don't make it a pointless inline helper that only hides what it does. Linus
On 02/09, Linus Torvalds wrote: > > This patch seems to be the right thing to do and removes the vestiges > of the old model. OK, thanks. > But I don't think you need that pipe_buf_assert_len() thing. Well, I'd prefer to keep this WARN_ON_ONCE() for some time... If nobody hits this warning we can kill eat_empty_buffer() and more hopefully dead checks, for example /* zero-length bvecs are not supported, skip them */ if (!this_len) continue; in iter_file_splice_write(). > And if > you do, please don't make it a pointless inline helper that only hides > what it does. Could you explain what do you think should I do if I keep this check? make pipe_buf_assert_len() return void? or just replace it with WARN_ON_ONCE(!buf->len) in its callers? Oleg.
On Sun, 9 Feb 2025 at 10:02, Oleg Nesterov <oleg@redhat.com> wrote: > > Could you explain what do you think should I do if I keep this check? > make pipe_buf_assert_len() return void? or just replace it with > WARN_ON_ONCE(!buf->len) in its callers? Just replace it with WARN_ON_ONCE() in any place where you really think it's needed. And honestly, if you worry so much about it, just allow the zero-sized case. I don't see why you want to special-case it in the first place. Yes, the zero sized buffer *used* to be a special case, but it was a special case for writes. And yes, splice wants to actually wait for "do we have real data" and has that eat_empty_buffer() case, but just *keep* it. Keeping that check is *better* and clearer than adding some pointless warning. IOW, why warn for a case that isn't a problem, and you're only making it a problem by thinking it is? Linus
On 02/09, Linus Torvalds wrote: > > On Sun, 9 Feb 2025 at 10:02, Oleg Nesterov <oleg@redhat.com> wrote: > > > > Could you explain what do you think should I do if I keep this check? > > make pipe_buf_assert_len() return void? or just replace it with > > WARN_ON_ONCE(!buf->len) in its callers? > > Just replace it with WARN_ON_ONCE() in any place where you really > think it's needed. OK, will do. > IOW, why warn for a case that isn't a problem, and you're only making > it a problem by thinking it is? Again, lets look eat_empty_buffer(). The comment says "maybe it's empty" but how/why can this happen ? The changelog for d1a819a2ec2d3 ("splice: teach splice pipe reading about empty pipe buffers") says "you can trigger it by doing a write to a pipe that fails" but if someone looks at anon_pipe_write() after 1/2 this case is not possible. And if eat_empty_buffer() flushes the buffer and updates pipe->tail, why doesn't it wake the writers? WARN_ON_ONCE() makes it clear that we do not expect !buf->len == 0, and the kernel will complain if it does happen. So unless you have a strong opinion, I'd prefer to keep it for now. Oleg.
On Sun, 9 Feb 2025 at 10:45, Oleg Nesterov <oleg@redhat.com> wrote: > > Again, lets look eat_empty_buffer(). > > The comment says "maybe it's empty" but how/why can this happen ? WHY DO YOU CARE? Even if it's just defensive programming, it's just plain good code, when the rule is "the caller is waiting for data". And if it means that you don't need to add random WARN_ON_ONCE() statements. So here's the deal: either you (a) convince yourself that the length really can never be true, and you write a comment about why that is the case. or (b) you DON'T convince yourself that that is true, and you leave eat_empty_buffer() alone. and both of those situations are fine. But adding a random sprinkling of WARN_ON_ONCE() statements is worse than *either* of those two cases. See what my argument is? Adding a WARN_ON is just making the code worse. Don't do it. It's the worst of both worlds: it adds code to generate, and it adds confusion to readers ("why are we warning?" or "when can this happen"). In contrast, the "eat_empty_buffer()" case just saying "if it's an empty buffer, it doesn't satisfy my needs, so I'll just release the empty buffer and go on". That's simple and straightforward. Easy to maintain, and more efficient than your WARN_ONs. And if you can convince yourself it's not needed, you remove it. EXACTLY LIKE THE WARN_ON. So there's simply never any upside to the WARN_ON. Linus
On 02/09, Linus Torvalds wrote: > > On Sun, 9 Feb 2025 at 10:45, Oleg Nesterov <oleg@redhat.com> wrote: > > > > Again, lets look eat_empty_buffer(). > > > > The comment says "maybe it's empty" but how/why can this happen ? > > WHY DO YOU CARE? Because it looks unclear/confusing, and I think it can confuse other readers of this code. Especially after 1/2. > So here's the deal: either you ... > (b) you DON'T convince yourself that that is true, and you leave > eat_empty_buffer() alone. Yes, I failed to convince myself that fs/splice.c can never add an empty bufer. Although it seems to me it should not. > In contrast, the "eat_empty_buffer()" case just saying "if it's an > empty buffer, it doesn't satisfy my needs, so I'll just release the > empty buffer and go on". ... without wakeup_pipe_writers(). OK, nevermind, I see your point even if I do not 100% agree. I'll send v2 without WARN_ON() and without 2/2. Oleg.
Hello Oleg, On 2/10/2025 12:45 AM, Oleg Nesterov wrote: > On 02/09, Linus Torvalds wrote: >> >> On Sun, 9 Feb 2025 at 10:45, Oleg Nesterov <oleg@redhat.com> wrote: >>> >>> Again, lets look eat_empty_buffer(). >>> >>> The comment says "maybe it's empty" but how/why can this happen ? >> >> WHY DO YOU CARE? > > Because it looks unclear/confusing, and I think it can confuse other > readers of this code. Especially after 1/2. > >> So here's the deal: either you > ... >> (b) you DON'T convince yourself that that is true, and you leave >> eat_empty_buffer() alone. > > Yes, I failed to convince myself that fs/splice.c can never add an > empty bufer. Although it seems to me it should not. > >> In contrast, the "eat_empty_buffer()" case just saying "if it's an >> empty buffer, it doesn't satisfy my needs, so I'll just release the >> empty buffer and go on". > > ... without wakeup_pipe_writers(). > > OK, nevermind, I see your point even if I do not 100% agree. > > I'll send v2 without WARN_ON() and without 2/2. Went ahead and tested that version on top of mainline with your previous change to skip updating {a,c,m}time, here are the results: ================================================================== Test : sched-messaging Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: mainline + no-acm_time[pct imp](CV) patched[pct imp](CV) 1-groups 1.00 [ -0.00]( 7.19) 0.95 [ 4.90](12.39) 2-groups 1.00 [ -0.00]( 3.54) 1.02 [ -1.92]( 6.55) 4-groups 1.00 [ -0.00]( 2.78) 1.01 [ -0.85]( 2.18) 8-groups 1.00 [ -0.00]( 1.04) 0.99 [ 0.63]( 0.77) 16-groups 1.00 [ -0.00]( 1.02) 1.00 [ -0.26]( 0.98) I don't see any regression / improvements from a performance standpoint on my 3rd Generation EPYC system (2 x 64C/128T. boost on, C2 disabled) Feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > Oleg. >
Hi Prateek, On 02/10, K Prateek Nayak wrote: > > 1-groups 1.00 [ -0.00]( 7.19) 0.95 [ 4.90](12.39) > 2-groups 1.00 [ -0.00]( 3.54) 1.02 [ -1.92]( 6.55) > 4-groups 1.00 [ -0.00]( 2.78) 1.01 [ -0.85]( 2.18) > 8-groups 1.00 [ -0.00]( 1.04) 0.99 [ 0.63]( 0.77) > 16-groups 1.00 [ -0.00]( 1.02) 1.00 [ -0.26]( 0.98) > > I don't see any regression / improvements from a performance standpoint Yes, this patch shouldn't make any difference performance-wise, at least in this case. Although I was thinking the same thing when I sent "pipe_read: don't wake up the writer if the pipe is still full" ;) > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Thanks! Please see v2, I've included you tag. Any chance you can also test the patch below? To me it looks like a cleanup which makes the "merge small writes" logic more understandable. And note that "page-align the rest of the writes" doesn't work anyway if "total_len & (PAGE_SIZE-1)" can't fit in the last buffer. However, in this particular case with DATASIZE=100 this patch can increase the number of copy_page_from_iter()'s in pipe_write(). And with this change receiver() can certainly get the short reads, so this can increase the number of sys_read() calls. So I am just curious if this change can cause any noticeable regression on your machine. Thank you! Oleg. --- a/fs/pipe.c +++ b/fs/pipe.c @@ -459,16 +459,16 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) was_empty = pipe_empty(head, pipe->tail); chars = total_len & (PAGE_SIZE-1); if (chars && !was_empty) { - unsigned int mask = pipe->ring_size - 1; - struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask]; + struct pipe_buffer *buf = pipe_buf(pipe, head - 1); int offset = buf->offset + buf->len; + int avail = PAGE_SIZE - offset; - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) && - offset + chars <= PAGE_SIZE) { + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) { ret = pipe_buf_confirm(pipe, buf); if (ret) goto out; + chars = min_t(ssize_t, chars, avail); ret = copy_page_from_iter(buf->page, offset, chars, from); if (unlikely(ret < chars)) { ret = -EFAULT;
On Mon, 10 Feb 2025 at 09:22, Oleg Nesterov <oleg@redhat.com> wrote: > > + int avail = PAGE_SIZE - offset; > > - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) && > - offset + chars <= PAGE_SIZE) { > + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) { > ret = pipe_buf_confirm(pipe, buf); > if (ret) > goto out; > > + chars = min_t(ssize_t, chars, avail); If I read this correctly, this patch is horribly broken. You can't do partial writes. Pipes have one very core atomicity guarantee: from the man-pages: PIPE_BUF POSIX.1 says that writes of less than PIPE_BUF bytes must be atomic: the output data is written to the pipe as a contiguous sequence. Writes of more than PIPE_BUF bytes may be nonatomic: the kernel may interleave the data with data written by other processes. POSIX.1 requires PIPE_BUF to be at least 512 bytes. (On Linux, PIPE_BUF is 4096 bytes.) IOW, that whole "try to write as many chars as there is room" is very very broken. You have to write all or nothing. So you can't (say) first write just 50 bytes of a 100-byte pipe write because it fits in the last buffer, and then wait for another buffer to become free to write the rest. Not just because another writer might come in and start mixing in data, but because *readers* may well expect to get 100 bytes or nothing. And to make matters worse, you'll never notice the bug until something breaks very subtly (unless we happen to have a good test-case for this somewhere - there might be a test for this in LTP). And yes, this is actually something I know for a fact that people depend on. Lots of traditional UNIX "send packet commands over pipes" programs around, which expect the packets to be atomic. So things *will* break, but it might take a while before you hit just the right race condition for things to go south, and the errors migth end up being very non-obvious indeed. Note that the initial chars = total_len & (PAGE_SIZE-1); before the whole test for "can we merge" is fine, because if total_len is larger than a page, it's no longer a write we need to worry about atomicity with. Maybe we should add a comment somewhere about this. Linus
On 02/10, Linus Torvalds wrote: > > On Mon, 10 Feb 2025 at 09:22, Oleg Nesterov <oleg@redhat.com> wrote: > > > > + int avail = PAGE_SIZE - offset; > > > > - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) && > > - offset + chars <= PAGE_SIZE) { > > + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) { > > ret = pipe_buf_confirm(pipe, buf); > > if (ret) > > goto out; > > > > + chars = min_t(ssize_t, chars, avail); > > If I read this correctly, this patch is horribly broken. > > You can't do partial writes. Pipes have one very core atomicity > guarantee: from the man-pages: > > PIPE_BUF > POSIX.1 says that writes of less than PIPE_BUF bytes must be > atomic: Ah, I didn't know! Thanks for your explanation and the quick NACK! > Maybe we should add a comment somewhere about this. Agreed. It would certainly help the ignorant readers like me ;) So I guess that the "goto again" logic in sender/receiver in tools/perf/bench/sched-messaging.c is currently pointless, DATASIZE == 100 < PIPE_BUF. Thanks. Oleg.
Hello Oleg, On 2/10/2025 10:52 PM, Oleg Nesterov wrote: > Hi Prateek, > > On 02/10, K Prateek Nayak wrote: >> >> 1-groups 1.00 [ -0.00]( 7.19) 0.95 [ 4.90](12.39) >> 2-groups 1.00 [ -0.00]( 3.54) 1.02 [ -1.92]( 6.55) >> 4-groups 1.00 [ -0.00]( 2.78) 1.01 [ -0.85]( 2.18) >> 8-groups 1.00 [ -0.00]( 1.04) 0.99 [ 0.63]( 0.77) >> 16-groups 1.00 [ -0.00]( 1.02) 1.00 [ -0.26]( 0.98) >> >> I don't see any regression / improvements from a performance standpoint > > Yes, this patch shouldn't make any difference performance-wise, at least > in this case. Although I was thinking the same thing when I sent "pipe_read: > don't wake up the writer if the pipe is still full" ;) > >> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > Thanks! Please see v2, I've included you tag. Thank you. I can confirm it is same as the variant I tested. > > Any chance you can also test the patch below? > > To me it looks like a cleanup which makes the "merge small writes" logic > more understandable. And note that "page-align the rest of the writes" > doesn't work anyway if "total_len & (PAGE_SIZE-1)" can't fit in the last > buffer. > > However, in this particular case with DATASIZE=100 this patch can increase > the number of copy_page_from_iter()'s in pipe_write(). And with this change > receiver() can certainly get the short reads, so this can increase the > number of sys_read() calls. > > So I am just curious if this change can cause any noticeable regression on > your machine. For the sake of science: ================================================================== Test : sched-messaging Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: baseline[pct imp](CV) merge_writes[pct imp](CV) 1-groups 1.00 [ -0.00](12.39) 1.08 [ -7.62](11.73) 2-groups 1.00 [ -0.00]( 6.55) 0.97 [ 2.52]( 3.01) 4-groups 1.00 [ -0.00]( 2.18) 1.00 [ 0.42]( 1.97) 8-groups 1.00 [ -0.00]( 0.77) 1.03 [ -3.35]( 5.07) 16-groups 1.00 [ -0.00]( 0.98) 1.01 [ -1.37]( 2.20) I see some improvements up until 4 groups (160 tasks) but beyond that it goes into a slight regression territory but the variance is large to draw any conclusions. Science experiment concluded. > > Thank you! > > Oleg. > > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -459,16 +459,16 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > was_empty = pipe_empty(head, pipe->tail); > chars = total_len & (PAGE_SIZE-1); > if (chars && !was_empty) { > - unsigned int mask = pipe->ring_size - 1; > - struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask]; > + struct pipe_buffer *buf = pipe_buf(pipe, head - 1); > int offset = buf->offset + buf->len; > + int avail = PAGE_SIZE - offset; > > - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) && > - offset + chars <= PAGE_SIZE) { > + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) { > ret = pipe_buf_confirm(pipe, buf); > if (ret) > goto out; > > + chars = min_t(ssize_t, chars, avail); > ret = copy_page_from_iter(buf->page, offset, chars, from); > if (unlikely(ret < chars)) { > ret = -EFAULT; >
diff --git a/fs/pipe.c b/fs/pipe.c index 2ae75adfba64..7f24d707f6a1 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -303,7 +303,7 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to) if (!pipe_empty(head, tail)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; - size_t chars = buf->len; + size_t chars = pipe_buf_assert_len(buf); size_t written; int error; @@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to) break; } mutex_unlock(&pipe->mutex); - /* * We only get here if we didn't actually read anything. * - * However, we could have seen (and removed) a zero-sized - * pipe buffer, and might have made space in the buffers - * that way. - * - * You can't make zero-sized pipe buffers by doing an empty - * write (not even in packet mode), but they can happen if - * the writer gets an EFAULT when trying to fill a buffer - * that already got allocated and inserted in the buffer - * array. - * - * So we still need to wake up any pending writers in the - * _very_ unlikely case that the pipe was full, but we got - * no data. - */ - if (unlikely(wake_writer)) - wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); - kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); - - /* * But because we didn't read anything, at this point we can * just return directly with -ERESTARTSYS if we're interrupted, * since we've done any required wakeups and there's no need @@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to) if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0) return -ERESTARTSYS; - wake_writer = false; wake_next_reader = true; mutex_lock(&pipe->mutex); } @@ -526,33 +505,27 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) pipe->tmp_page = page; } - /* Allocate a slot in the ring in advance and attach an - * empty buffer. If we fault or otherwise fail to use - * it, either the reader will consume it or it'll still - * be there for the next write. - */ - pipe->head = head + 1; + copied = copy_page_from_iter(page, 0, PAGE_SIZE, from); + if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) { + if (!ret) + ret = -EFAULT; + break; + } + pipe->head = head + 1; + pipe->tmp_page = NULL; /* Insert it into the buffer array */ buf = &pipe->bufs[head & mask]; buf->page = page; buf->ops = &anon_pipe_buf_ops; buf->offset = 0; - buf->len = 0; if (is_packetized(filp)) buf->flags = PIPE_BUF_FLAG_PACKET; else buf->flags = PIPE_BUF_FLAG_CAN_MERGE; - pipe->tmp_page = NULL; - copied = copy_page_from_iter(page, 0, PAGE_SIZE, from); - if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) { - if (!ret) - ret = -EFAULT; - break; - } - ret += copied; buf->len = copied; + ret += copied; if (!iov_iter_count(from)) break; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 8ff23bf5a819..4174429a3e0e 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -31,6 +31,12 @@ struct pipe_buffer { unsigned long private; }; +static inline unsigned int pipe_buf_assert_len(struct pipe_buffer *buf) +{ + WARN_ON_ONCE(!buf->len); + return buf->len; +} + /** * struct pipe_inode_info - a linux kernel pipe * @mutex: mutex protecting the whole thing
a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot") changed pipe_write() to increment pipe->head in advance. IIUC to avoid the race with the post_one_notification()-like code which can add another buffer under pipe->rd_wait.lock without pipe->mutex. This is no longer necessary after c73be61cede5 ("pipe: Add general notification queue support"), pipe_write() checks pipe_has_watch_queue() and returns -EXDEV at the start. And can't help in any case, pipe_write() no longer takes this rd_wait.lock spinlock. Change pipe_write() to call copy_page_from_iter() first and do nothing if it fails. This way pipe_write() can't add a zero-sized buffer and we can simplify pipe_read() which currently has to take care of this very unlikely case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- fs/pipe.c | 47 +++++++++------------------------------ include/linux/pipe_fs_i.h | 6 +++++ 2 files changed, 16 insertions(+), 37 deletions(-)