pipes && EPOLLET, again

Message ID	20250304154410.GB5756@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 980D2201256 for <linux-fsdevel@vger.kernel.org>; Tue, 4 Mar 2025 15:44:53 +0000 (UTC) Date: Tue, 4 Mar 2025 16:44:11 +0100 From: Oleg Nesterov <oleg@redhat.com> To: Linus Torvalds <torvalds@linux-foundation.org>, Mateusz Guzik <mjguzik@gmail.com> Cc: brauner@kernel.org, mingo@redhat.com, peterz@infradead.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: pipes && EPOLLET, again Message-ID: <20250304154410.GB5756@redhat.com> References: <20250303230409.452687-1-mjguzik@gmail.com> <20250303230409.452687-2-mjguzik@gmail.com> <20250304140726.GD26141@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250304140726.GD26141@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30)
Series	pipes && EPOLLET, again \| expand pipes && EPOLLET, again

Message ID

20250304154410.GB5756@redhat.com (mailing list archive)

State

New

Headers

Date: Tue, 4 Mar 2025 16:44:11 +0100
From: Oleg Nesterov <oleg@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Mateusz Guzik <mjguzik@gmail.com>
Cc: brauner@kernel.org, mingo@redhat.com, peterz@infradead.org,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: pipes && EPOLLET, again
Message-ID: <20250304154410.GB5756@redhat.com>
References: <20250303230409.452687-1-mjguzik@gmail.com>
 <20250303230409.452687-2-mjguzik@gmail.com>
 <20250304140726.GD26141@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250304140726.GD26141@redhat.com>
User-Agent: Mutt/1.5.24 (2015-08-30)

Series

pipes && EPOLLET, again | expand

Commit Message

Oleg Nesterov March 4, 2025, 3:44 p.m. UTC

Linus,

On 03/04, Oleg Nesterov wrote:
>
> and we need to cleanup the poll_usage
> logic first.

We have already discussed this before, I'll probably do this later,
but lets forget it for now.

Don't we need the trivial one-liner below anyway?

I am not saying this is a bug, but please consider

	#include <stdio.h>
	#include <stdbool.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/epoll.h>
	#include <assert.h>

	static char buf[16 * 4096];

	int main(void)
	{
		int pfd[2], efd;
		struct epoll_event evt = { .events = EPOLLIN | EPOLLET };

		pipe(pfd);
		efd = epoll_create1(0);
		epoll_ctl(efd, EPOLL_CTL_ADD, pfd[0], &evt);

		write(pfd[1], buf, 4096);
		assert(epoll_wait(efd, &evt, 1, 0) == 1);

		if (!fork()) {
			write(pfd[1], buf, sizeof(buf));
			assert(0);
		}

		sleep(1);
		assert(epoll_wait(efd, &evt, 1, 0) == 1);

		return 0;
	}

the 2nd epoll_wait() fails, despite the fact that the child has already
written 15 * PAGE_SIZE bytes. This doesn't look consistent to me...

Oleg.
---

Comments

Linus Torvalds March 4, 2025, 6:12 p.m. UTC | #1

On Tue, 4 Mar 2025 at 05:45, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Linus,
>
> On 03/04, Oleg Nesterov wrote:
> >
> > and we need to cleanup the poll_usage
> > logic first.
>
> We have already discussed this before, I'll probably do this later,
> but lets forget it for now.
>
> Don't we need the trivial one-liner below anyway?

See this email of mine:

  https://lore.kernel.org/all/CAHk-=wiCRwRFi0kGwd_Uv+Xv4HOB-ivHyUp9it6CNSmrKT4gOA@mail.gmail.com/

and the last paragraph in particular.

The whole "poll_usage" thing is a kernel *hack* to deal with broken
user space that expected garbage semantics that aren't real, and were
never real.

They were a random implementation detail, and they were *wrong*.

But because we have a strict "we don't break user space", we
introduced that completely bogus hack to say "ok, we'll send these
completely wrong extraneous events despite the fact that nothing has
changed, because some broken user space program was written to expect
them".

> I am not saying this is a bug, but please consider

That program is buggy, and we're not adding new hacks for new bugs.

If you ask for an edge-triggered EPOLL event, you get an *edge*
triggered EPOLL event. And there is no edge - the pipe was already
readable, no edge anywhere in sight.

So to clarify: we added the hack to work around *existing* user space
bugs that happened to work before.

If anything, we might consider removing the crazy "poll_usage" hack in
the (probably futile) hope that user space has been fixed.

But we're not encouraging *new* bugs.

               Linus

Oleg Nesterov March 4, 2025, 7:32 p.m. UTC | #2

On 03/04, Linus Torvalds wrote:
>
> On Tue, 4 Mar 2025 at 05:45, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Don't we need the trivial one-liner below anyway?
>
> See this email of mine:
>
>   https://lore.kernel.org/all/CAHk-=wiCRwRFi0kGwd_Uv+Xv4HOB-ivHyUp9it6CNSmrKT4gOA@mail.gmail.com/
>
> and the last paragraph in particular.
>
> The whole "poll_usage" thing is a kernel *hack* to deal with broken
> user space that expected garbage semantics that aren't real, and were
> never real.

Yes agreed. But we can make this hack more understandable. But as I said,
this is off-topic right now.

> introduced that completely bogus hack to say "ok, we'll send these
> completely wrong extraneous events despite the fact that nothing has
> changed, because some broken user space program was written to expect
> them".

Yes, but since we already have this hack:

> That program is buggy, and we're not adding new hacks for new bugs.

Yes, but see below...

> If you ask for an edge-triggered EPOLL event, you get an *edge*
> triggered EPOLL event. And there is no edge - the pipe was already
> readable, no edge anywhere in sight.

Yes, the pipe was already readable before before fork, but this condition
was already "consumed" by the 1st epoll_wait(EPOLLET). Please see below.

> If anything, we might consider removing the crazy "poll_usage" hack in
> the (probably futile) hope that user space has been fixed.

This would be the best option ;) Until then:

I agree that my test case is "buggy", but afaics it is not buggier than
userspace programs which rely on the unconditional kill_fasync()'s in
pipe_read/pipe_write?

So. anon_pipe_write() does

	if (was_empty)
		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);

before wait_event(pipe->wr_wait), but before return it does

	if (was_empty || pipe->poll_usage)
		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);

and this looks confusing to me.

If pipe_write() doesn't take poll_usage into account before wait_event(wr_wait),
then it doesn't need kill_fasync() too?

So I won't argue, but why not make both cases more consistent?

Oleg.

Linus Torvalds March 4, 2025, 7:49 p.m. UTC | #3

On Tue, 4 Mar 2025 at 09:32, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I agree that my test case is "buggy", but afaics it is not buggier than
> userspace programs which rely on the unconditional kill_fasync()'s in
> pipe_read/pipe_write?

I'm not convinced any such users actually exist.

The reason kill_fasync() is unconditional is that it's cheap. The
normal situation is "nobody there", and we test that without any
locking.

So we've never bothered to make any changes to that path, and there's
never been any real reason to have any "was_empty" like conditionals.

               Linus

diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..8a32257cc74f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -554,7 +554,7 @@  anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * become empty while we dropped the lock.
 		 */
 		mutex_unlock(&pipe->mutex);
-		if (was_empty)
+		if (was_empty || pipe->poll_usage)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));

pipes && EPOLLET, again

Commit Message

Comments

Patch