diff mbox series

[RFC,04/14] pipe: Add O_NOTIFICATION_PIPE [ver #2]

Message ID 157313375678.29677.15875689548927466028.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series pipe: Keyrings, Block and USB notifications [ver #2] | expand

Commit Message

David Howells Nov. 7, 2019, 1:35 p.m. UTC
Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
that the pipe being created is going to be used for notifications.  This
suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
pipe as calling iov_iter_revert() on a pipe when a kernel notification
message has been inserted into the middle of a multi-buffer splice will be
messy.

The flag is given the same value as O_EXCL as it seems unlikely that
this flag will ever be applicable to pipes and I don't want to use up
another O_* bit unnecessarily.  An alternative could be to add a pipe3()
system call.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/uapi/linux/watch_queue.h |    3 +++
 1 file changed, 3 insertions(+)

Comments

Andy Lutomirski Nov. 7, 2019, 6:16 p.m. UTC | #1
On Thu, Nov 7, 2019 at 5:39 AM David Howells <dhowells@redhat.com> wrote:
>
> Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
> that the pipe being created is going to be used for notifications.  This
> suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
> pipe as calling iov_iter_revert() on a pipe when a kernel notification
> message has been inserted into the middle of a multi-buffer splice will be
> messy.

How messy?  And is there some way to make it impossible for this to
happen?  Adding a new flag to pipe2() to avoid messy kernel code seems
like a poor tradeoff.
David Howells Nov. 7, 2019, 6:48 p.m. UTC | #2
Andy Lutomirski <luto@kernel.org> wrote:

> > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
> > that the pipe being created is going to be used for notifications.  This
> > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
> > pipe as calling iov_iter_revert() on a pipe when a kernel notification
> > message has been inserted into the middle of a multi-buffer splice will be
> > messy.
>
> How messy?

Well, iov_iter_revert() on a pipe iterator simply walks backwards along the
ring discarding the last N contiguous slots (where N is normally the number of
slots that were filled by whatever operation is being reverted).

However, unless the code that transfers stuff into the pipe takes the spinlock
spinlock and disables softirqs for the duration of its ring filling, what were
N contiguous slots may now have kernel notifications interspersed - even if it
has been holding the pipe mutex.

So, now what do you do?  You have to free up just the buffers relevant to the
iterator and then you can either compact down the ring to free up the space or
you can leave null slots and let the read side clean them up, thereby
reducing the capacity of the pipe temporarily.

Either way, iov_iter_revert() gets more complex and has to hold the spinlock.

And if you don't take the spinlock whilst you're reverting, more notifications
can come in to make your life more interesting.

There's also a problem with splicing out from a notification pipe that the
messages are scribed onto preallocated buffers, but now the buffers need
refcounts and, in any case, are of limited quantity.

> And is there some way to make it impossible for this to happen?

Yes.  That's what I'm doing by declaring the pipe to be unspliceable up front.

> Adding a new flag to pipe2() to avoid messy kernel code seems
> like a poor tradeoff.

By far the easiest place to check whether a pipe can be spliced to is in
get_pipe_info().  That's checking the file anyway.  After that, you can't make
the check until the pipe is locked.

Furthermore, if it's not done upfront, the change to the pipe might happen
during a splicing operation that's residing in pipe_wait()... which drops the
pipe mutex.

David
Andy Lutomirski Nov. 8, 2019, 5:06 a.m. UTC | #3
On Thu, Nov 7, 2019 at 10:48 AM David Howells <dhowells@redhat.com> wrote:
>
> Andy Lutomirski <luto@kernel.org> wrote:
>
> > > Add an O_NOTIFICATION_PIPE flag that can be passed to pipe2() to indicate
> > > that the pipe being created is going to be used for notifications.  This
> > > suppresses the use of splice(), vmsplice(), tee() and sendfile() on the
> > > pipe as calling iov_iter_revert() on a pipe when a kernel notification
> > > message has been inserted into the middle of a multi-buffer splice will be
> > > messy.
> >
> > How messy?
>
> Well, iov_iter_revert() on a pipe iterator simply walks backwards along the
> ring discarding the last N contiguous slots (where N is normally the number of
> slots that were filled by whatever operation is being reverted).
>
> However, unless the code that transfers stuff into the pipe takes the spinlock
> spinlock and disables softirqs for the duration of its ring filling, what were
> N contiguous slots may now have kernel notifications interspersed - even if it
> has been holding the pipe mutex.
>
> So, now what do you do?  You have to free up just the buffers relevant to the
> iterator and then you can either compact down the ring to free up the space or
> you can leave null slots and let the read side clean them up, thereby
> reducing the capacity of the pipe temporarily.
>
> Either way, iov_iter_revert() gets more complex and has to hold the spinlock.

I feel like I'm missing something fundamental here.

I can open a normal pipe from userspace (with pipe() or pipe2()), and
I can have two threads.  One thread writes to the pipe with write().
The other thread writes with splice().  Everything works fine.  What's
special about notifications?
David Howells Nov. 8, 2019, 6:42 a.m. UTC | #4
Andy Lutomirski <luto@kernel.org> wrote:

> I can open a normal pipe from userspace (with pipe() or pipe2()), and
> I can have two threads.  One thread writes to the pipe with write().
> The other thread writes with splice().  Everything works fine.

Yes.  Every operation you do on a pipe from userspace is serialised with the
pipe mutex - and both ends share the same pipe.

> What's special about notifications?

The post_notification() cannot take the pipe mutex.  It has to be callable
from softirq context.  Linus's idea is that when you're actually altering the
ring pointers you should hold the wake-queue spinlock, and post_notification()
holds the wake queue spinlock for the duration of the operation.

This means that post_notification() can be writing to the pipe whilst a
userspace-invoked operation is holding the pipe mutex and is also doing
something to the ring.

David
diff mbox series

Patch

diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 5f3d21e8a34b..9df72227f515 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -3,6 +3,9 @@ 
 #define _UAPI_LINUX_WATCH_QUEUE_H
 
 #include <linux/types.h>
+#include <linux/fcntl.h>
+
+#define O_NOTIFICATION_PIPE	O_EXCL	/* Parameter to pipe2() selecting notification pipe */
 
 enum watch_notification_type {
 	WATCH_TYPE_META		= 0,	/* Special record */