[RFC,00/11] pipe: Notification queue preparation [ver #3]
mbox series

Message ID 157262963995.13142.5568934007158044624.stgit@warthog.procyon.org.uk
Headers show
Series
  • pipe: Notification queue preparation [ver #3]
Related show

Message

David Howells Nov. 1, 2019, 5:34 p.m. UTC
Here's a set of preparatory patches for building a general notification
queue on top of pipes.  It makes a number of significant changes:

 (1) It removes the nr_exclusive argument from __wake_up_sync_key() as this
     is always 1.  This prepares for step 2.

 (2) Adds wake_up_interruptible_sync_poll_locked() so that poll can be
     woken up from a function that's holding the poll waitqueue spinlock.

 (3) Change the pipe buffer ring to be managed in terms of unbounded head
     and tail indices rather than bounded index and length.  This means
     that reading the pipe only needs to modify one index, not two.

 (4) A selection of helper functions are provided to query the state of the
     pipe buffer, plus a couple to apply updates to the pipe indices.

 (5) The pipe ring is allowed to have kernel-reserved slots.  This allows
     many notification messages to be spliced in by the kernel without
     allowing userspace to pin too many pages if it writes to the same
     pipe.

 (6) Advance the head and tail indices inside the pipe waitqueue lock and
     use step 2 to poke poll without having to take the lock twice.

 (7) Rearrange pipe_write() to preallocate the buffer it is going to write
     into and then drop the spinlock.  This allows kernel notifications to
     then be added the ring whilst it is filling the buffer it allocated.
     The read side is stalled because the pipe mutex is still held.

 (8) Don't wake up readers on a pipe if there was already data in it when
     we added more.

 (9) Don't wake up writers on a pipe if the ring wasn't full before we
     removed a buffer.

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-pipe-prep

PATCHES	BENCHMARK	BEST		TOTAL BYTES	AVG BYTES	STDDEV
=======	===============	===============	===============	===============	===============
-	pipe		      307457969	    36348556755	      302904639	       10622403
-	splice		      287117614	    26933658717	      224447155	      160777958
-	vmsplice	      435180375	    51302964090	      427524700	       19083037

rm-nrx	pipe		      311091179	    37093181356	      309109844	        7221622
rm-nrx	splice		      285628049	    27916298942	      232635824	      158296431
rm-nrx	vmsplice	      417703153	    47570362546	      396419687	       33960822

wakesl	pipe		      310698731	    36772541631	      306437846	        8249347
wakesl	splice		      286193726	    28600435451	      238336962	      141169318
wakesl	vmsplice	      436175803	    50723895824	      422699131	       40724240

ht	pipe		      305534565	    36426079543	      303550662	        5673885
ht	splice		      243632025	    23319439010	      194328658	      150479853
ht	vmsplice	      432825176	    49101781001	      409181508	       44102509

k-rsv	pipe		      308691523	    36652267561	      305435563	       12972559
k-rsv	splice		      244793528	    23625172865	      196876440	      125319143
k-rsv	vmsplice	      436119082	    49460808579	      412173404	       55547525

r-adv-t	pipe		      310094218	    36860182219	      307168185	        8081101
r-adv-t	splice		      285527382	    27085052687	      225708772	      206918887
r-adv-t	vmsplice	      336885948	    40128756927	      334406307	        5895935

r-cond	pipe		      308727804	    36635828180	      305298568	        9976806
r-cond	splice		      284467568	    28445793054	      237048275	      200284329
r-cond	vmsplice	      449679489	    51134833848	      426123615	       66790875

w-preal	pipe		      307416578	    36662086426	      305517386	        6216663
w-preal	splice		      282655051	    28455249109	      237127075	      194154549
w-preal	vmsplice	      437002601	    47832160621	      398601338	       96513019

w-redun	pipe		      307279630	    36329750422	      302747920	        8913567
w-redun	splice		      284324488	    27327152734	      227726272	      219735663
w-redun	vmsplice	      451141971	    51485257719	      429043814	       51388217

w-ckful	pipe		      305055247	    36374947350	      303124561	        5400728
w-ckful	splice		      281575308	    26841554544	      223679621	      215942886
w-ckful	vmsplice	      436653588	    47564907110	      396374225	       82255342

The patches column indicates the point in the patchset at which the benchmarks
were taken:

	0	No patches
	rm-nrx	"Remove the nr_exclusive argument from __wake_up_sync_key()"
	wakesl	"Add wake_up_interruptible_sync_poll_locked()"
	ht	"pipe: Use head and tail pointers for the ring, not cursor and length"
	k-rsv	"pipe: Allow pipes to have kernel-reserved slots"
	r-adv-t	"pipe: Advance tail pointer inside of wait spinlock in pipe_read()"
	r-cond	"pipe: Conditionalise wakeup in pipe_read()"
	w-preal	"pipe: Rearrange sequence in pipe_write() to preallocate slot"
	w-redun	"pipe: Remove redundant wakeup from pipe_write()"
	w-ckful	"pipe: Check for ring full inside of the spinlock in pipe_write()"

Changes:

 ver #3:

 (*) Get rid of pipe_commit_{read,write}.

 (*) Port the virtio_console driver.

 (*) Fix pipe_zero().

 (*) Amend some comments.

 (*) Added an additional patch that changes the threshold at which readers
     wake writers for Konstantin Khlebnikov.

 ver #2:

 (*) Split the notification patches out into a separate branch.

 (*) Removed the nr_exclusive parameter from __wake_up_sync_key().

 (*) Renamed the locked wakeup function.

 (*) Add helpers for empty, full, occupancy.

 (*) Split the addition of ->max_usage out into its own patch.

 (*) Fixed some bits pointed out by Rasmus Villemoes.

 ver #1:

 (*) Build on top of standard pipes instead of having a driver.

David
---
David Howells (11):
      pipe: Reduce #inclusion of pipe_fs_i.h
      Remove the nr_exclusive argument from __wake_up_sync_key()
      Add wake_up_interruptible_sync_poll_locked()
      pipe: Use head and tail pointers for the ring, not cursor and length
      pipe: Allow pipes to have kernel-reserved slots
      pipe: Advance tail pointer inside of wait spinlock in pipe_read()
      pipe: Conditionalise wakeup in pipe_read()
      pipe: Rearrange sequence in pipe_write() to preallocate slot
      pipe: Remove redundant wakeup from pipe_write()
      pipe: Check for ring full inside of the spinlock in pipe_write()
      pipe: Increase the writer-wakeup threshold to reduce context-switch count


 drivers/char/virtio_console.c |   16 +-
 fs/exec.c                     |    1 
 fs/fuse/dev.c                 |   31 +++--
 fs/ocfs2/aops.c               |    1 
 fs/pipe.c                     |  228 +++++++++++++++++++++--------------
 fs/splice.c                   |  190 ++++++++++++++++++-----------
 include/linux/pipe_fs_i.h     |   64 +++++++++-
 include/linux/uio.h           |    4 -
 include/linux/wait.h          |   11 +-
 kernel/exit.c                 |    2 
 kernel/sched/wait.c           |   37 ++++--
 lib/iov_iter.c                |  269 +++++++++++++++++++++++------------------
 security/smack/smack_lsm.c    |    1 
 13 files changed, 527 insertions(+), 328 deletions(-)

Comments

Linus Torvalds Nov. 1, 2019, 7:24 p.m. UTC | #1
On Fri, Nov 1, 2019 at 10:34 AM David Howells <dhowells@redhat.com> wrote:
>  (1) It removes the nr_exclusive argument from __wake_up_sync_key() as this
>      is always 1.  This prepares for step 2.
>
>  (2) Adds wake_up_interruptible_sync_poll_locked() so that poll can be
>      woken up from a function that's holding the poll waitqueue spinlock.

Side note: we have a couple of cases where I don't think we should use
the "sync" version at all.

Both pipe_read() and pipe_write() have that

        if (do_wakeup) {
                wake_up_interruptible_sync_poll(&pipe->wait, ...

code at the end, outside the loop. But those two wake-ups aren't
actually synchronous.

A sync wake is supposedly something where the waker is just about to
go to sleep, telling the scheduler that "don't bother trying to pick
another cpu, this process is going to sleep and you can stay here".

I'm not sure how much this matters, but it does strike me that it's
wrong. We're not going to sleep at all in that case - this is not the
"I filled the whole buffer, so I'm going to sleep" case (or the "I've
read all the data, I'm waiting for more".

It's entirely possible that we always wake pipe wakeups to be sync
just because it's a common pattern (and a common benchmark), but this
series made me look at it again. Particularly since David has
benchmarks that don't seem to show a lot of fluctuation with his
changes - I wonder how much the sync logic buys us (or hurts us)?

               Linus
David Howells Nov. 1, 2019, 10:05 p.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Side note: we have a couple of cases where I don't think we should use
> the "sync" version at all.
> 
> Both pipe_read() and pipe_write() have that
> 
>         if (do_wakeup) {
>                 wake_up_interruptible_sync_poll(&pipe->wait, ...
> 
> code at the end, outside the loop. But those two wake-ups aren't
> actually synchronous.

Changing those to non-sync:

BENCHMARK       BEST            TOTAL BYTES     AVG BYTES       STDDEV
=============== =============== =============== =============== ===============
pipe                  305816126     36255936983       302132808         8880788
splice                282402106     27102249370       225852078       210033443
vmsplice              440022611     48896995196       407474959        59906438

Changing the others in pipe_read() and pipe_write() too:

pipe                  305609682     36285967942       302383066         7415744
splice                282475690     27891475073       232428958       201687522
vmsplice              451458280     51949421503       432911845        34925242

The cumulative patch is attached below.  I'm not sure how well this should
make a difference with my benchmark programs since each thread can run on its
own CPU.

David
---
diff --git a/fs/pipe.c b/fs/pipe.c
index 9cd5cbef9552..c5e3765465f0 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -332,7 +332,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 				do_wakeup = 1;
 				wake = head - (tail - 1) == pipe->max_usage / 2;
 				if (wake)
-					wake_up_interruptible_sync_poll_locked(
+					wake_up_locked_poll(
 						&pipe->wait, EPOLLOUT | EPOLLWRNORM);
 				spin_unlock_irq(&pipe->wait.lock);
 				if (wake)
@@ -371,7 +371,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 
 	/* Signal writers asynchronously that there is more room. */
 	if (do_wakeup) {
-		wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+		wake_up_interruptible_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
 	if (ret > 0)
@@ -477,7 +477,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			 * syscall merging.
 			 * FIXME! Is this really true?
 			 */
-			wake_up_interruptible_sync_poll_locked(
+			wake_up_locked_poll(
 				&pipe->wait, EPOLLIN | EPOLLRDNORM);
 
 			spin_unlock_irq(&pipe->wait.lock);
@@ -531,7 +531,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 out:
 	__pipe_unlock(pipe);
 	if (do_wakeup) {
-		wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
+		wake_up_interruptible_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	}
 	if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
Linus Torvalds Nov. 1, 2019, 10:12 p.m. UTC | #3
On Fri, Nov 1, 2019 at 3:05 PM David Howells <dhowells@redhat.com> wrote:
>
> Changing those to non-sync:

Your benchmark seems very insensitive to just about any changes.

I suspect it is because you only test throughput. Latency is what the
pipe wakeup has been optimized for, and which tends to be much more
sensitive to other changes too (eg locking).

That said, I'm not convinced a latency test would show much either.

               Linus
David Howells Nov. 5, 2019, 4:02 p.m. UTC | #4
So to implement notifications on top of pipes, I've hacked it together a bit
in the following ways:

 (1) I'm passing O_TMPFILE to the pipe2() system call to indicate that you
     want a notifications pipe.  This prohibits splice and co. from being
     called on it as I don't want to have to try to fix iov_iter_revert() to
     handle kernel notifications being intermixed with splices.

     The choice of O_TMPFILE was just for convenience, but it needs to be
     something different.  I could, for instance, add a constant,
     O_NOTIFICATION_PIPE with the same *value* as O_TMPFILE.  I don't think
     it's likely that it will make sense to use O_TMPFILE with a pipe, but I
     also don't want to eat up another O_* constant just for this.

     Unfortunately, pipe2() doesn't have any other arguments into from which I
     can steal a bit.

 (2) I've added a pair of ioctls to configure the notifications bits.  They're
     ioctls as I just reused the ioctl code from my devmisc driver.  Should I
     use fcntl() instead, such as is done for F_SETPIPE_SZ?

     The ioctls do two things: set the ring size to a number of slots (so
     similarish to F_SETPIPE_SZ) and set filters.

Any thoughts on how better to represent these bits?

Thanks,
David