Message ID | 157262963995.13142.5568934007158044624.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
Headers | show |
Series | pipe: Notification queue preparation [ver #3] | expand |
On Fri, Nov 1, 2019 at 10:34 AM David Howells <dhowells@redhat.com> wrote: > (1) It removes the nr_exclusive argument from __wake_up_sync_key() as this > is always 1. This prepares for step 2. > > (2) Adds wake_up_interruptible_sync_poll_locked() so that poll can be > woken up from a function that's holding the poll waitqueue spinlock. Side note: we have a couple of cases where I don't think we should use the "sync" version at all. Both pipe_read() and pipe_write() have that if (do_wakeup) { wake_up_interruptible_sync_poll(&pipe->wait, ... code at the end, outside the loop. But those two wake-ups aren't actually synchronous. A sync wake is supposedly something where the waker is just about to go to sleep, telling the scheduler that "don't bother trying to pick another cpu, this process is going to sleep and you can stay here". I'm not sure how much this matters, but it does strike me that it's wrong. We're not going to sleep at all in that case - this is not the "I filled the whole buffer, so I'm going to sleep" case (or the "I've read all the data, I'm waiting for more". It's entirely possible that we always wake pipe wakeups to be sync just because it's a common pattern (and a common benchmark), but this series made me look at it again. Particularly since David has benchmarks that don't seem to show a lot of fluctuation with his changes - I wonder how much the sync logic buys us (or hurts us)? Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > Side note: we have a couple of cases where I don't think we should use > the "sync" version at all. > > Both pipe_read() and pipe_write() have that > > if (do_wakeup) { > wake_up_interruptible_sync_poll(&pipe->wait, ... > > code at the end, outside the loop. But those two wake-ups aren't > actually synchronous. Changing those to non-sync: BENCHMARK BEST TOTAL BYTES AVG BYTES STDDEV =============== =============== =============== =============== =============== pipe 305816126 36255936983 302132808 8880788 splice 282402106 27102249370 225852078 210033443 vmsplice 440022611 48896995196 407474959 59906438 Changing the others in pipe_read() and pipe_write() too: pipe 305609682 36285967942 302383066 7415744 splice 282475690 27891475073 232428958 201687522 vmsplice 451458280 51949421503 432911845 34925242 The cumulative patch is attached below. I'm not sure how well this should make a difference with my benchmark programs since each thread can run on its own CPU. David --- diff --git a/fs/pipe.c b/fs/pipe.c index 9cd5cbef9552..c5e3765465f0 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -332,7 +332,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) do_wakeup = 1; wake = head - (tail - 1) == pipe->max_usage / 2; if (wake) - wake_up_interruptible_sync_poll_locked( + wake_up_locked_poll( &pipe->wait, EPOLLOUT | EPOLLWRNORM); spin_unlock_irq(&pipe->wait.lock); if (wake) @@ -371,7 +371,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) /* Signal writers asynchronously that there is more room. */ if (do_wakeup) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } if (ret > 0) @@ -477,7 +477,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * syscall merging. * FIXME! Is this really true? */ - wake_up_interruptible_sync_poll_locked( + wake_up_locked_poll( &pipe->wait, EPOLLIN | EPOLLRDNORM); spin_unlock_irq(&pipe->wait.lock); @@ -531,7 +531,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) out: __pipe_unlock(pipe); if (do_wakeup) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
On Fri, Nov 1, 2019 at 3:05 PM David Howells <dhowells@redhat.com> wrote: > > Changing those to non-sync: Your benchmark seems very insensitive to just about any changes. I suspect it is because you only test throughput. Latency is what the pipe wakeup has been optimized for, and which tends to be much more sensitive to other changes too (eg locking). That said, I'm not convinced a latency test would show much either. Linus
So to implement notifications on top of pipes, I've hacked it together a bit in the following ways: (1) I'm passing O_TMPFILE to the pipe2() system call to indicate that you want a notifications pipe. This prohibits splice and co. from being called on it as I don't want to have to try to fix iov_iter_revert() to handle kernel notifications being intermixed with splices. The choice of O_TMPFILE was just for convenience, but it needs to be something different. I could, for instance, add a constant, O_NOTIFICATION_PIPE with the same *value* as O_TMPFILE. I don't think it's likely that it will make sense to use O_TMPFILE with a pipe, but I also don't want to eat up another O_* constant just for this. Unfortunately, pipe2() doesn't have any other arguments into from which I can steal a bit. (2) I've added a pair of ioctls to configure the notifications bits. They're ioctls as I just reused the ioctl code from my devmisc driver. Should I use fcntl() instead, such as is done for F_SETPIPE_SZ? The ioctls do two things: set the ring size to a number of slots (so similarish to F_SETPIPE_SZ) and set filters. Any thoughts on how better to represent these bits? Thanks, David