diff mbox

[03/32] fs: introduce new ->get_poll_head and ->poll_mask methods

Message ID 20180110155853.32348-4-hch@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Jan. 10, 2018, 3:58 p.m. UTC
->get_poll_head returns the waitqueue that the poll operation is going
to sleep on.  Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events.  But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.

->poll_mask is called after the wakeup to return the actual mask of
events reported by poll.  It can be called with the waitqueue lock
held in some cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/Locking |  7 ++++++-
 Documentation/filesystems/vfs.txt | 11 +++++++++++
 include/linux/fs.h                |  2 ++
 include/linux/poll.h              | 21 ++++++++++++++++++---
 4 files changed, 37 insertions(+), 4 deletions(-)

Comments

Al Viro Jan. 10, 2018, 9:04 p.m. UTC | #1
On Wed, Jan 10, 2018 at 04:58:24PM +0100, Christoph Hellwig wrote:
> ->get_poll_head returns the waitqueue that the poll operation is going
> to sleep on.  Note that this means we can only use a single waitqueue
> for the poll, unlike some current drivers that use two waitqueues for
> different events.  But now that we have keyed wakeups and heavily use
> those for poll there aren't that many good reason left to keep the
> multiple waitqueues, and if there are any ->poll is still around, the
> driver just won't support aio poll.

There's another problem with that - currently ->poll() may tell you "sod off,
I've got nothing for you to sleep on, eat your POLLHUP|POLLERR|something
and don't pester me again".  With your API that's hard to express sanely.

Another piece of fun related to that is handling of disconnects in general -

static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct *pts)
{
        struct proc_dir_entry *pde = PDE(file_inode(file));
        __poll_t rv = DEFAULT_POLLMASK;
        __poll_t (*poll)(struct file *, struct poll_table_struct *);
        if (use_pde(pde)) {
                poll = pde->proc_fops->poll;
                if (poll)
                        rv = poll(file, pts);
                unuse_pde(pde);
        }
        return rv;
}

and similar in sysfs.  

Note, BTW, the places like wait->_qproc = NULL; in do_select() and its ilk.
Some of them are "don't bother putting me on any queues, I won't be sleeping
anyway".  Some are "I'm already on all queues I care about, I'm going to
sleep now and the query everything again once woken up".  It would be nice
to have the method splitup reflect that kind of logics...

What about af_alg_poll(), BTW?  Looks like you've missed that one...

Another thing: IMO file_can_poll() should use FMODE_CAN_POLL - either as
"true if set, otherwise check ->f_op and set accordingly" or set in
do_dentry_open() and just check it in file_can_poll()...
Al Viro Jan. 11, 2018, 5:22 a.m. UTC | #2
On Wed, Jan 10, 2018 at 09:04:16PM +0000, Al Viro wrote:
> On Wed, Jan 10, 2018 at 04:58:24PM +0100, Christoph Hellwig wrote:
> > ->get_poll_head returns the waitqueue that the poll operation is going
> > to sleep on.  Note that this means we can only use a single waitqueue
> > for the poll, unlike some current drivers that use two waitqueues for
> > different events.  But now that we have keyed wakeups and heavily use
> > those for poll there aren't that many good reason left to keep the
> > multiple waitqueues, and if there are any ->poll is still around, the
> > driver just won't support aio poll.
> 
> There's another problem with that - currently ->poll() may tell you "sod off,
> I've got nothing for you to sleep on, eat your POLLHUP|POLLERR|something
> and don't pester me again".  With your API that's hard to express sanely.
> 
> Another piece of fun related to that is handling of disconnects in general -
> 
> static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct *pts)
> {
>         struct proc_dir_entry *pde = PDE(file_inode(file));
>         __poll_t rv = DEFAULT_POLLMASK;
>         __poll_t (*poll)(struct file *, struct poll_table_struct *);
>         if (use_pde(pde)) {
>                 poll = pde->proc_fops->poll;
>                 if (poll)
>                         rv = poll(file, pts);
>                 unuse_pde(pde);
>         }
>         return rv;
> }
> 
> and similar in sysfs.  
> 
> Note, BTW, the places like wait->_qproc = NULL; in do_select() and its ilk.
> Some of them are "don't bother putting me on any queues, I won't be sleeping
> anyway".  Some are "I'm already on all queues I care about, I'm going to
> sleep now and the query everything again once woken up".  It would be nice
> to have the method splitup reflect that kind of logics...
> 
> What about af_alg_poll(), BTW?  Looks like you've missed that one...
> 
> Another thing: IMO file_can_poll() should use FMODE_CAN_POLL - either as
> "true if set, otherwise check ->f_op and set accordingly" or set in
> do_dentry_open() and just check it in file_can_poll()...

Whee...  The very first ->poll() instance in alphabetic order on pathnames:
in arch/cris/arch-v10/drivers/gpio.c

static __poll_t gpio_poll(struct file *file, poll_table *wait)
{
        __poll_t mask = 0;
        struct gpio_private *priv = file->private_data;
        unsigned long data;
        unsigned long flags;

        spin_lock_irqsave(&gpio_lock, flags);

        poll_wait(file, &priv->alarm_wq, wait);

IOW, we are doing poll_wait() (== possible GFP_KERNEL __get_free_page()) under
a spinlock...
Christoph Hellwig Jan. 11, 2018, 8:28 a.m. UTC | #3
On Thu, Jan 11, 2018 at 05:22:00AM +0000, Al Viro wrote:
> Whee...  The very first ->poll() instance in alphabetic order on pathnames:
> in arch/cris/arch-v10/drivers/gpio.c
> 
> static __poll_t gpio_poll(struct file *file, poll_table *wait)
> {
>         __poll_t mask = 0;
>         struct gpio_private *priv = file->private_data;
>         unsigned long data;
>         unsigned long flags;
> 
>         spin_lock_irqsave(&gpio_lock, flags);
> 
>         poll_wait(file, &priv->alarm_wq, wait);
> 
> IOW, we are doing poll_wait() (== possible GFP_KERNEL __get_free_page()) under
> a spinlock...

Yes.  Another god reason to separate poll_wait and the actual
event check callback..
Christoph Hellwig Jan. 11, 2018, 11:32 a.m. UTC | #4
For other horrors that are even worse than any given ->poll instance
take a look at scif_poll and friends..
Christoph Hellwig Jan. 11, 2018, 11:36 a.m. UTC | #5
On Wed, Jan 10, 2018 at 09:04:16PM +0000, Al Viro wrote:
> There's another problem with that - currently ->poll() may tell you "sod off,
> I've got nothing for you to sleep on, eat your POLLHUP|POLLERR|something
> and don't pester me again".  With your API that's hard to express sanely.

And what exactly can currently tell 'sod off' right now?  ->poll
can only return the (E)POLL* mask.  But what would probably be sane
is to do the same thing in vfs_poll I already do in aio poll:  call
->poll_mask a first time before calling poll_wait to clear any
already pending events.  That way any early error gets instantly
propagated.

> Another piece of fun related to that is handling of disconnects in general -
> 
> static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct *pts)
> {
>         struct proc_dir_entry *pde = PDE(file_inode(file));
>         __poll_t rv = DEFAULT_POLLMASK;
>         __poll_t (*poll)(struct file *, struct poll_table_struct *);
>         if (use_pde(pde)) {
>                 poll = pde->proc_fops->poll;
>                 if (poll)
>                         rv = poll(file, pts);
>                 unuse_pde(pde);
>         }
>         return rv;
> }
> 
> and similar in sysfs.  

Can't find anything in sysfs, but debugfs has an amazingly bad variant
of the above, including confidence ensuring commit description bits like:

    In order not to pollute debugfs with wrapper definitions that aren't ever
    needed, I chose not to define a wrapper for every struct file_operations
    method possible. Instead, a wrapper is defined only for the subset of
    methods which are actually set by any debugfs users.
    Currently, these are:
		     
      ->llseek()
      ->read()
      ->write()
      ->unlocked_ioctl()
      ->poll()

So anyone implementing say, read_iter/write_iter or compat_ioctl
silently doesn't get the magic protection..

Either way - those two will need updating for the new scheme if
we add proc/debugfs ops, so I better do them now and convert at least
one example each.

> Note, BTW, the places like wait->_qproc = NULL; in do_select() and its ilk.
> Some of them are "don't bother putting me on any queues, I won't be sleeping
> anyway".  Some are "I'm already on all queues I care about, I'm going to
> sleep now and the query everything again once woken up".  It would be nice
> to have the method splitup reflect that kind of logics...

Hmm.  ->poll_mask already is a simple 'are these events pending'
method, and thuse should deal perfectly fine with both cases.  What
additional split do you think would be helpful?

> What about af_alg_poll(), BTW?  Looks like you've missed that one...

Converted for the next iteration.

> Another thing: IMO file_can_poll() should use FMODE_CAN_POLL - either as
> "true if set, otherwise check ->f_op and set accordingly" or set in
> do_dentry_open() and just check it in file_can_poll()...

I don't really see the point of wasting a fmode bit for it.  But
if really want that I cna do it.
Al Viro Jan. 11, 2018, 5:47 p.m. UTC | #6
On Thu, Jan 11, 2018 at 12:36:00PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 10, 2018 at 09:04:16PM +0000, Al Viro wrote:
> > There's another problem with that - currently ->poll() may tell you "sod off,
> > I've got nothing for you to sleep on, eat your POLLHUP|POLLERR|something
> > and don't pester me again".  With your API that's hard to express sanely.
> 
> And what exactly can currently tell 'sod off' right now?  ->poll
> can only return the (E)POLL* mask.  But what would probably be sane
> is to do the same thing in vfs_poll I already do in aio poll:  call
> ->poll_mask a first time before calling poll_wait to clear any
> already pending events.  That way any early error gets instantly
> propagated.

static __poll_t
capi_poll(struct file *file, poll_table *wait)
{
        struct capidev *cdev = file->private_data;
        __poll_t mask = 0;

        if (!cdev->ap.applid)
                return POLLERR;

        poll_wait(file, &(cdev->recvwait), wait);
        mask = POLLOUT | POLLWRNORM;
        if (!skb_queue_empty(&cdev->recvqueue))
                mask |= POLLIN | POLLRDNORM;
        return mask;
}

and a bunch of similar beasts.  FWIW, I'm going through that zoo, looking for
existing patterns.

BTW, consider this:
static __poll_t sync_serial_poll(struct file *file, poll_table *wait)
{
        int dev = iminor(file_inode(file));
        __poll_t mask = 0;
        struct sync_port *port;
        DEBUGPOLL(
        static __poll_t prev_mask;
        );

        port = &ports[dev];

        if (!port->started)
                sync_serial_start_port(port);

        poll_wait(file, &port->out_wait_q, wait);
        poll_wait(file, &port->in_wait_q, wait);

        /* No active transfer, descriptors are available */
        if (port->output && !port->tr_running)
                mask |= POLLOUT | POLLWRNORM;
	...
}

Besides having two queues, note the one-time sync_serial_start_port()
there.  Where would you map such things?  First ->poll_mask()?

> Can't find anything in sysfs,

Large chunk of sysfs is in fs/kernfs/*.c; it's there.

> > Note, BTW, the places like wait->_qproc = NULL; in do_select() and its ilk.
> > Some of them are "don't bother putting me on any queues, I won't be sleeping
> > anyway".  Some are "I'm already on all queues I care about, I'm going to
> > sleep now and the query everything again once woken up".  It would be nice
> > to have the method splitup reflect that kind of logics...
> 
> Hmm.  ->poll_mask already is a simple 'are these events pending'
> method, and thuse should deal perfectly fine with both cases.  What
> additional split do you think would be helpful?

What I mean is that it would be nice to have do_select() and friends aware of that.
You are hiding the whole thing behind vfs_poll(); sure, we can't really exploit
that while we have the mix of converted and unconverted instances, but it would
be a nice payoff.

As for calling ->poll_mask() first...  Three method calls per descriptor on the
first pass?  Overhead might get painful...

FWIW, the problem with "sod off early" ones is not the cost of poll_wait() -
it's that sometimes we might not _have_ a queue to sleep on.  Hell knows, I need
to finish the walk through that zoo to see what's out there...  Pox on
drivers/media - that's where the bulk of instances is, and they are fairly
convoluted...

wait_on_event_..._key() might be a good idea; we probably want comments from
Peter on that one.  An interesting testcase would be tty - the amount of
threads sleeping on those queues is going to be large; can we combine
->read_wait and ->write_wait without serious PITA?  Another issue is
ldisc handling - the first thing tty_poll() is doing is
        ld = tty_ldisc_ref_wait(tty);
and it really waits for ldisc changes in progress to settle.  Hell knows
whether anything relies on that, but I wouldn't be surprised if it did -
tty handling is one of the areas where select(2)/poll(2) get non-trivial
use...
Christoph Hellwig Jan. 12, 2018, 9:06 a.m. UTC | #7
On Thu, Jan 11, 2018 at 05:47:50PM +0000, Al Viro wrote:
> Besides having two queues, note the one-time sync_serial_start_port()
> there.  Where would you map such things?  First ->poll_mask()?

->get_poll_mask.  These sorts of calls are the prime reason why
the events argument is passed to it.

> 
> > Can't find anything in sysfs,
> 
> Large chunk of sysfs is in fs/kernfs/*.c; it's there.

Still can't find it - there is exactly one poll implementation in there,
and it's not a forwarder.

> > Hmm.  ->poll_mask already is a simple 'are these events pending'
> > method, and thuse should deal perfectly fine with both cases.  What
> > additional split do you think would be helpful?
> 
> What I mean is that it would be nice to have do_select() and friends aware of that.
> You are hiding the whole thing behind vfs_poll(); sure, we can't really exploit
> that while we have the mix of converted and unconverted instances, but it would
> be a nice payoff.

Yes.  I think we can actually get rid of vfs_poll again rather soon,
the prime reason for it was to isolate non-core callers.  But most
of them should use better primitives anyway, and even until then we
can get rid of vfs_poll for the core to make things cleaner.

> As for calling ->poll_mask() first...  Three method calls per descriptor on the
> first pass?  Overhead might get painful...

I would have said no until a about a week ago.  But with branch prediction
on indirect branches basically gone on x86 now it's not going to get
better.  That beeing said we already do three method calls for the aio
poll case and it hasn't been an issue.

> FWIW, the problem with "sod off early" ones is not the cost of poll_wait() -
> it's that sometimes we might not _have_ a queue to sleep on.  Hell knows, I need
> to finish the walk through that zoo to see what's out there...  Pox on
> drivers/media - that's where the bulk of instances is, and they are fairly
> convoluted...

I've seen a few instances that return errors before poll_wait, but
most of them seemed buggy.

> wait_on_event_..._key() might be a good idea; we probably want comments from
> Peter on that one.  An interesting testcase would be tty - the amount of
> threads sleeping on those queues is going to be large; can we combine
> ->read_wait and ->write_wait without serious PITA?  Another issue is
> ldisc handling - the first thing tty_poll() is doing is
>         ld = tty_ldisc_ref_wait(tty);
> and it really waits for ldisc changes in progress to settle.  Hell knows
> whether anything relies on that, but I wouldn't be surprised if it did -
> tty handling is one of the areas where select(2)/poll(2) get non-trivial
> use...

Yes. I'll look into it.

In the meantime I think that poll_mask semantics are a big improvement
for the common case, and they enable doing aio poll.  So I'd rather
move ahead and fix any details of method signatures once we're down
to just the hard cases instead of trying to do everything at once.
Christoph Hellwig Jan. 17, 2018, 4:05 p.m. UTC | #8
On Thu, Jan 11, 2018 at 05:47:50PM +0000, Al Viro wrote:
> What I mean is that it would be nice to have do_select() and friends aware of that.
> You are hiding the whole thing behind vfs_poll(); sure, we can't really exploit
> that while we have the mix of converted and unconverted instances, but it would
> be a nice payoff.

I have tried to move it out, but I don't think it makes sense yet,
given that we rely on passing the poll_table_struct everywhere.

Btw, another issue: the procs and debugfs forwarders are in fact
already broken right now for epoll and in-kernel callers.  ->poll
returning does not mean it is safe to release the module - it
may still have an active wait_queue_entry on the waitqueue.

For now I think I'll drop the support for the proc/debugfs forwarders,
because they will need a real fix independent of this series.
diff mbox

Patch

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 220bba28f72b..6d227f9d7bd9 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -440,6 +440,8 @@  prototypes:
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
+	struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+	__poll_t (*poll_mask) (struct file *, __poll_t);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
@@ -470,7 +472,7 @@  prototypes:
 };
 
 locking rules:
-	All may block.
+	All except for ->poll_mask may block.
 
 ->llseek() locking has moved from llseek to the individual llseek
 implementations.  If your fs is not using generic_file_llseek, you
@@ -498,6 +500,9 @@  in sys_read() and friends.
 the lease within the individual filesystem to record the result of the
 operation
 
+->poll_mask can be called with or without the waitqueue lock for the waitqueue
+returned from ->get_poll_head.
+
 --------------------------- dquot_operations -------------------------------
 prototypes:
 	int (*write_dquot) (struct dquot *);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f608180ad59d..bafb5c749443 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,8 @@  struct file_operations {
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
+	struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+	__poll_t (*poll_mask) (struct file *, __poll_t);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
@@ -901,6 +903,15 @@  otherwise noted.
 	activity on this file and (optionally) go to sleep until there
 	is activity. Called by the select(2) and poll(2) system calls
 
+  get_poll_head: Returns the struct wait_queue_head that poll, select,
+  epoll or aio poll should wait on in case this instance only has single
+  waitqueue.
+
+  poll_mask: return the mask of POLL* values describing the file descriptor
+  state.  Called before going to sleep on the waitqueue returned by
+  get_poll_head, and after it has been woken.  If ->get_poll_head and
+  ->poll_mask are implemented ->poll does not need to be implement.
+
   unlocked_ioctl: called by the ioctl(2) system call.
 
   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34c0434511c7..f7dd8eb1eb85 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1699,6 +1699,8 @@  struct file_operations {
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
+	struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+	__poll_t (*poll_mask) (struct file *, __poll_t);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 6eb3343c2ee1..a9abcbc06fc2 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -75,14 +75,29 @@  static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 
 static inline bool file_can_poll(struct file *file)
 {
-	return file->f_op->poll;
+	return file->f_op->poll ||
+		(file->f_op->get_poll_head && file->f_op->poll_mask);
 }
 
 static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
 {
-	if (unlikely(!file->f_op->poll))
+	unsigned int events = poll_requested_events(pt);
+
+	if (unlikely(!file_can_poll(file)))
+		return DEFAULT_POLLMASK;
+
+	if (file->f_op->poll)
+		return file->f_op->poll(file, pt);
+
+	poll_wait(file, file->f_op->get_poll_head(file, events), pt);
+	return file->f_op->poll_mask(file, events);
+}
+
+static inline __poll_t vfs_poll_mask(struct file *file, __poll_t events)
+{
+	if (unlikely(!file->f_op->poll_mask))
 		return DEFAULT_POLLMASK;
-	return file->f_op->poll(file, pt);
+	return file->f_op->poll_mask(file, events) & events;
 }
 
 struct poll_table_entry {