Message ID | 20180628142059.10017-7-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig <hch@lst.de> wrote: > > Note that for this removes the possibility of actually returning an > error before waiting in poll. We could still do this with an ERR_PTR > in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but > I'd like to defer that until actually required. I'm still going to just revert the whole poll mess for now. It's still completely broken. This helps things, but it doesn't fix the fundamental issue: the new interface is strictly worse than the old interface ever was. So Christoph, if you don't like the tradoitional ->poll() method, and you want something else for aio polling, I seriously will suggest that you introduce a new f_op for *that*. Don't mess with the existing ->poll() function, don't make select() and poll() use a slower and less capable function just because aio wants something else. In other words, you need to see AIO as the less important case, not as the driver for this. I also want to understand what the AIO race was, and what the problem with the poll() thing was. You claimed it was racy. I don't see it, and it was never ever explained in the whole series. I should never have pulled it in the first place if only for that reason, but I tend to trust Al when it comes to the VFS layer, so I did. My bad. So before we try this again, I most definitely want _explanations_. And I want the whole approach to be very clear that AIO is the ugly step-sister, not the driving force. Linus
On Thu, Jun 28, 2018 at 09:40:21AM -0700, Linus Torvalds wrote: > On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig <hch@lst.de> wrote: > > > > Note that for this removes the possibility of actually returning an > > error before waiting in poll. We could still do this with an ERR_PTR > > in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but > > I'd like to defer that until actually required. > > I'm still going to just revert the whole poll mess for now. > > It's still completely broken. This helps things, but it doesn't fix > the fundamental issue: the new interface is strictly worse than the > old interface ever was. > > So Christoph, if you don't like the tradoitional ->poll() method, and > you want something else for aio polling, I seriously will suggest that > you introduce a new f_op for *that*. Don't mess with the existing > ->poll() function, don't make select() and poll() use a slower and > less capable function just because aio wants something else. > > In other words, you need to see AIO as the less important case, not as > the driver for this. > > I also want to understand what the AIO race was, and what the problem > with the poll() thing was. You claimed it was racy. I don't see it, > and it was never ever explained in the whole series. I should never > have pulled it in the first place if only for that reason, but I tend > to trust Al when it comes to the VFS layer, so I did. My bad. ... and I should have pushed back harder, rather than getting sidetracked into fixing the fs/aio.c-side races in this series ;-/ As for what can be salvaged out of the whole mess, * probably the most serious lesson is that INDIRECT CALLS ARE COSTLY NOW and shouldn't be used lightly. That had been slow to sink in and we'd better all realize how much the things have changed. That, BTW, has implications going a lot further than poll-related stuff - e.g. the whole "we'll deal with revoke-like issues in procfs/sysfs/debugfs by wrapping method calls" needs to be reexamined. And in poll-related area, note that we have a lot of indirection levels for socket poll. * having an optional pointer to wait_queue_head in struct file is probably a good idea; a lot of ->poll() instances do have the same form. Even if sockets do not (and I'm not all that happy about the solution in the latest series), the instances that do are common and important enough. * a *LOT* of ->poll() instances only block in __pollwait() called (indirectly) on the first pass. If we annotate those in some way (flag set in ->open(), presence of a new method, whatever) and limit aio-poll to just those, we could deal with the aio side without disrupting select/poll at all; just use (in place of __pollwait) a different callback that wouldn't try to allocate poll_table_entry and worked with the stuff embedded into aio-poll iocb. How much do you intend to revert? Untangling just the ->poll()-related parts from the rest of changes in fs/aio.c will be unpleasant; we might end up reverting the whole tail of the series and redoing the things that are not poll-related on top of the revert... ;-/
On Thu, Jun 28, 2018 at 11:17 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > As for what can be salvaged out of the whole mess, > * probably the most serious lesson is that INDIRECT CALLS ARE > COSTLY NOW and shouldn't be used lightly. Note that this has always been true, it's just _way_ more obvious now. Indirect calls are costly not just because of the nasty 20+ cycle cost of the stupid spectre overhead (which hopefully will be a thing of the past in a few years as people upgrade), but because they are a pretty basic barrier to optimization, both for compilers but also just for _people_. Look at a lot of vfs optimization we've done, like all the name hashing optimization etc. We basically fall flat on our face if a filesystem implements its own name hash function, not _just_ because of the cost of the indirect function call, but because it suddenly means that the filesystem is doing its own thing and all the clever work we did to integrate name hashing with copying the name no longer works. So I really want to avoid indirect calls. And when they *do* happen, I want to avoid the model where people think of them as low-level object accessor functions - the C++ disease. I want indirect function calls to make sense at a higher level, and do some real operation. End result: I really despised the new poll() model. Yes, the performance report was what made me *notice*, but then when I looked at the code I went "No". Using an indirect call as some low-level accessor function really is fundamentally wrong. Don't do it. It's badly designed. Out VFS operations are _high-level_ operations, where we do one single call for a whole "read()" operation. "->poll()" used to be the same. The new "->get_poll_head()" and "->poll_mask()" was just bad, bad, bad. > * having an optional pointer to wait_queue_head in struct file > is probably a good idea; a lot of ->poll() instances do have the same > form. Even if sockets do not (and I'm not all that happy about the > solution in the latest series), the instances that do are common and > important enough. Right. I don't hate the poll wait-queue pointer. That said, I do hope that we can simply write things so as to not even need it. > * a *LOT* of ->poll() instances only block in __pollwait() > called (indirectly) on the first pass. They are *all* supposed to do it. The whole idea with "->poll()" is that the model of operation is: - add_wait_queue() and return state on the first pass - on subsequent passes (or if somebody else already returned a state that means we already know we're not going to wait), the poll table is NULL, so you *CANNOT* add_wait_queue again, so you just return state. Additionally, once _anybody_ has returned a "I already have the event", we also clear the poll table queue, so subsequent accesses will purely be for returning the poll state. So I don't understand why you're asking for annotation. The whole "you add yourself to the poll table" is *fundamentally* only done on the first pass. You should never do it for later passes. > How much do you intend to revert? Untangling just the ->poll()-related > parts from the rest of changes in fs/aio.c will be unpleasant; we might > end up reverting the whole tail of the series and redoing the things > that are not poll-related on top of the revert... ;-/ I pushed out my revert. It was fairly straightforward, it just reverted all the poll_mask/get_poll_head changes, and the aio code that depended on them. Btw, I really don't understand the "aio has a race". The old poll() interface was fundamentally race-free. There simply *is* no way to race on it, exactly because of the whole "add yourself to the wait queue first, then ask for state afterwards" model. The model may be _odd_, but it has literally worked well for a quarter century exactly because it's really simple and fundamentally cannot have races. So I think it's the aio code that needs fixing, not the polling code. I do want that explanation for why AIO is _so_ special that it can introduce a race in poll(). Because I suspect it's not so special, and it's just buggy. Maybe Christoph didn't understand the two-phase model (how you call ->poll() _twice_ - first to add yourself to the queue, later to check status). Or maybe AIO interfaces are just shit (again!) and don't work right. Linus
On Thu, Jun 28, 2018 at 12:31:14PM -0700, Linus Torvalds wrote: > > * a *LOT* of ->poll() instances only block in __pollwait() > > called (indirectly) on the first pass. > > They are *all* supposed to do it. Sure, but... static __poll_t binder_poll(struct file *filp, struct poll_table_struct *wait) { struct binder_proc *proc = filp->private_data; struct binder_thread *thread = NULL; bool wait_for_proc_work; thread = binder_get_thread(proc); if (!thread) return POLLERR; ... static struct binder_thread *binder_get_thread(struct binder_proc *proc) { struct binder_thread *thread; struct binder_thread *new_thread; binder_inner_proc_lock(proc); thread = binder_get_thread_ilocked(proc, NULL); binder_inner_proc_unlock(proc); if (!thread) { new_thread = kzalloc(sizeof(*thread), GFP_KERNEL); And that's hardly unique - we have instances playing with timers, allocations, whatnot. Even straight mutex_lock(), as in static __poll_t i915_perf_poll(struct file *file, poll_table *wait) { struct i915_perf_stream *stream = file->private_data; struct drm_i915_private *dev_priv = stream->dev_priv; __poll_t ret; mutex_lock(&dev_priv->perf.lock); ret = i915_perf_poll_locked(dev_priv, stream, file, wait); mutex_unlock(&dev_priv->perf.lock); return ret; } or static __poll_t cec_poll(struct file *filp, struct poll_table_struct *poll) { struct cec_fh *fh = filp->private_data; struct cec_adapter *adap = fh->adap; __poll_t res = 0; if (!cec_is_registered(adap)) return EPOLLERR | EPOLLHUP; mutex_lock(&adap->lock); if (adap->is_configured && adap->transmit_queue_sz < CEC_MAX_MSG_TX_QUEUE_SZ) res |= EPOLLOUT | EPOLLWRNORM; if (fh->queued_msgs) res |= EPOLLIN | EPOLLRDNORM; if (fh->total_queued_events) res |= EPOLLPRI; poll_wait(filp, &fh->wait, poll); mutex_unlock(&adap->lock); return res; } etc. > > The whole idea with "->poll()" is that the model of operation is: > > - add_wait_queue() and return state on the first pass > > - on subsequent passes (or if somebody else already returned a state > that means we already know we're not going to wait), the poll table is > NULL, so you *CANNOT* add_wait_queue again, so you just return state. > > Additionally, once _anybody_ has returned a "I already have the > event", we also clear the poll table queue, so subsequent accesses > will purely be for returning the poll state. > > So I don't understand why you're asking for annotation. The whole "you > add yourself to the poll table" is *fundamentally* only done on the > first pass. You should never do it for later passes. Sure. Unfortunately, adding yourself to the poll table is not the only way to block. And a plenty of instances in e.g. drivers/media (where the bulk of ->poll() instances lives) are very free with grabbing mutexes as they go.
On Thu, Jun 28, 2018 at 09:28:37PM +0100, Al Viro wrote: > Sure. Unfortunately, adding yourself to the poll table is not the only > way to block. And a plenty of instances in e.g. drivers/media (where > the bulk of ->poll() instances lives) are very free with grabbing mutexes > as they go. Speaking of obvious bogosities (a lot more so than a blocking allocation several calls down into helper): static __poll_t ca8210_test_int_poll( struct file *filp, struct poll_table_struct *ptable ) { __poll_t return_flags = 0; struct ca8210_priv *priv = filp->private_data; poll_wait(filp, &priv->test.readq, ptable); if (!kfifo_is_empty(&priv->test.up_fifo)) return_flags |= (EPOLLIN | EPOLLRDNORM); if (wait_event_interruptible( priv->test.readq, !kfifo_is_empty(&priv->test.up_fifo))) { return EPOLLERR; } return return_flags; } In a sense, poll_wait() had been an unfortunate choice of name - it's really "arrange to be woken up by", not "wait for something now" and while the outright confusion like above is rare, general "blocking is OK in that area" is not rare at all...
On Thu, Jun 28, 2018 at 1:28 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > Sure, but... > > static __poll_t binder_poll(struct file *filp, > struct poll_table_struct *wait) > { > struct binder_proc *proc = filp->private_data; > struct binder_thread *thread = NULL; > bool wait_for_proc_work; > > thread = binder_get_thread(proc); > if (!thread) > return POLLERR; That's actually fine. In particular, it's ok to *not* add yourself to the wait-queues if you already return the value that you will always return. For example, the poll function for a regular file could just always return EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM because it never changes. Now, it doesn't actually *do* that, because we have the rule that "no ->poll function" means exactly that, but it's not wrong. Similarly, if a poll handler has an error that will not go away, the above is actually the *right* thing to do. There simply isn't anything to wait for. Arguably, it probably should return not just POLLERR, but also POLLIN/POLLOUT, since an error *also* means that it's going to return immediately from read/write, but that's a separate thing entirely. > And that's hardly unique - we have instances playing with timers, > allocations, whatnot. Even straight mutex_lock(), as in So? Again, locking is permitted. It's not great, but it's not against the rules. So none of the things you point to are excuses for changing interfaces or adding any flags. And no, we don't care at all about "blocking" in general. Somebody who cares about _performance_ may care, but it's not fundamnetal. Even select(*) and poll() itself will block for allocating select tables etc, but also for reading (and updating) user space. Anybody who thinks "select cannot block" or "->poll() musn't block" is just confused. It has *never* been about that. It waits asynchronously for IO, but it may well wait synchronously for locks or memory or just "lazy implementation". And none of this has antyhign to do with the ->poll() interface itself. If you want to improve performance on some _particular_ file and actually worry about locking etc, you fix that particular implementation. You don't go and change the interface for everybody. The fact is, those interface changes were just broken shit. They were confused. I don't actually believe that AIO even needed them. Christoph, do you have a test program for IOCB_CMD_POLL and what it's actually supposed to do? Because I think that what it can do is simply to do the ->poll() calls outside the iocb locks, and then just attach the poll table to the kioctx afterwards. This whole "poll must not block" is a complete red herring. It doesn't come from any other requirements than BAD AIO GARBAGE CODE. Seriously. Stop thinking this has to happen inside some spinlocked region. That's AIO braindamage, it's irrelevant. Linus
On Thu, Jun 28, 2018 at 1:37 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > Speaking of obvious bogosities (a lot more so than a blocking allocation > several calls down into helper): > > static __poll_t ca8210_test_int_poll( > struct file *filp, > struct poll_table_struct *ptable > ) Ok, that's just garbage. Again, this is not an excuse for changing "->poll()". This is just a bogus driver that does stupid things. And again, don't use this as an example of "poll must not block" The fact is, poll() *can* block for locking and other such things, and it's perfectly normal and ok. It just shouldn't block for IO. I will not take any misguided patches to try to "fix" poll functions that might block. But I will take patches to fix bugs in drivers. In this case, I think the fix is to just remove that crazy "wait_event_interruptible()" entirely. Not that anybody cares, obviously. Apparently nobody has ever noticed how broken that poll function is. Linus
On Thu, Jun 28, 2018 at 02:11:17PM -0700, Linus Torvalds wrote: > On Thu, Jun 28, 2018 at 1:28 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > Sure, but... > > > > static __poll_t binder_poll(struct file *filp, > > struct poll_table_struct *wait) > > { > > struct binder_proc *proc = filp->private_data; > > struct binder_thread *thread = NULL; > > bool wait_for_proc_work; > > > > thread = binder_get_thread(proc); > > if (!thread) > > return POLLERR; > > That's actually fine. > > In particular, it's ok to *not* add yourself to the wait-queues if you > already return the value that you will always return. Sure (and that's one of the problems I mentioned with ->get_poll_head() model). But in this case I was refering to GFP_KERNEL allocation down there. > > And that's hardly unique - we have instances playing with timers, > > allocations, whatnot. Even straight mutex_lock(), as in > > So? > > Again, locking is permitted. It's not great, but it's not against the rules. Me: a *LOT* of ->poll() instances only block in __pollwait() called (indirectly) on the first pass. You: They are *all* supposed to do it. Me: <examples of instances that block elsewhere> I'm not saying that blocking on other things is a bug; some of such *are* bogus, but a lot aren't really broken. What I said is that in a lot of cases we really have hard "no blocking other than in callback" (and on subsequent passes there's no callback at all). Which is just about perfect for AIO purposes, so *IF* we go for "new method just for AIO, those who don't have it can take a hike", we might as well indicate that "can take a hike" in some way (be it opt-in or opt-out) and use straight unchanged ->poll(), with alternative callback. Looks like we were talking past each other for the last couple of rounds... > So none of the things you point to are excuses for changing interfaces > or adding any flags. > Anybody who thinks "select cannot block" or "->poll() musn't block" is > just confused. It has *never* been about that. It waits asynchronously > for IO, but it may well wait synchronously for locks or memory or just > "lazy implementation". Obviously. I *do* understand how poll() works, really. > The fact is, those interface changes were just broken shit. They were > confused. I don't actually believe that AIO even needed them. > > Christoph, do you have a test program for IOCB_CMD_POLL and what it's > actually supposed to do? > > Because I think that what it can do is simply to do the ->poll() calls > outside the iocb locks, and then just attach the poll table to the > kioctx afterwards. I'd do a bit more - embed the first poll_table_entry into poll iocb itself, so that the instances that use only one queue wouldn't need any allocations at all.
On Thu, Jun 28, 2018 at 2:30 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > Again, locking is permitted. It's not great, but it's not against the rules. > > Me: a *LOT* of ->poll() instances only block in __pollwait() called (indirectly) > on the first pass. > > You: They are *all* supposed to do it. > > Me: <examples of instances that block elsewhere> Oh, I thought you were talking about the whole "first pass" adding to wait queues, as opposed to doing it on the second pass. The *blocking* is entirely immaterial. I didn't even react to it, because it's simply not an issue. I don't understand why you're even hung up about it. The only reason "blocking" seems to be an issu eis because AIO has shit-for-brains and wanted to do poll() under the spinlock. But that's literally just AIO being confused garbage. It has zero relevance for anything else. Linus
On Thu, Jun 28, 2018 at 10:30:27PM +0100, Al Viro wrote: > I'm not saying that blocking on other things is a bug; some of such *are* bogus, > but a lot aren't really broken. What I said is that in a lot of cases we really > have hard "no blocking other than in callback" (and on subsequent passes there's > no callback at all). Which is just about perfect for AIO purposes, so *IF* we > go for "new method just for AIO, those who don't have it can take a hike", we might > as well indicate that "can take a hike" in some way (be it opt-in or opt-out) and > use straight unchanged ->poll(), with alternative callback. PS: one way of doing that would be to steal a flag from pt->_key and have ->poll() instances do an equivalent of if (flags & LOOKUP_RCU) return -ECHILD; we have in a lot of ->d_revalidate() instances for "need to block" case. Only here they would've returned EPOLLNVAL. Most of the ->poll() instances wouldn't care at all - they do not block unless the callback does (and in this case it wouldn't have). Normal poll(2)/select(2) are completely unaffected. And AIO would just have that bit set in its poll_table_struct. The rules for drivers change only in one respect - if your ->poll() is going to need to block, check poll_requested_events(pt) & EPOLL_ATOMIC and return EPOLLNVAL in such case.
On Thu, Jun 28, 2018 at 3:20 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > The rules for drivers change only in one respect - if your ->poll() is going to > need to block, check poll_requested_events(pt) & EPOLL_ATOMIC and return EPOLLNVAL > in such case. OI still don't even understand why you care. Yes, the AIO poll implementation did it under the spinlock. But there's no good *reason* for that. The "aio_poll()" function itself is called in perfectly fine blocking context. The only reason it does it under the spinlock is that apparently Christoph didn't understand how poll() worked. As far as I can tell, Christoph could have just done the first pass '->poll()' *without* taking a spinlock, and that adds the table entry to the table. Then, *under the spinlock*, you associate the table the the kioctx. And then *after* the spinlock, you can call "->poll()" again (now with a NULL table pointer), to verify that the state is still not triggered. That's the whole point of the two-phgase poll thing - the first phase adds the entry to the wait queues, and the second phase checks for the race of "did it the event happen in the meantime". There is absolutely no excuse for calling '->poll()' itself under the spinlock. I don't see any reason for it. The whole "AIO needs this to avoid races" was always complete and utter bullshit, as far as I can tell. So stop it with this crazy and pointless "poll() might block". IT DAMN WELL SHOULD BE ABLE TO BLOCK, AND NOBODY SANE WILL EVER CARE! If somebody cares, they are doing things wrong. So fix the AIO code, don't look at the poll() code, for chrissake! Linus
On Thu, Jun 28, 2018 at 03:35:03PM -0700, Linus Torvalds wrote: > Yes, the AIO poll implementation did it under the spinlock. > > But there's no good *reason* for that. The "aio_poll()" function > itself is called in perfectly fine blocking context. aio_poll() is not a problem. It's wakeup callback that is one. > As far as I can tell, Christoph could have just done the first pass > '->poll()' *without* taking a spinlock, and that adds the table entry > to the table. Then, *under the spinlock*, you associate the table the > the kioctx. And then *after* the spinlock, you can call "->poll()" > again (now with a NULL table pointer), to verify that the state is > still not triggered. That's the whole point of the two-phgase poll > thing - the first phase adds the entry to the wait queues, and the > second phase checks for the race of "did it the event happen in the > meantime". You are misreading that mess. What he's trying to do (other than surviving the awful clusterfuck around cancels) is to handle the decision what to report to userland right in the wakeup callback. *That* is what really drives the "make the second-pass ->poll() or something similar to it non-blocking" (in addition to the fact that it is such in considerable majority of instances).
On Thu, Jun 28, 2018 at 3:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > aio_poll() is not a problem. It's wakeup callback that is one. No, it's not a problem either. The only thing the wakup callback wants to do is to remove the wait queue entry. And *that* doesn't need to sleep, and it has absolutely nothing to do with whether "->poll()" itself sleeps or not. All that needs to do is "poll_freewait()". Done. In practice, it's probably only going to be a single free_poll_entry(), but who cares? The AIO code should handle the "maybe ->poll() added multiple wait-queues" just fine. > You are misreading that mess. What he's trying to do (other than surviving > the awful clusterfuck around cancels) is to handle the decision what to > report to userland right in the wakeup callback. *That* is what really > drives the "make the second-pass ->poll() or something similar to it > non-blocking" (in addition to the fact that it is such in considerable > majority of instances). That's just crazy BS. Just call poll() again when you copy the data to userland (which by definition can block, again). Stop the idiotic "let's break poll for stupid AIO reasons, because the AIO people are morons". Just make the AIO code do the sane thing. Linus
On Thu, Jun 28, 2018 at 03:55:35PM -0700, Linus Torvalds wrote: > > You are misreading that mess. What he's trying to do (other than surviving > > the awful clusterfuck around cancels) is to handle the decision what to > > report to userland right in the wakeup callback. *That* is what really > > drives the "make the second-pass ->poll() or something similar to it > > non-blocking" (in addition to the fact that it is such in considerable > > majority of instances). > > That's just crazy BS. > > Just call poll() again when you copy the data to userland (which by > definition can block, again). > > Stop the idiotic "let's break poll for stupid AIO reasons, because the > AIO people are morons". You underestimate the nastiness of that thing (and for the record, I'm a lot *less* fond of AIO than you are, what with having had to read that nest of horrors lately). It does not "copy the data to userland"; what it does instead is copying into an array of pages it keeps, right from IO completion callback. In read/write case. This ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); event = ev_page + pos % AIO_EVENTS_PER_PAGE; event->obj = (u64)(unsigned long)iocb->ki_user_iocb; event->data = iocb->ki_user_data; event->res = res; event->res2 = res2; kunmap_atomic(ev_page); flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); is what does the copying. And that might be done from IRQ context. Yes, really. They do have a slightly saner syscall that does copying from the damn ring buffer, but its use is optional - userland can (and does) direct read access to mmapped buffer. Single-consumer ABIs suck and AIO is one such... It could do schedule_work() and do blocking stuff from that - does so, in case if it can't grab ->ctx_lock. Earlier iteration used to try doing everything straight from wakeup callback, and *that* was racy as hell; I'd rather have Christoph explain which races he'd been refering to, but there had been a whole lot of that. Solution I suggested in the last round of that was to offload __aio_poll_complete() via schedule_work() both for cancel and poll wakeup cases. Doing the common case right from poll wakeup callback was argued to avoid noticable overhead in common situation - that's what "aio: try to complete poll iocbs without context switch" is about. I'm more than slightly unhappy about the lack of performance regression testing in non-AIO case... At that point I would really like to see replies from Christoph - he's on CET usually, no idea what his effective timezone is...
On Thu, Jun 28, 2018 at 4:37 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > You underestimate the nastiness of that thing (and for the record, I'm > a lot *less* fond of AIO than you are, what with having had to read that > nest of horrors lately). It does not "copy the data to userland"; what it > does instead is copying into an array of pages it keeps, right from > IO completion callback. I Ugh. Oh well. I'd be perfectly happy to have somebody re-write and re-architect the aio code entirely. Much rather than that the ->poll() code. Because I know which one I think is well-desiged with a nice usable interface, and which one is a pile of shit. In the meantime, if AIO wants to do poll() in some irq callback, may I suggest just using workqueues. Linus
On Thu, Jun 28, 2018 at 02:11:17PM -0700, Linus Torvalds wrote: > Christoph, do you have a test program for IOCB_CMD_POLL and what it's > actually supposed to do? https://pagure.io/libaio/c/9c6935e81854d1585bbfa48c35b185849d746864?branch=aio-poll is the actual test in libaio. In addition to that the seastar library actually has a real life user. But given that is c++ with all modern bells and whistles you'll probably have an as hard time as me actually understanding that. > Because I think that what it can do is simply to do the ->poll() calls > outside the iocb locks, and then just attach the poll table to the > kioctx afterwards. We could do that on the submit side fairly easily. The problem is really the completion side, where I'd much avoid introducing a spurious context switch. Right now even with a NULL qproc we can't guarantee any of that. So we'll need to schedule out to a workqueue, and then from that schedule the potential multiple NULL qproc calls, which might actually block elsewhere even if __pollwait is never called. > This whole "poll must not block" is a complete red herring. It doesn't > come from any other requirements than BAD AIO GARBAGE CODE. I comes from the fact to avoid a totally pointless context switch. aio code itself works just fine called from a workqueue, we have exatly that case when file system do non-trivial operations in their end_io handler.
On Thu, Jun 28, 2018 at 10:30:27PM +0100, Al Viro wrote: > > Because I think that what it can do is simply to do the ->poll() calls > > outside the iocb locks, and then just attach the poll table to the > > kioctx afterwards. > > I'd do a bit more - embed the first poll_table_entry into poll iocb itself, > so that the instances that use only one queue wouldn't need any allocations > at all. No need for poll_table_entry, we just need a wait_queue_head. poll_table_entry is an select.c internal (except for two nasty driver) - neither epoll nor most in-kernel callers use it.
On Fri, Jun 29, 2018 at 6:29 AM Christoph Hellwig <hch@lst.de> wrote: > No need for poll_table_entry, we just need a wait_queue_head. > poll_table_entry is an select.c internal (except for two nasty driver) - > neither epoll nor most in-kernel callers use it. Well, you need the poll_table for the "poll_wait()", and you would need to be able to have multiple wait-queues even though you only have one file you're waiting for. But yes, it doesn't really need to be the full complex poll_table_entry model. Linus
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 2c391338c675..4d183e8258b6 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -441,7 +441,6 @@ prototypes: int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); - struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t); __poll_t (*poll_mask) (struct file *, __poll_t); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); @@ -505,8 +504,7 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation -->poll_mask can be called with or without the waitqueue lock for the waitqueue -returned from ->get_poll_head. +->poll_mask can be called with or without the waitqueue lock --------------------------- dquot_operations ------------------------------- prototypes: diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 829a7b7857a4..2d2f07acafa8 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,7 +857,6 @@ struct file_operations { ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); int (*iterate) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); - struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t); __poll_t (*poll_mask) (struct file *, __poll_t); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); @@ -903,16 +902,12 @@ otherwise noted. activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls - get_poll_head: Returns the struct wait_queue_head that callers can - wait on. Callers need to check the returned events using ->poll_mask - once woken. Can return NULL to indicate polling is not supported, - or any error code using the ERR_PTR convention to indicate that a - grave error occured and ->poll_mask shall not be called. - poll_mask: return the mask of EPOLL* values describing the file descriptor - state. Called either before going to sleep on the waitqueue returned by - get_poll_head, or after it has been woken. If ->get_poll_head and - ->poll_mask are implemented ->poll does not need to be implement. + state. Called either before going to sleep on ->f_poll_head, or after it + has been woken. On files that support ->poll_mask the ->f_poll_head field + must point to a wait_queue_head that poll can wait on. The waitqueue must + have the same (or a longer) life time as the struct file itself. + If ->poll_mask is implemented ->poll does not need to be implement. unlocked_ioctl: called by the ioctl(2) system call. diff --git a/drivers/char/random.c b/drivers/char/random.c index a8fb0020ba5c..e6260a9fa3b9 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1875,10 +1875,10 @@ urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) return ret; } -static struct wait_queue_head * -random_get_poll_head(struct file *file, __poll_t events) +static int random_open(struct inode *inode, struct file *file) { - return &random_wait; + file->f_poll_head = &random_wait; + return 0; } static __poll_t @@ -1990,9 +1990,9 @@ static int random_fasync(int fd, struct file *filp, int on) } const struct file_operations random_fops = { + .open = random_open, .read = random_read, .write = random_write, - .get_poll_head = random_get_poll_head, .poll_mask = random_poll_mask, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, diff --git a/fs/aio.c b/fs/aio.c index e1d20124ec0e..84f498df8afc 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -168,7 +168,6 @@ struct fsync_iocb { struct poll_iocb { struct file *file; __poll_t events; - struct wait_queue_head *head; union { struct wait_queue_entry wait; @@ -1632,7 +1631,7 @@ static int aio_poll_cancel(struct kiocb *iocb) { struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw); struct poll_iocb *req = &aiocb->poll; - struct wait_queue_head *head = req->head; + struct wait_queue_head *head = req->file->f_poll_head; bool found = false; spin_lock(&head->lock); @@ -1655,7 +1654,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, struct file *file = req->file; __poll_t mask = key_to_poll(key); - assert_spin_locked(&req->head->lock); + assert_spin_locked(&file->f_poll_head->lock); /* for instances that support it check for an event match first: */ if (mask && !(mask & req->events)) @@ -1703,30 +1702,21 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb) req->file = fget(iocb->aio_fildes); if (unlikely(!req->file)) return -EBADF; - if (!file_has_poll_mask(req->file)) + if (!req->file->f_poll_head) goto out_fail; - req->head = req->file->f_op->get_poll_head(req->file, req->events); - if (!req->head) - goto out_fail; - if (IS_ERR(req->head)) { - mask = EPOLLERR; - goto done; - } - init_waitqueue_func_entry(&req->wait, aio_poll_wake); aiocb->ki_cancel = aio_poll_cancel; spin_lock_irq(&ctx->ctx_lock); - spin_lock(&req->head->lock); + spin_lock(&req->file->f_poll_head->lock); mask = req->file->f_op->poll_mask(req->file, req->events) & req->events; if (!mask) { - __add_wait_queue(req->head, &req->wait); + __add_wait_queue(req->file->f_poll_head, &req->wait); list_add_tail(&aiocb->ki_list, &ctx->active_reqs); } - spin_unlock(&req->head->lock); + spin_unlock(&req->file->f_poll_head->lock); spin_unlock_irq(&ctx->ctx_lock); -done: if (mask) __aio_poll_complete(aiocb, mask); return 0; diff --git a/fs/eventfd.c b/fs/eventfd.c index ceb1031f1cac..8904b127577b 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -101,14 +101,6 @@ static int eventfd_release(struct inode *inode, struct file *file) return 0; } -static struct wait_queue_head * -eventfd_get_poll_head(struct file *file, __poll_t events) -{ - struct eventfd_ctx *ctx = file->private_data; - - return &ctx->wqh; -} - static __poll_t eventfd_poll_mask(struct file *file, __poll_t eventmask) { struct eventfd_ctx *ctx = file->private_data; @@ -311,7 +303,6 @@ static const struct file_operations eventfd_fops = { .show_fdinfo = eventfd_show_fdinfo, #endif .release = eventfd_release, - .get_poll_head = eventfd_get_poll_head, .poll_mask = eventfd_poll_mask, .read = eventfd_read, .write = eventfd_write, @@ -390,7 +381,8 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget); static int do_eventfd(unsigned int count, int flags) { struct eventfd_ctx *ctx; - int fd; + struct file *file; + int fd, error; /* Check the EFD_* constants for consistency. */ BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC); @@ -408,12 +400,27 @@ static int do_eventfd(unsigned int count, int flags) ctx->count = count; ctx->flags = flags; - fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx, + error = get_unused_fd_flags(O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS)); + if (error < 0) + goto err_free_ctx; + fd = error; + + file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx, O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS)); - if (fd < 0) - eventfd_free_ctx(ctx); + if (IS_ERR(file)) { + error = PTR_ERR(file); + goto err_put_unused_fd; + } + file->f_poll_head = &ctx->wqh; + fd_install(fd, file); return fd; + +err_put_unused_fd: + put_unused_fd(fd); +err_free_ctx: + eventfd_free_ctx(ctx); + return error; } SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index ea4436f409fb..a0d199be91db 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -922,13 +922,6 @@ static __poll_t ep_read_events_proc(struct eventpoll *ep, struct list_head *head return 0; } -static struct wait_queue_head *ep_eventpoll_get_poll_head(struct file *file, - __poll_t eventmask) -{ - struct eventpoll *ep = file->private_data; - return &ep->poll_wait; -} - static __poll_t ep_eventpoll_poll_mask(struct file *file, __poll_t eventmask) { struct eventpoll *ep = file->private_data; @@ -972,7 +965,6 @@ static const struct file_operations eventpoll_fops = { .show_fdinfo = ep_show_fdinfo, #endif .release = ep_eventpoll_release, - .get_poll_head = ep_eventpoll_get_poll_head, .poll_mask = ep_eventpoll_poll_mask, .llseek = noop_llseek, }; @@ -1973,6 +1965,7 @@ static int do_epoll_create(int flags) goto out_free_fd; } ep->file = file; + file->f_poll_head = &ep->poll_wait; fd_install(fd, file); return fd; diff --git a/fs/pipe.c b/fs/pipe.c index bb0840e234f3..6817a60bebc9 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -509,14 +509,6 @@ static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) } } -static struct wait_queue_head * -pipe_get_poll_head(struct file *filp, __poll_t events) -{ - struct pipe_inode_info *pipe = filp->private_data; - - return &pipe->wait; -} - /* No kernel lock held - fine */ static __poll_t pipe_poll_mask(struct file *filp, __poll_t events) { @@ -768,6 +760,7 @@ int create_pipe_files(struct file **res, int flags) f->f_flags = O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)); f->private_data = inode->i_pipe; + f->f_poll_head = &inode->i_pipe->wait; res[0] = alloc_file(&path, FMODE_READ, &pipefifo_fops); if (IS_ERR(res[0])) { @@ -778,6 +771,7 @@ int create_pipe_files(struct file **res, int flags) path_get(&path); res[0]->private_data = inode->i_pipe; res[0]->f_flags = O_RDONLY | (flags & O_NONBLOCK); + res[0]->f_poll_head = &inode->i_pipe->wait; res[1] = f; return 0; @@ -930,6 +924,7 @@ static int fifo_open(struct inode *inode, struct file *filp) /* We can only do regular read/write on fifos */ filp->f_mode &= (FMODE_READ | FMODE_WRITE); + filp->f_poll_head = &pipe->wait; switch (filp->f_mode) { case FMODE_READ: @@ -1023,7 +1018,6 @@ const struct file_operations pipefifo_fops = { .llseek = no_llseek, .read_iter = pipe_read, .write_iter = pipe_write, - .get_poll_head = pipe_get_poll_head, .poll_mask = pipe_poll_mask, .unlocked_ioctl = pipe_ioctl, .release = pipe_release, diff --git a/fs/select.c b/fs/select.c index 317891ff8165..26e20063454a 100644 --- a/fs/select.c +++ b/fs/select.c @@ -38,20 +38,10 @@ __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt) { if (file->f_op->poll) { return file->f_op->poll(file, pt); - } else if (file_has_poll_mask(file)) { - unsigned int events = poll_requested_events(pt); - struct wait_queue_head *head; - - if (pt && pt->_qproc) { - head = file->f_op->get_poll_head(file, events); - if (!head) - return DEFAULT_POLLMASK; - if (IS_ERR(head)) - return EPOLLERR; - pt->_qproc(file, head, pt); - } - - return file->f_op->poll_mask(file, events); + } else if (file->f_poll_head) { + if (pt && pt->_qproc) + pt->_qproc(file, file->f_poll_head, pt); + return file->f_op->poll_mask(file, poll_requested_events(pt)); } else { return DEFAULT_POLLMASK; } diff --git a/fs/timerfd.c b/fs/timerfd.c index d84a2bee4f82..08e820166c4d 100644 --- a/fs/timerfd.c +++ b/fs/timerfd.c @@ -227,14 +227,6 @@ static int timerfd_release(struct inode *inode, struct file *file) return 0; } -static struct wait_queue_head *timerfd_get_poll_head(struct file *file, - __poll_t eventmask) -{ - struct timerfd_ctx *ctx = file->private_data; - - return &ctx->wqh; -} - static __poll_t timerfd_poll_mask(struct file *file, __poll_t eventmask) { struct timerfd_ctx *ctx = file->private_data; @@ -363,7 +355,6 @@ static long timerfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg static const struct file_operations timerfd_fops = { .release = timerfd_release, - .get_poll_head = timerfd_get_poll_head, .poll_mask = timerfd_poll_mask, .read = timerfd_read, .llseek = noop_llseek, @@ -386,8 +377,9 @@ static int timerfd_fget(int fd, struct fd *p) SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags) { - int ufd; + int ufd, error; struct timerfd_ctx *ctx; + struct file *file; /* Check the TFD_* constants for consistency. */ BUILD_BUG_ON(TFD_CLOEXEC != O_CLOEXEC); @@ -424,12 +416,25 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags) ctx->moffs = ktime_mono_to_real(0); - ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx, + error = get_unused_fd_flags(O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS)); + if (error < 0) + goto out_free_ctx; + ufd = error; + + file = anon_inode_getfile("[timerfd]", &timerfd_fops, ctx, O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS)); - if (ufd < 0) - kfree(ctx); + if (IS_ERR(file)) { + error = PTR_ERR(file); + goto err_put_unused_fd; + } + file->f_poll_head = &ctx->wqh; + fd_install(ufd, file); return ufd; +err_put_unused_fd: + put_unused_fd(ufd); +out_free_ctx: + return error; } static int do_timerfd_settime(int ufd, int flags, diff --git a/include/linux/fs.h b/include/linux/fs.h index 5c91108846db..cc0fb83d3743 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -878,6 +878,7 @@ struct file { loff_t f_pos; struct fown_struct f_owner; const struct cred *f_cred; + struct wait_queue_head *f_poll_head; struct file_ra_state f_ra; u64 f_version; @@ -1720,7 +1721,6 @@ struct file_operations { int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); - struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t); __poll_t (*poll_mask) (struct file *, __poll_t); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); diff --git a/include/linux/poll.h b/include/linux/poll.h index fdf86b4cbc71..e3dd9dfee20a 100644 --- a/include/linux/poll.h +++ b/include/linux/poll.h @@ -74,14 +74,9 @@ static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc) pt->_key = ~(__poll_t)0; /* all events enabled */ } -static inline bool file_has_poll_mask(struct file *file) -{ - return file->f_op->get_poll_head && file->f_op->poll_mask; -} - static inline bool file_can_poll(struct file *file) { - return file->f_op->poll || file_has_poll_mask(file); + return file->f_op->poll || file->f_poll_head; } __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt); diff --git a/net/socket.c b/net/socket.c index 4354afe8ad96..155d4ba38645 100644 --- a/net/socket.c +++ b/net/socket.c @@ -117,8 +117,6 @@ static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from); static int sock_mmap(struct file *file, struct vm_area_struct *vma); static int sock_close(struct inode *inode, struct file *file); -static struct wait_queue_head *sock_get_poll_head(struct file *file, - __poll_t events); static __poll_t sock_poll_mask(struct file *file, __poll_t); static __poll_t sock_poll(struct file *file, struct poll_table_struct *wait); static long sock_ioctl(struct file *file, unsigned int cmd, unsigned long arg); @@ -143,7 +141,6 @@ static const struct file_operations socket_file_ops = { .llseek = no_llseek, .read_iter = sock_read_iter, .write_iter = sock_write_iter, - .get_poll_head = sock_get_poll_head, .poll_mask = sock_poll_mask, .poll = sock_poll, .unlocked_ioctl = sock_ioctl, @@ -422,6 +419,8 @@ struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname) sock->file = file; file->f_flags = O_RDWR | (flags & O_NONBLOCK); + if (sock->ops->poll_mask) + file->f_poll_head = &sock->wq->wait; file->private_data = sock; return file; } @@ -1128,16 +1127,6 @@ int sock_create_lite(int family, int type, int protocol, struct socket **res) } EXPORT_SYMBOL(sock_create_lite); -static struct wait_queue_head *sock_get_poll_head(struct file *file, - __poll_t events) -{ - struct socket *sock = file->private_data; - - if (!sock->ops->poll_mask) - return NULL; - return &sock->wq->wait; -} - static __poll_t sock_poll_mask(struct file *file, __poll_t events) { struct socket *sock = file->private_data; @@ -1167,7 +1156,7 @@ static __poll_t sock_poll(struct file *file, poll_table *wait) if (sock->ops->poll) { mask = sock->ops->poll(file, sock, wait); } else if (sock->ops->poll_mask) { - sock_poll_wait(file, sock_get_poll_head(file, events), wait); + sock_poll_wait(file, &sock->wq->wait, wait); mask = sock->ops->poll_mask(sock, events); }
The doubling of the indirect calls caused a too big regression for some benchmarks in our post-spectre world. With some of the nastiness in the networking code moved out of the way we can instead stick a pointer to a waitqueue into struct file and avoid one of the indirect calls. This actually happens to simplify the code quite a bit as well. Note that for this removes the possibility of actually returning an error before waiting in poll. We could still do this with an ERR_PTR in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but I'd like to defer that until actually required. Signed-off-by: Christoph Hellwig <hch@lst.de> --- Documentation/filesystems/Locking | 4 +--- Documentation/filesystems/vfs.txt | 15 +++++--------- drivers/char/random.c | 8 ++++---- fs/aio.c | 22 ++++++--------------- fs/eventfd.c | 33 +++++++++++++++++++------------ fs/eventpoll.c | 9 +-------- fs/pipe.c | 12 +++-------- fs/select.c | 18 ++++------------- fs/timerfd.c | 31 +++++++++++++++++------------ include/linux/fs.h | 2 +- include/linux/poll.h | 7 +------ net/socket.c | 17 +++------------- 12 files changed, 67 insertions(+), 111 deletions(-)