Message ID | 20200721063258.17140-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm: silence soft lockups from unlock_page | expand |
On Tue 21-07-20 07:10:14, Qian Cai wrote: > > > > On Jul 21, 2020, at 2:33 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > on a large ppc machine. The very likely cause is a suboptimal > > configuration when systed-udev spawns way too many workders to bring the > > system up. > > This is strange. The problem description is missing quite a few > important details. For example, what systems exactly are those? How > many CPUs, memory and NUMA nodes were you talking about? Are these really important? I believe I can dig that out from the bug report but I didn't really consider that important enough. > Which kernel version was it reported? It is a SLES 4.12 based kernel with the said commit backported. The page lock internals are thus in line with the upstream kernel. > How many workers from systemd-udev was “misconfigured”? I do not know that information. I believe that it used whatever systemd comes with as a default. And that can be a lot. Do you have any actual feedback to the patch?
On Tue 21-07-20 07:44:07, Qian Cai wrote: > > > > On Jul 21, 2020, at 7:25 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > Are these really important? I believe I can dig that out from the bug > > report but I didn't really consider that important enough. > > Please dig them out. We have also been running those things on > “large” powerpc as well and never saw such soft-lockups. Those > details may give us some clues about the actual problem. I strongly suspect this is not really relevant but just FYI this is 16Node, 11.9TB with 1536CPUs system. > Once we > understand the problem better, we may judge if this “hack” is > really worth it. I do not have access to the machine so I can only judge from the boot log I have in hands. And from that it is pretty clear that $ grep BUG tmp/attachment.txt | wc -l 896 $ grep BUG tmp/attachment.txt | grep "\[systemd" | wc -l 860 $ grep do_fault+0x448 tmp/attachment.txt | wc -l 860 that the boot struggles, lockups happen from udev workers and most of them are stuck at the very same place which is unlock_page. The rest is a part of the changelog.
On Tue 21-07-20 09:23:44, Qian Cai wrote: > On Tue, Jul 21, 2020 at 02:17:52PM +0200, Michal Hocko wrote: > > On Tue 21-07-20 07:44:07, Qian Cai wrote: > > > > > > > > > > On Jul 21, 2020, at 7:25 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > Are these really important? I believe I can dig that out from the bug > > > > report but I didn't really consider that important enough. > > > > > > Please dig them out. We have also been running those things on > > > “large” powerpc as well and never saw such soft-lockups. Those > > > details may give us some clues about the actual problem. > > > > I strongly suspect this is not really relevant but just FYI this is > > 16Node, 11.9TB with 1536CPUs system. > > Okay, we are now talking about the HPC special case. Just brain-storming some > ideas here. > > > 1) What about increase the soft-lockup threshold early at boot and restore > afterwards? As far as I can tell, those soft-lockups are just a few bursts of > things and then cure itself after the booting. Is this really better option than silencing soft lockup from the code itself? What if the same access pattern happens later on? > 2) Reading through the comments above page_waitqueue(), it said rare hash > collisions could happen, so sounds like in this HPC case, it is rather easy to > hit those hash collisons. Thus, need to deal with that instead? As all of those seem to be the same class of process I suspect it is more likely that many processes are hitting the page fault on the same file page. E.g. a code/library. > 3) The commit 62906027091f ("mm: add PageWaiters indicating tasks are waiting > for a page bit") mentioned that, > > "Putting two bits in the same word opens the opportunity to remove the memory > barrier between clearing the lock bit and testing the waiters bit, after some > work on the arch primitives (e.g., ensuring memory operand widths match and > cover both bits)." > > Do you happen to know if this only happen on powerpc? I have only seen this single instance on that machine. I do not think this is very much HW specific but ppc platform is likely more prone to that. Just think of the memory itself. Each memory block is notified via udev and ppc has very small memblocks (16M to 256M). X86 will use 2G blocks on large machines. > Also, probably need to > dig out if those memory barrier is still there that could be removed to speed > up things. I would be really suprised if memory barriers matter much. It sounds much more likely that there is the same underlying problem as 11a19c7b099f. There are just too many waiters on the page. The commit prevents just the hard lockup part of the problem by dropping the lock and continuing after the bookmark. But, as mentioned in the changelog, cond_resched is not really an option because this path is called from atomic context as well. So !PREEMPT kernels are still in the same boat. I might have misunderstood something, of course, and would like to hear where is my thinking wrong.
I understand the pragmatic considerations here, but I'm quite concerned about the maintainability and long-term ability to reason about a patch like this. For example, how do we know when this patch is safe to remove? Also, what other precedent does this set for us covering for poor userspace behaviour? Speaking as a systemd maintainer, if udev could be doing something better on these machines, we'd be more than receptive to help fix it. In general I am against explicit watchdog tweaking here because a.) there's potential to mask other problems, and b.) it seems like the kind of one-off trivia nobody is going to remember exists when doing complex debugging in future. Is there anything preventing this being remedied in udev, instead of the kernel?
On Tue 21-07-20 15:17:49, Chris Down wrote: > I understand the pragmatic considerations here, but I'm quite concerned > about the maintainability and long-term ability to reason about a patch like > this. For example, how do we know when this patch is safe to remove? Also, > what other precedent does this set for us covering for poor userspace > behaviour? > > Speaking as a systemd maintainer, if udev could be doing something better on > these machines, we'd be more than receptive to help fix it. In general I am > against explicit watchdog tweaking here because a.) there's potential to > mask other problems, and b.) it seems like the kind of one-off trivia nobody > is going to remember exists when doing complex debugging in future. > > Is there anything preventing this being remedied in udev, instead of the > kernel? Yes, I believe that there is a configuration to cap the maximum number of workers. This is not my area but my understanding is that the maximum is tuned based on available memory and/or cpus. We have been hit byt this quite heavily on SLES. Maybe newer version of systemd have a better tuning. But, it seems that udev is just a messenger here. There is nothing really fundamentally udev specific in the underlying problem unless I miss something. It is quite possible that this could be triggered by other userspace which happens to fire many workers at the same time and condending on a shared page. Not that I like this workaround in the first place but it seems that the existing code allows very long wait chains and !PREEMPT kernels simply do not have any scheduling point for a long time potentially. I believe we should focus on that even if the systemd as the current trigger can be tuned better. I do not insist on this patch, hence RFC, but I am simply not seeing a much better, yet not convoluted, solution.
On Mon, Jul 20, 2020 at 11:33 PM Michal Hocko <mhocko@kernel.org> wrote: > > The lockup is in page_unlock in do_read_fault and I suspect that this is > yet another effect of a very long waitqueue chain which has been > addresses by 11a19c7b099f ("sched/wait: Introduce wakeup boomark in > wake_up_page_bit") previously. Hmm. I do not believe that you can actually get to the point where you have a million waiters and it takes 20+ seconds to wake everybody up. More likely, it's actually *caused* by that commit 11a19c7b099f, and what might be happening is that other CPU's are just adding new waiters to the list *while* we're waking things up, because somebody else already got the page lock again. Humor me.. Does something like this work instead? It's whitespace-damaged because of just a cut-and-paste, but it's entirely untested, and I haven't really thought about any memory ordering issues, but I think it's ok. The logic is that anybody who called wake_up_page_bit() _must_ have cleared that bit before that. So if we ever see it set again (and memory ordering doesn't matter), then clearly somebody else got access to the page bit (whichever it was), and we should not (a) waste time waking up people who can't get the bit anyway (b) be in a livelock where other CPU's continually add themselves to the wait queue because somebody else got the bit. and it's that (b) case that I think happens for you. NOTE! Totally UNTESTED patch follows. I think it's good, but maybe somebody sees some problem with this approach? I realize that people can wait for other bits than the unlocked, and if you're waiting for writeback to complete maybe you don't care if somebody else then started writeback *AGAIN* on the page and you'd actually want to be woken up regardless, but honestly, I don't think it really matters. Linus --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1054,6 +1054,15 @@ static void wake_up_page_bit(struct page *page, int bit_nr) * from wait queue */ spin_unlock_irqrestore(&q->lock, flags); + + /* + * If somebody else set the bit again, stop waking + * people up. It's now the responsibility of that + * other page bit owner to do so. + */ + if (test_bit(bit_nr, &page->flags)) + return; + cpu_relax(); spin_lock_irqsave(&q->lock, flags); __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
On Tue 21-07-20 08:33:33, Linus Torvalds wrote: > On Mon, Jul 20, 2020 at 11:33 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > The lockup is in page_unlock in do_read_fault and I suspect that this is > > yet another effect of a very long waitqueue chain which has been > > addresses by 11a19c7b099f ("sched/wait: Introduce wakeup boomark in > > wake_up_page_bit") previously. > > Hmm. > > I do not believe that you can actually get to the point where you have > a million waiters and it takes 20+ seconds to wake everybody up. I was really suprised as well! > More likely, it's actually *caused* by that commit 11a19c7b099f, and > what might be happening is that other CPU's are just adding new > waiters to the list *while* we're waking things up, because somebody > else already got the page lock again. > > Humor me.. Does something like this work instead? It's > whitespace-damaged because of just a cut-and-paste, but it's entirely > untested, and I haven't really thought about any memory ordering > issues, but I think it's ok. > > The logic is that anybody who called wake_up_page_bit() _must_ have > cleared that bit before that. So if we ever see it set again (and > memory ordering doesn't matter), then clearly somebody else got access > to the page bit (whichever it was), and we should not > > (a) waste time waking up people who can't get the bit anyway > > (b) be in a livelock where other CPU's continually add themselves to > the wait queue because somebody else got the bit. > > and it's that (b) case that I think happens for you. > > NOTE! Totally UNTESTED patch follows. I think it's good, but maybe > somebody sees some problem with this approach? I can ask them to give it a try.
On Tue, Jul 21, 2020 at 8:33 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > More likely, it's actually *caused* by that commit 11a19c7b099f, and > what might be happening is that other CPU's are just adding new > waiters to the list *while* we're waking things up, because somebody > else already got the page lock again. > > Humor me.. Does something like this work instead? I went back and looked at this, because it bothered me. And I'm no longer convinced it can possibly make a difference. Why? Because __wake_up_locked_key_bookmark() just calls __wake_up_common(), and that one checks the return value of the wakeup function: ret = curr->func(curr, mode, wake_flags, key); if (ret < 0) break; and will not add the bookmark back to the list if this triggers. And the wakeup function does that same "stop walking" thing: if (test_bit(key->bit_nr, &key->page->flags)) return -1; So if somebody else took the page lock, I think we should already have stopped walking the list. Of course, the page table lock hash table is very small. It's only 256 entries. So maybe the list is basically all aliases for another page entirely that is being hammered by that load, and we're just unlucky. Because the wakeup function only does that "stop walking" if the page key matched. So wait queue entries for another page that just hashes to the same bucket (or even the same page, but a different bit in the page) will confuse that logic. Hmm. I still can't see how you'd get so many entries (without re-adding them) that you'd hit the softlockup timer. So I wonder if maybe we really do hit the "aliasing with a really hot page that gets re-added in the page wait table" case, but it seems a bit contrived. So I think that patch is still worth testing, but I'm not quite as hopeful about it as I was originally. I do wonder if we should make that PAGE_WAIT_TABLE_SIZE be larger. 256 entries seems potentially ridiculously small, and aliasing not only increases the waitqueue length, it also potentially causes more contention on the waitqueue spinlock (which is already likely seeing some false sharing on a cacheline basis due to the fairly dense array of waitqueue entries: wait_queue_head is intentionally fairly small and dense unless you have lots of spinlock debugging options enabled). That hashed wait-queue size is an independent issue, though. But it might be part of "some loads can get into some really nasty behavior in corner cases" Linus
On Wed, 22 Jul 2020, Linus Torvalds wrote: > > I do wonder if we should make that PAGE_WAIT_TABLE_SIZE be larger. 256 > entries seems potentially ridiculously small, and aliasing not only > increases the waitqueue length, it also potentially causes more > contention on the waitqueue spinlock (which is already likely seeing > some false sharing on a cacheline basis due to the fairly dense array > of waitqueue entries: wait_queue_head is intentionally fairly small > and dense unless you have lots of spinlock debugging options enabled). > > That hashed wait-queue size is an independent issue, though. But it > might be part of "some loads can get into some really nasty behavior > in corner cases" I don't think we've ever suffered from unlock_page() softlockups when booting, as Michal reports; but we do have a forkbomby test (and a few others) which occasionally suffer that way, on machines with many cpus. We run with the three imperfect patches appended below, which together seemed to improve, but not entirely solve, the situation: mm,sched: make page_wait_table[] four times bigger mm,sched: wait_on_page_bit_common() add back to head mm,sched: __wake_up_common() break only after waking kernel/sched/wait.c | 5 ++++- mm/filemap.c | 10 ++++++++-- 2 files changed, 12 insertions(+), 3 deletions(-) I've rediffed them to 5.8-rc6, and checked that they build and sustain a load for a few minutes: so they're not entirely ridiculous on latest kernels, but certainly not thoroughly retested. I'm rather desperate to stay out of the discussion here, given how far behind I am on responding to other threads; and may not be able to defend them beyond what I said in the original commit messages. But seeing this thread, thought I'd better put them up for your eyes. (And, no, I don't think I have a Copyright on changing an 8 to a 10: you've probably gone further already; just included for completeness.) Hugh [PATCH] mm,sched: make page_wait_table[] four times bigger Current page_wait_table[] size is 256 entries: bump that up to 1024 to reduce latency from frequent page hash collisions. No science in choosing fourfold: just "a little bigger". Signed-off-by: Hugh Dickins <hughd@google.com> --- a/mm/filemap.c +++ b/mm/filemap.c @@ -968,7 +968,7 @@ EXPORT_SYMBOL(__page_cache_alloc); * at a cost of "thundering herd" phenomena during rare hash * collisions. */ -#define PAGE_WAIT_TABLE_BITS 8 +#define PAGE_WAIT_TABLE_BITS 10 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; [PATCH] mm,sched: wait_on_page_bit_common() add back to head wait_on_page_bit_common() always adds to tail of wait queue. That is of course right when adding an entry for the first time; but when woken, and bit found already set, so adding back again? Looks unfair: it would get placed after recent arrivals, and in danger of indefinite starvation. Instead, add back to head on repeat passes: not quite right either, but if that happens again and again, the ordering will be reversed each time, so it should work out reasonably fair. Signed-off-by: Hugh Dickins <hughd@google.com> --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1109,6 +1109,7 @@ static inline int wait_on_page_bit_commo struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; bool bit_is_set; + bool first_time = true; bool thrashing = false; bool delayacct = false; unsigned long pflags; @@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo spin_lock_irq(&q->lock); if (likely(list_empty(&wait->entry))) { - __add_wait_queue_entry_tail(q, wait); + if (first_time) { + __add_wait_queue_entry_tail(q, wait); + first_time = false; + } else { + __add_wait_queue(q, wait); + } SetPageWaiters(page); } [PATCH] mm,sched: __wake_up_common() break only after waking 4.14 commit 2554db916586 ("sched/wait: Break up long wake list walk") added WQ_FLAG_BOOKMARK early breakout from __wake_up_common(): it lets the waker drop and retake the irq-safe wait queue lock every 64 entries. It was introduced to handle an Intel customer issue with long page wait queues, but it also applies to all paths using __wake_up_common_lock(). It would probably be more useful to our non-preemptible kernel if it could do a cond_resched() along with its cpu_relax(); but although most unlock_page() calls would be fine with that, some would not - page_endio(), for example; and it would be a big job to sort them, and not clear that doing some not others would really help anyway. A patch that I've been running with, that does help somewhat to reduce those unlock_page() softlockups, is this weakening of 2554db916586: don't break out until at least one wakeup has been issued for the page. In the worst case (waking a page at the very end of a hash queue shared with many waiters on another page), this would simply revert to the old behavior, where there was no WQ_FLAG_BOOKMARK early breakout at all. Whilst it did not set out to change behavior for __wake_up_common_lock() users, review suggests that this change is probably good for them too. Signed-off-by: Hugh Dickins <hughd@google.com> --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -68,6 +68,7 @@ static int __wake_up_common(struct wait_ wait_queue_entry_t *bookmark) { wait_queue_entry_t *curr, *next; + int woken = 0; int cnt = 0; lockdep_assert_held(&wq_head->lock); @@ -95,8 +96,10 @@ static int __wake_up_common(struct wait_ break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; + if (ret) + woken++; - if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && + if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && woken && (&next->entry != &wq_head->head)) { bookmark->flags = WQ_FLAG_BOOKMARK; list_add_tail(&bookmark->entry, &next->entry);
On Wed, Jul 22, 2020 at 2:29 PM Hugh Dickins <hughd@google.com> wrote: > > -#define PAGE_WAIT_TABLE_BITS 8 > +#define PAGE_WAIT_TABLE_BITS 10 Well, that seems harmless even on small machines. > + bool first_time = true; > bool thrashing = false; > bool delayacct = false; > unsigned long pflags; > @@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo > spin_lock_irq(&q->lock); > > if (likely(list_empty(&wait->entry))) { > - __add_wait_queue_entry_tail(q, wait); > + if (first_time) { > + __add_wait_queue_entry_tail(q, wait); > + first_time = false; > + } else { > + __add_wait_queue(q, wait); > + } > SetPageWaiters(page); > } This seems very hacky. And in fact, looking closer, I'd say that there are more serious problems here. Look at that WQ_FLAG_EXCLUSIVE thing: non-exclusive waits should always go at the head (because they're not going to steal the bit, they just want to know when it got cleared), and exclusive waits should always go at the tail (because of fairness). But that's not at all what we do. Your patch adds even more confusion to this nasty area. And your third one: > + if (ret) > + woken++; > > - if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && > + if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && woken && I've got two reactions to this (a) we should not need a new "woken" variable, we should just set a high bit of "cnt" and make WAITQUEUE_WALK_BREAK_CNT contain that high bit (Tune "high bit" to whatever you want: it could be either the _real_ high bit of the variable, or it could be something like "128", which would mean that you'd break out after 128 non-waking entries). (b) Ugh, what hackery and magic behavior regardless I'm really starting to hate that wait_on_page_bit_common() function. See a few weeks ago how the function looks buggy to begin with https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/ and that never got resolved either (but probably never happens in practice). Linus
On Wed, Jul 22, 2020 at 3:10 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > + bool first_time = true; > > bool thrashing = false; > > bool delayacct = false; > > unsigned long pflags; > > @@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo > > spin_lock_irq(&q->lock); > > > > if (likely(list_empty(&wait->entry))) { > > - __add_wait_queue_entry_tail(q, wait); > > + if (first_time) { > > + __add_wait_queue_entry_tail(q, wait); > > + first_time = false; > > + } else { > > + __add_wait_queue(q, wait); > > + } > > SetPageWaiters(page); > > } > > This seems very hacky. > > And in fact, looking closer, I'd say that there are more serious problems here. > > Look at that WQ_FLAG_EXCLUSIVE thing: non-exclusive waits should > always go at the head (because they're not going to steal the bit, > they just want to know when it got cleared), and exclusive waits > should always go at the tail (because of fairness). > > But that's not at all what we do. > > Your patch adds even more confusion to this nasty area. Actually, I think the right thing to do is to just say "we should never add ourselves back to the list". We have three cases: - DROP: we'll never loop - SHARED: we'll break if we see the bit clear - so if we're no longer on the list, we should break out - EXCLUSIVE: we should remain on the list until we get the lock. and we should just handle these three cases in the wakeup function instead, I feel. And then to make it fair to people, let's just always add things to the head of the queue, whether exclusive or not. AND NEVER RE-QUEUE. IOW, something like the attached patch. NOTE NOTE NOTE! This is both somewhat subtle code, and ENTIRELY UNTESTED. But it actually fixes the bug mentioned a few weeks ago, it makes the code simpler in one sense, and it avoids the whole re-queueuing thing. It removes all the "is the page locked" testing from the actual wait loop, and replaces it with "is the wait queue entry still on the list" instead. Comments? Oleg, this should fix the race you talked about too. Note that this patch really is meant as a RFC, and "something like this". It builds. I tried to think it through. But it might have some obvious thinko that means that it really really doesn't work. Linus
On Wed, Jul 22, 2020 at 4:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > NOTE NOTE NOTE! This is both somewhat subtle code, and ENTIRELY > UNTESTED. It seems to boot. It adds more lines than it removes, but a lot of it is comments, and while it's somewhat subtle, I think it's actually conceptually simpler than what we had before. The actual waiting loop itself, for example, is now entirely and utterly trivial. The DROP behavior is also a lot more straightforward and logical, imnsho. The biggest annoyance I have with it is how it open-codes "finish_wait()", and honestly, the "proper" fix for that would likely be to simply instead make "finish_wait()" return the "did I need to remove it from the list or not" value. That's the only reason that patch open-codes it right now. It _feels_ like the right solution to this thing. But maybe that's just the "pee in the snow" effect, where I like it only because I've put my mark on it. So it would be good to get more opinions, and perhaps more testing than my limited "it boots and works for me, and I can still build kernels and run a browser". Linus
On Wed 22-07-20 11:29:20, Linus Torvalds wrote: > On Tue, Jul 21, 2020 at 8:33 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > More likely, it's actually *caused* by that commit 11a19c7b099f, and > > what might be happening is that other CPU's are just adding new > > waiters to the list *while* we're waking things up, because somebody > > else already got the page lock again. > > > > Humor me.. Does something like this work instead? > > I went back and looked at this, because it bothered me. Thanks for pursuing this. I have learned that the affected system is in fact a production machine which doesn't seem to have any downtime window planned soon. Moreover the issue is not always reproducible. So I cannot guarantee I can have this or other patches tested soon which is really unfortunate. > And I'm no longer convinced it can possibly make a difference. > > Why? > > Because __wake_up_locked_key_bookmark() just calls __wake_up_common(), > and that one checks the return value of the wakeup function: > > ret = curr->func(curr, mode, wake_flags, key); > if (ret < 0) > break; > > and will not add the bookmark back to the list if this triggers. > > And the wakeup function does that same "stop walking" thing: > > if (test_bit(key->bit_nr, &key->page->flags)) > return -1; > > So if somebody else took the page lock, I think we should already have > stopped walking the list. Right! I didn't bother to look at the wakeup callback so have missed this. For completeness this behavior is there since 3510ca20ece01 which we have in our 4.12 based kernel as well.
On 07/22, Linus Torvalds wrote: > > Comments? Oleg, this should fix the race you talked about too. Yes. I still can't convince myself thatI fully understand this patch but I see nothing really wrong after a quick glance... > + * We can no longer use 'wait' after we've done the > + * list_del_init(&wait->entry), Yes, but see below, > + * the target may decide it's all done with no > + * other locking, and 'wait' has been allocated on > + * the stack of the target. > */ > - if (test_bit(key->bit_nr, &key->page->flags)) > - return -1; > + target = wait->private; > + smp_mb(); > > - return autoremove_wake_function(wait, mode, sync, key); > + /* > + * Ok, we have successfully done what we're waiting for. > + * > + * Now unconditionally remove the wait entry, so that the > + * waiter can use that to see success or not. > + * > + * We _really_ should have a "list_del_init_careful()" > + * to properly pair with an unlocked "list_empty_careful()". > + */ > + list_del_init(&wait->entry); > + > + /* > + * Theres's another memory barrier in the wakup path, that > + * makes sure the wakup happens after the above is visible > + * to the target. > + */ > + wake_up_state(target, mode); We can no longer use 'target'. If it was already woken up it can notice list_empty_careful(), return without taking q->lock, and exit. Of course, this is purely theoretical... rcu_read_lock() should help but perhaps we can avoid it somehow? Say, can't we abuse WQ_FLAG_WOKEN? wake_page_function: wait->flags |= WQ_FLAG_WOKEN; wmb(); autoremove_wake_function(...); wait_on_page_bit_common: for (;;) { set_current_state(); if (wait.flags & WQ_FLAG_WOKEN) break; schedule(); } finish_wait(); rmb(); return wait.flags & WQ_FLAG_WOKEN ? 0 : -EINTR; Another (cosmetic) problem is that wake_up_state(mode) looks confusing. It is correct but only because we know that mode == TASK_NORMAL and thus wake_up_state() can'fail if the target is still blocked. > + spin_lock_irq(&q->lock); > + SetPageWaiters(page); > + if (!trylock_page_bit_common(page, bit_nr, behavior)) > + __add_wait_queue_entry_tail(q, wait); do we need SetPageWaiters() if trylock() succeeds ? Oleg.
On Thu, Jul 23, 2020 at 5:47 AM Oleg Nesterov <oleg@redhat.com> wrote: > > I still can't convince myself thatI fully understand this patch but I see > nothing really wrong after a quick glance... I guess my comments should be extended further then. Is there anything in particular you think is unclear? > > + wake_up_state(target, mode); > > We can no longer use 'target'. If it was already woken up it can notice > list_empty_careful(), return without taking q->lock, and exit. Good point. And yes, I think using WQ_FLAG_WOKEN is the right thing to do, and I wouldn't call it "abuse". It's exactly what it's for. And that also allows us to just use finish_wait(), since we no longer care as deeply about the waitlist state, we can just test that WQ_FLAG_WOKEN at the end instead. So that actually makes the patch much more straightforward too. I really disliked my open-coding there. Your suggestion fixes everything. > do we need SetPageWaiters() if trylock() succeeds ? We need to set it before the final page flag test, because otherwise we might miss somebody just about to wake us up (ie we see the bit set, but it's getting cleared on another CPU, and if PageWaiters isn't set then that other CPU won't do the wakeup). So here's a v2, now as a "real" commit with a commit message and everything. Is there anything in particular you would like clarified, or something else you find in this? Hugh - this should make your "patch 2" redundant. Is there any way to test that in your environment that triggered it? This v2 isn't tested, but the core of it is the same, just with nice cleanups from Oleg's suggestion, and an added comment about that SetPageWaiters() thing. Linus
On 07/23, Linus Torvalds wrote: > > So here's a v2, now as a "real" commit with a commit message and everything. I am already sleeping, will read it tomorrow, but at first glance... > @@ -1013,18 +1014,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > if (wait_page->bit_nr != key->bit_nr) > return 0; > > + /* Stop walking if it's locked */ > + if (wait->flags & WQ_FLAG_EXCLUSIVE) { > + if (test_and_set_bit(key->bit_nr, &key->page->flags)) > + return -1; > + } else { > + if (test_bit(key->bit_nr, &key->page->flags)) > + return -1; > + } > + > /* > - * Stop walking if it's locked. > - * Is this safe if put_and_wait_on_page_locked() is in use? > - * Yes: the waker must hold a reference to this page, and if PG_locked > - * has now already been set by another task, that task must also hold > - * a reference to the *same usage* of this page; so there is no need > - * to walk on to wake even the put_and_wait_on_page_locked() callers. > + * Let the waiter know we have done the page flag > + * handling for it (and the return value lets the > + * wakeup logic count exclusive wakeup events). > */ > - if (test_bit(key->bit_nr, &key->page->flags)) > - return -1; > + ret = (wait->flags & WQ_FLAG_EXCLUSIVE) != 0; > + wait->flags |= WQ_FLAG_WOKEN; > + wake_up_state(wait->private, mode); > > - return autoremove_wake_function(wait, mode, sync, key); > + /* > + * Ok, we have successfully done what we're waiting for, > + * and we can unconditionally remove the wait entry. > + * > + * Note that this has to be the absolute last thing we do, > + * since after list_del_init(&wait->entry) the wait entry > + * might be de-allocated and the process might even have > + * exited. > + * > + * We _really_ should have a "list_del_init_careful()" to > + * properly pair with the unlocked "list_empty_careful()" > + * in finish_wait(). > + */ > + smp_mb(); > + list_del_init(&wait->entry); I think smp_wmb() would be enough, but this is minor. We need a barrier between "wait->flags |= WQ_FLAG_WOKEN" and list_del_init(), But afaics we need another barrier, rmb(), in wait_on_page_bit_common() for the case when wait->private was not blocked; we need to ensure that if finish_wait() sees list_empty_careful() == T then we can't miss WQ_FLAG_WOKEN. Oleg.
On Thu, Jul 23, 2020 at 11:01 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > + * > > + * We _really_ should have a "list_del_init_careful()" to > > + * properly pair with the unlocked "list_empty_careful()" > > + * in finish_wait(). > > + */ > > + smp_mb(); > > + list_del_init(&wait->entry); > > I think smp_wmb() would be enough, but this is minor. Well, what we _really_ want (and what that comment is about) would be got that list_del_init_careful() to use a "smp_store_release()" for the last store, and then "list_empty_careful()" would use a "smp_load_acquire()" for the corresponding first load. On x86, that's free. On most other architectures, it's the minimal ordering requirement. And no, I don't think "smp_wmb()" is technically enough. With crazy memory ordering, one of the earlier *reads* (eg loading "wait->private" when waking things up) could have been delayed until after the stores that initialize the list - and thus read stack contents from another process after it's been released and re-used. Does that happen in reality? No. There are various conditionals in there which means that the stores end up being gated on the loads and cannot actually be re-ordered, but it's the kind of subtley So we actually do want to constrain all earlier reads and writes wrt the final write. Which is exactly what "smp_store_release()" does. But with our current list_empty_careful(), the smp_mb() is I think technically sufficient. > We need a barrier between "wait->flags |= WQ_FLAG_WOKEN" and list_del_init(), See above: we need more than just that write barrier, although in _practice_ you're right, and the other barriers actually all already exist and are part of wake_up_state(). So the smp_mb() is unnecessary, and in fact your smp_wmb() would be too. But I left it there basically as "documentation". > But afaics we need another barrier, rmb(), in wait_on_page_bit_common() fo > the case when wait->private was not blocked; we need to ensure that if > finish_wait() sees list_empty_careful() == T then we can't miss WQ_FLAG_WOKEN. Again, this is what a proper list_empty_careful() with a smp_load_acquire() would have automatically gotten for us. But yes, I think that without that, and with the explicit barriers, we need an smp_rmb() after the list_empty_careful(). I really think it should be _in_ list_empty_careful(), though. Or maybe finish_wait(). Hmm. Because looking at all the other finish_wait() uses, the fact that the waitqueue _list_ is properly ordered isn't really a guarantee of the rest of the stack space is. In practice, it will be, but I think this lack of serialization is a potential real bug elsewhere too. (Obviously none of this would show on x86, where we already *get* that smp_store_release/smp_load_acquire behavior for the existing list_del_init()/list_empty_careful(), since all stores are releases, and all loads are acquires) So I think that is a separate issue, generic to our finish_wait() uses. Linus
On Thu, Jul 23, 2020 at 11:22 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So I think that is a separate issue, generic to our finish_wait() uses. IOW, I think we should do something like this (this is on top of my patch, since it has that wake_page_function() change in it, but notice how we have the exact same issue in our traditional autoremove_wake_function() usage). Comments? Linus
On Thu, Jul 23, 2020 at 10:32 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So here's a v2, now as a "real" commit with a commit message and everything. Oh, except it's broken. Switching from the "am I still on the list" logic to the "WQ_FLAG_WOKEN is set if woken" logic was all well and good, but I missed the case where we did that trylock_page_bit_common(). It used to just not add the thing to the list if it would get the page bit, and then the rest of the waiting logic looked at that and was happy. But now if it needs to actually fake that WQ_FLAG_WOKEN flag. So that patch needs to do something like this: if (!trylock_page_bit_common(page, bit_nr, behavior)) __add_wait_queue_entry_tail(q, wait); else wait->flags |= WQ_FLAG_WOKEN; in there. Or maybe have that bit set originally, and clear it when we add to the wait queue. I'll send a new version after I actually test it. Linus
On Thu, 23 Jul 2020, Linus Torvalds wrote: > > I'll send a new version after I actually test it. I'll give it a try when you're happy with it. I did try yesterday's with my swapping loads on home machines (3 of 4 survived 16 hours), and with some google stresstests on work machines (0 of 10 survived). I've not spent long analyzing the crashes, all of them in or below __wake_up_common() called from __wake_up_locked_key_bookmark(): sometimes gets to run the curr->func() and crashes on something inside there (often list_del's lib/list_debug.c:53!), sometimes cannot get that far. Looks like the wait queue entries on the list were not entirely safe with that patch. Hugh
On Thu, Jul 23, 2020 at 4:11 PM Hugh Dickins <hughd@google.com> wrote: > > On Thu, 23 Jul 2020, Linus Torvalds wrote: > > > > I'll send a new version after I actually test it. > > I'll give it a try when you're happy with it. Ok, what I described is what I've been running for a while now. But I don't put much stress on my system with my normal workload, so.. > I did try yesterday's > with my swapping loads on home machines (3 of 4 survived 16 hours), > and with some google stresstests on work machines (0 of 10 survived). > > I've not spent long analyzing the crashes, all of them in or below > __wake_up_common() called from __wake_up_locked_key_bookmark(): > sometimes gets to run the curr->func() and crashes on something > inside there (often list_del's lib/list_debug.c:53!), sometimes > cannot get that far. Looks like the wait queue entries on the list > were not entirely safe with that patch. Hmm. The bug Oleg pointed out should be pretty theoretical. But I think the new approach with WQ_FLAG_WOKEN was much better anyway, despite me missing that one spot in the first version of the patch. So here's two patches - the first one does that wake_page_function() conversion, and the second one just does the memory ordering cleanup I mentioned. I don't think the second one shouldn't matter on x86, but who knows. I don't enable list debugging, but I find list corruption surprising. All of _that_ should be inside the page waiqueue lock, the only unlocked part was the "list_empty_careful()" part. But I'll walk over my patch mentally one more time. Here's the current version, anyway. Linus
On Thu, 23 Jul 2020, Linus Torvalds wrote: > On Thu, Jul 23, 2020 at 4:11 PM Hugh Dickins <hughd@google.com> wrote: > > On Thu, 23 Jul 2020, Linus Torvalds wrote: > > > > > > I'll send a new version after I actually test it. > > > > I'll give it a try when you're happy with it. > > Ok, what I described is what I've been running for a while now. But I > don't put much stress on my system with my normal workload, so.. > > > I did try yesterday's > > with my swapping loads on home machines (3 of 4 survived 16 hours), > > and with some google stresstests on work machines (0 of 10 survived). > > > > I've not spent long analyzing the crashes, all of them in or below > > __wake_up_common() called from __wake_up_locked_key_bookmark(): > > sometimes gets to run the curr->func() and crashes on something > > inside there (often list_del's lib/list_debug.c:53!), sometimes > > cannot get that far. Looks like the wait queue entries on the list > > were not entirely safe with that patch. > > Hmm. The bug Oleg pointed out should be pretty theoretical. But I > think the new approach with WQ_FLAG_WOKEN was much better anyway, > despite me missing that one spot in the first version of the patch. > > So here's two patches - the first one does that wake_page_function() > conversion, and the second one just does the memory ordering cleanup I > mentioned. > > I don't think the second one shouldn't matter on x86, but who knows. > > I don't enable list debugging, but I find list corruption surprising. > All of _that_ should be inside the page waiqueue lock, the only > unlocked part was the "list_empty_careful()" part. > > But I'll walk over my patch mentally one more time. Here's the current > version, anyway. Thanks, I'll start some tests going shortly. I do have to "port" these patches to a different kernel, and my first assumption on seeing crashes was that I'd screwed that up; but that seemed much less likely once the home test on top of v5.8-rc5 crashed in much the same way. The latter was not a list_del() crash, but on curr->func itself; but I take them all as just indicating that the wait queue entry can in rare cases be freed and reused. (And the amount of "port"ing was close to nil here: our trees did differ on an "unlikely" that one end had added or removed, plus I did start off by reverting two of my three patches. But perhaps I'm missing a subtle dependence on differences elsewhere in the tree.) I say that for full disclosure, so you don't wrack your brains too much, when it may still turn out to be a screwup on my part. Hugh
On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hughd@google.com> wrote: > > I say that for full disclosure, so you don't wrack your brains > too much, when it may still turn out to be a screwup on my part. Sounds unlikely. If that patch applied even reasonably closely, I don't see how you'd see a list corruption that wasn't due to the patch. You'd have had to use the wrong spinlock by mistake due to munging it, or something crazy like that. The main list-handling change is (a) open-coding of that finish_wait() (b) slightly different heuristics for removal in the wakeup function where (a) was because my original version of finishing the wait needed to do that return code checking. So a normal "finish_wait()" just does list_del_init(&wait->entry); where-as my open-coded one replaced that with if (!list_empty(&wait->entry)) { list_del(&wait->entry); ret = -EINTR; } and apart from that "set return to -EINTR because nobody woke us up", it also uses just a regular "list_del()" rather than a "list_del_init()". That causes the next/prev field to be poisoned rather than re-initialized. But that second change was because the list entry is on the stack, and we're not touching it any more and are about to return, so I removed the "init" part. Anyway, (a) really looks harmless. Unless the poisoning now triggers some racy debug test that had been hidden by the "init". Hmm. In contrast, (b) means that the likely access patterns of irqs removing the wait entry from the list might be very different from before. The old "autoremove" semantics would only remove the entry from the list when try_to_wake_up() actually woke things up. Now, a successful bit state _always_ removes it, which was kind of the point. But it might cause very different list handling patterns. All the actual list handling looks "obviously safe" because it's protected by the spinlock, though... If you do get oopses with the new patch too, try to send me a copy, and maybe I'll stare at exactly where it happens register contents and go "aah". Linus
On Thu, Jul 23, 2020 at 5:47 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hughd@google.com> wrote: > > > > I say that for full disclosure, so you don't wrack your brains > > too much, when it may still turn out to be a screwup on my part. > > Sounds unlikely. > > If that patch applied even reasonably closely, I don't see how you'd > see a list corruption that wasn't due to the patch. > > You'd have had to use the wrong spinlock by mistake due to munging it, > or something crazy like that. > > The main list-handling change is > > (a) open-coding of that finish_wait() > > (b) slightly different heuristics for removal in the wakeup function > > where (a) was because my original version of finishing the wait needed > to do that return code checking. > > So a normal "finish_wait()" just does > > list_del_init(&wait->entry); > > where-as my open-coded one replaced that with > > if (!list_empty(&wait->entry)) { > list_del(&wait->entry); > ret = -EINTR; > } > > and apart from that "set return to -EINTR because nobody woke us up", > it also uses just a regular "list_del()" rather than a > "list_del_init()". That causes the next/prev field to be poisoned > rather than re-initialized. But that second change was because the > list entry is on the stack, and we're not touching it any more and are > about to return, so I removed the "init" part. > > Anyway, (a) really looks harmless. Unless the poisoning now triggers > some racy debug test that had been hidden by the "init". Hmm. > > In contrast, (b) means that the likely access patterns of irqs > removing the wait entry from the list might be very different from > before. The old "autoremove" semantics would only remove the entry > from the list when try_to_wake_up() actually woke things up. Now, a > successful bit state _always_ removes it, which was kind of the point. > But it might cause very different list handling patterns. > > All the actual list handling looks "obviously safe" because it's > protected by the spinlock, though... > > If you do get oopses with the new patch too, try to send me a copy, > and maybe I'll stare at exactly where it happens register contents and > go "aah". This new version is doing much better: many hours to go, but all machines have got beyond the danger point where yesterday's version was crashing - phew! Hugh
On 07/23, Linus Torvalds wrote: > > IOW, I think we should do something like this (this is on top of my > patch, since it has that wake_page_function() change in it, but notice > how we have the exact same issue in our traditional > autoremove_wake_function() usage). ... > +static inline void list_del_init_careful(struct list_head *entry) > +{ > + __list_del_entry(entry); > + entry->prev = entry; > + smp_store_release(&entry->next, entry); > +} > + ... > static inline int list_empty_careful(const struct list_head *head) > { > - struct list_head *next = head->next; > + struct list_head *next = smp_load_acquire(&head->next); > return (next == head) && (next == head->prev); > } This (and your previous email) answers my concerns about memory barriers. IIUC, finish_wait() could even use this version of list_empty_careful(), struct list_head *next = smp_load_acquire(&head->next); return (next == head) && !WARN_ON(next != head->prev); iow, it doesn't really need to check next == head->prev as long as only list_del_init_careful() can remove it from list. Thanks! Oleg.
On 07/23, Linus Torvalds wrote: > > But I'll walk over my patch mentally one more time. Here's the current > version, anyway. Both patches look correct to me, feel free to add Reviewed-by: Oleg Nesterov <oleg@redhat.com> > @@ -1013,18 +1014,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > if (wait_page->bit_nr != key->bit_nr) > return 0; > > + /* Stop walking if it's locked */ > + if (wait->flags & WQ_FLAG_EXCLUSIVE) { > + if (test_and_set_bit(key->bit_nr, &key->page->flags)) > + return -1; > + } else { > + if (test_bit(key->bit_nr, &key->page->flags)) > + return -1; > + } not sure this makes any sense, but this looks like another user of trylock_page_bit_common(), see the patch below on top of 1/2. Oleg. --- mm/filemap.c~ 2020-07-24 17:09:34.728133846 +0200 +++ mm/filemap.c 2020-07-24 17:23:52.279185374 +0200 @@ -1000,6 +1000,14 @@ wait_queue_entry_t wait; }; +static int trylock_page_bit_common(struct page *page, int bit_nr, + struct wait_queue_entry *wait) +{ + return wait->flags & WQ_FLAG_EXCLUSIVE ? + !test_and_set_bit(bit_nr, &page->flags) : + !test_bit(bit_nr, &page->flags); +} + static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg) { int ret; @@ -1015,13 +1023,8 @@ return 0; /* Stop walking if it's locked */ - if (wait->flags & WQ_FLAG_EXCLUSIVE) { - if (test_and_set_bit(key->bit_nr, &key->page->flags)) - return -1; - } else { - if (test_bit(key->bit_nr, &key->page->flags)) - return -1; - } + if (!trylock_page_bit_common(key->page, key->bit_nr, wait)) + return -1; /* * Let the waiter know we have done the page flag @@ -1126,14 +1129,6 @@ */ }; -static inline int trylock_page_bit_common(struct page *page, int bit_nr, - enum behavior behavior) -{ - return behavior == EXCLUSIVE ? - !test_and_set_bit(bit_nr, &page->flags) : - !test_bit(bit_nr, &page->flags); -} - static inline int wait_on_page_bit_common(wait_queue_head_t *q, struct page *page, int bit_nr, int state, enum behavior behavior) { @@ -1170,7 +1165,7 @@ */ spin_lock_irq(&q->lock); SetPageWaiters(page); - if (!trylock_page_bit_common(page, bit_nr, behavior)) + if (!trylock_page_bit_common(page, bit_nr, wait)) __add_wait_queue_entry_tail(q, wait); spin_unlock_irq(&q->lock);
On Fri, Jul 24, 2020 at 8:24 AM Oleg Nesterov <oleg@redhat.com> wrote: > > not sure this makes any sense, but this looks like another user of > trylock_page_bit_common(), see the patch below on top of 1/2. Ok, that makes sense. Except you did it on top of the original patch without the fix to set WQ_FLAG_WOKEN for the non-wakeup case. And in fact, once you do it on top of that, it becomes obvious that we can share even more code: move the WQ_FLAG_WOKEN logic _into_ the trylock_page_bit_common() function. Then the whole thing becomes something like the attached. I added your reviewed-by, but maybe you should double-check my changes. Linus
On Fri, Jul 24, 2020 at 10:32 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > Ok, that makes sense. Except you did it on top of the original patch > without the fix to set WQ_FLAG_WOKEN for the non-wakeup case. Hmm. I just realized that one thing we could do is to not even test the page bit for the shared case in the wakeup path. Because anybody who uses the non-exclusive "wait_on_page_locked()" or "wait_on_page_writeback()" isn't actually interested in the bit state any more at that point. All they care about is that somebody cleared it - not whether it was then re-taken again. So instead of keeping them on the list - or stopping the waitqueue walk because somebody else got the bit - we could just mark them successfully done, wake them up, and remove those entries from the list. That would be better for everybody - less pointless waiting for a new lock or writeback event, but also fewer entries on the wait queues as we get rid of them more quickly instead of walking them over and over just because somebody else re-took the page lock. Generally "wait_on_page_locked()" is used for two things - either wait for the IO to then check if it's now uptodate - throttle things that can't afford to lock the page (eg page faults that dropped the mm lock, and as such need to go through the whole restart logic, but that don't want to lock the page because it's now too late, but also the page migration things) In the first case, waiting to actually seeing the locked bit clear is pointless - the code only cared about the "wait for IO in progress" not about the lock bit itself. And that second case generally might want to retry, but doesn't want to busy-loop. And "wait_on_page_writeback()" is basically used for similar reasons (ie check if there were IO errors, but also possibly to throttle writeback traffic). Saying "stop walking, keep it on the list" seems wrong. It makes IO error handling and retries much worse, for example. So it turns out that the wakeup logic and the initial wait logic don't have so much in common after all, and there is a fundamental conceptual difference between that "check bit one last time" case, and the "we got woken up, now what" case.. End result: one final (yes, hopefully - I think I'm done) version of this patch-series. This not only makes the code cleaner (the generated code for wake_up_page() is really quite nice now), but getting rid of extra waiting might help the load that Michal reported. Because a lot of page waiting might be that non-exclusive "wait_on_page_locked()" kind, particularly in the thundering herd kind of situation where one process starts IO, and then other processes wait for it to finish. Those users don't even care if somebody else then did a "lock_page()" for some other reason (maybe for writeback). They are generally perfectly happy with a locked page, as long as it's now up-to-date. So this not only simplifies the code, it really might avoid some problems too. Linus
On Fri, 24 Jul 2020, Linus Torvalds wrote: > On Fri, Jul 24, 2020 at 10:32 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > Ok, that makes sense. Except you did it on top of the original patch > > without the fix to set WQ_FLAG_WOKEN for the non-wakeup case. > > Hmm. > > I just realized that one thing we could do is to not even test the > page bit for the shared case in the wakeup path. > > Because anybody who uses the non-exclusive "wait_on_page_locked()" or > "wait_on_page_writeback()" isn't actually interested in the bit state > any more at that point. All they care about is that somebody cleared > it - not whether it was then re-taken again. > > So instead of keeping them on the list - or stopping the waitqueue > walk because somebody else got the bit - we could just mark them > successfully done, wake them up, and remove those entries from the > list. > > That would be better for everybody - less pointless waiting for a new > lock or writeback event, but also fewer entries on the wait queues as > we get rid of them more quickly instead of walking them over and over > just because somebody else re-took the page lock. > > Generally "wait_on_page_locked()" is used for two things > > - either wait for the IO to then check if it's now uptodate > > - throttle things that can't afford to lock the page (eg page faults > that dropped the mm lock, and as such need to go through the whole > restart logic, but that don't want to lock the page because it's now > too late, but also the page migration things) > > In the first case, waiting to actually seeing the locked bit clear is > pointless - the code only cared about the "wait for IO in progress" > not about the lock bit itself. > > And that second case generally might want to retry, but doesn't want > to busy-loop. > > And "wait_on_page_writeback()" is basically used for similar reasons > (ie check if there were IO errors, but also possibly to throttle > writeback traffic). > > Saying "stop walking, keep it on the list" seems wrong. It makes IO > error handling and retries much worse, for example. > > So it turns out that the wakeup logic and the initial wait logic don't > have so much in common after all, and there is a fundamental > conceptual difference between that "check bit one last time" case, and > the "we got woken up, now what" case.. > > End result: one final (yes, hopefully - I think I'm done) version of > this patch-series. > > This not only makes the code cleaner (the generated code for > wake_up_page() is really quite nice now), but getting rid of extra > waiting might help the load that Michal reported. > > Because a lot of page waiting might be that non-exclusive > "wait_on_page_locked()" kind, particularly in the thundering herd kind > of situation where one process starts IO, and then other processes > wait for it to finish. > > Those users don't even care if somebody else then did a "lock_page()" > for some other reason (maybe for writeback). They are generally > perfectly happy with a locked page, as long as it's now up-to-date. > > So this not only simplifies the code, it really might avoid some problems too. That set of tests I started yesterday has now completed: no crashes due to your changes (though, one machine crashed with an entirely unrelated list_del corruption: over in driverland, just a coincidence). I do see more of these stresstests reporting failure than I remember from the last time I ran them, and I wasn't quickly able to work out why (usually I just care about not crashing or hanging, rarely delve deeper into what they actually call success). The most likely cause would be some internal infrastructural oddity, and nothing for you to worry about; but there is a possibility that it's meaningful. But whatever, what happens on the next run, with these latest patches, will be more important; and I'll follow this next run with a run on the baseline without them, to compare results. Hugh
On Fri, Jul 24, 2020 at 7:08 PM Hugh Dickins <hughd@google.com> wrote: > > But whatever, what happens on the next run, with these latest patches, > will be more important; and I'll follow this next run with a run on > the baseline without them, to compare results. So the loads you are running are known to have sensitivity to this particular area, and are why you've done your patches to the page wait bit code? Because yes, I think that last version in particular might make a big difference with the "let people continue even if they only saw the wakeup event, and never actually tested and saw the bit clear". Of course, there's a possibility that "big difference" ends up being a negative one. I think it will make the page wait queues shorter (which is good for that latency and lockup thing), but waking things up more aggressively _may_ also end up adding more CPU load, if they then all decide to retry the operation for whatever reason. And hey, it's also possible that it makes no difference at all, because your load mainly tests the exclusive "lock_page()" case, which won't have changed. And that's all assuming they don't show some instability, of course. But the code actually seems fairly simple now, and the basic synchronization hasn't changed, just a behavioral optimization. Famous last words. Linus
On 07/24, Linus Torvalds wrote: > > And in fact, once you do it on top of that, it becomes obvious that we > can share even more code: move the WQ_FLAG_WOKEN logic _into_ the > trylock_page_bit_common() function. Ah, indeed, somehow I didn't realize this, > I added your reviewed-by, but maybe you should double-check my changes. Still looks correct to me, thanks. Oleg.
On 07/24, Linus Torvalds wrote: > > I just realized that one thing we could do is to not even test the > page bit for the shared case in the wakeup path. > > Because anybody who uses the non-exclusive "wait_on_page_locked()" or > "wait_on_page_writeback()" isn't actually interested in the bit state > any more at that point. All they care about is that somebody cleared > it - not whether it was then re-taken again. > > So instead of keeping them on the list - or stopping the waitqueue > walk because somebody else got the bit - we could just mark them > successfully done, wake them up, and remove those entries from the > list. Heh. I too thought about this. And just in case, your patch looks correct to me. But I can't really comment this behavioural change. Perhaps it should come in a separate patch? In essense, this partly reverts your commit 3510ca20ece0150 ("Minor page waitqueue cleanups"). I mean this part: (b) we don't want to put the non-locking waiters always on the front of the queue, and the locking waiters always on the back. Not only is that unfair, it means that we wake up thousands of reading threads that will just end up being blocked by the writer later anyway. ... @@ -972,10 +976,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, spin_lock_irq(&q->lock); if (likely(list_empty(&wait->entry))) { - if (lock) - __add_wait_queue_entry_tail_exclusive(q, wait); - else - __add_wait_queue(q, wait); + __add_wait_queue_entry_tail(q, wait); SetPageWaiters(page); } Oleg.
On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote: > > Heh. I too thought about this. And just in case, your patch looks correct > to me. But I can't really comment this behavioural change. Perhaps it > should come in a separate patch? We could do that. At the same time, I think both parts change how the waitqueue works that it might as well just be one "fix page_bit_wait waitqueue usage". But let's wait to see what Hugh's numbers say. > In essense, this partly reverts your commit 3510ca20ece0150 > ("Minor page waitqueue cleanups"). I mean this part: Well, no. I mean, it keeps the "always add to the fail" behavior. But some of the reasons for it have gone away. Now we could just make it go back to always doing non-exclusive waits at the head. The non-exclusive waiters _used_ to re-insert themselves on the queue until they saw the bit clear, so waking them up if the bit was just going to be set again was just going to make for unnecessary scheduling and waitlist noise. That reason is gone. But I think the fundamental fairness issue might still be there. So I'll keep the simpler "always add at the end". But you're right - we could expedite the non-exclusive waiters even more. Linus
Firstly, to avoid the confusion, let me repeat I think your patch is fine. I too thought that non-exclusive waiters do not care about the bit state and thus wake_page_function() can simply wake them all up. But then I did "git blame", found your commit 3510ca20ece0150 and came to conclusion there are reasons we should not do this. On 07/25, Linus Torvalds wrote: > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > In essense, this partly reverts your commit 3510ca20ece0150 > > ("Minor page waitqueue cleanups"). I mean this part: > > Well, no. I mean, it keeps the "always add to the fail" behavior. Ah, sorry for confusion, this doesn't matter. I didn't mean "fairness". What I tried to say. AFAICS before that commit we had (almost) the same behaviour you propose now: unlock_page/etc wakes all the non-exclusive waiters up. No? Or I misunderstood your reply? Quite possibly, too late for me... Oleg.
On Sat, Jul 25, 2020 at 12:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > What I tried to say. AFAICS before that commit we had (almost) the same > behaviour you propose now: unlock_page/etc wakes all the non-exclusive > waiters up. > > No? Yes, but no. We'd wake them _up_ fairly aggressively, but then they'd be caught on the bit being set again by the exclusive locker (that we also woke up). So they'd get woken up, and then go to sleep again. So the new behavior wakes things up more aggressively (but a different way), but not by letting them go out of order and early, but simply by not going back to sleep again. So the "wake up more" is very different - now it's about not going to sleep again, rather than by ordering the wakeup queue. We _could_ order the wakeup queue too, and put all non-exclusive weiters at the head again. And make it *really* aggressive. But since one of ourissues has been "latency of walking the wait queue", I'm not sure we want that. interspesing any blocking waiters - and stopping the waitqueue walking as a result - might be better under load. Wild handwaving. We could try it, but IO think that really would be a separate "try this out" patch. Right now, I think my patch will likely make for _better_ latencies for everything. Lower latency of non-exclusive waiters (because not going back to sleep), but also lower latency of walking the wait queue (because fewer entries, hopefully, and also less contention due to the "not going back to sleep" noise) Linus
On Sat, 25 Jul 2020, Linus Torvalds wrote: > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > Heh. I too thought about this. And just in case, your patch looks correct > > to me. But I can't really comment this behavioural change. Perhaps it > > should come in a separate patch? > > We could do that. At the same time, I think both parts change how the > waitqueue works that it might as well just be one "fix page_bit_wait > waitqueue usage". > > But let's wait to see what Hugh's numbers say. Oh no, no no: sorry for getting your hopes up there, I won't come up with any numbers more significant than "0 out of 10" machines crashed. I know it would be *really* useful if I could come up with performance comparisons, or steer someone else to do so: but I'm sorry, cannot. Currently it's actually 1 out of 10 machines crashed, for the same driverland issue seen last time, maybe it's a bad machine; and another 1 out of the 10 machines went AWOL for unknown reasons, but probably something outside the kernel got confused by the stress. No reason to suspect your changes at all (but some unanalyzed "failure"s, of dubious significance, accumulating like last time). I'm optimistic: nothing has happened to warn us off your changes. And on Fri, 24 Jul 2020, Linus Torvalds had written: > So the loads you are running are known to have sensitivity to this > particular area, and are why you've done your patches to the page wait > bit code? Yes. It's a series of nineteen ~hour-long tests, of which about five exhibited wake_up_page_bit problems in the past, and one has remained intermittently troublesome that way. Intermittently: usually it does get through, so getting through yesterday and today won't even tell us that your changes fixed it - that we shall learn over time later. Hugh
On Sat, 25 Jul 2020, Hugh Dickins wrote: > On Sat, 25 Jul 2020, Linus Torvalds wrote: > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > Heh. I too thought about this. And just in case, your patch looks correct > > > to me. But I can't really comment this behavioural change. Perhaps it > > > should come in a separate patch? > > > > We could do that. At the same time, I think both parts change how the > > waitqueue works that it might as well just be one "fix page_bit_wait > > waitqueue usage". > > > > But let's wait to see what Hugh's numbers say. > > Oh no, no no: sorry for getting your hopes up there, I won't come up > with any numbers more significant than "0 out of 10" machines crashed. > I know it would be *really* useful if I could come up with performance > comparisons, or steer someone else to do so: but I'm sorry, cannot. > > Currently it's actually 1 out of 10 machines crashed, for the same > driverland issue seen last time, maybe it's a bad machine; and another > 1 out of the 10 machines went AWOL for unknown reasons, but probably > something outside the kernel got confused by the stress. No reason > to suspect your changes at all (but some unanalyzed "failure"s, of > dubious significance, accumulating like last time). > > I'm optimistic: nothing has happened to warn us off your changes. Less optimistic now, I'm afraid. The machine I said had (twice) crashed coincidentally in driverland (some USB completion thing): that machine I set running a comparison kernel without your changes this morning, while the others still running with your changes; and it has now passed the point where it twice crashed before (the most troublesome test), without crashing. Surprising: maybe still just coincidence, but I must look closer at the crashes. The others have now completed, and one other crashed in that troublesome test, but sadly without yielding any crash info. I've just set comparison runs going on them all, to judge whether to take the "failure"s seriously; and I'll look more closely at them. But hungry and tired now: unlikely to have more to say tonight. > > And on Fri, 24 Jul 2020, Linus Torvalds had written: > > So the loads you are running are known to have sensitivity to this > > particular area, and are why you've done your patches to the page wait > > bit code? > > Yes. It's a series of nineteen ~hour-long tests, of which about five > exhibited wake_up_page_bit problems in the past, and one has remained > intermittently troublesome that way. Intermittently: usually it does > get through, so getting through yesterday and today won't even tell > us that your changes fixed it - that we shall learn over time later. > > Hugh
Linus, I was greatly confused and tried to confuse you. Somehow I misunderstood your last version and didn't bother to read it again until now. Sorry for noise and thanks for your explanations. Oleg. On 07/25, Linus Torvalds wrote: > > On Sat, Jul 25, 2020 at 12:28 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > What I tried to say. AFAICS before that commit we had (almost) the same > > behaviour you propose now: unlock_page/etc wakes all the non-exclusive > > waiters up. > > > > No? > > Yes, but no. > > We'd wake them _up_ fairly aggressively, but then they'd be caught on > the bit being set again by the exclusive locker (that we also woke > up). > > So they'd get woken up, and then go to sleep again. > > So the new behavior wakes things up more aggressively (but a different > way), but not by letting them go out of order and early, but simply by > not going back to sleep again. > > So the "wake up more" is very different - now it's about not going to > sleep again, rather than by ordering the wakeup queue. > > We _could_ order the wakeup queue too, and put all non-exclusive > weiters at the head again. And make it *really* aggressive. > > But since one of ourissues has been "latency of walking the wait > queue", I'm not sure we want that. interspesing any blocking waiters - > and stopping the waitqueue walking as a result - might be better under > load. > > Wild handwaving. We could try it, but IO think that really would be a > separate "try this out" patch. > > Right now, I think my patch will likely make for _better_ latencies > for everything. > > Lower latency of non-exclusive waiters (because not going back to > sleep), but also lower latency of walking the wait queue (because > fewer entries, hopefully, and also less contention due to the "not > going back to sleep" noise) > > Linus >
On Sat, 25 Jul 2020, Hugh Dickins wrote: > On Sat, 25 Jul 2020, Hugh Dickins wrote: > > On Sat, 25 Jul 2020, Linus Torvalds wrote: > > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > Heh. I too thought about this. And just in case, your patch looks correct > > > > to me. But I can't really comment this behavioural change. Perhaps it > > > > should come in a separate patch? > > > > > > We could do that. At the same time, I think both parts change how the > > > waitqueue works that it might as well just be one "fix page_bit_wait > > > waitqueue usage". > > > > > > But let's wait to see what Hugh's numbers say. > > > > Oh no, no no: sorry for getting your hopes up there, I won't come up > > with any numbers more significant than "0 out of 10" machines crashed. > > I know it would be *really* useful if I could come up with performance > > comparisons, or steer someone else to do so: but I'm sorry, cannot. > > > > Currently it's actually 1 out of 10 machines crashed, for the same > > driverland issue seen last time, maybe it's a bad machine; and another > > 1 out of the 10 machines went AWOL for unknown reasons, but probably > > something outside the kernel got confused by the stress. No reason > > to suspect your changes at all (but some unanalyzed "failure"s, of > > dubious significance, accumulating like last time). > > > > I'm optimistic: nothing has happened to warn us off your changes. > > Less optimistic now, I'm afraid. > > The machine I said had (twice) crashed coincidentally in driverland > (some USB completion thing): that machine I set running a comparison > kernel without your changes this morning, while the others still > running with your changes; and it has now passed the point where it > twice crashed before (the most troublesome test), without crashing. > > Surprising: maybe still just coincidence, but I must look closer at > the crashes. > > The others have now completed, and one other crashed in that > troublesome test, but sadly without yielding any crash info. > > I've just set comparison runs going on them all, to judge whether > to take the "failure"s seriously; and I'll look more closely at them. The comparison runs have not yet completed (except for the one started early), but they have all got past the most interesting tests, and it's clear that they do not have the "failure"s seen with your patches. From that I can only conclude that your patches make a difference. I've deduced nothing useful from the logs, will have to leave that to others here with more experience of them. But my assumption now is that you have successfully removed one bottleneck, so the tests get somewhat further and now stick in the next bottleneck, whatever that may be. Which shows up as "failure", where the unlock_page() wake_up_page_bit() bottleneck had allowed the tests to proceed in a more serially sedate way. The xhci handle_cmd_completion list_del bugs (on an older version of the driver): weird, nothing to do with page wakeups, I'll just have to assume that it's some driver bug exposed by the greater stress allowed down, and let driver people investigate (if it still manifests) when we take in your improvements. One nice thing from the comparison runs without your patches: watchdog panic did crash one of those with exactly the unlock_page() wake_up_page_bit() softlockup symptom we've been fighting, that did not appear with your patches. So although the sample size is much too small to justify a conclusion, it does tend towards confirming your changes. Thank you for your work on this! And I'm sure you'd have preferred some hard data back, rather than a diary of my mood swings, but... we do what we can. Hugh
On Sun, Jul 26, 2020 at 1:30 PM Hugh Dickins <hughd@google.com> wrote: > > I've deduced nothing useful from the logs, will have to leave that > to others here with more experience of them. But my assumption now > is that you have successfully removed one bottleneck, so the tests > get somewhat further and now stick in the next bottleneck, whatever > that may be. Which shows up as "failure", where the unlock_page() > wake_up_page_bit() bottleneck had allowed the tests to proceed in > a more serially sedate way. Well, that's the very optimistic reading. As the optimistic and happy person I am (hah!) I'm going to agree with you, and plan on just merging that patch early in the next merge window. It may fix a real bug in the current trere, but it's much too late to apply right now, particularly with your somewhat ambiguous results. Oleg's theoretical race has probably never been seen, and while the watchdog triggering is clearly a real bug, it's also extreme enough not to really be a strong argument for merging this out-of-window.. > The xhci handle_cmd_completion list_del bugs (on an older version > of the driver): weird, nothing to do with page wakeups, I'll just > have to assume that it's some driver bug exposed by the greater > stress allowed down, and let driver people investigate (if it > still manifests) when we take in your improvements. Do you have the bug-report, just to google against anybody else reporting something simialr> > One nice thing from the comparison runs without your patches: > watchdog panic did crash one of those with exactly the unlock_page() > wake_up_page_bit() softlockup symptom we've been fighting, that did > not appear with your patches. So although the sample size is much > too small to justify a conclusion, it does tend towards confirming > your changes. You win some, you lose some. But yes, I'll take that as a tentative success and that the approach is valid. Thanks, Linus
On Sun, 26 Jul 2020, Linus Torvalds wrote: > On Sun, Jul 26, 2020 at 1:30 PM Hugh Dickins <hughd@google.com> wrote: > > > > I've deduced nothing useful from the logs, will have to leave that > > to others here with more experience of them. But my assumption now > > is that you have successfully removed one bottleneck, so the tests > > get somewhat further and now stick in the next bottleneck, whatever > > that may be. Which shows up as "failure", where the unlock_page() > > wake_up_page_bit() bottleneck had allowed the tests to proceed in > > a more serially sedate way. > > Well, that's the very optimistic reading. > > As the optimistic and happy person I am (hah!) I'm going to agree with > you, and plan on just merging that patch early in the next merge > window. It may fix a real bug in the current trere, but it's much too > late to apply right now, particularly with your somewhat ambiguous > results. Absolutely: it should be good to see it in v5.9, but much too late for a patch like this in v5.8. > > Oleg's theoretical race has probably never been seen, and while the > watchdog triggering is clearly a real bug, it's also extreme enough > not to really be a strong argument for merging this out-of-window.. > > > The xhci handle_cmd_completion list_del bugs (on an older version > > of the driver): weird, nothing to do with page wakeups, I'll just > > have to assume that it's some driver bug exposed by the greater > > stress allowed down, and let driver people investigate (if it > > still manifests) when we take in your improvements. > > Do you have the bug-report, just to google against anybody else > reporting something simialr> Okay, just on that basis, with some reluctance an edited extract: certainly not asking you or anyone on the list to investigate further. [35196.140502] kernel BUG at lib/list_debug.c:53! [35196.141448] RIP: 0010:__list_del_entry_valid+0x8e/0xb0 [35196.141534] Call Trace: [35196.141538] <IRQ> [35196.141557] [<ffffffffc01bc8b4>] handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd] [35196.141578] [<ffffffffc01bda22>] xhci_irq+0x242/0x1ea0 [xhci_hcd] [35196.141608] [<ffffffffc01bf691>] xhci_msi_irq+0x11/0x20 [xhci_hcd] [35196.141622] [<ffffffffb9ff27f8>] __handle_irq_event_percpu+0x48/0x2c0 [35196.141636] [<ffffffffb9ff2aa2>] handle_irq_event_percpu+0x32/0x80 [35196.141651] [<ffffffffb9ff2b3a>] handle_irq_event+0x4a/0x80 [35196.141680] [<ffffffffb9ff6b08>] handle_edge_irq+0xd8/0x1b0 [35196.141697] [<ffffffffb9ec22ab>] handle_irq+0x2b/0x50 [35196.141712] [<ffffffffbaa02766>] do_IRQ+0xb6/0x1c0 [35196.141725] [<ffffffffbaa00990>] common_interrupt+0x90/0x90 [35196.141732] </IRQ> > > > One nice thing from the comparison runs without your patches: > > watchdog panic did crash one of those with exactly the unlock_page() > > wake_up_page_bit() softlockup symptom we've been fighting, that did > > not appear with your patches. So although the sample size is much > > too small to justify a conclusion, it does tend towards confirming > > your changes. > > You win some, you lose some. But yes, I'll take that as a tentative > success and that the approach is valid. Great, yes, tentative success: and we have three months in which to change our minds if any real trouble surfaces; and I wouldn't call anything I've seen (since that very first version) *real* trouble. > > Thanks, > > Linus
On Sun, Jul 26, 2020 at 01:30:32PM -0700, Hugh Dickins wrote: > On Sat, 25 Jul 2020, Hugh Dickins wrote: > > On Sat, 25 Jul 2020, Hugh Dickins wrote: > > > On Sat, 25 Jul 2020, Linus Torvalds wrote: > > > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > Heh. I too thought about this. And just in case, your patch looks correct > > > > > to me. But I can't really comment this behavioural change. Perhaps it > > > > > should come in a separate patch? > > > > > > > > We could do that. At the same time, I think both parts change how the > > > > waitqueue works that it might as well just be one "fix page_bit_wait > > > > waitqueue usage". > > > > > > > > But let's wait to see what Hugh's numbers say. > > > > > > Oh no, no no: sorry for getting your hopes up there, I won't come up > > > with any numbers more significant than "0 out of 10" machines crashed. > > > I know it would be *really* useful if I could come up with performance > > > comparisons, or steer someone else to do so: but I'm sorry, cannot. > > > > > > Currently it's actually 1 out of 10 machines crashed, for the same > > > driverland issue seen last time, maybe it's a bad machine; and another > > > 1 out of the 10 machines went AWOL for unknown reasons, but probably > > > something outside the kernel got confused by the stress. No reason > > > to suspect your changes at all (but some unanalyzed "failure"s, of > > > dubious significance, accumulating like last time). > > > > > > I'm optimistic: nothing has happened to warn us off your changes. > > > > Less optimistic now, I'm afraid. > > > > The machine I said had (twice) crashed coincidentally in driverland > > (some USB completion thing): that machine I set running a comparison > > kernel without your changes this morning, while the others still > > running with your changes; and it has now passed the point where it > > twice crashed before (the most troublesome test), without crashing. > > > > Surprising: maybe still just coincidence, but I must look closer at > > the crashes. > > > > The others have now completed, and one other crashed in that > > troublesome test, but sadly without yielding any crash info. > > > > I've just set comparison runs going on them all, to judge whether > > to take the "failure"s seriously; and I'll look more closely at them. > > The comparison runs have not yet completed (except for the one started > early), but they have all got past the most interesting tests, and it's > clear that they do not have the "failure"s seen with your patches. > > >From that I can only conclude that your patches make a difference. > > I've deduced nothing useful from the logs, will have to leave that > to others here with more experience of them. But my assumption now > is that you have successfully removed one bottleneck, so the tests > get somewhat further and now stick in the next bottleneck, whatever > that may be. Which shows up as "failure", where the unlock_page() > wake_up_page_bit() bottleneck had allowed the tests to proceed in > a more serially sedate way. > > The xhci handle_cmd_completion list_del bugs (on an older version > of the driver): weird, nothing to do with page wakeups, I'll just > have to assume that it's some driver bug exposed by the greater > stress allowed down, and let driver people investigate (if it > still manifests) when we take in your improvements. Linus just pointed me at this thread. If you could run: echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control and run the same workload to see if anything shows up in the log when xhci crashes, that would be great. Although if you are using an "older version" of the driver, there's not much I can suggest except update to a newer one :) thanks, greg k-h
Hi, sorry for being late in discussion (was offline or very busy with other stuff). I hope I got it right and this is the latest version of your patches. Btw. do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable. In the meantime I have learned that the customer suffering from the issue is very unlikely to reboot the machine anytime soon or even willing to test a non-trivial patch. We do not own any machine which exhibit this problem unfortunately. So it is quite unlikely I can help with testing. Also does it make sense to put this into mmotm tree for a while to get a larger testing coverage? I didn't get to review these patch and it will take some time to do that because this is really subtle area. Thanks for helping out and sorry that I cannot really help much more now. On Fri 24-07-20 16:25:56, Linus Torvalds wrote: > From 0bccb60841cc52a9aa6e9cc6b7eff59d1983e8fa Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Thu, 23 Jul 2020 10:16:49 -0700 > Subject: [PATCH 1/2] mm: rewrite wait_on_page_bit_common() logic > > It turns out that wait_on_page_bit_common() had several problems, > ranging from just unfair behavioe due to re-queueing at the end of the > wait queue when re-trying, and an outright bug that could result in > missed wakeups (but probably never happened in practice). > > This rewrites the whole logic to avoid both issues, by simply moving the > logic to check (and possibly take) the bit lock into the wakeup path > instead. > > That makes everything much more straightforward, and means that we never > need to re-queue the wait entry: if we get woken up, we'll be notified > through WQ_FLAG_WOKEN, and the wait queue entry will have been removed, > and everything will have been done for us. > > Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/ > Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/ > Reported-by: Oleg Nesterov <oleg@redhat.com> > Reported-by: Hugh Dickins <hughd@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Reviewed-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > mm/filemap.c | 132 +++++++++++++++++++++++++++++++++------------------ > 1 file changed, 85 insertions(+), 47 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 385759c4ce4b..8c3d3e233d37 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1002,6 +1002,7 @@ struct wait_page_queue { > > static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg) > { > + int ret; > struct wait_page_key *key = arg; > struct wait_page_queue *wait_page > = container_of(wait, struct wait_page_queue, wait); > @@ -1014,17 +1015,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > return 0; > > /* > - * Stop walking if it's locked. > - * Is this safe if put_and_wait_on_page_locked() is in use? > - * Yes: the waker must hold a reference to this page, and if PG_locked > - * has now already been set by another task, that task must also hold > - * a reference to the *same usage* of this page; so there is no need > - * to walk on to wake even the put_and_wait_on_page_locked() callers. > + * If it's an exclusive wait, we get the bit for it, and > + * stop walking if we can't. > + * > + * If it's a non-exclusive wait, then the fact that this > + * wake function was called means that the bit already > + * was cleared, and we don't care if somebody then > + * re-took it. > */ > - if (test_bit(key->bit_nr, &key->page->flags)) > - return -1; > + ret = 0; > + if (wait->flags & WQ_FLAG_EXCLUSIVE) { > + if (test_and_set_bit(key->bit_nr, &key->page->flags)) > + return -1; > + ret = 1; > + } > + wait->flags |= WQ_FLAG_WOKEN; > > - return autoremove_wake_function(wait, mode, sync, key); > + wake_up_state(wait->private, mode); > + > + /* > + * Ok, we have successfully done what we're waiting for, > + * and we can unconditionally remove the wait entry. > + * > + * Note that this has to be the absolute last thing we do, > + * since after list_del_init(&wait->entry) the wait entry > + * might be de-allocated and the process might even have > + * exited. > + * > + * We _really_ should have a "list_del_init_careful()" to > + * properly pair with the unlocked "list_empty_careful()" > + * in finish_wait(). > + */ > + smp_mb(); > + list_del_init(&wait->entry); > + return ret; > } > > static void wake_up_page_bit(struct page *page, int bit_nr) > @@ -1103,16 +1127,31 @@ enum behavior { > */ > }; > > +/* > + * Attempt to check (or get) the page bit, and mark the > + * waiter woken if successful. > + */ > +static inline bool trylock_page_bit_common(struct page *page, int bit_nr, > + struct wait_queue_entry *wait) > +{ > + if (wait->flags & WQ_FLAG_EXCLUSIVE) { > + if (test_and_set_bit(bit_nr, &page->flags)) > + return false; > + } else if (test_bit(bit_nr, &page->flags)) > + return false; > + > + wait->flags |= WQ_FLAG_WOKEN; > + return true; > +} > + > static inline int wait_on_page_bit_common(wait_queue_head_t *q, > struct page *page, int bit_nr, int state, enum behavior behavior) > { > struct wait_page_queue wait_page; > wait_queue_entry_t *wait = &wait_page.wait; > - bool bit_is_set; > bool thrashing = false; > bool delayacct = false; > unsigned long pflags; > - int ret = 0; > > if (bit_nr == PG_locked && > !PageUptodate(page) && PageWorkingset(page)) { > @@ -1130,48 +1169,47 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, > wait_page.page = page; > wait_page.bit_nr = bit_nr; > > - for (;;) { > - spin_lock_irq(&q->lock); > + /* > + * Do one last check whether we can get the > + * page bit synchronously. > + * > + * Do the SetPageWaiters() marking before that > + * to let any waker we _just_ missed know they > + * need to wake us up (otherwise they'll never > + * even go to the slow case that looks at the > + * page queue), and add ourselves to the wait > + * queue if we need to sleep. > + * > + * This part needs to be done under the queue > + * lock to avoid races. > + */ > + spin_lock_irq(&q->lock); > + SetPageWaiters(page); > + if (!trylock_page_bit_common(page, bit_nr, wait)) > + __add_wait_queue_entry_tail(q, wait); > + spin_unlock_irq(&q->lock); > > - if (likely(list_empty(&wait->entry))) { > - __add_wait_queue_entry_tail(q, wait); > - SetPageWaiters(page); > - } > + /* > + * From now on, all the logic will be based on > + * the WQ_FLAG_WOKEN flag, and the and the page > + * bit testing (and setting) will be - or has > + * already been - done by the wake function. > + * > + * We can drop our reference to the page. > + */ > + if (behavior == DROP) > + put_page(page); > > + for (;;) { > set_current_state(state); > > - spin_unlock_irq(&q->lock); > - > - bit_is_set = test_bit(bit_nr, &page->flags); > - if (behavior == DROP) > - put_page(page); > - > - if (likely(bit_is_set)) > - io_schedule(); > - > - if (behavior == EXCLUSIVE) { > - if (!test_and_set_bit_lock(bit_nr, &page->flags)) > - break; > - } else if (behavior == SHARED) { > - if (!test_bit(bit_nr, &page->flags)) > - break; > - } > - > - if (signal_pending_state(state, current)) { > - ret = -EINTR; > + if (signal_pending_state(state, current)) > break; > - } > > - if (behavior == DROP) { > - /* > - * We can no longer safely access page->flags: > - * even if CONFIG_MEMORY_HOTREMOVE is not enabled, > - * there is a risk of waiting forever on a page reused > - * for something that keeps it locked indefinitely. > - * But best check for -EINTR above before breaking. > - */ > + if (wait->flags & WQ_FLAG_WOKEN) > break; > - } > + > + io_schedule(); > } > > finish_wait(q, wait); > @@ -1190,7 +1228,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, > * bother with signals either. > */ > > - return ret; > + return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; > } > > void wait_on_page_bit(struct page *page, int bit_nr) > -- > 2.28.0.rc0.3.g1e25d3a62f > > From 93f0263b9b952a1c449cec56a6aadf6320e821f9 Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Thu, 23 Jul 2020 12:33:41 -0700 > Subject: [PATCH 2/2] list: add "list_del_init_careful()" to go with > "list_empty_careful()" > > That gives us ordering guarantees around the pair. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > include/linux/list.h | 20 +++++++++++++++++++- > kernel/sched/wait.c | 2 +- > mm/filemap.c | 7 +------ > 3 files changed, 21 insertions(+), 8 deletions(-) > > diff --git a/include/linux/list.h b/include/linux/list.h > index aff44d34f4e4..0d0d17a10d25 100644 > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -282,6 +282,24 @@ static inline int list_empty(const struct list_head *head) > return READ_ONCE(head->next) == head; > } > > +/** > + * list_del_init_careful - deletes entry from list and reinitialize it. > + * @entry: the element to delete from the list. > + * > + * This is the same as list_del_init(), except designed to be used > + * together with list_empty_careful() in a way to guarantee ordering > + * of other memory operations. > + * > + * Any memory operations done before a list_del_init_careful() are > + * guaranteed to be visible after a list_empty_careful() test. > + */ > +static inline void list_del_init_careful(struct list_head *entry) > +{ > + __list_del_entry(entry); > + entry->prev = entry; > + smp_store_release(&entry->next, entry); > +} > + > /** > * list_empty_careful - tests whether a list is empty and not being modified > * @head: the list to test > @@ -297,7 +315,7 @@ static inline int list_empty(const struct list_head *head) > */ > static inline int list_empty_careful(const struct list_head *head) > { > - struct list_head *next = head->next; > + struct list_head *next = smp_load_acquire(&head->next); > return (next == head) && (next == head->prev); > } > > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index ba059fbfc53a..01f5d3020589 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -389,7 +389,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i > int ret = default_wake_function(wq_entry, mode, sync, key); > > if (ret) > - list_del_init(&wq_entry->entry); > + list_del_init_careful(&wq_entry->entry); > > return ret; > } > diff --git a/mm/filemap.c b/mm/filemap.c > index 8c3d3e233d37..991503bbf922 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1041,13 +1041,8 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, > * since after list_del_init(&wait->entry) the wait entry > * might be de-allocated and the process might even have > * exited. > - * > - * We _really_ should have a "list_del_init_careful()" to > - * properly pair with the unlocked "list_empty_careful()" > - * in finish_wait(). > */ > - smp_mb(); > - list_del_init(&wait->entry); > + list_del_init_careful(&wait->entry); > return ret; > } > > -- > 2.28.0.rc0.3.g1e25d3a62f >
On Mon, Aug 3, 2020 at 6:14 AM Michal Hocko <mhocko@suse.com> wrote: > > I hope I got it right and this is the latest version of your patches. Btw. > do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable. I suspect it's still very reasonable, but I'd love to have numbers for it. > In the meantime I have learned that the customer suffering from the > issue is very unlikely to reboot the machine anytime soon or even > willing to test a non-trivial patch. We do not own any machine which > exhibit this problem unfortunately. So it is quite unlikely I can > help with testing. Ok. > Also does it make sense to put this into mmotm tree for a while to get a > larger testing coverage? Well, I did the 5.8 release yesterday, so I just put it in the tree for the 5.9 merge window - I've been running it locally since I posted it, and while Hugh couldn't prove it improved anything, his results certainly also didn't say it was bad. So anybody that tests my top-of-tree will be testing that thing now, which is likely more than linux-next or mmotm gets (outside of build testing and the robots). Of course, I don't know how many people run my development tree, particularly during the busy merge window, but hey, at worst it will be in the next linux-next that way. At best, it's not just me, but a number of other developers. Linus
Nice to see the +130.0% this morning. I got back on to this on Monday, here's some follow-up. On Sun, 26 Jul 2020, Hugh Dickins wrote: > > The comparison runs have not yet completed (except for the one started > early), but they have all got past the most interesting tests, and it's > clear that they do not have the "failure"s seen with your patches. > > From that I can only conclude that your patches make a difference. > > I've deduced nothing useful from the logs, will have to leave that > to others here with more experience of them. But my assumption now > is that you have successfully removed one bottleneck, so the tests > get somewhat further and now stick in the next bottleneck, whatever > that may be. Which shows up as "failure", where the unlock_page() > wake_up_page_bit() bottleneck had allowed the tests to proceed in > a more serially sedate way. Yes, that's still how it appears to me. The test failures, all of them, came from fork() returning ENOSPC, which originated from alloc_pid()'s idr_alloc_cyclic(). I did try doubling our already large pid_max, that did not work out well, there are probably good reasons for it to be where it is and I was wrong to dabble with it. I also tried an rcu_barrier() and retry when getting -ENOSPC there, thinking maybe RCU was not freeing them up fast enough, but that didn't help either. I think (but didn't quite make the effort to double-check with an independent count) it was simply running out of pids: that your change speeds up the forking enough, that exiting could not quite keep up (SIGCHLD is SIG_IGNed); whereas before your change, the unlock_page() in do_wp_page(), on a PageAnon stack page, slowed the forking down enough when heavily contended. (I think we could improve the checks there, to avoid taking page lock in more cases; but I don't know if that would help any real-life workload - I see now that Michal's case is do_read_fault() not do_wp_page().) And FWIW a further speedup there is the opposite of what these tests are wanting: for the moment I've enabled a delay to get them passing as before. Something I was interested to realize in looking at this: trylock_page() on a contended lock is now much less likely to jump the queue and succeed than before, since your lock holder hands off the page lock to the next holder: much smaller window than waiting for the next to wake to take it. Nothing wrong with that, but effect might be seen somewhere. > > The xhci handle_cmd_completion list_del bugs (on an older version > of the driver): weird, nothing to do with page wakeups, I'll just > have to assume that it's some driver bug exposed by the greater > stress allowed down, and let driver people investigate (if it > still manifests) when we take in your improvements. Complete red herring. I'll give Greg more info in response to his mail, and there may be an xhci bug in there; but when I looked back, found I'd come across the same bug back in October, and find that occasionally it's been seen in our fleet. Yes, it's odd that your change coincided with it becoming more common on that machine (which I've since replaced by another), yes it's funny that it's in __list_del_entry_valid(), which is exactly where I got crashes on pages with your initial patch; but it's just a distraction. > > One nice thing from the comparison runs without your patches: > watchdog panic did crash one of those with exactly the unlock_page() > wake_up_page_bit() softlockup symptom we've been fighting, that did > not appear with your patches. So although the sample size is much > too small to justify a conclusion, it does tend towards confirming > your changes. > > Thank you for your work on this! And I'm sure you'd have preferred > some hard data back, rather than a diary of my mood swings, but... > we do what we can. > > Hugh
On Mon, 27 Jul 2020, Greg KH wrote: > > Linus just pointed me at this thread. > > If you could run: > echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control > and run the same workload to see if anything shows up in the log when > xhci crashes, that would be great. Thanks, I tried that, and indeed it did have a story to tell: ep 0x81 - asked for 16 bytes, 10 bytes untransferred ep 0x81 - asked for 16 bytes, 10 bytes untransferred ep 0x81 - asked for 16 bytes, 10 bytes untransferred a very large number of lines like the above, then Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0 // Ding dong! ep 0x81 - asked for 16 bytes, 10 bytes untransferred Stopped on No-op or Link TRB for slot 1 ep 0 xhci_drop_endpoint called for udev 000000005bc07fa6 drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0 add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8 xhci_check_bandwidth called for udev 000000005bc07fa6 // Ding dong! Successful Endpoint Configure command Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0 // Ding dong! Stopped on No-op or Link TRB for slot 1 ep 2 Removing canceled TD starting at 0x0 (dma). list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88, but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:53! RIP: 0010:__list_del_entry_valid+0x8e/0xb0 Call Trace: <IRQ> handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd] xhci_irq+0x242/0x1ea0 [xhci_hcd] xhci_msi_irq+0x11/0x20 [xhci_hcd] __handle_irq_event_percpu+0x48/0x2c0 handle_irq_event_percpu+0x32/0x80 handle_irq_event+0x4a/0x80 handle_edge_irq+0xd8/0x1b0 handle_irq+0x2b/0x50 do_IRQ+0xb6/0x1c0 common_interrupt+0x90/0x90 </IRQ> Info provided for your interest, not expecting any response. The list_del info in there is non-standard, from a patch of mine: I find hashed addresses in debug output less than helpful. > > Although if you are using an "older version" of the driver, there's not > much I can suggest except update to a newer one :) Yes, I was reluctant to post any info, since really the ball is at our end of the court, not yours. I did have a go at bringing in the latest xhci driver instead, but quickly saw that was not a sensible task for me. And I did scan the git log of xhci changes (especially xhci-ring.c changes): thought I saw a likely relevant and easily applied fix commit, but in fact it made no difference here. I suspect it's in part a hardware problem, but driver not recovering correctly. I've replaced the machine (but also noticed that the same crash has occasionally been seen on other machines). I'm sure it has no relevance to this unlock_page() thread, though it's quite possible that it's triggered under stress, and Linus's changes allowed greater stress. Hugh
On Wed, Aug 5, 2020 at 10:21 PM Hugh Dickins <hughd@google.com> wrote: > > Something I was interested to realize in looking at this: trylock_page() > on a contended lock is now much less likely to jump the queue and > succeed than before, since your lock holder hands off the page lock to > the next holder: much smaller window than waiting for the next to wake > to take it. Nothing wrong with that, but effect might be seen somewhere. Yeah, the window is smaller, but it's not gone. It *might* be interesting to actually do the handover directly from "unlock_page()", and avoid clearing (and then setting) the bit entirely. Something like the attached TOTALLY UNTESTED patch. NOTE! Sometimes when I say something is untested, I think the patch is fine because it's simple and straightforward, I just didn't test it. This time it's both untested and very very subtle indeed. Did I get the hand-over case SMP memory barriers right? Did I screw something else up? So this might be complete garbage. I really don't know. But it might close the window for an unfair trylock (or lucky page_lock()) entirely. Or maybe it just makes page locking break entirely. It's a very real risk. The upside may not be worth it. Linus
On Thu, Aug 06, 2020 at 10:07:07AM -0700, Linus Torvalds wrote: > On Wed, Aug 5, 2020 at 10:21 PM Hugh Dickins <hughd@google.com> wrote: > > Something I was interested to realize in looking at this: trylock_page() > > on a contended lock is now much less likely to jump the queue and > > succeed than before, since your lock holder hands off the page lock to > > the next holder: much smaller window than waiting for the next to wake > > to take it. Nothing wrong with that, but effect might be seen somewhere. > > Yeah, the window is smaller, but it's not gone. > > It *might* be interesting to actually do the handover directly from > "unlock_page()", and avoid clearing (and then setting) the bit > entirely. > > Something like the attached TOTALLY UNTESTED patch. > > NOTE! Sometimes when I say something is untested, I think the patch is > fine because it's simple and straightforward, I just didn't test it. > > This time it's both untested and very very subtle indeed. Did I get > the hand-over case SMP memory barriers right? Did I screw something > else up? > > So this might be complete garbage. I really don't know. But it might > close the window for an unfair trylock (or lucky page_lock()) > entirely. It wasn't clear to me whether Hugh thought it was an improvement or not that trylock was now less likely to jump the queue. There're the usual "fair is the opposite of throughput" kind of arguments.
On Thu, Aug 6, 2020 at 11:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > It wasn't clear to me whether Hugh thought it was an improvement or > not that trylock was now less likely to jump the queue. There're > the usual "fair is the opposite of throughput" kind of arguments. Yeah, it could go either way. But on the whole, if the lock bit is getting any contention, I think we'd rather have it be fair for latency reasons. That said, I'm not convinced about my patch, and I actually threw it away without even testing it (sometimes I keep patches around in my private tree for testing, and they can live there for months or even years when I wonder if they are worth it, but this time I didn't bother to go to the trouble). If somebody is interested in pursuing this, I think that patch might be a good starting point (and it _might_ even work), but it seemed to be too subtle to really worry about unless somebody finds an actual acute reason for it. I think the existing patch narrowing the window is good, and it clearly didn't hurt throughput (although that was almost certainly for other reasons). Linus
On Thu, 6 Aug 2020, Linus Torvalds wrote: > On Thu, Aug 6, 2020 at 11:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > It wasn't clear to me whether Hugh thought it was an improvement or > > not that trylock was now less likely to jump the queue. There're > > the usual "fair is the opposite of throughput" kind of arguments. I don't know. I'm inclined to think that keeping just a small chance of jumping the queue is probably good; but pay no attention to me, that's an opinion based on ignorance. Thanks for mentioning the lucky lock_page(): I was quite wrong to single out trylock_page() - I suppose its lack of a slow path just helped it spring to mind more easily. > > Yeah, it could go either way. But on the whole, if the lock bit is > getting any contention, I think we'd rather have it be fair for > latency reasons. > > That said, I'm not convinced about my patch, and I actually threw it > away without even testing it (sometimes I keep patches around in my > private tree for testing, and they can live there for months or even > years when I wonder if they are worth it, but this time I didn't > bother to go to the trouble). > > If somebody is interested in pursuing this, I think that patch might > be a good starting point (and it _might_ even work), but it seemed to > be too subtle to really worry about unless somebody finds an actual > acute reason for it. It is an attractive idea, passing the baton from one to the next; and a logical destination from where you had already arrived. Maybe somebody, not me, should pursue it. I tried to ignore it, but eventually succumbed to temptation, and ran it overnight at home (my usual tmpfs swapping), and on one machine at work (fork/exit stress etc). It did need one fixup, below: then home testing went fine. But the fork/exit stress hit that old familiar unlock_page wake_up_page_bit softlockup that your existing patch has appeared to fix. I can't afford to investigate further (though might regret that: I keep wondering if the small dose of unfairness in your existing patch is enough to kick the test when it otherwise would hang, and this new stricter patch be pointing to that). My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus your two patches): I did not have in the io_uring changes from the current tree. In glancing there, I noticed one new and one previous instance of SetPageWaiters() *after* __add_wait_queue_entry_tail(): are those correct? > > I think the existing patch narrowing the window is good, and it > clearly didn't hurt throughput (although that was almost certainly for > other reasons). Agreed. > > Linus > --- linus/mm/filemap.c 2020-08-07 02:08:13.727709186 -0700 +++ hughd/mm/filemap.c 2020-08-07 02:16:10.960331473 -0700 @@ -1044,7 +1044,7 @@ static int wake_page_function(wait_queue return ret; } -static int wake_up_page_bit(struct page *page, int bit_nr) +static void wake_up_page_bit(struct page *page, int bit_nr, bool exclude) { wait_queue_head_t *q = page_waitqueue(page); struct wait_page_key key; @@ -1096,15 +1096,28 @@ static int wake_up_page_bit(struct page * That's okay, it's a rare case. The next waker will clear it. */ } + + /* + * If we hoped to pass PG_locked on to the next locker, but found + * no exclusive taker, then we must clear it before dropping q->lock. + * Otherwise unlock_page() can race trylock_page_bit_common(), and if + * PageWaiters was already set from before, then cmpxchg sees no + * difference to send it back to wake_up_page_bit(). + * + * We may have already dropped and retaken q->lock on the way here, but + * I think this works out because new entries are always added at tail. + */ + if (exclude && !exclusive) + clear_bit(bit_nr, &page->flags); + spin_unlock_irqrestore(&q->lock, flags); - return exclusive; } static void wake_up_page(struct page *page, int bit) { if (!PageWaiters(page)) return; - wake_up_page_bit(page, bit); + wake_up_page_bit(page, bit, false); } /* @@ -1339,8 +1352,8 @@ void unlock_page(struct page *page) * spinlock wake_up_page_bit() will do. */ smp_mb__before_atomic(); - if (wake_up_page_bit(page, PG_locked)) - return; + wake_up_page_bit(page, PG_locked, true); + return; } new = cmpxchg_release(&page->flags, flags, flags & ~(1 << PG_locked)); if (likely(new == flags))
On Fri, Aug 7, 2020 at 11:41 AM Hugh Dickins <hughd@google.com> wrote: > > + > + /* > + * If we hoped to pass PG_locked on to the next locker, but found > + * no exclusive taker, then we must clear it before dropping q->lock. > + * Otherwise unlock_page() can race trylock_page_bit_common(), and if > + * PageWaiters was already set from before, then cmpxchg sees no > + * difference to send it back to wake_up_page_bit(). > + * > + * We may have already dropped and retaken q->lock on the way here, but > + * I think this works out because new entries are always added at tail. > + */ > + if (exclude && !exclusive) > + clear_bit(bit_nr, &page->flags); > + > spin_unlock_irqrestore(&q->lock, flags); Yeah, I started thinking about this, and that's when I decided to just throw away the patch. Because now it clears the bit *after* it has woken people up, and that just made me go "Eww". You did add a comment about the whole "always added to the tail" thing, and the spinlock should serialize everything, so I guess it's ok (because the spinlock should serialize things that care), but it just feels wrong. I also started worrying about other people waiting on the page lock (ie we now have that whole io_uring case), and interaction with the PG_writeback case etc, and it just ended up feeling very messy. I think it was all fine - other cases won't hit that exclusive case at all - but I had a hard time wrapping my little mind around the exact handoff rules, and the cmpxchg behavior when other bits changed (eg PG_waiters), so I gave up. > My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus > your two patches): I did not have in the io_uring changes from the > current tree. In glancing there, I noticed one new and one previous > instance of SetPageWaiters() *after* __add_wait_queue_entry_tail(): > are those correct? I don't think SetPageWaiters() has any ordering constraints with the wait queue. Nobody looks at the waitqueue unless they already got to the slowpath. The only ordering constraint is with the testing of the bit we're going to wait on. Because doing if (test_and_set_bit()) SetPageWaiters(page); is wrong - there's a race in there when somebody can clear the bit, and not see that there are waiters. And obviously that needs to be done inside the spinlock, but once you do that, the ordering of the actual wait-queue is entirely irrelevant. The spinlock will order _that_ part. The only thing the spinlock won't order and serialize is the PG_lock bit and making sure we get to the slowpath in the first place. So if you're talking about __wait_on_page_locked_async(), then I think that part is ok. Or am I missing something? Linus
On Fri, Aug 07, 2020 at 11:41:09AM -0700, Hugh Dickins wrote: > My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus > your two patches): I did not have in the io_uring changes from the > current tree. In glancing there, I noticed one new and one previous > instance of SetPageWaiters() *after* __add_wait_queue_entry_tail(): > are those correct? By the way, don't do any performance testing on current Linus tree. This commit: commit 37f4a24c2469a10a4c16c641671bd766e276cf9f (refs/bisect/bad) Author: Ming Lei <ming.lei@redhat.com> Date: Tue Jun 30 22:03:57 2020 +0800 blk-mq: centralise related handling into blk_mq_get_driver_tag Move .nr_active update and request assignment into blk_mq_get_driver_tag(), all are good to do during getting driver tag. Meantime blk-flush related code is simplified and flush request needn't to update the request table manually any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> slows everything down, at least on my test qemu system. Seems like it's losing block queue tags.
On Wed, Aug 05, 2020 at 10:46:12PM -0700, Hugh Dickins wrote: > On Mon, 27 Jul 2020, Greg KH wrote: > > > > Linus just pointed me at this thread. > > > > If you could run: > > echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control > > and run the same workload to see if anything shows up in the log when > > xhci crashes, that would be great. > > Thanks, I tried that, and indeed it did have a story to tell: > > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > a very large number of lines like the above, then > Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0 > // Ding dong! > ep 0x81 - asked for 16 bytes, 10 bytes untransferred > Stopped on No-op or Link TRB for slot 1 ep 0 > xhci_drop_endpoint called for udev 000000005bc07fa6 > drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0 > add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8 > xhci_check_bandwidth called for udev 000000005bc07fa6 > // Ding dong! > Successful Endpoint Configure command > Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0 > // Ding dong! > Stopped on No-op or Link TRB for slot 1 ep 2 > Removing canceled TD starting at 0x0 (dma). > list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88, > but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b. > ------------[ cut here ]------------ > kernel BUG at lib/list_debug.c:53! > RIP: 0010:__list_del_entry_valid+0x8e/0xb0 > Call Trace: > <IRQ> > handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd] > xhci_irq+0x242/0x1ea0 [xhci_hcd] > xhci_msi_irq+0x11/0x20 [xhci_hcd] > __handle_irq_event_percpu+0x48/0x2c0 > handle_irq_event_percpu+0x32/0x80 > handle_irq_event+0x4a/0x80 > handle_edge_irq+0xd8/0x1b0 > handle_irq+0x2b/0x50 > do_IRQ+0xb6/0x1c0 > common_interrupt+0x90/0x90 > </IRQ> > > Info provided for your interest, not expecting any response. > The list_del info in there is non-standard, from a patch of mine: > I find hashed addresses in debug output less than helpful. Thanks for this, that is really odd. > > > > Although if you are using an "older version" of the driver, there's not > > much I can suggest except update to a newer one :) > > Yes, I was reluctant to post any info, since really the ball is at our > end of the court, not yours. I did have a go at bringing in the latest > xhci driver instead, but quickly saw that was not a sensible task for > me. And I did scan the git log of xhci changes (especially xhci-ring.c > changes): thought I saw a likely relevant and easily applied fix commit, > but in fact it made no difference here. > > I suspect it's in part a hardware problem, but driver not recovering > correctly. I've replaced the machine (but also noticed that the same > crash has occasionally been seen on other machines). I'm sure it has > no relevance to this unlock_page() thread, though it's quite possible > that it's triggered under stress, and Linus's changes allowed greater > stress. I will be willing to blame hardware problems for this as well, but will save this report in case something else shows up in the future, thanks! greg k-h
diff --git a/mm/filemap.c b/mm/filemap.c index 385759c4ce4b..74681c40a6e5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -41,6 +41,7 @@ #include <linux/delayacct.h> #include <linux/psi.h> #include <linux/ramfs.h> +#include <linux/nmi.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -1055,6 +1056,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr) */ spin_unlock_irqrestore(&q->lock, flags); cpu_relax(); + touch_softlockup_watchdog(); spin_lock_irqsave(&q->lock, flags); __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark); }