diff mbox series

[RFC] mm: silence soft lockups from unlock_page

Message ID 20200721063258.17140-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: silence soft lockups from unlock_page | expand

Commit Message

Michal Hocko July 21, 2020, 6:32 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

We have seen a bug report with huge number of soft lockups during the
system boot on !PREEMPT kernel
NMI watchdog: BUG: soft lockup - CPU#1291 stuck for 22s! [systemd-udevd:43283]
[...]
NIP [c00000000094e66c] _raw_spin_lock_irqsave+0xac/0x100
LR [c00000000094e654] _raw_spin_lock_irqsave+0x94/0x100
Call Trace:
[c00002293d883b30] [c0000000012cdee8] __pmd_index_size+0x0/0x8 (unreliable)
[c00002293d883b70] [c0000000002b7490] wake_up_page_bit+0xc0/0x150
[c00002293d883bf0] [c00000000030ceb8] do_fault+0x448/0x870
[c00002293d883c40] [c000000000310080] __handle_mm_fault+0x880/0x16b0
[c00002293d883d10] [c000000000311018] handle_mm_fault+0x168/0x250
[c00002293d883d50] [c00000000006c488] do_page_fault+0x568/0x8d0
[c00002293d883e30] [c00000000000a534] handle_page_fault+0x18/0x38

on a large ppc machine. The very likely cause is a suboptimal
configuration when systed-udev spawns way too many workders to bring the
system up.

The lockup is in page_unlock in do_read_fault and I suspect that this is
yet another effect of a very long waitqueue chain which has been
addresses by 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit") previously. The commit primarily aimed at hard lockup
prevention but it doesn't really help !PREEMPT case which still has to
process all the work without any rescheduling point. This is however not
really trivial because page_unlock is called from many contexts many of
which are likely called from an atomic context.

Introducing page_unlock_sleepable is certainly an option but it seems
like a hard to maintain option which doesn't really fix the underlying
problem as the same might happen from other unlock_page callers. This
patch doesn't address the underlying problem but it reduces the visible
effect. Tell the soft lockup to shut up when retrying batches on queued
waiters. This will also allow systems configured to panic on warning to
proceed with the boot.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/filemap.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Michal Hocko July 21, 2020, 11:25 a.m. UTC | #1
On Tue 21-07-20 07:10:14, Qian Cai wrote:
> 
> 
> > On Jul 21, 2020, at 2:33 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > on a large ppc machine. The very likely cause is a suboptimal
> > configuration when systed-udev spawns way too many workders to bring the
> > system up.
> 
> This is strange. The problem description is missing quite a few
> important details. For example, what systems exactly are those? How
> many CPUs, memory and NUMA nodes were you talking about?

Are these really important? I believe I can dig that out from the bug
report but I didn't really consider that important enough.

> Which kernel version was it reported?

It is a SLES 4.12 based kernel with the said commit backported. The page
lock internals are thus in line with the upstream kernel.

> How many workers from systemd-udev was “misconfigured”?

I do not know that information. I believe that it used whatever systemd
comes with as a default. And that can be a lot.

Do you have any actual feedback to the patch?
Michal Hocko July 21, 2020, 12:17 p.m. UTC | #2
On Tue 21-07-20 07:44:07, Qian Cai wrote:
> 
> 
> > On Jul 21, 2020, at 7:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > Are these really important? I believe I can dig that out from the bug
> > report but I didn't really consider that important enough.
> 
> Please dig them out. We have also been running those things on
> “large” powerpc as well and never saw such soft-lockups. Those
> details may give us some clues about the actual problem.

I strongly suspect this is not really relevant but just FYI this is
16Node, 11.9TB with 1536CPUs system.

> Once we
> understand the problem better, we may judge if this “hack” is
> really worth it.

I do not have access to the machine so I can only judge from the boot
log I have in hands. And from that it is pretty clear that 
$ grep BUG tmp/attachment.txt | wc -l
896

$ grep BUG tmp/attachment.txt | grep "\[systemd" | wc -l
860

$ grep do_fault+0x448 tmp/attachment.txt | wc -l
860

that the boot struggles, lockups happen from udev workers and most of
them are stuck at the very same place which is unlock_page. The rest is
a part of the changelog.
Michal Hocko July 21, 2020, 1:38 p.m. UTC | #3
On Tue 21-07-20 09:23:44, Qian Cai wrote:
> On Tue, Jul 21, 2020 at 02:17:52PM +0200, Michal Hocko wrote:
> > On Tue 21-07-20 07:44:07, Qian Cai wrote:
> > > 
> > > 
> > > > On Jul 21, 2020, at 7:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > 
> > > > Are these really important? I believe I can dig that out from the bug
> > > > report but I didn't really consider that important enough.
> > > 
> > > Please dig them out. We have also been running those things on
> > > “large” powerpc as well and never saw such soft-lockups. Those
> > > details may give us some clues about the actual problem.
> > 
> > I strongly suspect this is not really relevant but just FYI this is
> > 16Node, 11.9TB with 1536CPUs system.
> 
> Okay, we are now talking about the HPC special case. Just brain-storming some
> ideas here.
> 
> 
> 1) What about increase the soft-lockup threshold early at boot and restore
> afterwards? As far as I can tell, those soft-lockups are just a few bursts of
> things and then cure itself after the booting.

Is this really better option than silencing soft lockup from the code
itself? What if the same access pattern happens later on?

> 2) Reading through the comments above page_waitqueue(), it said rare hash
> collisions could happen, so sounds like in this HPC case, it is rather easy to
> hit those hash collisons. Thus, need to deal with that instead?

As all of those seem to be the same class of process I suspect it is
more likely that many processes are hitting the page fault on the same
file page. E.g. a code/library.

> 3) The commit 62906027091f ("mm: add PageWaiters indicating tasks are waiting
> for a page bit") mentioned that,
> 
> "Putting two bits in the same word opens the opportunity to remove the memory
> barrier between clearing the lock bit and testing the waiters bit, after some
> work on the arch primitives (e.g., ensuring memory operand widths match and
> cover both bits)."
> 
> Do you happen to know if this only happen on powerpc?

I have only seen this single instance on that machine. I do not think
this is very much HW specific but ppc platform is likely more prone to
that. Just think of the memory itself. Each memory block is notified via
udev and ppc has very small memblocks (16M to 256M). X86 will use 2G
blocks on large machines.

> Also, probably need to
> dig out if those memory barrier is still there that could be removed to speed
> up things.

I would be really suprised if memory barriers matter much. It sounds
much more likely that there is the same underlying problem as
11a19c7b099f. There are just too many waiters on the page. The commit
prevents just the hard lockup part of the problem by dropping the lock
and continuing after the bookmark. But, as mentioned in the changelog,
cond_resched is not really an option because this path is called from
atomic context as well. So !PREEMPT kernels are still in the same boat.

I might have misunderstood something, of course, and would like to hear
where is my thinking wrong.
Chris Down July 21, 2020, 2:17 p.m. UTC | #4
I understand the pragmatic considerations here, but I'm quite concerned about 
the maintainability and long-term ability to reason about a patch like this.  
For example, how do we know when this patch is safe to remove? Also, what other 
precedent does this set for us covering for poor userspace behaviour?

Speaking as a systemd maintainer, if udev could be doing something better on 
these machines, we'd be more than receptive to help fix it. In general I am 
against explicit watchdog tweaking here because a.) there's potential to mask 
other problems, and b.) it seems like the kind of one-off trivia nobody is 
going to remember exists when doing complex debugging in future.

Is there anything preventing this being remedied in udev, instead of the 
kernel?
Michal Hocko July 21, 2020, 3 p.m. UTC | #5
On Tue 21-07-20 15:17:49, Chris Down wrote:
> I understand the pragmatic considerations here, but I'm quite concerned
> about the maintainability and long-term ability to reason about a patch like
> this.  For example, how do we know when this patch is safe to remove? Also,
> what other precedent does this set for us covering for poor userspace
> behaviour?
> 
> Speaking as a systemd maintainer, if udev could be doing something better on
> these machines, we'd be more than receptive to help fix it. In general I am
> against explicit watchdog tweaking here because a.) there's potential to
> mask other problems, and b.) it seems like the kind of one-off trivia nobody
> is going to remember exists when doing complex debugging in future.
> 
> Is there anything preventing this being remedied in udev, instead of the
> kernel?

Yes, I believe that there is a configuration to cap the maximum number
of workers. This is not my area but my understanding is that the maximum
is tuned based on available memory and/or cpus. We have been hit byt
this quite heavily on SLES. Maybe newer version of systemd have a better
tuning.

But, it seems that udev is just a messenger here. There is nothing
really fundamentally udev specific in the underlying problem unless I
miss something. It is quite possible that this could be triggered by
other userspace which happens to fire many workers at the same time and
condending on a shared page.

Not that I like this workaround in the first place but it seems that the
existing code allows very long wait chains and !PREEMPT kernels simply
do not have any scheduling point for a long time potentially. I believe
we should focus on that even if the systemd as the current trigger can
be tuned better. I do not insist on this patch, hence RFC, but I am
simply not seeing a much better, yet not convoluted, solution.
Linus Torvalds July 21, 2020, 3:33 p.m. UTC | #6
On Mon, Jul 20, 2020 at 11:33 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> The lockup is in page_unlock in do_read_fault and I suspect that this is
> yet another effect of a very long waitqueue chain which has been
> addresses by 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
> wake_up_page_bit") previously.

Hmm.

I do not believe that you can actually get to the point where you have
a million waiters and it takes 20+ seconds to wake everybody up.

More likely, it's actually *caused* by that commit 11a19c7b099f, and
what might be happening is that other CPU's are just adding new
waiters to the list *while* we're waking things up, because somebody
else already got the page lock again.

Humor me.. Does something like this work instead? It's
whitespace-damaged because of just a cut-and-paste, but it's entirely
untested, and I haven't really thought about any memory ordering
issues, but I think it's ok.

The logic is that anybody who called wake_up_page_bit() _must_ have
cleared that bit before that. So if we ever see it set again (and
memory ordering doesn't matter), then clearly somebody else got access
to the page bit (whichever it was), and we should not

 (a) waste time waking up people who can't get the bit anyway

 (b) be in a  livelock where other CPU's continually add themselves to
the wait queue because somebody else got the bit.

and it's that (b) case that I think happens for you.

NOTE! Totally UNTESTED patch follows. I think it's good, but maybe
somebody sees some problem with this approach?

I realize that people can wait for other bits than the unlocked, and
if you're waiting for writeback to complete maybe you don't care if
somebody else then started writeback *AGAIN* on the page and you'd
actually want to be woken up regardless, but honestly, I don't think
it really matters.

                Linus

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1054,6 +1054,15 @@ static void wake_up_page_bit(struct page *page,
int bit_nr)
                 * from wait queue
                 */
                spin_unlock_irqrestore(&q->lock, flags);
+
+               /*
+                * If somebody else set the bit again, stop waking
+                * people up. It's now the responsibility of that
+                * other page bit owner to do so.
+                */
+               if (test_bit(bit_nr, &page->flags))
+                       return;
+
                cpu_relax();
                spin_lock_irqsave(&q->lock, flags);
                __wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
Michal Hocko July 21, 2020, 3:49 p.m. UTC | #7
On Tue 21-07-20 08:33:33, Linus Torvalds wrote:
> On Mon, Jul 20, 2020 at 11:33 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > The lockup is in page_unlock in do_read_fault and I suspect that this is
> > yet another effect of a very long waitqueue chain which has been
> > addresses by 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
> > wake_up_page_bit") previously.
> 
> Hmm.
> 
> I do not believe that you can actually get to the point where you have
> a million waiters and it takes 20+ seconds to wake everybody up.

I was really suprised as well!

> More likely, it's actually *caused* by that commit 11a19c7b099f, and
> what might be happening is that other CPU's are just adding new
> waiters to the list *while* we're waking things up, because somebody
> else already got the page lock again.
> 
> Humor me.. Does something like this work instead? It's
> whitespace-damaged because of just a cut-and-paste, but it's entirely
> untested, and I haven't really thought about any memory ordering
> issues, but I think it's ok.
> 
> The logic is that anybody who called wake_up_page_bit() _must_ have
> cleared that bit before that. So if we ever see it set again (and
> memory ordering doesn't matter), then clearly somebody else got access
> to the page bit (whichever it was), and we should not
> 
>  (a) waste time waking up people who can't get the bit anyway
> 
>  (b) be in a  livelock where other CPU's continually add themselves to
> the wait queue because somebody else got the bit.
> 
> and it's that (b) case that I think happens for you.
> 
> NOTE! Totally UNTESTED patch follows. I think it's good, but maybe
> somebody sees some problem with this approach?

I can ask them to give it a try.
Linus Torvalds July 22, 2020, 6:29 p.m. UTC | #8
On Tue, Jul 21, 2020 at 8:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> More likely, it's actually *caused* by that commit 11a19c7b099f, and
> what might be happening is that other CPU's are just adding new
> waiters to the list *while* we're waking things up, because somebody
> else already got the page lock again.
>
> Humor me.. Does something like this work instead?

I went back and looked at this, because it bothered me.

And I'm no longer convinced it can possibly make a difference.

Why?

Because __wake_up_locked_key_bookmark() just calls __wake_up_common(),
and that one checks the return value of the wakeup function:

                ret = curr->func(curr, mode, wake_flags, key);
                if (ret < 0)
                        break;

and will not add the bookmark back to the list if this triggers.

And the wakeup function does that same "stop walking" thing:

        if (test_bit(key->bit_nr, &key->page->flags))
                return -1;

So if somebody else took the page lock, I think we should already have
stopped walking the list.

Of course, the page table lock hash table is very small. It's only 256
entries. So maybe the list is basically all aliases for another page
entirely that is being hammered by that load, and we're just unlucky.

Because the wakeup function only does that "stop walking" if the page
key matched. So wait queue entries for another page that just hashes
to the same bucket (or even the same page, but a different bit in the
page) will confuse that logic.

Hmm.

I still can't see how you'd get so many entries (without re-adding
them) that you'd hit the softlockup timer.

So I wonder if maybe we really do hit the "aliasing with a really hot
page that gets re-added in the page wait table" case, but it seems a
bit contrived.

So I think that patch is still worth testing, but I'm not quite as
hopeful about it as I was originally.

I do wonder if we should make that PAGE_WAIT_TABLE_SIZE be larger. 256
entries seems potentially ridiculously small, and aliasing not only
increases the waitqueue length, it also potentially causes more
contention on the waitqueue spinlock (which is already likely seeing
some false sharing on a cacheline basis due to the fairly dense array
of waitqueue entries: wait_queue_head is intentionally fairly small
and dense unless you have lots of spinlock debugging options enabled).

That hashed wait-queue size is an independent issue, though. But it
might be part of "some loads can get into some really nasty behavior
in corner cases"

               Linus
Hugh Dickins July 22, 2020, 9:29 p.m. UTC | #9
On Wed, 22 Jul 2020, Linus Torvalds wrote:
> 
> I do wonder if we should make that PAGE_WAIT_TABLE_SIZE be larger. 256
> entries seems potentially ridiculously small, and aliasing not only
> increases the waitqueue length, it also potentially causes more
> contention on the waitqueue spinlock (which is already likely seeing
> some false sharing on a cacheline basis due to the fairly dense array
> of waitqueue entries: wait_queue_head is intentionally fairly small
> and dense unless you have lots of spinlock debugging options enabled).
> 
> That hashed wait-queue size is an independent issue, though. But it
> might be part of "some loads can get into some really nasty behavior
> in corner cases"

I don't think we've ever suffered from unlock_page() softlockups when
booting, as Michal reports; but we do have a forkbomby test (and a few
others) which occasionally suffer that way, on machines with many cpus.

We run with the three imperfect patches appended below, which together
seemed to improve, but not entirely solve, the situation:

mm,sched: make page_wait_table[] four times bigger
mm,sched: wait_on_page_bit_common() add back to head
mm,sched: __wake_up_common() break only after waking
 kernel/sched/wait.c |    5 ++++-
 mm/filemap.c        |   10 ++++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

I've rediffed them to 5.8-rc6, and checked that they build and sustain
a load for a few minutes: so they're not entirely ridiculous on latest
kernels, but certainly not thoroughly retested.

I'm rather desperate to stay out of the discussion here, given how far
behind I am on responding to other threads; and may not be able to
defend them beyond what I said in the original commit messages.
But seeing this thread, thought I'd better put them up for your eyes.

(And, no, I don't think I have a Copyright on changing an 8 to a 10:
you've probably gone further already; just included for completeness.)

Hugh

[PATCH] mm,sched: make page_wait_table[] four times bigger

Current page_wait_table[] size is 256 entries: bump that up to 1024
to reduce latency from frequent page hash collisions.  No science in
choosing fourfold: just "a little bigger".

Signed-off-by: Hugh Dickins <hughd@google.com>

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -968,7 +968,7 @@ EXPORT_SYMBOL(__page_cache_alloc);
  * at a cost of "thundering herd" phenomena during rare hash
  * collisions.
  */
-#define PAGE_WAIT_TABLE_BITS 8
+#define PAGE_WAIT_TABLE_BITS 10
 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
 static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
 
 
[PATCH] mm,sched: wait_on_page_bit_common() add back to head

wait_on_page_bit_common() always adds to tail of wait queue.  That is
of course right when adding an entry for the first time; but when woken,
and bit found already set, so adding back again?  Looks unfair: it would
get placed after recent arrivals, and in danger of indefinite starvation.

Instead, add back to head on repeat passes: not quite right either, but
if that happens again and again, the ordering will be reversed each time,
so it should work out reasonably fair.

Signed-off-by: Hugh Dickins <hughd@google.com>

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1109,6 +1109,7 @@ static inline int wait_on_page_bit_commo
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
 	bool bit_is_set;
+	bool first_time = true;
 	bool thrashing = false;
 	bool delayacct = false;
 	unsigned long pflags;
@@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo
 		spin_lock_irq(&q->lock);
 
 		if (likely(list_empty(&wait->entry))) {
-			__add_wait_queue_entry_tail(q, wait);
+			if (first_time) {
+				__add_wait_queue_entry_tail(q, wait);
+				first_time = false;
+			} else {
+				__add_wait_queue(q, wait);
+			}
 			SetPageWaiters(page);
 		}
 

[PATCH] mm,sched: __wake_up_common() break only after waking

4.14 commit 2554db916586 ("sched/wait: Break up long wake list walk")
added WQ_FLAG_BOOKMARK early breakout from __wake_up_common(): it lets
the waker drop and retake the irq-safe wait queue lock every 64 entries.
It was introduced to handle an Intel customer issue with long page wait
queues, but it also applies to all paths using __wake_up_common_lock().

It would probably be more useful to our non-preemptible kernel if it
could do a cond_resched() along with its cpu_relax(); but although
most unlock_page() calls would be fine with that, some would not -
page_endio(), for example; and it would be a big job to sort them,
and not clear that doing some not others would really help anyway.

A patch that I've been running with, that does help somewhat to reduce
those unlock_page() softlockups, is this weakening of 2554db916586:
don't break out until at least one wakeup has been issued for the page.

In the worst case (waking a page at the very end of a hash queue shared
with many waiters on another page), this would simply revert to the old
behavior, where there was no WQ_FLAG_BOOKMARK early breakout at all.

Whilst it did not set out to change behavior for __wake_up_common_lock()
users, review suggests that this change is probably good for them too.

Signed-off-by: Hugh Dickins <hughd@google.com>

--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -68,6 +68,7 @@ static int __wake_up_common(struct wait_
 			wait_queue_entry_t *bookmark)
 {
 	wait_queue_entry_t *curr, *next;
+	int woken = 0;
 	int cnt = 0;
 
 	lockdep_assert_held(&wq_head->lock);
@@ -95,8 +96,10 @@ static int __wake_up_common(struct wait_
 			break;
 		if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
 			break;
+		if (ret)
+			woken++;
 
-		if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
+		if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && woken &&
 				(&next->entry != &wq_head->head)) {
 			bookmark->flags = WQ_FLAG_BOOKMARK;
 			list_add_tail(&bookmark->entry, &next->entry);
Linus Torvalds July 22, 2020, 10:10 p.m. UTC | #10
On Wed, Jul 22, 2020 at 2:29 PM Hugh Dickins <hughd@google.com> wrote:
>
> -#define PAGE_WAIT_TABLE_BITS 8
> +#define PAGE_WAIT_TABLE_BITS 10

Well, that seems harmless even on small machines.

> +       bool first_time = true;
>         bool thrashing = false;
>         bool delayacct = false;
>         unsigned long pflags;
> @@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo
>                 spin_lock_irq(&q->lock);
>
>                 if (likely(list_empty(&wait->entry))) {
> -                       __add_wait_queue_entry_tail(q, wait);
> +                       if (first_time) {
> +                               __add_wait_queue_entry_tail(q, wait);
> +                               first_time = false;
> +                       } else {
> +                               __add_wait_queue(q, wait);
> +                       }
>                         SetPageWaiters(page);
>                 }

This seems very hacky.

And in fact, looking closer, I'd say that there are more serious problems here.

Look at that WQ_FLAG_EXCLUSIVE thing: non-exclusive waits should
always go at the head (because they're not going to steal the bit,
they just want to know when it got cleared), and exclusive waits
should always go at the tail (because of fairness).

But that's not at all what we do.

Your patch adds even more confusion to this nasty area.

And your third one:

> +               if (ret)
> +                       woken++;
>
> -               if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
> +               if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && woken &&

I've got two reactions to this

 (a) we should not need a new "woken" variable, we should just set a
high bit of "cnt" and make WAITQUEUE_WALK_BREAK_CNT contain that high
bit

     (Tune "high bit" to whatever you want: it could be either the
_real_ high bit of the variable, or it could be something like "128",
which would mean that you'd break out after 128 non-waking entries).

 (b) Ugh, what hackery and magic behavior regardless

I'm really starting to hate that wait_on_page_bit_common() function.

See a few weeks ago how the function looks buggy to begin with

  https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/

and that never got resolved either (but probably never happens in practice).

              Linus
Linus Torvalds July 22, 2020, 11:42 p.m. UTC | #11
On Wed, Jul 22, 2020 at 3:10 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> > +       bool first_time = true;
> >         bool thrashing = false;
> >         bool delayacct = false;
> >         unsigned long pflags;
> > @@ -1134,7 +1135,12 @@ static inline int wait_on_page_bit_commo
> >                 spin_lock_irq(&q->lock);
> >
> >                 if (likely(list_empty(&wait->entry))) {
> > -                       __add_wait_queue_entry_tail(q, wait);
> > +                       if (first_time) {
> > +                               __add_wait_queue_entry_tail(q, wait);
> > +                               first_time = false;
> > +                       } else {
> > +                               __add_wait_queue(q, wait);
> > +                       }
> >                         SetPageWaiters(page);
> >                 }
>
> This seems very hacky.
>
> And in fact, looking closer, I'd say that there are more serious problems here.
>
> Look at that WQ_FLAG_EXCLUSIVE thing: non-exclusive waits should
> always go at the head (because they're not going to steal the bit,
> they just want to know when it got cleared), and exclusive waits
> should always go at the tail (because of fairness).
>
> But that's not at all what we do.
>
> Your patch adds even more confusion to this nasty area.

Actually, I think the right thing to do is to just say "we should
never add ourselves back to the list".

We have three cases:

 - DROP: we'll never loop

 - SHARED: we'll break if we see the bit clear - so if we're no longer
on the list, we should break out

 - EXCLUSIVE: we should remain on the list until we get the lock.

and we should just handle these three cases in the wakeup function
instead, I feel.

And then to make it fair to people, let's just always add things to
the head of the queue, whether exclusive or not.

AND NEVER RE-QUEUE.

IOW, something like the attached patch.

NOTE NOTE NOTE! This is both somewhat subtle code, and ENTIRELY
UNTESTED. But it actually fixes the bug mentioned a few weeks ago, it
makes the code simpler in one sense, and it avoids the whole
re-queueuing thing.

It removes all the "is the page locked" testing from the actual wait
loop, and replaces it with "is the wait queue entry still on the list"
instead.

Comments? Oleg, this should fix the race you talked about too.

Note that this patch really is meant as a RFC, and "something like
this". It builds. I tried to think it through. But it might have some
obvious thinko that means that it really really doesn't work.

                    Linus
Linus Torvalds July 23, 2020, 12:23 a.m. UTC | #12
On Wed, Jul 22, 2020 at 4:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> NOTE NOTE NOTE! This is both somewhat subtle code, and ENTIRELY
> UNTESTED.

It seems to boot.

It adds more lines than it removes, but a lot of it is comments, and
while it's somewhat subtle, I think it's actually conceptually simpler
than what we had before.

The actual waiting loop itself, for example, is now entirely and
utterly trivial.

The DROP behavior is also a lot more straightforward and logical, imnsho.

The biggest annoyance I have with it is how it open-codes
"finish_wait()", and honestly, the "proper" fix for that would likely
be to simply instead make "finish_wait()" return the "did I need to
remove it from the list or not" value. That's the only reason that
patch open-codes it right now.

It _feels_ like the right solution to this thing.

But maybe that's just the "pee in the snow" effect, where I like it
only because I've put my mark on it.

So it would be good to get more opinions, and perhaps more testing
than my limited "it boots and works for me, and I can still build
kernels and run a browser".

                Linus
Michal Hocko July 23, 2020, 8:03 a.m. UTC | #13
On Wed 22-07-20 11:29:20, Linus Torvalds wrote:
> On Tue, Jul 21, 2020 at 8:33 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > More likely, it's actually *caused* by that commit 11a19c7b099f, and
> > what might be happening is that other CPU's are just adding new
> > waiters to the list *while* we're waking things up, because somebody
> > else already got the page lock again.
> >
> > Humor me.. Does something like this work instead?
> 
> I went back and looked at this, because it bothered me.

Thanks for pursuing this. I have learned that the affected system is in
fact a production machine which doesn't seem to have any downtime window
planned soon. Moreover the issue is not always reproducible. So I cannot
guarantee I can have this or other patches tested soon which is really
unfortunate.

> And I'm no longer convinced it can possibly make a difference.
> 
> Why?
> 
> Because __wake_up_locked_key_bookmark() just calls __wake_up_common(),
> and that one checks the return value of the wakeup function:
> 
>                 ret = curr->func(curr, mode, wake_flags, key);
>                 if (ret < 0)
>                         break;
> 
> and will not add the bookmark back to the list if this triggers.
> 
> And the wakeup function does that same "stop walking" thing:
> 
>         if (test_bit(key->bit_nr, &key->page->flags))
>                 return -1;
> 
> So if somebody else took the page lock, I think we should already have
> stopped walking the list.

Right! I didn't bother to look at the wakeup callback so have missed
this. For completeness this behavior is there since 3510ca20ece01 which
we have in our 4.12 based kernel as well.
Oleg Nesterov July 23, 2020, 12:47 p.m. UTC | #14
On 07/22, Linus Torvalds wrote:
>
> Comments? Oleg, this should fix the race you talked about too.

Yes.

I still can't convince myself thatI fully understand this patch but I see
nothing really wrong after a quick glance...

> +	 * We can no longer use 'wait' after we've done the
> +	 * list_del_init(&wait->entry),

Yes, but see below,

> +	 * the target may decide it's all done with no
> +	 * other locking, and 'wait' has been allocated on
> +	 * the stack of the target.
>  	 */
> -	if (test_bit(key->bit_nr, &key->page->flags))
> -		return -1;
> +	target = wait->private;
> +	smp_mb();
>
> -	return autoremove_wake_function(wait, mode, sync, key);
> +	/*
> +	 * Ok, we have successfully done what we're waiting for.
> +	 *
> +	 * Now unconditionally remove the wait entry, so that the
> +	 * waiter can use that to see success or not.
> +	 *
> +	 * We _really_ should have a "list_del_init_careful()"
> +	 * to properly pair with an unlocked "list_empty_careful()".
> +	 */
> +	list_del_init(&wait->entry);
> +
> +	/*
> +	 * Theres's another memory barrier in the wakup path, that
> +	 * makes sure the wakup happens after the above is visible
> +	 * to the target.
> +	 */
> +	wake_up_state(target, mode);

We can no longer use 'target'. If it was already woken up it can notice
list_empty_careful(), return without taking q->lock, and exit.

Of course, this is purely theoretical... rcu_read_lock() should help
but perhaps we can avoid it somehow?

Say, can't we abuse WQ_FLAG_WOKEN?

	wake_page_function:
		wait->flags |= WQ_FLAG_WOKEN;
		wmb();
		autoremove_wake_function(...);

	wait_on_page_bit_common:

		for (;;) {
			set_current_state();
			if (wait.flags & WQ_FLAG_WOKEN)
				break;
			schedule();
		}

		finish_wait();

		rmb();
		return wait.flags & WQ_FLAG_WOKEN ? 0 : -EINTR;

Another (cosmetic) problem is that wake_up_state(mode) looks confusing.
It is correct but only because we know that mode == TASK_NORMAL and thus
wake_up_state() can'fail if the target is still blocked.

> +	spin_lock_irq(&q->lock);
> +	SetPageWaiters(page);
> +	if (!trylock_page_bit_common(page, bit_nr, behavior))
> +		__add_wait_queue_entry_tail(q, wait);

do we need SetPageWaiters() if trylock() succeeds ?

Oleg.
Linus Torvalds July 23, 2020, 5:32 p.m. UTC | #15
On Thu, Jul 23, 2020 at 5:47 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> I still can't convince myself thatI fully understand this patch but I see
> nothing really wrong after a quick glance...

I guess my comments should be extended further then.

Is there anything in particular you think is unclear?

> > +     wake_up_state(target, mode);
>
> We can no longer use 'target'. If it was already woken up it can notice
> list_empty_careful(), return without taking q->lock, and exit.

Good point.

And yes, I think using WQ_FLAG_WOKEN is the right thing to do, and I
wouldn't call it "abuse". It's exactly what it's for.

And that also allows us to just use finish_wait(), since we no longer
care as deeply about the waitlist state, we can just test that
WQ_FLAG_WOKEN at the end instead.

So that actually makes the patch much more straightforward too. I
really disliked my open-coding there. Your suggestion fixes
everything.

> do we need SetPageWaiters() if trylock() succeeds ?

We need to set it before the final page flag test, because otherwise
we might miss somebody just about to wake us up (ie we see the bit
set, but it's getting cleared on another CPU, and if PageWaiters isn't
set then that other CPU won't do the wakeup).

So here's a v2, now as a "real" commit with a commit message and everything.

Is there anything in particular you would like clarified, or something
else you find in this?

Hugh - this should make your "patch 2" redundant. Is there any way to
test that in your environment that triggered it?

This v2 isn't tested, but the core of it is the same, just with nice
cleanups from Oleg's suggestion, and an added comment about that
SetPageWaiters() thing.

            Linus
Oleg Nesterov July 23, 2020, 6:01 p.m. UTC | #16
On 07/23, Linus Torvalds wrote:
>
> So here's a v2, now as a "real" commit with a commit message and everything.

I am already sleeping, will read it tomorrow, but at first glance...

> @@ -1013,18 +1014,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	if (wait_page->bit_nr != key->bit_nr)
>  		return 0;
>  
> +	/* Stop walking if it's locked */
> +	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> +		if (test_and_set_bit(key->bit_nr, &key->page->flags))
> +			return -1;
> +	} else {
> +		if (test_bit(key->bit_nr, &key->page->flags))
> +			return -1;
> +	}
> +
>  	/*
> -	 * Stop walking if it's locked.
> -	 * Is this safe if put_and_wait_on_page_locked() is in use?
> -	 * Yes: the waker must hold a reference to this page, and if PG_locked
> -	 * has now already been set by another task, that task must also hold
> -	 * a reference to the *same usage* of this page; so there is no need
> -	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
> +	 * Let the waiter know we have done the page flag
> +	 * handling for it (and the return value lets the
> +	 * wakeup logic count exclusive wakeup events).
>  	 */
> -	if (test_bit(key->bit_nr, &key->page->flags))
> -		return -1;
> +	ret = (wait->flags & WQ_FLAG_EXCLUSIVE) != 0;
> +	wait->flags |= WQ_FLAG_WOKEN;
> +	wake_up_state(wait->private, mode);
>  
> -	return autoremove_wake_function(wait, mode, sync, key);
> +	/*
> +	 * Ok, we have successfully done what we're waiting for,
> +	 * and we can unconditionally remove the wait entry.
> +	 *
> +	 * Note that this has to be the absolute last thing we do,
> +	 * since after list_del_init(&wait->entry) the wait entry
> +	 * might be de-allocated and the process might even have
> +	 * exited.
> +	 *
> +	 * We _really_ should have a "list_del_init_careful()" to
> +	 * properly pair with the unlocked "list_empty_careful()"
> +	 * in finish_wait().
> +	 */
> +	smp_mb();
> +	list_del_init(&wait->entry);

I think smp_wmb() would be enough, but this is minor.

We need a barrier between "wait->flags |= WQ_FLAG_WOKEN" and list_del_init(),

But afaics we need another barrier, rmb(), in wait_on_page_bit_common() for
the case when wait->private was not blocked; we need to ensure that if
finish_wait() sees list_empty_careful() == T then we can't miss WQ_FLAG_WOKEN.

Oleg.
Linus Torvalds July 23, 2020, 6:22 p.m. UTC | #17
On Thu, Jul 23, 2020 at 11:01 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> > +      *
> > +      * We _really_ should have a "list_del_init_careful()" to
> > +      * properly pair with the unlocked "list_empty_careful()"
> > +      * in finish_wait().
> > +      */
> > +     smp_mb();
> > +     list_del_init(&wait->entry);
>
> I think smp_wmb() would be enough, but this is minor.

Well, what we _really_ want (and what that comment is about) would be
got that list_del_init_careful() to use a "smp_store_release()" for
the last store, and then "list_empty_careful()" would use a
"smp_load_acquire()" for the corresponding first load.

On x86, that's free. On most other architectures, it's the minimal
ordering requirement.

And no, I don't think "smp_wmb()" is technically enough.

With crazy memory ordering, one of the earlier *reads* (eg loading
"wait->private" when waking things up) could have been delayed until
after the stores that initialize the list - and thus read stack
contents from another process after it's been released and re-used.

Does that happen in reality? No. There are various conditionals in
there which means that the stores end up being gated on the loads and
cannot actually be re-ordered, but it's the kind of subtley

So we actually do want to constrain all earlier reads and writes wrt
the final write. Which is exactly what "smp_store_release()" does.

But with our current list_empty_careful(), the smp_mb() is I think
technically sufficient.

> We need a barrier between "wait->flags |= WQ_FLAG_WOKEN" and list_del_init(),

See above: we need more than just that write barrier, although in
_practice_ you're right, and the other barriers actually all already
exist and are part of wake_up_state().

So the smp_mb() is unnecessary, and in fact your smp_wmb() would be
too. But I left it there basically as "documentation".

> But afaics we need another barrier, rmb(), in wait_on_page_bit_common() fo
> the case when wait->private was not blocked; we need to ensure that if
> finish_wait() sees list_empty_careful() == T then we can't miss WQ_FLAG_WOKEN.

Again, this is what a proper list_empty_careful() with a
smp_load_acquire() would have automatically gotten for us.

But yes, I think that without that, and with the explicit barriers, we
need an smp_rmb() after the list_empty_careful().

I really think it should be _in_ list_empty_careful(), though. Or
maybe finish_wait(). Hmm.

Because looking at all the other finish_wait() uses, the fact that the
waitqueue _list_ is properly ordered isn't really a guarantee of the
rest of the stack space is.

In practice, it will be, but I think this lack of serialization is a
potential real bug elsewhere too.

(Obviously none of this would show on x86, where we already *get* that
smp_store_release/smp_load_acquire behavior for the existing
list_del_init()/list_empty_careful(), since all stores are releases,
and all loads are acquires)

So I think that is a separate issue, generic to our finish_wait() uses.

             Linus
Linus Torvalds July 23, 2020, 7:03 p.m. UTC | #18
On Thu, Jul 23, 2020 at 11:22 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I think that is a separate issue, generic to our finish_wait() uses.

IOW, I think we should do something like this (this is on top of my
patch, since it has that wake_page_function() change in it, but notice
how we have the exact same issue in our traditional
autoremove_wake_function() usage).

Comments?

                   Linus
Linus Torvalds July 23, 2020, 8:03 p.m. UTC | #19
On Thu, Jul 23, 2020 at 10:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So here's a v2, now as a "real" commit with a commit message and everything.

Oh, except it's broken.

Switching from the "am I still on the list" logic to the
"WQ_FLAG_WOKEN is set if woken" logic was all well and good, but I
missed the case where we did that trylock_page_bit_common().

It used to just not add the thing to the list if it would get the page
bit, and then the rest of the waiting logic looked at that and was
happy.

But now if it needs to actually fake that WQ_FLAG_WOKEN flag.

So that patch needs to do something like this:

        if (!trylock_page_bit_common(page, bit_nr, behavior))
                __add_wait_queue_entry_tail(q, wait);
        else
                wait->flags |= WQ_FLAG_WOKEN;

in there. Or maybe have that bit set originally, and clear it when we
add to the wait queue.

I'll send a new version after I actually test it.

                Linus
Hugh Dickins July 23, 2020, 11:11 p.m. UTC | #20
On Thu, 23 Jul 2020, Linus Torvalds wrote:
> 
> I'll send a new version after I actually test it.

I'll give it a try when you're happy with it. I did try yesterday's
with my swapping loads on home machines (3 of 4 survived 16 hours),
and with some google stresstests on work machines (0 of 10 survived).

I've not spent long analyzing the crashes, all of them in or below
__wake_up_common() called from __wake_up_locked_key_bookmark():
sometimes gets to run the curr->func() and crashes on something
inside there (often list_del's lib/list_debug.c:53!), sometimes
cannot get that far. Looks like the wait queue entries on the list
were not entirely safe with that patch.

Hugh
Linus Torvalds July 23, 2020, 11:43 p.m. UTC | #21
On Thu, Jul 23, 2020 at 4:11 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 23 Jul 2020, Linus Torvalds wrote:
> >
> > I'll send a new version after I actually test it.
>
> I'll give it a try when you're happy with it.

Ok, what I described is what I've been running for a while now. But I
don't put much stress on my system with my normal workload, so..

> I did try yesterday's
> with my swapping loads on home machines (3 of 4 survived 16 hours),
> and with some google stresstests on work machines (0 of 10 survived).
>
> I've not spent long analyzing the crashes, all of them in or below
> __wake_up_common() called from __wake_up_locked_key_bookmark():
> sometimes gets to run the curr->func() and crashes on something
> inside there (often list_del's lib/list_debug.c:53!), sometimes
> cannot get that far. Looks like the wait queue entries on the list
> were not entirely safe with that patch.

Hmm. The bug Oleg pointed out should be pretty theoretical. But I
think the new approach with WQ_FLAG_WOKEN was much better anyway,
despite me missing that one spot in the first version of the patch.

So here's two patches - the first one does that wake_page_function()
conversion, and the second one just does the memory ordering cleanup I
mentioned.

I don't think the second one shouldn't matter on x86, but who knows.

I don't enable list debugging, but I find list corruption surprising.
All of _that_ should be inside the page waiqueue lock, the only
unlocked part was the "list_empty_careful()" part.

But I'll walk over my patch mentally one more time. Here's the current
version, anyway.

                Linus
Hugh Dickins July 24, 2020, 12:07 a.m. UTC | #22
On Thu, 23 Jul 2020, Linus Torvalds wrote:
> On Thu, Jul 23, 2020 at 4:11 PM Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 23 Jul 2020, Linus Torvalds wrote:
> > >
> > > I'll send a new version after I actually test it.
> >
> > I'll give it a try when you're happy with it.
> 
> Ok, what I described is what I've been running for a while now. But I
> don't put much stress on my system with my normal workload, so..
> 
> > I did try yesterday's
> > with my swapping loads on home machines (3 of 4 survived 16 hours),
> > and with some google stresstests on work machines (0 of 10 survived).
> >
> > I've not spent long analyzing the crashes, all of them in or below
> > __wake_up_common() called from __wake_up_locked_key_bookmark():
> > sometimes gets to run the curr->func() and crashes on something
> > inside there (often list_del's lib/list_debug.c:53!), sometimes
> > cannot get that far. Looks like the wait queue entries on the list
> > were not entirely safe with that patch.
> 
> Hmm. The bug Oleg pointed out should be pretty theoretical. But I
> think the new approach with WQ_FLAG_WOKEN was much better anyway,
> despite me missing that one spot in the first version of the patch.
> 
> So here's two patches - the first one does that wake_page_function()
> conversion, and the second one just does the memory ordering cleanup I
> mentioned.
> 
> I don't think the second one shouldn't matter on x86, but who knows.
> 
> I don't enable list debugging, but I find list corruption surprising.
> All of _that_ should be inside the page waiqueue lock, the only
> unlocked part was the "list_empty_careful()" part.
> 
> But I'll walk over my patch mentally one more time. Here's the current
> version, anyway.

Thanks, I'll start some tests going shortly.

I do have to "port" these patches to a different kernel, and my first
assumption on seeing crashes was that I'd screwed that up; but that
seemed much less likely once the home test on top of v5.8-rc5 crashed
in much the same way.  The latter was not a list_del() crash, but on
curr->func itself; but I take them all as just indicating that the
wait queue entry can in rare cases be freed and reused.

(And the amount of "port"ing was close to nil here: our trees did
differ on an "unlikely" that one end had added or removed, plus I
did start off by reverting two of my three patches. But perhaps I'm
missing a subtle dependence on differences elsewhere in the tree.)

I say that for full disclosure, so you don't wrack your brains
too much, when it may still turn out to be a screwup on my part.

Hugh
Linus Torvalds July 24, 2020, 12:46 a.m. UTC | #23
On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hughd@google.com> wrote:
>
> I say that for full disclosure, so you don't wrack your brains
> too much, when it may still turn out to be a screwup on my part.

Sounds unlikely.

If that patch applied even reasonably closely, I don't see how you'd
see a list corruption that wasn't due to the patch.

You'd have had to use the wrong spinlock by mistake due to munging it,
or something crazy like that.

The main list-handling change is

 (a) open-coding of that finish_wait()

 (b) slightly different heuristics for removal in the wakeup function

where (a) was because my original version of finishing the wait needed
to do that return code checking.

So a normal "finish_wait()" just does

        list_del_init(&wait->entry);

where-as my open-coded one replaced that with

        if (!list_empty(&wait->entry)) {
                list_del(&wait->entry);
                ret = -EINTR;
        }

and apart from that "set return to -EINTR because nobody woke us up",
it also uses just a regular "list_del()" rather than a
"list_del_init()". That causes the next/prev field to be poisoned
rather than re-initialized. But that second change was because the
list entry is on the stack, and we're not touching it any more and are
about to return, so I removed the "init" part.

Anyway, (a) really looks harmless. Unless the poisoning now triggers
some racy debug test that had been hidden by the "init". Hmm.

In contrast, (b) means that the likely access patterns of irqs
removing the wait entry from the list might be very different from
before. The old "autoremove" semantics would only remove the entry
from the list when try_to_wake_up() actually woke things up. Now, a
successful bit state _always_ removes it, which was kind of the point.
But it might cause very different list handling patterns.

All the actual list handling looks "obviously safe" because it's
protected by the spinlock, though...

If you do get oopses with the new patch too, try to send me a copy,
and maybe I'll stare at exactly where it happens register contents and
go "aah".

        Linus
Hugh Dickins July 24, 2020, 3:45 a.m. UTC | #24
On Thu, Jul 23, 2020 at 5:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > I say that for full disclosure, so you don't wrack your brains
> > too much, when it may still turn out to be a screwup on my part.
>
> Sounds unlikely.
>
> If that patch applied even reasonably closely, I don't see how you'd
> see a list corruption that wasn't due to the patch.
>
> You'd have had to use the wrong spinlock by mistake due to munging it,
> or something crazy like that.
>
> The main list-handling change is
>
>  (a) open-coding of that finish_wait()
>
>  (b) slightly different heuristics for removal in the wakeup function
>
> where (a) was because my original version of finishing the wait needed
> to do that return code checking.
>
> So a normal "finish_wait()" just does
>
>         list_del_init(&wait->entry);
>
> where-as my open-coded one replaced that with
>
>         if (!list_empty(&wait->entry)) {
>                 list_del(&wait->entry);
>                 ret = -EINTR;
>         }
>
> and apart from that "set return to -EINTR because nobody woke us up",
> it also uses just a regular "list_del()" rather than a
> "list_del_init()". That causes the next/prev field to be poisoned
> rather than re-initialized. But that second change was because the
> list entry is on the stack, and we're not touching it any more and are
> about to return, so I removed the "init" part.
>
> Anyway, (a) really looks harmless. Unless the poisoning now triggers
> some racy debug test that had been hidden by the "init". Hmm.
>
> In contrast, (b) means that the likely access patterns of irqs
> removing the wait entry from the list might be very different from
> before. The old "autoremove" semantics would only remove the entry
> from the list when try_to_wake_up() actually woke things up. Now, a
> successful bit state _always_ removes it, which was kind of the point.
> But it might cause very different list handling patterns.
>
> All the actual list handling looks "obviously safe" because it's
> protected by the spinlock, though...
>
> If you do get oopses with the new patch too, try to send me a copy,
> and maybe I'll stare at exactly where it happens register contents and
> go "aah".

This new version is doing much better: many hours to go, but all
machines have got beyond the danger point where yesterday's version
was crashing - phew!

Hugh
Oleg Nesterov July 24, 2020, 2:45 p.m. UTC | #25
On 07/23, Linus Torvalds wrote:
>
> IOW, I think we should do something like this (this is on top of my
> patch, since it has that wake_page_function() change in it, but notice
> how we have the exact same issue in our traditional
> autoremove_wake_function() usage).

...

> +static inline void list_del_init_careful(struct list_head *entry)
> +{
> +	__list_del_entry(entry);
> +	entry->prev = entry;
> +	smp_store_release(&entry->next, entry);
> +}
> +
...
>  static inline int list_empty_careful(const struct list_head *head)
>  {
> -	struct list_head *next = head->next;
> +	struct list_head *next = smp_load_acquire(&head->next);
>  	return (next == head) && (next == head->prev);
>  }

This (and your previous email) answers my concerns about memory barriers.

IIUC, finish_wait() could even use this version of list_empty_careful(),

	struct list_head *next = smp_load_acquire(&head->next);
	return (next == head) && !WARN_ON(next != head->prev);

iow, it doesn't really need to check next == head->prev as long as only
list_del_init_careful() can remove it from list.

Thanks!

Oleg.
Oleg Nesterov July 24, 2020, 3:24 p.m. UTC | #26
On 07/23, Linus Torvalds wrote:
>
> But I'll walk over my patch mentally one more time. Here's the current
> version, anyway.

Both patches look correct to me, feel free to add

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

> @@ -1013,18 +1014,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	if (wait_page->bit_nr != key->bit_nr)
>  		return 0;
>
> +	/* Stop walking if it's locked */
> +	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> +		if (test_and_set_bit(key->bit_nr, &key->page->flags))
> +			return -1;
> +	} else {
> +		if (test_bit(key->bit_nr, &key->page->flags))
> +			return -1;
> +	}

not sure this makes any sense, but this looks like another user of
trylock_page_bit_common(), see the patch below on top of 1/2.

Oleg.

--- mm/filemap.c~	2020-07-24 17:09:34.728133846 +0200
+++ mm/filemap.c	2020-07-24 17:23:52.279185374 +0200
@@ -1000,6 +1000,14 @@
 	wait_queue_entry_t wait;
 };
 
+static int trylock_page_bit_common(struct page *page, int bit_nr,
+					struct wait_queue_entry *wait)
+{
+	return wait->flags & WQ_FLAG_EXCLUSIVE ?
+		!test_and_set_bit(bit_nr, &page->flags) :
+		!test_bit(bit_nr, &page->flags);
+}
+
 static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
 {
 	int ret;
@@ -1015,13 +1023,8 @@
 		return 0;
 
 	/* Stop walking if it's locked */
-	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
-		if (test_and_set_bit(key->bit_nr, &key->page->flags))
-			return -1;
-	} else {
-		if (test_bit(key->bit_nr, &key->page->flags))
-			return -1;
-	}
+	if (!trylock_page_bit_common(key->page, key->bit_nr, wait))
+		return -1;
 
 	/*
 	 * Let the waiter know we have done the page flag
@@ -1126,14 +1129,6 @@
 			 */
 };
 
-static inline int trylock_page_bit_common(struct page *page, int bit_nr,
-	enum behavior behavior)
-{
-	return behavior == EXCLUSIVE ?
-		!test_and_set_bit(bit_nr, &page->flags) :
-		!test_bit(bit_nr, &page->flags);
-}
-
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	struct page *page, int bit_nr, int state, enum behavior behavior)
 {
@@ -1170,7 +1165,7 @@
 	 */
 	spin_lock_irq(&q->lock);
 	SetPageWaiters(page);
-	if (!trylock_page_bit_common(page, bit_nr, behavior))
+	if (!trylock_page_bit_common(page, bit_nr, wait))
 		__add_wait_queue_entry_tail(q, wait);
 	spin_unlock_irq(&q->lock);
Linus Torvalds July 24, 2020, 5:32 p.m. UTC | #27
On Fri, Jul 24, 2020 at 8:24 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> not sure this makes any sense, but this looks like another user of
> trylock_page_bit_common(), see the patch below on top of 1/2.

Ok, that makes sense. Except you did it on top of the original patch
without the fix to set WQ_FLAG_WOKEN for the non-wakeup case.

And in fact, once you do it on top of that, it becomes obvious that we
can share even more code: move the WQ_FLAG_WOKEN logic _into_ the
trylock_page_bit_common() function.

Then the whole thing becomes something like the attached.

I added your reviewed-by, but maybe you should double-check my changes.

                Linus
Linus Torvalds July 24, 2020, 11:25 p.m. UTC | #28
On Fri, Jul 24, 2020 at 10:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Ok, that makes sense. Except you did it on top of the original patch
> without the fix to set WQ_FLAG_WOKEN for the non-wakeup case.

Hmm.

I just realized that one thing we could do is to not even test the
page bit for the shared case in the wakeup path.

Because anybody who uses the non-exclusive "wait_on_page_locked()" or
"wait_on_page_writeback()" isn't actually interested in the bit state
any more at that point. All they care about is that somebody cleared
it - not whether it was then re-taken again.

So instead of keeping them on the list - or stopping the waitqueue
walk because somebody else got the bit - we could just mark them
successfully done, wake them up, and remove those entries from the
list.

That would be better for everybody - less pointless waiting for a new
lock or writeback event, but also fewer entries on the wait queues as
we get rid of them more quickly instead of walking them over and over
just because somebody else re-took the page lock.

Generally "wait_on_page_locked()" is used for two things

 - either wait for the IO to then check if it's now uptodate

 - throttle things that can't afford to lock the page (eg page faults
that dropped the mm lock, and as such need to go through the whole
restart logic, but that don't want to lock the page because it's now
too late, but also the page migration things)

In the first case, waiting to actually seeing the locked bit clear is
pointless - the code only cared about the "wait for IO in progress"
not about the lock bit itself.

And that second case generally might want to retry, but doesn't want
to busy-loop.

And "wait_on_page_writeback()" is basically used for similar reasons
(ie check if there were IO errors, but also possibly to throttle
writeback traffic).

Saying "stop walking, keep it on the list" seems wrong. It makes IO
error handling and retries much worse, for example.

So it turns out that the wakeup logic and the initial wait logic don't
have so much in common after all, and there is a fundamental
conceptual difference between that "check bit one last time" case, and
the "we got woken up, now what" case..

End result: one final (yes, hopefully - I think I'm done) version of
this patch-series.

This not only makes the code cleaner (the generated code for
wake_up_page() is really quite nice now), but getting rid of extra
waiting might help the load that Michal reported.

Because a lot of page waiting might be that non-exclusive
"wait_on_page_locked()" kind, particularly in the thundering herd kind
of situation where one process starts IO, and then other processes
wait for it to finish.

Those users don't even care if somebody else then did a "lock_page()"
for some other reason (maybe for writeback). They are generally
perfectly happy with a locked page, as long as it's now up-to-date.

So this not only simplifies the code, it really might avoid some problems too.

               Linus
Hugh Dickins July 25, 2020, 2:08 a.m. UTC | #29
On Fri, 24 Jul 2020, Linus Torvalds wrote:

> On Fri, Jul 24, 2020 at 10:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Ok, that makes sense. Except you did it on top of the original patch
> > without the fix to set WQ_FLAG_WOKEN for the non-wakeup case.
> 
> Hmm.
> 
> I just realized that one thing we could do is to not even test the
> page bit for the shared case in the wakeup path.
> 
> Because anybody who uses the non-exclusive "wait_on_page_locked()" or
> "wait_on_page_writeback()" isn't actually interested in the bit state
> any more at that point. All they care about is that somebody cleared
> it - not whether it was then re-taken again.
> 
> So instead of keeping them on the list - or stopping the waitqueue
> walk because somebody else got the bit - we could just mark them
> successfully done, wake them up, and remove those entries from the
> list.
> 
> That would be better for everybody - less pointless waiting for a new
> lock or writeback event, but also fewer entries on the wait queues as
> we get rid of them more quickly instead of walking them over and over
> just because somebody else re-took the page lock.
> 
> Generally "wait_on_page_locked()" is used for two things
> 
>  - either wait for the IO to then check if it's now uptodate
> 
>  - throttle things that can't afford to lock the page (eg page faults
> that dropped the mm lock, and as such need to go through the whole
> restart logic, but that don't want to lock the page because it's now
> too late, but also the page migration things)
> 
> In the first case, waiting to actually seeing the locked bit clear is
> pointless - the code only cared about the "wait for IO in progress"
> not about the lock bit itself.
> 
> And that second case generally might want to retry, but doesn't want
> to busy-loop.
> 
> And "wait_on_page_writeback()" is basically used for similar reasons
> (ie check if there were IO errors, but also possibly to throttle
> writeback traffic).
> 
> Saying "stop walking, keep it on the list" seems wrong. It makes IO
> error handling and retries much worse, for example.
> 
> So it turns out that the wakeup logic and the initial wait logic don't
> have so much in common after all, and there is a fundamental
> conceptual difference between that "check bit one last time" case, and
> the "we got woken up, now what" case..
> 
> End result: one final (yes, hopefully - I think I'm done) version of
> this patch-series.
> 
> This not only makes the code cleaner (the generated code for
> wake_up_page() is really quite nice now), but getting rid of extra
> waiting might help the load that Michal reported.
> 
> Because a lot of page waiting might be that non-exclusive
> "wait_on_page_locked()" kind, particularly in the thundering herd kind
> of situation where one process starts IO, and then other processes
> wait for it to finish.
> 
> Those users don't even care if somebody else then did a "lock_page()"
> for some other reason (maybe for writeback). They are generally
> perfectly happy with a locked page, as long as it's now up-to-date.
> 
> So this not only simplifies the code, it really might avoid some problems too.

That set of tests I started yesterday has now completed: no crashes
due to your changes (though, one machine crashed with an entirely
unrelated list_del corruption: over in driverland, just a coincidence).

I do see more of these stresstests reporting failure than I remember
from the last time I ran them, and I wasn't quickly able to work out
why (usually I just care about not crashing or hanging, rarely delve
deeper into what they actually call success).  The most likely cause
would be some internal infrastructural oddity, and nothing for you
to worry about; but there is a possibility that it's meaningful.

But whatever, what happens on the next run, with these latest patches,
will be more important; and I'll follow this next run with a run on
the baseline without them, to compare results.

Hugh
Linus Torvalds July 25, 2020, 2:46 a.m. UTC | #30
On Fri, Jul 24, 2020 at 7:08 PM Hugh Dickins <hughd@google.com> wrote:
>
> But whatever, what happens on the next run, with these latest patches,
> will be more important; and I'll follow this next run with a run on
> the baseline without them, to compare results.

So the loads you are running are known to have sensitivity to this
particular area, and are why you've done your patches to the page wait
bit code?

Because yes, I think that last version in particular might make a big
difference with the "let people continue even if they only saw the
wakeup event, and never actually tested and saw the bit clear".

Of course, there's a possibility that "big difference" ends up being a
negative one. I think it will make the page wait queues shorter (which
is good for that latency and lockup thing), but waking things up more
aggressively _may_ also end up adding more CPU load, if they then all
decide to retry the operation for whatever reason.

And hey, it's also possible that it makes no difference at all,
because your load mainly tests the exclusive "lock_page()" case, which
won't have changed.

And that's all assuming they don't show some instability, of course.
But the code actually seems fairly simple now, and the basic
synchronization hasn't changed, just a behavioral optimization.

Famous last words.

               Linus
Oleg Nesterov July 25, 2020, 9:39 a.m. UTC | #31
On 07/24, Linus Torvalds wrote:
>
> And in fact, once you do it on top of that, it becomes obvious that we
> can share even more code: move the WQ_FLAG_WOKEN logic _into_ the
> trylock_page_bit_common() function.

Ah, indeed, somehow I didn't realize this,

> I added your reviewed-by, but maybe you should double-check my changes.

Still looks correct to me, thanks.

Oleg.
Oleg Nesterov July 25, 2020, 10:14 a.m. UTC | #32
On 07/24, Linus Torvalds wrote:
>
> I just realized that one thing we could do is to not even test the
> page bit for the shared case in the wakeup path.
>
> Because anybody who uses the non-exclusive "wait_on_page_locked()" or
> "wait_on_page_writeback()" isn't actually interested in the bit state
> any more at that point. All they care about is that somebody cleared
> it - not whether it was then re-taken again.
>
> So instead of keeping them on the list - or stopping the waitqueue
> walk because somebody else got the bit - we could just mark them
> successfully done, wake them up, and remove those entries from the
> list.

Heh. I too thought about this. And just in case, your patch looks correct
to me. But I can't really comment this behavioural change. Perhaps it
should come in a separate patch?

In essense, this partly reverts your commit 3510ca20ece0150
("Minor page waitqueue cleanups"). I mean this part:


     (b) we don't want to put the non-locking waiters always on the front of
         the queue, and the locking waiters always on the back.  Not only is
         that unfair, it means that we wake up thousands of reading threads
         that will just end up being blocked by the writer later anyway.
...

	@@ -972,10 +976,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
			spin_lock_irq(&q->lock);
	 
			if (likely(list_empty(&wait->entry))) {
	-			if (lock)
	-				__add_wait_queue_entry_tail_exclusive(q, wait);
	-			else
	-				__add_wait_queue(q, wait);
	+			__add_wait_queue_entry_tail(q, wait);
				SetPageWaiters(page);
			}

Oleg.
Linus Torvalds July 25, 2020, 6:48 p.m. UTC | #33
On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Heh. I too thought about this. And just in case, your patch looks correct
> to me. But I can't really comment this behavioural change. Perhaps it
> should come in a separate patch?

We could do that. At the same time, I think both parts change how the
waitqueue works that it might as well just be one "fix page_bit_wait
waitqueue usage".

But let's wait to see what Hugh's numbers say.

> In essense, this partly reverts your commit 3510ca20ece0150
> ("Minor page waitqueue cleanups"). I mean this part:

Well, no. I mean, it keeps the "always add to the fail" behavior.

But some of the reasons for it have gone away. Now we could just make
it go back to always doing non-exclusive waits at the head.

The non-exclusive waiters _used_ to re-insert themselves on the queue
until they saw the bit clear, so waking them up if the bit was just
going to be set again was just going to make for unnecessary
scheduling and waitlist noise.

That reason is gone.

But I think the fundamental fairness issue might still be there. So
I'll keep the simpler "always add at the end".

But you're right - we could expedite the non-exclusive waiters even more.

             Linus
Oleg Nesterov July 25, 2020, 7:27 p.m. UTC | #34
Firstly, to avoid the confusion, let me repeat I think your patch is fine.

I too thought that non-exclusive waiters do not care about the bit state
and thus wake_page_function() can simply wake them all up.

But then I did "git blame", found your commit 3510ca20ece0150 and came to
conclusion there are reasons we should not do this.

On 07/25, Linus Torvalds wrote:
>
> On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > In essense, this partly reverts your commit 3510ca20ece0150
> > ("Minor page waitqueue cleanups"). I mean this part:
>
> Well, no. I mean, it keeps the "always add to the fail" behavior.

Ah, sorry for confusion, this doesn't matter. I didn't mean "fairness".

What I tried to say. AFAICS before that commit we had (almost) the same
behaviour you propose now: unlock_page/etc wakes all the non-exclusive
waiters up.

No?

Or I misunderstood your reply? Quite possibly, too late for me...

Oleg.
Linus Torvalds July 25, 2020, 7:51 p.m. UTC | #35
On Sat, Jul 25, 2020 at 12:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> What I tried to say. AFAICS before that commit we had (almost) the same
> behaviour you propose now: unlock_page/etc wakes all the non-exclusive
> waiters up.
>
> No?

Yes, but no.

We'd wake them _up_ fairly aggressively, but then they'd be caught on
the bit being set again by the exclusive locker (that we also woke
up).

So they'd get woken up, and then go to sleep again.

So the new behavior wakes things up more aggressively (but a different
way), but not by letting them go out of order and early, but simply by
not going back to sleep again.

So the "wake up more" is very different - now it's about not going to
sleep again, rather than by ordering the wakeup queue.

We _could_ order the wakeup queue too, and put all non-exclusive
weiters at the head again. And make it *really* aggressive.

But since one of ourissues has been "latency of walking the wait
queue", I'm not sure we want that. interspesing any blocking waiters -
and stopping the waitqueue walking as a result - might be better under
load.

Wild handwaving. We could try it, but IO think that really would be a
separate "try this out" patch.

Right now, I think my patch will likely make for _better_ latencies
for everything.

Lower latency of non-exclusive waiters (because not going back to
sleep), but also lower latency of walking the wait queue (because
fewer entries, hopefully, and also less contention due to the "not
going back to sleep" noise)

           Linus
Hugh Dickins July 25, 2020, 9:19 p.m. UTC | #36
On Sat, 25 Jul 2020, Linus Torvalds wrote:
> On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Heh. I too thought about this. And just in case, your patch looks correct
> > to me. But I can't really comment this behavioural change. Perhaps it
> > should come in a separate patch?
> 
> We could do that. At the same time, I think both parts change how the
> waitqueue works that it might as well just be one "fix page_bit_wait
> waitqueue usage".
> 
> But let's wait to see what Hugh's numbers say.

Oh no, no no: sorry for getting your hopes up there, I won't come up
with any numbers more significant than "0 out of 10" machines crashed.
I know it would be *really* useful if I could come up with performance
comparisons, or steer someone else to do so: but I'm sorry, cannot.

Currently it's actually 1 out of 10 machines crashed, for the same
driverland issue seen last time, maybe it's a bad machine; and another
1 out of the 10 machines went AWOL for unknown reasons, but probably
something outside the kernel got confused by the stress.  No reason
to suspect your changes at all (but some unanalyzed "failure"s, of
dubious significance, accumulating like last time).

I'm optimistic: nothing has happened to warn us off your changes.

And on Fri, 24 Jul 2020, Linus Torvalds had written:
> So the loads you are running are known to have sensitivity to this
> particular area, and are why you've done your patches to the page wait
> bit code?

Yes. It's a series of nineteen ~hour-long tests, of which about five
exhibited wake_up_page_bit problems in the past, and one has remained
intermittently troublesome that way.  Intermittently: usually it does
get through, so getting through yesterday and today won't even tell
us that your changes fixed it - that we shall learn over time later.

Hugh
Hugh Dickins July 26, 2020, 4:22 a.m. UTC | #37
On Sat, 25 Jul 2020, Hugh Dickins wrote:
> On Sat, 25 Jul 2020, Linus Torvalds wrote:
> > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Heh. I too thought about this. And just in case, your patch looks correct
> > > to me. But I can't really comment this behavioural change. Perhaps it
> > > should come in a separate patch?
> > 
> > We could do that. At the same time, I think both parts change how the
> > waitqueue works that it might as well just be one "fix page_bit_wait
> > waitqueue usage".
> > 
> > But let's wait to see what Hugh's numbers say.
> 
> Oh no, no no: sorry for getting your hopes up there, I won't come up
> with any numbers more significant than "0 out of 10" machines crashed.
> I know it would be *really* useful if I could come up with performance
> comparisons, or steer someone else to do so: but I'm sorry, cannot.
> 
> Currently it's actually 1 out of 10 machines crashed, for the same
> driverland issue seen last time, maybe it's a bad machine; and another
> 1 out of the 10 machines went AWOL for unknown reasons, but probably
> something outside the kernel got confused by the stress.  No reason
> to suspect your changes at all (but some unanalyzed "failure"s, of
> dubious significance, accumulating like last time).
> 
> I'm optimistic: nothing has happened to warn us off your changes.

Less optimistic now, I'm afraid.

The machine I said had (twice) crashed coincidentally in driverland
(some USB completion thing): that machine I set running a comparison
kernel without your changes this morning, while the others still
running with your changes; and it has now passed the point where it
twice crashed before (the most troublesome test), without crashing.

Surprising: maybe still just coincidence, but I must look closer at
the crashes.

The others have now completed, and one other crashed in that
troublesome test, but sadly without yielding any crash info.

I've just set comparison runs going on them all, to judge whether
to take the "failure"s seriously; and I'll look more closely at them.

But hungry and tired now: unlikely to have more to say tonight.

> 
> And on Fri, 24 Jul 2020, Linus Torvalds had written:
> > So the loads you are running are known to have sensitivity to this
> > particular area, and are why you've done your patches to the page wait
> > bit code?
> 
> Yes. It's a series of nineteen ~hour-long tests, of which about five
> exhibited wake_up_page_bit problems in the past, and one has remained
> intermittently troublesome that way.  Intermittently: usually it does
> get through, so getting through yesterday and today won't even tell
> us that your changes fixed it - that we shall learn over time later.
> 
> Hugh
Oleg Nesterov July 26, 2020, 1:57 p.m. UTC | #38
Linus,

I was greatly confused and tried to confuse you.

Somehow I misunderstood your last version and didn't bother to read it
again until now.

Sorry for noise and thanks for your explanations.

Oleg.


On 07/25, Linus Torvalds wrote:
>
> On Sat, Jul 25, 2020 at 12:28 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > What I tried to say. AFAICS before that commit we had (almost) the same
> > behaviour you propose now: unlock_page/etc wakes all the non-exclusive
> > waiters up.
> >
> > No?
>
> Yes, but no.
>
> We'd wake them _up_ fairly aggressively, but then they'd be caught on
> the bit being set again by the exclusive locker (that we also woke
> up).
>
> So they'd get woken up, and then go to sleep again.
>
> So the new behavior wakes things up more aggressively (but a different
> way), but not by letting them go out of order and early, but simply by
> not going back to sleep again.
>
> So the "wake up more" is very different - now it's about not going to
> sleep again, rather than by ordering the wakeup queue.
>
> We _could_ order the wakeup queue too, and put all non-exclusive
> weiters at the head again. And make it *really* aggressive.
>
> But since one of ourissues has been "latency of walking the wait
> queue", I'm not sure we want that. interspesing any blocking waiters -
> and stopping the waitqueue walking as a result - might be better under
> load.
>
> Wild handwaving. We could try it, but IO think that really would be a
> separate "try this out" patch.
>
> Right now, I think my patch will likely make for _better_ latencies
> for everything.
>
> Lower latency of non-exclusive waiters (because not going back to
> sleep), but also lower latency of walking the wait queue (because
> fewer entries, hopefully, and also less contention due to the "not
> going back to sleep" noise)
>
>            Linus
>
Hugh Dickins July 26, 2020, 8:30 p.m. UTC | #39
On Sat, 25 Jul 2020, Hugh Dickins wrote:
> On Sat, 25 Jul 2020, Hugh Dickins wrote:
> > On Sat, 25 Jul 2020, Linus Torvalds wrote:
> > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > Heh. I too thought about this. And just in case, your patch looks correct
> > > > to me. But I can't really comment this behavioural change. Perhaps it
> > > > should come in a separate patch?
> > > 
> > > We could do that. At the same time, I think both parts change how the
> > > waitqueue works that it might as well just be one "fix page_bit_wait
> > > waitqueue usage".
> > > 
> > > But let's wait to see what Hugh's numbers say.
> > 
> > Oh no, no no: sorry for getting your hopes up there, I won't come up
> > with any numbers more significant than "0 out of 10" machines crashed.
> > I know it would be *really* useful if I could come up with performance
> > comparisons, or steer someone else to do so: but I'm sorry, cannot.
> > 
> > Currently it's actually 1 out of 10 machines crashed, for the same
> > driverland issue seen last time, maybe it's a bad machine; and another
> > 1 out of the 10 machines went AWOL for unknown reasons, but probably
> > something outside the kernel got confused by the stress.  No reason
> > to suspect your changes at all (but some unanalyzed "failure"s, of
> > dubious significance, accumulating like last time).
> > 
> > I'm optimistic: nothing has happened to warn us off your changes.
> 
> Less optimistic now, I'm afraid.
> 
> The machine I said had (twice) crashed coincidentally in driverland
> (some USB completion thing): that machine I set running a comparison
> kernel without your changes this morning, while the others still
> running with your changes; and it has now passed the point where it
> twice crashed before (the most troublesome test), without crashing.
> 
> Surprising: maybe still just coincidence, but I must look closer at
> the crashes.
> 
> The others have now completed, and one other crashed in that
> troublesome test, but sadly without yielding any crash info.
> 
> I've just set comparison runs going on them all, to judge whether
> to take the "failure"s seriously; and I'll look more closely at them.

The comparison runs have not yet completed (except for the one started
early), but they have all got past the most interesting tests, and it's
clear that they do not have the "failure"s seen with your patches.

From that I can only conclude that your patches make a difference.

I've deduced nothing useful from the logs, will have to leave that
to others here with more experience of them.  But my assumption now
is that you have successfully removed one bottleneck, so the tests
get somewhat further and now stick in the next bottleneck, whatever
that may be.  Which shows up as "failure", where the unlock_page()
wake_up_page_bit() bottleneck had allowed the tests to proceed in
a more serially sedate way.

The xhci handle_cmd_completion list_del bugs (on an older version
of the driver): weird, nothing to do with page wakeups, I'll just
have to assume that it's some driver bug exposed by the greater
stress allowed down, and let driver people investigate (if it
still manifests) when we take in your improvements.

One nice thing from the comparison runs without your patches:
watchdog panic did crash one of those with exactly the unlock_page()
wake_up_page_bit() softlockup symptom we've been fighting, that did
not appear with your patches.  So although the sample size is much
too small to justify a conclusion, it does tend towards confirming
your changes.

Thank you for your work on this! And I'm sure you'd have preferred
some hard data back, rather than a diary of my mood swings, but...
we do what we can.

Hugh
Linus Torvalds July 26, 2020, 8:41 p.m. UTC | #40
On Sun, Jul 26, 2020 at 1:30 PM Hugh Dickins <hughd@google.com> wrote:
>
> I've deduced nothing useful from the logs, will have to leave that
> to others here with more experience of them.  But my assumption now
> is that you have successfully removed one bottleneck, so the tests
> get somewhat further and now stick in the next bottleneck, whatever
> that may be.  Which shows up as "failure", where the unlock_page()
> wake_up_page_bit() bottleneck had allowed the tests to proceed in
> a more serially sedate way.

Well, that's the very optimistic reading.

As the optimistic and happy person I am (hah!) I'm going to agree with
you, and plan on just merging that patch early in the next merge
window. It may fix a real bug in the current trere, but it's much too
late to apply right now, particularly with your somewhat ambiguous
results.

Oleg's theoretical race has probably never been seen, and while the
watchdog triggering is clearly a real bug, it's also extreme enough
not to really be a strong argument for merging this out-of-window..

> The xhci handle_cmd_completion list_del bugs (on an older version
> of the driver): weird, nothing to do with page wakeups, I'll just
> have to assume that it's some driver bug exposed by the greater
> stress allowed down, and let driver people investigate (if it
> still manifests) when we take in your improvements.

Do you have the bug-report, just to google against anybody else
reporting something simialr>

> One nice thing from the comparison runs without your patches:
> watchdog panic did crash one of those with exactly the unlock_page()
> wake_up_page_bit() softlockup symptom we've been fighting, that did
> not appear with your patches.  So although the sample size is much
> too small to justify a conclusion, it does tend towards confirming
> your changes.

You win some, you lose some. But yes, I'll take that as a tentative
success and that the approach is valid.

Thanks,

               Linus
Hugh Dickins July 26, 2020, 10:09 p.m. UTC | #41
On Sun, 26 Jul 2020, Linus Torvalds wrote:
> On Sun, Jul 26, 2020 at 1:30 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > I've deduced nothing useful from the logs, will have to leave that
> > to others here with more experience of them.  But my assumption now
> > is that you have successfully removed one bottleneck, so the tests
> > get somewhat further and now stick in the next bottleneck, whatever
> > that may be.  Which shows up as "failure", where the unlock_page()
> > wake_up_page_bit() bottleneck had allowed the tests to proceed in
> > a more serially sedate way.
> 
> Well, that's the very optimistic reading.
> 
> As the optimistic and happy person I am (hah!) I'm going to agree with
> you, and plan on just merging that patch early in the next merge
> window. It may fix a real bug in the current trere, but it's much too
> late to apply right now, particularly with your somewhat ambiguous
> results.

Absolutely: it should be good to see it in v5.9,
but much too late for a patch like this in v5.8.

> 
> Oleg's theoretical race has probably never been seen, and while the
> watchdog triggering is clearly a real bug, it's also extreme enough
> not to really be a strong argument for merging this out-of-window..
> 
> > The xhci handle_cmd_completion list_del bugs (on an older version
> > of the driver): weird, nothing to do with page wakeups, I'll just
> > have to assume that it's some driver bug exposed by the greater
> > stress allowed down, and let driver people investigate (if it
> > still manifests) when we take in your improvements.
> 
> Do you have the bug-report, just to google against anybody else
> reporting something simialr>

Okay, just on that basis, with some reluctance an edited extract:
certainly not asking you or anyone on the list to investigate further.

[35196.140502] kernel BUG at lib/list_debug.c:53!
[35196.141448] RIP: 0010:__list_del_entry_valid+0x8e/0xb0
[35196.141534] Call Trace:
[35196.141538]  <IRQ>
[35196.141557]  [<ffffffffc01bc8b4>] handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd]
[35196.141578]  [<ffffffffc01bda22>] xhci_irq+0x242/0x1ea0 [xhci_hcd]
[35196.141608]  [<ffffffffc01bf691>] xhci_msi_irq+0x11/0x20 [xhci_hcd]
[35196.141622]  [<ffffffffb9ff27f8>] __handle_irq_event_percpu+0x48/0x2c0
[35196.141636]  [<ffffffffb9ff2aa2>] handle_irq_event_percpu+0x32/0x80
[35196.141651]  [<ffffffffb9ff2b3a>] handle_irq_event+0x4a/0x80
[35196.141680]  [<ffffffffb9ff6b08>] handle_edge_irq+0xd8/0x1b0
[35196.141697]  [<ffffffffb9ec22ab>] handle_irq+0x2b/0x50
[35196.141712]  [<ffffffffbaa02766>] do_IRQ+0xb6/0x1c0
[35196.141725]  [<ffffffffbaa00990>] common_interrupt+0x90/0x90
[35196.141732]  </IRQ>
> 
> > One nice thing from the comparison runs without your patches:
> > watchdog panic did crash one of those with exactly the unlock_page()
> > wake_up_page_bit() softlockup symptom we've been fighting, that did
> > not appear with your patches.  So although the sample size is much
> > too small to justify a conclusion, it does tend towards confirming
> > your changes.
> 
> You win some, you lose some. But yes, I'll take that as a tentative
> success and that the approach is valid.

Great, yes, tentative success: and we have three months in which to
change our minds if any real trouble surfaces; and I wouldn't call
anything I've seen (since that very first version) *real* trouble.

> 
> Thanks,
> 
>                Linus
Greg KH July 27, 2020, 7:35 p.m. UTC | #42
On Sun, Jul 26, 2020 at 01:30:32PM -0700, Hugh Dickins wrote:
> On Sat, 25 Jul 2020, Hugh Dickins wrote:
> > On Sat, 25 Jul 2020, Hugh Dickins wrote:
> > > On Sat, 25 Jul 2020, Linus Torvalds wrote:
> > > > On Sat, Jul 25, 2020 at 3:14 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > Heh. I too thought about this. And just in case, your patch looks correct
> > > > > to me. But I can't really comment this behavioural change. Perhaps it
> > > > > should come in a separate patch?
> > > > 
> > > > We could do that. At the same time, I think both parts change how the
> > > > waitqueue works that it might as well just be one "fix page_bit_wait
> > > > waitqueue usage".
> > > > 
> > > > But let's wait to see what Hugh's numbers say.
> > > 
> > > Oh no, no no: sorry for getting your hopes up there, I won't come up
> > > with any numbers more significant than "0 out of 10" machines crashed.
> > > I know it would be *really* useful if I could come up with performance
> > > comparisons, or steer someone else to do so: but I'm sorry, cannot.
> > > 
> > > Currently it's actually 1 out of 10 machines crashed, for the same
> > > driverland issue seen last time, maybe it's a bad machine; and another
> > > 1 out of the 10 machines went AWOL for unknown reasons, but probably
> > > something outside the kernel got confused by the stress.  No reason
> > > to suspect your changes at all (but some unanalyzed "failure"s, of
> > > dubious significance, accumulating like last time).
> > > 
> > > I'm optimistic: nothing has happened to warn us off your changes.
> > 
> > Less optimistic now, I'm afraid.
> > 
> > The machine I said had (twice) crashed coincidentally in driverland
> > (some USB completion thing): that machine I set running a comparison
> > kernel without your changes this morning, while the others still
> > running with your changes; and it has now passed the point where it
> > twice crashed before (the most troublesome test), without crashing.
> > 
> > Surprising: maybe still just coincidence, but I must look closer at
> > the crashes.
> > 
> > The others have now completed, and one other crashed in that
> > troublesome test, but sadly without yielding any crash info.
> > 
> > I've just set comparison runs going on them all, to judge whether
> > to take the "failure"s seriously; and I'll look more closely at them.
> 
> The comparison runs have not yet completed (except for the one started
> early), but they have all got past the most interesting tests, and it's
> clear that they do not have the "failure"s seen with your patches.
> 
> >From that I can only conclude that your patches make a difference.
> 
> I've deduced nothing useful from the logs, will have to leave that
> to others here with more experience of them.  But my assumption now
> is that you have successfully removed one bottleneck, so the tests
> get somewhat further and now stick in the next bottleneck, whatever
> that may be.  Which shows up as "failure", where the unlock_page()
> wake_up_page_bit() bottleneck had allowed the tests to proceed in
> a more serially sedate way.
> 
> The xhci handle_cmd_completion list_del bugs (on an older version
> of the driver): weird, nothing to do with page wakeups, I'll just
> have to assume that it's some driver bug exposed by the greater
> stress allowed down, and let driver people investigate (if it
> still manifests) when we take in your improvements.

Linus just pointed me at this thread.

If you could run:
	echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
and run the same workload to see if anything shows up in the log when
xhci crashes, that would be great.

Although if you are using an "older version" of the driver, there's not
much I can suggest except update to a newer one :)

thanks,

greg k-h
Michal Hocko Aug. 3, 2020, 1:14 p.m. UTC | #43
Hi,
sorry for being late in discussion (was offline or very busy with other
stuff).

I hope I got it right and this is the latest version of your patches. Btw.
do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable.
In the meantime I have learned that the customer suffering from the
issue is very unlikely to reboot the machine anytime soon or even
willing to test a non-trivial patch. We do not own any machine which
exhibit this problem unfortunately. So it is quite unlikely I can
help with testing.

Also does it make sense to put this into mmotm tree for a while to get a
larger testing coverage?

I didn't get to review these patch and it will take some time to do that
because this is really subtle area.

Thanks for helping out and sorry that I cannot really help much more
now.

On Fri 24-07-20 16:25:56, Linus Torvalds wrote:
> From 0bccb60841cc52a9aa6e9cc6b7eff59d1983e8fa Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu, 23 Jul 2020 10:16:49 -0700
> Subject: [PATCH 1/2] mm: rewrite wait_on_page_bit_common() logic
> 
> It turns out that wait_on_page_bit_common() had several problems,
> ranging from just unfair behavioe due to re-queueing at the end of the
> wait queue when re-trying, and an outright bug that could result in
> missed wakeups (but probably never happened in practice).
> 
> This rewrites the whole logic to avoid both issues, by simply moving the
> logic to check (and possibly take) the bit lock into the wakeup path
> instead.
> 
> That makes everything much more straightforward, and means that we never
> need to re-queue the wait entry: if we get woken up, we'll be notified
> through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
> and everything will have been done for us.
> 
> Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/
> Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
> Reported-by: Oleg Nesterov <oleg@redhat.com>
> Reported-by: Hugh Dickins <hughd@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/filemap.c | 132 +++++++++++++++++++++++++++++++++------------------
>  1 file changed, 85 insertions(+), 47 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 385759c4ce4b..8c3d3e233d37 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1002,6 +1002,7 @@ struct wait_page_queue {
>  
>  static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
>  {
> +	int ret;
>  	struct wait_page_key *key = arg;
>  	struct wait_page_queue *wait_page
>  		= container_of(wait, struct wait_page_queue, wait);
> @@ -1014,17 +1015,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  		return 0;
>  
>  	/*
> -	 * Stop walking if it's locked.
> -	 * Is this safe if put_and_wait_on_page_locked() is in use?
> -	 * Yes: the waker must hold a reference to this page, and if PG_locked
> -	 * has now already been set by another task, that task must also hold
> -	 * a reference to the *same usage* of this page; so there is no need
> -	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
> +	 * If it's an exclusive wait, we get the bit for it, and
> +	 * stop walking if we can't.
> +	 *
> +	 * If it's a non-exclusive wait, then the fact that this
> +	 * wake function was called means that the bit already
> +	 * was cleared, and we don't care if somebody then
> +	 * re-took it.
>  	 */
> -	if (test_bit(key->bit_nr, &key->page->flags))
> -		return -1;
> +	ret = 0;
> +	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> +		if (test_and_set_bit(key->bit_nr, &key->page->flags))
> +			return -1;
> +		ret = 1;
> +	}
> +	wait->flags |= WQ_FLAG_WOKEN;
>  
> -	return autoremove_wake_function(wait, mode, sync, key);
> +	wake_up_state(wait->private, mode);
> +
> +	/*
> +	 * Ok, we have successfully done what we're waiting for,
> +	 * and we can unconditionally remove the wait entry.
> +	 *
> +	 * Note that this has to be the absolute last thing we do,
> +	 * since after list_del_init(&wait->entry) the wait entry
> +	 * might be de-allocated and the process might even have
> +	 * exited.
> +	 *
> +	 * We _really_ should have a "list_del_init_careful()" to
> +	 * properly pair with the unlocked "list_empty_careful()"
> +	 * in finish_wait().
> +	 */
> +	smp_mb();
> +	list_del_init(&wait->entry);
> +	return ret;
>  }
>  
>  static void wake_up_page_bit(struct page *page, int bit_nr)
> @@ -1103,16 +1127,31 @@ enum behavior {
>  			 */
>  };
>  
> +/*
> + * Attempt to check (or get) the page bit, and mark the
> + * waiter woken if successful.
> + */
> +static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
> +					struct wait_queue_entry *wait)
> +{
> +	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> +		if (test_and_set_bit(bit_nr, &page->flags))
> +			return false;
> +	} else if (test_bit(bit_nr, &page->flags))
> +		return false;
> +
> +	wait->flags |= WQ_FLAG_WOKEN;
> +	return true;
> +}
> +
>  static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  	struct page *page, int bit_nr, int state, enum behavior behavior)
>  {
>  	struct wait_page_queue wait_page;
>  	wait_queue_entry_t *wait = &wait_page.wait;
> -	bool bit_is_set;
>  	bool thrashing = false;
>  	bool delayacct = false;
>  	unsigned long pflags;
> -	int ret = 0;
>  
>  	if (bit_nr == PG_locked &&
>  	    !PageUptodate(page) && PageWorkingset(page)) {
> @@ -1130,48 +1169,47 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  	wait_page.page = page;
>  	wait_page.bit_nr = bit_nr;
>  
> -	for (;;) {
> -		spin_lock_irq(&q->lock);
> +	/*
> +	 * Do one last check whether we can get the
> +	 * page bit synchronously.
> +	 *
> +	 * Do the SetPageWaiters() marking before that
> +	 * to let any waker we _just_ missed know they
> +	 * need to wake us up (otherwise they'll never
> +	 * even go to the slow case that looks at the
> +	 * page queue), and add ourselves to the wait
> +	 * queue if we need to sleep.
> +	 *
> +	 * This part needs to be done under the queue
> +	 * lock to avoid races.
> +	 */
> +	spin_lock_irq(&q->lock);
> +	SetPageWaiters(page);
> +	if (!trylock_page_bit_common(page, bit_nr, wait))
> +		__add_wait_queue_entry_tail(q, wait);
> +	spin_unlock_irq(&q->lock);
>  
> -		if (likely(list_empty(&wait->entry))) {
> -			__add_wait_queue_entry_tail(q, wait);
> -			SetPageWaiters(page);
> -		}
> +	/*
> +	 * From now on, all the logic will be based on
> +	 * the WQ_FLAG_WOKEN flag, and the and the page
> +	 * bit testing (and setting) will be - or has
> +	 * already been - done by the wake function.
> +	 *
> +	 * We can drop our reference to the page.
> +	 */
> +	if (behavior == DROP)
> +		put_page(page);
>  
> +	for (;;) {
>  		set_current_state(state);
>  
> -		spin_unlock_irq(&q->lock);
> -
> -		bit_is_set = test_bit(bit_nr, &page->flags);
> -		if (behavior == DROP)
> -			put_page(page);
> -
> -		if (likely(bit_is_set))
> -			io_schedule();
> -
> -		if (behavior == EXCLUSIVE) {
> -			if (!test_and_set_bit_lock(bit_nr, &page->flags))
> -				break;
> -		} else if (behavior == SHARED) {
> -			if (!test_bit(bit_nr, &page->flags))
> -				break;
> -		}
> -
> -		if (signal_pending_state(state, current)) {
> -			ret = -EINTR;
> +		if (signal_pending_state(state, current))
>  			break;
> -		}
>  
> -		if (behavior == DROP) {
> -			/*
> -			 * We can no longer safely access page->flags:
> -			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> -			 * there is a risk of waiting forever on a page reused
> -			 * for something that keeps it locked indefinitely.
> -			 * But best check for -EINTR above before breaking.
> -			 */
> +		if (wait->flags & WQ_FLAG_WOKEN)
>  			break;
> -		}
> +
> +		io_schedule();
>  	}
>  
>  	finish_wait(q, wait);
> @@ -1190,7 +1228,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  	 * bother with signals either.
>  	 */
>  
> -	return ret;
> +	return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
>  }
>  
>  void wait_on_page_bit(struct page *page, int bit_nr)
> -- 
> 2.28.0.rc0.3.g1e25d3a62f
> 

> From 93f0263b9b952a1c449cec56a6aadf6320e821f9 Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu, 23 Jul 2020 12:33:41 -0700
> Subject: [PATCH 2/2] list: add "list_del_init_careful()" to go with
>  "list_empty_careful()"
> 
> That gives us ordering guarantees around the pair.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  include/linux/list.h | 20 +++++++++++++++++++-
>  kernel/sched/wait.c  |  2 +-
>  mm/filemap.c         |  7 +------
>  3 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index aff44d34f4e4..0d0d17a10d25 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -282,6 +282,24 @@ static inline int list_empty(const struct list_head *head)
>  	return READ_ONCE(head->next) == head;
>  }
>  
> +/**
> + * list_del_init_careful - deletes entry from list and reinitialize it.
> + * @entry: the element to delete from the list.
> + *
> + * This is the same as list_del_init(), except designed to be used
> + * together with list_empty_careful() in a way to guarantee ordering
> + * of other memory operations.
> + *
> + * Any memory operations done before a list_del_init_careful() are
> + * guaranteed to be visible after a list_empty_careful() test.
> + */
> +static inline void list_del_init_careful(struct list_head *entry)
> +{
> +	__list_del_entry(entry);
> +	entry->prev = entry;
> +	smp_store_release(&entry->next, entry);
> +}
> +
>  /**
>   * list_empty_careful - tests whether a list is empty and not being modified
>   * @head: the list to test
> @@ -297,7 +315,7 @@ static inline int list_empty(const struct list_head *head)
>   */
>  static inline int list_empty_careful(const struct list_head *head)
>  {
> -	struct list_head *next = head->next;
> +	struct list_head *next = smp_load_acquire(&head->next);
>  	return (next == head) && (next == head->prev);
>  }
>  
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index ba059fbfc53a..01f5d3020589 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -389,7 +389,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
>  	int ret = default_wake_function(wq_entry, mode, sync, key);
>  
>  	if (ret)
> -		list_del_init(&wq_entry->entry);
> +		list_del_init_careful(&wq_entry->entry);
>  
>  	return ret;
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8c3d3e233d37..991503bbf922 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1041,13 +1041,8 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
>  	 * since after list_del_init(&wait->entry) the wait entry
>  	 * might be de-allocated and the process might even have
>  	 * exited.
> -	 *
> -	 * We _really_ should have a "list_del_init_careful()" to
> -	 * properly pair with the unlocked "list_empty_careful()"
> -	 * in finish_wait().
>  	 */
> -	smp_mb();
> -	list_del_init(&wait->entry);
> +	list_del_init_careful(&wait->entry);
>  	return ret;
>  }
>  
> -- 
> 2.28.0.rc0.3.g1e25d3a62f
>
Linus Torvalds Aug. 3, 2020, 5:56 p.m. UTC | #44
On Mon, Aug 3, 2020 at 6:14 AM Michal Hocko <mhocko@suse.com> wrote:
>
> I hope I got it right and this is the latest version of your patches. Btw.
> do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable.

I suspect it's still very reasonable, but I'd love to have numbers for it.

> In the meantime I have learned that the customer suffering from the
> issue is very unlikely to reboot the machine anytime soon or even
> willing to test a non-trivial patch. We do not own any machine which
> exhibit this problem unfortunately. So it is quite unlikely I can
> help with testing.

Ok.

> Also does it make sense to put this into mmotm tree for a while to get a
> larger testing coverage?

Well, I did the 5.8 release yesterday, so I just put it in the tree
for the 5.9 merge window - I've been running it locally since I posted
it, and while Hugh couldn't prove it improved anything, his results
certainly also didn't say it was bad.

So anybody that tests my top-of-tree will be testing that thing now,
which is likely more than linux-next or mmotm gets (outside of build
testing and the robots).

Of course, I don't know how many people run my development tree,
particularly during the busy merge window, but hey, at worst it will
be in the next linux-next that way. At best, it's not just me, but a
number of other developers.

                   Linus
Hugh Dickins Aug. 6, 2020, 5:21 a.m. UTC | #45
Nice to see the +130.0% this morning.

I got back on to this on Monday, here's some follow-up.

On Sun, 26 Jul 2020, Hugh Dickins wrote:
> 
> The comparison runs have not yet completed (except for the one started
> early), but they have all got past the most interesting tests, and it's
> clear that they do not have the "failure"s seen with your patches.
> 
> From that I can only conclude that your patches make a difference.
> 
> I've deduced nothing useful from the logs, will have to leave that
> to others here with more experience of them.  But my assumption now
> is that you have successfully removed one bottleneck, so the tests
> get somewhat further and now stick in the next bottleneck, whatever
> that may be.  Which shows up as "failure", where the unlock_page()
> wake_up_page_bit() bottleneck had allowed the tests to proceed in
> a more serially sedate way.

Yes, that's still how it appears to me. The test failures, all
of them, came from fork() returning ENOSPC, which originated from
alloc_pid()'s idr_alloc_cyclic(). I did try doubling our already
large pid_max, that did not work out well, there are probably good
reasons for it to be where it is and I was wrong to dabble with it.
I also tried an rcu_barrier() and retry when getting -ENOSPC there,
thinking maybe RCU was not freeing them up fast enough, but that
didn't help either.

I think (but didn't quite make the effort to double-check with
an independent count) it was simply running out of pids: that your
change speeds up the forking enough, that exiting could not quite keep
up (SIGCHLD is SIG_IGNed); whereas before your change, the unlock_page()
in do_wp_page(), on a PageAnon stack page, slowed the forking down enough
when heavily contended.

(I think we could improve the checks there, to avoid taking page lock in
more cases; but I don't know if that would help any real-life workload -
I see now that Michal's case is do_read_fault() not do_wp_page().)

And FWIW a further speedup there is the opposite of what these tests
are wanting: for the moment I've enabled a delay to get them passing
as before.

Something I was interested to realize in looking at this: trylock_page()
on a contended lock is now much less likely to jump the queue and
succeed than before, since your lock holder hands off the page lock to
the next holder: much smaller window than waiting for the next to wake
to take it. Nothing wrong with that, but effect might be seen somewhere.

> 
> The xhci handle_cmd_completion list_del bugs (on an older version
> of the driver): weird, nothing to do with page wakeups, I'll just
> have to assume that it's some driver bug exposed by the greater
> stress allowed down, and let driver people investigate (if it
> still manifests) when we take in your improvements.

Complete red herring. I'll give Greg more info in response to his
mail, and there may be an xhci bug in there; but when I looked back,
found I'd come across the same bug back in October, and find that
occasionally it's been seen in our fleet. Yes, it's odd that your
change coincided with it becoming more common on that machine
(which I've since replaced by another), yes it's funny that it's
in __list_del_entry_valid(), which is exactly where I got crashes
on pages with your initial patch; but it's just a distraction.

> 
> One nice thing from the comparison runs without your patches:
> watchdog panic did crash one of those with exactly the unlock_page()
> wake_up_page_bit() softlockup symptom we've been fighting, that did
> not appear with your patches.  So although the sample size is much
> too small to justify a conclusion, it does tend towards confirming
> your changes.
> 
> Thank you for your work on this! And I'm sure you'd have preferred
> some hard data back, rather than a diary of my mood swings, but...
> we do what we can.
> 
> Hugh
Hugh Dickins Aug. 6, 2020, 5:46 a.m. UTC | #46
On Mon, 27 Jul 2020, Greg KH wrote:
> 
> Linus just pointed me at this thread.
> 
> If you could run:
> 	echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> and run the same workload to see if anything shows up in the log when
> xhci crashes, that would be great.

Thanks, I tried that, and indeed it did have a story to tell:

ep 0x81 - asked for 16 bytes, 10 bytes untransferred
ep 0x81 - asked for 16 bytes, 10 bytes untransferred
ep 0x81 - asked for 16 bytes, 10 bytes untransferred
   a very large number of lines like the above, then
Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0
// Ding dong!
ep 0x81 - asked for 16 bytes, 10 bytes untransferred
Stopped on No-op or Link TRB for slot 1 ep 0
xhci_drop_endpoint called for udev 000000005bc07fa6
drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0
add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8
xhci_check_bandwidth called for udev 000000005bc07fa6
// Ding dong!
Successful Endpoint Configure command
Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0
// Ding dong!
Stopped on No-op or Link TRB for slot 1 ep 2
Removing canceled TD starting at 0x0 (dma).
list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88,
   but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:53!
RIP: 0010:__list_del_entry_valid+0x8e/0xb0
Call Trace:
 <IRQ>
 handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd]
 xhci_irq+0x242/0x1ea0 [xhci_hcd]
 xhci_msi_irq+0x11/0x20 [xhci_hcd]
 __handle_irq_event_percpu+0x48/0x2c0
 handle_irq_event_percpu+0x32/0x80
 handle_irq_event+0x4a/0x80
 handle_edge_irq+0xd8/0x1b0
 handle_irq+0x2b/0x50
 do_IRQ+0xb6/0x1c0
 common_interrupt+0x90/0x90
 </IRQ>

Info provided for your interest, not expecting any response.
The list_del info in there is non-standard, from a patch of mine:
I find hashed addresses in debug output less than helpful.

> 
> Although if you are using an "older version" of the driver, there's not
> much I can suggest except update to a newer one :)

Yes, I was reluctant to post any info, since really the ball is at our
end of the court, not yours. I did have a go at bringing in the latest
xhci driver instead, but quickly saw that was not a sensible task for
me. And I did scan the git log of xhci changes (especially xhci-ring.c
changes): thought I saw a likely relevant and easily applied fix commit,
but in fact it made no difference here.

I suspect it's in part a hardware problem, but driver not recovering
correctly. I've replaced the machine (but also noticed that the same
crash has occasionally been seen on other machines). I'm sure it has
no relevance to this unlock_page() thread, though it's quite possible
that it's triggered under stress, and Linus's changes allowed greater
stress.

Hugh
Linus Torvalds Aug. 6, 2020, 5:07 p.m. UTC | #47
On Wed, Aug 5, 2020 at 10:21 PM Hugh Dickins <hughd@google.com> wrote:
>
> Something I was interested to realize in looking at this: trylock_page()
> on a contended lock is now much less likely to jump the queue and
> succeed than before, since your lock holder hands off the page lock to
> the next holder: much smaller window than waiting for the next to wake
> to take it. Nothing wrong with that, but effect might be seen somewhere.

Yeah, the window is smaller, but it's not gone.

It *might* be interesting to actually do the handover directly from
"unlock_page()", and avoid clearing (and then setting) the bit
entirely.

Something like the attached TOTALLY UNTESTED patch.

NOTE! Sometimes when I say something is untested, I think the patch is
fine because it's simple and straightforward, I just didn't test it.

This time it's both untested and very very subtle indeed. Did I get
the hand-over case SMP memory barriers right? Did I screw something
else up?

So this might be complete garbage. I really don't know. But it might
close the window for an unfair trylock (or lucky page_lock())
entirely.

Or maybe it just makes page locking break entirely. It's a very real risk.

The upside may not be worth it.

               Linus
Matthew Wilcox Aug. 6, 2020, 6 p.m. UTC | #48
On Thu, Aug 06, 2020 at 10:07:07AM -0700, Linus Torvalds wrote:
> On Wed, Aug 5, 2020 at 10:21 PM Hugh Dickins <hughd@google.com> wrote:
> > Something I was interested to realize in looking at this: trylock_page()
> > on a contended lock is now much less likely to jump the queue and
> > succeed than before, since your lock holder hands off the page lock to
> > the next holder: much smaller window than waiting for the next to wake
> > to take it. Nothing wrong with that, but effect might be seen somewhere.
> 
> Yeah, the window is smaller, but it's not gone.
> 
> It *might* be interesting to actually do the handover directly from
> "unlock_page()", and avoid clearing (and then setting) the bit
> entirely.
> 
> Something like the attached TOTALLY UNTESTED patch.
> 
> NOTE! Sometimes when I say something is untested, I think the patch is
> fine because it's simple and straightforward, I just didn't test it.
> 
> This time it's both untested and very very subtle indeed. Did I get
> the hand-over case SMP memory barriers right? Did I screw something
> else up?
> 
> So this might be complete garbage. I really don't know. But it might
> close the window for an unfair trylock (or lucky page_lock())
> entirely.

It wasn't clear to me whether Hugh thought it was an improvement or
not that trylock was now less likely to jump the queue.  There're
the usual "fair is the opposite of throughput" kind of arguments.
Linus Torvalds Aug. 6, 2020, 6:32 p.m. UTC | #49
On Thu, Aug 6, 2020 at 11:00 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> It wasn't clear to me whether Hugh thought it was an improvement or
> not that trylock was now less likely to jump the queue.  There're
> the usual "fair is the opposite of throughput" kind of arguments.

Yeah, it could go either way. But on the whole, if the lock bit is
getting any contention, I think we'd rather have it be fair for
latency reasons.

That said, I'm not convinced about my patch, and I actually threw it
away without even testing it (sometimes I keep patches around in my
private tree for testing, and they can live there for months or even
years when I wonder if they are worth it, but this time I didn't
bother to go to the trouble).

If somebody is interested in pursuing this, I think that patch might
be a good starting point (and it _might_ even work), but it seemed to
be too subtle to really worry about unless somebody finds an actual
acute reason for it.

I think the existing patch narrowing the window is good, and it
clearly didn't hurt throughput (although that was almost certainly for
other reasons).

                Linus
Hugh Dickins Aug. 7, 2020, 6:41 p.m. UTC | #50
On Thu, 6 Aug 2020, Linus Torvalds wrote:
> On Thu, Aug 6, 2020 at 11:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > It wasn't clear to me whether Hugh thought it was an improvement or
> > not that trylock was now less likely to jump the queue.  There're
> > the usual "fair is the opposite of throughput" kind of arguments.

I don't know.  I'm inclined to think that keeping just a small chance
of jumping the queue is probably good; but pay no attention to me,
that's an opinion based on ignorance.

Thanks for mentioning the lucky lock_page(): I was quite wrong to
single out trylock_page() - I suppose its lack of a slow path just
helped it spring to mind more easily.

> 
> Yeah, it could go either way. But on the whole, if the lock bit is
> getting any contention, I think we'd rather have it be fair for
> latency reasons.
> 
> That said, I'm not convinced about my patch, and I actually threw it
> away without even testing it (sometimes I keep patches around in my
> private tree for testing, and they can live there for months or even
> years when I wonder if they are worth it, but this time I didn't
> bother to go to the trouble).
> 
> If somebody is interested in pursuing this, I think that patch might
> be a good starting point (and it _might_ even work), but it seemed to
> be too subtle to really worry about unless somebody finds an actual
> acute reason for it.

It is an attractive idea, passing the baton from one to the next;
and a logical destination from where you had already arrived.
Maybe somebody, not me, should pursue it.

I tried to ignore it, but eventually succumbed to temptation, and ran
it overnight at home (my usual tmpfs swapping), and on one machine at
work (fork/exit stress etc).  It did need one fixup, below: then home
testing went fine.  But the fork/exit stress hit that old familiar
unlock_page wake_up_page_bit softlockup that your existing patch
has appeared to fix.  I can't afford to investigate further (though
might regret that: I keep wondering if the small dose of unfairness
in your existing patch is enough to kick the test when it otherwise
would hang, and this new stricter patch be pointing to that).

My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus
your two patches): I did not have in the io_uring changes from the
current tree. In glancing there, I noticed one new and one previous
instance of SetPageWaiters() *after* __add_wait_queue_entry_tail():
are those correct?
 
> 
> I think the existing patch narrowing the window is good, and it
> clearly didn't hurt throughput (although that was almost certainly for
> other reasons).

Agreed.

> 
>                 Linus
> 

--- linus/mm/filemap.c	2020-08-07 02:08:13.727709186 -0700
+++ hughd/mm/filemap.c	2020-08-07 02:16:10.960331473 -0700
@@ -1044,7 +1044,7 @@ static int wake_page_function(wait_queue
 	return ret;
 }
 
-static int wake_up_page_bit(struct page *page, int bit_nr)
+static void wake_up_page_bit(struct page *page, int bit_nr, bool exclude)
 {
 	wait_queue_head_t *q = page_waitqueue(page);
 	struct wait_page_key key;
@@ -1096,15 +1096,28 @@ static int wake_up_page_bit(struct page
 		 * That's okay, it's a rare case. The next waker will clear it.
 		 */
 	}
+
+	/*
+	 * If we hoped to pass PG_locked on to the next locker, but found
+	 * no exclusive taker, then we must clear it before dropping q->lock.
+	 * Otherwise unlock_page() can race trylock_page_bit_common(), and if
+	 * PageWaiters was already set from before, then cmpxchg sees no
+	 * difference to send it back to wake_up_page_bit().
+	 *
+	 * We may have already dropped and retaken q->lock on the way here, but
+	 * I think this works out because new entries are always added at tail.
+	 */
+	if (exclude && !exclusive)
+		clear_bit(bit_nr, &page->flags);
+
 	spin_unlock_irqrestore(&q->lock, flags);
-	return exclusive;
 }
 
 static void wake_up_page(struct page *page, int bit)
 {
 	if (!PageWaiters(page))
 		return;
-	wake_up_page_bit(page, bit);
+	wake_up_page_bit(page, bit, false);
 }
 
 /*
@@ -1339,8 +1352,8 @@ void unlock_page(struct page *page)
 			 * spinlock wake_up_page_bit() will do.
 			 */
 			smp_mb__before_atomic();
-			if (wake_up_page_bit(page, PG_locked))
-				return;
+			wake_up_page_bit(page, PG_locked, true);
+			return;
 		}
 		new = cmpxchg_release(&page->flags, flags, flags & ~(1 << PG_locked));
 		if (likely(new == flags))
Linus Torvalds Aug. 7, 2020, 7:07 p.m. UTC | #51
On Fri, Aug 7, 2020 at 11:41 AM Hugh Dickins <hughd@google.com> wrote:
>
> +
> +       /*
> +        * If we hoped to pass PG_locked on to the next locker, but found
> +        * no exclusive taker, then we must clear it before dropping q->lock.
> +        * Otherwise unlock_page() can race trylock_page_bit_common(), and if
> +        * PageWaiters was already set from before, then cmpxchg sees no
> +        * difference to send it back to wake_up_page_bit().
> +        *
> +        * We may have already dropped and retaken q->lock on the way here, but
> +        * I think this works out because new entries are always added at tail.
> +        */
> +       if (exclude && !exclusive)
> +               clear_bit(bit_nr, &page->flags);
> +
>         spin_unlock_irqrestore(&q->lock, flags);

Yeah, I started thinking about this, and that's when I decided to just
throw away the patch.

Because now it clears the bit *after* it has woken people up, and that
just made me go "Eww".

You did add a comment about the whole "always added to the tail"
thing, and the spinlock should serialize everything, so I guess it's
ok (because the spinlock should serialize things that care), but it
just feels wrong.

I also started worrying about other people waiting on the page lock
(ie we now have that whole io_uring case), and interaction with the
PG_writeback case etc, and it just ended up feeling very messy.

I think it was all fine - other cases won't hit that exclusive case at
all - but I had a hard time wrapping my little mind around the exact
handoff rules, and the cmpxchg behavior when other bits changed (eg
PG_waiters), so I gave up.

> My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus
> your two patches): I did not have in the io_uring changes from the
> current tree. In glancing there, I noticed one new and one previous
> instance of SetPageWaiters() *after* __add_wait_queue_entry_tail():
> are those correct?

I don't think SetPageWaiters() has any ordering constraints with the
wait queue. Nobody looks at the waitqueue unless they already got to
the slowpath.

The only ordering constraint is with the testing of the bit we're
going to wait on. Because doing

     if (test_and_set_bit())
          SetPageWaiters(page);

is wrong - there's a race in there when somebody can clear the bit,
and not see that there are waiters.

And obviously that needs to be done inside the spinlock, but once you
do that, the ordering of the actual wait-queue is entirely irrelevant.
The spinlock will order _that_ part. The only thing the spinlock won't
order and serialize is the PG_lock bit and making sure we get to the
slowpath in the first place.

So if you're talking about __wait_on_page_locked_async(), then I think
that part is ok.

Or am I missing something?

                Linus
Matthew Wilcox Aug. 7, 2020, 7:35 p.m. UTC | #52
On Fri, Aug 07, 2020 at 11:41:09AM -0700, Hugh Dickins wrote:
> My home testing was, effectively, on top of c6fe44d96fc1 (v5.8 plus
> your two patches): I did not have in the io_uring changes from the
> current tree. In glancing there, I noticed one new and one previous
> instance of SetPageWaiters() *after* __add_wait_queue_entry_tail():
> are those correct?

By the way, don't do any performance testing on current Linus tree.
This commit:

commit 37f4a24c2469a10a4c16c641671bd766e276cf9f (refs/bisect/bad)
Author: Ming Lei <ming.lei@redhat.com>
Date:   Tue Jun 30 22:03:57 2020 +0800

    blk-mq: centralise related handling into blk_mq_get_driver_tag
    
    Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
    all are good to do during getting driver tag.
    
    Meantime blk-flush related code is simplified and flush request needn't
    to update the request table manually any more.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

slows everything down, at least on my test qemu system.  Seems like it's
losing block queue tags.
Greg KH Aug. 18, 2020, 1:50 p.m. UTC | #53
On Wed, Aug 05, 2020 at 10:46:12PM -0700, Hugh Dickins wrote:
> On Mon, 27 Jul 2020, Greg KH wrote:
> > 
> > Linus just pointed me at this thread.
> > 
> > If you could run:
> > 	echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> > and run the same workload to see if anything shows up in the log when
> > xhci crashes, that would be great.
> 
> Thanks, I tried that, and indeed it did have a story to tell:
> 
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
>    a very large number of lines like the above, then
> Cancel URB 00000000d81602f7, dev 4, ep 0x0, starting at offset 0xfffd42c0
> // Ding dong!
> ep 0x81 - asked for 16 bytes, 10 bytes untransferred
> Stopped on No-op or Link TRB for slot 1 ep 0
> xhci_drop_endpoint called for udev 000000005bc07fa6
> drop ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x0
> add ep 0x81, slot id 1, new drop flags = 0x8, new add flags = 0x8
> xhci_check_bandwidth called for udev 000000005bc07fa6
> // Ding dong!
> Successful Endpoint Configure command
> Cancel URB 000000006b77d490, dev 4, ep 0x81, starting at offset 0x0
> // Ding dong!
> Stopped on No-op or Link TRB for slot 1 ep 2
> Removing canceled TD starting at 0x0 (dma).
> list_del corruption: prev(ffff8fdb4de7a130)->next should be ffff8fdb41697f88,
>    but is 6b6b6b6b6b6b6b6b; next(ffff8fdb4de7a130)->prev is 6b6b6b6b6b6b6b6b.
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:53!
> RIP: 0010:__list_del_entry_valid+0x8e/0xb0
> Call Trace:
>  <IRQ>
>  handle_cmd_completion+0x7d4/0x14f0 [xhci_hcd]
>  xhci_irq+0x242/0x1ea0 [xhci_hcd]
>  xhci_msi_irq+0x11/0x20 [xhci_hcd]
>  __handle_irq_event_percpu+0x48/0x2c0
>  handle_irq_event_percpu+0x32/0x80
>  handle_irq_event+0x4a/0x80
>  handle_edge_irq+0xd8/0x1b0
>  handle_irq+0x2b/0x50
>  do_IRQ+0xb6/0x1c0
>  common_interrupt+0x90/0x90
>  </IRQ>
> 
> Info provided for your interest, not expecting any response.
> The list_del info in there is non-standard, from a patch of mine:
> I find hashed addresses in debug output less than helpful.

Thanks for this, that is really odd.

> > 
> > Although if you are using an "older version" of the driver, there's not
> > much I can suggest except update to a newer one :)
> 
> Yes, I was reluctant to post any info, since really the ball is at our
> end of the court, not yours. I did have a go at bringing in the latest
> xhci driver instead, but quickly saw that was not a sensible task for
> me. And I did scan the git log of xhci changes (especially xhci-ring.c
> changes): thought I saw a likely relevant and easily applied fix commit,
> but in fact it made no difference here.
> 
> I suspect it's in part a hardware problem, but driver not recovering
> correctly. I've replaced the machine (but also noticed that the same
> crash has occasionally been seen on other machines). I'm sure it has
> no relevance to this unlock_page() thread, though it's quite possible
> that it's triggered under stress, and Linus's changes allowed greater
> stress.

I will be willing to blame hardware problems for this as well, but will
save this report in case something else shows up in the future, thanks!

greg k-h
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 385759c4ce4b..74681c40a6e5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -41,6 +41,7 @@ 
 #include <linux/delayacct.h>
 #include <linux/psi.h>
 #include <linux/ramfs.h>
+#include <linux/nmi.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1055,6 +1056,7 @@  static void wake_up_page_bit(struct page *page, int bit_nr)
 		 */
 		spin_unlock_irqrestore(&q->lock, flags);
 		cpu_relax();
+		touch_softlockup_watchdog();
 		spin_lock_irqsave(&q->lock, flags);
 		__wake_up_locked_key_bookmark(q, TASK_NORMAL, &key, &bookmark);
 	}