Message ID | aef9de29-e9f5-259a-f8be-12d1b734e72@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [next] sbitmap: fix lockup while swapping | expand |
On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > I have almost no grasp of all the possible sbitmap races, and their > consequences: but using the same !waitqueue_active() check as used > elsewhere, fixes the lockup and shows no adverse consequence for me. > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > Signed-off-by: Hugh Dickins <hughd@google.com> > --- > > lib/sbitmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > * function again to wakeup a new batch on a different 'ws'. > */ > if (cur == 0) > - return true; > + return !waitqueue_active(&ws->wait); If it's 0, that is supposed to mean another thread is about to make it not zero as well as increment the wakestate index. That should be happening after patch 48c033314f37 was included, at least. Prior to 4acb83417cad, the code would also return 'true' if the count was already zero, and this batched code wasn't supposed to behave any different in that condition. Anyway, I don't think the waitqueue_active criteria of the current waitstate is correct either. The next waitstate may have waiters too, so we still may need to account for this batch's count in order to wake them.
On Mon, 19 Sep 2022, Keith Busch wrote: > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > > I have almost no grasp of all the possible sbitmap races, and their > > consequences: but using the same !waitqueue_active() check as used > > elsewhere, fixes the lockup and shows no adverse consequence for me. > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > > Signed-off-by: Hugh Dickins <hughd@google.com> > > --- > > > > lib/sbitmap.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > --- a/lib/sbitmap.c > > +++ b/lib/sbitmap.c > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > > * function again to wakeup a new batch on a different 'ws'. > > */ > > if (cur == 0) > > - return true; > > + return !waitqueue_active(&ws->wait); > > If it's 0, that is supposed to mean another thread is about to make it not zero > as well as increment the wakestate index. That should be happening after patch > 48c033314f37 was included, at least. I believe that the thread about to make wait_cnt not zero (and increment the wakestate index) is precisely this interrupted thread: the backtrace shows that it had just done its wakeups, so has not yet reached making wait_cnt not zero; and I suppose that either its wakeups did not empty the waitqueue completely, or another waiter got added as soon as it dropped the spinlock. > > Prior to 4acb83417cad, the code would also return 'true' if the count was > already zero, and this batched code wasn't supposed to behave any different in > that condition. In principle yes, but in practice no. Prior to 4acb83417cad, the swapping load would run okayish for a few minutes, before freezing up mysteriously (presumably due to missed wakeups). The "ish" in "okayish" because the system time was abnormally high, and occasionally there was an odd message from systemd about killing its journal or something - 4acb83417cad saved me from having to investigate that further. Prior to 4acb83417cad, it never locked up looping on wait_cnt < 0; after 4acb83417cad, it would lock up on wait_cnt 0 in a few seconds. But in writing that, and remembering the abnormal systime, I begin to suspect that it might have often been in a tight loop on wait_cnt < 0, but the batch accounting sufficiently wrong that it always got rescued by an unrelated wakeup (shifting wakestate index), before any lockup ever got observed and reported. Or something like that. (And I'm trying to avoid making a fool of myself with the arithmetic: how quickly would wait_cnt 0 have got decremented to positive before?) I won't mind Jens deleting that "Fixes: 4acb83417cad" if it's unfair. > > Anyway, I don't think the waitqueue_active criteria of the current waitstate is > correct either. The next waitstate may have waiters too, so we still may need > to account for this batch's count in order to wake them. I cannot usefully comment on that, it's all rather too subtle for me. But I did wonder if each of those !waitqueue_active()s would be better replaced just by "false"s: we only get that far into __sbq_wake_up() if waitqueue_active(), so the !waitqueue_active() case reflects a race: a possible race, yes, but a race that wants precise accounting at a few imprecise locations? Hugh
On Mon 19-09-22 16:01:39, Hugh Dickins wrote: > On Mon, 19 Sep 2022, Keith Busch wrote: > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > > > I have almost no grasp of all the possible sbitmap races, and their > > > consequences: but using the same !waitqueue_active() check as used > > > elsewhere, fixes the lockup and shows no adverse consequence for me. > > > > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > --- > > > > > > lib/sbitmap.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > --- a/lib/sbitmap.c > > > +++ b/lib/sbitmap.c > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > > > * function again to wakeup a new batch on a different 'ws'. > > > */ > > > if (cur == 0) > > > - return true; > > > + return !waitqueue_active(&ws->wait); > > > > If it's 0, that is supposed to mean another thread is about to make it not zero > > as well as increment the wakestate index. That should be happening after patch > > 48c033314f37 was included, at least. > > I believe that the thread about to make wait_cnt not zero (and increment the > wakestate index) is precisely this interrupted thread: the backtrace shows > that it had just done its wakeups, so has not yet reached making wait_cnt > not zero; and I suppose that either its wakeups did not empty the waitqueue > completely, or another waiter got added as soon as it dropped the spinlock. > > > > > Prior to 4acb83417cad, the code would also return 'true' if the count was > > already zero, and this batched code wasn't supposed to behave any different in > > that condition. > > In principle yes, but in practice no. Prior to 4acb83417cad, the swapping > load would run okayish for a few minutes, before freezing up mysteriously > (presumably due to missed wakeups). The "ish" in "okayish" because the > system time was abnormally high, and occasionally there was an odd message > from systemd about killing its journal or something - 4acb83417cad saved > me from having to investigate that further. > > Prior to 4acb83417cad, it never locked up looping on wait_cnt < 0; > after 4acb83417cad, it would lock up on wait_cnt 0 in a few seconds. > > But in writing that, and remembering the abnormal systime, I begin to > suspect that it might have often been in a tight loop on wait_cnt < 0, > but the batch accounting sufficiently wrong that it always got rescued > by an unrelated wakeup (shifting wakestate index), before any lockup > ever got observed and reported. Or something like that. > > (And I'm trying to avoid making a fool of myself with the arithmetic: > how quickly would wait_cnt 0 have got decremented to positive before?) > > I won't mind Jens deleting that "Fixes: 4acb83417cad" if it's unfair. > > > > > Anyway, I don't think the waitqueue_active criteria of the current waitstate is > > correct either. The next waitstate may have waiters too, so we still may need > > to account for this batch's count in order to wake them. > > I cannot usefully comment on that, it's all rather too subtle for me. > > But I did wonder if each of those !waitqueue_active()s would be better > replaced just by "false"s: we only get that far into __sbq_wake_up() if > waitqueue_active(), so the !waitqueue_active() case reflects a race: > a possible race, yes, but a race that wants precise accounting at a > few imprecise locations? I think the code is actually too subtle and complex for anybody to sensibly reason about it (as the continuous stream of bugs demostrates ;)). That being said I agree with Keith that your "fix" looks suspicious and it is not how things are expected to work and we can possibly loose wakeups with your change. So we need to understand better how the code can be looping on your system. I'll think more about it tomorrow... Honza
On Wed 21-09-22 18:40:12, Jan Kara wrote: > On Mon 19-09-22 16:01:39, Hugh Dickins wrote: > > On Mon, 19 Sep 2022, Keith Busch wrote: > > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > > > > I have almost no grasp of all the possible sbitmap races, and their > > > > consequences: but using the same !waitqueue_active() check as used > > > > elsewhere, fixes the lockup and shows no adverse consequence for me. > > > > > > > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > --- > > > > > > > > lib/sbitmap.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > --- a/lib/sbitmap.c > > > > +++ b/lib/sbitmap.c > > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > > > > * function again to wakeup a new batch on a different 'ws'. > > > > */ > > > > if (cur == 0) > > > > - return true; > > > > + return !waitqueue_active(&ws->wait); > > > > > > If it's 0, that is supposed to mean another thread is about to make it not zero > > > as well as increment the wakestate index. That should be happening after patch > > > 48c033314f37 was included, at least. > > > > I believe that the thread about to make wait_cnt not zero (and increment the > > wakestate index) is precisely this interrupted thread: the backtrace shows > > that it had just done its wakeups, so has not yet reached making wait_cnt > > not zero; and I suppose that either its wakeups did not empty the waitqueue > > completely, or another waiter got added as soon as it dropped the spinlock. I was trying to wrap my head around this but I am failing to see how we could have wait_cnt == 0 for long enough to cause any kind of stall let alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand we have: CPU1 CPU2 sbitmap_queue_wake_up() ws = sbq_wake_ptr(sbq); cur = atomic_read(&ws->wait_cnt); do { ... wait_cnt = cur - sub; /* this will be 0 */ } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); ... /* Gets the same waitqueue */ ws = sbq_wake_ptr(sbq); cur = atomic_read(&ws->wait_cnt); do { if (cur == 0) return true; /* loop */ wake_up_nr(&ws->wait, wake_batch); smp_mb__before_atomic(); sbq_index_atomic_inc(&sbq->wake_index); atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */ So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come this takes so long that is causes a hang as you describe? Hum... So either CPU1 takes really long to get to atomic_set(): - can CPU1 get preempted? Likely not at least in the context you show in your message - can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is contended but still... or CPU2 somehow sees cur==0 for longer than it should. The whole sequence executed in a loop on CPU2 does not contain anything that would force CPU2 to refresh its cache and get new ws->wait_cnt value so we are at the mercy of CPU cache coherency mechanisms to stage the write on CPU1 and propagate it to other CPUs. But still I would not expect that to take significantly long. Any other ideas? Honza
On Fri, Sep 23, 2022 at 04:43:03PM +0200, Jan Kara wrote: > On Wed 21-09-22 18:40:12, Jan Kara wrote: > > On Mon 19-09-22 16:01:39, Hugh Dickins wrote: > > > On Mon, 19 Sep 2022, Keith Busch wrote: > > > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > > > > > I have almost no grasp of all the possible sbitmap races, and their > > > > > consequences: but using the same !waitqueue_active() check as used > > > > > elsewhere, fixes the lockup and shows no adverse consequence for me. > > > > > > > > > > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > > --- > > > > > > > > > > lib/sbitmap.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > --- a/lib/sbitmap.c > > > > > +++ b/lib/sbitmap.c > > > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > > > > > * function again to wakeup a new batch on a different 'ws'. > > > > > */ > > > > > if (cur == 0) > > > > > - return true; > > > > > + return !waitqueue_active(&ws->wait); > > > > > > > > If it's 0, that is supposed to mean another thread is about to make it not zero > > > > as well as increment the wakestate index. That should be happening after patch > > > > 48c033314f37 was included, at least. > > > > > > I believe that the thread about to make wait_cnt not zero (and increment the > > > wakestate index) is precisely this interrupted thread: the backtrace shows > > > that it had just done its wakeups, so has not yet reached making wait_cnt > > > not zero; and I suppose that either its wakeups did not empty the waitqueue > > > completely, or another waiter got added as soon as it dropped the spinlock. > > I was trying to wrap my head around this but I am failing to see how we > could have wait_cnt == 0 for long enough to cause any kind of stall let > alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand > we have: > > CPU1 CPU2 > sbitmap_queue_wake_up() > ws = sbq_wake_ptr(sbq); > cur = atomic_read(&ws->wait_cnt); > do { > ... > wait_cnt = cur - sub; /* this will be 0 */ > } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > ... > /* Gets the same waitqueue */ > ws = sbq_wake_ptr(sbq); > cur = atomic_read(&ws->wait_cnt); > do { > if (cur == 0) > return true; /* loop */ > wake_up_nr(&ws->wait, wake_batch); > smp_mb__before_atomic(); > sbq_index_atomic_inc(&sbq->wake_index); > atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */ > > So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come > this takes so long that is causes a hang as you describe? Hum... So either > CPU1 takes really long to get to atomic_set(): > - can CPU1 get preempted? Likely not at least in the context you show in > your message > - can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is > contended but still... > > or CPU2 somehow sees cur==0 for longer than it should. The whole sequence > executed in a loop on CPU2 does not contain anything that would force CPU2 > to refresh its cache and get new ws->wait_cnt value so we are at the mercy > of CPU cache coherency mechanisms to stage the write on CPU1 and propagate > it to other CPUs. But still I would not expect that to take significantly > long. Any other ideas? Thank you for the analysis. I arrived at the same conclusions. If this is a preempt enabled context, and there's just one CPU, I suppose the 2nd task could spin in the while(), blocking the 1st task from resetting the wait_cnt. I doubt that's happening though, at least for nvme where we call this function in irq context.
On Fri, 23 Sep 2022, Jan Kara wrote: > On Wed 21-09-22 18:40:12, Jan Kara wrote: > > On Mon 19-09-22 16:01:39, Hugh Dickins wrote: > > > On Mon, 19 Sep 2022, Keith Busch wrote: > > > > On Sun, Sep 18, 2022 at 02:10:51PM -0700, Hugh Dickins wrote: > > > > > I have almost no grasp of all the possible sbitmap races, and their > > > > > consequences: but using the same !waitqueue_active() check as used > > > > > elsewhere, fixes the lockup and shows no adverse consequence for me. > > > > > > > > > > > > > Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") > > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > > --- > > > > > > > > > > lib/sbitmap.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > --- a/lib/sbitmap.c > > > > > +++ b/lib/sbitmap.c > > > > > @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap > > > > > * function again to wakeup a new batch on a different 'ws'. > > > > > */ > > > > > if (cur == 0) > > > > > - return true; > > > > > + return !waitqueue_active(&ws->wait); > > > > > > > > If it's 0, that is supposed to mean another thread is about to make it not zero > > > > as well as increment the wakestate index. That should be happening after patch > > > > 48c033314f37 was included, at least. > > > > > > I believe that the thread about to make wait_cnt not zero (and increment the > > > wakestate index) is precisely this interrupted thread: the backtrace shows > > > that it had just done its wakeups, so has not yet reached making wait_cnt > > > not zero; and I suppose that either its wakeups did not empty the waitqueue > > > completely, or another waiter got added as soon as it dropped the spinlock. > > I was trying to wrap my head around this but I am failing to see how we > could have wait_cnt == 0 for long enough to cause any kind of stall let > alone a lockup in sbitmap_queue_wake_up() as you describe. I can understand > we have: > > CPU1 CPU2 > sbitmap_queue_wake_up() > ws = sbq_wake_ptr(sbq); > cur = atomic_read(&ws->wait_cnt); > do { > ... > wait_cnt = cur - sub; /* this will be 0 */ > } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > ... > /* Gets the same waitqueue */ > ws = sbq_wake_ptr(sbq); > cur = atomic_read(&ws->wait_cnt); > do { > if (cur == 0) > return true; /* loop */ > wake_up_nr(&ws->wait, wake_batch); > smp_mb__before_atomic(); > sbq_index_atomic_inc(&sbq->wake_index); > atomic_set(&ws->wait_cnt, wake_batch); /* This stops looping on CPU2 */ > > So until CPU1 reaches the atomic_set(), CPU2 can be looping. But how come > this takes so long that is causes a hang as you describe? Hum... So either > CPU1 takes really long to get to atomic_set(): > - can CPU1 get preempted? Likely not at least in the context you show in > your message > - can CPU1 spend so long in wake_up_nr()? Maybe the waitqueue lock is > contended but still... > > or CPU2 somehow sees cur==0 for longer than it should. The whole sequence > executed in a loop on CPU2 does not contain anything that would force CPU2 > to refresh its cache and get new ws->wait_cnt value so we are at the mercy > of CPU cache coherency mechanisms to stage the write on CPU1 and propagate > it to other CPUs. But still I would not expect that to take significantly > long. Any other ideas? Rushed reply (hoping) to help your thinking, I'll read you more closely two hours later. You're writing of CPU1 and CPU2, but in my case it was just the one CPU interrupted - see again sbitmap_queue_wake_up twice in the backtrace: sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request < scsi_end_request < scsi_io_completion < scsi_finish_command < scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq < __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt < _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up < sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < dd_bio_merge < blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio < __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct < submit_bio < __swap_writepage < swap_writepage < pageout < shrink_folio_list < evict_folios < lru_gen_shrink_lruvec < shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages < __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio < do_anonymous_page < __handle_mm_fault < handle_mm_fault < do_user_addr_fault < exc_page_fault < asm_exc_page_fault And in one case I did study stack contents carefully, confirming the same sbq pointer used in upper and lower sbitmap_queue_wake_up. And on Keith's points: yes, I do have preemption enabled; but no, this machine does not have nvme, and I never hit this on the nvme laptop. Thanks, Hugh
Does the following fix the observation? Rational being that there's no reason to spin on the current wait state that is already under handling; let subsequent clearings proceed to the next inevitable wait state immediately. --- diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 624fa7f118d1..47bf7882210b 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) *nr -= sub; + /* + * Increase wake_index before updating wait_cnt, otherwise concurrent + * callers can see valid wait_cnt in old waitqueue, which can cause + * invalid wakeup on the old waitqueue. + */ + sbq_index_atomic_inc(&sbq->wake_index); + /* * When wait_cnt == 0, we have to be particularly careful as we are * responsible to reset wait_cnt regardless whether we've actually @@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) * of atomic_set(). */ smp_mb__before_atomic(); - - /* - * Increase wake_index before updating wait_cnt, otherwise concurrent - * callers can see valid wait_cnt in old waitqueue, which can cause - * invalid wakeup on the old waitqueue. - */ - sbq_index_atomic_inc(&sbq->wake_index); atomic_set(&ws->wait_cnt, wake_batch); return ret || *nr; --
On Fri, 23 Sep 2022, Keith Busch wrote: > Does the following fix the observation? Rational being that there's no reason > to spin on the current wait state that is already under handling; let > subsequent clearings proceed to the next inevitable wait state immediately. It's running fine without lockup so far; but doesn't this change merely narrow the window? If this is interrupted in between atomic_try_cmpxchg() setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index, don't we run the same risk as before, of sbitmap_queue_wake_up() from the interrupt handler getting stuck on that wait_cnt 0? > > --- > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index 624fa7f118d1..47bf7882210b 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) > > *nr -= sub; > > + /* > + * Increase wake_index before updating wait_cnt, otherwise concurrent > + * callers can see valid wait_cnt in old waitqueue, which can cause > + * invalid wakeup on the old waitqueue. > + */ > + sbq_index_atomic_inc(&sbq->wake_index); > + > /* > * When wait_cnt == 0, we have to be particularly careful as we are > * responsible to reset wait_cnt regardless whether we've actually > @@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) > * of atomic_set(). > */ > smp_mb__before_atomic(); > - > - /* > - * Increase wake_index before updating wait_cnt, otherwise concurrent > - * callers can see valid wait_cnt in old waitqueue, which can cause > - * invalid wakeup on the old waitqueue. > - */ > - sbq_index_atomic_inc(&sbq->wake_index); > atomic_set(&ws->wait_cnt, wake_batch); > > return ret || *nr; > --
On Fri, 23 Sep 2022, Hugh Dickins wrote: > On Fri, 23 Sep 2022, Keith Busch wrote: > > > Does the following fix the observation? Rational being that there's no reason > > to spin on the current wait state that is already under handling; let > > subsequent clearings proceed to the next inevitable wait state immediately. > > It's running fine without lockup so far; but doesn't this change merely > narrow the window? If this is interrupted in between atomic_try_cmpxchg() > setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index, > don't we run the same risk as before, of sbitmap_queue_wake_up() from > the interrupt handler getting stuck on that wait_cnt 0? Yes, it ran successfully for 50 minutes, then an interrupt came in immediately after the cmpxchg, and it locked up just as before. Easily dealt with by disabling interrupts, no doubt, but I assume it's a badge of honour not to disable interrupts here (except perhaps in waking). Some clever way to make the wait_cnt and wake_index adjustments atomic? Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up() just supposed never to happen, the counts preventing it: but some misaccounting letting it happen by mistake? > > > > > --- > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > > index 624fa7f118d1..47bf7882210b 100644 > > --- a/lib/sbitmap.c > > +++ b/lib/sbitmap.c > > @@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) > > > > *nr -= sub; > > > > + /* > > + * Increase wake_index before updating wait_cnt, otherwise concurrent > > + * callers can see valid wait_cnt in old waitqueue, which can cause > > + * invalid wakeup on the old waitqueue. > > + */ > > + sbq_index_atomic_inc(&sbq->wake_index); > > + > > /* > > * When wait_cnt == 0, we have to be particularly careful as we are > > * responsible to reset wait_cnt regardless whether we've actually > > @@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr) > > * of atomic_set(). > > */ > > smp_mb__before_atomic(); > > - > > - /* > > - * Increase wake_index before updating wait_cnt, otherwise concurrent > > - * callers can see valid wait_cnt in old waitqueue, which can cause > > - * invalid wakeup on the old waitqueue. > > - */ > > - sbq_index_atomic_inc(&sbq->wake_index); > > atomic_set(&ws->wait_cnt, wake_batch); > > > > return ret || *nr; > > -- >
On Fri 23-09-22 16:15:29, Hugh Dickins wrote: > On Fri, 23 Sep 2022, Hugh Dickins wrote: > > On Fri, 23 Sep 2022, Keith Busch wrote: > > > > > Does the following fix the observation? Rational being that there's no reason > > > to spin on the current wait state that is already under handling; let > > > subsequent clearings proceed to the next inevitable wait state immediately. > > > > It's running fine without lockup so far; but doesn't this change merely > > narrow the window? If this is interrupted in between atomic_try_cmpxchg() > > setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index, > > don't we run the same risk as before, of sbitmap_queue_wake_up() from > > the interrupt handler getting stuck on that wait_cnt 0? > > Yes, it ran successfully for 50 minutes, then an interrupt came in > immediately after the cmpxchg, and it locked up just as before. > > Easily dealt with by disabling interrupts, no doubt, but I assume it's a > badge of honour not to disable interrupts here (except perhaps in waking). I don't think any magic with sbq_index_atomic_inc() is going to reliably fix this. After all the current waitqueue may be the only one that has active waiters so sbq_wake_ptr() will always end up returning this waitqueue regardless of the current value of sbq->wake_index. Honestly, this whole code needs a serious redesign. I have some simplifications in mind but it will take some thinking and benchmarking so we need some fix for the interim. I was pondering for quite some time about some band aid to the problem you've found but didn't find anything satisfactory. In the end I see two options: 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups but we were living with those for a relatively long time so probably we can live with them for some longer. 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have to redo his batched accounting patches on top. > Some clever way to make the wait_cnt and wake_index adjustments atomic? > > Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up() > just supposed never to happen, the counts preventing it: but some > misaccounting letting it happen by mistake? No, I think that is in principle a situation that we have to accommodate. Honza
Hi, 在 2022/09/26 19:44, Jan Kara 写道: > On Fri 23-09-22 16:15:29, Hugh Dickins wrote: >> On Fri, 23 Sep 2022, Hugh Dickins wrote: >>> On Fri, 23 Sep 2022, Keith Busch wrote: >>> >>>> Does the following fix the observation? Rational being that there's no reason >>>> to spin on the current wait state that is already under handling; let >>>> subsequent clearings proceed to the next inevitable wait state immediately. >>> >>> It's running fine without lockup so far; but doesn't this change merely >>> narrow the window? If this is interrupted in between atomic_try_cmpxchg() >>> setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index, >>> don't we run the same risk as before, of sbitmap_queue_wake_up() from >>> the interrupt handler getting stuck on that wait_cnt 0? >> >> Yes, it ran successfully for 50 minutes, then an interrupt came in >> immediately after the cmpxchg, and it locked up just as before. >> >> Easily dealt with by disabling interrupts, no doubt, but I assume it's a >> badge of honour not to disable interrupts here (except perhaps in waking). > > I don't think any magic with sbq_index_atomic_inc() is going to reliably > fix this. After all the current waitqueue may be the only one that has active > waiters so sbq_wake_ptr() will always end up returning this waitqueue > regardless of the current value of sbq->wake_index. > > Honestly, this whole code needs a serious redesign. I have some > simplifications in mind but it will take some thinking and benchmarking so > we need some fix for the interim. I was pondering for quite some time about > some band aid to the problem you've found but didn't find anything > satisfactory. > > In the end I see two options: > > 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups > but we were living with those for a relatively long time so probably we can > live with them for some longer. > > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have > to redo his batched accounting patches on top. > >> Some clever way to make the wait_cnt and wake_index adjustments atomic? I'm thinking about a hacky way to make the update of wake_cnt and wake_index atomic, however, redesign of sbitmap_queue is probably better.
On Mon, 26 Sep 2022, Jan Kara wrote: > On Fri 23-09-22 16:15:29, Hugh Dickins wrote: > > I don't think any magic with sbq_index_atomic_inc() is going to reliably > fix this. After all the current waitqueue may be the only one that has active > waiters so sbq_wake_ptr() will always end up returning this waitqueue > regardless of the current value of sbq->wake_index. > > Honestly, this whole code needs a serious redesign. I was pleased to see you say so, Jan: I do agree. > I have some > simplifications in mind but it will take some thinking and benchmarking I'm definitely not the right person to take it on, and glad if you can. But I did have some thoughts and experiments over the weekend, and would like to throw a couple of suggestions into the pot. One, not a big issue, but I think sbq_index_atomic_inc() is misconceived. It's unhelpful for multiple racers to be adjusting sbq->wake_index, and wake_index = ws - sbq->ws; atomic_cmpxchg(&sbq->wake_index, wake_index, sbq_index_inc(wake_index)); seems to me a better way for __sbq_wake_up() to increment it. Two, and here the depths of my naivete and incomprehension may be on display, but: I get the impression that __sbq_wake_up() is intended to accumulate wake_batch-1 wakeups while doing nothing, then on the wake_batch'th it hopes to do all those wake_batch wakeups. I assume someone in the past has established that that's a safe way to procede here, though it's not obviously safe to me. Now, those !waitqueue_active(&ws->wait) checks are good for catching when the hoped-for procedure has gone so "wrong" that there's actually nothing to be woken on this ws (so proceed to the next); but they give no clue as to when there are some but not enough wakeups done. It is very easy to add a wake_up_nr_return() to kernel/sched/wait.c, which returns the nr_exclusive still not woken (__wake_up_common_lock() merely has to return nr_exclusive itself); and then __sbq_wake_up() can be recalled until wake_batch have been woken (or all queues empty). I do have an experiment running that way: but my testing is much too limited to draw serious conclusions from, and I've already admitted that I may just be misunderstanding the whole thing. But, just maybe, a wake_up_nr_return() might be useful. End of those suggestions. > so > we need some fix for the interim. I was pondering for quite some time about > some band aid to the problem you've found but didn't find anything > satisfactory. > > In the end I see two options: > > 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups > but we were living with those for a relatively long time so probably we can > live with them for some longer. In getting that experiment above going, I did have to make this change below: and it looks to me now as better than my original patch - since this one does try all SBQ_WAIT_QUEUES before giving up, whereas my first patch immediately gave up on the waitqueue_active !wait_cnt case. --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p for (i = 0; i < SBQ_WAIT_QUEUES; i++) { struct sbq_wait_state *ws = &sbq->ws[wake_index]; - if (waitqueue_active(&ws->wait)) { + if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { if (wake_index != atomic_read(&sbq->wake_index)) atomic_set(&sbq->wake_index, wake_index); return ws; TBH I have not tested this one outside of that experiment: would you prefer this patch to my first one, I test and sign this off and send? > > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have > to redo his batched accounting patches on top. I know much too little to help make that choice. Hugh
On Sat, 24 Sep 2022, Hillf Danton wrote: > > I think the lockup can be avoided by > a) either advancing wake_index as early as I can [1], > b) or doing wakeup in case of zero wait_cnt to kill all cases of waitqueue_active(). > > Only for thoughts now. Thanks Hillf: I gave your __sbq_wake_up() patch below several tries, and as far as I could tell, it works just as well as my one-liner. But I don't think it's what we would want to do: doesn't it increment wake_index on every call to __sbq_wake_up()? whereas I thought it was intended to be incremented only after wake_batch calls (thinking in terms of nr 1). I'll not be surprised if your advance-wake_index-earlier idea ends up as a part of the solution: but mainly I agree with Jan that the whole code needs a serious redesign (or perhaps the whole design needs a serious recode). So I didn't give your version more thought. Hugh > > Hillf > > [1] https://lore.kernel.org/lkml/afe5b403-4e37-80fd-643d-79e0876a7047@linux.alibaba.com/ > > +++ b/lib/sbitmap.c > @@ -613,6 +613,16 @@ static bool __sbq_wake_up(struct sbitmap > if (!ws) > return false; > > + do { > + /* open code sbq_index_atomic_inc(&sbq->wake_index) to avoid race */ > + int old = atomic_read(&sbq->wake_index); > + int new = sbq_index_inc(old); > + > + /* try another ws if someone else takes care of this one */ > + if (old != atomic_cmpxchg(&sbq->wake_index, old, new)) > + return true; > + } while (0); > + > cur = atomic_read(&ws->wait_cnt); > do { > /* > @@ -620,7 +630,7 @@ static bool __sbq_wake_up(struct sbitmap > * function again to wakeup a new batch on a different 'ws'. > */ > if (cur == 0) > - return true; > + goto out; > sub = min(*nr, cur); > wait_cnt = cur - sub; > } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt)); > @@ -634,6 +644,7 @@ static bool __sbq_wake_up(struct sbitmap > > *nr -= sub; > > +out: > /* > * When wait_cnt == 0, we have to be particularly careful as we are > * responsible to reset wait_cnt regardless whether we've actually > @@ -661,12 +672,6 @@ static bool __sbq_wake_up(struct sbitmap > */ > smp_mb__before_atomic(); > > - /* > - * Increase wake_index before updating wait_cnt, otherwise concurrent > - * callers can see valid wait_cnt in old waitqueue, which can cause > - * invalid wakeup on the old waitqueue. > - */ > - sbq_index_atomic_inc(&sbq->wake_index); > atomic_set(&ws->wait_cnt, wake_batch); > > return ret || *nr;
On Mon 26-09-22 20:39:03, Hugh Dickins wrote: > On Mon, 26 Sep 2022, Jan Kara wrote: > > On Fri 23-09-22 16:15:29, Hugh Dickins wrote: > > > > I don't think any magic with sbq_index_atomic_inc() is going to reliably > > fix this. After all the current waitqueue may be the only one that has active > > waiters so sbq_wake_ptr() will always end up returning this waitqueue > > regardless of the current value of sbq->wake_index. > > > > Honestly, this whole code needs a serious redesign. > > I was pleased to see you say so, Jan: I do agree. > > > I have some > > simplifications in mind but it will take some thinking and benchmarking > > I'm definitely not the right person to take it on, and glad if you can. > But I did have some thoughts and experiments over the weekend, and would > like to throw a couple of suggestions into the pot. > > One, not a big issue, but I think sbq_index_atomic_inc() is misconceived. > It's unhelpful for multiple racers to be adjusting sbq->wake_index, and > wake_index = ws - sbq->ws; > atomic_cmpxchg(&sbq->wake_index, wake_index, sbq_index_inc(wake_index)); > seems to me a better way for __sbq_wake_up() to increment it. So my thinking was that instead of having multiple counters, we'd have just two - one counting completions and the other one counting wakeups and if completions - wakeups > batch, we search for waiters in the wait queues, wake them up so that 'wakeups' counter catches up. That also kind of alleviates the 'wake_index' issue because racing updates to it will lead to reordering of wakeups but not to lost wakeups, retries, or anything. I also agree with your wake_up_nr_return() idea below, that is part of this solution (reliably waking given number of waiters) and in fact I have already coded that yesterday while thinking about the problem ;) > Two, and here the depths of my naivete and incomprehension may be on > display, but: I get the impression that __sbq_wake_up() is intended > to accumulate wake_batch-1 wakeups while doing nothing, then on the > wake_batch'th it hopes to do all those wake_batch wakeups. Correct. That is the idea of this code as far as I understand it as well (but bear in mind that I'm digging into this code for only about 50 days longer than you ;). > I assume someone in the past has established that that's a safe way to > procede here, though it's not obviously safe to me. Yes, it is not obvious and even not universally safe. wake_batch has to be suitably tuned based on the number of available bits in the bitmap to avoid deadlocks (freeing of a bit is what generates a wakeup). For numbers of bits smaller than the number of waitqueues we have, we are using wake_batch == 1, which is obviously safe. As the number of bits grows larger we can set wake batch as wake_batch =~ bits/waitqueues. That makes sure all the waitqueues will be woken up and because new waiters are added only when all bits are used, this even makes sure all waiters will eventually wake up as the bits get used / freed. I won't comment on fairness, that has apparently not been a design consideration. > Now, those !waitqueue_active(&ws->wait) checks are good for catching > when the hoped-for procedure has gone so "wrong" that there's actually > nothing to be woken on this ws (so proceed to the next); but they give > no clue as to when there are some but not enough wakeups done. > > It is very easy to add a wake_up_nr_return() to kernel/sched/wait.c, > which returns the nr_exclusive still not woken (__wake_up_common_lock() > merely has to return nr_exclusive itself); and then __sbq_wake_up() can > be recalled until wake_batch have been woken (or all queues empty). Fully agreed about this. I'd just note that the waitqueue_active() checks are enough for the above reasoning to guarantee waking all the wait queues in principle. They just break down if multiple processes want to wakeup on the same waitqueue because of TTCTTU races. And that was the last straw that made me go with wake_up_nr_return() as you describe it. > I do have an experiment running that way: but my testing is much too > limited to draw serious conclusions from, and I've already admitted > that I may just be misunderstanding the whole thing. But, just maybe, > a wake_up_nr_return() might be useful. End of those suggestions. > > > so > > we need some fix for the interim. I was pondering for quite some time about > > some band aid to the problem you've found but didn't find anything > > satisfactory. > > > > In the end I see two options: > > > > 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups > > but we were living with those for a relatively long time so probably we can > > live with them for some longer. > > In getting that experiment above going, I did have to make this change > below: and it looks to me now as better than my original patch - since > this one does try all SBQ_WAIT_QUEUES before giving up, whereas my first > patch immediately gave up on the waitqueue_active !wait_cnt case. > > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -587,7 +587,7 @@ static struct sbq_wait_state *sbq_wake_p > for (i = 0; i < SBQ_WAIT_QUEUES; i++) { > struct sbq_wait_state *ws = &sbq->ws[wake_index]; > > - if (waitqueue_active(&ws->wait)) { > + if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) { > if (wake_index != atomic_read(&sbq->wake_index)) > atomic_set(&sbq->wake_index, wake_index); > return ws; > > TBH I have not tested this one outside of that experiment: would you > prefer this patch to my first one, I test and sign this off and send? Yes, actually this is an elegant solution. It has the same inherent raciness as your waitqueue_active() patch so wakeups could be lost even though some waiters need them but that seems pretty unlikely. So yes, if you can submit this, I guess this is a good band aid for the coming merge window. > > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io > > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving > > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have > > to redo his batched accounting patches on top. > > I know much too little to help make that choice. Yeah, I guess it is Jens' call in the end. I'm fine with both options. Honza
On Tue, 27 Sep 2022, Jan Kara wrote: > On Mon 26-09-22 20:39:03, Hugh Dickins wrote: > > So my thinking was that instead of having multiple counters, we'd have just > two - one counting completions and the other one counting wakeups and if > completions - wakeups > batch, we search for waiters in the wait queues, > wake them up so that 'wakeups' counter catches up. That also kind of > alleviates the 'wake_index' issue because racing updates to it will lead to > reordering of wakeups but not to lost wakeups, retries, or anything. > > I also agree with your wake_up_nr_return() idea below, that is part of this > solution (reliably waking given number of waiters) and in fact I have > already coded that yesterday while thinking about the problem ;) Great - I'm pleasantly surprised to have been not so far off, and we seem to be much in accord. (What I called wake_up_nr_return() can perfectly well be wake_up_nr() itself: I had just been temporarily avoiding a void to int change in a header file, recompiling the world.) Many thanks for your detailed elucidation of the batch safety, in particular: I won't pretend to have absorbed it completely yet, but it's there in your mail for me and all of us to refer back to. > > TBH I have not tested this one outside of that experiment: would you > > prefer this patch to my first one, I test and sign this off and send? > > Yes, actually this is an elegant solution. It has the same inherent > raciness as your waitqueue_active() patch so wakeups could be lost even > though some waiters need them but that seems pretty unlikely. So yes, if > you can submit this, I guess this is a good band aid for the coming merge > window. No problem in the testing, the v2 patch follows now. > > > > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io > > > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving > > > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have > > > to redo his batched accounting patches on top. > > > > I know much too little to help make that choice. > > Yeah, I guess it is Jens' call in the end. I'm fine with both options. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
--- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -620,7 +620,7 @@ static bool __sbq_wake_up(struct sbitmap * function again to wakeup a new batch on a different 'ws'. */ if (cur == 0) - return true; + return !waitqueue_active(&ws->wait); sub = min(*nr, cur); wait_cnt = cur - sub; } while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") is a big improvement: without it, I had to revert to before commit 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup") to avoid the high system time and freezes which that had introduced. Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy swapping (kernel builds in low memory) on another: soon locking up in sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling around with waitqueue_active() but wait_cnt 0 . Here is a backtrace, showing the common pattern of outer sbitmap_queue_wake_up() interrupted before setting wait_cnt 0 back to wake_batch (in some cases other CPUs are idle, in other cases they're spinning for a lock in dd_bio_merge()): sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request < scsi_end_request < scsi_io_completion < scsi_finish_command < scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq < __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt < _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up < sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < dd_bio_merge < blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio < __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct < submit_bio < __swap_writepage < swap_writepage < pageout < shrink_folio_list < evict_folios < lru_gen_shrink_lruvec < shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages < __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio < do_anonymous_page < __handle_mm_fault < handle_mm_fault < do_user_addr_fault < exc_page_fault < asm_exc_page_fault I have almost no grasp of all the possible sbitmap races, and their consequences: but using the same !waitqueue_active() check as used elsewhere, fixes the lockup and shows no adverse consequence for me. Fixes: 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") Signed-off-by: Hugh Dickins <hughd@google.com> --- lib/sbitmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)