Message ID | 20240617024556.211451-1-yang.yang@vivo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] sbitmap: fix io hung due to race on sbitmap_word::cleared | expand |
On 2024/6/17 10:45, Yang Yang wrote: > Configuration for sbq: > depth=64, wake_batch=6, shift=6, map_nr=1 > > 1. There are 64 requests in progress: > map->word = 0xFFFFFFFFFFFFFFFF > 2. After all the 64 requests complete, and no more requests come: > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > 3. Now two tasks try to allocate requests: > T1: T2: > __blk_mq_get_tag . > __sbitmap_queue_get . > sbitmap_get . > sbitmap_find_bit . > sbitmap_find_bit_in_word . > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > sbitmap_deferred_clear __sbitmap_queue_get > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > return false; __sbitmap_get_word -> nr=-1 > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > atomic_long_andnot() /* map->cleared=0 */ > if (!(map->cleared)) > return false; > /* > * map->cleared is cleared by T1 > * T2 fail to acquire the tag > */ > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > up due to the wake_batch being set at 6. If no more requests come, T1 > will wait here indefinitely. > > This patch achieves two purposes: > 1. Check on ->cleared and update on both ->cleared and ->word need to > be done atomically, and using spinlock could be the simplest solution. > So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which > may cause potential race. > > 2. Add extra check in sbitmap_deferred_clear(), to identify whether > ->word has free bits. > > Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") > Signed-off-by: Yang Yang <yang.yang@vivo.com> > > --- > Changes from v3: > - Add more arguments to sbitmap_deferred_clear(), for those who > don't care about the return value, just pass 0 > - Consider the situation when using sbitmap_get_shallow() > - Consider the situation when ->round_robin is true > - Modify commit message > Changes from v2: > - Modify commit message by suggestion > - Add extra check in sbitmap_deferred_clear() by suggestion > Changes from v1: > - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") > --- > include/linux/sbitmap.h | 5 +++++ > lib/sbitmap.c | 45 ++++++++++++++++++++++++++++++++--------- > 2 files changed, 41 insertions(+), 9 deletions(-) > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index d662cf136021..ec0b0e73c906 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -36,6 +36,11 @@ struct sbitmap_word { > * @cleared: word holding cleared bits > */ > unsigned long cleared ____cacheline_aligned_in_smp; > + > + /** > + * @swap_lock: Held while swapping word <-> cleared > + */ > + spinlock_t swap_lock; > } ____cacheline_aligned_in_smp; > > /** > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index 1e453f825c05..9bd85a9b74b9 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -60,12 +60,32 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, > /* > * See if we have deferred clears that we can batch move > */ > -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > -{ > - unsigned long mask; > +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, > + unsigned int depth, unsigned int alloc_hint, bool wrap) > +{ > + unsigned long mask, flags, word_mask; > + bool ret = false; > + > + spin_lock_irqsave(&map->swap_lock, flags); > + > + if (!map->cleared) { > + if (depth > 0) { > +#if BITS_PER_LONG == 64 > + word_mask = U64_MAX >> (BITS_PER_LONG - depth); > +#else > + word_mask = U32_MAX >> (BITS_PER_LONG - depth); > +#endif > + if (!wrap && alloc_hint) > + word_mask &= ~((1UL << alloc_hint) - 1); > + > + if ((READ_ONCE(map->word) & word_mask) == word_mask) > + ret = false; > + else > + ret = true; > + } > > - if (!READ_ONCE(map->cleared)) > - return false; > + goto out_unlock; > + } > > /* > * First get a stable cleared mask, setting the old mask to 0. > @@ -77,7 +97,10 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > */ > atomic_long_andnot(mask, (atomic_long_t *)&map->word); > BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); > - return true; > + ret = true; > +out_unlock: > + spin_unlock_irqrestore(&map->swap_lock, flags); > + return ret; > } > > int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > @@ -85,6 +108,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > bool alloc_hint) > { > unsigned int bits_per_word; > + int i; > > if (shift < 0) > shift = sbitmap_calculate_shift(depth); > @@ -116,6 +140,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, > return -ENOMEM; > } > > + for (i = 0; i < sb->map_nr; i++) > + spin_lock_init(&sb->map[i].swap_lock); > + > return 0; > } > EXPORT_SYMBOL_GPL(sbitmap_init_node); > @@ -126,7 +153,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth) > unsigned int i; > > for (i = 0; i < sb->map_nr; i++) > - sbitmap_deferred_clear(&sb->map[i]); > + sbitmap_deferred_clear(&sb->map[i], 0, 0, 0); > > sb->depth = depth; > sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word); > @@ -179,7 +206,7 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map, > alloc_hint, wrap); > if (nr != -1) > break; > - if (!sbitmap_deferred_clear(map)) > + if (!sbitmap_deferred_clear(map, depth, alloc_hint, wrap)) > break; > } while (1); > > @@ -496,7 +523,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags, > unsigned int map_depth = __map_depth(sb, index); > unsigned long val; > > - sbitmap_deferred_clear(map); > + sbitmap_deferred_clear(map, 0, 0, 0); > val = READ_ONCE(map->word); > if (val == (1UL << (map_depth - 1)) - 1) > goto next; Gentle ping. Thanks.
On Mon, Jun 17, 2024 at 10:45:51AM +0800, Yang Yang wrote: > Configuration for sbq: > depth=64, wake_batch=6, shift=6, map_nr=1 > > 1. There are 64 requests in progress: > map->word = 0xFFFFFFFFFFFFFFFF > 2. After all the 64 requests complete, and no more requests come: > map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF > 3. Now two tasks try to allocate requests: > T1: T2: > __blk_mq_get_tag . > __sbitmap_queue_get . > sbitmap_get . > sbitmap_find_bit . > sbitmap_find_bit_in_word . > __sbitmap_get_word -> nr=-1 __blk_mq_get_tag > sbitmap_deferred_clear __sbitmap_queue_get > /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit > if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word > return false; __sbitmap_get_word -> nr=-1 > mask = xchg(&map->cleared, 0) sbitmap_deferred_clear > atomic_long_andnot() /* map->cleared=0 */ > if (!(map->cleared)) > return false; > /* > * map->cleared is cleared by T1 > * T2 fail to acquire the tag > */ > > 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken > up due to the wake_batch being set at 6. If no more requests come, T1 > will wait here indefinitely. > > This patch achieves two purposes: > 1. Check on ->cleared and update on both ->cleared and ->word need to > be done atomically, and using spinlock could be the simplest solution. > So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which > may cause potential race. > > 2. Add extra check in sbitmap_deferred_clear(), to identify whether > ->word has free bits. > > Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") > Signed-off-by: Yang Yang <yang.yang@vivo.com> > > --- > Changes from v3: > - Add more arguments to sbitmap_deferred_clear(), for those who > don't care about the return value, just pass 0 > - Consider the situation when using sbitmap_get_shallow() > - Consider the situation when ->round_robin is true > - Modify commit message > Changes from v2: > - Modify commit message by suggestion > - Add extra check in sbitmap_deferred_clear() by suggestion > Changes from v1: > - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") > --- > include/linux/sbitmap.h | 5 +++++ > lib/sbitmap.c | 45 ++++++++++++++++++++++++++++++++--------- > 2 files changed, 41 insertions(+), 9 deletions(-) > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index d662cf136021..ec0b0e73c906 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -36,6 +36,11 @@ struct sbitmap_word { > * @cleared: word holding cleared bits > */ > unsigned long cleared ____cacheline_aligned_in_smp; > + > + /** > + * @swap_lock: Held while swapping word <-> cleared > + */ > + spinlock_t swap_lock; > } ____cacheline_aligned_in_smp; > > /** > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index 1e453f825c05..9bd85a9b74b9 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -60,12 +60,32 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, > /* > * See if we have deferred clears that we can batch move > */ > -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) > -{ > - unsigned long mask; > +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, > + unsigned int depth, unsigned int alloc_hint, bool wrap) > +{ > + unsigned long mask, flags, word_mask; > + bool ret = false; > + > + spin_lock_irqsave(&map->swap_lock, flags); > + > + if (!map->cleared) { > + if (depth > 0) { > +#if BITS_PER_LONG == 64 > + word_mask = U64_MAX >> (BITS_PER_LONG - depth); > +#else > + word_mask = U32_MAX >> (BITS_PER_LONG - depth); > +#endif Can we avoid the above conditional compiling by the following way? word_mask = (~0UL) >> (BITS_PER_LONG - depth); > + if (!wrap && alloc_hint) > + word_mask &= ~((1UL << alloc_hint) - 1); The current behavior is to always retry after moving ->cleared to word, and we change it to retry in case of any free bits. So here is it for avoiding dead loop by taking wrap & alloc_hint into account? If yes, this way looks fine, but I'd suggest to document this change. Thanks, Ming
On 2024/7/2 11:34, Ming Lei wrote: > On Mon, Jun 17, 2024 at 10:45:51AM +0800, Yang Yang wrote: >> Configuration for sbq: >> depth=64, wake_batch=6, shift=6, map_nr=1 >> >> 1. There are 64 requests in progress: >> map->word = 0xFFFFFFFFFFFFFFFF >> 2. After all the 64 requests complete, and no more requests come: >> map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF >> 3. Now two tasks try to allocate requests: >> T1: T2: >> __blk_mq_get_tag . >> __sbitmap_queue_get . >> sbitmap_get . >> sbitmap_find_bit . >> sbitmap_find_bit_in_word . >> __sbitmap_get_word -> nr=-1 __blk_mq_get_tag >> sbitmap_deferred_clear __sbitmap_queue_get >> /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit >> if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word >> return false; __sbitmap_get_word -> nr=-1 >> mask = xchg(&map->cleared, 0) sbitmap_deferred_clear >> atomic_long_andnot() /* map->cleared=0 */ >> if (!(map->cleared)) >> return false; >> /* >> * map->cleared is cleared by T1 >> * T2 fail to acquire the tag >> */ >> >> 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken >> up due to the wake_batch being set at 6. If no more requests come, T1 >> will wait here indefinitely. >> >> This patch achieves two purposes: >> 1. Check on ->cleared and update on both ->cleared and ->word need to >> be done atomically, and using spinlock could be the simplest solution. >> So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which >> may cause potential race. >> >> 2. Add extra check in sbitmap_deferred_clear(), to identify whether >> ->word has free bits. >> >> Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") >> Signed-off-by: Yang Yang <yang.yang@vivo.com> >> >> --- >> Changes from v3: >> - Add more arguments to sbitmap_deferred_clear(), for those who >> don't care about the return value, just pass 0 >> - Consider the situation when using sbitmap_get_shallow() >> - Consider the situation when ->round_robin is true >> - Modify commit message >> Changes from v2: >> - Modify commit message by suggestion >> - Add extra check in sbitmap_deferred_clear() by suggestion >> Changes from v1: >> - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") >> --- >> include/linux/sbitmap.h | 5 +++++ >> lib/sbitmap.c | 45 ++++++++++++++++++++++++++++++++--------- >> 2 files changed, 41 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h >> index d662cf136021..ec0b0e73c906 100644 >> --- a/include/linux/sbitmap.h >> +++ b/include/linux/sbitmap.h >> @@ -36,6 +36,11 @@ struct sbitmap_word { >> * @cleared: word holding cleared bits >> */ >> unsigned long cleared ____cacheline_aligned_in_smp; >> + >> + /** >> + * @swap_lock: Held while swapping word <-> cleared >> + */ >> + spinlock_t swap_lock; >> } ____cacheline_aligned_in_smp; >> >> /** >> diff --git a/lib/sbitmap.c b/lib/sbitmap.c >> index 1e453f825c05..9bd85a9b74b9 100644 >> --- a/lib/sbitmap.c >> +++ b/lib/sbitmap.c >> @@ -60,12 +60,32 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, >> /* >> * See if we have deferred clears that we can batch move >> */ >> -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) >> -{ >> - unsigned long mask; >> +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, >> + unsigned int depth, unsigned int alloc_hint, bool wrap) >> +{ >> + unsigned long mask, flags, word_mask; >> + bool ret = false; >> + >> + spin_lock_irqsave(&map->swap_lock, flags); >> + >> + if (!map->cleared) { >> + if (depth > 0) { >> +#if BITS_PER_LONG == 64 >> + word_mask = U64_MAX >> (BITS_PER_LONG - depth); >> +#else >> + word_mask = U32_MAX >> (BITS_PER_LONG - depth); >> +#endif > > Can we avoid the above conditional compiling by the following way? > > word_mask = (~0UL) >> (BITS_PER_LONG - depth); Sure, this looks better. > >> + if (!wrap && alloc_hint) >> + word_mask &= ~((1UL << alloc_hint) - 1); > > The current behavior is to always retry after moving ->cleared to > word, and we change it to retry in case of any free bits. So here > is it for avoiding dead loop by taking wrap & alloc_hint into > account? If yes, this way looks fine, but I'd suggest to document > this change. Yes. Thank you for the suggestion. I will include the following descriptions in the next version. /* * The current behavior is to always retry after moving * ->cleared to word, and we change it to retry in case * of any free bits. To avoid dead loop, we need to take * wrap & alloc_hint into account. Without this, a soft * lockup was detected in our test environment. */ Thanks. > > Thanks, > Ming > >
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index d662cf136021..ec0b0e73c906 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -36,6 +36,11 @@ struct sbitmap_word { * @cleared: word holding cleared bits */ unsigned long cleared ____cacheline_aligned_in_smp; + + /** + * @swap_lock: Held while swapping word <-> cleared + */ + spinlock_t swap_lock; } ____cacheline_aligned_in_smp; /** diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 1e453f825c05..9bd85a9b74b9 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -60,12 +60,32 @@ static inline void update_alloc_hint_after_get(struct sbitmap *sb, /* * See if we have deferred clears that we can batch move */ -static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) -{ - unsigned long mask; +static inline bool sbitmap_deferred_clear(struct sbitmap_word *map, + unsigned int depth, unsigned int alloc_hint, bool wrap) +{ + unsigned long mask, flags, word_mask; + bool ret = false; + + spin_lock_irqsave(&map->swap_lock, flags); + + if (!map->cleared) { + if (depth > 0) { +#if BITS_PER_LONG == 64 + word_mask = U64_MAX >> (BITS_PER_LONG - depth); +#else + word_mask = U32_MAX >> (BITS_PER_LONG - depth); +#endif + if (!wrap && alloc_hint) + word_mask &= ~((1UL << alloc_hint) - 1); + + if ((READ_ONCE(map->word) & word_mask) == word_mask) + ret = false; + else + ret = true; + } - if (!READ_ONCE(map->cleared)) - return false; + goto out_unlock; + } /* * First get a stable cleared mask, setting the old mask to 0. @@ -77,7 +97,10 @@ static inline bool sbitmap_deferred_clear(struct sbitmap_word *map) */ atomic_long_andnot(mask, (atomic_long_t *)&map->word); BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word)); - return true; + ret = true; +out_unlock: + spin_unlock_irqrestore(&map->swap_lock, flags); + return ret; } int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, @@ -85,6 +108,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, bool alloc_hint) { unsigned int bits_per_word; + int i; if (shift < 0) shift = sbitmap_calculate_shift(depth); @@ -116,6 +140,9 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, return -ENOMEM; } + for (i = 0; i < sb->map_nr; i++) + spin_lock_init(&sb->map[i].swap_lock); + return 0; } EXPORT_SYMBOL_GPL(sbitmap_init_node); @@ -126,7 +153,7 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth) unsigned int i; for (i = 0; i < sb->map_nr; i++) - sbitmap_deferred_clear(&sb->map[i]); + sbitmap_deferred_clear(&sb->map[i], 0, 0, 0); sb->depth = depth; sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word); @@ -179,7 +206,7 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map, alloc_hint, wrap); if (nr != -1) break; - if (!sbitmap_deferred_clear(map)) + if (!sbitmap_deferred_clear(map, depth, alloc_hint, wrap)) break; } while (1); @@ -496,7 +523,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags, unsigned int map_depth = __map_depth(sb, index); unsigned long val; - sbitmap_deferred_clear(map); + sbitmap_deferred_clear(map, 0, 0, 0); val = READ_ONCE(map->word); if (val == (1UL << (map_depth - 1)) - 1) goto next;
Configuration for sbq: depth=64, wake_batch=6, shift=6, map_nr=1 1. There are 64 requests in progress: map->word = 0xFFFFFFFFFFFFFFFF 2. After all the 64 requests complete, and no more requests come: map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF 3. Now two tasks try to allocate requests: T1: T2: __blk_mq_get_tag . __sbitmap_queue_get . sbitmap_get . sbitmap_find_bit . sbitmap_find_bit_in_word . __sbitmap_get_word -> nr=-1 __blk_mq_get_tag sbitmap_deferred_clear __sbitmap_queue_get /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word return false; __sbitmap_get_word -> nr=-1 mask = xchg(&map->cleared, 0) sbitmap_deferred_clear atomic_long_andnot() /* map->cleared=0 */ if (!(map->cleared)) return false; /* * map->cleared is cleared by T1 * T2 fail to acquire the tag */ 4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken up due to the wake_batch being set at 6. If no more requests come, T1 will wait here indefinitely. This patch achieves two purposes: 1. Check on ->cleared and update on both ->cleared and ->word need to be done atomically, and using spinlock could be the simplest solution. So revert commit 661d4f55a794 ("sbitmap: remove swap_lock"), which may cause potential race. 2. Add extra check in sbitmap_deferred_clear(), to identify whether ->word has free bits. Fixes: 661d4f55a794 ("sbitmap: remove swap_lock") Signed-off-by: Yang Yang <yang.yang@vivo.com> --- Changes from v3: - Add more arguments to sbitmap_deferred_clear(), for those who don't care about the return value, just pass 0 - Consider the situation when using sbitmap_get_shallow() - Consider the situation when ->round_robin is true - Modify commit message Changes from v2: - Modify commit message by suggestion - Add extra check in sbitmap_deferred_clear() by suggestion Changes from v1: - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock") --- include/linux/sbitmap.h | 5 +++++ lib/sbitmap.c | 45 ++++++++++++++++++++++++++++++++--------- 2 files changed, 41 insertions(+), 9 deletions(-)