diff mbox series

[v6] sbitmap: fix io hung due to race on sbitmap_word::cleared

Message ID 20240710065616.1060803-1-yang.yang@vivo.com (mailing list archive)
State New, archived
Headers show
Series [v6] sbitmap: fix io hung due to race on sbitmap_word::cleared | expand

Commit Message

YangYang July 10, 2024, 6:56 a.m. UTC
Configuration for sbq:
  depth=64, wake_batch=6, shift=6, map_nr=1

1. There are 64 requests in progress:
  map->word = 0xFFFFFFFFFFFFFFFF
2. After all the 64 requests complete, and no more requests come:
  map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF
3. Now two tasks try to allocate requests:
  T1:                                       T2:
  __blk_mq_get_tag                          .
  __sbitmap_queue_get                       .
  sbitmap_get                               .
  sbitmap_find_bit                          .
  sbitmap_find_bit_in_word                  .
  __sbitmap_get_word  -> nr=-1              __blk_mq_get_tag
  sbitmap_deferred_clear                    __sbitmap_queue_get
  /* map->cleared=0xFFFFFFFFFFFFFFFF */     sbitmap_find_bit
    if (!READ_ONCE(map->cleared))           sbitmap_find_bit_in_word
      return false;                         __sbitmap_get_word -> nr=-1
    mask = xchg(&map->cleared, 0)           sbitmap_deferred_clear
    atomic_long_andnot()                    /* map->cleared=0 */
                                              if (!(map->cleared))
                                                return false;
                                     /*
                                      * map->cleared is cleared by T1
                                      * T2 fail to acquire the tag
                                      */

4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken
up due to the wake_batch being set at 6. If no more requests come, T1
will wait here indefinitely.

This patch achieves two purposes:
1. Check on ->cleared and update on both ->cleared and ->word need to
be done atomically, and using spinlock could be the simplest solution.
2. Add extra check in sbitmap_deferred_clear(), to identify whether
->word has free bits.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>

---
Changes from v5:
  - Modify commit message
  - Change the fixes tag
Changes from v4:
  - Add some comments according to suggestion
Changes from v3:
  - Add more arguments to sbitmap_deferred_clear(), for those who
    don't care about the return value, just pass 0
  - Consider the situation when using sbitmap_get_shallow()
  - Consider the situation when ->round_robin is true
  - Modify commit message
Changes from v2:
  - Modify commit message by suggestion
  - Add extra check in sbitmap_deferred_clear() by suggestion
Changes from v1:
  - simply revert commit 661d4f55a794 ("sbitmap: remove swap_lock")
---
 include/linux/sbitmap.h |  5 +++++
 lib/sbitmap.c           | 46 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 43 insertions(+), 8 deletions(-)

Comments

Bart Van Assche July 10, 2024, 4:41 p.m. UTC | #1
On 7/9/24 11:56 PM, Yang Yang wrote:
> +	spin_lock_irqsave(&map->swap_lock, flags);

Please use guard(spinlock_irqsave) in new code instead of 
spin_lock_irqsave() + goto out_unlock + spin_unlock_irqrestore().
That will make this function significantly easier to read and to
maintain.

> +
> +	if (!map->cleared) {
> +		if (depth > 0) {
> +			word_mask = (~0UL) >> (BITS_PER_LONG - depth);
> +			/*
> +			 * The current behavior is to always retry after moving
> +			 * ->cleared to word, and we change it to retry in case
> +			 * of any free bits. To avoid dead loop, we need to take

What is a "dead loop"? Did you perhaps want to write "infinite loop"?

> +			 * wrap & alloc_hint into account. Without this, a soft
> +			 * lockup was detected in our test environment.

Source code comments should not refer to "our test environment". Code
that is intended for upstream inclusion should work for all setups.

> +			 */
> +			if (!wrap && alloc_hint)
> +				word_mask &= ~((1UL << alloc_hint) - 1);

Above I see an open-coded __clear_bit() operation. Has it been
considered to use __clear_bit() instead of open-coding it?

Thanks,

Bart.
Bart Van Assche July 10, 2024, 7:54 p.m. UTC | #2
On 7/9/24 11:56 PM, Yang Yang wrote:
> +	/**
> +	 * @swap_lock: Held while swapping word <-> cleared
> +	 */
> +	spinlock_t swap_lock;

Why is only swapping 'word' with 'cleared' protected by the spinlock?
If all 'cleared' changes would be protected by this spinlock then
that would allow to eliminate the expensive xchg() call from
sbitmap_deferred_clear().

Thanks,

Bart.
YangYang July 11, 2024, 9:47 a.m. UTC | #3
On 2024/7/11 0:41, Bart Van Assche wrote:
> On 7/9/24 11:56 PM, Yang Yang wrote:
>> +    spin_lock_irqsave(&map->swap_lock, flags);
> 
> Please use guard(spinlock_irqsave) in new code instead of spin_lock_irqsave() + goto out_unlock 
> + spin_unlock_irqrestore().
> That will make this function significantly easier to read and to
> maintain.

Got it.

> 
>> +
>> +    if (!map->cleared) {
>> +        if (depth > 0) {
>> +            word_mask = (~0UL) >> (BITS_PER_LONG - depth);
>> +            /*
>> +             * The current behavior is to always retry after moving
>> +             * ->cleared to word, and we change it to retry in case
>> +             * of any free bits. To avoid dead loop, we need to take
> 
> What is a "dead loop"? Did you perhaps want to write "infinite loop"?

Yeah. I suppose so.

> 
>> +             * wrap & alloc_hint into account. Without this, a soft
>> +             * lockup was detected in our test environment.
> 
> Source code comments should not refer to "our test environment". Code
> that is intended for upstream inclusion should work for all setups.

Got it.

> 
>> +             */
>> +            if (!wrap && alloc_hint)
>> +                word_mask &= ~((1UL << alloc_hint) - 1);
> 
> Above I see an open-coded __clear_bit() operation. Has it been
> considered to use __clear_bit() instead of open-coding it?

It is meant to reset multiple bits to zero, but __clear_bit() is only
capable of resetting one bit to zero.

Thanks.

> 
> Thanks,
> 
> Bart.
YangYang July 11, 2024, 12:48 p.m. UTC | #4
On 2024/7/11 3:54, Bart Van Assche wrote:
> On 7/9/24 11:56 PM, Yang Yang wrote:
>> +    /**
>> +     * @swap_lock: Held while swapping word <-> cleared
>> +     */
>> +    spinlock_t swap_lock;
> 
> Why is only swapping 'word' with 'cleared' protected by the spinlock?
> If all 'cleared' changes would be protected by this spinlock then
> that would allow to eliminate the expensive xchg() call from
> sbitmap_deferred_clear().

The spinlock was initially introduced in ea86ea2cdced ("sbitmap:
ammortize cost of clearing bits").
I think if all 'cleared' changes are protected by the spinlock, the
overhead of clearing tags would definitely increase.

Thanks.

> 
> Thanks,
> 
> Bart.
Ming Lei July 11, 2024, 1:39 p.m. UTC | #5
On Thu, Jul 11, 2024 at 08:48:03PM +0800, YangYang wrote:
> On 2024/7/11 3:54, Bart Van Assche wrote:
> > On 7/9/24 11:56 PM, Yang Yang wrote:
> > > +    /**
> > > +     * @swap_lock: Held while swapping word <-> cleared
> > > +     */
> > > +    spinlock_t swap_lock;
> > 
> > Why is only swapping 'word' with 'cleared' protected by the spinlock?
> > If all 'cleared' changes would be protected by this spinlock then
> > that would allow to eliminate the expensive xchg() call from
> > sbitmap_deferred_clear().
> 
> The spinlock was initially introduced in ea86ea2cdced ("sbitmap:
> ammortize cost of clearing bits").
> I think if all 'cleared' changes are protected by the spinlock, the
> overhead of clearing tags would definitely increase.

There are only two WRITE on 'cleared':

- xchg(&map->cleared, 0) in sbitmap_deferred_clear()

- set_bit() in sbitmap_deferred_clear_bit()

xchg() supposes to provide such protection already.

Thanks,
Ming
Bart Van Assche July 11, 2024, 4:33 p.m. UTC | #6
On 7/11/24 6:39 AM, Ming Lei wrote:
> There are only two WRITE on 'cleared':
> 
> - xchg(&map->cleared, 0) in sbitmap_deferred_clear()
> 
> - set_bit() in sbitmap_deferred_clear_bit()
> 
> xchg() supposes to provide such protection already.

Hi Ming,

The comment above 'swap_lock' in this patch is as follows:

	/**
	 * @swap_lock: Held while swapping word <-> cleared
	 */

In other words, 'swap_lock' is used to serialize *code*. Using
synchronization objects to serialize code is known as an anti-pattern,
something that shouldn't be done. Synchronization objects should be used
to serialize access to data. Hence my question whether it would be
appropriate to protect all 'cleared' changes with the newly introduced
spinlock.

Thanks,

Bart.
Ming Lei July 12, 2024, 1:01 a.m. UTC | #7
On Fri, Jul 12, 2024 at 12:33 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 7/11/24 6:39 AM, Ming Lei wrote:
> > There are only two WRITE on 'cleared':
> >
> > - xchg(&map->cleared, 0) in sbitmap_deferred_clear()
> >
> > - set_bit() in sbitmap_deferred_clear_bit()
> >
> > xchg() supposes to provide such protection already.
>
> Hi Ming,
>
> The comment above 'swap_lock' in this patch is as follows:
>
>         /**
>          * @swap_lock: Held while swapping word <-> cleared
>          */
>
> In other words, 'swap_lock' is used to serialize *code*. Using
> synchronization objects to serialize code is known as an anti-pattern,
> something that shouldn't be done.

> Synchronization objects should be used
> to serialize access to data.

It can be thought of serializing UPDATE on both ->word and ->cleared,
instead of code.

> Hence my question whether it would be
> appropriate to protect all 'cleared' changes with the newly introduced
> spinlock.

Wrt. ->cleared only update, xchag() is enough.

Thanks,
Ming
Bart Van Assche July 12, 2024, 5:53 p.m. UTC | #8
On 7/11/24 6:01 PM, Ming Lei wrote:
> On Fri, Jul 12, 2024 at 12:33 AM Bart Van Assche <bvanassche@acm.org> wrote:
>> The comment above 'swap_lock' in this patch is as follows:
>>
>>          /**
>>           * @swap_lock: Held while swapping word <-> cleared
>>           */
>>
>> In other words, 'swap_lock' is used to serialize *code*. Using
>> synchronization objects to serialize code is known as an anti-pattern,
>> something that shouldn't be done.
> 
>> Synchronization objects should be used
>> to serialize access to data.
> 
> It can be thought of serializing UPDATE on both ->word and ->cleared,
> instead of code.

It would be appreciated if the comment above swap_lock would be modified
such that it reflects that it serializes simultaneous updates of ->word
and ->cleared.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index d662cf136021..ec0b0e73c906 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -36,6 +36,11 @@  struct sbitmap_word {
 	 * @cleared: word holding cleared bits
 	 */
 	unsigned long cleared ____cacheline_aligned_in_smp;
+
+	/**
+	 * @swap_lock: Held while swapping word <-> cleared
+	 */
+	spinlock_t swap_lock;
 } ____cacheline_aligned_in_smp;
 
 /**
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 1e453f825c05..22d6e86ba87f 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -60,12 +60,35 @@  static inline void update_alloc_hint_after_get(struct sbitmap *sb,
 /*
  * See if we have deferred clears that we can batch move
  */
-static inline bool sbitmap_deferred_clear(struct sbitmap_word *map)
+static inline bool sbitmap_deferred_clear(struct sbitmap_word *map,
+		unsigned int depth, unsigned int alloc_hint, bool wrap)
 {
-	unsigned long mask;
+	unsigned long mask, flags, word_mask;
+	bool ret = false;
 
-	if (!READ_ONCE(map->cleared))
-		return false;
+	spin_lock_irqsave(&map->swap_lock, flags);
+
+	if (!map->cleared) {
+		if (depth > 0) {
+			word_mask = (~0UL) >> (BITS_PER_LONG - depth);
+			/*
+			 * The current behavior is to always retry after moving
+			 * ->cleared to word, and we change it to retry in case
+			 * of any free bits. To avoid dead loop, we need to take
+			 * wrap & alloc_hint into account. Without this, a soft
+			 * lockup was detected in our test environment.
+			 */
+			if (!wrap && alloc_hint)
+				word_mask &= ~((1UL << alloc_hint) - 1);
+
+			if ((READ_ONCE(map->word) & word_mask) == word_mask)
+				ret = false;
+			else
+				ret = true;
+		}
+
+		goto out_unlock;
+	}
 
 	/*
 	 * First get a stable cleared mask, setting the old mask to 0.
@@ -77,7 +100,10 @@  static inline bool sbitmap_deferred_clear(struct sbitmap_word *map)
 	 */
 	atomic_long_andnot(mask, (atomic_long_t *)&map->word);
 	BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(map->word));
-	return true;
+	ret = true;
+out_unlock:
+	spin_unlock_irqrestore(&map->swap_lock, flags);
+	return ret;
 }
 
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
@@ -85,6 +111,7 @@  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      bool alloc_hint)
 {
 	unsigned int bits_per_word;
+	int i;
 
 	if (shift < 0)
 		shift = sbitmap_calculate_shift(depth);
@@ -116,6 +143,9 @@  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		return -ENOMEM;
 	}
 
+	for (i = 0; i < sb->map_nr; i++)
+		spin_lock_init(&sb->map[i].swap_lock);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(sbitmap_init_node);
@@ -126,7 +156,7 @@  void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 	unsigned int i;
 
 	for (i = 0; i < sb->map_nr; i++)
-		sbitmap_deferred_clear(&sb->map[i]);
+		sbitmap_deferred_clear(&sb->map[i], 0, 0, 0);
 
 	sb->depth = depth;
 	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
@@ -179,7 +209,7 @@  static int sbitmap_find_bit_in_word(struct sbitmap_word *map,
 					alloc_hint, wrap);
 		if (nr != -1)
 			break;
-		if (!sbitmap_deferred_clear(map))
+		if (!sbitmap_deferred_clear(map, depth, alloc_hint, wrap))
 			break;
 	} while (1);
 
@@ -496,7 +526,7 @@  unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
 		unsigned int map_depth = __map_depth(sb, index);
 		unsigned long val;
 
-		sbitmap_deferred_clear(map);
+		sbitmap_deferred_clear(map, 0, 0, 0);
 		val = READ_ONCE(map->word);
 		if (val == (1UL << (map_depth - 1)) - 1)
 			goto next;