diff mbox series

sbitmap: ammortize cost of clearing bits

Message ID 91b96e21-45c9-63ca-334a-6f597b34123e@kernel.dk (mailing list archive)
State New, archived
Headers show
Series sbitmap: ammortize cost of clearing bits | expand

Commit Message

Jens Axboe Nov. 29, 2018, 8 p.m. UTC
sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits
out and maintain a hint cache, one particular bit allocated will end up
being cleared in the exact same spot.

This introduces batched clearing of bits. Instead of clearing a given
bit, the same bit is set in a cleared/free mask instead. If we fail
allocating a bit from a given word, then we check the free mask, and
batch move those cleared bits at that time. This trades 64 atomic bitops
for 2 cmpxchg(). On top of that, we do those sequentially, hopefully
making that a bit cheaper as well.

In a threaded poll test case, half the overhead of getting and clearing
tags is removed with this change. On another poll test case with a
single thread, performance is unchanged.
    
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

This patch is on top of the round robin fix for sbitmap just posted.

Comments

Omar Sandoval Nov. 29, 2018, 9:53 p.m. UTC | #1
On Thu, Nov 29, 2018 at 01:00:25PM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
> 
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg(). On top of that, we do those sequentially, hopefully
> making that a bit cheaper as well.
> 
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
>     
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> This patch is on top of the round robin fix for sbitmap just posted.

So I think this can lead to hangs due to interactions with wake_batch.
Bear with me:

Let's say the sbitmap_queue was created with depth=64 and shift=6, i.e.,
64 bits all in one word. In this case, wake_batch=8, because 64 bits
clearing should be enough to wake up in 8 batches of 8.

Let's say all 64 bits are allocated and there are 8 waiters. All 64 bits
are then freed (in the cleared word), so we wake up the 8 waiters. Let's
say they all attempt __sbitmap_get_word(), fail, and get to
sbitmap_deferred_clear() at the same time. One of them does the
cmpxchg() on cleared, setting it to zero, and then the rest see that
cleared is zero, so they return because there don't appear to be any
cleared bits. The first thread succeeds, and the rest go to sleep. 

Now, when the first thread clears the bit, it only subtracts one from
the batch, which isn't enough to do a wakeup. Hang!

This example is contrived, but in general I think that the window
between the cleared mask being zeroed and the word being cleared is
going to cause problems.

> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 804a50983ec5..cec685b89998 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -30,14 +30,19 @@ struct seq_file;
>   */
>  struct sbitmap_word {
>  	/**
> -	 * @word: The bitmap word itself.
> +	 * @depth: Number of bits being used in @word/@cleared
>  	 */
> -	unsigned long word;
> +	unsigned long depth;
>  
>  	/**
> -	 * @depth: Number of bits being used in @word.
> +	 * @word: word holding free bits
>  	 */
> -	unsigned long depth;
> +	unsigned long word ____cacheline_aligned_in_smp;

Completely separate note (assuming that we can iron out the race above),
this triples the size of each map. Does the word have to be in a
separate cacheline from the depth? Ignoring resize, depth and word are
always accessed together, so it seems to me that it would benefit from
sharing a cacheline now that clearing happens elsewhere.

> +	/**
> +	 * @cleared: word holding cleared bits
> +	 */
> +	unsigned long cleared ____cacheline_aligned_in_smp;
>  } ____cacheline_aligned_in_smp;
Jens Axboe Nov. 29, 2018, 10:19 p.m. UTC | #2
On 11/29/18 2:53 PM, Omar Sandoval wrote:
> On Thu, Nov 29, 2018 at 01:00:25PM -0700, Jens Axboe wrote:
>> sbitmap maintains a set of words that we use to set and clear bits, with
>> each bit representing a tag for blk-mq. Even though we spread the bits
>> out and maintain a hint cache, one particular bit allocated will end up
>> being cleared in the exact same spot.
>>
>> This introduces batched clearing of bits. Instead of clearing a given
>> bit, the same bit is set in a cleared/free mask instead. If we fail
>> allocating a bit from a given word, then we check the free mask, and
>> batch move those cleared bits at that time. This trades 64 atomic bitops
>> for 2 cmpxchg(). On top of that, we do those sequentially, hopefully
>> making that a bit cheaper as well.
>>
>> In a threaded poll test case, half the overhead of getting and clearing
>> tags is removed with this change. On another poll test case with a
>> single thread, performance is unchanged.
>>     
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> This patch is on top of the round robin fix for sbitmap just posted.
> 
> So I think this can lead to hangs due to interactions with wake_batch.
> Bear with me:
> 
> Let's say the sbitmap_queue was created with depth=64 and shift=6,
> i.e., 64 bits all in one word. In this case, wake_batch=8, because 64
> bits clearing should be enough to wake up in 8 batches of 8.
> 
> Let's say all 64 bits are allocated and there are 8 waiters. All 64
> bits are then freed (in the cleared word), so we wake up the 8
> waiters. Let's say they all attempt __sbitmap_get_word(), fail, and
> get to sbitmap_deferred_clear() at the same time. One of them does the
> cmpxchg() on cleared, setting it to zero, and then the rest see that
> cleared is zero, so they return because there don't appear to be any
> cleared bits. The first thread succeeds, and the rest go to sleep. 
> 
> Now, when the first thread clears the bit, it only subtracts one from
> the batch, which isn't enough to do a wakeup. Hang!
> 
> This example is contrived, but in general I think that the window
> between the cleared mask being zeroed and the word being cleared is
> going to cause problems.

Honestly, I would have been surprised if the initial version was
bullet proof! I think it should be workable though, but we probably want
to clear in a different order. IOW, clear ->word first, and then update
->cleared to avoid someone finding ->cleared == 0 and ending up
sleeping. Even now the race is tiny, especially considering we have
multiple maps.

I'll take a look at clearing in the opposite order in a bit.

>> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
>> index 804a50983ec5..cec685b89998 100644
>> --- a/include/linux/sbitmap.h
>> +++ b/include/linux/sbitmap.h
>> @@ -30,14 +30,19 @@ struct seq_file;
>>   */
>>  struct sbitmap_word {
>>  	/**
>> -	 * @word: The bitmap word itself.
>> +	 * @depth: Number of bits being used in @word/@cleared
>>  	 */
>> -	unsigned long word;
>> +	unsigned long depth;
>>  
>>  	/**
>> -	 * @depth: Number of bits being used in @word.
>> +	 * @word: word holding free bits
>>  	 */
>> -	unsigned long depth;
>> +	unsigned long word ____cacheline_aligned_in_smp;
> 
> Completely separate note (assuming that we can iron out the race above),
> this triples the size of each map. Does the word have to be in a
> separate cacheline from the depth? Ignoring resize, depth and word are
> always accessed together, so it seems to me that it would benefit from
> sharing a cacheline now that clearing happens elsewhere.

We can have depth and word share, that's not a concern. I'll make that
change.
diff mbox series

Patch

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 804a50983ec5..cec685b89998 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -30,14 +30,19 @@  struct seq_file;
  */
 struct sbitmap_word {
 	/**
-	 * @word: The bitmap word itself.
+	 * @depth: Number of bits being used in @word/@cleared
 	 */
-	unsigned long word;
+	unsigned long depth;
 
 	/**
-	 * @depth: Number of bits being used in @word.
+	 * @word: word holding free bits
 	 */
-	unsigned long depth;
+	unsigned long word ____cacheline_aligned_in_smp;
+
+	/**
+	 * @cleared: word holding cleared bits
+	 */
+	unsigned long cleared ____cacheline_aligned_in_smp;
 } ____cacheline_aligned_in_smp;
 
 /**
@@ -310,6 +315,19 @@  static inline void sbitmap_clear_bit(struct sbitmap *sb, unsigned int bitnr)
 	clear_bit(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
 }
 
+/*
+ * This one is special, since it doesn't actually clear the bit, rather it
+ * sets the corresponding bit in the ->cleared mask instead. Paired with
+ * the caller doing sbitmap_batch_clear() if a given index is full, which
+ * will clear the previously freed entries in the corresponding ->word.
+ */
+static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned int bitnr)
+{
+	unsigned long *addr = &sb->map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
+
+	set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
+}
+
 static inline void sbitmap_clear_bit_unlock(struct sbitmap *sb,
 					    unsigned int bitnr)
 {
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 45cab6bbc1c7..3007b89f3e33 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -111,6 +111,51 @@  static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 	return nr;
 }
 
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+	unsigned long mask, val;
+
+	if (!sb->map[index].cleared)
+		return false;
+
+	/*
+	 * First get a stable cleared mask, setting the old mask to 0.
+	 */
+	do {
+		mask = sb->map[index].cleared;
+	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+	/*
+	 * Now clear the masked bits in our free word
+	 */
+	do {
+		val = sb->map[index].word;
+	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+	return true;
+}
+
+static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
+				     unsigned int alloc_hint, bool round_robin)
+{
+	int nr;
+
+	do {
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					sb->map[index].depth, alloc_hint,
+					!round_robin);
+		if (nr != -1)
+			break;
+		if (!sbitmap_deferred_clear(sb, index))
+			break;
+	} while (1);
+
+	return nr;
+}
+
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 {
 	unsigned int i, index;
@@ -129,9 +174,8 @@  int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 		alloc_hint = 0;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		nr = __sbitmap_get_word(&sb->map[index].word,
-					sb->map[index].depth, alloc_hint,
-					!round_robin);
+		nr = sbitmap_find_bit_in_index(sb, index, alloc_hint,
+						round_robin);
 		if (nr != -1) {
 			nr += index << sb->shift;
 			break;
@@ -514,7 +558,8 @@  EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
 void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
 			 unsigned int cpu)
 {
-	sbitmap_clear_bit_unlock(&sbq->sb, nr);
+	sbitmap_deferred_clear_bit(&sbq->sb, nr);
+
 	/*
 	 * Pairs with the memory barrier in set_current_state() to ensure the
 	 * proper ordering of clear_bit_unlock()/waitqueue_active() in the waker