diff mbox series

lib/sbitmap: kill 'depth' from sbitmap_word

Message ID 20220110015007.326561-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series lib/sbitmap: kill 'depth' from sbitmap_word | expand

Commit Message

Ming Lei Jan. 10, 2022, 1:50 a.m. UTC
Only the last sbitmap_word can have different depth, and all the others
must have same depth of 1U << sb->shift, so not necessary to store it in
sbitmap_word, and it can be retrieved easily and efficiently by adding
one internal helper of __map_depth(sb, index).

Not see performance effect when running iops test on null_blk.

This way saves us one cacheline(usually 64 words) per each sbitmap_word.

Cc: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 15 +++++++++------
 lib/sbitmap.c           | 34 ++++++++++++++--------------------
 2 files changed, 23 insertions(+), 26 deletions(-)

Comments

Jens Axboe Jan. 10, 2022, 1:54 a.m. UTC | #1
On 1/9/22 6:50 PM, Ming Lei wrote:
> Only the last sbitmap_word can have different depth, and all the others
> must have same depth of 1U << sb->shift, so not necessary to store it in
> sbitmap_word, and it can be retrieved easily and efficiently by adding
> one internal helper of __map_depth(sb, index).
> 
> Not see performance effect when running iops test on null_blk.
> 
> This way saves us one cacheline(usually 64 words) per each sbitmap_word.

We probably want to kill the ____cacheline_aligned_in_smp from 'word' as
well.
Ming Lei Jan. 10, 2022, 2:38 a.m. UTC | #2
On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote:
> On 1/9/22 6:50 PM, Ming Lei wrote:
> > Only the last sbitmap_word can have different depth, and all the others
> > must have same depth of 1U << sb->shift, so not necessary to store it in
> > sbitmap_word, and it can be retrieved easily and efficiently by adding
> > one internal helper of __map_depth(sb, index).
> > 
> > Not see performance effect when running iops test on null_blk.
> > 
> > This way saves us one cacheline(usually 64 words) per each sbitmap_word.
> 
> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as
> well.

But sbitmap_deferred_clear_bit() is called in put fast path, then the
cacheline becomes shared with get path, and I guess this way isn't
expected.

Thanks,
Ming
Jens Axboe Jan. 10, 2022, 2:43 a.m. UTC | #3
On 1/9/22 7:38 PM, Ming Lei wrote:
> On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote:
>> On 1/9/22 6:50 PM, Ming Lei wrote:
>>> Only the last sbitmap_word can have different depth, and all the others
>>> must have same depth of 1U << sb->shift, so not necessary to store it in
>>> sbitmap_word, and it can be retrieved easily and efficiently by adding
>>> one internal helper of __map_depth(sb, index).
>>>
>>> Not see performance effect when running iops test on null_blk.
>>>
>>> This way saves us one cacheline(usually 64 words) per each sbitmap_word.
>>
>> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as
>> well.
> 
> But sbitmap_deferred_clear_bit() is called in put fast path, then the
> cacheline becomes shared with get path, and I guess this way isn't
> expected.

Just from 'word', not from 'cleared'. They will still be in separate
cache lines, but usually doesn't make sense to have the leading member
marked as cacheline aligned, that's a whole struct property at that
point.
Ming Lei Jan. 10, 2022, 2:47 a.m. UTC | #4
On Sun, Jan 09, 2022 at 07:43:26PM -0700, Jens Axboe wrote:
> On 1/9/22 7:38 PM, Ming Lei wrote:
> > On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote:
> >> On 1/9/22 6:50 PM, Ming Lei wrote:
> >>> Only the last sbitmap_word can have different depth, and all the others
> >>> must have same depth of 1U << sb->shift, so not necessary to store it in
> >>> sbitmap_word, and it can be retrieved easily and efficiently by adding
> >>> one internal helper of __map_depth(sb, index).
> >>>
> >>> Not see performance effect when running iops test on null_blk.
> >>>
> >>> This way saves us one cacheline(usually 64 words) per each sbitmap_word.
> >>
> >> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as
> >> well.
> > 
> > But sbitmap_deferred_clear_bit() is called in put fast path, then the
> > cacheline becomes shared with get path, and I guess this way isn't
> > expected.
> 
> Just from 'word', not from 'cleared'. They will still be in separate
> cache lines, but usually doesn't make sense to have the leading member
> marked as cacheline aligned, that's a whole struct property at that
> point.

OK, got it, it is just sort of code style thing, and not any functional
change.

Will include it in V2.


Thanks,
Ming
Jens Axboe Jan. 10, 2022, 2:49 a.m. UTC | #5
On 1/9/22 7:47 PM, Ming Lei wrote:
> On Sun, Jan 09, 2022 at 07:43:26PM -0700, Jens Axboe wrote:
>> On 1/9/22 7:38 PM, Ming Lei wrote:
>>> On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote:
>>>> On 1/9/22 6:50 PM, Ming Lei wrote:
>>>>> Only the last sbitmap_word can have different depth, and all the others
>>>>> must have same depth of 1U << sb->shift, so not necessary to store it in
>>>>> sbitmap_word, and it can be retrieved easily and efficiently by adding
>>>>> one internal helper of __map_depth(sb, index).
>>>>>
>>>>> Not see performance effect when running iops test on null_blk.
>>>>>
>>>>> This way saves us one cacheline(usually 64 words) per each sbitmap_word.
>>>>
>>>> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as
>>>> well.
>>>
>>> But sbitmap_deferred_clear_bit() is called in put fast path, then the
>>> cacheline becomes shared with get path, and I guess this way isn't
>>> expected.
>>
>> Just from 'word', not from 'cleared'. They will still be in separate
>> cache lines, but usually doesn't make sense to have the leading member
>> marked as cacheline aligned, that's a whole struct property at that
>> point.
> 
> OK, got it, it is just sort of code style thing, and not any functional
> change.

Right, it's more of a cleanup. It would be a bit misleading to have
it on the first member and thinking you could rely on it, when
external factors come into play there.

> Will include it in V2.

Thanks!
Martin Wilck Jan. 10, 2022, 4:22 p.m. UTC | #6
Hello Jens,

On Sun, 2022-01-09 at 19:43 -0700, Jens Axboe wrote:
> On 1/9/22 7:38 PM, Ming Lei wrote:
> > On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote:
> > > On 1/9/22 6:50 PM, Ming Lei wrote:
> > > > Only the last sbitmap_word can have different depth, and all
> > > > the others
> > > > must have same depth of 1U << sb->shift, so not necessary to
> > > > store it in
> > > > sbitmap_word, and it can be retrieved easily and efficiently by
> > > > adding
> > > > one internal helper of __map_depth(sb, index).
> > > > 
> > > > Not see performance effect when running iops test on null_blk.
> > > > 
> > > > This way saves us one cacheline(usually 64 words) per each
> > > > sbitmap_word.
> > > 
> > > We probably want to kill the ____cacheline_aligned_in_smp from
> > > 'word' as
> > > well.
> > 
> > But sbitmap_deferred_clear_bit() is called in put fast path, then
> > the
> > cacheline becomes shared with get path, and I guess this way isn't
> > expected.
> 
> Just from 'word', not from 'cleared'. They will still be in separate
> cache lines, but usually doesn't make sense to have the leading
> member
> marked as cacheline aligned, that's a whole struct property at that
> point.
> 

while discussing this - is there any data about how many separate cache
lines (for either "word" or "cleared") are beneficial for performance?

For bitmap sizes between 4 and 512 bit (on x86_64), the code generates
layouts with 4-8 cache lines, but above that, the number of cache lines
grows linearly with bitmap size. I am wondering whether we should
consider utilizing more of the allocated memory once a certain number
of separate cache lines is exceeded, by accessing additional words in
the existing cache lines.

Could you comment on that?

Thanks,
Martin
diff mbox series

Patch

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index fc0357a6e19b..a2b86760f985 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -27,11 +27,6 @@  struct seq_file;
  * struct sbitmap_word - Word in a &struct sbitmap.
  */
 struct sbitmap_word {
-	/**
-	 * @depth: Number of bits being used in @word/@cleared
-	 */
-	unsigned long depth;
-
 	/**
 	 * @word: word holding free bits
 	 */
@@ -164,6 +159,14 @@  struct sbitmap_queue {
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      gfp_t flags, int node, bool round_robin, bool alloc_hint);
 
+/* sbitmap internal helper */
+static inline unsigned int __map_depth(const struct sbitmap *sb, int index)
+{
+	if (index == sb->map_nr - 1)
+		return sb->depth - (index << sb->shift);
+	return 1U << sb->shift;
+}
+
 /**
  * sbitmap_free() - Free memory used by a &struct sbitmap.
  * @sb: Bitmap to free.
@@ -251,7 +254,7 @@  static inline void __sbitmap_for_each_set(struct sbitmap *sb,
 	while (scanned < sb->depth) {
 		unsigned long word;
 		unsigned int depth = min_t(unsigned int,
-					   sb->map[index].depth - nr,
+					   __map_depth(sb, index) - nr,
 					   sb->depth - scanned);
 
 		scanned += depth;
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2709ab825499..e9a51337a2a3 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -85,7 +85,6 @@  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      bool alloc_hint)
 {
 	unsigned int bits_per_word;
-	unsigned int i;
 
 	if (shift < 0)
 		shift = sbitmap_calculate_shift(depth);
@@ -117,10 +116,6 @@  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < sb->map_nr; i++) {
-		sb->map[i].depth = min(depth, bits_per_word);
-		depth -= sb->map[i].depth;
-	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(sbitmap_init_node);
@@ -135,11 +130,6 @@  void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 
 	sb->depth = depth;
 	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
-
-	for (i = 0; i < sb->map_nr; i++) {
-		sb->map[i].depth = min(depth, bits_per_word);
-		depth -= sb->map[i].depth;
-	}
 }
 EXPORT_SYMBOL_GPL(sbitmap_resize);
 
@@ -184,8 +174,8 @@  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
 	int nr;
 
 	do {
-		nr = __sbitmap_get_word(&map->word, map->depth, alloc_hint,
-					!sb->round_robin);
+		nr = __sbitmap_get_word(&map->word, __map_depth(sb, index),
+					alloc_hint, !sb->round_robin);
 		if (nr != -1)
 			break;
 		if (!sbitmap_deferred_clear(map))
@@ -257,7 +247,9 @@  static int __sbitmap_get_shallow(struct sbitmap *sb,
 	for (i = 0; i < sb->map_nr; i++) {
 again:
 		nr = __sbitmap_get_word(&sb->map[index].word,
-					min(sb->map[index].depth, shallow_depth),
+					min_t(unsigned int,
+					      __map_depth(sb, index),
+					      shallow_depth),
 					SB_NR_TO_BIT(sb, alloc_hint), true);
 		if (nr != -1) {
 			nr += index << sb->shift;
@@ -315,11 +307,12 @@  static unsigned int __sbitmap_weight(const struct sbitmap *sb, bool set)
 
 	for (i = 0; i < sb->map_nr; i++) {
 		const struct sbitmap_word *word = &sb->map[i];
+		unsigned int word_depth = __map_depth(sb, i);
 
 		if (set)
-			weight += bitmap_weight(&word->word, word->depth);
+			weight += bitmap_weight(&word->word, word_depth);
 		else
-			weight += bitmap_weight(&word->cleared, word->depth);
+			weight += bitmap_weight(&word->cleared, word_depth);
 	}
 	return weight;
 }
@@ -367,7 +360,7 @@  void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m)
 	for (i = 0; i < sb->map_nr; i++) {
 		unsigned long word = READ_ONCE(sb->map[i].word);
 		unsigned long cleared = READ_ONCE(sb->map[i].cleared);
-		unsigned int word_bits = READ_ONCE(sb->map[i].depth);
+		unsigned int word_bits = __map_depth(sb, i);
 
 		word &= ~cleared;
 
@@ -508,15 +501,16 @@  unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
 	for (i = 0; i < sb->map_nr; i++) {
 		struct sbitmap_word *map = &sb->map[index];
 		unsigned long get_mask;
+		unsigned int map_depth = __map_depth(sb, index);
 
 		sbitmap_deferred_clear(map);
-		if (map->word == (1UL << (map->depth - 1)) - 1)
+		if (map->word == (1UL << (map_depth - 1)) - 1)
 			continue;
 
-		nr = find_first_zero_bit(&map->word, map->depth);
-		if (nr + nr_tags <= map->depth) {
+		nr = find_first_zero_bit(&map->word, map_depth);
+		if (nr + nr_tags <= map_depth) {
 			atomic_long_t *ptr = (atomic_long_t *) &map->word;
-			int map_tags = min_t(int, nr_tags, map->depth);
+			int map_tags = min_t(int, nr_tags, map_depth);
 			unsigned long val, ret;
 
 			get_mask = ((1UL << map_tags) - 1) << nr;