Message ID | 20220110015007.326561-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lib/sbitmap: kill 'depth' from sbitmap_word | expand |
On 1/9/22 6:50 PM, Ming Lei wrote: > Only the last sbitmap_word can have different depth, and all the others > must have same depth of 1U << sb->shift, so not necessary to store it in > sbitmap_word, and it can be retrieved easily and efficiently by adding > one internal helper of __map_depth(sb, index). > > Not see performance effect when running iops test on null_blk. > > This way saves us one cacheline(usually 64 words) per each sbitmap_word. We probably want to kill the ____cacheline_aligned_in_smp from 'word' as well.
On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote: > On 1/9/22 6:50 PM, Ming Lei wrote: > > Only the last sbitmap_word can have different depth, and all the others > > must have same depth of 1U << sb->shift, so not necessary to store it in > > sbitmap_word, and it can be retrieved easily and efficiently by adding > > one internal helper of __map_depth(sb, index). > > > > Not see performance effect when running iops test on null_blk. > > > > This way saves us one cacheline(usually 64 words) per each sbitmap_word. > > We probably want to kill the ____cacheline_aligned_in_smp from 'word' as > well. But sbitmap_deferred_clear_bit() is called in put fast path, then the cacheline becomes shared with get path, and I guess this way isn't expected. Thanks, Ming
On 1/9/22 7:38 PM, Ming Lei wrote: > On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote: >> On 1/9/22 6:50 PM, Ming Lei wrote: >>> Only the last sbitmap_word can have different depth, and all the others >>> must have same depth of 1U << sb->shift, so not necessary to store it in >>> sbitmap_word, and it can be retrieved easily and efficiently by adding >>> one internal helper of __map_depth(sb, index). >>> >>> Not see performance effect when running iops test on null_blk. >>> >>> This way saves us one cacheline(usually 64 words) per each sbitmap_word. >> >> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as >> well. > > But sbitmap_deferred_clear_bit() is called in put fast path, then the > cacheline becomes shared with get path, and I guess this way isn't > expected. Just from 'word', not from 'cleared'. They will still be in separate cache lines, but usually doesn't make sense to have the leading member marked as cacheline aligned, that's a whole struct property at that point.
On Sun, Jan 09, 2022 at 07:43:26PM -0700, Jens Axboe wrote: > On 1/9/22 7:38 PM, Ming Lei wrote: > > On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote: > >> On 1/9/22 6:50 PM, Ming Lei wrote: > >>> Only the last sbitmap_word can have different depth, and all the others > >>> must have same depth of 1U << sb->shift, so not necessary to store it in > >>> sbitmap_word, and it can be retrieved easily and efficiently by adding > >>> one internal helper of __map_depth(sb, index). > >>> > >>> Not see performance effect when running iops test on null_blk. > >>> > >>> This way saves us one cacheline(usually 64 words) per each sbitmap_word. > >> > >> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as > >> well. > > > > But sbitmap_deferred_clear_bit() is called in put fast path, then the > > cacheline becomes shared with get path, and I guess this way isn't > > expected. > > Just from 'word', not from 'cleared'. They will still be in separate > cache lines, but usually doesn't make sense to have the leading member > marked as cacheline aligned, that's a whole struct property at that > point. OK, got it, it is just sort of code style thing, and not any functional change. Will include it in V2. Thanks, Ming
On 1/9/22 7:47 PM, Ming Lei wrote: > On Sun, Jan 09, 2022 at 07:43:26PM -0700, Jens Axboe wrote: >> On 1/9/22 7:38 PM, Ming Lei wrote: >>> On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote: >>>> On 1/9/22 6:50 PM, Ming Lei wrote: >>>>> Only the last sbitmap_word can have different depth, and all the others >>>>> must have same depth of 1U << sb->shift, so not necessary to store it in >>>>> sbitmap_word, and it can be retrieved easily and efficiently by adding >>>>> one internal helper of __map_depth(sb, index). >>>>> >>>>> Not see performance effect when running iops test on null_blk. >>>>> >>>>> This way saves us one cacheline(usually 64 words) per each sbitmap_word. >>>> >>>> We probably want to kill the ____cacheline_aligned_in_smp from 'word' as >>>> well. >>> >>> But sbitmap_deferred_clear_bit() is called in put fast path, then the >>> cacheline becomes shared with get path, and I guess this way isn't >>> expected. >> >> Just from 'word', not from 'cleared'. They will still be in separate >> cache lines, but usually doesn't make sense to have the leading member >> marked as cacheline aligned, that's a whole struct property at that >> point. > > OK, got it, it is just sort of code style thing, and not any functional > change. Right, it's more of a cleanup. It would be a bit misleading to have it on the first member and thinking you could rely on it, when external factors come into play there. > Will include it in V2. Thanks!
Hello Jens, On Sun, 2022-01-09 at 19:43 -0700, Jens Axboe wrote: > On 1/9/22 7:38 PM, Ming Lei wrote: > > On Sun, Jan 09, 2022 at 06:54:21PM -0700, Jens Axboe wrote: > > > On 1/9/22 6:50 PM, Ming Lei wrote: > > > > Only the last sbitmap_word can have different depth, and all > > > > the others > > > > must have same depth of 1U << sb->shift, so not necessary to > > > > store it in > > > > sbitmap_word, and it can be retrieved easily and efficiently by > > > > adding > > > > one internal helper of __map_depth(sb, index). > > > > > > > > Not see performance effect when running iops test on null_blk. > > > > > > > > This way saves us one cacheline(usually 64 words) per each > > > > sbitmap_word. > > > > > > We probably want to kill the ____cacheline_aligned_in_smp from > > > 'word' as > > > well. > > > > But sbitmap_deferred_clear_bit() is called in put fast path, then > > the > > cacheline becomes shared with get path, and I guess this way isn't > > expected. > > Just from 'word', not from 'cleared'. They will still be in separate > cache lines, but usually doesn't make sense to have the leading > member > marked as cacheline aligned, that's a whole struct property at that > point. > while discussing this - is there any data about how many separate cache lines (for either "word" or "cleared") are beneficial for performance? For bitmap sizes between 4 and 512 bit (on x86_64), the code generates layouts with 4-8 cache lines, but above that, the number of cache lines grows linearly with bitmap size. I am wondering whether we should consider utilizing more of the allocated memory once a certain number of separate cache lines is exceeded, by accessing additional words in the existing cache lines. Could you comment on that? Thanks, Martin
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index fc0357a6e19b..a2b86760f985 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -27,11 +27,6 @@ struct seq_file; * struct sbitmap_word - Word in a &struct sbitmap. */ struct sbitmap_word { - /** - * @depth: Number of bits being used in @word/@cleared - */ - unsigned long depth; - /** * @word: word holding free bits */ @@ -164,6 +159,14 @@ struct sbitmap_queue { int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, gfp_t flags, int node, bool round_robin, bool alloc_hint); +/* sbitmap internal helper */ +static inline unsigned int __map_depth(const struct sbitmap *sb, int index) +{ + if (index == sb->map_nr - 1) + return sb->depth - (index << sb->shift); + return 1U << sb->shift; +} + /** * sbitmap_free() - Free memory used by a &struct sbitmap. * @sb: Bitmap to free. @@ -251,7 +254,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb, while (scanned < sb->depth) { unsigned long word; unsigned int depth = min_t(unsigned int, - sb->map[index].depth - nr, + __map_depth(sb, index) - nr, sb->depth - scanned); scanned += depth; diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 2709ab825499..e9a51337a2a3 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -85,7 +85,6 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, bool alloc_hint) { unsigned int bits_per_word; - unsigned int i; if (shift < 0) shift = sbitmap_calculate_shift(depth); @@ -117,10 +116,6 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift, return -ENOMEM; } - for (i = 0; i < sb->map_nr; i++) { - sb->map[i].depth = min(depth, bits_per_word); - depth -= sb->map[i].depth; - } return 0; } EXPORT_SYMBOL_GPL(sbitmap_init_node); @@ -135,11 +130,6 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth) sb->depth = depth; sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word); - - for (i = 0; i < sb->map_nr; i++) { - sb->map[i].depth = min(depth, bits_per_word); - depth -= sb->map[i].depth; - } } EXPORT_SYMBOL_GPL(sbitmap_resize); @@ -184,8 +174,8 @@ static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index, int nr; do { - nr = __sbitmap_get_word(&map->word, map->depth, alloc_hint, - !sb->round_robin); + nr = __sbitmap_get_word(&map->word, __map_depth(sb, index), + alloc_hint, !sb->round_robin); if (nr != -1) break; if (!sbitmap_deferred_clear(map)) @@ -257,7 +247,9 @@ static int __sbitmap_get_shallow(struct sbitmap *sb, for (i = 0; i < sb->map_nr; i++) { again: nr = __sbitmap_get_word(&sb->map[index].word, - min(sb->map[index].depth, shallow_depth), + min_t(unsigned int, + __map_depth(sb, index), + shallow_depth), SB_NR_TO_BIT(sb, alloc_hint), true); if (nr != -1) { nr += index << sb->shift; @@ -315,11 +307,12 @@ static unsigned int __sbitmap_weight(const struct sbitmap *sb, bool set) for (i = 0; i < sb->map_nr; i++) { const struct sbitmap_word *word = &sb->map[i]; + unsigned int word_depth = __map_depth(sb, i); if (set) - weight += bitmap_weight(&word->word, word->depth); + weight += bitmap_weight(&word->word, word_depth); else - weight += bitmap_weight(&word->cleared, word->depth); + weight += bitmap_weight(&word->cleared, word_depth); } return weight; } @@ -367,7 +360,7 @@ void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m) for (i = 0; i < sb->map_nr; i++) { unsigned long word = READ_ONCE(sb->map[i].word); unsigned long cleared = READ_ONCE(sb->map[i].cleared); - unsigned int word_bits = READ_ONCE(sb->map[i].depth); + unsigned int word_bits = __map_depth(sb, i); word &= ~cleared; @@ -508,15 +501,16 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags, for (i = 0; i < sb->map_nr; i++) { struct sbitmap_word *map = &sb->map[index]; unsigned long get_mask; + unsigned int map_depth = __map_depth(sb, index); sbitmap_deferred_clear(map); - if (map->word == (1UL << (map->depth - 1)) - 1) + if (map->word == (1UL << (map_depth - 1)) - 1) continue; - nr = find_first_zero_bit(&map->word, map->depth); - if (nr + nr_tags <= map->depth) { + nr = find_first_zero_bit(&map->word, map_depth); + if (nr + nr_tags <= map_depth) { atomic_long_t *ptr = (atomic_long_t *) &map->word; - int map_tags = min_t(int, nr_tags, map->depth); + int map_tags = min_t(int, nr_tags, map_depth); unsigned long val, ret; get_mask = ((1UL << map_tags) - 1) << nr;
Only the last sbitmap_word can have different depth, and all the others must have same depth of 1U << sb->shift, so not necessary to store it in sbitmap_word, and it can be retrieved easily and efficiently by adding one internal helper of __map_depth(sb, index). Not see performance effect when running iops test on null_blk. This way saves us one cacheline(usually 64 words) per each sbitmap_word. Cc: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- include/linux/sbitmap.h | 15 +++++++++------ lib/sbitmap.c | 34 ++++++++++++++-------------------- 2 files changed, 23 insertions(+), 26 deletions(-)