diff mbox series

[v4,1/5] lib/bitmap: add bitmap_{set,get}_value()

Message ID 20230720173956.3674987-2-glider@google.com (mailing list archive)
State New, archived
Headers show
Series Implement MTE tag compression for swapped pages | expand

Commit Message

Alexander Potapenko July 20, 2023, 5:39 p.m. UTC
From: Syed Nayyar Waris <syednwaris@gmail.com>

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v4:
 - Address comments by Andy Shevchenko and Yury Norov:
   - prevent passing values >= 64 to GENMASK()
   - fix commit authorship
   - change comments
   - check for unlikely(nbits==0)
   - drop unnecessary const declarations
   - fix kernel-doc comments
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
 include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

Comments

Yury Norov July 23, 2023, 1:57 a.m. UTC | #1
On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value of nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, up to BITS_PER_LONG
> + */
> +static inline void bitmap_write(unsigned long *map,
> +				unsigned long value,
> +				unsigned long start, unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +
> +	if (unlikely(!nbits))
> +		return;
> +	value &= GENMASK(nbits - 1, 0);

Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
because otherwise it's an out-of-bonds type of error.

This is kind of gray zone to me, because it's a caller's responsibility
to provide correct input. But on the other hand, it would be a great
headache debugging corrupted bitmaps.

Now that we've got a single user of the bitmap_write, and even more,
it's wrapped with a helper, I think it would be reasonable to trim a
'value' in the helper, if needed.

Anyways, the comment must warn about that loudly...

> +	if (space >= nbits) {
> +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);

'GENMASK(nbits - 1, 0) << offset' looks really silly.

> +		map[index] |= value << offset;

Here you commit 2 reads and 2 writes for a word instead of one.

> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);

~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
~25 bytes of .text on x86_64.

> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

With all that I think the implementation should look something like
this:

        unsigned long w, mask;

        if (unlikely(nbits == 0))
                return 0;

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_FIRST_WORD_MASK(end);
        *map = w | (value >> BITS_PER_LONG * 2 - end);

It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
it should be more efficient.

Alexander, can you please try the above and compare?

In previous iteration, I asked you to share disassembly listings for the
functions. Can you please do that now?

Regarding the rest of the series:
 - I still see Evgenii's name in mtecomp.c, and EA0 references;
 - git-am throws warning about trailing line;
 - checkpatch warns 7 times;

Can you fix all the above before sending the new version?

Have you tested generic part against BE32, BE64 and LE32 architectures;
and arch part against BE64? If not, please do.

You're mentioning that the compression ratio is 2 to 20x. Can you
share the absolute numbers? If it's 1k vs 2k, I think most people
just don't care...

Can you share the code that you used to measure the compression ratio?
Would it make sense to export the numbers via sysfs?

Thanks,
Yury
Yury Norov July 27, 2023, 12:14 a.m. UTC | #2
On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote:
> On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > > +/**
> > > + * bitmap_write - write n-bit value within a memory region
> > > + * @map: address to the bitmap memory region
> > > + * @value: value of nbits
> > > + * @start: bit offset of the n-bit value
> > > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > > + */
> > > +static inline void bitmap_write(unsigned long *map,
> > > +                             unsigned long value,
> > > +                             unsigned long start, unsigned long nbits)
> > > +{
> > > +     size_t index = BIT_WORD(start);
> > > +     unsigned long offset = start % BITS_PER_LONG;
> > > +     unsigned long space = BITS_PER_LONG - offset;
> > > +
> > > +     if (unlikely(!nbits))
> > > +             return;
> > > +     value &= GENMASK(nbits - 1, 0);
> >
> > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> > because otherwise it's an out-of-bonds type of error.
> 
> I can easily imagine someone passing -1 (or ~0) as a value, but
> wanting to only write n bits of n.

This is an abuse of new API because we've got a bitmap_set(). But
whatever, let's keep that masking.

...
 
> I like the idea of sharing the first write between the branches, and
> it can be made even shorter:
> 
> ===========================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;

        space >= nbits <=>
        BITS_PER_LONG - offset >= nbits <=> 
        offset + nbits <= BITS_PER_LONG

>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :

So here GENMASK(nbits + offset - 1, offset) is at max:
GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
point. Does it make sense?

> ~BITMAP_FIRST_WORD_MASK(start));

As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
and vise-versa.

>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }
> ===========================================================
> 
> According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
> is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
> GCC (which on the other hand does a poor job optimizing both).
> 
> Overall, given that there's currently a single user of these
> functions, isn't it premature to optimize them without knowing
> anything about their performance?
> 
> > In previous iteration, I asked you to share disassembly listings for the
> > functions. Can you please do that now?
> 
> Will godbolt work for you (see above)?

I don't know for how long an external resource will keep the reference
alive. My SSD keeps emails long enough.

...

> > You're mentioning that the compression ratio is 2 to 20x. Can you
> > share the absolute numbers? If it's 1k vs 2k, I think most people
> > just don't care...
> 
> I'll provide the exact numbers with the next patch series. Last time I
> checked, the order of magnitude was tens of megabytes.

That's impressive. Fruitful idea. It would be important for embedded guys
who may disable MTE because of memory overhead. I think it's worth to
mention that in Kconfig together with associate performance overhead,
if it ever measurable.

> > Can you share the code that you used to measure the compression ratio?
> > Would it make sense to export the numbers via sysfs?
> 
> For out-of-line allocations the data can be derived from
> /proc/slabinfo, but we don't calculate inline allocations.
> Agreed, a debugfs interface won't hurt.
Alexander Potapenko Aug. 4, 2023, 4:07 p.m. UTC | #3
>         space >= nbits <=>
>         BITS_PER_LONG - offset >= nbits <=>
>         offset + nbits <= BITS_PER_LONG
>
> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
>
> So here GENMASK(nbits + offset - 1, offset) is at max:
> GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> point. Does it make sense?

It indeed does. Perhaps pulling offset inside GENMASK is not a bug
after all (a simple test does not show any difference between their
behavior.
But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
My guess is that this happens because the compiler fails to reuse the
value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

>
> > ~BITMAP_FIRST_WORD_MASK(start));
>
> As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> and vise-versa.

Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
BITMAP_LAST_WORD_MASK().

>
> >         map[index] |= value << offset;
> >         if (fit)
> >                 return;
> >
> >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);

OTOH I managed to shave three more bytes off by replacing
~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.

> >         map[index + 1] |= (value >> space);
> > }

I'll post the implementations together with the disassembly below.
I used some Clang 17.0.0 version that is a couple months behind
upstream, but that still produces sustainably shorter code (~48 bytes
less) than the trunk GCC on Godbolt.

1. Original implementation of bitmap_write() from this patch - 164
bytes (interestingly, it's 157 bytes with Clang 14.0.6)

==================================================================
void bitmap_write(unsigned long *map, unsigned long value,
                  unsigned long start, unsigned long nbits)
{
        size_t index = BIT_WORD(start);
        unsigned long offset = start % BITS_PER_LONG;
        unsigned long space = BITS_PER_LONG - offset;

        if (unlikely(!nbits))
                return;
        value &= GENMASK(nbits - 1, 0);
        if (space >= nbits) {
                map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
                map[index] |= value << offset;
                return;
        }
        map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
        map[index] |= value << offset;
        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001200 <bitmap_write>:
    1200: 41 57                push   %r15
    1202: 41 56                push   %r14
    1204: 53                    push   %rbx
    1205: 48 85 c9              test   %rcx,%rcx
    1208: 74 7b                je     1285 <bitmap_write+0x85>
    120a: 48 89 c8              mov    %rcx,%rax
    120d: 49 89 d2              mov    %rdx,%r10
    1210: 49 c1 ea 06          shr    $0x6,%r10
    1214: 41 89 d1              mov    %edx,%r9d
    1217: f6 d9                neg    %cl
    1219: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1220: 49 d3 ef              shr    %cl,%r15
    1223: 41 83 e1 3f          and    $0x3f,%r9d
    1227: 41 b8 40 00 00 00    mov    $0x40,%r8d
    122d: 4c 21 fe              and    %r15,%rsi
    1230: 48 89 f3              mov    %rsi,%rbx
    1233: 44 89 c9              mov    %r9d,%ecx
    1236: 48 d3 e3              shl    %cl,%rbx
    1239: 4d 29 c8              sub    %r9,%r8
    123c: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    1243: 4e 8b 34 d7          mov    (%rdi,%r10,8),%r14
    1247: 49 39 c0              cmp    %rax,%r8
    124a: 73 3f                jae    128b <bitmap_write+0x8b>
    124c: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1253: 44 89 c9              mov    %r9d,%ecx
    1256: 49 d3 e7              shl    %cl,%r15
    1259: 49 f7 d7              not    %r15
    125c: 4d 21 fe              and    %r15,%r14
    125f: 49 09 de              or     %rbx,%r14
    1262: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    1266: 01 c2                add    %eax,%edx
    1268: f6 da                neg    %dl
    126a: 89 d1                mov    %edx,%ecx
    126c: 49 d3 eb              shr    %cl,%r11
    126f: 49 f7 d3              not    %r11
    1272: 44 89 c1              mov    %r8d,%ecx
    1275: 48 d3 ee              shr    %cl,%rsi
    1278: 4e 23 5c d7 08        and    0x8(%rdi,%r10,8),%r11
    127d: 4c 09 de              or     %r11,%rsi
    1280: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    1285: 5b                    pop    %rbx
    1286: 41 5e                pop    %r14
    1288: 41 5f                pop    %r15
    128a: c3                    ret
    128b: 44 89 c9              mov    %r9d,%ecx
    128e: 49 d3 e7              shl    %cl,%r15
    1291: 49 f7 d7              not    %r15
    1294: 4d 21 fe              and    %r15,%r14
    1297: 49 09 de              or     %rbx,%r14
    129a: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    129e: 5b                    pop    %rbx
    129f: 41 5e                pop    %r14
    12a1: 41 5f                pop    %r15
    12a3: c3                    ret
==================================================================

2. Your implementation, my_bitmap_write() - 152 bytes (used to be
slightly worse with Clang 14.0.6 - 159 bytes):

==================================================================
void my_bitmap_write(unsigned long *map, unsigned long value,
                     unsigned long start, unsigned long nbits)
{
        unsigned long w, end;

        if (unlikely(nbits == 0))
                return;

        value &= GENMASK(nbits - 1, 0);

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) :
BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
        *map = w | (value >> (BITS_PER_LONG - start));
}
==================================================================
0000000000001160 <my_bitmap_write>:
    1160: 48 85 c9              test   %rcx,%rcx
    1163: 0f 84 8e 00 00 00    je     11f7 <my_bitmap_write+0x97>
    1169: 49 89 c9              mov    %rcx,%r9
    116c: f6 d9                neg    %cl
    116e: 48 d3 e6              shl    %cl,%rsi
    1171: 48 d3 ee              shr    %cl,%rsi
    1174: 49 89 d2              mov    %rdx,%r10
    1177: 49 c1 ea 06          shr    $0x6,%r10
    117b: 89 d0                mov    %edx,%eax
    117d: 83 e0 3f              and    $0x3f,%eax
    1180: 4e 8d 04 08          lea    (%rax,%r9,1),%r8
    1184: 4a 8d 0c 08          lea    (%rax,%r9,1),%rcx
    1188: 48 ff c9              dec    %rcx
    118b: 4e 8b 0c d7          mov    (%rdi,%r10,8),%r9
    118f: 48 83 f9 3f          cmp    $0x3f,%rcx
    1193: 77 29                ja     11be <my_bitmap_write+0x5e>
    1195: 41 f6 d8              neg    %r8b
    1198: 48 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%rdx
    119f: 44 89 c1              mov    %r8d,%ecx
    11a2: 48 d3 ea              shr    %cl,%rdx
    11a5: 89 c1                mov    %eax,%ecx
    11a7: 48 d3 ea              shr    %cl,%rdx
    11aa: 48 d3 e2              shl    %cl,%rdx
    11ad: 48 d3 e6              shl    %cl,%rsi
    11b0: 48 f7 d2              not    %rdx
    11b3: 49 21 d1              and    %rdx,%r9
    11b6: 4c 09 ce              or     %r9,%rsi
    11b9: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    11bd: c3                    ret
    11be: f6 da                neg    %dl
    11c0: 89 d1                mov    %edx,%ecx
    11c2: 49 d3 e1              shl    %cl,%r9
    11c5: 49 d3 e9              shr    %cl,%r9
    11c8: 48 89 f2              mov    %rsi,%rdx
    11cb: 89 c1                mov    %eax,%ecx
    11cd: 48 d3 e2              shl    %cl,%rdx
    11d0: 4c 09 ca              or     %r9,%rdx
    11d3: 4a 89 14 d7          mov    %rdx,(%rdi,%r10,8)
    11d7: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    11dc: 41 f6 d8              neg    %r8b
    11df: 44 89 c1              mov    %r8d,%ecx
    11e2: 48 d3 e2              shl    %cl,%rdx
    11e5: 48 d3 ea              shr    %cl,%rdx
    11e8: f6 d8                neg    %al
    11ea: 89 c1                mov    %eax,%ecx
    11ec: 48 d3 ee              shr    %cl,%rsi
    11ef: 48 09 d6              or     %rdx,%rsi
    11f2: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    11f7: c3                    ret
==================================================================

3. My improved version built on top of yours and mentioned above under
the name bitmap_write_new() - 116 bytes:

==================================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
00000000000012b0 <bitmap_write_new>:
    12b0: 48 85 c9              test   %rcx,%rcx
    12b3: 74 6e                je     1323 <bitmap_write_new+0x73>
    12b5: 48 89 c8              mov    %rcx,%rax
    12b8: f6 d9                neg    %cl
    12ba: 49 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%r10
    12c1: 49 d3 ea              shr    %cl,%r10
    12c4: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    12cb: 4c 21 d6              and    %r10,%rsi
    12ce: 89 d1                mov    %edx,%ecx
    12d0: 83 e1 3f              and    $0x3f,%ecx
    12d3: 41 b8 40 00 00 00    mov    $0x40,%r8d
    12d9: 49 29 c8              sub    %rcx,%r8
    12dc: 49 89 d1              mov    %rdx,%r9
    12df: 49 c1 e9 06          shr    $0x6,%r9
    12e3: 49 39 c0              cmp    %rax,%r8
    12e6: 4d 0f 42 d3          cmovb  %r11,%r10
    12ea: 49 d3 e2              shl    %cl,%r10
    12ed: 49 f7 d2              not    %r10
    12f0: 4e 23 14 cf          and    (%rdi,%r9,8),%r10
    12f4: 49 89 f3              mov    %rsi,%r11
    12f7: 49 d3 e3              shl    %cl,%r11
    12fa: 4d 09 d3              or     %r10,%r11
    12fd: 4e 89 1c cf          mov    %r11,(%rdi,%r9,8)
    1301: 49 39 c0              cmp    %rax,%r8
    1304: 73 1d                jae    1323 <bitmap_write_new+0x73>
    1306: 01 d0                add    %edx,%eax
    1308: 4a 8b 54 cf 08        mov    0x8(%rdi,%r9,8),%rdx
    130d: 89 c1                mov    %eax,%ecx
    130f: 48 d3 ea              shr    %cl,%rdx
    1312: 48 d3 e2              shl    %cl,%rdx
    1315: 44 89 c1              mov    %r8d,%ecx
    1318: 48 d3 ee              shr    %cl,%rsi
    131b: 48 09 d6              or     %rdx,%rsi
    131e: 4a 89 74 cf 08        mov    %rsi,0x8(%rdi,%r9,8)
    1323: c3                    ret
==================================================================

4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 +
offset, offset)` - 146 bytes:

==================================================================
void bitmap_write_new_shift(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001330 <bitmap_write_new_shift>:
    1330: 48 85 c9              test   %rcx,%rcx
    1333: 74 6a                je     139f <bitmap_write_new_shift+0x6f>
    1335: 48 89 c8              mov    %rcx,%rax
    1338: f6 d9                neg    %cl
    133a: 48 d3 e6              shl    %cl,%rsi
    133d: 48 d3 ee              shr    %cl,%rsi
    1340: 41 89 d0              mov    %edx,%r8d
    1343: 41 83 e0 3f          and    $0x3f,%r8d
    1347: 41 b9 40 00 00 00    mov    $0x40,%r9d
    134d: 4d 29 c1              sub    %r8,%r9
    1350: 49 89 d2              mov    %rdx,%r10
    1353: 49 c1 ea 06          shr    $0x6,%r10
    1357: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    135e: 44 89 c1              mov    %r8d,%ecx
    1361: 49 d3 e3              shl    %cl,%r11
    1364: 49 39 c1              cmp    %rax,%r9
    1367: 73 37                jae    13a0 <bitmap_write_new_shift+0x70>
    1369: 53                    push   %rbx
    136a: 49 f7 d3              not    %r11
    136d: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    1371: 48 89 f3              mov    %rsi,%rbx
    1374: 44 89 c1              mov    %r8d,%ecx
    1377: 48 d3 e3              shl    %cl,%rbx
    137a: 4c 09 db              or     %r11,%rbx
    137d: 4a 89 1c d7          mov    %rbx,(%rdi,%r10,8)
    1381: 01 d0                add    %edx,%eax
    1383: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    1388: 89 c1                mov    %eax,%ecx
    138a: 48 d3 ea              shr    %cl,%rdx
    138d: 48 d3 e2              shl    %cl,%rdx
    1390: 44 89 c9              mov    %r9d,%ecx
    1393: 48 d3 ee              shr    %cl,%rsi
    1396: 48 09 d6              or     %rdx,%rsi
    1399: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    139e: 5b                    pop    %rbx
    139f: c3                    ret
    13a0: 44 01 c0              add    %r8d,%eax
    13a3: f6 d8                neg    %al
    13a5: 89 c1                mov    %eax,%ecx
    13a7: 49 d3 e3              shl    %cl,%r11
    13aa: 49 d3 eb              shr    %cl,%r11
    13ad: 49 f7 d3              not    %r11
    13b0: 44 89 c1              mov    %r8d,%ecx
    13b3: 48 d3 e6              shl    %cl,%rsi
    13b6: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    13ba: 4c 09 de              or     %r11,%rsi
    13bd: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    13c1: c3                    ret
==================================================================

For posterity, I am also attaching the C file containing the four
implementations together with some code testing them.

PS. I won't be actively working on the patch series till the end of
August, so feel free to ignore this letter until then.
Yury Norov Aug. 4, 2023, 7:55 p.m. UTC | #4
On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:
> >         space >= nbits <=>
> >         BITS_PER_LONG - offset >= nbits <=>
> >         offset + nbits <= BITS_PER_LONG
> >
> > >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> >
> > So here GENMASK(nbits + offset - 1, offset) is at max:
> > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> > point. Does it make sense?
> 
> It indeed does. Perhaps pulling offset inside GENMASK is not a bug
> after all (a simple test does not show any difference between their
> behavior.
> But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
> My guess is that this happens because the compiler fails to reuse the
> value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
> calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

OK. Can you put a comment explaining this? Or maybe would be even
better to use BITMAP_LAST_WORD_MASK() here:

         mask = BITMAP_LAST_WORD_MASK(nbits);
         value &= mask;
         ...
         map[index] &= (fit ? (~mask << offset)) :

> > > ~BITMAP_FIRST_WORD_MASK(start));
> >
> > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > and vise-versa.
> 
> Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> BITMAP_LAST_WORD_MASK().
 
Wow... If that's consistent across different compilers/arches, we'd
just drop the latter. Thanks for pointing that. I'll check.

> > >         map[index] |= value << offset;
> > >         if (fit)
> > >                 return;
> > >
> > >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> 
> OTOH I managed to shave three more bytes off by replacing
> ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
> 
> > >         map[index + 1] |= (value >> space);
> > > }
> 
> I'll post the implementations together with the disassembly below.
> I used some Clang 17.0.0 version that is a couple months behind
> upstream, but that still produces sustainably shorter code (~48 bytes
> less) than the trunk GCC on Godbolt.
> 
> 1. Original implementation of bitmap_write() from this patch - 164
> bytes (interestingly, it's 157 bytes with Clang 14.0.6)

I spotted that too in some other case. Newer compilers tend to
generate bigger code, but the result usually works faster. One
particular reason for my case was a loop unrolling.

[...]

> 3. My improved version built on top of yours and mentioned above under
> the name bitmap_write_new() - 116 bytes:

30% better in size - that's impressive!
 
> ==================================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;
> 
>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> ~BITMAP_FIRST_WORD_MASK(start));
>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }

Thanks,
Yury
Andy Shevchenko Aug. 4, 2023, 8:05 p.m. UTC | #5
On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote:
> On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:

> > > > ~BITMAP_FIRST_WORD_MASK(start));
> > >
> > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > > and vise-versa.
> > 
> > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> > BITMAP_LAST_WORD_MASK().
>  
> Wow... If that's consistent across different compilers/arches, we'd
> just drop the latter. Thanks for pointing that. I'll check.

It might be...

...

> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> > ~BITMAP_FIRST_WORD_MASK(start));

...that both parts of ternary are using negation. So compiler may use negation
only once. (Haven't looked at the assembly.)
Alexander Potapenko Sept. 22, 2023, 7:47 a.m. UTC | #6
>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;

Double-checked there are none in v5 (only the Suggested-by: tag)

>  - git-am throws warning about trailing line;

I checked locally that `git am` does not warn about v5 patches. But
given that the patches are generated with `git format-patch` I suspect
they get garbled when you download them, could it be the case?

>  - checkpatch warns 7 times;

It now warns 4 times, three warnings are about updating MAINTAINERS (I
don't think there's need for this), the last one is about
CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description
text in Kconfig.

> Can you fix all the above before sending the new version?

> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

I did now.


> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

In the other thread I mentioned that although 20x compression is
reachable, it may not lead to practical savings. I reworded the
description, having added the absolute numbers.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

Done in v5

> Thanks,
> Yury
Alexander Potapenko Sept. 22, 2023, 7:48 a.m. UTC | #7
> Either way, whatever we decide, let's stay clear with our intentions
> and mention explicitly that tail bits are either must be zero, or
> ignored.
>
> Alexander, can you add the snippet above to the comments for the
> bitmap_write() and bitmap_read(), as well as in the test? Also, if we
> decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
> generate better code than GENMASK(nbits - 1, 0) does?

Added the snippet above to bitmap_write(). I think however it is
obvious that bitmap_read() reads only nbits?

> Commets are very appreciated.
>
> Thanks,
> Yury
Alexander Potapenko Sept. 22, 2023, 7:49 a.m. UTC | #8
> OK. Can you put a comment explaining this? Or maybe would be even
> better to use BITMAP_LAST_WORD_MASK() here:
>
>          mask = BITMAP_LAST_WORD_MASK(nbits);
>          value &= mask;
>          ...
>          map[index] &= (fit ? (~mask << offset)) :

I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's
self-explanatory now, WDYT?
diff mbox series

Patch

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..bc21c09a2e038 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@  struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
+ *                                              map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
+ *                                              map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@  static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+					unsigned long start,
+					unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (unlikely(!nbits))
+		return 0;
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +630,35 @@  static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ */
+static inline void bitmap_write(unsigned long *map,
+				unsigned long value,
+				unsigned long start, unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+
+	if (unlikely(!nbits))
+		return;
+	value &= GENMASK(nbits - 1, 0);
+	if (space >= nbits) {
+		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */