diff mbox series

[v2] rwlock: allow recursive read locking when already locked in write mode

Message ID 20200220173116.88915-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series [v2] rwlock: allow recursive read locking when already locked in write mode | expand

Commit Message

Roger Pau Monné Feb. 20, 2020, 5:31 p.m. UTC
Allow a CPU already holding the lock in write mode to also lock it in
read mode. There's no harm in allowing read locking a rwlock that's
already owned by the caller (ie: CPU) in write mode. Allowing such
accesses is required at least for the CPU maps use-case.

In order to do this reserve 14bits of the lock, this allows to support
up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
signal there are pending writers waiting on the lock and the other to
signal the lock is owned in write mode. Note the write related data
is using 16bits, this is done in order to be able to clear it (and
thus release the lock) using a 16bit atomic write.

This reduces the maximum number of concurrent readers from 16777216 to
65536, I think this should still be enough, or else the lock field
can be expanded from 32 to 64bits if all architectures support atomic
operations on 64bit integers.

Fixes: 5872c83b42c608 ('smp: convert the cpu maps lock into a rw lock')
Reported-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jürgen Groß <jgross@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Move the processor ID field to start at bit 0 and hence avoid the
   shift when checking it.
 - Enlarge write related structures to use 16bit, and hence allow them
   to be cleared using an atomic write.
---
 xen/common/rwlock.c      |  4 +--
 xen/include/xen/rwlock.h | 53 +++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 22 deletions(-)

Comments

Julien Grall Feb. 20, 2020, 7:20 p.m. UTC | #1
Hi,

On 20/02/2020 17:31, Roger Pau Monne wrote:
> Allow a CPU already holding the lock in write mode to also lock it in
> read mode. There's no harm in allowing read locking a rwlock that's
> already owned by the caller (ie: CPU) in write mode. Allowing such
> accesses is required at least for the CPU maps use-case.
> 
> In order to do this reserve 14bits of the lock, this allows to support
> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> signal there are pending writers waiting on the lock and the other to
> signal the lock is owned in write mode. Note the write related data
> is using 16bits, this is done in order to be able to clear it (and
> thus release the lock) using a 16bit atomic write.
> 
> This reduces the maximum number of concurrent readers from 16777216 to
> 65536, I think this should still be enough, or else the lock field
> can be expanded from 32 to 64bits if all architectures support atomic
> operations on 64bit integers.

FWIW, arm32 is able to support atomic operations on 64-bit integers.

>   static inline void _write_unlock(rwlock_t *lock)
>   {
> -    /*
> -     * If the writer field is atomic, it can be cleared directly.
> -     * Otherwise, an atomic subtraction will be used to clear it.
> -     */
> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> +    /* Since the writer field is atomic, it can be cleared directly. */
> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> +    BUILD_BUG_ON(_QR_SHIFT != 16);
> +    write_atomic((uint16_t *)&lock->cnts, 0);

I think this is an abuse to cast an atomic_t() directly into a uint16_t. 
You would at least want to use &lock->cnts.counter here.

The rest of the code looks good to me.

Cheers,
Roger Pau Monné Feb. 21, 2020, 9:10 a.m. UTC | #2
On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
> Hi,
> 
> On 20/02/2020 17:31, Roger Pau Monne wrote:
> > Allow a CPU already holding the lock in write mode to also lock it in
> > read mode. There's no harm in allowing read locking a rwlock that's
> > already owned by the caller (ie: CPU) in write mode. Allowing such
> > accesses is required at least for the CPU maps use-case.
> > 
> > In order to do this reserve 14bits of the lock, this allows to support
> > up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> > signal there are pending writers waiting on the lock and the other to
> > signal the lock is owned in write mode. Note the write related data
> > is using 16bits, this is done in order to be able to clear it (and
> > thus release the lock) using a 16bit atomic write.
> > 
> > This reduces the maximum number of concurrent readers from 16777216 to
> > 65536, I think this should still be enough, or else the lock field
> > can be expanded from 32 to 64bits if all architectures support atomic
> > operations on 64bit integers.
> 
> FWIW, arm32 is able to support atomic operations on 64-bit integers.
> 
> >   static inline void _write_unlock(rwlock_t *lock)
> >   {
> > -    /*
> > -     * If the writer field is atomic, it can be cleared directly.
> > -     * Otherwise, an atomic subtraction will be used to clear it.
> > -     */
> > -    atomic_sub(_QW_LOCKED, &lock->cnts);
> > +    /* Since the writer field is atomic, it can be cleared directly. */
> > +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> > +    BUILD_BUG_ON(_QR_SHIFT != 16);
> > +    write_atomic((uint16_t *)&lock->cnts, 0);
> 
> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
> would at least want to use &lock->cnts.counter here.

Sure, I was wondering about this myself.

Will wait for more comments, not sure whether this can be fixed upon
commit if there are no other issues.

Thanks, Roger.
Julien Grall Feb. 21, 2020, 12:42 p.m. UTC | #3
Hi,

On 21/02/2020 09:10, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>> Hi,
>>
>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>> Allow a CPU already holding the lock in write mode to also lock it in
>>> read mode. There's no harm in allowing read locking a rwlock that's
>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>> accesses is required at least for the CPU maps use-case.
>>>
>>> In order to do this reserve 14bits of the lock, this allows to support
>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>> signal there are pending writers waiting on the lock and the other to
>>> signal the lock is owned in write mode. Note the write related data
>>> is using 16bits, this is done in order to be able to clear it (and
>>> thus release the lock) using a 16bit atomic write.
>>>
>>> This reduces the maximum number of concurrent readers from 16777216 to
>>> 65536, I think this should still be enough, or else the lock field
>>> can be expanded from 32 to 64bits if all architectures support atomic
>>> operations on 64bit integers.
>>
>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>
>>>    static inline void _write_unlock(rwlock_t *lock)
>>>    {
>>> -    /*
>>> -     * If the writer field is atomic, it can be cleared directly.
>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>> -     */
>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>
>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>> would at least want to use &lock->cnts.counter here.
> 
> Sure, I was wondering about this myself.
> 
> Will wait for more comments, not sure whether this can be fixed upon
> commit if there are no other issues.

It is trivial enough to be fixed while committing. Assuming this is 
going to be fixed:

Reviewed-by: Julien Grall <julien@xen.org>

Cheers,
Jan Beulich Feb. 21, 2020, 1:36 p.m. UTC | #4
On 21.02.2020 10:10, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>> Hi,
>>
>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>> Allow a CPU already holding the lock in write mode to also lock it in
>>> read mode. There's no harm in allowing read locking a rwlock that's
>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>> accesses is required at least for the CPU maps use-case.
>>>
>>> In order to do this reserve 14bits of the lock, this allows to support
>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>> signal there are pending writers waiting on the lock and the other to
>>> signal the lock is owned in write mode. Note the write related data
>>> is using 16bits, this is done in order to be able to clear it (and
>>> thus release the lock) using a 16bit atomic write.
>>>
>>> This reduces the maximum number of concurrent readers from 16777216 to
>>> 65536, I think this should still be enough, or else the lock field
>>> can be expanded from 32 to 64bits if all architectures support atomic
>>> operations on 64bit integers.
>>
>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>
>>>   static inline void _write_unlock(rwlock_t *lock)
>>>   {
>>> -    /*
>>> -     * If the writer field is atomic, it can be cleared directly.
>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>> -     */
>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>
>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>> would at least want to use &lock->cnts.counter here.
> 
> Sure, I was wondering about this myself.
> 
> Will wait for more comments, not sure whether this can be fixed upon
> commit if there are no other issues.

It's more than just adding another field specifier here. A cast like
this one is endianness-unsafe, and hence a trap waiting for a big
endian port attempt to fall into. At the very least this should cause
a build failure on big endian systems, even better would be if it was
endianness-safe.

Jan
Jürgen Groß Feb. 21, 2020, 1:46 p.m. UTC | #5
On 21.02.20 14:36, Jan Beulich wrote:
> On 21.02.2020 10:10, Roger Pau Monné wrote:
>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>> Hi,
>>>
>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>> accesses is required at least for the CPU maps use-case.
>>>>
>>>> In order to do this reserve 14bits of the lock, this allows to support
>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>> signal there are pending writers waiting on the lock and the other to
>>>> signal the lock is owned in write mode. Note the write related data
>>>> is using 16bits, this is done in order to be able to clear it (and
>>>> thus release the lock) using a 16bit atomic write.
>>>>
>>>> This reduces the maximum number of concurrent readers from 16777216 to
>>>> 65536, I think this should still be enough, or else the lock field
>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>> operations on 64bit integers.
>>>
>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>
>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>    {
>>>> -    /*
>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>> -     */
>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>
>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>>> would at least want to use &lock->cnts.counter here.
>>
>> Sure, I was wondering about this myself.
>>
>> Will wait for more comments, not sure whether this can be fixed upon
>> commit if there are no other issues.
> 
> It's more than just adding another field specifier here. A cast like
> this one is endianness-unsafe, and hence a trap waiting for a big
> endian port attempt to fall into. At the very least this should cause
> a build failure on big endian systems, even better would be if it was
> endianness-safe.

Wouldn't a union be the better choice?


Juergen
Julien Grall Feb. 21, 2020, 1:49 p.m. UTC | #6
On 21/02/2020 13:46, Jürgen Groß wrote:
> On 21.02.20 14:36, Jan Beulich wrote:
>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>> accesses is required at least for the CPU maps use-case.
>>>>>
>>>>> In order to do this reserve 14bits of the lock, this allows to support
>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>> signal there are pending writers waiting on the lock and the other to
>>>>> signal the lock is owned in write mode. Note the write related data
>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>> thus release the lock) using a 16bit atomic write.
>>>>>
>>>>> This reduces the maximum number of concurrent readers from 16777216 to
>>>>> 65536, I think this should still be enough, or else the lock field
>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>> operations on 64bit integers.
>>>>
>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>
>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>    {
>>>>> -    /*
>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>> -     */
>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>> directly. */
>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>
>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>> uint16_t. You
>>>> would at least want to use &lock->cnts.counter here.
>>>
>>> Sure, I was wondering about this myself.
>>>
>>> Will wait for more comments, not sure whether this can be fixed upon
>>> commit if there are no other issues.
>>
>> It's more than just adding another field specifier here. A cast like
>> this one is endianness-unsafe, and hence a trap waiting for a big
>> endian port attempt to fall into. At the very least this should cause
>> a build failure on big endian systems, even better would be if it was
>> endianness-safe.
> 
> Wouldn't a union be the better choice?

You would not be able to use atomic_t in that case as you can't assume 
the layout of the structure.

Cheers,
Jürgen Groß Feb. 21, 2020, 2:06 p.m. UTC | #7
On 21.02.20 14:49, Julien Grall wrote:
> 
> 
> On 21/02/2020 13:46, Jürgen Groß wrote:
>> On 21.02.20 14:36, Jan Beulich wrote:
>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>
>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>> support
>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>> signal there are pending writers waiting on the lock and the other to
>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>
>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>> 16777216 to
>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>>> operations on 64bit integers.
>>>>>
>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>
>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>    {
>>>>>> -    /*
>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>> -     */
>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>> directly. */
>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>
>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>> uint16_t. You
>>>>> would at least want to use &lock->cnts.counter here.
>>>>
>>>> Sure, I was wondering about this myself.
>>>>
>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>> commit if there are no other issues.
>>>
>>> It's more than just adding another field specifier here. A cast like
>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>> endian port attempt to fall into. At the very least this should cause
>>> a build failure on big endian systems, even better would be if it was
>>> endianness-safe.
>>
>> Wouldn't a union be the better choice?
> 
> You would not be able to use atomic_t in that case as you can't assume 
> the layout of the structure.

union rwlockword {
     atomic_t cnts;
     uint32_t val;
     struct {
         uint16_t write;
         uint16_t readers;
     };
};

static inline const uint32_t _qr_bias(
     const union rwlockword {
         .write = 0,
         .readers = 1
     } x;

     return x.val;
}

...
     cnts = atomic_add_return(_qr_bias(), &lock->cnts);
...

I guess this should do the trick, no?


Juergen
Jan Beulich Feb. 21, 2020, 2:06 p.m. UTC | #8
On 21.02.2020 14:46, Jürgen Groß wrote:
> On 21.02.20 14:36, Jan Beulich wrote:
>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>> accesses is required at least for the CPU maps use-case.
>>>>>
>>>>> In order to do this reserve 14bits of the lock, this allows to support
>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>> signal there are pending writers waiting on the lock and the other to
>>>>> signal the lock is owned in write mode. Note the write related data
>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>> thus release the lock) using a 16bit atomic write.
>>>>>
>>>>> This reduces the maximum number of concurrent readers from 16777216 to
>>>>> 65536, I think this should still be enough, or else the lock field
>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>> operations on 64bit integers.
>>>>
>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>
>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>    {
>>>>> -    /*
>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>> -     */
>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>
>>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>>>> would at least want to use &lock->cnts.counter here.
>>>
>>> Sure, I was wondering about this myself.
>>>
>>> Will wait for more comments, not sure whether this can be fixed upon
>>> commit if there are no other issues.
>>
>> It's more than just adding another field specifier here. A cast like
>> this one is endianness-unsafe, and hence a trap waiting for a big
>> endian port attempt to fall into. At the very least this should cause
>> a build failure on big endian systems, even better would be if it was
>> endianness-safe.
> 
> Wouldn't a union be the better choice?

Well, a union (with the second half being an array of uint16_t) could
be the mechanism to implement this in an endian safe way, but the
union alone won't help - you'd still have to decide which of the array
elements you actually need to write to. And of course you'd then also
make assumptions on sizeof(int) which are more strict than our current
ones (which I'd like to avoid if possible).

Jan
Julien Grall Feb. 21, 2020, 2:11 p.m. UTC | #9
Hi Juergen,

On 21/02/2020 14:06, Jürgen Groß wrote:
> On 21.02.20 14:49, Julien Grall wrote:
>>
>>
>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>> Allow a CPU already holding the lock in write mode to also lock 
>>>>>>> it in
>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>
>>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>>> support
>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>> other to
>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>
>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>> 16777216 to
>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>> can be expanded from 32 to 64bits if all architectures support 
>>>>>>> atomic
>>>>>>> operations on 64bit integers.
>>>>>>
>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>
>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>    {
>>>>>>> -    /*
>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>> -     */
>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>> directly. */
>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>
>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>> uint16_t. You
>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>
>>>>> Sure, I was wondering about this myself.
>>>>>
>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>> commit if there are no other issues.
>>>>
>>>> It's more than just adding another field specifier here. A cast like
>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>> endian port attempt to fall into. At the very least this should cause
>>>> a build failure on big endian systems, even better would be if it was
>>>> endianness-safe.
>>>
>>> Wouldn't a union be the better choice?
>>
>> You would not be able to use atomic_t in that case as you can't assume 
>> the layout of the structure.
> 
> union rwlockword {
>      atomic_t cnts;
>      uint32_t val;
>      struct {
>          uint16_t write;
>          uint16_t readers;
>      };
> };
> 
> static inline const uint32_t _qr_bias(
>      const union rwlockword {
>          .write = 0,
>          .readers = 1
>      } x;
> 
>      return x.val;
> }
> 
> ...
>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
> ...
> 
> I guess this should do the trick, no?

You are assuming the atomic_t layout which I would rather no want to happen.

Cheers,
Jan Beulich Feb. 21, 2020, 2:16 p.m. UTC | #10
On 21.02.2020 15:06, Jürgen Groß wrote:
> On 21.02.20 14:49, Julien Grall wrote:
>>
>>
>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>
>>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>>> support
>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>> signal there are pending writers waiting on the lock and the other to
>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>
>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>> 16777216 to
>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>>>> operations on 64bit integers.
>>>>>>
>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>
>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>    {
>>>>>>> -    /*
>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>> -     */
>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>> directly. */
>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>
>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>> uint16_t. You
>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>
>>>>> Sure, I was wondering about this myself.
>>>>>
>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>> commit if there are no other issues.
>>>>
>>>> It's more than just adding another field specifier here. A cast like
>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>> endian port attempt to fall into. At the very least this should cause
>>>> a build failure on big endian systems, even better would be if it was
>>>> endianness-safe.
>>>
>>> Wouldn't a union be the better choice?
>>
>> You would not be able to use atomic_t in that case as you can't assume 
>> the layout of the structure.
> 
> union rwlockword {
>      atomic_t cnts;
>      uint32_t val;
>      struct {
>          uint16_t write;
>          uint16_t readers;
>      };
> };
> 
> static inline const uint32_t _qr_bias(
>      const union rwlockword {
>          .write = 0,
>          .readers = 1
>      } x;
> 
>      return x.val;
> }
> 
> ...
>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
> ...
> 
> I guess this should do the trick, no?

I'm afraid it won't, and not just because of the sizeof() aspect
already pointed out. Your x variable would end up like this in
memory:

little:	00 00 01 00
big:	00 00 00 01 => 00000001

which, read as 32-bit value, then ends up being

little:	00010000
big:	00000001

The add therefore would be able to spill into the high 16 bits.

Jan
Jürgen Groß Feb. 21, 2020, 2:16 p.m. UTC | #11
On 21.02.20 15:11, Julien Grall wrote:
> Hi Juergen,
> 
> On 21/02/2020 14:06, Jürgen Groß wrote:
>> On 21.02.20 14:49, Julien Grall wrote:
>>>
>>>
>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>> Allow a CPU already holding the lock in write mode to also lock 
>>>>>>>> it in
>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>
>>>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>>>> support
>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>> other to
>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>
>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>> 16777216 to
>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>> can be expanded from 32 to 64bits if all architectures support 
>>>>>>>> atomic
>>>>>>>> operations on 64bit integers.
>>>>>>>
>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>
>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>    {
>>>>>>>> -    /*
>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>> -     */
>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>> directly. */
>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>
>>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>>> uint16_t. You
>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>
>>>>>> Sure, I was wondering about this myself.
>>>>>>
>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>> commit if there are no other issues.
>>>>>
>>>>> It's more than just adding another field specifier here. A cast like
>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>> endian port attempt to fall into. At the very least this should cause
>>>>> a build failure on big endian systems, even better would be if it was
>>>>> endianness-safe.
>>>>
>>>> Wouldn't a union be the better choice?
>>>
>>> You would not be able to use atomic_t in that case as you can't 
>>> assume the layout of the structure.
>>
>> union rwlockword {
>>      atomic_t cnts;
>>      uint32_t val;
>>      struct {
>>          uint16_t write;
>>          uint16_t readers;
>>      };
>> };
>>
>> static inline const uint32_t _qr_bias(
>>      const union rwlockword {
>>          .write = 0,
>>          .readers = 1
>>      } x;
>>
>>      return x.val;
>> }
>>
>> ...
>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>> ...
>>
>> I guess this should do the trick, no?
> 
> You are assuming the atomic_t layout which I would rather no want to 
> happen.

That already happened. rwlock.h already assumes atomic_t to be 32 bits
wide. I agree it would be better to have an atomic32_t type for this
use case.


Juergen
Jan Beulich Feb. 21, 2020, 2:17 p.m. UTC | #12
On 21.02.2020 15:16, Jürgen Groß wrote:
> On 21.02.20 15:11, Julien Grall wrote:
>> Hi Juergen,
>>
>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>
>>>>
>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>> Allow a CPU already holding the lock in write mode to also lock 
>>>>>>>>> it in
>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>
>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>>>>> support
>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>>> other to
>>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>
>>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>>> 16777216 to
>>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>>> can be expanded from 32 to 64bits if all architectures support 
>>>>>>>>> atomic
>>>>>>>>> operations on 64bit integers.
>>>>>>>>
>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>>
>>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>    {
>>>>>>>>> -    /*
>>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>>> -     */
>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>>> directly. */
>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>
>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>>>> uint16_t. You
>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>
>>>>>>> Sure, I was wondering about this myself.
>>>>>>>
>>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>>> commit if there are no other issues.
>>>>>>
>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>> endian port attempt to fall into. At the very least this should cause
>>>>>> a build failure on big endian systems, even better would be if it was
>>>>>> endianness-safe.
>>>>>
>>>>> Wouldn't a union be the better choice?
>>>>
>>>> You would not be able to use atomic_t in that case as you can't 
>>>> assume the layout of the structure.
>>>
>>> union rwlockword {
>>>      atomic_t cnts;
>>>      uint32_t val;
>>>      struct {
>>>          uint16_t write;
>>>          uint16_t readers;
>>>      };
>>> };
>>>
>>> static inline const uint32_t _qr_bias(
>>>      const union rwlockword {
>>>          .write = 0,
>>>          .readers = 1
>>>      } x;
>>>
>>>      return x.val;
>>> }
>>>
>>> ...
>>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>> ...
>>>
>>> I guess this should do the trick, no?
>>
>> You are assuming the atomic_t layout which I would rather no want to 
>> happen.
> 
> That already happened. rwlock.h already assumes atomic_t to be 32 bits
> wide.

Exactly 32 bits, or at least as many?

Jan
Jürgen Groß Feb. 21, 2020, 2:19 p.m. UTC | #13
On 21.02.20 15:16, Jan Beulich wrote:
> On 21.02.2020 15:06, Jürgen Groß wrote:
>> On 21.02.20 14:49, Julien Grall wrote:
>>>
>>>
>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>
>>>>>>>> In order to do this reserve 14bits of the lock, this allows to
>>>>>>>> support
>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>> signal there are pending writers waiting on the lock and the other to
>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>
>>>>>>>> This reduces the maximum number of concurrent readers from
>>>>>>>> 16777216 to
>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>>>>> operations on 64bit integers.
>>>>>>>
>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>
>>>>>>>>     static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>     {
>>>>>>>> -    /*
>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>> -     */
>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>> +    /* Since the writer field is atomic, it can be cleared
>>>>>>>> directly. */
>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>
>>>>>>> I think this is an abuse to cast an atomic_t() directly into a
>>>>>>> uint16_t. You
>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>
>>>>>> Sure, I was wondering about this myself.
>>>>>>
>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>> commit if there are no other issues.
>>>>>
>>>>> It's more than just adding another field specifier here. A cast like
>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>> endian port attempt to fall into. At the very least this should cause
>>>>> a build failure on big endian systems, even better would be if it was
>>>>> endianness-safe.
>>>>
>>>> Wouldn't a union be the better choice?
>>>
>>> You would not be able to use atomic_t in that case as you can't assume
>>> the layout of the structure.
>>
>> union rwlockword {
>>       atomic_t cnts;
>>       uint32_t val;
>>       struct {
>>           uint16_t write;
>>           uint16_t readers;
>>       };
>> };
>>
>> static inline const uint32_t _qr_bias(
>>       const union rwlockword {
>>           .write = 0,
>>           .readers = 1
>>       } x;
>>
>>       return x.val;
>> }
>>
>> ...
>>       cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>> ...
>>
>> I guess this should do the trick, no?
> 
> I'm afraid it won't, and not just because of the sizeof() aspect
> already pointed out. Your x variable would end up like this in
> memory:
> 
> little:	00 00 01 00
> big:	00 00 00 01 => 00000001
> 
> which, read as 32-bit value, then ends up being
> 
> little:	00010000
> big:	00000001
> 
> The add therefore would be able to spill into the high 16 bits.

And why exactly is this worse than just dropping spilled bits?
Both cases will explode rather soon. Both cases can be avoided by
introduction of e.g. ASSERTing that reader isn't ~0 before
incrementing it (or to be zero afterwards).


Juergen
Jürgen Groß Feb. 21, 2020, 2:24 p.m. UTC | #14
On 21.02.20 15:17, Jan Beulich wrote:
> On 21.02.2020 15:16, Jürgen Groß wrote:
>> On 21.02.20 15:11, Julien Grall wrote:
>>> Hi Juergen,
>>>
>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>
>>>>>
>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>> Allow a CPU already holding the lock in write mode to also lock
>>>>>>>>>> it in
>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>
>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to
>>>>>>>>>> support
>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>>>> signal there are pending writers waiting on the lock and the
>>>>>>>>>> other to
>>>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>
>>>>>>>>>> This reduces the maximum number of concurrent readers from
>>>>>>>>>> 16777216 to
>>>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>>>> can be expanded from 32 to 64bits if all architectures support
>>>>>>>>>> atomic
>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>
>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>>>
>>>>>>>>>>     static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>     {
>>>>>>>>>> -    /*
>>>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>>>> -     */
>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared
>>>>>>>>>> directly. */
>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>
>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a
>>>>>>>>> uint16_t. You
>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>
>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>
>>>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>>>> commit if there are no other issues.
>>>>>>>
>>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>> endian port attempt to fall into. At the very least this should cause
>>>>>>> a build failure on big endian systems, even better would be if it was
>>>>>>> endianness-safe.
>>>>>>
>>>>>> Wouldn't a union be the better choice?
>>>>>
>>>>> You would not be able to use atomic_t in that case as you can't
>>>>> assume the layout of the structure.
>>>>
>>>> union rwlockword {
>>>>       atomic_t cnts;
>>>>       uint32_t val;
>>>>       struct {
>>>>           uint16_t write;
>>>>           uint16_t readers;
>>>>       };
>>>> };
>>>>
>>>> static inline const uint32_t _qr_bias(
>>>>       const union rwlockword {
>>>>           .write = 0,
>>>>           .readers = 1
>>>>       } x;
>>>>
>>>>       return x.val;
>>>> }
>>>>
>>>> ...
>>>>       cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>> ...
>>>>
>>>> I guess this should do the trick, no?
>>>
>>> You are assuming the atomic_t layout which I would rather no want to
>>> happen.
>>
>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>> wide.
> 
> Exactly 32 bits, or at least as many?

Value is read into a u32 variable and then an update via cmpxchg() is
done using that value for comparison. So any additional bits will be
lost and update will never succeed (see queue_write_lock_slowpath()).


Juergen
Roger Pau Monné Feb. 21, 2020, 2:26 p.m. UTC | #15
On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
> On 21.02.2020 10:10, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
> >> Hi,
> >>
> >> On 20/02/2020 17:31, Roger Pau Monne wrote:
> >>> Allow a CPU already holding the lock in write mode to also lock it in
> >>> read mode. There's no harm in allowing read locking a rwlock that's
> >>> already owned by the caller (ie: CPU) in write mode. Allowing such
> >>> accesses is required at least for the CPU maps use-case.
> >>>
> >>> In order to do this reserve 14bits of the lock, this allows to support
> >>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> >>> signal there are pending writers waiting on the lock and the other to
> >>> signal the lock is owned in write mode. Note the write related data
> >>> is using 16bits, this is done in order to be able to clear it (and
> >>> thus release the lock) using a 16bit atomic write.
> >>>
> >>> This reduces the maximum number of concurrent readers from 16777216 to
> >>> 65536, I think this should still be enough, or else the lock field
> >>> can be expanded from 32 to 64bits if all architectures support atomic
> >>> operations on 64bit integers.
> >>
> >> FWIW, arm32 is able to support atomic operations on 64-bit integers.
> >>
> >>>   static inline void _write_unlock(rwlock_t *lock)
> >>>   {
> >>> -    /*
> >>> -     * If the writer field is atomic, it can be cleared directly.
> >>> -     * Otherwise, an atomic subtraction will be used to clear it.
> >>> -     */
> >>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> >>> +    /* Since the writer field is atomic, it can be cleared directly. */
> >>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> >>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
> >>> +    write_atomic((uint16_t *)&lock->cnts, 0);
> >>
> >> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
> >> would at least want to use &lock->cnts.counter here.
> > 
> > Sure, I was wondering about this myself.
> > 
> > Will wait for more comments, not sure whether this can be fixed upon
> > commit if there are no other issues.
> 
> It's more than just adding another field specifier here. A cast like
> this one is endianness-unsafe, and hence a trap waiting for a big
> endian port attempt to fall into. At the very least this should cause
> a build failure on big endian systems, even better would be if it was
> endianness-safe.

Why don't we leave the atomic_dec then?

The ASSERT I've added can be turned into a BUG_ON, and that's likely
even safer than what we currently have.

Thanks, Roger.
Roger Pau Monné Feb. 21, 2020, 2:32 p.m. UTC | #16
On Fri, Feb 21, 2020 at 03:26:35PM +0100, Roger Pau Monné wrote:
> On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
> > On 21.02.2020 10:10, Roger Pau Monné wrote:
> > > On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
> > >> Hi,
> > >>
> > >> On 20/02/2020 17:31, Roger Pau Monne wrote:
> > >>> Allow a CPU already holding the lock in write mode to also lock it in
> > >>> read mode. There's no harm in allowing read locking a rwlock that's
> > >>> already owned by the caller (ie: CPU) in write mode. Allowing such
> > >>> accesses is required at least for the CPU maps use-case.
> > >>>
> > >>> In order to do this reserve 14bits of the lock, this allows to support
> > >>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> > >>> signal there are pending writers waiting on the lock and the other to
> > >>> signal the lock is owned in write mode. Note the write related data
> > >>> is using 16bits, this is done in order to be able to clear it (and
> > >>> thus release the lock) using a 16bit atomic write.
> > >>>
> > >>> This reduces the maximum number of concurrent readers from 16777216 to
> > >>> 65536, I think this should still be enough, or else the lock field
> > >>> can be expanded from 32 to 64bits if all architectures support atomic
> > >>> operations on 64bit integers.
> > >>
> > >> FWIW, arm32 is able to support atomic operations on 64-bit integers.
> > >>
> > >>>   static inline void _write_unlock(rwlock_t *lock)
> > >>>   {
> > >>> -    /*
> > >>> -     * If the writer field is atomic, it can be cleared directly.
> > >>> -     * Otherwise, an atomic subtraction will be used to clear it.
> > >>> -     */
> > >>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> > >>> +    /* Since the writer field is atomic, it can be cleared directly. */
> > >>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> > >>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
> > >>> +    write_atomic((uint16_t *)&lock->cnts, 0);
> > >>
> > >> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
> > >> would at least want to use &lock->cnts.counter here.
> > > 
> > > Sure, I was wondering about this myself.
> > > 
> > > Will wait for more comments, not sure whether this can be fixed upon
> > > commit if there are no other issues.
> > 
> > It's more than just adding another field specifier here. A cast like
> > this one is endianness-unsafe, and hence a trap waiting for a big
> > endian port attempt to fall into. At the very least this should cause
> > a build failure on big endian systems, even better would be if it was
> > endianness-safe.
> 
> Why don't we leave the atomic_dec then?

Sorry, that should be atomic_sub, not atomic_dec.

Thanks, Roger.
Julien Grall Feb. 21, 2020, 2:32 p.m. UTC | #17
On 21/02/2020 14:16, Jürgen Groß wrote:
> On 21.02.20 15:11, Julien Grall wrote:
>> Hi Juergen,
>>
>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>
>>>>
>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>> Allow a CPU already holding the lock in write mode to also lock 
>>>>>>>>> it in
>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock 
>>>>>>>>> that's
>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>
>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>>>>> support
>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: 
>>>>>>>>> one to
>>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>>> other to
>>>>>>>>> signal the lock is owned in write mode. Note the write related 
>>>>>>>>> data
>>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>
>>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>>> 16777216 to
>>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>>> can be expanded from 32 to 64bits if all architectures support 
>>>>>>>>> atomic
>>>>>>>>> operations on 64bit integers.
>>>>>>>>
>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit 
>>>>>>>> integers.
>>>>>>>>
>>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>    {
>>>>>>>>> -    /*
>>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>>> -     */
>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>>> directly. */
>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>
>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>>>> uint16_t. You
>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>
>>>>>>> Sure, I was wondering about this myself.
>>>>>>>
>>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>>> commit if there are no other issues.
>>>>>>
>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>> endian port attempt to fall into. At the very least this should cause
>>>>>> a build failure on big endian systems, even better would be if it was
>>>>>> endianness-safe.
>>>>>
>>>>> Wouldn't a union be the better choice?
>>>>
>>>> You would not be able to use atomic_t in that case as you can't 
>>>> assume the layout of the structure.
>>>
>>> union rwlockword {
>>>      atomic_t cnts;
>>>      uint32_t val;
>>>      struct {
>>>          uint16_t write;
>>>          uint16_t readers;
>>>      };
>>> };
>>>
>>> static inline const uint32_t _qr_bias(
>>>      const union rwlockword {
>>>          .write = 0,
>>>          .readers = 1
>>>      } x;
>>>
>>>      return x.val;
>>> }
>>>
>>> ...
>>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>> ...
>>>
>>> I guess this should do the trick, no?
>>
>> You are assuming the atomic_t layout which I would rather no want to 
>> happen.
> 
> That already happened. rwlock.h already assumes atomic_t to be 32 bits
> wide. I agree it would be better to have an atomic32_t type for this
> use case.

I don't think you understood my point here. An atomic32_t will not be 
better as you still assume the layout of the structure. I.e the 
structure has only one field.

I don't know if this is going to bite us in the future, but I don't want 
to make attempt to use such union. At least without any safeguard in Xen 
(i.e BUILD_BUG_ON(...)).

Cheers,
Jürgen Groß Feb. 21, 2020, 2:35 p.m. UTC | #18
On 21.02.20 15:32, Julien Grall wrote:
> 
> 
> On 21/02/2020 14:16, Jürgen Groß wrote:
>> On 21.02.20 15:11, Julien Grall wrote:
>>> Hi Juergen,
>>>
>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>
>>>>>
>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>> Allow a CPU already holding the lock in write mode to also 
>>>>>>>>>> lock it in
>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock 
>>>>>>>>>> that's
>>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing 
>>>>>>>>>> such
>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>
>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to 
>>>>>>>>>> support
>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: 
>>>>>>>>>> one to
>>>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>>>> other to
>>>>>>>>>> signal the lock is owned in write mode. Note the write related 
>>>>>>>>>> data
>>>>>>>>>> is using 16bits, this is done in order to be able to clear it 
>>>>>>>>>> (and
>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>
>>>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>>>> 16777216 to
>>>>>>>>>> 65536, I think this should still be enough, or else the lock 
>>>>>>>>>> field
>>>>>>>>>> can be expanded from 32 to 64bits if all architectures support 
>>>>>>>>>> atomic
>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>
>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit 
>>>>>>>>> integers.
>>>>>>>>>
>>>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>    {
>>>>>>>>>> -    /*
>>>>>>>>>> -     * If the writer field is atomic, it can be cleared 
>>>>>>>>>> directly.
>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear 
>>>>>>>>>> it.
>>>>>>>>>> -     */
>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>>>> directly. */
>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>
>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>>>>> uint16_t. You
>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>
>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>
>>>>>>>> Will wait for more comments, not sure whether this can be fixed 
>>>>>>>> upon
>>>>>>>> commit if there are no other issues.
>>>>>>>
>>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>> endian port attempt to fall into. At the very least this should 
>>>>>>> cause
>>>>>>> a build failure on big endian systems, even better would be if it 
>>>>>>> was
>>>>>>> endianness-safe.
>>>>>>
>>>>>> Wouldn't a union be the better choice?
>>>>>
>>>>> You would not be able to use atomic_t in that case as you can't 
>>>>> assume the layout of the structure.
>>>>
>>>> union rwlockword {
>>>>      atomic_t cnts;
>>>>      uint32_t val;
>>>>      struct {
>>>>          uint16_t write;
>>>>          uint16_t readers;
>>>>      };
>>>> };
>>>>
>>>> static inline const uint32_t _qr_bias(
>>>>      const union rwlockword {
>>>>          .write = 0,
>>>>          .readers = 1
>>>>      } x;
>>>>
>>>>      return x.val;
>>>> }
>>>>
>>>> ...
>>>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>> ...
>>>>
>>>> I guess this should do the trick, no?
>>>
>>> You are assuming the atomic_t layout which I would rather no want to 
>>> happen.
>>
>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>> wide. I agree it would be better to have an atomic32_t type for this
>> use case.
> 
> I don't think you understood my point here. An atomic32_t will not be 
> better as you still assume the layout of the structure. I.e the 
> structure has only one field.

I don't understand your reasoning here, sorry.

Are you saying that a structure of two uint16_t isn't known to be 32
bits long?


Juergen
Jan Beulich Feb. 21, 2020, 2:37 p.m. UTC | #19
On 21.02.2020 15:24, Jürgen Groß wrote:
> On 21.02.20 15:17, Jan Beulich wrote:
>> On 21.02.2020 15:16, Jürgen Groß wrote:
>>> On 21.02.20 15:11, Julien Grall wrote:
>>>> Hi Juergen,
>>>>
>>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>>
>>>>>>
>>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>>> Allow a CPU already holding the lock in write mode to also lock
>>>>>>>>>>> it in
>>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>>
>>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to
>>>>>>>>>>> support
>>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>>>>> signal there are pending writers waiting on the lock and the
>>>>>>>>>>> other to
>>>>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>>
>>>>>>>>>>> This reduces the maximum number of concurrent readers from
>>>>>>>>>>> 16777216 to
>>>>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>>>>> can be expanded from 32 to 64bits if all architectures support
>>>>>>>>>>> atomic
>>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>>
>>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>>>>
>>>>>>>>>>>     static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>>     {
>>>>>>>>>>> -    /*
>>>>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>>>>> -     */
>>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared
>>>>>>>>>>> directly. */
>>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>>
>>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a
>>>>>>>>>> uint16_t. You
>>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>>
>>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>>
>>>>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>>>>> commit if there are no other issues.
>>>>>>>>
>>>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>>> endian port attempt to fall into. At the very least this should cause
>>>>>>>> a build failure on big endian systems, even better would be if it was
>>>>>>>> endianness-safe.
>>>>>>>
>>>>>>> Wouldn't a union be the better choice?
>>>>>>
>>>>>> You would not be able to use atomic_t in that case as you can't
>>>>>> assume the layout of the structure.
>>>>>
>>>>> union rwlockword {
>>>>>       atomic_t cnts;
>>>>>       uint32_t val;
>>>>>       struct {
>>>>>           uint16_t write;
>>>>>           uint16_t readers;
>>>>>       };
>>>>> };
>>>>>
>>>>> static inline const uint32_t _qr_bias(
>>>>>       const union rwlockword {
>>>>>           .write = 0,
>>>>>           .readers = 1
>>>>>       } x;
>>>>>
>>>>>       return x.val;
>>>>> }
>>>>>
>>>>> ...
>>>>>       cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>>> ...
>>>>>
>>>>> I guess this should do the trick, no?
>>>>
>>>> You are assuming the atomic_t layout which I would rather no want to
>>>> happen.
>>>
>>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>>> wide.
>>
>> Exactly 32 bits, or at least as many?
> 
> Value is read into a u32 variable and then an update via cmpxchg() is
> done using that value for comparison. So any additional bits will be
> lost and update will never succeed (see queue_write_lock_slowpath()).

Well, the local variables should be of type int (or possibly unsigned
int), but the logic there isn't tied to anything else being exactly
32 bits wide. These local variable are merely an example of misguided
use of fixed width types.

Jan
Jan Beulich Feb. 21, 2020, 2:41 p.m. UTC | #20
On 21.02.2020 15:26, Roger Pau Monné wrote:
> On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>> accesses is required at least for the CPU maps use-case.
>>>>>
>>>>> In order to do this reserve 14bits of the lock, this allows to support
>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>> signal there are pending writers waiting on the lock and the other to
>>>>> signal the lock is owned in write mode. Note the write related data
>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>> thus release the lock) using a 16bit atomic write.
>>>>>
>>>>> This reduces the maximum number of concurrent readers from 16777216 to
>>>>> 65536, I think this should still be enough, or else the lock field
>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>> operations on 64bit integers.
>>>>
>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>
>>>>>   static inline void _write_unlock(rwlock_t *lock)
>>>>>   {
>>>>> -    /*
>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>> -     */
>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>
>>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>>>> would at least want to use &lock->cnts.counter here.
>>>
>>> Sure, I was wondering about this myself.
>>>
>>> Will wait for more comments, not sure whether this can be fixed upon
>>> commit if there are no other issues.
>>
>> It's more than just adding another field specifier here. A cast like
>> this one is endianness-unsafe, and hence a trap waiting for a big
>> endian port attempt to fall into. At the very least this should cause
>> a build failure on big endian systems, even better would be if it was
>> endianness-safe.
> 
> Why don't we leave the atomic_dec then?

Because you need to invoke smp_processor_id() to calculate the value
to use in the subtraction. I'm not meaning to say I'm entirely
opposed (seeing how much of a discussion we're having), but the
"simple write of zero" approach is certainly appealing.

Jan
Roger Pau Monné Feb. 21, 2020, 2:49 p.m. UTC | #21
On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
> On 21.02.2020 15:26, Roger Pau Monné wrote:
> > On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
> >> On 21.02.2020 10:10, Roger Pau Monné wrote:
> >>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
> >>>> Hi,
> >>>>
> >>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
> >>>>> Allow a CPU already holding the lock in write mode to also lock it in
> >>>>> read mode. There's no harm in allowing read locking a rwlock that's
> >>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
> >>>>> accesses is required at least for the CPU maps use-case.
> >>>>>
> >>>>> In order to do this reserve 14bits of the lock, this allows to support
> >>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> >>>>> signal there are pending writers waiting on the lock and the other to
> >>>>> signal the lock is owned in write mode. Note the write related data
> >>>>> is using 16bits, this is done in order to be able to clear it (and
> >>>>> thus release the lock) using a 16bit atomic write.
> >>>>>
> >>>>> This reduces the maximum number of concurrent readers from 16777216 to
> >>>>> 65536, I think this should still be enough, or else the lock field
> >>>>> can be expanded from 32 to 64bits if all architectures support atomic
> >>>>> operations on 64bit integers.
> >>>>
> >>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
> >>>>
> >>>>>   static inline void _write_unlock(rwlock_t *lock)
> >>>>>   {
> >>>>> -    /*
> >>>>> -     * If the writer field is atomic, it can be cleared directly.
> >>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
> >>>>> -     */
> >>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> >>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
> >>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> >>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
> >>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
> >>>>
> >>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
> >>>> would at least want to use &lock->cnts.counter here.
> >>>
> >>> Sure, I was wondering about this myself.
> >>>
> >>> Will wait for more comments, not sure whether this can be fixed upon
> >>> commit if there are no other issues.
> >>
> >> It's more than just adding another field specifier here. A cast like
> >> this one is endianness-unsafe, and hence a trap waiting for a big
> >> endian port attempt to fall into. At the very least this should cause
> >> a build failure on big endian systems, even better would be if it was
> >> endianness-safe.
> > 
> > Why don't we leave the atomic_dec then?
> 
> Because you need to invoke smp_processor_id() to calculate the value
> to use in the subtraction. I'm not meaning to say I'm entirely
> opposed (seeing how much of a discussion we're having), but the
> "simple write of zero" approach is certainly appealing.

Well, we could avoid the smp_processor_id() call and instead use:

atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);

Note mask is using the low 16bits now, but if we go the atomic_sub
route we could change the CPU ID fields to be 12 bits again and thus
have some more room for the readers count.

Thanks, Roger.
Jan Beulich Feb. 21, 2020, 2:50 p.m. UTC | #22
On 21.02.2020 15:19, Jürgen Groß wrote:
> On 21.02.20 15:16, Jan Beulich wrote:
>> On 21.02.2020 15:06, Jürgen Groß wrote:
>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>
>>>>
>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>
>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to
>>>>>>>>> support
>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>>> signal there are pending writers waiting on the lock and the other to
>>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>
>>>>>>>>> This reduces the maximum number of concurrent readers from
>>>>>>>>> 16777216 to
>>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>>>>>> operations on 64bit integers.
>>>>>>>>
>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>>
>>>>>>>>>     static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>     {
>>>>>>>>> -    /*
>>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>>> -     */
>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared
>>>>>>>>> directly. */
>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>
>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a
>>>>>>>> uint16_t. You
>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>
>>>>>>> Sure, I was wondering about this myself.
>>>>>>>
>>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>>> commit if there are no other issues.
>>>>>>
>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>> endian port attempt to fall into. At the very least this should cause
>>>>>> a build failure on big endian systems, even better would be if it was
>>>>>> endianness-safe.
>>>>>
>>>>> Wouldn't a union be the better choice?
>>>>
>>>> You would not be able to use atomic_t in that case as you can't assume
>>>> the layout of the structure.
>>>
>>> union rwlockword {
>>>       atomic_t cnts;
>>>       uint32_t val;
>>>       struct {
>>>           uint16_t write;
>>>           uint16_t readers;
>>>       };
>>> };
>>>
>>> static inline const uint32_t _qr_bias(
>>>       const union rwlockword {
>>>           .write = 0,
>>>           .readers = 1
>>>       } x;
>>>
>>>       return x.val;
>>> }
>>>
>>> ...
>>>       cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>> ...
>>>
>>> I guess this should do the trick, no?
>>
>> I'm afraid it won't, and not just because of the sizeof() aspect
>> already pointed out. Your x variable would end up like this in
>> memory:
>>
>> little:	00 00 01 00
>> big:	00 00 00 01 => 00000001
>>
>> which, read as 32-bit value, then ends up being
>>
>> little:	00010000
>> big:	00000001
>>
>> The add therefore would be able to spill into the high 16 bits.
> 
> And why exactly is this worse than just dropping spilled bits?
> Both cases will explode rather soon. Both cases can be avoided by
> introduction of e.g. ASSERTing that reader isn't ~0 before
> incrementing it (or to be zero afterwards).

Hmm, yes. I can't reconstruct the thoughts of the people having
laid out the bit assignments the way they are right now. I can
only assume there was a reason to put the reader count in the
high bits. But it looks like my previous remark wasn't even
pointing at the worst effect. The resulting big endian 32-bit
layout also is not in line with the _QW_* constants, due to
mis-ordering of the two halves.

Jan
Julien Grall Feb. 21, 2020, 2:51 p.m. UTC | #23
On 21/02/2020 14:35, Jürgen Groß wrote:
> On 21.02.20 15:32, Julien Grall wrote:
>>
>>
>> On 21/02/2020 14:16, Jürgen Groß wrote:
>>> On 21.02.20 15:11, Julien Grall wrote:
>>>> Hi Juergen,
>>>>
>>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>>
>>>>>>
>>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>>> Allow a CPU already holding the lock in write mode to also 
>>>>>>>>>>> lock it in
>>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock 
>>>>>>>>>>> that's
>>>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing 
>>>>>>>>>>> such
>>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>>
>>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows 
>>>>>>>>>>> to support
>>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: 
>>>>>>>>>>> one to
>>>>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>>>>> other to
>>>>>>>>>>> signal the lock is owned in write mode. Note the write 
>>>>>>>>>>> related data
>>>>>>>>>>> is using 16bits, this is done in order to be able to clear it 
>>>>>>>>>>> (and
>>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>>
>>>>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>>>>> 16777216 to
>>>>>>>>>>> 65536, I think this should still be enough, or else the lock 
>>>>>>>>>>> field
>>>>>>>>>>> can be expanded from 32 to 64bits if all architectures 
>>>>>>>>>>> support atomic
>>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>>
>>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit 
>>>>>>>>>> integers.
>>>>>>>>>>
>>>>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>>    {
>>>>>>>>>>> -    /*
>>>>>>>>>>> -     * If the writer field is atomic, it can be cleared 
>>>>>>>>>>> directly.
>>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to 
>>>>>>>>>>> clear it.
>>>>>>>>>>> -     */
>>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>>>>> directly. */
>>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>>
>>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a 
>>>>>>>>>> uint16_t. You
>>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>>
>>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>>
>>>>>>>>> Will wait for more comments, not sure whether this can be fixed 
>>>>>>>>> upon
>>>>>>>>> commit if there are no other issues.
>>>>>>>>
>>>>>>>> It's more than just adding another field specifier here. A cast 
>>>>>>>> like
>>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>>> endian port attempt to fall into. At the very least this should 
>>>>>>>> cause
>>>>>>>> a build failure on big endian systems, even better would be if 
>>>>>>>> it was
>>>>>>>> endianness-safe.
>>>>>>>
>>>>>>> Wouldn't a union be the better choice?
>>>>>>
>>>>>> You would not be able to use atomic_t in that case as you can't 
>>>>>> assume the layout of the structure.
>>>>>
>>>>> union rwlockword {
>>>>>      atomic_t cnts;
>>>>>      uint32_t val;
>>>>>      struct {
>>>>>          uint16_t write;
>>>>>          uint16_t readers;
>>>>>      };
>>>>> };
>>>>>
>>>>> static inline const uint32_t _qr_bias(
>>>>>      const union rwlockword {
>>>>>          .write = 0,
>>>>>          .readers = 1
>>>>>      } x;
>>>>>
>>>>>      return x.val;
>>>>> }
>>>>>
>>>>> ...
>>>>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>>> ...
>>>>>
>>>>> I guess this should do the trick, no?
>>>>
>>>> You are assuming the atomic_t layout which I would rather no want to 
>>>> happen.
>>>
>>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>>> wide. I agree it would be better to have an atomic32_t type for this
>>> use case.
>>
>> I don't think you understood my point here. An atomic32_t will not be 
>> better as you still assume the layout of the structure. I.e the 
>> structure has only one field.
> 
> I don't understand your reasoning here, sorry.
> 
> Are you saying that a structure of two uint16_t isn't known to be 32
> bits long?

You are assuming that atomic_t will always be:

struct
{
   int counter;
}

What if we decide to turn into

struct
{
   bool a;
   int counter;
}

You would at least want a BUILD_BUG_ON() as at the compiler will throw 
you an error rather than happily countinuing.

Cheers,
Jan Beulich Feb. 21, 2020, 2:52 p.m. UTC | #24
On 21.02.2020 15:49, Roger Pau Monné wrote:
> On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
>> On 21.02.2020 15:26, Roger Pau Monné wrote:
>>> On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>
>>>>>>> In order to do this reserve 14bits of the lock, this allows to support
>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>> signal there are pending writers waiting on the lock and the other to
>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>
>>>>>>> This reduces the maximum number of concurrent readers from 16777216 to
>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>>>> operations on 64bit integers.
>>>>>>
>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>
>>>>>>>   static inline void _write_unlock(rwlock_t *lock)
>>>>>>>   {
>>>>>>> -    /*
>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>> -     */
>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>
>>>>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>
>>>>> Sure, I was wondering about this myself.
>>>>>
>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>> commit if there are no other issues.
>>>>
>>>> It's more than just adding another field specifier here. A cast like
>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>> endian port attempt to fall into. At the very least this should cause
>>>> a build failure on big endian systems, even better would be if it was
>>>> endianness-safe.
>>>
>>> Why don't we leave the atomic_dec then?
>>
>> Because you need to invoke smp_processor_id() to calculate the value
>> to use in the subtraction. I'm not meaning to say I'm entirely
>> opposed (seeing how much of a discussion we're having), but the
>> "simple write of zero" approach is certainly appealing.
> 
> Well, we could avoid the smp_processor_id() call and instead use:
> 
> atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);

Which would make me suggest atomic_and() again.

Jan
Julien Grall Feb. 21, 2020, 2:56 p.m. UTC | #25
On 21/02/2020 14:49, Roger Pau Monné wrote:
> On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
>> Because you need to invoke smp_processor_id() to calculate the value
>> to use in the subtraction. I'm not meaning to say I'm entirely
>> opposed (seeing how much of a discussion we're having), but the
>> "simple write of zero" approach is certainly appealing.
> 
> Well, we could avoid the smp_processor_id() call and instead use:
> 
> atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);

AFAICT, this would not be safe because the top 16-bit may change behind 
your back (via a read_lock).

Cheers,
Roger Pau Monné Feb. 21, 2020, 2:58 p.m. UTC | #26
On Fri, Feb 21, 2020 at 03:52:28PM +0100, Jan Beulich wrote:
> On 21.02.2020 15:49, Roger Pau Monné wrote:
> > On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
> >> On 21.02.2020 15:26, Roger Pau Monné wrote:
> >>> On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
> >>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
> >>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
> >>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
> >>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
> >>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
> >>>>>>> accesses is required at least for the CPU maps use-case.
> >>>>>>>
> >>>>>>> In order to do this reserve 14bits of the lock, this allows to support
> >>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> >>>>>>> signal there are pending writers waiting on the lock and the other to
> >>>>>>> signal the lock is owned in write mode. Note the write related data
> >>>>>>> is using 16bits, this is done in order to be able to clear it (and
> >>>>>>> thus release the lock) using a 16bit atomic write.
> >>>>>>>
> >>>>>>> This reduces the maximum number of concurrent readers from 16777216 to
> >>>>>>> 65536, I think this should still be enough, or else the lock field
> >>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
> >>>>>>> operations on 64bit integers.
> >>>>>>
> >>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
> >>>>>>
> >>>>>>>   static inline void _write_unlock(rwlock_t *lock)
> >>>>>>>   {
> >>>>>>> -    /*
> >>>>>>> -     * If the writer field is atomic, it can be cleared directly.
> >>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
> >>>>>>> -     */
> >>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> >>>>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
> >>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> >>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
> >>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
> >>>>>>
> >>>>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
> >>>>>> would at least want to use &lock->cnts.counter here.
> >>>>>
> >>>>> Sure, I was wondering about this myself.
> >>>>>
> >>>>> Will wait for more comments, not sure whether this can be fixed upon
> >>>>> commit if there are no other issues.
> >>>>
> >>>> It's more than just adding another field specifier here. A cast like
> >>>> this one is endianness-unsafe, and hence a trap waiting for a big
> >>>> endian port attempt to fall into. At the very least this should cause
> >>>> a build failure on big endian systems, even better would be if it was
> >>>> endianness-safe.
> >>>
> >>> Why don't we leave the atomic_dec then?
> >>
> >> Because you need to invoke smp_processor_id() to calculate the value
> >> to use in the subtraction. I'm not meaning to say I'm entirely
> >> opposed (seeing how much of a discussion we're having), but the
> >> "simple write of zero" approach is certainly appealing.
> > 
> > Well, we could avoid the smp_processor_id() call and instead use:
> > 
> > atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);
> 
> Which would make me suggest atomic_and() again.

I'm certainly not opposed to that, but in order to get this regression
fixed I would argue that such atomic_sub is no worse than what's
currently done.

I can look into adding an atomic_and operation to all arches, but this
is likely to take time and I would like to get this sorted sooner
rather than later.

Thanks, Roger.
Roger Pau Monné Feb. 21, 2020, 2:59 p.m. UTC | #27
On Fri, Feb 21, 2020 at 02:56:20PM +0000, Julien Grall wrote:
> 
> 
> On 21/02/2020 14:49, Roger Pau Monné wrote:
> > On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
> > > Because you need to invoke smp_processor_id() to calculate the value
> > > to use in the subtraction. I'm not meaning to say I'm entirely
> > > opposed (seeing how much of a discussion we're having), but the
> > > "simple write of zero" approach is certainly appealing.
> > 
> > Well, we could avoid the smp_processor_id() call and instead use:
> > 
> > atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);
> 
> AFAICT, this would not be safe because the top 16-bit may change behind your
> back (via a read_lock).

Why not? We don't touch the top 16bits in any way, the subtraction
only affects the low 16bits and is done in an atomic manner.

Thanks, Roger.
Julien Grall Feb. 21, 2020, 3:02 p.m. UTC | #28
On 21/02/2020 14:59, Roger Pau Monné wrote:
> On Fri, Feb 21, 2020 at 02:56:20PM +0000, Julien Grall wrote:
>>
>>
>> On 21/02/2020 14:49, Roger Pau Monné wrote:
>>> On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
>>>> Because you need to invoke smp_processor_id() to calculate the value
>>>> to use in the subtraction. I'm not meaning to say I'm entirely
>>>> opposed (seeing how much of a discussion we're having), but the
>>>> "simple write of zero" approach is certainly appealing.
>>>
>>> Well, we could avoid the smp_processor_id() call and instead use:
>>>
>>> atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);
>>
>> AFAICT, this would not be safe because the top 16-bit may change behind your
>> back (via a read_lock).
> 
> Why not? We don't touch the top 16bits in any way, the subtraction
> only affects the low 16bits and is done in an atomic manner.

Hmmm, I did misread the code :(. Sorry for the that.

> 
> Thanks, Roger.
>
Jan Beulich Feb. 21, 2020, 3:11 p.m. UTC | #29
On 21.02.2020 15:56, Julien Grall wrote:
> On 21/02/2020 14:49, Roger Pau Monné wrote:
>> On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
>>> Because you need to invoke smp_processor_id() to calculate the value
>>> to use in the subtraction. I'm not meaning to say I'm entirely
>>> opposed (seeing how much of a discussion we're having), but the
>>> "simple write of zero" approach is certainly appealing.
>>
>> Well, we could avoid the smp_processor_id() call and instead use:
>>
>> atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);
> 
> AFAICT, this would not be safe because the top 16-bit may change behind 
> your back (via a read_lock).

But them changing between the atomic_read() and the atomic_sub()
is fine. The sub then will still only affect the low bits, leaving
the high ones as they were (potentially as updated after the read).

Jan
Jürgen Groß Feb. 21, 2020, 3:13 p.m. UTC | #30
On 21.02.20 15:51, Julien Grall wrote:
> 
> 
> On 21/02/2020 14:35, Jürgen Groß wrote:
>> On 21.02.20 15:32, Julien Grall wrote:
>>>
>>>
>>> On 21/02/2020 14:16, Jürgen Groß wrote:
>>>> On 21.02.20 15:11, Julien Grall wrote:
>>>>> Hi Juergen,
>>>>>
>>>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>>>> Allow a CPU already holding the lock in write mode to also 
>>>>>>>>>>>> lock it in
>>>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock 
>>>>>>>>>>>> that's
>>>>>>>>>>>> already owned by the caller (ie: CPU) in write mode. 
>>>>>>>>>>>> Allowing such
>>>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows 
>>>>>>>>>>>> to support
>>>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: 
>>>>>>>>>>>> one to
>>>>>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>>>>>> other to
>>>>>>>>>>>> signal the lock is owned in write mode. Note the write 
>>>>>>>>>>>> related data
>>>>>>>>>>>> is using 16bits, this is done in order to be able to clear 
>>>>>>>>>>>> it (and
>>>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>>>
>>>>>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>>>>>> 16777216 to
>>>>>>>>>>>> 65536, I think this should still be enough, or else the lock 
>>>>>>>>>>>> field
>>>>>>>>>>>> can be expanded from 32 to 64bits if all architectures 
>>>>>>>>>>>> support atomic
>>>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>>>
>>>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit 
>>>>>>>>>>> integers.
>>>>>>>>>>>
>>>>>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>>>    {
>>>>>>>>>>>> -    /*
>>>>>>>>>>>> -     * If the writer field is atomic, it can be cleared 
>>>>>>>>>>>> directly.
>>>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to 
>>>>>>>>>>>> clear it.
>>>>>>>>>>>> -     */
>>>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>>>>>> directly. */
>>>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>>>
>>>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into 
>>>>>>>>>>> a uint16_t. You
>>>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>>>
>>>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>>>
>>>>>>>>>> Will wait for more comments, not sure whether this can be 
>>>>>>>>>> fixed upon
>>>>>>>>>> commit if there are no other issues.
>>>>>>>>>
>>>>>>>>> It's more than just adding another field specifier here. A cast 
>>>>>>>>> like
>>>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>>>> endian port attempt to fall into. At the very least this should 
>>>>>>>>> cause
>>>>>>>>> a build failure on big endian systems, even better would be if 
>>>>>>>>> it was
>>>>>>>>> endianness-safe.
>>>>>>>>
>>>>>>>> Wouldn't a union be the better choice?
>>>>>>>
>>>>>>> You would not be able to use atomic_t in that case as you can't 
>>>>>>> assume the layout of the structure.
>>>>>>
>>>>>> union rwlockword {
>>>>>>      atomic_t cnts;
>>>>>>      uint32_t val;
>>>>>>      struct {
>>>>>>          uint16_t write;
>>>>>>          uint16_t readers;
>>>>>>      };
>>>>>> };
>>>>>>
>>>>>> static inline const uint32_t _qr_bias(
>>>>>>      const union rwlockword {
>>>>>>          .write = 0,
>>>>>>          .readers = 1
>>>>>>      } x;
>>>>>>
>>>>>>      return x.val;
>>>>>> }
>>>>>>
>>>>>> ...
>>>>>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>>>> ...
>>>>>>
>>>>>> I guess this should do the trick, no?
>>>>>
>>>>> You are assuming the atomic_t layout which I would rather no want 
>>>>> to happen.
>>>>
>>>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>>>> wide. I agree it would be better to have an atomic32_t type for this
>>>> use case.
>>>
>>> I don't think you understood my point here. An atomic32_t will not be 
>>> better as you still assume the layout of the structure. I.e the 
>>> structure has only one field.
>>
>> I don't understand your reasoning here, sorry.
>>
>> Are you saying that a structure of two uint16_t isn't known to be 32
>> bits long?
> 
> You are assuming that atomic_t will always be:
> 
> struct
> {
>    int counter;
> }
> 
> What if we decide to turn into
> 
> struct
> {
>    bool a;
>    int counter;
> }

As said before: then queue_write_lock_slowpath() is already broken.


Juergen
Jan Beulich Feb. 21, 2020, 3:15 p.m. UTC | #31
On 21.02.2020 15:58, Roger Pau Monné wrote:
> On Fri, Feb 21, 2020 at 03:52:28PM +0100, Jan Beulich wrote:
>> On 21.02.2020 15:49, Roger Pau Monné wrote:
>>> On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
>>>> On 21.02.2020 15:26, Roger Pau Monné wrote:
>>>>> On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
>>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>
>>>>>>>>> In order to do this reserve 14bits of the lock, this allows to support
>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
>>>>>>>>> signal there are pending writers waiting on the lock and the other to
>>>>>>>>> signal the lock is owned in write mode. Note the write related data
>>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>
>>>>>>>>> This reduces the maximum number of concurrent readers from 16777216 to
>>>>>>>>> 65536, I think this should still be enough, or else the lock field
>>>>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
>>>>>>>>> operations on 64bit integers.
>>>>>>>>
>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
>>>>>>>>
>>>>>>>>>   static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>   {
>>>>>>>>> -    /*
>>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>>> -     */
>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>
>>>>>>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>
>>>>>>> Sure, I was wondering about this myself.
>>>>>>>
>>>>>>> Will wait for more comments, not sure whether this can be fixed upon
>>>>>>> commit if there are no other issues.
>>>>>>
>>>>>> It's more than just adding another field specifier here. A cast like
>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>> endian port attempt to fall into. At the very least this should cause
>>>>>> a build failure on big endian systems, even better would be if it was
>>>>>> endianness-safe.
>>>>>
>>>>> Why don't we leave the atomic_dec then?
>>>>
>>>> Because you need to invoke smp_processor_id() to calculate the value
>>>> to use in the subtraction. I'm not meaning to say I'm entirely
>>>> opposed (seeing how much of a discussion we're having), but the
>>>> "simple write of zero" approach is certainly appealing.
>>>
>>> Well, we could avoid the smp_processor_id() call and instead use:
>>>
>>> atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);
>>
>> Which would make me suggest atomic_and() again.
> 
> I'm certainly not opposed to that, but in order to get this regression
> fixed I would argue that such atomic_sub is no worse than what's
> currently done.

It's one more read of a potentially heavily contended memory location.

> I can look into adding an atomic_and operation to all arches, but this
> is likely to take time and I would like to get this sorted sooner
> rather than later.

Because of it presumably taking time it was that I also brought up to
consider reverting. Then again, the Arm atomic_and() can, afaict, be
trivially cloned from their "add" counterparts, by replacing the
middle 'd' both in the function names and in the insn mnemonics. It's
just that there shouldn't be atomic_and_return() (for not being
useful, as it doesn't allow reconstructing the original value).

Jan
Jan Beulich Feb. 21, 2020, 3:17 p.m. UTC | #32
On 21.02.2020 16:13, Jürgen Groß wrote:
> On 21.02.20 15:51, Julien Grall wrote:
>>
>>
>> On 21/02/2020 14:35, Jürgen Groß wrote:
>>> On 21.02.20 15:32, Julien Grall wrote:
>>>>
>>>>
>>>> On 21/02/2020 14:16, Jürgen Groß wrote:
>>>>> On 21.02.20 15:11, Julien Grall wrote:
>>>>>> Hi Juergen,
>>>>>>
>>>>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>>>>> Allow a CPU already holding the lock in write mode to also 
>>>>>>>>>>>>> lock it in
>>>>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock 
>>>>>>>>>>>>> that's
>>>>>>>>>>>>> already owned by the caller (ie: CPU) in write mode. 
>>>>>>>>>>>>> Allowing such
>>>>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows 
>>>>>>>>>>>>> to support
>>>>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: 
>>>>>>>>>>>>> one to
>>>>>>>>>>>>> signal there are pending writers waiting on the lock and the 
>>>>>>>>>>>>> other to
>>>>>>>>>>>>> signal the lock is owned in write mode. Note the write 
>>>>>>>>>>>>> related data
>>>>>>>>>>>>> is using 16bits, this is done in order to be able to clear 
>>>>>>>>>>>>> it (and
>>>>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This reduces the maximum number of concurrent readers from 
>>>>>>>>>>>>> 16777216 to
>>>>>>>>>>>>> 65536, I think this should still be enough, or else the lock 
>>>>>>>>>>>>> field
>>>>>>>>>>>>> can be expanded from 32 to 64bits if all architectures 
>>>>>>>>>>>>> support atomic
>>>>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit 
>>>>>>>>>>>> integers.
>>>>>>>>>>>>
>>>>>>>>>>>>>    static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>>>>    {
>>>>>>>>>>>>> -    /*
>>>>>>>>>>>>> -     * If the writer field is atomic, it can be cleared 
>>>>>>>>>>>>> directly.
>>>>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to 
>>>>>>>>>>>>> clear it.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared 
>>>>>>>>>>>>> directly. */
>>>>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>>>>
>>>>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into 
>>>>>>>>>>>> a uint16_t. You
>>>>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>>>>
>>>>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>>>>
>>>>>>>>>>> Will wait for more comments, not sure whether this can be 
>>>>>>>>>>> fixed upon
>>>>>>>>>>> commit if there are no other issues.
>>>>>>>>>>
>>>>>>>>>> It's more than just adding another field specifier here. A cast 
>>>>>>>>>> like
>>>>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>>>>> endian port attempt to fall into. At the very least this should 
>>>>>>>>>> cause
>>>>>>>>>> a build failure on big endian systems, even better would be if 
>>>>>>>>>> it was
>>>>>>>>>> endianness-safe.
>>>>>>>>>
>>>>>>>>> Wouldn't a union be the better choice?
>>>>>>>>
>>>>>>>> You would not be able to use atomic_t in that case as you can't 
>>>>>>>> assume the layout of the structure.
>>>>>>>
>>>>>>> union rwlockword {
>>>>>>>      atomic_t cnts;
>>>>>>>      uint32_t val;
>>>>>>>      struct {
>>>>>>>          uint16_t write;
>>>>>>>          uint16_t readers;
>>>>>>>      };
>>>>>>> };
>>>>>>>
>>>>>>> static inline const uint32_t _qr_bias(
>>>>>>>      const union rwlockword {
>>>>>>>          .write = 0,
>>>>>>>          .readers = 1
>>>>>>>      } x;
>>>>>>>
>>>>>>>      return x.val;
>>>>>>> }
>>>>>>>
>>>>>>> ...
>>>>>>>      cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>>>>> ...
>>>>>>>
>>>>>>> I guess this should do the trick, no?
>>>>>>
>>>>>> You are assuming the atomic_t layout which I would rather no want 
>>>>>> to happen.
>>>>>
>>>>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>>>>> wide. I agree it would be better to have an atomic32_t type for this
>>>>> use case.
>>>>
>>>> I don't think you understood my point here. An atomic32_t will not be 
>>>> better as you still assume the layout of the structure. I.e the 
>>>> structure has only one field.
>>>
>>> I don't understand your reasoning here, sorry.
>>>
>>> Are you saying that a structure of two uint16_t isn't known to be 32
>>> bits long?
>>
>> You are assuming that atomic_t will always be:
>>
>> struct
>> {
>>    int counter;
>> }
>>
>> What if we decide to turn into
>>
>> struct
>> {
>>    bool a;
>>    int counter;
>> }
> 
> As said before: then queue_write_lock_slowpath() is already broken.

Why? The atomic_*() used would still only act on the counter field
(for their actual operation, i.e. what matters to callers; the
other field(s) could be statistics or whatever).

Jan
Jürgen Groß Feb. 21, 2020, 3:23 p.m. UTC | #33
On 21.02.20 16:17, Jan Beulich wrote:
> On 21.02.2020 16:13, Jürgen Groß wrote:
>> On 21.02.20 15:51, Julien Grall wrote:
>>>
>>>
>>> On 21/02/2020 14:35, Jürgen Groß wrote:
>>>> On 21.02.20 15:32, Julien Grall wrote:
>>>>>
>>>>>
>>>>> On 21/02/2020 14:16, Jürgen Groß wrote:
>>>>>> On 21.02.20 15:11, Julien Grall wrote:
>>>>>>> Hi Juergen,
>>>>>>>
>>>>>>> On 21/02/2020 14:06, Jürgen Groß wrote:
>>>>>>>> On 21.02.20 14:49, Julien Grall wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 21/02/2020 13:46, Jürgen Groß wrote:
>>>>>>>>>> On 21.02.20 14:36, Jan Beulich wrote:
>>>>>>>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
>>>>>>>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
>>>>>>>>>>>>>> Allow a CPU already holding the lock in write mode to also
>>>>>>>>>>>>>> lock it in
>>>>>>>>>>>>>> read mode. There's no harm in allowing read locking a rwlock
>>>>>>>>>>>>>> that's
>>>>>>>>>>>>>> already owned by the caller (ie: CPU) in write mode.
>>>>>>>>>>>>>> Allowing such
>>>>>>>>>>>>>> accesses is required at least for the CPU maps use-case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to do this reserve 14bits of the lock, this allows
>>>>>>>>>>>>>> to support
>>>>>>>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits:
>>>>>>>>>>>>>> one to
>>>>>>>>>>>>>> signal there are pending writers waiting on the lock and the
>>>>>>>>>>>>>> other to
>>>>>>>>>>>>>> signal the lock is owned in write mode. Note the write
>>>>>>>>>>>>>> related data
>>>>>>>>>>>>>> is using 16bits, this is done in order to be able to clear
>>>>>>>>>>>>>> it (and
>>>>>>>>>>>>>> thus release the lock) using a 16bit atomic write.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This reduces the maximum number of concurrent readers from
>>>>>>>>>>>>>> 16777216 to
>>>>>>>>>>>>>> 65536, I think this should still be enough, or else the lock
>>>>>>>>>>>>>> field
>>>>>>>>>>>>>> can be expanded from 32 to 64bits if all architectures
>>>>>>>>>>>>>> support atomic
>>>>>>>>>>>>>> operations on 64bit integers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit
>>>>>>>>>>>>> integers.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>     static inline void _write_unlock(rwlock_t *lock)
>>>>>>>>>>>>>>     {
>>>>>>>>>>>>>> -    /*
>>>>>>>>>>>>>> -     * If the writer field is atomic, it can be cleared
>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>> -     * Otherwise, an atomic subtraction will be used to
>>>>>>>>>>>>>> clear it.
>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>>>>>>>>> +    /* Since the writer field is atomic, it can be cleared
>>>>>>>>>>>>>> directly. */
>>>>>>>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
>>>>>>>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this is an abuse to cast an atomic_t() directly into
>>>>>>>>>>>>> a uint16_t. You
>>>>>>>>>>>>> would at least want to use &lock->cnts.counter here.
>>>>>>>>>>>>
>>>>>>>>>>>> Sure, I was wondering about this myself.
>>>>>>>>>>>>
>>>>>>>>>>>> Will wait for more comments, not sure whether this can be
>>>>>>>>>>>> fixed upon
>>>>>>>>>>>> commit if there are no other issues.
>>>>>>>>>>>
>>>>>>>>>>> It's more than just adding another field specifier here. A cast
>>>>>>>>>>> like
>>>>>>>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
>>>>>>>>>>> endian port attempt to fall into. At the very least this should
>>>>>>>>>>> cause
>>>>>>>>>>> a build failure on big endian systems, even better would be if
>>>>>>>>>>> it was
>>>>>>>>>>> endianness-safe.
>>>>>>>>>>
>>>>>>>>>> Wouldn't a union be the better choice?
>>>>>>>>>
>>>>>>>>> You would not be able to use atomic_t in that case as you can't
>>>>>>>>> assume the layout of the structure.
>>>>>>>>
>>>>>>>> union rwlockword {
>>>>>>>>       atomic_t cnts;
>>>>>>>>       uint32_t val;
>>>>>>>>       struct {
>>>>>>>>           uint16_t write;
>>>>>>>>           uint16_t readers;
>>>>>>>>       };
>>>>>>>> };
>>>>>>>>
>>>>>>>> static inline const uint32_t _qr_bias(
>>>>>>>>       const union rwlockword {
>>>>>>>>           .write = 0,
>>>>>>>>           .readers = 1
>>>>>>>>       } x;
>>>>>>>>
>>>>>>>>       return x.val;
>>>>>>>> }
>>>>>>>>
>>>>>>>> ...
>>>>>>>>       cnts = atomic_add_return(_qr_bias(), &lock->cnts);
>>>>>>>> ...
>>>>>>>>
>>>>>>>> I guess this should do the trick, no?
>>>>>>>
>>>>>>> You are assuming the atomic_t layout which I would rather no want
>>>>>>> to happen.
>>>>>>
>>>>>> That already happened. rwlock.h already assumes atomic_t to be 32 bits
>>>>>> wide. I agree it would be better to have an atomic32_t type for this
>>>>>> use case.
>>>>>
>>>>> I don't think you understood my point here. An atomic32_t will not be
>>>>> better as you still assume the layout of the structure. I.e the
>>>>> structure has only one field.
>>>>
>>>> I don't understand your reasoning here, sorry.
>>>>
>>>> Are you saying that a structure of two uint16_t isn't known to be 32
>>>> bits long?
>>>
>>> You are assuming that atomic_t will always be:
>>>
>>> struct
>>> {
>>>     int counter;
>>> }
>>>
>>> What if we decide to turn into
>>>
>>> struct
>>> {
>>>     bool a;
>>>     int counter;
>>> }
>>
>> As said before: then queue_write_lock_slowpath() is already broken.
> 
> Why? The atomic_*() used would still only act on the counter field
> (for their actual operation, i.e. what matters to callers; the
> other field(s) could be statistics or whatever).

No:

u32 cnts;
...
if ( !(cnts & _QW_WMASK) &&
      (atomic_cmpxchg(&lock->cnts, cnts,
                      cnts | _QW_WAITING) == cnts) )


Juergen
Jan Beulich Feb. 21, 2020, 3:27 p.m. UTC | #34
On 21.02.2020 16:23, Jürgen Groß wrote:
> On 21.02.20 16:17, Jan Beulich wrote:
>> On 21.02.2020 16:13, Jürgen Groß wrote:
>>> On 21.02.20 15:51, Julien Grall wrote:
>>>> You are assuming that atomic_t will always be:
>>>>
>>>> struct
>>>> {
>>>>     int counter;
>>>> }
>>>>
>>>> What if we decide to turn into
>>>>
>>>> struct
>>>> {
>>>>     bool a;
>>>>     int counter;
>>>> }
>>>
>>> As said before: then queue_write_lock_slowpath() is already broken.
>>
>> Why? The atomic_*() used would still only act on the counter field
>> (for their actual operation, i.e. what matters to callers; the
>> other field(s) could be statistics or whatever).
> 
> No:
> 
> u32 cnts;
> ...
> if ( !(cnts & _QW_WMASK) &&
>       (atomic_cmpxchg(&lock->cnts, cnts,
>                       cnts | _QW_WAITING) == cnts) )

I must be blind then. As said, atomic_cmpxchg() would still (for the
purpose of consuming "cnts") act on only the "counter" field. It may
additionally e.g. increment a stats counter.

Jan
Jürgen Groß Feb. 21, 2020, 3:33 p.m. UTC | #35
On 21.02.20 16:27, Jan Beulich wrote:
> On 21.02.2020 16:23, Jürgen Groß wrote:
>> On 21.02.20 16:17, Jan Beulich wrote:
>>> On 21.02.2020 16:13, Jürgen Groß wrote:
>>>> On 21.02.20 15:51, Julien Grall wrote:
>>>>> You are assuming that atomic_t will always be:
>>>>>
>>>>> struct
>>>>> {
>>>>>      int counter;
>>>>> }
>>>>>
>>>>> What if we decide to turn into
>>>>>
>>>>> struct
>>>>> {
>>>>>      bool a;
>>>>>      int counter;
>>>>> }
>>>>
>>>> As said before: then queue_write_lock_slowpath() is already broken.
>>>
>>> Why? The atomic_*() used would still only act on the counter field
>>> (for their actual operation, i.e. what matters to callers; the
>>> other field(s) could be statistics or whatever).
>>
>> No:
>>
>> u32 cnts;
>> ...
>> if ( !(cnts & _QW_WMASK) &&
>>        (atomic_cmpxchg(&lock->cnts, cnts,
>>                        cnts | _QW_WAITING) == cnts) )
> 
> I must be blind then. As said, atomic_cmpxchg() would still (for the
> purpose of consuming "cnts") act on only the "counter" field. It may
> additionally e.g. increment a stats counter.

Oh, sorry, I misunderstood you.


Juergen
Roger Pau Monné Feb. 21, 2020, 4:22 p.m. UTC | #36
On Fri, Feb 21, 2020 at 04:15:39PM +0100, Jan Beulich wrote:
> On 21.02.2020 15:58, Roger Pau Monné wrote:
> > On Fri, Feb 21, 2020 at 03:52:28PM +0100, Jan Beulich wrote:
> >> On 21.02.2020 15:49, Roger Pau Monné wrote:
> >>> On Fri, Feb 21, 2020 at 03:41:59PM +0100, Jan Beulich wrote:
> >>>> On 21.02.2020 15:26, Roger Pau Monné wrote:
> >>>>> On Fri, Feb 21, 2020 at 02:36:52PM +0100, Jan Beulich wrote:
> >>>>>> On 21.02.2020 10:10, Roger Pau Monné wrote:
> >>>>>>> On Thu, Feb 20, 2020 at 07:20:06PM +0000, Julien Grall wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On 20/02/2020 17:31, Roger Pau Monne wrote:
> >>>>>>>>> Allow a CPU already holding the lock in write mode to also lock it in
> >>>>>>>>> read mode. There's no harm in allowing read locking a rwlock that's
> >>>>>>>>> already owned by the caller (ie: CPU) in write mode. Allowing such
> >>>>>>>>> accesses is required at least for the CPU maps use-case.
> >>>>>>>>>
> >>>>>>>>> In order to do this reserve 14bits of the lock, this allows to support
> >>>>>>>>> up to 16384 CPUs. Also reduce the write lock mask to 2 bits: one to
> >>>>>>>>> signal there are pending writers waiting on the lock and the other to
> >>>>>>>>> signal the lock is owned in write mode. Note the write related data
> >>>>>>>>> is using 16bits, this is done in order to be able to clear it (and
> >>>>>>>>> thus release the lock) using a 16bit atomic write.
> >>>>>>>>>
> >>>>>>>>> This reduces the maximum number of concurrent readers from 16777216 to
> >>>>>>>>> 65536, I think this should still be enough, or else the lock field
> >>>>>>>>> can be expanded from 32 to 64bits if all architectures support atomic
> >>>>>>>>> operations on 64bit integers.
> >>>>>>>>
> >>>>>>>> FWIW, arm32 is able to support atomic operations on 64-bit integers.
> >>>>>>>>
> >>>>>>>>>   static inline void _write_unlock(rwlock_t *lock)
> >>>>>>>>>   {
> >>>>>>>>> -    /*
> >>>>>>>>> -     * If the writer field is atomic, it can be cleared directly.
> >>>>>>>>> -     * Otherwise, an atomic subtraction will be used to clear it.
> >>>>>>>>> -     */
> >>>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> >>>>>>>>> +    /* Since the writer field is atomic, it can be cleared directly. */
> >>>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> >>>>>>>>> +    BUILD_BUG_ON(_QR_SHIFT != 16);
> >>>>>>>>> +    write_atomic((uint16_t *)&lock->cnts, 0);
> >>>>>>>>
> >>>>>>>> I think this is an abuse to cast an atomic_t() directly into a uint16_t. You
> >>>>>>>> would at least want to use &lock->cnts.counter here.
> >>>>>>>
> >>>>>>> Sure, I was wondering about this myself.
> >>>>>>>
> >>>>>>> Will wait for more comments, not sure whether this can be fixed upon
> >>>>>>> commit if there are no other issues.
> >>>>>>
> >>>>>> It's more than just adding another field specifier here. A cast like
> >>>>>> this one is endianness-unsafe, and hence a trap waiting for a big
> >>>>>> endian port attempt to fall into. At the very least this should cause
> >>>>>> a build failure on big endian systems, even better would be if it was
> >>>>>> endianness-safe.
> >>>>>
> >>>>> Why don't we leave the atomic_dec then?
> >>>>
> >>>> Because you need to invoke smp_processor_id() to calculate the value
> >>>> to use in the subtraction. I'm not meaning to say I'm entirely
> >>>> opposed (seeing how much of a discussion we're having), but the
> >>>> "simple write of zero" approach is certainly appealing.
> >>>
> >>> Well, we could avoid the smp_processor_id() call and instead use:
> >>>
> >>> atomic_sub(atomic_read(&lock->cnts) & 0xffff, &lock->cnts);
> >>
> >> Which would make me suggest atomic_and() again.
> > 
> > I'm certainly not opposed to that, but in order to get this regression
> > fixed I would argue that such atomic_sub is no worse than what's
> > currently done.
> 
> It's one more read of a potentially heavily contended memory location.
> 
> > I can look into adding an atomic_and operation to all arches, but this
> > is likely to take time and I would like to get this sorted sooner
> > rather than later.
> 
> Because of it presumably taking time it was that I also brought up to
> consider reverting. Then again, the Arm atomic_and() can, afaict, be
> trivially cloned from their "add" counterparts, by replacing the
> middle 'd' both in the function names and in the insn mnemonics. It's
> just that there shouldn't be atomic_and_return() (for not being
> useful, as it doesn't allow reconstructing the original value).

I've added atomic_and in a pre-patch and I'm currently waiting for
gitlab to finish testing to assert that I'm not breaking the build.
Will post once that's finished if successful.

FWIW, the branch is at:

http://xenbits.xen.org/gitweb/?p=people/royger/xen.git;a=shortlog;h=refs/heads/rwlock3

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/common/rwlock.c b/xen/common/rwlock.c
index d568bbf6de..dadab372b5 100644
--- a/xen/common/rwlock.c
+++ b/xen/common/rwlock.c
@@ -69,7 +69,7 @@  void queue_write_lock_slowpath(rwlock_t *lock)
 
     /* Try to acquire the lock directly if no reader is present. */
     if ( !atomic_read(&lock->cnts) &&
-         (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0) )
+         (atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0) )
         goto unlock;
 
     /*
@@ -93,7 +93,7 @@  void queue_write_lock_slowpath(rwlock_t *lock)
         cnts = atomic_read(&lock->cnts);
         if ( (cnts == _QW_WAITING) &&
              (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
-                             _QW_LOCKED) == _QW_WAITING) )
+                             _write_lock_val()) == _QW_WAITING) )
             break;
 
         cpu_relax();
diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h
index 3dfea1ac2a..59f0d21075 100644
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -20,21 +20,30 @@  typedef struct {
 #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED
 #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
 
-/*
- * Writer states & reader shift and bias.
- *
- * Writer field is 8 bit to allow for potential optimisation, see
- * _write_unlock().
- */
-#define    _QW_WAITING  1               /* A writer is waiting     */
-#define    _QW_LOCKED   0xff            /* A writer holds the lock */
-#define    _QW_WMASK    0xff            /* Writer mask.*/
-#define    _QR_SHIFT    8               /* Reader count shift      */
+/* Writer states & reader shift and bias. */
+#define    _QW_CPUMASK  0x3fff              /* Writer CPU mask */
+#define    _QW_SHIFT    14                  /* Writer flags shift */
+#define    _QW_WAITING  (1u << _QW_SHIFT)   /* A writer is waiting */
+#define    _QW_LOCKED   (3u << _QW_SHIFT)   /* A writer holds the lock */
+#define    _QW_WMASK    (3u << _QW_SHIFT)   /* Writer mask */
+#define    _QR_SHIFT    16                  /* Reader count shift */
 #define    _QR_BIAS     (1U << _QR_SHIFT)
 
 void queue_read_lock_slowpath(rwlock_t *lock);
 void queue_write_lock_slowpath(rwlock_t *lock);
 
+static inline bool _is_write_locked_by_me(uint32_t cnts)
+{
+    BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS);
+    return (cnts & _QW_WMASK) == _QW_LOCKED &&
+           (cnts & _QW_CPUMASK) == smp_processor_id();
+}
+
+static inline bool _can_read_lock(uint32_t cnts)
+{
+    return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts);
+}
+
 /*
  * _read_trylock - try to acquire read lock of a queue rwlock.
  * @lock : Pointer to queue rwlock structure.
@@ -45,10 +54,10 @@  static inline int _read_trylock(rwlock_t *lock)
     u32 cnts;
 
     cnts = atomic_read(&lock->cnts);
-    if ( likely(!(cnts & _QW_WMASK)) )
+    if ( likely(_can_read_lock(cnts)) )
     {
         cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts);
-        if ( likely(!(cnts & _QW_WMASK)) )
+        if ( likely(_can_read_lock(cnts)) )
             return 1;
         atomic_sub(_QR_BIAS, &lock->cnts);
     }
@@ -64,7 +73,7 @@  static inline void _read_lock(rwlock_t *lock)
     u32 cnts;
 
     cnts = atomic_add_return(_QR_BIAS, &lock->cnts);
-    if ( likely(!(cnts & _QW_WMASK)) )
+    if ( likely(_can_read_lock(cnts)) )
         return;
 
     /* The slowpath will decrement the reader count, if necessary. */
@@ -115,6 +124,11 @@  static inline int _rw_is_locked(rwlock_t *lock)
     return atomic_read(&lock->cnts);
 }
 
+static inline uint32_t _write_lock_val(void)
+{
+    return _QW_LOCKED | smp_processor_id();
+}
+
 /*
  * queue_write_lock - acquire write lock of a queue rwlock.
  * @lock : Pointer to queue rwlock structure.
@@ -122,7 +136,7 @@  static inline int _rw_is_locked(rwlock_t *lock)
 static inline void _write_lock(rwlock_t *lock)
 {
     /* Optimize for the unfair lock case where the fair flag is 0. */
-    if ( atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0 )
+    if ( atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0 )
         return;
 
     queue_write_lock_slowpath(lock);
@@ -157,16 +171,15 @@  static inline int _write_trylock(rwlock_t *lock)
     if ( unlikely(cnts) )
         return 0;
 
-    return likely(atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0);
+    return likely(atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0);
 }
 
 static inline void _write_unlock(rwlock_t *lock)
 {
-    /*
-     * If the writer field is atomic, it can be cleared directly.
-     * Otherwise, an atomic subtraction will be used to clear it.
-     */
-    atomic_sub(_QW_LOCKED, &lock->cnts);
+    /* Since the writer field is atomic, it can be cleared directly. */
+    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
+    BUILD_BUG_ON(_QR_SHIFT != 16);
+    write_atomic((uint16_t *)&lock->cnts, 0);
 }
 
 static inline void _write_unlock_irq(rwlock_t *lock)