diff mbox series

rwlock: allow recursive read locking when already locked in write mode

Message ID 20200220120231.86907-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series rwlock: allow recursive read locking when already locked in write mode | expand

Commit Message

Roger Pau Monné Feb. 20, 2020, 12:02 p.m. UTC
Allow a CPU already holding the lock in write mode to also lock it in
read mode. There's no harm in allowing read locking a rwlock that's
already owned by the caller (ie: CPU) in write mode. Allowing such
accesses is required at least for the CPU maps use-case.

In order to do this reserve 12bits of the lock, this allows to support
up to 4096 CPUs. Also reduce the write lock mask to 2 bits: one to
signal there are pending writers waiting on the lock and the other to
signal the lock is owned in write mode.

This reduces the maximum number of concurrent readers from 16777216 to
262144, I think this should still be enough, or else the lock field
can be expanded from 32 to 64bits if all architectures support atomic
operations on 64bit integers.

Fixes: 5872c83b42c608 ('smp: convert the cpu maps lock into a rw lock')
Reported-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jürgen Groß <jgross@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I've done some testing and at least the CPU down case is fixed now.
Posting early in order to get feedback on the approach taken.
---
 xen/common/rwlock.c      |  4 ++--
 xen/include/xen/rwlock.h | 47 ++++++++++++++++++++++++++--------------
 2 files changed, 33 insertions(+), 18 deletions(-)

Comments

Jan Beulich Feb. 20, 2020, 12:48 p.m. UTC | #1
On 20.02.2020 13:02, Roger Pau Monne wrote:
> I've done some testing and at least the CPU down case is fixed now.
> Posting early in order to get feedback on the approach taken.

Looks good, thanks, just a question and two comments:

> --- a/xen/include/xen/rwlock.h
> +++ b/xen/include/xen/rwlock.h
> @@ -20,21 +20,30 @@ typedef struct {
>  #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED
>  #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
>  
> -/*
> - * Writer states & reader shift and bias.
> - *
> - * Writer field is 8 bit to allow for potential optimisation, see
> - * _write_unlock().
> - */
> -#define    _QW_WAITING  1               /* A writer is waiting     */
> -#define    _QW_LOCKED   0xff            /* A writer holds the lock */
> -#define    _QW_WMASK    0xff            /* Writer mask.*/
> -#define    _QR_SHIFT    8               /* Reader count shift      */
> +/* Writer states & reader shift and bias. */
> +#define    _QW_WAITING  1                       /* A writer is waiting */
> +#define    _QW_LOCKED   3                       /* A writer holds the lock */

Aiui things would work equally well if 2 was used here?

> +#define    _QW_WMASK    3                       /* Writer mask */
> +#define    _QW_CPUSHIFT 2                       /* Writer CPU shift */
> +#define    _QW_CPUMASK  0x3ffc                  /* Writer CPU mask */

At least on x86, the shift involved here is quite certainly
more expensive than using wider immediates on AND and CMP
resulting from the _QW_MASK-based comparisons. I'd therefore
like to suggest to put the CPU in the low 12 bits.

Another option is to use the recurse_cpu field of the
associated spin lock: The field is used for recursive locks
only, and hence the only conflict would be with
_spin_is_locked(), which we don't (and in the future then
also shouldn't) use on this lock.

> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
>       * If the writer field is atomic, it can be cleared directly.
>       * Otherwise, an atomic subtraction will be used to clear it.
>       */
> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> +    atomic_sub(_write_lock_val(), &lock->cnts);

I think this would be more efficient with atomic_and(), not
the least because of the then avoided smp_processor_id().
Whether to mask off just _QW_WMASK or also the CPU number of
the last write owner would need to be determined. But with
using subtraction, in case of problems it'll likely be
harder to understand what actually went on, from looking at
the resulting state of the lock (this is in part a pre-
existing problem, but gets worse with subtraction of CPU
numbers).

Jan
Roger Pau Monné Feb. 20, 2020, 2:11 p.m. UTC | #2
On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
> On 20.02.2020 13:02, Roger Pau Monne wrote:
> > I've done some testing and at least the CPU down case is fixed now.
> > Posting early in order to get feedback on the approach taken.
> 
> Looks good, thanks, just a question and two comments:
> 
> > --- a/xen/include/xen/rwlock.h
> > +++ b/xen/include/xen/rwlock.h
> > @@ -20,21 +20,30 @@ typedef struct {
> >  #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED
> >  #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
> >  
> > -/*
> > - * Writer states & reader shift and bias.
> > - *
> > - * Writer field is 8 bit to allow for potential optimisation, see
> > - * _write_unlock().
> > - */
> > -#define    _QW_WAITING  1               /* A writer is waiting     */
> > -#define    _QW_LOCKED   0xff            /* A writer holds the lock */
> > -#define    _QW_WMASK    0xff            /* Writer mask.*/
> > -#define    _QR_SHIFT    8               /* Reader count shift      */
> > +/* Writer states & reader shift and bias. */
> > +#define    _QW_WAITING  1                       /* A writer is waiting */
> > +#define    _QW_LOCKED   3                       /* A writer holds the lock */
> 
> Aiui things would work equally well if 2 was used here?

I think so, I left it as 3 because previously LOCKED would also
include WAITING, and I didn't want to change it in case I've missed
some code path that was relying on that.

> 
> > +#define    _QW_WMASK    3                       /* Writer mask */
> > +#define    _QW_CPUSHIFT 2                       /* Writer CPU shift */
> > +#define    _QW_CPUMASK  0x3ffc                  /* Writer CPU mask */
> 
> At least on x86, the shift involved here is quite certainly
> more expensive than using wider immediates on AND and CMP
> resulting from the _QW_MASK-based comparisons. I'd therefore
> like to suggest to put the CPU in the low 12 bits.

Hm right. The LOCKED and WAITING bits don't need shifting anyway.

> 
> Another option is to use the recurse_cpu field of the
> associated spin lock: The field is used for recursive locks
> only, and hence the only conflict would be with
> _spin_is_locked(), which we don't (and in the future then
> also shouldn't) use on this lock.

I looked into that also, but things get more complicated AFAICT, as it's
not possible to atomically fetch the state of the lock and the owner
CPU at the same time. Neither you could set the LOCKED bit and the CPU
at the same time.

> 
> > @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
> >       * If the writer field is atomic, it can be cleared directly.
> >       * Otherwise, an atomic subtraction will be used to clear it.
> >       */
> > -    atomic_sub(_QW_LOCKED, &lock->cnts);
> > +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> > +    atomic_sub(_write_lock_val(), &lock->cnts);
> 
> I think this would be more efficient with atomic_and(), not
> the least because of the then avoided smp_processor_id().
> Whether to mask off just _QW_WMASK or also the CPU number of
> the last write owner would need to be determined. But with
> using subtraction, in case of problems it'll likely be
> harder to understand what actually went on, from looking at
> the resulting state of the lock (this is in part a pre-
> existing problem, but gets worse with subtraction of CPU
> numbers).

Right, a mask would be better. Right now both need to be cleared (the
LOCKED and the CPU fields) as there's code that relies on !lock->cnts
as a way to determine that the lock is not read or write locked. If we
left the CPU lying around those checks would need to be adjusted.

Thanks, Roger.
Jürgen Groß Feb. 20, 2020, 2:23 p.m. UTC | #3
On 20.02.20 15:11, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
>> On 20.02.2020 13:02, Roger Pau Monne wrote:
>>> I've done some testing and at least the CPU down case is fixed now.
>>> Posting early in order to get feedback on the approach taken.
>>
>> Looks good, thanks, just a question and two comments:
>>
>>> --- a/xen/include/xen/rwlock.h
>>> +++ b/xen/include/xen/rwlock.h
>>> @@ -20,21 +20,30 @@ typedef struct {
>>>   #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED
>>>   #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
>>>   
>>> -/*
>>> - * Writer states & reader shift and bias.
>>> - *
>>> - * Writer field is 8 bit to allow for potential optimisation, see
>>> - * _write_unlock().
>>> - */
>>> -#define    _QW_WAITING  1               /* A writer is waiting     */
>>> -#define    _QW_LOCKED   0xff            /* A writer holds the lock */
>>> -#define    _QW_WMASK    0xff            /* Writer mask.*/
>>> -#define    _QR_SHIFT    8               /* Reader count shift      */
>>> +/* Writer states & reader shift and bias. */
>>> +#define    _QW_WAITING  1                       /* A writer is waiting */
>>> +#define    _QW_LOCKED   3                       /* A writer holds the lock */
>>
>> Aiui things would work equally well if 2 was used here?
> 
> I think so, I left it as 3 because previously LOCKED would also
> include WAITING, and I didn't want to change it in case I've missed
> some code path that was relying on that.
> 
>>
>>> +#define    _QW_WMASK    3                       /* Writer mask */
>>> +#define    _QW_CPUSHIFT 2                       /* Writer CPU shift */
>>> +#define    _QW_CPUMASK  0x3ffc                  /* Writer CPU mask */
>>
>> At least on x86, the shift involved here is quite certainly
>> more expensive than using wider immediates on AND and CMP
>> resulting from the _QW_MASK-based comparisons. I'd therefore
>> like to suggest to put the CPU in the low 12 bits.
> 
> Hm right. The LOCKED and WAITING bits don't need shifting anyway.
> 
>>
>> Another option is to use the recurse_cpu field of the
>> associated spin lock: The field is used for recursive locks
>> only, and hence the only conflict would be with
>> _spin_is_locked(), which we don't (and in the future then
>> also shouldn't) use on this lock.
> 
> I looked into that also, but things get more complicated AFAICT, as it's
> not possible to atomically fetch the state of the lock and the owner
> CPU at the same time. Neither you could set the LOCKED bit and the CPU
> at the same time.
> 
>>
>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
>>>        * If the writer field is atomic, it can be cleared directly.
>>>        * Otherwise, an atomic subtraction will be used to clear it.
>>>        */
>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>> +    atomic_sub(_write_lock_val(), &lock->cnts);
>>
>> I think this would be more efficient with atomic_and(), not
>> the least because of the then avoided smp_processor_id().
>> Whether to mask off just _QW_WMASK or also the CPU number of
>> the last write owner would need to be determined. But with
>> using subtraction, in case of problems it'll likely be
>> harder to understand what actually went on, from looking at
>> the resulting state of the lock (this is in part a pre-
>> existing problem, but gets worse with subtraction of CPU
>> numbers).
> 
> Right, a mask would be better. Right now both need to be cleared (the
> LOCKED and the CPU fields) as there's code that relies on !lock->cnts
> as a way to determine that the lock is not read or write locked. If we
> left the CPU lying around those checks would need to be adjusted.

In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero
value for write_unlock() (like its possible to do so today using a
single byte write).


Juergen
Roger Pau Monné Feb. 20, 2020, 2:38 p.m. UTC | #4
On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote:
> On 20.02.20 15:11, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
> > > On 20.02.2020 13:02, Roger Pau Monne wrote:
> > > > I've done some testing and at least the CPU down case is fixed now.
> > > > Posting early in order to get feedback on the approach taken.
> > > 
> > > Looks good, thanks, just a question and two comments:
> > > 
> > > > --- a/xen/include/xen/rwlock.h
> > > > +++ b/xen/include/xen/rwlock.h
> > > > @@ -20,21 +20,30 @@ typedef struct {
> > > >   #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED
> > > >   #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
> > > > -/*
> > > > - * Writer states & reader shift and bias.
> > > > - *
> > > > - * Writer field is 8 bit to allow for potential optimisation, see
> > > > - * _write_unlock().
> > > > - */
> > > > -#define    _QW_WAITING  1               /* A writer is waiting     */
> > > > -#define    _QW_LOCKED   0xff            /* A writer holds the lock */
> > > > -#define    _QW_WMASK    0xff            /* Writer mask.*/
> > > > -#define    _QR_SHIFT    8               /* Reader count shift      */
> > > > +/* Writer states & reader shift and bias. */
> > > > +#define    _QW_WAITING  1                       /* A writer is waiting */
> > > > +#define    _QW_LOCKED   3                       /* A writer holds the lock */
> > > 
> > > Aiui things would work equally well if 2 was used here?
> > 
> > I think so, I left it as 3 because previously LOCKED would also
> > include WAITING, and I didn't want to change it in case I've missed
> > some code path that was relying on that.
> > 
> > > 
> > > > +#define    _QW_WMASK    3                       /* Writer mask */
> > > > +#define    _QW_CPUSHIFT 2                       /* Writer CPU shift */
> > > > +#define    _QW_CPUMASK  0x3ffc                  /* Writer CPU mask */
> > > 
> > > At least on x86, the shift involved here is quite certainly
> > > more expensive than using wider immediates on AND and CMP
> > > resulting from the _QW_MASK-based comparisons. I'd therefore
> > > like to suggest to put the CPU in the low 12 bits.
> > 
> > Hm right. The LOCKED and WAITING bits don't need shifting anyway.
> > 
> > > 
> > > Another option is to use the recurse_cpu field of the
> > > associated spin lock: The field is used for recursive locks
> > > only, and hence the only conflict would be with
> > > _spin_is_locked(), which we don't (and in the future then
> > > also shouldn't) use on this lock.
> > 
> > I looked into that also, but things get more complicated AFAICT, as it's
> > not possible to atomically fetch the state of the lock and the owner
> > CPU at the same time. Neither you could set the LOCKED bit and the CPU
> > at the same time.
> > 
> > > 
> > > > @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
> > > >        * If the writer field is atomic, it can be cleared directly.
> > > >        * Otherwise, an atomic subtraction will be used to clear it.
> > > >        */
> > > > -    atomic_sub(_QW_LOCKED, &lock->cnts);
> > > > +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> > > > +    atomic_sub(_write_lock_val(), &lock->cnts);
> > > 
> > > I think this would be more efficient with atomic_and(), not
> > > the least because of the then avoided smp_processor_id().
> > > Whether to mask off just _QW_WMASK or also the CPU number of
> > > the last write owner would need to be determined. But with
> > > using subtraction, in case of problems it'll likely be
> > > harder to understand what actually went on, from looking at
> > > the resulting state of the lock (this is in part a pre-
> > > existing problem, but gets worse with subtraction of CPU
> > > numbers).
> > 
> > Right, a mask would be better. Right now both need to be cleared (the
> > LOCKED and the CPU fields) as there's code that relies on !lock->cnts
> > as a way to determine that the lock is not read or write locked. If we
> > left the CPU lying around those checks would need to be adjusted.
> 
> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero
> value for write_unlock() (like its possible to do so today using a
> single byte write).

That would limit the readers count to 65536, what do you think Jan?

Thanks, Roger.
Jan Beulich Feb. 20, 2020, 3:02 p.m. UTC | #5
On 20.02.2020 15:11, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
>> Another option is to use the recurse_cpu field of the
>> associated spin lock: The field is used for recursive locks
>> only, and hence the only conflict would be with
>> _spin_is_locked(), which we don't (and in the future then
>> also shouldn't) use on this lock.
> 
> I looked into that also, but things get more complicated AFAICT, as it's
> not possible to atomically fetch the state of the lock and the owner
> CPU at the same time. Neither you could set the LOCKED bit and the CPU
> at the same time.

There's no need to atomically fetch both afaics: The field is
valid only if the LOCKED bit is set. And when reading the
field you only care about the value being equal to
smp_processor_id(), i.e. it is fine to set LOCKED before
updating the CPU field on lock, and to reset the CPU field to
SPINLOCK_NO_CPU (or whatever it's called) before clearing
LOCKED.

Jan
Jan Beulich Feb. 20, 2020, 3:11 p.m. UTC | #6
On 20.02.2020 15:38, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote:
>> On 20.02.20 15:11, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
>>>> On 20.02.2020 13:02, Roger Pau Monne wrote:
>>>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
>>>>>        * If the writer field is atomic, it can be cleared directly.
>>>>>        * Otherwise, an atomic subtraction will be used to clear it.
>>>>>        */
>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>> +    atomic_sub(_write_lock_val(), &lock->cnts);
>>>>
>>>> I think this would be more efficient with atomic_and(), not
>>>> the least because of the then avoided smp_processor_id().
>>>> Whether to mask off just _QW_WMASK or also the CPU number of
>>>> the last write owner would need to be determined. But with
>>>> using subtraction, in case of problems it'll likely be
>>>> harder to understand what actually went on, from looking at
>>>> the resulting state of the lock (this is in part a pre-
>>>> existing problem, but gets worse with subtraction of CPU
>>>> numbers).
>>>
>>> Right, a mask would be better. Right now both need to be cleared (the
>>> LOCKED and the CPU fields) as there's code that relies on !lock->cnts
>>> as a way to determine that the lock is not read or write locked. If we
>>> left the CPU lying around those checks would need to be adjusted.
>>
>> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero
>> value for write_unlock() (like its possible to do so today using a
>> single byte write).
> 
> That would limit the readers count to 65536, what do you think Jan?

If the recurse_cpu approach is considered bad, I think this would
be acceptable. But of course you'll need to consult with the Arm
guys regarding the correctness of such a "half" store in their
memory model; I would hope this to be universally okay, but I'm
not entirely certain.

Jan
Roger Pau Monné Feb. 20, 2020, 3:17 p.m. UTC | #7
On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote:
> On 20.02.2020 15:11, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
> >> Another option is to use the recurse_cpu field of the
> >> associated spin lock: The field is used for recursive locks
> >> only, and hence the only conflict would be with
> >> _spin_is_locked(), which we don't (and in the future then
> >> also shouldn't) use on this lock.
> > 
> > I looked into that also, but things get more complicated AFAICT, as it's
> > not possible to atomically fetch the state of the lock and the owner
> > CPU at the same time. Neither you could set the LOCKED bit and the CPU
> > at the same time.
> 
> There's no need to atomically fetch both afaics: The field is
> valid only if the LOCKED bit is set. And when reading the
> field you only care about the value being equal to
> smp_processor_id(), i.e. it is fine to set LOCKED before
> updating the CPU field on lock, and to reset the CPU field to
> SPINLOCK_NO_CPU (or whatever it's called) before clearing
> LOCKED.

Yes that would require the usage of a sentinel value as 0 would be a
valid processor ID.

I would try to refrain from (abu)sing internal spinlock fields, as I
think we can solve this just by using the cnts field on the current
rwlock implementation.

What issue do you have in mind that would prevent storing the CPU
write locked in the cnts field?

Thanks, Roger.
Roger Pau Monné Feb. 20, 2020, 3:20 p.m. UTC | #8
On Thu, Feb 20, 2020 at 04:11:08PM +0100, Jan Beulich wrote:
> On 20.02.2020 15:38, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote:
> >> On 20.02.20 15:11, Roger Pau Monné wrote:
> >>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
> >>>> On 20.02.2020 13:02, Roger Pau Monne wrote:
> >>>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
> >>>>>        * If the writer field is atomic, it can be cleared directly.
> >>>>>        * Otherwise, an atomic subtraction will be used to clear it.
> >>>>>        */
> >>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
> >>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
> >>>>> +    atomic_sub(_write_lock_val(), &lock->cnts);
> >>>>
> >>>> I think this would be more efficient with atomic_and(), not
> >>>> the least because of the then avoided smp_processor_id().
> >>>> Whether to mask off just _QW_WMASK or also the CPU number of
> >>>> the last write owner would need to be determined. But with
> >>>> using subtraction, in case of problems it'll likely be
> >>>> harder to understand what actually went on, from looking at
> >>>> the resulting state of the lock (this is in part a pre-
> >>>> existing problem, but gets worse with subtraction of CPU
> >>>> numbers).
> >>>
> >>> Right, a mask would be better. Right now both need to be cleared (the
> >>> LOCKED and the CPU fields) as there's code that relies on !lock->cnts
> >>> as a way to determine that the lock is not read or write locked. If we
> >>> left the CPU lying around those checks would need to be adjusted.
> >>
> >> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero
> >> value for write_unlock() (like its possible to do so today using a
> >> single byte write).
> > 
> > That would limit the readers count to 65536, what do you think Jan?
> 
> If the recurse_cpu approach is considered bad, I think this would
> be acceptable. But of course you'll need to consult with the Arm
> guys regarding the correctness of such a "half" store in their
> memory model; I would hope this to be universally okay, but I'm
> not entirely certain.

I would like to get confirmation from the Arm folks, but I see Arm has
write_atomic and supports such operation against a uint16_t, so I
don't see why it wouldn't work.

Thanks, Roger.
Jan Beulich Feb. 20, 2020, 3:50 p.m. UTC | #9
On 20.02.2020 16:17, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote:
>> On 20.02.2020 15:11, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
>>>> Another option is to use the recurse_cpu field of the
>>>> associated spin lock: The field is used for recursive locks
>>>> only, and hence the only conflict would be with
>>>> _spin_is_locked(), which we don't (and in the future then
>>>> also shouldn't) use on this lock.
>>>
>>> I looked into that also, but things get more complicated AFAICT, as it's
>>> not possible to atomically fetch the state of the lock and the owner
>>> CPU at the same time. Neither you could set the LOCKED bit and the CPU
>>> at the same time.
>>
>> There's no need to atomically fetch both afaics: The field is
>> valid only if the LOCKED bit is set. And when reading the
>> field you only care about the value being equal to
>> smp_processor_id(), i.e. it is fine to set LOCKED before
>> updating the CPU field on lock, and to reset the CPU field to
>> SPINLOCK_NO_CPU (or whatever it's called) before clearing
>> LOCKED.
> 
> Yes that would require the usage of a sentinel value as 0 would be a
> valid processor ID.
> 
> I would try to refrain from (abu)sing internal spinlock fields, as I
> think we can solve this just by using the cnts field on the current
> rwlock implementation.
> 
> What issue do you have in mind that would prevent storing the CPU
> write locked in the cnts field?

The reduction of the number of bits used for other purposes.

Jan
Jan Beulich Feb. 20, 2020, 3:52 p.m. UTC | #10
On 20.02.2020 16:20, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 04:11:08PM +0100, Jan Beulich wrote:
>> On 20.02.2020 15:38, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote:
>>>> On 20.02.20 15:11, Roger Pau Monné wrote:
>>>>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
>>>>>> On 20.02.2020 13:02, Roger Pau Monne wrote:
>>>>>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock)
>>>>>>>        * If the writer field is atomic, it can be cleared directly.
>>>>>>>        * Otherwise, an atomic subtraction will be used to clear it.
>>>>>>>        */
>>>>>>> -    atomic_sub(_QW_LOCKED, &lock->cnts);
>>>>>>> +    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
>>>>>>> +    atomic_sub(_write_lock_val(), &lock->cnts);
>>>>>>
>>>>>> I think this would be more efficient with atomic_and(), not
>>>>>> the least because of the then avoided smp_processor_id().
>>>>>> Whether to mask off just _QW_WMASK or also the CPU number of
>>>>>> the last write owner would need to be determined. But with
>>>>>> using subtraction, in case of problems it'll likely be
>>>>>> harder to understand what actually went on, from looking at
>>>>>> the resulting state of the lock (this is in part a pre-
>>>>>> existing problem, but gets worse with subtraction of CPU
>>>>>> numbers).
>>>>>
>>>>> Right, a mask would be better. Right now both need to be cleared (the
>>>>> LOCKED and the CPU fields) as there's code that relies on !lock->cnts
>>>>> as a way to determine that the lock is not read or write locked. If we
>>>>> left the CPU lying around those checks would need to be adjusted.
>>>>
>>>> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero
>>>> value for write_unlock() (like its possible to do so today using a
>>>> single byte write).
>>>
>>> That would limit the readers count to 65536, what do you think Jan?
>>
>> If the recurse_cpu approach is considered bad, I think this would
>> be acceptable. But of course you'll need to consult with the Arm
>> guys regarding the correctness of such a "half" store in their
>> memory model; I would hope this to be universally okay, but I'm
>> not entirely certain.
> 
> I would like to get confirmation from the Arm folks, but I see Arm has
> write_atomic and supports such operation against a uint16_t, so I
> don't see why it wouldn't work.

The operations individually being available doesn't mean that
a combination of 32-bit locked accesses and a 16-bit plain
store are going to produce a consistent result. Perhaps I'm
over-cautious here, but I've been bitten by a vaguely similar
issue on x86 back in the ticket lock (in Linux) days.

Jan
Roger Pau Monné Feb. 20, 2020, 3:57 p.m. UTC | #11
On Thu, Feb 20, 2020 at 04:50:22PM +0100, Jan Beulich wrote:
> On 20.02.2020 16:17, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote:
> >> On 20.02.2020 15:11, Roger Pau Monné wrote:
> >>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
> >>>> Another option is to use the recurse_cpu field of the
> >>>> associated spin lock: The field is used for recursive locks
> >>>> only, and hence the only conflict would be with
> >>>> _spin_is_locked(), which we don't (and in the future then
> >>>> also shouldn't) use on this lock.
> >>>
> >>> I looked into that also, but things get more complicated AFAICT, as it's
> >>> not possible to atomically fetch the state of the lock and the owner
> >>> CPU at the same time. Neither you could set the LOCKED bit and the CPU
> >>> at the same time.
> >>
> >> There's no need to atomically fetch both afaics: The field is
> >> valid only if the LOCKED bit is set. And when reading the
> >> field you only care about the value being equal to
> >> smp_processor_id(), i.e. it is fine to set LOCKED before
> >> updating the CPU field on lock, and to reset the CPU field to
> >> SPINLOCK_NO_CPU (or whatever it's called) before clearing
> >> LOCKED.
> > 
> > Yes that would require the usage of a sentinel value as 0 would be a
> > valid processor ID.
> > 
> > I would try to refrain from (abu)sing internal spinlock fields, as I
> > think we can solve this just by using the cnts field on the current
> > rwlock implementation.
> > 
> > What issue do you have in mind that would prevent storing the CPU
> > write locked in the cnts field?
> 
> The reduction of the number of bits used for other purposes.

I think it should be fine for now, as we would support systems with up
to 16384 CPU IDs, and 65536 concurrent read lockers, which mean each
CPU could take the lock up to 4 times recursively. Note that
supporting 16384 CPUs is still very, very far off the radar.

Thanks, Roger.
Jürgen Groß Feb. 20, 2020, 4 p.m. UTC | #12
On 20.02.20 16:57, Roger Pau Monné wrote:
> On Thu, Feb 20, 2020 at 04:50:22PM +0100, Jan Beulich wrote:
>> On 20.02.2020 16:17, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote:
>>>> On 20.02.2020 15:11, Roger Pau Monné wrote:
>>>>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
>>>>>> Another option is to use the recurse_cpu field of the
>>>>>> associated spin lock: The field is used for recursive locks
>>>>>> only, and hence the only conflict would be with
>>>>>> _spin_is_locked(), which we don't (and in the future then
>>>>>> also shouldn't) use on this lock.
>>>>>
>>>>> I looked into that also, but things get more complicated AFAICT, as it's
>>>>> not possible to atomically fetch the state of the lock and the owner
>>>>> CPU at the same time. Neither you could set the LOCKED bit and the CPU
>>>>> at the same time.
>>>>
>>>> There's no need to atomically fetch both afaics: The field is
>>>> valid only if the LOCKED bit is set. And when reading the
>>>> field you only care about the value being equal to
>>>> smp_processor_id(), i.e. it is fine to set LOCKED before
>>>> updating the CPU field on lock, and to reset the CPU field to
>>>> SPINLOCK_NO_CPU (or whatever it's called) before clearing
>>>> LOCKED.
>>>
>>> Yes that would require the usage of a sentinel value as 0 would be a
>>> valid processor ID.
>>>
>>> I would try to refrain from (abu)sing internal spinlock fields, as I
>>> think we can solve this just by using the cnts field on the current
>>> rwlock implementation.
>>>
>>> What issue do you have in mind that would prevent storing the CPU
>>> write locked in the cnts field?
>>
>> The reduction of the number of bits used for other purposes.
> 
> I think it should be fine for now, as we would support systems with up
> to 16384 CPU IDs, and 65536 concurrent read lockers, which mean each
> CPU could take the lock up to 4 times recursively. Note that
> supporting 16384 CPUs is still very, very far off the radar.

Current spinlock implementation limits NR_CPUS to 4096.


Juergen
Roger Pau Monné Feb. 20, 2020, 4:54 p.m. UTC | #13
On Thu, Feb 20, 2020 at 05:00:37PM +0100, Jürgen Groß wrote:
> On 20.02.20 16:57, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2020 at 04:50:22PM +0100, Jan Beulich wrote:
> > > On 20.02.2020 16:17, Roger Pau Monné wrote:
> > > > On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote:
> > > > > On 20.02.2020 15:11, Roger Pau Monné wrote:
> > > > > > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote:
> > > > > > > Another option is to use the recurse_cpu field of the
> > > > > > > associated spin lock: The field is used for recursive locks
> > > > > > > only, and hence the only conflict would be with
> > > > > > > _spin_is_locked(), which we don't (and in the future then
> > > > > > > also shouldn't) use on this lock.
> > > > > > 
> > > > > > I looked into that also, but things get more complicated AFAICT, as it's
> > > > > > not possible to atomically fetch the state of the lock and the owner
> > > > > > CPU at the same time. Neither you could set the LOCKED bit and the CPU
> > > > > > at the same time.
> > > > > 
> > > > > There's no need to atomically fetch both afaics: The field is
> > > > > valid only if the LOCKED bit is set. And when reading the
> > > > > field you only care about the value being equal to
> > > > > smp_processor_id(), i.e. it is fine to set LOCKED before
> > > > > updating the CPU field on lock, and to reset the CPU field to
> > > > > SPINLOCK_NO_CPU (or whatever it's called) before clearing
> > > > > LOCKED.
> > > > 
> > > > Yes that would require the usage of a sentinel value as 0 would be a
> > > > valid processor ID.
> > > > 
> > > > I would try to refrain from (abu)sing internal spinlock fields, as I
> > > > think we can solve this just by using the cnts field on the current
> > > > rwlock implementation.
> > > > 
> > > > What issue do you have in mind that would prevent storing the CPU
> > > > write locked in the cnts field?
> > > 
> > > The reduction of the number of bits used for other purposes.
> > 
> > I think it should be fine for now, as we would support systems with up
> > to 16384 CPU IDs, and 65536 concurrent read lockers, which mean each
> > CPU could take the lock up to 4 times recursively. Note that
> > supporting 16384 CPUs is still very, very far off the radar.
> 
> Current spinlock implementation limits NR_CPUS to 4096.

Right, but here we can use up to 14 bits, so I think it's best to
assign them now than leave them as padding?

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/common/rwlock.c b/xen/common/rwlock.c
index d568bbf6de..dadab372b5 100644
--- a/xen/common/rwlock.c
+++ b/xen/common/rwlock.c
@@ -69,7 +69,7 @@  void queue_write_lock_slowpath(rwlock_t *lock)
 
     /* Try to acquire the lock directly if no reader is present. */
     if ( !atomic_read(&lock->cnts) &&
-         (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0) )
+         (atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0) )
         goto unlock;
 
     /*
@@ -93,7 +93,7 @@  void queue_write_lock_slowpath(rwlock_t *lock)
         cnts = atomic_read(&lock->cnts);
         if ( (cnts == _QW_WAITING) &&
              (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
-                             _QW_LOCKED) == _QW_WAITING) )
+                             _write_lock_val()) == _QW_WAITING) )
             break;
 
         cpu_relax();
diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h
index 3dfea1ac2a..b430ebd846 100644
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -20,21 +20,30 @@  typedef struct {
 #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED
 #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
 
-/*
- * Writer states & reader shift and bias.
- *
- * Writer field is 8 bit to allow for potential optimisation, see
- * _write_unlock().
- */
-#define    _QW_WAITING  1               /* A writer is waiting     */
-#define    _QW_LOCKED   0xff            /* A writer holds the lock */
-#define    _QW_WMASK    0xff            /* Writer mask.*/
-#define    _QR_SHIFT    8               /* Reader count shift      */
+/* Writer states & reader shift and bias. */
+#define    _QW_WAITING  1                       /* A writer is waiting */
+#define    _QW_LOCKED   3                       /* A writer holds the lock */
+#define    _QW_WMASK    3                       /* Writer mask */
+#define    _QW_CPUSHIFT 2                       /* Writer CPU shift */
+#define    _QW_CPUMASK  0x3ffc                  /* Writer CPU mask */
+#define    _QR_SHIFT    14                      /* Reader count shift */
 #define    _QR_BIAS     (1U << _QR_SHIFT)
 
 void queue_read_lock_slowpath(rwlock_t *lock);
 void queue_write_lock_slowpath(rwlock_t *lock);
 
+static inline bool _is_write_locked_by_me(uint32_t cnts)
+{
+    BUILD_BUG_ON((_QW_CPUMASK >> _QW_CPUSHIFT) < NR_CPUS);
+    return (cnts & _QW_WMASK) == _QW_LOCKED &&
+           MASK_EXTR(cnts, _QW_CPUMASK) == smp_processor_id();
+}
+
+static inline bool _can_read_lock(uint32_t cnts)
+{
+    return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts);
+}
+
 /*
  * _read_trylock - try to acquire read lock of a queue rwlock.
  * @lock : Pointer to queue rwlock structure.
@@ -45,10 +54,10 @@  static inline int _read_trylock(rwlock_t *lock)
     u32 cnts;
 
     cnts = atomic_read(&lock->cnts);
-    if ( likely(!(cnts & _QW_WMASK)) )
+    if ( likely(_can_read_lock(cnts)) )
     {
         cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts);
-        if ( likely(!(cnts & _QW_WMASK)) )
+        if ( likely(_can_read_lock(cnts)) )
             return 1;
         atomic_sub(_QR_BIAS, &lock->cnts);
     }
@@ -64,7 +73,7 @@  static inline void _read_lock(rwlock_t *lock)
     u32 cnts;
 
     cnts = atomic_add_return(_QR_BIAS, &lock->cnts);
-    if ( likely(!(cnts & _QW_WMASK)) )
+    if ( likely(_can_read_lock(cnts)) )
         return;
 
     /* The slowpath will decrement the reader count, if necessary. */
@@ -115,6 +124,11 @@  static inline int _rw_is_locked(rwlock_t *lock)
     return atomic_read(&lock->cnts);
 }
 
+static inline uint32_t _write_lock_val(void)
+{
+    return _QW_LOCKED | MASK_INSR(smp_processor_id(), _QW_CPUMASK);
+}
+
 /*
  * queue_write_lock - acquire write lock of a queue rwlock.
  * @lock : Pointer to queue rwlock structure.
@@ -122,7 +136,7 @@  static inline int _rw_is_locked(rwlock_t *lock)
 static inline void _write_lock(rwlock_t *lock)
 {
     /* Optimize for the unfair lock case where the fair flag is 0. */
-    if ( atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0 )
+    if ( atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0 )
         return;
 
     queue_write_lock_slowpath(lock);
@@ -157,7 +171,7 @@  static inline int _write_trylock(rwlock_t *lock)
     if ( unlikely(cnts) )
         return 0;
 
-    return likely(atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0);
+    return likely(atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0);
 }
 
 static inline void _write_unlock(rwlock_t *lock)
@@ -166,7 +180,8 @@  static inline void _write_unlock(rwlock_t *lock)
      * If the writer field is atomic, it can be cleared directly.
      * Otherwise, an atomic subtraction will be used to clear it.
      */
-    atomic_sub(_QW_LOCKED, &lock->cnts);
+    ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts)));
+    atomic_sub(_write_lock_val(), &lock->cnts);
 }
 
 static inline void _write_unlock_irq(rwlock_t *lock)