Message ID | 20200220120231.86907-1-roger.pau@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | rwlock: allow recursive read locking when already locked in write mode | expand |
On 20.02.2020 13:02, Roger Pau Monne wrote: > I've done some testing and at least the CPU down case is fixed now. > Posting early in order to get feedback on the approach taken. Looks good, thanks, just a question and two comments: > --- a/xen/include/xen/rwlock.h > +++ b/xen/include/xen/rwlock.h > @@ -20,21 +20,30 @@ typedef struct { > #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED > #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) > > -/* > - * Writer states & reader shift and bias. > - * > - * Writer field is 8 bit to allow for potential optimisation, see > - * _write_unlock(). > - */ > -#define _QW_WAITING 1 /* A writer is waiting */ > -#define _QW_LOCKED 0xff /* A writer holds the lock */ > -#define _QW_WMASK 0xff /* Writer mask.*/ > -#define _QR_SHIFT 8 /* Reader count shift */ > +/* Writer states & reader shift and bias. */ > +#define _QW_WAITING 1 /* A writer is waiting */ > +#define _QW_LOCKED 3 /* A writer holds the lock */ Aiui things would work equally well if 2 was used here? > +#define _QW_WMASK 3 /* Writer mask */ > +#define _QW_CPUSHIFT 2 /* Writer CPU shift */ > +#define _QW_CPUMASK 0x3ffc /* Writer CPU mask */ At least on x86, the shift involved here is quite certainly more expensive than using wider immediates on AND and CMP resulting from the _QW_MASK-based comparisons. I'd therefore like to suggest to put the CPU in the low 12 bits. Another option is to use the recurse_cpu field of the associated spin lock: The field is used for recursive locks only, and hence the only conflict would be with _spin_is_locked(), which we don't (and in the future then also shouldn't) use on this lock. > @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) > * If the writer field is atomic, it can be cleared directly. > * Otherwise, an atomic subtraction will be used to clear it. > */ > - atomic_sub(_QW_LOCKED, &lock->cnts); > + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); > + atomic_sub(_write_lock_val(), &lock->cnts); I think this would be more efficient with atomic_and(), not the least because of the then avoided smp_processor_id(). Whether to mask off just _QW_WMASK or also the CPU number of the last write owner would need to be determined. But with using subtraction, in case of problems it'll likely be harder to understand what actually went on, from looking at the resulting state of the lock (this is in part a pre- existing problem, but gets worse with subtraction of CPU numbers). Jan
On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: > On 20.02.2020 13:02, Roger Pau Monne wrote: > > I've done some testing and at least the CPU down case is fixed now. > > Posting early in order to get feedback on the approach taken. > > Looks good, thanks, just a question and two comments: > > > --- a/xen/include/xen/rwlock.h > > +++ b/xen/include/xen/rwlock.h > > @@ -20,21 +20,30 @@ typedef struct { > > #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED > > #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) > > > > -/* > > - * Writer states & reader shift and bias. > > - * > > - * Writer field is 8 bit to allow for potential optimisation, see > > - * _write_unlock(). > > - */ > > -#define _QW_WAITING 1 /* A writer is waiting */ > > -#define _QW_LOCKED 0xff /* A writer holds the lock */ > > -#define _QW_WMASK 0xff /* Writer mask.*/ > > -#define _QR_SHIFT 8 /* Reader count shift */ > > +/* Writer states & reader shift and bias. */ > > +#define _QW_WAITING 1 /* A writer is waiting */ > > +#define _QW_LOCKED 3 /* A writer holds the lock */ > > Aiui things would work equally well if 2 was used here? I think so, I left it as 3 because previously LOCKED would also include WAITING, and I didn't want to change it in case I've missed some code path that was relying on that. > > > +#define _QW_WMASK 3 /* Writer mask */ > > +#define _QW_CPUSHIFT 2 /* Writer CPU shift */ > > +#define _QW_CPUMASK 0x3ffc /* Writer CPU mask */ > > At least on x86, the shift involved here is quite certainly > more expensive than using wider immediates on AND and CMP > resulting from the _QW_MASK-based comparisons. I'd therefore > like to suggest to put the CPU in the low 12 bits. Hm right. The LOCKED and WAITING bits don't need shifting anyway. > > Another option is to use the recurse_cpu field of the > associated spin lock: The field is used for recursive locks > only, and hence the only conflict would be with > _spin_is_locked(), which we don't (and in the future then > also shouldn't) use on this lock. I looked into that also, but things get more complicated AFAICT, as it's not possible to atomically fetch the state of the lock and the owner CPU at the same time. Neither you could set the LOCKED bit and the CPU at the same time. > > > @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) > > * If the writer field is atomic, it can be cleared directly. > > * Otherwise, an atomic subtraction will be used to clear it. > > */ > > - atomic_sub(_QW_LOCKED, &lock->cnts); > > + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); > > + atomic_sub(_write_lock_val(), &lock->cnts); > > I think this would be more efficient with atomic_and(), not > the least because of the then avoided smp_processor_id(). > Whether to mask off just _QW_WMASK or also the CPU number of > the last write owner would need to be determined. But with > using subtraction, in case of problems it'll likely be > harder to understand what actually went on, from looking at > the resulting state of the lock (this is in part a pre- > existing problem, but gets worse with subtraction of CPU > numbers). Right, a mask would be better. Right now both need to be cleared (the LOCKED and the CPU fields) as there's code that relies on !lock->cnts as a way to determine that the lock is not read or write locked. If we left the CPU lying around those checks would need to be adjusted. Thanks, Roger.
On 20.02.20 15:11, Roger Pau Monné wrote: > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: >> On 20.02.2020 13:02, Roger Pau Monne wrote: >>> I've done some testing and at least the CPU down case is fixed now. >>> Posting early in order to get feedback on the approach taken. >> >> Looks good, thanks, just a question and two comments: >> >>> --- a/xen/include/xen/rwlock.h >>> +++ b/xen/include/xen/rwlock.h >>> @@ -20,21 +20,30 @@ typedef struct { >>> #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED >>> #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) >>> >>> -/* >>> - * Writer states & reader shift and bias. >>> - * >>> - * Writer field is 8 bit to allow for potential optimisation, see >>> - * _write_unlock(). >>> - */ >>> -#define _QW_WAITING 1 /* A writer is waiting */ >>> -#define _QW_LOCKED 0xff /* A writer holds the lock */ >>> -#define _QW_WMASK 0xff /* Writer mask.*/ >>> -#define _QR_SHIFT 8 /* Reader count shift */ >>> +/* Writer states & reader shift and bias. */ >>> +#define _QW_WAITING 1 /* A writer is waiting */ >>> +#define _QW_LOCKED 3 /* A writer holds the lock */ >> >> Aiui things would work equally well if 2 was used here? > > I think so, I left it as 3 because previously LOCKED would also > include WAITING, and I didn't want to change it in case I've missed > some code path that was relying on that. > >> >>> +#define _QW_WMASK 3 /* Writer mask */ >>> +#define _QW_CPUSHIFT 2 /* Writer CPU shift */ >>> +#define _QW_CPUMASK 0x3ffc /* Writer CPU mask */ >> >> At least on x86, the shift involved here is quite certainly >> more expensive than using wider immediates on AND and CMP >> resulting from the _QW_MASK-based comparisons. I'd therefore >> like to suggest to put the CPU in the low 12 bits. > > Hm right. The LOCKED and WAITING bits don't need shifting anyway. > >> >> Another option is to use the recurse_cpu field of the >> associated spin lock: The field is used for recursive locks >> only, and hence the only conflict would be with >> _spin_is_locked(), which we don't (and in the future then >> also shouldn't) use on this lock. > > I looked into that also, but things get more complicated AFAICT, as it's > not possible to atomically fetch the state of the lock and the owner > CPU at the same time. Neither you could set the LOCKED bit and the CPU > at the same time. > >> >>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) >>> * If the writer field is atomic, it can be cleared directly. >>> * Otherwise, an atomic subtraction will be used to clear it. >>> */ >>> - atomic_sub(_QW_LOCKED, &lock->cnts); >>> + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); >>> + atomic_sub(_write_lock_val(), &lock->cnts); >> >> I think this would be more efficient with atomic_and(), not >> the least because of the then avoided smp_processor_id(). >> Whether to mask off just _QW_WMASK or also the CPU number of >> the last write owner would need to be determined. But with >> using subtraction, in case of problems it'll likely be >> harder to understand what actually went on, from looking at >> the resulting state of the lock (this is in part a pre- >> existing problem, but gets worse with subtraction of CPU >> numbers). > > Right, a mask would be better. Right now both need to be cleared (the > LOCKED and the CPU fields) as there's code that relies on !lock->cnts > as a way to determine that the lock is not read or write locked. If we > left the CPU lying around those checks would need to be adjusted. In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero value for write_unlock() (like its possible to do so today using a single byte write). Juergen
On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote: > On 20.02.20 15:11, Roger Pau Monné wrote: > > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: > > > On 20.02.2020 13:02, Roger Pau Monne wrote: > > > > I've done some testing and at least the CPU down case is fixed now. > > > > Posting early in order to get feedback on the approach taken. > > > > > > Looks good, thanks, just a question and two comments: > > > > > > > --- a/xen/include/xen/rwlock.h > > > > +++ b/xen/include/xen/rwlock.h > > > > @@ -20,21 +20,30 @@ typedef struct { > > > > #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED > > > > #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) > > > > -/* > > > > - * Writer states & reader shift and bias. > > > > - * > > > > - * Writer field is 8 bit to allow for potential optimisation, see > > > > - * _write_unlock(). > > > > - */ > > > > -#define _QW_WAITING 1 /* A writer is waiting */ > > > > -#define _QW_LOCKED 0xff /* A writer holds the lock */ > > > > -#define _QW_WMASK 0xff /* Writer mask.*/ > > > > -#define _QR_SHIFT 8 /* Reader count shift */ > > > > +/* Writer states & reader shift and bias. */ > > > > +#define _QW_WAITING 1 /* A writer is waiting */ > > > > +#define _QW_LOCKED 3 /* A writer holds the lock */ > > > > > > Aiui things would work equally well if 2 was used here? > > > > I think so, I left it as 3 because previously LOCKED would also > > include WAITING, and I didn't want to change it in case I've missed > > some code path that was relying on that. > > > > > > > > > +#define _QW_WMASK 3 /* Writer mask */ > > > > +#define _QW_CPUSHIFT 2 /* Writer CPU shift */ > > > > +#define _QW_CPUMASK 0x3ffc /* Writer CPU mask */ > > > > > > At least on x86, the shift involved here is quite certainly > > > more expensive than using wider immediates on AND and CMP > > > resulting from the _QW_MASK-based comparisons. I'd therefore > > > like to suggest to put the CPU in the low 12 bits. > > > > Hm right. The LOCKED and WAITING bits don't need shifting anyway. > > > > > > > > Another option is to use the recurse_cpu field of the > > > associated spin lock: The field is used for recursive locks > > > only, and hence the only conflict would be with > > > _spin_is_locked(), which we don't (and in the future then > > > also shouldn't) use on this lock. > > > > I looked into that also, but things get more complicated AFAICT, as it's > > not possible to atomically fetch the state of the lock and the owner > > CPU at the same time. Neither you could set the LOCKED bit and the CPU > > at the same time. > > > > > > > > > @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) > > > > * If the writer field is atomic, it can be cleared directly. > > > > * Otherwise, an atomic subtraction will be used to clear it. > > > > */ > > > > - atomic_sub(_QW_LOCKED, &lock->cnts); > > > > + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); > > > > + atomic_sub(_write_lock_val(), &lock->cnts); > > > > > > I think this would be more efficient with atomic_and(), not > > > the least because of the then avoided smp_processor_id(). > > > Whether to mask off just _QW_WMASK or also the CPU number of > > > the last write owner would need to be determined. But with > > > using subtraction, in case of problems it'll likely be > > > harder to understand what actually went on, from looking at > > > the resulting state of the lock (this is in part a pre- > > > existing problem, but gets worse with subtraction of CPU > > > numbers). > > > > Right, a mask would be better. Right now both need to be cleared (the > > LOCKED and the CPU fields) as there's code that relies on !lock->cnts > > as a way to determine that the lock is not read or write locked. If we > > left the CPU lying around those checks would need to be adjusted. > > In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero > value for write_unlock() (like its possible to do so today using a > single byte write). That would limit the readers count to 65536, what do you think Jan? Thanks, Roger.
On 20.02.2020 15:11, Roger Pau Monné wrote: > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: >> Another option is to use the recurse_cpu field of the >> associated spin lock: The field is used for recursive locks >> only, and hence the only conflict would be with >> _spin_is_locked(), which we don't (and in the future then >> also shouldn't) use on this lock. > > I looked into that also, but things get more complicated AFAICT, as it's > not possible to atomically fetch the state of the lock and the owner > CPU at the same time. Neither you could set the LOCKED bit and the CPU > at the same time. There's no need to atomically fetch both afaics: The field is valid only if the LOCKED bit is set. And when reading the field you only care about the value being equal to smp_processor_id(), i.e. it is fine to set LOCKED before updating the CPU field on lock, and to reset the CPU field to SPINLOCK_NO_CPU (or whatever it's called) before clearing LOCKED. Jan
On 20.02.2020 15:38, Roger Pau Monné wrote: > On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote: >> On 20.02.20 15:11, Roger Pau Monné wrote: >>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: >>>> On 20.02.2020 13:02, Roger Pau Monne wrote: >>>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) >>>>> * If the writer field is atomic, it can be cleared directly. >>>>> * Otherwise, an atomic subtraction will be used to clear it. >>>>> */ >>>>> - atomic_sub(_QW_LOCKED, &lock->cnts); >>>>> + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); >>>>> + atomic_sub(_write_lock_val(), &lock->cnts); >>>> >>>> I think this would be more efficient with atomic_and(), not >>>> the least because of the then avoided smp_processor_id(). >>>> Whether to mask off just _QW_WMASK or also the CPU number of >>>> the last write owner would need to be determined. But with >>>> using subtraction, in case of problems it'll likely be >>>> harder to understand what actually went on, from looking at >>>> the resulting state of the lock (this is in part a pre- >>>> existing problem, but gets worse with subtraction of CPU >>>> numbers). >>> >>> Right, a mask would be better. Right now both need to be cleared (the >>> LOCKED and the CPU fields) as there's code that relies on !lock->cnts >>> as a way to determine that the lock is not read or write locked. If we >>> left the CPU lying around those checks would need to be adjusted. >> >> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero >> value for write_unlock() (like its possible to do so today using a >> single byte write). > > That would limit the readers count to 65536, what do you think Jan? If the recurse_cpu approach is considered bad, I think this would be acceptable. But of course you'll need to consult with the Arm guys regarding the correctness of such a "half" store in their memory model; I would hope this to be universally okay, but I'm not entirely certain. Jan
On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote: > On 20.02.2020 15:11, Roger Pau Monné wrote: > > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: > >> Another option is to use the recurse_cpu field of the > >> associated spin lock: The field is used for recursive locks > >> only, and hence the only conflict would be with > >> _spin_is_locked(), which we don't (and in the future then > >> also shouldn't) use on this lock. > > > > I looked into that also, but things get more complicated AFAICT, as it's > > not possible to atomically fetch the state of the lock and the owner > > CPU at the same time. Neither you could set the LOCKED bit and the CPU > > at the same time. > > There's no need to atomically fetch both afaics: The field is > valid only if the LOCKED bit is set. And when reading the > field you only care about the value being equal to > smp_processor_id(), i.e. it is fine to set LOCKED before > updating the CPU field on lock, and to reset the CPU field to > SPINLOCK_NO_CPU (or whatever it's called) before clearing > LOCKED. Yes that would require the usage of a sentinel value as 0 would be a valid processor ID. I would try to refrain from (abu)sing internal spinlock fields, as I think we can solve this just by using the cnts field on the current rwlock implementation. What issue do you have in mind that would prevent storing the CPU write locked in the cnts field? Thanks, Roger.
On Thu, Feb 20, 2020 at 04:11:08PM +0100, Jan Beulich wrote: > On 20.02.2020 15:38, Roger Pau Monné wrote: > > On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote: > >> On 20.02.20 15:11, Roger Pau Monné wrote: > >>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: > >>>> On 20.02.2020 13:02, Roger Pau Monne wrote: > >>>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) > >>>>> * If the writer field is atomic, it can be cleared directly. > >>>>> * Otherwise, an atomic subtraction will be used to clear it. > >>>>> */ > >>>>> - atomic_sub(_QW_LOCKED, &lock->cnts); > >>>>> + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); > >>>>> + atomic_sub(_write_lock_val(), &lock->cnts); > >>>> > >>>> I think this would be more efficient with atomic_and(), not > >>>> the least because of the then avoided smp_processor_id(). > >>>> Whether to mask off just _QW_WMASK or also the CPU number of > >>>> the last write owner would need to be determined. But with > >>>> using subtraction, in case of problems it'll likely be > >>>> harder to understand what actually went on, from looking at > >>>> the resulting state of the lock (this is in part a pre- > >>>> existing problem, but gets worse with subtraction of CPU > >>>> numbers). > >>> > >>> Right, a mask would be better. Right now both need to be cleared (the > >>> LOCKED and the CPU fields) as there's code that relies on !lock->cnts > >>> as a way to determine that the lock is not read or write locked. If we > >>> left the CPU lying around those checks would need to be adjusted. > >> > >> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero > >> value for write_unlock() (like its possible to do so today using a > >> single byte write). > > > > That would limit the readers count to 65536, what do you think Jan? > > If the recurse_cpu approach is considered bad, I think this would > be acceptable. But of course you'll need to consult with the Arm > guys regarding the correctness of such a "half" store in their > memory model; I would hope this to be universally okay, but I'm > not entirely certain. I would like to get confirmation from the Arm folks, but I see Arm has write_atomic and supports such operation against a uint16_t, so I don't see why it wouldn't work. Thanks, Roger.
On 20.02.2020 16:17, Roger Pau Monné wrote: > On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote: >> On 20.02.2020 15:11, Roger Pau Monné wrote: >>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: >>>> Another option is to use the recurse_cpu field of the >>>> associated spin lock: The field is used for recursive locks >>>> only, and hence the only conflict would be with >>>> _spin_is_locked(), which we don't (and in the future then >>>> also shouldn't) use on this lock. >>> >>> I looked into that also, but things get more complicated AFAICT, as it's >>> not possible to atomically fetch the state of the lock and the owner >>> CPU at the same time. Neither you could set the LOCKED bit and the CPU >>> at the same time. >> >> There's no need to atomically fetch both afaics: The field is >> valid only if the LOCKED bit is set. And when reading the >> field you only care about the value being equal to >> smp_processor_id(), i.e. it is fine to set LOCKED before >> updating the CPU field on lock, and to reset the CPU field to >> SPINLOCK_NO_CPU (or whatever it's called) before clearing >> LOCKED. > > Yes that would require the usage of a sentinel value as 0 would be a > valid processor ID. > > I would try to refrain from (abu)sing internal spinlock fields, as I > think we can solve this just by using the cnts field on the current > rwlock implementation. > > What issue do you have in mind that would prevent storing the CPU > write locked in the cnts field? The reduction of the number of bits used for other purposes. Jan
On 20.02.2020 16:20, Roger Pau Monné wrote: > On Thu, Feb 20, 2020 at 04:11:08PM +0100, Jan Beulich wrote: >> On 20.02.2020 15:38, Roger Pau Monné wrote: >>> On Thu, Feb 20, 2020 at 03:23:38PM +0100, Jürgen Groß wrote: >>>> On 20.02.20 15:11, Roger Pau Monné wrote: >>>>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: >>>>>> On 20.02.2020 13:02, Roger Pau Monne wrote: >>>>>>> @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) >>>>>>> * If the writer field is atomic, it can be cleared directly. >>>>>>> * Otherwise, an atomic subtraction will be used to clear it. >>>>>>> */ >>>>>>> - atomic_sub(_QW_LOCKED, &lock->cnts); >>>>>>> + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); >>>>>>> + atomic_sub(_write_lock_val(), &lock->cnts); >>>>>> >>>>>> I think this would be more efficient with atomic_and(), not >>>>>> the least because of the then avoided smp_processor_id(). >>>>>> Whether to mask off just _QW_WMASK or also the CPU number of >>>>>> the last write owner would need to be determined. But with >>>>>> using subtraction, in case of problems it'll likely be >>>>>> harder to understand what actually went on, from looking at >>>>>> the resulting state of the lock (this is in part a pre- >>>>>> existing problem, but gets worse with subtraction of CPU >>>>>> numbers). >>>>> >>>>> Right, a mask would be better. Right now both need to be cleared (the >>>>> LOCKED and the CPU fields) as there's code that relies on !lock->cnts >>>>> as a way to determine that the lock is not read or write locked. If we >>>>> left the CPU lying around those checks would need to be adjusted. >>>> >>>> In case you make _QR_SHIFT 16 it is possible to just write a 2-byte zero >>>> value for write_unlock() (like its possible to do so today using a >>>> single byte write). >>> >>> That would limit the readers count to 65536, what do you think Jan? >> >> If the recurse_cpu approach is considered bad, I think this would >> be acceptable. But of course you'll need to consult with the Arm >> guys regarding the correctness of such a "half" store in their >> memory model; I would hope this to be universally okay, but I'm >> not entirely certain. > > I would like to get confirmation from the Arm folks, but I see Arm has > write_atomic and supports such operation against a uint16_t, so I > don't see why it wouldn't work. The operations individually being available doesn't mean that a combination of 32-bit locked accesses and a 16-bit plain store are going to produce a consistent result. Perhaps I'm over-cautious here, but I've been bitten by a vaguely similar issue on x86 back in the ticket lock (in Linux) days. Jan
On Thu, Feb 20, 2020 at 04:50:22PM +0100, Jan Beulich wrote: > On 20.02.2020 16:17, Roger Pau Monné wrote: > > On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote: > >> On 20.02.2020 15:11, Roger Pau Monné wrote: > >>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: > >>>> Another option is to use the recurse_cpu field of the > >>>> associated spin lock: The field is used for recursive locks > >>>> only, and hence the only conflict would be with > >>>> _spin_is_locked(), which we don't (and in the future then > >>>> also shouldn't) use on this lock. > >>> > >>> I looked into that also, but things get more complicated AFAICT, as it's > >>> not possible to atomically fetch the state of the lock and the owner > >>> CPU at the same time. Neither you could set the LOCKED bit and the CPU > >>> at the same time. > >> > >> There's no need to atomically fetch both afaics: The field is > >> valid only if the LOCKED bit is set. And when reading the > >> field you only care about the value being equal to > >> smp_processor_id(), i.e. it is fine to set LOCKED before > >> updating the CPU field on lock, and to reset the CPU field to > >> SPINLOCK_NO_CPU (or whatever it's called) before clearing > >> LOCKED. > > > > Yes that would require the usage of a sentinel value as 0 would be a > > valid processor ID. > > > > I would try to refrain from (abu)sing internal spinlock fields, as I > > think we can solve this just by using the cnts field on the current > > rwlock implementation. > > > > What issue do you have in mind that would prevent storing the CPU > > write locked in the cnts field? > > The reduction of the number of bits used for other purposes. I think it should be fine for now, as we would support systems with up to 16384 CPU IDs, and 65536 concurrent read lockers, which mean each CPU could take the lock up to 4 times recursively. Note that supporting 16384 CPUs is still very, very far off the radar. Thanks, Roger.
On 20.02.20 16:57, Roger Pau Monné wrote: > On Thu, Feb 20, 2020 at 04:50:22PM +0100, Jan Beulich wrote: >> On 20.02.2020 16:17, Roger Pau Monné wrote: >>> On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote: >>>> On 20.02.2020 15:11, Roger Pau Monné wrote: >>>>> On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: >>>>>> Another option is to use the recurse_cpu field of the >>>>>> associated spin lock: The field is used for recursive locks >>>>>> only, and hence the only conflict would be with >>>>>> _spin_is_locked(), which we don't (and in the future then >>>>>> also shouldn't) use on this lock. >>>>> >>>>> I looked into that also, but things get more complicated AFAICT, as it's >>>>> not possible to atomically fetch the state of the lock and the owner >>>>> CPU at the same time. Neither you could set the LOCKED bit and the CPU >>>>> at the same time. >>>> >>>> There's no need to atomically fetch both afaics: The field is >>>> valid only if the LOCKED bit is set. And when reading the >>>> field you only care about the value being equal to >>>> smp_processor_id(), i.e. it is fine to set LOCKED before >>>> updating the CPU field on lock, and to reset the CPU field to >>>> SPINLOCK_NO_CPU (or whatever it's called) before clearing >>>> LOCKED. >>> >>> Yes that would require the usage of a sentinel value as 0 would be a >>> valid processor ID. >>> >>> I would try to refrain from (abu)sing internal spinlock fields, as I >>> think we can solve this just by using the cnts field on the current >>> rwlock implementation. >>> >>> What issue do you have in mind that would prevent storing the CPU >>> write locked in the cnts field? >> >> The reduction of the number of bits used for other purposes. > > I think it should be fine for now, as we would support systems with up > to 16384 CPU IDs, and 65536 concurrent read lockers, which mean each > CPU could take the lock up to 4 times recursively. Note that > supporting 16384 CPUs is still very, very far off the radar. Current spinlock implementation limits NR_CPUS to 4096. Juergen
On Thu, Feb 20, 2020 at 05:00:37PM +0100, Jürgen Groß wrote: > On 20.02.20 16:57, Roger Pau Monné wrote: > > On Thu, Feb 20, 2020 at 04:50:22PM +0100, Jan Beulich wrote: > > > On 20.02.2020 16:17, Roger Pau Monné wrote: > > > > On Thu, Feb 20, 2020 at 04:02:55PM +0100, Jan Beulich wrote: > > > > > On 20.02.2020 15:11, Roger Pau Monné wrote: > > > > > > On Thu, Feb 20, 2020 at 01:48:54PM +0100, Jan Beulich wrote: > > > > > > > Another option is to use the recurse_cpu field of the > > > > > > > associated spin lock: The field is used for recursive locks > > > > > > > only, and hence the only conflict would be with > > > > > > > _spin_is_locked(), which we don't (and in the future then > > > > > > > also shouldn't) use on this lock. > > > > > > > > > > > > I looked into that also, but things get more complicated AFAICT, as it's > > > > > > not possible to atomically fetch the state of the lock and the owner > > > > > > CPU at the same time. Neither you could set the LOCKED bit and the CPU > > > > > > at the same time. > > > > > > > > > > There's no need to atomically fetch both afaics: The field is > > > > > valid only if the LOCKED bit is set. And when reading the > > > > > field you only care about the value being equal to > > > > > smp_processor_id(), i.e. it is fine to set LOCKED before > > > > > updating the CPU field on lock, and to reset the CPU field to > > > > > SPINLOCK_NO_CPU (or whatever it's called) before clearing > > > > > LOCKED. > > > > > > > > Yes that would require the usage of a sentinel value as 0 would be a > > > > valid processor ID. > > > > > > > > I would try to refrain from (abu)sing internal spinlock fields, as I > > > > think we can solve this just by using the cnts field on the current > > > > rwlock implementation. > > > > > > > > What issue do you have in mind that would prevent storing the CPU > > > > write locked in the cnts field? > > > > > > The reduction of the number of bits used for other purposes. > > > > I think it should be fine for now, as we would support systems with up > > to 16384 CPU IDs, and 65536 concurrent read lockers, which mean each > > CPU could take the lock up to 4 times recursively. Note that > > supporting 16384 CPUs is still very, very far off the radar. > > Current spinlock implementation limits NR_CPUS to 4096. Right, but here we can use up to 14 bits, so I think it's best to assign them now than leave them as padding? Thanks, Roger.
diff --git a/xen/common/rwlock.c b/xen/common/rwlock.c index d568bbf6de..dadab372b5 100644 --- a/xen/common/rwlock.c +++ b/xen/common/rwlock.c @@ -69,7 +69,7 @@ void queue_write_lock_slowpath(rwlock_t *lock) /* Try to acquire the lock directly if no reader is present. */ if ( !atomic_read(&lock->cnts) && - (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0) ) + (atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0) ) goto unlock; /* @@ -93,7 +93,7 @@ void queue_write_lock_slowpath(rwlock_t *lock) cnts = atomic_read(&lock->cnts); if ( (cnts == _QW_WAITING) && (atomic_cmpxchg(&lock->cnts, _QW_WAITING, - _QW_LOCKED) == _QW_WAITING) ) + _write_lock_val()) == _QW_WAITING) ) break; cpu_relax(); diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h index 3dfea1ac2a..b430ebd846 100644 --- a/xen/include/xen/rwlock.h +++ b/xen/include/xen/rwlock.h @@ -20,21 +20,30 @@ typedef struct { #define DEFINE_RWLOCK(l) rwlock_t l = RW_LOCK_UNLOCKED #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) -/* - * Writer states & reader shift and bias. - * - * Writer field is 8 bit to allow for potential optimisation, see - * _write_unlock(). - */ -#define _QW_WAITING 1 /* A writer is waiting */ -#define _QW_LOCKED 0xff /* A writer holds the lock */ -#define _QW_WMASK 0xff /* Writer mask.*/ -#define _QR_SHIFT 8 /* Reader count shift */ +/* Writer states & reader shift and bias. */ +#define _QW_WAITING 1 /* A writer is waiting */ +#define _QW_LOCKED 3 /* A writer holds the lock */ +#define _QW_WMASK 3 /* Writer mask */ +#define _QW_CPUSHIFT 2 /* Writer CPU shift */ +#define _QW_CPUMASK 0x3ffc /* Writer CPU mask */ +#define _QR_SHIFT 14 /* Reader count shift */ #define _QR_BIAS (1U << _QR_SHIFT) void queue_read_lock_slowpath(rwlock_t *lock); void queue_write_lock_slowpath(rwlock_t *lock); +static inline bool _is_write_locked_by_me(uint32_t cnts) +{ + BUILD_BUG_ON((_QW_CPUMASK >> _QW_CPUSHIFT) < NR_CPUS); + return (cnts & _QW_WMASK) == _QW_LOCKED && + MASK_EXTR(cnts, _QW_CPUMASK) == smp_processor_id(); +} + +static inline bool _can_read_lock(uint32_t cnts) +{ + return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts); +} + /* * _read_trylock - try to acquire read lock of a queue rwlock. * @lock : Pointer to queue rwlock structure. @@ -45,10 +54,10 @@ static inline int _read_trylock(rwlock_t *lock) u32 cnts; cnts = atomic_read(&lock->cnts); - if ( likely(!(cnts & _QW_WMASK)) ) + if ( likely(_can_read_lock(cnts)) ) { cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts); - if ( likely(!(cnts & _QW_WMASK)) ) + if ( likely(_can_read_lock(cnts)) ) return 1; atomic_sub(_QR_BIAS, &lock->cnts); } @@ -64,7 +73,7 @@ static inline void _read_lock(rwlock_t *lock) u32 cnts; cnts = atomic_add_return(_QR_BIAS, &lock->cnts); - if ( likely(!(cnts & _QW_WMASK)) ) + if ( likely(_can_read_lock(cnts)) ) return; /* The slowpath will decrement the reader count, if necessary. */ @@ -115,6 +124,11 @@ static inline int _rw_is_locked(rwlock_t *lock) return atomic_read(&lock->cnts); } +static inline uint32_t _write_lock_val(void) +{ + return _QW_LOCKED | MASK_INSR(smp_processor_id(), _QW_CPUMASK); +} + /* * queue_write_lock - acquire write lock of a queue rwlock. * @lock : Pointer to queue rwlock structure. @@ -122,7 +136,7 @@ static inline int _rw_is_locked(rwlock_t *lock) static inline void _write_lock(rwlock_t *lock) { /* Optimize for the unfair lock case where the fair flag is 0. */ - if ( atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0 ) + if ( atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0 ) return; queue_write_lock_slowpath(lock); @@ -157,7 +171,7 @@ static inline int _write_trylock(rwlock_t *lock) if ( unlikely(cnts) ) return 0; - return likely(atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0); + return likely(atomic_cmpxchg(&lock->cnts, 0, _write_lock_val()) == 0); } static inline void _write_unlock(rwlock_t *lock) @@ -166,7 +180,8 @@ static inline void _write_unlock(rwlock_t *lock) * If the writer field is atomic, it can be cleared directly. * Otherwise, an atomic subtraction will be used to clear it. */ - atomic_sub(_QW_LOCKED, &lock->cnts); + ASSERT(_is_write_locked_by_me(atomic_read(&lock->cnts))); + atomic_sub(_write_lock_val(), &lock->cnts); } static inline void _write_unlock_irq(rwlock_t *lock)
Allow a CPU already holding the lock in write mode to also lock it in read mode. There's no harm in allowing read locking a rwlock that's already owned by the caller (ie: CPU) in write mode. Allowing such accesses is required at least for the CPU maps use-case. In order to do this reserve 12bits of the lock, this allows to support up to 4096 CPUs. Also reduce the write lock mask to 2 bits: one to signal there are pending writers waiting on the lock and the other to signal the lock is owned in write mode. This reduces the maximum number of concurrent readers from 16777216 to 262144, I think this should still be enough, or else the lock field can be expanded from 32 to 64bits if all architectures support atomic operations on 64bit integers. Fixes: 5872c83b42c608 ('smp: convert the cpu maps lock into a rw lock') Reported-by: Jan Beulich <jbeulich@suse.com> Reported-by: Jürgen Groß <jgross@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- I've done some testing and at least the CPU down case is fixed now. Posting early in order to get feedback on the approach taken. --- xen/common/rwlock.c | 4 ++-- xen/include/xen/rwlock.h | 47 ++++++++++++++++++++++++++-------------- 2 files changed, 33 insertions(+), 18 deletions(-)