diff mbox

[PATCHv5,1/3] rwlock: Add per-cpu reader-writer lock infrastructure

Message ID 1450454920-11036-2-git-send-email-malcolm.crossley@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Malcolm Crossley Dec. 18, 2015, 4:08 p.m. UTC
Per-cpu read-write locks allow for the fast path read case to have
low overhead by only setting/clearing a per-cpu variable for using
the read lock. The per-cpu read fast path also avoids locked
compare swap operations which can be particularly slow on coherent
multi-socket systems, particularly if there is heavy usage of the
read lock itself.

The per-cpu reader-writer lock uses a local variable to control
the read lock fast path. This allows a writer to disable the fast
path and ensures the readers switch to using the underlying
read-write lock implementation instead of the per-cpu variable.

Once the writer has taken the write lock and disabled the fast path,
it must poll the per-cpu variable for all CPU's which have entered
the critical section for the specific read-write lock the writer is
attempting to take. This design allows for a single per-cpu variable
to be used for read/write locks belonging to seperate data structures.
If a two or more different per-cpu read lock(s) are taken
simultaneously then the per-cpu data structure is not used and the
implementation takes the read lock of the underlying read-write lock,
this behaviour is equivalent to the slow path in terms of performance.
The per-cpu rwlock is not recursion safe for taking the per-cpu
read lock because there is no recursion count variable, this is
functionally equivalent to standard spin locks.

Slow path readers which are unblocked, set the per-cpu variable and
drop the read lock. This simplifies the implementation and allows
for fairness in the underlying read-write lock to be taken
advantage of.

There is more overhead on the per-cpu write lock path due to checking
each CPUs fast path per-cpu variable but this overhead is likely be
hidden by the required delay of waiting for readers to exit the
critical section. The loop is optimised to only iterate over
the per-cpu data of active readers of the rwlock. The cpumask_t for
tracking the active readers is stored in a single per-cpu data
location and thus the write lock is not pre-emption safe. Therefore
the per-cpu write lock can only be used with interrupts disabled.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
--
Changes since v4:
- Use static inlines for the percpu_owner check

Changes since v3:
- Fix the ASSERTS for the percpu_owner check

Changes since v2:
- Remove stray hard tabs
- Added owner field to percpu_rwlock_t for debug builds. This allows
  for ASSERTs to be added which verify the correct global percpu
  structure is used for the local initialised percpu_rwlock lock

Changes since v1:
- Removed restriction on taking two or more different percpu rw
locks simultaneously
- Moved fast-path/slow-path barrier to be per lock instead of global
- Created seperate percpu_rwlock_t type and added macros to
initialise new type
- Added helper macros for using the percpu rwlock itself
- Moved active_readers cpumask off the stack and into a percpu
variable
---
 xen/common/spinlock.c        |  45 +++++++++++++++++
 xen/include/asm-arm/percpu.h |   5 ++
 xen/include/asm-x86/percpu.h |   6 +++
 xen/include/xen/percpu.h     |   4 ++
 xen/include/xen/spinlock.h   | 115 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 175 insertions(+)

Comments

Jan Beulich Dec. 18, 2015, 4:39 p.m. UTC | #1
>>> On 18.12.15 at 17:08, <malcolm.crossley@citrix.com> wrote:
> Per-cpu read-write locks allow for the fast path read case to have
> low overhead by only setting/clearing a per-cpu variable for using
> the read lock. The per-cpu read fast path also avoids locked
> compare swap operations which can be particularly slow on coherent
> multi-socket systems, particularly if there is heavy usage of the
> read lock itself.
> 
> The per-cpu reader-writer lock uses a local variable to control
> the read lock fast path. This allows a writer to disable the fast
> path and ensures the readers switch to using the underlying
> read-write lock implementation instead of the per-cpu variable.
> 
> Once the writer has taken the write lock and disabled the fast path,
> it must poll the per-cpu variable for all CPU's which have entered
> the critical section for the specific read-write lock the writer is
> attempting to take. This design allows for a single per-cpu variable
> to be used for read/write locks belonging to seperate data structures.
> If a two or more different per-cpu read lock(s) are taken
> simultaneously then the per-cpu data structure is not used and the
> implementation takes the read lock of the underlying read-write lock,
> this behaviour is equivalent to the slow path in terms of performance.
> The per-cpu rwlock is not recursion safe for taking the per-cpu
> read lock because there is no recursion count variable, this is
> functionally equivalent to standard spin locks.
> 
> Slow path readers which are unblocked, set the per-cpu variable and
> drop the read lock. This simplifies the implementation and allows
> for fairness in the underlying read-write lock to be taken
> advantage of.
> 
> There is more overhead on the per-cpu write lock path due to checking
> each CPUs fast path per-cpu variable but this overhead is likely be
> hidden by the required delay of waiting for readers to exit the
> critical section. The loop is optimised to only iterate over
> the per-cpu data of active readers of the rwlock. The cpumask_t for
> tracking the active readers is stored in a single per-cpu data
> location and thus the write lock is not pre-emption safe. Therefore
> the per-cpu write lock can only be used with interrupts disabled.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap Dec. 22, 2015, 11:56 a.m. UTC | #2
On 18/12/15 16:08, Malcolm Crossley wrote:
> Per-cpu read-write locks allow for the fast path read case to have
> low overhead by only setting/clearing a per-cpu variable for using
> the read lock. The per-cpu read fast path also avoids locked
> compare swap operations which can be particularly slow on coherent
> multi-socket systems, particularly if there is heavy usage of the
> read lock itself.
> 
> The per-cpu reader-writer lock uses a local variable to control
> the read lock fast path. This allows a writer to disable the fast
> path and ensures the readers switch to using the underlying
> read-write lock implementation instead of the per-cpu variable.
> 
> Once the writer has taken the write lock and disabled the fast path,
> it must poll the per-cpu variable for all CPU's which have entered
> the critical section for the specific read-write lock the writer is
> attempting to take. This design allows for a single per-cpu variable
> to be used for read/write locks belonging to seperate data structures.
> If a two or more different per-cpu read lock(s) are taken
> simultaneously then the per-cpu data structure is not used and the
> implementation takes the read lock of the underlying read-write lock,
> this behaviour is equivalent to the slow path in terms of performance.
> The per-cpu rwlock is not recursion safe for taking the per-cpu
> read lock because there is no recursion count variable, this is
> functionally equivalent to standard spin locks.
> 
> Slow path readers which are unblocked, set the per-cpu variable and
> drop the read lock. This simplifies the implementation and allows
> for fairness in the underlying read-write lock to be taken
> advantage of.
> 
> There is more overhead on the per-cpu write lock path due to checking
> each CPUs fast path per-cpu variable but this overhead is likely be
> hidden by the required delay of waiting for readers to exit the
> critical section. The loop is optimised to only iterate over
> the per-cpu data of active readers of the rwlock. The cpumask_t for
> tracking the active readers is stored in a single per-cpu data
> location and thus the write lock is not pre-emption safe. Therefore
> the per-cpu write lock can only be used with interrupts disabled.
> 
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> --
> Changes since v4:
> - Use static inlines for the percpu_owner check
> 
> Changes since v3:
> - Fix the ASSERTS for the percpu_owner check
> 
> Changes since v2:
> - Remove stray hard tabs
> - Added owner field to percpu_rwlock_t for debug builds. This allows
>   for ASSERTs to be added which verify the correct global percpu
>   structure is used for the local initialised percpu_rwlock lock
> 
> Changes since v1:
> - Removed restriction on taking two or more different percpu rw
> locks simultaneously
> - Moved fast-path/slow-path barrier to be per lock instead of global
> - Created seperate percpu_rwlock_t type and added macros to
> initialise new type
> - Added helper macros for using the percpu rwlock itself
> - Moved active_readers cpumask off the stack and into a percpu
> variable
> ---
>  xen/common/spinlock.c        |  45 +++++++++++++++++
>  xen/include/asm-arm/percpu.h |   5 ++
>  xen/include/asm-x86/percpu.h |   6 +++
>  xen/include/xen/percpu.h     |   4 ++
>  xen/include/xen/spinlock.h   | 115 +++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 175 insertions(+)
> 
> diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
> index 7f89694..901c2b2 100644
> --- a/xen/common/spinlock.c
> +++ b/xen/common/spinlock.c
> @@ -10,6 +10,8 @@
>  #include <asm/processor.h>
>  #include <asm/atomic.h>
>  
> +static DEFINE_PER_CPU(cpumask_t, percpu_rwlock_readers);
> +
>  #ifndef NDEBUG
>  
>  static atomic_t spin_debug __read_mostly = ATOMIC_INIT(0);
> @@ -492,6 +494,49 @@ int _rw_is_write_locked(rwlock_t *lock)
>      return (lock->lock == RW_WRITE_FLAG); /* writer in critical section? */
>  }
>  
> +void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
> +                percpu_rwlock_t *percpu_rwlock)
> +{
> +    unsigned int cpu;
> +    cpumask_t *rwlock_readers = &this_cpu(percpu_rwlock_readers);
> +
> +    /* Validate the correct per_cpudata variable has been provided. */
> +    _percpu_rwlock_owner_check(per_cpudata, percpu_rwlock);
> +
> +    /* 
> +     * First take the write lock to protect against other writers or slow 
> +     * path readers.
> +     */
> +    write_lock(&percpu_rwlock->rwlock);
> +
> +    /* Now set the global variable so that readers start using read_lock. */
> +    percpu_rwlock->writer_activating = 1;
> +    smp_mb();
> +
> +    /* Using a per cpu cpumask is only safe if there is no nesting. */
> +    ASSERT(!in_irq());
> +    cpumask_copy(rwlock_readers, &cpu_online_map);
> +
> +    /* Check if there are any percpu readers in progress on this rwlock. */
> +    for ( ; ; )
> +    {
> +        for_each_cpu(cpu, rwlock_readers)
> +        {
> +            /* 
> +             * Remove any percpu readers not contending on this rwlock
> +             * from our check mask.
> +             */
> +            if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
> +                __cpumask_clear_cpu(cpu, rwlock_readers);
> +        }
> +        /* Check if we've cleared all percpu readers from check mask. */
> +        if ( cpumask_empty(rwlock_readers) )
> +            break;
> +        /* Give the coherency fabric a break. */
> +        cpu_relax();
> +    };
> +}
> +
>  #ifdef LOCK_PROFILE
>  
>  struct lock_profile_anc {
> diff --git a/xen/include/asm-arm/percpu.h b/xen/include/asm-arm/percpu.h
> index 71e7649..c308a56 100644
> --- a/xen/include/asm-arm/percpu.h
> +++ b/xen/include/asm-arm/percpu.h
> @@ -27,6 +27,11 @@ void percpu_init_areas(void);
>  #define __get_cpu_var(var) \
>      (*RELOC_HIDE(&per_cpu__##var, READ_SYSREG(TPIDR_EL2)))
>  
> +#define per_cpu_ptr(var, cpu)  \
> +    (*RELOC_HIDE(&var, __per_cpu_offset[cpu]))
> +#define __get_cpu_ptr(var) \
> +    (*RELOC_HIDE(&var, READ_SYSREG(TPIDR_EL2)))
> +
>  #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
>  
>  DECLARE_PER_CPU(unsigned int, cpu_id);
> diff --git a/xen/include/asm-x86/percpu.h b/xen/include/asm-x86/percpu.h
> index 604ff0d..51562b9 100644
> --- a/xen/include/asm-x86/percpu.h
> +++ b/xen/include/asm-x86/percpu.h
> @@ -20,4 +20,10 @@ void percpu_init_areas(void);
>  
>  #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
>  
> +#define __get_cpu_ptr(var) \
> +    (*RELOC_HIDE(var, get_cpu_info()->per_cpu_offset))
> +
> +#define per_cpu_ptr(var, cpu)  \
> +    (*RELOC_HIDE(var, __per_cpu_offset[cpu]))
> +
>  #endif /* __X86_PERCPU_H__ */
> diff --git a/xen/include/xen/percpu.h b/xen/include/xen/percpu.h
> index abe0b11..c896863 100644
> --- a/xen/include/xen/percpu.h
> +++ b/xen/include/xen/percpu.h
> @@ -16,6 +16,10 @@
>  /* Preferred on Xen. Also see arch-defined per_cpu(). */
>  #define this_cpu(var)    __get_cpu_var(var)
>  
> +#define this_cpu_ptr(ptr)    __get_cpu_ptr(ptr)
> +
> +#define get_per_cpu_var(var)  (per_cpu__##var)
> +
>  /* Linux compatibility. */
>  #define get_cpu_var(var) this_cpu(var)
>  #define put_cpu_var(var)
> diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
> index fb0438e..7338328 100644
> --- a/xen/include/xen/spinlock.h
> +++ b/xen/include/xen/spinlock.h
> @@ -3,6 +3,7 @@
>  
>  #include <asm/system.h>
>  #include <asm/spinlock.h>
> +#include <asm/types.h>
>  
>  #ifndef NDEBUG
>  struct lock_debug {
> @@ -261,4 +262,118 @@ int _rw_is_write_locked(rwlock_t *lock);
>  #define rw_is_locked(l)               _rw_is_locked(l)
>  #define rw_is_write_locked(l)         _rw_is_write_locked(l)
>  
> +typedef struct percpu_rwlock percpu_rwlock_t;
> +
> +struct percpu_rwlock {
> +    rwlock_t            rwlock;
> +    bool_t              writer_activating;
> +#ifndef NDEBUG
> +    percpu_rwlock_t     **percpu_owner;
> +#endif
> +};
> +
> +#ifndef NDEBUG
> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0, owner }
> +static inline void _percpu_rwlock_owner_check(percpu_rwlock_t **per_cpudata,
> +                                         percpu_rwlock_t *percpu_rwlock)
> +{
> +    ASSERT(per_cpudata == percpu_rwlock->percpu_owner);
> +}
> +#else
> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0 }
> +#define _percpu_rwlock_owner_check(data, lock) ((void)0)
> +#endif
> +
> +#define DEFINE_PERCPU_RWLOCK_RESOURCE(l, owner) \
> +    percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner))
> +#define percpu_rwlock_resource_init(l, owner) \
> +    (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner)))
> +
> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
> +                                         percpu_rwlock_t *percpu_rwlock)

Is there a particular reason you chose to only use the "owner" value in
the struct to verify that the "per_cpudata" argument passed matched the
one you expected, rather than just getting rid of the "per_cpudata"
argument altogether and always using the pointer in the struct?

(i.e., _percpu_read_lock(percpu_rwlock_t *percpu_rwlock) { ...
per_cpudata = percpu_rwlock->percpu_owner; ... })

I'm not an expert in this sort of micro-optimization, but it seems like
you're trading off storing a pointer in your rwlock struct for storing a
pointer at every call site.  Since you have to read writer_activating
for every lock or unlock anyway, it doesn't seem like you'd actually be
saving that many memory fetches; but having only one copy in the cache,
rather than one copy per call site, would on the whole reduce both the
cache footprint and the total memory used (if only by a few bytes).

It also makes the code cleaner to have only one argument, rather than
two which must match; but since in all the places you use it you end up
using a wrapper to give you a single argument anyway, I don't think that
matters in this case.  (i.e., if there's a good reason for having it at
the call site instead if in the struct, I'm fine with this approach).

Everything else looks good, thanks.

 -George
Malcolm Crossley Jan. 11, 2016, 3:06 p.m. UTC | #3
On 22/12/15 11:56, George Dunlap wrote:
> On 18/12/15 16:08, Malcolm Crossley wrote:
>> <snip>
>> +
>> +#ifndef NDEBUG
>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0, owner }
>> +static inline void _percpu_rwlock_owner_check(percpu_rwlock_t **per_cpudata,
>> +                                         percpu_rwlock_t *percpu_rwlock)
>> +{
>> +    ASSERT(per_cpudata == percpu_rwlock->percpu_owner);
>> +}
>> +#else
>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0 }
>> +#define _percpu_rwlock_owner_check(data, lock) ((void)0)
>> +#endif
>> +
>> +#define DEFINE_PERCPU_RWLOCK_RESOURCE(l, owner) \
>> +    percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner))
>> +#define percpu_rwlock_resource_init(l, owner) \
>> +    (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner)))
>> +
>> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
>> +                                         percpu_rwlock_t *percpu_rwlock)
> 
> Is there a particular reason you chose to only use the "owner" value in
> the struct to verify that the "per_cpudata" argument passed matched the
> one you expected, rather than just getting rid of the "per_cpudata"
> argument altogether and always using the pointer in the struct?

Initially I was aiming to add percpu aspects to the rwlock without increasing
the size of the rwlock structure itself, this was to keep data cache usage and
memory allocations the same.
It became clear that having a global writer_activating barrier would cause the
read_lock to enter the slow path far too often. So I put the writer_activating
variable in the percpu_rwlock_t, as writer_activating is just a bool then the
additional data overhead should be small. Always having a 8 byte pointer may
add a lot of overhead to data structures contain multiple rwlocks and thus
cause additional allocation overhead.
> 
> (i.e., _percpu_read_lock(percpu_rwlock_t *percpu_rwlock) { ...
> per_cpudata = percpu_rwlock->percpu_owner; ... })
> 
> I'm not an expert in this sort of micro-optimization, but it seems like
> you're trading off storing a pointer in your rwlock struct for storing a
> pointer at every call site.  Since you have to read writer_activating
> for every lock or unlock anyway,

writer_activating is not read on the read_unlock path. As these are rwlocks
then I'm assuming the read lock/unlock paths are more critical for performance.
So I'd prefer to not do a read of the percpu_rwlock structure if it's not
required (i.e. on the read unlock path)
Furthermore, the single byte for the writer_activating variable is likely
to have been read into cache by accesses to other parts of the data structure
near the percpu_rwlock_t. If we add additional 8 bytes to the percpu_rwlock_t
then this may not happen and it may also adjust the cache line alignment aswell.

> it doesn't seem like you'd actually be
> saving that many memory fetches; but having only one copy in the cache,
> rather than one copy per call site, would on the whole reduce both the
> cache footprint and the total memory used (if only by a few bytes).

If you put the owner pointer in the percpu_rwlock_t then wouldn't you have
a copy per instance of percpu_rwlock_t? Surely this would use more cache than
the handful of call site references to a global variable.

> 
> It also makes the code cleaner to have only one argument, rather than
> two which must match; but since in all the places you use it you end up
> using a wrapper to give you a single argument anyway, I don't think that
> matters in this case.  (i.e., if there's a good reason for having it at
> the call site instead if in the struct, I'm fine with this approach).

If you agree with my reasoning for the cache overhead and performance of the
read unlock path being better with passing the percpu_data as an argument then
I propose we keep the patches as is.

> 
> Everything else looks good, thanks.
> 
>  -George
> 

Thanks for the review and sorry about the delayed reply.

Malcolm
Malcolm Crossley Jan. 19, 2016, 10:29 a.m. UTC | #4
On 11/01/16 15:06, Malcolm Crossley wrote:
> On 22/12/15 11:56, George Dunlap wrote:
>> On 18/12/15 16:08, Malcolm Crossley wrote:
>>> <snip>
>>> +
>>> +#ifndef NDEBUG
>>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0, owner }
>>> +static inline void _percpu_rwlock_owner_check(percpu_rwlock_t **per_cpudata,
>>> +                                         percpu_rwlock_t *percpu_rwlock)
>>> +{
>>> +    ASSERT(per_cpudata == percpu_rwlock->percpu_owner);
>>> +}
>>> +#else
>>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0 }
>>> +#define _percpu_rwlock_owner_check(data, lock) ((void)0)
>>> +#endif
>>> +
>>> +#define DEFINE_PERCPU_RWLOCK_RESOURCE(l, owner) \
>>> +    percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner))
>>> +#define percpu_rwlock_resource_init(l, owner) \
>>> +    (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner)))
>>> +
>>> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
>>> +                                         percpu_rwlock_t *percpu_rwlock)
>>
>> Is there a particular reason you chose to only use the "owner" value in
>> the struct to verify that the "per_cpudata" argument passed matched the
>> one you expected, rather than just getting rid of the "per_cpudata"
>> argument altogether and always using the pointer in the struct?
> 
> Initially I was aiming to add percpu aspects to the rwlock without increasing
> the size of the rwlock structure itself, this was to keep data cache usage and
> memory allocations the same.
> It became clear that having a global writer_activating barrier would cause the
> read_lock to enter the slow path far too often. So I put the writer_activating
> variable in the percpu_rwlock_t, as writer_activating is just a bool then the
> additional data overhead should be small. Always having a 8 byte pointer may
> add a lot of overhead to data structures contain multiple rwlocks and thus
> cause additional allocation overhead.
>>
>> (i.e., _percpu_read_lock(percpu_rwlock_t *percpu_rwlock) { ...
>> per_cpudata = percpu_rwlock->percpu_owner; ... })
>>
>> I'm not an expert in this sort of micro-optimization, but it seems like
>> you're trading off storing a pointer in your rwlock struct for storing a
>> pointer at every call site.  Since you have to read writer_activating
>> for every lock or unlock anyway,
> 
> writer_activating is not read on the read_unlock path. As these are rwlocks
> then I'm assuming the read lock/unlock paths are more critical for performance.
> So I'd prefer to not do a read of the percpu_rwlock structure if it's not
> required (i.e. on the read unlock path)
> Furthermore, the single byte for the writer_activating variable is likely
> to have been read into cache by accesses to other parts of the data structure
> near the percpu_rwlock_t. If we add additional 8 bytes to the percpu_rwlock_t
> then this may not happen and it may also adjust the cache line alignment aswell.
> 
>> it doesn't seem like you'd actually be
>> saving that many memory fetches; but having only one copy in the cache,
>> rather than one copy per call site, would on the whole reduce both the
>> cache footprint and the total memory used (if only by a few bytes).
> 
> If you put the owner pointer in the percpu_rwlock_t then wouldn't you have
> a copy per instance of percpu_rwlock_t? Surely this would use more cache than
> the handful of call site references to a global variable.
> 
>>
>> It also makes the code cleaner to have only one argument, rather than
>> two which must match; but since in all the places you use it you end up
>> using a wrapper to give you a single argument anyway, I don't think that
>> matters in this case.  (i.e., if there's a good reason for having it at
>> the call site instead if in the struct, I'm fine with this approach).
> 
> If you agree with my reasoning for the cache overhead and performance of the
> read unlock path being better with passing the percpu_data as an argument then
> I propose we keep the patches as is.
> 
Ping? I believe this is the last point of discussion before the patches can go in.

Malcolm

>>
>> Everything else looks good, thanks.
>>
>>  -George
>>
> 
>
George Dunlap Jan. 19, 2016, 12:25 p.m. UTC | #5
On 19/01/16 10:29, Malcolm Crossley wrote:
> On 11/01/16 15:06, Malcolm Crossley wrote:
>> On 22/12/15 11:56, George Dunlap wrote:
>>> On 18/12/15 16:08, Malcolm Crossley wrote:
>>>> <snip>
>>>> +
>>>> +#ifndef NDEBUG
>>>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0, owner }
>>>> +static inline void _percpu_rwlock_owner_check(percpu_rwlock_t **per_cpudata,
>>>> +                                         percpu_rwlock_t *percpu_rwlock)
>>>> +{
>>>> +    ASSERT(per_cpudata == percpu_rwlock->percpu_owner);
>>>> +}
>>>> +#else
>>>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0 }
>>>> +#define _percpu_rwlock_owner_check(data, lock) ((void)0)
>>>> +#endif
>>>> +
>>>> +#define DEFINE_PERCPU_RWLOCK_RESOURCE(l, owner) \
>>>> +    percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner))
>>>> +#define percpu_rwlock_resource_init(l, owner) \
>>>> +    (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner)))
>>>> +
>>>> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
>>>> +                                         percpu_rwlock_t *percpu_rwlock)
>>>
>>> Is there a particular reason you chose to only use the "owner" value in
>>> the struct to verify that the "per_cpudata" argument passed matched the
>>> one you expected, rather than just getting rid of the "per_cpudata"
>>> argument altogether and always using the pointer in the struct?
>>
>> Initially I was aiming to add percpu aspects to the rwlock without increasing
>> the size of the rwlock structure itself, this was to keep data cache usage and
>> memory allocations the same.
>> It became clear that having a global writer_activating barrier would cause the
>> read_lock to enter the slow path far too often. So I put the writer_activating
>> variable in the percpu_rwlock_t, as writer_activating is just a bool then the
>> additional data overhead should be small. Always having a 8 byte pointer may
>> add a lot of overhead to data structures contain multiple rwlocks and thus
>> cause additional allocation overhead.
>>>
>>> (i.e., _percpu_read_lock(percpu_rwlock_t *percpu_rwlock) { ...
>>> per_cpudata = percpu_rwlock->percpu_owner; ... })
>>>
>>> I'm not an expert in this sort of micro-optimization, but it seems like
>>> you're trading off storing a pointer in your rwlock struct for storing a
>>> pointer at every call site.  Since you have to read writer_activating
>>> for every lock or unlock anyway,
>>
>> writer_activating is not read on the read_unlock path. As these are rwlocks
>> then I'm assuming the read lock/unlock paths are more critical for performance.
>> So I'd prefer to not do a read of the percpu_rwlock structure if it's not
>> required (i.e. on the read unlock path)
>> Furthermore, the single byte for the writer_activating variable is likely
>> to have been read into cache by accesses to other parts of the data structure
>> near the percpu_rwlock_t. If we add additional 8 bytes to the percpu_rwlock_t
>> then this may not happen and it may also adjust the cache line alignment aswell.
>>
>>> it doesn't seem like you'd actually be
>>> saving that many memory fetches; but having only one copy in the cache,
>>> rather than one copy per call site, would on the whole reduce both the
>>> cache footprint and the total memory used (if only by a few bytes).
>>
>> If you put the owner pointer in the percpu_rwlock_t then wouldn't you have
>> a copy per instance of percpu_rwlock_t? Surely this would use more cache than
>> the handful of call site references to a global variable.
>>
>>>
>>> It also makes the code cleaner to have only one argument, rather than
>>> two which must match; but since in all the places you use it you end up
>>> using a wrapper to give you a single argument anyway, I don't think that
>>> matters in this case.  (i.e., if there's a good reason for having it at
>>> the call site instead if in the struct, I'm fine with this approach).
>>
>> If you agree with my reasoning for the cache overhead and performance of the
>> read unlock path being better with passing the percpu_data as an argument then
>> I propose we keep the patches as is.
>>
> Ping? I believe this is the last point of discussion before the patches can go in.

Sorry -- I did skim this, and intended to give it another once-over last
week, but some other stuff came up.  I should get a chance to take a
look at it sometime this week.

 -George
George Dunlap Jan. 20, 2016, 3:30 p.m. UTC | #6
On Mon, Jan 11, 2016 at 3:06 PM, Malcolm Crossley
<malcolm.crossley@citrix.com> wrote:
> On 22/12/15 11:56, George Dunlap wrote:
>> On 18/12/15 16:08, Malcolm Crossley wrote:
>>> <snip>
>>> +
>>> +#ifndef NDEBUG
>>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0, owner }
>>> +static inline void _percpu_rwlock_owner_check(percpu_rwlock_t **per_cpudata,
>>> +                                         percpu_rwlock_t *percpu_rwlock)
>>> +{
>>> +    ASSERT(per_cpudata == percpu_rwlock->percpu_owner);
>>> +}
>>> +#else
>>> +#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0 }
>>> +#define _percpu_rwlock_owner_check(data, lock) ((void)0)
>>> +#endif
>>> +
>>> +#define DEFINE_PERCPU_RWLOCK_RESOURCE(l, owner) \
>>> +    percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner))
>>> +#define percpu_rwlock_resource_init(l, owner) \
>>> +    (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner)))
>>> +
>>> +static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
>>> +                                         percpu_rwlock_t *percpu_rwlock)
>>
>> Is there a particular reason you chose to only use the "owner" value in
>> the struct to verify that the "per_cpudata" argument passed matched the
>> one you expected, rather than just getting rid of the "per_cpudata"
>> argument altogether and always using the pointer in the struct?
>
> Initially I was aiming to add percpu aspects to the rwlock without increasing
> the size of the rwlock structure itself, this was to keep data cache usage and
> memory allocations the same.
> It became clear that having a global writer_activating barrier would cause the
> read_lock to enter the slow path far too often. So I put the writer_activating
> variable in the percpu_rwlock_t, as writer_activating is just a bool then the
> additional data overhead should be small. Always having a 8 byte pointer may
> add a lot of overhead to data structures contain multiple rwlocks and thus
> cause additional allocation overhead.
>>
>> (i.e., _percpu_read_lock(percpu_rwlock_t *percpu_rwlock) { ...
>> per_cpudata = percpu_rwlock->percpu_owner; ... })
>>
>> I'm not an expert in this sort of micro-optimization, but it seems like
>> you're trading off storing a pointer in your rwlock struct for storing a
>> pointer at every call site.  Since you have to read writer_activating
>> for every lock or unlock anyway,
>
> writer_activating is not read on the read_unlock path. As these are rwlocks
> then I'm assuming the read lock/unlock paths are more critical for performance.
> So I'd prefer to not do a read of the percpu_rwlock structure if it's not
> required (i.e. on the read unlock path)
> Furthermore, the single byte for the writer_activating variable is likely
> to have been read into cache by accesses to other parts of the data structure
> near the percpu_rwlock_t. If we add additional 8 bytes to the percpu_rwlock_t
> then this may not happen and it may also adjust the cache line alignment aswell.
>
>> it doesn't seem like you'd actually be
>> saving that many memory fetches; but having only one copy in the cache,
>> rather than one copy per call site, would on the whole reduce both the
>> cache footprint and the total memory used (if only by a few bytes).
>
> If you put the owner pointer in the percpu_rwlock_t then wouldn't you have
> a copy per instance of percpu_rwlock_t? Surely this would use more cache than
> the handful of call site references to a global variable.
>
>>
>> It also makes the code cleaner to have only one argument, rather than
>> two which must match; but since in all the places you use it you end up
>> using a wrapper to give you a single argument anyway, I don't think that
>> matters in this case.  (i.e., if there's a good reason for having it at
>> the call site instead if in the struct, I'm fine with this approach).
>
> If you agree with my reasoning for the cache overhead and performance of the
> read unlock path being better with passing the percpu_data as an argument then
> I propose we keep the patches as is.

Yeah, I think you convinced me:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Thanks!
 -George
Ian Campbell Jan. 21, 2016, 3:17 p.m. UTC | #7
On Fri, 2015-12-18 at 16:08 +0000, Malcolm Crossley wrote:
> diff --git a/xen/include/asm-arm/percpu.h b/xen/include/asm-arm/percpu.h
> index 71e7649..c308a56 100644
> --- a/xen/include/asm-arm/percpu.h
> +++ b/xen/include/asm-arm/percpu.h
> @@ -27,6 +27,11 @@ void percpu_init_areas(void);
>  #define __get_cpu_var(var) \
>      (*RELOC_HIDE(&per_cpu__##var, READ_SYSREG(TPIDR_EL2)))
>  
> +#define per_cpu_ptr(var, cpu)  \
> +    (*RELOC_HIDE(&var, __per_cpu_offset[cpu]))
> +#define __get_cpu_ptr(var) \
> +    (*RELOC_HIDE(&var, READ_SYSREG(TPIDR_EL2)))
> +
>  #define DECLARE_PER_CPU(type, name) extern __typeof__(type)
> per_cpu__##name
>  

For ARM: Acked-by :Ian Campbell <ian.campbell@citrix.com>
diff mbox

Patch

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 7f89694..901c2b2 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -10,6 +10,8 @@ 
 #include <asm/processor.h>
 #include <asm/atomic.h>
 
+static DEFINE_PER_CPU(cpumask_t, percpu_rwlock_readers);
+
 #ifndef NDEBUG
 
 static atomic_t spin_debug __read_mostly = ATOMIC_INIT(0);
@@ -492,6 +494,49 @@  int _rw_is_write_locked(rwlock_t *lock)
     return (lock->lock == RW_WRITE_FLAG); /* writer in critical section? */
 }
 
+void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
+                percpu_rwlock_t *percpu_rwlock)
+{
+    unsigned int cpu;
+    cpumask_t *rwlock_readers = &this_cpu(percpu_rwlock_readers);
+
+    /* Validate the correct per_cpudata variable has been provided. */
+    _percpu_rwlock_owner_check(per_cpudata, percpu_rwlock);
+
+    /* 
+     * First take the write lock to protect against other writers or slow 
+     * path readers.
+     */
+    write_lock(&percpu_rwlock->rwlock);
+
+    /* Now set the global variable so that readers start using read_lock. */
+    percpu_rwlock->writer_activating = 1;
+    smp_mb();
+
+    /* Using a per cpu cpumask is only safe if there is no nesting. */
+    ASSERT(!in_irq());
+    cpumask_copy(rwlock_readers, &cpu_online_map);
+
+    /* Check if there are any percpu readers in progress on this rwlock. */
+    for ( ; ; )
+    {
+        for_each_cpu(cpu, rwlock_readers)
+        {
+            /* 
+             * Remove any percpu readers not contending on this rwlock
+             * from our check mask.
+             */
+            if ( per_cpu_ptr(per_cpudata, cpu) != percpu_rwlock )
+                __cpumask_clear_cpu(cpu, rwlock_readers);
+        }
+        /* Check if we've cleared all percpu readers from check mask. */
+        if ( cpumask_empty(rwlock_readers) )
+            break;
+        /* Give the coherency fabric a break. */
+        cpu_relax();
+    };
+}
+
 #ifdef LOCK_PROFILE
 
 struct lock_profile_anc {
diff --git a/xen/include/asm-arm/percpu.h b/xen/include/asm-arm/percpu.h
index 71e7649..c308a56 100644
--- a/xen/include/asm-arm/percpu.h
+++ b/xen/include/asm-arm/percpu.h
@@ -27,6 +27,11 @@  void percpu_init_areas(void);
 #define __get_cpu_var(var) \
     (*RELOC_HIDE(&per_cpu__##var, READ_SYSREG(TPIDR_EL2)))
 
+#define per_cpu_ptr(var, cpu)  \
+    (*RELOC_HIDE(&var, __per_cpu_offset[cpu]))
+#define __get_cpu_ptr(var) \
+    (*RELOC_HIDE(&var, READ_SYSREG(TPIDR_EL2)))
+
 #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
 
 DECLARE_PER_CPU(unsigned int, cpu_id);
diff --git a/xen/include/asm-x86/percpu.h b/xen/include/asm-x86/percpu.h
index 604ff0d..51562b9 100644
--- a/xen/include/asm-x86/percpu.h
+++ b/xen/include/asm-x86/percpu.h
@@ -20,4 +20,10 @@  void percpu_init_areas(void);
 
 #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
 
+#define __get_cpu_ptr(var) \
+    (*RELOC_HIDE(var, get_cpu_info()->per_cpu_offset))
+
+#define per_cpu_ptr(var, cpu)  \
+    (*RELOC_HIDE(var, __per_cpu_offset[cpu]))
+
 #endif /* __X86_PERCPU_H__ */
diff --git a/xen/include/xen/percpu.h b/xen/include/xen/percpu.h
index abe0b11..c896863 100644
--- a/xen/include/xen/percpu.h
+++ b/xen/include/xen/percpu.h
@@ -16,6 +16,10 @@ 
 /* Preferred on Xen. Also see arch-defined per_cpu(). */
 #define this_cpu(var)    __get_cpu_var(var)
 
+#define this_cpu_ptr(ptr)    __get_cpu_ptr(ptr)
+
+#define get_per_cpu_var(var)  (per_cpu__##var)
+
 /* Linux compatibility. */
 #define get_cpu_var(var) this_cpu(var)
 #define put_cpu_var(var)
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index fb0438e..7338328 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -3,6 +3,7 @@ 
 
 #include <asm/system.h>
 #include <asm/spinlock.h>
+#include <asm/types.h>
 
 #ifndef NDEBUG
 struct lock_debug {
@@ -261,4 +262,118 @@  int _rw_is_write_locked(rwlock_t *lock);
 #define rw_is_locked(l)               _rw_is_locked(l)
 #define rw_is_write_locked(l)         _rw_is_write_locked(l)
 
+typedef struct percpu_rwlock percpu_rwlock_t;
+
+struct percpu_rwlock {
+    rwlock_t            rwlock;
+    bool_t              writer_activating;
+#ifndef NDEBUG
+    percpu_rwlock_t     **percpu_owner;
+#endif
+};
+
+#ifndef NDEBUG
+#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0, owner }
+static inline void _percpu_rwlock_owner_check(percpu_rwlock_t **per_cpudata,
+                                         percpu_rwlock_t *percpu_rwlock)
+{
+    ASSERT(per_cpudata == percpu_rwlock->percpu_owner);
+}
+#else
+#define PERCPU_RW_LOCK_UNLOCKED(owner) { RW_LOCK_UNLOCKED, 0 }
+#define _percpu_rwlock_owner_check(data, lock) ((void)0)
+#endif
+
+#define DEFINE_PERCPU_RWLOCK_RESOURCE(l, owner) \
+    percpu_rwlock_t l = PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner))
+#define percpu_rwlock_resource_init(l, owner) \
+    (*(l) = (percpu_rwlock_t)PERCPU_RW_LOCK_UNLOCKED(&get_per_cpu_var(owner)))
+
+static inline void _percpu_read_lock(percpu_rwlock_t **per_cpudata,
+                                         percpu_rwlock_t *percpu_rwlock)
+{
+    /* Validate the correct per_cpudata variable has been provided. */
+    _percpu_rwlock_owner_check(per_cpudata, percpu_rwlock);
+
+    /* We cannot support recursion on the same lock. */
+    ASSERT(this_cpu_ptr(per_cpudata) != percpu_rwlock);
+    /* 
+     * Detect using a second percpu_rwlock_t simulatenously and fallback
+     * to standard read_lock.
+     */
+    if ( unlikely(this_cpu_ptr(per_cpudata) != NULL ) )
+    {
+        read_lock(&percpu_rwlock->rwlock);
+        return;
+    }
+
+    /* Indicate this cpu is reading. */
+    this_cpu_ptr(per_cpudata) = percpu_rwlock;
+    smp_mb();
+    /* Check if a writer is waiting. */
+    if ( unlikely(percpu_rwlock->writer_activating) )
+    {
+        /* Let the waiting writer know we aren't holding the lock. */
+        this_cpu_ptr(per_cpudata) = NULL;
+        /* Wait using the read lock to keep the lock fair. */
+        read_lock(&percpu_rwlock->rwlock);
+        /* Set the per CPU data again and continue. */
+        this_cpu_ptr(per_cpudata) = percpu_rwlock;
+        /* Drop the read lock because we don't need it anymore. */
+        read_unlock(&percpu_rwlock->rwlock);
+    }
+}
+
+static inline void _percpu_read_unlock(percpu_rwlock_t **per_cpudata,
+                percpu_rwlock_t *percpu_rwlock)
+{
+    /* Validate the correct per_cpudata variable has been provided. */
+    _percpu_rwlock_owner_check(per_cpudata, percpu_rwlock);
+
+    /* Verify the read lock was taken for this lock */
+    ASSERT(this_cpu_ptr(per_cpudata) != NULL);
+    /* 
+     * Detect using a second percpu_rwlock_t simulatenously and fallback
+     * to standard read_unlock.
+     */
+    if ( unlikely(this_cpu_ptr(per_cpudata) != percpu_rwlock ) )
+    {
+        read_unlock(&percpu_rwlock->rwlock);
+        return;
+    }
+    this_cpu_ptr(per_cpudata) = NULL;
+    smp_wmb();
+}
+
+/* Don't inline percpu write lock as it's a complex function. */
+void _percpu_write_lock(percpu_rwlock_t **per_cpudata,
+                        percpu_rwlock_t *percpu_rwlock);
+
+static inline void _percpu_write_unlock(percpu_rwlock_t **per_cpudata,
+                percpu_rwlock_t *percpu_rwlock)
+{
+    /* Validate the correct per_cpudata variable has been provided. */
+    _percpu_rwlock_owner_check(per_cpudata, percpu_rwlock);
+
+    ASSERT(percpu_rwlock->writer_activating);
+    percpu_rwlock->writer_activating = 0;
+    write_unlock(&percpu_rwlock->rwlock);
+}
+
+#define percpu_rw_is_write_locked(l)         _rw_is_write_locked(&((l)->rwlock))
+
+#define percpu_read_lock(percpu, lock) \
+    _percpu_read_lock(&get_per_cpu_var(percpu), lock)
+#define percpu_read_unlock(percpu, lock) \
+    _percpu_read_unlock(&get_per_cpu_var(percpu), lock)
+#define percpu_write_lock(percpu, lock) \
+    _percpu_write_lock(&get_per_cpu_var(percpu), lock)
+#define percpu_write_unlock(percpu, lock) \
+    _percpu_write_unlock(&get_per_cpu_var(percpu), lock)
+
+#define DEFINE_PERCPU_RWLOCK_GLOBAL(name) DEFINE_PER_CPU(percpu_rwlock_t *, \
+                                                         name)
+#define DECLARE_PERCPU_RWLOCK_GLOBAL(name) DECLARE_PER_CPU(percpu_rwlock_t *, \
+                                                         name)
+
 #endif /* __SPINLOCK_H__ */