Message ID | 20190131030136.56999-4-alex.kogan@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add NUMA-awareness to qspinlock | expand |
On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote: > Choose the next lock holder among spinning threads running on the same > socket with high probability rather than always. With small probability, > hand the lock to the first thread in the secondary queue or, if that > queue is empty, to the immediate successor of the current lock holder > in the main queue. Thus, assuming no failures while threads hold the > lock, every thread would be able to acquire the lock after a bounded > number of lock transitions, with high probability. > > Note that we could make the inter-socket transition deterministic, > by sticking a counter of intra-socket transitions in the head node > of the secondary queue. At the handoff time, we could increment > the counter and check if it is below a threshold. This adds another > field to queue nodes and nearly-certain local cache miss to read and > update this counter during the handoff. While still beating stock, > this variant adds certain overhead over the probabilistic variant. (also heavily suffers from the socket == node confusion) How would you suggest RT 'tunes' this? RT relies on FIFO fairness of the basic spinlock primitives; you just completely wrecked that.
> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote: >> Choose the next lock holder among spinning threads running on the same >> socket with high probability rather than always. With small probability, >> hand the lock to the first thread in the secondary queue or, if that >> queue is empty, to the immediate successor of the current lock holder >> in the main queue. Thus, assuming no failures while threads hold the >> lock, every thread would be able to acquire the lock after a bounded >> number of lock transitions, with high probability. >> >> Note that we could make the inter-socket transition deterministic, >> by sticking a counter of intra-socket transitions in the head node >> of the secondary queue. At the handoff time, we could increment >> the counter and check if it is below a threshold. This adds another >> field to queue nodes and nearly-certain local cache miss to read and >> update this counter during the handoff. While still beating stock, >> this variant adds certain overhead over the probabilistic variant. > > (also heavily suffers from the socket == node confusion) > > How would you suggest RT 'tunes' this? > > RT relies on FIFO fairness of the basic spinlock primitives; you just > completely wrecked that. This is true that CNA trades some fairness for shorter lock handover latency, much like any other NUMA-aware lock. Can you explain, however, what exactly breaks here? It seems that even today, qspinlock does not support RT_PREEMPT, given that it uses per-CPU queue nodes. Thank you, — Alex
On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote: > > > On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote: > >> Choose the next lock holder among spinning threads running on the same > >> socket with high probability rather than always. With small probability, > >> hand the lock to the first thread in the secondary queue or, if that > >> queue is empty, to the immediate successor of the current lock holder > >> in the main queue. Thus, assuming no failures while threads hold the > >> lock, every thread would be able to acquire the lock after a bounded > >> number of lock transitions, with high probability. > >> > >> Note that we could make the inter-socket transition deterministic, > >> by sticking a counter of intra-socket transitions in the head node > >> of the secondary queue. At the handoff time, we could increment > >> the counter and check if it is below a threshold. This adds another > >> field to queue nodes and nearly-certain local cache miss to read and > >> update this counter during the handoff. While still beating stock, > >> this variant adds certain overhead over the probabilistic variant. > > > > (also heavily suffers from the socket == node confusion) > > > > How would you suggest RT 'tunes' this? > > > > RT relies on FIFO fairness of the basic spinlock primitives; you just > > completely wrecked that. > > This is true that CNA trades some fairness for shorter lock handover > latency, much like any other NUMA-aware lock. > > Can you explain, however, what exactly breaks here? Timeliness guarantees. FIFO-fair has well defined time behaviour; you know exactly how long you get to wait before you acquire the lock, namely however many waiters are in front of you multiplied by the worst case wait time. Doing time analysis on a randomized algorithm isn't my idea of fun. > It seems that even today, qspinlock does not support RT_PREEMPT, given > that it uses per-CPU queue nodes. It does work with RT, commit: 7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee") it a direct result of RT observing funnies with it. I've no idea why you think it would not work.
On 02/05/2019 04:22 AM, Peter Zijlstra wrote: > On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote: >>> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote: >>>> Choose the next lock holder among spinning threads running on the same >>>> socket with high probability rather than always. With small probability, >>>> hand the lock to the first thread in the secondary queue or, if that >>>> queue is empty, to the immediate successor of the current lock holder >>>> in the main queue. Thus, assuming no failures while threads hold the >>>> lock, every thread would be able to acquire the lock after a bounded >>>> number of lock transitions, with high probability. >>>> >>>> Note that we could make the inter-socket transition deterministic, >>>> by sticking a counter of intra-socket transitions in the head node >>>> of the secondary queue. At the handoff time, we could increment >>>> the counter and check if it is below a threshold. This adds another >>>> field to queue nodes and nearly-certain local cache miss to read and >>>> update this counter during the handoff. While still beating stock, >>>> this variant adds certain overhead over the probabilistic variant. >>> (also heavily suffers from the socket == node confusion) >>> >>> How would you suggest RT 'tunes' this? >>> >>> RT relies on FIFO fairness of the basic spinlock primitives; you just >>> completely wrecked that. >> This is true that CNA trades some fairness for shorter lock handover >> latency, much like any other NUMA-aware lock. >> >> Can you explain, however, what exactly breaks here? > Timeliness guarantees. FIFO-fair has well defined time behaviour; you > know exactly how long you get to wait before you acquire the lock, > namely however many waiters are in front of you multiplied by the worst > case wait time. > > Doing time analysis on a randomized algorithm isn't my idea of fun. RT doesn't work well with NUMA qspinlock is another reason why I want it to be a separate slow path. We will disable it on a RT kernel where guaranteed low latency is a must and throughput isn't as important. Cheers, Longman
[ Resending after correcting an issue with the included URL and correcting a typo in Waiman’s name — sorry about that! ] > On Feb 5, 2019, at 4:22 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Feb 04, 2019 at 10:35:09PM -0500, Alex Kogan wrote: >> >>> On Jan 31, 2019, at 5:00 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Wed, Jan 30, 2019 at 10:01:35PM -0500, Alex Kogan wrote: >>>> Choose the next lock holder among spinning threads running on the same >>>> socket with high probability rather than always. With small probability, >>>> hand the lock to the first thread in the secondary queue or, if that >>>> queue is empty, to the immediate successor of the current lock holder >>>> in the main queue. Thus, assuming no failures while threads hold the >>>> lock, every thread would be able to acquire the lock after a bounded >>>> number of lock transitions, with high probability. >>>> >>>> Note that we could make the inter-socket transition deterministic, >>>> by sticking a counter of intra-socket transitions in the head node >>>> of the secondary queue. At the handoff time, we could increment >>>> the counter and check if it is below a threshold. This adds another >>>> field to queue nodes and nearly-certain local cache miss to read and >>>> update this counter during the handoff. While still beating stock, >>>> this variant adds certain overhead over the probabilistic variant. >>> >>> (also heavily suffers from the socket == node confusion) >>> >>> How would you suggest RT 'tunes' this? >>> >>> RT relies on FIFO fairness of the basic spinlock primitives; you just >>> completely wrecked that. >> >> This is true that CNA trades some fairness for shorter lock handover >> latency, much like any other NUMA-aware lock. >> >> Can you explain, however, what exactly breaks here? > > Timeliness guarantees. FIFO-fair has well defined time behaviour; you > know exactly how long you get to wait before you acquire the lock, > namely however many waiters are in front of you multiplied by the worst > case wait time. Got it — thanks for the clarification! > > Doing time analysis on a randomized algorithm isn't my idea of fun. > >> It seems that even today, qspinlock does not support RT_PREEMPT, given >> that it uses per-CPU queue nodes. > > It does work with RT, commit: > > 7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee") > > it a direct result of RT observing funnies with it. I've no idea why you > think it would not work. Just trying to get to the bottom of it — as of today, qspinlock explicitly assumes no preemption while waiting for the lock. Here is what Waiman had to say about that in https://lwn.net/Articles/561775: "The idea behind this spinlock implementation is the fact that spinlocks are acquired with preemption disabled. In other words, the process will not be migrated to another CPU while it is trying to get a spinlock.” This was back in 2013, but the code still uses per-CPU queue nodes, and AFAICT, preemption will break things up. So what you are saying is that RT would be fine assuming no preemption in the spinlock as long as it provides FIFO? Or there is some future code patch that will take care of the “no preemption” assumption (but still assume FIFO)? Thanks, — Alex
On 02/05/2019 04:07 PM, Alex Kogan wrote: >> Doing time analysis on a randomized algorithm isn't my idea of fun. >> >>> It seems that even today, qspinlock does not support RT_PREEMPT, given >>> that it uses per-CPU queue nodes. >> It does work with RT, commit: >> >> 7aa54be29765 ("locking/qspinlock, x86: Provide liveness guarantee") >> >> it a direct result of RT observing funnies with it. I've no idea why you >> think it would not work. > Just trying to get to the bottom of it — as of today, qspinlock explicitly assumes > no preemption while waiting for the lock. > > Here is what Waiman had to say about that in https://lwn.net/Articles/561775: > > "The idea behind this spinlock implementation is the fact that spinlocks > are acquired with preemption disabled. In other words, the process > will not be migrated to another CPU while it is trying to get a > spinlock.” > > This was back in 2013, but the code still uses per-CPU queue nodes, > and AFAICT, preemption will break things up. > > So what you are saying is that RT would be fine assuming no preemption in > the spinlock as long as it provides FIFO? Or there is some future code patch > that will take care of the “no preemption” assumption (but still assume FIFO)? > > Thanks, > — Alex Some of the critical sections protected by spinlocks may have execution times that are much longer than desired. That is why they are converted to rt-mutex in the RT kernel. There is another class of spinlocks called raw spinlocks. They are the same as regular spinlocks in non RT-kernel, but remain spinlocks with no preemption allowed in RT-kernel as sleeping locks can't be used in atomic context. This is where the replacement of the current qspinlock code by your NUMA-aware qspinlock may screw up the timing guarantee that can be provided by the RT-kernel. Cheers, Longman
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 6addc24f219d..d3caef4f84e2 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -31,6 +31,7 @@ #include <linux/prefetch.h> #include <asm/byteorder.h> #include <asm/qspinlock.h> +#include <linux/random.h> /* * Include queued spinlock statistics code @@ -112,6 +113,18 @@ struct qnode { */ static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]); +/* Per-CPU pseudo-random number seed */ +static DEFINE_PER_CPU(u32, seed); + +/* + * Controls the probability for intra-socket lock hand-off. It can be + * tuned and depend, e.g., on the number of CPUs per socket. For now, + * choose a value that provides reasonable long-term fairness without + * sacrificing performance compared to a version that does not have any + * fairness guarantees. + */ +#define INTRA_SOCKET_HANDOFF_PROB_ARG 0x10000 + /* * We must be able to distinguish between no-tail and the tail at 0:0, * therefore increment the cpu number by one. @@ -369,6 +382,35 @@ static struct mcs_spinlock *find_successor(struct mcs_spinlock *me, return NULL; } +/* + * xorshift function for generating pseudo-random numbers: + * https://en.wikipedia.org/wiki/Xorshift + */ +static inline u32 xor_random(void) +{ + u32 v; + + v = this_cpu_read(seed); + if (v == 0) + get_random_bytes(&v, sizeof(u32)); + + v ^= v << 6; + v ^= v >> 21; + v ^= v << 7; + this_cpu_write(seed, v); + + return v; +} + +/* + * Return false with probability 1 / @range. + * @range must be a power of 2. + */ +static bool probably(unsigned int range) +{ + return xor_random() & (range - 1); +} + #endif /* _GEN_PV_LOCK_SLOWPATH */ /** @@ -647,8 +689,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) if (!next) next = smp_cond_load_relaxed(&node->next, (VAL)); - /* Try to pass the lock to a thread running on the same socket. */ - succ = find_successor(node, cpuid); + /* + * Try to pass the lock to a thread running on the same socket. + * For long-term fairness, search for such a thread with high + * probability rather than always. + */ + succ = NULL; + if (probably(INTRA_SOCKET_HANDOFF_PROB_ARG)) + succ = find_successor(node, cpuid); + if (succ) { arch_mcs_spin_unlock_contended(&succ->locked, node->locked); } else if (node->locked > 1) {