diff mbox

[v16,13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

Message ID 1429901803-29771-14-git-send-email-Waiman.Long@hp.com (mailing list archive)
State New, archived
Headers show

Commit Message

Waiman Long April 24, 2015, 6:56 p.m. UTC
In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 kernel/locking/qspinlock_paravirt.h |   28 +++++++++++++++++++++++++---
 1 files changed, 25 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra April 29, 2015, 6:11 p.m. UTC | #1
On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
> In the pv_scan_next() function, the slow cmpxchg atomic operation is
> performed even if the other CPU is not even close to being halted. This
> extra cmpxchg can harm slowpath performance.
> 
> This patch introduces the new mayhalt flag to indicate if the other
> spinning CPU is close to being halted or not. The current threshold
> for x86 is 2k cpu_relax() calls. If this flag is not set, the other
> spinning CPU will have at least 2k more cpu_relax() calls before
> it can enter the halt state. This should give enough time for the
> setting of the locked flag in struct mcs_spinlock to propagate to
> that CPU without using atomic op.

Yuck! I'm not at all sure you can make assumptions like that. And the
worst part is, if it goes wrong the borkage is subtle and painful.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 29, 2015, 6:27 p.m. UTC | #2
On Wed, Apr 29, 2015 at 11:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
>> In the pv_scan_next() function, the slow cmpxchg atomic operation is
>> performed even if the other CPU is not even close to being halted. This
>> extra cmpxchg can harm slowpath performance.
>>
>> This patch introduces the new mayhalt flag to indicate if the other
>> spinning CPU is close to being halted or not. The current threshold
>> for x86 is 2k cpu_relax() calls. If this flag is not set, the other
>> spinning CPU will have at least 2k more cpu_relax() calls before
>> it can enter the halt state. This should give enough time for the
>> setting of the locked flag in struct mcs_spinlock to propagate to
>> that CPU without using atomic op.
>
> Yuck! I'm not at all sure you can make assumptions like that. And the
> worst part is, if it goes wrong the borkage is subtle and painful.\

I have to agree with Peter.

But it goes beyond this particular patch. Patterns like this:

       xchg(&pn->mayhalt, true);

are just evil and disgusting. Even befoe this patch, that code had

                (void)xchg(&pn->state, vcpu_halted);

which is *wrong* and should never be done.

If you want it to be "set_mb()" (which sets a value and has a memory
barrier), then use set_mb(). Yes, it happens to use a "xchg()" to do
so, but dammit, it documents that whole "this is a memory barrier" in
the name.

Also, anybody who does this should damn well document why the memory
barrier is needed. The xchg(&pn->state, vcpu_halted) at least is
preceded by a comment about the barriers. The new mayhalt has no sane
comment in it, and the reason seems to be that no sane comment is
possible. The xchg() seems to be some black magic thing.

Let's not introduce magic stuff in our locking primitives. At least
not undocumented magic that makes no sense.

                               Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long April 30, 2015, 6:49 p.m. UTC | #3
On 04/29/2015 02:11 PM, Peter Zijlstra wrote:
> On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
>> In the pv_scan_next() function, the slow cmpxchg atomic operation is
>> performed even if the other CPU is not even close to being halted. This
>> extra cmpxchg can harm slowpath performance.
>>
>> This patch introduces the new mayhalt flag to indicate if the other
>> spinning CPU is close to being halted or not. The current threshold
>> for x86 is 2k cpu_relax() calls. If this flag is not set, the other
>> spinning CPU will have at least 2k more cpu_relax() calls before
>> it can enter the halt state. This should give enough time for the
>> setting of the locked flag in struct mcs_spinlock to propagate to
>> that CPU without using atomic op.
> Yuck! I'm not at all sure you can make assumptions like that. And the
> worst part is, if it goes wrong the borkage is subtle and painful.

I do think the code is OK. However, you are right that if my reasoning 
is incorrect, the resulting bug will be really subtle. So I am going to 
withdraw this particular patch as it has no functional impact to the 
overall patch series. Please let me know if you have any other comments 
on other parts of the series and I will send send out a new series 
without this particular patch.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long April 30, 2015, 6:56 p.m. UTC | #4
On 04/29/2015 02:27 PM, Linus Torvalds wrote:
> On Wed, Apr 29, 2015 at 11:11 AM, Peter Zijlstra<peterz@infradead.org>  wrote:
>> On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
>>> In the pv_scan_next() function, the slow cmpxchg atomic operation is
>>> performed even if the other CPU is not even close to being halted. This
>>> extra cmpxchg can harm slowpath performance.
>>>
>>> This patch introduces the new mayhalt flag to indicate if the other
>>> spinning CPU is close to being halted or not. The current threshold
>>> for x86 is 2k cpu_relax() calls. If this flag is not set, the other
>>> spinning CPU will have at least 2k more cpu_relax() calls before
>>> it can enter the halt state. This should give enough time for the
>>> setting of the locked flag in struct mcs_spinlock to propagate to
>>> that CPU without using atomic op.
>> Yuck! I'm not at all sure you can make assumptions like that. And the
>> worst part is, if it goes wrong the borkage is subtle and painful.\
> I have to agree with Peter.
>
> But it goes beyond this particular patch. Patterns like this:
>
>         xchg(&pn->mayhalt, true);
>
> are just evil and disgusting. Even befoe this patch, that code had
>
>                  (void)xchg(&pn->state, vcpu_halted);
>
> which is *wrong* and should never be done.
>
> If you want it to be "set_mb()" (which sets a value and has a memory
> barrier), then use set_mb(). Yes, it happens to use a "xchg()" to do
> so, but dammit, it documents that whole "this is a memory barrier" in
> the name.
> Also, anybody who does this should damn well document why the memory
> barrier is needed. The xchg(&pn->state, vcpu_halted) at least is
> preceded by a comment about the barriers. The new mayhalt has no sane
> comment in it, and the reason seems to be that no sane comment is
> possible. The xchg() seems to be some black magic thing.
>
> Let's not introduce magic stuff in our locking primitives. At least
> not undocumented magic that makes no sense.
>
>                                 Linus

Thanks for the comments. I will withdraw this patch and use set_mb() in 
the code as suggested for better readability.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 4, 2015, 2:05 p.m. UTC | #5
On Thu, Apr 30, 2015 at 02:49:26PM -0400, Waiman Long wrote:
> On 04/29/2015 02:11 PM, Peter Zijlstra wrote:
> >On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
> >>In the pv_scan_next() function, the slow cmpxchg atomic operation is
> >>performed even if the other CPU is not even close to being halted. This
> >>extra cmpxchg can harm slowpath performance.
> >>
> >>This patch introduces the new mayhalt flag to indicate if the other
> >>spinning CPU is close to being halted or not. The current threshold
> >>for x86 is 2k cpu_relax() calls. If this flag is not set, the other
> >>spinning CPU will have at least 2k more cpu_relax() calls before
> >>it can enter the halt state. This should give enough time for the
> >>setting of the locked flag in struct mcs_spinlock to propagate to
> >>that CPU without using atomic op.
> >Yuck! I'm not at all sure you can make assumptions like that. And the
> >worst part is, if it goes wrong the borkage is subtle and painful.
> 
> I do think the code is OK. However, you are right that if my reasoning is
> incorrect, the resulting bug will be really subtle. 

So I do not think its correct, imagine the fabrics used for the 4096 cpu
SGI machine, now add some serious traffic to them. There is no saying
your random 2k relax loop will be enough to propagate the change.

Equally, another arch (this is generic code) might have starvation
issues on its inter-cpu fabric and delay the store just long enough.

The thing is, one should _never_ rely on timing for correctness, _ever_.

> So I am going to
> withdraw this particular patch as it has no functional impact to the overall
> patch series. Please let me know if you have any other comments on other
> parts of the series and I will send send out a new series without this
> particular patch.

Please wait a little while, I've queued the 'basic' patches, once that
settles in tip we can look at the others.

Also, I have some local changes (sorry, I could not help mysef) I should
post, I've been somewhat delayed by illness.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long May 4, 2015, 5:18 p.m. UTC | #6
On 05/04/2015 10:05 AM, Peter Zijlstra wrote:
> On Thu, Apr 30, 2015 at 02:49:26PM -0400, Waiman Long wrote:
>> On 04/29/2015 02:11 PM, Peter Zijlstra wrote:
>>> On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
>>>> In the pv_scan_next() function, the slow cmpxchg atomic operation is
>>>> performed even if the other CPU is not even close to being halted. This
>>>> extra cmpxchg can harm slowpath performance.
>>>>
>>>> This patch introduces the new mayhalt flag to indicate if the other
>>>> spinning CPU is close to being halted or not. The current threshold
>>>> for x86 is 2k cpu_relax() calls. If this flag is not set, the other
>>>> spinning CPU will have at least 2k more cpu_relax() calls before
>>>> it can enter the halt state. This should give enough time for the
>>>> setting of the locked flag in struct mcs_spinlock to propagate to
>>>> that CPU without using atomic op.
>>> Yuck! I'm not at all sure you can make assumptions like that. And the
>>> worst part is, if it goes wrong the borkage is subtle and painful.
>> I do think the code is OK. However, you are right that if my reasoning is
>> incorrect, the resulting bug will be really subtle.
> So I do not think its correct, imagine the fabrics used for the 4096 cpu
> SGI machine, now add some serious traffic to them. There is no saying
> your random 2k relax loop will be enough to propagate the change.
>
> Equally, another arch (this is generic code) might have starvation
> issues on its inter-cpu fabric and delay the store just long enough.
>
> The thing is, one should _never_ rely on timing for correctness, _ever_.
>

Yes, you are right. Having a dependency on timing can be dangerous.

>> So I am going to
>> withdraw this particular patch as it has no functional impact to the overall
>> patch series. Please let me know if you have any other comments on other
>> parts of the series and I will send send out a new series without this
>> particular patch.
> Please wait a little while, I've queued the 'basic' patches, once that
> settles in tip we can look at the others.
>
> Also, I have some local changes (sorry, I could not help mysef) I should
> post, I've been somewhat delayed by illness.

Sure. I will wait until you finish your tip test.

I am sorry to hear that you are bothered with illness. I hope you get 
well by now.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 9b4ac3d..41ee033 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -19,7 +19,8 @@ 
  * native_queue_spin_unlock().
  */
 
-#define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
+#define _Q_SLOW_VAL		(3U << _Q_LOCKED_OFFSET)
+#define MAYHALT_THRESHOLD	(SPIN_THRESHOLD >> 4)
 
 /*
  * The vcpu_hashed is a special state that is set by the new lock holder on
@@ -39,6 +40,7 @@  struct pv_node {
 
 	int			cpu;
 	u8			state;
+	u8			mayhalt;
 };
 
 /*
@@ -198,6 +200,7 @@  static void pv_init_node(struct mcs_spinlock *node)
 
 	pn->cpu = smp_processor_id();
 	pn->state = vcpu_running;
+	pn->mayhalt = false;
 }
 
 /*
@@ -214,23 +217,34 @@  static void pv_wait_node(struct mcs_spinlock *node)
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
+			if (loop == MAYHALT_THRESHOLD)
+				xchg(&pn->mayhalt, true);
 			cpu_relax();
 		}
 
 		/*
-		 * Order pn->state vs pn->locked thusly:
+		 * Order pn->state/pn->mayhalt vs pn->locked thusly:
 		 *
-		 * [S] pn->state = vcpu_halted	  [S] next->locked = 1
+		 * [S] pn->mayhalt = 1            [S] next->locked = 1
+		 *     MB, delay                      barrier()
+		 * [S] pn->state = vcpu_halted    [L] pn->mayhalt
 		 *     MB			      MB
 		 * [L] pn->locked		[RmW] pn->state = vcpu_hashed
 		 *
 		 * Matches the cmpxchg() from pv_scan_next().
+		 *
+		 * As the new lock holder may quit (when pn->mayhalt is not
+		 * set) without memory barrier, a sufficiently long delay is
+		 * inserted between the setting of pn->mayhalt and pn->state
+		 * to ensure that there is enough time for the new pn->locked
+		 * value to be propagated here to be checked below.
 		 */
 		(void)xchg(&pn->state, vcpu_halted);
 
 		if (!READ_ONCE(node->locked))
 			pv_wait(&pn->state, vcpu_halted);
 
+		pn->mayhalt = false;
 		/*
 		 * Reset the state except when vcpu_hashed is set.
 		 */
@@ -263,6 +277,14 @@  static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
 	struct __qspinlock *l = (void *)lock;
 
 	/*
+	 * If mayhalt is not set, there is enough time for the just set value
+	 * in pn->locked to be propagated to the other CPU before it is time
+	 * to halt.
+	 */
+	if (!READ_ONCE(pn->mayhalt))
+		return;
+
+	/*
 	 * Transition CPU state: halted => hashed
 	 * Quit if the transition failed.
 	 */