diff mbox

[1/1,RFC] workqueue: fix ghost PENDING flag while doing MQ IO

Message ID 20160425154847.GZ7822@mtj.duckdns.org (mailing list archive)
State New, archived
Headers show

Commit Message

Tejun Heo April 25, 2016, 3:48 p.m. UTC
Hello, Roman.

On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
...
>   CPU#6                                CPU#2
>   reqeust ffff884000343600 inserted
>   hctx marked as pended
>   kblockd_schedule...() returns 1
>   <schedule to kblockd workqueue>
>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>   flush_busy_ctxs() is executed
>                                        reqeust ffff884000343cc0 inserted
>                                        hctx marked as pended
>                                        kblockd_schedule...() returns 0
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                                        WTF?
> 
> As a result ffff884000343cc0 request pended forever.
> 
> According to the trace output I see that another CPU _always_ observes
> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
> cleared on another CPU.
> 
> Checking the workqueue.c code I see that clearing the bit is nothing
> more, but atomic_long_set(), which is <mov> instruction. This
> function:
> 
>   static inline void set_work_data()
> 
> In attempt to "fix" the mystery I replaced atomic_long_set() call with
> atomic_long_xchg(), which is <xchg> instruction.
> 
> The problem has gone.
> 
> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
> not require flush of all CPU caches, which can be dirty after executing
> of <mov> on another CPU.  But <xchg> really updates the memory and the
> following execution of <lock btsl> observes that bit was cleared.
> 
> As a conculusion I can say, that I am lucky enough and can reproduce
> this bug in several minutes on a specific load (I tried many other
> simple loads using fio, even using btrecord/btreplay, no success).
> And that easy reproduction on a specific load gives me some freedom
> to test and then to be sure, that problem has gone.

Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
Howells who ISTR to have reported a similar issue.  The root problem
here, I think, is that PENDING is used to synchronize between
different queueing instances but we don't have proper memory barrier
after it.

	A				B

	queue (test_and_sets PENDING)
	dispatch (clears PENDING)
	execute				queue (test_and_sets PENDING)

So, for B, the guarantee must be that either A starts executing after
B's test_and_set or B's test_and_set succeeds; however, as we don't
have any memory barrier between dispatch and execute, there's nothing
preventing the processor from scheduling some memory fetch operations
from the execute stage before the clearing of PENDING - ie. A might
not see what B has done prior to queue even if B's test_and_set fails
indicating that A should.  Can you please test whether the following
patch fixes the issue?

Comments

Tejun Heo April 25, 2016, 4 p.m. UTC | #1
On Mon, Apr 25, 2016 at 11:48:47AM -0400, Tejun Heo wrote:
> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David

I just went through the history.  So, we used to have clear_bit()
instead of atomic_long_set() but clear_bit() doesn't imply any barrier
either, so kudos to you.  You just debugged an issue which is most
likely has been there for over a decade! :)
Roman Pen April 25, 2016, 4:34 p.m. UTC | #2
Hello, Tejun,

On Mon, Apr 25, 2016 at 5:48 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>>   CPU#6                                CPU#2
>>   reqeust ffff884000343600 inserted
>>   hctx marked as pended
>>   kblockd_schedule...() returns 1
>>   <schedule to kblockd workqueue>
>>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>>   flush_busy_ctxs() is executed
>>                                        reqeust ffff884000343cc0 inserted
>>                                        hctx marked as pended
>>                                        kblockd_schedule...() returns 0
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                        WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>>   static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU.  But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
>
> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
> Howells who ISTR to have reported a similar issue.  The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
>
>         A                               B
>
>         queue (test_and_sets PENDING)
>         dispatch (clears PENDING)
>         execute                         queue (test_and_sets PENDING)
>
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should.  Can you please test whether the following
> patch fixes the issue?

I can assure you that smp_mb() helps (at least running for 30 minutes
under IO). That was my first variant, but I did not like it because I
could not explain myself why:

1. not smp_wmb()? We need to do flush after an update.
   (I tried that also, and it does not help)

2. what protects us from this situation?

  CPU#0                  CPU#1
                         set_work_data()
  test_and_set_bit()
                         smp_mb()

And 2. question was crucial to me, because even tiny delay "fixes" the
problem, e.g. ndelay also "fixes" the bug:

         smp_wmb();
         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
 +       ndelay(40);
  }

Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

--
Roman

>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>          */
>         smp_wmb();
>         set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +       smp_mb();
>  }
>
>  static void clear_work_data(struct work_struct *work)
>
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roman Pen April 25, 2016, 4:40 p.m. UTC | #3
On Mon, Apr 25, 2016 at 6:00 PM, Tejun Heo <tj@kernel.org> wrote:
> On Mon, Apr 25, 2016 at 11:48:47AM -0400, Tejun Heo wrote:
>> Heh, excellent debugging.  I wonder how old this bug is.  cc'ing David
>
> I just went through the history.  So, we used to have clear_bit()
> instead of atomic_long_set() but clear_bit() doesn't imply any barrier
> either, so kudos to you.  You just debugged an issue which is most
> likely has been there for over a decade! :)

That's nice, that we can finally fix it.
But as I was saying I was just lucky enough to have very easy reproduction.
I could check all crazy ideas I had :)

--
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo April 25, 2016, 5:03 p.m. UTC | #4
Hello, Roman.

On Mon, Apr 25, 2016 at 06:34:45PM +0200, Roman Penyaev wrote:
> I can assure you that smp_mb() helps (at least running for 30 minutes
> under IO). That was my first variant, but I did not like it because I
> could not explain myself why:
> 
> 1. not smp_wmb()? We need to do flush after an update.
>    (I tried that also, and it does not help)

Regardless of the success of queue_work(), the interface guarantees
that there will be at least one execution instance which sees whatever
updates the queuer has made prior to calling queue_work().  The
PENDING bit is what synchronizes this operations.

	A				B

					Make updates
	clear PENDING			test_and_set PENDING
	start execution

So, if B's test_and_set takes place before clearing of PENDING, what
should be guaranteed is that A's execution must be able to see B's
updates; however, as there's no barrier between "clear PENDING" and
"start execution", memory loads of execution can be scheduled before
clearing of PENDING which leads to a situation where B loses queueing
but its updates are not seen by the prior instance's execution.  It's
a classic "either a sees b (clear PENDING) or b sees a (prior
updates)" interlocking situation.

> 2. what protects us from this situation?
> 
>   CPU#0                  CPU#1
>                          set_work_data()
>   test_and_set_bit()
>                          smp_mb()

The above would be completely fine as CPU#1's execution would see all
the changes CPU#0 has made upto that point.

> And 2. question was crucial to me, because even tiny delay "fixes" the
> problem, e.g. ndelay also "fixes" the bug:
> 
>          smp_wmb();
>          set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
>  +       ndelay(40);
>   }
> 
> Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

Yeah, this is the CPU rescheduling loads for the execution ahead of
clearing of PENDING and doing anything inbetween is likely to reduce
the chance of it happening drastically, but smp_mb() inbetween is
actually the right solution here.

Thanks.
Roman Pen April 25, 2016, 5:39 p.m. UTC | #5
On Mon, Apr 25, 2016 at 7:03 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 06:34:45PM +0200, Roman Penyaev wrote:
>> I can assure you that smp_mb() helps (at least running for 30 minutes
>> under IO). That was my first variant, but I did not like it because I
>> could not explain myself why:
>>
>> 1. not smp_wmb()? We need to do flush after an update.
>>    (I tried that also, and it does not help)
>
> Regardless of the success of queue_work(), the interface guarantees
> that there will be at least one execution instance which sees whatever
> updates the queuer has made prior to calling queue_work().  The
> PENDING bit is what synchronizes this operations.
>
>         A                               B
>
>                                         Make updates
>         clear PENDING                   test_and_set PENDING
>         start execution
>
> So, if B's test_and_set takes place before clearing of PENDING, what
> should be guaranteed is that A's execution must be able to see B's
> updates; however, as there's no barrier between "clear PENDING" and
> "start execution", memory loads of execution can be scheduled before
> clearing of PENDING which leads to a situation where B loses queueing
> but its updates are not seen by the prior instance's execution.  It's
> a classic "either a sees b (clear PENDING) or b sees a (prior
> updates)" interlocking situation.

Ok, that's clear now. Thanks. I was confused also by a spin lock, which
is being released just after clear pending:

   set_work_pool_and_clear_pending(work, pool->id);
   spin_unlock_irq(&pool->lock);
   ...
   worker->current_func(work);

But seems memory operations of execution can leak-in and appear before
pended bit is cleared and spin lock is released.
(according to Documentation/memory-barriers.txt, (6) RELEASE operations)

>> 2. what protects us from this situation?
>>
>>   CPU#0                  CPU#1
>>                          set_work_data()
>>   test_and_set_bit()
>>                          smp_mb()
>
> The above would be completely fine as CPU#1's execution would see all
> the changes CPU#0 has made upto that point.
>
>> And 2. question was crucial to me, because even tiny delay "fixes" the
>> problem, e.g. ndelay also "fixes" the bug:
>>
>>          smp_wmb();
>>          set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
>>  +       ndelay(40);
>>   }
>>
>> Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.
>
> Yeah, this is the CPU rescheduling loads for the execution ahead of
> clearing of PENDING and doing anything inbetween is likely to reduce
> the chance of it happening drastically, but smp_mb() inbetween is
> actually the right solution here.

Tejun, do you need an updated patch for that? With a proper smp_mb()?

--
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo April 25, 2016, 5:51 p.m. UTC | #6
Hello, Roman.

On Mon, Apr 25, 2016 at 07:39:52PM +0200, Roman Penyaev wrote:
> Ok, that's clear now. Thanks. I was confused also by a spin lock, which
> is being released just after clear pending:
> 
>    set_work_pool_and_clear_pending(work, pool->id);
>    spin_unlock_irq(&pool->lock);
>    ...
>    worker->current_func(work);
> 
> But seems memory operations of execution can leak-in and appear before
> pended bit is cleared and spin lock is released.
> (according to Documentation/memory-barriers.txt, (6) RELEASE operations)

Yeap, release doesn't prevent following reads from creeping ahead of
it.

> Tejun, do you need an updated patch for that? With a proper smp_mb()?

Yes, can you please also add comment explaining what the mb is
interlocking?

Thanks a lot!
Peter Hurley April 26, 2016, 1:22 a.m. UTC | #7
On 04/25/2016 08:48 AM, Tejun Heo wrote:
> Hello, Roman.
> 
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>>   CPU#6                                CPU#2
>>   reqeust ffff884000343600 inserted
>>   hctx marked as pended
>>   kblockd_schedule...() returns 1
>>   <schedule to kblockd workqueue>
>>   *** WORK_STRUCT_PENDING_BIT is cleared ***
>>   flush_busy_ctxs() is executed
>>                                        reqeust ffff884000343cc0 inserted
>>                                        hctx marked as pended
>>                                        kblockd_schedule...() returns 0
>>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                        WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>>   static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU.  But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
> 
> Heh, excellent debugging.  I wonder how old this bug is.

This is the same bug I wrote about 2 yrs ago (but with the wrong fix).

http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html

Unfortunately I didn't have a reproducer at all :/


> cc'ing David
> Howells who ISTR to have reported a similar issue.  The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
> 
> 	A				B
> 
> 	queue (test_and_sets PENDING)
> 	dispatch (clears PENDING)
> 	execute				queue (test_and_sets PENDING)
> 
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should.  Can you please test whether the following
> patch fixes the issue?
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
>  	 */
>  	smp_wmb();
>  	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> +	smp_mb();
>  }

The atomic_long_xchg() patch has several benefits over the naked barrier:

1. set_work_pool_and_clear_pending() has the same requirements as
   clear_work_data(); note that both require write barrier before and
   full barrier after.

2. xchg() et al imply full barrier before and full barrier after.

3. The naked barriers could be removed, while improving efficiency.
   On x86, mov + mfence => xchg

4. Maybe fixes other hidden bugs.
   For example, I'm wondering if reordering with set_work_pwq/list_add_tail
   would be a problem; ie., what if work is visible on the worklist _before_
   data is initialized by set_work_pwq()?

Regards,
Peter Hurley



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo April 26, 2016, 3:15 p.m. UTC | #8
Hello, Peter.

On Mon, Apr 25, 2016 at 06:22:01PM -0700, Peter Hurley wrote:
> This is the same bug I wrote about 2 yrs ago (but with the wrong fix).
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html
> 
> Unfortunately I didn't have a reproducer at all :/

Ah, bummer.

> The atomic_long_xchg() patch has several benefits over the naked barrier:
> 
> 1. set_work_pool_and_clear_pending() has the same requirements as
>    clear_work_data(); note that both require write barrier before and
>    full barrier after.

clear_work_data() is only used by __cancel_work_timer() and there's no
following execution or anything where rescheduling memory loads can
cause any issue.

> 2. xchg() et al imply full barrier before and full barrier after.
> 
> 3. The naked barriers could be removed, while improving efficiency.
>    On x86, mov + mfence => xchg

It's unlikely to make any measureable difference.  Is xchg() actually
cheaper than store + rmb?

> 4. Maybe fixes other hidden bugs.
>    For example, I'm wondering if reordering with set_work_pwq/list_add_tail
>    would be a problem; ie., what if work is visible on the worklist _before_
>    data is initialized by set_work_pwq()?

Worklist is always accessed under the pool lock.  The barrier comes
into play only because we're using bare PENDING bit for
synchronization.  I'm not necessarily against making all clearings of
PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
weak tho.

Thanks.
Peter Hurley April 26, 2016, 5:27 p.m. UTC | #9
Hi Tejun,

On 04/26/2016 08:15 AM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, Apr 25, 2016 at 06:22:01PM -0700, Peter Hurley wrote:
>> This is the same bug I wrote about 2 yrs ago (but with the wrong fix).
>>
>> http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html
>>
>> Unfortunately I didn't have a reproducer at all :/
> 
> Ah, bummer.
> 
>> The atomic_long_xchg() patch has several benefits over the naked barrier:
>>
>> 1. set_work_pool_and_clear_pending() has the same requirements as
>>    clear_work_data(); note that both require write barrier before and
>>    full barrier after.
> 
> clear_work_data() is only used by __cancel_work_timer() and there's no
> following execution or anything where rescheduling memory loads can
> cause any issue.

I meant this block:

	clear_work_data(work);

	/*
	 * Paired with prepare_to_wait() above so that either
	 * waitqueue_active() is visible here or !work_is_canceling() is
	 * visible there.
	 */
	smp_mb();


>> 2. xchg() et al imply full barrier before and full barrier after.
>>
>> 3. The naked barriers could be removed, while improving efficiency.
>>    On x86, mov + mfence => xchg
> 
> It's unlikely to make any measureable difference.  Is xchg() actually
> cheaper than store + rmb?

store + mfence (full barrier), yes. Roughly 2x faster.

https://lkml.org/lkml/2015/11/2/607


>> 4. Maybe fixes other hidden bugs.
>>    For example, I'm wondering if reordering with set_work_pwq/list_add_tail
>>    would be a problem; ie., what if work is visible on the worklist _before_
>>    data is initialized by set_work_pwq()?
> 
> Worklist is always accessed under the pool lock.

I see a lot of list_empty() checks while not holding pool lock, but I get
what you mean: that work visible on the worklist isn't actually inspected
without holding the pool lock.


> The barrier comes
> into play only because we're using bare PENDING bit for
> synchronization.

I see that it's only transitions of PENDING that need full barriers but it's
not clear from the semantics the conditions under which pending is already
set. If performance isn't the concern there, maybe those uses (where PENDING
is expected to already be set) should check and warn?


> I'm not necessarily against making all clearings of
> PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
> weak tho.

I agree 2 and 3 are not the best reasons.
Actually, it looks that I'm in the minority anyway, and that style-wise,
naked barrier is preferred.

Regards,
Peter Hurley
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo April 26, 2016, 5:45 p.m. UTC | #10
Hello, Peter.

On Tue, Apr 26, 2016 at 10:27:59AM -0700, Peter Hurley wrote:
> > It's unlikely to make any measureable difference.  Is xchg() actually
> > cheaper than store + rmb?
> 
> store + mfence (full barrier), yes. Roughly 2x faster.
> 
> https://lkml.org/lkml/2015/11/2/607

Ah, didn't know that.  Thanks for the pointer.

> > I'm not necessarily against making all clearings of
> > PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
> > weak tho.
> 
> I agree 2 and 3 are not the best reasons.
> Actually, it looks that I'm in the minority anyway, and that style-wise,
> naked barrier is preferred.

As long as what's happening is clearly documented, I think either is
fine.  I'm gonna go with Roman's mb patch for -stable fix but think
it'd be nice to have a separate patch to consolidate the paths which
clear PENDING and make them use xchg.  If you can spin up a patch for
that, I'd be happy to apply it to wq/for-3.7.

Thanks.
Peter Hurley April 26, 2016, 8:07 p.m. UTC | #11
On 04/26/2016 10:45 AM, Tejun Heo wrote:
> As long as what's happening is clearly documented, I think either is
> fine.  I'm gonna go with Roman's mb patch for -stable fix but think
> it'd be nice to have a separate patch to consolidate the paths which
> clear PENDING and make them use xchg.  If you can spin up a patch for
> that, I'd be happy to apply it to wq/for-3.7.

Ok.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke April 27, 2016, 5:50 a.m. UTC | #12
On 04/26/2016 07:45 PM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Apr 26, 2016 at 10:27:59AM -0700, Peter Hurley wrote:
>>> It's unlikely to make any measureable difference.  Is xchg() actually
>>> cheaper than store + rmb?
>>
>> store + mfence (full barrier), yes. Roughly 2x faster.
>>
>> https://lkml.org/lkml/2015/11/2/607
> 
> Ah, didn't know that.  Thanks for the pointer.
> 
>>> I'm not necessarily against making all clearings of
>>> PENDING to be followed by a rmb or use xhcg.  Reasons 2-4 are pretty
>>> weak tho.
>>
>> I agree 2 and 3 are not the best reasons.
>> Actually, it looks that I'm in the minority anyway, and that style-wise,
>> naked barrier is preferred.
> 
> As long as what's happening is clearly documented, I think either is
> fine.  I'm gonna go with Roman's mb patch for -stable fix but think
> it'd be nice to have a separate patch to consolidate the paths which
> clear PENDING and make them use xchg.  If you can spin up a patch for
> that, I'd be happy to apply it to wq/for-3.7.
>                                          ^^^
Ah. Time warp.
I knew it would happen eventually :-)

Cheers,

Hannes
Tejun Heo April 27, 2016, 7:05 p.m. UTC | #13
On Wed, Apr 27, 2016 at 1:50 AM, Hannes Reinecke <hare@suse.de> wrote:
>> that, I'd be happy to apply it to wq/for-3.7.
>>                                          ^^^
> Ah. Time warp.
> I knew it would happen eventually :-)

lol wrong tree too. Sorry about that.
diff mbox

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2232ae3..8ec2b5e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -666,6 +666,7 @@  static void set_work_pool_and_clear_pending(struct work_struct *work,
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)