block: fix a crash in do_task_dead()

Message ID	1559161526-618-1-git-send-email-cai@lca.pw (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of cai@lca.pw designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Qian Cai <cai@lca.pw> To: axboe@kernel.dk Cc: akpm@linux-foundation.org, hch@lst.de, peterz@infradead.org, oleg@redhat.com, gkohli@codeaurora.org, mingo@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qian Cai <cai@lca.pw> Subject: [PATCH] block: fix a crash in do_task_dead() Date: Wed, 29 May 2019 16:25:26 -0400 Message-Id: <1559161526-618-1-git-send-email-cai@lca.pw> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	block: fix a crash in do_task_dead() \| expand block: fix a crash in do_task_dead()

Qian Cai May 29, 2019, 8:25 p.m. UTC

The commit 0619317ff8ba ("block: add polled wakeup task helper")
replaced wake_up_process() with blk_wake_io_task() in
end_swap_bio_read() which triggers a crash when running heavy swapping
workloads.

[T114538] kernel BUG at kernel/sched/core.c:3462!
[T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
[T114538] Call trace:
[T114538]  do_task_dead+0xf0/0xf8
[T114538]  do_exit+0xd5c/0x10fc
[T114538]  do_group_exit+0xf4/0x110
[T114538]  get_signal+0x280/0xdd8
[T114538]  do_notify_resume+0x720/0x968
[T114538]  work_pending+0x8/0x10

This is because shortly after set_special_state(TASK_DEAD),
end_swap_bio_read() is called from an interrupt handler that revive the
task state to TASK_RUNNING causes __schedule() to return and trip the
BUG() later.

[  C206] Call trace:
[  C206]  dump_backtrace+0x0/0x268
[  C206]  show_stack+0x20/0x2c
[  C206]  dump_stack+0xb4/0x108
[  C206]  blk_wake_io_task+0x7c/0x80
[  C206]  end_swap_bio_read+0x22c/0x31c
[  C206]  bio_endio+0x3d8/0x414
[  C206]  dec_pending+0x280/0x378 [dm_mod]
[  C206]  clone_endio+0x128/0x2ac [dm_mod]
[  C206]  bio_endio+0x3d8/0x414
[  C206]  blk_update_request+0x3ac/0x924
[  C206]  scsi_end_request+0x54/0x350
[  C206]  scsi_io_completion+0xf0/0x6f4
[  C206]  scsi_finish_command+0x214/0x228
[  C206]  scsi_softirq_done+0x170/0x1a4
[  C206]  blk_done_softirq+0x100/0x194
[  C206]  __do_softirq+0x350/0x790
[  C206]  irq_exit+0x200/0x26c
[  C206]  handle_IPI+0x2e8/0x514
[  C206]  gic_handle_irq+0x224/0x228
[  C206]  el1_irq+0xb8/0x140
[  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
[  C206]  do_task_dead+0x88/0xf8
[  C206]  do_exit+0xd5c/0x10fc
[  C206]  do_group_exit+0xf4/0x110
[  C206]  get_signal+0x280/0xdd8
[  C206]  do_notify_resume+0x720/0x968
[  C206]  work_pending+0x8/0x10

Before the offensive commit, wake_up_process() will prevent this from
happening by taking the pi_lock and bail out immediately if TASK_DEAD is
set.

if (!(p->state & TASK_NORMAL))
	goto out;

Fix it by calling wake_up_process() if it is in a non-task context.

Fixes: 0619317ff8ba ("block: add polled wakeup task helper")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 include/linux/blkdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Jens Axboe May 29, 2019, 8:31 p.m. UTC | #1

On 5/29/19 2:25 PM, Qian Cai wrote:
> The commit 0619317ff8ba ("block: add polled wakeup task helper")
> replaced wake_up_process() with blk_wake_io_task() in
> end_swap_bio_read() which triggers a crash when running heavy swapping
> workloads.
> 
> [T114538] kernel BUG at kernel/sched/core.c:3462!
> [T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
> [T114538] Call trace:
> [T114538]  do_task_dead+0xf0/0xf8
> [T114538]  do_exit+0xd5c/0x10fc
> [T114538]  do_group_exit+0xf4/0x110
> [T114538]  get_signal+0x280/0xdd8
> [T114538]  do_notify_resume+0x720/0x968
> [T114538]  work_pending+0x8/0x10
> 
> This is because shortly after set_special_state(TASK_DEAD),
> end_swap_bio_read() is called from an interrupt handler that revive the
> task state to TASK_RUNNING causes __schedule() to return and trip the
> BUG() later.
> 
> [  C206] Call trace:
> [  C206]  dump_backtrace+0x0/0x268
> [  C206]  show_stack+0x20/0x2c
> [  C206]  dump_stack+0xb4/0x108
> [  C206]  blk_wake_io_task+0x7c/0x80
> [  C206]  end_swap_bio_read+0x22c/0x31c
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  dec_pending+0x280/0x378 [dm_mod]
> [  C206]  clone_endio+0x128/0x2ac [dm_mod]
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  blk_update_request+0x3ac/0x924
> [  C206]  scsi_end_request+0x54/0x350
> [  C206]  scsi_io_completion+0xf0/0x6f4
> [  C206]  scsi_finish_command+0x214/0x228
> [  C206]  scsi_softirq_done+0x170/0x1a4
> [  C206]  blk_done_softirq+0x100/0x194
> [  C206]  __do_softirq+0x350/0x790
> [  C206]  irq_exit+0x200/0x26c
> [  C206]  handle_IPI+0x2e8/0x514
> [  C206]  gic_handle_irq+0x224/0x228
> [  C206]  el1_irq+0xb8/0x140
> [  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
> [  C206]  do_task_dead+0x88/0xf8
> [  C206]  do_exit+0xd5c/0x10fc
> [  C206]  do_group_exit+0xf4/0x110
> [  C206]  get_signal+0x280/0xdd8
> [  C206]  do_notify_resume+0x720/0x968
> [  C206]  work_pending+0x8/0x10
> 
> Before the offensive commit, wake_up_process() will prevent this from
> happening by taking the pi_lock and bail out immediately if TASK_DEAD is
> set.
> 
> if (!(p->state & TASK_NORMAL))
> 	goto out;
> 
> Fix it by calling wake_up_process() if it is in a non-task context.

I like this one a lot better than the previous fix. Unless folks
object, I'll queue this up for 5.2, thanks.

Peter Zijlstra May 30, 2019, 8:03 a.m. UTC | #2

On Wed, May 29, 2019 at 04:25:26PM -0400, Qian Cai wrote:

> Fixes: 0619317ff8ba ("block: add polled wakeup task helper")

What is the purpose of that patch ?! The Changelog doesn't mention any
benefit or performance gain. So why not revert that?

> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  include/linux/blkdev.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 592669bcc536..290eb7528f54 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1803,7 +1803,7 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
>  	 * that case, we don't need to signal a wakeup, it's enough to just
>  	 * mark us as RUNNING.
>  	 */
> -	if (waiter == current)
> +	if (waiter == current && in_task())
>  		__set_current_state(TASK_RUNNING);

NAK, No that's broken too.

The right fix is something like:

	if (waiter == current) {
		barrier();
		if (current->state & TASK_NORAL)
			__set_current_state(TASK_RUNNING);
	}

But even that is yuck to do outside of the scheduler code, as it looses
tracepoints and stats.

So can we please just revert that original patch and start over -- if
needed?

>  	else
>  		wake_up_process(waiter);
> -- 
> 1.8.3.1
>

Oleg Nesterov May 30, 2019, 11:15 a.m. UTC | #3

On 05/29, Qian Cai wrote:
>
> The commit 0619317ff8ba ("block: add polled wakeup task helper")
> replaced wake_up_process() with blk_wake_io_task() in
> end_swap_bio_read() which triggers a crash when running heavy swapping
> workloads.
> 
> [T114538] kernel BUG at kernel/sched/core.c:3462!
> [T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
> [T114538] Call trace:
> [T114538]  do_task_dead+0xf0/0xf8
> [T114538]  do_exit+0xd5c/0x10fc
> [T114538]  do_group_exit+0xf4/0x110
> [T114538]  get_signal+0x280/0xdd8
> [T114538]  do_notify_resume+0x720/0x968
> [T114538]  work_pending+0x8/0x10
> 
> This is because shortly after set_special_state(TASK_DEAD),
> end_swap_bio_read() is called from an interrupt handler that revive the
> task state to TASK_RUNNING causes __schedule() to return and trip the
> BUG() later.
> 
> [  C206] Call trace:
> [  C206]  dump_backtrace+0x0/0x268
> [  C206]  show_stack+0x20/0x2c
> [  C206]  dump_stack+0xb4/0x108
> [  C206]  blk_wake_io_task+0x7c/0x80
> [  C206]  end_swap_bio_read+0x22c/0x31c
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  dec_pending+0x280/0x378 [dm_mod]
> [  C206]  clone_endio+0x128/0x2ac [dm_mod]
> [  C206]  bio_endio+0x3d8/0x414
> [  C206]  blk_update_request+0x3ac/0x924
> [  C206]  scsi_end_request+0x54/0x350
> [  C206]  scsi_io_completion+0xf0/0x6f4
> [  C206]  scsi_finish_command+0x214/0x228
> [  C206]  scsi_softirq_done+0x170/0x1a4
> [  C206]  blk_done_softirq+0x100/0x194
> [  C206]  __do_softirq+0x350/0x790
> [  C206]  irq_exit+0x200/0x26c
> [  C206]  handle_IPI+0x2e8/0x514
> [  C206]  gic_handle_irq+0x224/0x228
> [  C206]  el1_irq+0xb8/0x140
> [  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
> [  C206]  do_task_dead+0x88/0xf8
> [  C206]  do_exit+0xd5c/0x10fc
> [  C206]  do_group_exit+0xf4/0x110
> [  C206]  get_signal+0x280/0xdd8
> [  C206]  do_notify_resume+0x720/0x968
> [  C206]  work_pending+0x8/0x10
> 
> Before the offensive commit, wake_up_process() will prevent this from
> happening by taking the pi_lock and bail out immediately if TASK_DEAD is
> set.
> 
> if (!(p->state & TASK_NORMAL))
> 	goto out;

I don't understand this code at all but I am just curious, can we do
something like incomplete patch below ?

Oleg.

--- x/mm/page_io.c
+++ x/mm/page_io.c
@@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
 	bio_put(bio);
-	blk_wake_io_task(waiter);
-	put_task_struct(waiter);
+	if (waiter) {
+		blk_wake_io_task(waiter);
+		put_task_struct(waiter);
+	}
 }
 
 int generic_swapfile_activate(struct swap_info_struct *sis,
@@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
 	 * Keep this task valid during swap readpage because the oom killer may
 	 * attempt to access it in the page fault retry time check.
 	 */
-	get_task_struct(current);
-	bio->bi_private = current;
 	bio_set_op_attrs(bio, REQ_OP_READ, 0);
-	if (synchronous)
+	if (synchronous) {
 		bio->bi_opf |= REQ_HIPRI;
+		get_task_struct(current);
+		bio->bi_private = current;
+	}
 	count_vm_event(PSWPIN);
 	bio_get(bio);
 	qc = submit_bio(bio);

Jens Axboe May 31, 2019, 9:10 p.m. UTC | #4

On 5/30/19 5:15 AM, Oleg Nesterov wrote:
> On 05/29, Qian Cai wrote:
>>
>> The commit 0619317ff8ba ("block: add polled wakeup task helper")
>> replaced wake_up_process() with blk_wake_io_task() in
>> end_swap_bio_read() which triggers a crash when running heavy swapping
>> workloads.
>>
>> [T114538] kernel BUG at kernel/sched/core.c:3462!
>> [T114538] Process oom01 (pid: 114538, stack limit = 0x000000004f40e0c1)
>> [T114538] Call trace:
>> [T114538]  do_task_dead+0xf0/0xf8
>> [T114538]  do_exit+0xd5c/0x10fc
>> [T114538]  do_group_exit+0xf4/0x110
>> [T114538]  get_signal+0x280/0xdd8
>> [T114538]  do_notify_resume+0x720/0x968
>> [T114538]  work_pending+0x8/0x10
>>
>> This is because shortly after set_special_state(TASK_DEAD),
>> end_swap_bio_read() is called from an interrupt handler that revive the
>> task state to TASK_RUNNING causes __schedule() to return and trip the
>> BUG() later.
>>
>> [  C206] Call trace:
>> [  C206]  dump_backtrace+0x0/0x268
>> [  C206]  show_stack+0x20/0x2c
>> [  C206]  dump_stack+0xb4/0x108
>> [  C206]  blk_wake_io_task+0x7c/0x80
>> [  C206]  end_swap_bio_read+0x22c/0x31c
>> [  C206]  bio_endio+0x3d8/0x414
>> [  C206]  dec_pending+0x280/0x378 [dm_mod]
>> [  C206]  clone_endio+0x128/0x2ac [dm_mod]
>> [  C206]  bio_endio+0x3d8/0x414
>> [  C206]  blk_update_request+0x3ac/0x924
>> [  C206]  scsi_end_request+0x54/0x350
>> [  C206]  scsi_io_completion+0xf0/0x6f4
>> [  C206]  scsi_finish_command+0x214/0x228
>> [  C206]  scsi_softirq_done+0x170/0x1a4
>> [  C206]  blk_done_softirq+0x100/0x194
>> [  C206]  __do_softirq+0x350/0x790
>> [  C206]  irq_exit+0x200/0x26c
>> [  C206]  handle_IPI+0x2e8/0x514
>> [  C206]  gic_handle_irq+0x224/0x228
>> [  C206]  el1_irq+0xb8/0x140
>> [  C206]  _raw_spin_unlock_irqrestore+0x3c/0x74
>> [  C206]  do_task_dead+0x88/0xf8
>> [  C206]  do_exit+0xd5c/0x10fc
>> [  C206]  do_group_exit+0xf4/0x110
>> [  C206]  get_signal+0x280/0xdd8
>> [  C206]  do_notify_resume+0x720/0x968
>> [  C206]  work_pending+0x8/0x10
>>
>> Before the offensive commit, wake_up_process() will prevent this from
>> happening by taking the pi_lock and bail out immediately if TASK_DEAD is
>> set.
>>
>> if (!(p->state & TASK_NORMAL))
>> 	goto out;
> 
> I don't understand this code at all but I am just curious, can we do
> something like incomplete patch below ?
> 
> Oleg.
> 
> --- x/mm/page_io.c
> +++ x/mm/page_io.c
> @@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
>   	unlock_page(page);
>   	WRITE_ONCE(bio->bi_private, NULL);
>   	bio_put(bio);
> -	blk_wake_io_task(waiter);
> -	put_task_struct(waiter);
> +	if (waiter) {
> +		blk_wake_io_task(waiter);
> +		put_task_struct(waiter);
> +	}
>   }
>   
>   int generic_swapfile_activate(struct swap_info_struct *sis,
> @@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
>   	 * Keep this task valid during swap readpage because the oom killer may
>   	 * attempt to access it in the page fault retry time check.
>   	 */
> -	get_task_struct(current);
> -	bio->bi_private = current;
>   	bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -	if (synchronous)
> +	if (synchronous) {
>   		bio->bi_opf |= REQ_HIPRI;
> +		get_task_struct(current);
> +		bio->bi_private = current;
> +	}
>   	count_vm_event(PSWPIN);
>   	bio_get(bio);
>   	qc = submit_bio(bio);

I think this would solve it for swap.

Jens Axboe May 31, 2019, 9:12 p.m. UTC | #5

On 5/30/19 2:03 AM, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 04:25:26PM -0400, Qian Cai wrote:
> 
>> Fixes: 0619317ff8ba ("block: add polled wakeup task helper")
> 
> What is the purpose of that patch ?! The Changelog doesn't mention any
> benefit or performance gain. So why not revert that?

Yeah that is actually pretty weak. There are substantial performance
gains for small IOs using this trick, the changelog should have
included those. I guess that was left on the list...

>> Signed-off-by: Qian Cai <cai@lca.pw>
>> ---
>>   include/linux/blkdev.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 592669bcc536..290eb7528f54 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1803,7 +1803,7 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
>>   	 * that case, we don't need to signal a wakeup, it's enough to just
>>   	 * mark us as RUNNING.
>>   	 */
>> -	if (waiter == current)
>> +	if (waiter == current && in_task())
>>   		__set_current_state(TASK_RUNNING);
> 
> NAK, No that's broken too.
> 
> The right fix is something like:
> 
> 	if (waiter == current) {
> 		barrier();
> 		if (current->state & TASK_NORAL)
> 			__set_current_state(TASK_RUNNING);
> 	}
> 
> But even that is yuck to do outside of the scheduler code, as it looses
> tracepoints and stats.
> 
> So can we please just revert that original patch and start over -- if
> needed?

How about we just use your above approach? It looks fine to me (except
the obvious typo). I'd hate to give up this gain, in the realm of
fighting against stupid kernel offload solutions we need every cycle we
can get.

I know it's not super kosher, your patch, but I don't think it's that
bad hidden in a generic helper.

Peter Zijlstra June 3, 2019, 12:37 p.m. UTC | #6

On Fri, May 31, 2019 at 03:12:13PM -0600, Jens Axboe wrote:
> On 5/30/19 2:03 AM, Peter Zijlstra wrote:

> > What is the purpose of that patch ?! The Changelog doesn't mention any
> > benefit or performance gain. So why not revert that?
> 
> Yeah that is actually pretty weak. There are substantial performance
> gains for small IOs using this trick, the changelog should have
> included those. I guess that was left on the list...

OK. I've looked at the try_to_wake_up() path for these exact
conditions and we're certainly sub-optimal there, and I think we can put
much of this special case in there. Please see below.

> I know it's not super kosher, your patch, but I don't think it's that
> bad hidden in a generic helper.

How about the thing that Oleg proposed? That is, not set a waiter when
we know the loop is polling? That would avoid the need for this
alltogether, it would also avoid any set_current_state() on the wait
side of things.

Anyway, Oleg, do you see anything blatantly buggered with this patch?

(the stats were already dodgy for rq-stats, this patch makes them dodgy
for task-stats too)

---
 kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..474aa4c8e9d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
+	if (p == current) {
+		/*
+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
+		 * == smp_processor_id()'. Together this means we can special
+		 * case the whole 'p->on_rq && ttwu_remote()' case below
+		 * without taking any locks.
+		 *
+		 * In particular:
+		 *  - we rely on Program-Order guarantees for all the ordering,
+		 *  - we're serialized against set_special_state() by virtue of
+		 *    it disabling IRQs (this allows not taking ->pi_lock).
+		 */
+		if (!(p->state & state))
+			goto out;
+
+		success = 1;
+		trace_sched_waking(p);
+		p->state = TASK_RUNNING;
+		trace_sched_woken(p);
+		goto out;
+	}
+
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
 	 * need to ensure that CONDITION=1 done by the caller can not be
@@ -1999,7 +2021,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	smp_mb__after_spinlock();
 	if (!(p->state & state))
-		goto out;
+		goto unlock;
 
 	trace_sched_waking(p);
 
@@ -2029,7 +2051,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
-		goto stat;
+		goto unlock;
 
 #ifdef CONFIG_SMP
 	/*
@@ -2089,12 +2111,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
-stat:
-	ttwu_stat(p, cpu, wake_flags);
-out:
+unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
-	return success;
+out:
+	if (success) {
+		ttwu_stat(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
 }
 
 /**

Peter Zijlstra June 3, 2019, 12:44 p.m. UTC | #7

On Mon, Jun 03, 2019 at 02:37:05PM +0200, Peter Zijlstra wrote:

> Anyway, Oleg, do you see anything blatantly buggered with this patch?
> 
> (the stats were already dodgy for rq-stats, this patch makes them dodgy
> for task-stats too)

It now also has concurrency on wakeup; but afaict that's harmless, we'll
get racing stores of p->state = TASK_RUNNING, much the same as if there
was a remote wakeup vs a wait-loop terminating early.

I suppose the tracepoint consumers might have to deal with some
artifacts there, but that's their problem.

> ---
>  kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..474aa4c8e9d2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	unsigned long flags;
>  	int cpu, success = 0;
>  
> +	if (p == current) {
> +		/*
> +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> +		 * == smp_processor_id()'. Together this means we can special
> +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> +		 * without taking any locks.
> +		 *
> +		 * In particular:
> +		 *  - we rely on Program-Order guarantees for all the ordering,
> +		 *  - we're serialized against set_special_state() by virtue of
> +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> +		 */
> +		if (!(p->state & state))
> +			goto out;
> +
> +		success = 1;
> +		trace_sched_waking(p);
> +		p->state = TASK_RUNNING;
> +		trace_sched_woken(p);
> +		goto out;
> +	}
> +
>  	/*
>  	 * If we are going to wake up a thread waiting for CONDITION we
>  	 * need to ensure that CONDITION=1 done by the caller can not be
> @@ -1999,7 +2021,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>  	smp_mb__after_spinlock();
>  	if (!(p->state & state))
> -		goto out;
> +		goto unlock;
>  
>  	trace_sched_waking(p);
>  
> @@ -2029,7 +2051,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 */
>  	smp_rmb();
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +		goto unlock;
>  
>  #ifdef CONFIG_SMP
>  	/*
> @@ -2089,12 +2111,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  #endif /* CONFIG_SMP */
>  
>  	ttwu_queue(p, cpu, wake_flags);
> -stat:
> -	ttwu_stat(p, cpu, wake_flags);
> -out:
> +unlock:
>  	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>  
> -	return success;
> +out:
> +	if (success) {
> +		ttwu_stat(p, cpu, wake_flags);
> +		return true;
> +	}
> +
> +	return false;
>  }
>  
>  /**

Oleg Nesterov June 3, 2019, 4:09 p.m. UTC | #8

On 06/03, Peter Zijlstra wrote:
>
> It now also has concurrency on wakeup; but afaict that's harmless, we'll
> get racing stores of p->state = TASK_RUNNING, much the same as if there
> was a remote wakeup vs a wait-loop terminating early.
>
> I suppose the tracepoint consumers might have to deal with some
> artifacts there, but that's their problem.

I guess you mean that trace_sched_waking/wakeup can be reported twice if
try_to_wake_up(current) races with ttwu_remote(). And ttwu_stat().

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	unsigned long flags;
> >  	int cpu, success = 0;
> >  
> > +	if (p == current) {
> > +		/*
> > +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> > +		 * == smp_processor_id()'. Together this means we can special
> > +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> > +		 * without taking any locks.
> > +		 *
> > +		 * In particular:
> > +		 *  - we rely on Program-Order guarantees for all the ordering,
> > +		 *  - we're serialized against set_special_state() by virtue of
> > +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> > +		 */
> > +		if (!(p->state & state))
> > +			goto out;
> > +
> > +		success = 1;
> > +		trace_sched_waking(p);
> > +		p->state = TASK_RUNNING;
> > +		trace_sched_woken(p);
                ^^^^^^^^^^^^^^^^^
trace_sched_wakeup(p) ?

I see nothing wrong... but probably this is because I don't fully understand
this change. In particular, I don't really understand who else can benefit from
this optimization...

Oleg.

Peter Zijlstra June 3, 2019, 4:19 p.m. UTC | #9

On Mon, Jun 03, 2019 at 06:09:53PM +0200, Oleg Nesterov wrote:
> On 06/03, Peter Zijlstra wrote:
> >
> > It now also has concurrency on wakeup; but afaict that's harmless, we'll
> > get racing stores of p->state = TASK_RUNNING, much the same as if there
> > was a remote wakeup vs a wait-loop terminating early.
> >
> > I suppose the tracepoint consumers might have to deal with some
> > artifacts there, but that's their problem.
> 
> I guess you mean that trace_sched_waking/wakeup can be reported twice if
> try_to_wake_up(current) races with ttwu_remote(). And ttwu_stat().

Right, one local one remote, and you get them things twice.

> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > >  	unsigned long flags;
> > >  	int cpu, success = 0;
> > >  
> > > +	if (p == current) {
> > > +		/*
> > > +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> > > +		 * == smp_processor_id()'. Together this means we can special
> > > +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> > > +		 * without taking any locks.
> > > +		 *
> > > +		 * In particular:
> > > +		 *  - we rely on Program-Order guarantees for all the ordering,
> > > +		 *  - we're serialized against set_special_state() by virtue of
> > > +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> > > +		 */
> > > +		if (!(p->state & state))
> > > +			goto out;
> > > +
> > > +		success = 1;
> > > +		trace_sched_waking(p);
> > > +		p->state = TASK_RUNNING;
> > > +		trace_sched_woken(p);
>                 ^^^^^^^^^^^^^^^^^
> trace_sched_wakeup(p) ?

Uhm,, yah.

> I see nothing wrong... but probably this is because I don't fully understand
> this change. In particular, I don't really understand who else can benefit from
> this optimization...

Pretty much every wait-loop, where the wakeup happens from IRQ context
on the same CPU, before we've hit schedule().

Now, I've no idea if that's many, but I much prefer to keep this magic
inside try_to_wake_up() than spreading it around.

Jens Axboe June 3, 2019, 4:23 p.m. UTC | #10

On 6/3/19 6:37 AM, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 03:12:13PM -0600, Jens Axboe wrote:
>> On 5/30/19 2:03 AM, Peter Zijlstra wrote:
> 
>>> What is the purpose of that patch ?! The Changelog doesn't mention any
>>> benefit or performance gain. So why not revert that?
>>
>> Yeah that is actually pretty weak. There are substantial performance
>> gains for small IOs using this trick, the changelog should have
>> included those. I guess that was left on the list...
> 
> OK. I've looked at the try_to_wake_up() path for these exact
> conditions and we're certainly sub-optimal there, and I think we can put
> much of this special case in there. Please see below.
> 
>> I know it's not super kosher, your patch, but I don't think it's that
>> bad hidden in a generic helper.
> 
> How about the thing that Oleg proposed? That is, not set a waiter when
> we know the loop is polling? That would avoid the need for this
> alltogether, it would also avoid any set_current_state() on the wait
> side of things.
> 
> Anyway, Oleg, do you see anything blatantly buggered with this patch?
> 
> (the stats were already dodgy for rq-stats, this patch makes them dodgy
> for task-stats too)
> 
> ---
>   kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++++------
>   1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..474aa4c8e9d2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1990,6 +1990,28 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	unsigned long flags;
>   	int cpu, success = 0;
>   
> +	if (p == current) {
> +		/*
> +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> +		 * == smp_processor_id()'. Together this means we can special
> +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> +		 * without taking any locks.
> +		 *
> +		 * In particular:
> +		 *  - we rely on Program-Order guarantees for all the ordering,
> +		 *  - we're serialized against set_special_state() by virtue of
> +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> +		 */
> +		if (!(p->state & state))
> +			goto out;
> +
> +		success = 1;
> +		trace_sched_waking(p);
> +		p->state = TASK_RUNNING;
> +		trace_sched_woken(p);
> +		goto out;
> +	}
> +
>   	/*
>   	 * If we are going to wake up a thread waiting for CONDITION we
>   	 * need to ensure that CONDITION=1 done by the caller can not be
> @@ -1999,7 +2021,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	raw_spin_lock_irqsave(&p->pi_lock, flags);
>   	smp_mb__after_spinlock();
>   	if (!(p->state & state))
> -		goto out;
> +		goto unlock;
>   
>   	trace_sched_waking(p);
>   
> @@ -2029,7 +2051,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	 */
>   	smp_rmb();
>   	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +		goto unlock;
>   
>   #ifdef CONFIG_SMP
>   	/*
> @@ -2089,12 +2111,16 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   #endif /* CONFIG_SMP */
>   
>   	ttwu_queue(p, cpu, wake_flags);
> -stat:
> -	ttwu_stat(p, cpu, wake_flags);
> -out:
> +unlock:
>   	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>   
> -	return success;
> +out:
> +	if (success) {
> +		ttwu_stat(p, cpu, wake_flags);
> +		return true;
> +	}
> +
> +	return false;
>   }
>   
>   /**

Let me run some tests with this vs mainline vs blk wakeup hack removed.

Jens Axboe June 5, 2019, 3:04 p.m. UTC | #11

On 6/3/19 6:37 AM, Peter Zijlstra wrote:
> On Fri, May 31, 2019 at 03:12:13PM -0600, Jens Axboe wrote:
>> On 5/30/19 2:03 AM, Peter Zijlstra wrote:
> 
>>> What is the purpose of that patch ?! The Changelog doesn't mention any
>>> benefit or performance gain. So why not revert that?
>>
>> Yeah that is actually pretty weak. There are substantial performance
>> gains for small IOs using this trick, the changelog should have
>> included those. I guess that was left on the list...
> 
> OK. I've looked at the try_to_wake_up() path for these exact
> conditions and we're certainly sub-optimal there, and I think we can put
> much of this special case in there. Please see below.
> 
>> I know it's not super kosher, your patch, but I don't think it's that
>> bad hidden in a generic helper.
> 
> How about the thing that Oleg proposed? That is, not set a waiter when
> we know the loop is polling? That would avoid the need for this
> alltogether, it would also avoid any set_current_state() on the wait
> side of things.
> 
> Anyway, Oleg, do you see anything blatantly buggered with this patch?
> 
> (the stats were already dodgy for rq-stats, this patch makes them dodgy
> for task-stats too)

Tested this patch, looks good to me. Made the trace change to make it
compile, and also moved the cpu = task_cpu() assignment earlier to
avoid uninitialized use of that variable.

How about the following plan - if folks are happy with this sched patch,
we can queue it up for 5.3. Once that is in, I'll kill the block change
that special cases the polled task wakeup. For 5.2, we go with Oleg's
patch for the swap case.

Peter Zijlstra June 7, 2019, 1:35 p.m. UTC | #12

On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
> How about the following plan - if folks are happy with this sched patch,
> we can queue it up for 5.3. Once that is in, I'll kill the block change
> that special cases the polled task wakeup. For 5.2, we go with Oleg's
> patch for the swap case.

OK, works for me. I'll go write a proper patch.

Thanks!

Peter Zijlstra June 7, 2019, 2:23 p.m. UTC | #13

On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
> > How about the following plan - if folks are happy with this sched patch,
> > we can queue it up for 5.3. Once that is in, I'll kill the block change
> > that special cases the polled task wakeup. For 5.2, we go with Oleg's
> > patch for the swap case.
> 
> OK, works for me. I'll go write a proper patch.

I now have the below; I'll queue that after the long weekend and let
0-day chew on it for a while and then push it out to tip or something.

---
Subject: sched: Optimize try_to_wake_up() for local wakeups
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Jun 7 15:39:49 CEST 2019

Jens reported that significant performance can be had on some block
workloads (XXX numbers?) by special casing local wakeups. That is,
wakeups on the current task before it schedules out. Given something
like the normal wait pattern:

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		if (cond)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);

Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.

Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:

 - ttwu_stat(); we already had concurrency on rq stats, this now also
   brings it to task stats. -ENOCARE

 - tracepoints; it is now possible to get multiple instances of
   trace_sched_waking() (and possibly trace_sched_wakeup()) for the
   same task. Tracers will have to learn to cope.

Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.

After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.

Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.

This then yields an extremely simple and fast path for 'p == current'.

Cc: Qian Cai <cai@lca.pw>
Cc: mingo@redhat.com
Cc: akpm@linux-foundation.org
Cc: hch@lst.de
Cc: gkohli@codeaurora.org
Cc: oleg@redhat.com
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
 	unsigned long flags;
 	int cpu, success = 0;

+	if (p == current) {
+		/*
+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
+		 * == smp_processor_id()'. Together this means we can special
+		 * case the whole 'p->on_rq && ttwu_remote()' case below
+		 * without taking any locks.
+		 *
+		 * In particular:
+		 *  - we rely on Program-Order guarantees for all the ordering,
+		 *  - we're serialized against set_special_state() by virtue of
+		 *    it disabling IRQs (this allows not taking ->pi_lock).
+		 */
+		if (!(p->state & state))
+			return false;
+
+		success = 1;
+		trace_sched_waking(p);
+		p->state = TASK_RUNNING;
+		trace_sched_wakeup(p);
+		goto out;
+	}
+
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
 	 * need to ensure that CONDITION=1 done by the caller can not be
@@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	smp_mb__after_spinlock();
 	if (!(p->state & state))
-		goto out;
+		goto unlock;

 	trace_sched_waking(p);

@@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
-		goto stat;
+		goto unlock;

 #ifdef CONFIG_SMP
 	/*
@@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un
 #endif /* CONFIG_SMP */

 	ttwu_queue(p, cpu, wake_flags);
-stat:
-	ttwu_stat(p, cpu, wake_flags);
-out:
+unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+out:
+	if (success)
+		ttwu_stat(p, cpu, wake_flags);

 	return success;
 }

Jens Axboe June 8, 2019, 8:39 a.m. UTC | #14

On 6/7/19 8:23 AM, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
>>> How about the following plan - if folks are happy with this sched patch,
>>> we can queue it up for 5.3. Once that is in, I'll kill the block change
>>> that special cases the polled task wakeup. For 5.2, we go with Oleg's
>>> patch for the swap case.
>>
>> OK, works for me. I'll go write a proper patch.
> 
> I now have the below; I'll queue that after the long weekend and let
> 0-day chew on it for a while and then push it out to tip or something.

LGTM, thanks Peter!

Gaurav Kohli June 10, 2019, 1:13 p.m. UTC | #15

On 6/7/2019 7:53 PM, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
>>> How about the following plan - if folks are happy with this sched patch,
>>> we can queue it up for 5.3. Once that is in, I'll kill the block change
>>> that special cases the polled task wakeup. For 5.2, we go with Oleg's
>>> patch for the swap case.
>>
>> OK, works for me. I'll go write a proper patch.
> 
> I now have the below; I'll queue that after the long weekend and let
> 0-day chew on it for a while and then push it out to tip or something.
> 
> 
> ---
> Subject: sched: Optimize try_to_wake_up() for local wakeups
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Jun 7 15:39:49 CEST 2019
> 
> Jens reported that significant performance can be had on some block
> workloads (XXX numbers?) by special casing local wakeups. That is,
> wakeups on the current task before it schedules out. Given something
> like the normal wait pattern:
> 
> 	for (;;) {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 
> 		if (cond)
> 			break;
> 
> 		schedule();
> 	}
> 	__set_current_state(TASK_RUNNING);
> 
> Any wakeup (on this CPU) after set_current_state() and before
> schedule() would benefit from this.
> 
> Normal wakeups take p->pi_lock, which serializes wakeups to the same
> task. By eliding that we gain concurrency on:
> 
>   - ttwu_stat(); we already had concurrency on rq stats, this now also
>     brings it to task stats. -ENOCARE
> 
>   - tracepoints; it is now possible to get multiple instances of
>     trace_sched_waking() (and possibly trace_sched_wakeup()) for the
>     same task. Tracers will have to learn to cope.
> 
> Furthermore, p->pi_lock is used by set_special_state(), to order
> against TASK_RUNNING stores from other CPUs. But since this is
> strictly CPU local, we don't need the lock, and set_special_state()'s
> disabling of IRQs is sufficient.
> 
> After the normal wakeup takes p->pi_lock it issues
> smp_mb__after_spinlock(), in order to ensure the woken task must
> observe prior stores before we observe the p->state. If this is CPU
> local, this will be satisfied with a compiler barrier, and we rely on
> try_to_wake_up() being a funcation call, which implies such.
> 
> Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
> would continue into the ttwu_remote() branch, which normally is
> concerned with exactly this wakeup scenario, except from a remote CPU.
> IOW we're waking a task that is still running. In this case, we can
> trivially avoid taking rq->lock, all that's left from this is to set
> p->state.
> 
> This then yields an extremely simple and fast path for 'p == current'.
> 
> Cc: Qian Cai <cai@lca.pw>
> Cc: mingo@redhat.com
> Cc: akpm@linux-foundation.org
> Cc: hch@lst.de
> Cc: gkohli@codeaurora.org
> Cc: oleg@redhat.com
> Reported-by: Jens Axboe <axboe@kernel.dk>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/core.c |   33 ++++++++++++++++++++++++++++-----
>   1 file changed, 28 insertions(+), 5 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
>   	unsigned long flags;
>   	int cpu, success = 0;
>   
> +	if (p == current) {
> +		/*
> +		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> +		 * == smp_processor_id()'. Together this means we can special
> +		 * case the whole 'p->on_rq && ttwu_remote()' case below
> +		 * without taking any locks.
> +		 *
> +		 * In particular:
> +		 *  - we rely on Program-Order guarantees for all the ordering,
> +		 *  - we're serialized against set_special_state() by virtue of
> +		 *    it disabling IRQs (this allows not taking ->pi_lock).
> +		 */
> +		if (!(p->state & state))
> +			return false;
> +

Hi Peter, Jen,

As we are not taking pi_lock here , is there possibility of same task 
dead call comes as this point of time for current thread, bcoz of which 
we have seen earlier issue after this commit 0619317ff8ba
[T114538]  do_task_dead+0xf0/0xf8
[T114538]  do_exit+0xd5c/0x10fc
[T114538]  do_group_exit+0xf4/0x110
[T114538]  get_signal+0x280/0xdd8
[T114538]  do_notify_resume+0x720/0x968
[T114538]  work_pending+0x8/0x10

Is there a chance of TASK_DEAD set at this point of time?


> +		success = 1;
> +		trace_sched_waking(p);
> +		p->state = TASK_RUNNING;
> +		trace_sched_wakeup(p);
> +		goto out;
> +	}
> +
>   	/*
>   	 * If we are going to wake up a thread waiting for CONDITION we
>   	 * need to ensure that CONDITION=1 done by the caller can not be
> @@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un
>   	raw_spin_lock_irqsave(&p->pi_lock, flags);
>   	smp_mb__after_spinlock();
>   	if (!(p->state & state))
> -		goto out;
> +		goto unlock;
>   
>   	trace_sched_waking(p);
>   
> @@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un
>   	 */
>   	smp_rmb();
>   	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +		goto unlock;
>   
>   #ifdef CONFIG_SMP
>   	/*
> @@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un
>   #endif /* CONFIG_SMP */
>   
>   	ttwu_queue(p, cpu, wake_flags);
> -stat:
> -	ttwu_stat(p, cpu, wake_flags);
> -out:
> +unlock:
>   	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +out:
> +	if (success)
> +		ttwu_stat(p, cpu, wake_flags);
>   
>   	return success;
>   }
>

Oleg Nesterov June 10, 2019, 2:46 p.m. UTC | #16

On 06/10, Gaurav Kohli wrote:
>
> >@@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
> >  	unsigned long flags;
> >  	int cpu, success = 0;
> >+	if (p == current) {
> >+		/*
> >+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
> >+		 * == smp_processor_id()'. Together this means we can special
> >+		 * case the whole 'p->on_rq && ttwu_remote()' case below
> >+		 * without taking any locks.
> >+		 *
> >+		 * In particular:
> >+		 *  - we rely on Program-Order guarantees for all the ordering,
> >+		 *  - we're serialized against set_special_state() by virtue of
> >+		 *    it disabling IRQs (this allows not taking ->pi_lock).
> >+		 */
> >+		if (!(p->state & state))
> >+			return false;
> >+
>
> Hi Peter, Jen,
>
> As we are not taking pi_lock here , is there possibility of same task dead
> call comes as this point of time for current thread, bcoz of which we have
> seen earlier issue after this commit 0619317ff8ba
> [T114538]  do_task_dead+0xf0/0xf8
> [T114538]  do_exit+0xd5c/0x10fc
> [T114538]  do_group_exit+0xf4/0x110
> [T114538]  get_signal+0x280/0xdd8
> [T114538]  do_notify_resume+0x720/0x968
> [T114538]  work_pending+0x8/0x10
>
> Is there a chance of TASK_DEAD set at this point of time?

In this case try_to_wake_up(current, TASK_NORMAL) will do nothing, see the
if (!(p->state & state)) above.

See also the comment about set_special_state() above. It disables irqs and
this is enough to ensure that try_to_wake_up(current) from irq can't race
with set_special_state(TASK_DEAD).

Oleg.

Gaurav Kohli June 11, 2019, 4:39 a.m. UTC | #17

>>> +
>>
>> Hi Peter, Jen,
>>
>> As we are not taking pi_lock here , is there possibility of same task dead
>> call comes as this point of time for current thread, bcoz of which we have
>> seen earlier issue after this commit 0619317ff8ba
>> [T114538]  do_task_dead+0xf0/0xf8
>> [T114538]  do_exit+0xd5c/0x10fc
>> [T114538]  do_group_exit+0xf4/0x110
>> [T114538]  get_signal+0x280/0xdd8
>> [T114538]  do_notify_resume+0x720/0x968
>> [T114538]  work_pending+0x8/0x10
>>
>> Is there a chance of TASK_DEAD set at this point of time?
> 
> In this case try_to_wake_up(current, TASK_NORMAL) will do nothing, see the
> if (!(p->state & state)) above.
> 
> See also the comment about set_special_state() above. It disables irqs and
> this is enough to ensure that try_to_wake_up(current) from irq can't race
> with set_special_state(TASK_DEAD).

Thanks Oleg,

I missed that part(both thread and interrupt is in same core only), So 
that situation would never come.
> 
> Oleg.
>

Hugh Dickins June 30, 2019, 11:06 p.m. UTC | #18

On Wed, 5 Jun 2019, Jens Axboe wrote:
> 
> How about the following plan - if folks are happy with this sched patch,
> we can queue it up for 5.3. Once that is in, I'll kill the block change
> that special cases the polled task wakeup. For 5.2, we go with Oleg's
> patch for the swap case.

I just hit the do_task_dead() kernel BUG at kernel/sched/core.c:3463!
while heavy swapping on 5.2-rc7: it looks like Oleg's patch intended
for 5.2 was not signed off, and got forgotten.

I did hit the do_task_dead() BUG (but not at all easily) on early -rcs
before seeing Oleg's patch, then folded it in and and didn't hit the BUG
again; then just tried again without it, and luckily hit in a few hours.

So I can give it an enthusiastic
Acked-by: Hugh Dickins <hughd@google.com>
because it makes good sense to avoid the get/blk_wake/put overhead on
the asynch path anyway, even if it didn't work around a bug; but only
Half-Tested-by: Hugh Dickins <hughd@google.com>
since I have not been exercising the synchronous path at all.

Hugh, requoting Oleg:

> 
> I don't understand this code at all but I am just curious, can we do
> something like incomplete patch below ?
> 
> Oleg.
> 
> --- x/mm/page_io.c
> +++ x/mm/page_io.c
> @@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
>  	unlock_page(page);
>  	WRITE_ONCE(bio->bi_private, NULL);
>  	bio_put(bio);
> -	blk_wake_io_task(waiter);
> -	put_task_struct(waiter);
> +	if (waiter) {
> +		blk_wake_io_task(waiter);
> +		put_task_struct(waiter);
> +	}
>  }
>  
>  int generic_swapfile_activate(struct swap_info_struct *sis,
> @@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
>  	 * Keep this task valid during swap readpage because the oom killer may
>  	 * attempt to access it in the page fault retry time check.
>  	 */
> -	get_task_struct(current);
> -	bio->bi_private = current;
>  	bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -	if (synchronous)
> +	if (synchronous) {
>  		bio->bi_opf |= REQ_HIPRI;
> +		get_task_struct(current);
> +		bio->bi_private = current;
> +	}
>  	count_vm_event(PSWPIN);
>  	bio_get(bio);
>  	qc = submit_bio(bio);

Jens Axboe July 1, 2019, 2:22 p.m. UTC | #19

On 6/30/19 5:06 PM, Hugh Dickins wrote:
> On Wed, 5 Jun 2019, Jens Axboe wrote:
>>
>> How about the following plan - if folks are happy with this sched patch,
>> we can queue it up for 5.3. Once that is in, I'll kill the block change
>> that special cases the polled task wakeup. For 5.2, we go with Oleg's
>> patch for the swap case.
> 
> I just hit the do_task_dead() kernel BUG at kernel/sched/core.c:3463!
> while heavy swapping on 5.2-rc7: it looks like Oleg's patch intended
> for 5.2 was not signed off, and got forgotten.
> 
> I did hit the do_task_dead() BUG (but not at all easily) on early -rcs
> before seeing Oleg's patch, then folded it in and and didn't hit the BUG
> again; then just tried again without it, and luckily hit in a few hours.
> 
> So I can give it an enthusiastic
> Acked-by: Hugh Dickins <hughd@google.com>
> because it makes good sense to avoid the get/blk_wake/put overhead on
> the asynch path anyway, even if it didn't work around a bug; but only
> Half-Tested-by: Hugh Dickins <hughd@google.com>
> since I have not been exercising the synchronous path at all.

I'll take the blame for that, went away on vacation for 3 weeks...
But yes, for 5.2, the patch from Oleg looks fine. Once Peter's other
change is in mainline, I'll go through and remove these special cases.

Andrew, can you queue Oleg's patch for 5.2? You can also add my:

Reviewed-by: Jens Axboe <axboe@kernel.dk>

to it.

> 
> Hugh, requoting Oleg:
> 
>>
>> I don't understand this code at all but I am just curious, can we do
>> something like incomplete patch below ?
>>
>> Oleg.
>>
>> --- x/mm/page_io.c
>> +++ x/mm/page_io.c
>> @@ -140,8 +140,10 @@ int swap_readpage(struct page *page, bool synchronous)
>>   	unlock_page(page);
>>   	WRITE_ONCE(bio->bi_private, NULL);
>>   	bio_put(bio);
>> -	blk_wake_io_task(waiter);
>> -	put_task_struct(waiter);
>> +	if (waiter) {
>> +		blk_wake_io_task(waiter);
>> +		put_task_struct(waiter);
>> +	}
>>   }
>>   
>>   int generic_swapfile_activate(struct swap_info_struct *sis,
>> @@ -398,11 +400,12 @@ int swap_readpage(struct page *page, boo
>>   	 * Keep this task valid during swap readpage because the oom killer may
>>   	 * attempt to access it in the page fault retry time check.
>>   	 */
>> -	get_task_struct(current);
>> -	bio->bi_private = current;
>>   	bio_set_op_attrs(bio, REQ_OP_READ, 0);
>> -	if (synchronous)
>> +	if (synchronous) {
>>   		bio->bi_opf |= REQ_HIPRI;
>> +		get_task_struct(current);
>> +		bio->bi_private = current;
>> +	}
>>   	count_vm_event(PSWPIN);
>>   	bio_get(bio);
>>   	qc = submit_bio(bio);

Andrew Morton July 2, 2019, 10:06 p.m. UTC | #20

On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> Andrew, can you queue Oleg's patch for 5.2? You can also add my:
> 
> Reviewed-by: Jens Axboe <axboe@kernel.dk>

Sure.  Although things are a bit of a mess.  Oleg, can we please have a
clean resend with signoffs and acks, etc?

Oleg Nesterov July 3, 2019, 5:35 p.m. UTC | #21

On 07/02, Andrew Morton wrote:

> On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> 
> > Andrew, can you queue Oleg's patch for 5.2? You can also add my:
> > 
> > Reviewed-by: Jens Axboe <axboe@kernel.dk>
> 
> Sure.  Although things are a bit of a mess.  Oleg, can we please have a
> clean resend with signoffs and acks, etc?

OK, will do tomorrow. This cleanup can be improved, we can avoid get/put_task_struct
altogether, but need to recheck.

Oleg.

Hugh Dickins July 3, 2019, 5:44 p.m. UTC | #22

On Wed, 3 Jul 2019, Oleg Nesterov wrote:
> On 07/02, Andrew Morton wrote:
> > On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> > 
> > > Andrew, can you queue Oleg's patch for 5.2? You can also add my:
> > > 
> > > Reviewed-by: Jens Axboe <axboe@kernel.dk>
> > 
> > Sure.  Although things are a bit of a mess.  Oleg, can we please have a
> > clean resend with signoffs and acks, etc?
> 
> OK, will do tomorrow. This cleanup can be improved, we can avoid get/put_task_struct
> altogether, but need to recheck.

Thank you, Oleg. But, with respect, I'd caution against making it cleverer
at the last minute: what you posted already is understandable, works, has
Jen's Reviewed-by and my Acked-by: it just lacks a description and signoff.

Hugh

Jens Axboe July 3, 2019, 5:52 p.m. UTC | #23

On 7/3/19 11:35 AM, Oleg Nesterov wrote:
> On 07/02, Andrew Morton wrote:
> 
>> On Mon, 1 Jul 2019 08:22:32 -0600 Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> Andrew, can you queue Oleg's patch for 5.2? You can also add my:
>>>
>>> Reviewed-by: Jens Axboe <axboe@kernel.dk>
>>
>> Sure.  Although things are a bit of a mess.  Oleg, can we please have a
>> clean resend with signoffs and acks, etc?
> 
> OK, will do tomorrow. This cleanup can be improved, we can avoid get/put_task_struct
> altogether, but need to recheck.

I'd just send it again as-is. We're removing the blk wakeup special case
anyway for 5.3.

Oleg Nesterov July 4, 2019, 4 p.m. UTC | #24

On 07/03, Hugh Dickins wrote:
>
> Thank you, Oleg. But, with respect, I'd caution against making it cleverer
> at the last minute: what you posted already is understandable, works, has
> Jen's Reviewed-by and my Acked-by: it just lacks a description and signoff.

OK, agreed. I am sending that patch as-is with yours and Jen's acks applied.

Oleg.

block: fix a crash in do_task_dead()

Commit Message

Comments

Patch