diff mbox series

rcu-tasks: Delay rcu_tasks_verify_self_tests() to avoid missed callbacks

Message ID 20220610184212.822113-1-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series rcu-tasks: Delay rcu_tasks_verify_self_tests() to avoid missed callbacks | expand

Commit Message

Waiman Long June 10, 2022, 6:42 p.m. UTC
Even though rcu_tasks selftest is initiated early in the boot process,
the verification done at late initcall time may not be late enough to
catch all the callbacks especially on systems with just a few cpus and
small memory.

After 12 bootup's On a s390x system, 1 of them had failed rcu_tasks
verification test.

[    8.183013] call_rcu_tasks() has been failed.
[    8.183041] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x64/0xd0
[    8.203246] Callback from call_rcu_tasks() invoked.

In this particular case, the callback missed the check by about
20ms. Similar rcu_tasks selftest failures are also seen in ppc64le
systems.

[    0.313391] call_rcu_tasks() has been failed.
[    0.313407] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x5c/0xa0
[    0.335569] Callback from call_rcu_tasks() invoked.

Avoid this missed callback by delaying the verification using
delayed_work. The delay is set to be about 0.1s which hopefully will
be long enough to catch all the callbacks on systems with few cpus and
small memory.

Fixes: bfba7ed084f8 ("rcu-tasks: Add RCU-tasks self tests")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/rcu/tasks.h | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Comments

Paul E. McKenney June 10, 2022, 8:58 p.m. UTC | #1
On Fri, Jun 10, 2022 at 02:42:12PM -0400, Waiman Long wrote:
> Even though rcu_tasks selftest is initiated early in the boot process,
> the verification done at late initcall time may not be late enough to
> catch all the callbacks especially on systems with just a few cpus and
> small memory.
> 
> After 12 bootup's On a s390x system, 1 of them had failed rcu_tasks
> verification test.
> 
> [    8.183013] call_rcu_tasks() has been failed.
> [    8.183041] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x64/0xd0
> [    8.203246] Callback from call_rcu_tasks() invoked.
> 
> In this particular case, the callback missed the check by about
> 20ms. Similar rcu_tasks selftest failures are also seen in ppc64le
> systems.
> 
> [    0.313391] call_rcu_tasks() has been failed.
> [    0.313407] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x5c/0xa0
> [    0.335569] Callback from call_rcu_tasks() invoked.
> 
> Avoid this missed callback by delaying the verification using
> delayed_work. The delay is set to be about 0.1s which hopefully will
> be long enough to catch all the callbacks on systems with few cpus and
> small memory.
> 
> Fixes: bfba7ed084f8 ("rcu-tasks: Add RCU-tasks self tests")
> Signed-off-by: Waiman Long <longman@redhat.com>

Good catch, thank you!

A few days ago, I queued this:

2585014188d5 ("rcu-tasks: Be more patient for RCU Tasks boot-time testing")

This is shown in full at the end of this email.  Does this fix this
problem for you?

							Thanx, Paul

> ---
>  kernel/rcu/tasks.h | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 3925e32159b5..25f964a671ba 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1735,7 +1735,7 @@ static void rcu_tasks_initiate_self_tests(void)
>  #endif
>  }
>  
> -static int rcu_tasks_verify_self_tests(void)
> +static void rcu_tasks_verify_self_tests(struct work_struct *work __maybe_unused)
>  {
>  	int ret = 0;
>  	int i;
> @@ -1749,10 +1749,23 @@ static int rcu_tasks_verify_self_tests(void)
>  
>  	if (ret)
>  		WARN_ON(1);
> +}
> +
> +static struct delayed_work rcu_tasks_verify_work;
>  
> -	return ret;
> +/*
> + * The rcu_tasks verification is done indirectly via the work queue to
> + * introduce an additional 0.1s delay to catch all the callbacks before
> + * the verification is done as late_initcall time may not be late enough
> + * to have all the callbacks fired.
> + */
> +static int rcu_tasks_verify_schedule_work(void)
> +{
> +	INIT_DELAYED_WORK(&rcu_tasks_verify_work, rcu_tasks_verify_self_tests);
> +	schedule_delayed_work(&rcu_tasks_verify_work, HZ/10);
> +	return 0;
>  }
> -late_initcall(rcu_tasks_verify_self_tests);
> +late_initcall(rcu_tasks_verify_schedule_work);
>  #else /* #ifdef CONFIG_PROVE_RCU */
>  static void rcu_tasks_initiate_self_tests(void) { }
>  #endif /* #else #ifdef CONFIG_PROVE_RCU */
> -- 
> 2.31.1

------------------------------------------------------------------------

commit 2585014188d5e66052b4226b42602b6f3d921389
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue Jun 7 15:23:52 2022 -0700

    rcu-tasks: Be more patient for RCU Tasks boot-time testing
    
    The RCU-Tasks family of grace-period primitives can take some time to
    complete, and the amount of time can depend on the exact hardware and
    software configuration.  Some configurations boot up fast enough that the
    RCU-Tasks verification process gets false-positive failures.  This commit
    therefore allows up to 30 seconds for the grace periods to complete, with
    this value adjustable downwards using the rcupdate.rcu_task_stall_timeout
    kernel boot parameter.
    
    Reported-by: Matthew Wilcox <willy@infradead.org>
    Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index df6b2cb2f205d..fcbd0ec33c866 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -145,6 +145,7 @@ static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY;
 module_param(rcu_task_ipi_delay, int, 0644);
 
 /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
+#define RCU_TASK_BOOT_STALL_TIMEOUT (HZ * 30)
 #define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
 static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
 module_param(rcu_task_stall_timeout, int, 0644);
@@ -1776,23 +1777,24 @@ struct rcu_tasks_test_desc {
 	struct rcu_head rh;
 	const char *name;
 	bool notrun;
+	unsigned long runstart;
 };
 
 static struct rcu_tasks_test_desc tests[] = {
 	{
 		.name = "call_rcu_tasks()",
 		/* If not defined, the test is skipped. */
-		.notrun = !IS_ENABLED(CONFIG_TASKS_RCU),
+		.notrun = IS_ENABLED(CONFIG_TASKS_RCU),
 	},
 	{
 		.name = "call_rcu_tasks_rude()",
 		/* If not defined, the test is skipped. */
-		.notrun = !IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
+		.notrun = IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
 	},
 	{
 		.name = "call_rcu_tasks_trace()",
 		/* If not defined, the test is skipped. */
-		.notrun = !IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
+		.notrun = IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
 	}
 };
 
@@ -1803,23 +1805,28 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
 
 	pr_info("Callback from %s invoked.\n", rttd->name);
 
-	rttd->notrun = true;
+	rttd->notrun = false;
 }
 
 static void rcu_tasks_initiate_self_tests(void)
 {
+	unsigned long j = jiffies;
+
 	pr_info("Running RCU-tasks wait API self tests\n");
 #ifdef CONFIG_TASKS_RCU
+	tests[0].runstart = j;
 	synchronize_rcu_tasks();
 	call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
 #endif
 
 #ifdef CONFIG_TASKS_RUDE_RCU
+	tests[1].runstart = j;
 	synchronize_rcu_tasks_rude();
 	call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
 #endif
 
 #ifdef CONFIG_TASKS_TRACE_RCU
+	tests[2].runstart = j;
 	synchronize_rcu_tasks_trace();
 	call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
 #endif
@@ -1829,11 +1836,18 @@ static int rcu_tasks_verify_self_tests(void)
 {
 	int ret = 0;
 	int i;
+	unsigned long bst = rcu_task_stall_timeout;
 
+	if (bst <= 0 || bst > RCU_TASK_BOOT_STALL_TIMEOUT)
+		bst = RCU_TASK_BOOT_STALL_TIMEOUT;
 	for (i = 0; i < ARRAY_SIZE(tests); i++) {
-		if (!tests[i].notrun) {		// still hanging.
-			pr_err("%s has been failed.\n", tests[i].name);
-			ret = -1;
+		while (tests[i].notrun) {		// still hanging.
+			if (time_after(jiffies, tests[i].runstart + bst)) {
+				pr_err("%s has failed boot-time tests.\n", tests[i].name);
+				ret = -1;
+				break;
+			}
+			schedule_timeout_uninterruptible(1);
 		}
 	}
Waiman Long June 13, 2022, 4:01 p.m. UTC | #2
On 6/10/22 16:58, Paul E. McKenney wrote:
> On Fri, Jun 10, 2022 at 02:42:12PM -0400, Waiman Long wrote:
>> Even though rcu_tasks selftest is initiated early in the boot process,
>> the verification done at late initcall time may not be late enough to
>> catch all the callbacks especially on systems with just a few cpus and
>> small memory.
>>
>> After 12 bootup's On a s390x system, 1 of them had failed rcu_tasks
>> verification test.
>>
>> [    8.183013] call_rcu_tasks() has been failed.
>> [    8.183041] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x64/0xd0
>> [    8.203246] Callback from call_rcu_tasks() invoked.
>>
>> In this particular case, the callback missed the check by about
>> 20ms. Similar rcu_tasks selftest failures are also seen in ppc64le
>> systems.
>>
>> [    0.313391] call_rcu_tasks() has been failed.
>> [    0.313407] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x5c/0xa0
>> [    0.335569] Callback from call_rcu_tasks() invoked.
>>
>> Avoid this missed callback by delaying the verification using
>> delayed_work. The delay is set to be about 0.1s which hopefully will
>> be long enough to catch all the callbacks on systems with few cpus and
>> small memory.
>>
>> Fixes: bfba7ed084f8 ("rcu-tasks: Add RCU-tasks self tests")
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Good catch, thank you!
>
> A few days ago, I queued this:
>
> 2585014188d5 ("rcu-tasks: Be more patient for RCU Tasks boot-time testing")
>
> This is shown in full at the end of this email.  Does this fix this
> problem for you?

I think your patch should fix the false positive warning and it give 
plenty of time for this to happen.

I do have one question though. rcu_tasks_verify_selft_tests() is called 
from do_initcalls(). Since it may not be the last late initcall, does 
that mean other late initcalls queued after that may be delayed by a 
second or more?

Thanks,
Longman

> 							Thanx, Paul
>
>> ---
>>   kernel/rcu/tasks.h | 19 ++++++++++++++++---
>>   1 file changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>> index 3925e32159b5..25f964a671ba 100644s
>> --- a/kernel/rcu/tasks.h
>> +++ b/kernel/rcu/tasks.h
>> @@ -1735,7 +1735,7 @@ static void rcu_tasks_initiate_self_tests(void)
>>   #endif
>>   }
>>   
>> -static int rcu_tasks_verify_self_tests(void)
>> +static void rcu_tasks_verify_self_tests(struct work_struct *work __maybe_unused)
>>   {
>>   	int ret = 0;
>>   	int i;
>> @@ -1749,10 +1749,23 @@ static int rcu_tasks_verify_self_tests(void)
>>   
>>   	if (ret)
>>   		WARN_ON(1);
>> +}
>> +
>> +static struct delayed_work rcu_tasks_verify_work;
>>   
>> -	return ret;
>> +/*
>> + * The rcu_tasks verification is done indirectly via the work queue to
>> + * introduce an additional 0.1s delay to catch all the callbacks before
>> + * the verification is done as late_initcall time may not be late enough
>> + * to have all the callbacks fired.
>> + */
>> +static int rcu_tasks_verify_schedule_work(void)
>> +{
>> +	INIT_DELAYED_WORK(&rcu_tasks_verify_work, rcu_tasks_verify_self_tests);
>> +	schedule_delayed_work(&rcu_tasks_verify_work, HZ/10);
>> +	return 0;
>>   }
>> -late_initcall(rcu_tasks_verify_self_tests);
>> +late_initcall(rcu_tasks_verify_schedule_work);
>>   #else /* #ifdef CONFIG_PROVE_RCU */
>>   static void rcu_tasks_initiate_self_tests(void) { }
>>   #endif /* #else #ifdef CONFIG_PROVE_RCU */
>> -- 
>> 2.31.1
> ------------------------------------------------------------------------
>
> commit 2585014188d5e66052b4226b42602b6f3d921389
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Tue Jun 7 15:23:52 2022 -0700
>
>      rcu-tasks: Be more patient for RCU Tasks boot-time testing
>      
>      The RCU-Tasks family of grace-period primitives can take some time to
>      complete, and the amount of time can depend on the exact hardware and
>      software configuration.  Some configurations boot up fast enough that the
>      RCU-Tasks verification process gets false-positive failures.  This commit
>      therefore allows up to 30 seconds for the grace periods to complete, with
>      this value adjustable downwards using the rcupdate.rcu_task_stall_timeout
>      kernel boot parameter.
>      
>      Reported-by: Matthew Wilcox <willy@infradead.org>
>      Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
>      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>      Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index df6b2cb2f205d..fcbd0ec33c866 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -145,6 +145,7 @@ static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY;
>   module_param(rcu_task_ipi_delay, int, 0644);
>   
>   /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
> +#define RCU_TASK_BOOT_STALL_TIMEOUT (HZ * 30)
>   #define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
>   static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
>   module_param(rcu_task_stall_timeout, int, 0644);
> @@ -1776,23 +1777,24 @@ struct rcu_tasks_test_desc {
>   	struct rcu_head rh;
>   	const char *name;
>   	bool notrun;
> +	unsigned long runstart;
>   };
>   
>   static struct rcu_tasks_test_desc tests[] = {
>   	{
>   		.name = "call_rcu_tasks()",
>   		/* If not defined, the test is skipped. */
> -		.notrun = !IS_ENABLED(CONFIG_TASKS_RCU),
> +		.notrun = IS_ENABLED(CONFIG_TASKS_RCU),
>   	},
>   	{
>   		.name = "call_rcu_tasks_rude()",
>   		/* If not defined, the test is skipped. */
> -		.notrun = !IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
> +		.notrun = IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
>   	},
>   	{
>   		.name = "call_rcu_tasks_trace()",
>   		/* If not defined, the test is skipped. */
> -		.notrun = !IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
> +		.notrun = IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
>   	}
>   };
>   
> @@ -1803,23 +1805,28 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
>   
>   	pr_info("Callback from %s invoked.\n", rttd->name);
>   
> -	rttd->notrun = true;
> +	rttd->notrun = false;
>   }
>   
>   static void rcu_tasks_initiate_self_tests(void)
>   {
> +	unsigned long j = jiffies;
> +
>   	pr_info("Running RCU-tasks wait API self tests\n");
>   #ifdef CONFIG_TASKS_RCU
> +	tests[0].runstart = j;
>   	synchronize_rcu_tasks();
>   	call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
>   #endif
>   
>   #ifdef CONFIG_TASKS_RUDE_RCU
> +	tests[1].runstart = j;
>   	synchronize_rcu_tasks_rude();
>   	call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
>   #endif
>   
>   #ifdef CONFIG_TASKS_TRACE_RCU
> +	tests[2].runstart = j;
>   	synchronize_rcu_tasks_trace();
>   	call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
>   #endif
> @@ -1829,11 +1836,18 @@ static int rcu_tasks_verify_self_tests(void)
>   {
>   	int ret = 0;
>   	int i;
> +	unsigned long bst = rcu_task_stall_timeout;
>   
> +	if (bst <= 0 || bst > RCU_TASK_BOOT_STALL_TIMEOUT)
> +		bst = RCU_TASK_BOOT_STALL_TIMEOUT;
>   	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> -		if (!tests[i].notrun) {		// still hanging.
> -			pr_err("%s has been failed.\n", tests[i].name);
> -			ret = -1;
> +		while (tests[i].notrun) {		// still hanging.
> +			if (time_after(jiffies, tests[i].runstart + bst)) {
> +				pr_err("%s has failed boot-time tests.\n", tests[i].name);
> +				ret = -1;
> +				break;
> +			}
> +			schedule_timeout_uninterruptible(1);
>   		}
>   	}
>   
>
Paul E. McKenney June 13, 2022, 5:56 p.m. UTC | #3
On Mon, Jun 13, 2022 at 12:01:24PM -0400, Waiman Long wrote:
> On 6/10/22 16:58, Paul E. McKenney wrote:
> > On Fri, Jun 10, 2022 at 02:42:12PM -0400, Waiman Long wrote:
> > > Even though rcu_tasks selftest is initiated early in the boot process,
> > > the verification done at late initcall time may not be late enough to
> > > catch all the callbacks especially on systems with just a few cpus and
> > > small memory.
> > > 
> > > After 12 bootup's On a s390x system, 1 of them had failed rcu_tasks
> > > verification test.
> > > 
> > > [    8.183013] call_rcu_tasks() has been failed.
> > > [    8.183041] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x64/0xd0
> > > [    8.203246] Callback from call_rcu_tasks() invoked.
> > > 
> > > In this particular case, the callback missed the check by about
> > > 20ms. Similar rcu_tasks selftest failures are also seen in ppc64le
> > > systems.
> > > 
> > > [    0.313391] call_rcu_tasks() has been failed.
> > > [    0.313407] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x5c/0xa0
> > > [    0.335569] Callback from call_rcu_tasks() invoked.
> > > 
> > > Avoid this missed callback by delaying the verification using
> > > delayed_work. The delay is set to be about 0.1s which hopefully will
> > > be long enough to catch all the callbacks on systems with few cpus and
> > > small memory.
> > > 
> > > Fixes: bfba7ed084f8 ("rcu-tasks: Add RCU-tasks self tests")
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > Good catch, thank you!
> > 
> > A few days ago, I queued this:
> > 
> > 2585014188d5 ("rcu-tasks: Be more patient for RCU Tasks boot-time testing")
> > 
> > This is shown in full at the end of this email.  Does this fix this
> > problem for you?
> 
> I think your patch should fix the false positive warning and it give plenty
> of time for this to happen.
> 
> I do have one question though. rcu_tasks_verify_selft_tests() is called from
> do_initcalls(). Since it may not be the last late initcall, does that mean
> other late initcalls queued after that may be delayed by a second or more?

Indeed.  Which is why I would welcome the workqueues portion of your
patch on top of the above patch in -rcu.  ;-)

							Thanx, Paul

> > > ---
> > >   kernel/rcu/tasks.h | 19 ++++++++++++++++---
> > >   1 file changed, 16 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index 3925e32159b5..25f964a671ba 100644s
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -1735,7 +1735,7 @@ static void rcu_tasks_initiate_self_tests(void)
> > >   #endif
> > >   }
> > > -static int rcu_tasks_verify_self_tests(void)
> > > +static void rcu_tasks_verify_self_tests(struct work_struct *work __maybe_unused)
> > >   {
> > >   	int ret = 0;
> > >   	int i;
> > > @@ -1749,10 +1749,23 @@ static int rcu_tasks_verify_self_tests(void)
> > >   	if (ret)
> > >   		WARN_ON(1);
> > > +}
> > > +
> > > +static struct delayed_work rcu_tasks_verify_work;
> > > -	return ret;
> > > +/*
> > > + * The rcu_tasks verification is done indirectly via the work queue to
> > > + * introduce an additional 0.1s delay to catch all the callbacks before
> > > + * the verification is done as late_initcall time may not be late enough
> > > + * to have all the callbacks fired.
> > > + */
> > > +static int rcu_tasks_verify_schedule_work(void)
> > > +{
> > > +	INIT_DELAYED_WORK(&rcu_tasks_verify_work, rcu_tasks_verify_self_tests);
> > > +	schedule_delayed_work(&rcu_tasks_verify_work, HZ/10);
> > > +	return 0;
> > >   }
> > > -late_initcall(rcu_tasks_verify_self_tests);
> > > +late_initcall(rcu_tasks_verify_schedule_work);
> > >   #else /* #ifdef CONFIG_PROVE_RCU */
> > >   static void rcu_tasks_initiate_self_tests(void) { }
> > >   #endif /* #else #ifdef CONFIG_PROVE_RCU */
> > > -- 
> > > 2.31.1
> > ------------------------------------------------------------------------
> > 
> > commit 2585014188d5e66052b4226b42602b6f3d921389
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Tue Jun 7 15:23:52 2022 -0700
> > 
> >      rcu-tasks: Be more patient for RCU Tasks boot-time testing
> >      The RCU-Tasks family of grace-period primitives can take some time to
> >      complete, and the amount of time can depend on the exact hardware and
> >      software configuration.  Some configurations boot up fast enough that the
> >      RCU-Tasks verification process gets false-positive failures.  This commit
> >      therefore allows up to 30 seconds for the grace periods to complete, with
> >      this value adjustable downwards using the rcupdate.rcu_task_stall_timeout
> >      kernel boot parameter.
> >      Reported-by: Matthew Wilcox <willy@infradead.org>
> >      Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
> >      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> >      Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
> > 
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index df6b2cb2f205d..fcbd0ec33c866 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -145,6 +145,7 @@ static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY;
> >   module_param(rcu_task_ipi_delay, int, 0644);
> >   /* Control stall timeouts.  Disable with <= 0, otherwise jiffies till stall. */
> > +#define RCU_TASK_BOOT_STALL_TIMEOUT (HZ * 30)
> >   #define RCU_TASK_STALL_TIMEOUT (HZ * 60 * 10)
> >   static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT;
> >   module_param(rcu_task_stall_timeout, int, 0644);
> > @@ -1776,23 +1777,24 @@ struct rcu_tasks_test_desc {
> >   	struct rcu_head rh;
> >   	const char *name;
> >   	bool notrun;
> > +	unsigned long runstart;
> >   };
> >   static struct rcu_tasks_test_desc tests[] = {
> >   	{
> >   		.name = "call_rcu_tasks()",
> >   		/* If not defined, the test is skipped. */
> > -		.notrun = !IS_ENABLED(CONFIG_TASKS_RCU),
> > +		.notrun = IS_ENABLED(CONFIG_TASKS_RCU),
> >   	},
> >   	{
> >   		.name = "call_rcu_tasks_rude()",
> >   		/* If not defined, the test is skipped. */
> > -		.notrun = !IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
> > +		.notrun = IS_ENABLED(CONFIG_TASKS_RUDE_RCU),
> >   	},
> >   	{
> >   		.name = "call_rcu_tasks_trace()",
> >   		/* If not defined, the test is skipped. */
> > -		.notrun = !IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
> > +		.notrun = IS_ENABLED(CONFIG_TASKS_TRACE_RCU)
> >   	}
> >   };
> > @@ -1803,23 +1805,28 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
> >   	pr_info("Callback from %s invoked.\n", rttd->name);
> > -	rttd->notrun = true;
> > +	rttd->notrun = false;
> >   }
> >   static void rcu_tasks_initiate_self_tests(void)
> >   {
> > +	unsigned long j = jiffies;
> > +
> >   	pr_info("Running RCU-tasks wait API self tests\n");
> >   #ifdef CONFIG_TASKS_RCU
> > +	tests[0].runstart = j;
> >   	synchronize_rcu_tasks();
> >   	call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
> >   #endif
> >   #ifdef CONFIG_TASKS_RUDE_RCU
> > +	tests[1].runstart = j;
> >   	synchronize_rcu_tasks_rude();
> >   	call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
> >   #endif
> >   #ifdef CONFIG_TASKS_TRACE_RCU
> > +	tests[2].runstart = j;
> >   	synchronize_rcu_tasks_trace();
> >   	call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
> >   #endif
> > @@ -1829,11 +1836,18 @@ static int rcu_tasks_verify_self_tests(void)
> >   {
> >   	int ret = 0;
> >   	int i;
> > +	unsigned long bst = rcu_task_stall_timeout;
> > +	if (bst <= 0 || bst > RCU_TASK_BOOT_STALL_TIMEOUT)
> > +		bst = RCU_TASK_BOOT_STALL_TIMEOUT;
> >   	for (i = 0; i < ARRAY_SIZE(tests); i++) {
> > -		if (!tests[i].notrun) {		// still hanging.
> > -			pr_err("%s has been failed.\n", tests[i].name);
> > -			ret = -1;
> > +		while (tests[i].notrun) {		// still hanging.
> > +			if (time_after(jiffies, tests[i].runstart + bst)) {
> > +				pr_err("%s has failed boot-time tests.\n", tests[i].name);
> > +				ret = -1;
> > +				break;
> > +			}
> > +			schedule_timeout_uninterruptible(1);
> >   		}
> >   	}
> > 
>
Waiman Long June 13, 2022, 6:54 p.m. UTC | #4
On 6/13/22 13:56, Paul E. McKenney wrote:
> On Mon, Jun 13, 2022 at 12:01:24PM -0400, Waiman Long wrote:
>> On 6/10/22 16:58, Paul E. McKenney wrote:
>>> On Fri, Jun 10, 2022 at 02:42:12PM -0400, Waiman Long wrote:
>>>> Even though rcu_tasks selftest is initiated early in the boot process,
>>>> the verification done at late initcall time may not be late enough to
>>>> catch all the callbacks especially on systems with just a few cpus and
>>>> small memory.
>>>>
>>>> After 12 bootup's On a s390x system, 1 of them had failed rcu_tasks
>>>> verification test.
>>>>
>>>> [    8.183013] call_rcu_tasks() has been failed.
>>>> [    8.183041] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x64/0xd0
>>>> [    8.203246] Callback from call_rcu_tasks() invoked.
>>>>
>>>> In this particular case, the callback missed the check by about
>>>> 20ms. Similar rcu_tasks selftest failures are also seen in ppc64le
>>>> systems.
>>>>
>>>> [    0.313391] call_rcu_tasks() has been failed.
>>>> [    0.313407] WARNING: CPU: 0 PID: 1 at kernel/rcu/tasks.h:1696 rcu_tasks_verify_self_tests+0x5c/0xa0
>>>> [    0.335569] Callback from call_rcu_tasks() invoked.
>>>>
>>>> Avoid this missed callback by delaying the verification using
>>>> delayed_work. The delay is set to be about 0.1s which hopefully will
>>>> be long enough to catch all the callbacks on systems with few cpus and
>>>> small memory.
>>>>
>>>> Fixes: bfba7ed084f8 ("rcu-tasks: Add RCU-tasks self tests")
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> Good catch, thank you!
>>>
>>> A few days ago, I queued this:
>>>
>>> 2585014188d5 ("rcu-tasks: Be more patient for RCU Tasks boot-time testing")
>>>
>>> This is shown in full at the end of this email.  Does this fix this
>>> problem for you?
>> I think your patch should fix the false positive warning and it give plenty
>> of time for this to happen.
>>
>> I do have one question though. rcu_tasks_verify_selft_tests() is called from
>> do_initcalls(). Since it may not be the last late initcall, does that mean
>> other late initcalls queued after that may be delayed by a second or more?
> Indeed.  Which is why I would welcome the workqueues portion of your
> patch on top of the above patch in -rcu.  ;-)

Sure. I will work on such a follow-up patch.

Cheers,
Longman
diff mbox series

Patch

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 3925e32159b5..25f964a671ba 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1735,7 +1735,7 @@  static void rcu_tasks_initiate_self_tests(void)
 #endif
 }
 
-static int rcu_tasks_verify_self_tests(void)
+static void rcu_tasks_verify_self_tests(struct work_struct *work __maybe_unused)
 {
 	int ret = 0;
 	int i;
@@ -1749,10 +1749,23 @@  static int rcu_tasks_verify_self_tests(void)
 
 	if (ret)
 		WARN_ON(1);
+}
+
+static struct delayed_work rcu_tasks_verify_work;
 
-	return ret;
+/*
+ * The rcu_tasks verification is done indirectly via the work queue to
+ * introduce an additional 0.1s delay to catch all the callbacks before
+ * the verification is done as late_initcall time may not be late enough
+ * to have all the callbacks fired.
+ */
+static int rcu_tasks_verify_schedule_work(void)
+{
+	INIT_DELAYED_WORK(&rcu_tasks_verify_work, rcu_tasks_verify_self_tests);
+	schedule_delayed_work(&rcu_tasks_verify_work, HZ/10);
+	return 0;
 }
-late_initcall(rcu_tasks_verify_self_tests);
+late_initcall(rcu_tasks_verify_schedule_work);
 #else /* #ifdef CONFIG_PROVE_RCU */
 static void rcu_tasks_initiate_self_tests(void) { }
 #endif /* #else #ifdef CONFIG_PROVE_RCU */