diff mbox series

rcu: Reduce synchronize_rcu() delays when all wait heads are in use

Message ID 20240313083228.233247-1-Neeraj.Upadhyay@amd.com (mailing list archive)
State Superseded
Headers show
Series rcu: Reduce synchronize_rcu() delays when all wait heads are in use | expand

Commit Message

Neeraj Upadhyay March 13, 2024, 8:32 a.m. UTC
When all wait heads are in use, which can happen when
rcu_sr_normal_gp_cleanup_work()'s callback processing
is slow, any new synchronize_rcu() user's rcu_synchronize
node's processing is deferred to future GP periods. This
can result in long list of synchronize_rcu() invocations
waiting for full grace period processing, which can delay
freeing of memory. Mitigate this problem by using first
node in the list as wait tail when all wait heads are in use.
While methods to speed up callback processing would be needed
to recover from this situation, allowing new nodes to complete
their grace period can help prevent delays due to a fixed
number of wait head nodes.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 kernel/rcu/tree.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

Comments

Joel Fernandes March 13, 2024, 2:40 p.m. UTC | #1
Hi Neeraj,

On 3/13/2024 4:32 AM, Neeraj Upadhyay wrote:
> When all wait heads are in use, which can happen when
> rcu_sr_normal_gp_cleanup_work()'s callback processing
> is slow, any new synchronize_rcu() user's rcu_synchronize
> node's processing is deferred to future GP periods. This
> can result in long list of synchronize_rcu() invocations
> waiting for full grace period processing, which can delay
> freeing of memory. Mitigate this problem by using first
> node in the list as wait tail when all wait heads are in use.
> While methods to speed up callback processing would be needed
> to recover from this situation, allowing new nodes to complete
> their grace period can help prevent delays due to a fixed
> number of wait head nodes.
> 
> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> ---
>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 9fbb5ab57c84..bdccce1ed62f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>   * for this new grace period. Given that there are a fixed
>   * number of wait nodes, if all wait nodes are in use
>   * (which can happen when kworker callback processing
> - * is delayed) and additional grace period is requested.
> - * This means, a system is slow in processing callbacks.
> - *
> - * TODO: If a slow processing is detected, a first node
> - * in the llist should be used as a wait-tail for this
> - * grace period, therefore users which should wait due
> - * to a slow process are handled by _this_ grace period
> - * and not next.
> + * is delayed), first node in the llist is used as wait
> + * tail for this grace period. This means, the first node
> + * has to go through additional grace periods before it is
> + * part of the wait callbacks. This should be ok, as
> + * the system is slow in processing callbacks anyway.
>   *
>   * Below is an illustration of how the done and wait
>   * tail pointers move from one set of rcu_synchronize nodes
> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>  		return start_new_poll;
>  
>  	wait_head = rcu_sr_get_wait_head();
> -	if (!wait_head) {
> -		// Kick another GP to retry.
> +	if (wait_head) {
> +		/* Inject a wait-dummy-node. */
> +		llist_add(wait_head, &rcu_state.srs_next);
> +	} else {
> +		// Kick another GP for first node.
>  		start_new_poll = true;
> -		return start_new_poll;
> +		if (first == rcu_state.srs_done_tail)

small nit:
Does done_tail access here need smp_load_acquire() or READ_ONCE() to match the
other users?

Also if you don't mind could you please rebase your patch on top of mine [1] ? I
think it will otherwise trigger this warning in my patch:

WARN_ON_ONCE(!rcu);

Because I always assume there to be at least 2 wait heads at clean up time.

[1] https://lore.kernel.org/all/20240308224439.281349-1-joel@joelfernandes.org/

Thanks!

 - Joel


> +			return start_new_poll;
> +		wait_head = first;
>  	}
>  
> -	/* Inject a wait-dummy-node. */
> -	llist_add(wait_head, &rcu_state.srs_next);
> -
>  	/*
>  	 * A waiting list of rcu_synchronize nodes should be empty on
>  	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
Frederic Weisbecker March 13, 2024, 3:18 p.m. UTC | #2
Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
> When all wait heads are in use, which can happen when
> rcu_sr_normal_gp_cleanup_work()'s callback processing
> is slow, any new synchronize_rcu() user's rcu_synchronize
> node's processing is deferred to future GP periods. This
> can result in long list of synchronize_rcu() invocations
> waiting for full grace period processing, which can delay
> freeing of memory. Mitigate this problem by using first
> node in the list as wait tail when all wait heads are in use.
> While methods to speed up callback processing would be needed
> to recover from this situation, allowing new nodes to complete
> their grace period can help prevent delays due to a fixed
> number of wait head nodes.
> 
> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> ---
>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 9fbb5ab57c84..bdccce1ed62f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>   * for this new grace period. Given that there are a fixed
>   * number of wait nodes, if all wait nodes are in use
>   * (which can happen when kworker callback processing
> - * is delayed) and additional grace period is requested.
> - * This means, a system is slow in processing callbacks.
> - *
> - * TODO: If a slow processing is detected, a first node
> - * in the llist should be used as a wait-tail for this
> - * grace period, therefore users which should wait due
> - * to a slow process are handled by _this_ grace period
> - * and not next.
> + * is delayed), first node in the llist is used as wait
> + * tail for this grace period. This means, the first node
> + * has to go through additional grace periods before it is
> + * part of the wait callbacks. This should be ok, as
> + * the system is slow in processing callbacks anyway.
>   *
>   * Below is an illustration of how the done and wait
>   * tail pointers move from one set of rcu_synchronize nodes
> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>  		return start_new_poll;
>  
>  	wait_head = rcu_sr_get_wait_head();
> -	if (!wait_head) {
> -		// Kick another GP to retry.
> +	if (wait_head) {
> +		/* Inject a wait-dummy-node. */
> +		llist_add(wait_head, &rcu_state.srs_next);
> +	} else {
> +		// Kick another GP for first node.
>  		start_new_poll = true;
> -		return start_new_poll;
> +		if (first == rcu_state.srs_done_tail)
> +			return start_new_poll;
> +		wait_head = first;

This means you're setting a non-wait-head as srs_wait_tail, right?
Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():

	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));

Also there is a risk that this non-wait-head gets later assigned as
rcu_state.srs_done_tail. And then this pending sr may not be completed
until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
the work doesn't take care of rcu_state.srs_done_tail itself). And then
the delay can be arbitrary.

And the next grace period completing this sr (that non-wait-head set
as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
of its stack may race with the cleanup work dereferencing it?

Thanks.



>  	}
>  
> -	/* Inject a wait-dummy-node. */
> -	llist_add(wait_head, &rcu_state.srs_next);
> -
>  	/*
>  	 * A waiting list of rcu_synchronize nodes should be empty on
>  	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> -- 
> 2.34.1
> 
>
Neeraj Upadhyay March 13, 2024, 4:04 p.m. UTC | #3
Hi Joel,

On 3/13/2024 8:10 PM, Joel Fernandes wrote:
> Hi Neeraj,
> 
> On 3/13/2024 4:32 AM, Neeraj Upadhyay wrote:
>> When all wait heads are in use, which can happen when
>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>> is slow, any new synchronize_rcu() user's rcu_synchronize
>> node's processing is deferred to future GP periods. This
>> can result in long list of synchronize_rcu() invocations
>> waiting for full grace period processing, which can delay
>> freeing of memory. Mitigate this problem by using first
>> node in the list as wait tail when all wait heads are in use.
>> While methods to speed up callback processing would be needed
>> to recover from this situation, allowing new nodes to complete
>> their grace period can help prevent delays due to a fixed
>> number of wait head nodes.
>>
>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>> ---
>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 9fbb5ab57c84..bdccce1ed62f 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>   * for this new grace period. Given that there are a fixed
>>   * number of wait nodes, if all wait nodes are in use
>>   * (which can happen when kworker callback processing
>> - * is delayed) and additional grace period is requested.
>> - * This means, a system is slow in processing callbacks.
>> - *
>> - * TODO: If a slow processing is detected, a first node
>> - * in the llist should be used as a wait-tail for this
>> - * grace period, therefore users which should wait due
>> - * to a slow process are handled by _this_ grace period
>> - * and not next.
>> + * is delayed), first node in the llist is used as wait
>> + * tail for this grace period. This means, the first node
>> + * has to go through additional grace periods before it is
>> + * part of the wait callbacks. This should be ok, as
>> + * the system is slow in processing callbacks anyway.
>>   *
>>   * Below is an illustration of how the done and wait
>>   * tail pointers move from one set of rcu_synchronize nodes
>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>  		return start_new_poll;
>>  
>>  	wait_head = rcu_sr_get_wait_head();
>> -	if (!wait_head) {
>> -		// Kick another GP to retry.
>> +	if (wait_head) {
>> +		/* Inject a wait-dummy-node. */
>> +		llist_add(wait_head, &rcu_state.srs_next);
>> +	} else {
>> +		// Kick another GP for first node.
>>  		start_new_poll = true;
>> -		return start_new_poll;
>> +		if (first == rcu_state.srs_done_tail)
> 
> small nit:
> Does done_tail access here need smp_load_acquire() or READ_ONCE() to match the
> other users?
> 

As srs_done_tail is only updated in RCU GP thread context, I think it is not required.
Please correct me if I am wrong here.

> Also if you don't mind could you please rebase your patch on top of mine [1] ? I
> think it will otherwise trigger this warning in my patch:

Sure!


Thanks
Neeraj

> 
> WARN_ON_ONCE(!rcu);
> 
> Because I always assume there to be at least 2 wait heads at clean up time.
> 
> [1] https://lore.kernel.org/all/20240308224439.281349-1-joel@joelfernandes.org/
> 
> Thanks!
> 
>  - Joel
> 
> 
>> +			return start_new_poll;
>> +		wait_head = first;
>>  	}
>>  
>> -	/* Inject a wait-dummy-node. */
>> -	llist_add(wait_head, &rcu_state.srs_next);
>> -
>>  	/*
>>  	 * A waiting list of rcu_synchronize nodes should be empty on
>>  	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
Neeraj Upadhyay March 13, 2024, 4:11 p.m. UTC | #4
Hi Frederic,

On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
>> When all wait heads are in use, which can happen when
>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>> is slow, any new synchronize_rcu() user's rcu_synchronize
>> node's processing is deferred to future GP periods. This
>> can result in long list of synchronize_rcu() invocations
>> waiting for full grace period processing, which can delay
>> freeing of memory. Mitigate this problem by using first
>> node in the list as wait tail when all wait heads are in use.
>> While methods to speed up callback processing would be needed
>> to recover from this situation, allowing new nodes to complete
>> their grace period can help prevent delays due to a fixed
>> number of wait head nodes.
>>
>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>> ---
>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 9fbb5ab57c84..bdccce1ed62f 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>   * for this new grace period. Given that there are a fixed
>>   * number of wait nodes, if all wait nodes are in use
>>   * (which can happen when kworker callback processing
>> - * is delayed) and additional grace period is requested.
>> - * This means, a system is slow in processing callbacks.
>> - *
>> - * TODO: If a slow processing is detected, a first node
>> - * in the llist should be used as a wait-tail for this
>> - * grace period, therefore users which should wait due
>> - * to a slow process are handled by _this_ grace period
>> - * and not next.
>> + * is delayed), first node in the llist is used as wait
>> + * tail for this grace period. This means, the first node
>> + * has to go through additional grace periods before it is
>> + * part of the wait callbacks. This should be ok, as
>> + * the system is slow in processing callbacks anyway.
>>   *
>>   * Below is an illustration of how the done and wait
>>   * tail pointers move from one set of rcu_synchronize nodes
>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>  		return start_new_poll;
>>  
>>  	wait_head = rcu_sr_get_wait_head();
>> -	if (!wait_head) {
>> -		// Kick another GP to retry.
>> +	if (wait_head) {
>> +		/* Inject a wait-dummy-node. */
>> +		llist_add(wait_head, &rcu_state.srs_next);
>> +	} else {
>> +		// Kick another GP for first node.
>>  		start_new_poll = true;
>> -		return start_new_poll;
>> +		if (first == rcu_state.srs_done_tail)
>> +			return start_new_poll;
>> +		wait_head = first;
> 
> This means you're setting a non-wait-head as srs_wait_tail, right?
> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
> 
> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> 

Oh I missed it. Will fix it, thanks!

> Also there is a risk that this non-wait-head gets later assigned as
> rcu_state.srs_done_tail. And then this pending sr may not be completed
> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
> the work doesn't take care of rcu_state.srs_done_tail itself). And then
> the delay can be arbitrary.
> 

That is correct. Only the first node suffers from deferred GP.
If there are large number of callbacks which got added after
last available wait head was queued, all those callbacks (except one)
can still have a GP assigned to them.

> And the next grace period completing this sr (that non-wait-head set
> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
> of its stack may race with the cleanup work dereferencing it?
> 

This sr can only be completed when done tail moves to new node. Till
then, it gets deferred continuously. So, we won't be entering into
the situation where the sr processing is complete while done tail is pointing
to it. Please correct me if I am missing something here.




Thanks
Neeraj

> Thanks.
> 
> 
> 
>>  	}
>>  
>> -	/* Inject a wait-dummy-node. */
>> -	llist_add(wait_head, &rcu_state.srs_next);
>> -
>>  	/*
>>  	 * A waiting list of rcu_synchronize nodes should be empty on
>>  	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
>> -- 
>> 2.34.1
>>
>>
Joel Fernandes March 13, 2024, 4:13 p.m. UTC | #5
On 3/13/2024 12:04 PM, Neeraj Upadhyay wrote:
> Hi Joel,
> 
> On 3/13/2024 8:10 PM, Joel Fernandes wrote:
>> Hi Neeraj,
>>
>> On 3/13/2024 4:32 AM, Neeraj Upadhyay wrote:
>>> When all wait heads are in use, which can happen when
>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>> node's processing is deferred to future GP periods. This
>>> can result in long list of synchronize_rcu() invocations
>>> waiting for full grace period processing, which can delay
>>> freeing of memory. Mitigate this problem by using first
>>> node in the list as wait tail when all wait heads are in use.
>>> While methods to speed up callback processing would be needed
>>> to recover from this situation, allowing new nodes to complete
>>> their grace period can help prevent delays due to a fixed
>>> number of wait head nodes.
>>>
>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>> ---
>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>> --- a/kernel/rcu/tree.c
>>> +++ b/kernel/rcu/tree.c
>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>   * for this new grace period. Given that there are a fixed
>>>   * number of wait nodes, if all wait nodes are in use
>>>   * (which can happen when kworker callback processing
>>> - * is delayed) and additional grace period is requested.
>>> - * This means, a system is slow in processing callbacks.
>>> - *
>>> - * TODO: If a slow processing is detected, a first node
>>> - * in the llist should be used as a wait-tail for this
>>> - * grace period, therefore users which should wait due
>>> - * to a slow process are handled by _this_ grace period
>>> - * and not next.
>>> + * is delayed), first node in the llist is used as wait
>>> + * tail for this grace period. This means, the first node
>>> + * has to go through additional grace periods before it is
>>> + * part of the wait callbacks. This should be ok, as
>>> + * the system is slow in processing callbacks anyway.
>>>   *
>>>   * Below is an illustration of how the done and wait
>>>   * tail pointers move from one set of rcu_synchronize nodes
>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>  		return start_new_poll;
>>>  
>>>  	wait_head = rcu_sr_get_wait_head();
>>> -	if (!wait_head) {
>>> -		// Kick another GP to retry.
>>> +	if (wait_head) {
>>> +		/* Inject a wait-dummy-node. */
>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>> +	} else {
>>> +		// Kick another GP for first node.
>>>  		start_new_poll = true;
>>> -		return start_new_poll;
>>> +		if (first == rcu_state.srs_done_tail)
>>
>> small nit:
>> Does done_tail access here need smp_load_acquire() or READ_ONCE() to match the
>> other users?
>>
> 
> As srs_done_tail is only updated in RCU GP thread context, I think it is not required.
> Please correct me if I am wrong here.

But will KCSAN not scream that its a data race?

thanks,

 - Joel
Neeraj Upadhyay March 13, 2024, 4:19 p.m. UTC | #6
On 3/13/2024 9:43 PM, Joel Fernandes wrote:
> 
> 
> On 3/13/2024 12:04 PM, Neeraj Upadhyay wrote:
>> Hi Joel,
>>
>> On 3/13/2024 8:10 PM, Joel Fernandes wrote:
>>> Hi Neeraj,
>>>
>>> On 3/13/2024 4:32 AM, Neeraj Upadhyay wrote:
>>>> When all wait heads are in use, which can happen when
>>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>>> node's processing is deferred to future GP periods. This
>>>> can result in long list of synchronize_rcu() invocations
>>>> waiting for full grace period processing, which can delay
>>>> freeing of memory. Mitigate this problem by using first
>>>> node in the list as wait tail when all wait heads are in use.
>>>> While methods to speed up callback processing would be needed
>>>> to recover from this situation, allowing new nodes to complete
>>>> their grace period can help prevent delays due to a fixed
>>>> number of wait head nodes.
>>>>
>>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>> ---
>>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>>> --- a/kernel/rcu/tree.c
>>>> +++ b/kernel/rcu/tree.c
>>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>>   * for this new grace period. Given that there are a fixed
>>>>   * number of wait nodes, if all wait nodes are in use
>>>>   * (which can happen when kworker callback processing
>>>> - * is delayed) and additional grace period is requested.
>>>> - * This means, a system is slow in processing callbacks.
>>>> - *
>>>> - * TODO: If a slow processing is detected, a first node
>>>> - * in the llist should be used as a wait-tail for this
>>>> - * grace period, therefore users which should wait due
>>>> - * to a slow process are handled by _this_ grace period
>>>> - * and not next.
>>>> + * is delayed), first node in the llist is used as wait
>>>> + * tail for this grace period. This means, the first node
>>>> + * has to go through additional grace periods before it is
>>>> + * part of the wait callbacks. This should be ok, as
>>>> + * the system is slow in processing callbacks anyway.
>>>>   *
>>>>   * Below is an illustration of how the done and wait
>>>>   * tail pointers move from one set of rcu_synchronize nodes
>>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>>  		return start_new_poll;
>>>>  
>>>>  	wait_head = rcu_sr_get_wait_head();
>>>> -	if (!wait_head) {
>>>> -		// Kick another GP to retry.
>>>> +	if (wait_head) {
>>>> +		/* Inject a wait-dummy-node. */
>>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>>> +	} else {
>>>> +		// Kick another GP for first node.
>>>>  		start_new_poll = true;
>>>> -		return start_new_poll;
>>>> +		if (first == rcu_state.srs_done_tail)
>>>
>>> small nit:
>>> Does done_tail access here need smp_load_acquire() or READ_ONCE() to match the
>>> other users?
>>>
>>
>> As srs_done_tail is only updated in RCU GP thread context, I think it is not required.
>> Please correct me if I am wrong here.
> 
> But will KCSAN not scream that its a data race?
> 

For reads from the exclusive writer context? Interesting, let me check that...


Thanks
Neeraj

> thanks,
> 
>  - Joel
>
Frederic Weisbecker March 13, 2024, 4:43 p.m. UTC | #7
Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
> Hi Frederic,
> 
> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
> > Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
> >> When all wait heads are in use, which can happen when
> >> rcu_sr_normal_gp_cleanup_work()'s callback processing
> >> is slow, any new synchronize_rcu() user's rcu_synchronize
> >> node's processing is deferred to future GP periods. This
> >> can result in long list of synchronize_rcu() invocations
> >> waiting for full grace period processing, which can delay
> >> freeing of memory. Mitigate this problem by using first
> >> node in the list as wait tail when all wait heads are in use.
> >> While methods to speed up callback processing would be needed
> >> to recover from this situation, allowing new nodes to complete
> >> their grace period can help prevent delays due to a fixed
> >> number of wait head nodes.
> >>
> >> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> >> ---
> >>  kernel/rcu/tree.c | 27 +++++++++++++--------------
> >>  1 file changed, 13 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 9fbb5ab57c84..bdccce1ed62f 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> >>   * for this new grace period. Given that there are a fixed
> >>   * number of wait nodes, if all wait nodes are in use
> >>   * (which can happen when kworker callback processing
> >> - * is delayed) and additional grace period is requested.
> >> - * This means, a system is slow in processing callbacks.
> >> - *
> >> - * TODO: If a slow processing is detected, a first node
> >> - * in the llist should be used as a wait-tail for this
> >> - * grace period, therefore users which should wait due
> >> - * to a slow process are handled by _this_ grace period
> >> - * and not next.
> >> + * is delayed), first node in the llist is used as wait
> >> + * tail for this grace period. This means, the first node
> >> + * has to go through additional grace periods before it is
> >> + * part of the wait callbacks. This should be ok, as
> >> + * the system is slow in processing callbacks anyway.
> >>   *
> >>   * Below is an illustration of how the done and wait
> >>   * tail pointers move from one set of rcu_synchronize nodes
> >> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
> >>  		return start_new_poll;
> >>  
> >>  	wait_head = rcu_sr_get_wait_head();
> >> -	if (!wait_head) {
> >> -		// Kick another GP to retry.
> >> +	if (wait_head) {
> >> +		/* Inject a wait-dummy-node. */
> >> +		llist_add(wait_head, &rcu_state.srs_next);
> >> +	} else {
> >> +		// Kick another GP for first node.
> >>  		start_new_poll = true;
> >> -		return start_new_poll;
> >> +		if (first == rcu_state.srs_done_tail)
> >> +			return start_new_poll;
> >> +		wait_head = first;
> > 
> > This means you're setting a non-wait-head as srs_wait_tail, right?
> > Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
> > 
> > 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> > 
> 
> Oh I missed it. Will fix it, thanks!
> 
> > Also there is a risk that this non-wait-head gets later assigned as
> > rcu_state.srs_done_tail. And then this pending sr may not be completed
> > until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
> > the work doesn't take care of rcu_state.srs_done_tail itself). And then
> > the delay can be arbitrary.
> > 
> 
> That is correct. Only the first node suffers from deferred GP.
> If there are large number of callbacks which got added after
> last available wait head was queued, all those callbacks (except one)
> can still have a GP assigned to them.
> 
> > And the next grace period completing this sr (that non-wait-head set
> > as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
> > of its stack may race with the cleanup work dereferencing it?
> > 
> 
> This sr can only be completed when done tail moves to new node. Till
> then, it gets deferred continuously. So, we won't be entering into
> the situation where the sr processing is complete while done tail is pointing
> to it. Please correct me if I am missing something here.

Ok I'm confused as usual. Let's take a practical case. Is the following
sequence possible?

1) wait_tail = NULL
   done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...

Initial layout

2) wait_tail = SR5 -> WH4...
   done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...

New GP

3) wait_tail = NULL
   done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...

GP completes, normal cleanup

3) wait_tail = SR6->SR5...
   done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...

New GP

4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
   the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
   overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
   and concurrently dereferences SR5.

But I bet I'm missing something obvious in the middle, preventing that...
Neeraj Upadhyay March 13, 2024, 4:54 p.m. UTC | #8
Hi Frederic,

On 3/13/2024 10:13 PM, Frederic Weisbecker wrote:
> Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
>> Hi Frederic,
>>
>> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
>>> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
>>>> When all wait heads are in use, which can happen when
>>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>>> node's processing is deferred to future GP periods. This
>>>> can result in long list of synchronize_rcu() invocations
>>>> waiting for full grace period processing, which can delay
>>>> freeing of memory. Mitigate this problem by using first
>>>> node in the list as wait tail when all wait heads are in use.
>>>> While methods to speed up callback processing would be needed
>>>> to recover from this situation, allowing new nodes to complete
>>>> their grace period can help prevent delays due to a fixed
>>>> number of wait head nodes.
>>>>
>>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>> ---
>>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>>> --- a/kernel/rcu/tree.c
>>>> +++ b/kernel/rcu/tree.c
>>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>>   * for this new grace period. Given that there are a fixed
>>>>   * number of wait nodes, if all wait nodes are in use
>>>>   * (which can happen when kworker callback processing
>>>> - * is delayed) and additional grace period is requested.
>>>> - * This means, a system is slow in processing callbacks.
>>>> - *
>>>> - * TODO: If a slow processing is detected, a first node
>>>> - * in the llist should be used as a wait-tail for this
>>>> - * grace period, therefore users which should wait due
>>>> - * to a slow process are handled by _this_ grace period
>>>> - * and not next.
>>>> + * is delayed), first node in the llist is used as wait
>>>> + * tail for this grace period. This means, the first node
>>>> + * has to go through additional grace periods before it is
>>>> + * part of the wait callbacks. This should be ok, as
>>>> + * the system is slow in processing callbacks anyway.
>>>>   *
>>>>   * Below is an illustration of how the done and wait
>>>>   * tail pointers move from one set of rcu_synchronize nodes
>>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>>  		return start_new_poll;
>>>>  
>>>>  	wait_head = rcu_sr_get_wait_head();
>>>> -	if (!wait_head) {
>>>> -		// Kick another GP to retry.
>>>> +	if (wait_head) {
>>>> +		/* Inject a wait-dummy-node. */
>>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>>> +	} else {
>>>> +		// Kick another GP for first node.
>>>>  		start_new_poll = true;
>>>> -		return start_new_poll;
>>>> +		if (first == rcu_state.srs_done_tail)
>>>> +			return start_new_poll;
>>>> +		wait_head = first;
>>>
>>> This means you're setting a non-wait-head as srs_wait_tail, right?
>>> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
>>>
>>> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>
>>
>> Oh I missed it. Will fix it, thanks!
>>
>>> Also there is a risk that this non-wait-head gets later assigned as
>>> rcu_state.srs_done_tail. And then this pending sr may not be completed
>>> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
>>> the work doesn't take care of rcu_state.srs_done_tail itself). And then
>>> the delay can be arbitrary.
>>>
>>
>> That is correct. Only the first node suffers from deferred GP.
>> If there are large number of callbacks which got added after
>> last available wait head was queued, all those callbacks (except one)
>> can still have a GP assigned to them.
>>
>>> And the next grace period completing this sr (that non-wait-head set
>>> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
>>> of its stack may race with the cleanup work dereferencing it?
>>>
>>
>> This sr can only be completed when done tail moves to new node. Till
>> then, it gets deferred continuously. So, we won't be entering into
>> the situation where the sr processing is complete while done tail is pointing
>> to it. Please correct me if I am missing something here.
> 
> Ok I'm confused as usual. Let's take a practical case. Is the following
> sequence possible?
> 
> 1) wait_tail = NULL
>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
> 
> Initial layout
> 
> 2) wait_tail = SR5 -> WH4...
>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
> 
> New GP
> 
> 3) wait_tail = NULL
>    done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
> 
> GP completes, normal cleanup
> 
> 3) wait_tail = SR6->SR5...
>    done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...
> 
> New GP
> 
> 4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
>    the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
>    overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
>    and concurrently dereferences SR5.
> 
> But I bet I'm missing something obvious in the middle, preventing that...

Your analysis looks correct to me. Maybe, one way to fix this can be that
rcu_sr_normal_gp_cleanup() stops processing nodes in its context,
when it reaches done tail and done tail is not a wait head. I will
think more on this, thanks!


Thanks
Neeraj
Frederic Weisbecker March 13, 2024, 5:05 p.m. UTC | #9
Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
> > Also there is a risk that this non-wait-head gets later assigned as
> > rcu_state.srs_done_tail. And then this pending sr may not be completed
> > until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
> > the work doesn't take care of rcu_state.srs_done_tail itself). And then
> > the delay can be arbitrary.
> > 
> 
> That is correct. Only the first node suffers from deferred GP.
> If there are large number of callbacks which got added after
> last available wait head was queued, all those callbacks (except one)
> can still have a GP assigned to them.

Oh and yes I missed the fact that you still reissue a GP while queueing
a non wait-head as ->srs_wait_tail. That point looks good.

Thanks.
Frederic Weisbecker March 13, 2024, 5:15 p.m. UTC | #10
Le Wed, Mar 13, 2024 at 10:24:58PM +0530, Neeraj Upadhyay a écrit :
> Hi Frederic,
> 
> On 3/13/2024 10:13 PM, Frederic Weisbecker wrote:
> > Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
> >> Hi Frederic,
> >>
> >> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
> >>> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
> >>>> When all wait heads are in use, which can happen when
> >>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
> >>>> is slow, any new synchronize_rcu() user's rcu_synchronize
> >>>> node's processing is deferred to future GP periods. This
> >>>> can result in long list of synchronize_rcu() invocations
> >>>> waiting for full grace period processing, which can delay
> >>>> freeing of memory. Mitigate this problem by using first
> >>>> node in the list as wait tail when all wait heads are in use.
> >>>> While methods to speed up callback processing would be needed
> >>>> to recover from this situation, allowing new nodes to complete
> >>>> their grace period can help prevent delays due to a fixed
> >>>> number of wait head nodes.
> >>>>
> >>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> >>>> ---
> >>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
> >>>>  1 file changed, 13 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >>>> index 9fbb5ab57c84..bdccce1ed62f 100644
> >>>> --- a/kernel/rcu/tree.c
> >>>> +++ b/kernel/rcu/tree.c
> >>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
> >>>>   * for this new grace period. Given that there are a fixed
> >>>>   * number of wait nodes, if all wait nodes are in use
> >>>>   * (which can happen when kworker callback processing
> >>>> - * is delayed) and additional grace period is requested.
> >>>> - * This means, a system is slow in processing callbacks.
> >>>> - *
> >>>> - * TODO: If a slow processing is detected, a first node
> >>>> - * in the llist should be used as a wait-tail for this
> >>>> - * grace period, therefore users which should wait due
> >>>> - * to a slow process are handled by _this_ grace period
> >>>> - * and not next.
> >>>> + * is delayed), first node in the llist is used as wait
> >>>> + * tail for this grace period. This means, the first node
> >>>> + * has to go through additional grace periods before it is
> >>>> + * part of the wait callbacks. This should be ok, as
> >>>> + * the system is slow in processing callbacks anyway.
> >>>>   *
> >>>>   * Below is an illustration of how the done and wait
> >>>>   * tail pointers move from one set of rcu_synchronize nodes
> >>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
> >>>>  		return start_new_poll;
> >>>>  
> >>>>  	wait_head = rcu_sr_get_wait_head();
> >>>> -	if (!wait_head) {
> >>>> -		// Kick another GP to retry.
> >>>> +	if (wait_head) {
> >>>> +		/* Inject a wait-dummy-node. */
> >>>> +		llist_add(wait_head, &rcu_state.srs_next);
> >>>> +	} else {
> >>>> +		// Kick another GP for first node.
> >>>>  		start_new_poll = true;
> >>>> -		return start_new_poll;
> >>>> +		if (first == rcu_state.srs_done_tail)
> >>>> +			return start_new_poll;
> >>>> +		wait_head = first;
> >>>
> >>> This means you're setting a non-wait-head as srs_wait_tail, right?
> >>> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
> >>>
> >>> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >>>
> >>
> >> Oh I missed it. Will fix it, thanks!
> >>
> >>> Also there is a risk that this non-wait-head gets later assigned as
> >>> rcu_state.srs_done_tail. And then this pending sr may not be completed
> >>> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
> >>> the work doesn't take care of rcu_state.srs_done_tail itself). And then
> >>> the delay can be arbitrary.
> >>>
> >>
> >> That is correct. Only the first node suffers from deferred GP.
> >> If there are large number of callbacks which got added after
> >> last available wait head was queued, all those callbacks (except one)
> >> can still have a GP assigned to them.
> >>
> >>> And the next grace period completing this sr (that non-wait-head set
> >>> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
> >>> of its stack may race with the cleanup work dereferencing it?
> >>>
> >>
> >> This sr can only be completed when done tail moves to new node. Till
> >> then, it gets deferred continuously. So, we won't be entering into
> >> the situation where the sr processing is complete while done tail is pointing
> >> to it. Please correct me if I am missing something here.
> > 
> > Ok I'm confused as usual. Let's take a practical case. Is the following
> > sequence possible?
> > 
> > 1) wait_tail = NULL
> >    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
> > 
> > Initial layout
> > 
> > 2) wait_tail = SR5 -> WH4...
> >    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
> > 
> > New GP
> > 
> > 3) wait_tail = NULL
> >    done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
> > 
> > GP completes, normal cleanup
> > 
> > 3) wait_tail = SR6->SR5...
> >    done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...
> > 
> > New GP
> > 
> > 4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
> >    the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
> >    overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
> >    and concurrently dereferences SR5.
> > 
> > But I bet I'm missing something obvious in the middle, preventing that...
> 
> Your analysis looks correct to me. Maybe, one way to fix this can be that
> rcu_sr_normal_gp_cleanup() stops processing nodes in its context,
> when it reaches done tail and done tail is not a wait head. I will
> think more on this, thanks!

That alone is probably not enough. In the end you may end up with a real
pending sr stuck as done tail without completion, until one day a
new real queue comes up, preferably with a real wait head in order not
to get stuck with a new sr as done tail.

> 
> 
> Thanks
> Neeraj
Neeraj Upadhyay March 13, 2024, 5:26 p.m. UTC | #11
On 3/13/2024 10:45 PM, Frederic Weisbecker wrote:
> Le Wed, Mar 13, 2024 at 10:24:58PM +0530, Neeraj Upadhyay a écrit :
>> Hi Frederic,
>>
>> On 3/13/2024 10:13 PM, Frederic Weisbecker wrote:
>>> Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
>>>> Hi Frederic,
>>>>
>>>> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
>>>>> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
>>>>>> When all wait heads are in use, which can happen when
>>>>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>>>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>>>>> node's processing is deferred to future GP periods. This
>>>>>> can result in long list of synchronize_rcu() invocations
>>>>>> waiting for full grace period processing, which can delay
>>>>>> freeing of memory. Mitigate this problem by using first
>>>>>> node in the list as wait tail when all wait heads are in use.
>>>>>> While methods to speed up callback processing would be needed
>>>>>> to recover from this situation, allowing new nodes to complete
>>>>>> their grace period can help prevent delays due to a fixed
>>>>>> number of wait head nodes.
>>>>>>
>>>>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>>>> ---
>>>>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>>>>> --- a/kernel/rcu/tree.c
>>>>>> +++ b/kernel/rcu/tree.c
>>>>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>>>>   * for this new grace period. Given that there are a fixed
>>>>>>   * number of wait nodes, if all wait nodes are in use
>>>>>>   * (which can happen when kworker callback processing
>>>>>> - * is delayed) and additional grace period is requested.
>>>>>> - * This means, a system is slow in processing callbacks.
>>>>>> - *
>>>>>> - * TODO: If a slow processing is detected, a first node
>>>>>> - * in the llist should be used as a wait-tail for this
>>>>>> - * grace period, therefore users which should wait due
>>>>>> - * to a slow process are handled by _this_ grace period
>>>>>> - * and not next.
>>>>>> + * is delayed), first node in the llist is used as wait
>>>>>> + * tail for this grace period. This means, the first node
>>>>>> + * has to go through additional grace periods before it is
>>>>>> + * part of the wait callbacks. This should be ok, as
>>>>>> + * the system is slow in processing callbacks anyway.
>>>>>>   *
>>>>>>   * Below is an illustration of how the done and wait
>>>>>>   * tail pointers move from one set of rcu_synchronize nodes
>>>>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>>>>  		return start_new_poll;
>>>>>>  
>>>>>>  	wait_head = rcu_sr_get_wait_head();
>>>>>> -	if (!wait_head) {
>>>>>> -		// Kick another GP to retry.
>>>>>> +	if (wait_head) {
>>>>>> +		/* Inject a wait-dummy-node. */
>>>>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>>>>> +	} else {
>>>>>> +		// Kick another GP for first node.
>>>>>>  		start_new_poll = true;
>>>>>> -		return start_new_poll;
>>>>>> +		if (first == rcu_state.srs_done_tail)
>>>>>> +			return start_new_poll;
>>>>>> +		wait_head = first;
>>>>>
>>>>> This means you're setting a non-wait-head as srs_wait_tail, right?
>>>>> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
>>>>>
>>>>> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>>>
>>>>
>>>> Oh I missed it. Will fix it, thanks!
>>>>
>>>>> Also there is a risk that this non-wait-head gets later assigned as
>>>>> rcu_state.srs_done_tail. And then this pending sr may not be completed
>>>>> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
>>>>> the work doesn't take care of rcu_state.srs_done_tail itself). And then
>>>>> the delay can be arbitrary.
>>>>>
>>>>
>>>> That is correct. Only the first node suffers from deferred GP.
>>>> If there are large number of callbacks which got added after
>>>> last available wait head was queued, all those callbacks (except one)
>>>> can still have a GP assigned to them.
>>>>
>>>>> And the next grace period completing this sr (that non-wait-head set
>>>>> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
>>>>> of its stack may race with the cleanup work dereferencing it?
>>>>>
>>>>
>>>> This sr can only be completed when done tail moves to new node. Till
>>>> then, it gets deferred continuously. So, we won't be entering into
>>>> the situation where the sr processing is complete while done tail is pointing
>>>> to it. Please correct me if I am missing something here.
>>>
>>> Ok I'm confused as usual. Let's take a practical case. Is the following
>>> sequence possible?
>>>
>>> 1) wait_tail = NULL
>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>
>>> Initial layout
>>>
>>> 2) wait_tail = SR5 -> WH4...
>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>
>>> New GP
>>>
>>> 3) wait_tail = NULL
>>>    done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>
>>> GP completes, normal cleanup
>>>
>>> 3) wait_tail = SR6->SR5...
>>>    done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...
>>>
>>> New GP
>>>
>>> 4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
>>>    the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
>>>    overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
>>>    and concurrently dereferences SR5.
>>>
>>> But I bet I'm missing something obvious in the middle, preventing that...
>>
>> Your analysis looks correct to me. Maybe, one way to fix this can be that
>> rcu_sr_normal_gp_cleanup() stops processing nodes in its context,
>> when it reaches done tail and done tail is not a wait head. I will
>> think more on this, thanks!
> 
> That alone is probably not enough. In the end you may end up with a real
> pending sr stuck as done tail without completion, until one day a
> new real queue comes up, preferably with a real wait head in order not
> to get stuck with a new sr as done tail.
> 

But after point 4 in your sequence, rcu_sr_normal_gp_cleanup() would move
the done tail to SR6 and queue a new work, which will process SR5,
so, we will be able to progress real pending srs?


Thanks
Neeraj

>>
>>
>> Thanks
>> Neeraj
Neeraj Upadhyay March 13, 2024, 5:49 p.m. UTC | #12
On 3/13/2024 10:56 PM, Neeraj Upadhyay wrote:
> 
> 
> On 3/13/2024 10:45 PM, Frederic Weisbecker wrote:
>> Le Wed, Mar 13, 2024 at 10:24:58PM +0530, Neeraj Upadhyay a écrit :
>>> Hi Frederic,
>>>
>>> On 3/13/2024 10:13 PM, Frederic Weisbecker wrote:
>>>> Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
>>>>> Hi Frederic,
>>>>>
>>>>> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
>>>>>> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
>>>>>>> When all wait heads are in use, which can happen when
>>>>>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>>>>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>>>>>> node's processing is deferred to future GP periods. This
>>>>>>> can result in long list of synchronize_rcu() invocations
>>>>>>> waiting for full grace period processing, which can delay
>>>>>>> freeing of memory. Mitigate this problem by using first
>>>>>>> node in the list as wait tail when all wait heads are in use.
>>>>>>> While methods to speed up callback processing would be needed
>>>>>>> to recover from this situation, allowing new nodes to complete
>>>>>>> their grace period can help prevent delays due to a fixed
>>>>>>> number of wait head nodes.
>>>>>>>
>>>>>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>>>>> ---
>>>>>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>>>>>> --- a/kernel/rcu/tree.c
>>>>>>> +++ b/kernel/rcu/tree.c
>>>>>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>>>>>   * for this new grace period. Given that there are a fixed
>>>>>>>   * number of wait nodes, if all wait nodes are in use
>>>>>>>   * (which can happen when kworker callback processing
>>>>>>> - * is delayed) and additional grace period is requested.
>>>>>>> - * This means, a system is slow in processing callbacks.
>>>>>>> - *
>>>>>>> - * TODO: If a slow processing is detected, a first node
>>>>>>> - * in the llist should be used as a wait-tail for this
>>>>>>> - * grace period, therefore users which should wait due
>>>>>>> - * to a slow process are handled by _this_ grace period
>>>>>>> - * and not next.
>>>>>>> + * is delayed), first node in the llist is used as wait
>>>>>>> + * tail for this grace period. This means, the first node
>>>>>>> + * has to go through additional grace periods before it is
>>>>>>> + * part of the wait callbacks. This should be ok, as
>>>>>>> + * the system is slow in processing callbacks anyway.
>>>>>>>   *
>>>>>>>   * Below is an illustration of how the done and wait
>>>>>>>   * tail pointers move from one set of rcu_synchronize nodes
>>>>>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>>>>>  		return start_new_poll;
>>>>>>>  
>>>>>>>  	wait_head = rcu_sr_get_wait_head();
>>>>>>> -	if (!wait_head) {
>>>>>>> -		// Kick another GP to retry.
>>>>>>> +	if (wait_head) {
>>>>>>> +		/* Inject a wait-dummy-node. */
>>>>>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>>>>>> +	} else {
>>>>>>> +		// Kick another GP for first node.
>>>>>>>  		start_new_poll = true;
>>>>>>> -		return start_new_poll;
>>>>>>> +		if (first == rcu_state.srs_done_tail)
>>>>>>> +			return start_new_poll;
>>>>>>> +		wait_head = first;
>>>>>>
>>>>>> This means you're setting a non-wait-head as srs_wait_tail, right?
>>>>>> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
>>>>>>
>>>>>> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>>>>
>>>>>
>>>>> Oh I missed it. Will fix it, thanks!
>>>>>
>>>>>> Also there is a risk that this non-wait-head gets later assigned as
>>>>>> rcu_state.srs_done_tail. And then this pending sr may not be completed
>>>>>> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
>>>>>> the work doesn't take care of rcu_state.srs_done_tail itself). And then
>>>>>> the delay can be arbitrary.
>>>>>>
>>>>>
>>>>> That is correct. Only the first node suffers from deferred GP.
>>>>> If there are large number of callbacks which got added after
>>>>> last available wait head was queued, all those callbacks (except one)
>>>>> can still have a GP assigned to them.
>>>>>
>>>>>> And the next grace period completing this sr (that non-wait-head set
>>>>>> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
>>>>>> of its stack may race with the cleanup work dereferencing it?
>>>>>>
>>>>>
>>>>> This sr can only be completed when done tail moves to new node. Till
>>>>> then, it gets deferred continuously. So, we won't be entering into
>>>>> the situation where the sr processing is complete while done tail is pointing
>>>>> to it. Please correct me if I am missing something here.
>>>>
>>>> Ok I'm confused as usual. Let's take a practical case. Is the following
>>>> sequence possible?
>>>>
>>>> 1) wait_tail = NULL
>>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>>
>>>> Initial layout
>>>>
>>>> 2) wait_tail = SR5 -> WH4...
>>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>>
>>>> New GP
>>>>
>>>> 3) wait_tail = NULL
>>>>    done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>>
>>>> GP completes, normal cleanup
>>>>
>>>> 3) wait_tail = SR6->SR5...
>>>>    done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...
>>>>
>>>> New GP
>>>>
>>>> 4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
>>>>    the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
>>>>    overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
>>>>    and concurrently dereferences SR5.
>>>>
>>>> But I bet I'm missing something obvious in the middle, preventing that...
>>>
>>> Your analysis looks correct to me. Maybe, one way to fix this can be that
>>> rcu_sr_normal_gp_cleanup() stops processing nodes in its context,
>>> when it reaches done tail and done tail is not a wait head. I will
>>> think more on this, thanks!
>>
>> That alone is probably not enough. In the end you may end up with a real
>> pending sr stuck as done tail without completion, until one day a
>> new real queue comes up, preferably with a real wait head in order not
>> to get stuck with a new sr as done tail.
>>
> 
> But after point 4 in your sequence, rcu_sr_normal_gp_cleanup() would move
> the done tail to SR6 and queue a new work, which will process SR5,
> so, we will be able to progress real pending srs?
> 
> 

Missed one point. We will continue initiating new GP until first node cb
is processed, which can happen when a wait head becomes available.



Thanks
Neeraj

> Thanks
> Neeraj
> 
>>>
>>>
>>> Thanks
>>> Neeraj
Frederic Weisbecker March 13, 2024, 11:22 p.m. UTC | #13
On Wed, Mar 13, 2024 at 11:19:08PM +0530, Neeraj Upadhyay wrote:
> Missed one point. We will continue initiating new GP until first node cb
> is processed, which can happen when a wait head becomes available.

Ah yes, I missed that indeed. Ok that seem to make sense.
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9fbb5ab57c84..bdccce1ed62f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1470,14 +1470,11 @@  static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
  * for this new grace period. Given that there are a fixed
  * number of wait nodes, if all wait nodes are in use
  * (which can happen when kworker callback processing
- * is delayed) and additional grace period is requested.
- * This means, a system is slow in processing callbacks.
- *
- * TODO: If a slow processing is detected, a first node
- * in the llist should be used as a wait-tail for this
- * grace period, therefore users which should wait due
- * to a slow process are handled by _this_ grace period
- * and not next.
+ * is delayed), first node in the llist is used as wait
+ * tail for this grace period. This means, the first node
+ * has to go through additional grace periods before it is
+ * part of the wait callbacks. This should be ok, as
+ * the system is slow in processing callbacks anyway.
  *
  * Below is an illustration of how the done and wait
  * tail pointers move from one set of rcu_synchronize nodes
@@ -1725,15 +1722,17 @@  static bool rcu_sr_normal_gp_init(void)
 		return start_new_poll;
 
 	wait_head = rcu_sr_get_wait_head();
-	if (!wait_head) {
-		// Kick another GP to retry.
+	if (wait_head) {
+		/* Inject a wait-dummy-node. */
+		llist_add(wait_head, &rcu_state.srs_next);
+	} else {
+		// Kick another GP for first node.
 		start_new_poll = true;
-		return start_new_poll;
+		if (first == rcu_state.srs_done_tail)
+			return start_new_poll;
+		wait_head = first;
 	}
 
-	/* Inject a wait-dummy-node. */
-	llist_add(wait_head, &rcu_state.srs_next);
-
 	/*
 	 * A waiting list of rcu_synchronize nodes should be empty on
 	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),