diff mbox series

[v3,1/1] rcu: Reduce synchronize_rcu() waiting time

Message ID 20231016173004.14148-1-urezki@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v3,1/1] rcu: Reduce synchronize_rcu() waiting time | expand

Commit Message

Uladzislau Rezki Oct. 16, 2023, 5:30 p.m. UTC
A call to a synchronize_rcu() can be optimized from time point of
view. Different workloads can be affected by this especially the
ones which use this API in its time critical sections.

For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
callback can be delayed and such delay depends on where in a nocb
list it is located.

1. On our Android devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:

<snip>
  <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
  <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
  <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

2. We use cpuset/cgroup to classify tasks and assign them into
different cgroups. For example "backgrond" group which binds tasks
only to little CPUs or "foreground" which makes use of all CPUs.
Tasks can be migrated between groups by a request if an acceleration
is needed.

See below an example how "surfaceflinger" task gets migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speed it up it
can be temporary moved into "foreground" cgroup which allows
to use big/all CPUs:

cgroup_attach_task():
 -> cgroup_migrate_execute()
   -> cpuset_can_attach()
     -> percpu_down_write()
       -> rcu_sync_enter()
         -> synchronize_rcu()
   -> now move tasks to the new cgroup.
 -> cgroup_migrate_finish()

<snip>
         rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
    PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
    PERFD-SERVER-1855    [000] d..1.  7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
   TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>

"A moving time" depends on how fast synchronize_rcu() completes. See
the first trace line. The migration has not occurred until the sync
was done first. Please note, number of different callbacks to be
invoked can be thousands.

3. To address this drawback, maintain a separate track that consists
of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
either wake-ups a worker to drain all list or directly wakes-up end
user if it is one in the drain list.

4. This patch improves the performance of synchronize_rcu() approximately
by 30% on synthetic tests. The real test case, camera launch time, shows
below figures(time is in milliseconds):

542 vs 489 diff: 9%
540 vs 466 diff: 13%
518 vs 468 diff: 9%
531 vs 457 diff: 13%
548 vs 475 diff: 13%
509 vs 484 diff: 4%

Synthetic test:

Hardware: x86_64 64 CPUs, 64GB of memory

- 60.000 tasks(simultaneous);
- each task does(1000 loops)
     synchronize_rcu();
     kfree(p);

default: CONFIG_RCU_NOCB_CPU: takes 306 seconds to complete all users;
patch: CONFIG_RCU_NOCB_CPU: takes 201 seconds to complete all users.

Please note, by default this functionality is OFF and the old way is
still used instead, In order to activate it, please do:

echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c     | 151 +++++++++++++++++++++++++++++++++++++++++-
 kernel/rcu/tree_exp.h |   2 +-
 2 files changed, 151 insertions(+), 2 deletions(-)

Comments

Joel Fernandes Oct. 17, 2023, 1:28 a.m. UTC | #1
Hello, Vlad,

Looks like a nice patch, I am still looking at this more and just
started, but a few small comments:

On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony)
<urezki@gmail.com> wrote:
>
> A call to a synchronize_rcu() can be optimized from time point of
> view. Different workloads can be affected by this especially the
> ones which use this API in its time critical sections.
>
> For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> callback can be delayed and such delay depends on where in a nocb
> list it is located.
>
> 1. On our Android devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
> <snip>
>   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
>   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
>   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
>
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
>
[...]
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
> either wake-ups a worker to drain all list or directly wakes-up end
> user if it is one in the drain list.

I wonder if there is a cut-off N, where waking up N ("a few") inline
instead of just 1, yields similar results. For small values of N, that
keeps the total number of wakeups lower than pushing off to a kworker.
For instance, if 2 users, then you get 3 wakeups instead of just 2 (1
for the kworker and another 2 for the synchronize). But if you had a
cut off like N=5, then 2 users results in just 2 wakeups.
I don't feel too strongly about it, but for small values of N, I am
interested in a slightly better optimization if we can squeeze it.

[...]
> + * There are three lists for handling synchronize_rcu() users.
> + * A first list corresponds to new coming users, second for users
> + * which wait for a grace period and third is for which a grace
> + * period is passed.
> + */
> +static struct sr_normal_state {
> +       struct llist_head curr; /* request a GP users. */
> +       struct llist_head wait; /* wait for GP users. */
> +       struct llist_head done; /* ready for GP users. */
> +       struct llist_node *curr_tail;
> +       struct llist_node *wait_tail;

Just for clarification, the only reason you need 'tail' is because you
can use llist_add_batch() to quickly splice the list, instead of
finding the tail each time (expensive for a large list), correct? It
would be good to mention that in a comment here.

> +       atomic_t active;
> +} sr;
> +
> +/* Disable it by default. */
> +static int rcu_normal_wake_from_gp = 0;
> +module_param(rcu_normal_wake_from_gp, int, 0644);
> +
> +static void rcu_sr_normal_complete(struct llist_node *node)
> +{
> +       struct rcu_synchronize *rs = container_of(
> +               (struct rcu_head *) node, struct rcu_synchronize, head);
> +       unsigned long oldstate = (unsigned long) rs->head.func;
> +
> +       if (!poll_state_synchronize_rcu(oldstate))
> +               WARN_ONCE(1, "A full grace period is not passed yet: %lu",
> +                       rcu_seq_diff(get_state_synchronize_rcu(), oldstate));

nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get
rid of if() ?

> +
> +       /* Finally. */
> +       complete(&rs->completion);
> +}
> +
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> +{
> +       struct llist_node *done, *rcu, *next;
> +
> +       done = llist_del_all(&sr.done);
> +       if (!done)
> +               return;
> +
> +       llist_for_each_safe(rcu, next, done)
> +               rcu_sr_normal_complete(rcu);
> +}
[...]
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> +       atomic_inc(&sr.active);
> +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> +               /* Set the tail. Only first and one user can do that. */
> +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> +       atomic_dec(&sr.active);

Here there is no memory ordering provided by the atomic ops. Is that really Ok?

And same question for other usages of .active.

Per: https://www.kernel.org/doc/Documentation/atomic_t.txt
"RMW operations that have no return value are unordered;"
--
Note to self to ping my Android friends as well about this improvement
:-). Especially the -mm Android folks are interested in app startup
time.

thanks,

 - Joel



> +}
> +
>  /*
>   * Initialize a new grace period.  Return false if no grace period required.
>   */
> @@ -1418,6 +1534,7 @@ static noinline_for_stack bool rcu_gp_init(void)
>         /* Record GP times before starting GP, hence rcu_seq_start(). */
>         rcu_seq_start(&rcu_state.gp_seq);
>         ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> +       rcu_sr_normal_gp_init();
>         trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
>         rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
>         raw_spin_unlock_irq_rcu_node(rnp);
> @@ -1787,6 +1904,9 @@ static noinline void rcu_gp_cleanup(void)
>         }
>         raw_spin_unlock_irq_rcu_node(rnp);
>
> +       // Make synchronize_rcu() users aware of the end of old grace period.
> +       rcu_sr_normal_gp_cleanup();
> +
>         // If strict, make all CPUs aware of the end of the old grace period.
>         if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
>                 on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> @@ -3500,6 +3620,35 @@ static int rcu_blocking_is_gp(void)
>         return true;
>  }
>
> +/*
> + * Helper function for the synchronize_rcu() API.
> + */
> +static void synchronize_rcu_normal(void)
> +{
> +       struct rcu_synchronize rs;
> +
> +       if (READ_ONCE(rcu_normal_wake_from_gp)) {
> +               init_rcu_head_on_stack(&rs.head);
> +               init_completion(&rs.completion);
> +
> +               /*
> +                * This code might be preempted, therefore take a GP
> +                * snapshot before adding a request.
> +                */
> +               rs.head.func = (void *) get_state_synchronize_rcu();
> +               rcu_sr_normal_add_req(&rs);
> +
> +               /* Kick a GP and start waiting. */
> +               (void) start_poll_synchronize_rcu();
> +
> +               /* Now we can wait. */
> +               wait_for_completion(&rs.completion);
> +               destroy_rcu_head_on_stack(&rs.head);
> +       } else {
> +               wait_rcu_gp(call_rcu_hurry);
> +       }
> +}
> +
>  /**
>   * synchronize_rcu - wait until a grace period has elapsed.
>   *
> @@ -3551,7 +3700,7 @@ void synchronize_rcu(void)
>                 if (rcu_gp_is_expedited())
>                         synchronize_rcu_expedited();
>                 else
> -                       wait_rcu_gp(call_rcu_hurry);
> +                       synchronize_rcu_normal();
>                 return;
>         }
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..279a37beb05a 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -987,7 +987,7 @@ void synchronize_rcu_expedited(void)
>
>         /* If expedited grace periods are prohibited, fall back to normal. */
>         if (rcu_gp_is_normal()) {
> -               wait_rcu_gp(call_rcu_hurry);
> +               synchronize_rcu_normal();
>                 return;
>         }
>
> --
> 2.30.2
>
Uladzislau Rezki Oct. 17, 2023, 2:06 p.m. UTC | #2
Hello, Joel!

> Hello, Vlad,
> 
> Looks like a nice patch, I am still looking at this more and just
> started, but a few small comments:
> 
Thank you for looking at it :) And thank you for the comments.


> On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony)
> <urezki@gmail.com> wrote:
> >
> > A call to a synchronize_rcu() can be optimized from time point of
> > view. Different workloads can be affected by this especially the
> > ones which use this API in its time critical sections.
> >
> > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> > callback can be delayed and such delay depends on where in a nocb
> > list it is located.
> >
> > 1. On our Android devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> >
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> >
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> >
> > See below an example how "surfaceflinger" task gets migrated.
> > Initially it is located in the "system-background" cgroup which
> > allows to run only on little cores. In order to speed it up it
> > can be temporary moved into "foreground" cgroup which allows
> > to use big/all CPUs:
> >
> [...]
> > 3. To address this drawback, maintain a separate track that consists
> > of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
> > either wake-ups a worker to drain all list or directly wakes-up end
> > user if it is one in the drain list.
> 
> I wonder if there is a cut-off N, where waking up N ("a few") inline
> instead of just 1, yields similar results. For small values of N, that
> keeps the total number of wakeups lower than pushing off to a kworker.
> For instance, if 2 users, then you get 3 wakeups instead of just 2 (1
> for the kworker and another 2 for the synchronize). But if you had a
> cut off like N=5, then 2 users results in just 2 wakeups.
> I don't feel too strongly about it, but for small values of N, I am
> interested in a slightly better optimization if we can squeeze it.
> 
This makes sense to me. We can add a threshold to wake-up several users.
Like you mentioned. We can do 2 wake-ups instead of 3.

> [...]
> > + * There are three lists for handling synchronize_rcu() users.
> > + * A first list corresponds to new coming users, second for users
> > + * which wait for a grace period and third is for which a grace
> > + * period is passed.
> > + */
> > +static struct sr_normal_state {
> > +       struct llist_head curr; /* request a GP users. */
> > +       struct llist_head wait; /* wait for GP users. */
> > +       struct llist_head done; /* ready for GP users. */
> > +       struct llist_node *curr_tail;
> > +       struct llist_node *wait_tail;
> 
> Just for clarification, the only reason you need 'tail' is because you
> can use llist_add_batch() to quickly splice the list, instead of
> finding the tail each time (expensive for a large list), correct? It
> would be good to mention that in a comment here.
> 
Right. I do not want to scan the list. I will comment it.

> > +       atomic_t active;
> > +} sr;
> > +
> > +/* Disable it by default. */
> > +static int rcu_normal_wake_from_gp = 0;
> > +module_param(rcu_normal_wake_from_gp, int, 0644);
> > +
> > +static void rcu_sr_normal_complete(struct llist_node *node)
> > +{
> > +       struct rcu_synchronize *rs = container_of(
> > +               (struct rcu_head *) node, struct rcu_synchronize, head);
> > +       unsigned long oldstate = (unsigned long) rs->head.func;
> > +
> > +       if (!poll_state_synchronize_rcu(oldstate))
> > +               WARN_ONCE(1, "A full grace period is not passed yet: %lu",
> > +                       rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
> 
> nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get
> rid of if() ?
>
Initially i had it written as you proposed i reworked it because i
wanted to see the delta. I do not have a strong opinion, so i can
fix it as you proposed.

> 
> > +
> > +       /* Finally. */
> > +       complete(&rs->completion);
> > +}
> > +
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > +       struct llist_node *done, *rcu, *next;
> > +
> > +       done = llist_del_all(&sr.done);
> > +       if (!done)
> > +               return;
> > +
> > +       llist_for_each_safe(rcu, next, done)
> > +               rcu_sr_normal_complete(rcu);
> > +}
> [...]
> > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > +{
> > +       atomic_inc(&sr.active);
> > +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> > +               /* Set the tail. Only first and one user can do that. */
> > +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> > +       atomic_dec(&sr.active);
> 
> Here there is no memory ordering provided by the atomic ops. Is that really Ok?
> 
This needs to be reworked since there is no ordering guaranteed. I think
there is a version of "atomic_inc_something" that guarantees it?

> And same question for other usages of .active.
> 
> Per: https://www.kernel.org/doc/Documentation/atomic_t.txt
> "RMW operations that have no return value are unordered;"
> --
> Note to self to ping my Android friends as well about this improvement
> :-). Especially the -mm Android folks are interested in app startup
> time.
> 
That is good. Thank you :)

--
Uladzislau Rezki
Uladzislau Rezki Oct. 18, 2023, 11:20 a.m. UTC | #3
Hello, Hillf!

> Hi Ulad
> 
> Good work with a nit.
> 
Thank you :)

> On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony) <urezki@gmail.com> wrote:
> > +static void rcu_sr_normal_gp_init(void)
> > +{
> > +	struct llist_node *llnode, *rcu;
> > +	int ret;
> > +
> > +	if (llist_empty(&sr.curr))
> > +		return;
> 
> This empty check erases the curr_tail race below instead of
> atomic_inc_return(&sr.active), because llist_add() will never return true
> after this check.
>
I use "active" counter to guarantee that a tail was updated in the
rcu_sr_normal_add_req(), i.e. the list might be not empty whereas the
tail updating might be in progress. llist_add() success and the task gets
preemted as an example.

Or i miss your point? If so, i appreciate if you clarify it in more
detail.

> > +
> > +	/*
> > +	 * A waiting list of GP should be empty on this step,
> > +	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > +	 * rolls it over. If not, it is a BUG, warn a user.
> > +	 */
> > +	WARN_ON_ONCE(!llist_empty(&sr.wait));
> > +
> > +	/*
> > +	 * Obtain a tail of current active users. It is guaranteed
> > +	 * that if we are only one active user and the list is not
> > +	 * empty, the tail has already been updated.
> > +	 */
> > +	ret = atomic_inc_return(&sr.active);
> 
> Replacing atomic_inc_return() with smp_mb() cuts sr.active off.
> 
But here we would like to know that we were only one user + not
empty list gurantees that a tail is ready.

Thank you for your comments!

--
Uladzislau Rezki
Joel Fernandes Oct. 18, 2023, 2:32 p.m. UTC | #4
On Tue, Oct 17, 2023 at 10:06 AM Uladzislau Rezki <urezki@gmail.com> wrote:
[...]
> > > +
> > > +       /* Finally. */
> > > +       complete(&rs->completion);
> > > +}
> > > +
> > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > +{
> > > +       struct llist_node *done, *rcu, *next;
> > > +
> > > +       done = llist_del_all(&sr.done);
> > > +       if (!done)
> > > +               return;
> > > +
> > > +       llist_for_each_safe(rcu, next, done)
> > > +               rcu_sr_normal_complete(rcu);
> > > +}
> > [...]
> > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > +{
> > > +       atomic_inc(&sr.active);
> > > +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> > > +               /* Set the tail. Only first and one user can do that. */
> > > +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> > > +       atomic_dec(&sr.active);
> >
> > Here there is no memory ordering provided by the atomic ops. Is that really Ok?
> >
> This needs to be reworked since there is no ordering guaranteed. I think
> there is a version of "atomic_inc_something" that guarantees it?

Yeah there is atomic_fetch_{inc,dec}{_acquire,_release}()

Or:
  atomic_inc(&sr.active);
  smp_mb__after_atomic();

  smp_mb__before_atomic();
  atomic_dec(&sr.active);

?

That's probably better because we don't need ordering before the inc
or after the dec, AFAICS.

I am actually a bit surprised there is no atomic_inc_acquire() yet. :-)

Thanks.
Uladzislau Rezki Oct. 18, 2023, 5:36 p.m. UTC | #5
On Wed, Oct 18, 2023 at 10:32:22AM -0400, Joel Fernandes wrote:
> On Tue, Oct 17, 2023 at 10:06 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> [...]
> > > > +
> > > > +       /* Finally. */
> > > > +       complete(&rs->completion);
> > > > +}
> > > > +
> > > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > +{
> > > > +       struct llist_node *done, *rcu, *next;
> > > > +
> > > > +       done = llist_del_all(&sr.done);
> > > > +       if (!done)
> > > > +               return;
> > > > +
> > > > +       llist_for_each_safe(rcu, next, done)
> > > > +               rcu_sr_normal_complete(rcu);
> > > > +}
> > > [...]
> > > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > +{
> > > > +       atomic_inc(&sr.active);
> > > > +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> > > > +               /* Set the tail. Only first and one user can do that. */
> > > > +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> > > > +       atomic_dec(&sr.active);
> > >
> > > Here there is no memory ordering provided by the atomic ops. Is that really Ok?
> > >
> > This needs to be reworked since there is no ordering guaranteed. I think
> > there is a version of "atomic_inc_something" that guarantees it?
> 
> Yeah there is atomic_fetch_{inc,dec}{_acquire,_release}()
> 
> Or:
>   atomic_inc(&sr.active);
>   smp_mb__after_atomic();
> 
>   smp_mb__before_atomic();
>   atomic_dec(&sr.active);
> 
> ?
>
> That's probably better because we don't need ordering before the inc
> or after the dec, AFAICS.
> 
There are two variants, atomic_inc_return() and atomic_dec_return()
which are fully ordered. Any thoughts about them? One line instead of
two as in your case.

Your concern is about, that atomic_inc() can be reodered? There is a
llist_add() that has inside the try_cmpxchg() that should have barrier.

Any thoughts?

Thank you for the review and help, Joel!

--
Uladzislau Rezki
Uladzislau Rezki Oct. 19, 2023, 3:44 p.m. UTC | #6
On Thu, Oct 19, 2023 at 07:44:32PM +0800, Hillf Danton wrote:
> On Wed, 18 Oct 2023 13:20:49 +0200 Uladzislau Rezki (Sony) <urezki@gmail.com> wrote:
> > > On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony) <urezki@gmail.com> wrote:
> > > > +static void rcu_sr_normal_gp_init(void)
> > > > +{
> > > > +	struct llist_node *llnode, *rcu;
> > > > +	int ret;
> > > > +
> > > > +	if (llist_empty(&sr.curr))
> > > > +		return;
> > > 
> > > This empty check erases the curr_tail race below instead of
> > > atomic_inc_return(&sr.active), because llist_add() will never return true
> > > after this check.
> > >
> > I use "active" counter to guarantee that a tail was updated in the
> > rcu_sr_normal_add_req(), i.e. the list might be not empty whereas the
> > tail updating might be in progress. llist_add() success and the task gets
> > preemted as an example.
> 
> You are right - the preempt is what I missed.
> 
> Then another question rising - the adding order of sync requests is changed
> at wakeup time, as shown by the two functions below with sr.curr_tail and
> sr.active cut off.
> 
> static void rcu_sr_normal_gp_init(void)
> {
> 	struct llist_node *llnode, *rcu;
> 
> 	llnode = llist_del_all(&sr.curr);
> 	rcu = llist_reverse_order(llnode);
> 	if (!rcu)
> 		return;
> 	/*
> 	 * A waiting list of GP should be empty on this step,
> 	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> 	 * rolls it over. If not, it is a BUG, warn a user.
> 	 */
> 	WARN_ON_ONCE(!llist_empty(&sr.wait));
> 
> 	WRITE_ONCE(sr.wait_tail, llnode);
> 
> 	llist_add_batch(rcu, llnode, &sr.wait);
> }
> 
> static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> {
> 	llist_add((struct llist_node *) &rs->head, &sr.curr);
> }
>
This is what i was thinking of in the beginning but i had decided
to make it more complex and maintain the current tail. We will get
a slight penalty in performance on a synthetic test but i have compared
it and this is negligible in fact.

Another good thing with it is we process users in a reverse order,
i.e. the most waiting users to less ones, this is from a waiting time
point of view.

I like it better and it is more simple. Anyway we can improve it
further.

Thank you for the proposal!

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 78554e7181dd..42a36d09cf3a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,6 +1384,122 @@  static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+/*
+ * There are three lists for handling synchronize_rcu() users.
+ * A first list corresponds to new coming users, second for users
+ * which wait for a grace period and third is for which a grace
+ * period is passed.
+ */
+static struct sr_normal_state {
+	struct llist_head curr;	/* request a GP users. */
+	struct llist_head wait;	/* wait for GP users. */
+	struct llist_head done;	/* ready for GP users. */
+	struct llist_node *curr_tail;
+	struct llist_node *wait_tail;
+	atomic_t active;
+} sr;
+
+/* Disable it by default. */
+static int rcu_normal_wake_from_gp = 0;
+module_param(rcu_normal_wake_from_gp, int, 0644);
+
+static void rcu_sr_normal_complete(struct llist_node *node)
+{
+	struct rcu_synchronize *rs = container_of(
+		(struct rcu_head *) node, struct rcu_synchronize, head);
+	unsigned long oldstate = (unsigned long) rs->head.func;
+
+	if (!poll_state_synchronize_rcu(oldstate))
+		WARN_ONCE(1, "A full grace period is not passed yet: %lu",
+			rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
+
+	/* Finally. */
+	complete(&rs->completion);
+}
+
+static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
+{
+	struct llist_node *done, *rcu, *next;
+
+	done = llist_del_all(&sr.done);
+	if (!done)
+		return;
+
+	llist_for_each_safe(rcu, next, done)
+		rcu_sr_normal_complete(rcu);
+}
+static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
+
+/*
+ * Helper function for rcu_gp_cleanup().
+ */
+static void rcu_sr_normal_gp_cleanup(void)
+{
+	struct llist_node *first, *tail;
+
+	tail = READ_ONCE(sr.wait_tail);
+	first = llist_del_all(&sr.wait);
+	if (!first)
+		return;
+
+	/* Only one user? */
+	if (!first->next) {
+		rcu_sr_normal_complete(first);
+		return;
+	}
+
+	/* Can be not empty. */
+	llist_add_batch(first, tail, &sr.done);
+	queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
+}
+
+/*
+ * Helper function for rcu_gp_init().
+ */
+static void rcu_sr_normal_gp_init(void)
+{
+	struct llist_node *llnode, *rcu;
+	int ret;
+
+	if (llist_empty(&sr.curr))
+		return;
+
+	/*
+	 * A waiting list of GP should be empty on this step,
+	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+	 * rolls it over. If not, it is a BUG, warn a user.
+	 */
+	WARN_ON_ONCE(!llist_empty(&sr.wait));
+
+	/*
+	 * Obtain a tail of current active users. It is guaranteed
+	 * that if we are only one active user and the list is not
+	 * empty, the tail has already been updated.
+	 */
+	ret = atomic_inc_return(&sr.active);
+	WRITE_ONCE(sr.wait_tail, (ret == 1) ? READ_ONCE(sr.curr_tail):NULL);
+	llnode = llist_del_all(&sr.curr);
+	atomic_dec(&sr.active);
+
+	if (ret != 1) {
+		llist_for_each(rcu, llnode) {
+			if (!rcu->next)
+				WRITE_ONCE(sr.wait_tail, rcu);
+		}
+	}
+
+	llist_add_batch(llnode, READ_ONCE(sr.wait_tail), &sr.wait);
+}
+
+static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
+{
+	atomic_inc(&sr.active);
+	if (llist_add((struct llist_node *) &rs->head, &sr.curr))
+		/* Set the tail. Only first and one user can do that. */
+		WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
+	atomic_dec(&sr.active);
+}
+
 /*
  * Initialize a new grace period.  Return false if no grace period required.
  */
@@ -1418,6 +1534,7 @@  static noinline_for_stack bool rcu_gp_init(void)
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
+	rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);
@@ -1787,6 +1904,9 @@  static noinline void rcu_gp_cleanup(void)
 	}
 	raw_spin_unlock_irq_rcu_node(rnp);
 
+	// Make synchronize_rcu() users aware of the end of old grace period.
+	rcu_sr_normal_gp_cleanup();
+
 	// If strict, make all CPUs aware of the end of the old grace period.
 	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
 		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
@@ -3500,6 +3620,35 @@  static int rcu_blocking_is_gp(void)
 	return true;
 }
 
+/*
+ * Helper function for the synchronize_rcu() API.
+ */
+static void synchronize_rcu_normal(void)
+{
+	struct rcu_synchronize rs;
+
+	if (READ_ONCE(rcu_normal_wake_from_gp)) {
+		init_rcu_head_on_stack(&rs.head);
+		init_completion(&rs.completion);
+
+		/*
+		 * This code might be preempted, therefore take a GP
+		 * snapshot before adding a request.
+		 */
+		rs.head.func = (void *) get_state_synchronize_rcu();
+		rcu_sr_normal_add_req(&rs);
+
+		/* Kick a GP and start waiting. */
+		(void) start_poll_synchronize_rcu();
+
+		/* Now we can wait. */
+		wait_for_completion(&rs.completion);
+		destroy_rcu_head_on_stack(&rs.head);
+	} else {
+		wait_rcu_gp(call_rcu_hurry);
+	}
+}
+
 /**
  * synchronize_rcu - wait until a grace period has elapsed.
  *
@@ -3551,7 +3700,7 @@  void synchronize_rcu(void)
 		if (rcu_gp_is_expedited())
 			synchronize_rcu_expedited();
 		else
-			wait_rcu_gp(call_rcu_hurry);
+			synchronize_rcu_normal();
 		return;
 	}
 
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6d7cea5d591f..279a37beb05a 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -987,7 +987,7 @@  void synchronize_rcu_expedited(void)
 
 	/* If expedited grace periods are prohibited, fall back to normal. */
 	if (rcu_gp_is_normal()) {
-		wait_rcu_gp(call_rcu_hurry);
+		synchronize_rcu_normal();
 		return;
 	}