Message ID | 20250324170156.469763-2-joelagnelf@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/3] rcu: Replace magic number with meaningful constant in rcu_seq_done_exact() | expand |
On Mon, Mar 24, 2025 at 01:01:53PM -0400, Joel Fernandes wrote: > The rcu_seq_done_exact() function checks if a grace period has completed by > comparing sequence numbers. It includes a guard band to handle sequence number > wraparound, which was previously expressed using the magic number calculation > '3 * RCU_SEQ_STATE_MASK + 1'. > > This magic number is not immediately obvious in terms of what it represents. > > Instead, the reason we need this tiny guardband is because of the lag between > the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). > > This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing > the newly started GP as completed immediately, due to the following sequence > which arises due to the delay between update of rcu_state.gp_seq_polled and > root rnp's gp_seq: > > rnp->gp_seq = rcu_state.gp_seq = 0 > > CPU 0 CPU 1 > ----- ----- > // rcu_state.gp_seq = 1 > rcu_seq_start(&rcu_state.gp_seq) > // snap = 8 > snap = rcu_seq_snap(&rcu_state.gp_seq) > // Two full GP differences > rcu_seq_done_exact(&rnp->gp_seq, snap) > // rnp->gp_seq = 1 > WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > > This can happen due to get_state_synchronize_rcu_full() sampling > rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() > sampling the root rnp's gp_seq. The delay between the update of the 2 > counters occurs in rcu_gp_init() during which the counters briefly go > out of sync. > > Make the guardband explictly 2 GPs. This improves code readability and > maintainability by making the intent clearer as well. > > Suggested-by: Frederic Weisbecker <frederic@kernel.org> > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> One concern is that a small error anywhere in the code could cause this minimal guard band to be too small. This is not a problem for some use cases (rcu_barrier() just does an extra operation, and normal grace periods are protected from forever-idle CPUs by ->gpwrap), but could be an issue on 32-bit systems for user of polled RCU grace periods. In contrast, making the guard band a bit longer than it needs to be has little or no downside. Thanx, Paul > --- > kernel/rcu/rcu.h | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index eed2951a4962..5e1ee570bb27 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -57,6 +57,9 @@ > /* Low-order bit definition for polled grace-period APIs. */ > #define RCU_GET_STATE_COMPLETED 0x1 > > +/* A complete grace period count */ > +#define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1) > + > extern int sysctl_sched_rt_runtime; > > /* > @@ -162,7 +165,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > { > unsigned long cur_s = READ_ONCE(*sp); > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_GP)); > } > > /* > -- > 2.43.0 >
> On Mar 26, 2025, at 6:33 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Mon, Mar 24, 2025 at 01:01:53PM -0400, Joel Fernandes wrote: >> The rcu_seq_done_exact() function checks if a grace period has completed by >> comparing sequence numbers. It includes a guard band to handle sequence number >> wraparound, which was previously expressed using the magic number calculation >> '3 * RCU_SEQ_STATE_MASK + 1'. >> >> This magic number is not immediately obvious in terms of what it represents. >> >> Instead, the reason we need this tiny guardband is because of the lag between >> the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). >> >> This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing >> the newly started GP as completed immediately, due to the following sequence >> which arises due to the delay between update of rcu_state.gp_seq_polled and >> root rnp's gp_seq: >> >> rnp->gp_seq = rcu_state.gp_seq = 0 >> >> CPU 0 CPU 1 >> ----- ----- >> // rcu_state.gp_seq = 1 >> rcu_seq_start(&rcu_state.gp_seq) >> // snap = 8 >> snap = rcu_seq_snap(&rcu_state.gp_seq) >> // Two full GP differences >> rcu_seq_done_exact(&rnp->gp_seq, snap) >> // rnp->gp_seq = 1 >> WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); >> >> This can happen due to get_state_synchronize_rcu_full() sampling >> rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() >> sampling the root rnp's gp_seq. The delay between the update of the 2 >> counters occurs in rcu_gp_init() during which the counters briefly go >> out of sync. >> >> Make the guardband explictly 2 GPs. This improves code readability and >> maintainability by making the intent clearer as well. >> >> Suggested-by: Frederic Weisbecker <frederic@kernel.org> >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > > One concern is that a small error anywhere in the code could cause this > minimal guard band to be too small. This is not a problem for some > use cases (rcu_barrier() just does an extra operation, and normal grace > periods are protected from forever-idle CPUs by ->gpwrap), but could be > an issue on 32-bit systems for user of polled RCU grace periods. Could you provide more details of the usecase (sequence of steps) causing an issue for 32 bit polled RCU users? I am not able to see how this patch can affect them. > > In contrast, making the guard band a bit longer than it needs to be > has little or no downside. Making it 3 GP instead of 2 should be ok with me as long as we document it but at least it will not be a magic number based on an equation. I feel we should not put random magic numbers which is more dangerous since it is hard to explain (and hence debug — just my 2 cents). Thanks. > > Thanx, Paul > >> --- >> kernel/rcu/rcu.h | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h >> index eed2951a4962..5e1ee570bb27 100644 >> --- a/kernel/rcu/rcu.h >> +++ b/kernel/rcu/rcu.h >> @@ -57,6 +57,9 @@ >> /* Low-order bit definition for polled grace-period APIs. */ >> #define RCU_GET_STATE_COMPLETED 0x1 >> >> +/* A complete grace period count */ >> +#define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1) >> + >> extern int sysctl_sched_rt_runtime; >> >> /* >> @@ -162,7 +165,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) >> { >> unsigned long cur_s = READ_ONCE(*sp); >> >> - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); >> + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_GP)); >> } >> >> /* >> -- >> 2.43.0 >>
On Wed, Mar 26, 2025 at 10:42:52PM +0000, Joel Fernandes wrote: > > > > On Mar 26, 2025, at 6:33 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Mon, Mar 24, 2025 at 01:01:53PM -0400, Joel Fernandes wrote: > >> The rcu_seq_done_exact() function checks if a grace period has completed by > >> comparing sequence numbers. It includes a guard band to handle sequence number > >> wraparound, which was previously expressed using the magic number calculation > >> '3 * RCU_SEQ_STATE_MASK + 1'. > >> > >> This magic number is not immediately obvious in terms of what it represents. > >> > >> Instead, the reason we need this tiny guardband is because of the lag between > >> the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). > >> > >> This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing > >> the newly started GP as completed immediately, due to the following sequence > >> which arises due to the delay between update of rcu_state.gp_seq_polled and > >> root rnp's gp_seq: > >> > >> rnp->gp_seq = rcu_state.gp_seq = 0 > >> > >> CPU 0 CPU 1 > >> ----- ----- > >> // rcu_state.gp_seq = 1 > >> rcu_seq_start(&rcu_state.gp_seq) > >> // snap = 8 > >> snap = rcu_seq_snap(&rcu_state.gp_seq) > >> // Two full GP differences > >> rcu_seq_done_exact(&rnp->gp_seq, snap) > >> // rnp->gp_seq = 1 > >> WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > >> > >> This can happen due to get_state_synchronize_rcu_full() sampling > >> rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() > >> sampling the root rnp's gp_seq. The delay between the update of the 2 > >> counters occurs in rcu_gp_init() during which the counters briefly go > >> out of sync. > >> > >> Make the guardband explictly 2 GPs. This improves code readability and > >> maintainability by making the intent clearer as well. > >> > >> Suggested-by: Frederic Weisbecker <frederic@kernel.org> > >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > > > > One concern is that a small error anywhere in the code could cause this > > minimal guard band to be too small. This is not a problem for some > > use cases (rcu_barrier() just does an extra operation, and normal grace > > periods are protected from forever-idle CPUs by ->gpwrap), but could be > > an issue on 32-bit systems for user of polled RCU grace periods. > > Could you provide more details of the usecase (sequence of steps) causing an issue for 32 bit polled RCU users? I am not able to see how this patch can affect them. > > > > > In contrast, making the guard band a bit longer than it needs to be > > has little or no downside. > > Making it 3 GP instead of 2 should be ok with me as long as we document it but at least it will not be a magic number based on an equation. I feel we should not put random magic numbers which is more dangerous since it is hard to explain (and hence debug — just my 2 cents). Apologies, I was getting ahead of us on this one. That third (and maybe also a fourth) grace period becomes important if we start getting memory contention on rcu_state.gp_seq, in which case we would want the polled grace periods to look at the leaf rcu_node structure's gp_seq, which would introduce extra slop. For now, for this patch, assuming you guys promise to remember this should such memory contention prove problematic: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Thanx, Paul > Thanks. > > > > > Thanx, Paul > > > >> --- > >> kernel/rcu/rcu.h | 5 ++++- > >> 1 file changed, 4 insertions(+), 1 deletion(-) > >> > >> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > >> index eed2951a4962..5e1ee570bb27 100644 > >> --- a/kernel/rcu/rcu.h > >> +++ b/kernel/rcu/rcu.h > >> @@ -57,6 +57,9 @@ > >> /* Low-order bit definition for polled grace-period APIs. */ > >> #define RCU_GET_STATE_COMPLETED 0x1 > >> > >> +/* A complete grace period count */ > >> +#define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1) > >> + > >> extern int sysctl_sched_rt_runtime; > >> > >> /* > >> @@ -162,7 +165,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) > >> { > >> unsigned long cur_s = READ_ONCE(*sp); > >> > >> - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); > >> + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_GP)); > >> } > >> > >> /* > >> -- > >> 2.43.0 > >>
On 3/27/2025 1:09 PM, Paul E. McKenney wrote: >>> In contrast, making the guard band a bit longer than it needs to be >>> has little or no downside. >> Making it 3 GP instead of 2 should be ok with me as long as we document it but at least it will not be a magic number based on an equation. I feel we should not put random magic numbers which is more dangerous since it is hard to explain (and hence debug — just my 2 cents). > Apologies, I was getting ahead of us on this one. > > That third (and maybe also a fourth) grace period becomes important if > we start getting memory contention on rcu_state.gp_seq, in which case > we would want the polled grace periods to look at the leaf rcu_node > structure's gp_seq, which would introduce extra slop. Makes sense. > For now, for this patch, assuming you guys promise to remember this > should such memory contention prove problematic: > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Thanks! Will add the tag. - Joel
Le Thu, Mar 27, 2025 at 10:09:48AM -0700, Paul E. McKenney a écrit : > On Wed, Mar 26, 2025 at 10:42:52PM +0000, Joel Fernandes wrote: > > > > > > > On Mar 26, 2025, at 6:33 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Mon, Mar 24, 2025 at 01:01:53PM -0400, Joel Fernandes wrote: > > >> The rcu_seq_done_exact() function checks if a grace period has completed by > > >> comparing sequence numbers. It includes a guard band to handle sequence number > > >> wraparound, which was previously expressed using the magic number calculation > > >> '3 * RCU_SEQ_STATE_MASK + 1'. > > >> > > >> This magic number is not immediately obvious in terms of what it represents. > > >> > > >> Instead, the reason we need this tiny guardband is because of the lag between > > >> the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). > > >> > > >> This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing > > >> the newly started GP as completed immediately, due to the following sequence > > >> which arises due to the delay between update of rcu_state.gp_seq_polled and > > >> root rnp's gp_seq: > > >> > > >> rnp->gp_seq = rcu_state.gp_seq = 0 > > >> > > >> CPU 0 CPU 1 > > >> ----- ----- > > >> // rcu_state.gp_seq = 1 > > >> rcu_seq_start(&rcu_state.gp_seq) > > >> // snap = 8 > > >> snap = rcu_seq_snap(&rcu_state.gp_seq) > > >> // Two full GP differences > > >> rcu_seq_done_exact(&rnp->gp_seq, snap) > > >> // rnp->gp_seq = 1 > > >> WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > > >> > > >> This can happen due to get_state_synchronize_rcu_full() sampling > > >> rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() > > >> sampling the root rnp's gp_seq. The delay between the update of the 2 > > >> counters occurs in rcu_gp_init() during which the counters briefly go > > >> out of sync. > > >> > > >> Make the guardband explictly 2 GPs. This improves code readability and > > >> maintainability by making the intent clearer as well. > > >> > > >> Suggested-by: Frederic Weisbecker <frederic@kernel.org> > > >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > > > > > > One concern is that a small error anywhere in the code could cause this > > > minimal guard band to be too small. This is not a problem for some > > > use cases (rcu_barrier() just does an extra operation, and normal grace > > > periods are protected from forever-idle CPUs by ->gpwrap), but could be > > > an issue on 32-bit systems for user of polled RCU grace periods. > > > > Could you provide more details of the usecase (sequence of steps) causing an issue for 32 bit polled RCU users? I am not able to see how this patch can affect them. > > > > > > > > In contrast, making the guard band a bit longer than it needs to be > > > has little or no downside. > > > > Making it 3 GP instead of 2 should be ok with me as long as we document it but at least it will not be a magic number based on an equation. I feel we should not put random magic numbers which is more dangerous since it is hard to explain (and hence debug — just my 2 cents). > > Apologies, I was getting ahead of us on this one. > > That third (and maybe also a fourth) grace period becomes important if > we start getting memory contention on rcu_state.gp_seq, in which case > we would want the polled grace periods to look at the leaf rcu_node > structure's gp_seq, which would introduce extra slop. > > For now, for this patch, assuming you guys promise to remember this > should such memory contention prove problematic: > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Taking again a step back on this. This all happens because we have two different global gp sequence counter, and the two are used on different occasions, even mixed up together since 85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection") even though they have different constraints and reliability. This can be error prone and confusing. So I'm unfortunately asking again if it wouldn't be a good idea to have a single global state counter that lives in the root node so that we don't have it duplicated in rcu_state.gp_seq. This involves some special care: 1) As you pointed out, the root can also be the unique leaf. So care must be taken that no QS is reported before new online CPUs have been handled and QS masks initialized. In order for __note_gp_changes() to handle it gracefully we must have two seq states: one to tell the grace period has started (before checking new online CPUs): RCU_STATE_STARTED and one to tell that grace period is ready to receive quiescent states: RCU_STATE_WAIT_QS. Although not strictly required, __note_gp_changes() can ignore simple state changes to RCU_STATE_STARTED in order to spare useless work and rnp locking. No QS will be reported before ->qsmask are initialized in any case. 2) rcu_gp_cleanup() currently ends the grace period top-down through the rnp tree and only then on rcu_state. With this change it's simply top-down. I have audited rcu_seq_snap() and rcu_start_this_gp() uses and I haven't found an obvious issue with that. The draft below holds 30 minutes of TREE01: diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index feb3ac1dc5d5..c4bb3eb92a58 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -19,6 +19,12 @@ * The two least significant bits contain the control flags. * The most significant bits contain the grace-period sequence counter. * + * RCU_STATE_STARTED : The gp has started but QS are not + * requested yet. + * + * RCU_STATE_WAIT_QS : The gp has started and waits for + * QS to be reported. + * * When both control flags are zero, no grace period is in progress. * When either bit is non-zero, a grace period has started and is in * progress. When the grace period completes, the control flags are reset @@ -54,6 +60,9 @@ * grace-period sequence number. */ +#define RCU_STATE_STARTED 0x1 +#define RCU_STATE_WAIT_QS 0x2 + /* Low-order bit definition for polled grace-period APIs. */ #define RCU_GET_STATE_COMPLETED 0x1 @@ -92,7 +101,13 @@ static inline void rcu_seq_start(unsigned long *sp) { WRITE_ONCE(*sp, *sp + 1); smp_mb(); /* Ensure update-side operation after counter increment. */ - WARN_ON_ONCE(rcu_seq_state(*sp) != 1); + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_STARTED); +} + +static inline void rcu_seq_wait_qs(unsigned long *sp) +{ + WRITE_ONCE(*sp, *sp + 1); + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_WAIT_QS); } /* Compute the end-of-grace-period value for the specified sequence number. */ @@ -135,6 +150,11 @@ static inline unsigned long rcu_seq_current(unsigned long *sp) return READ_ONCE(*sp); } +static inline bool rcu_seq_changed(unsigned long old, unsigned long new) +{ + return (old & ~RCU_STATE_STARTED) != (new & ~RCU_STATE_STARTED); +} + /* * Given a snapshot from rcu_seq_snap(), determine whether or not the * corresponding update-side operation has started. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 475f31deed14..54480ee9599f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -81,9 +81,9 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { .gpwrap = true, }; static struct rcu_state rcu_state = { + .node[0].gp_seq = (0UL - 300UL) << RCU_SEQ_CTR_SHIFT, .level = { &rcu_state.node[0] }, .gp_state = RCU_GP_IDLE, - .gp_seq = (0UL - 300UL) << RCU_SEQ_CTR_SHIFT, .barrier_mutex = __MUTEX_INITIALIZER(rcu_state.barrier_mutex), .barrier_lock = __RAW_SPIN_LOCK_UNLOCKED(rcu_state.barrier_lock), .name = RCU_NAME, @@ -192,6 +192,21 @@ int rcu_get_gp_kthreads_prio(void) } EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); +/* + * Return the root node of the rcu_state structure. + */ +static struct rcu_node *rcu_get_root(void) +{ + return &rcu_state.node[0]; +} + +static unsigned long *rcu_root_seq(void) +{ + struct rcu_node *root = rcu_get_root(); + + return &root->gp_seq; +} + /* * Number of grace periods between delays, normalized by the duration of * the delay. The longer the delay, the more the grace periods between @@ -210,7 +225,7 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); */ static int rcu_gp_in_progress(void) { - return rcu_seq_state(rcu_seq_current(&rcu_state.gp_seq)); + return rcu_seq_state(rcu_seq_current(rcu_root_seq())); } /* @@ -504,7 +519,7 @@ static int rcu_pending(int user); */ unsigned long rcu_get_gp_seq(void) { - return READ_ONCE(rcu_state.gp_seq); + return READ_ONCE(*rcu_root_seq()); } EXPORT_SYMBOL_GPL(rcu_get_gp_seq); @@ -520,21 +535,13 @@ unsigned long rcu_exp_batches_completed(void) } EXPORT_SYMBOL_GPL(rcu_exp_batches_completed); -/* - * Return the root node of the rcu_state structure. - */ -static struct rcu_node *rcu_get_root(void) -{ - return &rcu_state.node[0]; -} - /* * Send along grace-period-related data for rcutorture diagnostics. */ void rcutorture_get_gp_data(int *flags, unsigned long *gp_seq) { *flags = READ_ONCE(rcu_state.gp_flags); - *gp_seq = rcu_seq_current(&rcu_state.gp_seq); + *gp_seq = rcu_seq_current(rcu_root_seq()); } EXPORT_SYMBOL_GPL(rcutorture_get_gp_data); @@ -1010,7 +1017,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("NoGPkthread")); goto unlock_out; } - trace_rcu_grace_period(rcu_state.name, data_race(rcu_state.gp_seq), TPS("newreq")); + trace_rcu_grace_period(rcu_state.name, data_race(*rcu_root_seq()), TPS("newreq")); ret = true; /* Caller must wake GP kthread. */ unlock_out: /* Push furthest requested GP to leaf node and rcu_data structure. */ @@ -1095,7 +1102,7 @@ static void rcu_gp_kthread_wake(void) !READ_ONCE(rcu_state.gp_flags) || !t) return; WRITE_ONCE(rcu_state.gp_wake_time, jiffies); - WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); + WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(*rcu_root_seq())); swake_up_one_online(&rcu_state.gp_wq); } @@ -1135,7 +1142,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) * accelerating callback invocation to an earlier grace-period * number. */ - gp_seq_req = rcu_seq_snap(&rcu_state.gp_seq); + gp_seq_req = rcu_seq_snap(rcu_root_seq()); if (rcu_segcblist_accelerate(&rdp->cblist, gp_seq_req)) ret = rcu_start_this_gp(rnp, rdp, gp_seq_req); @@ -1164,7 +1171,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp, bool needwake; rcu_lockdep_assert_cblist_protected(rdp); - c = rcu_seq_snap(&rcu_state.gp_seq); + c = rcu_seq_snap(rcu_root_seq()); if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { /* Old request still live, so mark recent callbacks. */ (void)rcu_segcblist_accelerate(&rdp->cblist, c); @@ -1249,7 +1256,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) raw_lockdep_assert_held_rcu_node(rnp); - if (rdp->gp_seq == rnp->gp_seq) + if (!rcu_seq_changed(rdp->gp_seq, rnp->gp_seq)) return false; /* Nothing to do. */ /* Handle the ends of any preceding grace periods first. */ @@ -1298,7 +1305,7 @@ static void note_gp_changes(struct rcu_data *rdp) local_irq_save(flags); rnp = rdp->mynode; - if ((rdp->gp_seq == rcu_seq_current(&rnp->gp_seq) && + if ((!rcu_seq_changed(rdp->gp_seq, rcu_seq_current(&rnp->gp_seq)) && !unlikely(READ_ONCE(rdp->gpwrap))) || /* w/out lock. */ !raw_spin_trylock_rcu_node(rnp)) { /* irqs already off, so later. */ local_irq_restore(flags); @@ -1341,7 +1348,7 @@ static bool rcu_gp_slow_is_suppressed(void) static void rcu_gp_slow(int delay) { if (!rcu_gp_slow_is_suppressed() && delay > 0 && - !(rcu_seq_ctr(rcu_state.gp_seq) % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay))) + !(rcu_seq_ctr(*rcu_root_seq()) % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay))) schedule_timeout_idle(delay); } @@ -1802,10 +1809,9 @@ static noinline_for_stack bool rcu_gp_init(void) /* Advance to a new grace period and initialize state. */ record_gp_stall_check_time(); /* Record GP times before starting GP, hence rcu_seq_start(). */ - rcu_seq_start(&rcu_state.gp_seq); - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); + rcu_seq_start(rcu_root_seq()); start_new_poll = rcu_sr_normal_gp_init(); - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start")); + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("start")); rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap); raw_spin_unlock_irq_rcu_node(rnp); @@ -1902,7 +1908,10 @@ static noinline_for_stack bool rcu_gp_init(void) rdp = this_cpu_ptr(&rcu_data); rcu_preempt_check_blocked_tasks(rnp); rnp->qsmask = rnp->qsmaskinit; - WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); + if (!rnp->parent) + rcu_seq_wait_qs(&rnp->gp_seq); + else + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); if (rnp == rdp->mynode) (void)__note_gp_changes(rnp, rdp); rcu_preempt_boost_start_gp(rnp); @@ -2018,7 +2027,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) WRITE_ONCE(rcu_state.jiffies_kick_kthreads, jiffies + (j ? 3 * j : 2)); } - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("fqswait")); WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS); (void)swait_event_idle_timeout_exclusive(rcu_state.gp_wq, @@ -2041,7 +2050,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) /* If time for quiescent-state forcing, do it. */ if (!time_after(rcu_state.jiffies_force_qs, jiffies) || (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs); gf = 0; @@ -2049,7 +2058,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) first_gp_fqs = false; gf = rcu_state.cbovld ? RCU_GP_FLAG_OVLD : 0; } - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("fqsend")); cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); @@ -2060,7 +2069,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); WARN_ON(signal_pending(current)); - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("fqswaitsig")); ret = 1; /* Keep old FQS timing. */ j = jiffies; @@ -2081,7 +2090,6 @@ static noinline void rcu_gp_cleanup(void) int cpu; bool needgp = false; unsigned long gp_duration; - unsigned long new_gp_seq; bool offloaded; struct rcu_data *rdp; struct rcu_node *rnp = rcu_get_root(); @@ -2103,6 +2111,8 @@ static noinline void rcu_gp_cleanup(void) * period as completed in all of the rcu_node structures. */ rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap); + rcu_seq_end(rcu_root_seq()); + smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). raw_spin_unlock_irq_rcu_node(rnp); /* @@ -2114,16 +2124,12 @@ static noinline void rcu_gp_cleanup(void) * the rcu_node structures before the beginning of the next grace * period is recorded in any of the rcu_node structures. */ - new_gp_seq = rcu_state.gp_seq; - rcu_seq_end(&new_gp_seq); rcu_for_each_node_breadth_first(rnp) { raw_spin_lock_irq_rcu_node(rnp); if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp))) dump_blkd_tasks(rnp, 10); WARN_ON_ONCE(rnp->qsmask); - WRITE_ONCE(rnp->gp_seq, new_gp_seq); - if (!rnp->parent) - smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); rdp = this_cpu_ptr(&rcu_data); if (rnp == rdp->mynode) needgp = __note_gp_changes(rnp, rdp) || needgp; @@ -2146,9 +2152,7 @@ static noinline void rcu_gp_cleanup(void) raw_spin_lock_irq_rcu_node(rnp); /* GP before ->gp_seq update. */ /* Declare grace period done, trace first to use old GP number. */ - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end")); - rcu_seq_end(&rcu_state.gp_seq); - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("end")); WRITE_ONCE(rcu_state.gp_state, RCU_GP_IDLE); /* Check for GP requests since above loop. */ rdp = this_cpu_ptr(&rcu_data); @@ -2173,7 +2177,7 @@ static noinline void rcu_gp_cleanup(void) WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); WRITE_ONCE(rcu_state.gp_req_activity, jiffies); - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("newreq")); + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("newreq")); } else { // We get here either if there is no need for an @@ -2204,7 +2208,7 @@ static int __noreturn rcu_gp_kthread(void *unused) /* Handle grace-period start. */ for (;;) { - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("reqwait")); WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_GPS); swait_event_idle_exclusive(rcu_state.gp_wq, @@ -2218,7 +2222,7 @@ static int __noreturn rcu_gp_kthread(void *unused) cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); WARN_ON(signal_pending(current)); - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("reqwaitsig")); } @@ -3299,9 +3303,10 @@ void synchronize_rcu(void) // these counters. local_irq_save(flags); WARN_ON_ONCE(num_online_cpus() > 1); - rcu_state.gp_seq += (1 << RCU_SEQ_CTR_SHIFT); + rnp = rcu_get_root(); + rnp->gp_seq += (1 << RCU_SEQ_CTR_SHIFT); for (rnp = this_cpu_ptr(&rcu_data)->mynode; rnp; rnp = rnp->parent) - rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; + rnp->gp_seq_needed = rnp->gp_seq = *rcu_root_seq(); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(synchronize_rcu); @@ -3390,7 +3395,7 @@ static void start_poll_synchronize_rcu_common(void) // get a grace period that no one needed. These accesses are ordered // by smp_mb(), and we are accessing them in the opposite order // from which they are updated at grace-period start, as required. - needwake = rcu_start_this_gp(rnp, rdp, rcu_seq_snap(&rcu_state.gp_seq)); + needwake = rcu_start_this_gp(rnp, rdp, rcu_seq_snap(rcu_root_seq())); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); if (needwake) rcu_gp_kthread_wake(); @@ -3512,11 +3517,9 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); */ bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) { - struct rcu_node *rnp = rcu_get_root(); - smp_mb(); // Order against root rcu_node structure grace-period cleanup. if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || - rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || + rcu_seq_done_exact(rcu_root_seq(), rgosp->rgos_norm) || rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) { smp_mb(); /* Ensure GP ends before subsequent accesses. */ @@ -4062,9 +4065,9 @@ rcu_boot_init_percpu_data(int cpu) WARN_ON_ONCE(ct->nesting != 1); WARN_ON_ONCE(rcu_watching_snap_in_eqs(ct_rcu_watching_cpu(cpu))); rdp->barrier_seq_snap = rcu_state.barrier_sequence; - rdp->rcu_ofl_gp_seq = rcu_state.gp_seq; + rdp->rcu_ofl_gp_seq = *rcu_root_seq(); rdp->rcu_ofl_gp_state = RCU_GP_CLEANED; - rdp->rcu_onl_gp_seq = rcu_state.gp_seq; + rdp->rcu_onl_gp_seq = *rcu_root_seq(); rdp->rcu_onl_gp_state = RCU_GP_CLEANED; rdp->last_sched_clock = jiffies; rdp->cpu = cpu; @@ -4271,7 +4274,7 @@ void rcutree_report_cpu_starting(unsigned int cpu) smp_store_release(&rcu_state.ncpus, rcu_state.ncpus + newcpu); /* ^^^ */ ASSERT_EXCLUSIVE_WRITER(rcu_state.ncpus); rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */ - rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); + rdp->rcu_onl_gp_seq = READ_ONCE(*rcu_root_seq()); rdp->rcu_onl_gp_state = READ_ONCE(rcu_state.gp_state); /* An incoming CPU should never be blocking a grace period. */ @@ -4322,7 +4325,7 @@ void rcutree_report_cpu_dead(void) mask = rdp->grpmask; arch_spin_lock(&rcu_state.ofl_lock); raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ - rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq); + rdp->rcu_ofl_gp_seq = READ_ONCE(*rcu_root_seq()); rdp->rcu_ofl_gp_state = READ_ONCE(rcu_state.gp_state); if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */ /* Report quiescent state -before- changing ->qsmaskinitnext! */ @@ -4534,7 +4537,7 @@ void rcu_scheduler_starting(void) // Fix up the ->gp_seq counters. local_irq_save(flags); rcu_for_each_node_breadth_first(rnp) - rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; + rnp->gp_seq_needed = rnp->gp_seq = *rcu_root_seq(); local_irq_restore(flags); // Switch out of early boot mode. @@ -4583,9 +4586,9 @@ static void __init rcu_init_one(void) raw_spin_lock_init(&rnp->fqslock); lockdep_set_class_and_name(&rnp->fqslock, &rcu_fqs_class[i], fqs[i]); - rnp->gp_seq = rcu_state.gp_seq; - rnp->gp_seq_needed = rcu_state.gp_seq; - rnp->completedqs = rcu_state.gp_seq; + rnp->gp_seq = *rcu_root_seq(); + rnp->gp_seq_needed = *rcu_root_seq(); + rnp->completedqs = *rcu_root_seq(); rnp->qsmask = 0; rnp->qsmaskinit = 0; rnp->grplo = j * cpustride; diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index a9a811d9d7a3..9d811270b808 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -347,7 +347,6 @@ struct rcu_state { /* The following fields are guarded by the root rcu_node's lock. */ - unsigned long gp_seq ____cacheline_internodealigned_in_smp; /* Grace-period sequence #. */ unsigned long gp_max; /* Maximum GP duration in */ /* jiffies. */ diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 925fcdad5dea..367adb062a6d 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -344,7 +344,7 @@ static void rcu_dump_cpu_stacks(unsigned long gp_seq) rcu_for_each_leaf_node(rnp) { printk_deferred_enter(); for_each_leaf_node_possible_cpu(rnp, cpu) { - if (gp_seq != data_race(rcu_state.gp_seq)) { + if (gp_seq != data_race(*rcu_root_seq())) { printk_deferred_exit(); pr_err("INFO: Stall ended during stack backtracing.\n"); return; @@ -474,7 +474,7 @@ static void print_cpu_stall_info(int cpu) */ touch_nmi_watchdog(); - ticks_value = rcu_seq_ctr(rcu_state.gp_seq - rdp->gp_seq); + ticks_value = rcu_seq_ctr(*rcu_root_seq() - rdp->gp_seq); if (ticks_value) { ticks_title = "GPs behind"; } else { @@ -518,7 +518,7 @@ static void rcu_check_gp_kthread_starvation(void) cpu = gpk ? task_cpu(gpk) : -1; pr_err("%s kthread starved for %ld jiffies! g%ld f%#x %s(%d) ->state=%#x ->cpu=%d\n", rcu_state.name, j, - (long)rcu_seq_current(&rcu_state.gp_seq), + (long)rcu_seq_current(rcu_root_seq()), data_race(READ_ONCE(rcu_state.gp_flags)), gp_state_getname(rcu_state.gp_state), data_race(READ_ONCE(rcu_state.gp_state)), @@ -561,7 +561,7 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void) cpu = task_cpu(gpk); pr_err("%s kthread timer wakeup didn't happen for %ld jiffies! g%ld f%#x %s(%d) ->state=%#x\n", rcu_state.name, (jiffies - jiffies_fqs), - (long)rcu_seq_current(&rcu_state.gp_seq), + (long)rcu_seq_current(rcu_root_seq()), data_race(READ_ONCE(rcu_state.gp_flags)), // Diagnostic read gp_state_getname(RCU_GP_WAIT_FQS), RCU_GP_WAIT_FQS, data_race(READ_ONCE(gpk->__state))); @@ -613,7 +613,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) totqlen += rcu_get_n_cbs_cpu(cpu); pr_err("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n", smp_processor_id(), (long)(jiffies - gps), - (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, + (long)rcu_seq_current(rcu_root_seq()), totqlen, data_race(rcu_state.n_online_cpus)); // Diagnostic read if (ndetected) { rcu_dump_cpu_stacks(gp_seq); @@ -622,7 +622,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) rcu_for_each_leaf_node(rnp) rcu_print_detail_task_stall_rnp(rnp); } else { - if (rcu_seq_current(&rcu_state.gp_seq) != gp_seq) { + if (rcu_seq_current(rcu_root_seq()) != gp_seq) { pr_err("INFO: Stall ended before state dump start\n"); } else { j = jiffies; @@ -679,7 +679,7 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) totqlen += rcu_get_n_cbs_cpu(cpu); pr_err("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n", jiffies - gps, - (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, + (long)rcu_seq_current(rcu_root_seq()), totqlen, data_race(rcu_state.n_online_cpus)); // Diagnostic read rcu_check_gp_kthread_expired_fqs_timer(); @@ -743,28 +743,28 @@ static void check_cpu_stall(struct rcu_data *rdp) /* * Lots of memory barriers to reject false positives. * - * The idea is to pick up rcu_state.gp_seq, then + * The idea is to pick up rcu_root_seq(), then * rcu_state.jiffies_stall, then rcu_state.gp_start, and finally - * another copy of rcu_state.gp_seq. These values are updated in + * another copy of rcu_root_seq(). These values are updated in * the opposite order with memory barriers (or equivalent) during * grace-period initialization and cleanup. Now, a false positive * can occur if we get an new value of rcu_state.gp_start and a old * value of rcu_state.jiffies_stall. But given the memory barriers, * the only way that this can happen is if one grace period ends * and another starts between these two fetches. This is detected - * by comparing the second fetch of rcu_state.gp_seq with the - * previous fetch from rcu_state.gp_seq. + * by comparing the second fetch of rcu_root_seq() with the + * previous fetch from rcu_root_seq(). * * Given this check, comparisons of jiffies, rcu_state.jiffies_stall, * and rcu_state.gp_start suffice to forestall false positives. */ - gs1 = READ_ONCE(rcu_state.gp_seq); + gs1 = READ_ONCE(*rcu_root_seq()); smp_rmb(); /* Pick up ->gp_seq first... */ js = READ_ONCE(rcu_state.jiffies_stall); smp_rmb(); /* ...then ->jiffies_stall before the rest... */ gps = READ_ONCE(rcu_state.gp_start); smp_rmb(); /* ...and finally ->gp_start before ->gp_seq again. */ - gs2 = READ_ONCE(rcu_state.gp_seq); + gs2 = READ_ONCE(*rcu_root_seq()); if (gs1 != gs2 || ULONG_CMP_LT(j, js) || ULONG_CMP_GE(gps, js) || @@ -893,12 +893,12 @@ void show_rcu_gp_kthreads(void) data_race(READ_ONCE(rcu_state.gp_state)), t ? data_race(READ_ONCE(t->__state)) : 0x1ffff, t ? t->rt_priority : 0xffU, js, ja, jr, jw, (long)data_race(READ_ONCE(rcu_state.gp_wake_seq)), - (long)data_race(READ_ONCE(rcu_state.gp_seq)), + (long)data_race(READ_ONCE(*rcu_root_seq())), (long)data_race(READ_ONCE(rcu_get_root()->gp_seq_needed)), data_race(READ_ONCE(rcu_state.gp_max)), data_race(READ_ONCE(rcu_state.gp_flags))); rcu_for_each_node_breadth_first(rnp) { - if (ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), READ_ONCE(rnp->gp_seq_needed)) && + if (ULONG_CMP_GE(READ_ONCE(*rcu_root_seq()), READ_ONCE(rnp->gp_seq_needed)) && !data_race(READ_ONCE(rnp->qsmask)) && !data_race(READ_ONCE(rnp->boost_tasks)) && !data_race(READ_ONCE(rnp->exp_tasks)) && !data_race(READ_ONCE(rnp->gp_tasks))) continue; @@ -917,7 +917,7 @@ void show_rcu_gp_kthreads(void) for_each_leaf_node_possible_cpu(rnp, cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); if (READ_ONCE(rdp->gpwrap) || - ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), + ULONG_CMP_GE(READ_ONCE(*rcu_root_seq()), READ_ONCE(rdp->gp_seq_needed))) continue; pr_info("\tcpu %d ->gp_seq_needed %ld\n",
On Mon, Mar 31, 2025 at 12:10:58AM +0200, Frederic Weisbecker wrote: > Le Thu, Mar 27, 2025 at 10:09:48AM -0700, Paul E. McKenney a écrit : > > On Wed, Mar 26, 2025 at 10:42:52PM +0000, Joel Fernandes wrote: > > > > > > > > > > On Mar 26, 2025, at 6:33 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Mon, Mar 24, 2025 at 01:01:53PM -0400, Joel Fernandes wrote: > > > >> The rcu_seq_done_exact() function checks if a grace period has completed by > > > >> comparing sequence numbers. It includes a guard band to handle sequence number > > > >> wraparound, which was previously expressed using the magic number calculation > > > >> '3 * RCU_SEQ_STATE_MASK + 1'. > > > >> > > > >> This magic number is not immediately obvious in terms of what it represents. > > > >> > > > >> Instead, the reason we need this tiny guardband is because of the lag between > > > >> the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). > > > >> > > > >> This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing > > > >> the newly started GP as completed immediately, due to the following sequence > > > >> which arises due to the delay between update of rcu_state.gp_seq_polled and > > > >> root rnp's gp_seq: > > > >> > > > >> rnp->gp_seq = rcu_state.gp_seq = 0 > > > >> > > > >> CPU 0 CPU 1 > > > >> ----- ----- > > > >> // rcu_state.gp_seq = 1 > > > >> rcu_seq_start(&rcu_state.gp_seq) > > > >> // snap = 8 > > > >> snap = rcu_seq_snap(&rcu_state.gp_seq) > > > >> // Two full GP differences > > > >> rcu_seq_done_exact(&rnp->gp_seq, snap) > > > >> // rnp->gp_seq = 1 > > > >> WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > > > >> > > > >> This can happen due to get_state_synchronize_rcu_full() sampling > > > >> rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() > > > >> sampling the root rnp's gp_seq. The delay between the update of the 2 > > > >> counters occurs in rcu_gp_init() during which the counters briefly go > > > >> out of sync. > > > >> > > > >> Make the guardband explictly 2 GPs. This improves code readability and > > > >> maintainability by making the intent clearer as well. > > > >> > > > >> Suggested-by: Frederic Weisbecker <frederic@kernel.org> > > > >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > > > > > > > > One concern is that a small error anywhere in the code could cause this > > > > minimal guard band to be too small. This is not a problem for some > > > > use cases (rcu_barrier() just does an extra operation, and normal grace > > > > periods are protected from forever-idle CPUs by ->gpwrap), but could be > > > > an issue on 32-bit systems for user of polled RCU grace periods. > > > > > > Could you provide more details of the usecase (sequence of steps) causing an issue for 32 bit polled RCU users? I am not able to see how this patch can affect them. > > > > > > > > > > > In contrast, making the guard band a bit longer than it needs to be > > > > has little or no downside. > > > > > > Making it 3 GP instead of 2 should be ok with me as long as we document it but at least it will not be a magic number based on an equation. I feel we should not put random magic numbers which is more dangerous since it is hard to explain (and hence debug — just my 2 cents). > > > > Apologies, I was getting ahead of us on this one. > > > > That third (and maybe also a fourth) grace period becomes important if > > we start getting memory contention on rcu_state.gp_seq, in which case > > we would want the polled grace periods to look at the leaf rcu_node > > structure's gp_seq, which would introduce extra slop. > > > > For now, for this patch, assuming you guys promise to remember this > > should such memory contention prove problematic: > > > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > > Taking again a step back on this. This all happens because we have two different > global gp sequence counter, and the two are used on different occasions, even mixed > up together since 85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() > GP-start detection") even though they have different constraints and > reliability. > > This can be error prone and confusing. Fair enough, though given the need for the grace-period sequence number to be replicated across the leaf rcu_node structures, we might well find that we are facing a law of conservation of confusion... Nevertheless, I do fully support at least exploring this sort of thing. After all, once I found something that worked, my incentive to further improve things was a bit limited. Especially given that there was never a shortage of other things demanding improvement. ;-) Please see more comments interspersed. Thanx, Paul > So I'm unfortunately asking again if it wouldn't be a good idea to have a single > global state counter that lives in the root node so that we don't have it > duplicated in rcu_state.gp_seq. > > This involves some special care: > > 1) As you pointed out, the root can also be the unique leaf. So care must > be taken that no QS is reported before new online CPUs have been handled > and QS masks initialized. In order for __note_gp_changes() to handle > it gracefully we must have two seq states: one to tell the > grace period has started (before checking new online CPUs): RCU_STATE_STARTED > and one to tell that grace period is ready to receive quiescent states: > RCU_STATE_WAIT_QS. Although not strictly required, __note_gp_changes() can > ignore simple state changes to RCU_STATE_STARTED in order to spare useless > work and rnp locking. No QS will be reported before ->qsmask are initialized > in any case. We have available up to three stages of grace-period processing, which can be expanded relatively easily. My concern is that we will be replacing "which ->gp_seq to use" confusion with "which gp_seq state to pay attention to" confusion. > 2) rcu_gp_cleanup() currently ends the grace period top-down through the rnp tree > and only then on rcu_state. With this change it's simply top-down. I have audited > rcu_seq_snap() and rcu_start_this_gp() uses and I haven't found an obvious > issue with that. That would result in the root rcu_node structure's ->gp_seq declaring the grace period over before the rest of the rcu_node structures were aware of it having ended. I guess that the last leaf rcu_node structure's ->gp_seq field would be what you would look at if you were looking for the actual grace period end. Of course, you could argue that the grace period isn't fully over until all the CPUs are aware of it, but that can take a very long time, especially if there are CPUs staying idle for seconds (let alone minutes) at at time. > The draft below holds 30 minutes of TREE01: > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index feb3ac1dc5d5..c4bb3eb92a58 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -19,6 +19,12 @@ > * The two least significant bits contain the control flags. > * The most significant bits contain the grace-period sequence counter. > * > + * RCU_STATE_STARTED : The gp has started but QS are not > + * requested yet. > + * > + * RCU_STATE_WAIT_QS : The gp has started and waits for > + * QS to be reported. Do we want RCU_STATE_IDLE, by analogy with SRCU? > * When both control flags are zero, no grace period is in progress. > * When either bit is non-zero, a grace period has started and is in > * progress. When the grace period completes, the control flags are reset > @@ -54,6 +60,9 @@ > * grace-period sequence number. > */ > > +#define RCU_STATE_STARTED 0x1 > +#define RCU_STATE_WAIT_QS 0x2 > + > /* Low-order bit definition for polled grace-period APIs. */ > #define RCU_GET_STATE_COMPLETED 0x1 > > @@ -92,7 +101,13 @@ static inline void rcu_seq_start(unsigned long *sp) > { > WRITE_ONCE(*sp, *sp + 1); > smp_mb(); /* Ensure update-side operation after counter increment. */ > - WARN_ON_ONCE(rcu_seq_state(*sp) != 1); > + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_STARTED); > +} > + > +static inline void rcu_seq_wait_qs(unsigned long *sp) > +{ > + WRITE_ONCE(*sp, *sp + 1); > + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_WAIT_QS); Or we could have the WARN_ON_ONCE() first, and use rcu_seq_set_state(). > } > > /* Compute the end-of-grace-period value for the specified sequence number. */ > @@ -135,6 +150,11 @@ static inline unsigned long rcu_seq_current(unsigned long *sp) > return READ_ONCE(*sp); > } > > +static inline bool rcu_seq_changed(unsigned long old, unsigned long new) > +{ > + return (old & ~RCU_STATE_STARTED) != (new & ~RCU_STATE_STARTED); > +} > + > /* > * Given a snapshot from rcu_seq_snap(), determine whether or not the > * corresponding update-side operation has started. > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 475f31deed14..54480ee9599f 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -81,9 +81,9 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { > .gpwrap = true, > }; > static struct rcu_state rcu_state = { > + .node[0].gp_seq = (0UL - 300UL) << RCU_SEQ_CTR_SHIFT, > .level = { &rcu_state.node[0] }, > .gp_state = RCU_GP_IDLE, > - .gp_seq = (0UL - 300UL) << RCU_SEQ_CTR_SHIFT, > .barrier_mutex = __MUTEX_INITIALIZER(rcu_state.barrier_mutex), > .barrier_lock = __RAW_SPIN_LOCK_UNLOCKED(rcu_state.barrier_lock), > .name = RCU_NAME, > @@ -192,6 +192,21 @@ int rcu_get_gp_kthreads_prio(void) > } > EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); > > +/* > + * Return the root node of the rcu_state structure. > + */ > +static struct rcu_node *rcu_get_root(void) > +{ > + return &rcu_state.node[0]; > +} OK, just code movement. > +static unsigned long *rcu_root_seq(void) > +{ > + struct rcu_node *root = rcu_get_root(); > + > + return &root->gp_seq; > +} Adding this function might be useful in any case, even if we keep the current rcu_state.gp_seq counter. A single place to add comments that address the chosen type of confusion, if nothing else. ;-) 13 calls want the pointer and 13 want the value. An even split! Not sure what lesson to draw from this, though indirecting a function call will likely cause some confusion. Some might argue for two access functions, one for the pointer and the other for the value. Me, I am just glad that I included the word "might" in the first sentence of the prior paragraph. > /* > * Number of grace periods between delays, normalized by the duration of > * the delay. The longer the delay, the more the grace periods between > @@ -210,7 +225,7 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); > */ > static int rcu_gp_in_progress(void) > { > - return rcu_seq_state(rcu_seq_current(&rcu_state.gp_seq)); > + return rcu_seq_state(rcu_seq_current(rcu_root_seq())); > } > > /* > @@ -504,7 +519,7 @@ static int rcu_pending(int user); > */ > unsigned long rcu_get_gp_seq(void) > { > - return READ_ONCE(rcu_state.gp_seq); > + return READ_ONCE(*rcu_root_seq()); > } > EXPORT_SYMBOL_GPL(rcu_get_gp_seq); > > @@ -520,21 +535,13 @@ unsigned long rcu_exp_batches_completed(void) > } > EXPORT_SYMBOL_GPL(rcu_exp_batches_completed); > > -/* > - * Return the root node of the rcu_state structure. > - */ > -static struct rcu_node *rcu_get_root(void) > -{ > - return &rcu_state.node[0]; > -} > - > /* > * Send along grace-period-related data for rcutorture diagnostics. > */ > void rcutorture_get_gp_data(int *flags, unsigned long *gp_seq) > { > *flags = READ_ONCE(rcu_state.gp_flags); > - *gp_seq = rcu_seq_current(&rcu_state.gp_seq); > + *gp_seq = rcu_seq_current(rcu_root_seq()); > } > EXPORT_SYMBOL_GPL(rcutorture_get_gp_data); > > @@ -1010,7 +1017,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, > trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("NoGPkthread")); > goto unlock_out; > } > - trace_rcu_grace_period(rcu_state.name, data_race(rcu_state.gp_seq), TPS("newreq")); > + trace_rcu_grace_period(rcu_state.name, data_race(*rcu_root_seq()), TPS("newreq")); > ret = true; /* Caller must wake GP kthread. */ > unlock_out: > /* Push furthest requested GP to leaf node and rcu_data structure. */ > @@ -1095,7 +1102,7 @@ static void rcu_gp_kthread_wake(void) > !READ_ONCE(rcu_state.gp_flags) || !t) > return; > WRITE_ONCE(rcu_state.gp_wake_time, jiffies); > - WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); > + WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(*rcu_root_seq())); > swake_up_one_online(&rcu_state.gp_wq); > } > > @@ -1135,7 +1142,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) > * accelerating callback invocation to an earlier grace-period > * number. > */ > - gp_seq_req = rcu_seq_snap(&rcu_state.gp_seq); > + gp_seq_req = rcu_seq_snap(rcu_root_seq()); > if (rcu_segcblist_accelerate(&rdp->cblist, gp_seq_req)) > ret = rcu_start_this_gp(rnp, rdp, gp_seq_req); > > @@ -1164,7 +1171,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp, > bool needwake; > > rcu_lockdep_assert_cblist_protected(rdp); > - c = rcu_seq_snap(&rcu_state.gp_seq); > + c = rcu_seq_snap(rcu_root_seq()); > if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { > /* Old request still live, so mark recent callbacks. */ > (void)rcu_segcblist_accelerate(&rdp->cblist, c); > @@ -1249,7 +1256,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) > > raw_lockdep_assert_held_rcu_node(rnp); > > - if (rdp->gp_seq == rnp->gp_seq) > + if (!rcu_seq_changed(rdp->gp_seq, rnp->gp_seq)) The old code would detect a bunch of elapsed grace periods, starting and ending in the same state. The new code does not. Does the new code correctly handle CPUs that go idle for many grace periods? > return false; /* Nothing to do. */ > > /* Handle the ends of any preceding grace periods first. */ > @@ -1298,7 +1305,7 @@ static void note_gp_changes(struct rcu_data *rdp) > > local_irq_save(flags); > rnp = rdp->mynode; > - if ((rdp->gp_seq == rcu_seq_current(&rnp->gp_seq) && > + if ((!rcu_seq_changed(rdp->gp_seq, rcu_seq_current(&rnp->gp_seq)) && Ditto. > !unlikely(READ_ONCE(rdp->gpwrap))) || /* w/out lock. */ > !raw_spin_trylock_rcu_node(rnp)) { /* irqs already off, so later. */ > local_irq_restore(flags); > @@ -1341,7 +1348,7 @@ static bool rcu_gp_slow_is_suppressed(void) > static void rcu_gp_slow(int delay) > { > if (!rcu_gp_slow_is_suppressed() && delay > 0 && > - !(rcu_seq_ctr(rcu_state.gp_seq) % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay))) > + !(rcu_seq_ctr(*rcu_root_seq()) % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay))) > schedule_timeout_idle(delay); > } > > @@ -1802,10 +1809,9 @@ static noinline_for_stack bool rcu_gp_init(void) > /* Advance to a new grace period and initialize state. */ > record_gp_stall_check_time(); > /* Record GP times before starting GP, hence rcu_seq_start(). */ > - rcu_seq_start(&rcu_state.gp_seq); > - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); Why are we dropping the ASSERT_EXCLUSIVE_WRITER()? As opposed to moving it to the setting of the rcu_node structures' ->gp_seq updates? > + rcu_seq_start(rcu_root_seq()); > start_new_poll = rcu_sr_normal_gp_init(); > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start")); > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("start")); > rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap); > raw_spin_unlock_irq_rcu_node(rnp); > > @@ -1902,7 +1908,10 @@ static noinline_for_stack bool rcu_gp_init(void) > rdp = this_cpu_ptr(&rcu_data); > rcu_preempt_check_blocked_tasks(rnp); > rnp->qsmask = rnp->qsmaskinit; > - WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > + if (!rnp->parent) > + rcu_seq_wait_qs(&rnp->gp_seq); > + else > + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); This asymmetry is necessary because if I understand correctly, the non-root rcu_node structures still have zero state. But this is one argument for use of rcu_seq_set_state(), because then you could just do rcu_seq_set_state(rnp->gp_seq, RCU_STATE_WAIT_QS) in both cases. Or am I missing something that prevents this from working? > if (rnp == rdp->mynode) > (void)__note_gp_changes(rnp, rdp); > rcu_preempt_boost_start_gp(rnp); > @@ -2018,7 +2027,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) > WRITE_ONCE(rcu_state.jiffies_kick_kthreads, > jiffies + (j ? 3 * j : 2)); > } > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), > TPS("fqswait")); > WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS); > (void)swait_event_idle_timeout_exclusive(rcu_state.gp_wq, > @@ -2041,7 +2050,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) > /* If time for quiescent-state forcing, do it. */ > if (!time_after(rcu_state.jiffies_force_qs, jiffies) || > (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), > TPS("fqsstart")); > rcu_gp_fqs(first_gp_fqs); > gf = 0; > @@ -2049,7 +2058,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) > first_gp_fqs = false; > gf = rcu_state.cbovld ? RCU_GP_FLAG_OVLD : 0; > } > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), > TPS("fqsend")); > cond_resched_tasks_rcu_qs(); > WRITE_ONCE(rcu_state.gp_activity, jiffies); > @@ -2060,7 +2069,7 @@ static noinline_for_stack void rcu_gp_fqs_loop(void) > cond_resched_tasks_rcu_qs(); > WRITE_ONCE(rcu_state.gp_activity, jiffies); > WARN_ON(signal_pending(current)); > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), > TPS("fqswaitsig")); > ret = 1; /* Keep old FQS timing. */ > j = jiffies; > @@ -2081,7 +2090,6 @@ static noinline void rcu_gp_cleanup(void) > int cpu; > bool needgp = false; > unsigned long gp_duration; > - unsigned long new_gp_seq; > bool offloaded; > struct rcu_data *rdp; > struct rcu_node *rnp = rcu_get_root(); > @@ -2103,6 +2111,8 @@ static noinline void rcu_gp_cleanup(void) > * period as completed in all of the rcu_node structures. > */ > rcu_poll_gp_seq_end(&rcu_state.gp_seq_polled_snap); > + rcu_seq_end(rcu_root_seq()); > + smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). > raw_spin_unlock_irq_rcu_node(rnp); > > /* > @@ -2114,16 +2124,12 @@ static noinline void rcu_gp_cleanup(void) > * the rcu_node structures before the beginning of the next grace > * period is recorded in any of the rcu_node structures. > */ > - new_gp_seq = rcu_state.gp_seq; > - rcu_seq_end(&new_gp_seq); > rcu_for_each_node_breadth_first(rnp) { > raw_spin_lock_irq_rcu_node(rnp); > if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp))) > dump_blkd_tasks(rnp, 10); > WARN_ON_ONCE(rnp->qsmask); > - WRITE_ONCE(rnp->gp_seq, new_gp_seq); > - if (!rnp->parent) > - smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). OK, this got moved up. > + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); The idea being that the root rcu_node ->gp_seq is just overwritten with itself, correct? > rdp = this_cpu_ptr(&rcu_data); > if (rnp == rdp->mynode) > needgp = __note_gp_changes(rnp, rdp) || needgp; > @@ -2146,9 +2152,7 @@ static noinline void rcu_gp_cleanup(void) > raw_spin_lock_irq_rcu_node(rnp); /* GP before ->gp_seq update. */ > > /* Declare grace period done, trace first to use old GP number. */ > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end")); > - rcu_seq_end(&rcu_state.gp_seq); > - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); Same question on not moving ASSERT_EXCLUSIVE_WRITER() to the rcu_node structures' ->gp_seq updates. > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("end")); > WRITE_ONCE(rcu_state.gp_state, RCU_GP_IDLE); > /* Check for GP requests since above loop. */ > rdp = this_cpu_ptr(&rcu_data); > @@ -2173,7 +2177,7 @@ static noinline void rcu_gp_cleanup(void) > > WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); > WRITE_ONCE(rcu_state.gp_req_activity, jiffies); > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("newreq")); > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), TPS("newreq")); > } else { > > // We get here either if there is no need for an > @@ -2204,7 +2208,7 @@ static int __noreturn rcu_gp_kthread(void *unused) > > /* Handle grace-period start. */ > for (;;) { > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), > TPS("reqwait")); > WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_GPS); > swait_event_idle_exclusive(rcu_state.gp_wq, > @@ -2218,7 +2222,7 @@ static int __noreturn rcu_gp_kthread(void *unused) > cond_resched_tasks_rcu_qs(); > WRITE_ONCE(rcu_state.gp_activity, jiffies); > WARN_ON(signal_pending(current)); > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, > + trace_rcu_grace_period(rcu_state.name, *rcu_root_seq(), > TPS("reqwaitsig")); > } > > @@ -3299,9 +3303,10 @@ void synchronize_rcu(void) > // these counters. > local_irq_save(flags); > WARN_ON_ONCE(num_online_cpus() > 1); > - rcu_state.gp_seq += (1 << RCU_SEQ_CTR_SHIFT); > + rnp = rcu_get_root(); > + rnp->gp_seq += (1 << RCU_SEQ_CTR_SHIFT); > for (rnp = this_cpu_ptr(&rcu_data)->mynode; rnp; rnp = rnp->parent) > - rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; > + rnp->gp_seq_needed = rnp->gp_seq = *rcu_root_seq(); > local_irq_restore(flags); > } > EXPORT_SYMBOL_GPL(synchronize_rcu); > @@ -3390,7 +3395,7 @@ static void start_poll_synchronize_rcu_common(void) > // get a grace period that no one needed. These accesses are ordered > // by smp_mb(), and we are accessing them in the opposite order > // from which they are updated at grace-period start, as required. > - needwake = rcu_start_this_gp(rnp, rdp, rcu_seq_snap(&rcu_state.gp_seq)); > + needwake = rcu_start_this_gp(rnp, rdp, rcu_seq_snap(rcu_root_seq())); > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > if (needwake) > rcu_gp_kthread_wake(); > @@ -3512,11 +3517,9 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); > */ > bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) > { > - struct rcu_node *rnp = rcu_get_root(); > - > smp_mb(); // Order against root rcu_node structure grace-period cleanup. > if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || > - rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || > + rcu_seq_done_exact(rcu_root_seq(), rgosp->rgos_norm) || > rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || > rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) { > smp_mb(); /* Ensure GP ends before subsequent accesses. */ > @@ -4062,9 +4065,9 @@ rcu_boot_init_percpu_data(int cpu) > WARN_ON_ONCE(ct->nesting != 1); > WARN_ON_ONCE(rcu_watching_snap_in_eqs(ct_rcu_watching_cpu(cpu))); > rdp->barrier_seq_snap = rcu_state.barrier_sequence; > - rdp->rcu_ofl_gp_seq = rcu_state.gp_seq; > + rdp->rcu_ofl_gp_seq = *rcu_root_seq(); > rdp->rcu_ofl_gp_state = RCU_GP_CLEANED; > - rdp->rcu_onl_gp_seq = rcu_state.gp_seq; > + rdp->rcu_onl_gp_seq = *rcu_root_seq(); > rdp->rcu_onl_gp_state = RCU_GP_CLEANED; > rdp->last_sched_clock = jiffies; > rdp->cpu = cpu; > @@ -4271,7 +4274,7 @@ void rcutree_report_cpu_starting(unsigned int cpu) > smp_store_release(&rcu_state.ncpus, rcu_state.ncpus + newcpu); /* ^^^ */ > ASSERT_EXCLUSIVE_WRITER(rcu_state.ncpus); > rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */ > - rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); > + rdp->rcu_onl_gp_seq = READ_ONCE(*rcu_root_seq()); > rdp->rcu_onl_gp_state = READ_ONCE(rcu_state.gp_state); > > /* An incoming CPU should never be blocking a grace period. */ > @@ -4322,7 +4325,7 @@ void rcutree_report_cpu_dead(void) > mask = rdp->grpmask; > arch_spin_lock(&rcu_state.ofl_lock); > raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ > - rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq); > + rdp->rcu_ofl_gp_seq = READ_ONCE(*rcu_root_seq()); > rdp->rcu_ofl_gp_state = READ_ONCE(rcu_state.gp_state); > if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */ > /* Report quiescent state -before- changing ->qsmaskinitnext! */ > @@ -4534,7 +4537,7 @@ void rcu_scheduler_starting(void) > // Fix up the ->gp_seq counters. > local_irq_save(flags); > rcu_for_each_node_breadth_first(rnp) > - rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; > + rnp->gp_seq_needed = rnp->gp_seq = *rcu_root_seq(); > local_irq_restore(flags); > > // Switch out of early boot mode. > @@ -4583,9 +4586,9 @@ static void __init rcu_init_one(void) > raw_spin_lock_init(&rnp->fqslock); > lockdep_set_class_and_name(&rnp->fqslock, > &rcu_fqs_class[i], fqs[i]); > - rnp->gp_seq = rcu_state.gp_seq; > - rnp->gp_seq_needed = rcu_state.gp_seq; > - rnp->completedqs = rcu_state.gp_seq; > + rnp->gp_seq = *rcu_root_seq(); > + rnp->gp_seq_needed = *rcu_root_seq(); > + rnp->completedqs = *rcu_root_seq(); > rnp->qsmask = 0; > rnp->qsmaskinit = 0; > rnp->grplo = j * cpustride; > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index a9a811d9d7a3..9d811270b808 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -347,7 +347,6 @@ struct rcu_state { > > /* The following fields are guarded by the root rcu_node's lock. */ > > - unsigned long gp_seq ____cacheline_internodealigned_in_smp; > /* Grace-period sequence #. */ Heh! Does the rcu_node structure's ->gp_seq need alignment? > unsigned long gp_max; /* Maximum GP duration in */ > /* jiffies. */ > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index 925fcdad5dea..367adb062a6d 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -344,7 +344,7 @@ static void rcu_dump_cpu_stacks(unsigned long gp_seq) > rcu_for_each_leaf_node(rnp) { > printk_deferred_enter(); > for_each_leaf_node_possible_cpu(rnp, cpu) { > - if (gp_seq != data_race(rcu_state.gp_seq)) { > + if (gp_seq != data_race(*rcu_root_seq())) { > printk_deferred_exit(); > pr_err("INFO: Stall ended during stack backtracing.\n"); > return; > @@ -474,7 +474,7 @@ static void print_cpu_stall_info(int cpu) > */ > touch_nmi_watchdog(); > > - ticks_value = rcu_seq_ctr(rcu_state.gp_seq - rdp->gp_seq); > + ticks_value = rcu_seq_ctr(*rcu_root_seq() - rdp->gp_seq); > if (ticks_value) { > ticks_title = "GPs behind"; > } else { > @@ -518,7 +518,7 @@ static void rcu_check_gp_kthread_starvation(void) > cpu = gpk ? task_cpu(gpk) : -1; > pr_err("%s kthread starved for %ld jiffies! g%ld f%#x %s(%d) ->state=%#x ->cpu=%d\n", > rcu_state.name, j, > - (long)rcu_seq_current(&rcu_state.gp_seq), > + (long)rcu_seq_current(rcu_root_seq()), > data_race(READ_ONCE(rcu_state.gp_flags)), > gp_state_getname(rcu_state.gp_state), > data_race(READ_ONCE(rcu_state.gp_state)), > @@ -561,7 +561,7 @@ static void rcu_check_gp_kthread_expired_fqs_timer(void) > cpu = task_cpu(gpk); > pr_err("%s kthread timer wakeup didn't happen for %ld jiffies! g%ld f%#x %s(%d) ->state=%#x\n", > rcu_state.name, (jiffies - jiffies_fqs), > - (long)rcu_seq_current(&rcu_state.gp_seq), > + (long)rcu_seq_current(rcu_root_seq()), > data_race(READ_ONCE(rcu_state.gp_flags)), // Diagnostic read > gp_state_getname(RCU_GP_WAIT_FQS), RCU_GP_WAIT_FQS, > data_race(READ_ONCE(gpk->__state))); > @@ -613,7 +613,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > totqlen += rcu_get_n_cbs_cpu(cpu); > pr_err("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n", > smp_processor_id(), (long)(jiffies - gps), > - (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, > + (long)rcu_seq_current(rcu_root_seq()), totqlen, > data_race(rcu_state.n_online_cpus)); // Diagnostic read > if (ndetected) { > rcu_dump_cpu_stacks(gp_seq); > @@ -622,7 +622,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > rcu_for_each_leaf_node(rnp) > rcu_print_detail_task_stall_rnp(rnp); > } else { > - if (rcu_seq_current(&rcu_state.gp_seq) != gp_seq) { > + if (rcu_seq_current(rcu_root_seq()) != gp_seq) { > pr_err("INFO: Stall ended before state dump start\n"); > } else { > j = jiffies; > @@ -679,7 +679,7 @@ static void print_cpu_stall(unsigned long gp_seq, unsigned long gps) > totqlen += rcu_get_n_cbs_cpu(cpu); > pr_err("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n", > jiffies - gps, > - (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, > + (long)rcu_seq_current(rcu_root_seq()), totqlen, > data_race(rcu_state.n_online_cpus)); // Diagnostic read > > rcu_check_gp_kthread_expired_fqs_timer(); > @@ -743,28 +743,28 @@ static void check_cpu_stall(struct rcu_data *rdp) > /* > * Lots of memory barriers to reject false positives. > * > - * The idea is to pick up rcu_state.gp_seq, then > + * The idea is to pick up rcu_root_seq(), then > * rcu_state.jiffies_stall, then rcu_state.gp_start, and finally > - * another copy of rcu_state.gp_seq. These values are updated in > + * another copy of rcu_root_seq(). These values are updated in > * the opposite order with memory barriers (or equivalent) during > * grace-period initialization and cleanup. Now, a false positive > * can occur if we get an new value of rcu_state.gp_start and a old > * value of rcu_state.jiffies_stall. But given the memory barriers, > * the only way that this can happen is if one grace period ends > * and another starts between these two fetches. This is detected > - * by comparing the second fetch of rcu_state.gp_seq with the > - * previous fetch from rcu_state.gp_seq. > + * by comparing the second fetch of rcu_root_seq() with the > + * previous fetch from rcu_root_seq(). > * > * Given this check, comparisons of jiffies, rcu_state.jiffies_stall, > * and rcu_state.gp_start suffice to forestall false positives. > */ > - gs1 = READ_ONCE(rcu_state.gp_seq); > + gs1 = READ_ONCE(*rcu_root_seq()); > smp_rmb(); /* Pick up ->gp_seq first... */ > js = READ_ONCE(rcu_state.jiffies_stall); > smp_rmb(); /* ...then ->jiffies_stall before the rest... */ > gps = READ_ONCE(rcu_state.gp_start); > smp_rmb(); /* ...and finally ->gp_start before ->gp_seq again. */ > - gs2 = READ_ONCE(rcu_state.gp_seq); > + gs2 = READ_ONCE(*rcu_root_seq()); > if (gs1 != gs2 || > ULONG_CMP_LT(j, js) || > ULONG_CMP_GE(gps, js) || > @@ -893,12 +893,12 @@ void show_rcu_gp_kthreads(void) > data_race(READ_ONCE(rcu_state.gp_state)), > t ? data_race(READ_ONCE(t->__state)) : 0x1ffff, t ? t->rt_priority : 0xffU, > js, ja, jr, jw, (long)data_race(READ_ONCE(rcu_state.gp_wake_seq)), > - (long)data_race(READ_ONCE(rcu_state.gp_seq)), > + (long)data_race(READ_ONCE(*rcu_root_seq())), > (long)data_race(READ_ONCE(rcu_get_root()->gp_seq_needed)), > data_race(READ_ONCE(rcu_state.gp_max)), > data_race(READ_ONCE(rcu_state.gp_flags))); > rcu_for_each_node_breadth_first(rnp) { > - if (ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), READ_ONCE(rnp->gp_seq_needed)) && > + if (ULONG_CMP_GE(READ_ONCE(*rcu_root_seq()), READ_ONCE(rnp->gp_seq_needed)) && > !data_race(READ_ONCE(rnp->qsmask)) && !data_race(READ_ONCE(rnp->boost_tasks)) && > !data_race(READ_ONCE(rnp->exp_tasks)) && !data_race(READ_ONCE(rnp->gp_tasks))) > continue; > @@ -917,7 +917,7 @@ void show_rcu_gp_kthreads(void) > for_each_leaf_node_possible_cpu(rnp, cpu) { > rdp = per_cpu_ptr(&rcu_data, cpu); > if (READ_ONCE(rdp->gpwrap) || > - ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), > + ULONG_CMP_GE(READ_ONCE(*rcu_root_seq()), > READ_ONCE(rdp->gp_seq_needed))) > continue; > pr_info("\tcpu %d ->gp_seq_needed %ld\n",
Le Mon, Mar 31, 2025 at 11:28:47AM -0700, Paul E. McKenney a écrit : > > So I'm unfortunately asking again if it wouldn't be a good idea to have a single > > global state counter that lives in the root node so that we don't have it > > duplicated in rcu_state.gp_seq. > > > > This involves some special care: > > > > 1) As you pointed out, the root can also be the unique leaf. So care must > > be taken that no QS is reported before new online CPUs have been handled > > and QS masks initialized. In order for __note_gp_changes() to handle > > it gracefully we must have two seq states: one to tell the > > grace period has started (before checking new online CPUs): RCU_STATE_STARTED > > and one to tell that grace period is ready to receive quiescent states: > > RCU_STATE_WAIT_QS. Although not strictly required, __note_gp_changes() can > > ignore simple state changes to RCU_STATE_STARTED in order to spare useless > > work and rnp locking. No QS will be reported before ->qsmask are initialized > > in any case. > > We have available up to three stages of grace-period processing, which > can be expanded relatively easily. My concern is that we will be > replacing "which ->gp_seq to use" confusion with "which gp_seq state > to pay attention to" confusion. True, this comes at the cost of a bit more complex state space. > > > 2) rcu_gp_cleanup() currently ends the grace period top-down through the rnp tree > > and only then on rcu_state. With this change it's simply top-down. I have audited > > rcu_seq_snap() and rcu_start_this_gp() uses and I haven't found an obvious > > issue with that. > > That would result in the root rcu_node structure's ->gp_seq declaring the > grace period over before the rest of the rcu_node structures were aware > of it having ended. I guess that the last leaf rcu_node structure's > ->gp_seq field would be what you would look at if you were looking for > the actual grace period end. > > Of course, you could argue that the grace period isn't fully over > until all the CPUs are aware of it, but that can take a very long time, > especially if there are CPUs staying idle for seconds (let alone minutes) > at at time. Right. And it seems not all parts of RCU agree on a common definition of a complete grace period. Polling seem to be fine with just the root rnp to reflect rcu_seq_end() (which happens before the leaves do). Snapshot takers though rely on the whole tree to reflect the GP end... As far as I audited the code, mostly rcu_seq_snap() callers, rcu_start_this_gp() and __note_gp_changes(), it looks fine to consider the GP is ended as soon as the root reflects it. I may have missed something of course... > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > index feb3ac1dc5d5..c4bb3eb92a58 100644 > > --- a/kernel/rcu/rcu.h > > +++ b/kernel/rcu/rcu.h > > @@ -19,6 +19,12 @@ > > * The two least significant bits contain the control flags. > > * The most significant bits contain the grace-period sequence counter. > > * > > + * RCU_STATE_STARTED : The gp has started but QS are not > > + * requested yet. > > + * > > + * RCU_STATE_WAIT_QS : The gp has started and waits for > > + * QS to be reported. > > Do we want RCU_STATE_IDLE, by analogy with SRCU? Makes sense. > > > * When both control flags are zero, no grace period is in progress. > > * When either bit is non-zero, a grace period has started and is in > > * progress. When the grace period completes, the control flags are reset > > @@ -54,6 +60,9 @@ > > * grace-period sequence number. > > */ > > > > +#define RCU_STATE_STARTED 0x1 > > +#define RCU_STATE_WAIT_QS 0x2 > > + > > /* Low-order bit definition for polled grace-period APIs. */ > > #define RCU_GET_STATE_COMPLETED 0x1 > > > > @@ -92,7 +101,13 @@ static inline void rcu_seq_start(unsigned long *sp) > > { > > WRITE_ONCE(*sp, *sp + 1); > > smp_mb(); /* Ensure update-side operation after counter increment. */ > > - WARN_ON_ONCE(rcu_seq_state(*sp) != 1); > > + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_STARTED); > > +} > > + > > +static inline void rcu_seq_wait_qs(unsigned long *sp) > > +{ > > + WRITE_ONCE(*sp, *sp + 1); > > + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_WAIT_QS); > > Or we could have the WARN_ON_ONCE() first, and use rcu_seq_set_state(). Right! > > @@ -192,6 +192,21 @@ int rcu_get_gp_kthreads_prio(void) > > } > > EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); > > > > +/* > > + * Return the root node of the rcu_state structure. > > + */ > > +static struct rcu_node *rcu_get_root(void) > > +{ > > + return &rcu_state.node[0]; > > +} > > OK, just code movement. Right, if I proceed with this, a series will (try to) split more gently for review. > > > +static unsigned long *rcu_root_seq(void) > > +{ > > + struct rcu_node *root = rcu_get_root(); > > + > > + return &root->gp_seq; > > +} > > Adding this function might be useful in any case, even if we keep the > current rcu_state.gp_seq counter. A single place to add comments that > address the chosen type of confusion, if nothing else. ;-) ;-)) > > 13 calls want the pointer and 13 want the value. An even split! The split is but 13 itself is not even! And I know people who strongly want it to stay that way! > Not sure what lesson to draw from this, though indirecting a function call will > likely cause some confusion. Some might argue for two access functions, > one for the pointer and the other for the value. Me, I am just glad that > I included the word "might" in the first sentence of the prior paragraph. Right, I'm not proud of the *rcu_root_seq() things. I thought about adding rcu_root_seq_current() but I'm not entirely comfortable with having all readers under READ_ONCE(). I like when our assumptions of non concurrent plain accesses are defeatedly reported by KCSAN. > > @@ -1249,7 +1256,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) > > > > raw_lockdep_assert_held_rcu_node(rnp); > > > > - if (rdp->gp_seq == rnp->gp_seq) > > + if (!rcu_seq_changed(rdp->gp_seq, rnp->gp_seq)) > > The old code would detect a bunch of elapsed grace periods, starting > and ending in the same state. The new code does not. Does the new code > correctly handle CPUs that go idle for many grace periods? I'm not sure I understand what you mean. rcu_seq_changed() is a comparison that simply clears RCU_SEQ_STARTED on both sides such that: * Different GP numbers with or without RCU_SEQ_STARTED/RCU_SEQ_WAIT_QS are always considered a change. * Same GP numbers but rnp has RCU_SEQ_WAIT_QS while rdp hasn't is considered a change but: * Same GP numbers but rnp has RCU_SEQ_STARTED while rdp hasn't is _NOT_ considered a change. This is not for correctness but rather to spare rnp locking and useless work. The ->qsmask are 0 anyway so no QS can be reported too early. > > @@ -1802,10 +1809,9 @@ static noinline_for_stack bool rcu_gp_init(void) > > /* Advance to a new grace period and initialize state. */ > > record_gp_stall_check_time(); > > /* Record GP times before starting GP, hence rcu_seq_start(). */ > > - rcu_seq_start(&rcu_state.gp_seq); > > - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); > > Why are we dropping the ASSERT_EXCLUSIVE_WRITER()? As opposed to moving > it to the setting of the rcu_node structures' ->gp_seq updates? I must confess I misunderstood that macro and thought it was to assert the variable has no other writer _at all_. But that seem to apply scopewise. So yes good point. > > @@ -1902,7 +1908,10 @@ static noinline_for_stack bool rcu_gp_init(void) > > rdp = this_cpu_ptr(&rcu_data); > > rcu_preempt_check_blocked_tasks(rnp); > > rnp->qsmask = rnp->qsmaskinit; > > - WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > > + if (!rnp->parent) > > + rcu_seq_wait_qs(&rnp->gp_seq); > > + else > > + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); > > This asymmetry is necessary because if I understand correctly, the > non-root rcu_node structures still have zero state. But this is one > argument for use of rcu_seq_set_state(), because then you could just do > rcu_seq_set_state(rnp->gp_seq, RCU_STATE_WAIT_QS) in both cases. > > Or am I missing something that prevents this from working? Indeed, the sequence counter (but not the state) is expected to be the same on all nodes, so we can just modify the state as you said. And we can then assert WARN_ON_ONCE(rnp->gp_seq != *rcu_root_seq()) > > @@ -2114,16 +2124,12 @@ static noinline void rcu_gp_cleanup(void) > > * the rcu_node structures before the beginning of the next grace > > * period is recorded in any of the rcu_node structures. > > */ > > - new_gp_seq = rcu_state.gp_seq; > > - rcu_seq_end(&new_gp_seq); > > rcu_for_each_node_breadth_first(rnp) { > > raw_spin_lock_irq_rcu_node(rnp); > > if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp))) > > dump_blkd_tasks(rnp, 10); > > WARN_ON_ONCE(rnp->qsmask); > > - WRITE_ONCE(rnp->gp_seq, new_gp_seq); > > - if (!rnp->parent) > > - smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). > > OK, this got moved up. > > > + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); > > The idea being that the root rcu_node ->gp_seq is just overwritten > with itself, correct? Right! > > > rdp = this_cpu_ptr(&rcu_data); > > if (rnp == rdp->mynode) > > needgp = __note_gp_changes(rnp, rdp) || needgp; > > @@ -2146,9 +2152,7 @@ static noinline void rcu_gp_cleanup(void) > > raw_spin_lock_irq_rcu_node(rnp); /* GP before ->gp_seq update. */ > > > > /* Declare grace period done, trace first to use old GP number. */ > > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end")); > > - rcu_seq_end(&rcu_state.gp_seq); > > - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); > > Same question on not moving ASSERT_EXCLUSIVE_WRITER() to the rcu_node > structures' ->gp_seq updates. Indeed! > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > > index a9a811d9d7a3..9d811270b808 100644 > > --- a/kernel/rcu/tree.h > > +++ b/kernel/rcu/tree.h > > @@ -347,7 +347,6 @@ struct rcu_state { > > > > /* The following fields are guarded by the root rcu_node's lock. */ > > > > - unsigned long gp_seq ____cacheline_internodealigned_in_smp; > > /* Grace-period sequence #. */ > > Heh! Does the rcu_node structure's ->gp_seq need alignment? Probably but then this will increase the size of all rnp's, right? Thanks!
On Mon, Mar 31, 2025 at 11:14:10PM +0200, Frederic Weisbecker wrote: > Le Mon, Mar 31, 2025 at 11:28:47AM -0700, Paul E. McKenney a écrit : > > > So I'm unfortunately asking again if it wouldn't be a good idea to have a single > > > global state counter that lives in the root node so that we don't have it > > > duplicated in rcu_state.gp_seq. > > > > > > This involves some special care: > > > > > > 1) As you pointed out, the root can also be the unique leaf. So care must > > > be taken that no QS is reported before new online CPUs have been handled > > > and QS masks initialized. In order for __note_gp_changes() to handle > > > it gracefully we must have two seq states: one to tell the > > > grace period has started (before checking new online CPUs): RCU_STATE_STARTED > > > and one to tell that grace period is ready to receive quiescent states: > > > RCU_STATE_WAIT_QS. Although not strictly required, __note_gp_changes() can > > > ignore simple state changes to RCU_STATE_STARTED in order to spare useless > > > work and rnp locking. No QS will be reported before ->qsmask are initialized > > > in any case. > > > > We have available up to three stages of grace-period processing, which > > can be expanded relatively easily. My concern is that we will be > > replacing "which ->gp_seq to use" confusion with "which gp_seq state > > to pay attention to" confusion. > > True, this comes at the cost of a bit more complex state space. Choices, choices! ;-) > > > 2) rcu_gp_cleanup() currently ends the grace period top-down through the rnp tree > > > and only then on rcu_state. With this change it's simply top-down. I have audited > > > rcu_seq_snap() and rcu_start_this_gp() uses and I haven't found an obvious > > > issue with that. > > > > That would result in the root rcu_node structure's ->gp_seq declaring the > > grace period over before the rest of the rcu_node structures were aware > > of it having ended. I guess that the last leaf rcu_node structure's > > ->gp_seq field would be what you would look at if you were looking for > > the actual grace period end. > > > > Of course, you could argue that the grace period isn't fully over > > until all the CPUs are aware of it, but that can take a very long time, > > especially if there are CPUs staying idle for seconds (let alone minutes) > > at at time. > > Right. And it seems not all parts of RCU agree on a common definition of > a complete grace period. Polling seem to be fine with just the root rnp > to reflect rcu_seq_end() (which happens before the leaves do). Snapshot takers > though rely on the whole tree to reflect the GP end... > > As far as I audited the code, mostly rcu_seq_snap() callers, rcu_start_this_gp() > and __note_gp_changes(), it looks fine to consider the GP is ended as soon > as the root reflects it. I may have missed something of course... The disagreement is a feature, at least up to a point. That feature allows CPUs to go idle for long periods without RCU having to bother them or to mess with their per-CPU data (give or take ->gpwrap). It also allows per-rcu_node-leaf locking, which is important on large systems. Trying to make precisely globally agreed-on beginnings and ends of RCU grace periods will not end well from performance, scalability, or real-time-response viewpoints. ;-) But simplifications that don't hurt performance, scalability, and real-time-response are of course welcome. > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > > index feb3ac1dc5d5..c4bb3eb92a58 100644 > > > --- a/kernel/rcu/rcu.h > > > +++ b/kernel/rcu/rcu.h > > > @@ -19,6 +19,12 @@ > > > * The two least significant bits contain the control flags. > > > * The most significant bits contain the grace-period sequence counter. > > > * > > > + * RCU_STATE_STARTED : The gp has started but QS are not > > > + * requested yet. > > > + * > > > + * RCU_STATE_WAIT_QS : The gp has started and waits for > > > + * QS to be reported. > > > > Do we want RCU_STATE_IDLE, by analogy with SRCU? > > Makes sense. > > > > > > * When both control flags are zero, no grace period is in progress. > > > * When either bit is non-zero, a grace period has started and is in > > > * progress. When the grace period completes, the control flags are reset > > > @@ -54,6 +60,9 @@ > > > * grace-period sequence number. > > > */ > > > > > > +#define RCU_STATE_STARTED 0x1 > > > +#define RCU_STATE_WAIT_QS 0x2 > > > + > > > /* Low-order bit definition for polled grace-period APIs. */ > > > #define RCU_GET_STATE_COMPLETED 0x1 > > > > > > @@ -92,7 +101,13 @@ static inline void rcu_seq_start(unsigned long *sp) > > > { > > > WRITE_ONCE(*sp, *sp + 1); > > > smp_mb(); /* Ensure update-side operation after counter increment. */ > > > - WARN_ON_ONCE(rcu_seq_state(*sp) != 1); > > > + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_STARTED); > > > +} > > > + > > > +static inline void rcu_seq_wait_qs(unsigned long *sp) > > > +{ > > > + WRITE_ONCE(*sp, *sp + 1); > > > + WARN_ON_ONCE(rcu_seq_state(*sp) != RCU_STATE_WAIT_QS); > > > > Or we could have the WARN_ON_ONCE() first, and use rcu_seq_set_state(). > > Right! > > > > @@ -192,6 +192,21 @@ int rcu_get_gp_kthreads_prio(void) > > > } > > > EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio); > > > > > > +/* > > > + * Return the root node of the rcu_state structure. > > > + */ > > > +static struct rcu_node *rcu_get_root(void) > > > +{ > > > + return &rcu_state.node[0]; > > > +} > > > > OK, just code movement. > > Right, if I proceed with this, a series will (try to) split more gently > for review. Fair enough! > > > +static unsigned long *rcu_root_seq(void) > > > +{ > > > + struct rcu_node *root = rcu_get_root(); > > > + > > > + return &root->gp_seq; > > > +} > > > > Adding this function might be useful in any case, even if we keep the > > current rcu_state.gp_seq counter. A single place to add comments that > > address the chosen type of confusion, if nothing else. ;-) > > ;-)) > > > > > 13 calls want the pointer and 13 want the value. An even split! > > The split is but 13 itself is not even! And I know people who strongly > want it to stay that way! Bad regex morning, apparently... > > Not sure what lesson to draw from this, though indirecting a function call will > > likely cause some confusion. Some might argue for two access functions, > > one for the pointer and the other for the value. Me, I am just glad that > > I included the word "might" in the first sentence of the prior paragraph. > > Right, I'm not proud of the *rcu_root_seq() things. > I thought about adding rcu_root_seq_current() but I'm not entirely comfortable > with having all readers under READ_ONCE(). I like when our assumptions of non > concurrent plain accesses are defeatedly reported by KCSAN. Good point. I believe that ASSERT_EXCLUSIVE_WRITER() can be used to make this work, but plain accesses are admittedly much less typing. And the small number of uses makes three different access functions less than fully attractive. > > > @@ -1249,7 +1256,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) > > > > > > raw_lockdep_assert_held_rcu_node(rnp); > > > > > > - if (rdp->gp_seq == rnp->gp_seq) > > > + if (!rcu_seq_changed(rdp->gp_seq, rnp->gp_seq)) > > > > The old code would detect a bunch of elapsed grace periods, starting > > and ending in the same state. The new code does not. Does the new code > > correctly handle CPUs that go idle for many grace periods? > > I'm not sure I understand what you mean. rcu_seq_changed() is a comparison > that simply clears RCU_SEQ_STARTED on both sides such that: > > * Different GP numbers with or without RCU_SEQ_STARTED/RCU_SEQ_WAIT_QS > are always considered a change. > > * Same GP numbers but rnp has RCU_SEQ_WAIT_QS while rdp hasn't is considered a > change > > but: > > * Same GP numbers but rnp has RCU_SEQ_STARTED while rdp hasn't is _NOT_ > considered a change. This is not for correctness but rather to spare > rnp locking and useless work. The ->qsmask are 0 anyway so no QS can > be reported too early. Idiot here missed the pair of tildes ("~") in that function. I clearly need to revisit this code with that in mind... > > > @@ -1802,10 +1809,9 @@ static noinline_for_stack bool rcu_gp_init(void) > > > /* Advance to a new grace period and initialize state. */ > > > record_gp_stall_check_time(); > > > /* Record GP times before starting GP, hence rcu_seq_start(). */ > > > - rcu_seq_start(&rcu_state.gp_seq); > > > - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); > > > > Why are we dropping the ASSERT_EXCLUSIVE_WRITER()? As opposed to moving > > it to the setting of the rcu_node structures' ->gp_seq updates? > > I must confess I misunderstood that macro and thought it was to assert > the variable has no other writer _at all_. But that seem to apply > scopewise. So yes good point. > > > > @@ -1902,7 +1908,10 @@ static noinline_for_stack bool rcu_gp_init(void) > > > rdp = this_cpu_ptr(&rcu_data); > > > rcu_preempt_check_blocked_tasks(rnp); > > > rnp->qsmask = rnp->qsmaskinit; > > > - WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); > > > + if (!rnp->parent) > > > + rcu_seq_wait_qs(&rnp->gp_seq); > > > + else > > > + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); > > > > This asymmetry is necessary because if I understand correctly, the > > non-root rcu_node structures still have zero state. But this is one > > argument for use of rcu_seq_set_state(), because then you could just do > > rcu_seq_set_state(rnp->gp_seq, RCU_STATE_WAIT_QS) in both cases. > > > > Or am I missing something that prevents this from working? > > Indeed, the sequence counter (but not the state) is expected to be the same on > all nodes, so we can just modify the state as you said. And we can then assert > WARN_ON_ONCE(rnp->gp_seq != *rcu_root_seq()) > > > > @@ -2114,16 +2124,12 @@ static noinline void rcu_gp_cleanup(void) > > > * the rcu_node structures before the beginning of the next grace > > > * period is recorded in any of the rcu_node structures. > > > */ > > > - new_gp_seq = rcu_state.gp_seq; > > > - rcu_seq_end(&new_gp_seq); > > > rcu_for_each_node_breadth_first(rnp) { > > > raw_spin_lock_irq_rcu_node(rnp); > > > if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp))) > > > dump_blkd_tasks(rnp, 10); > > > WARN_ON_ONCE(rnp->qsmask); > > > - WRITE_ONCE(rnp->gp_seq, new_gp_seq); > > > - if (!rnp->parent) > > > - smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). > > > > OK, this got moved up. > > > > > + WRITE_ONCE(rnp->gp_seq, *rcu_root_seq()); > > > > The idea being that the root rcu_node ->gp_seq is just overwritten > > with itself, correct? > > Right! > > > > > > rdp = this_cpu_ptr(&rcu_data); > > > if (rnp == rdp->mynode) > > > needgp = __note_gp_changes(rnp, rdp) || needgp; > > > @@ -2146,9 +2152,7 @@ static noinline void rcu_gp_cleanup(void) > > > raw_spin_lock_irq_rcu_node(rnp); /* GP before ->gp_seq update. */ > > > > > > /* Declare grace period done, trace first to use old GP number. */ > > > - trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end")); > > > - rcu_seq_end(&rcu_state.gp_seq); > > > - ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); > > > > Same question on not moving ASSERT_EXCLUSIVE_WRITER() to the rcu_node > > structures' ->gp_seq updates. > > Indeed! > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > > > index a9a811d9d7a3..9d811270b808 100644 > > > --- a/kernel/rcu/tree.h > > > +++ b/kernel/rcu/tree.h > > > @@ -347,7 +347,6 @@ struct rcu_state { > > > > > > /* The following fields are guarded by the root rcu_node's lock. */ > > > > > > - unsigned long gp_seq ____cacheline_internodealigned_in_smp; > > > /* Grace-period sequence #. */ > > > > Heh! Does the rcu_node structure's ->gp_seq need alignment? > > Probably but then this will increase the size of all rnp's, right? Indeed, this probably needs actual performance results showing that it is needed. My guess is that only systems with a single rcu_node structure that is both leaf and root would have any chance of noticing. And those tend to have few CPUs, so they might not care. Thanx, Paul
Le Mon, Mar 31, 2025 at 02:29:52PM -0700, Paul E. McKenney a écrit : > The disagreement is a feature, at least up to a point. That feature > allows CPUs to go idle for long periods without RCU having to bother > them or to mess with their per-CPU data (give or take ->gpwrap). It also > allows per-rcu_node-leaf locking, which is important on large systems. > > Trying to make precisely globally agreed-on beginnings and ends of > RCU grace periods will not end well from performance, scalability, > or real-time-response viewpoints. ;-) The distributed disagreement is definetly a feature. The duplicate root is more debatable. > But simplifications that don't hurt performance, scalability, and > real-time-response are of course welcome. I'm not even sure my proposal is a simplification. Perhaps it is. Another hope is that it could avoid future accidents. > Indeed, this probably needs actual performance results showing that > it is needed. My guess is that only systems with a single rcu_node > structure that is both leaf and root would have any chance of noticing. > And those tend to have few CPUs, so they might not care. Do you have any idea for a benchmark to test here? Thanks. > Thanx, Paul
On Tue, Apr 01, 2025 at 03:38:21PM +0200, Frederic Weisbecker wrote: > Le Mon, Mar 31, 2025 at 02:29:52PM -0700, Paul E. McKenney a écrit : > > The disagreement is a feature, at least up to a point. That feature > > allows CPUs to go idle for long periods without RCU having to bother > > them or to mess with their per-CPU data (give or take ->gpwrap). It also > > allows per-rcu_node-leaf locking, which is important on large systems. > > > > Trying to make precisely globally agreed-on beginnings and ends of > > RCU grace periods will not end well from performance, scalability, > > or real-time-response viewpoints. ;-) > > The distributed disagreement is definetly a feature. The duplicate root > is more debatable. > > > But simplifications that don't hurt performance, scalability, and > > real-time-response are of course welcome. > > I'm not even sure my proposal is a simplification. Perhaps it is. Another > hope is that it could avoid future accidents. > > > Indeed, this probably needs actual performance results showing that > > it is needed. My guess is that only systems with a single rcu_node > > structure that is both leaf and root would have any chance of noticing. > > And those tend to have few CPUs, so they might not care. > > Do you have any idea for a benchmark to test here? The best that I can come up with is hammering with a combination of expedited grace periods and polled grace periods. Thanx, Paul
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index eed2951a4962..5e1ee570bb27 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -57,6 +57,9 @@ /* Low-order bit definition for polled grace-period APIs. */ #define RCU_GET_STATE_COMPLETED 0x1 +/* A complete grace period count */ +#define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1) + extern int sysctl_sched_rt_runtime; /* @@ -162,7 +165,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, unsigned long s) { unsigned long cur_s = READ_ONCE(*sp); - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1)); + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_GP)); } /*
The rcu_seq_done_exact() function checks if a grace period has completed by comparing sequence numbers. It includes a guard band to handle sequence number wraparound, which was previously expressed using the magic number calculation '3 * RCU_SEQ_STATE_MASK + 1'. This magic number is not immediately obvious in terms of what it represents. Instead, the reason we need this tiny guardband is because of the lag between the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing the newly started GP as completed immediately, due to the following sequence which arises due to the delay between update of rcu_state.gp_seq_polled and root rnp's gp_seq: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); This can happen due to get_state_synchronize_rcu_full() sampling rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() sampling the root rnp's gp_seq. The delay between the update of the 2 counters occurs in rcu_gp_init() during which the counters briefly go out of sync. Make the guardband explictly 2 GPs. This improves code readability and maintainability by making the intent clearer as well. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> --- kernel/rcu/rcu.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)