mbox series

[RFC,v1,0/2] Avoid rcu_core() if CPU just left guest vcpu

Message ID 20240328171949.743211-1-leobras@redhat.com (mailing list archive)
Headers show
Series Avoid rcu_core() if CPU just left guest vcpu | expand

Message

Leonardo Bras March 28, 2024, 5:19 p.m. UTC
I am dealing with a latency issue inside a KVM guest, which is caused by
a sched_switch to rcuc[1].

During guest entry, kernel code will signal to RCU that current CPU was on
a quiescent state, making sure no other CPU is waiting for this one.

If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
issued somewhere since guest entry, there is a chance a timer interrupt
will happen in that CPU, which will cause rcu_sched_clock_irq() to run.

rcu_sched_clock_irq() will check rcu_pending() which will return true,
and cause invoke_rcu_core() to be called, which will (in current config)
cause rcuc/N to be scheduled into the current cpu.

On rcu_pending(), I noticed we can avoid returning true (and thus invoking
rcu_core()) if the current cpu is nohz_full, and the cpu came from either
idle or userspace, since both are considered quiescent states.

Since this is also true to guest context, my idea to solve this latency
issue by avoiding rcu_core() invocation if it was running a guest vcpu.

On the other hand, I could not find a way of reliably saying the current
cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
for keeping the time (jiffies) of the last guest exit.

In patch #2 I compare current time to that time, and if less than a second
has past, we just skip rcu_core() invocation, since there is a high chance
it will just go back to the guest in a moment.

What I know it's weird with this patch:
1 - Not sure if this is the best way of finding out if the cpu was
    running a guest recently.

2 - This per-cpu variable needs to get set at each guest_exit(), so it's
    overhead, even though it's supposed to be in local cache. If that's
    an issue, I would suggest having this part compiled out on 
    !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
    enabled seems more expensive than just setting this out.

3 - It checks if the guest exit happened over than 1 second ago. This 1
    second value was copied from rcu_nohz_full_cpu() which checks if the
    grace period started over than a second ago. If this value is bad,
    I have no issue changing it.

4 - Even though I could detect no issue, I included linux/kvm_host.h into 
    rcu/tree_plugin.h, which is the first time it's getting included
    outside of kvm or arch code, and can be weird. An alternative would
    be to create a new header for providing data for non-kvm code.

Please provide feedback.

Thanks!
Leo
									...
[1]: It uses a PREEMPT_RT kernel, with the guest cpus running on isolated,
rcu_nocbs, nohz_full cpus.

Leonardo Bras (2):
  kvm: Implement guest_exit_last_time()
  rcu: Ignore RCU in nohz_full cpus if it was running a guest recently

 include/linux/kvm_host.h | 13 +++++++++++++
 kernel/rcu/tree_plugin.h | 14 ++++++++++++++
 kernel/rcu/tree.c        |  4 +++-
 virt/kvm/kvm_main.c      |  3 +++
 4 files changed, 33 insertions(+), 1 deletion(-)


base-commit: 8d025e2092e29bfd13e56c78e22af25fac83c8ec

Comments

Sean Christopherson April 1, 2024, 8:21 p.m. UTC | #1
On Thu, Mar 28, 2024, Leonardo Bras wrote:
> I am dealing with a latency issue inside a KVM guest, which is caused by
> a sched_switch to rcuc[1].
> 
> During guest entry, kernel code will signal to RCU that current CPU was on
> a quiescent state, making sure no other CPU is waiting for this one.
> 
> If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> issued somewhere since guest entry, there is a chance a timer interrupt
> will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> 
> rcu_sched_clock_irq() will check rcu_pending() which will return true,
> and cause invoke_rcu_core() to be called, which will (in current config)
> cause rcuc/N to be scheduled into the current cpu.
> 
> On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> idle or userspace, since both are considered quiescent states.
> 
> Since this is also true to guest context, my idea to solve this latency
> issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> 
> On the other hand, I could not find a way of reliably saying the current
> cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> for keeping the time (jiffies) of the last guest exit.
> 
> In patch #2 I compare current time to that time, and if less than a second
> has past, we just skip rcu_core() invocation, since there is a high chance
> it will just go back to the guest in a moment.

What's the downside if there's a false positive?

> What I know it's weird with this patch:
> 1 - Not sure if this is the best way of finding out if the cpu was
>     running a guest recently.
> 
> 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
>     overhead, even though it's supposed to be in local cache. If that's
>     an issue, I would suggest having this part compiled out on 
>     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
>     enabled seems more expensive than just setting this out.

A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
imprecise, e.g. it'll be a full tick "behind" on many exits.

> 3 - It checks if the guest exit happened over than 1 second ago. This 1
>     second value was copied from rcu_nohz_full_cpu() which checks if the
>     grace period started over than a second ago. If this value is bad,
>     I have no issue changing it.

IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
of what magic time threshold is used.  IIUC, what you want is a way to detect if
a CPU is likely to _run_ a KVM vCPU in the near future.  KVM can provide that
information with much better precision, e.g. KVM knows when when it's in the core
vCPU run loop.

> 4 - Even though I could detect no issue, I included linux/kvm_host.h into 
>     rcu/tree_plugin.h, which is the first time it's getting included
>     outside of kvm or arch code, and can be weird.

Heh, kvm_host.h isn't included outside of KVM because several architectures can
build KVM as a module, which means referencing global KVM varibles from the kernel
proper won't work.

>     An alternative would be to create a new header for providing data for
>     non-kvm code.

I doubt a new .h or .c file is needed just for this, there's gotta be a decent
landing spot for a one-off variable.  E.g. I wouldn't be at all surprised if there
is additional usefulness in knowing if a CPU is in KVM's core run loop and thus
likely to do a VM-Enter in the near future, at which point you could probably make
a good argument for adding a flag in "struct context_tracking".  Even without a
separate use case, there's a good argument for adding that info to context_tracking.
Marcelo Tosatti April 5, 2024, 1:45 p.m. UTC | #2
On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > I am dealing with a latency issue inside a KVM guest, which is caused by
> > a sched_switch to rcuc[1].
> > 
> > During guest entry, kernel code will signal to RCU that current CPU was on
> > a quiescent state, making sure no other CPU is waiting for this one.
> > 
> > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > issued somewhere since guest entry, there is a chance a timer interrupt
> > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > 
> > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > and cause invoke_rcu_core() to be called, which will (in current config)
> > cause rcuc/N to be scheduled into the current cpu.
> > 
> > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > idle or userspace, since both are considered quiescent states.
> > 
> > Since this is also true to guest context, my idea to solve this latency
> > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > 
> > On the other hand, I could not find a way of reliably saying the current
> > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > for keeping the time (jiffies) of the last guest exit.
> > 
> > In patch #2 I compare current time to that time, and if less than a second
> > has past, we just skip rcu_core() invocation, since there is a high chance
> > it will just go back to the guest in a moment.
> 
> What's the downside if there's a false positive?

rcuc wakes up (which might exceed the allowed latency threshold
for certain realtime apps).

> > What I know it's weird with this patch:
> > 1 - Not sure if this is the best way of finding out if the cpu was
> >     running a guest recently.
> > 
> > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> >     overhead, even though it's supposed to be in local cache. If that's
> >     an issue, I would suggest having this part compiled out on 
> >     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> >     enabled seems more expensive than just setting this out.
> 
> A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> imprecise, e.g. it'll be a full tick "behind" on many exits.
> 
> > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> >     second value was copied from rcu_nohz_full_cpu() which checks if the
> >     grace period started over than a second ago. If this value is bad,
> >     I have no issue changing it.
> 
> IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> of what magic time threshold is used.  

Why? It works for this particular purpose.

> IIUC, what you want is a way to detect if
> a CPU is likely to _run_ a KVM vCPU in the near future.  KVM can provide that
> information with much better precision, e.g. KVM knows when when it's in the core
> vCPU run loop.

ktime_t ktime_get(void)
{
        struct timekeeper *tk = &tk_core.timekeeper;
        unsigned int seq;
        ktime_t base;
        u64 nsecs;

        WARN_ON(timekeeping_suspended);

        do {
                seq = read_seqcount_begin(&tk_core.seq);
                base = tk->tkr_mono.base;
                nsecs = timekeeping_get_ns(&tk->tkr_mono);

        } while (read_seqcount_retry(&tk_core.seq, seq));

        return ktime_add_ns(base, nsecs);
}
EXPORT_SYMBOL_GPL(ktime_get);

ktime_get() is more expensive than unsigned long assignment.

What is done is: If vcpu has entered guest mode in the past, then RCU
extended quiescent state has been transitioned into the CPU, therefore
it is not necessary to wake up rcu core.

The logic is copied from:

/*
 * Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
 * grace-period kthread will do force_quiescent_state() processing?
 * The idea is to avoid waking up RCU core processing on such a
 * CPU unless the grace period has extended for too long.
 *
 * This code relies on the fact that all NO_HZ_FULL CPUs are also
 * RCU_NOCB_CPU CPUs.
 */
static bool rcu_nohz_full_cpu(void)
{
#ifdef CONFIG_NO_HZ_FULL
        if (tick_nohz_full_cpu(smp_processor_id()) &&
            (!rcu_gp_in_progress() ||
             time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
                return true;
#endif /* #ifdef CONFIG_NO_HZ_FULL */
        return false;
}

Note:

avoid waking up RCU core processing on such a
CPU unless the grace period has extended for too long.

> > 4 - Even though I could detect no issue, I included linux/kvm_host.h into 
> >     rcu/tree_plugin.h, which is the first time it's getting included
> >     outside of kvm or arch code, and can be weird.
> 
> Heh, kvm_host.h isn't included outside of KVM because several architectures can
> build KVM as a module, which means referencing global KVM varibles from the kernel
> proper won't work.
> 
> >     An alternative would be to create a new header for providing data for
> >     non-kvm code.
> 
> I doubt a new .h or .c file is needed just for this, there's gotta be a decent
> landing spot for a one-off variable.  E.g. I wouldn't be at all surprised if there
> is additional usefulness in knowing if a CPU is in KVM's core run loop and thus
> likely to do a VM-Enter in the near future, at which point you could probably make
> a good argument for adding a flag in "struct context_tracking".  Even without a
> separate use case, there's a good argument for adding that info to context_tracking.

Well, jiffies is cheap and just works. 

Perhaps can add higher resolution later if required?
Sean Christopherson April 5, 2024, 2:42 p.m. UTC | #3
On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> > On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > > I am dealing with a latency issue inside a KVM guest, which is caused by
> > > a sched_switch to rcuc[1].
> > > 
> > > During guest entry, kernel code will signal to RCU that current CPU was on
> > > a quiescent state, making sure no other CPU is waiting for this one.
> > > 
> > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > > issued somewhere since guest entry, there is a chance a timer interrupt
> > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > > 
> > > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > > and cause invoke_rcu_core() to be called, which will (in current config)
> > > cause rcuc/N to be scheduled into the current cpu.
> > > 
> > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > > idle or userspace, since both are considered quiescent states.
> > > 
> > > Since this is also true to guest context, my idea to solve this latency
> > > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > > 
> > > On the other hand, I could not find a way of reliably saying the current
> > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > > for keeping the time (jiffies) of the last guest exit.
> > > 
> > > In patch #2 I compare current time to that time, and if less than a second
> > > has past, we just skip rcu_core() invocation, since there is a high chance
> > > it will just go back to the guest in a moment.
> > 
> > What's the downside if there's a false positive?
> 
> rcuc wakes up (which might exceed the allowed latency threshold
> for certain realtime apps).

Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
enter a guest, but the CPU never does (at least, not in the immediate future).

Or am I just not understanding how RCU's kthreads work?

> > > What I know it's weird with this patch:
> > > 1 - Not sure if this is the best way of finding out if the cpu was
> > >     running a guest recently.
> > > 
> > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > >     overhead, even though it's supposed to be in local cache. If that's
> > >     an issue, I would suggest having this part compiled out on 
> > >     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > >     enabled seems more expensive than just setting this out.
> > 
> > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> > imprecise, e.g. it'll be a full tick "behind" on many exits.
> > 
> > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > >     grace period started over than a second ago. If this value is bad,
> > >     I have no issue changing it.
> > 
> > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > of what magic time threshold is used.  
> 
> Why? It works for this particular purpose.

Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
against edge cases, and I'm pretty sure we can do better with about the same amount
of effort/churn.

> > IIUC, what you want is a way to detect if a CPU is likely to _run_ a KVM
> > vCPU in the near future.  KVM can provide that information with much better
> > precision, e.g. KVM knows when when it's in the core vCPU run loop.
> 
> ktime_t ktime_get(void)
> {
>         struct timekeeper *tk = &tk_core.timekeeper;
>         unsigned int seq;
>         ktime_t base;
>         u64 nsecs;
> 
>         WARN_ON(timekeeping_suspended);
> 
>         do {
>                 seq = read_seqcount_begin(&tk_core.seq);
>                 base = tk->tkr_mono.base;
>                 nsecs = timekeeping_get_ns(&tk->tkr_mono);
> 
>         } while (read_seqcount_retry(&tk_core.seq, seq));
> 
>         return ktime_add_ns(base, nsecs);
> }
> EXPORT_SYMBOL_GPL(ktime_get);
> 
> ktime_get() is more expensive than unsigned long assignment.

Huh?  What does ktime_get() have to do with anything?  I'm suggesting something
like the below (wants_to_run is from an in-flight patch,
https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@google.com).

---
 include/linux/context_tracking.h       | 12 ++++++++++++
 include/linux/context_tracking_state.h |  3 +++
 kernel/rcu/tree.c                      |  9 +++++++--
 virt/kvm/kvm_main.c                    |  7 +++++++
 4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 6e76b9dba00e..59bc855701c5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -86,6 +86,16 @@ static __always_inline void context_tracking_guest_exit(void)
 		__ct_user_exit(CONTEXT_GUEST);
 }
 
+static inline void context_tracking_guest_start_run_loop(void)
+{
+	__this_cpu_write(context_tracking.in_guest_run_loop, true);
+}
+
+static inline void context_tracking_guest_stop_run_loop(void)
+{
+	__this_cpu_write(context_tracking.in_guest_run_loop, false);
+}
+
 #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
 
 #else
@@ -99,6 +109,8 @@ static inline int ct_state(void) { return -1; }
 static inline int __ct_state(void) { return -1; }
 static __always_inline bool context_tracking_guest_enter(void) { return false; }
 static __always_inline void context_tracking_guest_exit(void) { }
+static inline void context_tracking_guest_start_run_loop(void) { }
+static inline void context_tracking_guest_stop_run_loop(void) { }
 #define CT_WARN_ON(cond) do { } while (0)
 #endif /* !CONFIG_CONTEXT_TRACKING_USER */
 
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index bbff5f7f8803..629ada1a4d81 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -25,6 +25,9 @@ enum ctx_state {
 #define CT_DYNTICKS_MASK (~CT_STATE_MASK)
 
 struct context_tracking {
+#if IS_ENABLED(CONFIG_KVM)
+	bool in_guest_run_loop;
+#endif
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 	/*
 	 * When active is false, probes are unset in order
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d9642dd06c25..303ae9ae1c53 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
 	if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
 		return 1;
 
-	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
-	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+	/*
+	 * Is this a nohz_full CPU in userspace, idle, or likely to enter a
+	 * guest in the near future?  (Ignore RCU if so.)
+	 */
+	if ((user || rcu_is_cpu_rrupt_from_idle() ||
+	     __this_cpu_read(context_tracking.in_guest_run_loop)) &&
+	    rcu_nohz_full_cpu())
 		return 0;
 
 	/* Is the RCU core waiting for a quiescent state from this CPU? */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bfb2b52a1416..5a7efc669a0f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
 
+	if (vcpu->wants_to_run)
+		context_tracking_guest_start_run_loop();
+
 	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
@@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
 	__this_cpu_write(kvm_running_vcpu, NULL);
+
+	if (vcpu->wants_to_run)
+		context_tracking_guest_stop_run_loop();
+
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);

base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968
--
Paul E. McKenney April 6, 2024, 12:03 a.m. UTC | #4
On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> > > On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > > > I am dealing with a latency issue inside a KVM guest, which is caused by
> > > > a sched_switch to rcuc[1].
> > > > 
> > > > During guest entry, kernel code will signal to RCU that current CPU was on
> > > > a quiescent state, making sure no other CPU is waiting for this one.
> > > > 
> > > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > > > issued somewhere since guest entry, there is a chance a timer interrupt
> > > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > > > 
> > > > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > > > and cause invoke_rcu_core() to be called, which will (in current config)
> > > > cause rcuc/N to be scheduled into the current cpu.
> > > > 
> > > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > > > idle or userspace, since both are considered quiescent states.
> > > > 
> > > > Since this is also true to guest context, my idea to solve this latency
> > > > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > > > 
> > > > On the other hand, I could not find a way of reliably saying the current
> > > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > > > for keeping the time (jiffies) of the last guest exit.
> > > > 
> > > > In patch #2 I compare current time to that time, and if less than a second
> > > > has past, we just skip rcu_core() invocation, since there is a high chance
> > > > it will just go back to the guest in a moment.
> > > 
> > > What's the downside if there's a false positive?
> > 
> > rcuc wakes up (which might exceed the allowed latency threshold
> > for certain realtime apps).
> 
> Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
> enter a guest, but the CPU never does (at least, not in the immediate future).
> 
> Or am I just not understanding how RCU's kthreads work?

It is quite possible that the current rcu_pending() code needs help,
given the possibility of vCPU preemption.  I have heard of people doing
nested KVM virtualization -- or is that no longer a thing?

But the help might well involve RCU telling the hypervisor that a given
vCPU needs to run.  Not sure how that would go over, though it has been
prototyped a couple times in the context of RCU priority boosting.

> > > > What I know it's weird with this patch:
> > > > 1 - Not sure if this is the best way of finding out if the cpu was
> > > >     running a guest recently.
> > > > 
> > > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > > >     overhead, even though it's supposed to be in local cache. If that's
> > > >     an issue, I would suggest having this part compiled out on 
> > > >     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > > >     enabled seems more expensive than just setting this out.
> > > 
> > > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> > > imprecise, e.g. it'll be a full tick "behind" on many exits.
> > > 
> > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > > >     grace period started over than a second ago. If this value is bad,
> > > >     I have no issue changing it.
> > > 
> > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > of what magic time threshold is used.  
> > 
> > Why? It works for this particular purpose.
> 
> Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> against edge cases, and I'm pretty sure we can do better with about the same amount
> of effort/churn.

Beyond a certain point, we have no choice.  How long should RCU let
a CPU run with preemption disabled before complaining?  We choose 21
seconds in mainline and some distros choose 60 seconds.  Android chooses
20 milliseconds for synchronize_rcu_expedited() grace periods.

> > > IIUC, what you want is a way to detect if a CPU is likely to _run_ a KVM
> > > vCPU in the near future.  KVM can provide that information with much better
> > > precision, e.g. KVM knows when when it's in the core vCPU run loop.
> > 
> > ktime_t ktime_get(void)
> > {
> >         struct timekeeper *tk = &tk_core.timekeeper;
> >         unsigned int seq;
> >         ktime_t base;
> >         u64 nsecs;
> > 
> >         WARN_ON(timekeeping_suspended);
> > 
> >         do {
> >                 seq = read_seqcount_begin(&tk_core.seq);
> >                 base = tk->tkr_mono.base;
> >                 nsecs = timekeeping_get_ns(&tk->tkr_mono);
> > 
> >         } while (read_seqcount_retry(&tk_core.seq, seq));
> > 
> >         return ktime_add_ns(base, nsecs);
> > }
> > EXPORT_SYMBOL_GPL(ktime_get);
> > 
> > ktime_get() is more expensive than unsigned long assignment.
> 
> Huh?  What does ktime_get() have to do with anything?  I'm suggesting something
> like the below (wants_to_run is from an in-flight patch,
> https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@google.com).

Interesting.  Some questions below, especially if we are doing nested
virtualization.

> ---
>  include/linux/context_tracking.h       | 12 ++++++++++++
>  include/linux/context_tracking_state.h |  3 +++
>  kernel/rcu/tree.c                      |  9 +++++++--
>  virt/kvm/kvm_main.c                    |  7 +++++++
>  4 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..59bc855701c5 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -86,6 +86,16 @@ static __always_inline void context_tracking_guest_exit(void)
>  		__ct_user_exit(CONTEXT_GUEST);
>  }
>  
> +static inline void context_tracking_guest_start_run_loop(void)
> +{
> +	__this_cpu_write(context_tracking.in_guest_run_loop, true);
> +}
> +
> +static inline void context_tracking_guest_stop_run_loop(void)
> +{
> +	__this_cpu_write(context_tracking.in_guest_run_loop, false);
> +}
> +
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
> @@ -99,6 +109,8 @@ static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
>  static __always_inline void context_tracking_guest_exit(void) { }
> +static inline void context_tracking_guest_start_run_loop(void) { }
> +static inline void context_tracking_guest_stop_run_loop(void) { }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index bbff5f7f8803..629ada1a4d81 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -25,6 +25,9 @@ enum ctx_state {
>  #define CT_DYNTICKS_MASK (~CT_STATE_MASK)
>  
>  struct context_tracking {
> +#if IS_ENABLED(CONFIG_KVM)
> +	bool in_guest_run_loop;
> +#endif
>  #ifdef CONFIG_CONTEXT_TRACKING_USER
>  	/*
>  	 * When active is false, probes are unset in order
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d9642dd06c25..303ae9ae1c53 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
>  	if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
>  		return 1;
>  
> -	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> +	/*
> +	 * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> +	 * guest in the near future?  (Ignore RCU if so.)
> +	 */
> +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> +	     __this_cpu_read(context_tracking.in_guest_run_loop)) &&

In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
a quiescent just before the current scheduling-clock interrupt and will
again be in a quiescent state right after return from this interrupt.
This means that the grace-period kthread will be able to remotely sense
this quiescent state, so that the current CPU need do nothing.

In constrast, it looks like context_tracking.in_guest_run_loop instead
means that when we return from this interrupt, this CPU will still be
in a non-quiescent state.

Now, in the nested-virtualization case, your point might be that the
lower-level hypervisor could preempt the vCPU in the interrupt handler
just as easily as in the .in_guest_run_loop code.  Which is a good point.
But I don't know of a way to handle this other than heuristics and maybe
hinting to the hypervisor (which has been prototyped for RCU priority
boosting).

Maybe the time for such hinting has come?

> +	    rcu_nohz_full_cpu())

And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
some time.

>  		return 0;
>  
>  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index bfb2b52a1416..5a7efc669a0f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
>  {
>  	int cpu = get_cpu();
>  
> +	if (vcpu->wants_to_run)
> +		context_tracking_guest_start_run_loop();

At this point, if this is a nohz_full CPU, it will no longer report
quiescent states until the grace period is at least one second old.

> +
>  	__this_cpu_write(kvm_running_vcpu, vcpu);
>  	preempt_notifier_register(&vcpu->preempt_notifier);
>  	kvm_arch_vcpu_load(vcpu, cpu);
> @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
>  	kvm_arch_vcpu_put(vcpu);
>  	preempt_notifier_unregister(&vcpu->preempt_notifier);
>  	__this_cpu_write(kvm_running_vcpu, NULL);
> +

And also at this point, if this is a nohz_full CPU, it will no longer
report quiescent states until the grace period is at least one second old.

> +	if (vcpu->wants_to_run)
> +		context_tracking_guest_stop_run_loop();
> +
>  	preempt_enable();
>  }
>  EXPORT_SYMBOL_GPL(vcpu_put);
> 
> base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968

All of which might be OK.  Just checking as to whether all of that was
in fact the intent.

							Thanx, Paul
Sean Christopherson April 8, 2024, 5:16 p.m. UTC | #5
On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > rcuc wakes up (which might exceed the allowed latency threshold
> > > for certain realtime apps).
> > 
> > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
> > enter a guest, but the CPU never does (at least, not in the immediate future).
> > 
> > Or am I just not understanding how RCU's kthreads work?
> 
> It is quite possible that the current rcu_pending() code needs help,
> given the possibility of vCPU preemption.  I have heard of people doing
> nested KVM virtualization -- or is that no longer a thing?

Nested virtualization is still very much a thing, but I don't see how it is at
all unique with respect to RCU grace periods and quiescent states.  More below.

> But the help might well involve RCU telling the hypervisor that a given
> vCPU needs to run.  Not sure how that would go over, though it has been
> prototyped a couple times in the context of RCU priority boosting.
>
> > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > >     grace period started over than a second ago. If this value is bad,
> > > > >     I have no issue changing it.
> > > > 
> > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > of what magic time threshold is used.  
> > > 
> > > Why? It works for this particular purpose.
> > 
> > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > against edge cases, and I'm pretty sure we can do better with about the same amount
> > of effort/churn.
> 
> Beyond a certain point, we have no choice.  How long should RCU let
> a CPU run with preemption disabled before complaining?  We choose 21
> seconds in mainline and some distros choose 60 seconds.  Android chooses
> 20 milliseconds for synchronize_rcu_expedited() grace periods.

Issuing a warning based on an arbitrary time limit is wildly different than using
an arbitrary time window to make functional decisions.  My objection to the "assume
the CPU will enter a quiescent state if it exited a KVM guest in the last second"
is that there are plenty of scenarios where that assumption falls apart, i.e. where
_that_ physical CPU will not re-enter the guest.

Off the top of my head:

 - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
   will get false positives, and the *new* pCPU will get false negatives (though
   the false negatives aren't all that problematic since the pCPU will enter a
   quiescent state on the next VM-Enter.

 - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
   won't re-enter the guest.  And so the pCPU will get false positives until the
   vCPU gets a wake event or the 1 second window expires.

 - If the VM terminates, the pCPU will get false positives until the 1 second
   window expires.

The false positives are solvable problems, by hooking vcpu_put() to reset
kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
scheduled in on a different pCPU, KVM would hook vcpu_load().

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d9642dd06c25..303ae9ae1c53 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
> >  	if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> >  		return 1;
> >  
> > -	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +	/*
> > +	 * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> > +	 * guest in the near future?  (Ignore RCU if so.)
> > +	 */
> > +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > +	     __this_cpu_read(context_tracking.in_guest_run_loop)) &&
> 
> In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
> a quiescent just before the current scheduling-clock interrupt and will
> again be in a quiescent state right after return from this interrupt.
> This means that the grace-period kthread will be able to remotely sense
> this quiescent state, so that the current CPU need do nothing.
>
> In constrast, it looks like context_tracking.in_guest_run_loop instead
> means that when we return from this interrupt, this CPU will still be
> in a non-quiescent state.
> 
> Now, in the nested-virtualization case, your point might be that the
> lower-level hypervisor could preempt the vCPU in the interrupt handler
> just as easily as in the .in_guest_run_loop code.  Which is a good point.
> But I don't know of a way to handle this other than heuristics and maybe
> hinting to the hypervisor (which has been prototyped for RCU priority
> boosting).

Regarding nested virtualization, what exactly is your concern?  IIUC, you are
worried about this code running at L1, i.e. as a nested hypervisor, and L0, i.e.
the bare metal hypervisor, scheduling out the L1 CPU.  And because the L1 CPU
doesn't get run "soon", it won't enter a quiescent state as expected by RCU.

But that's 100% the case with RCU in a VM in general.  If an L1 CPU gets scheduled
out by L0, that L1 CPU won't participate in any RCU stuff until it gets scheduled
back in by L0.

E.g. throw away all of the special case checks for rcu_nohz_full_cpu() in
rcu_pending(), and the exact same problem exists.  The L1 CPU could get scheduled
out while trying to run the RCU core kthread just as easily as it could get
scheduled out while trying to run the vCPU task.  Or the L1 CPU could get scheduled
out while it's still in the IRQ handler, before it even completes it rcu_pending().

And FWIW, it's not just L0 scheduling that is problematic.  If something in L0
prevents an L1 CPU (vCPU from L0's perspective) from making forward progress, e.g.
due to a bug in L0, or severe resource contention, from the L1 kernel's perspective,
the L1 CPU will appear stuck and trigger various warnings, e.g. soft-lockup,
need_resched, RCU stalls, etc.
 
> Maybe the time for such hinting has come?

That's a largely orthogonal discussion.  As above, boosting the scheduling priority
of a vCPU because that vCPU is in critical section of some form is not at all
unique to nested virtualization (or RCU).

For basic functional correctness, the L0 hypervisor already has the "hint" it 
needs.  L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
runnable, i.e. not halted, not in WFS, etc.

> > +	    rcu_nohz_full_cpu())
> 
> And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> some time.

That's not a good reason to use a suboptimal heuristic for determining whether
or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
scenario of a false positive.

> >  		return 0;
> >  
> >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index bfb2b52a1416..5a7efc669a0f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> >  {
> >  	int cpu = get_cpu();
> >  
> > +	if (vcpu->wants_to_run)
> > +		context_tracking_guest_start_run_loop();
> 
> At this point, if this is a nohz_full CPU, it will no longer report
> quiescent states until the grace period is at least one second old.

I don't think I follow the "will no longer report quiescent states" issue.  Are
you saying that this would prevent guest_context_enter_irqoff() from reporting
that the CPU is entering a quiescent state?  If so, that's an issue that would
need to be resolved regardless of what heuristic we use to determine whether or
not a CPU is likely to enter a KVM guest.

> >  	__this_cpu_write(kvm_running_vcpu, vcpu);
> >  	preempt_notifier_register(&vcpu->preempt_notifier);
> >  	kvm_arch_vcpu_load(vcpu, cpu);
> > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> >  	kvm_arch_vcpu_put(vcpu);
> >  	preempt_notifier_unregister(&vcpu->preempt_notifier);
> >  	__this_cpu_write(kvm_running_vcpu, NULL);
> > +
> 
> And also at this point, if this is a nohz_full CPU, it will no longer
> report quiescent states until the grace period is at least one second old.
> 
> > +	if (vcpu->wants_to_run)
> > +		context_tracking_guest_stop_run_loop();
> > +
> >  	preempt_enable();
> >  }
> >  EXPORT_SYMBOL_GPL(vcpu_put);
> > 
> > base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968
> 
> All of which might be OK.  Just checking as to whether all of that was
> in fact the intent.
> 
> 							Thanx, Paul
Paul E. McKenney April 8, 2024, 6:42 p.m. UTC | #6
On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > > 
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > > a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
> > > enter a guest, but the CPU never does (at least, not in the immediate future).
> > > 
> > > Or am I just not understanding how RCU's kthreads work?
> > 
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption.  I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
> 
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states.  More below.

When the hypervisor runs on bare metal, the existing checks have
interrupts disables.  Yes, you can still get added delays from NMIs and
SMIs, but excessively long NMI/SMI handlers are either considered to be
bugs or happen when the system is already in trouble (backtrace NMIs,
for an example of the latter).

But if the hypervisor is running on top of another hypervisor, then
the scheduling-clock interrupt handler is subject to vCPU preemption,
which can unduly delay reporting of RCU quiescent states.

And no, this is not exactly new, but your patch reminded me of it.

> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run.  Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > >     grace period started over than a second ago. If this value is bad,
> > > > > >     I have no issue changing it.
> > > > > 
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > > of what magic time threshold is used.  
> > > > 
> > > > Why? It works for this particular purpose.
> > > 
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the same amount
> > > of effort/churn.
> > 
> > Beyond a certain point, we have no choice.  How long should RCU let
> > a CPU run with preemption disabled before complaining?  We choose 21
> > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> 
> Issuing a warning based on an arbitrary time limit is wildly different than using
> an arbitrary time window to make functional decisions.  My objection to the "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. where
> _that_ physical CPU will not re-enter the guest.
> 
> Off the top of my head:
> 
>  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
>    will get false positives, and the *new* pCPU will get false negatives (though
>    the false negatives aren't all that problematic since the pCPU will enter a
>    quiescent state on the next VM-Enter.
> 
>  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
>    won't re-enter the guest.  And so the pCPU will get false positives until the
>    vCPU gets a wake event or the 1 second window expires.
> 
>  - If the VM terminates, the pCPU will get false positives until the 1 second
>    window expires.
> 
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Here you are arguing against the heuristic in the original patch, correct?
As opposed to the current RCU heuristic that ignores certain quiescent
states for nohz_full CPUs until the grace period reaches an age of
one second?

If so, no argument here.  In fact, please consider my ack cancelled.

> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index d9642dd06c25..303ae9ae1c53 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
> > >  	if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> > >  		return 1;
> > >  
> > > -	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > +	/*
> > > +	 * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> > > +	 * guest in the near future?  (Ignore RCU if so.)
> > > +	 */
> > > +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > > +	     __this_cpu_read(context_tracking.in_guest_run_loop)) &&
> > 
> > In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
> > a quiescent just before the current scheduling-clock interrupt and will
> > again be in a quiescent state right after return from this interrupt.
> > This means that the grace-period kthread will be able to remotely sense
> > this quiescent state, so that the current CPU need do nothing.
> >
> > In constrast, it looks like context_tracking.in_guest_run_loop instead
> > means that when we return from this interrupt, this CPU will still be
> > in a non-quiescent state.
> > 
> > Now, in the nested-virtualization case, your point might be that the
> > lower-level hypervisor could preempt the vCPU in the interrupt handler
> > just as easily as in the .in_guest_run_loop code.  Which is a good point.
> > But I don't know of a way to handle this other than heuristics and maybe
> > hinting to the hypervisor (which has been prototyped for RCU priority
> > boosting).
> 
> Regarding nested virtualization, what exactly is your concern?  IIUC, you are
> worried about this code running at L1, i.e. as a nested hypervisor, and L0, i.e.
> the bare metal hypervisor, scheduling out the L1 CPU.  And because the L1 CPU
> doesn't get run "soon", it won't enter a quiescent state as expected by RCU.

I don't believe that I have any additional concerns over and above
those for the current situation for nested virtualization.  But see my
additional question on your patch below for non-nested virtualization.

> But that's 100% the case with RCU in a VM in general.  If an L1 CPU gets scheduled
> out by L0, that L1 CPU won't participate in any RCU stuff until it gets scheduled
> back in by L0.
> 
> E.g. throw away all of the special case checks for rcu_nohz_full_cpu() in
> rcu_pending(), and the exact same problem exists.  The L1 CPU could get scheduled
> out while trying to run the RCU core kthread just as easily as it could get
> scheduled out while trying to run the vCPU task.  Or the L1 CPU could get scheduled
> out while it's still in the IRQ handler, before it even completes it rcu_pending().
> 
> And FWIW, it's not just L0 scheduling that is problematic.  If something in L0
> prevents an L1 CPU (vCPU from L0's perspective) from making forward progress, e.g.
> due to a bug in L0, or severe resource contention, from the L1 kernel's perspective,
> the L1 CPU will appear stuck and trigger various warnings, e.g. soft-lockup,
> need_resched, RCU stalls, etc.

Indeed, there was a USENIX paper on some aspects of this topic some years back.
https://www.usenix.org/conference/atc17/technical-sessions/presentation/prasad

> > Maybe the time for such hinting has come?
> 
> That's a largely orthogonal discussion.  As above, boosting the scheduling priority
> of a vCPU because that vCPU is in critical section of some form is not at all
> unique to nested virtualization (or RCU).
> 
> For basic functional correctness, the L0 hypervisor already has the "hint" it 
> needs.  L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
> runnable, i.e. not halted, not in WFS, etc.

And if the system is sufficiently lightly loaded, all will be well, as is
the case with my rcutorture usage.  However, if the system is saturated,
that basic functional correctness might not be enough.  I haven't heard
many complaints, other than research work, so I have been assuming that
we do not yet need hinting.  But you guys tell me.  ;-)

> > > +	    rcu_nohz_full_cpu())
> > 
> > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> > some time.
> 
> That's not a good reason to use a suboptimal heuristic for determining whether
> or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
> scenario of a false positive.

Again, are you referring to the current RCU code, or the original patch
that started this email thread?

> > >  		return 0;
> > >  
> > >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index bfb2b52a1416..5a7efc669a0f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > >  {
> > >  	int cpu = get_cpu();
> > >  
> > > +	if (vcpu->wants_to_run)
> > > +		context_tracking_guest_start_run_loop();
> > 
> > At this point, if this is a nohz_full CPU, it will no longer report
> > quiescent states until the grace period is at least one second old.
> 
> I don't think I follow the "will no longer report quiescent states" issue.  Are
> you saying that this would prevent guest_context_enter_irqoff() from reporting
> that the CPU is entering a quiescent state?  If so, that's an issue that would
> need to be resolved regardless of what heuristic we use to determine whether or
> not a CPU is likely to enter a KVM guest.

Please allow me to start over.  Are interrupts disabled at this point,
and, if so, will they remain disabled until the transfer of control to
the guest has become visible to RCU via the context-tracking code?

Or has the context-tracking code already made the transfer of control
to the guest visible to RCU?

> > >  	__this_cpu_write(kvm_running_vcpu, vcpu);
> > >  	preempt_notifier_register(&vcpu->preempt_notifier);
> > >  	kvm_arch_vcpu_load(vcpu, cpu);
> > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > >  	kvm_arch_vcpu_put(vcpu);
> > >  	preempt_notifier_unregister(&vcpu->preempt_notifier);
> > >  	__this_cpu_write(kvm_running_vcpu, NULL);
> > > +
> > 
> > And also at this point, if this is a nohz_full CPU, it will no longer
> > report quiescent states until the grace period is at least one second old.

And here, are interrupts disabled at this point, and if so, have they
been disabled since the time that the exit from the guest become
visible to RCU via the context-tracking code?

Or will the context-tracking code make the transfer of control to the
guest visible to RCU at some later time?

							Thanx, Paul

> > > +	if (vcpu->wants_to_run)
> > > +		context_tracking_guest_stop_run_loop();
> > > +
> > >  	preempt_enable();
> > >  }
> > >  EXPORT_SYMBOL_GPL(vcpu_put);
> > > 
> > > base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968
> > 
> > All of which might be OK.  Just checking as to whether all of that was
> > in fact the intent.
> > 
> > 							Thanx, Paul
Sean Christopherson April 8, 2024, 8:06 p.m. UTC | #7
On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > Issuing a warning based on an arbitrary time limit is wildly different than using
> > an arbitrary time window to make functional decisions.  My objection to the "assume
> > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > _that_ physical CPU will not re-enter the guest.
> > 
> > Off the top of my head:
> > 
> >  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> >    will get false positives, and the *new* pCPU will get false negatives (though
> >    the false negatives aren't all that problematic since the pCPU will enter a
> >    quiescent state on the next VM-Enter.
> > 
> >  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> >    won't re-enter the guest.  And so the pCPU will get false positives until the
> >    vCPU gets a wake event or the 1 second window expires.
> > 
> >  - If the VM terminates, the pCPU will get false positives until the 1 second
> >    window expires.
> > 
> > The false positives are solvable problems, by hooking vcpu_put() to reset
> > kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> > scheduled in on a different pCPU, KVM would hook vcpu_load().
> 
> Here you are arguing against the heuristic in the original patch, correct?

Yep, correct.

> As opposed to the current RCU heuristic that ignores certain quiescent
> states for nohz_full CPUs until the grace period reaches an age of
> one second?
> 
> If so, no argument here.  In fact, please consider my ack cancelled.

...

> > That's a largely orthogonal discussion.  As above, boosting the scheduling priority
> > of a vCPU because that vCPU is in critical section of some form is not at all
> > unique to nested virtualization (or RCU).
> > 
> > For basic functional correctness, the L0 hypervisor already has the "hint" it 
> > needs.  L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
> > runnable, i.e. not halted, not in WFS, etc.
> 
> And if the system is sufficiently lightly loaded, all will be well, as is
> the case with my rcutorture usage.  However, if the system is saturated,
> that basic functional correctness might not be enough.  I haven't heard
> many complaints, other than research work, so I have been assuming that
> we do not yet need hinting.  But you guys tell me.  ;-)

We should never use hinting for basic, *default* functionality.  If the host is
so overloaded that it can induce RCU stalls with the default threshold of 21
seconds, then something in the host's domain is broken/misconfigured.  E.g. it
doesn't necessary have to be a host kernel/userspace bug, it could be an issue
with VM scheduling at the control plane.  But it's still a host issue, and under
no circumstance should the host need a hint in order for the guest to not complain
after 20+ seconds.

And _if_ we were to push the default lower, e.g. all the way down to Android's
aggressive 20 milliseconds, a boosting hint would still be the wrong way to go
about it, because no sane hypervisor would ever back such a hint with strong
guarantees for all scenarios.

It's very much possible to achieve a 20ms deadline when running as a VM, but it
would require strong guarantees about the VM's configuration and environment,
e.g. that memory isn't overcommited, that each vCPU has a fully dedicated pCPU,
etc.

> > > > +	    rcu_nohz_full_cpu())
> > > 
> > > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> > > some time.
> > 
> > That's not a good reason to use a suboptimal heuristic for determining whether
> > or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
> > scenario of a false positive.
> 
> Again, are you referring to the current RCU code, or the original patch
> that started this email thread?

Original patch.

> > > >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index bfb2b52a1416..5a7efc669a0f 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > > >  {
> > > >  	int cpu = get_cpu();
> > > >  
> > > > +	if (vcpu->wants_to_run)
> > > > +		context_tracking_guest_start_run_loop();
> > > 
> > > At this point, if this is a nohz_full CPU, it will no longer report
> > > quiescent states until the grace period is at least one second old.
> > 
> > I don't think I follow the "will no longer report quiescent states" issue.  Are
> > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > that the CPU is entering a quiescent state?  If so, that's an issue that would
> > need to be resolved regardless of what heuristic we use to determine whether or
> > not a CPU is likely to enter a KVM guest.
> 
> Please allow me to start over.  Are interrupts disabled at this point,

Nope, IRQs are enabled.

Oof, I'm glad you asked, because I was going to say that there's one exception,
kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
back in.  But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().

> and, if so, will they remain disabled until the transfer of control to
> the guest has become visible to RCU via the context-tracking code?
> 
> Or has the context-tracking code already made the transfer of control
> to the guest visible to RCU?

Nope.  The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
happens later, just before the actual VM-Enter.  And that call does happen with
IRQs disabled (and IRQs stay disabled until the CPU enters the guest).

> > > >  	__this_cpu_write(kvm_running_vcpu, vcpu);
> > > >  	preempt_notifier_register(&vcpu->preempt_notifier);
> > > >  	kvm_arch_vcpu_load(vcpu, cpu);
> > > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > > >  	kvm_arch_vcpu_put(vcpu);
> > > >  	preempt_notifier_unregister(&vcpu->preempt_notifier);
> > > >  	__this_cpu_write(kvm_running_vcpu, NULL);
> > > > +
> > > 
> > > And also at this point, if this is a nohz_full CPU, it will no longer
> > > report quiescent states until the grace period is at least one second old.
> 
> And here, are interrupts disabled at this point, and if so, have they
> been disabled since the time that the exit from the guest become
> visible to RCU via the context-tracking code?

IRQs are enabled.

The gist of my suggestion is:

	ioctl(KVM_RUN) {

		context_tracking_guest_start_run_loop();

		for (;;) {

			vcpu_run();

			if (<need to return to userspace>)
				break;
		}

		context_tracking_guest_stop_run_loop();
	}

where vcpu_run() encompasses a fairly huge amount of code and functionality,
including the logic to do world switches between host and guest.

E.g. if a vCPU triggers a VM-Exit because it tried to access memory that has been
swapped out by the host, KVM could end up way down in mm/ doing I/O to bring a
page back into memory for the guest.  Immediately after VM-Exit, before enabling
IRQs, KVM will notify RCU that the CPU has exited the extended quiescent state
(this is what happens today).  But the "in KVM run loop" flag would stay set, and
RCU would rely on rcu_nohz_full_cpu() for protection, e.g. in case faulting in
memory somehow takes more than a second.

But, barring something that triggers a return to userspace, KVM _will_ re-enter
the guest as quickly as possible.  So it's still a heuristic in the sense that
the CPU isn't guaranteed to enter the guest, nor are there any enforceable SLOs
on how quickly the CPU will enter the guest, but I think it's the best tradeoff
between simplicity and functionality, especially since rcu_nohz_full_cpu() has
a one second timeout to safeguard against some unforeseen hiccup that prevents
KVM from re-entering the guest in a timely manner.

Note, as above, my intent is that there would also be hooks in kvm_sched_{in,out}()
to note that the guest run loop is starting/stopping if the vCPU task yields or
is preempted.
Paul E. McKenney April 8, 2024, 9:02 p.m. UTC | #8
On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > Issuing a warning based on an arbitrary time limit is wildly different than using
> > > an arbitrary time window to make functional decisions.  My objection to the "assume
> > > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > > _that_ physical CPU will not re-enter the guest.
> > > 
> > > Off the top of my head:
> > > 
> > >  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> > >    will get false positives, and the *new* pCPU will get false negatives (though
> > >    the false negatives aren't all that problematic since the pCPU will enter a
> > >    quiescent state on the next VM-Enter.
> > > 
> > >  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> > >    won't re-enter the guest.  And so the pCPU will get false positives until the
> > >    vCPU gets a wake event or the 1 second window expires.
> > > 
> > >  - If the VM terminates, the pCPU will get false positives until the 1 second
> > >    window expires.
> > > 
> > > The false positives are solvable problems, by hooking vcpu_put() to reset
> > > kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> > > scheduled in on a different pCPU, KVM would hook vcpu_load().
> > 
> > Here you are arguing against the heuristic in the original patch, correct?
> 
> Yep, correct.

Whew!!!  ;-)

> > As opposed to the current RCU heuristic that ignores certain quiescent
> > states for nohz_full CPUs until the grace period reaches an age of
> > one second?
> > 
> > If so, no argument here.  In fact, please consider my ack cancelled.
> 
> ...
> 
> > > That's a largely orthogonal discussion.  As above, boosting the scheduling priority
> > > of a vCPU because that vCPU is in critical section of some form is not at all
> > > unique to nested virtualization (or RCU).
> > > 
> > > For basic functional correctness, the L0 hypervisor already has the "hint" it 
> > > needs.  L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
> > > runnable, i.e. not halted, not in WFS, etc.
> > 
> > And if the system is sufficiently lightly loaded, all will be well, as is
> > the case with my rcutorture usage.  However, if the system is saturated,
> > that basic functional correctness might not be enough.  I haven't heard
> > many complaints, other than research work, so I have been assuming that
> > we do not yet need hinting.  But you guys tell me.  ;-)
> 
> We should never use hinting for basic, *default* functionality.  If the host is
> so overloaded that it can induce RCU stalls with the default threshold of 21
> seconds, then something in the host's domain is broken/misconfigured.  E.g. it
> doesn't necessary have to be a host kernel/userspace bug, it could be an issue
> with VM scheduling at the control plane.  But it's still a host issue, and under
> no circumstance should the host need a hint in order for the guest to not complain
> after 20+ seconds.
> 
> And _if_ we were to push the default lower, e.g. all the way down to Android's
> aggressive 20 milliseconds, a boosting hint would still be the wrong way to go
> about it, because no sane hypervisor would ever back such a hint with strong
> guarantees for all scenarios.
> 
> It's very much possible to achieve a 20ms deadline when running as a VM, but it
> would require strong guarantees about the VM's configuration and environment,
> e.g. that memory isn't overcommited, that each vCPU has a fully dedicated pCPU,
> etc.

Agreed, and again, you guys need to tell me what is necessary here.

> > > > > +	    rcu_nohz_full_cpu())
> > > > 
> > > > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> > > > some time.
> > > 
> > > That's not a good reason to use a suboptimal heuristic for determining whether
> > > or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
> > > scenario of a false positive.
> > 
> > Again, are you referring to the current RCU code, or the original patch
> > that started this email thread?
> 
> Original patch.
> 
> > > > >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index bfb2b52a1416..5a7efc669a0f 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > > > >  {
> > > > >  	int cpu = get_cpu();
> > > > >  
> > > > > +	if (vcpu->wants_to_run)
> > > > > +		context_tracking_guest_start_run_loop();
> > > > 
> > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > quiescent states until the grace period is at least one second old.
> > > 
> > > I don't think I follow the "will no longer report quiescent states" issue.  Are
> > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > that the CPU is entering a quiescent state?  If so, that's an issue that would
> > > need to be resolved regardless of what heuristic we use to determine whether or
> > > not a CPU is likely to enter a KVM guest.
> > 
> > Please allow me to start over.  Are interrupts disabled at this point,
> 
> Nope, IRQs are enabled.
> 
> Oof, I'm glad you asked, because I was going to say that there's one exception,
> kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> back in.  But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
> 
> > and, if so, will they remain disabled until the transfer of control to
> > the guest has become visible to RCU via the context-tracking code?
> > 
> > Or has the context-tracking code already made the transfer of control
> > to the guest visible to RCU?
> 
> Nope.  The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> happens later, just before the actual VM-Enter.  And that call does happen with
> IRQs disabled (and IRQs stay disabled until the CPU enters the guest).

OK, then we can have difficulties with long-running interrupts hitting
this range of code.  It is unfortunately not unheard-of for interrupts
plus trailing softirqs to run for tens of seconds, even minutes.

One counter-argument is that that softirq would take scheduling-clock
interrupts, and would eventually make rcu_core() run.

But does a rcu_sched_clock_irq() from a guest OS have its "user"
argument set?

> > > > >  	__this_cpu_write(kvm_running_vcpu, vcpu);
> > > > >  	preempt_notifier_register(&vcpu->preempt_notifier);
> > > > >  	kvm_arch_vcpu_load(vcpu, cpu);
> > > > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > > > >  	kvm_arch_vcpu_put(vcpu);
> > > > >  	preempt_notifier_unregister(&vcpu->preempt_notifier);
> > > > >  	__this_cpu_write(kvm_running_vcpu, NULL);
> > > > > +
> > > > 
> > > > And also at this point, if this is a nohz_full CPU, it will no longer
> > > > report quiescent states until the grace period is at least one second old.
> > 
> > And here, are interrupts disabled at this point, and if so, have they
> > been disabled since the time that the exit from the guest become
> > visible to RCU via the context-tracking code?
> 
> IRQs are enabled.
> 
> The gist of my suggestion is:
> 
> 	ioctl(KVM_RUN) {
> 
> 		context_tracking_guest_start_run_loop();
> 
> 		for (;;) {
> 
> 			vcpu_run();
> 
> 			if (<need to return to userspace>)
> 				break;
> 		}
> 
> 		context_tracking_guest_stop_run_loop();
> 	}
> 
> where vcpu_run() encompasses a fairly huge amount of code and functionality,
> including the logic to do world switches between host and guest.
> 
> E.g. if a vCPU triggers a VM-Exit because it tried to access memory that has been
> swapped out by the host, KVM could end up way down in mm/ doing I/O to bring a
> page back into memory for the guest.  Immediately after VM-Exit, before enabling
> IRQs, KVM will notify RCU that the CPU has exited the extended quiescent state
> (this is what happens today).  But the "in KVM run loop" flag would stay set, and
> RCU would rely on rcu_nohz_full_cpu() for protection, e.g. in case faulting in
> memory somehow takes more than a second.
> 
> But, barring something that triggers a return to userspace, KVM _will_ re-enter
> the guest as quickly as possible.  So it's still a heuristic in the sense that
> the CPU isn't guaranteed to enter the guest, nor are there any enforceable SLOs
> on how quickly the CPU will enter the guest, but I think it's the best tradeoff
> between simplicity and functionality, especially since rcu_nohz_full_cpu() has
> a one second timeout to safeguard against some unforeseen hiccup that prevents
> KVM from re-entering the guest in a timely manner.
> 
> Note, as above, my intent is that there would also be hooks in kvm_sched_{in,out}()
> to note that the guest run loop is starting/stopping if the vCPU task yields or
> is preempted.

Very good, same responses as for the context_tracking_guest_start_run_loop()
case.

							Thanx, Paul
Sean Christopherson April 8, 2024, 9:56 p.m. UTC | #9
On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> > On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > > > > > +	if (vcpu->wants_to_run)
> > > > > > +		context_tracking_guest_start_run_loop();
> > > > > 
> > > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > > quiescent states until the grace period is at least one second old.
> > > > 
> > > > I don't think I follow the "will no longer report quiescent states" issue.  Are
> > > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > > that the CPU is entering a quiescent state?  If so, that's an issue that would
> > > > need to be resolved regardless of what heuristic we use to determine whether or
> > > > not a CPU is likely to enter a KVM guest.
> > > 
> > > Please allow me to start over.  Are interrupts disabled at this point,
> > 
> > Nope, IRQs are enabled.
> > 
> > Oof, I'm glad you asked, because I was going to say that there's one exception,
> > kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> > back in.  But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> > i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
> > 
> > > and, if so, will they remain disabled until the transfer of control to
> > > the guest has become visible to RCU via the context-tracking code?
> > > 
> > > Or has the context-tracking code already made the transfer of control
> > > to the guest visible to RCU?
> > 
> > Nope.  The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> > happens later, just before the actual VM-Enter.  And that call does happen with
> > IRQs disabled (and IRQs stay disabled until the CPU enters the guest).
> 
> OK, then we can have difficulties with long-running interrupts hitting
> this range of code.  It is unfortunately not unheard-of for interrupts
> plus trailing softirqs to run for tens of seconds, even minutes.

Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
be a massive lag before the CPU gets back into a quiescent state.

> One counter-argument is that that softirq would take scheduling-clock
> interrupts, and would eventually make rcu_core() run.

Considering that this behavior would be unique to nohz_full CPUs, how much
responsibility does RCU have to ensure a sane setup?  E.g. if a softirq runs for
multiple seconds on a nohz_full CPU whose primary role is to run a KVM vCPU, then
whatever real-time workaround the vCPU is running is already doomed.

> But does a rcu_sched_clock_irq() from a guest OS have its "user"
> argument set?

No, and it shouldn't, at least not on x86 (I assume other architectures are
similar, but I don't actually no for sure).

On x86, the IRQ that the kernel sees comes looks like it comes from host kernel
code.  And on AMD (SVM), the IRQ doesn't just "look" like it came from host kernel,
the IRQ really does get vectored/handled in the host kernel.  Intel CPUs have a
performance optimization where the IRQ gets "eaten" as part of the VM-Exit, and
so KVM synthesizes a stack frame and does a manual CALL to invoke the IRQ handler.

And that's just for IRQs that actually arrive while the guest is running.  IRQs
arrive while KVM is active, e.g. running its large vcpu_run(), are "pure" host
IRQs.
Paul E. McKenney April 8, 2024, 10:35 p.m. UTC | #10
On Mon, Apr 08, 2024 at 02:56:29PM -0700, Sean Christopherson wrote:
> On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> > > On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > > > > > > +	if (vcpu->wants_to_run)
> > > > > > > +		context_tracking_guest_start_run_loop();
> > > > > > 
> > > > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > > > quiescent states until the grace period is at least one second old.
> > > > > 
> > > > > I don't think I follow the "will no longer report quiescent states" issue.  Are
> > > > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > > > that the CPU is entering a quiescent state?  If so, that's an issue that would
> > > > > need to be resolved regardless of what heuristic we use to determine whether or
> > > > > not a CPU is likely to enter a KVM guest.
> > > > 
> > > > Please allow me to start over.  Are interrupts disabled at this point,
> > > 
> > > Nope, IRQs are enabled.
> > > 
> > > Oof, I'm glad you asked, because I was going to say that there's one exception,
> > > kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> > > back in.  But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> > > i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
> > > 
> > > > and, if so, will they remain disabled until the transfer of control to
> > > > the guest has become visible to RCU via the context-tracking code?
> > > > 
> > > > Or has the context-tracking code already made the transfer of control
> > > > to the guest visible to RCU?
> > > 
> > > Nope.  The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> > > happens later, just before the actual VM-Enter.  And that call does happen with
> > > IRQs disabled (and IRQs stay disabled until the CPU enters the guest).
> > 
> > OK, then we can have difficulties with long-running interrupts hitting
> > this range of code.  It is unfortunately not unheard-of for interrupts
> > plus trailing softirqs to run for tens of seconds, even minutes.
> 
> Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
> be a massive lag before the CPU gets back into a quiescent state.

Exactly!

> > One counter-argument is that that softirq would take scheduling-clock
> > interrupts, and would eventually make rcu_core() run.
> 
> Considering that this behavior would be unique to nohz_full CPUs, how much
> responsibility does RCU have to ensure a sane setup?  E.g. if a softirq runs for
> multiple seconds on a nohz_full CPU whose primary role is to run a KVM vCPU, then
> whatever real-time workaround the vCPU is running is already doomed.

True, but it is always good to be doing one's part.

> > But does a rcu_sched_clock_irq() from a guest OS have its "user"
> > argument set?
> 
> No, and it shouldn't, at least not on x86 (I assume other architectures are
> similar, but I don't actually no for sure).
> 
> On x86, the IRQ that the kernel sees comes looks like it comes from host kernel
> code.  And on AMD (SVM), the IRQ doesn't just "look" like it came from host kernel,
> the IRQ really does get vectored/handled in the host kernel.  Intel CPUs have a
> performance optimization where the IRQ gets "eaten" as part of the VM-Exit, and
> so KVM synthesizes a stack frame and does a manual CALL to invoke the IRQ handler.
> 
> And that's just for IRQs that actually arrive while the guest is running.  IRQs
> arrive while KVM is active, e.g. running its large vcpu_run(), are "pure" host
> IRQs.

OK, then is it possible to get some other indication to the
rcu_sched_clock_irq() function that it has interrupted a guest OS?

Not an emergency, and maybe not even necessary, but it might well be
one hole that would be good to stop up.

							Thanx, Paul
Sean Christopherson April 8, 2024, 11:06 p.m. UTC | #11
On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 02:56:29PM -0700, Sean Christopherson wrote:
> > > OK, then we can have difficulties with long-running interrupts hitting
> > > this range of code.  It is unfortunately not unheard-of for interrupts
> > > plus trailing softirqs to run for tens of seconds, even minutes.
> > 
> > Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
> > be a massive lag before the CPU gets back into a quiescent state.
> 
> Exactly!

...

> OK, then is it possible to get some other indication to the
> rcu_sched_clock_irq() function that it has interrupted a guest OS?

It's certainly possible, but I don't think we want to go down that road.

Any functionality built on that would be strictly limited to Intel CPUs, because
AFAIK, only Intel VMX has the mode where an IRQ can be handled without enabling
IRQs (which sounds stupid when I write it like that).

E.g. on AMD SVM, if an IRQ interrupts the guest, KVM literally handles it by
doing:

	local_irq_enable();
	++vcpu->stat.exits;
	local_irq_disable();

which means there's no way for KVM to guarantee that the IRQ that leads to
rcu_sched_clock_irq() is the _only_ IRQ that is taken (or that what RCU sees was
even the IRQ that interrupted the guest, though that probably doesn't matter much).

Orthogonal to RCU, I do think it makes sense to have KVM VMX handle IRQs in its
fastpath for VM-Exit, i.e. handle the IRQ VM-Exit and re-enter the guest without
ever enabling IRQs.  But that's purely a KVM optimization, e.g. to avoid useless
work when the host has already done what it needed to do.

But even then, to make it so RCU could safely skip invoke_rcu_core(), KVM would
need to _guarantee_ re-entry to the guest, and I don't think we want to do that.
E.g. if there is some work that needs to be done on the CPU, re-entering the guest
is a huge waste of cycles, as KVM would need to do some shenanigans to immediately
force a VM-Exit.  It'd also require a moderate amount of complexity that I wouldn't
want to maintain, particularly since it'd be Intel-only.

> Not an emergency, and maybe not even necessary, but it might well be
> one hole that would be good to stop up.
> 
> 							Thanx, Paul
Paul E. McKenney April 8, 2024, 11:20 p.m. UTC | #12
On Mon, Apr 08, 2024 at 04:06:22PM -0700, Sean Christopherson wrote:
> On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 02:56:29PM -0700, Sean Christopherson wrote:
> > > > OK, then we can have difficulties with long-running interrupts hitting
> > > > this range of code.  It is unfortunately not unheard-of for interrupts
> > > > plus trailing softirqs to run for tens of seconds, even minutes.
> > > 
> > > Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
> > > be a massive lag before the CPU gets back into a quiescent state.
> > 
> > Exactly!
> 
> ...
> 
> > OK, then is it possible to get some other indication to the
> > rcu_sched_clock_irq() function that it has interrupted a guest OS?
> 
> It's certainly possible, but I don't think we want to go down that road.
> 
> Any functionality built on that would be strictly limited to Intel CPUs, because
> AFAIK, only Intel VMX has the mode where an IRQ can be handled without enabling
> IRQs (which sounds stupid when I write it like that).
> 
> E.g. on AMD SVM, if an IRQ interrupts the guest, KVM literally handles it by
> doing:
> 
> 	local_irq_enable();
> 	++vcpu->stat.exits;
> 	local_irq_disable();
> 
> which means there's no way for KVM to guarantee that the IRQ that leads to
> rcu_sched_clock_irq() is the _only_ IRQ that is taken (or that what RCU sees was
> even the IRQ that interrupted the guest, though that probably doesn't matter much).
> 
> Orthogonal to RCU, I do think it makes sense to have KVM VMX handle IRQs in its
> fastpath for VM-Exit, i.e. handle the IRQ VM-Exit and re-enter the guest without
> ever enabling IRQs.  But that's purely a KVM optimization, e.g. to avoid useless
> work when the host has already done what it needed to do.
> 
> But even then, to make it so RCU could safely skip invoke_rcu_core(), KVM would
> need to _guarantee_ re-entry to the guest, and I don't think we want to do that.
> E.g. if there is some work that needs to be done on the CPU, re-entering the guest
> is a huge waste of cycles, as KVM would need to do some shenanigans to immediately
> force a VM-Exit.  It'd also require a moderate amount of complexity that I wouldn't
> want to maintain, particularly since it'd be Intel-only.

Thank you for the analysis!

It sounds like the current state, imperfect though it might be, is the
best of the known possible worlds at the moment.

But should anyone come up with something better, please do not keep it
a secret!

							Thanx, Paul

> > Not an emergency, and maybe not even necessary, but it might well be
> > one hole that would be good to stop up.
> > 
> > 							Thanx, Paul
Marcelo Tosatti April 10, 2024, 2:39 a.m. UTC | #13
On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > > 
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > > a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
> > > enter a guest, but the CPU never does (at least, not in the immediate future).
> > > 
> > > Or am I just not understanding how RCU's kthreads work?
> > 
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption.  I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
> 
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states.  More below.
> 
> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run.  Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > >     grace period started over than a second ago. If this value is bad,
> > > > > >     I have no issue changing it.
> > > > > 
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > > of what magic time threshold is used.  
> > > > 
> > > > Why? It works for this particular purpose.
> > > 
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the same amount
> > > of effort/churn.
> > 
> > Beyond a certain point, we have no choice.  How long should RCU let
> > a CPU run with preemption disabled before complaining?  We choose 21
> > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> 
> Issuing a warning based on an arbitrary time limit is wildly different than using
> an arbitrary time window to make functional decisions.  My objection to the "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. where
> _that_ physical CPU will not re-enter the guest.
> 
> Off the top of my head:
> 
>  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
>    will get false positives, and the *new* pCPU will get false negatives (though
>    the false negatives aren't all that problematic since the pCPU will enter a
>    quiescent state on the next VM-Enter.
> 
>  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
>    won't re-enter the guest.  And so the pCPU will get false positives until the
>    vCPU gets a wake event or the 1 second window expires.
> 
>  - If the VM terminates, the pCPU will get false positives until the 1 second
>    window expires.
> 
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Sean,

It seems that fixing the problems you pointed out above is a way to go.
Marcelo Tosatti April 15, 2024, 7:47 p.m. UTC | #14
On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > > 
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > > a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
> > > enter a guest, but the CPU never does (at least, not in the immediate future).
> > > 
> > > Or am I just not understanding how RCU's kthreads work?
> > 
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption.  I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
> 
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states.  More below.
> 
> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run.  Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > >     grace period started over than a second ago. If this value is bad,
> > > > > >     I have no issue changing it.
> > > > > 
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > > of what magic time threshold is used.  
> > > > 
> > > > Why? It works for this particular purpose.
> > > 
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the same amount
> > > of effort/churn.
> > 
> > Beyond a certain point, we have no choice.  How long should RCU let
> > a CPU run with preemption disabled before complaining?  We choose 21
> > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> 
> Issuing a warning based on an arbitrary time limit is wildly different than using
> an arbitrary time window to make functional decisions.  My objection to the "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. where
> _that_ physical CPU will not re-enter the guest.
> 
> Off the top of my head:
> 
>  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
>    will get false positives, and the *new* pCPU will get false negatives (though
>    the false negatives aren't all that problematic since the pCPU will enter a
>    quiescent state on the next VM-Enter.
> 
>  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
>    won't re-enter the guest.  And so the pCPU will get false positives until the
>    vCPU gets a wake event or the 1 second window expires.
> 
>  - If the VM terminates, the pCPU will get false positives until the 1 second
>    window expires.
> 
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Hi Sean,

So this should deal with it? (untested, don't apply...).

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..be90d83d631a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
+
+/*
+ * Returns time (jiffies) for the last guest exit in current cpu
+ */
+static inline unsigned long guest_exit_last_time(void)
+{
+	return this_cpu_read(kvm_last_guest_exit);
+}
+
 /*
  * Exit guest context and exit an RCU extended quiescent state.
  *
@@ -488,6 +498,9 @@ static __always_inline void guest_state_enter_irqoff(void)
 static __always_inline void guest_context_exit_irqoff(void)
 {
 	context_tracking_guest_exit();
+
+	/* Keeps track of last guest exit */
+	this_cpu_write(kvm_last_guest_exit, jiffies);
 }
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..231d0e4d2cf1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -110,6 +110,9 @@ static struct kmem_cache *kvm_vcpu_cache;
 static __read_mostly struct preempt_ops kvm_preempt_ops;
 static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
 
+DEFINE_PER_CPU(unsigned long, kvm_last_guest_exit);
+EXPORT_SYMBOL_GPL(kvm_last_guest_exit);
+
 struct dentry *kvm_debugfs_dir;
 EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
 
@@ -210,6 +213,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 	int cpu = get_cpu();
 
 	__this_cpu_write(kvm_running_vcpu, vcpu);
+	__this_cpu_write(kvm_last_guest_exit, 0);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
@@ -222,6 +226,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
 	__this_cpu_write(kvm_running_vcpu, NULL);
+	__this_cpu_write(kvm_last_guest_exit, 0);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);
Sean Christopherson April 15, 2024, 9:29 p.m. UTC | #15
On Mon, Apr 15, 2024, Marcelo Tosatti wrote:
> On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > Beyond a certain point, we have no choice.  How long should RCU let
> > > a CPU run with preemption disabled before complaining?  We choose 21
> > > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> > 
> > Issuing a warning based on an arbitrary time limit is wildly different than using
> > an arbitrary time window to make functional decisions.  My objection to the "assume
> > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > _that_ physical CPU will not re-enter the guest.
> > 
> > Off the top of my head:
> > 
> >  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> >    will get false positives, and the *new* pCPU will get false negatives (though
> >    the false negatives aren't all that problematic since the pCPU will enter a
> >    quiescent state on the next VM-Enter.
> > 
> >  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> >    won't re-enter the guest.  And so the pCPU will get false positives until the
> >    vCPU gets a wake event or the 1 second window expires.
> > 
> >  - If the VM terminates, the pCPU will get false positives until the 1 second
> >    window expires.
> > 
> > The false positives are solvable problems, by hooking vcpu_put() to reset
> > kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> > scheduled in on a different pCPU, KVM would hook vcpu_load().
> 
> Hi Sean,
> 
> So this should deal with it? (untested, don't apply...).

Not entirely.  As I belatedly noted, hooking vcpu_put() doesn't handle the case
where the vCPU is preempted, i.e. kvm_sched_out() would also need to zero out
kvm_last_guest_exit to avoid a false positive.  Going through the scheduler will
note the CPU is quiescent for the current grace period, but after that RCU will
still see a non-zero kvm_last_guest_exit even though the vCPU task isn't actively
running.

And snapshotting the VM-Exit time will get false negatives when the vCPU is about
to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
preempted and/or migrated to a different pCPU.

I don't understand the motivation for keeping the kvm_last_guest_exit logic.  My
understanding is that RCU already has a timeout to avoid stalling RCU.  I don't
see what is gained by effectively duplicating that timeout for KVM.  Why not have
KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..be90d83d631a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
>  	lockdep_hardirqs_on(CALLER_ADDR0);
>  }
>  
> +DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
> +
> +/*
> + * Returns time (jiffies) for the last guest exit in current cpu
> + */
> +static inline unsigned long guest_exit_last_time(void)
> +{
> +	return this_cpu_read(kvm_last_guest_exit);
> +}
> +
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
> @@ -488,6 +498,9 @@ static __always_inline void guest_state_enter_irqoff(void)
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
>  	context_tracking_guest_exit();
> +
> +	/* Keeps track of last guest exit */
> +	this_cpu_write(kvm_last_guest_exit, jiffies);
>  }
>  
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..231d0e4d2cf1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -110,6 +110,9 @@ static struct kmem_cache *kvm_vcpu_cache;
>  static __read_mostly struct preempt_ops kvm_preempt_ops;
>  static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
>  
> +DEFINE_PER_CPU(unsigned long, kvm_last_guest_exit);
> +EXPORT_SYMBOL_GPL(kvm_last_guest_exit);
> +
>  struct dentry *kvm_debugfs_dir;
>  EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
>  
> @@ -210,6 +213,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
>  	int cpu = get_cpu();
>  
>  	__this_cpu_write(kvm_running_vcpu, vcpu);
> +	__this_cpu_write(kvm_last_guest_exit, 0);
>  	preempt_notifier_register(&vcpu->preempt_notifier);
>  	kvm_arch_vcpu_load(vcpu, cpu);
>  	put_cpu();
> @@ -222,6 +226,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
>  	kvm_arch_vcpu_put(vcpu);
>  	preempt_notifier_unregister(&vcpu->preempt_notifier);
>  	__this_cpu_write(kvm_running_vcpu, NULL);
> +	__this_cpu_write(kvm_last_guest_exit, 0);
>  	preempt_enable();
>  }
>  EXPORT_SYMBOL_GPL(vcpu_put);
>
Marcelo Tosatti April 16, 2024, 12:36 p.m. UTC | #16
On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > > Beyond a certain point, we have no choice.  How long should RCU let
> > > > a CPU run with preemption disabled before complaining?  We choose 21
> > > > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > > > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> > > 
> > > Issuing a warning based on an arbitrary time limit is wildly different than using
> > > an arbitrary time window to make functional decisions.  My objection to the "assume
> > > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > > _that_ physical CPU will not re-enter the guest.
> > > 
> > > Off the top of my head:
> > > 
> > >  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> > >    will get false positives, and the *new* pCPU will get false negatives (though
> > >    the false negatives aren't all that problematic since the pCPU will enter a
> > >    quiescent state on the next VM-Enter.
> > > 
> > >  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> > >    won't re-enter the guest.  And so the pCPU will get false positives until the
> > >    vCPU gets a wake event or the 1 second window expires.
> > > 
> > >  - If the VM terminates, the pCPU will get false positives until the 1 second
> > >    window expires.
> > > 
> > > The false positives are solvable problems, by hooking vcpu_put() to reset
> > > kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> > > scheduled in on a different pCPU, KVM would hook vcpu_load().
> > 
> > Hi Sean,
> > 
> > So this should deal with it? (untested, don't apply...).
> 
> Not entirely.  As I belatedly noted, hooking vcpu_put() doesn't handle the case
> where the vCPU is preempted, i.e. kvm_sched_out() would also need to zero out
> kvm_last_guest_exit to avoid a false positive. 

True. Can fix that.

> Going through the scheduler will
> note the CPU is quiescent for the current grace period, but after that RCU will
> still see a non-zero kvm_last_guest_exit even though the vCPU task isn't actively
> running.

Right, can fix kvm_sched_out().

> And snapshotting the VM-Exit time will get false negatives when the vCPU is about
> to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> preempted and/or migrated to a different pCPU.

Right, for the use-case where waking up rcuc is a problem, the pCPU is
isolated (there are no userspace processes and hopefully no kernel threads
executing there), vCPU pinned to that pCPU.

So there should be no preemptions or migrations.

> I don't understand the motivation for keeping the kvm_last_guest_exit logic.

The motivation is to _avoid_ waking up rcuc to perform RCU core
processing, in case the vCPU runs on a nohz full CPU, since
entering the VM is an extended quiescent state.

The logic for userspace/idle extended quiescent states is:

This is called from the sched clock interrupt.

/*
 * This function is invoked from each scheduling-clock interrupt,
 * and checks to see if this CPU is in a non-context-switch quiescent
 * state, for example, user mode or idle loop.  It also schedules RCU
 * core processing.  If the current grace period has gone on too long,
 * it will ask the scheduler to manufacture a context switch for the sole
 * purpose of providing the needed quiescent state.
 */
void rcu_sched_clock_irq(int user)
{
...
        if (rcu_pending(user))
                invoke_rcu_core();
...
}

And, from rcu_pending:

        /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
        if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
                return 0;

/*
 * Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
 * grace-period kthread will do force_quiescent_state() processing?
 * The idea is to avoid waking up RCU core processing on such a
 * CPU unless the grace period has extended for too long.
 *
 * This code relies on the fact that all NO_HZ_FULL CPUs are also
 * RCU_NOCB_CPU CPUs.
 */
static bool rcu_nohz_full_cpu(void)
{
#ifdef CONFIG_NO_HZ_FULL
        if (tick_nohz_full_cpu(smp_processor_id()) &&
            (!rcu_gp_in_progress() ||
             time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
                return true;
#endif /* #ifdef CONFIG_NO_HZ_FULL */
        return false;
}

Does that make sense?

> My understanding is that RCU already has a timeout to avoid stalling RCU.  I don't
> see what is gained by effectively duplicating that timeout for KVM.

The point is not to avoid stalling RCU. The point is to not perform RCU
core processing through rcuc thread (because that interrupts execution
of the vCPU thread), if it is known that an extended quiescent state 
will occur "soon" anyway (via VM-entry).

If the extended quiescent state does not occur in 1 second, then rcuc
will be woken up (the time_before call in rcu_nohz_full_cpu function 
above).

> Why not have
> KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?

Do you mean something like:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d9642dd06c25..0ca5a6a45025 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
                return 1;
 
        /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
-       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+       if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
                return 0;
 
        /* Is the RCU core waiting for a quiescent state from this CPU? */

The problem is:

1) You should only set that flag, in the VM-entry path, after the point
where no use of RCU is made: close to guest_state_enter_irqoff call.

2) While handling a VM-exit, a host timer interrupt can occur before that,
or after the point where "this_cpu->in_kvm_run" is set to false.

And a host timer interrupt calls rcu_sched_clock_irq which is going to
wake up rcuc.

Or am i missing something?

Thanks.

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..be90d83d631a 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
> >  	lockdep_hardirqs_on(CALLER_ADDR0);
> >  }
> >  
> > +DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
> > +
> > +/*
> > + * Returns time (jiffies) for the last guest exit in current cpu
> > + */
> > +static inline unsigned long guest_exit_last_time(void)
> > +{
> > +	return this_cpu_read(kvm_last_guest_exit);
> > +}
> > +
> >  /*
> >   * Exit guest context and exit an RCU extended quiescent state.
> >   *
> > @@ -488,6 +498,9 @@ static __always_inline void guest_state_enter_irqoff(void)
> >  static __always_inline void guest_context_exit_irqoff(void)
> >  {
> >  	context_tracking_guest_exit();
> > +
> > +	/* Keeps track of last guest exit */
> > +	this_cpu_write(kvm_last_guest_exit, jiffies);
> >  }
> >  
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fb49c2a60200..231d0e4d2cf1 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -110,6 +110,9 @@ static struct kmem_cache *kvm_vcpu_cache;
> >  static __read_mostly struct preempt_ops kvm_preempt_ops;
> >  static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
> >  
> > +DEFINE_PER_CPU(unsigned long, kvm_last_guest_exit);
> > +EXPORT_SYMBOL_GPL(kvm_last_guest_exit);
> > +
> >  struct dentry *kvm_debugfs_dir;
> >  EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
> >  
> > @@ -210,6 +213,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> >  	int cpu = get_cpu();
> >  
> >  	__this_cpu_write(kvm_running_vcpu, vcpu);
> > +	__this_cpu_write(kvm_last_guest_exit, 0);
> >  	preempt_notifier_register(&vcpu->preempt_notifier);
> >  	kvm_arch_vcpu_load(vcpu, cpu);
> >  	put_cpu();
> > @@ -222,6 +226,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> >  	kvm_arch_vcpu_put(vcpu);
> >  	preempt_notifier_unregister(&vcpu->preempt_notifier);
> >  	__this_cpu_write(kvm_running_vcpu, NULL);
> > +	__this_cpu_write(kvm_last_guest_exit, 0);
> >  	preempt_enable();
> >  }
> >  EXPORT_SYMBOL_GPL(vcpu_put);
> > 
> 
>
Sean Christopherson April 16, 2024, 2:07 p.m. UTC | #17
On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> > And snapshotting the VM-Exit time will get false negatives when the vCPU is about
> > to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> > preempted and/or migrated to a different pCPU.
> 
> Right, for the use-case where waking up rcuc is a problem, the pCPU is
> isolated (there are no userspace processes and hopefully no kernel threads
> executing there), vCPU pinned to that pCPU.
> 
> So there should be no preemptions or migrations.

I understand that preemption/migration will not be problematic if the system is
configured "correctly", but we still need to play nice with other scenarios and/or
suboptimal setups.  While false positives aren't fatal, KVM still should do its
best to avoid them, especially when it's relatively easy to do so.

> > My understanding is that RCU already has a timeout to avoid stalling RCU.  I don't
> > see what is gained by effectively duplicating that timeout for KVM.
> 
> The point is not to avoid stalling RCU. The point is to not perform RCU
> core processing through rcuc thread (because that interrupts execution
> of the vCPU thread), if it is known that an extended quiescent state 
> will occur "soon" anyway (via VM-entry).

I know.  My point is that, as you note below, RCU will wake-up rcuc after 1 second
even if KVM is still reporting a VM-Enter is imminent, i.e. there's a 1 second
timeout to avoid an RCU stall to due to KVM never completing entry to the guest.

> If the extended quiescent state does not occur in 1 second, then rcuc
> will be woken up (the time_before call in rcu_nohz_full_cpu function 
> above).
> 
> > Why not have
> > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> 
> Do you mean something like:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d9642dd06c25..0ca5a6a45025 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
>                 return 1;
>  
>         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> +       if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
>                 return 0;

Yes.  This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic
in kvm_sched_{in,out}().

>         /* Is the RCU core waiting for a quiescent state from this CPU? */
> 
> The problem is:
> 
> 1) You should only set that flag, in the VM-entry path, after the point
> where no use of RCU is made: close to guest_state_enter_irqoff call.

Why?  As established above, KVM essentially has 1 second to enter the guest after
setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
the time before KVM enters the guest can probably be measured in microseconds.

Snapshotting the exit time has the exact same problem of depending on KVM to
re-enter the guest soon-ish, so I don't understand why this would be considered
a problem with a flag to note the CPU is in KVM's run loop, but not with a
snapshot to say the CPU recently exited a KVM guest.

> 2) While handling a VM-exit, a host timer interrupt can occur before that,
> or after the point where "this_cpu->in_kvm_run" is set to false.
>
> And a host timer interrupt calls rcu_sched_clock_irq which is going to
> wake up rcuc.

If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
or the vCPU was scheduled out.  In the former case, rcuc won't be woken up if the
CPU is in userspace.  And in the latter case, waking up rcuc is absolutely the
correct thing to do as VM-Enter is not imminent.

For exits to userspace, there would be a small window where an IRQ could arrive
between KVM putting the vCPU and the CPU actually returning to userspace, but
unless that's problematic in practice, I think it's a reasonable tradeoff.
Marcelo Tosatti April 17, 2024, 4:14 p.m. UTC | #18
On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> > > And snapshotting the VM-Exit time will get false negatives when the vCPU is about
> > > to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> > > preempted and/or migrated to a different pCPU.
> > 
> > Right, for the use-case where waking up rcuc is a problem, the pCPU is
> > isolated (there are no userspace processes and hopefully no kernel threads
> > executing there), vCPU pinned to that pCPU.
> > 
> > So there should be no preemptions or migrations.
> 
> I understand that preemption/migration will not be problematic if the system is
> configured "correctly", but we still need to play nice with other scenarios and/or
> suboptimal setups.  While false positives aren't fatal, KVM still should do its
> best to avoid them, especially when it's relatively easy to do so.

Sure.

> > > My understanding is that RCU already has a timeout to avoid stalling RCU.  I don't
> > > see what is gained by effectively duplicating that timeout for KVM.
> > 
> > The point is not to avoid stalling RCU. The point is to not perform RCU
> > core processing through rcuc thread (because that interrupts execution
> > of the vCPU thread), if it is known that an extended quiescent state 
> > will occur "soon" anyway (via VM-entry).
> 
> I know.  My point is that, as you note below, RCU will wake-up rcuc after 1 second
> even if KVM is still reporting a VM-Enter is imminent, i.e. there's a 1 second
> timeout to avoid an RCU stall to due to KVM never completing entry to the guest.

Right.

So a reply to the sentence:

"My understanding is that RCU already has a timeout to avoid stalling RCU.  I don't
 see what is gained by effectively duplicating that timeout for KVM."

Is that the current RCU timeout is not functional for KVM VM entries,
therefore it needs modification.

> > If the extended quiescent state does not occur in 1 second, then rcuc
> > will be woken up (the time_before call in rcu_nohz_full_cpu function 
> > above).
> > 
> > > Why not have
> > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> > 
> > Do you mean something like:
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d9642dd06c25..0ca5a6a45025 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> >                 return 1;
> >  
> >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +       if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> >                 return 0;
> 
> Yes.  This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic
> in kvm_sched_{in,out}().

Question: where is vcpu->wants_to_run set? (or, where is the full series
again?).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bfb2b52a1416..5a7efc669a0f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu = get_cpu();
 
+	if (vcpu->wants_to_run)
+		context_tracking_guest_start_run_loop();
+
 	__this_cpu_write(kvm_running_vcpu, vcpu);
 	preempt_notifier_register(&vcpu->preempt_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
@@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_arch_vcpu_put(vcpu);
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
 	__this_cpu_write(kvm_running_vcpu, NULL);
+
+	if (vcpu->wants_to_run)
+		context_tracking_guest_stop_run_loop();
+
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);

A little worried about guest HLT:

/**
 * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
 *
 * If the current CPU is idle and running at a first-level (not nested)
 * interrupt, or directly, from idle, return true.
 *
 * The caller must have at least disabled IRQs.
 */
static int rcu_is_cpu_rrupt_from_idle(void)
{
        long nesting;

        /*
         * Usually called from the tick; but also used from smp_function_call()
         * for expedited grace periods. This latter can result in running from
         * the idle task, instead of an actual IPI.
         */
	...

        /* Does CPU appear to be idle from an RCU standpoint? */
        return ct_dynticks_nesting() == 0;
}

static __always_inline void ct_cpuidle_enter(void)
{
        lockdep_assert_irqs_disabled();
        /*
         * Idle is allowed to (temporary) enable IRQs. It
         * will return with IRQs disabled.
         *
         * Trace IRQs enable here, then switch off RCU, and have
         * arch_cpu_idle() use raw_local_irq_enable(). Note that
         * ct_idle_enter() relies on lockdep IRQ state, so switch that
         * last -- this is very similar to the entry code.
         */
        trace_hardirqs_on_prepare();
        lockdep_hardirqs_on_prepare();
        instrumentation_end();
        ct_idle_enter();
        lockdep_hardirqs_on(_RET_IP_);
}

So for guest HLT emulation, there is a window between

kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put 
and
the idle's task call to ct_cpuidle_enter, where 

ct_dynticks_nesting() != 0 and vcpu_put has already executed.

Even for idle=poll, the race exists.

> >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > 
> > The problem is:
> > 
> > 1) You should only set that flag, in the VM-entry path, after the point
> > where no use of RCU is made: close to guest_state_enter_irqoff call.
> 
> Why?  As established above, KVM essentially has 1 second to enter the guest after
> setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
> the time before KVM enters the guest can probably be measured in microseconds.

OK.

> Snapshotting the exit time has the exact same problem of depending on KVM to
> re-enter the guest soon-ish, so I don't understand why this would be considered
> a problem with a flag to note the CPU is in KVM's run loop, but not with a
> snapshot to say the CPU recently exited a KVM guest.

See the race above.

> > 2) While handling a VM-exit, a host timer interrupt can occur before that,
> > or after the point where "this_cpu->in_kvm_run" is set to false.
> >
> > And a host timer interrupt calls rcu_sched_clock_irq which is going to
> > wake up rcuc.
> 
> If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
> or the vCPU was scheduled out.  In the former case, rcuc won't be woken up if the
> CPU is in userspace.  And in the latter case, waking up rcuc is absolutely the
> correct thing to do as VM-Enter is not imminent.
> 
> For exits to userspace, there would be a small window where an IRQ could arrive
> between KVM putting the vCPU and the CPU actually returning to userspace, but
> unless that's problematic in practice, I think it's a reasonable tradeoff.

OK, your proposal looks alright except these races.

We don't want those races to occur in production (and they likely will).

Is there any way to fix the races? Perhaps cmpxchg?
Sean Christopherson April 17, 2024, 5:22 p.m. UTC | #19
On Wed, Apr 17, 2024, Marcelo Tosatti wrote:
> On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > > > Why not have
> > > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> > > 
> > > Do you mean something like:
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index d9642dd06c25..0ca5a6a45025 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > >                 return 1;
> > >  
> > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > +       if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> > >                 return 0;
> > 
> > Yes.  This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic
> > in kvm_sched_{in,out}().
> 
> Question: where is vcpu->wants_to_run set? (or, where is the full series
> again?).

Precisely around the call to kvm_arch_vcpu_ioctl_run().  I am planning on applying
the patch that introduces the code for 6.10[*], I just haven't yet for a variety
of reasons.

[*] https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@google.com

> So for guest HLT emulation, there is a window between
> 
> kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put 
> and the idle's task call to ct_cpuidle_enter, where 
> 
> ct_dynticks_nesting() != 0 and vcpu_put has already executed.
> 
> Even for idle=poll, the race exists.

Is waking rcuc actually problematic?  I agree it's not ideal, but it's a smallish
window, i.e. is unlikely to happen frequently, and if rcuc is awakened, it will
effectively steal cycles from the idle thread, not the vCPU thread.  If the vCPU
gets a wake event before rcuc completes, then the vCPU could experience jitter,
but that could also happen if the CPU ends up in a deep C-state.

And that race exists in general, i.e. any IRQ that arrives just as the idle task
is being scheduled in will unnecessarily wakeup rcuc.

> > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > 
> > > The problem is:
> > > 
> > > 1) You should only set that flag, in the VM-entry path, after the point
> > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> > 
> > Why?  As established above, KVM essentially has 1 second to enter the guest after
> > setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
> > the time before KVM enters the guest can probably be measured in microseconds.
> 
> OK.
> 
> > Snapshotting the exit time has the exact same problem of depending on KVM to
> > re-enter the guest soon-ish, so I don't understand why this would be considered
> > a problem with a flag to note the CPU is in KVM's run loop, but not with a
> > snapshot to say the CPU recently exited a KVM guest.
> 
> See the race above.

Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot
approach ends up with the same race.  And not zeroing kvm_last_guest_exit is
arguably much more problematic as encountering a false positive doesn't require
hitting a small window.

> > > 2) While handling a VM-exit, a host timer interrupt can occur before that,
> > > or after the point where "this_cpu->in_kvm_run" is set to false.
> > >
> > > And a host timer interrupt calls rcu_sched_clock_irq which is going to
> > > wake up rcuc.
> > 
> > If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
> > or the vCPU was scheduled out.  In the former case, rcuc won't be woken up if the
> > CPU is in userspace.  And in the latter case, waking up rcuc is absolutely the
> > correct thing to do as VM-Enter is not imminent.
> > 
> > For exits to userspace, there would be a small window where an IRQ could arrive
> > between KVM putting the vCPU and the CPU actually returning to userspace, but
> > unless that's problematic in practice, I think it's a reasonable tradeoff.
> 
> OK, your proposal looks alright except these races.
> 
> We don't want those races to occur in production (and they likely will).
> 
> Is there any way to fix the races? Perhaps cmpxchg?

I don't think an atomic switch from the vCPU task to the idle task is feasible,
e.g. KVM would somehow have to know that the idle task is going to run next.
This seems like something that needs a generic solution, e.g. to prevent waking
rcuc if the idle task is in the process of being scheduled in.
Leonardo Bras May 3, 2024, 6:42 p.m. UTC | #20
Hello Sean, Marcelo and Paul,

Thank you for your comments on this thread!
I will try to reply some of the questions below:

(Sorry for the delay, I was OOO for a while.)


On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > I am dealing with a latency issue inside a KVM guest, which is caused by
> > a sched_switch to rcuc[1].
> > 
> > During guest entry, kernel code will signal to RCU that current CPU was on
> > a quiescent state, making sure no other CPU is waiting for this one.
> > 
> > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > issued somewhere since guest entry, there is a chance a timer interrupt
> > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > 
> > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > and cause invoke_rcu_core() to be called, which will (in current config)
> > cause rcuc/N to be scheduled into the current cpu.
> > 
> > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > idle or userspace, since both are considered quiescent states.
> > 
> > Since this is also true to guest context, my idea to solve this latency
> > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > 
> > On the other hand, I could not find a way of reliably saying the current
> > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > for keeping the time (jiffies) of the last guest exit.
> > 
> > In patch #2 I compare current time to that time, and if less than a second
> > has past, we just skip rcu_core() invocation, since there is a high chance
> > it will just go back to the guest in a moment.
> 
> What's the downside if there's a false positive?

False positive being guest_exit without going back in this CPU, right?
If so in WSC, supposing no qs happens and there is a pending request, RCU 
will take a whole second to run again, possibly making other CPUs wait 
this long for a synchronize_rcu.

This value (1 second) could defined in .config or as a parameter if needed, 
but does not seem a big deal, 

> 
> > What I know it's weird with this patch:
> > 1 - Not sure if this is the best way of finding out if the cpu was
> >     running a guest recently.
> > 
> > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> >     overhead, even though it's supposed to be in local cache. If that's
> >     an issue, I would suggest having this part compiled out on 
> >     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> >     enabled seems more expensive than just setting this out.
> 
> A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> imprecise, e.g. it'll be a full tick "behind" on many exits.

That would not be a problem, as it would mean 1 tick less waiting in the 
false positive WSC, and the 1s amount is plenty.

> 
> > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> >     second value was copied from rcu_nohz_full_cpu() which checks if the
> >     grace period started over than a second ago. If this value is bad,
> >     I have no issue changing it.
> 
> IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> of what magic time threshold is used.  IIUC, what you want is a way to detect if
> a CPU is likely to _run_ a KVM vCPU in the near future.

That's correct!

>  KVM can provide that
> information with much better precision, e.g. KVM knows when when it's in the core
> vCPU run loop.

That would not be enough.
I need to present the application/problem to make a point:

- There is multiple  isolated physical CPU (nohz_full) on which we want to 
  run KVM_RT vcpus, which will be running a real-time (low latency) task.
- This task should not miss deadlines (RT), so we test the VM to make sure 
  the maximum latency on a long run does not exceed the latency requirement
- This vcpu will run on SCHED_FIFO, but has to run on lower priority than
  rcuc, so we can avoid stalling other cpus.
- There may be some scenarios where the vcpu will go back to userspace
  (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
  this to run other stuff (like rcuc).

Now, I understand it will cover most of our issues if we have a context 
tracking around the vcpu_run loop, since we can use that to decide not to 
run rcuc on the cpu if the interruption hapenned inside the loop.

But IIUC we can have a thread that "just got out of the loop" getting 
interrupted by the timer, and asked to run rcu_core which will be bad for 
latency.

I understand that the chance may be statistically low, but happening once 
may be enough to crush the latency numbers.

Now, I can't think on a place to put this context trackers in kvm code that 
would avoid the chance of rcuc running improperly, that's why the suggested 
timeout, even though its ugly.

About the false-positive, IIUC we could reduce it if we reset the per-cpu 
last_guest_exit on kvm_put.

> 
> > 4 - Even though I could detect no issue, I included linux/kvm_host.h into 
> >     rcu/tree_plugin.h, which is the first time it's getting included
> >     outside of kvm or arch code, and can be weird.
> 
> Heh, kvm_host.h isn't included outside of KVM because several architectures can
> build KVM as a module, which means referencing global KVM varibles from the kernel
> proper won't work.
> 
> >     An alternative would be to create a new header for providing data for
> >     non-kvm code.
> 
> I doubt a new .h or .c file is needed just for this, there's gotta be a decent
> landing spot for a one-off variable.

You are probably right

>  E.g. I wouldn't be at all surprised if there
> is additional usefulness in knowing if a CPU is in KVM's core run loop and thus
> likely to do a VM-Enter in the near future, at which point you could probably make
> a good argument for adding a flag in "struct context_tracking".  Even without a
> separate use case, there's a good argument for adding that info to context_tracking.

For the tracking solution, makes sense :)
Not sure if the 'timeout' alternative will be that useful outside rcu.

Thanks!
Leo
Leonardo Bras May 3, 2024, 7:09 p.m. UTC | #21
On Fri, May 03, 2024 at 03:42:38PM -0300, Leonardo Bras wrote:
> Hello Sean, Marcelo and Paul,
> 
> Thank you for your comments on this thread!
> I will try to reply some of the questions below:
> 
> (Sorry for the delay, I was OOO for a while.)
> 
> 
> On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> > On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > > I am dealing with a latency issue inside a KVM guest, which is caused by
> > > a sched_switch to rcuc[1].
> > > 
> > > During guest entry, kernel code will signal to RCU that current CPU was on
> > > a quiescent state, making sure no other CPU is waiting for this one.
> > > 
> > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > > issued somewhere since guest entry, there is a chance a timer interrupt
> > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > > 
> > > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > > and cause invoke_rcu_core() to be called, which will (in current config)
> > > cause rcuc/N to be scheduled into the current cpu.
> > > 
> > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > > idle or userspace, since both are considered quiescent states.
> > > 
> > > Since this is also true to guest context, my idea to solve this latency
> > > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > > 
> > > On the other hand, I could not find a way of reliably saying the current
> > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > > for keeping the time (jiffies) of the last guest exit.
> > > 
> > > In patch #2 I compare current time to that time, and if less than a second
> > > has past, we just skip rcu_core() invocation, since there is a high chance
> > > it will just go back to the guest in a moment.
> > 
> > What's the downside if there's a false positive?
> 
> False positive being guest_exit without going back in this CPU, right?
> If so in WSC, supposing no qs happens and there is a pending request, RCU 
> will take a whole second to run again, possibly making other CPUs wait 
> this long for a synchronize_rcu.

Just to make sure it's clear:
It will wait at most 1 second, if the grace period was requested just 
before the last_guest_exit update. It will never make the grace period 
be longer than the already defined 1 second. 

That's because in the timer interrupt we have:

	if (rcu_pending())
		invoke_rcu_core();

and on rcu_pending():

	if ((user || rcu_is_cpu_rrupt_from_idle() || rcu_recent_guest_exit()) &&
	    rcu_nohz_full_cpu())
		return 0;

Meaning that even if we allow 5 seconds after recent_guest_exit, it will 
only make rcu_nohz_full_cpu() run, and it will check if the grace period is 
younger than 1 second before skipping the rcu_core() invocation.



> 
> This value (1 second) could defined in .config or as a parameter if needed, 
> but does not seem a big deal, 
> 
> > 
> > > What I know it's weird with this patch:
> > > 1 - Not sure if this is the best way of finding out if the cpu was
> > >     running a guest recently.
> > > 
> > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > >     overhead, even though it's supposed to be in local cache. If that's
> > >     an issue, I would suggest having this part compiled out on 
> > >     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > >     enabled seems more expensive than just setting this out.
> > 
> > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> > imprecise, e.g. it'll be a full tick "behind" on many exits.
> 
> That would not be a problem, as it would mean 1 tick less waiting in the 
> false positive WSC, and the 1s amount is plenty.

s/less/more/

> 
> > 
> > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > >     grace period started over than a second ago. If this value is bad,
> > >     I have no issue changing it.
> > 
> > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > of what magic time threshold is used.  IIUC, what you want is a way to detect if
> > a CPU is likely to _run_ a KVM vCPU in the near future.
> 
> That's correct!
> 
> >  KVM can provide that
> > information with much better precision, e.g. KVM knows when when it's in the core
> > vCPU run loop.
> 
> That would not be enough.
> I need to present the application/problem to make a point:
> 
> - There is multiple  isolated physical CPU (nohz_full) on which we want to 
>   run KVM_RT vcpus, which will be running a real-time (low latency) task.
> - This task should not miss deadlines (RT), so we test the VM to make sure 
>   the maximum latency on a long run does not exceed the latency requirement
> - This vcpu will run on SCHED_FIFO, but has to run on lower priority than
>   rcuc, so we can avoid stalling other cpus.
> - There may be some scenarios where the vcpu will go back to userspace
>   (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
>   this to run other stuff (like rcuc).
> 
> Now, I understand it will cover most of our issues if we have a context 
> tracking around the vcpu_run loop, since we can use that to decide not to 
> run rcuc on the cpu if the interruption hapenned inside the loop.
> 
> But IIUC we can have a thread that "just got out of the loop" getting 
> interrupted by the timer, and asked to run rcu_core which will be bad for 
> latency.
> 
> I understand that the chance may be statistically low, but happening once 
> may be enough to crush the latency numbers.
> 
> Now, I can't think on a place to put this context trackers in kvm code that 
> would avoid the chance of rcuc running improperly, that's why the suggested 
> timeout, even though its ugly.
> 
> About the false-positive, IIUC we could reduce it if we reset the per-cpu 
> last_guest_exit on kvm_put.
> 
> > 
> > > 4 - Even though I could detect no issue, I included linux/kvm_host.h into 
> > >     rcu/tree_plugin.h, which is the first time it's getting included
> > >     outside of kvm or arch code, and can be weird.
> > 
> > Heh, kvm_host.h isn't included outside of KVM because several architectures can
> > build KVM as a module, which means referencing global KVM varibles from the kernel
> > proper won't work.
> > 
> > >     An alternative would be to create a new header for providing data for
> > >     non-kvm code.
> > 
> > I doubt a new .h or .c file is needed just for this, there's gotta be a decent
> > landing spot for a one-off variable.
> 
> You are probably right
> 
> >  E.g. I wouldn't be at all surprised if there
> > is additional usefulness in knowing if a CPU is in KVM's core run loop and thus
> > likely to do a VM-Enter in the near future, at which point you could probably make
> > a good argument for adding a flag in "struct context_tracking".  Even without a
> > separate use case, there's a good argument for adding that info to context_tracking.
> 
> For the tracking solution, makes sense :)
> Not sure if the 'timeout' alternative will be that useful outside rcu.
> 
> Thanks!
> Leo
Leonardo Bras May 3, 2024, 8:44 p.m. UTC | #22
On Wed, Apr 17, 2024 at 10:22:18AM -0700, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Marcelo Tosatti wrote:
> > On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> > > On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > > > > Why not have
> > > > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > > > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> > > > 
> > > > Do you mean something like:
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index d9642dd06c25..0ca5a6a45025 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > > >                 return 1;
> > > >  
> > > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > > +       if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> > > >                 return 0;
> > > 
> > > Yes.  This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic
> > > in kvm_sched_{in,out}().
> > 
> > Question: where is vcpu->wants_to_run set? (or, where is the full series
> > again?).
> 
> Precisely around the call to kvm_arch_vcpu_ioctl_run().  I am planning on applying
> the patch that introduces the code for 6.10[*], I just haven't yet for a variety
> of reasons.
> 
> [*] https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@google.com
> 
> > So for guest HLT emulation, there is a window between
> > 
> > kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put 
> > and the idle's task call to ct_cpuidle_enter, where 
> > 
> > ct_dynticks_nesting() != 0 and vcpu_put has already executed.
> > 
> > Even for idle=poll, the race exists.
> 
> Is waking rcuc actually problematic?

Yeah, it may introduce a lot (30us) of latency in some cases, causing a 
missed deadline.

When dealing with RT tasks, missing a deadline can be really bad, so we 
need to make sure it will happen as rarely as possible.

>  I agree it's not ideal, but it's a smallish
> window, i.e. is unlikely to happen frequently, and if rcuc is awakened, it will
> effectively steal cycles from the idle thread, not the vCPU thread.

It would be fine, but sometimes the idle thread will run very briefly, and 
stealing microseconds from it will still steal enough time from the vcpu 
thread to become a problem.

>  If the vCPU
> gets a wake event before rcuc completes, then the vCPU could experience jitter,
> but that could also happen if the CPU ends up in a deep C-state.

IIUC, if the scenario calls for a very short HLT, which is kind of usual, 
then the CPU will not get into deep C-state. 
For the scenarios longer HLT happens, then it would be fine.

> 
> And that race exists in general, i.e. any IRQ that arrives just as the idle task
> is being scheduled in will unnecessarily wakeup rcuc.

That's a race could be solved with the timeout (snapshot) solution, if we 
don't zero last_guest_exit on kvm_sched_out(), right?

> 
> > > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > 
> > > > The problem is:
> > > > 
> > > > 1) You should only set that flag, in the VM-entry path, after the point
> > > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> > > 
> > > Why?  As established above, KVM essentially has 1 second to enter the guest after
> > > setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
> > > the time before KVM enters the guest can probably be measured in microseconds.
> > 
> > OK.
> > 
> > > Snapshotting the exit time has the exact same problem of depending on KVM to
> > > re-enter the guest soon-ish, so I don't understand why this would be considered
> > > a problem with a flag to note the CPU is in KVM's run loop, but not with a
> > > snapshot to say the CPU recently exited a KVM guest.
> > 
> > See the race above.
> 
> Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot
> approach ends up with the same race.  And not zeroing kvm_last_guest_exit is
> arguably much more problematic as encountering a false positive doesn't require
> hitting a small window.

For the false positive (only on nohz_full) the maximum delay for the  
rcu_core() to be run would be 1s, and that would be in case we don't schedule out for 
some userspace task or idle thread, in which case we have a quiescent state 
without the need of rcu_core().

Now, for not being an userspace nor idle thread, it would need to be one or 
more kernel threads, which I suppose aren't usually many, nor usually 
take that long for completing, if we consider to be running on an isolated (nohz_full) cpu. 

So, for the kvm_sched_out() case, I don't actually think we are  
statistically introducing that much of a delay in the RCU mechanism.

(I may be missing some point, though)

Thanks!
Leo

> 
> > > > 2) While handling a VM-exit, a host timer interrupt can occur before that,
> > > > or after the point where "this_cpu->in_kvm_run" is set to false.
> > > >
> > > > And a host timer interrupt calls rcu_sched_clock_irq which is going to
> > > > wake up rcuc.
> > > 
> > > If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
> > > or the vCPU was scheduled out.  In the former case, rcuc won't be woken up if the
> > > CPU is in userspace.  And in the latter case, waking up rcuc is absolutely the
> > > correct thing to do as VM-Enter is not imminent.
> > > 
> > > For exits to userspace, there would be a small window where an IRQ could arrive
> > > between KVM putting the vCPU and the CPU actually returning to userspace, but
> > > unless that's problematic in practice, I think it's a reasonable tradeoff.
> > 
> > OK, your proposal looks alright except these races.
> > 
> > We don't want those races to occur in production (and they likely will).
> > 
> > Is there any way to fix the races? Perhaps cmpxchg?
> 
> I don't think an atomic switch from the vCPU task to the idle task is feasible,
> e.g. KVM would somehow have to know that the idle task is going to run next.
> This seems like something that needs a generic solution, e.g. to prevent waking
> rcuc if the idle task is in the process of being scheduled in.
>
Sean Christopherson May 3, 2024, 9:29 p.m. UTC | #23
On Fri, May 03, 2024, Leonardo Bras wrote:
> > KVM can provide that information with much better precision, e.g. KVM
> > knows when when it's in the core vCPU run loop.
> 
> That would not be enough.
> I need to present the application/problem to make a point:
> 
> - There is multiple  isolated physical CPU (nohz_full) on which we want to 
>   run KVM_RT vcpus, which will be running a real-time (low latency) task.
> - This task should not miss deadlines (RT), so we test the VM to make sure 
>   the maximum latency on a long run does not exceed the latency requirement
> - This vcpu will run on SCHED_FIFO, but has to run on lower priority than
>   rcuc, so we can avoid stalling other cpus.
> - There may be some scenarios where the vcpu will go back to userspace
>   (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
>   this to run other stuff (like rcuc).
>
> Now, I understand it will cover most of our issues if we have a context 
> tracking around the vcpu_run loop, since we can use that to decide not to 
> run rcuc on the cpu if the interruption hapenned inside the loop.
> 
> But IIUC we can have a thread that "just got out of the loop" getting 
> interrupted by the timer, and asked to run rcu_core which will be bad for 
> latency.
> 
> I understand that the chance may be statistically low, but happening once 
> may be enough to crush the latency numbers.
> 
> Now, I can't think on a place to put this context trackers in kvm code that 
> would avoid the chance of rcuc running improperly, that's why the suggested 
> timeout, even though its ugly.
> 
> About the false-positive, IIUC we could reduce it if we reset the per-cpu 
> last_guest_exit on kvm_put.

Which then opens up the window that you're trying to avoid (IRQ arriving just
after the vCPU is put, before the CPU exits to userspace).

If you want the "entry to guest is imminent" status to be preserved across an exit
to userspace, then it seems liek the flag really should be a property of the task,
not a property of the physical CPU.  Similar to how rcu_is_cpu_rrupt_from_idle()
detects that an idle task was interrupted, that goal is to detect if a vCPU task
was interrupted.

PF_VCPU is already "taken" for similar tracking, but if we want to track "this
task will soon enter an extended quiescent state", I don't see any reason to make
it specific to vCPU tasks.  Unless the kernel/KVM dynamically manages the flag,
which as above will create windows for false negatives, the kernel needs to
trust userspace to a certaine extent no matter what.  E.g. even if KVM sets a
PF_xxx flag on the first KVM_RUN, nothing would prevent userspace from calling
into KVM to get KVM to set the flag, and then doing something else entirely with
the task.

So if we're comfortable relying on the 1 second timeout to guard against a
misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
userspace communicate to the kernel that it's a real-time task that spends the
overwhelming majority of its time in userspace or guest context, i.e. should be
given extra leniency with respect to rcuc if the task happens to be interrupted
while it's in kernel context.
Leonardo Bras May 3, 2024, 10 p.m. UTC | #24
On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> On Fri, May 03, 2024, Leonardo Bras wrote:
> > > KVM can provide that information with much better precision, e.g. KVM
> > > knows when when it's in the core vCPU run loop.
> > 
> > That would not be enough.
> > I need to present the application/problem to make a point:
> > 
> > - There is multiple  isolated physical CPU (nohz_full) on which we want to 
> >   run KVM_RT vcpus, which will be running a real-time (low latency) task.
> > - This task should not miss deadlines (RT), so we test the VM to make sure 
> >   the maximum latency on a long run does not exceed the latency requirement
> > - This vcpu will run on SCHED_FIFO, but has to run on lower priority than
> >   rcuc, so we can avoid stalling other cpus.
> > - There may be some scenarios where the vcpu will go back to userspace
> >   (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
> >   this to run other stuff (like rcuc).
> >
> > Now, I understand it will cover most of our issues if we have a context 
> > tracking around the vcpu_run loop, since we can use that to decide not to 
> > run rcuc on the cpu if the interruption hapenned inside the loop.
> > 
> > But IIUC we can have a thread that "just got out of the loop" getting 
> > interrupted by the timer, and asked to run rcu_core which will be bad for 
> > latency.
> > 
> > I understand that the chance may be statistically low, but happening once 
> > may be enough to crush the latency numbers.
> > 
> > Now, I can't think on a place to put this context trackers in kvm code that 
> > would avoid the chance of rcuc running improperly, that's why the suggested 
> > timeout, even though its ugly.
> > 
> > About the false-positive, IIUC we could reduce it if we reset the per-cpu 
> > last_guest_exit on kvm_put.
> 
> Which then opens up the window that you're trying to avoid (IRQ arriving just
> after the vCPU is put, before the CPU exits to userspace).
> 
> If you want the "entry to guest is imminent" status to be preserved across an exit
> to userspace, then it seems liek the flag really should be a property of the task,
> not a property of the physical CPU.  Similar to how rcu_is_cpu_rrupt_from_idle()
> detects that an idle task was interrupted, that goal is to detect if a vCPU task
> was interrupted.
> 
> PF_VCPU is already "taken" for similar tracking, but if we want to track "this
> task will soon enter an extended quiescent state", I don't see any reason to make
> it specific to vCPU tasks.  Unless the kernel/KVM dynamically manages the flag,
> which as above will create windows for false negatives, the kernel needs to
> trust userspace to a certaine extent no matter what.  E.g. even if KVM sets a
> PF_xxx flag on the first KVM_RUN, nothing would prevent userspace from calling
> into KVM to get KVM to set the flag, and then doing something else entirely with
> the task.
> 
> So if we're comfortable relying on the 1 second timeout to guard against a
> misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> userspace communicate to the kernel that it's a real-time task that spends the
> overwhelming majority of its time in userspace or guest context, i.e. should be
> given extra leniency with respect to rcuc if the task happens to be interrupted
> while it's in kernel context.
> 


I think I understand what you propose here.

But I am not sure what would happen in this case:

- RT guest task calls short HLT
- Host schedule another kernel thread (other task)
- Timer interruption, rcu_pending will() check the task which is not set 
  with above flag.
- rcuc runs, introducing latency
- Goes back to previous kernel thread, finishes running with rcuc latency
- Goes back to vcpu thread

Isn't there any chance that, on an short guest HLT, the latency previously 
introduced by rcuc preempting another kernel thread gets to introduce a 
latency to the RT task running in the vcpu?

Thanks!
Leo



-
Paul E. McKenney May 3, 2024, 10 p.m. UTC | #25
On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> On Fri, May 03, 2024, Leonardo Bras wrote:
> > > KVM can provide that information with much better precision, e.g. KVM
> > > knows when when it's in the core vCPU run loop.
> > 
> > That would not be enough.
> > I need to present the application/problem to make a point:
> > 
> > - There is multiple  isolated physical CPU (nohz_full) on which we want to 
> >   run KVM_RT vcpus, which will be running a real-time (low latency) task.
> > - This task should not miss deadlines (RT), so we test the VM to make sure 
> >   the maximum latency on a long run does not exceed the latency requirement
> > - This vcpu will run on SCHED_FIFO, but has to run on lower priority than
> >   rcuc, so we can avoid stalling other cpus.
> > - There may be some scenarios where the vcpu will go back to userspace
> >   (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
> >   this to run other stuff (like rcuc).
> >
> > Now, I understand it will cover most of our issues if we have a context 
> > tracking around the vcpu_run loop, since we can use that to decide not to 
> > run rcuc on the cpu if the interruption hapenned inside the loop.
> > 
> > But IIUC we can have a thread that "just got out of the loop" getting 
> > interrupted by the timer, and asked to run rcu_core which will be bad for 
> > latency.
> > 
> > I understand that the chance may be statistically low, but happening once 
> > may be enough to crush the latency numbers.
> > 
> > Now, I can't think on a place to put this context trackers in kvm code that 
> > would avoid the chance of rcuc running improperly, that's why the suggested 
> > timeout, even though its ugly.
> > 
> > About the false-positive, IIUC we could reduce it if we reset the per-cpu 
> > last_guest_exit on kvm_put.
> 
> Which then opens up the window that you're trying to avoid (IRQ arriving just
> after the vCPU is put, before the CPU exits to userspace).
> 
> If you want the "entry to guest is imminent" status to be preserved across an exit
> to userspace, then it seems liek the flag really should be a property of the task,
> not a property of the physical CPU.  Similar to how rcu_is_cpu_rrupt_from_idle()
> detects that an idle task was interrupted, that goal is to detect if a vCPU task
> was interrupted.
> 
> PF_VCPU is already "taken" for similar tracking, but if we want to track "this
> task will soon enter an extended quiescent state", I don't see any reason to make
> it specific to vCPU tasks.  Unless the kernel/KVM dynamically manages the flag,
> which as above will create windows for false negatives, the kernel needs to
> trust userspace to a certaine extent no matter what.  E.g. even if KVM sets a
> PF_xxx flag on the first KVM_RUN, nothing would prevent userspace from calling
> into KVM to get KVM to set the flag, and then doing something else entirely with
> the task.
> 
> So if we're comfortable relying on the 1 second timeout to guard against a
> misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> userspace communicate to the kernel that it's a real-time task that spends the
> overwhelming majority of its time in userspace or guest context, i.e. should be
> given extra leniency with respect to rcuc if the task happens to be interrupted
> while it's in kernel context.

But if the task is executing in host kernel context for quite some time,
then the host kernel's RCU really does need to take evasive action.

On the other hand, if that task is executing in guest context (either
kernel or userspace), then the host kernel's RCU can immediately report
that task's quiescent state.

Too much to ask for the host kernel's RCU to be able to sense the
difference?  ;-)

							Thanx, Paul
Marcelo Tosatti May 6, 2024, 6:47 p.m. UTC | #26
On Fri, May 03, 2024 at 05:44:22PM -0300, Leonardo Bras wrote:
> On Wed, Apr 17, 2024 at 10:22:18AM -0700, Sean Christopherson wrote:
> > On Wed, Apr 17, 2024, Marcelo Tosatti wrote:
> > > On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> > > > On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > > > > > Why not have
> > > > > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > > > > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> > > > > 
> > > > > Do you mean something like:
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index d9642dd06c25..0ca5a6a45025 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > > > >                 return 1;
> > > > >  
> > > > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > > > +       if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> > > > >                 return 0;
> > > > 
> > > > Yes.  This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic
> > > > in kvm_sched_{in,out}().
> > > 
> > > Question: where is vcpu->wants_to_run set? (or, where is the full series
> > > again?).
> > 
> > Precisely around the call to kvm_arch_vcpu_ioctl_run().  I am planning on applying
> > the patch that introduces the code for 6.10[*], I just haven't yet for a variety
> > of reasons.
> > 
> > [*] https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@google.com
> > 
> > > So for guest HLT emulation, there is a window between
> > > 
> > > kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put 
> > > and the idle's task call to ct_cpuidle_enter, where 
> > > 
> > > ct_dynticks_nesting() != 0 and vcpu_put has already executed.
> > > 
> > > Even for idle=poll, the race exists.
> > 
> > Is waking rcuc actually problematic?
> 
> Yeah, it may introduce a lot (30us) of latency in some cases, causing a 
> missed deadline.
> 
> When dealing with RT tasks, missing a deadline can be really bad, so we 
> need to make sure it will happen as rarely as possible.
> 
> >  I agree it's not ideal, but it's a smallish
> > window, i.e. is unlikely to happen frequently, and if rcuc is awakened, it will
> > effectively steal cycles from the idle thread, not the vCPU thread.
> 
> It would be fine, but sometimes the idle thread will run very briefly, and 
> stealing microseconds from it will still steal enough time from the vcpu 
> thread to become a problem.
> 
> >  If the vCPU
> > gets a wake event before rcuc completes, then the vCPU could experience jitter,
> > but that could also happen if the CPU ends up in a deep C-state.
> 
> IIUC, if the scenario calls for a very short HLT, which is kind of usual, 
> then the CPU will not get into deep C-state. 
> For the scenarios longer HLT happens, then it would be fine.

And it might be that the chosen idle state has low latency.

There is interest from customer in using realtime and saving energy as
well.

For example:

https://doc.dpdk.org/guides/sample_app_ug/l3_forward_power_man.html

> > And that race exists in general, i.e. any IRQ that arrives just as the idle task
> > is being scheduled in will unnecessarily wakeup rcuc.
> 
> That's a race could be solved with the timeout (snapshot) solution, if we 
> don't zero last_guest_exit on kvm_sched_out(), right?

Yes.

> > > > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > > 
> > > > > The problem is:
> > > > > 
> > > > > 1) You should only set that flag, in the VM-entry path, after the point
> > > > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> > > > 
> > > > Why?  As established above, KVM essentially has 1 second to enter the guest after
> > > > setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
> > > > the time before KVM enters the guest can probably be measured in microseconds.
> > > 
> > > OK.
> > > 
> > > > Snapshotting the exit time has the exact same problem of depending on KVM to
> > > > re-enter the guest soon-ish, so I don't understand why this would be considered
> > > > a problem with a flag to note the CPU is in KVM's run loop, but not with a
> > > > snapshot to say the CPU recently exited a KVM guest.
> > > 
> > > See the race above.
> > 
> > Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot
> > approach ends up with the same race.  And not zeroing kvm_last_guest_exit is
> > arguably much more problematic as encountering a false positive doesn't require
> > hitting a small window.
> 
> For the false positive (only on nohz_full) the maximum delay for the  
> rcu_core() to be run would be 1s, and that would be in case we don't schedule out for 
> some userspace task or idle thread, in which case we have a quiescent state 
> without the need of rcu_core().
> 
> Now, for not being an userspace nor idle thread, it would need to be one or 
> more kernel threads, which I suppose aren't usually many, nor usually 
> take that long for completing, if we consider to be running on an isolated (nohz_full) cpu. 
> 
> So, for the kvm_sched_out() case, I don't actually think we are  
> statistically introducing that much of a delay in the RCU mechanism.
> 
> (I may be missing some point, though)
> 
> Thanks!
> Leo
> 
> > 
> > > > > 2) While handling a VM-exit, a host timer interrupt can occur before that,
> > > > > or after the point where "this_cpu->in_kvm_run" is set to false.
> > > > >
> > > > > And a host timer interrupt calls rcu_sched_clock_irq which is going to
> > > > > wake up rcuc.
> > > > 
> > > > If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
> > > > or the vCPU was scheduled out.  In the former case, rcuc won't be woken up if the
> > > > CPU is in userspace.  And in the latter case, waking up rcuc is absolutely the
> > > > correct thing to do as VM-Enter is not imminent.
> > > > 
> > > > For exits to userspace, there would be a small window where an IRQ could arrive
> > > > between KVM putting the vCPU and the CPU actually returning to userspace, but
> > > > unless that's problematic in practice, I think it's a reasonable tradeoff.
> > > 
> > > OK, your proposal looks alright except these races.
> > > 
> > > We don't want those races to occur in production (and they likely will).
> > > 
> > > Is there any way to fix the races? Perhaps cmpxchg?
> > 
> > I don't think an atomic switch from the vCPU task to the idle task is feasible,
> > e.g. KVM would somehow have to know that the idle task is going to run next.
> > This seems like something that needs a generic solution, e.g. to prevent waking
> > rcuc if the idle task is in the process of being scheduled in.
> > 
> 
>
Sean Christopherson May 7, 2024, 5:55 p.m. UTC | #27
On Fri, May 03, 2024, Paul E. McKenney wrote:
> On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > So if we're comfortable relying on the 1 second timeout to guard against a
> > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > userspace communicate to the kernel that it's a real-time task that spends the
> > overwhelming majority of its time in userspace or guest context, i.e. should be
> > given extra leniency with respect to rcuc if the task happens to be interrupted
> > while it's in kernel context.
> 
> But if the task is executing in host kernel context for quite some time,
> then the host kernel's RCU really does need to take evasive action.

Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
form of the 1 second timeout.

And while KVM does not guarantee that it will immediately resume the guest after
servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
anything that would prevent the kernel from preempting the interrupt task.

> On the other hand, if that task is executing in guest context (either
> kernel or userspace), then the host kernel's RCU can immediately report
> that task's quiescent state.
> 
> Too much to ask for the host kernel's RCU to be able to sense the
> difference?  ;-)

KVM already notifies RCU when its entering/exiting an extended quiescent state,
via __ct_user_{enter,exit}().

When handling an IRQ that _probably_ triggered an exit from the guest, the CPU
has already exited the quiescent state.  And AFAIK, that can't be safely changed,
i.e. KVM must note the context switch before enabling IRQs.
Sean Christopherson May 7, 2024, 6:05 p.m. UTC | #28
On Mon, May 06, 2024, Marcelo Tosatti wrote:
> On Fri, May 03, 2024 at 05:44:22PM -0300, Leonardo Bras wrote:
> > > And that race exists in general, i.e. any IRQ that arrives just as the idle task
> > > is being scheduled in will unnecessarily wakeup rcuc.
> > 
> > That's a race could be solved with the timeout (snapshot) solution, if we 
> > don't zero last_guest_exit on kvm_sched_out(), right?
> 
> Yes.

And if KVM doesn't zero last_guest_exit on kvm_sched_out(), then we're right back
in the situation where RCU can get false positives (see below).

> > > > > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > > > 
> > > > > > The problem is:
> > > > > > 
> > > > > > 1) You should only set that flag, in the VM-entry path, after the point
> > > > > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> > > > > 
> > > > > Why?  As established above, KVM essentially has 1 second to enter the guest after
> > > > > setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
> > > > > the time before KVM enters the guest can probably be measured in microseconds.
> > > > 
> > > > OK.
> > > > 
> > > > > Snapshotting the exit time has the exact same problem of depending on KVM to
> > > > > re-enter the guest soon-ish, so I don't understand why this would be considered
> > > > > a problem with a flag to note the CPU is in KVM's run loop, but not with a
> > > > > snapshot to say the CPU recently exited a KVM guest.
> > > > 
> > > > See the race above.
> > > 
> > > Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot
> > > approach ends up with the same race.  And not zeroing kvm_last_guest_exit is
> > > arguably much more problematic as encountering a false positive doesn't require
> > > hitting a small window.
> > 
> > For the false positive (only on nohz_full) the maximum delay for the
> > rcu_core() to be run would be 1s, and that would be in case we don't
> > schedule out for some userspace task or idle thread, in which case we have
> > a quiescent state without the need of rcu_core().
> > 
> > Now, for not being an userspace nor idle thread, it would need to be one or
> > more kernel threads, which I suppose aren't usually many, nor usually take
> > that long for completing, if we consider to be running on an isolated
> > (nohz_full) cpu. 
> > 
> > So, for the kvm_sched_out() case, I don't actually think we are  
> > statistically introducing that much of a delay in the RCU mechanism.
> > 
> > (I may be missing some point, though)

My point is that if kvm_last_guest_exit is left as-is on kvm_sched_out() and
vcpu_put(), then from a kernel/RCU safety perspective there is no meaningful
difference between KVM setting kvm_last_guest_exit and userspace being allowed
to mark a task as being exempt from being preempted by rcuc.  Userspace can
simply do KVM_RUN once to gain exemption from rcuc until the 1 second timeout
expires.

And if KVM does zero kvm_last_guest_exit on kvm_sched_out()/vcpu_put(), then the
approach has the exact same window as my in_guest_run_loop idea, i.e. rcuc can be
unnecessarily awakened in the time between KVM puts the vCPU and the CPU exits to
userspace.
Paul E. McKenney May 7, 2024, 7:15 p.m. UTC | #29
On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> On Fri, May 03, 2024, Paul E. McKenney wrote:
> > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > userspace communicate to the kernel that it's a real-time task that spends the
> > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > while it's in kernel context.
> > 
> > But if the task is executing in host kernel context for quite some time,
> > then the host kernel's RCU really does need to take evasive action.
> 
> Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> form of the 1 second timeout.

Plus RCU will force-enable that CPU's scheduler-clock tick after about
ten milliseconds of that CPU not being in a quiescent state, with
the time varying depending on the value of HZ and the number of CPUs.
After about ten seconds (halfway to the RCU CPU stall warning), it will
resched_cpu() that CPU every few milliseconds.

> And while KVM does not guarantee that it will immediately resume the guest after
> servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> anything that would prevent the kernel from preempting the interrupt task.

Similarly, the hypervisor could preempt a guest OS's RCU read-side
critical section or its preempt_disable() code.

Or am I missing your point?

> > On the other hand, if that task is executing in guest context (either
> > kernel or userspace), then the host kernel's RCU can immediately report
> > that task's quiescent state.
> > 
> > Too much to ask for the host kernel's RCU to be able to sense the
> > difference?  ;-)
> 
> KVM already notifies RCU when its entering/exiting an extended quiescent state,
> via __ct_user_{enter,exit}().
> 
> When handling an IRQ that _probably_ triggered an exit from the guest, the CPU
> has already exited the quiescent state.  And AFAIK, that can't be safely changed,
> i.e. KVM must note the context switch before enabling IRQs.

Whew!!!  ;-)

Just to make sure that I understand, is there any part of the problem
to be solved that does not involve vCPU preemption?

							Thanx, Paul
Sean Christopherson May 7, 2024, 9 p.m. UTC | #30
On Tue, May 07, 2024, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > while it's in kernel context.
> > > 
> > > But if the task is executing in host kernel context for quite some time,
> > > then the host kernel's RCU really does need to take evasive action.
> > 
> > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > form of the 1 second timeout.
> 
> Plus RCU will force-enable that CPU's scheduler-clock tick after about
> ten milliseconds of that CPU not being in a quiescent state, with
> the time varying depending on the value of HZ and the number of CPUs.
> After about ten seconds (halfway to the RCU CPU stall warning), it will
> resched_cpu() that CPU every few milliseconds.
> 
> > And while KVM does not guarantee that it will immediately resume the guest after
> > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > anything that would prevent the kernel from preempting the interrupt task.
> 
> Similarly, the hypervisor could preempt a guest OS's RCU read-side
> critical section or its preempt_disable() code.
> 
> Or am I missing your point?

I think you're missing my point?  I'm talking specifically about host RCU, what
is or isn't happening in the guest is completely out of scope.

My overarching point is that the existing @user check in rcu_pending() is optimistic,
in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
the aforementioned guardrails.

And I'm arguing that, since the @user check isn't bombproof, there's no reason to
try to harden against every possible edge case in an equivalent @guest check,
because it's unnecessary for kernel safety, thanks to the guardrails.
Paul E. McKenney May 7, 2024, 9:37 p.m. UTC | #31
On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> On Tue, May 07, 2024, Paul E. McKenney wrote:
> > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > while it's in kernel context.
> > > > 
> > > > But if the task is executing in host kernel context for quite some time,
> > > > then the host kernel's RCU really does need to take evasive action.
> > > 
> > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > form of the 1 second timeout.
> > 
> > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > ten milliseconds of that CPU not being in a quiescent state, with
> > the time varying depending on the value of HZ and the number of CPUs.
> > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > resched_cpu() that CPU every few milliseconds.
> > 
> > > And while KVM does not guarantee that it will immediately resume the guest after
> > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > anything that would prevent the kernel from preempting the interrupt task.
> > 
> > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > critical section or its preempt_disable() code.
> > 
> > Or am I missing your point?
> 
> I think you're missing my point?  I'm talking specifically about host RCU, what
> is or isn't happening in the guest is completely out of scope.

Ah, I was thinking of nested virtualization.

> My overarching point is that the existing @user check in rcu_pending() is optimistic,
> in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> the aforementioned guardrails.

You lost me on this one.

The "user" argument to rcu_pending() comes from the context saved at
the time of the scheduling-clock interrupt.  In other words, the CPU
really was executing in user mode (which is an RCU quiescent state)
when the interrupt arrived.

And that suffices, 100% guaranteed.

The reason that it suffices is that other RCU code such as rcu_qs() and
rcu_note_context_switch() ensure that this CPU does not pay attention to
the user-argument-induced quiescent state unless this CPU had previously
acknowledged the current grace period.

And if the CPU has previously acknowledged the current grace period, that
acknowledgement must have preceded the interrupt from user-mode execution.
Thus the prior quiescent state represented by that user-mode execution
applies to that previously acknowledged grace period.

This is admittedly a bit indirect, but then again this is Linux-kernel
RCU that we are talking about.

> And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> try to harden against every possible edge case in an equivalent @guest check,
> because it's unnecessary for kernel safety, thanks to the guardrails.

And the same argument above would also apply to an equivalent check for
execution in guest mode at the time of the interrupt.

Please understand that I am not saying that we absolutely need an
additional check (you tell me!).  But if we do need RCU to be more
aggressive about treating guest execution as an RCU quiescent state
within the host, that additional check would be an excellent way of
making that happen.

							Thanx, Paul
Leonardo Bras May 7, 2024, 10:36 p.m. UTC | #32
On Tue, May 07, 2024 at 11:05:55AM -0700, Sean Christopherson wrote:
> On Mon, May 06, 2024, Marcelo Tosatti wrote:
> > On Fri, May 03, 2024 at 05:44:22PM -0300, Leonardo Bras wrote:
> > > > And that race exists in general, i.e. any IRQ that arrives just as the idle task
> > > > is being scheduled in will unnecessarily wakeup rcuc.
> > > 
> > > That's a race could be solved with the timeout (snapshot) solution, if we 
> > > don't zero last_guest_exit on kvm_sched_out(), right?
> > 
> > Yes.
> 
> And if KVM doesn't zero last_guest_exit on kvm_sched_out(), then we're right back
> in the situation where RCU can get false positives (see below).
> 
> > > > > > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > > > > 
> > > > > > > The problem is:
> > > > > > > 
> > > > > > > 1) You should only set that flag, in the VM-entry path, after the point
> > > > > > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> > > > > > 
> > > > > > Why?  As established above, KVM essentially has 1 second to enter the guest after
> > > > > > setting in_guest_run_loop (or whatever we call it).  In the vast majority of cases,
> > > > > > the time before KVM enters the guest can probably be measured in microseconds.
> > > > > 
> > > > > OK.
> > > > > 
> > > > > > Snapshotting the exit time has the exact same problem of depending on KVM to
> > > > > > re-enter the guest soon-ish, so I don't understand why this would be considered
> > > > > > a problem with a flag to note the CPU is in KVM's run loop, but not with a
> > > > > > snapshot to say the CPU recently exited a KVM guest.
> > > > > 
> > > > > See the race above.
> > > > 
> > > > Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot
> > > > approach ends up with the same race.  And not zeroing kvm_last_guest_exit is
> > > > arguably much more problematic as encountering a false positive doesn't require
> > > > hitting a small window.
> > > 
> > > For the false positive (only on nohz_full) the maximum delay for the
> > > rcu_core() to be run would be 1s, and that would be in case we don't
> > > schedule out for some userspace task or idle thread, in which case we have
> > > a quiescent state without the need of rcu_core().
> > > 
> > > Now, for not being an userspace nor idle thread, it would need to be one or
> > > more kernel threads, which I suppose aren't usually many, nor usually take
> > > that long for completing, if we consider to be running on an isolated
> > > (nohz_full) cpu. 
> > > 
> > > So, for the kvm_sched_out() case, I don't actually think we are  
> > > statistically introducing that much of a delay in the RCU mechanism.
> > > 
> > > (I may be missing some point, though)
> 
> My point is that if kvm_last_guest_exit is left as-is on kvm_sched_out() and
> vcpu_put(), then from a kernel/RCU safety perspective there is no meaningful
> difference between KVM setting kvm_last_guest_exit and userspace being allowed
> to mark a task as being exempt from being preempted by rcuc.  Userspace can
> simply do KVM_RUN once to gain exemption from rcuc until the 1 second timeout
> expires.

Oh, I see. Your concern is that an user can explore this to purposely
explore/slowdown the RCU mechanism on nohz_full isolated CPUs. Is that 
it?

Even in this case, KVM_RUN would need to run every second, which would 
cause a quiescent state every second, and move other CPUs forward in RCU.

I don't get how this could be explored. I mean, running idle tasks and 
userspace tasks would already cause a quiescent state, making this useless 
for this purpose. So the user would need to be willing to run kernel 
threads in the meantime between KVM_RUNs, right?

Maybe this could be relevant on the scenario: 
"I want the other users of this machine to experience slowdown in their 
processes".
But this this is possible to reproduce by actually running a busy VM in the 
cpu anyway, even in the context_tracking solution, right?

I may have missed your point here. :/
Could you help me understand it, please?

Thanks!
Leo



> 
> And if KVM does zero kvm_last_guest_exit on kvm_sched_out()/vcpu_put(), then the
> approach has the exact same window as my in_guest_run_loop idea, i.e. rcuc can be
> unnecessarily awakened in the time between KVM puts the vCPU and the CPU exits to
> userspace.
>
Sean Christopherson May 7, 2024, 11:47 p.m. UTC | #33
On Tue, May 07, 2024, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > > while it's in kernel context.
> > > > > 
> > > > > But if the task is executing in host kernel context for quite some time,
> > > > > then the host kernel's RCU really does need to take evasive action.
> > > > 
> > > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > > form of the 1 second timeout.
> > > 
> > > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > > ten milliseconds of that CPU not being in a quiescent state, with
> > > the time varying depending on the value of HZ and the number of CPUs.
> > > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > > resched_cpu() that CPU every few milliseconds.
> > > 
> > > > And while KVM does not guarantee that it will immediately resume the guest after
> > > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > > anything that would prevent the kernel from preempting the interrupt task.
> > > 
> > > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > > critical section or its preempt_disable() code.
> > > 
> > > Or am I missing your point?
> > 
> > I think you're missing my point?  I'm talking specifically about host RCU, what
> > is or isn't happening in the guest is completely out of scope.
> 
> Ah, I was thinking of nested virtualization.
> 
> > My overarching point is that the existing @user check in rcu_pending() is optimistic,
> > in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> > is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> > the aforementioned guardrails.
> 
> You lost me on this one.
> 
> The "user" argument to rcu_pending() comes from the context saved at
> the time of the scheduling-clock interrupt.  In other words, the CPU
> really was executing in user mode (which is an RCU quiescent state)
> when the interrupt arrived.
> 
> And that suffices, 100% guaranteed.

Ooh, that's where I'm off in the weeds.  I was viewing @user as "this CPU will be
quiescent", but it really means "this CPU _was_ quiescent".

> The reason that it suffices is that other RCU code such as rcu_qs() and
> rcu_note_context_switch() ensure that this CPU does not pay attention to
> the user-argument-induced quiescent state unless this CPU had previously
> acknowledged the current grace period.
> 
> And if the CPU has previously acknowledged the current grace period, that
> acknowledgement must have preceded the interrupt from user-mode execution.
> Thus the prior quiescent state represented by that user-mode execution
> applies to that previously acknowledged grace period.

To confirm my own understanding: 

  1. Acknowledging the current grace period means any future rcu_read_lock() on
     the CPU will be accounted to the next grace period.

  2. A CPU can acknowledge a grace period without being quiescent.

  3. Userspace can't acknowledge a grace period, because it doesn't run kernel
     code (stating the obvious).

  4. All RCU read-side critical sections must complete before exiting to usersepace.

And so if an IRQ interrupts userspace, and the CPU previously acknowledged grace
period N, RCU can infer that grace period N elapsed on the CPU, because all
"locks" held on grace period N are guaranteed to have been dropped.

> This is admittedly a bit indirect, but then again this is Linux-kernel
> RCU that we are talking about.
> 
> > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > try to harden against every possible edge case in an equivalent @guest check,
> > because it's unnecessary for kernel safety, thanks to the guardrails.
> 
> And the same argument above would also apply to an equivalent check for
> execution in guest mode at the time of the interrupt.

This is partly why I was off in the weeds.  KVM cannot guarantee that the
interrupt that leads to rcu_pending() actually interrupted the guest.  And the
original patch didn't help at all, because a time-based check doesn't come
remotely close to the guarantees that the @user check provides.

> Please understand that I am not saying that we absolutely need an
> additional check (you tell me!).

Heh, I don't think I'm qualified to answer that question, at least not yet.

> But if we do need RCU to be more aggressive about treating guest execution as
> an RCU quiescent state within the host, that additional check would be an
> excellent way of making that happen.

It's not clear to me that being more agressive is warranted.  If my understanding
of the existing @user check is correct, we _could_ achieve similar functionality
for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
On x86, this would be relatively straightforward (hack-a-patch below), but I've
no idea what it would look like on other architectures.

But the value added isn't entirely clear to me, probably because I'm still missing
something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
note the transition from guest to host kernel.  Why isn't that a sufficient hook
for RCU to infer grace period completion?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1a9e1e0c9f49..259b60adaad7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11301,6 +11301,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
        if (vcpu->arch.guest_fpu.xfd_err)
                wrmsrl(MSR_IA32_XFD_ERR, 0);
 
+       RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
+                        lock_is_held(&rcu_lock_map) ||
+                        lock_is_held(&rcu_sched_lock_map),
+                        "KVM in RCU read-side critical section with PF_VCPU set and IRQs enabled");
+
        /*
         * Consume any pending interrupts, including the possible source of
         * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b2bccfd37c38..cdb815105de4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3929,7 +3929,8 @@ static int rcu_pending(int user)
                return 1;
 
        /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
-       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+       if ((user || rcu_is_cpu_rrupt_from_idle() || (current->flags & PF_VCPU)) &&
+           rcu_nohz_full_cpu())
                return 0;
 
        /* Is the RCU core waiting for a quiescent state from this CPU? */
Sean Christopherson May 8, 2024, 12:08 a.m. UTC | #34
On Tue, May 07, 2024, Sean Christopherson wrote:
> On Tue, May 07, 2024, Paul E. McKenney wrote:
> > On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > > > while it's in kernel context.
> > > > > > 
> > > > > > But if the task is executing in host kernel context for quite some time,
> > > > > > then the host kernel's RCU really does need to take evasive action.
> > > > > 
> > > > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > > > form of the 1 second timeout.
> > > > 
> > > > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > > > ten milliseconds of that CPU not being in a quiescent state, with
> > > > the time varying depending on the value of HZ and the number of CPUs.
> > > > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > > > resched_cpu() that CPU every few milliseconds.
> > > > 
> > > > > And while KVM does not guarantee that it will immediately resume the guest after
> > > > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > > > anything that would prevent the kernel from preempting the interrupt task.
> > > > 
> > > > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > > > critical section or its preempt_disable() code.
> > > > 
> > > > Or am I missing your point?
> > > 
> > > I think you're missing my point?  I'm talking specifically about host RCU, what
> > > is or isn't happening in the guest is completely out of scope.
> > 
> > Ah, I was thinking of nested virtualization.
> > 
> > > My overarching point is that the existing @user check in rcu_pending() is optimistic,
> > > in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> > > is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> > > the aforementioned guardrails.
> > 
> > You lost me on this one.
> > 
> > The "user" argument to rcu_pending() comes from the context saved at
> > the time of the scheduling-clock interrupt.  In other words, the CPU
> > really was executing in user mode (which is an RCU quiescent state)
> > when the interrupt arrived.
> > 
> > And that suffices, 100% guaranteed.
> 
> Ooh, that's where I'm off in the weeds.  I was viewing @user as "this CPU will be
> quiescent", but it really means "this CPU _was_ quiescent".

Hrm, I'm still confused though.  That's rock solid for this check:

	/* Is the RCU core waiting for a quiescent state from this CPU? */

But I don't understand how it plays into the next three checks that can result in
rcuc being awakened.  I suspect it's these checks that Leo and Marcelo are trying
squash, and these _do_ seem like they are NOT 100% guaranteed by the @user check.

	/* Does this CPU have callbacks ready to invoke? */
	/* Has RCU gone idle with this CPU needing another grace period? */
	/* Have RCU grace period completed or started?  */

> > The reason that it suffices is that other RCU code such as rcu_qs() and
> > rcu_note_context_switch() ensure that this CPU does not pay attention to
> > the user-argument-induced quiescent state unless this CPU had previously
> > acknowledged the current grace period.
> > 
> > And if the CPU has previously acknowledged the current grace period, that
> > acknowledgement must have preceded the interrupt from user-mode execution.
> > Thus the prior quiescent state represented by that user-mode execution
> > applies to that previously acknowledged grace period.
> 
> To confirm my own understanding: 
> 
>   1. Acknowledging the current grace period means any future rcu_read_lock() on
>      the CPU will be accounted to the next grace period.
> 
>   2. A CPU can acknowledge a grace period without being quiescent.
> 
>   3. Userspace can't acknowledge a grace period, because it doesn't run kernel
>      code (stating the obvious).
> 
>   4. All RCU read-side critical sections must complete before exiting to usersepace.
> 
> And so if an IRQ interrupts userspace, and the CPU previously acknowledged grace
> period N, RCU can infer that grace period N elapsed on the CPU, because all
> "locks" held on grace period N are guaranteed to have been dropped.
> 
> > This is admittedly a bit indirect, but then again this is Linux-kernel
> > RCU that we are talking about.
> > 
> > > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > > try to harden against every possible edge case in an equivalent @guest check,
> > > because it's unnecessary for kernel safety, thanks to the guardrails.
> > 
> > And the same argument above would also apply to an equivalent check for
> > execution in guest mode at the time of the interrupt.
> 
> This is partly why I was off in the weeds.  KVM cannot guarantee that the
> interrupt that leads to rcu_pending() actually interrupted the guest.  And the
> original patch didn't help at all, because a time-based check doesn't come
> remotely close to the guarantees that the @user check provides.
> 
> > Please understand that I am not saying that we absolutely need an
> > additional check (you tell me!).
> 
> Heh, I don't think I'm qualified to answer that question, at least not yet.
> 
> > But if we do need RCU to be more aggressive about treating guest execution as
> > an RCU quiescent state within the host, that additional check would be an
> > excellent way of making that happen.
> 
> It's not clear to me that being more agressive is warranted.  If my understanding
> of the existing @user check is correct, we _could_ achieve similar functionality
> for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> On x86, this would be relatively straightforward (hack-a-patch below), but I've
> no idea what it would look like on other architectures.
> 
> But the value added isn't entirely clear to me, probably because I'm still missing
> something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> note the transition from guest to host kernel.  Why isn't that a sufficient hook
> for RCU to infer grace period completion?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1a9e1e0c9f49..259b60adaad7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11301,6 +11301,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>         if (vcpu->arch.guest_fpu.xfd_err)
>                 wrmsrl(MSR_IA32_XFD_ERR, 0);
>  
> +       RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> +                        lock_is_held(&rcu_lock_map) ||
> +                        lock_is_held(&rcu_sched_lock_map),
> +                        "KVM in RCU read-side critical section with PF_VCPU set and IRQs enabled");
> +
>         /*
>          * Consume any pending interrupts, including the possible source of
>          * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b2bccfd37c38..cdb815105de4 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3929,7 +3929,8 @@ static int rcu_pending(int user)
>                 return 1;
>  
>         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> +       if ((user || rcu_is_cpu_rrupt_from_idle() || (current->flags & PF_VCPU)) &&
> +           rcu_nohz_full_cpu())
>                 return 0;
>  
>         /* Is the RCU core waiting for a quiescent state from this CPU? */
> 
>
Leonardo Bras May 8, 2024, 2:51 a.m. UTC | #35
On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> On Tue, May 07, 2024, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > > > > while it's in kernel context.
> > > > > > > 
> > > > > > > But if the task is executing in host kernel context for quite some time,
> > > > > > > then the host kernel's RCU really does need to take evasive action.
> > > > > > 
> > > > > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > > > > form of the 1 second timeout.
> > > > > 
> > > > > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > > > > ten milliseconds of that CPU not being in a quiescent state, with
> > > > > the time varying depending on the value of HZ and the number of CPUs.
> > > > > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > > > > resched_cpu() that CPU every few milliseconds.
> > > > > 
> > > > > > And while KVM does not guarantee that it will immediately resume the guest after
> > > > > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > > > > anything that would prevent the kernel from preempting the interrupt task.
> > > > > 
> > > > > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > > > > critical section or its preempt_disable() code.
> > > > > 
> > > > > Or am I missing your point?
> > > > 
> > > > I think you're missing my point?  I'm talking specifically about host RCU, what
> > > > is or isn't happening in the guest is completely out of scope.
> > > 
> > > Ah, I was thinking of nested virtualization.
> > > 
> > > > My overarching point is that the existing @user check in rcu_pending() is optimistic,
> > > > in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> > > > is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> > > > the aforementioned guardrails.
> > > 
> > > You lost me on this one.
> > > 
> > > The "user" argument to rcu_pending() comes from the context saved at
> > > the time of the scheduling-clock interrupt.  In other words, the CPU
> > > really was executing in user mode (which is an RCU quiescent state)
> > > when the interrupt arrived.
> > > 
> > > And that suffices, 100% guaranteed.
> > 
> > Ooh, that's where I'm off in the weeds.  I was viewing @user as "this CPU will be
> > quiescent", but it really means "this CPU _was_ quiescent".
> 
> Hrm, I'm still confused though.  That's rock solid for this check:
> 
> 	/* Is the RCU core waiting for a quiescent state from this CPU? */
> 
> But I don't understand how it plays into the next three checks that can result in
> rcuc being awakened.  I suspect it's these checks that Leo and Marcelo are trying
> squash, and these _do_ seem like they are NOT 100% guaranteed by the @user check.
> 
> 	/* Does this CPU have callbacks ready to invoke? */
> 	/* Has RCU gone idle with this CPU needing another grace period? */
> 	/* Have RCU grace period completed or started?  */
> 
> > > The reason that it suffices is that other RCU code such as rcu_qs() and
> > > rcu_note_context_switch() ensure that this CPU does not pay attention to
> > > the user-argument-induced quiescent state unless this CPU had previously
> > > acknowledged the current grace period.
> > > 
> > > And if the CPU has previously acknowledged the current grace period, that
> > > acknowledgement must have preceded the interrupt from user-mode execution.
> > > Thus the prior quiescent state represented by that user-mode execution
> > > applies to that previously acknowledged grace period.
> > 
> > To confirm my own understanding: 
> > 
> >   1. Acknowledging the current grace period means any future rcu_read_lock() on
> >      the CPU will be accounted to the next grace period.
> > 
> >   2. A CPU can acknowledge a grace period without being quiescent.
> > 
> >   3. Userspace can't acknowledge a grace period, because it doesn't run kernel
> >      code (stating the obvious).
> > 
> >   4. All RCU read-side critical sections must complete before exiting to usersepace.
> > 
> > And so if an IRQ interrupts userspace, and the CPU previously acknowledged grace
> > period N, RCU can infer that grace period N elapsed on the CPU, because all
> > "locks" held on grace period N are guaranteed to have been dropped.
> > 
> > > This is admittedly a bit indirect, but then again this is Linux-kernel
> > > RCU that we are talking about.
> > > 
> > > > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > > > try to harden against every possible edge case in an equivalent @guest check,
> > > > because it's unnecessary for kernel safety, thanks to the guardrails.
> > > 
> > > And the same argument above would also apply to an equivalent check for
> > > execution in guest mode at the time of the interrupt.
> > 
> > This is partly why I was off in the weeds.  KVM cannot guarantee that the
> > interrupt that leads to rcu_pending() actually interrupted the guest.  And the
> > original patch didn't help at all, because a time-based check doesn't come
> > remotely close to the guarantees that the @user check provides.
> > 
> > > Please understand that I am not saying that we absolutely need an
> > > additional check (you tell me!).
> > 
> > Heh, I don't think I'm qualified to answer that question, at least not yet.
> > 
> > > But if we do need RCU to be more aggressive about treating guest execution as
> > > an RCU quiescent state within the host, that additional check would be an
> > > excellent way of making that happen.
> > 
> > It's not clear to me that being more agressive is warranted.  If my understanding
> > of the existing @user check is correct, we _could_ achieve similar functionality
> > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > no idea what it would look like on other architectures.
> > 
> > But the value added isn't entirely clear to me, probably because I'm still missing
> > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > for RCU to infer grace period completion?

This is one of the solutions I tested when I was trying to solve the bug:
- Report quiescent state both in guest entry & guest exit.

It improves the bug, but has 2 issues compared to the timing alternative:
1 - Saving jiffies to a per-cpu local variable is usually cheaper than 
    reporting a quiescent state
2 - If we report it on guest_exit() and some other cpu requests a grace 
    period in the next few cpu cycles, there is chance a timer interrupt 
    can trigger rcu_core() before the next guest_entry, which would 
    introduce unnecessary latency, and cause be the issue we are trying to 
    fix.

I mean, it makes the bug reproduce less, but do not fix it.

Thx,
Leo

> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1a9e1e0c9f49..259b60adaad7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11301,6 +11301,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >         if (vcpu->arch.guest_fpu.xfd_err)
> >                 wrmsrl(MSR_IA32_XFD_ERR, 0);
> >  
> > +       RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> > +                        lock_is_held(&rcu_lock_map) ||
> > +                        lock_is_held(&rcu_sched_lock_map),
> > +                        "KVM in RCU read-side critical section with PF_VCPU set and IRQs enabled");
> > +
> >         /*
> >          * Consume any pending interrupts, including the possible source of
> >          * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index b2bccfd37c38..cdb815105de4 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3929,7 +3929,8 @@ static int rcu_pending(int user)
> >                 return 1;
> >  
> >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +       if ((user || rcu_is_cpu_rrupt_from_idle() || (current->flags & PF_VCPU)) &&
> > +           rcu_nohz_full_cpu())
> >                 return 0;
> >  
> >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > 
> > 
>
Paul E. McKenney May 8, 2024, 3:20 a.m. UTC | #36
On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> On Tue, May 07, 2024, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > > > > while it's in kernel context.
> > > > > > > 
> > > > > > > But if the task is executing in host kernel context for quite some time,
> > > > > > > then the host kernel's RCU really does need to take evasive action.
> > > > > > 
> > > > > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > > > > form of the 1 second timeout.
> > > > > 
> > > > > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > > > > ten milliseconds of that CPU not being in a quiescent state, with
> > > > > the time varying depending on the value of HZ and the number of CPUs.
> > > > > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > > > > resched_cpu() that CPU every few milliseconds.
> > > > > 
> > > > > > And while KVM does not guarantee that it will immediately resume the guest after
> > > > > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > > > > anything that would prevent the kernel from preempting the interrupt task.
> > > > > 
> > > > > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > > > > critical section or its preempt_disable() code.
> > > > > 
> > > > > Or am I missing your point?
> > > > 
> > > > I think you're missing my point?  I'm talking specifically about host RCU, what
> > > > is or isn't happening in the guest is completely out of scope.
> > > 
> > > Ah, I was thinking of nested virtualization.
> > > 
> > > > My overarching point is that the existing @user check in rcu_pending() is optimistic,
> > > > in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> > > > is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> > > > the aforementioned guardrails.
> > > 
> > > You lost me on this one.
> > > 
> > > The "user" argument to rcu_pending() comes from the context saved at
> > > the time of the scheduling-clock interrupt.  In other words, the CPU
> > > really was executing in user mode (which is an RCU quiescent state)
> > > when the interrupt arrived.
> > > 
> > > And that suffices, 100% guaranteed.
> > 
> > Ooh, that's where I'm off in the weeds.  I was viewing @user as "this CPU will be
> > quiescent", but it really means "this CPU _was_ quiescent".

Exactly!

> Hrm, I'm still confused though.  That's rock solid for this check:
> 
> 	/* Is the RCU core waiting for a quiescent state from this CPU? */
> 
> But I don't understand how it plays into the next three checks that can result in
> rcuc being awakened.  I suspect it's these checks that Leo and Marcelo are trying
> squash, and these _do_ seem like they are NOT 100% guaranteed by the @user check.

The short answer is that RCU is a state machine.  These checks all
indicate that there is something for that state machine to do, so
rcu_core() (in the rcuc kthread in some configurations) is invoked to
make the per-CPU portion of this state machine take a step.  The state
machine's state will reject a quiescent-state report that does not
apply to the current grace period.  It will also recognize the case
where there is no quiescent-state report.

> 	/* Does this CPU have callbacks ready to invoke? */

If callbacks are not offloaded, then the state machine is in charge of
invoking them.

> 	/* Has RCU gone idle with this CPU needing another grace period? */

If this CPU needs a grace period and there is currently on grace
period in progress, the state machine will start a grace period.
(Though grace periods can also be started from elsewhere.)

> 	/* Have RCU grace period completed or started?  */

If this CPU is not yet aware of a grace period's start or completion,
the state machine takes care of it.

This state machine has per-task, per-CPU, and global components.
It optimizes to do its work locally.  This means that the implementation
of this state machine is distributed across quite a bit of code.
You won't likely understand it by looking at only a small piece of it.
You will instead need to go line-by-line through much of the contents
of kernel/rcu, starting with kernel/rcu/tree.c.

If you are interested, we have done quite a bit of work documenting it,
please see here:

https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing

If you do get a chance to look it over, feedback is welcome!

> > > The reason that it suffices is that other RCU code such as rcu_qs() and
> > > rcu_note_context_switch() ensure that this CPU does not pay attention to
> > > the user-argument-induced quiescent state unless this CPU had previously
> > > acknowledged the current grace period.
> > > 
> > > And if the CPU has previously acknowledged the current grace period, that
> > > acknowledgement must have preceded the interrupt from user-mode execution.
> > > Thus the prior quiescent state represented by that user-mode execution
> > > applies to that previously acknowledged grace period.
> > 
> > To confirm my own understanding: 
> > 
> >   1. Acknowledging the current grace period means any future rcu_read_lock() on
> >      the CPU will be accounted to the next grace period.

More or less.  Any uncertainty will cause RCU to err on the side of
accounting that rcu_read_lock() to the current grace period.  Why any
uncertainty?  Because certainty is exceedingly expensive in this game.
See for example the video of my Kernel Recipes talk from last year.

> >   2. A CPU can acknowledge a grace period without being quiescent.

Yes, and either the beginning or the end of that grace period.
(It clearly cannot acknowledge both without going quiescent at some
point in between times, because otherwise that grace period could not
be permitted to end.)

> >   3. Userspace can't acknowledge a grace period, because it doesn't run kernel
> >      code (stating the obvious).

Agreed.

> >   4. All RCU read-side critical sections must complete before exiting to usersepace.

Agreed.  Any that try not to will hear from lockdep.

> > And so if an IRQ interrupts userspace, and the CPU previously acknowledged grace
> > period N, RCU can infer that grace period N elapsed on the CPU, because all
> > "locks" held on grace period N are guaranteed to have been dropped.

More precisely, previously noted the beginning of that grace period,
but yes.

> > > This is admittedly a bit indirect, but then again this is Linux-kernel
> > > RCU that we are talking about.
> > > 
> > > > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > > > try to harden against every possible edge case in an equivalent @guest check,
> > > > because it's unnecessary for kernel safety, thanks to the guardrails.
> > > 
> > > And the same argument above would also apply to an equivalent check for
> > > execution in guest mode at the time of the interrupt.
> > 
> > This is partly why I was off in the weeds.  KVM cannot guarantee that the
> > interrupt that leads to rcu_pending() actually interrupted the guest.  And the
> > original patch didn't help at all, because a time-based check doesn't come
> > remotely close to the guarantees that the @user check provides.

Nothing in the registers from the interrupted context permits that
determination?

> > > Please understand that I am not saying that we absolutely need an
> > > additional check (you tell me!).
> > 
> > Heh, I don't think I'm qualified to answer that question, at least not yet.

Me, I would assume that we don't unless something says otherwise.  One
example of such a somthing is an RCU CPU stall warning.

> > > But if we do need RCU to be more aggressive about treating guest execution as
> > > an RCU quiescent state within the host, that additional check would be an
> > > excellent way of making that happen.
> > 
> > It's not clear to me that being more agressive is warranted.  If my understanding
> > of the existing @user check is correct, we _could_ achieve similar functionality
> > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > no idea what it would look like on other architectures.

At first glance, this looks plausible.  I would guess that a real patch
would have to be architecture dependent, and that could simply involve
a Kconfig option (perhaps something like CONFIG_RCU_SENSE_GUEST), so
that the check you add to rcu_pending is conditioned on something like
IS_ENABLED(CONFIG_RCU_SENSE_GUEST).

There would also need to be a similar check in rcu_sched_clock_irq(),
or maybe in rcu_flavor_sched_clock_irq(), to force a call to rcu_qs()
in this situation.

> > But the value added isn't entirely clear to me, probably because I'm still missing
> > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > for RCU to infer grace period completion?

Agreed, unless we are sure we need the change, we should not make it.
All I am going on is that I was sent a patch that looked to be intended to
make RCU more aggressive about finding quiescent states from guest OSes.
I suspect that some change like this might eventually be needed in the
non-nohz_full case, something about a 2017 USENIX paper.

But we should have hard evidence that we need a change before making one.
And you are more likely to come across such evidence than am I.  ;-)

							Thanx, Paul

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1a9e1e0c9f49..259b60adaad7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11301,6 +11301,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >         if (vcpu->arch.guest_fpu.xfd_err)
> >                 wrmsrl(MSR_IA32_XFD_ERR, 0);
> >  
> > +       RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> > +                        lock_is_held(&rcu_lock_map) ||
> > +                        lock_is_held(&rcu_sched_lock_map),
> > +                        "KVM in RCU read-side critical section with PF_VCPU set and IRQs enabled");
> > +
> >         /*
> >          * Consume any pending interrupts, including the possible source of
> >          * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index b2bccfd37c38..cdb815105de4 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3929,7 +3929,8 @@ static int rcu_pending(int user)
> >                 return 1;
> >  
> >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +       if ((user || rcu_is_cpu_rrupt_from_idle() || (current->flags & PF_VCPU)) &&
> > +           rcu_nohz_full_cpu())
> >                 return 0;
> >  
> >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > 
> >
Paul E. McKenney May 8, 2024, 3:22 a.m. UTC | #37
On Tue, May 07, 2024 at 11:51:15PM -0300, Leonardo Bras wrote:
> On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Sean Christopherson wrote:
> > > On Tue, May 07, 2024, Paul E. McKenney wrote:

[ . . . ]

> > > > But if we do need RCU to be more aggressive about treating guest execution as
> > > > an RCU quiescent state within the host, that additional check would be an
> > > > excellent way of making that happen.
> > > 
> > > It's not clear to me that being more agressive is warranted.  If my understanding
> > > of the existing @user check is correct, we _could_ achieve similar functionality
> > > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > > no idea what it would look like on other architectures.
> > > 
> > > But the value added isn't entirely clear to me, probably because I'm still missing
> > > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > > for RCU to infer grace period completion?
> 
> This is one of the solutions I tested when I was trying to solve the bug:
> - Report quiescent state both in guest entry & guest exit.
> 
> It improves the bug, but has 2 issues compared to the timing alternative:
> 1 - Saving jiffies to a per-cpu local variable is usually cheaper than 
>     reporting a quiescent state
> 2 - If we report it on guest_exit() and some other cpu requests a grace 
>     period in the next few cpu cycles, there is chance a timer interrupt 
>     can trigger rcu_core() before the next guest_entry, which would 
>     introduce unnecessary latency, and cause be the issue we are trying to 
>     fix.
> 
> I mean, it makes the bug reproduce less, but do not fix it.

OK, then it sounds like something might be needed, but again, I must
defer to you guys on the need.

If there is a need, what are your thoughts on the approach that Sean
suggested?

							Thanx, Paul
Paul E. McKenney May 8, 2024, 4:04 a.m. UTC | #38
On Tue, May 07, 2024 at 08:20:53PM -0700, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Sean Christopherson wrote:
> > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> > > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > > > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > > > > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > > > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > > > > > while it's in kernel context.
> > > > > > > > 
> > > > > > > > But if the task is executing in host kernel context for quite some time,
> > > > > > > > then the host kernel's RCU really does need to take evasive action.
> > > > > > > 
> > > > > > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > > > > > form of the 1 second timeout.
> > > > > > 
> > > > > > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > > > > > ten milliseconds of that CPU not being in a quiescent state, with
> > > > > > the time varying depending on the value of HZ and the number of CPUs.
> > > > > > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > > > > > resched_cpu() that CPU every few milliseconds.
> > > > > > 
> > > > > > > And while KVM does not guarantee that it will immediately resume the guest after
> > > > > > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > > > > > anything that would prevent the kernel from preempting the interrupt task.
> > > > > > 
> > > > > > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > > > > > critical section or its preempt_disable() code.
> > > > > > 
> > > > > > Or am I missing your point?
> > > > > 
> > > > > I think you're missing my point?  I'm talking specifically about host RCU, what
> > > > > is or isn't happening in the guest is completely out of scope.
> > > > 
> > > > Ah, I was thinking of nested virtualization.
> > > > 
> > > > > My overarching point is that the existing @user check in rcu_pending() is optimistic,
> > > > > in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> > > > > is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> > > > > the aforementioned guardrails.
> > > > 
> > > > You lost me on this one.
> > > > 
> > > > The "user" argument to rcu_pending() comes from the context saved at
> > > > the time of the scheduling-clock interrupt.  In other words, the CPU
> > > > really was executing in user mode (which is an RCU quiescent state)
> > > > when the interrupt arrived.
> > > > 
> > > > And that suffices, 100% guaranteed.
> > > 
> > > Ooh, that's where I'm off in the weeds.  I was viewing @user as "this CPU will be
> > > quiescent", but it really means "this CPU _was_ quiescent".
> 
> Exactly!
> 
> > Hrm, I'm still confused though.  That's rock solid for this check:
> > 
> > 	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > 
> > But I don't understand how it plays into the next three checks that can result in
> > rcuc being awakened.  I suspect it's these checks that Leo and Marcelo are trying
> > squash, and these _do_ seem like they are NOT 100% guaranteed by the @user check.
> 
> The short answer is that RCU is a state machine.  These checks all
> indicate that there is something for that state machine to do, so
> rcu_core() (in the rcuc kthread in some configurations) is invoked to
> make the per-CPU portion of this state machine take a step.  The state
> machine's state will reject a quiescent-state report that does not
> apply to the current grace period.  It will also recognize the case
> where there is no quiescent-state report.
> 
> > 	/* Does this CPU have callbacks ready to invoke? */
> 
> If callbacks are not offloaded, then the state machine is in charge of
> invoking them.
> 
> > 	/* Has RCU gone idle with this CPU needing another grace period? */
> 
> If this CPU needs a grace period and there is currently on grace
> period in progress, the state machine will start a grace period.
> (Though grace periods can also be started from elsewhere.)
> 
> > 	/* Have RCU grace period completed or started?  */
> 
> If this CPU is not yet aware of a grace period's start or completion,
> the state machine takes care of it.
> 
> This state machine has per-task, per-CPU, and global components.
> It optimizes to do its work locally.  This means that the implementation
> of this state machine is distributed across quite a bit of code.
> You won't likely understand it by looking at only a small piece of it.
> You will instead need to go line-by-line through much of the contents
> of kernel/rcu, starting with kernel/rcu/tree.c.
> 
> If you are interested, we have done quite a bit of work documenting it,
> please see here:
> 
> https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing
> 
> If you do get a chance to look it over, feedback is welcome!
> 
> > > > The reason that it suffices is that other RCU code such as rcu_qs() and
> > > > rcu_note_context_switch() ensure that this CPU does not pay attention to
> > > > the user-argument-induced quiescent state unless this CPU had previously
> > > > acknowledged the current grace period.
> > > > 
> > > > And if the CPU has previously acknowledged the current grace period, that
> > > > acknowledgement must have preceded the interrupt from user-mode execution.
> > > > Thus the prior quiescent state represented by that user-mode execution
> > > > applies to that previously acknowledged grace period.
> > > 
> > > To confirm my own understanding: 
> > > 
> > >   1. Acknowledging the current grace period means any future rcu_read_lock() on
> > >      the CPU will be accounted to the next grace period.
> 
> More or less.  Any uncertainty will cause RCU to err on the side of
> accounting that rcu_read_lock() to the current grace period.  Why any
> uncertainty?  Because certainty is exceedingly expensive in this game.
> See for example the video of my Kernel Recipes talk from last year.
> 
> > >   2. A CPU can acknowledge a grace period without being quiescent.
> 
> Yes, and either the beginning or the end of that grace period.
> (It clearly cannot acknowledge both without going quiescent at some
> point in between times, because otherwise that grace period could not
> be permitted to end.)
> 
> > >   3. Userspace can't acknowledge a grace period, because it doesn't run kernel
> > >      code (stating the obvious).
> 
> Agreed.
> 
> > >   4. All RCU read-side critical sections must complete before exiting to usersepace.
> 
> Agreed.  Any that try not to will hear from lockdep.
> 
> > > And so if an IRQ interrupts userspace, and the CPU previously acknowledged grace
> > > period N, RCU can infer that grace period N elapsed on the CPU, because all
> > > "locks" held on grace period N are guaranteed to have been dropped.
> 
> More precisely, previously noted the beginning of that grace period,
> but yes.
> 
> > > > This is admittedly a bit indirect, but then again this is Linux-kernel
> > > > RCU that we are talking about.
> > > > 
> > > > > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > > > > try to harden against every possible edge case in an equivalent @guest check,
> > > > > because it's unnecessary for kernel safety, thanks to the guardrails.
> > > > 
> > > > And the same argument above would also apply to an equivalent check for
> > > > execution in guest mode at the time of the interrupt.
> > > 
> > > This is partly why I was off in the weeds.  KVM cannot guarantee that the
> > > interrupt that leads to rcu_pending() actually interrupted the guest.  And the
> > > original patch didn't help at all, because a time-based check doesn't come
> > > remotely close to the guarantees that the @user check provides.
> 
> Nothing in the registers from the interrupted context permits that
> determination?
> 
> > > > Please understand that I am not saying that we absolutely need an
> > > > additional check (you tell me!).
> > > 
> > > Heh, I don't think I'm qualified to answer that question, at least not yet.
> 
> Me, I would assume that we don't unless something says otherwise.  One
> example of such a somthing is an RCU CPU stall warning.
> 
> > > > But if we do need RCU to be more aggressive about treating guest execution as
> > > > an RCU quiescent state within the host, that additional check would be an
> > > > excellent way of making that happen.
> > > 
> > > It's not clear to me that being more agressive is warranted.  If my understanding
> > > of the existing @user check is correct, we _could_ achieve similar functionality
> > > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > > no idea what it would look like on other architectures.
> 
> At first glance, this looks plausible.  I would guess that a real patch
> would have to be architecture dependent, and that could simply involve
> a Kconfig option (perhaps something like CONFIG_RCU_SENSE_GUEST), so
> that the check you add to rcu_pending is conditioned on something like
> IS_ENABLED(CONFIG_RCU_SENSE_GUEST).
> 
> There would also need to be a similar check in rcu_sched_clock_irq(),
> or maybe in rcu_flavor_sched_clock_irq(), to force a call to rcu_qs()
> in this situation.

Never mind this last paragraph.  It is clearly time for me to put down
the keyboard.  :-/

						Thanx, Paul

> > > But the value added isn't entirely clear to me, probably because I'm still missing
> > > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > > for RCU to infer grace period completion?
> 
> Agreed, unless we are sure we need the change, we should not make it.
> All I am going on is that I was sent a patch that looked to be intended to
> make RCU more aggressive about finding quiescent states from guest OSes.
> I suspect that some change like this might eventually be needed in the
> non-nohz_full case, something about a 2017 USENIX paper.
> 
> But we should have hard evidence that we need a change before making one.
> And you are more likely to come across such evidence than am I.  ;-)
> 
> 							Thanx, Paul
> 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 1a9e1e0c9f49..259b60adaad7 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -11301,6 +11301,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >         if (vcpu->arch.guest_fpu.xfd_err)
> > >                 wrmsrl(MSR_IA32_XFD_ERR, 0);
> > >  
> > > +       RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> > > +                        lock_is_held(&rcu_lock_map) ||
> > > +                        lock_is_held(&rcu_sched_lock_map),
> > > +                        "KVM in RCU read-side critical section with PF_VCPU set and IRQs enabled");
> > > +
> > >         /*
> > >          * Consume any pending interrupts, including the possible source of
> > >          * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index b2bccfd37c38..cdb815105de4 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3929,7 +3929,8 @@ static int rcu_pending(int user)
> > >                 return 1;
> > >  
> > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > +       if ((user || rcu_is_cpu_rrupt_from_idle() || (current->flags & PF_VCPU)) &&
> > > +           rcu_nohz_full_cpu())
> > >                 return 0;
> > >  
> > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > 
> > >
Leonardo Bras May 8, 2024, 6:19 a.m. UTC | #39
On Tue, May 07, 2024 at 08:22:42PM -0700, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 11:51:15PM -0300, Leonardo Bras wrote:
> > On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> > > On Tue, May 07, 2024, Sean Christopherson wrote:
> > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > > But if we do need RCU to be more aggressive about treating guest execution as
> > > > > an RCU quiescent state within the host, that additional check would be an
> > > > > excellent way of making that happen.
> > > > 
> > > > It's not clear to me that being more agressive is warranted.  If my understanding
> > > > of the existing @user check is correct, we _could_ achieve similar functionality
> > > > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > > > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > > > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > > > no idea what it would look like on other architectures.
> > > > 
> > > > But the value added isn't entirely clear to me, probably because I'm still missing
> > > > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > > > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > > > for RCU to infer grace period completion?
> > 
> > This is one of the solutions I tested when I was trying to solve the bug:
> > - Report quiescent state both in guest entry & guest exit.
> > 
> > It improves the bug, but has 2 issues compared to the timing alternative:
> > 1 - Saving jiffies to a per-cpu local variable is usually cheaper than 
> >     reporting a quiescent state
> > 2 - If we report it on guest_exit() and some other cpu requests a grace 
> >     period in the next few cpu cycles, there is chance a timer interrupt 
> >     can trigger rcu_core() before the next guest_entry, which would 
> >     introduce unnecessary latency, and cause be the issue we are trying to 
> >     fix.
> > 
> > I mean, it makes the bug reproduce less, but do not fix it.
> 
> OK, then it sounds like something might be needed, but again, I must
> defer to you guys on the need.
> 
> If there is a need, what are your thoughts on the approach that Sean
> suggested?

Something just hit me, and maybe I need to propose something more generic.

But I need some help with a question first:
- Let's forget about kvm for a few seconds, and focus in host userspace:
  If we have a high priority (user) task running on nohz_full cpu, and it 
  gets interrupted (IRQ, let's say). Is it possible that the interrupting task 
  gets interrupted by the timer interrupt which will check for 
  rcu_pending(), and return true ? (1)
  (or is there any protection for that kind of scenario?) (2)

1)
If there is any possibility of this happening, maybe we could consider 
fixing it by adding some kind of generic timeout in RCU code, to be used 
in nohz_full, so that it keeps track of the last time an quiescent state 
ran in this_cpu, and returns false on rcu_pending() if one happened in the 
last N jiffies.

In this case, we could also report a quiescent state in guest_exit, and 
make use of above generic RCU timeout to avoid having any rcu_core() 
running in those switching moments.

2)
On the other hand, if there are mechanisms in place for avoiding such 
scenario, it could justify adding some similar mechanism to KVM guest_exit 
/ guest_entry. In case adding such mechanism is hard, or expensive, we 
could use the KVM-only timeout previously suggested to avoid what we are 
currently hitting.

Could we use both a timeout & context tracking in this scenario? yes
But why do that, if the timeout would work just as well?

If I missed something, please let me know. :)

Thanks!
Leo
Sean Christopherson May 8, 2024, 2:01 p.m. UTC | #40
On Wed, May 08, 2024, Leonardo Bras wrote:
> Something just hit me, and maybe I need to propose something more generic.

Yes.  This is what I was trying to get across with my complaints about keying off
of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
be quiescent soon" and so the core concept/functionality belongs in common code,
not KVM.
Paul E. McKenney May 8, 2024, 2:36 p.m. UTC | #41
On Tue, May 07, 2024 at 09:04:22PM -0700, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 08:20:53PM -0700, Paul E. McKenney wrote:
> > On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> > > On Tue, May 07, 2024, Sean Christopherson wrote:
> > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > > On Tue, May 07, 2024 at 02:00:12PM -0700, Sean Christopherson wrote:
> > > > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> > > > > > > On Tue, May 07, 2024 at 10:55:54AM -0700, Sean Christopherson wrote:
> > > > > > > > On Fri, May 03, 2024, Paul E. McKenney wrote:
> > > > > > > > > On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote:
> > > > > > > > > > So if we're comfortable relying on the 1 second timeout to guard against a
> > > > > > > > > > misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
> > > > > > > > > > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
> > > > > > > > > > userspace communicate to the kernel that it's a real-time task that spends the
> > > > > > > > > > overwhelming majority of its time in userspace or guest context, i.e. should be
> > > > > > > > > > given extra leniency with respect to rcuc if the task happens to be interrupted
> > > > > > > > > > while it's in kernel context.
> > > > > > > > > 
> > > > > > > > > But if the task is executing in host kernel context for quite some time,
> > > > > > > > > then the host kernel's RCU really does need to take evasive action.
> > > > > > > > 
> > > > > > > > Agreed, but what I'm saying is that RCU already has the mechanism to do so in the
> > > > > > > > form of the 1 second timeout.
> > > > > > > 
> > > > > > > Plus RCU will force-enable that CPU's scheduler-clock tick after about
> > > > > > > ten milliseconds of that CPU not being in a quiescent state, with
> > > > > > > the time varying depending on the value of HZ and the number of CPUs.
> > > > > > > After about ten seconds (halfway to the RCU CPU stall warning), it will
> > > > > > > resched_cpu() that CPU every few milliseconds.
> > > > > > > 
> > > > > > > > And while KVM does not guarantee that it will immediately resume the guest after
> > > > > > > > servicing the IRQ, neither does the existing userspace logic.  E.g. I don't see
> > > > > > > > anything that would prevent the kernel from preempting the interrupt task.
> > > > > > > 
> > > > > > > Similarly, the hypervisor could preempt a guest OS's RCU read-side
> > > > > > > critical section or its preempt_disable() code.
> > > > > > > 
> > > > > > > Or am I missing your point?
> > > > > > 
> > > > > > I think you're missing my point?  I'm talking specifically about host RCU, what
> > > > > > is or isn't happening in the guest is completely out of scope.
> > > > > 
> > > > > Ah, I was thinking of nested virtualization.
> > > > > 
> > > > > > My overarching point is that the existing @user check in rcu_pending() is optimistic,
> > > > > > in the sense that the CPU is _likely_ to quickly enter a quiescent state if @user
> > > > > > is true, but it's not 100% guaranteed.  And because it's not guaranteed, RCU has
> > > > > > the aforementioned guardrails.
> > > > > 
> > > > > You lost me on this one.
> > > > > 
> > > > > The "user" argument to rcu_pending() comes from the context saved at
> > > > > the time of the scheduling-clock interrupt.  In other words, the CPU
> > > > > really was executing in user mode (which is an RCU quiescent state)
> > > > > when the interrupt arrived.
> > > > > 
> > > > > And that suffices, 100% guaranteed.
> > > > 
> > > > Ooh, that's where I'm off in the weeds.  I was viewing @user as "this CPU will be
> > > > quiescent", but it really means "this CPU _was_ quiescent".
> > 
> > Exactly!
> > 
> > > Hrm, I'm still confused though.  That's rock solid for this check:
> > > 
> > > 	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > > 
> > > But I don't understand how it plays into the next three checks that can result in
> > > rcuc being awakened.  I suspect it's these checks that Leo and Marcelo are trying
> > > squash, and these _do_ seem like they are NOT 100% guaranteed by the @user check.
> > 
> > The short answer is that RCU is a state machine.  These checks all
> > indicate that there is something for that state machine to do, so
> > rcu_core() (in the rcuc kthread in some configurations) is invoked to
> > make the per-CPU portion of this state machine take a step.  The state
> > machine's state will reject a quiescent-state report that does not
> > apply to the current grace period.  It will also recognize the case
> > where there is no quiescent-state report.
> > 
> > > 	/* Does this CPU have callbacks ready to invoke? */
> > 
> > If callbacks are not offloaded, then the state machine is in charge of
> > invoking them.
> > 
> > > 	/* Has RCU gone idle with this CPU needing another grace period? */
> > 
> > If this CPU needs a grace period and there is currently on grace
> > period in progress, the state machine will start a grace period.
> > (Though grace periods can also be started from elsewhere.)
> > 
> > > 	/* Have RCU grace period completed or started?  */
> > 
> > If this CPU is not yet aware of a grace period's start or completion,
> > the state machine takes care of it.
> > 
> > This state machine has per-task, per-CPU, and global components.
> > It optimizes to do its work locally.  This means that the implementation
> > of this state machine is distributed across quite a bit of code.
> > You won't likely understand it by looking at only a small piece of it.
> > You will instead need to go line-by-line through much of the contents
> > of kernel/rcu, starting with kernel/rcu/tree.c.
> > 
> > If you are interested, we have done quite a bit of work documenting it,
> > please see here:
> > 
> > https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing
> > 
> > If you do get a chance to look it over, feedback is welcome!
> > 
> > > > > The reason that it suffices is that other RCU code such as rcu_qs() and
> > > > > rcu_note_context_switch() ensure that this CPU does not pay attention to
> > > > > the user-argument-induced quiescent state unless this CPU had previously
> > > > > acknowledged the current grace period.
> > > > > 
> > > > > And if the CPU has previously acknowledged the current grace period, that
> > > > > acknowledgement must have preceded the interrupt from user-mode execution.
> > > > > Thus the prior quiescent state represented by that user-mode execution
> > > > > applies to that previously acknowledged grace period.
> > > > 
> > > > To confirm my own understanding: 
> > > > 
> > > >   1. Acknowledging the current grace period means any future rcu_read_lock() on
> > > >      the CPU will be accounted to the next grace period.
> > 
> > More or less.  Any uncertainty will cause RCU to err on the side of
> > accounting that rcu_read_lock() to the current grace period.  Why any
> > uncertainty?  Because certainty is exceedingly expensive in this game.
> > See for example the video of my Kernel Recipes talk from last year.
> > 
> > > >   2. A CPU can acknowledge a grace period without being quiescent.
> > 
> > Yes, and either the beginning or the end of that grace period.
> > (It clearly cannot acknowledge both without going quiescent at some
> > point in between times, because otherwise that grace period could not
> > be permitted to end.)
> > 
> > > >   3. Userspace can't acknowledge a grace period, because it doesn't run kernel
> > > >      code (stating the obvious).
> > 
> > Agreed.
> > 
> > > >   4. All RCU read-side critical sections must complete before exiting to usersepace.
> > 
> > Agreed.  Any that try not to will hear from lockdep.
> > 
> > > > And so if an IRQ interrupts userspace, and the CPU previously acknowledged grace
> > > > period N, RCU can infer that grace period N elapsed on the CPU, because all
> > > > "locks" held on grace period N are guaranteed to have been dropped.
> > 
> > More precisely, previously noted the beginning of that grace period,
> > but yes.
> > 
> > > > > This is admittedly a bit indirect, but then again this is Linux-kernel
> > > > > RCU that we are talking about.
> > > > > 
> > > > > > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > > > > > try to harden against every possible edge case in an equivalent @guest check,
> > > > > > because it's unnecessary for kernel safety, thanks to the guardrails.
> > > > > 
> > > > > And the same argument above would also apply to an equivalent check for
> > > > > execution in guest mode at the time of the interrupt.
> > > > 
> > > > This is partly why I was off in the weeds.  KVM cannot guarantee that the
> > > > interrupt that leads to rcu_pending() actually interrupted the guest.  And the
> > > > original patch didn't help at all, because a time-based check doesn't come
> > > > remotely close to the guarantees that the @user check provides.
> > 
> > Nothing in the registers from the interrupted context permits that
> > determination?
> > 
> > > > > Please understand that I am not saying that we absolutely need an
> > > > > additional check (you tell me!).
> > > > 
> > > > Heh, I don't think I'm qualified to answer that question, at least not yet.
> > 
> > Me, I would assume that we don't unless something says otherwise.  One
> > example of such a somthing is an RCU CPU stall warning.
> > 
> > > > > But if we do need RCU to be more aggressive about treating guest execution as
> > > > > an RCU quiescent state within the host, that additional check would be an
> > > > > excellent way of making that happen.
> > > > 
> > > > It's not clear to me that being more agressive is warranted.  If my understanding
> > > > of the existing @user check is correct, we _could_ achieve similar functionality
> > > > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > > > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > > > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > > > no idea what it would look like on other architectures.
> > 
> > At first glance, this looks plausible.  I would guess that a real patch
> > would have to be architecture dependent, and that could simply involve
> > a Kconfig option (perhaps something like CONFIG_RCU_SENSE_GUEST), so
> > that the check you add to rcu_pending is conditioned on something like
> > IS_ENABLED(CONFIG_RCU_SENSE_GUEST).
> > 
> > There would also need to be a similar check in rcu_sched_clock_irq(),
> > or maybe in rcu_flavor_sched_clock_irq(), to force a call to rcu_qs()
> > in this situation.
> 
> Never mind this last paragraph.  It is clearly time for me to put down
> the keyboard.  :-/

But there is a real additional change to be made.  If RCU is not watching,
then tracing is disallowed, and all intervening functions must be either
inlined or marked noinstr.  This will likely be quite messy.

Also, if RCU is to not be watching, why not just move the context-tracking
transition?

						Thanx, Paul

> > > > But the value added isn't entirely clear to me, probably because I'm still missing
> > > > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > > > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > > > for RCU to infer grace period completion?
> > 
> > Agreed, unless we are sure we need the change, we should not make it.
> > All I am going on is that I was sent a patch that looked to be intended to
> > make RCU more aggressive about finding quiescent states from guest OSes.
> > I suspect that some change like this might eventually be needed in the
> > non-nohz_full case, something about a 2017 USENIX paper.
> > 
> > But we should have hard evidence that we need a change before making one.
> > And you are more likely to come across such evidence than am I.  ;-)
> > 
> > 							Thanx, Paul
> > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 1a9e1e0c9f49..259b60adaad7 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -11301,6 +11301,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > > >         if (vcpu->arch.guest_fpu.xfd_err)
> > > >                 wrmsrl(MSR_IA32_XFD_ERR, 0);
> > > >  
> > > > +       RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
> > > > +                        lock_is_held(&rcu_lock_map) ||
> > > > +                        lock_is_held(&rcu_sched_lock_map),
> > > > +                        "KVM in RCU read-side critical section with PF_VCPU set and IRQs enabled");
> > > > +
> > > >         /*
> > > >          * Consume any pending interrupts, including the possible source of
> > > >          * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index b2bccfd37c38..cdb815105de4 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -3929,7 +3929,8 @@ static int rcu_pending(int user)
> > > >                 return 1;
> > > >  
> > > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > > +       if ((user || rcu_is_cpu_rrupt_from_idle() || (current->flags & PF_VCPU)) &&
> > > > +           rcu_nohz_full_cpu())
> > > >                 return 0;
> > > >  
> > > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > 
> > > >
Sean Christopherson May 8, 2024, 3:35 p.m. UTC | #42
On Tue, May 07, 2024, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> > > > This is admittedly a bit indirect, but then again this is Linux-kernel
> > > > RCU that we are talking about.
> > > > 
> > > > > And I'm arguing that, since the @user check isn't bombproof, there's no reason to
> > > > > try to harden against every possible edge case in an equivalent @guest check,
> > > > > because it's unnecessary for kernel safety, thanks to the guardrails.
> > > > 
> > > > And the same argument above would also apply to an equivalent check for
> > > > execution in guest mode at the time of the interrupt.
> > > 
> > > This is partly why I was off in the weeds.  KVM cannot guarantee that the
> > > interrupt that leads to rcu_pending() actually interrupted the guest.  And the
> > > original patch didn't help at all, because a time-based check doesn't come
> > > remotely close to the guarantees that the @user check provides.
> 
> Nothing in the registers from the interrupted context permits that
> determination?

No, because the interrupt/call chain that reaches rcu_pending() actually originates
in KVM host code, not guest code.  I.e. the eventual IRET will return control to
KVM, not to the guest.

On AMD, the interrupt quite literally interrupts the host, not the guest.  AMD
CPUs don't actually acknowledge/consume the physical interrupt when the guest is
running, the CPU simply generates a VM-Exit that says "there's an interrupt pending".
It's up to software, i.e. KVM, to enable IRQs and handle (all!) pending interrupts.

Intel CPUs have a mode where the CPU fully acknowledges the interrupt and reports
the exact vector that caused the VM-Exit, but it's still up to software to invoke
the interrupt handler, i.e. the interrupt trampolines through KVM.

And before handling/forwarding the interrupt, KVM exits its quiescent state,
leaves its no-instrumention region, invokes tracepoitnes, etc.  So even my PF_VCPU
idea is _very_ different than the user/idle scenarios, where the interrupt really
truly does original from an extended quiescent state.

> > > > But if we do need RCU to be more aggressive about treating guest execution as
> > > > an RCU quiescent state within the host, that additional check would be an
> > > > excellent way of making that happen.
> > > 
> > > It's not clear to me that being more agressive is warranted.  If my understanding
> > > of the existing @user check is correct, we _could_ achieve similar functionality
> > > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > > no idea what it would look like on other architectures.
> 
> At first glance, this looks plausible.  I would guess that a real patch
> would have to be architecture dependent, and that could simply involve
> a Kconfig option (perhaps something like CONFIG_RCU_SENSE_GUEST), so
> that the check you add to rcu_pending is conditioned on something like
> IS_ENABLED(CONFIG_RCU_SENSE_GUEST).
> 
> There would also need to be a similar check in rcu_sched_clock_irq(),
> or maybe in rcu_flavor_sched_clock_irq(), to force a call to rcu_qs()
> in this situation.
> 
> > > But the value added isn't entirely clear to me, probably because I'm still missing
> > > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > > for RCU to infer grace period completion?
> 
> Agreed, unless we are sure we need the change, we should not make it.

+1.  And your comments about tracepoints, instrumentions, etc. makes me think
that trying to force the issue with PF_VCPU would be a bad idea.
Paul E. McKenney May 9, 2024, 3:32 a.m. UTC | #43
On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> On Wed, May 08, 2024, Leonardo Bras wrote:
> > Something just hit me, and maybe I need to propose something more generic.
> 
> Yes.  This is what I was trying to get across with my complaints about keying off
> of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> be quiescent soon" and so the core concept/functionality belongs in common code,
> not KVM.

OK, we could do something like the following wholly within RCU, namely
to make rcu_pending() refrain from invoking rcu_core() until the grace
period is at least the specified age, defaulting to zero (and to the
current behavior).

Perhaps something like the patch shown below.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit abc7cd2facdebf85aa075c567321589862f88542
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed May 8 20:11:58 2024 -0700

    rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter
    
    If a CPU is running either a userspace application or a guest OS in
    nohz_full mode, it is possible for a system call to occur just as an
    RCU grace period is starting.  If that CPU also has the scheduling-clock
    tick enabled for any reason (such as a second runnable task), and if the
    system was booted with rcutree.use_softirq=0, then RCU can add insult to
    injury by awakening that CPU's rcuc kthread, resulting in yet another
    task and yet more OS jitter due to switching to that task, running it,
    and switching back.
    
    In addition, in the common case where that system call is not of
    excessively long duration, awakening the rcuc task is pointless.
    This pointlessness is due to the fact that the CPU will enter an extended
    quiescent state upon returning to the userspace application or guest OS.
    In this case, the rcuc kthread cannot do anything that the main RCU
    grace-period kthread cannot do on its behalf, at least if it is given
    a few additional milliseconds (for example, given the time duration
    specified by rcutree.jiffies_till_first_fqs, give or take scheduling
    delays).
    
    This commit therefore adds a rcutree.nocb_patience_delay kernel boot
    parameter that specifies the grace period age (in milliseconds)
    before which RCU will refrain from awakening the rcuc kthread.
    Preliminary experiementation suggests a value of 1000, that is,
    one second.  Increasing rcutree.nocb_patience_delay will increase
    grace-period latency and in turn increase memory footprint, so systems
    with constrained memory might choose a smaller value.  Systems with
    less-aggressive OS-jitter requirements might choose the default value
    of zero, which keeps the traditional immediate-wakeup behavior, thus
    avoiding increases in grace-period latency.
    
    Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
    
    Reported-by: Leonardo Bras <leobras@redhat.com>
    Suggested-by: Leonardo Bras <leobras@redhat.com>
    Suggested-by: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a3b0fd1910e6..42383986e692b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4981,6 +4981,13 @@
 			the ->nocb_bypass queue.  The definition of "too
 			many" is supplied by this kernel boot parameter.
 
+	rcutree.nocb_patience_delay= [KNL]
+			On callback-offloaded (rcu_nocbs) CPUs, avoid
+			disturbing RCU unless the grace period has
+			reached the specified age in milliseconds.
+			Defaults to zero.  Large values will be capped
+			at five seconds.
+
 	rcutree.qhimark= [KNL]
 			Set threshold of queued RCU callbacks beyond which
 			batch limiting is disabled.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7560e204198bb..6e4b8b43855a0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -176,6 +176,8 @@ static int gp_init_delay;
 module_param(gp_init_delay, int, 0444);
 static int gp_cleanup_delay;
 module_param(gp_cleanup_delay, int, 0444);
+static int nocb_patience_delay;
+module_param(nocb_patience_delay, int, 0444);
 
 // Add delay to rcu_read_unlock() for strict grace periods.
 static int rcu_unlock_delay;
@@ -4334,6 +4336,8 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
 static int rcu_pending(int user)
 {
 	bool gp_in_progress;
+	unsigned long j = jiffies;
+	unsigned int patience = msecs_to_jiffies(nocb_patience_delay);
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	struct rcu_node *rnp = rdp->mynode;
 
@@ -4347,11 +4351,13 @@ static int rcu_pending(int user)
 		return 1;
 
 	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
-	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+	gp_in_progress = rcu_gp_in_progress();
+	if ((user || rcu_is_cpu_rrupt_from_idle() ||
+	     (gp_in_progress && time_before(j + patience, rcu_state.gp_start))) &&
+	    rcu_nohz_full_cpu())
 		return 0;
 
 	/* Is the RCU core waiting for a quiescent state from this CPU? */
-	gp_in_progress = rcu_gp_in_progress();
 	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
 		return 1;
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 340bbefe5f652..174333d0e9507 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -93,6 +93,15 @@ static void __init rcu_bootup_announce_oddness(void)
 		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
 	if (gp_cleanup_delay)
 		pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
+	if (nocb_patience_delay < 0) {
+		pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
+		nocb_patience_delay = 0;
+	} else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
+		pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
+		nocb_patience_delay = 5 * MSEC_PER_SEC;
+	} else if (nocb_patience_delay) {
+		pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
+	}
 	if (!use_softirq)
 		pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
 	if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
Leonardo Bras May 9, 2024, 8:16 a.m. UTC | #44
On Wed, May 08, 2024 at 08:32:40PM -0700, Paul E. McKenney wrote:
> On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> > On Wed, May 08, 2024, Leonardo Bras wrote:
> > > Something just hit me, and maybe I need to propose something more generic.
> > 
> > Yes.  This is what I was trying to get across with my complaints about keying off
> > of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> > be quiescent soon" and so the core concept/functionality belongs in common code,
> > not KVM.
> 
> OK, we could do something like the following wholly within RCU, namely
> to make rcu_pending() refrain from invoking rcu_core() until the grace
> period is at least the specified age, defaulting to zero (and to the
> current behavior).
> 
> Perhaps something like the patch shown below.

That's exactly what I was thinking :)

> 
> Thoughts?

Some suggestions below:

> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit abc7cd2facdebf85aa075c567321589862f88542
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Wed May 8 20:11:58 2024 -0700
> 
>     rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter
>     
>     If a CPU is running either a userspace application or a guest OS in
>     nohz_full mode, it is possible for a system call to occur just as an
>     RCU grace period is starting.  If that CPU also has the scheduling-clock
>     tick enabled for any reason (such as a second runnable task), and if the
>     system was booted with rcutree.use_softirq=0, then RCU can add insult to
>     injury by awakening that CPU's rcuc kthread, resulting in yet another
>     task and yet more OS jitter due to switching to that task, running it,
>     and switching back.
>     
>     In addition, in the common case where that system call is not of
>     excessively long duration, awakening the rcuc task is pointless.
>     This pointlessness is due to the fact that the CPU will enter an extended
>     quiescent state upon returning to the userspace application or guest OS.
>     In this case, the rcuc kthread cannot do anything that the main RCU
>     grace-period kthread cannot do on its behalf, at least if it is given
>     a few additional milliseconds (for example, given the time duration
>     specified by rcutree.jiffies_till_first_fqs, give or take scheduling
>     delays).
>     
>     This commit therefore adds a rcutree.nocb_patience_delay kernel boot
>     parameter that specifies the grace period age (in milliseconds)
>     before which RCU will refrain from awakening the rcuc kthread.
>     Preliminary experiementation suggests a value of 1000, that is,
>     one second.  Increasing rcutree.nocb_patience_delay will increase
>     grace-period latency and in turn increase memory footprint, so systems
>     with constrained memory might choose a smaller value.  Systems with
>     less-aggressive OS-jitter requirements might choose the default value
>     of zero, which keeps the traditional immediate-wakeup behavior, thus
>     avoiding increases in grace-period latency.
>     
>     Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
>     
>     Reported-by: Leonardo Bras <leobras@redhat.com>
>     Suggested-by: Leonardo Bras <leobras@redhat.com>
>     Suggested-by: Sean Christopherson <seanjc@google.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0a3b0fd1910e6..42383986e692b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4981,6 +4981,13 @@
>  			the ->nocb_bypass queue.  The definition of "too
>  			many" is supplied by this kernel boot parameter.
>  
> +	rcutree.nocb_patience_delay= [KNL]
> +			On callback-offloaded (rcu_nocbs) CPUs, avoid
> +			disturbing RCU unless the grace period has
> +			reached the specified age in milliseconds.
> +			Defaults to zero.  Large values will be capped
> +			at five seconds.
> +
>  	rcutree.qhimark= [KNL]
>  			Set threshold of queued RCU callbacks beyond which
>  			batch limiting is disabled.
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7560e204198bb..6e4b8b43855a0 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -176,6 +176,8 @@ static int gp_init_delay;
>  module_param(gp_init_delay, int, 0444);
>  static int gp_cleanup_delay;
>  module_param(gp_cleanup_delay, int, 0444);
> +static int nocb_patience_delay;
> +module_param(nocb_patience_delay, int, 0444);
>  
>  // Add delay to rcu_read_unlock() for strict grace periods.
>  static int rcu_unlock_delay;
> @@ -4334,6 +4336,8 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
>  static int rcu_pending(int user)
>  {
>  	bool gp_in_progress;
> +	unsigned long j = jiffies;

I think this is probably taken care by the compiler, but just in case I would move the 
j = jiffies;
closer to it's use, in order to avoid reading 'jiffies' if rcu_pending 
exits before the nohz_full testing.


> +	unsigned int patience = msecs_to_jiffies(nocb_patience_delay);

What do you think on processsing the new parameter in boot, and saving it 
in terms of jiffies already? 

It would make it unnecessary to convert ms -> jiffies every time we run 
rcu_pending.

(OOO will probably remove the extra division, but may cause less impact in 
some arch)

>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  	struct rcu_node *rnp = rdp->mynode;
>  
> @@ -4347,11 +4351,13 @@ static int rcu_pending(int user)
>  		return 1;
>  
>  	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> +	gp_in_progress = rcu_gp_in_progress();
> +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> +	     (gp_in_progress && time_before(j + patience, rcu_state.gp_start))) &&

I think you meant:
	time_before(j, rcu_state.gp_start + patience)

or else this always fails, as we can never have now to happen before a 
previously started gp, right?

Also, as per rcu_nohz_full_cpu() we probably need it to be read with 
READ_ONCE():

	time_before(j, READ_ONCE(rcu_state.gp_start) + patience)

> +	    rcu_nohz_full_cpu())
>  		return 0;
>  
>  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> -	gp_in_progress = rcu_gp_in_progress();
>  	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
>  		return 1;
>  
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 340bbefe5f652..174333d0e9507 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -93,6 +93,15 @@ static void __init rcu_bootup_announce_oddness(void)
>  		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
>  	if (gp_cleanup_delay)
>  		pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> +	if (nocb_patience_delay < 0) {
> +		pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> +		nocb_patience_delay = 0;
> +	} else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> +		pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> +		nocb_patience_delay = 5 * MSEC_PER_SEC;
> +	} else if (nocb_patience_delay) {

Here you suggest that we don't print if 'nocb_patience_delay == 0', 
as it's the default behavior, right?

I think printing on 0 could be useful to check if the feature exists, even 
though we are zeroing it, but this will probably add unnecessary verbosity.

> +		pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> +	}

Here I suppose something like this can take care of not needing to convert 
ms -> jiffies every rcu_pending():

+	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);

>  	if (!use_softirq)
>  		pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
>  	if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> 


Thanks!
Leo
Leonardo Bras May 9, 2024, 10:14 a.m. UTC | #45
On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> On Wed, May 08, 2024 at 08:32:40PM -0700, Paul E. McKenney wrote:
> > On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> > > On Wed, May 08, 2024, Leonardo Bras wrote:
> > > > Something just hit me, and maybe I need to propose something more generic.
> > > 
> > > Yes.  This is what I was trying to get across with my complaints about keying off
> > > of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> > > be quiescent soon" and so the core concept/functionality belongs in common code,
> > > not KVM.
> > 
> > OK, we could do something like the following wholly within RCU, namely
> > to make rcu_pending() refrain from invoking rcu_core() until the grace
> > period is at least the specified age, defaulting to zero (and to the
> > current behavior).
> > 
> > Perhaps something like the patch shown below.
> 
> That's exactly what I was thinking :)
> 
> > 
> > Thoughts?
> 
> Some suggestions below:
> 
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit abc7cd2facdebf85aa075c567321589862f88542
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Wed May 8 20:11:58 2024 -0700
> > 
> >     rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter
> >     
> >     If a CPU is running either a userspace application or a guest OS in
> >     nohz_full mode, it is possible for a system call to occur just as an
> >     RCU grace period is starting.  If that CPU also has the scheduling-clock
> >     tick enabled for any reason (such as a second runnable task), and if the
> >     system was booted with rcutree.use_softirq=0, then RCU can add insult to
> >     injury by awakening that CPU's rcuc kthread, resulting in yet another
> >     task and yet more OS jitter due to switching to that task, running it,
> >     and switching back.
> >     
> >     In addition, in the common case where that system call is not of
> >     excessively long duration, awakening the rcuc task is pointless.
> >     This pointlessness is due to the fact that the CPU will enter an extended
> >     quiescent state upon returning to the userspace application or guest OS.
> >     In this case, the rcuc kthread cannot do anything that the main RCU
> >     grace-period kthread cannot do on its behalf, at least if it is given
> >     a few additional milliseconds (for example, given the time duration
> >     specified by rcutree.jiffies_till_first_fqs, give or take scheduling
> >     delays).
> >     
> >     This commit therefore adds a rcutree.nocb_patience_delay kernel boot
> >     parameter that specifies the grace period age (in milliseconds)
> >     before which RCU will refrain from awakening the rcuc kthread.
> >     Preliminary experiementation suggests a value of 1000, that is,
> >     one second.  Increasing rcutree.nocb_patience_delay will increase
> >     grace-period latency and in turn increase memory footprint, so systems
> >     with constrained memory might choose a smaller value.  Systems with
> >     less-aggressive OS-jitter requirements might choose the default value
> >     of zero, which keeps the traditional immediate-wakeup behavior, thus
> >     avoiding increases in grace-period latency.
> >     
> >     Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> >     
> >     Reported-by: Leonardo Bras <leobras@redhat.com>
> >     Suggested-by: Leonardo Bras <leobras@redhat.com>
> >     Suggested-by: Sean Christopherson <seanjc@google.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 0a3b0fd1910e6..42383986e692b 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4981,6 +4981,13 @@
> >  			the ->nocb_bypass queue.  The definition of "too
> >  			many" is supplied by this kernel boot parameter.
> >  
> > +	rcutree.nocb_patience_delay= [KNL]
> > +			On callback-offloaded (rcu_nocbs) CPUs, avoid
> > +			disturbing RCU unless the grace period has
> > +			reached the specified age in milliseconds.
> > +			Defaults to zero.  Large values will be capped
> > +			at five seconds.
> > +
> >  	rcutree.qhimark= [KNL]
> >  			Set threshold of queued RCU callbacks beyond which
> >  			batch limiting is disabled.
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 7560e204198bb..6e4b8b43855a0 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -176,6 +176,8 @@ static int gp_init_delay;
> >  module_param(gp_init_delay, int, 0444);
> >  static int gp_cleanup_delay;
> >  module_param(gp_cleanup_delay, int, 0444);
> > +static int nocb_patience_delay;
> > +module_param(nocb_patience_delay, int, 0444);
> >  
> >  // Add delay to rcu_read_unlock() for strict grace periods.
> >  static int rcu_unlock_delay;
> > @@ -4334,6 +4336,8 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
> >  static int rcu_pending(int user)
> >  {
> >  	bool gp_in_progress;
> > +	unsigned long j = jiffies;
> 
> I think this is probably taken care by the compiler, but just in case I would move the 
> j = jiffies;
> closer to it's use, in order to avoid reading 'jiffies' if rcu_pending 
> exits before the nohz_full testing.
> 
> 
> > +	unsigned int patience = msecs_to_jiffies(nocb_patience_delay);
> 
> What do you think on processsing the new parameter in boot, and saving it 
> in terms of jiffies already? 
> 
> It would make it unnecessary to convert ms -> jiffies every time we run 
> rcu_pending.
> 
> (OOO will probably remove the extra division, but may cause less impact in 
> some arch)
> 
> >  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> >  	struct rcu_node *rnp = rdp->mynode;
> >  
> > @@ -4347,11 +4351,13 @@ static int rcu_pending(int user)
> >  		return 1;
> >  
> >  	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +	gp_in_progress = rcu_gp_in_progress();
> > +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > +	     (gp_in_progress && time_before(j + patience, rcu_state.gp_start))) &&
> 
> I think you meant:
> 	time_before(j, rcu_state.gp_start + patience)
> 
> or else this always fails, as we can never have now to happen before a 
> previously started gp, right?
> 
> Also, as per rcu_nohz_full_cpu() we probably need it to be read with 
> READ_ONCE():
> 
> 	time_before(j, READ_ONCE(rcu_state.gp_start) + patience)
> 
> > +	    rcu_nohz_full_cpu())
> >  		return 0;
> >  
> >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > -	gp_in_progress = rcu_gp_in_progress();
> >  	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> >  		return 1;
> >  
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 340bbefe5f652..174333d0e9507 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -93,6 +93,15 @@ static void __init rcu_bootup_announce_oddness(void)
> >  		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> >  	if (gp_cleanup_delay)
> >  		pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > +	if (nocb_patience_delay < 0) {
> > +		pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> > +		nocb_patience_delay = 0;
> > +	} else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > +		pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> > +		nocb_patience_delay = 5 * MSEC_PER_SEC;
> > +	} else if (nocb_patience_delay) {
> 
> Here you suggest that we don't print if 'nocb_patience_delay == 0', 
> as it's the default behavior, right?
> 
> I think printing on 0 could be useful to check if the feature exists, even 
> though we are zeroing it, but this will probably add unnecessary verbosity.
> 
> > +		pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> > +	}
> 
> Here I suppose something like this can take care of not needing to convert 
> ms -> jiffies every rcu_pending():
> 
> +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> 

Uh, there is more to it, actually. We need to make sure the user 
understands that we are rounding-down the value to multiple of a jiffy 
period, so it's not a surprise if the delay value is not exactly the same 
as the passed on kernel cmdline.

So something like bellow diff should be ok, as this behavior is explained 
in the docs, and pr_info() will print the effective value.

What do you think?

Thanks!
Leo

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a3b0fd1910e..9a50be9fd9eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4974,20 +4974,28 @@
                        otherwise be caused by callback floods through
                        use of the ->nocb_bypass list.  However, in the
                        common non-flooded case, RCU queues directly to
                        the main ->cblist in order to avoid the extra
                        overhead of the ->nocb_bypass list and its lock.
                        But if there are too many callbacks queued during
                        a single jiffy, RCU pre-queues the callbacks into
                        the ->nocb_bypass queue.  The definition of "too
                        many" is supplied by this kernel boot parameter.
 
+       rcutree.nocb_patience_delay= [KNL]
+                       On callback-offloaded (rcu_nocbs) CPUs, avoid
+                       disturbing RCU unless the grace period has
+                       reached the specified age in milliseconds.
+                       Defaults to zero.  Large values will be capped
+                       at five seconds. Values rounded-down to a multiple
+                       of a jiffy period.
+
        rcutree.qhimark= [KNL]
                        Set threshold of queued RCU callbacks beyond which
                        batch limiting is disabled.
 
        rcutree.qlowmark= [KNL]
                        Set threshold of queued RCU callbacks below which
                        batch limiting is re-enabled.
 
        rcutree.qovld= [KNL]
                        Set threshold of queued RCU callbacks beyond which
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index fcf2b4aa3441..62ede401420f 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -512,20 +512,21 @@ do {                                                              \
        local_irq_save(flags);                                  \
        if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
                raw_spin_lock(&(rdp)->nocb_lock);               \
 } while (0)
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 static void rcu_bind_gp_kthread(void);
 static bool rcu_nohz_full_cpu(void);
+static bool rcu_on_patience_delay(void);
 
 /* Forward declarations for tree_stall.h */
 static void record_gp_stall_check_time(void);
 static void rcu_iw_handler(struct irq_work *iwp);
 static void check_cpu_stall(struct rcu_data *rdp);
 static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
                                     const unsigned long gpssdelay);
 
 /* Forward declarations for tree_exp.h. */
 static void sync_rcu_do_polled_gp(struct work_struct *wp);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 340bbefe5f65..639243b0410f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -5,20 +5,21 @@
  * or preemptible semantics.
  *
  * Copyright Red Hat, 2009
  * Copyright IBM Corporation, 2009
  *
  * Author: Ingo Molnar <mingo@elte.hu>
  *        Paul E. McKenney <paulmck@linux.ibm.com>
  */
 
 #include "../locking/rtmutex_common.h"
+#include <linux/jiffies.h>
 
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
 {
        /*
         * In order to read the offloaded state of an rdp in a safe
         * and stable way and prevent from its value to be changed
         * under us, we must either hold the barrier mutex, the cpu
         * hotplug lock (read or write) or the nocb lock. Local
         * non-preemptible reads are also safe. NOCB kthreads and
         * timers have their own means of synchronization against the
@@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
        if (rcu_kick_kthreads)
                pr_info("\tKick kthreads if too-long grace period.\n");
        if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
                pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
        if (gp_preinit_delay)
                pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
        if (gp_init_delay)
                pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
        if (gp_cleanup_delay)
                pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
+       if (nocb_patience_delay < 0) {
+               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
+                       nocb_patience_delay);
+               nocb_patience_delay = 0;
+       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
+               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
+                       nocb_patience_delay, 5 * MSEC_PER_SEC);
+               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
+       } else if (nocb_patience_delay) {
+               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
+               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
+                       jiffies_to_msecs(nocb_patience_delay);
+       }
        if (!use_softirq)
                pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
        if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
                pr_info("\tRCU debug extended QS entry/exit.\n");
        rcupdate_announce_bootup_oddness();
 }
 
 #ifdef CONFIG_PREEMPT_RCU
 
 static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
@@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
 
 /*
  * Bind the RCU grace-period kthreads to the housekeeping CPU.
  */
 static void rcu_bind_gp_kthread(void)
 {
        if (!tick_nohz_full_enabled())
                return;
        housekeeping_affine(current, HK_TYPE_RCU);
 }
+
+/*
+ * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
+ * start of current grace period is smaller than nocb_patience_delay ?
+ *
+ * This code relies on the fact that all NO_HZ_FULL CPUs are also
+ * RCU_NOCB_CPU CPUs.
+ */
+static bool rcu_on_patience_delay(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+       if (!nocb_patience_delay)
+               return false;
+
+       if (time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay))
+               return true;
+#endif /* #ifdef CONFIG_NO_HZ_FULL */
+       return false;
+}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7560e204198b..7a2d94370ab4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -169,20 +169,22 @@ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
 module_param(kthread_prio, int, 0444);
 
 /* Delay in jiffies for grace-period initialization delays, debug only. */
 
 static int gp_preinit_delay;
 module_param(gp_preinit_delay, int, 0444);
 static int gp_init_delay;
 module_param(gp_init_delay, int, 0444);
 static int gp_cleanup_delay;
 module_param(gp_cleanup_delay, int, 0444);
+static int nocb_patience_delay;
+module_param(nocb_patience_delay, int, 0444);
 
 // Add delay to rcu_read_unlock() for strict grace periods.
 static int rcu_unlock_delay;
 #ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
 module_param(rcu_unlock_delay, int, 0444);
 #endif
 
 /*
  * This rcu parameter is runtime-read-only. It reflects
  * a minimum allowed number of objects which can be cached
@@ -4340,25 +4342,27 @@ static int rcu_pending(int user)
        lockdep_assert_irqs_disabled();
 
        /* Check for CPU stalls, if enabled. */
        check_cpu_stall(rdp);
 
        /* Does this CPU need a deferred NOCB wakeup? */
        if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
                return 1;
 
        /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
-       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+       gp_in_progress = rcu_gp_in_progress();
+       if ((user || rcu_is_cpu_rrupt_from_idle() ||
+            (gp_in_progress && rcu_on_patience_delay())) &&
+           rcu_nohz_full_cpu())
                return 0;
 
        /* Is the RCU core waiting for a quiescent state from this CPU? */
-       gp_in_progress = rcu_gp_in_progress();
        if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
                return 1;
 
        /* Does this CPU have callbacks ready to invoke? */
        if (!rcu_rdp_is_offloaded(rdp) &&
            rcu_segcblist_ready_cbs(&rdp->cblist))
                return 1;
 
        /* Has RCU gone idle with this CPU needing another grace period? */
        if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
Paul E. McKenney May 9, 2024, 10:41 p.m. UTC | #46
On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> On Wed, May 08, 2024 at 08:32:40PM -0700, Paul E. McKenney wrote:
> > On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> > > On Wed, May 08, 2024, Leonardo Bras wrote:
> > > > Something just hit me, and maybe I need to propose something more generic.
> > > 
> > > Yes.  This is what I was trying to get across with my complaints about keying off
> > > of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> > > be quiescent soon" and so the core concept/functionality belongs in common code,
> > > not KVM.
> > 
> > OK, we could do something like the following wholly within RCU, namely
> > to make rcu_pending() refrain from invoking rcu_core() until the grace
> > period is at least the specified age, defaulting to zero (and to the
> > current behavior).
> > 
> > Perhaps something like the patch shown below.
> 
> That's exactly what I was thinking :)
> 
> > 
> > Thoughts?
> 
> Some suggestions below:
> 
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit abc7cd2facdebf85aa075c567321589862f88542
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Wed May 8 20:11:58 2024 -0700
> > 
> >     rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter
> >     
> >     If a CPU is running either a userspace application or a guest OS in
> >     nohz_full mode, it is possible for a system call to occur just as an
> >     RCU grace period is starting.  If that CPU also has the scheduling-clock
> >     tick enabled for any reason (such as a second runnable task), and if the
> >     system was booted with rcutree.use_softirq=0, then RCU can add insult to
> >     injury by awakening that CPU's rcuc kthread, resulting in yet another
> >     task and yet more OS jitter due to switching to that task, running it,
> >     and switching back.
> >     
> >     In addition, in the common case where that system call is not of
> >     excessively long duration, awakening the rcuc task is pointless.
> >     This pointlessness is due to the fact that the CPU will enter an extended
> >     quiescent state upon returning to the userspace application or guest OS.
> >     In this case, the rcuc kthread cannot do anything that the main RCU
> >     grace-period kthread cannot do on its behalf, at least if it is given
> >     a few additional milliseconds (for example, given the time duration
> >     specified by rcutree.jiffies_till_first_fqs, give or take scheduling
> >     delays).
> >     
> >     This commit therefore adds a rcutree.nocb_patience_delay kernel boot
> >     parameter that specifies the grace period age (in milliseconds)
> >     before which RCU will refrain from awakening the rcuc kthread.
> >     Preliminary experiementation suggests a value of 1000, that is,
> >     one second.  Increasing rcutree.nocb_patience_delay will increase
> >     grace-period latency and in turn increase memory footprint, so systems
> >     with constrained memory might choose a smaller value.  Systems with
> >     less-aggressive OS-jitter requirements might choose the default value
> >     of zero, which keeps the traditional immediate-wakeup behavior, thus
> >     avoiding increases in grace-period latency.
> >     
> >     Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> >     
> >     Reported-by: Leonardo Bras <leobras@redhat.com>
> >     Suggested-by: Leonardo Bras <leobras@redhat.com>
> >     Suggested-by: Sean Christopherson <seanjc@google.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 0a3b0fd1910e6..42383986e692b 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4981,6 +4981,13 @@
> >  			the ->nocb_bypass queue.  The definition of "too
> >  			many" is supplied by this kernel boot parameter.
> >  
> > +	rcutree.nocb_patience_delay= [KNL]
> > +			On callback-offloaded (rcu_nocbs) CPUs, avoid
> > +			disturbing RCU unless the grace period has
> > +			reached the specified age in milliseconds.
> > +			Defaults to zero.  Large values will be capped
> > +			at five seconds.
> > +
> >  	rcutree.qhimark= [KNL]
> >  			Set threshold of queued RCU callbacks beyond which
> >  			batch limiting is disabled.
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 7560e204198bb..6e4b8b43855a0 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -176,6 +176,8 @@ static int gp_init_delay;
> >  module_param(gp_init_delay, int, 0444);
> >  static int gp_cleanup_delay;
> >  module_param(gp_cleanup_delay, int, 0444);
> > +static int nocb_patience_delay;
> > +module_param(nocb_patience_delay, int, 0444);
> >  
> >  // Add delay to rcu_read_unlock() for strict grace periods.
> >  static int rcu_unlock_delay;
> > @@ -4334,6 +4336,8 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
> >  static int rcu_pending(int user)
> >  {
> >  	bool gp_in_progress;
> > +	unsigned long j = jiffies;
> 
> I think this is probably taken care by the compiler, but just in case I would move the 
> j = jiffies;
> closer to it's use, in order to avoid reading 'jiffies' if rcu_pending 
> exits before the nohz_full testing.

Good point!  I just removed j and used jiffies directly.

> > +	unsigned int patience = msecs_to_jiffies(nocb_patience_delay);
> 
> What do you think on processsing the new parameter in boot, and saving it 
> in terms of jiffies already? 
> 
> It would make it unnecessary to convert ms -> jiffies every time we run 
> rcu_pending.
> 
> (OOO will probably remove the extra division, but may cause less impact in 
> some arch)

This isn't exactly a fastpath, but it is easy enough to do the conversion
in rcu_bootup_announce_oddness() and place it into another variable
(for the benefit of those using drgn or going through crash dumps).

> >  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> >  	struct rcu_node *rnp = rdp->mynode;
> >  
> > @@ -4347,11 +4351,13 @@ static int rcu_pending(int user)
> >  		return 1;
> >  
> >  	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +	gp_in_progress = rcu_gp_in_progress();
> > +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > +	     (gp_in_progress && time_before(j + patience, rcu_state.gp_start))) &&
> 
> I think you meant:
> 	time_before(j, rcu_state.gp_start + patience)
> 
> or else this always fails, as we can never have now to happen before a 
> previously started gp, right?
> 
> Also, as per rcu_nohz_full_cpu() we probably need it to be read with 
> READ_ONCE():
> 
> 	time_before(j, READ_ONCE(rcu_state.gp_start) + patience)

Good catch on both counts, fixed!

> > +	    rcu_nohz_full_cpu())
> >  		return 0;
> >  
> >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > -	gp_in_progress = rcu_gp_in_progress();
> >  	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> >  		return 1;
> >  
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 340bbefe5f652..174333d0e9507 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -93,6 +93,15 @@ static void __init rcu_bootup_announce_oddness(void)
> >  		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> >  	if (gp_cleanup_delay)
> >  		pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > +	if (nocb_patience_delay < 0) {
> > +		pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> > +		nocb_patience_delay = 0;
> > +	} else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > +		pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> > +		nocb_patience_delay = 5 * MSEC_PER_SEC;
> > +	} else if (nocb_patience_delay) {
> 
> Here you suggest that we don't print if 'nocb_patience_delay == 0', 
> as it's the default behavior, right?

Exactly, in keeping with the function name rcu_bootup_announce_oddness().

This approach allows easy spotting of deviations from default settings,
which can be very helpful when debugging.

> I think printing on 0 could be useful to check if the feature exists, even 
> though we are zeroing it, but this will probably add unnecessary verbosity.

It could be quite useful to people learning the RCU implementation,
and I encourage those people to remove all those "if" statements from
rcu_bootup_announce_oddness() in order to get the full story.

> > +		pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> > +	}
> 
> Here I suppose something like this can take care of not needing to convert 
> ms -> jiffies every rcu_pending():
> 
> +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);

Agreed, but I used a separate variable to help people looking at crash
dumps or using drgn.

And thank you for your review and comments!  Applying these changes
with attribution.

							Thanx, Paul

> >  	if (!use_softirq)
> >  		pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> >  	if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > 
> 
> 
> Thanks!
> Leo
>
Leonardo Bras May 9, 2024, 11:07 p.m. UTC | #47
On Thu, May 9, 2024 at 7:44 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > On Wed, May 08, 2024 at 08:32:40PM -0700, Paul E. McKenney wrote:
> > > On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> > > > On Wed, May 08, 2024, Leonardo Bras wrote:
> > > > > Something just hit me, and maybe I need to propose something more generic.
> > > >
> > > > Yes.  This is what I was trying to get across with my complaints about keying off
> > > > of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> > > > be quiescent soon" and so the core concept/functionality belongs in common code,
> > > > not KVM.
> > >
> > > OK, we could do something like the following wholly within RCU, namely
> > > to make rcu_pending() refrain from invoking rcu_core() until the grace
> > > period is at least the specified age, defaulting to zero (and to the
> > > current behavior).
> > >
> > > Perhaps something like the patch shown below.
> >
> > That's exactly what I was thinking :)
> >
> > >
> > > Thoughts?
> >
> > Some suggestions below:
> >
> > >
> > >                                                     Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > commit abc7cd2facdebf85aa075c567321589862f88542
> > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > Date:   Wed May 8 20:11:58 2024 -0700
> > >
> > >     rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter
> > >
> > >     If a CPU is running either a userspace application or a guest OS in
> > >     nohz_full mode, it is possible for a system call to occur just as an
> > >     RCU grace period is starting.  If that CPU also has the scheduling-clock
> > >     tick enabled for any reason (such as a second runnable task), and if the
> > >     system was booted with rcutree.use_softirq=0, then RCU can add insult to
> > >     injury by awakening that CPU's rcuc kthread, resulting in yet another
> > >     task and yet more OS jitter due to switching to that task, running it,
> > >     and switching back.
> > >
> > >     In addition, in the common case where that system call is not of
> > >     excessively long duration, awakening the rcuc task is pointless.
> > >     This pointlessness is due to the fact that the CPU will enter an extended
> > >     quiescent state upon returning to the userspace application or guest OS.
> > >     In this case, the rcuc kthread cannot do anything that the main RCU
> > >     grace-period kthread cannot do on its behalf, at least if it is given
> > >     a few additional milliseconds (for example, given the time duration
> > >     specified by rcutree.jiffies_till_first_fqs, give or take scheduling
> > >     delays).
> > >
> > >     This commit therefore adds a rcutree.nocb_patience_delay kernel boot
> > >     parameter that specifies the grace period age (in milliseconds)
> > >     before which RCU will refrain from awakening the rcuc kthread.
> > >     Preliminary experiementation suggests a value of 1000, that is,
> > >     one second.  Increasing rcutree.nocb_patience_delay will increase
> > >     grace-period latency and in turn increase memory footprint, so systems
> > >     with constrained memory might choose a smaller value.  Systems with
> > >     less-aggressive OS-jitter requirements might choose the default value
> > >     of zero, which keeps the traditional immediate-wakeup behavior, thus
> > >     avoiding increases in grace-period latency.
> > >
> > >     Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > >
> > >     Reported-by: Leonardo Bras <leobras@redhat.com>
> > >     Suggested-by: Leonardo Bras <leobras@redhat.com>
> > >     Suggested-by: Sean Christopherson <seanjc@google.com>
> > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 0a3b0fd1910e6..42383986e692b 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -4981,6 +4981,13 @@
> > >                     the ->nocb_bypass queue.  The definition of "too
> > >                     many" is supplied by this kernel boot parameter.
> > >
> > > +   rcutree.nocb_patience_delay= [KNL]
> > > +                   On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > +                   disturbing RCU unless the grace period has
> > > +                   reached the specified age in milliseconds.
> > > +                   Defaults to zero.  Large values will be capped
> > > +                   at five seconds.
> > > +
> > >     rcutree.qhimark= [KNL]
> > >                     Set threshold of queued RCU callbacks beyond which
> > >                     batch limiting is disabled.
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 7560e204198bb..6e4b8b43855a0 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -176,6 +176,8 @@ static int gp_init_delay;
> > >  module_param(gp_init_delay, int, 0444);
> > >  static int gp_cleanup_delay;
> > >  module_param(gp_cleanup_delay, int, 0444);
> > > +static int nocb_patience_delay;
> > > +module_param(nocb_patience_delay, int, 0444);
> > >
> > >  // Add delay to rcu_read_unlock() for strict grace periods.
> > >  static int rcu_unlock_delay;
> > > @@ -4334,6 +4336,8 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
> > >  static int rcu_pending(int user)
> > >  {
> > >     bool gp_in_progress;
> > > +   unsigned long j = jiffies;
> >
> > I think this is probably taken care by the compiler, but just in case I would move the
> > j = jiffies;
> > closer to it's use, in order to avoid reading 'jiffies' if rcu_pending
> > exits before the nohz_full testing.
>
> Good point!  I just removed j and used jiffies directly.
>
> > > +   unsigned int patience = msecs_to_jiffies(nocb_patience_delay);
> >
> > What do you think on processsing the new parameter in boot, and saving it
> > in terms of jiffies already?
> >
> > It would make it unnecessary to convert ms -> jiffies every time we run
> > rcu_pending.
> >
> > (OOO will probably remove the extra division, but may cause less impact in
> > some arch)
>
> This isn't exactly a fastpath, but it is easy enough to do the conversion
> in rcu_bootup_announce_oddness() and place it into another variable
> (for the benefit of those using drgn or going through crash dumps).
>
> > >     struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > >     struct rcu_node *rnp = rdp->mynode;
> > >
> > > @@ -4347,11 +4351,13 @@ static int rcu_pending(int user)
> > >             return 1;
> > >
> > >     /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > -   if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > +   gp_in_progress = rcu_gp_in_progress();
> > > +   if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > > +        (gp_in_progress && time_before(j + patience, rcu_state.gp_start))) &&
> >
> > I think you meant:
> >       time_before(j, rcu_state.gp_start + patience)
> >
> > or else this always fails, as we can never have now to happen before a
> > previously started gp, right?
> >
> > Also, as per rcu_nohz_full_cpu() we probably need it to be read with
> > READ_ONCE():
> >
> >       time_before(j, READ_ONCE(rcu_state.gp_start) + patience)
>
> Good catch on both counts, fixed!
>
> > > +       rcu_nohz_full_cpu())
> > >             return 0;
> > >
> > >     /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > -   gp_in_progress = rcu_gp_in_progress();
> > >     if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> > >             return 1;
> > >
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 340bbefe5f652..174333d0e9507 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -93,6 +93,15 @@ static void __init rcu_bootup_announce_oddness(void)
> > >             pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > >     if (gp_cleanup_delay)
> > >             pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > +   if (nocb_patience_delay < 0) {
> > > +           pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> > > +           nocb_patience_delay = 0;
> > > +   } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > +           pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > +           nocb_patience_delay = 5 * MSEC_PER_SEC;
> > > +   } else if (nocb_patience_delay) {
> >
> > Here you suggest that we don't print if 'nocb_patience_delay == 0',
> > as it's the default behavior, right?
>
> Exactly, in keeping with the function name rcu_bootup_announce_oddness().
>
> This approach allows easy spotting of deviations from default settings,
> which can be very helpful when debugging.
>
> > I think printing on 0 could be useful to check if the feature exists, even
> > though we are zeroing it, but this will probably add unnecessary verbosity.
>
> It could be quite useful to people learning the RCU implementation,
> and I encourage those people to remove all those "if" statements from
> rcu_bootup_announce_oddness() in order to get the full story.
>
> > > +           pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> > > +   }
> >
> > Here I suppose something like this can take care of not needing to convert
> > ms -> jiffies every rcu_pending():
> >
> > +     nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
>
> Agreed, but I used a separate variable to help people looking at crash
> dumps or using drgn.
>
> And thank you for your review and comments!  Applying these changes
> with attribution.
>

Thank you!
Leo

>                                                         Thanx, Paul
>
> > >     if (!use_softirq)
> > >             pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > >     if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > >
> >
> >
> > Thanks!
> > Leo
> >
>
Paul E. McKenney May 9, 2024, 11:45 p.m. UTC | #48
On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:

[ . . . ]

> > Here I suppose something like this can take care of not needing to convert 
> > ms -> jiffies every rcu_pending():
> > 
> > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > 
> 
> Uh, there is more to it, actually. We need to make sure the user 
> understands that we are rounding-down the value to multiple of a jiffy 
> period, so it's not a surprise if the delay value is not exactly the same 
> as the passed on kernel cmdline.
> 
> So something like bellow diff should be ok, as this behavior is explained 
> in the docs, and pr_info() will print the effective value.
> 
> What do you think?

Good point, and I have taken your advice on making the documentation
say what it does.

> Thanks!
> Leo
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0a3b0fd1910e..9a50be9fd9eb 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4974,20 +4974,28 @@
>                         otherwise be caused by callback floods through
>                         use of the ->nocb_bypass list.  However, in the
>                         common non-flooded case, RCU queues directly to
>                         the main ->cblist in order to avoid the extra
>                         overhead of the ->nocb_bypass list and its lock.
>                         But if there are too many callbacks queued during
>                         a single jiffy, RCU pre-queues the callbacks into
>                         the ->nocb_bypass queue.  The definition of "too
>                         many" is supplied by this kernel boot parameter.
>  
> +       rcutree.nocb_patience_delay= [KNL]
> +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> +                       disturbing RCU unless the grace period has
> +                       reached the specified age in milliseconds.
> +                       Defaults to zero.  Large values will be capped
> +                       at five seconds. Values rounded-down to a multiple
> +                       of a jiffy period.
> +
>         rcutree.qhimark= [KNL]
>                         Set threshold of queued RCU callbacks beyond which
>                         batch limiting is disabled.
>  
>         rcutree.qlowmark= [KNL]
>                         Set threshold of queued RCU callbacks below which
>                         batch limiting is re-enabled.
>  
>         rcutree.qovld= [KNL]
>                         Set threshold of queued RCU callbacks beyond which
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index fcf2b4aa3441..62ede401420f 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -512,20 +512,21 @@ do {                                                              \
>         local_irq_save(flags);                                  \
>         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
>                 raw_spin_lock(&(rdp)->nocb_lock);               \
>  } while (0)
>  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
>  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
>  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>  
>  static void rcu_bind_gp_kthread(void);
>  static bool rcu_nohz_full_cpu(void);
> +static bool rcu_on_patience_delay(void);

I don't think we need an access function, but will check below.

>  /* Forward declarations for tree_stall.h */
>  static void record_gp_stall_check_time(void);
>  static void rcu_iw_handler(struct irq_work *iwp);
>  static void check_cpu_stall(struct rcu_data *rdp);
>  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
>                                      const unsigned long gpssdelay);
>  
>  /* Forward declarations for tree_exp.h. */
>  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 340bbefe5f65..639243b0410f 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -5,20 +5,21 @@
>   * or preemptible semantics.
>   *
>   * Copyright Red Hat, 2009
>   * Copyright IBM Corporation, 2009
>   *
>   * Author: Ingo Molnar <mingo@elte.hu>
>   *        Paul E. McKenney <paulmck@linux.ibm.com>
>   */
>  
>  #include "../locking/rtmutex_common.h"
> +#include <linux/jiffies.h>

This is already pulled in by the enclosing tree.c file, so it should not
be necessary to include it again.  (Or did you get a build failure when
leaving this out?)

>  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
>  {
>         /*
>          * In order to read the offloaded state of an rdp in a safe
>          * and stable way and prevent from its value to be changed
>          * under us, we must either hold the barrier mutex, the cpu
>          * hotplug lock (read or write) or the nocb lock. Local
>          * non-preemptible reads are also safe. NOCB kthreads and
>          * timers have their own means of synchronization against the
> @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
>         if (rcu_kick_kthreads)
>                 pr_info("\tKick kthreads if too-long grace period.\n");
>         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
>                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
>         if (gp_preinit_delay)
>                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
>         if (gp_init_delay)
>                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
>         if (gp_cleanup_delay)
>                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> +       if (nocb_patience_delay < 0) {
> +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> +                       nocb_patience_delay);
> +               nocb_patience_delay = 0;
> +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> +       } else if (nocb_patience_delay) {
> +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> +                       jiffies_to_msecs(nocb_patience_delay);
> +       }

I just did this here at the end:

	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);

Ah, you are wanting to print out the milliseconds after the rounding
to jiffies.

I am going to hold off on that for the moment, but I hear your request
and I have not yet said "no".  ;-)

>         if (!use_softirq)
>                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
>         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
>                 pr_info("\tRCU debug extended QS entry/exit.\n");
>         rcupdate_announce_bootup_oddness();
>  }
>  
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
>  
>  /*
>   * Bind the RCU grace-period kthreads to the housekeeping CPU.
>   */
>  static void rcu_bind_gp_kthread(void)
>  {
>         if (!tick_nohz_full_enabled())
>                 return;
>         housekeeping_affine(current, HK_TYPE_RCU);
>  }
> +
> +/*
> + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> + * start of current grace period is smaller than nocb_patience_delay ?
> + *
> + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> + * RCU_NOCB_CPU CPUs.
> + */
> +static bool rcu_on_patience_delay(void)
> +{
> +#ifdef CONFIG_NO_HZ_FULL

You lost me on this one.  Why do we need the #ifdef instead of
IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
compile-time @false in CONFIG_NO_HZ_FULL=n kernels.

> +       if (!nocb_patience_delay)
> +               return false;

We get this automatically with the comparison below, right?  If so, we
are not gaining much by creating the helper function.  Or am I missing
some trick here?

							Thanx, Paul

> +       if (time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay))
> +               return true;
> +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> +       return false;
> +}
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7560e204198b..7a2d94370ab4 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -169,20 +169,22 @@ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
>  module_param(kthread_prio, int, 0444);
>  
>  /* Delay in jiffies for grace-period initialization delays, debug only. */
>  
>  static int gp_preinit_delay;
>  module_param(gp_preinit_delay, int, 0444);
>  static int gp_init_delay;
>  module_param(gp_init_delay, int, 0444);
>  static int gp_cleanup_delay;
>  module_param(gp_cleanup_delay, int, 0444);
> +static int nocb_patience_delay;
> +module_param(nocb_patience_delay, int, 0444);
>  
>  // Add delay to rcu_read_unlock() for strict grace periods.
>  static int rcu_unlock_delay;
>  #ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
>  module_param(rcu_unlock_delay, int, 0444);
>  #endif
>  
>  /*
>   * This rcu parameter is runtime-read-only. It reflects
>   * a minimum allowed number of objects which can be cached
> @@ -4340,25 +4342,27 @@ static int rcu_pending(int user)
>         lockdep_assert_irqs_disabled();
>  
>         /* Check for CPU stalls, if enabled. */
>         check_cpu_stall(rdp);
>  
>         /* Does this CPU need a deferred NOCB wakeup? */
>         if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
>                 return 1;
>  
>         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> +       gp_in_progress = rcu_gp_in_progress();
> +       if ((user || rcu_is_cpu_rrupt_from_idle() ||
> +            (gp_in_progress && rcu_on_patience_delay())) &&
> +           rcu_nohz_full_cpu())
>                 return 0;
>  
>         /* Is the RCU core waiting for a quiescent state from this CPU? */
> -       gp_in_progress = rcu_gp_in_progress();
>         if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
>                 return 1;
>  
>         /* Does this CPU have callbacks ready to invoke? */
>         if (!rcu_rdp_is_offloaded(rdp) &&
>             rcu_segcblist_ready_cbs(&rdp->cblist))
>                 return 1;
>  
>         /* Has RCU gone idle with this CPU needing another grace period? */
>         if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
> 
> 
>
Leonardo Bras May 10, 2024, 4:06 p.m. UTC | #49
On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> 
> [ . . . ]
> 
> > > Here I suppose something like this can take care of not needing to convert 
> > > ms -> jiffies every rcu_pending():
> > > 
> > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > 
> > 
> > Uh, there is more to it, actually. We need to make sure the user 
> > understands that we are rounding-down the value to multiple of a jiffy 
> > period, so it's not a surprise if the delay value is not exactly the same 
> > as the passed on kernel cmdline.
> > 
> > So something like bellow diff should be ok, as this behavior is explained 
> > in the docs, and pr_info() will print the effective value.
> > 
> > What do you think?
> 
> Good point, and I have taken your advice on making the documentation
> say what it does.

Thanks :)

> 
> > Thanks!
> > Leo
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4974,20 +4974,28 @@
> >                         otherwise be caused by callback floods through
> >                         use of the ->nocb_bypass list.  However, in the
> >                         common non-flooded case, RCU queues directly to
> >                         the main ->cblist in order to avoid the extra
> >                         overhead of the ->nocb_bypass list and its lock.
> >                         But if there are too many callbacks queued during
> >                         a single jiffy, RCU pre-queues the callbacks into
> >                         the ->nocb_bypass queue.  The definition of "too
> >                         many" is supplied by this kernel boot parameter.
> >  
> > +       rcutree.nocb_patience_delay= [KNL]
> > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > +                       disturbing RCU unless the grace period has
> > +                       reached the specified age in milliseconds.
> > +                       Defaults to zero.  Large values will be capped
> > +                       at five seconds. Values rounded-down to a multiple
> > +                       of a jiffy period.
> > +
> >         rcutree.qhimark= [KNL]
> >                         Set threshold of queued RCU callbacks beyond which
> >                         batch limiting is disabled.
> >  
> >         rcutree.qlowmark= [KNL]
> >                         Set threshold of queued RCU callbacks below which
> >                         batch limiting is re-enabled.
> >  
> >         rcutree.qovld= [KNL]
> >                         Set threshold of queued RCU callbacks beyond which
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index fcf2b4aa3441..62ede401420f 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -512,20 +512,21 @@ do {                                                              \
> >         local_irq_save(flags);                                  \
> >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> >  } while (0)
> >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> >  
> >  static void rcu_bind_gp_kthread(void);
> >  static bool rcu_nohz_full_cpu(void);
> > +static bool rcu_on_patience_delay(void);
> 
> I don't think we need an access function, but will check below.
> 
> >  /* Forward declarations for tree_stall.h */
> >  static void record_gp_stall_check_time(void);
> >  static void rcu_iw_handler(struct irq_work *iwp);
> >  static void check_cpu_stall(struct rcu_data *rdp);
> >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> >                                      const unsigned long gpssdelay);
> >  
> >  /* Forward declarations for tree_exp.h. */
> >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 340bbefe5f65..639243b0410f 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -5,20 +5,21 @@
> >   * or preemptible semantics.
> >   *
> >   * Copyright Red Hat, 2009
> >   * Copyright IBM Corporation, 2009
> >   *
> >   * Author: Ingo Molnar <mingo@elte.hu>
> >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> >   */
> >  
> >  #include "../locking/rtmutex_common.h"
> > +#include <linux/jiffies.h>
> 
> This is already pulled in by the enclosing tree.c file, so it should not
> be necessary to include it again. 

Even better :)

> (Or did you get a build failure when
> leaving this out?)

I didn't, it's just that my editor complained the symbols were not getting 
properly resolved, so I included it and it was fixed. But since clangd is 
know to make some mistakes, I should have compile-test'd before adding it.

> 
> >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> >  {
> >         /*
> >          * In order to read the offloaded state of an rdp in a safe
> >          * and stable way and prevent from its value to be changed
> >          * under us, we must either hold the barrier mutex, the cpu
> >          * hotplug lock (read or write) or the nocb lock. Local
> >          * non-preemptible reads are also safe. NOCB kthreads and
> >          * timers have their own means of synchronization against the
> > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> >         if (rcu_kick_kthreads)
> >                 pr_info("\tKick kthreads if too-long grace period.\n");
> >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> >         if (gp_preinit_delay)
> >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> >         if (gp_init_delay)
> >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> >         if (gp_cleanup_delay)
> >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > +       if (nocb_patience_delay < 0) {
> > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > +                       nocb_patience_delay);
> > +               nocb_patience_delay = 0;
> > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > +       } else if (nocb_patience_delay) {
> > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > +                       jiffies_to_msecs(nocb_patience_delay);
> > +       }
> 
> I just did this here at the end:
> 
> 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> 
> Ah, you are wanting to print out the milliseconds after the rounding
> to jiffies.

That's right, just to make sure the user gets the effective patience time, 
instead of the before-rounding one, which was on input.

> 
> I am going to hold off on that for the moment, but I hear your request
> and I have not yet said "no".  ;-)

Sure :)
It's just something I think it's nice to have (as a user).

> 
> >         if (!use_softirq)
> >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> >         rcupdate_announce_bootup_oddness();
> >  }
> >  
> >  #ifdef CONFIG_PREEMPT_RCU
> >  
> >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> >  
> >  /*
> >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> >   */
> >  static void rcu_bind_gp_kthread(void)
> >  {
> >         if (!tick_nohz_full_enabled())
> >                 return;
> >         housekeeping_affine(current, HK_TYPE_RCU);
> >  }
> > +
> > +/*
> > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > + * start of current grace period is smaller than nocb_patience_delay ?
> > + *
> > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > + * RCU_NOCB_CPU CPUs.
> > + */
> > +static bool rcu_on_patience_delay(void)
> > +{
> > +#ifdef CONFIG_NO_HZ_FULL
> 
> You lost me on this one.  Why do we need the #ifdef instead of
> IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> compile-time @false in CONFIG_NO_HZ_FULL=n kernels.

You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
	if ((...) && rcu_nohz_full_cpu())
And since it returns false, this whole statement will be compiled out, and 
the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
need to test it.


> 
> > +       if (!nocb_patience_delay)
> > +               return false;
> 
> We get this automatically with the comparison below, right?

Right

>   If so, we
> are not gaining much by creating the helper function.  Or am I missing
> some trick here?

Well, it's a fastpath. Up to here, we just need to read 
nocb_patience_delay{,_jiffies} from memory.

If we don't include the fastpath we have to read jiffies and 
rcu_state.gp_start, which can take extra time: up to 2 cache misses.

I thought it could be relevant, as we reduce the overhead of the new 
parameter when it's disabled (patience=0). 

Do you think that could be relevant?

Thanks!
Leo

> 
> 							Thanx, Paul
> 
> > +       if (time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay))
> > +               return true;
> > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > +       return false;
> > +}
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 7560e204198b..7a2d94370ab4 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -169,20 +169,22 @@ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
> >  module_param(kthread_prio, int, 0444);
> >  
> >  /* Delay in jiffies for grace-period initialization delays, debug only. */
> >  
> >  static int gp_preinit_delay;
> >  module_param(gp_preinit_delay, int, 0444);
> >  static int gp_init_delay;
> >  module_param(gp_init_delay, int, 0444);
> >  static int gp_cleanup_delay;
> >  module_param(gp_cleanup_delay, int, 0444);
> > +static int nocb_patience_delay;
> > +module_param(nocb_patience_delay, int, 0444);
> >  
> >  // Add delay to rcu_read_unlock() for strict grace periods.
> >  static int rcu_unlock_delay;
> >  #ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
> >  module_param(rcu_unlock_delay, int, 0444);
> >  #endif
> >  
> >  /*
> >   * This rcu parameter is runtime-read-only. It reflects
> >   * a minimum allowed number of objects which can be cached
> > @@ -4340,25 +4342,27 @@ static int rcu_pending(int user)
> >         lockdep_assert_irqs_disabled();
> >  
> >         /* Check for CPU stalls, if enabled. */
> >         check_cpu_stall(rdp);
> >  
> >         /* Does this CPU need a deferred NOCB wakeup? */
> >         if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> >                 return 1;
> >  
> >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +       gp_in_progress = rcu_gp_in_progress();
> > +       if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > +            (gp_in_progress && rcu_on_patience_delay())) &&
> > +           rcu_nohz_full_cpu())
> >                 return 0;
> >  
> >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > -       gp_in_progress = rcu_gp_in_progress();
> >         if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> >                 return 1;
> >  
> >         /* Does this CPU have callbacks ready to invoke? */
> >         if (!rcu_rdp_is_offloaded(rdp) &&
> >             rcu_segcblist_ready_cbs(&rdp->cblist))
> >                 return 1;
> >  
> >         /* Has RCU gone idle with this CPU needing another grace period? */
> >         if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
> > 
> > 
> > 
>
Paul E. McKenney May 10, 2024, 4:21 p.m. UTC | #50
On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > 
> > [ . . . ]
> > 
> > > > Here I suppose something like this can take care of not needing to convert 
> > > > ms -> jiffies every rcu_pending():
> > > > 
> > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > 
> > > 
> > > Uh, there is more to it, actually. We need to make sure the user 
> > > understands that we are rounding-down the value to multiple of a jiffy 
> > > period, so it's not a surprise if the delay value is not exactly the same 
> > > as the passed on kernel cmdline.
> > > 
> > > So something like bellow diff should be ok, as this behavior is explained 
> > > in the docs, and pr_info() will print the effective value.
> > > 
> > > What do you think?
> > 
> > Good point, and I have taken your advice on making the documentation
> > say what it does.
> 
> Thanks :)
> 
> > 
> > > Thanks!
> > > Leo
> > > 
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -4974,20 +4974,28 @@
> > >                         otherwise be caused by callback floods through
> > >                         use of the ->nocb_bypass list.  However, in the
> > >                         common non-flooded case, RCU queues directly to
> > >                         the main ->cblist in order to avoid the extra
> > >                         overhead of the ->nocb_bypass list and its lock.
> > >                         But if there are too many callbacks queued during
> > >                         a single jiffy, RCU pre-queues the callbacks into
> > >                         the ->nocb_bypass queue.  The definition of "too
> > >                         many" is supplied by this kernel boot parameter.
> > >  
> > > +       rcutree.nocb_patience_delay= [KNL]
> > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > +                       disturbing RCU unless the grace period has
> > > +                       reached the specified age in milliseconds.
> > > +                       Defaults to zero.  Large values will be capped
> > > +                       at five seconds. Values rounded-down to a multiple
> > > +                       of a jiffy period.
> > > +
> > >         rcutree.qhimark= [KNL]
> > >                         Set threshold of queued RCU callbacks beyond which
> > >                         batch limiting is disabled.
> > >  
> > >         rcutree.qlowmark= [KNL]
> > >                         Set threshold of queued RCU callbacks below which
> > >                         batch limiting is re-enabled.
> > >  
> > >         rcutree.qovld= [KNL]
> > >                         Set threshold of queued RCU callbacks beyond which
> > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > index fcf2b4aa3441..62ede401420f 100644
> > > --- a/kernel/rcu/tree.h
> > > +++ b/kernel/rcu/tree.h
> > > @@ -512,20 +512,21 @@ do {                                                              \
> > >         local_irq_save(flags);                                  \
> > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > >  } while (0)
> > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > >  
> > >  static void rcu_bind_gp_kthread(void);
> > >  static bool rcu_nohz_full_cpu(void);
> > > +static bool rcu_on_patience_delay(void);
> > 
> > I don't think we need an access function, but will check below.
> > 
> > >  /* Forward declarations for tree_stall.h */
> > >  static void record_gp_stall_check_time(void);
> > >  static void rcu_iw_handler(struct irq_work *iwp);
> > >  static void check_cpu_stall(struct rcu_data *rdp);
> > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > >                                      const unsigned long gpssdelay);
> > >  
> > >  /* Forward declarations for tree_exp.h. */
> > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 340bbefe5f65..639243b0410f 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -5,20 +5,21 @@
> > >   * or preemptible semantics.
> > >   *
> > >   * Copyright Red Hat, 2009
> > >   * Copyright IBM Corporation, 2009
> > >   *
> > >   * Author: Ingo Molnar <mingo@elte.hu>
> > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > >   */
> > >  
> > >  #include "../locking/rtmutex_common.h"
> > > +#include <linux/jiffies.h>
> > 
> > This is already pulled in by the enclosing tree.c file, so it should not
> > be necessary to include it again. 
> 
> Even better :)
> 
> > (Or did you get a build failure when
> > leaving this out?)
> 
> I didn't, it's just that my editor complained the symbols were not getting 
> properly resolved, so I included it and it was fixed. But since clangd is 
> know to make some mistakes, I should have compile-test'd before adding it.

Ah, got it!  ;-)

> > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > >  {
> > >         /*
> > >          * In order to read the offloaded state of an rdp in a safe
> > >          * and stable way and prevent from its value to be changed
> > >          * under us, we must either hold the barrier mutex, the cpu
> > >          * hotplug lock (read or write) or the nocb lock. Local
> > >          * non-preemptible reads are also safe. NOCB kthreads and
> > >          * timers have their own means of synchronization against the
> > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > >         if (rcu_kick_kthreads)
> > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > >         if (gp_preinit_delay)
> > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > >         if (gp_init_delay)
> > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > >         if (gp_cleanup_delay)
> > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > +       if (nocb_patience_delay < 0) {
> > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > +                       nocb_patience_delay);
> > > +               nocb_patience_delay = 0;
> > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > +       } else if (nocb_patience_delay) {
> > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > +       }
> > 
> > I just did this here at the end:
> > 
> > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > 
> > Ah, you are wanting to print out the milliseconds after the rounding
> > to jiffies.
> 
> That's right, just to make sure the user gets the effective patience time, 
> instead of the before-rounding one, which was on input.
> 
> > I am going to hold off on that for the moment, but I hear your request
> > and I have not yet said "no".  ;-)
> 
> Sure :)
> It's just something I think it's nice to have (as a user).

If you would like to do a separate patch adding this, here are the
requirements:

o	If the current code prints nothing, nothing additional should
	be printed.

o	If the rounding ended up with the same value (as it should in
	systems with HZ=1000), nothing additional should be printed.

o	Your choice as to whether or not you want to print out the
	jiffies value.

o	If the additional message is on a new line, it needs to be
	indented so that it is clear that it is subordinate to the
	previous message.

	Otherwise, you can use pr_cont() to continue the previous
	line, of course being careful about "\n".

Probably also something that I am forgetting, but that is most of it.

> > >         if (!use_softirq)
> > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > >         rcupdate_announce_bootup_oddness();
> > >  }
> > >  
> > >  #ifdef CONFIG_PREEMPT_RCU
> > >  
> > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > >  
> > >  /*
> > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > >   */
> > >  static void rcu_bind_gp_kthread(void)
> > >  {
> > >         if (!tick_nohz_full_enabled())
> > >                 return;
> > >         housekeeping_affine(current, HK_TYPE_RCU);
> > >  }
> > > +
> > > +/*
> > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > + *
> > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > + * RCU_NOCB_CPU CPUs.
> > > + */
> > > +static bool rcu_on_patience_delay(void)
> > > +{
> > > +#ifdef CONFIG_NO_HZ_FULL
> > 
> > You lost me on this one.  Why do we need the #ifdef instead of
> > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> 
> You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> 	if ((...) && rcu_nohz_full_cpu())
> And since it returns false, this whole statement will be compiled out, and 
> the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> need to test it.

Very good!  You had me going there for a bit.  ;-)

> > > +       if (!nocb_patience_delay)
> > > +               return false;
> > 
> > We get this automatically with the comparison below, right?
> 
> Right
> 
> >   If so, we
> > are not gaining much by creating the helper function.  Or am I missing
> > some trick here?
> 
> Well, it's a fastpath. Up to here, we just need to read 
> nocb_patience_delay{,_jiffies} from memory.

Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
nocb_patience_delay is unused after boot.

> If we don't include the fastpath we have to read jiffies and 
> rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> 
> I thought it could be relevant, as we reduce the overhead of the new 
> parameter when it's disabled (patience=0). 
> 
> Do you think that could be relevant?

Well, the hardware's opinion is what matters.  ;-)

But the caller's code path reads jiffies a few times, so it should
be hot in the cache, correct?

But that does lead to another topic, namely the possibility of tagging
nocb_patience_delay_jiffies with __read_mostly.  And there might be
a number of other of RCU's variables that could be similarly tagged
in order to avoid false sharing.  (But is there any false sharing?
This might be worth testing.)

							Thanx, Paul

> Thanks!
> Leo
> 
> > 
> > 							Thanx, Paul
> > 
> > > +       if (time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay))
> > > +               return true;
> > > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > > +       return false;
> > > +}
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 7560e204198b..7a2d94370ab4 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -169,20 +169,22 @@ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
> > >  module_param(kthread_prio, int, 0444);
> > >  
> > >  /* Delay in jiffies for grace-period initialization delays, debug only. */
> > >  
> > >  static int gp_preinit_delay;
> > >  module_param(gp_preinit_delay, int, 0444);
> > >  static int gp_init_delay;
> > >  module_param(gp_init_delay, int, 0444);
> > >  static int gp_cleanup_delay;
> > >  module_param(gp_cleanup_delay, int, 0444);
> > > +static int nocb_patience_delay;
> > > +module_param(nocb_patience_delay, int, 0444);
> > >  
> > >  // Add delay to rcu_read_unlock() for strict grace periods.
> > >  static int rcu_unlock_delay;
> > >  #ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
> > >  module_param(rcu_unlock_delay, int, 0444);
> > >  #endif
> > >  
> > >  /*
> > >   * This rcu parameter is runtime-read-only. It reflects
> > >   * a minimum allowed number of objects which can be cached
> > > @@ -4340,25 +4342,27 @@ static int rcu_pending(int user)
> > >         lockdep_assert_irqs_disabled();
> > >  
> > >         /* Check for CPU stalls, if enabled. */
> > >         check_cpu_stall(rdp);
> > >  
> > >         /* Does this CPU need a deferred NOCB wakeup? */
> > >         if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> > >                 return 1;
> > >  
> > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > +       gp_in_progress = rcu_gp_in_progress();
> > > +       if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > > +            (gp_in_progress && rcu_on_patience_delay())) &&
> > > +           rcu_nohz_full_cpu())
> > >                 return 0;
> > >  
> > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > -       gp_in_progress = rcu_gp_in_progress();
> > >         if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> > >                 return 1;
> > >  
> > >         /* Does this CPU have callbacks ready to invoke? */
> > >         if (!rcu_rdp_is_offloaded(rdp) &&
> > >             rcu_segcblist_ready_cbs(&rdp->cblist))
> > >                 return 1;
> > >  
> > >         /* Has RCU gone idle with this CPU needing another grace period? */
> > >         if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
> > > 
> > > 
> > > 
> > 
> 
>
Leonardo Bras May 10, 2024, 5:12 p.m. UTC | #51
On Fri, May 10, 2024 at 09:21:59AM -0700, Paul E. McKenney wrote:
> On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> > On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > Here I suppose something like this can take care of not needing to convert 
> > > > > ms -> jiffies every rcu_pending():
> > > > > 
> > > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > 
> > > > 
> > > > Uh, there is more to it, actually. We need to make sure the user 
> > > > understands that we are rounding-down the value to multiple of a jiffy 
> > > > period, so it's not a surprise if the delay value is not exactly the same 
> > > > as the passed on kernel cmdline.
> > > > 
> > > > So something like bellow diff should be ok, as this behavior is explained 
> > > > in the docs, and pr_info() will print the effective value.
> > > > 
> > > > What do you think?
> > > 
> > > Good point, and I have taken your advice on making the documentation
> > > say what it does.
> > 
> > Thanks :)
> > 
> > > 
> > > > Thanks!
> > > > Leo
> > > > 
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -4974,20 +4974,28 @@
> > > >                         otherwise be caused by callback floods through
> > > >                         use of the ->nocb_bypass list.  However, in the
> > > >                         common non-flooded case, RCU queues directly to
> > > >                         the main ->cblist in order to avoid the extra
> > > >                         overhead of the ->nocb_bypass list and its lock.
> > > >                         But if there are too many callbacks queued during
> > > >                         a single jiffy, RCU pre-queues the callbacks into
> > > >                         the ->nocb_bypass queue.  The definition of "too
> > > >                         many" is supplied by this kernel boot parameter.
> > > >  
> > > > +       rcutree.nocb_patience_delay= [KNL]
> > > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > > +                       disturbing RCU unless the grace period has
> > > > +                       reached the specified age in milliseconds.
> > > > +                       Defaults to zero.  Large values will be capped
> > > > +                       at five seconds. Values rounded-down to a multiple
> > > > +                       of a jiffy period.
> > > > +
> > > >         rcutree.qhimark= [KNL]
> > > >                         Set threshold of queued RCU callbacks beyond which
> > > >                         batch limiting is disabled.
> > > >  
> > > >         rcutree.qlowmark= [KNL]
> > > >                         Set threshold of queued RCU callbacks below which
> > > >                         batch limiting is re-enabled.
> > > >  
> > > >         rcutree.qovld= [KNL]
> > > >                         Set threshold of queued RCU callbacks beyond which
> > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > index fcf2b4aa3441..62ede401420f 100644
> > > > --- a/kernel/rcu/tree.h
> > > > +++ b/kernel/rcu/tree.h
> > > > @@ -512,20 +512,21 @@ do {                                                              \
> > > >         local_irq_save(flags);                                  \
> > > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > > >  } while (0)
> > > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > > >  
> > > >  static void rcu_bind_gp_kthread(void);
> > > >  static bool rcu_nohz_full_cpu(void);
> > > > +static bool rcu_on_patience_delay(void);
> > > 
> > > I don't think we need an access function, but will check below.
> > > 
> > > >  /* Forward declarations for tree_stall.h */
> > > >  static void record_gp_stall_check_time(void);
> > > >  static void rcu_iw_handler(struct irq_work *iwp);
> > > >  static void check_cpu_stall(struct rcu_data *rdp);
> > > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > > >                                      const unsigned long gpssdelay);
> > > >  
> > > >  /* Forward declarations for tree_exp.h. */
> > > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 340bbefe5f65..639243b0410f 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -5,20 +5,21 @@
> > > >   * or preemptible semantics.
> > > >   *
> > > >   * Copyright Red Hat, 2009
> > > >   * Copyright IBM Corporation, 2009
> > > >   *
> > > >   * Author: Ingo Molnar <mingo@elte.hu>
> > > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > > >   */
> > > >  
> > > >  #include "../locking/rtmutex_common.h"
> > > > +#include <linux/jiffies.h>
> > > 
> > > This is already pulled in by the enclosing tree.c file, so it should not
> > > be necessary to include it again. 
> > 
> > Even better :)
> > 
> > > (Or did you get a build failure when
> > > leaving this out?)
> > 
> > I didn't, it's just that my editor complained the symbols were not getting 
> > properly resolved, so I included it and it was fixed. But since clangd is 
> > know to make some mistakes, I should have compile-test'd before adding it.
> 
> Ah, got it!  ;-)
> 
> > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > > >  {
> > > >         /*
> > > >          * In order to read the offloaded state of an rdp in a safe
> > > >          * and stable way and prevent from its value to be changed
> > > >          * under us, we must either hold the barrier mutex, the cpu
> > > >          * hotplug lock (read or write) or the nocb lock. Local
> > > >          * non-preemptible reads are also safe. NOCB kthreads and
> > > >          * timers have their own means of synchronization against the
> > > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > > >         if (rcu_kick_kthreads)
> > > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > > >         if (gp_preinit_delay)
> > > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > > >         if (gp_init_delay)
> > > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > > >         if (gp_cleanup_delay)
> > > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > > +       if (nocb_patience_delay < 0) {
> > > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > > +                       nocb_patience_delay);
> > > > +               nocb_patience_delay = 0;
> > > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > > +       } else if (nocb_patience_delay) {
> > > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > > +       }
> > > 
> > > I just did this here at the end:
> > > 
> > > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > > 
> > > Ah, you are wanting to print out the milliseconds after the rounding
> > > to jiffies.
> > 
> > That's right, just to make sure the user gets the effective patience time, 
> > instead of the before-rounding one, which was on input.
> > 
> > > I am going to hold off on that for the moment, but I hear your request
> > > and I have not yet said "no".  ;-)
> > 
> > Sure :)
> > It's just something I think it's nice to have (as a user).
> 
> If you would like to do a separate patch adding this, here are the
> requirements:
> 
> o	If the current code prints nothing, nothing additional should
> 	be printed.
> 
> o	If the rounding ended up with the same value (as it should in
> 	systems with HZ=1000), nothing additional should be printed.
> 
> o	Your choice as to whether or not you want to print out the
> 	jiffies value.
> 
> o	If the additional message is on a new line, it needs to be
> 	indented so that it is clear that it is subordinate to the
> 	previous message.
> 
> 	Otherwise, you can use pr_cont() to continue the previous
> 	line, of course being careful about "\n".
> 
> Probably also something that I am forgetting, but that is most of it.
> 

Thanks!
I will work on a patch doing that :)

> > > >         if (!use_softirq)
> > > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > > >         rcupdate_announce_bootup_oddness();
> > > >  }
> > > >  
> > > >  #ifdef CONFIG_PREEMPT_RCU
> > > >  
> > > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > > >  
> > > >  /*
> > > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > > >   */
> > > >  static void rcu_bind_gp_kthread(void)
> > > >  {
> > > >         if (!tick_nohz_full_enabled())
> > > >                 return;
> > > >         housekeeping_affine(current, HK_TYPE_RCU);
> > > >  }
> > > > +
> > > > +/*
> > > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > > + *
> > > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > > + * RCU_NOCB_CPU CPUs.
> > > > + */
> > > > +static bool rcu_on_patience_delay(void)
> > > > +{
> > > > +#ifdef CONFIG_NO_HZ_FULL
> > > 
> > > You lost me on this one.  Why do we need the #ifdef instead of
> > > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> > 
> > You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> > 	if ((...) && rcu_nohz_full_cpu())
> > And since it returns false, this whole statement will be compiled out, and 
> > the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> > need to test it.
> 
> Very good!  You had me going there for a bit.  ;-)
> 
> > > > +       if (!nocb_patience_delay)
> > > > +               return false;
> > > 
> > > We get this automatically with the comparison below, right?
> > 
> > Right
> > 
> > >   If so, we
> > > are not gaining much by creating the helper function.  Or am I missing
> > > some trick here?
> > 
> > Well, it's a fastpath. Up to here, we just need to read 
> > nocb_patience_delay{,_jiffies} from memory.
> 
> Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
> nocb_patience_delay is unused after boot.

Right, I used both because I was referring to the older version and the 
current version with _jiffies.
> 
> > If we don't include the fastpath we have to read jiffies and 
> > rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> > 
> > I thought it could be relevant, as we reduce the overhead of the new 
> > parameter when it's disabled (patience=0). 
> > 
> > Do you think that could be relevant?
> 
> Well, the hardware's opinion is what matters.  ;-)
> 
> But the caller's code path reads jiffies a few times, so it should
> be hot in the cache, correct?

Right, but I wonder how are the chances of it getting updated between  
caller's use and this function's. Same for gp_start.

> 
> But that does lead to another topic, namely the possibility of tagging
> nocb_patience_delay_jiffies with __read_mostly. 

Oh, right. This was supposed to be in the diff I sent earlier, but I 
completelly forgot to change before sending. So, yeah, I agree on 
nocb_patience_delay being __read_mostly; 

> And there might be
> a number of other of RCU's variables that could be similarly tagged
> in order to avoid false sharing.  (But is there any false sharing?
> This might be worth testing.)

Maybe there isn't, but I wonder if it would hurt performance if they were 
tagged as __read_only anyway. 


Thanks!
Leo

> 
> 							Thanx, Paul
> 
> > Thanks!
> > Leo
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > > > +       if (time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay))
> > > > +               return true;
> > > > +#endif /* #ifdef CONFIG_NO_HZ_FULL */
> > > > +       return false;
> > > > +}
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 7560e204198b..7a2d94370ab4 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -169,20 +169,22 @@ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
> > > >  module_param(kthread_prio, int, 0444);
> > > >  
> > > >  /* Delay in jiffies for grace-period initialization delays, debug only. */
> > > >  
> > > >  static int gp_preinit_delay;
> > > >  module_param(gp_preinit_delay, int, 0444);
> > > >  static int gp_init_delay;
> > > >  module_param(gp_init_delay, int, 0444);
> > > >  static int gp_cleanup_delay;
> > > >  module_param(gp_cleanup_delay, int, 0444);
> > > > +static int nocb_patience_delay;
> > > > +module_param(nocb_patience_delay, int, 0444);
> > > >  
> > > >  // Add delay to rcu_read_unlock() for strict grace periods.
> > > >  static int rcu_unlock_delay;
> > > >  #ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
> > > >  module_param(rcu_unlock_delay, int, 0444);
> > > >  #endif
> > > >  
> > > >  /*
> > > >   * This rcu parameter is runtime-read-only. It reflects
> > > >   * a minimum allowed number of objects which can be cached
> > > > @@ -4340,25 +4342,27 @@ static int rcu_pending(int user)
> > > >         lockdep_assert_irqs_disabled();
> > > >  
> > > >         /* Check for CPU stalls, if enabled. */
> > > >         check_cpu_stall(rdp);
> > > >  
> > > >         /* Does this CPU need a deferred NOCB wakeup? */
> > > >         if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> > > >                 return 1;
> > > >  
> > > >         /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > > > -       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > > +       gp_in_progress = rcu_gp_in_progress();
> > > > +       if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > > > +            (gp_in_progress && rcu_on_patience_delay())) &&
> > > > +           rcu_nohz_full_cpu())
> > > >                 return 0;
> > > >  
> > > >         /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > -       gp_in_progress = rcu_gp_in_progress();
> > > >         if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> > > >                 return 1;
> > > >  
> > > >         /* Does this CPU have callbacks ready to invoke? */
> > > >         if (!rcu_rdp_is_offloaded(rdp) &&
> > > >             rcu_segcblist_ready_cbs(&rdp->cblist))
> > > >                 return 1;
> > > >  
> > > >         /* Has RCU gone idle with this CPU needing another grace period? */
> > > >         if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
> > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
>
Paul E. McKenney May 10, 2024, 5:41 p.m. UTC | #52
On Fri, May 10, 2024 at 02:12:32PM -0300, Leonardo Bras wrote:
> On Fri, May 10, 2024 at 09:21:59AM -0700, Paul E. McKenney wrote:
> > On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> > > On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > > > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > Here I suppose something like this can take care of not needing to convert 
> > > > > > ms -> jiffies every rcu_pending():
> > > > > > 
> > > > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > 
> > > > > 
> > > > > Uh, there is more to it, actually. We need to make sure the user 
> > > > > understands that we are rounding-down the value to multiple of a jiffy 
> > > > > period, so it's not a surprise if the delay value is not exactly the same 
> > > > > as the passed on kernel cmdline.
> > > > > 
> > > > > So something like bellow diff should be ok, as this behavior is explained 
> > > > > in the docs, and pr_info() will print the effective value.
> > > > > 
> > > > > What do you think?
> > > > 
> > > > Good point, and I have taken your advice on making the documentation
> > > > say what it does.
> > > 
> > > Thanks :)
> > > 
> > > > 
> > > > > Thanks!
> > > > > Leo
> > > > > 
> > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > @@ -4974,20 +4974,28 @@
> > > > >                         otherwise be caused by callback floods through
> > > > >                         use of the ->nocb_bypass list.  However, in the
> > > > >                         common non-flooded case, RCU queues directly to
> > > > >                         the main ->cblist in order to avoid the extra
> > > > >                         overhead of the ->nocb_bypass list and its lock.
> > > > >                         But if there are too many callbacks queued during
> > > > >                         a single jiffy, RCU pre-queues the callbacks into
> > > > >                         the ->nocb_bypass queue.  The definition of "too
> > > > >                         many" is supplied by this kernel boot parameter.
> > > > >  
> > > > > +       rcutree.nocb_patience_delay= [KNL]
> > > > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > > > +                       disturbing RCU unless the grace period has
> > > > > +                       reached the specified age in milliseconds.
> > > > > +                       Defaults to zero.  Large values will be capped
> > > > > +                       at five seconds. Values rounded-down to a multiple
> > > > > +                       of a jiffy period.
> > > > > +
> > > > >         rcutree.qhimark= [KNL]
> > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > >                         batch limiting is disabled.
> > > > >  
> > > > >         rcutree.qlowmark= [KNL]
> > > > >                         Set threshold of queued RCU callbacks below which
> > > > >                         batch limiting is re-enabled.
> > > > >  
> > > > >         rcutree.qovld= [KNL]
> > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > index fcf2b4aa3441..62ede401420f 100644
> > > > > --- a/kernel/rcu/tree.h
> > > > > +++ b/kernel/rcu/tree.h
> > > > > @@ -512,20 +512,21 @@ do {                                                              \
> > > > >         local_irq_save(flags);                                  \
> > > > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > > > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > > > >  } while (0)
> > > > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > > > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > > > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > > > >  
> > > > >  static void rcu_bind_gp_kthread(void);
> > > > >  static bool rcu_nohz_full_cpu(void);
> > > > > +static bool rcu_on_patience_delay(void);
> > > > 
> > > > I don't think we need an access function, but will check below.
> > > > 
> > > > >  /* Forward declarations for tree_stall.h */
> > > > >  static void record_gp_stall_check_time(void);
> > > > >  static void rcu_iw_handler(struct irq_work *iwp);
> > > > >  static void check_cpu_stall(struct rcu_data *rdp);
> > > > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > >                                      const unsigned long gpssdelay);
> > > > >  
> > > > >  /* Forward declarations for tree_exp.h. */
> > > > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 340bbefe5f65..639243b0410f 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -5,20 +5,21 @@
> > > > >   * or preemptible semantics.
> > > > >   *
> > > > >   * Copyright Red Hat, 2009
> > > > >   * Copyright IBM Corporation, 2009
> > > > >   *
> > > > >   * Author: Ingo Molnar <mingo@elte.hu>
> > > > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > > > >   */
> > > > >  
> > > > >  #include "../locking/rtmutex_common.h"
> > > > > +#include <linux/jiffies.h>
> > > > 
> > > > This is already pulled in by the enclosing tree.c file, so it should not
> > > > be necessary to include it again. 
> > > 
> > > Even better :)
> > > 
> > > > (Or did you get a build failure when
> > > > leaving this out?)
> > > 
> > > I didn't, it's just that my editor complained the symbols were not getting 
> > > properly resolved, so I included it and it was fixed. But since clangd is 
> > > know to make some mistakes, I should have compile-test'd before adding it.
> > 
> > Ah, got it!  ;-)
> > 
> > > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > > > >  {
> > > > >         /*
> > > > >          * In order to read the offloaded state of an rdp in a safe
> > > > >          * and stable way and prevent from its value to be changed
> > > > >          * under us, we must either hold the barrier mutex, the cpu
> > > > >          * hotplug lock (read or write) or the nocb lock. Local
> > > > >          * non-preemptible reads are also safe. NOCB kthreads and
> > > > >          * timers have their own means of synchronization against the
> > > > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > >         if (rcu_kick_kthreads)
> > > > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > > > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > > > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > > > >         if (gp_preinit_delay)
> > > > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > > > >         if (gp_init_delay)
> > > > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > > > >         if (gp_cleanup_delay)
> > > > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > > > +       if (nocb_patience_delay < 0) {
> > > > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > > > +                       nocb_patience_delay);
> > > > > +               nocb_patience_delay = 0;
> > > > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > > > +       } else if (nocb_patience_delay) {
> > > > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > > > +       }
> > > > 
> > > > I just did this here at the end:
> > > > 
> > > > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > > > 
> > > > Ah, you are wanting to print out the milliseconds after the rounding
> > > > to jiffies.
> > > 
> > > That's right, just to make sure the user gets the effective patience time, 
> > > instead of the before-rounding one, which was on input.
> > > 
> > > > I am going to hold off on that for the moment, but I hear your request
> > > > and I have not yet said "no".  ;-)
> > > 
> > > Sure :)
> > > It's just something I think it's nice to have (as a user).
> > 
> > If you would like to do a separate patch adding this, here are the
> > requirements:
> > 
> > o	If the current code prints nothing, nothing additional should
> > 	be printed.
> > 
> > o	If the rounding ended up with the same value (as it should in
> > 	systems with HZ=1000), nothing additional should be printed.
> > 
> > o	Your choice as to whether or not you want to print out the
> > 	jiffies value.
> > 
> > o	If the additional message is on a new line, it needs to be
> > 	indented so that it is clear that it is subordinate to the
> > 	previous message.
> > 
> > 	Otherwise, you can use pr_cont() to continue the previous
> > 	line, of course being careful about "\n".
> > 
> > Probably also something that I am forgetting, but that is most of it.
> 
> Thanks!
> I will work on a patch doing that :)

Very good, looking forward to seeing what you come up with!

My current state is on the "dev" branch of the -rcu tree, so please base
on that.

> > > > >         if (!use_softirq)
> > > > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > > > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > > > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > > > >         rcupdate_announce_bootup_oddness();
> > > > >  }
> > > > >  
> > > > >  #ifdef CONFIG_PREEMPT_RCU
> > > > >  
> > > > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > > > >  
> > > > >  /*
> > > > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > > > >   */
> > > > >  static void rcu_bind_gp_kthread(void)
> > > > >  {
> > > > >         if (!tick_nohz_full_enabled())
> > > > >                 return;
> > > > >         housekeeping_affine(current, HK_TYPE_RCU);
> > > > >  }
> > > > > +
> > > > > +/*
> > > > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > > > + *
> > > > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > > > + * RCU_NOCB_CPU CPUs.
> > > > > + */
> > > > > +static bool rcu_on_patience_delay(void)
> > > > > +{
> > > > > +#ifdef CONFIG_NO_HZ_FULL
> > > > 
> > > > You lost me on this one.  Why do we need the #ifdef instead of
> > > > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > > > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> > > 
> > > You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> > > 	if ((...) && rcu_nohz_full_cpu())
> > > And since it returns false, this whole statement will be compiled out, and 
> > > the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> > > need to test it.
> > 
> > Very good!  You had me going there for a bit.  ;-)
> > 
> > > > > +       if (!nocb_patience_delay)
> > > > > +               return false;
> > > > 
> > > > We get this automatically with the comparison below, right?
> > > 
> > > Right
> > > 
> > > >   If so, we
> > > > are not gaining much by creating the helper function.  Or am I missing
> > > > some trick here?
> > > 
> > > Well, it's a fastpath. Up to here, we just need to read 
> > > nocb_patience_delay{,_jiffies} from memory.
> > 
> > Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
> > nocb_patience_delay is unused after boot.
> 
> Right, I used both because I was referring to the older version and the 
> current version with _jiffies.

Fair enough!

> > > If we don't include the fastpath we have to read jiffies and 
> > > rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> > > 
> > > I thought it could be relevant, as we reduce the overhead of the new 
> > > parameter when it's disabled (patience=0). 
> > > 
> > > Do you think that could be relevant?
> > 
> > Well, the hardware's opinion is what matters.  ;-)
> > 
> > But the caller's code path reads jiffies a few times, so it should
> > be hot in the cache, correct?
> 
> Right, but I wonder how are the chances of it getting updated between  
> caller's use and this function's. Same for gp_start.

Well, jiffies is updated at most once per millisecond, and gp_start is
updated at most once per few milliseconds.  So the chances of it being
updated within that code sequence are quite small.

> > But that does lead to another topic, namely the possibility of tagging
> > nocb_patience_delay_jiffies with __read_mostly. 
> 
> Oh, right. This was supposed to be in the diff I sent earlier, but I 
> completelly forgot to change before sending. So, yeah, I agree on 
> nocb_patience_delay being __read_mostly; 
> 
> > And there might be
> > a number of other of RCU's variables that could be similarly tagged
> > in order to avoid false sharing.  (But is there any false sharing?
> > This might be worth testing.)
> 
> Maybe there isn't, but I wonder if it would hurt performance if they were 
> tagged as __read_only anyway. 

Let's be at least a little careful here.  It is just as easy to hurt
performance by marking things __read_mostly or __read_only as it is
to help performance.  ;-)

							Thanx, Paul
Leonardo Bras May 10, 2024, 7:50 p.m. UTC | #53
On Fri, May 10, 2024 at 10:41:53AM -0700, Paul E. McKenney wrote:
> On Fri, May 10, 2024 at 02:12:32PM -0300, Leonardo Bras wrote:
> > On Fri, May 10, 2024 at 09:21:59AM -0700, Paul E. McKenney wrote:
> > > On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> > > > On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > > > > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > > > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > Here I suppose something like this can take care of not needing to convert 
> > > > > > > ms -> jiffies every rcu_pending():
> > > > > > > 
> > > > > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > 
> > > > > > 
> > > > > > Uh, there is more to it, actually. We need to make sure the user 
> > > > > > understands that we are rounding-down the value to multiple of a jiffy 
> > > > > > period, so it's not a surprise if the delay value is not exactly the same 
> > > > > > as the passed on kernel cmdline.
> > > > > > 
> > > > > > So something like bellow diff should be ok, as this behavior is explained 
> > > > > > in the docs, and pr_info() will print the effective value.
> > > > > > 
> > > > > > What do you think?
> > > > > 
> > > > > Good point, and I have taken your advice on making the documentation
> > > > > say what it does.
> > > > 
> > > > Thanks :)
> > > > 
> > > > > 
> > > > > > Thanks!
> > > > > > Leo
> > > > > > 
> > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > @@ -4974,20 +4974,28 @@
> > > > > >                         otherwise be caused by callback floods through
> > > > > >                         use of the ->nocb_bypass list.  However, in the
> > > > > >                         common non-flooded case, RCU queues directly to
> > > > > >                         the main ->cblist in order to avoid the extra
> > > > > >                         overhead of the ->nocb_bypass list and its lock.
> > > > > >                         But if there are too many callbacks queued during
> > > > > >                         a single jiffy, RCU pre-queues the callbacks into
> > > > > >                         the ->nocb_bypass queue.  The definition of "too
> > > > > >                         many" is supplied by this kernel boot parameter.
> > > > > >  
> > > > > > +       rcutree.nocb_patience_delay= [KNL]
> > > > > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > > > > +                       disturbing RCU unless the grace period has
> > > > > > +                       reached the specified age in milliseconds.
> > > > > > +                       Defaults to zero.  Large values will be capped
> > > > > > +                       at five seconds. Values rounded-down to a multiple
> > > > > > +                       of a jiffy period.
> > > > > > +
> > > > > >         rcutree.qhimark= [KNL]
> > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > >                         batch limiting is disabled.
> > > > > >  
> > > > > >         rcutree.qlowmark= [KNL]
> > > > > >                         Set threshold of queued RCU callbacks below which
> > > > > >                         batch limiting is re-enabled.
> > > > > >  
> > > > > >         rcutree.qovld= [KNL]
> > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > > index fcf2b4aa3441..62ede401420f 100644
> > > > > > --- a/kernel/rcu/tree.h
> > > > > > +++ b/kernel/rcu/tree.h
> > > > > > @@ -512,20 +512,21 @@ do {                                                              \
> > > > > >         local_irq_save(flags);                                  \
> > > > > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > > > > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > > > > >  } while (0)
> > > > > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > > > > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > >  
> > > > > >  static void rcu_bind_gp_kthread(void);
> > > > > >  static bool rcu_nohz_full_cpu(void);
> > > > > > +static bool rcu_on_patience_delay(void);
> > > > > 
> > > > > I don't think we need an access function, but will check below.
> > > > > 
> > > > > >  /* Forward declarations for tree_stall.h */
> > > > > >  static void record_gp_stall_check_time(void);
> > > > > >  static void rcu_iw_handler(struct irq_work *iwp);
> > > > > >  static void check_cpu_stall(struct rcu_data *rdp);
> > > > > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > >                                      const unsigned long gpssdelay);
> > > > > >  
> > > > > >  /* Forward declarations for tree_exp.h. */
> > > > > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > index 340bbefe5f65..639243b0410f 100644
> > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > @@ -5,20 +5,21 @@
> > > > > >   * or preemptible semantics.
> > > > > >   *
> > > > > >   * Copyright Red Hat, 2009
> > > > > >   * Copyright IBM Corporation, 2009
> > > > > >   *
> > > > > >   * Author: Ingo Molnar <mingo@elte.hu>
> > > > > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > > > > >   */
> > > > > >  
> > > > > >  #include "../locking/rtmutex_common.h"
> > > > > > +#include <linux/jiffies.h>
> > > > > 
> > > > > This is already pulled in by the enclosing tree.c file, so it should not
> > > > > be necessary to include it again. 
> > > > 
> > > > Even better :)
> > > > 
> > > > > (Or did you get a build failure when
> > > > > leaving this out?)
> > > > 
> > > > I didn't, it's just that my editor complained the symbols were not getting 
> > > > properly resolved, so I included it and it was fixed. But since clangd is 
> > > > know to make some mistakes, I should have compile-test'd before adding it.
> > > 
> > > Ah, got it!  ;-)
> > > 
> > > > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > > > > >  {
> > > > > >         /*
> > > > > >          * In order to read the offloaded state of an rdp in a safe
> > > > > >          * and stable way and prevent from its value to be changed
> > > > > >          * under us, we must either hold the barrier mutex, the cpu
> > > > > >          * hotplug lock (read or write) or the nocb lock. Local
> > > > > >          * non-preemptible reads are also safe. NOCB kthreads and
> > > > > >          * timers have their own means of synchronization against the
> > > > > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > >         if (rcu_kick_kthreads)
> > > > > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > > > > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > > > > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > > > > >         if (gp_preinit_delay)
> > > > > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > > > > >         if (gp_init_delay)
> > > > > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > > > > >         if (gp_cleanup_delay)
> > > > > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > > > > +       if (nocb_patience_delay < 0) {
> > > > > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > > > > +                       nocb_patience_delay);
> > > > > > +               nocb_patience_delay = 0;
> > > > > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > > > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > > > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > > > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > > > > +       } else if (nocb_patience_delay) {
> > > > > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > > > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > > > > +       }
> > > > > 
> > > > > I just did this here at the end:
> > > > > 
> > > > > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > > > > 
> > > > > Ah, you are wanting to print out the milliseconds after the rounding
> > > > > to jiffies.
> > > > 
> > > > That's right, just to make sure the user gets the effective patience time, 
> > > > instead of the before-rounding one, which was on input.
> > > > 
> > > > > I am going to hold off on that for the moment, but I hear your request
> > > > > and I have not yet said "no".  ;-)
> > > > 
> > > > Sure :)
> > > > It's just something I think it's nice to have (as a user).
> > > 
> > > If you would like to do a separate patch adding this, here are the
> > > requirements:
> > > 
> > > o	If the current code prints nothing, nothing additional should
> > > 	be printed.
> > > 
> > > o	If the rounding ended up with the same value (as it should in
> > > 	systems with HZ=1000), nothing additional should be printed.
> > > 
> > > o	Your choice as to whether or not you want to print out the
> > > 	jiffies value.
> > > 
> > > o	If the additional message is on a new line, it needs to be
> > > 	indented so that it is clear that it is subordinate to the
> > > 	previous message.
> > > 
> > > 	Otherwise, you can use pr_cont() to continue the previous
> > > 	line, of course being careful about "\n".
> > > 
> > > Probably also something that I am forgetting, but that is most of it.
> > 
> > Thanks!
> > I will work on a patch doing that :)
> 
> Very good, looking forward to seeing what you come up with!
> 
> My current state is on the "dev" branch of the -rcu tree, so please base
> on that.

Thanks! I used it earlier to send the previous diff :)

> 
> > > > > >         if (!use_softirq)
> > > > > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > > > > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > > > > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > > > > >         rcupdate_announce_bootup_oddness();
> > > > > >  }
> > > > > >  
> > > > > >  #ifdef CONFIG_PREEMPT_RCU
> > > > > >  
> > > > > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > > > > >  
> > > > > >  /*
> > > > > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > > > > >   */
> > > > > >  static void rcu_bind_gp_kthread(void)
> > > > > >  {
> > > > > >         if (!tick_nohz_full_enabled())
> > > > > >                 return;
> > > > > >         housekeeping_affine(current, HK_TYPE_RCU);
> > > > > >  }
> > > > > > +
> > > > > > +/*
> > > > > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > > > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > > > > + *
> > > > > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > > > > + * RCU_NOCB_CPU CPUs.
> > > > > > + */
> > > > > > +static bool rcu_on_patience_delay(void)
> > > > > > +{
> > > > > > +#ifdef CONFIG_NO_HZ_FULL
> > > > > 
> > > > > You lost me on this one.  Why do we need the #ifdef instead of
> > > > > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > > > > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> > > > 
> > > > You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> > > > 	if ((...) && rcu_nohz_full_cpu())
> > > > And since it returns false, this whole statement will be compiled out, and 
> > > > the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> > > > need to test it.
> > > 
> > > Very good!  You had me going there for a bit.  ;-)
> > > 
> > > > > > +       if (!nocb_patience_delay)
> > > > > > +               return false;
> > > > > 
> > > > > We get this automatically with the comparison below, right?
> > > > 
> > > > Right
> > > > 
> > > > >   If so, we
> > > > > are not gaining much by creating the helper function.  Or am I missing
> > > > > some trick here?
> > > > 
> > > > Well, it's a fastpath. Up to here, we just need to read 
> > > > nocb_patience_delay{,_jiffies} from memory.
> > > 
> > > Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
> > > nocb_patience_delay is unused after boot.
> > 
> > Right, I used both because I was referring to the older version and the 
> > current version with _jiffies.
> 
> Fair enough!
> 
> > > > If we don't include the fastpath we have to read jiffies and 
> > > > rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> > > > 
> > > > I thought it could be relevant, as we reduce the overhead of the new 
> > > > parameter when it's disabled (patience=0). 
> > > > 
> > > > Do you think that could be relevant?
> > > 
> > > Well, the hardware's opinion is what matters.  ;-)
> > > 
> > > But the caller's code path reads jiffies a few times, so it should
> > > be hot in the cache, correct?
> > 
> > Right, but I wonder how are the chances of it getting updated between  
> > caller's use and this function's. Same for gp_start.
> 
> Well, jiffies is updated at most once per millisecond, and gp_start is
> updated at most once per few milliseconds.  So the chances of it being
> updated within that code sequence are quite small.

Fair enough, and we probably don't need to worry about it getting 
cached-out in this sequence, as well. 

Also time_before() is a macro and we don't need to worry on the function 
call, so we just spend 2 extra L1-cache reads and a couple arithmetic 
instructions which are not supposed to take long, so it's fair to assume 
the fast-path would not be that much faster than the slow path, which means 
we don't need a fast path after all.

Thanks for helping me notice that :)

> 
> > > But that does lead to another topic, namely the possibility of tagging
> > > nocb_patience_delay_jiffies with __read_mostly. 
> > 
> > Oh, right. This was supposed to be in the diff I sent earlier, but I 
> > completelly forgot to change before sending. So, yeah, I agree on 
> > nocb_patience_delay being __read_mostly; 
> > 
> > > And there might be
> > > a number of other of RCU's variables that could be similarly tagged
> > > in order to avoid false sharing.  (But is there any false sharing?
> > > This might be worth testing.)
> > 
> > Maybe there isn't, but I wonder if it would hurt performance if they were 
> > tagged as __read_only anyway. 
> 
> Let's be at least a little careful here.  It is just as easy to hurt
> performance by marking things __read_mostly or __read_only as it is
> to help performance.  ;-)

Fair enough :)

> 
> 							Thanx, Paul
> 

Thanks!
Leo
Leonardo Bras May 10, 2024, 9:15 p.m. UTC | #54
On Fri, May 10, 2024 at 04:50:41PM -0300, Leonardo Bras wrote:
> On Fri, May 10, 2024 at 10:41:53AM -0700, Paul E. McKenney wrote:
> > On Fri, May 10, 2024 at 02:12:32PM -0300, Leonardo Bras wrote:
> > > On Fri, May 10, 2024 at 09:21:59AM -0700, Paul E. McKenney wrote:
> > > > On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> > > > > On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > > > > > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > > > > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > > Here I suppose something like this can take care of not needing to convert 
> > > > > > > > ms -> jiffies every rcu_pending():
> > > > > > > > 
> > > > > > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > > 
> > > > > > > 
> > > > > > > Uh, there is more to it, actually. We need to make sure the user 
> > > > > > > understands that we are rounding-down the value to multiple of a jiffy 
> > > > > > > period, so it's not a surprise if the delay value is not exactly the same 
> > > > > > > as the passed on kernel cmdline.
> > > > > > > 
> > > > > > > So something like bellow diff should be ok, as this behavior is explained 
> > > > > > > in the docs, and pr_info() will print the effective value.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > 
> > > > > > Good point, and I have taken your advice on making the documentation
> > > > > > say what it does.
> > > > > 
> > > > > Thanks :)
> > > > > 
> > > > > > 
> > > > > > > Thanks!
> > > > > > > Leo
> > > > > > > 
> > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > @@ -4974,20 +4974,28 @@
> > > > > > >                         otherwise be caused by callback floods through
> > > > > > >                         use of the ->nocb_bypass list.  However, in the
> > > > > > >                         common non-flooded case, RCU queues directly to
> > > > > > >                         the main ->cblist in order to avoid the extra
> > > > > > >                         overhead of the ->nocb_bypass list and its lock.
> > > > > > >                         But if there are too many callbacks queued during
> > > > > > >                         a single jiffy, RCU pre-queues the callbacks into
> > > > > > >                         the ->nocb_bypass queue.  The definition of "too
> > > > > > >                         many" is supplied by this kernel boot parameter.
> > > > > > >  
> > > > > > > +       rcutree.nocb_patience_delay= [KNL]
> > > > > > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > > > > > +                       disturbing RCU unless the grace period has
> > > > > > > +                       reached the specified age in milliseconds.
> > > > > > > +                       Defaults to zero.  Large values will be capped
> > > > > > > +                       at five seconds. Values rounded-down to a multiple
> > > > > > > +                       of a jiffy period.
> > > > > > > +
> > > > > > >         rcutree.qhimark= [KNL]
> > > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > >                         batch limiting is disabled.
> > > > > > >  
> > > > > > >         rcutree.qlowmark= [KNL]
> > > > > > >                         Set threshold of queued RCU callbacks below which
> > > > > > >                         batch limiting is re-enabled.
> > > > > > >  
> > > > > > >         rcutree.qovld= [KNL]
> > > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > > > index fcf2b4aa3441..62ede401420f 100644
> > > > > > > --- a/kernel/rcu/tree.h
> > > > > > > +++ b/kernel/rcu/tree.h
> > > > > > > @@ -512,20 +512,21 @@ do {                                                              \
> > > > > > >         local_irq_save(flags);                                  \
> > > > > > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > > > > > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > > > > > >  } while (0)
> > > > > > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > > > > > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > > >  
> > > > > > >  static void rcu_bind_gp_kthread(void);
> > > > > > >  static bool rcu_nohz_full_cpu(void);
> > > > > > > +static bool rcu_on_patience_delay(void);
> > > > > > 
> > > > > > I don't think we need an access function, but will check below.
> > > > > > 
> > > > > > >  /* Forward declarations for tree_stall.h */
> > > > > > >  static void record_gp_stall_check_time(void);
> > > > > > >  static void rcu_iw_handler(struct irq_work *iwp);
> > > > > > >  static void check_cpu_stall(struct rcu_data *rdp);
> > > > > > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > >                                      const unsigned long gpssdelay);
> > > > > > >  
> > > > > > >  /* Forward declarations for tree_exp.h. */
> > > > > > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > > index 340bbefe5f65..639243b0410f 100644
> > > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > > @@ -5,20 +5,21 @@
> > > > > > >   * or preemptible semantics.
> > > > > > >   *
> > > > > > >   * Copyright Red Hat, 2009
> > > > > > >   * Copyright IBM Corporation, 2009
> > > > > > >   *
> > > > > > >   * Author: Ingo Molnar <mingo@elte.hu>
> > > > > > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > > > > > >   */
> > > > > > >  
> > > > > > >  #include "../locking/rtmutex_common.h"
> > > > > > > +#include <linux/jiffies.h>
> > > > > > 
> > > > > > This is already pulled in by the enclosing tree.c file, so it should not
> > > > > > be necessary to include it again. 
> > > > > 
> > > > > Even better :)
> > > > > 
> > > > > > (Or did you get a build failure when
> > > > > > leaving this out?)
> > > > > 
> > > > > I didn't, it's just that my editor complained the symbols were not getting 
> > > > > properly resolved, so I included it and it was fixed. But since clangd is 
> > > > > know to make some mistakes, I should have compile-test'd before adding it.
> > > > 
> > > > Ah, got it!  ;-)
> > > > 
> > > > > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > > > > > >  {
> > > > > > >         /*
> > > > > > >          * In order to read the offloaded state of an rdp in a safe
> > > > > > >          * and stable way and prevent from its value to be changed
> > > > > > >          * under us, we must either hold the barrier mutex, the cpu
> > > > > > >          * hotplug lock (read or write) or the nocb lock. Local
> > > > > > >          * non-preemptible reads are also safe. NOCB kthreads and
> > > > > > >          * timers have their own means of synchronization against the
> > > > > > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > > >         if (rcu_kick_kthreads)
> > > > > > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > > > > > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > > > > > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > > > > > >         if (gp_preinit_delay)
> > > > > > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > > > > > >         if (gp_init_delay)
> > > > > > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > > > > > >         if (gp_cleanup_delay)
> > > > > > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > > > > > +       if (nocb_patience_delay < 0) {
> > > > > > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > > > > > +                       nocb_patience_delay);
> > > > > > > +               nocb_patience_delay = 0;
> > > > > > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > > > > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > > > > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > > > > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > > > > > +       } else if (nocb_patience_delay) {
> > > > > > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > > > > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > > > > > +       }
> > > > > > 
> > > > > > I just did this here at the end:
> > > > > > 
> > > > > > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > > > > > 
> > > > > > Ah, you are wanting to print out the milliseconds after the rounding
> > > > > > to jiffies.
> > > > > 
> > > > > That's right, just to make sure the user gets the effective patience time, 
> > > > > instead of the before-rounding one, which was on input.
> > > > > 
> > > > > > I am going to hold off on that for the moment, but I hear your request
> > > > > > and I have not yet said "no".  ;-)
> > > > > 
> > > > > Sure :)
> > > > > It's just something I think it's nice to have (as a user).
> > > > 
> > > > If you would like to do a separate patch adding this, here are the
> > > > requirements:
> > > > 
> > > > o	If the current code prints nothing, nothing additional should
> > > > 	be printed.
> > > > 
> > > > o	If the rounding ended up with the same value (as it should in
> > > > 	systems with HZ=1000), nothing additional should be printed.
> > > > 
> > > > o	Your choice as to whether or not you want to print out the
> > > > 	jiffies value.
> > > > 
> > > > o	If the additional message is on a new line, it needs to be
> > > > 	indented so that it is clear that it is subordinate to the
> > > > 	previous message.
> > > > 
> > > > 	Otherwise, you can use pr_cont() to continue the previous
> > > > 	line, of course being careful about "\n".
> > > > 
> > > > Probably also something that I am forgetting, but that is most of it.
> > > 
> > > Thanks!
> > > I will work on a patch doing that :)
> > 
> > Very good, looking forward to seeing what you come up with!
> > 
> > My current state is on the "dev" branch of the -rcu tree, so please base
> > on that.
> 
> Thanks! I used it earlier to send the previous diff :)
> 
> > 
> > > > > > >         if (!use_softirq)
> > > > > > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > > > > > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > > > > > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > > > > > >         rcupdate_announce_bootup_oddness();
> > > > > > >  }
> > > > > > >  
> > > > > > >  #ifdef CONFIG_PREEMPT_RCU
> > > > > > >  
> > > > > > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > > > > > >   */
> > > > > > >  static void rcu_bind_gp_kthread(void)
> > > > > > >  {
> > > > > > >         if (!tick_nohz_full_enabled())
> > > > > > >                 return;
> > > > > > >         housekeeping_affine(current, HK_TYPE_RCU);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > > > > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > > > > > + *
> > > > > > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > > > > > + * RCU_NOCB_CPU CPUs.
> > > > > > > + */
> > > > > > > +static bool rcu_on_patience_delay(void)
> > > > > > > +{
> > > > > > > +#ifdef CONFIG_NO_HZ_FULL
> > > > > > 
> > > > > > You lost me on this one.  Why do we need the #ifdef instead of
> > > > > > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > > > > > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> > > > > 
> > > > > You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> > > > > 	if ((...) && rcu_nohz_full_cpu())
> > > > > And since it returns false, this whole statement will be compiled out, and 
> > > > > the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> > > > > need to test it.
> > > > 
> > > > Very good!  You had me going there for a bit.  ;-)
> > > > 
> > > > > > > +       if (!nocb_patience_delay)
> > > > > > > +               return false;
> > > > > > 
> > > > > > We get this automatically with the comparison below, right?
> > > > > 
> > > > > Right
> > > > > 
> > > > > >   If so, we
> > > > > > are not gaining much by creating the helper function.  Or am I missing
> > > > > > some trick here?
> > > > > 
> > > > > Well, it's a fastpath. Up to here, we just need to read 
> > > > > nocb_patience_delay{,_jiffies} from memory.
> > > > 
> > > > Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
> > > > nocb_patience_delay is unused after boot.
> > > 
> > > Right, I used both because I was referring to the older version and the 
> > > current version with _jiffies.
> > 
> > Fair enough!
> > 
> > > > > If we don't include the fastpath we have to read jiffies and 
> > > > > rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> > > > > 
> > > > > I thought it could be relevant, as we reduce the overhead of the new 
> > > > > parameter when it's disabled (patience=0). 
> > > > > 
> > > > > Do you think that could be relevant?
> > > > 
> > > > Well, the hardware's opinion is what matters.  ;-)
> > > > 
> > > > But the caller's code path reads jiffies a few times, so it should
> > > > be hot in the cache, correct?
> > > 
> > > Right, but I wonder how are the chances of it getting updated between  
> > > caller's use and this function's. Same for gp_start.
> > 
> > Well, jiffies is updated at most once per millisecond, and gp_start is
> > updated at most once per few milliseconds.  So the chances of it being
> > updated within that code sequence are quite small.
> 
> Fair enough, and we probably don't need to worry about it getting 
> cached-out in this sequence, as well. 
> 
> Also time_before() is a macro and we don't need to worry on the function 
> call, so we just spend 2 extra L1-cache reads and a couple arithmetic 
> instructions which are not supposed to take long, so it's fair to assume 
> the fast-path would not be that much faster than the slow path, which means 
> we don't need a fast path after all.
> 
> Thanks for helping me notice that :)
> 
> > 
> > > > But that does lead to another topic, namely the possibility of tagging
> > > > nocb_patience_delay_jiffies with __read_mostly. 
> > > 
> > > Oh, right. This was supposed to be in the diff I sent earlier, but I 
> > > completelly forgot to change before sending. So, yeah, I agree on 
> > > nocb_patience_delay being __read_mostly; 
> > > 
> > > > And there might be
> > > > a number of other of RCU's variables that could be similarly tagged
> > > > in order to avoid false sharing.  (But is there any false sharing?
> > > > This might be worth testing.)
> > > 
> > > Maybe there isn't, but I wonder if it would hurt performance if they were 
> > > tagged as __read_only anyway. 
> > 
> > Let's be at least a little careful here.  It is just as easy to hurt
> > performance by marking things __read_mostly or __read_only as it is
> > to help performance.  ;-)
> 
> Fair enough :)
> 
> > 
> > 							Thanx, Paul
> > 
> 

Oh, btw, for what it's worth:
Reviewed-by: Leonardo Bras <leobras@redhat.com>

Thanks!
Leo
Paul E. McKenney May 10, 2024, 9:38 p.m. UTC | #55
On Fri, May 10, 2024 at 06:15:14PM -0300, Leonardo Bras wrote:
> On Fri, May 10, 2024 at 04:50:41PM -0300, Leonardo Bras wrote:
> > On Fri, May 10, 2024 at 10:41:53AM -0700, Paul E. McKenney wrote:
> > > On Fri, May 10, 2024 at 02:12:32PM -0300, Leonardo Bras wrote:
> > > > On Fri, May 10, 2024 at 09:21:59AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, May 10, 2024 at 01:06:40PM -0300, Leonardo Bras wrote:
> > > > > > On Thu, May 09, 2024 at 04:45:53PM -0700, Paul E. McKenney wrote:
> > > > > > > On Thu, May 09, 2024 at 07:14:18AM -0300, Leonardo Bras wrote:
> > > > > > > > On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > > 
> > > > > > > > > Here I suppose something like this can take care of not needing to convert 
> > > > > > > > > ms -> jiffies every rcu_pending():
> > > > > > > > > 
> > > > > > > > > +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Uh, there is more to it, actually. We need to make sure the user 
> > > > > > > > understands that we are rounding-down the value to multiple of a jiffy 
> > > > > > > > period, so it's not a surprise if the delay value is not exactly the same 
> > > > > > > > as the passed on kernel cmdline.
> > > > > > > > 
> > > > > > > > So something like bellow diff should be ok, as this behavior is explained 
> > > > > > > > in the docs, and pr_info() will print the effective value.
> > > > > > > > 
> > > > > > > > What do you think?
> > > > > > > 
> > > > > > > Good point, and I have taken your advice on making the documentation
> > > > > > > say what it does.
> > > > > > 
> > > > > > Thanks :)
> > > > > > 
> > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > Leo
> > > > > > > > 
> > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > > index 0a3b0fd1910e..9a50be9fd9eb 100644
> > > > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > > > > @@ -4974,20 +4974,28 @@
> > > > > > > >                         otherwise be caused by callback floods through
> > > > > > > >                         use of the ->nocb_bypass list.  However, in the
> > > > > > > >                         common non-flooded case, RCU queues directly to
> > > > > > > >                         the main ->cblist in order to avoid the extra
> > > > > > > >                         overhead of the ->nocb_bypass list and its lock.
> > > > > > > >                         But if there are too many callbacks queued during
> > > > > > > >                         a single jiffy, RCU pre-queues the callbacks into
> > > > > > > >                         the ->nocb_bypass queue.  The definition of "too
> > > > > > > >                         many" is supplied by this kernel boot parameter.
> > > > > > > >  
> > > > > > > > +       rcutree.nocb_patience_delay= [KNL]
> > > > > > > > +                       On callback-offloaded (rcu_nocbs) CPUs, avoid
> > > > > > > > +                       disturbing RCU unless the grace period has
> > > > > > > > +                       reached the specified age in milliseconds.
> > > > > > > > +                       Defaults to zero.  Large values will be capped
> > > > > > > > +                       at five seconds. Values rounded-down to a multiple
> > > > > > > > +                       of a jiffy period.
> > > > > > > > +
> > > > > > > >         rcutree.qhimark= [KNL]
> > > > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > > >                         batch limiting is disabled.
> > > > > > > >  
> > > > > > > >         rcutree.qlowmark= [KNL]
> > > > > > > >                         Set threshold of queued RCU callbacks below which
> > > > > > > >                         batch limiting is re-enabled.
> > > > > > > >  
> > > > > > > >         rcutree.qovld= [KNL]
> > > > > > > >                         Set threshold of queued RCU callbacks beyond which
> > > > > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > > > > > > index fcf2b4aa3441..62ede401420f 100644
> > > > > > > > --- a/kernel/rcu/tree.h
> > > > > > > > +++ b/kernel/rcu/tree.h
> > > > > > > > @@ -512,20 +512,21 @@ do {                                                              \
> > > > > > > >         local_irq_save(flags);                                  \
> > > > > > > >         if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
> > > > > > > >                 raw_spin_lock(&(rdp)->nocb_lock);               \
> > > > > > > >  } while (0)
> > > > > > > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > > > >  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> > > > > > > >  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> > > > > > > >  
> > > > > > > >  static void rcu_bind_gp_kthread(void);
> > > > > > > >  static bool rcu_nohz_full_cpu(void);
> > > > > > > > +static bool rcu_on_patience_delay(void);
> > > > > > > 
> > > > > > > I don't think we need an access function, but will check below.
> > > > > > > 
> > > > > > > >  /* Forward declarations for tree_stall.h */
> > > > > > > >  static void record_gp_stall_check_time(void);
> > > > > > > >  static void rcu_iw_handler(struct irq_work *iwp);
> > > > > > > >  static void check_cpu_stall(struct rcu_data *rdp);
> > > > > > > >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > > >                                      const unsigned long gpssdelay);
> > > > > > > >  
> > > > > > > >  /* Forward declarations for tree_exp.h. */
> > > > > > > >  static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > > > index 340bbefe5f65..639243b0410f 100644
> > > > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > > > @@ -5,20 +5,21 @@
> > > > > > > >   * or preemptible semantics.
> > > > > > > >   *
> > > > > > > >   * Copyright Red Hat, 2009
> > > > > > > >   * Copyright IBM Corporation, 2009
> > > > > > > >   *
> > > > > > > >   * Author: Ingo Molnar <mingo@elte.hu>
> > > > > > > >   *        Paul E. McKenney <paulmck@linux.ibm.com>
> > > > > > > >   */
> > > > > > > >  
> > > > > > > >  #include "../locking/rtmutex_common.h"
> > > > > > > > +#include <linux/jiffies.h>
> > > > > > > 
> > > > > > > This is already pulled in by the enclosing tree.c file, so it should not
> > > > > > > be necessary to include it again. 
> > > > > > 
> > > > > > Even better :)
> > > > > > 
> > > > > > > (Or did you get a build failure when
> > > > > > > leaving this out?)
> > > > > > 
> > > > > > I didn't, it's just that my editor complained the symbols were not getting 
> > > > > > properly resolved, so I included it and it was fixed. But since clangd is 
> > > > > > know to make some mistakes, I should have compile-test'd before adding it.
> > > > > 
> > > > > Ah, got it!  ;-)
> > > > > 
> > > > > > > >  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
> > > > > > > >  {
> > > > > > > >         /*
> > > > > > > >          * In order to read the offloaded state of an rdp in a safe
> > > > > > > >          * and stable way and prevent from its value to be changed
> > > > > > > >          * under us, we must either hold the barrier mutex, the cpu
> > > > > > > >          * hotplug lock (read or write) or the nocb lock. Local
> > > > > > > >          * non-preemptible reads are also safe. NOCB kthreads and
> > > > > > > >          * timers have their own means of synchronization against the
> > > > > > > > @@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > > > >         if (rcu_kick_kthreads)
> > > > > > > >                 pr_info("\tKick kthreads if too-long grace period.\n");
> > > > > > > >         if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
> > > > > > > >                 pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
> > > > > > > >         if (gp_preinit_delay)
> > > > > > > >                 pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
> > > > > > > >         if (gp_init_delay)
> > > > > > > >                 pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> > > > > > > >         if (gp_cleanup_delay)
> > > > > > > >                 pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > > > > > > > +       if (nocb_patience_delay < 0) {
> > > > > > > > +               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
> > > > > > > > +                       nocb_patience_delay);
> > > > > > > > +               nocb_patience_delay = 0;
> > > > > > > > +       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > > > > > > > +               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
> > > > > > > > +                       nocb_patience_delay, 5 * MSEC_PER_SEC);
> > > > > > > > +               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
> > > > > > > > +       } else if (nocb_patience_delay) {
> > > > > > > > +               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > > +               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
> > > > > > > > +                       jiffies_to_msecs(nocb_patience_delay);
> > > > > > > > +       }
> > > > > > > 
> > > > > > > I just did this here at the end:
> > > > > > > 
> > > > > > > 	nocb_patience_delay_jiffies = msecs_to_jiffies(nocb_patience_delay);
> > > > > > > 
> > > > > > > Ah, you are wanting to print out the milliseconds after the rounding
> > > > > > > to jiffies.
> > > > > > 
> > > > > > That's right, just to make sure the user gets the effective patience time, 
> > > > > > instead of the before-rounding one, which was on input.
> > > > > > 
> > > > > > > I am going to hold off on that for the moment, but I hear your request
> > > > > > > and I have not yet said "no".  ;-)
> > > > > > 
> > > > > > Sure :)
> > > > > > It's just something I think it's nice to have (as a user).
> > > > > 
> > > > > If you would like to do a separate patch adding this, here are the
> > > > > requirements:
> > > > > 
> > > > > o	If the current code prints nothing, nothing additional should
> > > > > 	be printed.
> > > > > 
> > > > > o	If the rounding ended up with the same value (as it should in
> > > > > 	systems with HZ=1000), nothing additional should be printed.
> > > > > 
> > > > > o	Your choice as to whether or not you want to print out the
> > > > > 	jiffies value.
> > > > > 
> > > > > o	If the additional message is on a new line, it needs to be
> > > > > 	indented so that it is clear that it is subordinate to the
> > > > > 	previous message.
> > > > > 
> > > > > 	Otherwise, you can use pr_cont() to continue the previous
> > > > > 	line, of course being careful about "\n".
> > > > > 
> > > > > Probably also something that I am forgetting, but that is most of it.
> > > > 
> > > > Thanks!
> > > > I will work on a patch doing that :)
> > > 
> > > Very good, looking forward to seeing what you come up with!
> > > 
> > > My current state is on the "dev" branch of the -rcu tree, so please base
> > > on that.
> > 
> > Thanks! I used it earlier to send the previous diff :)
> > 
> > > 
> > > > > > > >         if (!use_softirq)
> > > > > > > >                 pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
> > > > > > > >         if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
> > > > > > > >                 pr_info("\tRCU debug extended QS entry/exit.\n");
> > > > > > > >         rcupdate_announce_bootup_oddness();
> > > > > > > >  }
> > > > > > > >  
> > > > > > > >  #ifdef CONFIG_PREEMPT_RCU
> > > > > > > >  
> > > > > > > >  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > > > > @@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
> > > > > > > >  
> > > > > > > >  /*
> > > > > > > >   * Bind the RCU grace-period kthreads to the housekeeping CPU.
> > > > > > > >   */
> > > > > > > >  static void rcu_bind_gp_kthread(void)
> > > > > > > >  {
> > > > > > > >         if (!tick_nohz_full_enabled())
> > > > > > > >                 return;
> > > > > > > >         housekeeping_affine(current, HK_TYPE_RCU);
> > > > > > > >  }
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
> > > > > > > > + * start of current grace period is smaller than nocb_patience_delay ?
> > > > > > > > + *
> > > > > > > > + * This code relies on the fact that all NO_HZ_FULL CPUs are also
> > > > > > > > + * RCU_NOCB_CPU CPUs.
> > > > > > > > + */
> > > > > > > > +static bool rcu_on_patience_delay(void)
> > > > > > > > +{
> > > > > > > > +#ifdef CONFIG_NO_HZ_FULL
> > > > > > > 
> > > > > > > You lost me on this one.  Why do we need the #ifdef instead of
> > > > > > > IS_ENABLED()?  Also, please note that rcu_nohz_full_cpu() is already a
> > > > > > > compile-time @false in CONFIG_NO_HZ_FULL=n kernels.
> > > > > > 
> > > > > > You are right. rcu_nohz_full_cpu() has a high chance of being inlined on
> > > > > > 	if ((...) && rcu_nohz_full_cpu())
> > > > > > And since it returns false, this whole statement will be compiled out, and 
> > > > > > the new function will not exist in CONFIG_NO_HZ_FULL=n, so there  is no 
> > > > > > need to test it.
> > > > > 
> > > > > Very good!  You had me going there for a bit.  ;-)
> > > > > 
> > > > > > > > +       if (!nocb_patience_delay)
> > > > > > > > +               return false;
> > > > > > > 
> > > > > > > We get this automatically with the comparison below, right?
> > > > > > 
> > > > > > Right
> > > > > > 
> > > > > > >   If so, we
> > > > > > > are not gaining much by creating the helper function.  Or am I missing
> > > > > > > some trick here?
> > > > > > 
> > > > > > Well, it's a fastpath. Up to here, we just need to read 
> > > > > > nocb_patience_delay{,_jiffies} from memory.
> > > > > 
> > > > > Just nocb_patience_delay_jiffies, correct?  Unless I am missing something,
> > > > > nocb_patience_delay is unused after boot.
> > > > 
> > > > Right, I used both because I was referring to the older version and the 
> > > > current version with _jiffies.
> > > 
> > > Fair enough!
> > > 
> > > > > > If we don't include the fastpath we have to read jiffies and 
> > > > > > rcu_state.gp_start, which can take extra time: up to 2 cache misses.
> > > > > > 
> > > > > > I thought it could be relevant, as we reduce the overhead of the new 
> > > > > > parameter when it's disabled (patience=0). 
> > > > > > 
> > > > > > Do you think that could be relevant?
> > > > > 
> > > > > Well, the hardware's opinion is what matters.  ;-)
> > > > > 
> > > > > But the caller's code path reads jiffies a few times, so it should
> > > > > be hot in the cache, correct?
> > > > 
> > > > Right, but I wonder how are the chances of it getting updated between  
> > > > caller's use and this function's. Same for gp_start.
> > > 
> > > Well, jiffies is updated at most once per millisecond, and gp_start is
> > > updated at most once per few milliseconds.  So the chances of it being
> > > updated within that code sequence are quite small.
> > 
> > Fair enough, and we probably don't need to worry about it getting 
> > cached-out in this sequence, as well. 
> > 
> > Also time_before() is a macro and we don't need to worry on the function 
> > call, so we just spend 2 extra L1-cache reads and a couple arithmetic 
> > instructions which are not supposed to take long, so it's fair to assume 
> > the fast-path would not be that much faster than the slow path, which means 
> > we don't need a fast path after all.
> > 
> > Thanks for helping me notice that :)
> > 
> > > 
> > > > > But that does lead to another topic, namely the possibility of tagging
> > > > > nocb_patience_delay_jiffies with __read_mostly. 
> > > > 
> > > > Oh, right. This was supposed to be in the diff I sent earlier, but I 
> > > > completelly forgot to change before sending. So, yeah, I agree on 
> > > > nocb_patience_delay being __read_mostly; 
> > > > 
> > > > > And there might be
> > > > > a number of other of RCU's variables that could be similarly tagged
> > > > > in order to avoid false sharing.  (But is there any false sharing?
> > > > > This might be worth testing.)
> > > > 
> > > > Maybe there isn't, but I wonder if it would hurt performance if they were 
> > > > tagged as __read_only anyway. 
> > > 
> > > Let's be at least a little careful here.  It is just as easy to hurt
> > > performance by marking things __read_mostly or __read_only as it is
> > > to help performance.  ;-)
> > 
> > Fair enough :)
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > 
> 
> Oh, btw, for what it's worth:
> Reviewed-by: Leonardo Bras <leobras@redhat.com>

Applied, thank you!

							Thanx, Paul
Leonardo Bras May 11, 2024, 2:08 a.m. UTC | #56
On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> On Wed, May 08, 2024, Leonardo Bras wrote:
> > Something just hit me, and maybe I need to propose something more generic.
> 
> Yes.  This is what I was trying to get across with my complaints about keying off
> of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> be quiescent soon" and so the core concept/functionality belongs in common code,
> not KVM.
> 

Hello Sean,

Paul implemented the RCU patience cmdline option, that will help avoiding 
rcuc waking up if the grace period is younger than X miliseconds, which 
means the last quiescent state needs to be at least X miliseconds old.

With that I just have to add a quiescent state in guest_exit(), and we will 
be able to get the same effect of last_guest_exit patch. 

I sent this RFC patch doing that:
https://lore.kernel.org/all/20240511020557.1198200-1-leobras@redhat.com/

Please take a look.

Thanks!
Leo