[RFC,1/1] kvm: Note an RCU quiescent state on guest exit

Message ID	20240511020557.1198200-1-leobras@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE18D10958 for <kvm@vger.kernel.org>; Sat, 11 May 2024 02:06:31 +0000 (UTC) From: Leonardo Bras <leobras@redhat.com> To: Frederic Weisbecker <frederic@kernel.org>, "Paul E. McKenney" <paulmck@kernel.org>, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com>, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit Date: Fri, 10 May 2024 23:05:56 -0300 Message-ID: <20240511020557.1198200-1-leobras@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC,1/1] kvm: Note an RCU quiescent state on guest exit \| expand [RFC,1/1] kvm: Note an RCU quiescent state on guest exit

Leonardo Bras May 11, 2024, 2:05 a.m. UTC

As of today, KVM notes a quiescent state only in guest entry, which is good
as it avoids the guest being interrupted for current RCU operations.

While the guest vcpu runs, it can be interrupted by a timer IRQ that will
check for any RCU operations waiting for this CPU. In case there are any of
such, it invokes rcu_core() in order to sched-out the current thread and
note a quiescent state.

This occasional schedule work will introduce tens of microsseconds of
latency, which is really bad for vcpus running latency-sensitive
applications, such as real-time workloads.

So, note a quiescent state in guest exit, so the interrupted guests is able
to deal with any pending RCU operations before being required to invoke
rcu_core(), and thus avoid the overhead of related scheduler work.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---

ps: A patch fixing this same issue was discussed in this thread:
https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/

Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
to avoid having invoke_rcu() being called on grace-periods starting between
guest exit and the timer IRQ. This RCU option is being discussed in a
sub-thread of this message:
https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/


 include/linux/context_tracking.h |  6 ++++--
 include/linux/kvm_host.h         | 10 +++++++++-
 2 files changed, 13 insertions(+), 3 deletions(-)

Leonardo Bras May 11, 2024, 2:11 a.m. UTC | #1

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able

s/guests/guest

> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
> 
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> 
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> 
>  include/linux/context_tracking.h |  6 ++++--
>  include/linux/kvm_host.h         | 10 +++++++++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  }
>  
>  static __always_inline bool context_tracking_guest_enter(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_enter(CONTEXT_GUEST);
>  
>  	return context_tracking_enabled_this_cpu();
>  }
>  
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_exit(CONTEXT_GUEST);
> +
> +	return context_tracking_enabled_this_cpu();
>  }
>  
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline void user_enter_irqoff(void) { }
>  static inline void user_exit_irqoff(void) { }
>  static inline int exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
>  static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
>  extern void context_tracking_init(void);
>  #else
>  static inline void context_tracking_init(void) { }
>  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
>   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
>   * unsafe to use any code which may directly or indirectly use RCU, tracing
>   * (including IRQ flag tracing), or lockdep. All code in this period must be
>   * non-instrumentable.
>   */
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
> -	context_tracking_guest_exit();
> +	/*
> +	 * Guest mode is treated as a quiescent state, see
> +	 * guest_context_enter_irqoff() for more details.
> +	 */
> +	if (!context_tracking_guest_exit()) {
> +		instrumentation_begin();
> +		rcu_virt_note_context_switch();
> +		instrumentation_end();
> +	}
>  }
>  
>  /*
>   * Stop accounting time towards a guest.
>   * Must be called after exiting guest context.
>   */
>  static __always_inline void guest_timing_exit_irqoff(void)
>  {
>  	instrumentation_begin();
>  	/* Flush the guest cputime we spent on the guest */
> -- 
> 2.45.0
>

Paul E. McKenney May 11, 2024, 2:55 p.m. UTC | #2

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
> 
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> 
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> 
>  include/linux/context_tracking.h |  6 ++++--
>  include/linux/kvm_host.h         | 10 +++++++++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  }
>  
>  static __always_inline bool context_tracking_guest_enter(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_enter(CONTEXT_GUEST);
>  
>  	return context_tracking_enabled_this_cpu();
>  }
>  
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_exit(CONTEXT_GUEST);
> +
> +	return context_tracking_enabled_this_cpu();
>  }
>  
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline void user_enter_irqoff(void) { }
>  static inline void user_exit_irqoff(void) { }
>  static inline int exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
>  static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
>  extern void context_tracking_init(void);
>  #else
>  static inline void context_tracking_init(void) { }
>  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
>   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
>   * unsafe to use any code which may directly or indirectly use RCU, tracing
>   * (including IRQ flag tracing), or lockdep. All code in this period must be
>   * non-instrumentable.
>   */
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
> -	context_tracking_guest_exit();
> +	/*
> +	 * Guest mode is treated as a quiescent state, see
> +	 * guest_context_enter_irqoff() for more details.
> +	 */
> +	if (!context_tracking_guest_exit()) {
> +		instrumentation_begin();
> +		rcu_virt_note_context_switch();
> +		instrumentation_end();
> +	}
>  }
>  
>  /*
>   * Stop accounting time towards a guest.
>   * Must be called after exiting guest context.
>   */
>  static __always_inline void guest_timing_exit_irqoff(void)
>  {
>  	instrumentation_begin();
>  	/* Flush the guest cputime we spent on the guest */
> -- 
> 2.45.0
>

Leonardo Bras May 11, 2024, 8:31 p.m. UTC | #3

On Sat, May 11, 2024 at 07:55:55AM -0700, Paul E. McKenney wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> > 
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> > 
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> > 
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 
> Acked-by: Paul E. McKenney <paulmck@kernel.org>

Thanks!
Leo

> 
> > ---
> > 
> > ps: A patch fixing this same issue was discussed in this thread:
> > https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > 
> > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > to avoid having invoke_rcu() being called on grace-periods starting between
> > guest exit and the timer IRQ. This RCU option is being discussed in a
> > sub-thread of this message:
> > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> > 
> > 
> >  include/linux/context_tracking.h |  6 ++++--
> >  include/linux/kvm_host.h         | 10 +++++++++-
> >  2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 6e76b9dba00e..8a78fabeafc3 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> >  }
> >  
> >  static __always_inline bool context_tracking_guest_enter(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_enter(CONTEXT_GUEST);
> >  
> >  	return context_tracking_enabled_this_cpu();
> >  }
> >  
> > -static __always_inline void context_tracking_guest_exit(void)
> > +static __always_inline bool context_tracking_guest_exit(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_exit(CONTEXT_GUEST);
> > +
> > +	return context_tracking_enabled_this_cpu();
> >  }
> >  
> >  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
> >  
> >  #else
> >  static inline void user_enter(void) { }
> >  static inline void user_exit(void) { }
> >  static inline void user_enter_irqoff(void) { }
> >  static inline void user_exit_irqoff(void) { }
> >  static inline int exception_enter(void) { return 0; }
> >  static inline void exception_exit(enum ctx_state prev_ctx) { }
> >  static inline int ct_state(void) { return -1; }
> >  static inline int __ct_state(void) { return -1; }
> >  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> > -static __always_inline void context_tracking_guest_exit(void) { }
> > +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> >  #define CT_WARN_ON(cond) do { } while (0)
> >  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> >  extern void context_tracking_init(void);
> >  #else
> >  static inline void context_tracking_init(void) { }
> >  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..e37724c44843 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> >  /*
> >   * Exit guest context and exit an RCU extended quiescent state.
> >   *
> >   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> >   * unsafe to use any code which may directly or indirectly use RCU, tracing
> >   * (including IRQ flag tracing), or lockdep. All code in this period must be
> >   * non-instrumentable.
> >   */
> >  static __always_inline void guest_context_exit_irqoff(void)
> >  {
> > -	context_tracking_guest_exit();
> > +	/*
> > +	 * Guest mode is treated as a quiescent state, see
> > +	 * guest_context_enter_irqoff() for more details.
> > +	 */
> > +	if (!context_tracking_guest_exit()) {
> > +		instrumentation_begin();
> > +		rcu_virt_note_context_switch();
> > +		instrumentation_end();
> > +	}
> >  }
> >  
> >  /*
> >   * Stop accounting time towards a guest.
> >   * Must be called after exiting guest context.
> >   */
> >  static __always_inline void guest_timing_exit_irqoff(void)
> >  {
> >  	instrumentation_begin();
> >  	/* Flush the guest cputime we spent on the guest */
> > -- 
> > 2.45.0
> > 
>

Marcelo Tosatti May 12, 2024, 9:44 p.m. UTC | #4

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.

This does not properly fix the current problem, as RCU work might be
scheduled after the VM exit, followed by a timer interrupt.

Correct?

> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
> 
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> 
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> 
>  include/linux/context_tracking.h |  6 ++++--
>  include/linux/kvm_host.h         | 10 +++++++++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
>  }
>  
>  static __always_inline bool context_tracking_guest_enter(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_enter(CONTEXT_GUEST);
>  
>  	return context_tracking_enabled_this_cpu();
>  }
>  
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
>  {
>  	if (context_tracking_enabled())
>  		__ct_user_exit(CONTEXT_GUEST);
> +
> +	return context_tracking_enabled_this_cpu();
>  }
>  
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
>  static inline void user_enter(void) { }
>  static inline void user_exit(void) { }
>  static inline void user_enter_irqoff(void) { }
>  static inline void user_exit_irqoff(void) { }
>  static inline int exception_enter(void) { return 0; }
>  static inline void exception_exit(enum ctx_state prev_ctx) { }
>  static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
>  extern void context_tracking_init(void);
>  #else
>  static inline void context_tracking_init(void) { }
>  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>  
>  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
>  /*
>   * Exit guest context and exit an RCU extended quiescent state.
>   *
>   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
>   * unsafe to use any code which may directly or indirectly use RCU, tracing
>   * (including IRQ flag tracing), or lockdep. All code in this period must be
>   * non-instrumentable.
>   */
>  static __always_inline void guest_context_exit_irqoff(void)
>  {
> -	context_tracking_guest_exit();
> +	/*
> +	 * Guest mode is treated as a quiescent state, see
> +	 * guest_context_enter_irqoff() for more details.
> +	 */
> +	if (!context_tracking_guest_exit()) {
> +		instrumentation_begin();
> +		rcu_virt_note_context_switch();
> +		instrumentation_end();
> +	}
>  }
>  
>  /*
>   * Stop accounting time towards a guest.
>   * Must be called after exiting guest context.
>   */
>  static __always_inline void guest_timing_exit_irqoff(void)
>  {
>  	instrumentation_begin();
>  	/* Flush the guest cputime we spent on the guest */
> -- 
> 2.45.0
> 
> 
>

Marcelo Tosatti May 13, 2024, 1:06 a.m. UTC | #5

On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> > 
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> > 
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> > 
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> This does not properly fix the current problem, as RCU work might be
> scheduled after the VM exit, followed by a timer interrupt.
> 
> Correct?

Not that i am against the patch... 

But, regarding the problem at hand, it does not fix it reliably.

Leonardo Bras May 13, 2024, 3:14 a.m. UTC | #6

On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> > 
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> > 
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> > 
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> 
> This does not properly fix the current problem, as RCU work might be
> scheduled after the VM exit, followed by a timer interrupt.
> 
> Correct?

Correct, for this case, check the note below:

> 
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> > 
> > ps: A patch fixing this same issue was discussed in this thread:
> > https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > 
> > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > to avoid having invoke_rcu() being called on grace-periods starting between
> > guest exit and the timer IRQ. This RCU option is being discussed in a
> > sub-thread of this message:
> > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/

^ This one above.
The idea is to use this rcutree.nocb_patience_delay=N :
a new option we added on RCU that allow us to avoid invoking rcu_core() if 
the grace_period < N miliseconds. This only works on nohz_full cpus.

So with both the current patch and the one in above link, we have the same 
effect as we previously had with last_guest_exit, with a cherry on top: we 
can avoid rcu_core() getting called in situations where a grace period just 
started after going into kernel code, and a timer interrupt happened before 
it can report quiescent state again. 

For our nohz_full vcpu thread scenario, we have:

- guest_exit note a quiescent state
- let's say we start a grace period in the next cycle
- If timer interrupts, it requires the grace period to be older than N 
  miliseconds
  - If we configure a proper value for patience, it will never reach the 
    end of patience before going guest_entry, and thus noting a quiescent 
    state

What do you think?

Thanks!
Leo

> > 
> > 
> >  include/linux/context_tracking.h |  6 ++++--
> >  include/linux/kvm_host.h         | 10 +++++++++-
> >  2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 6e76b9dba00e..8a78fabeafc3 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> >  }
> >  
> >  static __always_inline bool context_tracking_guest_enter(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_enter(CONTEXT_GUEST);
> >  
> >  	return context_tracking_enabled_this_cpu();
> >  }
> >  
> > -static __always_inline void context_tracking_guest_exit(void)
> > +static __always_inline bool context_tracking_guest_exit(void)
> >  {
> >  	if (context_tracking_enabled())
> >  		__ct_user_exit(CONTEXT_GUEST);
> > +
> > +	return context_tracking_enabled_this_cpu();
> >  }
> >  
> >  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
> >  
> >  #else
> >  static inline void user_enter(void) { }
> >  static inline void user_exit(void) { }
> >  static inline void user_enter_irqoff(void) { }
> >  static inline void user_exit_irqoff(void) { }
> >  static inline int exception_enter(void) { return 0; }
> >  static inline void exception_exit(enum ctx_state prev_ctx) { }
> >  static inline int ct_state(void) { return -1; }
> >  static inline int __ct_state(void) { return -1; }
> >  static __always_inline bool context_tracking_guest_enter(void) { return false; }
> > -static __always_inline void context_tracking_guest_exit(void) { }
> > +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> >  #define CT_WARN_ON(cond) do { } while (0)
> >  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> >  extern void context_tracking_init(void);
> >  #else
> >  static inline void context_tracking_init(void) { }
> >  #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
> >  
> >  #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..e37724c44843 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> >  /*
> >   * Exit guest context and exit an RCU extended quiescent state.
> >   *
> >   * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> >   * unsafe to use any code which may directly or indirectly use RCU, tracing
> >   * (including IRQ flag tracing), or lockdep. All code in this period must be
> >   * non-instrumentable.
> >   */
> >  static __always_inline void guest_context_exit_irqoff(void)
> >  {
> > -	context_tracking_guest_exit();
> > +	/*
> > +	 * Guest mode is treated as a quiescent state, see
> > +	 * guest_context_enter_irqoff() for more details.
> > +	 */
> > +	if (!context_tracking_guest_exit()) {
> > +		instrumentation_begin();
> > +		rcu_virt_note_context_switch();
> > +		instrumentation_end();
> > +	}
> >  }
> >  
> >  /*
> >   * Stop accounting time towards a guest.
> >   * Must be called after exiting guest context.
> >   */
> >  static __always_inline void guest_timing_exit_irqoff(void)
> >  {
> >  	instrumentation_begin();
> >  	/* Flush the guest cputime we spent on the guest */
> > -- 
> > 2.45.0
> > 
> > 
> > 
>

Marcelo Tosatti May 13, 2024, 7:14 p.m. UTC | #7

On Mon, May 13, 2024 at 12:14:32AM -0300, Leonardo Bras wrote:
> On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> > On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > as it avoids the guest being interrupted for current RCU operations.
> > > 
> > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > check for any RCU operations waiting for this CPU. In case there are any of
> > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > note a quiescent state.
> > > 
> > > This occasional schedule work will introduce tens of microsseconds of
> > > latency, which is really bad for vcpus running latency-sensitive
> > > applications, such as real-time workloads.
> > > 
> > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > to deal with any pending RCU operations before being required to invoke
> > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > 
> > This does not properly fix the current problem, as RCU work might be
> > scheduled after the VM exit, followed by a timer interrupt.
> > 
> > Correct?
> 
> Correct, for this case, check the note below:
> 
> > 
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > > 
> > > ps: A patch fixing this same issue was discussed in this thread:
> > > https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
> > > 
> > > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > > to avoid having invoke_rcu() being called on grace-periods starting between
> > > guest exit and the timer IRQ. This RCU option is being discussed in a
> > > sub-thread of this message:
> > > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> 
> ^ This one above.
> The idea is to use this rcutree.nocb_patience_delay=N :
> a new option we added on RCU that allow us to avoid invoking rcu_core() if 
> the grace_period < N miliseconds. This only works on nohz_full cpus.
> 
> So with both the current patch and the one in above link, we have the same 
> effect as we previously had with last_guest_exit, with a cherry on top: we 
> can avoid rcu_core() getting called in situations where a grace period just 
> started after going into kernel code, and a timer interrupt happened before 
> it can report quiescent state again. 
> 
> For our nohz_full vcpu thread scenario, we have:
> 
> - guest_exit note a quiescent state
> - let's say we start a grace period in the next cycle
> - If timer interrupts, it requires the grace period to be older than N 
>   miliseconds
>   - If we configure a proper value for patience, it will never reach the 
>     end of patience before going guest_entry, and thus noting a quiescent 
>     state
> 
> What do you think?

I don't fully understand all of the RCU details, but since RCU quiescent
state marking happens in IRQ disabled section, there is no chance for a
timer interrupt to conflict with the marking of quiescent state.

So seem to make sense to me.

Sean Christopherson May 13, 2024, 7:40 p.m. UTC | #8

On Fri, May 10, 2024, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
> 
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
> 
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
> 
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.

Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
a context switch on the next VM-Enter, so even if there is extra latency or
something, KVM will eventually take the hit in the common case no matter what.
But I know some setups are sensitive to handling select VM-Exits as soon as possible.

I ask mainly because it seems like a no brainer to me to have both VM-Entry and
VM-Exit note the context switch, which begs the question of why KVM isn't already
doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
use RCU extended quiescent state when running KVM guest") handled the VM-Entry
case?

Leonardo Bras May 13, 2024, 9:47 p.m. UTC | #9

On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, May 10, 2024, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> >
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> >
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> >
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
>
> Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> a context switch on the next VM-Enter, so even if there is extra latency or
> something, KVM will eventually take the hit in the common case no matter what.
> But I know some setups are sensitive to handling select VM-Exits as soon as possible.
>
> I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> VM-Exit note the context switch, which begs the question of why KVM isn't already
> doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> case?

I don't know, by the lore I see it happening in guest entry since the
first time it was introduced at
https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/

Noting a quiescent state is cheap, but it may cost a few accesses to
possibly non-local cachelines. (Not an expert in this, Paul please let
me know if I got it wrong).

I don't have a historic context on why it was just implemented on
guest_entry, but it would make sense when we don't worry about latency
to take the entry-only approach:
- It saves the overhead of calling rcu_virt_note_context_switch()
twice per guest entry in the loop
- KVM will probably run guest entry soon after guest exit (in loop),
so there is no need to run it twice
- Eventually running rcu_core() may be cheaper than noting quiescent
state every guest entry/exit cycle

Upsides of the new strategy:
- Noting a quiescent state in guest exit avoids calling rcu_core() if
there was a grace period request while guest was running, and timer
interrupt hits the cpu.
- If the loop re-enter quickly there is a high chance that guest
entry's rcu_virt_note_context_switch() will be fast (local cacheline)
as there is low probability of a grace period request happening
between exit & re-entry.
- It allows us to use the rcu patience strategy to avoid rcu_core()
running if any grace period request happens between guest exit and
guest re-entry, which is very important for low latency workloads
running on guests as it reduces maximum latency in long runs.

What do you think?

Thanks!
Leo

Paul E. McKenney May 14, 2024, 10:54 p.m. UTC | #10

On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > as it avoids the guest being interrupted for current RCU operations.
> > >
> > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > check for any RCU operations waiting for this CPU. In case there are any of
> > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > note a quiescent state.
> > >
> > > This occasional schedule work will introduce tens of microsseconds of
> > > latency, which is really bad for vcpus running latency-sensitive
> > > applications, such as real-time workloads.
> > >
> > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > to deal with any pending RCU operations before being required to invoke
> > > rcu_core(), and thus avoid the overhead of related scheduler work.
> >
> > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > a context switch on the next VM-Enter, so even if there is extra latency or
> > something, KVM will eventually take the hit in the common case no matter what.
> > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> >
> > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > case?
> 
> I don't know, by the lore I see it happening in guest entry since the
> first time it was introduced at
> https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> 
> Noting a quiescent state is cheap, but it may cost a few accesses to
> possibly non-local cachelines. (Not an expert in this, Paul please let
> me know if I got it wrong).

Yes, it is cheap, especially if interrupts are already disabled.
(As in the scheduler asks RCU to do the same amount of work on its
context-switch fastpath.)

> I don't have a historic context on why it was just implemented on
> guest_entry, but it would make sense when we don't worry about latency
> to take the entry-only approach:
> - It saves the overhead of calling rcu_virt_note_context_switch()
> twice per guest entry in the loop
> - KVM will probably run guest entry soon after guest exit (in loop),
> so there is no need to run it twice
> - Eventually running rcu_core() may be cheaper than noting quiescent
> state every guest entry/exit cycle
> 
> Upsides of the new strategy:
> - Noting a quiescent state in guest exit avoids calling rcu_core() if
> there was a grace period request while guest was running, and timer
> interrupt hits the cpu.
> - If the loop re-enter quickly there is a high chance that guest
> entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> as there is low probability of a grace period request happening
> between exit & re-entry.
> - It allows us to use the rcu patience strategy to avoid rcu_core()
> running if any grace period request happens between guest exit and
> guest re-entry, which is very important for low latency workloads
> running on guests as it reduces maximum latency in long runs.
> 
> What do you think?

Try both on the workload of interest with appropriate tracing and
see what happens?  The hardware's opinion overrides mine.  ;-)

							Thanx, Paul

Leonardo Bras May 15, 2024, 4:45 a.m. UTC | #11

On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > as it avoids the guest being interrupted for current RCU operations.
> > > >
> > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > note a quiescent state.
> > > >
> > > > This occasional schedule work will introduce tens of microsseconds of
> > > > latency, which is really bad for vcpus running latency-sensitive
> > > > applications, such as real-time workloads.
> > > >
> > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > to deal with any pending RCU operations before being required to invoke
> > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > >
> > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > something, KVM will eventually take the hit in the common case no matter what.
> > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > >
> > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > case?
> > 
> > I don't know, by the lore I see it happening in guest entry since the
> > first time it was introduced at
> > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > 
> > Noting a quiescent state is cheap, but it may cost a few accesses to
> > possibly non-local cachelines. (Not an expert in this, Paul please let
> > me know if I got it wrong).
> 
> Yes, it is cheap, especially if interrupts are already disabled.
> (As in the scheduler asks RCU to do the same amount of work on its
> context-switch fastpath.)

Thanks!

> 
> > I don't have a historic context on why it was just implemented on
> > guest_entry, but it would make sense when we don't worry about latency
> > to take the entry-only approach:
> > - It saves the overhead of calling rcu_virt_note_context_switch()
> > twice per guest entry in the loop
> > - KVM will probably run guest entry soon after guest exit (in loop),
> > so there is no need to run it twice
> > - Eventually running rcu_core() may be cheaper than noting quiescent
> > state every guest entry/exit cycle
> > 
> > Upsides of the new strategy:
> > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > there was a grace period request while guest was running, and timer
> > interrupt hits the cpu.
> > - If the loop re-enter quickly there is a high chance that guest
> > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > as there is low probability of a grace period request happening
> > between exit & re-entry.
> > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > running if any grace period request happens between guest exit and
> > guest re-entry, which is very important for low latency workloads
> > running on guests as it reduces maximum latency in long runs.
> > 
> > What do you think?
> 
> Try both on the workload of interest with appropriate tracing and
> see what happens?  The hardware's opinion overrides mine.  ;-)

That's a great approach!

But in this case I think noting a quiescent state in guest exit is 
necessary to avoid a scenario in which a VM takes longer than RCU 
patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
exit was quite brief. 

IIUC Sean's question is more on the tone of "Why KVM does not note a 
quiescent state in guest exit already, if it does in guest entry", and I 
just came with a few arguments to try finding a possible rationale, since 
I could find no discussion on that topic in the lore for the original 
commit.

Since noting a quiescent state in guest exit is cheap enough, avoids rcuc 
schedules when grace period starts during guest execution, and enables a 
much more rational usage of RCU patience, it's a safe to assume it's a 
better way of dealing with RCU compared to current implementation.

Sean, what do you think?

Thanks!
Leo

> 
> 							Thanx, Paul
>

Paul E. McKenney May 15, 2024, 2:57 p.m. UTC | #12

On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > >
> > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > note a quiescent state.
> > > > >
> > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > applications, such as real-time workloads.
> > > > >
> > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > to deal with any pending RCU operations before being required to invoke
> > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > >
> > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > >
> > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > case?
> > > 
> > > I don't know, by the lore I see it happening in guest entry since the
> > > first time it was introduced at
> > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > 
> > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > me know if I got it wrong).
> > 
> > Yes, it is cheap, especially if interrupts are already disabled.
> > (As in the scheduler asks RCU to do the same amount of work on its
> > context-switch fastpath.)
> 
> Thanks!
> 
> > 
> > > I don't have a historic context on why it was just implemented on
> > > guest_entry, but it would make sense when we don't worry about latency
> > > to take the entry-only approach:
> > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > twice per guest entry in the loop
> > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > so there is no need to run it twice
> > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > state every guest entry/exit cycle
> > > 
> > > Upsides of the new strategy:
> > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > there was a grace period request while guest was running, and timer
> > > interrupt hits the cpu.
> > > - If the loop re-enter quickly there is a high chance that guest
> > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > as there is low probability of a grace period request happening
> > > between exit & re-entry.
> > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > running if any grace period request happens between guest exit and
> > > guest re-entry, which is very important for low latency workloads
> > > running on guests as it reduces maximum latency in long runs.
> > > 
> > > What do you think?
> > 
> > Try both on the workload of interest with appropriate tracing and
> > see what happens?  The hardware's opinion overrides mine.  ;-)
> 
> That's a great approach!
> 
> But in this case I think noting a quiescent state in guest exit is 
> necessary to avoid a scenario in which a VM takes longer than RCU 
> patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> exit was quite brief. 
> 
> IIUC Sean's question is more on the tone of "Why KVM does not note a 
> quiescent state in guest exit already, if it does in guest entry", and I 
> just came with a few arguments to try finding a possible rationale, since 
> I could find no discussion on that topic in the lore for the original 
> commit.

Understood, and maybe trying it would answer that question quickly.
Don't get me wrong, just because it appears to work in a few tests doesn't
mean that it really works, but if it visibly blows up, that answers the
question quite quickly and easily.  ;-)

But yes, if it appears to work, there must be a full investigation into
whether or not the change really is safe.

							Thanx, Paul

> Since noting a quiescent state in guest exit is cheap enough, avoids rcuc 
> schedules when grace period starts during guest execution, and enables a 
> much more rational usage of RCU patience, it's a safe to assume it's a 
> better way of dealing with RCU compared to current implementation.
> 
> Sean, what do you think?
> 
> Thanks!
> Leo
> 
> > 
> > 							Thanx, Paul
> > 
>

[RFC,1/1] kvm: Note an RCU quiescent state on guest exit

Commit Message

Comments

Patch