diff mbox

[3/5] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.

Message ID 150114248433.22910.16140726025093688678.stgit@Solace (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli July 27, 2017, 8:01 a.m. UTC
Xen is a tickless (micro-)kernel. This means that, when a CPU
becomes idle, we stop all the activity on it, including any
periodic tick or timer.

When we imported RCU from Linux, Linux (x86) was a ticking
kernel, i.e., there was a periodic timer tick always running,
even on totally idle CPUs. This was bad from a power efficiency
perspective, but it's what maked it possible to monitor the
quiescent states of all the CPUs, and hence tell when an RCU
grace period ends.

In Xen, that is impossible, and that's particularly problematic
when system is idle (or lightly loaded) systems, as CPUs that
are idle may never have the chance to tell RCU about their
quiescence, and grace periods could extend indefinitely!

This has led, on x86, to long (an unpredictable) delays between
RCU callbacks queueing and invokation. On ARM, we actually see
infinite grace periods (e.g., complate_domain_destroy() may
never be actually invoked on an idle system). See here:

 https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html

The first step for fixing this situation is for RCU to record,
at the beginning of a grace period, which CPUs are already idle.
In fact, being idle, they can't be in the middle of any read-side
critical section, and we don't have to wait for them to declare
a grace period finished.

This is tracked in a cpumask, in a way that is very similar to
how Linux also was achieving the same on s390 --which indeed was
tickless already, even back then-- and to what it started to do
for x86, from 2.6.21 on (see commit 79bf2bb3 "tick-management:
dyntick / highres functionality").

While there, also adopt the memory barrier introduced by Linux
commit commit c3f59023 ("Fix RCU race in access of nohz_cpu_mask").

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/arm/domain.c         |    2 ++
 xen/arch/x86/acpi/cpu_idle.c  |   25 +++++++++++++++++--------
 xen/arch/x86/cpu/mwait-idle.c |    9 ++++++++-
 xen/arch/x86/domain.c         |    8 +++++++-
 xen/common/rcupdate.c         |   28 ++++++++++++++++++++++++++--
 xen/include/xen/rcupdate.h    |    3 +++
 6 files changed, 63 insertions(+), 12 deletions(-)

Comments

Stefano Stabellini July 31, 2017, 9:17 p.m. UTC | #1
On Thu, 27 Jul 2017, Dario Faggioli wrote:
> Xen is a tickless (micro-)kernel. This means that, when a CPU
> becomes idle, we stop all the activity on it, including any
> periodic tick or timer.
> 
> When we imported RCU from Linux, Linux (x86) was a ticking
> kernel, i.e., there was a periodic timer tick always running,
> even on totally idle CPUs. This was bad from a power efficiency
> perspective, but it's what maked it possible to monitor the
> quiescent states of all the CPUs, and hence tell when an RCU
> grace period ends.
> 
> In Xen, that is impossible, and that's particularly problematic
> when system is idle (or lightly loaded) systems, as CPUs that
> are idle may never have the chance to tell RCU about their
> quiescence, and grace periods could extend indefinitely!
> 
> This has led, on x86, to long (an unpredictable) delays between
> RCU callbacks queueing and invokation. On ARM, we actually see
> infinite grace periods (e.g., complate_domain_destroy() may
> never be actually invoked on an idle system). See here:
> 
>  https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html
> 
> The first step for fixing this situation is for RCU to record,
> at the beginning of a grace period, which CPUs are already idle.
> In fact, being idle, they can't be in the middle of any read-side
> critical section, and we don't have to wait for them to declare
> a grace period finished.
> 
> This is tracked in a cpumask, in a way that is very similar to
> how Linux also was achieving the same on s390 --which indeed was
> tickless already, even back then-- and to what it started to do
> for x86, from 2.6.21 on (see commit 79bf2bb3 "tick-management:
> dyntick / highres functionality").
> 
> While there, also adopt the memory barrier introduced by Linux
> commit commit c3f59023 ("Fix RCU race in access of nohz_cpu_mask").
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/arm/domain.c         |    2 ++
>  xen/arch/x86/acpi/cpu_idle.c  |   25 +++++++++++++++++--------
>  xen/arch/x86/cpu/mwait-idle.c |    9 ++++++++-
>  xen/arch/x86/domain.c         |    8 +++++++-
>  xen/common/rcupdate.c         |   28 ++++++++++++++++++++++++++--
>  xen/include/xen/rcupdate.h    |    3 +++
>  6 files changed, 63 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index fce29cb..666b7ef 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -50,8 +50,10 @@ static void do_idle(void)
>      local_irq_disable();
>      if ( cpu_is_haltable(cpu) )
>      {
> +        rcu_idle_enter(cpu);
>          dsb(sy);
>          wfi();
> +        rcu_idle_exit(cpu);
>      }
>      local_irq_enable();
>  
> diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> index 8cc5a82..f0fdc87 100644
> --- a/xen/common/rcupdate.c
> +++ b/xen/common/rcupdate.c
> @@ -52,7 +52,8 @@ static struct rcu_ctrlblk {
>      int  next_pending;  /* Is the next batch already waiting?         */
>  
>      spinlock_t  lock __cacheline_aligned;
> -    cpumask_t   cpumask; /* CPUs that need to switch in order    */
> +    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
> +    cpumask_t   idle_cpumask; /* ... unless they are already idle */
>      /* for current batch to proceed.        */
>  } __cacheline_aligned rcu_ctrlblk = {
>      .cur = -300,
> @@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
>          smp_wmb();
>          rcp->cur++;
>  
> -        cpumask_copy(&rcp->cpumask, &cpu_online_map);
> +       /*
> +        * Accessing idle_cpumask before incrementing rcp->cur needs a
> +        * Barrier  Otherwise it can cause tickless idle CPUs to be
                    ^ otherwise


> +        * included in rcp->cpumask, which will extend graceperiods
> +        * unnecessarily.
> +        */

It doesn't look like this comment applies to this code: we are accessing
idle_cpumask after rcp->cur here. Unless you meant "Accessing
idle_cpumask *after* incrementing rcp->cur."

Also could you please add a pointer to the other barrier in the pair
(barriers always go in pair, for example I think the smp_wmb() above in
rcu_start_batch is matched by the smp_rmb() in __rcu_process_callbacks.)


> +        smp_mb();
> +        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
>      }
>  }
>  
> @@ -474,7 +482,23 @@ static struct notifier_block cpu_nfb = {
>  void __init rcu_init(void)
>  {
>      void *cpu = (void *)(long)smp_processor_id();
> +
> +    cpumask_setall(&rcu_ctrlblk.idle_cpumask);
>      cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
>      register_cpu_notifier(&cpu_nfb);
>      open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
>  }
> +
> +/*
> + * The CPU is becoming idle, so no more read side critical
> + * sections, and one more step toward grace period.
> + */
> +void rcu_idle_enter(unsigned int cpu)
> +{
> +    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
> +}
> +
> +void rcu_idle_exit(unsigned int cpu)
> +{
> +    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
> +}
> diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
> index 557a7b1..561ac43 100644
> --- a/xen/include/xen/rcupdate.h
> +++ b/xen/include/xen/rcupdate.h
> @@ -146,4 +146,7 @@ void call_rcu(struct rcu_head *head,
>  
>  int rcu_barrier(void);
>  
> +void rcu_idle_enter(unsigned int cpu);
> +void rcu_idle_exit(unsigned int cpu);
> +
>  #endif /* __XEN_RCUPDATE_H */
>
Jan Beulich Aug. 7, 2017, 8:35 a.m. UTC | #2
>>> Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM >>>
>--- a/xen/arch/x86/acpi/cpu_idle.c
>+++ b/xen/arch/x86/acpi/cpu_idle.c
>@@ -418,14 +418,16 @@ static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
>}
 >
>-static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
>+static void acpi_idle_do_entry(unsigned int cpu, struct acpi_processor_cx *cx)
>{
>+    rcu_idle_enter(cpu);
>+
>switch ( cx->entry_method )
>{
>case ACPI_CSTATE_EM_FFH:
>/* Call into architectural FFH based C-state */
>acpi_processor_ffh_cstate_enter(cx);
>-        return;
>+        break;
>case ACPI_CSTATE_EM_SYSIO:
         >/* IO port based C-state */
>inb(cx->address);
>@@ -433,12 +435,14 @@ static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
>because chipsets cannot guarantee that STPCLK# signal
>gets asserted in time to freeze execution properly. */
>inl(pmtmr_ioport);
>-        return;
>+        break;
     >case ACPI_CSTATE_EM_HALT:

Wiuld be nice if you also added blank lines between the break-s and the
subsequent case labels.

>@@ -756,6 +759,8 @@ static void mwait_idle(void)
>return;
>}
 >
>+    rcu_idle_enter(cpu);
>+
>eax = cx->address;
>cstate = ((eax >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 >
>@@ -787,6 +792,8 @@ static void mwait_idle(void)
>irq_traced[0], irq_traced[1], irq_traced[2], irq_traced[3]);
 >
>/* Now back in C0. */
>+    rcu_idle_exit(cpu);
>+
>update_idle_stats(power, cx, before, after);
>local_irq_enable();
 
Compared to the ACPI code, the window is quite a bit larger here. Any reason
you can't put these just around the mwait_idle_with_hints() invocation, which
would match what you do for the ACPI-based driver?

>@@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
>smp_wmb();
>rcp->cur++;
 >
>-        cpumask_copy(&rcp->cpumask, &cpu_online_map);
>+       /*
>+        * Accessing idle_cpumask before incrementing rcp->cur needs a
>+        * Barrier  Otherwise it can cause tickless idle CPUs to be
>+        * included in rcp->cpumask, which will extend graceperiods
>+        * unnecessarily.
>+        */
>+        smp_mb();
>+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);

I have some trouble with understanding the comment: You don't access
->idle_cpumask before you increment ->cur. (Also, as a general remark,
please go through patch description and comments again to correct
spelling and alike issues.)

>@@ -474,7 +482,23 @@ static struct notifier_block cpu_nfb = {
>void __init rcu_init(void)
>{
>void *cpu = (void *)(long)smp_processor_id();
>+
>+    cpumask_setall(&rcu_ctrlblk.idle_cpumask);

The CPU you're running on surely is not idle initially?

Jan
Dario Faggioli Aug. 9, 2017, 8:48 a.m. UTC | #3
On Mon, 2017-08-07 at 02:35 -0600, Jan Beulich wrote:
> > > > Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM
> > > > >>>
> > --- a/xen/arch/x86/acpi/cpu_idle.c
> > +++ b/xen/arch/x86/acpi/cpu_idle.c
> > 
> > @@ -433,12 +435,14 @@ static void acpi_idle_do_entry(struct
> > acpi_processor_cx *cx)
> > because chipsets cannot guarantee that STPCLK# signal
> > gets asserted in time to freeze execution properly. */
> > inl(pmtmr_ioport);
> > -        return;
> > +        break;
> 
>      >case ACPI_CSTATE_EM_HALT:
> 
> Wiuld be nice if you also added blank lines between the break-s and
> the
> subsequent case labels.
> 
Ok.

> > @@ -787,6 +792,8 @@ static void mwait_idle(void)
> > irq_traced[0], irq_traced[1], irq_traced[2], irq_traced[3]);
> 
>  >
> > /* Now back in C0. */
> > +    rcu_idle_exit(cpu);
> > +
> > update_idle_stats(power, cx, before, after);
> > local_irq_enable();
> 
>  
> Compared to the ACPI code, the window is quite a bit larger here. Any
> reason
> you can't put these just around the mwait_idle_with_hints()
> invocation, which
> would match what you do for the ACPI-based driver?
> 
No, no reasons not to do that. I'll do it.

> > @@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk
> > *rcp)
> > smp_wmb();
> > rcp->cur++;
> 
>  >
> > -        cpumask_copy(&rcp->cpumask, &cpu_online_map);
> > +       /*
> > +        * Accessing idle_cpumask before incrementing rcp->cur
> > needs a
> > +        * Barrier  Otherwise it can cause tickless idle CPUs to be
> > +        * included in rcp->cpumask, which will extend graceperiods
> > +        * unnecessarily.
> > +        */
> > +        smp_mb();
> > +        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp-
> > >idle_cpumask);
> 
> I have some trouble with understanding the comment: You don't access
> ->idle_cpumask before you increment ->cur.
>
It comes verbatim from the Linux commit. You're not the first one that
finds it unclear, and I don't like it either.

So, this is the Linux patch:

        if (rcp->next_pending &&
                        rcp->completed == rcp->cur) {
-               /* Can't change, since spin lock held. */
-               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
-
                rcp->next_pending = 0;
-               /* next_pending == 0 must be visible in __rcu_process_callbacks()
-                * before it can see new value of cur.
+               /*
+                * next_pending == 0 must be visible in
+                * __rcu_process_callbacks() before it can see new value of cur.
                 */
                smp_wmb();
                rcp->cur++;
+
+               /*
+                * Accessing nohz_cpu_mask before incrementing rcp->cur needs a
+                * Barrier  Otherwise it can cause tickless idle CPUs to be
+                * included in rsp->cpumask, which will extend graceperiods
+                * unnecessarily.
+                */
+               smp_mb();
+               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
+

_I_think_ what the original author meant was something along the line
of <<Accessing nohz_cpu_mask before incrementing rcp->cur is unsafe.
Therefore, let's access it afterwords, and put a barrier in between.>>

But yeah, as said, I don't like it myself. In fact, it's the same exact
wording used in the changelog of the patch (Linux commit
c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine there,
here is completely misleading, as it does not comment/describe the
final look of the code.

I'm going to change it.

> (Also, as a general remark,
> please go through patch description and comments again to correct
> spelling and alike issues.)
> 
Sure.

Regards,
Dario
Jan Beulich Aug. 9, 2017, 8:57 a.m. UTC | #4
>>> On 09.08.17 at 10:48, <dario.faggioli@citrix.com> wrote:
> On Mon, 2017-08-07 at 02:35 -0600, Jan Beulich wrote:
>> > > > Dario Faggioli <dario.faggioli@citrix.com> 07/27/17 10:01 AM
>> > @@ -248,7 +249,14 @@ static void rcu_start_batch(struct rcu_ctrlblk
>> > *rcp)
>> > smp_wmb();
>> > rcp->cur++;
>> 
>>  >
>> > -        cpumask_copy(&rcp->cpumask, &cpu_online_map);
>> > +       /*
>> > +        * Accessing idle_cpumask before incrementing rcp->cur
>> > needs a
>> > +        * Barrier  Otherwise it can cause tickless idle CPUs to be
>> > +        * included in rcp->cpumask, which will extend graceperiods
>> > +        * unnecessarily.
>> > +        */
>> > +        smp_mb();
>> > +        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp-
>> > >idle_cpumask);
>> 
>> I have some trouble with understanding the comment: You don't access
>> ->idle_cpumask before you increment ->cur.
>>
> It comes verbatim from the Linux commit. You're not the first one that
> finds it unclear, and I don't like it either.
> 
> So, this is the Linux patch:
> 
>         if (rcp->next_pending &&
>                         rcp->completed == rcp->cur) {
> -               /* Can't change, since spin lock held. */
> -               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
> -
>                 rcp->next_pending = 0;
> -               /* next_pending == 0 must be visible in __rcu_process_callbacks()
> -                * before it can see new value of cur.
> +               /*
> +                * next_pending == 0 must be visible in
> +                * __rcu_process_callbacks() before it can see new value of cur.
>                  */
>                 smp_wmb();
>                 rcp->cur++;
> +
> +               /*
> +                * Accessing nohz_cpu_mask before incrementing rcp->cur needs 
> a
> +                * Barrier  Otherwise it can cause tickless idle CPUs to be
> +                * included in rsp->cpumask, which will extend graceperiods
> +                * unnecessarily.
> +                */
> +               smp_mb();
> +               cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);
> +
> 
> _I_think_ what the original author meant was something along the line
> of <<Accessing nohz_cpu_mask before incrementing rcp->cur is unsafe.
> Therefore, let's access it afterwords, and put a barrier in between.>>
> 
> But yeah, as said, I don't like it myself. In fact, it's the same exact
> wording used in the changelog of the patch (Linux commit
> c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine there,
> here is completely misleading, as it does not comment/describe the
> final look of the code.
> 
> I'm going to change it.

Perhaps worth submitting a Linux patch too then?

Jan
Dario Faggioli Aug. 9, 2017, 9:20 a.m. UTC | #5
On Wed, 2017-08-09 at 02:57 -0600, Jan Beulich wrote:
> > > > On 09.08.17 at 10:48, <dario.faggioli@citrix.com> wrote:
> > 
> > _I_think_ what the original author meant was something along the
> > line
> > of <<Accessing nohz_cpu_mask before incrementing rcp->cur is
> > unsafe.
> > Therefore, let's access it afterwords, and put a barrier in
> > between.>>
> > 
> > But yeah, as said, I don't like it myself. In fact, it's the same
> > exact
> > wording used in the changelog of the patch (Linux commit
> > c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine
> > there,
> > here is completely misleading, as it does not comment/describe the
> > final look of the code.
> > 
> > I'm going to change it.
> 
> Perhaps worth submitting a Linux patch too then?
> 
That code is long gone.

The file itself doesn't exist any longer, and the mask has been killed,
in favour of other mechanisms (depending of the variant of RCU, of the
degree of tickles-idleness, etc). :-)

Regards,
Dario
Jan Beulich Aug. 9, 2017, 9:26 a.m. UTC | #6
>>> On 09.08.17 at 11:20, <dario.faggioli@citrix.com> wrote:
> On Wed, 2017-08-09 at 02:57 -0600, Jan Beulich wrote:
>> > > > On 09.08.17 at 10:48, <dario.faggioli@citrix.com> wrote:
>> > 
>> > _I_think_ what the original author meant was something along the
>> > line
>> > of <<Accessing nohz_cpu_mask before incrementing rcp->cur is
>> > unsafe.
>> > Therefore, let's access it afterwords, and put a barrier in
>> > between.>>
>> > 
>> > But yeah, as said, I don't like it myself. In fact, it's the same
>> > exact
>> > wording used in the changelog of the patch (Linux commit
>> > c3f5902325d3053986e7359f706581d8f032e72f), but while it is fine
>> > there,
>> > here is completely misleading, as it does not comment/describe the
>> > final look of the code.
>> > 
>> > I'm going to change it.
>> 
>> Perhaps worth submitting a Linux patch too then?
>> 
> That code is long gone.
> 
> The file itself doesn't exist any longer, and the mask has been killed,
> in favour of other mechanisms (depending of the variant of RCU, of the
> degree of tickles-idleness, etc). :-)

Oh, I see - in that case we're fine with the comment being
whatever makes most sense for us.

Jan
Tim Deegan Aug. 9, 2017, 11:38 a.m. UTC | #7
Hi,

At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote:
> In Xen, that is impossible, and that's particularly problematic
> when system is idle (or lightly loaded) systems, as CPUs that
> are idle may never have the chance to tell RCU about their
> quiescence, and grace periods could extend indefinitely!
[...]
> The first step for fixing this situation is for RCU to record,
> at the beginning of a grace period, which CPUs are already idle.
> In fact, being idle, they can't be in the middle of any read-side
> critical section, and we don't have to wait for them to declare
> a grace period finished.

AIUI this patch fixes a bug where:
 - a CPU is idle/asleep;
 - it is added to the cpumask of a new RCU grace period; and
 - because the CPU is asleep, the grace period never ends. 
Have I understood?

I think we might be left with a race condition:
 - CPU A is about to idle.  It runs the RCU softirq and
   clears itself from the current grace period.
 - CPU B ends the grace period and starts a new one.
 - CPU A calls rcu_idle_enter() and sleeps.
 - The new grace period never ends.

Is that fixed by your later rcu_idle_timer patch?  AIUI that's only
invoked when the calling CPU has pending RCU callbacks.

Or can it be fixed here by something like this in rcu_idle_enter?
 - lock rcp->lock
 - add ourselves to the idle mask
 - if we are in rcp->cpumask:
     - cpu_quiet()
 - unlock rcp->lock

There's also the code at the top of rcu_check_quiescent_state() that
requres _two_ idle states per batch.  I don't know what race that's
protecting against so I don't know whether we need to worry about it
here as well. :)

Cheers,

Tim.
Dario Faggioli Aug. 11, 2017, 5:25 p.m. UTC | #8
On Wed, 2017-08-09 at 12:38 +0100, Tim Deegan wrote:
> Hi,
> 
Hey! :-)

> At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote:
> > In Xen, that is impossible, and that's particularly problematic
> > when system is idle (or lightly loaded) systems, as CPUs that
> > are idle may never have the chance to tell RCU about their
> > quiescence, and grace periods could extend indefinitely!
> 
> [...]
> > The first step for fixing this situation is for RCU to record,
> > at the beginning of a grace period, which CPUs are already idle.
> > In fact, being idle, they can't be in the middle of any read-side
> > critical section, and we don't have to wait for them to declare
> > a grace period finished.
> 
> AIUI this patch fixes a bug where:
>  - a CPU is idle/asleep;
>  - it is added to the cpumask of a new RCU grace period; and
>  - because the CPU is asleep, the grace period never ends. 
> Have I understood?
> 
Yes indeed.

Basically, without this patch, all the CPUs are always added/considered
for all the grace periods. And if there are some that are idle/asleep,
we need to wait for them to wake up, before declared the grace period
ended.

So the duration of the grace period itself is out of any control (and
in fact, it happens to be reasonable, on x86, even on fully idle
system, while not so much on ARM). Even virtually infinite... :-O

> I think we might be left with a race condition:
>  - CPU A is about to idle.  It runs the RCU softirq and
>    clears itself from the current grace period.
>  - CPU B ends the grace period and starts a new one.
>  - CPU A calls rcu_idle_enter() and sleeps.
>  - The new grace period never ends.
> 
Yes, I think this is a thing.

I've give it some thoughts, and will post something about it as another
email.

> Is that fixed by your later rcu_idle_timer patch?  AIUI that's only
> invoked when the calling CPU has pending RCU callbacks.
> 
The timer is meant at dealing with the CPU with pending callback, yes.
I don't think we can solve this issue with the timer, even if we extend
its scope to all CPUs with whatever pending RCU processing... we'd
still have problems with grace periods that starts (on CPU B) after the
check for whether or not we need the timer has happened.

> Or can it be fixed here by something like this in rcu_idle_enter?
>  - lock rcp->lock
>  - add ourselves to the idle mask
>  - if we are in rcp->cpumask:
>      - cpu_quiet()
>  - unlock rcp->lock
> 
If rcu_idle_enter() leaves inside an IRQ disabled section (e.g., in
mwait_idle(), as it does right now), I can't take rcp->lock (because of
spinlock IRQ safety).

If I move it outside of that, then yes, the above may work. Or,
actually, we may not even need it... But as I said, more on this in
another message.

> There's also the code at the top of rcu_check_quiescent_state() that
> requres _two_ idle states per batch.  I don't know what race that's
> protecting against so I don't know whether we need to worry about it
> here as well. :)
> 
Not sure I'm getting this part. It's not a race, it is that, literally,
the first time (after someone has started a new batch) a CPU executes
rcu_check_quiescent_state(), realizes we're in a new grace period:

   6.681421066 -.-.-.|.--x-|--- d0v7 rcu_call fn=0xffff82d080207c4c, rcp_cur=-300, rdp_quiescbatch=-300
   6.681422601 -.-.-.|.--x-|--- d0v7 do_softirq, pending = 0x00000010
   6.681422889 -.-.-.|.--x-|--- d0v7 rcu_pending? yes (ret=2): no pending entries but new entries
   6.681423214 -.-.-.|.--x-|--- d0v7 raise_softirq nr 3
   6.681423494 -.-.-.|.--x-|--- d0v7 softirq_handler nr 3
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_next_batch, rcp_cur=-300, rdp_batch=-299, rcp_next_pending? no
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_start_batch, rcp_cur=-299, rcp_cpumask=0x00001442
1->6.681423494 -.-.-.|.--x-|--- d0v7 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
   6.681423494 -.-.-.|.--x-|--- d0v7 rcu_grace_start, rcp_cur=-299

The second realizes the grace period ended:

   6.681425277 -.-.-.|.--x-|--- d0v7 do_softirq, pending = 0x00000010
   6.681425647 -.-.-.|.--x-|--- d0v7 rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
   6.681425872 -.-.-.|.--x-|--- d0v7 raise_softirq nr 3
   6.681426117 -.-.-.|.--x-|--- d0v7 softirq_handler nr 3
   6.681426117 -.-.-.|.--x-|--- d0v7 rcu_process_callbacks, rcp_completed=-300, rdp_batch=-299, rdp_curlist: yes, rdp_nxtlist: null, rdp_donelist: null
2->6.681426117 -.-.-.|.--x-|--- d0v7 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
   6.681426117 -.-.-.|.--x-|--- d0v7 rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
   6.681426117 -.-.-.|.--x-|--- d0v7 rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00001042

This trace above was for the CPU which actaully started the grace
period, and which has the callback queued. Here's the same for another
one:

   6.682213282 -.-.-.x.----|--- d0v2 vcpu_block d0v2
   6.682213549 -.-.-.x.----|--- d0v2 raise_softirq nr 1
   6.682213999 -.-.-.x.----|--- d0v2 do_softirq, pending = 0x00000002
   6.682214397 -.-.-.x.----|--- d0v2 rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
   6.682214667 -.-.-.x.----|--- d0v2 raise_softirq nr 3
   [...]
   6.682224404 -.-.-.x.----|--- d32767v6 softirq_handler nr 3
   6.682224404 -.-.-.x.----|--- d32767v6 rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
1->6.682224404 -.-.-.x.----|--- d32767v6 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
   6.682224404 -.-.-.x.----|--- d32767v6 rcu_grace_start, rcp_cur=-299
   6.682225310 -.-.-.x.----|--- d32767v6 do_softirq, pending = 0x00000000
   6.682225570 -.-.-.x.----|--- d32767v6 rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
   6.682225790 -.-.-.x.----|--- d32767v6 raise_softirq nr 3
   6.682226032 -.-.-.x.----|--- d32767v6 softirq_handler nr 3
   6.682226032 -.-.-.x.----|--- d32767v6 rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
2->6.682226032 -.-.-.x.----|--- d32767v6 rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
   6.682226032 -.-.-.x.----|--- d32767v6 rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
   6.682226032 -.-.-.x.----|--- d32767v6 rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000000

But maybe I've not understood properly what you meant...

Thanks and Regards,
Dario
Dario Faggioli Aug. 14, 2017, 9:19 a.m. UTC | #9
On Wed, 2017-08-09 at 12:38 +0100, Tim Deegan wrote:
> Hi,
> 
> At 10:01 +0200 on 27 Jul (1501149684), Dario Faggioli wrote:
> > In Xen, that is impossible, and that's particularly problematic
> > when system is idle (or lightly loaded) systems, as CPUs that
> > are idle may never have the chance to tell RCU about their
> > quiescence, and grace periods could extend indefinitely!
> 
> [...]
> > The first step for fixing this situation is for RCU to record,
> > at the beginning of a grace period, which CPUs are already idle.
> > In fact, being idle, they can't be in the middle of any read-side
> > critical section, and we don't have to wait for them to declare
> > a grace period finished.
> 
> AIUI this patch fixes a bug where:
>  - a CPU is idle/asleep;
>  - it is added to the cpumask of a new RCU grace period; and
>  - because the CPU is asleep, the grace period never ends. 
> Have I understood?
> 
> I think we might be left with a race condition:
>  - CPU A is about to idle.  It runs the RCU softirq and
>    clears itself from the current grace period.
>  - CPU B ends the grace period and starts a new one.
>  - CPU A calls rcu_idle_enter() and sleeps.
>  - The new grace period never ends.
> 
So, I could not make this happen artificially, but I put together a
possible sequence of events that depicts this situation. I'm putting it
here, but also pasting it at https://pastebin.com/PHu6Guq0 (in case the
email body is mangled.

Lines starting with a '|' are the output of tracing, triggered by the
execution of the code/functions reported in the other lines:

	CPU0				CPU1				CPU2
	.				.				.
	.				call_rcu(...)			.
	.				|rcu_call fn=xxx, rcp_cur=-300, rdp_quiescbatch=-300
	.				.				.
	.				|do_softirq			.
	.				|rcu_pending? yes (ret=2): no pending entries but new entries
	.				|raise_softirq nr 3		.
	.				|softirq_handler nr 3		.
	.				rcu_process_callbacks();	.
	.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null
	.				  rdp->batch = rcp->cur + 1;	.
	.				  smp_rmb;			.
   	.				|rcu_next_batch, rcp_cur=-300, rdp_batch=-299, rcp_next_pending? no
	.				  if (!rcp->next_pending) {	.
	.				   rcp->next_pending = 1;	.
	.				   rcu_start_batch();		.
	.				     if (rcp->next_pending && rcp->completed == rcp->cur) {
	.				      rcp->next_pending = 0;	.
	.				      smp_wmb;			.
	.				      rcp->cur++;		.
	.				      smp_mb;			.
	.				      cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
	.				|rcu_start_batch, rcp_cur=-299, rcp_cpumask=0x00000006 //Active CPUs: 1,2
	.				  }				.
	.				  rcu_check_quiescent_state();	.
	.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
	.				    if (rdp->quiescbatch != rcp->cur) {
	.				     rdp->qs_pending = 1;	.
	.				     rdp->quiescbatch = rcp->cur;
	.				|rcu_grace_start, rcp_cur=-299	.
	.				    }				.
	call_rcu(...)			.				.
	|rcu_call fn=xxx, rcp_cur=-299, rdp_quiescbatch=-300		.
	.				.				.
	|do_softirq			.				.
	|rcu_pending? yes (ret=2): no pending entries but new entries	.
	|raise_softirq nr 3		.				.
	|softirq_handler nr 3		.				.
	rcu_process_callbacks();	.				.
	|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: yes, rdp_donelist: null
	  rdp->batch = rcp->cur + 1;	.				.
	  smp_rmb;			.				.
   	|rcu_next_batch, rcp_cur=-299, rdp_batch=-298, rcp_next_pending? no
	  if (!rcp->next_pending) {	.				.
	   rcp->next_pending = 1;	.				.
	   rcu_start_batch();		.				.
	     if (rcp->next_pending && rcp->completed == rcp->cur) //NO!!!
          }				.				.
	  rcu_check_quiescent_state();	.				.
	|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
	  if (rdp->quiescbatch != rcp->cur) {				.
	   rdp->qs_pending = 1;		.				.
	   rdp->quiescbatch = rcp->cur;	.				.
	|rcu_grace_start, rcp_cur=-299	.				.
	  }				.				.
	.				.				.
	.				.				|vcpu_block dXvY
	.				.				|raise_softirq nr 1
	.				.				|rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
	.				.				|raise_softirq nr 3
	.				.				|softirq_handler nr 1
	.				.				|//scheduler stuff
	.				.				|runstate_change dXvY running->blocked
	.				.				|runstate_change d32767v2 runnable->running
	.				.				idle_loop()
	.				.				  pm_idle() //e.g., mwait_idle()
	.				.				    cpufreq_dbs_timer_suspend();
	.				.				|timer_stop t=0xffff830319acebc0, fn=0xffff82d080252da9
	.				.				|timer_rm_entry t=0xffff830319acebc0, cpu=2, status=0x3
	.				.				    sched_tick_suspend();
	.				.				      rcu_idle_timer_start()
	.				.				    process_pending_softirqs();
	.				.				|do_softirq, pending = 0x00000008
	.				.				|rcu_pending? yes (ret=4): waiting for CPU to quiesce (rdp_qs_pending=0)
	.				.				|raise_softirq nr 3
	.				.				|softirq_handler nr 3
	.				.				      rcu_process_callbacks();
	.				.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
	.				.				        rcu_check_quiescent_state();
	.				.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-300, rdp_qs_pending: no
	.				.				          if (rdp->quiescbatch != rcp->cur) {
	.				.				           rdp->qs_pending = 1;
	.				.				           rdp->quiescbatch = rcp->cur;
	.				.				|rcu_grace_start, rcp_cur=-299
	.				.				          }
	.				.				|do_softirq, pending = 0x00000000
	.				.				|rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
	.				.				|raise_softirq nr 3
	.				.				|softirq_handler nr 3
	.				.				      rcu_process_callbacks();
	.				.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=0, rdp_curlist: null, rdp_nxtlist: null, rdp_donelist: null
	.				.				        rcu_check_quiescent_state();
	.				.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
	.				.				          if (rdp->quiescbatch != rcp->cur) // NO!!!
	.				.				          if (!rdp->qs_pending) // NO!!!
	.				.				          rdp->qs_pending = 0;
	.				.				|rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
	.				.				          if (likely(rdp->quiescbatch == rcp->cur)) {
	.				.				           cpu_quiet();
	.				.				             cpumask_clear_cpu(cpu, &rcp->cpumask);
	.				.				|rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000002 // Active CPU: 1
	.				.				             if (cpumask_empty(&rcp->cpumask)) // NO!!!
	.				|do_softirq, pending = 0x00000010
	.				|rcu_pending? yes (ret=5): waiting for CPU to quiesce (rdp_qs_pending=1)
	.				|raise_softirq nr 3		.
	.				|softirq_handler nr 3		.
	.				rcu_process_callbacks();	.
	.				|rcu_process_callbacks, rcp_completed=-300, rdp_batch=-299, rdp_curlist: yes, rdp_nxtlist: null, rdp_donelist: null
	.				  rcu_check_quiescent_state();	.
	.				|rcu_check_quiesc_state, rcp_cur=-299, rdp_quiescbatch=-299, rdp_qs_pending: yes
	.				    if (rdp->quiescbatch != rcp->cur) // NO!!!
	.				    if (!rdp->qs_pending) // NO!!!
	.				|rcu_grace_done, rcp_cur=-299, rcp_completed=-300, rdp_quiescbatch=-299
	.				  if (likely(rdp->quiescbatch == rcp->cur)) {
	.				   cpu_quiet(rdp->cpu, rcp);	.
	.				     cpumask_clear_cpu(cpu, &rcp->cpumask); 
	.				|rcu_cpu_quiet, rcp_cur=-299, rcp_cpumask=0x00000000 //No more active CPU
	.				     if (cpumask_empty(&rcp->cpumask)) {
	.				      rcp->completed = rcp->cur;.
	.				      rcu_start_batch();	.
	.				        if (rcp->next_pending && rcp->completed == rcp->cur) {
	.				         rcp->next_pending = 0;	.
	.				         smp_wmb;		.
	.				.				|do_softirq, pending = 0x00000000
	.				.				|rcu_pending? no
	.				.				    local_irq_disable();
	.				.				    if (!cpu_is_haltable(cpu)) //NO!!!
	.				.				|rcu_pending? no
	.				.				|pm_idle_start c2, pm_tick=3720475837, exp=1708us, pred=1024us
	.				.				    if (cpu_is_haltable(cpu)) {  // <--- rcu_pending: quiescbatch=-299, rcp->cur=-299
[1]	.				.				|rcu_pending? no
[2]	.				         rcp->cur++;		.
	.				         smp_mb;		.
	.				         cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
	.				|rcu_start_batch, rcp_cur=-298, rcp_cpumask=0x00000007
	.				        }
	.				     }
[3]	.				.				     rcu_idle_enter(cpu);
	.				.				     mwait_idle_with_hints();
	.				.				.

So, this is basically what's described above, with my CPU1 being Tim's
CPU B (the one that ends the grace period and starts a new one), and my
CPU2 being Tim's CPU A (the one that goes idle). There's the need for
CPU0 issuing a call_rcu() and queueing a new batch (or CPU1 won't start
any new one).

Basically, for the race to occur (in [3]), it is necessary that [2]
(CPU1 doing rcp->cur++) happens after [1] (last call to rcu_pending()
of CPU2, before clearing the mask and going idle). In fact, it that is
not the case, rcu_pending() in CPU2 will say 'no', because of this
check:

 if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)

which would prevent the CPU from going idle (and will, at least, lead
to it going through check_quiescent_state() twice, clearing itself from
the new grace period too).

Therefore, it looks to me that the race can be avoided by making sure
that there is at least one check of rcu_pending(), after a CPU has
called rcu_enter_idle(). This basically means moving rcu_idle_enter()
up.

I was actually thinking to move it inside the tick-stopping logic too.
This way, either, when CPU1 samples the cpumask for the new grace
period:
1) CPU2 will be in idle_cpumask already, and it actually goes idle;
2) CPU2 will be in idle_cpumask, but then it does not go idle
(cpu_is_haltable() returning false) for various reasons. This may look
unideal, but if it is here, trying to go idle, it is not holding
references to read-side critical sections, or at least we can consider
it to be quiescing, so it's ok to ignore it for the grace period;
3) CPU2 is not in idle_cpumask, and so it will not be in the sampled
mask, but the first check of rcu_pending() will lead acknowledge
quiescence, and calling of cpu_quiet();

Which looks fine to me. Or am I still missing something?

Thanks and Regards,
Dario
diff mbox

Patch

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index fce29cb..666b7ef 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -50,8 +50,10 @@  static void do_idle(void)
     local_irq_disable();
     if ( cpu_is_haltable(cpu) )
     {
+        rcu_idle_enter(cpu);
         dsb(sy);
         wfi();
+        rcu_idle_exit(cpu);
     }
     local_irq_enable();
 
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 482b8a7..04c52e8 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -418,14 +418,16 @@  static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
     mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
 }
 
-static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
+static void acpi_idle_do_entry(unsigned int cpu, struct acpi_processor_cx *cx)
 {
+    rcu_idle_enter(cpu);
+
     switch ( cx->entry_method )
     {
     case ACPI_CSTATE_EM_FFH:
         /* Call into architectural FFH based C-state */
         acpi_processor_ffh_cstate_enter(cx);
-        return;
+        break;
     case ACPI_CSTATE_EM_SYSIO:
         /* IO port based C-state */
         inb(cx->address);
@@ -433,12 +435,14 @@  static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
            because chipsets cannot guarantee that STPCLK# signal
            gets asserted in time to freeze execution properly. */
         inl(pmtmr_ioport);
-        return;
+        break;
     case ACPI_CSTATE_EM_HALT:
         safe_halt();
         local_irq_disable();
-        return;
+        break;
     }
+
+    rcu_idle_exit(cpu);
 }
 
 static int acpi_idle_bm_check(void)
@@ -540,7 +544,8 @@  void update_idle_stats(struct acpi_processor_power *power,
 
 static void acpi_processor_idle(void)
 {
-    struct acpi_processor_power *power = processor_powers[smp_processor_id()];
+    unsigned int cpu = smp_processor_id();
+    struct acpi_processor_power *power = processor_powers[cpu];
     struct acpi_processor_cx *cx = NULL;
     int next_state;
     uint64_t t1, t2 = 0;
@@ -563,7 +568,11 @@  static void acpi_processor_idle(void)
         if ( pm_idle_save )
             pm_idle_save();
         else
+        {
+            rcu_idle_enter(cpu);
             safe_halt();
+            rcu_idle_exit(cpu);
+        }
         return;
     }
 
@@ -579,7 +588,7 @@  static void acpi_processor_idle(void)
      */
     local_irq_disable();
 
-    if ( !cpu_is_haltable(smp_processor_id()) )
+    if ( !cpu_is_haltable(cpu) )
     {
         local_irq_enable();
         sched_tick_resume();
@@ -610,7 +619,7 @@  static void acpi_processor_idle(void)
             update_last_cx_stat(power, cx, t1);
 
             /* Invoke C2 */
-            acpi_idle_do_entry(cx);
+            acpi_idle_do_entry(cpu, cx);
             /* Get end time (ticks) */
             t2 = cpuidle_get_tick();
             trace_exit_reason(irq_traced);
@@ -672,7 +681,7 @@  static void acpi_processor_idle(void)
         }
 
         /* Invoke C3 */
-        acpi_idle_do_entry(cx);
+        acpi_idle_do_entry(cpu, cx);
 
         if ( (cx->type == ACPI_STATE_C3) &&
              power->flags.bm_check && power->flags.bm_control )
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 762dff1..ae9e92b 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -735,8 +735,11 @@  static void mwait_idle(void)
 	if (!cx) {
 		if (pm_idle_save)
 			pm_idle_save();
-		else
+		else {
+			rcu_idle_enter(cpu);
 			safe_halt();
+			rcu_idle_exit(cpu);
+		}
 		return;
 	}
 
@@ -756,6 +759,8 @@  static void mwait_idle(void)
 		return;
 	}
 
+	rcu_idle_enter(cpu);
+
 	eax = cx->address;
 	cstate = ((eax >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
@@ -787,6 +792,8 @@  static void mwait_idle(void)
 		irq_traced[0], irq_traced[1], irq_traced[2], irq_traced[3]);
 
 	/* Now back in C0. */
+	rcu_idle_exit(cpu);
+
 	update_idle_stats(power, cx, before, after);
 	local_irq_enable();
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index dd8bf13..a6c0f66 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -73,9 +73,15 @@  void (*dead_idle) (void) __read_mostly = default_dead_idle;
 
 static void default_idle(void)
 {
+    unsigned int cpu = smp_processor_id();
+
     local_irq_disable();
-    if ( cpu_is_haltable(smp_processor_id()) )
+    if ( cpu_is_haltable(cpu) )
+    {
+        rcu_idle_enter(cpu);
         safe_halt();
+        rcu_idle_exit(cpu);
+    }
     else
         local_irq_enable();
 }
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 8cc5a82..f0fdc87 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -52,7 +52,8 @@  static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order    */
+    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
+    cpumask_t   idle_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -248,7 +249,14 @@  static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         smp_wmb();
         rcp->cur++;
 
-        cpumask_copy(&rcp->cpumask, &cpu_online_map);
+       /*
+        * Accessing idle_cpumask before incrementing rcp->cur needs a
+        * Barrier  Otherwise it can cause tickless idle CPUs to be
+        * included in rcp->cpumask, which will extend graceperiods
+        * unnecessarily.
+        */
+        smp_mb();
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
     }
 }
 
@@ -474,7 +482,23 @@  static struct notifier_block cpu_nfb = {
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
+
+    cpumask_setall(&rcu_ctrlblk.idle_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
 }
+
+/*
+ * The CPU is becoming idle, so no more read side critical
+ * sections, and one more step toward grace period.
+ */
+void rcu_idle_enter(unsigned int cpu)
+{
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+}
+
+void rcu_idle_exit(unsigned int cpu)
+{
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+}
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 557a7b1..561ac43 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -146,4 +146,7 @@  void call_rcu(struct rcu_head *head,
 
 int rcu_barrier(void);
 
+void rcu_idle_enter(unsigned int cpu);
+void rcu_idle_exit(unsigned int cpu);
+
 #endif /* __XEN_RCUPDATE_H */