diff mbox

[RFC,01/10] CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers

Message ID 20121204085324.25919.53090.stgit@srivatsabhat.in.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srivatsa S. Bhat Dec. 4, 2012, 8:53 a.m. UTC
From: Michael Wang <wangyun@linux.vnet.ibm.com>

There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" (atomic because they run in atomic contexts).

Often, these atomic hotplug readers have a simple need : they want the cpu
online mask that they work with (inside their critical section), to be
stable, i.e., it should be guaranteed that CPUs in that mask won't go
offline during the critical section. An important point here is that they
don't really care if such a "stable" mask is a subset of the actual
cpu_online_mask.

The intent of this patch is to provide such a "stable" cpu online mask
for that class of atomic hotplug readers.

Fundamental idea behind the design:
-----------------------------------

Simply put, have a separate mask called the stable cpu online mask; and
at the hotplug writer (cpu_down()), note down the CPU that is about to go
offline, and remove it from the stable cpu online mask. Then, feel free
to take that CPU offline, since the atomic hotplug readers won't see it
from now on. Also, don't start any new cpu_down() operations until all
existing atomic hotplug readers have completed (because they might still
be working with the old value of the stable online mask).

Some important design requirements and considerations:
-----------------------------------------------------

1. The atomic hotplug readers should ideally *never* wait for the hotplug
   writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
   can be in very hot-paths like interrupt handling/IPI and hence, if they
   have to wait for an ongoing cpu_down() to complete, it would pretty much
   introduce the same performance/latency problems as stop_machine().

2. Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global locks/counters etc. Because, these paths currently
   use the extremely fast preempt_disable(); our replacement to
   preempt_disable() should not become ridiculously costly.

3. preempt_disable() was recursive. The replacement should also be recursive.

Implementation of the design:
----------------------------

Atomic hotplug reader side:

We use per-cpu counters to mark the presence of atomic hotplug readers.
A reader would increment its per-cpu counter and continue, without waiting
for anything. And no locks are used in this path. Together, these satisfy
all the 3 requirements mentioned above.

The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
ensure that all existing atomic hotplug readers have completed. Only after
that, it will proceed to actually take the CPU offline.

[ Michael: Designed the synchronization for the IPI case ]
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
[ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h     |    4 +
 include/linux/cpumask.h |    5 ++
 kernel/cpu.c            |  129 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tejun Heo Dec. 4, 2012, 3:17 p.m. UTC | #1
Hello, Srivatsa.

On Tue, Dec 04, 2012 at 02:23:41PM +0530, Srivatsa S. Bhat wrote:
>  extern const struct cpumask *const cpu_possible_mask;
>  extern const struct cpumask *const cpu_online_mask;
> +extern const struct cpumask *const cpu_online_stable_mask;
>  extern const struct cpumask *const cpu_present_mask;
>  extern const struct cpumask *const cpu_active_mask;

This is a bit nasty.  The distinction between cpu_online_mask and the
stable one is quite subtle and there's no mechanism to verify the
right one is in use.  IIUC, the only time cpu_online_mask and
cpu_online_stable_mask can deviate is during the final stage CPU take
down, right?  If so, why not just make cpu_online_mask the stable one
and the few cases where the actual online state matters deal with the
internal version?  Anyone outside cpu hotplug proper should be happy
with the stable version anyway, no?

Thanks.
Srivatsa S. Bhat Dec. 4, 2012, 9:14 p.m. UTC | #2
Hi Tejun,

On 12/04/2012 08:47 PM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> On Tue, Dec 04, 2012 at 02:23:41PM +0530, Srivatsa S. Bhat wrote:
>>  extern const struct cpumask *const cpu_possible_mask;
>>  extern const struct cpumask *const cpu_online_mask;
>> +extern const struct cpumask *const cpu_online_stable_mask;
>>  extern const struct cpumask *const cpu_present_mask;
>>  extern const struct cpumask *const cpu_active_mask;
> 
> This is a bit nasty.  The distinction between cpu_online_mask and the
> stable one is quite subtle and there's no mechanism to verify the
> right one is in use.  IIUC, the only time cpu_online_mask and
> cpu_online_stable_mask can deviate is during the final stage CPU take
> down, right?

No, actually they deviate in the initial stage itself. We flip the bit
in the stable mask right in the beginning, and then flip the bit in the
online mask slightly later, in __cpu_disable().

...which makes it look stupid to have a separate "stable" mask in the
first place! Hmm...

Thinking in this direction a bit more, I have written a patchset that
doesn't need a separate stable mask, but which works with the existing
cpu_online_mask itself. I'll post it tomorrow after testing and updating
the patch descriptions.

One of the things I'm trying to achieve is to identify 2 types of
hotplug readers: 

1. Readers who care only about synchronizing with the updates to
cpu_online_mask (light-weight readers)

2. Readers who really want full synchronization with the entire CPU
tear-down sequence.

The reason for doing this, instead of assuming every reader to be of
type 2 is that, if we don't make this distinction, we can end up in the
very same latency issues and performance problems that we hit when
using stop_machine(), without even using stop_machine()!

[The readers can be in very hot paths, like interrupt handlers. So if
there is no distinction between light-weight readers and full-readers,
we can potentially slow down the entire machine unnecessarily, effectively
creating the same effect as stop_machine()]

IOW, IMHO, one of the goals of the replacement to stop_machine() should
be that it should not indirectly induce the "stop_machine() effect".

The new patchset that I have written takes care of this requirement
and provides APIs for both types of readers, and also doesn't use
any extra cpu masks. I'll post this patchset tomorrow, after taking a
careful look at it again.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Dec. 4, 2012, 10:10 p.m. UTC | #3
On Tue, 04 Dec 2012 14:23:41 +0530
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:

> From: Michael Wang <wangyun@linux.vnet.ibm.com>
> 
> There are places where preempt_disable() is used to prevent any CPU from
> going offline during the critical section. Let us call them as "atomic
> hotplug readers" (atomic because they run in atomic contexts).
> 
> Often, these atomic hotplug readers have a simple need : they want the cpu
> online mask that they work with (inside their critical section), to be
> stable, i.e., it should be guaranteed that CPUs in that mask won't go
> offline during the critical section. An important point here is that they
> don't really care if such a "stable" mask is a subset of the actual
> cpu_online_mask.
> 
> The intent of this patch is to provide such a "stable" cpu online mask
> for that class of atomic hotplug readers.
> 
> Fundamental idea behind the design:
> -----------------------------------
> 
> Simply put, have a separate mask called the stable cpu online mask; and
> at the hotplug writer (cpu_down()), note down the CPU that is about to go
> offline, and remove it from the stable cpu online mask. Then, feel free
> to take that CPU offline, since the atomic hotplug readers won't see it
> from now on. Also, don't start any new cpu_down() operations until all
> existing atomic hotplug readers have completed (because they might still
> be working with the old value of the stable online mask).
> 
> Some important design requirements and considerations:
> -----------------------------------------------------
> 
> 1. The atomic hotplug readers should ideally *never* wait for the hotplug
>    writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
>    can be in very hot-paths like interrupt handling/IPI and hence, if they
>    have to wait for an ongoing cpu_down() to complete, it would pretty much
>    introduce the same performance/latency problems as stop_machine().
> 
> 2. Any synchronization at the atomic hotplug readers side must be highly
>    scalable - avoid global locks/counters etc. Because, these paths currently
>    use the extremely fast preempt_disable(); our replacement to
>    preempt_disable() should not become ridiculously costly.
> 
> 3. preempt_disable() was recursive. The replacement should also be recursive.
> 
> Implementation of the design:
> ----------------------------
> 
> Atomic hotplug reader side:
> 
> We use per-cpu counters to mark the presence of atomic hotplug readers.
> A reader would increment its per-cpu counter and continue, without waiting
> for anything. And no locks are used in this path. Together, these satisfy
> all the 3 requirements mentioned above.
> 
> The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
> ensure that all existing atomic hotplug readers have completed. Only after
> that, it will proceed to actually take the CPU offline.
> 
> [ Michael: Designed the synchronization for the IPI case ]

Like this:

[wangyun@linux.vnet.ibm.com: designed the synchronization for the IPI case]

> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> [ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>
> ...
>
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
>  
>  extern void get_online_cpus(void);
>  extern void put_online_cpus(void);
> +extern void get_online_cpus_stable_atomic(void);
> +extern void put_online_cpus_stable_atomic(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
>  #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
>  #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
> @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
>  
>  #define get_online_cpus()	do { } while (0)
>  #define put_online_cpus()	do { } while (0)
> +#define get_online_cpus_stable_atomic()	do { } while (0)
> +#define put_online_cpus_stable_atomic()	do { } while (0)

static inline C functions would be preferred if possible.  Feel free to
fix up the wrong crufty surrounding code as well ;)

>
> ...
>
> @@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
>  
>  #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
>  #define for_each_online_cpu(cpu)   for_each_cpu((cpu), cpu_online_mask)
> +#define for_each_online_cpu_stable(cpu)					  \
> +				for_each_cpu((cpu), cpu_online_stable_mask)
>  #define for_each_present_cpu(cpu)  for_each_cpu((cpu), cpu_present_mask)
>  
>  /* Wrappers for arch boot code to manipulate normally-constant masks */
>  void set_cpu_possible(unsigned int cpu, bool possible);
>  void set_cpu_present(unsigned int cpu, bool present);
>  void set_cpu_online(unsigned int cpu, bool online);
> +void set_cpu_online_stable(unsigned int cpu, bool online);

The naming is inconsistent.  This is "cpu_online_stable", but
for_each_online_cpu_stable() is "online_cpu_stable".  Can we make
everything the same?

>  void set_cpu_active(unsigned int cpu, bool active);
>  void init_cpu_present(const struct cpumask *src);
>  void init_cpu_possible(const struct cpumask *src);
>
> ...
>
> +void get_online_cpus_stable_atomic(void)
> +{
> +	preempt_disable();
> +	this_cpu_inc(hotplug_reader_refcount);
> +
> +	/*
> +	 * Prevent reordering of writes to hotplug_reader_refcount and
> +	 * reads from cpu_online_stable_mask.
> +	 */
> +	smp_mb();
> +}
> +EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
> +
> +void put_online_cpus_stable_atomic(void)
> +{
> +	/*
> +	 * Prevent reordering of reads from cpu_online_stable_mask and
> +	 * writes to hotplug_reader_refcount.
> +	 */
> +	smp_mb();
> +	this_cpu_dec(hotplug_reader_refcount);
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
> +
>  static struct {
>  	struct task_struct *active_writer;
>  	struct mutex lock; /* Synchronizes accesses to refcount, */
> @@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
>  	write_unlock_irq(&tasklist_lock);
>  }
>  
> +/*
> + * We want all atomic hotplug readers to refer to the new value of
> + * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
> + * to complete. Any new atomic hotplug readers will see the updated mask
> + * and hence pose no problems.
> + */
> +static void sync_hotplug_readers(void)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		while (per_cpu(hotplug_reader_refcount, cpu))
> +			cpu_relax();
> +	}
> +}

That all looks solid to me.

> +/*
> + * We are serious about taking this CPU down. So clear the CPU from the
> + * stable online mask.
> + */
> +static void prepare_cpu_take_down(unsigned int cpu)
> +{
> +	set_cpu_online_stable(cpu, false);
> +
> +	/*
> +	 * Prevent reordering of writes to cpu_online_stable_mask and reads
> +	 * from hotplug_reader_refcount.
> +	 */
> +	smp_mb();
> +
> +	/*
> +	 * Wait for all active hotplug readers to complete, to ensure that
> +	 * there are no critical sections still referring to the old stable
> +	 * online mask.
> +	 */
> +	sync_hotplug_readers();
> +}

I wonder about the cpu-online case.  A typical caller might want to do:


/*
 * Set each online CPU's "foo" to "bar"
 */

int global_bar;

void set_cpu_foo(int bar)
{
	get_online_cpus_stable_atomic();
	global_bar = bar;
	for_each_online_cpu_stable()
		cpu->foo = bar;
	put_online_cpus_stable_atomic()
}

void_cpu_online_notifier_handler(void)
{
	cpu->foo = global_bar;
}

And I think that set_cpu_foo() would be buggy, because a CPU could come
online before global_bar was altered, and that newly-online CPU would
pick up the old value of `bar'.

So what's the rule here?  global_bar must be written before we run
get_online_cpus_stable_atomic()?

Anyway, please have a think and spell all this out?

>  struct take_cpu_down_param {
>  	unsigned long mod;
>  	void *hcpu;
> @@ -246,7 +351,9 @@ struct take_cpu_down_param {
>  static int __ref take_cpu_down(void *_param)
>  {
>  	struct take_cpu_down_param *param = _param;
> -	int err;
> +	int err, cpu = (long)(param->hcpu);
> +

Like this please:

	int err;
	int cpu = (long)(param->hcpu);

> +	prepare_cpu_take_down(cpu);
>  
>  	/* Ensure this CPU doesn't handle any more interrupts. */
>  	err = __cpu_disable();
>
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Wang Dec. 5, 2012, 2:56 a.m. UTC | #4
On 12/05/2012 06:10 AM, Andrew Morton wrote:
> On Tue, 04 Dec 2012 14:23:41 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> From: Michael Wang <wangyun@linux.vnet.ibm.com>
>>
>> There are places where preempt_disable() is used to prevent any CPU from
>> going offline during the critical section. Let us call them as "atomic
>> hotplug readers" (atomic because they run in atomic contexts).
>>
>> Often, these atomic hotplug readers have a simple need : they want the cpu
>> online mask that they work with (inside their critical section), to be
>> stable, i.e., it should be guaranteed that CPUs in that mask won't go
>> offline during the critical section. An important point here is that they
>> don't really care if such a "stable" mask is a subset of the actual
>> cpu_online_mask.
>>
>> The intent of this patch is to provide such a "stable" cpu online mask
>> for that class of atomic hotplug readers.
>>
>> Fundamental idea behind the design:
>> -----------------------------------
>>
>> Simply put, have a separate mask called the stable cpu online mask; and
>> at the hotplug writer (cpu_down()), note down the CPU that is about to go
>> offline, and remove it from the stable cpu online mask. Then, feel free
>> to take that CPU offline, since the atomic hotplug readers won't see it
>> from now on. Also, don't start any new cpu_down() operations until all
>> existing atomic hotplug readers have completed (because they might still
>> be working with the old value of the stable online mask).
>>
>> Some important design requirements and considerations:
>> -----------------------------------------------------
>>
>> 1. The atomic hotplug readers should ideally *never* wait for the hotplug
>>    writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
>>    can be in very hot-paths like interrupt handling/IPI and hence, if they
>>    have to wait for an ongoing cpu_down() to complete, it would pretty much
>>    introduce the same performance/latency problems as stop_machine().
>>
>> 2. Any synchronization at the atomic hotplug readers side must be highly
>>    scalable - avoid global locks/counters etc. Because, these paths currently
>>    use the extremely fast preempt_disable(); our replacement to
>>    preempt_disable() should not become ridiculously costly.
>>
>> 3. preempt_disable() was recursive. The replacement should also be recursive.
>>
>> Implementation of the design:
>> ----------------------------
>>
>> Atomic hotplug reader side:
>>
>> We use per-cpu counters to mark the presence of atomic hotplug readers.
>> A reader would increment its per-cpu counter and continue, without waiting
>> for anything. And no locks are used in this path. Together, these satisfy
>> all the 3 requirements mentioned above.
>>
>> The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
>> ensure that all existing atomic hotplug readers have completed. Only after
>> that, it will proceed to actually take the CPU offline.
>>
>> [ Michael: Designed the synchronization for the IPI case ]
> 
> Like this:
> 
> [wangyun@linux.vnet.ibm.com: designed the synchronization for the IPI case]
> 
>> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
>> [ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>
>> ...
>>
>> --- a/include/linux/cpu.h
>> +++ b/include/linux/cpu.h
>> @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
>>  
>>  extern void get_online_cpus(void);
>>  extern void put_online_cpus(void);
>> +extern void get_online_cpus_stable_atomic(void);
>> +extern void put_online_cpus_stable_atomic(void);
>>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
>>  #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
>>  #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
>> @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
>>  
>>  #define get_online_cpus()	do { } while (0)
>>  #define put_online_cpus()	do { } while (0)
>> +#define get_online_cpus_stable_atomic()	do { } while (0)
>> +#define put_online_cpus_stable_atomic()	do { } while (0)
> 
> static inline C functions would be preferred if possible.  Feel free to
> fix up the wrong crufty surrounding code as well ;)
> 
>>
>> ...
>>
>> @@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
>>  
>>  #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
>>  #define for_each_online_cpu(cpu)   for_each_cpu((cpu), cpu_online_mask)
>> +#define for_each_online_cpu_stable(cpu)					  \
>> +				for_each_cpu((cpu), cpu_online_stable_mask)
>>  #define for_each_present_cpu(cpu)  for_each_cpu((cpu), cpu_present_mask)
>>  
>>  /* Wrappers for arch boot code to manipulate normally-constant masks */
>>  void set_cpu_possible(unsigned int cpu, bool possible);
>>  void set_cpu_present(unsigned int cpu, bool present);
>>  void set_cpu_online(unsigned int cpu, bool online);
>> +void set_cpu_online_stable(unsigned int cpu, bool online);
> 
> The naming is inconsistent.  This is "cpu_online_stable", but
> for_each_online_cpu_stable() is "online_cpu_stable".  Can we make
> everything the same?
> 
>>  void set_cpu_active(unsigned int cpu, bool active);
>>  void init_cpu_present(const struct cpumask *src);
>>  void init_cpu_possible(const struct cpumask *src);
>>
>> ...
>>
>> +void get_online_cpus_stable_atomic(void)
>> +{
>> +	preempt_disable();
>> +	this_cpu_inc(hotplug_reader_refcount);
>> +
>> +	/*
>> +	 * Prevent reordering of writes to hotplug_reader_refcount and
>> +	 * reads from cpu_online_stable_mask.
>> +	 */
>> +	smp_mb();
>> +}
>> +EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
>> +
>> +void put_online_cpus_stable_atomic(void)
>> +{
>> +	/*
>> +	 * Prevent reordering of reads from cpu_online_stable_mask and
>> +	 * writes to hotplug_reader_refcount.
>> +	 */
>> +	smp_mb();
>> +	this_cpu_dec(hotplug_reader_refcount);
>> +	preempt_enable();
>> +}
>> +EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
>> +
>>  static struct {
>>  	struct task_struct *active_writer;
>>  	struct mutex lock; /* Synchronizes accesses to refcount, */
>> @@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
>>  	write_unlock_irq(&tasklist_lock);
>>  }
>>  
>> +/*
>> + * We want all atomic hotplug readers to refer to the new value of
>> + * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
>> + * to complete. Any new atomic hotplug readers will see the updated mask
>> + * and hence pose no problems.
>> + */
>> +static void sync_hotplug_readers(void)
>> +{
>> +	unsigned int cpu;
>> +
>> +	for_each_online_cpu(cpu) {
>> +		while (per_cpu(hotplug_reader_refcount, cpu))
>> +			cpu_relax();
>> +	}
>> +}
> 
> That all looks solid to me.
> 
>> +/*
>> + * We are serious about taking this CPU down. So clear the CPU from the
>> + * stable online mask.
>> + */
>> +static void prepare_cpu_take_down(unsigned int cpu)
>> +{
>> +	set_cpu_online_stable(cpu, false);
>> +
>> +	/*
>> +	 * Prevent reordering of writes to cpu_online_stable_mask and reads
>> +	 * from hotplug_reader_refcount.
>> +	 */
>> +	smp_mb();
>> +
>> +	/*
>> +	 * Wait for all active hotplug readers to complete, to ensure that
>> +	 * there are no critical sections still referring to the old stable
>> +	 * online mask.
>> +	 */
>> +	sync_hotplug_readers();
>> +}
> 
> I wonder about the cpu-online case.  A typical caller might want to do:
> 
> 
> /*
>  * Set each online CPU's "foo" to "bar"
>  */
> 
> int global_bar;
> 
> void set_cpu_foo(int bar)
> {
> 	get_online_cpus_stable_atomic();
> 	global_bar = bar;
> 	for_each_online_cpu_stable()
> 		cpu->foo = bar;
> 	put_online_cpus_stable_atomic()
> }
> 
> void_cpu_online_notifier_handler(void)
> {
> 	cpu->foo = global_bar;
> }
> 
> And I think that set_cpu_foo() would be buggy, because a CPU could come
> online before global_bar was altered, and that newly-online CPU would
> pick up the old value of `bar'.
> 
> So what's the rule here?  global_bar must be written before we run
> get_online_cpus_stable_atomic()?
> 
> Anyway, please have a think and spell all this out?

That's right, actually this related to one question, should the hotplug
happen during get_online and put_online?

Answer will be YES according to old API which using mutex, the hotplug
won't happen in critical section, but the cost is get_online() will
block, which will kill the performance.

So we designed this mechanism to do acceleration, but as you pointed
out, although the result will never be wrong, but the 'stable' mask is
not stable since it could be changed in critical section.

And we have two solution.

One is from Srivatsa, using 'read_lock' and 'write_lock', it will
prevent hotplug happen just like the old rule, the cost is we need a
global 'rw_lock' which perform bad on NUMA system, and no doubt,
get_online() will block for short time when doing hotplug.

Another is to maintain a per-cpu cache mask, this mask will only be
updated in get_online(), and be used in critical section, then we will
get a real stable mask, but one flaw is, on different cpu in critical
section, online mask will be different.

We will be appreciate if we could collect some comments on which one to
be used in next version.

Regards,
Michael Wang

> 
>>  struct take_cpu_down_param {
>>  	unsigned long mod;
>>  	void *hcpu;
>> @@ -246,7 +351,9 @@ struct take_cpu_down_param {
>>  static int __ref take_cpu_down(void *_param)
>>  {
>>  	struct take_cpu_down_param *param = _param;
>> -	int err;
>> +	int err, cpu = (long)(param->hcpu);
>> +
> 
> Like this please:
> 
> 	int err;
> 	int cpu = (long)(param->hcpu);
> 
>> +	prepare_cpu_take_down(cpu);
>>  
>>  	/* Ensure this CPU doesn't handle any more interrupts. */
>>  	err = __cpu_disable();
>>
>> ...
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Wang Dec. 5, 2012, 3:28 a.m. UTC | #5
On 12/05/2012 10:56 AM, Michael Wang wrote:
[...]
>>
>> I wonder about the cpu-online case.  A typical caller might want to do:
>>
>>
>> /*
>>  * Set each online CPU's "foo" to "bar"
>>  */
>>
>> int global_bar;
>>
>> void set_cpu_foo(int bar)
>> {
>> 	get_online_cpus_stable_atomic();
>> 	global_bar = bar;
>> 	for_each_online_cpu_stable()
>> 		cpu->foo = bar;
>> 	put_online_cpus_stable_atomic()
>> }
>>
>> void_cpu_online_notifier_handler(void)
>> {
>> 	cpu->foo = global_bar;
>> }

Oh, forgive me for misunderstanding your question :(

In this case, we have to prevent hotplug happen, not just ensure the
online mask is correct.

Hmm..., we need more consideration.

Regards,
Michael Wang

>>
>> And I think that set_cpu_foo() would be buggy, because a CPU could come
>> online before global_bar was altered, and that newly-online CPU would
>> pick up the old value of `bar'.
>>
>> So what's the rule here?  global_bar must be written before we run
>> get_online_cpus_stable_atomic()?
>>
>> Anyway, please have a think and spell all this out?
> 
> That's right, actually this related to one question, should the hotplug
> happen during get_online and put_online?
> 
> Answer will be YES according to old API which using mutex, the hotplug
> won't happen in critical section, but the cost is get_online() will
> block, which will kill the performance.
> 
> So we designed this mechanism to do acceleration, but as you pointed
> out, although the result will never be wrong, but the 'stable' mask is
> not stable since it could be changed in critical section.
> 
> And we have two solution.
> 
> One is from Srivatsa, using 'read_lock' and 'write_lock', it will
> prevent hotplug happen just like the old rule, the cost is we need a
> global 'rw_lock' which perform bad on NUMA system, and no doubt,
> get_online() will block for short time when doing hotplug.
> 
> Another is to maintain a per-cpu cache mask, this mask will only be
> updated in get_online(), and be used in critical section, then we will
> get a real stable mask, but one flaw is, on different cpu in critical
> section, online mask will be different.
> 
> We will be appreciate if we could collect some comments on which one to
> be used in next version.
> 
> Regards,
> Michael Wang
> 
>>
>>>  struct take_cpu_down_param {
>>>  	unsigned long mod;
>>>  	void *hcpu;
>>> @@ -246,7 +351,9 @@ struct take_cpu_down_param {
>>>  static int __ref take_cpu_down(void *_param)
>>>  {
>>>  	struct take_cpu_down_param *param = _param;
>>> -	int err;
>>> +	int err, cpu = (long)(param->hcpu);
>>> +
>>
>> Like this please:
>>
>> 	int err;
>> 	int cpu = (long)(param->hcpu);
>>
>>> +	prepare_cpu_take_down(cpu);
>>>  
>>>  	/* Ensure this CPU doesn't handle any more interrupts. */
>>>  	err = __cpu_disable();
>>>
>>> ...
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Dec. 5, 2012, 12:38 p.m. UTC | #6
On 12/05/2012 03:40 AM, Andrew Morton wrote:
> On Tue, 04 Dec 2012 14:23:41 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> From: Michael Wang <wangyun@linux.vnet.ibm.com>
>>
>> There are places where preempt_disable() is used to prevent any CPU from
>> going offline during the critical section. Let us call them as "atomic
>> hotplug readers" (atomic because they run in atomic contexts).
>>
>> Often, these atomic hotplug readers have a simple need : they want the cpu
>> online mask that they work with (inside their critical section), to be
>> stable, i.e., it should be guaranteed that CPUs in that mask won't go
>> offline during the critical section. An important point here is that they
>> don't really care if such a "stable" mask is a subset of the actual
>> cpu_online_mask.
>>
>> The intent of this patch is to provide such a "stable" cpu online mask
>> for that class of atomic hotplug readers.
>>
[...]
>>
>> ...
>>
>> @@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
>>  
>>  #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
>>  #define for_each_online_cpu(cpu)   for_each_cpu((cpu), cpu_online_mask)
>> +#define for_each_online_cpu_stable(cpu)					  \
>> +				for_each_cpu((cpu), cpu_online_stable_mask)
>>  #define for_each_present_cpu(cpu)  for_each_cpu((cpu), cpu_present_mask)
>>  
>>  /* Wrappers for arch boot code to manipulate normally-constant masks */
>>  void set_cpu_possible(unsigned int cpu, bool possible);
>>  void set_cpu_present(unsigned int cpu, bool present);
>>  void set_cpu_online(unsigned int cpu, bool online);
>> +void set_cpu_online_stable(unsigned int cpu, bool online);
> 
> The naming is inconsistent.  This is "cpu_online_stable", but
> for_each_online_cpu_stable() is "online_cpu_stable".  Can we make
> everything the same?
> 

I've actually gotten rid of the cpu_online_stable_mask in my new version
(which I'll post shortly), based on Tejun's suggestion.

>>  void set_cpu_active(unsigned int cpu, bool active);
>>  void init_cpu_present(const struct cpumask *src);
>>  void init_cpu_possible(const struct cpumask *src);
>>
>> ...
>>
>> +void get_online_cpus_stable_atomic(void)
>> +{
>> +	preempt_disable();
>> +	this_cpu_inc(hotplug_reader_refcount);
>> +
>> +	/*
>> +	 * Prevent reordering of writes to hotplug_reader_refcount and
>> +	 * reads from cpu_online_stable_mask.
>> +	 */
>> +	smp_mb();
>> +}
>> +EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
>> +
>> +void put_online_cpus_stable_atomic(void)
>> +{
>> +	/*
>> +	 * Prevent reordering of reads from cpu_online_stable_mask and
>> +	 * writes to hotplug_reader_refcount.
>> +	 */
>> +	smp_mb();
>> +	this_cpu_dec(hotplug_reader_refcount);
>> +	preempt_enable();
>> +}
>> +EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
>> +
>>  static struct {
>>  	struct task_struct *active_writer;
>>  	struct mutex lock; /* Synchronizes accesses to refcount, */
>> @@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
>>  	write_unlock_irq(&tasklist_lock);
>>  }
>>  
>> +/*
>> + * We want all atomic hotplug readers to refer to the new value of
>> + * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
>> + * to complete. Any new atomic hotplug readers will see the updated mask
>> + * and hence pose no problems.
>> + */
>> +static void sync_hotplug_readers(void)
>> +{
>> +	unsigned int cpu;
>> +
>> +	for_each_online_cpu(cpu) {
>> +		while (per_cpu(hotplug_reader_refcount, cpu))
>> +			cpu_relax();
>> +	}
>> +}
> 
> That all looks solid to me.

Actually it isn't fully correct. For example, consider a reader such as this:

get_online_cpus_atomic_stable();

for_each_online_cpu_stable(cpu)
	do_operation_X();

for_each_online_cpu_stable(cpu)
	undo_operation_X();

put_online_cpus_atomic_stable();

Here, the stable mask is supposed to remain *unchanged* throughout the
entire duration of get/put_online_cpus_stable_atomic(). However, since the
hotplug writer updates the stable online mask without waiting for anything,
the reader can see updates to the stable mask *in-between* his critical section!
So he can end up doing operation_X() on CPUs 1, 2 and 3 and undoing the operation
on only CPUs 1 and 2, for example, because CPU 3 was removed from the stable
mask in the meantime, by the hotplug writer!

IOW, it actually breaks the fundamental guarantee that we set out to provide
in the first place! Of course, the usecase I gave above is hypothetical, but
it _is_ valid and important nevertheless, IMHO.

Anyway, the new version (which gets rid of the extra cpumask) won't get
into such issues. I actually have a version of the "extra cpumask" patchset
that fixes this particular issue using rwlocks (like Michael mentioned), but
I won't post it because IMHO it is much superior to provide the necessary
synchronization without using such extra cpumasks at all.

> 
>> +/*
>> + * We are serious about taking this CPU down. So clear the CPU from the
>> + * stable online mask.
>> + */
>> +static void prepare_cpu_take_down(unsigned int cpu)
>> +{
>> +	set_cpu_online_stable(cpu, false);
>> +
>> +	/*
>> +	 * Prevent reordering of writes to cpu_online_stable_mask and reads
>> +	 * from hotplug_reader_refcount.
>> +	 */
>> +	smp_mb();
>> +
>> +	/*
>> +	 * Wait for all active hotplug readers to complete, to ensure that
>> +	 * there are no critical sections still referring to the old stable
>> +	 * online mask.
>> +	 */
>> +	sync_hotplug_readers();
>> +}
> 
> I wonder about the cpu-online case.  A typical caller might want to do:
> 
> 
> /*
>  * Set each online CPU's "foo" to "bar"
>  */
> 
> int global_bar;
> 
> void set_cpu_foo(int bar)
> {
> 	get_online_cpus_stable_atomic();
> 	global_bar = bar;
> 	for_each_online_cpu_stable()
> 		cpu->foo = bar;
> 	put_online_cpus_stable_atomic()
> }
> 
> void_cpu_online_notifier_handler(void)
> {
> 	cpu->foo = global_bar;
> }
> 
> And I think that set_cpu_foo() would be buggy, because a CPU could come
> online before global_bar was altered, and that newly-online CPU would
> pick up the old value of `bar'.
> 
> So what's the rule here?  global_bar must be written before we run
> get_online_cpus_stable_atomic()?
> 
> Anyway, please have a think and spell all this out?
> 

Can I please skip this issue of CPUs coming online for now?

preempt_disable() along with stop_machine() never prevented CPUs from coming
online. It only prevented CPUs from going offline. The drop-in replacement
to stop_machine() should also provide that guarantee, at minimum. Later we
can either improvise it to also prevent CPU online, or probably come up with
suitable rules/conventions to deal with it.

In summary, I would like to have a solid replacement for stop_machine() as
the first goal. I hope that sounds reasonable...?

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..c64b6ed 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@  extern struct bus_type cpu_subsys;
 
 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_stable_atomic(void);
+extern void put_online_cpus_stable_atomic(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@  static inline void cpu_hotplug_driver_unlock(void)
 
 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_stable_atomic()	do { } while (0)
+#define put_online_cpus_stable_atomic()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 0325602..2148e60 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -78,6 +78,7 @@  extern int nr_cpu_ids;
 
 extern const struct cpumask *const cpu_possible_mask;
 extern const struct cpumask *const cpu_online_mask;
+extern const struct cpumask *const cpu_online_stable_mask;
 extern const struct cpumask *const cpu_present_mask;
 extern const struct cpumask *const cpu_active_mask;
 
@@ -87,6 +88,7 @@  extern const struct cpumask *const cpu_active_mask;
 #define num_present_cpus()	cpumask_weight(cpu_present_mask)
 #define num_active_cpus()	cpumask_weight(cpu_active_mask)
 #define cpu_online(cpu)		cpumask_test_cpu((cpu), cpu_online_mask)
+#define cpu_online_stable(cpu)	cpumask_test_cpu((cpu), cpu_online_stable_mask)
 #define cpu_possible(cpu)	cpumask_test_cpu((cpu), cpu_possible_mask)
 #define cpu_present(cpu)	cpumask_test_cpu((cpu), cpu_present_mask)
 #define cpu_active(cpu)		cpumask_test_cpu((cpu), cpu_active_mask)
@@ -705,12 +707,15 @@  extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
 
 #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
 #define for_each_online_cpu(cpu)   for_each_cpu((cpu), cpu_online_mask)
+#define for_each_online_cpu_stable(cpu)					  \
+				for_each_cpu((cpu), cpu_online_stable_mask)
 #define for_each_present_cpu(cpu)  for_each_cpu((cpu), cpu_present_mask)
 
 /* Wrappers for arch boot code to manipulate normally-constant masks */
 void set_cpu_possible(unsigned int cpu, bool possible);
 void set_cpu_present(unsigned int cpu, bool present);
 void set_cpu_online(unsigned int cpu, bool online);
+void set_cpu_online_stable(unsigned int cpu, bool online);
 void set_cpu_active(unsigned int cpu, bool active);
 void init_cpu_present(const struct cpumask *src);
 void init_cpu_possible(const struct cpumask *src);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..aaf2393 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,6 +49,73 @@  static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+/*
+ * Per-cpu counter to mark the presence of active atomic hotplug readers
+ * (those that run in atomic context and want to prevent CPUs from going
+ * offline).
+ */
+static DEFINE_PER_CPU(int, hotplug_reader_refcount);
+
+/*
+ * Hotplug readers (those that want to prevent CPUs from going offline)
+ * sometimes run from atomic contexts, and hence can't use
+ * get/put_online_cpus() because they can sleep. And often-times, all
+ * they really want is a stable (unchanging) online mask to work with, which
+ * could be a subset of the actual cpu_online_mask, but with a guarantee
+ * that all the CPUs in that stable mask stay online throughout the
+ * hotplug-read-side critical section.
+ *
+ * In such cases, these atomic hotplug readers can use the pair
+ * get/put_online_cpus_stable_atomic() around their critical section to
+ * ensure that the stable mask 'cpu_online_stable_mask' remains unaltered
+ * throughout that critical section. And of course, they should only use
+ * the stable mask in their critical section, and not the actual
+ * cpu_online_mask!
+ *
+ * Eg:
+ *
+ * Atomic hotplug read-side critical section:
+ * -----------------------------------------
+ *
+ * get_online_cpus_stable_atomic();
+ *
+ * for_each_online_cpu_stable(cpu) {
+ *         ... Do something...
+ * }
+ *    ...
+ *
+ * if (cpu_online_stable(other_cpu))
+ *         do_something();
+ *
+ * put_online_cpus_stable_atomic();
+ *
+ * You can call this function recursively.
+ */
+void get_online_cpus_stable_atomic(void)
+{
+	preempt_disable();
+	this_cpu_inc(hotplug_reader_refcount);
+
+	/*
+	 * Prevent reordering of writes to hotplug_reader_refcount and
+	 * reads from cpu_online_stable_mask.
+	 */
+	smp_mb();
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
+
+void put_online_cpus_stable_atomic(void)
+{
+	/*
+	 * Prevent reordering of reads from cpu_online_stable_mask and
+	 * writes to hotplug_reader_refcount.
+	 */
+	smp_mb();
+	this_cpu_dec(hotplug_reader_refcount);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
+
 static struct {
 	struct task_struct *active_writer;
 	struct mutex lock; /* Synchronizes accesses to refcount, */
@@ -237,6 +304,44 @@  static inline void check_for_tasks(int cpu)
 	write_unlock_irq(&tasklist_lock);
 }
 
+/*
+ * We want all atomic hotplug readers to refer to the new value of
+ * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
+ * to complete. Any new atomic hotplug readers will see the updated mask
+ * and hence pose no problems.
+ */
+static void sync_hotplug_readers(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu) {
+		while (per_cpu(hotplug_reader_refcount, cpu))
+			cpu_relax();
+	}
+}
+
+/*
+ * We are serious about taking this CPU down. So clear the CPU from the
+ * stable online mask.
+ */
+static void prepare_cpu_take_down(unsigned int cpu)
+{
+	set_cpu_online_stable(cpu, false);
+
+	/*
+	 * Prevent reordering of writes to cpu_online_stable_mask and reads
+	 * from hotplug_reader_refcount.
+	 */
+	smp_mb();
+
+	/*
+	 * Wait for all active hotplug readers to complete, to ensure that
+	 * there are no critical sections still referring to the old stable
+	 * online mask.
+	 */
+	sync_hotplug_readers();
+}
+
 struct take_cpu_down_param {
 	unsigned long mod;
 	void *hcpu;
@@ -246,7 +351,9 @@  struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	int err;
+	int err, cpu = (long)(param->hcpu);
+
+	prepare_cpu_take_down(cpu);
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
@@ -670,6 +777,11 @@  static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
 const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits);
 EXPORT_SYMBOL(cpu_online_mask);
 
+static DECLARE_BITMAP(cpu_online_stable_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_online_stable_mask =
+					to_cpumask(cpu_online_stable_bits);
+EXPORT_SYMBOL(cpu_online_stable_mask);
+
 static DECLARE_BITMAP(cpu_present_bits, CONFIG_NR_CPUS) __read_mostly;
 const struct cpumask *const cpu_present_mask = to_cpumask(cpu_present_bits);
 EXPORT_SYMBOL(cpu_present_mask);
@@ -694,12 +806,26 @@  void set_cpu_present(unsigned int cpu, bool present)
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_present_bits));
 }
 
+void set_cpu_online_stable(unsigned int cpu, bool online)
+{
+	if (online)
+		cpumask_set_cpu(cpu, to_cpumask(cpu_online_stable_bits));
+	else
+		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_stable_bits));
+}
+
 void set_cpu_online(unsigned int cpu, bool online)
 {
 	if (online)
 		cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
 	else
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
+
+	/*
+	 * Any changes to the online mask must also be propagated to the
+	 * stable online mask.
+	 */
+	set_cpu_online_stable(cpu, online);
 }
 
 void set_cpu_active(unsigned int cpu, bool active)
@@ -723,4 +849,5 @@  void init_cpu_possible(const struct cpumask *src)
 void init_cpu_online(const struct cpumask *src)
 {
 	cpumask_copy(to_cpumask(cpu_online_bits), src);
+	cpumask_copy(to_cpumask(cpu_online_stable_bits), src);
 }