diff mbox

[RFC] kernel/cpu: Use lockref for online CPU reference counting

Message ID 20160215170618.GL6375@twins.programming.kicks-ass.net (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Zijlstra Feb. 15, 2016, 5:06 p.m. UTC
On Mon, Feb 15, 2016 at 03:17:55PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 15, 2016 at 02:36:43PM +0200, Joonas Lahtinen wrote:
> > Instead of implementing a custom locked reference counting, use lockref.
> > 
> > Current implementation leads to a deadlock splat on Intel SKL platforms
> > when lockdep debugging is enabled.
> > 
> > This is due to few of CPUfreq drivers (including Intel P-state) having this;
> > policy->rwsem is locked during driver initialization and the functions called
> > during init that actually apply CPU limits use get_online_cpus (because they
> > have other calling paths too), which will briefly lock cpu_hotplug.lock to
> > increase cpu_hotplug.refcount.
> > 
> > On later calling path, when doing a suspend, when cpu_hotplug_begin() is called
> > in disable_nonboot_cpus(), callbacks to CPUfreq functions get called after,
> > which will lock policy->rwsem and cpu_hotplug.lock is already held by
> > cpu_hotplug_begin() and we do have a potential deadlock scenario reported by
> > our CI system (though it is a very unlikely one). See the Bugzilla link for more
> > details.
> 
> I've been meaning to change the thing into a percpu-rwsem, I just
> haven't had time to look into the lockdep splat that generated.


The below has plenty lockdep issues because percpu-rwsem is
reader-writer fair (like the regular rwsem), so it does throw up a fair
number of very icky issues.

If at all possible, I'd really rather fix those and have a 'saner'
hotplug lock, rather than muddle on with open-coded horror lock we have
now.

Comments

Joonas Lahtinen Feb. 16, 2016, 8:49 a.m. UTC | #1
Hi,

On ma, 2016-02-15 at 18:06 +0100, Peter Zijlstra wrote:
> On Mon, Feb 15, 2016 at 03:17:55PM +0100, Peter Zijlstra wrote:
> > On Mon, Feb 15, 2016 at 02:36:43PM +0200, Joonas Lahtinen wrote:
> > > Instead of implementing a custom locked reference counting, use lockref.
> > > 
> > > Current implementation leads to a deadlock splat on Intel SKL platforms
> > > when lockdep debugging is enabled.
> > > 
> > > This is due to few of CPUfreq drivers (including Intel P-state) having this;
> > > policy->rwsem is locked during driver initialization and the functions called
> > > during init that actually apply CPU limits use get_online_cpus (because they
> > > have other calling paths too), which will briefly lock cpu_hotplug.lock to
> > > increase cpu_hotplug.refcount.
> > > 
> > > On later calling path, when doing a suspend, when cpu_hotplug_begin() is called
> > > in disable_nonboot_cpus(), callbacks to CPUfreq functions get called after,
> > > which will lock policy->rwsem and cpu_hotplug.lock is already held by
> > > cpu_hotplug_begin() and we do have a potential deadlock scenario reported by
> > > our CI system (though it is a very unlikely one). See the Bugzilla link for more
> > > details.
> > 
> > I've been meaning to change the thing into a percpu-rwsem, I just
> > haven't had time to look into the lockdep splat that generated.
> 
> 
> The below has plenty lockdep issues because percpu-rwsem is
> reader-writer fair (like the regular rwsem), so it does throw up a fair
> number of very icky issues.
> 

I originally thought of implementing this more similar to what you
specify, but then I came across a discussion in the mailing list where
it was NAKed adding more members to task_struct;

http://comments.gmane.org/gmane.linux.kernel/970273

Adding proper recursion (the way my initial implementation was going)
got ugly without modifying task_struct because get_online_cpus() is a
speed critical code path.

So I'm all for fixing the current code in a different way if that will
then be merged.

Regards, Joonas

> If at all possible, I'd really rather fix those and have a 'saner'
> hotplug lock, rather than muddle on with open-coded horror lock we have
> now.
> 
> 

<SNIP>
Peter Zijlstra Feb. 16, 2016, 9:14 a.m. UTC | #2
On Tue, Feb 16, 2016 at 10:49:36AM +0200, Joonas Lahtinen wrote:
> I originally thought of implementing this more similar to what you
> specify, but then I came across a discussion in the mailing list where
> it was NAKed adding more members to task_struct;
> 
> http://comments.gmane.org/gmane.linux.kernel/970273
> 
> Adding proper recursion (the way my initial implementation was going)
> got ugly without modifying task_struct because get_online_cpus() is a
> speed critical code path.

Yeah, just don't let Linus hear you say that. get_online_cpus() is _not_
considered performance critical.

> So I'm all for fixing the current code in a different way if that will
> then be merged.

So I'm not sure why you're poking at this horror show to begin with.
ISTR you mentioning a lockdep splat for SKL, but failed to provide
detail.

Making the hotplug lock _more_ special to fix that is just wrong. Fix
the retarded locking that lead to it.
Joonas Lahtinen Feb. 16, 2016, 10:51 a.m. UTC | #3
On ti, 2016-02-16 at 10:14 +0100, Peter Zijlstra wrote:
> On Tue, Feb 16, 2016 at 10:49:36AM +0200, Joonas Lahtinen wrote:
> > I originally thought of implementing this more similar to what you
> > specify, but then I came across a discussion in the mailing list where
> > it was NAKed adding more members to task_struct;
> > 
> > http://comments.gmane.org/gmane.linux.kernel/970273
> > 
> > Adding proper recursion (the way my initial implementation was going)
> > got ugly without modifying task_struct because get_online_cpus() is a
> > speed critical code path.
> 
> Yeah, just don't let Linus hear you say that. get_online_cpus() is _not_
> considered performance critical.

Oh well, at least changes to it added quite noticeably to the bootup
time of a system.

> 
> > So I'm all for fixing the current code in a different way if that will
> > then be merged.
> 
> So I'm not sure why you're poking at this horror show to begin with.
> ISTR you mentioning a lockdep splat for SKL, but failed to provide
> detail.
> 

Quoting my original patch;

"See the Bugzilla link for more details.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93294"

The improvement my patch implements is to use lockref for locked
reference counting (hotplug code previously rolled its own mutex +
atomic combo), which gets rid of the deadlock scenario described and
linked in the initial patch. Trace for the scenario;

https://bugs.freedesktop.org/attachment.cgi?id=121490

I think using lockref makes it substantially less special, lockref code
being a lot more battle-tested in the FS code than the previous
cpu_hotplug.lock mess.

> Making the hotplug lock _more_ special to fix that is just wrong. Fix
> the retarded locking that lead to it.
> 

I do agree that it's still not pretty, but now it does correctly what
the previous code was trying to do with custom mutex + atomic.

I'm all for fixing the code further, but prior to proceeding there
needs to be some sort of an agreement on either making
get_online_cpus() slower (which does not seem like a good idea) or
adding more members to task_struct.

Regards, Joonas

>
Peter Zijlstra Feb. 16, 2016, 11:07 a.m. UTC | #4
On Tue, Feb 16, 2016 at 12:51:03PM +0200, Joonas Lahtinen wrote:
> Quoting my original patch;
> 
> "See the Bugzilla link for more details.

If its not in the Changelog it doesn't exist. Patches should be self
contained and not refer to external sources for critical information.
Joonas Lahtinen Feb. 17, 2016, 12:47 p.m. UTC | #5
On ti, 2016-02-16 at 12:07 +0100, Peter Zijlstra wrote:
> On Tue, Feb 16, 2016 at 12:51:03PM +0200, Joonas Lahtinen wrote:
> > Quoting my original patch;
> > 
> > "See the Bugzilla link for more details.
> 
> If its not in the Changelog it doesn't exist. Patches should be self
> contained and not refer to external sources for critical information.

The exact locking case in CPUfreq drivers causing a splat is described
in the patch. Details were already included, that's why term "more
details" was used.

This is not exactly taking us closer to a fix, so I'd rather like to
discuss about the possibility of adding one more member to task_struct.
I added Oleg as CC to catch his attention to re-evaluate based on the
previous discussion in this thread.

Regards, Joonas
Peter Zijlstra Feb. 17, 2016, 2:20 p.m. UTC | #6
On Wed, Feb 17, 2016 at 02:47:31PM +0200, Joonas Lahtinen wrote:
> On ti, 2016-02-16 at 12:07 +0100, Peter Zijlstra wrote:
> > On Tue, Feb 16, 2016 at 12:51:03PM +0200, Joonas Lahtinen wrote:
> > > Quoting my original patch;
> > > 
> > > "See the Bugzilla link for more details.
> > 
> > If its not in the Changelog it doesn't exist. Patches should be self
> > contained and not refer to external sources for critical information.
> 
> The exact locking case in CPUfreq drivers causing a splat is described
> in the patch. Details were already included, that's why term "more
> details" was used.

Barely. What was not described was why you went to tinker with the
hotplug lock instead of sanitizing cpufreq. Nor why your chosen solution
is good.

> This is not exactly taking us closer to a fix, 

Why you think we can discuss fixes if you've not actually described your
problem is beyond me.
Daniel Vetter Feb. 17, 2016, 4:13 p.m. UTC | #7
On Wed, Feb 17, 2016 at 03:20:05PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 02:47:31PM +0200, Joonas Lahtinen wrote:
> > On ti, 2016-02-16 at 12:07 +0100, Peter Zijlstra wrote:
> > > On Tue, Feb 16, 2016 at 12:51:03PM +0200, Joonas Lahtinen wrote:
> > > > Quoting my original patch;
> > > > 
> > > > "See the Bugzilla link for more details.
> > > 
> > > If its not in the Changelog it doesn't exist. Patches should be self
> > > contained and not refer to external sources for critical information.
> > 
> > The exact locking case in CPUfreq drivers causing a splat is described
> > in the patch. Details were already included, that's why term "more
> > details" was used.
> 
> Barely. What was not described was why you went to tinker with the
> hotplug lock instead of sanitizing cpufreq. Nor why your chosen solution
> is good.
> 
> > This is not exactly taking us closer to a fix, 
> 
> Why you think we can discuss fixes if you've not actually described your
> problem is beyond me.

Can we please stop the petty-fights while figuring out how to work out a
solution here, thanks.

And for context we're hitting this on CI in a bunch of our machines, which
means no more lockdep checking for us. Which is, at least for me, pretty
serious, and why we're throwing complete cpu-anything newbies at that code
trying to come up with some solution to unblock our CI efforts for the
intel gfx driver. Unfortunately our attempts at just disabling lots of
Kconfig symbols proofed futile, so ideas to avoid all that code highly
welcome.

As soon as CI stops hitting this we'll jump out of your inbox, if you want
so.

Thanks, Daniel
Peter Zijlstra Feb. 17, 2016, 4:14 p.m. UTC | #8
On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> And for context we're hitting this on CI in a bunch of our machines, which

What's CI ?
Daniel Vetter Feb. 17, 2016, 4:33 p.m. UTC | #9
On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> > And for context we're hitting this on CI in a bunch of our machines, which
> 
> What's CI ?

Continuous integration, aka our own farm of machines dedicated to running
i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l
and also post-merge on our own little integration tree), but for just the
graphics team and our needs.
-Daniel
Peter Zijlstra Feb. 17, 2016, 4:37 p.m. UTC | #10
On Wed, Feb 17, 2016 at 05:33:51PM +0100, Daniel Vetter wrote:
> On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> > > And for context we're hitting this on CI in a bunch of our machines, which
> > 
> > What's CI ?
> 
> Continuous integration, aka our own farm of machines dedicated to running
> i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l
> and also post-merge on our own little integration tree), but for just the
> graphics team and our needs.

So what patch triggered this new issue? Did cpufreq change or what?
Joonas Lahtinen Feb. 18, 2016, 10:39 a.m. UTC | #11
On ke, 2016-02-17 at 17:37 +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 05:33:51PM +0100, Daniel Vetter wrote:
> > On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> > > > And for context we're hitting this on CI in a bunch of our machines, which
> > > 
> > > What's CI ?
> > 
> > Continuous integration, aka our own farm of machines dedicated to running
> > i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l
> > and also post-merge on our own little integration tree), but for just the
> > graphics team and our needs.
> 
> So what patch triggered this new issue? Did cpufreq change or what?

It appeared right after enabling lockdep debugging on the continuous
integration system. So we do not have a history of it not being there.

Taking an another look at my code, it could indeed end up in double-
wait-looping scenario if suspend and initialization was performed
simultaneously (it had a couple of other bugs too, fixed in v2).
Strange thing is, I think that should have been caught by cpuhp_lock_*
lockdep tracking.

So I'll move the discussion to linux-pm list to change the CPUfreq code. Thanks for the comments.

Regards, Joonas
Joonas Lahtinen Feb. 18, 2016, 10:54 a.m. UTC | #12
Hi,

On ma, 2016-02-15 at 18:06 +0100, Peter Zijlstra wrote:
> On Mon, Feb 15, 2016 at 03:17:55PM +0100, Peter Zijlstra wrote:
> > On Mon, Feb 15, 2016 at 02:36:43PM +0200, Joonas Lahtinen wrote:
> > > Instead of implementing a custom locked reference counting, use lockref.
> > > 
> > > Current implementation leads to a deadlock splat on Intel SKL platforms
> > > when lockdep debugging is enabled.
> > > 
> > > This is due to few of CPUfreq drivers (including Intel P-state) having this;
> > > policy->rwsem is locked during driver initialization and the functions called
> > > during init that actually apply CPU limits use get_online_cpus (because they
> > > have other calling paths too), which will briefly lock cpu_hotplug.lock to
> > > increase cpu_hotplug.refcount.
> > > 
> > > On later calling path, when doing a suspend, when cpu_hotplug_begin() is called
> > > in disable_nonboot_cpus(), callbacks to CPUfreq functions get called after,
> > > which will lock policy->rwsem and cpu_hotplug.lock is already held by
> > > cpu_hotplug_begin() and we do have a potential deadlock scenario reported by
> > > our CI system (though it is a very unlikely one). See the Bugzilla link for more
> > > details.
> > 
> > I've been meaning to change the thing into a percpu-rwsem, I just
> > haven't had time to look into the lockdep splat that generated.
> 
> 
> The below has plenty lockdep issues because percpu-rwsem is
> reader-writer fair (like the regular rwsem), so it does throw up a fair
> number of very icky issues.
> 
> If at all possible, I'd really rather fix those and have a 'saner'
> hotplug lock, rather than muddle on with open-coded horror lock we have
> now.
> 

I do still agree the below would be a worthy change to proceed with.

CC'd Oleg here too to give a comment.

Regards, Joonas

> 
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -225,6 +225,8 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
>  
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
>  extern void get_online_cpus(void);
> @@ -242,6 +244,8 @@ int cpu_down(unsigned int cpu);
>  
>  #else		/* CONFIG_HOTPLUG_CPU */
>  
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -16,6 +16,15 @@ struct percpu_rw_semaphore {
>  	wait_queue_head_t	write_waitq;
>  };
>  
> +#define DEFINE_STATIC_PERCPU_RWSEM(name)				\
> +static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_frc_##name);	\
> +static struct percpu_rw_semaphore name = {				\
> +	.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC),	\
> +	.fast_read_ctr = &__percpu_rwsem_frc_##name,			\
> +	.rw_sem = __RWSEM_INITIALIZER(name.rw_sem),			\
> +	.write_waitq = __WAIT_QUEUE_HEAD_INITIALIZER(name.write_waitq),	\
> +}
> +
>  extern void percpu_down_read(struct percpu_rw_semaphore *);
>  extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
>  extern void percpu_up_read(struct percpu_rw_semaphore *);
> @@ -33,9 +42,11 @@ extern void percpu_free_rwsem(struct per
>  	__percpu_init_rwsem(brw, #brw, &rwsem_key);		\
>  })
>  
> -
>  #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
>  
> +#define percpu_rwsem_assert_held(sem)                          \
> +	lockdep_assert_held(&(sem)->rw_sem)
> +
>  static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
>  					bool read, unsigned long ip)
>  {
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1403,6 +1403,9 @@ struct task_struct {
>  	struct task_struct *last_wakee;
>  
>  	int wake_cpu;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int cpuhp_ref;
> +#endif
>  #endif
>  	int on_rq;
>  
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "smpboot.h"
>  
> @@ -51,121 +52,52 @@ EXPORT_SYMBOL(cpu_notifier_register_done
>  
>  static RAW_NOTIFIER_HEAD(cpu_chain);
>  
> -/* If set, cpu_up and cpu_down will return -EBUSY and do nothing.
> +/*
> + * If set, cpu_up and cpu_down will return -EBUSY and do nothing.
>   * Should always be manipulated under cpu_add_remove_lock
>   */
>  static int cpu_hotplug_disabled;
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  
> -static struct {
> -	struct task_struct *active_writer;
> -	/* wait queue to wake up the active_writer */
> -	wait_queue_head_t wq;
> -	/* verifies that no writer will get active while readers are active */
> -	struct mutex lock;
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	atomic_t refcount;
> -
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -	struct lockdep_map dep_map;
> -#endif
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.wq = __WAIT_QUEUE_HEAD_INITIALIZER(cpu_hotplug.wq),
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -	.dep_map = {.name = "cpu_hotplug.lock" },
> -#endif
> -};
> -
> -/* Lockdep annotations for get/put_online_cpus() and cpu_hotplug_begin/end() */
> -#define cpuhp_lock_acquire_read() lock_map_acquire_read(&cpu_hotplug.dep_map)
> -#define cpuhp_lock_acquire_tryread() \
> -				  lock_map_acquire_tryread(&cpu_hotplug.dep_map)
> -#define cpuhp_lock_acquire()      lock_map_acquire(&cpu_hotplug.dep_map)
> -#define cpuhp_lock_release()      lock_map_release(&cpu_hotplug.dep_map)
> +DEFINE_STATIC_PERCPU_RWSEM(hotplug);
>  
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	if (WARN_ON_ONCE(p->cpuhp_ref))
> +		p->cpuhp_ref = 0;
> +}
>  
>  void get_online_cpus(void)
>  {
>  	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> +
> +	if (current->cpuhp_ref++) /* read recursion */
>  		return;
> -	cpuhp_lock_acquire_read();
> -	mutex_lock(&cpu_hotplug.lock);
> -	atomic_inc(&cpu_hotplug.refcount);
> -	mutex_unlock(&cpu_hotplug.lock);
> +
> +	percpu_down_read(&hotplug);
>  }
>  EXPORT_SYMBOL_GPL(get_online_cpus);
>  
>  void put_online_cpus(void)
>  {
> -	int refcount;
> -
> -	if (cpu_hotplug.active_writer == current)
> +	if (--current->cpuhp_ref)
>  		return;
>  
> -	refcount = atomic_dec_return(&cpu_hotplug.refcount);
> -	if (WARN_ON(refcount < 0)) /* try to fix things up */
> -		atomic_inc(&cpu_hotplug.refcount);
> -
> -	if (refcount <= 0 && waitqueue_active(&cpu_hotplug.wq))
> -		wake_up(&cpu_hotplug.wq);
> -
> -	cpuhp_lock_release();
> -
> +	percpu_up_read(&hotplug);
>  }
>  EXPORT_SYMBOL_GPL(put_online_cpus);
>  
> -/*
> - * This ensures that the hotplug operation can begin only when the
> - * refcount goes to zero.
> - *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
> - * Since cpu_hotplug_begin() is always called after invoking
> - * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
> - */
>  void cpu_hotplug_begin(void)
>  {
> -	DEFINE_WAIT(wait);
> -
> -	cpu_hotplug.active_writer = current;
> -	cpuhp_lock_acquire();
> -
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		prepare_to_wait(&cpu_hotplug.wq, &wait, TASK_UNINTERRUPTIBLE);
> -		if (likely(!atomic_read(&cpu_hotplug.refcount)))
> -				break;
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> -	finish_wait(&cpu_hotplug.wq, &wait);
> +	percpu_down_write(&hotplug);
> +	current->cpuhp_ref++; /* allow read-in-write recursion */
>  }
>  
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> -	cpuhp_lock_release();
> +	current->cpuhp_ref--;
> +	percpu_up_write(&hotplug);
>  }
>  
>  /*
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1414,6 +1414,8 @@ static struct task_struct *copy_process(
>  	p->sequential_io_avg	= 0;
>  #endif
>  
> +	cpu_hotplug_init_task(p);
> +
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	retval = sched_fork(clone_flags, p);
>  	if (retval)
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -53,6 +53,11 @@ config GENERIC_IO
>  config STMP_DEVICE
>  	bool
>  
> +config PERCPU_RWSEM_HOTPLUG
> +	def_bool y
> +	depends on HOTPLUG_CPU
> +	select PERCPU_RWSEM
> +
>  config ARCH_USE_CMPXCHG_LOCKREF
>  	bool
>
diff mbox

Patch

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -225,6 +225,8 @@  extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
 extern void get_online_cpus(void);
@@ -242,6 +244,8 @@  int cpu_down(unsigned int cpu);
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -16,6 +16,15 @@  struct percpu_rw_semaphore {
 	wait_queue_head_t	write_waitq;
 };
 
+#define DEFINE_STATIC_PERCPU_RWSEM(name)				\
+static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_frc_##name);	\
+static struct percpu_rw_semaphore name = {				\
+	.rss = __RCU_SYNC_INITIALIZER(name.rss, RCU_SCHED_SYNC),	\
+	.fast_read_ctr = &__percpu_rwsem_frc_##name,			\
+	.rw_sem = __RWSEM_INITIALIZER(name.rw_sem),			\
+	.write_waitq = __WAIT_QUEUE_HEAD_INITIALIZER(name.write_waitq),	\
+}
+
 extern void percpu_down_read(struct percpu_rw_semaphore *);
 extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
 extern void percpu_up_read(struct percpu_rw_semaphore *);
@@ -33,9 +42,11 @@  extern void percpu_free_rwsem(struct per
 	__percpu_init_rwsem(brw, #brw, &rwsem_key);		\
 })
 
-
 #define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
 
+#define percpu_rwsem_assert_held(sem)                          \
+	lockdep_assert_held(&(sem)->rw_sem)
+
 static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
 					bool read, unsigned long ip)
 {
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1403,6 +1403,9 @@  struct task_struct {
 	struct task_struct *last_wakee;
 
 	int wake_cpu;
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpuhp_ref;
+#endif
 #endif
 	int on_rq;
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -23,6 +23,7 @@ 
 #include <linux/tick.h>
 #include <linux/irq.h>
 #include <trace/events/power.h>
+#include <linux/percpu-rwsem.h>
 
 #include "smpboot.h"
 
@@ -51,121 +52,52 @@  EXPORT_SYMBOL(cpu_notifier_register_done
 
 static RAW_NOTIFIER_HEAD(cpu_chain);
 
-/* If set, cpu_up and cpu_down will return -EBUSY and do nothing.
+/*
+ * If set, cpu_up and cpu_down will return -EBUSY and do nothing.
  * Should always be manipulated under cpu_add_remove_lock
  */
 static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	/* wait queue to wake up the active_writer */
-	wait_queue_head_t wq;
-	/* verifies that no writer will get active while readers are active */
-	struct mutex lock;
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	atomic_t refcount;
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	struct lockdep_map dep_map;
-#endif
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.wq = __WAIT_QUEUE_HEAD_INITIALIZER(cpu_hotplug.wq),
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	.dep_map = {.name = "cpu_hotplug.lock" },
-#endif
-};
-
-/* Lockdep annotations for get/put_online_cpus() and cpu_hotplug_begin/end() */
-#define cpuhp_lock_acquire_read() lock_map_acquire_read(&cpu_hotplug.dep_map)
-#define cpuhp_lock_acquire_tryread() \
-				  lock_map_acquire_tryread(&cpu_hotplug.dep_map)
-#define cpuhp_lock_acquire()      lock_map_acquire(&cpu_hotplug.dep_map)
-#define cpuhp_lock_release()      lock_map_release(&cpu_hotplug.dep_map)
+DEFINE_STATIC_PERCPU_RWSEM(hotplug);
 
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	if (WARN_ON_ONCE(p->cpuhp_ref))
+		p->cpuhp_ref = 0;
+}
 
 void get_online_cpus(void)
 {
 	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+
+	if (current->cpuhp_ref++) /* read recursion */
 		return;
-	cpuhp_lock_acquire_read();
-	mutex_lock(&cpu_hotplug.lock);
-	atomic_inc(&cpu_hotplug.refcount);
-	mutex_unlock(&cpu_hotplug.lock);
+
+	percpu_down_read(&hotplug);
 }
 EXPORT_SYMBOL_GPL(get_online_cpus);
 
 void put_online_cpus(void)
 {
-	int refcount;
-
-	if (cpu_hotplug.active_writer == current)
+	if (--current->cpuhp_ref)
 		return;
 
-	refcount = atomic_dec_return(&cpu_hotplug.refcount);
-	if (WARN_ON(refcount < 0)) /* try to fix things up */
-		atomic_inc(&cpu_hotplug.refcount);
-
-	if (refcount <= 0 && waitqueue_active(&cpu_hotplug.wq))
-		wake_up(&cpu_hotplug.wq);
-
-	cpuhp_lock_release();
-
+	percpu_up_read(&hotplug);
 }
 EXPORT_SYMBOL_GPL(put_online_cpus);
 
-/*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
- */
 void cpu_hotplug_begin(void)
 {
-	DEFINE_WAIT(wait);
-
-	cpu_hotplug.active_writer = current;
-	cpuhp_lock_acquire();
-
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		prepare_to_wait(&cpu_hotplug.wq, &wait, TASK_UNINTERRUPTIBLE);
-		if (likely(!atomic_read(&cpu_hotplug.refcount)))
-				break;
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
-	finish_wait(&cpu_hotplug.wq, &wait);
+	percpu_down_write(&hotplug);
+	current->cpuhp_ref++; /* allow read-in-write recursion */
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
-	cpuhp_lock_release();
+	current->cpuhp_ref--;
+	percpu_up_write(&hotplug);
 }
 
 /*
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1414,6 +1414,8 @@  static struct task_struct *copy_process(
 	p->sequential_io_avg	= 0;
 #endif
 
+	cpu_hotplug_init_task(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -53,6 +53,11 @@  config GENERIC_IO
 config STMP_DEVICE
 	bool
 
+config PERCPU_RWSEM_HOTPLUG
+	def_bool y
+	depends on HOTPLUG_CPU
+	select PERCPU_RWSEM
+
 config ARCH_USE_CMPXCHG_LOCKREF
 	bool