diff mbox

[4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

Message ID 55F92904.4090206@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Bonzini Sept. 16, 2015, 8:32 a.m. UTC
On 15/09/2015 19:38, Paul E. McKenney wrote:
> Excellent points!
> 
> Other options in such situations include the following:
> 
> o	Rework so that the code uses call_rcu*() instead of *_expedited().
> 
> o	Maintain a per-task or per-CPU counter so that every so many
> 	*_expedited() invocations instead uses the non-expedited
> 	counterpart.  (For example, synchronize_rcu instead of
> 	synchronize_rcu_expedited().)

Or just use ratelimit (untested):



> Note that synchronize_srcu_expedited() is less troublesome than are the
> other *_expedited() functions, because synchronize_srcu_expedited() does
> not inflict OS jitter on other CPUs.

Yup, synchronize_srcu_expedited() is just a busy wait and it can
complete extremely fast if you use SRCU as a "local RCU" rather
than a "sleepable RCU".  However it doesn't apply here since you
want to avoid SRCU's 2 memory barriers per lock/unlock.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christian Borntraeger Sept. 16, 2015, 8:57 a.m. UTC | #1
Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 19:38, Paul E. McKenney wrote:
>> Excellent points!
>>
>> Other options in such situations include the following:
>>
>> o	Rework so that the code uses call_rcu*() instead of *_expedited().
>>
>> o	Maintain a per-task or per-CPU counter so that every so many
>> 	*_expedited() invocations instead uses the non-expedited
>> 	counterpart.  (For example, synchronize_rcu instead of
>> 	synchronize_rcu_expedited().)
> 
> Or just use ratelimit (untested):

One of my tests was to always replace synchronize_sched_expedited with 
synchronize_sched and things turned out to be even worse. Not sure if
it makes sense to test yopur in-the-middle approach?

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 16, 2015, 9:12 a.m. UTC | #2
On 16/09/2015 10:57, Christian Borntraeger wrote:
> Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
>>
>>
>> On 15/09/2015 19:38, Paul E. McKenney wrote:
>>> Excellent points!
>>>
>>> Other options in such situations include the following:
>>>
>>> o	Rework so that the code uses call_rcu*() instead of *_expedited().
>>>
>>> o	Maintain a per-task or per-CPU counter so that every so many
>>> 	*_expedited() invocations instead uses the non-expedited
>>> 	counterpart.  (For example, synchronize_rcu instead of
>>> 	synchronize_rcu_expedited().)
>>
>> Or just use ratelimit (untested):
> 
> One of my tests was to always replace synchronize_sched_expedited with 
> synchronize_sched and things turned out to be even worse. Not sure if
> it makes sense to test yopur in-the-middle approach?

I don't think it applies here, since down_write/up_write is a
synchronous API.

If the revert isn't easy, I think backporting rcu_sync is the best bet.
 The issue is that rcu_sync doesn't eliminate synchronize_sched, it only
makes it more rare.  So it's possible that it isn't eliminating the root
cause of the problem.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Sept. 16, 2015, 12:22 p.m. UTC | #3
On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 10:57, Christian Borntraeger wrote:
> > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> >>
> >>
> >> On 15/09/2015 19:38, Paul E. McKenney wrote:
> >>> Excellent points!
> >>>
> >>> Other options in such situations include the following:
> >>>
> >>> o	Rework so that the code uses call_rcu*() instead of *_expedited().
> >>>
> >>> o	Maintain a per-task or per-CPU counter so that every so many
> >>> 	*_expedited() invocations instead uses the non-expedited
> >>> 	counterpart.  (For example, synchronize_rcu instead of
> >>> 	synchronize_rcu_expedited().)
> >>
> >> Or just use ratelimit (untested):
> >
> > One of my tests was to always replace synchronize_sched_expedited with
> > synchronize_sched and things turned out to be even worse. Not sure if
> > it makes sense to test yopur in-the-middle approach?
>
> I don't think it applies here, since down_write/up_write is a
> synchronous API.
>
> If the revert isn't easy, I think backporting rcu_sync is the best bet.

I leave this to Paul and Tejun... at least I think this is not v4.2 material.

>  The issue is that rcu_sync doesn't eliminate synchronize_sched,

Yes, but it eliminates _expedited(). This is good, but otoh this means
that (say) individual __cgroup_procs_write() can take much more time.
However, it won't block the readers and/or disturb the whole system.
And percpu_up_write() doesn't do synchronize_sched() at all.

> it only
> makes it more rare.

Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
a single synchronize_sched().

> So it's possible that it isn't eliminating the root
> cause of the problem.

We will see... Just in case, currently the usage of percpu_down_write()
is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
this needs some WIP changes in rcu_sync. Plus we can do more improvements,
but this is off-topic right now.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 16, 2015, 12:35 p.m. UTC | #4
On 16/09/2015 14:22, Oleg Nesterov wrote:
> >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.

According to Christian, removing the _expedited() "makes things worse"
in that the system takes ages to boot up and systemd timeouts.  So I'm
still a bit wary about anything that uses RCU for the cgroups write side.

However, rcu_sync is okay with him, so perhaps it is really really
effective.  Christian, can you instrument how many synchronize_sched
(out of the 6479 cgroup_procs_write calls) are actually executed at boot
with the rcu rework?

Paolo

> And percpu_up_write() doesn't do synchronize_sched() at all.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Sept. 16, 2015, 12:43 p.m. UTC | #5
On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 14:22, Oleg Nesterov wrote:
> > >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> >
> > Yes, but it eliminates _expedited(). This is good, but otoh this means
> > that (say) individual __cgroup_procs_write() can take much more time.
> > However, it won't block the readers and/or disturb the whole system.
>
> According to Christian, removing the _expedited() "makes things worse"

Yes sure, we can not just remove _expedited() from down/up_read().

> in that the system takes ages to boot up and systemd timeouts.

Yes, this is clear

> So I'm
> still a bit wary about anything that uses RCU for the cgroups write side.
>
> However, rcu_sync is okay with him, so perhaps it is really really
> effective.  Christian, can you instrument how many synchronize_sched
> (out of the 6479 cgroup_procs_write calls) are actually executed at boot
> with the rcu rework?

Heh, another change I have in mind. It would be nice to add some trace
points. But firstly we should merge the current code.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger Sept. 16, 2015, 12:56 p.m. UTC | #6
Am 16.09.2015 um 14:22 schrieb Oleg Nesterov:
>>  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.
> And percpu_up_write() doesn't do synchronize_sched() at all.
> 
>> it only
>> makes it more rare.
> 
> Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
> a single synchronize_sched().

And in fact it does. Paolo suggested to trace how often we call 
synchronize_sched so I applied some advanced printk debugging technology ;-)
Until login I have 41 and after starting the 70 guests this went up to 48.
Nice work.

> 
>> So it's possible that it isn't eliminating the root
>> cause of the problem.
> 
> We will see... Just in case, currently the usage of percpu_down_write()
> is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
> this needs some WIP changes in rcu_sync. Plus we can do more improvements,
> but this is off-topic right now.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Sept. 16, 2015, 2:16 p.m. UTC | #7
Hello,

On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
> > If the revert isn't easy, I think backporting rcu_sync is the best bet.
> 
> I leave this to Paul and Tejun... at least I think this is not v4.2 material.

Will route reverts through cgroup branch.  Should be pretty painless.
Nice job on percpu_rwsem.  It's so much better than having to come up
with a separate middleground solution.

Thanks.
Paolo Bonzini Sept. 16, 2015, 2:19 p.m. UTC | #8
On 16/09/2015 16:16, Tejun Heo wrote:
> On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
>>> > > If the revert isn't easy, I think backporting rcu_sync is the best bet.
>> > 
>> > I leave this to Paul and Tejun... at least I think this is not v4.2 material.
> Will route reverts through cgroup branch.  Should be pretty painless.
> Nice job on percpu_rwsem.  It's so much better than having to come up
> with a separate middleground solution.

Thanks all for the quick resolution!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 834c4e52cb2d..8fb66b2aeed9 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -6,6 +6,7 @@ 
 #include <linux/percpu.h>
 #include <linux/wait.h>
 #include <linux/lockdep.h>
+#include <linux/ratelimit.h>
 
 struct percpu_rw_semaphore {
 	unsigned int __percpu	*fast_read_ctr;
@@ -13,6 +14,7 @@  struct percpu_rw_semaphore {
 	struct rw_semaphore	rw_sem;
 	atomic_t		slow_read_ctr;
 	wait_queue_head_t	write_waitq;
+	struct ratelimit_state	expedited_ratelimit;
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f32567254867..c33f8bc89384 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -20,6 +20,8 @@  int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
 	atomic_set(&brw->write_ctr, 0);
 	atomic_set(&brw->slow_read_ctr, 0);
 	init_waitqueue_head(&brw->write_waitq);
+	/* Expedite one down_write and one up_write per second.  */
+	ratelimit_state_init(&brw->expedited_ratelimit, HZ, 2);
 	return 0;
 }
 
@@ -152,7 +156,10 @@  void percpu_down_write(struct percpu_rw_semaphore *brw)
 	 *    fast-path, it executes a full memory barrier before we return.
 	 *    See R_W case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
+	if (__ratelimit(&brw->expedited_ratelimit))
+		synchronize_sched_expedited();
+	else
+		synchronize_sched();
 
 	/* exclude other writers, and block the new readers completely */
 	down_write(&brw->rw_sem);
@@ -172,7 +179,10 @@  void percpu_up_write(struct percpu_rw_semaphore *brw)
 	 * Insert the barrier before the next fast-path in down_read,
 	 * see W_R case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
+	if (__ratelimit(&brw->expedited_ratelimit))
+		synchronize_sched_expedited();
+	else
+		synchronize_sched();
 	/* the last writer unblocks update_fast_ctr() */
 	atomic_dec(&brw->write_ctr);
 }