Message ID | 55F92904.4090206@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am 16.09.2015 um 10:32 schrieb Paolo Bonzini: > > > On 15/09/2015 19:38, Paul E. McKenney wrote: >> Excellent points! >> >> Other options in such situations include the following: >> >> o Rework so that the code uses call_rcu*() instead of *_expedited(). >> >> o Maintain a per-task or per-CPU counter so that every so many >> *_expedited() invocations instead uses the non-expedited >> counterpart. (For example, synchronize_rcu instead of >> synchronize_rcu_expedited().) > > Or just use ratelimit (untested): One of my tests was to always replace synchronize_sched_expedited with synchronize_sched and things turned out to be even worse. Not sure if it makes sense to test yopur in-the-middle approach? Christian -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/09/2015 10:57, Christian Borntraeger wrote: > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini: >> >> >> On 15/09/2015 19:38, Paul E. McKenney wrote: >>> Excellent points! >>> >>> Other options in such situations include the following: >>> >>> o Rework so that the code uses call_rcu*() instead of *_expedited(). >>> >>> o Maintain a per-task or per-CPU counter so that every so many >>> *_expedited() invocations instead uses the non-expedited >>> counterpart. (For example, synchronize_rcu instead of >>> synchronize_rcu_expedited().) >> >> Or just use ratelimit (untested): > > One of my tests was to always replace synchronize_sched_expedited with > synchronize_sched and things turned out to be even worse. Not sure if > it makes sense to test yopur in-the-middle approach? I don't think it applies here, since down_write/up_write is a synchronous API. If the revert isn't easy, I think backporting rcu_sync is the best bet. The issue is that rcu_sync doesn't eliminate synchronize_sched, it only makes it more rare. So it's possible that it isn't eliminating the root cause of the problem. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/16, Paolo Bonzini wrote: > > > On 16/09/2015 10:57, Christian Borntraeger wrote: > > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini: > >> > >> > >> On 15/09/2015 19:38, Paul E. McKenney wrote: > >>> Excellent points! > >>> > >>> Other options in such situations include the following: > >>> > >>> o Rework so that the code uses call_rcu*() instead of *_expedited(). > >>> > >>> o Maintain a per-task or per-CPU counter so that every so many > >>> *_expedited() invocations instead uses the non-expedited > >>> counterpart. (For example, synchronize_rcu instead of > >>> synchronize_rcu_expedited().) > >> > >> Or just use ratelimit (untested): > > > > One of my tests was to always replace synchronize_sched_expedited with > > synchronize_sched and things turned out to be even worse. Not sure if > > it makes sense to test yopur in-the-middle approach? > > I don't think it applies here, since down_write/up_write is a > synchronous API. > > If the revert isn't easy, I think backporting rcu_sync is the best bet. I leave this to Paul and Tejun... at least I think this is not v4.2 material. > The issue is that rcu_sync doesn't eliminate synchronize_sched, Yes, but it eliminates _expedited(). This is good, but otoh this means that (say) individual __cgroup_procs_write() can take much more time. However, it won't block the readers and/or disturb the whole system. And percpu_up_write() doesn't do synchronize_sched() at all. > it only > makes it more rare. Yes, so we can hope that multiple __cgroup_procs_write()'s can "share" a single synchronize_sched(). > So it's possible that it isn't eliminating the root > cause of the problem. We will see... Just in case, currently the usage of percpu_down_write() is suboptimal. We do not need to do ->sync() under cgroup_mutex. But this needs some WIP changes in rcu_sync. Plus we can do more improvements, but this is off-topic right now. Oleg. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/09/2015 14:22, Oleg Nesterov wrote: > > The issue is that rcu_sync doesn't eliminate synchronize_sched, > > Yes, but it eliminates _expedited(). This is good, but otoh this means > that (say) individual __cgroup_procs_write() can take much more time. > However, it won't block the readers and/or disturb the whole system. According to Christian, removing the _expedited() "makes things worse" in that the system takes ages to boot up and systemd timeouts. So I'm still a bit wary about anything that uses RCU for the cgroups write side. However, rcu_sync is okay with him, so perhaps it is really really effective. Christian, can you instrument how many synchronize_sched (out of the 6479 cgroup_procs_write calls) are actually executed at boot with the rcu rework? Paolo > And percpu_up_write() doesn't do synchronize_sched() at all. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/16, Paolo Bonzini wrote: > > > On 16/09/2015 14:22, Oleg Nesterov wrote: > > > The issue is that rcu_sync doesn't eliminate synchronize_sched, > > > > Yes, but it eliminates _expedited(). This is good, but otoh this means > > that (say) individual __cgroup_procs_write() can take much more time. > > However, it won't block the readers and/or disturb the whole system. > > According to Christian, removing the _expedited() "makes things worse" Yes sure, we can not just remove _expedited() from down/up_read(). > in that the system takes ages to boot up and systemd timeouts. Yes, this is clear > So I'm > still a bit wary about anything that uses RCU for the cgroups write side. > > However, rcu_sync is okay with him, so perhaps it is really really > effective. Christian, can you instrument how many synchronize_sched > (out of the 6479 cgroup_procs_write calls) are actually executed at boot > with the rcu rework? Heh, another change I have in mind. It would be nice to add some trace points. But firstly we should merge the current code. Oleg. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am 16.09.2015 um 14:22 schrieb Oleg Nesterov: >> The issue is that rcu_sync doesn't eliminate synchronize_sched, > > Yes, but it eliminates _expedited(). This is good, but otoh this means > that (say) individual __cgroup_procs_write() can take much more time. > However, it won't block the readers and/or disturb the whole system. > And percpu_up_write() doesn't do synchronize_sched() at all. > >> it only >> makes it more rare. > > Yes, so we can hope that multiple __cgroup_procs_write()'s can "share" > a single synchronize_sched(). And in fact it does. Paolo suggested to trace how often we call synchronize_sched so I applied some advanced printk debugging technology ;-) Until login I have 41 and after starting the 70 guests this went up to 48. Nice work. > >> So it's possible that it isn't eliminating the root >> cause of the problem. > > We will see... Just in case, currently the usage of percpu_down_write() > is suboptimal. We do not need to do ->sync() under cgroup_mutex. But > this needs some WIP changes in rcu_sync. Plus we can do more improvements, > but this is off-topic right now. > > Oleg. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote: > > If the revert isn't easy, I think backporting rcu_sync is the best bet. > > I leave this to Paul and Tejun... at least I think this is not v4.2 material. Will route reverts through cgroup branch. Should be pretty painless. Nice job on percpu_rwsem. It's so much better than having to come up with a separate middleground solution. Thanks.
On 16/09/2015 16:16, Tejun Heo wrote: > On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote: >>> > > If the revert isn't easy, I think backporting rcu_sync is the best bet. >> > >> > I leave this to Paul and Tejun... at least I think this is not v4.2 material. > Will route reverts through cgroup branch. Should be pretty painless. > Nice job on percpu_rwsem. It's so much better than having to come up > with a separate middleground solution. Thanks all for the quick resolution! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 834c4e52cb2d..8fb66b2aeed9 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -6,6 +6,7 @@ #include <linux/percpu.h> #include <linux/wait.h> #include <linux/lockdep.h> +#include <linux/ratelimit.h> struct percpu_rw_semaphore { unsigned int __percpu *fast_read_ctr; @@ -13,6 +14,7 @@ struct percpu_rw_semaphore { struct rw_semaphore rw_sem; atomic_t slow_read_ctr; wait_queue_head_t write_waitq; + struct ratelimit_state expedited_ratelimit; }; extern void percpu_down_read(struct percpu_rw_semaphore *); diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index f32567254867..c33f8bc89384 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -20,6 +20,8 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw, atomic_set(&brw->write_ctr, 0); atomic_set(&brw->slow_read_ctr, 0); init_waitqueue_head(&brw->write_waitq); + /* Expedite one down_write and one up_write per second. */ + ratelimit_state_init(&brw->expedited_ratelimit, HZ, 2); return 0; } @@ -152,7 +156,10 @@ void percpu_down_write(struct percpu_rw_semaphore *brw) * fast-path, it executes a full memory barrier before we return. * See R_W case in the comment above update_fast_ctr(). */ - synchronize_sched_expedited(); + if (__ratelimit(&brw->expedited_ratelimit)) + synchronize_sched_expedited(); + else + synchronize_sched(); /* exclude other writers, and block the new readers completely */ down_write(&brw->rw_sem); @@ -172,7 +179,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw) * Insert the barrier before the next fast-path in down_read, * see W_R case in the comment above update_fast_ctr(). */ - synchronize_sched_expedited(); + if (__ratelimit(&brw->expedited_ratelimit)) + synchronize_sched_expedited(); + else + synchronize_sched(); /* the last writer unblocks update_fast_ctr() */ atomic_dec(&brw->write_ctr); }