Message ID | 55310CF2.6070107@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote: > > The path this notifier is called from has nothing to do with those > > costs. > > How not? The task is going to incur those costs, it's not like half > a dozen extra instruction make any difference. But anyway... Its attributed to the entity doing the migration, which can be the wakeup path or a softirq. And we very much do care about the wakeup path. > ... that's a valid objection. Please look at the patch below. Still a NAK on that, distros have no choice but to enable that CONFIG option because people might want to run KVM. CONFIG options are pointless if they end up being mandatory. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 17/04/2015 15:43, Peter Zijlstra wrote: > On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote: >>> The path this notifier is called from has nothing to do with those >>> costs. > > Its attributed to the entity doing the migration, which can be the > wakeup path or a softirq. And we very much do care about the wakeup > path. It's not run on all wakeups. It's within an "if (task_cpu(p) != new_cpu)". WF_MIGRATED _is_ a slow path for wakeups, even though wakeup itself is of course something we care about. For load balancing, calculate_imbalance alone is orders of magnitudes more expensive than this notifier (which can be optimized to two instructions with at most one cache miss). >> ... that's a valid objection. Please look at the patch below. > > Still a NAK on that, distros have no choice but to enable that CONFIG > option because people might want to run KVM. Again: running virtual machines does not require these notifiers. KVM needs preempt and MMU notifiers, and also enables user return notifiers, but does not need these ones. It's only paravirt that needs them. It's perfectly fine for distros to disable paravirt. Some do, some don't. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote: > On 17/04/2015 15:10, Peter Zijlstra wrote: > > On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote: > >> On 17/04/2015 12:55, Peter Zijlstra wrote: > >>> Also, it looks like you already do exactly this for other things, look > >>> at: > >>> > >>> kvm_sched_in() > >>> kvm_arch_vcpu_load() > >>> if (unlikely(vcpu->cpu != cpu) ... ) > >>> > >>> So no, I don't believe for one second you need this. > > > > This [...] brings us back to where we were last > > time. There is _0_ justification for this in the patches, that alone is > > grounds enough to reject it. > > Oh, we totally agree on that. I didn't commit that patch, but I already > said the commit message was insufficient. > > > Why should the guest task care about the physical cpu of the vcpu; > > that's a layering fail if ever there was one. > > It's totally within your right to not read the code, but then please > don't try commenting at it. > > This code: > > kvm_sched_in() > kvm_arch_vcpu_load() > if (unlikely(vcpu->cpu != cpu) ... ) > > runs in the host. The hypervisor obviously cares if the physical CPU of > the VCPU changes. It has to tell the source processor (vcpu->cpu) to > release the VCPU's data structure and only then it can use it in the > target processor (cpu). No layering violation here. > > The task migration notifier runs in the guest, whenever the VCPU of > a task changes. > > > Furthermore, the only thing that migration handler seems to do is > > increment a variable that is not actually used in that file. > > It's used in the vDSO, so you cannot increment it in the file that uses it. > > >> And frankly, I think the static key is snake oil. The cost of task > >> migration in terms of cache misses and TLB misses is in no way > >> comparable to the cost of filling in a structure on the stack, > >> dereferencing the head of the notifiers list and seeing that it's NULL. > > > > The path this notifier is called from has nothing to do with those > > costs. > > How not? The task is going to incur those costs, it's not like half > a dozen extra instruction make any difference. But anyway... > > > And the fact you're inflicting these costs on _everyone_ for a > > single x86_64-paravirt case is insane. > > ... that's a valid objection. Please look at the patch below. > > > I've had enough of this, the below goes into sched/urgent and you can > > come back with sane patches if and when you're ready. > > Oh, please, cut the alpha male crap. > > Paolo > > ------------------- 8< ---------------- > >From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 > From: Paolo Bonzini <pbonzini@redhat.com> > Date: Fri, 17 Apr 2015 14:57:34 +0200 > Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER > > The task migration notifier is only used in x86 paravirt. Make it > possible to compile it out. > > While at it, move some code around to ensure tmn is filled from CPU > registers. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/x86/Kconfig | 1 + > init/Kconfig | 3 +++ > kernel/sched/core.c | 9 ++++++++- > 3 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index d43e7e1c784b..9af252c8698d 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST > > config PARAVIRT > bool "Enable paravirtualization code" > + select TASK_MIGRATION_NOTIFIER > ---help--- > This changes the kernel so it can modify itself when it is run > under a hypervisor, potentially improving performance significantly > diff --git a/init/Kconfig b/init/Kconfig > index 3b9df1aa35db..891917123338 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -2016,6 +2016,9 @@ source "block/Kconfig" > config PREEMPT_NOTIFIERS > bool > > +config TASK_MIGRATION_NOTIFIER > + bool > + > config PADATA > depends on SMP > bool > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f9123a82cbb6..c07a53aa543c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) > rq_clock_skip_update(rq, true); > } > > +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER > static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); > > void register_task_migration_notifier(struct notifier_block *n) > { > atomic_notifier_chain_register(&task_migration_notifier, n); > } > +#endif > > #ifdef CONFIG_SMP > void set_task_cpu(struct task_struct *p, unsigned int new_cpu) > @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) > trace_sched_migrate_task(p, new_cpu); > > if (task_cpu(p) != new_cpu) { > +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER > struct task_migration_notifier tmn; > + int from_cpu = task_cpu(p); > +#endif > > if (p->sched_class->migrate_task_rq) > p->sched_class->migrate_task_rq(p, new_cpu); > p->se.nr_migrations++; > perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); > > +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER > tmn.task = p; > - tmn.from_cpu = task_cpu(p); > + tmn.from_cpu = from_cpu; > tmn.to_cpu = new_cpu; > > atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn); > +#endif > } > > __set_task_cpu(p, new_cpu); > -- > 2.3.5 Paolo, Please revert the patch -- can fix properly in the host which also conforms the KVM guest/host documented protocol. Radim submitted a patch to kvm@ to split the kvm_write_guest in two with a barrier in between, i think. I'll review that patch. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 12:01 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote: >> On 17/04/2015 15:10, Peter Zijlstra wrote: >> > On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote: >> >> On 17/04/2015 12:55, Peter Zijlstra wrote: >> >>> Also, it looks like you already do exactly this for other things, look >> >>> at: >> >>> >> >>> kvm_sched_in() >> >>> kvm_arch_vcpu_load() >> >>> if (unlikely(vcpu->cpu != cpu) ... ) >> >>> >> >>> So no, I don't believe for one second you need this. >> > >> > This [...] brings us back to where we were last >> > time. There is _0_ justification for this in the patches, that alone is >> > grounds enough to reject it. >> >> Oh, we totally agree on that. I didn't commit that patch, but I already >> said the commit message was insufficient. >> >> > Why should the guest task care about the physical cpu of the vcpu; >> > that's a layering fail if ever there was one. >> >> It's totally within your right to not read the code, but then please >> don't try commenting at it. >> >> This code: >> >> kvm_sched_in() >> kvm_arch_vcpu_load() >> if (unlikely(vcpu->cpu != cpu) ... ) >> >> runs in the host. The hypervisor obviously cares if the physical CPU of >> the VCPU changes. It has to tell the source processor (vcpu->cpu) to >> release the VCPU's data structure and only then it can use it in the >> target processor (cpu). No layering violation here. >> >> The task migration notifier runs in the guest, whenever the VCPU of >> a task changes. >> >> > Furthermore, the only thing that migration handler seems to do is >> > increment a variable that is not actually used in that file. >> >> It's used in the vDSO, so you cannot increment it in the file that uses it. >> >> >> And frankly, I think the static key is snake oil. The cost of task >> >> migration in terms of cache misses and TLB misses is in no way >> >> comparable to the cost of filling in a structure on the stack, >> >> dereferencing the head of the notifiers list and seeing that it's NULL. >> > >> > The path this notifier is called from has nothing to do with those >> > costs. >> >> How not? The task is going to incur those costs, it's not like half >> a dozen extra instruction make any difference. But anyway... >> >> > And the fact you're inflicting these costs on _everyone_ for a >> > single x86_64-paravirt case is insane. >> >> ... that's a valid objection. Please look at the patch below. >> >> > I've had enough of this, the below goes into sched/urgent and you can >> > come back with sane patches if and when you're ready. >> >> Oh, please, cut the alpha male crap. >> >> Paolo >> >> ------------------- 8< ---------------- >> >From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 >> From: Paolo Bonzini <pbonzini@redhat.com> >> Date: Fri, 17 Apr 2015 14:57:34 +0200 >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER >> >> The task migration notifier is only used in x86 paravirt. Make it >> possible to compile it out. >> >> While at it, move some code around to ensure tmn is filled from CPU >> registers. >> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> arch/x86/Kconfig | 1 + >> init/Kconfig | 3 +++ >> kernel/sched/core.c | 9 ++++++++- >> 3 files changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index d43e7e1c784b..9af252c8698d 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST >> >> config PARAVIRT >> bool "Enable paravirtualization code" >> + select TASK_MIGRATION_NOTIFIER >> ---help--- >> This changes the kernel so it can modify itself when it is run >> under a hypervisor, potentially improving performance significantly >> diff --git a/init/Kconfig b/init/Kconfig >> index 3b9df1aa35db..891917123338 100644 >> --- a/init/Kconfig >> +++ b/init/Kconfig >> @@ -2016,6 +2016,9 @@ source "block/Kconfig" >> config PREEMPT_NOTIFIERS >> bool >> >> +config TASK_MIGRATION_NOTIFIER >> + bool >> + >> config PADATA >> depends on SMP >> bool >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index f9123a82cbb6..c07a53aa543c 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) >> rq_clock_skip_update(rq, true); >> } >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); >> >> void register_task_migration_notifier(struct notifier_block *n) >> { >> atomic_notifier_chain_register(&task_migration_notifier, n); >> } >> +#endif >> >> #ifdef CONFIG_SMP >> void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> trace_sched_migrate_task(p, new_cpu); >> >> if (task_cpu(p) != new_cpu) { >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> struct task_migration_notifier tmn; >> + int from_cpu = task_cpu(p); >> +#endif >> >> if (p->sched_class->migrate_task_rq) >> p->sched_class->migrate_task_rq(p, new_cpu); >> p->se.nr_migrations++; >> perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> tmn.task = p; >> - tmn.from_cpu = task_cpu(p); >> + tmn.from_cpu = from_cpu; >> tmn.to_cpu = new_cpu; >> >> atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn); >> +#endif >> } >> >> __set_task_cpu(p, new_cpu); >> -- >> 2.3.5 > > Paolo, > > Please revert the patch -- can fix properly in the host > which also conforms the KVM guest/host documented protocol. > > Radim submitted a patch to kvm@ to split > the kvm_write_guest in two with a barrier in between, i think. > > I'll review that patch. > Can you cc me on that? Thanks, Andy
>> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 >> From: Paolo Bonzini <pbonzini@redhat.com> >> Date: Fri, 17 Apr 2015 14:57:34 +0200 >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER >> >> The task migration notifier is only used in x86 paravirt. Make it >> possible to compile it out. >> >> While at it, move some code around to ensure tmn is filled from CPU >> registers. >> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> arch/x86/Kconfig | 1 + >> init/Kconfig | 3 +++ >> kernel/sched/core.c | 9 ++++++++- >> 3 files changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index d43e7e1c784b..9af252c8698d 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST >> >> config PARAVIRT >> bool "Enable paravirtualization code" >> + select TASK_MIGRATION_NOTIFIER >> ---help--- >> This changes the kernel so it can modify itself when it is run >> under a hypervisor, potentially improving performance significantly >> diff --git a/init/Kconfig b/init/Kconfig >> index 3b9df1aa35db..891917123338 100644 >> --- a/init/Kconfig >> +++ b/init/Kconfig >> @@ -2016,6 +2016,9 @@ source "block/Kconfig" >> config PREEMPT_NOTIFIERS >> bool >> >> +config TASK_MIGRATION_NOTIFIER >> + bool >> + >> config PADATA >> depends on SMP >> bool >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index f9123a82cbb6..c07a53aa543c 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) >> rq_clock_skip_update(rq, true); >> } >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); >> >> void register_task_migration_notifier(struct notifier_block *n) >> { >> atomic_notifier_chain_register(&task_migration_notifier, n); >> } >> +#endif >> >> #ifdef CONFIG_SMP >> void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> trace_sched_migrate_task(p, new_cpu); >> >> if (task_cpu(p) != new_cpu) { >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> struct task_migration_notifier tmn; >> + int from_cpu = task_cpu(p); >> +#endif >> >> if (p->sched_class->migrate_task_rq) >> p->sched_class->migrate_task_rq(p, new_cpu); >> p->se.nr_migrations++; >> perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> tmn.task = p; >> - tmn.from_cpu = task_cpu(p); >> + tmn.from_cpu = from_cpu; >> tmn.to_cpu = new_cpu; >> >> atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn); >> +#endif >> } >> >> __set_task_cpu(p, new_cpu); >> -- >> 2.3.5 > > Paolo, > > Please revert the patch -- can fix properly in the host > which also conforms the KVM guest/host documented protocol. > > Radim submitted a patch to kvm@ to split > the kvm_write_guest in two with a barrier in between, i think. > > I'll review that patch. You're thinking of http://article.gmane.org/gmane.linux.kernel.stable/129187, but see Andy's reply: > > I think there are at least two ways that would work: > > a) If KVM incremented version as advertised: > > cpu = getcpu(); > pvti = pvti for cpu; > > ver1 = pvti->version; > check stable bit; > rdtsc_barrier, rdtsc, read scale, shift, etc. > if (getcpu() != cpu) retry; > if (pvti->version != ver1) retry; > > I think this is safe because, we're guaranteed that there was an > interval (between the two version reads) in which the vcpu we think > we're on was running and the kvmclock data was valid and marked > stable, and we know that the tsc we read came from that interval. > > Note: rdtscp isn't needed. If we're stable, is makes no difference > which cpu's tsc we actually read. > > b) If version remains buggy but we use this migrations_from hack: > > cpu = getcpu(); > pvti = pvti for cpu; > m1 = pvti->migrations_from; > barrier(); > > ver1 = pvti->version; > check stable bit; > rdtsc_barrier, rdtsc, read scale, shift, etc. > if (getcpu() != cpu) retry; > if (pvti->version != ver1) retry; /* probably not really needed */ > > barrier(); > if (pvti->migrations_from != m1) retry; > > This is just like (a), except that we're using a guest kernel hack to > ensure that no one migrated off the vcpu during the version-protected > critical section and that we were, in fact, on that vcpu at some point > during that critical section. Once we've ensured that we were on > pvti's associated vcpu for the entire time we were reading it, then we > are protected by the existing versioning in the host. (a) is not going to happen until 4.2, and there are too many buggy hosts around so we'd have to define new ABI that lets the guest distinguish a buggy host from a fixed one. (b) works now, is not invasive, and I still maintain that the cost is negligible. I'm going to run for a while with CONFIG_SCHEDSTATS to see how often you have a migration. Anyhow if the task migration notifier is reverted we have to disable the whole vsyscall support altogether. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote: > > > >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 > >> From: Paolo Bonzini <pbonzini@redhat.com> > >> Date: Fri, 17 Apr 2015 14:57:34 +0200 > >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER > >> > >> The task migration notifier is only used in x86 paravirt. Make it > >> possible to compile it out. > >> > >> While at it, move some code around to ensure tmn is filled from CPU > >> registers. > >> > >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > >> --- > >> arch/x86/Kconfig | 1 + > >> init/Kconfig | 3 +++ > >> kernel/sched/core.c | 9 ++++++++- > >> 3 files changed, 12 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > >> index d43e7e1c784b..9af252c8698d 100644 > >> --- a/arch/x86/Kconfig > >> +++ b/arch/x86/Kconfig > >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST > >> > >> config PARAVIRT > >> bool "Enable paravirtualization code" > >> + select TASK_MIGRATION_NOTIFIER > >> ---help--- > >> This changes the kernel so it can modify itself when it is run > >> under a hypervisor, potentially improving performance significantly > >> diff --git a/init/Kconfig b/init/Kconfig > >> index 3b9df1aa35db..891917123338 100644 > >> --- a/init/Kconfig > >> +++ b/init/Kconfig > >> @@ -2016,6 +2016,9 @@ source "block/Kconfig" > >> config PREEMPT_NOTIFIERS > >> bool > >> > >> +config TASK_MIGRATION_NOTIFIER > >> + bool > >> + > >> config PADATA > >> depends on SMP > >> bool > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >> index f9123a82cbb6..c07a53aa543c 100644 > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) > >> rq_clock_skip_update(rq, true); > >> } > >> > >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER > >> static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); > >> > >> void register_task_migration_notifier(struct notifier_block *n) > >> { > >> atomic_notifier_chain_register(&task_migration_notifier, n); > >> } > >> +#endif > >> > >> #ifdef CONFIG_SMP > >> void set_task_cpu(struct task_struct *p, unsigned int new_cpu) > >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) > >> trace_sched_migrate_task(p, new_cpu); > >> > >> if (task_cpu(p) != new_cpu) { > >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER > >> struct task_migration_notifier tmn; > >> + int from_cpu = task_cpu(p); > >> +#endif > >> > >> if (p->sched_class->migrate_task_rq) > >> p->sched_class->migrate_task_rq(p, new_cpu); > >> p->se.nr_migrations++; > >> perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); > >> > >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER > >> tmn.task = p; > >> - tmn.from_cpu = task_cpu(p); > >> + tmn.from_cpu = from_cpu; > >> tmn.to_cpu = new_cpu; > >> > >> atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn); > >> +#endif > >> } > >> > >> __set_task_cpu(p, new_cpu); > >> -- > >> 2.3.5 > > > > Paolo, > > > > Please revert the patch -- can fix properly in the host > > which also conforms the KVM guest/host documented protocol. > > > > Radim submitted a patch to kvm@ to split > > the kvm_write_guest in two with a barrier in between, i think. > > > > I'll review that patch. > > You're thinking of > http://article.gmane.org/gmane.linux.kernel.stable/129187, but see > Andy's reply: > > > > > I think there are at least two ways that would work: > > > > a) If KVM incremented version as advertised: > > > > cpu = getcpu(); > > pvti = pvti for cpu; > > > > ver1 = pvti->version; > > check stable bit; > > rdtsc_barrier, rdtsc, read scale, shift, etc. > > if (getcpu() != cpu) retry; > > if (pvti->version != ver1) retry; > > > > I think this is safe because, we're guaranteed that there was an > > interval (between the two version reads) in which the vcpu we think > > we're on was running and the kvmclock data was valid and marked > > stable, and we know that the tsc we read came from that interval. > > > > Note: rdtscp isn't needed. If we're stable, is makes no difference > > which cpu's tsc we actually read. > > > > b) If version remains buggy but we use this migrations_from hack: > > > > cpu = getcpu(); > > pvti = pvti for cpu; > > m1 = pvti->migrations_from; > > barrier(); > > > > ver1 = pvti->version; > > check stable bit; > > rdtsc_barrier, rdtsc, read scale, shift, etc. > > if (getcpu() != cpu) retry; > > if (pvti->version != ver1) retry; /* probably not really needed */ > > > > barrier(); > > if (pvti->migrations_from != m1) retry; > > > > This is just like (a), except that we're using a guest kernel hack to > > ensure that no one migrated off the vcpu during the version-protected > > critical section and that we were, in fact, on that vcpu at some point > > during that critical section. Once we've ensured that we were on > > pvti's associated vcpu for the entire time we were reading it, then we > > are protected by the existing versioning in the host. > > (a) is not going to happen until 4.2, and there are too many buggy hosts > around so we'd have to define new ABI that lets the guest distinguish a > buggy host from a fixed one. > > (b) works now, is not invasive, and I still maintain that the cost is > negligible. I'm going to run for a while with CONFIG_SCHEDSTATS to see > how often you have a migration. > > Anyhow if the task migration notifier is reverted we have to disable the > whole vsyscall support altogether. The bug which this is fixing is very rare, have no memory of a report. In fact, its even difficult to create a synthetic reproducer. You need: 1) update of kvmclock data structure (happens once every 5 minutes). 2) migration of task from vcpu1 to vcpu2 back to vcpu1. 3) a data race between kvm_write_guest (string copy) and 2 above. At the same time. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 1:18 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote: >> >> >> >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001 >> >> From: Paolo Bonzini <pbonzini@redhat.com> >> >> Date: Fri, 17 Apr 2015 14:57:34 +0200 >> >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER >> >> >> >> The task migration notifier is only used in x86 paravirt. Make it >> >> possible to compile it out. >> >> >> >> While at it, move some code around to ensure tmn is filled from CPU >> >> registers. >> >> >> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> >> --- >> >> arch/x86/Kconfig | 1 + >> >> init/Kconfig | 3 +++ >> >> kernel/sched/core.c | 9 ++++++++- >> >> 3 files changed, 12 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> >> index d43e7e1c784b..9af252c8698d 100644 >> >> --- a/arch/x86/Kconfig >> >> +++ b/arch/x86/Kconfig >> >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST >> >> >> >> config PARAVIRT >> >> bool "Enable paravirtualization code" >> >> + select TASK_MIGRATION_NOTIFIER >> >> ---help--- >> >> This changes the kernel so it can modify itself when it is run >> >> under a hypervisor, potentially improving performance significantly >> >> diff --git a/init/Kconfig b/init/Kconfig >> >> index 3b9df1aa35db..891917123338 100644 >> >> --- a/init/Kconfig >> >> +++ b/init/Kconfig >> >> @@ -2016,6 +2016,9 @@ source "block/Kconfig" >> >> config PREEMPT_NOTIFIERS >> >> bool >> >> >> >> +config TASK_MIGRATION_NOTIFIER >> >> + bool >> >> + >> >> config PADATA >> >> depends on SMP >> >> bool >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> >> index f9123a82cbb6..c07a53aa543c 100644 >> >> --- a/kernel/sched/core.c >> >> +++ b/kernel/sched/core.c >> >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) >> >> rq_clock_skip_update(rq, true); >> >> } >> >> >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> >> static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); >> >> >> >> void register_task_migration_notifier(struct notifier_block *n) >> >> { >> >> atomic_notifier_chain_register(&task_migration_notifier, n); >> >> } >> >> +#endif >> >> >> >> #ifdef CONFIG_SMP >> >> void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) >> >> trace_sched_migrate_task(p, new_cpu); >> >> >> >> if (task_cpu(p) != new_cpu) { >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> >> struct task_migration_notifier tmn; >> >> + int from_cpu = task_cpu(p); >> >> +#endif >> >> >> >> if (p->sched_class->migrate_task_rq) >> >> p->sched_class->migrate_task_rq(p, new_cpu); >> >> p->se.nr_migrations++; >> >> perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); >> >> >> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER >> >> tmn.task = p; >> >> - tmn.from_cpu = task_cpu(p); >> >> + tmn.from_cpu = from_cpu; >> >> tmn.to_cpu = new_cpu; >> >> >> >> atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn); >> >> +#endif >> >> } >> >> >> >> __set_task_cpu(p, new_cpu); >> >> -- >> >> 2.3.5 >> > >> > Paolo, >> > >> > Please revert the patch -- can fix properly in the host >> > which also conforms the KVM guest/host documented protocol. >> > >> > Radim submitted a patch to kvm@ to split >> > the kvm_write_guest in two with a barrier in between, i think. >> > >> > I'll review that patch. >> >> You're thinking of >> http://article.gmane.org/gmane.linux.kernel.stable/129187, but see >> Andy's reply: >> >> > >> > I think there are at least two ways that would work: >> > >> > a) If KVM incremented version as advertised: >> > >> > cpu = getcpu(); >> > pvti = pvti for cpu; >> > >> > ver1 = pvti->version; >> > check stable bit; >> > rdtsc_barrier, rdtsc, read scale, shift, etc. >> > if (getcpu() != cpu) retry; >> > if (pvti->version != ver1) retry; >> > >> > I think this is safe because, we're guaranteed that there was an >> > interval (between the two version reads) in which the vcpu we think >> > we're on was running and the kvmclock data was valid and marked >> > stable, and we know that the tsc we read came from that interval. >> > >> > Note: rdtscp isn't needed. If we're stable, is makes no difference >> > which cpu's tsc we actually read. >> > >> > b) If version remains buggy but we use this migrations_from hack: >> > >> > cpu = getcpu(); >> > pvti = pvti for cpu; >> > m1 = pvti->migrations_from; >> > barrier(); >> > >> > ver1 = pvti->version; >> > check stable bit; >> > rdtsc_barrier, rdtsc, read scale, shift, etc. >> > if (getcpu() != cpu) retry; >> > if (pvti->version != ver1) retry; /* probably not really needed */ >> > >> > barrier(); >> > if (pvti->migrations_from != m1) retry; >> > >> > This is just like (a), except that we're using a guest kernel hack to >> > ensure that no one migrated off the vcpu during the version-protected >> > critical section and that we were, in fact, on that vcpu at some point >> > during that critical section. Once we've ensured that we were on >> > pvti's associated vcpu for the entire time we were reading it, then we >> > are protected by the existing versioning in the host. >> >> (a) is not going to happen until 4.2, and there are too many buggy hosts >> around so we'd have to define new ABI that lets the guest distinguish a >> buggy host from a fixed one. >> >> (b) works now, is not invasive, and I still maintain that the cost is >> negligible. I'm going to run for a while with CONFIG_SCHEDSTATS to see >> how often you have a migration. >> >> Anyhow if the task migration notifier is reverted we have to disable the >> whole vsyscall support altogether. > > The bug which this is fixing is very rare, have no memory of a report. > > In fact, its even difficult to create a synthetic reproducer. You need: > > 1) update of kvmclock data structure (happens once every 5 minutes). > 2) migration of task from vcpu1 to vcpu2 back to vcpu1. > 3) a data race between kvm_write_guest (string copy) and > 2 above. > > At the same time. Something maybe worth considering: On my box, vclock_gettime using kvm-clock is about 40 ns. An empty syscall is about 33 ns. clock_gettime *should* be around 17 ns. The clock_gettime syscall is about 73 ns. Could we figure out why clock_gettime (the syscall) is so slow, fix it, and then not be so sad about removing the existing kvm-clock vdso code? Once we fix the host for real (add a new feature bit and make it global instead of per-cpu), then we could have a really fast vdso implementation, too. --Andy > >
On Fri, Apr 17, 2015 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > On my box, vclock_gettime using kvm-clock is about 40 ns. An empty > syscall is about 33 ns. clock_gettime *should* be around 17 ns. > > The clock_gettime syscall is about 73 ns. > > Could we figure out why clock_gettime (the syscall) is so slow If we only could profile some random program (let's call it "a.out" that did the syscall(__NR_gettime_syscall) a couple million times. Oh, lookie here, Santa came around: 21.83% [k] system_call 12.85% [.] syscall 9.76% [k] __audit_syscall_exit 9.55% [k] copy_user_enhanced_fast_string 4.68% [k] __getnstimeofday64 4.08% [k] syscall_trace_enter_phase1 3.85% [k] __audit_syscall_entry 3.77% [k] unroll_tree_refs 3.15% [k] sys_clock_gettime 2.92% [k] int_very_careful 2.73% [.] main 2.35% [k] syscall_trace_leave 2.28% [k] read_tsc 1.73% [k] int_restore_rest 1.73% [k] int_with_check 1.48% [k] syscall_return 1.32% [k] dput 1.24% [k] system_call_fastpath 1.21% [k] syscall_return_via_sysret 1.21% [k] tracesys 0.81% [k] do_audit_syscall_entry 0.80% [k] current_kernel_time 0.73% [k] getnstimeofday64 0.68% [k] path_put 0.66% [k] posix_clock_realtime_get 0.61% [k] int_careful 0.60% [k] mntput 0.49% [k] kfree 0.36% [k] _copy_to_user 0.31% [k] int_ret_from_sys_call_irqs_off looks to me like it's spending a lot of time in system call auditing. Which makes no sense to me, since none of this should be triggering any auditing. And there's a lot of time in low-level kernel system call assembly code. If I only remembered the name of the crazy person who said "Ok" when I suggest he just be the maintainer of the code since he has spent a lot of time sending patches for it. Something like Amdy Letorsky. No, that wasn't it. Hmm. It's on the tip of my tongue. Oh well. Maybe somebody can remember the guys name. It's familiar for some reason. Andy? Linus -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 2:28 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Apr 17, 2015 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> >> On my box, vclock_gettime using kvm-clock is about 40 ns. An empty >> syscall is about 33 ns. clock_gettime *should* be around 17 ns. >> >> The clock_gettime syscall is about 73 ns. >> >> Could we figure out why clock_gettime (the syscall) is so slow > > If we only could profile some random program (let's call it "a.out" > that did the syscall(__NR_gettime_syscall) a couple million times. > > Oh, lookie here, Santa came around: > > 21.83% [k] system_call > 12.85% [.] syscall > 9.76% [k] __audit_syscall_exit > 9.55% [k] copy_user_enhanced_fast_string > 4.68% [k] __getnstimeofday64 > 4.08% [k] syscall_trace_enter_phase1 > 3.85% [k] __audit_syscall_entry > 3.77% [k] unroll_tree_refs > 3.15% [k] sys_clock_gettime > 2.92% [k] int_very_careful > 2.73% [.] main > 2.35% [k] syscall_trace_leave > 2.28% [k] read_tsc > 1.73% [k] int_restore_rest > 1.73% [k] int_with_check > 1.48% [k] syscall_return > 1.32% [k] dput > 1.24% [k] system_call_fastpath > 1.21% [k] syscall_return_via_sysret > 1.21% [k] tracesys > 0.81% [k] do_audit_syscall_entry > 0.80% [k] current_kernel_time > 0.73% [k] getnstimeofday64 > 0.68% [k] path_put > 0.66% [k] posix_clock_realtime_get > 0.61% [k] int_careful > 0.60% [k] mntput > 0.49% [k] kfree > 0.36% [k] _copy_to_user > 0.31% [k] int_ret_from_sys_call_irqs_off > > looks to me like it's spending a lot of time in system call auditing. > Which makes no sense to me, since none of this should be triggering > any auditing. And there's a lot of time in low-level kernel system > call assembly code. Muahaha. The auditors have invaded your system. (I did my little benchmark with a more sensible configuration -- see way below). Can you send the output of: # auditctl -s # auditctl -l Are you, perchance, using Fedora? I lobbied rather heavily, and successfully, to get the default configuration to stop auditing. Unfortunately, the fix wasn't retroactive, so, unless you have a very fresh install, you might want to apply the fix yourself: https://fedorahosted.org/fesco/ticket/1311 > > If I only remembered the name of the crazy person who said "Ok" when I > suggest he just be the maintainer of the code since he has spent a lot > of time sending patches for it. Something like Amdy Letorsky. No, that > wasn't it. Hmm. It's on the tip of my tongue. > > Oh well. Maybe somebody can remember the guys name. It's familiar for > some reason. Andy? Amdy Lumirtowsky thinks he meant to attach a condition to his maintainerish activities: he will do his best to keep the audit code *out* of the low-level stuff, but he's going to try to avoid ever touching the audit code itself, because if he ever had to change it, he might accidentally delete the entire file. Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some point? I would love auditing to set some really loud global warning that you've just done a Bad Thing (tm) performance-wise by enabling it. Back to timing. With kvm-clock, I see: 23.80% timing_test_64 [kernel.kallsyms] [k] pvclock_clocksource_read 15.57% timing_test_64 libc-2.20.so [.] syscall 12.39% timing_test_64 [kernel.kallsyms] [k] system_call 12.35% timing_test_64 [kernel.kallsyms] [k] copy_user_generic_string 10.95% timing_test_64 [kernel.kallsyms] [k] system_call_after_swapgs 7.35% timing_test_64 [kernel.kallsyms] [k] ktime_get_ts64 6.20% timing_test_64 [kernel.kallsyms] [k] sys_clock_gettime 3.62% timing_test_64 [kernel.kallsyms] [k] system_call_fastpath 2.08% timing_test_64 timing_test_64 [.] main 1.72% timing_test_64 timing_test_64 [.] syscall@plt 1.58% timing_test_64 [kernel.kallsyms] [k] kvm_clock_get_cycles 1.22% timing_test_64 [kernel.kallsyms] [k] _copy_to_user 0.65% timing_test_64 [kernel.kallsyms] [k] posix_ktime_get_ts 0.13% timing_test_64 [kernel.kallsyms] [k] apic_timer_interrupt We've got some silly indirection, a uaccess that probably didn't get optimized very well, and the terrifying function pvclock_clocksource_read. By comparison, using tsc: 19.51% timing_test_64 libc-2.20.so [.] syscall 15.52% timing_test_64 [kernel.kallsyms] [k] system_call 15.25% timing_test_64 [kernel.kallsyms] [k] copy_user_generic_string 14.34% timing_test_64 [kernel.kallsyms] [k] system_call_after_swapgs 8.66% timing_test_64 [kernel.kallsyms] [k] ktime_get_ts64 6.95% timing_test_64 [kernel.kallsyms] [k] sys_clock_gettime 5.93% timing_test_64 [kernel.kallsyms] [k] native_read_tsc 5.12% timing_test_64 [kernel.kallsyms] [k] system_call_fastpath 2.62% timing_test_64 timing_test_64 [.] main That's better, although the uaccess silliness is still there. (No PEBS -- I don't think it works right in KVM.) --Andy > > Linus
On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > Muahaha. The auditors have invaded your system. (I did my little > benchmark with a more sensible configuration -- see way below). > > Can you send the output of: > > # auditctl -s > # auditctl -l # auditctl -s enabled 1 flag 1 pid 822 rate_limit 0 backlog_limit 320 lost 0 backlog 0 backlog_wait_time 60000 loginuid_immutable 0 unlocked # auditctl -l No rules > Are you, perchance, using Fedora? F21. Yup. I used to just disable auditing in the kernel entirely, but then I ended up deciding that I need to run something closer to the broken Fedora config (selinux in particular) in order to actually optimize the real-world pathname handling situation rather than the _sane_ one. Oh well. I think audit support got enabled at the same time in my kernels because I ended up using the default config and then taking out the truly crazy stuff without noticing AUDITSYSCALL. > I lobbied rather heavily, and > successfully, to get the default configuration to stop auditing. > Unfortunately, the fix wasn't retroactive, so, unless you have a very > fresh install, you might want to apply the fix yourself: Is that fix happening in Fedora going forward, though? Like F22? > Amdy Lumirtowsky thinks he meant to attach a condition to his > maintainerish activities: he will do his best to keep the audit code > *out* of the low-level stuff, but he's going to try to avoid ever > touching the audit code itself, because if he ever had to change it, > he might accidentally delete the entire file. Oooh. That would be _such_ a shame. Can we please do it by mistake? "Oops, my fingers slipped" > Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some > point? I would love auditing to set some really loud global warning > that you've just done a Bad Thing (tm) performance-wise by enabling > it. Or even just a big fat warning in dmesg the first time auditing triggers. > Back to timing. With kvm-clock, I see: > > 23.80% timing_test_64 [kernel.kallsyms] [k] pvclock_clocksource_read Oh wow. How can that possibly be sane? Isn't the *whole* point of pvclock_clocksource_read() to be a native rdtsc with scaling? How does it cause that kind of insane pain? Oh well. Some paravirt person would need to look and care. Linus -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> >> Muahaha. The auditors have invaded your system. (I did my little >> benchmark with a more sensible configuration -- see way below). >> >> Can you send the output of: >> >> # auditctl -s >> # auditctl -l > > # auditctl -s > enabled 1 > flag 1 > pid 822 > rate_limit 0 > backlog_limit 320 > lost 0 > backlog 0 > backlog_wait_time 60000 > loginuid_immutable 0 unlocked > # auditctl -l > No rules Yes, "No rules" doesn't mean "don't audit". > >> Are you, perchance, using Fedora? > > F21. Yup. > > I used to just disable auditing in the kernel entirely, but then I > ended up deciding that I need to run something closer to the broken > Fedora config (selinux in particular) in order to actually optimize > the real-world pathname handling situation rather than the _sane_ one. > Oh well. I think audit support got enabled at the same time in my > kernels because I ended up using the default config and then taking > out the truly crazy stuff without noticing AUDITSYSCALL. > >> I lobbied rather heavily, and >> successfully, to get the default configuration to stop auditing. >> Unfortunately, the fix wasn't retroactive, so, unless you have a very >> fresh install, you might want to apply the fix yourself: > > Is that fix happening in Fedora going forward, though? Like F22? It's in F21, actually. Unfortunately, the audit package is really weird. It installs /etc/audit/rules.d/audit.rules, containing: # This file contains the auditctl rules that are loaded # whenever the audit daemon is started via the initscripts. # The rules are simply the parameters that would be passed # to auditctl. # First rule - delete all -D # This suppresses syscall auditing for all tasks started # with this rule in effect. Remove it if you need syscall # auditing. -a task,never Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't exist) it copies that file to /etc/audit/audit.rules post-install. (No, I don't know why it works this way.) IOW, you might want to copy your /etc/audit/rules.d/audit.rules to /etc/audit/audit.rules. I think you need to reboot to get the full effect. You could apply the rule manually (or maybe restart the audit service), but it will only affect newly-started tasks. > >> Amdy Lumirtowsky thinks he meant to attach a condition to his >> maintainerish activities: he will do his best to keep the audit code >> *out* of the low-level stuff, but he's going to try to avoid ever >> touching the audit code itself, because if he ever had to change it, >> he might accidentally delete the entire file. > > Oooh. That would be _such_ a shame. > > Can we please do it by mistake? "Oops, my fingers slipped" > >> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some >> point? I would love auditing to set some really loud global warning >> that you've just done a Bad Thing (tm) performance-wise by enabling >> it. > > Or even just a big fat warning in dmesg the first time auditing triggers. > >> Back to timing. With kvm-clock, I see: >> >> 23.80% timing_test_64 [kernel.kallsyms] [k] pvclock_clocksource_read > > Oh wow. How can that possibly be sane? > > Isn't the *whole* point of pvclock_clocksource_read() to be a native > rdtsc with scaling? How does it cause that kind of insane pain? An unnecessarily complicated protocol, a buggy host implementation, and an unnecessarily complicated guest implementation :( > > Oh well. Some paravirt person would need to look and care. The code there is a bit scary. --Andy > > Linus
On Fri, Apr 17, 2015 at 03:25:28PM -0700, Andy Lutomirski wrote: > On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote: > >> > >> Muahaha. The auditors have invaded your system. (I did my little > >> benchmark with a more sensible configuration -- see way below). > >> > >> Can you send the output of: > >> > >> # auditctl -s > >> # auditctl -l > > > > # auditctl -s > > enabled 1 > > flag 1 > > pid 822 > > rate_limit 0 > > backlog_limit 320 > > lost 0 > > backlog 0 > > backlog_wait_time 60000 > > loginuid_immutable 0 unlocked > > # auditctl -l > > No rules > > Yes, "No rules" doesn't mean "don't audit". > > > > >> Are you, perchance, using Fedora? > > > > F21. Yup. > > > > I used to just disable auditing in the kernel entirely, but then I > > ended up deciding that I need to run something closer to the broken > > Fedora config (selinux in particular) in order to actually optimize > > the real-world pathname handling situation rather than the _sane_ one. > > Oh well. I think audit support got enabled at the same time in my > > kernels because I ended up using the default config and then taking > > out the truly crazy stuff without noticing AUDITSYSCALL. > > > >> I lobbied rather heavily, and > >> successfully, to get the default configuration to stop auditing. > >> Unfortunately, the fix wasn't retroactive, so, unless you have a very > >> fresh install, you might want to apply the fix yourself: > > > > Is that fix happening in Fedora going forward, though? Like F22? > > It's in F21, actually. Unfortunately, the audit package is really > weird. It installs /etc/audit/rules.d/audit.rules, containing: > > # This file contains the auditctl rules that are loaded > # whenever the audit daemon is started via the initscripts. > # The rules are simply the parameters that would be passed > # to auditctl. > > # First rule - delete all > -D > > # This suppresses syscall auditing for all tasks started > # with this rule in effect. Remove it if you need syscall > # auditing. > -a task,never > > Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't > exist) it copies that file to /etc/audit/audit.rules post-install. > (No, I don't know why it works this way.) > > IOW, you might want to copy your /etc/audit/rules.d/audit.rules to > /etc/audit/audit.rules. I think you need to reboot to get the full > effect. You could apply the rule manually (or maybe restart the audit > service), but it will only affect newly-started tasks. > > > > >> Amdy Lumirtowsky thinks he meant to attach a condition to his > >> maintainerish activities: he will do his best to keep the audit code > >> *out* of the low-level stuff, but he's going to try to avoid ever > >> touching the audit code itself, because if he ever had to change it, > >> he might accidentally delete the entire file. > > > > Oooh. That would be _such_ a shame. > > > > Can we please do it by mistake? "Oops, my fingers slipped" > > > >> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some > >> point? I would love auditing to set some really loud global warning > >> that you've just done a Bad Thing (tm) performance-wise by enabling > >> it. > > > > Or even just a big fat warning in dmesg the first time auditing triggers. > > > >> Back to timing. With kvm-clock, I see: > >> > >> 23.80% timing_test_64 [kernel.kallsyms] [k] pvclock_clocksource_read > > > > Oh wow. How can that possibly be sane? > > > > Isn't the *whole* point of pvclock_clocksource_read() to be a native > > rdtsc with scaling? How does it cause that kind of insane pain? > > An unnecessarily complicated protocol, a buggy host implementation, > and an unnecessarily complicated guest implementation :( How about start by removing the unnecessary rdtsc-barrier? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/04/2015 00:25, Andy Lutomirski wrote: >> Isn't the *whole* point of pvclock_clocksource_read() to be a native >> rdtsc with scaling? How does it cause that kind of insane pain? It's possible that your machine ends up with PVCLOCK_TSC_STABLE_BIT clear, so you get an atomic cmpxchg in addition (and associated cacheline bouncing, since anything reading the clocksource in the kernel will cause that variable to bounce). But that's not too common on recent machines. Is the vsyscall faster for you or does it degenerate to the syscall? If so, you have PVCLOCK_TSC_STABLE_BIT clear. > An unnecessarily complicated protocol, a buggy host implementation, > and an unnecessarily complicated guest implementation :( pvclock_clocksource_read() itself is not scary and need not worry about the buggy host implementation (preempt_disable makes things easy). It's the vDSO stuff that has the nice things. There's a few micro-optimizations that we could do (the guest implementation _is_ unnecessarily baroque), but it may not be enough if the rdtsc_barrier()s (lfence) are the performance killers. Will look closely on Monday. Paolo >> Oh well. Some paravirt person would need to look and care. > > The code there is a bit scary. > > --Andy > >> >> Linus > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 17/04/2015 22:18, Marcelo Tosatti wrote: > The bug which this is fixing is very rare, have no memory of a report. > > In fact, its even difficult to create a synthetic reproducer. But then why was the task migration notifier even in Jeremy's original code for Xen? Was it supposed to work even on non-synchronized TSC? If that's the case, then it could be reverted indeed; but then why did you commit this patch to 4.1? Did you think of something that would cause the seqcount-like protocol to fail, and that turned out not to be the case later? I was only following the mailing list sparsely in March. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 17/04/2015 22:18, Marcelo Tosatti wrote: >> The bug which this is fixing is very rare, have no memory of a report. >> >> In fact, its even difficult to create a synthetic reproducer. > > But then why was the task migration notifier even in Jeremy's original > code for Xen? Was it supposed to work even on non-synchronized TSC? > > If that's the case, then it could be reverted indeed; but then why did > you commit this patch to 4.1? Did you think of something that would > cause the seqcount-like protocol to fail, and that turned out not to be > the case later? I was only following the mailing list sparsely in March. I don't think anyone ever tried that hard to test this stuff. There was an infinte loop that Firefox was triggering as a KVM guest somewhat reliably until a couple months ago in the same vdso code. :( --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 20, 2015 at 06:59:04PM +0200, Paolo Bonzini wrote: > > > On 17/04/2015 22:18, Marcelo Tosatti wrote: > > The bug which this is fixing is very rare, have no memory of a report. > > > > In fact, its even difficult to create a synthetic reproducer. > > But then why was the task migration notifier even in Jeremy's original > code for Xen? To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe. > Was it supposed to work even on non-synchronized TSC? Yes it is supposed to work on non-synchronized TSC. > If that's the case, then it could be reverted indeed; but then why did > you commit this patch to 4.1? Because it fixes the problem Andy reported (see Subject: KVM: x86: fix kvmclock write race (v2) on kvm@). As long as you have Radim's fix on top. > Did you think of something that would > cause the seqcount-like protocol to fail, and that turned out not to be > the case later? I was only following the mailing list sparsely in March. No. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 22/04/2015 22:56, Marcelo Tosatti wrote: >> > But then why was the task migration notifier even in Jeremy's original >> > code for Xen? > To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe. Ok, to cover it for non-synchronized TSC. While KVM requires synchronized TSC. > > If that's the case, then it could be reverted indeed; but then why did > > you commit this patch to 4.1? > > Because it fixes the problem Andy reported (see Subject: KVM: x86: fix > kvmclock write race (v2) on kvm@). As long as you have Radim's > fix on top. But if it's so rare, and it was known that fixing the host protocol was just as good a solution, why was the guest fix committed? I'm just trying to understand. I am worried that this patch was rushed in; so far I had assumed it wasn't (a revert of a revert is rare enough that you don't do it lightly...) but maybe I was wrong. Right now I cannot even decide whether to revert it (and please Peter in the process :)) or submit the Kconfig symbol patch officially. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote: > On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > On 17/04/2015 22:18, Marcelo Tosatti wrote: > >> The bug which this is fixing is very rare, have no memory of a report. > >> > >> In fact, its even difficult to create a synthetic reproducer. > > > > But then why was the task migration notifier even in Jeremy's original > > code for Xen? Was it supposed to work even on non-synchronized TSC? > > > > If that's the case, then it could be reverted indeed; but then why did > > you commit this patch to 4.1? Did you think of something that would > > cause the seqcount-like protocol to fail, and that turned out not to be > > the case later? I was only following the mailing list sparsely in March. > > I don't think anyone ever tried that hard to test this stuff. There > was an infinte loop that Firefox was triggering as a KVM guest > somewhat reliably until a couple months ago in the same vdso code. :( https://bugzilla.redhat.com/show_bug.cgi?id=1174664 --- Comment #5 from Juan Quintela <quintela@redhat.com> --- Another round # dmesg | grep msr [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 0.000000] kvm-clock: cpu 0, msr 1:1ffd8001, primary cpu clock [ 0.000000] kvm-stealtime: cpu 0, msr 11fc0d100 [ 0.041174] kvm-clock: cpu 1, msr 1:1ffd8041, secondary cpu clock [ 0.053011] kvm-stealtime: cpu 1, msr 11fc8d100 After start: [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8000' 000000001ffd8000: 0x3b401060 0xfffc7f4b 0x3b42d040 0xfffc7f4b 000000001ffd8010: 0x3b42d460 0xfffc7f4b 0x3b42d4c0 0xfffc7f4b [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8040' 000000001ffd8040: 0x3b42d700 0xfffc7f4b 0x3b42d760 0xfffc7f4b 000000001ffd8050: 0x3b42d7c0 0xfffc7f4b 0x3b42d820 0xfffc7f4b When firefox hangs [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8000' 000000001ffd8000: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 000000001ffd8010: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a [root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser 'xp /8x 0x1ffd8040' 000000001ffd8040: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 000000001ffd8050: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote: > > > On 22/04/2015 22:56, Marcelo Tosatti wrote: > >> > But then why was the task migration notifier even in Jeremy's original > >> > code for Xen? > > To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe. > > Ok, to cover it for non-synchronized TSC. While KVM requires > synchronized TSC. > > > > If that's the case, then it could be reverted indeed; but then why did > > > you commit this patch to 4.1? > > > > Because it fixes the problem Andy reported (see Subject: KVM: x86: fix > > kvmclock write race (v2) on kvm@). As long as you have Radim's > > fix on top. > > But if it's so rare, and it was known that fixing the host protocol was > just as good a solution, why was the guest fix committed? I don't know. Should have fixed the host protocol. > I'm just trying to understand. I am worried that this patch was rushed > in; so far I had assumed it wasn't (a revert of a revert is rare enough > that you don't do it lightly...) but maybe I was wrong. Yes it was rushed in. > Right now I cannot even decide whether to revert it (and please Peter in > the process :)) or submit the Kconfig symbol patch officially. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 22/04/2015 23:21, Marcelo Tosatti wrote: > On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote: >> On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> >>> >>> On 17/04/2015 22:18, Marcelo Tosatti wrote: >>>> The bug which this is fixing is very rare, have no memory of a report. >>>> >>>> In fact, its even difficult to create a synthetic reproducer. >>> >>> But then why was the task migration notifier even in Jeremy's original >>> code for Xen? Was it supposed to work even on non-synchronized TSC? >>> >>> If that's the case, then it could be reverted indeed; but then why did >>> you commit this patch to 4.1? Did you think of something that would >>> cause the seqcount-like protocol to fail, and that turned out not to be >>> the case later? I was only following the mailing list sparsely in March. >> >> I don't think anyone ever tried that hard to test this stuff. There >> was an infinte loop that Firefox was triggering as a KVM guest >> somewhat reliably until a couple months ago in the same vdso code. :( > > https://bugzilla.redhat.com/show_bug.cgi?id=1174664 That was the missing volatile in an asm. Older compilers didn't catch it. :( Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/04/2015 00:55, Marcelo Tosatti wrote: > On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote: >> >> >> On 22/04/2015 22:56, Marcelo Tosatti wrote: >>>>> But then why was the task migration notifier even in Jeremy's original >>>>> code for Xen? >>> To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe. >> >> Ok, to cover it for non-synchronized TSC. While KVM requires >> synchronized TSC. >> >>>> If that's the case, then it could be reverted indeed; but then why did >>>> you commit this patch to 4.1? >>> >>> Because it fixes the problem Andy reported (see Subject: KVM: x86: fix >>> kvmclock write race (v2) on kvm@). As long as you have Radim's >>> fix on top. >> >> But if it's so rare, and it was known that fixing the host protocol was >> just as good a solution, why was the guest fix committed? > > I don't know. Should have fixed the host protocol. No problem. Let's do the right thing now. >> I'm just trying to understand. I am worried that this patch was rushed >> in; so far I had assumed it wasn't (a revert of a revert is rare enough >> that you don't do it lightly...) but maybe I was wrong. > > Yes it was rushed in. Ok, so re-reverted it will be. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 23, 2015 at 11:13:23AM +0200, Paolo Bonzini wrote: > > > On 22/04/2015 23:21, Marcelo Tosatti wrote: > > On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote: > >> On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > >>> > >>> > >>> On 17/04/2015 22:18, Marcelo Tosatti wrote: > >>>> The bug which this is fixing is very rare, have no memory of a report. > >>>> > >>>> In fact, its even difficult to create a synthetic reproducer. > >>> > >>> But then why was the task migration notifier even in Jeremy's original > >>> code for Xen? Was it supposed to work even on non-synchronized TSC? > >>> > >>> If that's the case, then it could be reverted indeed; but then why did > >>> you commit this patch to 4.1? Did you think of something that would > >>> cause the seqcount-like protocol to fail, and that turned out not to be > >>> the case later? I was only following the mailing list sparsely in March. > >> > >> I don't think anyone ever tried that hard to test this stuff. There > >> was an infinte loop that Firefox was triggering as a KVM guest > >> somewhat reliably until a couple months ago in the same vdso code. :( > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1174664 > > That was the missing volatile in an asm. Older compilers didn't catch > it. :( How do you know that? It looks like memory corruption (look at the pattern at the end). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/04/2015 13:51, Marcelo Tosatti wrote: >>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1174664 >> > >> > That was the missing volatile in an asm. Older compilers didn't catch >> > it. :( > How do you know that? It looks like memory corruption (look at the > pattern at the end). I suspect some kind of operator error there, it makes no sense. On the other hand, bug 1178975 is much clearer and the symptoms are the same. In that bug, you can see that the same kernel source works on f20 (package version 3.17.7-200.fc20.x86_64) and fails on f21 (package version 3.17.7-300.fc21.x86_64). Of course the compiler is different. The newer one hoists the lsl out of the loop; if you get a CPU migration at the wrong time, the cpu != cpu1 condition will always be true the loop will never exit. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 23, 2015 at 02:02:29PM +0200, Paolo Bonzini wrote: > > > On 23/04/2015 13:51, Marcelo Tosatti wrote: > >>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1174664 > >> > > >> > That was the missing volatile in an asm. Older compilers didn't catch > >> > it. :( > > How do you know that? It looks like memory corruption (look at the > > pattern at the end). > > I suspect some kind of operator error there, it makes no sense. if (unlikely(s->flags & SLAB_POISON)) memset(start, POISON_INUSE, PAGE_SIZE << order); * Padding is done using 0x5a (POISON_INUSE) > On the other hand, bug 1178975 is much clearer and the symptoms are the > same. In that bug, you can see that the same kernel source works on f20 > (package version 3.17.7-200.fc20.x86_64) and fails on f21 (package > version 3.17.7-300.fc21.x86_64). Of course the compiler is different. > The newer one hoists the lsl out of the loop; if you get a CPU migration > at the wrong time, the cpu != cpu1 condition will always be true the > loop will never exit. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d43e7e1c784b..9af252c8698d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST config PARAVIRT bool "Enable paravirtualization code" + select TASK_MIGRATION_NOTIFIER ---help--- This changes the kernel so it can modify itself when it is run under a hypervisor, potentially improving performance significantly diff --git a/init/Kconfig b/init/Kconfig index 3b9df1aa35db..891917123338 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2016,6 +2016,9 @@ source "block/Kconfig" config PREEMPT_NOTIFIERS bool +config TASK_MIGRATION_NOTIFIER + bool + config PADATA depends on SMP bool diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9123a82cbb6..c07a53aa543c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) rq_clock_skip_update(rq, true); } +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); void register_task_migration_notifier(struct notifier_block *n) { atomic_notifier_chain_register(&task_migration_notifier, n); } +#endif #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER struct task_migration_notifier tmn; + int from_cpu = task_cpu(p); +#endif if (p->sched_class->migrate_task_rq) p->sched_class->migrate_task_rq(p, new_cpu); p->se.nr_migrations++; perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER tmn.task = p; - tmn.from_cpu = task_cpu(p); + tmn.from_cpu = from_cpu; tmn.to_cpu = new_cpu; atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn); +#endif } __set_task_cpu(p, new_cpu);