[GIT,PULL] First batch of KVM changes for 4.1

Message ID	55310CF2.6070107@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Message-ID: <55310CF2.6070107@redhat.com> Date: Fri, 17 Apr 2015 15:38:58 +0200 From: Paolo Bonzini <pbonzini@redhat.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Peter Zijlstra <peterz@infradead.org> CC: torvalds@linux-foundation.org, linux-kernel@vger.kernel.org, gleb@kernel.org, kvm@vger.kernel.org, Ralf Baechle <ralf@linux-mips.org>, mtosatti@redhat.com, luto@kernel.org Subject: Re: [GIT PULL] First batch of KVM changes for 4.1 References: <1428678089-16291-1-git-send-email-pbonzini@redhat.com> <20150417085238.GJ17717@twins.programming.kicks-ass.net> <20150417091745.GA24151@twins.programming.kicks-ass.net> <5530DBED.5080508@redhat.com> <20150417103654.GE5029@twins.programming.kicks-ass.net> <5530E28F.2030401@redhat.com> <20150417105506.GF5029@twins.programming.kicks-ass.net> <553100C1.5000408@redhat.com> <20150417131037.GG23123@twins.programming.kicks-ass.net> In-Reply-To: <20150417131037.GG23123@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Paolo Bonzini April 17, 2015, 1:38 p.m. UTC

On 17/04/2015 15:10, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote:
>> On 17/04/2015 12:55, Peter Zijlstra wrote:
>>> Also, it looks like you already do exactly this for other things, look
>>> at:
>>>
>>> 	kvm_sched_in()
>>> 	  kvm_arch_vcpu_load()
>>> 	    if (unlikely(vcpu->cpu != cpu) ... )
>>>
>>> So no, I don't believe for one second you need this.
> 
> This [...] brings us back to where we were last
> time. There is _0_ justification for this in the patches, that alone is
> grounds enough to reject it.

Oh, we totally agree on that.  I didn't commit that patch, but I already
said the commit message was insufficient.

> Why should the guest task care about the physical cpu of the vcpu;
> that's a layering fail if ever there was one.

It's totally within your right to not read the code, but then please
don't try commenting at it.

This code:

	kvm_sched_in()
	  kvm_arch_vcpu_load()
 	    if (unlikely(vcpu->cpu != cpu) ... )

runs in the host.  The hypervisor obviously cares if the physical CPU of
the VCPU changes.  It has to tell the source processor (vcpu->cpu) to
release the VCPU's data structure and only then it can use it in the
target processor (cpu).  No layering violation here.

The task migration notifier runs in the guest, whenever the VCPU of
a task changes.

> Furthermore, the only thing that migration handler seems to do is
> increment a variable that is not actually used in that file.

It's used in the vDSO, so you cannot increment it in the file that uses it.

>> And frankly, I think the static key is snake oil.  The cost of task 
>> migration in terms of cache misses and TLB misses is in no way 
>> comparable to the cost of filling in a structure on the stack, 
>> dereferencing the head of the notifiers list and seeing that it's NULL.
> 
> The path this notifier is called from has nothing to do with those
> costs.

How not?  The task is going to incur those costs, it's not like half
a dozen extra instruction make any difference.  But anyway...

> And the fact you're inflicting these costs on _everyone_ for a
> single x86_64-paravirt case is insane.

... that's a valid objection.  Please look at the patch below.

> I've had enough of this, the below goes into sched/urgent and you can
> come back with sane patches if and when you're ready.

Oh, please, cut the alpha male crap.

Paolo

------------------- 8< ----------------
From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Fri, 17 Apr 2015 14:57:34 +0200
Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER

The task migration notifier is only used in x86 paravirt.  Make it
possible to compile it out.

While at it, move some code around to ensure tmn is filled from CPU
registers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/Kconfig    | 1 +
 init/Kconfig        | 3 +++
 kernel/sched/core.c | 9 ++++++++-
 3 files changed, 12 insertions(+), 1 deletion(-)

Peter Zijlstra April 17, 2015, 1:43 p.m. UTC | #1

On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote:
> > The path this notifier is called from has nothing to do with those
> > costs.
> 
> How not?  The task is going to incur those costs, it's not like half
> a dozen extra instruction make any difference.  But anyway...

Its attributed to the entity doing the migration, which can be the
wakeup path or a softirq. And we very much do care about the wakeup
path.

> ... that's a valid objection.  Please look at the patch below.

Still a NAK on that, distros have no choice but to enable that CONFIG
option because people might want to run KVM.

CONFIG options are pointless if they end up being mandatory.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 17, 2015, 2:57 p.m. UTC | #2

On 17/04/2015 15:43, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote:
>>> The path this notifier is called from has nothing to do with those
>>> costs.
>
> Its attributed to the entity doing the migration, which can be the
> wakeup path or a softirq. And we very much do care about the wakeup
> path.

It's not run on all wakeups.  It's within an "if (task_cpu(p) !=
new_cpu)".  WF_MIGRATED _is_ a slow path for wakeups, even though wakeup
itself is of course something we care about.

For load balancing, calculate_imbalance alone is orders of magnitudes
more expensive than this notifier (which can be optimized to two
instructions with at most one cache miss).

>> ... that's a valid objection.  Please look at the patch below.
> 
> Still a NAK on that, distros have no choice but to enable that CONFIG
> option because people might want to run KVM.

Again: running virtual machines does not require these notifiers.  KVM
needs preempt and MMU notifiers, and also enables user return notifiers,
but does not need these ones.

It's only paravirt that needs them.  It's perfectly fine for distros to
disable paravirt.  Some do, some don't.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 17, 2015, 7:01 p.m. UTC | #3

On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote:
> On 17/04/2015 15:10, Peter Zijlstra wrote:
> > On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote:
> >> On 17/04/2015 12:55, Peter Zijlstra wrote:
> >>> Also, it looks like you already do exactly this for other things, look
> >>> at:
> >>>
> >>> 	kvm_sched_in()
> >>> 	  kvm_arch_vcpu_load()
> >>> 	    if (unlikely(vcpu->cpu != cpu) ... )
> >>>
> >>> So no, I don't believe for one second you need this.
> > 
> > This [...] brings us back to where we were last
> > time. There is _0_ justification for this in the patches, that alone is
> > grounds enough to reject it.
> 
> Oh, we totally agree on that.  I didn't commit that patch, but I already
> said the commit message was insufficient.
> 
> > Why should the guest task care about the physical cpu of the vcpu;
> > that's a layering fail if ever there was one.
> 
> It's totally within your right to not read the code, but then please
> don't try commenting at it.
> 
> This code:
> 
> 	kvm_sched_in()
> 	  kvm_arch_vcpu_load()
>  	    if (unlikely(vcpu->cpu != cpu) ... )
> 
> runs in the host.  The hypervisor obviously cares if the physical CPU of
> the VCPU changes.  It has to tell the source processor (vcpu->cpu) to
> release the VCPU's data structure and only then it can use it in the
> target processor (cpu).  No layering violation here.
> 
> The task migration notifier runs in the guest, whenever the VCPU of
> a task changes.
> 
> > Furthermore, the only thing that migration handler seems to do is
> > increment a variable that is not actually used in that file.
> 
> It's used in the vDSO, so you cannot increment it in the file that uses it.
> 
> >> And frankly, I think the static key is snake oil.  The cost of task 
> >> migration in terms of cache misses and TLB misses is in no way 
> >> comparable to the cost of filling in a structure on the stack, 
> >> dereferencing the head of the notifiers list and seeing that it's NULL.
> > 
> > The path this notifier is called from has nothing to do with those
> > costs.
> 
> How not?  The task is going to incur those costs, it's not like half
> a dozen extra instruction make any difference.  But anyway...
> 
> > And the fact you're inflicting these costs on _everyone_ for a
> > single x86_64-paravirt case is insane.
> 
> ... that's a valid objection.  Please look at the patch below.
> 
> > I've had enough of this, the below goes into sched/urgent and you can
> > come back with sane patches if and when you're ready.
> 
> Oh, please, cut the alpha male crap.
> 
> Paolo
> 
> ------------------- 8< ----------------
> >From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini <pbonzini@redhat.com>
> Date: Fri, 17 Apr 2015 14:57:34 +0200
> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
> 
> The task migration notifier is only used in x86 paravirt.  Make it
> possible to compile it out.
> 
> While at it, move some code around to ensure tmn is filled from CPU
> registers.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/Kconfig    | 1 +
>  init/Kconfig        | 3 +++
>  kernel/sched/core.c | 9 ++++++++-
>  3 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d43e7e1c784b..9af252c8698d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
>  
>  config PARAVIRT
>  	bool "Enable paravirtualization code"
> +	select TASK_MIGRATION_NOTIFIER
>  	---help---
>  	  This changes the kernel so it can modify itself when it is run
>  	  under a hypervisor, potentially improving performance significantly
> diff --git a/init/Kconfig b/init/Kconfig
> index 3b9df1aa35db..891917123338 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
>  config PREEMPT_NOTIFIERS
>  	bool
>  
> +config TASK_MIGRATION_NOTIFIER
> +	bool
> +
>  config PADATA
>  	depends on SMP
>  	bool
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9123a82cbb6..c07a53aa543c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>  		rq_clock_skip_update(rq, true);
>  }
>  
> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
>  
>  void register_task_migration_notifier(struct notifier_block *n)
>  {
>  	atomic_notifier_chain_register(&task_migration_notifier, n);
>  }
> +#endif
>  
>  #ifdef CONFIG_SMP
>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  	trace_sched_migrate_task(p, new_cpu);
>  
>  	if (task_cpu(p) != new_cpu) {
> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>  		struct task_migration_notifier tmn;
> +		int from_cpu = task_cpu(p);
> +#endif
>  
>  		if (p->sched_class->migrate_task_rq)
>  			p->sched_class->migrate_task_rq(p, new_cpu);
>  		p->se.nr_migrations++;
>  		perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
>  
> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>  		tmn.task = p;
> -		tmn.from_cpu = task_cpu(p);
> +		tmn.from_cpu = from_cpu;
>  		tmn.to_cpu = new_cpu;
>  
>  		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
> +#endif
>  	}
>  
>  	__set_task_cpu(p, new_cpu);
> -- 
> 2.3.5

Paolo, 

Please revert the patch -- can fix properly in the host
which also conforms the KVM guest/host documented protocol.

Radim submitted a patch to kvm@ to split 
the kvm_write_guest in two with a barrier in between, i think.

I'll review that patch.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski April 17, 2015, 7:16 p.m. UTC | #4

On Fri, Apr 17, 2015 at 12:01 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote:
>> On 17/04/2015 15:10, Peter Zijlstra wrote:
>> > On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote:
>> >> On 17/04/2015 12:55, Peter Zijlstra wrote:
>> >>> Also, it looks like you already do exactly this for other things, look
>> >>> at:
>> >>>
>> >>>   kvm_sched_in()
>> >>>     kvm_arch_vcpu_load()
>> >>>       if (unlikely(vcpu->cpu != cpu) ... )
>> >>>
>> >>> So no, I don't believe for one second you need this.
>> >
>> > This [...] brings us back to where we were last
>> > time. There is _0_ justification for this in the patches, that alone is
>> > grounds enough to reject it.
>>
>> Oh, we totally agree on that.  I didn't commit that patch, but I already
>> said the commit message was insufficient.
>>
>> > Why should the guest task care about the physical cpu of the vcpu;
>> > that's a layering fail if ever there was one.
>>
>> It's totally within your right to not read the code, but then please
>> don't try commenting at it.
>>
>> This code:
>>
>>       kvm_sched_in()
>>         kvm_arch_vcpu_load()
>>           if (unlikely(vcpu->cpu != cpu) ... )
>>
>> runs in the host.  The hypervisor obviously cares if the physical CPU of
>> the VCPU changes.  It has to tell the source processor (vcpu->cpu) to
>> release the VCPU's data structure and only then it can use it in the
>> target processor (cpu).  No layering violation here.
>>
>> The task migration notifier runs in the guest, whenever the VCPU of
>> a task changes.
>>
>> > Furthermore, the only thing that migration handler seems to do is
>> > increment a variable that is not actually used in that file.
>>
>> It's used in the vDSO, so you cannot increment it in the file that uses it.
>>
>> >> And frankly, I think the static key is snake oil.  The cost of task
>> >> migration in terms of cache misses and TLB misses is in no way
>> >> comparable to the cost of filling in a structure on the stack,
>> >> dereferencing the head of the notifiers list and seeing that it's NULL.
>> >
>> > The path this notifier is called from has nothing to do with those
>> > costs.
>>
>> How not?  The task is going to incur those costs, it's not like half
>> a dozen extra instruction make any difference.  But anyway...
>>
>> > And the fact you're inflicting these costs on _everyone_ for a
>> > single x86_64-paravirt case is insane.
>>
>> ... that's a valid objection.  Please look at the patch below.
>>
>> > I've had enough of this, the below goes into sched/urgent and you can
>> > come back with sane patches if and when you're ready.
>>
>> Oh, please, cut the alpha male crap.
>>
>> Paolo
>>
>> ------------------- 8< ----------------
>> >From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
>> From: Paolo Bonzini <pbonzini@redhat.com>
>> Date: Fri, 17 Apr 2015 14:57:34 +0200
>> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
>>
>> The task migration notifier is only used in x86 paravirt.  Make it
>> possible to compile it out.
>>
>> While at it, move some code around to ensure tmn is filled from CPU
>> registers.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  arch/x86/Kconfig    | 1 +
>>  init/Kconfig        | 3 +++
>>  kernel/sched/core.c | 9 ++++++++-
>>  3 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d43e7e1c784b..9af252c8698d 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
>>
>>  config PARAVIRT
>>       bool "Enable paravirtualization code"
>> +     select TASK_MIGRATION_NOTIFIER
>>       ---help---
>>         This changes the kernel so it can modify itself when it is run
>>         under a hypervisor, potentially improving performance significantly
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 3b9df1aa35db..891917123338 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
>>  config PREEMPT_NOTIFIERS
>>       bool
>>
>> +config TASK_MIGRATION_NOTIFIER
>> +     bool
>> +
>>  config PADATA
>>       depends on SMP
>>       bool
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index f9123a82cbb6..c07a53aa543c 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>>               rq_clock_skip_update(rq, true);
>>  }
>>
>> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
>>
>>  void register_task_migration_notifier(struct notifier_block *n)
>>  {
>>       atomic_notifier_chain_register(&task_migration_notifier, n);
>>  }
>> +#endif
>>
>>  #ifdef CONFIG_SMP
>>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>       trace_sched_migrate_task(p, new_cpu);
>>
>>       if (task_cpu(p) != new_cpu) {
>> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>>               struct task_migration_notifier tmn;
>> +             int from_cpu = task_cpu(p);
>> +#endif
>>
>>               if (p->sched_class->migrate_task_rq)
>>                       p->sched_class->migrate_task_rq(p, new_cpu);
>>               p->se.nr_migrations++;
>>               perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
>>
>> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>>               tmn.task = p;
>> -             tmn.from_cpu = task_cpu(p);
>> +             tmn.from_cpu = from_cpu;
>>               tmn.to_cpu = new_cpu;
>>
>>               atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>> +#endif
>>       }
>>
>>       __set_task_cpu(p, new_cpu);
>> --
>> 2.3.5
>
> Paolo,
>
> Please revert the patch -- can fix properly in the host
> which also conforms the KVM guest/host documented protocol.
>
> Radim submitted a patch to kvm@ to split
> the kvm_write_guest in two with a barrier in between, i think.
>
> I'll review that patch.
>

Can you cc me on that?

Thanks,
Andy

Paolo Bonzini April 17, 2015, 7:57 p.m. UTC | #5

>> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
>> From: Paolo Bonzini <pbonzini@redhat.com>
>> Date: Fri, 17 Apr 2015 14:57:34 +0200
>> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
>>
>> The task migration notifier is only used in x86 paravirt.  Make it
>> possible to compile it out.
>>
>> While at it, move some code around to ensure tmn is filled from CPU
>> registers.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  arch/x86/Kconfig    | 1 +
>>  init/Kconfig        | 3 +++
>>  kernel/sched/core.c | 9 ++++++++-
>>  3 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d43e7e1c784b..9af252c8698d 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
>>  
>>  config PARAVIRT
>>  	bool "Enable paravirtualization code"
>> +	select TASK_MIGRATION_NOTIFIER
>>  	---help---
>>  	  This changes the kernel so it can modify itself when it is run
>>  	  under a hypervisor, potentially improving performance significantly
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 3b9df1aa35db..891917123338 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
>>  config PREEMPT_NOTIFIERS
>>  	bool
>>  
>> +config TASK_MIGRATION_NOTIFIER
>> +	bool
>> +
>>  config PADATA
>>  	depends on SMP
>>  	bool
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index f9123a82cbb6..c07a53aa543c 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>>  		rq_clock_skip_update(rq, true);
>>  }
>>  
>> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
>>  
>>  void register_task_migration_notifier(struct notifier_block *n)
>>  {
>>  	atomic_notifier_chain_register(&task_migration_notifier, n);
>>  }
>> +#endif
>>  
>>  #ifdef CONFIG_SMP
>>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>  	trace_sched_migrate_task(p, new_cpu);
>>  
>>  	if (task_cpu(p) != new_cpu) {
>> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>>  		struct task_migration_notifier tmn;
>> +		int from_cpu = task_cpu(p);
>> +#endif
>>  
>>  		if (p->sched_class->migrate_task_rq)
>>  			p->sched_class->migrate_task_rq(p, new_cpu);
>>  		p->se.nr_migrations++;
>>  		perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
>>  
>> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>>  		tmn.task = p;
>> -		tmn.from_cpu = task_cpu(p);
>> +		tmn.from_cpu = from_cpu;
>>  		tmn.to_cpu = new_cpu;
>>  
>>  		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>> +#endif
>>  	}
>>  
>>  	__set_task_cpu(p, new_cpu);
>> -- 
>> 2.3.5
> 
> Paolo, 
> 
> Please revert the patch -- can fix properly in the host
> which also conforms the KVM guest/host documented protocol.
> 
> Radim submitted a patch to kvm@ to split 
> the kvm_write_guest in two with a barrier in between, i think.
> 
> I'll review that patch.

You're thinking of
http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
Andy's reply:

> 
> I think there are at least two ways that would work:
> 
> a) If KVM incremented version as advertised:
> 
> cpu = getcpu();
> pvti = pvti for cpu;
> 
> ver1 = pvti->version;
> check stable bit;
> rdtsc_barrier, rdtsc, read scale, shift, etc.
> if (getcpu() != cpu) retry;
> if (pvti->version != ver1) retry;
> 
> I think this is safe because, we're guaranteed that there was an
> interval (between the two version reads) in which the vcpu we think
> we're on was running and the kvmclock data was valid and marked
> stable, and we know that the tsc we read came from that interval.
> 
> Note: rdtscp isn't needed. If we're stable, is makes no difference
> which cpu's tsc we actually read.
> 
> b) If version remains buggy but we use this migrations_from hack:
> 
> cpu = getcpu();
> pvti = pvti for cpu;
> m1 = pvti->migrations_from;
> barrier();
> 
> ver1 = pvti->version;
> check stable bit;
> rdtsc_barrier, rdtsc, read scale, shift, etc.
> if (getcpu() != cpu) retry;
> if (pvti->version != ver1) retry;  /* probably not really needed */
> 
> barrier();
> if (pvti->migrations_from != m1) retry;
> 
> This is just like (a), except that we're using a guest kernel hack to
> ensure that no one migrated off the vcpu during the version-protected
> critical section and that we were, in fact, on that vcpu at some point
> during that critical section.  Once we've ensured that we were on
> pvti's associated vcpu for the entire time we were reading it, then we
> are protected by the existing versioning in the host.

(a) is not going to happen until 4.2, and there are too many buggy hosts
around so we'd have to define new ABI that lets the guest distinguish a
buggy host from a fixed one.

(b) works now, is not invasive, and I still maintain that the cost is
negligible.  I'm going to run for a while with CONFIG_SCHEDSTATS to see
how often you have a migration.

Anyhow if the task migration notifier is reverted we have to disable the
whole vsyscall support altogether.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 17, 2015, 8:18 p.m. UTC | #6

On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote:
> 
> 
> >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
> >> From: Paolo Bonzini <pbonzini@redhat.com>
> >> Date: Fri, 17 Apr 2015 14:57:34 +0200
> >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
> >>
> >> The task migration notifier is only used in x86 paravirt.  Make it
> >> possible to compile it out.
> >>
> >> While at it, move some code around to ensure tmn is filled from CPU
> >> registers.
> >>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >>  arch/x86/Kconfig    | 1 +
> >>  init/Kconfig        | 3 +++
> >>  kernel/sched/core.c | 9 ++++++++-
> >>  3 files changed, 12 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index d43e7e1c784b..9af252c8698d 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
> >>  
> >>  config PARAVIRT
> >>  	bool "Enable paravirtualization code"
> >> +	select TASK_MIGRATION_NOTIFIER
> >>  	---help---
> >>  	  This changes the kernel so it can modify itself when it is run
> >>  	  under a hypervisor, potentially improving performance significantly
> >> diff --git a/init/Kconfig b/init/Kconfig
> >> index 3b9df1aa35db..891917123338 100644
> >> --- a/init/Kconfig
> >> +++ b/init/Kconfig
> >> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
> >>  config PREEMPT_NOTIFIERS
> >>  	bool
> >>  
> >> +config TASK_MIGRATION_NOTIFIER
> >> +	bool
> >> +
> >>  config PADATA
> >>  	depends on SMP
> >>  	bool
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index f9123a82cbb6..c07a53aa543c 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> >>  		rq_clock_skip_update(rq, true);
> >>  }
> >>  
> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
> >>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
> >>  
> >>  void register_task_migration_notifier(struct notifier_block *n)
> >>  {
> >>  	atomic_notifier_chain_register(&task_migration_notifier, n);
> >>  }
> >> +#endif
> >>  
> >>  #ifdef CONFIG_SMP
> >>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >>  	trace_sched_migrate_task(p, new_cpu);
> >>  
> >>  	if (task_cpu(p) != new_cpu) {
> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
> >>  		struct task_migration_notifier tmn;
> >> +		int from_cpu = task_cpu(p);
> >> +#endif
> >>  
> >>  		if (p->sched_class->migrate_task_rq)
> >>  			p->sched_class->migrate_task_rq(p, new_cpu);
> >>  		p->se.nr_migrations++;
> >>  		perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
> >>  
> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
> >>  		tmn.task = p;
> >> -		tmn.from_cpu = task_cpu(p);
> >> +		tmn.from_cpu = from_cpu;
> >>  		tmn.to_cpu = new_cpu;
> >>  
> >>  		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
> >> +#endif
> >>  	}
> >>  
> >>  	__set_task_cpu(p, new_cpu);
> >> -- 
> >> 2.3.5
> > 
> > Paolo, 
> > 
> > Please revert the patch -- can fix properly in the host
> > which also conforms the KVM guest/host documented protocol.
> > 
> > Radim submitted a patch to kvm@ to split 
> > the kvm_write_guest in two with a barrier in between, i think.
> > 
> > I'll review that patch.
> 
> You're thinking of
> http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
> Andy's reply:
> 
> > 
> > I think there are at least two ways that would work:
> > 
> > a) If KVM incremented version as advertised:
> > 
> > cpu = getcpu();
> > pvti = pvti for cpu;
> > 
> > ver1 = pvti->version;
> > check stable bit;
> > rdtsc_barrier, rdtsc, read scale, shift, etc.
> > if (getcpu() != cpu) retry;
> > if (pvti->version != ver1) retry;
> > 
> > I think this is safe because, we're guaranteed that there was an
> > interval (between the two version reads) in which the vcpu we think
> > we're on was running and the kvmclock data was valid and marked
> > stable, and we know that the tsc we read came from that interval.
> > 
> > Note: rdtscp isn't needed. If we're stable, is makes no difference
> > which cpu's tsc we actually read.
> > 
> > b) If version remains buggy but we use this migrations_from hack:
> > 
> > cpu = getcpu();
> > pvti = pvti for cpu;
> > m1 = pvti->migrations_from;
> > barrier();
> > 
> > ver1 = pvti->version;
> > check stable bit;
> > rdtsc_barrier, rdtsc, read scale, shift, etc.
> > if (getcpu() != cpu) retry;
> > if (pvti->version != ver1) retry;  /* probably not really needed */
> > 
> > barrier();
> > if (pvti->migrations_from != m1) retry;
> > 
> > This is just like (a), except that we're using a guest kernel hack to
> > ensure that no one migrated off the vcpu during the version-protected
> > critical section and that we were, in fact, on that vcpu at some point
> > during that critical section.  Once we've ensured that we were on
> > pvti's associated vcpu for the entire time we were reading it, then we
> > are protected by the existing versioning in the host.
> 
> (a) is not going to happen until 4.2, and there are too many buggy hosts
> around so we'd have to define new ABI that lets the guest distinguish a
> buggy host from a fixed one.
> 
> (b) works now, is not invasive, and I still maintain that the cost is
> negligible.  I'm going to run for a while with CONFIG_SCHEDSTATS to see
> how often you have a migration.
> 
> Anyhow if the task migration notifier is reverted we have to disable the
> whole vsyscall support altogether.

The bug which this is fixing is very rare, have no memory of a report.

In fact, its even difficult to create a synthetic reproducer. You need:

1) update of kvmclock data structure (happens once every 5 minutes).
2) migration of task from vcpu1 to vcpu2 back to vcpu1.
3) a data race between kvm_write_guest (string copy) and 
2 above.

At the same time.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski April 17, 2015, 8:39 p.m. UTC | #7

On Fri, Apr 17, 2015 at 1:18 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote:
>>
>>
>> >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
>> >> From: Paolo Bonzini <pbonzini@redhat.com>
>> >> Date: Fri, 17 Apr 2015 14:57:34 +0200
>> >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
>> >>
>> >> The task migration notifier is only used in x86 paravirt.  Make it
>> >> possible to compile it out.
>> >>
>> >> While at it, move some code around to ensure tmn is filled from CPU
>> >> registers.
>> >>
>> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> >> ---
>> >>  arch/x86/Kconfig    | 1 +
>> >>  init/Kconfig        | 3 +++
>> >>  kernel/sched/core.c | 9 ++++++++-
>> >>  3 files changed, 12 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> index d43e7e1c784b..9af252c8698d 100644
>> >> --- a/arch/x86/Kconfig
>> >> +++ b/arch/x86/Kconfig
>> >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
>> >>
>> >>  config PARAVIRT
>> >>    bool "Enable paravirtualization code"
>> >> +  select TASK_MIGRATION_NOTIFIER
>> >>    ---help---
>> >>      This changes the kernel so it can modify itself when it is run
>> >>      under a hypervisor, potentially improving performance significantly
>> >> diff --git a/init/Kconfig b/init/Kconfig
>> >> index 3b9df1aa35db..891917123338 100644
>> >> --- a/init/Kconfig
>> >> +++ b/init/Kconfig
>> >> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
>> >>  config PREEMPT_NOTIFIERS
>> >>    bool
>> >>
>> >> +config TASK_MIGRATION_NOTIFIER
>> >> +  bool
>> >> +
>> >>  config PADATA
>> >>    depends on SMP
>> >>    bool
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index f9123a82cbb6..c07a53aa543c 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>> >>            rq_clock_skip_update(rq, true);
>> >>  }
>> >>
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
>> >>
>> >>  void register_task_migration_notifier(struct notifier_block *n)
>> >>  {
>> >>    atomic_notifier_chain_register(&task_migration_notifier, n);
>> >>  }
>> >> +#endif
>> >>
>> >>  #ifdef CONFIG_SMP
>> >>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> >>    trace_sched_migrate_task(p, new_cpu);
>> >>
>> >>    if (task_cpu(p) != new_cpu) {
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >>            struct task_migration_notifier tmn;
>> >> +          int from_cpu = task_cpu(p);
>> >> +#endif
>> >>
>> >>            if (p->sched_class->migrate_task_rq)
>> >>                    p->sched_class->migrate_task_rq(p, new_cpu);
>> >>            p->se.nr_migrations++;
>> >>            perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
>> >>
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >>            tmn.task = p;
>> >> -          tmn.from_cpu = task_cpu(p);
>> >> +          tmn.from_cpu = from_cpu;
>> >>            tmn.to_cpu = new_cpu;
>> >>
>> >>            atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>> >> +#endif
>> >>    }
>> >>
>> >>    __set_task_cpu(p, new_cpu);
>> >> --
>> >> 2.3.5
>> >
>> > Paolo,
>> >
>> > Please revert the patch -- can fix properly in the host
>> > which also conforms the KVM guest/host documented protocol.
>> >
>> > Radim submitted a patch to kvm@ to split
>> > the kvm_write_guest in two with a barrier in between, i think.
>> >
>> > I'll review that patch.
>>
>> You're thinking of
>> http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
>> Andy's reply:
>>
>> >
>> > I think there are at least two ways that would work:
>> >
>> > a) If KVM incremented version as advertised:
>> >
>> > cpu = getcpu();
>> > pvti = pvti for cpu;
>> >
>> > ver1 = pvti->version;
>> > check stable bit;
>> > rdtsc_barrier, rdtsc, read scale, shift, etc.
>> > if (getcpu() != cpu) retry;
>> > if (pvti->version != ver1) retry;
>> >
>> > I think this is safe because, we're guaranteed that there was an
>> > interval (between the two version reads) in which the vcpu we think
>> > we're on was running and the kvmclock data was valid and marked
>> > stable, and we know that the tsc we read came from that interval.
>> >
>> > Note: rdtscp isn't needed. If we're stable, is makes no difference
>> > which cpu's tsc we actually read.
>> >
>> > b) If version remains buggy but we use this migrations_from hack:
>> >
>> > cpu = getcpu();
>> > pvti = pvti for cpu;
>> > m1 = pvti->migrations_from;
>> > barrier();
>> >
>> > ver1 = pvti->version;
>> > check stable bit;
>> > rdtsc_barrier, rdtsc, read scale, shift, etc.
>> > if (getcpu() != cpu) retry;
>> > if (pvti->version != ver1) retry;  /* probably not really needed */
>> >
>> > barrier();
>> > if (pvti->migrations_from != m1) retry;
>> >
>> > This is just like (a), except that we're using a guest kernel hack to
>> > ensure that no one migrated off the vcpu during the version-protected
>> > critical section and that we were, in fact, on that vcpu at some point
>> > during that critical section.  Once we've ensured that we were on
>> > pvti's associated vcpu for the entire time we were reading it, then we
>> > are protected by the existing versioning in the host.
>>
>> (a) is not going to happen until 4.2, and there are too many buggy hosts
>> around so we'd have to define new ABI that lets the guest distinguish a
>> buggy host from a fixed one.
>>
>> (b) works now, is not invasive, and I still maintain that the cost is
>> negligible.  I'm going to run for a while with CONFIG_SCHEDSTATS to see
>> how often you have a migration.
>>
>> Anyhow if the task migration notifier is reverted we have to disable the
>> whole vsyscall support altogether.
>
> The bug which this is fixing is very rare, have no memory of a report.
>
> In fact, its even difficult to create a synthetic reproducer. You need:
>
> 1) update of kvmclock data structure (happens once every 5 minutes).
> 2) migration of task from vcpu1 to vcpu2 back to vcpu1.
> 3) a data race between kvm_write_guest (string copy) and
> 2 above.
>
> At the same time.

Something maybe worth considering:

On my box, vclock_gettime using kvm-clock is about 40 ns.  An empty
syscall is about 33 ns.  clock_gettime *should* be around 17 ns.

The clock_gettime syscall is about 73 ns.

Could we figure out why clock_gettime (the syscall) is so slow, fix
it, and then not be so sad about removing the existing kvm-clock vdso
code?  Once we fix the host for real (add a new feature bit and make
it global instead of per-cpu), then we could have a really fast vdso
implementation, too.

--Andy

>
>

Linus Torvalds April 17, 2015, 9:28 p.m. UTC | #8

On Fri, Apr 17, 2015 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> On my box, vclock_gettime using kvm-clock is about 40 ns.  An empty
> syscall is about 33 ns.  clock_gettime *should* be around 17 ns.
>
> The clock_gettime syscall is about 73 ns.
>
> Could we figure out why clock_gettime (the syscall) is so slow

If we only could profile some random program (let's call it "a.out"
that did the syscall(__NR_gettime_syscall) a couple million times.

Oh, lookie here, Santa came around:

  21.83%   [k] system_call
  12.85%   [.] syscall
   9.76%   [k] __audit_syscall_exit
   9.55%   [k] copy_user_enhanced_fast_string
   4.68%   [k] __getnstimeofday64
   4.08%   [k] syscall_trace_enter_phase1
   3.85%   [k] __audit_syscall_entry
   3.77%   [k] unroll_tree_refs
   3.15%   [k] sys_clock_gettime
   2.92%   [k] int_very_careful
   2.73%   [.] main
   2.35%   [k] syscall_trace_leave
   2.28%   [k] read_tsc
   1.73%   [k] int_restore_rest
   1.73%   [k] int_with_check
   1.48%   [k] syscall_return
   1.32%   [k] dput
   1.24%   [k] system_call_fastpath
   1.21%   [k] syscall_return_via_sysret
   1.21%   [k] tracesys
   0.81%   [k] do_audit_syscall_entry
   0.80%   [k] current_kernel_time
   0.73%   [k] getnstimeofday64
   0.68%   [k] path_put
   0.66%   [k] posix_clock_realtime_get
   0.61%   [k] int_careful
   0.60%   [k] mntput
   0.49%   [k] kfree
   0.36%   [k] _copy_to_user
   0.31%   [k] int_ret_from_sys_call_irqs_off

looks to me like it's spending a lot of time in system call auditing.
Which makes no sense to me, since none of this should be triggering
any auditing. And there's a lot of time in low-level kernel system
call assembly code.

If I only remembered the name of the crazy person who said "Ok" when I
suggest he just be the maintainer of the code since he has spent a lot
of time sending patches for it. Something like Amdy Letorsky. No, that
wasn't it. Hmm. It's on the tip of my tongue.

Oh well. Maybe somebody can remember the guys name. It's familiar for
some reason. Andy?

             Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski April 17, 2015, 9:42 p.m. UTC | #9

On Fri, Apr 17, 2015 at 2:28 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 17, 2015 at 4:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> On my box, vclock_gettime using kvm-clock is about 40 ns.  An empty
>> syscall is about 33 ns.  clock_gettime *should* be around 17 ns.
>>
>> The clock_gettime syscall is about 73 ns.
>>
>> Could we figure out why clock_gettime (the syscall) is so slow
>
> If we only could profile some random program (let's call it "a.out"
> that did the syscall(__NR_gettime_syscall) a couple million times.
>
> Oh, lookie here, Santa came around:
>
>   21.83%   [k] system_call
>   12.85%   [.] syscall
>    9.76%   [k] __audit_syscall_exit
>    9.55%   [k] copy_user_enhanced_fast_string
>    4.68%   [k] __getnstimeofday64
>    4.08%   [k] syscall_trace_enter_phase1
>    3.85%   [k] __audit_syscall_entry
>    3.77%   [k] unroll_tree_refs
>    3.15%   [k] sys_clock_gettime
>    2.92%   [k] int_very_careful
>    2.73%   [.] main
>    2.35%   [k] syscall_trace_leave
>    2.28%   [k] read_tsc
>    1.73%   [k] int_restore_rest
>    1.73%   [k] int_with_check
>    1.48%   [k] syscall_return
>    1.32%   [k] dput
>    1.24%   [k] system_call_fastpath
>    1.21%   [k] syscall_return_via_sysret
>    1.21%   [k] tracesys
>    0.81%   [k] do_audit_syscall_entry
>    0.80%   [k] current_kernel_time
>    0.73%   [k] getnstimeofday64
>    0.68%   [k] path_put
>    0.66%   [k] posix_clock_realtime_get
>    0.61%   [k] int_careful
>    0.60%   [k] mntput
>    0.49%   [k] kfree
>    0.36%   [k] _copy_to_user
>    0.31%   [k] int_ret_from_sys_call_irqs_off
>
> looks to me like it's spending a lot of time in system call auditing.
> Which makes no sense to me, since none of this should be triggering
> any auditing. And there's a lot of time in low-level kernel system
> call assembly code.

Muahaha.  The auditors have invaded your system.  (I did my little
benchmark with a more sensible configuration -- see way below).

Can you send the output of:

# auditctl -s
# auditctl -l

Are you, perchance, using Fedora?  I lobbied rather heavily, and
successfully, to get the default configuration to stop auditing.
Unfortunately, the fix wasn't retroactive, so, unless you have a very
fresh install, you might want to apply the fix yourself:

https://fedorahosted.org/fesco/ticket/1311

>
> If I only remembered the name of the crazy person who said "Ok" when I
> suggest he just be the maintainer of the code since he has spent a lot
> of time sending patches for it. Something like Amdy Letorsky. No, that
> wasn't it. Hmm. It's on the tip of my tongue.
>
> Oh well. Maybe somebody can remember the guys name. It's familiar for
> some reason. Andy?

Amdy Lumirtowsky thinks he meant to attach a condition to his
maintainerish activities: he will do his best to keep the audit code
*out* of the low-level stuff, but he's going to try to avoid ever
touching the audit code itself, because if he ever had to change it,
he might accidentally delete the entire file.

Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
point?  I would love auditing to set some really loud global warning
that you've just done a Bad Thing (tm) performance-wise by enabling
it.

Back to timing.  With kvm-clock, I see:

  23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
  15.57%  timing_test_64  libc-2.20.so        [.] syscall
  12.39%  timing_test_64  [kernel.kallsyms]   [k] system_call
  12.35%  timing_test_64  [kernel.kallsyms]   [k] copy_user_generic_string
  10.95%  timing_test_64  [kernel.kallsyms]   [k] system_call_after_swapgs
   7.35%  timing_test_64  [kernel.kallsyms]   [k] ktime_get_ts64
   6.20%  timing_test_64  [kernel.kallsyms]   [k] sys_clock_gettime
   3.62%  timing_test_64  [kernel.kallsyms]   [k] system_call_fastpath
   2.08%  timing_test_64  timing_test_64      [.] main
   1.72%  timing_test_64  timing_test_64      [.] syscall@plt
   1.58%  timing_test_64  [kernel.kallsyms]   [k] kvm_clock_get_cycles
   1.22%  timing_test_64  [kernel.kallsyms]   [k] _copy_to_user
   0.65%  timing_test_64  [kernel.kallsyms]   [k] posix_ktime_get_ts
   0.13%  timing_test_64  [kernel.kallsyms]   [k] apic_timer_interrupt

We've got some silly indirection, a uaccess that probably didn't get
optimized very well, and the terrifying function
pvclock_clocksource_read.

By comparison, using tsc:

  19.51%  timing_test_64  libc-2.20.so       [.] syscall
  15.52%  timing_test_64  [kernel.kallsyms]  [k] system_call
  15.25%  timing_test_64  [kernel.kallsyms]  [k] copy_user_generic_string
  14.34%  timing_test_64  [kernel.kallsyms]  [k] system_call_after_swapgs
   8.66%  timing_test_64  [kernel.kallsyms]  [k] ktime_get_ts64
   6.95%  timing_test_64  [kernel.kallsyms]  [k] sys_clock_gettime
   5.93%  timing_test_64  [kernel.kallsyms]  [k] native_read_tsc
   5.12%  timing_test_64  [kernel.kallsyms]  [k] system_call_fastpath
   2.62%  timing_test_64  timing_test_64     [.] main

That's better, although the uaccess silliness is still there.

(No PEBS -- I don't think it works right in KVM.)

--Andy

>
>              Linus

Linus Torvalds April 17, 2015, 10:04 p.m. UTC | #10

On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Muahaha.  The auditors have invaded your system.  (I did my little
> benchmark with a more sensible configuration -- see way below).
>
> Can you send the output of:
>
> # auditctl -s
> # auditctl -l

  # auditctl -s
  enabled 1
  flag 1
  pid 822
  rate_limit 0
  backlog_limit 320
  lost 0
  backlog 0
  backlog_wait_time 60000
  loginuid_immutable 0 unlocked
  # auditctl -l
  No rules

> Are you, perchance, using Fedora?

F21. Yup.

I used to just disable auditing in the kernel entirely, but then I
ended up deciding that I need to run something closer to the broken
Fedora config (selinux in particular) in order to actually optimize
the real-world pathname handling situation rather than the _sane_ one.
Oh well. I think audit support got enabled at the same time in my
kernels because I ended up using the default config and then taking
out the truly crazy stuff without noticing AUDITSYSCALL.

> I lobbied rather heavily, and
> successfully, to get the default configuration to stop auditing.
> Unfortunately, the fix wasn't retroactive, so, unless you have a very
> fresh install, you might want to apply the fix yourself:

Is that fix happening in Fedora going forward, though? Like F22?

> Amdy Lumirtowsky thinks he meant to attach a condition to his
> maintainerish activities: he will do his best to keep the audit code
> *out* of the low-level stuff, but he's going to try to avoid ever
> touching the audit code itself, because if he ever had to change it,
> he might accidentally delete the entire file.

Oooh. That would be _such_ a shame.

Can we please do it by mistake? "Oops, my fingers slipped"

> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
> point?  I would love auditing to set some really loud global warning
> that you've just done a Bad Thing (tm) performance-wise by enabling
> it.

Or even just a big fat warning in dmesg the first time auditing triggers.

> Back to timing.  With kvm-clock, I see:
>
>   23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read

Oh wow. How can that possibly be sane?

Isn't the *whole* point of pvclock_clocksource_read() to be a native
rdtsc with scaling? How does it cause that kind of insane pain?

Oh well. Some paravirt person would need to look and care.

           Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski April 17, 2015, 10:25 p.m. UTC | #11

On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Muahaha.  The auditors have invaded your system.  (I did my little
>> benchmark with a more sensible configuration -- see way below).
>>
>> Can you send the output of:
>>
>> # auditctl -s
>> # auditctl -l
>
>   # auditctl -s
>   enabled 1
>   flag 1
>   pid 822
>   rate_limit 0
>   backlog_limit 320
>   lost 0
>   backlog 0
>   backlog_wait_time 60000
>   loginuid_immutable 0 unlocked
>   # auditctl -l
>   No rules

Yes, "No rules" doesn't mean "don't audit".

>
>> Are you, perchance, using Fedora?
>
> F21. Yup.
>
> I used to just disable auditing in the kernel entirely, but then I
> ended up deciding that I need to run something closer to the broken
> Fedora config (selinux in particular) in order to actually optimize
> the real-world pathname handling situation rather than the _sane_ one.
> Oh well. I think audit support got enabled at the same time in my
> kernels because I ended up using the default config and then taking
> out the truly crazy stuff without noticing AUDITSYSCALL.
>
>> I lobbied rather heavily, and
>> successfully, to get the default configuration to stop auditing.
>> Unfortunately, the fix wasn't retroactive, so, unless you have a very
>> fresh install, you might want to apply the fix yourself:
>
> Is that fix happening in Fedora going forward, though? Like F22?

It's in F21, actually.  Unfortunately, the audit package is really
weird.  It installs /etc/audit/rules.d/audit.rules, containing:

# This file contains the auditctl rules that are loaded
# whenever the audit daemon is started via the initscripts.
# The rules are simply the parameters that would be passed
# to auditctl.

# First rule - delete all
-D

# This suppresses syscall auditing for all tasks started
# with this rule in effect.  Remove it if you need syscall
# auditing.
-a task,never

Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't
exist) it copies that file to /etc/audit/audit.rules post-install.
(No, I don't know why it works this way.)

IOW, you might want to copy your /etc/audit/rules.d/audit.rules to
/etc/audit/audit.rules.  I think you need to reboot to get the full
effect.  You could apply the rule manually (or maybe restart the audit
service), but it will only affect newly-started tasks.

>
>> Amdy Lumirtowsky thinks he meant to attach a condition to his
>> maintainerish activities: he will do his best to keep the audit code
>> *out* of the low-level stuff, but he's going to try to avoid ever
>> touching the audit code itself, because if he ever had to change it,
>> he might accidentally delete the entire file.
>
> Oooh. That would be _such_ a shame.
>
> Can we please do it by mistake? "Oops, my fingers slipped"
>
>> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
>> point?  I would love auditing to set some really loud global warning
>> that you've just done a Bad Thing (tm) performance-wise by enabling
>> it.
>
> Or even just a big fat warning in dmesg the first time auditing triggers.
>
>> Back to timing.  With kvm-clock, I see:
>>
>>   23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
>
> Oh wow. How can that possibly be sane?
>
> Isn't the *whole* point of pvclock_clocksource_read() to be a native
> rdtsc with scaling? How does it cause that kind of insane pain?

An unnecessarily complicated protocol, a buggy host implementation,
and an unnecessarily complicated guest implementation :(

>
> Oh well. Some paravirt person would need to look and care.

The code there is a bit scary.

--Andy

>
>            Linus

Marcelo Tosatti April 17, 2015, 11:39 p.m. UTC | #12

On Fri, Apr 17, 2015 at 03:25:28PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >> Muahaha.  The auditors have invaded your system.  (I did my little
> >> benchmark with a more sensible configuration -- see way below).
> >>
> >> Can you send the output of:
> >>
> >> # auditctl -s
> >> # auditctl -l
> >
> >   # auditctl -s
> >   enabled 1
> >   flag 1
> >   pid 822
> >   rate_limit 0
> >   backlog_limit 320
> >   lost 0
> >   backlog 0
> >   backlog_wait_time 60000
> >   loginuid_immutable 0 unlocked
> >   # auditctl -l
> >   No rules
> 
> Yes, "No rules" doesn't mean "don't audit".
> 
> >
> >> Are you, perchance, using Fedora?
> >
> > F21. Yup.
> >
> > I used to just disable auditing in the kernel entirely, but then I
> > ended up deciding that I need to run something closer to the broken
> > Fedora config (selinux in particular) in order to actually optimize
> > the real-world pathname handling situation rather than the _sane_ one.
> > Oh well. I think audit support got enabled at the same time in my
> > kernels because I ended up using the default config and then taking
> > out the truly crazy stuff without noticing AUDITSYSCALL.
> >
> >> I lobbied rather heavily, and
> >> successfully, to get the default configuration to stop auditing.
> >> Unfortunately, the fix wasn't retroactive, so, unless you have a very
> >> fresh install, you might want to apply the fix yourself:
> >
> > Is that fix happening in Fedora going forward, though? Like F22?
> 
> It's in F21, actually.  Unfortunately, the audit package is really
> weird.  It installs /etc/audit/rules.d/audit.rules, containing:
> 
> # This file contains the auditctl rules that are loaded
> # whenever the audit daemon is started via the initscripts.
> # The rules are simply the parameters that would be passed
> # to auditctl.
> 
> # First rule - delete all
> -D
> 
> # This suppresses syscall auditing for all tasks started
> # with this rule in effect.  Remove it if you need syscall
> # auditing.
> -a task,never
> 
> Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't
> exist) it copies that file to /etc/audit/audit.rules post-install.
> (No, I don't know why it works this way.)
> 
> IOW, you might want to copy your /etc/audit/rules.d/audit.rules to
> /etc/audit/audit.rules.  I think you need to reboot to get the full
> effect.  You could apply the rule manually (or maybe restart the audit
> service), but it will only affect newly-started tasks.
> 
> >
> >> Amdy Lumirtowsky thinks he meant to attach a condition to his
> >> maintainerish activities: he will do his best to keep the audit code
> >> *out* of the low-level stuff, but he's going to try to avoid ever
> >> touching the audit code itself, because if he ever had to change it,
> >> he might accidentally delete the entire file.
> >
> > Oooh. That would be _such_ a shame.
> >
> > Can we please do it by mistake? "Oops, my fingers slipped"
> >
> >> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
> >> point?  I would love auditing to set some really loud global warning
> >> that you've just done a Bad Thing (tm) performance-wise by enabling
> >> it.
> >
> > Or even just a big fat warning in dmesg the first time auditing triggers.
> >
> >> Back to timing.  With kvm-clock, I see:
> >>
> >>   23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
> >
> > Oh wow. How can that possibly be sane?
> >
> > Isn't the *whole* point of pvclock_clocksource_read() to be a native
> > rdtsc with scaling? How does it cause that kind of insane pain?
> 
> An unnecessarily complicated protocol, a buggy host implementation,
> and an unnecessarily complicated guest implementation :(

How about start by removing the unnecessary rdtsc-barrier? 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 18, 2015, 4:20 p.m. UTC | #13

On 18/04/2015 00:25, Andy Lutomirski wrote:
>> Isn't the *whole* point of pvclock_clocksource_read() to be a native
>> rdtsc with scaling? How does it cause that kind of insane pain?

It's possible that your machine ends up with PVCLOCK_TSC_STABLE_BIT
clear, so you get an atomic cmpxchg in addition (and associated
cacheline bouncing, since anything reading the clocksource in the kernel
will cause that variable to bounce).  But that's not too common on
recent machines.

Is the vsyscall faster for you or does it degenerate to the syscall?  If
so, you have PVCLOCK_TSC_STABLE_BIT clear.

> An unnecessarily complicated protocol, a buggy host implementation,
> and an unnecessarily complicated guest implementation :(

pvclock_clocksource_read() itself is not scary and need not worry about
the buggy host implementation (preempt_disable makes things easy).  It's
the vDSO stuff that has the nice things.

There's a few micro-optimizations that we could do (the guest
implementation _is_ unnecessarily baroque), but it may not be enough if
the rdtsc_barrier()s (lfence) are the performance killers.  Will look
closely on Monday.

Paolo

>> Oh well. Some paravirt person would need to look and care.
> 
> The code there is a bit scary.
> 
> --Andy
> 
>>
>>            Linus
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 20, 2015, 4:59 p.m. UTC | #14

On 17/04/2015 22:18, Marcelo Tosatti wrote:
> The bug which this is fixing is very rare, have no memory of a report.
> 
> In fact, its even difficult to create a synthetic reproducer.

But then why was the task migration notifier even in Jeremy's original
code for Xen?  Was it supposed to work even on non-synchronized TSC?

If that's the case, then it could be reverted indeed; but then why did
you commit this patch to 4.1?  Did you think of something that would
cause the seqcount-like protocol to fail, and that turned out not to be
the case later?  I was only following the mailing list sparsely in March.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andy Lutomirski April 20, 2015, 8:27 p.m. UTC | #15

On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 17/04/2015 22:18, Marcelo Tosatti wrote:
>> The bug which this is fixing is very rare, have no memory of a report.
>>
>> In fact, its even difficult to create a synthetic reproducer.
>
> But then why was the task migration notifier even in Jeremy's original
> code for Xen?  Was it supposed to work even on non-synchronized TSC?
>
> If that's the case, then it could be reverted indeed; but then why did
> you commit this patch to 4.1?  Did you think of something that would
> cause the seqcount-like protocol to fail, and that turned out not to be
> the case later?  I was only following the mailing list sparsely in March.

I don't think anyone ever tried that hard to test this stuff.  There
was an infinte loop that Firefox was triggering as a KVM guest
somewhat reliably until a couple months ago in the same vdso code.  :(

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 22, 2015, 8:56 p.m. UTC | #16

On Mon, Apr 20, 2015 at 06:59:04PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/04/2015 22:18, Marcelo Tosatti wrote:
> > The bug which this is fixing is very rare, have no memory of a report.
> > 
> > In fact, its even difficult to create a synthetic reproducer.
> 
> But then why was the task migration notifier even in Jeremy's original
> code for Xen? 

To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe.

> Was it supposed to work even on non-synchronized TSC?

Yes it is supposed to work on non-synchronized TSC.

> If that's the case, then it could be reverted indeed; but then why did
> you commit this patch to 4.1? 

Because it fixes the problem Andy reported (see Subject: KVM: x86: fix
kvmclock write race (v2) on kvm@). As long as you have Radim's
fix on top.

> Did you think of something that would
> cause the seqcount-like protocol to fail, and that turned out not to be
> the case later?  I was only following the mailing list sparsely in March.

No.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 22, 2015, 9:01 p.m. UTC | #17

On 22/04/2015 22:56, Marcelo Tosatti wrote:
>> > But then why was the task migration notifier even in Jeremy's original
>> > code for Xen? 
> To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe.

Ok, to cover it for non-synchronized TSC.  While KVM requires
synchronized TSC.

> > If that's the case, then it could be reverted indeed; but then why did
> > you commit this patch to 4.1? 
> 
> Because it fixes the problem Andy reported (see Subject: KVM: x86: fix
> kvmclock write race (v2) on kvm@). As long as you have Radim's
> fix on top.

But if it's so rare, and it was known that fixing the host protocol was
just as good a solution, why was the guest fix committed?

I'm just trying to understand.  I am worried that this patch was rushed
in; so far I had assumed it wasn't (a revert of a revert is rare enough
that you don't do it lightly...) but maybe I was wrong.

Right now I cannot even decide whether to revert it (and please Peter in
the process :)) or submit the Kconfig symbol patch officially.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 22, 2015, 9:21 p.m. UTC | #18

On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote:
> On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 17/04/2015 22:18, Marcelo Tosatti wrote:
> >> The bug which this is fixing is very rare, have no memory of a report.
> >>
> >> In fact, its even difficult to create a synthetic reproducer.
> >
> > But then why was the task migration notifier even in Jeremy's original
> > code for Xen?  Was it supposed to work even on non-synchronized TSC?
> >
> > If that's the case, then it could be reverted indeed; but then why did
> > you commit this patch to 4.1?  Did you think of something that would
> > cause the seqcount-like protocol to fail, and that turned out not to be
> > the case later?  I was only following the mailing list sparsely in March.
> 
> I don't think anyone ever tried that hard to test this stuff.  There
> was an infinte loop that Firefox was triggering as a KVM guest
> somewhat reliably until a couple months ago in the same vdso code.  :(

https://bugzilla.redhat.com/show_bug.cgi?id=1174664

--- Comment #5 from Juan Quintela <quintela@redhat.com> ---

Another round

# dmesg | grep msr
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 1:1ffd8001, primary cpu clock
[    0.000000] kvm-stealtime: cpu 0, msr 11fc0d100
[    0.041174] kvm-clock: cpu 1, msr 1:1ffd8041, secondary cpu clock
[    0.053011] kvm-stealtime: cpu 1, msr 11fc8d100


After start:

[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp
/8x
0x1ffd8000'
000000001ffd8000: 0x3b401060 0xfffc7f4b 0x3b42d040 0xfffc7f4b
000000001ffd8010: 0x3b42d460 0xfffc7f4b 0x3b42d4c0 0xfffc7f4b


[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp /8x 0x1ffd8040'
000000001ffd8040: 0x3b42d700 0xfffc7f4b 0x3b42d760 0xfffc7f4b
000000001ffd8050: 0x3b42d7c0 0xfffc7f4b 0x3b42d820 0xfffc7f4b

When firefox hangs

[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp
/8x
0x1ffd8000'
000000001ffd8000: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a
000000001ffd8010: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a


[root@trasno yum.repos.d]# virsh qemu-monitor-command --hmp browser  'xp
/8x
0x1ffd8040'
000000001ffd8040: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a
000000001ffd8050: 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a 0x5a5a5a5a


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 22, 2015, 10:55 p.m. UTC | #19

On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 22/04/2015 22:56, Marcelo Tosatti wrote:
> >> > But then why was the task migration notifier even in Jeremy's original
> >> > code for Xen? 
> > To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe.
> 
> Ok, to cover it for non-synchronized TSC.  While KVM requires
> synchronized TSC.
> 
> > > If that's the case, then it could be reverted indeed; but then why did
> > > you commit this patch to 4.1? 
> > 
> > Because it fixes the problem Andy reported (see Subject: KVM: x86: fix
> > kvmclock write race (v2) on kvm@). As long as you have Radim's
> > fix on top.
> 
> But if it's so rare, and it was known that fixing the host protocol was
> just as good a solution, why was the guest fix committed?

I don't know. Should have fixed the host protocol.

> I'm just trying to understand.  I am worried that this patch was rushed
> in; so far I had assumed it wasn't (a revert of a revert is rare enough
> that you don't do it lightly...) but maybe I was wrong.

Yes it was rushed in.

> Right now I cannot even decide whether to revert it (and please Peter in
> the process :)) or submit the Kconfig symbol patch officially.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 23, 2015, 9:13 a.m. UTC | #20

On 22/04/2015 23:21, Marcelo Tosatti wrote:
> On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 17/04/2015 22:18, Marcelo Tosatti wrote:
>>>> The bug which this is fixing is very rare, have no memory of a report.
>>>>
>>>> In fact, its even difficult to create a synthetic reproducer.
>>>
>>> But then why was the task migration notifier even in Jeremy's original
>>> code for Xen?  Was it supposed to work even on non-synchronized TSC?
>>>
>>> If that's the case, then it could be reverted indeed; but then why did
>>> you commit this patch to 4.1?  Did you think of something that would
>>> cause the seqcount-like protocol to fail, and that turned out not to be
>>> the case later?  I was only following the mailing list sparsely in March.
>>
>> I don't think anyone ever tried that hard to test this stuff.  There
>> was an infinte loop that Firefox was triggering as a KVM guest
>> somewhat reliably until a couple months ago in the same vdso code.  :(
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1174664

That was the missing volatile in an asm.  Older compilers didn't catch
it. :(

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 23, 2015, 11:29 a.m. UTC | #21

On 23/04/2015 00:55, Marcelo Tosatti wrote:
> On Wed, Apr 22, 2015 at 11:01:49PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 22/04/2015 22:56, Marcelo Tosatti wrote:
>>>>> But then why was the task migration notifier even in Jeremy's original
>>>>> code for Xen? 
>>> To cover for the vcpu1 -> vcpu2 -> vcpu1 case, i believe.
>>
>> Ok, to cover it for non-synchronized TSC.  While KVM requires
>> synchronized TSC.
>>
>>>> If that's the case, then it could be reverted indeed; but then why did
>>>> you commit this patch to 4.1? 
>>>
>>> Because it fixes the problem Andy reported (see Subject: KVM: x86: fix
>>> kvmclock write race (v2) on kvm@). As long as you have Radim's
>>> fix on top.
>>
>> But if it's so rare, and it was known that fixing the host protocol was
>> just as good a solution, why was the guest fix committed?
> 
> I don't know. Should have fixed the host protocol.

No problem.  Let's do the right thing now.

>> I'm just trying to understand.  I am worried that this patch was rushed
>> in; so far I had assumed it wasn't (a revert of a revert is rare enough
>> that you don't do it lightly...) but maybe I was wrong.
> 
> Yes it was rushed in.

Ok, so re-reverted it will be.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 23, 2015, 11:51 a.m. UTC | #22

On Thu, Apr 23, 2015 at 11:13:23AM +0200, Paolo Bonzini wrote:
> 
> 
> On 22/04/2015 23:21, Marcelo Tosatti wrote:
> > On Mon, Apr 20, 2015 at 01:27:58PM -0700, Andy Lutomirski wrote:
> >> On Mon, Apr 20, 2015 at 9:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>
> >>>
> >>> On 17/04/2015 22:18, Marcelo Tosatti wrote:
> >>>> The bug which this is fixing is very rare, have no memory of a report.
> >>>>
> >>>> In fact, its even difficult to create a synthetic reproducer.
> >>>
> >>> But then why was the task migration notifier even in Jeremy's original
> >>> code for Xen?  Was it supposed to work even on non-synchronized TSC?
> >>>
> >>> If that's the case, then it could be reverted indeed; but then why did
> >>> you commit this patch to 4.1?  Did you think of something that would
> >>> cause the seqcount-like protocol to fail, and that turned out not to be
> >>> the case later?  I was only following the mailing list sparsely in March.
> >>
> >> I don't think anyone ever tried that hard to test this stuff.  There
> >> was an infinte loop that Firefox was triggering as a KVM guest
> >> somewhat reliably until a couple months ago in the same vdso code.  :(
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1174664
> 
> That was the missing volatile in an asm.  Older compilers didn't catch
> it. :(

How do you know that? It looks like memory corruption (look at the
pattern at the end).


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini April 23, 2015, 12:02 p.m. UTC | #23

On 23/04/2015 13:51, Marcelo Tosatti wrote:
>>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1174664
>> > 
>> > That was the missing volatile in an asm.  Older compilers didn't catch
>> > it. :(
> How do you know that? It looks like memory corruption (look at the
> pattern at the end).

I suspect some kind of operator error there, it makes no sense.

On the other hand, bug 1178975 is much clearer and the symptoms are the
same.  In that bug, you can see that the same kernel source works on f20
(package version 3.17.7-200.fc20.x86_64) and fails on f21 (package
version 3.17.7-300.fc21.x86_64).  Of course the compiler is different.
The newer one hoists the lsl out of the loop; if you get a CPU migration
at the wrong time, the cpu != cpu1 condition will always be true the
loop will never exit.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti April 23, 2015, 5:06 p.m. UTC | #24

On Thu, Apr 23, 2015 at 02:02:29PM +0200, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 13:51, Marcelo Tosatti wrote:
> >>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1174664
> >> > 
> >> > That was the missing volatile in an asm.  Older compilers didn't catch
> >> > it. :(
> > How do you know that? It looks like memory corruption (look at the
> > pattern at the end).
> 
> I suspect some kind of operator error there, it makes no sense.


        if (unlikely(s->flags & SLAB_POISON))
                memset(start, POISON_INUSE, PAGE_SIZE << order);

 *      Padding is done using 0x5a (POISON_INUSE)

> On the other hand, bug 1178975 is much clearer and the symptoms are the
> same.  In that bug, you can see that the same kernel source works on f20
> (package version 3.17.7-200.fc20.x86_64) and fails on f21 (package
> version 3.17.7-300.fc21.x86_64).  Of course the compiler is different.
> The newer one hoists the lsl out of the loop; if you get a CPU migration
> at the wrong time, the cpu != cpu1 condition will always be true the
> loop will never exit.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT,PULL] First batch of KVM changes for 4.1

Commit Message

Comments

Patch