Message ID | 20170210161928.GI6515@twins.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 02/10/2017 11:19 AM, Peter Zijlstra wrote: > On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >> It was found when running fio sequential write test with a XFS ramdisk >> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >> by perf were as follows: >> >> 69.75% 0.59% fio [k] down_write >> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >> 67.12% 1.12% fio [k] rwsem_down_write_failed >> 63.48% 52.77% fio [k] osq_lock >> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >> > Thinking about this again, wouldn't something like the below also work? > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 099fcba4981d..6aa33702c15c 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) > local_irq_restore(flags); > } > > +#ifdef CONFIG_X86_32 > __visible bool __kvm_vcpu_is_preempted(int cpu) > { > struct kvm_steal_time *src = &per_cpu(steal_time, cpu); > @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) > } > PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); > > +#else > + > +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); > + > +asm( > +".pushsection .text;" > +".global __raw_callee_save___kvm_vcpu_is_preempted;" > +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > +"__raw_callee_save___kvm_vcpu_is_preempted:" > +FRAME_BEGIN > +"push %rdi;" > +"push %rdx;" > +"movslq %edi, %rdi;" > +"movq $steal_time+16, %rax;" > +"movq __per_cpu_offset(,%rdi,8), %rdx;" > +"cmpb $0, (%rdx,%rax);" > +"setne %al;" > +"pop %rdx;" > +"pop %rdi;" > +FRAME_END > +"ret;" > +".popsection"); > + > +#endif > + > /* > * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > */ That should work for now. I have done something similar for __pv_queued_spin_unlock. However, this has the problem of creating a dependency on the exact layout of the steal_time structure. Maybe the constant 16 can be passed in as a parameter offsetof(struct kvm_steal_time, preempted) to the asm call. Cheers, Longman
On 02/10/2017 11:35 AM, Waiman Long wrote: > On 02/10/2017 11:19 AM, Peter Zijlstra wrote: >> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >>> It was found when running fio sequential write test with a XFS ramdisk >>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >>> by perf were as follows: >>> >>> 69.75% 0.59% fio [k] down_write >>> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >>> 67.12% 1.12% fio [k] rwsem_down_write_failed >>> 63.48% 52.77% fio [k] osq_lock >>> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >>> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >>> >> Thinking about this again, wouldn't something like the below also work? >> >> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c >> index 099fcba4981d..6aa33702c15c 100644 >> --- a/arch/x86/kernel/kvm.c >> +++ b/arch/x86/kernel/kvm.c >> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) >> local_irq_restore(flags); >> } >> >> +#ifdef CONFIG_X86_32 >> __visible bool __kvm_vcpu_is_preempted(int cpu) >> { >> struct kvm_steal_time *src = &per_cpu(steal_time, cpu); >> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) >> } >> PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); >> >> +#else >> + >> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); >> + >> +asm( >> +".pushsection .text;" >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >> +"__raw_callee_save___kvm_vcpu_is_preempted:" >> +FRAME_BEGIN >> +"push %rdi;" >> +"push %rdx;" >> +"movslq %edi, %rdi;" >> +"movq $steal_time+16, %rax;" >> +"movq __per_cpu_offset(,%rdi,8), %rdx;" >> +"cmpb $0, (%rdx,%rax);" >> +"setne %al;" >> +"pop %rdx;" >> +"pop %rdi;" >> +FRAME_END >> +"ret;" >> +".popsection"); >> + >> +#endif >> + >> /* >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >> */ > That should work for now. I have done something similar for > __pv_queued_spin_unlock. However, this has the problem of creating a > dependency on the exact layout of the steal_time structure. Maybe the > constant 16 can be passed in as a parameter offsetof(struct > kvm_steal_time, preempted) to the asm call. > > Cheers, > Longman One more thing, that will improve KVM performance, but it won't help Xen. I looked into the assembly code for rwsem_spin_on_owner, It need to save and restore 2 additional registers with my patch. Doing it your way, will transfer the save and restore overhead to the assembly code. However, __kvm_vcpu_is_preempted() is called multiple times per invocation of rwsem_spin_on_owner. That function is simple enough that making __kvm_vcpu_is_preempted() callee-save won't produce much compiler optimization opportunity. The outer function rwsem_down_write_failed() does appear to be a bit bigger (from 866 bytes to 884 bytes) though. Cheers, Longman
On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote: > >> +asm( > >> +".pushsection .text;" > >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" > >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > >> +"__raw_callee_save___kvm_vcpu_is_preempted:" > >> +FRAME_BEGIN > >> +"push %rdi;" > >> +"push %rdx;" > >> +"movslq %edi, %rdi;" > >> +"movq $steal_time+16, %rax;" > >> +"movq __per_cpu_offset(,%rdi,8), %rdx;" > >> +"cmpb $0, (%rdx,%rax);" Could we not put the $steal_time+16 displacement as an immediate in the cmpb and save a whole register here? That way we'd end up with something like: asm(" push %rdi; movslq %edi, %rdi; movq __per_cpu_offset(,%rdi,8), %rax; cmpb $0, %[offset](%rax); setne %al; pop %rdi; " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); And if we could get rid of the sign extend on edi we could avoid all the push-pop nonsense, but I'm not sure I see how to do that (then again, this asm foo isn't my strongest point). > >> +"setne %al;" > >> +"pop %rdx;" > >> +"pop %rdi;" > >> +FRAME_END > >> +"ret;" > >> +".popsection"); > >> + > >> +#endif > >> + > >> /* > >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > >> */ > > That should work for now. I have done something similar for > > __pv_queued_spin_unlock. However, this has the problem of creating a > > dependency on the exact layout of the steal_time structure. Maybe the > > constant 16 can be passed in as a parameter offsetof(struct > > kvm_steal_time, preempted) to the asm call. Yeah it should be well possible to pass that in. But ideally we'd have GCC grow something like __attribute__((callee_saved)) or somesuch and it would do all this for us. > One more thing, that will improve KVM performance, but it won't help Xen. People still use Xen? ;-) In any case, their implementation looks very similar and could easily crib this. > I looked into the assembly code for rwsem_spin_on_owner, It need to save > and restore 2 additional registers with my patch. Doing it your way, > will transfer the save and restore overhead to the assembly code. > However, __kvm_vcpu_is_preempted() is called multiple times per > invocation of rwsem_spin_on_owner. That function is simple enough that > making __kvm_vcpu_is_preempted() callee-save won't produce much compiler > optimization opportunity. This is because of that noinline, right? Otherwise it would've been folded and register pressure would be much higher. > The outer function rwsem_down_write_failed() > does appear to be a bit bigger (from 866 bytes to 884 bytes) though. I suspect GCC is being clever and since all this is static it plays games with the calling convention and pushes these clobbers out.
On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: > That way we'd end up with something like: > > asm(" > push %rdi; > movslq %edi, %rdi; > movq __per_cpu_offset(,%rdi,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > pop %rdi; > " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); > > And if we could get rid of the sign extend on edi we could avoid all the > push-pop nonsense, but I'm not sure I see how to do that (then again, > this asm foo isn't my strongest point). Maybe: movsql %edi, %rax; movq __per_cpu_offset(,%rax,8), %rax; cmpb $0, %[offset](%rax); setne %al; ?
On 02/13/2017 05:47 AM, Peter Zijlstra wrote: > On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote: > >>>> +asm( >>>> +".pushsection .text;" >>>> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >>>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >>>> +"__raw_callee_save___kvm_vcpu_is_preempted:" >>>> +FRAME_BEGIN >>>> +"push %rdi;" >>>> +"push %rdx;" >>>> +"movslq %edi, %rdi;" >>>> +"movq $steal_time+16, %rax;" >>>> +"movq __per_cpu_offset(,%rdi,8), %rdx;" >>>> +"cmpb $0, (%rdx,%rax);" > Could we not put the $steal_time+16 displacement as an immediate in the > cmpb and save a whole register here? > > That way we'd end up with something like: > > asm(" > push %rdi; > movslq %edi, %rdi; > movq __per_cpu_offset(,%rdi,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > pop %rdi; > " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); > > And if we could get rid of the sign extend on edi we could avoid all the > push-pop nonsense, but I'm not sure I see how to do that (then again, > this asm foo isn't my strongest point). Yes, I think that can work. I will try to ran this patch to see how thing goes. >>>> +"setne %al;" >>>> +"pop %rdx;" >>>> +"pop %rdi;" >>>> +FRAME_END >>>> +"ret;" >>>> +".popsection"); >>>> + >>>> +#endif >>>> + >>>> /* >>>> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >>>> */ >>> That should work for now. I have done something similar for >>> __pv_queued_spin_unlock. However, this has the problem of creating a >>> dependency on the exact layout of the steal_time structure. Maybe the >>> constant 16 can be passed in as a parameter offsetof(struct >>> kvm_steal_time, preempted) to the asm call. > Yeah it should be well possible to pass that in. But ideally we'd have > GCC grow something like __attribute__((callee_saved)) or somesuch and it > would do all this for us. That will be really nice too. I am not too fond of working in assembly. >> One more thing, that will improve KVM performance, but it won't help Xen. > People still use Xen? ;-) In any case, their implementation looks very > similar and could easily crib this. In Red Hat, my focus will be on KVM performance. I do believe that there are still Xen users out there. So we still need to keep their interest into consideration. Given that, I am OK to make it work better in KVM first and then think about Xen later. >> I looked into the assembly code for rwsem_spin_on_owner, It need to save >> and restore 2 additional registers with my patch. Doing it your way, >> will transfer the save and restore overhead to the assembly code. >> However, __kvm_vcpu_is_preempted() is called multiple times per >> invocation of rwsem_spin_on_owner. That function is simple enough that >> making __kvm_vcpu_is_preempted() callee-save won't produce much compiler >> optimization opportunity. > This is because of that noinline, right? Otherwise it would've been > folded and register pressure would be much higher. Yes, I guess so. The noinline is there so that we know what the CPU time is for spinning rather than other activities within the slowpath. > >> The outer function rwsem_down_write_failed() >> does appear to be a bit bigger (from 866 bytes to 884 bytes) though. > I suspect GCC is being clever and since all this is static it plays > games with the calling convention and pushes these clobbers out. > > Cheers, Longman
On 02/13/2017 05:53 AM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >> That way we'd end up with something like: >> >> asm(" >> push %rdi; >> movslq %edi, %rdi; >> movq __per_cpu_offset(,%rdi,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> pop %rdi; >> " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); >> >> And if we could get rid of the sign extend on edi we could avoid all the >> push-pop nonsense, but I'm not sure I see how to do that (then again, >> this asm foo isn't my strongest point). > Maybe: > > movsql %edi, %rax; > movq __per_cpu_offset(,%rax,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > > ? Yes, that looks good to me. Cheers, Longman
On February 13, 2017 2:53:43 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >> That way we'd end up with something like: >> >> asm(" >> push %rdi; >> movslq %edi, %rdi; >> movq __per_cpu_offset(,%rdi,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> pop %rdi; >> " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct >steal_time, preempted))); >> >> And if we could get rid of the sign extend on edi we could avoid all >the >> push-pop nonsense, but I'm not sure I see how to do that (then again, >> this asm foo isn't my strongest point). > >Maybe: > >movsql %edi, %rax; >movq __per_cpu_offset(,%rax,8), %rax; >cmpb $0, %[offset](%rax); >setne %al; > >? We could kill the zero or sign extend by changing the calling interface to pass an unsigned long instead of an int. It is much more likely that a zero extend is free for the caller than a sign extend.
On 02/13/2017 02:42 PM, Waiman Long wrote: > On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>> That way we'd end up with something like: >>> >>> asm(" >>> push %rdi; >>> movslq %edi, %rdi; >>> movq __per_cpu_offset(,%rdi,8), %rax; >>> cmpb $0, %[offset](%rax); >>> setne %al; >>> pop %rdi; >>> " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); >>> >>> And if we could get rid of the sign extend on edi we could avoid all the >>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>> this asm foo isn't my strongest point). >> Maybe: >> >> movsql %edi, %rax; >> movq __per_cpu_offset(,%rax,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> >> ? > Yes, that looks good to me. > > Cheers, > Longman > Sorry, I am going to take it back. The displacement or offset can only be up to 32-bit. So we will still need to use at least one more register, I think. Cheers, Longman
On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: > On 02/13/2017 02:42 PM, Waiman Long wrote: > > On 02/13/2017 05:53 AM, Peter Zijlstra wrote: > >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: > >>> That way we'd end up with something like: > >>> > >>> asm(" > >>> push %rdi; > >>> movslq %edi, %rdi; > >>> movq __per_cpu_offset(,%rdi,8), %rax; > >>> cmpb $0, %[offset](%rax); > >>> setne %al; > >>> pop %rdi; > >>> " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); > >>> > >>> And if we could get rid of the sign extend on edi we could avoid all the > >>> push-pop nonsense, but I'm not sure I see how to do that (then again, > >>> this asm foo isn't my strongest point). > >> Maybe: > >> > >> movsql %edi, %rax; > >> movq __per_cpu_offset(,%rax,8), %rax; > >> cmpb $0, %[offset](%rax); > >> setne %al; > >> > >> ? > > Yes, that looks good to me. > > > > Cheers, > > Longman > > > Sorry, I am going to take it back. The displacement or offset can only > be up to 32-bit. So we will still need to use at least one more > register, I think. I don't think that would be a problem, I very much doubt we declare more than 4G worth of per-cpu variables in the kernel. In any case, use "e" or "Z" as constraint (I never quite know when to use which). That are s32 and u32 displacement immediates resp. and should fail compile with a semi-sensible failure if the displacement is too big.
On Mon, Feb 13, 2017 at 12:06:44PM -0800, hpa@zytor.com wrote: > >Maybe: > > > >movsql %edi, %rax; > >movq __per_cpu_offset(,%rax,8), %rax; > >cmpb $0, %[offset](%rax); > >setne %al; > > > >? > > We could kill the zero or sign extend by changing the calling > interface to pass an unsigned long instead of an int. It is much more > likely that a zero extend is free for the caller than a sign extend. Right, Boris and me talked about that on IRC. I was wondering if the argument was u32 if we could assume the top 32 bits are 0 and then use rdi without prior movzx. That would allow reducing the thing one more instruction. Also, PVOP_CALL_ARG#() have an (unsigned long) cast in them that doesn't make sense. That cast ends up resulting in the calling code doing explicit sign or zero extends into the full 64bit register for no good reason. If one removes that cast things still compile, but I worry something somehow relies on this weird behaviour and will come apart.
On February 13, 2017 1:52:20 PM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: >> On 02/13/2017 02:42 PM, Waiman Long wrote: >> > On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >> >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >> >>> That way we'd end up with something like: >> >>> >> >>> asm(" >> >>> push %rdi; >> >>> movslq %edi, %rdi; >> >>> movq __per_cpu_offset(,%rdi,8), %rax; >> >>> cmpb $0, %[offset](%rax); >> >>> setne %al; >> >>> pop %rdi; >> >>> " : : [offset] "i" (((unsigned long)&steal_time) + >offsetof(struct steal_time, preempted))); >> >>> >> >>> And if we could get rid of the sign extend on edi we could avoid >all the >> >>> push-pop nonsense, but I'm not sure I see how to do that (then >again, >> >>> this asm foo isn't my strongest point). >> >> Maybe: >> >> >> >> movsql %edi, %rax; >> >> movq __per_cpu_offset(,%rax,8), %rax; >> >> cmpb $0, %[offset](%rax); >> >> setne %al; >> >> >> >> ? >> > Yes, that looks good to me. >> > >> > Cheers, >> > Longman >> > >> Sorry, I am going to take it back. The displacement or offset can >only >> be up to 32-bit. So we will still need to use at least one more >> register, I think. > >I don't think that would be a problem, I very much doubt we declare >more >than 4G worth of per-cpu variables in the kernel. > >In any case, use "e" or "Z" as constraint (I never quite know when to >use which). That are s32 and u32 displacement immediates resp. and >should fail compile with a semi-sensible failure if the displacement is >too big. e for signed, Z for unsigned. Obviously you have to use a matching instruction: an immediate or displacement in a 64-bit instruction is sign-extended, in a 32-bit instruction zero-extended. E.g.: movl %0,%%eax # use Z, all of %rax will be set movq %0,%%rax # use e
On February 13, 2017 1:52:20 PM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: >> On 02/13/2017 02:42 PM, Waiman Long wrote: >> > On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >> >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >> >>> That way we'd end up with something like: >> >>> >> >>> asm(" >> >>> push %rdi; >> >>> movslq %edi, %rdi; >> >>> movq __per_cpu_offset(,%rdi,8), %rax; >> >>> cmpb $0, %[offset](%rax); >> >>> setne %al; >> >>> pop %rdi; >> >>> " : : [offset] "i" (((unsigned long)&steal_time) + >offsetof(struct steal_time, preempted))); >> >>> >> >>> And if we could get rid of the sign extend on edi we could avoid >all the >> >>> push-pop nonsense, but I'm not sure I see how to do that (then >again, >> >>> this asm foo isn't my strongest point). >> >> Maybe: >> >> >> >> movsql %edi, %rax; >> >> movq __per_cpu_offset(,%rax,8), %rax; >> >> cmpb $0, %[offset](%rax); >> >> setne %al; >> >> >> >> ? >> > Yes, that looks good to me. >> > >> > Cheers, >> > Longman >> > >> Sorry, I am going to take it back. The displacement or offset can >only >> be up to 32-bit. So we will still need to use at least one more >> register, I think. > >I don't think that would be a problem, I very much doubt we declare >more >than 4G worth of per-cpu variables in the kernel. > >In any case, use "e" or "Z" as constraint (I never quite know when to >use which). That are s32 and u32 displacement immediates resp. and >should fail compile with a semi-sensible failure if the displacement is >too big. Oh, and unless you are explicitly forcing 32-bit addressing mode, displacements are always "e" (or "m" if you let gcc pick the addressing mode.)
On 02/13/2017 03:06 PM, hpa@zytor.com wrote: > On February 13, 2017 2:53:43 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>> That way we'd end up with something like: >>> >>> asm(" >>> push %rdi; >>> movslq %edi, %rdi; >>> movq __per_cpu_offset(,%rdi,8), %rax; >>> cmpb $0, %[offset](%rax); >>> setne %al; >>> pop %rdi; >>> " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct >> steal_time, preempted))); >>> And if we could get rid of the sign extend on edi we could avoid all >> the >>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>> this asm foo isn't my strongest point). >> Maybe: >> >> movsql %edi, %rax; >> movq __per_cpu_offset(,%rax,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> >> ? > We could kill the zero or sign extend by changing the calling interface to pass an unsigned long instead of an int. It is much more likely that a zero extend is free for the caller than a sign extend. I have thought of that too. However, the goal is to eliminate memory read/write from/to stack. Eliminating a register sign-extend instruction won't help much in term of performance. Cheers, Longman
On Mon, Feb 13, 2017 at 05:24:36PM -0500, Waiman Long wrote: > >> movsql %edi, %rax; > >> movq __per_cpu_offset(,%rax,8), %rax; > >> cmpb $0, %[offset](%rax); > >> setne %al; > I have thought of that too. However, the goal is to eliminate memory > read/write from/to stack. Eliminating a register sign-extend instruction > won't help much in term of performance. Problem here is that all instructions have dependencies, so if you can get rid of the sign extend mov you kill a bunch of stall cycles (I would expect). But yes, peanuts vs the stack load/stores.
On 02/13/2017 04:52 PM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: >> On 02/13/2017 02:42 PM, Waiman Long wrote: >>> On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >>>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>>>> That way we'd end up with something like: >>>>> >>>>> asm(" >>>>> push %rdi; >>>>> movslq %edi, %rdi; >>>>> movq __per_cpu_offset(,%rdi,8), %rax; >>>>> cmpb $0, %[offset](%rax); >>>>> setne %al; >>>>> pop %rdi; >>>>> " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); >>>>> >>>>> And if we could get rid of the sign extend on edi we could avoid all the >>>>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>>>> this asm foo isn't my strongest point). >>>> Maybe: >>>> >>>> movsql %edi, %rax; >>>> movq __per_cpu_offset(,%rax,8), %rax; >>>> cmpb $0, %[offset](%rax); >>>> setne %al; >>>> >>>> ? >>> Yes, that looks good to me. >>> >>> Cheers, >>> Longman >>> >> Sorry, I am going to take it back. The displacement or offset can only >> be up to 32-bit. So we will still need to use at least one more >> register, I think. > I don't think that would be a problem, I very much doubt we declare more > than 4G worth of per-cpu variables in the kernel. > > In any case, use "e" or "Z" as constraint (I never quite know when to > use which). That are s32 and u32 displacement immediates resp. and > should fail compile with a semi-sensible failure if the displacement is > too big. > It is the address of &steal_time that will exceed the 32-bit limit. Cheers, Longman
On February 13, 2017 2:34:01 PM PST, Waiman Long <longman@redhat.com> wrote: >On 02/13/2017 04:52 PM, Peter Zijlstra wrote: >> On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: >>> On 02/13/2017 02:42 PM, Waiman Long wrote: >>>> On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >>>>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>>>>> That way we'd end up with something like: >>>>>> >>>>>> asm(" >>>>>> push %rdi; >>>>>> movslq %edi, %rdi; >>>>>> movq __per_cpu_offset(,%rdi,8), %rax; >>>>>> cmpb $0, %[offset](%rax); >>>>>> setne %al; >>>>>> pop %rdi; >>>>>> " : : [offset] "i" (((unsigned long)&steal_time) + >offsetof(struct steal_time, preempted))); >>>>>> >>>>>> And if we could get rid of the sign extend on edi we could avoid >all the >>>>>> push-pop nonsense, but I'm not sure I see how to do that (then >again, >>>>>> this asm foo isn't my strongest point). >>>>> Maybe: >>>>> >>>>> movsql %edi, %rax; >>>>> movq __per_cpu_offset(,%rax,8), %rax; >>>>> cmpb $0, %[offset](%rax); >>>>> setne %al; >>>>> >>>>> ? >>>> Yes, that looks good to me. >>>> >>>> Cheers, >>>> Longman >>>> >>> Sorry, I am going to take it back. The displacement or offset can >only >>> be up to 32-bit. So we will still need to use at least one more >>> register, I think. >> I don't think that would be a problem, I very much doubt we declare >more >> than 4G worth of per-cpu variables in the kernel. >> >> In any case, use "e" or "Z" as constraint (I never quite know when to >> use which). That are s32 and u32 displacement immediates resp. and >> should fail compile with a semi-sensible failure if the displacement >is >> too big. >> >It is the address of &steal_time that will exceed the 32-bit limit. > >Cheers, >Longman That seems odd in the extreme?
On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote:
> It is the address of &steal_time that will exceed the 32-bit limit.
That seems extremely unlikely. That would mean we have more than 4G
worth of per-cpu variables declared in the kernel.
On 02/14/2017 04:39 AM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote: >> It is the address of &steal_time that will exceed the 32-bit limit. > That seems extremely unlikely. That would mean we have more than 4G > worth of per-cpu variables declared in the kernel. I have some doubt about if the compiler is able to properly use RIP-relative addressing for this. Anyway, it seems like constraints aren't allowed for asm() when not in the function context, at least for the the compiler that I am using (4.8.5). So it is a moot point. Cheers, Longman
On 14/02/17 14:46, Waiman Long wrote: > On 02/14/2017 04:39 AM, Peter Zijlstra wrote: >> On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote: >>> It is the address of &steal_time that will exceed the 32-bit limit. >> That seems extremely unlikely. That would mean we have more than 4G >> worth of per-cpu variables declared in the kernel. > I have some doubt about if the compiler is able to properly use > RIP-relative addressing for this. Anyway, it seems like constraints > aren't allowed for asm() when not in the function context, at least for > the the compiler that I am using (4.8.5). So it is a moot point. You can work the issue of not having parameters in a plain asm() statement by using an asm-offset, stringizing it, and have C put the string fragments back together. "cmpb $0, " STR(STEAL_TIME_preempted) "(%rax);" ~Andrew
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba4981d..6aa33702c15c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(int cpu) { struct kvm_steal_time *src = &per_cpu(steal_time, cpu); @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); + +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +FRAME_BEGIN +"push %rdi;" +"push %rdx;" +"movslq %edi, %rdi;" +"movq $steal_time+16, %rax;" +"movq __per_cpu_offset(,%rdi,8), %rdx;" +"cmpb $0, (%rdx,%rax);" +"setne %al;" +"pop %rdx;" +"pop %rdi;" +FRAME_END +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */