Message ID | 3c1a7eee-7909-4b18-9497-1d0a6ee4f17d@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/PV: simplify (and thus correct) guest accessor functions | expand |
On 02/09/2024 1:28 pm, Jan Beulich wrote: > Taking a fault on a non-byte-granular insn means that the "number of > bytes not handled" return value would need extra care in calculating, if > we want callers to be able to derive e.g. exception context (to be > injected to the guest) - CR2 for #PF in particular - from the value. To > simplify things rather than complicating them, reduce inline assembly to > just byte-granular string insns. On recent CPUs that's also supposed to > be more efficient anyway. > > For singular element accessors, however, alignment checks are added, > hence slightly complicating the code. Misaligned (user) buffer accesses > will now be forwarded to copy_{from,to}_guest_ll(). > > Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the > same way, as they're produced by mere re-processing of the same code. > Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their > comments made match reality; down the road we may want to change their > return types, e.g. to bool. > > Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") > Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> This is definitely simpler, however the code gen less so. add/remove: 0/0 grow/shrink: 42/9 up/down: 2759/-178 (2581) > --- a/xen/arch/x86/include/asm/uaccess.h > +++ b/xen/arch/x86/include/asm/uaccess.h > @@ -251,7 +251,8 @@ do { > static always_inline unsigned long > __copy_to_guest_pv(void __user *to, const void *from, unsigned long n) > { > - if (__builtin_constant_p(n)) { > + if ( __builtin_constant_p(n) && !((unsigned long)to & (n - 1)) ) > + { The problem is that we now emit this if condition unconditionally, because the alignment check isn't constant-foldable. This in turn forces setup for both the trivial case and the function call case, compounding the bloat. e.g. the same example: unsigned int foo(void *ptr) { uint16_t val; return __copy_from_guest_pv(&val, ptr, sizeof(val)); } now generates: <foo>: 48 89 f8 mov %rdi,%rax 48 83 ec 08 sub $0x8,%rsp 48 89 fe mov %rdi,%rsi 83 e0 01 and $0x1,%eax 75 31 jne <foo+0x40> 0f 1f 00 nopl (%rax) // STAC 48 89 fa mov %rdi,%rdx 49 b8 ff ff ff ff ff movabs $0xffff87ffffffffff,%r8 87 ff ff 48 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%rdi 49 39 d0 cmp %rdx,%r8 48 d1 df rcr %rdi 48 21 fa and %rdi,%rdx 66 8b 0a mov (%rdx),%cx 66 89 4c 24 06 mov %cx,0x6(%rsp) 0f 1f 00 nopl (%rax) // CLAC 48 83 c4 08 add $0x8,%rsp c3 ret 90 nop 48 8d 7c 24 06 lea 0x6(%rsp),%rdi ba 02 00 00 00 mov $0x2,%edx e8 11 bc 03 00 call <copy_from_guest_ll> 48 83 c4 08 add $0x8,%rsp c3 ret If we're not sure of the alignment in the first place, then it's better to hand off to copy_*_guest_ll than to emit logic like this. However, I can't think of any way of doing this without forgoing the optimisation entirely. We can't base anything on the type of the guest-side pointer because while a guest expected to align it, there's no hard requirement. In turn, this means there's about nothing we can do with compiler heuristics in Xen. Perhaps the right thing to do is have copy_*_guest_ll have a fastpath for aligned single accesses, and forgo the inline code generation entirely. ~Andrew
On 02.09.2024 16:13, Andrew Cooper wrote: > On 02/09/2024 1:28 pm, Jan Beulich wrote: >> Taking a fault on a non-byte-granular insn means that the "number of >> bytes not handled" return value would need extra care in calculating, if >> we want callers to be able to derive e.g. exception context (to be >> injected to the guest) - CR2 for #PF in particular - from the value. To >> simplify things rather than complicating them, reduce inline assembly to >> just byte-granular string insns. On recent CPUs that's also supposed to >> be more efficient anyway. >> >> For singular element accessors, however, alignment checks are added, >> hence slightly complicating the code. Misaligned (user) buffer accesses >> will now be forwarded to copy_{from,to}_guest_ll(). >> >> Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the >> same way, as they're produced by mere re-processing of the same code. >> Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their >> comments made match reality; down the road we may want to change their >> return types, e.g. to bool. >> >> Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") >> Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") >> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > This is definitely simpler, however the code gen less so. > > add/remove: 0/0 grow/shrink: 42/9 up/down: 2759/-178 (2581) Not nice, but entirely expected. >> --- a/xen/arch/x86/include/asm/uaccess.h >> +++ b/xen/arch/x86/include/asm/uaccess.h >> @@ -251,7 +251,8 @@ do { >> static always_inline unsigned long >> __copy_to_guest_pv(void __user *to, const void *from, unsigned long n) >> { >> - if (__builtin_constant_p(n)) { >> + if ( __builtin_constant_p(n) && !((unsigned long)to & (n - 1)) ) >> + { > > The problem is that we now emit this if condition unconditionally, > because the alignment check isn't constant-foldable. This in turn > forces setup for both the trivial case and the function call case, > compounding the bloat. > > e.g. the same example: > > unsigned int foo(void *ptr) > { > uint16_t val; > return __copy_from_guest_pv(&val, ptr, sizeof(val)); > } > > now generates: > > <foo>: > 48 89 f8 mov %rdi,%rax > 48 83 ec 08 sub $0x8,%rsp > 48 89 fe mov %rdi,%rsi > 83 e0 01 and $0x1,%eax > 75 31 jne <foo+0x40> > 0f 1f 00 nopl (%rax) // STAC > 48 89 fa mov %rdi,%rdx > 49 b8 ff ff ff ff ff movabs $0xffff87ffffffffff,%r8 > 87 ff ff > 48 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%rdi > 49 39 d0 cmp %rdx,%r8 > 48 d1 df rcr %rdi > 48 21 fa and %rdi,%rdx > 66 8b 0a mov (%rdx),%cx > 66 89 4c 24 06 mov %cx,0x6(%rsp) > 0f 1f 00 nopl (%rax) // CLAC > 48 83 c4 08 add $0x8,%rsp > c3 ret > 90 nop > 48 8d 7c 24 06 lea 0x6(%rsp),%rdi > ba 02 00 00 00 mov $0x2,%edx > e8 11 bc 03 00 call <copy_from_guest_ll> > 48 83 c4 08 add $0x8,%rsp > c3 ret > > > If we're not sure of the alignment in the first place, then it's better > to hand off to copy_*_guest_ll than to emit logic like this. > > However, I can't think of any way of doing this without forgoing the > optimisation entirely. We can't base anything on the type of the > guest-side pointer because while a guest expected to align it, there's > no hard requirement. In turn, this means there's about nothing we can > do with compiler heuristics in Xen. Well, I'm pretty sure we can retain the optimization, just that to do so the patch will need to be more intrusive. I'd need to fiddle with {get,put}_unsafe_asm() to make sure we produce a correct error indicator (rather than the build-time constants passed into the macro). Yet that'll still be more code per call site, even if most (all?) new code would then live in .fixup. If we went this route, we'd want to settle up front whether for the to-guest operations we actually need the correct return values. The problematic uses, after all, are all from-user if I'm not mistaken. And determining how much of a buffer we can write to (ideally without actually altering any data) is going to be a little more tricky than a simple REP LODSB that we could use for read probing. Of course we could take the position that since we were asked to write to the buffer, we could as well REP STOSB e.g. zeroes into it. Obviously there's then the corner case of the buffer becoming accessible between the 1st and 2nd access attempt ... Another option might be to actually make the CR2 value visible to the fixup code, for use in calculating the correct error indicator. (If the fault isn't #PF, returning the full size is imo good enough in any event, even if there's the corner case of crossing the canonical- noncanonical boundary.) > Perhaps the right thing to do is have copy_*_guest_ll have a fastpath > for aligned single accesses, and forgo the inline code generation entirely. I expect that'll still be an increase in code size, plus presumably some loss of performance (for the added calls and the new conditional). Jan
On 02/09/2024 4:09 pm, Jan Beulich wrote: > On 02.09.2024 16:13, Andrew Cooper wrote: >> On 02/09/2024 1:28 pm, Jan Beulich wrote: >>> Taking a fault on a non-byte-granular insn means that the "number of >>> bytes not handled" return value would need extra care in calculating, if >>> we want callers to be able to derive e.g. exception context (to be >>> injected to the guest) - CR2 for #PF in particular - from the value. To >>> simplify things rather than complicating them, reduce inline assembly to >>> just byte-granular string insns. On recent CPUs that's also supposed to >>> be more efficient anyway. >>> >>> For singular element accessors, however, alignment checks are added, >>> hence slightly complicating the code. Misaligned (user) buffer accesses >>> will now be forwarded to copy_{from,to}_guest_ll(). >>> >>> Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the >>> same way, as they're produced by mere re-processing of the same code. >>> Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their >>> comments made match reality; down the road we may want to change their >>> return types, e.g. to bool. >>> >>> Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") >>> Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") >>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> This is definitely simpler, however the code gen less so. >> >> add/remove: 0/0 grow/shrink: 42/9 up/down: 2759/-178 (2581) > Not nice, but entirely expected. Yes. Having considered this for a while, the usual rule prevails; get it correct first, fast second. So, lets go with it like this. I already want to get rid of .fixup for backtrace reasons, so don't want to go expanding our use of it. I'm also not thrilled with the idea of doing a second access. At the time any fault has occurred, we're delivering an error of some form, and finding unexpected success the second time around is about the worst available outcome. As to your comment about to-guest, that does matter for correct %cr2 on an emulated store. The INS emulation tries precisely this. I've got some ad-hoc XTF tests which I'll run this patch over, but I'm expecting to R-by/T-by in this form. ~Andrew
On 26/09/2024 6:59 pm, Andrew Cooper wrote: > On 02/09/2024 4:09 pm, Jan Beulich wrote: >> On 02.09.2024 16:13, Andrew Cooper wrote: >>> On 02/09/2024 1:28 pm, Jan Beulich wrote: >>>> Taking a fault on a non-byte-granular insn means that the "number of >>>> bytes not handled" return value would need extra care in calculating, if >>>> we want callers to be able to derive e.g. exception context (to be >>>> injected to the guest) - CR2 for #PF in particular - from the value. To >>>> simplify things rather than complicating them, reduce inline assembly to >>>> just byte-granular string insns. On recent CPUs that's also supposed to >>>> be more efficient anyway. >>>> >>>> For singular element accessors, however, alignment checks are added, >>>> hence slightly complicating the code. Misaligned (user) buffer accesses >>>> will now be forwarded to copy_{from,to}_guest_ll(). >>>> >>>> Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the >>>> same way, as they're produced by mere re-processing of the same code. >>>> Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their >>>> comments made match reality; down the road we may want to change their >>>> return types, e.g. to bool. >>>> >>>> Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") >>>> Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") >>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> This is definitely simpler, however the code gen less so. >>> >>> add/remove: 0/0 grow/shrink: 42/9 up/down: 2759/-178 (2581) >> Not nice, but entirely expected. > Yes. Having considered this for a while, the usual rule prevails; get > it correct first, fast second. > > So, lets go with it like this. > > I already want to get rid of .fixup for backtrace reasons, so don't want > to go expanding our use of it. > > I'm also not thrilled with the idea of doing a second access. At the > time any fault has occurred, we're delivering an error of some form, and > finding unexpected success the second time around is about the worst > available outcome. > > As to your comment about to-guest, that does matter for correct %cr2 on > an emulated store. The INS emulation tries precisely this. > > I've got some ad-hoc XTF tests which I'll run this patch over, but I'm > expecting to R-by/T-by in this form. Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> See the "x86/pv: Fixes to guest_io_okay()" for the other aspects of this. ~Andrew
--- a/xen/arch/x86/include/asm/uaccess.h +++ b/xen/arch/x86/include/asm/uaccess.h @@ -251,7 +251,8 @@ do { static always_inline unsigned long __copy_to_guest_pv(void __user *to, const void *from, unsigned long n) { - if (__builtin_constant_p(n)) { + if ( __builtin_constant_p(n) && !((unsigned long)to & (n - 1)) ) + { unsigned long ret; switch (n) { @@ -291,7 +292,8 @@ __copy_to_guest_pv(void __user *to, cons static always_inline unsigned long __copy_from_guest_pv(void *to, const void __user *from, unsigned long n) { - if (__builtin_constant_p(n)) { + if ( __builtin_constant_p(n) && !((unsigned long)from & (n - 1)) ) + { unsigned long ret; switch (n) { @@ -321,8 +323,7 @@ __copy_from_guest_pv(void *to, const voi * * Copy data from hypervisor space to a potentially unmapped area. * - * Returns number of bytes that could not be copied. - * On success, this will be zero. + * Returns zero on success and non-zero if some bytes could not be copied. */ static always_inline unsigned int copy_to_unsafe(void __user *to, const void *from, unsigned int n) @@ -358,8 +359,7 @@ copy_to_unsafe(void __user *to, const vo * * Copy data from a potentially unmapped area space to hypervisor space. * - * Returns number of bytes that could not be copied. - * On success, this will be zero. + * Returns zero on success and non-zero if some bytes could not be copied. * * If some data could not be copied, this function will pad the copied * data to the requested size using zero bytes. --- a/xen/arch/x86/usercopy.c +++ b/xen/arch/x86/usercopy.c @@ -16,42 +16,19 @@ unsigned int copy_to_guest_ll(void __user *to, const void *from, unsigned int n) { - unsigned dummy; + GUARD(unsigned dummy); stac(); asm volatile ( GUARD( " guest_access_mask_ptr %[to], %q[scratch1], %q[scratch2]\n" ) - " cmp $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n" - " jbe 1f\n" - " mov %k[to], %[cnt]\n" - " neg %[cnt]\n" - " and $"STR(BYTES_PER_LONG-1)", %[cnt]\n" - " sub %[cnt], %[aux]\n" - "4: rep movsb\n" /* make 'to' address aligned */ - " mov %[aux], %[cnt]\n" - " shr $"STR(LONG_BYTEORDER)", %[cnt]\n" - " and $"STR(BYTES_PER_LONG-1)", %[aux]\n" - " .align 2,0x90\n" - "0: rep movs"__OS"\n" /* as many words as possible... */ - " mov %[aux],%[cnt]\n" - "1: rep movsb\n" /* ...remainder copied as bytes */ + "1: rep movsb\n" "2:\n" - ".section .fixup,\"ax\"\n" - "5: add %[aux], %[cnt]\n" - " jmp 2b\n" - "3: lea (%q[aux], %q[cnt], "STR(BYTES_PER_LONG)"), %[cnt]\n" - " jmp 2b\n" - ".previous\n" - _ASM_EXTABLE(4b, 5b) - _ASM_EXTABLE(0b, 3b) _ASM_EXTABLE(1b, 2b) - : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from), - [aux] "=&r" (dummy) + : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from) GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy)) - : "[aux]" (n) - : "memory" ); + :: "memory" ); clac(); return n; @@ -66,25 +43,9 @@ unsigned int copy_from_guest_ll(void *to GUARD( " guest_access_mask_ptr %[from], %q[scratch1], %q[scratch2]\n" ) - " cmp $"STR(2*BYTES_PER_LONG-1)", %[cnt]\n" - " jbe 1f\n" - " mov %k[to], %[cnt]\n" - " neg %[cnt]\n" - " and $"STR(BYTES_PER_LONG-1)", %[cnt]\n" - " sub %[cnt], %[aux]\n" - "4: rep movsb\n" /* make 'to' address aligned */ - " mov %[aux],%[cnt]\n" - " shr $"STR(LONG_BYTEORDER)", %[cnt]\n" - " and $"STR(BYTES_PER_LONG-1)", %[aux]\n" - " .align 2,0x90\n" - "0: rep movs"__OS"\n" /* as many words as possible... */ - " mov %[aux], %[cnt]\n" - "1: rep movsb\n" /* ...remainder copied as bytes */ + "1: rep movsb\n" "2:\n" ".section .fixup,\"ax\"\n" - "5: add %[aux], %[cnt]\n" - " jmp 6f\n" - "3: lea (%q[aux], %q[cnt], "STR(BYTES_PER_LONG)"), %[cnt]\n" "6: mov %[cnt], %k[from]\n" " xchg %%eax, %[aux]\n" " xor %%eax, %%eax\n" @@ -93,14 +54,11 @@ unsigned int copy_from_guest_ll(void *to " mov %k[from], %[cnt]\n" " jmp 2b\n" ".previous\n" - _ASM_EXTABLE(4b, 5b) - _ASM_EXTABLE(0b, 3b) _ASM_EXTABLE(1b, 6b) : [cnt] "+c" (n), [to] "+D" (to), [from] "+S" (from), [aux] "=&r" (dummy) GUARD(, [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy)) - : "[aux]" (n) - : "memory" ); + :: "memory" ); clac(); return n; @@ -145,20 +103,12 @@ unsigned int clear_guest_pv(void __user stac(); asm volatile ( " guest_access_mask_ptr %[to], %[scratch1], %[scratch2]\n" - "0: rep stos"__OS"\n" - " mov %[bytes], %[cnt]\n" "1: rep stosb\n" "2:\n" - ".section .fixup,\"ax\"\n" - "3: lea (%q[bytes], %q[longs], "STR(BYTES_PER_LONG)"), %[cnt]\n" - " jmp 2b\n" - ".previous\n" - _ASM_EXTABLE(0b,3b) _ASM_EXTABLE(1b,2b) - : [cnt] "=&c" (n), [to] "+D" (to), [scratch1] "=&r" (dummy), + : [cnt] "+c" (n), [to] "+D" (to), [scratch1] "=&r" (dummy), [scratch2] "=&r" (dummy) - : [bytes] "r" (n & (BYTES_PER_LONG - 1)), - [longs] "0" (n / BYTES_PER_LONG), "a" (0) ); + : "a" (0) ); clac(); }
Taking a fault on a non-byte-granular insn means that the "number of bytes not handled" return value would need extra care in calculating, if we want callers to be able to derive e.g. exception context (to be injected to the guest) - CR2 for #PF in particular - from the value. To simplify things rather than complicating them, reduce inline assembly to just byte-granular string insns. On recent CPUs that's also supposed to be more efficient anyway. For singular element accessors, however, alignment checks are added, hence slightly complicating the code. Misaligned (user) buffer accesses will now be forwarded to copy_{from,to}_guest_ll(). Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the same way, as they're produced by mere re-processing of the same code. Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their comments made match reality; down the road we may want to change their return types, e.g. to bool. Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>