Message ID | 1dac86de-cb8c-d2b2-d0ab-bf76707d22d0@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86emul: string insn corrections | expand |
On 06/10/2022 14:11, Jan Beulich wrote: > In an entirely different context I came across Linux commit 428e3d08574b > ("KVM: x86: Fix zero iterations REP-string"), which points out that > we're still doing things wrong: For one, there's no zero-extension at > all on AMD. And then while RCX is zero-extended from 32 bits uniformly > for all string instructions on newer hardware, RSI/RDI are only for MOVS > and STOS on the systems I have access to. (On an old family 0xf system > I've further found that for REP LODS even RCX is not zero-extended.) > > Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Partly RFC for none of this being documented anywhere (and it partly > being model specific); inquiry pending. None of this surprises me. The rep instructions have always been microcoded, and 0 reps is a special case which has been largely ignored until recently. I wouldn't be surprised if the behaviour changes with MISC_ENABLE.FAST_STRINGS (given the KVM commit message) and I also wouldn't be surprised if it's different between Core and Atom too (given the Fam 0xf observation). It's almost worth executing a zero-length rep stub, except that may potentially go very wrong in certain ecx/rcx cases. I'm not sure how important these cases are to cover. Given that they do differ between vendors and generation, and that their use in compiled code is not going to consider the registers live after use, is the complexity really worth it? ~Andrew
On 10.10.2022 20:56, Andrew Cooper wrote: > On 06/10/2022 14:11, Jan Beulich wrote: >> In an entirely different context I came across Linux commit 428e3d08574b >> ("KVM: x86: Fix zero iterations REP-string"), which points out that >> we're still doing things wrong: For one, there's no zero-extension at >> all on AMD. And then while RCX is zero-extended from 32 bits uniformly >> for all string instructions on newer hardware, RSI/RDI are only for MOVS >> and STOS on the systems I have access to. (On an old family 0xf system >> I've further found that for REP LODS even RCX is not zero-extended.) >> >> Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> Partly RFC for none of this being documented anywhere (and it partly >> being model specific); inquiry pending. > > None of this surprises me. The rep instructions have always been > microcoded, and 0 reps is a special case which has been largely ignored > until recently. > > I wouldn't be surprised if the behaviour changes with > MISC_ENABLE.FAST_STRINGS (given the KVM commit message) and I also > wouldn't be surprised if it's different between Core and Atom too (given > the Fam 0xf observation). > > It's almost worth executing a zero-length rep stub, except that may > potentially go very wrong in certain ecx/rcx cases. > > I'm not sure how important these cases are to cover. Given that they do > differ between vendors and generation, and that their use in compiled > code is not going to consider the registers live after use, is the > complexity really worth it? By "complexity", what do you mean? The patch doesn't add new complexity, it only converts "true" to "false" in several places, plus it updates a comment. I don't think we can legitimately simplify things (by removing logic), so the only thing I can think of is your thought towards executing a zero-length REP stub (which you say may be problematic in certain cases). Patch 2 makes clear why this wouldn't be a good idea for INS and OUTS. It also cannot possibly be got right when emulating 16-bit code (without switching to a 16-bit code segment), and it's uncertain whether a 32-bit address size override would actually yield the same behavior as a native address size operation in 32-bit code. Of course, if limiting this (the way we currently do) to just 32-bit addressing in 64-bit mode, then this ought to be representative (with the INS/OUTS caveat remaining), but - as you say - adding complexity for likely little gain. Jan
On 10.10.2022 20:56, Andrew Cooper wrote: > On 06/10/2022 14:11, Jan Beulich wrote: >> In an entirely different context I came across Linux commit 428e3d08574b >> ("KVM: x86: Fix zero iterations REP-string"), which points out that >> we're still doing things wrong: For one, there's no zero-extension at >> all on AMD. And then while RCX is zero-extended from 32 bits uniformly >> for all string instructions on newer hardware, RSI/RDI are only for MOVS >> and STOS on the systems I have access to. (On an old family 0xf system >> I've further found that for REP LODS even RCX is not zero-extended.) >> >> Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> Partly RFC for none of this being documented anywhere (and it partly >> being model specific); inquiry pending. > > None of this surprises me. The rep instructions have always been > microcoded, and 0 reps is a special case which has been largely ignored > until recently. > > I wouldn't be surprised if the behaviour changes with > MISC_ENABLE.FAST_STRINGS (given the KVM commit message) I've tried this on a Skylake, and things don't change there when forcing the MSR bit off. Jan > and I also > wouldn't be surprised if it's different between Core and Atom too (given > the Fam 0xf observation). > > It's almost worth executing a zero-length rep stub, except that may > potentially go very wrong in certain ecx/rcx cases. > > I'm not sure how important these cases are to cover. Given that they do > differ between vendors and generation, and that their use in compiled > code is not going to consider the registers live after use, is the > complexity really worth it? > > ~Andrew
--- a/xen/arch/x86/x86_emulate/x86_emulate.c +++ b/xen/arch/x86/x86_emulate/x86_emulate.c @@ -1589,7 +1589,7 @@ static inline void put_loop_count( regs->r(cx) = ad_bytes == 4 ? (uint32_t)count : count; } -#define get_rep_prefix(using_si, using_di) ({ \ +#define get_rep_prefix(extend_si, extend_di) ({ \ unsigned long max_reps = 1; \ if ( rep_prefix() ) \ max_reps = get_loop_count(&_regs, ad_bytes); \ @@ -1597,14 +1597,14 @@ static inline void put_loop_count( { \ /* \ * Skip the instruction if no repetitions are required, but \ - * zero extend involved registers first when using 32-bit \ + * zero extend relevant registers first when using 32-bit \ * addressing in 64-bit mode. \ */ \ - if ( mode_64bit() && ad_bytes == 4 ) \ + if ( !amd_like(ctxt) && mode_64bit() && ad_bytes == 4 ) \ { \ _regs.r(cx) = 0; \ - if ( using_si ) _regs.r(si) = _regs.esi; \ - if ( using_di ) _regs.r(di) = _regs.edi; \ + if ( extend_si ) _regs.r(si) = _regs.esi; \ + if ( extend_di ) _regs.r(di) = _regs.edi; \ } \ goto complete_insn; \ } \ @@ -4248,7 +4248,7 @@ x86_emulate( goto imul; case 0x6c ... 0x6d: /* ins %dx,%es:%edi */ { - unsigned long nr_reps = get_rep_prefix(false, true); + unsigned long nr_reps = get_rep_prefix(false, false); unsigned int port = _regs.dx; dst.bytes = !(b & 1) ? 1 : (op_bytes == 8) ? 4 : op_bytes; @@ -4289,7 +4289,7 @@ x86_emulate( } case 0x6e ... 0x6f: /* outs %esi,%dx */ { - unsigned long nr_reps = get_rep_prefix(true, false); + unsigned long nr_reps = get_rep_prefix(false, false); unsigned int port = _regs.dx; dst.bytes = !(b & 1) ? 1 : (op_bytes == 8) ? 4 : op_bytes; @@ -4630,7 +4630,7 @@ x86_emulate( case 0xa6 ... 0xa7: /* cmps */ { unsigned long next_eip = _regs.r(ip); - get_rep_prefix(true, true); + get_rep_prefix(false, false); src.bytes = dst.bytes = (d & ByteOp) ? 1 : op_bytes; if ( (rc = read_ulong(ea.mem.seg, truncate_ea(_regs.r(si)), &dst.val, dst.bytes, ctxt, ops)) || @@ -4672,7 +4672,7 @@ x86_emulate( } case 0xac ... 0xad: /* lods */ - get_rep_prefix(true, false); + get_rep_prefix(false, false); if ( (rc = read_ulong(ea.mem.seg, truncate_ea(_regs.r(si)), &dst.val, dst.bytes, ctxt, ops)) != 0 ) goto done; @@ -4683,7 +4683,7 @@ x86_emulate( case 0xae ... 0xaf: /* scas */ { unsigned long next_eip = _regs.r(ip); - get_rep_prefix(false, true); + get_rep_prefix(false, false); if ( (rc = read_ulong(x86_seg_es, truncate_ea(_regs.r(di)), &dst.val, src.bytes, ctxt, ops)) != 0 ) goto done;
In an entirely different context I came across Linux commit 428e3d08574b ("KVM: x86: Fix zero iterations REP-string"), which points out that we're still doing things wrong: For one, there's no zero-extension at all on AMD. And then while RCX is zero-extended from 32 bits uniformly for all string instructions on newer hardware, RSI/RDI are only for MOVS and STOS on the systems I have access to. (On an old family 0xf system I've further found that for REP LODS even RCX is not zero-extended.) Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Partly RFC for none of this being documented anywhere (and it partly being model specific); inquiry pending. If I was asked, I would have claimed to have tested all string insns and for both vendors back at the time. But pretty clearly I didn't, and instead I did derive uniform behavior from just the MOVS and STOS observations on just Intel hardware; I'm sorry for that.