mbox series

[V2,0/3] improve -overcommit cpu-pm=on|off

Message ID 20240524200017.150339-1-zide.chen@intel.com (mailing list archive)
Headers show
Series improve -overcommit cpu-pm=on|off | expand

Message

Chen, Zide May 24, 2024, 8 p.m. UTC
Currently, if running "-overcommit cpu-pm=on" on hosts that don't
have MWAIT support, the MWAIT/MONITOR feature is advertised to the
guest and executing MWAIT/MONITOR on the guest triggers #UD.

V2:
- [PATCH 1]: took Thomas' suggestion for more generic fix
- [PATCH 2/3]: no changes

Zide Chen (3):
  vl: Allow multiple -overcommit commands
  target/i386: call cpu_exec_realizefn before x86_cpu_filter_features
  target/i386: Move host_cpu_enable_cpu_pm into kvm_cpu_realizefn()

 system/vl.c               |  4 ++--
 target/i386/cpu.c         | 24 ++++++++++++------------
 target/i386/host-cpu.c    | 12 ------------
 target/i386/kvm/kvm-cpu.c | 12 +++++++++---
 4 files changed, 23 insertions(+), 29 deletions(-)

Comments

Igor Mammedov May 28, 2024, 9:23 a.m. UTC | #1
On Fri, 24 May 2024 13:00:14 -0700
Zide Chen <zide.chen@intel.com> wrote:

> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> guest and executing MWAIT/MONITOR on the guest triggers #UD.

this is missing proper description how do you trigger issue
with reproducer and detailed description why guest sees MWAIT
when it's not supported by host.

> 
> V2:
> - [PATCH 1]: took Thomas' suggestion for more generic fix
> - [PATCH 2/3]: no changes
> 
> Zide Chen (3):
>   vl: Allow multiple -overcommit commands
>   target/i386: call cpu_exec_realizefn before x86_cpu_filter_features
>   target/i386: Move host_cpu_enable_cpu_pm into kvm_cpu_realizefn()
> 
>  system/vl.c               |  4 ++--
>  target/i386/cpu.c         | 24 ++++++++++++------------
>  target/i386/host-cpu.c    | 12 ------------
>  target/i386/kvm/kvm-cpu.c | 12 +++++++++---
>  4 files changed, 23 insertions(+), 29 deletions(-)
>
Chen, Zide May 28, 2024, 6:16 p.m. UTC | #2
On 5/28/2024 2:23 AM, Igor Mammedov wrote:
> On Fri, 24 May 2024 13:00:14 -0700
> Zide Chen <zide.chen@intel.com> wrote:
> 
>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
>> guest and executing MWAIT/MONITOR on the guest triggers #UD.
> 
> this is missing proper description how do you trigger issue
> with reproducer and detailed description why guest sees MWAIT
> when it's not supported by host.

If "overcommit cpu-pm=on" and "-cpu hpst" are present, as shown in the
following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
that it doesn't have a chance to check MWAIT against host features and
will be advertised to the guest regardless of whether it's supported by
the host or not.

x86_cpu_realizefn()
  x86_cpu_filter_features()
  cpu_exec_realizefn()
    kvm_cpu_realizefn
      host_cpu_realizefn
        host_cpu_enable_cpu_pm
          env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;


If it's not supported by the host, executing MONITOR or MWAIT
instructions from the guest triggers #UD, no matter MWAIT_EXITING
control is set or not.
Igor Mammedov May 29, 2024, 12:46 p.m. UTC | #3
On Tue, 28 May 2024 11:16:59 -0700
"Chen, Zide" <zide.chen@intel.com> wrote:

> On 5/28/2024 2:23 AM, Igor Mammedov wrote:
> > On Fri, 24 May 2024 13:00:14 -0700
> > Zide Chen <zide.chen@intel.com> wrote:
> >   
> >> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> >> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> >> guest and executing MWAIT/MONITOR on the guest triggers #UD.  
> > 
> > this is missing proper description how do you trigger issue
> > with reproducer and detailed description why guest sees MWAIT
> > when it's not supported by host.  
> 
> If "overcommit cpu-pm=on" and "-cpu hpst" are present, as shown in the
it's bette to provide full QEMU CLI and host/guest kernels used and what
hardware was used if it's relevant so others can reproduce problem.

> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> that it doesn't have a chance to check MWAIT against host features and
> will be advertised to the guest regardless of whether it's supported by
> the host or not.
> 
> x86_cpu_realizefn()
>   x86_cpu_filter_features()
>   cpu_exec_realizefn()
>     kvm_cpu_realizefn
>       host_cpu_realizefn
>         host_cpu_enable_cpu_pm
>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> 
> 
> If it's not supported by the host, executing MONITOR or MWAIT
> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> control is set or not.

If I recall right, kvm was able to emulate mwait/monitor.
So question is why it leads to exception instead?
Chen, Zide May 29, 2024, 5:31 p.m. UTC | #4
On 5/29/2024 5:46 AM, Igor Mammedov wrote:
> On Tue, 28 May 2024 11:16:59 -0700
> "Chen, Zide" <zide.chen@intel.com> wrote:
> 
>> On 5/28/2024 2:23 AM, Igor Mammedov wrote:
>>> On Fri, 24 May 2024 13:00:14 -0700
>>> Zide Chen <zide.chen@intel.com> wrote:
>>>   
>>>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
>>>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
>>>> guest and executing MWAIT/MONITOR on the guest triggers #UD.  
>>>
>>> this is missing proper description how do you trigger issue
>>> with reproducer and detailed description why guest sees MWAIT
>>> when it's not supported by host.  
>>
>> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the
> it's bette to provide full QEMU CLI and host/guest kernels used and what
> hardware was used if it's relevant so others can reproduce problem.

I ever reproduced this on an older Intel Icelake machine, a
Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
issue, not specific to particular models.

For the CLI, I think the only command line options that matter are
 -overcommit cpu-pm=on: to set enable_cpu_pm
 -cpu host: so that cpu->max_features is set

For QEMU version, as long as it's after this commit: 662175b91ff2
("i386: reorder call to cpu_exec_realizefn")

The guest fails to boot:

[ 24.825568] smpboot: x86: Booting SMP configuration:
[ 24.826377] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
#13 #14 #15 #17
[ 24.985799] .... node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
#136 #137 #138 #139 #140 #141 #142 #143 #145
[ 25.136955] invalid opcode: 0000 1 PREEMPT SMP NOPTI
[ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
[ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
[ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
[ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
[ 25.137790] RSP: 0000:ffffffff91403e70 EFLAGS: 00010046
[ 25.137790] RAX: ffffffff9140a980 RBX: ffffffff9140a980 RCX:
0000000000000000
[ 25.137790] RDX: 0000000000000000 RSI: ffff97f1ade21b20 RDI:
0000000000000004
[ 25.137790] RBP: 0000000000000000 R08: 00000005da4709cb R09:
0000000000000001
[ 25.137790] R10: 0000000000005da4 R11: 0000000000000009 R12:
0000000000000000
[ 25.137790] R13: ffff98573ff90fc0 R14: ffffffff9140a038 R15:
0000000000093ff0
[ 25.137790] FS: 0000000000000000(0000) GS:ffff97f1ade00000(0000)
knlGS:0000000000000000
[ 25.137790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 25.137790] CR2: ffff97d8aa801000 CR3: 00000049e9430001 CR4:
0000000000770ef0
[ 25.137790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 25.137790] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
0000000000000400
[ 25.137790] PKRU: 55555554
[ 25.137790] Call Trace:
[ 25.137790] <TASK>
[ 25.137790] ? die+0x37/0x90
[ 25.137790] ? do_trap+0xe3/0x110
[ 25.137790] ? mwait_idle+0x35/0x80
[ 25.137790] ? do_error_trap+0x6a/0x90
[ 25.137790] ? mwait_idle+0x35/0x80
[ 25.137790] ? exc_invalid_op+0x52/0x70
[ 25.137790] ? mwait_idle+0x35/0x80
[ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
[ 25.137790] ? mwait_idle+0x35/0x80
[ 25.137790] default_idle_call+0x30/0x100
[ 25.137790] cpuidle_idle_call+0x12c/0x170
[ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
[ 25.137790] do_idle+0x7f/0xd0
[ 25.137790] cpu_startup_entry+0x29/0x30
[ 25.137790] rest_init+0xcc/0xd0
[ 25.137790] start_kernel+0x396/0x5d0
[ 25.137790] x86_64_start_reservations+0x18/0x30
[ 25.137790] x86_64_start_kernel+0xe7/0xf0
[ 25.137790] common_startup_64+0x13e/0x148
[ 25.137790] </TASK>
[ 25.137790] Modules linked in:
[ 25.137790] --[ end trace 0000000000000000 ]--
[ 25.137790] invalid opcode: 0000 2 PREEMPT SMP NOPTI
[ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
[ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8

> 
>> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
>> that it doesn't have a chance to check MWAIT against host features and
>> will be advertised to the guest regardless of whether it's supported by
>> the host or not.
>>
>> x86_cpu_realizefn()
>>   x86_cpu_filter_features()
>>   cpu_exec_realizefn()
>>     kvm_cpu_realizefn
>>       host_cpu_realizefn
>>         host_cpu_enable_cpu_pm
>>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
>>
>>
>> If it's not supported by the host, executing MONITOR or MWAIT
>> instructions from the guest triggers #UD, no matter MWAIT_EXITING
>> control is set or not.
> 
> If I recall right, kvm was able to emulate mwait/monitor.
> So question is why it leads to exception instead?

KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
didn't find explicit proof from Intel SDM that #UD exceptions take
precedence over MWAIT/MONITOR VM exits, but this is my speculation. For
example, in ancient machines which don't support MWAIT yet, the only way
it can do is #UD, not MWAIT VM exit?
Zhao Liu May 30, 2024, 1:54 p.m. UTC | #5
Hi Zide,

On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
> Date: Wed, 29 May 2024 10:31:21 -0700
> From: "Chen, Zide" <zide.chen@intel.com>
> Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
> 
> 
> 
> On 5/29/2024 5:46 AM, Igor Mammedov wrote:
> > On Tue, 28 May 2024 11:16:59 -0700
> > "Chen, Zide" <zide.chen@intel.com> wrote:
> > 
> >> On 5/28/2024 2:23 AM, Igor Mammedov wrote:
> >>> On Fri, 24 May 2024 13:00:14 -0700
> >>> Zide Chen <zide.chen@intel.com> wrote:
> >>>   
> >>>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> >>>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> >>>> guest and executing MWAIT/MONITOR on the guest triggers #UD.  
> >>>
> >>> this is missing proper description how do you trigger issue
> >>> with reproducer and detailed description why guest sees MWAIT
> >>> when it's not supported by host.  
> >>
> >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the
> > it's bette to provide full QEMU CLI and host/guest kernels used and what
> > hardware was used if it's relevant so others can reproduce problem.
> 
> I ever reproduced this on an older Intel Icelake machine, a
> Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
> issue, not specific to particular models.
> 
> For the CLI, I think the only command line options that matter are
>  -overcommit cpu-pm=on: to set enable_cpu_pm
>  -cpu host: so that cpu->max_features is set
> 
> For QEMU version, as long as it's after this commit: 662175b91ff2
> ("i386: reorder call to cpu_exec_realizefn")
> 
> The guest fails to boot:
> 
> [ 24.825568] smpboot: x86: Booting SMP configuration:
> [ 24.826377] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
> #13 #14 #15 #17
> [ 24.985799] .... node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
> #136 #137 #138 #139 #140 #141 #142 #143 #145
> [ 25.136955] invalid opcode: 0000 1 PREEMPT SMP NOPTI
> [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
> [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> [ 25.137790] RSP: 0000:ffffffff91403e70 EFLAGS: 00010046
> [ 25.137790] RAX: ffffffff9140a980 RBX: ffffffff9140a980 RCX:
> 0000000000000000
> [ 25.137790] RDX: 0000000000000000 RSI: ffff97f1ade21b20 RDI:
> 0000000000000004
> [ 25.137790] RBP: 0000000000000000 R08: 00000005da4709cb R09:
> 0000000000000001
> [ 25.137790] R10: 0000000000005da4 R11: 0000000000000009 R12:
> 0000000000000000
> [ 25.137790] R13: ffff98573ff90fc0 R14: ffffffff9140a038 R15:
> 0000000000093ff0
> [ 25.137790] FS: 0000000000000000(0000) GS:ffff97f1ade00000(0000)
> knlGS:0000000000000000
> [ 25.137790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 25.137790] CR2: ffff97d8aa801000 CR3: 00000049e9430001 CR4:
> 0000000000770ef0
> [ 25.137790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 25.137790] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
> 0000000000000400
> [ 25.137790] PKRU: 55555554
> [ 25.137790] Call Trace:
> [ 25.137790] <TASK>
> [ 25.137790] ? die+0x37/0x90
> [ 25.137790] ? do_trap+0xe3/0x110
> [ 25.137790] ? mwait_idle+0x35/0x80
> [ 25.137790] ? do_error_trap+0x6a/0x90
> [ 25.137790] ? mwait_idle+0x35/0x80
> [ 25.137790] ? exc_invalid_op+0x52/0x70
> [ 25.137790] ? mwait_idle+0x35/0x80
> [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
> [ 25.137790] ? mwait_idle+0x35/0x80
> [ 25.137790] default_idle_call+0x30/0x100
> [ 25.137790] cpuidle_idle_call+0x12c/0x170
> [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
> [ 25.137790] do_idle+0x7f/0xd0
> [ 25.137790] cpu_startup_entry+0x29/0x30
> [ 25.137790] rest_init+0xcc/0xd0
> [ 25.137790] start_kernel+0x396/0x5d0
> [ 25.137790] x86_64_start_reservations+0x18/0x30
> [ 25.137790] x86_64_start_kernel+0xe7/0xf0
> [ 25.137790] common_startup_64+0x13e/0x148
> [ 25.137790] </TASK>
> [ 25.137790] Modules linked in:
> [ 25.137790] --[ end trace 0000000000000000 ]--
> [ 25.137790] invalid opcode: 0000 2 PREEMPT SMP NOPTI
> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> 
> > 
> >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> >> that it doesn't have a chance to check MWAIT against host features and
> >> will be advertised to the guest regardless of whether it's supported by
> >> the host or not.
> >>
> >> x86_cpu_realizefn()
> >>   x86_cpu_filter_features()
> >>   cpu_exec_realizefn()
> >>     kvm_cpu_realizefn
> >>       host_cpu_realizefn
> >>         host_cpu_enable_cpu_pm
> >>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> >>
> >>
> >> If it's not supported by the host, executing MONITOR or MWAIT
> >> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> >> control is set or not.
> > 
> > If I recall right, kvm was able to emulate mwait/monitor.
> > So question is why it leads to exception instead?
> 
> KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
> didn't find explicit proof from Intel SDM that #UD exceptions take
> precedence over MWAIT/MONITOR VM exits, but this is my speculation. For
> example, in ancient machines which don't support MWAIT yet, the only way
> it can do is #UD, not MWAIT VM exit?

For the Host which doesn't support MWAIT, it shouldn't have the VMX
control bit for mwait exit either, right?

Could you pls check this on your machine? If VMX doesn't support this
exit event, then triggering an exception will make sense.

-Zhao
Igor Mammedov May 30, 2024, 2:34 p.m. UTC | #6
On Thu, 30 May 2024 21:54:47 +0800
Zhao Liu <zhao1.liu@intel.com> wrote:

> Hi Zide,
> 
> On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
> > Date: Wed, 29 May 2024 10:31:21 -0700
> > From: "Chen, Zide" <zide.chen@intel.com>
> > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
> > 
> > 
> > 
> > On 5/29/2024 5:46 AM, Igor Mammedov wrote:  
> > > On Tue, 28 May 2024 11:16:59 -0700
> > > "Chen, Zide" <zide.chen@intel.com> wrote:
> > >   
> > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote:  
> > >>> On Fri, 24 May 2024 13:00:14 -0700
> > >>> Zide Chen <zide.chen@intel.com> wrote:
> > >>>     
> > >>>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> > >>>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> > >>>> guest and executing MWAIT/MONITOR on the guest triggers #UD.    
> > >>>
> > >>> this is missing proper description how do you trigger issue
> > >>> with reproducer and detailed description why guest sees MWAIT
> > >>> when it's not supported by host.    
> > >>
> > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the  
> > > it's bette to provide full QEMU CLI and host/guest kernels used and what
> > > hardware was used if it's relevant so others can reproduce problem.  
> > 
> > I ever reproduced this on an older Intel Icelake machine, a
> > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
> > issue, not specific to particular models.
> > 
> > For the CLI, I think the only command line options that matter are
> >  -overcommit cpu-pm=on: to set enable_cpu_pm
> >  -cpu host: so that cpu->max_features is set
> > 
> > For QEMU version, as long as it's after this commit: 662175b91ff2
> > ("i386: reorder call to cpu_exec_realizefn")
> > 
> > The guest fails to boot:
> > 
> > [ 24.825568] smpboot: x86: Booting SMP configuration:
> > [ 24.826377] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
> > #13 #14 #15 #17
> > [ 24.985799] .... node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
> > #136 #137 #138 #139 #140 #141 #142 #143 #145
> > [ 25.136955] invalid opcode: 0000 1 PREEMPT SMP NOPTI
> > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
> > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> > [ 25.137790] RSP: 0000:ffffffff91403e70 EFLAGS: 00010046
> > [ 25.137790] RAX: ffffffff9140a980 RBX: ffffffff9140a980 RCX:
> > 0000000000000000
> > [ 25.137790] RDX: 0000000000000000 RSI: ffff97f1ade21b20 RDI:
> > 0000000000000004
> > [ 25.137790] RBP: 0000000000000000 R08: 00000005da4709cb R09:
> > 0000000000000001
> > [ 25.137790] R10: 0000000000005da4 R11: 0000000000000009 R12:
> > 0000000000000000
> > [ 25.137790] R13: ffff98573ff90fc0 R14: ffffffff9140a038 R15:
> > 0000000000093ff0
> > [ 25.137790] FS: 0000000000000000(0000) GS:ffff97f1ade00000(0000)
> > knlGS:0000000000000000
> > [ 25.137790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 25.137790] CR2: ffff97d8aa801000 CR3: 00000049e9430001 CR4:
> > 0000000000770ef0
> > [ 25.137790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [ 25.137790] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
> > 0000000000000400
> > [ 25.137790] PKRU: 55555554
> > [ 25.137790] Call Trace:
> > [ 25.137790] <TASK>
> > [ 25.137790] ? die+0x37/0x90
> > [ 25.137790] ? do_trap+0xe3/0x110
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? do_error_trap+0x6a/0x90
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? exc_invalid_op+0x52/0x70
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] default_idle_call+0x30/0x100
> > [ 25.137790] cpuidle_idle_call+0x12c/0x170
> > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
> > [ 25.137790] do_idle+0x7f/0xd0
> > [ 25.137790] cpu_startup_entry+0x29/0x30
> > [ 25.137790] rest_init+0xcc/0xd0
> > [ 25.137790] start_kernel+0x396/0x5d0
> > [ 25.137790] x86_64_start_reservations+0x18/0x30
> > [ 25.137790] x86_64_start_kernel+0xe7/0xf0
> > [ 25.137790] common_startup_64+0x13e/0x148
> > [ 25.137790] </TASK>
> > [ 25.137790] Modules linked in:
> > [ 25.137790] --[ end trace 0000000000000000 ]--
> > [ 25.137790] invalid opcode: 0000 2 PREEMPT SMP NOPTI
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> >   
> > >   
> > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > >> that it doesn't have a chance to check MWAIT against host features and
> > >> will be advertised to the guest regardless of whether it's supported by
> > >> the host or not.
> > >>
> > >> x86_cpu_realizefn()
> > >>   x86_cpu_filter_features()
> > >>   cpu_exec_realizefn()
> > >>     kvm_cpu_realizefn
> > >>       host_cpu_realizefn
> > >>         host_cpu_enable_cpu_pm
> > >>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> > >>
> > >>
> > >> If it's not supported by the host, executing MONITOR or MWAIT
> > >> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> > >> control is set or not.  
> > > 
> > > If I recall right, kvm was able to emulate mwait/monitor.
> > > So question is why it leads to exception instead?  
> > 
> > KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
> > didn't find explicit proof from Intel SDM that #UD exceptions take
> > precedence over MWAIT/MONITOR VM exits, but this is my speculation. For
> > example, in ancient machines which don't support MWAIT yet, the only way
> > it can do is #UD, not MWAIT VM exit?  
> 
> For the Host which doesn't support MWAIT, it shouldn't have the VMX
> control bit for mwait exit either, right?
> 
> Could you pls check this on your machine? If VMX doesn't support this
> exit event, then triggering an exception will make sense.

My assumption (probably wrong) was that KVM would emulate mwait if it's unavailable,
unless we have KVM_CAP_X86_DISABLE_EXITS enabled. And in the later case it would
explode as expected, however then we shouldn't be able to set KVM_CAP_X86_DISABLE_EXITS
to begin with.

Recently Sean posted a patch related to that
[PATCH v2 12/49] KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
  https://lkml.org/lkml/2024/5/17/729

This needs someone with KVM expertise to chime in
Perhaps Paolo/Sean could clarify expected behavior.
Igor Mammedov May 30, 2024, 2:49 p.m. UTC | #7
On Thu, 30 May 2024 21:54:47 +0800
Zhao Liu <zhao1.liu@intel.com> wrote:

> Hi Zide,
> 
> On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
> > Date: Wed, 29 May 2024 10:31:21 -0700
> > From: "Chen, Zide" <zide.chen@intel.com>
> > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
> > 
> > 
> > 
> > On 5/29/2024 5:46 AM, Igor Mammedov wrote:  
> > > On Tue, 28 May 2024 11:16:59 -0700
> > > "Chen, Zide" <zide.chen@intel.com> wrote:
> > >   
> > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote:  
> > >>> On Fri, 24 May 2024 13:00:14 -0700
> > >>> Zide Chen <zide.chen@intel.com> wrote:
> > >>>     
> > >>>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> > >>>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> > >>>> guest and executing MWAIT/MONITOR on the guest triggers #UD.    
> > >>>
> > >>> this is missing proper description how do you trigger issue
> > >>> with reproducer and detailed description why guest sees MWAIT
> > >>> when it's not supported by host.    
> > >>
> > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the  
> > > it's bette to provide full QEMU CLI and host/guest kernels used and what
> > > hardware was used if it's relevant so others can reproduce problem.  
> > 
> > I ever reproduced this on an older Intel Icelake machine, a
> > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
> > issue, not specific to particular models.
> > 
> > For the CLI, I think the only command line options that matter are
> >  -overcommit cpu-pm=on: to set enable_cpu_pm
> >  -cpu host: so that cpu->max_features is set
> > 
> > For QEMU version, as long as it's after this commit: 662175b91ff2
> > ("i386: reorder call to cpu_exec_realizefn")
> > 
> > The guest fails to boot:
> > 
> > [ 24.825568] smpboot: x86: Booting SMP configuration:
> > [ 24.826377] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
> > #13 #14 #15 #17
> > [ 24.985799] .... node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
> > #136 #137 #138 #139 #140 #141 #142 #143 #145
> > [ 25.136955] invalid opcode: 0000 1 PREEMPT SMP NOPTI
> > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
> > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> > [ 25.137790] RSP: 0000:ffffffff91403e70 EFLAGS: 00010046
> > [ 25.137790] RAX: ffffffff9140a980 RBX: ffffffff9140a980 RCX:
> > 0000000000000000
> > [ 25.137790] RDX: 0000000000000000 RSI: ffff97f1ade21b20 RDI:
> > 0000000000000004
> > [ 25.137790] RBP: 0000000000000000 R08: 00000005da4709cb R09:
> > 0000000000000001
> > [ 25.137790] R10: 0000000000005da4 R11: 0000000000000009 R12:
> > 0000000000000000
> > [ 25.137790] R13: ffff98573ff90fc0 R14: ffffffff9140a038 R15:
> > 0000000000093ff0
> > [ 25.137790] FS: 0000000000000000(0000) GS:ffff97f1ade00000(0000)
> > knlGS:0000000000000000
> > [ 25.137790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 25.137790] CR2: ffff97d8aa801000 CR3: 00000049e9430001 CR4:
> > 0000000000770ef0
> > [ 25.137790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [ 25.137790] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
> > 0000000000000400
> > [ 25.137790] PKRU: 55555554
> > [ 25.137790] Call Trace:
> > [ 25.137790] <TASK>
> > [ 25.137790] ? die+0x37/0x90
> > [ 25.137790] ? do_trap+0xe3/0x110
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? do_error_trap+0x6a/0x90
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? exc_invalid_op+0x52/0x70
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] default_idle_call+0x30/0x100
> > [ 25.137790] cpuidle_idle_call+0x12c/0x170
> > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
> > [ 25.137790] do_idle+0x7f/0xd0
> > [ 25.137790] cpu_startup_entry+0x29/0x30
> > [ 25.137790] rest_init+0xcc/0xd0
> > [ 25.137790] start_kernel+0x396/0x5d0
> > [ 25.137790] x86_64_start_reservations+0x18/0x30
> > [ 25.137790] x86_64_start_kernel+0xe7/0xf0
> > [ 25.137790] common_startup_64+0x13e/0x148
> > [ 25.137790] </TASK>
> > [ 25.137790] Modules linked in:
> > [ 25.137790] --[ end trace 0000000000000000 ]--
> > [ 25.137790] invalid opcode: 0000 2 PREEMPT SMP NOPTI
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> >   
> > >   
> > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > >> that it doesn't have a chance to check MWAIT against host features and
> > >> will be advertised to the guest regardless of whether it's supported by
> > >> the host or not.
> > >>
> > >> x86_cpu_realizefn()
> > >>   x86_cpu_filter_features()
> > >>   cpu_exec_realizefn()
> > >>     kvm_cpu_realizefn
> > >>       host_cpu_realizefn
> > >>         host_cpu_enable_cpu_pm
> > >>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> > >>
> > >>
> > >> If it's not supported by the host, executing MONITOR or MWAIT
> > >> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> > >> control is set or not.  
> > > 
> > > If I recall right, kvm was able to emulate mwait/monitor.
> > > So question is why it leads to exception instead?  
> > 
> > KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
> > didn't find explicit proof from Intel SDM that #UD exceptions take
> > precedence over MWAIT/MONITOR VM exits, but this is my speculation. For
> > example, in ancient machines which don't support MWAIT yet, the only way
> > it can do is #UD, not MWAIT VM exit?  
> 
> For the Host which doesn't support MWAIT, it shouldn't have the VMX
> control bit for mwait exit either, right?
> 
> Could you pls check this on your machine? If VMX doesn't support this
> exit event, then triggering an exception will make sense.

My assumption (probably wrong) was that KVM would emulate mwait if it's unavailable,
unless we have KVM_CAP_X86_DISABLE_EXITS enabled. And in the later case it would
explode as expected, however then we shouldn't be able to set KVM_CAP_X86_DISABLE_EXITS
to begin with.

Recently Sean posted a patch related to that
[PATCH v2 12/49] KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
  https://lkml.org/lkml/2024/5/17/729

This needs someone with KVM expertise to chime in
Perhaps Paolo/Sean could clarify expected behavior.


> 
> -Zhao
>
Sean Christopherson May 30, 2024, 2:53 p.m. UTC | #8
On Thu, May 30, 2024, Igor Mammedov wrote:
> On Thu, 30 May 2024 21:54:47 +0800 Zhao Liu <zhao1.liu@intel.com> wrote:

...

> > > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > > >> that it doesn't have a chance to check MWAIT against host features and
> > > >> will be advertised to the guest regardless of whether it's supported by
> > > >> the host or not.
> > > >>
> > > >> x86_cpu_realizefn()
> > > >>   x86_cpu_filter_features()
> > > >>   cpu_exec_realizefn()
> > > >>     kvm_cpu_realizefn
> > > >>       host_cpu_realizefn
> > > >>         host_cpu_enable_cpu_pm
> > > >>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> > > >>
> > > >>
> > > >> If it's not supported by the host, executing MONITOR or MWAIT
> > > >> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> > > >> control is set or not.  
> > > > 
> > > > If I recall right, kvm was able to emulate mwait/monitor.
> > > > So question is why it leads to exception instead?

Because KVM doesn't emulated MONITOR/MWAIT on #UD.

> > > KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
> > > didn't find explicit proof from Intel SDM that #UD exceptions take
> > > precedence over MWAIT/MONITOR VM exits, but this is my speculation.

Yeah, typically #UD takes priority over VM-Exit interception checks.  AMD's APM
is much more explicit and states that all exceptions are checked on MONITOR/MWAIT
before the interception check.

> > > For example, in ancient machines which don't support MWAIT yet, the only
> > > way it can do is #UD, not MWAIT VM exit?  

Not really relevant, because such CPUs wouldn't have MWAIT-exiting.

> > For the Host which doesn't support MWAIT, it shouldn't have the VMX
> > control bit for mwait exit either, right?
> > 
> > Could you pls check this on your machine? If VMX doesn't support this
> > exit event, then triggering an exception will make sense.
> 
> My assumption (probably wrong) was that KVM would emulate mwait if it's unavailable,

Nope.  In order to limit the attack surface of the emulator on modern CPUs, KVM
only emulates select instructions in response to a #UD.

But even if KVM did emulate MONITOR/MWAIT on #UD, this is inarguably a QEMU bug,
e.g. QEMU will effectively coerce the guest into using a idle-polling mechanism.
Chen, Zide May 30, 2024, 4:15 p.m. UTC | #9
On 5/30/2024 6:54 AM, Zhao Liu wrote:
> Hi Zide,
> 
> On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
>> Date: Wed, 29 May 2024 10:31:21 -0700
>> From: "Chen, Zide" <zide.chen@intel.com>
>> Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
>>
>>
>>
>> On 5/29/2024 5:46 AM, Igor Mammedov wrote:
>>> On Tue, 28 May 2024 11:16:59 -0700
>>> "Chen, Zide" <zide.chen@intel.com> wrote:
>>>
>>>> On 5/28/2024 2:23 AM, Igor Mammedov wrote:
>>>>> On Fri, 24 May 2024 13:00:14 -0700
>>>>> Zide Chen <zide.chen@intel.com> wrote:
>>>>>   
>>>>>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
>>>>>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
>>>>>> guest and executing MWAIT/MONITOR on the guest triggers #UD.  
>>>>>
>>>>> this is missing proper description how do you trigger issue
>>>>> with reproducer and detailed description why guest sees MWAIT
>>>>> when it's not supported by host.  
>>>>
>>>> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the
>>> it's bette to provide full QEMU CLI and host/guest kernels used and what
>>> hardware was used if it's relevant so others can reproduce problem.
>>
>> I ever reproduced this on an older Intel Icelake machine, a
>> Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
>> issue, not specific to particular models.
>>
>> For the CLI, I think the only command line options that matter are
>>  -overcommit cpu-pm=on: to set enable_cpu_pm
>>  -cpu host: so that cpu->max_features is set
>>
>> For QEMU version, as long as it's after this commit: 662175b91ff2
>> ("i386: reorder call to cpu_exec_realizefn")
>>
>> The guest fails to boot:
>>
>> [ 24.825568] smpboot: x86: Booting SMP configuration:
>> [ 24.826377] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
>> #13 #14 #15 #17
>> [ 24.985799] .... node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
>> #136 #137 #138 #139 #140 #141 #142 #143 #145
>> [ 25.136955] invalid opcode: 0000 1 PREEMPT SMP NOPTI
>> [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
>> [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
>> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
>> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
>> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
>> [ 25.137790] RSP: 0000:ffffffff91403e70 EFLAGS: 00010046
>> [ 25.137790] RAX: ffffffff9140a980 RBX: ffffffff9140a980 RCX:
>> 0000000000000000
>> [ 25.137790] RDX: 0000000000000000 RSI: ffff97f1ade21b20 RDI:
>> 0000000000000004
>> [ 25.137790] RBP: 0000000000000000 R08: 00000005da4709cb R09:
>> 0000000000000001
>> [ 25.137790] R10: 0000000000005da4 R11: 0000000000000009 R12:
>> 0000000000000000
>> [ 25.137790] R13: ffff98573ff90fc0 R14: ffffffff9140a038 R15:
>> 0000000000093ff0
>> [ 25.137790] FS: 0000000000000000(0000) GS:ffff97f1ade00000(0000)
>> knlGS:0000000000000000
>> [ 25.137790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 25.137790] CR2: ffff97d8aa801000 CR3: 00000049e9430001 CR4:
>> 0000000000770ef0
>> [ 25.137790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 25.137790] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
>> 0000000000000400
>> [ 25.137790] PKRU: 55555554
>> [ 25.137790] Call Trace:
>> [ 25.137790] <TASK>
>> [ 25.137790] ? die+0x37/0x90
>> [ 25.137790] ? do_trap+0xe3/0x110
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] ? do_error_trap+0x6a/0x90
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] ? exc_invalid_op+0x52/0x70
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] default_idle_call+0x30/0x100
>> [ 25.137790] cpuidle_idle_call+0x12c/0x170
>> [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
>> [ 25.137790] do_idle+0x7f/0xd0
>> [ 25.137790] cpu_startup_entry+0x29/0x30
>> [ 25.137790] rest_init+0xcc/0xd0
>> [ 25.137790] start_kernel+0x396/0x5d0
>> [ 25.137790] x86_64_start_reservations+0x18/0x30
>> [ 25.137790] x86_64_start_kernel+0xe7/0xf0
>> [ 25.137790] common_startup_64+0x13e/0x148
>> [ 25.137790] </TASK>
>> [ 25.137790] Modules linked in:
>> [ 25.137790] --[ end trace 0000000000000000 ]--
>> [ 25.137790] invalid opcode: 0000 2 PREEMPT SMP NOPTI
>> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
>> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
>> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
>>
>>>
>>>> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
>>>> that it doesn't have a chance to check MWAIT against host features and
>>>> will be advertised to the guest regardless of whether it's supported by
>>>> the host or not.
>>>>
>>>> x86_cpu_realizefn()
>>>>   x86_cpu_filter_features()
>>>>   cpu_exec_realizefn()
>>>>     kvm_cpu_realizefn
>>>>       host_cpu_realizefn
>>>>         host_cpu_enable_cpu_pm
>>>>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
>>>>
>>>>
>>>> If it's not supported by the host, executing MONITOR or MWAIT
>>>> instructions from the guest triggers #UD, no matter MWAIT_EXITING
>>>> control is set or not.
>>>
>>> If I recall right, kvm was able to emulate mwait/monitor.
>>> So question is why it leads to exception instead?
>>
>> KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
>> didn't find explicit proof from Intel SDM that #UD exceptions take
>> precedence over MWAIT/MONITOR VM exits, but this is my speculation. For
>> example, in ancient machines which don't support MWAIT yet, the only way
>> it can do is #UD, not MWAIT VM exit?
> 
> For the Host which doesn't support MWAIT, it shouldn't have the VMX
> control bit for mwait exit either, right?
> 
> Could you pls check this on your machine? If VMX doesn't support this
> exit event, then triggering an exception will make sense.


As Sean just confirmed, #UD takes priority over MWAIT exiting VM-exit,
thus if the host doesn't support MWAIT, regardless the MWAIT exiting is
set or not, executing MWAIT instruction from the guest triggers #UD, and
the guest doesn't boot.

This is not desired and VMM should not advertise MWAIT to the guest in
this case.
Michael S. Tsirkin June 2, 2024, 9:54 p.m. UTC | #10
On Thu, May 30, 2024 at 04:49:33PM +0200, Igor Mammedov wrote:
> On Thu, 30 May 2024 21:54:47 +0800
> Zhao Liu <zhao1.liu@intel.com> wrote:
> 
> > Hi Zide,
> > 
> > On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
> > > Date: Wed, 29 May 2024 10:31:21 -0700
> > > From: "Chen, Zide" <zide.chen@intel.com>
> > > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
> > > 
> > > 
> > > 
> > > On 5/29/2024 5:46 AM, Igor Mammedov wrote:  
> > > > On Tue, 28 May 2024 11:16:59 -0700
> > > > "Chen, Zide" <zide.chen@intel.com> wrote:
> > > >   
> > > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote:  
> > > >>> On Fri, 24 May 2024 13:00:14 -0700
> > > >>> Zide Chen <zide.chen@intel.com> wrote:
> > > >>>     
> > > >>>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> > > >>>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> > > >>>> guest and executing MWAIT/MONITOR on the guest triggers #UD.    
> > > >>>
> > > >>> this is missing proper description how do you trigger issue
> > > >>> with reproducer and detailed description why guest sees MWAIT
> > > >>> when it's not supported by host.    
> > > >>
> > > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the  
> > > > it's bette to provide full QEMU CLI and host/guest kernels used and what
> > > > hardware was used if it's relevant so others can reproduce problem.  
> > > 
> > > I ever reproduced this on an older Intel Icelake machine, a
> > > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
> > > issue, not specific to particular models.
> > > 
> > > For the CLI, I think the only command line options that matter are
> > >  -overcommit cpu-pm=on: to set enable_cpu_pm
> > >  -cpu host: so that cpu->max_features is set
> > > 
> > > For QEMU version, as long as it's after this commit: 662175b91ff2
> > > ("i386: reorder call to cpu_exec_realizefn")
> > > 
> > > The guest fails to boot:
> > > 
> > > [ 24.825568] smpboot: x86: Booting SMP configuration:
> > > [ 24.826377] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
> > > #13 #14 #15 #17
> > > [ 24.985799] .... node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
> > > #136 #137 #138 #139 #140 #141 #142 #143 #145
> > > [ 25.136955] invalid opcode: 0000 1 PREEMPT SMP NOPTI
> > > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
> > > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
> > > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> > > [ 25.137790] RSP: 0000:ffffffff91403e70 EFLAGS: 00010046
> > > [ 25.137790] RAX: ffffffff9140a980 RBX: ffffffff9140a980 RCX:
> > > 0000000000000000
> > > [ 25.137790] RDX: 0000000000000000 RSI: ffff97f1ade21b20 RDI:
> > > 0000000000000004
> > > [ 25.137790] RBP: 0000000000000000 R08: 00000005da4709cb R09:
> > > 0000000000000001
> > > [ 25.137790] R10: 0000000000005da4 R11: 0000000000000009 R12:
> > > 0000000000000000
> > > [ 25.137790] R13: ffff98573ff90fc0 R14: ffffffff9140a038 R15:
> > > 0000000000093ff0
> > > [ 25.137790] FS: 0000000000000000(0000) GS:ffff97f1ade00000(0000)
> > > knlGS:0000000000000000
> > > [ 25.137790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 25.137790] CR2: ffff97d8aa801000 CR3: 00000049e9430001 CR4:
> > > 0000000000770ef0
> > > [ 25.137790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [ 25.137790] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
> > > 0000000000000400
> > > [ 25.137790] PKRU: 55555554
> > > [ 25.137790] Call Trace:
> > > [ 25.137790] <TASK>
> > > [ 25.137790] ? die+0x37/0x90
> > > [ 25.137790] ? do_trap+0xe3/0x110
> > > [ 25.137790] ? mwait_idle+0x35/0x80
> > > [ 25.137790] ? do_error_trap+0x6a/0x90
> > > [ 25.137790] ? mwait_idle+0x35/0x80
> > > [ 25.137790] ? exc_invalid_op+0x52/0x70
> > > [ 25.137790] ? mwait_idle+0x35/0x80
> > > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
> > > [ 25.137790] ? mwait_idle+0x35/0x80
> > > [ 25.137790] default_idle_call+0x30/0x100
> > > [ 25.137790] cpuidle_idle_call+0x12c/0x170
> > > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
> > > [ 25.137790] do_idle+0x7f/0xd0
> > > [ 25.137790] cpu_startup_entry+0x29/0x30
> > > [ 25.137790] rest_init+0xcc/0xd0
> > > [ 25.137790] start_kernel+0x396/0x5d0
> > > [ 25.137790] x86_64_start_reservations+0x18/0x30
> > > [ 25.137790] x86_64_start_kernel+0xe7/0xf0
> > > [ 25.137790] common_startup_64+0x13e/0x148
> > > [ 25.137790] </TASK>
> > > [ 25.137790] Modules linked in:
> > > [ 25.137790] --[ end trace 0000000000000000 ]--
> > > [ 25.137790] invalid opcode: 0000 2 PREEMPT SMP NOPTI
> > > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> > >   
> > > >   
> > > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > > >> that it doesn't have a chance to check MWAIT against host features and
> > > >> will be advertised to the guest regardless of whether it's supported by
> > > >> the host or not.
> > > >>
> > > >> x86_cpu_realizefn()
> > > >>   x86_cpu_filter_features()
> > > >>   cpu_exec_realizefn()
> > > >>     kvm_cpu_realizefn
> > > >>       host_cpu_realizefn
> > > >>         host_cpu_enable_cpu_pm
> > > >>           env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> > > >>
> > > >>
> > > >> If it's not supported by the host, executing MONITOR or MWAIT
> > > >> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> > > >> control is set or not.  
> > > > 
> > > > If I recall right, kvm was able to emulate mwait/monitor.
> > > > So question is why it leads to exception instead?  
> > > 
> > > KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
> > > didn't find explicit proof from Intel SDM that #UD exceptions take
> > > precedence over MWAIT/MONITOR VM exits, but this is my speculation. For
> > > example, in ancient machines which don't support MWAIT yet, the only way
> > > it can do is #UD, not MWAIT VM exit?  
> > 
> > For the Host which doesn't support MWAIT, it shouldn't have the VMX
> > control bit for mwait exit either, right?
> > 
> > Could you pls check this on your machine? If VMX doesn't support this
> > exit event, then triggering an exception will make sense.
> 
> My assumption (probably wrong) was that KVM would emulate mwait if it's unavailable,


emulating mwait correctly is very hard. KVM does not try.

> unless we have KVM_CAP_X86_DISABLE_EXITS enabled. And in the later case it would
> explode as expected, however then we shouldn't be able to set KVM_CAP_X86_DISABLE_EXITS
> to begin with.
> 
> Recently Sean posted a patch related to that
> [PATCH v2 12/49] KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
>   https://lkml.org/lkml/2024/5/17/729
> 
> This needs someone with KVM expertise to chime in
> Perhaps Paolo/Sean could clarify expected behavior.
> 
> 
> > 
> > -Zhao
> >