Message ID | 20170112182446.9600-3-lersek@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 12 Jan 2017 19:24:45 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > The generic edk2 SMM infrastructure prefers > EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If > Trigger() only brings the current processor into SMM, then edk2 handles it > in the following ways: > > (1) If Trigger() is executed by the BSP (which is guaranteed before > ExitBootServices(), but is not necessarily true at runtime), then: > > (a) If edk2 has been configured for "traditional" SMM synchronization, > then the BSP sends directed SMIs to the APs with APIC delivery, > bringing them into SMM individually. Then the BSP runs the SMI > handler / dispatcher. > > (b) If edk2 has been configured for "relaxed" SMM synchronization, > then the APs that are not already in SMM are not brought in, and > the BSP runs the SMI handler / dispatcher. > > (2) If Trigger() is executed by an AP (which is possible after > ExitBootServices(), and can be forced e.g. by "taskset -c 1 > efibootmgr"), then the AP in question brings in the BSP with a > directed SMI, and the BSP runs the SMI handler / dispatcher. > > The smaller problem with (1a) and (2) is that the BSP and AP > synchronization is slow. For example, the "taskset -c 1 efibootmgr" > command from (2) can take more than 3 seconds to complete, because > efibootmgr accesses non-volatile UEFI variables intensively. > > The larger problem is that QEMU's current behavior diverges from the > behavior usually seen on physical hardware, and that keeps exposing > obscure corner cases, race conditions and other instabilities in edk2, > which generally expects / prefers a software SMI to affect all CPUs at > once. > > Therefore introduce the "broadcast SMI" feature that causes QEMU to inject > the SMI on all VCPUs. > > While the original posting of this patch > <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> > only intended to speed up (2), based on our recent "stress testing" of SMM > this patch actually provides functional improvements. > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Igor Mammedov <imammedo@redhat.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Signed-off-by: Laszlo Ersek <lersek@redhat.com> > Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > --- > > Notes: > v6: > - no changes, pick up Michael's R-b > > v5: > - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the > ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for > DEFINE_PROP_BIT() in the next patch) > > include/hw/i386/ich9.h | 3 +++ > hw/isa/lpc_ich9.c | 10 +++++++++- > 2 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h > index da1118727146..18dcca7ebcbf 100644 > --- a/include/hw/i386/ich9.h > +++ b/include/hw/i386/ich9.h > @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); > #define ICH9_SMB_HST_D1 0x06 > #define ICH9_SMB_HOST_BLOCK_DB 0x07 > > +/* bit positions used in fw_cfg SMI feature negotiation */ > +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 > + > #endif /* HW_ICH9_H */ > diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > index 376b7801a42c..ced6f803a4f2 100644 > --- a/hw/isa/lpc_ich9.c > +++ b/hw/isa/lpc_ich9.c > @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) > > /* SMI_EN = PMBASE + 30. SMI control and enable register */ > if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { > - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > + if (lpc->smi_negotiated_features & > + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { > + CPUState *cs; > + CPU_FOREACH(cs) { > + cpu_interrupt(cs, CPU_INTERRUPT_SMI); > + } Shouldn't CPUs with default SMI base be excluded from broadcast? > + } else { > + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > + } > } > } >
On 01/13/17 14:09, Igor Mammedov wrote: > On Thu, 12 Jan 2017 19:24:45 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> The generic edk2 SMM infrastructure prefers >> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If >> Trigger() only brings the current processor into SMM, then edk2 handles it >> in the following ways: >> >> (1) If Trigger() is executed by the BSP (which is guaranteed before >> ExitBootServices(), but is not necessarily true at runtime), then: >> >> (a) If edk2 has been configured for "traditional" SMM synchronization, >> then the BSP sends directed SMIs to the APs with APIC delivery, >> bringing them into SMM individually. Then the BSP runs the SMI >> handler / dispatcher. >> >> (b) If edk2 has been configured for "relaxed" SMM synchronization, >> then the APs that are not already in SMM are not brought in, and >> the BSP runs the SMI handler / dispatcher. >> >> (2) If Trigger() is executed by an AP (which is possible after >> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >> efibootmgr"), then the AP in question brings in the BSP with a >> directed SMI, and the BSP runs the SMI handler / dispatcher. >> >> The smaller problem with (1a) and (2) is that the BSP and AP >> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >> command from (2) can take more than 3 seconds to complete, because >> efibootmgr accesses non-volatile UEFI variables intensively. >> >> The larger problem is that QEMU's current behavior diverges from the >> behavior usually seen on physical hardware, and that keeps exposing >> obscure corner cases, race conditions and other instabilities in edk2, >> which generally expects / prefers a software SMI to affect all CPUs at >> once. >> >> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject >> the SMI on all VCPUs. >> >> While the original posting of this patch >> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >> only intended to speed up (2), based on our recent "stress testing" of SMM >> this patch actually provides functional improvements. >> >> Cc: "Michael S. Tsirkin" <mst@redhat.com> >> Cc: Gerd Hoffmann <kraxel@redhat.com> >> Cc: Igor Mammedov <imammedo@redhat.com> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> Signed-off-by: Laszlo Ersek <lersek@redhat.com> >> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> >> --- >> >> Notes: >> v6: >> - no changes, pick up Michael's R-b >> >> v5: >> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >> DEFINE_PROP_BIT() in the next patch) >> >> include/hw/i386/ich9.h | 3 +++ >> hw/isa/lpc_ich9.c | 10 +++++++++- >> 2 files changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >> index da1118727146..18dcca7ebcbf 100644 >> --- a/include/hw/i386/ich9.h >> +++ b/include/hw/i386/ich9.h >> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >> #define ICH9_SMB_HST_D1 0x06 >> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >> >> +/* bit positions used in fw_cfg SMI feature negotiation */ >> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >> + >> #endif /* HW_ICH9_H */ >> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >> index 376b7801a42c..ced6f803a4f2 100644 >> --- a/hw/isa/lpc_ich9.c >> +++ b/hw/isa/lpc_ich9.c >> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) >> >> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >> + if (lpc->smi_negotiated_features & >> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >> + CPUState *cs; >> + CPU_FOREACH(cs) { >> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >> + } > Shouldn't CPUs with default SMI base be excluded from broadcast? I don't think so; as far as I recall, in edk2 the initial SMM entry code -- before SMBASE relocation -- accommodates all VCPUs entering the same code area for execution. The VCPUs are told apart from each other by Initial APIC ID (that is, "who am I"), which is used as an index or search criterion into a global shared array. Once they find their respective private data blocks, the VCPUs don't step on each other's toes. Thanks Laszlo > >> + } else { >> + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >> + } >> } >> } >> >
On Mon, 16 Jan 2017 21:46:55 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/13/17 14:09, Igor Mammedov wrote: > > On Thu, 12 Jan 2017 19:24:45 +0100 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> The generic edk2 SMM infrastructure prefers > >> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If > >> Trigger() only brings the current processor into SMM, then edk2 handles it > >> in the following ways: > >> > >> (1) If Trigger() is executed by the BSP (which is guaranteed before > >> ExitBootServices(), but is not necessarily true at runtime), then: > >> > >> (a) If edk2 has been configured for "traditional" SMM synchronization, > >> then the BSP sends directed SMIs to the APs with APIC delivery, > >> bringing them into SMM individually. Then the BSP runs the SMI > >> handler / dispatcher. > >> > >> (b) If edk2 has been configured for "relaxed" SMM synchronization, > >> then the APs that are not already in SMM are not brought in, and > >> the BSP runs the SMI handler / dispatcher. > >> > >> (2) If Trigger() is executed by an AP (which is possible after > >> ExitBootServices(), and can be forced e.g. by "taskset -c 1 > >> efibootmgr"), then the AP in question brings in the BSP with a > >> directed SMI, and the BSP runs the SMI handler / dispatcher. > >> > >> The smaller problem with (1a) and (2) is that the BSP and AP > >> synchronization is slow. For example, the "taskset -c 1 efibootmgr" > >> command from (2) can take more than 3 seconds to complete, because > >> efibootmgr accesses non-volatile UEFI variables intensively. > >> > >> The larger problem is that QEMU's current behavior diverges from the > >> behavior usually seen on physical hardware, and that keeps exposing > >> obscure corner cases, race conditions and other instabilities in edk2, > >> which generally expects / prefers a software SMI to affect all CPUs at > >> once. > >> > >> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject > >> the SMI on all VCPUs. > >> > >> While the original posting of this patch > >> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> > >> only intended to speed up (2), based on our recent "stress testing" of SMM > >> this patch actually provides functional improvements. > >> > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> > >> Cc: Gerd Hoffmann <kraxel@redhat.com> > >> Cc: Igor Mammedov <imammedo@redhat.com> > >> Cc: Paolo Bonzini <pbonzini@redhat.com> > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com> > >> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > >> --- > >> > >> Notes: > >> v6: > >> - no changes, pick up Michael's R-b > >> > >> v5: > >> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the > >> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for > >> DEFINE_PROP_BIT() in the next patch) > >> > >> include/hw/i386/ich9.h | 3 +++ > >> hw/isa/lpc_ich9.c | 10 +++++++++- > >> 2 files changed, 12 insertions(+), 1 deletion(-) > >> > >> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h > >> index da1118727146..18dcca7ebcbf 100644 > >> --- a/include/hw/i386/ich9.h > >> +++ b/include/hw/i386/ich9.h > >> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); > >> #define ICH9_SMB_HST_D1 0x06 > >> #define ICH9_SMB_HOST_BLOCK_DB 0x07 > >> > >> +/* bit positions used in fw_cfg SMI feature negotiation */ > >> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 > >> + > >> #endif /* HW_ICH9_H */ > >> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > >> index 376b7801a42c..ced6f803a4f2 100644 > >> --- a/hw/isa/lpc_ich9.c > >> +++ b/hw/isa/lpc_ich9.c > >> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) > >> > >> /* SMI_EN = PMBASE + 30. SMI control and enable register */ > >> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { > >> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >> + if (lpc->smi_negotiated_features & > >> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { > >> + CPUState *cs; > >> + CPU_FOREACH(cs) { > >> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >> + } > > Shouldn't CPUs with default SMI base be excluded from broadcast? > > I don't think so; as far as I recall, in edk2 the initial SMM entry code > -- before SMBASE relocation -- accommodates all VCPUs entering the same > code area for execution. The VCPUs are told apart from each other by > Initial APIC ID (that is, "who am I"), which is used as an index or > search criterion into a global shared array. Once they find their > respective private data blocks, the VCPUs don't step on each other's toes. That's what I'm not sure. If broadcast SMI is sent before relocation all CPUs will use the same SMBASE and as result save their state in the same memory (even if it's state after RESET/POWER ON) racing/overwriting each other's saved state and then on exit from SMI they all will restore whatever state that race produced so it seems plain wrong thing to do. Are you sure that edk2 does broadcast for SMI initialization or does it using directed SMIs? > > Thanks > Laszlo > > > > >> + } else { > >> + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >> + } > >> } > >> } > >> > > >
On 01/17/17 12:21, Igor Mammedov wrote: > On Mon, 16 Jan 2017 21:46:55 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 01/13/17 14:09, Igor Mammedov wrote: >>> On Thu, 12 Jan 2017 19:24:45 +0100 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> The generic edk2 SMM infrastructure prefers >>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If >>>> Trigger() only brings the current processor into SMM, then edk2 handles it >>>> in the following ways: >>>> >>>> (1) If Trigger() is executed by the BSP (which is guaranteed before >>>> ExitBootServices(), but is not necessarily true at runtime), then: >>>> >>>> (a) If edk2 has been configured for "traditional" SMM synchronization, >>>> then the BSP sends directed SMIs to the APs with APIC delivery, >>>> bringing them into SMM individually. Then the BSP runs the SMI >>>> handler / dispatcher. >>>> >>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, >>>> then the APs that are not already in SMM are not brought in, and >>>> the BSP runs the SMI handler / dispatcher. >>>> >>>> (2) If Trigger() is executed by an AP (which is possible after >>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >>>> efibootmgr"), then the AP in question brings in the BSP with a >>>> directed SMI, and the BSP runs the SMI handler / dispatcher. >>>> >>>> The smaller problem with (1a) and (2) is that the BSP and AP >>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >>>> command from (2) can take more than 3 seconds to complete, because >>>> efibootmgr accesses non-volatile UEFI variables intensively. >>>> >>>> The larger problem is that QEMU's current behavior diverges from the >>>> behavior usually seen on physical hardware, and that keeps exposing >>>> obscure corner cases, race conditions and other instabilities in edk2, >>>> which generally expects / prefers a software SMI to affect all CPUs at >>>> once. >>>> >>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject >>>> the SMI on all VCPUs. >>>> >>>> While the original posting of this patch >>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >>>> only intended to speed up (2), based on our recent "stress testing" of SMM >>>> this patch actually provides functional improvements. >>>> >>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>> Cc: Gerd Hoffmann <kraxel@redhat.com> >>>> Cc: Igor Mammedov <imammedo@redhat.com> >>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> >>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> >>>> --- >>>> >>>> Notes: >>>> v6: >>>> - no changes, pick up Michael's R-b >>>> >>>> v5: >>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >>>> DEFINE_PROP_BIT() in the next patch) >>>> >>>> include/hw/i386/ich9.h | 3 +++ >>>> hw/isa/lpc_ich9.c | 10 +++++++++- >>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >>>> index da1118727146..18dcca7ebcbf 100644 >>>> --- a/include/hw/i386/ich9.h >>>> +++ b/include/hw/i386/ich9.h >>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >>>> #define ICH9_SMB_HST_D1 0x06 >>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >>>> >>>> +/* bit positions used in fw_cfg SMI feature negotiation */ >>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >>>> + >>>> #endif /* HW_ICH9_H */ >>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >>>> index 376b7801a42c..ced6f803a4f2 100644 >>>> --- a/hw/isa/lpc_ich9.c >>>> +++ b/hw/isa/lpc_ich9.c >>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) >>>> >>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>> + if (lpc->smi_negotiated_features & >>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >>>> + CPUState *cs; >>>> + CPU_FOREACH(cs) { >>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>> + } >>> Shouldn't CPUs with default SMI base be excluded from broadcast? >> >> I don't think so; as far as I recall, in edk2 the initial SMM entry code >> -- before SMBASE relocation -- accommodates all VCPUs entering the same >> code area for execution. The VCPUs are told apart from each other by >> Initial APIC ID (that is, "who am I"), which is used as an index or >> search criterion into a global shared array. Once they find their >> respective private data blocks, the VCPUs don't step on each other's toes. > That's what I'm not sure. > If broadcast SMI is sent before relocation all CPUs will use > the same SMBASE and as result save their state in the same > memory (even if it's state after RESET/POWER ON) racing/overwriting > each other's saved state Good point! > and then on exit from SMI they all will restore > whatever state that race produced so it seems plain wrong thing to do. > > Are you sure that edk2 does broadcast for SMI initialization or does it > using directed SMIs? Hmmm, you are right. The SmmRelocateBases() function in "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for each individual AP in succession, by sending SMI IPIs to them, one by one. Then it does the same for the BSP, by sending itself a similar SMI IPI. The above QEMU code is only exercised much later, after the SMBASE relocation, with the SMM stack up and running. It is used when a (preferably broadcast) SMI is triggered for firmware services (for example, the UEFI variable services). So, you are right. Considering edk2 only, the difference shouldn't matter -- when this code is invoked (via an IO port write), the SMBASEs will all have been relocated. Also, I've been informed that on real hardware, writing to APM_CNT triggers an SMI on all CPUs, regardless of the individual SMBASE values. So I believe the above code should be closest to real hardware, and the pre-patch code was only written in unicast form for SeaBIOS's sake. I think the code is safe above. If the guest injects an SMI via APM_CNT after negotiating SMI broadcast, it should have not left any VCPUs without an SMI handler by that time. edk2 / OVMF conforms to this. In the future we can tweak this further, if necessary; we have 63 more bits, and we'll be able to reject invalid combinations of bits. Do you feel that we should skip VCPUs whose SMBASE has not been changed from the default? If so, doesn't that run the risk of missing a VCPU that has an actual SMI handler installed, and it just happens to be placed at the default SMBASE location? ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, but I think (a) that's not what real hardware does, and (b) it is risky if a VCPU's actual SMI handler has been installed while keeping the default SMBASE value. What do you think? Thanks! Laszlo > >> >> Thanks >> Laszlo >> >>> >>>> + } else { >>>> + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>> + } >>>> } >>>> } >>>> >>> >> >
On Tue, 17 Jan 2017 13:06:13 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/17/17 12:21, Igor Mammedov wrote: > > On Mon, 16 Jan 2017 21:46:55 +0100 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 01/13/17 14:09, Igor Mammedov wrote: > >>> On Thu, 12 Jan 2017 19:24:45 +0100 > >>> Laszlo Ersek <lersek@redhat.com> wrote: > >>> > >>>> The generic edk2 SMM infrastructure prefers > >>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If > >>>> Trigger() only brings the current processor into SMM, then edk2 handles it > >>>> in the following ways: > >>>> > >>>> (1) If Trigger() is executed by the BSP (which is guaranteed before > >>>> ExitBootServices(), but is not necessarily true at runtime), then: > >>>> > >>>> (a) If edk2 has been configured for "traditional" SMM synchronization, > >>>> then the BSP sends directed SMIs to the APs with APIC delivery, > >>>> bringing them into SMM individually. Then the BSP runs the SMI > >>>> handler / dispatcher. > >>>> > >>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, > >>>> then the APs that are not already in SMM are not brought in, and > >>>> the BSP runs the SMI handler / dispatcher. > >>>> > >>>> (2) If Trigger() is executed by an AP (which is possible after > >>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 > >>>> efibootmgr"), then the AP in question brings in the BSP with a > >>>> directed SMI, and the BSP runs the SMI handler / dispatcher. > >>>> > >>>> The smaller problem with (1a) and (2) is that the BSP and AP > >>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" > >>>> command from (2) can take more than 3 seconds to complete, because > >>>> efibootmgr accesses non-volatile UEFI variables intensively. > >>>> > >>>> The larger problem is that QEMU's current behavior diverges from the > >>>> behavior usually seen on physical hardware, and that keeps exposing > >>>> obscure corner cases, race conditions and other instabilities in edk2, > >>>> which generally expects / prefers a software SMI to affect all CPUs at > >>>> once. > >>>> > >>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject > >>>> the SMI on all VCPUs. > >>>> > >>>> While the original posting of this patch > >>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> > >>>> only intended to speed up (2), based on our recent "stress testing" of SMM > >>>> this patch actually provides functional improvements. > >>>> > >>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> > >>>> Cc: Gerd Hoffmann <kraxel@redhat.com> > >>>> Cc: Igor Mammedov <imammedo@redhat.com> > >>>> Cc: Paolo Bonzini <pbonzini@redhat.com> > >>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> > >>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > >>>> --- > >>>> > >>>> Notes: > >>>> v6: > >>>> - no changes, pick up Michael's R-b > >>>> > >>>> v5: > >>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the > >>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for > >>>> DEFINE_PROP_BIT() in the next patch) > >>>> > >>>> include/hw/i386/ich9.h | 3 +++ > >>>> hw/isa/lpc_ich9.c | 10 +++++++++- > >>>> 2 files changed, 12 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h > >>>> index da1118727146..18dcca7ebcbf 100644 > >>>> --- a/include/hw/i386/ich9.h > >>>> +++ b/include/hw/i386/ich9.h > >>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); > >>>> #define ICH9_SMB_HST_D1 0x06 > >>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 > >>>> > >>>> +/* bit positions used in fw_cfg SMI feature negotiation */ > >>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 > >>>> + > >>>> #endif /* HW_ICH9_H */ > >>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > >>>> index 376b7801a42c..ced6f803a4f2 100644 > >>>> --- a/hw/isa/lpc_ich9.c > >>>> +++ b/hw/isa/lpc_ich9.c > >>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) > >>>> > >>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ > >>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { > >>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >>>> + if (lpc->smi_negotiated_features & > >>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { > >>>> + CPUState *cs; > >>>> + CPU_FOREACH(cs) { > >>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>> + } > >>> Shouldn't CPUs with default SMI base be excluded from broadcast? > >> > >> I don't think so; as far as I recall, in edk2 the initial SMM entry code > >> -- before SMBASE relocation -- accommodates all VCPUs entering the same > >> code area for execution. The VCPUs are told apart from each other by > >> Initial APIC ID (that is, "who am I"), which is used as an index or > >> search criterion into a global shared array. Once they find their > >> respective private data blocks, the VCPUs don't step on each other's toes. > > That's what I'm not sure. > > If broadcast SMI is sent before relocation all CPUs will use > > the same SMBASE and as result save their state in the same > > memory (even if it's state after RESET/POWER ON) racing/overwriting > > each other's saved state > > Good point! > > > and then on exit from SMI they all will restore > > whatever state that race produced so it seems plain wrong thing to do. > > > > Are you sure that edk2 does broadcast for SMI initialization or does it > > using directed SMIs? > > Hmmm, you are right. The SmmRelocateBases() function in > "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for > each individual AP in succession, by sending SMI IPIs to them, one by > one. Then it does the same for the BSP, by sending itself a similar SMI IPI. > > The above QEMU code is only exercised much later, after the SMBASE > relocation, with the SMM stack up and running. It is used when a > (preferably broadcast) SMI is triggered for firmware services (for > example, the UEFI variable services). > > So, you are right. > > Considering edk2 only, the difference shouldn't matter -- when this code > is invoked (via an IO port write), the SMBASEs will all have been relocated. > > Also, I've been informed that on real hardware, writing to APM_CNT > triggers an SMI on all CPUs, regardless of the individual SMBASE values. > So I believe the above code should be closest to real hardware, and the > pre-patch code was only written in unicast form for SeaBIOS's sake. > > I think the code is safe above. If the guest injects an SMI via APM_CNT > after negotiating SMI broadcast, it should have not left any VCPUs > without an SMI handler by that time. edk2 / OVMF conforms to this. In > the future we can tweak this further, if necessary; we have 63 more > bits, and we'll be able to reject invalid combinations of bits. > > Do you feel that we should skip VCPUs whose SMBASE has not been changed > from the default? > > If so, doesn't that run the risk of missing a VCPU that has an actual > SMI handler installed, and it just happens to be placed at the default > SMBASE location? > > ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, > but I think (a) that's not what real hardware does, and (b) it is risky > if a VCPU's actual SMI handler has been installed while keeping the > default SMBASE value. > > What do you think? (a) probably doesn't consider hotplug, so it's totally fine for the case where firmware is the only one who wakes up/initializes CPUs. however consider 2 CPU hotplugged and then broadcast SMI happens to serve some other task (if there isn't unpark 'feature'). Even if real hw does it, it's behavior not defined by SDM with possibly bad consequences. (b) alone it's risky so I'd skip based on combination of if (default SMBASE & CPU is in reset state) skip; that should be safe and it leaves open possibility to avoid adding unpark 'feature' to CPU. > > Thanks! > Laszlo > > > > >> > >> Thanks > >> Laszlo > >> > >>> > >>>> + } else { > >>>> + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >>>> + } > >>>> } > >>>> } > >>>> > >>> > >> > > >
On 01/17/17 14:20, Igor Mammedov wrote: > On Tue, 17 Jan 2017 13:06:13 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 01/17/17 12:21, Igor Mammedov wrote: >>> On Mon, 16 Jan 2017 21:46:55 +0100 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> On 01/13/17 14:09, Igor Mammedov wrote: >>>>> On Thu, 12 Jan 2017 19:24:45 +0100 >>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>> >>>>>> The generic edk2 SMM infrastructure prefers >>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If >>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it >>>>>> in the following ways: >>>>>> >>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before >>>>>> ExitBootServices(), but is not necessarily true at runtime), then: >>>>>> >>>>>> (a) If edk2 has been configured for "traditional" SMM synchronization, >>>>>> then the BSP sends directed SMIs to the APs with APIC delivery, >>>>>> bringing them into SMM individually. Then the BSP runs the SMI >>>>>> handler / dispatcher. >>>>>> >>>>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, >>>>>> then the APs that are not already in SMM are not brought in, and >>>>>> the BSP runs the SMI handler / dispatcher. >>>>>> >>>>>> (2) If Trigger() is executed by an AP (which is possible after >>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >>>>>> efibootmgr"), then the AP in question brings in the BSP with a >>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. >>>>>> >>>>>> The smaller problem with (1a) and (2) is that the BSP and AP >>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >>>>>> command from (2) can take more than 3 seconds to complete, because >>>>>> efibootmgr accesses non-volatile UEFI variables intensively. >>>>>> >>>>>> The larger problem is that QEMU's current behavior diverges from the >>>>>> behavior usually seen on physical hardware, and that keeps exposing >>>>>> obscure corner cases, race conditions and other instabilities in edk2, >>>>>> which generally expects / prefers a software SMI to affect all CPUs at >>>>>> once. >>>>>> >>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject >>>>>> the SMI on all VCPUs. >>>>>> >>>>>> While the original posting of this patch >>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM >>>>>> this patch actually provides functional improvements. >>>>>> >>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com> >>>>>> Cc: Igor Mammedov <imammedo@redhat.com> >>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> >>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> >>>>>> --- >>>>>> >>>>>> Notes: >>>>>> v6: >>>>>> - no changes, pick up Michael's R-b >>>>>> >>>>>> v5: >>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >>>>>> DEFINE_PROP_BIT() in the next patch) >>>>>> >>>>>> include/hw/i386/ich9.h | 3 +++ >>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- >>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >>>>>> index da1118727146..18dcca7ebcbf 100644 >>>>>> --- a/include/hw/i386/ich9.h >>>>>> +++ b/include/hw/i386/ich9.h >>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >>>>>> #define ICH9_SMB_HST_D1 0x06 >>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >>>>>> >>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ >>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >>>>>> + >>>>>> #endif /* HW_ICH9_H */ >>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >>>>>> index 376b7801a42c..ced6f803a4f2 100644 >>>>>> --- a/hw/isa/lpc_ich9.c >>>>>> +++ b/hw/isa/lpc_ich9.c >>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) >>>>>> >>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>>>> + if (lpc->smi_negotiated_features & >>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >>>>>> + CPUState *cs; >>>>>> + CPU_FOREACH(cs) { >>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>> + } >>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? >>>> >>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code >>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same >>>> code area for execution. The VCPUs are told apart from each other by >>>> Initial APIC ID (that is, "who am I"), which is used as an index or >>>> search criterion into a global shared array. Once they find their >>>> respective private data blocks, the VCPUs don't step on each other's toes. >>> That's what I'm not sure. >>> If broadcast SMI is sent before relocation all CPUs will use >>> the same SMBASE and as result save their state in the same >>> memory (even if it's state after RESET/POWER ON) racing/overwriting >>> each other's saved state >> >> Good point! >> >>> and then on exit from SMI they all will restore >>> whatever state that race produced so it seems plain wrong thing to do. >>> >>> Are you sure that edk2 does broadcast for SMI initialization or does it >>> using directed SMIs? >> >> Hmmm, you are right. The SmmRelocateBases() function in >> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for >> each individual AP in succession, by sending SMI IPIs to them, one by >> one. Then it does the same for the BSP, by sending itself a similar SMI IPI. >> >> The above QEMU code is only exercised much later, after the SMBASE >> relocation, with the SMM stack up and running. It is used when a >> (preferably broadcast) SMI is triggered for firmware services (for >> example, the UEFI variable services). >> >> So, you are right. >> >> Considering edk2 only, the difference shouldn't matter -- when this code >> is invoked (via an IO port write), the SMBASEs will all have been relocated. >> >> Also, I've been informed that on real hardware, writing to APM_CNT >> triggers an SMI on all CPUs, regardless of the individual SMBASE values. >> So I believe the above code should be closest to real hardware, and the >> pre-patch code was only written in unicast form for SeaBIOS's sake. >> >> I think the code is safe above. If the guest injects an SMI via APM_CNT >> after negotiating SMI broadcast, it should have not left any VCPUs >> without an SMI handler by that time. edk2 / OVMF conforms to this. In >> the future we can tweak this further, if necessary; we have 63 more >> bits, and we'll be able to reject invalid combinations of bits. >> >> Do you feel that we should skip VCPUs whose SMBASE has not been changed >> from the default? >> >> If so, doesn't that run the risk of missing a VCPU that has an actual >> SMI handler installed, and it just happens to be placed at the default >> SMBASE location? >> >> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, >> but I think (a) that's not what real hardware does, and (b) it is risky >> if a VCPU's actual SMI handler has been installed while keeping the >> default SMBASE value. >> >> What do you think? > (a) probably doesn't consider hotplug, so it's totally fine for the case > where firmware is the only one who wakes up/initializes CPUs. > however consider 2 CPU hotplugged and then broadcast SMI happens > to serve some other task (if there isn't unpark 'feature'). > Even if real hw does it, it's behavior not defined by SDM with possibly > bad consequences. > > (b) alone it's risky so I'd skip based on combination of > > if (default SMBASE & CPU is in reset state) > skip; > > that should be safe and it leaves open possibility to avoid adding > unpark 'feature' to CPU. The above attributes ("SMBASE" and "CPU is in reset state") are specific to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the X86_CPU() macro? Can I assume here that the macro will never return NULL? (I think so, LPC is only used with x86.) ... I guess SMBASE can be accessed as X86_CPU(cs)->env.smbase How do I figure out if the CPU is in reset state ("waiting for first INIT") though? Note that this state should be distinguished from simply "halted" (i.e., after a HLT). After a HLT, the SMI should be injected and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send APs to sleep. Thanks Laszlo > >> >> Thanks! >> Laszlo >> >>> >>>> >>>> Thanks >>>> Laszlo >>>> >>>>> >>>>>> + } else { >>>>>> + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>>>> + } >>>>>> } >>>>>> } >>>>>> >>>>> >>>> >>> >> >
On Tue, 17 Jan 2017 14:46:05 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/17/17 14:20, Igor Mammedov wrote: > > On Tue, 17 Jan 2017 13:06:13 +0100 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 01/17/17 12:21, Igor Mammedov wrote: > >>> On Mon, 16 Jan 2017 21:46:55 +0100 > >>> Laszlo Ersek <lersek@redhat.com> wrote: > >>> > >>>> On 01/13/17 14:09, Igor Mammedov wrote: > >>>>> On Thu, 12 Jan 2017 19:24:45 +0100 > >>>>> Laszlo Ersek <lersek@redhat.com> wrote: > >>>>> > >>>>>> The generic edk2 SMM infrastructure prefers > >>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If > >>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it > >>>>>> in the following ways: > >>>>>> > >>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before > >>>>>> ExitBootServices(), but is not necessarily true at runtime), then: > >>>>>> > >>>>>> (a) If edk2 has been configured for "traditional" SMM synchronization, > >>>>>> then the BSP sends directed SMIs to the APs with APIC delivery, > >>>>>> bringing them into SMM individually. Then the BSP runs the SMI > >>>>>> handler / dispatcher. > >>>>>> > >>>>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, > >>>>>> then the APs that are not already in SMM are not brought in, and > >>>>>> the BSP runs the SMI handler / dispatcher. > >>>>>> > >>>>>> (2) If Trigger() is executed by an AP (which is possible after > >>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 > >>>>>> efibootmgr"), then the AP in question brings in the BSP with a > >>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. > >>>>>> > >>>>>> The smaller problem with (1a) and (2) is that the BSP and AP > >>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" > >>>>>> command from (2) can take more than 3 seconds to complete, because > >>>>>> efibootmgr accesses non-volatile UEFI variables intensively. > >>>>>> > >>>>>> The larger problem is that QEMU's current behavior diverges from the > >>>>>> behavior usually seen on physical hardware, and that keeps exposing > >>>>>> obscure corner cases, race conditions and other instabilities in edk2, > >>>>>> which generally expects / prefers a software SMI to affect all CPUs at > >>>>>> once. > >>>>>> > >>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject > >>>>>> the SMI on all VCPUs. > >>>>>> > >>>>>> While the original posting of this patch > >>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> > >>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM > >>>>>> this patch actually provides functional improvements. > >>>>>> > >>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> > >>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com> > >>>>>> Cc: Igor Mammedov <imammedo@redhat.com> > >>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com> > >>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> > >>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > >>>>>> --- > >>>>>> > >>>>>> Notes: > >>>>>> v6: > >>>>>> - no changes, pick up Michael's R-b > >>>>>> > >>>>>> v5: > >>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the > >>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for > >>>>>> DEFINE_PROP_BIT() in the next patch) > >>>>>> > >>>>>> include/hw/i386/ich9.h | 3 +++ > >>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- > >>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) > >>>>>> > >>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h > >>>>>> index da1118727146..18dcca7ebcbf 100644 > >>>>>> --- a/include/hw/i386/ich9.h > >>>>>> +++ b/include/hw/i386/ich9.h > >>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); > >>>>>> #define ICH9_SMB_HST_D1 0x06 > >>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 > >>>>>> > >>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ > >>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 > >>>>>> + > >>>>>> #endif /* HW_ICH9_H */ > >>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > >>>>>> index 376b7801a42c..ced6f803a4f2 100644 > >>>>>> --- a/hw/isa/lpc_ich9.c > >>>>>> +++ b/hw/isa/lpc_ich9.c > >>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) > >>>>>> > >>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ > >>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { > >>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >>>>>> + if (lpc->smi_negotiated_features & > >>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { > >>>>>> + CPUState *cs; > >>>>>> + CPU_FOREACH(cs) { > >>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>>>> + } > >>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? > >>>> > >>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code > >>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same > >>>> code area for execution. The VCPUs are told apart from each other by > >>>> Initial APIC ID (that is, "who am I"), which is used as an index or > >>>> search criterion into a global shared array. Once they find their > >>>> respective private data blocks, the VCPUs don't step on each other's toes. > >>> That's what I'm not sure. > >>> If broadcast SMI is sent before relocation all CPUs will use > >>> the same SMBASE and as result save their state in the same > >>> memory (even if it's state after RESET/POWER ON) racing/overwriting > >>> each other's saved state > >> > >> Good point! > >> > >>> and then on exit from SMI they all will restore > >>> whatever state that race produced so it seems plain wrong thing to do. > >>> > >>> Are you sure that edk2 does broadcast for SMI initialization or does it > >>> using directed SMIs? > >> > >> Hmmm, you are right. The SmmRelocateBases() function in > >> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for > >> each individual AP in succession, by sending SMI IPIs to them, one by > >> one. Then it does the same for the BSP, by sending itself a similar SMI IPI. > >> > >> The above QEMU code is only exercised much later, after the SMBASE > >> relocation, with the SMM stack up and running. It is used when a > >> (preferably broadcast) SMI is triggered for firmware services (for > >> example, the UEFI variable services). > >> > >> So, you are right. > >> > >> Considering edk2 only, the difference shouldn't matter -- when this code > >> is invoked (via an IO port write), the SMBASEs will all have been relocated. > >> > >> Also, I've been informed that on real hardware, writing to APM_CNT > >> triggers an SMI on all CPUs, regardless of the individual SMBASE values. > >> So I believe the above code should be closest to real hardware, and the > >> pre-patch code was only written in unicast form for SeaBIOS's sake. > >> > >> I think the code is safe above. If the guest injects an SMI via APM_CNT > >> after negotiating SMI broadcast, it should have not left any VCPUs > >> without an SMI handler by that time. edk2 / OVMF conforms to this. In > >> the future we can tweak this further, if necessary; we have 63 more > >> bits, and we'll be able to reject invalid combinations of bits. > >> > >> Do you feel that we should skip VCPUs whose SMBASE has not been changed > >> from the default? > >> > >> If so, doesn't that run the risk of missing a VCPU that has an actual > >> SMI handler installed, and it just happens to be placed at the default > >> SMBASE location? > >> > >> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, > >> but I think (a) that's not what real hardware does, and (b) it is risky > >> if a VCPU's actual SMI handler has been installed while keeping the > >> default SMBASE value. > >> > >> What do you think? > > (a) probably doesn't consider hotplug, so it's totally fine for the case > > where firmware is the only one who wakes up/initializes CPUs. > > however consider 2 CPU hotplugged and then broadcast SMI happens > > to serve some other task (if there isn't unpark 'feature'). > > Even if real hw does it, it's behavior not defined by SDM with possibly > > bad consequences. > > > > (b) alone it's risky so I'd skip based on combination of > > > > if (default SMBASE & CPU is in reset state) > > skip; > > > > that should be safe and it leaves open possibility to avoid adding > > unpark 'feature' to CPU. > > The above attributes ("SMBASE" and "CPU is in reset state") are specific > to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the > X86_CPU() macro? > > Can I assume here that the macro will never return NULL? (I think so, > LPC is only used with x86.) yep, that should work. > > ... I guess SMBASE can be accessed as > > X86_CPU(cs)->env.smbase preferred style: X86CPU *cpu = X86_CPU(cs); cpu->... > > How do I figure out if the CPU is in reset state ("waiting for first > INIT") though? Note that this state should be distinguished from simply > "halted" (i.e., after a HLT). After a HLT, the SMI should be injected > and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send > APs to sleep. how about using EIP reset value? x86_cpu_reset() ... env->eip = 0xfff0; > > Thanks > Laszlo > > > > >> > >> Thanks! > >> Laszlo > >> > >>> > >>>> > >>>> Thanks > >>>> Laszlo > >>>> > >>>>> > >>>>>> + } else { > >>>>>> + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >>>>>> + } > >>>>>> } > >>>>>> } > >>>>>> > >>>>> > >>>> > >>> > >> > > >
On 01/17/17 15:20, Igor Mammedov wrote: > On Tue, 17 Jan 2017 14:46:05 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 01/17/17 14:20, Igor Mammedov wrote: >>> On Tue, 17 Jan 2017 13:06:13 +0100 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> On 01/17/17 12:21, Igor Mammedov wrote: >>>>> On Mon, 16 Jan 2017 21:46:55 +0100 >>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>> >>>>>> On 01/13/17 14:09, Igor Mammedov wrote: >>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100 >>>>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>>>> >>>>>>>> The generic edk2 SMM infrastructure prefers >>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If >>>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it >>>>>>>> in the following ways: >>>>>>>> >>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before >>>>>>>> ExitBootServices(), but is not necessarily true at runtime), then: >>>>>>>> >>>>>>>> (a) If edk2 has been configured for "traditional" SMM synchronization, >>>>>>>> then the BSP sends directed SMIs to the APs with APIC delivery, >>>>>>>> bringing them into SMM individually. Then the BSP runs the SMI >>>>>>>> handler / dispatcher. >>>>>>>> >>>>>>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, >>>>>>>> then the APs that are not already in SMM are not brought in, and >>>>>>>> the BSP runs the SMI handler / dispatcher. >>>>>>>> >>>>>>>> (2) If Trigger() is executed by an AP (which is possible after >>>>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >>>>>>>> efibootmgr"), then the AP in question brings in the BSP with a >>>>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. >>>>>>>> >>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP >>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >>>>>>>> command from (2) can take more than 3 seconds to complete, because >>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively. >>>>>>>> >>>>>>>> The larger problem is that QEMU's current behavior diverges from the >>>>>>>> behavior usually seen on physical hardware, and that keeps exposing >>>>>>>> obscure corner cases, race conditions and other instabilities in edk2, >>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at >>>>>>>> once. >>>>>>>> >>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject >>>>>>>> the SMI on all VCPUs. >>>>>>>> >>>>>>>> While the original posting of this patch >>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >>>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM >>>>>>>> this patch actually provides functional improvements. >>>>>>>> >>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com> >>>>>>>> Cc: Igor Mammedov <imammedo@redhat.com> >>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> >>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> >>>>>>>> --- >>>>>>>> >>>>>>>> Notes: >>>>>>>> v6: >>>>>>>> - no changes, pick up Michael's R-b >>>>>>>> >>>>>>>> v5: >>>>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >>>>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >>>>>>>> DEFINE_PROP_BIT() in the next patch) >>>>>>>> >>>>>>>> include/hw/i386/ich9.h | 3 +++ >>>>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- >>>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >>>>>>>> index da1118727146..18dcca7ebcbf 100644 >>>>>>>> --- a/include/hw/i386/ich9.h >>>>>>>> +++ b/include/hw/i386/ich9.h >>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >>>>>>>> #define ICH9_SMB_HST_D1 0x06 >>>>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >>>>>>>> >>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ >>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >>>>>>>> + >>>>>>>> #endif /* HW_ICH9_H */ >>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >>>>>>>> index 376b7801a42c..ced6f803a4f2 100644 >>>>>>>> --- a/hw/isa/lpc_ich9.c >>>>>>>> +++ b/hw/isa/lpc_ich9.c >>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) >>>>>>>> >>>>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >>>>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >>>>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>>>>>> + if (lpc->smi_negotiated_features & >>>>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >>>>>>>> + CPUState *cs; >>>>>>>> + CPU_FOREACH(cs) { >>>>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>>>> + } >>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? >>>>>> >>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code >>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same >>>>>> code area for execution. The VCPUs are told apart from each other by >>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or >>>>>> search criterion into a global shared array. Once they find their >>>>>> respective private data blocks, the VCPUs don't step on each other's toes. >>>>> That's what I'm not sure. >>>>> If broadcast SMI is sent before relocation all CPUs will use >>>>> the same SMBASE and as result save their state in the same >>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting >>>>> each other's saved state >>>> >>>> Good point! >>>> >>>>> and then on exit from SMI they all will restore >>>>> whatever state that race produced so it seems plain wrong thing to do. >>>>> >>>>> Are you sure that edk2 does broadcast for SMI initialization or does it >>>>> using directed SMIs? >>>> >>>> Hmmm, you are right. The SmmRelocateBases() function in >>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for >>>> each individual AP in succession, by sending SMI IPIs to them, one by >>>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI. >>>> >>>> The above QEMU code is only exercised much later, after the SMBASE >>>> relocation, with the SMM stack up and running. It is used when a >>>> (preferably broadcast) SMI is triggered for firmware services (for >>>> example, the UEFI variable services). >>>> >>>> So, you are right. >>>> >>>> Considering edk2 only, the difference shouldn't matter -- when this code >>>> is invoked (via an IO port write), the SMBASEs will all have been relocated. >>>> >>>> Also, I've been informed that on real hardware, writing to APM_CNT >>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values. >>>> So I believe the above code should be closest to real hardware, and the >>>> pre-patch code was only written in unicast form for SeaBIOS's sake. >>>> >>>> I think the code is safe above. If the guest injects an SMI via APM_CNT >>>> after negotiating SMI broadcast, it should have not left any VCPUs >>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In >>>> the future we can tweak this further, if necessary; we have 63 more >>>> bits, and we'll be able to reject invalid combinations of bits. >>>> >>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed >>>> from the default? >>>> >>>> If so, doesn't that run the risk of missing a VCPU that has an actual >>>> SMI handler installed, and it just happens to be placed at the default >>>> SMBASE location? >>>> >>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, >>>> but I think (a) that's not what real hardware does, and (b) it is risky >>>> if a VCPU's actual SMI handler has been installed while keeping the >>>> default SMBASE value. >>>> >>>> What do you think? >>> (a) probably doesn't consider hotplug, so it's totally fine for the case >>> where firmware is the only one who wakes up/initializes CPUs. >>> however consider 2 CPU hotplugged and then broadcast SMI happens >>> to serve some other task (if there isn't unpark 'feature'). >>> Even if real hw does it, it's behavior not defined by SDM with possibly >>> bad consequences. >>> >>> (b) alone it's risky so I'd skip based on combination of >>> >>> if (default SMBASE & CPU is in reset state) >>> skip; >>> >>> that should be safe and it leaves open possibility to avoid adding >>> unpark 'feature' to CPU. >> >> The above attributes ("SMBASE" and "CPU is in reset state") are specific >> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the >> X86_CPU() macro? >> >> Can I assume here that the macro will never return NULL? (I think so, >> LPC is only used with x86.) > yep, that should work. > >> >> ... I guess SMBASE can be accessed as >> >> X86_CPU(cs)->env.smbase > preferred style: > X86CPU *cpu = X86_CPU(cs); > cpu->... > >> >> How do I figure out if the CPU is in reset state ("waiting for first >> INIT") though? Note that this state should be distinguished from simply >> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected >> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send >> APs to sleep. > > how about using EIP reset value? > x86_cpu_reset() > ... > env->eip = 0xfff0; Okay, so I tried this. It doesn't work. This is the code I wrote (extracting a new ich9_apm_broadcast_smi() function from ich9_apm_ctrl_changed(), due to size reasons): > static void ich9_apm_broadcast_smi(void) > { > CPUState *cs; > > cpu_synchronize_all_states(); /* <--------- mark this */ > CPU_FOREACH(cs) { > X86CPU *cpu = X86_CPU(cs); > CPUX86State *env = &cpu->env; > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > CPUClass *k = CPU_GET_CLASS(cs); > uint64_t cpu_arch_id = k->get_arch_id(cs); > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > continue; > } > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > } > } There are two cases: (a) If I do *not* add the cpu_synchronize_all_states() call before the loop, then the filter condition matches *despite* the APM_CNT write being performed by OVMF *after* SMBASE relocation. I see the trace messages in the QEMU log (this is a four VCPU Ia32 guest): 28863@1484677204.715171:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0 28863@1484677204.715182:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1 28863@1484677204.715184:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2 28863@1484677204.715185:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3 28863@1484677204.724084:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0 28863@1484677204.724088:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1 28863@1484677204.724090:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2 28863@1484677204.724091:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3 That is, even though SMBASE relocation succeeds first, and APM_CNT is only written afterwards -- I proved this from the OVMF debug log --, the code above does not see an up-to-date CPU state for the KVM VCPUs, and it filters out the SMI for *all* VCPUs. Needless to say, this breaks the firmware entirely; it cannot boot. (b) If I add the cpu_synchronize_all_states() call before the loop, then the incorrect filter matches go away; QEMU sees the KVM VCPU states correctly, and the SMI is broad-cast. However, in this case, the boot slows to a *crawl*. It's unbearable. All VCPUs are pegged at 100%, and you can easily follow the OVMF debug log with the naked eye, almost line by line. I didn't expect that cpu_synchronize_all_states() would be so costly, but it makes the idea of checking VCPU state in the APM_CNT handler a non-starter. Can we stick with the current "v6 wave 2" in light of this? Thanks, Laszlo
On Tue, 17 Jan 2017 19:53:21 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/17/17 15:20, Igor Mammedov wrote: > > On Tue, 17 Jan 2017 14:46:05 +0100 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 01/17/17 14:20, Igor Mammedov wrote: > >>> On Tue, 17 Jan 2017 13:06:13 +0100 > >>> Laszlo Ersek <lersek@redhat.com> wrote: > >>> > >>>> On 01/17/17 12:21, Igor Mammedov wrote: > >>>>> On Mon, 16 Jan 2017 21:46:55 +0100 > >>>>> Laszlo Ersek <lersek@redhat.com> wrote: > >>>>> > >>>>>> On 01/13/17 14:09, Igor Mammedov wrote: > >>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100 > >>>>>>> Laszlo Ersek <lersek@redhat.com> wrote: > >>>>>>> > >>>>>>>> The generic edk2 SMM infrastructure prefers > >>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If > >>>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it > >>>>>>>> in the following ways: > >>>>>>>> > >>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before > >>>>>>>> ExitBootServices(), but is not necessarily true at runtime), then: > >>>>>>>> > >>>>>>>> (a) If edk2 has been configured for "traditional" SMM synchronization, > >>>>>>>> then the BSP sends directed SMIs to the APs with APIC delivery, > >>>>>>>> bringing them into SMM individually. Then the BSP runs the SMI > >>>>>>>> handler / dispatcher. > >>>>>>>> > >>>>>>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, > >>>>>>>> then the APs that are not already in SMM are not brought in, and > >>>>>>>> the BSP runs the SMI handler / dispatcher. > >>>>>>>> > >>>>>>>> (2) If Trigger() is executed by an AP (which is possible after > >>>>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 > >>>>>>>> efibootmgr"), then the AP in question brings in the BSP with a > >>>>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. > >>>>>>>> > >>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP > >>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" > >>>>>>>> command from (2) can take more than 3 seconds to complete, because > >>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively. > >>>>>>>> > >>>>>>>> The larger problem is that QEMU's current behavior diverges from the > >>>>>>>> behavior usually seen on physical hardware, and that keeps exposing > >>>>>>>> obscure corner cases, race conditions and other instabilities in edk2, > >>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at > >>>>>>>> once. > >>>>>>>> > >>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject > >>>>>>>> the SMI on all VCPUs. > >>>>>>>> > >>>>>>>> While the original posting of this patch > >>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> > >>>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM > >>>>>>>> this patch actually provides functional improvements. > >>>>>>>> > >>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> > >>>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com> > >>>>>>>> Cc: Igor Mammedov <imammedo@redhat.com> > >>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com> > >>>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> > >>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> > >>>>>>>> --- > >>>>>>>> > >>>>>>>> Notes: > >>>>>>>> v6: > >>>>>>>> - no changes, pick up Michael's R-b > >>>>>>>> > >>>>>>>> v5: > >>>>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the > >>>>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for > >>>>>>>> DEFINE_PROP_BIT() in the next patch) > >>>>>>>> > >>>>>>>> include/hw/i386/ich9.h | 3 +++ > >>>>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- > >>>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) > >>>>>>>> > >>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h > >>>>>>>> index da1118727146..18dcca7ebcbf 100644 > >>>>>>>> --- a/include/hw/i386/ich9.h > >>>>>>>> +++ b/include/hw/i386/ich9.h > >>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); > >>>>>>>> #define ICH9_SMB_HST_D1 0x06 > >>>>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 > >>>>>>>> > >>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ > >>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 > >>>>>>>> + > >>>>>>>> #endif /* HW_ICH9_H */ > >>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > >>>>>>>> index 376b7801a42c..ced6f803a4f2 100644 > >>>>>>>> --- a/hw/isa/lpc_ich9.c > >>>>>>>> +++ b/hw/isa/lpc_ich9.c > >>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) > >>>>>>>> > >>>>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ > >>>>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { > >>>>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >>>>>>>> + if (lpc->smi_negotiated_features & > >>>>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { > >>>>>>>> + CPUState *cs; > >>>>>>>> + CPU_FOREACH(cs) { > >>>>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>>>>>> + } > >>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? > >>>>>> > >>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code > >>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same > >>>>>> code area for execution. The VCPUs are told apart from each other by > >>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or > >>>>>> search criterion into a global shared array. Once they find their > >>>>>> respective private data blocks, the VCPUs don't step on each other's toes. > >>>>> That's what I'm not sure. > >>>>> If broadcast SMI is sent before relocation all CPUs will use > >>>>> the same SMBASE and as result save their state in the same > >>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting > >>>>> each other's saved state > >>>> > >>>> Good point! > >>>> > >>>>> and then on exit from SMI they all will restore > >>>>> whatever state that race produced so it seems plain wrong thing to do. > >>>>> > >>>>> Are you sure that edk2 does broadcast for SMI initialization or does it > >>>>> using directed SMIs? > >>>> > >>>> Hmmm, you are right. The SmmRelocateBases() function in > >>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for > >>>> each individual AP in succession, by sending SMI IPIs to them, one by > >>>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI. > >>>> > >>>> The above QEMU code is only exercised much later, after the SMBASE > >>>> relocation, with the SMM stack up and running. It is used when a > >>>> (preferably broadcast) SMI is triggered for firmware services (for > >>>> example, the UEFI variable services). > >>>> > >>>> So, you are right. > >>>> > >>>> Considering edk2 only, the difference shouldn't matter -- when this code > >>>> is invoked (via an IO port write), the SMBASEs will all have been relocated. > >>>> > >>>> Also, I've been informed that on real hardware, writing to APM_CNT > >>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values. > >>>> So I believe the above code should be closest to real hardware, and the > >>>> pre-patch code was only written in unicast form for SeaBIOS's sake. > >>>> > >>>> I think the code is safe above. If the guest injects an SMI via APM_CNT > >>>> after negotiating SMI broadcast, it should have not left any VCPUs > >>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In > >>>> the future we can tweak this further, if necessary; we have 63 more > >>>> bits, and we'll be able to reject invalid combinations of bits. > >>>> > >>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed > >>>> from the default? > >>>> > >>>> If so, doesn't that run the risk of missing a VCPU that has an actual > >>>> SMI handler installed, and it just happens to be placed at the default > >>>> SMBASE location? > >>>> > >>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, > >>>> but I think (a) that's not what real hardware does, and (b) it is risky > >>>> if a VCPU's actual SMI handler has been installed while keeping the > >>>> default SMBASE value. > >>>> > >>>> What do you think? > >>> (a) probably doesn't consider hotplug, so it's totally fine for the case > >>> where firmware is the only one who wakes up/initializes CPUs. > >>> however consider 2 CPU hotplugged and then broadcast SMI happens > >>> to serve some other task (if there isn't unpark 'feature'). > >>> Even if real hw does it, it's behavior not defined by SDM with possibly > >>> bad consequences. > >>> > >>> (b) alone it's risky so I'd skip based on combination of > >>> > >>> if (default SMBASE & CPU is in reset state) > >>> skip; > >>> > >>> that should be safe and it leaves open possibility to avoid adding > >>> unpark 'feature' to CPU. > >> > >> The above attributes ("SMBASE" and "CPU is in reset state") are specific > >> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the > >> X86_CPU() macro? > >> > >> Can I assume here that the macro will never return NULL? (I think so, > >> LPC is only used with x86.) > > yep, that should work. > > > >> > >> ... I guess SMBASE can be accessed as > >> > >> X86_CPU(cs)->env.smbase > > preferred style: > > X86CPU *cpu = X86_CPU(cs); > > cpu->... > > > >> > >> How do I figure out if the CPU is in reset state ("waiting for first > >> INIT") though? Note that this state should be distinguished from simply > >> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected > >> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send > >> APs to sleep. > > > > how about using EIP reset value? > > x86_cpu_reset() > > ... > > env->eip = 0xfff0; > > Okay, so I tried this. It doesn't work. > > This is the code I wrote (extracting a new ich9_apm_broadcast_smi() > function from ich9_apm_ctrl_changed(), due to size reasons): > > > static void ich9_apm_broadcast_smi(void) > > { > > CPUState *cs; > > > > cpu_synchronize_all_states(); /* <--------- mark this */ it probably should be executed after pause_all_vcpus(), maybe it will help. > > CPU_FOREACH(cs) { > > X86CPU *cpu = X86_CPU(cs); > > CPUX86State *env = &cpu->env; > > > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > > CPUClass *k = CPU_GET_CLASS(cs); > > uint64_t cpu_arch_id = k->get_arch_id(cs); > > > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > > continue; > > } > > > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > > } > > } > [...] > (b) If I add the cpu_synchronize_all_states() call before the loop, then > the incorrect filter matches go away; QEMU sees the KVM VCPU states > correctly, and the SMI is broad-cast. > > However, in this case, the boot slows to a *crawl*. It's unbearable. All > VCPUs are pegged at 100%, and you can easily follow the OVMF debug log > with the naked eye, almost line by line. > I didn't expect that cpu_synchronize_all_states() would be so costly, > but it makes the idea of checking VCPU state in the APM_CNT handler a > non-starter. I wonder were bottleneck is to slow down guest so much. What is actual cost of calling cpu_synchronize_all_states()? Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() would help. > Can we stick with the current "v6 wave 2" in light of this? > > Thanks, > Laszlo
On 01/18/17 11:03, Igor Mammedov wrote: > On Tue, 17 Jan 2017 19:53:21 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 01/17/17 15:20, Igor Mammedov wrote: >>> On Tue, 17 Jan 2017 14:46:05 +0100 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> On 01/17/17 14:20, Igor Mammedov wrote: >>>>> On Tue, 17 Jan 2017 13:06:13 +0100 >>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>> >>>>>> On 01/17/17 12:21, Igor Mammedov wrote: >>>>>>> On Mon, 16 Jan 2017 21:46:55 +0100 >>>>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>>>> >>>>>>>> On 01/13/17 14:09, Igor Mammedov wrote: >>>>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100 >>>>>>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> The generic edk2 SMM infrastructure prefers >>>>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each processor. If >>>>>>>>>> Trigger() only brings the current processor into SMM, then edk2 handles it >>>>>>>>>> in the following ways: >>>>>>>>>> >>>>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before >>>>>>>>>> ExitBootServices(), but is not necessarily true at runtime), then: >>>>>>>>>> >>>>>>>>>> (a) If edk2 has been configured for "traditional" SMM synchronization, >>>>>>>>>> then the BSP sends directed SMIs to the APs with APIC delivery, >>>>>>>>>> bringing them into SMM individually. Then the BSP runs the SMI >>>>>>>>>> handler / dispatcher. >>>>>>>>>> >>>>>>>>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, >>>>>>>>>> then the APs that are not already in SMM are not brought in, and >>>>>>>>>> the BSP runs the SMI handler / dispatcher. >>>>>>>>>> >>>>>>>>>> (2) If Trigger() is executed by an AP (which is possible after >>>>>>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >>>>>>>>>> efibootmgr"), then the AP in question brings in the BSP with a >>>>>>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. >>>>>>>>>> >>>>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP >>>>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >>>>>>>>>> command from (2) can take more than 3 seconds to complete, because >>>>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively. >>>>>>>>>> >>>>>>>>>> The larger problem is that QEMU's current behavior diverges from the >>>>>>>>>> behavior usually seen on physical hardware, and that keeps exposing >>>>>>>>>> obscure corner cases, race conditions and other instabilities in edk2, >>>>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at >>>>>>>>>> once. >>>>>>>>>> >>>>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to inject >>>>>>>>>> the SMI on all VCPUs. >>>>>>>>>> >>>>>>>>>> While the original posting of this patch >>>>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >>>>>>>>>> only intended to speed up (2), based on our recent "stress testing" of SMM >>>>>>>>>> this patch actually provides functional improvements. >>>>>>>>>> >>>>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>>>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com> >>>>>>>>>> Cc: Igor Mammedov <imammedo@redhat.com> >>>>>>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>>>>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com> >>>>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> >>>>>>>>>> --- >>>>>>>>>> >>>>>>>>>> Notes: >>>>>>>>>> v6: >>>>>>>>>> - no changes, pick up Michael's R-b >>>>>>>>>> >>>>>>>>>> v5: >>>>>>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >>>>>>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >>>>>>>>>> DEFINE_PROP_BIT() in the next patch) >>>>>>>>>> >>>>>>>>>> include/hw/i386/ich9.h | 3 +++ >>>>>>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- >>>>>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >>>>>>>>>> index da1118727146..18dcca7ebcbf 100644 >>>>>>>>>> --- a/include/hw/i386/ich9.h >>>>>>>>>> +++ b/include/hw/i386/ich9.h >>>>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >>>>>>>>>> #define ICH9_SMB_HST_D1 0x06 >>>>>>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >>>>>>>>>> >>>>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ >>>>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >>>>>>>>>> + >>>>>>>>>> #endif /* HW_ICH9_H */ >>>>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >>>>>>>>>> index 376b7801a42c..ced6f803a4f2 100644 >>>>>>>>>> --- a/hw/isa/lpc_ich9.c >>>>>>>>>> +++ b/hw/isa/lpc_ich9.c >>>>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) >>>>>>>>>> >>>>>>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >>>>>>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >>>>>>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>>>>>>>> + if (lpc->smi_negotiated_features & >>>>>>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >>>>>>>>>> + CPUState *cs; >>>>>>>>>> + CPU_FOREACH(cs) { >>>>>>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>>>>>> + } >>>>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? >>>>>>>> >>>>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code >>>>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same >>>>>>>> code area for execution. The VCPUs are told apart from each other by >>>>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or >>>>>>>> search criterion into a global shared array. Once they find their >>>>>>>> respective private data blocks, the VCPUs don't step on each other's toes. >>>>>>> That's what I'm not sure. >>>>>>> If broadcast SMI is sent before relocation all CPUs will use >>>>>>> the same SMBASE and as result save their state in the same >>>>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting >>>>>>> each other's saved state >>>>>> >>>>>> Good point! >>>>>> >>>>>>> and then on exit from SMI they all will restore >>>>>>> whatever state that race produced so it seems plain wrong thing to do. >>>>>>> >>>>>>> Are you sure that edk2 does broadcast for SMI initialization or does it >>>>>>> using directed SMIs? >>>>>> >>>>>> Hmmm, you are right. The SmmRelocateBases() function in >>>>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for >>>>>> each individual AP in succession, by sending SMI IPIs to them, one by >>>>>> one. Then it does the same for the BSP, by sending itself a similar SMI IPI. >>>>>> >>>>>> The above QEMU code is only exercised much later, after the SMBASE >>>>>> relocation, with the SMM stack up and running. It is used when a >>>>>> (preferably broadcast) SMI is triggered for firmware services (for >>>>>> example, the UEFI variable services). >>>>>> >>>>>> So, you are right. >>>>>> >>>>>> Considering edk2 only, the difference shouldn't matter -- when this code >>>>>> is invoked (via an IO port write), the SMBASEs will all have been relocated. >>>>>> >>>>>> Also, I've been informed that on real hardware, writing to APM_CNT >>>>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values. >>>>>> So I believe the above code should be closest to real hardware, and the >>>>>> pre-patch code was only written in unicast form for SeaBIOS's sake. >>>>>> >>>>>> I think the code is safe above. If the guest injects an SMI via APM_CNT >>>>>> after negotiating SMI broadcast, it should have not left any VCPUs >>>>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In >>>>>> the future we can tweak this further, if necessary; we have 63 more >>>>>> bits, and we'll be able to reject invalid combinations of bits. >>>>>> >>>>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed >>>>>> from the default? >>>>>> >>>>>> If so, doesn't that run the risk of missing a VCPU that has an actual >>>>>> SMI handler installed, and it just happens to be placed at the default >>>>>> SMBASE location? >>>>>> >>>>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, >>>>>> but I think (a) that's not what real hardware does, and (b) it is risky >>>>>> if a VCPU's actual SMI handler has been installed while keeping the >>>>>> default SMBASE value. >>>>>> >>>>>> What do you think? >>>>> (a) probably doesn't consider hotplug, so it's totally fine for the case >>>>> where firmware is the only one who wakes up/initializes CPUs. >>>>> however consider 2 CPU hotplugged and then broadcast SMI happens >>>>> to serve some other task (if there isn't unpark 'feature'). >>>>> Even if real hw does it, it's behavior not defined by SDM with possibly >>>>> bad consequences. >>>>> >>>>> (b) alone it's risky so I'd skip based on combination of >>>>> >>>>> if (default SMBASE & CPU is in reset state) >>>>> skip; >>>>> >>>>> that should be safe and it leaves open possibility to avoid adding >>>>> unpark 'feature' to CPU. >>>> >>>> The above attributes ("SMBASE" and "CPU is in reset state") are specific >>>> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the >>>> X86_CPU() macro? >>>> >>>> Can I assume here that the macro will never return NULL? (I think so, >>>> LPC is only used with x86.) >>> yep, that should work. >>> >>>> >>>> ... I guess SMBASE can be accessed as >>>> >>>> X86_CPU(cs)->env.smbase >>> preferred style: >>> X86CPU *cpu = X86_CPU(cs); >>> cpu->... >>> >>>> >>>> How do I figure out if the CPU is in reset state ("waiting for first >>>> INIT") though? Note that this state should be distinguished from simply >>>> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected >>>> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send >>>> APs to sleep. >>> >>> how about using EIP reset value? >>> x86_cpu_reset() >>> ... >>> env->eip = 0xfff0; >> >> Okay, so I tried this. It doesn't work. >> >> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() >> function from ich9_apm_ctrl_changed(), due to size reasons): >> >>> static void ich9_apm_broadcast_smi(void) >>> { >>> CPUState *cs; >>> >>> cpu_synchronize_all_states(); /* <--------- mark this */ > it probably should be executed after pause_all_vcpus(), > maybe it will help. > >>> CPU_FOREACH(cs) { >>> X86CPU *cpu = X86_CPU(cs); >>> CPUX86State *env = &cpu->env; >>> >>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>> CPUClass *k = CPU_GET_CLASS(cs); >>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>> >>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>> continue; >>> } >>> >>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>> } >>> } >> > [...] >> (b) If I add the cpu_synchronize_all_states() call before the loop, then >> the incorrect filter matches go away; QEMU sees the KVM VCPU states >> correctly, and the SMI is broad-cast. >> >> However, in this case, the boot slows to a *crawl*. It's unbearable. All >> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log >> with the naked eye, almost line by line. >> I didn't expect that cpu_synchronize_all_states() would be so costly, >> but it makes the idea of checking VCPU state in the APM_CNT handler a >> non-starter. > I wonder were bottleneck is to slow down guest so much. > What is actual cost of calling cpu_synchronize_all_states()? > > Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() > would help. If I prepend just a pause_all_vcpus() function call, at the top, then the VM freezes (quite understandably) when the first SMI is raised via APM_CNT. If I, instead, bracket the function body with pause_all_vcpus() and resume_all_vcpus(), like this: static void ich9_apm_broadcast_smi(void) { CPUState *cs; pause_all_vcpus(); cpu_synchronize_all_states(); CPU_FOREACH(cs) { X86CPU *cpu = X86_CPU(cs); CPUX86State *env = &cpu->env; if (env->smbase == 0x30000 && env->eip == 0xfff0) { CPUClass *k = CPU_GET_CLASS(cs); uint64_t cpu_arch_id = k->get_arch_id(cs); trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); continue; } cpu_interrupt(cs, CPU_INTERRUPT_SMI); } resume_all_vcpus(); } then I see the following symptoms: - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at 100% - the boot process, as observed from the OVMF log, is just as slow (= crawling) as without the VCPU pausing/resuming (i.e., like with cpu_synchronize_all_states() only); so ultimately the pausing/resuming doesn't help. Thanks Laszlo
On 01/18/17 11:19, Laszlo Ersek wrote: > On 01/18/17 11:03, Igor Mammedov wrote: >> On Tue, 17 Jan 2017 19:53:21 +0100 >> Laszlo Ersek <lersek@redhat.com> wrote: >> [snip] >>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() >>> function from ich9_apm_ctrl_changed(), due to size reasons): >>> >>>> static void ich9_apm_broadcast_smi(void) >>>> { >>>> CPUState *cs; >>>> >>>> cpu_synchronize_all_states(); /* <--------- mark this */ >> it probably should be executed after pause_all_vcpus(), >> maybe it will help. >> >>>> CPU_FOREACH(cs) { >>>> X86CPU *cpu = X86_CPU(cs); >>>> CPUX86State *env = &cpu->env; >>>> >>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>>> CPUClass *k = CPU_GET_CLASS(cs); >>>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>>> >>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>>> continue; >>>> } >>>> >>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>> } >>>> } >>> >> [...] >>> (b) If I add the cpu_synchronize_all_states() call before the loop, then >>> the incorrect filter matches go away; QEMU sees the KVM VCPU states >>> correctly, and the SMI is broad-cast. >>> >>> However, in this case, the boot slows to a *crawl*. It's unbearable. All >>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log >>> with the naked eye, almost line by line. >>> I didn't expect that cpu_synchronize_all_states() would be so costly, >>> but it makes the idea of checking VCPU state in the APM_CNT handler a >>> non-starter. >> I wonder were bottleneck is to slow down guest so much. >> What is actual cost of calling cpu_synchronize_all_states()? >> >> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() >> would help. > > If I prepend just a pause_all_vcpus() function call, at the top, then > the VM freezes (quite understandably) when the first SMI is raised via > APM_CNT. > > If I, instead, bracket the function body with pause_all_vcpus() and > resume_all_vcpus(), like this: > > static void ich9_apm_broadcast_smi(void) > { > CPUState *cs; > > pause_all_vcpus(); > cpu_synchronize_all_states(); > CPU_FOREACH(cs) { > X86CPU *cpu = X86_CPU(cs); > CPUX86State *env = &cpu->env; > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > CPUClass *k = CPU_GET_CLASS(cs); > uint64_t cpu_arch_id = k->get_arch_id(cs); > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > continue; > } > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > } > resume_all_vcpus(); > } > > then I see the following symptoms: > - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at > 100% > - the boot process, as observed from the OVMF log, is just as slow > (= crawling) as without the VCPU pausing/resuming (i.e., like with > cpu_synchronize_all_states() only); so ultimately the > pausing/resuming doesn't help. I also tried this variant (bracketing only the sync call with pause/resume): static void ich9_apm_broadcast_smi(void) { CPUState *cs; pause_all_vcpus(); cpu_synchronize_all_states(); resume_all_vcpus(); CPU_FOREACH(cs) { X86CPU *cpu = X86_CPU(cs); CPUX86State *env = &cpu->env; if (env->smbase == 0x30000 && env->eip == 0xfff0) { CPUClass *k = CPU_GET_CLASS(cs); uint64_t cpu_arch_id = k->get_arch_id(cs); trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); continue; } cpu_interrupt(cs, CPU_INTERRUPT_SMI); } } This behaves identically to the version where pause/resume bracket the entire function body (i.e., 1 VCPU pegged, super-slow boot progress). Laszlo
On Wed, 18 Jan 2017 11:23:48 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/18/17 11:19, Laszlo Ersek wrote: > > On 01/18/17 11:03, Igor Mammedov wrote: > >> On Tue, 17 Jan 2017 19:53:21 +0100 > >> Laszlo Ersek <lersek@redhat.com> wrote: > >> > > [snip] > > >>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() > >>> function from ich9_apm_ctrl_changed(), due to size reasons): > >>> > >>>> static void ich9_apm_broadcast_smi(void) > >>>> { > >>>> CPUState *cs; > >>>> > >>>> cpu_synchronize_all_states(); /* <--------- mark this */ > >> it probably should be executed after pause_all_vcpus(), > >> maybe it will help. > >> > >>>> CPU_FOREACH(cs) { > >>>> X86CPU *cpu = X86_CPU(cs); > >>>> CPUX86State *env = &cpu->env; > >>>> > >>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >>>> CPUClass *k = CPU_GET_CLASS(cs); > >>>> uint64_t cpu_arch_id = k->get_arch_id(cs); > >>>> > >>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >>>> continue; > >>>> } > >>>> > >>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>> } > >>>> } > >>> > >> [...] > >>> (b) If I add the cpu_synchronize_all_states() call before the loop, then > >>> the incorrect filter matches go away; QEMU sees the KVM VCPU states > >>> correctly, and the SMI is broad-cast. > >>> > >>> However, in this case, the boot slows to a *crawl*. It's unbearable. All > >>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log > >>> with the naked eye, almost line by line. > >>> I didn't expect that cpu_synchronize_all_states() would be so costly, > >>> but it makes the idea of checking VCPU state in the APM_CNT handler a > >>> non-starter. > >> I wonder were bottleneck is to slow down guest so much. > >> What is actual cost of calling cpu_synchronize_all_states()? > >> > >> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() > >> would help. > > > > If I prepend just a pause_all_vcpus() function call, at the top, then > > the VM freezes (quite understandably) when the first SMI is raised via > > APM_CNT. > > > > If I, instead, bracket the function body with pause_all_vcpus() and > > resume_all_vcpus(), like this: > > > > static void ich9_apm_broadcast_smi(void) > > { > > CPUState *cs; > > > > pause_all_vcpus(); > > cpu_synchronize_all_states(); > > CPU_FOREACH(cs) { > > X86CPU *cpu = X86_CPU(cs); > > CPUX86State *env = &cpu->env; > > > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > > CPUClass *k = CPU_GET_CLASS(cs); > > uint64_t cpu_arch_id = k->get_arch_id(cs); > > > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > > continue; > > } > > > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > > } > > resume_all_vcpus(); > > } > > > > then I see the following symptoms: > > - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at > > 100% > > - the boot process, as observed from the OVMF log, is just as slow > > (= crawling) as without the VCPU pausing/resuming (i.e., like with > > cpu_synchronize_all_states() only); so ultimately the > > pausing/resuming doesn't help. > > I also tried this variant (bracketing only the sync call with pause/resume): > > static void ich9_apm_broadcast_smi(void) > { > CPUState *cs; > > pause_all_vcpus(); > cpu_synchronize_all_states(); > resume_all_vcpus(); > CPU_FOREACH(cs) { > X86CPU *cpu = X86_CPU(cs); > CPUX86State *env = &cpu->env; > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > CPUClass *k = CPU_GET_CLASS(cs); > uint64_t cpu_arch_id = k->get_arch_id(cs); > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > continue; > } > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > } > } > > This behaves identically to the version where pause/resume bracket the > entire function body (i.e., 1 VCPU pegged, super-slow boot progress). wrapping entire function into pause_all_vcpus()/resume_all_vcpus() looks better as then cpu_interrupt() would not need to send IPIs to CPUs since pause_all_vcpus() would have done it. So the left question is what this 1 CPU does to make things so slow. Does it send a lot of SMIs or something else?
On 01/18/17 13:38, Igor Mammedov wrote: > On Wed, 18 Jan 2017 11:23:48 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 01/18/17 11:19, Laszlo Ersek wrote: >>> On 01/18/17 11:03, Igor Mammedov wrote: >>>> On Tue, 17 Jan 2017 19:53:21 +0100 >>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>> >> >> [snip] >> >>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() >>>>> function from ich9_apm_ctrl_changed(), due to size reasons): >>>>> >>>>>> static void ich9_apm_broadcast_smi(void) >>>>>> { >>>>>> CPUState *cs; >>>>>> >>>>>> cpu_synchronize_all_states(); /* <--------- mark this */ >>>> it probably should be executed after pause_all_vcpus(), >>>> maybe it will help. >>>> >>>>>> CPU_FOREACH(cs) { >>>>>> X86CPU *cpu = X86_CPU(cs); >>>>>> CPUX86State *env = &cpu->env; >>>>>> >>>>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>>>>> CPUClass *k = CPU_GET_CLASS(cs); >>>>>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>>>>> >>>>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>>>>> continue; >>>>>> } >>>>>> >>>>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>> } >>>>>> } >>>>> >>>> [...] >>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then >>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states >>>>> correctly, and the SMI is broad-cast. >>>>> >>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All >>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log >>>>> with the naked eye, almost line by line. >>>>> I didn't expect that cpu_synchronize_all_states() would be so costly, >>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a >>>>> non-starter. >>>> I wonder were bottleneck is to slow down guest so much. >>>> What is actual cost of calling cpu_synchronize_all_states()? >>>> >>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() >>>> would help. >>> >>> If I prepend just a pause_all_vcpus() function call, at the top, then >>> the VM freezes (quite understandably) when the first SMI is raised via >>> APM_CNT. >>> >>> If I, instead, bracket the function body with pause_all_vcpus() and >>> resume_all_vcpus(), like this: >>> >>> static void ich9_apm_broadcast_smi(void) >>> { >>> CPUState *cs; >>> >>> pause_all_vcpus(); >>> cpu_synchronize_all_states(); >>> CPU_FOREACH(cs) { >>> X86CPU *cpu = X86_CPU(cs); >>> CPUX86State *env = &cpu->env; >>> >>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>> CPUClass *k = CPU_GET_CLASS(cs); >>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>> >>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>> continue; >>> } >>> >>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>> } >>> resume_all_vcpus(); >>> } >>> >>> then I see the following symptoms: >>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at >>> 100% >>> - the boot process, as observed from the OVMF log, is just as slow >>> (= crawling) as without the VCPU pausing/resuming (i.e., like with >>> cpu_synchronize_all_states() only); so ultimately the >>> pausing/resuming doesn't help. >> >> I also tried this variant (bracketing only the sync call with pause/resume): >> >> static void ich9_apm_broadcast_smi(void) >> { >> CPUState *cs; >> >> pause_all_vcpus(); >> cpu_synchronize_all_states(); >> resume_all_vcpus(); >> CPU_FOREACH(cs) { >> X86CPU *cpu = X86_CPU(cs); >> CPUX86State *env = &cpu->env; >> >> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >> CPUClass *k = CPU_GET_CLASS(cs); >> uint64_t cpu_arch_id = k->get_arch_id(cs); >> >> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >> continue; >> } >> >> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >> } >> } >> >> This behaves identically to the version where pause/resume bracket the >> entire function body (i.e., 1 VCPU pegged, super-slow boot progress). > wrapping entire function into pause_all_vcpus()/resume_all_vcpus() > looks better as then cpu_interrupt() would not need to send IPIs to CPUs > since pause_all_vcpus() would have done it. I don't understand, why would a pause operation send IPIs? Do you mean the resume? Even in that case, why would the resume send precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls) and not some other kind of interrupt? > So the left question is what this 1 CPU does to make things so slow. > Does it send a lot of SMIs or something else? The firmware uses many SMIs / Trigger() calls, for example for the implementation of UEFI variable services, and SMM lockbox operations. However, it does not use *masses* of them. Following the OVMF log, I see that *individual* APM_CNT writes take very long (on the order of a second, even), which implies that a single cpu_synchronize_all_states() function call takes very long. I briefly checked what cpu_synchronize_all_states() does, digging down its call stack. I don't see anything in it that we should or could modify for the sake of this work. I think the current line of investigation is way out of scope for this work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3", cpu_interrupt() just works as intended. Why are we branching out this far from that? Just to avoid the CPU unpark feature that Paolo suggested (which would also be negotiated)? What is so wrong with the planned CPU unpark stuff that justifies blocking this patch? Honestly I don't understand why we can't approach these features *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU hotplug later. I think the current negotiation pattern is flexible and future-proofed enough for that. It was you who suggested, for simplifying the last patch in the series, that we completely ignore VCPU hotplug for now (and I whole-heartedly agreed). Let me put it like this: what is in this patch, in your opinion, that *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits our options in implementing it? In my opinion, we can later implement *anything at all*, dependent on new feature bits. We can even decide to reject guest feature requests that specify *only* bit#0; which practically means disabling guests (using SMIs) that don't conform to our VCPU hotlpug implementation. Laszlo
On Wed, 18 Jan 2017 16:42:27 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/18/17 13:38, Igor Mammedov wrote: > > On Wed, 18 Jan 2017 11:23:48 +0100 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 01/18/17 11:19, Laszlo Ersek wrote: > >>> On 01/18/17 11:03, Igor Mammedov wrote: > >>>> On Tue, 17 Jan 2017 19:53:21 +0100 > >>>> Laszlo Ersek <lersek@redhat.com> wrote: > >>>> > >> > >> [snip] > >> > >>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() > >>>>> function from ich9_apm_ctrl_changed(), due to size reasons): > >>>>> > >>>>>> static void ich9_apm_broadcast_smi(void) > >>>>>> { > >>>>>> CPUState *cs; > >>>>>> > >>>>>> cpu_synchronize_all_states(); /* <--------- mark this */ > >>>> it probably should be executed after pause_all_vcpus(), > >>>> maybe it will help. > >>>> > >>>>>> CPU_FOREACH(cs) { > >>>>>> X86CPU *cpu = X86_CPU(cs); > >>>>>> CPUX86State *env = &cpu->env; > >>>>>> > >>>>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >>>>>> CPUClass *k = CPU_GET_CLASS(cs); > >>>>>> uint64_t cpu_arch_id = k->get_arch_id(cs); > >>>>>> > >>>>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >>>>>> continue; > >>>>>> } > >>>>>> > >>>>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>>>> } > >>>>>> } > >>>>> > >>>> [...] > >>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then > >>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states > >>>>> correctly, and the SMI is broad-cast. > >>>>> > >>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All > >>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log > >>>>> with the naked eye, almost line by line. > >>>>> I didn't expect that cpu_synchronize_all_states() would be so costly, > >>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a > >>>>> non-starter. > >>>> I wonder were bottleneck is to slow down guest so much. > >>>> What is actual cost of calling cpu_synchronize_all_states()? > >>>> > >>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() > >>>> would help. > >>> > >>> If I prepend just a pause_all_vcpus() function call, at the top, then > >>> the VM freezes (quite understandably) when the first SMI is raised via > >>> APM_CNT. > >>> > >>> If I, instead, bracket the function body with pause_all_vcpus() and > >>> resume_all_vcpus(), like this: > >>> > >>> static void ich9_apm_broadcast_smi(void) > >>> { > >>> CPUState *cs; > >>> > >>> pause_all_vcpus(); > >>> cpu_synchronize_all_states(); > >>> CPU_FOREACH(cs) { > >>> X86CPU *cpu = X86_CPU(cs); > >>> CPUX86State *env = &cpu->env; > >>> > >>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >>> CPUClass *k = CPU_GET_CLASS(cs); > >>> uint64_t cpu_arch_id = k->get_arch_id(cs); > >>> > >>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >>> continue; > >>> } > >>> > >>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>> } > >>> resume_all_vcpus(); > >>> } > >>> > >>> then I see the following symptoms: > >>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at > >>> 100% > >>> - the boot process, as observed from the OVMF log, is just as slow > >>> (= crawling) as without the VCPU pausing/resuming (i.e., like with > >>> cpu_synchronize_all_states() only); so ultimately the > >>> pausing/resuming doesn't help. > >> > >> I also tried this variant (bracketing only the sync call with pause/resume): > >> > >> static void ich9_apm_broadcast_smi(void) > >> { > >> CPUState *cs; > >> > >> pause_all_vcpus(); > >> cpu_synchronize_all_states(); > >> resume_all_vcpus(); > >> CPU_FOREACH(cs) { > >> X86CPU *cpu = X86_CPU(cs); > >> CPUX86State *env = &cpu->env; > >> > >> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >> CPUClass *k = CPU_GET_CLASS(cs); > >> uint64_t cpu_arch_id = k->get_arch_id(cs); > >> > >> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >> continue; > >> } > >> > >> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >> } > >> } > >> > >> This behaves identically to the version where pause/resume bracket the > >> entire function body (i.e., 1 VCPU pegged, super-slow boot progress). > > wrapping entire function into pause_all_vcpus()/resume_all_vcpus() > > looks better as then cpu_interrupt() would not need to send IPIs to CPUs > > since pause_all_vcpus() would have done it. > > I don't understand, why would a pause operation send IPIs? pause_all forces exit on all CPUs and cpu_interrupt() does the same to a CPU so if all CPUs were kicked out of running state then cpu_interrupt() would skip kick out step. > Do you mean the resume? Even in that case, why would the resume send > precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls) > and not some other kind of interrupt? > > > So the left question is what this 1 CPU does to make things so slow. > > Does it send a lot of SMIs or something else? > > The firmware uses many SMIs / Trigger() calls, for example for the > implementation of UEFI variable services, and SMM lockbox operations. > However, it does not use *masses* of them. Following the OVMF log, I see > that *individual* APM_CNT writes take very long (on the order of a > second, even), which implies that a single cpu_synchronize_all_states() > function call takes very long. > > I briefly checked what cpu_synchronize_all_states() does, digging down > its call stack. I don't see anything in it that we should or could > modify for the sake of this work. > > > I think the current line of investigation is way out of scope for this > work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3", > cpu_interrupt() just works as intended. It's not that I'm outright against of v6 as is, it just seems that something fundamentally broken on cpu_synchronize_all_states() call chain and instead of fixing issue the same SMBASE race case would be just ignored since CPU hotplug is broken with OVMF anyways. So I'm fine with applying v6 as is and as follow up fix to cpu_synchronize_all_states() with follow up filtering out of the same SMBASE in reset state CPUs if possible in 2.9 merge window. > Why are we branching out this far from that? Just to avoid the CPU > unpark feature that Paolo suggested (which would also be negotiated)? > > What is so wrong with the planned CPU unpark stuff that justifies > blocking this patch? > > Honestly I don't understand why we can't approach these features > *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU > hotplug later. I think the current negotiation pattern is flexible and > future-proofed enough for that. It was you who suggested, for > simplifying the last patch in the series, that we completely ignore VCPU > hotplug for now (and I whole-heartedly agreed). > > Let me put it like this: what is in this patch, in your opinion, that > *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits > our options in implementing it? In my opinion, we can later implement > *anything at all*, dependent on new feature bits. We can even decide to > reject guest feature requests that specify *only* bit#0; which > practically means disabling guests (using SMIs) that don't conform to > our VCPU hotlpug implementation. Why do negotiation if hotplug could be done without it, on hw(QEMU) side hotplug seems to be implemented sufficiently (but we still missing SMRR support). The only thing to make hotplug work with OVMF might be issuing SMI either directly from QEMU or from AML (write into APM_CNT) after CPU is hotplugged. > > Laszlo
On 01/18/17 17:26, Igor Mammedov wrote: > On Wed, 18 Jan 2017 16:42:27 +0100 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 01/18/17 13:38, Igor Mammedov wrote: >>> On Wed, 18 Jan 2017 11:23:48 +0100 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> On 01/18/17 11:19, Laszlo Ersek wrote: >>>>> On 01/18/17 11:03, Igor Mammedov wrote: >>>>>> On Tue, 17 Jan 2017 19:53:21 +0100 >>>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>>> >>>> >>>> [snip] >>>> >>>>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() >>>>>>> function from ich9_apm_ctrl_changed(), due to size reasons): >>>>>>> >>>>>>>> static void ich9_apm_broadcast_smi(void) >>>>>>>> { >>>>>>>> CPUState *cs; >>>>>>>> >>>>>>>> cpu_synchronize_all_states(); /* <--------- mark this */ >>>>>> it probably should be executed after pause_all_vcpus(), >>>>>> maybe it will help. >>>>>> >>>>>>>> CPU_FOREACH(cs) { >>>>>>>> X86CPU *cpu = X86_CPU(cs); >>>>>>>> CPUX86State *env = &cpu->env; >>>>>>>> >>>>>>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>>>>>>> CPUClass *k = CPU_GET_CLASS(cs); >>>>>>>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>>>>>>> >>>>>>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>>>>>>> continue; >>>>>>>> } >>>>>>>> >>>>>>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>>>> } >>>>>>>> } >>>>>>> >>>>>> [...] >>>>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then >>>>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states >>>>>>> correctly, and the SMI is broad-cast. >>>>>>> >>>>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All >>>>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log >>>>>>> with the naked eye, almost line by line. >>>>>>> I didn't expect that cpu_synchronize_all_states() would be so costly, >>>>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a >>>>>>> non-starter. >>>>>> I wonder were bottleneck is to slow down guest so much. >>>>>> What is actual cost of calling cpu_synchronize_all_states()? >>>>>> >>>>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() >>>>>> would help. >>>>> >>>>> If I prepend just a pause_all_vcpus() function call, at the top, then >>>>> the VM freezes (quite understandably) when the first SMI is raised via >>>>> APM_CNT. >>>>> >>>>> If I, instead, bracket the function body with pause_all_vcpus() and >>>>> resume_all_vcpus(), like this: >>>>> >>>>> static void ich9_apm_broadcast_smi(void) >>>>> { >>>>> CPUState *cs; >>>>> >>>>> pause_all_vcpus(); >>>>> cpu_synchronize_all_states(); >>>>> CPU_FOREACH(cs) { >>>>> X86CPU *cpu = X86_CPU(cs); >>>>> CPUX86State *env = &cpu->env; >>>>> >>>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>>>> CPUClass *k = CPU_GET_CLASS(cs); >>>>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>>>> >>>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>>>> continue; >>>>> } >>>>> >>>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>> } >>>>> resume_all_vcpus(); >>>>> } >>>>> >>>>> then I see the following symptoms: >>>>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at >>>>> 100% >>>>> - the boot process, as observed from the OVMF log, is just as slow >>>>> (= crawling) as without the VCPU pausing/resuming (i.e., like with >>>>> cpu_synchronize_all_states() only); so ultimately the >>>>> pausing/resuming doesn't help. >>>> >>>> I also tried this variant (bracketing only the sync call with pause/resume): >>>> >>>> static void ich9_apm_broadcast_smi(void) >>>> { >>>> CPUState *cs; >>>> >>>> pause_all_vcpus(); >>>> cpu_synchronize_all_states(); >>>> resume_all_vcpus(); >>>> CPU_FOREACH(cs) { >>>> X86CPU *cpu = X86_CPU(cs); >>>> CPUX86State *env = &cpu->env; >>>> >>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { >>>> CPUClass *k = CPU_GET_CLASS(cs); >>>> uint64_t cpu_arch_id = k->get_arch_id(cs); >>>> >>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); >>>> continue; >>>> } >>>> >>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>> } >>>> } >>>> >>>> This behaves identically to the version where pause/resume bracket the >>>> entire function body (i.e., 1 VCPU pegged, super-slow boot progress). >>> wrapping entire function into pause_all_vcpus()/resume_all_vcpus() >>> looks better as then cpu_interrupt() would not need to send IPIs to CPUs >>> since pause_all_vcpus() would have done it. >> >> I don't understand, why would a pause operation send IPIs? > pause_all forces exit on all CPUs and cpu_interrupt() does the same to > a CPU so if all CPUs were kicked out of running state then cpu_interrupt() > would skip kick out step. Understood. > >> Do you mean the resume? Even in that case, why would the resume send >> precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls) >> and not some other kind of interrupt? >> >>> So the left question is what this 1 CPU does to make things so slow. >>> Does it send a lot of SMIs or something else? >> >> The firmware uses many SMIs / Trigger() calls, for example for the >> implementation of UEFI variable services, and SMM lockbox operations. >> However, it does not use *masses* of them. Following the OVMF log, I see >> that *individual* APM_CNT writes take very long (on the order of a >> second, even), which implies that a single cpu_synchronize_all_states() >> function call takes very long. >> >> I briefly checked what cpu_synchronize_all_states() does, digging down >> its call stack. I don't see anything in it that we should or could >> modify for the sake of this work. >> >> >> I think the current line of investigation is way out of scope for this >> work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3", >> cpu_interrupt() just works as intended. > It's not that I'm outright against of v6 as is, > it just seems that something fundamentally broken on > cpu_synchronize_all_states() call chain Maybe, but I don't see how fixing that is a prerequisite for this patch. > and instead of fixing issue > the same SMBASE race case would be just ignored I don't understand. What SMBASE race? Ignored by what and when? Sorry, I'm lost. If you are saying that, for the moment we can ignore any SMBASE races (whatever they may be) because CPU hotplug is broken with OVMF anyway, then I agree. I would like to limit the scope of this series as much as possible. > since CPU hotplug is > broken with OVMF anyways. > > So I'm fine with applying v6 as is > and as follow up fix to cpu_synchronize_all_states() I cannot promise to work on cpu_synchronize_all_states(). It ties into KVM and it goes over my head. In fact, if there is a problem with cpu_synchronize_all_states(), it must be a general one. The only reason I even thought of calling cpu_synchronize_all_states() here, after seeing the out-of-date register values, is that I recalled it from when I worked on dump-guest-memory. dump-guest-memory writes the register file to the dump, and so it needs cpu_synchronize_all_states(). But, again, cpu_synchronize_all_states() is a general facility, and I can't promise to work on it. > with > follow up filtering out of the same SMBASE in reset state CPUs > if possible in 2.9 merge window. I'm still not convinced that this filtering is absolutely necessary, in the face of feature negotiation; *but*, if someone figures out what's wrong with cpu_synchronize_all_states(), and manages to make its performance impact negligible, then I am certainly willing to add the filtering to the SMI broadcast. At that point, it shouldn't be much work, I hope. > >> Why are we branching out this far from that? Just to avoid the CPU >> unpark feature that Paolo suggested (which would also be negotiated)? >> >> What is so wrong with the planned CPU unpark stuff that justifies >> blocking this patch? >> >> Honestly I don't understand why we can't approach these features >> *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU >> hotplug later. I think the current negotiation pattern is flexible and >> future-proofed enough for that. It was you who suggested, for >> simplifying the last patch in the series, that we completely ignore VCPU >> hotplug for now (and I whole-heartedly agreed). >> >> Let me put it like this: what is in this patch, in your opinion, that >> *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits >> our options in implementing it? In my opinion, we can later implement >> *anything at all*, dependent on new feature bits. We can even decide to >> reject guest feature requests that specify *only* bit#0; which >> practically means disabling guests (using SMIs) that don't conform to >> our VCPU hotlpug implementation. > Why do negotiation if hotplug could be done without it, Because it lets us segment the feature set into several small steps, where I have a chance of grasping the small pieces and maybe even contributing small, surgical patches? You and Paolo might be seeing the big picture, and I certainly don't want to go *against* the big picture. I just want to advance with as little steps as possible. There's the historical observation that a compiler has as many stages as there are sub-teams in the compiler team. (Please excuse me for not recalling the exact phrasing and the person who quipped it.) That observation exists for a reason. Feature and module boundaries tend to mirror human understanding. > on hw(QEMU) > side hotplug seems to be implemented sufficiently (but we still missing > SMRR support). The only thing to make hotplug work with OVMF might be > issuing SMI either directly from QEMU or from AML (write into APM_CNT) > after CPU is hotplugged. Maybe. For me, that looms in the future. If you agree with my participation as outlined above; that is, - I care about this exact patch, as posted, - someone else looks into cpu_synchronize_all_states(), - and then I'm willing to care about the incremental patch for the filtering, then I propose we go ahead with this patch. It's the last one in the series that needs your R-b. Thanks Laszlo
On Wed, 18 Jan 2017 18:23:45 +0100 Laszlo Ersek <lersek@redhat.com> wrote: > On 01/18/17 17:26, Igor Mammedov wrote: > > On Wed, 18 Jan 2017 16:42:27 +0100 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 01/18/17 13:38, Igor Mammedov wrote: > >>> On Wed, 18 Jan 2017 11:23:48 +0100 > >>> Laszlo Ersek <lersek@redhat.com> wrote: > >>> > >>>> On 01/18/17 11:19, Laszlo Ersek wrote: > >>>>> On 01/18/17 11:03, Igor Mammedov wrote: > >>>>>> On Tue, 17 Jan 2017 19:53:21 +0100 > >>>>>> Laszlo Ersek <lersek@redhat.com> wrote: > >>>>>> > >>>> > >>>> [snip] > >>>> > >>>>>>> This is the code I wrote (extracting a new ich9_apm_broadcast_smi() > >>>>>>> function from ich9_apm_ctrl_changed(), due to size reasons): > >>>>>>> > >>>>>>>> static void ich9_apm_broadcast_smi(void) > >>>>>>>> { > >>>>>>>> CPUState *cs; > >>>>>>>> > >>>>>>>> cpu_synchronize_all_states(); /* <--------- mark this */ > >>>>>> it probably should be executed after pause_all_vcpus(), > >>>>>> maybe it will help. > >>>>>> > >>>>>>>> CPU_FOREACH(cs) { > >>>>>>>> X86CPU *cpu = X86_CPU(cs); > >>>>>>>> CPUX86State *env = &cpu->env; > >>>>>>>> > >>>>>>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >>>>>>>> CPUClass *k = CPU_GET_CLASS(cs); > >>>>>>>> uint64_t cpu_arch_id = k->get_arch_id(cs); > >>>>>>>> > >>>>>>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >>>>>>>> continue; > >>>>>>>> } > >>>>>>>> > >>>>>>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>>>>>> } > >>>>>>>> } > >>>>>>> > >>>>>> [...] > >>>>>>> (b) If I add the cpu_synchronize_all_states() call before the loop, then > >>>>>>> the incorrect filter matches go away; QEMU sees the KVM VCPU states > >>>>>>> correctly, and the SMI is broad-cast. > >>>>>>> > >>>>>>> However, in this case, the boot slows to a *crawl*. It's unbearable. All > >>>>>>> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log > >>>>>>> with the naked eye, almost line by line. > >>>>>>> I didn't expect that cpu_synchronize_all_states() would be so costly, > >>>>>>> but it makes the idea of checking VCPU state in the APM_CNT handler a > >>>>>>> non-starter. > >>>>>> I wonder were bottleneck is to slow down guest so much. > >>>>>> What is actual cost of calling cpu_synchronize_all_states()? > >>>>>> > >>>>>> Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() > >>>>>> would help. > >>>>> > >>>>> If I prepend just a pause_all_vcpus() function call, at the top, then > >>>>> the VM freezes (quite understandably) when the first SMI is raised via > >>>>> APM_CNT. > >>>>> > >>>>> If I, instead, bracket the function body with pause_all_vcpus() and > >>>>> resume_all_vcpus(), like this: > >>>>> > >>>>> static void ich9_apm_broadcast_smi(void) > >>>>> { > >>>>> CPUState *cs; > >>>>> > >>>>> pause_all_vcpus(); > >>>>> cpu_synchronize_all_states(); > >>>>> CPU_FOREACH(cs) { > >>>>> X86CPU *cpu = X86_CPU(cs); > >>>>> CPUX86State *env = &cpu->env; > >>>>> > >>>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >>>>> CPUClass *k = CPU_GET_CLASS(cs); > >>>>> uint64_t cpu_arch_id = k->get_arch_id(cs); > >>>>> > >>>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >>>>> continue; > >>>>> } > >>>>> > >>>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>>> } > >>>>> resume_all_vcpus(); > >>>>> } > >>>>> > >>>>> then I see the following symptoms: > >>>>> - rather than 4 VCPUs being pegged at 100%, only one VCPU is pegged at > >>>>> 100% > >>>>> - the boot process, as observed from the OVMF log, is just as slow > >>>>> (= crawling) as without the VCPU pausing/resuming (i.e., like with > >>>>> cpu_synchronize_all_states() only); so ultimately the > >>>>> pausing/resuming doesn't help. > >>>> > >>>> I also tried this variant (bracketing only the sync call with pause/resume): > >>>> > >>>> static void ich9_apm_broadcast_smi(void) > >>>> { > >>>> CPUState *cs; > >>>> > >>>> pause_all_vcpus(); > >>>> cpu_synchronize_all_states(); > >>>> resume_all_vcpus(); > >>>> CPU_FOREACH(cs) { > >>>> X86CPU *cpu = X86_CPU(cs); > >>>> CPUX86State *env = &cpu->env; > >>>> > >>>> if (env->smbase == 0x30000 && env->eip == 0xfff0) { > >>>> CPUClass *k = CPU_GET_CLASS(cs); > >>>> uint64_t cpu_arch_id = k->get_arch_id(cs); > >>>> > >>>> trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > >>>> continue; > >>>> } > >>>> > >>>> cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>> } > >>>> } > >>>> > >>>> This behaves identically to the version where pause/resume bracket the > >>>> entire function body (i.e., 1 VCPU pegged, super-slow boot progress). > >>> wrapping entire function into pause_all_vcpus()/resume_all_vcpus() > >>> looks better as then cpu_interrupt() would not need to send IPIs to CPUs > >>> since pause_all_vcpus() would have done it. > >> > >> I don't understand, why would a pause operation send IPIs? > > pause_all forces exit on all CPUs and cpu_interrupt() does the same to > > a CPU so if all CPUs were kicked out of running state then cpu_interrupt() > > would skip kick out step. > > Understood. > > > > >> Do you mean the resume? Even in that case, why would the resume send > >> precisely SMIs (substituting for the specific CPU_INTERRUPT_SMI calls) > >> and not some other kind of interrupt? > >> > >>> So the left question is what this 1 CPU does to make things so slow. > >>> Does it send a lot of SMIs or something else? > >> > >> The firmware uses many SMIs / Trigger() calls, for example for the > >> implementation of UEFI variable services, and SMM lockbox operations. > >> However, it does not use *masses* of them. Following the OVMF log, I see > >> that *individual* APM_CNT writes take very long (on the order of a > >> second, even), which implies that a single cpu_synchronize_all_states() > >> function call takes very long. > >> > >> I briefly checked what cpu_synchronize_all_states() does, digging down > >> its call stack. I don't see anything in it that we should or could > >> modify for the sake of this work. > >> > >> > >> I think the current line of investigation is way out of scope for this > >> work, and that we've lost focus. In the original "PATCH v6 wave 2 2/3", > >> cpu_interrupt() just works as intended. > > > It's not that I'm outright against of v6 as is, > > it just seems that something fundamentally broken on > > cpu_synchronize_all_states() call chain > > Maybe, but I don't see how fixing that is a prerequisite for this patch. > > > and instead of fixing issue > > the same SMBASE race case would be just ignored > > I don't understand. What SMBASE race? Ignored by what and when? Sorry, > I'm lost. > > If you are saying that, for the moment we can ignore any SMBASE races > (whatever they may be) because CPU hotplug is broken with OVMF anyway, > then I agree. I would like to limit the scope of this series as much as > possible. > > > since CPU hotplug is > > broken with OVMF anyways. > > > > So I'm fine with applying v6 as is > > and as follow up fix to cpu_synchronize_all_states() > > I cannot promise to work on cpu_synchronize_all_states(). It ties into > KVM and it goes over my head. > > In fact, if there is a problem with cpu_synchronize_all_states(), it > must be a general one. > > The only reason I even thought of calling cpu_synchronize_all_states() > here, after seeing the out-of-date register values, is that I recalled > it from when I worked on dump-guest-memory. dump-guest-memory writes the > register file to the dump, and so it needs cpu_synchronize_all_states(). > > But, again, cpu_synchronize_all_states() is a general facility, and I > can't promise to work on it. > > > with > > follow up filtering out of the same SMBASE in reset state CPUs > > if possible in 2.9 merge window. > > I'm still not convinced that this filtering is absolutely necessary, in > the face of feature negotiation; *but*, if someone figures out what's > wrong with cpu_synchronize_all_states(), and manages to make its > performance impact negligible, then I am certainly willing to add the > filtering to the SMI broadcast. > > At that point, it shouldn't be much work, I hope. > > > > >> Why are we branching out this far from that? Just to avoid the CPU > >> unpark feature that Paolo suggested (which would also be negotiated)? > >> > >> What is so wrong with the planned CPU unpark stuff that justifies > >> blocking this patch? > >> > >> Honestly I don't understand why we can't approach these features > >> *incrementally*. Let's do the negotiable SMI broadcast now, then do VCPU > >> hotplug later. I think the current negotiation pattern is flexible and > >> future-proofed enough for that. It was you who suggested, for > >> simplifying the last patch in the series, that we completely ignore VCPU > >> hotplug for now (and I whole-heartedly agreed). > >> > >> Let me put it like this: what is in this patch, in your opinion, that > >> *irrevocably* breaks the upcoming VCPU hotplug feature, or else limits > >> our options in implementing it? In my opinion, we can later implement > >> *anything at all*, dependent on new feature bits. We can even decide to > >> reject guest feature requests that specify *only* bit#0; which > >> practically means disabling guests (using SMIs) that don't conform to > >> our VCPU hotlpug implementation. > > > Why do negotiation if hotplug could be done without it, > > Because it lets us segment the feature set into several small steps, > where I have a chance of grasping the small pieces and maybe even > contributing small, surgical patches? > > You and Paolo might be seeing the big picture, and I certainly don't > want to go *against* the big picture. I just want to advance with as > little steps as possible. > > There's the historical observation that a compiler has as many stages as > there are sub-teams in the compiler team. (Please excuse me for not > recalling the exact phrasing and the person who quipped it.) That > observation exists for a reason. Feature and module boundaries tend to > mirror human understanding. > > > on hw(QEMU) > > side hotplug seems to be implemented sufficiently (but we still missing > > SMRR support). The only thing to make hotplug work with OVMF might be > > issuing SMI either directly from QEMU or from AML (write into APM_CNT) > > after CPU is hotplugged. > > Maybe. For me, that looms in the future. > > If you agree with my participation as outlined above; that is, > - I care about this exact patch, as posted, > - someone else looks into cpu_synchronize_all_states(), CCing Radim who graciously agreed to take a look what wrong from KVM side, Could you give him steps to reproduce issue, pls. > - and then I'm willing to care about the incremental patch for the > filtering, ok > then I propose we go ahead with this patch. It's the last one in the > series that needs your R-b. Reviewed-by: Igor Mammedov <imammedo@redhat.com> > > Thanks > Laszlo
diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h index da1118727146..18dcca7ebcbf 100644 --- a/include/hw/i386/ich9.h +++ b/include/hw/i386/ich9.h @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); #define ICH9_SMB_HST_D1 0x06 #define ICH9_SMB_HOST_BLOCK_DB 0x07 +/* bit positions used in fw_cfg SMI feature negotiation */ +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 + #endif /* HW_ICH9_H */ diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c index 376b7801a42c..ced6f803a4f2 100644 --- a/hw/isa/lpc_ich9.c +++ b/hw/isa/lpc_ich9.c @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, void *arg) /* SMI_EN = PMBASE + 30. SMI control and enable register */ if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); + if (lpc->smi_negotiated_features & + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { + CPUState *cs; + CPU_FOREACH(cs) { + cpu_interrupt(cs, CPU_INTERRUPT_SMI); + } + } else { + cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); + } } }