Message ID | 20200710161704.309824-4-imammedo@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: fix cpu hotplug with secure boot | expand |
(CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) references) On 07/10/20 18:17, Igor Mammedov wrote: > In case firmware has negotiated CPU hotplug SMI feature, generate > AML to describe SMI IO port region and send SMI to firmware > on each CPU hotplug SCI. > > It might be not really usable, but should serve as a starting point to > discuss how better to deal with split hotplug sequence during hot-add > ( > ex scenario where it will break is: > hot-add > -> (QEMU) add CPU in hotplug regs > -> (QEMU) SCI > -1-> (OS) scan > -1-> (OS) SMI > -1-> (FW) pull in CPU1 *** > -1-> (OS) start iterating hotplug regs > hot-add > -> (QEMU) add CPU in hotplug regs > -> (QEMU) SCI > -2-> (OS) scan (blocked on mutex till previous scan is finished) > -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI > -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI > that's where it explodes, since FW didn't see CPU2 > when SMI was called > ) > > hot remove will throw in yet another set of problems, so lets discuss > both here and see if we can really share hotplug registers block between > FW and AML or we should do something else with it. This issue is generally triggered by management applications such as libvirt that issue device_add commands in quick succession. For libvirt, the command is "virsh setvcpus" (plural) with multiple CPUs specified for plugging. The singular "virsh setvcpu" command, which is more friendly towards guest NUMA, does not run into the symptom. The scope of the scan method lock is not large enough, with SMI in the picture. I suggest that we not uproot the existing AML code or the hotplug register block. Instead, I suggest that we add serialization at a higher level, with sufficient scope. QEMU: - introduce a new flag standing for "CPU plug operation in progress" - if ICH9_LPC_SMI_F_BROADCAST_BIT has been negotiated: - "device_add" and "device_del" should enforce ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT and ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT, respectively - both device_add and device_del (for VCPUs) should set check the "in progress" flag. - If set, reject the request synchronously - Otherwise, set the flag, and commence the operation - in cpu_hotplug_wr(), where we emit the ACPI_DEVICE_OST event with qapi_event_send_acpi_device_ost(), clear the "in-progress" flag. - If QEMU executes the QMP command processing and the cpu_hotplug_wr() function on different (host OS) threads, then perhaps we should use an atomic type for the flag. (Not sure about locking between QEMU threads, sorry.) I don't really expect race conditions, but in case we ever get stuck with the flag, we should make sure that the stuck state is "in progress", and not "not in progress". (The former state can prevent further plug operations, but cannot cause the guest to lose state.) libvirtd: - the above QEMU changes will gracefully prevent "virsh setvcpus" (plural) from working, so the libvirtd changes should "only" aim at coping with the new serialization added in QEMU - introduce a new info channel between qemuDomainSetVcpusLive() and qemuProcessHandleAcpiOstInfo(). In the former, delay sending the next "device_add" command until ACPI_DEVICE_OST is queued by qemuProcessHandleAcpiOstInfo(). My point is that we have 60+ SMI feature bits left, so if we have to invent something entirely new for unplug, we can still do it later. I'm unable to foresee the particulars of unplug at this stage, beyond what I've said already in this thread (i.e. that we should immediately introduce the unplug feature bit, just so we can reject unplug gracefully). If the current ACPI stuff, the hotplug register block, etc. prove *inextensible for unplug*, we can use further feature bits for something entirely new. But I think we can extend what we have now for completing plug only. And I think modifying QEMU and libvirtd for that is fair -- I've just read the _OST method documentation in the ACPI spec, and _OST was clearly designed for coordinating (un)plug between agents. For example, my understanding is that memory hotplug can fail, failure is reported via ACPI_DEVICE_OST to libvirtd, so ACPI_DEVICE_OST is already used for syncronizing guest operations with management layer operations. > > Signed-off-by: Igor Mammedov <imammedo@redhat.com> > --- > v0: > - s/aml_string/aml_eisaid/ > /fixes Windows BSOD, on nonsense value/ (Laszlo Ersek <lersek@redhat.com>) > - put SMI device under PCI0 like the rest of hotplug IO port > - do not generate SMI AML if CPU hotplug SMI feature hasn't been negotiated > --- > include/hw/acpi/cpu.h | 1 + > hw/acpi/cpu.c | 6 ++++++ > hw/i386/acpi-build.c | 33 ++++++++++++++++++++++++++++++++- > hw/isa/lpc_ich9.c | 3 +++ > 4 files changed, 42 insertions(+), 1 deletion(-) > > diff --git a/include/hw/acpi/cpu.h b/include/hw/acpi/cpu.h > index 62f0278ba2..0eeedaa491 100644 > --- a/include/hw/acpi/cpu.h > +++ b/include/hw/acpi/cpu.h > @@ -50,6 +50,7 @@ void cpu_hotplug_hw_init(MemoryRegion *as, Object *owner, > typedef struct CPUHotplugFeatures { > bool acpi_1_compatible; > bool has_legacy_cphp; > + const char *smi_path; > } CPUHotplugFeatures; > > void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts, > diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c > index 3d6a500fb7..6539bc23f6 100644 > --- a/hw/acpi/cpu.c > +++ b/hw/acpi/cpu.c > @@ -14,6 +14,8 @@ > #define ACPI_CPU_CMD_DATA_OFFSET_RW 8 > #define ACPI_CPU_CMD_DATA2_OFFSET_R 0 > > +#define OVMF_CPUHP_SMI_CMD 4 > + > enum { > CPHP_GET_NEXT_CPU_WITH_EVENT_CMD = 0, > CPHP_OST_EVENT_CMD = 1, > @@ -473,6 +475,10 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts, > Aml *next_cpu_cmd = aml_int(CPHP_GET_NEXT_CPU_WITH_EVENT_CMD); > > aml_append(method, aml_acquire(ctrl_lock, 0xFFFF)); > + if (opts.smi_path) { > + aml_append(method, aml_store(aml_int(OVMF_CPUHP_SMI_CMD), > + aml_name("%s", opts.smi_path))); > + } > aml_append(method, aml_store(one, has_event)); > while_ctx = aml_while(aml_equal(has_event, one)); > { > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index b7bcbbbb2a..1e21386f0c 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -95,6 +95,7 @@ typedef struct AcpiPmInfo { > bool s3_disabled; > bool s4_disabled; > bool pcihp_bridge_en; > + bool smi_on_cpuhp; > uint8_t s4_val; > AcpiFadtData fadt; > uint16_t cpu_hp_io_base; > @@ -194,6 +195,7 @@ static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm) > pm->cpu_hp_io_base = 0; > pm->pcihp_io_base = 0; > pm->pcihp_io_len = 0; > + pm->smi_on_cpuhp = false; > > assert(obj); > init_common_fadt_data(machine, obj, &pm->fadt); > @@ -213,6 +215,9 @@ static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm) > pm->fadt.reset_val = 0xf; > pm->fadt.flags |= 1 << ACPI_FADT_F_RESET_REG_SUP; > pm->cpu_hp_io_base = ICH9_CPU_HOTPLUG_IO_BASE; > + pm->smi_on_cpuhp = > + !!(object_property_get_uint(lpc, "x-smi-negotiated-features", NULL) > + & BIT_ULL(ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT)); (1) style: I think the "&" operator should be placed at the end of the line, not at the start (2) style: can we introduce a macro for the new property name? > } > > /* The above need not be conditional on machine type because the reset port > @@ -1515,6 +1520,31 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, > aml_append(dev, aml_name_decl("_UID", aml_int(1))); > aml_append(dev, build_q35_osc_method()); > aml_append(sb_scope, dev); > + > + if (pm->smi_on_cpuhp) { > + /* reserve SMI block resources */ > + dev = aml_device("PCI0.SMI0"); > + aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A06"))); > + aml_append(dev, aml_name_decl("_UID", aml_string("SMI resources"))); > + crs = aml_resource_template(); > + aml_append(crs, > + aml_io( > + AML_DECODE16, > + ACPI_PORT_SMI_CMD, > + ACPI_PORT_SMI_CMD, > + 1, > + 1) > + ); (3) Just a thought: I wonder if we should reserve both ports (0xB2 and 0xB3 too). For now we don't have any use for the "data" port, but arguably it's part of the same register block. > + aml_append(dev, aml_name_decl("_CRS", crs)); > + aml_append(dev, aml_operation_region("SMIR", AML_SYSTEM_IO, > + aml_int(0xB2), 1)); (4) Style: we should probably use ACPI_PORT_SMI_CMD rather than 0xB2 (5) Same as (3) -- make the "data" port part of the opregion? (We'd still make the SMIC field below cover only 0xB2, of course.) Anyway, feel free to ignore (3) and (5) if they are out of scope. > + field = aml_field("SMIR", AML_BYTE_ACC, AML_NOLOCK, > + AML_WRITE_AS_ZEROS); > + aml_append(field, aml_named_field("SMIC", 8)); > + aml_append(dev, field); > + aml_append(sb_scope, dev); > + } > + > aml_append(dsdt, sb_scope); > > build_hpet_aml(dsdt); > @@ -1530,7 +1560,8 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, > build_legacy_cpu_hotplug_aml(dsdt, machine, pm->cpu_hp_io_base); > } else { > CPUHotplugFeatures opts = { > - .acpi_1_compatible = true, .has_legacy_cphp = true > + .acpi_1_compatible = true, .has_legacy_cphp = true, > + .smi_path = pm->smi_on_cpuhp ? "\\_SB.PCI0.SMI0.SMIC" : NULL, > }; > build_cpus_aml(dsdt, machine, opts, pm->cpu_hp_io_base, > "\\_SB.PCI0", "\\_GPE._E02"); > diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > index 83da7346ab..5ebea70a8d 100644 > --- a/hw/isa/lpc_ich9.c > +++ b/hw/isa/lpc_ich9.c > @@ -643,6 +643,9 @@ static void ich9_lpc_initfn(Object *obj) > &acpi_enable_cmd, OBJ_PROP_FLAG_READ); > object_property_add_uint8_ptr(OBJECT(lpc), ACPI_PM_PROP_ACPI_DISABLE_CMD, > &acpi_disable_cmd, OBJ_PROP_FLAG_READ); > + object_property_add_uint64_ptr(obj, "x-smi-negotiated-features", (6) same as (2), we should use a macro for the property name. > + &lpc->smi_negotiated_features, > + OBJ_PROP_FLAG_READ); > > ich9_pm_add_properties(obj, &lpc->pm); > } > I'll follow up with test results. Thanks Laszlo
On 07/14/20 14:28, Laszlo Ersek wrote: > (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) > references) > > On 07/10/20 18:17, Igor Mammedov wrote: >> In case firmware has negotiated CPU hotplug SMI feature, generate >> AML to describe SMI IO port region and send SMI to firmware >> on each CPU hotplug SCI. >> >> It might be not really usable, but should serve as a starting point to >> discuss how better to deal with split hotplug sequence during hot-add >> ( >> ex scenario where it will break is: >> hot-add >> -> (QEMU) add CPU in hotplug regs >> -> (QEMU) SCI >> -1-> (OS) scan >> -1-> (OS) SMI >> -1-> (FW) pull in CPU1 *** >> -1-> (OS) start iterating hotplug regs >> hot-add >> -> (QEMU) add CPU in hotplug regs >> -> (QEMU) SCI >> -2-> (OS) scan (blocked on mutex till previous scan is finished) >> -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI >> -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI >> that's where it explodes, since FW didn't see CPU2 >> when SMI was called >> ) >> >> hot remove will throw in yet another set of problems, so lets discuss >> both here and see if we can really share hotplug registers block between >> FW and AML or we should do something else with it. > > This issue is generally triggered by management applications such as > libvirt that issue device_add commands in quick succession. For libvirt, > the command is "virsh setvcpus" (plural) with multiple CPUs specified > for plugging. The singular "virsh setvcpu" command, which is more > friendly towards guest NUMA, does not run into the symptom. > > The scope of the scan method lock is not large enough, with SMI in the > picture. > > I suggest that we not uproot the existing AML code or the hotplug > register block. Instead, I suggest that we add serialization at a higher > level, with sufficient scope. > > QEMU: > > - introduce a new flag standing for "CPU plug operation in progress" > > - if ICH9_LPC_SMI_F_BROADCAST_BIT has been negotiated: > > - "device_add" and "device_del" should enforce > ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT and > ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT, respectively > > - both device_add and device_del (for VCPUs) should set check the > "in progress" flag. > > - If set, reject the request synchronously > > - Otherwise, set the flag, and commence the operation > > - in cpu_hotplug_wr(), where we emit the ACPI_DEVICE_OST event with > qapi_event_send_acpi_device_ost(), clear the "in-progress" flag. > > - If QEMU executes the QMP command processing and the cpu_hotplug_wr() > function on different (host OS) threads, then perhaps we should use an > atomic type for the flag. (Not sure about locking between QEMU threads, > sorry.) I don't really expect race conditions, but in case we ever get > stuck with the flag, we should make sure that the stuck state is "in > progress", and not "not in progress". (The former state can prevent > further plug operations, but cannot cause the guest to lose state.) Furthermore, the "CPU plug operation in progress" flag should be: - either migrated, - or a migration blocker. Because on the destination host, device_add should be possible if and only if the plug operation completed (either still on the source host, or on the destination host). I guess that the "migration blocker" option is easier. Anyway I assume this is already handled with memory hotplug somehow (i.e., migration attempt between device_add and ACPI_DEVICE_OST). Thanks, Laszlo
On Tue, 14 Jul 2020 14:41:28 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > On 07/14/20 14:28, Laszlo Ersek wrote: > > (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) > > references) > > > > On 07/10/20 18:17, Igor Mammedov wrote: > >> In case firmware has negotiated CPU hotplug SMI feature, generate > >> AML to describe SMI IO port region and send SMI to firmware > >> on each CPU hotplug SCI. > >> > >> It might be not really usable, but should serve as a starting point to > >> discuss how better to deal with split hotplug sequence during hot-add > >> ( > >> ex scenario where it will break is: > >> hot-add > >> -> (QEMU) add CPU in hotplug regs > >> -> (QEMU) SCI > >> -1-> (OS) scan > >> -1-> (OS) SMI > >> -1-> (FW) pull in CPU1 *** > >> -1-> (OS) start iterating hotplug regs > >> hot-add > >> -> (QEMU) add CPU in hotplug regs > >> -> (QEMU) SCI > >> -2-> (OS) scan (blocked on mutex till previous scan is finished) > >> -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI > >> -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI > >> that's where it explodes, since FW didn't see CPU2 > >> when SMI was called > >> ) > >> > >> hot remove will throw in yet another set of problems, so lets discuss > >> both here and see if we can really share hotplug registers block between > >> FW and AML or we should do something else with it. > > > > This issue is generally triggered by management applications such as > > libvirt that issue device_add commands in quick succession. For libvirt, > > the command is "virsh setvcpus" (plural) with multiple CPUs specified > > for plugging. The singular "virsh setvcpu" command, which is more > > friendly towards guest NUMA, does not run into the symptom. > > > > The scope of the scan method lock is not large enough, with SMI in the > > picture. > > > > I suggest that we not uproot the existing AML code or the hotplug > > register block. Instead, I suggest that we add serialization at a higher > > level, with sufficient scope. > > > > QEMU: > > > > - introduce a new flag standing for "CPU plug operation in progress" > > > > - if ICH9_LPC_SMI_F_BROADCAST_BIT has been negotiated: > > > > - "device_add" and "device_del" should enforce > > ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT and > > ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT, respectively > > > > - both device_add and device_del (for VCPUs) should set check the > > "in progress" flag. > > > > - If set, reject the request synchronously > > > > - Otherwise, set the flag, and commence the operation > > > > - in cpu_hotplug_wr(), where we emit the ACPI_DEVICE_OST event with > > qapi_event_send_acpi_device_ost(), clear the "in-progress" flag. > > > > - If QEMU executes the QMP command processing and the cpu_hotplug_wr() > > function on different (host OS) threads, then perhaps we should use an > > atomic type for the flag. (Not sure about locking between QEMU threads, > > sorry.) I don't really expect race conditions, but in case we ever get > > stuck with the flag, we should make sure that the stuck state is "in > > progress", and not "not in progress". (The former state can prevent > > further plug operations, but cannot cause the guest to lose state.) > > Furthermore, the "CPU plug operation in progress" flag should be: > - either migrated, > - or a migration blocker. > > Because on the destination host, device_add should be possible if and > only if the plug operation completed (either still on the source host, > or on the destination host). I have a way more simple alternative idea, which doesn't involve libvirt. We can change AML to 1. cache hotplugged CPUs from controller 2. send SMI 3. send Notify event to OS to online cached CPUs this way we never INIT/SIPI cpus that FW hasn't seen yet as for FW, it can relocate extra CPU that arrived after #1 it won't cause any harm as on the next SCI AML will pick up those CPUs and SMI upcall will be just NOP. I'll post a patch here on top of this series for you to try (without any of your comments addressed yet, as it's already written and I was testing it for a while to make sure it won't explode with various windows versions) > I guess that the "migration blocker" option is easier. > > Anyway I assume this is already handled with memory hotplug somehow > (i.e., migration attempt between device_add and ACPI_DEVICE_OST). Thanks for comments, I'll need some time to ponder on other comments and do some palaeontology research to answer questions (aka. I need to make up excuses for the code I wrote :) ) > Thanks, > Laszlo
On 07/14/20 17:19, Igor Mammedov wrote: > On Tue, 14 Jul 2020 14:41:28 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 07/14/20 14:28, Laszlo Ersek wrote: >>> (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) >>> references) >>> >>> On 07/10/20 18:17, Igor Mammedov wrote: >>>> In case firmware has negotiated CPU hotplug SMI feature, generate >>>> AML to describe SMI IO port region and send SMI to firmware >>>> on each CPU hotplug SCI. >>>> >>>> It might be not really usable, but should serve as a starting point to >>>> discuss how better to deal with split hotplug sequence during hot-add >>>> ( >>>> ex scenario where it will break is: >>>> hot-add >>>> -> (QEMU) add CPU in hotplug regs >>>> -> (QEMU) SCI >>>> -1-> (OS) scan >>>> -1-> (OS) SMI >>>> -1-> (FW) pull in CPU1 *** >>>> -1-> (OS) start iterating hotplug regs >>>> hot-add >>>> -> (QEMU) add CPU in hotplug regs >>>> -> (QEMU) SCI >>>> -2-> (OS) scan (blocked on mutex till previous scan is finished) >>>> -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI >>>> -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI >>>> that's where it explodes, since FW didn't see CPU2 >>>> when SMI was called >>>> ) >>>> >>>> hot remove will throw in yet another set of problems, so lets discuss >>>> both here and see if we can really share hotplug registers block between >>>> FW and AML or we should do something else with it. >>> >>> This issue is generally triggered by management applications such as >>> libvirt that issue device_add commands in quick succession. For libvirt, >>> the command is "virsh setvcpus" (plural) with multiple CPUs specified >>> for plugging. The singular "virsh setvcpu" command, which is more >>> friendly towards guest NUMA, does not run into the symptom. >>> >>> The scope of the scan method lock is not large enough, with SMI in the >>> picture. >>> >>> I suggest that we not uproot the existing AML code or the hotplug >>> register block. Instead, I suggest that we add serialization at a higher >>> level, with sufficient scope. >>> >>> QEMU: >>> >>> - introduce a new flag standing for "CPU plug operation in progress" >>> >>> - if ICH9_LPC_SMI_F_BROADCAST_BIT has been negotiated: >>> >>> - "device_add" and "device_del" should enforce >>> ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT and >>> ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT, respectively >>> >>> - both device_add and device_del (for VCPUs) should set check the >>> "in progress" flag. >>> >>> - If set, reject the request synchronously >>> >>> - Otherwise, set the flag, and commence the operation >>> >>> - in cpu_hotplug_wr(), where we emit the ACPI_DEVICE_OST event with >>> qapi_event_send_acpi_device_ost(), clear the "in-progress" flag. >>> >>> - If QEMU executes the QMP command processing and the cpu_hotplug_wr() >>> function on different (host OS) threads, then perhaps we should use an >>> atomic type for the flag. (Not sure about locking between QEMU threads, >>> sorry.) I don't really expect race conditions, but in case we ever get >>> stuck with the flag, we should make sure that the stuck state is "in >>> progress", and not "not in progress". (The former state can prevent >>> further plug operations, but cannot cause the guest to lose state.) >> >> Furthermore, the "CPU plug operation in progress" flag should be: >> - either migrated, >> - or a migration blocker. >> >> Because on the destination host, device_add should be possible if and >> only if the plug operation completed (either still on the source host, >> or on the destination host). > > I have a way more simple alternative idea, which doesn't involve libvirt. > > We can change AML to > 1. cache hotplugged CPUs from controller > 2. send SMI > 3. send Notify event to OS to online cached CPUs > this way we never INIT/SIPI cpus that FW hasn't seen yet > as for FW, it can relocate extra CPU that arrived after #1 > it won't cause any harm as on the next SCI AML will pick up those > CPUs and SMI upcall will be just NOP. > > I'll post a patch here on top of this series for you to try > (without any of your comments addressed yet, as it's already written > and I was testing it for a while to make sure it won't explode > with various windows versions) Sounds good, I'll be happy to test it. Indeed "no event" is something that the fw deals with gracefully. (IIRC I wanted to cover a "spurious SMI" explicitly.) It didn't occur to me that you could dynamically size e.g. a package object in AML. Based on my reading of the ACPI spec, "VarPackageOp" can take a *runtime* "NumElements", so if you did two loops, the first loop could count the pending stuff, and then a VarPackageOp could be used with just the right NumElements... Anyway, I digress :) > >> I guess that the "migration blocker" option is easier. >> >> Anyway I assume this is already handled with memory hotplug somehow >> (i.e., migration attempt between device_add and ACPI_DEVICE_OST). > > Thanks for comments, > I'll need some time to ponder on other comments and do some > palaeontology research to answer questions > (aka. I need to make up excuses for the code I wrote :) ) haha, thanks :) Laszlo
On Wed, 15 Jul 2020 14:38:00 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > On 07/14/20 17:19, Igor Mammedov wrote: > > On Tue, 14 Jul 2020 14:41:28 +0200 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 07/14/20 14:28, Laszlo Ersek wrote: > >>> (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) > >>> references) > >>> > >>> On 07/10/20 18:17, Igor Mammedov wrote: > >>>> In case firmware has negotiated CPU hotplug SMI feature, generate > >>>> AML to describe SMI IO port region and send SMI to firmware > >>>> on each CPU hotplug SCI. > >>>> > >>>> It might be not really usable, but should serve as a starting point to > >>>> discuss how better to deal with split hotplug sequence during hot-add > >>>> ( > >>>> ex scenario where it will break is: > >>>> hot-add > >>>> -> (QEMU) add CPU in hotplug regs > >>>> -> (QEMU) SCI > >>>> -1-> (OS) scan > >>>> -1-> (OS) SMI > >>>> -1-> (FW) pull in CPU1 *** > >>>> -1-> (OS) start iterating hotplug regs > >>>> hot-add > >>>> -> (QEMU) add CPU in hotplug regs > >>>> -> (QEMU) SCI > >>>> -2-> (OS) scan (blocked on mutex till previous scan is finished) > >>>> -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI > >>>> -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI > >>>> that's where it explodes, since FW didn't see CPU2 > >>>> when SMI was called > >>>> ) > >>>> > >>>> hot remove will throw in yet another set of problems, so lets discuss > >>>> both here and see if we can really share hotplug registers block between > >>>> FW and AML or we should do something else with it. > >>> > >>> This issue is generally triggered by management applications such as > >>> libvirt that issue device_add commands in quick succession. For libvirt, > >>> the command is "virsh setvcpus" (plural) with multiple CPUs specified > >>> for plugging. The singular "virsh setvcpu" command, which is more > >>> friendly towards guest NUMA, does not run into the symptom. > >>> > >>> The scope of the scan method lock is not large enough, with SMI in the > >>> picture. > >>> > >>> I suggest that we not uproot the existing AML code or the hotplug > >>> register block. Instead, I suggest that we add serialization at a higher > >>> level, with sufficient scope. > >>> > >>> QEMU: > >>> > >>> - introduce a new flag standing for "CPU plug operation in progress" > >>> > >>> - if ICH9_LPC_SMI_F_BROADCAST_BIT has been negotiated: > >>> > >>> - "device_add" and "device_del" should enforce > >>> ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT and > >>> ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT, respectively > >>> > >>> - both device_add and device_del (for VCPUs) should set check the > >>> "in progress" flag. > >>> > >>> - If set, reject the request synchronously > >>> > >>> - Otherwise, set the flag, and commence the operation > >>> > >>> - in cpu_hotplug_wr(), where we emit the ACPI_DEVICE_OST event with > >>> qapi_event_send_acpi_device_ost(), clear the "in-progress" flag. > >>> > >>> - If QEMU executes the QMP command processing and the cpu_hotplug_wr() > >>> function on different (host OS) threads, then perhaps we should use an > >>> atomic type for the flag. (Not sure about locking between QEMU threads, > >>> sorry.) I don't really expect race conditions, but in case we ever get > >>> stuck with the flag, we should make sure that the stuck state is "in > >>> progress", and not "not in progress". (The former state can prevent > >>> further plug operations, but cannot cause the guest to lose state.) > >> > >> Furthermore, the "CPU plug operation in progress" flag should be: > >> - either migrated, > >> - or a migration blocker. > >> > >> Because on the destination host, device_add should be possible if and > >> only if the plug operation completed (either still on the source host, > >> or on the destination host). > > > > I have a way more simple alternative idea, which doesn't involve libvirt. > > > > We can change AML to > > 1. cache hotplugged CPUs from controller > > 2. send SMI > > 3. send Notify event to OS to online cached CPUs > > this way we never INIT/SIPI cpus that FW hasn't seen yet > > as for FW, it can relocate extra CPU that arrived after #1 > > it won't cause any harm as on the next SCI AML will pick up those > > CPUs and SMI upcall will be just NOP. > > > > I'll post a patch here on top of this series for you to try > > (without any of your comments addressed yet, as it's already written > > and I was testing it for a while to make sure it won't explode > > with various windows versions) > > Sounds good, I'll be happy to test it. > > Indeed "no event" is something that the fw deals with gracefully. (IIRC > I wanted to cover a "spurious SMI" explicitly.) is it possible to distinguish "spurious SMI" vs hotplug SMI, if yes then we probably do not care about any other SMIs except hotplug one. > It didn't occur to me that you could dynamically size e.g. a package > object in AML. Based on my reading of the ACPI spec, "VarPackageOp" can > take a *runtime* "NumElements", so if you did two loops, the first loop > could count the pending stuff, and then a VarPackageOp could be used > with just the right NumElements... Anyway, I digress :) well, it's mine field since Windows implement only a subset of spec and VarPackageOp is not avalable on all version that support hotplug. I think, I've narrowed language down to supported subset, so I need to complete another round of testing to see if I didn't break anything by accident. > > > >> I guess that the "migration blocker" option is easier. > >> > >> Anyway I assume this is already handled with memory hotplug somehow > >> (i.e., migration attempt between device_add and ACPI_DEVICE_OST). > > > > Thanks for comments, > > I'll need some time to ponder on other comments and do some > > palaeontology research to answer questions > > (aka. I need to make up excuses for the code I wrote :) ) > > haha, thanks :) > Laszlo
On 07/15/20 15:43, Igor Mammedov wrote: > On Wed, 15 Jul 2020 14:38:00 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 07/14/20 17:19, Igor Mammedov wrote: >>> On Tue, 14 Jul 2020 14:41:28 +0200 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> On 07/14/20 14:28, Laszlo Ersek wrote: >>>>> (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) >>>>> references) >>>>> >>>>> On 07/10/20 18:17, Igor Mammedov wrote: >>>>>> In case firmware has negotiated CPU hotplug SMI feature, generate >>>>>> AML to describe SMI IO port region and send SMI to firmware >>>>>> on each CPU hotplug SCI. >>>>>> >>>>>> It might be not really usable, but should serve as a starting point to >>>>>> discuss how better to deal with split hotplug sequence during hot-add >>>>>> ( >>>>>> ex scenario where it will break is: >>>>>> hot-add >>>>>> -> (QEMU) add CPU in hotplug regs >>>>>> -> (QEMU) SCI >>>>>> -1-> (OS) scan >>>>>> -1-> (OS) SMI >>>>>> -1-> (FW) pull in CPU1 *** >>>>>> -1-> (OS) start iterating hotplug regs >>>>>> hot-add >>>>>> -> (QEMU) add CPU in hotplug regs >>>>>> -> (QEMU) SCI >>>>>> -2-> (OS) scan (blocked on mutex till previous scan is finished) >>>>>> -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI >>>>>> -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI >>>>>> that's where it explodes, since FW didn't see CPU2 >>>>>> when SMI was called >>>>>> ) >>>>>> >>>>>> hot remove will throw in yet another set of problems, so lets discuss >>>>>> both here and see if we can really share hotplug registers block between >>>>>> FW and AML or we should do something else with it. >>>>> >>>>> This issue is generally triggered by management applications such as >>>>> libvirt that issue device_add commands in quick succession. For libvirt, >>>>> the command is "virsh setvcpus" (plural) with multiple CPUs specified >>>>> for plugging. The singular "virsh setvcpu" command, which is more >>>>> friendly towards guest NUMA, does not run into the symptom. >>>>> >>>>> The scope of the scan method lock is not large enough, with SMI in the >>>>> picture. >>>>> >>>>> I suggest that we not uproot the existing AML code or the hotplug >>>>> register block. Instead, I suggest that we add serialization at a higher >>>>> level, with sufficient scope. >>>>> >>>>> QEMU: >>>>> >>>>> - introduce a new flag standing for "CPU plug operation in progress" >>>>> >>>>> - if ICH9_LPC_SMI_F_BROADCAST_BIT has been negotiated: >>>>> >>>>> - "device_add" and "device_del" should enforce >>>>> ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT and >>>>> ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT, respectively >>>>> >>>>> - both device_add and device_del (for VCPUs) should set check the >>>>> "in progress" flag. >>>>> >>>>> - If set, reject the request synchronously >>>>> >>>>> - Otherwise, set the flag, and commence the operation >>>>> >>>>> - in cpu_hotplug_wr(), where we emit the ACPI_DEVICE_OST event with >>>>> qapi_event_send_acpi_device_ost(), clear the "in-progress" flag. >>>>> >>>>> - If QEMU executes the QMP command processing and the cpu_hotplug_wr() >>>>> function on different (host OS) threads, then perhaps we should use an >>>>> atomic type for the flag. (Not sure about locking between QEMU threads, >>>>> sorry.) I don't really expect race conditions, but in case we ever get >>>>> stuck with the flag, we should make sure that the stuck state is "in >>>>> progress", and not "not in progress". (The former state can prevent >>>>> further plug operations, but cannot cause the guest to lose state.) >>>> >>>> Furthermore, the "CPU plug operation in progress" flag should be: >>>> - either migrated, >>>> - or a migration blocker. >>>> >>>> Because on the destination host, device_add should be possible if and >>>> only if the plug operation completed (either still on the source host, >>>> or on the destination host). >>> >>> I have a way more simple alternative idea, which doesn't involve libvirt. >>> >>> We can change AML to >>> 1. cache hotplugged CPUs from controller >>> 2. send SMI >>> 3. send Notify event to OS to online cached CPUs >>> this way we never INIT/SIPI cpus that FW hasn't seen yet >>> as for FW, it can relocate extra CPU that arrived after #1 >>> it won't cause any harm as on the next SCI AML will pick up those >>> CPUs and SMI upcall will be just NOP. >>> >>> I'll post a patch here on top of this series for you to try >>> (without any of your comments addressed yet, as it's already written >>> and I was testing it for a while to make sure it won't explode >>> with various windows versions) >> >> Sounds good, I'll be happy to test it. >> >> Indeed "no event" is something that the fw deals with gracefully. (IIRC >> I wanted to cover a "spurious SMI" explicitly.) > is it possible to distinguish "spurious SMI" vs hotplug SMI, > if yes then we probably do not care about any other SMIs except hotplug one. Sorry, I was unclear. By "spurious SMI", I meant that the hotplug SMI (command value 4) is raised, but then the hotplug register block reports no CPUs with pending events. The fw code handles that. > >> It didn't occur to me that you could dynamically size e.g. a package >> object in AML. Based on my reading of the ACPI spec, "VarPackageOp" can >> take a *runtime* "NumElements", so if you did two loops, the first loop >> could count the pending stuff, and then a VarPackageOp could be used >> with just the right NumElements... Anyway, I digress :) > > well, it's mine field since Windows implement only a subset of spec > and VarPackageOp is not avalable on all version that support hotplug. Ugh. :/ > I think, I've narrowed language down to supported subset, so I need > to complete another round of testing to see if I didn't break anything > by accident. Thanks. Laszlo > >>> >>>> I guess that the "migration blocker" option is easier. >>>> >>>> Anyway I assume this is already handled with memory hotplug somehow >>>> (i.e., migration attempt between device_add and ACPI_DEVICE_OST). >>> >>> Thanks for comments, >>> I'll need some time to ponder on other comments and do some >>> palaeontology research to answer questions >>> (aka. I need to make up excuses for the code I wrote :) ) >> >> haha, thanks :) >> Laszlo >
On Tue, 14 Jul 2020 14:28:29 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) > references) > > On 07/10/20 18:17, Igor Mammedov wrote: [...] > (3) Just a thought: I wonder if we should reserve both ports (0xB2 and > 0xB3 too). For now we don't have any use for the "data" port, but > arguably it's part of the same register block. we probably should, might be used for unplug part. BTW any ideas how we'd like to procceed with unplug? current flow looks like: QEMU OSPM unplug_req() 1) =>SCI ---> ------------------------- 2) handle_sci() scan for events and send Notify per removed CPU clear rm_evt ------------------------- 3) offline cpu ------------------------- 4) call _EJ0 to unplug CPU write into ej_evt <------------- ------------------------- 5) unplug cb We probably should modify _EJ0 to send SMI instead of direct access to cpuhp block, the question is how OSPM would tell FW which CPU it ejects. [...]
On 07/17/20 15:13, Igor Mammedov wrote: > On Tue, 14 Jul 2020 14:28:29 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> (CC'ing Peter Krempa due to virsh setvcpu (singular) / setvcpus (plural) >> references) >> >> On 07/10/20 18:17, Igor Mammedov wrote: > [...] > >> (3) Just a thought: I wonder if we should reserve both ports (0xB2 and >> 0xB3 too). For now we don't have any use for the "data" port, but >> arguably it's part of the same register block. > > we probably should, might be used for unplug part. > > BTW any ideas how we'd like to procceed with unplug? > > current flow looks like: > > QEMU OSPM > unplug_req() > 1) =>SCI ---> > ------------------------- > 2) handle_sci() > scan for events and send > Notify per removed CPU > clear rm_evt > ------------------------- > 3) offline cpu > ------------------------- > 4) call _EJ0 to unplug CPU > write into ej_evt > <------------- > ------------------------- > 5) unplug cb > > > We probably should modify _EJ0 to send SMI instead of direct access > to cpuhp block, the question is how OSPM would tell FW which CPU it > ejects. The optimal solution would be DataTableRegion, but of course Windows doesn't support that. So the usual replacement is something like below (we used a variant of it in "docs/specs/vmgenid.txt"): - Create a new ACPI data table. See "Appendix O - UEFI ACPI Data Table", table 94, in the UEFI 2.8 spec <https://uefi.org/specifications>. In this table, set Identifier = B055E4C3-B5B5-4948-9D02-A46FCA3B0A3B (I just generated this with "uuidgen"). Set DataOffset to 54 decimal. Place 4 zeroes at offset 54, for Data. - Create a SystemMemory Operation Region, 58 bytes in size (for covering the entire table), with a single 4-byte Field at offset 54. Use the add_pointer command for pointing the OperationRegion opcode to offset 0 (that is, the very beginning) of the above-created ACPI table. (Normally I would suggest creating an Integer object, using add_pointer on the Integer object, and then dynamically creating an OperationRegion at that Integer address -- but I don't know if Windows can deal with dynamically determined OperationRegion addresses.) - In the CPU hot-unplug ACPI code, write the APIC ID (or the selector I guess?) of the CPU being un-plugged to the field at offset 54, before raising the SMI - Raise the SMI with a new command (value 5, I guess) - in OVMF, the CpuHotplugSmm driver can register an EndOfDxe callback, and use EFI_ACPI_SDT_PROTOCOL.GetAcpiTable(), in a loop, for locating the data table. Save a pointer (in SMRAM) to offset 54 of that table. At runtime, when handling the SMI (command 5) read the APIC ID (or selector) from the data field at offset 54. (The AddTableToList() function in edk2's "MdeModulePkg/Universal/Acpi/AcpiTableDxe/AcpiTableProtocol.c" file makes sure that "UEFI ACPI Data Tables" are allocated in AcpiNVS memory, which is permitted for use as an "SMM communication buffer".) (The difference with VMGENID is that QEMU need not learn of the guest-side address of the data table, hence no "write_pointer" command. Instead, CpuHotplugSmm needs to learn of the address. That can be done with in an EndOfDxe callback, using EFI_ACPI_SDT_PROTOCOL.GetAcpiTable() in a loop, but that does require QEMU to produce a real ACPI data table, not just a header-less blob placed in AcpiNVS memory.) Thanks, Laszlo
diff --git a/include/hw/acpi/cpu.h b/include/hw/acpi/cpu.h index 62f0278ba2..0eeedaa491 100644 --- a/include/hw/acpi/cpu.h +++ b/include/hw/acpi/cpu.h @@ -50,6 +50,7 @@ void cpu_hotplug_hw_init(MemoryRegion *as, Object *owner, typedef struct CPUHotplugFeatures { bool acpi_1_compatible; bool has_legacy_cphp; + const char *smi_path; } CPUHotplugFeatures; void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts, diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c index 3d6a500fb7..6539bc23f6 100644 --- a/hw/acpi/cpu.c +++ b/hw/acpi/cpu.c @@ -14,6 +14,8 @@ #define ACPI_CPU_CMD_DATA_OFFSET_RW 8 #define ACPI_CPU_CMD_DATA2_OFFSET_R 0 +#define OVMF_CPUHP_SMI_CMD 4 + enum { CPHP_GET_NEXT_CPU_WITH_EVENT_CMD = 0, CPHP_OST_EVENT_CMD = 1, @@ -473,6 +475,10 @@ void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts, Aml *next_cpu_cmd = aml_int(CPHP_GET_NEXT_CPU_WITH_EVENT_CMD); aml_append(method, aml_acquire(ctrl_lock, 0xFFFF)); + if (opts.smi_path) { + aml_append(method, aml_store(aml_int(OVMF_CPUHP_SMI_CMD), + aml_name("%s", opts.smi_path))); + } aml_append(method, aml_store(one, has_event)); while_ctx = aml_while(aml_equal(has_event, one)); { diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index b7bcbbbb2a..1e21386f0c 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -95,6 +95,7 @@ typedef struct AcpiPmInfo { bool s3_disabled; bool s4_disabled; bool pcihp_bridge_en; + bool smi_on_cpuhp; uint8_t s4_val; AcpiFadtData fadt; uint16_t cpu_hp_io_base; @@ -194,6 +195,7 @@ static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm) pm->cpu_hp_io_base = 0; pm->pcihp_io_base = 0; pm->pcihp_io_len = 0; + pm->smi_on_cpuhp = false; assert(obj); init_common_fadt_data(machine, obj, &pm->fadt); @@ -213,6 +215,9 @@ static void acpi_get_pm_info(MachineState *machine, AcpiPmInfo *pm) pm->fadt.reset_val = 0xf; pm->fadt.flags |= 1 << ACPI_FADT_F_RESET_REG_SUP; pm->cpu_hp_io_base = ICH9_CPU_HOTPLUG_IO_BASE; + pm->smi_on_cpuhp = + !!(object_property_get_uint(lpc, "x-smi-negotiated-features", NULL) + & BIT_ULL(ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT)); } /* The above need not be conditional on machine type because the reset port @@ -1515,6 +1520,31 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, aml_append(dev, aml_name_decl("_UID", aml_int(1))); aml_append(dev, build_q35_osc_method()); aml_append(sb_scope, dev); + + if (pm->smi_on_cpuhp) { + /* reserve SMI block resources */ + dev = aml_device("PCI0.SMI0"); + aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A06"))); + aml_append(dev, aml_name_decl("_UID", aml_string("SMI resources"))); + crs = aml_resource_template(); + aml_append(crs, + aml_io( + AML_DECODE16, + ACPI_PORT_SMI_CMD, + ACPI_PORT_SMI_CMD, + 1, + 1) + ); + aml_append(dev, aml_name_decl("_CRS", crs)); + aml_append(dev, aml_operation_region("SMIR", AML_SYSTEM_IO, + aml_int(0xB2), 1)); + field = aml_field("SMIR", AML_BYTE_ACC, AML_NOLOCK, + AML_WRITE_AS_ZEROS); + aml_append(field, aml_named_field("SMIC", 8)); + aml_append(dev, field); + aml_append(sb_scope, dev); + } + aml_append(dsdt, sb_scope); build_hpet_aml(dsdt); @@ -1530,7 +1560,8 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, build_legacy_cpu_hotplug_aml(dsdt, machine, pm->cpu_hp_io_base); } else { CPUHotplugFeatures opts = { - .acpi_1_compatible = true, .has_legacy_cphp = true + .acpi_1_compatible = true, .has_legacy_cphp = true, + .smi_path = pm->smi_on_cpuhp ? "\\_SB.PCI0.SMI0.SMIC" : NULL, }; build_cpus_aml(dsdt, machine, opts, pm->cpu_hp_io_base, "\\_SB.PCI0", "\\_GPE._E02"); diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c index 83da7346ab..5ebea70a8d 100644 --- a/hw/isa/lpc_ich9.c +++ b/hw/isa/lpc_ich9.c @@ -643,6 +643,9 @@ static void ich9_lpc_initfn(Object *obj) &acpi_enable_cmd, OBJ_PROP_FLAG_READ); object_property_add_uint8_ptr(OBJECT(lpc), ACPI_PM_PROP_ACPI_DISABLE_CMD, &acpi_disable_cmd, OBJ_PROP_FLAG_READ); + object_property_add_uint64_ptr(obj, "x-smi-negotiated-features", + &lpc->smi_negotiated_features, + OBJ_PROP_FLAG_READ); ich9_pm_add_properties(obj, &lpc->pm); }
In case firmware has negotiated CPU hotplug SMI feature, generate AML to describe SMI IO port region and send SMI to firmware on each CPU hotplug SCI. It might be not really usable, but should serve as a starting point to discuss how better to deal with split hotplug sequence during hot-add ( ex scenario where it will break is: hot-add -> (QEMU) add CPU in hotplug regs -> (QEMU) SCI -1-> (OS) scan -1-> (OS) SMI -1-> (FW) pull in CPU1 *** -1-> (OS) start iterating hotplug regs hot-add -> (QEMU) add CPU in hotplug regs -> (QEMU) SCI -2-> (OS) scan (blocked on mutex till previous scan is finished) -1-> (OS) 1st added CPU1 send device check event -> INIT/SIPI -1-> (OS) 1st added CPU2 send device check event -> INIT/SIPI that's where it explodes, since FW didn't see CPU2 when SMI was called ) hot remove will throw in yet another set of problems, so lets discuss both here and see if we can really share hotplug registers block between FW and AML or we should do something else with it. Signed-off-by: Igor Mammedov <imammedo@redhat.com> --- v0: - s/aml_string/aml_eisaid/ /fixes Windows BSOD, on nonsense value/ (Laszlo Ersek <lersek@redhat.com>) - put SMI device under PCI0 like the rest of hotplug IO port - do not generate SMI AML if CPU hotplug SMI feature hasn't been negotiated --- include/hw/acpi/cpu.h | 1 + hw/acpi/cpu.c | 6 ++++++ hw/i386/acpi-build.c | 33 ++++++++++++++++++++++++++++++++- hw/isa/lpc_ich9.c | 3 +++ 4 files changed, 42 insertions(+), 1 deletion(-)