diff mbox series

acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block

Message ID 20230104090138.214862-1-lersek@redhat.com (mailing list archive)
State New, archived
Headers show
Series acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block | expand

Commit Message

Laszlo Ersek Jan. 4, 2023, 9:01 a.m. UTC
The modern ACPI CPU hotplug interface was introduced in the following
series (aa1dd39ca307..679dd1a957df), released in v2.7.0:

  1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
  2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
  3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
  4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
  5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
                  interface
  6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
                  interface
  7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
  8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type

Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
accesses for the hotplug register block.  Patch#1 preserved the same
restriction for the legacy register block, but:

- it specified DWORD accesses for some of the modern registers,

- in particular, the switch from the legacy block to the modern block
  would require a DWORD write to the *legacy* block.

The latter functionality was then implemented in cpu_status_write()
[hw/acpi/cpu_hotplug.c], in patch#8.

Unfortunately, all DWORD accesses depended on a dormant bug: the one
introced in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
register block would work in spite of the above series *not* relaxing
"valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":

> static const MemoryRegionOps AcpiCpuHotplug_ops = {
>     .read = cpu_status_read,
>     .write = cpu_status_write,
>     .endianness = DEVICE_LITTLE_ENDIAN,
>     .valid = {
>         .min_access_size = 1,
>         .max_access_size = 1,
>     },
> };

Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
interface (including the documentation) was extended with another DWORD
*read* access, namely to the "Command data 2" register, which would be
important for the guest to confirm whether it managed to switch the
register block from legacy to modern.

This functionality too silently depended on the bug from commit
a014ed07bd5a.

In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
the bug from commit a014ed07bd5a was fixed (the commit was reverted).
That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
the v2.7.0 series quoted at the top -- namely the fact that
"valid.max_access_size = 1" didn't match what the guest was supposed to
do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").

The symptom is that the "modern interface negotiation protocol"
described in commit ae340aa3d256:

> +      Use following steps to detect and enable modern CPU hotplug interface:
> +        1. Store 0x0 to the 'CPU selector' register,
> +           attempting to switch to modern mode
> +        2. Store 0x0 to the 'CPU selector' register,
> +           to ensure valid selector value
> +        3. Store 0x0 to the 'Command field' register,
> +        4. Read the 'Command data 2' register.
> +           If read value is 0x0, the modern interface is enabled.
> +           Otherwise legacy or no CPU hotplug interface available

falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
writes; so no switching happens.  Step 3 (a single-byte write) is not
lost, but it has no effect; see the condition in cpu_status_write() in
patch#8.  And step 4 *misleads* the guest into thinking that the switch
worked: the DWORD read is lost again -- it returns zero to the guest
without ever reaching the device model, so the guest never learns the
switch didn't work.

This means that guest behavior centered on the "Command data 2" register
worked *only* in the v5.0.0 release; it got effectively regressed in
v5.1.0.

To make things *even more* complicated, the breakage was (and remains, as
of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
no difference with KVM acceleration -- the DWORD accesses still work,
despite "valid.max_access_size = 1".

As commit 5d971f9e6725 suggests, fix the problem by raising
"valid.max_access_size" to 4 -- the spec now clearly instructs the guest
to perform DWORD accesses to the legacy register block too, for enabling
(and verifying!) the modern block.  In order to keep compatibility for the
device model implementation though, set "impl.max_access_size = 1", so
that wide accesses be split before they reach the legacy read/write
handlers, like they always have been on KVM, and like they were on TCG
before 5d971f9e6725 (v5.1.0).

Tested with:

- OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
  intermixed with ACPI S3 suspend/resume, using KVM accel
  (regression-test);

- OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
  intermixed with ACPI S3 suspend/resume, using KVM accel
  (regression-test);

- OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
  register block switch and the present/possible CPU counting through the
  modern hotplug interface, during OVMF boot (bugfix test);

- I do not have any testcase (guest payload) for regression-testing CPU
  hotplug through the *legacy* CPU hotplug register block.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Ani Sinha <ani@anisinha.ca>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-stable@nongnu.org
Ref: "IO port write width clamping differs between TCG and KVM"
Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
Reported-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    This should be applied to:
    
    - stable-5.2 (new branch)
    
    - stable-6.2 (new branch)
    
    - stable-7.2 (new branch)
    
    whichever is still considered maintained, as there is currently *no*
    public QEMU release in which the modern CPU hotplug register block
    works, when using TCG acceleration.  v5.0.0 works, but that minor
    release has been obsoleted by v5.2.0, which does not work.

 hw/acpi/cpu_hotplug.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Ard Biesheuvel Jan. 4, 2023, 9:33 a.m. UTC | #1
On Wed, 4 Jan 2023 at 10:01, Laszlo Ersek <lersek@redhat.com> wrote:
>
> The modern ACPI CPU hotplug interface was introduced in the following
> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>
>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>                   interface
>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>                   interface
>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
>
> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
> accesses for the hotplug register block.  Patch#1 preserved the same
> restriction for the legacy register block, but:
>
> - it specified DWORD accesses for some of the modern registers,
>
> - in particular, the switch from the legacy block to the modern block
>   would require a DWORD write to the *legacy* block.
>
> The latter functionality was then implemented in cpu_status_write()
> [hw/acpi/cpu_hotplug.c], in patch#8.
>
> Unfortunately, all DWORD accesses depended on a dormant bug: the one
> introced in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
> register block would work in spite of the above series *not* relaxing
> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
>
> > static const MemoryRegionOps AcpiCpuHotplug_ops = {
> >     .read = cpu_status_read,
> >     .write = cpu_status_write,
> >     .endianness = DEVICE_LITTLE_ENDIAN,
> >     .valid = {
> >         .min_access_size = 1,
> >         .max_access_size = 1,
> >     },
> > };
>
> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
> interface (including the documentation) was extended with another DWORD
> *read* access, namely to the "Command data 2" register, which would be
> important for the guest to confirm whether it managed to switch the
> register block from legacy to modern.
>
> This functionality too silently depended on the bug from commit
> a014ed07bd5a.
>
> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
> the v2.7.0 series quoted at the top -- namely the fact that
> "valid.max_access_size = 1" didn't match what the guest was supposed to
> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
>
> The symptom is that the "modern interface negotiation protocol"
> described in commit ae340aa3d256:
>
> > +      Use following steps to detect and enable modern CPU hotplug interface:
> > +        1. Store 0x0 to the 'CPU selector' register,
> > +           attempting to switch to modern mode
> > +        2. Store 0x0 to the 'CPU selector' register,
> > +           to ensure valid selector value
> > +        3. Store 0x0 to the 'Command field' register,
> > +        4. Read the 'Command data 2' register.
> > +           If read value is 0x0, the modern interface is enabled.
> > +           Otherwise legacy or no CPU hotplug interface available
>
> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
> writes; so no switching happens.  Step 3 (a single-byte write) is not
> lost, but it has no effect; see the condition in cpu_status_write() in
> patch#8.  And step 4 *misleads* the guest into thinking that the switch
> worked: the DWORD read is lost again -- it returns zero to the guest
> without ever reaching the device model, so the guest never learns the
> switch didn't work.
>
> This means that guest behavior centered on the "Command data 2" register
> worked *only* in the v5.0.0 release; it got effectively regressed in
> v5.1.0.
>
> To make things *even more* complicated, the breakage was (and remains, as
> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
> no difference with KVM acceleration -- the DWORD accesses still work,
> despite "valid.max_access_size = 1".
>
> As commit 5d971f9e6725 suggests, fix the problem by raising
> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
> to perform DWORD accesses to the legacy register block too, for enabling
> (and verifying!) the modern block.  In order to keep compatibility for the
> device model implementation though, set "impl.max_access_size = 1", so
> that wide accesses be split before they reach the legacy read/write
> handlers, like they always have been on KVM, and like they were on TCG
> before 5d971f9e6725 (v5.1.0).
>
> Tested with:
>
> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
>
> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
>
> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>   register block switch and the present/possible CPU counting through the
>   modern hotplug interface, during OVMF boot (bugfix test);
>
> - I do not have any testcase (guest payload) for regression-testing CPU
>   hotplug through the *legacy* CPU hotplug register block.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: qemu-stable@nongnu.org
> Ref: "IO port write width clamping differs between TCG and KVM"
> Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> Reported-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>

Thanks for going down this rabbit hole.

With this patch applied, the QEMU IA32 regression that would only
manifest when using KVM now also happens in TCG mode.

Yay

Tested-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>
> Notes:
>     This should be applied to:
>
>     - stable-5.2 (new branch)
>
>     - stable-6.2 (new branch)
>
>     - stable-7.2 (new branch)
>
>     whichever is still considered maintained, as there is currently *no*
>     public QEMU release in which the modern CPU hotplug register block
>     works, when using TCG acceleration.  v5.0.0 works, but that minor
>     release has been obsoleted by v5.2.0, which does not work.
>
>  hw/acpi/cpu_hotplug.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> index 53654f863830..ff14c3f4106f 100644
> --- a/hw/acpi/cpu_hotplug.c
> +++ b/hw/acpi/cpu_hotplug.c
> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>      .endianness = DEVICE_LITTLE_ENDIAN,
>      .valid = {
>          .min_access_size = 1,
> +        .max_access_size = 4,
> +    },
> +    .impl = {
>          .max_access_size = 1,
>      },
>  };
Philippe Mathieu-Daudé Jan. 4, 2023, 9:34 a.m. UTC | #2
On 4/1/23 10:01, Laszlo Ersek wrote:
> The modern ACPI CPU hotplug interface was introduced in the following
> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
> 
>    1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>    2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>    3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>    4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>    5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>                    interface
>    6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>                    interface
>    7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>    8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
> 
> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
> accesses for the hotplug register block.  Patch#1 preserved the same
> restriction for the legacy register block, but:
> 
> - it specified DWORD accesses for some of the modern registers,
> 
> - in particular, the switch from the legacy block to the modern block
>    would require a DWORD write to the *legacy* block.
> 
> The latter functionality was then implemented in cpu_status_write()
> [hw/acpi/cpu_hotplug.c], in patch#8.
> 
> Unfortunately, all DWORD accesses depended on a dormant bug: the one
> introced in earlier commit a014ed07bd5a ("memory: accept mismatching sizes

Typo "introduced",

> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
> register block would work in spite of the above series *not* relaxing
> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
> 
>> static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>      .read = cpu_status_read,
>>      .write = cpu_status_write,
>>      .endianness = DEVICE_LITTLE_ENDIAN,
>>      .valid = {
>>          .min_access_size = 1,
>>          .max_access_size = 1,
>>      },
>> };
> 
> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
> interface (including the documentation) was extended with another DWORD
> *read* access, namely to the "Command data 2" register, which would be
> important for the guest to confirm whether it managed to switch the
> register block from legacy to modern.
> 
> This functionality too silently depended on the bug from commit
> a014ed07bd5a.
> 
> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
> the v2.7.0 series quoted at the top -- namely the fact that
> "valid.max_access_size = 1" didn't match what the guest was supposed to
> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
> 
> The symptom is that the "modern interface negotiation protocol"
> described in commit ae340aa3d256:
> 
>> +      Use following steps to detect and enable modern CPU hotplug interface:
>> +        1. Store 0x0 to the 'CPU selector' register,
>> +           attempting to switch to modern mode
>> +        2. Store 0x0 to the 'CPU selector' register,
>> +           to ensure valid selector value
>> +        3. Store 0x0 to the 'Command field' register,
>> +        4. Read the 'Command data 2' register.
>> +           If read value is 0x0, the modern interface is enabled.
>> +           Otherwise legacy or no CPU hotplug interface available
> 
> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
> writes; so no switching happens.  Step 3 (a single-byte write) is not
> lost, but it has no effect; see the condition in cpu_status_write() in
> patch#8.  And step 4 *misleads* the guest into thinking that the switch
> worked: the DWORD read is lost again -- it returns zero to the guest
> without ever reaching the device model, so the guest never learns the
> switch didn't work.
> 
> This means that guest behavior centered on the "Command data 2" register
> worked *only* in the v5.0.0 release; it got effectively regressed in
> v5.1.0.
> 
> To make things *even more* complicated, the breakage was (and remains, as
> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
> no difference with KVM acceleration -- the DWORD accesses still work,
> despite "valid.max_access_size = 1".
> 
> As commit 5d971f9e6725 suggests, fix the problem by raising
> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
> to perform DWORD accesses to the legacy register block too, for enabling
> (and verifying!) the modern block.  In order to keep compatibility for the
> device model implementation though, set "impl.max_access_size = 1", so
> that wide accesses be split before they reach the legacy read/write
> handlers, like they always have been on KVM, and like they were on TCG
> before 5d971f9e6725 (v5.1.0).
> 
> Tested with:
> 
> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>    intermixed with ACPI S3 suspend/resume, using KVM accel
>    (regression-test);
> 
> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>    intermixed with ACPI S3 suspend/resume, using KVM accel
>    (regression-test);
> 
> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>    register block switch and the present/possible CPU counting through the
>    modern hotplug interface, during OVMF boot (bugfix test);
> 
> - I do not have any testcase (guest payload) for regression-testing CPU
>    hotplug through the *legacy* CPU hotplug register block.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: qemu-stable@nongnu.org
> Ref: "IO port write width clamping differs between TCG and KVM"
> Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> Reported-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> 
> Notes:
>      This should be applied to:
>      
>      - stable-5.2 (new branch)
>      
>      - stable-6.2 (new branch)
>      
>      - stable-7.2 (new branch)
>      
>      whichever is still considered maintained, as there is currently *no*
>      public QEMU release in which the modern CPU hotplug register block
>      works, when using TCG acceleration.  v5.0.0 works, but that minor
>      release has been obsoleted by v5.2.0, which does not work.
> 
>   hw/acpi/cpu_hotplug.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> index 53654f863830..ff14c3f4106f 100644
> --- a/hw/acpi/cpu_hotplug.c
> +++ b/hw/acpi/cpu_hotplug.c
> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>       .endianness = DEVICE_LITTLE_ENDIAN,
>       .valid = {
>           .min_access_size = 1,
> +        .max_access_size = 4,
> +    },
> +    .impl = {
>           .max_access_size = 1,

Arguably:
Fixes: b8622725cf ("acpi_piix4: Add infrastructure to send CPU hot-plug 
GPE to guest")

>       },
>   };

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Laszlo Ersek Jan. 4, 2023, 10:26 a.m. UTC | #3
On 1/4/23 10:33, Ard Biesheuvel wrote:
> On Wed, 4 Jan 2023 at 10:01, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> The modern ACPI CPU hotplug interface was introduced in the following
>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>>
>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>>                   interface
>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>>                   interface
>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
>>
>> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
>> accesses for the hotplug register block.  Patch#1 preserved the same
>> restriction for the legacy register block, but:
>>
>> - it specified DWORD accesses for some of the modern registers,
>>
>> - in particular, the switch from the legacy block to the modern block
>>   would require a DWORD write to the *legacy* block.
>>
>> The latter functionality was then implemented in cpu_status_write()
>> [hw/acpi/cpu_hotplug.c], in patch#8.
>>
>> Unfortunately, all DWORD accesses depended on a dormant bug: the one
>> introced in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
>> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
>> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
>> register block would work in spite of the above series *not* relaxing
>> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
>>
>>> static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>>     .read = cpu_status_read,
>>>     .write = cpu_status_write,
>>>     .endianness = DEVICE_LITTLE_ENDIAN,
>>>     .valid = {
>>>         .min_access_size = 1,
>>>         .max_access_size = 1,
>>>     },
>>> };
>>
>> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
>> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
>> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
>> interface (including the documentation) was extended with another DWORD
>> *read* access, namely to the "Command data 2" register, which would be
>> important for the guest to confirm whether it managed to switch the
>> register block from legacy to modern.
>>
>> This functionality too silently depended on the bug from commit
>> a014ed07bd5a.
>>
>> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
>> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
>> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
>> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
>> the v2.7.0 series quoted at the top -- namely the fact that
>> "valid.max_access_size = 1" didn't match what the guest was supposed to
>> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
>>
>> The symptom is that the "modern interface negotiation protocol"
>> described in commit ae340aa3d256:
>>
>>> +      Use following steps to detect and enable modern CPU hotplug interface:
>>> +        1. Store 0x0 to the 'CPU selector' register,
>>> +           attempting to switch to modern mode
>>> +        2. Store 0x0 to the 'CPU selector' register,
>>> +           to ensure valid selector value
>>> +        3. Store 0x0 to the 'Command field' register,
>>> +        4. Read the 'Command data 2' register.
>>> +           If read value is 0x0, the modern interface is enabled.
>>> +           Otherwise legacy or no CPU hotplug interface available
>>
>> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
>> writes; so no switching happens.  Step 3 (a single-byte write) is not
>> lost, but it has no effect; see the condition in cpu_status_write() in
>> patch#8.  And step 4 *misleads* the guest into thinking that the switch
>> worked: the DWORD read is lost again -- it returns zero to the guest
>> without ever reaching the device model, so the guest never learns the
>> switch didn't work.
>>
>> This means that guest behavior centered on the "Command data 2" register
>> worked *only* in the v5.0.0 release; it got effectively regressed in
>> v5.1.0.
>>
>> To make things *even more* complicated, the breakage was (and remains, as
>> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
>> no difference with KVM acceleration -- the DWORD accesses still work,
>> despite "valid.max_access_size = 1".
>>
>> As commit 5d971f9e6725 suggests, fix the problem by raising
>> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
>> to perform DWORD accesses to the legacy register block too, for enabling
>> (and verifying!) the modern block.  In order to keep compatibility for the
>> device model implementation though, set "impl.max_access_size = 1", so
>> that wide accesses be split before they reach the legacy read/write
>> handlers, like they always have been on KVM, and like they were on TCG
>> before 5d971f9e6725 (v5.1.0).
>>
>> Tested with:
>>
>> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>>   intermixed with ACPI S3 suspend/resume, using KVM accel
>>   (regression-test);
>>
>> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>>   intermixed with ACPI S3 suspend/resume, using KVM accel
>>   (regression-test);
>>
>> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>>   register block switch and the present/possible CPU counting through the
>>   modern hotplug interface, during OVMF boot (bugfix test);
>>
>> - I do not have any testcase (guest payload) for regression-testing CPU
>>   hotplug through the *legacy* CPU hotplug register block.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Ani Sinha <ani@anisinha.ca>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Peter Maydell <peter.maydell@linaro.org>
>> Cc: qemu-stable@nongnu.org
>> Ref: "IO port write width clamping differs between TCG and KVM"
>> Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>> Reported-by: Ard Biesheuvel <ardb@kernel.org>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> 
> Thanks for going down this rabbit hole.
> 
> With this patch applied, the QEMU IA32 regression that would only
> manifest when using KVM now also happens in TCG mode.
> 
> Yay

Yes -- now that OVMF sees the proper CPU counts on TCG, the UefiCpuPkg
regression in <https://bugzilla.tianocore.org/show_bug.cgi?id=4234> is
supposed to hit OVMF on TCG as well. IOW, with the device model fixed,
TCG's effective "hiding" of the CPUs no longer happens, and thus no
longer masks TianoCore#4234.

(Expanded your comment a little bit because at first I didn't get it,
and then thought that qemu-devel might benefit from an explanation.)

> 
> Tested-by: Ard Biesheuvel <ardb@kernel.org>

Thank you!
Laszlo

> 
>> ---
>>
>> Notes:
>>     This should be applied to:
>>
>>     - stable-5.2 (new branch)
>>
>>     - stable-6.2 (new branch)
>>
>>     - stable-7.2 (new branch)
>>
>>     whichever is still considered maintained, as there is currently *no*
>>     public QEMU release in which the modern CPU hotplug register block
>>     works, when using TCG acceleration.  v5.0.0 works, but that minor
>>     release has been obsoleted by v5.2.0, which does not work.
>>
>>  hw/acpi/cpu_hotplug.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
>> index 53654f863830..ff14c3f4106f 100644
>> --- a/hw/acpi/cpu_hotplug.c
>> +++ b/hw/acpi/cpu_hotplug.c
>> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>      .endianness = DEVICE_LITTLE_ENDIAN,
>>      .valid = {
>>          .min_access_size = 1,
>> +        .max_access_size = 4,
>> +    },
>> +    .impl = {
>>          .max_access_size = 1,
>>      },
>>  };
>
Laszlo Ersek Jan. 4, 2023, 10:29 a.m. UTC | #4
On 1/4/23 10:34, Philippe Mathieu-Daudé wrote:
> On 4/1/23 10:01, Laszlo Ersek wrote:
>> The modern ACPI CPU hotplug interface was introduced in the following
>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>>
>>    1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>>    2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>>    3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>>    4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>>    5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>>                    interface
>>    6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>>                    interface
>>    7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>>    8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine
>> type
>>
>> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
>> accesses for the hotplug register block.  Patch#1 preserved the same
>> restriction for the legacy register block, but:
>>
>> - it specified DWORD accesses for some of the modern registers,
>>
>> - in particular, the switch from the legacy block to the modern block
>>    would require a DWORD write to the *legacy* block.
>>
>> The latter functionality was then implemented in cpu_status_write()
>> [hw/acpi/cpu_hotplug.c], in patch#8.
>>
>> Unfortunately, all DWORD accesses depended on a dormant bug: the one
>> introced in earlier commit a014ed07bd5a ("memory: accept mismatching
>> sizes
> 
> Typo "introduced",
> 
>> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
>> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU
>> hotplug
>> register block would work in spite of the above series *not* relaxing
>> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
>>
>>> static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>>      .read = cpu_status_read,
>>>      .write = cpu_status_write,
>>>      .endianness = DEVICE_LITTLE_ENDIAN,
>>>      .valid = {
>>>          .min_access_size = 1,
>>>          .max_access_size = 1,
>>>      },
>>> };
>>
>> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
>> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
>> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
>> interface (including the documentation) was extended with another DWORD
>> *read* access, namely to the "Command data 2" register, which would be
>> important for the guest to confirm whether it managed to switch the
>> register block from legacy to modern.
>>
>> This functionality too silently depended on the bug from commit
>> a014ed07bd5a.
>>
>> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
>> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
>> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
>> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
>> the v2.7.0 series quoted at the top -- namely the fact that
>> "valid.max_access_size = 1" didn't match what the guest was supposed to
>> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
>>
>> The symptom is that the "modern interface negotiation protocol"
>> described in commit ae340aa3d256:
>>
>>> +      Use following steps to detect and enable modern CPU hotplug
>>> interface:
>>> +        1. Store 0x0 to the 'CPU selector' register,
>>> +           attempting to switch to modern mode
>>> +        2. Store 0x0 to the 'CPU selector' register,
>>> +           to ensure valid selector value
>>> +        3. Store 0x0 to the 'Command field' register,
>>> +        4. Read the 'Command data 2' register.
>>> +           If read value is 0x0, the modern interface is enabled.
>>> +           Otherwise legacy or no CPU hotplug interface available
>>
>> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
>> writes; so no switching happens.  Step 3 (a single-byte write) is not
>> lost, but it has no effect; see the condition in cpu_status_write() in
>> patch#8.  And step 4 *misleads* the guest into thinking that the switch
>> worked: the DWORD read is lost again -- it returns zero to the guest
>> without ever reaching the device model, so the guest never learns the
>> switch didn't work.
>>
>> This means that guest behavior centered on the "Command data 2" register
>> worked *only* in the v5.0.0 release; it got effectively regressed in
>> v5.1.0.
>>
>> To make things *even more* complicated, the breakage was (and remains, as
>> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
>> no difference with KVM acceleration -- the DWORD accesses still work,
>> despite "valid.max_access_size = 1".
>>
>> As commit 5d971f9e6725 suggests, fix the problem by raising
>> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
>> to perform DWORD accesses to the legacy register block too, for enabling
>> (and verifying!) the modern block.  In order to keep compatibility for
>> the
>> device model implementation though, set "impl.max_access_size = 1", so
>> that wide accesses be split before they reach the legacy read/write
>> handlers, like they always have been on KVM, and like they were on TCG
>> before 5d971f9e6725 (v5.1.0).
>>
>> Tested with:
>>
>> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>>    intermixed with ACPI S3 suspend/resume, using KVM accel
>>    (regression-test);
>>
>> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>>    intermixed with ACPI S3 suspend/resume, using KVM accel
>>    (regression-test);
>>
>> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified
>> the
>>    register block switch and the present/possible CPU counting through
>> the
>>    modern hotplug interface, during OVMF boot (bugfix test);
>>
>> - I do not have any testcase (guest payload) for regression-testing CPU
>>    hotplug through the *legacy* CPU hotplug register block.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Ani Sinha <ani@anisinha.ca>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Peter Maydell <peter.maydell@linaro.org>
>> Cc: qemu-stable@nongnu.org
>> Ref: "IO port write width clamping differs between TCG and KVM"
>> Link:
>> http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>> Reported-by: Ard Biesheuvel <ardb@kernel.org>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>>
>> Notes:
>>      This should be applied to:
>>           - stable-5.2 (new branch)
>>           - stable-6.2 (new branch)
>>           - stable-7.2 (new branch)
>>           whichever is still considered maintained, as there is
>> currently *no*
>>      public QEMU release in which the modern CPU hotplug register block
>>      works, when using TCG acceleration.  v5.0.0 works, but that minor
>>      release has been obsoleted by v5.2.0, which does not work.
>>
>>   hw/acpi/cpu_hotplug.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
>> index 53654f863830..ff14c3f4106f 100644
>> --- a/hw/acpi/cpu_hotplug.c
>> +++ b/hw/acpi/cpu_hotplug.c
>> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>       .endianness = DEVICE_LITTLE_ENDIAN,
>>       .valid = {
>>           .min_access_size = 1,
>> +        .max_access_size = 4,
>> +    },
>> +    .impl = {
>>           .max_access_size = 1,
> 
> Arguably:
> Fixes: b8622725cf ("acpi_piix4: Add infrastructure to send CPU hot-plug
> GPE to guest")

Hmm, not sure. I thought hard about a potential "Fixes:" line for the
commit message, but couldn't identify any one particular commit. That's
why I quoted the whole initial 8-part patch series -- there are multiple
stages there where DWORD access dependencies are introduced, but the
restriction in the device model is not relaxed. Patch#1 (commit
abd49bc2ed2f) did it with the docs, but that wasn't the only such patch.

I *think* b8622725cf is fine, per se. It entirely precedes the modern
register block, as it is part of the now-legacy block's introduction. At
that time, 1-byte accesses were entirely fine, AFAICT.

> 
>>       },
>>   };
> 
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> 

Thank you! :)
Laszlo
Igor Mammedov Jan. 4, 2023, 10:35 a.m. UTC | #5
On Wed,  4 Jan 2023 10:01:38 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> The modern ACPI CPU hotplug interface was introduced in the following
> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
> 
>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>                   interface
>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>                   interface
>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
> 
> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
> accesses for the hotplug register block.  Patch#1 preserved the same
> restriction for the legacy register block, but:
> 
> - it specified DWORD accesses for some of the modern registers,
> 
> - in particular, the switch from the legacy block to the modern block
>   would require a DWORD write to the *legacy* block.
> 
> The latter functionality was then implemented in cpu_status_write()
> [hw/acpi/cpu_hotplug.c], in patch#8.
> 
> Unfortunately, all DWORD accesses depended on a dormant bug: the one
> introced in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
> register block would work in spite of the above series *not* relaxing
> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
> 
> > static const MemoryRegionOps AcpiCpuHotplug_ops = {
> >     .read = cpu_status_read,
> >     .write = cpu_status_write,
> >     .endianness = DEVICE_LITTLE_ENDIAN,
> >     .valid = {
> >         .min_access_size = 1,
> >         .max_access_size = 1,
> >     },
> > };  
> 
> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
> interface (including the documentation) was extended with another DWORD
> *read* access, namely to the "Command data 2" register, which would be
> important for the guest to confirm whether it managed to switch the
> register block from legacy to modern.
> 
> This functionality too silently depended on the bug from commit
> a014ed07bd5a.
> 
> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
> the v2.7.0 series quoted at the top -- namely the fact that
> "valid.max_access_size = 1" didn't match what the guest was supposed to
> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
> 
> The symptom is that the "modern interface negotiation protocol"
> described in commit ae340aa3d256:
> 
> > +      Use following steps to detect and enable modern CPU hotplug interface:
> > +        1. Store 0x0 to the 'CPU selector' register,
> > +           attempting to switch to modern mode
> > +        2. Store 0x0 to the 'CPU selector' register,
> > +           to ensure valid selector value
> > +        3. Store 0x0 to the 'Command field' register,
> > +        4. Read the 'Command data 2' register.
> > +           If read value is 0x0, the modern interface is enabled.
> > +           Otherwise legacy or no CPU hotplug interface available  
> 
> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
> writes; so no switching happens.  Step 3 (a single-byte write) is not
> lost, but it has no effect; see the condition in cpu_status_write() in
> patch#8.  And step 4 *misleads* the guest into thinking that the switch
> worked: the DWORD read is lost again -- it returns zero to the guest
> without ever reaching the device model, so the guest never learns the
> switch didn't work.
> 
> This means that guest behavior centered on the "Command data 2" register
> worked *only* in the v5.0.0 release; it got effectively regressed in
> v5.1.0.
> 
> To make things *even more* complicated, the breakage was (and remains, as
> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
> no difference with KVM acceleration -- the DWORD accesses still work,
> despite "valid.max_access_size = 1".
> 
> As commit 5d971f9e6725 suggests, fix the problem by raising
> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
> to perform DWORD accesses to the legacy register block too, for enabling
> (and verifying!) the modern block.  In order to keep compatibility for the
> device model implementation though, set "impl.max_access_size = 1", so
> that wide accesses be split before they reach the legacy read/write
> handlers, like they always have been on KVM, and like they were on TCG
> before 5d971f9e6725 (v5.1.0).
> 
> Tested with:
> 
> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
> 
> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
> 
> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>   register block switch and the present/possible CPU counting through the
>   modern hotplug interface, during OVMF boot (bugfix test);


> - I do not have any testcase (guest payload) for regression-testing CPU
>   hotplug through the *legacy* CPU hotplug register block.
I've checked it with old Seabios (that had it's own ACPI tables) (taken from 1.6 QEMU branch),  
it works fine in TCG and KVM mode.

Tested-by: Igor Mammedov <imammedo@redhat.com>

> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: qemu-stable@nongnu.org
> Ref: "IO port write width clamping differs between TCG and KVM"
> Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> Reported-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> 
> Notes:
>     This should be applied to:
>     
>     - stable-5.2 (new branch)
>     
>     - stable-6.2 (new branch)
>     
>     - stable-7.2 (new branch)
>     
>     whichever is still considered maintained, as there is currently *no*
>     public QEMU release in which the modern CPU hotplug register block
>     works, when using TCG acceleration.  v5.0.0 works, but that minor
>     release has been obsoleted by v5.2.0, which does not work.
> 
>  hw/acpi/cpu_hotplug.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> index 53654f863830..ff14c3f4106f 100644
> --- a/hw/acpi/cpu_hotplug.c
> +++ b/hw/acpi/cpu_hotplug.c
> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>      .endianness = DEVICE_LITTLE_ENDIAN,
>      .valid = {
>          .min_access_size = 1,
> +        .max_access_size = 4,
> +    },
> +    .impl = {
>          .max_access_size = 1,
>      },
>  };
Igor Mammedov Jan. 4, 2023, 10:38 a.m. UTC | #6
On Wed, 4 Jan 2023 10:34:09 +0100
Philippe Mathieu-Daudé <philmd@linaro.org> wrote:

> On 4/1/23 10:01, Laszlo Ersek wrote:
[...]
> > diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> > index 53654f863830..ff14c3f4106f 100644
> > --- a/hw/acpi/cpu_hotplug.c
> > +++ b/hw/acpi/cpu_hotplug.c
> > @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
> >       .endianness = DEVICE_LITTLE_ENDIAN,
> >       .valid = {
> >           .min_access_size = 1,
> > +        .max_access_size = 4,
> > +    },
> > +    .impl = {
> >           .max_access_size = 1,  
> 
> Arguably:
> Fixes: b8622725cf ("acpi_piix4: Add infrastructure to send CPU hot-plug 
> GPE to guest")

nope, this one is correct, as legacy interface used 1 byte access only

> 
> >       },
> >   };
Laszlo Ersek Jan. 4, 2023, 12:13 p.m. UTC | #7
On 1/4/23 11:35, Igor Mammedov wrote:
> On Wed,  4 Jan 2023 10:01:38 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> The modern ACPI CPU hotplug interface was introduced in the following
>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>>
>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>>                   interface
>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>>                   interface
>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
>>
>> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
>> accesses for the hotplug register block.  Patch#1 preserved the same
>> restriction for the legacy register block, but:
>>
>> - it specified DWORD accesses for some of the modern registers,
>>
>> - in particular, the switch from the legacy block to the modern block
>>   would require a DWORD write to the *legacy* block.
>>
>> The latter functionality was then implemented in cpu_status_write()
>> [hw/acpi/cpu_hotplug.c], in patch#8.
>>
>> Unfortunately, all DWORD accesses depended on a dormant bug: the one
>> introced in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
>> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
>> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
>> register block would work in spite of the above series *not* relaxing
>> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
>>
>>> static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>>     .read = cpu_status_read,
>>>     .write = cpu_status_write,
>>>     .endianness = DEVICE_LITTLE_ENDIAN,
>>>     .valid = {
>>>         .min_access_size = 1,
>>>         .max_access_size = 1,
>>>     },
>>> };  
>>
>> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
>> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
>> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
>> interface (including the documentation) was extended with another DWORD
>> *read* access, namely to the "Command data 2" register, which would be
>> important for the guest to confirm whether it managed to switch the
>> register block from legacy to modern.
>>
>> This functionality too silently depended on the bug from commit
>> a014ed07bd5a.
>>
>> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
>> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
>> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
>> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
>> the v2.7.0 series quoted at the top -- namely the fact that
>> "valid.max_access_size = 1" didn't match what the guest was supposed to
>> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
>>
>> The symptom is that the "modern interface negotiation protocol"
>> described in commit ae340aa3d256:
>>
>>> +      Use following steps to detect and enable modern CPU hotplug interface:
>>> +        1. Store 0x0 to the 'CPU selector' register,
>>> +           attempting to switch to modern mode
>>> +        2. Store 0x0 to the 'CPU selector' register,
>>> +           to ensure valid selector value
>>> +        3. Store 0x0 to the 'Command field' register,
>>> +        4. Read the 'Command data 2' register.
>>> +           If read value is 0x0, the modern interface is enabled.
>>> +           Otherwise legacy or no CPU hotplug interface available  
>>
>> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
>> writes; so no switching happens.  Step 3 (a single-byte write) is not
>> lost, but it has no effect; see the condition in cpu_status_write() in
>> patch#8.  And step 4 *misleads* the guest into thinking that the switch
>> worked: the DWORD read is lost again -- it returns zero to the guest
>> without ever reaching the device model, so the guest never learns the
>> switch didn't work.
>>
>> This means that guest behavior centered on the "Command data 2" register
>> worked *only* in the v5.0.0 release; it got effectively regressed in
>> v5.1.0.
>>
>> To make things *even more* complicated, the breakage was (and remains, as
>> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
>> no difference with KVM acceleration -- the DWORD accesses still work,
>> despite "valid.max_access_size = 1".
>>
>> As commit 5d971f9e6725 suggests, fix the problem by raising
>> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
>> to perform DWORD accesses to the legacy register block too, for enabling
>> (and verifying!) the modern block.  In order to keep compatibility for the
>> device model implementation though, set "impl.max_access_size = 1", so
>> that wide accesses be split before they reach the legacy read/write
>> handlers, like they always have been on KVM, and like they were on TCG
>> before 5d971f9e6725 (v5.1.0).
>>
>> Tested with:
>>
>> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>>   intermixed with ACPI S3 suspend/resume, using KVM accel
>>   (regression-test);
>>
>> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>>   intermixed with ACPI S3 suspend/resume, using KVM accel
>>   (regression-test);
>>
>> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>>   register block switch and the present/possible CPU counting through the
>>   modern hotplug interface, during OVMF boot (bugfix test);
> 
> 
>> - I do not have any testcase (guest payload) for regression-testing CPU
>>   hotplug through the *legacy* CPU hotplug register block.

> I've checked it with old Seabios (that had it's own ACPI tables) (taken from 1.6 QEMU branch),  
> it works fine in TCG and KVM mode.
> 
> Tested-by: Igor Mammedov <imammedo@redhat.com>

Awesome, thank you!
Laszlo

> 
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Ani Sinha <ani@anisinha.ca>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Peter Maydell <peter.maydell@linaro.org>
>> Cc: qemu-stable@nongnu.org
>> Ref: "IO port write width clamping differs between TCG and KVM"
>> Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>> Reported-by: Ard Biesheuvel <ardb@kernel.org>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>>
>> Notes:
>>     This should be applied to:
>>     
>>     - stable-5.2 (new branch)
>>     
>>     - stable-6.2 (new branch)
>>     
>>     - stable-7.2 (new branch)
>>     
>>     whichever is still considered maintained, as there is currently *no*
>>     public QEMU release in which the modern CPU hotplug register block
>>     works, when using TCG acceleration.  v5.0.0 works, but that minor
>>     release has been obsoleted by v5.2.0, which does not work.
>>
>>  hw/acpi/cpu_hotplug.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
>> index 53654f863830..ff14c3f4106f 100644
>> --- a/hw/acpi/cpu_hotplug.c
>> +++ b/hw/acpi/cpu_hotplug.c
>> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>      .endianness = DEVICE_LITTLE_ENDIAN,
>>      .valid = {
>>          .min_access_size = 1,
>> +        .max_access_size = 4,
>> +    },
>> +    .impl = {
>>          .max_access_size = 1,
>>      },
>>  };
>
Philippe Mathieu-Daudé Jan. 4, 2023, 12:27 p.m. UTC | #8
On 4/1/23 11:38, Igor Mammedov wrote:
> On Wed, 4 Jan 2023 10:34:09 +0100
> Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
> 
>> On 4/1/23 10:01, Laszlo Ersek wrote:
> [...]
>>> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
>>> index 53654f863830..ff14c3f4106f 100644
>>> --- a/hw/acpi/cpu_hotplug.c
>>> +++ b/hw/acpi/cpu_hotplug.c
>>> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>>        .endianness = DEVICE_LITTLE_ENDIAN,
>>>        .valid = {
>>>            .min_access_size = 1,
>>> +        .max_access_size = 4,
>>> +    },
>>> +    .impl = {
>>>            .max_access_size = 1,
>>
>> Arguably:
>> Fixes: b8622725cf ("acpi_piix4: Add infrastructure to send CPU hot-plug
>> GPE to guest")
> 
> nope, this one is correct, as legacy interface used 1 byte access only

Yes, Laszlo explained elsewhere in the thread.
Michael S. Tsirkin Jan. 4, 2023, 12:35 p.m. UTC | #9
On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
> The modern ACPI CPU hotplug interface was introduced in the following
> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
> 
>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>                   interface
>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>                   interface
>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
> 
> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
> accesses for the hotplug register block.  Patch#1 preserved the same
> restriction for the legacy register block, but:
> 
> - it specified DWORD accesses for some of the modern registers,
> 
> - in particular, the switch from the legacy block to the modern block
>   would require a DWORD write to the *legacy* block.
> 
> The latter functionality was then implemented in cpu_status_write()
> [hw/acpi/cpu_hotplug.c], in patch#8.
> 
> Unfortunately, all DWORD accesses depended on a dormant bug: the one
> introced

introduced

> in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
> register block would work in spite of the above series *not* relaxing
> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
> 
> > static const MemoryRegionOps AcpiCpuHotplug_ops = {
> >     .read = cpu_status_read,
> >     .write = cpu_status_write,
> >     .endianness = DEVICE_LITTLE_ENDIAN,
> >     .valid = {
> >         .min_access_size = 1,
> >         .max_access_size = 1,
> >     },
> > };
> 
> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
> interface (including the documentation) was extended with another DWORD
> *read* access, namely to the "Command data 2" register, which would be
> important for the guest to confirm whether it managed to switch the
> register block from legacy to modern.
> 
> This functionality too silently depended on the bug from commit
> a014ed07bd5a.
> 
> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
> the v2.7.0 series quoted at the top -- namely the fact that
> "valid.max_access_size = 1" didn't match what the guest was supposed to
> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
> 
> The symptom is that the "modern interface negotiation protocol"
> described in commit ae340aa3d256:
> 
> > +      Use following steps to detect and enable modern CPU hotplug interface:
> > +        1. Store 0x0 to the 'CPU selector' register,
> > +           attempting to switch to modern mode
> > +        2. Store 0x0 to the 'CPU selector' register,
> > +           to ensure valid selector value
> > +        3. Store 0x0 to the 'Command field' register,
> > +        4. Read the 'Command data 2' register.
> > +           If read value is 0x0, the modern interface is enabled.
> > +           Otherwise legacy or no CPU hotplug interface available
> 
> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
> writes; so no switching happens.  Step 3 (a single-byte write) is not
> lost, but it has no effect; see the condition in cpu_status_write() in
> patch#8.  And step 4 *misleads* the guest into thinking that the switch
> worked: the DWORD read is lost again -- it returns zero to the guest
> without ever reaching the device model, so the guest never learns the
> switch didn't work.
> 
> This means that guest behavior centered on the "Command data 2" register
> worked *only* in the v5.0.0 release; it got effectively regressed in
> v5.1.0.
> 
> To make things *even more* complicated, the breakage was (and remains, as
> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
> no difference with KVM acceleration -- the DWORD accesses still work,
> despite "valid.max_access_size = 1".

BTW do you happen to know why that's the case for KVM?
Because if kvm ignores valid.max_access_size generally then
commit 5d971f9e6725 is incomplete, and we probably have some
related kvm-only bugs.

> As commit 5d971f9e6725 suggests, fix the problem by raising
> "valid.max_access_size" to 4 -- the spec now clearly instructs the guest
> to perform DWORD accesses to the legacy register block too, for enabling
> (and verifying!) the modern block.  In order to keep compatibility for the
> device model implementation though, set "impl.max_access_size = 1", so
> that wide accesses be split before they reach the legacy read/write
> handlers, like they always have been on KVM, and like they were on TCG
> before 5d971f9e6725 (v5.1.0).
> 
> Tested with:
> 
> - OVMF IA32 + qemu-system-i386, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
> 
> - OVMF IA32X64 + qemu-system-x86_64, CPU hotplug/hot-unplug with SMM,
>   intermixed with ACPI S3 suspend/resume, using KVM accel
>   (regression-test);
> 
> - OVMF IA32 + qemu-system-i386, SMM enabled, using TCG accel; verified the
>   register block switch and the present/possible CPU counting through the
>   modern hotplug interface, during OVMF boot (bugfix test);
> 
> - I do not have any testcase (guest payload) for regression-testing CPU
>   hotplug through the *legacy* CPU hotplug register block.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: qemu-stable@nongnu.org
> Ref: "IO port write width clamping differs between TCG and KVM"
> Link: http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
> Link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> Reported-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> 
> Notes:
>     This should be applied to:
>     
>     - stable-5.2 (new branch)
>     
>     - stable-6.2 (new branch)
>     
>     - stable-7.2 (new branch)
>     
>     whichever is still considered maintained, as there is currently *no*
>     public QEMU release in which the modern CPU hotplug register block
>     works, when using TCG acceleration.  v5.0.0 works, but that minor
>     release has been obsoleted by v5.2.0, which does not work.
> 
>  hw/acpi/cpu_hotplug.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> index 53654f863830..ff14c3f4106f 100644
> --- a/hw/acpi/cpu_hotplug.c
> +++ b/hw/acpi/cpu_hotplug.c
> @@ -52,6 +52,9 @@ static const MemoryRegionOps AcpiCpuHotplug_ops = {
>      .endianness = DEVICE_LITTLE_ENDIAN,
>      .valid = {
>          .min_access_size = 1,
> +        .max_access_size = 4,
> +    },
> +    .impl = {
>          .max_access_size = 1,
>      },
>  };
Laszlo Ersek Jan. 5, 2023, 7:13 a.m. UTC | #10
On 1/4/23 13:35, Michael S. Tsirkin wrote:
> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
>> The modern ACPI CPU hotplug interface was introduced in the following
>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>>
>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>>                   interface
>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>>                   interface
>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
>>
>> Before patch#1, "docs/specs/acpi_cpu_hotplug.txt" only specified 1-byte
>> accesses for the hotplug register block.  Patch#1 preserved the same
>> restriction for the legacy register block, but:
>>
>> - it specified DWORD accesses for some of the modern registers,
>>
>> - in particular, the switch from the legacy block to the modern block
>>   would require a DWORD write to the *legacy* block.
>>
>> The latter functionality was then implemented in cpu_status_write()
>> [hw/acpi/cpu_hotplug.c], in patch#8.
>>
>> Unfortunately, all DWORD accesses depended on a dormant bug: the one
>> introced
>
> introduced

Huh, thanks. :)

... Because I'm proposing this for stable as well, I figure if I post a
v2 just with this small update, too.

(and I'm just noticing that Phil pointed out the same typo earlier --
sorry Phil, I scrolled through that, my apologies, and thanks for
catching it on your end as well!)

>
>> in earlier commit a014ed07bd5a ("memory: accept mismatching sizes
>> in memory_region_access_valid", 2013-05-29); first released in v1.6.0.
>> Due to commit a014ed07bd5a, the DWORD accesses to the *legacy* CPU hotplug
>> register block would work in spite of the above series *not* relaxing
>> "valid.max_access_size = 1" in "hw/acpi/cpu_hotplug.c":
>>
>>> static const MemoryRegionOps AcpiCpuHotplug_ops = {
>>>     .read = cpu_status_read,
>>>     .write = cpu_status_write,
>>>     .endianness = DEVICE_LITTLE_ENDIAN,
>>>     .valid = {
>>>         .min_access_size = 1,
>>>         .max_access_size = 1,
>>>     },
>>> };
>>
>> Later, in commits e6d0c3ce6895 ("acpi: cpuhp: introduce 'Command data 2'
>> field", 2020-01-22) and ae340aa3d256 ("acpi: cpuhp: spec: add typical
>> usecases", 2020-01-22), first released in v5.0.0, the modern CPU hotplug
>> interface (including the documentation) was extended with another DWORD
>> *read* access, namely to the "Command data 2" register, which would be
>> important for the guest to confirm whether it managed to switch the
>> register block from legacy to modern.
>>
>> This functionality too silently depended on the bug from commit
>> a014ed07bd5a.
>>
>> In commit 5d971f9e6725 ('memory: Revert "memory: accept mismatching sizes
>> in memory_region_access_valid"', 2020-06-26), first released in v5.1.0,
>> the bug from commit a014ed07bd5a was fixed (the commit was reverted).
>> That swiftly exposed the bug in "AcpiCpuHotplug_ops", still present from
>> the v2.7.0 series quoted at the top -- namely the fact that
>> "valid.max_access_size = 1" didn't match what the guest was supposed to
>> do, according to the spec ("docs/specs/acpi_cpu_hotplug.txt").
>>
>> The symptom is that the "modern interface negotiation protocol"
>> described in commit ae340aa3d256:
>>
>>> +      Use following steps to detect and enable modern CPU hotplug interface:
>>> +        1. Store 0x0 to the 'CPU selector' register,
>>> +           attempting to switch to modern mode
>>> +        2. Store 0x0 to the 'CPU selector' register,
>>> +           to ensure valid selector value
>>> +        3. Store 0x0 to the 'Command field' register,
>>> +        4. Read the 'Command data 2' register.
>>> +           If read value is 0x0, the modern interface is enabled.
>>> +           Otherwise legacy or no CPU hotplug interface available
>>
>> falls apart for the guest: steps 1 and 2 are lost, because they are DWORD
>> writes; so no switching happens.  Step 3 (a single-byte write) is not
>> lost, but it has no effect; see the condition in cpu_status_write() in
>> patch#8.  And step 4 *misleads* the guest into thinking that the switch
>> worked: the DWORD read is lost again -- it returns zero to the guest
>> without ever reaching the device model, so the guest never learns the
>> switch didn't work.
>>
>> This means that guest behavior centered on the "Command data 2" register
>> worked *only* in the v5.0.0 release; it got effectively regressed in
>> v5.1.0.
>>
>> To make things *even more* complicated, the breakage was (and remains, as
>> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
>> no difference with KVM acceleration -- the DWORD accesses still work,
>> despite "valid.max_access_size = 1".
>
> BTW do you happen to know why that's the case for KVM? Because if kvm
> ignores valid.max_access_size generally then commit 5d971f9e6725 is
> incomplete, and we probably have some related kvm-only bugs.

It remains a mystery for me why KVM accel does not enforce
"valid.max_access_size".

In the thread I started earlier (which led to this patch), at

  "IO port write width clamping differs between TCG and KVM"
  https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html

I included a backtrace of the call stack, taken when KVM *let through*
the DWORD access (and it got split into 4 single-byte accesses). I don't
want to quote the entire stack trace, but I'll quote my (rudimentary)
understanding of it:

On 1/3/23 18:42, Laszlo Ersek wrote:
> Now, if I look at the above-quoted backtrace, I see that, from frame#5
> to frame#4, a kind of "streaming" happens (with KVM only). Because,
> len=4 is there, but "l=1" appears upon entry to
> flatview_write_continue():
>
>   flatview_write()                         [softmmu/physmem.c]
>     flatview_translate()                   [softmmu/physmem.c]
>       flatview_do_translate()              [softmmu/physmem.c]
>         address_space_translate_internal() [softmmu/physmem.c]
>     flatview_write_continue()              [softmmu/physmem.c]
>
> And then I vaguely feel that this is somehow related to the following
> big comment in address_space_translate_internal() [softmmu/physmem.c]:
>
>>     /* MMIO registers can be expected to perform full-width accesses based only
>>      * on their address, without considering adjacent registers that could
>>      * decode to completely different MemoryRegions.  When such registers
>>      * exist (e.g. I/O ports 0xcf8 and 0xcf9 on most PC chipsets), MMIO
>>      * regions overlap wildly.  For this reason we cannot clamp the accesses
>>      * here.
>>      *
>>      * If the length is small (as is the case for address_space_ldl/stl),
>>      * everything works fine.  If the incoming length is large, however,
>>      * the caller really has to do the clamping through memory_access_size.
>>      */
>
> What I don't understand is the *difference* in behavior between KVM
> and TCG. The above reasoning either applies to both KVM and TCG, or it
> applies to neither, right?

I can try one more thing. I'll set another breakpoint on
cpu_status_write(), and get a backtrace when it is hit on *TCG* (now
with this patch applied). Then compare the backtrace with that taken
from the KVM run (which had no fix applied). There could be a
difference.

... Oh yeah, it's a *completely* different stack trace. OK, I've changed
my mind about not including the KVM stack trace here -- I need to do
that now, for comparison.

So, this is the KVM stack trace again, with the DWORD access succeeding,
*without* this patch applied to QEMU:

> #0  cpu_status_write (opaque=0x5555572b9880, addr=0, data=0, size=1) at ../../src/upstream/qemu/hw/acpi/cpu_hotplug.c:42
> #1  0x0000555555d3d34f in memory_region_write_accessor (mr=0x5555572b9890, addr=0, value=0x7fffe6412fe8, size=1, shift=0, mask=255, attrs=...) at ../../src/upstream/qemu/softmmu/memory.c:493
>         tmp = 0
> #2  0x0000555555d3d595 in access_with_adjusted_size (addr=0, value=0x7fffe6412fe8, size=1, access_size_min=1, access_size_max=4, access_fn=0x555555d3d259 <memory_region_write_accessor>, mr=0x5555572b9890, attrs=...)
>     at ../../src/upstream/qemu/softmmu/memory.c:555
>         access_mask = 255
>         access_size = 1
>         i = 0
>         r = 0
> #3  0x0000555555d406d8 in memory_region_dispatch_write (mr=0x5555572b9890, addr=0, data=0, op=MO_8, attrs=...) at ../../src/upstream/qemu/softmmu/memory.c:1515
>         size = 1
> #4  0x0000555555d4e320 in flatview_write_continue (fv=0x7fffdc0141e0, addr=3288, attrs=..., ptr=0x7ffff7fbd000, len=4, addr1=0, l=1, mr=0x5555572b9890) at ../../src/upstream/qemu/softmmu/physmem.c:2825
>         ram_ptr = 0x4 <error: Cannot access memory at address 0x4>
>         val = 0
>         result = 0
>         release_lock = true
>         buf = 0x7ffff7fbd000 ""
> #5  0x0000555555d4e483 in flatview_write (fv=0x7fffdc0141e0, addr=3288, attrs=..., buf=0x7ffff7fbd000, len=4) at ../../src/upstream/qemu/softmmu/physmem.c:2867
>         l = 4
>         addr1 = 0
>         mr = 0x5555572b9890
> #6  0x0000555555d4e833 in address_space_write (as=0x55555690d000 <address_space_io>, addr=3288, attrs=..., buf=0x7ffff7fbd000, len=4) at ../../src/upstream/qemu/softmmu/physmem.c:2963
>         _rcu_read_auto = 0x1
>         result = 0
>         fv = 0x7fffdc0141e0
> #7  0x0000555555d4e8a0 in address_space_rw (as=0x55555690d000 <address_space_io>, addr=3288, attrs=..., buf=0x7ffff7fbd000, len=4, is_write=true) at ../../src/upstream/qemu/softmmu/physmem.c:2973
> #8  0x0000555555de1c6e in kvm_handle_io (port=3288, attrs=..., data=0x7ffff7fbd000, direction=1, size=4, count=1) at ../../src/upstream/qemu/accel/kvm/kvm-all.c:2639
>         i = 0
>         ptr = 0x7ffff7fbd000 ""
> #9  0x0000555555de240e in kvm_cpu_exec (cpu=0x555556d42db0) at ../../src/upstream/qemu/accel/kvm/kvm-all.c:2890
>         attrs = {unspecified = 0, secure = 0, user = 0, memory = 0, requester_id = 0, byte_swap = 0, target_tlb_bit0 = 0, target_tlb_bit1 = 0, target_tlb_bit2 = 0}
>         run = 0x7ffff7fbc000
>         ret = 0
>         run_ret = 0
> #10 0x0000555555de51a2 in kvm_vcpu_thread_fn (arg=0x555556d42db0) at ../../src/upstream/qemu/accel/kvm/kvm-accel-ops.c:51
>         cpu = 0x555556d42db0
>         r = 65536
> #11 0x0000555555feb23a in qemu_thread_start (args=0x555556d46240) at ../../src/upstream/qemu/util/qemu-thread-posix.c:505
>         __cancel_buf =
>             {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737056437824, -6609603676087882481, 140737056437824, 7, 140737304061232, 0, -6609603676064813809, -1076185459444912881}, __mask_was_saved = 0}}, __pad = {0x7fffe6413330, 0x0, 0x0, 0x0}}
>         __cancel_routine = 0x555555feb0e9 <qemu_thread_atexit_notify>
>         __cancel_arg = 0x0
>         __not_first_call = 0
>         qemu_thread_args = 0x555556d46240
>         start_routine = 0x555555de50d6 <kvm_vcpu_thread_fn>
>         arg = 0x555556d42db0
>         r = 0x0
> #12 0x00007ffff503e802 in start_thread () at /lib64/libc.so.6
> #13 0x00007ffff4fde450 in clone3 () at /lib64/libc.so.6


And the following is the TCG stack trace, with the patch *applied*:


> #0  cpu_status_write (opaque=0x55555733cd10, addr=0, data=0, size=1) at ../../src/upstream/qemu/hw/acpi/cpu_hotplug.c:42
> #1  0x0000555555d3fc48 in memory_region_write_accessor (mr=0x55555733cd20, addr=0, value=0x7fffe65d1ac8, size=1, shift=0, mask=255, attrs=...) at ../../src/upstream/qemu/softmmu/memory.c:493
>         tmp = 0
> #2  0x0000555555d3fe8e in access_with_adjusted_size (addr=0, value=0x7fffe65d1ac8, size=4, access_size_min=1, access_size_max=1, access_fn=0x555555d3fb52 <memory_region_write_accessor>, mr=0x55555733cd20, attrs=...)
>     at ../../src/upstream/qemu/softmmu/memory.c:555
>         access_mask = 255
>         access_size = 1
>         i = 0
>         r = 0
> #3  0x0000555555d42fd1 in memory_region_dispatch_write (mr=0x55555733cd20, addr=0, data=0, op=MO_32, attrs=...) at ../../src/upstream/qemu/softmmu/memory.c:1515
>         size = 4
> #4  0x0000555555d527a9 in address_space_stl_internal (as=0x555556912900 <address_space_io>, addr=3288, val=0, attrs=..., result=0x0, endian=DEVICE_NATIVE_ENDIAN) at /home/lacos/src/upstream/qemu/memory_ldst.c.inc:319
>         ptr = 0x555556d67920 "@\210\v\230\377\177"
>         mr = 0x55555733cd20
>         l = 4
>         addr1 = 0
>         r = 8606785
>         release_lock = true
> #5  0x0000555555d528a0 in address_space_stl (as=0x555556912900 <address_space_io>, addr=3288, val=0, attrs=..., result=0x0) at /home/lacos/src/upstream/qemu/memory_ldst.c.inc:350
> #6  0x0000555555bc69a4 in helper_outl (env=0x555556d5d410, port=3288, data=0) at ../../src/upstream/qemu/target/i386/tcg/sysemu/misc_helper.c:55
> #7  0x00007fff5812602e in code_gen_buffer ()
> #8  0x0000555555dbed82 in cpu_tb_exec (cpu=0x555556d5c6a0, itb=0x7fff9813d740, tb_exit=0x7fffe65d2158) at ../../src/upstream/qemu/accel/tcg/cpu-exec.c:438
>         env = 0x555556d5d410
>         ret = 93825001053227
>         last_tb = 0x83a067
>         tb_ptr = 0x7fff5813d800 <code_gen_buffer+1300435>
>         __PRETTY_FUNCTION__ = "cpu_tb_exec"
> #9  0x0000555555dbfa2a in cpu_loop_exec_tb (cpu=0x555556d5c6a0, tb=0x7fff9813d740, pc=8626279, last_tb=0x7fffe65d2170, tb_exit=0x7fffe65d2158) at ../../src/upstream/qemu/accel/tcg/cpu-exec.c:868
>         insns_left = 0
>         __PRETTY_FUNCTION__ = "cpu_loop_exec_tb"
> #10 0x0000555555dbfe10 in cpu_exec (cpu=0x555556d5c6a0) at ../../src/upstream/qemu/accel/tcg/cpu-exec.c:1032
>         tb = 0x7fff9813d740
>         flags = 4194992
>         cflags = 4278714368
>         cs_base = 0
>         pc = 8626279
>         last_tb = 0x7fff9813d640
>         tb_exit = 0
>         ret = 32767
>         sc = {diff_clk = 0, last_cpu_icount = 0, realtime_clock = 0}
>         __func__ = "cpu_exec"
> #11 0x0000555555deb245 in tcg_cpus_exec (cpu=0x555556d5c6a0) at ../../src/upstream/qemu/accel/tcg/tcg-accel-ops.c:69
>         ret = 21845
>         __PRETTY_FUNCTION__ = "tcg_cpus_exec"
> #12 0x0000555555deb8fd in mttcg_cpu_thread_fn (arg=0x555556d5c6a0) at ../../src/upstream/qemu/accel/tcg/tcg-accel-ops-mttcg.c:95
>         r = 65536
>         force_rcu = {notifier = {notify = 0x555555deb766 <mttcg_force_rcu>, node = {le_next = 0x0, le_prev = 0x7fffe65d64a0}}, cpu = 0x555556d5c6a0}
>         cpu = 0x555556d5c6a0
>         __PRETTY_FUNCTION__ = "mttcg_cpu_thread_fn"
>         __func__ = "mttcg_cpu_thread_fn"
> #13 0x0000555555fee5a2 in qemu_thread_start (args=0x555556d899a0) at ../../src/upstream/qemu/util/qemu-thread-posix.c:505
>         __cancel_buf =
>             {__cancel_jmp_buf = {{__cancel_jmp_buf = {140737058268736, 8614073237258514976, 140737058268736, 0, 140737304061232, 0, 8614073237214474784, 2512501441319154208}, __mask_was_saved = 0}}, __pad = {0x7fffe65d2330, 0x0, 0x0, 0x0}}
>         __cancel_routine = 0x555555fee451 <qemu_thread_atexit_notify>
>         __cancel_arg = 0x0
>         __not_first_call = 0
>         qemu_thread_args = 0x555556d899a0
>         start_routine = 0x555555deb7a8 <mttcg_cpu_thread_fn>
>         arg = 0x555556d5c6a0
>         r = 0x0
> #14 0x00007ffff503e802 in start_thread () at /lib64/libc.so.6
> #15 0x00007ffff4fde450 in clone3 () at /lib64/libc.so.6

Almost entirely different. They only join at
memory_region_dispatch_write(), and you'll notice that, for
memory_region_dispatch_write(), the operation is different: in the
unpatched-QEMU-with-KVM case, it is MO_8, in the patched-QEMU-with-TCG
case, it is MO_32. So whatever does the verification and splitting for
KVM runs *earlier* (more outwards on the stack) than
memory_region_dispatch_write() -- in the somewhere in the parts that
differ between TCG and KVM.

So commit 5d971f9e6725 made a difference for helper_outl() /
address_space_stl() / address_space_stl_internal(), by modifying
(restricting) memory_region_access_valid(). The some modification does
not play a role, apparently, for the "flatview" stuff that is in use on
KVM!

Ah, found something else. Set a breakpoint like this, in the
patched-QEMU-with-TCG case:

(gdb) break memory_region_access_valid if mr->ops==&AcpiCpuHotplug_ops

The backtrace I got for that is this:

> #0  memory_region_access_valid (mr=0x55555733cd20, addr=0, size=4, is_write=true, attrs=...) at ../../src/upstream/qemu/softmmu/memory.c:1380
> #1  0x0000555555d42f11 in memory_region_dispatch_write (mr=0x55555733cd20, addr=0, data=0, op=MO_32, attrs=...) at ../../src/upstream/qemu/softmmu/memory.c:1502
> #2  0x0000555555d527a9 in address_space_stl_internal (as=0x555556912900 <address_space_io>, addr=3288, val=0, attrs=..., result=0x0, endian=DEVICE_NATIVE_ENDIAN) at /home/lacos/src/upstream/qemu/memory_ldst.c.inc:319
> #3  0x0000555555d528a0 in address_space_stl (as=0x555556912900 <address_space_io>, addr=3288, val=0, attrs=..., result=0x0) at /home/lacos/src/upstream/qemu/memory_ldst.c.inc:350
> #4  0x0000555555bc69a4 in helper_outl (env=0x555556d5d410, port=3288, data=0) at ../../src/upstream/qemu/target/i386/tcg/sysemu/misc_helper.c:55
> #5  0x00007fff5812602e in code_gen_buffer ()
> #6  0x0000555555dbed82 in cpu_tb_exec (cpu=0x555556d5c6a0, itb=0x7fff9813d740, tb_exit=0x7fffe65d2158) at ../../src/upstream/qemu/accel/tcg/cpu-exec.c:438
> #7  0x0000555555dbfa2a in cpu_loop_exec_tb (cpu=0x555556d5c6a0, tb=0x7fff9813d740, pc=8626279, last_tb=0x7fffe65d2170, tb_exit=0x7fffe65d2158) at ../../src/upstream/qemu/accel/tcg/cpu-exec.c:868
> #8  0x0000555555dbfe10 in cpu_exec (cpu=0x555556d5c6a0) at ../../src/upstream/qemu/accel/tcg/cpu-exec.c:1032
> #9  0x0000555555deb245 in tcg_cpus_exec (cpu=0x555556d5c6a0) at ../../src/upstream/qemu/accel/tcg/tcg-accel-ops.c:69
> #10 0x0000555555deb8fd in mttcg_cpu_thread_fn (arg=0x555556d5c6a0) at ../../src/upstream/qemu/accel/tcg/tcg-accel-ops-mttcg.c:95
> #11 0x0000555555fee5a2 in qemu_thread_start (args=0x555556d899a0) at ../../src/upstream/qemu/util/qemu-thread-posix.c:505
> #12 0x00007ffff503e802 in start_thread () at /lib64/libc.so.6
> #13 0x00007ffff4fde450 in clone3 () at /lib64/libc.so.6

This explains why commit 5d971f9e6725 makes no difference for KVM.

Namely, in case of TCG, memory_region_dispatch_write() is entered with
op=MO_32, and *then* memory_region_access_valid() verifies that, with
size=4. Commit 5d971f9e6725 restricts memory_region_access_valid(), so
-- without the present patch -- it returns "false", and then
memory_region_dispatch_write() fails too, returning MEMTX_DECODE_ERROR.

But in case of KVM, memory_region_dispatch_write() is *already* entered
with op=MO_8, due to the splitting happening earlier -- and therefore
memory_region_access_valid() is perfectly happy with size=1 (even
without this patch).

As I wrote earlier, the splitting on KVM definitely happens between
flatview_write() (frame#5) and flatview_write_continue() (frame#4). Let
me try check that...

OK, I think I've found it. Unfortuntely, it was not easy. You'll see
above in the KVM stack trace:

> #4  0x0000555555d4e320 in flatview_write_continue (fv=0x7fffdc0141e0, addr=3288, attrs=..., ptr=0x7ffff7fbd000, len=4, addr1=0, l=1, mr=0x5555572b9890) at ../../src/upstream/qemu/softmmu/physmem.c:2825

which might tempt you to set a breakpoint like this:

(gdb) break flatview_write_continue if addr==3288 && len==4 && l==1

However, this breakpoint never will be hit, even though the stack frame
*does* exist like shown, when cpu_status_write() is reached.

The solution to the riddle is that flatview_write_continue() actually
gets "l==4" from the outside... and then internally, it overwrites the
parameter "l". :( What the heck. Terrible programming practice.

So the breakpoint to set is actually (note "l==4"):

(gdb) break flatview_write_continue if addr==3288 && len==4 && l==4

And yes, that gives us the solution -- please locate line *2809* below,
at first:

  2733	int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
  2734	{
  2735	    unsigned access_size_max = mr->ops->valid.max_access_size;
  2736
  2737	    /* Regions are assumed to support 1-4 byte accesses unless
  2738	       otherwise specified.  */
  2739	    if (access_size_max == 0) {
  2740	        access_size_max = 4;
  2741	    }
  2742
  2743	    /* Bound the maximum access by the alignment of the address.  */
  2744	    if (!mr->ops->impl.unaligned) {
  2745	        unsigned align_size_max = addr & -addr;
  2746	        if (align_size_max != 0 && align_size_max < access_size_max) {
  2747	            access_size_max = align_size_max;
  2748	        }
  2749	    }
  2750
  2751	    /* Don't attempt accesses larger than the maximum.  */
  2752	    if (l > access_size_max) {
  2753	        l = access_size_max;
  2754	    }
  2755	    l = pow2floor(l);
  2756
  2757	    return l;
  2758	}
[...]
  2786	static bool flatview_access_allowed(MemoryRegion *mr, MemTxAttrs attrs,
  2787	                                    hwaddr addr, hwaddr len)
  2788	{
  2789	    if (likely(!attrs.memory)) {
  2790	        return true;
  2791	    }
  2792	    if (memory_region_is_ram(mr)) {
  2793	        return true;
  2794	    }
  2795	    qemu_log_mask(LOG_GUEST_ERROR,
  2796	                  "Invalid access to non-RAM device at "
  2797	                  "addr 0x%" HWADDR_PRIX ", size %" HWADDR_PRIu ", "
  2798	                  "region '%s'\n", addr, len, memory_region_name(mr));
  2799	    return false;
  2800	}
  2801
  2802	/* Called within RCU critical section.  */
  2803	static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
  2804	                                           MemTxAttrs attrs,
  2805	                                           const void *ptr,
  2806	                                           hwaddr len, hwaddr addr1,
  2807	                                           hwaddr l, MemoryRegion *mr)
  2808	{
  2809	    uint8_t *ram_ptr;
  2810	    uint64_t val;
  2811	    MemTxResult result = MEMTX_OK;
  2812	    bool release_lock = false;
  2813	    const uint8_t *buf = ptr;
  2814
  2815	    for (;;) {
  2816	        if (!flatview_access_allowed(mr, attrs, addr1, l)) {
  2817	            result |= MEMTX_ACCESS_ERROR;
  2818	            /* Keep going. */
  2819	        } else if (!memory_access_is_direct(mr, true)) {
  2820	            release_lock |= prepare_mmio_access(mr);
  2821	            l = memory_access_size(mr, l, addr1);
  2822	            /* XXX: could force current_cpu to NULL to avoid
  2823	               potential bugs */
  2824	            val = ldn_he_p(buf, l);
  2825	            result |= memory_region_dispatch_write(mr, addr1, val,
  2826	                                                   size_memop(l), attrs);

So we enter at line 2809 with addr==3288 && len==4 && l==4.

The flatview_access_allowed() function -- defined at line 2786 --
returns "true", from line 2790, taking the (!attrs.memory) branch.

Note that flatview_access_allowed() doesn't check *len* at all. Not sure
if that's right.

Either way, back in flatview_write_continue(), we proceed to line 2821,
and call memory_access_size(), and assign the result to "l". This
assignment reduces "l" from 4 to 1:

- memory_access_size() is defined at line 2733, see above. In it,
  "access_size_max" is initialized to 1, on line 2735 -- this patch is
  not applied, so "AcpiCpuHotplug_ops.valid.max_access_size" is still 1.

- On line 2745, "align_size_max" is initialized to 0, so line 2747 is
  not reached, and "access_size_max" remains 1.

- Then the branch on line 2752 is taken, and on line 2753, "l" is set to
  1.

- Line 2755 leaves "l" at 1.

Back in flatview_write_continue(), on line 2825, we call
memory_region_dispatch_write() with "op" being set from "size_memop(l)",
where "l" is now 1 -- hence op=MO_8 in frame#3
(memory_region_dispatch_write()).

So, I think the bug is somehow "distributed" between
flatview_write_continue(), flatview_access_allowed(), and
memory_access_size(). flatview_access_allowed() does not care about "l"
at all, when it should (maybe?) compare it against
"mr->ops->valid.max_access_size". In turn, memory_access_size()
*silently* reduces the access width, based on
"->ops->valid.max_access_size".

And all this this *precedes* the call to memory_region_access_valid(),
which is only called from within memory_region_dispatch_write(), which
already gets the reduced width only.

Now, flatview_access_allowed() is from commit 3ab6fdc91b72
("softmmu/physmem: Introduce MemTxAttrs::memory field and
MEMTX_ACCESS_ERROR", 2022-03-21), and the fact it does not check "len"
seems intentional -- it only takes "len" for logging.

Hmm. After digging a lot more, I find the issue may have been introduced
over three commits:

- 82f2563fc815 ("exec: introduce memory_access_size", 2013-05-29), which
  (IIUC) was the first step towards automatically reducing the address
  width, but at first only based on alignment,

- 23326164ae6f ("exec: Support 64-bit operations in address_space_rw",
  2013-07-14), which extended the splitting based on
  "MemoryRegionOps.impl",

- e1622f4b1539 ("exec: fix incorrect assumptions in memory_access_size",
  2013-07-18), which flipped the splitting basis to
  "MemoryRegionOps.valid".

To me, 23326164ae6f seems *vaguely* correct ("vague" is not criticism
for the commit, it's criticism for my understanding :)); after all we're
on our way towards the device model, and the device model exposes via
"MemoryRegionOps.impl" what it can handle. Plus, commit 5d971f9e6725
does direct us towards "MemoryRegionOps.impl"!

But clearly there must have been something wrong with 23326164ae6f,
according to e1622f4b1539...

The latter is what introduced the current "silent splitting of access
based on 'valid'". The message of commit e1622f4b1539 says, almost like
an afterthought:

>     access_size_max can be mr->ops->valid.max_access_size because memory.c
>     can and will still break accesses bigger than
>     mr->ops->impl.max_access_size.

I think this argument may have been wrong: if "impl.max_access_size" is
large (such as: unset), but "valid.max_access_size" is small, that just
means:

  the implementation is flexible and can deal with any access widths (so
  "memory.c" *need not* break up accesses for the device model's sake),
  but the device should restrict the *guest* to small accesses. So if
  the guest tries something larger, we shouldn't silently accommodate
  that.

I have zero idea how to fix this, but I feel that the quoted argument
from commit e1622f4b1539 is the reason why KVM accel is so lenient that
it sort of "de-fangs" commit 5d971f9e6725.

Laszlo
Philippe Mathieu-Daudé Jan. 5, 2023, 9 a.m. UTC | #11
On 5/1/23 08:13, Laszlo Ersek wrote:
> On 1/4/23 13:35, Michael S. Tsirkin wrote:
>> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
[...]

>>> To make things *even more* complicated, the breakage was (and remains, as
>>> of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
>>> no difference with KVM acceleration -- the DWORD accesses still work,
>>> despite "valid.max_access_size = 1".
>>
>> BTW do you happen to know why that's the case for KVM? Because if kvm
>> ignores valid.max_access_size generally then commit 5d971f9e6725 is
>> incomplete, and we probably have some related kvm-only bugs.
> 
> It remains a mystery for me why KVM accel does not enforce
> "valid.max_access_size".
> 
> In the thread I started earlier (which led to this patch), at
> 
>    "IO port write width clamping differs between TCG and KVM"
>    https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html

[...]

> So, I think the bug is somehow "distributed" between
> flatview_write_continue(), flatview_access_allowed(), and
> memory_access_size(). flatview_access_allowed() does not care about "l"
> at all, when it should (maybe?) compare it against
> "mr->ops->valid.max_access_size". In turn, memory_access_size()
> *silently* reduces the access width, based on
> "->ops->valid.max_access_size".
> 
> And all this this *precedes* the call to memory_region_access_valid(),
> which is only called from within memory_region_dispatch_write(), which
> already gets the reduced width only.
> 
> Now, flatview_access_allowed() is from commit 3ab6fdc91b72
> ("softmmu/physmem: Introduce MemTxAttrs::memory field and
> MEMTX_ACCESS_ERROR", 2022-03-21), and the fact it does not check "len"
> seems intentional -- it only takes "len" for logging.
> 
> Hmm. After digging a lot more, I find the issue may have been introduced
> over three commits:
> 
> - 82f2563fc815 ("exec: introduce memory_access_size", 2013-05-29), which
>    (IIUC) was the first step towards automatically reducing the address
>    width, but at first only based on alignment,
> 
> - 23326164ae6f ("exec: Support 64-bit operations in address_space_rw",
>    2013-07-14), which extended the splitting based on
>    "MemoryRegionOps.impl",
> 
> - e1622f4b1539 ("exec: fix incorrect assumptions in memory_access_size",
>    2013-07-18), which flipped the splitting basis to
>    "MemoryRegionOps.valid".
> 
> To me, 23326164ae6f seems *vaguely* correct ("vague" is not criticism
> for the commit, it's criticism for my understanding :)); after all we're
> on our way towards the device model, and the device model exposes via
> "MemoryRegionOps.impl" what it can handle. Plus, commit 5d971f9e6725
> does direct us towards "MemoryRegionOps.impl"!
> 
> But clearly there must have been something wrong with 23326164ae6f,
> according to e1622f4b1539...

Maybe the long-standing unaligned access problem? Could be fixed by:
https://lore.kernel.org/qemu-devel/20210619172626.875885-15-richard.henderson@linaro.org/

> The latter is what introduced the current "silent splitting of access
> based on 'valid'". The message of commit e1622f4b1539 says, almost like
> an afterthought:
> 
>>      access_size_max can be mr->ops->valid.max_access_size because memory.c
>>      can and will still break accesses bigger than
>>      mr->ops->impl.max_access_size.
> 
> I think this argument may have been wrong: if "impl.max_access_size" is
> large (such as: unset), but "valid.max_access_size" is small, that just
> means:
> 
>    the implementation is flexible and can deal with any access widths (so
>    "memory.c" *need not* break up accesses for the device model's sake),
>    but the device should restrict the *guest* to small accesses. So if
>    the guest tries something larger, we shouldn't silently accommodate
>    that.

Indeed. '.impl' is a software thing for the device modeller, ideally one
will chose a value that allows the simplest implementation. I.e. if a
device only allows 8-bit access, use 8-bit registers aligned on a 64-bit
boundary, the model might use:

   .impl.min_access_size = 8,
   .impl.max_access_size = 1,

Also we need to keep in mind that even if most MemoryRegionOps structs
are 'static const', such structure can be dynamically created. I.e.:
https://lore.kernel.org/qemu-devel/20200817161853.593247-5-f4bug@amsat.org/

> I have zero idea how to fix this, but I feel that the quoted argument
> from commit e1622f4b1539 is the reason why KVM accel is so lenient that
> it sort of "de-fangs" commit 5d971f9e6725.
> 
> Laszlo
>
Michael S. Tsirkin Jan. 5, 2023, 9:51 a.m. UTC | #12
On Thu, Jan 05, 2023 at 10:00:00AM +0100, Philippe Mathieu-Daudé wrote:
> On 5/1/23 08:13, Laszlo Ersek wrote:
> > On 1/4/23 13:35, Michael S. Tsirkin wrote:
> > > On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
> [...]
> 
> > > > To make things *even more* complicated, the breakage was (and remains, as
> > > > of today) visible with TCG acceleration only.  Commit 5d971f9e6725 makes
> > > > no difference with KVM acceleration -- the DWORD accesses still work,
> > > > despite "valid.max_access_size = 1".
> > > 
> > > BTW do you happen to know why that's the case for KVM? Because if kvm
> > > ignores valid.max_access_size generally then commit 5d971f9e6725 is
> > > incomplete, and we probably have some related kvm-only bugs.
> > 
> > It remains a mystery for me why KVM accel does not enforce
> > "valid.max_access_size".
> > 
> > In the thread I started earlier (which led to this patch), at
> > 
> >    "IO port write width clamping differs between TCG and KVM"
> >    https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> 
> [...]
> 
> > So, I think the bug is somehow "distributed" between
> > flatview_write_continue(), flatview_access_allowed(), and
> > memory_access_size(). flatview_access_allowed() does not care about "l"
> > at all, when it should (maybe?) compare it against
> > "mr->ops->valid.max_access_size". In turn, memory_access_size()
> > *silently* reduces the access width, based on
> > "->ops->valid.max_access_size".
> > 
> > And all this this *precedes* the call to memory_region_access_valid(),
> > which is only called from within memory_region_dispatch_write(), which
> > already gets the reduced width only.
> > 
> > Now, flatview_access_allowed() is from commit 3ab6fdc91b72
> > ("softmmu/physmem: Introduce MemTxAttrs::memory field and
> > MEMTX_ACCESS_ERROR", 2022-03-21), and the fact it does not check "len"
> > seems intentional -- it only takes "len" for logging.
> > 
> > Hmm. After digging a lot more, I find the issue may have been introduced
> > over three commits:
> > 
> > - 82f2563fc815 ("exec: introduce memory_access_size", 2013-05-29), which
> >    (IIUC) was the first step towards automatically reducing the address
> >    width, but at first only based on alignment,
> > 
> > - 23326164ae6f ("exec: Support 64-bit operations in address_space_rw",
> >    2013-07-14), which extended the splitting based on
> >    "MemoryRegionOps.impl",
> > 
> > - e1622f4b1539 ("exec: fix incorrect assumptions in memory_access_size",
> >    2013-07-18), which flipped the splitting basis to
> >    "MemoryRegionOps.valid".
> > 
> > To me, 23326164ae6f seems *vaguely* correct ("vague" is not criticism
> > for the commit, it's criticism for my understanding :)); after all we're
> > on our way towards the device model, and the device model exposes via
> > "MemoryRegionOps.impl" what it can handle. Plus, commit 5d971f9e6725
> > does direct us towards "MemoryRegionOps.impl"!
> > 
> > But clearly there must have been something wrong with 23326164ae6f,
> > according to e1622f4b1539...
> 
> Maybe the long-standing unaligned access problem? Could be fixed by:
> https://lore.kernel.org/qemu-devel/20210619172626.875885-15-richard.henderson@linaro.org/

indeed. want to dust it up and post?

> > The latter is what introduced the current "silent splitting of access
> > based on 'valid'". The message of commit e1622f4b1539 says, almost like
> > an afterthought:
> > 
> > >      access_size_max can be mr->ops->valid.max_access_size because memory.c
> > >      can and will still break accesses bigger than
> > >      mr->ops->impl.max_access_size.
> > 
> > I think this argument may have been wrong: if "impl.max_access_size" is
> > large (such as: unset), but "valid.max_access_size" is small, that just
> > means:
> > 
> >    the implementation is flexible and can deal with any access widths (so
> >    "memory.c" *need not* break up accesses for the device model's sake),
> >    but the device should restrict the *guest* to small accesses. So if
> >    the guest tries something larger, we shouldn't silently accommodate
> >    that.
> 
> Indeed. '.impl' is a software thing for the device modeller, ideally one
> will chose a value that allows the simplest implementation. I.e. if a
> device only allows 8-bit access, use 8-bit registers aligned on a 64-bit
> boundary, the model might use:
> 
>   .impl.min_access_size = 8,
>   .impl.max_access_size = 1,
> 
> Also we need to keep in mind that even if most MemoryRegionOps structs
> are 'static const', such structure can be dynamically created. I.e.:
> https://lore.kernel.org/qemu-devel/20200817161853.593247-5-f4bug@amsat.org/
> 
> > I have zero idea how to fix this, but I feel that the quoted argument
> > from commit e1622f4b1539 is the reason why KVM accel is so lenient that
> > it sort of "de-fangs" commit 5d971f9e6725.
> > 
> > Laszlo
> >
Christian Ehrhardt March 1, 2023, 7:17 a.m. UTC | #13
On Thu, Jan 5, 2023 at 8:14 AM Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 1/4/23 13:35, Michael S. Tsirkin wrote:
> > On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
> >> The modern ACPI CPU hotplug interface was introduced in the following
> >> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
> >>
> >>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
> >>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
> >>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
> >>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
> >>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
> >>                   interface
> >>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
> >>                   interface
> >>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
> >>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
> >>
...
>
> The solution to the riddle

Hi,
just to add to this nicely convoluted case an FYI to everyone involved
back then,
the fix seems to have caused a regression [1] in - as far as I've
found - an edge case.

[1]: https://gitlab.com/qemu-project/qemu/-/issues/1520

...

> Laszlo
>
>
Laszlo Ersek March 1, 2023, 8:03 a.m. UTC | #14
Hello Christian,

On 3/1/23 08:17, Christian Ehrhardt wrote:
> On Thu, Jan 5, 2023 at 8:14 AM Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> On 1/4/23 13:35, Michael S. Tsirkin wrote:
>>> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
>>>> The modern ACPI CPU hotplug interface was introduced in the following
>>>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>>>>
>>>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>>>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>>>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>>>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>>>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>>>>                   interface
>>>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>>>>                   interface
>>>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>>>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
>>>>
> ...
>>
>> The solution to the riddle
> 
> Hi,
> just to add to this nicely convoluted case an FYI to everyone involved
> back then,
> the fix seems to have caused a regression [1] in - as far as I've
> found - an edge case.
> 
> [1]: https://gitlab.com/qemu-project/qemu/-/issues/1520

After reading the gitlab case, here's my theory on it:

- Without the patch applied, the CPU hotplug register block in QEMU is
broken. Effectively, it has *always* been broken; to put it differently,
you have most likely *never* seen a QEMU in which the CPU hotplug
register block was not broken. The reason is that the only QEMU release
without the breakage (as far as a guest could see it!) was v5.0.0, but
it got exposed to the guest as early as v5.1.0 (IOW, in the 5.* series,
the first stable release already exposed the issue), and the symptom has
existed since (up to and including 7.2).

- With the register block broken, OVMF's multiprocessing is broken, and
the random chaos just happens to play out in a way that makes OVMF think
it's running on a uniprocessor system.

- With the register block *fixed* (commit dab30fbe applied), OVMF
actually boots up your VCPUs. With MT-TCG, this translates to as many
host-side VCPU threads running in your QEMU process as you have VCPUs.

- Furthermore, if your OVMF build includes the SMM driver stack, then
each UEFI variable update will require all VCPUs to enter SMM. All VCPUs
entering SMM is a "thundering herd" event, so it seriously spins up all
your host-side threads. (I assume the SMM-enabled binaries are what you
refer to as "signed OVMF cases" in the gitlab ticket.)

- If you overcommit the VCPUs (#vcpus > #pcpus), then your host-side
threads will be competing for PCPUs. On s390x, there is apparently some
bottleneck in QEMU's locking or in the host kernel or wherever else that
penalizes (#threads > #pcpus) heavily, while on other host arches, the
penalty is (apparently) not as severe.

So, the QEMU fix actually "only exposes" the high penalty of the MT-TCG
VCPU thread overcommit that appears characteristic of s390x hosts.
You've not seen this symptom before because, regardless of how many
VCPUs you've specified in the past, OVMF has never actually attempted to
bring those up, due to the hotplug regblock breakage "masking" the
actual VCPU counts (the present-at-boot VCPU count and the possible max
VCPU count).

Here's a test you could try: go back to QEMU v5.0.0 *precisely*, and try
to reproduce the symptom. I expect that it should reproduce.

Here's another test you can try: with latest QEMU, boot an x86 Linux
guest, but using SeaBIOS, not OVMF, on your s390x host. Then, in the
Linux guest, run as many busy loops (e.g. in the shell) as there are
VCPUs. Compare the behavior between #vcpus = #pcpus vs. #vcpus > #pcpus.
The idea here is of course to show that the impact of overcommitting x86
VCPUs on s390x is not specific to OVMF. Note that I don't *fully* expect
this test to confirm the expectation, because the guest workload will be
very different: in the Linux guest case, your VCPUs will not be
attempting to enter SMM *or* to access pflash, so the paths exercised in
QEMU will be very different. But, the test may still be worth a try.

Yet another test (or more like, information gathering): re-run the
problematic case, while printing the OVMF debug log (the x86 debug
console) to stdout, and visually determine at what part(s) the slowdown
hits. (I guess you can also feed the debug console log through some
timestamping utility like "logger".) I suspect it's going to be those
log sections that relate to SMM entry -- initial SMBASE relocation, and
then whenever UEFI variables are modified.

Preliminary advice: don't overcommit VCPUs in the setup at hand, or else
please increase the timeout. :)

In edk2, a way to mitigate said "thundering herd" problem *supposedly*
exists (using unicast SMIs rather than broadcast ones), but that
configuration of the core SMM components in edk2 had always been
extremely unstable when built into OVMF *and* running on QEMU/KVM. So we
opted for broadcast SMIs (supporting which actually required some QEMU
patches). Broadcast SMIs generate larger spikes in host load, but
regarding guest functionally, they are much more stable/robust.

Laszlo
Christian Ehrhardt March 2, 2023, 8:32 a.m. UTC | #15
On Wed, Mar 1, 2023 at 9:04 AM Laszlo Ersek <lersek@redhat.com> wrote:
>
> Hello Christian,
>
> On 3/1/23 08:17, Christian Ehrhardt wrote:
> > On Thu, Jan 5, 2023 at 8:14 AM Laszlo Ersek <lersek@redhat.com> wrote:
> >>
> >> On 1/4/23 13:35, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
> >>>> The modern ACPI CPU hotplug interface was introduced in the following
> >>>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
> >>>>
> >>>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
> >>>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
> >>>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
> >>>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
> >>>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
> >>>>                   interface
> >>>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
> >>>>                   interface
> >>>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
> >>>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
> >>>>
> > ...
> >>
> >> The solution to the riddle
> >
> > Hi,
> > just to add to this nicely convoluted case an FYI to everyone involved
> > back then,
> > the fix seems to have caused a regression [1] in - as far as I've
> > found - an edge case.
> >
> > [1]: https://gitlab.com/qemu-project/qemu/-/issues/1520
>
> After reading the gitlab case, here's my theory on it:
>
> - Without the patch applied, the CPU hotplug register block in QEMU is
> broken. Effectively, it has *always* been broken; to put it differently,
> you have most likely *never* seen a QEMU in which the CPU hotplug
> register block was not broken. The reason is that the only QEMU release
> without the breakage (as far as a guest could see it!) was v5.0.0, but
> it got exposed to the guest as early as v5.1.0 (IOW, in the 5.* series,
> the first stable release already exposed the issue), and the symptom has
> existed since (up to and including 7.2).
>
> - With the register block broken, OVMF's multiprocessing is broken, and
> the random chaos just happens to play out in a way that makes OVMF think
> it's running on a uniprocessor system.
>
> - With the register block *fixed* (commit dab30fbe applied), OVMF
> actually boots up your VCPUs. With MT-TCG, this translates to as many
> host-side VCPU threads running in your QEMU process as you have VCPUs.
>
> - Furthermore, if your OVMF build includes the SMM driver stack, then
> each UEFI variable update will require all VCPUs to enter SMM. All VCPUs
> entering SMM is a "thundering herd" event, so it seriously spins up all
> your host-side threads. (I assume the SMM-enabled binaries are what you
> refer to as "signed OVMF cases" in the gitlab ticket.)
>
> - If you overcommit the VCPUs (#vcpus > #pcpus), then your host-side
> threads will be competing for PCPUs. On s390x, there is apparently some
> bottleneck in QEMU's locking or in the host kernel or wherever else that
> penalizes (#threads > #pcpus) heavily, while on other host arches, the
> penalty is (apparently) not as severe.
>
> So, the QEMU fix actually "only exposes" the high penalty of the MT-TCG
> VCPU thread overcommit that appears characteristic of s390x hosts.
> You've not seen this symptom before because, regardless of how many
> VCPUs you've specified in the past, OVMF has never actually attempted to
> bring those up, due to the hotplug regblock breakage "masking" the
> actual VCPU counts (the present-at-boot VCPU count and the possible max
> VCPU count).

Thank you for the detailed thoughts - if we can confirm this we can
close the case as "it is odd that there is so much penalty, but =>
Won't Fix / Works as Intended"

> Here's a test you could try: go back to QEMU v5.0.0 *precisely*, and try
> to reproduce the symptom. I expect that it should reproduce.

v5.0.0 - 1 host cpu vs 2 vcpu - 58.47s
v5.0.0 - 1 host cpu vs 1 vcpu -  5.33s
v5.0.0 - 2 host cpu vs 2 vcpu -  5.27s
v5.1.0 - 1 host cpu vs 2 vcpu -  7.18s
v5.1.0 - 1 host cpu vs 1 vcpu -  5.22s
v5.1.0 - 2 host cpu vs 2 vcpu -  5.40s

Yes, v5.0.0 behaves exactly like the recent master branch does since your fix.
And v5.1.0 does no more, just as you predicted

> Here's another test you can try: with latest QEMU, boot an x86 Linux
> guest, but using SeaBIOS, not OVMF, on your s390x host. Then, in the
> Linux guest, run as many busy loops (e.g. in the shell) as there are
> VCPUs. Compare the behavior between #vcpus = #pcpus vs. #vcpus > #pcpus.
> The idea here is of course to show that the impact of overcommitting x86
> VCPUs on s390x is not specific to OVMF. Note that I don't *fully* expect
> this test to confirm the expectation, because the guest workload will be
> very different: in the Linux guest case, your VCPUs will not be
> attempting to enter SMM *or* to access pflash, so the paths exercised in
> QEMU will be very different. But, the test may still be worth a try.

That felt too much of a different workload to me, so I have skipped this
one as - without further evidence that it will help - it could be quite a
time sink.

> Yet another test (or more like, information gathering): re-run the
> problematic case, while printing the OVMF debug log (the x86 debug
> console) to stdout, and visually determine at what part(s) the slowdown
> hits. (I guess you can also feed the debug console log through some
> timestamping utility like "logger".) I suspect it's going to be those
> log sections that relate to SMM entry -- initial SMBASE relocation, and
> then whenever UEFI variables are modified.

Building without -b RELEASE adding debugcon and timestamping that
ouput showed that each individual initialization takes the expected
~x10 longer.
So up to these they are more or less at the same speed initially.
But then the bad case slows down.

Here one example on BootGraphicsResourceTableDxe.efi

good ~0.09
[08:14:36.657866559] Loading driver at 0x0000DA42000
EntryPoint=0x0000DA43545 BootGraphicsResourceTableDxe.efi
[08:14:36.658913369] InstallProtocolInterface:
BC62157E-3E33-4FEC-9920-2D3B36D750DF DA72D18
[08:14:36.659946746] ProtectUefiImageCommon - 0xDA72040
[08:14:36.660982120] - 0x000000000DA42000 - 0x0000000000002840
[08:14:36.662043745] InstallProtocolInterface:
CDEA2BD3-FC25-4C1C-B97C-B31186064990 DA445F0
[08:14:36.663092745] InstallProtocolInterface:
4B5DC1DF-1EAA-48B2-A7E9-EAC489A00B5C DA44670
[08:14:36.664139682] Loading driver
961578FE-B6B7-44C3-AF35-6BC705CD2B1F
[08:14:36.665191815] InstallProtocolInterface:
5B1B31A1-9562-11D2-8E3F-00A0C969723B DA72540
[08:14:36.666244307] Loading driver at 0x0000DA02000
EntryPoint=0x0000DA099BC Fat.efi

bad ~0.17s
[08:15:30.386201946] Loading driver at 0x0000DA49000
EntryPoint=0x0000DA4A545 BootGraphicsResourceTableDxe.efi
[08:15:30.410568994] InstallProtocolInterface:
BC62157E-3E33-4FEC-9920-2D3B36D750DF DA7EB18
[08:15:30.430838932] ProtectUefiImageCommon - 0xDA7E140
[08:15:30.440526879] - 0x000000000DA49000 - 0x0000000000002840
[08:15:30.450730504] InstallProtocolInterface:
CDEA2BD3-FC25-4C1C-B97C-B31186064990 DA4B5F0
[08:15:30.480538889] InstallProtocolInterface:
4B5DC1DF-1EAA-48B2-A7E9-EAC489A00B5C DA4B670
[08:15:30.490532370] Loading driver
961578FE-B6B7-44C3-AF35-6BC705CD2B1F
[08:15:30.510566744] InstallProtocolInterface:
5B1B31A1-9562-11D2-8E3F-00A0C969723B DA7D040
[08:15:30.550572432] Loading driver at 0x0000D7F6000
EntryPoint=0x0000D7FD9BC Fat.efi

This seems to be the case for each driver load in here which then adds up.
There is another rather big jump here a bit later

good ~instant
[08:14:37.267336194] Select Item: 0xE
[08:14:37.268346995] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success

bad ~8s
[08:15:43.561054490] Select Item: 0xE
[08:15:51.291039364] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success

The whole late section of OVMF init makes up for almost all of the loss.

Full files:
- good: https://paste.ubuntu.com/p/DcMpxtd9Cy/
- bad: https://paste.ubuntu.com/p/4wDfzmC9Sm/

> Preliminary advice: don't overcommit VCPUs in the setup at hand, or else
> please increase the timeout. :)

I was always in the "if possible you should not overcommit" camp anyway.
And we have - by now resolved this in the tests [1] due to my bug
about it - thanks @Dann Frazier

[1]: https://salsa.debian.org/qemu-team/edk2/-/commit/243f0c2533fc18671dc373645e44b5071d8474a5

> In edk2, a way to mitigate said "thundering herd" problem *supposedly*
> exists (using unicast SMIs rather than broadcast ones), but that
> configuration of the core SMM components in edk2 had always been
> extremely unstable when built into OVMF *and* running on QEMU/KVM. So we
> opted for broadcast SMIs (supporting which actually required some QEMU
> patches). Broadcast SMIs generate larger spikes in host load, but
> regarding guest functionally, they are much more stable/robust.
>
> Laszlo
>
Laszlo Ersek March 2, 2023, 10:38 a.m. UTC | #16
On 3/2/23 09:32, Christian Ehrhardt wrote:

> good ~instant
> [08:14:37.267336194] Select Item: 0xE
> [08:14:37.268346995] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success
>
> bad ~8s
> [08:15:43.561054490] Select Item: 0xE
> [08:15:51.291039364] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success

Yes, this is consistent with my hypothesis.

  PlatformBootManagerBeforeConsole()

    GetFrontPageTimeoutFromQemu()
      QemuFwCfgSelectItem (QemuFwCfgItemBootMenu)
        // "Select Item: 0xE"

    gRT->SetVariable()

    PlatformRegisterOptionsAndKeys()
      EfiBootManagerAddKeyOptionVariable()

        gRT->SetVariable()

        BmProcessKeyOption()
          BmRegisterHotkeyNotify()
            // "[Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success"

IOW, there are at least two gRT->SetVariable() calls in OVMF (with
EFI_VARIABLE_NON_VOLATILE attribute) between the two adjacent log lines
you quoted.

The other functions listed in the call tree may contain further
gRT->SetVariable() calls.

Laszlo
diff mbox series

Patch

diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
index 53654f863830..ff14c3f4106f 100644
--- a/hw/acpi/cpu_hotplug.c
+++ b/hw/acpi/cpu_hotplug.c
@@ -52,6 +52,9 @@  static const MemoryRegionOps AcpiCpuHotplug_ops = {
     .endianness = DEVICE_LITTLE_ENDIAN,
     .valid = {
         .min_access_size = 1,
+        .max_access_size = 4,
+    },
+    .impl = {
         .max_access_size = 1,
     },
 };