q35/mch: implement extended TSEG sizes

Message ID	20170608161013.17920-1-lersek@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com C8FD1A08F8 From: Laszlo Ersek <lersek@redhat.com> To: qemu devel list <qemu-devel@nongnu.org> Date: Thu, 8 Jun 2017 18:10:13 +0200 Message-Id: <20170608161013.17920-1-lersek@redhat.com> Subject: [Qemu-devel] [PATCH] q35/mch: implement extended TSEG sizes Precedence: list Cc: Paolo Bonzini <pbonzini@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Laszlo Ersek June 8, 2017, 4:10 p.m. UTC

The q35 machine type currently lets the guest firmware select a 1MB, 2MB
or 8MB TSEG (basically, SMRAM) size. In edk2/OVMF, we use 8MB, but even
that is not enough when a lot of VCPUs (more than approx. 224) are
configured -- SMRAM footprint scales largely proportionally with VCPU
count.

Introduce a new property for "mch" called "extended-tseg-mbytes", which
expresses (in megabytes) the user's choice of TSEG (SMRAM) size.

Invent a new, QEMU-specific register in the config space of the DRAM
Controller, at offset 0x50, in order to allow guest firmware to query the
TSEG (SMRAM) size.

According to Intel Document Number 316966-002, Table 5-1 "DRAM Controller
Register Address Map (D0:F0)":

    Warning: Address locations that are not listed are considered Intel
             Reserved registers locations. Reads to Reserved registers may
             return non-zero values. Writes to reserved locations may
             cause system failures.

             All registers that are defined in the PCI 2.3 specification,
             but are not necessary or implemented in this component are
             simply not included in this document. The
             reserved/unimplemented space in the PCI configuration header
             space is not documented as such in this summary.

Offsets 0x50 and 0x51 are not listed in Table 5-1. They are also not part
of the standard PCI config space header. And they precede the capability
list as well, which starts at 0xe0 for this device.

When the guest writes value 0xffff to this register, the value that can be
read back is that of "mch.extended-tseg-mbytes" -- unless it remains
0xffff. The guest is required to write 0xffff first (as opposed to a
read-only register) because PCI config space is generally not cleared on
QEMU reset, and after S3 resume or reboot, new guest firmware running on
old QEMU could read a guest OS-injected value from this register.

After reading the available "extended" TSEG size, the guest firmware may
actually request that TSEG size by writing pattern 11b to the ESMRAMC
register's TSEG_SZ bit-field. (The Intel spec referenced above defines
only patterns 00b (1MB), 01b (2MB) and 10b (8MB); 11b is reserved.)

On the QEMU command line, the value can be set with

  -global mch.extended-tseg-mbytes=N

The default value for 2.10+ q35 machine types is 16. The value is limited
to 0xfff (4095) at the moment, purely so that the product (4095 MB) can be
stored to the uint32_t variable "tseg_size" in mch_update_smram(). Users
are responsible for choosing sensible TSEG sizes.

On 2.9 and earlier q35 machine types, the default value is 0. This lets
the 11b bit pattern in ESMRAMC.TSEG_SZ, and the register at offset 0x50,
keep their original behavior.

When "extended-tseg-mbytes" is nonzero, the new register at offset 0x50 is
set to that value on reset, for completeness.

PCI config space is migrated automatically, so no VMSD changes are
necessary.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1447027
Ref: https://lists.01.org/pipermail/edk2-devel/2017-May/010456.html
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    RFC->PATCH changes:
    - This is an identical PATCH repost of the original RFC at
      <http://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg06912.html>;
      I got the OVMF patches now.

 include/hw/i386/pc.h      |  5 +++++
 include/hw/pci-host/q35.h |  6 ++++++
 hw/pci-host/q35.c         | 41 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 49 insertions(+), 3 deletions(-)

Paolo Bonzini June 8, 2017, 4:34 p.m. UTC | #1

On 08/06/2017 18:10, Laszlo Ersek wrote:
> When the guest writes value 0xffff to this register, the value that can be
> read back is that of "mch.extended-tseg-mbytes" -- unless it remains
> 0xffff. The guest is required to write 0xffff first (as opposed to a
> read-only register) because PCI config space is generally not cleared on
> QEMU reset, and after S3 resume or reboot, new guest firmware running on
> old QEMU could read a guest OS-injected value from this register.

I guess that's also a reason not to make it readonly (that is, it would
require some firmware code anyway to test for "readonlyness" and
distinguish old machine types from new)?

Thanks,

Paolo

Michael S. Tsirkin June 8, 2017, 5:41 p.m. UTC | #2

On Thu, Jun 08, 2017 at 06:10:13PM +0200, Laszlo Ersek wrote:
> The q35 machine type currently lets the guest firmware select a 1MB, 2MB
> or 8MB TSEG (basically, SMRAM) size. In edk2/OVMF, we use 8MB, but even
> that is not enough when a lot of VCPUs (more than approx. 224) are
> configured -- SMRAM footprint scales largely proportionally with VCPU
> count.
> 
> Introduce a new property for "mch" called "extended-tseg-mbytes", which
> expresses (in megabytes) the user's choice of TSEG (SMRAM) size.
> 
> Invent a new, QEMU-specific register in the config space of the DRAM
> Controller, at offset 0x50, in order to allow guest firmware to query the
> TSEG (SMRAM) size.
> 
> According to Intel Document Number 316966-002, Table 5-1 "DRAM Controller
> Register Address Map (D0:F0)":
> 
>     Warning: Address locations that are not listed are considered Intel
>              Reserved registers locations. Reads to Reserved registers may
>              return non-zero values. Writes to reserved locations may
>              cause system failures.
> 
>              All registers that are defined in the PCI 2.3 specification,
>              but are not necessary or implemented in this component are
>              simply not included in this document. The
>              reserved/unimplemented space in the PCI configuration header
>              space is not documented as such in this summary.
> 
> Offsets 0x50 and 0x51 are not listed in Table 5-1. They are also not part
> of the standard PCI config space header. And they precede the capability
> list as well, which starts at 0xe0 for this device.
> 
> When the guest writes value 0xffff to this register, the value that can be
> read back is that of "mch.extended-tseg-mbytes" -- unless it remains
> 0xffff. The guest is required to write 0xffff first (as opposed to a
> read-only register) because PCI config space is generally not cleared on
> QEMU reset, and after S3 resume or reboot, new guest firmware running on
> old QEMU could read a guest OS-injected value from this register.
> 
> After reading the available "extended" TSEG size, the guest firmware may
> actually request that TSEG size by writing pattern 11b to the ESMRAMC
> register's TSEG_SZ bit-field. (The Intel spec referenced above defines
> only patterns 00b (1MB), 01b (2MB) and 10b (8MB); 11b is reserved.)
> 
> On the QEMU command line, the value can be set with
> 
>   -global mch.extended-tseg-mbytes=N
> 
> The default value for 2.10+ q35 machine types is 16. The value is limited
> to 0xfff (4095) at the moment, purely so that the product (4095 MB) can be
> stored to the uint32_t variable "tseg_size" in mch_update_smram(). Users
> are responsible for choosing sensible TSEG sizes.
> 
> On 2.9 and earlier q35 machine types, the default value is 0. This lets
> the 11b bit pattern in ESMRAMC.TSEG_SZ, and the register at offset 0x50,
> keep their original behavior.
> 
> When "extended-tseg-mbytes" is nonzero, the new register at offset 0x50 is
> set to that value on reset, for completeness.
> 
> PCI config space is migrated automatically, so no VMSD changes are
> necessary.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1447027
> Ref: https://lists.01.org/pipermail/edk2-devel/2017-May/010456.html
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>

I really dislike negotiation being re-invented for each device.  Do we
need these tricks?  Can we just do fw cfg with standard discovery?
This ties in with my proposal to generalize smi features to
generic ones.


> ---
> 
> Notes:
>     RFC->PATCH changes:
>     - This is an identical PATCH repost of the original RFC at
>       <http://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg06912.html>;
>       I got the OVMF patches now.
> 
>  include/hw/i386/pc.h      |  5 +++++
>  include/hw/pci-host/q35.h |  6 ++++++
>  hw/pci-host/q35.c         | 41 ++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 49 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index d071c9c0e9cd..233216abdc1a 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -384,6 +384,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>  
>  #define PC_COMPAT_2_9 \
>      HW_COMPAT_2_9 \
> +    {\
> +        .driver   = "mch",\
> +        .property = "extended-tseg-mbytes",\
> +        .value    = stringify(0),\
> +    },\
>  
>  #define PC_COMPAT_2_8 \
>      HW_COMPAT_2_8 \
> diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
> index 53b6760c16c5..58983c00b32d 100644
> --- a/include/hw/pci-host/q35.h
> +++ b/include/hw/pci-host/q35.h
> @@ -60,6 +60,7 @@ typedef struct MCHPCIState {
>      uint64_t above_4g_mem_size;
>      uint64_t pci_hole64_size;
>      uint32_t short_root_bus;
> +    uint16_t ext_tseg_mbytes;
>  } MCHPCIState;
>  
>  typedef struct Q35PCIHost {
> @@ -91,6 +92,11 @@ typedef struct Q35PCIHost {
>  /* D0:F0 configuration space */
>  #define MCH_HOST_BRIDGE_REVISION_DEFAULT       0x0
>  
> +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES        0x50
> +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_SIZE   2
> +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY  0xffff
> +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX    0xfff
> +
>  #define MCH_HOST_BRIDGE_PCIEXBAR               0x60    /* 64bit register */
>  #define MCH_HOST_BRIDGE_PCIEXBAR_SIZE          8       /* 64bit register */
>  #define MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT       0xb0000000
> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
> index cd5c49616ef9..28cb97b60fa3 100644
> --- a/hw/pci-host/q35.c
> +++ b/hw/pci-host/q35.c
> @@ -134,7 +134,7 @@ static void q35_host_get_mmcfg_size(Object *obj, Visitor *v, const char *name,
>      visit_type_uint32(v, name, &value, errp);
>  }
>  
> -static Property mch_props[] = {
> +static Property q35_host_props[] = {
>      DEFINE_PROP_UINT64(PCIE_HOST_MCFG_BASE, Q35PCIHost, parent_obj.base_addr,
>                          MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT),
>      DEFINE_PROP_SIZE(PCI_HOST_PROP_PCI_HOLE64_SIZE, Q35PCIHost,
> @@ -154,7 +154,7 @@ static void q35_host_class_init(ObjectClass *klass, void *data)
>  
>      hc->root_bus_path = q35_host_root_bus_path;
>      dc->realize = q35_host_realize;
> -    dc->props = mch_props;
> +    dc->props = q35_host_props;
>      /* Reason: needs to be wired up by pc_q35_init */
>      dc->user_creatable = false;
>      set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
> @@ -369,7 +369,7 @@ static void mch_update_smram(MCHPCIState *mch)
>              tseg_size = 1024 * 1024 * 8;
>              break;
>          default:
> -            tseg_size = 0;
> +            tseg_size = 1024 * 1024 * (uint32_t)mch->ext_tseg_mbytes;
>              break;
>          }
>      } else {
> @@ -392,6 +392,17 @@ static void mch_update_smram(MCHPCIState *mch)
>      memory_region_transaction_commit();
>  }
>  
> +static void mch_update_ext_tseg_mbytes(MCHPCIState *mch)
> +{
> +    PCIDevice *pd = PCI_DEVICE(mch);
> +    uint8_t *reg = pd->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES;
> +
> +    if (mch->ext_tseg_mbytes > 0 &&
> +        pci_get_word(reg) == MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY) {
> +        pci_set_word(reg, mch->ext_tseg_mbytes);
> +    }
> +}
> +
>  static void mch_write_config(PCIDevice *d,
>                                uint32_t address, uint32_t val, int len)
>  {
> @@ -413,6 +424,11 @@ static void mch_write_config(PCIDevice *d,
>                         MCH_HOST_BRIDGE_SMRAM_SIZE)) {
>          mch_update_smram(mch);
>      }
> +
> +    if (ranges_overlap(address, len, MCH_HOST_BRIDGE_EXT_TSEG_MBYTES,
> +                       MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_SIZE)) {
> +        mch_update_ext_tseg_mbytes(mch);
> +    }
>  }
>  
>  static void mch_update(MCHPCIState *mch)
> @@ -420,6 +436,7 @@ static void mch_update(MCHPCIState *mch)
>      mch_update_pciexbar(mch);
>      mch_update_pam(mch);
>      mch_update_smram(mch);
> +    mch_update_ext_tseg_mbytes(mch);
>  }
>  
>  static int mch_post_load(void *opaque, int version_id)
> @@ -457,6 +474,11 @@ static void mch_reset(DeviceState *qdev)
>      d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK;
>      d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK;
>  
> +    if (mch->ext_tseg_mbytes > 0) {
> +        pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES,
> +                     MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY);
> +    }
> +
>      mch_update(mch);
>  }
>  
> @@ -465,6 +487,12 @@ static void mch_realize(PCIDevice *d, Error **errp)
>      int i;
>      MCHPCIState *mch = MCH_PCI_DEVICE(d);
>  
> +    if (mch->ext_tseg_mbytes > MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX) {
> +        error_setg(errp, "invalid extended-tseg-mbytes value: %" PRIu16,
> +                   mch->ext_tseg_mbytes);
> +        return;
> +    }
> +
>      /* setup pci memory mapping */
>      pc_pci_as_mapping_init(OBJECT(mch), mch->system_memory,
>                             mch->pci_address_space);
> @@ -530,6 +558,12 @@ uint64_t mch_mcfg_base(void)
>      return MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT;
>  }
>  
> +static Property mch_props[] = {
> +    DEFINE_PROP_UINT16("extended-tseg-mbytes", MCHPCIState, ext_tseg_mbytes,
> +                       16),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
>  static void mch_class_init(ObjectClass *klass, void *data)
>  {
>      PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
> @@ -538,6 +572,7 @@ static void mch_class_init(ObjectClass *klass, void *data)
>      k->realize = mch_realize;
>      k->config_write = mch_write_config;
>      dc->reset = mch_reset;
> +    dc->props = mch_props;
>      set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>      dc->desc = "Host bridge";
>      dc->vmsd = &vmstate_mch;
> -- 
> 2.9.3

Laszlo Ersek June 8, 2017, 6:31 p.m. UTC | #3

On 06/08/17 18:34, Paolo Bonzini wrote:
> 
> 
> On 08/06/2017 18:10, Laszlo Ersek wrote:
>> When the guest writes value 0xffff to this register, the value that can be
>> read back is that of "mch.extended-tseg-mbytes" -- unless it remains
>> 0xffff. The guest is required to write 0xffff first (as opposed to a
>> read-only register) because PCI config space is generally not cleared on
>> QEMU reset, and after S3 resume or reboot, new guest firmware running on
>> old QEMU could read a guest OS-injected value from this register.
> 
> I guess that's also a reason not to make it readonly (that is, it would
> require some firmware code anyway to test for "readonlyness" and
> distinguish old machine types from new)?

That's right; in
<https://lists.01.org/pipermail/edk2-devel/2017-May/010432.html> I wrote,

> If we invent such a new register, it should be in a location that is
> either read-only, or zeroed-on-reset, in current QEMU. Otherwise, new
> firmware running on old QEMU could be misled by a guest OS that writes
> to this register, and then either reboots or enters S3.
>
> ... With this in mind, I don't oppose "having to write somewhere to
> read back the result", but then let's please make that write access as
> well to the same new qemu-specific register, and not to MCH_ESMRAMC.

The problem is that config space is by default r/w and not cleared on
reboot -- see Gerd's
<https://lists.01.org/pipermail/edk2-devel/2017-May/010456.html> --, and
once we designate another offset as "special" (either "reactive" or
"read only"), the firmware will have to write to it and read it back
unconditionally, to tell it apart from the default / original /
non-special config space.

Thanks!
Laszlo

Gerd Hoffmann June 8, 2017, 7:48 p.m. UTC | #4

Hi,

> I really dislike negotiation being re-invented for each device.  Do
> we
> need these tricks?  Can we just do fw cfg with standard discovery?
> This ties in with my proposal to generalize smi features to
> generic ones.

Device properties should be part of the device.
We should have done this with the smi too.

A more standard way to handle this would be to add a vendor-specific
pci capability and place the register there.  Not sure we have room for
that in the pci config space though.

cheers,
  Gerd

Michael S. Tsirkin June 8, 2017, 7:55 p.m. UTC | #5

On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > I really dislike negotiation being re-invented for each device.  Do
> > we
> > need these tricks?  Can we just do fw cfg with standard discovery?
> > This ties in with my proposal to generalize smi features to
> > generic ones.
> 
> Device properties should be part of the device.
> We should have done this with the smi too.

What is part of the device and what isn't? It's all part
of QEMU in the end.  Adding probing for multiple devices
will just add to number of exits and slow down guest boot.

We do want to stick to emulating real devices if we can, no argument
here - but this stuff is PV anyway - what do we gain by spreading it
out?

> A more standard way to handle this would be to add a vendor-specific
> pci capability and place the register there.  Not sure we have room for
> that in the pci config space though.
> 
> cheers,
>   Gerd

We don't have room anywhere in PCI config space. Laszlo makes argument
why it's safe for this device based on spec but it's anyone's guess
whether current and future software will follow spec.  In short, going
anywhere near the emulated device has a potential to break some drivers.

Laszlo Ersek June 8, 2017, 11:01 p.m. UTC | #6

On 06/08/17 21:55, Michael S. Tsirkin wrote:
> On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote:
>>   Hi,
>>
>>> I really dislike negotiation being re-invented for each device.  Do
>>> we
>>> need these tricks?  Can we just do fw cfg with standard discovery?
>>> This ties in with my proposal to generalize smi features to
>>> generic ones.
>>
>> Device properties should be part of the device.
>> We should have done this with the smi too.
> 
> What is part of the device and what isn't? It's all part
> of QEMU in the end.  Adding probing for multiple devices
> will just add to number of exits and slow down guest boot.
> 
> We do want to stick to emulating real devices if we can, no argument
> here - but this stuff is PV anyway - what do we gain by spreading it
> out?
> 
>> A more standard way to handle this would be to add a vendor-specific
>> pci capability and place the register there.  Not sure we have room for
>> that in the pci config space though.
>>
>> cheers,
>>   Gerd
> 
> We don't have room anywhere in PCI config space. Laszlo makes argument
> why it's safe for this device based on spec but it's anyone's guess
> whether current and future software will follow spec.  In short, going
> anywhere near the emulated device has a potential to break some drivers.

I'm fine using any QEMU facility that lets independent firmware modules
perform their feature detections / negotiations / lockdowns in arbitrary
order between each other. (Hopefully without extreme parsing requirements.)

What I can not sign up for is to develop a general QEMU infrastructure
for this (regardless of whether it is the fw_cfg bitmap stuff prevails,
or the PCI config space register / capability list). Either is complex
work, needing documentation too, the design has to be future proof. I'm
not experienced enough in QEMU to get it right reasonably soon
(everything is surprisingly complex and difficult in QEMU -- this has
been my experience over the years, and I still struggle with QOM every
single time), and I definitely do not have the capacity to take on a
QEMU feature of the suggested size.

It's not lack of interest on my part, but lack of capacity. (Case in
point: it's ~1AM local time, and my laptop's uptime, which quite closely
approximates the hours I've actually spent working today, is ~15:30.)
The reason I keep submitting these little patches to qemu-devel is that
I figure everyone else is overloaded too, so I might as well try what
I'm capable of. But, we should be clear that that is not much, load-wise
and sophistication-wise.

The alternative could have been that I'd clone
<https://bugzilla.redhat.com/show_bug.cgi?id=1447027> to qemu-kvm-rhev
(from OVMF), set up the cross-BZ dependencies correctly, wait until the
clone gets assigned to a seasoned QEMU developer, and once he or she
gets to work on it, we figure out the design together, and once he/she
writes the code for QEMU, I write the code for the firmware.

I figured that sending a patch like the present one (having discussed it
preliminarily with Gerd and Paolo in the "[edk2] SMRAM sizes on large
hosts" thread) would be more efficient than waiting for a seasoned QEMU
developer. I didn't expect that my patch would be better than theirs. :)
The above kind of collaboration has certainly proved functional in the
past, it just takes a lot of time and coordination.

Anyway, "Laszlo embarking on a QEMU infrastructure project that's liable
to take fifteen patch set iterations" is not an alternative,
unfortunately. I definitely don't intend to throw QEMU patches over the
fence; I know what drag that creates for maintainers. I intend to be
responsible for my QEMU patches. However -- or perhaps, "exactly because
of that"? -- I simply can't take on QEMU work that's larger than this
caliber.

Sorry about the wall of text.

Thanks,
Laszlo

Michael S. Tsirkin June 9, 2017, 12:19 a.m. UTC | #7

On Fri, Jun 09, 2017 at 01:01:54AM +0200, Laszlo Ersek wrote:
> On 06/08/17 21:55, Michael S. Tsirkin wrote:
> > On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote:
> >>   Hi,
> >>
> >>> I really dislike negotiation being re-invented for each device.  Do
> >>> we
> >>> need these tricks?  Can we just do fw cfg with standard discovery?
> >>> This ties in with my proposal to generalize smi features to
> >>> generic ones.
> >>
> >> Device properties should be part of the device.
> >> We should have done this with the smi too.
> > 
> > What is part of the device and what isn't? It's all part
> > of QEMU in the end.  Adding probing for multiple devices
> > will just add to number of exits and slow down guest boot.
> > 
> > We do want to stick to emulating real devices if we can, no argument
> > here - but this stuff is PV anyway - what do we gain by spreading it
> > out?
> > 
> >> A more standard way to handle this would be to add a vendor-specific
> >> pci capability and place the register there.  Not sure we have room for
> >> that in the pci config space though.
> >>
> >> cheers,
> >>   Gerd
> > 
> > We don't have room anywhere in PCI config space. Laszlo makes argument
> > why it's safe for this device based on spec but it's anyone's guess
> > whether current and future software will follow spec.  In short, going
> > anywhere near the emulated device has a potential to break some drivers.
> 
> I'm fine using any QEMU facility that lets independent firmware modules
> perform their feature detections / negotiations / lockdowns in arbitrary
> order between each other. (Hopefully without extreme parsing requirements.)

How about adding etc/mch/features etc copying the smi stuff? Is this
simple enough? We can worry about removing code duplication later.

> What I can not sign up for is to develop a general QEMU infrastructure
> for this (regardless of whether it is the fw_cfg bitmap stuff prevails,
> or the PCI config space register / capability list). Either is complex
> work, needing documentation too, the design has to be future proof. I'm
> not experienced enough in QEMU to get it right reasonably soon
> (everything is surprisingly complex and difficult in QEMU -- this has
> been my experience over the years, and I still struggle with QOM every
> single time), and I definitely do not have the capacity to take on a
> QEMU feature of the suggested size.
> 
> It's not lack of interest on my part, but lack of capacity. (Case in
> point: it's ~1AM local time, and my laptop's uptime, which quite closely
> approximates the hours I've actually spent working today, is ~15:30.)
> The reason I keep submitting these little patches to qemu-devel is that
> I figure everyone else is overloaded too, so I might as well try what
> I'm capable of. But, we should be clear that that is not much, load-wise
> and sophistication-wise.
> 
> The alternative could have been that I'd clone
> <https://bugzilla.redhat.com/show_bug.cgi?id=1447027> to qemu-kvm-rhev
> (from OVMF), set up the cross-BZ dependencies correctly, wait until the
> clone gets assigned to a seasoned QEMU developer, and once he or she
> gets to work on it, we figure out the design together, and once he/she
> writes the code for QEMU, I write the code for the firmware.
> 
> I figured that sending a patch like the present one (having discussed it
> preliminarily with Gerd and Paolo in the "[edk2] SMRAM sizes on large
> hosts" thread) would be more efficient than waiting for a seasoned QEMU
> developer. I didn't expect that my patch would be better than theirs. :)
> The above kind of collaboration has certainly proved functional in the
> past, it just takes a lot of time and coordination.
> 
> Anyway, "Laszlo embarking on a QEMU infrastructure project that's liable
> to take fifteen patch set iterations" is not an alternative,
> unfortunately. I definitely don't intend to throw QEMU patches over the
> fence; I know what drag that creates for maintainers. I intend to be
> responsible for my QEMU patches. However -- or perhaps, "exactly because
> of that"? -- I simply can't take on QEMU work that's larger than this
> caliber.
> 
> Sorry about the wall of text.
> 
> Thanks,
> Laszlo

Paolo Bonzini June 9, 2017, 11:40 a.m. UTC | #8

On 08/06/2017 21:55, Michael S. Tsirkin wrote:
> We don't have room anywhere in PCI config space. Laszlo makes argument
> why it's safe for this device based on spec but it's anyone's guess
> whether current and future software will follow spec.  In short, going
> anywhere near the emulated device has a potential to break some drivers.

There are no such drivers.  The MCH and PCH are only touched by the
firmware, not by the OS.  There are exceptions such as the Intel
graphics driver's incestuous relationship with the PCH, but they are
well known.

Let's avoid overengineering please.  The only sensible (but less
sensible) alternative is to just say it's 32MB (or 64, or 128) and cross
fingers that we won't need more.

Paolo

Laszlo Ersek June 9, 2017, 5:41 p.m. UTC | #9

On 06/09/17 02:19, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 01:01:54AM +0200, Laszlo Ersek wrote:
>> On 06/08/17 21:55, Michael S. Tsirkin wrote:
>>> On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote:
>>>>   Hi,
>>>>
>>>>> I really dislike negotiation being re-invented for each device.  Do
>>>>> we
>>>>> need these tricks?  Can we just do fw cfg with standard discovery?
>>>>> This ties in with my proposal to generalize smi features to
>>>>> generic ones.
>>>>
>>>> Device properties should be part of the device.
>>>> We should have done this with the smi too.
>>>
>>> What is part of the device and what isn't? It's all part
>>> of QEMU in the end.  Adding probing for multiple devices
>>> will just add to number of exits and slow down guest boot.
>>>
>>> We do want to stick to emulating real devices if we can, no argument
>>> here - but this stuff is PV anyway - what do we gain by spreading it
>>> out?
>>>
>>>> A more standard way to handle this would be to add a vendor-specific
>>>> pci capability and place the register there.  Not sure we have room for
>>>> that in the pci config space though.
>>>>
>>>> cheers,
>>>>   Gerd
>>>
>>> We don't have room anywhere in PCI config space. Laszlo makes argument
>>> why it's safe for this device based on spec but it's anyone's guess
>>> whether current and future software will follow spec.  In short, going
>>> anywhere near the emulated device has a potential to break some drivers.
>>
>> I'm fine using any QEMU facility that lets independent firmware modules
>> perform their feature detections / negotiations / lockdowns in arbitrary
>> order between each other. (Hopefully without extreme parsing requirements.)
> 
> How about adding etc/mch/features etc copying the smi stuff? Is this
> simple enough? We can worry about removing code duplication later.

I'm having a hard time mapping the negotiation used for SMI to this
TSEG-size case. (Just to be sure I've now re-read the commit message on
50de920b372b ("hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg",
2017-01-26).)

For SMI, there is only one way to trigger it. And the broadcast feature,
if negotiated, modifies the behavior of SMI. It does not change how the
SMI is triggered by the guest.

The TSEG size selection differs. There are already three options for the
guest (bit patterns 00 -> 1MB, 01 -> 2MB, 10 -> 8MB) to write to
ESMRAMC.TSEG_SZ. Let's say the guest wants more:

- First, beyond learning whether "more is possible", it has to know
exactly "how much". So a single bit in some "host-features" bitmask is
not good enough for that.

- Second, assuming an increased TSEG size is available (with a queriable
size), the bit patterns 00, 01, 10 should still not change their
behavior / meaning. For example, I don't see how

  ESMRAMC.TSEG_SZ == 00

(implying 1MB TSEG) could coexist with the guest actually negotiating
16MB via another communication channel. Whatever code looked at
ESMRAMC.TSEG_SZ would rightfully think there was a 1MB TSEG.

So I think such a side-channel negotiation would actually violate
specified behavior from the Q35 spec, rather than just fill in
reserved/unspecified bits.

- The third doubt I have is security / lockdown. In the SMI case, we
defined our own lockdown scheme, and it did not overlap or conflict with
anything. For SMRAM however, the SMRAM.D_LCK bit already locks down a
whole bunch of things, including the ESMRAMC.TSEG_SZ field. Should
SMRAM.D_LCK also lock down the "other" negotiation channel? If that's
our choice, then we diverge at once from the negotiation pattern seen
with SMI. Or else, what if we lock down the "other" channel (selecting
16MB for example), but don't lock down D_LCK? Then ESMRAMC.TSEG_SZ can
still be modified, and that should actually affect the TSEG size
(because D_LCK is not set).

... I think the negotiation pattern we used for etc/smi/* doesn't apply
here; the Q35 spec already defines too much of a selection / lockdown
dance for that. We could perhaps figure out the interactions, but (a)
it's a security thing so there isn't much room for mistakes, and (b) we
wouldn't be reusing the pattern seen with the SMI feature negotiation
anyway.

ESMRAMC.TSEG_SZ == 11 is the simplest (non-conflicting) extension to the
Q35 spec, IMO. That's the core thing.

How we pass the information to the guest that
(a) ESMRAMC.TSEG_SZ == 11 is available, and
(b) what size it actually means

is a secondary question. For this, my original suggestion was a new
fw_cfg file indeed:

https://lists.01.org/pipermail/edk2-devel/2017-May/010404.html

but Gerd seemed to prefer PCI config space. I was fine either way.

(Please note that a malicious guest OS cannot fake an fw_cfg file on old
QEMU, but it can fill in unimplemented registers in PCI config space.
This is why using fw_cfg for querying the extended TSEG availability and
size would be a "read only" operation in the guest, and why using PCI
config space instead requires a *write* operation first. The latter has
nothing to do with feature negotiation; the write occurs only to
suppress earlier tampering from a malicious guest OS, pre-reboot.)

Summary:

(1) I don't think the pattern that we used for broadcast SMI applies
here; we already have a spec-mandated selection and lockdown protocol
involving ESMRAMC.TSEG_SZ and SMRAM.D_LCK.

I don't think we can, or should try to, diverge from this protocol. In
fact we're lucky to have a reserved bitmask value (11) in ESMRAMC.TSEG_SZ.

(2) How exactly we inform the guest about the TSEG size that
(ESMRAMC.TSEG_SZ == 11) corresponds to is a secondary question. We can
make it a constant (and pray, like Paolo mentions), or we can make it
cmdline-configurable and expose it via PCI config space, or we can make
it cmdline-configurable and expose it via a new fw_cfg file. In OVMF I
can easily accommodate either of these three options.

Thanks
Laszlo

Gerd Hoffmann June 9, 2017, 8:01 p.m. UTC | #10

On Fri, 2017-06-09 at 13:40 +0200, Paolo Bonzini wrote:
> 
> On 08/06/2017 21:55, Michael S. Tsirkin wrote:
> > We don't have room anywhere in PCI config space. Laszlo makes
> > argument
> > why it's safe for this device based on spec but it's anyone's guess
> > whether current and future software will follow spec.  In short,
> > going
> > anywhere near the emulated device has a potential to break some
> > drivers.
> 
> There are no such drivers.  The MCH and PCH are only touched by the
> firmware, not by the OS.

Yea.  That is *exactly* the reason why I think simply using the 0x50
offset probably works fine in practice even though I suspect on
physical hardware it might be some undocumented register.  Much of the
stuff in the host bridge pci config space is firmware territory, and we
run qemu specific firmware *anyway*.

cheers,
  Gerd

Michael S. Tsirkin June 14, 2017, 6:25 p.m. UTC | #11

On Fri, Jun 09, 2017 at 10:01:18PM +0200, Gerd Hoffmann wrote:
> On Fri, 2017-06-09 at 13:40 +0200, Paolo Bonzini wrote:
> > 
> > On 08/06/2017 21:55, Michael S. Tsirkin wrote:
> > > We don't have room anywhere in PCI config space. Laszlo makes
> > > argument
> > > why it's safe for this device based on spec but it's anyone's guess
> > > whether current and future software will follow spec.  In short,
> > > going
> > > anywhere near the emulated device has a potential to break some
> > > drivers.
> > 
> > There are no such drivers.  The MCH and PCH are only touched by the
> > firmware, not by the OS.
> 
> Yea.  That is *exactly* the reason why I think simply using the 0x50
> offset probably works fine in practice even though I suspect on
> physical hardware it might be some undocumented register.  Much of the
> stuff in the host bridge pci config space is firmware territory, and we
> run qemu specific firmware *anyway*.
> 
> cheers,
>   Gerd

To be specific, what I meant is a bit that tells guest that a
config space register is available, and lets host find out
that guest is going to use it.

This to ensure full forward and backward compatibility.

I agree a fw cfg file for a single bit seems like an overkill, that's
why I thought sharing feature files with SMI would be a good idea.

Do you see an issue with that?

Gerd Hoffmann June 15, 2017, 7:07 a.m. UTC | #12

Hi,

> To be specific, what I meant is a bit that tells guest that a
> config space register is available, and lets host find out
> that guest is going to use it.
> 
> This to ensure full forward and backward compatibility.
> 
> I agree a fw cfg file for a single bit seems like an overkill, that's
> why I thought sharing feature files with SMI would be a good idea.
> 
> Do you see an issue with that?

The point of placing the extended-tseg-size register in mch pci config
is that the firmware can figure this easily *without* looking somewhere
else, as all the other tseg (and smram) config bits are in mch pci
config space too.

When involving fw_cfg there is no reason to keep the extended-tseg-size 
register in mch.  We can simply place a "etc/q35-extended-tseg-size"
file in fw_cfg then (which is either not present or contains the
extended tseg size).  So a feature bit in fw_cfg looks absolutely
pointless to me.

I still think the approach and patch by Laszlo is perfectly fine. 
While being at it:

Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>

cheers,
  Gerd

Michael S. Tsirkin June 16, 2017, 3:23 a.m. UTC | #13

On Thu, Jun 15, 2017 at 09:07:45AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > To be specific, what I meant is a bit that tells guest that a
> > config space register is available, and lets host find out
> > that guest is going to use it.
> > 
> > This to ensure full forward and backward compatibility.
> > 
> > I agree a fw cfg file for a single bit seems like an overkill, that's
> > why I thought sharing feature files with SMI would be a good idea.
> > 
> > Do you see an issue with that?
> 
> The point of placing the extended-tseg-size register in mch pci config
> is that the firmware can figure this easily *without* looking somewhere
> else, as all the other tseg (and smram) config bits are in mch pci
> config space too.
> 
> When involving fw_cfg there is no reason to keep the extended-tseg-size 
> register in mch.  We can simply place a "etc/q35-extended-tseg-size"
> file in fw_cfg then (which is either not present or contains the
> extended tseg size).  So a feature bit in fw_cfg looks absolutely
> pointless to me.
> 
> I still think the approach and patch by Laszlo is perfectly fine. 
> While being at it:
> 
> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
> 
> cheers,
>   Gerd


I guess I'll merge it and we'll see if there's any fallout.
Thanks everyone.

q35/mch: implement extended TSEG sizes

Commit Message

Comments

Patch