Message ID | 20170608161013.17920-1-lersek@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/06/2017 18:10, Laszlo Ersek wrote: > When the guest writes value 0xffff to this register, the value that can be > read back is that of "mch.extended-tseg-mbytes" -- unless it remains > 0xffff. The guest is required to write 0xffff first (as opposed to a > read-only register) because PCI config space is generally not cleared on > QEMU reset, and after S3 resume or reboot, new guest firmware running on > old QEMU could read a guest OS-injected value from this register. I guess that's also a reason not to make it readonly (that is, it would require some firmware code anyway to test for "readonlyness" and distinguish old machine types from new)? Thanks, Paolo
On Thu, Jun 08, 2017 at 06:10:13PM +0200, Laszlo Ersek wrote: > The q35 machine type currently lets the guest firmware select a 1MB, 2MB > or 8MB TSEG (basically, SMRAM) size. In edk2/OVMF, we use 8MB, but even > that is not enough when a lot of VCPUs (more than approx. 224) are > configured -- SMRAM footprint scales largely proportionally with VCPU > count. > > Introduce a new property for "mch" called "extended-tseg-mbytes", which > expresses (in megabytes) the user's choice of TSEG (SMRAM) size. > > Invent a new, QEMU-specific register in the config space of the DRAM > Controller, at offset 0x50, in order to allow guest firmware to query the > TSEG (SMRAM) size. > > According to Intel Document Number 316966-002, Table 5-1 "DRAM Controller > Register Address Map (D0:F0)": > > Warning: Address locations that are not listed are considered Intel > Reserved registers locations. Reads to Reserved registers may > return non-zero values. Writes to reserved locations may > cause system failures. > > All registers that are defined in the PCI 2.3 specification, > but are not necessary or implemented in this component are > simply not included in this document. The > reserved/unimplemented space in the PCI configuration header > space is not documented as such in this summary. > > Offsets 0x50 and 0x51 are not listed in Table 5-1. They are also not part > of the standard PCI config space header. And they precede the capability > list as well, which starts at 0xe0 for this device. > > When the guest writes value 0xffff to this register, the value that can be > read back is that of "mch.extended-tseg-mbytes" -- unless it remains > 0xffff. The guest is required to write 0xffff first (as opposed to a > read-only register) because PCI config space is generally not cleared on > QEMU reset, and after S3 resume or reboot, new guest firmware running on > old QEMU could read a guest OS-injected value from this register. > > After reading the available "extended" TSEG size, the guest firmware may > actually request that TSEG size by writing pattern 11b to the ESMRAMC > register's TSEG_SZ bit-field. (The Intel spec referenced above defines > only patterns 00b (1MB), 01b (2MB) and 10b (8MB); 11b is reserved.) > > On the QEMU command line, the value can be set with > > -global mch.extended-tseg-mbytes=N > > The default value for 2.10+ q35 machine types is 16. The value is limited > to 0xfff (4095) at the moment, purely so that the product (4095 MB) can be > stored to the uint32_t variable "tseg_size" in mch_update_smram(). Users > are responsible for choosing sensible TSEG sizes. > > On 2.9 and earlier q35 machine types, the default value is 0. This lets > the 11b bit pattern in ESMRAMC.TSEG_SZ, and the register at offset 0x50, > keep their original behavior. > > When "extended-tseg-mbytes" is nonzero, the new register at offset 0x50 is > set to that value on reset, for completeness. > > PCI config space is migrated automatically, so no VMSD changes are > necessary. > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1447027 > Ref: https://lists.01.org/pipermail/edk2-devel/2017-May/010456.html > Signed-off-by: Laszlo Ersek <lersek@redhat.com> I really dislike negotiation being re-invented for each device. Do we need these tricks? Can we just do fw cfg with standard discovery? This ties in with my proposal to generalize smi features to generic ones. > --- > > Notes: > RFC->PATCH changes: > - This is an identical PATCH repost of the original RFC at > <http://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg06912.html>; > I got the OVMF patches now. > > include/hw/i386/pc.h | 5 +++++ > include/hw/pci-host/q35.h | 6 ++++++ > hw/pci-host/q35.c | 41 ++++++++++++++++++++++++++++++++++++++--- > 3 files changed, 49 insertions(+), 3 deletions(-) > > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h > index d071c9c0e9cd..233216abdc1a 100644 > --- a/include/hw/i386/pc.h > +++ b/include/hw/i386/pc.h > @@ -384,6 +384,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); > > #define PC_COMPAT_2_9 \ > HW_COMPAT_2_9 \ > + {\ > + .driver = "mch",\ > + .property = "extended-tseg-mbytes",\ > + .value = stringify(0),\ > + },\ > > #define PC_COMPAT_2_8 \ > HW_COMPAT_2_8 \ > diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h > index 53b6760c16c5..58983c00b32d 100644 > --- a/include/hw/pci-host/q35.h > +++ b/include/hw/pci-host/q35.h > @@ -60,6 +60,7 @@ typedef struct MCHPCIState { > uint64_t above_4g_mem_size; > uint64_t pci_hole64_size; > uint32_t short_root_bus; > + uint16_t ext_tseg_mbytes; > } MCHPCIState; > > typedef struct Q35PCIHost { > @@ -91,6 +92,11 @@ typedef struct Q35PCIHost { > /* D0:F0 configuration space */ > #define MCH_HOST_BRIDGE_REVISION_DEFAULT 0x0 > > +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES 0x50 > +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_SIZE 2 > +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY 0xffff > +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX 0xfff > + > #define MCH_HOST_BRIDGE_PCIEXBAR 0x60 /* 64bit register */ > #define MCH_HOST_BRIDGE_PCIEXBAR_SIZE 8 /* 64bit register */ > #define MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT 0xb0000000 > diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c > index cd5c49616ef9..28cb97b60fa3 100644 > --- a/hw/pci-host/q35.c > +++ b/hw/pci-host/q35.c > @@ -134,7 +134,7 @@ static void q35_host_get_mmcfg_size(Object *obj, Visitor *v, const char *name, > visit_type_uint32(v, name, &value, errp); > } > > -static Property mch_props[] = { > +static Property q35_host_props[] = { > DEFINE_PROP_UINT64(PCIE_HOST_MCFG_BASE, Q35PCIHost, parent_obj.base_addr, > MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT), > DEFINE_PROP_SIZE(PCI_HOST_PROP_PCI_HOLE64_SIZE, Q35PCIHost, > @@ -154,7 +154,7 @@ static void q35_host_class_init(ObjectClass *klass, void *data) > > hc->root_bus_path = q35_host_root_bus_path; > dc->realize = q35_host_realize; > - dc->props = mch_props; > + dc->props = q35_host_props; > /* Reason: needs to be wired up by pc_q35_init */ > dc->user_creatable = false; > set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories); > @@ -369,7 +369,7 @@ static void mch_update_smram(MCHPCIState *mch) > tseg_size = 1024 * 1024 * 8; > break; > default: > - tseg_size = 0; > + tseg_size = 1024 * 1024 * (uint32_t)mch->ext_tseg_mbytes; > break; > } > } else { > @@ -392,6 +392,17 @@ static void mch_update_smram(MCHPCIState *mch) > memory_region_transaction_commit(); > } > > +static void mch_update_ext_tseg_mbytes(MCHPCIState *mch) > +{ > + PCIDevice *pd = PCI_DEVICE(mch); > + uint8_t *reg = pd->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES; > + > + if (mch->ext_tseg_mbytes > 0 && > + pci_get_word(reg) == MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY) { > + pci_set_word(reg, mch->ext_tseg_mbytes); > + } > +} > + > static void mch_write_config(PCIDevice *d, > uint32_t address, uint32_t val, int len) > { > @@ -413,6 +424,11 @@ static void mch_write_config(PCIDevice *d, > MCH_HOST_BRIDGE_SMRAM_SIZE)) { > mch_update_smram(mch); > } > + > + if (ranges_overlap(address, len, MCH_HOST_BRIDGE_EXT_TSEG_MBYTES, > + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_SIZE)) { > + mch_update_ext_tseg_mbytes(mch); > + } > } > > static void mch_update(MCHPCIState *mch) > @@ -420,6 +436,7 @@ static void mch_update(MCHPCIState *mch) > mch_update_pciexbar(mch); > mch_update_pam(mch); > mch_update_smram(mch); > + mch_update_ext_tseg_mbytes(mch); > } > > static int mch_post_load(void *opaque, int version_id) > @@ -457,6 +474,11 @@ static void mch_reset(DeviceState *qdev) > d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK; > d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK; > > + if (mch->ext_tseg_mbytes > 0) { > + pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES, > + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY); > + } > + > mch_update(mch); > } > > @@ -465,6 +487,12 @@ static void mch_realize(PCIDevice *d, Error **errp) > int i; > MCHPCIState *mch = MCH_PCI_DEVICE(d); > > + if (mch->ext_tseg_mbytes > MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX) { > + error_setg(errp, "invalid extended-tseg-mbytes value: %" PRIu16, > + mch->ext_tseg_mbytes); > + return; > + } > + > /* setup pci memory mapping */ > pc_pci_as_mapping_init(OBJECT(mch), mch->system_memory, > mch->pci_address_space); > @@ -530,6 +558,12 @@ uint64_t mch_mcfg_base(void) > return MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT; > } > > +static Property mch_props[] = { > + DEFINE_PROP_UINT16("extended-tseg-mbytes", MCHPCIState, ext_tseg_mbytes, > + 16), > + DEFINE_PROP_END_OF_LIST(), > +}; > + > static void mch_class_init(ObjectClass *klass, void *data) > { > PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); > @@ -538,6 +572,7 @@ static void mch_class_init(ObjectClass *klass, void *data) > k->realize = mch_realize; > k->config_write = mch_write_config; > dc->reset = mch_reset; > + dc->props = mch_props; > set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories); > dc->desc = "Host bridge"; > dc->vmsd = &vmstate_mch; > -- > 2.9.3
On 06/08/17 18:34, Paolo Bonzini wrote: > > > On 08/06/2017 18:10, Laszlo Ersek wrote: >> When the guest writes value 0xffff to this register, the value that can be >> read back is that of "mch.extended-tseg-mbytes" -- unless it remains >> 0xffff. The guest is required to write 0xffff first (as opposed to a >> read-only register) because PCI config space is generally not cleared on >> QEMU reset, and after S3 resume or reboot, new guest firmware running on >> old QEMU could read a guest OS-injected value from this register. > > I guess that's also a reason not to make it readonly (that is, it would > require some firmware code anyway to test for "readonlyness" and > distinguish old machine types from new)? That's right; in <https://lists.01.org/pipermail/edk2-devel/2017-May/010432.html> I wrote, > If we invent such a new register, it should be in a location that is > either read-only, or zeroed-on-reset, in current QEMU. Otherwise, new > firmware running on old QEMU could be misled by a guest OS that writes > to this register, and then either reboots or enters S3. > > ... With this in mind, I don't oppose "having to write somewhere to > read back the result", but then let's please make that write access as > well to the same new qemu-specific register, and not to MCH_ESMRAMC. The problem is that config space is by default r/w and not cleared on reboot -- see Gerd's <https://lists.01.org/pipermail/edk2-devel/2017-May/010456.html> --, and once we designate another offset as "special" (either "reactive" or "read only"), the firmware will have to write to it and read it back unconditionally, to tell it apart from the default / original / non-special config space. Thanks! Laszlo
Hi, > I really dislike negotiation being re-invented for each device. Do > we > need these tricks? Can we just do fw cfg with standard discovery? > This ties in with my proposal to generalize smi features to > generic ones. Device properties should be part of the device. We should have done this with the smi too. A more standard way to handle this would be to add a vendor-specific pci capability and place the register there. Not sure we have room for that in the pci config space though. cheers, Gerd
On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote: > Hi, > > > I really dislike negotiation being re-invented for each device. Do > > we > > need these tricks? Can we just do fw cfg with standard discovery? > > This ties in with my proposal to generalize smi features to > > generic ones. > > Device properties should be part of the device. > We should have done this with the smi too. What is part of the device and what isn't? It's all part of QEMU in the end. Adding probing for multiple devices will just add to number of exits and slow down guest boot. We do want to stick to emulating real devices if we can, no argument here - but this stuff is PV anyway - what do we gain by spreading it out? > A more standard way to handle this would be to add a vendor-specific > pci capability and place the register there. Not sure we have room for > that in the pci config space though. > > cheers, > Gerd We don't have room anywhere in PCI config space. Laszlo makes argument why it's safe for this device based on spec but it's anyone's guess whether current and future software will follow spec. In short, going anywhere near the emulated device has a potential to break some drivers.
On 06/08/17 21:55, Michael S. Tsirkin wrote: > On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote: >> Hi, >> >>> I really dislike negotiation being re-invented for each device. Do >>> we >>> need these tricks? Can we just do fw cfg with standard discovery? >>> This ties in with my proposal to generalize smi features to >>> generic ones. >> >> Device properties should be part of the device. >> We should have done this with the smi too. > > What is part of the device and what isn't? It's all part > of QEMU in the end. Adding probing for multiple devices > will just add to number of exits and slow down guest boot. > > We do want to stick to emulating real devices if we can, no argument > here - but this stuff is PV anyway - what do we gain by spreading it > out? > >> A more standard way to handle this would be to add a vendor-specific >> pci capability and place the register there. Not sure we have room for >> that in the pci config space though. >> >> cheers, >> Gerd > > We don't have room anywhere in PCI config space. Laszlo makes argument > why it's safe for this device based on spec but it's anyone's guess > whether current and future software will follow spec. In short, going > anywhere near the emulated device has a potential to break some drivers. I'm fine using any QEMU facility that lets independent firmware modules perform their feature detections / negotiations / lockdowns in arbitrary order between each other. (Hopefully without extreme parsing requirements.) What I can not sign up for is to develop a general QEMU infrastructure for this (regardless of whether it is the fw_cfg bitmap stuff prevails, or the PCI config space register / capability list). Either is complex work, needing documentation too, the design has to be future proof. I'm not experienced enough in QEMU to get it right reasonably soon (everything is surprisingly complex and difficult in QEMU -- this has been my experience over the years, and I still struggle with QOM every single time), and I definitely do not have the capacity to take on a QEMU feature of the suggested size. It's not lack of interest on my part, but lack of capacity. (Case in point: it's ~1AM local time, and my laptop's uptime, which quite closely approximates the hours I've actually spent working today, is ~15:30.) The reason I keep submitting these little patches to qemu-devel is that I figure everyone else is overloaded too, so I might as well try what I'm capable of. But, we should be clear that that is not much, load-wise and sophistication-wise. The alternative could have been that I'd clone <https://bugzilla.redhat.com/show_bug.cgi?id=1447027> to qemu-kvm-rhev (from OVMF), set up the cross-BZ dependencies correctly, wait until the clone gets assigned to a seasoned QEMU developer, and once he or she gets to work on it, we figure out the design together, and once he/she writes the code for QEMU, I write the code for the firmware. I figured that sending a patch like the present one (having discussed it preliminarily with Gerd and Paolo in the "[edk2] SMRAM sizes on large hosts" thread) would be more efficient than waiting for a seasoned QEMU developer. I didn't expect that my patch would be better than theirs. :) The above kind of collaboration has certainly proved functional in the past, it just takes a lot of time and coordination. Anyway, "Laszlo embarking on a QEMU infrastructure project that's liable to take fifteen patch set iterations" is not an alternative, unfortunately. I definitely don't intend to throw QEMU patches over the fence; I know what drag that creates for maintainers. I intend to be responsible for my QEMU patches. However -- or perhaps, "exactly because of that"? -- I simply can't take on QEMU work that's larger than this caliber. Sorry about the wall of text. Thanks, Laszlo
On Fri, Jun 09, 2017 at 01:01:54AM +0200, Laszlo Ersek wrote: > On 06/08/17 21:55, Michael S. Tsirkin wrote: > > On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote: > >> Hi, > >> > >>> I really dislike negotiation being re-invented for each device. Do > >>> we > >>> need these tricks? Can we just do fw cfg with standard discovery? > >>> This ties in with my proposal to generalize smi features to > >>> generic ones. > >> > >> Device properties should be part of the device. > >> We should have done this with the smi too. > > > > What is part of the device and what isn't? It's all part > > of QEMU in the end. Adding probing for multiple devices > > will just add to number of exits and slow down guest boot. > > > > We do want to stick to emulating real devices if we can, no argument > > here - but this stuff is PV anyway - what do we gain by spreading it > > out? > > > >> A more standard way to handle this would be to add a vendor-specific > >> pci capability and place the register there. Not sure we have room for > >> that in the pci config space though. > >> > >> cheers, > >> Gerd > > > > We don't have room anywhere in PCI config space. Laszlo makes argument > > why it's safe for this device based on spec but it's anyone's guess > > whether current and future software will follow spec. In short, going > > anywhere near the emulated device has a potential to break some drivers. > > I'm fine using any QEMU facility that lets independent firmware modules > perform their feature detections / negotiations / lockdowns in arbitrary > order between each other. (Hopefully without extreme parsing requirements.) How about adding etc/mch/features etc copying the smi stuff? Is this simple enough? We can worry about removing code duplication later. > What I can not sign up for is to develop a general QEMU infrastructure > for this (regardless of whether it is the fw_cfg bitmap stuff prevails, > or the PCI config space register / capability list). Either is complex > work, needing documentation too, the design has to be future proof. I'm > not experienced enough in QEMU to get it right reasonably soon > (everything is surprisingly complex and difficult in QEMU -- this has > been my experience over the years, and I still struggle with QOM every > single time), and I definitely do not have the capacity to take on a > QEMU feature of the suggested size. > > It's not lack of interest on my part, but lack of capacity. (Case in > point: it's ~1AM local time, and my laptop's uptime, which quite closely > approximates the hours I've actually spent working today, is ~15:30.) > The reason I keep submitting these little patches to qemu-devel is that > I figure everyone else is overloaded too, so I might as well try what > I'm capable of. But, we should be clear that that is not much, load-wise > and sophistication-wise. > > The alternative could have been that I'd clone > <https://bugzilla.redhat.com/show_bug.cgi?id=1447027> to qemu-kvm-rhev > (from OVMF), set up the cross-BZ dependencies correctly, wait until the > clone gets assigned to a seasoned QEMU developer, and once he or she > gets to work on it, we figure out the design together, and once he/she > writes the code for QEMU, I write the code for the firmware. > > I figured that sending a patch like the present one (having discussed it > preliminarily with Gerd and Paolo in the "[edk2] SMRAM sizes on large > hosts" thread) would be more efficient than waiting for a seasoned QEMU > developer. I didn't expect that my patch would be better than theirs. :) > The above kind of collaboration has certainly proved functional in the > past, it just takes a lot of time and coordination. > > Anyway, "Laszlo embarking on a QEMU infrastructure project that's liable > to take fifteen patch set iterations" is not an alternative, > unfortunately. I definitely don't intend to throw QEMU patches over the > fence; I know what drag that creates for maintainers. I intend to be > responsible for my QEMU patches. However -- or perhaps, "exactly because > of that"? -- I simply can't take on QEMU work that's larger than this > caliber. > > Sorry about the wall of text. > > Thanks, > Laszlo
On 08/06/2017 21:55, Michael S. Tsirkin wrote: > We don't have room anywhere in PCI config space. Laszlo makes argument > why it's safe for this device based on spec but it's anyone's guess > whether current and future software will follow spec. In short, going > anywhere near the emulated device has a potential to break some drivers. There are no such drivers. The MCH and PCH are only touched by the firmware, not by the OS. There are exceptions such as the Intel graphics driver's incestuous relationship with the PCH, but they are well known. Let's avoid overengineering please. The only sensible (but less sensible) alternative is to just say it's 32MB (or 64, or 128) and cross fingers that we won't need more. Paolo
On 06/09/17 02:19, Michael S. Tsirkin wrote: > On Fri, Jun 09, 2017 at 01:01:54AM +0200, Laszlo Ersek wrote: >> On 06/08/17 21:55, Michael S. Tsirkin wrote: >>> On Thu, Jun 08, 2017 at 09:48:53PM +0200, Gerd Hoffmann wrote: >>>> Hi, >>>> >>>>> I really dislike negotiation being re-invented for each device. Do >>>>> we >>>>> need these tricks? Can we just do fw cfg with standard discovery? >>>>> This ties in with my proposal to generalize smi features to >>>>> generic ones. >>>> >>>> Device properties should be part of the device. >>>> We should have done this with the smi too. >>> >>> What is part of the device and what isn't? It's all part >>> of QEMU in the end. Adding probing for multiple devices >>> will just add to number of exits and slow down guest boot. >>> >>> We do want to stick to emulating real devices if we can, no argument >>> here - but this stuff is PV anyway - what do we gain by spreading it >>> out? >>> >>>> A more standard way to handle this would be to add a vendor-specific >>>> pci capability and place the register there. Not sure we have room for >>>> that in the pci config space though. >>>> >>>> cheers, >>>> Gerd >>> >>> We don't have room anywhere in PCI config space. Laszlo makes argument >>> why it's safe for this device based on spec but it's anyone's guess >>> whether current and future software will follow spec. In short, going >>> anywhere near the emulated device has a potential to break some drivers. >> >> I'm fine using any QEMU facility that lets independent firmware modules >> perform their feature detections / negotiations / lockdowns in arbitrary >> order between each other. (Hopefully without extreme parsing requirements.) > > How about adding etc/mch/features etc copying the smi stuff? Is this > simple enough? We can worry about removing code duplication later. I'm having a hard time mapping the negotiation used for SMI to this TSEG-size case. (Just to be sure I've now re-read the commit message on 50de920b372b ("hw/isa/lpc_ich9: add SMI feature negotiation via fw_cfg", 2017-01-26).) For SMI, there is only one way to trigger it. And the broadcast feature, if negotiated, modifies the behavior of SMI. It does not change how the SMI is triggered by the guest. The TSEG size selection differs. There are already three options for the guest (bit patterns 00 -> 1MB, 01 -> 2MB, 10 -> 8MB) to write to ESMRAMC.TSEG_SZ. Let's say the guest wants more: - First, beyond learning whether "more is possible", it has to know exactly "how much". So a single bit in some "host-features" bitmask is not good enough for that. - Second, assuming an increased TSEG size is available (with a queriable size), the bit patterns 00, 01, 10 should still not change their behavior / meaning. For example, I don't see how ESMRAMC.TSEG_SZ == 00 (implying 1MB TSEG) could coexist with the guest actually negotiating 16MB via another communication channel. Whatever code looked at ESMRAMC.TSEG_SZ would rightfully think there was a 1MB TSEG. So I think such a side-channel negotiation would actually violate specified behavior from the Q35 spec, rather than just fill in reserved/unspecified bits. - The third doubt I have is security / lockdown. In the SMI case, we defined our own lockdown scheme, and it did not overlap or conflict with anything. For SMRAM however, the SMRAM.D_LCK bit already locks down a whole bunch of things, including the ESMRAMC.TSEG_SZ field. Should SMRAM.D_LCK also lock down the "other" negotiation channel? If that's our choice, then we diverge at once from the negotiation pattern seen with SMI. Or else, what if we lock down the "other" channel (selecting 16MB for example), but don't lock down D_LCK? Then ESMRAMC.TSEG_SZ can still be modified, and that should actually affect the TSEG size (because D_LCK is not set). ... I think the negotiation pattern we used for etc/smi/* doesn't apply here; the Q35 spec already defines too much of a selection / lockdown dance for that. We could perhaps figure out the interactions, but (a) it's a security thing so there isn't much room for mistakes, and (b) we wouldn't be reusing the pattern seen with the SMI feature negotiation anyway. ESMRAMC.TSEG_SZ == 11 is the simplest (non-conflicting) extension to the Q35 spec, IMO. That's the core thing. How we pass the information to the guest that (a) ESMRAMC.TSEG_SZ == 11 is available, and (b) what size it actually means is a secondary question. For this, my original suggestion was a new fw_cfg file indeed: https://lists.01.org/pipermail/edk2-devel/2017-May/010404.html but Gerd seemed to prefer PCI config space. I was fine either way. (Please note that a malicious guest OS cannot fake an fw_cfg file on old QEMU, but it can fill in unimplemented registers in PCI config space. This is why using fw_cfg for querying the extended TSEG availability and size would be a "read only" operation in the guest, and why using PCI config space instead requires a *write* operation first. The latter has nothing to do with feature negotiation; the write occurs only to suppress earlier tampering from a malicious guest OS, pre-reboot.) Summary: (1) I don't think the pattern that we used for broadcast SMI applies here; we already have a spec-mandated selection and lockdown protocol involving ESMRAMC.TSEG_SZ and SMRAM.D_LCK. I don't think we can, or should try to, diverge from this protocol. In fact we're lucky to have a reserved bitmask value (11) in ESMRAMC.TSEG_SZ. (2) How exactly we inform the guest about the TSEG size that (ESMRAMC.TSEG_SZ == 11) corresponds to is a secondary question. We can make it a constant (and pray, like Paolo mentions), or we can make it cmdline-configurable and expose it via PCI config space, or we can make it cmdline-configurable and expose it via a new fw_cfg file. In OVMF I can easily accommodate either of these three options. Thanks Laszlo
On Fri, 2017-06-09 at 13:40 +0200, Paolo Bonzini wrote: > > On 08/06/2017 21:55, Michael S. Tsirkin wrote: > > We don't have room anywhere in PCI config space. Laszlo makes > > argument > > why it's safe for this device based on spec but it's anyone's guess > > whether current and future software will follow spec. In short, > > going > > anywhere near the emulated device has a potential to break some > > drivers. > > There are no such drivers. The MCH and PCH are only touched by the > firmware, not by the OS. Yea. That is *exactly* the reason why I think simply using the 0x50 offset probably works fine in practice even though I suspect on physical hardware it might be some undocumented register. Much of the stuff in the host bridge pci config space is firmware territory, and we run qemu specific firmware *anyway*. cheers, Gerd
On Fri, Jun 09, 2017 at 10:01:18PM +0200, Gerd Hoffmann wrote: > On Fri, 2017-06-09 at 13:40 +0200, Paolo Bonzini wrote: > > > > On 08/06/2017 21:55, Michael S. Tsirkin wrote: > > > We don't have room anywhere in PCI config space. Laszlo makes > > > argument > > > why it's safe for this device based on spec but it's anyone's guess > > > whether current and future software will follow spec. In short, > > > going > > > anywhere near the emulated device has a potential to break some > > > drivers. > > > > There are no such drivers. The MCH and PCH are only touched by the > > firmware, not by the OS. > > Yea. That is *exactly* the reason why I think simply using the 0x50 > offset probably works fine in practice even though I suspect on > physical hardware it might be some undocumented register. Much of the > stuff in the host bridge pci config space is firmware territory, and we > run qemu specific firmware *anyway*. > > cheers, > Gerd To be specific, what I meant is a bit that tells guest that a config space register is available, and lets host find out that guest is going to use it. This to ensure full forward and backward compatibility. I agree a fw cfg file for a single bit seems like an overkill, that's why I thought sharing feature files with SMI would be a good idea. Do you see an issue with that?
Hi, > To be specific, what I meant is a bit that tells guest that a > config space register is available, and lets host find out > that guest is going to use it. > > This to ensure full forward and backward compatibility. > > I agree a fw cfg file for a single bit seems like an overkill, that's > why I thought sharing feature files with SMI would be a good idea. > > Do you see an issue with that? The point of placing the extended-tseg-size register in mch pci config is that the firmware can figure this easily *without* looking somewhere else, as all the other tseg (and smram) config bits are in mch pci config space too. When involving fw_cfg there is no reason to keep the extended-tseg-size register in mch. We can simply place a "etc/q35-extended-tseg-size" file in fw_cfg then (which is either not present or contains the extended tseg size). So a feature bit in fw_cfg looks absolutely pointless to me. I still think the approach and patch by Laszlo is perfectly fine. While being at it: Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> cheers, Gerd
On Thu, Jun 15, 2017 at 09:07:45AM +0200, Gerd Hoffmann wrote: > Hi, > > > To be specific, what I meant is a bit that tells guest that a > > config space register is available, and lets host find out > > that guest is going to use it. > > > > This to ensure full forward and backward compatibility. > > > > I agree a fw cfg file for a single bit seems like an overkill, that's > > why I thought sharing feature files with SMI would be a good idea. > > > > Do you see an issue with that? > > The point of placing the extended-tseg-size register in mch pci config > is that the firmware can figure this easily *without* looking somewhere > else, as all the other tseg (and smram) config bits are in mch pci > config space too. > > When involving fw_cfg there is no reason to keep the extended-tseg-size > register in mch. We can simply place a "etc/q35-extended-tseg-size" > file in fw_cfg then (which is either not present or contains the > extended tseg size). So a feature bit in fw_cfg looks absolutely > pointless to me. > > I still think the approach and patch by Laszlo is perfectly fine. > While being at it: > > Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> > > cheers, > Gerd I guess I'll merge it and we'll see if there's any fallout. Thanks everyone.
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h index d071c9c0e9cd..233216abdc1a 100644 --- a/include/hw/i386/pc.h +++ b/include/hw/i386/pc.h @@ -384,6 +384,11 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); #define PC_COMPAT_2_9 \ HW_COMPAT_2_9 \ + {\ + .driver = "mch",\ + .property = "extended-tseg-mbytes",\ + .value = stringify(0),\ + },\ #define PC_COMPAT_2_8 \ HW_COMPAT_2_8 \ diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h index 53b6760c16c5..58983c00b32d 100644 --- a/include/hw/pci-host/q35.h +++ b/include/hw/pci-host/q35.h @@ -60,6 +60,7 @@ typedef struct MCHPCIState { uint64_t above_4g_mem_size; uint64_t pci_hole64_size; uint32_t short_root_bus; + uint16_t ext_tseg_mbytes; } MCHPCIState; typedef struct Q35PCIHost { @@ -91,6 +92,11 @@ typedef struct Q35PCIHost { /* D0:F0 configuration space */ #define MCH_HOST_BRIDGE_REVISION_DEFAULT 0x0 +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES 0x50 +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_SIZE 2 +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY 0xffff +#define MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX 0xfff + #define MCH_HOST_BRIDGE_PCIEXBAR 0x60 /* 64bit register */ #define MCH_HOST_BRIDGE_PCIEXBAR_SIZE 8 /* 64bit register */ #define MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT 0xb0000000 diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c index cd5c49616ef9..28cb97b60fa3 100644 --- a/hw/pci-host/q35.c +++ b/hw/pci-host/q35.c @@ -134,7 +134,7 @@ static void q35_host_get_mmcfg_size(Object *obj, Visitor *v, const char *name, visit_type_uint32(v, name, &value, errp); } -static Property mch_props[] = { +static Property q35_host_props[] = { DEFINE_PROP_UINT64(PCIE_HOST_MCFG_BASE, Q35PCIHost, parent_obj.base_addr, MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT), DEFINE_PROP_SIZE(PCI_HOST_PROP_PCI_HOLE64_SIZE, Q35PCIHost, @@ -154,7 +154,7 @@ static void q35_host_class_init(ObjectClass *klass, void *data) hc->root_bus_path = q35_host_root_bus_path; dc->realize = q35_host_realize; - dc->props = mch_props; + dc->props = q35_host_props; /* Reason: needs to be wired up by pc_q35_init */ dc->user_creatable = false; set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories); @@ -369,7 +369,7 @@ static void mch_update_smram(MCHPCIState *mch) tseg_size = 1024 * 1024 * 8; break; default: - tseg_size = 0; + tseg_size = 1024 * 1024 * (uint32_t)mch->ext_tseg_mbytes; break; } } else { @@ -392,6 +392,17 @@ static void mch_update_smram(MCHPCIState *mch) memory_region_transaction_commit(); } +static void mch_update_ext_tseg_mbytes(MCHPCIState *mch) +{ + PCIDevice *pd = PCI_DEVICE(mch); + uint8_t *reg = pd->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES; + + if (mch->ext_tseg_mbytes > 0 && + pci_get_word(reg) == MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY) { + pci_set_word(reg, mch->ext_tseg_mbytes); + } +} + static void mch_write_config(PCIDevice *d, uint32_t address, uint32_t val, int len) { @@ -413,6 +424,11 @@ static void mch_write_config(PCIDevice *d, MCH_HOST_BRIDGE_SMRAM_SIZE)) { mch_update_smram(mch); } + + if (ranges_overlap(address, len, MCH_HOST_BRIDGE_EXT_TSEG_MBYTES, + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_SIZE)) { + mch_update_ext_tseg_mbytes(mch); + } } static void mch_update(MCHPCIState *mch) @@ -420,6 +436,7 @@ static void mch_update(MCHPCIState *mch) mch_update_pciexbar(mch); mch_update_pam(mch); mch_update_smram(mch); + mch_update_ext_tseg_mbytes(mch); } static int mch_post_load(void *opaque, int version_id) @@ -457,6 +474,11 @@ static void mch_reset(DeviceState *qdev) d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK; d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK; + if (mch->ext_tseg_mbytes > 0) { + pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES, + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY); + } + mch_update(mch); } @@ -465,6 +487,12 @@ static void mch_realize(PCIDevice *d, Error **errp) int i; MCHPCIState *mch = MCH_PCI_DEVICE(d); + if (mch->ext_tseg_mbytes > MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX) { + error_setg(errp, "invalid extended-tseg-mbytes value: %" PRIu16, + mch->ext_tseg_mbytes); + return; + } + /* setup pci memory mapping */ pc_pci_as_mapping_init(OBJECT(mch), mch->system_memory, mch->pci_address_space); @@ -530,6 +558,12 @@ uint64_t mch_mcfg_base(void) return MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT; } +static Property mch_props[] = { + DEFINE_PROP_UINT16("extended-tseg-mbytes", MCHPCIState, ext_tseg_mbytes, + 16), + DEFINE_PROP_END_OF_LIST(), +}; + static void mch_class_init(ObjectClass *klass, void *data) { PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); @@ -538,6 +572,7 @@ static void mch_class_init(ObjectClass *klass, void *data) k->realize = mch_realize; k->config_write = mch_write_config; dc->reset = mch_reset; + dc->props = mch_props; set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories); dc->desc = "Host bridge"; dc->vmsd = &vmstate_mch;
The q35 machine type currently lets the guest firmware select a 1MB, 2MB or 8MB TSEG (basically, SMRAM) size. In edk2/OVMF, we use 8MB, but even that is not enough when a lot of VCPUs (more than approx. 224) are configured -- SMRAM footprint scales largely proportionally with VCPU count. Introduce a new property for "mch" called "extended-tseg-mbytes", which expresses (in megabytes) the user's choice of TSEG (SMRAM) size. Invent a new, QEMU-specific register in the config space of the DRAM Controller, at offset 0x50, in order to allow guest firmware to query the TSEG (SMRAM) size. According to Intel Document Number 316966-002, Table 5-1 "DRAM Controller Register Address Map (D0:F0)": Warning: Address locations that are not listed are considered Intel Reserved registers locations. Reads to Reserved registers may return non-zero values. Writes to reserved locations may cause system failures. All registers that are defined in the PCI 2.3 specification, but are not necessary or implemented in this component are simply not included in this document. The reserved/unimplemented space in the PCI configuration header space is not documented as such in this summary. Offsets 0x50 and 0x51 are not listed in Table 5-1. They are also not part of the standard PCI config space header. And they precede the capability list as well, which starts at 0xe0 for this device. When the guest writes value 0xffff to this register, the value that can be read back is that of "mch.extended-tseg-mbytes" -- unless it remains 0xffff. The guest is required to write 0xffff first (as opposed to a read-only register) because PCI config space is generally not cleared on QEMU reset, and after S3 resume or reboot, new guest firmware running on old QEMU could read a guest OS-injected value from this register. After reading the available "extended" TSEG size, the guest firmware may actually request that TSEG size by writing pattern 11b to the ESMRAMC register's TSEG_SZ bit-field. (The Intel spec referenced above defines only patterns 00b (1MB), 01b (2MB) and 10b (8MB); 11b is reserved.) On the QEMU command line, the value can be set with -global mch.extended-tseg-mbytes=N The default value for 2.10+ q35 machine types is 16. The value is limited to 0xfff (4095) at the moment, purely so that the product (4095 MB) can be stored to the uint32_t variable "tseg_size" in mch_update_smram(). Users are responsible for choosing sensible TSEG sizes. On 2.9 and earlier q35 machine types, the default value is 0. This lets the 11b bit pattern in ESMRAMC.TSEG_SZ, and the register at offset 0x50, keep their original behavior. When "extended-tseg-mbytes" is nonzero, the new register at offset 0x50 is set to that value on reset, for completeness. PCI config space is migrated automatically, so no VMSD changes are necessary. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1447027 Ref: https://lists.01.org/pipermail/edk2-devel/2017-May/010456.html Signed-off-by: Laszlo Ersek <lersek@redhat.com> --- Notes: RFC->PATCH changes: - This is an identical PATCH repost of the original RFC at <http://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg06912.html>; I got the OVMF patches now. include/hw/i386/pc.h | 5 +++++ include/hw/pci-host/q35.h | 6 ++++++ hw/pci-host/q35.c | 41 ++++++++++++++++++++++++++++++++++++++--- 3 files changed, 49 insertions(+), 3 deletions(-)