Message ID | 20180924221346.16733-3-lersek@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base | expand |
Hi Laszlo, On 9/25/18 1:13 AM, Laszlo Ersek wrote: > In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI > hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in > the ACPI DSDT that would be at least as large as the new "pci-hole64-size" > property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" > 64-bit MMIO aperture to the guest OS for hotplug purposes. > [...] > Also, because SeaBIOS never assigns 64-bit GPAs to > 64-bit BARs, the patch makes no difference to SeaBIOS guests. And this is why we didn't catch that earlier... we need to include OVMF in our basic tests. > (Which is in > turn why ACPI test data for the "bios-tables-test" need not be refreshed.) > > Using an i440fx OVMF guest with 5GB RAM, an example _CRS change is: > >> @@ -881,9 +881,9 @@ >> QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, >> 0x0000000000000000, // Granularity >> 0x0000000800000000, // Range Minimum >> - 0x000000080001C0FF, // Range Maximum >> + 0x000000087FFFFFFF, // Range Maximum >> 0x0000000000000000, // Translation Offset >> - 0x000000000001C100, // Length >> + 0x0000000080000000, // Length >> ,, , AddressRangeMemory, TypeStatic) >> }) >> Device (GPE0) > (On i440fx, the low RAM split is at 3GB, in this case. Therefore, with 5GB > guest RAM and no DIMM hotplug range, pc_pci_hole64_start() returns 4 + > (5-3) = 6 GB. Adding the 2GB extension to that yields 8GB, which is below > the firmware-programmed base of 32GB, before the patch. Therefore, before > the patch, the extension is ineffective. After the patch, we add the 2GB > extension to the firmware-programmed base, namely 32GB.) > > Using a q35 OVMF guest with 5GB RAM, an example _CRS change is: > >> @@ -3162,9 +3162,9 @@ >> QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, >> 0x0000000000000000, // Granularity >> 0x0000000800000000, // Range Minimum >> - 0x00000009BFFFFFFF, // Range Maximum >> + 0x0000000FFFFFFFFF, // Range Maximum >> 0x0000000000000000, // Translation Offset >> - 0x00000001C0000000, // Length >> + 0x0000000800000000, // Length >> ,, , AddressRangeMemory, TypeStatic) >> }) >> Device (GPE0) > (On Q35, the low RAM split is at 2GB. Therefore, with 5GB guest RAM and no > DIMM hotplug range, pc_pci_hole64_start() returns 4 + (5-2) = 7 GB. Adding > the 32GB extension to that yields 39GB (0x0000_0009_BFFF_FFFF + 1), before > the patch. After the patch, we add the 32GB extension to the > firmware-programmed base, namely 32GB.) > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> > Link: http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com > Fixes: 9fa99d2519cbf71f871e46871df12cb446dc1c3e > Signed-off-by: Laszlo Ersek <lersek@redhat.com> > --- > hw/pci-host/piix.c | 2 +- > hw/pci-host/q35.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c > index 0df91e002076..fb4b0669ac9f 100644 > --- a/hw/pci-host/piix.c > +++ b/hw/pci-host/piix.c > @@ -285,7 +285,7 @@ static void i440fx_pcihost_get_pci_hole64_end(Object *obj, Visitor *v, > { > PCIHostState *h = PCI_HOST_BRIDGE(obj); > I440FXState *s = I440FX_PCI_HOST_BRIDGE(obj); > - uint64_t hole64_start = pc_pci_hole64_start(); > + uint64_t hole64_start = i440fx_pcihost_get_pci_hole64_start_value(obj); > Range w64; > uint64_t value, hole64_end; > > diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c > index 8acf942b5e65..1c5433161f14 100644 > --- a/hw/pci-host/q35.c > +++ b/hw/pci-host/q35.c > @@ -145,7 +145,7 @@ static void q35_host_get_pci_hole64_end(Object *obj, Visitor *v, > { > PCIHostState *h = PCI_HOST_BRIDGE(obj); > Q35PCIHost *s = Q35_HOST_DEVICE(obj); > - uint64_t hole64_start = pc_pci_hole64_start(); > + uint64_t hole64_start = q35_host_get_pci_hole64_start_value(obj); > Range w64; > uint64_t value, hole64_end; > Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com> Thanks, Marcel
On Tue, 25 Sep 2018 00:13:46 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI > hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in > the ACPI DSDT that would be at least as large as the new "pci-hole64-size" > property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" > 64-bit MMIO aperture to the guest OS for hotplug purposes. > > In that commit, we added or modified five functions: > > - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default > 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips > the DIMM hotplug area too (if any). > > - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): > board-specific 64-bit base property getters called abstractly by the > ACPI generator. Both of these fall back to pc_pci_hole64_start() if the > firmware didn't program any 64-bit hole (i.e. if the firmware didn't > assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they > honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit > GPA programmed by the firmware as the base address for the aperture). > > - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): > these intended to extend the aperture to our size recommendation, > calculated relative to the base of the aperture. > > Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and > q35_host_get_pci_hole64_end() currently only extend the aperture relative > to the default base (pc_pci_hole64_start()), ignoring any programming done > by the firmware. This means that our size recommendation may not be met. > Fix it by honoring the firmware's address assignments. > > The strange extension sizes were spotted by Alex, in the log of a guest > kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). > > This change only affects DSDT generation, therefore no new compat property > is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to > 64-bit BARs, the patch makes no difference to SeaBIOS guests. This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but only if it cannot satisfy all the BARs from 32-bit MMMIO, see src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several assigned GPUs and you'll eventually cross that threshold and all 64-bit BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem devices could do the same. Thanks, Alex > (Which is in > turn why ACPI test data for the "bios-tables-test" need not be refreshed.) > > Using an i440fx OVMF guest with 5GB RAM, an example _CRS change is: > > > @@ -881,9 +881,9 @@ > > QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, > > 0x0000000000000000, // Granularity > > 0x0000000800000000, // Range Minimum > > - 0x000000080001C0FF, // Range Maximum > > + 0x000000087FFFFFFF, // Range Maximum > > 0x0000000000000000, // Translation Offset > > - 0x000000000001C100, // Length > > + 0x0000000080000000, // Length > > ,, , AddressRangeMemory, TypeStatic) > > }) > > Device (GPE0) > > (On i440fx, the low RAM split is at 3GB, in this case. Therefore, with 5GB > guest RAM and no DIMM hotplug range, pc_pci_hole64_start() returns 4 + > (5-3) = 6 GB. Adding the 2GB extension to that yields 8GB, which is below > the firmware-programmed base of 32GB, before the patch. Therefore, before > the patch, the extension is ineffective. After the patch, we add the 2GB > extension to the firmware-programmed base, namely 32GB.) > > Using a q35 OVMF guest with 5GB RAM, an example _CRS change is: > > > @@ -3162,9 +3162,9 @@ > > QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, > > 0x0000000000000000, // Granularity > > 0x0000000800000000, // Range Minimum > > - 0x00000009BFFFFFFF, // Range Maximum > > + 0x0000000FFFFFFFFF, // Range Maximum > > 0x0000000000000000, // Translation Offset > > - 0x00000001C0000000, // Length > > + 0x0000000800000000, // Length > > ,, , AddressRangeMemory, TypeStatic) > > }) > > Device (GPE0) > > (On Q35, the low RAM split is at 2GB. Therefore, with 5GB guest RAM and no > DIMM hotplug range, pc_pci_hole64_start() returns 4 + (5-2) = 7 GB. Adding > the 32GB extension to that yields 39GB (0x0000_0009_BFFF_FFFF + 1), before > the patch. After the patch, we add the 32GB extension to the > firmware-programmed base, namely 32GB.) > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> > Link: http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com > Fixes: 9fa99d2519cbf71f871e46871df12cb446dc1c3e > Signed-off-by: Laszlo Ersek <lersek@redhat.com> > --- > hw/pci-host/piix.c | 2 +- > hw/pci-host/q35.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c > index 0df91e002076..fb4b0669ac9f 100644 > --- a/hw/pci-host/piix.c > +++ b/hw/pci-host/piix.c > @@ -285,7 +285,7 @@ static void i440fx_pcihost_get_pci_hole64_end(Object *obj, Visitor *v, > { > PCIHostState *h = PCI_HOST_BRIDGE(obj); > I440FXState *s = I440FX_PCI_HOST_BRIDGE(obj); > - uint64_t hole64_start = pc_pci_hole64_start(); > + uint64_t hole64_start = i440fx_pcihost_get_pci_hole64_start_value(obj); > Range w64; > uint64_t value, hole64_end; > > diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c > index 8acf942b5e65..1c5433161f14 100644 > --- a/hw/pci-host/q35.c > +++ b/hw/pci-host/q35.c > @@ -145,7 +145,7 @@ static void q35_host_get_pci_hole64_end(Object *obj, Visitor *v, > { > PCIHostState *h = PCI_HOST_BRIDGE(obj); > Q35PCIHost *s = Q35_HOST_DEVICE(obj); > - uint64_t hole64_start = pc_pci_hole64_start(); > + uint64_t hole64_start = q35_host_get_pci_hole64_start_value(obj); > Range w64; > uint64_t value, hole64_end; >
On Tue, Sep 25, 2018 at 09:07:45PM +0300, Marcel Apfelbaum wrote: > Hi Laszlo, > > On 9/25/18 1:13 AM, Laszlo Ersek wrote: > > In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI > > hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in > > the ACPI DSDT that would be at least as large as the new "pci-hole64-size" > > property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" > > 64-bit MMIO aperture to the guest OS for hotplug purposes. > > > > [...] > > > Also, because SeaBIOS never assigns 64-bit GPAs to > > 64-bit BARs, the patch makes no difference to SeaBIOS guests. > > And this is why we didn't catch that earlier... we need to include > OVMF in our basic tests. For that we need to distribute the binary with QEMU, we don't want basic tests to have another external dependency. > > (Which is in > > turn why ACPI test data for the "bios-tables-test" need not be refreshed.) > > > > Using an i440fx OVMF guest with 5GB RAM, an example _CRS change is: > > > > > @@ -881,9 +881,9 @@ > > > QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, > > > 0x0000000000000000, // Granularity > > > 0x0000000800000000, // Range Minimum > > > - 0x000000080001C0FF, // Range Maximum > > > + 0x000000087FFFFFFF, // Range Maximum > > > 0x0000000000000000, // Translation Offset > > > - 0x000000000001C100, // Length > > > + 0x0000000080000000, // Length > > > ,, , AddressRangeMemory, TypeStatic) > > > }) > > > Device (GPE0) > > (On i440fx, the low RAM split is at 3GB, in this case. Therefore, with 5GB > > guest RAM and no DIMM hotplug range, pc_pci_hole64_start() returns 4 + > > (5-3) = 6 GB. Adding the 2GB extension to that yields 8GB, which is below > > the firmware-programmed base of 32GB, before the patch. Therefore, before > > the patch, the extension is ineffective. After the patch, we add the 2GB > > extension to the firmware-programmed base, namely 32GB.) > > > > Using a q35 OVMF guest with 5GB RAM, an example _CRS change is: > > > > > @@ -3162,9 +3162,9 @@ > > > QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, > > > 0x0000000000000000, // Granularity > > > 0x0000000800000000, // Range Minimum > > > - 0x00000009BFFFFFFF, // Range Maximum > > > + 0x0000000FFFFFFFFF, // Range Maximum > > > 0x0000000000000000, // Translation Offset > > > - 0x00000001C0000000, // Length > > > + 0x0000000800000000, // Length > > > ,, , AddressRangeMemory, TypeStatic) > > > }) > > > Device (GPE0) > > (On Q35, the low RAM split is at 2GB. Therefore, with 5GB guest RAM and no > > DIMM hotplug range, pc_pci_hole64_start() returns 4 + (5-2) = 7 GB. Adding > > the 32GB extension to that yields 39GB (0x0000_0009_BFFF_FFFF + 1), before > > the patch. After the patch, we add the 32GB extension to the > > firmware-programmed base, namely 32GB.) > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > > Cc: Alex Williamson <alex.williamson@redhat.com> > > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> > > Link: http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com > > Fixes: 9fa99d2519cbf71f871e46871df12cb446dc1c3e > > Signed-off-by: Laszlo Ersek <lersek@redhat.com> > > --- > > hw/pci-host/piix.c | 2 +- > > hw/pci-host/q35.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c > > index 0df91e002076..fb4b0669ac9f 100644 > > --- a/hw/pci-host/piix.c > > +++ b/hw/pci-host/piix.c > > @@ -285,7 +285,7 @@ static void i440fx_pcihost_get_pci_hole64_end(Object *obj, Visitor *v, > > { > > PCIHostState *h = PCI_HOST_BRIDGE(obj); > > I440FXState *s = I440FX_PCI_HOST_BRIDGE(obj); > > - uint64_t hole64_start = pc_pci_hole64_start(); > > + uint64_t hole64_start = i440fx_pcihost_get_pci_hole64_start_value(obj); > > Range w64; > > uint64_t value, hole64_end; > > diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c > > index 8acf942b5e65..1c5433161f14 100644 > > --- a/hw/pci-host/q35.c > > +++ b/hw/pci-host/q35.c > > @@ -145,7 +145,7 @@ static void q35_host_get_pci_hole64_end(Object *obj, Visitor *v, > > { > > PCIHostState *h = PCI_HOST_BRIDGE(obj); > > Q35PCIHost *s = Q35_HOST_DEVICE(obj); > > - uint64_t hole64_start = pc_pci_hole64_start(); > > + uint64_t hole64_start = q35_host_get_pci_hole64_start_value(obj); > > Range w64; > > uint64_t value, hole64_end; > > > Reviewed-by: Marcel Apfelbaum<marcel.apfelbaum@gmail.com> > > > Thanks, > Marcel >
On 09/25/18 22:36, Alex Williamson wrote: > On Tue, 25 Sep 2018 00:13:46 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI >> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in >> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" >> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" >> 64-bit MMIO aperture to the guest OS for hotplug purposes. >> >> In that commit, we added or modified five functions: >> >> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default >> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips >> the DIMM hotplug area too (if any). >> >> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): >> board-specific 64-bit base property getters called abstractly by the >> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the >> firmware didn't program any 64-bit hole (i.e. if the firmware didn't >> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they >> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit >> GPA programmed by the firmware as the base address for the aperture). >> >> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): >> these intended to extend the aperture to our size recommendation, >> calculated relative to the base of the aperture. >> >> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and >> q35_host_get_pci_hole64_end() currently only extend the aperture relative >> to the default base (pc_pci_hole64_start()), ignoring any programming done >> by the firmware. This means that our size recommendation may not be met. >> Fix it by honoring the firmware's address assignments. >> >> The strange extension sizes were spotted by Alex, in the log of a guest >> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). >> >> This change only affects DSDT generation, therefore no new compat property >> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to >> 64-bit BARs, the patch makes no difference to SeaBIOS guests. > > This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but > only if it cannot satisfy all the BARs from 32-bit MMMIO, see > src/fw/pciinit.c:pci_bios_map_devices. Indeed, this mistake in the commit message briefly crossed my mind, some time after posting the series. Then I promptly forgot about it again. :/ (Because, if SeaBIOS really never picked 64-bit addresses, then commit 9fa99d2519cb would have been mostly untestable, in the first place.) Thank you for pointing this out! > Create a VM with several > assigned GPUs and you'll eventually cross that threshold and all 64-bit > BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem > devices could do the same. Thanks, OK, I think I'll have to do those ivshmem tests, and rework the commit message a bit. Thanks! Laszlo > > Alex > >> (Which is in >> turn why ACPI test data for the "bios-tables-test" need not be refreshed.) >> >> Using an i440fx OVMF guest with 5GB RAM, an example _CRS change is: >> >>> @@ -881,9 +881,9 @@ >>> QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, >>> 0x0000000000000000, // Granularity >>> 0x0000000800000000, // Range Minimum >>> - 0x000000080001C0FF, // Range Maximum >>> + 0x000000087FFFFFFF, // Range Maximum >>> 0x0000000000000000, // Translation Offset >>> - 0x000000000001C100, // Length >>> + 0x0000000080000000, // Length >>> ,, , AddressRangeMemory, TypeStatic) >>> }) >>> Device (GPE0) >> >> (On i440fx, the low RAM split is at 3GB, in this case. Therefore, with 5GB >> guest RAM and no DIMM hotplug range, pc_pci_hole64_start() returns 4 + >> (5-3) = 6 GB. Adding the 2GB extension to that yields 8GB, which is below >> the firmware-programmed base of 32GB, before the patch. Therefore, before >> the patch, the extension is ineffective. After the patch, we add the 2GB >> extension to the firmware-programmed base, namely 32GB.) >> >> Using a q35 OVMF guest with 5GB RAM, an example _CRS change is: >> >>> @@ -3162,9 +3162,9 @@ >>> QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite, >>> 0x0000000000000000, // Granularity >>> 0x0000000800000000, // Range Minimum >>> - 0x00000009BFFFFFFF, // Range Maximum >>> + 0x0000000FFFFFFFFF, // Range Maximum >>> 0x0000000000000000, // Translation Offset >>> - 0x00000001C0000000, // Length >>> + 0x0000000800000000, // Length >>> ,, , AddressRangeMemory, TypeStatic) >>> }) >>> Device (GPE0) >> >> (On Q35, the low RAM split is at 2GB. Therefore, with 5GB guest RAM and no >> DIMM hotplug range, pc_pci_hole64_start() returns 4 + (5-2) = 7 GB. Adding >> the 32GB extension to that yields 39GB (0x0000_0009_BFFF_FFFF + 1), before >> the patch. After the patch, we add the 32GB extension to the >> firmware-programmed base, namely 32GB.) >> >> Cc: "Michael S. Tsirkin" <mst@redhat.com> >> Cc: Alex Williamson <alex.williamson@redhat.com> >> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> >> Link: http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com >> Fixes: 9fa99d2519cbf71f871e46871df12cb446dc1c3e >> Signed-off-by: Laszlo Ersek <lersek@redhat.com> >> --- >> hw/pci-host/piix.c | 2 +- >> hw/pci-host/q35.c | 2 +- >> 2 files changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c >> index 0df91e002076..fb4b0669ac9f 100644 >> --- a/hw/pci-host/piix.c >> +++ b/hw/pci-host/piix.c >> @@ -285,7 +285,7 @@ static void i440fx_pcihost_get_pci_hole64_end(Object *obj, Visitor *v, >> { >> PCIHostState *h = PCI_HOST_BRIDGE(obj); >> I440FXState *s = I440FX_PCI_HOST_BRIDGE(obj); >> - uint64_t hole64_start = pc_pci_hole64_start(); >> + uint64_t hole64_start = i440fx_pcihost_get_pci_hole64_start_value(obj); >> Range w64; >> uint64_t value, hole64_end; >> >> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c >> index 8acf942b5e65..1c5433161f14 100644 >> --- a/hw/pci-host/q35.c >> +++ b/hw/pci-host/q35.c >> @@ -145,7 +145,7 @@ static void q35_host_get_pci_hole64_end(Object *obj, Visitor *v, >> { >> PCIHostState *h = PCI_HOST_BRIDGE(obj); >> Q35PCIHost *s = Q35_HOST_DEVICE(obj); >> - uint64_t hole64_start = pc_pci_hole64_start(); >> + uint64_t hole64_start = q35_host_get_pci_hole64_start_value(obj); >> Range w64; >> uint64_t value, hole64_end; >> >
On 09/25/18 22:36, Alex Williamson wrote: > On Tue, 25 Sep 2018 00:13:46 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI >> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in >> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" >> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" >> 64-bit MMIO aperture to the guest OS for hotplug purposes. >> >> In that commit, we added or modified five functions: >> >> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default >> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips >> the DIMM hotplug area too (if any). >> >> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): >> board-specific 64-bit base property getters called abstractly by the >> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the >> firmware didn't program any 64-bit hole (i.e. if the firmware didn't >> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they >> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit >> GPA programmed by the firmware as the base address for the aperture). >> >> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): >> these intended to extend the aperture to our size recommendation, >> calculated relative to the base of the aperture. >> >> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and >> q35_host_get_pci_hole64_end() currently only extend the aperture relative >> to the default base (pc_pci_hole64_start()), ignoring any programming done >> by the firmware. This means that our size recommendation may not be met. >> Fix it by honoring the firmware's address assignments. >> >> The strange extension sizes were spotted by Alex, in the log of a guest >> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). >> >> This change only affects DSDT generation, therefore no new compat property >> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to >> 64-bit BARs, the patch makes no difference to SeaBIOS guests. > > This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but > only if it cannot satisfy all the BARs from 32-bit MMMIO, see > src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several > assigned GPUs and you'll eventually cross that threshold and all 64-bit > BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem > devices could do the same. Thanks, The effect of this patch is not hard to demonstrate with SeaBIOS+Q35, when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device. However, using SeaBIOS+i440fx, I can't show the difference. I've been experimenting with various ivshmem devices (even multiple at the same time, with different sizes). The "all or nothing" nature of SeaBIOS's high allocation of the 64-bit BARs, combined with hugepage alignment inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU for i440fx, seem to make it surprisingly difficult to trigger the issue. I figure I should: (1) remove the sentence "the patch makes no difference to SeaBIOS guests" from the commit message, (2) include the DSDT diff on SeaBIOS/q35 in the commit message, (3) remain silent on SeaBIOS/i440fx, in the commit message, (4) append a new patch, for "bios-tables-test", so that the ACPI gen change is validated as part of the test suite, on SeaBIOS/q35. Regarding (4): - is it OK if I add the test only for Q35? - what guest RAM size am I allowed to use in the test suite? In my own SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's acceptable for the test suite. Thanks! Laszlo
On 09/26/18 13:12, Laszlo Ersek wrote: > (4) append a new patch, for "bios-tables-test", so that the ACPI gen > change is validated as part of the test suite, on SeaBIOS/q35. > > Regarding (4): > > - is it OK if I add the test only for Q35? > > - what guest RAM size am I allowed to use in the test suite? In my own > SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's > acceptable for the test suite. And, even if the patch's effect can be shown with little guest DRAM, the test case still requires a multi-gig ivshmem-plain device. In "tests/ivshmem-test.c", I see how it is set up -- the backend is set up with shm_open(). The file created under /dev/shm (on Linux) might require host RAM just the same as normal guest DRAM (especially with memory overcommit disabled on the host), correct? Thanks Laszlo
On Wed, 26 Sep 2018 13:35:14 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > On 09/26/18 13:12, Laszlo Ersek wrote: > > > (4) append a new patch, for "bios-tables-test", so that the ACPI gen > > change is validated as part of the test suite, on SeaBIOS/q35. > > > > Regarding (4): > > > > - is it OK if I add the test only for Q35? > > > > - what guest RAM size am I allowed to use in the test suite? In my own > > SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's > > acceptable for the test suite. > > And, even if the patch's effect can be shown with little guest DRAM, the > test case still requires a multi-gig ivshmem-plain device. In > "tests/ivshmem-test.c", I see how it is set up -- the backend is set up > with shm_open(). The file created under /dev/shm (on Linux) might > require host RAM just the same as normal guest DRAM (especially with > memory overcommit disabled on the host), correct? with over commit disable or cgroups limits enforced (I'd expect that in automated testing env i.e. travis or something else) allocating such amount of RAM probably would fail like crazy. Maybe using memdev file backend with manually created sparse file might actually work (with preallocate disabled) > > Thanks > Laszlo >
On Wed, 26 Sep 2018 13:12:47 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > On 09/25/18 22:36, Alex Williamson wrote: > > On Tue, 25 Sep 2018 00:13:46 +0200 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI > >> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in > >> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" > >> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" > >> 64-bit MMIO aperture to the guest OS for hotplug purposes. > >> > >> In that commit, we added or modified five functions: > >> > >> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default > >> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips > >> the DIMM hotplug area too (if any). > >> > >> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): > >> board-specific 64-bit base property getters called abstractly by the > >> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the > >> firmware didn't program any 64-bit hole (i.e. if the firmware didn't > >> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they > >> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit > >> GPA programmed by the firmware as the base address for the aperture). > >> > >> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): > >> these intended to extend the aperture to our size recommendation, > >> calculated relative to the base of the aperture. > >> > >> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and > >> q35_host_get_pci_hole64_end() currently only extend the aperture relative > >> to the default base (pc_pci_hole64_start()), ignoring any programming done > >> by the firmware. This means that our size recommendation may not be met. > >> Fix it by honoring the firmware's address assignments. > >> > >> The strange extension sizes were spotted by Alex, in the log of a guest > >> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). > >> > >> This change only affects DSDT generation, therefore no new compat property > >> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to > >> 64-bit BARs, the patch makes no difference to SeaBIOS guests. > > > > This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but > > only if it cannot satisfy all the BARs from 32-bit MMMIO, see > > src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several > > assigned GPUs and you'll eventually cross that threshold and all 64-bit > > BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem > > devices could do the same. Thanks, > > The effect of this patch is not hard to demonstrate with SeaBIOS+Q35, > when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device. > > However, using SeaBIOS+i440fx, I can't show the difference. I've been > experimenting with various ivshmem devices (even multiple at the same > time, with different sizes). The "all or nothing" nature of SeaBIOS's > high allocation of the 64-bit BARs, combined with hugepage alignment > inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU > for i440fx, seem to make it surprisingly difficult to trigger the issue. > > I figure I should: > > (1) remove the sentence "the patch makes no difference to SeaBIOS > guests" from the commit message, > > (2) include the DSDT diff on SeaBIOS/q35 in the commit message, > > (3) remain silent on SeaBIOS/i440fx, in the commit message, > > (4) append a new patch, for "bios-tables-test", so that the ACPI gen > change is validated as part of the test suite, on SeaBIOS/q35. > > Regarding (4): > > - is it OK if I add the test only for Q35? > > - what guest RAM size am I allowed to use in the test suite? In my own > SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's > acceptable for the test suite. Seems like you've done due diligence, the plan looks ok to me. Regarding the test memory allocation, is it possible and reasonable to perhaps create a 256MB shared memory area and re-use it for multiple ivshmem devices? ie. rather than 1, 4GB ivshmem device, use 16, 256MB devices, all with the same backing. Thanks, Alex
(+Eric) On 09/26/18 14:10, Igor Mammedov wrote: > On Wed, 26 Sep 2018 13:35:14 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 09/26/18 13:12, Laszlo Ersek wrote: >> >>> (4) append a new patch, for "bios-tables-test", so that the ACPI gen >>> change is validated as part of the test suite, on SeaBIOS/q35. >>> >>> Regarding (4): >>> >>> - is it OK if I add the test only for Q35? >>> >>> - what guest RAM size am I allowed to use in the test suite? In my own >>> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's >>> acceptable for the test suite. >> >> And, even if the patch's effect can be shown with little guest DRAM, the >> test case still requires a multi-gig ivshmem-plain device. In >> "tests/ivshmem-test.c", I see how it is set up -- the backend is set up >> with shm_open(). The file created under /dev/shm (on Linux) might >> require host RAM just the same as normal guest DRAM (especially with >> memory overcommit disabled on the host), correct? > with over commit disable or cgroups limits enforced (I'd expect that > in automated testing env i.e. travis or something else) > allocating such amount of RAM probably would fail like crazy. > > Maybe using memdev file backend with manually created sparse file > might actually work (with preallocate disabled) Thanks, this sounds like a good idea. I see shm_open() is used heavily in ivshmem-related tests. I haven't looked much at shm_open() before. (I've always known it existed in POSIX, but I've never cared.) So now I first checked what shm_open() would give me over a regular temporary file created with open(); after all, the file descriptor returned by either would have to be mmap()'d. From the rationale in POSIX: <http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html#tag_22_02_08_14>, it seems like the {shm_open(), mmap()} combo has two significant guarantees over {open(), mmap()}: - the namespace may be distinct (there need not be a writeable filesystem at all), - the shared object will *always* be locked in RAM ("Shared memory is not just simply providing common access to data, it is providing the fastest possible communication between the processes"). The rationale seems to permit, on purpose, an shm_open() implementation that is actually based on open(), using a special file system -- and AIUI, /dev/shm is just that, on Linux. Eric, does the above sound more or less correct? If it is correct, then I think shm_open() is exactly what I *don't* want for this use case. Because, while I do need a pathname for an mmap()-able object (regular file, or otherwise), just so I can do: -object memory-backend-file,id=mem-obj,...,mem-path=... \ -device ivshmem-plain,memdev=mem-obj,... , I want the underlying object to put as little pressure on the system that runs the test suite as possible. This means I should specifically ask for a regular file, to be mmap()'d (with MAP_SHARED). Then the kernel knows in advance that it can always page out the dirty stuff, and the mapping shouldn't clash with cgroups, or disabled memory overcommit. Now, in order to make that actually safe, I should in theory ask for preallocation on the filesystem (otherwise, if the filesystem runs out of space, while the kernel is allocating fs extents in order to flush the dirty pages to them, the process gets a SIGBUS, IIRC). However, because I know that nothing will be in fact dirtied, I can minimize the footprint on the filesystem as well, and forego preallocation too. This suggests that, in my test case, - I call g_file_open_tmp() for creating the temporary file, - pass the returned fd to ftruncate() for resizing the temporary file, - pass the returned pathname to the "memory-backend-file" objects, in the "mem-path" property, - set "share=on", - set "prealloc=off", - "discard-data" is irrelevant (there won't be any dirty pages). Thanks Laszlo
(+Eric) On 09/26/18 18:26, Alex Williamson wrote: > On Wed, 26 Sep 2018 13:12:47 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 09/25/18 22:36, Alex Williamson wrote: >>> On Tue, 25 Sep 2018 00:13:46 +0200 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI >>>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in >>>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" >>>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" >>>> 64-bit MMIO aperture to the guest OS for hotplug purposes. >>>> >>>> In that commit, we added or modified five functions: >>>> >>>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default >>>> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips >>>> the DIMM hotplug area too (if any). >>>> >>>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): >>>> board-specific 64-bit base property getters called abstractly by the >>>> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the >>>> firmware didn't program any 64-bit hole (i.e. if the firmware didn't >>>> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they >>>> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit >>>> GPA programmed by the firmware as the base address for the aperture). >>>> >>>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): >>>> these intended to extend the aperture to our size recommendation, >>>> calculated relative to the base of the aperture. >>>> >>>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and >>>> q35_host_get_pci_hole64_end() currently only extend the aperture relative >>>> to the default base (pc_pci_hole64_start()), ignoring any programming done >>>> by the firmware. This means that our size recommendation may not be met. >>>> Fix it by honoring the firmware's address assignments. >>>> >>>> The strange extension sizes were spotted by Alex, in the log of a guest >>>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). >>>> >>>> This change only affects DSDT generation, therefore no new compat property >>>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to >>>> 64-bit BARs, the patch makes no difference to SeaBIOS guests. >>> >>> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but >>> only if it cannot satisfy all the BARs from 32-bit MMMIO, see >>> src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several >>> assigned GPUs and you'll eventually cross that threshold and all 64-bit >>> BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem >>> devices could do the same. Thanks, >> >> The effect of this patch is not hard to demonstrate with SeaBIOS+Q35, >> when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device. >> >> However, using SeaBIOS+i440fx, I can't show the difference. I've been >> experimenting with various ivshmem devices (even multiple at the same >> time, with different sizes). The "all or nothing" nature of SeaBIOS's >> high allocation of the 64-bit BARs, combined with hugepage alignment >> inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU >> for i440fx, seem to make it surprisingly difficult to trigger the issue. >> >> I figure I should: >> >> (1) remove the sentence "the patch makes no difference to SeaBIOS >> guests" from the commit message, >> >> (2) include the DSDT diff on SeaBIOS/q35 in the commit message, >> >> (3) remain silent on SeaBIOS/i440fx, in the commit message, >> >> (4) append a new patch, for "bios-tables-test", so that the ACPI gen >> change is validated as part of the test suite, on SeaBIOS/q35. >> >> Regarding (4): >> >> - is it OK if I add the test only for Q35? >> >> - what guest RAM size am I allowed to use in the test suite? In my own >> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's >> acceptable for the test suite. > > Seems like you've done due diligence, the plan looks ok to me. > Regarding the test memory allocation, is it possible and reasonable to > perhaps create a 256MB shared memory area and re-use it for multiple > ivshmem devices? ie. rather than 1, 4GB ivshmem device, use 16, 256MB > devices, all with the same backing. Thanks, This too sounds useful. AIUI, ftruncate() is neither forbidden, nor required, to allocate filesystem extents when increasing the size of a file. Using one smaller regular temporary file as the common foundation for multiple "memory-backend-file" objects will save space on the fs if ftruncate() happens to allocate extents. (I've also thought of passing the same "memory-backend-file" object to multiple ivshmem-plain devices, but ivshmem_plain_realize() [hw/misc/ivshmem.c] checks whether the HostMemoryBackend is already mapped.) Thanks! Laszlo
On 09/26/18 18:26, Alex Williamson wrote: > On Wed, 26 Sep 2018 13:12:47 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 09/25/18 22:36, Alex Williamson wrote: >>> On Tue, 25 Sep 2018 00:13:46 +0200 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI >>>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in >>>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" >>>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" >>>> 64-bit MMIO aperture to the guest OS for hotplug purposes. >>>> >>>> In that commit, we added or modified five functions: >>>> >>>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default >>>> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips >>>> the DIMM hotplug area too (if any). >>>> >>>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): >>>> board-specific 64-bit base property getters called abstractly by the >>>> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the >>>> firmware didn't program any 64-bit hole (i.e. if the firmware didn't >>>> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they >>>> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit >>>> GPA programmed by the firmware as the base address for the aperture). >>>> >>>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): >>>> these intended to extend the aperture to our size recommendation, >>>> calculated relative to the base of the aperture. >>>> >>>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and >>>> q35_host_get_pci_hole64_end() currently only extend the aperture relative >>>> to the default base (pc_pci_hole64_start()), ignoring any programming done >>>> by the firmware. This means that our size recommendation may not be met. >>>> Fix it by honoring the firmware's address assignments. >>>> >>>> The strange extension sizes were spotted by Alex, in the log of a guest >>>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). >>>> >>>> This change only affects DSDT generation, therefore no new compat property >>>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to >>>> 64-bit BARs, the patch makes no difference to SeaBIOS guests. >>> >>> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but >>> only if it cannot satisfy all the BARs from 32-bit MMMIO, see >>> src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several >>> assigned GPUs and you'll eventually cross that threshold and all 64-bit >>> BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem >>> devices could do the same. Thanks, >> >> The effect of this patch is not hard to demonstrate with SeaBIOS+Q35, >> when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device. >> >> However, using SeaBIOS+i440fx, I can't show the difference. I've been >> experimenting with various ivshmem devices (even multiple at the same >> time, with different sizes). The "all or nothing" nature of SeaBIOS's >> high allocation of the 64-bit BARs, combined with hugepage alignment >> inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU >> for i440fx, seem to make it surprisingly difficult to trigger the issue. >> >> I figure I should: >> >> (1) remove the sentence "the patch makes no difference to SeaBIOS >> guests" from the commit message, >> >> (2) include the DSDT diff on SeaBIOS/q35 in the commit message, >> >> (3) remain silent on SeaBIOS/i440fx, in the commit message, >> >> (4) append a new patch, for "bios-tables-test", so that the ACPI gen >> change is validated as part of the test suite, on SeaBIOS/q35. >> >> Regarding (4): >> >> - is it OK if I add the test only for Q35? >> >> - what guest RAM size am I allowed to use in the test suite? In my own >> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's >> acceptable for the test suite. > > Seems like you've done due diligence, the plan looks ok to me. > Regarding the test memory allocation, is it possible and reasonable to > perhaps create a 256MB shared memory area and re-use it for multiple > ivshmem devices? ie. rather than 1, 4GB ivshmem device, use 16, 256MB > devices, all with the same backing. Thanks, In order to show that the patch makes a difference, on SeaBIOS+Q35, I must have SeaBIOS place the lowest 64-bit BAR strictly above end-of-RAM / end-of-DIMM-hotplug-area (i.e., strictly past the return value of pc_pci_hole64_start()). That's not easy, assuming test suite constraints: - With a small guest RAM size (and no DIMM hotplug area), pc_pci_hole64_start() returns 4GB. Unfortunately, 4GB is well aligned for 256MB BARs, so that's where SeaBIOS will place the lowest such BAR too. Therefore the patch makes no difference. - If I try to mis-align pc_pci_hole64_start() for the 256MB BAR size, by adding a DIMM hotplug area, it's not helpful. Because the end of the DIMM area ("etc/reserved-memory-end") is rounded up to 1GB alignment automatically. Similarly, pc_pci_hole64_start() is rounded up to 1GB alignment automatically. Thus, the BARs remain aligned. - If I try to mis-align pc_pci_hole64_start() for the 256MB BAR size by using more guest RAM (e.g. 2049 MB --> the low RAM split is at 2GB on Q35, so RAM would end at 4GB+1MB), then it's the same as the previous bullet, because pc_pci_hole64_start() is rounded up to 1GB alignment, and the BARs remain aligned. - If I try to mis-align pc_pci_hole64_start() (again, 4GB, with small guest RAM) for the ivshmem BAR size by changing the BAR size, then I must pick a 8GB BAR size at least. And then I can't use the 256MB "fragmentation". - I guess I could ascertain the mis-alignment by using small guest RAM (128MB), a single DIMM hotplug slot so that reserved-memory-end is rounded up to 5GB (through the 1GB alignment), and a single ivshmem device with a 2GB BAR (mis-aligned with the reserved-memory-end at 5GB). Is this the best compromise? (The last option is basically an optimization of my previous reproducer, namely 5GB guest RAM and 4GB BAR. The 5GB guest RAM made sure that RAM would end at an *odd* GB number -- we can optimize that by using an empty DIMM hotplug area rather than guest RAM. And the 4GB BAR can be optimized to the smallest *even* (positive) number of GBs, namely 2GB. But we can't do anything about the pervasive GB-alignment.) Thanks Laszlo
Hi, > > Maybe using memdev file backend with manually created sparse file > > might actually work (with preallocate disabled) > > Thanks, this sounds like a good idea. > > I see shm_open() is used heavily in ivshmem-related tests. I haven't > looked much at shm_open() before. (I've always known it existed in > POSIX, but I've never cared.) How about improving the lovely pci-testdev we have a bit? Then we can have huge pci bars without needing backing storage for them ... HTH, Gerd =========================== cut here =========================== From 05cfced149b0b5c953391666c3151034bc7fe88b Mon Sep 17 00:00:00 2001 From: Gerd Hoffmann <kraxel@redhat.com> Date: Thu, 27 Sep 2018 07:43:10 +0200 Subject: [PATCH] pci-testdev: add optional memory bar Add memory bar to pci-testdev. Size is configurable using the membar property. Setting the size to zero (default) turns it off. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> --- hw/misc/pci-testdev.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/hw/misc/pci-testdev.c b/hw/misc/pci-testdev.c index 32041f535f..af4d678ee4 100644 --- a/hw/misc/pci-testdev.c +++ b/hw/misc/pci-testdev.c @@ -85,6 +85,9 @@ typedef struct PCITestDevState { MemoryRegion portio; IOTest *tests; int current; + + size_t membar_size; + MemoryRegion membar; } PCITestDevState; #define TYPE_PCI_TEST_DEV "pci-testdev" @@ -253,6 +256,15 @@ static void pci_testdev_realize(PCIDevice *pci_dev, Error **errp) pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio); pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->portio); + if (d->membar_size) { + memory_region_init(&d->membar, OBJECT(d), "membar", d->membar_size); + pci_register_bar(pci_dev, 2, + PCI_BASE_ADDRESS_SPACE_MEMORY | + PCI_BASE_ADDRESS_MEM_PREFETCH | + PCI_BASE_ADDRESS_MEM_TYPE_64, + &d->membar); + } + d->current = -1; d->tests = g_malloc0(IOTEST_MAX * sizeof *d->tests); for (i = 0; i < IOTEST_MAX; ++i) { @@ -305,6 +317,11 @@ static void qdev_pci_testdev_reset(DeviceState *dev) pci_testdev_reset(d); } +static Property pci_testdev_properties[] = { + DEFINE_PROP_SIZE("membar", PCITestDevState, membar_size, 0), + DEFINE_PROP_END_OF_LIST(), +}; + static void pci_testdev_class_init(ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); @@ -319,6 +336,7 @@ static void pci_testdev_class_init(ObjectClass *klass, void *data) dc->desc = "PCI Test Device"; set_bit(DEVICE_CATEGORY_MISC, dc->categories); dc->reset = qdev_pci_testdev_reset; + dc->props = pci_testdev_properties; } static const TypeInfo pci_testdev_info = {
On Thu, 27 Sep 2018 00:31:07 +0200 Laszlo Ersek <lersek@redhat.com> wrote: > On 09/26/18 18:26, Alex Williamson wrote: > > On Wed, 26 Sep 2018 13:12:47 +0200 > > Laszlo Ersek <lersek@redhat.com> wrote: > > > >> On 09/25/18 22:36, Alex Williamson wrote: > >>> On Tue, 25 Sep 2018 00:13:46 +0200 > >>> Laszlo Ersek <lersek@redhat.com> wrote: > >>> > >>>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI > >>>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in > >>>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" > >>>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" > >>>> 64-bit MMIO aperture to the guest OS for hotplug purposes. > >>>> > >>>> In that commit, we added or modified five functions: > >>>> > >>>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default > >>>> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips > >>>> the DIMM hotplug area too (if any). > >>>> > >>>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): > >>>> board-specific 64-bit base property getters called abstractly by the > >>>> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the > >>>> firmware didn't program any 64-bit hole (i.e. if the firmware didn't > >>>> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they > >>>> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit > >>>> GPA programmed by the firmware as the base address for the aperture). > >>>> > >>>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): > >>>> these intended to extend the aperture to our size recommendation, > >>>> calculated relative to the base of the aperture. > >>>> > >>>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and > >>>> q35_host_get_pci_hole64_end() currently only extend the aperture relative > >>>> to the default base (pc_pci_hole64_start()), ignoring any programming done > >>>> by the firmware. This means that our size recommendation may not be met. > >>>> Fix it by honoring the firmware's address assignments. > >>>> > >>>> The strange extension sizes were spotted by Alex, in the log of a guest > >>>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). > >>>> > >>>> This change only affects DSDT generation, therefore no new compat property > >>>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to > >>>> 64-bit BARs, the patch makes no difference to SeaBIOS guests. > >>> > >>> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but > >>> only if it cannot satisfy all the BARs from 32-bit MMMIO, see > >>> src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several > >>> assigned GPUs and you'll eventually cross that threshold and all 64-bit > >>> BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem > >>> devices could do the same. Thanks, > >> > >> The effect of this patch is not hard to demonstrate with SeaBIOS+Q35, > >> when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device. > >> > >> However, using SeaBIOS+i440fx, I can't show the difference. I've been > >> experimenting with various ivshmem devices (even multiple at the same > >> time, with different sizes). The "all or nothing" nature of SeaBIOS's > >> high allocation of the 64-bit BARs, combined with hugepage alignment > >> inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU > >> for i440fx, seem to make it surprisingly difficult to trigger the issue. > >> > >> I figure I should: > >> > >> (1) remove the sentence "the patch makes no difference to SeaBIOS > >> guests" from the commit message, > >> > >> (2) include the DSDT diff on SeaBIOS/q35 in the commit message, > >> > >> (3) remain silent on SeaBIOS/i440fx, in the commit message, > >> > >> (4) append a new patch, for "bios-tables-test", so that the ACPI gen > >> change is validated as part of the test suite, on SeaBIOS/q35. > >> > >> Regarding (4): > >> > >> - is it OK if I add the test only for Q35? > >> > >> - what guest RAM size am I allowed to use in the test suite? In my own > >> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's > >> acceptable for the test suite. > > > > Seems like you've done due diligence, the plan looks ok to me. > > Regarding the test memory allocation, is it possible and reasonable to > > perhaps create a 256MB shared memory area and re-use it for multiple > > ivshmem devices? ie. rather than 1, 4GB ivshmem device, use 16, 256MB > > devices, all with the same backing. Thanks, > > In order to show that the patch makes a difference, on SeaBIOS+Q35, I > must have SeaBIOS place the lowest 64-bit BAR strictly above end-of-RAM > / end-of-DIMM-hotplug-area (i.e., strictly past the return value of > pc_pci_hole64_start()). That's not easy, assuming test suite constraints: > > - With a small guest RAM size (and no DIMM hotplug area), > pc_pci_hole64_start() returns 4GB. Unfortunately, 4GB is well aligned > for 256MB BARs, so that's where SeaBIOS will place the lowest such BAR > too. Therefore the patch makes no difference. > > - If I try to mis-align pc_pci_hole64_start() for the 256MB BAR size, by > adding a DIMM hotplug area, it's not helpful. Because the end of the > DIMM area ("etc/reserved-memory-end") is rounded up to 1GB alignment > automatically. Similarly, pc_pci_hole64_start() is rounded up to 1GB > alignment automatically. Thus, the BARs remain aligned. > > - If I try to mis-align pc_pci_hole64_start() for the 256MB BAR size by > using more guest RAM (e.g. 2049 MB --> the low RAM split is at 2GB on > Q35, so RAM would end at 4GB+1MB), then it's the same as the previous > bullet, because pc_pci_hole64_start() is rounded up to 1GB alignment, > and the BARs remain aligned. > > - If I try to mis-align pc_pci_hole64_start() (again, 4GB, with small > guest RAM) for the ivshmem BAR size by changing the BAR size, then I > must pick a 8GB BAR size at least. And then I can't use the 256MB > "fragmentation". > > - I guess I could ascertain the mis-alignment by using small guest RAM > (128MB), a single DIMM hotplug slot so that reserved-memory-end is > rounded up to 5GB (through the 1GB alignment), and a single ivshmem > device with a 2GB BAR (mis-aligned with the reserved-memory-end at 5GB). > Is this the best compromise? > > (The last option is basically an optimization of my previous reproducer, > namely 5GB guest RAM and 4GB BAR. The 5GB guest RAM made sure that RAM > would end at an *odd* GB number -- we can optimize that by using an > empty DIMM hotplug area rather than guest RAM. And the 4GB BAR can be > optimized to the smallest *even* (positive) number of GBs, namely 2GB. that probably would do the job > But we can't do anything about the pervasive GB-alignment.) I haven't read the code to be sure but at first glance, thanks to brokenness before, it might be possible to mis-align pci_hole64 start if one would use a combination of broken_reserved_end=true && enforce_aligned_dimm=false That's assuming that an old machine type that has this combo is still suitable for testcase otherwise. > > Thanks > Laszlo
On 09/27/18 09:03, Igor Mammedov wrote: > On Thu, 27 Sep 2018 00:31:07 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: > >> On 09/26/18 18:26, Alex Williamson wrote: >>> On Wed, 26 Sep 2018 13:12:47 +0200 >>> Laszlo Ersek <lersek@redhat.com> wrote: >>> >>>> On 09/25/18 22:36, Alex Williamson wrote: >>>>> On Tue, 25 Sep 2018 00:13:46 +0200 >>>>> Laszlo Ersek <lersek@redhat.com> wrote: >>>>> >>>>>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI >>>>>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in >>>>>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size" >>>>>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough" >>>>>> 64-bit MMIO aperture to the guest OS for hotplug purposes. >>>>>> >>>>>> In that commit, we added or modified five functions: >>>>>> >>>>>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default >>>>>> 64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips >>>>>> the DIMM hotplug area too (if any). >>>>>> >>>>>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start(): >>>>>> board-specific 64-bit base property getters called abstractly by the >>>>>> ACPI generator. Both of these fall back to pc_pci_hole64_start() if the >>>>>> firmware didn't program any 64-bit hole (i.e. if the firmware didn't >>>>>> assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they >>>>>> honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit >>>>>> GPA programmed by the firmware as the base address for the aperture). >>>>>> >>>>>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end(): >>>>>> these intended to extend the aperture to our size recommendation, >>>>>> calculated relative to the base of the aperture. >>>>>> >>>>>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and >>>>>> q35_host_get_pci_hole64_end() currently only extend the aperture relative >>>>>> to the default base (pc_pci_hole64_start()), ignoring any programming done >>>>>> by the firmware. This means that our size recommendation may not be met. >>>>>> Fix it by honoring the firmware's address assignments. >>>>>> >>>>>> The strange extension sizes were spotted by Alex, in the log of a guest >>>>>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs). >>>>>> >>>>>> This change only affects DSDT generation, therefore no new compat property >>>>>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to >>>>>> 64-bit BARs, the patch makes no difference to SeaBIOS guests. >>>>> >>>>> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but >>>>> only if it cannot satisfy all the BARs from 32-bit MMMIO, see >>>>> src/fw/pciinit.c:pci_bios_map_devices. Create a VM with several >>>>> assigned GPUs and you'll eventually cross that threshold and all 64-bit >>>>> BARs will be moved above 4G. I'm sure a few sufficiently sized ivshmem >>>>> devices could do the same. Thanks, >>>> >>>> The effect of this patch is not hard to demonstrate with SeaBIOS+Q35, >>>> when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device. >>>> >>>> However, using SeaBIOS+i440fx, I can't show the difference. I've been >>>> experimenting with various ivshmem devices (even multiple at the same >>>> time, with different sizes). The "all or nothing" nature of SeaBIOS's >>>> high allocation of the 64-bit BARs, combined with hugepage alignment >>>> inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU >>>> for i440fx, seem to make it surprisingly difficult to trigger the issue. >>>> >>>> I figure I should: >>>> >>>> (1) remove the sentence "the patch makes no difference to SeaBIOS >>>> guests" from the commit message, >>>> >>>> (2) include the DSDT diff on SeaBIOS/q35 in the commit message, >>>> >>>> (3) remain silent on SeaBIOS/i440fx, in the commit message, >>>> >>>> (4) append a new patch, for "bios-tables-test", so that the ACPI gen >>>> change is validated as part of the test suite, on SeaBIOS/q35. >>>> >>>> Regarding (4): >>>> >>>> - is it OK if I add the test only for Q35? >>>> >>>> - what guest RAM size am I allowed to use in the test suite? In my own >>>> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's >>>> acceptable for the test suite. >>> >>> Seems like you've done due diligence, the plan looks ok to me. >>> Regarding the test memory allocation, is it possible and reasonable to >>> perhaps create a 256MB shared memory area and re-use it for multiple >>> ivshmem devices? ie. rather than 1, 4GB ivshmem device, use 16, 256MB >>> devices, all with the same backing. Thanks, >> >> In order to show that the patch makes a difference, on SeaBIOS+Q35, I >> must have SeaBIOS place the lowest 64-bit BAR strictly above end-of-RAM >> / end-of-DIMM-hotplug-area (i.e., strictly past the return value of >> pc_pci_hole64_start()). That's not easy, assuming test suite constraints: >> >> - With a small guest RAM size (and no DIMM hotplug area), >> pc_pci_hole64_start() returns 4GB. Unfortunately, 4GB is well aligned >> for 256MB BARs, so that's where SeaBIOS will place the lowest such BAR >> too. Therefore the patch makes no difference. >> >> - If I try to mis-align pc_pci_hole64_start() for the 256MB BAR size, by >> adding a DIMM hotplug area, it's not helpful. Because the end of the >> DIMM area ("etc/reserved-memory-end") is rounded up to 1GB alignment >> automatically. Similarly, pc_pci_hole64_start() is rounded up to 1GB >> alignment automatically. Thus, the BARs remain aligned. >> >> - If I try to mis-align pc_pci_hole64_start() for the 256MB BAR size by >> using more guest RAM (e.g. 2049 MB --> the low RAM split is at 2GB on >> Q35, so RAM would end at 4GB+1MB), then it's the same as the previous >> bullet, because pc_pci_hole64_start() is rounded up to 1GB alignment, >> and the BARs remain aligned. >> >> - If I try to mis-align pc_pci_hole64_start() (again, 4GB, with small >> guest RAM) for the ivshmem BAR size by changing the BAR size, then I >> must pick a 8GB BAR size at least. And then I can't use the 256MB >> "fragmentation". >> >> - I guess I could ascertain the mis-alignment by using small guest RAM >> (128MB), a single DIMM hotplug slot so that reserved-memory-end is >> rounded up to 5GB (through the 1GB alignment), and a single ivshmem >> device with a 2GB BAR (mis-aligned with the reserved-memory-end at 5GB). >> Is this the best compromise? >> >> (The last option is basically an optimization of my previous reproducer, >> namely 5GB guest RAM and 4GB BAR. The 5GB guest RAM made sure that RAM >> would end at an *odd* GB number -- we can optimize that by using an >> empty DIMM hotplug area rather than guest RAM. And the 4GB BAR can be >> optimized to the smallest *even* (positive) number of GBs, namely 2GB. > that probably would do the job OK, thanks. >> But we can't do anything about the pervasive GB-alignment.) > I haven't read the code to be sure but at first glance, thanks to > brokenness before, it might be possible to mis-align pci_hole64 start > if one would use a combination of > broken_reserved_end=true && enforce_aligned_dimm=false > > That's assuming that an old machine type that has this combo is still > suitable for testcase otherwise. I thought of messing with manual compat property overrides (or old machine types), but I'd prefer not to go there. The code I'm patching already depends on a compat property ("x-pci-hole64-fix", for both "i440FX-pcihost" and "q35-pcihost"), and that one is set to "off" up to and including PC_COMPAT_2_10. Whereas the latest machine types (i440fx and q35) where "broken_reserved_end" is true are 2.4. Which leaves me with an empty set. I'll start work on the "best compromise" method. Thanks! Laszlo
On 09/27/18 07:48, Gerd Hoffmann wrote: > Hi, > >>> Maybe using memdev file backend with manually created sparse file >>> might actually work (with preallocate disabled) >> >> Thanks, this sounds like a good idea. >> >> I see shm_open() is used heavily in ivshmem-related tests. I haven't >> looked much at shm_open() before. (I've always known it existed in >> POSIX, but I've never cared.) > > How about improving the lovely pci-testdev we have a bit? Then we can > have huge pci bars without needing backing storage for them ... Earlier I checked qemu-system-x86_64 -device pci-testdev,\? briefly. I didn't see anything relevant and figured it was out of scope. ( Actually, I've thought of yet another thing: the resource reservation capability for PCIe Root Ports. The 64-bit reservation field in that cap structure should be perfectly suitable for this testing. Except, of course, SeaBIOS only handles the bus number reservation field, at this time. :/ ) Looking at "docs/specs/pci-testdev.txt" now, extending the PCI testdev looks like a separate project to me. We're getting quite far from the original goal. I totally agree that if the PCI testdev already had this ability, it would be a perfect fit, but I don't think I can personally take that on now. Thanks Laszlo > =========================== cut here =========================== > From 05cfced149b0b5c953391666c3151034bc7fe88b Mon Sep 17 00:00:00 2001 > From: Gerd Hoffmann <kraxel@redhat.com> > Date: Thu, 27 Sep 2018 07:43:10 +0200 > Subject: [PATCH] pci-testdev: add optional memory bar > > Add memory bar to pci-testdev. Size is configurable using the membar > property. Setting the size to zero (default) turns it off. > > Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> > --- > hw/misc/pci-testdev.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/hw/misc/pci-testdev.c b/hw/misc/pci-testdev.c > index 32041f535f..af4d678ee4 100644 > --- a/hw/misc/pci-testdev.c > +++ b/hw/misc/pci-testdev.c > @@ -85,6 +85,9 @@ typedef struct PCITestDevState { > MemoryRegion portio; > IOTest *tests; > int current; > + > + size_t membar_size; > + MemoryRegion membar; > } PCITestDevState; > > #define TYPE_PCI_TEST_DEV "pci-testdev" > @@ -253,6 +256,15 @@ static void pci_testdev_realize(PCIDevice *pci_dev, Error **errp) > pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio); > pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->portio); > > + if (d->membar_size) { > + memory_region_init(&d->membar, OBJECT(d), "membar", d->membar_size); > + pci_register_bar(pci_dev, 2, > + PCI_BASE_ADDRESS_SPACE_MEMORY | > + PCI_BASE_ADDRESS_MEM_PREFETCH | > + PCI_BASE_ADDRESS_MEM_TYPE_64, > + &d->membar); > + } > + > d->current = -1; > d->tests = g_malloc0(IOTEST_MAX * sizeof *d->tests); > for (i = 0; i < IOTEST_MAX; ++i) { > @@ -305,6 +317,11 @@ static void qdev_pci_testdev_reset(DeviceState *dev) > pci_testdev_reset(d); > } > > +static Property pci_testdev_properties[] = { > + DEFINE_PROP_SIZE("membar", PCITestDevState, membar_size, 0), > + DEFINE_PROP_END_OF_LIST(), > +}; > + > static void pci_testdev_class_init(ObjectClass *klass, void *data) > { > DeviceClass *dc = DEVICE_CLASS(klass); > @@ -319,6 +336,7 @@ static void pci_testdev_class_init(ObjectClass *klass, void *data) > dc->desc = "PCI Test Device"; > set_bit(DEVICE_CATEGORY_MISC, dc->categories); > dc->reset = qdev_pci_testdev_reset; > + dc->props = pci_testdev_properties; > } > > static const TypeInfo pci_testdev_info = { >
On 09/27/18 11:21, Laszlo Ersek wrote: > On 09/27/18 07:48, Gerd Hoffmann wrote: >> Hi, >> >>>> Maybe using memdev file backend with manually created sparse file >>>> might actually work (with preallocate disabled) >>> >>> Thanks, this sounds like a good idea. >>> >>> I see shm_open() is used heavily in ivshmem-related tests. I haven't >>> looked much at shm_open() before. (I've always known it existed in >>> POSIX, but I've never cared.) >> >> How about improving the lovely pci-testdev we have a bit? Then we can >> have huge pci bars without needing backing storage for them ... > > Earlier I checked > > qemu-system-x86_64 -device pci-testdev,\? > > briefly. I didn't see anything relevant and figured it was out of scope. > > ( > Actually, I've thought of yet another thing: the resource reservation > capability for PCIe Root Ports. The 64-bit reservation field in that cap > structure should be perfectly suitable for this testing. > > Except, of course, SeaBIOS only handles the bus number reservation > field, at this time. :/ > ) > > Looking at "docs/specs/pci-testdev.txt" now, extending the PCI testdev > looks like a separate project to me. We're getting quite far from the > original goal. I totally agree that if the PCI testdev already had this > ability, it would be a perfect fit, but I don't think I can personally > take that on now. I realize you pasted a prototype patch below (thanks for that!), but it doesn't update the documentation. I wouldn't like to dig down another design rabbit hole here. If you can pursue upstreaming this change, I can definitely delay my series, and use the enhanced pci testdev as a reproducer. But, I'm not explicitly asking you to pursue it -- none of this classifies as a "serious bug", so these patches might not be the best uses of our times. Laszlo >> =========================== cut here =========================== >> From 05cfced149b0b5c953391666c3151034bc7fe88b Mon Sep 17 00:00:00 2001 >> From: Gerd Hoffmann <kraxel@redhat.com> >> Date: Thu, 27 Sep 2018 07:43:10 +0200 >> Subject: [PATCH] pci-testdev: add optional memory bar >> >> Add memory bar to pci-testdev. Size is configurable using the membar >> property. Setting the size to zero (default) turns it off. >> >> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> >> --- >> hw/misc/pci-testdev.c | 18 ++++++++++++++++++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/hw/misc/pci-testdev.c b/hw/misc/pci-testdev.c >> index 32041f535f..af4d678ee4 100644 >> --- a/hw/misc/pci-testdev.c >> +++ b/hw/misc/pci-testdev.c >> @@ -85,6 +85,9 @@ typedef struct PCITestDevState { >> MemoryRegion portio; >> IOTest *tests; >> int current; >> + >> + size_t membar_size; >> + MemoryRegion membar; >> } PCITestDevState; >> >> #define TYPE_PCI_TEST_DEV "pci-testdev" >> @@ -253,6 +256,15 @@ static void pci_testdev_realize(PCIDevice *pci_dev, Error **errp) >> pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio); >> pci_register_bar(pci_dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->portio); >> >> + if (d->membar_size) { >> + memory_region_init(&d->membar, OBJECT(d), "membar", d->membar_size); >> + pci_register_bar(pci_dev, 2, >> + PCI_BASE_ADDRESS_SPACE_MEMORY | >> + PCI_BASE_ADDRESS_MEM_PREFETCH | >> + PCI_BASE_ADDRESS_MEM_TYPE_64, >> + &d->membar); >> + } >> + >> d->current = -1; >> d->tests = g_malloc0(IOTEST_MAX * sizeof *d->tests); >> for (i = 0; i < IOTEST_MAX; ++i) { >> @@ -305,6 +317,11 @@ static void qdev_pci_testdev_reset(DeviceState *dev) >> pci_testdev_reset(d); >> } >> >> +static Property pci_testdev_properties[] = { >> + DEFINE_PROP_SIZE("membar", PCITestDevState, membar_size, 0), >> + DEFINE_PROP_END_OF_LIST(), >> +}; >> + >> static void pci_testdev_class_init(ObjectClass *klass, void *data) >> { >> DeviceClass *dc = DEVICE_CLASS(klass); >> @@ -319,6 +336,7 @@ static void pci_testdev_class_init(ObjectClass *klass, void *data) >> dc->desc = "PCI Test Device"; >> set_bit(DEVICE_CATEGORY_MISC, dc->categories); >> dc->reset = qdev_pci_testdev_reset; >> + dc->props = pci_testdev_properties; >> } >> >> static const TypeInfo pci_testdev_info = { >> >
Hi, > I realize you pasted a prototype patch below (thanks for that!), but it > doesn't update the documentation. I wouldn't like to dig down another > design rabbit hole here. > > If you can pursue upstreaming this change, I can definitely delay my > series, and use the enhanced pci testdev as a reproducer. Patch sent. > But, I'm not explicitly asking you to pursue it -- none of this > classifies as a "serious bug", so these patches might not be the best > uses of our times. Well, jumping through loops to get the tests going with ivshmem + memory overcommit isn't that a great use of our times either ... cheers, Gerd
On 9/27/18 4:21 AM, Laszlo Ersek wrote: > On 09/27/18 07:48, Gerd Hoffmann wrote: >> Hi, >> >>>> Maybe using memdev file backend with manually created sparse file >>>> might actually work (with preallocate disabled) >>> >>> Thanks, this sounds like a good idea. >>> >>> I see shm_open() is used heavily in ivshmem-related tests. I haven't >>> looked much at shm_open() before. (I've always known it existed in >>> POSIX, but I've never cared.) >> >> How about improving the lovely pci-testdev we have a bit? Then we can >> have huge pci bars without needing backing storage for them ... > > Earlier I checked > > qemu-system-x86_64 -device pci-testdev,\? Side note: While that works, we are trying to shift the documentation over to recommending -device pci-testdev,help instead, as then you don't need to remember to shell-quote the ? to prevent unintended globbing.
On 9/26/18 3:26 PM, Laszlo Ersek wrote: > (+Eric) > > I see shm_open() is used heavily in ivshmem-related tests. I haven't > looked much at shm_open() before. (I've always known it existed in > POSIX, but I've never cared.) I've never actually played with shm_open() myself, but understand the theory of it enough to reply. > > So now I first checked what shm_open() would give me over a regular > temporary file created with open(); after all, the file descriptor > returned by either would have to be mmap()'d. From the rationale in POSIX: > > <http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html#tag_22_02_08_14>, > > it seems like the {shm_open(), mmap()} combo has two significant > guarantees over {open(), mmap()}: > > - the namespace may be distinct (there need not be a writeable > filesystem at all), > > - the shared object will *always* be locked in RAM ("Shared memory is > not just simply providing common access to data, it is providing the > fastest possible communication between the processes"). > > The rationale seems to permit, on purpose, an shm_open() implementation > that is actually based on open(), using a special file system -- and > AIUI, /dev/shm is just that, on Linux. > > Eric, does the above sound more or less correct? You're right about it being permitted to be a distinct namespace; on the other hand, it doesn't even have to be a special file system. An implementation could even use a compile-time fixed directory name visible to the rest of the system (although of course you shouldn't rely on being able to use the file system to poke at the shmem objects, nor should you manipulate the file system underneath the reserved directory behind shmem's back if that is what the implementation is using). So I'm less certain of whether you are guaranteed that the shared memory has to be locked in place (where it can never be paged out), since an implementation on top of the filesystem does not have to do such locking - but you are also right that a high quality-of-implementation will strive to keep the memory live rather than paging it out precisely because it is used for interprocess communication that would be penalized if it can be paged out. > > If it is correct, then I think shm_open() is exactly what I *don't* want > for this use case. Because, while I do need a pathname for an > mmap()-able object (regular file, or otherwise), just so I can do: > > -object memory-backend-file,id=mem-obj,...,mem-path=... \ > -device ivshmem-plain,memdev=mem-obj,... > > , I want the underlying object to put as little pressure on the system > that runs the test suite as possible. > > This means I should specifically ask for a regular file, to be mmap()'d > (with MAP_SHARED). Then the kernel knows in advance that it can always > page out the dirty stuff, and the mapping shouldn't clash with cgroups, > or disabled memory overcommit. Indeed, shmem CAN be a thin veneer on top of the file system, and support being paged out; but since an implementation that pins the memory such that it cannot page is permitted (and in fact maybe desirable), you are right that using shmem can indeed put pressure on different resources in relation to what you can accomplish by using the file system yourself. > > Now, in order to make that actually safe, I should in theory ask for > preallocation on the filesystem (otherwise, if the filesystem runs out > of space, while the kernel is allocating fs extents in order to flush > the dirty pages to them, the process gets a SIGBUS, IIRC). However, > because I know that nothing will be in fact dirtied, I can minimize the > footprint on the filesystem as well, and forego preallocation too. > > This suggests that, in my test case, > - I call g_file_open_tmp() for creating the temporary file, > - pass the returned fd to ftruncate() for resizing the temporary file, > - pass the returned pathname to the "memory-backend-file" objects, in > the "mem-path" property, > - set "share=on", > - set "prealloc=off", > - "discard-data" is irrelevant (there won't be any dirty pages). > > Thanks > Laszlo >
On 9/26/18 3:35 PM, Laszlo Ersek wrote: > (+Eric) > This too sounds useful. AIUI, ftruncate() is neither forbidden, nor > required, to allocate filesystem extents when increasing the size of a > file. Using one smaller regular temporary file as the common foundation > for multiple "memory-backend-file" objects will save space on the fs if > ftruncate() happens to allocate extents. Correct. Most likely, ftruncate() extends by creating a hole if the file system supports sparse files, but POSIX has no way to guarantee holes (it merely states that both ftruncate() and write() beyond EOF are the two POSIX interfaces that may create holes - but that it is up to the implementation whether holes actually get created, then fstat() and lseek(SEEK_HOLE) can inform you whether holes were actually created [the former when st_blocks < st_size]). posix_fallocate() exists to force allocation, and Linux fallocate() offers even more fine-tuning when you need specific hole or non-hole behaviors.
On 09/27/18 17:24, Eric Blake wrote: > On 9/26/18 3:35 PM, Laszlo Ersek wrote: >> (+Eric) > >> This too sounds useful. AIUI, ftruncate() is neither forbidden, nor >> required, to allocate filesystem extents when increasing the size of a >> file. Using one smaller regular temporary file as the common foundation >> for multiple "memory-backend-file" objects will save space on the fs if >> ftruncate() happens to allocate extents. > > Correct. Most likely, ftruncate() extends by creating a hole if the file > system supports sparse files, but POSIX has no way to guarantee holes > (it merely states that both ftruncate() and write() beyond EOF are the > two POSIX interfaces that may create holes - but that it is up to the > implementation whether holes actually get created, then fstat() and > lseek(SEEK_HOLE) can inform you whether holes were actually created [the > former when st_blocks < st_size]). posix_fallocate() exists to force > allocation, and Linux fallocate() offers even more fine-tuning when you > need specific hole or non-hole behaviors. > Thank you for the answers! Laszlo
On 09/27/18 09:03, Igor Mammedov wrote: > On Thu, 27 Sep 2018 00:31:07 +0200 > Laszlo Ersek <lersek@redhat.com> wrote: >> - I guess I could ascertain the mis-alignment by using small guest RAM >> (128MB), a single DIMM hotplug slot so that reserved-memory-end is >> rounded up to 5GB (through the 1GB alignment), and a single ivshmem >> device with a 2GB BAR (mis-aligned with the reserved-memory-end at 5GB). >> Is this the best compromise? >> >> (The last option is basically an optimization of my previous reproducer, >> namely 5GB guest RAM and 4GB BAR. The 5GB guest RAM made sure that RAM >> would end at an *odd* GB number -- we can optimize that by using an >> empty DIMM hotplug area rather than guest RAM. And the 4GB BAR can be >> optimized to the smallest *even* (positive) number of GBs, namely 2GB. > that probably would do the job There's some black magic going on in the sizing of the reserved memory area, due to at least commit 085f8e88ba73 ("pc: count in 1Gb hugepage alignment when sizing hotplug-memory container", 2014-11-24). At this point I'm basically trying to find whatever numbers that (a) don't take up many resources, (b) reproduce the issue. :) Thanks Laszlo
diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c index 0df91e002076..fb4b0669ac9f 100644 --- a/hw/pci-host/piix.c +++ b/hw/pci-host/piix.c @@ -285,7 +285,7 @@ static void i440fx_pcihost_get_pci_hole64_end(Object *obj, Visitor *v, { PCIHostState *h = PCI_HOST_BRIDGE(obj); I440FXState *s = I440FX_PCI_HOST_BRIDGE(obj); - uint64_t hole64_start = pc_pci_hole64_start(); + uint64_t hole64_start = i440fx_pcihost_get_pci_hole64_start_value(obj); Range w64; uint64_t value, hole64_end; diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c index 8acf942b5e65..1c5433161f14 100644 --- a/hw/pci-host/q35.c +++ b/hw/pci-host/q35.c @@ -145,7 +145,7 @@ static void q35_host_get_pci_hole64_end(Object *obj, Visitor *v, { PCIHostState *h = PCI_HOST_BRIDGE(obj); Q35PCIHost *s = Q35_HOST_DEVICE(obj); - uint64_t hole64_start = pc_pci_hole64_start(); + uint64_t hole64_start = q35_host_get_pci_hole64_start_value(obj); Range w64; uint64_t value, hole64_end;