Message ID | 20230503002701.854329-1-leobras@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1,1/1] hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0 | expand |
On Tue, 2 May 2023 21:27:02 -0300 Leonardo Bras <leobras@redhat.com> wrote: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> Thanks Leo, you are a star. LGTM Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > include/hw/pci/pci.h | 2 ++ > hw/core/machine.c | 1 + > hw/pci/pci.c | 2 ++ > hw/pci/pcie_aer.c | 11 +++++++---- > 4 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h > index 935b4b91b4..e6d0574a29 100644 > --- a/include/hw/pci/pci.h > +++ b/include/hw/pci/pci.h > @@ -207,6 +207,8 @@ enum { > QEMU_PCIE_EXTCAP_INIT = (1 << QEMU_PCIE_EXTCAP_INIT_BITNR), > #define QEMU_PCIE_CXL_BITNR 10 > QEMU_PCIE_CAP_CXL = (1 << QEMU_PCIE_CXL_BITNR), > +#define QEMU_PCIE_ERR_UNC_MASK_BITNR 11 > + QEMU_PCIE_ERR_UNC_MASK = (1 << QEMU_PCIE_ERR_UNC_MASK_BITNR), > }; > > typedef struct PCIINTxRoute { > diff --git a/hw/core/machine.c b/hw/core/machine.c > index 47a34841a5..07f763eb2e 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = { > { "e1000e", "migrate-timadj", "off" }, > { "virtio-mem", "x-early-migration", "false" }, > { "migration", "x-preempt-pre-7-2", "true" }, > + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, > }; > const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2); > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index 8a87ccc8b0..5153ad63d6 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -79,6 +79,8 @@ static Property pci_props[] = { > DEFINE_PROP_STRING("failover_pair_id", PCIDevice, > failover_pair_id), > DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0), > + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present, > + QEMU_PCIE_ERR_UNC_MASK_BITNR, true), > DEFINE_PROP_END_OF_LIST() > }; > > diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c > index 103667c368..374d593ead 100644 > --- a/hw/pci/pcie_aer.c > +++ b/hw/pci/pcie_aer.c > @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, uint16_t offset, > > pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, > PCI_ERR_UNC_SUPPORTED); > - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > - PCI_ERR_UNC_MASK_DEFAULT); > - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > - PCI_ERR_UNC_SUPPORTED); > + > + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { > + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > + PCI_ERR_UNC_MASK_DEFAULT); > + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > + PCI_ERR_UNC_SUPPORTED); > + } > > pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, > PCI_ERR_UNC_SEVERITY_DEFAULT);
On Tue, May 02, 2023 at 09:27:02PM -0300, Leonardo Bras wrote: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com>
On Wed, May 3, 2023 at 6:49 AM Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Tue, 2 May 2023 21:27:02 -0300 > Leonardo Bras <leobras@redhat.com> wrote: > > > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > > set for machine types < 8.0 will cause migration to fail if the target > > QEMU version is < 8.0.0 : > > > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > > qemu-system-x86_64: Failed to load PCIDevice:config > > qemu-system-x86_64: Failed to load e1000e:parent_obj > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > > qemu-system-x86_64: load of migration failed: Invalid argument > > > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > > with this cmdline: > > > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > > > In order to fix this, property x-pcie-err-unc-mask was introduced to > > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > > default, but is disabled if machine type <= 7.2. > > > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > Thanks Leo, you are a star. > > LGTM > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Thanks! > > --- > > include/hw/pci/pci.h | 2 ++ > > hw/core/machine.c | 1 + > > hw/pci/pci.c | 2 ++ > > hw/pci/pcie_aer.c | 11 +++++++---- > > 4 files changed, 12 insertions(+), 4 deletions(-) > > > > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h > > index 935b4b91b4..e6d0574a29 100644 > > --- a/include/hw/pci/pci.h > > +++ b/include/hw/pci/pci.h > > @@ -207,6 +207,8 @@ enum { > > QEMU_PCIE_EXTCAP_INIT = (1 << QEMU_PCIE_EXTCAP_INIT_BITNR), > > #define QEMU_PCIE_CXL_BITNR 10 > > QEMU_PCIE_CAP_CXL = (1 << QEMU_PCIE_CXL_BITNR), > > +#define QEMU_PCIE_ERR_UNC_MASK_BITNR 11 > > + QEMU_PCIE_ERR_UNC_MASK = (1 << QEMU_PCIE_ERR_UNC_MASK_BITNR), > > }; > > > > typedef struct PCIINTxRoute { > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > index 47a34841a5..07f763eb2e 100644 > > --- a/hw/core/machine.c > > +++ b/hw/core/machine.c > > @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = { > > { "e1000e", "migrate-timadj", "off" }, > > { "virtio-mem", "x-early-migration", "false" }, > > { "migration", "x-preempt-pre-7-2", "true" }, > > + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, > > }; > > const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2); > > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > index 8a87ccc8b0..5153ad63d6 100644 > > --- a/hw/pci/pci.c > > +++ b/hw/pci/pci.c > > @@ -79,6 +79,8 @@ static Property pci_props[] = { > > DEFINE_PROP_STRING("failover_pair_id", PCIDevice, > > failover_pair_id), > > DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0), > > + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present, > > + QEMU_PCIE_ERR_UNC_MASK_BITNR, true), > > DEFINE_PROP_END_OF_LIST() > > }; > > > > diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c > > index 103667c368..374d593ead 100644 > > --- a/hw/pci/pcie_aer.c > > +++ b/hw/pci/pcie_aer.c > > @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, uint16_t offset, > > > > pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, > > PCI_ERR_UNC_SUPPORTED); > > - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > > - PCI_ERR_UNC_MASK_DEFAULT); > > - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > > - PCI_ERR_UNC_SUPPORTED); > > + > > + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { > > + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > > + PCI_ERR_UNC_MASK_DEFAULT); > > + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > > + PCI_ERR_UNC_SUPPORTED); > > + } > > > > pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, > > PCI_ERR_UNC_SEVERITY_DEFAULT); >
Leonardo Bras <leobras@redhat.com> wrote: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com>
On Tue, May 02, 2023 at 09:27:02PM -0300, Leonardo Bras wrote: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> Since this one blocks mostly all backward migration for current master and all the downstream trees, shall we consider pick it up sooner? And I think we should make sure qemu-stable gets this too, I'm not pretty sure how we normally do that and whether we need explicit CC: to stable list, perhaps not.. Thanks!
On Tue, May 09, 2023 at 07:01:53AM -0700, Peter Xu wrote: > On Tue, May 02, 2023 at 09:27:02PM -0300, Leonardo Bras wrote: > > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > > set for machine types < 8.0 will cause migration to fail if the target > > QEMU version is < 8.0.0 : > > > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > > qemu-system-x86_64: Failed to load PCIDevice:config > > qemu-system-x86_64: Failed to load e1000e:parent_obj > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > > qemu-system-x86_64: load of migration failed: Invalid argument > > > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > > with this cmdline: > > > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > > > In order to fix this, property x-pcie-err-unc-mask was introduced to > > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > > default, but is disabled if machine type <= 7.2. > > > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > Since this one blocks mostly all backward migration for current master and > all the downstream trees, shall we consider pick it up sooner? > > And I think we should make sure qemu-stable gets this too, I'm not pretty > sure how we normally do that and whether we need explicit CC: to stable > list, perhaps not.. > > Thanks! > > -- > Peter Xu Yes will be in the pull I'm working on, targeting end of week.
Leonardo Bras <leobras@redhat.com> wrote: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com>
03.05.2023 03:27, Leonardo Bras пишет: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> It looks like 8.0-stable material to me, is it not? /mjt
On Wed, May 10, 2023 at 07:29:40PM +0300, Michael Tokarev wrote: > 03.05.2023 03:27, Leonardo Bras пишет: > > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > > set for machine types < 8.0 will cause migration to fail if the target > > QEMU version is < 8.0.0 : > > > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > > qemu-system-x86_64: Failed to load PCIDevice:config > > qemu-system-x86_64: Failed to load e1000e:parent_obj > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > > qemu-system-x86_64: load of migration failed: Invalid argument > > > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > > with this cmdline: > > > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > > > In order to fix this, property x-pcie-err-unc-mask was introduced to > > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > > default, but is disabled if machine type <= 7.2. > > > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > It looks like 8.0-stable material to me, is it not? > > /mjt Good point. Feel free to CC stable.
Michael Tokarev <mjt@tls.msk.ru> wrote: > 03.05.2023 03:27, Leonardo Bras пишет: >> Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK >> set for machine types < 8.0 will cause migration to fail if the target >> QEMU version is < 8.0.0 : >> qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a >> read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 >> qemu-system-x86_64: Failed to load PCIDevice:config >> qemu-system-x86_64: Failed to load e1000e:parent_obj >> qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' >> qemu-system-x86_64: load of migration failed: Invalid argument >> The above test migrated a 7.2 machine type from QEMU master to QEMU >> 7.2.0, >> with this cmdline: >> ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] >> In order to fix this, property x-pcie-err-unc-mask was introduced to >> control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by >> default, but is disabled if machine type <= 7.2. >> Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK >> register") >> Suggested-by: Michael S. Tsirkin <mst@redhat.com> >> Signed-off-by: Leonardo Bras <leobras@redhat.com> > > It looks like 8.0-stable material to me, is it not? It is . I thought we have already requested that. We lost the intent somewhere, sorry. Later, Juan.
Am 03.05.23 um 02:27 schrieb Leonardo Bras: > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > set for machine types < 8.0 will cause migration to fail if the target > QEMU version is < 8.0.0 : > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > qemu-system-x86_64: Failed to load PCIDevice:config > qemu-system-x86_64: Failed to load e1000e:parent_obj > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > qemu-system-x86_64: load of migration failed: Invalid argument > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > with this cmdline: > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > In order to fix this, property x-pcie-err-unc-mask was introduced to > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > default, but is disabled if machine type <= 7.2. > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Leonardo Bras <leobras@redhat.com> Thank you for the patch! Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch when using machine type <= 7.2. That is because after this patch, when using machine type <= 7.2, the wmask for the register is not set and when 8.0 sends a nonzero value for the register, the error condition in get_pci_config_device() will trigger again. Is it necessary to also handle that? Maybe by special casing the error condition in get_pci_config_device() to be prepared to accept such a stream from 8.0? If that is considered not worth it, consider this: Tested-by: Fiona Ebner <f.ebner@proxmox.com> Best Regards, Fiona
Fiona Ebner <f.ebner@proxmox.com> wrote: > Am 03.05.23 um 02:27 schrieb Leonardo Bras: >> Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK >> set for machine types < 8.0 will cause migration to fail if the target >> QEMU version is < 8.0.0 : >> >> qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 >> qemu-system-x86_64: Failed to load PCIDevice:config >> qemu-system-x86_64: Failed to load e1000e:parent_obj >> qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' >> qemu-system-x86_64: load of migration failed: Invalid argument >> >> The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, >> with this cmdline: >> >> ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] >> >> In order to fix this, property x-pcie-err-unc-mask was introduced to >> control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by >> default, but is disabled if machine type <= 7.2. >> >> Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") >> Suggested-by: Michael S. Tsirkin <mst@redhat.com> >> Signed-off-by: Leonardo Bras <leobras@redhat.com> > > Thank you for the patch! > > Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 > > AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch > when using machine type <= 7.2. That is because after this patch, when > using machine type <= 7.2, the wmask for the register is not set and > when 8.0 sends a nonzero value for the register, the error condition in > get_pci_config_device() will trigger again. I think that works correctly. See https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02733.html What we have (before this patch) (using abbrevs as in the doc before) Current state: (1) qemu-8.0 -M pc-8.0 -> qemu-8.0 -M pc-8.0 works not affected by the patch (2) qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-8.0 works works well because 7.2 don't change that field (3) qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 fails With the patch we fixed 3, so once it is in stable, 1 and 2 continue as usual and for (3) we will have: (3) qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 works If what you mean is that: (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works Will fail, that is true, but I can think a "sane" way to fix this. > Is it necessary to also handle that? Maybe by special casing the error > condition in get_pci_config_device() to be prepared to accept such a > stream from 8.0? Well, we can do that, but it is to the pci maintainers to decide if that is "sane". > If that is considered not worth it, consider this: > > Tested-by: Fiona Ebner <f.ebner@proxmox.com> > > Best Regards, > Fiona Later, Juan.
On Thu, May 11, 2023 at 10:27:35AM +0200, Fiona Ebner wrote: > Am 03.05.23 um 02:27 schrieb Leonardo Bras: > > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > > set for machine types < 8.0 will cause migration to fail if the target > > QEMU version is < 8.0.0 : > > > > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > > qemu-system-x86_64: Failed to load PCIDevice:config > > qemu-system-x86_64: Failed to load e1000e:parent_obj > > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > > qemu-system-x86_64: load of migration failed: Invalid argument > > > > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > > with this cmdline: > > > > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > > > > In order to fix this, property x-pcie-err-unc-mask was introduced to > > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > > default, but is disabled if machine type <= 7.2. > > > > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > Thank you for the patch! > > Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 > > AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch > when using machine type <= 7.2. That is because after this patch, when > using machine type <= 7.2, the wmask for the register is not set and > when 8.0 sends a nonzero value for the register, the error condition in > get_pci_config_device() will trigger again. > > Is it necessary to also handle that? Maybe by special casing the error > condition in get_pci_config_device() to be prepared to accept such a > stream from 8.0? > > If that is considered not worth it, consider this: > > Tested-by: Fiona Ebner <f.ebner@proxmox.com> > > Best Regards, > Fiona Yes any fix is like that. We keep encountering bugs like this but there does not seem to be will to create infrastructure for fixing it, which would involve describing version of qemu being migrated to.
"Michael S. Tsirkin" <mst@redhat.com> wrote: [Added libvirt people to the party, see the end of the message ] > On Thu, May 11, 2023 at 10:27:35AM +0200, Fiona Ebner wrote: >> Am 03.05.23 um 02:27 schrieb Leonardo Bras: >> > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK >> > set for machine types < 8.0 will cause migration to fail if the target >> > QEMU version is < 8.0.0 : >> > >> > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 >> > qemu-system-x86_64: Failed to load PCIDevice:config >> > qemu-system-x86_64: Failed to load e1000e:parent_obj >> > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' >> > qemu-system-x86_64: load of migration failed: Invalid argument >> > >> > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, >> > with this cmdline: >> > >> > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] >> > >> > In order to fix this, property x-pcie-err-unc-mask was introduced to >> > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by >> > default, but is disabled if machine type <= 7.2. >> > >> > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") >> > Suggested-by: Michael S. Tsirkin <mst@redhat.com> >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> >> >> Thank you for the patch! >> >> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 >> >> AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch >> when using machine type <= 7.2. That is because after this patch, when >> using machine type <= 7.2, the wmask for the register is not set and >> when 8.0 sends a nonzero value for the register, the error condition in >> get_pci_config_device() will trigger again. >> >> Is it necessary to also handle that? Maybe by special casing the error >> condition in get_pci_config_device() to be prepared to accept such a >> stream from 8.0? >> >> If that is considered not worth it, consider this: >> >> Tested-by: Fiona Ebner <f.ebner@proxmox.com> >> >> Best Regards, >> Fiona > > Yes any fix is like that. We keep encountering bugs like this > but there does not seem to be will to create infrastructure > for fixing it, which would involve describing > version of qemu being migrated to. You have it. qemu-8.0 -M pc-7.2 -> is going to migrate to an older qemu, so be extra careful. We tried (and faield) that you could always migrate to the next version without changing the machine type. That was a myth that never really worked, but sometimes it did. Now we say that we only support migration with same machine type, period. And then you know what you are going to do. And what we are debating here is something different. $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 We broke it. We don't have a time machine. We can fix that for 8.0. And we have a fix. But what the fix does is: $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works again. But what Fiona was asking for is: $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 is broken right now. But we can "fix" it. But that implies that we change pci to allow both the correct value (the old one, what qemu-7.2 sends), and the new value that qemu-8.0 -M pc-7.2 sends. Yes, it is a kludge. That is why I said that it depended on what PCI maintainers think. diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c index 103667c368..374d593ead 100644 --- a/hw/pci/pcie_aer.c +++ b/hw/pci/pcie_aer.c @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, uint16_t offset, pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, PCI_ERR_UNC_SUPPORTED); - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, - PCI_ERR_UNC_MASK_DEFAULT); - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, - PCI_ERR_UNC_SUPPORTED); + + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, + PCI_ERR_UNC_MASK_DEFAULT); + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, + PCI_ERR_UNC_SUPPORTED); + } pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, PCI_ERR_UNC_SEVERITY_DEFAULT); We do this at init time. But we need to a migration incoming time, that for old machine types we can find that this has been set up. So they need to accept both ways. And this is in the middle of the pci configuration space. I have no clue how to do this and be sure that it is a valid pci configuration afterwards. With my migration knowledge, I know that the problem is in this function (ok, I know th error message gave you a big clue): static int get_pci_config_device(QEMUFile *f, void *pv, size_t size, const VMStateField *field) { PCIDevice *s = container_of(pv, PCIDevice, config); uint8_t *config; int i; assert(size == pci_config_size(s)); config = g_malloc(size); qemu_get_buffer(f, config, size); for (i = 0; i < size; ++i) { if ((config[i] ^ s->config[i]) & s->cmask[i] & ~s->wmask[i] & ~s->w1cmask[i]) { -----> here --->> error_report("%s: Bad config data: i=0x%x read: %x device: %x " "cmask: %x wmask: %x w1cmask:%x", __func__, i, config[i], s->config[i], s->cmask[i], s->wmask[i], s->w1cmask[i]); g_free(config); return -EINVAL; } } memcpy(s->config, config, size); pci_update_mappings(s); if (IS_PCI_BRIDGE(s)) { pci_bridge_update_mappings(PCI_BRIDGE(s)); } memory_region_set_enabled(&s->bus_master_enable_region, pci_get_word(s->config + PCI_COMMAND) & PCI_COMMAND_MASTER); g_free(config); return 0; } But what is the kludge that I have to do to put in "here" a condition that says: if the problem is at dev->config + offset + PCI_ERR_UNCOR_MASK that is not enabled in local (target of migration) device but it was enabled in remote (source of migration) (the same for wmask) + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, + PCI_ERR_UNC_MASK_DEFAULT); + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, + PCI_ERR_UNC_SUPPORTED); + } just enable it locally and don't give an error. And we have to maintain the kludge forever because every machine with a machine type older than qemu-8.0 that has started in qemu-8.0 (i.e. pre qemu-8.0.1 that has the fix) is going to have that enable when the machine definition says that it should be disabled. That would fix the: qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 It is worth it? Dunno. That is my question. And knowing from what qemu it has migrated from would not help. We would need to add a new tweak and means: This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and has the pciaerr bug. But wait, we have _that_. And it is called + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, from the patch. We can teach libvirt about this glitch, and if he is migrating a pc-7.2 machine in qemu-8.0 machine, And they want to migrate to a new qemu (call it qemu-8.1), it needs to be started: qemu-8.1 -M pc-7.2 <whatever pci devices need to do>,x-pci-err-unc-mask="true" Until the user reboots it and then that property can be reset to default value. So at the end, pci maintainers don't need to do anything (so they don't have to kill me). And now it is the libvirt people who is going to try to kill me. Sniff. It is worth it? I don't know. But as seen, migration has the infrastructure to do it. But it is a very hard problem, and fixing it in general is very complicated. We are very bad at testing that we have broken migration for qemu-X.Y -M pc-X.Y -> qemu-X.(Y+1) -M pc-X.Y Sorry for the long mail, Juan.
On Thu, May 11, 2023 at 01:43:47PM +0200, Juan Quintela wrote: > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > [Added libvirt people to the party, see the end of the message ] > > > On Thu, May 11, 2023 at 10:27:35AM +0200, Fiona Ebner wrote: > >> Am 03.05.23 um 02:27 schrieb Leonardo Bras: > >> > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK > >> > set for machine types < 8.0 will cause migration to fail if the target > >> > QEMU version is < 8.0.0 : > >> > > >> > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 > >> > qemu-system-x86_64: Failed to load PCIDevice:config > >> > qemu-system-x86_64: Failed to load e1000e:parent_obj > >> > qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' > >> > qemu-system-x86_64: load of migration failed: Invalid argument > >> > > >> > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, > >> > with this cmdline: > >> > > >> > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] > >> > > >> > In order to fix this, property x-pcie-err-unc-mask was introduced to > >> > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by > >> > default, but is disabled if machine type <= 7.2. > >> > > >> > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") > >> > Suggested-by: Michael S. Tsirkin <mst@redhat.com> > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> > >> > >> Thank you for the patch! > >> > >> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 > >> > >> AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch > >> when using machine type <= 7.2. That is because after this patch, when > >> using machine type <= 7.2, the wmask for the register is not set and > >> when 8.0 sends a nonzero value for the register, the error condition in > >> get_pci_config_device() will trigger again. > >> > >> Is it necessary to also handle that? Maybe by special casing the error > >> condition in get_pci_config_device() to be prepared to accept such a > >> stream from 8.0? > >> > >> If that is considered not worth it, consider this: > >> > >> Tested-by: Fiona Ebner <f.ebner@proxmox.com> > >> > >> Best Regards, > >> Fiona > > > > Yes any fix is like that. We keep encountering bugs like this > > but there does not seem to be will to create infrastructure > > for fixing it, which would involve describing > > version of qemu being migrated to. > > You have it. > > qemu-8.0 -M pc-7.2 -> is going to migrate to an older qemu, so be extra > careful. > > We tried (and faield) that you could always migrate to the next version > without changing the machine type. > > That was a myth that never really worked, but sometimes it did. > > Now we say that we only support migration with same machine type, > period. And then you know what you are going to do. > > And what we are debating here is something different. > > $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 > > We broke it. We don't have a time machine. We can fix that for 8.0. > > And we have a fix. But what the fix does is: > > $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > > works again. > > But what Fiona was asking for is: > > $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > > is broken right now. But we can "fix" it. > But that implies that we change pci to allow both the correct value > (the old one, what qemu-7.2 sends), and the new value that qemu-8.0 -M > pc-7.2 sends. Yes, it is a kludge. That is why I said that it depended > on what PCI maintainers think. > > diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c > index 103667c368..374d593ead 100644 > --- a/hw/pci/pcie_aer.c > +++ b/hw/pci/pcie_aer.c > @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, > uint16_t offset, > > pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, > PCI_ERR_UNC_SUPPORTED); > - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > - PCI_ERR_UNC_MASK_DEFAULT); > - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > - PCI_ERR_UNC_SUPPORTED); > + > + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { > + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > + PCI_ERR_UNC_MASK_DEFAULT); > + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > + PCI_ERR_UNC_SUPPORTED); > + } > > pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, > PCI_ERR_UNC_SEVERITY_DEFAULT); > > We do this at init time. > But we need to a migration incoming time, that for old machine types we > can find that this has been set up. So they need to accept both ways. > And this is in the middle of the pci configuration space. > > I have no clue how to do this and be sure that it is a valid pci > configuration afterwards. > > With my migration knowledge, I know that the problem is in this function > (ok, I know th error message gave you a big clue): > > static int get_pci_config_device(QEMUFile *f, void *pv, size_t size, > const VMStateField *field) > { > PCIDevice *s = container_of(pv, PCIDevice, config); > uint8_t *config; > int i; > > assert(size == pci_config_size(s)); > config = g_malloc(size); > > qemu_get_buffer(f, config, size); > for (i = 0; i < size; ++i) { > if ((config[i] ^ s->config[i]) & > s->cmask[i] & ~s->wmask[i] & ~s->w1cmask[i]) { > > -----> here --->> > error_report("%s: Bad config data: i=0x%x read: %x device: %x " > "cmask: %x wmask: %x w1cmask:%x", __func__, > i, config[i], s->config[i], > s->cmask[i], s->wmask[i], s->w1cmask[i]); > g_free(config); > return -EINVAL; > } > } > memcpy(s->config, config, size); > > pci_update_mappings(s); > if (IS_PCI_BRIDGE(s)) { > pci_bridge_update_mappings(PCI_BRIDGE(s)); > } > > memory_region_set_enabled(&s->bus_master_enable_region, > pci_get_word(s->config + PCI_COMMAND) > & PCI_COMMAND_MASTER); > > g_free(config); > return 0; > } > > But what is the kludge that I have to do to put in "here" a condition > that says: > > if the problem is at dev->config + offset + PCI_ERR_UNCOR_MASK that is > not enabled in local (target of migration) device but it was enabled in remote > (source of migration) (the same for wmask) > > + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { > + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > + PCI_ERR_UNC_MASK_DEFAULT); > + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > + PCI_ERR_UNC_SUPPORTED); > + } > > just enable it locally and don't give an error. And we have to maintain > the kludge forever because every machine with a machine type older than > qemu-8.0 that has started in qemu-8.0 (i.e. pre qemu-8.0.1 that has the > fix) is going to have that enable when the machine definition says that > it should be disabled. > > That would fix the: > > qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > > It is worth it? Dunno. That is my question. > > And knowing from what qemu it has migrated from would not help. We > would need to add a new tweak and means: > > This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and > has the pciaerr bug. But wait, we have _that_. > > And it is called > > + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, > > from the patch. > > We can teach libvirt about this glitch, and if he is migrating a pc-7.2 > machine in qemu-8.0 machine, And they want to migrate to a new qemu > (call it qemu-8.1), it needs to be started: > > qemu-8.1 -M pc-7.2 <whatever pci devices need to do>,x-pci-err-unc-mask="true" > > Until the user reboots it and then that property can be reset to default > value. > > So at the end, pci maintainers don't need to do anything (so they don't > have to kill me). > > And now it is the libvirt people who is going to try to kill me. > > Sniff. > > It is worth it? I don't know. > > But as seen, migration has the infrastructure to do it. But it is a very > hard problem, and fixing it in general is very complicated. We are very > bad at testing that we have broken migration for > > qemu-X.Y -M pc-X.Y -> qemu-X.(Y+1) -M pc-X.Y > > Sorry for the long mail, Juan. Great summary thanks! Yes the infrastructure is there but it's too hard to use. qemu knows which versions need which properties but libvirt does not.
11.05.2023 11:40, Juan Quintela wrote: > Fiona Ebner <f.ebner@proxmox.com> wrote: ... >> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 >> >> AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch >> when using machine type <= 7.2. That is because after this patch, when >> using machine type <= 7.2, the wmask for the register is not set and >> when 8.0 sends a nonzero value for the register, the error condition in >> get_pci_config_device() will trigger again. > > I think that works correctly. > See https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02733.html > > What we have (before this patch) (using abbrevs as in the doc before) > > Current state: > > (1) qemu-8.0 -M pc-8.0 -> qemu-8.0 -M pc-8.0 works > > not affected by the patch > > (2) qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-8.0 works > > works well because 7.2 don't change that field > > (3) qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 fails > > With the patch we fixed 3, so once it is in stable, 1 and 2 continue as > usual and for (3) we will have: > > (3) qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 works > > If what you mean is that: > > (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works > > Will fail, that is true, but I can think a "sane" way to fix this. That's a great summary indeed. >> Is it necessary to also handle that? Maybe by special casing the error >> condition in get_pci_config_device() to be prepared to accept such a >> stream from 8.0? > > Well, we can do that, but it is to the pci maintainers to decide if that > is "sane". So, can we go from here somewhere? I'd love this fix to be in 8.0.1, either with or without the (un)sane part of the (3) variant above which might fail. Or else we'll have the same situation in 8.0.1 as we now have in 8.0.0 (the deadline is May-27). We did broke x.y.0 => x.y.1 migration before already like this, such as with 7.2.0=>7.2.1. I'm not saying it's a nice thing to do, just stating a fact. Yes, it is better to avoid such breakage, but.. meh.. Thanks! /mjt
Michael Tokarev <mjt@tls.msk.ru> wrote: > 11.05.2023 11:40, Juan Quintela wrote: >> Fiona Ebner <f.ebner@proxmox.com> wrote: > ... >>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 >>> >>> AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch >>> when using machine type <= 7.2. That is because after this patch, when >>> using machine type <= 7.2, the wmask for the register is not set and >>> when 8.0 sends a nonzero value for the register, the error condition in >>> get_pci_config_device() will trigger again. >> I think that works correctly. >> See https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02733.html >> What we have (before this patch) (using abbrevs as in the doc >> before) >> Current state: >> (1) qemu-8.0 -M pc-8.0 -> qemu-8.0 -M pc-8.0 works >> not affected by the patch >> (2) qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-8.0 works >> works well because 7.2 don't change that field >> (3) qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 fails >> With the patch we fixed 3, so once it is in stable, 1 and 2 continue >> as >> usual and for (3) we will have: >> (3) qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 works >> If what you mean is that: >> (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works >> Will fail, that is true, but I can think a "sane" way to fix this. Hi > That's a great summary indeed. Thanks. >>> Is it necessary to also handle that? Maybe by special casing the error >>> condition in get_pci_config_device() to be prepared to accept such a >>> stream from 8.0? >> Well, we can do that, but it is to the pci maintainers to decide if >> that >> is "sane". > > So, can we go from here somewhere? I'd love this fix to be in 8.0.1, > either with or without the (un)sane part of the (3) variant above which > might fail. Or else we'll have the same situation in 8.0.1 as we now > have in 8.0.0 (the deadline is May-27). > > We did broke x.y.0 => x.y.1 migration before already like this, such as > with 7.2.0=>7.2.1. I'm not saying it's a nice thing to do, just stating > a fact. Yes, it is better to avoid such breakage, but.. meh.. See patch for documentation: https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03288.html Basically, the best we can do is: - get the patch posted. Fixes everything except: (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works And for that, we can document somewhere that we need to launch qemu-8.0.1 as: $ qemu-8.0.1 -M pc-7.2 -device blah,x-pci-err-unc-mask=on And mark someone that this machine is tainted an can only be migrated to qemu's >= qemu-8.0.1. And that we should reboot it as the user convenience. (reboot here means poweroff qemu and poweron it back without x-pci-err-unc-mask=on). Later, Juan.
On Thu, May 18, 2023 at 01:33:43PM +0200, Juan Quintela wrote: > Michael Tokarev <mjt@tls.msk.ru> wrote: > > 11.05.2023 11:40, Juan Quintela wrote: > >> Fiona Ebner <f.ebner@proxmox.com> wrote: > > ... > >>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 > >>> > >>> AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch > >>> when using machine type <= 7.2. That is because after this patch, when > >>> using machine type <= 7.2, the wmask for the register is not set and > >>> when 8.0 sends a nonzero value for the register, the error condition in > >>> get_pci_config_device() will trigger again. > >> I think that works correctly. > >> See https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02733.html > >> What we have (before this patch) (using abbrevs as in the doc > >> before) > >> Current state: > >> (1) qemu-8.0 -M pc-8.0 -> qemu-8.0 -M pc-8.0 works > >> not affected by the patch > >> (2) qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-8.0 works > >> works well because 7.2 don't change that field > >> (3) qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 fails > >> With the patch we fixed 3, so once it is in stable, 1 and 2 continue > >> as > >> usual and for (3) we will have: > >> (3) qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 works > >> If what you mean is that: > >> (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works > >> Will fail, that is true, but I can think a "sane" way to fix this. > > Hi > > > That's a great summary indeed. > > Thanks. > > >>> Is it necessary to also handle that? Maybe by special casing the error > >>> condition in get_pci_config_device() to be prepared to accept such a > >>> stream from 8.0? > >> Well, we can do that, but it is to the pci maintainers to decide if > >> that > >> is "sane". > > > > So, can we go from here somewhere? I'd love this fix to be in 8.0.1, > > either with or without the (un)sane part of the (3) variant above which > > might fail. Or else we'll have the same situation in 8.0.1 as we now > > have in 8.0.0 (the deadline is May-27). > > > > We did broke x.y.0 => x.y.1 migration before already like this, such as > > with 7.2.0=>7.2.1. I'm not saying it's a nice thing to do, just stating > > a fact. Yes, it is better to avoid such breakage, but.. meh.. > > See patch for documentation: > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03288.html > > Basically, the best we can do is: > - get the patch posted. Fixes everything except: > (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works > > And for that, we can document somewhere that we need to launch > qemu-8.0.1 as: > > $ qemu-8.0.1 -M pc-7.2 -device blah,x-pci-err-unc-mask=on One thing we can also do to avoid it in the future is simply having someone do this check around each softfreeze (and we'll also need maintainers be careful on merging anything that's risky though after softfreeze) rather than after release (what I did for this time, which is late), try to cover as much devices as possible. I don't know whether there's a way to always cover all devices. I'll volunteer myself for that as long as I'll remember. Juan, please also have a check or remind me if I didn't. :) I am not sure whether I mentioned it somewhere before, but maybe it'll work if we can also have some way we check migrating each of the vmsd from old-qemu to new-qemu (and also new->old) covering all devices. It doesn't need to be a real migration, just generate the per-device stream and try loading on the other binary. It might be an overkill to be part of CI to check each commit, but if there's some way to check it then at least we can run it also after softfreeze. I also don't know whether it'll be easy to achieve it at all, but I'll think more about it too and update if I found something useful. > > And mark someone that this machine is tainted an can only be migrated to > qemu's >= qemu-8.0.1. And that we should reboot it as the user > convenience. (reboot here means poweroff qemu and poweron it back > without x-pci-err-unc-mask=on). > > Later, Juan. > > > >
On Thu, May 18, 2023 at 09:27:01AM -0400, Peter Xu wrote: > On Thu, May 18, 2023 at 01:33:43PM +0200, Juan Quintela wrote: > > Michael Tokarev <mjt@tls.msk.ru> wrote: > > > 11.05.2023 11:40, Juan Quintela wrote: > > >> Fiona Ebner <f.ebner@proxmox.com> wrote: > > > ... > > >>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1576 > > >>> > > >>> AFAICT, this breaks (forward) migration from 8.0 to 8.0 + this patch > > >>> when using machine type <= 7.2. That is because after this patch, when > > >>> using machine type <= 7.2, the wmask for the register is not set and > > >>> when 8.0 sends a nonzero value for the register, the error condition in > > >>> get_pci_config_device() will trigger again. > > >> I think that works correctly. > > >> See https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02733.html > > >> What we have (before this patch) (using abbrevs as in the doc > > >> before) > > >> Current state: > > >> (1) qemu-8.0 -M pc-8.0 -> qemu-8.0 -M pc-8.0 works > > >> not affected by the patch > > >> (2) qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-8.0 works > > >> works well because 7.2 don't change that field > > >> (3) qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 fails > > >> With the patch we fixed 3, so once it is in stable, 1 and 2 continue > > >> as > > >> usual and for (3) we will have: > > >> (3) qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 works > > >> If what you mean is that: > > >> (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works > > >> Will fail, that is true, but I can think a "sane" way to fix this. > > > > Hi > > > > > That's a great summary indeed. > > > > Thanks. > > > > >>> Is it necessary to also handle that? Maybe by special casing the error > > >>> condition in get_pci_config_device() to be prepared to accept such a > > >>> stream from 8.0? > > >> Well, we can do that, but it is to the pci maintainers to decide if > > >> that > > >> is "sane". > > > > > > So, can we go from here somewhere? I'd love this fix to be in 8.0.1, > > > either with or without the (un)sane part of the (3) variant above which > > > might fail. Or else we'll have the same situation in 8.0.1 as we now > > > have in 8.0.0 (the deadline is May-27). > > > > > > We did broke x.y.0 => x.y.1 migration before already like this, such as > > > with 7.2.0=>7.2.1. I'm not saying it's a nice thing to do, just stating > > > a fact. Yes, it is better to avoid such breakage, but.. meh.. > > > > See patch for documentation: > > > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03288.html > > > > Basically, the best we can do is: > > - get the patch posted. Fixes everything except: > > (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works > > > > And for that, we can document somewhere that we need to launch > > qemu-8.0.1 as: > > > > $ qemu-8.0.1 -M pc-7.2 -device blah,x-pci-err-unc-mask=on > > One thing we can also do to avoid it in the future is simply having someone > do this check around each softfreeze (and we'll also need maintainers be > careful on merging anything that's risky though after softfreeze) rather > than after release (what I did for this time, which is late), try to cover > as much devices as possible. I don't know whether there's a way to always > cover all devices. > > I'll volunteer myself for that as long as I'll remember. Juan, please also > have a check or remind me if I didn't. :) > > I am not sure whether I mentioned it somewhere before, but maybe it'll work > if we can also have some way we check migrating each of the vmsd from > old-qemu to new-qemu (and also new->old) covering all devices. It doesn't > need to be a real migration, just generate the per-device stream and try > loading on the other binary. > > It might be an overkill to be part of CI to check each commit, but if > there's some way to check it then at least we can run it also after > softfreeze. I also don't know whether it'll be easy to achieve it at all, > but I'll think more about it too and update if I found something useful. I'm all for more testing but if it does not actually test that the values loaded are actually used correctly then the testing is of course lower quality. Better than nothing I guess ... > > > > And mark someone that this machine is tainted an can only be migrated to > > qemu's >= qemu-8.0.1. And that we should reboot it as the user > > convenience. (reboot here means poweroff qemu and poweron it back > > without x-pci-err-unc-mask=on). > > > > Later, Juan. > > > > > > > > > > -- > Peter Xu
Peter Xu <peterx@redhat.com> wrote: > On Thu, May 18, 2023 at 01:33:43PM +0200, Juan Quintela wrote: >> See patch for documentation: >> >> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03288.html >> >> Basically, the best we can do is: >> - get the patch posted. Fixes everything except: >> (3) qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 works >> >> And for that, we can document somewhere that we need to launch >> qemu-8.0.1 as: >> >> $ qemu-8.0.1 -M pc-7.2 -device blah,x-pci-err-unc-mask=on > > One thing we can also do to avoid it in the future is simply having someone > do this check around each softfreeze (and we'll also need maintainers be > careful on merging anything that's risky though after softfreeze) rather > than after release (what I did for this time, which is late), try to cover > as much devices as possible. I don't know whether there's a way to always > cover all devices. > > I'll volunteer myself for that as long as I'll remember. Juan, please also > have a check or remind me if I didn't. :) > > I am not sure whether I mentioned it somewhere before, but maybe it'll work > if we can also have some way we check migrating each of the vmsd from > old-qemu to new-qemu (and also new->old) covering all devices. It doesn't > need to be a real migration, just generate the per-device stream and try > loading on the other binary. > > It might be an overkill to be part of CI to check each commit, but if > there's some way to check it then at least we can run it also after > softfreeze. I also don't know whether it'll be easy to achieve it at all, > but I'll think more about it too and update if I found something useful. Hi Peter First, thanks for volunteering. And next, I think this is done better as part of avocado. Several reasons: - We need two different qemu's - We want to run it perhaps daily. - We want to report any problem. I will start with something really simply. Like getting the migration-test tests cases that we have, and just run them in both directions. I.e. new -> old and old -> new. That will give us a baseline: - x86_64 - i386 - aarch64 - ppc - s390 I think nothing else cares about versioned machine types right now. Once the mechanism is working and the reporting is sent somewhere, we can go from there and add machines with the devices that we care about. But just the example that I showed would have detected the problem that we are talking about. After that I would make sure that we are checking all virtio devices, with/without vhost. And once we have done that, the device authors that care about their devices will add test to the infrastructure. Later, Juan.
"Michael S. Tsirkin" <mst@redhat.com> wrote: [adding cleber] > On Thu, May 18, 2023 at 09:27:01AM -0400, Peter Xu wrote: >> One thing we can also do to avoid it in the future is simply having someone >> do this check around each softfreeze (and we'll also need maintainers be >> careful on merging anything that's risky though after softfreeze) rather >> than after release (what I did for this time, which is late), try to cover >> as much devices as possible. I don't know whether there's a way to always >> cover all devices. >> >> I'll volunteer myself for that as long as I'll remember. Juan, please also >> have a check or remind me if I didn't. :) >> >> I am not sure whether I mentioned it somewhere before, but maybe it'll work >> if we can also have some way we check migrating each of the vmsd from >> old-qemu to new-qemu (and also new->old) covering all devices. It doesn't >> need to be a real migration, just generate the per-device stream and try >> loading on the other binary. >> >> It might be an overkill to be part of CI to check each commit, but if >> there's some way to check it then at least we can run it also after >> softfreeze. I also don't know whether it'll be easy to achieve it at all, >> but I'll think more about it too and update if I found something useful. > > I'm all for more testing but if it does not actually test that the > values loaded are actually used correctly then the testing is of course > lower quality. Better than nothing I guess ... As said in my other email. The problem that I have is that there is no easy way to test two different qemu binaries. And what we would like is testing at least: qemu-8.0 -M pc-8.0 -> qemu-latest -M pc-8.0 qemu-latest -M pc-8.0 -> qemu-8.0 -M pc-8.0 Then we have the probem of: - what architectures do we care? My bet is on x86_64, i386? (almost free with previous one), ppc64, aarch64, s390. - what devices do we care? virtio-devices for sure. Not sure with others. Notice that if we want the test to run on CI, they need to be virtual devices. Once the mechanism of testing and reporting is there, we can decide: - what other devices we care no good ideas here: SATA, e1000e perhaps. - what other versions we care perhaps it is not out of qestion to test the last two versions. i.e. to add: $ qemu-7.2 -M pc-7.2 -> qemu-latest -M pc-7.2 $ qemu-latest -M pc-7.2 -> qemu-7.2 -M pc-7.2 But we haven't been very good just allowing the latest version to add the latests two. Cleber, any idea how difficult would be to add something like that? For reference, we are trying to test, detect the failures described here: https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03285.html See the three emails. Thanks, Juan.
On Thu, May 11, 2023 at 13:43:47 +0200, Juan Quintela wrote: > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > [Added libvirt people to the party, see the end of the message ] Sorry, I'm not that much into parties :-) > That would fix the: > > qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > > It is worth it? Dunno. That is my question. > > And knowing from what qemu it has migrated from would not help. We > would need to add a new tweak and means: > > This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and > has the pciaerr bug. But wait, we have _that_. > > And it is called > > + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, > > from the patch. > > We can teach libvirt about this glitch, and if he is migrating a pc-7.2 > machine in qemu-8.0 machine, And they want to migrate to a new qemu > (call it qemu-8.1), it needs to be started: > > qemu-8.1 -M pc-7.2 <whatever pci devices need to do>,x-pci-err-unc-mask="true" > > Until the user reboots it and then that property can be reset to default > value. Hmm and what would happen if eventually this machine gets migrated back to qemu-8.0? Or even when the machine is stopped, started again, and then migrated to qemu-8.0? Jirka
Jiri Denemark <jdenemar@redhat.com> wrote: > On Thu, May 11, 2023 at 13:43:47 +0200, Juan Quintela wrote: >> "Michael S. Tsirkin" <mst@redhat.com> wrote: >> >> [Added libvirt people to the party, see the end of the message ] > > Sorry, I'm not that much into parties :-) > >> That would fix the: >> >> qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 >> >> It is worth it? Dunno. That is my question. >> >> And knowing from what qemu it has migrated from would not help. We >> would need to add a new tweak and means: >> >> This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and >> has the pciaerr bug. But wait, we have _that_. >> >> And it is called >> >> + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, >> >> from the patch. >> >> We can teach libvirt about this glitch, and if he is migrating a pc-7.2 >> machine in qemu-8.0 machine, And they want to migrate to a new qemu >> (call it qemu-8.1), it needs to be started: >> >> qemu-8.1 -M pc-7.2 <whatever pci devices need to do>,x-pci-err-unc-mask="true" >> >> Until the user reboots it and then that property can be reset to default >> value. > > Hmm and what would happen if eventually this machine gets migrated back > to qemu-8.0? It works. migrating to qemu-7.2 is what is not going to work. To migrate to qemu-8.0, you just need to drop the "x-pci-err-unc-mask=true" bit. And it would work. So, to be clear, this machine can migrate to: - qemu-8.0, you just need to drop the "x-pci-err-unc-mask=true" bit - qemu-8.0.1 or newer, you just need to maintain the "x-pci-err-unc-mask=true" bit. Let's just assume that qemu-7.2.1 don't get the "x-pci-err-unc-mask=true" bit, so it will not be able to migrate there. > Or even when the machine is stopped, started again, and > then migrated to qemu-8.0? If you do what I call a hard reset (i.e. poweroff + poweron so qemu dies), you should drop the "x-pci-err-unc-mask=true" bit. And then you can migrate to qemu-7.2 and all qemu-8.0.1 and newer. Basically what we need is a "mark" inside libvirt that means something like: - this is weird machine that looks like pc-7.2 - but has "x-pci-err-unc-mask=true" - so it can only migrate to qemu-8.0 and newer. - but if it even reboots in qemu-8.0.1 or newer, we want it back to become a "normal" pc-7.2 machine (i.e. drop the x-pci-err-unc-mask=true). That would be the perfect world. But as we are in an imperfect world, something like: - this machine started in qemu-8.0 -M pc-7.2, we know this is broken and it can't migrate outside of qemu-8.0 because it would fail to go to either qemu-7.2 or qemu-8.0.1. I would argue that if you do the second option doing the "right" option i.e. the first one is not much more complicated, but that is a question that you should be better to answer. And then we have the other Michael question. How can we export that information so libvirt can use it. In this case we can comunicate libvirt: - In qemu-8.0 we broke pc-7.2. - The problem is fixed in qemu-8.0.1 using property "x-pci-err-unc-mask=false". - You can migrate from qemu-8.0 in newer if you set that property as true. - Guests started in qemu-8.0 -M pc-7.2 should reboot in qemu-8.0.1 or newer to become "normal pc-7.2". - If we publish this on qemu, we can only publish it on qemu-8.0.1 and newer. - Or we can publish it somewhere else and any libvirt can take this information. - Or we can comunicate this to libvirt, and they incorporate it on their source anywhere that you see fit. The point here is that when we use a property on a machine type, it can be for two reasons: - We detected at the right time that we changed the value of something, and we did the right thing on hw_compat_X_Y, so libvirt needs to do nothing. - We *DID NOT* detect that we broke compatibility before release, and we need to make a property to identify that problem. This is where we need to do this dance. Notice that normally we detect lots of problems during development and this *should* not happen. But when it happens, we need to be able to do something. And also notice that normally we broke just some device, not a whole machine type. But as you can see we have broke it this time. We are trying to automate the detection of this kind of failures, but we are still on design stage, so we need to plan how to handle this. Any comments? Later, Juan.
On Fri, May 26, 2023 at 09:55:22AM +0200, Juan Quintela wrote: > Jiri Denemark <jdenemar@redhat.com> wrote: > > On Thu, May 11, 2023 at 13:43:47 +0200, Juan Quintela wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > >> [Added libvirt people to the party, see the end of the message ] > > > > Sorry, I'm not that much into parties :-) > > > >> That would fix the: > >> > >> qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > >> > >> It is worth it? Dunno. That is my question. > >> > >> And knowing from what qemu it has migrated from would not help. We > >> would need to add a new tweak and means: > >> > >> This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and > >> has the pciaerr bug. But wait, we have _that_. > >> > >> And it is called > >> > >> + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, > >> > >> from the patch. > >> > >> We can teach libvirt about this glitch, and if he is migrating a pc-7.2 > >> machine in qemu-8.0 machine, And they want to migrate to a new qemu > >> (call it qemu-8.1), it needs to be started: > >> > >> qemu-8.1 -M pc-7.2 <whatever pci devices need to do>,x-pci-err-unc-mask="true" > >> > >> Until the user reboots it and then that property can be reset to default > >> value. > > > > Hmm and what would happen if eventually this machine gets migrated back > > to qemu-8.0? > > It works. > migrating to qemu-7.2 is what is not going to work. > To migrate to qemu-8.0, you just need to drop the > "x-pci-err-unc-mask=true" bit. And it would work. > > So, to be clear, this machine can migrate to: > > - qemu-8.0, you just need to drop the "x-pci-err-unc-mask=true" bit > > - qemu-8.0.1 or newer, you just need to maintain the > "x-pci-err-unc-mask=true" bit. > > Let's just assume that qemu-7.2.1 don't get the > "x-pci-err-unc-mask=true" bit, so it will not be able to migrate there. > > > > Or even when the machine is stopped, started again, and > > then migrated to qemu-8.0? > > If you do what I call a hard reset (i.e. poweroff + poweron so qemu > dies), you should drop the "x-pci-err-unc-mask=true" bit. And then you > can migrate to qemu-7.2 and all qemu-8.0.1 and newer. > > Basically what we need is a "mark" inside libvirt that means something > like: > > - this is weird machine that looks like pc-7.2 > - but has "x-pci-err-unc-mask=true" > - so it can only migrate to qemu-8.0 and newer. > - but if it even reboots in qemu-8.0.1 or newer, we want it back to > become a "normal" pc-7.2 machine (i.e. drop the > x-pci-err-unc-mask=true). > > That would be the perfect world. But as we are in an imperfect world, > something like: > > - this machine started in qemu-8.0 -M pc-7.2, we know this is broken and > it can't migrate outside of qemu-8.0 because it would fail to go to > either qemu-7.2 or qemu-8.0.1. > > I would argue that if you do the second option doing the "right" option > i.e. the first one is not much more complicated, but that is a question > that you should be better to answer. > > And then we have the other Michael question. How can we export that > information so libvirt can use it. > > In this case we can comunicate libvirt: > - In qemu-8.0 we broke pc-7.2. > - The problem is fixed in qemu-8.0.1 using property > "x-pci-err-unc-mask=false". > - You can migrate from qemu-8.0 in newer if you set that property as > true. > - Guests started in qemu-8.0 -M pc-7.2 should reboot in qemu-8.0.1 or > newer to become "normal pc-7.2". > - If we publish this on qemu, we can only publish it on qemu-8.0.1 and > newer. > - Or we can publish it somewhere else and any libvirt can take this > information. > - Or we can comunicate this to libvirt, and they incorporate it on their > source anywhere that you see fit. And this is not an isolated instance. There are things like this in almost each release. My suggestion is a package with known bugs like this. It would list these work arounds in some machine readable format and would be essentially append only, making it relatively safe even for very old RHEL distros to pick up the latest version once in a while. E.g. the fact we add bug workaround for 10.0 will not affect 7.2 so you do not need to fork with each release. > The point here is that when we use a property on a machine type, it can > be for two reasons: > > - We detected at the right time that we changed the value of something, > and we did the right thing on hw_compat_X_Y, so libvirt needs to do > nothing. > > - We *DID NOT* detect that we broke compatibility before release, and we > need to make a property to identify that problem. This is where we > need to do this dance. > > Notice that normally we detect lots of problems during development and > this *should* not happen. But when it happens, we need to be able to do > something. > > And also notice that normally we broke just some device, not a whole > machine type. But as you can see we have broke it this time. We are > trying to automate the detection of this kind of failures, but we are > still on design stage, so we need to plan how to handle this. > > Any comments? > > Later, Juan. > > > >
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index 935b4b91b4..e6d0574a29 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -207,6 +207,8 @@ enum { QEMU_PCIE_EXTCAP_INIT = (1 << QEMU_PCIE_EXTCAP_INIT_BITNR), #define QEMU_PCIE_CXL_BITNR 10 QEMU_PCIE_CAP_CXL = (1 << QEMU_PCIE_CXL_BITNR), +#define QEMU_PCIE_ERR_UNC_MASK_BITNR 11 + QEMU_PCIE_ERR_UNC_MASK = (1 << QEMU_PCIE_ERR_UNC_MASK_BITNR), }; typedef struct PCIINTxRoute { diff --git a/hw/core/machine.c b/hw/core/machine.c index 47a34841a5..07f763eb2e 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = { { "e1000e", "migrate-timadj", "off" }, { "virtio-mem", "x-early-migration", "false" }, { "migration", "x-preempt-pre-7-2", "true" }, + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, }; const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2); diff --git a/hw/pci/pci.c b/hw/pci/pci.c index 8a87ccc8b0..5153ad63d6 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -79,6 +79,8 @@ static Property pci_props[] = { DEFINE_PROP_STRING("failover_pair_id", PCIDevice, failover_pair_id), DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0), + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present, + QEMU_PCIE_ERR_UNC_MASK_BITNR, true), DEFINE_PROP_END_OF_LIST() }; diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c index 103667c368..374d593ead 100644 --- a/hw/pci/pcie_aer.c +++ b/hw/pci/pcie_aer.c @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, uint16_t offset, pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, PCI_ERR_UNC_SUPPORTED); - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, - PCI_ERR_UNC_MASK_DEFAULT); - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, - PCI_ERR_UNC_SUPPORTED); + + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, + PCI_ERR_UNC_MASK_DEFAULT); + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, + PCI_ERR_UNC_SUPPORTED); + } pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, PCI_ERR_UNC_SEVERITY_DEFAULT);
Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK set for machine types < 8.0 will cause migration to fail if the target QEMU version is < 8.0.0 : qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 40 device: 0 cmask: ff wmask: 0 w1cmask:0 qemu-system-x86_64: Failed to load PCIDevice:config qemu-system-x86_64: Failed to load e1000e:parent_obj qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:02.0/e1000e' qemu-system-x86_64: load of migration failed: Invalid argument The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0, with this cmdline: ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX] In order to fix this, property x-pcie-err-unc-mask was introduced to control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by default, but is disabled if machine type <= 7.2. Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register") Suggested-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Leonardo Bras <leobras@redhat.com> --- include/hw/pci/pci.h | 2 ++ hw/core/machine.c | 1 + hw/pci/pci.c | 2 ++ hw/pci/pcie_aer.c | 11 +++++++---- 4 files changed, 12 insertions(+), 4 deletions(-)