Message ID | 20170919164952.13200.78849.stgit@gimli.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Alex, On 19/09/2017 18:58, Alex Williamson wrote: > With virtual PCI-Express chipsets, we now see userspace/guest drivers > trying to match the physical MPS setting to a virtual downstream port. > Of course a lone physical device surrounded by virtual interconnects > cannot make a correct decision for a proper MPS setting. Instead, > let's virtualize the MPS control register so that writes through to > hardware are disallowed. Userspace drivers like QEMU assume they can > write anything to the device and we'll filter out anything dangerous. > Since mismatched MPS can lead to AER and other faults, let's add it > to the kernel side rather than relying on userspace virtualization to > handle it. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > > Do we have any reason to suspect that a userspace driver has any > dependencies on the physical MPS setting or is this only tuning the > protocol layer and it's transparent to the driver? Note that per the > PCI spec, a device supporting only 128B MPS can hardwire the control > register to 000b, but it doesn't seem PCIe compliant to hardwire it to > any given value, such as would be the appearance if we exposed this as > a read-only register rather than virtualizing it. QEMU would then be > responsible for virtualizing it, which makes coordinating the upgrade > troublesome. > > drivers/vfio/pci/vfio_pci_config.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > index 5628fe114347..91335e6de88a 100644 > --- a/drivers/vfio/pci/vfio_pci_config.c > +++ b/drivers/vfio/pci/vfio_pci_config.c > @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm) > > /* > * Allow writes to device control fields, except devctl_phantom, > - * which could confuse IOMMU, and the ARI bit in devctl2, which > + * which could confuse IOMMU, MPS, which can break communication > + * with other physical devices, and the ARI bit in devctl2, which > * is set at probe time. FLR gets virtualized via our writefn. > */ > p_setw(perm, PCI_EXP_DEVCTL, > - PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM); > + PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD, > + ~PCI_EXP_DEVCTL_PHANTOM); > p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI); Is it correct that the read value still will be the one written by the guest? I see the MMRS can take the read MPS value in some pcie_bus_config values. So a consequence could be that the applied MMRS (which is not virtualized) is lower than what is set by host, due to a guest pcie root port MPSS for instance. So if the above is not totally wrong, shouldn't we virtualize MMRS as well? Thanks Eric > return 0; > } >
[cc +linux-pci, Sinan] On Tue, 19 Sep 2017 19:50:37 +0200 Auger Eric <eric.auger@redhat.com> wrote: > Hi Alex, > > On 19/09/2017 18:58, Alex Williamson wrote: > > With virtual PCI-Express chipsets, we now see userspace/guest drivers > > trying to match the physical MPS setting to a virtual downstream port. > > Of course a lone physical device surrounded by virtual interconnects > > cannot make a correct decision for a proper MPS setting. Instead, > > let's virtualize the MPS control register so that writes through to > > hardware are disallowed. Userspace drivers like QEMU assume they can > > write anything to the device and we'll filter out anything dangerous. > > Since mismatched MPS can lead to AER and other faults, let's add it > > to the kernel side rather than relying on userspace virtualization to > > handle it. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > > > Do we have any reason to suspect that a userspace driver has any > > dependencies on the physical MPS setting or is this only tuning the > > protocol layer and it's transparent to the driver? Note that per the > > PCI spec, a device supporting only 128B MPS can hardwire the control > > register to 000b, but it doesn't seem PCIe compliant to hardwire it to > > any given value, such as would be the appearance if we exposed this as > > a read-only register rather than virtualizing it. QEMU would then be > > responsible for virtualizing it, which makes coordinating the upgrade > > troublesome. > > > > drivers/vfio/pci/vfio_pci_config.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > > index 5628fe114347..91335e6de88a 100644 > > --- a/drivers/vfio/pci/vfio_pci_config.c > > +++ b/drivers/vfio/pci/vfio_pci_config.c > > @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm) > > > > /* > > * Allow writes to device control fields, except devctl_phantom, > > - * which could confuse IOMMU, and the ARI bit in devctl2, which > > + * which could confuse IOMMU, MPS, which can break communication > > + * with other physical devices, and the ARI bit in devctl2, which > > * is set at probe time. FLR gets virtualized via our writefn. > > */ > > p_setw(perm, PCI_EXP_DEVCTL, > > - PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM); > > + PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD, > > + ~PCI_EXP_DEVCTL_PHANTOM); > > p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI); > Is it correct that the read value still will be the one written by the > guest? If we don't do this, the register is read-only, which is what I'm suggesting in my comment above doesn't seem spec compliant. If the kernel makes it read-only, then userspace (ex. QEMU) would need to virtualize it, and we get into a more complicated, two part fix. > I see the MMRS can take the read MPS value in some pcie_bus_config > values. So a consequence could be that the applied MMRS (which is not > virtualized) is lower than what is set by host, due to a guest pcie root > port MPSS for instance. > > So if the above is not totally wrong, shouldn't we virtualize MMRS as well? I think MPSS is Maximum Payload Size Supported and MMRS... did you mean MRRS (Maximum Read Request Size)? My impression is that MRRS is predominantly device and driver dependent, not topology dependent. A device can send a read request with a size larger than MPS, which implies that the device supplying the read data would split it into multiple TLPs based on MPS. The implementation note in PCIe rev3.1 sec 7.8.4 suggests there may be QoS implications of MRRS such that if MRRS>MPS, then the device requesting the data may use fewer tokens for the request. I have no idea how we could infer any sort of QoS policy though and I don't see that in-kernel drivers are bound to any sort of policy. The pcie_write_mrrs() function also implies a protocol (which I can't find a spec reference to) where it will re-try writing MRRS if the value written doesn't stick (and generate a dev_warn on each iteration). Unless we want extra warnings in VMs, that suggests we don't want this to be read-only. So I think we need to allow MRRS writes through to hardware, but I do question whether we need to fill the gap that a VM might casually write MRRS to a value less than the physical MPS. This is what we'd expect today where the virtual topology has an MPS of 128B regardless of the physical topology. Do we virtualize MRRS writes such that an MRRS value less the physical MPS value never reaches hardware? Can we assume that's always invalid? Thanks, Alex
Hi Alex, On 19/09/2017 22:20, Alex Williamson wrote: > [cc +linux-pci, Sinan] > > On Tue, 19 Sep 2017 19:50:37 +0200 > Auger Eric <eric.auger@redhat.com> wrote: > >> Hi Alex, >> >> On 19/09/2017 18:58, Alex Williamson wrote: >>> With virtual PCI-Express chipsets, we now see userspace/guest drivers >>> trying to match the physical MPS setting to a virtual downstream port. >>> Of course a lone physical device surrounded by virtual interconnects >>> cannot make a correct decision for a proper MPS setting. Instead, >>> let's virtualize the MPS control register so that writes through to >>> hardware are disallowed. Userspace drivers like QEMU assume they can >>> write anything to the device and we'll filter out anything dangerous. >>> Since mismatched MPS can lead to AER and other faults, let's add it >>> to the kernel side rather than relying on userspace virtualization to >>> handle it. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>> --- >>> >>> Do we have any reason to suspect that a userspace driver has any >>> dependencies on the physical MPS setting or is this only tuning the >>> protocol layer and it's transparent to the driver? Note that per the >>> PCI spec, a device supporting only 128B MPS can hardwire the control >>> register to 000b, but it doesn't seem PCIe compliant to hardwire it to >>> any given value, such as would be the appearance if we exposed this as >>> a read-only register rather than virtualizing it. QEMU would then be >>> responsible for virtualizing it, which makes coordinating the upgrade >>> troublesome. >>> >>> drivers/vfio/pci/vfio_pci_config.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c >>> index 5628fe114347..91335e6de88a 100644 >>> --- a/drivers/vfio/pci/vfio_pci_config.c >>> +++ b/drivers/vfio/pci/vfio_pci_config.c >>> @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm) >>> >>> /* >>> * Allow writes to device control fields, except devctl_phantom, >>> - * which could confuse IOMMU, and the ARI bit in devctl2, which >>> + * which could confuse IOMMU, MPS, which can break communication >>> + * with other physical devices, and the ARI bit in devctl2, which >>> * is set at probe time. FLR gets virtualized via our writefn. >>> */ >>> p_setw(perm, PCI_EXP_DEVCTL, >>> - PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM); >>> + PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD, >>> + ~PCI_EXP_DEVCTL_PHANTOM); >>> p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI); >> Is it correct that the read value still will be the one written by the >> guest? > > If we don't do this, the register is read-only, which is what I'm > suggesting in my comment above doesn't seem spec compliant. If the > kernel makes it read-only, then userspace (ex. QEMU) would need to > virtualize it, and we get into a more complicated, two part fix. > >> I see the MMRS can take the read MPS value in some pcie_bus_config >> values. So a consequence could be that the applied MMRS (which is not >> virtualized) is lower than what is set by host, due to a guest pcie root >> port MPSS for instance. >> >> So if the above is not totally wrong, shouldn't we virtualize MMRS as well? > > I think MPSS is Maximum Payload Size Supported and MMRS... did you mean > MRRS (Maximum Read Request Size)? Yes sorry for the typo. Indeed I meant MRRS. > > My impression is that MRRS is predominantly device and driver > dependent, not topology dependent. A device can send a read request > with a size larger than MPS, which implies that the device supplying > the read data would split it into multiple TLPs based on MPS. I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules for SW Configuration it is written: "Software must set Max_Read_Request_Size of an isochronous-configured device with a value that does not exceed the Max_Payload_Size set for the device." But on the the other hand some drivers are setting the MMRS directly without further checking the MPS? The > implementation note in PCIe rev3.1 sec 7.8.4 suggests there may be QoS > implications of MRRS such that if MRRS>MPS, then the device requesting > the data may use fewer tokens for the request. I have no idea how we > could infer any sort of QoS policy though and I don't see that > in-kernel drivers are bound to any sort of policy. > > The pcie_write_mrrs() function also implies a protocol (which I can't > find a spec reference to) where it will re-try writing MRRS if the > value written doesn't stick (and generate a dev_warn on each > iteration). Unless we want extra warnings in VMs, that suggests we > don't want this to be read-only. > > So I think we need to allow MRRS writes through to hardware, but I do > question whether we need to fill the gap that a VM might casually write > MRRS to a value less than the physical MPS. This is what we'd expect > today where the virtual topology has an MPS of 128B regardless of the > physical topology. Do we virtualize MRRS writes such that an MRRS > value less the physical MPS value never reaches hardware? Can we > assume that's always invalid? Thanks, Thanks Eric > > Alex >
On 9/20/2017 3:59 AM, Auger Eric wrote: >> My impression is that MRRS is predominantly device and driver >> dependent, not topology dependent. A device can send a read request >> with a size larger than MPS, which implies that the device supplying >> the read data would split it into multiple TLPs based on MPS. > I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules > for SW Configuration it is written: > "Software must set Max_Read_Request_Size of an isochronous-configured > device with a value that does not exceed the Max_Payload_Size set for > the device." > > But on the the other hand some drivers are setting the MMRS directly > without further checking the MPS? We discussed this on LPC. MRRS and MPS are two independent concepts and are not related to each other under normal circumstances. The only valid criteria is that MRRS needs to be a multiple of MPS. https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf Because completions are required to be a minimum of MPS size. If MRRS > MPS, read response is sent as multiple completions. The only reason you want to match MRRS==MPS is that you don't want a single device to hog system resources. You can have MRRS 4k and MPS 128 bytes. Completions come in as 128 x N packets on the PCI bus. If you are sharing the same PCI bus with some other device, switch; you are effectively stalling other devices. That's why, isochronous devices are requesting small MRRS. An Isochronous device is an exception not norm.
On 9/20/2017 9:01 AM, Sinan Kaya wrote: > The only valid criteria is that MRRS needs to be a multiple of MPS. > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf > Apologies for sending the wrong link. Correcting it. https://linuxplumbersconf.org/2017/ocw//system/presentations/4737/original/mps.pdf
Hi Sinan, On 20/09/2017 15:01, Sinan Kaya wrote: > On 9/20/2017 3:59 AM, Auger Eric wrote: >>> My impression is that MRRS is predominantly device and driver >>> dependent, not topology dependent. A device can send a read request >>> with a size larger than MPS, which implies that the device supplying >>> the read data would split it into multiple TLPs based on MPS. >> I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules >> for SW Configuration it is written: >> "Software must set Max_Read_Request_Size of an isochronous-configured >> device with a value that does not exceed the Max_Payload_Size set for >> the device." >> >> But on the the other hand some drivers are setting the MMRS directly >> without further checking the MPS? > > We discussed this on LPC. MRRS and MPS are two independent concepts and > are not related to each other under normal circumstances. > > The only valid criteria is that MRRS needs to be a multiple of MPS. > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf > > Because completions are required to be a minimum of MPS size. If MRRS > MPS, > read response is sent as multiple completions. With that patch, you can end up with MRRS < MPS. Do I understand correctly this is an issue? Thanks Eric > > The only reason you want to match MRRS==MPS is that you don't want a single > device to hog system resources. You can have MRRS 4k and MPS 128 bytes. Completions > come in as 128 x N packets on the PCI bus. > > If you are sharing the same PCI bus with some other device, switch; you are effectively > stalling other devices. That's why, isochronous devices are requesting small MRRS. > > An Isochronous device is an exception not norm. >
On 9/20/2017 10:26 AM, Auger Eric wrote: >> Because completions are required to be a minimum of MPS size. If MRRS > MPS, >> read response is sent as multiple completions. > With that patch, you can end up with MRRS < MPS. Do I understand > correctly this is an issue? To give the right context, I tried to mean that you can't use MRRS as a length value for completions if MRRS is a fairly large number and bigger than MPS. Therefore, a single completion packet size needs to be less than MRRS and cannot exceed MPS. Spec is even calling for receivers to throw an error if completion length is greater than MPS. single completion <= MPS (128) <= MRRS (4k) I looked at the spec again. I have not seen anything in the spec that prohibits MRRS < MPS as long as completion size is less than MPS. single completion <= MRRS (128) < MPS (256) I think this is also valid. I hope I got it right this time.
On Wed, 20 Sep 2017 16:26:25 +0200 Auger Eric <eric.auger@redhat.com> wrote: > Hi Sinan, > > On 20/09/2017 15:01, Sinan Kaya wrote: > > On 9/20/2017 3:59 AM, Auger Eric wrote: > >>> My impression is that MRRS is predominantly device and driver > >>> dependent, not topology dependent. A device can send a read request > >>> with a size larger than MPS, which implies that the device supplying > >>> the read data would split it into multiple TLPs based on MPS. > >> I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules > >> for SW Configuration it is written: > >> "Software must set Max_Read_Request_Size of an isochronous-configured > >> device with a value that does not exceed the Max_Payload_Size set for > >> the device." > >> > >> But on the the other hand some drivers are setting the MMRS directly > >> without further checking the MPS? > > > > We discussed this on LPC. MRRS and MPS are two independent concepts and > > are not related to each other under normal circumstances. > > > > The only valid criteria is that MRRS needs to be a multiple of MPS. > > > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf > > > > Because completions are required to be a minimum of MPS size. If MRRS > MPS, > > read response is sent as multiple completions. > > With that patch, you can end up with MRRS < MPS. Do I understand > correctly this is an issue? My impression is that the issue would be inefficiency. There should be nothing functionally wrong with a read request less than MPS, but we're not "filling" the TLP as much as the topology allows. Is that your understanding as well, Sinan? It seems like it would be relatively easy to virtualize MRRS like we do the FLR bit, ie. evaluate the change the user is trying to make and update MRRS with pci-core callbacks, capping the lower bound equal to MPS for efficiency. It's possible we'll encounter devices that really do need a lower MPS, but it seems unlikely since this is the setting the PCI core seems to make by default (MRRS == MPS). Thanks, Alex
On 9/20/2017 12:11 PM, Alex Williamson wrote: > My impression is that the issue would be inefficiency. There should be > nothing functionally wrong with a read request less than MPS, but we're > not "filling" the TLP as much as the topology allows. Is that your > understanding as well, Sinan? That's my understanding as well. It would be a performance hit to use a value less than MPS.
Hi Sinan, On 20/09/2017 18:29, Sinan Kaya wrote: > On 9/20/2017 12:11 PM, Alex Williamson wrote: >> My impression is that the issue would be inefficiency. There should be >> nothing functionally wrong with a read request less than MPS, but we're >> not "filling" the TLP as much as the topology allows. Is that your >> understanding as well, Sinan? > > That's my understanding as well. It would be a performance hit to use a > value less than MPS. > > Thank you for your explanations. best Regards Eric
Hi Alex, On 19/09/2017 18:58, Alex Williamson wrote: > With virtual PCI-Express chipsets, we now see userspace/guest drivers > trying to match the physical MPS setting to a virtual downstream port. > Of course a lone physical device surrounded by virtual interconnects > cannot make a correct decision for a proper MPS setting. Instead, > let's virtualize the MPS control register so that writes through to > hardware are disallowed. Userspace drivers like QEMU assume they can > write anything to the device and we'll filter out anything dangerous. > Since mismatched MPS can lead to AER and other faults, let's add it > to the kernel side rather than relying on userspace virtualization to > handle it. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric > --- > > Do we have any reason to suspect that a userspace driver has any > dependencies on the physical MPS setting or is this only tuning the > protocol layer and it's transparent to the driver? Note that per the > PCI spec, a device supporting only 128B MPS can hardwire the control > register to 000b, but it doesn't seem PCIe compliant to hardwire it to > any given value, such as would be the appearance if we exposed this as > a read-only register rather than virtualizing it. QEMU would then be > responsible for virtualizing it, which makes coordinating the upgrade > troublesome. > > drivers/vfio/pci/vfio_pci_config.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > index 5628fe114347..91335e6de88a 100644 > --- a/drivers/vfio/pci/vfio_pci_config.c > +++ b/drivers/vfio/pci/vfio_pci_config.c > @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm) > > /* > * Allow writes to device control fields, except devctl_phantom, > - * which could confuse IOMMU, and the ARI bit in devctl2, which > + * which could confuse IOMMU, MPS, which can break communication > + * with other physical devices, and the ARI bit in devctl2, which > * is set at probe time. FLR gets virtualized via our writefn. > */ > p_setw(perm, PCI_EXP_DEVCTL, > - PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM); > + PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD, > + ~PCI_EXP_DEVCTL_PHANTOM); > p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI); > return 0; > } >
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 5628fe114347..91335e6de88a 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm) /* * Allow writes to device control fields, except devctl_phantom, - * which could confuse IOMMU, and the ARI bit in devctl2, which + * which could confuse IOMMU, MPS, which can break communication + * with other physical devices, and the ARI bit in devctl2, which * is set at probe time. FLR gets virtualized via our writefn. */ p_setw(perm, PCI_EXP_DEVCTL, - PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM); + PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD, + ~PCI_EXP_DEVCTL_PHANTOM); p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI); return 0; }
With virtual PCI-Express chipsets, we now see userspace/guest drivers trying to match the physical MPS setting to a virtual downstream port. Of course a lone physical device surrounded by virtual interconnects cannot make a correct decision for a proper MPS setting. Instead, let's virtualize the MPS control register so that writes through to hardware are disallowed. Userspace drivers like QEMU assume they can write anything to the device and we'll filter out anything dangerous. Since mismatched MPS can lead to AER and other faults, let's add it to the kernel side rather than relying on userspace virtualization to handle it. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- Do we have any reason to suspect that a userspace driver has any dependencies on the physical MPS setting or is this only tuning the protocol layer and it's transparent to the driver? Note that per the PCI spec, a device supporting only 128B MPS can hardwire the control register to 000b, but it doesn't seem PCIe compliant to hardwire it to any given value, such as would be the appearance if we exposed this as a read-only register rather than virtualizing it. QEMU would then be responsible for virtualizing it, which makes coordinating the upgrade troublesome. drivers/vfio/pci/vfio_pci_config.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)