vfio/pci: Virtualize Maximum Payload Size

Message ID	20170919164952.13200.78849.stgit@gimli.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com D0DD83E2AF Subject: [PATCH] vfio/pci: Virtualize Maximum Payload Size From: Alex Williamson <alex.williamson@redhat.com> To: kvm@vger.kernel.org Cc: eric.auger@redhat.com, linux-kernel@vger.kernel.org, jcm@redhat.com Date: Tue, 19 Sep 2017 10:58:04 -0600 Message-ID: <20170919164952.13200.78849.stgit@gimli.home> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Alex Williamson Sept. 19, 2017, 4:58 p.m. UTC

With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting.  Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed.  Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

Do we have any reason to suspect that a userspace driver has any
dependencies on the physical MPS setting or is this only tuning the
protocol layer and it's transparent to the driver?  Note that per the
PCI spec, a device supporting only 128B MPS can hardwire the control
register to 000b, but it doesn't seem PCIe compliant to hardwire it to
any given value, such as would be the appearance if we exposed this as
a read-only register rather than virtualizing it.  QEMU would then be
responsible for virtualizing it, which makes coordinating the upgrade
troublesome.

 drivers/vfio/pci/vfio_pci_config.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Eric Auger Sept. 19, 2017, 5:50 p.m. UTC | #1

Hi Alex,

On 19/09/2017 18:58, Alex Williamson wrote:
> With virtual PCI-Express chipsets, we now see userspace/guest drivers
> trying to match the physical MPS setting to a virtual downstream port.
> Of course a lone physical device surrounded by virtual interconnects
> cannot make a correct decision for a proper MPS setting.  Instead,
> let's virtualize the MPS control register so that writes through to
> hardware are disallowed.  Userspace drivers like QEMU assume they can
> write anything to the device and we'll filter out anything dangerous.
> Since mismatched MPS can lead to AER and other faults, let's add it
> to the kernel side rather than relying on userspace virtualization to
> handle it.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Do we have any reason to suspect that a userspace driver has any
> dependencies on the physical MPS setting or is this only tuning the
> protocol layer and it's transparent to the driver?  Note that per the
> PCI spec, a device supporting only 128B MPS can hardwire the control
> register to 000b, but it doesn't seem PCIe compliant to hardwire it to
> any given value, such as would be the appearance if we exposed this as
> a read-only register rather than virtualizing it.  QEMU would then be
> responsible for virtualizing it, which makes coordinating the upgrade
> troublesome.
> 
>  drivers/vfio/pci/vfio_pci_config.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 5628fe114347..91335e6de88a 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
>  
>  	/*
>  	 * Allow writes to device control fields, except devctl_phantom,
> -	 * which could confuse IOMMU, and the ARI bit in devctl2, which
> +	 * which could confuse IOMMU, MPS, which can break communication
> +	 * with other physical devices, and the ARI bit in devctl2, which
>  	 * is set at probe time.  FLR gets virtualized via our writefn.
>  	 */
>  	p_setw(perm, PCI_EXP_DEVCTL,
> -	       PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM);
> +	       PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD,
> +	       ~PCI_EXP_DEVCTL_PHANTOM);
>  	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
Is it correct that the read value still will be the one written by the
guest?

I see the MMRS can take the read MPS value in some pcie_bus_config
values. So a consequence could be that the applied MMRS (which is not
virtualized) is lower than what is set by host, due to a guest pcie root
port MPSS for instance.

So if the above is not totally wrong, shouldn't we virtualize MMRS as well?

Thanks

Eric

>  	return 0;
>  }
>

Alex Williamson Sept. 19, 2017, 8:20 p.m. UTC | #2

[cc +linux-pci, Sinan]

On Tue, 19 Sep 2017 19:50:37 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 19/09/2017 18:58, Alex Williamson wrote:
> > With virtual PCI-Express chipsets, we now see userspace/guest drivers
> > trying to match the physical MPS setting to a virtual downstream port.
> > Of course a lone physical device surrounded by virtual interconnects
> > cannot make a correct decision for a proper MPS setting.  Instead,
> > let's virtualize the MPS control register so that writes through to
> > hardware are disallowed.  Userspace drivers like QEMU assume they can
> > write anything to the device and we'll filter out anything dangerous.
> > Since mismatched MPS can lead to AER and other faults, let's add it
> > to the kernel side rather than relying on userspace virtualization to
> > handle it.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> > Do we have any reason to suspect that a userspace driver has any
> > dependencies on the physical MPS setting or is this only tuning the
> > protocol layer and it's transparent to the driver?  Note that per the
> > PCI spec, a device supporting only 128B MPS can hardwire the control
> > register to 000b, but it doesn't seem PCIe compliant to hardwire it to
> > any given value, such as would be the appearance if we exposed this as
> > a read-only register rather than virtualizing it.  QEMU would then be
> > responsible for virtualizing it, which makes coordinating the upgrade
> > troublesome.
> > 
> >  drivers/vfio/pci/vfio_pci_config.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> > index 5628fe114347..91335e6de88a 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
> >  
> >  	/*
> >  	 * Allow writes to device control fields, except devctl_phantom,
> > -	 * which could confuse IOMMU, and the ARI bit in devctl2, which
> > +	 * which could confuse IOMMU, MPS, which can break communication
> > +	 * with other physical devices, and the ARI bit in devctl2, which
> >  	 * is set at probe time.  FLR gets virtualized via our writefn.
> >  	 */
> >  	p_setw(perm, PCI_EXP_DEVCTL,
> > -	       PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM);
> > +	       PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD,
> > +	       ~PCI_EXP_DEVCTL_PHANTOM);
> >  	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);  
> Is it correct that the read value still will be the one written by the
> guest?

If we don't do this, the register is read-only, which is what I'm
suggesting in my comment above doesn't seem spec compliant.  If the
kernel makes it read-only, then userspace (ex. QEMU) would need to
virtualize it, and we get into a more complicated, two part fix.

> I see the MMRS can take the read MPS value in some pcie_bus_config
> values. So a consequence could be that the applied MMRS (which is not
> virtualized) is lower than what is set by host, due to a guest pcie root
> port MPSS for instance.
> 
> So if the above is not totally wrong, shouldn't we virtualize MMRS as well?

I think MPSS is Maximum Payload Size Supported and MMRS... did you mean
MRRS (Maximum Read Request Size)?

My impression is that MRRS is predominantly device and driver
dependent, not topology dependent.  A device can send a read request
with a size larger than MPS, which implies that the device supplying
the read data would split it into multiple TLPs based on MPS.  The
implementation note in PCIe rev3.1 sec 7.8.4 suggests there may be QoS
implications of MRRS such that if MRRS>MPS, then the device requesting
the data may use fewer tokens for the request.  I have no idea how we
could infer any sort of QoS policy though and I don't see that
in-kernel drivers are bound to any sort of policy.

The pcie_write_mrrs() function also implies a protocol (which I can't
find a spec reference to) where it will re-try writing MRRS if the
value written doesn't stick (and generate a dev_warn on each
iteration).  Unless we want extra warnings in VMs, that suggests we
don't want this to be read-only.

So I think we need to allow MRRS writes through to hardware, but I do
question whether we need to fill the gap that a VM might casually write
MRRS to a value less than the physical MPS.  This is what we'd expect
today where the virtual topology has an MPS of 128B regardless of the
physical topology.  Do we virtualize MRRS writes such that an MRRS
value less the physical MPS value never reaches hardware?  Can we
assume that's always invalid?  Thanks,

Alex

Eric Auger Sept. 20, 2017, 7:59 a.m. UTC | #3

Hi Alex,

On 19/09/2017 22:20, Alex Williamson wrote:
> [cc +linux-pci, Sinan]
> 
> On Tue, 19 Sep 2017 19:50:37 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 19/09/2017 18:58, Alex Williamson wrote:
>>> With virtual PCI-Express chipsets, we now see userspace/guest drivers
>>> trying to match the physical MPS setting to a virtual downstream port.
>>> Of course a lone physical device surrounded by virtual interconnects
>>> cannot make a correct decision for a proper MPS setting.  Instead,
>>> let's virtualize the MPS control register so that writes through to
>>> hardware are disallowed.  Userspace drivers like QEMU assume they can
>>> write anything to the device and we'll filter out anything dangerous.
>>> Since mismatched MPS can lead to AER and other faults, let's add it
>>> to the kernel side rather than relying on userspace virtualization to
>>> handle it.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>
>>> Do we have any reason to suspect that a userspace driver has any
>>> dependencies on the physical MPS setting or is this only tuning the
>>> protocol layer and it's transparent to the driver?  Note that per the
>>> PCI spec, a device supporting only 128B MPS can hardwire the control
>>> register to 000b, but it doesn't seem PCIe compliant to hardwire it to
>>> any given value, such as would be the appearance if we exposed this as
>>> a read-only register rather than virtualizing it.  QEMU would then be
>>> responsible for virtualizing it, which makes coordinating the upgrade
>>> troublesome.
>>>
>>>  drivers/vfio/pci/vfio_pci_config.c |    6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>> index 5628fe114347..91335e6de88a 100644
>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>> @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
>>>  
>>>  	/*
>>>  	 * Allow writes to device control fields, except devctl_phantom,
>>> -	 * which could confuse IOMMU, and the ARI bit in devctl2, which
>>> +	 * which could confuse IOMMU, MPS, which can break communication
>>> +	 * with other physical devices, and the ARI bit in devctl2, which
>>>  	 * is set at probe time.  FLR gets virtualized via our writefn.
>>>  	 */
>>>  	p_setw(perm, PCI_EXP_DEVCTL,
>>> -	       PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM);
>>> +	       PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD,
>>> +	       ~PCI_EXP_DEVCTL_PHANTOM);
>>>  	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);  
>> Is it correct that the read value still will be the one written by the
>> guest?
> 
> If we don't do this, the register is read-only, which is what I'm
> suggesting in my comment above doesn't seem spec compliant.  If the
> kernel makes it read-only, then userspace (ex. QEMU) would need to
> virtualize it, and we get into a more complicated, two part fix.
>  
>> I see the MMRS can take the read MPS value in some pcie_bus_config
>> values. So a consequence could be that the applied MMRS (which is not
>> virtualized) is lower than what is set by host, due to a guest pcie root
>> port MPSS for instance.
>>
>> So if the above is not totally wrong, shouldn't we virtualize MMRS as well?
> 
> I think MPSS is Maximum Payload Size Supported and MMRS... did you mean
> MRRS (Maximum Read Request Size)?
Yes sorry for the typo. Indeed I meant MRRS.
> 
> My impression is that MRRS is predominantly device and driver
> dependent, not topology dependent.  A device can send a read request
> with a size larger than MPS, which implies that the device supplying
> the read data would split it into multiple TLPs based on MPS.
I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules
for SW Configuration it is written:
"Software must set Max_Read_Request_Size of an isochronous-configured
device with a value that does not exceed the Max_Payload_Size set for
the device."

But on the the other hand some drivers are setting the MMRS directly
without further checking the MPS?
  The
> implementation note in PCIe rev3.1 sec 7.8.4 suggests there may be QoS
> implications of MRRS such that if MRRS>MPS, then the device requesting
> the data may use fewer tokens for the request.  I have no idea how we
> could infer any sort of QoS policy though and I don't see that
> in-kernel drivers are bound to any sort of policy.
> 
> The pcie_write_mrrs() function also implies a protocol (which I can't
> find a spec reference to) where it will re-try writing MRRS if the
> value written doesn't stick (and generate a dev_warn on each
> iteration).  Unless we want extra warnings in VMs, that suggests we
> don't want this to be read-only.
> 
> So I think we need to allow MRRS writes through to hardware, but I do
> question whether we need to fill the gap that a VM might casually write
> MRRS to a value less than the physical MPS.  This is what we'd expect
> today where the virtual topology has an MPS of 128B regardless of the
> physical topology.  Do we virtualize MRRS writes such that an MRRS
> value less the physical MPS value never reaches hardware?  Can we
> assume that's always invalid?  Thanks,

Thanks

Eric
> 
> Alex
>

Sinan Kaya Sept. 20, 2017, 1:01 p.m. UTC | #4

On 9/20/2017 3:59 AM, Auger Eric wrote:
>> My impression is that MRRS is predominantly device and driver
>> dependent, not topology dependent.  A device can send a read request
>> with a size larger than MPS, which implies that the device supplying
>> the read data would split it into multiple TLPs based on MPS.
> I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules
> for SW Configuration it is written:
> "Software must set Max_Read_Request_Size of an isochronous-configured
> device with a value that does not exceed the Max_Payload_Size set for
> the device."
> 
> But on the the other hand some drivers are setting the MMRS directly
> without further checking the MPS?

We discussed this on LPC. MRRS and MPS are two independent concepts and
are not related to each other under normal circumstances. 

The only valid criteria is that MRRS needs to be a multiple of MPS.

https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf

Because completions are required to be a minimum of MPS size. If MRRS > MPS,
read response is sent as multiple completions.

The only reason you want to match MRRS==MPS is that you don't want a single
device to hog system resources. You can have MRRS 4k and MPS 128 bytes. Completions
come in as 128 x N packets on the PCI bus. 

If you are sharing the same PCI bus with some other device, switch; you are effectively
stalling other devices. That's why, isochronous devices are requesting small MRRS.

An Isochronous device is an exception not norm.

Sinan Kaya Sept. 20, 2017, 1:02 p.m. UTC | #5

On 9/20/2017 9:01 AM, Sinan Kaya wrote:
> The only valid criteria is that MRRS needs to be a multiple of MPS.
> 
> https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf
> 

Apologies for sending the wrong link. Correcting it.

https://linuxplumbersconf.org/2017/ocw//system/presentations/4737/original/mps.pdf

Eric Auger Sept. 20, 2017, 2:26 p.m. UTC | #6

Hi Sinan,

On 20/09/2017 15:01, Sinan Kaya wrote:
> On 9/20/2017 3:59 AM, Auger Eric wrote:
>>> My impression is that MRRS is predominantly device and driver
>>> dependent, not topology dependent.  A device can send a read request
>>> with a size larger than MPS, which implies that the device supplying
>>> the read data would split it into multiple TLPs based on MPS.
>> I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules
>> for SW Configuration it is written:
>> "Software must set Max_Read_Request_Size of an isochronous-configured
>> device with a value that does not exceed the Max_Payload_Size set for
>> the device."
>>
>> But on the the other hand some drivers are setting the MMRS directly
>> without further checking the MPS?
> 
> We discussed this on LPC. MRRS and MPS are two independent concepts and
> are not related to each other under normal circumstances. 
> 
> The only valid criteria is that MRRS needs to be a multiple of MPS.
> 
> https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf
> 
> Because completions are required to be a minimum of MPS size. If MRRS > MPS,
> read response is sent as multiple completions.

With that patch, you can end up with MRRS < MPS. Do I understand
correctly this is an issue?

Thanks

Eric
> 
> The only reason you want to match MRRS==MPS is that you don't want a single
> device to hog system resources. You can have MRRS 4k and MPS 128 bytes. Completions
> come in as 128 x N packets on the PCI bus. 
> 
> If you are sharing the same PCI bus with some other device, switch; you are effectively
> stalling other devices. That's why, isochronous devices are requesting small MRRS.
> 
> An Isochronous device is an exception not norm.
>

Sinan Kaya Sept. 20, 2017, 3:04 p.m. UTC | #7

On 9/20/2017 10:26 AM, Auger Eric wrote:
>> Because completions are required to be a minimum of MPS size. If MRRS > MPS,
>> read response is sent as multiple completions.
> With that patch, you can end up with MRRS < MPS. Do I understand
> correctly this is an issue?

To give the right context, I tried to mean that you can't use MRRS as a length
value for completions if MRRS is a fairly large number and bigger than MPS.

Therefore, a single completion packet size needs to be less than MRRS and cannot
exceed MPS. Spec is even calling for receivers to throw an error if completion
length is greater than MPS.

single completion <= MPS (128) <= MRRS (4k)

I looked at the spec again. I have not seen anything in the spec that prohibits
MRRS < MPS as long as completion size is less than MPS.

single completion <= MRRS (128) < MPS (256)

I think this is also valid. 

I hope I got it right this time.

Alex Williamson Sept. 20, 2017, 4:11 p.m. UTC | #8

On Wed, 20 Sep 2017 16:26:25 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Sinan,
> 
> On 20/09/2017 15:01, Sinan Kaya wrote:
> > On 9/20/2017 3:59 AM, Auger Eric wrote:  
> >>> My impression is that MRRS is predominantly device and driver
> >>> dependent, not topology dependent.  A device can send a read request
> >>> with a size larger than MPS, which implies that the device supplying
> >>> the read data would split it into multiple TLPs based on MPS.  
> >> I read that too on the net. However in in 6.3.4.1. (3.0. Nov 10), Rules
> >> for SW Configuration it is written:
> >> "Software must set Max_Read_Request_Size of an isochronous-configured
> >> device with a value that does not exceed the Max_Payload_Size set for
> >> the device."
> >>
> >> But on the the other hand some drivers are setting the MMRS directly
> >> without further checking the MPS?  
> > 
> > We discussed this on LPC. MRRS and MPS are two independent concepts and
> > are not related to each other under normal circumstances. 
> > 
> > The only valid criteria is that MRRS needs to be a multiple of MPS.
> > 
> > https://linuxplumbersconf.org/2017/ocw//system/presentations/4732/original/crs.pdf
> > 
> > Because completions are required to be a minimum of MPS size. If MRRS > MPS,
> > read response is sent as multiple completions.  
> 
> With that patch, you can end up with MRRS < MPS. Do I understand
> correctly this is an issue?

My impression is that the issue would be inefficiency.  There should be
nothing functionally wrong with a read request less than MPS, but we're
not "filling" the TLP as much as the topology allows.  Is that your
understanding as well, Sinan?

It seems like it would be relatively easy to virtualize MRRS like we do
the FLR bit, ie. evaluate the change the user is trying to make and
update MRRS with pci-core callbacks, capping the lower bound equal to
MPS for efficiency.  It's possible we'll encounter devices that really
do need a lower MPS, but it seems unlikely since this is the setting
the PCI core seems to make by default (MRRS == MPS).  Thanks,

Alex

Sinan Kaya Sept. 20, 2017, 4:29 p.m. UTC | #9

On 9/20/2017 12:11 PM, Alex Williamson wrote:
> My impression is that the issue would be inefficiency.  There should be
> nothing functionally wrong with a read request less than MPS, but we're
> not "filling" the TLP as much as the topology allows.  Is that your
> understanding as well, Sinan?

That's my understanding as well. It would be a performance hit to use a
value less than MPS.

Eric Auger Sept. 21, 2017, 6:36 a.m. UTC | #10

Hi Sinan,

On 20/09/2017 18:29, Sinan Kaya wrote:
> On 9/20/2017 12:11 PM, Alex Williamson wrote:
>> My impression is that the issue would be inefficiency.  There should be
>> nothing functionally wrong with a read request less than MPS, but we're
>> not "filling" the TLP as much as the topology allows.  Is that your
>> understanding as well, Sinan?
> 
> That's my understanding as well. It would be a performance hit to use a
> value less than MPS.
> 
> 

Thank you for your explanations.

best Regards

Eric

Eric Auger Sept. 21, 2017, 6:37 a.m. UTC | #11

Hi Alex,

On 19/09/2017 18:58, Alex Williamson wrote:
> With virtual PCI-Express chipsets, we now see userspace/guest drivers
> trying to match the physical MPS setting to a virtual downstream port.
> Of course a lone physical device surrounded by virtual interconnects
> cannot make a correct decision for a proper MPS setting.  Instead,
> let's virtualize the MPS control register so that writes through to
> hardware are disallowed.  Userspace drivers like QEMU assume they can
> write anything to the device and we'll filter out anything dangerous.
> Since mismatched MPS can lead to AER and other faults, let's add it
> to the kernel side rather than relying on userspace virtualization to
> handle it.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> 
> Do we have any reason to suspect that a userspace driver has any
> dependencies on the physical MPS setting or is this only tuning the
> protocol layer and it's transparent to the driver?  Note that per the
> PCI spec, a device supporting only 128B MPS can hardwire the control
> register to 000b, but it doesn't seem PCIe compliant to hardwire it to
> any given value, such as would be the appearance if we exposed this as
> a read-only register rather than virtualizing it.  QEMU would then be
> responsible for virtualizing it, which makes coordinating the upgrade
> troublesome.
> 
>  drivers/vfio/pci/vfio_pci_config.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 5628fe114347..91335e6de88a 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -849,11 +849,13 @@ static int __init init_pci_cap_exp_perm(struct perm_bits *perm)
>  
>  	/*
>  	 * Allow writes to device control fields, except devctl_phantom,
> -	 * which could confuse IOMMU, and the ARI bit in devctl2, which
> +	 * which could confuse IOMMU, MPS, which can break communication
> +	 * with other physical devices, and the ARI bit in devctl2, which
>  	 * is set at probe time.  FLR gets virtualized via our writefn.
>  	 */
>  	p_setw(perm, PCI_EXP_DEVCTL,
> -	       PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM);
> +	       PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD,
> +	       ~PCI_EXP_DEVCTL_PHANTOM);
>  	p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
>  	return 0;
>  }
>

vfio/pci: Virtualize Maximum Payload Size

Commit Message

Comments

Patch