diff mbox

PCI: Match Root Port's MPS to endpoint's MPSS when necessary

Message ID 20180718185158.149373.77902.stgit@tak.stowe (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Myron Stowe July 18, 2018, 6:51 p.m. UTC
In commit 27d868b5e6cf ("PCI: Set MPS to match upstream bridge"), we made
sure every device's MPS setting matches its upstream bridge, making it more
likely that a hot-added device will work in a system with an optimized MPS
configuration.

Recently I've started encountering systems where the endpoint device's MPSS
capability is less than its root port's current MPS value, thus the
endpoint is not capable of matching its upstream bridge's MPS setting (see:
bugzilla via "Link:" below).  This leaves the system vunerable - the
upstream root port could respond with larger sized TLPs than the endpoint
can handle, and the endpoint will consider them to be 'Malformed'.

One could use the "pci=pcie_bus_safe" kernel parameter to resolve the
issue, but, it both forces a user to have to supply a kernel parameter to
get the system to function reliable, and may end up limiting MPS settings
of other, non-related, sub-topologies which could benefit from maintaining
their larger values.

This patch augments Keith's approach to include tuning down a root port's
MPS setting when its hot-added endpoint device is not capable of matching
it.  The tuning down, so that both the root port and endpoint match, is
limited to root ports with downstream endpoint device sub-topologies.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200527
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Sinan Kaya <okaya@kernel.org>
Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
---
 drivers/pci/probe.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Jon Mason July 24, 2018, 3:47 p.m. UTC | #1
On Wed, Jul 18, 2018 at 12:51:58PM -0600, Myron Stowe wrote:
> In commit 27d868b5e6cf ("PCI: Set MPS to match upstream bridge"), we made
> sure every device's MPS setting matches its upstream bridge, making it more
> likely that a hot-added device will work in a system with an optimized MPS
> configuration.
> 
> Recently I've started encountering systems where the endpoint device's MPSS
> capability is less than its root port's current MPS value, thus the
> endpoint is not capable of matching its upstream bridge's MPS setting (see:
> bugzilla via "Link:" below).  This leaves the system vunerable - the
> upstream root port could respond with larger sized TLPs than the endpoint
> can handle, and the endpoint will consider them to be 'Malformed'.
> 
> One could use the "pci=pcie_bus_safe" kernel parameter to resolve the
> issue, but, it both forces a user to have to supply a kernel parameter to
> get the system to function reliable, and may end up limiting MPS settings
> of other, non-related, sub-topologies which could benefit from maintaining
> their larger values.
> 
> This patch augments Keith's approach to include tuning down a root port's
> MPS setting when its hot-added endpoint device is not capable of matching
> it.  The tuning down, so that both the root port and endpoint match, is
> limited to root ports with downstream endpoint device sub-topologies.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200527
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jon Mason <jdmason@kudzu.us>

Looks good to me
Acked-by: Jon Mason <jdmason@kudzu.us>

> Cc: Sinan Kaya <okaya@kernel.org>
> Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
> ---
>  drivers/pci/probe.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac91b6f..2987bd9 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1670,7 +1670,7 @@ int pci_setup_device(struct pci_dev *dev)
>  static void pci_configure_mps(struct pci_dev *dev)
>  {
>  	struct pci_dev *bridge = pci_upstream_bridge(dev);
> -	int mps, p_mps, rc;
> +	int mps, mpss, p_mps, rc;
>  
>  	if (!pci_is_pcie(dev) || !bridge || !pci_is_pcie(bridge))
>  		return;
> @@ -1694,6 +1694,14 @@ static void pci_configure_mps(struct pci_dev *dev)
>  	if (pcie_bus_config != PCIE_BUS_DEFAULT)
>  		return;
>  
> +	mpss = 128 << dev->pcie_mpss;
> +	if (mpss < p_mps && pci_pcie_type(bridge) == PCI_EXP_TYPE_ROOT_PORT) {
> +		pcie_set_mps(bridge, mpss);
> +		pci_info(dev, "Upstream bridge's Max Payload Size set to %d (was %d, max %d)\n",
> +			 mpss, p_mps, 128 << bridge->pcie_mpss);
> +		p_mps = pcie_get_mps(bridge);
> +	}
> +
>  	rc = pcie_set_mps(dev, p_mps);
>  	if (rc) {
>  		pci_warn(dev, "can't set Max Payload Size to %d; if necessary, use \"pci=pcie_bus_safe\" and report a bug\n",
> @@ -1702,7 +1710,7 @@ static void pci_configure_mps(struct pci_dev *dev)
>  	}
>  
>  	pci_info(dev, "Max Payload Size set to %d (was %d, max %d)\n",
> -		 p_mps, mps, 128 << dev->pcie_mpss);
> +		 p_mps, mps, mpss);
>  }
>  
>  static struct hpp_type0 pci_default_type0 = {
>
Bjorn Helgaas Aug. 1, 2018, 2:05 p.m. UTC | #2
On Wed, Jul 18, 2018 at 12:51:58PM -0600, Myron Stowe wrote:
> In commit 27d868b5e6cf ("PCI: Set MPS to match upstream bridge"), we made
> sure every device's MPS setting matches its upstream bridge, making it more
> likely that a hot-added device will work in a system with an optimized MPS
> configuration.
> 
> Recently I've started encountering systems where the endpoint device's MPSS
> capability is less than its root port's current MPS value, thus the
> endpoint is not capable of matching its upstream bridge's MPS setting (see:
> bugzilla via "Link:" below).  This leaves the system vunerable - the
> upstream root port could respond with larger sized TLPs than the endpoint
> can handle, and the endpoint will consider them to be 'Malformed'.
> 
> One could use the "pci=pcie_bus_safe" kernel parameter to resolve the
> issue, but, it both forces a user to have to supply a kernel parameter to
> get the system to function reliable, and may end up limiting MPS settings
> of other, non-related, sub-topologies which could benefit from maintaining
> their larger values.
> 
> This patch augments Keith's approach to include tuning down a root port's
> MPS setting when its hot-added endpoint device is not capable of matching
> it.  The tuning down, so that both the root port and endpoint match, is
> limited to root ports with downstream endpoint device sub-topologies.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200527
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jon Mason <jdmason@kudzu.us>
> Cc: Sinan Kaya <okaya@kernel.org>
> Signed-off-by: Myron Stowe <myron.stowe@redhat.com>

Applied to pci/enumeration for v4.19, thanks!

> ---
>  drivers/pci/probe.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac91b6f..2987bd9 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1670,7 +1670,7 @@ int pci_setup_device(struct pci_dev *dev)
>  static void pci_configure_mps(struct pci_dev *dev)
>  {
>  	struct pci_dev *bridge = pci_upstream_bridge(dev);
> -	int mps, p_mps, rc;
> +	int mps, mpss, p_mps, rc;
>  
>  	if (!pci_is_pcie(dev) || !bridge || !pci_is_pcie(bridge))
>  		return;
> @@ -1694,6 +1694,14 @@ static void pci_configure_mps(struct pci_dev *dev)
>  	if (pcie_bus_config != PCIE_BUS_DEFAULT)
>  		return;
>  
> +	mpss = 128 << dev->pcie_mpss;
> +	if (mpss < p_mps && pci_pcie_type(bridge) == PCI_EXP_TYPE_ROOT_PORT) {
> +		pcie_set_mps(bridge, mpss);
> +		pci_info(dev, "Upstream bridge's Max Payload Size set to %d (was %d, max %d)\n",
> +			 mpss, p_mps, 128 << bridge->pcie_mpss);
> +		p_mps = pcie_get_mps(bridge);
> +	}
> +
>  	rc = pcie_set_mps(dev, p_mps);
>  	if (rc) {
>  		pci_warn(dev, "can't set Max Payload Size to %d; if necessary, use \"pci=pcie_bus_safe\" and report a bug\n",
> @@ -1702,7 +1710,7 @@ static void pci_configure_mps(struct pci_dev *dev)
>  	}
>  
>  	pci_info(dev, "Max Payload Size set to %d (was %d, max %d)\n",
> -		 p_mps, mps, 128 << dev->pcie_mpss);
> +		 p_mps, mps, mpss);
>  }
>  
>  static struct hpp_type0 pci_default_type0 = {
>
Dongdong Liu Aug. 10, 2018, 10:04 a.m. UTC | #3
Hi Bjorn, Myron

I found a bug after applied the patch.

The topology is as below. The 82599 netcard with two functions connect to RP.
  +-[0000:80]-+-00.0-[81]--+-00.0  Device 8086:10fb
  |           |            \-00.1  Device 8086:10fb

1. lspci -s BDF -vvv  to get the value of device's MPSS , MPS and MRRS.
RP (80:00.0): MPSS=512 MPS=512 MRRS=512
EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
    PF1(81:00.1): MPSS=512 MPS=512 MRRS=512

2. Enable SRIOV.
echo 1  > /sys/devices/pci0000\:80/0000\:80\:00.0/0000\:81\:00.0/sriov_numvfs
RP(80:00.0): MPSS=512 MPS=128 MRRS=512
                           ^^^
EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
			      ^^^ 	
    PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
			      ^^^ 	
    VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
                               ^^^
The 82599 netcard PF (MPSS 512) and VF's MPSS (MPSS 128) are different.
Then RP (MPS 128) will report Malformed TLP when PF0/PF1 has memory write operation with MPS 512.

The 82599 netcard could work ok without the patch.
The values of MPSS, MPS, MRRS are as below without the patch.

RP(80:00.0): MPSS=512 MPS=512 MRRS=512
                           ^^^
EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
			      ^^^ 	
    PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
			      ^^^ 	
    VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
                               ^^^

Thanks,
Dongdong

在 2018/8/1 22:05, Bjorn Helgaas 写道:
> On Wed, Jul 18, 2018 at 12:51:58PM -0600, Myron Stowe wrote:
>> In commit 27d868b5e6cf ("PCI: Set MPS to match upstream bridge"), we made
>> sure every device's MPS setting matches its upstream bridge, making it more
>> likely that a hot-added device will work in a system with an optimized MPS
>> configuration.
>>
>> Recently I've started encountering systems where the endpoint device's MPSS
>> capability is less than its root port's current MPS value, thus the
>> endpoint is not capable of matching its upstream bridge's MPS setting (see:
>> bugzilla via "Link:" below).  This leaves the system vunerable - the
>> upstream root port could respond with larger sized TLPs than the endpoint
>> can handle, and the endpoint will consider them to be 'Malformed'.
>>
>> One could use the "pci=pcie_bus_safe" kernel parameter to resolve the
>> issue, but, it both forces a user to have to supply a kernel parameter to
>> get the system to function reliable, and may end up limiting MPS settings
>> of other, non-related, sub-topologies which could benefit from maintaining
>> their larger values.
>>
>> This patch augments Keith's approach to include tuning down a root port's
>> MPS setting when its hot-added endpoint device is not capable of matching
>> it.  The tuning down, so that both the root port and endpoint match, is
>> limited to root ports with downstream endpoint device sub-topologies.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200527
>> Cc: Keith Busch <keith.busch@intel.com>
>> Cc: Jon Mason <jdmason@kudzu.us>
>> Cc: Sinan Kaya <okaya@kernel.org>
>> Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
>
> Applied to pci/enumeration for v4.19, thanks!
>
>> ---
>>  drivers/pci/probe.c |   12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index ac91b6f..2987bd9 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -1670,7 +1670,7 @@ int pci_setup_device(struct pci_dev *dev)
>>  static void pci_configure_mps(struct pci_dev *dev)
>>  {
>>  	struct pci_dev *bridge = pci_upstream_bridge(dev);
>> -	int mps, p_mps, rc;
>> +	int mps, mpss, p_mps, rc;
>>
>>  	if (!pci_is_pcie(dev) || !bridge || !pci_is_pcie(bridge))
>>  		return;
>> @@ -1694,6 +1694,14 @@ static void pci_configure_mps(struct pci_dev *dev)
>>  	if (pcie_bus_config != PCIE_BUS_DEFAULT)
>>  		return;
>>
>> +	mpss = 128 << dev->pcie_mpss;
>> +	if (mpss < p_mps && pci_pcie_type(bridge) == PCI_EXP_TYPE_ROOT_PORT) {
>> +		pcie_set_mps(bridge, mpss);
>> +		pci_info(dev, "Upstream bridge's Max Payload Size set to %d (was %d, max %d)\n",
>> +			 mpss, p_mps, 128 << bridge->pcie_mpss);
>> +		p_mps = pcie_get_mps(bridge);
>> +	}
>> +
>>  	rc = pcie_set_mps(dev, p_mps);
>>  	if (rc) {
>>  		pci_warn(dev, "can't set Max Payload Size to %d; if necessary, use \"pci=pcie_bus_safe\" and report a bug\n",
>> @@ -1702,7 +1710,7 @@ static void pci_configure_mps(struct pci_dev *dev)
>>  	}
>>
>>  	pci_info(dev, "Max Payload Size set to %d (was %d, max %d)\n",
>> -		 p_mps, mps, 128 << dev->pcie_mpss);
>> +		 p_mps, mps, mpss);
>>  }
>>
>>  static struct hpp_type0 pci_default_type0 = {
>>
>
> .
>
Bjorn Helgaas Aug. 10, 2018, 5:28 p.m. UTC | #4
On Fri, Aug 10, 2018 at 06:04:39PM +0800, Dongdong Liu wrote:
> Hi Bjorn, Myron
> 
> I found a bug after applied the patch.
> 
> The topology is as below. The 82599 netcard with two functions connect to RP.
>  +-[0000:80]-+-00.0-[81]--+-00.0  Device 8086:10fb
>  |           |            \-00.1  Device 8086:10fb
> 
> 1. lspci -s BDF -vvv  to get the value of device's MPSS , MPS and MRRS.
> RP (80:00.0): MPSS=512 MPS=512 MRRS=512
> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
>    PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
> 
> 2. Enable SRIOV.
> echo 1  > /sys/devices/pci0000\:80/0000\:80\:00.0/0000\:81\:00.0/sriov_numvfs
> RP(80:00.0): MPSS=512 MPS=128 MRRS=512
>                           ^^^
> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>    PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>    VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
>                               ^^^
> The 82599 netcard PF (MPSS 512) and VF's MPSS (MPSS 128) are different.
> Then RP (MPS 128) will report Malformed TLP when PF0/PF1 has memory write operation with MPS 512.
> 
> The 82599 netcard could work ok without the patch.
> The values of MPSS, MPS, MRRS are as below without the patch.
> 
> RP(80:00.0): MPSS=512 MPS=512 MRRS=512
>                           ^^^
> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>    PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>    VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
>                               ^^^

OK, thanks a lot for testing this out.

I'll drop this change for now until we figure out what's going on.

> 在 2018/8/1 22:05, Bjorn Helgaas 写道:
> > On Wed, Jul 18, 2018 at 12:51:58PM -0600, Myron Stowe wrote:
> > > In commit 27d868b5e6cf ("PCI: Set MPS to match upstream bridge"), we made
> > > sure every device's MPS setting matches its upstream bridge, making it more
> > > likely that a hot-added device will work in a system with an optimized MPS
> > > configuration.
> > > 
> > > Recently I've started encountering systems where the endpoint device's MPSS
> > > capability is less than its root port's current MPS value, thus the
> > > endpoint is not capable of matching its upstream bridge's MPS setting (see:
> > > bugzilla via "Link:" below).  This leaves the system vunerable - the
> > > upstream root port could respond with larger sized TLPs than the endpoint
> > > can handle, and the endpoint will consider them to be 'Malformed'.
> > > 
> > > One could use the "pci=pcie_bus_safe" kernel parameter to resolve the
> > > issue, but, it both forces a user to have to supply a kernel parameter to
> > > get the system to function reliable, and may end up limiting MPS settings
> > > of other, non-related, sub-topologies which could benefit from maintaining
> > > their larger values.
> > > 
> > > This patch augments Keith's approach to include tuning down a root port's
> > > MPS setting when its hot-added endpoint device is not capable of matching
> > > it.  The tuning down, so that both the root port and endpoint match, is
> > > limited to root ports with downstream endpoint device sub-topologies.
> > > 
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200527
> > > Cc: Keith Busch <keith.busch@intel.com>
> > > Cc: Jon Mason <jdmason@kudzu.us>
> > > Cc: Sinan Kaya <okaya@kernel.org>
> > > Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
> > 
> > Applied to pci/enumeration for v4.19, thanks!
> > 
> > > ---
> > >  drivers/pci/probe.c |   12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > > index ac91b6f..2987bd9 100644
> > > --- a/drivers/pci/probe.c
> > > +++ b/drivers/pci/probe.c
> > > @@ -1670,7 +1670,7 @@ int pci_setup_device(struct pci_dev *dev)
> > >  static void pci_configure_mps(struct pci_dev *dev)
> > >  {
> > >  	struct pci_dev *bridge = pci_upstream_bridge(dev);
> > > -	int mps, p_mps, rc;
> > > +	int mps, mpss, p_mps, rc;
> > > 
> > >  	if (!pci_is_pcie(dev) || !bridge || !pci_is_pcie(bridge))
> > >  		return;
> > > @@ -1694,6 +1694,14 @@ static void pci_configure_mps(struct pci_dev *dev)
> > >  	if (pcie_bus_config != PCIE_BUS_DEFAULT)
> > >  		return;
> > > 
> > > +	mpss = 128 << dev->pcie_mpss;
> > > +	if (mpss < p_mps && pci_pcie_type(bridge) == PCI_EXP_TYPE_ROOT_PORT) {
> > > +		pcie_set_mps(bridge, mpss);
> > > +		pci_info(dev, "Upstream bridge's Max Payload Size set to %d (was %d, max %d)\n",
> > > +			 mpss, p_mps, 128 << bridge->pcie_mpss);
> > > +		p_mps = pcie_get_mps(bridge);
> > > +	}
> > > +
> > >  	rc = pcie_set_mps(dev, p_mps);
> > >  	if (rc) {
> > >  		pci_warn(dev, "can't set Max Payload Size to %d; if necessary, use \"pci=pcie_bus_safe\" and report a bug\n",
> > > @@ -1702,7 +1710,7 @@ static void pci_configure_mps(struct pci_dev *dev)
> > >  	}
> > > 
> > >  	pci_info(dev, "Max Payload Size set to %d (was %d, max %d)\n",
> > > -		 p_mps, mps, 128 << dev->pcie_mpss);
> > > +		 p_mps, mps, mpss);
> > >  }
> > > 
> > >  static struct hpp_type0 pci_default_type0 = {
> > > 
> > 
> > .
> > 
>
Myron Stowe Aug. 10, 2018, 9:33 p.m. UTC | #5
On Fri, 10 Aug 2018 18:04:39 +0800
Dongdong Liu <liudongdong3@huawei.com> wrote:

> Hi Bjorn, Myron
> 
> I found a bug after applied the patch.
> 
> The topology is as below. The 82599 netcard with two functions
> connect to RP. +-[0000:80]-+-00.0-[81]--+-00.0  Device 8086:10fb
>   |           |            \-00.1  Device 8086:10fb
> 
> 1. lspci -s BDF -vvv  to get the value of device's MPSS , MPS and
> MRRS. RP (80:00.0): MPSS=512 MPS=512 MRRS=512
> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
>     PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
> 
> 2. Enable SRIOV.
> echo 1
> > /sys/devices/pci0000\:80/0000\:80\:00.0/0000\:81\:00.0/sriov_numvfs
> > RP(80:00.0): MPSS=512 MPS=128 MRRS=512
>                            ^^^
> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>     PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>     VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
>                                ^^^
> The 82599 netcard PF (MPSS 512) and VF's MPSS (MPSS 128) are
> different. Then RP (MPS 128) will report Malformed TLP when PF0/PF1
> has memory write operation with MPS 512.
> 
> The 82599 netcard could work ok without the patch.
> The values of MPSS, MPS, MRRS are as below without the patch.
> 
> RP(80:00.0): MPSS=512 MPS=512 MRRS=512
>                            ^^^
> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>     PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
> 			      ^^^ 	
>     VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
>                                ^^^

Hi Dongdong,

Thanks for the testing and noticing a problem with the patch,
especially before it was incorporated upstream!


Looking into the PCI Express Base spec (4.0 r1.0), section 9.3.5.3
concerning the "Device Capabilities Register", it indicates "PF and VF
functionality is defined in Section 7.5.3.3 except where noted in
Table 9-15".  Table 9-15 doesn't specifically mention anything with
respect to MPSS which would make one _think_ that its respective VF's
bits are valid.

However, section 9.3.5.4, concerning the "Device Control Register",
does specifically show both Max_Payload_Size (MPS) and
Max_Read_request_Size (MRRS) to be 'RsvdP' for VFs in Table 9-16
[1].  Just prior to the table it states:
  "PF and VF functionality is defined in Section 7.5.3.4 except where 
   noted in Table 9-16. For VF fields marked RsvdP, the PF setting
   applies to the VF."

All of which implies that with respect to MPSS, MPS, and MRRS values,
we should _not_ be paying any attention to the VF's fields, but
rather only to the PF's.  Only looking at the PF's fields also
_logically_ makes sense as it is the sole physical interface to the
PCIe bus.


As to the patch, looks like an additional check as to if the
device is a virtual function - 'dev->is_virtfn' - is needed where we
bail out early in the case that it is.


[1] Per 7.4 "Configuration Register Types: 'RsvdP' fields are -
      "Reserved for future RW implementations.  Register bits are
       read-only and must return zero when read. Software must preserve
       the value read for writes to bits."
    which accounts for the MPS, and MRRS values being read as '0', and
    thus subsequently intereptred as '128'.

    Which brings up a tangental question: Should 'lspci' interpret,
    and output, 'RsvdP' fields of the Device Control Register
    corresponding to VFs?

Myron

> 
> Thanks,
> Dongdong
在 2018/8/1 22:05, Bjorn Helgaas 写道:
>
snip O<
Dongdong Liu Aug. 11, 2018, 3:47 a.m. UTC | #6
Hi Myron

在 2018/8/11 5:33, Myron Stowe 写道:
> On Fri, 10 Aug 2018 18:04:39 +0800
> Dongdong Liu <liudongdong3@huawei.com> wrote:
>
>> Hi Bjorn, Myron
>>
>> I found a bug after applied the patch.
>>
>> The topology is as below. The 82599 netcard with two functions
>> connect to RP. +-[0000:80]-+-00.0-[81]--+-00.0  Device 8086:10fb
>>   |           |            \-00.1  Device 8086:10fb
>>
>> 1. lspci -s BDF -vvv  to get the value of device's MPSS , MPS and
>> MRRS. RP (80:00.0): MPSS=512 MPS=512 MRRS=512
>> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
>>     PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
>>
>> 2. Enable SRIOV.
>> echo 1
>>> /sys/devices/pci0000\:80/0000\:80\:00.0/0000\:81\:00.0/sriov_numvfs
>>> RP(80:00.0): MPSS=512 MPS=128 MRRS=512
>>                            ^^^
>> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
>> 			      ^^^ 	
>>     PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
>> 			      ^^^ 	
>>     VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
>>                                ^^^
>> The 82599 netcard PF (MPSS 512) and VF's MPSS (MPSS 128) are
>> different. Then RP (MPS 128) will report Malformed TLP when PF0/PF1
>> has memory write operation with MPS 512.
>>
>> The 82599 netcard could work ok without the patch.
>> The values of MPSS, MPS, MRRS are as below without the patch.
>>
>> RP(80:00.0): MPSS=512 MPS=512 MRRS=512
>>                            ^^^
>> EP PF0(81:00.0): MPSS=512 MPS=512 MRRS=512
>> 			      ^^^ 	
>>     PF1(81:00.1): MPSS=512 MPS=512 MRRS=512
>> 			      ^^^ 	
>>     VF0(81:10.0): MPSS=128 MPS=128 MRRS=128
>>                                ^^^
>
> Hi Dongdong,
>
> Thanks for the testing and noticing a problem with the patch,
> especially before it was incorporated upstream!
>
>
> Looking into the PCI Express Base spec (4.0 r1.0), section 9.3.5.3
> concerning the "Device Capabilities Register", it indicates "PF and VF
> functionality is defined in Section 7.5.3.3 except where noted in
> Table 9-15".  Table 9-15 doesn't specifically mention anything with
> respect to MPSS which would make one _think_ that its respective VF's
> bits are valid.
Yes, very easy to misunderstand especially section 7.5.3.3 says
Max_Payload_Size Supported--
The Functions of a Multi-Function Device are permitted to report
different values for this field.
>
> However, section 9.3.5.4, concerning the "Device Control Register",
> does specifically show both Max_Payload_Size (MPS) and
> Max_Read_request_Size (MRRS) to be 'RsvdP' for VFs in Table 9-16
> [1].  Just prior to the table it states:
>   "PF and VF functionality is defined in Section 7.5.3.4 except where
>    noted in Table 9-16. For VF fields marked RsvdP, the PF setting
>    applies to the VF."
>
> All of which implies that with respect to MPSS, MPS, and MRRS values,
> we should _not_ be paying any attention to the VF's fields, but
> rather only to the PF's.  Only looking at the PF's fields also
> _logically_ makes sense as it is the sole physical interface to the
> PCIe bus.
Thanks for clarifying this.
>
>
> As to the patch, looks like an additional check as to if the
> device is a virtual function - 'dev->is_virtfn' - is needed where we
> bail out early in the case that it is.

Yes, that will be ok.
Thanks,
Dongdong

>
>
> [1] Per 7.4 "Configuration Register Types: 'RsvdP' fields are -
>       "Reserved for future RW implementations.  Register bits are
>        read-only and must return zero when read. Software must preserve
>        the value read for writes to bits."
>     which accounts for the MPS, and MRRS values being read as '0', and
>     thus subsequently intereptred as '128'.
>
>     Which brings up a tangental question: Should 'lspci' interpret,
>     and output, 'RsvdP' fields of the Device Control Register
>     corresponding to VFs?
>
> Myron
>
>>
>> Thanks,
>> Dongdong
> 在 2018/8/1 22:05, Bjorn Helgaas 写道:
>>
> snip O<
>
> .
>
diff mbox

Patch

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac91b6f..2987bd9 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1670,7 +1670,7 @@  int pci_setup_device(struct pci_dev *dev)
 static void pci_configure_mps(struct pci_dev *dev)
 {
 	struct pci_dev *bridge = pci_upstream_bridge(dev);
-	int mps, p_mps, rc;
+	int mps, mpss, p_mps, rc;
 
 	if (!pci_is_pcie(dev) || !bridge || !pci_is_pcie(bridge))
 		return;
@@ -1694,6 +1694,14 @@  static void pci_configure_mps(struct pci_dev *dev)
 	if (pcie_bus_config != PCIE_BUS_DEFAULT)
 		return;
 
+	mpss = 128 << dev->pcie_mpss;
+	if (mpss < p_mps && pci_pcie_type(bridge) == PCI_EXP_TYPE_ROOT_PORT) {
+		pcie_set_mps(bridge, mpss);
+		pci_info(dev, "Upstream bridge's Max Payload Size set to %d (was %d, max %d)\n",
+			 mpss, p_mps, 128 << bridge->pcie_mpss);
+		p_mps = pcie_get_mps(bridge);
+	}
+
 	rc = pcie_set_mps(dev, p_mps);
 	if (rc) {
 		pci_warn(dev, "can't set Max Payload Size to %d; if necessary, use \"pci=pcie_bus_safe\" and report a bug\n",
@@ -1702,7 +1710,7 @@  static void pci_configure_mps(struct pci_dev *dev)
 	}
 
 	pci_info(dev, "Max Payload Size set to %d (was %d, max %d)\n",
-		 p_mps, mps, 128 << dev->pcie_mpss);
+		 p_mps, mps, mpss);
 }
 
 static struct hpp_type0 pci_default_type0 = {