diff mbox

PCI: Remove MRRS modification from MPS setting code

Message ID 1314971145-28038-1-git-send-email-mason@myri.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Jon Mason Sept. 2, 2011, 1:45 p.m. UTC
Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices.  Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus.

Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
---
 drivers/pci/probe.c |   36 ------------------------------------
 1 files changed, 0 insertions(+), 36 deletions(-)

Comments

Benjamin Herrenschmidt Sept. 5, 2011, 5:48 p.m. UTC | #1
On Fri, 2011-09-02 at 08:45 -0500, Jon Mason wrote:
> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
> massive negative ramifications on some devices.  Without knowing which
> devices have this issue, do not modify from the default value when
> walking the PCI-E bus.

BTW. I recently ran into an MRRS related problem (on a system that
doesn't have any of our recent patches yet).

This does have an impact however if we use the algorithm I suggested for
setting MPS, which is to allow parents to have a larger MPS than
children in order to avoid clamping an entire hierarchy when it contains
one child device that has a small MPS.

When doing that, it is critical that the MRRS be set to be lower or
equal to the MPS of the device. Because the parent bridge (and thus the
host bridge) will potentially have an MPS larger than the requesting
function, the bridge -will- honor a large read request with a packet
potential as large as it's own MPS, and thus potentially larger that
what the function can cope with.

This isn't a problem if the MPS are clamped to the smallest common
denominator of the entire hierarchy.

So it depends really what algorithm has been chosen for the
configuration of the MPS.

Cheers,
Ben. 

> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
> Signed-off-by: Jon Mason <mason@myri.com>
> ---
>  drivers/pci/probe.c |   36 ------------------------------------
>  1 files changed, 0 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 8473727..d896c5e 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1394,41 +1394,6 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
>  		dev_err(&dev->dev, "Failed attempting to set the MPS\n");
>  }
>  
> -static void pcie_write_mrrs(struct pci_dev *dev, int mps)
> -{
> -	int rc, mrrs;
> -
> -	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
> -		int dev_mpss = 128 << dev->pcie_mpss;
> -
> -		/* For Max performance, the MRRS must be set to the largest
> -		 * supported value.  However, it cannot be configured larger
> -		 * than the MPS the device or the bus can support.  This assumes
> -		 * that the largest MRRS available on the device cannot be
> -		 * smaller than the device MPSS.
> -		 */
> -		mrrs = mps < dev_mpss ? mps : dev_mpss;
> -	} else
> -		/* In the "safe" case, configure the MRRS for fairness on the
> -		 * bus by making all devices have the same size
> -		 */
> -		mrrs = mps;
> -
> -
> -	/* MRRS is a R/W register.  Invalid values can be written, but a
> -	 * subsiquent read will verify if the value is acceptable or not.
> -	 * If the MRRS value provided is not acceptable (e.g., too large),
> -	 * shrink the value until it is acceptable to the HW.
> - 	 */
> -	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
> -		rc = pcie_set_readrq(dev, mrrs);
> -		if (rc)
> -			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
> -
> -		mrrs /= 2;
> -	}
> -}
> -
>  static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>  {
>  	int mps = 128 << *(u8 *)data;
> @@ -1440,7 +1405,6 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>  		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
>  
>  	pcie_write_mps(dev, mps);
> -	pcie_write_mrrs(dev, mps);
>  
>  	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
>  		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Mason Sept. 6, 2011, 6:11 p.m. UTC | #2
On Mon, Sep 5, 2011 at 10:48 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2011-09-02 at 08:45 -0500, Jon Mason wrote:
>> Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
>> massive negative ramifications on some devices.  Without knowing which
>> devices have this issue, do not modify from the default value when
>> walking the PCI-E bus.
>
> BTW. I recently ran into an MRRS related problem (on a system that
> doesn't have any of our recent patches yet).
>
> This does have an impact however if we use the algorithm I suggested for
> setting MPS, which is to allow parents to have a larger MPS than
> children in order to avoid clamping an entire hierarchy when it contains
> one child device that has a small MPS.
>
> When doing that, it is critical that the MRRS be set to be lower or
> equal to the MPS of the device. Because the parent bridge (and thus the
> host bridge) will potentially have an MPS larger than the requesting
> function, the bridge -will- honor a large read request with a packet
> potential as large as it's own MPS, and thus potentially larger that
> what the function can cope with.
>
> This isn't a problem if the MPS are clamped to the smallest common
> denominator of the entire hierarchy.
>
> So it depends really what algorithm has been chosen for the
> configuration of the MPS.

Yes, that is why I added the MRRS setting code originally.  Since the
MRRS setting is almost universally larger than the MPS and some
devices are freaking when their MRRS is changed, we must either use
the "safe" option by default and not touch the MRRS or allow the
"performance" option to have this hole.  I suppose a third option is
to create a list of faulty devices that cannot handle MRRS
modification, but that seems rather ugly.

I'm open to any option, but there seems to be a large number of issues
with the MRRS modification code and we need to get a fix into 3.1.

Thanks,
Jon



>
> Cheers,
> Ben.
>
>> Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
>> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
>> Signed-off-by: Jon Mason <mason@myri.com>
>> ---
>>  drivers/pci/probe.c |   36 ------------------------------------
>>  1 files changed, 0 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 8473727..d896c5e 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -1394,41 +1394,6 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
>>               dev_err(&dev->dev, "Failed attempting to set the MPS\n");
>>  }
>>
>> -static void pcie_write_mrrs(struct pci_dev *dev, int mps)
>> -{
>> -     int rc, mrrs;
>> -
>> -     if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
>> -             int dev_mpss = 128 << dev->pcie_mpss;
>> -
>> -             /* For Max performance, the MRRS must be set to the largest
>> -              * supported value.  However, it cannot be configured larger
>> -              * than the MPS the device or the bus can support.  This assumes
>> -              * that the largest MRRS available on the device cannot be
>> -              * smaller than the device MPSS.
>> -              */
>> -             mrrs = mps < dev_mpss ? mps : dev_mpss;
>> -     } else
>> -             /* In the "safe" case, configure the MRRS for fairness on the
>> -              * bus by making all devices have the same size
>> -              */
>> -             mrrs = mps;
>> -
>> -
>> -     /* MRRS is a R/W register.  Invalid values can be written, but a
>> -      * subsiquent read will verify if the value is acceptable or not.
>> -      * If the MRRS value provided is not acceptable (e.g., too large),
>> -      * shrink the value until it is acceptable to the HW.
>> -      */
>> -     while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
>> -             rc = pcie_set_readrq(dev, mrrs);
>> -             if (rc)
>> -                     dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
>> -
>> -             mrrs /= 2;
>> -     }
>> -}
>> -
>>  static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>>  {
>>       int mps = 128 << *(u8 *)data;
>> @@ -1440,7 +1405,6 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>>                pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
>>
>>       pcie_write_mps(dev, mps);
>> -     pcie_write_mrrs(dev, mps);
>>
>>       dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
>>                pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Sept. 6, 2011, 6:52 p.m. UTC | #3
On Tue, 2011-09-06 at 11:11 -0700, Jon Mason wrote:
> Yes, that is why I added the MRRS setting code originally.  Since the
> MRRS setting is almost universally larger than the MPS and some
> devices are freaking when their MRRS is changed, we must either use
> the "safe" option by default and not touch the MRRS or allow the
> "performance" option to have this hole.  I suppose a third option is
> to create a list of faulty devices that cannot handle MRRS
> modification, but that seems rather ugly.

Do we know for sure that card can't handle it at all or only freaks out
with "0" ?

IE Does the card work if MRS == MPS of the card (ie using my algorithm
and assuming the host bridge supports arbitrary MPS).

> I'm open to any option, but there seems to be a large number of issues
> with the MRRS modification code and we need to get a fix into 3.1. 

Ok, I only head of one, I didn't know there were more.

Are these typically hitting with the "performance" option ? IE. It make
sense to leave MRRS untouched in the "safe" case.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Mason Sept. 6, 2011, 7:12 p.m. UTC | #4
On Tue, Sep 6, 2011 at 11:52 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2011-09-06 at 11:11 -0700, Jon Mason wrote:
>> Yes, that is why I added the MRRS setting code originally.  Since the
>> MRRS setting is almost universally larger than the MPS and some
>> devices are freaking when their MRRS is changed, we must either use
>> the "safe" option by default and not touch the MRRS or allow the
>> "performance" option to have this hole.  I suppose a third option is
>> to create a list of faulty devices that cannot handle MRRS
>> modification, but that seems rather ugly.
>
> Do we know for sure that card can't handle it at all or only freaks out
> with "0" ?

It's hard to say since the MPS on most systems is "0" (e.g., 128B).
It appears that one of the Radeon chips doesn't like the MRRS set to
0/6/7 (according to a patch attached to the same bug).  So, there are
a number of invalid settings that will have a negative impact on that.
 If we don't want to allow the hole in the patch I pushed to Jesse, I
think we should go with the "safe" option not setting the MRRS and
make it the default.

> IE Does the card work if MRS == MPS of the card (ie using my algorithm
> and assuming the host bridge supports arbitrary MPS).

Currently, the code will set the MRRS to the MPS of the parent (if the
"performance" option is used).  It will then read back the value to
verify that it has been changed.  If not, then it will try setting it
to the next value lower, etc. until it hits 0.

The "safe" option tries to set the MRRS to a uniform size to make
everything equitable, which seemed logical but not 100% necessary.  I
am more than happy to have this option not touch the default MRRS at
all.

>> I'm open to any option, but there seems to be a large number of issues
>> with the MRRS modification code and we need to get a fix into 3.1.
>
> Ok, I only head of one, I didn't know there were more.

I've gotten 5 e-mails so far.  I'm betting there are more out there.

> Are these typically hitting with the "performance" option ? IE. It make
> sense to leave MRRS untouched in the "safe" case.

The patch I sent out still used the "performance" option without
modifying the default MRRS of the device.  All that have tested it
said that it resolves their issues.

thanks,
Jon

>
> Cheers,
> Ben.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Sept. 6, 2011, 8:47 p.m. UTC | #5
On Tue, 2011-09-06 at 12:12 -0700, Jon Mason wrote:
> > Are these typically hitting with the "performance" option ? IE. It
> make
> > sense to leave MRRS untouched in the "safe" case.
> 
> The patch I sent out still used the "performance" option without
> modifying the default MRRS of the device.  All that have tested it
> said that it resolves their issues.

But that will cause other issues as I described, if the MRRS end up
larger than the MPS. IE. The MRRS of a device must be set to be lower or
equal to the MPS of that device (not of the parent btw) if we allow the
parent(s) to have a larger MPS.

I -did- hit a very real problem with adapters where that wasn't true.

As for those users, are they all the same radeon ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rolf Eike Beer Sept. 7, 2011, 8:13 a.m. UTC | #6
> On Tue, 2011-09-06 at 12:12 -0700, Jon Mason wrote:
>> > Are these typically hitting with the "performance" option ? IE. It
>> make
>> > sense to leave MRRS untouched in the "safe" case.
>>
>> The patch I sent out still used the "performance" option without
>> modifying the default MRRS of the device.  All that have tested it
>> said that it resolves their issues.
>
> But that will cause other issues as I described, if the MRRS end up
> larger than the MPS. IE. The MRRS of a device must be set to be lower or
> equal to the MPS of that device (not of the parent btw) if we allow the
> parent(s) to have a larger MPS.

I don't think so. Just looking into my lspci I found more than one
occurence of things like this:

MaxPayload 128 bytes, MaxReadReq 512 bytes

Which is perfectly fine. The requester (i.e. the device) may issue read
requests up to 512 bytes at once, which is a thing about transfer credits
and the like (IIRC). The completer may split the completions into packages
of any valid size between the minimum size (IIRC 64 bytes) and the maximum
_payload_ size the requester can handle. In this case you will likely get
either 8*64 or 4*128 byte completions for a read request of 512 byte, but
any combination in between would be valid, too.

One needs to keep in mind that a read request of 512 byte is itself only a
12 or 16 byte packet, so it isn't affected by the payload size at all.

> I -did- hit a very real problem with adapters where that wasn't true.

From my understanding this shows that these adapters were broken, nothing
else. An adapter must be prepared to get a bunch of smaller packets
anyway, as it has no control of how the completer sends out the data. So
maybe your adapters just got screwed by awaiting a 512 byte reply and
getting 4*128? You could connect them to some sort of bastard completer if
you have one (I don't) that completes every request in packets of 64 byte
and see if they will not even explode with 128 byte MRRS.

Eike
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Sept. 7, 2011, 9:30 a.m. UTC | #7
On Wed, 2011-09-07 at 10:13 +0200, Rolf Eike Beer wrote:
> > On Tue, 2011-09-06 at 12:12 -0700, Jon Mason wrote:
> >> > Are these typically hitting with the "performance" option ? IE. It
> >> make
> >> > sense to leave MRRS untouched in the "safe" case.
> >>
> >> The patch I sent out still used the "performance" option without
> >> modifying the default MRRS of the device.  All that have tested it
> >> said that it resolves their issues.
> >
> > But that will cause other issues as I described, if the MRRS end up
> > larger than the MPS. IE. The MRRS of a device must be set to be lower or
> > equal to the MPS of that device (not of the parent btw) if we allow the
> > parent(s) to have a larger MPS.
> 
> I don't think so. Just looking into my lspci I found more than one
> occurence of things like this:
> 
> MaxPayload 128 bytes, MaxReadReq 512 bytes
>
> Which is perfectly fine. The requester (i.e. the device) may issue read
> requests up to 512 bytes at once, which is a thing about transfer credits
> and the like (IIRC). The completer may split the completions into packages
> of any valid size between the minimum size (IIRC 64 bytes) and the maximum
> _payload_ size the requester can handle. In this case you will likely get
> either 8*64 or 4*128 byte completions for a read request of 512 byte, but
> any combination in between would be valid, too.

No. If the MPS of the host bridge is larger than 128, then the host
bridge will possibly return read responses using payloads up to the
MRRS.

> One needs to keep in mind that a read request of 512 byte is itself only a
> 12 or 16 byte packet, so it isn't affected by the payload size at all.

The problem is of course not the request packet itself but the response.

> > I -did- hit a very real problem with adapters where that wasn't true.
> 
> >From my understanding this shows that these adapters were broken, nothing
> else. An adapter must be prepared to get a bunch of smaller packets
> anyway, as it has no control of how the completer sends out the data. So
> maybe your adapters just got screwed by awaiting a 512 byte reply and
> getting 4*128? You could connect them to some sort of bastard completer if
> you have one (I don't) that completes every request in packets of 64 byte
> and see if they will not even explode with 128 byte MRRS.

Unfortunately, I didn't manage to get a good TLP capture of the problem
packets in AER. But basically what happens is:

 - Host bridge has a large MPS (For example 4096)
 - Device has a smaller MPS (for example 128)
 - Device has a large MRRS (for example 512)

What I observed is when receiving network packets larger than roughly
128 bytes (I didn't get precise packet size threshold, I wasn't doing
the tests myself), the device appears to get read -responses- larger
than it's MPS (up to it's MRRS, ie, the size it specified in the read
request), and shoots an UE upstream.

This happens with e1000e's so I doubt it's a broken PCIe implementation
in the device, and it makes sense all things considered.

The Host bridge having an MPS larger than 128, it is allowed to send a
read response using a large TLP, which will be rejected by the device.

The "safe" approach of course is to clamp all MPS to the minimum, but
that leads to way too many situations where everybody gets down to 128
bytes because -one- device in the system has 128 bytes, and that means
that anything that has a hotplug slot must clamp everybody as well.

Cheers,
Ben.
 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Sept. 7, 2011, 10:19 a.m. UTC | #8
On Wed, 2011-09-07 at 06:30 -0300, Benjamin Herrenschmidt wrote:

> Unfortunately, I didn't manage to get a good TLP capture of the problem
> packets in AER. But basically what happens is:
> 
>  - Host bridge has a large MPS (For example 4096)
>  - Device has a smaller MPS (for example 128)
>  - Device has a large MRRS (for example 512)

Just double checked on the actual machine. The device has a MPS of 256
and the bridge can go up to 4096. We were letting it up and observed the
problem with an MRRS of 512 (apparently the power-on default of that
adapter).

So either we clamp the bridge to 256 and penalize everybody, or we clamp
the e1000's MRRS to 256 and things work.

Cheers,
Ben.

> What I observed is when receiving network packets larger than roughly
> 128 bytes (I didn't get precise packet size threshold, I wasn't doing
> the tests myself), the device appears to get read -responses- larger
> than it's MPS (up to it's MRRS, ie, the size it specified in the read
> request), and shoots an UE upstream.
> 
> This happens with e1000e's so I doubt it's a broken PCIe implementation
> in the device, and it makes sense all things considered.
> 
> The Host bridge having an MPS larger than 128, it is allowed to send a
> read response using a large TLP, which will be rejected by the device.
> 
> The "safe" approach of course is to clamp all MPS to the minimum, but
> that leads to way too many situations where everybody gets down to 128
> bytes because -one- device in the system has 128 bytes, and that means
> that anything that has a hotplug slot must clamp everybody as well.
> 
> Cheers,
> Ben.
>  


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rolf Eike Beer Sept. 7, 2011, 10:37 a.m. UTC | #9
> On Wed, 2011-09-07 at 06:30 -0300, Benjamin Herrenschmidt wrote:
>
>> Unfortunately, I didn't manage to get a good TLP capture of the problem
>> packets in AER. But basically what happens is:
>>
>>  - Host bridge has a large MPS (For example 4096)

Out of curiosity: what is this for a board? The only thing we ever found
for a reasonable price that has more than 128 byte here is the Intel
X58/Tylersburg (and an older NVidia one). This is not really my business
anymore but I guess I could make some people happy if I tell them what to
look for.

>>  - Device has a smaller MPS (for example 128)
>>  - Device has a large MRRS (for example 512)
>
> Just double checked on the actual machine. The device has a MPS of 256
> and the bridge can go up to 4096. We were letting it up and observed the
> problem with an MRRS of 512 (apparently the power-on default of that
> adapter).
>
> So either we clamp the bridge to 256 and penalize everybody, or we clamp
> the e1000's MRRS to 256 and things work.

We need to change the MPS of the bridge anyway as it could send e.g. a
write request of 4k otherwise. Which is completely orthogonal to the MRRS
and would cause the same breakage. And as far as I understand what the
patches do is exactly this change for exactly this reason: avoid too large
packets hitting the device. The MRRS is only for things that were
originally requested by the target device, but it is by far not the only
way such packets may happen. Maybe it is the most likely way, but nothing
more.

But it would still be an interesting question to get a list of devices
broken when the MRRS is changed. And to kick the vendors hard to fix that
mess.

Eike
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Sept. 7, 2011, 10:58 a.m. UTC | #10
On Wed, 2011-09-07 at 12:37 +0200, Rolf Eike Beer wrote:
> > On Wed, 2011-09-07 at 06:30 -0300, Benjamin Herrenschmidt wrote:
> >
> >> Unfortunately, I didn't manage to get a good TLP capture of the problem
> >> packets in AER. But basically what happens is:
> >>
> >>  - Host bridge has a large MPS (For example 4096)
> 
> Out of curiosity: what is this for a board? The only thing we ever found
> for a reasonable price that has more than 128 byte here is the Intel
> X58/Tylersburg (and an older NVidia one). This is not really my business
> anymore but I guess I could make some people happy if I tell them what to
> look for.

IBM POWER7 server :-)
 
> >>  - Device has a smaller MPS (for example 128)
> >>  - Device has a large MRRS (for example 512)
> >
> > Just double checked on the actual machine. The device has a MPS of 256
> > and the bridge can go up to 4096. We were letting it up and observed the
> > problem with an MRRS of 512 (apparently the power-on default of that
> > adapter).
> >
> > So either we clamp the bridge to 256 and penalize everybody, or we clamp
> > the e1000's MRRS to 256 and things work.
> 
> We need to change the MPS of the bridge anyway as it could send e.g. a
> write request of 4k otherwise. 

No. It couldn't in our case (and in most implementations) simply because
write requests from the bridge are triggered by MMIO operations that
simply cannot gather that much. Granted, it requires platform specific
knowledge to be able to make that assumption (which is why I originally
wanted that "fast" mode to be selected specifically by the
architecture).

In our case, we -know- our bridges will never generate a request longer
than 128 bytes. In fact, it can probably only happen if you have
something like a fancy DMA controller on the CPU side of the bridge (tho
Intels tend to have that nowadays).

> Which is completely orthogonal to the MRRS
> and would cause the same breakage. And as far as I understand what the
> patches do is exactly this change for exactly this reason: avoid too large
> packets hitting the device. The MRRS is only for things that were
> originally requested by the target device, but it is by far not the only
> way such packets may happen. Maybe it is the most likely way, but nothing
> more.
> 
> But it would still be an interesting question to get a list of devices
> broken when the MRRS is changed. And to kick the vendors hard to fix that
> mess.

Well, I would definitely recommend that the default unless changed by
the architecture code is the "safe" mode, in which case the bridge is
clamped (and effectively all devices in the hierarchy below the host
bridge are clamped to the lowest common MPS denominator).

On Power server, I'm happy to switch to the "fast" approach by default
which is to leave the upstream bridges at a higher MPS since that will
gain us a significant performance boost with some adapters, and have the
"safe" mode remain a kernel command line option just in case.

Without that, anything like a hotplug enclosure for example would have
to have everything permanently clamped to 128 which sucks.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Mason Sept. 7, 2011, 6:48 p.m. UTC | #11
On Sep 6, 2011 1:47 PM, "Benjamin Herrenschmidt"
<benh@kernel.crashing.org> wrote:
> On Tue, 2011-09-06 at 12:12 -0700, Jon Mason wrote:
>> > Are these typically hitting with the "performance" option ? IE. It
>> make
>> > sense to leave MRRS untouched in the "safe" case.
>>
>> The patch I sent out still used the "performance" option without
>> modifying the default MRRS of the device. All that have tested it
>> said that it resolves their issues.
>
> But that will cause other issues as I described, if the MRRS end up
> larger than the MPS. IE. The MRRS of a device must be set to be lower or
> equal to the MPS of that device (not of the parent btw) if we allow the
> parent(s) to have a larger MPS.

Yes, that is the hole.

> I -did- hit a very real problem with adapters where that wasn't true.
>
> As for those users, are they all the same radeon ?

hpsa users too.  Also, some Dell servers are experiencing issues, but
I am not sure what hardware is causing issue on those systems.  Based
on this thread, I believe the best way to move forward is to make the
"safe" option the default and have it not set the MRRS.  The
"performance" option will set the MRRS.  I will add a printk to notify
the user of potential issues.  Hopefully users will try this option
and inform us of any issues encountered.

Thoughts?

Thanks,
Jon

> Cheers,
> Ben.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shyam Iyer Sept. 7, 2011, 8:36 p.m. UTC | #12
> -----Original Message-----
> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
> owner@vger.kernel.org] On Behalf Of Jon Mason
> Sent: Wednesday, September 07, 2011 2:49 PM
> To: Benjamin Herrenschmidt
> Cc: Jesse Barnes; linux-pci@vger.kernel.org
> Subject: Re: [PATCH] PCI: Remove MRRS modification from MPS setting
> code
> 
> On Sep 6, 2011 1:47 PM, "Benjamin Herrenschmidt"
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2011-09-06 at 12:12 -0700, Jon Mason wrote:
> >> > Are these typically hitting with the "performance" option ? IE. It
> >> make
> >> > sense to leave MRRS untouched in the "safe" case.
> >>
> >> The patch I sent out still used the "performance" option without
> >> modifying the default MRRS of the device. All that have tested it
> >> said that it resolves their issues.
> >
> > But that will cause other issues as I described, if the MRRS end up
> > larger than the MPS. IE. The MRRS of a device must be set to be lower
> or
> > equal to the MPS of that device (not of the parent btw) if we allow
> the
> > parent(s) to have a larger MPS.
> 
> Yes, that is the hole.
> 
> > I -did- hit a very real problem with adapters where that wasn't true.
> >
> > As for those users, are they all the same radeon ?
> 
> hpsa users too.  Also, some Dell servers are experiencing issues, but
> I am not sure what hardware is causing issue on those systems.  Based
> on this thread, I believe the best way to move forward is to make the
> "safe" option the default and have it not set the MRRS.  The
> "performance" option will set the MRRS.  I will add a printk to notify
> the user of potential issues.  Hopefully users will try this option
> and inform us of any issues encountered.

A parameter option may help the cause for folks who would rather like the performance option.

> 
> Thoughts?
> 
> Thanks,
> Jon
> 
> > Cheers,
> > Ben.
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jon Mason Sept. 7, 2011, 9:20 p.m. UTC | #13
On Wed, Sep 7, 2011 at 1:36 PM,  <Shyam_Iyer@dell.com> wrote:
>
>
>> -----Original Message-----
>> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
>> owner@vger.kernel.org] On Behalf Of Jon Mason
>> Sent: Wednesday, September 07, 2011 2:49 PM
>> To: Benjamin Herrenschmidt
>> Cc: Jesse Barnes; linux-pci@vger.kernel.org
>> Subject: Re: [PATCH] PCI: Remove MRRS modification from MPS setting
>> code
>>
>> On Sep 6, 2011 1:47 PM, "Benjamin Herrenschmidt"
>> <benh@kernel.crashing.org> wrote:
>> > On Tue, 2011-09-06 at 12:12 -0700, Jon Mason wrote:
>> >> > Are these typically hitting with the "performance" option ? IE. It
>> >> make
>> >> > sense to leave MRRS untouched in the "safe" case.
>> >>
>> >> The patch I sent out still used the "performance" option without
>> >> modifying the default MRRS of the device. All that have tested it
>> >> said that it resolves their issues.
>> >
>> > But that will cause other issues as I described, if the MRRS end up
>> > larger than the MPS. IE. The MRRS of a device must be set to be lower
>> or
>> > equal to the MPS of that device (not of the parent btw) if we allow
>> the
>> > parent(s) to have a larger MPS.
>>
>> Yes, that is the hole.
>>
>> > I -did- hit a very real problem with adapters where that wasn't true.
>> >
>> > As for those users, are they all the same radeon ?
>>
>> hpsa users too.  Also, some Dell servers are experiencing issues, but
>> I am not sure what hardware is causing issue on those systems.  Based
>> on this thread, I believe the best way to move forward is to make the
>> "safe" option the default and have it not set the MRRS.  The
>> "performance" option will set the MRRS.  I will add a printk to notify
>> the user of potential issues.  Hopefully users will try this option
>> and inform us of any issues encountered.
>
> A parameter option may help the cause for folks who would rather like the performance option.

A boot parm already exists, pci=pcie_bus_perf.

>
>>
>> Thoughts?
>>
>> Thanks,
>> Jon
>>
>> > Cheers,
>> > Ben.
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Sept. 8, 2011, 9:50 a.m. UTC | #14
On Wed, 2011-09-07 at 11:48 -0700, Jon Mason wrote:
> > I -did- hit a very real problem with adapters where that wasn't
> true.
> >
> > As for those users, are they all the same radeon ?
> 
> hpsa users too.  Also, some Dell servers are experiencing issues, but
> I am not sure what hardware is causing issue on those systems.  Based
> on this thread, I believe the best way to move forward is to make the
> "safe" option the default and have it not set the MRRS.  The
> "performance" option will set the MRRS.  I will add a printk to notify
> the user of potential issues.  Hopefully users will try this option
> and inform us of any issues encountered.

I'm fine with that tho I'd like to have a way to change the default in
my arch code. People tend to use a limited set of adapters on Power
which are "supported" by IBM and we more/less know we can do the
"performance" option with them :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..d896c5e 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1394,41 +1394,6 @@  static void pcie_write_mps(struct pci_dev *dev, int mps)
 		dev_err(&dev->dev, "Failed attempting to set the MPS\n");
 }
 
-static void pcie_write_mrrs(struct pci_dev *dev, int mps)
-{
-	int rc, mrrs;
-
-	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
-		int dev_mpss = 128 << dev->pcie_mpss;
-
-		/* For Max performance, the MRRS must be set to the largest
-		 * supported value.  However, it cannot be configured larger
-		 * than the MPS the device or the bus can support.  This assumes
-		 * that the largest MRRS available on the device cannot be
-		 * smaller than the device MPSS.
-		 */
-		mrrs = mps < dev_mpss ? mps : dev_mpss;
-	} else
-		/* In the "safe" case, configure the MRRS for fairness on the
-		 * bus by making all devices have the same size
-		 */
-		mrrs = mps;
-
-
-	/* MRRS is a R/W register.  Invalid values can be written, but a
-	 * subsiquent read will verify if the value is acceptable or not.
-	 * If the MRRS value provided is not acceptable (e.g., too large),
-	 * shrink the value until it is acceptable to the HW.
- 	 */
-	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
-		rc = pcie_set_readrq(dev, mrrs);
-		if (rc)
-			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
-
-		mrrs /= 2;
-	}
-}
-
 static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 {
 	int mps = 128 << *(u8 *)data;
@@ -1440,7 +1405,6 @@  static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	pcie_write_mps(dev, mps);
-	pcie_write_mrrs(dev, mps);
 
 	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));