diff mbox

[v6,0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

Message ID MWHPR12MB160087E9E0B17F7A17609714C8AA0@MWHPR12MB1600.namprd12.prod.outlook.com (mailing list archive)
State New, archived
Headers show

Commit Message

Casey Leedom July 7, 2017, 9:48 p.m. UTC
Hey Ding, Bjorn, Alexander, et.al.,

  Sorry for the insane delay in getting to this and thanks especially to
Ding for picking up the ball on this feature.  I got side=-tracked into a
multi-week rewrite of our Firmware/Host Driver Port Capabilities code, then
to the recent Ethernet Plug-Fest at the University of New Hampshire for a
week, and finally the 4th of July weekend.  Digging out from under email has
been non-amusing.  In any case, enough excuses.
 
  I'm looking at the "v6" version of the patches.  If this isn't the corect
version, please yell at me and I'll look at the correct one immediately.

  In the change to drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c (patch
3/3) it looks like the code is checking the Chelsio Adapter PCI Device to
see if Relaxed Ordering is supported.  But what we really want to test is
the Root Complex port.  I.e. something like:


The basic idea is that we want to find out if the Root Complex is okay with
TLPs directed its way with the Relaxed Ordering Attribute set.

  I also have to say that I'm a bit confused by the
pcie_relaxed_ordering_supported() in use here.  It seems that it's testing
whether the PCI Device's own Relaxed Ordering Enable is set which governs
whether it will send TLPs with the Relaxed Ordering Attribute set.  It has
nothing to do with whether or not the device responds well to incoming TLPs
with the Relaxed Ordering Attribute set.  I think that we really want to use
the pci_dev_should_disable_relaxed_ordering() API on the Root Complex Port
because that tests whether the quirck fired on it?  I.e.:

+       root = pci_find_pcie_root_port(pdev);
+       if (!pci_dev_should_disable_relaxed_ordering(root))
+               adapter->flags |= ROOT_NO_RELAXED_ORDERING;


Casey

Comments

Casey Leedom July 8, 2017, 12:30 a.m. UTC | #1
By the way, it ~seems~ like the patch set confuses the idea of the PCIe Capability Device Control[Relaxed Ordering Enable] with the device's ability to handle incoming TLPs with the Relaxed Ordering Attribute set.  These are completely different things.  The PCIe Capability Device Control[Relaxed Ordering Enable] solely governs the ability of the device to _send_ TLPs with the Relaxed Ordering Attribute set.  It has nothing whatsoever to do with the handling of incoming TLPs with the Relaxed Ordering Attribute set.  In fact, there is any standard way to disable receipt processing of such TLPs.  That's kind of the whole point of the majority of the patch.  If there was a separate "Ignore Incoming Relaxed Ordering Attributes" then we'd mostly just have a quirk which would set that for problematic devices.

  Worse yet, if I'm reading the patch right, it _is_ turning off the PCIe Capability Device Control[Relaxed Ordering Enable] for the devices that we've identified with problematic receive handling.  Not only does this not solve the problem that we've been talking about, it actually almost certainly introduces a huge Graphics Performance Bug.  The Relaxed Ordering Attribute was originally developed for Graphics Device support in order to download textures, etc. to a graphics devices as fast as possible and only ensure ordering at points where needed.  For instance, as far as I know, the Intel Root Complexes that we've been talking about have no issues whatsoever in generating downstream TLPs with the Relaxed Ordering Attribute set.  By turning off the PCIe Capability Device Control[Relaxed Ordering Enable] on these Root Complexes, this patch prevents the generation of downstream TLPs with the Relaxed Ordering Attribute set targeted at devices like graphics adapters.

Casey
Alexander Duyck July 8, 2017, 12:47 a.m. UTC | #2
On Fri, Jul 7, 2017 at 5:30 PM, Casey Leedom <leedom@chelsio.com> wrote:
>   By the way, it ~seems~ like the patch set confuses the idea of the PCIe Capability Device Control[Relaxed Ordering Enable] with the device's ability to handle incoming TLPs with the Relaxed Ordering Attribute set.  These are completely different things.  The PCIe Capability Device Control[Relaxed Ordering Enable] solely governs the ability of the device to _send_ TLPs with the Relaxed Ordering Attribute set.  It has nothing whatsoever to do with the handling of incoming TLPs with the Relaxed Ordering Attribute set.  In fact, there is any standard way to disable receipt processing of such TLPs.  That's kind of the whole point of the majority of the patch.  If there was a separate "Ignore Incoming Relaxed Ordering Attributes" then we'd mostly just have a quirk which would set that for problematic devices.
>
>   Worse yet, if I'm reading the patch right, it _is_ turning off the PCIe Capability Device Control[Relaxed Ordering Enable] for the devices that we've identified with problematic receive handling.  Not only does this not solve the problem that we've been talking about, it actually almost certainly introduces a huge Graphics Performance Bug.  The Relaxed Ordering Attribute was originally developed for Graphics Device support in order to download textures, etc. to a graphics devices as fast as possible and only ensure ordering at points where needed.  For instance, as far as I know, the Intel Root Complexes that we've been talking about have no issues whatsoever in generating downstream TLPs with the Relaxed Ordering Attribute set.  By turning off the PCIe Capability Device Control[Relaxed Ordering Enable] on these Root Complexes, this patch prevents the generation of downstream TLPs with the Relaxed Ordering Attribute set targeted at devices like graphics adapters.
>
> Casey

The patches should be disabling the relaxed ordering on upstream
facing ports that would be sending requests to the root complex. So if
the root complex cannot handle receiving a message with relaxed
ordering enabled we should be clearing the relaxed ordering bit in the
PCIe configuration on those device that could send TLPs to that root
complex. By doing that the devices hosted on that root complex cannot
generate requests that contain relaxed ordering TLPs.

I'll have to do some double checking of the patches, but I thought we
were only clearing the relaxed ordering bits on the devices that would
be sending requests to the root complex, not the downstream ports of
the root complex itself. As such we should be able to generate relaxed
ordering TLPs from the root complex, but the devices hosted on the bus
shouldn't be able to generate relaxed ordering TLPs. The bit in the
configuration space controls the ability to generate content, not the
ability to receive it so clearing the bit on the device should be the
correct behavior for this.

As far as your suggestions for the cxgb4 patch it has the same problem
it originally had. We want to disable the relaxed ordering bit on the
device since the device should not be generating relaxed ordering
requests. This is why I was trying to get your input on this as I know
you have a peer-to-peer configuration that you wanted to support. What
we need to do is identify what platforms cannot support relaxed
ordering to the root complex and prevent that from happening which is
why we clear the relaxed ordering bit on the device. In that setup we
need to have a good way to identify when this has occurred and instead
setup a side channel type configuration where we then re-enable
relaxed ordering, but only allow for sending directly to the peer and
not the root complex.

- Alex
Casey Leedom July 8, 2017, 2:04 a.m. UTC | #3
Okay, thanks for the note Alexander.	I'll have to look more closely at
the patch on Monday and try it out on one of the targeted systems to verify
the semantics you describe.

  However, that said, there is no way to tell a priori where a device will
send TLPs.  To simply assume that all TLPs will be directed towards the Root
Complex is a big assumption.  Only the device and the code controlling it
know where the TLPs will be directed.  That's why there are changes required
in the cxgb4 driver.  For instance, the code in
drivers/net/ethernet/chelsio./cxgb4/sge.c: t4_sge_alloc_rxq() knows that
it's allocating Free List Buffers in Host Memory and that the RX Queues that
it's allocating in the Hardware will eventually send Ingress Data to those
Free List Buffers.  (And similarly for the Free List Buffer Pointer Queue
with respect to DMA Reads from the host.)  In that routine we explicitly
configure the Hardware to use/not-use the Relaxed Ordering Attribute via the
FW_IQ_CMD_FL0FETCHRO and FW_IQ_CMD_FL0DATARO flags.  Basically we're
conditionally setting them based on the desirability of sending Relaxed
Ordering TLPs to the Root Complex.  (And we would perform the same kind of
check for an nVME application ... which brings us to ...)

  And what would be the code using these patch APIs to set up a Peer-to-Peer
nVME-style application?	 In that case we'd need the Chelsio adapter's PCIe
Capability Device Control[Relaxed Ordering Enable] set for the nVME
application ... and we would avoid programming the Chelsio Hardware to use
Relaxed Ordering for TLPs directed at the Root Complex.	 Thus we would be in
a position where some TLPs being emitted by the device to Peer devices would
have Relaxed Ordering set and some directed at the Root Complex would not.
And the only way for that to work is if the source device's Device
Control[Relaxed Ordering Enable] is set ...

  Finally, setting aside my disagreements with the patch, we still have the
code in the cxgb4 driver which explicitly turns on its own Device
Control[Relaxed Ordering Enable] in cxgb4_main.c:
enable_pcie_relaxed_ordering().  So the patch is something of a loop if all
we're doing is testing our own Relaxed Ordering Enable state ...
 
Casey
Alexander Duyck July 8, 2017, 3:37 a.m. UTC | #4
On Fri, Jul 7, 2017 at 7:04 PM, Casey Leedom <leedom@chelsio.com> wrote:
>   Okay, thanks for the note Alexander.  I'll have to look more closely at
> the patch on Monday and try it out on one of the targeted systems to verify
> the semantics you describe.
>
>   However, that said, there is no way to tell a priori where a device will
> send TLPs.  To simply assume that all TLPs will be directed towards the Root
> Complex is a big assumption.  Only the device and the code controlling it
> know where the TLPs will be directed.  That's why there are changes required
> in the cxgb4 driver.  For instance, the code in
> drivers/net/ethernet/chelsio./cxgb4/sge.c: t4_sge_alloc_rxq() knows that
> it's allocating Free List Buffers in Host Memory and that the RX Queues that
> it's allocating in the Hardware will eventually send Ingress Data to those
> Free List Buffers.  (And similarly for the Free List Buffer Pointer Queue
> with respect to DMA Reads from the host.)  In that routine we explicitly
> configure the Hardware to use/not-use the Relaxed Ordering Attribute via the
> FW_IQ_CMD_FL0FETCHRO and FW_IQ_CMD_FL0DATARO flags.  Basically we're
> conditionally setting them based on the desirability of sending Relaxed
> Ordering TLPs to the Root Complex.  (And we would perform the same kind of
> check for an nVME application ... which brings us to ...)

The general idea with this is to keep this simple. In the vast
majority of cases the assumption is that a device will be sending data
back and forth from system memory. As such the mostly likely thing
that any given device will be interacting with is the root complex. By
making it so that we disable relaxed ordering if the root complex says
we can't support it seems like the simplest and most direct solution
to avoid the issue of us sending any requests with relaxed ordering
enabled TLPs to the root complex.

With that said what we are getting into are subtleties that end up
impacting devices that perform peer-to-peer operations and I don't
suspect that peer-to-peer is really all that common. Ideally my
thought on this is that if there is something in the setup that cannot
support relaxed ordering we should be disabling relaxed ordering and
then re-enabling it in the drivers that have special peer-to-peer
routes that they are aware of. This should help to simplify things for
cases such as a function being direct assigned as the configuration
space should be passed through to the guest with the relaxed ordering
attribute disabled, and that status will be visible to the guest. If
we just use the quirk we lose that and it becomes problematic if a
function is direct assigned on a system that doesn't support relaxed
ordering as the guest has no visibility into the host's quirks.

>   And what would be the code using these patch APIs to set up a Peer-to-Peer
> nVME-style application?  In that case we'd need the Chelsio adapter's PCIe
> Capability Device Control[Relaxed Ordering Enable] set for the nVME
> application ... and we would avoid programming the Chelsio Hardware to use
> Relaxed Ordering for TLPs directed at the Root Complex.  Thus we would be in
> a position where some TLPs being emitted by the device to Peer devices would
> have Relaxed Ordering set and some directed at the Root Complex would not.
> And the only way for that to work is if the source device's Device
> Control[Relaxed Ordering Enable] is set ...

Right. I admit this is pushing extra complexity into the driver, but
what else would you have us do?

The idea here is more of a lock-out tag-out type setup. We go in and
disable relaxed ordering on the devices if it isn't safe to send a
relaxed ordering TLP to the root complex. We then leave it up to the
driver to go through and re-enable it if the driver knows enough about
how it works and what kind of transactions it might issue. I'm not
saying we have to leave the relaxed ordering bit disabled in the
control register. I'm saying we disable it at first, and then leave it
up to the device driver to re-enable it if it needs the functionality
for something that is other than root-complex based and it knows it
can avoid sending the frames to the root complex. Ideally such a
driver would also clean up after itself if removed so that it leaves
the device in the same state it found it in.

Maybe we don't even really need patch 3/3 in this series. If disabling
the relaxed ordering bit in the configuration space already disabled
relaxed ordering for the entire device then this patch doesn't even
really do anything then right? The device takes care of it already for
us so we could probably just drop this patch as it currently stands.

If that is the case maybe we need to refocus patch 3/3 on re-enabling
relaxed ordering for that nVME specific case. That might be beyond the
scope of what Ding can handle though, and I am not familiar with the
Chelsio hardware either. So that might be something that is best left
to you as a follow-up patch.

>   Finally, setting aside my disagreements with the patch, we still have the
> code in the cxgb4 driver which explicitly turns on its own Device
> Control[Relaxed Ordering Enable] in cxgb4_main.c:
> enable_pcie_relaxed_ordering().  So the patch is something of a loop if all
> we're doing is testing our own Relaxed Ordering Enable state ...

I assume you are talking about what I am suggesting and not what is in
the actual patch.

Basically what it comes down to is sort of a loop. You would need to
add some code on your probe and remove routines to check the status of
the bits and configure them the way you want, and on remove you would
need to clean things up so that the next time a driver loads it
doesn't get confused. However you should only need this in the case
where the root complex cannot support relaxed ordering. In all other
cases you should be just running with relaxed ordering enabled for
everything.
Ding Tianhong July 10, 2017, 10:49 a.m. UTC | #5
Hi Casey:

On 2017/7/8 10:04, Casey Leedom wrote:
>   Okay, thanks for the note Alexander.	I'll have to look more closely at
> the patch on Monday and try it out on one of the targeted systems to verify
> the semantics you describe.
> 

All the modification is only clearing the device's Device Control{Relaxed Ordering
Enable]bit when distinguish that the platform should not support RO and did nothing
to the RC configuration, so I don't think it will break anything compare to the
first version from yours.

>   However, that said, there is no way to tell a priori where a device will
> send TLPs.  To simply assume that all TLPs will be directed towards the Root
> Complex is a big assumption.  Only the device and the code controlling it
> know where the TLPs will be directed.  That's why there are changes required
> in the cxgb4 driver.  For instance, the code in
> drivers/net/ethernet/chelsio./cxgb4/sge.c: t4_sge_alloc_rxq() knows that
> it's allocating Free List Buffers in Host Memory and that the RX Queues that
> it's allocating in the Hardware will eventually send Ingress Data to those
> Free List Buffers.  (And similarly for the Free List Buffer Pointer Queue
> with respect to DMA Reads from the host.)  In that routine we explicitly
> configure the Hardware to use/not-use the Relaxed Ordering Attribute via the
> FW_IQ_CMD_FL0FETCHRO and FW_IQ_CMD_FL0DATARO flags.  Basically we're
> conditionally setting them based on the desirability of sending Relaxed
> Ordering TLPs to the Root Complex.  (And we would perform the same kind of
> check for an nVME application ... which brings us to ...)
> 
>   And what would be the code using these patch APIs to set up a Peer-to-Peer
> nVME-style application?	 In that case we'd need the Chelsio adapter's PCIe
> Capability Device Control[Relaxed Ordering Enable] set for the nVME
> application ... and we would avoid programming the Chelsio Hardware to use
> Relaxed Ordering for TLPs directed at the Root Complex.	 Thus we would be in
> a position where some TLPs being emitted by the device to Peer devices would
> have Relaxed Ordering set and some directed at the Root Complex would not.
> And the only way for that to work is if the source device's Device
> Control[Relaxed Ordering Enable] is set ...
> 
>   Finally, setting aside my disagreements with the patch, we still have the
> code in the cxgb4 driver which explicitly turns on its own Device
> Control[Relaxed Ordering Enable] in cxgb4_main.c:
> enable_pcie_relaxed_ordering().  So the patch is something of a loop if all
> we're doing is testing our own Relaxed Ordering Enable state ...
>  
> Casey
> 
> .
>
Casey Leedom July 11, 2017, 12:01 a.m. UTC | #6
Hey Alexander,

  Okay, I understand your point regarding the "most likely scenario" being
TLPs directed upstream to the Root Complex.  But I'd still like to make sure
that we have an agreed upon API/methodology for doing Peer-to-Peer with
Relaxed Ordering and no Relaxed Ordering to the Root Complex.  I don't see
how the proposed APIs can be used in that fashion.
 
  Right now the proposed change for cxgb4 is for it to test its own PCIe
Capability Device Control[Relaxed Ordering Enable] in order to use that
information to program the Chelsio Hardware to emit/not emit upstream TLPs
with the Relaxed Ordering Attribute set.  But if we're going to have the
mixed mode situation I describe, the PCIe Capability Device Control[Relaxed
Ordering Enable] will have to be set which means that we'll be programming
the Chelsio Hardware to send upstream TLPs with Relaxed Ordering Enable to
the Root Complex which is what we were trying to avoid in the first place ...

  [[ And, as I noted on Friday evening, the currect cxgb4 Driver hardwires
     the Relaxed Ordering Enable on early dureing device probe, so that
     would minimally need to be addressed even if we decide that we don't
     ever want to support mixed mode Relaxed Ordering. ]]

  We need some method of telling the Chelsio Driver that it should/shouldn't
use Relaxed Ordering with TLPs directed at the Root Complex.  And the same
is true for a Peer PCIe Device.

  It may be that we should approach this from the completely opposite
direction and instead of having quirks which identify problematic devices,
have quirks which identify devices which would benefit from the use of
Relaxed Ordering (if the sending device supports that).  That is, assume the
using Relaxed Ordering shouldn't be done unless the target device says "I
love Relaxed Ordering TLPs" ...  In such a world, an NVMe or a Graphics
device might declare love of Relaxed Ordering and the same for a SPARC Root
Complex (I think that was your example).

  By the way, the sole example of Data Corruption with Relaxed Ordering is
the AMD A1100 ARM SoC and AMD appears to have given up on that almost as
soon as it was released.  So what we're left with currently is a performance
problem on modern Intel CPUs ...  (And hopefully we'll get a Technical
Publication on that issue fairly soon.)

Casey
Alexander Duyck July 11, 2017, 8:33 p.m. UTC | #7
On Mon, Jul 10, 2017 at 5:01 PM, Casey Leedom <leedom@chelsio.com> wrote:
>
> Hey Alexander,
>
>   Okay, I understand your point regarding the "most likely scenario" being
> TLPs directed upstream to the Root Complex.  But I'd still like to make sure
> that we have an agreed upon API/methodology for doing Peer-to-Peer with
> Relaxed Ordering and no Relaxed Ordering to the Root Complex.  I don't see
> how the proposed APIs can be used in that fashion.
>
>   Right now the proposed change for cxgb4 is for it to test its own PCIe
> Capability Device Control[Relaxed Ordering Enable] in order to use that
> information to program the Chelsio Hardware to emit/not emit upstream TLPs
> with the Relaxed Ordering Attribute set.  But if we're going to have the
> mixed mode situation I describe, the PCIe Capability Device Control[Relaxed
> Ordering Enable] will have to be set which means that we'll be programming
> the Chelsio Hardware to send upstream TLPs with Relaxed Ordering Enable to
> the Root Complex which is what we were trying to avoid in the first place ...
>
>   [[ And, as I noted on Friday evening, the currect cxgb4 Driver hardwires
>      the Relaxed Ordering Enable on early dureing device probe, so that
>      would minimally need to be addressed even if we decide that we don't
>      ever want to support mixed mode Relaxed Ordering. ]]

So when you say "hardwires the Relaxed Ordering Enable" do you mean it
cannot be changed by software, or are you saying the firmware is just
setting the bit? I just want to make sure the change even works.

Really what I was trying to get at is the Relaxed Ordering Enable bit
doesn't have to stay as 0 if it is cleared due to the root complex not
supporting it. If you have a device that is smart enough to be able to
distinguish between different destinations and can change the TLPs
depending on where they are routed then the driver should feel free to
re-enable the relaxed ordering bit itself and do whatever it wants.
The idea though is that in most cases it will probably be enabled by
default since most firmwares default to that if I am not mistaken. If
we disable it, then it is up to the drivers to figure out if they need
to come up with a workaround.

>   We need some method of telling the Chelsio Driver that it should/shouldn't
> use Relaxed Ordering with TLPs directed at the Root Complex.  And the same
> is true for a Peer PCIe Device.

Well my thought for now is to take this in baby steps. Looking over
the code I might suggest that Ding just drops patch 3 and just focuses
on the first two patches for now. Patch 3 doesn't do anything from the
sounds of it since the Relaxed Ordering Enable being cleared will
cause the TLPs to already not set the relaxed ordering bit, do I have
that right?

The main thing Ding cared about is making the ixgbe driver behave more
like the cxgb4 driver in that he wanted to have us enable relaxed
ordering by default which right now we were only doing for the SPARC
platforms. As such we may want to look at changing patch 3 to instead
strip the code in ixgbe so that it is enabled to use Relaxed Ordering
by default and then let the configuration space determine if we use it
or not.

>   It may be that we should approach this from the completely opposite
> direction and instead of having quirks which identify problematic devices,
> have quirks which identify devices which would benefit from the use of
> Relaxed Ordering (if the sending device supports that).  That is, assume the
> using Relaxed Ordering shouldn't be done unless the target device says "I
> love Relaxed Ordering TLPs" ...  In such a world, an NVMe or a Graphics
> device might declare love of Relaxed Ordering and the same for a SPARC Root
> Complex (I think that was your example).

Really I think we are probably better off enabling it by default and
leaving it up to the configuration space of the end points to sort it
out. The advantage is that it will let us catch issues with platforms
that need these kind of quirks sooner since up until now we have been
avoiding enabling relaxed ordering for most devices on x86 and as we
get faster and faster buses we are going to need to start fully
supporting this sooner or later anyway.

>   By the way, the sole example of Data Corruption with Relaxed Ordering is
> the AMD A1100 ARM SoC and AMD appears to have given up on that almost as
> soon as it was released.  So what we're left with currently is a performance
> problem on modern Intel CPUs ...  (And hopefully we'll get a Technical
> Publication on that issue fairly soon.)
>
> Casey

That's also assuming there aren't any other platforms with issues
lurking out there. If something like this had been in place before
some of the modern Intel CPUs were developed perhaps they would have
caught the relaxed ordering issue soon enough to have resolved it in
the silicon.

Odds are this sets things up for how we will deal with future issues
more than current ones. I'm more a fan of just letting the drivers
configure themselves for relaxed ordering to the root complex by
default, and then if something comes up where relaxed ordering can't
be supported on the platform then we do workarounds for things like
the peer-to-peer performance you mentioned before for the NVMe
solution.

- Alex
Ding Tianhong July 12, 2017, 9:46 a.m. UTC | #8
On 2017/7/11 8:01, Casey Leedom wrote:
> 
> Hey Alexander,
> 
>   Okay, I understand your point regarding the "most likely scenario" being
> TLPs directed upstream to the Root Complex.  But I'd still like to make sure
> that we have an agreed upon API/methodology for doing Peer-to-Peer with
> Relaxed Ordering and no Relaxed Ordering to the Root Complex.  I don't see
> how the proposed APIs can be used in that fashion.
>  
>   Right now the proposed change for cxgb4 is for it to test its own PCIe
> Capability Device Control[Relaxed Ordering Enable] in order to use that
> information to program the Chelsio Hardware to emit/not emit upstream TLPs
> with the Relaxed Ordering Attribute set.  But if we're going to have the
> mixed mode situation I describe, the PCIe Capability Device Control[Relaxed
> Ordering Enable] will have to be set which means that we'll be programming
> the Chelsio Hardware to send upstream TLPs with Relaxed Ordering Enable to
> the Root Complex which is what we were trying to avoid in the first place ...
> 
>   [[ And, as I noted on Friday evening, the currect cxgb4 Driver hardwires
>      the Relaxed Ordering Enable on early dureing device probe, so that
>      would minimally need to be addressed even if we decide that we don't
>      ever want to support mixed mode Relaxed Ordering. ]]
> 
>   We need some method of telling the Chelsio Driver that it should/shouldn't
> use Relaxed Ordering with TLPs directed at the Root Complex.  And the same
> is true for a Peer PCIe Device.
> 
>   It may be that we should approach this from the completely opposite
> direction and instead of having quirks which identify problematic devices,
> have quirks which identify devices which would benefit from the use of
> Relaxed Ordering (if the sending device supports that).  That is, assume the
> using Relaxed Ordering shouldn't be done unless the target device says "I
> love Relaxed Ordering TLPs" ...  In such a world, an NVMe or a Graphics
> device might declare love of Relaxed Ordering and the same for a SPARC Root
> Complex (I think that was your example).
> 
>   By the way, the sole example of Data Corruption with Relaxed Ordering is
> the AMD A1100 ARM SoC and AMD appears to have given up on that almost as
> soon as it was released.  So what we're left with currently is a performance
> problem on modern Intel CPUs ...  (And hopefully we'll get a Technical
> Publication on that issue fairly soon.)
> 
> Casey
> 

Hi Casey:

After the long discuss, I think If the PCIe Capability Device Control[Relaxed Ordering
Enable] to be cleared when the platform's RC has some problematic for RO didn't break
anything in your driver, I think you could choose to check the
(!pci_dev_should_disable_relaxed_ordering(root)) in the code to to enable
ROOT_NO_RELAXED_ORDERING for your adapter, and enable the PCIe Capability Device Control
[Relaxed Ordering Enable] bit when you need it, I think we don't have much gap here.
And we could leave the pear-to-pear situation to be fixed later.

Thanks
Ding

> .
>
Casey Leedom July 13, 2017, 12:52 a.m. UTC | #9
Sorry again for the delay.  This time at least partially caused by a Chelsio-internal Customer Support request to simply disable Relaxed Ordering entirely due to the performance issues with our 100Gb/s product and relatively recent Intel Root Complexes.  Our Customer Support people are tired of asking customers to try turning off Relaxed Ordering. (sigh)

  So, first off, I've mentioned a couple of times that the current cxgb4 driver hardwired the PCIe Capability Device Control[Relaxed Ordering Enable] on.  Here's the code which does it:

    drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4657:

    static void enable_pcie_relaxed_ordering(struct pci_dev *dev)
    {
        pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN);
    }

This is called from the PCI Probe() routine init_one() later in that file.  I just wanted to make sure people knew about this.  Obviously given our current very difficult thread, this would either need to be diked out or changed or a completely different mechanism put in place.

  Second, just to make sure everyone's on the same page, the above simply allows the device to send TLPs with the Relaxed Ordering Attribute.  It doesn't cause TLPs to suddenly all be sent with RO set.  The use of Relaxed Ordering is selective.  For instance, in our hardware we can configure the RX Path to use RO on Ingress Packet Data delivery to Free List Buffers, but not use RO for delivery of messages noting newly delivered Ingress Packet Data.  Doing this allows the destination PCIe target to [potentially] optimize the DMA Writes to it based on local conditions (memory controller channel availability, etc.), but ensure that the message noting newly delivered Ingress Packet Data isn't processed till all of the preceding TLPs with RO set containing Ingress Packet Data have been processed.  (This by the way is the essence of the AMD A1100 ARM SoC bug: its Root Complex isn't obeying that PCIe ordering rule.)

  Third, as noted above, I'm getting a lot of pressure to get this addressed sooner than later, so I think that we should go with something fairly simple along the lines that you guys are proposing and I'll stop whining about the problem of needing to handle Peer-to-Peer with Relaxed Ordering while not using it for deliveries to the Root Complex.  We can just wait for that kettle of fish to explode on us and deal with the mess then.  (Hhmmm, the mixed metaphor landed in an entirely different place than I originally intended ... :-))

  If we try to stick as closely to Ding's latest patch set as possible, then we can probably just add the diff to remove the enable_pcie_relaxed_ordering() code in cxgb4_main.c.

Casey
Ding Tianhong July 13, 2017, 1:18 a.m. UTC | #10
On 2017/7/13 8:52, Casey Leedom wrote:
>   Sorry again for the delay.  This time at least partially caused by a Chelsio-internal Customer Support request to simply disable Relaxed Ordering entirely due to the performance issues with our 100Gb/s product and relatively recent Intel Root Complexes.  Our Customer Support people are tired of asking customers to try turning off Relaxed Ordering. (sigh)
> 
>   So, first off, I've mentioned a couple of times that the current cxgb4 driver hardwired the PCIe Capability Device Control[Relaxed Ordering Enable] on.  Here's the code which does it:
> 
>     drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4657:
> 
>     static void enable_pcie_relaxed_ordering(struct pci_dev *dev)
>     {
>         pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN);
>     }

we should remove it.

> 
> This is called from the PCI Probe() routine init_one() later in that file.  I just wanted to make sure people knew about this.  Obviously given our current very difficult thread, this would either need to be diked out or changed or a completely different mechanism put in place.
> 
>   Second, just to make sure everyone's on the same page, the above simply allows the device to send TLPs with the Relaxed Ordering Attribute.  It doesn't cause TLPs to suddenly all be sent with RO set.  The use of Relaxed Ordering is selective.  For instance, in our hardware we can configure the RX Path to use RO on Ingress Packet Data delivery to Free List Buffers, but not use RO for delivery of messages noting newly delivered Ingress Packet Data.  Doing this allows the destination PCIe target to [potentially] optimize the DMA Writes to it based on local conditions (memory controller channel availability, etc.), but ensure that the message noting newly delivered Ingress Packet Data isn't processed till all of the preceding TLPs with RO set containing Ingress Packet Data have been processed.  (This by the way is the essence of the AMD A1100 ARM SoC bug: its Root Complex isn't obeying that PCIe ordering rule.)
> 
>   Third, as noted above, I'm getting a lot of pressure to get this addressed sooner than later, so I think that we should go with something fairly simple along the lines that you guys are proposing and I'll stop whining about the problem of needing to handle Peer-to-Peer with Relaxed Ordering while not using it for deliveries to the Root Complex.  We can just wait for that kettle of fish to explode on us and deal with the mess then.  (Hhmmm, the mixed metaphor landed in an entirely different place than I originally intended ... :-))
> 

Ok, we could fix them when we trigger this, I think it is not a big problem.

>   If we try to stick as closely to Ding's latest patch set as possible, then we can probably just add the diff to remove the enable_pcie_relaxed_ordering() code in cxgb4_main.c.
> 

If no other more suggestion, I will send a new version and remove the enable_pcie_relaxed_ordering(), thanks.  :)

Ding
> Casey
> .
>
Casey Leedom July 13, 2017, 2:28 a.m. UTC | #11
| From: Ding Tianhong <dingtianhong@huawei.com>
| Sent: Wednesday, July 12, 2017 6:18 PM
| 
| If no other more suggestion, I will send a new version and remove the
| enable_pcie_relaxed_ordering(), thanks.  :)

   Sounds good to me.  (And sorry for forgetting to justify that last message.
I hate working with web-based email agents.)

Casey
diff mbox

Patch

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 38a5c67..546538d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4620,6 +4620,7 @@  static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
        struct port_info *pi;
        bool highdma = false;
        struct adapter *adapter = NULL;
+       struct pci_dev *root;
        struct net_device *netdev;
        void __iomem *regs;
        u32 whoami, pl_rev;
@@ -4726,6 +4727,24 @@  static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
        adapter->msg_enable = DFLT_MSG_ENABLE;
        memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map));

+       /* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+        * Ingress Packet Data to Free List Buffers in order to allow for
+        * chipset performance optimizations between the Root Complex and
+        * Memory Controllers.  (Messages to the associated Ingress Queue
+        * notifying new Packet Placement in the Free Lists Buffers will be
+        * send without the Relaxed Ordering Attribute thus guaranteeing that
+        * all preceding PCIe Transaction Layer Packets will be processed
+        * first.)  But some Root Complexes have various issues with Upstream
+        * Transaction Layer Packets with the Relaxed Ordering Attribute set.
+        * The PCIe devices which under the Root Complexes will be cleared the
+        * Relaxed Ordering bit in the configuration space, So we check our
+        * PCIe configuration space to see if it's flagged with advice against
+        * using Relaxed Ordering.
+        */
+       root = pci_find_pcie_root_port(pdev);
+       if (!pcie_relaxed_ordering_supported(root))
+               adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
        spin_lock_init(&adapter->stats_lock);
        spin_lock_init(&adapter->tid_release_lock);
        spin_lock_init(&adapter->win0_lock);