From patchwork Sat May 27 10:34:09 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ding Tianhong X-Patchwork-Id: 9751821 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1998A602F0 for ; Sat, 27 May 2017 10:35:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0031A28446 for ; Sat, 27 May 2017 10:35:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D655B28294; Sat, 27 May 2017 10:35:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID autolearn=ham version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [65.50.211.133]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id A9A7828294 for ; Sat, 27 May 2017 10:35:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:From:References:To:Subject:Reply-To:Content-ID:Content-Description :Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=7pCQpAzmPjCo2uxlunNFBufOxHAIDc0u0DOtLAbrRjE=; b=S7/fPOIp5ZL8g1 xVLbyRoUYbfsmxHg1rnHg0fnCWUXYXb1y2WRK0tkLbgNneokOh6ixF8fFRBCKSSzlM7vE78HIaD3l Wa/3GTuFH/SZ0dJz1MOKe508vHaiY9r7aeKfTkYFccOnNP3gkmBJWSQXKoLltMn807OORQd7AIyP4 55wBlow9s4S2mAFSrbbqCLoQu5SBXXzEViheG88ilOQYepDDVjHivpfrqkyegRaLMM1nZQJJZJpzl QKXP8DoawrFBT8i1atVlyju/Hxu7x9QPecddJsMl/UcF0CkAEtf4sgnAJQo0WMQvVSM4ZpurwEavo ut4/dpdnmlULTJz9h5Iw==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.87 #1 (Red Hat Linux)) id 1dEZ4V-000162-83; Sat, 27 May 2017 10:35:47 +0000 Received: from szxga01-in.huawei.com ([45.249.212.187]) by bombadil.infradead.org with esmtps (Exim 4.87 #1 (Red Hat Linux)) id 1dEZ4Q-00014n-OK for linux-arm-kernel@lists.infradead.org; Sat, 27 May 2017 10:35:45 +0000 Received: from 172.30.72.55 (EHLO dggeml406-hub.china.huawei.com) ([172.30.72.55]) by dggrg01-dlp.huawei.com (MOS 4.4.6-GA FastPath queued) with ESMTP id APH80295; Sat, 27 May 2017 18:34:22 +0800 (CST) Received: from [127.0.0.1] (10.177.23.32) by dggeml406-hub.china.huawei.com (10.3.17.50) with Microsoft SMTP Server id 14.3.301.0; Sat, 27 May 2017 18:34:12 +0800 Subject: Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING To: Alexander Duyck References: <758d0e431c732fe133e7b0e660bde5fc1beccdba.1493678834.git.leedom@chelsio.com> <20170502165329.GB26406@linux.intel.com> <22c46382-5778-022f-d0bd-74abec9e9c7c@huawei.com> <2eab1a1d-3444-fe11-7626-5b6459e954f8@huawei.com> From: Ding Tianhong Message-ID: Date: Sat, 27 May 2017 18:34:09 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.177.23.32] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A0B0201.59295630.0008, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 188ab0a44422ae6adf4e130bb2b62f71 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20170527_033543_531401_24E6BB2F X-CRM114-Status: GOOD ( 46.05 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , Gabriele Paoloni , Asit K Mallick , Catalin Marinas , Will Deacon , LinuxArm , "Raj, Ashok" , Bjorn Helgaas , Ganesh GR , Jeff Kirsher , Bob Shaw , Casey Leedom , Patrick J Cramer , "Arjun V." , Michael Werner , "linux-arm-kernel@lists.infradead.org" , Amir Ancel , Netdev , David Laight , Suravee Suthikulpanit , Robin Murphy , David Miller , h Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP On 2017/5/26 3:49, Alexander Duyck wrote: > On Thu, May 25, 2017 at 6:35 AM, Ding Tianhong wrote: >> >> On 2017/5/9 8:48, Casey Leedom wrote: >>> >>> | From: Alexander Duyck >>> | Date: Saturday, May 6, 2017 11:07 AM >>> | >>> | | From: Ding Tianhong >>> | | Date: Fri, May 5, 2017 at 8:08 PM >>> | | >>> | | According the suggestion, I could only think of this code: >>> | | .. >>> | >>> | This is a bit simplistic but it is a start. >>> >>> Yes, something tells me that this is going to be more complicated than any >>> of us like ... >>> >>> | The other bit I was getting at is that we need to update the core PCIe >>> | code so that when we configure devices and the root complex reports no >>> | support for relaxed ordering it should be clearing the relaxed >>> | ordering bits in the PCIe configuration registers on the upstream >>> | facing devices. >>> >>> Of course, this can be written to by the Driver at any time ... and is in >>> the case of the cxgb4 Driver ... >>> >>> After a lot of rummaging around, it does look like KVM prohibits writes to >>> the PCIe Capability Device Control register in drivers/xen/xen-pciback/ >>> conf_space_capability.c and conf_space.c simply because writes aren't >>> allowed unless "permissive" is set. So it ~looks~ like a driver running in >>> a Virtual Machine can't turn Enable Relaxed Ordering back on ... >>> >>> | The last bit we need in all this is a way to allow for setups where >>> | peer-to-peer wants to perform relaxed ordering but for writes to the >>> | host we have to not use relaxed ordering. For that we need to enable a >>> | special case and that isn't handled right now in any of the solutions >>> | we have coded up so far. >>> >>> Yes, we do need this. >>> >>> >>> | From: Alexander Duyck >>> | Date: Saturday, May 8, 2017 08:22 AM >>> | >>> | The problem is we need to have something that can be communicated >>> | through a VM. Your change doesn't work in that regard. That was why I >>> | suggested just updating the code so that we when we initialized PCIe >>> | devices what we do is either set or clear the relaxed ordering bit in >>> | the PCIe device control register. That way when we direct assign an >>> | interface it could know just based on the bits int the PCIe >>> | configuration if it could use relaxed ordering or not. >>> | >>> | At that point the driver code itself becomes very simple since you >>> | could just enable the relaxed ordering by default in the igb/ixgbe >>> | driver and if the bit is set or cleared in the PCIe configuration then >>> | we are either sending with relaxed ordering requests or not and don't >>> | have to try and locate the root complex. >>> | >>> | So from the sound of it Casey has a special use case where he doesn't >>> | want to send relaxed ordering frames to the root complex, but instead >>> | would like to send them to another PCIe device. To do that he needs to >>> | have a way to enable the relaxed ordering bit in the PCIe >>> | configuration but then not send any to the root complex. Odds are that >>> | is something he might be able to just implement in the driver, but is >>> | something that may become a more general case in the future. I don't >>> | see our change here impacting it as long as we keep the solution >>> | generic and mostly confined to when we instantiate the devices as the >>> | driver could likely make the decision to change the behavior later. >>> >>> It's not just me. Intel has said that while RO directed at the Root >>> Complex Host Coherent Memory has a performance bug (not Data Corruption), >>> it's a performance win for Peer-to-Peer writes to MMIO Space. (I'll be very >>> interested in hearing what the bug is if we get that much detail. The very >>> same TLPs directed to the Root Complex Port without Relaxed Ordering set get >>> good performance. So this is essentially a bug in the hardware that was >>> ~trying~ to implement a performance win.) >>> >>> Meanwhile, I currently only know of a single PCIe End Point which causes >>> catastrophic results: the AMD A1100 ARM SoC ("SEATTLE"). And it's not even >>> clear that product is even alive anymore since I haven't been able to get >>> any responses from them for several months. >>> >>> What I'm saying is: let's try to architect a solution which doesn't throw >>> the baby out with the bath water ... >>> >>> I think that if a Device's Root Complex Port has problems with Relaxed >>> Ordering, it ~probably~ makes sense to turn off the PCIe Capability Device >>> Control[Enable Relaxed Ordering] when we assign a device to a Virtual >>> Machine since the Device Driver can no longer query the Relaxed Ordering >>> Support of the Root Complex Port. The only down side of this would be if we >>> assigned two Peers to a VM in an application which wanted to do Peer-to-Peer >>> transfers. But that seems like a hard application to support in any case >>> since the PCI Bus:Slot.Function IDs for assigned Devices within a VM don't >>> match the actual values. >>> >>> For Devices running in the base OS/Hypervisor, their Drivers can query the >>> Relaxed Ordering Support for the Root Complex Port or a Peer Device. So a >>> simple flag within the (struct pci_dev *)->dev_flags would serve for that >>> along with a per-Architecture/Platform mechanism for setting it ... >>> >>> Casey >>> >> >> I have take a time to talk to our kvm team about how to distinguish the relaxed >> ordering in the VM for some vf just like 82599-vf, the probe routine looks like >> could work like this: >> 1) QEMU could emulate the platform by the Vender ID and device ID which could be >> read from the host. >> 2) The QEMU could create a virtual PCIe dev complex and recognize the PCIe bus address which >> come and detach from the host to the guest. >> 3) the PCI quirk could enable the Relaxed Ordering by the Vendor ID and Device ID. >> 4) The VF drivers could read the flag and set to the hw. >> >> So I think we could set the PCI_DEV_FLAGS_RELAXED_ORDERING_ENABLED for some special platform >> and don't enable by default, if I miss something, please not hesitate to enlighten me :) > > This isn't what I had in mind at all. > > So what Casey had originally submitted was a step in the direction of > what I was thinking. Basically on platforms where it is not advisable > to enable relaxed ordering we need to advertise that relaxed ordering > is not safe. Then when we are initializing the devices underneath > those we need to be clearing the relaxed ordering enable bits in the > PCI configuration, that is the piece that was missing from Casey's > original patch. In addition we then need to have a way for devices to > optionally enable relaxed ordering for cases like Casey pointed out > where they might want to use relaxed ordering for peer-to-peer > transactions, but not for transactions to the root complex. Finally in > the case of the Intel drivers we could then just drop the compile time > checks entirely and just enable the device to configure the internal > registers for relaxed ordering because the configuration space becomes > the spot that controls if this gets enabled or not. > > So the initial set of patches Casey submitted only really consisted of > 2 patches. What I am proposing is that we would be looking at > expanding this out to about 4 patches. The first patch is the original > 1 of 2, the second patch would be to modify the PCI initialization > code to clear the relaxed ordering enable bit in the event that the > root complex doesn't want relaxed ordering enabled, the third would be > to make changes to the Chelsio driver as needed to allow for the > peer-to-peer case to be enabled when the relaxed ordering bit in the > configuration space is not enabled without triggering any relaxed > ordering requests to the root complex, and the last one would be to > drop the defines in ixgbe and whatever other Intel drivers are > currently checking for either SPARC or the define that was added to > support relaxed ordering and just act like we are going to do it > always with the PCI configuration space controlling if we do or not. > > Ideally as a part of the second patch we should have a way of testing > if a given path can support relaxed ordering. That way when we go to > try to enable a peer-to-peer setup we can be certain that a given path > will work and don't try enabling it in paths that would be unsupported > for peer-to-peer. > > This ends up being somewhat of a risk for the Intel NICs, but if the > Chelsio devices have been running with relaxed ordering enabled for > some time and have identified the chipsets that should be issues, then > odds are we should be fine as well. > According to your opinion, I try to build the second patch again, 1. in the pcie probe time, the pci_configure_relaxed_ordering will used to set the relaxed ordering bit for configuration space. 2. export the pcie_get_relaxed_ordering for drivers and the drivers could decide whether should enable the relaxed ordering. If I still go to the wrong path, please correct me, thanks. --------------------------------------------------------------- Thanks Ding >>> . >>> >> > > . > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b01bd5b..0076e4a 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4852,6 +4852,28 @@ int pcie_get_mps(struct pci_dev *dev) } EXPORT_SYMBOL(pcie_get_mps); +int pcie_set_relaxed_ordering(struct pci_dev *dev) +{ + return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN); +} +EXPORT_SYMBOL(pcie_set_relaxed_ordering); + +int pcie_clear_relaxed_ordering(struct pci_dev *dev) +{ + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN); +} +EXPORT_SYMBOL(pcie_clear_relaxed_ordering); + +int pcie_get_relaxed_ordering(struct pci_dev *dev) +{ + u16 v; + + pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &v); + + return (v & PCI_EXP_DEVCTL_RELAX_EN) >> 4; +} +EXPORT_SYMBOL(pcie_get_relaxed_ordering); + /** * pcie_set_mps - set PCI Express maximum payload size * @dev: PCI device to query diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 19c8950..aeb22b5 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1701,6 +1701,16 @@ static void pci_configure_extended_tags(struct pci_dev *dev) PCI_EXP_DEVCTL_EXT_TAG); } +static void pci_configure_relaxed_ordering(struct pci_dev *dev) +{ + int ret; + + if (dev && (dev->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING)) + pcie_set_relaxed_ordering(dev); + else + pcie_clear_relaxed_ordering(dev); +} + static void pci_configure_device(struct pci_dev *dev) { struct hotplug_params hpp; @@ -1708,6 +1718,7 @@ static void pci_configure_device(struct pci_dev *dev) pci_configure_mps(dev); pci_configure_extended_tags(dev); + pci_configure_relaxed_ordering(dev); memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp);