[V4] PCI: handle CRS returned by device after FLR

Message ID	1499375234-23928-1-git-send-email-okaya@codeaurora.org (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> sender: okaya@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id B914E6081E; Thu, 6 Jul 2017 21:07:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org B914E6081E From: Sinan Kaya <okaya@codeaurora.org> To: linux-pci@vger.kernel.org, timur@codeaurora.org, alex.williamson@redhat.com, vikrams@codeaurora.org Cc: Lorenzo.Pieralisi@arm.com, linux-arm-msm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Sinan Kaya <okaya@codeaurora.org>, Bjorn Helgaas <bhelgaas@google.com>, linux-kernel@vger.kernel.org Subject: [PATCH V4] PCI: handle CRS returned by device after FLR Date: Thu, 6 Jul 2017 17:07:14 -0400 Message-Id: <1499375234-23928-1-git-send-email-okaya@codeaurora.org> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Sinan Kaya July 6, 2017, 9:07 p.m. UTC

An endpoint is allowed to issue Configuration Request Retry Status (CRS)
following a Function Level Reset (FLR) request to indicate that it is not
ready to accept new requests.

Seen a timeout message with Intel 750 NVMe drive and FLR reset.

Kernel enables CRS visibility in pci_enable_crs() function for each bridge
it discovers. The OS observes a special vendor ID read value of 0xFFFF0001
in this case. We need to keep polling until this special read value
disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
given vendor id read request under the covers.

Adding a vendor ID read if this is a physical function before attempting
to read any other registers on the endpoint. A CRS indication will only
be given if the address to be read is vendor ID register.

Note that virtual functions report their vendor ID through another
mechanism.

The spec is calling to wait up to 1 seconds if the device is sending CRS.
The NVMe device seems to be requiring more. Relax this up to 60 seconds.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/pci/pci.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Bjorn Helgaas July 13, 2017, 12:17 p.m. UTC | #1

On Thu, Jul 06, 2017 at 05:07:14PM -0400, Sinan Kaya wrote:
> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
> following a Function Level Reset (FLR) request to indicate that it is not
> ready to accept new requests.
> 
> Seen a timeout message with Intel 750 NVMe drive and FLR reset.
> 
> Kernel enables CRS visibility in pci_enable_crs() function for each bridge
> it discovers. The OS observes a special vendor ID read value of 0xFFFF0001
> in this case. We need to keep polling until this special read value
> disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
> given vendor id read request under the covers.
> 
> Adding a vendor ID read if this is a physical function before attempting
> to read any other registers on the endpoint. A CRS indication will only
> be given if the address to be read is vendor ID register.
> 
> Note that virtual functions report their vendor ID through another
> mechanism.
> 
> The spec is calling to wait up to 1 seconds if the device is sending CRS.
> The NVMe device seems to be requiring more. Relax this up to 60 seconds.

Can you add a pointer to the "1 second" requirement in the spec here?
We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
there a basis in the spec for the 60 second timeout?

What's the NVMe excuse for requiring more time than the spec allows?
Is this a hardware erratum?  Is there some PCIe ECN pending to address
this?

I try to avoid adding generic changes based on one specific piece of
hardware because it can penalize everybody else who actually bothered
to follow the spec.  For example, if FLR fails because a non-NVMe
device is broken, it will now take 60 seconds to notice that instead
of 1 second.

> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>  drivers/pci/pci.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index aab9d51..83a9784 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3723,10 +3723,16 @@ static void pci_flr_wait(struct pci_dev *dev)
>  	int i = 0;
>  	u32 id;
>  
> -	do {
> -		msleep(100);
> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -	} while (i++ < 10 && id == ~0);
> +	if (dev->is_virtfn) {
> +		do {
> +			msleep(100);
> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> +		} while (i++ < 10 && id == ~0);
> +	} else {
> +		if (!pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &id,
> +						60*1000))
> +			id = ~0;
> +	}
>  
>  	if (id == ~0)
>  		dev_warn(&dev->dev, "Failed to return from FLR\n");
> -- 
> 1.9.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Sinan Kaya July 13, 2017, 3:44 p.m. UTC | #2

On 7/13/2017 8:17 AM, Bjorn Helgaas wrote:
>> he spec is calling to wait up to 1 seconds if the device is sending CRS.
>> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> Can you add a pointer to the "1 second" requirement in the spec here?
> We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
> there a basis in the spec for the 60 second timeout?

This does not specify a hard limit above on how long SW need to wait. 

"6.6.2 Function Level Reset
After an FLR has been initiated by writing a 1b to the Initiate Function Level Reset bit, 
the Function must complete the FLR within 100 ms.

While a Function is required to complete the FLR operation within the time limit described above,
the subsequent Function-specific initialization sequence may require additional time. 
If additional time is required, the Function must return a Configuration Request Retry Status (CRS) 
Completion Status when a Configuration Request is received 15 after the time limit above. 
After the Function responds to a Configuration Request with a Completion status other than CRS, 
it is not permitted to return CRS until it is reset again."

However, another indirect reference here tells us it is capped by 1 second below.

"6.23. Readiness Notifications (RN)
Readiness Notifications (RN) is intended to reduce the time software needs to
wait before issuing Configuration Requests to a Device or Function following DRS
Events or FRS Events. RN includes both the Device Readiness Status (DRS) and
Function Readiness Status (FRS) mechanisms. These mechanisms provide a direct
indication of Configuration-Readiness (see 5 Terms and Acronyms entry for “Configuration-Ready”). 

When used, DRS and FRS allow an improved behavior over the CRS mechanism, and eliminate
its associated periodic polling time of up to 1 second following a reset."

If I remember it right from CRS commit messages, 60 seconds was coming from
some PCIe switch taking too long to boot.

> 
> What's the NVMe excuse for requiring more time than the spec allows?
> Is this a hardware erratum?  Is there some PCIe ECN pending to address
> this?

We have seen the issue with Intel 750 and Intel P3600 NVMe drives. I don't
have access to the errata document for either of the drives.

> 
> I try to avoid adding generic changes based on one specific piece of
> hardware because it can penalize everybody else who actually bothered
> to follow the spec.  For example, if FLR fails because a non-NVMe
> device is broken, it will now take 60 seconds to notice that instead
> of 1 second.
> 

We can look for a better number like 3-4 seconds and put some nice warning
that HW might be broken (violating the spec) and could be in need of
a FW/BIOS update.

What do you think?

Keith Busch July 13, 2017, 4:03 p.m. UTC | #3

On Thu, Jul 13, 2017 at 07:17:58AM -0500, Bjorn Helgaas wrote:
> On Thu, Jul 06, 2017 at 05:07:14PM -0400, Sinan Kaya wrote:
> > An endpoint is allowed to issue Configuration Request Retry Status (CRS)
> > following a Function Level Reset (FLR) request to indicate that it is not
> > ready to accept new requests.
> > 
> > Seen a timeout message with Intel 750 NVMe drive and FLR reset.
> > 
> > Kernel enables CRS visibility in pci_enable_crs() function for each bridge
> > it discovers. The OS observes a special vendor ID read value of 0xFFFF0001
> > in this case. We need to keep polling until this special read value
> > disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
> > given vendor id read request under the covers.
> > 
> > Adding a vendor ID read if this is a physical function before attempting
> > to read any other registers on the endpoint. A CRS indication will only
> > be given if the address to be read is vendor ID register.
> > 
> > Note that virtual functions report their vendor ID through another
> > mechanism.
> > 
> > The spec is calling to wait up to 1 seconds if the device is sending CRS.
> > The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> 
> Can you add a pointer to the "1 second" requirement in the spec here?
> We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
> there a basis in the spec for the 60 second timeout?

I also don't see anywhere that says CRS is limited to only 1 second. It
looks to me that the spec allows a device to return CRS for as long as it
takes to complete initialization.

From PCIe Base Spec, Section 2.3.1 CRS Implementation note:

  A device in receipt of a Configuration Request following a valid reset
  condition may respond with a CRS Completion Status to terminate the
  Request, and thus effectively stall the Configuration Request until
  such time that the subsystem has completed local initialization and
  is ready to communicate with the host.

No time limit specified here, or anywhere else for that matter AFAICT.
Where is 1 second requirement coming from?

Keith Busch July 13, 2017, 4:29 p.m. UTC | #4

On Thu, Jul 13, 2017 at 11:44:12AM -0400, Sinan Kaya wrote:
> On 7/13/2017 8:17 AM, Bjorn Helgaas wrote:
> >> he spec is calling to wait up to 1 seconds if the device is sending CRS.
> >> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> > Can you add a pointer to the "1 second" requirement in the spec here?
> > We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
> > there a basis in the spec for the 60 second timeout?
> 
> This does not specify a hard limit above on how long SW need to wait. 
> 
> "6.6.2 Function Level Reset
> After an FLR has been initiated by writing a 1b to the Initiate Function Level Reset bit, 
> the Function must complete the FLR within 100 ms.
> 
> While a Function is required to complete the FLR operation within the time limit described above,
> the subsequent Function-specific initialization sequence may require additional time. 
> If additional time is required, the Function must return a Configuration Request Retry Status (CRS) 
> Completion Status when a Configuration Request is received 15 after the time limit above. 
> After the Function responds to a Configuration Request with a Completion status other than CRS, 
> it is not permitted to return CRS until it is reset again."
> 
> However, another indirect reference here tells us it is capped by 1 second below.
> 
> "6.23. Readiness Notifications (RN)
> Readiness Notifications (RN) is intended to reduce the time software needs to
> wait before issuing Configuration Requests to a Device or Function following DRS
> Events or FRS Events. RN includes both the Device Readiness Status (DRS) and
> Function Readiness Status (FRS) mechanisms. These mechanisms provide a direct
> indication of Configuration-Readiness (see 5 Terms and Acronyms entry for “Configuration-Ready”). 
> 
> When used, DRS and FRS allow an improved behavior over the CRS mechanism, and eliminate
> its associated periodic polling time of up to 1 second following a reset."

That wording is just confusing. It looks to me the 1 second polling is
to be used following a reset if CRS is not implemented.

  https://pcisig.com/sites/default/files/specification_documents/ECN_RN_29_Aug_2013.pdf

"
  Through the mechanisms defined by this ECR, we can avoid the long,
  architected, fixed delays following various forms of reset before
  software is permitted to perform its first Configuration Request. These
  delays are very large:

  1 second if Configuration Retry Status (CRS) is not used
"

It goes on to say CRS is usually much lower, but doesn't specify an
upper bound either.

Sinan Kaya July 13, 2017, 4:42 p.m. UTC | #5

On 7/13/2017 12:29 PM, Keith Busch wrote:
>> When used, DRS and FRS allow an improved behavior over the CRS mechanism, and eliminate
>> its associated periodic polling time of up to 1 second following a reset."
> That wording is just confusing. It looks to me the 1 second polling is
> to be used following a reset if CRS is not implemented.
> 
>   https://pcisig.com/sites/default/files/specification_documents/ECN_RN_29_Aug_2013.pdf
> 
> "
>   Through the mechanisms defined by this ECR, we can avoid the long,
>   architected, fixed delays following various forms of reset before
>   software is permitted to perform its first Configuration Request. These
>   delays are very large:
> 
>   1 second if Configuration Retry Status (CRS) is not used
> "
> 
> It goes on to say CRS is usually much lower, but doesn't specify an
> upper bound either.
> 

I see, we got caught on spec language where we don't know what 'its' is. 

Bjorn,

Since there is no upper cap on how long, what is your preference (stick to 60),
give incremental warning updates every 5 seconds?

I can certainly rewrite the commit message.

Sinan

Keith Busch July 13, 2017, 5:24 p.m. UTC | #6

On Thu, Jul 13, 2017 at 12:42:44PM -0400, Sinan Kaya wrote:
> On 7/13/2017 12:29 PM, Keith Busch wrote:
> > That wording is just confusing. It looks to me the 1 second polling is
> > to be used following a reset if CRS is not implemented.
> > 
> >   https://pcisig.com/sites/default/files/specification_documents/ECN_RN_29_Aug_2013.pdf
> > 
> > "
> >   Through the mechanisms defined by this ECR, we can avoid the long,
> >   architected, fixed delays following various forms of reset before
> >   software is permitted to perform its first Configuration Request. These
> >   delays are very large:
> > 
> >   1 second if Configuration Retry Status (CRS) is not used
> > "
> > 
> > It goes on to say CRS is usually much lower, but doesn't specify an
> > upper bound either.
> > 
> 
> I see, we got caught on spec language where we don't know what 'its' is. 

Well, I don't know for certain if your original interpretation is
incorrect. Just saying the CRS intention doesn't explicitly stand out to me.

On a side note, I'll also see if I can get clarification on what
expectations the hardware people have for this particular product. Your
observation seems a little high to me, but I don't know if that's
outside the product's limits.

Bjorn Helgaas July 13, 2017, 11:38 p.m. UTC | #7

On Thu, Jul 13, 2017 at 11:44:12AM -0400, Sinan Kaya wrote:
> On 7/13/2017 8:17 AM, Bjorn Helgaas wrote:
> >> he spec is calling to wait up to 1 seconds if the device is sending CRS.
> >> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> > Can you add a pointer to the "1 second" requirement in the spec here?
> > We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
> > there a basis in the spec for the 60 second timeout?
> 
> This does not specify a hard limit above on how long SW need to wait. 

I wouldn't expect a *maximum* time we can wait.  I'm looking for the
minimum times the spec requires.

If you're claiming "the spec is calling to wait up to 1 second", I
just want to know where in the spec it says that.  That helps in the
future when we need to maintain code like this.

> If I remember it right from CRS commit messages, 60 seconds was coming from
> some PCIe switch taking too long to boot.

If you have a pointer to this, please include it.  The earliest thing
I can find is when Linux was imported into git (1da177e4c3f4
("Linux-2.6.12-rc2")), which includes a 60 second timeout:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?id=1da177e4c3f4#n686

Bjorn Helgaas July 13, 2017, 11:49 p.m. UTC | #8

On Thu, Jul 06, 2017 at 05:07:14PM -0400, Sinan Kaya wrote:
> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
> following a Function Level Reset (FLR) request to indicate that it is not
> ready to accept new requests.
> 
> Seen a timeout message with Intel 750 NVMe drive and FLR reset.
> 
> Kernel enables CRS visibility in pci_enable_crs() function for each bridge
> it discovers. The OS observes a special vendor ID read value of 0xFFFF0001
> in this case. We need to keep polling until this special read value
> disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
> given vendor id read request under the covers.

This patch isn't about how CRS works; we already have support for
that.  So this paragraph is mostly extraneous and can be replaced by a
simple reference to CRS in the spec.  

> Adding a vendor ID read if this is a physical function before attempting
> to read any other registers on the endpoint. A CRS indication will only
> be given if the address to be read is vendor ID register.
> 
> Note that virtual functions report their vendor ID through another
> mechanism.

How VFs report vendor ID is irrelevant.

What *is* relevant is how FLR affects VFs.  The SR-IOV spec, r1.1,
sec 2.2.2, says FLR doesn't affect a VF's existence in config space.

That suggests to me that reading a VF's PCI_COMMAND register after an
FLR should always return valid data (never ~0).  I suppose an FLR VF
reset isn't instantaneous, though, so I suppose we do need the 100ms
delay.  But maybe we should just return immediately after that,
without reading any VF config space?

pci_flr_wait() was added by 5adecf817dd6 ("PCI: Wait for up to 1000ms
after FLR reset"); maybe Alex has more insight into this.

> The spec is calling to wait up to 1 seconds if the device is sending CRS.
> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>  drivers/pci/pci.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index aab9d51..83a9784 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3723,10 +3723,16 @@ static void pci_flr_wait(struct pci_dev *dev)
>  	int i = 0;
>  	u32 id;
>  
> -	do {
> -		msleep(100);
> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -	} while (i++ < 10 && id == ~0);
> +	if (dev->is_virtfn) {
> +		do {
> +			msleep(100);
> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> +		} while (i++ < 10 && id == ~0);
> +	} else {
> +		if (!pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &id,
> +						60*1000))
> +			id = ~0;
> +	}
>  
>  	if (id == ~0)
>  		dev_warn(&dev->dev, "Failed to return from FLR\n");
> -- 
> 1.9.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Sinan Kaya July 14, 2017, 2:10 p.m. UTC | #9

On 7/13/2017 7:38 PM, Bjorn Helgaas wrote:
>> This does not specify a hard limit above on how long SW need to wait. 
> I wouldn't expect a *maximum* time we can wait.  I'm looking for the
> minimum times the spec requires.

My understanding is FLR needs to finish in 100ms max under normal circumstances.
If an endpoint needs more, it needs to issue CRS.

	"After an FLR has been initiated by writing a 1b to the Initiate Function Level Reset bit, 
	the Function must complete the FLR within 100 ms. 

	...

	it is recommended that software allow as much time as provided by the pre-FLR value for Completion 
	Timeout on the device. If Completion Timeouts were disabled on the Function when FLR was issued, 
	then the delay is system dependent but must be no less than 100 ms."

The only minimum I found is in the last paragraph where somebody actually disables
completion timeout. I don't know why anyone would do that.

> 
> If you're claiming "the spec is calling to wait up to 1 second", I
> just want to know where in the spec it says that.  That helps in the
> future when we need to maintain code like this.
> 

Keith and I discussed this here. 

https://www.spinics.net/lists/arm-kernel/msg593493.html

We have a spec language problem. My interpretation of this is 1 seconds
max for CRS.

	"When used, DRS and FRS allow an improved behavior over the CRS mechanism, 
	and eliminate its associated periodic polling time of up to 1 second
	following a reset."

The ECN is referring to conventional reset as 1 second max rather than CRS.
	https://www.spinics.net/lists/arm-kernel/msg593500.html

Sinan Kaya July 14, 2017, 2:28 p.m. UTC | #10

Hi Bjorn,

On 7/13/2017 7:49 PM, Bjorn Helgaas wrote:
> How VFs report vendor ID is irrelevant.

I was trying to highlight this. 

"SR-IOV spec rev 1.1, 3.4.1.1 & 3.4.1.2, Vendor ID and Device ID fields
for the VF return 0xFFFF when read.  The "Virtualization Intermediary"
is supposed to use the vendor ID from the PF and the device ID defined
in the PF SR-IOV capability."

https://www.spinics.net/lists/linux-pci/msg58530.html

The point is we can't use pci_bus_read_dev_vendor_id() for both
virtual and physical functions as in my V3 patch.

> 
> What *is* relevant is how FLR affects VFs.  The SR-IOV spec, r1.1,
> sec 2.2.2, says FLR doesn't affect a VF's existence in config space.
> 

I see.

> That suggests to me that reading a VF's PCI_COMMAND register after an
> FLR should always return valid data (never ~0).  I suppose an FLR VF
> reset isn't instantaneous, though, so I suppose we do need the 100ms
> delay.  But maybe we should just return immediately after that,
> without reading any VF config space?

make sense.

> 
> pci_flr_wait() was added by 5adecf817dd6 ("PCI: Wait for up to 1000ms
> after FLR reset"); maybe Alex has more insight into this.

I think code is handling both virtual and physical functions. 1 second
is nice to have for physical functions. See this comment at the top.

/*
 * We should only need to wait 100ms after FLR, but some devices take longer.
 * Wait for up to 1000ms for config space to return something other than -1.
 * Intel IGD requires this when an LCD panel is attached.  We read the 2nd
 * dword because VFs don't implement the 1st dword.
 */

Since we are separating the handling of physical and virtual functions,
we could go back to 100ms for virtual functions and handle CRS/wait for
physical functions.

Sinan

Sinan Kaya July 31, 2017, 9:45 p.m. UTC | #11

Hi Bjorn,

On 7/13/2017 7:49 PM, Bjorn Helgaas wrote:
> On Thu, Jul 06, 2017 at 05:07:14PM -0400, Sinan Kaya wrote:
>> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
>> following a Function Level Reset (FLR) request to indicate that it is not
>> ready to accept new requests.
>>
>> Seen a timeout message with Intel 750 NVMe drive and FLR reset.
>>
>> Kernel enables CRS visibility in pci_enable_crs() function for each bridge
>> it discovers. The OS observes a special vendor ID read value of 0xFFFF0001
>> in this case. We need to keep polling until this special read value
>> disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
>> given vendor id read request under the covers.
> 
> This patch isn't about how CRS works; we already have support for
> that.  So this paragraph is mostly extraneous and can be replaced by a
> simple reference to CRS in the spec.  
> 
>> Adding a vendor ID read if this is a physical function before attempting
>> to read any other registers on the endpoint. A CRS indication will only
>> be given if the address to be read is vendor ID register.
>>
>> Note that virtual functions report their vendor ID through another
>> mechanism.
> 
> How VFs report vendor ID is irrelevant.
> 
> What *is* relevant is how FLR affects VFs.  The SR-IOV spec, r1.1,
> sec 2.2.2, says FLR doesn't affect a VF's existence in config space.
> 
> That suggests to me that reading a VF's PCI_COMMAND register after an
> FLR should always return valid data (never ~0).  I suppose an FLR VF
> reset isn't instantaneous, though, so I suppose we do need the 100ms
> delay.  But maybe we should just return immediately after that,
> without reading any VF config space?
> 
> pci_flr_wait() was added by 5adecf817dd6 ("PCI: Wait for up to 1000ms
> after FLR reset"); maybe Alex has more insight into this.
> 
>> The spec is calling to wait up to 1 seconds if the device is sending CRS.
>> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>> ---
>>  drivers/pci/pci.c | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index aab9d51..83a9784 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3723,10 +3723,16 @@ static void pci_flr_wait(struct pci_dev *dev)
>>  	int i = 0;
>>  	u32 id;
>>  
>> -	do {
>> -		msleep(100);
>> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
>> -	} while (i++ < 10 && id == ~0);
>> +	if (dev->is_virtfn) {
>> +		do {
>> +			msleep(100);
>> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
>> +		} while (i++ < 10 && id == ~0);
>> +	} else {
>> +		if (!pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &id,
>> +						60*1000))
>> +			id = ~0;
>> +	}
>>  
>>  	if (id == ~0)
>>  		dev_warn(&dev->dev, "Failed to return from FLR\n");
>> -- 
>> 1.9.1
>>
>>

Welcome back from vacation. Let's pick up from where we left. 

Based on our email conversation, here are things I noted:

1. You want to go back to 100ms for virtual functions.
2. You want to print some user visible information when CRS is taking longer.
I can call the vendor ID function 60 times and print a message on each loop
if it helps.
3. You are looking for maximum CRS time reference from the spec. The reference
depends on how you interpret it. I interpreted it as 1 second maximum CRS time.
Keith interpreted it as 1 second being a reference to conventional reset maximum
time rather than CRS time. The fact that no time limit specified in the CRS chapter
also makes me lean towards Keith's interpretation. 
4. I'll clean up the commit message based on your feedback.

Please let me know if I missed anything.

Sinan

Bjorn Helgaas July 31, 2017, 10:19 p.m. UTC | #12

On Mon, Jul 31, 2017 at 05:45:46PM -0400, Sinan Kaya wrote:
> Hi Bjorn,
> 
> On 7/13/2017 7:49 PM, Bjorn Helgaas wrote:
> > On Thu, Jul 06, 2017 at 05:07:14PM -0400, Sinan Kaya wrote:
> >> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
> >> following a Function Level Reset (FLR) request to indicate that it is not
> >> ready to accept new requests.
> >>
> >> Seen a timeout message with Intel 750 NVMe drive and FLR reset.
> >>
> >> Kernel enables CRS visibility in pci_enable_crs() function for each bridge
> >> it discovers. The OS observes a special vendor ID read value of 0xFFFF0001
> >> in this case. We need to keep polling until this special read value
> >> disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
> >> given vendor id read request under the covers.
> > 
> > This patch isn't about how CRS works; we already have support for
> > that.  So this paragraph is mostly extraneous and can be replaced by a
> > simple reference to CRS in the spec.  
> > 
> >> Adding a vendor ID read if this is a physical function before attempting
> >> to read any other registers on the endpoint. A CRS indication will only
> >> be given if the address to be read is vendor ID register.
> >>
> >> Note that virtual functions report their vendor ID through another
> >> mechanism.
> > 
> > How VFs report vendor ID is irrelevant.
> > 
> > What *is* relevant is how FLR affects VFs.  The SR-IOV spec, r1.1,
> > sec 2.2.2, says FLR doesn't affect a VF's existence in config space.
> > 
> > That suggests to me that reading a VF's PCI_COMMAND register after an
> > FLR should always return valid data (never ~0).  I suppose an FLR VF
> > reset isn't instantaneous, though, so I suppose we do need the 100ms
> > delay.  But maybe we should just return immediately after that,
> > without reading any VF config space?
> > 
> > pci_flr_wait() was added by 5adecf817dd6 ("PCI: Wait for up to 1000ms
> > after FLR reset"); maybe Alex has more insight into this.
> > 
> >> The spec is calling to wait up to 1 seconds if the device is sending CRS.
> >> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> >>
> >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >> ---
> >>  drivers/pci/pci.c | 14 ++++++++++----
> >>  1 file changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index aab9d51..83a9784 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -3723,10 +3723,16 @@ static void pci_flr_wait(struct pci_dev *dev)
> >>  	int i = 0;
> >>  	u32 id;
> >>  
> >> -	do {
> >> -		msleep(100);
> >> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> >> -	} while (i++ < 10 && id == ~0);
> >> +	if (dev->is_virtfn) {
> >> +		do {
> >> +			msleep(100);
> >> +			pci_read_config_dword(dev, PCI_COMMAND, &id);
> >> +		} while (i++ < 10 && id == ~0);
> >> +	} else {
> >> +		if (!pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &id,
> >> +						60*1000))
> >> +			id = ~0;
> >> +	}
> >>  
> >>  	if (id == ~0)
> >>  		dev_warn(&dev->dev, "Failed to return from FLR\n");
> >> -- 
> >> 1.9.1
> >>
> >>
> 
> Welcome back from vacation. Let's pick up from where we left. 
> 
> Based on our email conversation, here are things I noted:
> 
> 1. You want to go back to 100ms for virtual functions.

This should be a separate patch with a changelog and a comment that
references the spec, e.g., SR-IOV r1.1, sec 2.2.2 or whatever seems
most relevant.

> 2. You want to print some user visible information when CRS is
> taking longer.  I can call the vendor ID function 60 times and print
> a message on each loop if it helps.

I don't remember the context of this.  60 copies of a message sounds
like too much.

> 3. You are looking for maximum CRS time reference from the spec. The
> reference depends on how you interpret it. I interpreted it as 1
> second maximum CRS time.  Keith interpreted it as 1 second being a
> reference to conventional reset maximum time rather than CRS time.
> The fact that no time limit specified in the CRS chapter also makes
> me lean towards Keith's interpretation. 

I'm looking for the *minimum* time we're required to wait per the
spec, with pointers to the spec sections you used to derive it.

In most cases the spec does not specify maximum times for software
events because that would be like saying "your software must be at
least this fast" and that sort of thing is impossible to enforce.

If you find a maximum, great -- use it and include the spec pointer.
But I suspect the max time we wait for a device to become ready will
be a policy question of how long Linux wants to wait.

Bjorn

[V4] PCI: handle CRS returned by device after FLR

Commit Message

Comments

Patch