diff mbox series

[V14,4/7] PCI: loongson: Don't access non-existant devices

Message ID 20220617074330.12605-5-chenhuacai@loongson.cn (mailing list archive)
State Superseded
Headers show
Series PCI: Loongson pci improvements and quirks | expand

Commit Message

Huacai Chen June 17, 2022, 7:43 a.m. UTC
On LS2K/LS7A, some non-existant devices don't return 0xffffffff when
scanning. This is a hardware flaw but we can only avoid it by software
now.

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
---
 drivers/pci/controller/pci-loongson.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas June 27, 2022, 9:38 p.m. UTC | #1
On Fri, Jun 17, 2022 at 03:43:27PM +0800, Huacai Chen wrote:
> On LS2K/LS7A, some non-existant devices don't return 0xffffffff when
> scanning. This is a hardware flaw but we can only avoid it by software
> now.

We should say what *does* happen if we do a config read to a device
that doesn't exit.  Machine check, hang, etc?

> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
> ---
>  drivers/pci/controller/pci-loongson.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
> index a1222fc15454..e22142f75d97 100644
> --- a/drivers/pci/controller/pci-loongson.c
> +++ b/drivers/pci/controller/pci-loongson.c
> @@ -134,10 +134,20 @@ static void __iomem *cfg0_map(struct loongson_pci *priv, int bus,
>  	return priv->cfg0_base + addroff;
>  }
>  
> +static bool pdev_is_existant(unsigned char bus, unsigned int device, unsigned int function)
> +{
> +	if ((bus == 0) && (device >= 9 && device <= 20) && (function > 0))
> +		return false;

Why do you test pci_is_root_bus() below and "bus == 0" here?  I think
you intend them both to test the same thing.  If so, I think you
should test for "if (pci_is_root_bus(bus) ..." here.

Generally speaking we only probe for functions > 0 if .0 is marked as
multi-function, so I guess this means 00:09.0 is marked as a
multi-function device, but config reads to 00:09.1 would fail?

> +	return true;

Returning "true" here means "the device *may* exist," not "this device
*does* exist," right?  If so, the function name probably should be
"pdev_may_exist()".

I guess that when we do a config read to a non-root bus device that
doesn't exist, e.g., "01:00.0", that read terminates with an
Unsupported Request error, the config read gets the ~0 data we expect?

> +}
> +
>  static void __iomem *pci_loongson_map_bus(struct pci_bus *bus, unsigned int devfn,
>  			       int where)
>  {
>  	unsigned char busnum = bus->number;
> +	unsigned int device = PCI_SLOT(devfn);
> +	unsigned int function = PCI_FUNC(devfn);
>  	struct loongson_pci *priv = pci_bus_to_loongson_pci(bus);
>  
>  	if (pci_is_root_bus(bus))
> @@ -147,8 +157,13 @@ static void __iomem *pci_loongson_map_bus(struct pci_bus *bus, unsigned int devf
>  	 * Do not read more than one device on the bus other than
>  	 * the host bus.
>  	 */
> -	if (priv->data->flags & FLAG_DEV_FIX &&
> -			!pci_is_root_bus(bus) && PCI_SLOT(devfn) > 0)
> +	if ((priv->data->flags & FLAG_DEV_FIX) && bus->self) {
> +		if (!pci_is_root_bus(bus) && (device > 0))
> +			return NULL;
> +	}
> +
> +	/* Don't access non-existant devices */
> +	if (!pdev_is_existant(busnum, device, function))
>  		return NULL;

Is this a "forever" hardware bug that will never be fixed, or should
there be a flag like FLAG_DEV_FIX so we only do this on the broken
devices?

>  	/* CFG0 can only access standard space */
> -- 
> 2.27.0
>
吕建民 June 28, 2022, 1:03 p.m. UTC | #2
On 2022/6/28 上午5:38, Bjorn Helgaas wrote:
> On Fri, Jun 17, 2022 at 03:43:27PM +0800, Huacai Chen wrote:
>> On LS2K/LS7A, some non-existant devices don't return 0xffffffff when
>> scanning. This is a hardware flaw but we can only avoid it by software
>> now.
> 
> We should say what *does* happen if we do a config read to a device
> that doesn't exit.  Machine check, hang, etc?
> 

The device is a hidden device(only for debug) that should not be 
scanned. If scanned in a non-normal way, the machine is hang(one case in 
ltp pci test can trigger the issue, which is explained blow).


>> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
>> ---
>>   drivers/pci/controller/pci-loongson.c | 19 +++++++++++++++++--
>>   1 file changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
>> index a1222fc15454..e22142f75d97 100644
>> --- a/drivers/pci/controller/pci-loongson.c
>> +++ b/drivers/pci/controller/pci-loongson.c
>> @@ -134,10 +134,20 @@ static void __iomem *cfg0_map(struct loongson_pci *priv, int bus,
>>   	return priv->cfg0_base + addroff;
>>   }
>>   
>> +static bool pdev_is_existant(unsigned char bus, unsigned int device, unsigned int function)
>> +{
>> +	if ((bus == 0) && (device >= 9 && device <= 20) && (function > 0))
>> +		return false;
> 
> Why do you test pci_is_root_bus() below and "bus == 0" here?  I think
> you intend them both to test the same thing.  If so, I think you
> should test for "if (pci_is_root_bus(bus) ..." here.
> 

I agree, I think we can only use pci_is_root_bus to do the work.


> Generally speaking we only probe for functions > 0 if .0 is marked as
> multi-function, so I guess this means 00:09.0 is marked as a
> multi-function device, but config reads to 00:09.1 would fail?
> 

Yes, definitely. Actually, the 00:09.0 is a single device, so fun1(09.1)
will not be scanned(e.g. the fun1 will be not scanned on pci enumeration
during kernel booting).

But, there is one situation: when running ltp pci test case on LS7A,
the 00:08.2 is a sata controller(a valid device), and the bus number(0)
and devfn(0x42) are inputted to kernel api pci_scan_slot(), which has
clear note: devfn must have zero function. So, apparently, the inputted
devfn's function is not zero, but 2, and then in the pci_scan_slot():


         for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, 
fn)) {
                 dev = pci_scan_single_device(bus, devfn + fn);
                 ...
         }


08.2,08.3...and 09.1 will be scanned one by one, so the 09.1(fun1) is
scanned.

>> +	return true;
> 
> Returning "true" here means "the device *may* exist," not "this device
> *does* exist," right?  If so, the function name probably should be
> "pdev_may_exist()".
> 

Yes, I think pdev_may_exist maybe better.

> I guess that when we do a config read to a non-root bus device that
> doesn't exist, e.g., "01:00.0", that read terminates with an
> Unsupported Request error, the config read gets the ~0 data we expect?
> 

Yes, I think so.

>> +}
>> +
>>   static void __iomem *pci_loongson_map_bus(struct pci_bus *bus, unsigned int devfn,
>>   			       int where)
>>   {
>>   	unsigned char busnum = bus->number;
>> +	unsigned int device = PCI_SLOT(devfn);
>> +	unsigned int function = PCI_FUNC(devfn);
>>   	struct loongson_pci *priv = pci_bus_to_loongson_pci(bus);
>>   
>>   	if (pci_is_root_bus(bus))
>> @@ -147,8 +157,13 @@ static void __iomem *pci_loongson_map_bus(struct pci_bus *bus, unsigned int devf
>>   	 * Do not read more than one device on the bus other than
>>   	 * the host bus.
>>   	 */
>> -	if (priv->data->flags & FLAG_DEV_FIX &&
>> -			!pci_is_root_bus(bus) && PCI_SLOT(devfn) > 0)
>> +	if ((priv->data->flags & FLAG_DEV_FIX) && bus->self) {
>> +		if (!pci_is_root_bus(bus) && (device > 0))
>> +			return NULL;
>> +	}
>> +
>> +	/* Don't access non-existant devices */
>> +	if (!pdev_is_existant(busnum, device, function))
>>   		return NULL;
> 
> Is this a "forever" hardware bug that will never be fixed, or should
> there be a flag like FLAG_DEV_FIX so we only do this on the broken
> devices?
> 

No, the next new version LS7A will correct it, so maybe we can use 
FLAG_DEV_FIX-like to address it.

>>   	/* CFG0 can only access standard space */
>> -- 
>> 2.27.0
>>
Bjorn Helgaas June 28, 2022, 4:04 p.m. UTC | #3
On Tue, Jun 28, 2022 at 09:03:02PM +0800, Jianmin Lv wrote:
> On 2022/6/28 上午5:38, Bjorn Helgaas wrote:
> > On Fri, Jun 17, 2022 at 03:43:27PM +0800, Huacai Chen wrote:
> > > On LS2K/LS7A, some non-existant devices don't return 0xffffffff when
> > > scanning. This is a hardware flaw but we can only avoid it by software
> > > now.
> > 
> > We should say what *does* happen if we do a config read to a device
> > that doesn't exit.  Machine check, hang, etc?
> 
> The device is a hidden device(only for debug) that should not be
> scanned. If scanned in a non-normal way, the machine is hang(one
> case in ltp pci test can trigger the issue, which is explained
> below).

Reading the Vendor ID is the *normal* way to scan for a device.  It
seems that this hardware just hangs in some cases when the device
doesn't exist.

> > Generally speaking we only probe for functions > 0 if .0 is marked as
> > multi-function, so I guess this means 00:09.0 is marked as a
> > multi-function device, but config reads to 00:09.1 would fail?
> 
> Yes, definitely. Actually, the 00:09.0 is a single device, so fun1(09.1)
> will not be scanned(e.g. the fun1 will be not scanned on pci enumeration
> during kernel booting).
> 
> But, there is one situation: when running ltp pci test case on LS7A,
> the 00:08.2 is a sata controller(a valid device), and the bus number(0)
> and devfn(0x42) are inputted to kernel api pci_scan_slot(), which has
> clear note: devfn must have zero function. So, apparently, the inputted
> devfn's function is not zero, but 2, and then in the pci_scan_slot():
> 
>         for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn))
> {
>                 dev = pci_scan_single_device(bus, devfn + fn);
>                 ...
>         }
> 
> 08.2,08.3...and 09.1 will be scanned one by one, so the 09.1(fun1) is
> scanned.

Does the "((bus == 0) && (device >= 9 && device <= 20) && (function > 0))"
test catch *all* devfns where the hang occurs?  I wouldn't want to
only avoid the ones that LTP happens to use.  If we did that, a future
LTP change could easily break things again.  But I assume you know
exactly what devices are present on the root bus.

> > > -	if (priv->data->flags & FLAG_DEV_FIX &&
> > > -			!pci_is_root_bus(bus) && PCI_SLOT(devfn) > 0)
> > > +	if ((priv->data->flags & FLAG_DEV_FIX) && bus->self) {
> > > +		if (!pci_is_root_bus(bus) && (device > 0))
> > > +			return NULL;
> > > +	}
> > > +
> > > +	/* Don't access non-existant devices */
> > > +	if (!pdev_is_existant(busnum, device, function))
> > >   		return NULL;
> > 
> > Is this a "forever" hardware bug that will never be fixed, or should
> > there be a flag like FLAG_DEV_FIX so we only do this on the broken
> > devices?
> 
> No, the next new version LS7A will correct it, so maybe we can use
> FLAG_DEV_FIX-like to address it.

You should add the flag now instead of waiting for the new hardware.
Otherwise you may not remember or notice the need to make this
conditional on the hardware version, you'll wonder why the fixed
hardware doesn't enumerate devices correctly.

Bjorn
吕建民 June 29, 2022, 12:33 a.m. UTC | #4
On 2022/6/29 上午12:04, Bjorn Helgaas wrote:
> On Tue, Jun 28, 2022 at 09:03:02PM +0800, Jianmin Lv wrote:
>> On 2022/6/28 上午5:38, Bjorn Helgaas wrote:
>>> On Fri, Jun 17, 2022 at 03:43:27PM +0800, Huacai Chen wrote:
>>>> On LS2K/LS7A, some non-existant devices don't return 0xffffffff when
>>>> scanning. This is a hardware flaw but we can only avoid it by software
>>>> now.
>>>
>>> We should say what *does* happen if we do a config read to a device
>>> that doesn't exit.  Machine check, hang, etc?
>>
>> The device is a hidden device(only for debug) that should not be
>> scanned. If scanned in a non-normal way, the machine is hang(one
>> case in ltp pci test can trigger the issue, which is explained
>> below).
> 
> Reading the Vendor ID is the *normal* way to scan for a device.  It
> seems that this hardware just hangs in some cases when the device
> doesn't exist.
> 
>>> Generally speaking we only probe for functions > 0 if .0 is marked as
>>> multi-function, so I guess this means 00:09.0 is marked as a
>>> multi-function device, but config reads to 00:09.1 would fail?
>>
>> Yes, definitely. Actually, the 00:09.0 is a single device, so fun1(09.1)
>> will not be scanned(e.g. the fun1 will be not scanned on pci enumeration
>> during kernel booting).
>>
>> But, there is one situation: when running ltp pci test case on LS7A,
>> the 00:08.2 is a sata controller(a valid device), and the bus number(0)
>> and devfn(0x42) are inputted to kernel api pci_scan_slot(), which has
>> clear note: devfn must have zero function. So, apparently, the inputted
>> devfn's function is not zero, but 2, and then in the pci_scan_slot():
>>
>>          for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn))
>> {
>>                  dev = pci_scan_single_device(bus, devfn + fn);
>>                  ...
>>          }
>>
>> 08.2,08.3...and 09.1 will be scanned one by one, so the 09.1(fun1) is
>> scanned.
> 
> Does the "((bus == 0) && (device >= 9 && device <= 20) && (function > 0))"
> test catch *all* devfns where the hang occurs?  I wouldn't want to
> only avoid the ones that LTP happens to use.  If we did that, a future
> LTP change could easily break things again.  But I assume you know
> exactly what devices are present on the root bus.
> 

Yes, as you said, I'm sure that only these hidden functions(fun1 of dev 
9 to 20) on root bus can cause issue, so this fix is enough to address it.

>>>> -	if (priv->data->flags & FLAG_DEV_FIX &&
>>>> -			!pci_is_root_bus(bus) && PCI_SLOT(devfn) > 0)
>>>> +	if ((priv->data->flags & FLAG_DEV_FIX) && bus->self) {
>>>> +		if (!pci_is_root_bus(bus) && (device > 0))
>>>> +			return NULL;
>>>> +	}
>>>> +
>>>> +	/* Don't access non-existant devices */
>>>> +	if (!pdev_is_existant(busnum, device, function))
>>>>    		return NULL;
>>>
>>> Is this a "forever" hardware bug that will never be fixed, or should
>>> there be a flag like FLAG_DEV_FIX so we only do this on the broken
>>> devices?
>>
>> No, the next new version LS7A will correct it, so maybe we can use
>> FLAG_DEV_FIX-like to address it.
> 
> You should add the flag now instead of waiting for the new hardware.
> Otherwise you may not remember or notice the need to make this
> conditional on the hardware version, you'll wonder why the fixed
> hardware doesn't enumerate devices correctly.
> 

Thanks for your suggestion, I agree that, Huacai, WDYT?


> Bjorn
>
Huacai Chen June 29, 2022, 10:03 a.m. UTC | #5
Hi, Jianmin and Bjorn,

On Wed, Jun 29, 2022 at 8:33 AM Jianmin Lv <lvjianmin@loongson.cn> wrote:
>
>
>
> On 2022/6/29 上午12:04, Bjorn Helgaas wrote:
> > On Tue, Jun 28, 2022 at 09:03:02PM +0800, Jianmin Lv wrote:
> >> On 2022/6/28 上午5:38, Bjorn Helgaas wrote:
> >>> On Fri, Jun 17, 2022 at 03:43:27PM +0800, Huacai Chen wrote:
> >>>> On LS2K/LS7A, some non-existant devices don't return 0xffffffff when
> >>>> scanning. This is a hardware flaw but we can only avoid it by software
> >>>> now.
> >>>
> >>> We should say what *does* happen if we do a config read to a device
> >>> that doesn't exit.  Machine check, hang, etc?
> >>
> >> The device is a hidden device(only for debug) that should not be
> >> scanned. If scanned in a non-normal way, the machine is hang(one
> >> case in ltp pci test can trigger the issue, which is explained
> >> below).
> >
> > Reading the Vendor ID is the *normal* way to scan for a device.  It
> > seems that this hardware just hangs in some cases when the device
> > doesn't exist.
> >
> >>> Generally speaking we only probe for functions > 0 if .0 is marked as
> >>> multi-function, so I guess this means 00:09.0 is marked as a
> >>> multi-function device, but config reads to 00:09.1 would fail?
> >>
> >> Yes, definitely. Actually, the 00:09.0 is a single device, so fun1(09.1)
> >> will not be scanned(e.g. the fun1 will be not scanned on pci enumeration
> >> during kernel booting).
> >>
> >> But, there is one situation: when running ltp pci test case on LS7A,
> >> the 00:08.2 is a sata controller(a valid device), and the bus number(0)
> >> and devfn(0x42) are inputted to kernel api pci_scan_slot(), which has
> >> clear note: devfn must have zero function. So, apparently, the inputted
> >> devfn's function is not zero, but 2, and then in the pci_scan_slot():
> >>
> >>          for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn))
> >> {
> >>                  dev = pci_scan_single_device(bus, devfn + fn);
> >>                  ...
> >>          }
> >>
> >> 08.2,08.3...and 09.1 will be scanned one by one, so the 09.1(fun1) is
> >> scanned.
> >
> > Does the "((bus == 0) && (device >= 9 && device <= 20) && (function > 0))"
> > test catch *all* devfns where the hang occurs?  I wouldn't want to
> > only avoid the ones that LTP happens to use.  If we did that, a future
> > LTP change could easily break things again.  But I assume you know
> > exactly what devices are present on the root bus.
> >
>
> Yes, as you said, I'm sure that only these hidden functions(fun1 of dev
> 9 to 20) on root bus can cause issue, so this fix is enough to address it.
>
> >>>> -  if (priv->data->flags & FLAG_DEV_FIX &&
> >>>> -                  !pci_is_root_bus(bus) && PCI_SLOT(devfn) > 0)
> >>>> +  if ((priv->data->flags & FLAG_DEV_FIX) && bus->self) {
> >>>> +          if (!pci_is_root_bus(bus) && (device > 0))
> >>>> +                  return NULL;
> >>>> +  }
> >>>> +
> >>>> +  /* Don't access non-existant devices */
> >>>> +  if (!pdev_is_existant(busnum, device, function))
> >>>>                    return NULL;
> >>>
> >>> Is this a "forever" hardware bug that will never be fixed, or should
> >>> there be a flag like FLAG_DEV_FIX so we only do this on the broken
> >>> devices?
> >>
> >> No, the next new version LS7A will correct it, so maybe we can use
> >> FLAG_DEV_FIX-like to address it.
> >
> > You should add the flag now instead of waiting for the new hardware.
> > Otherwise you may not remember or notice the need to make this
> > conditional on the hardware version, you'll wonder why the fixed
> > hardware doesn't enumerate devices correctly.
> >
>
> Thanks for your suggestion, I agree that, Huacai, WDYT?
Agree, I think we can use a FLAG_DEV_HIDDEN flag.

Huacai
>
>
> > Bjorn
> >
>
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index a1222fc15454..e22142f75d97 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -134,10 +134,20 @@  static void __iomem *cfg0_map(struct loongson_pci *priv, int bus,
 	return priv->cfg0_base + addroff;
 }
 
+static bool pdev_is_existant(unsigned char bus, unsigned int device, unsigned int function)
+{
+	if ((bus == 0) && (device >= 9 && device <= 20) && (function > 0))
+		return false;
+
+	return true;
+}
+
 static void __iomem *pci_loongson_map_bus(struct pci_bus *bus, unsigned int devfn,
 			       int where)
 {
 	unsigned char busnum = bus->number;
+	unsigned int device = PCI_SLOT(devfn);
+	unsigned int function = PCI_FUNC(devfn);
 	struct loongson_pci *priv = pci_bus_to_loongson_pci(bus);
 
 	if (pci_is_root_bus(bus))
@@ -147,8 +157,13 @@  static void __iomem *pci_loongson_map_bus(struct pci_bus *bus, unsigned int devf
 	 * Do not read more than one device on the bus other than
 	 * the host bus.
 	 */
-	if (priv->data->flags & FLAG_DEV_FIX &&
-			!pci_is_root_bus(bus) && PCI_SLOT(devfn) > 0)
+	if ((priv->data->flags & FLAG_DEV_FIX) && bus->self) {
+		if (!pci_is_root_bus(bus) && (device > 0))
+			return NULL;
+	}
+
+	/* Don't access non-existant devices */
+	if (!pdev_is_existant(busnum, device, function))
 		return NULL;
 
 	/* CFG0 can only access standard space */