diff mbox series

[v3,2/2] PCI/AER: Enable AER on all PCIe devices supporting it

Message ID 20220119092200.35823-3-sr@denx.de (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series Fully enable AER | expand

Commit Message

Stefan Roese Jan. 19, 2022, 9:22 a.m. UTC
With this change, AER is now enabled on all PCIe devices, also when the
PCIe device is hot-plugged.

Please note that this change is quite invasive, as with this patch
applied, AER now will be enabled in the Device Control registers of all
available PCIe Endpoints, which currently is not the case.

When "pci=noaer" is selected, AER stays disabled of course.

Signed-off-by: Stefan Roese <sr@denx.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Pali Rohár <pali@kernel.org>
Cc: Bharat Kumar Gogada <bharat.kumar.gogada@xilinx.com>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Yao Hongbo <yaohongbo@linux.alibaba.com>
Cc: Naveen Naidu <naveennaidu479@gmail.com>
---
v3:
- New patch, replacing the "old" 2/2 patch
  Now enabling of AER for each PCIe device is done in pci_aer_init(),
  which also makes sure that AER is enabled in each PCIe device even when
  it's hot-plugged.

 drivers/pci/pcie/aer.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Pali Rohár Jan. 19, 2022, 10:37 a.m. UTC | #1
On Wednesday 19 January 2022 10:22:00 Stefan Roese wrote:
> With this change, AER is now enabled on all PCIe devices, also when the
> PCIe device is hot-plugged.
> 
> Please note that this change is quite invasive, as with this patch
> applied, AER now will be enabled in the Device Control registers of all
> available PCIe Endpoints, which currently is not the case.
> 
> When "pci=noaer" is selected, AER stays disabled of course.

Hello Stefan! I was thinking more about this change and I'm not sure
what happens if AER-capable PCIe device is hotplugged into some PCIe
switch connected in the PCIe hierarchy where Root Port is not
AER-capable (e.g. current linux implementation of pci-aardvark.c and
pci-mvebu.c). My feeling is that in this case AER should not be enabled
as there is nobody who can deliver AER interrupt to the OS. But I really
do not know what is supposed from kernel AER driver, so lets wait for
Bjorn reply.

And when you opened this issue with hotplugging, another thing for
followup changes in future is calling pcie_set_ecrc_checking() function
to align ECRC state of newly hotplugged device with "pci=ecrc=..."
cmdline option. As currently it is done only at that function
set_device_error_reporting().

> Signed-off-by: Stefan Roese <sr@denx.de>
> Cc: Bjorn Helgaas <helgaas@kernel.org>
> Cc: Pali Rohár <pali@kernel.org>
> Cc: Bharat Kumar Gogada <bharat.kumar.gogada@xilinx.com>
> Cc: Michal Simek <michal.simek@xilinx.com>
> Cc: Yao Hongbo <yaohongbo@linux.alibaba.com>
> Cc: Naveen Naidu <naveennaidu479@gmail.com>
> ---
> v3:
> - New patch, replacing the "old" 2/2 patch
>   Now enabling of AER for each PCIe device is done in pci_aer_init(),
>   which also makes sure that AER is enabled in each PCIe device even when
>   it's hot-plugged.
> 
>  drivers/pci/pcie/aer.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..01a25e4a5168 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev)
>  	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
>  
>  	pci_aer_clear_status(dev);
> +
> +	/* Enable AER if requested */
> +	if (pci_aer_available())
> +		pci_enable_pcie_error_reporting(dev);
>  }
>  
>  void pci_aer_exit(struct pci_dev *dev)
> -- 
> 2.34.1
>
Keith Busch Jan. 19, 2022, 6:25 p.m. UTC | #2
On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote:
> @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev)
>  	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
>  
>  	pci_aer_clear_status(dev);
> +
> +	/* Enable AER if requested */
> +	if (pci_aer_available())
> +		pci_enable_pcie_error_reporting(dev);
>  }

Hasn't it always been the device specific driver's responsibility to
call this function?
Bjorn Helgaas Jan. 19, 2022, 9 p.m. UTC | #3
On Wed, Jan 19, 2022 at 10:25:50AM -0800, Keith Busch wrote:
> On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote:
> > @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev)
> >  	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
> >  
> >  	pci_aer_clear_status(dev);
> > +
> > +	/* Enable AER if requested */
> > +	if (pci_aer_available())
> > +		pci_enable_pcie_error_reporting(dev);
> >  }
> 
> Hasn't it always been the device specific driver's responsibility to
> call this function?

So far it has been done by the driver, because the PCI core doesn't do
it.  But is there a reason it should be done by the driver?  It
doesn't seem necessarily device-specific.

Bjorn
Keith Busch Jan. 19, 2022, 9:18 p.m. UTC | #4
On Wed, Jan 19, 2022 at 03:00:02PM -0600, Bjorn Helgaas wrote:
> On Wed, Jan 19, 2022 at 10:25:50AM -0800, Keith Busch wrote:
> > On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote:
> > > @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev)
> > >  	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
> > >  
> > >  	pci_aer_clear_status(dev);
> > > +
> > > +	/* Enable AER if requested */
> > > +	if (pci_aer_available())
> > > +		pci_enable_pcie_error_reporting(dev);
> > >  }
> > 
> > Hasn't it always been the device specific driver's responsibility to
> > call this function?
> 
> So far it has been done by the driver, because the PCI core doesn't do
> it.  But is there a reason it should be done by the driver?  It
> doesn't seem necessarily device-specific.

I was thinking the device driver knows if it provides .err_handler
callbacks in order to respond to AER handling, so it would know if it is
ready for its device to enable error reporting. But I guess it doesn't
really matter if the driver provides callbacks anyway.
Stefan Roese Jan. 20, 2022, 7:31 a.m. UTC | #5
On 1/19/22 11:37, Pali Rohár wrote:
> On Wednesday 19 January 2022 10:22:00 Stefan Roese wrote:
>> With this change, AER is now enabled on all PCIe devices, also when the
>> PCIe device is hot-plugged.
>>
>> Please note that this change is quite invasive, as with this patch
>> applied, AER now will be enabled in the Device Control registers of all
>> available PCIe Endpoints, which currently is not the case.
>>
>> When "pci=noaer" is selected, AER stays disabled of course.
> 
> Hello Stefan! I was thinking more about this change and I'm not sure
> what happens if AER-capable PCIe device is hotplugged into some PCIe
> switch connected in the PCIe hierarchy where Root Port is not
> AER-capable (e.g. current linux implementation of pci-aardvark.c and
> pci-mvebu.c). My feeling is that in this case AER should not be enabled
> as there is nobody who can deliver AER interrupt to the OS. But I really
> do not know what is supposed from kernel AER driver, so lets wait for
> Bjorn reply.

But what happens right now, when a device driver like the NVMe driver
calls pci_enable_pcie_error_reporting() ? There is also no checking,
if the connected Root Port or some switch / bridge in-between supports
AER or not. IIUTC, this is identical to what this patch here does.
Enable AER in the device and if the upstream infrastructure does not
support AER, then the AER event will just not be received by the
Kernel. Which is most likely not worse than not enabling AER at all
on this device. Or am I missing something?

> And when you opened this issue with hotplugging, another thing for
> followup changes in future is calling pcie_set_ecrc_checking() function
> to align ECRC state of newly hotplugged device with "pci=ecrc=..."
> cmdline option. As currently it is done only at that function
> set_device_error_reporting().

Agreed, this is another area to look into. Not sure if it's okay to
address this, once this patch-set has been accepted (if it will be).

Thanks,
Stefan
Stefan Roese Jan. 20, 2022, 7:32 a.m. UTC | #6
On 1/19/22 22:18, Keith Busch wrote:
> On Wed, Jan 19, 2022 at 03:00:02PM -0600, Bjorn Helgaas wrote:
>> On Wed, Jan 19, 2022 at 10:25:50AM -0800, Keith Busch wrote:
>>> On Wed, Jan 19, 2022 at 10:22:00AM +0100, Stefan Roese wrote:
>>>> @@ -387,6 +387,10 @@ void pci_aer_init(struct pci_dev *dev)
>>>>   	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
>>>>   
>>>>   	pci_aer_clear_status(dev);
>>>> +
>>>> +	/* Enable AER if requested */
>>>> +	if (pci_aer_available())
>>>> +		pci_enable_pcie_error_reporting(dev);
>>>>   }
>>>
>>> Hasn't it always been the device specific driver's responsibility to
>>> call this function?
>>
>> So far it has been done by the driver, because the PCI core doesn't do
>> it.  But is there a reason it should be done by the driver?  It
>> doesn't seem necessarily device-specific.
> 
> I was thinking the device driver knows if it provides .err_handler
> callbacks in order to respond to AER handling, so it would know if it is
> ready for its device to enable error reporting. But I guess it doesn't
> really matter if the driver provides callbacks anyway.

That's my understanding as well.

Thanks,
Stefan
Pali Rohár Jan. 20, 2022, 1:23 p.m. UTC | #7
On Thursday 20 January 2022 08:31:31 Stefan Roese wrote:
> On 1/19/22 11:37, Pali Rohár wrote:
> > On Wednesday 19 January 2022 10:22:00 Stefan Roese wrote:
> > > With this change, AER is now enabled on all PCIe devices, also when the
> > > PCIe device is hot-plugged.
> > > 
> > > Please note that this change is quite invasive, as with this patch
> > > applied, AER now will be enabled in the Device Control registers of all
> > > available PCIe Endpoints, which currently is not the case.
> > > 
> > > When "pci=noaer" is selected, AER stays disabled of course.
> > 
> > Hello Stefan! I was thinking more about this change and I'm not sure
> > what happens if AER-capable PCIe device is hotplugged into some PCIe
> > switch connected in the PCIe hierarchy where Root Port is not
> > AER-capable (e.g. current linux implementation of pci-aardvark.c and
> > pci-mvebu.c). My feeling is that in this case AER should not be enabled
> > as there is nobody who can deliver AER interrupt to the OS. But I really
> > do not know what is supposed from kernel AER driver, so lets wait for
> > Bjorn reply.
> 
> But what happens right now, when a device driver like the NVMe driver
> calls pci_enable_pcie_error_reporting() ? There is also no checking,
> if the connected Root Port or some switch / bridge in-between supports
> AER or not. IIUTC, this is identical to what this patch here does.
> Enable AER in the device and if the upstream infrastructure does not
> support AER, then the AER event will just not be received by the
> Kernel. Which is most likely not worse than not enabling AER at all
> on this device. Or am I missing something?

You are right!

Seems that AER code has lot of candidates for followup fixes/cleanups...

> > And when you opened this issue with hotplugging, another thing for
> > followup changes in future is calling pcie_set_ecrc_checking() function
> > to align ECRC state of newly hotplugged device with "pci=ecrc=..."
> > cmdline option. As currently it is done only at that function
> > set_device_error_reporting().
> 
> Agreed, this is another area to look into. Not sure if it's okay to
> address this, once this patch-set has been accepted (if it will be).
> 
> Thanks,
> Stefan
Bjorn Helgaas Jan. 20, 2022, 3:46 p.m. UTC | #8
On Thu, Jan 20, 2022 at 08:31:31AM +0100, Stefan Roese wrote:
> On 1/19/22 11:37, Pali Rohár wrote:

> > And when you opened this issue with hotplugging, another thing for
> > followup changes in future is calling pcie_set_ecrc_checking() function
> > to align ECRC state of newly hotplugged device with "pci=ecrc=..."
> > cmdline option. As currently it is done only at that function
> > set_device_error_reporting().
> 
> Agreed, this is another area to look into. Not sure if it's okay to
> address this, once this patch-set has been accepted (if it will be).

ECRC might be something that could be peeled off first to reduce the
complexity of AER itself.

The ECRC capability and enable bits are in the AER Capability, so I
think it should be moved to pci_aer_init() so it happens for every
device as we enumerate it.

As far as I can tell, there is no requirement that every device in the
path support ECRC, so it can be enabled independently for each device.
I think devices that don't support ECRC checking must handle TLPs with
ECRC without error.

Per Table 6-5, ECRC check failures result in a device logging the
prefix/header of the TLP and sending ERR_NONFATAL or ERR_COR.  I think
this is useful regardless of whether AER interrupts are enabled
because error information is logged where the ECRC failure was
detected.

Bjorn
Stefan Roese Jan. 20, 2022, 4:59 p.m. UTC | #9
On 1/20/22 16:46, Bjorn Helgaas wrote:
> On Thu, Jan 20, 2022 at 08:31:31AM +0100, Stefan Roese wrote:
>> On 1/19/22 11:37, Pali Rohár wrote:
> 
>>> And when you opened this issue with hotplugging, another thing for
>>> followup changes in future is calling pcie_set_ecrc_checking() function
>>> to align ECRC state of newly hotplugged device with "pci=ecrc=..."
>>> cmdline option. As currently it is done only at that function
>>> set_device_error_reporting().
>>
>> Agreed, this is another area to look into. Not sure if it's okay to
>> address this, once this patch-set has been accepted (if it will be).
> 
> ECRC might be something that could be peeled off first to reduce the
> complexity of AER itself.
> 
> The ECRC capability and enable bits are in the AER Capability, so I
> think it should be moved to pci_aer_init() so it happens for every
> device as we enumerate it.

Just that there is no misunderstanding: You are thinking about something
like this:

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..5585fefc4d0e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -387,6 +387,9 @@ void pci_aer_init(struct pci_dev *dev)
         pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, 
sizeof(u32) * n);

         pci_aer_clear_status(dev);
+
+       /* Enable ECRC checking if enabled and configured */
+       pcie_set_ecrc_checking(dev);
  }

  void pci_aer_exit(struct pci_dev *dev)
@@ -1223,9 +1226,6 @@ static int set_device_error_reporting(struct 
pci_dev *dev, void *data)
                         pci_disable_pcie_error_reporting(dev);
         }

-       if (enable)
-               pcie_set_ecrc_checking(dev);
-
         return 0;
  }

Perhaps as patch 1/3 in this patch series? Or as some completely
separate patch?

Thanks,
Stefan

> As far as I can tell, there is no requirement that every device in the
> path support ECRC, so it can be enabled independently for each device.
> I think devices that don't support ECRC checking must handle TLPs with
> ECRC without error.
> 
> Per Table 6-5, ECRC check failures result in a device logging the
> prefix/header of the TLP and sending ERR_NONFATAL or ERR_COR.  I think
> this is useful regardless of whether AER interrupts are enabled
> because error information is logged where the ECRC failure was
> detected.
> 
> Bjorn
> 

Viele Grüße,
Stefan Roese
Bjorn Helgaas Jan. 20, 2022, 5:54 p.m. UTC | #10
On Thu, Jan 20, 2022 at 05:59:22PM +0100, Stefan Roese wrote:
> On 1/20/22 16:46, Bjorn Helgaas wrote:
> > On Thu, Jan 20, 2022 at 08:31:31AM +0100, Stefan Roese wrote:
> > > On 1/19/22 11:37, Pali Rohár wrote:
> > 
> > > > And when you opened this issue with hotplugging, another thing for
> > > > followup changes in future is calling pcie_set_ecrc_checking() function
> > > > to align ECRC state of newly hotplugged device with "pci=ecrc=..."
> > > > cmdline option. As currently it is done only at that function
> > > > set_device_error_reporting().
> > > 
> > > Agreed, this is another area to look into. Not sure if it's okay to
> > > address this, once this patch-set has been accepted (if it will be).
> > 
> > ECRC might be something that could be peeled off first to reduce the
> > complexity of AER itself.
> > 
> > The ECRC capability and enable bits are in the AER Capability, so I
> > think it should be moved to pci_aer_init() so it happens for every
> > device as we enumerate it.
> 
> Just that there is no misunderstanding: You are thinking about something
> like this:
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..5585fefc4d0e 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -387,6 +387,9 @@ void pci_aer_init(struct pci_dev *dev)
>         pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) *
> n);
> 
>         pci_aer_clear_status(dev);
> +
> +       /* Enable ECRC checking if enabled and configured */
> +       pcie_set_ecrc_checking(dev);
>  }
> 
>  void pci_aer_exit(struct pci_dev *dev)
> @@ -1223,9 +1226,6 @@ static int set_device_error_reporting(struct pci_dev
> *dev, void *data)
>                         pci_disable_pcie_error_reporting(dev);
>         }
> 
> -       if (enable)
> -               pcie_set_ecrc_checking(dev);
> -
>         return 0;
>  }
> 
> Perhaps as patch 1/3 in this patch series? Or as some completely
> separate patch?

Yes.  Probably as 1/3, since subsequent patches may depend on this
one, or at least may not apply cleanly without this one.

Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..01a25e4a5168 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -387,6 +387,10 @@  void pci_aer_init(struct pci_dev *dev)
 	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
 
 	pci_aer_clear_status(dev);
+
+	/* Enable AER if requested */
+	if (pci_aer_available())
+		pci_enable_pcie_error_reporting(dev);
 }
 
 void pci_aer_exit(struct pci_dev *dev)