[RFC,0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports

Message ID	20240617200411.1426554-1-terry.bowman@amd.com
Headers	show Received: from NAM10-BN7-obe.outbound.protection.outlook.com (mail-bn7nam10on2064.outbound.protection.outlook.com [40.107.92.64]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70294E542; Mon, 17 Jun 2024 20:04:30 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Terry Bowman <terry.bowman@amd.com> To: <dan.j.williams@intel.com>, <ira.weiny@intel.com>, <dave@stgolabs.net>, <dave.jiang@intel.com>, <alison.schofield@intel.com>, <ming4.li@intel.com>, <vishal.l.verma@intel.com>, <jim.harris@samsung.com>, <ilpo.jarvinen@linux.intel.com>, <ardb@kernel.org>, <sathyanarayanan.kuppuswamy@linux.intel.com>, <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <terry.bowman@amd.com>, <Yazen.Ghannam@amd.com>, <Robert.Richter@amd.com> Subject: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Date: Mon, 17 Jun 2024 15:04:02 -0500 Message-ID: <20240617200411.1426554-1-terry.bowman@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain
Series	Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports \| expand [RFC,0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch … [RFC,1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers [RFC,2/9] PCI/AER: Call AER CE handler before clearing AER CE status register [RFC,3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors [RFC,4/9] cxl/pci: Map CXL PCIe ports' RAS registers [RFC,5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports [RFC,6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors [RFC,7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors [RFC,8/9] PCI/AER: Export pci_aer_unmask_internal_errors() [RFC,9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors

Message ID

20240617200411.1426554-1-terry.bowman@amd.com

Headers

Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
From: Terry Bowman <terry.bowman@amd.com>
To: <dan.j.williams@intel.com>, <ira.weiny@intel.com>, <dave@stgolabs.net>,
	<dave.jiang@intel.com>, <alison.schofield@intel.com>, <ming4.li@intel.com>,
	<vishal.l.verma@intel.com>, <jim.harris@samsung.com>,
	<ilpo.jarvinen@linux.intel.com>, <ardb@kernel.org>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>, <linux-cxl@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <terry.bowman@amd.com>,
	<Yazen.Ghannam@amd.com>, <Robert.Richter@amd.com>
Subject: [RFC PATCH 0/9] Add RAS support for CXL root ports,
 CXL downstream switch ports, and CXL upstream switch ports
Date: Mon, 17 Jun 2024 15:04:02 -0500
Message-ID: <20240617200411.1426554-1-terry.bowman@amd.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jun 2024 20:04:24.8367
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 f45d355d-05cb-4df6-9734-08dc8f08afb1
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	CO1PEPF000044F5.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB4063

Series

Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports | expand

Message

Terry Bowman June 17, 2024, 8:04 p.m. UTC

This patchset provides RAS logging for CXL root ports, CXL downstream
switch ports, and CXL upstream switch ports. This includes changes to
use a portdrv notifier chain to communicate CXL AER/RAS errors to a
cxl_pci callback.

The first 3 patches prepare for and add an atomic notifier chain to the
portdrv driver. The portdrv's notifier chain reports the port device's
AER internal errors to the registered callback(s). The preparation changes
include a portdrv update to call the uncorrectable handler for PCIe root
ports and PCIe downstream switch ports. Also, the AER correctable error
(CE) status is made available to the AER CE handler.

The next 4 patches are in preparation for adding an atomic notification
callback in the cxl_pci driver. This is for receiving AER internal error
events from the portdrv notifier chain. Preparation includes adding RAS
register block mapping, adding trace functions for logging, and
refactoring cxl_pci RAS functions for reuse.

The final 2 patches enable the AER internal error interrupts.

Testing RAS CE/UCE:
  QEMU was used for testing CXL root port, CXL downstream switch port, and
  CXL upstream switch port. The aer-inject tool was used to inject AER and
  a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
  testing. Testing passed with no issues.
 
  An AMD platform with the AMD RAS error injection tool was used for
  testing CXL root port injection. Testing passed with no issues.

  TODO - regression test CXL1.1 RCH handling.

Solutions Considered (1-4):
  Below are solutions that were considered. Solution #4 is
  implemented in this patchset. 

  1.) Reassigning portdrv error handler for CXL port devices
  
  This solution was based on reassigning the portdrv's CE/UCE err_handler
  to be CXL cxl_pci driver functions.
  
  I started with this solution and once the flow was working I realized
  the endpoint removal would have to be addressed as well. While this
  could be resolved it does highlight the odd coupling and dependency
  between the CXL port devices error handling with cxl_pci endpoint's
  handlers. Also, the err_handler re-assignment at runtime required
  ignoring the 'const' definition. I don't believe this should be
  considered as a possible solution.
  
  2.) Update the AER driver to call cxl_pci driver's error handler before
  calling pci_aer_handle_error()

  This is similar to the existing RCH port error approach in aer.c.
  In this solution the AER driver searches for a downstream CXL endpoint
  to 'handle' detected CXL port protocol errors.

  This is a good solution to consider if the one presented in this patchset
  is not acceptable. I was initially reluctant to this approach because it
  adds more CXL coupling to the AER driver. But, I think this solution
  would technically work. I believe Ming was working towards this
  solution.

  3.) Refactor portdrv
  The portdrv refactoring solution is to change the portdrv service drivers
  into PCIe auxiliary drivers. With this change the facility drivers can be
  associated with a PCIe driver instead fixed bound to the portdrv driver.

  In this case the CXL port functionality would be added either as a CXL
  auxiliary driver or as a CXL specific port driver
  (PCI_CLASS_BRIDGE_PCI_NORMAL).

  This solution has challenges in the interrupt allocation by separate
  auxiliary drivers and in binding of a specific driver. Binding is
  currently based on PCIe class and would require extending the binding
  logic to support multiple drivers for the same class.

  Jonathan Cameron is working towards this solution by initially solving
  for the PMU service driver.[1] It is using the auxiliary bus to associate
  what were service drivers with the portdrv driver. Using a CXL auxiliary
  for handling CXL port RAS errors would result in RAS logic called from
  the cxl_pci and CXL auxiliary drivers. This may need a library driver.

  4.) Using a portdrv notifier chain/callback for CIE/UIE
  (Implemented in this patchset)

  This solution uses a portdrv atomic chain notifier and a cxl_pci
  callback to handle and log CXL port RAS errors.
  
  I chose this after trying solution#1 above. I see a couple advantages to
  this solution are:
  - Is general port implementation for CIE/UIE specific handling mentioned
  in the PCIe spec.[2]
  - Notifier is used in RAS MCE driver as an existing example.
  - Does not introduce further CXL dependencies into the AER driver.
  - The notifier chain provides registration/unregistration and
  synchronization.

  A disadvantage of this approach is coupling still exists between the CXL
  port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
  is handled by a notifier callback in the cxl_pci endpoint driver.

  Most of the patches in this patchset could be reused to work with
  solution#3 or solution#2. The atomic notifier could be dropped and
  instead use an auxiliary device or AER driver awareness. The other
  changes in this patchset could possibly be reused.

  [1] Kernel.org -
  https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
  [2] PCI6.0 - 6.2.10 Internal errors

 drivers/cxl/core/core.h    |   4 +
 drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
 drivers/cxl/core/port.c    |   6 +-
 drivers/cxl/core/trace.h   |  34 +++++++++
 drivers/cxl/cxl.h          |  10 +++
 drivers/cxl/cxlpci.h       |   2 +
 drivers/cxl/mem.c          |  32 +++++++-
 drivers/cxl/pci.c          |  19 ++++-
 drivers/pci/pcie/aer.c     |  10 ++-
 drivers/pci/pcie/err.c     |  20 +++++
 drivers/pci/pcie/portdrv.c |  32 ++++++++
 drivers/pci/pcie/portdrv.h |   2 +
 include/linux/aer.h        |   6 ++
 13 files changed, 303 insertions(+), 27 deletions(-)


base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8

Comments

Dan Williams June 21, 2024, 7:04 p.m. UTC | #1

Terry Bowman wrote:
> This patchset provides RAS logging for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This includes changes to
> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> cxl_pci callback.
> 
> The first 3 patches prepare for and add an atomic notifier chain to the
> portdrv driver. The portdrv's notifier chain reports the port device's
> AER internal errors to the registered callback(s). The preparation changes
> include a portdrv update to call the uncorrectable handler for PCIe root
> ports and PCIe downstream switch ports. Also, the AER correctable error
> (CE) status is made available to the AER CE handler.
> 
> The next 4 patches are in preparation for adding an atomic notification
> callback in the cxl_pci driver. This is for receiving AER internal error
> events from the portdrv notifier chain. Preparation includes adding RAS
> register block mapping, adding trace functions for logging, and
> refactoring cxl_pci RAS functions for reuse.
> 
> The final 2 patches enable the AER internal error interrupts.
[..] 
> 
> Solutions Considered (1-4):
>   Below are solutions that were considered. Solution #4 is
>   implemented in this patchset. 
[..]
>  2.) Update the AER driver to call cxl_pci driver's error handler before
>  calling pci_aer_handle_error()
>
>  This is similar to the existing RCH port error approach in aer.c.
>  In this solution the AER driver searches for a downstream CXL endpoint
>  to 'handle' detected CXL port protocol errors.
>
>  This is a good solution to consider if the one presented in this patchset
>  is not acceptable. I was initially reluctant to this approach because it
>  adds more CXL coupling to the AER driver. But, I think this solution
>  would technically work. I believe Ming was working towards this
>  solution.

I feel like the coupling is warranted because these things *are* PCIe
and CXL ports, but it means solving the interrupt distribution problem.

>   3.) Refactor portdrv
>   The portdrv refactoring solution is to change the portdrv service drivers
>   into PCIe auxiliary drivers. With this change the facility drivers can be
>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> 
>   In this case the CXL port functionality would be added either as a CXL
>   auxiliary driver or as a CXL specific port driver
>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> 
>   This solution has challenges in the interrupt allocation by separate
>   auxiliary drivers and in binding of a specific driver. Binding is
>   currently based on PCIe class and would require extending the binding
>   logic to support multiple drivers for the same class.
> 
>   Jonathan Cameron is working towards this solution by initially solving
>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>   for handling CXL port RAS errors would result in RAS logic called from
>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.

I don't think auxiliary bus is a fundamental step forward from pcie
portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
but with all the same problems around how to distribute interrupt
services to different interested parties.

So I think notifiers are interesting from the perspective of a software
hack to enable interrupt distribution. However, given that dynamic MSI-X
support is within reach I am interested in exploring that path and
mandating that archs that want to handle CXL protocol errors natively
need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
native protocol error handling support via CXL _OSC.

In other words, I expect native dynamic MSI-X support is more
maintainable in the sense of keeping all the code in one notification
domain.

>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>   (Implemented in this patchset)
> 
>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>   callback to handle and log CXL port RAS errors.

Oh, I will need to look that the cxl_pci tie in for this, I was
expecting cxl_pci only gets involved in the RCH case because the port
and the endpoint are one in the same object. in the VH case I would only
expect cxl_pci to get involved for its own observed protocol errors, not
those reported upstream from that endpoint.

>   I chose this after trying solution#1 above. I see a couple advantages to
>   this solution are:
>   - Is general port implementation for CIE/UIE specific handling mentioned
>   in the PCIe spec.[2]
>   - Notifier is used in RAS MCE driver as an existing example.
>   - Does not introduce further CXL dependencies into the AER driver.
>   - The notifier chain provides registration/unregistration and
>   synchronization.
> 
>   A disadvantage of this approach is coupling still exists between the CXL
>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>   is handled by a notifier callback in the cxl_pci endpoint driver.
> 
>   Most of the patches in this patchset could be reused to work with
>   solution#3 or solution#2. The atomic notifier could be dropped and
>   instead use an auxiliary device or AER driver awareness. The other
>   changes in this patchset could possibly be reused.

I appreciate the discussion of tradeoffs, thanks Terry!

Terry Bowman June 24, 2024, 5:47 p.m. UTC | #2

Hi Dan,

I added responses below.

On 6/21/24 14:04, Dan Williams wrote:
> Terry Bowman wrote:
>> This patchset provides RAS logging for CXL root ports, CXL downstream
>> switch ports, and CXL upstream switch ports. This includes changes to
>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>> cxl_pci callback.
>>
>> The first 3 patches prepare for and add an atomic notifier chain to the
>> portdrv driver. The portdrv's notifier chain reports the port device's
>> AER internal errors to the registered callback(s). The preparation changes
>> include a portdrv update to call the uncorrectable handler for PCIe root
>> ports and PCIe downstream switch ports. Also, the AER correctable error
>> (CE) status is made available to the AER CE handler.
>>
>> The next 4 patches are in preparation for adding an atomic notification
>> callback in the cxl_pci driver. This is for receiving AER internal error
>> events from the portdrv notifier chain. Preparation includes adding RAS
>> register block mapping, adding trace functions for logging, and
>> refactoring cxl_pci RAS functions for reuse.
>>
>> The final 2 patches enable the AER internal error interrupts.
> [..] 
>>
>> Solutions Considered (1-4):
>>   Below are solutions that were considered. Solution #4 is
>>   implemented in this patchset. 
> [..]
>>  2.) Update the AER driver to call cxl_pci driver's error handler before
>>  calling pci_aer_handle_error()
>>
>>  This is similar to the existing RCH port error approach in aer.c.
>>  In this solution the AER driver searches for a downstream CXL endpoint
>>  to 'handle' detected CXL port protocol errors.
>>
>>  This is a good solution to consider if the one presented in this patchset
>>  is not acceptable. I was initially reluctant to this approach because it
>>  adds more CXL coupling to the AER driver. But, I think this solution
>>  would technically work. I believe Ming was working towards this
>>  solution.
> 
> I feel like the coupling is warranted because these things *are* PCIe
> and CXL ports, but it means solving the interrupt distribution problem.
> 

I understand the service driver interrupt issue but it is not clear how it 
applies to the CXL port error handling. Can you help me understand how the 
interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.


>>   3.) Refactor portdrv
>>   The portdrv refactoring solution is to change the portdrv service drivers
>>   into PCIe auxiliary drivers. With this change the facility drivers can be
>>   associated with a PCIe driver instead fixed bound to the portdrv driver.
>>
>>   In this case the CXL port functionality would be added either as a CXL
>>   auxiliary driver or as a CXL specific port driver
>>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
>>
>>   This solution has challenges in the interrupt allocation by separate
>>   auxiliary drivers and in binding of a specific driver. Binding is
>>   currently based on PCIe class and would require extending the binding
>>   logic to support multiple drivers for the same class.
>>
>>   Jonathan Cameron is working towards this solution by initially solving
>>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>>   for handling CXL port RAS errors would result in RAS logic called from
>>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> 
> I don't think auxiliary bus is a fundamental step forward from pcie
> portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
> but with all the same problems around how to distribute interrupt
> services to different interested parties.
> 
> So I think notifiers are interesting from the perspective of a software
> hack to enable interrupt distribution. However, given that dynamic MSI-X
> support is within reach I am interested in exploring that path and
> mandating that archs that want to handle CXL protocol errors natively
> need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
> native protocol error handling support via CXL _OSC.
> 
> In other words, I expect native dynamic MSI-X support is more
> maintainable in the sense of keeping all the code in one notification
> domain.
> 
>>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>>   (Implemented in this patchset)
>>
>>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>>   callback to handle and log CXL port RAS errors.
> 
> Oh, I will need to look that the cxl_pci tie in for this, I was
> expecting cxl_pci only gets involved in the RCH case because the port
> and the endpoint are one in the same object. in the VH case I would only
> expect cxl_pci to get involved for its own observed protocol errors, not
> those reported upstream from that endpoint.
> 

The CXL port error handling needs a place to live with few options at the moment.
Where do you want the CXL port error handlers to reside? 

Regards,
Terry

>>   I chose this after trying solution#1 above. I see a couple advantages to
>>   this solution are:
>>   - Is general port implementation for CIE/UIE specific handling mentioned
>>   in the PCIe spec.[2]
>>   - Notifier is used in RAS MCE driver as an existing example.
>>   - Does not introduce further CXL dependencies into the AER driver.
>>   - The notifier chain provides registration/unregistration and
>>   synchronization.
>>
>>   A disadvantage of this approach is coupling still exists between the CXL
>>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>>   is handled by a notifier callback in the cxl_pci endpoint driver.
>>
>>   Most of the patches in this patchset could be reused to work with
>>   solution#3 or solution#2. The atomic notifier could be dropped and
>>   instead use an auxiliary device or AER driver awareness. The other
>>   changes in this patchset could possibly be reused.
> 
> I appreciate the discussion of tradeoffs, thanks Terry!

Dan Williams June 24, 2024, 8:51 p.m. UTC | #3

Terry Bowman wrote:
> Hi Dan,
> 
> I added responses below.
> 
> On 6/21/24 14:04, Dan Williams wrote:
> > Terry Bowman wrote:
> >> This patchset provides RAS logging for CXL root ports, CXL downstream
> >> switch ports, and CXL upstream switch ports. This includes changes to
> >> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> >> cxl_pci callback.
> >>
> >> The first 3 patches prepare for and add an atomic notifier chain to the
> >> portdrv driver. The portdrv's notifier chain reports the port device's
> >> AER internal errors to the registered callback(s). The preparation changes
> >> include a portdrv update to call the uncorrectable handler for PCIe root
> >> ports and PCIe downstream switch ports. Also, the AER correctable error
> >> (CE) status is made available to the AER CE handler.
> >>
> >> The next 4 patches are in preparation for adding an atomic notification
> >> callback in the cxl_pci driver. This is for receiving AER internal error
> >> events from the portdrv notifier chain. Preparation includes adding RAS
> >> register block mapping, adding trace functions for logging, and
> >> refactoring cxl_pci RAS functions for reuse.
> >>
> >> The final 2 patches enable the AER internal error interrupts.
> > [..] 
> >>
> >> Solutions Considered (1-4):
> >>   Below are solutions that were considered. Solution #4 is
> >>   implemented in this patchset. 
> > [..]
> >>  2.) Update the AER driver to call cxl_pci driver's error handler before
> >>  calling pci_aer_handle_error()
> >>
> >>  This is similar to the existing RCH port error approach in aer.c.
> >>  In this solution the AER driver searches for a downstream CXL endpoint
> >>  to 'handle' detected CXL port protocol errors.
> >>
> >>  This is a good solution to consider if the one presented in this patchset
> >>  is not acceptable. I was initially reluctant to this approach because it
> >>  adds more CXL coupling to the AER driver. But, I think this solution
> >>  would technically work. I believe Ming was working towards this
> >>  solution.
> > 
> > I feel like the coupling is warranted because these things *are* PCIe
> > and CXL ports, but it means solving the interrupt distribution problem.
> > 
> 
> I understand the service driver interrupt issue but it is not clear how it 
> applies to the CXL port error handling. Can you help me understand how the 
> interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.

Just the case of the AER MSI/-X vector being multiplexed with other CXL
functionality on the same device. If the CXL interrupt vector is to be
enabled later then it means MSI/-X vector enabling needs to be dynamic.

...but yeah, not a problem now as we are only talking about PCIe AER
events and not multiplexing yet. I.e. that problem can be solved later.

> 
> 
> >>   3.) Refactor portdrv
> >>   The portdrv refactoring solution is to change the portdrv service drivers
> >>   into PCIe auxiliary drivers. With this change the facility drivers can be
> >>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> >>
> >>   In this case the CXL port functionality would be added either as a CXL
> >>   auxiliary driver or as a CXL specific port driver
> >>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> >>
> >>   This solution has challenges in the interrupt allocation by separate
> >>   auxiliary drivers and in binding of a specific driver. Binding is
> >>   currently based on PCIe class and would require extending the binding
> >>   logic to support multiple drivers for the same class.
> >>
> >>   Jonathan Cameron is working towards this solution by initially solving
> >>   for the PMU service driver.[1] It is using the auxiliary bus to associate
> >>   what were service drivers with the portdrv driver. Using a CXL auxiliary
> >>   for handling CXL port RAS errors would result in RAS logic called from
> >>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> > 
> > I don't think auxiliary bus is a fundamental step forward from pcie
> > portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
> > but with all the same problems around how to distribute interrupt
> > services to different interested parties.
> > 
> > So I think notifiers are interesting from the perspective of a software
> > hack to enable interrupt distribution. However, given that dynamic MSI-X
> > support is within reach I am interested in exploring that path and
> > mandating that archs that want to handle CXL protocol errors natively
> > need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
> > native protocol error handling support via CXL _OSC.
> > 
> > In other words, I expect native dynamic MSI-X support is more
> > maintainable in the sense of keeping all the code in one notification
> > domain.
> > 
> >>   4.) Using a portdrv notifier chain/callback for CIE/UIE
> >>   (Implemented in this patchset)
> >>
> >>   This solution uses a portdrv atomic chain notifier and a cxl_pci
> >>   callback to handle and log CXL port RAS errors.
> > 
> > Oh, I will need to look that the cxl_pci tie in for this, I was
> > expecting cxl_pci only gets involved in the RCH case because the port
> > and the endpoint are one in the same object. in the VH case I would only
> > expect cxl_pci to get involved for its own observed protocol errors, not
> > those reported upstream from that endpoint.
> > 
> 
> The CXL port error handling needs a place to live with few options at the moment.
> Where do you want the CXL port error handlers to reside? 

I need to go understand exactly why cxl_pci is involved in this current
proposal, but I was thinking it is probably more natural for cxl_port to
have error handlers.

Terry Bowman June 25, 2024, 2:29 p.m. UTC | #4

On 6/24/24 15:51, Dan Williams wrote:
> Terry Bowman wrote:
>> Hi Dan,
>>
>> I added responses below.
>>
>> On 6/21/24 14:04, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> This patchset provides RAS logging for CXL root ports, CXL downstream
>>>> switch ports, and CXL upstream switch ports. This includes changes to
>>>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>>>> cxl_pci callback.
>>>>
>>>> The first 3 patches prepare for and add an atomic notifier chain to the
>>>> portdrv driver. The portdrv's notifier chain reports the port device's
>>>> AER internal errors to the registered callback(s). The preparation changes
>>>> include a portdrv update to call the uncorrectable handler for PCIe root
>>>> ports and PCIe downstream switch ports. Also, the AER correctable error
>>>> (CE) status is made available to the AER CE handler.
>>>>
>>>> The next 4 patches are in preparation for adding an atomic notification
>>>> callback in the cxl_pci driver. This is for receiving AER internal error
>>>> events from the portdrv notifier chain. Preparation includes adding RAS
>>>> register block mapping, adding trace functions for logging, and
>>>> refactoring cxl_pci RAS functions for reuse.
>>>>
>>>> The final 2 patches enable the AER internal error interrupts.
>>> [..] 
>>>>
>>>> Solutions Considered (1-4):
>>>>   Below are solutions that were considered. Solution #4 is
>>>>   implemented in this patchset. 
>>> [..]
>>>>  2.) Update the AER driver to call cxl_pci driver's error handler before
>>>>  calling pci_aer_handle_error()
>>>>
>>>>  This is similar to the existing RCH port error approach in aer.c.
>>>>  In this solution the AER driver searches for a downstream CXL endpoint
>>>>  to 'handle' detected CXL port protocol errors.
>>>>
>>>>  This is a good solution to consider if the one presented in this patchset
>>>>  is not acceptable. I was initially reluctant to this approach because it
>>>>  adds more CXL coupling to the AER driver. But, I think this solution
>>>>  would technically work. I believe Ming was working towards this
>>>>  solution.
>>>
>>> I feel like the coupling is warranted because these things *are* PCIe
>>> and CXL ports, but it means solving the interrupt distribution problem.
>>>
>>
>> I understand the service driver interrupt issue but it is not clear how it 
>> applies to the CXL port error handling. Can you help me understand how the 
>> interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.
> 
> Just the case of the AER MSI/-X vector being multiplexed with other CXL
> functionality on the same device. If the CXL interrupt vector is to be
> enabled later then it means MSI/-X vector enabling needs to be dynamic.
> 
> ...but yeah, not a problem now as we are only talking about PCIe AER
> events and not multiplexing yet. I.e. that problem can be solved later.
> 
>>
>>
>>>>   3.) Refactor portdrv
>>>>   The portdrv refactoring solution is to change the portdrv service drivers
>>>>   into PCIe auxiliary drivers. With this change the facility drivers can be
>>>>   associated with a PCIe driver instead fixed bound to the portdrv driver.
>>>>
>>>>   In this case the CXL port functionality would be added either as a CXL
>>>>   auxiliary driver or as a CXL specific port driver
>>>>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
>>>>
>>>>   This solution has challenges in the interrupt allocation by separate
>>>>   auxiliary drivers and in binding of a specific driver. Binding is
>>>>   currently based on PCIe class and would require extending the binding
>>>>   logic to support multiple drivers for the same class.
>>>>
>>>>   Jonathan Cameron is working towards this solution by initially solving
>>>>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>>>>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>>>>   for handling CXL port RAS errors would result in RAS logic called from
>>>>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
>>>
>>> I don't think auxiliary bus is a fundamental step forward from pcie
>>> portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
>>> but with all the same problems around how to distribute interrupt
>>> services to different interested parties.
>>>
>>> So I think notifiers are interesting from the perspective of a software
>>> hack to enable interrupt distribution. However, given that dynamic MSI-X
>>> support is within reach I am interested in exploring that path and
>>> mandating that archs that want to handle CXL protocol errors natively
>>> need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
>>> native protocol error handling support via CXL _OSC.
>>>
>>> In other words, I expect native dynamic MSI-X support is more
>>> maintainable in the sense of keeping all the code in one notification
>>> domain.
>>>
>>>>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>>>>   (Implemented in this patchset)
>>>>
>>>>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>>>>   callback to handle and log CXL port RAS errors.
>>>
>>> Oh, I will need to look that the cxl_pci tie in for this, I was
>>> expecting cxl_pci only gets involved in the RCH case because the port
>>> and the endpoint are one in the same object. in the VH case I would only
>>> expect cxl_pci to get involved for its own observed protocol errors, not
>>> those reported upstream from that endpoint.
>>>
>>
>> The CXL port error handling needs a place to live with few options at the moment.
>> Where do you want the CXL port error handlers to reside? 
> 
> I need to go understand exactly why cxl_pci is involved in this current
> proposal, but I was thinking it is probably more natural for cxl_port to
> have error handlers.

Ok. I agree, cxl_port is a better location for the handlers.

Regards,
Terry