diff mbox series

[v3,4/4] cxl: Add post reset warning if reset is detected as Secondary Bus Reset (SBR)

Message ID 20240402234848.3287160-5-dave.jiang@intel.com (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Add Secondary Bus Reset (SBR) support for CXL | expand

Commit Message

Dave Jiang April 2, 2024, 11:45 p.m. UTC
SBR is equivalent to a device been hot removed and inserted again. Doing a
SBR on a CXL type 3 device is problematic if the exported device memory is
part of system memory that cannot be offlined. The event is equivalent to
violently ripping out that range of memory from the kernel. While the
hardware requires the "Unmask SBR" bit set in the Port Control Extensions
register and the kernel currently does not unmask it, user can unmask
this bit via setpci or similar tool.

The driver does not have a way to detect whether a reset coming from the
PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
detect is to note if a decoder is marked as enabled in software but the
decoder control register indicates it's not committed.

A helper function is added to find discrepancy between the decoder
software state versus the hardware register state.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
v3:
- Rename decocer_hw_mismatch() to __cxl_endpoint_decoder_reset_detected(). (Dan)
- Move register accessing function to core/pci.c. (Dan)
- Add kernel taint to decoder reset. (Dan)
---
 drivers/cxl/core/pci.c | 31 +++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h      |  2 ++
 drivers/cxl/pci.c      | 20 ++++++++++++++++++++
 3 files changed, 53 insertions(+)

Comments

Jonathan Cameron April 3, 2024, 3:32 p.m. UTC | #1
On Tue, 2 Apr 2024 16:45:32 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> SBR is equivalent to a device been hot removed and inserted again. Doing a
> SBR on a CXL type 3 device is problematic if the exported device memory is
> part of system memory that cannot be offlined. The event is equivalent to
> violently ripping out that range of memory from the kernel. While the
> hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> register and the kernel currently does not unmask it, user can unmask
> this bit via setpci or similar tool.
> 
> The driver does not have a way to detect whether a reset coming from the
> PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> detect is to note if a decoder is marked as enabled in software but the
> decoder control register indicates it's not committed.
> 
> A helper function is added to find discrepancy between the decoder
> software state versus the hardware register state.
> 
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

As I said way back on v1, this smells hacky.

Why not pass the info on what reset was done down from the PCI core?
I see Bjorn commented it would be *possible* to do it in the PCI core
but raised other concerns that needed addressing first (I think you've
dealt with thosenow).  Doesn't look that hard to me (I've not coded it
up yet though).

The core code knows how far it got down the list reset_methods before
it succeeded in resetting.  So...

Modify __pci_reset_function_locked() to return the index of the reset
method that succeeded. Then pass that to pci_dev_restore().
Finally push it into a reset_done2() that takes that as an extra
parameter and the driver can see if it is FLR or SBR.
The extended reset_done is to avoid modifying lots of drivers.
However a quick grep suggests it's not that heavily used (15ish?)
so maybe just add the parameter.

There are a few other paths, but non look that problematic at
first glance...

So Bjorn, now the rest of this is hopefully close to what you'll be
happey with, which way do you prefer?



> ---
> v3:
> - Rename decocer_hw_mismatch() to __cxl_endpoint_decoder_reset_detected(). (Dan)
> - Move register accessing function to core/pci.c. (Dan)
> - Add kernel taint to decoder reset. (Dan)
> ---
>  drivers/cxl/core/pci.c | 31 +++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h      |  2 ++
>  drivers/cxl/pci.c      | 20 ++++++++++++++++++++
>  3 files changed, 53 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index c496a9710d62..597221f7f19b 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -1045,3 +1045,34 @@ long cxl_pci_get_latency(struct pci_dev *pdev)
>  
>  	return cxl_flit_size(pdev) * MEGA / bw;
>  }
> +
> +static int __cxl_endpoint_decoder_reset_detected(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled;
> +	struct cxl_port *port = data;
> +	struct cxl_decoder *cxld;
> +	struct cxl_hdm *cxlhdm;
> +	void __iomem *hdm;
> +	u32 ctrl;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
> +		return 0;
> +
> +	cxld = &cxled->cxld;
> +	cxlhdm = dev_get_drvdata(&port->dev);
> +	hdm = cxlhdm->regs.hdm_decoder;
> +	ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
> +
> +	return !FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl);
> +}
> +
> +bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port)
> +{
> +	return device_for_each_child(&port->dev, port,
> +				     __cxl_endpoint_decoder_reset_detected);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_endpoint_decoder_reset_detected, CXL);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 534e25e2f0a4..e3c237c50b59 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -895,6 +895,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
>  			     struct access_coordinate *c1,
>  			     struct access_coordinate *c2);
>  
> +bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
> +
>  /*
>   * Unit test builds overrides this to __weak, find the 'strong' version
>   * of these symbols in tools/testing/cxl/.
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 110478573296..5dc1f28a031d 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -957,11 +957,31 @@ static void cxl_error_resume(struct pci_dev *pdev)
>  		 dev->driver ? "successful" : "failed");
>  }
>  
> +static void cxl_reset_done(struct pci_dev *pdev)
> +{
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> +	struct device *dev = &pdev->dev;
> +
> +	/*
> +	 * FLR does not expect to touch the HDM decoders and related registers.
> +	 * SBR however will wipe all device configurations.
> +	 * Issue warning if there was active decoder before reset that no
> +	 * longer exists.
> +	 */
> +	if (cxl_endpoint_decoder_reset_detected(cxlmd->endpoint)) {
> +		dev_warn(dev, "SBR happened without memory regions removal.\n");
> +		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
> +		add_taint(TAINT_USER, LOCKDEP_NOW_UNRELIABLE);
> +	}
> +}
> +
>  static const struct pci_error_handlers cxl_error_handlers = {
>  	.error_detected	= cxl_error_detected,
>  	.slot_reset	= cxl_slot_reset,
>  	.resume		= cxl_error_resume,
>  	.cor_error_detected	= cxl_cor_error_detected,
> +	.reset_done	= cxl_reset_done,
>  };
>  
>  static struct pci_driver cxl_pci_driver = {
Dan Williams April 3, 2024, 4:27 p.m. UTC | #2
Jonathan Cameron wrote:
> On Tue, 2 Apr 2024 16:45:32 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
> > SBR is equivalent to a device been hot removed and inserted again. Doing a
> > SBR on a CXL type 3 device is problematic if the exported device memory is
> > part of system memory that cannot be offlined. The event is equivalent to
> > violently ripping out that range of memory from the kernel. While the
> > hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> > register and the kernel currently does not unmask it, user can unmask
> > this bit via setpci or similar tool.
> > 
> > The driver does not have a way to detect whether a reset coming from the
> > PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> > detect is to note if a decoder is marked as enabled in software but the
> > decoder control register indicates it's not committed.
> > 
> > A helper function is added to find discrepancy between the decoder
> > software state versus the hardware register state.
> > 
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> 
> As I said way back on v1, this smells hacky.
> 
> Why not pass the info on what reset was done down from the PCI core?
> I see Bjorn commented it would be *possible* to do it in the PCI core
> but raised other concerns that needed addressing first (I think you've
> dealt with thosenow).  Doesn't look that hard to me (I've not coded it
> up yet though).
> 
> The core code knows how far it got down the list reset_methods before
> it succeeded in resetting.  So...
> 
> Modify __pci_reset_function_locked() to return the index of the reset
> method that succeeded. Then pass that to pci_dev_restore().
> Finally push it into a reset_done2() that takes that as an extra
> parameter and the driver can see if it is FLR or SBR.
> The extended reset_done is to avoid modifying lots of drivers.
> However a quick grep suggests it's not that heavily used (15ish?)
> so maybe just add the parameter.
> 
> There are a few other paths, but non look that problematic at
> first glance...
> 
> So Bjorn, now the rest of this is hopefully close to what you'll be
> happey with, which way do you prefer?

I will defer to Bjorn, but I am not fan of this reset_done2() proposal.
"Revalidate after reset" is a common driver pattern and all that
plumbing the effective-reset-type does is make cxl_reset_done() more
precise for no discernible value.
Lukas Wunner April 4, 2024, 8:51 a.m. UTC | #3
On Wed, Apr 03, 2024 at 04:32:57PM +0100, Jonathan Cameron wrote:
> On Tue, 2 Apr 2024 16:45:32 -0700 Dave Jiang <dave.jiang@intel.com> wrote:
> Why not pass the info on what reset was done down from the PCI core?
> I see Bjorn commented it would be *possible* to do it in the PCI core
> but raised other concerns that needed addressing first (I think you've
> dealt with those now).  Doesn't look that hard to me (I've not coded it
> up yet though).
> 
> The core code knows how far it got down the list reset_methods before
> it succeeded in resetting.  So...
> 
> Modify __pci_reset_function_locked() to return the index of the reset
> method that succeeded. Then pass that to pci_dev_restore().
> Finally push it into a reset_done2() that takes that as an extra
> parameter and the driver can see if it is FLR or SBR.

The reset types to distinguish per PCIe r6.2 sec 6.6 are
Conventional Reset and Function Level Reset.

Secondary Bus Reset is a Conventional Reset.

The spec subdivides Conventional Reset into Cold, Warm and Hot,
but that distinction is probably irrelevant for the kernel.

I think a more generalized (and therefore better) approach would be
to store the reset type the device has undergone in struct pci_dev,
right next to error_state, so that not just the ->reset_done()
callback benefits from the information.  The reset type applied has
consequences beyond the individual driver:  E.g. an FLR does not
affect CMA-SPDM session state, but a Conventional Reset does.
So there may be consumers of that information in the PCI core as well.

It's worth noting that we already have an enum pcie_reset_state in
<linux/pci.h> which distinguishes between deassert, warm and hot reset.
It is currently only used by PowerPC EEH to convey to the platform
which type of reset it should apply.  It might be possible to extend
the enum so that it can be used to store the reset type that *was*
applied to a device in struct pci_dev.

That all being said, checking for the *symptoms* of a Conventional Reset,
as Dave has done here, may actually be more robust than just relying on
what type of reset was applied.  E.g. after an FLR was handled, the device
may experience a DPC-induced Hot Reset.  By checking for the *symptoms*,
the driver may be able to catch that the device has undergone a
Conventional Reset immediately after an FLR.  Also, who knows if all
devices are well-behaved and retain their state during an FLR, as they
should per the spec?  Maybe there are broken devices which do not respect
that rule.  Checking for symptoms of a Conventional Reset would catch
those devices as well.

Thanks,

Lukas
Jonathan Cameron April 4, 2024, 1:13 p.m. UTC | #4
On Thu, 4 Apr 2024 10:51:36 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Wed, Apr 03, 2024 at 04:32:57PM +0100, Jonathan Cameron wrote:
> > On Tue, 2 Apr 2024 16:45:32 -0700 Dave Jiang <dave.jiang@intel.com> wrote:
> > Why not pass the info on what reset was done down from the PCI core?
> > I see Bjorn commented it would be *possible* to do it in the PCI core
> > but raised other concerns that needed addressing first (I think you've
> > dealt with those now).  Doesn't look that hard to me (I've not coded it
> > up yet though).
> > 
> > The core code knows how far it got down the list reset_methods before
> > it succeeded in resetting.  So...
> > 
> > Modify __pci_reset_function_locked() to return the index of the reset
> > method that succeeded. Then pass that to pci_dev_restore().
> > Finally push it into a reset_done2() that takes that as an extra
> > parameter and the driver can see if it is FLR or SBR.  
> 
> The reset types to distinguish per PCIe r6.2 sec 6.6 are
> Conventional Reset and Function Level Reset.
> 
> Secondary Bus Reset is a Conventional Reset.
> 
> The spec subdivides Conventional Reset into Cold, Warm and Hot,
> but that distinction is probably irrelevant for the kernel.

Agreed. SBR is only called out explicitly here because it's the one
with a handy triggering mechamism.

> 
> I think a more generalized (and therefore better) approach would be
> to store the reset type the device has undergone in struct pci_dev,
> right next to error_state, so that not just the ->reset_done()
> callback benefits from the information.  The reset type applied has
> consequences beyond the individual driver:  E.g. an FLR does not
> affect CMA-SPDM session state, but a Conventional Reset does.
> So there may be consumers of that information in the PCI core as well.

That makes sense if we do go the route of enhancing the information
provided for a reset.

> 
> It's worth noting that we already have an enum pcie_reset_state in
> <linux/pci.h> which distinguishes between deassert, warm and hot reset.
> It is currently only used by PowerPC EEH to convey to the platform
> which type of reset it should apply.  It might be possible to extend
> the enum so that it can be used to store the reset type that *was*
> applied to a device in struct pci_dev.
> 
> That all being said, checking for the *symptoms* of a Conventional Reset,
> as Dave has done here, may actually be more robust than just relying on
> what type of reset was applied.  E.g. after an FLR was handled, the device
> may experience a DPC-induced Hot Reset.  

This sounds like a plausible reason for doing it by symptom checking.

> By checking for the *symptoms*,
> the driver may be able to catch that the device has undergone a
> Conventional Reset immediately after an FLR.  Also, who knows if all
> devices are well-behaved and retain their state during an FLR, as they
> should per the spec?  Maybe there are broken devices which do not respect
> that rule.  Checking for symptoms of a Conventional Reset would catch
> those devices as well.

I'm not particularly keen on complexity additions to the kernel for
possible broken devices. For CXL devices the rules are very clear 
and the HDM decoder must not be reset.  If not chances are host OS will
take out BIOS setup memory and that isn't healthy.

Perhaps the key point here is that the patch title is misleading / simplistic.
The patch only warns if a reset happened that caused a configuration mismatch
for the address decoders.  SBR at other times is fine.

So even if we had a reset_type available, the driver would still need
to see if it mattered.

So I've ended up arguing myself into the fact all this code is needed anyway.
Perhaps change the patch title to

cxl: Add post reset warning if reset results in loss of previously committed HDM decoders.

If something along those lines..

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Jonathan

> 
> Thanks,
> 
> Lukas
Jonathan Cameron April 4, 2024, 1:16 p.m. UTC | #5
On Wed, 3 Apr 2024 09:27:28 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Tue, 2 Apr 2024 16:45:32 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >   
> > > SBR is equivalent to a device been hot removed and inserted again. Doing a
> > > SBR on a CXL type 3 device is problematic if the exported device memory is
> > > part of system memory that cannot be offlined. The event is equivalent to
> > > violently ripping out that range of memory from the kernel. While the
> > > hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> > > register and the kernel currently does not unmask it, user can unmask
> > > this bit via setpci or similar tool.
> > > 
> > > The driver does not have a way to detect whether a reset coming from the
> > > PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> > > detect is to note if a decoder is marked as enabled in software but the
> > > decoder control register indicates it's not committed.
> > > 
> > > A helper function is added to find discrepancy between the decoder
> > > software state versus the hardware register state.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Dave Jiang <dave.jiang@intel.com>  
> > 
> > As I said way back on v1, this smells hacky.
> > 
> > Why not pass the info on what reset was done down from the PCI core?
> > I see Bjorn commented it would be *possible* to do it in the PCI core
> > but raised other concerns that needed addressing first (I think you've
> > dealt with thosenow).  Doesn't look that hard to me (I've not coded it
> > up yet though).
> > 
> > The core code knows how far it got down the list reset_methods before
> > it succeeded in resetting.  So...
> > 
> > Modify __pci_reset_function_locked() to return the index of the reset
> > method that succeeded. Then pass that to pci_dev_restore().
> > Finally push it into a reset_done2() that takes that as an extra
> > parameter and the driver can see if it is FLR or SBR.
> > The extended reset_done is to avoid modifying lots of drivers.
> > However a quick grep suggests it's not that heavily used (15ish?)
> > so maybe just add the parameter.
> > 
> > There are a few other paths, but non look that problematic at
> > first glance...
> > 
> > So Bjorn, now the rest of this is hopefully close to what you'll be
> > happey with, which way do you prefer?  
> 
> I will defer to Bjorn, but I am not fan of this reset_done2() proposal.
> "Revalidate after reset" is a common driver pattern and all that
> plumbing the effective-reset-type does is make cxl_reset_done() more
> precise for no discernible value.

As per other thread branch, I think you are right, but key is this is not
detecting the SBR at all, it's detecting HDM decoders not being in expected
state. If they weren't setup before SBR, then we don't warn.  So SBR is
the cause, but not what is being detected (which is a subset of SBR results)
  
>
diff mbox series

Patch

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c496a9710d62..597221f7f19b 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -1045,3 +1045,34 @@  long cxl_pci_get_latency(struct pci_dev *pdev)
 
 	return cxl_flit_size(pdev) * MEGA / bw;
 }
+
+static int __cxl_endpoint_decoder_reset_detected(struct device *dev, void *data)
+{
+	struct cxl_endpoint_decoder *cxled;
+	struct cxl_port *port = data;
+	struct cxl_decoder *cxld;
+	struct cxl_hdm *cxlhdm;
+	void __iomem *hdm;
+	u32 ctrl;
+
+	if (!is_endpoint_decoder(dev))
+		return 0;
+
+	cxled = to_cxl_endpoint_decoder(dev);
+	if ((cxled->cxld.flags & CXL_DECODER_F_ENABLE) == 0)
+		return 0;
+
+	cxld = &cxled->cxld;
+	cxlhdm = dev_get_drvdata(&port->dev);
+	hdm = cxlhdm->regs.hdm_decoder;
+	ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(cxld->id));
+
+	return !FIELD_GET(CXL_HDM_DECODER0_CTRL_COMMITTED, ctrl);
+}
+
+bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port)
+{
+	return device_for_each_child(&port->dev, port,
+				     __cxl_endpoint_decoder_reset_detected);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_endpoint_decoder_reset_detected, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 534e25e2f0a4..e3c237c50b59 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -895,6 +895,8 @@  void cxl_coordinates_combine(struct access_coordinate *out,
 			     struct access_coordinate *c1,
 			     struct access_coordinate *c2);
 
+bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
+
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
  * of these symbols in tools/testing/cxl/.
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 110478573296..5dc1f28a031d 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -957,11 +957,31 @@  static void cxl_error_resume(struct pci_dev *pdev)
 		 dev->driver ? "successful" : "failed");
 }
 
+static void cxl_reset_done(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &pdev->dev;
+
+	/*
+	 * FLR does not expect to touch the HDM decoders and related registers.
+	 * SBR however will wipe all device configurations.
+	 * Issue warning if there was active decoder before reset that no
+	 * longer exists.
+	 */
+	if (cxl_endpoint_decoder_reset_detected(cxlmd->endpoint)) {
+		dev_warn(dev, "SBR happened without memory regions removal.\n");
+		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
+		add_taint(TAINT_USER, LOCKDEP_NOW_UNRELIABLE);
+	}
+}
+
 static const struct pci_error_handlers cxl_error_handlers = {
 	.error_detected	= cxl_error_detected,
 	.slot_reset	= cxl_slot_reset,
 	.resume		= cxl_error_resume,
 	.cor_error_detected	= cxl_cor_error_detected,
+	.reset_done	= cxl_reset_done,
 };
 
 static struct pci_driver cxl_pci_driver = {