diff mbox series

cxl: Add post reset warning if the reset is detected as Secondary Bus Reset (SBR)

Message ID 20240215232307.2793530-1-dave.jiang@intel.com
State New, archived
Headers show
Series cxl: Add post reset warning if the reset is detected as Secondary Bus Reset (SBR) | expand

Commit Message

Dave Jiang Feb. 15, 2024, 11:23 p.m. UTC
SBR is equivalent to a device been hot removed and inserted again. Doing a
SBR on a CXL type 3 device is problematic if the exported device memory is
part of system memory that cannot be offlined. The event is equivalent to
violently ripping out that range of memory from the kernel. While the
hardware requires the "Unmask SBR" bit set in the Port Control Extensions
register and the kernel currently does not unmask it, user can unmask
this bit via setpci or similar tool.

The driver does not have a way to detect whether a reset coming from the
PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
detect is to note if there are active decoders before the reset and check
if the range register memory active bit remains set after reset.

A helper function to check is added to detect if the range register memory
active bit is set. A locked helper for cxl_num_decoders_committed() is also
added to allow pci code to call the cxl_num_decoders_committed() while
holding the cxl_region_rwsem.

Add a err_handler->reset_prepare() to detect whether there are active
decoders.  Add a err_handler->reset_done() to check if there was active
memory before the reset and it is no longer active after the reset. A
warning is emitted in the case of active memory has been offlined.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/pci.c  | 14 ++++++++++++++
 drivers/cxl/core/port.c | 11 +++++++++++
 drivers/cxl/cxl.h       |  2 ++
 drivers/cxl/cxlmem.h    |  1 +
 drivers/cxl/pci.c       | 31 +++++++++++++++++++++++++++++++
 5 files changed, 59 insertions(+)

Comments

Ira Weiny Feb. 18, 2024, 7:36 p.m. UTC | #1
Dave Jiang wrote:
> SBR is equivalent to a device been hot removed and inserted again. Doing a
> SBR on a CXL type 3 device is problematic if the exported device memory is
> part of system memory that cannot be offlined. The event is equivalent to
> violently ripping out that range of memory from the kernel. While the
> hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> register and the kernel currently does not unmask it, user can unmask
> this bit via setpci or similar tool.
> 
> The driver does not have a way to detect whether a reset coming from the
> PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> detect is to note if there are active decoders before the reset and check
> if the range register memory active bit remains set after reset.
> 
> A helper function to check is added to detect if the range register memory
> active bit is set. A locked helper for cxl_num_decoders_committed() is also
> added to allow pci code to call the cxl_num_decoders_committed() while
> holding the cxl_region_rwsem.
> 
> Add a err_handler->reset_prepare() to detect whether there are active
> decoders.  Add a err_handler->reset_done() to check if there was active
> memory before the reset and it is no longer active after the reset. A
> warning is emitted in the case of active memory has been offlined.
> 
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Jonathan Cameron Feb. 19, 2024, 2:20 p.m. UTC | #2
On Thu, 15 Feb 2024 16:23:07 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> SBR is equivalent to a device been hot removed and inserted again. Doing a
> SBR on a CXL type 3 device is problematic if the exported device memory is
> part of system memory that cannot be offlined. The event is equivalent to
> violently ripping out that range of memory from the kernel. While the
> hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> register and the kernel currently does not unmask it, user can unmask
> this bit via setpci or similar tool.
> 
> The driver does not have a way to detect whether a reset coming from the
> PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> detect is to note if there are active decoders before the reset and check
> if the range register memory active bit remains set after reset.
> 
> A helper function to check is added to detect if the range register memory
> active bit is set. A locked helper for cxl_num_decoders_committed() is also
> added to allow pci code to call the cxl_num_decoders_committed() while
> holding the cxl_region_rwsem.
> 
> Add a err_handler->reset_prepare() to detect whether there are active
> decoders.  Add a err_handler->reset_done() to check if there was active
> memory before the reset and it is no longer active after the reset. A
> warning is emitted in the case of active memory has been offlined.
> 
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

This feels like we are papering over a hole in the PCI core.
Is there no way of detecting Secondary Bus Reset (SBR) and
communicate that down to the device?
+CC Bjorn. 
Most of the logic would be needed in driver anyway though as
we don't want to bother warning on SBR if there was no memory mapped.

Bjorn, would you prefer this FLR vs SBR being detected by state
change in driver, or a modification to the PCI core so that it
provides this info to the drivers?  I assume this pretty unique
to CXL as normally there isn't a magic control to ignore triggering
a reset.

One trivial comment inline.

Jonathan

> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index e59d9d37aa65..81d9f57d2e84 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -45,6 +45,17 @@ int cxl_num_decoders_committed(struct cxl_port *port)
>  	return port->commit_end + 1;
>  }
>  
> +int cxl_num_decoders_committed_locked(struct cxl_port *port)
> +{
> +	int decoders;
> +
> +	guard(rwsem_read)(&cxl_region_rwsem);
> +	decoders = cxl_num_decoders_committed(port);

return cxl_num_decoder_commited(port);

> +
> +	return decoders;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_num_decoders_committed_locked, CXL);
> +
>  static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
>  			    char *buf)
>  {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index b6017c0c57b4..530c7e693096 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -720,6 +720,7 @@ static inline bool is_cxl_root(struct cxl_port *port)
>  }
>  
>  int cxl_num_decoders_committed(struct cxl_port *port);
> +int cxl_num_decoders_committed_locked(struct cxl_port *port);
>  bool is_cxl_port(const struct device *dev);
>  struct cxl_port *to_cxl_port(const struct device *dev);
>  struct pci_bus;
> @@ -800,6 +801,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm,
>  int devm_cxl_add_passthrough_decoder(struct cxl_port *port);
>  int cxl_dvsec_rr_decode(struct device *dev, int dvsec,
>  			struct cxl_endpoint_dvsec_info *info);
> +bool cxl_dvsec_rr_active(struct device *dev, int d);
>  
>  bool is_cxl_region(struct device *dev);
>  
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 5303d6942b88..9f1814005322 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -440,6 +440,7 @@ struct cxl_dev_state {
>  	struct resource ram_res;
>  	u64 serial;
>  	enum cxl_devtype type;
> +	bool active_rr_prereset;
>  };
>  
>  /**
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 233e7c42c161..5a5fda7134f6 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -957,11 +957,42 @@ static void cxl_error_resume(struct pci_dev *pdev)
>  		 dev->driver ? "successful" : "failed");
>  }
>  
> +static void cxl_reset_prepare(struct pci_dev *pdev)
> +{
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> +
> +	if (cxl_num_decoders_committed_locked(cxlmd->endpoint))
> +		cxlds->active_rr_prereset = true;
> +}
> +
> +static void cxl_reset_done(struct pci_dev *pdev)
> +{
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> +	struct device *dev = &cxlmd->dev;
> +
> +	/*
> +	 * FLR does not expect to touch the HDM decoders and related registers.
> +	 * SBR however will wipe all device configurations.
> +	 * Issue warning if there was active configuration before reset that no
> +	 * longer exists.
> +	 */
> +	if (cxlds->active_rr_prereset &&
> +	    !cxl_dvsec_rr_active(&pdev->dev, cxlds->cxl_dvsec)) {
> +		dev_warn(dev, "SBR happened without memory regions removal.\n");
> +		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
> +	}
> +	cxlds->active_rr_prereset = false;
> +}
> +
>  static const struct pci_error_handlers cxl_error_handlers = {
>  	.error_detected	= cxl_error_detected,
>  	.slot_reset	= cxl_slot_reset,
>  	.resume		= cxl_error_resume,
>  	.cor_error_detected	= cxl_cor_error_detected,
> +	.reset_prepare	= cxl_reset_prepare,
> +	.reset_done	= cxl_reset_done,
>  };
>  
>  static struct pci_driver cxl_pci_driver = {
Dan Williams Feb. 20, 2024, 6:20 p.m. UTC | #3
Jonathan Cameron wrote:
> On Thu, 15 Feb 2024 16:23:07 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
> > SBR is equivalent to a device been hot removed and inserted again. Doing a
> > SBR on a CXL type 3 device is problematic if the exported device memory is
> > part of system memory that cannot be offlined. The event is equivalent to
> > violently ripping out that range of memory from the kernel. While the
> > hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> > register and the kernel currently does not unmask it, user can unmask
> > this bit via setpci or similar tool.
> > 
> > The driver does not have a way to detect whether a reset coming from the
> > PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> > detect is to note if there are active decoders before the reset and check
> > if the range register memory active bit remains set after reset.
> > 
> > A helper function to check is added to detect if the range register memory
> > active bit is set. A locked helper for cxl_num_decoders_committed() is also
> > added to allow pci code to call the cxl_num_decoders_committed() while
> > holding the cxl_region_rwsem.
> > 
> > Add a err_handler->reset_prepare() to detect whether there are active
> > decoders.  Add a err_handler->reset_done() to check if there was active
> > memory before the reset and it is no longer active after the reset. A
> > warning is emitted in the case of active memory has been offlined.
> > 
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> 
> This feels like we are papering over a hole in the PCI core.
> Is there no way of detecting Secondary Bus Reset (SBR) and
> communicate that down to the device?
> +CC Bjorn. 
> Most of the logic would be needed in driver anyway though as
> we don't want to bother warning on SBR if there was no memory mapped.
> 
> Bjorn, would you prefer this FLR vs SBR being detected by state
> change in driver, or a modification to the PCI core so that it
> provides this info to the drivers?  I assume this pretty unique
> to CXL as normally there isn't a magic control to ignore triggering
> a reset.

So there *is* a magic control to ignore triggering a reset per the CXL
specification, see "Unmask SBR" in "Port Control Extensions".

Moreover, I do not see this as papering over a hole. The only software
that flips that "Unmask SBR" bit from its default today is a userpace
"setpci" script.  Unless kernel_lockdown is in force there is nothing to
stop or warn root about the danger, in fact there is a wide swath of
damage that root with config-cycle-write-access can wreak.

If someone goes through that trouble, and in keeping with the general
Linux ethos of giving root access to footguns (outside of
kernel_lockdown), there is not much justification to block it, but the
driver can definitely clarify the damage after the fact.

I will also point out that the lack of a reset reason notification is
not the loan concern. If there is appetite for increasing core-to-driver
transparency, the hotplug reason is also missing. Whether ->remove() is
logical or physical and the ability to set the magnetic-retention-latch
from an endpoint driver could be interesting, but the staus quo is
sufficient for now.

...a comment for Dave below

> 
> One trivial comment inline.
> 
> Jonathan
> 
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index e59d9d37aa65..81d9f57d2e84 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -45,6 +45,17 @@ int cxl_num_decoders_committed(struct cxl_port *port)
> >  	return port->commit_end + 1;
> >  }
> >  
> > +int cxl_num_decoders_committed_locked(struct cxl_port *port)
> > +{
> > +	int decoders;
> > +
> > +	guard(rwsem_read)(&cxl_region_rwsem);
> > +	decoders = cxl_num_decoders_committed(port);
> 
> return cxl_num_decoder_commited(port);
> 
> > +
> > +	return decoders;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_num_decoders_committed_locked, CXL);
> > +
> >  static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
> >  			    char *buf)
> >  {
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index b6017c0c57b4..530c7e693096 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -720,6 +720,7 @@ static inline bool is_cxl_root(struct cxl_port *port)
> >  }
> >  
> >  int cxl_num_decoders_committed(struct cxl_port *port);
> > +int cxl_num_decoders_committed_locked(struct cxl_port *port);
> >  bool is_cxl_port(const struct device *dev);
> >  struct cxl_port *to_cxl_port(const struct device *dev);
> >  struct pci_bus;
> > @@ -800,6 +801,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm,
> >  int devm_cxl_add_passthrough_decoder(struct cxl_port *port);
> >  int cxl_dvsec_rr_decode(struct device *dev, int dvsec,
> >  			struct cxl_endpoint_dvsec_info *info);
> > +bool cxl_dvsec_rr_active(struct device *dev, int d);
> >  
> >  bool is_cxl_region(struct device *dev);
> >  
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 5303d6942b88..9f1814005322 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -440,6 +440,7 @@ struct cxl_dev_state {
> >  	struct resource ram_res;
> >  	u64 serial;
> >  	enum cxl_devtype type;
> > +	bool active_rr_prereset;
> >  };
> >  
> >  /**
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 233e7c42c161..5a5fda7134f6 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -957,11 +957,42 @@ static void cxl_error_resume(struct pci_dev *pdev)
> >  		 dev->driver ? "successful" : "failed");
> >  }
> >  
> > +static void cxl_reset_prepare(struct pci_dev *pdev)
> > +{
> > +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> > +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> > +
> > +	if (cxl_num_decoders_committed_locked(cxlmd->endpoint))
> > +		cxlds->active_rr_prereset = true;
> > +}
> > +
> > +static void cxl_reset_done(struct pci_dev *pdev)
> > +{
> > +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> > +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> > +	struct device *dev = &cxlmd->dev;
> > +
> > +	/*
> > +	 * FLR does not expect to touch the HDM decoders and related registers.
> > +	 * SBR however will wipe all device configurations.
> > +	 * Issue warning if there was active configuration before reset that no
> > +	 * longer exists.
> > +	 */
> > +	if (cxlds->active_rr_prereset &&
> > +	    !cxl_dvsec_rr_active(&pdev->dev, cxlds->cxl_dvsec)) {
> > +		dev_warn(dev, "SBR happened without memory regions removal.\n");
> > +		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");

Dave, did you test this? I reacted to the addition of
->active_rr_prereset as a case of putting code logic in a data
structure, but I doubt it is even effectice since nothing informs
software that the register values changed. I.e. the check should be to
walk through all the software committed decoders and see if they are
still hardware committed. No need for ->active_rr_prereset.
Bjorn Helgaas Feb. 20, 2024, 8:39 p.m. UTC | #4
[+cc Alex, reset expert]

On Mon, Feb 19, 2024 at 02:20:06PM +0000, Jonathan Cameron wrote:
> On Thu, 15 Feb 2024 16:23:07 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
> > SBR is equivalent to a device been hot removed and inserted again. Doing a
> > SBR on a CXL type 3 device is problematic if the exported device memory is
> > part of system memory that cannot be offlined. The event is equivalent to
> > violently ripping out that range of memory from the kernel. While the
> > hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> > register and the kernel currently does not unmask it, user can unmask
> > this bit via setpci or similar tool.

IIUC, this refers to CXL r3.1, sec 8.1.5.2, which says the default
"Unmask SBR" value is 0, and that when it is 0, the SBR bit in Bridge
Control has no effect unless the Port is operating in PCIe or RCD
mode.

So I guess the scenario is a CXL Port leading to a CXL type 3 device,
and if the Port has the default "Unmask SBR" of 0, an attempt to reset
the type 3 device via SBR would have no effect.  But if a user or some
future kernel *sets* "Unmask SBR", the type 3 device could see a hot
reset.

It sounds kind of problematic that when "Unmask SBR" is 0, an attempt
to reset downstream devices using SBR would fail but the caller of
pci_reset_function() would think it succeeded.

> > The driver does not have a way to detect whether a reset coming from the
> > PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> > detect is to note if there are active decoders before the reset and check
> > if the range register memory active bit remains set after reset.
> > 
> > A helper function to check is added to detect if the range register memory
> > active bit is set. A locked helper for cxl_num_decoders_committed() is also
> > added to allow pci code to call the cxl_num_decoders_committed() while
> > holding the cxl_region_rwsem.
> > 
> > Add a err_handler->reset_prepare() to detect whether there are active
> > decoders.  Add a err_handler->reset_done() to check if there was active
> > memory before the reset and it is no longer active after the reset. A
> > warning is emitted in the case of active memory has been offlined.
> > 
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> 
> This feels like we are papering over a hole in the PCI core.
> Is there no way of detecting Secondary Bus Reset (SBR) and
> communicate that down to the device?
> +CC Bjorn. 
> Most of the logic would be needed in driver anyway though as
> we don't want to bother warning on SBR if there was no memory mapped.
> 
> Bjorn, would you prefer this FLR vs SBR being detected by state
> change in driver, or a modification to the PCI core so that it
> provides this info to the drivers?

I guess this is about pci_reset_function() and similar paths, which
call .reset_prepare() in pci_dev_save_and_disable() before the reset
and call .reset_done() afterwards, and the question is whether we
should pass a new parameter to .reset_done() to tell the driver what
type of reset was done?

That seems *possible*, but kind of a hassle because there are several
different reset methods (six in pci_reset_fn_methods[] plus
device-specific things in pci_dev_reset_methods[] and a few more in
various hotplug_slot_ops structs), and I guess we'd have to plumb them
all to return some indication of what kind of reset they used.

But even before we get that far, if pci_reset_function() just does
nothing on these devices because SBRs are masked by default, that
sounds like it needs to be fixed first.

> I assume this pretty unique
> to CXL as normally there isn't a magic control to ignore triggering
> a reset.
> 
> One trivial comment inline.
> 
> Jonathan
> 
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index e59d9d37aa65..81d9f57d2e84 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -45,6 +45,17 @@ int cxl_num_decoders_committed(struct cxl_port *port)
> >  	return port->commit_end + 1;
> >  }
> >  
> > +int cxl_num_decoders_committed_locked(struct cxl_port *port)
> > +{
> > +	int decoders;
> > +
> > +	guard(rwsem_read)(&cxl_region_rwsem);
> > +	decoders = cxl_num_decoders_committed(port);
> 
> return cxl_num_decoder_commited(port);
> 
> > +
> > +	return decoders;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_num_decoders_committed_locked, CXL);
> > +
> >  static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
> >  			    char *buf)
> >  {
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index b6017c0c57b4..530c7e693096 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -720,6 +720,7 @@ static inline bool is_cxl_root(struct cxl_port *port)
> >  }
> >  
> >  int cxl_num_decoders_committed(struct cxl_port *port);
> > +int cxl_num_decoders_committed_locked(struct cxl_port *port);
> >  bool is_cxl_port(const struct device *dev);
> >  struct cxl_port *to_cxl_port(const struct device *dev);
> >  struct pci_bus;
> > @@ -800,6 +801,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm,
> >  int devm_cxl_add_passthrough_decoder(struct cxl_port *port);
> >  int cxl_dvsec_rr_decode(struct device *dev, int dvsec,
> >  			struct cxl_endpoint_dvsec_info *info);
> > +bool cxl_dvsec_rr_active(struct device *dev, int d);
> >  
> >  bool is_cxl_region(struct device *dev);
> >  
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 5303d6942b88..9f1814005322 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -440,6 +440,7 @@ struct cxl_dev_state {
> >  	struct resource ram_res;
> >  	u64 serial;
> >  	enum cxl_devtype type;
> > +	bool active_rr_prereset;
> >  };
> >  
> >  /**
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 233e7c42c161..5a5fda7134f6 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -957,11 +957,42 @@ static void cxl_error_resume(struct pci_dev *pdev)
> >  		 dev->driver ? "successful" : "failed");
> >  }
> >  
> > +static void cxl_reset_prepare(struct pci_dev *pdev)
> > +{
> > +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> > +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> > +
> > +	if (cxl_num_decoders_committed_locked(cxlmd->endpoint))
> > +		cxlds->active_rr_prereset = true;
> > +}
> > +
> > +static void cxl_reset_done(struct pci_dev *pdev)
> > +{
> > +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> > +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> > +	struct device *dev = &cxlmd->dev;
> > +
> > +	/*
> > +	 * FLR does not expect to touch the HDM decoders and related registers.
> > +	 * SBR however will wipe all device configurations.
> > +	 * Issue warning if there was active configuration before reset that no
> > +	 * longer exists.
> > +	 */
> > +	if (cxlds->active_rr_prereset &&
> > +	    !cxl_dvsec_rr_active(&pdev->dev, cxlds->cxl_dvsec)) {
> > +		dev_warn(dev, "SBR happened without memory regions removal.\n");
> > +		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
> > +	}
> > +	cxlds->active_rr_prereset = false;
> > +}
> > +
> >  static const struct pci_error_handlers cxl_error_handlers = {
> >  	.error_detected	= cxl_error_detected,
> >  	.slot_reset	= cxl_slot_reset,
> >  	.resume		= cxl_error_resume,
> >  	.cor_error_detected	= cxl_cor_error_detected,
> > +	.reset_prepare	= cxl_reset_prepare,
> > +	.reset_done	= cxl_reset_done,
> >  };
> >  
> >  static struct pci_driver cxl_pci_driver = {
>
Dan Williams Feb. 20, 2024, 9 p.m. UTC | #5
Bjorn Helgaas wrote:
> [+cc Alex, reset expert]
> 
> On Mon, Feb 19, 2024 at 02:20:06PM +0000, Jonathan Cameron wrote:
> > On Thu, 15 Feb 2024 16:23:07 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> > 
> > > SBR is equivalent to a device been hot removed and inserted again. Doing a
> > > SBR on a CXL type 3 device is problematic if the exported device memory is
> > > part of system memory that cannot be offlined. The event is equivalent to
> > > violently ripping out that range of memory from the kernel. While the
> > > hardware requires the "Unmask SBR" bit set in the Port Control Extensions
> > > register and the kernel currently does not unmask it, user can unmask
> > > this bit via setpci or similar tool.
> 
> IIUC, this refers to CXL r3.1, sec 8.1.5.2, which says the default
> "Unmask SBR" value is 0, and that when it is 0, the SBR bit in Bridge
> Control has no effect unless the Port is operating in PCIe or RCD
> mode.
> 
> So I guess the scenario is a CXL Port leading to a CXL type 3 device,
> and if the Port has the default "Unmask SBR" of 0, an attempt to reset
> the type 3 device via SBR would have no effect.  But if a user or some
> future kernel *sets* "Unmask SBR", the type 3 device could see a hot
> reset.
> 
> It sounds kind of problematic that when "Unmask SBR" is 0, an attempt
> to reset downstream devices using SBR would fail but the caller of
> pci_reset_function() would think it succeeded.
> 
> > > The driver does not have a way to detect whether a reset coming from the
> > > PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
> > > detect is to note if there are active decoders before the reset and check
> > > if the range register memory active bit remains set after reset.
> > > 
> > > A helper function to check is added to detect if the range register memory
> > > active bit is set. A locked helper for cxl_num_decoders_committed() is also
> > > added to allow pci code to call the cxl_num_decoders_committed() while
> > > holding the cxl_region_rwsem.
> > > 
> > > Add a err_handler->reset_prepare() to detect whether there are active
> > > decoders.  Add a err_handler->reset_done() to check if there was active
> > > memory before the reset and it is no longer active after the reset. A
> > > warning is emitted in the case of active memory has been offlined.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> > 
> > This feels like we are papering over a hole in the PCI core.
> > Is there no way of detecting Secondary Bus Reset (SBR) and
> > communicate that down to the device?
> > +CC Bjorn. 
> > Most of the logic would be needed in driver anyway though as
> > we don't want to bother warning on SBR if there was no memory mapped.
> > 
> > Bjorn, would you prefer this FLR vs SBR being detected by state
> > change in driver, or a modification to the PCI core so that it
> > provides this info to the drivers?
> 
> I guess this is about pci_reset_function() and similar paths, which
> call .reset_prepare() in pci_dev_save_and_disable() before the reset
> and call .reset_done() afterwards, and the question is whether we
> should pass a new parameter to .reset_done() to tell the driver what
> type of reset was done?
> 
> That seems *possible*, but kind of a hassle because there are several
> different reset methods (six in pci_reset_fn_methods[] plus
> device-specific things in pci_dev_reset_methods[] and a few more in
> various hotplug_slot_ops structs), and I guess we'd have to plumb them
> all to return some indication of what kind of reset they used.
> 
> But even before we get that far, if pci_reset_function() just does
> nothing on these devices because SBRs are masked by default, that
> sounds like it needs to be fixed first.

True, if the administrator asked for SBR the kernel should work to honor
that regardless of any quirks like CXL SBR masking that might be in
effect.

Is that effectively a new reset_method? In other words, teach the kernel
to toggle "Unmask SBR" on demand, but only if the selected reset_method
is "cxl_bus". That at least allows us a place to hook some documentation
around why this is a potentially system-catastrophic event.

Otherwise CXL SBR masking is in effect modeled as just another way to
indicate a PCI_DEV_FLAGS_NO_BUS_RESET quirk.
Dave Jiang Feb. 21, 2024, 4:35 p.m. UTC | #6
On 2/20/24 11:20 AM, Dan Williams wrote:
> Jonathan Cameron wrote:
>> On Thu, 15 Feb 2024 16:23:07 -0700
>> Dave Jiang <dave.jiang@intel.com> wrote:
>>
>>> SBR is equivalent to a device been hot removed and inserted again. Doing a
>>> SBR on a CXL type 3 device is problematic if the exported device memory is
>>> part of system memory that cannot be offlined. The event is equivalent to
>>> violently ripping out that range of memory from the kernel. While the
>>> hardware requires the "Unmask SBR" bit set in the Port Control Extensions
>>> register and the kernel currently does not unmask it, user can unmask
>>> this bit via setpci or similar tool.
>>>
>>> The driver does not have a way to detect whether a reset coming from the
>>> PCI subsystem is a Function Level Reset (FLR) or SBR. The only way to
>>> detect is to note if there are active decoders before the reset and check
>>> if the range register memory active bit remains set after reset.
>>>
>>> A helper function to check is added to detect if the range register memory
>>> active bit is set. A locked helper for cxl_num_decoders_committed() is also
>>> added to allow pci code to call the cxl_num_decoders_committed() while
>>> holding the cxl_region_rwsem.
>>>
>>> Add a err_handler->reset_prepare() to detect whether there are active
>>> decoders.  Add a err_handler->reset_done() to check if there was active
>>> memory before the reset and it is no longer active after the reset. A
>>> warning is emitted in the case of active memory has been offlined.
>>>
>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>
>> This feels like we are papering over a hole in the PCI core.
>> Is there no way of detecting Secondary Bus Reset (SBR) and
>> communicate that down to the device?
>> +CC Bjorn. 
>> Most of the logic would be needed in driver anyway though as
>> we don't want to bother warning on SBR if there was no memory mapped.
>>
>> Bjorn, would you prefer this FLR vs SBR being detected by state
>> change in driver, or a modification to the PCI core so that it
>> provides this info to the drivers?  I assume this pretty unique
>> to CXL as normally there isn't a magic control to ignore triggering
>> a reset.
> 
> So there *is* a magic control to ignore triggering a reset per the CXL
> specification, see "Unmask SBR" in "Port Control Extensions".
> 
> Moreover, I do not see this as papering over a hole. The only software
> that flips that "Unmask SBR" bit from its default today is a userpace
> "setpci" script.  Unless kernel_lockdown is in force there is nothing to
> stop or warn root about the danger, in fact there is a wide swath of
> damage that root with config-cycle-write-access can wreak.
> 
> If someone goes through that trouble, and in keeping with the general
> Linux ethos of giving root access to footguns (outside of
> kernel_lockdown), there is not much justification to block it, but the
> driver can definitely clarify the damage after the fact.
> 
> I will also point out that the lack of a reset reason notification is
> not the loan concern. If there is appetite for increasing core-to-driver
> transparency, the hotplug reason is also missing. Whether ->remove() is
> logical or physical and the ability to set the magnetic-retention-latch
> from an endpoint driver could be interesting, but the staus quo is
> sufficient for now.
> 
> ...a comment for Dave below
> 
>>
>> One trivial comment inline.
>>
>> Jonathan
>>
>>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>>> index e59d9d37aa65..81d9f57d2e84 100644
>>> --- a/drivers/cxl/core/port.c
>>> +++ b/drivers/cxl/core/port.c
>>> @@ -45,6 +45,17 @@ int cxl_num_decoders_committed(struct cxl_port *port)
>>>  	return port->commit_end + 1;
>>>  }
>>>  
>>> +int cxl_num_decoders_committed_locked(struct cxl_port *port)
>>> +{
>>> +	int decoders;
>>> +
>>> +	guard(rwsem_read)(&cxl_region_rwsem);
>>> +	decoders = cxl_num_decoders_committed(port);
>>
>> return cxl_num_decoder_commited(port);
>>
>>> +
>>> +	return decoders;
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(cxl_num_decoders_committed_locked, CXL);
>>> +
>>>  static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
>>>  			    char *buf)
>>>  {
>>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>>> index b6017c0c57b4..530c7e693096 100644
>>> --- a/drivers/cxl/cxl.h
>>> +++ b/drivers/cxl/cxl.h
>>> @@ -720,6 +720,7 @@ static inline bool is_cxl_root(struct cxl_port *port)
>>>  }
>>>  
>>>  int cxl_num_decoders_committed(struct cxl_port *port);
>>> +int cxl_num_decoders_committed_locked(struct cxl_port *port);
>>>  bool is_cxl_port(const struct device *dev);
>>>  struct cxl_port *to_cxl_port(const struct device *dev);
>>>  struct pci_bus;
>>> @@ -800,6 +801,7 @@ int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm,
>>>  int devm_cxl_add_passthrough_decoder(struct cxl_port *port);
>>>  int cxl_dvsec_rr_decode(struct device *dev, int dvsec,
>>>  			struct cxl_endpoint_dvsec_info *info);
>>> +bool cxl_dvsec_rr_active(struct device *dev, int d);
>>>  
>>>  bool is_cxl_region(struct device *dev);
>>>  
>>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>>> index 5303d6942b88..9f1814005322 100644
>>> --- a/drivers/cxl/cxlmem.h
>>> +++ b/drivers/cxl/cxlmem.h
>>> @@ -440,6 +440,7 @@ struct cxl_dev_state {
>>>  	struct resource ram_res;
>>>  	u64 serial;
>>>  	enum cxl_devtype type;
>>> +	bool active_rr_prereset;
>>>  };
>>>  
>>>  /**
>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>> index 233e7c42c161..5a5fda7134f6 100644
>>> --- a/drivers/cxl/pci.c
>>> +++ b/drivers/cxl/pci.c
>>> @@ -957,11 +957,42 @@ static void cxl_error_resume(struct pci_dev *pdev)
>>>  		 dev->driver ? "successful" : "failed");
>>>  }
>>>  
>>> +static void cxl_reset_prepare(struct pci_dev *pdev)
>>> +{
>>> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>> +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
>>> +
>>> +	if (cxl_num_decoders_committed_locked(cxlmd->endpoint))
>>> +		cxlds->active_rr_prereset = true;
>>> +}
>>> +
>>> +static void cxl_reset_done(struct pci_dev *pdev)
>>> +{
>>> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>> +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
>>> +	struct device *dev = &cxlmd->dev;
>>> +
>>> +	/*
>>> +	 * FLR does not expect to touch the HDM decoders and related registers.
>>> +	 * SBR however will wipe all device configurations.
>>> +	 * Issue warning if there was active configuration before reset that no
>>> +	 * longer exists.
>>> +	 */
>>> +	if (cxlds->active_rr_prereset &&
>>> +	    !cxl_dvsec_rr_active(&pdev->dev, cxlds->cxl_dvsec)) {
>>> +		dev_warn(dev, "SBR happened without memory regions removal.\n");
>>> +		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
> 
> Dave, did you test this? I reacted to the addition of
> ->active_rr_prereset as a case of putting code logic in a data
> structure, but I doubt it is even effectice since nothing informs
> software that the register values changed. I.e. the check should be to
> walk through all the software committed decoders and see if they are
> still hardware committed. No need for ->active_rr_prereset.

I've not got hold of hw to test yet. I just figured to see if this is the direction we want to go while I work on getting hold of hw. I added ->active_rr_prereset with the thinking that if we find there's nothing setup before the reset and after the reset we can skip emitting false warnings. But it sounds like we want only ->reset_done() to walk through the decoders and emit warning if nothing is setup regardless of previous state? Although would it be sufficient to just detect the range register Memory_Active bit? SBR would reset this bit to 0 right?
Dan Williams Feb. 21, 2024, 7:45 p.m. UTC | #7
Dave Jiang wrote:
[..]
> > Dave, did you test this? I reacted to the addition of
> > ->active_rr_prereset as a case of putting code logic in a data
> > structure, but I doubt it is even effectice since nothing informs
> > software that the register values changed. I.e. the check should be to
> > walk through all the software committed decoders and see if they are
> > still hardware committed. No need for ->active_rr_prereset.
> 
> I've not got hold of hw to test yet. I just figured to see if this is
> the direction we want to go while I work on getting hold of hw. I
> added ->active_rr_prereset with the thinking that if we find there's
> nothing setup before the reset and after the reset we can skip
> emitting false warnings. But it sounds like we want only
> ->reset_done() to walk through the decoders and emit warning if
> nothing is setup regardless of previous state?

Wait, no. Previous state is already recorded in the 'struct cxl_decoder'
software state, right?  I.e. if CXL_DECODER_F_ENABLE was set before the
reset then when you go to read the hardware state you will see the
mismatch, if SBR was triggered. If FLR is triggered then the decoder's
CXL_DECODER_F_ENABLE state will still match hardware.

> Although would it be sufficient to just detect the range register
> Memory_Active bit? SBR would reset this bit to 0 right?

If the driver just revalidates all the device's decoder instances
against hardware that should be sufficient. Now there is a question
about whether it is sufficient to check one vs all, but might as well
check them all for a quick sanity check.
diff mbox series

Patch

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 6c9c8d92f8f7..73b5cd54e761 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -322,6 +322,20 @@  static int devm_cxl_enable_hdm(struct device *host, struct cxl_hdm *cxlhdm)
 	return devm_add_action_or_reset(host, disable_hdm, cxlhdm);
 }
 
+bool cxl_dvsec_rr_active(struct device *dev, int d)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	u32 temp;
+	int rc;
+
+	rc = pci_read_config_dword(pdev, d + CXL_DVSEC_RANGE_SIZE_LOW(0), &temp);
+	if (rc)
+		return false;
+
+	return FIELD_GET(CXL_DVSEC_MEM_ACTIVE, temp);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dvsec_rr_active, CXL);
+
 int cxl_dvsec_rr_decode(struct device *dev, int d,
 			struct cxl_endpoint_dvsec_info *info)
 {
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index e59d9d37aa65..81d9f57d2e84 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -45,6 +45,17 @@  int cxl_num_decoders_committed(struct cxl_port *port)
 	return port->commit_end + 1;
 }
 
+int cxl_num_decoders_committed_locked(struct cxl_port *port)
+{
+	int decoders;
+
+	guard(rwsem_read)(&cxl_region_rwsem);
+	decoders = cxl_num_decoders_committed(port);
+
+	return decoders;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_num_decoders_committed_locked, CXL);
+
 static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
 			    char *buf)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b6017c0c57b4..530c7e693096 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -720,6 +720,7 @@  static inline bool is_cxl_root(struct cxl_port *port)
 }
 
 int cxl_num_decoders_committed(struct cxl_port *port);
+int cxl_num_decoders_committed_locked(struct cxl_port *port);
 bool is_cxl_port(const struct device *dev);
 struct cxl_port *to_cxl_port(const struct device *dev);
 struct pci_bus;
@@ -800,6 +801,7 @@  int devm_cxl_enumerate_decoders(struct cxl_hdm *cxlhdm,
 int devm_cxl_add_passthrough_decoder(struct cxl_port *port);
 int cxl_dvsec_rr_decode(struct device *dev, int dvsec,
 			struct cxl_endpoint_dvsec_info *info);
+bool cxl_dvsec_rr_active(struct device *dev, int d);
 
 bool is_cxl_region(struct device *dev);
 
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 5303d6942b88..9f1814005322 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -440,6 +440,7 @@  struct cxl_dev_state {
 	struct resource ram_res;
 	u64 serial;
 	enum cxl_devtype type;
+	bool active_rr_prereset;
 };
 
 /**
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 233e7c42c161..5a5fda7134f6 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -957,11 +957,42 @@  static void cxl_error_resume(struct pci_dev *pdev)
 		 dev->driver ? "successful" : "failed");
 }
 
+static void cxl_reset_prepare(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+
+	if (cxl_num_decoders_committed_locked(cxlmd->endpoint))
+		cxlds->active_rr_prereset = true;
+}
+
+static void cxl_reset_done(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &cxlmd->dev;
+
+	/*
+	 * FLR does not expect to touch the HDM decoders and related registers.
+	 * SBR however will wipe all device configurations.
+	 * Issue warning if there was active configuration before reset that no
+	 * longer exists.
+	 */
+	if (cxlds->active_rr_prereset &&
+	    !cxl_dvsec_rr_active(&pdev->dev, cxlds->cxl_dvsec)) {
+		dev_warn(dev, "SBR happened without memory regions removal.\n");
+		dev_warn(dev, "System may be unstable if regions hosted system memory.\n");
+	}
+	cxlds->active_rr_prereset = false;
+}
+
 static const struct pci_error_handlers cxl_error_handlers = {
 	.error_detected	= cxl_error_detected,
 	.slot_reset	= cxl_slot_reset,
 	.resume		= cxl_error_resume,
 	.cor_error_detected	= cxl_cor_error_detected,
+	.reset_prepare	= cxl_reset_prepare,
+	.reset_done	= cxl_reset_done,
 };
 
 static struct pci_driver cxl_pci_driver = {