mbox series

[0/3] cxl/mbox: support background operation abort requests

Message ID 20241022031809.242591-1-dave@stgolabs.net
Headers show
Series cxl/mbox: support background operation abort requests | expand

Message

Davidlohr Bueso Oct. 22, 2024, 3:18 a.m. UTC
Hello,

This series implements the abortable capabilities for mailbox background
operations, introduced in CXL 3.1.

Details in each patch, but patch 1 moves non-sanitize background command
polling out of the mbox_mutex, such that patch 2 can add the machinery
to request aborting the on going command. Patch 3 simply renames a call.

While an overal policy could include command-priorities (which could
be trivially added); the current policy is that, if there is an on going
background operation, a new incoming bg-capable command will blindly
request it to be cancelled, if capable.

Considering the interest of this for Logs[0], perhaps Micron folks could
this? Does this work along the lines of what that discussion concluded?

Applies against Linus' latest tree.

[0]: https://lore.kernel.org/all/20240313071218.729-1-sthanneeru.opensrc@micron.com/

Testing:

lockdep enabled kernel + the qemu counterpart patches:
	https://lore.kernel.org/linux-cxl/20240813221255.179200-1-dave@stgolabs.net/

#> echo 1 > /sys/bus/cxl/devices/mem1/security/sanitize && sleep 3 && echo 1 > /sys/bus/cxl/devices/mem1/security/sanitize
[  185.928276] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0d:00.0: Sending command: 0x4400cxl/devices/mem1/security/sanitize
[  185.930024] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[  185.931608] cxl_pci:__cxl_pci_mbox_send_cmd:306: cxl_pci 0000:0d:00.0: Sanitization operation started
[  188.936583] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0d:00.0: Sending command: 0x0005
[  188.943956] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[  188.951786] cxl_pci:cxl_try_to_cancel_background:376: cxl_pci 0000:0d:00.0: Sanitization operation aborted
[  188.957762] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0d:00.0: Sending command: 0x4400
[  188.959886] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
[  188.962325] cxl_pci:__cxl_pci_mbox_send_cmd:306: cxl_pci 0000:0d:00.0: Sanitization operation started
[  197.034644] cxl_pci:cxl_mbox_sanitize_work:164: cxl_pci 0000:0d:00.0: Sanitization operation ended

#> cxl update-firmware -F img.fw mem1 && sleep 3 && echo 1 > /sys/bus/cxl/devices/mem1/security/sanitize
[   14.443484] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0200
[   14.445884] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
[   14.448744] cxl_core:cxl_query_cmd:539: cxl_mem mem1: Query IOCTL
[   14.450458] cxl_core:cxl_query_cmd:539: cxl_mem mem1: Query IOCTL
[   14.452307] cxl_core:cxl_send_cmd:644: cxl_mem mem1: Send IOCTL
[   14.453686] cxl_core:handle_mailbox_cmd_from_user:602: cxl_pci 0000:0e:00.0: Submitting Get FW Info command for user
[   14.453686] 	opcode: 200
[   14.453686] 	size: 0
[   14.460453] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0201
[   14.462313] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
[   14.464166] cxl_pci:__cxl_pci_mbox_send_cmd:310: cxl_pci 0000:0e:00.0: Mailbox background operation (0x0201) started
[   14.466536] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0200
[   14.468380] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
{
  "memdev":"mem1",
  "pmem_size":"1024.00 MiB (1073.74 MB)",
  "serial":"0",
  "host":"0000:0e:00.0",
  "firmware":{
    "num_slots":2,
    "active_slot":1,
    "online_activate_capable":true,
    "slot_1_version":"BWFW VERSION 0",
    "fw_update_in_progress":true,
    "remaining_size":"48.83 MiB (51.20 MB)"
  }
}
cxl memdev: cmd_update_fw: firmware update started on 1 mem device
[   17.484680] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0005
[   17.486993] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
[   17.489510] cxl_pci:cxl_pci_mbox_send:451: cxl_pci 0000:0e:00.0: Mailbox background operation (0x0201) aborted
[   17.492476] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x4400
[   17.494937] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
[   17.497598] cxl_pci:__cxl_pci_mbox_send_cmd:306: cxl_pci 0000:0e:00.0: Sanitization operation started
[   25.682631] cxl_pci:cxl_mbox_sanitize_work:164: cxl_pci 0000:0e:00.0: Sanitization operation ended

Thanks!

Davidlohr Bueso (3):
  cxl/pci: lockless background synchronous polling
  cxl/mbox: support aborting the current background operation
  cxl/pci: rename cxl_mbox_background_complete()

 drivers/cxl/core/mbox.c |   1 +
 drivers/cxl/cxlmem.h    |  14 +++
 drivers/cxl/pci.c       | 189 ++++++++++++++++++++++++++++------------
 include/cxl/mailbox.h   |   2 +
 4 files changed, 152 insertions(+), 54 deletions(-)

--
2.46.1

Comments

Ravis OpenSrc Oct. 23, 2024, 5:32 a.m. UTC | #1
Hi Davidlohr,

   We have recently submitted an RFC to implement Request Background Abort
in case a timeout is encountered while executing a background command.
The patch series disables those background commands which do not support abort
based on their individual CEL properties.

https://lore.kernel.org/linux-cxl/20241017163215.00000547@Huawei.com/T/#ma6710971908b85a5f8c5da2e23b8102b5a6e277c

This implementation is based on Dan's suggestion in a previous thread.
https://lore.kernel.org/linux-cxl/66035c2e8ba17_770232948b@dwillia2-xfh.jf.intel.com.notmuch/

We can discuss more on how we can potentially merge and arrive at a common ground.

--Ravi.
>On Mon, 21 Oct , Davidlohr Bueso wrote:
>Hello,
>
>This series implements the abortable capabilities for mailbox background
>operations, introduced in CXL 3.1.
>
>Details in each patch, but patch 1 moves non-sanitize background command
>polling out of the mbox_mutex, such that patch 2 can add the machinery
>to request aborting the on going command. Patch 3 simply renames a call.
>
>While an overal policy could include command-priorities (which could
>be trivially added); the current policy is that, if there is an on going
>background operation, a new incoming bg-capable command will blindly
>request it to be cancelled, if capable.
>
>Considering the interest of this for Logs[0], perhaps Micron folks could
>this? Does this work along the lines of what that discussion concluded?
>
>Applies against Linus' latest tree.
>
>[0]: https://lore.kernel.org/all/20240313071218.729-1-sthanneeru.opensrc@micron.com/
>
>Testing:
>
>lockdep enabled kernel + the qemu counterpart patches:
>    https://lore.kernel.org/linux-cxl/20240813221255.179200-1-dave@stgolabs.net/ 
>
>#> echo 1 > /sys/bus/cxl/devices/mem1/security/sanitize && sleep 3 && echo 1 > /sys/bus/cxl/devices/mem1/security/sanitize
>[  185.928276] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0d:00.0: Sending command: 0x4400cxl/devices/mem1/security/sanitize
>[  185.930024] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
>[  185.931608] cxl_pci:__cxl_pci_mbox_send_cmd:306: cxl_pci 0000:0d:00.0: Sanitization operation started
>[  188.936583] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0d:00.0: Sending command: 0x0005
>[  188.943956] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
>[  188.951786] cxl_pci:cxl_try_to_cancel_background:376: cxl_pci 0000:0d:00.0: Sanitization operation aborted
>[  188.957762] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0d:00.0: Sending command: 0x4400
>[  188.959886] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0d:00.0: Doorbell wait took 0ms
>[  188.962325] cxl_pci:__cxl_pci_mbox_send_cmd:306: cxl_pci 0000:0d:00.0: Sanitization operation started
>[  197.034644] cxl_pci:cxl_mbox_sanitize_work:164: cxl_pci 0000:0d:00.0: Sanitization operation ended
>
>#> cxl update-firmware -F img.fw mem1 && sleep 3 && echo 1 > /sys/bus/cxl/devices/mem1/security/sanitize
>[   14.443484] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0200
>[   14.445884] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
>[   14.448744] cxl_core:cxl_query_cmd:539: cxl_mem mem1: Query IOCTL
>[   14.450458] cxl_core:cxl_query_cmd:539: cxl_mem mem1: Query IOCTL
>[   14.452307] cxl_core:cxl_send_cmd:644: cxl_mem mem1: Send IOCTL
>[   14.453686] cxl_core:handle_mailbox_cmd_from_user:602: cxl_pci 0000:0e:00.0: Submitting Get FW Info command for user
>[   14.453686]  opcode: 200
>[   14.453686]  size: 0
>[   14.460453] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0201
>[   14.462313] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
>[   14.464166] cxl_pci:__cxl_pci_mbox_send_cmd:310: cxl_pci 0000:0e:00.0: Mailbox background operation (0x0201) started
>[   14.466536] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0200
>[   14.468380] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
>{
>  "memdev":"mem1",
>  "pmem_size":"1024.00 MiB (1073.74 MB)",
>  "serial":"0",
>  "host":"0000:0e:00.0",
>  "firmware":{
>    "num_slots":2,
>    "active_slot":1,
>    "online_activate_capable":true,
>    "slot_1_version":"BWFW VERSION 0",
>    "fw_update_in_progress":true,
>    "remaining_size":"48.83 MiB (51.20 MB)"
>  }
>}
>cxl memdev: cmd_update_fw: firmware update started on 1 mem device
>[   17.484680] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x0005
>[   17.486993] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
>[   17.489510] cxl_pci:cxl_pci_mbox_send:451: cxl_pci 0000:0e:00.0: Mailbox background operation (0x0201) aborted
>[   17.492476] cxl_pci:__cxl_pci_mbox_send_cmd:262: cxl_pci 0000:0e:00.0: Sending command: 0x4400
>[   17.494937] cxl_pci:cxl_pci_mbox_wait_for_doorbell:74: cxl_pci 0000:0e:00.0: Doorbell wait took 0ms
>[   17.497598] cxl_pci:__cxl_pci_mbox_send_cmd:306: cxl_pci 0000:0e:00.0: Sanitization operation started
>[   25.682631] cxl_pci:cxl_mbox_sanitize_work:164: cxl_pci 0000:0e:00.0: Sanitization operation ended
>
>Thanks!
>
>Davidlohr Bueso (3):
>  cxl/pci: lockless background synchronous polling
>  cxl/mbox: support aborting the current background operation
>  cxl/pci: rename cxl_mbox_background_complete()
>
> drivers/cxl/core/mbox.c |   1 +
> drivers/cxl/cxlmem.h    |  14 +++
> drivers/cxl/pci.c       | 189 ++++++++++++++++++++++++++++------------
> include/cxl/mailbox.h   |   2 +
> 4 files changed, 152 insertions(+), 54 deletions(-)
Davidlohr Bueso Oct. 23, 2024, 2:29 p.m. UTC | #2
On Wed, 23 Oct 2024, Ravis OpenSrc wrote:\n
>Hi Davidlohr,
>
>   We have recently submitted an RFC to implement Request Background Abort
>in case a timeout is encountered while executing a background command.
>The patch series disables those background commands which do not support abort
>based on their individual CEL properties.
>
>https://lore.kernel.org/linux-cxl/20241017163215.00000547@Huawei.com/T/#ma6710971908b85a5f8c5da2e23b8102b5a6e277c

*sigh* it would have been nice to be Cc'ed.

>
>This implementation is based on Dan's suggestion in a previous thread.
>https://lore.kernel.org/linux-cxl/66035c2e8ba17_770232948b@dwillia2-xfh.jf.intel.com.notmuch/
>
>We can discuss more on how we can potentially merge and arrive at a common ground.

I think that my approach is way more robust then defining a timeout as in that series.
It also aligns more with Jonathan's comments about the abort being on demand.
Ravis OpenSrc Oct. 23, 2024, 9:17 p.m. UTC | #3
>>On Wed, 23 Oct 2024, Ravis OpenSrc wrote:\n
>>Hi Davidlohr,
>>
>>   We have recently submitted an RFC to implement Request Background Abort
>>in case a timeout is encountered while executing a background command.
>>The patch series disables those background commands which do not support abort
>>based on their individual CEL properties.
>>
>>https://lore.kernel.org/linux-cxl/20241017163215.00000547@Huawei.com/T/#ma6710971908b85a5f8c5da2e23b8102b5a6e277c
>
>*sigh* it would have been nice to be Cc'ed.

Noted. Will take care of this in future.

>
>>
>>This implementation is based on Dan's suggestion in a previous thread.
>>https://lore.kernel.org/linux-cxl/66035c2e8ba17_770232948b@dwillia2-xfh.jf.intel.com.notmuch/
>>
>>We can discuss more on how we can potentially merge and arrive at a common ground.
>
>I think that my approach is way more robust then defining a timeout as in that series.
>It also aligns more with Jonathan's comments about the abort being on demand.

The one major functionality in our original proposal apart from implementing abort is

Allowing background commands to be invoked from user space through the primary mailbox
by ensuring only those background commands are enabled which also support request abort operation
by checking their individual CEL details.

Few questions on abort implementation are:
1. Would abort be required only when a new bg command is issued?
2. What is the path for supporting background commands from primary mailbox if not done through
CEL check?
3. Should all future background commands follow sysfs path?

--Ravi.
Davidlohr Bueso Oct. 23, 2024, 11:35 p.m. UTC | #4
On Wed, 23 Oct 2024, Ravis OpenSrc wrote:
>The one major functionality in our original proposal apart from implementing abort is
>
>Allowing background commands to be invoked from user space through the primary mailbox
>by ensuring only those background commands are enabled which also support request abort operation
>by checking their individual CEL details.

Is vendor-specific logs not something that can be done OoB?

If we are strictly talking about CEL details, yes this series could use that, and was
planning on adding it for an eventual v2 -- ie: why send the abort cmd at all if we
know the current one doesn't support it. But that is minutiae, for kernel bg commands.

But yeah, for your needs, the enabled cmds would need that filter.

Then with that, would adding something like the below address your needs and below
questions? Basically, if userspace is running a command, then the kernel comes in
and wants to run its own bg command, it will simply abort *anything* ongoing as a
last resort. And since we have no kernel-context about whatever is running at that
point, is *think* it is safe to assume it was user-initiated.

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 7b0fad7f6c4d..bf0742546ea8 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -374,7 +374,7 @@ static bool cxl_try_to_cancel_background(struct cxl_mailbox *cxl_mbox)
		mds->security.sanitize_active = false;

		dev_dbg(cxlds->dev, "Sanitization operation aborted\n");
-	} else {
+	} else if (atomic_read_acquire(&cxl_mbox->poll_bgop)){
		/*
		 * Kick the poller and wait for it to be done - no one else
		 * can touch mbox regs. rcuwait_wake_up() provides full
@@ -403,7 +403,9 @@ static int cxl_pci_mbox_send(struct cxl_mailbox *cxl_mbox,
	 */
	if (cxl_is_background_cmd(cmd->opcode)) {
		if (mds->security.sanitize_active ||
-		    atomic_read_acquire(&cxl_mbox->poll_bgop)) {
+		    atomic_read_acquire(&cxl_mbox->poll_bgop) ||
+		    /* userspace-initiated ? */
+		    !cxl_mbox_background_done(cxlds)) {
			if (!cxl_try_to_cancel_background(cxl_mbox)) {
				mutex_unlock(&cxl_mbox->mbox_mutex);
				return -EBUSY;
Ravis OpenSrc Oct. 24, 2024, 6:30 a.m. UTC | #5
>On Wed, 23 Oct 2024, Davidlohr Bueso wrote:
>>The one major functionality in our original proposal apart from implementing abort is
>>
>>Allowing background commands to be invoked from user space through the primary mailbox
>>by ensuring only those background commands are enabled which also support request abort operation
>>by checking their individual CEL details.
>
>Is vendor-specific logs not something that can be done OoB?
>
>If we are strictly talking about CEL details, yes this series could use that, and was
>planning on adding it for an eventual v2 -- ie: why send the abort cmd at all if we
>know the current one doesn't support it. But that is minutiae, for kernel bg commands.
>
>But yeah, for your needs, the enabled cmds would need that filter.

Ok. we can merge changes related to CEL checking and filtering of enabled commands.

>
>Then with that, would adding something like the below address your needs and below
>questions? Basically, if userspace is running a command, then the kernel comes in
>and wants to run its own bg command, it will simply abort *anything* ongoing as a
>last resort. And since we have no kernel-context about whatever is running at that
>point, is *think* it is safe to assume it was user-initiated.

Yes, this seems a reasonable assumption.

Currently user space doesn't have a way to initiate background commands
through primary mailbox as there is no provision to provide timeout value.

By filtering out user-initiated background commands which do not support abort,
we can potentially have a default timeout or allow user space to provide a custom timeout value. 
 
Overall, the approach to cancel current background operation when new bg operation
is initiated seems elegant. 
>
>diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>index 7b0fad7f6c4d..bf0742546ea8 100644
>--- a/drivers/cxl/pci.c
>+++ b/drivers/cxl/pci.c
>@@ -374,7 +374,7 @@ static bool cxl_try_to_cancel_background(struct cxl_mailbox *cxl_mbox)
>		mds->security.sanitize_active = false;
>
>		dev_dbg(cxlds->dev, "Sanitization operation aborted\n");
>-	} else {
>+	} else if (atomic_read_acquire(&cxl_mbox->poll_bgop)){
>		/*
>		 * Kick the poller and wait for it to be done - no one else
>		 * can touch mbox regs. rcuwait_wake_up() provides full
>@@ -403,7 +403,9 @@ static int cxl_pci_mbox_send(struct cxl_mailbox *cxl_mbox,
>	 */
>	if (cxl_is_background_cmd(cmd->opcode)) {
>		if (mds->security.sanitize_active ||
>-		    atomic_read_acquire(&cxl_mbox->poll_bgop)) {
>+		    atomic_read_acquire(&cxl_mbox->poll_bgop) ||
>+		    /* userspace-initiated ? */
>+		    !cxl_mbox_background_done(cxlds)) {
>			if (!cxl_try_to_cancel_background(cxl_mbox)) {
>				mutex_unlock(&cxl_mbox->mbox_mutex);
>
--Ravi.