Message ID | 20221206011501.464916-2-dave@stgolabs.net |
---|---|
State | New, archived |
Headers | show |
Series | cxl: BG operations and device sanitation | expand |
On 12/5/2022 6:14 PM, Davidlohr Bueso wrote: > This adds support for handling background operations, as defined in > the CXL 3.0 spec. Commands that can take too long (over ~2 seconds) > can run asynchronously in the background, these are limited to: > transfer/activate firmware, scan media, sanitize (aka overwrite) > and VPPB bind/unbind. > > Completion of background operations, successful or not, can be handled > via irq or poll based. This patch deals only with polling, which > introduces some complexities as we need to handle the window in which > the original background command completed, then a new one is > successfully started before the poller wakes up and checks. This, > in theory, can occur any amount of times. One of the problems is > that previous states cannot be tracked as hw background status > registers always deal with the last command. > > So in order to keep hardware and the driver in sync, there can be > windows where the hardware is ready but the driver won't be, and > thus must serialize/handle it accordingly. While this polling cost > may not be ideal: 1) background commands are rare/limited and > 2) the extra busy time should be small compared to the overall > command duration and 3) devices that support mailbox interrupts > do not use this. > > The implementation extends struct cxl_mem_command to become aware > of background-capable commands in a generic fashion and presents > some interfaces: > > - Calls for bg operations, where each bg command can choose to implement > whatever needed based on the nature of the operation. For example, > while running, overwrite may only allow certain related commands to > occur, while scan media does not have any such limitations. > Delay times can also be different, for which ad-hoc hinting can > be defined - for example, scan media could use some value based on > GET_SCAN_MEDIA_CAPS and overwrite has predefined values based on > pmem DSM specs[0]. Similarly some commands need to execute tasks > once the command finishes, such as overwriting requires CPU flushing > when successfully done. These are: > > cxl_mbox_bgcmd_conflicts() > cxl_mbox_bgcmd_delay() > cxl_mbox_bgcmd_post() > > - cxl_mbox_send_cmd() is extended such that callers can distinguish, > upon rc == 0, between completed and successfully started in the > background. > > - cxl_mbox_bgcmd_running() will atomically tell which command is > running in the background, if any. This allows easy querying > functionality. Similarly, there are cxl_mbox_bgcmd_start() and > cxl_mbox_bgcmd_done() to safely mark the in-flight operation. > While x86 serializes atomics, care must be taken with arm64, for > example, ensuring, minimally, release/acquire semantics. > > There are currently no supported commands. > > [0] https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf > > Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> > --- > drivers/cxl/core/core.h | 3 +- > drivers/cxl/core/mbox.c | 213 ++++++++++++++++++++++++++++++++++++++-- > drivers/cxl/core/port.c | 1 + > drivers/cxl/cxl.h | 8 ++ > drivers/cxl/cxlmem.h | 55 ++++++++++- > drivers/cxl/pci.c | 4 + > drivers/cxl/pmem.c | 5 +- > drivers/cxl/security.c | 13 +-- > 8 files changed, 282 insertions(+), 20 deletions(-) > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > index 1d8f87be283f..2a71664a0668 100644 > --- a/drivers/cxl/core/core.h > +++ b/drivers/cxl/core/core.h > @@ -69,6 +69,7 @@ static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port, > > int cxl_memdev_init(void); > void cxl_memdev_exit(void); > -void cxl_mbox_init(void); > +int cxl_mbox_init(void); > +void cxl_mbox_exit(void); > > #endif /* __CXL_CORE_H__ */ > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 35dd889f1d3a..bfee9251c81c 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -10,6 +10,7 @@ > #include "core.h" > > static bool cxl_raw_allow_all; > +static struct workqueue_struct *cxl_mbox_bgpoll_wq; Do we want per device wq rather than a global one? Just thinking if you have a device go away and you need to flush outstanding commands, you might impact the other devices. > > /** > * DOC: cxl mbox > @@ -24,7 +25,7 @@ static bool cxl_raw_allow_all; > for ((cmd) = &cxl_mem_commands[0]; \ > ((cmd) - cxl_mem_commands) < ARRAY_SIZE(cxl_mem_commands); (cmd)++) > > -#define CXL_CMD(_id, sin, sout, _flags) \ > +#define __CXL_CMD(_id, sin, sout, _flags, _bgops) \ > [CXL_MEM_COMMAND_ID_##_id] = { \ > .info = { \ > .id = CXL_MEM_COMMAND_ID_##_id, \ > @@ -33,8 +34,13 @@ static bool cxl_raw_allow_all; > }, \ > .opcode = CXL_MBOX_OP_##_id, \ > .flags = _flags, \ > + .bgops = _bgops, \ > } > > +#define CXL_CMD(_id, sin, sout, _flags) __CXL_CMD(_id, sin, sout, _flags, NULL) > +#define CXL_BGCMD(_id, sin, sout, _flags, _bgops) \ > + __CXL_CMD(_id, sin, sout, _flags | CXL_CMD_FLAG_BACKGROUND, _bgops) > + > #define CXL_VARIABLE_PAYLOAD ~0U > /* > * This table defines the supported mailbox commands for the driver. This table > @@ -63,7 +69,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { > CXL_CMD(INJECT_POISON, 0x8, 0, 0), > CXL_CMD(CLEAR_POISON, 0x48, 0, 0), > CXL_CMD(GET_SCAN_MEDIA_CAPS, 0x10, 0x4, 0), > - CXL_CMD(SCAN_MEDIA, 0x11, 0, 0), > + CXL_BGCMD(SCAN_MEDIA, 0x11, 0, 0, NULL), > CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), > CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), > CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), > @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > return cxl_command_names[c->info.id].name; > } > > +static unsigned long __maybe_unused > +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) > +{ > + struct cxl_mem_command *c; > + unsigned long ret = 0; > + > + c = cxl_mem_find_command(opcode); > + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > + dev_WARN_ONCE(cxlds->dev, true, > + "Not a background command\n"); > + return 0; > + } > + > + if (c->bgops && c->bgops->delay) > + ret = c->bgops->delay(cxlds); > + > + return ret * HZ; > +} > + > +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, > + u16 opcode, bool success) > +{ > + struct cxl_mem_command *c; > + > + c = cxl_mem_find_command(opcode); > + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > + dev_WARN_ONCE(cxlds->dev, true, > + "Not a background command\n"); > + return; > + } > + > + if (c->bgops && c->bgops->post) > + c->bgops->post(cxlds, success); > +} > + > +/* > + * Ensure that ->mbox_send(@new) can run safely when a background > + * command is running. If so, returns zero, otherwise error. > + * > + * check 1. bg cmd running. If not running it is fine to > + * send any command. If one is running then there > + * may be restrictions. > + * check 2. @new incoming command is capable of running > + * in the background. If there is an in-flight bg > + * operation, all these are forbidden as we need > + * to serialize when polling. > + * check 3. @new incoming command is not blacklisted by the > + * current running command. > + */ > +static int __maybe_unused > +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) > +{ > + struct cxl_mem_command *c; > + > + /* 1 */ > + if (likely(!cxl_mbox_bgcmd_running(cxlds))) > + return 0; > + > + c = cxl_mem_find_command(new); > + if (!c) > + return -EINVAL; > + > + /* 2 */ > + if (c->flags & CXL_CMD_FLAG_BACKGROUND) > + return -EBUSY; > + > + /* 3 */ > + if (c->bgops && c->bgops->conflicts) > + return c->bgops->conflicts(new); > + > + return 0; > +} > + > +/* > + * Background operation polling mode. > + */ > +void cxl_mbox_bgcmd_work(struct work_struct *work) > +{ > + struct cxl_dev_state *cxlds; > + u64 bgcmd_status_reg; > + u16 opcode; > + u32 pct; > + > + cxlds = container_of(work, struct cxl_dev_state, mbox_bgpoll.work); > + > + dev_WARN_ONCE(cxlds->dev, !!cxlds->mbox_irq, > + "Polling mailbox when IRQs are supported\n"); > + > + bgcmd_status_reg = readq(cxlds->regs.mbox + > + CXLDEV_MBOX_BG_CMD_STATUS_OFFSET); > + opcode = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_OPCODE_MASK, > + bgcmd_status_reg); > + > + pct = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_PCT_MASK, bgcmd_status_reg); > + if (pct != 100) { > + unsigned long hint; > + u64 status_reg; > + > + status_reg = readq(cxlds->regs.mbox + CXLDEV_MBOX_STATUS_OFFSET); > + if (!FIELD_GET(CXLDEV_MBOX_STATUS_BACKGROUND_OPERATION, > + status_reg)) > + dev_WARN(cxlds->dev, > + "No background operation running (%u%%).\n", > + pct); > + > + hint = cxl_mbox_bgcmd_delay(cxlds, opcode); > + if (hint == 0) { > + /* > + * TODO: try to do better(?). pct is updated every > + * ~1 sec, maybe use a heurstic based on that. > + */ > + hint = 5 * HZ; > + } > + > + queue_delayed_work(cxl_mbox_bgpoll_wq, > + &cxlds->mbox_bgpoll, hint); > + } else { /* bg operation completed */ > + u16 return_code; > + > + return_code = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_RC_MASK, > + bgcmd_status_reg); > + cxl_mbox_bgcmd_post(cxlds, opcode, > + return_code == CXL_MBOX_CMD_RC_SUCCESS); > + > + put_device(cxlds->dev); > + cxl_mbox_bgcmd_done(cxlds); > + } > +} > +EXPORT_SYMBOL_NS_GPL(cxl_mbox_bgcmd_work, CXL); > + > /** > * cxl_mbox_send_cmd() - Send a mailbox command to a device. > * @cxlds: The device data for the operation > @@ -153,6 +289,7 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > * @in_size: The length of the input payload > * @out: Caller allocated buffer for the output. > * @out_size: Expected size of output. > + * @return_code: (optional output) HW returned code. > * > * Context: Any context. > * Return: > @@ -167,8 +304,9 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > * error. While this distinction can be useful for commands from userspace, the > * kernel will only be able to use results when both are successful. > */ > -int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, > - size_t in_size, void *out, size_t out_size) > +int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, > + void *in, size_t in_size, > + void *out, size_t out_size, u16 *return_code) > { > const struct cxl_mem_command *cmd = cxl_mem_find_command(opcode); > struct cxl_mbox_cmd mbox_cmd = { > @@ -183,12 +321,53 @@ int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, > if (in_size > cxlds->payload_size || out_size > cxlds->payload_size) > return -E2BIG; > > + /* > + * With bg polling this can overlap a scenario where the > + * hardware can receive new requests but the driver is > + * not ready. Handle accordingly. > + */ > + if (!cxlds->mbox_irq) { > + rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); > + if (rc) > + return rc; > + } > + > rc = cxlds->mbox_send(cxlds, &mbox_cmd); > + if (return_code) > + *return_code = mbox_cmd.return_code; > if (rc) > return rc; > > - if (mbox_cmd.return_code != CXL_MBOX_CMD_RC_SUCCESS) > + switch (mbox_cmd.return_code) { > + case CXL_MBOX_CMD_RC_BACKGROUND: > + { > + int err; > + > + dev_dbg(cxlds->dev, "Opcode 0x%04x: %s\n", opcode, > + cxl_mbox_cmd_rc2str(&mbox_cmd)); > + > + err = cxl_mbox_bgcmd_begin(cxlds, opcode); > + if (err) { > + dev_WARN(cxlds->dev, > + "Corrupted background cmd (opcode 0x%04x)\n", > + err); > + return -ENXIO; > + } > + > + get_device(cxlds->dev); > + > + if (!cxlds->mbox_irq) { > + /* do first poll check asap before using any hinting */ > + queue_delayed_work(cxl_mbox_bgpoll_wq, > + &cxlds->mbox_bgpoll, 0); > + } > + break; > + } > + case CXL_MBOX_CMD_RC_SUCCESS: > + break; > + default: > return cxl_mbox_cmd_rc2errno(&mbox_cmd); > + } > > /* > * Variable sized commands can't be validated and so it's up to the > @@ -575,7 +754,7 @@ static int cxl_xfer_log(struct cxl_dev_state *cxlds, uuid_t *uuid, u32 size, u8 > int rc; > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_LOG, &log, sizeof(log), > - out, xfer_size); > + out, xfer_size, NULL); > if (rc < 0) > return rc; > > @@ -628,7 +807,7 @@ static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_dev_state *cxl > return ERR_PTR(-ENOMEM); > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_SUPPORTED_LOGS, NULL, 0, ret, > - cxlds->payload_size); > + cxlds->payload_size, NULL); > if (rc < 0) { > kvfree(ret); > return ERR_PTR(rc); > @@ -738,7 +917,7 @@ static int cxl_mem_get_partition_info(struct cxl_dev_state *cxlds) > int rc; > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_PARTITION_INFO, NULL, 0, > - &pi, sizeof(pi)); > + &pi, sizeof(pi), NULL); > > if (rc) > return rc; > @@ -771,7 +950,7 @@ int cxl_dev_state_identify(struct cxl_dev_state *cxlds) > int rc; > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_IDENTIFY, NULL, 0, &id, > - sizeof(id)); > + sizeof(id), NULL); > if (rc < 0) > return rc; > > @@ -868,11 +1047,25 @@ struct cxl_dev_state *cxl_dev_state_create(struct device *dev) > } > EXPORT_SYMBOL_NS_GPL(cxl_dev_state_create, CXL); > > -void __init cxl_mbox_init(void) > +int __init cxl_mbox_init(void) > { > struct dentry *mbox_debugfs; > > mbox_debugfs = cxl_debugfs_create_dir("mbox"); > debugfs_create_bool("raw_allow_all", 0600, mbox_debugfs, > &cxl_raw_allow_all); > + > + /* background commands can run concurrently on different devices */ > + cxl_mbox_bgpoll_wq = alloc_workqueue("cxl_mbox_bgcmd", WQ_UNBOUND, 0); > + if (!cxl_mbox_bgpoll_wq) { > + debugfs_remove_recursive(mbox_debugfs); > + return -ENOMEM; > + } > + > + return 0; > +} > + > +void cxl_mbox_exit(void) > +{ > + destroy_workqueue(cxl_mbox_bgpoll_wq); > } > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > index 0d2f5eaaca7d..449767bfb5aa 100644 > --- a/drivers/cxl/core/port.c > +++ b/drivers/cxl/core/port.c > @@ -1945,6 +1945,7 @@ static void cxl_core_exit(void) > destroy_workqueue(cxl_bus_wq); > cxl_memdev_exit(); > debugfs_remove_recursive(cxl_debugfs); > + cxl_mbox_exit(); > } > > module_init(cxl_core_init); > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index e5e1abceeca7..a10e979ef3a4 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -139,14 +139,22 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw) > /* CXL 2.0 8.2.8.4 Mailbox Registers */ > #define CXLDEV_MBOX_CAPS_OFFSET 0x00 > #define CXLDEV_MBOX_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0) > +#define CXLDEV_MBOX_CAP_BG_CMD_IRQNUM_MASK GENMASK(10, 7) > +#define CXLDEV_MBOX_CAP_BG_CMD_IRQ BIT(6) > #define CXLDEV_MBOX_CTRL_OFFSET 0x04 > #define CXLDEV_MBOX_CTRL_DOORBELL BIT(0) > +#define CXLDEV_MBOX_CTRL_BGCMD_IRQ BIT(2) > #define CXLDEV_MBOX_CMD_OFFSET 0x08 > #define CXLDEV_MBOX_CMD_COMMAND_OPCODE_MASK GENMASK_ULL(15, 0) > #define CXLDEV_MBOX_CMD_PAYLOAD_LENGTH_MASK GENMASK_ULL(36, 16) > #define CXLDEV_MBOX_STATUS_OFFSET 0x10 > +#define CXLDEV_MBOX_STATUS_BACKGROUND_OPERATION BIT(0) > #define CXLDEV_MBOX_STATUS_RET_CODE_MASK GENMASK_ULL(47, 32) > +/* CXL 3.0 8.2.8.4.7 Background Command Status Register */ > #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18 > +#define CXLDEV_MBOX_BG_CMD_COMMAND_OPCODE_MASK GENMASK_ULL(15, 0) > +#define CXLDEV_MBOX_BG_CMD_COMMAND_PCT_MASK GENMASK_ULL(22, 16) > +#define CXLDEV_MBOX_BG_CMD_COMMAND_RC_MASK GENMASK_ULL(47, 32) > #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20 > > /* > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index 75baeb0bbe57..e7cb2f2fadc4 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -212,6 +212,9 @@ struct cxl_endpoint_dvsec_info { > * @serial: PCIe Device Serial Number > * @doe_mbs: PCI DOE mailbox array > * @mbox_send: @dev specific transport for transmitting mailbox commands > + * @mbox_irq: @dev supports mailbox interrupts > + * @mbox_bg: opcode for the in-flight background operation on @dev > + * @mbox_bgpoll: self-polling delayed work item > * > * See section 8.2.9.5.2 Capacity Configuration and Label Storage for > * details on capacity parameters. > @@ -248,6 +251,9 @@ struct cxl_dev_state { > struct xarray doe_mbs; > > int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd); > + bool mbox_irq; > + atomic_t mbox_bg; > + struct delayed_work __maybe_unused mbox_bgpoll; > }; > > enum cxl_opcode { > @@ -290,6 +296,26 @@ enum cxl_opcode { > UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \ > 0x40, 0x3d, 0x86) > > +/* > + * Supported CXL mailbox commands that can run in the background. > + * > + * Barriers pair with release/acquire semantics. When done, > + * clearing ->mbox_bg must be the very last operation before > + * finishing the command. > + */ > +static inline int cxl_mbox_bgcmd_begin(struct cxl_dev_state *cxlds, u16 opcode) > +{ > + return atomic_xchg(&cxlds->mbox_bg, opcode); > +} > +static inline int cxl_mbox_bgcmd_running(struct cxl_dev_state *cxlds) > +{ > + return atomic_read_acquire(&cxlds->mbox_bg); > +} > +static inline void cxl_mbox_bgcmd_done(struct cxl_dev_state *cxlds) > +{ > + atomic_set_release(&cxlds->mbox_bg, 0); > +} > + > struct cxl_mbox_get_supported_logs { > __le16 entries; > u8 rsvd[6]; > @@ -353,16 +379,38 @@ struct cxl_mbox_set_partition_info { > > #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0) > > +/** > + * struct cxl_mbox_bgcmd_ops - Optional ad-hoc handling of background cmds > + * @post: Execute after the command completes > + * @conflicts: How to handle a @new command when one is currently executing > + * (only for polling mode) > + * @delay: Delay hint for polling on command completion > + */ > +struct cxl_mem_bgcommand_ops { > + void (*post)(struct cxl_dev_state *cxlds, bool success); "post" here I think conveys different meaning than "after". Typically it is used in the context of issuing the command. To remove confusion, maybe call it "end_cmd" or "finish"? > + int (*conflicts)(u16 new); has_conflicts() may be better. Also, it can return a bool to indicate whether there are conflicts. > + unsigned long (*delay)(struct cxl_dev_state *cxlds); "poll_delay"? > +}; > + > /** > * struct cxl_mem_command - Driver representation of a memory device command > * @info: Command information as it exists for the UAPI > * @opcode: The actual bits used for the mailbox protocol > * @flags: Set of flags effecting driver behavior. > + * @bg_ops: Set of optional callbacks to handle commands that can run in > + * the background. > * > * * %CXL_CMD_FLAG_FORCE_ENABLE: In cases of error, commands with this flag > * will be enabled by the driver regardless of what hardware may have > * advertised. > * > + * * %CXL_CMD_FLAG_BACKGROUND: The command may execute asynchronously in the > + * background if the operation takes longer than ~2 seconds. The semantics > + * are: > + * - Only a single of these commands can run at a time. > + * - Depending on the nature of the operation, devices may continue to > + * accept new non-background commands. > + * > * The cxl_mem_command is the driver's internal representation of commands that > * are supported by the driver. Some of these commands may not be supported by > * the hardware. The driver will use @info to validate the fields passed in by > @@ -374,8 +422,10 @@ struct cxl_mem_command { > struct cxl_command_info info; > enum cxl_opcode opcode; > u32 flags; > + struct cxl_mem_bgcommand_ops __maybe_unused *bgops; > #define CXL_CMD_FLAG_NONE 0 > #define CXL_CMD_FLAG_FORCE_ENABLE BIT(0) > +#define CXL_CMD_FLAG_BACKGROUND BIT(1) > }; > > #define CXL_PMEM_SEC_STATE_USER_PASS_SET 0x01 > @@ -414,7 +464,8 @@ enum { > }; > > int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, > - size_t in_size, void *out, size_t out_size); > + size_t in_size, void *out, size_t out_size, > + u16 *return_code); Given the existing commands don't care about return_code and only the new ones do, and to minimize code changes and having to chase down every call of the existing command, you can maybe do something like: int cxl_mbox_send_cmd_with_return(struct cxl_dev_state *cxlds, u16 opcode, void *in, size_t in_size, void *out, size_t out_size, u16 *return_code) { ..... } int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, size_t in_size, void *out, size_t out_size) { return cxl_mbox_send_cmd_with_return(cxlds, opcode, in, in_size, out, out_size, NULL); } > int cxl_dev_state_identify(struct cxl_dev_state *cxlds); > int cxl_await_media_ready(struct cxl_dev_state *cxlds); > int cxl_enumerate_cmds(struct cxl_dev_state *cxlds); > @@ -434,6 +485,8 @@ static inline void cxl_mem_active_dec(void) > } > #endif > > +void cxl_mbox_bgcmd_work(struct work_struct *work); > + > struct cxl_hdm { > struct cxl_component_regs regs; > unsigned int decoder_count; > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index 621a0522b554..b21d8681beac 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -268,6 +268,10 @@ static int cxl_pci_setup_mailbox(struct cxl_dev_state *cxlds) > return -ENXIO; > } > > + atomic_set(&cxlds->mbox_bg, 0); > + if (!cxlds->mbox_irq) > + INIT_DELAYED_WORK(&cxlds->mbox_bgpoll, cxl_mbox_bgcmd_work); > + > dev_dbg(cxlds->dev, "Mailbox payload sized %zu", > cxlds->payload_size); > > diff --git a/drivers/cxl/pmem.c b/drivers/cxl/pmem.c > index ab40c93c44e5..c6a6d3fd6883 100644 > --- a/drivers/cxl/pmem.c > +++ b/drivers/cxl/pmem.c > @@ -173,7 +173,8 @@ static int cxl_pmem_get_config_data(struct cxl_dev_state *cxlds, > }; > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_LSA, &get_lsa, > - sizeof(get_lsa), cmd->out_buf, cmd->in_length); > + sizeof(get_lsa), cmd->out_buf, > + cmd->in_length, NULL); > cmd->status = 0; > > return rc; > @@ -205,7 +206,7 @@ static int cxl_pmem_set_config_data(struct cxl_dev_state *cxlds, > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_SET_LSA, set_lsa, > struct_size(set_lsa, data, cmd->in_length), > - NULL, 0); > + NULL, 0, NULL); > > /* > * Set "firmware" status (4-packed bytes at the end of the input > diff --git a/drivers/cxl/security.c b/drivers/cxl/security.c > index 5484d4eecfd1..825d1bd1dd16 100644 > --- a/drivers/cxl/security.c > +++ b/drivers/cxl/security.c > @@ -20,7 +20,7 @@ static unsigned long cxl_pmem_get_security_flags(struct nvdimm *nvdimm, > int rc; > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_SECURITY_STATE, NULL, 0, > - &sec_out, sizeof(sec_out)); > + &sec_out, sizeof(sec_out), NULL); > if (rc < 0) > return 0; > > @@ -67,7 +67,7 @@ static int cxl_pmem_security_change_key(struct nvdimm *nvdimm, > memcpy(set_pass.new_pass, new_data->data, NVDIMM_PASSPHRASE_LEN); > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_SET_PASSPHRASE, > - &set_pass, sizeof(set_pass), NULL, 0); > + &set_pass, sizeof(set_pass), NULL, 0, NULL); > return rc; > } > > @@ -86,7 +86,7 @@ static int __cxl_pmem_security_disable(struct nvdimm *nvdimm, > memcpy(dis_pass.pass, key_data->data, NVDIMM_PASSPHRASE_LEN); > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_DISABLE_PASSPHRASE, > - &dis_pass, sizeof(dis_pass), NULL, 0); > + &dis_pass, sizeof(dis_pass), NULL, 0, NULL); > return rc; > } > > @@ -108,7 +108,8 @@ static int cxl_pmem_security_freeze(struct nvdimm *nvdimm) > struct cxl_memdev *cxlmd = cxl_nvd->cxlmd; > struct cxl_dev_state *cxlds = cxlmd->cxlds; > > - return cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_FREEZE_SECURITY, NULL, 0, NULL, 0); > + return cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_FREEZE_SECURITY, > + NULL, 0, NULL, 0, NULL); > } > > static int cxl_pmem_security_unlock(struct nvdimm *nvdimm, > @@ -122,7 +123,7 @@ static int cxl_pmem_security_unlock(struct nvdimm *nvdimm, > > memcpy(pass, key_data->data, NVDIMM_PASSPHRASE_LEN); > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_UNLOCK, > - pass, NVDIMM_PASSPHRASE_LEN, NULL, 0); > + pass, NVDIMM_PASSPHRASE_LEN, NULL, 0, NULL); > if (rc < 0) > return rc; > > @@ -143,7 +144,7 @@ static int cxl_pmem_security_passphrase_erase(struct nvdimm *nvdimm, > CXL_PMEM_SEC_PASS_MASTER : CXL_PMEM_SEC_PASS_USER; > memcpy(erase.pass, key->data, NVDIMM_PASSPHRASE_LEN); > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE, > - &erase, sizeof(erase), NULL, 0); > + &erase, sizeof(erase), NULL, 0, NULL); > if (rc < 0) > return rc; >
Davidlohr Bueso wrote: > This adds support for handling background operations, as defined in > the CXL 3.0 spec. Commands that can take too long (over ~2 seconds) > can run asynchronously in the background, these are limited to: > transfer/activate firmware, scan media, sanitize (aka overwrite) > and VPPB bind/unbind. > > Completion of background operations, successful or not, can be handled > via irq or poll based. This patch deals only with polling, which > introduces some complexities as we need to handle the window in which > the original background command completed, then a new one is > successfully started before the poller wakes up and checks. This, > in theory, can occur any amount of times. One of the problems is > that previous states cannot be tracked as hw background status > registers always deal with the last command. > > So in order to keep hardware and the driver in sync, there can be > windows where the hardware is ready but the driver won't be, and > thus must serialize/handle it accordingly. While this polling cost > may not be ideal: 1) background commands are rare/limited and > 2) the extra busy time should be small compared to the overall > command duration and 3) devices that support mailbox interrupts > do not use this. > > The implementation extends struct cxl_mem_command to become aware > of background-capable commands in a generic fashion and presents > some interfaces: > > - Calls for bg operations, where each bg command can choose to implement > whatever needed based on the nature of the operation. For example, > while running, overwrite may only allow certain related commands to > occur, while scan media does not have any such limitations. > Delay times can also be different, for which ad-hoc hinting can > be defined - for example, scan media could use some value based on > GET_SCAN_MEDIA_CAPS and overwrite has predefined values based on > pmem DSM specs[0]. Similarly some commands need to execute tasks > once the command finishes, such as overwriting requires CPU flushing > when successfully done. These are: > > cxl_mbox_bgcmd_conflicts() > cxl_mbox_bgcmd_delay() > cxl_mbox_bgcmd_post() > > - cxl_mbox_send_cmd() is extended such that callers can distinguish, > upon rc == 0, between completed and successfully started in the > background. > > - cxl_mbox_bgcmd_running() will atomically tell which command is > running in the background, if any. This allows easy querying > functionality. Similarly, there are cxl_mbox_bgcmd_start() and > cxl_mbox_bgcmd_done() to safely mark the in-flight operation. > While x86 serializes atomics, care must be taken with arm64, for > example, ensuring, minimally, release/acquire semantics. > > There are currently no supported commands. > > [0] https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf > > Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> > --- > drivers/cxl/core/core.h | 3 +- > drivers/cxl/core/mbox.c | 213 ++++++++++++++++++++++++++++++++++++++-- > drivers/cxl/core/port.c | 1 + > drivers/cxl/cxl.h | 8 ++ > drivers/cxl/cxlmem.h | 55 ++++++++++- > drivers/cxl/pci.c | 4 + > drivers/cxl/pmem.c | 5 +- > drivers/cxl/security.c | 13 +-- > 8 files changed, 282 insertions(+), 20 deletions(-) > > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > index 1d8f87be283f..2a71664a0668 100644 > --- a/drivers/cxl/core/core.h > +++ b/drivers/cxl/core/core.h > @@ -69,6 +69,7 @@ static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port, > > int cxl_memdev_init(void); > void cxl_memdev_exit(void); > -void cxl_mbox_init(void); > +int cxl_mbox_init(void); > +void cxl_mbox_exit(void); > > #endif /* __CXL_CORE_H__ */ > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 35dd889f1d3a..bfee9251c81c 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -10,6 +10,7 @@ > #include "core.h" > > static bool cxl_raw_allow_all; > +static struct workqueue_struct *cxl_mbox_bgpoll_wq; > > /** > * DOC: cxl mbox > @@ -24,7 +25,7 @@ static bool cxl_raw_allow_all; > for ((cmd) = &cxl_mem_commands[0]; \ > ((cmd) - cxl_mem_commands) < ARRAY_SIZE(cxl_mem_commands); (cmd)++) > > -#define CXL_CMD(_id, sin, sout, _flags) \ > +#define __CXL_CMD(_id, sin, sout, _flags, _bgops) \ > [CXL_MEM_COMMAND_ID_##_id] = { \ > .info = { \ > .id = CXL_MEM_COMMAND_ID_##_id, \ > @@ -33,8 +34,13 @@ static bool cxl_raw_allow_all; > }, \ > .opcode = CXL_MBOX_OP_##_id, \ > .flags = _flags, \ > + .bgops = _bgops, \ > } > > +#define CXL_CMD(_id, sin, sout, _flags) __CXL_CMD(_id, sin, sout, _flags, NULL) > +#define CXL_BGCMD(_id, sin, sout, _flags, _bgops) \ > + __CXL_CMD(_id, sin, sout, _flags | CXL_CMD_FLAG_BACKGROUND, _bgops) > + > #define CXL_VARIABLE_PAYLOAD ~0U > /* > * This table defines the supported mailbox commands for the driver. This table > @@ -63,7 +69,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { > CXL_CMD(INJECT_POISON, 0x8, 0, 0), > CXL_CMD(CLEAR_POISON, 0x48, 0, 0), > CXL_CMD(GET_SCAN_MEDIA_CAPS, 0x10, 0x4, 0), > - CXL_CMD(SCAN_MEDIA, 0x11, 0, 0), > + CXL_BGCMD(SCAN_MEDIA, 0x11, 0, 0, NULL), One of the recent revelations with this patch: https://lore.kernel.org/linux-cxl/167030056464.4044561.11486507095384253833.stgit@dwillia2-xfh.jf.intel.com/ ...is that commands that are not meant to be called from userspace should simply never appear in CXL_CMDS nor cxl_mem_commands. The motivation for defining SCAN_MEDIA and friends was to use their definitions to block the RAW command passthrough, but ironically that enables the ioctl path which is also unwanted. So I think this patchset first needs to mark those uapi command numbers deprecated. Then ensure that all other background commands can not be submitted via the ioctl path. > CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), > CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), > CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), > @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > return cxl_command_names[c->info.id].name; > } > > +static unsigned long __maybe_unused > +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) > +{ > + struct cxl_mem_command *c; > + unsigned long ret = 0; > + > + c = cxl_mem_find_command(opcode); Per that change above, cxl_mem_find_command() is only for the uapi. > + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > + dev_WARN_ONCE(cxlds->dev, true, > + "Not a background command\n"); > + return 0; > + } > + > + if (c->bgops && c->bgops->delay) > + ret = c->bgops->delay(cxlds); > + > + return ret * HZ; > +} In general I worry that this is a bit too flexible. Instead of callback ops I think this wants to be straight line code with something like a poll_interval argument in 'struct cxl_mbox_cmd' where 0 means the command is immediate and non-zero is a number of jiffies between background operation polling intervals. > + > +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, > + u16 opcode, bool success) > +{ > + struct cxl_mem_command *c; > + > + c = cxl_mem_find_command(opcode); > + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > + dev_WARN_ONCE(cxlds->dev, true, > + "Not a background command\n"); > + return; > + } > + > + if (c->bgops && c->bgops->post) > + c->bgops->post(cxlds, success); > +} > + > +/* > + * Ensure that ->mbox_send(@new) can run safely when a background > + * command is running. If so, returns zero, otherwise error. > + * > + * check 1. bg cmd running. If not running it is fine to > + * send any command. If one is running then there > + * may be restrictions. > + * check 2. @new incoming command is capable of running > + * in the background. If there is an in-flight bg > + * operation, all these are forbidden as we need > + * to serialize when polling. > + * check 3. @new incoming command is not blacklisted by the > + * current running command. > + */ > +static int __maybe_unused > +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) > +{ > + struct cxl_mem_command *c; > + > + /* 1 */ > + if (likely(!cxl_mbox_bgcmd_running(cxlds))) > + return 0; > + > + c = cxl_mem_find_command(new); > + if (!c) > + return -EINVAL; > + > + /* 2 */ > + if (c->flags & CXL_CMD_FLAG_BACKGROUND) > + return -EBUSY; > + > + /* 3 */ > + if (c->bgops && c->bgops->conflicts) > + return c->bgops->conflicts(new); > + > + return 0; > +} > + > +/* > + * Background operation polling mode. > + */ > +void cxl_mbox_bgcmd_work(struct work_struct *work) Is a workqueue needed? If the goal is to make it so that all the kernel background command use cases chop up their work so that one command can not monopolize the mailbox for a long amount of time then the implementation can also assume that foreground commands that collide with an in-flight background operation can just wait synchronously for their time slice. That makes the handling simple, acquire lock, poll for completion, and everyone else just queues up on the lock. > +{ > + struct cxl_dev_state *cxlds; > + u64 bgcmd_status_reg; > + u16 opcode; > + u32 pct; > + > + cxlds = container_of(work, struct cxl_dev_state, mbox_bgpoll.work); > + > + dev_WARN_ONCE(cxlds->dev, !!cxlds->mbox_irq, > + "Polling mailbox when IRQs are supported\n"); > + > + bgcmd_status_reg = readq(cxlds->regs.mbox + > + CXLDEV_MBOX_BG_CMD_STATUS_OFFSET); > + opcode = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_OPCODE_MASK, > + bgcmd_status_reg); > + > + pct = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_PCT_MASK, bgcmd_status_reg); > + if (pct != 100) { > + unsigned long hint; > + u64 status_reg; > + > + status_reg = readq(cxlds->regs.mbox + CXLDEV_MBOX_STATUS_OFFSET); > + if (!FIELD_GET(CXLDEV_MBOX_STATUS_BACKGROUND_OPERATION, > + status_reg)) > + dev_WARN(cxlds->dev, > + "No background operation running (%u%%).\n", > + pct); > + > + hint = cxl_mbox_bgcmd_delay(cxlds, opcode); > + if (hint == 0) { > + /* > + * TODO: try to do better(?). pct is updated every > + * ~1 sec, maybe use a heurstic based on that. > + */ > + hint = 5 * HZ; > + } > + > + queue_delayed_work(cxl_mbox_bgpoll_wq, > + &cxlds->mbox_bgpoll, hint); > + } else { /* bg operation completed */ > + u16 return_code; > + > + return_code = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_RC_MASK, > + bgcmd_status_reg); > + cxl_mbox_bgcmd_post(cxlds, opcode, > + return_code == CXL_MBOX_CMD_RC_SUCCESS); > + > + put_device(cxlds->dev); > + cxl_mbox_bgcmd_done(cxlds); > + } > +} > +EXPORT_SYMBOL_NS_GPL(cxl_mbox_bgcmd_work, CXL); > + > /** > * cxl_mbox_send_cmd() - Send a mailbox command to a device. > * @cxlds: The device data for the operation > @@ -153,6 +289,7 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > * @in_size: The length of the input payload > * @out: Caller allocated buffer for the output. > * @out_size: Expected size of output. > + * @return_code: (optional output) HW returned code. > * > * Context: Any context. > * Return: > @@ -167,8 +304,9 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > * error. While this distinction can be useful for commands from userspace, the > * kernel will only be able to use results when both are successful. > */ > -int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, > - size_t in_size, void *out, size_t out_size) > +int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, > + void *in, size_t in_size, > + void *out, size_t out_size, u16 *return_code) > { > const struct cxl_mem_command *cmd = cxl_mem_find_command(opcode); > struct cxl_mbox_cmd mbox_cmd = { > @@ -183,12 +321,53 @@ int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, > if (in_size > cxlds->payload_size || out_size > cxlds->payload_size) > return -E2BIG; > > + /* > + * With bg polling this can overlap a scenario where the > + * hardware can receive new requests but the driver is > + * not ready. Handle accordingly. > + */ > + if (!cxlds->mbox_irq) { > + rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); > + if (rc) > + return rc; > + } Can't happen with the straight-line approach as all an interrupt will do is wake up a wait_event_timeout() instance for the currently executing background command. > + > rc = cxlds->mbox_send(cxlds, &mbox_cmd); > + if (return_code) > + *return_code = mbox_cmd.return_code; > if (rc) > return rc; > > - if (mbox_cmd.return_code != CXL_MBOX_CMD_RC_SUCCESS) > + switch (mbox_cmd.return_code) { > + case CXL_MBOX_CMD_RC_BACKGROUND: > + { > + int err; > + > + dev_dbg(cxlds->dev, "Opcode 0x%04x: %s\n", opcode, > + cxl_mbox_cmd_rc2str(&mbox_cmd)); > + > + err = cxl_mbox_bgcmd_begin(cxlds, opcode); > + if (err) { > + dev_WARN(cxlds->dev, > + "Corrupted background cmd (opcode 0x%04x)\n", > + err); > + return -ENXIO; > + } > + > + get_device(cxlds->dev); > + > + if (!cxlds->mbox_irq) { > + /* do first poll check asap before using any hinting */ > + queue_delayed_work(cxl_mbox_bgpoll_wq, > + &cxlds->mbox_bgpoll, 0); > + } > + break; > + } > + case CXL_MBOX_CMD_RC_SUCCESS: > + break; > + default: > return cxl_mbox_cmd_rc2errno(&mbox_cmd); > + } > > /* > * Variable sized commands can't be validated and so it's up to the > @@ -575,7 +754,7 @@ static int cxl_xfer_log(struct cxl_dev_state *cxlds, uuid_t *uuid, u32 size, u8 > int rc; > > rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_LOG, &log, sizeof(log), > - out, xfer_size); > + out, xfer_size, NULL); In the above referenced patch I fixed up cxl_internal_send_cmd() to derive all its parameters from a passed in 'struct cxl_mbox_cmd', that should make it easier to add new parameters. I'll stop here until we can come to consensus on how general this solution needs to be. I think this really wants to have a single background command use case in mind to answer some of the use case case questions.
On Tue, 06 Dec 2022, Dave Jiang wrote: >On 12/5/2022 6:14 PM, Davidlohr Bueso wrote: >>+static struct workqueue_struct *cxl_mbox_bgpoll_wq; > >Do we want per device wq rather than a global one? Just thinking if >you have a device go away and you need to flush outstanding commands, >you might impact the other devices. I'm not sure I completely follow here as the device is refcounted so it doesn't go away while the op is in progress. But this actually makes me think that the current usage is broken as nothing is actually serializing access to queue on the cxl_mbox_bgpoll_wq. So two different devices could be adding themselves concurrently. And doing this per-device actually solves the above nicely, and no queue is actually needed as self would always be the only one. ... >>+/** >>+ * struct cxl_mbox_bgcmd_ops - Optional ad-hoc handling of background cmds >>+ * @post: Execute after the command completes >>+ * @conflicts: How to handle a @new command when one is currently executing >>+ * (only for polling mode) >>+ * @delay: Delay hint for polling on command completion >>+ */ >>+struct cxl_mem_bgcommand_ops { >>+ void (*post)(struct cxl_dev_state *cxlds, bool success); > >"post" here I think conveys different meaning than "after". Typically >it is used in the context of issuing the command. To remove confusion, >maybe call it "end_cmd" or "finish"? Sure. > > >>+ int (*conflicts)(u16 new); > >has_conflicts() may be better. Also, it can return a bool to indicate >whether there are conflicts. I'd rather have the callback return the error directly, which is why it's an int. > >>+ unsigned long (*delay)(struct cxl_dev_state *cxlds); > >"poll_delay"? Sure. > >>+}; >>+ >> /** >> * struct cxl_mem_command - Driver representation of a memory device command >> * @info: Command information as it exists for the UAPI >> * @opcode: The actual bits used for the mailbox protocol >> * @flags: Set of flags effecting driver behavior. >>+ * @bg_ops: Set of optional callbacks to handle commands that can run in >>+ * the background. >> * >> * * %CXL_CMD_FLAG_FORCE_ENABLE: In cases of error, commands with this flag >> * will be enabled by the driver regardless of what hardware may have >> * advertised. >> * >>+ * * %CXL_CMD_FLAG_BACKGROUND: The command may execute asynchronously in the >>+ * background if the operation takes longer than ~2 seconds. The semantics >>+ * are: >>+ * - Only a single of these commands can run at a time. >>+ * - Depending on the nature of the operation, devices may continue to >>+ * accept new non-background commands. >>+ * >> * The cxl_mem_command is the driver's internal representation of commands that >> * are supported by the driver. Some of these commands may not be supported by >> * the hardware. The driver will use @info to validate the fields passed in by >>@@ -374,8 +422,10 @@ struct cxl_mem_command { >> struct cxl_command_info info; >> enum cxl_opcode opcode; >> u32 flags; >>+ struct cxl_mem_bgcommand_ops __maybe_unused *bgops; >> #define CXL_CMD_FLAG_NONE 0 >> #define CXL_CMD_FLAG_FORCE_ENABLE BIT(0) >>+#define CXL_CMD_FLAG_BACKGROUND BIT(1) >> }; >> #define CXL_PMEM_SEC_STATE_USER_PASS_SET 0x01 >>@@ -414,7 +464,8 @@ enum { >> }; >> int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, >>- size_t in_size, void *out, size_t out_size); >>+ size_t in_size, void *out, size_t out_size, >>+ u16 *return_code); > >Given the existing commands don't care about return_code and only the >new ones do, and to minimize code changes and having to chase down >every call of the existing command, you can maybe do something like: > >int cxl_mbox_send_cmd_with_return(struct cxl_dev_state *cxlds, > u16 opcode, void *in, size_t in_size, > void *out, size_t out_size, > u16 *return_code) >{ > ..... >} > >int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, > u16 opcode, void *in, size_t in_size, > void *out, size_t out_size) >{ > return cxl_mbox_send_cmd_with_return(cxlds, opcode, in, in_size, > out, out_size, NULL); >} I thought about this as Jonathan wasn't too fond of this either. But considering Dan's recent rework of cxl_mbox_send_cmd() I have to redo these and should be cleaner anyway. Thanks, Davidlohr
On 12/6/2022 8:42 PM, Davidlohr Bueso wrote: > On Tue, 06 Dec 2022, Dave Jiang wrote: > >> On 12/5/2022 6:14 PM, Davidlohr Bueso wrote: >>> +static struct workqueue_struct *cxl_mbox_bgpoll_wq; >> >> Do we want per device wq rather than a global one? Just thinking if >> you have a device go away and you need to flush outstanding commands, >> you might impact the other devices. > > I'm not sure I completely follow here as the device is refcounted > so it doesn't go away while the op is in progress. But this actually i.e. somebody unbind the driver to a device for whatever reason. DJ > makes me think that the current usage is broken as nothing is actually > serializing access to queue on the cxl_mbox_bgpoll_wq. So two different > devices could be adding themselves concurrently. > > And doing this per-device actually solves the above nicely, and no > queue is actually needed as self would always be the only one. > > ... > >>> +/** >>> + * struct cxl_mbox_bgcmd_ops - Optional ad-hoc handling of >>> background cmds >>> + * @post: Execute after the command completes >>> + * @conflicts: How to handle a @new command when one is currently >>> executing >>> + *Â Â Â Â Â Â Â Â Â Â Â Â (only for polling mode) >>> + * @delay: Delay hint for polling on command completion >>> + */ >>> +struct cxl_mem_bgcommand_ops { >>> +Â Â Â void (*post)(struct cxl_dev_state *cxlds, bool success); >> >> "post" here I think conveys different meaning than "after". Typically >> it is used in the context of issuing the command. To remove confusion, >> maybe call it "end_cmd" or "finish"? > > Sure. > >> >> >>> +Â Â Â int (*conflicts)(u16 new); >> >> has_conflicts() may be better. Also, it can return a bool to indicate >> whether there are conflicts. > > I'd rather have the callback return the error directly, which is why > it's an int. > >> >>> +Â Â Â unsigned long (*delay)(struct cxl_dev_state *cxlds); >> >> "poll_delay"? > > Sure. > >> >>> +}; >>> + >>> Â /** >>> Â * struct cxl_mem_command - Driver representation of a memory device >>> command >>> Â * @info: Command information as it exists for the UAPI >>> Â * @opcode: The actual bits used for the mailbox protocol >>> Â * @flags: Set of flags effecting driver behavior. >>> + * @bg_ops: Set of optional callbacks to handle commands that can >>> run in >>> + *Â Â Â Â Â Â Â Â Â the background. >>> Â * >>> Â *Â * %CXL_CMD_FLAG_FORCE_ENABLE: In cases of error, commands with >>> this flag >>> Â *Â Â Â will be enabled by the driver regardless of what hardware may >>> have >>> Â *Â Â Â advertised. >>> Â * >>> + *Â * %CXL_CMD_FLAG_BACKGROUND: The command may execute >>> asynchronously in the >>> + *Â Â Â background if the operation takes longer than ~2 seconds. The >>> semantics >>> + *Â Â Â are: >>> + *Â Â Â Â Â - Only a single of these commands can run at a time. >>> + *Â Â Â Â Â - Depending on the nature of the operation, devices may >>> continue to >>> + *Â Â Â Â Â Â Â accept new non-background commands. >>> + * >>> Â * The cxl_mem_command is the driver's internal representation of >>> commands that >>> Â * are supported by the driver. Some of these commands may not be >>> supported by >>> Â * the hardware. The driver will use @info to validate the fields >>> passed in by >>> @@ -374,8 +422,10 @@ struct cxl_mem_command { >>> Â Â Â Â struct cxl_command_info info; >>> Â Â Â Â enum cxl_opcode opcode; >>> Â Â Â Â u32 flags; >>> +Â Â Â struct cxl_mem_bgcommand_ops __maybe_unused *bgops; >>> Â #define CXL_CMD_FLAG_NONE 0 >>> Â #define CXL_CMD_FLAG_FORCE_ENABLE BIT(0) >>> +#define CXL_CMD_FLAG_BACKGROUND BIT(1) >>> Â }; >>> Â #define CXL_PMEM_SEC_STATE_USER_PASS_SETÂ Â Â 0x01 >>> @@ -414,7 +464,8 @@ enum { >>> Â }; >>> Â int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void >>> *in, >>> -Â Â Â Â Â Â Â Â Â Â Â Â Â size_t in_size, void *out, size_t out_size); >>> +Â Â Â Â Â Â Â Â Â Â Â Â Â size_t in_size, void *out, size_t out_size, >>> +Â Â Â Â Â Â Â Â Â Â Â Â Â u16 *return_code); >> >> Given the existing commands don't care about return_code and only the >> new ones do, and to minimize code changes and having to chase down >> every call of the existing command, you can maybe do something like: >> >> int cxl_mbox_send_cmd_with_return(struct cxl_dev_state *cxlds, >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â u16 opcode, void *in, size_t in_size, >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â void *out, size_t out_size, >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â u16 *return_code) >> { >> Â Â Â Â ..... >> } >> >> int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, >> Â Â Â Â Â Â Â Â Â Â Â Â Â u16 opcode, void *in, size_t in_size, >> Â Â Â Â Â Â Â Â Â Â Â Â Â void *out, size_t out_size) >> { >> Â Â Â Â return cxl_mbox_send_cmd_with_return(cxlds, opcode, in, in_size, >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â out, out_size, NULL); >> } > > I thought about this as Jonathan wasn't too fond of this either. But > considering Dan's recent rework of cxl_mbox_send_cmd() I have to redo > these and should be cleaner anyway. > > Thanks, > Davidlohr
On 12/6/2022 6:55 PM, Dan Williams wrote: > Davidlohr Bueso wrote: >> This adds support for handling background operations, as defined in >> the CXL 3.0 spec. Commands that can take too long (over ~2 seconds) >> can run asynchronously in the background, these are limited to: >> transfer/activate firmware, scan media, sanitize (aka overwrite) >> and VPPB bind/unbind. >> >> Completion of background operations, successful or not, can be handled >> via irq or poll based. This patch deals only with polling, which >> introduces some complexities as we need to handle the window in which >> the original background command completed, then a new one is >> successfully started before the poller wakes up and checks. This, >> in theory, can occur any amount of times. One of the problems is >> that previous states cannot be tracked as hw background status >> registers always deal with the last command. >> >> So in order to keep hardware and the driver in sync, there can be >> windows where the hardware is ready but the driver won't be, and >> thus must serialize/handle it accordingly. While this polling cost >> may not be ideal: 1) background commands are rare/limited and >> 2) the extra busy time should be small compared to the overall >> command duration and 3) devices that support mailbox interrupts >> do not use this. >> >> The implementation extends struct cxl_mem_command to become aware >> of background-capable commands in a generic fashion and presents >> some interfaces: >> >> - Calls for bg operations, where each bg command can choose to implement >> whatever needed based on the nature of the operation. For example, >> while running, overwrite may only allow certain related commands to >> occur, while scan media does not have any such limitations. >> Delay times can also be different, for which ad-hoc hinting can >> be defined - for example, scan media could use some value based on >> GET_SCAN_MEDIA_CAPS and overwrite has predefined values based on >> pmem DSM specs[0]. Similarly some commands need to execute tasks >> once the command finishes, such as overwriting requires CPU flushing >> when successfully done. These are: >> >> cxl_mbox_bgcmd_conflicts() >> cxl_mbox_bgcmd_delay() >> cxl_mbox_bgcmd_post() >> >> - cxl_mbox_send_cmd() is extended such that callers can distinguish, >> upon rc == 0, between completed and successfully started in the >> background. >> >> - cxl_mbox_bgcmd_running() will atomically tell which command is >> running in the background, if any. This allows easy querying >> functionality. Similarly, there are cxl_mbox_bgcmd_start() and >> cxl_mbox_bgcmd_done() to safely mark the in-flight operation. >> While x86 serializes atomics, care must be taken with arm64, for >> example, ensuring, minimally, release/acquire semantics. >> >> There are currently no supported commands. >> >> [0] https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf >> >> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> >> --- >> drivers/cxl/core/core.h | 3 +- >> drivers/cxl/core/mbox.c | 213 ++++++++++++++++++++++++++++++++++++++-- >> drivers/cxl/core/port.c | 1 + >> drivers/cxl/cxl.h | 8 ++ >> drivers/cxl/cxlmem.h | 55 ++++++++++- >> drivers/cxl/pci.c | 4 + >> drivers/cxl/pmem.c | 5 +- >> drivers/cxl/security.c | 13 +-- >> 8 files changed, 282 insertions(+), 20 deletions(-) >> >> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h >> index 1d8f87be283f..2a71664a0668 100644 >> --- a/drivers/cxl/core/core.h >> +++ b/drivers/cxl/core/core.h >> @@ -69,6 +69,7 @@ static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port, >> >> int cxl_memdev_init(void); >> void cxl_memdev_exit(void); >> -void cxl_mbox_init(void); >> +int cxl_mbox_init(void); >> +void cxl_mbox_exit(void); >> >> #endif /* __CXL_CORE_H__ */ >> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c >> index 35dd889f1d3a..bfee9251c81c 100644 >> --- a/drivers/cxl/core/mbox.c >> +++ b/drivers/cxl/core/mbox.c >> @@ -10,6 +10,7 @@ >> #include "core.h" >> >> static bool cxl_raw_allow_all; >> +static struct workqueue_struct *cxl_mbox_bgpoll_wq; >> >> /** >> * DOC: cxl mbox >> @@ -24,7 +25,7 @@ static bool cxl_raw_allow_all; >> for ((cmd) = &cxl_mem_commands[0]; \ >> ((cmd) - cxl_mem_commands) < ARRAY_SIZE(cxl_mem_commands); (cmd)++) >> >> -#define CXL_CMD(_id, sin, sout, _flags) \ >> +#define __CXL_CMD(_id, sin, sout, _flags, _bgops) \ >> [CXL_MEM_COMMAND_ID_##_id] = { \ >> .info = { \ >> .id = CXL_MEM_COMMAND_ID_##_id, \ >> @@ -33,8 +34,13 @@ static bool cxl_raw_allow_all; >> }, \ >> .opcode = CXL_MBOX_OP_##_id, \ >> .flags = _flags, \ >> + .bgops = _bgops, \ >> } >> >> +#define CXL_CMD(_id, sin, sout, _flags) __CXL_CMD(_id, sin, sout, _flags, NULL) >> +#define CXL_BGCMD(_id, sin, sout, _flags, _bgops) \ >> + __CXL_CMD(_id, sin, sout, _flags | CXL_CMD_FLAG_BACKGROUND, _bgops) >> + >> #define CXL_VARIABLE_PAYLOAD ~0U >> /* >> * This table defines the supported mailbox commands for the driver. This table >> @@ -63,7 +69,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { >> CXL_CMD(INJECT_POISON, 0x8, 0, 0), >> CXL_CMD(CLEAR_POISON, 0x48, 0, 0), >> CXL_CMD(GET_SCAN_MEDIA_CAPS, 0x10, 0x4, 0), >> - CXL_CMD(SCAN_MEDIA, 0x11, 0, 0), >> + CXL_BGCMD(SCAN_MEDIA, 0x11, 0, 0, NULL), > > One of the recent revelations with this patch: > > https://lore.kernel.org/linux-cxl/167030056464.4044561.11486507095384253833.stgit@dwillia2-xfh.jf.intel.com/ > > ...is that commands that are not meant to be called from userspace > should simply never appear in CXL_CMDS nor cxl_mem_commands. The > motivation for defining SCAN_MEDIA and friends was to use their > definitions to block the RAW command passthrough, but ironically that > enables the ioctl path which is also unwanted. > > So I think this patchset first needs to mark those uapi command numbers > deprecated. Then ensure that all other background commands can not be > submitted via the ioctl path. > >> CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), >> CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), >> CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), >> @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) >> return cxl_command_names[c->info.id].name; >> } >> >> +static unsigned long __maybe_unused >> +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) >> +{ >> + struct cxl_mem_command *c; >> + unsigned long ret = 0; >> + >> + c = cxl_mem_find_command(opcode); > > Per that change above, cxl_mem_find_command() is only for the uapi. > >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { >> + dev_WARN_ONCE(cxlds->dev, true, >> + "Not a background command\n"); >> + return 0; >> + } >> + >> + if (c->bgops && c->bgops->delay) >> + ret = c->bgops->delay(cxlds); >> + >> + return ret * HZ; >> +} > > In general I worry that this is a bit too flexible. Instead of callback > ops I think this wants to be straight line code with something like a > poll_interval argument in 'struct cxl_mbox_cmd' where 0 means the > command is immediate and non-zero is a number of jiffies between > background operation polling intervals. > >> + >> +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, >> + u16 opcode, bool success) >> +{ >> + struct cxl_mem_command *c; >> + >> + c = cxl_mem_find_command(opcode); >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { >> + dev_WARN_ONCE(cxlds->dev, true, >> + "Not a background command\n"); >> + return; >> + } >> + >> + if (c->bgops && c->bgops->post) >> + c->bgops->post(cxlds, success); >> +} >> + >> +/* >> + * Ensure that ->mbox_send(@new) can run safely when a background >> + * command is running. If so, returns zero, otherwise error. >> + * >> + * check 1. bg cmd running. If not running it is fine to >> + * send any command. If one is running then there >> + * may be restrictions. >> + * check 2. @new incoming command is capable of running >> + * in the background. If there is an in-flight bg >> + * operation, all these are forbidden as we need >> + * to serialize when polling. >> + * check 3. @new incoming command is not blacklisted by the >> + * current running command. >> + */ >> +static int __maybe_unused >> +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) >> +{ >> + struct cxl_mem_command *c; >> + >> + /* 1 */ >> + if (likely(!cxl_mbox_bgcmd_running(cxlds))) >> + return 0; >> + >> + c = cxl_mem_find_command(new); >> + if (!c) >> + return -EINVAL; >> + >> + /* 2 */ >> + if (c->flags & CXL_CMD_FLAG_BACKGROUND) >> + return -EBUSY; >> + >> + /* 3 */ >> + if (c->bgops && c->bgops->conflicts) >> + return c->bgops->conflicts(new); >> + >> + return 0; >> +} >> + >> +/* >> + * Background operation polling mode. >> + */ >> +void cxl_mbox_bgcmd_work(struct work_struct *work) > > Is a workqueue needed? If the goal is to make it so that all the kernel > background command use cases chop up their work so that one command can > not monopolize the mailbox for a long amount of time then the > implementation can also assume that foreground commands that collide > with an in-flight background operation can just wait synchronously for > their time slice. > > That makes the handling simple, acquire lock, poll for completion, > and everyone else just queues up on the lock. Aren't certain commands valid while a background command is on going? Just thinking if you are doing a sanitize (overwrite), for a pmem it may take a LONG time. And maybe you don't want to block everything for the duration of that command completion? Maybe I'm not understanding what you are saying here. DJ > >> +{ >> + struct cxl_dev_state *cxlds; >> + u64 bgcmd_status_reg; >> + u16 opcode; >> + u32 pct; >> + >> + cxlds = container_of(work, struct cxl_dev_state, mbox_bgpoll.work); >> + >> + dev_WARN_ONCE(cxlds->dev, !!cxlds->mbox_irq, >> + "Polling mailbox when IRQs are supported\n"); >> + >> + bgcmd_status_reg = readq(cxlds->regs.mbox + >> + CXLDEV_MBOX_BG_CMD_STATUS_OFFSET); >> + opcode = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_OPCODE_MASK, >> + bgcmd_status_reg); >> + >> + pct = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_PCT_MASK, bgcmd_status_reg); >> + if (pct != 100) { >> + unsigned long hint; >> + u64 status_reg; >> + >> + status_reg = readq(cxlds->regs.mbox + CXLDEV_MBOX_STATUS_OFFSET); >> + if (!FIELD_GET(CXLDEV_MBOX_STATUS_BACKGROUND_OPERATION, >> + status_reg)) >> + dev_WARN(cxlds->dev, >> + "No background operation running (%u%%).\n", >> + pct); >> + >> + hint = cxl_mbox_bgcmd_delay(cxlds, opcode); >> + if (hint == 0) { >> + /* >> + * TODO: try to do better(?). pct is updated every >> + * ~1 sec, maybe use a heurstic based on that. >> + */ >> + hint = 5 * HZ; >> + } >> + >> + queue_delayed_work(cxl_mbox_bgpoll_wq, >> + &cxlds->mbox_bgpoll, hint); >> + } else { /* bg operation completed */ >> + u16 return_code; >> + >> + return_code = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_RC_MASK, >> + bgcmd_status_reg); >> + cxl_mbox_bgcmd_post(cxlds, opcode, >> + return_code == CXL_MBOX_CMD_RC_SUCCESS); >> + >> + put_device(cxlds->dev); >> + cxl_mbox_bgcmd_done(cxlds); >> + } >> +} >> +EXPORT_SYMBOL_NS_GPL(cxl_mbox_bgcmd_work, CXL); >> + >> /** >> * cxl_mbox_send_cmd() - Send a mailbox command to a device. >> * @cxlds: The device data for the operation >> @@ -153,6 +289,7 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) >> * @in_size: The length of the input payload >> * @out: Caller allocated buffer for the output. >> * @out_size: Expected size of output. >> + * @return_code: (optional output) HW returned code. >> * >> * Context: Any context. >> * Return: >> @@ -167,8 +304,9 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) >> * error. While this distinction can be useful for commands from userspace, the >> * kernel will only be able to use results when both are successful. >> */ >> -int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, >> - size_t in_size, void *out, size_t out_size) >> +int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, >> + void *in, size_t in_size, >> + void *out, size_t out_size, u16 *return_code) >> { >> const struct cxl_mem_command *cmd = cxl_mem_find_command(opcode); >> struct cxl_mbox_cmd mbox_cmd = { >> @@ -183,12 +321,53 @@ int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, >> if (in_size > cxlds->payload_size || out_size > cxlds->payload_size) >> return -E2BIG; >> >> + /* >> + * With bg polling this can overlap a scenario where the >> + * hardware can receive new requests but the driver is >> + * not ready. Handle accordingly. >> + */ >> + if (!cxlds->mbox_irq) { >> + rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); >> + if (rc) >> + return rc; >> + } > > Can't happen with the straight-line approach as all an interrupt will do > is wake up a wait_event_timeout() instance for the currently executing > background command. > >> + >> rc = cxlds->mbox_send(cxlds, &mbox_cmd); >> + if (return_code) >> + *return_code = mbox_cmd.return_code; >> if (rc) >> return rc; >> >> - if (mbox_cmd.return_code != CXL_MBOX_CMD_RC_SUCCESS) >> + switch (mbox_cmd.return_code) { >> + case CXL_MBOX_CMD_RC_BACKGROUND: >> + { >> + int err; >> + >> + dev_dbg(cxlds->dev, "Opcode 0x%04x: %s\n", opcode, >> + cxl_mbox_cmd_rc2str(&mbox_cmd)); >> + >> + err = cxl_mbox_bgcmd_begin(cxlds, opcode); >> + if (err) { >> + dev_WARN(cxlds->dev, >> + "Corrupted background cmd (opcode 0x%04x)\n", >> + err); >> + return -ENXIO; >> + } >> + >> + get_device(cxlds->dev); >> + >> + if (!cxlds->mbox_irq) { >> + /* do first poll check asap before using any hinting */ >> + queue_delayed_work(cxl_mbox_bgpoll_wq, >> + &cxlds->mbox_bgpoll, 0); >> + } >> + break; >> + } >> + case CXL_MBOX_CMD_RC_SUCCESS: >> + break; >> + default: >> return cxl_mbox_cmd_rc2errno(&mbox_cmd); >> + } >> >> /* >> * Variable sized commands can't be validated and so it's up to the >> @@ -575,7 +754,7 @@ static int cxl_xfer_log(struct cxl_dev_state *cxlds, uuid_t *uuid, u32 size, u8 >> int rc; >> >> rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_LOG, &log, sizeof(log), >> - out, xfer_size); >> + out, xfer_size, NULL); > > In the above referenced patch I fixed up cxl_internal_send_cmd() to > derive all its parameters from a passed in 'struct cxl_mbox_cmd', that > should make it easier to add new parameters. > > I'll stop here until we can come to consensus on how general this > solution needs to be. I think this really wants to have a single > background command use case in mind to answer some of the use case case > questions.
On Tue, 06 Dec 2022, Dan Williams wrote: >One of the recent revelations with this patch: > >https://lore.kernel.org/linux-cxl/167030056464.4044561.11486507095384253833.stgit@dwillia2-xfh.jf.intel.com/ > >...is that commands that are not meant to be called from userspace >should simply never appear in CXL_CMDS nor cxl_mem_commands. The >motivation for defining SCAN_MEDIA and friends was to use their >definitions to block the RAW command passthrough, but ironically that >enables the ioctl path which is also unwanted. > >So I think this patchset first needs to mark those uapi command numbers >deprecated. Then ensure that all other background commands can not be >submitted via the ioctl path. Ok I will have a look. > >> CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), >> CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), >> CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), >> @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) >> return cxl_command_names[c->info.id].name; >> } >> >> +static unsigned long __maybe_unused >> +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) >> +{ >> + struct cxl_mem_command *c; >> + unsigned long ret = 0; >> + >> + c = cxl_mem_find_command(opcode); > >Per that change above, cxl_mem_find_command() is only for the uapi. > >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { >> + dev_WARN_ONCE(cxlds->dev, true, >> + "Not a background command\n"); >> + return 0; >> + } >> + >> + if (c->bgops && c->bgops->delay) >> + ret = c->bgops->delay(cxlds); >> + >> + return ret * HZ; >> +} > >In general I worry that this is a bit too flexible. Instead of callback >ops I think this wants to be straight line code with something like a >poll_interval argument in 'struct cxl_mbox_cmd' where 0 means the >command is immediate and non-zero is a number of jiffies between >background operation polling intervals. Are you refering _only_ to the delay callback being too flexible or the the whole set of callbacks? If only delay, I don't like the idea of having a poll_interval argument *and* a bgops structure with post and conflicts callbacks. We could also move conflicts functionality to another argument in cxl_mbox_cmd structure, but not delay functionality, which makes consolidating everything in a callback structure nicer imo. > >> + >> +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, >> + u16 opcode, bool success) >> +{ >> + struct cxl_mem_command *c; >> + >> + c = cxl_mem_find_command(opcode); >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { >> + dev_WARN_ONCE(cxlds->dev, true, >> + "Not a background command\n"); >> + return; >> + } >> + >> + if (c->bgops && c->bgops->post) >> + c->bgops->post(cxlds, success); >> +} >> + >> +/* >> + * Ensure that ->mbox_send(@new) can run safely when a background >> + * command is running. If so, returns zero, otherwise error. >> + * >> + * check 1. bg cmd running. If not running it is fine to >> + * send any command. If one is running then there >> + * may be restrictions. >> + * check 2. @new incoming command is capable of running >> + * in the background. If there is an in-flight bg >> + * operation, all these are forbidden as we need >> + * to serialize when polling. >> + * check 3. @new incoming command is not blacklisted by the >> + * current running command. >> + */ >> +static int __maybe_unused >> +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) >> +{ >> + struct cxl_mem_command *c; >> + >> + /* 1 */ >> + if (likely(!cxl_mbox_bgcmd_running(cxlds))) >> + return 0; >> + >> + c = cxl_mem_find_command(new); >> + if (!c) >> + return -EINVAL; >> + >> + /* 2 */ >> + if (c->flags & CXL_CMD_FLAG_BACKGROUND) >> + return -EBUSY; >> + >> + /* 3 */ >> + if (c->bgops && c->bgops->conflicts) >> + return c->bgops->conflicts(new); >> + >> + return 0; >> +} >> + >> +/* >> + * Background operation polling mode. >> + */ >> +void cxl_mbox_bgcmd_work(struct work_struct *work) > >Is a workqueue needed? If the goal is to make it so that all the kernel >background command use cases chop up their work so that one command can >not monopolize the mailbox for a long amount of time then the >implementation can also assume that foreground commands that collide >with an in-flight background operation can just wait synchronously for >their time slice. > >That makes the handling simple, acquire lock, poll for completion, >and everyone else just queues up on the lock. I don't see how associating a lock to the bg op will make it simpler. I happen to think that my approach is simpler than anything involving locking. I'll explore getting rid of the wq in favor of per-device as suggested by Dave, but fundamentally would be along the same lines. If we really want to make things simple, than maybe just not supporting polling would be better - I assume most devices will have irq support anyway. ... >> + /* >> + * With bg polling this can overlap a scenario where the >> + * hardware can receive new requests but the driver is >> + * not ready. Handle accordingly. >> + */ >> + if (!cxlds->mbox_irq) { >> + rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); >> + if (rc) >> + return rc; >> + } > >Can't happen with the straight-line approach as all an interrupt will do >is wake up a wait_event_timeout() instance for the currently executing >background command. I don't follow here. Upon mbox irq support, getting an irq will have have the isr simply do a: cxl_mbox_bgcmd_post(); put_device(cxlds->dev); cxl_mbox_bgcmd_done(cxlds); return IRQ_HANDLED; ... to terminate the bg command and no wakeup is needed. But polling won't be able to do this hence the check above. Maybe I'm just missing your point. Thanks, Davidlohr
On Wed, 07 Dec 2022, Davidlohr Bueso wrote: >>>+ /* >>>+ * With bg polling this can overlap a scenario where the >>>+ * hardware can receive new requests but the driver is >>>+ * not ready. Handle accordingly. >>>+ */ >>>+ if (!cxlds->mbox_irq) { >>>+ rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); >>>+ if (rc) >>>+ return rc; >>>+ } >> >>Can't happen with the straight-line approach as all an interrupt will do >>is wake up a wait_event_timeout() instance for the currently executing >>background command. > >I don't follow here. Upon mbox irq support, getting an irq will have have >the isr simply do a: > > cxl_mbox_bgcmd_post(); > put_device(cxlds->dev); > cxl_mbox_bgcmd_done(cxlds); > > return IRQ_HANDLED; > >... to terminate the bg command and no wakeup is needed. But polling won't >be able to do this hence the check above. Maybe I'm just missing your point. Ok so I'm assuming here cxl_mbox_bgcmd_post() can run in irq context (this would of course need to be documented). If for example doing a wbinvd_on_all_cpus() requires a threaded irq (I have seen usage in the kernel that uses it is sleepable contexts), then we cannot avoid the cxl_mbox_check_cmd_bgcmd() check. The same would apply for any other bg command in the future that implements ->post() callback. Thanks, Davidlohr
Dave Jiang wrote: > > > On 12/6/2022 6:55 PM, Dan Williams wrote: > > Davidlohr Bueso wrote: > >> This adds support for handling background operations, as defined in > >> the CXL 3.0 spec. Commands that can take too long (over ~2 seconds) > >> can run asynchronously in the background, these are limited to: > >> transfer/activate firmware, scan media, sanitize (aka overwrite) > >> and VPPB bind/unbind. > >> > >> Completion of background operations, successful or not, can be handled > >> via irq or poll based. This patch deals only with polling, which > >> introduces some complexities as we need to handle the window in which > >> the original background command completed, then a new one is > >> successfully started before the poller wakes up and checks. This, > >> in theory, can occur any amount of times. One of the problems is > >> that previous states cannot be tracked as hw background status > >> registers always deal with the last command. > >> > >> So in order to keep hardware and the driver in sync, there can be > >> windows where the hardware is ready but the driver won't be, and > >> thus must serialize/handle it accordingly. While this polling cost > >> may not be ideal: 1) background commands are rare/limited and > >> 2) the extra busy time should be small compared to the overall > >> command duration and 3) devices that support mailbox interrupts > >> do not use this. > >> > >> The implementation extends struct cxl_mem_command to become aware > >> of background-capable commands in a generic fashion and presents > >> some interfaces: > >> > >> - Calls for bg operations, where each bg command can choose to implement > >> whatever needed based on the nature of the operation. For example, > >> while running, overwrite may only allow certain related commands to > >> occur, while scan media does not have any such limitations. > >> Delay times can also be different, for which ad-hoc hinting can > >> be defined - for example, scan media could use some value based on > >> GET_SCAN_MEDIA_CAPS and overwrite has predefined values based on > >> pmem DSM specs[0]. Similarly some commands need to execute tasks > >> once the command finishes, such as overwriting requires CPU flushing > >> when successfully done. These are: > >> > >> cxl_mbox_bgcmd_conflicts() > >> cxl_mbox_bgcmd_delay() > >> cxl_mbox_bgcmd_post() > >> > >> - cxl_mbox_send_cmd() is extended such that callers can distinguish, > >> upon rc == 0, between completed and successfully started in the > >> background. > >> > >> - cxl_mbox_bgcmd_running() will atomically tell which command is > >> running in the background, if any. This allows easy querying > >> functionality. Similarly, there are cxl_mbox_bgcmd_start() and > >> cxl_mbox_bgcmd_done() to safely mark the in-flight operation. > >> While x86 serializes atomics, care must be taken with arm64, for > >> example, ensuring, minimally, release/acquire semantics. > >> > >> There are currently no supported commands. > >> > >> [0] https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf > >> > >> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> > >> --- > >> drivers/cxl/core/core.h | 3 +- > >> drivers/cxl/core/mbox.c | 213 ++++++++++++++++++++++++++++++++++++++-- > >> drivers/cxl/core/port.c | 1 + > >> drivers/cxl/cxl.h | 8 ++ > >> drivers/cxl/cxlmem.h | 55 ++++++++++- > >> drivers/cxl/pci.c | 4 + > >> drivers/cxl/pmem.c | 5 +- > >> drivers/cxl/security.c | 13 +-- > >> 8 files changed, 282 insertions(+), 20 deletions(-) > >> > >> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h > >> index 1d8f87be283f..2a71664a0668 100644 > >> --- a/drivers/cxl/core/core.h > >> +++ b/drivers/cxl/core/core.h > >> @@ -69,6 +69,7 @@ static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port, > >> > >> int cxl_memdev_init(void); > >> void cxl_memdev_exit(void); > >> -void cxl_mbox_init(void); > >> +int cxl_mbox_init(void); > >> +void cxl_mbox_exit(void); > >> > >> #endif /* __CXL_CORE_H__ */ > >> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > >> index 35dd889f1d3a..bfee9251c81c 100644 > >> --- a/drivers/cxl/core/mbox.c > >> +++ b/drivers/cxl/core/mbox.c > >> @@ -10,6 +10,7 @@ > >> #include "core.h" > >> > >> static bool cxl_raw_allow_all; > >> +static struct workqueue_struct *cxl_mbox_bgpoll_wq; > >> > >> /** > >> * DOC: cxl mbox > >> @@ -24,7 +25,7 @@ static bool cxl_raw_allow_all; > >> for ((cmd) = &cxl_mem_commands[0]; \ > >> ((cmd) - cxl_mem_commands) < ARRAY_SIZE(cxl_mem_commands); (cmd)++) > >> > >> -#define CXL_CMD(_id, sin, sout, _flags) \ > >> +#define __CXL_CMD(_id, sin, sout, _flags, _bgops) \ > >> [CXL_MEM_COMMAND_ID_##_id] = { \ > >> .info = { \ > >> .id = CXL_MEM_COMMAND_ID_##_id, \ > >> @@ -33,8 +34,13 @@ static bool cxl_raw_allow_all; > >> }, \ > >> .opcode = CXL_MBOX_OP_##_id, \ > >> .flags = _flags, \ > >> + .bgops = _bgops, \ > >> } > >> > >> +#define CXL_CMD(_id, sin, sout, _flags) __CXL_CMD(_id, sin, sout, _flags, NULL) > >> +#define CXL_BGCMD(_id, sin, sout, _flags, _bgops) \ > >> + __CXL_CMD(_id, sin, sout, _flags | CXL_CMD_FLAG_BACKGROUND, _bgops) > >> + > >> #define CXL_VARIABLE_PAYLOAD ~0U > >> /* > >> * This table defines the supported mailbox commands for the driver. This table > >> @@ -63,7 +69,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { > >> CXL_CMD(INJECT_POISON, 0x8, 0, 0), > >> CXL_CMD(CLEAR_POISON, 0x48, 0, 0), > >> CXL_CMD(GET_SCAN_MEDIA_CAPS, 0x10, 0x4, 0), > >> - CXL_CMD(SCAN_MEDIA, 0x11, 0, 0), > >> + CXL_BGCMD(SCAN_MEDIA, 0x11, 0, 0, NULL), > > > > One of the recent revelations with this patch: > > > > https://lore.kernel.org/linux-cxl/167030056464.4044561.11486507095384253833.stgit@dwillia2-xfh.jf.intel.com/ > > > > ...is that commands that are not meant to be called from userspace > > should simply never appear in CXL_CMDS nor cxl_mem_commands. The > > motivation for defining SCAN_MEDIA and friends was to use their > > definitions to block the RAW command passthrough, but ironically that > > enables the ioctl path which is also unwanted. > > > > So I think this patchset first needs to mark those uapi command numbers > > deprecated. Then ensure that all other background commands can not be > > submitted via the ioctl path. > > > >> CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), > >> CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), > >> CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), > >> @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > >> return cxl_command_names[c->info.id].name; > >> } > >> > >> +static unsigned long __maybe_unused > >> +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) > >> +{ > >> + struct cxl_mem_command *c; > >> + unsigned long ret = 0; > >> + > >> + c = cxl_mem_find_command(opcode); > > > > Per that change above, cxl_mem_find_command() is only for the uapi. > > > >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > >> + dev_WARN_ONCE(cxlds->dev, true, > >> + "Not a background command\n"); > >> + return 0; > >> + } > >> + > >> + if (c->bgops && c->bgops->delay) > >> + ret = c->bgops->delay(cxlds); > >> + > >> + return ret * HZ; > >> +} > > > > In general I worry that this is a bit too flexible. Instead of callback > > ops I think this wants to be straight line code with something like a > > poll_interval argument in 'struct cxl_mbox_cmd' where 0 means the > > command is immediate and non-zero is a number of jiffies between > > background operation polling intervals. > > > >> + > >> +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, > >> + u16 opcode, bool success) > >> +{ > >> + struct cxl_mem_command *c; > >> + > >> + c = cxl_mem_find_command(opcode); > >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > >> + dev_WARN_ONCE(cxlds->dev, true, > >> + "Not a background command\n"); > >> + return; > >> + } > >> + > >> + if (c->bgops && c->bgops->post) > >> + c->bgops->post(cxlds, success); > >> +} > >> + > >> +/* > >> + * Ensure that ->mbox_send(@new) can run safely when a background > >> + * command is running. If so, returns zero, otherwise error. > >> + * > >> + * check 1. bg cmd running. If not running it is fine to > >> + * send any command. If one is running then there > >> + * may be restrictions. > >> + * check 2. @new incoming command is capable of running > >> + * in the background. If there is an in-flight bg > >> + * operation, all these are forbidden as we need > >> + * to serialize when polling. > >> + * check 3. @new incoming command is not blacklisted by the > >> + * current running command. > >> + */ > >> +static int __maybe_unused > >> +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) > >> +{ > >> + struct cxl_mem_command *c; > >> + > >> + /* 1 */ > >> + if (likely(!cxl_mbox_bgcmd_running(cxlds))) > >> + return 0; > >> + > >> + c = cxl_mem_find_command(new); > >> + if (!c) > >> + return -EINVAL; > >> + > >> + /* 2 */ > >> + if (c->flags & CXL_CMD_FLAG_BACKGROUND) > >> + return -EBUSY; > >> + > >> + /* 3 */ > >> + if (c->bgops && c->bgops->conflicts) > >> + return c->bgops->conflicts(new); > >> + > >> + return 0; > >> +} > >> + > >> +/* > >> + * Background operation polling mode. > >> + */ > >> +void cxl_mbox_bgcmd_work(struct work_struct *work) > > > > Is a workqueue needed? If the goal is to make it so that all the kernel > > background command use cases chop up their work so that one command can > > not monopolize the mailbox for a long amount of time then the > > implementation can also assume that foreground commands that collide > > with an in-flight background operation can just wait synchronously for > > their time slice. > > > > That makes the handling simple, acquire lock, poll for completion, > > and everyone else just queues up on the lock. > > Aren't certain commands valid while a background command is on going? > Just thinking if you are doing a sanitize (overwrite), for a pmem it may > take a LONG time. And maybe you don't want to block everything for the > duration of that command completion? Maybe I'm not understanding what > you are saying here. > I am saying the possibility that a device implementation *might* allow foreground commands to proceed while background commands are operating is close to useless the way the specification defines it. The fact that the implementation needs to be prepared for any foreground command to return "Busy" due to background operation conflict and the fact that background operations do not stack leads me to say, no, let's not play those games and just handle everything in-order with no overlap. While it is true that overwrite can occupy ->mbox_send() for a long time, it also specifies that commands that depend on the media being in a 'not disabled' state will also fail. So again, for simplicity, if you start an overwrite one of the consequences is all other command requests block or fail until that completes. Likely 'fail' is better than 'block' to encourage user space to be mindful that it cannot overlap overwrite and any other device usage. Otherwise, the complexity needs to have an exceedingly good reason, not just "because the spec might allow it".
Davidlohr Bueso wrote: > On Tue, 06 Dec 2022, Dan Williams wrote: > > >One of the recent revelations with this patch: > > > >https://lore.kernel.org/linux-cxl/167030056464.4044561.11486507095384253833.stgit@dwillia2-xfh.jf.intel.com/ > > > >...is that commands that are not meant to be called from userspace > >should simply never appear in CXL_CMDS nor cxl_mem_commands. The > >motivation for defining SCAN_MEDIA and friends was to use their > >definitions to block the RAW command passthrough, but ironically that > >enables the ioctl path which is also unwanted. > > > >So I think this patchset first needs to mark those uapi command numbers > >deprecated. Then ensure that all other background commands can not be > >submitted via the ioctl path. > > Ok I will have a look. > > > > >> CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), > >> CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), > >> CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), > >> @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) > >> return cxl_command_names[c->info.id].name; > >> } > >> > >> +static unsigned long __maybe_unused > >> +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) > >> +{ > >> + struct cxl_mem_command *c; > >> + unsigned long ret = 0; > >> + > >> + c = cxl_mem_find_command(opcode); > > > >Per that change above, cxl_mem_find_command() is only for the uapi. > > > >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > >> + dev_WARN_ONCE(cxlds->dev, true, > >> + "Not a background command\n"); > >> + return 0; > >> + } > >> + > >> + if (c->bgops && c->bgops->delay) > >> + ret = c->bgops->delay(cxlds); > >> + > >> + return ret * HZ; > >> +} > > > >In general I worry that this is a bit too flexible. Instead of callback > >ops I think this wants to be straight line code with something like a > >poll_interval argument in 'struct cxl_mbox_cmd' where 0 means the > >command is immediate and non-zero is a number of jiffies between > >background operation polling intervals. > > Are you refering _only_ to the delay callback being too flexible or the > the whole set of callbacks? If only delay, I don't like the idea of > having a poll_interval argument *and* a bgops structure with post and > conflicts callbacks. We could also move conflicts functionality to > another argument in cxl_mbox_cmd structure, but not delay functionality, > which makes consolidating everything in a callback structure nicer imo. I am refering to all of the callback operations, because the callback design assumes that the CXL specification actually defined a functional background command facility which it did not. A functional background mechanism would arrange for queueing to overlap with result reaping, like an NVME ring buffer*. As it stands the driver needs to be careful to reap previous results before issuing the next command which complicates the implementation for little benefit. A straight-line implementation eliminates the conflict concept. It moves the details to the caller about how to time-slice the operation, or in the case of overwrite, allow it monopolize the mailbox since the device is effectively dead during the overwrite. --- * it turns out the Bind vPPB command got this right. The command completion is queued to an event not to the singular background command status that needs collision protection. > > > > >> + > >> +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, > >> + u16 opcode, bool success) > >> +{ > >> + struct cxl_mem_command *c; > >> + > >> + c = cxl_mem_find_command(opcode); > >> + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { > >> + dev_WARN_ONCE(cxlds->dev, true, > >> + "Not a background command\n"); > >> + return; > >> + } > >> + > >> + if (c->bgops && c->bgops->post) > >> + c->bgops->post(cxlds, success); > >> +} > >> + > >> +/* > >> + * Ensure that ->mbox_send(@new) can run safely when a background > >> + * command is running. If so, returns zero, otherwise error. > >> + * > >> + * check 1. bg cmd running. If not running it is fine to > >> + * send any command. If one is running then there > >> + * may be restrictions. > >> + * check 2. @new incoming command is capable of running > >> + * in the background. If there is an in-flight bg > >> + * operation, all these are forbidden as we need > >> + * to serialize when polling. > >> + * check 3. @new incoming command is not blacklisted by the > >> + * current running command. > >> + */ > >> +static int __maybe_unused > >> +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) > >> +{ > >> + struct cxl_mem_command *c; > >> + > >> + /* 1 */ > >> + if (likely(!cxl_mbox_bgcmd_running(cxlds))) > >> + return 0; > >> + > >> + c = cxl_mem_find_command(new); > >> + if (!c) > >> + return -EINVAL; > >> + > >> + /* 2 */ > >> + if (c->flags & CXL_CMD_FLAG_BACKGROUND) > >> + return -EBUSY; > >> + > >> + /* 3 */ > >> + if (c->bgops && c->bgops->conflicts) > >> + return c->bgops->conflicts(new); > >> + > >> + return 0; > >> +} > >> + > >> +/* > >> + * Background operation polling mode. > >> + */ > >> +void cxl_mbox_bgcmd_work(struct work_struct *work) > > > >Is a workqueue needed? If the goal is to make it so that all the kernel > >background command use cases chop up their work so that one command can > >not monopolize the mailbox for a long amount of time then the > >implementation can also assume that foreground commands that collide > >with an in-flight background operation can just wait synchronously for > >their time slice. > > > >That makes the handling simple, acquire lock, poll for completion, > >and everyone else just queues up on the lock. > > I don't see how associating a lock to the bg op will make it simpler. I > happen to think that my approach is simpler than anything involving locking. > I'll explore getting rid of the wq in favor of per-device as suggested by > Dave, but fundamentally would be along the same lines. > > If we really want to make things simple, than maybe just not supporting > polling would be better - I assume most devices will have irq support > anyway. Hmm, I'm just not (yet) grokking how the callbacks or irq support simplify the implementation. Perhaps I can explain better in code... I see irq support as a small incremental extension of polling support, not something that precludes polling support. For example (untested / unfinished): diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 3a66aadb4df0..30b21ae4cbe5 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -178,6 +178,19 @@ static int __cxl_pci_mbox_send_cmd(struct cxl_dev_state *cxlds, mbox_cmd->return_code = FIELD_GET(CXLDEV_MBOX_STATUS_RET_CODE_MASK, status_reg); + if (mbox_cmd->return_code == CXL_MBOX_CMD_RC_BACKGROUND) { + int i; + + for (i = 0; i < mbox_cmd->poll_count; i++) + rc = wait_event_interruptible_timeout( + &mbox_wait, cxl_background_complete(cxlds), + mbox_cmd->poll_interval); + + if (!cxl_background_complete(cxlds)) + cxl_cmd_err(cxlds->dev, mbox_cmd, md_status, + "background timeout"); + } + if (mbox_cmd->return_code != CXL_MBOX_CMD_RC_SUCCESS) { dev_dbg(dev, "Mailbox operation had an error: %s\n", cxl_mbox_cmd_rc2str(mbox_cmd)); ...where all irq support adds is a small handler that does: wake_up(&mbox_wait); return IRQ_HANDLED; ...to short-circuit the wait in __cxl_pci_mbox_send_cmd(). The cxl_internal_send_cmd() caller gets a synchronous response and can handle it from there. No need for conflict detection nor async execution contexts. Caller is responsible for making sure ->poll_count and ->poll_interval are reasonable. > > ... > > >> + /* > >> + * With bg polling this can overlap a scenario where the > >> + * hardware can receive new requests but the driver is > >> + * not ready. Handle accordingly. > >> + */ > >> + if (!cxlds->mbox_irq) { > >> + rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); > >> + if (rc) > >> + return rc; > >> + } > > > >Can't happen with the straight-line approach as all an interrupt will do > >is wake up a wait_event_timeout() instance for the currently executing > >background command. > > I don't follow here. Upon mbox irq support, getting an irq will have have > the isr simply do a: > > cxl_mbox_bgcmd_post(); > put_device(cxlds->dev); Side comment: an isr should never need put_device() because the irq must be shutdown before the device is unregistered. > cxl_mbox_bgcmd_done(cxlds); The complexity I see here is the synchronization needed to make sure that the isr is aligned with the current background command queued for execution and in turn synchronized with incoming commands. My observation is that current mailbox state synchronization is sufficient for: - Transfer FW: broken up into multiple transfers - Scan Media: broken up by Scan Media Physical Address Length values that try to get the Estimated Scan Media Time below CXL_MAILBOX_TIMEOUT_MS per iteration. - Sanitize: the device is effectively dead during this operation, so there is little need to time-share I do have questions about: - Activate FW: not clear how long an activation could take in the online case, and no way to minimize it. Might be something that ends up being handled by Documentation about specific device behavior. - Populate Log: ugh, any log for any reason for an indefinite amount of time. This one needs an example use case of what the kernel would do with it. - Maintenance: I think this is in the same category as Sanitize in terms of recovery of a failed / offline device where command collisions do not matter. ...but I do not see those questions as prohibitive to the one-at-a-time approach for the commands that are going to be enabled in the kernel in the near to medium term. > > return IRQ_HANDLED; > > ... to terminate the bg command and no wakeup is needed. But polling won't > be able to do this hence the check above. Maybe I'm just missing your point. > > Thanks, > Davidlohr
On Wed, 07 Dec 2022, Dan Williams wrote: >Hmm, I'm just not (yet) grokking how the callbacks or irq support >simplify the implementation. Perhaps I can explain better in code... So when I say that no polling at all makes the implementation simpler I'm referring to the fact that only some form of ->post() would be needed (in an async context of course). As such, most of the infrastructure in this series is to support polling, and without it we could do away with the other callbacks as well as the atomics. The hardware would take care "synchronizing" everything - which is similar to what Jonathan had highlighted in the rfc version and was wondering about the explicit checks. >I see irq support as a small incremental extension of polling support, >not something that precludes polling support. For example (untested / >unfinished): > >diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c >index 3a66aadb4df0..30b21ae4cbe5 100644 >--- a/drivers/cxl/pci.c >+++ b/drivers/cxl/pci.c >@@ -178,6 +178,19 @@ static int __cxl_pci_mbox_send_cmd(struct cxl_dev_state *cxlds, > mbox_cmd->return_code = > FIELD_GET(CXLDEV_MBOX_STATUS_RET_CODE_MASK, status_reg); > >+ if (mbox_cmd->return_code == CXL_MBOX_CMD_RC_BACKGROUND) { >+ int i; >+ >+ for (i = 0; i < mbox_cmd->poll_count; i++) >+ rc = wait_event_interruptible_timeout( >+ &mbox_wait, cxl_background_complete(cxlds), >+ mbox_cmd->poll_interval); >+ >+ if (!cxl_background_complete(cxlds)) >+ cxl_cmd_err(cxlds->dev, mbox_cmd, md_status, >+ "background timeout"); >+ } >+ > if (mbox_cmd->return_code != CXL_MBOX_CMD_RC_SUCCESS) { > dev_dbg(dev, "Mailbox operation had an error: %s\n", > cxl_mbox_cmd_rc2str(mbox_cmd)); > >...where all irq support adds is a small handler that does: > > wake_up(&mbox_wait); > return IRQ_HANDLED; > >...to short-circuit the wait in __cxl_pci_mbox_send_cmd(). I agree that under this synchronous scheme polling and irq are quite straightforward. I had always considered an async approach and not playing with timeouts, but just let the whole thing run to completion. I was also guided by the fact that the spec says: "It is strongly recommended that devices continue to accept new non- background commands while the background operation is running." What I don't like about a synchronous approach is that you get the situation where the driver timesout, but the hw is still doing the background command, no? And there is no way to "cancel" it. Sure, any new incoming mbox commands during that time would self-serialize naturally by hw. But when the bg command is done and we have irqs enabled, the isr will be triggered and we no longer have the context. You would also have a discrepancy in a returned timeout error vs a different return code from the actual hw, specially if for example the operation was successful. afaict the only thing that would allow users to know the real status of the sent bg command would be the Background Operation Status command, but that is far from ideal. >The cxl_internal_send_cmd() caller gets a synchronous response and can >handle it from there. No need for conflict detection nor async execution >contexts. Caller is responsible for making sure ->poll_count and >->poll_interval are reasonable. ... >The complexity I see here is the synchronization needed to make sure >that the isr is aligned with the current background command queued for >execution and in turn synchronized with incoming commands. My >observation is that current mailbox state synchronization is >sufficient for: > >- Transfer FW: broken up into multiple transfers >- Scan Media: broken up by Scan Media Physical Address Length values > that try to get the Estimated Scan Media Time below > CXL_MAILBOX_TIMEOUT_MS per iteration. >- Sanitize: the device is effectively dead during this operation, so > there is little need to time-share But tbh I find it much more intuitive to get an error then having to block for who knows how long. Either way this would still require special cases such as sanitize where the poller should not timeout. Of course there is nothing in the approach from this series that prevents users from being mindful of the background command and breaking things up, where possible. Thanks, Davidlohr
Davidlohr Bueso wrote: > On Wed, 07 Dec 2022, Dan Williams wrote: > > >Hmm, I'm just not (yet) grokking how the callbacks or irq support > >simplify the implementation. Perhaps I can explain better in code... > > So when I say that no polling at all makes the implementation simpler I'm > referring to the fact that only some form of ->post() would be needed (in > an async context of course). As such, most of the infrastructure in this > series is to support polling, and without it we could do away with the > other callbacks as well as the atomics. The hardware would take care > "synchronizing" everything - which is similar to what Jonathan had > highlighted in the rfc version and was wondering about the explicit > checks. > > >I see irq support as a small incremental extension of polling support, > >not something that precludes polling support. For example (untested / > >unfinished): > > > >diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > >index 3a66aadb4df0..30b21ae4cbe5 100644 > >--- a/drivers/cxl/pci.c > >+++ b/drivers/cxl/pci.c > >@@ -178,6 +178,19 @@ static int __cxl_pci_mbox_send_cmd(struct cxl_dev_state *cxlds, > > mbox_cmd->return_code = > > FIELD_GET(CXLDEV_MBOX_STATUS_RET_CODE_MASK, status_reg); > > > >+ if (mbox_cmd->return_code == CXL_MBOX_CMD_RC_BACKGROUND) { > >+ int i; > >+ > >+ for (i = 0; i < mbox_cmd->poll_count; i++) > >+ rc = wait_event_interruptible_timeout( > >+ &mbox_wait, cxl_background_complete(cxlds), > >+ mbox_cmd->poll_interval); > >+ > >+ if (!cxl_background_complete(cxlds)) > >+ cxl_cmd_err(cxlds->dev, mbox_cmd, md_status, > >+ "background timeout"); > >+ } > >+ > > if (mbox_cmd->return_code != CXL_MBOX_CMD_RC_SUCCESS) { > > dev_dbg(dev, "Mailbox operation had an error: %s\n", > > cxl_mbox_cmd_rc2str(mbox_cmd)); > > > >...where all irq support adds is a small handler that does: > > > > wake_up(&mbox_wait); > > return IRQ_HANDLED; > > > >...to short-circuit the wait in __cxl_pci_mbox_send_cmd(). > > I agree that under this synchronous scheme polling and irq are quite > straightforward. I had always considered an async approach and not > playing with timeouts, but just let the whole thing run to completion. > I was also guided by the fact that the spec says: > > "It is strongly recommended that devices continue to accept new non- > background commands while the background operation is running." A "strong recommendation" without a specification defined mechanism to know when a device adheres to that is effectively useless for the driver. If it needs to consider that failure case at all, it might as well skip the optimized path for what are slow-path operations that do not demand optimization. > > What I don't like about a synchronous approach is that you get the > situation where the driver timesout, but the hw is still doing the > background command, no? And there is no way to "cancel" it. Sure, > any new incoming mbox commands during that time would self-serialize > naturally by hw. But when the bg command is done and we have irqs enabled, > the isr will be triggered and we no longer have the context. You would > also have a discrepancy in a returned timeout error vs a different > return code from the actual hw, specially if for example the operation > was successful. afaict the only thing that would allow users to know > the real status of the sent bg command would be the Background Operation > Status command, but that is far from ideal. The kernel needs to give up at some point, so the status for that lost command is "timeout". Then the mailbox state is indeterminate until the next successful command submission and the driver can get back in sync with the hardware state. The typical solution for this case is trigger device error handling up to and including reset to get the driver back in sync with the hardware state. Unless I am overlooking something CXL is atypical / stuck because any resets that might recover the mbox are destructive to HDM decoding. > >The cxl_internal_send_cmd() caller gets a synchronous response and can > >handle it from there. No need for conflict detection nor async execution > >contexts. Caller is responsible for making sure ->poll_count and > >->poll_interval are reasonable. > > ... > > >The complexity I see here is the synchronization needed to make sure > >that the isr is aligned with the current background command queued for > >execution and in turn synchronized with incoming commands. My > >observation is that current mailbox state synchronization is > >sufficient for: > > > >- Transfer FW: broken up into multiple transfers > >- Scan Media: broken up by Scan Media Physical Address Length values > > that try to get the Estimated Scan Media Time below > > CXL_MAILBOX_TIMEOUT_MS per iteration. > >- Sanitize: the device is effectively dead during this operation, so > > there is little need to time-share > > But tbh I find it much more intuitive to get an error then having to > block for who knows how long. Yes, agree it would be reasonable to just have the driver reject all other commands while sanitize is in-flight. > Either way this would still require special cases such as sanitize > where the poller should not timeout. Of course there is nothing in the > approach from this series that prevents users from being mindful of > the background command and breaking things up, where possible. Right, especially because all the users are kernel internal. So it's a code complexity question. It's easier to add flexibility than rip it out later so I am leaning towards excruciatingly simple until that becomes unbearable... which in retrospect would have saved some of the DOE implementation thrash.
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h index 1d8f87be283f..2a71664a0668 100644 --- a/drivers/cxl/core/core.h +++ b/drivers/cxl/core/core.h @@ -69,6 +69,7 @@ static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port, int cxl_memdev_init(void); void cxl_memdev_exit(void); -void cxl_mbox_init(void); +int cxl_mbox_init(void); +void cxl_mbox_exit(void); #endif /* __CXL_CORE_H__ */ diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c index 35dd889f1d3a..bfee9251c81c 100644 --- a/drivers/cxl/core/mbox.c +++ b/drivers/cxl/core/mbox.c @@ -10,6 +10,7 @@ #include "core.h" static bool cxl_raw_allow_all; +static struct workqueue_struct *cxl_mbox_bgpoll_wq; /** * DOC: cxl mbox @@ -24,7 +25,7 @@ static bool cxl_raw_allow_all; for ((cmd) = &cxl_mem_commands[0]; \ ((cmd) - cxl_mem_commands) < ARRAY_SIZE(cxl_mem_commands); (cmd)++) -#define CXL_CMD(_id, sin, sout, _flags) \ +#define __CXL_CMD(_id, sin, sout, _flags, _bgops) \ [CXL_MEM_COMMAND_ID_##_id] = { \ .info = { \ .id = CXL_MEM_COMMAND_ID_##_id, \ @@ -33,8 +34,13 @@ static bool cxl_raw_allow_all; }, \ .opcode = CXL_MBOX_OP_##_id, \ .flags = _flags, \ + .bgops = _bgops, \ } +#define CXL_CMD(_id, sin, sout, _flags) __CXL_CMD(_id, sin, sout, _flags, NULL) +#define CXL_BGCMD(_id, sin, sout, _flags, _bgops) \ + __CXL_CMD(_id, sin, sout, _flags | CXL_CMD_FLAG_BACKGROUND, _bgops) + #define CXL_VARIABLE_PAYLOAD ~0U /* * This table defines the supported mailbox commands for the driver. This table @@ -63,7 +69,7 @@ static struct cxl_mem_command cxl_mem_commands[CXL_MEM_COMMAND_ID_MAX] = { CXL_CMD(INJECT_POISON, 0x8, 0, 0), CXL_CMD(CLEAR_POISON, 0x48, 0, 0), CXL_CMD(GET_SCAN_MEDIA_CAPS, 0x10, 0x4, 0), - CXL_CMD(SCAN_MEDIA, 0x11, 0, 0), + CXL_BGCMD(SCAN_MEDIA, 0x11, 0, 0, NULL), CXL_CMD(GET_SCAN_MEDIA, 0, CXL_VARIABLE_PAYLOAD, 0), CXL_CMD(GET_SECURITY_STATE, 0, 0x4, 0), CXL_CMD(SET_PASSPHRASE, 0x60, 0, 0), @@ -145,6 +151,136 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) return cxl_command_names[c->info.id].name; } +static unsigned long __maybe_unused +cxl_mbox_bgcmd_delay(struct cxl_dev_state *cxlds, u16 opcode) +{ + struct cxl_mem_command *c; + unsigned long ret = 0; + + c = cxl_mem_find_command(opcode); + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { + dev_WARN_ONCE(cxlds->dev, true, + "Not a background command\n"); + return 0; + } + + if (c->bgops && c->bgops->delay) + ret = c->bgops->delay(cxlds); + + return ret * HZ; +} + +static void cxl_mbox_bgcmd_post(struct cxl_dev_state *cxlds, + u16 opcode, bool success) +{ + struct cxl_mem_command *c; + + c = cxl_mem_find_command(opcode); + if (!c || !(c->flags & CXL_CMD_FLAG_BACKGROUND)) { + dev_WARN_ONCE(cxlds->dev, true, + "Not a background command\n"); + return; + } + + if (c->bgops && c->bgops->post) + c->bgops->post(cxlds, success); +} + +/* + * Ensure that ->mbox_send(@new) can run safely when a background + * command is running. If so, returns zero, otherwise error. + * + * check 1. bg cmd running. If not running it is fine to + * send any command. If one is running then there + * may be restrictions. + * check 2. @new incoming command is capable of running + * in the background. If there is an in-flight bg + * operation, all these are forbidden as we need + * to serialize when polling. + * check 3. @new incoming command is not blacklisted by the + * current running command. + */ +static int __maybe_unused +cxl_mbox_check_cmd_bgcmd(struct cxl_dev_state *cxlds, u16 new) +{ + struct cxl_mem_command *c; + + /* 1 */ + if (likely(!cxl_mbox_bgcmd_running(cxlds))) + return 0; + + c = cxl_mem_find_command(new); + if (!c) + return -EINVAL; + + /* 2 */ + if (c->flags & CXL_CMD_FLAG_BACKGROUND) + return -EBUSY; + + /* 3 */ + if (c->bgops && c->bgops->conflicts) + return c->bgops->conflicts(new); + + return 0; +} + +/* + * Background operation polling mode. + */ +void cxl_mbox_bgcmd_work(struct work_struct *work) +{ + struct cxl_dev_state *cxlds; + u64 bgcmd_status_reg; + u16 opcode; + u32 pct; + + cxlds = container_of(work, struct cxl_dev_state, mbox_bgpoll.work); + + dev_WARN_ONCE(cxlds->dev, !!cxlds->mbox_irq, + "Polling mailbox when IRQs are supported\n"); + + bgcmd_status_reg = readq(cxlds->regs.mbox + + CXLDEV_MBOX_BG_CMD_STATUS_OFFSET); + opcode = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_OPCODE_MASK, + bgcmd_status_reg); + + pct = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_PCT_MASK, bgcmd_status_reg); + if (pct != 100) { + unsigned long hint; + u64 status_reg; + + status_reg = readq(cxlds->regs.mbox + CXLDEV_MBOX_STATUS_OFFSET); + if (!FIELD_GET(CXLDEV_MBOX_STATUS_BACKGROUND_OPERATION, + status_reg)) + dev_WARN(cxlds->dev, + "No background operation running (%u%%).\n", + pct); + + hint = cxl_mbox_bgcmd_delay(cxlds, opcode); + if (hint == 0) { + /* + * TODO: try to do better(?). pct is updated every + * ~1 sec, maybe use a heurstic based on that. + */ + hint = 5 * HZ; + } + + queue_delayed_work(cxl_mbox_bgpoll_wq, + &cxlds->mbox_bgpoll, hint); + } else { /* bg operation completed */ + u16 return_code; + + return_code = FIELD_GET(CXLDEV_MBOX_BG_CMD_COMMAND_RC_MASK, + bgcmd_status_reg); + cxl_mbox_bgcmd_post(cxlds, opcode, + return_code == CXL_MBOX_CMD_RC_SUCCESS); + + put_device(cxlds->dev); + cxl_mbox_bgcmd_done(cxlds); + } +} +EXPORT_SYMBOL_NS_GPL(cxl_mbox_bgcmd_work, CXL); + /** * cxl_mbox_send_cmd() - Send a mailbox command to a device. * @cxlds: The device data for the operation @@ -153,6 +289,7 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) * @in_size: The length of the input payload * @out: Caller allocated buffer for the output. * @out_size: Expected size of output. + * @return_code: (optional output) HW returned code. * * Context: Any context. * Return: @@ -167,8 +304,9 @@ static const char *cxl_mem_opcode_to_name(u16 opcode) * error. While this distinction can be useful for commands from userspace, the * kernel will only be able to use results when both are successful. */ -int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, - size_t in_size, void *out, size_t out_size) +int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, + void *in, size_t in_size, + void *out, size_t out_size, u16 *return_code) { const struct cxl_mem_command *cmd = cxl_mem_find_command(opcode); struct cxl_mbox_cmd mbox_cmd = { @@ -183,12 +321,53 @@ int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, if (in_size > cxlds->payload_size || out_size > cxlds->payload_size) return -E2BIG; + /* + * With bg polling this can overlap a scenario where the + * hardware can receive new requests but the driver is + * not ready. Handle accordingly. + */ + if (!cxlds->mbox_irq) { + rc = cxl_mbox_check_cmd_bgcmd(cxlds, opcode); + if (rc) + return rc; + } + rc = cxlds->mbox_send(cxlds, &mbox_cmd); + if (return_code) + *return_code = mbox_cmd.return_code; if (rc) return rc; - if (mbox_cmd.return_code != CXL_MBOX_CMD_RC_SUCCESS) + switch (mbox_cmd.return_code) { + case CXL_MBOX_CMD_RC_BACKGROUND: + { + int err; + + dev_dbg(cxlds->dev, "Opcode 0x%04x: %s\n", opcode, + cxl_mbox_cmd_rc2str(&mbox_cmd)); + + err = cxl_mbox_bgcmd_begin(cxlds, opcode); + if (err) { + dev_WARN(cxlds->dev, + "Corrupted background cmd (opcode 0x%04x)\n", + err); + return -ENXIO; + } + + get_device(cxlds->dev); + + if (!cxlds->mbox_irq) { + /* do first poll check asap before using any hinting */ + queue_delayed_work(cxl_mbox_bgpoll_wq, + &cxlds->mbox_bgpoll, 0); + } + break; + } + case CXL_MBOX_CMD_RC_SUCCESS: + break; + default: return cxl_mbox_cmd_rc2errno(&mbox_cmd); + } /* * Variable sized commands can't be validated and so it's up to the @@ -575,7 +754,7 @@ static int cxl_xfer_log(struct cxl_dev_state *cxlds, uuid_t *uuid, u32 size, u8 int rc; rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_LOG, &log, sizeof(log), - out, xfer_size); + out, xfer_size, NULL); if (rc < 0) return rc; @@ -628,7 +807,7 @@ static struct cxl_mbox_get_supported_logs *cxl_get_gsl(struct cxl_dev_state *cxl return ERR_PTR(-ENOMEM); rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_SUPPORTED_LOGS, NULL, 0, ret, - cxlds->payload_size); + cxlds->payload_size, NULL); if (rc < 0) { kvfree(ret); return ERR_PTR(rc); @@ -738,7 +917,7 @@ static int cxl_mem_get_partition_info(struct cxl_dev_state *cxlds) int rc; rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_PARTITION_INFO, NULL, 0, - &pi, sizeof(pi)); + &pi, sizeof(pi), NULL); if (rc) return rc; @@ -771,7 +950,7 @@ int cxl_dev_state_identify(struct cxl_dev_state *cxlds) int rc; rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_IDENTIFY, NULL, 0, &id, - sizeof(id)); + sizeof(id), NULL); if (rc < 0) return rc; @@ -868,11 +1047,25 @@ struct cxl_dev_state *cxl_dev_state_create(struct device *dev) } EXPORT_SYMBOL_NS_GPL(cxl_dev_state_create, CXL); -void __init cxl_mbox_init(void) +int __init cxl_mbox_init(void) { struct dentry *mbox_debugfs; mbox_debugfs = cxl_debugfs_create_dir("mbox"); debugfs_create_bool("raw_allow_all", 0600, mbox_debugfs, &cxl_raw_allow_all); + + /* background commands can run concurrently on different devices */ + cxl_mbox_bgpoll_wq = alloc_workqueue("cxl_mbox_bgcmd", WQ_UNBOUND, 0); + if (!cxl_mbox_bgpoll_wq) { + debugfs_remove_recursive(mbox_debugfs); + return -ENOMEM; + } + + return 0; +} + +void cxl_mbox_exit(void) +{ + destroy_workqueue(cxl_mbox_bgpoll_wq); } diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c index 0d2f5eaaca7d..449767bfb5aa 100644 --- a/drivers/cxl/core/port.c +++ b/drivers/cxl/core/port.c @@ -1945,6 +1945,7 @@ static void cxl_core_exit(void) destroy_workqueue(cxl_bus_wq); cxl_memdev_exit(); debugfs_remove_recursive(cxl_debugfs); + cxl_mbox_exit(); } module_init(cxl_core_init); diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index e5e1abceeca7..a10e979ef3a4 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -139,14 +139,22 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw) /* CXL 2.0 8.2.8.4 Mailbox Registers */ #define CXLDEV_MBOX_CAPS_OFFSET 0x00 #define CXLDEV_MBOX_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0) +#define CXLDEV_MBOX_CAP_BG_CMD_IRQNUM_MASK GENMASK(10, 7) +#define CXLDEV_MBOX_CAP_BG_CMD_IRQ BIT(6) #define CXLDEV_MBOX_CTRL_OFFSET 0x04 #define CXLDEV_MBOX_CTRL_DOORBELL BIT(0) +#define CXLDEV_MBOX_CTRL_BGCMD_IRQ BIT(2) #define CXLDEV_MBOX_CMD_OFFSET 0x08 #define CXLDEV_MBOX_CMD_COMMAND_OPCODE_MASK GENMASK_ULL(15, 0) #define CXLDEV_MBOX_CMD_PAYLOAD_LENGTH_MASK GENMASK_ULL(36, 16) #define CXLDEV_MBOX_STATUS_OFFSET 0x10 +#define CXLDEV_MBOX_STATUS_BACKGROUND_OPERATION BIT(0) #define CXLDEV_MBOX_STATUS_RET_CODE_MASK GENMASK_ULL(47, 32) +/* CXL 3.0 8.2.8.4.7 Background Command Status Register */ #define CXLDEV_MBOX_BG_CMD_STATUS_OFFSET 0x18 +#define CXLDEV_MBOX_BG_CMD_COMMAND_OPCODE_MASK GENMASK_ULL(15, 0) +#define CXLDEV_MBOX_BG_CMD_COMMAND_PCT_MASK GENMASK_ULL(22, 16) +#define CXLDEV_MBOX_BG_CMD_COMMAND_RC_MASK GENMASK_ULL(47, 32) #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20 /* diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h index 75baeb0bbe57..e7cb2f2fadc4 100644 --- a/drivers/cxl/cxlmem.h +++ b/drivers/cxl/cxlmem.h @@ -212,6 +212,9 @@ struct cxl_endpoint_dvsec_info { * @serial: PCIe Device Serial Number * @doe_mbs: PCI DOE mailbox array * @mbox_send: @dev specific transport for transmitting mailbox commands + * @mbox_irq: @dev supports mailbox interrupts + * @mbox_bg: opcode for the in-flight background operation on @dev + * @mbox_bgpoll: self-polling delayed work item * * See section 8.2.9.5.2 Capacity Configuration and Label Storage for * details on capacity parameters. @@ -248,6 +251,9 @@ struct cxl_dev_state { struct xarray doe_mbs; int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd); + bool mbox_irq; + atomic_t mbox_bg; + struct delayed_work __maybe_unused mbox_bgpoll; }; enum cxl_opcode { @@ -290,6 +296,26 @@ enum cxl_opcode { UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \ 0x40, 0x3d, 0x86) +/* + * Supported CXL mailbox commands that can run in the background. + * + * Barriers pair with release/acquire semantics. When done, + * clearing ->mbox_bg must be the very last operation before + * finishing the command. + */ +static inline int cxl_mbox_bgcmd_begin(struct cxl_dev_state *cxlds, u16 opcode) +{ + return atomic_xchg(&cxlds->mbox_bg, opcode); +} +static inline int cxl_mbox_bgcmd_running(struct cxl_dev_state *cxlds) +{ + return atomic_read_acquire(&cxlds->mbox_bg); +} +static inline void cxl_mbox_bgcmd_done(struct cxl_dev_state *cxlds) +{ + atomic_set_release(&cxlds->mbox_bg, 0); +} + struct cxl_mbox_get_supported_logs { __le16 entries; u8 rsvd[6]; @@ -353,16 +379,38 @@ struct cxl_mbox_set_partition_info { #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0) +/** + * struct cxl_mbox_bgcmd_ops - Optional ad-hoc handling of background cmds + * @post: Execute after the command completes + * @conflicts: How to handle a @new command when one is currently executing + * (only for polling mode) + * @delay: Delay hint for polling on command completion + */ +struct cxl_mem_bgcommand_ops { + void (*post)(struct cxl_dev_state *cxlds, bool success); + int (*conflicts)(u16 new); + unsigned long (*delay)(struct cxl_dev_state *cxlds); +}; + /** * struct cxl_mem_command - Driver representation of a memory device command * @info: Command information as it exists for the UAPI * @opcode: The actual bits used for the mailbox protocol * @flags: Set of flags effecting driver behavior. + * @bg_ops: Set of optional callbacks to handle commands that can run in + * the background. * * * %CXL_CMD_FLAG_FORCE_ENABLE: In cases of error, commands with this flag * will be enabled by the driver regardless of what hardware may have * advertised. * + * * %CXL_CMD_FLAG_BACKGROUND: The command may execute asynchronously in the + * background if the operation takes longer than ~2 seconds. The semantics + * are: + * - Only a single of these commands can run at a time. + * - Depending on the nature of the operation, devices may continue to + * accept new non-background commands. + * * The cxl_mem_command is the driver's internal representation of commands that * are supported by the driver. Some of these commands may not be supported by * the hardware. The driver will use @info to validate the fields passed in by @@ -374,8 +422,10 @@ struct cxl_mem_command { struct cxl_command_info info; enum cxl_opcode opcode; u32 flags; + struct cxl_mem_bgcommand_ops __maybe_unused *bgops; #define CXL_CMD_FLAG_NONE 0 #define CXL_CMD_FLAG_FORCE_ENABLE BIT(0) +#define CXL_CMD_FLAG_BACKGROUND BIT(1) }; #define CXL_PMEM_SEC_STATE_USER_PASS_SET 0x01 @@ -414,7 +464,8 @@ enum { }; int cxl_mbox_send_cmd(struct cxl_dev_state *cxlds, u16 opcode, void *in, - size_t in_size, void *out, size_t out_size); + size_t in_size, void *out, size_t out_size, + u16 *return_code); int cxl_dev_state_identify(struct cxl_dev_state *cxlds); int cxl_await_media_ready(struct cxl_dev_state *cxlds); int cxl_enumerate_cmds(struct cxl_dev_state *cxlds); @@ -434,6 +485,8 @@ static inline void cxl_mem_active_dec(void) } #endif +void cxl_mbox_bgcmd_work(struct work_struct *work); + struct cxl_hdm { struct cxl_component_regs regs; unsigned int decoder_count; diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 621a0522b554..b21d8681beac 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -268,6 +268,10 @@ static int cxl_pci_setup_mailbox(struct cxl_dev_state *cxlds) return -ENXIO; } + atomic_set(&cxlds->mbox_bg, 0); + if (!cxlds->mbox_irq) + INIT_DELAYED_WORK(&cxlds->mbox_bgpoll, cxl_mbox_bgcmd_work); + dev_dbg(cxlds->dev, "Mailbox payload sized %zu", cxlds->payload_size); diff --git a/drivers/cxl/pmem.c b/drivers/cxl/pmem.c index ab40c93c44e5..c6a6d3fd6883 100644 --- a/drivers/cxl/pmem.c +++ b/drivers/cxl/pmem.c @@ -173,7 +173,8 @@ static int cxl_pmem_get_config_data(struct cxl_dev_state *cxlds, }; rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_LSA, &get_lsa, - sizeof(get_lsa), cmd->out_buf, cmd->in_length); + sizeof(get_lsa), cmd->out_buf, + cmd->in_length, NULL); cmd->status = 0; return rc; @@ -205,7 +206,7 @@ static int cxl_pmem_set_config_data(struct cxl_dev_state *cxlds, rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_SET_LSA, set_lsa, struct_size(set_lsa, data, cmd->in_length), - NULL, 0); + NULL, 0, NULL); /* * Set "firmware" status (4-packed bytes at the end of the input diff --git a/drivers/cxl/security.c b/drivers/cxl/security.c index 5484d4eecfd1..825d1bd1dd16 100644 --- a/drivers/cxl/security.c +++ b/drivers/cxl/security.c @@ -20,7 +20,7 @@ static unsigned long cxl_pmem_get_security_flags(struct nvdimm *nvdimm, int rc; rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_SECURITY_STATE, NULL, 0, - &sec_out, sizeof(sec_out)); + &sec_out, sizeof(sec_out), NULL); if (rc < 0) return 0; @@ -67,7 +67,7 @@ static int cxl_pmem_security_change_key(struct nvdimm *nvdimm, memcpy(set_pass.new_pass, new_data->data, NVDIMM_PASSPHRASE_LEN); rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_SET_PASSPHRASE, - &set_pass, sizeof(set_pass), NULL, 0); + &set_pass, sizeof(set_pass), NULL, 0, NULL); return rc; } @@ -86,7 +86,7 @@ static int __cxl_pmem_security_disable(struct nvdimm *nvdimm, memcpy(dis_pass.pass, key_data->data, NVDIMM_PASSPHRASE_LEN); rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_DISABLE_PASSPHRASE, - &dis_pass, sizeof(dis_pass), NULL, 0); + &dis_pass, sizeof(dis_pass), NULL, 0, NULL); return rc; } @@ -108,7 +108,8 @@ static int cxl_pmem_security_freeze(struct nvdimm *nvdimm) struct cxl_memdev *cxlmd = cxl_nvd->cxlmd; struct cxl_dev_state *cxlds = cxlmd->cxlds; - return cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_FREEZE_SECURITY, NULL, 0, NULL, 0); + return cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_FREEZE_SECURITY, + NULL, 0, NULL, 0, NULL); } static int cxl_pmem_security_unlock(struct nvdimm *nvdimm, @@ -122,7 +123,7 @@ static int cxl_pmem_security_unlock(struct nvdimm *nvdimm, memcpy(pass, key_data->data, NVDIMM_PASSPHRASE_LEN); rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_UNLOCK, - pass, NVDIMM_PASSPHRASE_LEN, NULL, 0); + pass, NVDIMM_PASSPHRASE_LEN, NULL, 0, NULL); if (rc < 0) return rc; @@ -143,7 +144,7 @@ static int cxl_pmem_security_passphrase_erase(struct nvdimm *nvdimm, CXL_PMEM_SEC_PASS_MASTER : CXL_PMEM_SEC_PASS_USER; memcpy(erase.pass, key->data, NVDIMM_PASSPHRASE_LEN); rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE, - &erase, sizeof(erase), NULL, 0); + &erase, sizeof(erase), NULL, 0, NULL); if (rc < 0) return rc;
This adds support for handling background operations, as defined in the CXL 3.0 spec. Commands that can take too long (over ~2 seconds) can run asynchronously in the background, these are limited to: transfer/activate firmware, scan media, sanitize (aka overwrite) and VPPB bind/unbind. Completion of background operations, successful or not, can be handled via irq or poll based. This patch deals only with polling, which introduces some complexities as we need to handle the window in which the original background command completed, then a new one is successfully started before the poller wakes up and checks. This, in theory, can occur any amount of times. One of the problems is that previous states cannot be tracked as hw background status registers always deal with the last command. So in order to keep hardware and the driver in sync, there can be windows where the hardware is ready but the driver won't be, and thus must serialize/handle it accordingly. While this polling cost may not be ideal: 1) background commands are rare/limited and 2) the extra busy time should be small compared to the overall command duration and 3) devices that support mailbox interrupts do not use this. The implementation extends struct cxl_mem_command to become aware of background-capable commands in a generic fashion and presents some interfaces: - Calls for bg operations, where each bg command can choose to implement whatever needed based on the nature of the operation. For example, while running, overwrite may only allow certain related commands to occur, while scan media does not have any such limitations. Delay times can also be different, for which ad-hoc hinting can be defined - for example, scan media could use some value based on GET_SCAN_MEDIA_CAPS and overwrite has predefined values based on pmem DSM specs[0]. Similarly some commands need to execute tasks once the command finishes, such as overwriting requires CPU flushing when successfully done. These are: cxl_mbox_bgcmd_conflicts() cxl_mbox_bgcmd_delay() cxl_mbox_bgcmd_post() - cxl_mbox_send_cmd() is extended such that callers can distinguish, upon rc == 0, between completed and successfully started in the background. - cxl_mbox_bgcmd_running() will atomically tell which command is running in the background, if any. This allows easy querying functionality. Similarly, there are cxl_mbox_bgcmd_start() and cxl_mbox_bgcmd_done() to safely mark the in-flight operation. While x86 serializes atomics, care must be taken with arm64, for example, ensuring, minimally, release/acquire semantics. There are currently no supported commands. [0] https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> --- drivers/cxl/core/core.h | 3 +- drivers/cxl/core/mbox.c | 213 ++++++++++++++++++++++++++++++++++++++-- drivers/cxl/core/port.c | 1 + drivers/cxl/cxl.h | 8 ++ drivers/cxl/cxlmem.h | 55 ++++++++++- drivers/cxl/pci.c | 4 + drivers/cxl/pmem.c | 5 +- drivers/cxl/security.c | 13 +-- 8 files changed, 282 insertions(+), 20 deletions(-)