Message ID | 1541657381-7452-1-git-send-email-lijie34@huawei.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
Series | multipath-tools: add ANA support for NVMe device | expand |
Hello Lijie, On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: > Add support for Asynchronous Namespace Access as specified in NVMe > 1.3 > TP 4004. The states are updated through reading the ANA log page. > > By default, the native nvme multipath takes over the nvme device. > We can pass a false to the parameter 'multipath' of the nvme-core.ko > module,when we want to use multipath-tools. Thank you for the patch. It looks quite good to me. I've tested it with a Linux target and found no problems so far. I have a few questions and comments inline below. I suggest you also have a look at detect_prio(); it seems to make sense to use the ana prioritizer for NVMe paths automatically if ANA is supported (with your patch, "detect_prio no" and "prio ana" have to be configured explicitly). But that can be done in a later patch. Martin > --- > libmultipath/prio.h | 1 + > libmultipath/prioritizers/Makefile | 1 + > libmultipath/prioritizers/ana.c | 292 > +++++++++++++++++++++++++++++++++++++ > libmultipath/prioritizers/ana.h | 221 > ++++++++++++++++++++++++++++ > multipath/multipath.conf.5 | 8 + > 5 files changed, 523 insertions(+) > create mode 100644 libmultipath/prioritizers/ana.c > create mode 100644 libmultipath/prioritizers/ana.h > > diff --git a/libmultipath/prio.h b/libmultipath/prio.h > index aa587cc..599d1d8 100644 > --- a/libmultipath/prio.h > +++ b/libmultipath/prio.h > @@ -30,6 +30,7 @@ struct path; > #define PRIO_WEIGHTED_PATH "weightedpath" > #define PRIO_SYSFS "sysfs" > #define PRIO_PATH_LATENCY "path_latency" > +#define PRIO_ANA "ana" > > /* > * Value used to mark the fact prio was not defined > diff --git a/libmultipath/prioritizers/Makefile > b/libmultipath/prioritizers/Makefile > index ab7bc07..15afaba 100644 > --- a/libmultipath/prioritizers/Makefile > +++ b/libmultipath/prioritizers/Makefile > @@ -19,6 +19,7 @@ LIBS = \ > libpriordac.so \ > libprioweightedpath.so \ > libpriopath_latency.so \ > + libprioana.so \ > libpriosysfs.so > > all: $(LIBS) > diff --git a/libmultipath/prioritizers/ana.c > b/libmultipath/prioritizers/ana.c > new file mode 100644 > index 0000000..c5aaa5f > --- /dev/null > +++ b/libmultipath/prioritizers/ana.c > @@ -0,0 +1,292 @@ > +/* > + * (C) Copyright HUAWEI Technology Corp. 2017 All Rights Reserved. > + * > + * ana.c > + * Version 1.00 > + * > + * Tool to make use of a NVMe-feature called Asymmetric Namespace > Access. > + * It determines the ANA state of a device and prints a priority > value to stdout. > + * > + * Author(s): Cheng Jike <chengjike.cheng@huawei.com> > + * Li Jie <lijie34@huawei.com> > + * > + * This file is released under the GPL version 2, or any later > version. > + */ > +#include <stdio.h> > +#include <sys/ioctl.h> > +#include <sys/stat.h> > +#include <sys/types.h> > +#include <stdbool.h> > + > +#include "debug.h" > +#include "prio.h" > +#include "structs.h" > +#include "ana.h" > + > +enum { > + ANA_PRIO_OPTIMIZED = 50, > + ANA_PRIO_NONOPTIMIZED = 10, > + ANA_PRIO_INACCESSIBLE = 5, > + ANA_PRIO_PERSISTENT_LOSS = 1, > + ANA_PRIO_CHANGE = 0, > + ANA_PRIO_RESERVED = 0, > + ANA_PRIO_GETCTRL_FAILED = -1, > + ANA_PRIO_NOT_SUPPORTED = -2, > + ANA_PRIO_GETANAS_FAILED = -3, > + ANA_PRIO_GETANALOG_FAILED = -4, > + ANA_PRIO_GETNSID_FAILED = -5, > + ANA_PRIO_GETNS_FAILED = -6, > + ANA_PRIO_NO_MEMORY = -7, > + ANA_PRIO_NO_INFORMATION = -8, > +}; > + > +static const char * anas_string[] = { > + [NVME_ANA_OPTIMIZED] = "ANA Optimized > State", > + [NVME_ANA_NONOPTIMIZED] = "ANA Non-Optimized > State", > + [NVME_ANA_INACCESSIBLE] = "ANA Inaccessible > State", > + [NVME_ANA_PERSISTENT_LOSS] = "ANA Persistent > Loss State", > + [NVME_ANA_CHANGE] = "ANA Change state", > + [NVME_ANA_RESERVED] = "Invalid namespace > group state!", > +}; > + > +static const char *aas_print_string(int rc) > +{ > + rc &= 0xff; > + > + switch(rc) { > + case NVME_ANA_OPTIMIZED: > + case NVME_ANA_NONOPTIMIZED: > + case NVME_ANA_INACCESSIBLE: > + case NVME_ANA_PERSISTENT_LOSS: > + case NVME_ANA_CHANGE: > + return anas_string[rc]; > + default: > + return anas_string[NVME_ANA_RESERVED]; > + } > + > + return anas_string[NVME_ANA_RESERVED]; > +} nit: the last statement is superfluous. > + > +static int nvme_get_nsid(int fd, unsigned *nsid) > +{ > + static struct stat nvme_stat; > + int err = fstat(fd, &nvme_stat); > + if (err < 0) > + return 1; > + > + if (!S_ISBLK(nvme_stat.st_mode)) { > + condlog(0, "Error: requesting namespace-id from non- > block device\n"); > + return 1; > + } > + > + *nsid = ioctl(fd, NVME_IOCTL_ID); > + return 0; > +} > + > +static int nvme_submit_admin_passthru(int fd, struct > nvme_passthru_cmd *cmd) > +{ > + return ioctl(fd, NVME_IOCTL_ADMIN_CMD, cmd); > +} > + > +int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 > lpo, > + __u16 lsi, bool rae, __u32 data_len, void *data) lpo and lsi are 0 in all callers. Likewise, rae is always true. You might consider simplifying your code by omitting these paramters. I can see that you did it this way because you have copied the code from nvme-cli/nvme-ioctl.c, which is fine. In this case, however, we may want to copy the entire file, in order to be able to track more easily if our code is up-to-date. I wish something like libnvmecli existed. > +{ > + struct nvme_admin_cmd cmd = { > + .opcode = nvme_admin_get_log_page, > + .nsid = nsid, > + .addr = (__u64)(uintptr_t) data, > + .data_len = data_len, > + }; > + __u32 numd = (data_len >> 2) - 1; > + __u16 numdu = numd >> 16, numdl = numd & 0xffff; > + > + cmd.cdw10 = log_id | (numdl << 16) | (rae ? 1 << 15 : 0); > + if (lsp)log143 > + cmd.cdw10 |= lsp << 8; > + > + cmd.cdw11 = numdu | (lsi << 16); > + cmd.cdw12 = lpo; > + cmd.cdw13 = (lpo >> 32); > + > + return nvme_submit_admin_passthru(fd, &cmd); > + > +} > + > +int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, > void *data) > +{ > + struct nvme_admin_cmd cmd = { > + .opcode = nvme_admin_identify, > + .nsid = nsid, > + .addr = (__u64)(uintptr_t) data, > + .data_len = NVME_IDENTIFY_DATA_SIZE, > + .cdw10 = cdw10, > + .cdw11 = cdw11, Along similar lines as above, this could be simplified, as cdw11 is never used. > + }; > + > + return nvme_submit_admin_passthru(fd, &cmd); > +} > + > +int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data) > +{ > + return nvme_identify13(fd, nsid, cdw10, 0, data); > +} > + > +int nvme_identify_ctrl(int fd, void *data) > +{ > + return nvme_identify(fd, 0, NVME_ID_CNS_CTRL, data); > +} > + > +int nvme_identify_ns(int fd, __u32 nsid, void *data) > +{ > + return nvme_identify(fd, nsid, NVME_ID_CNS_NS, data); > +} > + > +int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo) > +{ > + __u64 lpo = 0; > + > + return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_ANA, rgo, > lpo, 0, > + true, ana_log_len, ana_log); > +} > + > +static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log) > +{ > + int rc = ANA_PRIO_GETANAS_FAILED; > + void *base = ana_log; > + struct nvme_ana_rsp_hdr *hdr = base; > + struct nvme_ana_group_desc *ana_desc; > + int offset = sizeof(struct nvme_ana_rsp_hdr); > + __u32 nr_nsids; > + size_t nsid_buf_size; > + int i, j; > + > + for (i = 0; i < le16_to_cpu(hdr->ngrps); i++) { > + ana_desc = base + offset; > + nr_nsids = le32_to_cpu(ana_desc->nnsids); > + nsid_buf_size = nr_nsids * sizeof(__le32); > + > + offset += sizeof(*ana_desc); > + > + for (j = 0; j < nr_nsids; j++) { > + if (nsid == le32_to_cpu(ana_desc->nsids[j])) > + return ana_desc->state; > + } > + > + if (anagrpid != 0 && anagrpid == le32_to_cpu(ana_desc- > >grpid)) > + rc = ana_desc->state; Shouldn't you check this condition before iterating over the nsids? And if the anagrpid matches, shouldn't you return the state immediately? Some comments would be helpful in order to understand this logic. > + > + offset += nsid_buf_size; > + } > + > + return rc; > +} > + > +int get_ana_info(struct path * pp, unsigned int timeout) This should be a static function. > +{ > + int rc; > + __u32 nsid; > + struct nvme_id_ctrl ctrl; > + struct nvme_id_ns ns; > + void *ana_log; > + size_t ana_log_len; > + > + rc = nvme_identify_ctrl(pp->fd, &ctrl); > + if (rc) > + return ANA_PRIO_GETCTRL_FAILED; > + > + if(!(ctrl.cmic & (1 << 3))) > + return ANA_PRIO_NOT_SUPPORTED; > + > + rc = nvme_get_nsid(pp->fd, &nsid); > + if (rc) > + return ANA_PRIO_GETNSID_FAILED; > + > + rc = nvme_identify_ns(pp->fd, nsid, &ns); > + if (rc) > + return ANA_PRIO_GETNS_FAILED; > + > + ana_log_len = sizeof(struct nvme_ana_rsp_hdr) + > + le32_to_cpu(ctrl.nanagrpid) * sizeof(struct > nvme_ana_group_desc); > + if (!(ctrl.anacap & (1 << 6))) > + ana_log_len += le32_to_cpu(ctrl.mnan) * sizeof(__le32); I'd like to see a sanity check for ana_log_len here. > + > + ana_log = malloc(ana_log_len); > + if (!ana_log) > + return ANA_PRIO_NO_MEMORY; > + > + rc = nvme_ana_log(pp->fd, ana_log, ana_log_len, > + (ctrl.anacap & (1 << 6)) ? NVME_ANA_LOG_RGO : 0); > + if (rc) { > + free(ana_log); > + return ANA_PRIO_GETANALOG_FAILED; > + } Please use a "goto error;" like construct here. > + > + rc = get_ana_state(nsid, le32_to_cpu(ns.anagrpid), ana_log); > + if (rc < 0){ > + free(ana_log); > + return ANA_PRIO_GETANAS_FAILED; > + } > + > + free(ana_log); > + condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc, > aas_print_string(rc)); Please put an "error:" label here and free ana_log there. > + > + return rc; > +} > + > +int getprio(struct path * pp, char * args, unsigned int timeout) > +{ > + int rc; > + > + if (pp->fd < 0) > + return ANA_PRIO_NO_INFORMATION; > + > + rc = get_ana_info(pp, timeout); > + if (rc >= 0) { > + rc &= 0x0f; > + switch(rc) { > + case NVME_ANA_OPTIMIZED: > + rc = ANA_PRIO_OPTIMIZED; > + break; > + case NVME_ANA_NONOPTIMIZED: > + rc = ANA_PRIO_NONOPTIMIZED; > + break; > + case NVME_ANA_INACCESSIBLE: > + rc = ANA_PRIO_INACCESSIBLE; > + break; > + case NVME_ANA_PERSISTENT_LOSS: > + rc = ANA_PRIO_PERSISTENT_LOSS; > + break; > + case NVME_ANA_CHANGE: > + rc = ANA_PRIO_CHANGE; > + break; > + default: > + rc = ANA_PRIO_RESERVED; > + } > + } else { > + switch(rc) { > + case ANA_PRIO_GETCTRL_FAILED: > + condlog(0, "%s: couldn't get ctrl info", pp- > >dev); > + break; > + case ANA_PRIO_NOT_SUPPORTED: > + condlog(0, "%s: ana not supported", pp->dev); > + break; > + case ANA_PRIO_GETANAS_FAILED: > + condlog(0, "%s: couldn't get ana state", pp- > >dev); > + break; > + case ANA_PRIO_GETANALOG_FAILED: > + condlog(0, "%s: couldn't get ana log", pp- > >dev); > + break; > + case ANA_PRIO_GETNS_FAILED: > + condlog(0, "%s: couldn't get namespace", pp- > >dev); > + break; > + case ANA_PRIO_GETNSID_FAILED: > + condlog(0, "%s: couldn't get namespace id", pp- > >dev); > + break; > + case ANA_PRIO_NO_MEMORY: > + condlog(0, "%s: couldn't alloc memory", pp- > >dev); > + break; > + } > + } > + return rc; > +} > + > diff --git a/libmultipath/prioritizers/ana.h > b/libmultipath/prioritizers/ana.h > new file mode 100644 > index 0000000..92cfa9e > --- /dev/null > +++ b/libmultipath/prioritizers/ana.h > @@ -0,0 +1,221 @@ > +#ifndef _ANA_H > +#define _ANA_H > + > +#include <linux/types.h> > + > +#define NVME_NSID_ALL 0xffffffff > +#define NVME_IDENTIFY_DATA_SIZE 4096 > + > +#define NVME_LOG_ANA 0x0c > + > +/* Admin commands */ > +enum nvme_admin_opcode { > + nvme_admin_get_log_page = 0x02, > + nvme_admin_identify = 0x06, > +}; > + > +enum { > + NVME_ID_CNS_NS = 0x00, > + NVME_ID_CNS_CTRL = 0x01, > +}; > + > +/* nvme ioctl start */ > +struct nvme_passthru_cmd { > + __u8 opcode; > + __u8 flags; > + __u16 rsvd1; > + __u32 nsid; > + __u32 cdw2; > + __u32 cdw3; > + __u64 metadata; > + __u64 addr; > + __u32 metadata_len; > + __u32 data_len; > + __u32 cdw10; > + __u32 cdw11; > + __u32 cdw12; > + __u32 cdw13; > + __u32 cdw14; > + __u32 cdw15; > + __u32 timeout_ms; > + __u32 result; > +}; > + > +#define nvme_admin_cmd nvme_passthru_cmd > + > +#define NVME_IOCTL_ID _IO('N', 0x40) > +#define NVME_IOCTL_ADMIN_CMD _IOWR('N', 0x41, struct nvme_admin_cmd) > +/* nvme ioctl end */ Please #include <linux/nvme_ioctl.h> rather than copying its contents here. > + > +/* nvme id ctrl start */ > +struct nvme_id_power_state { > + __le16 max_power; /* centiwatts */ > + __u8 rsvd2; > + __u8 flags; > + __le32 entry_lat; /* microseconds */ > + __le32 exit_lat; /* microseconds */ > + __u8 read_tput; > + __u8 read_lat; > + __u8 write_tput; > + __u8 write_lat; > + __le16 idle_power; > + __u8 idle_scale; > + __u8 rsvd19; > + __le16 active_power; > + __u8 active_work_scale; > + __u8 rsvd23[9]; > +}; > + > +struct nvme_id_ctrl { > + __le16 vid; > + __le16 ssvid; > + char sn[20]; > + char mn[40]; > + char fr[8]; > + __u8 rab; > + __u8 ieee[3]; > + __u8 cmic; > + __u8 mdts; > + __le16 cntlid; > + __le32 ver; > + __le32 rtd3r; > + __le32 rtd3e; > + __le32 oaes; > + __le32 ctratt; > + __u8 rsvd100[156]; > + __le16 oacs; > + __u8 acl; > + __u8 aerl; > + __u8 frmw; > + __u8 lpa; > + __u8 elpe; > + __u8 npss; > + __u8 avscc; > + __u8 apsta; > + __le16 wctemp; > + __le16 cctemp; > + __le16 mtfa; > + __le32 hmpre; > + __le32 hmmin; > + __u8 tnvmcap[16]; > + __u8 unvmcap[16]; > + __le32 rpmbs; > + __le16 edstt; > + __u8 dsto; > + __u8 fwug; > + __le16 kas; > + __le16 hctma; > + __le16 mntmt; > + __le16 mxtmt; > + __le32 sanicap; > + __le32 hmminds; > + __le16 hmmaxd; > + __u8 rsvd338[4]; > + __u8 anatt; > + __u8 anacap; > + __le32 anagrpmax; > + __le32 nanagrpid; > + __u8 rsvd352[160]; > + __u8 sqes; > + __u8 cqes; > + __le16 maxcmd; > + __le32 nn; > + __le16 oncs; > + __le16 fuses; > + __u8 fna; > + __u8 vwc; > + __le16 awun; > + __le16 awupf; > + __u8 nvscc; > + __u8 nwpc; > + __le16 acwu; > + __u8 rsvd534[2]; > + __le32 sgls; > + __le32 mnan; > + __u8 rsvd544[224]; > + char subnqn[256]; > + __u8 rsvd1024[768]; > + __le32 ioccsz; > + __le32 iorcsz; > + __le16 icdoff; > + __u8 ctrattr; > + __u8 msdbd; > + __u8 rsvd1804[244]; > + struct nvme_id_power_state psd[32]; > + __u8 vs[1024]; > +}; > +/* nvme id ctrl end */ > + > +/* nvme id ns start */ > +struct nvme_lbaf { > + __le16 ms; > + __u8 ds; > + __u8 rp; > +}; > + > +struct nvme_id_ns { > + __le64 nsze; > + __le64 ncap; > + __le64 nuse; > + __u8 nsfeat; > + __u8 nlbaf; > + __u8 flbas; > + __u8 mc; > + __u8 dpc; > + __u8 dps; > + __u8 nmic; > + __u8 rescap; > + __u8 fpi; > + __u8 rsvd33; > + __le16 nawun; > + __le16 nawupf; > + __le16 nacwu; > + __le16 nabsn; > + __le16 nabo; > + __le16 nabspf; > + __le16 noiob; > + __u8 nvmcap[16]; > + __u8 rsvd64[28]; > + __le32 anagrpid; > + __u8 rsvd96[3]; > + __u8 nsattr; > + __u8 rsvd100[4]; > + __u8 nguid[16]; > + __u8 eui64[8]; > + struct nvme_lbaf lbaf[16]; > + __u8 rsvd192[192]; > + __u8 vs[3712]; > +}; > +/* nvme id ns end */ > + > +/* nvme ana start */ > +enum nvme_ana_state { > + NVME_ANA_OPTIMIZED = 0x01, > + NVME_ANA_NONOPTIMIZED = 0x02, > + NVME_ANA_INACCESSIBLE = 0x03, > + NVME_ANA_PERSISTENT_LOSS = 0x04, > + NVME_ANA_CHANGE = 0x0f, > + NVME_ANA_RESERVED = 0x05, > +}; > + > +struct nvme_ana_rsp_hdr { > + __le64 chgcnt; > + __le16 ngrps; > + __le16 rsvd10[3]; > +}; > + > +struct nvme_ana_group_desc { > + __le32 grpid; > + __le32 nnsids; > + __le64 chgcnt; > + __u8 state; > + __u8 rsvd17[15]; > + __le32 nsids[]; > +}; > + > +/* flag for the log specific field of the ANA log */ > +#define NVME_ANA_LOG_RGO (1 << 0) > + > +/* nvme ana end */ > + > +#endif > diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5 > index 6333366..c5b2fb6 100644 > --- a/multipath/multipath.conf.5 > +++ b/multipath/multipath.conf.5 > @@ -334,6 +334,10 @@ priority provided as argument. Requires > prio_args keyword. > Generate the path priority based on a latency algorithm. > Requires prio_args keyword. > .TP > +.I ana > +(Hardware-dependent) > +Generate the path priority based on the NVMe ANA settings. > +.TP > .I datacore > (Hardware-dependent) > Generate the path priority for some DataCore storage arrays. > Requires prio_args > @@ -1391,6 +1395,10 @@ Active/Standby mode exclusively. > .I 1 alua > (Hardware-dependent) > Hardware handler for SCSI-3 ALUA compatible arrays. > +.TP > +.I 1 ana > +(Hardware-dependent) > +Hardware handler for NVMe ANA compatible arrays. No. hardware handlers are for SCSI only. Regards, Martin
On Mon, Nov 12 2018 at 11:23am -0500, Martin Wilck <mwilck@suse.com> wrote: > Hello Lijie, > > On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: > > Add support for Asynchronous Namespace Access as specified in NVMe > > 1.3 > > TP 4004. The states are updated through reading the ANA log page. > > > > By default, the native nvme multipath takes over the nvme device. > > We can pass a false to the parameter 'multipath' of the nvme-core.ko > > module,when we want to use multipath-tools. > > Thank you for the patch. It looks quite good to me. I've tested it with > a Linux target and found no problems so far. > > I have a few questions and comments inline below. > > I suggest you also have a look at detect_prio(); it seems to make sense > to use the ana prioritizer for NVMe paths automatically if ANA is > supported (with your patch, "detect_prio no" and "prio ana" have to be > configured explicitly). But that can be done in a later patch. I (and others) think it makes sense to at least triple check with the NVMe developers (now cc'd) to see if we could get agreement on the nvme driver providing the ANA state via sysfs (when modparam nvme_core.multipath=N is set), like Hannes proposed here: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html Then the userspace multipath-tools ANA support could just read sysfs rather than reinvent harvesting the ANA state via ioctl. But if we continue to hit a wall on that type of sharing of the nvme driver's state then I'm fine with reinventing ANA state inquiry and tracking like has been proposed here. Mike p.s. thanks for your review Martin, we really need to determine the way forward for full multipath-tools support of NVMe with ANA. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 2018-11-12 at 16:53 -0500, Mike Snitzer wrote: > > I (and others) think it makes sense to at least triple check with the > NVMe developers (now cc'd) to see if we could get agreement on the > nvme > driver providing the ANA state via sysfs (when modparam > nvme_core.multipath=N is set), like Hannes proposed here: > http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > > Then the userspace multipath-tools ANA support could just read sysfs > rather than reinvent harvesting the ANA state via ioctl. If some kernels expose the sysfs attributes and some don't, what are we supposed to do? In order to be portable, multipath-tools (and other user space tools, for that matter) need to support both, and maintain multiple code paths. Just like SCSI :-) I'd really like to see this abstracted away with a "libnvme" (you name it), which user space tools could simply call without having to worry how it actually talks to the kernel. Martin
On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote: > On Mon, Nov 12 2018 at 11:23am -0500, > Martin Wilck <mwilck@suse.com> wrote: > > > Hello Lijie, > > > > On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: > > > Add support for Asynchronous Namespace Access as specified in NVMe > > > 1.3 > > > TP 4004. The states are updated through reading the ANA log page. > > > > > > By default, the native nvme multipath takes over the nvme device. > > > We can pass a false to the parameter 'multipath' of the nvme-core.ko > > > module,when we want to use multipath-tools. > > > > Thank you for the patch. It looks quite good to me. I've tested it with > > a Linux target and found no problems so far. > > > > I have a few questions and comments inline below. > > > > I suggest you also have a look at detect_prio(); it seems to make sense > > to use the ana prioritizer for NVMe paths automatically if ANA is > > supported (with your patch, "detect_prio no" and "prio ana" have to be > > configured explicitly). But that can be done in a later patch. > > I (and others) think it makes sense to at least triple check with the > NVMe developers (now cc'd) to see if we could get agreement on the nvme > driver providing the ANA state via sysfs (when modparam > nvme_core.multipath=N is set), like Hannes proposed here: > http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > > Then the userspace multipath-tools ANA support could just read sysfs > rather than reinvent harvesting the ANA state via ioctl. I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param isn't even an issue. > But if we continue to hit a wall on that type of sharing of the nvme > driver's state then I'm fine with reinventing ANA state inquiry and > tracking like has been proposed here. > > Mike > > p.s. thanks for your review Martin, we really need to determine the way > forward for full multipath-tools support of NVMe with ANA. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Nov 13 2018 at 11:18am -0500, Keith Busch <keith.busch@intel.com> wrote: > On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote: > > On Mon, Nov 12 2018 at 11:23am -0500, > > Martin Wilck <mwilck@suse.com> wrote: > > > > > Hello Lijie, > > > > > > On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: > > > > Add support for Asynchronous Namespace Access as specified in NVMe > > > > 1.3 > > > > TP 4004. The states are updated through reading the ANA log page. > > > > > > > > By default, the native nvme multipath takes over the nvme device. > > > > We can pass a false to the parameter 'multipath' of the nvme-core.ko > > > > module,when we want to use multipath-tools. > > > > > > Thank you for the patch. It looks quite good to me. I've tested it with > > > a Linux target and found no problems so far. > > > > > > I have a few questions and comments inline below. > > > > > > I suggest you also have a look at detect_prio(); it seems to make sense > > > to use the ana prioritizer for NVMe paths automatically if ANA is > > > supported (with your patch, "detect_prio no" and "prio ana" have to be > > > configured explicitly). But that can be done in a later patch. > > > > I (and others) think it makes sense to at least triple check with the > > NVMe developers (now cc'd) to see if we could get agreement on the nvme > > driver providing the ANA state via sysfs (when modparam > > nvme_core.multipath=N is set), like Hannes proposed here: > > http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > > > > Then the userspace multipath-tools ANA support could just read sysfs > > rather than reinvent harvesting the ANA state via ioctl. > > I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't > even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param > isn't even an issue. I like your instincts, we just need to take them a bit further. Splitting out the kernel's ANA log page parsing won't buy us much given it is userspace (multipath-tools) that needs to consume it. The less work userspace needs to do (because kernel has already done it) the better. If the NVMe driver is made to always track and export the ANA state via sysfs [1] we'd avoid userspace parsing duplication "for free". This should occur regardless of what layer is reacting to the ANA state changes (be it NVMe's native multipathing or multipath-tools). ANA and NVMe multipathing really are disjoint, making them tightly coupled only serves to force NVMe driver provided multipathing _or_ userspace ANA state tracking duplication that really isn't ideal [2]. We need a reasoned answer to the primary question of whether the NVMe maintainers are willing to cooperate by providing this basic ANA sysfs export even if nvme_core.multipath=N [1]. Christoph said "No" [3], but offered little _real_ justification for why this isn't the right thing for NVMe in general -- even when asked the question gets ignored [4]. The inability to provide proper justification for rejecting a patch (that already had one co-maintainer's Reviewed-by [5]) _should_ render that rejection baseless, and the patch applied (especially if there is contributing subsystem developer interest in maintaining this support over time, which there is). At least that is what would happen in a properly maintained kernel subsystem. It'd really go a long way if senior Linux NVMe maintainers took steps to accept reasonable changes. Mike [1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html [3]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020815.html [4]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020846.html [5]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020790.html -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Nov 13 2018 at 1:00pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Nov 13 2018 at 11:18am -0500, > Keith Busch <keith.busch@intel.com> wrote: > > > On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote: > > > On Mon, Nov 12 2018 at 11:23am -0500, > > > Martin Wilck <mwilck@suse.com> wrote: > > > > > > > Hello Lijie, > > > > > > > > On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: > > > > > Add support for Asynchronous Namespace Access as specified in NVMe > > > > > 1.3 > > > > > TP 4004. The states are updated through reading the ANA log page. > > > > > > > > > > By default, the native nvme multipath takes over the nvme device. > > > > > We can pass a false to the parameter 'multipath' of the nvme-core.ko > > > > > module,when we want to use multipath-tools. > > > > > > > > Thank you for the patch. It looks quite good to me. I've tested it with > > > > a Linux target and found no problems so far. > > > > > > > > I have a few questions and comments inline below. > > > > > > > > I suggest you also have a look at detect_prio(); it seems to make sense > > > > to use the ana prioritizer for NVMe paths automatically if ANA is > > > > supported (with your patch, "detect_prio no" and "prio ana" have to be > > > > configured explicitly). But that can be done in a later patch. > > > > > > I (and others) think it makes sense to at least triple check with the > > > NVMe developers (now cc'd) to see if we could get agreement on the nvme > > > driver providing the ANA state via sysfs (when modparam > > > nvme_core.multipath=N is set), like Hannes proposed here: > > > http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > > > > > > Then the userspace multipath-tools ANA support could just read sysfs > > > rather than reinvent harvesting the ANA state via ioctl. > > > > I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't > > even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param > > isn't even an issue. > > I like your instincts, we just need to take them a bit further. > > Splitting out the kernel's ANA log page parsing won't buy us much given > it is userspace (multipath-tools) that needs to consume it. The less > work userspace needs to do (because kernel has already done it) the > better. > > If the NVMe driver is made to always track and export the ANA state via > sysfs [1] we'd avoid userspace parsing duplication "for free". This > should occur regardless of what layer is reacting to the ANA state > changes (be it NVMe's native multipathing or multipath-tools). > > ANA and NVMe multipathing really are disjoint, making them tightly > coupled only serves to force NVMe driver provided multipathing _or_ > userspace ANA state tracking duplication that really isn't ideal [2]. > > We need a reasoned answer to the primary question of whether the NVMe > maintainers are willing to cooperate by providing this basic ANA sysfs > export even if nvme_core.multipath=N [1]. > > Christoph said "No" [3], but offered little _real_ justification for why > this isn't the right thing for NVMe in general. ... > [1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html ... I knew there had to be a pretty tight coupling between the NVMe driver's native multipathing and ANA support... and that the simplicity of Hannes' patch [1] was too good to be true. The real justification for not making Hannes' change is it'd effectively be useless without first splitting out the ANA handling done during NVMe request completion (NVME_SC_ANA_* cases in nvme_failover_req) that triggers re-reading the ANA log page accordingly. So without the ability to drive the ANA workqueue to trigger nvme_read_ana_log() from the nvme driver's completion path -- even if nvme_core.multipath=N -- it really doesn't buy multipath-tools anything to have the NVMe driver export the ana state via sysfs, because that ANA state will never get updated. > The inability to provide proper justification for rejecting a patch > (that already had one co-maintainer's Reviewed-by [5]) _should_ render > that rejection baseless, and the patch applied (especially if there is > contributing subsystem developer interest in maintaining this support > over time, which there is). At least that is what would happen in a > properly maintained kernel subsystem. > > It'd really go a long way if senior Linux NVMe maintainers took steps to > accept reasonable changes. Even though I'm frustrated I was clearly too harsh and regret my tone. I promise to _try_ to suck less. This dynamic of terse responses or no responses at all whenever NVMe driver changes to ease multipath-tools NVMe support are floated is the depressing gift that keeps on giving. But enough excuses... Not holding my breath BUT: if decoupling the reading of ANA state from native NVMe multipathing specific work during nvme request completion were an acceptable advancement I'd gladly do the work. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 11/13/18 5:18 PM, Keith Busch wrote: > On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote: >> On Mon, Nov 12 2018 at 11:23am -0500, >> Martin Wilck <mwilck@suse.com> wrote: >> >>> Hello Lijie, >>> >>> On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: >>>> Add support for Asynchronous Namespace Access as specified in NVMe >>>> 1.3 >>>> TP 4004. The states are updated through reading the ANA log page. >>>> >>>> By default, the native nvme multipath takes over the nvme device. >>>> We can pass a false to the parameter 'multipath' of the nvme-core.ko >>>> module,when we want to use multipath-tools. >>> >>> Thank you for the patch. It looks quite good to me. I've tested it with >>> a Linux target and found no problems so far. >>> >>> I have a few questions and comments inline below. >>> >>> I suggest you also have a look at detect_prio(); it seems to make sense >>> to use the ana prioritizer for NVMe paths automatically if ANA is >>> supported (with your patch, "detect_prio no" and "prio ana" have to be >>> configured explicitly). But that can be done in a later patch. >> >> I (and others) think it makes sense to at least triple check with the >> NVMe developers (now cc'd) to see if we could get agreement on the nvme >> driver providing the ANA state via sysfs (when modparam >> nvme_core.multipath=N is set), like Hannes proposed here: >> http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html >> >> Then the userspace multipath-tools ANA support could just read sysfs >> rather than reinvent harvesting the ANA state via ioctl. > > I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't > even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param > isn't even an issue. > My argument here is that _ANA_ support should not be tied to the NVME native multipathing at all. Whether or not ANA is present is a choice of the target implementation; the host has _zero_ influence on this. It may argue all it wants and doesn't even have to implement multipathing. But in the end if the target declares a path as 'inaccessible' the path _is_ inaccessible to the host. Even it that particular host doesn't implement or activate multipathing. (You wouldn't believe how often I had this discussion with our QA team for ALUA ...) Whether or not _multipathing_ is used is a _host_ implementation, and is actually independent on ANA support; linux itself proved the point here as we supported (symmetric) multipathing far longer than we had an ANA implementation. So personally I don't see why the 'raw' ANA support (parsing log pages, figuring out path states etc) needs to be tied in with native NVMe multipathing. _Especially_ not as my patch really is trivial. Cheers, Hannes
On 11/14/18 6:38 AM, Mike Snitzer wrote: > On Tue, Nov 13 2018 at 1:00pm -0500, > Mike Snitzer <snitzer@redhat.com> wrote: > >> On Tue, Nov 13 2018 at 11:18am -0500, >> Keith Busch <keith.busch@intel.com> wrote: >> >>> On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote: >>>> On Mon, Nov 12 2018 at 11:23am -0500, >>>> Martin Wilck <mwilck@suse.com> wrote: >>>> >>>>> Hello Lijie, >>>>> >>>>> On Thu, 2018-11-08 at 14:09 +0800, lijie wrote: >>>>>> Add support for Asynchronous Namespace Access as specified in NVMe >>>>>> 1.3 >>>>>> TP 4004. The states are updated through reading the ANA log page. >>>>>> >>>>>> By default, the native nvme multipath takes over the nvme device. >>>>>> We can pass a false to the parameter 'multipath' of the nvme-core.ko >>>>>> module,when we want to use multipath-tools. >>>>> >>>>> Thank you for the patch. It looks quite good to me. I've tested it with >>>>> a Linux target and found no problems so far. >>>>> >>>>> I have a few questions and comments inline below. >>>>> >>>>> I suggest you also have a look at detect_prio(); it seems to make sense >>>>> to use the ana prioritizer for NVMe paths automatically if ANA is >>>>> supported (with your patch, "detect_prio no" and "prio ana" have to be >>>>> configured explicitly). But that can be done in a later patch. >>>> >>>> I (and others) think it makes sense to at least triple check with the >>>> NVMe developers (now cc'd) to see if we could get agreement on the nvme >>>> driver providing the ANA state via sysfs (when modparam >>>> nvme_core.multipath=N is set), like Hannes proposed here: >>>> http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html >>>> >>>> Then the userspace multipath-tools ANA support could just read sysfs >>>> rather than reinvent harvesting the ANA state via ioctl. >>> >>> I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't >>> even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param >>> isn't even an issue. >> >> I like your instincts, we just need to take them a bit further. >> >> Splitting out the kernel's ANA log page parsing won't buy us much given >> it is userspace (multipath-tools) that needs to consume it. The less >> work userspace needs to do (because kernel has already done it) the >> better. >> >> If the NVMe driver is made to always track and export the ANA state via >> sysfs [1] we'd avoid userspace parsing duplication "for free". This >> should occur regardless of what layer is reacting to the ANA state >> changes (be it NVMe's native multipathing or multipath-tools). >> >> ANA and NVMe multipathing really are disjoint, making them tightly >> coupled only serves to force NVMe driver provided multipathing _or_ >> userspace ANA state tracking duplication that really isn't ideal [2]. >> >> We need a reasoned answer to the primary question of whether the NVMe >> maintainers are willing to cooperate by providing this basic ANA sysfs >> export even if nvme_core.multipath=N [1]. >> >> Christoph said "No" [3], but offered little _real_ justification for why >> this isn't the right thing for NVMe in general. > ... >> [1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html >> [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html > ... > > I knew there had to be a pretty tight coupling between the NVMe driver's > native multipathing and ANA support... and that the simplicity of > Hannes' patch [1] was too good to be true. > > The real justification for not making Hannes' change is it'd effectively > be useless without first splitting out the ANA handling done during NVMe > request completion (NVME_SC_ANA_* cases in nvme_failover_req) that > triggers re-reading the ANA log page accordingly. > > So without the ability to drive the ANA workqueue to trigger > nvme_read_ana_log() from the nvme driver's completion path -- even if > nvme_core.multipath=N -- it really doesn't buy multipath-tools anything > to have the NVMe driver export the ana state via sysfs, because that ANA > state will never get updated. > Hmm. Indeed, I was more focussed on having the sysfs attributes displayed, so yes, indeed it needs some more work. >> The inability to provide proper justification for rejecting a patch >> (that already had one co-maintainer's Reviewed-by [5]) _should_ render >> that rejection baseless, and the patch applied (especially if there is >> contributing subsystem developer interest in maintaining this support >> over time, which there is). At least that is what would happen in a >> properly maintained kernel subsystem. >> >> It'd really go a long way if senior Linux NVMe maintainers took steps to >> accept reasonable changes. > > Even though I'm frustrated I was clearly too harsh and regret my tone. > I promise to _try_ to suck less. > > This dynamic of terse responses or no responses at all whenever NVMe > driver changes to ease multipath-tools NVMe support are floated is the > depressing gift that keeps on giving. But enough excuses... > > Not holding my breath BUT: > if decoupling the reading of ANA state from native NVMe multipathing > specific work during nvme request completion were an acceptable > advancement I'd gladly do the work. > I'd be happy to work on that, given that we'll have to have 'real' ANA support for device-mapper anyway for SLE12 SP4 etc. Cheers, Hannes
On Wed, 2018-11-14 at 08:49 +0100, Hannes Reinecke wrote: > On 11/14/18 6:38 AM, Mike Snitzer wrote: > > > > Not holding my breath BUT: > > if decoupling the reading of ANA state from native NVMe > > multipathing > > specific work during nvme request completion were an acceptable > > advancement I'd gladly do the work. > > > I'd be happy to work on that, given that we'll have to have 'real' > ANA > support for device-mapper anyway for SLE12 SP4 etc. So, what's the way ahead for multipath-tools? Given that, IIUC, at least one official kernel (4.19) has been released with ANA support but without the ANA sysfs attributes exported to user space, multipath-tools can't rely on them being present. I for one have no issue with copying code from nvme-cli and doing NVMe ioctls from multipath-tools, as in Lijie's patch. When those sysfs attributes are added (and work as expected), we will use them, but IMO we'll need to fall back to ioctls until then, and also afterwards, because we can't assume to be always running on the latest kernel. Best, Martin -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Nov 14, 2018 at 08:24:05AM +0100, Hannes Reinecke wrote: > My argument here is that _ANA_ support should not be tied to the NVME > native multipathing at all. It should. Because nvme driver multipathing is the only sanctioned way to use it. All other ways aren't supported and might break at any time. > So personally I don't see why the 'raw' ANA support (parsing log pages, > figuring out path states etc) needs to be tied in with native NVMe > multipathing. _Especially_ not as my patch really is trivial. And not actually usable for anything.. They only thing you do is to increase the code size for embedded nvme users that don't need any multipathing. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Nov 14 2018 at 10:35am -0500, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Nov 14, 2018 at 08:24:05AM +0100, Hannes Reinecke wrote: > > My argument here is that _ANA_ support should not be tied to the NVME > > native multipathing at all. > > It should. Because nvme driver multipathing is the only sanctioned > way to use it. All other ways aren't supported and might break at > any time. Quite a few of us who are multipath-tools oriented would like the proper separation rather than your more pragmatic isolated approach. And we'd fix it if it broke in the future. > > So personally I don't see why the 'raw' ANA support (parsing log pages, > > figuring out path states etc) needs to be tied in with native NVMe > > multipathing. _Especially_ not as my patch really is trivial. > > And not actually usable for anything.. They only thing you do is to > increase the code size for embedded nvme users that don't need any > multipathing. Isn't that why CONFIG_NVME_MULTIPATH exists? Embedded nvme users wouldn't set that. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Nov 14 2018 at 2:49am -0500, Hannes Reinecke <hare@suse.de> wrote: > On 11/14/18 6:38 AM, Mike Snitzer wrote: > >On Tue, Nov 13 2018 at 1:00pm -0500, > >Mike Snitzer <snitzer@redhat.com> wrote: > > > >>[1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > >>[2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html > >... > > > >I knew there had to be a pretty tight coupling between the NVMe driver's > >native multipathing and ANA support... and that the simplicity of > >Hannes' patch [1] was too good to be true. > > > >The real justification for not making Hannes' change is it'd effectively > >be useless without first splitting out the ANA handling done during NVMe > >request completion (NVME_SC_ANA_* cases in nvme_failover_req) that > >triggers re-reading the ANA log page accordingly. > > > >So without the ability to drive the ANA workqueue to trigger > >nvme_read_ana_log() from the nvme driver's completion path -- even if > >nvme_core.multipath=N -- it really doesn't buy multipath-tools anything > >to have the NVMe driver export the ana state via sysfs, because that ANA > >state will never get updated. > > > Hmm. Indeed, I was more focussed on having the sysfs attributes > displayed, so yes, indeed it needs some more work. ... > >Not holding my breath BUT: > >if decoupling the reading of ANA state from native NVMe multipathing > >specific work during nvme request completion were an acceptable > >advancement I'd gladly do the work. > > > I'd be happy to work on that, given that we'll have to have 'real' > ANA support for device-mapper anyway for SLE12 SP4 etc. I had a close enough look yesterday that I figured I'd just implement what I reasoned through as one way forward, compile tested only (patch relative to Jens' for-4.21/block): drivers/nvme/host/core.c | 14 +++++++--- drivers/nvme/host/multipath.c | 65 ++++++++++++++++++++++++++----------------- drivers/nvme/host/nvme.h | 4 +++ 3 files changed, 54 insertions(+), 29 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f172d63db2b5..05313ab5d91e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -252,10 +252,16 @@ void nvme_complete_rq(struct request *req) trace_nvme_complete_rq(req); if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && - blk_path_error(status)) { - nvme_failover_req(req); - return; + if (blk_path_error(status)) { + struct nvme_ns *ns = req->q->queuedata; + u16 nvme_status = nvme_req(req)->status; + + if (req->cmd_flags & REQ_NVME_MPATH) { + nvme_failover_req(req); + nvme_update_ana(ns, nvme_status); + return; + } + nvme_update_ana(ns, nvme_status); } if (!blk_queue_dying(req->q)) { diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 5e3cc8c59a39..f7fbc161dc8c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath, inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) { - return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); + return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); } /* @@ -47,6 +47,17 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } +static bool nvme_ana_error(u16 status) +{ + switch (status & 0x7ff) { + case NVME_SC_ANA_TRANSITION: + case NVME_SC_ANA_INACCESSIBLE: + case NVME_SC_ANA_PERSISTENT_LOSS: + return true; + } + return false; +} + void nvme_failover_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -58,10 +69,7 @@ void nvme_failover_req(struct request *req) spin_unlock_irqrestore(&ns->head->requeue_lock, flags); blk_mq_end_request(req, 0); - switch (status & 0x7ff) { - case NVME_SC_ANA_TRANSITION: - case NVME_SC_ANA_INACCESSIBLE: - case NVME_SC_ANA_PERSISTENT_LOSS: + if (nvme_ana_error(status)) { /* * If we got back an ANA error we know the controller is alive, * but not ready to serve this namespaces. The spec suggests @@ -69,31 +77,38 @@ void nvme_failover_req(struct request *req) * that the admin and I/O queues are not serialized that is * fundamentally racy. So instead just clear the current path, * mark the the path as pending and kick of a re-read of the ANA - * log page ASAP. + * log page ASAP (see nvme_update_ana() below). */ nvme_mpath_clear_current_path(ns); - if (ns->ctrl->ana_log_buf) { - set_bit(NVME_NS_ANA_PENDING, &ns->flags); - queue_work(nvme_wq, &ns->ctrl->ana_work); + } else { + switch (status & 0x7ff) { + case NVME_SC_HOST_PATH_ERROR: + /* + * Temporary transport disruption in talking to the + * controller. Try to send on a new path. + */ + nvme_mpath_clear_current_path(ns); + break; + default: + /* + * Reset the controller for any non-ANA error as we + * don't know what caused the error. + */ + nvme_reset_ctrl(ns->ctrl); + break; } - break; - case NVME_SC_HOST_PATH_ERROR: - /* - * Temporary transport disruption in talking to the controller. - * Try to send on a new path. - */ - nvme_mpath_clear_current_path(ns); - break; - default: - /* - * Reset the controller for any non-ANA error as we don't know - * what caused the error. - */ - nvme_reset_ctrl(ns->ctrl); - break; } +} - kblockd_schedule_work(&ns->head->requeue_work); +void nvme_update_ana(struct nvme_ns *ns, u16 status) +{ + if (nvme_ana_error(status) && ns->ctrl->ana_log_buf) { + set_bit(NVME_NS_ANA_PENDING, &ns->flags); + queue_work(nvme_wq, &ns->ctrl->ana_work); + } + + if (multipath) + kblockd_schedule_work(&ns->head->requeue_work); } void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 32a1f1cfdfb4..8b4bc2054b7a 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -468,6 +468,7 @@ bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); void nvme_failover_req(struct request *req); +void nvme_update_ana(struct nvme_ns *ns, u16 status); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -507,6 +508,9 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, static inline void nvme_failover_req(struct request *req) { } +static inline void nvme_update_ana(struct nvme_ns *ns, u16 status) +{ +} static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { } -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 11/14/18 6:47 PM, Mike Snitzer wrote: > On Wed, Nov 14 2018 at 2:49am -0500, > Hannes Reinecke <hare@suse.de> wrote: > >> On 11/14/18 6:38 AM, Mike Snitzer wrote: >>> On Tue, Nov 13 2018 at 1:00pm -0500, >>> Mike Snitzer <snitzer@redhat.com> wrote: >>> >>>> [1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html >>>> [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html >>> ... >>> >>> I knew there had to be a pretty tight coupling between the NVMe driver's >>> native multipathing and ANA support... and that the simplicity of >>> Hannes' patch [1] was too good to be true. >>> >>> The real justification for not making Hannes' change is it'd effectively >>> be useless without first splitting out the ANA handling done during NVMe >>> request completion (NVME_SC_ANA_* cases in nvme_failover_req) that >>> triggers re-reading the ANA log page accordingly. >>> >>> So without the ability to drive the ANA workqueue to trigger >>> nvme_read_ana_log() from the nvme driver's completion path -- even if >>> nvme_core.multipath=N -- it really doesn't buy multipath-tools anything >>> to have the NVMe driver export the ana state via sysfs, because that ANA >>> state will never get updated. >>> >> Hmm. Indeed, I was more focussed on having the sysfs attributes >> displayed, so yes, indeed it needs some more work. > ... >>> Not holding my breath BUT: >>> if decoupling the reading of ANA state from native NVMe multipathing >>> specific work during nvme request completion were an acceptable >>> advancement I'd gladly do the work. >>> >> I'd be happy to work on that, given that we'll have to have 'real' >> ANA support for device-mapper anyway for SLE12 SP4 etc. > > I had a close enough look yesterday that I figured I'd just implement > what I reasoned through as one way forward, compile tested only (patch > relative to Jens' for-4.21/block): > > drivers/nvme/host/core.c | 14 +++++++--- > drivers/nvme/host/multipath.c | 65 ++++++++++++++++++++++++++----------------- > drivers/nvme/host/nvme.h | 4 +++ > 3 files changed, 54 insertions(+), 29 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index f172d63db2b5..05313ab5d91e 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -252,10 +252,16 @@ void nvme_complete_rq(struct request *req) > trace_nvme_complete_rq(req); > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > - if ((req->cmd_flags & REQ_NVME_MPATH) && > - blk_path_error(status)) { > - nvme_failover_req(req); > - return; > + if (blk_path_error(status)) { > + struct nvme_ns *ns = req->q->queuedata; > + u16 nvme_status = nvme_req(req)->status; > + > + if (req->cmd_flags & REQ_NVME_MPATH) { > + nvme_failover_req(req); > + nvme_update_ana(ns, nvme_status); > + return; > + } > + nvme_update_ana(ns, nvme_status); > } > > if (!blk_queue_dying(req->q)) { > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 5e3cc8c59a39..f7fbc161dc8c 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath, > > inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) > { > - return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); > + return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); > } > > /* > @@ -47,6 +47,17 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > } > } > > +static bool nvme_ana_error(u16 status) > +{ > + switch (status & 0x7ff) { > + case NVME_SC_ANA_TRANSITION: > + case NVME_SC_ANA_INACCESSIBLE: > + case NVME_SC_ANA_PERSISTENT_LOSS: > + return true; > + } > + return false; > +} > + > void nvme_failover_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > @@ -58,10 +69,7 @@ void nvme_failover_req(struct request *req) > spin_unlock_irqrestore(&ns->head->requeue_lock, flags); > blk_mq_end_request(req, 0); > > - switch (status & 0x7ff) { > - case NVME_SC_ANA_TRANSITION: > - case NVME_SC_ANA_INACCESSIBLE: > - case NVME_SC_ANA_PERSISTENT_LOSS: > + if (nvme_ana_error(status)) { > /* > * If we got back an ANA error we know the controller is alive, > * but not ready to serve this namespaces. The spec suggests > @@ -69,31 +77,38 @@ void nvme_failover_req(struct request *req) > * that the admin and I/O queues are not serialized that is > * fundamentally racy. So instead just clear the current path, > * mark the the path as pending and kick of a re-read of the ANA > - * log page ASAP. > + * log page ASAP (see nvme_update_ana() below). > */ > nvme_mpath_clear_current_path(ns); > - if (ns->ctrl->ana_log_buf) { > - set_bit(NVME_NS_ANA_PENDING, &ns->flags); > - queue_work(nvme_wq, &ns->ctrl->ana_work); > + } else { > + switch (status & 0x7ff) { > + case NVME_SC_HOST_PATH_ERROR: > + /* > + * Temporary transport disruption in talking to the > + * controller. Try to send on a new path. > + */ > + nvme_mpath_clear_current_path(ns); > + break; > + default: > + /* > + * Reset the controller for any non-ANA error as we > + * don't know what caused the error. > + */ > + nvme_reset_ctrl(ns->ctrl); > + break; > } > - break; > - case NVME_SC_HOST_PATH_ERROR: > - /* > - * Temporary transport disruption in talking to the controller. > - * Try to send on a new path. > - */ > - nvme_mpath_clear_current_path(ns); > - break; > - default: > - /* > - * Reset the controller for any non-ANA error as we don't know > - * what caused the error. > - */ > - nvme_reset_ctrl(ns->ctrl); > - break; > } Maybe move nvme_mpath_clear_current_path() out of the loop (it wouldn't matter if we clear the path if we reset the controller afterwards); that might even clean up the code even more. > +} > > - kblockd_schedule_work(&ns->head->requeue_work); > +void nvme_update_ana(struct nvme_ns *ns, u16 status) > +{ > + if (nvme_ana_error(status) && ns->ctrl->ana_log_buf) { > + set_bit(NVME_NS_ANA_PENDING, &ns->flags); > + queue_work(nvme_wq, &ns->ctrl->ana_work); > + } > + > + if (multipath) > + kblockd_schedule_work(&ns->head->requeue_work); > } maybe use 'ns->head->disk' here; we only need to call this if we have a multipath disk. Remaining bits are okay. Cheers, Hannes
On Wed, Nov 14 2018 at 1:51pm -0500, Hannes Reinecke <hare@suse.de> wrote: > On 11/14/18 6:47 PM, Mike Snitzer wrote: > > On Wed, Nov 14 2018 at 2:49am -0500, > > Hannes Reinecke <hare@suse.de> wrote: > > > >> On 11/14/18 6:38 AM, Mike Snitzer wrote: > >>> On Tue, Nov 13 2018 at 1:00pm -0500, > >>> Mike Snitzer <snitzer@redhat.com> wrote: > >>> > >>>> [1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html > >>>> [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html > >>> ... > >>> > >>> I knew there had to be a pretty tight coupling between the NVMe driver's > >>> native multipathing and ANA support... and that the simplicity of > >>> Hannes' patch [1] was too good to be true. > >>> > >>> The real justification for not making Hannes' change is it'd effectively > >>> be useless without first splitting out the ANA handling done during NVMe > >>> request completion (NVME_SC_ANA_* cases in nvme_failover_req) that > >>> triggers re-reading the ANA log page accordingly. > >>> > >>> So without the ability to drive the ANA workqueue to trigger > >>> nvme_read_ana_log() from the nvme driver's completion path -- even if > >>> nvme_core.multipath=N -- it really doesn't buy multipath-tools anything > >>> to have the NVMe driver export the ana state via sysfs, because that ANA > >>> state will never get updated. > >>> > >> Hmm. Indeed, I was more focussed on having the sysfs attributes > >> displayed, so yes, indeed it needs some more work. > > ... > >>> Not holding my breath BUT: > >>> if decoupling the reading of ANA state from native NVMe multipathing > >>> specific work during nvme request completion were an acceptable > >>> advancement I'd gladly do the work. > >>> > >> I'd be happy to work on that, given that we'll have to have 'real' > >> ANA support for device-mapper anyway for SLE12 SP4 etc. > > > > I had a close enough look yesterday that I figured I'd just implement > > what I reasoned through as one way forward, compile tested only (patch > > relative to Jens' for-4.21/block): > > > > drivers/nvme/host/core.c | 14 +++++++--- > > drivers/nvme/host/multipath.c | 65 ++++++++++++++++++++++++++----------------- > > drivers/nvme/host/nvme.h | 4 +++ > > 3 files changed, 54 insertions(+), 29 deletions(-) > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index f172d63db2b5..05313ab5d91e 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -252,10 +252,16 @@ void nvme_complete_rq(struct request *req) > > trace_nvme_complete_rq(req); > > > > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { > > - if ((req->cmd_flags & REQ_NVME_MPATH) && > > - blk_path_error(status)) { > > - nvme_failover_req(req); > > - return; > > + if (blk_path_error(status)) { > > + struct nvme_ns *ns = req->q->queuedata; > > + u16 nvme_status = nvme_req(req)->status; > > + > > + if (req->cmd_flags & REQ_NVME_MPATH) { > > + nvme_failover_req(req); > > + nvme_update_ana(ns, nvme_status); > > + return; > > + } > > + nvme_update_ana(ns, nvme_status); > > } > > > > if (!blk_queue_dying(req->q)) { > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > > index 5e3cc8c59a39..f7fbc161dc8c 100644 > > --- a/drivers/nvme/host/multipath.c > > +++ b/drivers/nvme/host/multipath.c > > @@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath, > > > > inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) > > { > > - return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); > > + return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); > > } > > > > /* > > @@ -47,6 +47,17 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > > } > > } > > > > +static bool nvme_ana_error(u16 status) > > +{ > > + switch (status & 0x7ff) { > > + case NVME_SC_ANA_TRANSITION: > > + case NVME_SC_ANA_INACCESSIBLE: > > + case NVME_SC_ANA_PERSISTENT_LOSS: > > + return true; > > + } > > + return false; > > +} > > + > > void nvme_failover_req(struct request *req) > > { > > struct nvme_ns *ns = req->q->queuedata; > > @@ -58,10 +69,7 @@ void nvme_failover_req(struct request *req) > > spin_unlock_irqrestore(&ns->head->requeue_lock, flags); > > blk_mq_end_request(req, 0); > > > > - switch (status & 0x7ff) { > > - case NVME_SC_ANA_TRANSITION: > > - case NVME_SC_ANA_INACCESSIBLE: > > - case NVME_SC_ANA_PERSISTENT_LOSS: > > + if (nvme_ana_error(status)) { > > /* > > * If we got back an ANA error we know the controller is alive, > > * but not ready to serve this namespaces. The spec suggests > > @@ -69,31 +77,38 @@ void nvme_failover_req(struct request *req) > > * that the admin and I/O queues are not serialized that is > > * fundamentally racy. So instead just clear the current path, > > * mark the the path as pending and kick of a re-read of the ANA > > - * log page ASAP. > > + * log page ASAP (see nvme_update_ana() below). > > */ > > nvme_mpath_clear_current_path(ns); > > - if (ns->ctrl->ana_log_buf) { > > - set_bit(NVME_NS_ANA_PENDING, &ns->flags); > > - queue_work(nvme_wq, &ns->ctrl->ana_work); > > + } else { > > + switch (status & 0x7ff) { > > + case NVME_SC_HOST_PATH_ERROR: > > + /* > > + * Temporary transport disruption in talking to the > > + * controller. Try to send on a new path. > > + */ > > + nvme_mpath_clear_current_path(ns); > > + break; > > + default: > > + /* > > + * Reset the controller for any non-ANA error as we > > + * don't know what caused the error. > > + */ > > + nvme_reset_ctrl(ns->ctrl); > > + break; > > } > > - break; > > - case NVME_SC_HOST_PATH_ERROR: > > - /* > > - * Temporary transport disruption in talking to the controller. > > - * Try to send on a new path. > > - */ > > - nvme_mpath_clear_current_path(ns); > > - break; > > - default: > > - /* > > - * Reset the controller for any non-ANA error as we don't know > > - * what caused the error. > > - */ > > - nvme_reset_ctrl(ns->ctrl); > > - break; > > } > Maybe move nvme_mpath_clear_current_path() out of the loop (it wouldn't > matter if we clear the path if we reset the controller afterwards); that > might even clean up the code even more. Not completely sure what you're suggesting. But I was going for purely functional equivalent of the existing upstream code -- just with multipathing and ana split. Could be that in future it wouldn't make sense to always nvme_mpath_clear_current_path() for all cases? > > > +} > > > > - kblockd_schedule_work(&ns->head->requeue_work); > > +void nvme_update_ana(struct nvme_ns *ns, u16 status) > > +{ > > + if (nvme_ana_error(status) && ns->ctrl->ana_log_buf) { > > + set_bit(NVME_NS_ANA_PENDING, &ns->flags); > > + queue_work(nvme_wq, &ns->ctrl->ana_work); > > + } > > + > > + if (multipath) > > + kblockd_schedule_work(&ns->head->requeue_work); > > } > > maybe use 'ns->head->disk' here; we only need to call this if we have a > multipath disk. Sure. > Remaining bits are okay. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/libmultipath/prio.h b/libmultipath/prio.h index aa587cc..599d1d8 100644 --- a/libmultipath/prio.h +++ b/libmultipath/prio.h @@ -30,6 +30,7 @@ struct path; #define PRIO_WEIGHTED_PATH "weightedpath" #define PRIO_SYSFS "sysfs" #define PRIO_PATH_LATENCY "path_latency" +#define PRIO_ANA "ana" /* * Value used to mark the fact prio was not defined diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile index ab7bc07..15afaba 100644 --- a/libmultipath/prioritizers/Makefile +++ b/libmultipath/prioritizers/Makefile @@ -19,6 +19,7 @@ LIBS = \ libpriordac.so \ libprioweightedpath.so \ libpriopath_latency.so \ + libprioana.so \ libpriosysfs.so all: $(LIBS) diff --git a/libmultipath/prioritizers/ana.c b/libmultipath/prioritizers/ana.c new file mode 100644 index 0000000..c5aaa5f --- /dev/null +++ b/libmultipath/prioritizers/ana.c @@ -0,0 +1,292 @@ +/* + * (C) Copyright HUAWEI Technology Corp. 2017 All Rights Reserved. + * + * ana.c + * Version 1.00 + * + * Tool to make use of a NVMe-feature called Asymmetric Namespace Access. + * It determines the ANA state of a device and prints a priority value to stdout. + * + * Author(s): Cheng Jike <chengjike.cheng@huawei.com> + * Li Jie <lijie34@huawei.com> + * + * This file is released under the GPL version 2, or any later version. + */ +#include <stdio.h> +#include <sys/ioctl.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <stdbool.h> + +#include "debug.h" +#include "prio.h" +#include "structs.h" +#include "ana.h" + +enum { + ANA_PRIO_OPTIMIZED = 50, + ANA_PRIO_NONOPTIMIZED = 10, + ANA_PRIO_INACCESSIBLE = 5, + ANA_PRIO_PERSISTENT_LOSS = 1, + ANA_PRIO_CHANGE = 0, + ANA_PRIO_RESERVED = 0, + ANA_PRIO_GETCTRL_FAILED = -1, + ANA_PRIO_NOT_SUPPORTED = -2, + ANA_PRIO_GETANAS_FAILED = -3, + ANA_PRIO_GETANALOG_FAILED = -4, + ANA_PRIO_GETNSID_FAILED = -5, + ANA_PRIO_GETNS_FAILED = -6, + ANA_PRIO_NO_MEMORY = -7, + ANA_PRIO_NO_INFORMATION = -8, +}; + +static const char * anas_string[] = { + [NVME_ANA_OPTIMIZED] = "ANA Optimized State", + [NVME_ANA_NONOPTIMIZED] = "ANA Non-Optimized State", + [NVME_ANA_INACCESSIBLE] = "ANA Inaccessible State", + [NVME_ANA_PERSISTENT_LOSS] = "ANA Persistent Loss State", + [NVME_ANA_CHANGE] = "ANA Change state", + [NVME_ANA_RESERVED] = "Invalid namespace group state!", +}; + +static const char *aas_print_string(int rc) +{ + rc &= 0xff; + + switch(rc) { + case NVME_ANA_OPTIMIZED: + case NVME_ANA_NONOPTIMIZED: + case NVME_ANA_INACCESSIBLE: + case NVME_ANA_PERSISTENT_LOSS: + case NVME_ANA_CHANGE: + return anas_string[rc]; + default: + return anas_string[NVME_ANA_RESERVED]; + } + + return anas_string[NVME_ANA_RESERVED]; +} + +static int nvme_get_nsid(int fd, unsigned *nsid) +{ + static struct stat nvme_stat; + int err = fstat(fd, &nvme_stat); + if (err < 0) + return 1; + + if (!S_ISBLK(nvme_stat.st_mode)) { + condlog(0, "Error: requesting namespace-id from non-block device\n"); + return 1; + } + + *nsid = ioctl(fd, NVME_IOCTL_ID); + return 0; +} + +static int nvme_submit_admin_passthru(int fd, struct nvme_passthru_cmd *cmd) +{ + return ioctl(fd, NVME_IOCTL_ADMIN_CMD, cmd); +} + +int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 lpo, + __u16 lsi, bool rae, __u32 data_len, void *data) +{ + struct nvme_admin_cmd cmd = { + .opcode = nvme_admin_get_log_page, + .nsid = nsid, + .addr = (__u64)(uintptr_t) data, + .data_len = data_len, + }; + __u32 numd = (data_len >> 2) - 1; + __u16 numdu = numd >> 16, numdl = numd & 0xffff; + + cmd.cdw10 = log_id | (numdl << 16) | (rae ? 1 << 15 : 0); + if (lsp) + cmd.cdw10 |= lsp << 8; + + cmd.cdw11 = numdu | (lsi << 16); + cmd.cdw12 = lpo; + cmd.cdw13 = (lpo >> 32); + + return nvme_submit_admin_passthru(fd, &cmd); + +} + +int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, void *data) +{ + struct nvme_admin_cmd cmd = { + .opcode = nvme_admin_identify, + .nsid = nsid, + .addr = (__u64)(uintptr_t) data, + .data_len = NVME_IDENTIFY_DATA_SIZE, + .cdw10 = cdw10, + .cdw11 = cdw11, + }; + + return nvme_submit_admin_passthru(fd, &cmd); +} + +int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data) +{ + return nvme_identify13(fd, nsid, cdw10, 0, data); +} + +int nvme_identify_ctrl(int fd, void *data) +{ + return nvme_identify(fd, 0, NVME_ID_CNS_CTRL, data); +} + +int nvme_identify_ns(int fd, __u32 nsid, void *data) +{ + return nvme_identify(fd, nsid, NVME_ID_CNS_NS, data); +} + +int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo) +{ + __u64 lpo = 0; + + return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_ANA, rgo, lpo, 0, + true, ana_log_len, ana_log); +} + +static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log) +{ + int rc = ANA_PRIO_GETANAS_FAILED; + void *base = ana_log; + struct nvme_ana_rsp_hdr *hdr = base; + struct nvme_ana_group_desc *ana_desc; + int offset = sizeof(struct nvme_ana_rsp_hdr); + __u32 nr_nsids; + size_t nsid_buf_size; + int i, j; + + for (i = 0; i < le16_to_cpu(hdr->ngrps); i++) { + ana_desc = base + offset; + nr_nsids = le32_to_cpu(ana_desc->nnsids); + nsid_buf_size = nr_nsids * sizeof(__le32); + + offset += sizeof(*ana_desc); + + for (j = 0; j < nr_nsids; j++) { + if (nsid == le32_to_cpu(ana_desc->nsids[j])) + return ana_desc->state; + } + + if (anagrpid != 0 && anagrpid == le32_to_cpu(ana_desc->grpid)) + rc = ana_desc->state; + + offset += nsid_buf_size; + } + + return rc; +} + +int get_ana_info(struct path * pp, unsigned int timeout) +{ + int rc; + __u32 nsid; + struct nvme_id_ctrl ctrl; + struct nvme_id_ns ns; + void *ana_log; + size_t ana_log_len; + + rc = nvme_identify_ctrl(pp->fd, &ctrl); + if (rc) + return ANA_PRIO_GETCTRL_FAILED; + + if(!(ctrl.cmic & (1 << 3))) + return ANA_PRIO_NOT_SUPPORTED; + + rc = nvme_get_nsid(pp->fd, &nsid); + if (rc) + return ANA_PRIO_GETNSID_FAILED; + + rc = nvme_identify_ns(pp->fd, nsid, &ns); + if (rc) + return ANA_PRIO_GETNS_FAILED; + + ana_log_len = sizeof(struct nvme_ana_rsp_hdr) + + le32_to_cpu(ctrl.nanagrpid) * sizeof(struct nvme_ana_group_desc); + if (!(ctrl.anacap & (1 << 6))) + ana_log_len += le32_to_cpu(ctrl.mnan) * sizeof(__le32); + + ana_log = malloc(ana_log_len); + if (!ana_log) + return ANA_PRIO_NO_MEMORY; + + rc = nvme_ana_log(pp->fd, ana_log, ana_log_len, + (ctrl.anacap & (1 << 6)) ? NVME_ANA_LOG_RGO : 0); + if (rc) { + free(ana_log); + return ANA_PRIO_GETANALOG_FAILED; + } + + rc = get_ana_state(nsid, le32_to_cpu(ns.anagrpid), ana_log); + if (rc < 0){ + free(ana_log); + return ANA_PRIO_GETANAS_FAILED; + } + + free(ana_log); + condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc, aas_print_string(rc)); + + return rc; +} + +int getprio(struct path * pp, char * args, unsigned int timeout) +{ + int rc; + + if (pp->fd < 0) + return ANA_PRIO_NO_INFORMATION; + + rc = get_ana_info(pp, timeout); + if (rc >= 0) { + rc &= 0x0f; + switch(rc) { + case NVME_ANA_OPTIMIZED: + rc = ANA_PRIO_OPTIMIZED; + break; + case NVME_ANA_NONOPTIMIZED: + rc = ANA_PRIO_NONOPTIMIZED; + break; + case NVME_ANA_INACCESSIBLE: + rc = ANA_PRIO_INACCESSIBLE; + break; + case NVME_ANA_PERSISTENT_LOSS: + rc = ANA_PRIO_PERSISTENT_LOSS; + break; + case NVME_ANA_CHANGE: + rc = ANA_PRIO_CHANGE; + break; + default: + rc = ANA_PRIO_RESERVED; + } + } else { + switch(rc) { + case ANA_PRIO_GETCTRL_FAILED: + condlog(0, "%s: couldn't get ctrl info", pp->dev); + break; + case ANA_PRIO_NOT_SUPPORTED: + condlog(0, "%s: ana not supported", pp->dev); + break; + case ANA_PRIO_GETANAS_FAILED: + condlog(0, "%s: couldn't get ana state", pp->dev); + break; + case ANA_PRIO_GETANALOG_FAILED: + condlog(0, "%s: couldn't get ana log", pp->dev); + break; + case ANA_PRIO_GETNS_FAILED: + condlog(0, "%s: couldn't get namespace", pp->dev); + break; + case ANA_PRIO_GETNSID_FAILED: + condlog(0, "%s: couldn't get namespace id", pp->dev); + break; + case ANA_PRIO_NO_MEMORY: + condlog(0, "%s: couldn't alloc memory", pp->dev); + break; + } + } + return rc; +} + diff --git a/libmultipath/prioritizers/ana.h b/libmultipath/prioritizers/ana.h new file mode 100644 index 0000000..92cfa9e --- /dev/null +++ b/libmultipath/prioritizers/ana.h @@ -0,0 +1,221 @@ +#ifndef _ANA_H +#define _ANA_H + +#include <linux/types.h> + +#define NVME_NSID_ALL 0xffffffff +#define NVME_IDENTIFY_DATA_SIZE 4096 + +#define NVME_LOG_ANA 0x0c + +/* Admin commands */ +enum nvme_admin_opcode { + nvme_admin_get_log_page = 0x02, + nvme_admin_identify = 0x06, +}; + +enum { + NVME_ID_CNS_NS = 0x00, + NVME_ID_CNS_CTRL = 0x01, +}; + +/* nvme ioctl start */ +struct nvme_passthru_cmd { + __u8 opcode; + __u8 flags; + __u16 rsvd1; + __u32 nsid; + __u32 cdw2; + __u32 cdw3; + __u64 metadata; + __u64 addr; + __u32 metadata_len; + __u32 data_len; + __u32 cdw10; + __u32 cdw11; + __u32 cdw12; + __u32 cdw13; + __u32 cdw14; + __u32 cdw15; + __u32 timeout_ms; + __u32 result; +}; + +#define nvme_admin_cmd nvme_passthru_cmd + +#define NVME_IOCTL_ID _IO('N', 0x40) +#define NVME_IOCTL_ADMIN_CMD _IOWR('N', 0x41, struct nvme_admin_cmd) +/* nvme ioctl end */ + +/* nvme id ctrl start */ +struct nvme_id_power_state { + __le16 max_power; /* centiwatts */ + __u8 rsvd2; + __u8 flags; + __le32 entry_lat; /* microseconds */ + __le32 exit_lat; /* microseconds */ + __u8 read_tput; + __u8 read_lat; + __u8 write_tput; + __u8 write_lat; + __le16 idle_power; + __u8 idle_scale; + __u8 rsvd19; + __le16 active_power; + __u8 active_work_scale; + __u8 rsvd23[9]; +}; + +struct nvme_id_ctrl { + __le16 vid; + __le16 ssvid; + char sn[20]; + char mn[40]; + char fr[8]; + __u8 rab; + __u8 ieee[3]; + __u8 cmic; + __u8 mdts; + __le16 cntlid; + __le32 ver; + __le32 rtd3r; + __le32 rtd3e; + __le32 oaes; + __le32 ctratt; + __u8 rsvd100[156]; + __le16 oacs; + __u8 acl; + __u8 aerl; + __u8 frmw; + __u8 lpa; + __u8 elpe; + __u8 npss; + __u8 avscc; + __u8 apsta; + __le16 wctemp; + __le16 cctemp; + __le16 mtfa; + __le32 hmpre; + __le32 hmmin; + __u8 tnvmcap[16]; + __u8 unvmcap[16]; + __le32 rpmbs; + __le16 edstt; + __u8 dsto; + __u8 fwug; + __le16 kas; + __le16 hctma; + __le16 mntmt; + __le16 mxtmt; + __le32 sanicap; + __le32 hmminds; + __le16 hmmaxd; + __u8 rsvd338[4]; + __u8 anatt; + __u8 anacap; + __le32 anagrpmax; + __le32 nanagrpid; + __u8 rsvd352[160]; + __u8 sqes; + __u8 cqes; + __le16 maxcmd; + __le32 nn; + __le16 oncs; + __le16 fuses; + __u8 fna; + __u8 vwc; + __le16 awun; + __le16 awupf; + __u8 nvscc; + __u8 nwpc; + __le16 acwu; + __u8 rsvd534[2]; + __le32 sgls; + __le32 mnan; + __u8 rsvd544[224]; + char subnqn[256]; + __u8 rsvd1024[768]; + __le32 ioccsz; + __le32 iorcsz; + __le16 icdoff; + __u8 ctrattr; + __u8 msdbd; + __u8 rsvd1804[244]; + struct nvme_id_power_state psd[32]; + __u8 vs[1024]; +}; +/* nvme id ctrl end */ + +/* nvme id ns start */ +struct nvme_lbaf { + __le16 ms; + __u8 ds; + __u8 rp; +}; + +struct nvme_id_ns { + __le64 nsze; + __le64 ncap; + __le64 nuse; + __u8 nsfeat; + __u8 nlbaf; + __u8 flbas; + __u8 mc; + __u8 dpc; + __u8 dps; + __u8 nmic; + __u8 rescap; + __u8 fpi; + __u8 rsvd33; + __le16 nawun; + __le16 nawupf; + __le16 nacwu; + __le16 nabsn; + __le16 nabo; + __le16 nabspf; + __le16 noiob; + __u8 nvmcap[16]; + __u8 rsvd64[28]; + __le32 anagrpid; + __u8 rsvd96[3]; + __u8 nsattr; + __u8 rsvd100[4]; + __u8 nguid[16]; + __u8 eui64[8]; + struct nvme_lbaf lbaf[16]; + __u8 rsvd192[192]; + __u8 vs[3712]; +}; +/* nvme id ns end */ + +/* nvme ana start */ +enum nvme_ana_state { + NVME_ANA_OPTIMIZED = 0x01, + NVME_ANA_NONOPTIMIZED = 0x02, + NVME_ANA_INACCESSIBLE = 0x03, + NVME_ANA_PERSISTENT_LOSS = 0x04, + NVME_ANA_CHANGE = 0x0f, + NVME_ANA_RESERVED = 0x05, +}; + +struct nvme_ana_rsp_hdr { + __le64 chgcnt; + __le16 ngrps; + __le16 rsvd10[3]; +}; + +struct nvme_ana_group_desc { + __le32 grpid; + __le32 nnsids; + __le64 chgcnt; + __u8 state; + __u8 rsvd17[15]; + __le32 nsids[]; +}; + +/* flag for the log specific field of the ANA log */ +#define NVME_ANA_LOG_RGO (1 << 0) + +/* nvme ana end */ + +#endif diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5 index 6333366..c5b2fb6 100644 --- a/multipath/multipath.conf.5 +++ b/multipath/multipath.conf.5 @@ -334,6 +334,10 @@ priority provided as argument. Requires prio_args keyword. Generate the path priority based on a latency algorithm. Requires prio_args keyword. .TP +.I ana +(Hardware-dependent) +Generate the path priority based on the NVMe ANA settings. +.TP .I datacore (Hardware-dependent) Generate the path priority for some DataCore storage arrays. Requires prio_args @@ -1391,6 +1395,10 @@ Active/Standby mode exclusively. .I 1 alua (Hardware-dependent) Hardware handler for SCSI-3 ALUA compatible arrays. +.TP +.I 1 ana +(Hardware-dependent) +Hardware handler for NVMe ANA compatible arrays. .PP The default is: \fB<unset>\fR .PP