Message ID | 572cfcc0-197a-9ead-9cb-3c5bf5e735@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | 6.2 nvme-pci: something wrong | expand |
On Fri, Dec 23, 2022 at 09:24:56PM -0800, Hugh Dickins wrote: > Hi Christoph, > > There's something wrong with the nvme-pci heading for 6.2-rc1: > no problem booting here on this Lenovo ThinkPad X1 Carbon 5th, > but under load... > > nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: Abort status: 0x0 > nvme nvme0: Abort status: 0x0 > nvme nvme0: Abort status: 0x0 > nvme nvme0: Abort status: 0x0 > nvme nvme0: I/O 0 QID 2 timeout, reset controller > > ...and more, until I just have to poweroff and reboot. > > Bisection points to your > 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers") > And that does revert cleanly, giving a kernel which shows no problem. > > I've spent a while comparing old nvme_pci_alloc_tag_set() and new > nvme_alloc_io_tag_set(), I do not know my way around there at all > and may be talking nonsense, but it did look as if there might now > be a difference in the queue_depth, sqsize, q_depth conversions. > > I'm running load successfully with the patch below, but I strongly > suspect that the right patch will be somewhere else: over to you! Thanks for the report! The patch is definitively wrong, ->sqsize hold one of the awful so called 'zeroes based values' in NVMe, where 0 means 1, and thus have a built-in one off. We should probably convert it to a sane value at read time, but that's a separate discussion. I suspect your controller is one of those where we quirk the size, and the commit you bisected fails to reflects that in the common sqsizse value. The patch below should be the minimum fix, and in the long term, the duplicate bookkeeping for it in the PCI driver should go away: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f0f8027644bbf8..a73c0ee7bd1892 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2536,7 +2536,6 @@ static int nvme_pci_enable(struct nvme_dev *dev) dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1, io_queue_depth); - dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap); dev->dbs = dev->bar + 4096; @@ -2577,7 +2576,7 @@ static int nvme_pci_enable(struct nvme_dev *dev) dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n", dev->q_depth); } - + dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ nvme_map_cmb(dev);
[Note: this mail contains only information for Linux kernel regression tracking. Mails like these contain '#forregzbot' in the subject to make then easy to spot and filter out. The author also tried to remove most or all individuals from the list of recipients to spare them the hassle.] On 24.12.22 06:24, Hugh Dickins wrote: > > There's something wrong with the nvme-pci heading for 6.2-rc1: > no problem booting here on this Lenovo ThinkPad X1 Carbon 5th, > but under load...> [...] > Bisection points to your > 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers") > And that does revert cleanly, giving a kernel which shows no problem. > [...] Thanks for the report. To be sure below issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot: #regzbot ^introduced 0da7feaa5913 #regzbot title nvme-pci: problems under load #regzbot ignore-activity Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.ns.
On Fri, 23 Dec 2022, Christoph Hellwig wrote: > On Fri, Dec 23, 2022 at 09:24:56PM -0800, Hugh Dickins wrote: > > Hi Christoph, > > > > There's something wrong with the nvme-pci heading for 6.2-rc1: > > no problem booting here on this Lenovo ThinkPad X1 Carbon 5th, > > but under load... > > > > nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting > > nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting > > nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting > > nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: Abort status: 0x0 > > nvme nvme0: I/O 0 QID 2 timeout, reset controller > > > > ...and more, until I just have to poweroff and reboot. > > > > Bisection points to your > > 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers") > > And that does revert cleanly, giving a kernel which shows no problem. > > > > I've spent a while comparing old nvme_pci_alloc_tag_set() and new > > nvme_alloc_io_tag_set(), I do not know my way around there at all > > and may be talking nonsense, but it did look as if there might now > > be a difference in the queue_depth, sqsize, q_depth conversions. > > > > I'm running load successfully with the patch below, but I strongly > > suspect that the right patch will be somewhere else: over to you! > > Thanks for the report! The patch is definitively wrong, ->sqsize > hold one of the awful so called 'zeroes based values' in NVMe, > where 0 means 1, and thus have a built-in one off. We should > probably convert it to a sane value at read time, but that's a > separate discussion. > > I suspect your controller is one of those where we quirk the size, > and the commit you bisected fails to reflects that in the common > sqsizse value. The patch below should be the minimum fix, and in > the long term, the duplicate bookkeeping for it in the PCI driver > should go away: Thanks for the rapid response. No, I've just tried your patch below, and it does not help. And I don't see any message about "queue size" in dmesg, so don't think mine is quirked. cat /sys/devices/pci0000:00/0000:00:1c.4/0000:05:00.0/nvme/nvme0/sqsize tells me 511 (and that looked like the only relevant file under /sys). Regarding the awful 0's based queue depth: yes, it just looked to me as if the way that got handled in pci.c before differed from the way it gets handled in pci.c and core.c now, one too many "+ 1"s or "- 1"s somewhere. > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index f0f8027644bbf8..a73c0ee7bd1892 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2536,7 +2536,6 @@ static int nvme_pci_enable(struct nvme_dev *dev) > > dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1, > io_queue_depth); > - dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ > dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap); > dev->dbs = dev->bar + 4096; > > @@ -2577,7 +2576,7 @@ static int nvme_pci_enable(struct nvme_dev *dev) > dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n", > dev->q_depth); > } > - > + dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ > > nvme_map_cmb(dev); >
On Sat, Dec 24, 2022 at 2:19 AM Hugh Dickins <hughd@google.com> wrote: > > Regarding the awful 0's based queue depth: yes, it just looked to me > as if the way that got handled in pci.c before differed from the way > it gets handled in pci.c and core.c now, one too many "+ 1"s or "- 1"s > somewhere. The commit in question seems to replace nvme_pci_alloc_tag_set() calls with nvme_alloc_io_tag_set(), and that has a big difference in how queue_depth is set. It used to do (in nnvme_pci_alloc_tag_set()): set->queue_depth = min_t(unsigned, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1; but now it does (in nvme_alloc_io_tag_set()) set->queue_depth = ctrl->sqsize + 1; instead. So that "set->queue_depth" _seems_ to have historically had that "-1" (that "zero means one" that apparently sqsize also has), but the new code basically undoes it. I don't know the code at all, but this does all seem to be a change (and *very* confusing). The fact that Hugh gets it to work by doint that set->queue_depth = ctrl->sqsize; does seem to match the whole "it used to subtract one" behavior it had. Which is why I assume Hugh tried that patch in the first place. Linus
On Fri, Dec 23, 2022 at 09:24:56PM -0800, Hugh Dickins wrote: > Hi Christoph, > > There's something wrong with the nvme-pci heading for 6.2-rc1: > no problem booting here on this Lenovo ThinkPad X1 Carbon 5th, > but under load... > > nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting > nvme nvme0: Abort status: 0x0 > nvme nvme0: Abort status: 0x0 > nvme nvme0: Abort status: 0x0 > nvme nvme0: Abort status: 0x0 > nvme nvme0: I/O 0 QID 2 timeout, reset controller > > ...and more, until I just have to poweroff and reboot. > > Bisection points to your > 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers") > And that does revert cleanly, giving a kernel which shows no problem. > > I've spent a while comparing old nvme_pci_alloc_tag_set() and new > nvme_alloc_io_tag_set(), I do not know my way around there at all > and may be talking nonsense, but it did look as if there might now > be a difference in the queue_depth, sqsize, q_depth conversions. > > I'm running load successfully with the patch below, but I strongly > suspect that the right patch will be somewhere else: over to you! > > Hugh > > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ct > > memset(set, 0, sizeof(*set)); > set->ops = ops; > - set->queue_depth = ctrl->sqsize + 1; > + set->queue_depth = ctrl->sqsize; Your observation is a queue-wrap condition that makes it impossible for the controller know there are new commands. Your patch does look like the correct thing to do. The "zero means one" thing is a confusing distraction, I think. It makes more sense if you consider sqsize as the maximum number of tags we can have outstanding at one time and it looks like all the drivers set it that way. We're supposed to leave one slot empty for a full NVMe queue, so adding one here to report the total number slots isn't right since that would allow us to fill all slots. Fabrics drivers have been using this method for a while, though, so interesting they haven't had a simiar problem.
On Sat, Dec 24, 2022 at 03:06:38PM -0700, Keith Busch wrote: > Your observation is a queue-wrap condition that makes it impossible for > the controller know there are new commands. > > Your patch does look like the correct thing to do. The "zero means one" > thing is a confusing distraction, I think. It makes more sense if you > consider sqsize as the maximum number of tags we can have outstanding at > one time and it looks like all the drivers set it that way. We're > supposed to leave one slot empty for a full NVMe queue, so adding one > here to report the total number slots isn't right since that would allow > us to fill all slots. Yes, and pcie did actually do the ‐ 1 from q_depth, so we should drop the +1 for sqsize. And add back the missing BLK_MQ_MAX_DEPTH. But we still need to keep sqsize updated as well. > Fabrics drivers have been using this method for a while, though, so > interesting they haven't had a simiar problem. Fabrics doesn't have a real queue and thus no actual wrap, so I don't think they will be hit as bad by this. So we'll probably need something like this, split into two patches. And then for 6.2 clean up the sqsize vs q_depth mess for real. diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 95c488ea91c303..5b723c65fbeab5 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, memset(set, 0, sizeof(*set)); set->ops = ops; - set->queue_depth = ctrl->sqsize + 1; + set->queue_depth = min_t(unsigned, ctrl->sqsize, BLK_MQ_MAX_DEPTH - 1); /* * Some Apple controllers requires tags to be unique across admin and * the (only) I/O queue, so reserve the first 32 tags of the I/O queue. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f0f8027644bbf8..ec5e1c578a710b 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2332,10 +2332,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) if (dev->cmb_use_sqes) { result = nvme_cmb_qdepth(dev, nr_io_queues, sizeof(struct nvme_command)); - if (result > 0) + if (result > 0) { dev->q_depth = result; - else + dev->ctrl.sqsize = dev->q_depth - 1; + } else { dev->cmb_use_sqes = false; + } } do { @@ -2536,7 +2538,6 @@ static int nvme_pci_enable(struct nvme_dev *dev) dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1, io_queue_depth); - dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap); dev->dbs = dev->bar + 4096; @@ -2577,7 +2578,7 @@ static int nvme_pci_enable(struct nvme_dev *dev) dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n", dev->q_depth); } - + dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ nvme_map_cmb(dev);
On Sat, 24 Dec 2022, Christoph Hellwig wrote: > On Sat, Dec 24, 2022 at 03:06:38PM -0700, Keith Busch wrote: > > Your observation is a queue-wrap condition that makes it impossible for > > the controller know there are new commands. > > > > Your patch does look like the correct thing to do. The "zero means one" > > thing is a confusing distraction, I think. It makes more sense if you > > consider sqsize as the maximum number of tags we can have outstanding at > > one time and it looks like all the drivers set it that way. We're > > supposed to leave one slot empty for a full NVMe queue, so adding one > > here to report the total number slots isn't right since that would allow > > us to fill all slots. > > Yes, and pcie did actually do the ‐ 1 from q_depth, so we should > drop the +1 for sqsize. And add back the missing BLK_MQ_MAX_DEPTH. > But we still need to keep sqsize updated as well. > > > Fabrics drivers have been using this method for a while, though, so > > interesting they haven't had a simiar problem. > > Fabrics doesn't have a real queue and thus no actual wrap, so > I don't think they will be hit as bad by this. > > So we'll probably need something like this, split into two patches. > And then for 6.2 clean up the sqsize vs q_depth mess for real. This patch is working fine for me; and, in the light of Keith's explanation, so far as I can tell, seems the right thing to do. Thanks! Hugh > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 95c488ea91c303..5b723c65fbeab5 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, > > memset(set, 0, sizeof(*set)); > set->ops = ops; > - set->queue_depth = ctrl->sqsize + 1; > + set->queue_depth = min_t(unsigned, ctrl->sqsize, BLK_MQ_MAX_DEPTH - 1); > /* > * Some Apple controllers requires tags to be unique across admin and > * the (only) I/O queue, so reserve the first 32 tags of the I/O queue. > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index f0f8027644bbf8..ec5e1c578a710b 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2332,10 +2332,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) > if (dev->cmb_use_sqes) { > result = nvme_cmb_qdepth(dev, nr_io_queues, > sizeof(struct nvme_command)); > - if (result > 0) > + if (result > 0) { > dev->q_depth = result; > - else > + dev->ctrl.sqsize = dev->q_depth - 1; > + } else { > dev->cmb_use_sqes = false; > + } > } > > do { > @@ -2536,7 +2538,6 @@ static int nvme_pci_enable(struct nvme_dev *dev) > > dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1, > io_queue_depth); > - dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ > dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap); > dev->dbs = dev->bar + 4096; > > @@ -2577,7 +2578,7 @@ static int nvme_pci_enable(struct nvme_dev *dev) > dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n", > dev->q_depth); > } > - > + dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ > > nvme_map_cmb(dev); > >
[TLDR: This mail in primarily relevant for Linux kernel regression tracking. See link in footer if these mails annoy you.] On 24.12.22 08:52, Thorsten Leemhuis wrote: > [Note: this mail contains only information for Linux kernel regression > tracking. Mails like these contain '#forregzbot' in the subject to make > then easy to spot and filter out. The author also tried to remove most > or all individuals from the list of recipients to spare them the hassle.] > > On 24.12.22 06:24, Hugh Dickins wrote: >> >> There's something wrong with the nvme-pci heading for 6.2-rc1: >> no problem booting here on this Lenovo ThinkPad X1 Carbon 5th, >> but under load...> [...] >> Bisection points to your >> 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers") >> And that does revert cleanly, giving a kernel which shows no problem. >> [...] > > Thanks for the report. To be sure below issue doesn't fall through the > cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression > tracking bot: > > #regzbot ^introduced 0da7feaa5913 > #regzbot title nvme-pci: problems under load > #regzbot ignore-activity #regzbot fix: 88d356ca41ba1c3effc2d4208dfbd43 Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
--- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ct memset(set, 0, sizeof(*set)); set->ops = ops; - set->queue_depth = ctrl->sqsize + 1; + set->queue_depth = ctrl->sqsize; /* * Some Apple controllers requires tags to be unique across admin and * the (only) I/O queue, so reserve the first 32 tags of the I/O queue.