diff mbox series

6.2 nvme-pci: something wrong

Message ID 572cfcc0-197a-9ead-9cb-3c5bf5e735@google.com (mailing list archive)
State New, archived
Headers show
Series 6.2 nvme-pci: something wrong | expand

Commit Message

Hugh Dickins Dec. 24, 2022, 5:24 a.m. UTC
Hi Christoph,

There's something wrong with the nvme-pci heading for 6.2-rc1:
no problem booting here on this Lenovo ThinkPad X1 Carbon 5th,
but under load...

nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting
nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting
nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting
nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: Abort status: 0x0
nvme nvme0: I/O 0 QID 2 timeout, reset controller

...and more, until I just have to poweroff and reboot.

Bisection points to your
0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
And that does revert cleanly, giving a kernel which shows no problem.

I've spent a while comparing old nvme_pci_alloc_tag_set() and new
nvme_alloc_io_tag_set(), I do not know my way around there at all
and may be talking nonsense, but it did look as if there might now
be a difference in the queue_depth, sqsize, q_depth conversions.

I'm running load successfully with the patch below, but I strongly
suspect that the right patch will be somewhere else: over to you!

Hugh

Comments

Christoph Hellwig Dec. 24, 2022, 7:14 a.m. UTC | #1
On Fri, Dec 23, 2022 at 09:24:56PM -0800, Hugh Dickins wrote:
> Hi Christoph,
> 
> There's something wrong with the nvme-pci heading for 6.2-rc1:
> no problem booting here on this Lenovo ThinkPad X1 Carbon 5th,
> but under load...
> 
> nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: Abort status: 0x0
> nvme nvme0: Abort status: 0x0
> nvme nvme0: Abort status: 0x0
> nvme nvme0: Abort status: 0x0
> nvme nvme0: I/O 0 QID 2 timeout, reset controller
> 
> ...and more, until I just have to poweroff and reboot.
> 
> Bisection points to your
> 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
> And that does revert cleanly, giving a kernel which shows no problem.
> 
> I've spent a while comparing old nvme_pci_alloc_tag_set() and new
> nvme_alloc_io_tag_set(), I do not know my way around there at all
> and may be talking nonsense, but it did look as if there might now
> be a difference in the queue_depth, sqsize, q_depth conversions.
> 
> I'm running load successfully with the patch below, but I strongly
> suspect that the right patch will be somewhere else: over to you!

Thanks for the report!  The patch is definitively wrong, ->sqsize
hold one of the awful so called 'zeroes based values' in NVMe,
where 0 means 1, and thus have a built-in one off.  We should
probably convert it to a sane value at read time, but that's a
separate discussion.

I suspect your controller is one of those where we quirk the size,
and the commit you bisected fails to reflects that in the common
sqsizse value.  The patch below should be the minimum fix, and in
the long term, the duplicate bookkeeping for it in the PCI driver
should go away:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f0f8027644bbf8..a73c0ee7bd1892 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2536,7 +2536,6 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1,
 				io_queue_depth);
-	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
 	dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
 	dev->dbs = dev->bar + 4096;
 
@@ -2577,7 +2576,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 		dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n",
 			 dev->q_depth);
 	}
-
+	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
 
 	nvme_map_cmb(dev);
Thorsten Leemhuis Dec. 24, 2022, 7:52 a.m. UTC | #2
[Note: this mail contains only information for Linux kernel regression
tracking. Mails like these contain '#forregzbot' in the subject to make
then easy to spot and filter out. The author also tried to remove most
or all individuals from the list of recipients to spare them the hassle.]

On 24.12.22 06:24, Hugh Dickins wrote:
> 
> There's something wrong with the nvme-pci heading for 6.2-rc1:
> no problem booting here on this Lenovo ThinkPad X1 Carbon 5th,
> but under load...> [...]
> Bisection points to your
> 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
> And that does revert cleanly, giving a kernel which shows no problem.
> [...]

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced 0da7feaa5913
#regzbot title nvme-pci: problems under load
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.ns.
Hugh Dickins Dec. 24, 2022, 10:19 a.m. UTC | #3
On Fri, 23 Dec 2022, Christoph Hellwig wrote:
> On Fri, Dec 23, 2022 at 09:24:56PM -0800, Hugh Dickins wrote:
> > Hi Christoph,
> > 
> > There's something wrong with the nvme-pci heading for 6.2-rc1:
> > no problem booting here on this Lenovo ThinkPad X1 Carbon 5th,
> > but under load...
> > 
> > nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting
> > nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting
> > nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting
> > nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting
> > nvme nvme0: Abort status: 0x0
> > nvme nvme0: Abort status: 0x0
> > nvme nvme0: Abort status: 0x0
> > nvme nvme0: Abort status: 0x0
> > nvme nvme0: I/O 0 QID 2 timeout, reset controller
> > 
> > ...and more, until I just have to poweroff and reboot.
> > 
> > Bisection points to your
> > 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
> > And that does revert cleanly, giving a kernel which shows no problem.
> > 
> > I've spent a while comparing old nvme_pci_alloc_tag_set() and new
> > nvme_alloc_io_tag_set(), I do not know my way around there at all
> > and may be talking nonsense, but it did look as if there might now
> > be a difference in the queue_depth, sqsize, q_depth conversions.
> > 
> > I'm running load successfully with the patch below, but I strongly
> > suspect that the right patch will be somewhere else: over to you!
> 
> Thanks for the report!  The patch is definitively wrong, ->sqsize
> hold one of the awful so called 'zeroes based values' in NVMe,
> where 0 means 1, and thus have a built-in one off.  We should
> probably convert it to a sane value at read time, but that's a
> separate discussion.
> 
> I suspect your controller is one of those where we quirk the size,
> and the commit you bisected fails to reflects that in the common
> sqsizse value.  The patch below should be the minimum fix, and in
> the long term, the duplicate bookkeeping for it in the PCI driver
> should go away:

Thanks for the rapid response.  No, I've just tried your patch below,
and it does not help.  And I don't see any message about "queue size"
in dmesg, so don't think mine is quirked.

cat /sys/devices/pci0000:00/0000:00:1c.4/0000:05:00.0/nvme/nvme0/sqsize
tells me 511 (and that looked like the only relevant file under /sys).

Regarding the awful 0's based queue depth: yes, it just looked to me
as if the way that got handled in pci.c before differed from the way
it gets handled in pci.c and core.c now, one too many "+ 1"s or "- 1"s
somewhere.

> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f0f8027644bbf8..a73c0ee7bd1892 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2536,7 +2536,6 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1,
>  				io_queue_depth);
> -	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
>  	dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
>  	dev->dbs = dev->bar + 4096;
>  
> @@ -2577,7 +2576,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  		dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n",
>  			 dev->q_depth);
>  	}
> -
> +	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
>  
>  	nvme_map_cmb(dev);
>
Linus Torvalds Dec. 24, 2022, 4:56 p.m. UTC | #4
On Sat, Dec 24, 2022 at 2:19 AM Hugh Dickins <hughd@google.com> wrote:
>
> Regarding the awful 0's based queue depth: yes, it just looked to me
> as if the way that got handled in pci.c before differed from the way
> it gets handled in pci.c and core.c now, one too many "+ 1"s or "- 1"s
> somewhere.

The commit in question seems to replace nvme_pci_alloc_tag_set() calls
with nvme_alloc_io_tag_set(), and that has a big difference in how
queue_depth is set.

It used to do (in nnvme_pci_alloc_tag_set()):

        set->queue_depth = min_t(unsigned, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1;

but now it does (in nvme_alloc_io_tag_set())

        set->queue_depth = ctrl->sqsize + 1;

instead.

So that "set->queue_depth" _seems_ to have historically had that "-1"
(that "zero means one" that apparently sqsize also has), but the new
code basically undoes it.

I don't know the code at all, but this does all seem to be a change
(and *very* confusing).

The fact that Hugh gets it to work by doint that

        set->queue_depth = ctrl->sqsize;

does seem to match the whole "it used to subtract one" behavior it
had. Which is why I assume Hugh tried that patch in the first place.

             Linus
Keith Busch Dec. 24, 2022, 10:06 p.m. UTC | #5
On Fri, Dec 23, 2022 at 09:24:56PM -0800, Hugh Dickins wrote:
> Hi Christoph,
> 
> There's something wrong with the nvme-pci heading for 6.2-rc1:
> no problem booting here on this Lenovo ThinkPad X1 Carbon 5th,
> but under load...
> 
> nvme nvme0: I/O 0 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: I/O 1 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: I/O 2 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: I/O 3 (I/O Cmd) QID 2 timeout, aborting
> nvme nvme0: Abort status: 0x0
> nvme nvme0: Abort status: 0x0
> nvme nvme0: Abort status: 0x0
> nvme nvme0: Abort status: 0x0
> nvme nvme0: I/O 0 QID 2 timeout, reset controller
> 
> ...and more, until I just have to poweroff and reboot.
> 
> Bisection points to your
> 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
> And that does revert cleanly, giving a kernel which shows no problem.
> 
> I've spent a while comparing old nvme_pci_alloc_tag_set() and new
> nvme_alloc_io_tag_set(), I do not know my way around there at all
> and may be talking nonsense, but it did look as if there might now
> be a difference in the queue_depth, sqsize, q_depth conversions.
> 
> I'm running load successfully with the patch below, but I strongly
> suspect that the right patch will be somewhere else: over to you!
> 
> Hugh
> 
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ct
>  
>  	memset(set, 0, sizeof(*set));
>  	set->ops = ops;
> -	set->queue_depth = ctrl->sqsize + 1;
> +	set->queue_depth = ctrl->sqsize;

Your observation is a queue-wrap condition that makes it impossible for
the controller know there are new commands.

Your patch does look like the correct thing to do. The "zero means one"
thing is a confusing distraction, I think. It makes more sense if you
consider sqsize as the maximum number of tags we can have outstanding at
one time and it looks like all the drivers set it that way. We're
supposed to leave one slot empty for a full NVMe queue, so adding one
here to report the total number slots isn't right since that would allow
us to fill all slots.

Fabrics drivers have been using this method for a while, though, so
interesting they haven't had a simiar problem.
Christoph Hellwig Dec. 25, 2022, 5:30 a.m. UTC | #6
On Sat, Dec 24, 2022 at 03:06:38PM -0700, Keith Busch wrote:
> Your observation is a queue-wrap condition that makes it impossible for
> the controller know there are new commands.
> 
> Your patch does look like the correct thing to do. The "zero means one"
> thing is a confusing distraction, I think. It makes more sense if you
> consider sqsize as the maximum number of tags we can have outstanding at
> one time and it looks like all the drivers set it that way. We're
> supposed to leave one slot empty for a full NVMe queue, so adding one
> here to report the total number slots isn't right since that would allow
> us to fill all slots.

Yes, and pcie did actually do the ‐ 1 from q_depth, so we should
drop the +1 for sqsize.  And add back the missing BLK_MQ_MAX_DEPTH.
But we still need to keep sqsize updated as well.

> Fabrics drivers have been using this method for a while, though, so
> interesting they haven't had a simiar problem.

Fabrics doesn't have a real queue and thus no actual wrap, so
I don't think they will be hit as bad by this.

So we'll probably need something like this, split into two patches.
And then for 6.2 clean up the sqsize vs q_depth mess for real.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 95c488ea91c303..5b723c65fbeab5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 
 	memset(set, 0, sizeof(*set));
 	set->ops = ops;
-	set->queue_depth = ctrl->sqsize + 1;
+	set->queue_depth = min_t(unsigned, ctrl->sqsize, BLK_MQ_MAX_DEPTH - 1);
 	/*
 	 * Some Apple controllers requires tags to be unique across admin and
 	 * the (only) I/O queue, so reserve the first 32 tags of the I/O queue.
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f0f8027644bbf8..ec5e1c578a710b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2332,10 +2332,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
-		if (result > 0)
+		if (result > 0) {
 			dev->q_depth = result;
-		else
+			dev->ctrl.sqsize = dev->q_depth - 1;
+		} else {
 			dev->cmb_use_sqes = false;
+		}
 	}
 
 	do {
@@ -2536,7 +2538,6 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1,
 				io_queue_depth);
-	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
 	dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
 	dev->dbs = dev->bar + 4096;
 
@@ -2577,7 +2578,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 		dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n",
 			 dev->q_depth);
 	}
-
+	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
 
 	nvme_map_cmb(dev);
Hugh Dickins Dec. 25, 2022, 8:33 a.m. UTC | #7
On Sat, 24 Dec 2022, Christoph Hellwig wrote:
> On Sat, Dec 24, 2022 at 03:06:38PM -0700, Keith Busch wrote:
> > Your observation is a queue-wrap condition that makes it impossible for
> > the controller know there are new commands.
> > 
> > Your patch does look like the correct thing to do. The "zero means one"
> > thing is a confusing distraction, I think. It makes more sense if you
> > consider sqsize as the maximum number of tags we can have outstanding at
> > one time and it looks like all the drivers set it that way. We're
> > supposed to leave one slot empty for a full NVMe queue, so adding one
> > here to report the total number slots isn't right since that would allow
> > us to fill all slots.
> 
> Yes, and pcie did actually do the ‐ 1 from q_depth, so we should
> drop the +1 for sqsize.  And add back the missing BLK_MQ_MAX_DEPTH.
> But we still need to keep sqsize updated as well.
> 
> > Fabrics drivers have been using this method for a while, though, so
> > interesting they haven't had a simiar problem.
> 
> Fabrics doesn't have a real queue and thus no actual wrap, so
> I don't think they will be hit as bad by this.
> 
> So we'll probably need something like this, split into two patches.
> And then for 6.2 clean up the sqsize vs q_depth mess for real.

This patch is working fine for me; and, in the light of Keith's
explanation, so far as I can tell, seems the right thing to do.

Thanks!
Hugh

> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 95c488ea91c303..5b723c65fbeab5 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4926,7 +4926,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
>  
>  	memset(set, 0, sizeof(*set));
>  	set->ops = ops;
> -	set->queue_depth = ctrl->sqsize + 1;
> +	set->queue_depth = min_t(unsigned, ctrl->sqsize, BLK_MQ_MAX_DEPTH - 1);
>  	/*
>  	 * Some Apple controllers requires tags to be unique across admin and
>  	 * the (only) I/O queue, so reserve the first 32 tags of the I/O queue.
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f0f8027644bbf8..ec5e1c578a710b 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2332,10 +2332,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>  	if (dev->cmb_use_sqes) {
>  		result = nvme_cmb_qdepth(dev, nr_io_queues,
>  				sizeof(struct nvme_command));
> -		if (result > 0)
> +		if (result > 0) {
>  			dev->q_depth = result;
> -		else
> +			dev->ctrl.sqsize = dev->q_depth - 1;
> +		} else {
>  			dev->cmb_use_sqes = false;
> +		}
>  	}
>  
>  	do {
> @@ -2536,7 +2538,6 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1,
>  				io_queue_depth);
> -	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
>  	dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
>  	dev->dbs = dev->bar + 4096;
>  
> @@ -2577,7 +2578,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  		dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n",
>  			 dev->q_depth);
>  	}
> -
> +	dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
>  
>  	nvme_map_cmb(dev);
>  
>
Thorsten Leemhuis Jan. 4, 2023, 2:02 p.m. UTC | #8
[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 24.12.22 08:52, Thorsten Leemhuis wrote:
> [Note: this mail contains only information for Linux kernel regression
> tracking. Mails like these contain '#forregzbot' in the subject to make
> then easy to spot and filter out. The author also tried to remove most
> or all individuals from the list of recipients to spare them the hassle.]
> 
> On 24.12.22 06:24, Hugh Dickins wrote:
>>
>> There's something wrong with the nvme-pci heading for 6.2-rc1:
>> no problem booting here on this Lenovo ThinkPad X1 Carbon 5th,
>> but under load...> [...]
>> Bisection points to your
>> 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
>> And that does revert cleanly, giving a kernel which shows no problem.
>> [...]
> 
> Thanks for the report. To be sure below issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 0da7feaa5913
> #regzbot title nvme-pci: problems under load
> #regzbot ignore-activity

#regzbot fix: 88d356ca41ba1c3effc2d4208dfbd43

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
diff mbox series

Patch

--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4926,7 +4926,7 @@  int nvme_alloc_io_tag_set(struct nvme_ct
 
 	memset(set, 0, sizeof(*set));
 	set->ops = ops;
-	set->queue_depth = ctrl->sqsize + 1;
+	set->queue_depth = ctrl->sqsize;
 	/*
 	 * Some Apple controllers requires tags to be unique across admin and
 	 * the (only) I/O queue, so reserve the first 32 tags of the I/O queue.