diff mbox series

[v3] scsi: core: Fix shost->cmd_per_lun calculation in scsi_add_host_with_dma()

Message ID 20211008043546.6006-1-decui@microsoft.com (mailing list archive)
State Accepted
Headers show
Series [v3] scsi: core: Fix shost->cmd_per_lun calculation in scsi_add_host_with_dma() | expand

Commit Message

Dexuan Cui Oct. 8, 2021, 4:35 a.m. UTC
After commit ea2f0f77538c, a 416-CPU VM running on Hyper-V hangs during
boot because the hv_storvsc driver sets scsi_driver.can_queue to an "int"
value that exceeds SHRT_MAX, and hence scsi_add_host_with_dma() sets
shost->cmd_per_lun to a negative "short" value.

Use min_t(int, ...) to fix the issue.

Fixes: ea2f0f77538c ("scsi: core: Cap scsi_host cmd_per_lun at can_queue")
Cc: stable@vger.kernel.org
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---

v1 tried to fix the issue by changing the storvsc driver:
https://lwn.net/ml/linux-kernel/BYAPR21MB1270BBC14D5F1AE69FC31A16BFB09@BYAPR21MB1270.namprd21.prod.outlook.com/

v2 directly fixed the scsi core change instead as Michael Kelley suggested
(refer to the above link).

v3 simplified the commit log, as John Garry suggested.
   Added Haiyang's and Ming's Reviewed-by.

 drivers/scsi/hosts.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

John Garry Oct. 8, 2021, noon UTC | #1
On 08/10/2021 05:35, Dexuan Cui wrote:
> After commit ea2f0f77538c, a 416-CPU VM running on Hyper-V hangs during
> boot because the hv_storvsc driver sets scsi_driver.can_queue to an "int"
> value that exceeds SHRT_MAX, and hence scsi_add_host_with_dma() sets
> shost->cmd_per_lun to a negative "short" value.
> 
> Use min_t(int, ...) to fix the issue.
> 
> Fixes: ea2f0f77538c ("scsi: core: Cap scsi_host cmd_per_lun at can_queue")
> Cc:stable@vger.kernel.org
> Signed-off-by: Dexuan Cui<decui@microsoft.com>
> Reviewed-by: Haiyang Zhang<haiyangz@microsoft.com>
> Reviewed-by: Ming Lei<ming.lei@redhat.com>

Reviewed-by: John Garry <john.garry@huawei.com>

thanks
Martin K. Petersen Oct. 12, 2021, 4:41 p.m. UTC | #2
Dexuan,

> After commit ea2f0f77538c, a 416-CPU VM running on Hyper-V hangs during
> boot because the hv_storvsc driver sets scsi_driver.can_queue to an "int"
> value that exceeds SHRT_MAX, and hence scsi_add_host_with_dma() sets
> shost->cmd_per_lun to a negative "short" value.
>
> Use min_t(int, ...) to fix the issue.

I queued this up as a short term workaround. However, I am hoping that
the rework of the scaling code in storvsc lands soon.
Dexuan Cui Oct. 12, 2021, 6:05 p.m. UTC | #3
> From: Martin K. Petersen <martin.petersen@oracle.com>
> Sent: Tuesday, October 12, 2021 9:42 AM
> 
> Dexuan,
> 
> > After commit ea2f0f77538c, a 416-CPU VM running on Hyper-V hangs during
> > boot because the hv_storvsc driver sets scsi_driver.can_queue to an "int"
> > value that exceeds SHRT_MAX, and hence scsi_add_host_with_dma() sets
> > shost->cmd_per_lun to a negative "short" value.
> >
> > Use min_t(int, ...) to fix the issue.
> 
> I queued this up as a short term workaround. However, I am hoping that
> the rework of the scaling code in storvsc lands soon.

Thanks, Martin! I know Michael Kelley will improve the netvsc.

Regarding this patch, I'm not sure if it's a "workaround": if it's incorrect to
set a bigger-than-SHRT_MAX scsi_driver.can_queue value, probably we should
change scsi_driver.can_queue from "int" to "u16"? BTW, I guess the "cmd_per_lun"
should also be "u16" rather than "short"?

This was discussed in May, and it looks like the conclusion was not clear to me:
https://lwn.net/ml/linux-kernel/457d23a9-deb0-4ee1-fe7f-5a63605d9686@huawei.com/

Thanks,
-- Dexuan
Martin K. Petersen Oct. 12, 2021, 8:35 p.m. UTC | #4
On Thu, 7 Oct 2021 21:35:46 -0700, Dexuan Cui wrote:

> After commit ea2f0f77538c, a 416-CPU VM running on Hyper-V hangs during
> boot because the hv_storvsc driver sets scsi_driver.can_queue to an "int"
> value that exceeds SHRT_MAX, and hence scsi_add_host_with_dma() sets
> shost->cmd_per_lun to a negative "short" value.
> 
> Use min_t(int, ...) to fix the issue.
> 
> [...]

Applied to 5.15/scsi-fixes, thanks!

[1/1] scsi: core: Fix shost->cmd_per_lun calculation in scsi_add_host_with_dma()
      https://git.kernel.org/mkp/scsi/c/50b6cb351636
Martin K. Petersen Oct. 12, 2021, 9:27 p.m. UTC | #5
Dexuan,

> Regarding this patch, I'm not sure if it's a "workaround": if it's
> incorrect to set a bigger-than-SHRT_MAX scsi_driver.can_queue value,
> probably we should change scsi_driver.can_queue from "int" to "u16"?

> BTW, I guess the "cmd_per_lun" should also be "u16" rather than
> "short"?

I agree that it would be nice to get all this cleaned up. Several,
somewhat peculiar, 25-year old design choices.

cmd_per_lun has traditionally been in the ballpark of low hundreds,
can_queue typically in the low thousands. And the block layer currently
caps at ~10K. Happy to take patches fixing this up, although I am a bit
worried about how much churn it will generate.

That said, I do think that cleaning this up is somewhat orthogonal to
the issue with storvsc. I suspect that allowing a huge amount of
concurrent outstanding commands is going to be detrimental to
performance for most workloads. And from that perspective I think that
the short->int fix, while valid given the type discrepancy, is just
treating the symptom.

Therefore I consider the short->int fix a workaround. And the proper fix
involves looking closely at things are scaled in the storvsc case. Which
I have noted that Michael is working on.
Dexuan Cui Oct. 12, 2021, 9:43 p.m. UTC | #6
> From: Martin K. Petersen <martin.petersen@oracle.com>
> Sent: Tuesday, October 12, 2021 2:28 PM
> To: Dexuan Cui <decui@microsoft.com>
> 
> > Regarding this patch, I'm not sure if it's a "workaround": if it's
> > incorrect to set a bigger-than-SHRT_MAX scsi_driver.can_queue value,
> > probably we should change scsi_driver.can_queue from "int" to "u16"?
> 
> > BTW, I guess the "cmd_per_lun" should also be "u16" rather than
> > "short"?
> 
> I agree that it would be nice to get all this cleaned up. Several,
> somewhat peculiar, 25-year old design choices.
> 
> cmd_per_lun has traditionally been in the ballpark of low hundreds,
> can_queue typically in the low thousands. And the block layer currently
> caps at ~10K. Happy to take patches fixing this up, although I am a bit
> worried about how much churn it will generate.

Thanks for the explanation!
 
> That said, I do think that cleaning this up is somewhat orthogonal to
> the issue with storvsc. I suspect that allowing a huge amount of
> concurrent outstanding commands is going to be detrimental to
> performance for most workloads. And from that perspective I think that
> the short->int fix, while valid given the type discrepancy, is just
> treating the symptom.

Agreed.
 
> Therefore I consider the short->int fix a workaround. And the proper fix
> involves looking closely at things are scaled in the storvsc case. Which
> I have noted that Michael is working on.

Agreed. My v1 actually tried to work around the storvsc driver instread. :-)
diff mbox series

Patch

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 3f6f14f0cafb..24b72ee4246f 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -220,7 +220,8 @@  int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 		goto fail;
 	}
 
-	shost->cmd_per_lun = min_t(short, shost->cmd_per_lun,
+	/* Use min_t(int, ...) in case shost->can_queue exceeds SHRT_MAX */
+	shost->cmd_per_lun = min_t(int, shost->cmd_per_lun,
 				   shost->can_queue);
 
 	error = scsi_init_sense_cache(shost);