[00/10] scsi: Fix internal host code use

Message ID	20220804034100.121125-1-michael.christie@oracle.com (mailing list archive)
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: jgross@suse.com, njavali@marvell.com, pbonzini@redhat.com, jasowang@redhat.com, mst@redhat.com, stefanha@redhat.com, oneukum@suse.com, manoj@linux.ibm.com, mrochs@linux.ibm.com, ukrishn@linux.ibm.com, martin.petersen@oracle.com, linux-scsi@vger.kernel.org, james.bottomley@hansenpartnership.com Subject: [PATCH 00/10] scsi: Fix internal host code use Date: Wed, 3 Aug 2022 22:40:50 -0500 Message-Id: <20220804034100.121125-1-michael.christie@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	scsi: Fix internal host code use \| expand [00/10] scsi: Fix internal host code use [01/10] scsi: xen: Drop use of internal host codes. [02/10] scsi: storvsc: Drop DID_TARGET_FAILURE use. [03/10] scsi: uas: Drop DID_TARGET_FAILURE use. [04/10] scsi: virtio_scsi: Drop DID_TARGET_FAILURE use. [05/10] scsi: virtio_scsi: Drop DID_NEXUS_FAILURE use. [06/10] scsi: qla2xxx: Drop DID_TARGET_FAILURE use. [07/10] scsi: cxlflash: Drop DID_ALLOC_FAILURE use. [08/10] scsi: Add error codes for internal scsi-ml use. [09/10] scsi: Convert scsi_decide_disposition to use SCSIML_STAT [10/10] scsi: Remove useless host error codes.

Message ID

20220804034100.121125-1-michael.christie@oracle.com (mailing list archive)

Headers

From: Mike Christie <michael.christie@oracle.com>
To: jgross@suse.com, njavali@marvell.com, pbonzini@redhat.com,
        jasowang@redhat.com, mst@redhat.com, stefanha@redhat.com,
        oneukum@suse.com, manoj@linux.ibm.com, mrochs@linux.ibm.com,
        ukrishn@linux.ibm.com, martin.petersen@oracle.com,
        linux-scsi@vger.kernel.org, james.bottomley@hansenpartnership.com
Subject: [PATCH 00/10] scsi: Fix internal host code use
Date: Wed,  3 Aug 2022 22:40:50 -0500
Message-Id: <20220804034100.121125-1-michael.christie@oracle.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 qfb/drg5U73cQnTQWeQZ5AWDBG2SXYTdr6Io1pkaHsnU12f4QPW8GWx4Ki6fZp44InoGehOHSXviHELo77lvyZzteqH5/ANmN3f4taKNOS9TFTzD3EZ708geeOjGRPy2ew+vkIpEVAsF+YhhUWBv86QA+SO4GHThbM5SWz0jsZaG1y0E7/RS5rqQmWgCx6HKlxMrEphuvjirJm9o//OvPOaKdlw+QBxJpKhx3UikTu4gUY7Mvxp1koXq/H5NwvLfyfE15iAK8angoDHWThhDaDS1mqQ8wYtAm2Lu4QacBeJwHVKXhenIT4wXfMdgaJIeeR/70HPaPLLqh4Z23TRsaQ9Sb44ZymkE5Er2FH+E8PjyM07m6QEinsgFUS/iRf85tTnK1yWY4WyngY4WG0cHPC7ARgxczrGBlvlFjShRGf3MkCkfTMzCV7sWZgYx5ciaYjru/ZcTQ23/SdkF5av0yiiuV4eKiWdlFy/auPTvAYuzBNOrdIIrPWvYiuTbIkAuWl7L+wreSC5dUQ5MtdagjxljcstFXr0fXCZ2NGTva1v66+fKyqqCbRXSU0/RnQaLBvf4BB7JFECk3Fo+WosAAFn+VxzDD4mM2eQtaqsUCiEEl4fH0EkSAwLSqeWVYaQKMMRbVylUbcfYwD87Sx32lqXgVxysDMl/mOQfk7xHQtCNBhFWw0519muRTAb0ocG0Xc4jQehSHE58k/Z5h/sMAUJYvkQ9NZXR89/iDMWWForZ+fm+OYcEih/Vp922z7yyvgQt2/QRy1M9Atl2mRhjroq8k69KCIpjGzbKiBOd7bL2yQ2DAxn2dBZ4g67aSVUkNecXrIcevdGuS2BLOVujt0TxXkA+QLmMttfuoyt1QTVmUVnzlBk6hxVD4+Jx8HkCCyiYNxayNaxwKnPJ14DpFvEkGiqUKxgE9NfBsvx9nFVB5TNIek+EL6YjAjgTThsNndMkhZfkaQRc41QWcjrvbrd/D3BJYk2DfIhxfL7JhE0ckOwc1g8W7eX9Ra1pEAZApI72ykBPoM7COaDhRi/GvisSdS9SvNrefIbDAdMq+s/XTWt2gSWJRgKbgmyHvDTLb33H+vTuuJlBQi5xy0jQ7WTeu61RAO2pHJJvSPW52GRITy0FiLfl1smCJ2qAwT7LK0bGfvPFZg23fnPlsY/oRQPj+DvgQ1nydzIEcmGuTeM8qsjA2IUbNmhbycbebhc24wqY2yiIuRcZggfyUXHEPNYC9ZZdlsXfySr35VNUZe7p9v4wDN0b86V8Khn1M8O6jzAv62r/bYmOpC075aVJKzXoIJ81FcMLiT3RnANYF5W0NGU3gc45WRfn6v/UImOyCrWou7R8lVXkATvgbQKAy7s7hFdI/TpuPeAKOFvTFu3KwyIguscZ64klpt+Wlo//6mD2DlpGfU+hnkBmDoDsARBwalRDKjEqe+5kg5lPWi1Ou3mmQVJUk17VrqTZAMO3iTKtymCR1K/PWwxBo8W8HhB2L5yLZEQSMPIPDnDKLpunt4Jm8hm8+WxMBpNUMhtq2eBgeAXX5+nd3kG2a3FHse1vI3knx9R2KwTxWVlorVOd1PlyVexhOvmUlyjPpF3PJY/YD0l5aEZebLPUho7aPQ==
X-OriginatorOrg: oracle.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 d3db7bf5-9fd8-44f3-3316-08da75cb2781
X-MS-Exchange-CrossTenant-AuthSource: DM5PR10MB1466.namprd10.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 Aug 2022 03:41:02.8717
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 oa5+RxEaH1djxJ+wQcagt2D0+Qtn3RDwOHU1s8trUbEPRvFweLHro8TQo7ckE8JP6q7Z4nEk5QK/lxQKXkhbRT60K/FFRMzIHJ78QqVZsKo=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR10MB6037
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1
 definitions=2022-08-03_07,2022-08-02_01,2022-06-22_01
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 mlxscore=0 suspectscore=0
 adultscore=0 bulkscore=0 malwarescore=0 mlxlogscore=999 phishscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000
 definitions=main-2208040015
X-Proofpoint-ORIG-GUID: vqS95gnIyibpPgYCIPLlHtjj3s5icPGV
X-Proofpoint-GUID: vqS95gnIyibpPgYCIPLlHtjj3s5icPGV
Precedence: bulk
List-ID: <linux-scsi.vger.kernel.org>
X-Mailing-List: linux-scsi@vger.kernel.org

Series

scsi: Fix internal host code use | expand

Message

Mike Christie Aug. 4, 2022, 3:40 a.m. UTC

The following patches made over Martin's 5.20 staging branch fix an issue
where we probably intended the host codes:

DID_TARGET_FAILURE
DID_NEXUS_FAILURE
DID_ALLOC_FAILURE
DID_MEDIUM_ERROR

to be internal to scsi-ml, but at some point drivers started using them
and the driver writers never updated scsi-ml.

The problem with drivers using them to tell scsi-ml there was an error
is:

1. scsi_result_to_blk_status clears those codes, so they are not
propagated upwards. SG IO/passthrough users will then not see an error
and think a command was successful.

2. The SCSI error handler runs because scsi_decide_disposition has no
case statements for them and we return FAILED.

This patchset converts the drivers to stop using these codes, and then
moves them to scsi_priv.h in a new error byte so they can only be used
by scsi-ml.

Comments

Oliver Neukum Aug. 4, 2022, 6:55 a.m. UTC | #1

On 04.08.22 05:40, Mike Christie wrote:
> The following patches made over Martin's 5.20 staging branch fix an issue
> where we probably intended the host codes:
> 
> DID_TARGET_FAILURE
> DID_NEXUS_FAILURE
> DID_ALLOC_FAILURE
> DID_MEDIUM_ERROR
> 
> to be internal to scsi-ml, but at some point drivers started using them
> and the driver writers never updated scsi-ml.

Hi,

this approach drops useful information, though. If a device
reports such specific an error condition, why not use that
information?

	Regards
		Oliver

Mike Christie Aug. 4, 2022, 5:04 p.m. UTC | #2

On 8/4/22 1:55 AM, Oliver Neukum wrote:
> 
> 
> On 04.08.22 05:40, Mike Christie wrote:
>> The following patches made over Martin's 5.20 staging branch fix an issue
>> where we probably intended the host codes:
>>
>> DID_TARGET_FAILURE
>> DID_NEXUS_FAILURE
>> DID_ALLOC_FAILURE
>> DID_MEDIUM_ERROR
>>
>> to be internal to scsi-ml, but at some point drivers started using them
>> and the driver writers never updated scsi-ml.
> 
> Hi,
> 
> this approach drops useful information, though. If a device
> reports such specific an error condition, why not use that
> information

Is there a specific patch/case/code you are concerned?

I think in most cases the drivers were not using the correct
error code or they were stretching in trying to find a code
already.

The only ones that I thought were questionable were:

1. storvsc_drv: Used DID_TARGET_FAILURE for a local allocation
failure when they wanted to handle lun removal/scanning from a
worker thread.

I don't think DID_TARGET_FAILURE is right here. The driver wants
to just not retry this command. It's not really a perm target
failure like DID_TARGET_FAILURE is documented as. The failure
is just that the driver can't allocate some mem to perform lun
management.

I think either:

1. When we hit that failure path that we want to keep the 
DID_NO_CONNECT/DID_REQUEUE and not overwrite them.

Or

2. I used DID_BAD_TARGET to try and keep the spirit of their
DID_TARGET_FAILURE use where we couldn't handle an operation on
it's behalf. So the target itself is not bad but our processing
for it was so I thought it was close enough.

Note that I think the root issue is that the driver should
not be handling UAs and doing LUN scanning/removal and should
have added code to scsi-ml so it can be handled for everyone.
So really that code should not exist but that is a larger
change. I didn't want to add a new error code because of this.

2. uas: Used DID_TARGET_FAILURE when a TMF was not supported.
Again I don't think that code was right because it's not
a perm target failure. It is something that we don't want to
retry on another path but I don't think that comes up for
this driver ever.

I think DID_BAD_TARGET is ok'ish for this one. It's not a bad
target, but the target doesn't support what we needed and
DID_BAD_TARGET still conveys what we wanted and gives us the
same behavior.

3. cxlflash: DID_ALLOC_FAILURE was wrong in this case because
they wanted a retryable error. DID_ALLOC_FAILURE was for when
we are doing provisioning and couldn't allocate space on the
device, and is not retrable.

DID_ERROR gives them the behavior they want. It does lose info
but that's just how drivers ask scsi-ml to retry errors we don't
have codes for. We could add a new code but I don't think it
was worth it since we don't do that for every other driver and
their retryable errors. If there are drivers that have the same
issue then I'm for adding a new code.