mbox series

[00/10] scsi: Fix internal host code use

Message ID 20220804034100.121125-1-michael.christie@oracle.com (mailing list archive)
Headers show
Series scsi: Fix internal host code use | expand

Message

Mike Christie Aug. 4, 2022, 3:40 a.m. UTC
The following patches made over Martin's 5.20 staging branch fix an issue
where we probably intended the host codes:

DID_TARGET_FAILURE
DID_NEXUS_FAILURE
DID_ALLOC_FAILURE
DID_MEDIUM_ERROR

to be internal to scsi-ml, but at some point drivers started using them
and the driver writers never updated scsi-ml.

The problem with drivers using them to tell scsi-ml there was an error
is:

1. scsi_result_to_blk_status clears those codes, so they are not
propagated upwards. SG IO/passthrough users will then not see an error
and think a command was successful.

2. The SCSI error handler runs because scsi_decide_disposition has no
case statements for them and we return FAILED.

This patchset converts the drivers to stop using these codes, and then
moves them to scsi_priv.h in a new error byte so they can only be used
by scsi-ml.

Comments

Oliver Neukum Aug. 4, 2022, 6:55 a.m. UTC | #1
On 04.08.22 05:40, Mike Christie wrote:
> The following patches made over Martin's 5.20 staging branch fix an issue
> where we probably intended the host codes:
> 
> DID_TARGET_FAILURE
> DID_NEXUS_FAILURE
> DID_ALLOC_FAILURE
> DID_MEDIUM_ERROR
> 
> to be internal to scsi-ml, but at some point drivers started using them
> and the driver writers never updated scsi-ml.

Hi,

this approach drops useful information, though. If a device
reports such specific an error condition, why not use that
information?

	Regards
		Oliver
Mike Christie Aug. 4, 2022, 5:04 p.m. UTC | #2
On 8/4/22 1:55 AM, Oliver Neukum wrote:
> 
> 
> On 04.08.22 05:40, Mike Christie wrote:
>> The following patches made over Martin's 5.20 staging branch fix an issue
>> where we probably intended the host codes:
>>
>> DID_TARGET_FAILURE
>> DID_NEXUS_FAILURE
>> DID_ALLOC_FAILURE
>> DID_MEDIUM_ERROR
>>
>> to be internal to scsi-ml, but at some point drivers started using them
>> and the driver writers never updated scsi-ml.
> 
> Hi,
> 
> this approach drops useful information, though. If a device
> reports such specific an error condition, why not use that
> information

Is there a specific patch/case/code you are concerned?

I think in most cases the drivers were not using the correct
error code or they were stretching in trying to find a code
already.

The only ones that I thought were questionable were:

1. storvsc_drv: Used DID_TARGET_FAILURE for a local allocation
failure when they wanted to handle lun removal/scanning from a
worker thread.

I don't think DID_TARGET_FAILURE is right here. The driver wants
to just not retry this command. It's not really a perm target
failure like DID_TARGET_FAILURE is documented as. The failure
is just that the driver can't allocate some mem to perform lun
management.

I think either:

1. When we hit that failure path that we want to keep the 
DID_NO_CONNECT/DID_REQUEUE and not overwrite them.

Or

2. I used DID_BAD_TARGET to try and keep the spirit of their
DID_TARGET_FAILURE use where we couldn't handle an operation on
it's behalf. So the target itself is not bad but our processing
for it was so I thought it was close enough.

Note that I think the root issue is that the driver should
not be handling UAs and doing LUN scanning/removal and should
have added code to scsi-ml so it can be handled for everyone.
So really that code should not exist but that is a larger
change. I didn't want to add a new error code because of this.

2. uas: Used DID_TARGET_FAILURE when a TMF was not supported.
Again I don't think that code was right because it's not
a perm target failure. It is something that we don't want to
retry on another path but I don't think that comes up for
this driver ever.

I think DID_BAD_TARGET is ok'ish for this one. It's not a bad
target, but the target doesn't support what we needed and
DID_BAD_TARGET still conveys what we wanted and gives us the
same behavior.

3. cxlflash: DID_ALLOC_FAILURE was wrong in this case because
they wanted a retryable error. DID_ALLOC_FAILURE was for when
we are doing provisioning and couldn't allocate space on the
device, and is not retrable.

DID_ERROR gives them the behavior they want. It does lose info
but that's just how drivers ask scsi-ml to retry errors we don't
have codes for. We could add a new code but I don't think it
was worth it since we don't do that for every other driver and
their retryable errors. If there are drivers that have the same
issue then I'm for adding a new code.