Message ID | 20250314012927.150860-1-jiangjianjun3@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | scsi: scsi_error: Introduce new error handle mechanism | expand |
On 3/14/25 02:29, JiangJianJun wrote: > It's unbearable for systems with large scale scsi devices share HBAs to > block all devices' IOs when handle error commands, we need a new error > handle mechanism to address this issue. > > I consulted about this issue a year ago, the discuss link can be found in > refenence. Hannes replied about why we have to block the SCSI host > then perform error recovery kindly. I think it's unnecessary to block > SCSI host for all drivers and can try a small level recovery(LUN based for > example) first to avoid block the SCSI host. > Technically, yes. There are, however, some issues which would need to be addressed if someone would design a new error handler. 1. The 'LUN Reset' TMF (as it's currently being used) is badly scoped; it will reset the LUN itself, affecting all ports to that LUN. So in a multipathed/multiported environment all initiators will be affected, even if they haven't experienced an error. Is that what we want? Shouldn't we rather use the 'Reset IT Nexus' TMF here? And, of course, the 'Target Reset' TMF has been dropped from SAM, so I really don't see the point in spending time here ... 2. Irrespective of the EH granularity, any error handing requires that all activity on the level has to be stopped. If you need to issue a LUN reset, you need to stop I/O for that LUN. 3. The current EH framework is designed around 'struct scsi_cmnd'. Which means that the command _initiating_ the error handling can only be returned once the _entire_ error handling (with all escalations) is finished. And more often than not, the application is waiting on that command to be completed before the next I/O is sent. And that really limits the effectiveness of any improved error handler; the application ultimatively has to wait for a host reset before it can contine. But anyway. We already have a mechanism for asynchronous command aborts; have you checked if you can adapt if for LUN reset, too? That would be the easiest solution, I guess ... Cheers, Hannes
On 3/14/25 2:01 AM, Hannes Reinecke wrote: > 3. The current EH framework is designed around 'struct scsi_cmnd'. > Which means that the command _initiating_ the error handling can > only be returned once the _entire_ error handling (with all > escalations) is finished. And more often than not, the application > is waiting on that command to be completed before the next I/O > is sent. And that really limits the effectiveness of any improved > error handler; the application ultimatively has to wait for a > host reset before it can contine. > > But anyway. > We already have a mechanism for asynchronous command aborts; > have you checked if you can adapt if for LUN reset, too? > That would be the easiest solution, I guess ... Hmm ... does this mean submitting a LUN reset while concurrently new SCSI commands can be submitted from another thread? I don't think that's safe. Additionally, how could a LUN reset help if a SCSI abort doesn't help? If a SCSI abort doesn't help, it probably means that the host controller locked up, e.g. due to a firmware bug. How to recover from this without resetting the host controller? Thanks, Bart.