[08/11] scsi: ufs: Improve SCSI abort handling further

Message ID	20211110004440.3389311-9-bvanassche@acm.org (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Bart Van Assche <bvanassche@acm.org> To: "Martin K . Petersen" <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org, Jaegeuk Kim <jaegeuk@kernel.org>, Adrian Hunter <adrian.hunter@intel.com>, Bart Van Assche <bvanassche@acm.org>, "James E.J. Bottomley" <jejb@linux.ibm.com>, Bean Huo <beanhuo@micron.com>, Can Guo <cang@codeaurora.org>, Stanley Chu <stanley.chu@mediatek.com>, Asutosh Das <asutoshd@codeaurora.org>, James Bottomley <James.Bottomley@HansenPartnership.com>, Santosh Yaraganavi <santoshsy@gmail.com>, Arnd Bergmann <arnd@arndb.de>, Vishak G <vishak.g@samsung.com> Subject: [PATCH 08/11] scsi: ufs: Improve SCSI abort handling further Date: Tue, 9 Nov 2021 16:44:37 -0800 Message-Id: <20211110004440.3389311-9-bvanassche@acm.org> In-Reply-To: <20211110004440.3389311-1-bvanassche@acm.org> References: <20211110004440.3389311-1-bvanassche@acm.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	UFS patches for kernel v5.17 \| expand [00/11] UFS patches for kernel v5.17 [01/11] scsi: ufs: Rename a function argument [02/11] scsi: ufs: Remove is_rpmb_wlun() [03/11] scsi: ufs: Remove the sdev_rpmb member [04/11] scsi: ufs: Remove dead code [05/11] scsi: core: Add support for reserved tags [06/11] scsi: ufs: Rework ufshcd_change_queue_depth() [07/11] scsi: ufs: Fix a deadlock in the error handler [08/11] scsi: ufs: Improve SCSI abort handling further [09/11] scsi: ufs: Fix a kernel crash during shutdown [10/11] scsi: ufs: Optimize the command queueing code [11/11] scsi: ufs: Implement polling support

Bart Van Assche Nov. 10, 2021, 12:44 a.m. UTC

Make sure that aborted commands are completed once by clearing the
corresponding tag bit from hba->outstanding_reqs. This patch is a
follow-up for commit cd892096c940 ("scsi: ufs: core: Improve SCSI
abort handling").

Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/ufs/ufshcd.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Adrian Hunter Nov. 10, 2021, 8:57 a.m. UTC | #1

On 10/11/2021 02:44, Bart Van Assche wrote:
> Make sure that aborted commands are completed once by clearing the
> corresponding tag bit from hba->outstanding_reqs. This patch is a
> follow-up for commit cd892096c940 ("scsi: ufs: core: Improve SCSI
> abort handling").
> 
> Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver")
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/ufs/ufshcd.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index 8f5640647054..1e15ed1f639f 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -7090,6 +7090,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>  		goto release;
>  	}
>  
> +	/*
> +	 * ufshcd_try_to_abort_task() cleared the 'tag' bit in the doorbell
> +	 * register. Clear the corresponding bit from outstanding_reqs to
> +	 * prevent early completion.
> +	 */
> +	spin_lock_irqsave(&hba->outstanding_lock, flags);
> +	__clear_bit(tag, &hba->outstanding_reqs);
> +	spin_unlock_irqrestore(&hba->outstanding_lock, flags);

Seems like something ufshcd_clear_cmd() should be doing instead?

> +
>  	lrbp->cmd = NULL;
>  	err = SUCCESS;
>  
>

Bart Van Assche Nov. 10, 2021, 6:56 p.m. UTC | #2

On 11/10/21 12:57 AM, Adrian Hunter wrote:
> On 10/11/2021 02:44, Bart Van Assche wrote:
>> Make sure that aborted commands are completed once by clearing the
>> corresponding tag bit from hba->outstanding_reqs. This patch is a
>> follow-up for commit cd892096c940 ("scsi: ufs: core: Improve SCSI
>> abort handling").
>>
>> Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver")
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>   drivers/scsi/ufs/ufshcd.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index 8f5640647054..1e15ed1f639f 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -7090,6 +7090,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>   		goto release;
>>   	}
>>   
>> +	/*
>> +	 * ufshcd_try_to_abort_task() cleared the 'tag' bit in the doorbell
>> +	 * register. Clear the corresponding bit from outstanding_reqs to
>> +	 * prevent early completion.
>> +	 */
>> +	spin_lock_irqsave(&hba->outstanding_lock, flags);
>> +	__clear_bit(tag, &hba->outstanding_reqs);
>> +	spin_unlock_irqrestore(&hba->outstanding_lock, flags);
> 
> Seems like something ufshcd_clear_cmd() should be doing instead?

Hi Adrian,

I'm concerned that would break ufshcd_eh_device_reset_handler() since 
that reset handler retries SCSI commands by calling 
__ufshcd_transfer_req_compl() after having called ufshcd_clear_cmd().

Thanks,

Bart.

Peter Wang (王信友) Nov. 11, 2021, 9:17 a.m. UTC | #3

Hi Bart,

On 11/10/21 8:44 AM, Bart Van Assche wrote:
> Make sure that aborted commands are completed once by clearing the
> corresponding tag bit from hba->outstanding_reqs. This patch is a
> follow-up for commit cd892096c940 ("scsi: ufs: core: Improve SCSI
> abort handling").
>
> Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver")
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   drivers/scsi/ufs/ufshcd.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index 8f5640647054..1e15ed1f639f 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -7090,6 +7090,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>   		goto release;
>   	}
>   
> +	/*
> +	 * ufshcd_try_to_abort_task() cleared the 'tag' bit in the doorbell
> +	 * register. Clear the corresponding bit from outstanding_reqs to
> +	 * prevent early completion.
> +	 */
> +	spin_lock_irqsave(&hba->outstanding_lock, flags);
> +	__clear_bit(tag, &hba->outstanding_reqs);
> +	spin_unlock_irqrestore(&hba->outstanding_lock, flags);
> +

should we also call unmap?

     scsi_dma_unmap(cmd);


>   	lrbp->cmd = NULL;
>   	err = SUCCESS;
>

Adrian Hunter Nov. 12, 2021, 10:56 a.m. UTC | #4

On 10/11/2021 20:56, Bart Van Assche wrote:
> On 11/10/21 12:57 AM, Adrian Hunter wrote:
>> On 10/11/2021 02:44, Bart Van Assche wrote:
>>> Make sure that aborted commands are completed once by clearing the
>>> corresponding tag bit from hba->outstanding_reqs. This patch is a
>>> follow-up for commit cd892096c940 ("scsi: ufs: core: Improve SCSI
>>> abort handling").
>>>
>>> Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver")
>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>> ---
>>>   drivers/scsi/ufs/ufshcd.c | 9 +++++++++
>>>   1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>>> index 8f5640647054..1e15ed1f639f 100644
>>> --- a/drivers/scsi/ufs/ufshcd.c
>>> +++ b/drivers/scsi/ufs/ufshcd.c
>>> @@ -7090,6 +7090,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>>           goto release;
>>>       }
>>>   +    /*
>>> +     * ufshcd_try_to_abort_task() cleared the 'tag' bit in the doorbell
>>> +     * register. Clear the corresponding bit from outstanding_reqs to
>>> +     * prevent early completion.
>>> +     */
>>> +    spin_lock_irqsave(&hba->outstanding_lock, flags);
>>> +    __clear_bit(tag, &hba->outstanding_reqs);
>>> +    spin_unlock_irqrestore(&hba->outstanding_lock, flags);
>>
>> Seems like something ufshcd_clear_cmd() should be doing instead?
> 
> Hi Adrian,
> 
> I'm concerned that would break ufshcd_eh_device_reset_handler() since that reset handler retries SCSI commands by calling __ufshcd_transfer_req_compl() after having called ufshcd_clear_cmd().

Whenever an outstanding_reqs bit transitions 1 -> 0, then
__ufshcd_transfer_req_compl() must be called.

In all cases, the correct logic must have the effect of:

	spin_lock_irqsave(&hba->outstanding_lock, flags);
	bit = __test_and_clear_bit(tag, &hba->outstanding_reqs);
	spin_unlock_irqrestore(&hba->outstanding_lock, flags);

	if (bit)
		__ufshcd_transfer_req_compl(hba, 1UL << tag);

To put it another way, how else does the driver know whether or
not __ufshcd_transfer_req_compl() has been called already.


As a separate issue, in ufshcd_abort() there is:

	/* If command is already aborted/completed, return FAILED. */
	if (!(test_bit(tag, &hba->outstanding_reqs))) {
		dev_err(hba->dev,
			"%s: cmd at tag %d already completed, outstanding=0x%lx, doorbell=0x%x\n",
			__func__, tag, hba->outstanding_reqs, reg);
		goto release;
	}

which seems wrong. FAILED should only be returned to escalate the
error handling, so if the slot has already successfully been
cleared, that is SUCCESS.  scsi_times_out() has already blocked
the scsi_done() path (by setting SCMD_STATE_COMPLETE), so any use
after free must be being caused by SCSI not the ufs driver.

path

Bart Van Assche Nov. 15, 2021, 11:09 p.m. UTC | #5

On 11/12/21 2:56 AM, Adrian Hunter wrote:
> On 10/11/2021 20:56, Bart Van Assche wrote:
>> On 11/10/21 12:57 AM, Adrian Hunter wrote:
>>> Seems like something ufshcd_clear_cmd() should be doing instead?
>>
>> I'm concerned that would break ufshcd_eh_device_reset_handler()
 >> since that reset handler retries SCSI commands by calling
 >> __ufshcd_transfer_req_compl() after having called ufshcd_clear_cmd().
> Whenever an outstanding_reqs bit transitions 1 -> 0, then
> __ufshcd_transfer_req_compl() must be called.

I will look further into this.

> As a separate issue, in ufshcd_abort() there is:
> 
> 	/* If command is already aborted/completed, return FAILED. */
> 	if (!(test_bit(tag, &hba->outstanding_reqs))) {
> 		dev_err(hba->dev,
> 			"%s: cmd at tag %d already completed, outstanding=0x%lx, doorbell=0x%x\n",
> 			__func__, tag, hba->outstanding_reqs, reg);
> 		goto release;
> 	}
> 
> which seems wrong. FAILED should only be returned to escalate the
> error handling, so if the slot has already successfully been
> cleared, that is SUCCESS.  scsi_times_out() has already blocked
> the scsi_done() path (by setting SCMD_STATE_COMPLETE), so any use
> after free must be being caused by SCSI not the ufs driver.

scmd_eh_abort_handler() would trigger a use-after-free if ufshcd_abort() 
would return SUCCESS for completed commands. Hence the choice for the 
return value FAILED for completed commands.

BTW, can this code path ever be reached since scsi_done() sets the 
SCMD_STATE_COMPLETE bit before it calls blk_mq_complete_request() and 
since scsi_times_out() tests that bit before it schedules a call of 
ufshcd_abort()?

Thanks,

Bart.

Adrian Hunter Nov. 16, 2021, 9:03 a.m. UTC | #6

On 16/11/2021 01:09, Bart Van Assche wrote:
> On 11/12/21 2:56 AM, Adrian Hunter wrote:
>> On 10/11/2021 20:56, Bart Van Assche wrote:
>>> On 11/10/21 12:57 AM, Adrian Hunter wrote:
>>>> Seems like something ufshcd_clear_cmd() should be doing instead?
>>>
>>> I'm concerned that would break ufshcd_eh_device_reset_handler()
>>> since that reset handler retries SCSI commands by calling
>>> __ufshcd_transfer_req_compl() after having called ufshcd_clear_cmd().
>> Whenever an outstanding_reqs bit transitions 1 -> 0, then
>> __ufshcd_transfer_req_compl() must be called.
> 
> I will look further into this.
> 
>> As a separate issue, in ufshcd_abort() there is:
>>
>>     /* If command is already aborted/completed, return FAILED. */
>>     if (!(test_bit(tag, &hba->outstanding_reqs))) {
>>         dev_err(hba->dev,
>>             "%s: cmd at tag %d already completed, outstanding=0x%lx, doorbell=0x%x\n",
>>             __func__, tag, hba->outstanding_reqs, reg);
>>         goto release;
>>     }
>>
>> which seems wrong. FAILED should only be returned to escalate the
>> error handling, so if the slot has already successfully been
>> cleared, that is SUCCESS.  scsi_times_out() has already blocked
>> the scsi_done() path (by setting SCMD_STATE_COMPLETE), so any use
>> after free must be being caused by SCSI not the ufs driver.
> 
> scmd_eh_abort_handler() would trigger a use-after-free if ufshcd_abort() would return SUCCESS for completed commands. Hence the choice for the return value FAILED for completed commands.
> 
> BTW, can this code path ever be reached since scsi_done() sets the SCMD_STATE_COMPLETE bit before it calls blk_mq_complete_request() and since scsi_times_out() tests that bit before it schedules a call of ufshcd_abort()?

Racing with timing out, the UFS interrupt handler can clear hba->outstanding_reqs bit and call __ufshcd_transfer_req_compl(), but the request will not be freed if scsi_times_out() wins the race.  So there should be no use-after-free unless SCSI error handling itself has a bug.

One perhaps unrelated issue in scsi_times_out():

		if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state))
			return BLK_EH_RESET_TIMER;

Shouldn't that return BLK_EH_DONE not BLK_EH_RESET_TIMER, since the request has been through blk_mq_complete_request() ?

Peter Wang (王信友) Nov. 16, 2021, 9:07 a.m. UTC | #7

On 11/11/21 5:17 PM, Peter Wang wrote:
> Hi Bart,
>
> On 11/10/21 8:44 AM, Bart Van Assche wrote:
>> Make sure that aborted commands are completed once by clearing the
>> corresponding tag bit from hba->outstanding_reqs. This patch is a
>> follow-up for commit cd892096c940 ("scsi: ufs: core: Improve SCSI
>> abort handling").
>>
>> Fixes: 7a3e97b0dc4b ("[SCSI] ufshcd: UFS Host controller driver")
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>   drivers/scsi/ufs/ufshcd.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index 8f5640647054..1e15ed1f639f 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -7090,6 +7090,15 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>>           goto release;
>>       }
>>   +    /*
>> +     * ufshcd_try_to_abort_task() cleared the 'tag' bit in the doorbell
>> +     * register. Clear the corresponding bit from outstanding_reqs to
>> +     * prevent early completion.
>> +     */
>> +    spin_lock_irqsave(&hba->outstanding_lock, flags);
>> +    __clear_bit(tag, &hba->outstanding_reqs);
>> +    spin_unlock_irqrestore(&hba->outstanding_lock, flags);
>> +
>
> should we also call unmap?
>
>     scsi_dma_unmap(cmd);
>
Hi Bart,

Should we add unmap?


Thanks.

Peter


>
>>       lrbp->cmd = NULL;
>>       err = SUCCESS;

Bart Van Assche Nov. 16, 2021, 4:07 p.m. UTC | #8

On 11/16/21 1:03 AM, Adrian Hunter wrote:
> One perhaps unrelated issue in scsi_times_out():
> 
> 		if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state))
> 			return BLK_EH_RESET_TIMER;
> 
> Shouldn't that return BLK_EH_DONE not BLK_EH_RESET_TIMER, since the request has been through blk_mq_complete_request() ?

I think so. I will submit a patch.

Bart.

Bart Van Assche Nov. 16, 2021, 4:08 p.m. UTC | #9

On 11/16/21 1:07 AM, Peter Wang wrote:
> Should we add unmap?

Hi Peter,

I will add DMA unmapping code in the abort handler.

Thanks,

Bart.

Adrian Hunter Nov. 16, 2021, 8:16 p.m. UTC | #10

On 16/11/2021 18:08, Bart Van Assche wrote:
> On 11/16/21 1:07 AM, Peter Wang wrote:
>> Should we add unmap?
> 
> Hi Peter,
> 
> I will add DMA unmapping code in the abort handler.

I would note that __ufshcd_transfer_req_compl() does that, as well as providing
a matching ufshcd_release() for the ufshcd_hold() in ufshcd_queuecommand(), so
do consider __ufshcd_transfer_req_compl().

Using __ufshcd_transfer_req_compl() seems consistent with the error handler which
uses __ufshcd_transfer_req_compl() via ufshcd_complete_requests(), which will pick
up all the requests that the error handler has just aborted via
ufshcd_try_to_abort_task().  Also ufshcd_host_reset_and_restore() uses
__ufshcd_transfer_req_compl() via ufshcd_complete_requests(), which will pick
up anything still in outstanding_reqs because the doorbell has become zero at
that point.

Bart Van Assche Nov. 16, 2021, 9:53 p.m. UTC | #11

On 11/16/21 12:16, Adrian Hunter wrote:
> On 16/11/2021 18:08, Bart Van Assche wrote:
>> On 11/16/21 1:07 AM, Peter Wang wrote:
>>> Should we add unmap?
>>
>> Hi Peter,
>>
>> I will add DMA unmapping code in the abort handler.
> 
> I would note that __ufshcd_transfer_req_compl() does that, as well as providing
> a matching ufshcd_release() for the ufshcd_hold() in ufshcd_queuecommand(), so
> do consider __ufshcd_transfer_req_compl().
> 
> Using __ufshcd_transfer_req_compl() seems consistent with the error handler which
> uses __ufshcd_transfer_req_compl() via ufshcd_complete_requests(), which will pick
> up all the requests that the error handler has just aborted via
> ufshcd_try_to_abort_task().  Also ufshcd_host_reset_and_restore() uses
> __ufshcd_transfer_req_compl() via ufshcd_complete_requests(), which will pick
> up anything still in outstanding_reqs because the doorbell has become zero at
> that point.

Hi Adrian,

Although I agree with minimizing code duplication, I'm not sure that 
__ufshcd_transfer_req_compl() should be called from inside 
ufshcd_abort(). __ufshcd_transfer_req_compl() also calls 
ufshcd_update_monitor() and ufshcd_add_command_trace(hba, index, 
UFS_CMD_COMP). Neither function should be called when aborting a command.

Thanks,

Bart.

Adrian Hunter Nov. 17, 2021, 7:37 a.m. UTC | #12

On 16/11/2021 23:53, Bart Van Assche wrote:
> On 11/16/21 12:16, Adrian Hunter wrote:
>> On 16/11/2021 18:08, Bart Van Assche wrote:
>>> On 11/16/21 1:07 AM, Peter Wang wrote:
>>>> Should we add unmap?
>>>
>>> Hi Peter,
>>>
>>> I will add DMA unmapping code in the abort handler.
>>
>> I would note that __ufshcd_transfer_req_compl() does that, as well as providing
>> a matching ufshcd_release() for the ufshcd_hold() in ufshcd_queuecommand(), so
>> do consider __ufshcd_transfer_req_compl().
>>
>> Using __ufshcd_transfer_req_compl() seems consistent with the error handler which
>> uses __ufshcd_transfer_req_compl() via ufshcd_complete_requests(), which will pick
>> up all the requests that the error handler has just aborted via
>> ufshcd_try_to_abort_task().  Also ufshcd_host_reset_and_restore() uses
>> __ufshcd_transfer_req_compl() via ufshcd_complete_requests(), which will pick
>> up anything still in outstanding_reqs because the doorbell has become zero at
>> that point.
> 
> Hi Adrian,
> 
> Although I agree with minimizing code duplication, I'm not sure that __ufshcd_transfer_req_compl() should be called from inside ufshcd_abort(). __ufshcd_transfer_req_compl() also calls ufshcd_update_monitor() and ufshcd_add_command_trace(hba, index, UFS_CMD_COMP). Neither function should be called when aborting a command.

But we should trace the abort I would have thought - seems like a separate issue though.

[08/11] scsi: ufs: Improve SCSI abort handling further

Commit Message

Comments

Patch