diff mbox series

[v3,3/5] nvme-fabrics: avoid double request completion for nvmf_fail_nonready_command

Message ID 20210121070330.19701-4-lengchao@huawei.com (mailing list archive)
State New, archived
Headers show
Series avoid double request completion and IO error | expand

Commit Message

Chao Leng Jan. 21, 2021, 7:03 a.m. UTC
When reconnect, the request may be completed with NVME_SC_HOST_PATH_ERROR
in nvmf_fail_nonready_command. The state of request will be changed to
MQ_RQ_IN_FLIGHT before call nvme_complete_rq. If free the request
asynchronously such as in nvme_submit_user_cmd, in extreme scenario
the request may be completed again in tear down process.
nvmf_fail_nonready_command do not need calling blk_mq_start_request
before complete the request. nvmf_fail_nonready_command should set
the state of request to MQ_RQ_COMPLETE before complete the request.

Signed-off-by: Chao Leng <lengchao@huawei.com>
---
 drivers/nvme/host/fabrics.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Comments

Hannes Reinecke Jan. 21, 2021, 8:58 a.m. UTC | #1
On 1/21/21 8:03 AM, Chao Leng wrote:
> When reconnect, the request may be completed with NVME_SC_HOST_PATH_ERROR
> in nvmf_fail_nonready_command. The state of request will be changed to
> MQ_RQ_IN_FLIGHT before call nvme_complete_rq. If free the request
> asynchronously such as in nvme_submit_user_cmd, in extreme scenario
> the request may be completed again in tear down process.
> nvmf_fail_nonready_command do not need calling blk_mq_start_request
> before complete the request. nvmf_fail_nonready_command should set
> the state of request to MQ_RQ_COMPLETE before complete the request.
> 

So what you are saying is that there is a race condition between
blk_mq_start_request()
and
nvme_complete_request()

> Signed-off-by: Chao Leng <lengchao@huawei.com>
> ---
>   drivers/nvme/host/fabrics.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 72ac00173500..874e4320e214 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -553,9 +553,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
>   	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>   		return BLK_STS_RESOURCE;
>   
> -	nvme_req(rq)->status = NVME_SC_HOST_PATH_ERROR;
> -	blk_mq_start_request(rq);
> -	nvme_complete_rq(rq);
> +	nvme_complete_failed_req(rq);
>   	return BLK_STS_OK;
>   }
>   EXPORT_SYMBOL_GPL(nvmf_fail_nonready_command);
> I'd rather have 'nvme_complete_failed_req()' accept the status as 
argument, like

nvme_complete_failed_request(rq, NVME_SC_HOST_PATH_ERROR)

that way it's obvious what is happening, and the status isn't hidden in 
the function.

Cheers,

Hannes
Christoph Hellwig Jan. 21, 2021, 9 a.m. UTC | #2
On Thu, Jan 21, 2021 at 09:58:37AM +0100, Hannes Reinecke wrote:
> On 1/21/21 8:03 AM, Chao Leng wrote:
>> When reconnect, the request may be completed with NVME_SC_HOST_PATH_ERROR
>> in nvmf_fail_nonready_command. The state of request will be changed to
>> MQ_RQ_IN_FLIGHT before call nvme_complete_rq. If free the request
>> asynchronously such as in nvme_submit_user_cmd, in extreme scenario
>> the request may be completed again in tear down process.
>> nvmf_fail_nonready_command do not need calling blk_mq_start_request
>> before complete the request. nvmf_fail_nonready_command should set
>> the state of request to MQ_RQ_COMPLETE before complete the request.
>>
>
> So what you are saying is that there is a race condition between
> blk_mq_start_request()
> and
> nvme_complete_request()

Between those to a teardwon that cancels all requests can come in.
Hannes Reinecke Jan. 21, 2021, 9:27 a.m. UTC | #3
On 1/21/21 10:00 AM, Christoph Hellwig wrote:
> On Thu, Jan 21, 2021 at 09:58:37AM +0100, Hannes Reinecke wrote:
>> On 1/21/21 8:03 AM, Chao Leng wrote:
>>> When reconnect, the request may be completed with NVME_SC_HOST_PATH_ERROR
>>> in nvmf_fail_nonready_command. The state of request will be changed to
>>> MQ_RQ_IN_FLIGHT before call nvme_complete_rq. If free the request
>>> asynchronously such as in nvme_submit_user_cmd, in extreme scenario
>>> the request may be completed again in tear down process.
>>> nvmf_fail_nonready_command do not need calling blk_mq_start_request
>>> before complete the request. nvmf_fail_nonready_command should set
>>> the state of request to MQ_RQ_COMPLETE before complete the request.
>>>
>>
>> So what you are saying is that there is a race condition between
>> blk_mq_start_request()
>> and
>> nvme_complete_request()
> 
> Between those to a teardown that cancels all requests can come in.
> 
Doesn't nvme_complete_request() insulate against a double completion?
I seem to remember we've gone through great lengths ensuring that.

And if this is just about setting the correct error code on completion 
I'd really prefer to stick with the current code. Moving that into a 
helper is fine, but I'd rather not introduce our own code modifying 
request state.

If there really is a race condition this feels like a more generic 
problem; calling blk_mq_start_request() followed by blk_mq_end_request() 
is a quite common pattern, and from my impression the recommended way.
So if there is an issue it would need to be addressed for all drivers, 
not just some nvme-specific way.
Plus I'd like to have Jens' opinion here.

Cheers,

Hannes
Chao Leng Jan. 22, 2021, 1:48 a.m. UTC | #4
On 2021/1/21 16:58, Hannes Reinecke wrote:
> On 1/21/21 8:03 AM, Chao Leng wrote:
>> When reconnect, the request may be completed with NVME_SC_HOST_PATH_ERROR
>> in nvmf_fail_nonready_command. The state of request will be changed to
>> MQ_RQ_IN_FLIGHT before call nvme_complete_rq. If free the request
>> asynchronously such as in nvme_submit_user_cmd, in extreme scenario
>> the request may be completed again in tear down process.
>> nvmf_fail_nonready_command do not need calling blk_mq_start_request
>> before complete the request. nvmf_fail_nonready_command should set
>> the state of request to MQ_RQ_COMPLETE before complete the request.
>>
> 
> So what you are saying is that there is a race condition between
> blk_mq_start_request()
> and
> nvme_complete_request()
Yes. The race is:
process1:error recovery->tear down->quiesce queue(wait dispatch done)
process2:dispatch->queue_rq->nvmf_fail_nonready_command->
     nvme_complete_rq(if the request is freed asynchronously, wake
	nvme_submit_user_cmd( for example) but have no chance to run).
process1:continue ->cancle suspend request, check the state is not
     MQ_RQ_IDLE and MQ_RQ_COMPLETE, complete(free) the request.
process3: nvme_submit_user_cmd now has chance to run, and the free the
     request again.
Test Injection Method: inject a msleep before call blk_mq_free_request
in nvme_submit_user_cmd.
> 
>> Signed-off-by: Chao Leng <lengchao@huawei.com>
>> ---
>>   drivers/nvme/host/fabrics.c | 4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>> index 72ac00173500..874e4320e214 100644
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -553,9 +553,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
>>           !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>>           return BLK_STS_RESOURCE;
>> -    nvme_req(rq)->status = NVME_SC_HOST_PATH_ERROR;
>> -    blk_mq_start_request(rq);
>> -    nvme_complete_rq(rq);
>> +    nvme_complete_failed_req(rq);
>>       return BLK_STS_OK;
>>   }
>>   EXPORT_SYMBOL_GPL(nvmf_fail_nonready_command);
>> I'd rather have 'nvme_complete_failed_req()' accept the status as 
> argument, like
> 
> nvme_complete_failed_request(rq, NVME_SC_HOST_PATH_ERROR)
> 
> that way it's obvious what is happening, and the status isn't hidden in the function.
Ok, good idea. Thank you for your suggestion.
> 
> Cheers,
> 
> Hannes
Chao Leng Jan. 22, 2021, 1:50 a.m. UTC | #5
On 2021/1/21 17:27, Hannes Reinecke wrote:
> On 1/21/21 10:00 AM, Christoph Hellwig wrote:
>> On Thu, Jan 21, 2021 at 09:58:37AM +0100, Hannes Reinecke wrote:
>>> On 1/21/21 8:03 AM, Chao Leng wrote:
>>>> When reconnect, the request may be completed with NVME_SC_HOST_PATH_ERROR
>>>> in nvmf_fail_nonready_command. The state of request will be changed to
>>>> MQ_RQ_IN_FLIGHT before call nvme_complete_rq. If free the request
>>>> asynchronously such as in nvme_submit_user_cmd, in extreme scenario
>>>> the request may be completed again in tear down process.
>>>> nvmf_fail_nonready_command do not need calling blk_mq_start_request
>>>> before complete the request. nvmf_fail_nonready_command should set
>>>> the state of request to MQ_RQ_COMPLETE before complete the request.
>>>>
>>>
>>> So what you are saying is that there is a race condition between
>>> blk_mq_start_request()
>>> and
>>> nvme_complete_request()
>>
>> Between those to a teardown that cancels all requests can come in.
>>
> Doesn't nvme_complete_request() insulate against a double completion?
nvme_complete_request can not insulate against double completion.
Setting the state of request to MQ_RQ_COMPLETE avoid double completion.
tear down(nvme_cancel_request) check the state of the request, if the
state is MQ_RQ_COMPLETE, it will skip completion.
> I seem to remember we've gone through great lengths ensuring that.
> 
> And if this is just about setting the correct error code on completion I'd really prefer to stick with the current code. Moving that into a helper is fine, but I'd rather not introduce our own code modifying request state.
> 
> If there really is a race condition this feels like a more generic problem; calling blk_mq_start_request() followed by blk_mq_end_request() is a quite common pattern, and from my impression the recommended way.
> So if there is an issue it would need to be addressed for all drivers, not just some nvme-specific way.
Currently, it is not safe for nvme. The probability is very low.
I am not sure whether similar occurs in other scenarios.
> Plus I'd like to have Jens' opinion here.
> 
> Cheers,
> 
> Hannes
diff mbox series

Patch

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 72ac00173500..874e4320e214 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -553,9 +553,7 @@  blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;
 
-	nvme_req(rq)->status = NVME_SC_HOST_PATH_ERROR;
-	blk_mq_start_request(rq);
-	nvme_complete_rq(rq);
+	nvme_complete_failed_req(rq);
 	return BLK_STS_OK;
 }
 EXPORT_SYMBOL_GPL(nvmf_fail_nonready_command);