blk-mq: fix corruption with direct issue

Message ID	1d359819-5410-7af2-d02b-f0ecca39d2c9@kernel.dk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> To: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org> Cc: Ming Lei <ming.lei@redhat.com> From: Jens Axboe <axboe@kernel.dk> Subject: [PATCH] blk-mq: fix corruption with direct issue Message-ID: <1d359819-5410-7af2-d02b-f0ecca39d2c9@kernel.dk> Date: Tue, 4 Dec 2018 15:47:46 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	blk-mq: fix corruption with direct issue \| expand blk-mq: fix corruption with direct issue

Jens Axboe Dec. 4, 2018, 10:47 p.m. UTC

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685

Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Ming Lei Dec. 5, 2018, 1:37 a.m. UTC | #1

On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:
> 
> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> 
> because most of it is garbage...
> 
> This doesn't happen from the normal issue path, as we will simply defer
> the request to the hardware queue dispatch list if we fail. Once it's on
> the dispatch list, we never merge with it.
> 
> Fix this from the direct issue path by flagging the request as
> REQ_NOMERGE so we don't change the size of it before issue.
> 
> See also:
>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f91c6e5b17a..d8f518c6ea38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>  		break;
>  	case BLK_STS_RESOURCE:
>  	case BLK_STS_DEV_RESOURCE:
> +		/*
> +		 * If direct dispatch fails, we cannot allow any merging on
> +		 * this IO. Drivers (like SCSI) may have set up permanent state
> +		 * for this request, like SG tables and mappings, and if we
> +		 * merge to it later on then we'll still only do IO to the
> +		 * original part.
> +		 */
> +		rq->cmd_flags |= REQ_NOMERGE;
> +
>  		blk_mq_update_dispatch_busy(hctx, true);
>  		__blk_mq_requeue_request(rq);
>  		break;
> 

Not sure it is enough to just mark it as NOMERGE, for example, driver
may have setup the .special_vec for discard, and NOMERGE may not prevent
request from entering elevator queue completely. Cause 'rq.rb_node' and
'rq.special_vec' share same space.

So how about inserting this request via blk_mq_request_bypass_insert()
in case that direct issue returns BUSY? Then it is invariant that
any request queued via .queue_rq() won't enter scheduler queue.

--

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..4b2db0b2909e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1764,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (bypass_insert)
 		return BLK_STS_RESOURCE;
 
-	blk_mq_sched_insert_request(rq, false, run_queue, false);
+	blk_mq_request_bypass_insert(rq, run_queue);
 	return BLK_STS_OK;
 }
 
@@ -1780,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
 	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
-		blk_mq_sched_insert_request(rq, false, true, false);
+		blk_mq_request_bypass_insert(rq, true);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
 


Thanks,
Ming

Guenter Roeck Dec. 5, 2018, 1:38 a.m. UTC | #2

On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:
> 
> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> 
> because most of it is garbage...
> 
> This doesn't happen from the normal issue path, as we will simply defer
> the request to the hardware queue dispatch list if we fail. Once it's on
> the dispatch list, we never merge with it.
> 
> Fix this from the direct issue path by flagging the request as
> REQ_NOMERGE so we don't change the size of it before issue.
> 
> See also:
>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Tested-by: Guenter Roeck <linux@roeck-us.net>

... on two systems affected by the problem.

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f91c6e5b17a..d8f518c6ea38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>  		break;
>  	case BLK_STS_RESOURCE:
>  	case BLK_STS_DEV_RESOURCE:
> +		/*
> +		 * If direct dispatch fails, we cannot allow any merging on
> +		 * this IO. Drivers (like SCSI) may have set up permanent state
> +		 * for this request, like SG tables and mappings, and if we
> +		 * merge to it later on then we'll still only do IO to the
> +		 * original part.
> +		 */
> +		rq->cmd_flags |= REQ_NOMERGE;
> +
>  		blk_mq_update_dispatch_busy(hctx, true);
>  		__blk_mq_requeue_request(rq);
>  		break;

Jens Axboe Dec. 5, 2018, 2:16 a.m. UTC | #3

On 12/4/18 6:37 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
>>
>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>
>> because most of it is garbage...
>>
>> This doesn't happen from the normal issue path, as we will simply defer
>> the request to the hardware queue dispatch list if we fail. Once it's on
>> the dispatch list, we never merge with it.
>>
>> Fix this from the direct issue path by flagging the request as
>> REQ_NOMERGE so we don't change the size of it before issue.
>>
>> See also:
>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 3f91c6e5b17a..d8f518c6ea38 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>  		break;
>>  	case BLK_STS_RESOURCE:
>>  	case BLK_STS_DEV_RESOURCE:
>> +		/*
>> +		 * If direct dispatch fails, we cannot allow any merging on
>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>> +		 * for this request, like SG tables and mappings, and if we
>> +		 * merge to it later on then we'll still only do IO to the
>> +		 * original part.
>> +		 */
>> +		rq->cmd_flags |= REQ_NOMERGE;
>> +
>>  		blk_mq_update_dispatch_busy(hctx, true);
>>  		__blk_mq_requeue_request(rq);
>>  		break;
>>
> 
> Not sure it is enough to just mark it as NOMERGE, for example, driver
> may have setup the .special_vec for discard, and NOMERGE may not prevent
> request from entering elevator queue completely. Cause 'rq.rb_node' and
> 'rq.special_vec' share same space.

We should rather limit the scope of the direct dispatch instead. It
doesn't make sense to do for anything but read/write anyway.

> So how about inserting this request via blk_mq_request_bypass_insert()
> in case that direct issue returns BUSY? Then it is invariant that
> any request queued via .queue_rq() won't enter scheduler queue.

I did consider this, but I didn't want to experiment with exercising
a new path for an important bug fix. You do realize that your original
patch has been corrupting data for months? I think a little caution
is in order here.

Jens Axboe Dec. 5, 2018, 2:23 a.m. UTC | #4

On 12/4/18 7:16 PM, Jens Axboe wrote:
> On 12/4/18 6:37 PM, Ming Lei wrote:
>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>> we queue the request up normally. However, the SCSI layer may have
>>> already setup SG tables etc for this particular command. If we later
>>> merge with this request, then the old tables are no longer valid. Once
>>> we issue the IO, we only read/write the original part of the request,
>>> not the new state of it.
>>>
>>> This causes data corruption, and is most often noticed with the file
>>> system complaining about the just read data being invalid:
>>>
>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>
>>> because most of it is garbage...
>>>
>>> This doesn't happen from the normal issue path, as we will simply defer
>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>> the dispatch list, we never merge with it.
>>>
>>> Fix this from the direct issue path by flagging the request as
>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>
>>> See also:
>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>
>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> ---
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>>  		break;
>>>  	case BLK_STS_RESOURCE:
>>>  	case BLK_STS_DEV_RESOURCE:
>>> +		/*
>>> +		 * If direct dispatch fails, we cannot allow any merging on
>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>>> +		 * for this request, like SG tables and mappings, and if we
>>> +		 * merge to it later on then we'll still only do IO to the
>>> +		 * original part.
>>> +		 */
>>> +		rq->cmd_flags |= REQ_NOMERGE;
>>> +
>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>  		__blk_mq_requeue_request(rq);
>>>  		break;
>>>
>>
>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>> 'rq.special_vec' share same space.
> 
> We should rather limit the scope of the direct dispatch instead. It
> doesn't make sense to do for anything but read/write anyway.
> 
>> So how about inserting this request via blk_mq_request_bypass_insert()
>> in case that direct issue returns BUSY? Then it is invariant that
>> any request queued via .queue_rq() won't enter scheduler queue.
> 
> I did consider this, but I didn't want to experiment with exercising
> a new path for an important bug fix. You do realize that your original
> patch has been corrupting data for months? I think a little caution
> is in order here.

Here's a further limiting version. And we seriously need to clean up the
direct issue paths, it's ridiculous.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..3262d83b9e07 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		break;
 	case BLK_STS_RESOURCE:
 	case BLK_STS_DEV_RESOURCE:
+		/*
+		 * If direct dispatch fails, we cannot allow any merging on
+		 * this IO. Drivers (like SCSI) may have set up permanent state
+		 * for this request, like SG tables and mappings, and if we
+		 * merge to it later on then we'll still only do IO to the
+		 * original part.
+		 */
+		rq->cmd_flags |= REQ_NOMERGE;
+
 		blk_mq_update_dispatch_busy(hctx, true);
 		__blk_mq_requeue_request(rq);
 		break;
@@ -1727,6 +1736,18 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+/*
+ * Don't allow direct dispatch of anything but regular reads/writes,
+ * as some of the other commands can potentially share request space
+ * with data we need for the IO scheduler. If we attempt a direct dispatch
+ * on those and fail, we can't safely add it to the scheduler afterwards
+ * without potentially overwriting data that the driver has already written.
+ */
+static bool blk_rq_can_direct_dispatch(struct request *rq)
+{
+	return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
+}
+
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 						struct request *rq,
 						blk_qc_t *cookie,
@@ -1748,7 +1769,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		goto insert;
 	}
 
-	if (q->elevator && !bypass_insert)
+	if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
 		goto insert;
 
 	if (!blk_mq_get_dispatch_budget(hctx))
@@ -1810,6 +1831,9 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq = list_first_entry(list, struct request,
 				queuelist);
 
+		if (!blk_rq_can_direct_dispatch(rq))
+			break;
+
 		list_del_init(&rq->queuelist);
 		ret = blk_mq_request_issue_directly(rq);
 		if (ret != BLK_STS_OK) {

Jens Axboe Dec. 5, 2018, 2:25 a.m. UTC | #5

On 12/4/18 6:38 PM, Guenter Roeck wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
>>
>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>
>> because most of it is garbage...
>>
>> This doesn't happen from the normal issue path, as we will simply defer
>> the request to the hardware queue dispatch list if we fail. Once it's on
>> the dispatch list, we never merge with it.
>>
>> Fix this from the direct issue path by flagging the request as
>> REQ_NOMERGE so we don't change the size of it before issue.
>>
>> See also:
>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Tested-by: Guenter Roeck <linux@roeck-us.net>
> 
> ... on two systems affected by the problem.

Thanks for testing! And for being persistent in reproducing and
providing clues for getting this nailed.

Ming Lei Dec. 5, 2018, 2:27 a.m. UTC | #6

On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> On 12/4/18 6:37 PM, Ming Lei wrote:
> > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >> we queue the request up normally. However, the SCSI layer may have
> >> already setup SG tables etc for this particular command. If we later
> >> merge with this request, then the old tables are no longer valid. Once
> >> we issue the IO, we only read/write the original part of the request,
> >> not the new state of it.
> >>
> >> This causes data corruption, and is most often noticed with the file
> >> system complaining about the just read data being invalid:
> >>
> >> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>
> >> because most of it is garbage...
> >>
> >> This doesn't happen from the normal issue path, as we will simply defer
> >> the request to the hardware queue dispatch list if we fail. Once it's on
> >> the dispatch list, we never merge with it.
> >>
> >> Fix this from the direct issue path by flagging the request as
> >> REQ_NOMERGE so we don't change the size of it before issue.
> >>
> >> See also:
> >>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>
> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>
> >> ---
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 3f91c6e5b17a..d8f518c6ea38 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> >>  		break;
> >>  	case BLK_STS_RESOURCE:
> >>  	case BLK_STS_DEV_RESOURCE:
> >> +		/*
> >> +		 * If direct dispatch fails, we cannot allow any merging on
> >> +		 * this IO. Drivers (like SCSI) may have set up permanent state
> >> +		 * for this request, like SG tables and mappings, and if we
> >> +		 * merge to it later on then we'll still only do IO to the
> >> +		 * original part.
> >> +		 */
> >> +		rq->cmd_flags |= REQ_NOMERGE;
> >> +
> >>  		blk_mq_update_dispatch_busy(hctx, true);
> >>  		__blk_mq_requeue_request(rq);
> >>  		break;
> >>
> > 
> > Not sure it is enough to just mark it as NOMERGE, for example, driver
> > may have setup the .special_vec for discard, and NOMERGE may not prevent
> > request from entering elevator queue completely. Cause 'rq.rb_node' and
> > 'rq.special_vec' share same space.
> 
> We should rather limit the scope of the direct dispatch instead. It
> doesn't make sense to do for anything but read/write anyway.

discard is kind of write, and it isn't treated very specially in make
request path, except for multi-range discard.

> 
> > So how about inserting this request via blk_mq_request_bypass_insert()
> > in case that direct issue returns BUSY? Then it is invariant that
> > any request queued via .queue_rq() won't enter scheduler queue.
> 
> I did consider this, but I didn't want to experiment with exercising
> a new path for an important bug fix. You do realize that your original
> patch has been corrupting data for months? I think a little caution
> is in order here.

But marking NOMERGE still may have a hole on re-insert discard request as
mentioned above.

Given we never allow to re-insert queued request to scheduler queue
except for 6ce3dd6eec1, I think it is the correct thing to do, and the
fix is simple too.

Thanks,
Ming

Jens Axboe Dec. 5, 2018, 2:30 a.m. UTC | #7

On 12/4/18 7:27 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
>> On 12/4/18 6:37 PM, Ming Lei wrote:
>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>>> we queue the request up normally. However, the SCSI layer may have
>>>> already setup SG tables etc for this particular command. If we later
>>>> merge with this request, then the old tables are no longer valid. Once
>>>> we issue the IO, we only read/write the original part of the request,
>>>> not the new state of it.
>>>>
>>>> This causes data corruption, and is most often noticed with the file
>>>> system complaining about the just read data being invalid:
>>>>
>>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>>
>>>> because most of it is garbage...
>>>>
>>>> This doesn't happen from the normal issue path, as we will simply defer
>>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>>> the dispatch list, we never merge with it.
>>>>
>>>> Fix this from the direct issue path by flagging the request as
>>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>>
>>>> See also:
>>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>>
>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>
>>>> ---
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>>>  		break;
>>>>  	case BLK_STS_RESOURCE:
>>>>  	case BLK_STS_DEV_RESOURCE:
>>>> +		/*
>>>> +		 * If direct dispatch fails, we cannot allow any merging on
>>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>>>> +		 * for this request, like SG tables and mappings, and if we
>>>> +		 * merge to it later on then we'll still only do IO to the
>>>> +		 * original part.
>>>> +		 */
>>>> +		rq->cmd_flags |= REQ_NOMERGE;
>>>> +
>>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>>  		__blk_mq_requeue_request(rq);
>>>>  		break;
>>>>
>>>
>>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>>> 'rq.special_vec' share same space.
>>
>> We should rather limit the scope of the direct dispatch instead. It
>> doesn't make sense to do for anything but read/write anyway.
> 
> discard is kind of write, and it isn't treated very specially in make
> request path, except for multi-range discard.

The point of direct dispatch is to reduce latencies for requests,
discards are so damn slow on ALL devices anyway that it doesn't make any
sense to try direct dispatch to begin with, regardless of whether it
possible or not.

>>> So how about inserting this request via blk_mq_request_bypass_insert()
>>> in case that direct issue returns BUSY? Then it is invariant that
>>> any request queued via .queue_rq() won't enter scheduler queue.
>>
>> I did consider this, but I didn't want to experiment with exercising
>> a new path for an important bug fix. You do realize that your original
>> patch has been corrupting data for months? I think a little caution
>> is in order here.
> 
> But marking NOMERGE still may have a hole on re-insert discard request as
> mentioned above.

What I said was further limit the scope of direct dispatch, which means
not allowing anything that isn't a read/write.

> Given we never allow to re-insert queued request to scheduler queue
> except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> fix is simple too.

As I said, it's not the time to experiment. This issue has been there
since 4.19-rc1. The alternative is yanking both those patches, and then
looking at it later when the direct issue path has been cleaned up
first.

Ming Lei Dec. 5, 2018, 2:58 a.m. UTC | #8

On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
> On 12/4/18 7:27 PM, Ming Lei wrote:
> > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> >> On 12/4/18 6:37 PM, Ming Lei wrote:
> >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >>>> we queue the request up normally. However, the SCSI layer may have
> >>>> already setup SG tables etc for this particular command. If we later
> >>>> merge with this request, then the old tables are no longer valid. Once
> >>>> we issue the IO, we only read/write the original part of the request,
> >>>> not the new state of it.
> >>>>
> >>>> This causes data corruption, and is most often noticed with the file
> >>>> system complaining about the just read data being invalid:
> >>>>
> >>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>>>
> >>>> because most of it is garbage...
> >>>>
> >>>> This doesn't happen from the normal issue path, as we will simply defer
> >>>> the request to the hardware queue dispatch list if we fail. Once it's on
> >>>> the dispatch list, we never merge with it.
> >>>>
> >>>> Fix this from the direct issue path by flagging the request as
> >>>> REQ_NOMERGE so we don't change the size of it before issue.
> >>>>
> >>>> See also:
> >>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>>>
> >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
> >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>
> >>>> ---
> >>>>
> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>> index 3f91c6e5b17a..d8f518c6ea38 100644
> >>>> --- a/block/blk-mq.c
> >>>> +++ b/block/blk-mq.c
> >>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> >>>>  		break;
> >>>>  	case BLK_STS_RESOURCE:
> >>>>  	case BLK_STS_DEV_RESOURCE:
> >>>> +		/*
> >>>> +		 * If direct dispatch fails, we cannot allow any merging on
> >>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
> >>>> +		 * for this request, like SG tables and mappings, and if we
> >>>> +		 * merge to it later on then we'll still only do IO to the
> >>>> +		 * original part.
> >>>> +		 */
> >>>> +		rq->cmd_flags |= REQ_NOMERGE;
> >>>> +
> >>>>  		blk_mq_update_dispatch_busy(hctx, true);
> >>>>  		__blk_mq_requeue_request(rq);
> >>>>  		break;
> >>>>
> >>>
> >>> Not sure it is enough to just mark it as NOMERGE, for example, driver
> >>> may have setup the .special_vec for discard, and NOMERGE may not prevent
> >>> request from entering elevator queue completely. Cause 'rq.rb_node' and
> >>> 'rq.special_vec' share same space.
> >>
> >> We should rather limit the scope of the direct dispatch instead. It
> >> doesn't make sense to do for anything but read/write anyway.
> > 
> > discard is kind of write, and it isn't treated very specially in make
> > request path, except for multi-range discard.
> 
> The point of direct dispatch is to reduce latencies for requests,
> discards are so damn slow on ALL devices anyway that it doesn't make any
> sense to try direct dispatch to begin with, regardless of whether it
> possible or not.

SCSI MQ device may benefit from direct dispatch from reduced lock contention.

> 
> >>> So how about inserting this request via blk_mq_request_bypass_insert()
> >>> in case that direct issue returns BUSY? Then it is invariant that
> >>> any request queued via .queue_rq() won't enter scheduler queue.
> >>
> >> I did consider this, but I didn't want to experiment with exercising
> >> a new path for an important bug fix. You do realize that your original
> >> patch has been corrupting data for months? I think a little caution
> >> is in order here.
> > 
> > But marking NOMERGE still may have a hole on re-insert discard request as
> > mentioned above.
> 
> What I said was further limit the scope of direct dispatch, which means
> not allowing anything that isn't a read/write.

IMO, the conservative approach is to take the one used in legacy io
path, in which it is never allowed to re-insert queued request to
scheduler queue except for requeue, however RQF_DONTPREP is cleared
before requeuing request to scheduler.

> 
> > Given we never allow to re-insert queued request to scheduler queue
> > except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> > fix is simple too.
> 
> As I said, it's not the time to experiment. This issue has been there
> since 4.19-rc1. The alternative is yanking both those patches, and then
> looking at it later when the direct issue path has been cleaned up
> first.

The issue should have been there from v4.1, especially after commit
f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"),
which is the 1st one to re-insert the queued request into scheduler
queue.


Thanks,
Ming

Ming Lei Dec. 5, 2018, 3:03 a.m. UTC | #9

On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
> > On 12/4/18 7:27 PM, Ming Lei wrote:
> > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> > >> On 12/4/18 6:37 PM, Ming Lei wrote:
> > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> > >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> > >>>> we queue the request up normally. However, the SCSI layer may have
> > >>>> already setup SG tables etc for this particular command. If we later
> > >>>> merge with this request, then the old tables are no longer valid. Once
> > >>>> we issue the IO, we only read/write the original part of the request,
> > >>>> not the new state of it.
> > >>>>
> > >>>> This causes data corruption, and is most often noticed with the file
> > >>>> system complaining about the just read data being invalid:
> > >>>>
> > >>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> > >>>>
> > >>>> because most of it is garbage...
> > >>>>
> > >>>> This doesn't happen from the normal issue path, as we will simply defer
> > >>>> the request to the hardware queue dispatch list if we fail. Once it's on
> > >>>> the dispatch list, we never merge with it.
> > >>>>
> > >>>> Fix this from the direct issue path by flagging the request as
> > >>>> REQ_NOMERGE so we don't change the size of it before issue.
> > >>>>
> > >>>> See also:
> > >>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> > >>>>
> > >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
> > >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>
> > >>>> ---
> > >>>>
> > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >>>> index 3f91c6e5b17a..d8f518c6ea38 100644
> > >>>> --- a/block/blk-mq.c
> > >>>> +++ b/block/blk-mq.c
> > >>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >>>>  		break;
> > >>>>  	case BLK_STS_RESOURCE:
> > >>>>  	case BLK_STS_DEV_RESOURCE:
> > >>>> +		/*
> > >>>> +		 * If direct dispatch fails, we cannot allow any merging on
> > >>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
> > >>>> +		 * for this request, like SG tables and mappings, and if we
> > >>>> +		 * merge to it later on then we'll still only do IO to the
> > >>>> +		 * original part.
> > >>>> +		 */
> > >>>> +		rq->cmd_flags |= REQ_NOMERGE;
> > >>>> +
> > >>>>  		blk_mq_update_dispatch_busy(hctx, true);
> > >>>>  		__blk_mq_requeue_request(rq);
> > >>>>  		break;
> > >>>>
> > >>>
> > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver
> > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent
> > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and
> > >>> 'rq.special_vec' share same space.
> > >>
> > >> We should rather limit the scope of the direct dispatch instead. It
> > >> doesn't make sense to do for anything but read/write anyway.
> > > 
> > > discard is kind of write, and it isn't treated very specially in make
> > > request path, except for multi-range discard.
> > 
> > The point of direct dispatch is to reduce latencies for requests,
> > discards are so damn slow on ALL devices anyway that it doesn't make any
> > sense to try direct dispatch to begin with, regardless of whether it
> > possible or not.
> 
> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
> 
> > 
> > >>> So how about inserting this request via blk_mq_request_bypass_insert()
> > >>> in case that direct issue returns BUSY? Then it is invariant that
> > >>> any request queued via .queue_rq() won't enter scheduler queue.
> > >>
> > >> I did consider this, but I didn't want to experiment with exercising
> > >> a new path for an important bug fix. You do realize that your original
> > >> patch has been corrupting data for months? I think a little caution
> > >> is in order here.
> > > 
> > > But marking NOMERGE still may have a hole on re-insert discard request as
> > > mentioned above.
> > 
> > What I said was further limit the scope of direct dispatch, which means
> > not allowing anything that isn't a read/write.
> 
> IMO, the conservative approach is to take the one used in legacy io
> path, in which it is never allowed to re-insert queued request to
> scheduler queue except for requeue, however RQF_DONTPREP is cleared
> before requeuing request to scheduler.
> 
> > 
> > > Given we never allow to re-insert queued request to scheduler queue
> > > except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> > > fix is simple too.
> > 
> > As I said, it's not the time to experiment. This issue has been there
> > since 4.19-rc1. The alternative is yanking both those patches, and then
> > looking at it later when the direct issue path has been cleaned up
> > first.
> 
> The issue should have been there from v4.1, especially after commit
> f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"),
> which is the 1st one to re-insert the queued request into scheduler
> queue.

But at that time, there isn't io scheduler for MQ, so in theory the
issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
add blk-mq adaptation of the deadline IO scheduler").

Thanks,
Ming

Jens Axboe Dec. 5, 2018, 3:04 a.m. UTC | #10

On 12/4/18 7:58 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
>> On 12/4/18 7:27 PM, Ming Lei wrote:
>>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
>>>> On 12/4/18 6:37 PM, Ming Lei wrote:
>>>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>>>>> we queue the request up normally. However, the SCSI layer may have
>>>>>> already setup SG tables etc for this particular command. If we later
>>>>>> merge with this request, then the old tables are no longer valid. Once
>>>>>> we issue the IO, we only read/write the original part of the request,
>>>>>> not the new state of it.
>>>>>>
>>>>>> This causes data corruption, and is most often noticed with the file
>>>>>> system complaining about the just read data being invalid:
>>>>>>
>>>>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>>>>
>>>>>> because most of it is garbage...
>>>>>>
>>>>>> This doesn't happen from the normal issue path, as we will simply defer
>>>>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>>>>> the dispatch list, we never merge with it.
>>>>>>
>>>>>> Fix this from the direct issue path by flagging the request as
>>>>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>>>>
>>>>>> See also:
>>>>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>>>>
>>>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>>>>>  		break;
>>>>>>  	case BLK_STS_RESOURCE:
>>>>>>  	case BLK_STS_DEV_RESOURCE:
>>>>>> +		/*
>>>>>> +		 * If direct dispatch fails, we cannot allow any merging on
>>>>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>>>>>> +		 * for this request, like SG tables and mappings, and if we
>>>>>> +		 * merge to it later on then we'll still only do IO to the
>>>>>> +		 * original part.
>>>>>> +		 */
>>>>>> +		rq->cmd_flags |= REQ_NOMERGE;
>>>>>> +
>>>>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>>>>  		__blk_mq_requeue_request(rq);
>>>>>>  		break;
>>>>>>
>>>>>
>>>>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>>>>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>>>>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>>>>> 'rq.special_vec' share same space.
>>>>
>>>> We should rather limit the scope of the direct dispatch instead. It
>>>> doesn't make sense to do for anything but read/write anyway.
>>>
>>> discard is kind of write, and it isn't treated very specially in make
>>> request path, except for multi-range discard.
>>
>> The point of direct dispatch is to reduce latencies for requests,
>> discards are so damn slow on ALL devices anyway that it doesn't make any
>> sense to try direct dispatch to begin with, regardless of whether it
>> possible or not.
> 
> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
> 
>>
>>>>> So how about inserting this request via blk_mq_request_bypass_insert()
>>>>> in case that direct issue returns BUSY? Then it is invariant that
>>>>> any request queued via .queue_rq() won't enter scheduler queue.
>>>>
>>>> I did consider this, but I didn't want to experiment with exercising
>>>> a new path for an important bug fix. You do realize that your original
>>>> patch has been corrupting data for months? I think a little caution
>>>> is in order here.
>>>
>>> But marking NOMERGE still may have a hole on re-insert discard request as
>>> mentioned above.
>>
>> What I said was further limit the scope of direct dispatch, which means
>> not allowing anything that isn't a read/write.
> 
> IMO, the conservative approach is to take the one used in legacy io
> path, in which it is never allowed to re-insert queued request to
> scheduler queue except for requeue, however RQF_DONTPREP is cleared
> before requeuing request to scheduler.

I don't agree. Whether you agree or not, I'm doing the simple fix for
4.20 and queueing it up for 4.21 as well. Then I hope Jianchao can
rework his direct dispatch cleanup patches for 4.21 on top of that,
and then we can experiment on top of that with other solutions.

>>> Given we never allow to re-insert queued request to scheduler queue
>>> except for 6ce3dd6eec1, I think it is the correct thing to do, and the
>>> fix is simple too.
>>
>> As I said, it's not the time to experiment. This issue has been there
>> since 4.19-rc1. The alternative is yanking both those patches, and then
>> looking at it later when the direct issue path has been cleaned up
>> first.
> 
> The issue should have been there from v4.1, especially after commit
> f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"),
> which is the 1st one to re-insert the queued request into scheduler
> queue.

You can hand wave all you want, fact is that people have been hitting
this left and right since 4.19 was released, and it was even bisected
down to the specific two direct issue patches you did. This isn't
some theoretical debate. You've seen the bugzilla, I suggest you go
read the whole thing.

Jens Axboe Dec. 5, 2018, 3:05 a.m. UTC | #11

On 12/4/18 8:03 PM, Ming Lei wrote:
> On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote:
>> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
>>> On 12/4/18 7:27 PM, Ming Lei wrote:
>>>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
>>>>> On 12/4/18 6:37 PM, Ming Lei wrote:
>>>>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>>>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>>>>>> we queue the request up normally. However, the SCSI layer may have
>>>>>>> already setup SG tables etc for this particular command. If we later
>>>>>>> merge with this request, then the old tables are no longer valid. Once
>>>>>>> we issue the IO, we only read/write the original part of the request,
>>>>>>> not the new state of it.
>>>>>>>
>>>>>>> This causes data corruption, and is most often noticed with the file
>>>>>>> system complaining about the just read data being invalid:
>>>>>>>
>>>>>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>>>>>
>>>>>>> because most of it is garbage...
>>>>>>>
>>>>>>> This doesn't happen from the normal issue path, as we will simply defer
>>>>>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>>>>>> the dispatch list, we never merge with it.
>>>>>>>
>>>>>>> Fix this from the direct issue path by flagging the request as
>>>>>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>>>>>
>>>>>>> See also:
>>>>>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>>>>>
>>>>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>>>>>> --- a/block/blk-mq.c
>>>>>>> +++ b/block/blk-mq.c
>>>>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>>>>>>  		break;
>>>>>>>  	case BLK_STS_RESOURCE:
>>>>>>>  	case BLK_STS_DEV_RESOURCE:
>>>>>>> +		/*
>>>>>>> +		 * If direct dispatch fails, we cannot allow any merging on
>>>>>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>>>>>>> +		 * for this request, like SG tables and mappings, and if we
>>>>>>> +		 * merge to it later on then we'll still only do IO to the
>>>>>>> +		 * original part.
>>>>>>> +		 */
>>>>>>> +		rq->cmd_flags |= REQ_NOMERGE;
>>>>>>> +
>>>>>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>>>>>  		__blk_mq_requeue_request(rq);
>>>>>>>  		break;
>>>>>>>
>>>>>>
>>>>>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>>>>>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>>>>>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>>>>>> 'rq.special_vec' share same space.
>>>>>
>>>>> We should rather limit the scope of the direct dispatch instead. It
>>>>> doesn't make sense to do for anything but read/write anyway.
>>>>
>>>> discard is kind of write, and it isn't treated very specially in make
>>>> request path, except for multi-range discard.
>>>
>>> The point of direct dispatch is to reduce latencies for requests,
>>> discards are so damn slow on ALL devices anyway that it doesn't make any
>>> sense to try direct dispatch to begin with, regardless of whether it
>>> possible or not.
>>
>> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
>>
>>>
>>>>>> So how about inserting this request via blk_mq_request_bypass_insert()
>>>>>> in case that direct issue returns BUSY? Then it is invariant that
>>>>>> any request queued via .queue_rq() won't enter scheduler queue.
>>>>>
>>>>> I did consider this, but I didn't want to experiment with exercising
>>>>> a new path for an important bug fix. You do realize that your original
>>>>> patch has been corrupting data for months? I think a little caution
>>>>> is in order here.
>>>>
>>>> But marking NOMERGE still may have a hole on re-insert discard request as
>>>> mentioned above.
>>>
>>> What I said was further limit the scope of direct dispatch, which means
>>> not allowing anything that isn't a read/write.
>>
>> IMO, the conservative approach is to take the one used in legacy io
>> path, in which it is never allowed to re-insert queued request to
>> scheduler queue except for requeue, however RQF_DONTPREP is cleared
>> before requeuing request to scheduler.
>>
>>>
>>>> Given we never allow to re-insert queued request to scheduler queue
>>>> except for 6ce3dd6eec1, I think it is the correct thing to do, and the
>>>> fix is simple too.
>>>
>>> As I said, it's not the time to experiment. This issue has been there
>>> since 4.19-rc1. The alternative is yanking both those patches, and then
>>> looking at it later when the direct issue path has been cleaned up
>>> first.
>>
>> The issue should have been there from v4.1, especially after commit
>> f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"),
>> which is the 1st one to re-insert the queued request into scheduler
>> queue.
> 
> But at that time, there isn't io scheduler for MQ, so in theory the
> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> add blk-mq adaptation of the deadline IO scheduler").

Ming, I'm getting really tired of this. As mentioned in the other email,
we're not having a theoretical or hypothetical debate here. The facts
are on the table, there's no point in trying to shift blame. We need
to deal with the current situation.

Christoph Hellwig Dec. 5, 2018, 2:41 p.m. UTC | #12

On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:

Looks good as the first quick fix:

Reviewed-by: Christoph Hellwig <hch@lst.de>

But as we discussed I think we need to be even more careful. including
the only read and write check below in the thread or some higher level
approach.

Jens Axboe Dec. 5, 2018, 3:15 p.m. UTC | #13

On 12/5/18 7:41 AM, Christoph Hellwig wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
> 
> Looks good as the first quick fix:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> But as we discussed I think we need to be even more careful. including
> the only read and write check below in the thread or some higher level
> approach.

I agree, I'm reviewing the patches from Jianchao now and will do
something cleaner on top of that.

Guenter Roeck Dec. 5, 2018, 5:55 p.m. UTC | #14

On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote:
> On 12/4/18 6:38 PM, Guenter Roeck wrote:
> > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >> we queue the request up normally. However, the SCSI layer may have
> >> already setup SG tables etc for this particular command. If we later
> >> merge with this request, then the old tables are no longer valid. Once
> >> we issue the IO, we only read/write the original part of the request,
> >> not the new state of it.
> >>
> >> This causes data corruption, and is most often noticed with the file
> >> system complaining about the just read data being invalid:
> >>
> >> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>
> >> because most of it is garbage...
> >>
> >> This doesn't happen from the normal issue path, as we will simply defer
> >> the request to the hardware queue dispatch list if we fail. Once it's on
> >> the dispatch list, we never merge with it.
> >>
> >> Fix this from the direct issue path by flagging the request as
> >> REQ_NOMERGE so we don't change the size of it before issue.
> >>
> >> See also:
> >>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>
> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > Tested-by: Guenter Roeck <linux@roeck-us.net>
> > 
> > ... on two systems affected by the problem.
> 
> Thanks for testing! And for being persistent in reproducing and
> providing clues for getting this nailed.
> 

My pleasure.

I see that there is some discussion about this patch.

Unfortunately, everyone running a 4.19 or later kernel is at serious
risk of data corruption. Given that, if this patch doesn't make it
upstream for one reason or another, would it be possible to at least
revert the two patches introducing the problem until this is sorted
out for good ? If this is not acceptable either, maybe mark blk-mq
as broken ? After all, it _is_ broken. This is even more true if it
turns out that a problem may exist since 4.1, as suggested in the
discussion.

Also, it seems to me that even with this problem fixed, blk-mq may not
be ready for primetime after all. With that in mind, maybe commit
d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
bit premature. Should that be reverted ?

Thanks,
Guenter

Jens Axboe Dec. 5, 2018, 5:59 p.m. UTC | #15

On 12/5/18 10:55 AM, Guenter Roeck wrote:
> On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote:
>> On 12/4/18 6:38 PM, Guenter Roeck wrote:
>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>>> we queue the request up normally. However, the SCSI layer may have
>>>> already setup SG tables etc for this particular command. If we later
>>>> merge with this request, then the old tables are no longer valid. Once
>>>> we issue the IO, we only read/write the original part of the request,
>>>> not the new state of it.
>>>>
>>>> This causes data corruption, and is most often noticed with the file
>>>> system complaining about the just read data being invalid:
>>>>
>>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>>
>>>> because most of it is garbage...
>>>>
>>>> This doesn't happen from the normal issue path, as we will simply defer
>>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>>> the dispatch list, we never merge with it.
>>>>
>>>> Fix this from the direct issue path by flagging the request as
>>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>>
>>>> See also:
>>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>>
>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> Tested-by: Guenter Roeck <linux@roeck-us.net>
>>>
>>> ... on two systems affected by the problem.
>>
>> Thanks for testing! And for being persistent in reproducing and
>> providing clues for getting this nailed.
>>
> 
> My pleasure.
> 
> I see that there is some discussion about this patch.
> 
> Unfortunately, everyone running a 4.19 or later kernel is at serious
> risk of data corruption. Given that, if this patch doesn't make it
> upstream for one reason or another, would it be possible to at least
> revert the two patches introducing the problem until this is sorted
> out for good ? If this is not acceptable either, maybe mark blk-mq
> as broken ? After all, it _is_ broken. This is even more true if it
> turns out that a problem may exist since 4.1, as suggested in the
> discussion.

It is queued up, it'll go upstream later today.

> Also, it seems to me that even with this problem fixed, blk-mq may not
> be ready for primetime after all. With that in mind, maybe commit
> d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
> bit premature. Should that be reverted ?

I have to strongly disagree with that, the timing is just unfortunate.
There are literally millions of machines running blk-mq/scsi-mq, and
this is the only hickup we've had. So I want to put this one to rest
once and for all, there's absolutely no reason not to continue with
what we've planned.

Guenter Roeck Dec. 5, 2018, 7:09 p.m. UTC | #16

On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote:
[ ... ]
> 
> > Also, it seems to me that even with this problem fixed, blk-mq may not
> > be ready for primetime after all. With that in mind, maybe commit
> > d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
> > bit premature. Should that be reverted ?
> 
> I have to strongly disagree with that, the timing is just unfortunate.
> There are literally millions of machines running blk-mq/scsi-mq, and
> this is the only hickup we've had. So I want to put this one to rest
> once and for all, there's absolutely no reason not to continue with
> what we've planned.
> 

Guess we have to agree to disagree. In my opinion, for infrastructure as
critical as this, a single hickup is one hickup too many. Not that I would
describe this as hickup in the first place; I would describe it as major
disaster. I also don't think the timing is unfortunate. If anything,
it proves my point.

Thanks,
Guenter

Jens Axboe Dec. 5, 2018, 8:11 p.m. UTC | #17

On 12/5/18 12:09 PM, Guenter Roeck wrote:
> On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote:
> [ ... ]
>>
>>> Also, it seems to me that even with this problem fixed, blk-mq may not
>>> be ready for primetime after all. With that in mind, maybe commit
>>> d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
>>> bit premature. Should that be reverted ?
>>
>> I have to strongly disagree with that, the timing is just unfortunate.
>> There are literally millions of machines running blk-mq/scsi-mq, and
>> this is the only hickup we've had. So I want to put this one to rest
>> once and for all, there's absolutely no reason not to continue with
>> what we've planned.
>>
> 
> Guess we have to agree to disagree. In my opinion, for infrastructure
> as critical as this, a single hickup is one hickup too many. Not that
> I would describe this as hickup in the first place; I would describe
> it as major disaster.

Don't get me wrong, I don't mean to use hickup in a diminishing fashion,
this was by all means a disaster for the ones hit by it. But if you look
at the scope of how many folks are using blk-mq/scsi-mq and have been
for years, we're really talking about a tiny tiny percentage here. This
could just as easily have happened with the old IO stack. The bug was a
freak accident, and even with full knowledge of why it happened, I'm
still having an extraordinarily hard time triggering it at will on my
test boxes. As with any disaster, it's usually a combination of multiple
things that go wrong, and this one is no different. The folks that hit
this generally hit it pretty easily, and (by far) the majority would
never hit it.

Bugs happen, whether you like it or not. They happen in file systems,
memory management, and they happen in storage. Things are continually
developed, and that sometimes introduces bugs. We do our best to ensure
that doesn't happen, but sometimes freak accidents like this happen. I
think my track record of decades of work speaks for itself there, it's
not like this is a frequent occurence. And if this particular issue
wasn't well understood and instead just reverted the offending commits,
then I would agree with you. But that's not the case. I'm very confident
in the stability, among other things, of blk-mq and the drivers that
utilize it. Most of the storage drivers are using it today, and have
been for a long time.

Theodore Ts'o Dec. 7, 2018, 2:46 a.m. UTC | #18

On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> 
> But at that time, there isn't io scheduler for MQ, so in theory the
> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> add blk-mq adaptation of the deadline IO scheduler").

Hi Ming,

How were serious you about this issue being there (theoretically) an
issue since 4.11?  Can you talk about how it might get triggered, and
how we can test for it?  The reason why I ask is because we're trying
to track down a mysterious file system corruption problem on a 4.14.x
stable kernel.  The symptoms are *very* eerily similar to kernel
bugzilla #201685.

The problem is that the problem is super-rare --- roughly once a week
out of a popuation of about 2500 systems.  The workload is NFS
serving.  Unfortunately, the problem is since 4.14.63, we can no
longer disable blk-mq for the virtio-scsi driver, thanks to the commit
b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
vector affinity") getting backported into 4.14.63 as commit
70b522f163bbb32.

We're considering reverting this patch in our 4.14 LTS kernel, and
seeing whether it makes the problem go away.  Is there any thing else
you might suggest?

Thanks,

						- Ted

P.S.  Unlike the repro's that users were seeing in #201685, we *did*
have an I/O scheduler enabled --- it was mq-deadline.  But right now,
given your comments, and the corruptions that we're seeing, I'm not
feeling very warm and fuzzy about block-mq.  :-( :-( :-(

Jens Axboe Dec. 7, 2018, 3:04 a.m. UTC | #19

On 12/6/18 7:46 PM, Theodore Y. Ts'o wrote:
> On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
>>
>> But at that time, there isn't io scheduler for MQ, so in theory the
>> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
>> add blk-mq adaptation of the deadline IO scheduler").
> 
> Hi Ming,
> 
> How were serious you about this issue being there (theoretically) an
> issue since 4.11?  Can you talk about how it might get triggered, and
> how we can test for it?  The reason why I ask is because we're trying
> to track down a mysterious file system corruption problem on a 4.14.x
> stable kernel.  The symptoms are *very* eerily similar to kernel
> bugzilla #201685.
> 
> The problem is that the problem is super-rare --- roughly once a week
> out of a popuation of about 2500 systems.  The workload is NFS
> serving.  Unfortunately, the problem is since 4.14.63, we can no
> longer disable blk-mq for the virtio-scsi driver, thanks to the commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity") getting backported into 4.14.63 as commit
> 70b522f163bbb32.
> 
> We're considering reverting this patch in our 4.14 LTS kernel, and
> seeing whether it makes the problem go away.  Is there any thing else
> you might suggest?

We should just make SCSI do the right thing, which is to unprep if
it sees BUSY and prep next time again. Otherwise I fear the direct
dispatch isn't going to be super useful, if a failed direct dispatch
prevents future merging.

This would be a lot less error prone as well for other cases.

Ming Lei Dec. 7, 2018, 3:44 a.m. UTC | #20

On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> > 
> > But at that time, there isn't io scheduler for MQ, so in theory the
> > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> > add blk-mq adaptation of the deadline IO scheduler").
> 
> Hi Ming,
> 
> How were serious you about this issue being there (theoretically) an
> issue since 4.11?  Can you talk about how it might get triggered, and
> how we can test for it?  The reason why I ask is because we're trying
> to track down a mysterious file system corruption problem on a 4.14.x
> stable kernel.  The symptoms are *very* eerily similar to kernel
> bugzilla #201685.

Hi Theodore,

It is just a theory analysis.

blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(),
both are on real MQ disks.

IO merge can be done on none or real io schedulers, so in theory there might
be the risk from v4.1, but IO merge on sw queue didn't work for a bit long,
especially it was fixed by ab42f35d9cb5ac49b5a2.

As Jens mentioned in bugzilla, there are several conditions required
for triggering the issue:

- MQ device

- queue busy can be triggered. It is hard to trigger in NVMe PCI,
  but may be possible on NVMe FC. However, it can be quite easy to
  trigger on SCSI devices. We know there are some MQ SCSI HBA,
  qlogic FC, megaraid_sas.

- IO merge is enabled. 

I have setup scsi_debug in the following way:

modprobe scsi_debug dev_size_mb=4096 clustering=1 \
		max_luns=1 submit_queues=2 max_queue=2

- submit_queues=2 may set this disk as MQ
- max_queue=4 may trigger the queue busy condition easily

and run some write IO on ext4 over the disk: fio, kernel building,... for
some time, but still can't trigger the data corruption once.

I should have created more LUN, so that queue may be easier to become
busy, will do that soon.

> 
> The problem is that the problem is super-rare --- roughly once a week
> out of a popuation of about 2500 systems.  The workload is NFS
> serving.  Unfortunately, the problem is since 4.14.63, we can no
> longer disable blk-mq for the virtio-scsi driver, thanks to the commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity") getting backported into 4.14.63 as commit
> 70b522f163bbb32.

virtio_scsi supports multi-queue mode, if that is enabled in your
setting, you may try single queue and see if difference can be made.

If multi-queue mode isn't enabled, your problem should be different with
this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE.

> We're considering reverting this patch in our 4.14 LTS kernel, and
> seeing whether it makes the problem go away.  Is there any thing else
> you might suggest?

IO hang is only triggered on machine with special CPU topo, it should be
fine to revert b5b6e8c8d3b4 on normal VM.

No other suggestions yet.

Thanks,
Ming

Ming Lei Dec. 7, 2018, 9:30 a.m. UTC | #21

On Fri, Dec 07, 2018 at 11:44:39AM +0800, Ming Lei wrote:
> On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote:
> > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> > > 
> > > But at that time, there isn't io scheduler for MQ, so in theory the
> > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> > > add blk-mq adaptation of the deadline IO scheduler").
> > 
> > Hi Ming,
> > 
> > How were serious you about this issue being there (theoretically) an
> > issue since 4.11?  Can you talk about how it might get triggered, and
> > how we can test for it?  The reason why I ask is because we're trying
> > to track down a mysterious file system corruption problem on a 4.14.x
> > stable kernel.  The symptoms are *very* eerily similar to kernel
> > bugzilla #201685.
> 
> Hi Theodore,
> 
> It is just a theory analysis.
> 
> blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(),
> both are on real MQ disks.
> 
> IO merge can be done on none or real io schedulers, so in theory there might
> be the risk from v4.1, but IO merge on sw queue didn't work for a bit long,
> especially it was fixed by ab42f35d9cb5ac49b5a2.
> 
> As Jens mentioned in bugzilla, there are several conditions required
> for triggering the issue:
> 
> - MQ device
> 
> - queue busy can be triggered. It is hard to trigger in NVMe PCI,
>   but may be possible on NVMe FC. However, it can be quite easy to
>   trigger on SCSI devices. We know there are some MQ SCSI HBA,
>   qlogic FC, megaraid_sas.
> 
> - IO merge is enabled. 
> 
> I have setup scsi_debug in the following way:
> 
> modprobe scsi_debug dev_size_mb=4096 clustering=1 \
> 		max_luns=1 submit_queues=2 max_queue=2
> 
> - submit_queues=2 may set this disk as MQ
> - max_queue=4 may trigger the queue busy condition easily
> 
> and run some write IO on ext4 over the disk: fio, kernel building,... for
> some time, but still can't trigger the data corruption once.
> 
> I should have created more LUN, so that queue may be easier to become
> busy, will do that soon.

Actually I should have used SDEBUG_OPT_HOST_BUSY to simulate the queue busy.

Thanks,
Ming

blk-mq: fix corruption with direct issue

Commit Message

Comments

Patch