Message ID | 1d359819-5410-7af2-d02b-f0ecca39d2c9@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: fix corruption with direct issue | expand |
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > > because most of it is garbage... > > This doesn't happen from the normal issue path, as we will simply defer > the request to the hardware queue dispatch list if we fail. Once it's on > the dispatch list, we never merge with it. > > Fix this from the direct issue path by flagging the request as > REQ_NOMERGE so we don't change the size of it before issue. > > See also: > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3f91c6e5b17a..d8f518c6ea38 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > + /* > + * If direct dispatch fails, we cannot allow any merging on > + * this IO. Drivers (like SCSI) may have set up permanent state > + * for this request, like SG tables and mappings, and if we > + * merge to it later on then we'll still only do IO to the > + * original part. > + */ > + rq->cmd_flags |= REQ_NOMERGE; > + > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break; > Not sure it is enough to just mark it as NOMERGE, for example, driver may have setup the .special_vec for discard, and NOMERGE may not prevent request from entering elevator queue completely. Cause 'rq.rb_node' and 'rq.special_vec' share same space. So how about inserting this request via blk_mq_request_bypass_insert() in case that direct issue returns BUSY? Then it is invariant that any request queued via .queue_rq() won't enter scheduler queue. -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..4b2db0b2909e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1764,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE; - blk_mq_sched_insert_request(rq, false, run_queue, false); + blk_mq_request_bypass_insert(rq, run_queue); return BLK_STS_OK; } @@ -1780,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_sched_insert_request(rq, false, true, false); + blk_mq_request_bypass_insert(rq, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); Thanks, Ming
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > > because most of it is garbage... > > This doesn't happen from the normal issue path, as we will simply defer > the request to the hardware queue dispatch list if we fail. Once it's on > the dispatch list, we never merge with it. > > Fix this from the direct issue path by flagging the request as > REQ_NOMERGE so we don't change the size of it before issue. > > See also: > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > Signed-off-by: Jens Axboe <axboe@kernel.dk> Tested-by: Guenter Roeck <linux@roeck-us.net> ... on two systems affected by the problem. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3f91c6e5b17a..d8f518c6ea38 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > + /* > + * If direct dispatch fails, we cannot allow any merging on > + * this IO. Drivers (like SCSI) may have set up permanent state > + * for this request, like SG tables and mappings, and if we > + * merge to it later on then we'll still only do IO to the > + * original part. > + */ > + rq->cmd_flags |= REQ_NOMERGE; > + > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break;
On 12/4/18 6:37 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: >> >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >> >> because most of it is garbage... >> >> This doesn't happen from the normal issue path, as we will simply defer >> the request to the hardware queue dispatch list if we fail. Once it's on >> the dispatch list, we never merge with it. >> >> Fix this from the direct issue path by flagging the request as >> REQ_NOMERGE so we don't change the size of it before issue. >> >> See also: >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> >> --- >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 3f91c6e5b17a..d8f518c6ea38 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >> break; >> case BLK_STS_RESOURCE: >> case BLK_STS_DEV_RESOURCE: >> + /* >> + * If direct dispatch fails, we cannot allow any merging on >> + * this IO. Drivers (like SCSI) may have set up permanent state >> + * for this request, like SG tables and mappings, and if we >> + * merge to it later on then we'll still only do IO to the >> + * original part. >> + */ >> + rq->cmd_flags |= REQ_NOMERGE; >> + >> blk_mq_update_dispatch_busy(hctx, true); >> __blk_mq_requeue_request(rq); >> break; >> > > Not sure it is enough to just mark it as NOMERGE, for example, driver > may have setup the .special_vec for discard, and NOMERGE may not prevent > request from entering elevator queue completely. Cause 'rq.rb_node' and > 'rq.special_vec' share same space. We should rather limit the scope of the direct dispatch instead. It doesn't make sense to do for anything but read/write anyway. > So how about inserting this request via blk_mq_request_bypass_insert() > in case that direct issue returns BUSY? Then it is invariant that > any request queued via .queue_rq() won't enter scheduler queue. I did consider this, but I didn't want to experiment with exercising a new path for an important bug fix. You do realize that your original patch has been corrupting data for months? I think a little caution is in order here.
On 12/4/18 7:16 PM, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: >> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>> we queue the request up normally. However, the SCSI layer may have >>> already setup SG tables etc for this particular command. If we later >>> merge with this request, then the old tables are no longer valid. Once >>> we issue the IO, we only read/write the original part of the request, >>> not the new state of it. >>> >>> This causes data corruption, and is most often noticed with the file >>> system complaining about the just read data being invalid: >>> >>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >>> >>> because most of it is garbage... >>> >>> This doesn't happen from the normal issue path, as we will simply defer >>> the request to the hardware queue dispatch list if we fail. Once it's on >>> the dispatch list, we never merge with it. >>> >>> Fix this from the direct issue path by flagging the request as >>> REQ_NOMERGE so we don't change the size of it before issue. >>> >>> See also: >>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>> >>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>> >>> --- >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >>> break; >>> case BLK_STS_RESOURCE: >>> case BLK_STS_DEV_RESOURCE: >>> + /* >>> + * If direct dispatch fails, we cannot allow any merging on >>> + * this IO. Drivers (like SCSI) may have set up permanent state >>> + * for this request, like SG tables and mappings, and if we >>> + * merge to it later on then we'll still only do IO to the >>> + * original part. >>> + */ >>> + rq->cmd_flags |= REQ_NOMERGE; >>> + >>> blk_mq_update_dispatch_busy(hctx, true); >>> __blk_mq_requeue_request(rq); >>> break; >>> >> >> Not sure it is enough to just mark it as NOMERGE, for example, driver >> may have setup the .special_vec for discard, and NOMERGE may not prevent >> request from entering elevator queue completely. Cause 'rq.rb_node' and >> 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. > >> So how about inserting this request via blk_mq_request_bypass_insert() >> in case that direct issue returns BUSY? Then it is invariant that >> any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. Here's a further limiting version. And we seriously need to clean up the direct issue paths, it's ridiculous. diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..3262d83b9e07 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: + /* + * If direct dispatch fails, we cannot allow any merging on + * this IO. Drivers (like SCSI) may have set up permanent state + * for this request, like SG tables and mappings, and if we + * merge to it later on then we'll still only do IO to the + * original part. + */ + rq->cmd_flags |= REQ_NOMERGE; + blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break; @@ -1727,6 +1736,18 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, return ret; } +/* + * Don't allow direct dispatch of anything but regular reads/writes, + * as some of the other commands can potentially share request space + * with data we need for the IO scheduler. If we attempt a direct dispatch + * on those and fail, we can't safely add it to the scheduler afterwards + * without potentially overwriting data that the driver has already written. + */ +static bool blk_rq_can_direct_dispatch(struct request *rq) +{ + return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; +} + static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie, @@ -1748,7 +1769,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, goto insert; } - if (q->elevator && !bypass_insert) + if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert)) goto insert; if (!blk_mq_get_dispatch_budget(hctx)) @@ -1810,6 +1831,9 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, struct request *rq = list_first_entry(list, struct request, queuelist); + if (!blk_rq_can_direct_dispatch(rq)) + break; + list_del_init(&rq->queuelist); ret = blk_mq_request_issue_directly(rq); if (ret != BLK_STS_OK) {
On 12/4/18 6:38 PM, Guenter Roeck wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: >> >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >> >> because most of it is garbage... >> >> This doesn't happen from the normal issue path, as we will simply defer >> the request to the hardware queue dispatch list if we fail. Once it's on >> the dispatch list, we never merge with it. >> >> Fix this from the direct issue path by flagging the request as >> REQ_NOMERGE so we don't change the size of it before issue. >> >> See also: >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > Tested-by: Guenter Roeck <linux@roeck-us.net> > > ... on two systems affected by the problem. Thanks for testing! And for being persistent in reproducing and providing clues for getting this nailed.
On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >> we queue the request up normally. However, the SCSI layer may have > >> already setup SG tables etc for this particular command. If we later > >> merge with this request, then the old tables are no longer valid. Once > >> we issue the IO, we only read/write the original part of the request, > >> not the new state of it. > >> > >> This causes data corruption, and is most often noticed with the file > >> system complaining about the just read data being invalid: > >> > >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > >> > >> because most of it is garbage... > >> > >> This doesn't happen from the normal issue path, as we will simply defer > >> the request to the hardware queue dispatch list if we fail. Once it's on > >> the dispatch list, we never merge with it. > >> > >> Fix this from the direct issue path by flagging the request as > >> REQ_NOMERGE so we don't change the size of it before issue. > >> > >> See also: > >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >> > >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > >> > >> --- > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index 3f91c6e5b17a..d8f518c6ea38 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > >> break; > >> case BLK_STS_RESOURCE: > >> case BLK_STS_DEV_RESOURCE: > >> + /* > >> + * If direct dispatch fails, we cannot allow any merging on > >> + * this IO. Drivers (like SCSI) may have set up permanent state > >> + * for this request, like SG tables and mappings, and if we > >> + * merge to it later on then we'll still only do IO to the > >> + * original part. > >> + */ > >> + rq->cmd_flags |= REQ_NOMERGE; > >> + > >> blk_mq_update_dispatch_busy(hctx, true); > >> __blk_mq_requeue_request(rq); > >> break; > >> > > > > Not sure it is enough to just mark it as NOMERGE, for example, driver > > may have setup the .special_vec for discard, and NOMERGE may not prevent > > request from entering elevator queue completely. Cause 'rq.rb_node' and > > 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. discard is kind of write, and it isn't treated very specially in make request path, except for multi-range discard. > > > So how about inserting this request via blk_mq_request_bypass_insert() > > in case that direct issue returns BUSY? Then it is invariant that > > any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. But marking NOMERGE still may have a hole on re-insert discard request as mentioned above. Given we never allow to re-insert queued request to scheduler queue except for 6ce3dd6eec1, I think it is the correct thing to do, and the fix is simple too. Thanks, Ming
On 12/4/18 7:27 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: >> On 12/4/18 6:37 PM, Ming Lei wrote: >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>>> we queue the request up normally. However, the SCSI layer may have >>>> already setup SG tables etc for this particular command. If we later >>>> merge with this request, then the old tables are no longer valid. Once >>>> we issue the IO, we only read/write the original part of the request, >>>> not the new state of it. >>>> >>>> This causes data corruption, and is most often noticed with the file >>>> system complaining about the just read data being invalid: >>>> >>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >>>> >>>> because most of it is garbage... >>>> >>>> This doesn't happen from the normal issue path, as we will simply defer >>>> the request to the hardware queue dispatch list if we fail. Once it's on >>>> the dispatch list, we never merge with it. >>>> >>>> Fix this from the direct issue path by flagging the request as >>>> REQ_NOMERGE so we don't change the size of it before issue. >>>> >>>> See also: >>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>>> >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>> >>>> --- >>>> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>>> --- a/block/blk-mq.c >>>> +++ b/block/blk-mq.c >>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >>>> break; >>>> case BLK_STS_RESOURCE: >>>> case BLK_STS_DEV_RESOURCE: >>>> + /* >>>> + * If direct dispatch fails, we cannot allow any merging on >>>> + * this IO. Drivers (like SCSI) may have set up permanent state >>>> + * for this request, like SG tables and mappings, and if we >>>> + * merge to it later on then we'll still only do IO to the >>>> + * original part. >>>> + */ >>>> + rq->cmd_flags |= REQ_NOMERGE; >>>> + >>>> blk_mq_update_dispatch_busy(hctx, true); >>>> __blk_mq_requeue_request(rq); >>>> break; >>>> >>> >>> Not sure it is enough to just mark it as NOMERGE, for example, driver >>> may have setup the .special_vec for discard, and NOMERGE may not prevent >>> request from entering elevator queue completely. Cause 'rq.rb_node' and >>> 'rq.special_vec' share same space. >> >> We should rather limit the scope of the direct dispatch instead. It >> doesn't make sense to do for anything but read/write anyway. > > discard is kind of write, and it isn't treated very specially in make > request path, except for multi-range discard. The point of direct dispatch is to reduce latencies for requests, discards are so damn slow on ALL devices anyway that it doesn't make any sense to try direct dispatch to begin with, regardless of whether it possible or not. >>> So how about inserting this request via blk_mq_request_bypass_insert() >>> in case that direct issue returns BUSY? Then it is invariant that >>> any request queued via .queue_rq() won't enter scheduler queue. >> >> I did consider this, but I didn't want to experiment with exercising >> a new path for an important bug fix. You do realize that your original >> patch has been corrupting data for months? I think a little caution >> is in order here. > > But marking NOMERGE still may have a hole on re-insert discard request as > mentioned above. What I said was further limit the scope of direct dispatch, which means not allowing anything that isn't a read/write. > Given we never allow to re-insert queued request to scheduler queue > except for 6ce3dd6eec1, I think it is the correct thing to do, and the > fix is simple too. As I said, it's not the time to experiment. This issue has been there since 4.19-rc1. The alternative is yanking both those patches, and then looking at it later when the direct issue path has been cleaned up first.
On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: > On 12/4/18 7:27 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > >> On 12/4/18 6:37 PM, Ming Lei wrote: > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >>>> we queue the request up normally. However, the SCSI layer may have > >>>> already setup SG tables etc for this particular command. If we later > >>>> merge with this request, then the old tables are no longer valid. Once > >>>> we issue the IO, we only read/write the original part of the request, > >>>> not the new state of it. > >>>> > >>>> This causes data corruption, and is most often noticed with the file > >>>> system complaining about the just read data being invalid: > >>>> > >>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > >>>> > >>>> because most of it is garbage... > >>>> > >>>> This doesn't happen from the normal issue path, as we will simply defer > >>>> the request to the hardware queue dispatch list if we fail. Once it's on > >>>> the dispatch list, we never merge with it. > >>>> > >>>> Fix this from the direct issue path by flagging the request as > >>>> REQ_NOMERGE so we don't change the size of it before issue. > >>>> > >>>> See also: > >>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >>>> > >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> > >>>> > >>>> --- > >>>> > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c > >>>> index 3f91c6e5b17a..d8f518c6ea38 100644 > >>>> --- a/block/blk-mq.c > >>>> +++ b/block/blk-mq.c > >>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > >>>> break; > >>>> case BLK_STS_RESOURCE: > >>>> case BLK_STS_DEV_RESOURCE: > >>>> + /* > >>>> + * If direct dispatch fails, we cannot allow any merging on > >>>> + * this IO. Drivers (like SCSI) may have set up permanent state > >>>> + * for this request, like SG tables and mappings, and if we > >>>> + * merge to it later on then we'll still only do IO to the > >>>> + * original part. > >>>> + */ > >>>> + rq->cmd_flags |= REQ_NOMERGE; > >>>> + > >>>> blk_mq_update_dispatch_busy(hctx, true); > >>>> __blk_mq_requeue_request(rq); > >>>> break; > >>>> > >>> > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and > >>> 'rq.special_vec' share same space. > >> > >> We should rather limit the scope of the direct dispatch instead. It > >> doesn't make sense to do for anything but read/write anyway. > > > > discard is kind of write, and it isn't treated very specially in make > > request path, except for multi-range discard. > > The point of direct dispatch is to reduce latencies for requests, > discards are so damn slow on ALL devices anyway that it doesn't make any > sense to try direct dispatch to begin with, regardless of whether it > possible or not. SCSI MQ device may benefit from direct dispatch from reduced lock contention. > > >>> So how about inserting this request via blk_mq_request_bypass_insert() > >>> in case that direct issue returns BUSY? Then it is invariant that > >>> any request queued via .queue_rq() won't enter scheduler queue. > >> > >> I did consider this, but I didn't want to experiment with exercising > >> a new path for an important bug fix. You do realize that your original > >> patch has been corrupting data for months? I think a little caution > >> is in order here. > > > > But marking NOMERGE still may have a hole on re-insert discard request as > > mentioned above. > > What I said was further limit the scope of direct dispatch, which means > not allowing anything that isn't a read/write. IMO, the conservative approach is to take the one used in legacy io path, in which it is never allowed to re-insert queued request to scheduler queue except for requeue, however RQF_DONTPREP is cleared before requeuing request to scheduler. > > > Given we never allow to re-insert queued request to scheduler queue > > except for 6ce3dd6eec1, I think it is the correct thing to do, and the > > fix is simple too. > > As I said, it's not the time to experiment. This issue has been there > since 4.19-rc1. The alternative is yanking both those patches, and then > looking at it later when the direct issue path has been cleaned up > first. The issue should have been there from v4.1, especially after commit f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"), which is the 1st one to re-insert the queued request into scheduler queue. Thanks, Ming
On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: > > On 12/4/18 7:27 PM, Ming Lei wrote: > > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > > >> On 12/4/18 6:37 PM, Ming Lei wrote: > > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > > >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > > >>>> we queue the request up normally. However, the SCSI layer may have > > >>>> already setup SG tables etc for this particular command. If we later > > >>>> merge with this request, then the old tables are no longer valid. Once > > >>>> we issue the IO, we only read/write the original part of the request, > > >>>> not the new state of it. > > >>>> > > >>>> This causes data corruption, and is most often noticed with the file > > >>>> system complaining about the just read data being invalid: > > >>>> > > >>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > > >>>> > > >>>> because most of it is garbage... > > >>>> > > >>>> This doesn't happen from the normal issue path, as we will simply defer > > >>>> the request to the hardware queue dispatch list if we fail. Once it's on > > >>>> the dispatch list, we never merge with it. > > >>>> > > >>>> Fix this from the direct issue path by flagging the request as > > >>>> REQ_NOMERGE so we don't change the size of it before issue. > > >>>> > > >>>> See also: > > >>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > >>>> > > >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > > >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > >>>> > > >>>> --- > > >>>> > > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c > > >>>> index 3f91c6e5b17a..d8f518c6ea38 100644 > > >>>> --- a/block/blk-mq.c > > >>>> +++ b/block/blk-mq.c > > >>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > >>>> break; > > >>>> case BLK_STS_RESOURCE: > > >>>> case BLK_STS_DEV_RESOURCE: > > >>>> + /* > > >>>> + * If direct dispatch fails, we cannot allow any merging on > > >>>> + * this IO. Drivers (like SCSI) may have set up permanent state > > >>>> + * for this request, like SG tables and mappings, and if we > > >>>> + * merge to it later on then we'll still only do IO to the > > >>>> + * original part. > > >>>> + */ > > >>>> + rq->cmd_flags |= REQ_NOMERGE; > > >>>> + > > >>>> blk_mq_update_dispatch_busy(hctx, true); > > >>>> __blk_mq_requeue_request(rq); > > >>>> break; > > >>>> > > >>> > > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver > > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent > > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and > > >>> 'rq.special_vec' share same space. > > >> > > >> We should rather limit the scope of the direct dispatch instead. It > > >> doesn't make sense to do for anything but read/write anyway. > > > > > > discard is kind of write, and it isn't treated very specially in make > > > request path, except for multi-range discard. > > > > The point of direct dispatch is to reduce latencies for requests, > > discards are so damn slow on ALL devices anyway that it doesn't make any > > sense to try direct dispatch to begin with, regardless of whether it > > possible or not. > > SCSI MQ device may benefit from direct dispatch from reduced lock contention. > > > > > >>> So how about inserting this request via blk_mq_request_bypass_insert() > > >>> in case that direct issue returns BUSY? Then it is invariant that > > >>> any request queued via .queue_rq() won't enter scheduler queue. > > >> > > >> I did consider this, but I didn't want to experiment with exercising > > >> a new path for an important bug fix. You do realize that your original > > >> patch has been corrupting data for months? I think a little caution > > >> is in order here. > > > > > > But marking NOMERGE still may have a hole on re-insert discard request as > > > mentioned above. > > > > What I said was further limit the scope of direct dispatch, which means > > not allowing anything that isn't a read/write. > > IMO, the conservative approach is to take the one used in legacy io > path, in which it is never allowed to re-insert queued request to > scheduler queue except for requeue, however RQF_DONTPREP is cleared > before requeuing request to scheduler. > > > > > > Given we never allow to re-insert queued request to scheduler queue > > > except for 6ce3dd6eec1, I think it is the correct thing to do, and the > > > fix is simple too. > > > > As I said, it's not the time to experiment. This issue has been there > > since 4.19-rc1. The alternative is yanking both those patches, and then > > looking at it later when the direct issue path has been cleaned up > > first. > > The issue should have been there from v4.1, especially after commit > f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"), > which is the 1st one to re-insert the queued request into scheduler > queue. But at that time, there isn't io scheduler for MQ, so in theory the issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: add blk-mq adaptation of the deadline IO scheduler"). Thanks, Ming
On 12/4/18 7:58 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: >> On 12/4/18 7:27 PM, Ming Lei wrote: >>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: >>>> On 12/4/18 6:37 PM, Ming Lei wrote: >>>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>>>>> we queue the request up normally. However, the SCSI layer may have >>>>>> already setup SG tables etc for this particular command. If we later >>>>>> merge with this request, then the old tables are no longer valid. Once >>>>>> we issue the IO, we only read/write the original part of the request, >>>>>> not the new state of it. >>>>>> >>>>>> This causes data corruption, and is most often noticed with the file >>>>>> system complaining about the just read data being invalid: >>>>>> >>>>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >>>>>> >>>>>> because most of it is garbage... >>>>>> >>>>>> This doesn't happen from the normal issue path, as we will simply defer >>>>>> the request to the hardware queue dispatch list if we fail. Once it's on >>>>>> the dispatch list, we never merge with it. >>>>>> >>>>>> Fix this from the direct issue path by flagging the request as >>>>>> REQ_NOMERGE so we don't change the size of it before issue. >>>>>> >>>>>> See also: >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>>>>> >>>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>>> >>>>>> --- >>>>>> >>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>>>>> --- a/block/blk-mq.c >>>>>> +++ b/block/blk-mq.c >>>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >>>>>> break; >>>>>> case BLK_STS_RESOURCE: >>>>>> case BLK_STS_DEV_RESOURCE: >>>>>> + /* >>>>>> + * If direct dispatch fails, we cannot allow any merging on >>>>>> + * this IO. Drivers (like SCSI) may have set up permanent state >>>>>> + * for this request, like SG tables and mappings, and if we >>>>>> + * merge to it later on then we'll still only do IO to the >>>>>> + * original part. >>>>>> + */ >>>>>> + rq->cmd_flags |= REQ_NOMERGE; >>>>>> + >>>>>> blk_mq_update_dispatch_busy(hctx, true); >>>>>> __blk_mq_requeue_request(rq); >>>>>> break; >>>>>> >>>>> >>>>> Not sure it is enough to just mark it as NOMERGE, for example, driver >>>>> may have setup the .special_vec for discard, and NOMERGE may not prevent >>>>> request from entering elevator queue completely. Cause 'rq.rb_node' and >>>>> 'rq.special_vec' share same space. >>>> >>>> We should rather limit the scope of the direct dispatch instead. It >>>> doesn't make sense to do for anything but read/write anyway. >>> >>> discard is kind of write, and it isn't treated very specially in make >>> request path, except for multi-range discard. >> >> The point of direct dispatch is to reduce latencies for requests, >> discards are so damn slow on ALL devices anyway that it doesn't make any >> sense to try direct dispatch to begin with, regardless of whether it >> possible or not. > > SCSI MQ device may benefit from direct dispatch from reduced lock contention. > >> >>>>> So how about inserting this request via blk_mq_request_bypass_insert() >>>>> in case that direct issue returns BUSY? Then it is invariant that >>>>> any request queued via .queue_rq() won't enter scheduler queue. >>>> >>>> I did consider this, but I didn't want to experiment with exercising >>>> a new path for an important bug fix. You do realize that your original >>>> patch has been corrupting data for months? I think a little caution >>>> is in order here. >>> >>> But marking NOMERGE still may have a hole on re-insert discard request as >>> mentioned above. >> >> What I said was further limit the scope of direct dispatch, which means >> not allowing anything that isn't a read/write. > > IMO, the conservative approach is to take the one used in legacy io > path, in which it is never allowed to re-insert queued request to > scheduler queue except for requeue, however RQF_DONTPREP is cleared > before requeuing request to scheduler. I don't agree. Whether you agree or not, I'm doing the simple fix for 4.20 and queueing it up for 4.21 as well. Then I hope Jianchao can rework his direct dispatch cleanup patches for 4.21 on top of that, and then we can experiment on top of that with other solutions. >>> Given we never allow to re-insert queued request to scheduler queue >>> except for 6ce3dd6eec1, I think it is the correct thing to do, and the >>> fix is simple too. >> >> As I said, it's not the time to experiment. This issue has been there >> since 4.19-rc1. The alternative is yanking both those patches, and then >> looking at it later when the direct issue path has been cleaned up >> first. > > The issue should have been there from v4.1, especially after commit > f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"), > which is the 1st one to re-insert the queued request into scheduler > queue. You can hand wave all you want, fact is that people have been hitting this left and right since 4.19 was released, and it was even bisected down to the specific two direct issue patches you did. This isn't some theoretical debate. You've seen the bugzilla, I suggest you go read the whole thing.
On 12/4/18 8:03 PM, Ming Lei wrote: > On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote: >> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: >>> On 12/4/18 7:27 PM, Ming Lei wrote: >>>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: >>>>> On 12/4/18 6:37 PM, Ming Lei wrote: >>>>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>>>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>>>>>> we queue the request up normally. However, the SCSI layer may have >>>>>>> already setup SG tables etc for this particular command. If we later >>>>>>> merge with this request, then the old tables are no longer valid. Once >>>>>>> we issue the IO, we only read/write the original part of the request, >>>>>>> not the new state of it. >>>>>>> >>>>>>> This causes data corruption, and is most often noticed with the file >>>>>>> system complaining about the just read data being invalid: >>>>>>> >>>>>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >>>>>>> >>>>>>> because most of it is garbage... >>>>>>> >>>>>>> This doesn't happen from the normal issue path, as we will simply defer >>>>>>> the request to the hardware queue dispatch list if we fail. Once it's on >>>>>>> the dispatch list, we never merge with it. >>>>>>> >>>>>>> Fix this from the direct issue path by flagging the request as >>>>>>> REQ_NOMERGE so we don't change the size of it before issue. >>>>>>> >>>>>>> See also: >>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>>>>>> >>>>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>>>>>> --- a/block/blk-mq.c >>>>>>> +++ b/block/blk-mq.c >>>>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >>>>>>> break; >>>>>>> case BLK_STS_RESOURCE: >>>>>>> case BLK_STS_DEV_RESOURCE: >>>>>>> + /* >>>>>>> + * If direct dispatch fails, we cannot allow any merging on >>>>>>> + * this IO. Drivers (like SCSI) may have set up permanent state >>>>>>> + * for this request, like SG tables and mappings, and if we >>>>>>> + * merge to it later on then we'll still only do IO to the >>>>>>> + * original part. >>>>>>> + */ >>>>>>> + rq->cmd_flags |= REQ_NOMERGE; >>>>>>> + >>>>>>> blk_mq_update_dispatch_busy(hctx, true); >>>>>>> __blk_mq_requeue_request(rq); >>>>>>> break; >>>>>>> >>>>>> >>>>>> Not sure it is enough to just mark it as NOMERGE, for example, driver >>>>>> may have setup the .special_vec for discard, and NOMERGE may not prevent >>>>>> request from entering elevator queue completely. Cause 'rq.rb_node' and >>>>>> 'rq.special_vec' share same space. >>>>> >>>>> We should rather limit the scope of the direct dispatch instead. It >>>>> doesn't make sense to do for anything but read/write anyway. >>>> >>>> discard is kind of write, and it isn't treated very specially in make >>>> request path, except for multi-range discard. >>> >>> The point of direct dispatch is to reduce latencies for requests, >>> discards are so damn slow on ALL devices anyway that it doesn't make any >>> sense to try direct dispatch to begin with, regardless of whether it >>> possible or not. >> >> SCSI MQ device may benefit from direct dispatch from reduced lock contention. >> >>> >>>>>> So how about inserting this request via blk_mq_request_bypass_insert() >>>>>> in case that direct issue returns BUSY? Then it is invariant that >>>>>> any request queued via .queue_rq() won't enter scheduler queue. >>>>> >>>>> I did consider this, but I didn't want to experiment with exercising >>>>> a new path for an important bug fix. You do realize that your original >>>>> patch has been corrupting data for months? I think a little caution >>>>> is in order here. >>>> >>>> But marking NOMERGE still may have a hole on re-insert discard request as >>>> mentioned above. >>> >>> What I said was further limit the scope of direct dispatch, which means >>> not allowing anything that isn't a read/write. >> >> IMO, the conservative approach is to take the one used in legacy io >> path, in which it is never allowed to re-insert queued request to >> scheduler queue except for requeue, however RQF_DONTPREP is cleared >> before requeuing request to scheduler. >> >>> >>>> Given we never allow to re-insert queued request to scheduler queue >>>> except for 6ce3dd6eec1, I think it is the correct thing to do, and the >>>> fix is simple too. >>> >>> As I said, it's not the time to experiment. This issue has been there >>> since 4.19-rc1. The alternative is yanking both those patches, and then >>> looking at it later when the direct issue path has been cleaned up >>> first. >> >> The issue should have been there from v4.1, especially after commit >> f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"), >> which is the 1st one to re-insert the queued request into scheduler >> queue. > > But at that time, there isn't io scheduler for MQ, so in theory the > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > add blk-mq adaptation of the deadline IO scheduler"). Ming, I'm getting really tired of this. As mentioned in the other email, we're not having a theoretical or hypothetical debate here. The facts are on the table, there's no point in trying to shift blame. We need to deal with the current situation.
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: Looks good as the first quick fix: Reviewed-by: Christoph Hellwig <hch@lst.de> But as we discussed I think we need to be even more careful. including the only read and write check below in the thread or some higher level approach.
On 12/5/18 7:41 AM, Christoph Hellwig wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: > > Looks good as the first quick fix: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > But as we discussed I think we need to be even more careful. including > the only read and write check below in the thread or some higher level > approach. I agree, I'm reviewing the patches from Jianchao now and will do something cleaner on top of that.
On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote: > On 12/4/18 6:38 PM, Guenter Roeck wrote: > > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >> we queue the request up normally. However, the SCSI layer may have > >> already setup SG tables etc for this particular command. If we later > >> merge with this request, then the old tables are no longer valid. Once > >> we issue the IO, we only read/write the original part of the request, > >> not the new state of it. > >> > >> This causes data corruption, and is most often noticed with the file > >> system complaining about the just read data being invalid: > >> > >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > >> > >> because most of it is garbage... > >> > >> This doesn't happen from the normal issue path, as we will simply defer > >> the request to the hardware queue dispatch list if we fail. Once it's on > >> the dispatch list, we never merge with it. > >> > >> Fix this from the direct issue path by flagging the request as > >> REQ_NOMERGE so we don't change the size of it before issue. > >> > >> See also: > >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >> > >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > > > Tested-by: Guenter Roeck <linux@roeck-us.net> > > > > ... on two systems affected by the problem. > > Thanks for testing! And for being persistent in reproducing and > providing clues for getting this nailed. > My pleasure. I see that there is some discussion about this patch. Unfortunately, everyone running a 4.19 or later kernel is at serious risk of data corruption. Given that, if this patch doesn't make it upstream for one reason or another, would it be possible to at least revert the two patches introducing the problem until this is sorted out for good ? If this is not acceptable either, maybe mark blk-mq as broken ? After all, it _is_ broken. This is even more true if it turns out that a problem may exist since 4.1, as suggested in the discussion. Also, it seems to me that even with this problem fixed, blk-mq may not be ready for primetime after all. With that in mind, maybe commit d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a bit premature. Should that be reverted ? Thanks, Guenter
On 12/5/18 10:55 AM, Guenter Roeck wrote: > On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote: >> On 12/4/18 6:38 PM, Guenter Roeck wrote: >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>>> we queue the request up normally. However, the SCSI layer may have >>>> already setup SG tables etc for this particular command. If we later >>>> merge with this request, then the old tables are no longer valid. Once >>>> we issue the IO, we only read/write the original part of the request, >>>> not the new state of it. >>>> >>>> This causes data corruption, and is most often noticed with the file >>>> system complaining about the just read data being invalid: >>>> >>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >>>> >>>> because most of it is garbage... >>>> >>>> This doesn't happen from the normal issue path, as we will simply defer >>>> the request to the hardware queue dispatch list if we fail. Once it's on >>>> the dispatch list, we never merge with it. >>>> >>>> Fix this from the direct issue path by flagging the request as >>>> REQ_NOMERGE so we don't change the size of it before issue. >>>> >>>> See also: >>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>>> >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>> >>> Tested-by: Guenter Roeck <linux@roeck-us.net> >>> >>> ... on two systems affected by the problem. >> >> Thanks for testing! And for being persistent in reproducing and >> providing clues for getting this nailed. >> > > My pleasure. > > I see that there is some discussion about this patch. > > Unfortunately, everyone running a 4.19 or later kernel is at serious > risk of data corruption. Given that, if this patch doesn't make it > upstream for one reason or another, would it be possible to at least > revert the two patches introducing the problem until this is sorted > out for good ? If this is not acceptable either, maybe mark blk-mq > as broken ? After all, it _is_ broken. This is even more true if it > turns out that a problem may exist since 4.1, as suggested in the > discussion. It is queued up, it'll go upstream later today. > Also, it seems to me that even with this problem fixed, blk-mq may not > be ready for primetime after all. With that in mind, maybe commit > d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a > bit premature. Should that be reverted ? I have to strongly disagree with that, the timing is just unfortunate. There are literally millions of machines running blk-mq/scsi-mq, and this is the only hickup we've had. So I want to put this one to rest once and for all, there's absolutely no reason not to continue with what we've planned.
On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote: [ ... ] > > > Also, it seems to me that even with this problem fixed, blk-mq may not > > be ready for primetime after all. With that in mind, maybe commit > > d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a > > bit premature. Should that be reverted ? > > I have to strongly disagree with that, the timing is just unfortunate. > There are literally millions of machines running blk-mq/scsi-mq, and > this is the only hickup we've had. So I want to put this one to rest > once and for all, there's absolutely no reason not to continue with > what we've planned. > Guess we have to agree to disagree. In my opinion, for infrastructure as critical as this, a single hickup is one hickup too many. Not that I would describe this as hickup in the first place; I would describe it as major disaster. I also don't think the timing is unfortunate. If anything, it proves my point. Thanks, Guenter
On 12/5/18 12:09 PM, Guenter Roeck wrote: > On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote: > [ ... ] >> >>> Also, it seems to me that even with this problem fixed, blk-mq may not >>> be ready for primetime after all. With that in mind, maybe commit >>> d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a >>> bit premature. Should that be reverted ? >> >> I have to strongly disagree with that, the timing is just unfortunate. >> There are literally millions of machines running blk-mq/scsi-mq, and >> this is the only hickup we've had. So I want to put this one to rest >> once and for all, there's absolutely no reason not to continue with >> what we've planned. >> > > Guess we have to agree to disagree. In my opinion, for infrastructure > as critical as this, a single hickup is one hickup too many. Not that > I would describe this as hickup in the first place; I would describe > it as major disaster. Don't get me wrong, I don't mean to use hickup in a diminishing fashion, this was by all means a disaster for the ones hit by it. But if you look at the scope of how many folks are using blk-mq/scsi-mq and have been for years, we're really talking about a tiny tiny percentage here. This could just as easily have happened with the old IO stack. The bug was a freak accident, and even with full knowledge of why it happened, I'm still having an extraordinarily hard time triggering it at will on my test boxes. As with any disaster, it's usually a combination of multiple things that go wrong, and this one is no different. The folks that hit this generally hit it pretty easily, and (by far) the majority would never hit it. Bugs happen, whether you like it or not. They happen in file systems, memory management, and they happen in storage. Things are continually developed, and that sometimes introduces bugs. We do our best to ensure that doesn't happen, but sometimes freak accidents like this happen. I think my track record of decades of work speaks for itself there, it's not like this is a frequent occurence. And if this particular issue wasn't well understood and instead just reverted the offending commits, then I would agree with you. But that's not the case. I'm very confident in the stability, among other things, of blk-mq and the drivers that utilize it. Most of the storage drivers are using it today, and have been for a long time.
On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > But at that time, there isn't io scheduler for MQ, so in theory the > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > add blk-mq adaptation of the deadline IO scheduler"). Hi Ming, How were serious you about this issue being there (theoretically) an issue since 4.11? Can you talk about how it might get triggered, and how we can test for it? The reason why I ask is because we're trying to track down a mysterious file system corruption problem on a 4.14.x stable kernel. The symptoms are *very* eerily similar to kernel bugzilla #201685. The problem is that the problem is super-rare --- roughly once a week out of a popuation of about 2500 systems. The workload is NFS serving. Unfortunately, the problem is since 4.14.63, we can no longer disable blk-mq for the virtio-scsi driver, thanks to the commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity") getting backported into 4.14.63 as commit 70b522f163bbb32. We're considering reverting this patch in our 4.14 LTS kernel, and seeing whether it makes the problem go away. Is there any thing else you might suggest? Thanks, - Ted P.S. Unlike the repro's that users were seeing in #201685, we *did* have an I/O scheduler enabled --- it was mq-deadline. But right now, given your comments, and the corruptions that we're seeing, I'm not feeling very warm and fuzzy about block-mq. :-( :-( :-(
On 12/6/18 7:46 PM, Theodore Y. Ts'o wrote: > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: >> >> But at that time, there isn't io scheduler for MQ, so in theory the >> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: >> add blk-mq adaptation of the deadline IO scheduler"). > > Hi Ming, > > How were serious you about this issue being there (theoretically) an > issue since 4.11? Can you talk about how it might get triggered, and > how we can test for it? The reason why I ask is because we're trying > to track down a mysterious file system corruption problem on a 4.14.x > stable kernel. The symptoms are *very* eerily similar to kernel > bugzilla #201685. > > The problem is that the problem is super-rare --- roughly once a week > out of a popuation of about 2500 systems. The workload is NFS > serving. Unfortunately, the problem is since 4.14.63, we can no > longer disable blk-mq for the virtio-scsi driver, thanks to the commit > b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq > vector affinity") getting backported into 4.14.63 as commit > 70b522f163bbb32. > > We're considering reverting this patch in our 4.14 LTS kernel, and > seeing whether it makes the problem go away. Is there any thing else > you might suggest? We should just make SCSI do the right thing, which is to unprep if it sees BUSY and prep next time again. Otherwise I fear the direct dispatch isn't going to be super useful, if a failed direct dispatch prevents future merging. This would be a lot less error prone as well for other cases.
On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote: > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > > > But at that time, there isn't io scheduler for MQ, so in theory the > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > > add blk-mq adaptation of the deadline IO scheduler"). > > Hi Ming, > > How were serious you about this issue being there (theoretically) an > issue since 4.11? Can you talk about how it might get triggered, and > how we can test for it? The reason why I ask is because we're trying > to track down a mysterious file system corruption problem on a 4.14.x > stable kernel. The symptoms are *very* eerily similar to kernel > bugzilla #201685. Hi Theodore, It is just a theory analysis. blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(), both are on real MQ disks. IO merge can be done on none or real io schedulers, so in theory there might be the risk from v4.1, but IO merge on sw queue didn't work for a bit long, especially it was fixed by ab42f35d9cb5ac49b5a2. As Jens mentioned in bugzilla, there are several conditions required for triggering the issue: - MQ device - queue busy can be triggered. It is hard to trigger in NVMe PCI, but may be possible on NVMe FC. However, it can be quite easy to trigger on SCSI devices. We know there are some MQ SCSI HBA, qlogic FC, megaraid_sas. - IO merge is enabled. I have setup scsi_debug in the following way: modprobe scsi_debug dev_size_mb=4096 clustering=1 \ max_luns=1 submit_queues=2 max_queue=2 - submit_queues=2 may set this disk as MQ - max_queue=4 may trigger the queue busy condition easily and run some write IO on ext4 over the disk: fio, kernel building,... for some time, but still can't trigger the data corruption once. I should have created more LUN, so that queue may be easier to become busy, will do that soon. > > The problem is that the problem is super-rare --- roughly once a week > out of a popuation of about 2500 systems. The workload is NFS > serving. Unfortunately, the problem is since 4.14.63, we can no > longer disable blk-mq for the virtio-scsi driver, thanks to the commit > b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq > vector affinity") getting backported into 4.14.63 as commit > 70b522f163bbb32. virtio_scsi supports multi-queue mode, if that is enabled in your setting, you may try single queue and see if difference can be made. If multi-queue mode isn't enabled, your problem should be different with this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE. > We're considering reverting this patch in our 4.14 LTS kernel, and > seeing whether it makes the problem go away. Is there any thing else > you might suggest? IO hang is only triggered on machine with special CPU topo, it should be fine to revert b5b6e8c8d3b4 on normal VM. No other suggestions yet. Thanks, Ming
On Fri, Dec 07, 2018 at 11:44:39AM +0800, Ming Lei wrote: > On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote: > > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > > > > > But at that time, there isn't io scheduler for MQ, so in theory the > > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > > > add blk-mq adaptation of the deadline IO scheduler"). > > > > Hi Ming, > > > > How were serious you about this issue being there (theoretically) an > > issue since 4.11? Can you talk about how it might get triggered, and > > how we can test for it? The reason why I ask is because we're trying > > to track down a mysterious file system corruption problem on a 4.14.x > > stable kernel. The symptoms are *very* eerily similar to kernel > > bugzilla #201685. > > Hi Theodore, > > It is just a theory analysis. > > blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(), > both are on real MQ disks. > > IO merge can be done on none or real io schedulers, so in theory there might > be the risk from v4.1, but IO merge on sw queue didn't work for a bit long, > especially it was fixed by ab42f35d9cb5ac49b5a2. > > As Jens mentioned in bugzilla, there are several conditions required > for triggering the issue: > > - MQ device > > - queue busy can be triggered. It is hard to trigger in NVMe PCI, > but may be possible on NVMe FC. However, it can be quite easy to > trigger on SCSI devices. We know there are some MQ SCSI HBA, > qlogic FC, megaraid_sas. > > - IO merge is enabled. > > I have setup scsi_debug in the following way: > > modprobe scsi_debug dev_size_mb=4096 clustering=1 \ > max_luns=1 submit_queues=2 max_queue=2 > > - submit_queues=2 may set this disk as MQ > - max_queue=4 may trigger the queue busy condition easily > > and run some write IO on ext4 over the disk: fio, kernel building,... for > some time, but still can't trigger the data corruption once. > > I should have created more LUN, so that queue may be easier to become > busy, will do that soon. Actually I should have used SDEBUG_OPT_HOST_BUSY to simulate the queue busy. Thanks, Ming
diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..d8f518c6ea38 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: + /* + * If direct dispatch fails, we cannot allow any merging on + * this IO. Drivers (like SCSI) may have set up permanent state + * for this request, like SG tables and mappings, and if we + * merge to it later on then we'll still only do IO to the + * original part. + */ + rq->cmd_flags |= REQ_NOMERGE; + blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break;
If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Signed-off-by: Jens Axboe <axboe@kernel.dk> ---