Message ID | 87ft85osn6.fsf@mail.parknet.co.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fat: Avoid oops when bdi->io_pages==0 | expand |
On Sun, Aug 30, 2020 at 09:59:41AM +0900, OGAWA Hirofumi wrote: > On one system, there was bdi->io_pages==0. This seems to be the bug of > a driver somewhere, and should fix it though. Anyway, it is better to > avoid the divide-by-zero Oops. > > So this check it. > > Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> > Cc: <stable@vger.kernel.org> > --- > fs/fat/fatent.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c > index f7e3304..98a1c4f 100644 > --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 > +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 > @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo > if (fatent->entry >= ent_limit) > return; > > - if (ra_pages > sb->s_bdi->io_pages) > + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) > ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); Wait, rounddown? ->io_pages is supposed to be the maximum number of pages to readahead. Shouldn't this be max() instead of rounddown()?
Matthew Wilcox <willy@infradead.org> writes: > On Sun, Aug 30, 2020 at 09:59:41AM +0900, OGAWA Hirofumi wrote: >> On one system, there was bdi->io_pages==0. This seems to be the bug of >> a driver somewhere, and should fix it though. Anyway, it is better to >> avoid the divide-by-zero Oops. >> >> So this check it. >> >> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> >> Cc: <stable@vger.kernel.org> >> --- >> fs/fat/fatent.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c >> index f7e3304..98a1c4f 100644 >> --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 >> +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 >> @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo >> if (fatent->entry >= ent_limit) >> return; >> >> - if (ra_pages > sb->s_bdi->io_pages) >> + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) >> ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); > > Wait, rounddown? ->io_pages is supposed to be the maximum number of > pages to readahead. Shouldn't this be max() instead of rounddown()? Hm, io_pages is limited by driver setting too, and io_pages can be lower than ra_pages, e.g. usb storage. Assuming ra_pages is user intent of readahead window. So if io_pages is lower than ra_pages, this try ra_pages to align of io_pages chunk, but not bigger than ra_pages. Because if block layer splits I/O requests to hard limit, then I/O is not optimal. So it is intent, I can be misunderstanding though. Thanks.
On Sun, Aug 30, 2020 at 10:54:35AM +0900, OGAWA Hirofumi wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > On Sun, Aug 30, 2020 at 09:59:41AM +0900, OGAWA Hirofumi wrote: > >> On one system, there was bdi->io_pages==0. This seems to be the bug of > >> a driver somewhere, and should fix it though. Anyway, it is better to > >> avoid the divide-by-zero Oops. > >> > >> So this check it. > >> > >> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> > >> Cc: <stable@vger.kernel.org> > >> --- > >> fs/fat/fatent.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c > >> index f7e3304..98a1c4f 100644 > >> --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 > >> +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 > >> @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo > >> if (fatent->entry >= ent_limit) > >> return; > >> > >> - if (ra_pages > sb->s_bdi->io_pages) > >> + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) > >> ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); > > > > Wait, rounddown? ->io_pages is supposed to be the maximum number of > > pages to readahead. Shouldn't this be max() instead of rounddown()? Sorry, I meant 'min', not 'max'. > Hm, io_pages is limited by driver setting too, and io_pages can be lower > than ra_pages, e.g. usb storage. > > Assuming ra_pages is user intent of readahead window. So if io_pages is > lower than ra_pages, this try ra_pages to align of io_pages chunk, but > not bigger than ra_pages. Because if block layer splits I/O requests to > hard limit, then I/O is not optimal. > > So it is intent, I can be misunderstanding though. Looking at this some more, I'm not sure it makes sense to consult ->io_pages at all. I see how it gets set to 0 -- the admin can write '1' to /sys/block/<device>/queue/max_sectors_kb and that gets turned into 0 in ->io_pages. But I'm not sure it makes any sense to respect that. Looking at mm/readahead.c, all it does is limit the size of a read request which exceeds the current readahead window. It's not used to limit the readahead window itself. For example: unsigned long max_pages = ra->ra_pages; ... if (req_size > max_pages && bdi->io_pages > max_pages) max_pages = min(req_size, bdi->io_pages); Setting io_pages below ra_pages has no effect. So maybe fat should also disregard it?
Matthew Wilcox <willy@infradead.org> writes: > On Sun, Aug 30, 2020 at 10:54:35AM +0900, OGAWA Hirofumi wrote: >> Matthew Wilcox <willy@infradead.org> writes: >> >> Hm, io_pages is limited by driver setting too, and io_pages can be lower >> than ra_pages, e.g. usb storage. >> >> Assuming ra_pages is user intent of readahead window. So if io_pages is >> lower than ra_pages, this try ra_pages to align of io_pages chunk, but >> not bigger than ra_pages. Because if block layer splits I/O requests to >> hard limit, then I/O is not optimal. >> >> So it is intent, I can be misunderstanding though. > > Looking at this some more, I'm not sure it makes sense to consult ->io_pages > at all. I see how it gets set to 0 -- the admin can write '1' to > /sys/block/<device>/queue/max_sectors_kb and that gets turned into 0 > in ->io_pages. if (max_sectors_kb > max_hw_sectors_kb || max_sectors_kb < page_kb) return -EINVAL; It should not set to 0 via /sys/.../max_sectors_kb. However the default of bdi->io_pages is 0. So it happened if a driver didn't initialized it. > But I'm not sure it makes any sense to respect that. Looking at > mm/readahead.c, all it does is limit the size of a read request which > exceeds the current readahead window. It's not used to limit the > readahead window itself. For example: > > unsigned long max_pages = ra->ra_pages; > ... > if (req_size > max_pages && bdi->io_pages > max_pages) > max_pages = min(req_size, bdi->io_pages); > > Setting io_pages below ra_pages has no effect. So maybe fat should also > disregard it? |-----------------------| requested blocks [before] ra_pages |===========|===========|===========| io_pages |---------|---------|---------| req |---------|-|-------|---| [after] ra_pages |=========|=========|=========| io_pages |---------|---------|---------| req |---------|---------|---| This path is known the large sequential read there. Well, anyway, this intent is to use [after] as 3 req, instead of [before] as 4 req. Thanks.
Hi [This is an automated email] This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: all The bot has tested the following trees: v5.8.5, v5.4.61, v4.19.142, v4.14.195, v4.9.234, v4.4.234. v5.8.5: Build OK! v5.4.61: Failed to apply! Possible dependencies: 898310032b96 ("fat: improve the readahead for FAT entries") a090a5a7d73f ("fat: fix fat_ra_init() for data clusters == 0") v4.19.142: Failed to apply! Possible dependencies: 898310032b96 ("fat: improve the readahead for FAT entries") a090a5a7d73f ("fat: fix fat_ra_init() for data clusters == 0") v4.14.195: Failed to apply! Possible dependencies: 898310032b96 ("fat: improve the readahead for FAT entries") a090a5a7d73f ("fat: fix fat_ra_init() for data clusters == 0") f663b5b38fff ("fat: add FITRIM ioctl for FAT file system") v4.9.234: Failed to apply! Possible dependencies: 898310032b96 ("fat: improve the readahead for FAT entries") a090a5a7d73f ("fat: fix fat_ra_init() for data clusters == 0") f663b5b38fff ("fat: add FITRIM ioctl for FAT file system") v4.4.234: Failed to apply! Possible dependencies: 898310032b96 ("fat: improve the readahead for FAT entries") 8992de4cec12 ("fat: constify fatent_operations structures") a090a5a7d73f ("fat: fix fat_ra_init() for data clusters == 0") f663b5b38fff ("fat: add FITRIM ioctl for FAT file system") NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch?
Sasha Levin <sashal@kernel.org> writes: > Hi > > [This is an automated email] > > This commit has been processed because it contains a -stable tag. > The stable tag indicates that it's relevant for the following trees: all > > The bot has tested the following trees: v5.8.5, v5.4.61, v4.19.142, v4.14.195, v4.9.234, v4.4.234. > > v5.8.5: Build OK! [...] > > How should we proceed with this patch? Only 5.8.x has to apply this patch. Thanks.
On Sat, Aug 29, 2020 at 7:08 PM OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> wrote: > > On one system, there was bdi->io_pages==0. This seems to be the bug of > a driver somewhere, and should fix it though. Anyway, it is better to > avoid the divide-by-zero Oops. > > So this check it. > > Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> > Cc: <stable@vger.kernel.org> > --- > fs/fat/fatent.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c > index f7e3304..98a1c4f 100644 > --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 > +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 > @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo > if (fatent->entry >= ent_limit) > return; > > - if (ra_pages > sb->s_bdi->io_pages) > + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) > ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); > reada_blocks = ra_pages << (PAGE_SHIFT - sb->s_blocksize_bits + 1); I don't think we should work-around this here. What device is this on? Something like the below may help. diff --git a/block/blk-core.c b/block/blk-core.c index d9d632639bd1..10c08ac50697 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -539,6 +539,7 @@ struct request_queue *blk_alloc_queue(int node_id) goto fail_stats; q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES; + q->backing_dev_info->io_pages = VM_READAHEAD_PAGES; q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK; q->node = node_id;
Jens Axboe <axboe@kernel.dk> writes: > On Sat, Aug 29, 2020 at 7:08 PM OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> wrote: >> >> On one system, there was bdi->io_pages==0. This seems to be the bug of >> a driver somewhere, and should fix it though. Anyway, it is better to >> avoid the divide-by-zero Oops. >> >> So this check it. >> >> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> >> Cc: <stable@vger.kernel.org> >> --- >> fs/fat/fatent.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c >> index f7e3304..98a1c4f 100644 >> --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 >> +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 >> @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo >> if (fatent->entry >= ent_limit) >> return; >> >> - if (ra_pages > sb->s_bdi->io_pages) >> + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) >> ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); >> reada_blocks = ra_pages << (PAGE_SHIFT - sb->s_blocksize_bits + 1); > > I don't think we should work-around this here. What device is this on? > Something like the below may help. The reported bug is from nvme stack, and the below patch (I submitted same patch to you) fixed the reported case though. But I didn't verify all possible path, so I'd liked to use safer side. If block layer can guarantee io_pages!=0 instead, and can apply to stable branch (5.8+). It would work too. Thanks. > diff --git a/block/blk-core.c b/block/blk-core.c > index d9d632639bd1..10c08ac50697 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -539,6 +539,7 @@ struct request_queue *blk_alloc_queue(int node_id) > goto fail_stats; > > q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES; > + q->backing_dev_info->io_pages = VM_READAHEAD_PAGES; > q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK; > q->node = node_id;
On 8/31/20 10:37 AM, OGAWA Hirofumi wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On Sat, Aug 29, 2020 at 7:08 PM OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> wrote: >>> >>> On one system, there was bdi->io_pages==0. This seems to be the bug of >>> a driver somewhere, and should fix it though. Anyway, it is better to >>> avoid the divide-by-zero Oops. >>> >>> So this check it. >>> >>> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> >>> Cc: <stable@vger.kernel.org> >>> --- >>> fs/fat/fatent.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c >>> index f7e3304..98a1c4f 100644 >>> --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 >>> +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 >>> @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo >>> if (fatent->entry >= ent_limit) >>> return; >>> >>> - if (ra_pages > sb->s_bdi->io_pages) >>> + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) >>> ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); >>> reada_blocks = ra_pages << (PAGE_SHIFT - sb->s_blocksize_bits + 1); >> >> I don't think we should work-around this here. What device is this on? >> Something like the below may help. > > The reported bug is from nvme stack, and the below patch (I submitted > same patch to you) fixed the reported case though. But I didn't verify > all possible path, so I'd liked to use safer side. > > If block layer can guarantee io_pages!=0 instead, and can apply to > stable branch (5.8+). It would work too. We really should ensure that ->io_pages is always set, imho, instead of having to work-around it in other spots.
On Mon, Aug 31, 2020 at 10:39:26AM -0600, Jens Axboe wrote: > We really should ensure that ->io_pages is always set, imho, instead of > having to work-around it in other spots. Interestingly, there are only three places in the entire kernel which _use_ bdi->io_pages. FAT, Verity and the pagecache readahead code. Verity: unsigned long num_ra_pages = min_t(unsigned long, num_blocks_to_hash - i, inode->i_sb->s_bdi->io_pages); FAT: if (ra_pages > sb->s_bdi->io_pages) ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); Pagecache: max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages); and if (req_size > max_pages && bdi->io_pages > max_pages) max_pages = min(req_size, bdi->io_pages); The funny thing is that all three are using it differently. Verity is taking io_pages to be the maximum amount to readahead. FAT is using it as the unit of readahead (round down to the previous multiple) and the pagecache uses it to limit reads that exceed the current per-file readahead limit (but allows per-file readahead to exceed io_pages, in which case it has no effect). So how should it be used? My inclination is to say that the pagecache is right, by virtue of being the most-used.
On 8/31/20 10:56 AM, Matthew Wilcox wrote: > On Mon, Aug 31, 2020 at 10:39:26AM -0600, Jens Axboe wrote: >> We really should ensure that ->io_pages is always set, imho, instead of >> having to work-around it in other spots. > > Interestingly, there are only three places in the entire kernel which > _use_ bdi->io_pages. FAT, Verity and the pagecache readahead code. > > Verity: > unsigned long num_ra_pages = > min_t(unsigned long, num_blocks_to_hash - i, > inode->i_sb->s_bdi->io_pages); > > FAT: > if (ra_pages > sb->s_bdi->io_pages) > ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); > > Pagecache: > max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages); > and > if (req_size > max_pages && bdi->io_pages > max_pages) > max_pages = min(req_size, bdi->io_pages); > > The funny thing is that all three are using it differently. Verity is > taking io_pages to be the maximum amount to readahead. FAT is using > it as the unit of readahead (round down to the previous multiple) and > the pagecache uses it to limit reads that exceed the current per-file > readahead limit (but allows per-file readahead to exceed io_pages, > in which case it has no effect). > > So how should it be used? My inclination is to say that the pagecache > is right, by virtue of being the most-used. When I added ->io_pages, it was for the page cache use case. The others grew after that...
Jens Axboe <axboe@kernel.dk> writes: > On 8/31/20 10:37 AM, OGAWA Hirofumi wrote: >> Jens Axboe <axboe@kernel.dk> writes: >> >>> I don't think we should work-around this here. What device is this on? >>> Something like the below may help. >> >> The reported bug is from nvme stack, and the below patch (I submitted >> same patch to you) fixed the reported case though. But I didn't verify >> all possible path, so I'd liked to use safer side. >> >> If block layer can guarantee io_pages!=0 instead, and can apply to >> stable branch (5.8+). It would work too. > > We really should ensure that ->io_pages is always set, imho, instead of > having to work-around it in other spots. I think it is good too. However, the issue would be how to do it for stable branch. If you think that block layer patch is enough and submit to stable (5.8+) branch instead, I'm fine without fatfs patch. (Or removing workaround in fatfs with block layer patch later?) Thanks.
On 8/31/20 11:16 AM, OGAWA Hirofumi wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 8/31/20 10:37 AM, OGAWA Hirofumi wrote: >>> Jens Axboe <axboe@kernel.dk> writes: >>> >>>> I don't think we should work-around this here. What device is this on? >>>> Something like the below may help. >>> >>> The reported bug is from nvme stack, and the below patch (I submitted >>> same patch to you) fixed the reported case though. But I didn't verify >>> all possible path, so I'd liked to use safer side. >>> >>> If block layer can guarantee io_pages!=0 instead, and can apply to >>> stable branch (5.8+). It would work too. >> >> We really should ensure that ->io_pages is always set, imho, instead of >> having to work-around it in other spots. > > I think it is good too. However, the issue would be how to do it for > stable branch. Agree > If you think that block layer patch is enough and submit to stable > (5.8+) branch instead, I'm fine without fatfs patch. (Or removing > workaround in fatfs with block layer patch later?) Well, it should catch any block based case as we then set it when allocating the queue. So should do the trick for all block based cases at least.
Jens Axboe <axboe@kernel.dk> writes: > On 8/31/20 10:56 AM, Matthew Wilcox wrote: >> On Mon, Aug 31, 2020 at 10:39:26AM -0600, Jens Axboe wrote: >>> We really should ensure that ->io_pages is always set, imho, instead of >>> having to work-around it in other spots. >> >> Interestingly, there are only three places in the entire kernel which >> _use_ bdi->io_pages. FAT, Verity and the pagecache readahead code. >> >> Verity: >> unsigned long num_ra_pages = >> min_t(unsigned long, num_blocks_to_hash - i, >> inode->i_sb->s_bdi->io_pages); >> >> FAT: >> if (ra_pages > sb->s_bdi->io_pages) >> ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); >> >> Pagecache: >> max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages); >> and >> if (req_size > max_pages && bdi->io_pages > max_pages) >> max_pages = min(req_size, bdi->io_pages); >> >> The funny thing is that all three are using it differently. Verity is >> taking io_pages to be the maximum amount to readahead. FAT is using >> it as the unit of readahead (round down to the previous multiple) and >> the pagecache uses it to limit reads that exceed the current per-file >> readahead limit (but allows per-file readahead to exceed io_pages, >> in which case it has no effect). >> >> So how should it be used? My inclination is to say that the pagecache >> is right, by virtue of being the most-used. > > When I added ->io_pages, it was for the page cache use case. The others > grew after that... FAT and pagecache usage would be similar or same purpose. The both is using io_pages as optimal IO size. In pagecache case, it uses io_pages if one request size is exceeding io_pages. In FAT case, there is perfect knowledge about future/total request size. So FAT divides request by io_pages, and adjust ra_pages with knowledge. I don't know about verity. Thanks.
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c index f7e3304..98a1c4f 100644 --- a/fs/fat/fatent.c 2020-08-30 06:52:47.251564566 +0900 +++ b/fs/fat/fatent.c 2020-08-30 06:54:05.838319213 +0900 @@ -660,7 +660,7 @@ static void fat_ra_init(struct super_blo if (fatent->entry >= ent_limit) return; - if (ra_pages > sb->s_bdi->io_pages) + if (sb->s_bdi->io_pages && ra_pages > sb->s_bdi->io_pages) ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); reada_blocks = ra_pages << (PAGE_SHIFT - sb->s_blocksize_bits + 1);
On one system, there was bdi->io_pages==0. This seems to be the bug of a driver somewhere, and should fix it though. Anyway, it is better to avoid the divide-by-zero Oops. So this check it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: <stable@vger.kernel.org> --- fs/fat/fatent.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)