Message ID | 20230414110821.21548-1-p.raghav@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | convert create_page_buffers to create_folio_buffers | expand |
On 4/14/23 13:08, Pankaj Raghav wrote: > One of the first kernel panic we hit when we try to increase the > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c > function do not support large folios (folios > PAGE_SIZE) at the moment, > these changes are required when we want to remove that constraint. > > Willy had already mentioned that he wanted to convert create_page_buffers to > create_folio_buffers but didn't get to it yet, so I decided to take a > shot. > > No functional changes introduced. > > OI: > - I don't like the fact that I had to introduce > folio_create_empty_buffers() as create_empty_buffers() is used in > many parts of the kernel. Should I do a big bang change as a part of > this series where we convert create_empty_buffers to take a folio and > change the callers to pass a folio instead of a page? > > - I split the series into 4 commits for clarity. I could squash them > into one patch if needed. > > [1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/ > > Pankaj Raghav (4): > fs/buffer: add set_bh_folio helper > buffer: add alloc_folio_buffers() helper > fs/buffer: add folio_create_empty_buffers helper > fs/buffer: convert create_page_buffers to create_folio_buffers > > fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- > include/linux/buffer_head.h | 4 ++ > 2 files changed, 125 insertions(+), 10 deletions(-) > Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches. I've had to use two additional patches to get my modified 'brd' driver off the ground with logical blocksize of 16k: - mm/filemap: allocate folios according to the blocksize (will be sending the patch separately) - Modify read_folio() to use the correct order: @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) limit = inode->i_sb->s_maxbytes; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - head = create_folio_buffers(folio, inode, 0); blocksize = head->b_size; bbits = block_size_bits(blocksize); - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); + if (WARN_ON(PAGE_SHIFT < bbits)) { + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); + } else { + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); + } lblock = (limit+blocksize-1) >> bbits; bh = head; nr = 0; With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd and have it happily loaded. Haven't tested the write path yet, though; there's surely quite some work to be done. BTW; I've got another patch replacing 'writepage' with 'write_folio' (and the corresponding argument update). Is that a direction you want to go? Cheers, Hannes
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > BTW; I've got another patch replacing 'writepage' with 'write_folio' > (and the corresponding argument update). Is that a direction you want to go? No; ->writepage is being deleted. It's already gone from ext4 and xfs.
On 4/14/23 15:51, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: >> BTW; I've got another patch replacing 'writepage' with 'write_folio' >> (and the corresponding argument update). Is that a direction you want to go? > > No; ->writepage is being deleted. It's already gone from ext4 and xfs. Aw. And here's me having converted block/fops over to using iomap w/ iomap_writepage(). Tough. Oh well. Wasn't a great fit anyway as for a sb_bread() replacement we would need a sub-page access for iomap. Question is whether we really need that or shouldn't read PAGE_SIZE sectors always. Surely would make life easier. And would save us all the logic with bh_lru etc as we can rely on the page cache. But probably an item for the iomap discussion at LSF. Unless you got plans already ... Cheers, Hannes
>> Pankaj Raghav (4): >> fs/buffer: add set_bh_folio helper >> buffer: add alloc_folio_buffers() helper >> fs/buffer: add folio_create_empty_buffers helper >> fs/buffer: convert create_page_buffers to create_folio_buffers >> >> fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- >> include/linux/buffer_head.h | 4 ++ >> 2 files changed, 125 insertions(+), 10 deletions(-) >> > Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches. > I've had to use two additional patches to get my modified 'brd' driver off the ground with logical > blocksize of 16k: Good to know that we are working on a similar direction here. > - mm/filemap: allocate folios according to the blocksize > (will be sending the patch separately) > - Modify read_folio() to use the correct order: > > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; > > > With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd > and have it happily loaded. I will give your patches a try as well.
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > On 4/14/23 13:08, Pankaj Raghav wrote: > > One of the first kernel panic we hit when we try to increase the > > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c > > function do not support large folios (folios > PAGE_SIZE) at the moment, > > these changes are required when we want to remove that constraint. > Funnily enough, I've been tinkering along the same lines, and ended up with > pretty similar patches. > I've had to use two additional patches to get my modified 'brd' driver off > the ground with logical blocksize of 16k: > - mm/filemap: allocate folios according to the blocksize > (will be sending the patch separately) > - Modify read_folio() to use the correct order: > > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, > get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; At a quick glance I think both approaches (unless Hannes already did it) seem to just miss that pesky static *arr[MAX_BUF_PER_PAGE], and so I think we need to: a) dynamically allocate those now b) do a cursory review of the users of that and prepare them to grok buffer heads which are blocksize based rather than PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for bs > PAGE_SIZE right now. Luis
On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote: > a) dynamically allocate those now > b) do a cursory review of the users of that and prepare them > to grok buffer heads which are blocksize based rather than > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. > > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for > bs > PAGE_SIZE right now. Worse, we'll overflow the array and corrupt the stack. This one is a simple fix ... +++ b/fs/buffer.c @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr, i; int fully_mapped = 1; @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) } /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + bh = head; + do { lock_buffer(bh); mark_buffer_async_read(bh); - } + bh = bh->b_this_page; + } while (bh != head); /* * Stage 3: start the IO. Check for uptodateness * inside the buffer lock in case another process reading * the underlying blockdev brought it uptodate (the sct fix). */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + bh = head; + do { if (buffer_uptodate(bh)) end_buffer_async_read(bh, 1); else submit_bh(REQ_OP_READ, bh); - } + bh = bh->b_this_page; + } while (bh != head); + return 0; } EXPORT_SYMBOL(block_read_full_folio);
On Sat, Apr 15, 2023 at 03:31:54AM +0100, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote: > > a) dynamically allocate those now > > b) do a cursory review of the users of that and prepare them > > to grok buffer heads which are blocksize based rather than > > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. > > > > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for > > bs > PAGE_SIZE right now. > > Worse, we'll overflow the array and corrupt the stack. > > This one is a simple fix ... > > +++ b/fs/buffer.c > @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head; > unsigned int blocksize, bbits; > int nr, i; > int fully_mapped = 1; > @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > } > > /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + bh = head; > + do { > lock_buffer(bh); > mark_buffer_async_read(bh); > - } > + bh = bh->b_this_page; > + } while (bh != head); > > /* > * Stage 3: start the IO. Check for uptodateness > * inside the buffer lock in case another process reading > * the underlying blockdev brought it uptodate (the sct fix). > */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + bh = head; > + do { > if (buffer_uptodate(bh)) > end_buffer_async_read(bh, 1); > else > submit_bh(REQ_OP_READ, bh); > - } > + bh = bh->b_this_page; > + } while (bh != head); > + > return 0; I thought of that but I saw that the loop that assigns the arr only pegs a bh if we don't "continue" for certain conditions, which made me believe that we only wanted to keep on the array as non-null items which meet the initial loop's criteria. If that is not accurate then yes, the simplication is nice! Luis
On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote: > I thought of that but I saw that the loop that assigns the arr only > pegs a bh if we don't "continue" for certain conditions, which made me > believe that we only wanted to keep on the array as non-null items which > meet the initial loop's criteria. If that is not accurate then yes, > the simplication is nice! Uh, right. A little bit more carefully this time ... how does this look? diff --git a/fs/buffer.c b/fs/buffer.c index 5e67e21b350a..dff671079b02 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr, i; int fully_mapped = 1; @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + nr++; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) return 0; } - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + /* + * Stage two: lock the buffers. Recheck the uptodate flag under + * the lock in case somebody else brought it uptodate first. + */ + bh = head; + do { + if (buffer_uptodate(bh)) + continue; lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } mark_buffer_async_read(bh); - } + } while ((bh = bh->b_this_page) != head); - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). - */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else + /* Stage 3: start the IO */ + bh = head; + do { + if (buffer_async_read(bh)) submit_bh(REQ_OP_READ, bh); - } + } while ((bh = bh->b_this_page) != head); + return 0; } EXPORT_SYMBOL(block_read_full_folio); I do wonder how much it's worth doing this vs switching to non-BH methods. I appreciate that's a lot of work still.
On 4/15/23 05:44, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote: >> I thought of that but I saw that the loop that assigns the arr only >> pegs a bh if we don't "continue" for certain conditions, which made me >> believe that we only wanted to keep on the array as non-null items which >> meet the initial loop's criteria. If that is not accurate then yes, >> the simplication is nice! > > Uh, right. A little bit more carefully this time ... how does this > look? > > diff --git a/fs/buffer.c b/fs/buffer.c > index 5e67e21b350a..dff671079b02 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head; > unsigned int blocksize, bbits; > int nr, i; > int fully_mapped = 1; > @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > + nr++; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > return 0; > } > > - /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + /* > + * Stage two: lock the buffers. Recheck the uptodate flag under > + * the lock in case somebody else brought it uptodate first. > + */ > + bh = head; > + do { > + if (buffer_uptodate(bh)) > + continue; > lock_buffer(bh); > + if (buffer_uptodate(bh)) { > + unlock_buffer(bh); > + continue; > + } > mark_buffer_async_read(bh); > - } > + } while ((bh = bh->b_this_page) != head); > > - /* > - * Stage 3: start the IO. Check for uptodateness > - * inside the buffer lock in case another process reading > - * the underlying blockdev brought it uptodate (the sct fix). > - */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > - if (buffer_uptodate(bh)) > - end_buffer_async_read(bh, 1); > - else > + /* Stage 3: start the IO */ > + bh = head; > + do { > + if (buffer_async_read(bh)) > submit_bh(REQ_OP_READ, bh); > - } > + } while ((bh = bh->b_this_page) != head); > + > return 0; > } > EXPORT_SYMBOL(block_read_full_folio); > > > I do wonder how much it's worth doing this vs switching to non-BH methods. > I appreciate that's a lot of work still. That's what I've been wondering, too. I would _vastly_ prefer to switch over to iomap; however, the blasted sb_bread() is getting in the way. Currently iomap only runs on entire pages / folios, but a lot of (older) filesystems insist on doing 512 byte I/O. While this seem logical (seeing that 512 bytes is the default, and, in most cases, the only supported sector size) question is whether _we_ from the linux side need to do that. We _could_ upgrade to always do full page I/O; there's a good chance we'll be using the entire page anyway eventually. And with storage bandwidth getting larger and larger we might even get a performance boost there. And it would save us having to implement sub-page I/O for iomap. Hmm? Cheers, Hannes
On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > On 4/15/23 05:44, Matthew Wilcox wrote: > > I do wonder how much it's worth doing this vs switching to non-BH methods. > > I appreciate that's a lot of work still. > > That's what I've been wondering, too. > > I would _vastly_ prefer to switch over to iomap; however, the blasted > sb_bread() is getting in the way. Currently iomap only runs on entire > pages / folios, but a lot of (older) filesystems insist on doing 512 Hang on, no, iomap can issue sub-page reads. eg iomap_read_folio_sync() will read the parts of the folio which have not yet been read when called from __iomap_write_begin(). > byte I/O. While this seem logical (seeing that 512 bytes is the > default, and, in most cases, the only supported sector size) question > is whether _we_ from the linux side need to do that. > We _could_ upgrade to always do full page I/O; there's a good > chance we'll be using the entire page anyway eventually. > And with storage bandwidth getting larger and larger we might even > get a performance boost there. I think we need to look at this from the filesystem side. What do filesystems actually want to do? The first thing is they want to read the superblock. That's either going to be immediately freed ("Oh, this isn't a JFS filesystem after all") or it's going to hang around indefinitely. There's no particular need to keep it in any kind of cache (buffer or page). Except ... we want to probe a dozen different filesystems, and half of them keep their superblock at the same offset from the start of the block device. So we do want to keep it cached. That's arguing for using the page cache, at least to read it. Now, do we want userspace to be able to dd a new superblock into place and have the mounted filesystem see it? I suspect that confuses just about every filesystem out there. So I think the right answer is to read the page into the bdev's page cache and then copy it into a kmalloc'ed buffer which the filesystem is then responsible for freeing. It's also responsible for writing it back (so that's another API we need), and for a journalled filesystem, it needs to fit into the journalling scheme. Also, we may need to write back multiple copies of the superblock, possibly with slight modifications. There are a lot of considerations here, and I don't feel like I have enough of an appreciation of filesystem needs to come up with a decent API. I'd hope we can get a good discussion going at LSFMM.
On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > On 4/15/23 05:44, Matthew Wilcox wrote: > > We _could_ upgrade to always do full page I/O; there's a good > > chance we'll be using the entire page anyway eventually. *Iff* doing away with buffer head 512 granularity could help block sizes greater than page size where physical and logical block size > PAGE_SIZE we whould also be able to see it on 4kn drives (logical and physical block size == 4k). A projection could be made after. In so far as experimenting with this, if you already have some effort on IOMAP for bdev aops one possibility for pure experimentation for now would be to peg a new set of aops could be set in the path of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early for us to know if the device is has (lbs = pbs) > 512. For NVMe for instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We put together and set the logical and phyiscal block size on NVMe on nvme_update_ns_info() --> nvme_update_disk_info(), right before we call device_add_disk(). The only way to override the aops then would be right before device_add_disk(), or as part of a new device_add_disk_aops() or whatever. > > And with storage bandwidth getting larger and larger we might even > > get a performance boost there. > > I think we need to look at this from the filesystem side. Before that let's recap the bdev cache current issues. Today by just adding the disk we move on to partition scanning immediately unless your block driver has a flag that says otherwise. The current crash we're evaluating with brd and that we also hit with NVMe is due to this part. device_add_disk() --> disk_scan_partitions() --> blkdev_get_whole() --> bdev_disk_changed() --> filemap_read_folio() --> filler() The filler is from aops. We don't even have a filesystem yet on these devices at this point. The entire partition core does this partition scanning. Refer to: disk_scan_partitions() --> block/partitions/core.c : bdev_disk_changed() And all of that stuff is also under a 512-byte atomic operation assumption, we could do better if wanted to. > What do filesystems actually want to do? So you are suggesting that the early reads of the block device by the block cache and its use of the page cache cache should be aligned / perhaps redesigned to assist more clearly with what modern filesystems might actually would want today? > The first thing is they want to read > the superblock. That's either going to be immediately freed ("Oh, > this isn't a JFS filesystem after all") or it's going to hang around > indefinitely. There's no particular need to keep it in any kind of > cache (buffer or page). And the bdev cache would not be able to know before hand that's the case. > Except ... we want to probe a dozen different > filesystems, and half of them keep their superblock at the same offset > from the start of the block device. So we do want to keep it cached. > That's arguing for using the page cache, at least to read it. Do we currently share anything from the bdev cache with the fs for this? Let's say that first block device blocksize in memory. > Now, do we want userspace to be able to dd a new superblock into place > and have the mounted filesystem see it? Not sure I follow this. dd a new super block? > I suspect that confuses just > about every filesystem out there. So I think the right answer is to read > the page into the bdev's page cache and then copy it into a kmalloc'ed > buffer which the filesystem is then responsible for freeing. It's also > responsible for writing it back (so that's another API we need), and for > a journalled filesystem, it needs to fit into the journalling scheme. > Also, we may need to write back multiple copies of the superblock, > possibly with slight modifications. Are you considering these as extentions to the bdev cache? Luis
On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > We _could_ upgrade to always do full page I/O; there's a good > > > chance we'll be using the entire page anyway eventually. > > *Iff* doing away with buffer head 512 granularity could help block sizes > greater than page size where physical and logical block size > PAGE_SIZE > we whould also be able to see it on 4kn drives (logical and physical > block size == 4k). A projection could be made after. > > In so far as experimenting with this, if you already have some > effort on IOMAP for bdev aops one possibility for pure experimentation > for now would be to peg a new set of aops could be set in the path > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > put together and set the logical and phyiscal block size on NVMe on > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > call device_add_disk(). The only way to override the aops then would > be right before device_add_disk(), or as part of a new device_add_disk_aops() > or whatever. I think you're making this harder than you need to. For LBA size > PAGE_SIZE, there is always only going to be one BH per folio-that-is-LBA-size, so all the problems with more-than-8-BHs-per-4k-page don't actually exist. I don't think we should be overriding the aops, and if we narrow the scope of large folio support in blockdev t only supporting folio_size == LBA size, it becomes much more feasible. > > > And with storage bandwidth getting larger and larger we might even > > > get a performance boost there. > > > > I think we need to look at this from the filesystem side. > > Before that let's recap the bdev cache current issues. Ooh, yes, this is good! I was totally neglecting the partition scanning code. > Today by just adding the disk we move on to partition scanning > immediately unless your block driver has a flag that says otherwise. The > current crash we're evaluating with brd and that we also hit with NVMe > is due to this part. > > device_add_disk() --> > disk_scan_partitions() --> > blkdev_get_whole() --> > bdev_disk_changed() --> > filemap_read_folio() --> filler() > > The filler is from aops. Right, so you missed a step in that callchain, which is read_mapping_folio(). That ends up in do_read_cache_folio(), which contains the deadly: folio = filemap_alloc_folio(gfp, 0); so that needs to be changed to get the minimum folio order from the mapping, and then it should work. > > What do filesystems actually want to do? > > So you are suggesting that the early reads of the block device by the > block cache and its use of the page cache cache should be aligned / > perhaps redesigned to assist more clearly with what modern filesystems > might actually would want today? Perhaps? I'm just saying the needs of the block device are not the only consideration here. I'd like an API that makes sense for the fs author. > > The first thing is they want to read > > the superblock. That's either going to be immediately freed ("Oh, > > this isn't a JFS filesystem after all") or it's going to hang around > > indefinitely. There's no particular need to keep it in any kind of > > cache (buffer or page). > > And the bdev cache would not be able to know before hand that's the case. Right, nobody knows until it's been read and examined. > > Except ... we want to probe a dozen different > > filesystems, and half of them keep their superblock at the same offset > > from the start of the block device. So we do want to keep it cached. > > That's arguing for using the page cache, at least to read it. > > Do we currently share anything from the bdev cache with the fs for this? > Let's say that first block device blocksize in memory. sb_bread() is used by most filesystems, and the buffer cache aliases into the page cache. > > Now, do we want userspace to be able to dd a new superblock into place > > and have the mounted filesystem see it? > > Not sure I follow this. dd a new super block? In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', I can overwrite the superblock. Do we want filesystems to see that kind of vandalism, or do we want the mounted filesystem to have its own copy of the data and overwrite what userspace wrote the next time it updates the superblock? (the trick is that this may not be vandalism, it might be the sysadmin updating the uuid or running some fsck-ish program or trying to update the superblock to support fabulous-new-feature on next mount. does this change the answer?) > > I suspect that confuses just > > about every filesystem out there. So I think the right answer is to read > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > buffer which the filesystem is then responsible for freeing. It's also > > responsible for writing it back (so that's another API we need), and for > > a journalled filesystem, it needs to fit into the journalling scheme. > > Also, we may need to write back multiple copies of the superblock, > > possibly with slight modifications. > > Are you considering these as extentions to the bdev cache? I'm trying to suggest some of the considerations that need to go into a replacement for sb_bread().
On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > > We _could_ upgrade to always do full page I/O; there's a good > > > > chance we'll be using the entire page anyway eventually. > > > > *Iff* doing away with buffer head 512 granularity could help block sizes > > greater than page size where physical and logical block size > PAGE_SIZE > > we whould also be able to see it on 4kn drives (logical and physical > > block size == 4k). A projection could be made after. > > > > In so far as experimenting with this, if you already have some > > effort on IOMAP for bdev aops one possibility for pure experimentation > > for now would be to peg a new set of aops could be set in the path > > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > > put together and set the logical and phyiscal block size on NVMe on > > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > > call device_add_disk(). The only way to override the aops then would > > be right before device_add_disk(), or as part of a new device_add_disk_aops() > > or whatever. > > I think you're making this harder than you need to. > For LBA size > PAGE_SIZE, there is always only going to be > one BH per folio-that-is-LBA-size, so all the problems with > more-than-8-BHs-per-4k-page don't actually exist. Oh! Then yes, sorry! > I don't think we > should be overriding the aops, and if we narrow the scope of large folio > support in blockdev t only supporting folio_size == LBA size, it becomes > much more feasible. I'm trying to think of the possible use cases where folio_size != LBA size and I cannot immediately think of some. Yes there are cases where a filesystem may use a different block for say meta data than data, but that I believe is side issue, ie, read/writes for small metadata would have to be accepted. At least for NVMe we have metadata size as part of the LBA format, but from what I understand no Linux filesystem yet uses that. > > > > And with storage bandwidth getting larger and larger we might even > > > > get a performance boost there. > > > > > > I think we need to look at this from the filesystem side. > > > > Before that let's recap the bdev cache current issues. > > Ooh, yes, this is good! I was totally neglecting the partition > scanning code. > > > Today by just adding the disk we move on to partition scanning > > immediately unless your block driver has a flag that says otherwise. The > > current crash we're evaluating with brd and that we also hit with NVMe > > is due to this part. > > > > device_add_disk() --> > > disk_scan_partitions() --> > > blkdev_get_whole() --> > > bdev_disk_changed() --> > > filemap_read_folio() --> filler() > > > > The filler is from aops. > > Right, so you missed a step in that callchain, which is > read_mapping_folio(). That ends up in do_read_cache_folio(), which > contains the deadly: > > folio = filemap_alloc_folio(gfp, 0); Right and before this we have: if (!filler) filler = mapping->a_ops->read_folio; The folio is order 0 and after filemap_alloc_folio() its added to the page cache, then filemap_read_folio(file, filler, folio) uses the filler blkdev_read_folio() --> fs/buffer.c block_read_full_folio() Just for posterity: fs/buffer.c int block_read_full_folio(struct folio *folio, get_block_t *get_block) { ... struct buffer_head *bh, *head,...; ... head = create_page_buffers(&folio->page, inode, 0); ... } static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) { BUG_ON(!PageLocked(page)); if (!page_has_buffers(page)) create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits), b_state); return page_buffers(page); } void create_empty_buffers(struct page *page, unsigned long blocksize, unsigned long b_state) { struct buffer_head *bh, *head, *tail; head = alloc_page_buffers(page, blocksize, true); bh = head; do { bh->b_state |= b_state; -----> CRASH HERE head is NULL tail = bh; bh = bh->b_this_page; } while (bh); tail->b_this_page = head; } Why is it NULL? The blocksize passed to alloc_page_buffers() is larger than PAGE_SIZE and below offset is PAGE_SIZE. PAGE_SIZE - blocksize gives a negative number and so the loop is not traversed. struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, bool retry) { struct buffer_head *bh, *head; gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; long offset; struct mem_cgroup *memcg, *old_memcg; if (retry) gfp |= __GFP_NOFAIL; /* The page lock pins the memcg */ memcg = page_memcg(page); old_memcg = set_active_memcg(memcg); head = NULL; offset = PAGE_SIZE; while ((offset -= size) >= 0) { bh = alloc_buffer_head(gfp); if (!bh) goto no_grow; bh->b_this_page = head; bh->b_blocknr = -1; head = bh; bh->b_size = size; /* Link the buffer to its page */ set_bh_page(bh, page, offset); } out: set_active_memcg(old_memcg); return head; ... } I see now what you say about the buffer head being of the block size bh->b_size = size above. > so that needs to be changed to get the minimum folio order from the > mapping, and then it should work. > > > > What do filesystems actually want to do? > > > > So you are suggesting that the early reads of the block device by the > > block cache and its use of the page cache cache should be aligned / > > perhaps redesigned to assist more clearly with what modern filesystems > > might actually would want today? > > Perhaps? I'm just saying the needs of the block device are not the > only consideration here. I'd like an API that makes sense for the fs > author. Makes sense! > > > The first thing is they want to read > > > the superblock. That's either going to be immediately freed ("Oh, > > > this isn't a JFS filesystem after all") or it's going to hang around > > > indefinitely. There's no particular need to keep it in any kind of > > > cache (buffer or page). > > > > And the bdev cache would not be able to know before hand that's the case. > > Right, nobody knows until it's been read and examined. > > > > Except ... we want to probe a dozen different > > > filesystems, and half of them keep their superblock at the same offset > > > from the start of the block device. So we do want to keep it cached. > > > That's arguing for using the page cache, at least to read it. > > > > Do we currently share anything from the bdev cache with the fs for this? > > Let's say that first block device blocksize in memory. > > sb_bread() is used by most filesystems, and the buffer cache aliases > into the page cache. I see thanks. I checked what xfs does and its xfs_readsb() uses its own xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why they did that. > > > Now, do we want userspace to be able to dd a new superblock into place > > > and have the mounted filesystem see it? > > > > Not sure I follow this. dd a new super block? > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > I can overwrite the superblock. Do we want filesystems to see that > kind of vandalism, or do we want the mounted filesystem to have its > own copy of the data and overwrite what userspace wrote the next time it > updates the superblock? Oh, what happens today? > (the trick is that this may not be vandalism, it might be the sysadmin > updating the uuid or running some fsck-ish program or trying to update > the superblock to support fabulous-new-feature on next mount. does this > change the answer?) > > > > I suspect that confuses just > > > about every filesystem out there. So I think the right answer is to read > > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > > buffer which the filesystem is then responsible for freeing. It's also > > > responsible for writing it back (so that's another API we need), and for > > > a journalled filesystem, it needs to fit into the journalling scheme. > > > Also, we may need to write back multiple copies of the superblock, > > > possibly with slight modifications. > > > > Are you considering these as extentions to the bdev cache? > > I'm trying to suggest some of the considerations that need to go into > a replacement for sb_bread(). I see! That would also help EOL buffer heads for that purpose. Luis
On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > I don't think we > > should be overriding the aops, and if we narrow the scope of large folio > > support in blockdev t only supporting folio_size == LBA size, it becomes > > much more feasible. > > I'm trying to think of the possible use cases where folio_size != LBA size > and I cannot immediately think of some. Yes there are cases where a > filesystem may use a different block for say meta data than data, but that > I believe is side issue, ie, read/writes for small metadata would have > to be accepted. At least for NVMe we have metadata size as part of the > LBA format, but from what I understand no Linux filesystem yet uses that. NVMe metadata is per-block metadata -- a CRC or similar. Filesystem metadata is things like directories, inode tables, free space bitmaps, etc. > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > bool retry) > { [...] > head = NULL; > offset = PAGE_SIZE; > while ((offset -= size) >= 0) { > > I see now what you say about the buffer head being of the block size > bh->b_size = size above. Yes, just changing that to 'offset = page_size(page);' will do the trick. > > sb_bread() is used by most filesystems, and the buffer cache aliases > > into the page cache. > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > they did that. IRIX didn't have an sb_bread() ;-) > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > I can overwrite the superblock. Do we want filesystems to see that > > kind of vandalism, or do we want the mounted filesystem to have its > > own copy of the data and overwrite what userspace wrote the next time it > > updates the superblock? > > Oh, what happens today? Depends on the filesystem, I think? Not really sure, to be honest.
On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > > > Except ... we want to probe a dozen different > > > > filesystems, and half of them keep their superblock at the same offset > > > > from the start of the block device. So we do want to keep it cached. > > > > That's arguing for using the page cache, at least to read it. > > > > > > Do we currently share anything from the bdev cache with the fs for this? > > > Let's say that first block device blocksize in memory. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > into the page cache. > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > they did that. XFS has it's own metadata address space for caching - it does not use the block device page cache at all. This is not new, it never has. The xfs_buf buffer cache does not use the page cache, either. It does it's own thing, has it's own indexing, locking, shrinkers, etc. IOWs, it does not use the iomap infrastructure at all - iomap is used by XFS exclusively for data IO. As for why we use an uncached buffer for the superblock? That's largely historic because prior to 2007 every modification that did allocation/free needed to lock and modify the superblock at transaction commit. Hence it's always needed in memory but a critical fast path, so it is always directly available without needing to do a cache lookup to callers that need it. In 2007, lazy superblock counters got rid of the requirement to lock the superblock buffer in every transaction commit, so the uncached buffer optimisation hasn't really been needed for the past decade. But if it ain't broke, don't try to fix it.... > > > > Now, do we want userspace to be able to dd a new superblock into place > > > > and have the mounted filesystem see it? > > > > > > Not sure I follow this. dd a new super block? > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > I can overwrite the superblock. Do we want filesystems to see that > > kind of vandalism, or do we want the mounted filesystem to have its > > own copy of the data and overwrite what userspace wrote the next time it > > updates the superblock? > > Oh, what happens today? In XFS, it will completely ignore the fact the the superblock got trashed like this. When the fs goes idle, or the sb modified for some other reason, it will relog the in-memory superblock and write it back to disk, thereby fixing the corruption. i.e. while the filesystem is mounted, the superblock is _write-only_... > > (the trick is that this may not be vandalism, it might be the sysadmin > > updating the uuid or running some fsck-ish program or trying to update > > the superblock to support fabulous-new-feature on next mount. does this > > change the answer?) If you need to change anything in the superblock while the XFS fs is mounted, then you have to use ioctls to modify the superblock contents through the running transaction subsystem. Editting the block device directly breaks the security model of filesystems that assume they have exclusive access to the block device whilst the filesystem is mounted.... -Dave.
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, > get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; BTW I See this pattern in: fs/mpage.c: do_mpage_readpage() fs/mpage.c: __mpage_writepage() A helper might be in order. Luis
On 4/17/23 04:27, Luis Chamberlain wrote: > On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: >> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, >> get_block_t *get_block) >> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) >> limit = inode->i_sb->s_maxbytes; >> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >> - >> head = create_folio_buffers(folio, inode, 0); >> blocksize = head->b_size; >> bbits = block_size_bits(blocksize); >> >> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); >> + if (WARN_ON(PAGE_SHIFT < bbits)) { >> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); >> + } else { >> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); >> + } >> lblock = (limit+blocksize-1) >> bbits; >> bh = head; >> nr = 0; > > BTW I See this pattern in: > > fs/mpage.c: do_mpage_readpage() > fs/mpage.c: __mpage_writepage() > > A helper might be in order. > But only _iff_ we decide to stick with buffer_heads and upgrade the buffer_head code to handle multi-page block sizes. The idea was to move over to iomap, hence I'm not sure into which lengths we should go keeping buffer_heads alive ... Cheers, Hannes
On Sun, Apr 16, 2023 at 03:07:33PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > > I don't think we > > > should be overriding the aops, and if we narrow the scope of large folio > > > support in blockdev t only supporting folio_size == LBA size, it becomes > > > much more feasible. > > > > I'm trying to think of the possible use cases where folio_size != LBA size > > and I cannot immediately think of some. Yes there are cases where a > > filesystem may use a different block for say meta data than data, but that > > I believe is side issue, ie, read/writes for small metadata would have > > to be accepted. At least for NVMe we have metadata size as part of the > > LBA format, but from what I understand no Linux filesystem yet uses that. > > NVMe metadata is per-block metadata -- a CRC or similar. Filesystem > metadata is things like directories, inode tables, free space bitmaps, > etc. > > > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > > bool retry) > > { > [...] > > head = NULL; > > offset = PAGE_SIZE; > > while ((offset -= size) >= 0) { > > > > I see now what you say about the buffer head being of the block size > > bh->b_size = size above. > > Yes, just changing that to 'offset = page_size(page);' will do the trick. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > > into the page cache. > > > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > > they did that. > > IRIX didn't have an sb_bread() ;-) > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > > I can overwrite the superblock. Do we want filesystems to see that > > > kind of vandalism, or do we want the mounted filesystem to have its > > > own copy of the data and overwrite what userspace wrote the next time it > > > updates the superblock? > > > > Oh, what happens today? > > Depends on the filesystem, I think? Not really sure, to be honest. The filesystem driver sees the vandalism, and can very well crash as a result[1]. In that case it was corrupted journal contents being replayed, but the same thing would happen if you wrote a malicious userspace program to set the metadata_csum feature flag in the ondisk superblock after mounting the fs. https://bugzilla.kernel.org/show_bug.cgi?id=82201#c4 I've tried to prevent people from writing to mounted block devices in the past, but did not succeed. If you try to prevent programs from opening such devices with O_RDWR/O_WRONLY you then break lvm tools which require that ability even though they don't actually write anything to the block device. If you make the block device write_iter function fail, then old e2fsprogs breaks and you get shouted at for breaking userspace. Hence I decided to let security researchers find these bugs and control the design discussion via CVE. That's not correct and it's not smart, but it preserves some of my sanity. --D