Message ID | 20240815090849.972355-1-kernel@pankajraghav.com (mailing list archive) |
---|---|
Headers | show |
Series | enable bs > ps in XFS | expand |
Hi Pankaj, I applied the first five patches and set minimum folio size for afs files to 8K (see attached patch) and ran some tests. With simple tests, I can see in the trace log that it is definitely creating 8K folios where it would previously create 4K folios. However, with 'xfstests -g quick', generic/075 generic/112 generic/393 fail where they didn't previously. I won't be able to look into this more till Monday. If you want to try using afs for yourself, install the kafs-client package (available on Fedora and Debian), do 'systemctl start afs.mount' and then you can, say, do: ls /afs/openafs.org/www/docs.openafs.org/ and browse the publicly accessible files under there. David --- commit d676df787baee3b710b9f0d284b21518473feb3c Author: David Howells <dhowells@redhat.com> Date: Fri Aug 16 19:54:25 2024 +0100 afs: [DEBUGGING] Set min folio order diff --git a/fs/afs/inode.c b/fs/afs/inode.c index 3acf5e050072..c3842cba92e7 100644 --- a/fs/afs/inode.c +++ b/fs/afs/inode.c @@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op, inode->i_fop = &afs_file_operations; inode->i_mapping->a_ops = &afs_file_aops; mapping_set_large_folios(inode->i_mapping); + mapping_set_folio_min_order(inode->i_mapping, 1); break; case AFS_FTYPE_DIR: inode->i_mode = S_IFDIR | (status->mode & S_IALLUGO);
Hi David, On Fri, Aug 16, 2024 at 08:31:03PM +0100, David Howells wrote: > Hi Pankaj, > > I applied the first five patches and set minimum folio size for afs files to > 8K (see attached patch) and ran some tests. > > With simple tests, I can see in the trace log that it is definitely creating > 8K folios where it would previously create 4K folios. > > However, with 'xfstests -g quick', generic/075 generic/112 generic/393 fail > where they didn't previously. I won't be able to look into this more till > Monday. Thanks for trying it out! As you might have seen the whole patchset, typically filesystems will require some changes to support min order correctly. That is why this patchset only enables XFS to use min order to support bs > ps. In the case of XFS (block-based FS), we set the min order to the FS block size as that is the smallest unit of operation in the data path, and we know for sure there are no implicit PAGE_SIZE assumption. I am no expert in network filesystems but are you sure there are no PAGE_SIZE assumption when manipulating folios from the page cache in AFS? Similar to AFS, XFS also supported large_folios but we found some bugs when we set min order to be the block size of the FS. > > If you want to try using afs for yourself, install the kafs-client package > (available on Fedora and Debian), do 'systemctl start afs.mount' and then you > can, say, do: > > ls /afs/openafs.org/www/docs.openafs.org/ > > and browse the publicly accessible files under there. Great. But is this enough to run FStests? I assume I also need some afs server to run the fstests? Are the tests just failing or are you getting some kernel panic? > > David > --- > commit d676df787baee3b710b9f0d284b21518473feb3c > Author: David Howells <dhowells@redhat.com> > Date: Fri Aug 16 19:54:25 2024 +0100 > > afs: [DEBUGGING] Set min folio order > > diff --git a/fs/afs/inode.c b/fs/afs/inode.c > index 3acf5e050072..c3842cba92e7 100644 > --- a/fs/afs/inode.c > +++ b/fs/afs/inode.c > @@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op, > inode->i_fop = &afs_file_operations; > inode->i_mapping->a_ops = &afs_file_aops; > mapping_set_large_folios(inode->i_mapping); > + mapping_set_folio_min_order(inode->i_mapping, 1); > break; > case AFS_FTYPE_DIR: > inode->i_mode = S_IFDIR | (status->mode & S_IALLUGO); >
Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote: > I am no expert in network filesystems but are you sure there are no > PAGE_SIZE assumption when manipulating folios from the page cache in > AFS? Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to netfslib and intend to do the same to ceph. The client filesystems just provide read and write ops to netfslib and netfslib uses those to do ordinary buffered I/O, unbuffered I/O (selectable by mount option on some filesystems) and DIO. That said, I'm not sure that I haven't made some PAGE_SIZE assumptions. I don't *think* I have since netfslib is fully multipage folio capable, but I can't guarantee it. Mostly this was just a note to you that there might be an issue with your code - but I haven't investigated it yet and it could well be in my code. Apparently, I also need to update xfstests, so it could be that too. > > ls /afs/openafs.org/www/docs.openafs.org/ > > > > and browse the publicly accessible files under there. > > Great. But is this enough to run FStests? I assume I also need some afs > server to run the fstests? Sadly not, but if you turn on some tracepoints, you can see netfslib operating under the bonnet. > Are the tests just failing or are you getting some kernel panic? Just failing. Thanks, David
On 8/18/24 22:16, David Howells wrote: > Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote: > >> I am no expert in network filesystems but are you sure there are no >> PAGE_SIZE assumption when manipulating folios from the page cache in >> AFS? > > Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to > netfslib and intend to do the same to ceph. The client filesystems just > provide read and write ops to netfslib and netfslib uses those to do ordinary > buffered I/O, unbuffered I/O (selectable by mount option on some filesystems) > and DIO. > > That said, I'm not sure that I haven't made some PAGE_SIZE assumptions. I > don't *think* I have since netfslib is fully multipage folio capable, but I > can't guarantee it. > I guess you did: static int afs_fill_super(struct super_block *sb, struct afs_fs_context *ctx) { struct afs_super_info *as = AFS_FS_S(sb); struct inode *inode = NULL; int ret; _enter(""); /* fill in the superblock */ sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_magic = AFS_FS_MAGIC; sb->s_op = &afs_super_ops; IE you essentially nail AFS to use PAGE_SIZE. Not sure how you would tell AFS to use a different block size; maybe a mount option? And there are several other places which will need to be modified; eg afs_mntpt_set_params() is trying to read from a page which won't fly with large blocks (converted to read_full_folio()?), and, of course, the infamous AFS_DIR_BLOCKS_PER_PAGE which will overflow for large blocks. So some work is required, but everything looks doable. Maybe I can find some time until LPC. > Mostly this was just a note to you that there might be an issue with your code > - but I haven't investigated it yet and it could well be in my code. > Hmm. I'd rather fix the obvious places in afs first; just do a quick grep for 'PAGE_', that'll give you a good impression of places to look at. Cheers, Hannes
On Mon, Aug 19, 2024 at 09:24:11AM +0200, Hannes Reinecke wrote: > On 8/18/24 22:16, David Howells wrote: > > Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote: > > > > > I am no expert in network filesystems but are you sure there are no > > > PAGE_SIZE assumption when manipulating folios from the page cache in > > > AFS? > > > > Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to > > netfslib and intend to do the same to ceph. The client filesystems just > > provide read and write ops to netfslib and netfslib uses those to do ordinary > > buffered I/O, unbuffered I/O (selectable by mount option on some filesystems) > > and DIO. > > > > That said, I'm not sure that I haven't made some PAGE_SIZE assumptions. I > > don't *think* I have since netfslib is fully multipage folio capable, but I > > can't guarantee it. > > > I guess you did: > > static int afs_fill_super(struct super_block *sb, struct afs_fs_context > *ctx) > { > struct afs_super_info *as = AFS_FS_S(sb); > struct inode *inode = NULL; > int ret; > > _enter(""); > > /* fill in the superblock */ > sb->s_blocksize = PAGE_SIZE; > sb->s_blocksize_bits = PAGE_SHIFT; > sb->s_maxbytes = MAX_LFS_FILESIZE; > sb->s_magic = AFS_FS_MAGIC; > sb->s_op = &afs_super_ops; > > IE you essentially nail AFS to use PAGE_SIZE. > Not sure how you would tell AFS to use a different block size; > maybe a mount option? I saw this as well, but I didn't see this variable being used anywhere. Probably this has no meaning in a network-based FSs? > And there are several other places which will need to be modified; > eg afs_mntpt_set_params() is trying to read from a page which > won't fly with large blocks (converted to read_full_folio()?), > and, of course, the infamous AFS_DIR_BLOCKS_PER_PAGE which will > overflow for large blocks. But the min folio order is set only for AFS_FTYPE_FILE and not for AFS_FTYPE_DIR. > > So some work is required, but everything looks doable. > Maybe I can find some time until LPC. > > > Mostly this was just a note to you that there might be an issue with your code > > - but I haven't investigated it yet and it could well be in my code. > > > Hmm. I'd rather fix the obvious places in afs first; just do a quick > grep for 'PAGE_', that'll give you a good impression of places to look at. > Agree. > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg > HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich >
Hi Pankaj, I can reproduce the problem with: xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" -c "truncate 4" -c "truncate 4096" /xfstest.test/wubble; od -x /xfstest.test/wubble borrowed from generic/393. I've distilled it down to the attached C program. Turning on tracing and adding a bit more, I can see the problem happening. Here's an excerpt of the tracing (I've added some non-upstream tracepoints). Firstly, you can see the second pwrite at fpos 0, 40 bytes (ie. 0x28): pankaj-5833: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 mod-streamw Then first ftruncate() is called to reduce the file size to 4: pankaj-5833: netfs_truncate: ni=9e isz=2028 rsz=2028 zp=4000 to=4 pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=4 l=1ffc d=78787878 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part pankaj-5833: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=4 You can see the invalidate_folio call, with the offset at 0x4 an the length as 0x1ffc. The data at the beginning of the page is 0x78787878. This looks correct. Then second ftruncate() is called to increase the file size to 4096 (ie. 0x1000): pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000 pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4 And here's the problem: in the invalidate_folio() call, the offset is 0x1000 and the length is 0x1000 (o= and l=). But that's the wrong half of the folio! I'm guessing that the caller thereafter clears the other half of the folio - the bit that should be kept. David --- /* Distillation of the generic/393 xfstest */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0) static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"; static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"; static const char dropfile[] = "/proc/sys/vm/drop_caches"; static const char droptype[] = "3"; static const char file[] = "/xfstest.test/wubble"; int main(int argc, char *argv[]) { int fd, drop; /* Fill in the second 8K block of the file... */ fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666); ERR(fd, "open"); ERR(ftruncate(fd, 0), "pre-trunc $file"); ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000"); ERR(close(fd), "close"); /* ... and drop the pagecache so that we get a streaming * write, attaching some private data to the folio. */ drop = open(dropfile, O_WRONLY); ERR(drop, dropfile); ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop"); ERR(close(drop), "close-drop"); fd = open(file, O_WRONLY, 0666); ERR(fd, "reopen"); /* Make a streaming write on the first 8K block (needs O_WRONLY). */ ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0"); /* Now use truncate to shrink and reexpand. */ ERR(ftruncate(fd, 4), "trunc-4"); ERR(ftruncate(fd, 4096), "trunc-4096"); ERR(close(fd), "close-2"); exit(0); }
David Howells <dhowells@redhat.com> wrote: > You can see the invalidate_folio call, with the offset at 0x4 an the length as > 0x1ffc. The data at the beginning of the page is 0x78787878. This looks > correct. > > Then second ftruncate() is called to increase the file size to 4096 > (ie. 0x1000): > > pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000 > pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878 > pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part > pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4 > > And here's the problem: in the invalidate_folio() call, the offset is 0x1000 > and the length is 0x1000 (o= and l=). But that's the wrong half of the folio! > I'm guessing that the caller thereafter clears the other half of the folio - > the bit that should be kept. Actually, I think I'm wrong in my evaluation - I think that's the region to be invalidated, not the region to be kept. David
Hannes Reinecke <hare@suse.de> wrote: > IE you essentially nail AFS to use PAGE_SIZE. > Not sure how you would tell AFS to use a different block size; > maybe a mount option? As far as I know: sb->s_blocksize = PAGE_SIZE; sb->s_blocksize_bits = PAGE_SHIFT; isn't used by the VM. > Hmm. I'd rather fix the obvious places in afs first; just do a quick > grep for 'PAGE_', that'll give you a good impression of places to look at. Sure: fs/afs/dir.c: nr_pages = (i_size + PAGE_SIZE - 1) / PAGE_SIZE; fs/afs/dir.c: req->len = nr_pages * PAGE_SIZE; /* We can ask for more than there is */ fs/afs/dir.c: task_io_account_read(PAGE_SIZE * req->nr_pages); fs/afs/dir.c: folio = __filemap_get_folio(dir->i_mapping, ctx->pos / PAGE_SIZE, fs/afs/xdr_fs.h:#define AFS_DIR_BLOCKS_PER_PAGE (PAGE_SIZE / AFS_DIR_BLOCK_SIZE) Those only affect directories. fs/afs/mntpt.c: if (size < 2 || size > PAGE_SIZE - 1) That only affects mountpoint symlinks. fs/afs/super.c: sb->s_blocksize = PAGE_SIZE; This is the only thing (and sb->s_blocksize_bits) that might affect files. I checked, and doubling this and adding 1 to bits does not alter the outcome. Now, the VM wrangling is offloaded to netfslib, and most of that is to do with converting between indices and file positions. Going through the usages of PAGE_SIZE there: fs/netfs/buffered_read.c: size += PAGE_SIZE << order; That was recording the size of a folio readahead allocated. fs/netfs/buffered_read.c: size_t nr_bvec = flen / PAGE_SIZE + 2; fs/netfs/buffered_read.c: part = min_t(size_t, to - off, PAGE_SIZE); Those two are used to fill in the gaps around a partial page - but that didn't appear in the logs. fs/netfs/buffered_write.c: pgoff_t index = pos / PAGE_SIZE; fs/netfs/buffered_write.c: fgp_flags |= fgf_set_order(pos % PAGE_SIZE + part); Those two are used when asking __filemap_get_folio() to allocate a folio to write into. I got a folio of the right size and index, so that's not the problem. fs/netfs/fscache_io.c: pgoff_t first = start / PAGE_SIZE; fs/netfs/fscache_io.c: pgoff_t last = (start + len - 1) / PAGE_SIZE; Caching is not enabled at the moment, so these don't happen. fs/netfs/iterator.c: cur_npages = DIV_ROUND_UP(ret, PAGE_SIZE); fs/netfs/iterator.c: len = ret > PAGE_SIZE ? PAGE_SIZE : ret; I'm not doing DIO, so these aren't used. fs/netfs/iterator.c: pgoff_t index = pos / PAGE_SIZE; I'm not using an ITER_XARRAY iterator, so this doesn't happen. fs/netfs/misc.c: rreq->io_iter.count += PAGE_SIZE << order; This is just multiplying up the folio size to add to the byte count. fs/netfs/read_collect.c: fsize = PAGE_SIZE << subreq->curr_folio_order; fs/netfs/read_collect.c: WARN_ON_ONCE(folioq_folio(folioq, slot)->index != fpos / PAGE_SIZE)) { These two are converting between a file pos and an index - but only during read, and I can see from wireshark that we're writing the wrong data to the server before we get this far. And that's all the PAGE_SIZE usages in afs and netfslib. David
On 8/19/24 13:46, David Howells wrote: > Hi Pankaj, > > I can reproduce the problem with: > > xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" -c "truncate 4" -c "truncate 4096" /xfstest.test/wubble; od -x /xfstest.test/wubble > > borrowed from generic/393. I've distilled it down to the attached C program. > > Turning on tracing and adding a bit more, I can see the problem happening. > Here's an excerpt of the tracing (I've added some non-upstream tracepoints). > Firstly, you can see the second pwrite at fpos 0, 40 bytes (ie. 0x28): > > pankaj-5833: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0 > pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 mod-streamw > > Then first ftruncate() is called to reduce the file size to 4: > > pankaj-5833: netfs_truncate: ni=9e isz=2028 rsz=2028 zp=4000 to=4 > pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=4 l=1ffc d=78787878 > pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part > pankaj-5833: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=4 > > You can see the invalidate_folio call, with the offset at 0x4 an the length as > 0x1ffc. The data at the beginning of the page is 0x78787878. This looks > correct. > > Then second ftruncate() is called to increase the file size to 4096 > (ie. 0x1000): > > pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000 > pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878 > pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part > pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4 > > And here's the problem: in the invalidate_folio() call, the offset is 0x1000 > and the length is 0x1000 (o= and l=). But that's the wrong half of the folio! > I'm guessing that the caller thereafter clears the other half of the folio - > the bit that should be kept. > > David > --- > /* Distillation of the generic/393 xfstest */ > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <fcntl.h> > > #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0) > > static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"; > static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"; > static const char dropfile[] = "/proc/sys/vm/drop_caches"; > static const char droptype[] = "3"; > static const char file[] = "/xfstest.test/wubble"; > > int main(int argc, char *argv[]) > { > int fd, drop; > > /* Fill in the second 8K block of the file... */ > fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666); > ERR(fd, "open"); > ERR(ftruncate(fd, 0), "pre-trunc $file"); > ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000"); > ERR(close(fd), "close"); > > /* ... and drop the pagecache so that we get a streaming > * write, attaching some private data to the folio. > */ > drop = open(dropfile, O_WRONLY); > ERR(drop, dropfile); > ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop"); > ERR(close(drop), "close-drop"); > > fd = open(file, O_WRONLY, 0666); > ERR(fd, "reopen"); > /* Make a streaming write on the first 8K block (needs O_WRONLY). */ > ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0"); > /* Now use truncate to shrink and reexpand. */ > ERR(ftruncate(fd, 4), "trunc-4"); > ERR(ftruncate(fd, 4096), "trunc-4096"); > ERR(close(fd), "close-2"); > exit(0); > } > Wouldn't the second truncate end up with a 4k file, and not an 8k? IE the resulting file will be: After step 1: 8k After step 2: 4 After step 3: 4k Hmm? Cheers, Hannes
Hannes Reinecke <hare@suse.de> wrote: > Wouldn't the second truncate end up with a 4k file, and not an 8k? > IE the resulting file will be: > After step 1: 8k > After step 2: 4 > After step 3: 4k Yes, but the folio should still be an 8K folio, and it is: > pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part as indicated by the inclusive folio index range ix=00000-00001. The problem is that the bottom four bytes of the file are getting cleared somewhere. They *should* be "XXXX", but they're all zeros. David
Okay, the code in netfs_invalidate_folio() isn't correct in the way it reduces streaming writes. Attached is a patch that shows some of the changes I need to make - but this is not yet working. David --- diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c index eaa0a992d178..e237c771eeb5 100644 --- a/fs/netfs/misc.c +++ b/fs/netfs/misc.c @@ -214,18 +215,34 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length) /* We have a partially uptodate page from a streaming write. */ unsigned int fstart = finfo->dirty_offset; unsigned int fend = fstart + finfo->dirty_len; - unsigned int end = offset + length; + unsigned int iend = offset + length; if (offset >= fend) return; - if (end <= fstart) + if (iend <= fstart) + return; + + /* The invalidation region overlaps the data. If the region + * covers the start of the data, we either move along the start + * or just erase the data entirely. + */ + if (offset <= fstart) { + if (iend >= fend) + goto erase_completely; + /* Move the start of the data. */ + finfo->dirty_len = fend - iend; + finfo->dirty_offset = offset; return; - if (offset <= fstart && end >= fend) - goto erase_completely; - if (offset <= fstart && end > fstart) - goto reduce_len; - if (offset > fstart && end >= fend) - goto move_start; + } + + /* Reduce the length of the data if the invalidation region + * covers the tail part. + */ + if (iend >= fend) { + finfo->dirty_len = offset - fstart; + return; + } + /* A partial write was split. The caller has already zeroed * it, so just absorb the hole. */ @@ -238,12 +261,6 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length) folio_clear_uptodate(folio); kfree(finfo); return; -reduce_len: - finfo->dirty_len = offset + length - finfo->dirty_offset; - return; -move_start: - finfo->dirty_len -= offset - finfo->dirty_offset; - finfo->dirty_offset = offset; } EXPORT_SYMBOL(netfs_invalidate_folio);
> --- > /* Distillation of the generic/393 xfstest */ > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <fcntl.h> > > #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0) > > static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"; > static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"; > static const char dropfile[] = "/proc/sys/vm/drop_caches"; > static const char droptype[] = "3"; > static const char file[] = "/xfstest.test/wubble"; > > int main(int argc, char *argv[]) > { > int fd, drop; > > /* Fill in the second 8K block of the file... */ > fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666); > ERR(fd, "open"); > ERR(ftruncate(fd, 0), "pre-trunc $file"); > ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000"); > ERR(close(fd), "close"); > > /* ... and drop the pagecache so that we get a streaming > * write, attaching some private data to the folio. > */ > drop = open(dropfile, O_WRONLY); > ERR(drop, dropfile); > ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop"); > ERR(close(drop), "close-drop"); > > fd = open(file, O_WRONLY, 0666); > ERR(fd, "reopen"); > /* Make a streaming write on the first 8K block (needs O_WRONLY). */ > ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0"); > /* Now use truncate to shrink and reexpand. */ > ERR(ftruncate(fd, 4), "trunc-4"); > ERR(ftruncate(fd, 4096), "trunc-4096"); > ERR(close(fd), "close-2"); > exit(0); > } I tried this code on XFS, and it is working as expected (I am getting xxxx). [nix-shell:~/xfstests]# hexdump -C /media/test/wubble 00000000 78 78 78 78 00 00 00 00 00 00 00 00 00 00 00 00 |xxxx............| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 I did some tracing as well and here are the results. $ trace-cmd record -e xfs_file_fsync -e xfs_file_buffered_write -e xfs_setattr -e xfs_zero_eof -F -c ./a.out [nix-shell:~/xfstests]# trace-cmd report cpus=4 a.out-3872 [003] 84120.161472: xfs_setattr: dev 259:0 ino 0x103 iflags 0x0 a.out-3872 [003] 84120.172109: xfs_setattr: dev 259:0 ino 0x103 iflags 0x20 a.out-3872 [003] 84120.172151: xfs_zero_eof: dev 259:0 ino 0x103 isize 0x0 disize 0x0 pos 0x0 bytecount 0x2000 // First truncate a.out-3872 [003] 84120.172156: xfs_file_buffered_write: dev 259:0 ino 0x103 disize 0x0 pos 0x2000 bytecount 0x28 a.out-3872 [003] 84120.185423: xfs_file_buffered_write: dev 259:0 ino 0x103 disize 0x2028 pos 0x0 bytecount 0x28 a.out-3872 [003] 84120.185477: xfs_setattr: dev 259:0 ino 0x103 iflags 0x0 a.out-3872 [003] 84120.186493: xfs_setattr: dev 259:0 ino 0x103 iflags 0x20 a.out-3872 [003] 84120.186495: xfs_zero_eof: dev 259:0 ino 0x103 isize 0x4 disize 0x4 pos 0x4 bytecount 0xffc // Third truncate First and third truncate result in calling xfs_zero_eof as we are increasing the size of the file. When we do the second ftruncate(fd, 4), we call into iomap_truncate_page() with offset 0: int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, const struct iomap_ops *ops) { unsigned int blocksize = i_blocksize(inode); unsigned int off = pos & (blocksize - 1); /* Block boundary? Nothing to do */ if (!off) return 0; return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops); } As you can see, we take into account the blocksize (which is set as minorder during inode init) and make sure the sub-block zeroing is done correctly. Also if you see iomap_invalidate_folio(), we don't remove the folio private data until the whole folio is invalidated. I doubt we are doing anything wrong from the page cache layer with these patches. All we do with minorder support is to make sure we always allocate folios in the page cache that are at least min order in size and aligned to the min order (PATCH 2 and 3) and we maintain this even we do a split (PATCH 4). I hope this helps! -- Pankaj
Okay, I think there is a bug in your patches also. If I do: xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" \ -c "truncate 4" -c "truncate 4096" \ /xfstest.test/wubble; od /xfstest.test/wubble I see: xfs_io-6059: netfs_truncate: ni=9e isz=1000 rsz=1000 zp=0 to=0 xfs_io-6059: netfs_set_size: ni=9e resize-file isz=0 rsz=0 zp=0 xfs_io-6059: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0 xfs_io-6059: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 mod-n-clear d=5858585858585858 xfs_io-6059: netfs_write: R=0000000c WRITEBACK c=00000002 i=9e by=0-ffffffffffffffff xfs_io-6059: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 store d=5858585858585858 xfs_io-6059: netfs_sreq: R=0000000c[1] UPLD PREP f=00 s=0 0/0 e=0 xfs_io-6059: netfs_sreq: R=0000000c[1] UPLD SUBMT f=100 s=0 0/28 e=0 kworker-5948: netfs_sreq: R=0000000c[1] UPLD TERM f=100 s=0 28/28 e=0 kworker-5948: netfs_rreq: R=0000000c WB COLLECT f=2120 kworker-5948: netfs_sreq: R=0000000c[1] UPLD FREE f=00 s=0 28/28 e=0 kworker-5948: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 clear d=5858585858585858 kworker-5948: netfs_rreq: R=0000000c WB WR-DONE f=2120 kworker-5948: netfs_rreq: R=0000000c WB WAKE-IP f=2120 kworker-5948: netfs_rreq: R=0000000c WB FREE f=2100 xfs_io-6059: netfs_truncate: ni=9e isz=28 rsz=28 zp=0 to=4 xfs_io-6059: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=0 But ->release_folio() should have been called here because netfs_inode_init() would have called mapping_set_release_always() for ordinary afs files. xfs_io-6059: netfs_truncate: ni=9e isz=4 rsz=4 zp=0 to=1000 xfs_io-6059: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=0 od-6060: netfs_read: R=0000000d READAHEAD c=00000002 ni=9e s=0 l=2000 sz=1000 od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read d=58585858 od-6060: netfs_sreq: R=0000000d[1] ---- ADD f=00 s=0 0/2000 e=0 od-6060: netfs_sreq: R=0000000d[1] ZERO SUBMT f=00 s=0 0/2000 e=0 od-6060: netfs_sreq: R=0000000d[1] ZERO CLEAR f=02 s=0 2000/2000 e=0 od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-done d=0 od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-unlock d=0 od-6060: netfs_sreq: R=0000000d[1] ZERO TERM f=02 s=0 2000/2000 e=0 od-6060: netfs_sreq: R=0000000d[1] ZERO FREE f=02 s=0 2000/2000 e=0 od-6060: netfs_rreq: R=0000000d RA ASSESS f=20 od-6060: netfs_rreq: R=0000000d RA WAKE-IP f=20 od-6060: netfs_rreq: R=0000000d RA DONE f=00 od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-put d=0 kworker-5948: netfs_rreq: R=0000000d RA FREE f=00 David
Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote: > I tried this code on XFS, and it is working as expected (I am getting > xxxx). XFS doesn't try to use mapping_set_release_always(). David
On Mon, Aug 19, 2024 at 07:40:44PM +0100, David Howells wrote: > Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote: > > > I tried this code on XFS, and it is working as expected (I am getting > > xxxx). > > XFS doesn't try to use mapping_set_release_always(). Thanks David for digging deep. It is indeed a bug in this patchset (PATCH 1). I think I overlooked the way we MASK the folio order bits when we changed it sometime back. But still I don't know why AS_RELEASE_ALWAYS is being cleared because it is in BIT 6, and existing bug should not affect BIT 6. The following triggers an ASSERT failure. diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 0fcf235e5023..35961d73d54a 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -88,9 +88,13 @@ xfs_inode_alloc( /* VFS doesn't initialise i_mode! */ VFS_I(ip)->i_mode = 0; + mapping_set_unevictable(VFS_I(ip)->i_mapping); mapping_set_folio_min_order(VFS_I(ip)->i_mapping, M_IGEO(mp)->min_folio_order); + ASSERT(mapping_unevictable(VFS_I(ip)->i_mapping) == 1); + + mapping_clear_unevictable(VFS_I(ip)->i_mapping); XFS_STATS_INC(mp, vn_active); ASSERT(atomic_read(&ip->i_pincount) == 0); ASSERT(ip->i_ino == 0); The patch that fixes this is: diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 61a7649d86e5..5e245b8dcfd6 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -217,6 +217,7 @@ enum mapping_flags { #define AS_FOLIO_ORDER_MASK ((1u << AS_FOLIO_ORDER_BITS) - 1) #define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN) #define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX) +#define AS_FOLIO_ORDER_MIN_MAX_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK) /** * mapping_set_error - record a writeback error in the address_space @@ -418,7 +419,7 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping, if (max < min) max = min; - mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) | + mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MIN_MAX_MASK) | (min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX); } Could you try this patch and see if it fixes it by any chance? -- Pankaj
Okay, I think I've found the bugs in my code and in truncate. It appears
they're affected by your code, but exist upstream. You can add:
Tested-by: David Howells <dhowells@redhat.com>
to patches 1-5 if you wish.
David
On Wed, Aug 21, 2024 at 12:24:24AM +0100, David Howells wrote: > Okay, I think I've found the bugs in my code and in truncate. It appears > they're affected by your code, but exist upstream. You can add: > > Tested-by: David Howells <dhowells@redhat.com> > > to patches 1-5 if you wish. Thanks David. I will send a new version with your Tested-by and the one fix in the first patch.
From: Pankaj Raghav <p.raghav@samsung.com> This is the 12th version of the series that enables block size > page size (Large Block Size) experimental support in XFS. Please consider this for the inclusion in 6.12. The series is based on fs-next as I was not able to run tests on the latest linux-next. The context and motivation can be seen in cover letter of the RFC v1 [0]. We also recorded a talk about this effort at LPC [1], if someone would like more context on this effort. A lot of emphasis has been put on testing using kdevops, starting with an XFS baseline [3]. The testing has been split into regression and progression. Regression testing: In regression testing, we ran the whole test suite to check for regressions on existing profiles due to the page cache changes. I also ran split_huge_page_test selftest on XFS filesystem to check for huge page splits in min order chunks is done correctly. No regressions were found with these patches added on top. Progression testing: For progression testing, we tested for 8k, 16k, 32k and 64k block sizes. To compare it with existing support, an ARM VM with 64k base page system (without our patches) was used as a reference to check for actual failures due to LBS support in a 4k base page size system. No new failures were found with the LBS support. We've done some preliminary performance tests with fio on XFS on 4k block size against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these patches applied, and detected no regressions. We ran sysbench on postgres and mysql for several hours on LBS XFS without any issues. We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device is aligned and at least filesystem block size in length. For those who want this in a git tree we have this up on a kdevops large-block-minorder-for-next-v12 tag [6]. [0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/ [1] https://www.youtube.com/watch?v=ar72r5Xf7x4 [2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org [3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md 489 non-critical issues and 55 critical issues. We've determined and reported that the 55 critical issues have all fall into 5 common XFS asserts or hung tasks and 2 memory management asserts. [4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes [5] https://github.com/iovisor/bcc/pull/4813 [6] https://github.com/linux-kdevops/linux/ [7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t Changes since v11: - Minor string alignment fixup. - Collected RVB from Dave. Dave Chinner (1): xfs: use kvmalloc for xattr buffers Luis Chamberlain (1): mm: split a folio in minimum folio order chunks Matthew Wilcox (Oracle) (1): fs: Allow fine-grained control of folio sizes Pankaj Raghav (7): filemap: allocate mapping_min_order folios in the page cache readahead: allocate folios with mapping_min_order in readahead filemap: cap PTE range to be created to allowed zero fill in folio_map_range() iomap: fix iomap_dio_zero() for fs bs > system page size xfs: expose block size in stat xfs: make the calculation generic in xfs_sb_validate_fsb_count() xfs: enable block size larger than page size support fs/iomap/buffered-io.c | 4 +- fs/iomap/direct-io.c | 45 +++++++++++-- fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++--- fs/xfs/libxfs/xfs_ialloc.c | 5 ++ fs/xfs/libxfs/xfs_shared.h | 3 + fs/xfs/xfs_icache.c | 6 +- fs/xfs/xfs_iops.c | 2 +- fs/xfs/xfs_mount.c | 8 ++- fs/xfs/xfs_super.c | 28 +++++--- include/linux/huge_mm.h | 14 ++-- include/linux/pagemap.h | 122 ++++++++++++++++++++++++++++++---- mm/filemap.c | 36 ++++++---- mm/huge_memory.c | 59 ++++++++++++++-- mm/readahead.c | 83 +++++++++++++++++------ 14 files changed, 345 insertions(+), 85 deletions(-) base-commit: bb62fbd2b0e31b2ed5dccf1dc4489460137fdf5c