[v12,00/10] enable bs > ps in XFS

Message ID	20240815090849.972355-1-kernel@pankajraghav.com (mailing list archive)
Headers	show Received: from mout-p-103.mailbox.org (mout-p-103.mailbox.org [80.241.56.161]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BAE51BDA9B; Thu, 15 Aug 2024 09:09:08 +0000 (UTC) From: "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> To: brauner@kernel.org, akpm@linux-foundation.org Cc: chandan.babu@oracle.com, linux-fsdevel@vger.kernel.org, djwong@kernel.org, hare@suse.de, gost.dev@samsung.com, linux-xfs@vger.kernel.org, kernel@pankajraghav.com, hch@lst.de, david@fromorbit.com, Zi Yan <ziy@nvidia.com>, yang@os.amperecomputing.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, willy@infradead.org, john.g.garry@oracle.com, cl@os.amperecomputing.com, p.raghav@samsung.com, mcgrof@kernel.org, ryan.roberts@arm.com Subject: [PATCH v12 00/10] enable bs > ps in XFS Date: Thu, 15 Aug 2024 11:08:39 +0200 Message-ID: <20240815090849.972355-1-kernel@pankajraghav.com> Content-Type: text/plain; charset="utf-8" Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	enable bs > ps in XFS \| expand [v12,00/10] enable bs > ps in XFS [v12,01/10] fs: Allow fine-grained control of folio sizes [v12,02/10] filemap: allocate mapping_min_order folios in the page cache [v12,03/10] readahead: allocate folios with mapping_min_order in readahead [v12,04/10] mm: split a folio in minimum folio order chunks [v12,05/10] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() [v12,06/10] iomap: fix iomap_dio_zero() for fs bs > system page size [v12,07/10] xfs: use kvmalloc for xattr buffers [v12,08/10] xfs: expose block size in stat [v12,09/10] xfs: make the calculation generic in xfs_sb_validate_fsb_count() [v12,10/10] xfs: enable block size larger than page size support

Pankaj Raghav (Samsung) Aug. 15, 2024, 9:08 a.m. UTC

From: Pankaj Raghav <p.raghav@samsung.com>

This is the 12th version of the series that enables block size > page size
(Large Block Size) experimental support in XFS. Please consider this for
the inclusion in 6.12.
The series is based on fs-next as I was not able to run tests on
the latest linux-next.

The context and motivation can be seen in cover letter of the RFC v1 [0].
We also recorded a talk about this effort at LPC [1], if someone would
like more context on this effort.

A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.

I also ran split_huge_page_test selftest on XFS filesystem to check for
huge page splits in min order chunks is done correctly.

No regressions were found with these patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.  To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.

We ran sysbench on postgres and mysql for several hours on LBS XFS
without any issues.

We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.

For those who want this in a git tree we have this up on a kdevops
large-block-minorder-for-next-v12 tag [6].

[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[6] https://github.com/linux-kdevops/linux/
[7] https://lore.kernel.org/linux-kernel/Zl20pc-YlIWCSy6Z@casper.infradead.org/#t

Changes since v11:
- Minor string alignment fixup.
- Collected RVB from Dave.

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Luis Chamberlain (1):
  mm: split a folio in minimum folio order chunks

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  filemap: allocate mapping_min_order folios in the page cache
  readahead: allocate folios with mapping_min_order in readahead
  filemap: cap PTE range to be created to allowed zero fill in
    folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/buffered-io.c        |   4 +-
 fs/iomap/direct-io.c          |  45 +++++++++++--
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++---
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |   8 ++-
 fs/xfs/xfs_super.c            |  28 +++++---
 include/linux/huge_mm.h       |  14 ++--
 include/linux/pagemap.h       | 122 ++++++++++++++++++++++++++++++----
 mm/filemap.c                  |  36 ++++++----
 mm/huge_memory.c              |  59 ++++++++++++++--
 mm/readahead.c                |  83 +++++++++++++++++------
 14 files changed, 345 insertions(+), 85 deletions(-)


base-commit: bb62fbd2b0e31b2ed5dccf1dc4489460137fdf5c

David Howells Aug. 16, 2024, 7:31 p.m. UTC | #1

Hi Pankaj,

I applied the first five patches and set minimum folio size for afs files to
8K (see attached patch) and ran some tests.

With simple tests, I can see in the trace log that it is definitely creating
8K folios where it would previously create 4K folios.

However, with 'xfstests -g quick', generic/075 generic/112 generic/393 fail
where they didn't previously.  I won't be able to look into this more till
Monday.

If you want to try using afs for yourself, install the kafs-client package
(available on Fedora and Debian), do 'systemctl start afs.mount' and then you
can, say, do:

	ls /afs/openafs.org/www/docs.openafs.org/

and browse the publicly accessible files under there.

David
---
commit d676df787baee3b710b9f0d284b21518473feb3c
Author: David Howells <dhowells@redhat.com>
Date:   Fri Aug 16 19:54:25 2024 +0100

    afs: [DEBUGGING] Set min folio order

diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3acf5e050072..c3842cba92e7 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
 		inode->i_fop	= &afs_file_operations;
 		inode->i_mapping->a_ops	= &afs_file_aops;
 		mapping_set_large_folios(inode->i_mapping);
+		mapping_set_folio_min_order(inode->i_mapping, 1);
 		break;
 	case AFS_FTYPE_DIR:
 		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);

Pankaj Raghav (Samsung) Aug. 18, 2024, 4:51 p.m. UTC | #2

Hi David,

On Fri, Aug 16, 2024 at 08:31:03PM +0100, David Howells wrote:
> Hi Pankaj,
> 
> I applied the first five patches and set minimum folio size for afs files to
> 8K (see attached patch) and ran some tests.
> 
> With simple tests, I can see in the trace log that it is definitely creating
> 8K folios where it would previously create 4K folios.
> 
> However, with 'xfstests -g quick', generic/075 generic/112 generic/393 fail
> where they didn't previously.  I won't be able to look into this more till
> Monday.

Thanks for trying it out!

As you might have seen the whole patchset, typically filesystems will
require some changes to support min order correctly. That is why 
this patchset only enables XFS to use min order to support bs > ps.

In the case of XFS (block-based FS), we set the min order to the FS
block size as that is the smallest unit of operation in the data path,
and we know for sure there are no implicit PAGE_SIZE assumption.

I am no expert in network filesystems but are you sure there are no
PAGE_SIZE assumption when manipulating folios from the page cache in
AFS?

Similar to AFS, XFS also supported large_folios but we found some bugs
when we set min order to be the block size of the FS.
> 
> If you want to try using afs for yourself, install the kafs-client package
> (available on Fedora and Debian), do 'systemctl start afs.mount' and then you
> can, say, do:
> 
> 	ls /afs/openafs.org/www/docs.openafs.org/
> 
> and browse the publicly accessible files under there.

Great. But is this enough to run FStests? I assume I also need some afs
server to run the fstests?

Are the tests just failing or are you getting some kernel panic?

> 
> David
> ---
> commit d676df787baee3b710b9f0d284b21518473feb3c
> Author: David Howells <dhowells@redhat.com>
> Date:   Fri Aug 16 19:54:25 2024 +0100
> 
>     afs: [DEBUGGING] Set min folio order
> 
> diff --git a/fs/afs/inode.c b/fs/afs/inode.c
> index 3acf5e050072..c3842cba92e7 100644
> --- a/fs/afs/inode.c
> +++ b/fs/afs/inode.c
> @@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
>  		inode->i_fop	= &afs_file_operations;
>  		inode->i_mapping->a_ops	= &afs_file_aops;
>  		mapping_set_large_folios(inode->i_mapping);
> +		mapping_set_folio_min_order(inode->i_mapping, 1);
>  		break;
>  	case AFS_FTYPE_DIR:
>  		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);
>

David Howells Aug. 18, 2024, 8:16 p.m. UTC | #3

Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:

> I am no expert in network filesystems but are you sure there are no
> PAGE_SIZE assumption when manipulating folios from the page cache in
> AFS?

Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to
netfslib and intend to do the same to ceph.  The client filesystems just
provide read and write ops to netfslib and netfslib uses those to do ordinary
buffered I/O, unbuffered I/O (selectable by mount option on some filesystems)
and DIO.

That said, I'm not sure that I haven't made some PAGE_SIZE assumptions.  I
don't *think* I have since netfslib is fully multipage folio capable, but I
can't guarantee it.

Mostly this was just a note to you that there might be an issue with your code
- but I haven't investigated it yet and it could well be in my code.

Apparently, I also need to update xfstests, so it could be that too.

> > 	ls /afs/openafs.org/www/docs.openafs.org/
> > 
> > and browse the publicly accessible files under there.
> 
> Great. But is this enough to run FStests? I assume I also need some afs
> server to run the fstests?

Sadly not, but if you turn on some tracepoints, you can see netfslib operating
under the bonnet.

> Are the tests just failing or are you getting some kernel panic?

Just failing.

Thanks,
David

Hannes Reinecke Aug. 19, 2024, 7:24 a.m. UTC | #4

On 8/18/24 22:16, David Howells wrote:
> Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:
> 
>> I am no expert in network filesystems but are you sure there are no
>> PAGE_SIZE assumption when manipulating folios from the page cache in
>> AFS?
> 
> Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to
> netfslib and intend to do the same to ceph.  The client filesystems just
> provide read and write ops to netfslib and netfslib uses those to do ordinary
> buffered I/O, unbuffered I/O (selectable by mount option on some filesystems)
> and DIO.
> 
> That said, I'm not sure that I haven't made some PAGE_SIZE assumptions.  I
> don't *think* I have since netfslib is fully multipage folio capable, but I
> can't guarantee it.
> 
I guess you did:

static int afs_fill_super(struct super_block *sb, struct afs_fs_context 
*ctx)
{
         struct afs_super_info *as = AFS_FS_S(sb);
         struct inode *inode = NULL;
         int ret;

         _enter("");

         /* fill in the superblock */
         sb->s_blocksize         = PAGE_SIZE;
         sb->s_blocksize_bits    = PAGE_SHIFT;
         sb->s_maxbytes          = MAX_LFS_FILESIZE;
         sb->s_magic             = AFS_FS_MAGIC;
         sb->s_op                = &afs_super_ops;

IE you essentially nail AFS to use PAGE_SIZE.
Not sure how you would tell AFS to use a different block size;
maybe a mount option?

And there are several other places which will need to be modified;
eg afs_mntpt_set_params() is trying to read from a page which
won't fly with large blocks (converted to read_full_folio()?),
and, of course, the infamous AFS_DIR_BLOCKS_PER_PAGE which will
overflow for large blocks.

So some work is required, but everything looks doable.
Maybe I can find some time until LPC.

> Mostly this was just a note to you that there might be an issue with your code
> - but I haven't investigated it yet and it could well be in my code.
> 
Hmm. I'd rather fix the obvious places in afs first; just do a quick
grep for 'PAGE_', that'll give you a good impression of places to look at.

Cheers,

Hannes

Pankaj Raghav (Samsung) Aug. 19, 2024, 7:37 a.m. UTC | #5

On Mon, Aug 19, 2024 at 09:24:11AM +0200, Hannes Reinecke wrote:
> On 8/18/24 22:16, David Howells wrote:
> > Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:
> > 
> > > I am no expert in network filesystems but are you sure there are no
> > > PAGE_SIZE assumption when manipulating folios from the page cache in
> > > AFS?
> > 
> > Note that I've removed the knowledge of the pagecache from 9p, afs and cifs to
> > netfslib and intend to do the same to ceph.  The client filesystems just
> > provide read and write ops to netfslib and netfslib uses those to do ordinary
> > buffered I/O, unbuffered I/O (selectable by mount option on some filesystems)
> > and DIO.
> > 
> > That said, I'm not sure that I haven't made some PAGE_SIZE assumptions.  I
> > don't *think* I have since netfslib is fully multipage folio capable, but I
> > can't guarantee it.
> > 
> I guess you did:
> 
> static int afs_fill_super(struct super_block *sb, struct afs_fs_context
> *ctx)
> {
>         struct afs_super_info *as = AFS_FS_S(sb);
>         struct inode *inode = NULL;
>         int ret;
> 
>         _enter("");
> 
>         /* fill in the superblock */
>         sb->s_blocksize         = PAGE_SIZE;
>         sb->s_blocksize_bits    = PAGE_SHIFT;
>         sb->s_maxbytes          = MAX_LFS_FILESIZE;
>         sb->s_magic             = AFS_FS_MAGIC;
>         sb->s_op                = &afs_super_ops;
> 
> IE you essentially nail AFS to use PAGE_SIZE.
> Not sure how you would tell AFS to use a different block size;
> maybe a mount option?

I saw this as well, but I didn't see this variable being used anywhere.
Probably this has no meaning in a network-based FSs?

> And there are several other places which will need to be modified;
> eg afs_mntpt_set_params() is trying to read from a page which
> won't fly with large blocks (converted to read_full_folio()?),
> and, of course, the infamous AFS_DIR_BLOCKS_PER_PAGE which will
> overflow for large blocks.

But the min folio order is set only for AFS_FTYPE_FILE and not
for AFS_FTYPE_DIR.

> 
> So some work is required, but everything looks doable.
> Maybe I can find some time until LPC.
> 
> > Mostly this was just a note to you that there might be an issue with your code
> > - but I haven't investigated it yet and it could well be in my code.
> > 
> Hmm. I'd rather fix the obvious places in afs first; just do a quick
> grep for 'PAGE_', that'll give you a good impression of places to look at.
> 
Agree.

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
>

David Howells Aug. 19, 2024, 11:46 a.m. UTC | #6

Hi Pankaj,

I can reproduce the problem with:

xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" -c "truncate 4" -c "truncate 4096" /xfstest.test/wubble; od -x /xfstest.test/wubble

borrowed from generic/393.  I've distilled it down to the attached C program.

Turning on tracing and adding a bit more, I can see the problem happening.
Here's an excerpt of the tracing (I've added some non-upstream tracepoints).
Firstly, you can see the second pwrite at fpos 0, 40 bytes (ie. 0x28):

 pankaj-5833: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0
 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 mod-streamw

Then first ftruncate() is called to reduce the file size to 4:

 pankaj-5833: netfs_truncate: ni=9e isz=2028 rsz=2028 zp=4000 to=4
 pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=4 l=1ffc d=78787878
 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
 pankaj-5833: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=4

You can see the invalidate_folio call, with the offset at 0x4 an the length as
0x1ffc.  The data at the beginning of the page is 0x78787878.  This looks
correct.

Then second ftruncate() is called to increase the file size to 4096
(ie. 0x1000):

 pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000
 pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878
 pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
 pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4

And here's the problem: in the invalidate_folio() call, the offset is 0x1000
and the length is 0x1000 (o= and l=).  But that's the wrong half of the folio!
I'm guessing that the caller thereafter clears the other half of the folio -
the bit that should be kept.

David
---
/* Distillation of the generic/393 xfstest */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

#define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0)

static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy";
static const char dropfile[] = "/proc/sys/vm/drop_caches";
static const char droptype[] = "3";
static const char file[] = "/xfstest.test/wubble";

int main(int argc, char *argv[])
{
        int fd, drop;

	/* Fill in the second 8K block of the file... */
        fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666);
        ERR(fd, "open");
        ERR(ftruncate(fd, 0), "pre-trunc $file");
        ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000");
        ERR(close(fd), "close");

	/* ... and drop the pagecache so that we get a streaming
	 * write, attaching some private data to the folio.
	 */
        drop = open(dropfile, O_WRONLY);
        ERR(drop, dropfile);
        ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop");
        ERR(close(drop), "close-drop");

        fd = open(file, O_WRONLY, 0666);
        ERR(fd, "reopen");
	/* Make a streaming write on the first 8K block (needs O_WRONLY). */
        ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0");
	/* Now use truncate to shrink and reexpand. */
        ERR(ftruncate(fd, 4), "trunc-4");
        ERR(ftruncate(fd, 4096), "trunc-4096");
        ERR(close(fd), "close-2");
        exit(0);
}

David Howells Aug. 19, 2024, 11:59 a.m. UTC | #7

David Howells <dhowells@redhat.com> wrote:

> You can see the invalidate_folio call, with the offset at 0x4 an the length as
> 0x1ffc.  The data at the beginning of the page is 0x78787878.  This looks
> correct.
> 
> Then second ftruncate() is called to increase the file size to 4096
> (ie. 0x1000):
> 
>  pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000
>  pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878
>  pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
>  pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4
> 
> And here's the problem: in the invalidate_folio() call, the offset is 0x1000
> and the length is 0x1000 (o= and l=).  But that's the wrong half of the folio!
> I'm guessing that the caller thereafter clears the other half of the folio -
> the bit that should be kept.

Actually, I think I'm wrong in my evaluation - I think that's the region to be
invalidated, not the region to be kept.

David

David Howells Aug. 19, 2024, 12:25 p.m. UTC | #8

Hannes Reinecke <hare@suse.de> wrote:

> IE you essentially nail AFS to use PAGE_SIZE.
> Not sure how you would tell AFS to use a different block size;
> maybe a mount option?

As far as I know:

        sb->s_blocksize         = PAGE_SIZE;
        sb->s_blocksize_bits    = PAGE_SHIFT;

isn't used by the VM.

> Hmm. I'd rather fix the obvious places in afs first; just do a quick
> grep for 'PAGE_', that'll give you a good impression of places to look at.

Sure:

   fs/afs/dir.c:   nr_pages = (i_size + PAGE_SIZE - 1) / PAGE_SIZE;
   fs/afs/dir.c:   req->len = nr_pages * PAGE_SIZE; /* We can ask for more than there is */
   fs/afs/dir.c:           task_io_account_read(PAGE_SIZE * req->nr_pages);
   fs/afs/dir.c:           folio = __filemap_get_folio(dir->i_mapping, ctx->pos / PAGE_SIZE,
   fs/afs/xdr_fs.h:#define AFS_DIR_BLOCKS_PER_PAGE (PAGE_SIZE / AFS_DIR_BLOCK_SIZE)

Those only affect directories.

   fs/afs/mntpt.c:         if (size < 2 || size > PAGE_SIZE - 1)

That only affects mountpoint symlinks.

   fs/afs/super.c: sb->s_blocksize         = PAGE_SIZE;

This is the only thing (and sb->s_blocksize_bits) that might affect files.  I
checked, and doubling this and adding 1 to bits does not alter the outcome.

Now, the VM wrangling is offloaded to netfslib, and most of that is to do with
converting between indices and file positions.  Going through the usages of
PAGE_SIZE there:

   fs/netfs/buffered_read.c:               size += PAGE_SIZE << order;

That was recording the size of a folio readahead allocated.

   fs/netfs/buffered_read.c:       size_t nr_bvec = flen / PAGE_SIZE + 2;
   fs/netfs/buffered_read.c:               part = min_t(size_t, to - off, PAGE_SIZE);

Those two are used to fill in the gaps around a partial page - but that didn't
appear in the logs.

   fs/netfs/buffered_write.c:      pgoff_t index = pos / PAGE_SIZE;
   fs/netfs/buffered_write.c:              fgp_flags |= fgf_set_order(pos % PAGE_SIZE + part);

Those two are used when asking __filemap_get_folio() to allocate a folio to
write into.  I got a folio of the right size and index, so that's not the
problem.

   fs/netfs/fscache_io.c:  pgoff_t first = start / PAGE_SIZE;
   fs/netfs/fscache_io.c:  pgoff_t last = (start + len - 1) / PAGE_SIZE;

Caching is not enabled at the moment, so these don't happen.

   fs/netfs/iterator.c:            cur_npages = DIV_ROUND_UP(ret, PAGE_SIZE);
   fs/netfs/iterator.c:                    len = ret > PAGE_SIZE ? PAGE_SIZE : ret;

I'm not doing DIO, so these aren't used.

   fs/netfs/iterator.c:    pgoff_t index = pos / PAGE_SIZE;

I'm not using an ITER_XARRAY iterator, so this doesn't happen.

   fs/netfs/misc.c:        rreq->io_iter.count += PAGE_SIZE << order;

This is just multiplying up the folio size to add to the byte count.

   fs/netfs/read_collect.c:        fsize = PAGE_SIZE << subreq->curr_folio_order;
   fs/netfs/read_collect.c:            WARN_ON_ONCE(folioq_folio(folioq, slot)->index != fpos / PAGE_SIZE)) {

These two are converting between a file pos and an index - but only during
read, and I can see from wireshark that we're writing the wrong data to the
server before we get this far.

And that's all the PAGE_SIZE usages in afs and netfslib.

David

Hannes Reinecke Aug. 19, 2024, 12:48 p.m. UTC | #9

On 8/19/24 13:46, David Howells wrote:
> Hi Pankaj,
> 
> I can reproduce the problem with:
> 
> xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" -c "truncate 4" -c "truncate 4096" /xfstest.test/wubble; od -x /xfstest.test/wubble
> 
> borrowed from generic/393.  I've distilled it down to the attached C program.
> 
> Turning on tracing and adding a bit more, I can see the problem happening.
> Here's an excerpt of the tracing (I've added some non-upstream tracepoints).
> Firstly, you can see the second pwrite at fpos 0, 40 bytes (ie. 0x28):
> 
>   pankaj-5833: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0
>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 mod-streamw
> 
> Then first ftruncate() is called to reduce the file size to 4:
> 
>   pankaj-5833: netfs_truncate: ni=9e isz=2028 rsz=2028 zp=4000 to=4
>   pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=4 l=1ffc d=78787878
>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
>   pankaj-5833: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=4
> 
> You can see the invalidate_folio call, with the offset at 0x4 an the length as
> 0x1ffc.  The data at the beginning of the page is 0x78787878.  This looks
> correct.
> 
> Then second ftruncate() is called to increase the file size to 4096
> (ie. 0x1000):
> 
>   pankaj-5833: netfs_truncate: ni=9e isz=4 rsz=4 zp=4 to=1000
>   pankaj-5833: netfs_inval_folio: pfn=116fec i=0009e ix=00000-00001 o=1000 l=1000 d=78787878
>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part
>   pankaj-5833: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=4
> 
> And here's the problem: in the invalidate_folio() call, the offset is 0x1000
> and the length is 0x1000 (o= and l=).  But that's the wrong half of the folio!
> I'm guessing that the caller thereafter clears the other half of the folio -
> the bit that should be kept.
> 
> David
> ---
> /* Distillation of the generic/393 xfstest */
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> 
> #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0)
> 
> static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
> static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy";
> static const char dropfile[] = "/proc/sys/vm/drop_caches";
> static const char droptype[] = "3";
> static const char file[] = "/xfstest.test/wubble";
> 
> int main(int argc, char *argv[])
> {
>          int fd, drop;
> 
> 	/* Fill in the second 8K block of the file... */
>          fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666);
>          ERR(fd, "open");
>          ERR(ftruncate(fd, 0), "pre-trunc $file");
>          ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000");
>          ERR(close(fd), "close");
> 
> 	/* ... and drop the pagecache so that we get a streaming
> 	 * write, attaching some private data to the folio.
> 	 */
>          drop = open(dropfile, O_WRONLY);
>          ERR(drop, dropfile);
>          ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop");
>          ERR(close(drop), "close-drop");
> 
>          fd = open(file, O_WRONLY, 0666);
>          ERR(fd, "reopen");
> 	/* Make a streaming write on the first 8K block (needs O_WRONLY). */
>          ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0");
> 	/* Now use truncate to shrink and reexpand. */
>          ERR(ftruncate(fd, 4), "trunc-4");
>          ERR(ftruncate(fd, 4096), "trunc-4096");
>          ERR(close(fd), "close-2");
>          exit(0);
> }
> 

Wouldn't the second truncate end up with a 4k file, and not an 8k?
IE the resulting file will be:
After step 1: 8k
After step 2: 4
After step 3: 4k

Hmm?

Cheers,

Hannes

David Howells Aug. 19, 2024, 2:08 p.m. UTC | #10

Hannes Reinecke <hare@suse.de> wrote:

> Wouldn't the second truncate end up with a 4k file, and not an 8k?
> IE the resulting file will be:
> After step 1: 8k
> After step 2: 4
> After step 3: 4k

Yes, but the folio should still be an 8K folio, and it is:

>   pankaj-5833: netfs_folio: pfn=116fec i=0009e ix=00000-00001 inval-part

as indicated by the inclusive folio index range ix=00000-00001.

The problem is that the bottom four bytes of the file are getting cleared
somewhere.  They *should* be "XXXX", but they're all zeros.

David

David Howells Aug. 19, 2024, 3:17 p.m. UTC | #11

Okay, the code in netfs_invalidate_folio() isn't correct in the way it reduces
streaming writes.  Attached is a patch that shows some of the changes I need
to make - but this is not yet working.

David
---
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index eaa0a992d178..e237c771eeb5 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -214,18 +215,34 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length)
 		/* We have a partially uptodate page from a streaming write. */
 		unsigned int fstart = finfo->dirty_offset;
 		unsigned int fend = fstart + finfo->dirty_len;
-		unsigned int end = offset + length;
+		unsigned int iend = offset + length;
 
 		if (offset >= fend)
 			return;
-		if (end <= fstart)
+		if (iend <= fstart)
+			return;
+
+		/* The invalidation region overlaps the data.  If the region
+		 * covers the start of the data, we either move along the start
+		 * or just erase the data entirely.
+		 */
+		if (offset <= fstart) {
+			if (iend >= fend)
+				goto erase_completely;
+			/* Move the start of the data. */
+			finfo->dirty_len = fend - iend;
+			finfo->dirty_offset = offset;
 			return;
-		if (offset <= fstart && end >= fend)
-			goto erase_completely;
-		if (offset <= fstart && end > fstart)
-			goto reduce_len;
-		if (offset > fstart && end >= fend)
-			goto move_start;
+		}
+
+		/* Reduce the length of the data if the invalidation region
+		 * covers the tail part.
+		 */
+		if (iend >= fend) {
+			finfo->dirty_len = offset - fstart;
+			return;
+		}
+
 		/* A partial write was split.  The caller has already zeroed
 		 * it, so just absorb the hole.
 		 */
@@ -238,12 +261,6 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length)
 	folio_clear_uptodate(folio);
 	kfree(finfo);
 	return;
-reduce_len:
-	finfo->dirty_len = offset + length - finfo->dirty_offset;
-	return;
-move_start:
-	finfo->dirty_len -= offset - finfo->dirty_offset;
-	finfo->dirty_offset = offset;
 }
 EXPORT_SYMBOL(netfs_invalidate_folio);

Pankaj Raghav (Samsung) Aug. 19, 2024, 4:39 p.m. UTC | #12

> ---
> /* Distillation of the generic/393 xfstest */
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> 
> #define ERR(x, y) do { if ((long)(x) == -1) { perror(y); exit(1); } } while(0)
> 
> static const char xxx[40] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
> static const char yyy[40] = "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy";
> static const char dropfile[] = "/proc/sys/vm/drop_caches";
> static const char droptype[] = "3";
> static const char file[] = "/xfstest.test/wubble";
> 
> int main(int argc, char *argv[])
> {
>         int fd, drop;
> 
> 	/* Fill in the second 8K block of the file... */
>         fd = open(file, O_CREAT|O_TRUNC|O_WRONLY, 0666);
>         ERR(fd, "open");
>         ERR(ftruncate(fd, 0), "pre-trunc $file");
>         ERR(pwrite(fd, yyy, sizeof(yyy), 0x2000), "write-2000");
>         ERR(close(fd), "close");
> 
> 	/* ... and drop the pagecache so that we get a streaming
> 	 * write, attaching some private data to the folio.
> 	 */
>         drop = open(dropfile, O_WRONLY);
>         ERR(drop, dropfile);
>         ERR(write(drop, droptype, sizeof(droptype) - 1), "write-drop");
>         ERR(close(drop), "close-drop");
> 
>         fd = open(file, O_WRONLY, 0666);
>         ERR(fd, "reopen");
> 	/* Make a streaming write on the first 8K block (needs O_WRONLY). */
>         ERR(pwrite(fd, xxx, sizeof(xxx), 0), "write-0");
> 	/* Now use truncate to shrink and reexpand. */
>         ERR(ftruncate(fd, 4), "trunc-4");
>         ERR(ftruncate(fd, 4096), "trunc-4096");
>         ERR(close(fd), "close-2");
>         exit(0);
> }

I tried this code on XFS, and it is working as expected (I am getting
xxxx).

[nix-shell:~/xfstests]# hexdump -C /media/test/wubble
00000000  78 78 78 78 00 00 00 00  00 00 00 00 00 00 00 00  |xxxx............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

I did some tracing as well and here are the results.

$ trace-cmd record -e xfs_file_fsync -e xfs_file_buffered_write -e xfs_setattr -e xfs_zero_eof -F -c ./a.out

[nix-shell:~/xfstests]# trace-cmd report
cpus=4
           a.out-3872  [003] 84120.161472: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x0
           a.out-3872  [003] 84120.172109: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x20 
           a.out-3872  [003] 84120.172151: xfs_zero_eof:         dev 259:0 ino 0x103 isize 0x0 disize 0x0 pos 0x0 bytecount 0x2000 // First truncate
           a.out-3872  [003] 84120.172156: xfs_file_buffered_write: dev 259:0 ino 0x103 disize 0x0 pos 0x2000 bytecount 0x28
           a.out-3872  [003] 84120.185423: xfs_file_buffered_write: dev 259:0 ino 0x103 disize 0x2028 pos 0x0 bytecount 0x28
           a.out-3872  [003] 84120.185477: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x0
           a.out-3872  [003] 84120.186493: xfs_setattr:          dev 259:0 ino 0x103 iflags 0x20
           a.out-3872  [003] 84120.186495: xfs_zero_eof:         dev 259:0 ino 0x103 isize 0x4 disize 0x4 pos 0x4 bytecount 0xffc // Third truncate

First and third truncate result in calling xfs_zero_eof as we are
increasing the size of the file.

When we do the second ftruncate(fd, 4), we call into iomap_truncate_page() with
offset 0:

int
iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
		const struct iomap_ops *ops)
{
	unsigned int blocksize = i_blocksize(inode);
	unsigned int off = pos & (blocksize - 1);

	/* Block boundary? Nothing to do */
	if (!off)
		return 0;
	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
}

As you can see, we take into account the blocksize (which is set as
minorder during inode init) and make sure the sub-block zeroing is done
correctly.

Also if you see iomap_invalidate_folio(), we don't remove the folio
private data until the whole folio is invalidated.

I doubt we are doing anything wrong from the page cache layer with these
patches.

All we do with minorder support is to make sure we always allocate folios
in the page cache that are at least min order in size and aligned to the
min order (PATCH 2 and 3) and we maintain this even we do a split (PATCH
4).

I hope this helps!

--
Pankaj

David Howells Aug. 19, 2024, 4:51 p.m. UTC | #13

Okay, I think there is a bug in your patches also.  If I do:

	xfs_io -t -f -c "pwrite -S 0x58 0 40" -c "fsync" \
		-c "truncate 4" -c "truncate 4096" \
		/xfstest.test/wubble; od /xfstest.test/wubble

I see:

  xfs_io-6059: netfs_truncate: ni=9e isz=1000 rsz=1000 zp=0 to=0
  xfs_io-6059: netfs_set_size: ni=9e resize-file isz=0 rsz=0 zp=0
  xfs_io-6059: netfs_write_iter: WRITE-ITER i=9e s=0 l=28 f=0
  xfs_io-6059: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 mod-n-clear d=5858585858585858
  xfs_io-6059: netfs_write: R=0000000c WRITEBACK c=00000002 i=9e by=0-ffffffffffffffff
  xfs_io-6059: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 store d=5858585858585858
  xfs_io-6059: netfs_sreq: R=0000000c[1] UPLD PREP  f=00 s=0 0/0 e=0
  xfs_io-6059: netfs_sreq: R=0000000c[1] UPLD SUBMT f=100 s=0 0/28 e=0
 kworker-5948: netfs_sreq: R=0000000c[1] UPLD TERM  f=100 s=0 28/28 e=0
 kworker-5948: netfs_rreq: R=0000000c WB COLLECT f=2120
 kworker-5948: netfs_sreq: R=0000000c[1] UPLD FREE  f=00 s=0 28/28 e=0
 kworker-5948: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 clear d=5858585858585858
 kworker-5948: netfs_rreq: R=0000000c WB WR-DONE f=2120
 kworker-5948: netfs_rreq: R=0000000c WB WAKE-IP f=2120
 kworker-5948: netfs_rreq: R=0000000c WB FREE    f=2100
  xfs_io-6059: netfs_truncate: ni=9e isz=28 rsz=28 zp=0 to=4
  xfs_io-6059: netfs_set_size: ni=9e resize-file isz=4 rsz=4 zp=0

But ->release_folio() should have been called here because netfs_inode_init()
would have called mapping_set_release_always() for ordinary afs files.

  xfs_io-6059: netfs_truncate: ni=9e isz=4 rsz=4 zp=0 to=1000
  xfs_io-6059: netfs_set_size: ni=9e resize-file isz=1000 rsz=1000 zp=0
      od-6060: netfs_read: R=0000000d READAHEAD c=00000002 ni=9e s=0 l=2000 sz=1000
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read d=58585858
      od-6060: netfs_sreq: R=0000000d[1] ---- ADD   f=00 s=0 0/2000 e=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO SUBMT f=00 s=0 0/2000 e=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO CLEAR f=02 s=0 2000/2000 e=0
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-done d=0
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-unlock d=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO TERM  f=02 s=0 2000/2000 e=0
      od-6060: netfs_sreq: R=0000000d[1] ZERO FREE  f=02 s=0 2000/2000 e=0
      od-6060: netfs_rreq: R=0000000d RA ASSESS  f=20
      od-6060: netfs_rreq: R=0000000d RA WAKE-IP f=20
      od-6060: netfs_rreq: R=0000000d RA DONE    f=00
      od-6060: netfs_folio: pfn=10d996 i=0009e ix=00000-00001 read-put d=0
 kworker-5948: netfs_rreq: R=0000000d RA FREE    f=00

David

David Howells Aug. 19, 2024, 6:40 p.m. UTC | #14

Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:

> I tried this code on XFS, and it is working as expected (I am getting
> xxxx).

XFS doesn't try to use mapping_set_release_always().

David

Pankaj Raghav (Samsung) Aug. 20, 2024, 9:17 a.m. UTC | #15

On Mon, Aug 19, 2024 at 07:40:44PM +0100, David Howells wrote:
> Pankaj Raghav (Samsung) <kernel@pankajraghav.com> wrote:
> 
> > I tried this code on XFS, and it is working as expected (I am getting
> > xxxx).
> 
> XFS doesn't try to use mapping_set_release_always().

Thanks David for digging deep. It is indeed a bug in this patchset
(PATCH 1). I think I overlooked the way we MASK the folio order bits
when we changed it sometime back. 

But still I don't know why AS_RELEASE_ALWAYS is being cleared because it
is in BIT 6, and existing bug should not affect BIT 6.

The following triggers an ASSERT failure.

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0fcf235e5023..35961d73d54a 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,9 +88,13 @@ xfs_inode_alloc(
 
        /* VFS doesn't initialise i_mode! */
        VFS_I(ip)->i_mode = 0;
+       mapping_set_unevictable(VFS_I(ip)->i_mapping);
        mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
                                    M_IGEO(mp)->min_folio_order);
 
+       ASSERT(mapping_unevictable(VFS_I(ip)->i_mapping) == 1);
+
+       mapping_clear_unevictable(VFS_I(ip)->i_mapping);
        XFS_STATS_INC(mp, vn_active);
        ASSERT(atomic_read(&ip->i_pincount) == 0);
        ASSERT(ip->i_ino == 0);

The patch that fixes this is:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 61a7649d86e5..5e245b8dcfd6 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -217,6 +217,7 @@ enum mapping_flags {
 #define AS_FOLIO_ORDER_MASK     ((1u << AS_FOLIO_ORDER_BITS) - 1)
 #define AS_FOLIO_ORDER_MIN_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MIN)
 #define AS_FOLIO_ORDER_MAX_MASK (AS_FOLIO_ORDER_MASK << AS_FOLIO_ORDER_MAX)
+#define AS_FOLIO_ORDER_MIN_MAX_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
 
 /**
  * mapping_set_error - record a writeback error in the address_space
@@ -418,7 +419,7 @@ static inline void mapping_set_folio_order_range(struct address_space *mapping,
        if (max < min)
                max = min;
 
-       mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+       mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MIN_MAX_MASK) |
                (min << AS_FOLIO_ORDER_MIN) | (max << AS_FOLIO_ORDER_MAX);
 }
 
Could you try this patch and see if it fixes it by any chance?

--
Pankaj

David Howells Aug. 20, 2024, 11:24 p.m. UTC | #16

Okay, I think I've found the bugs in my code and in truncate.  It appears
they're affected by your code, but exist upstream.  You can add:

	Tested-by: David Howells <dhowells@redhat.com>

to patches 1-5 if you wish.

David

Pankaj Raghav (Samsung) Aug. 21, 2024, 7:16 a.m. UTC | #17

On Wed, Aug 21, 2024 at 12:24:24AM +0100, David Howells wrote:
> Okay, I think I've found the bugs in my code and in truncate.  It appears
> they're affected by your code, but exist upstream.  You can add:
> 
> 	Tested-by: David Howells <dhowells@redhat.com>
> 
> to patches 1-5 if you wish.

Thanks David. I will send a new version with your Tested-by and the one
fix in the first patch.

[v12,00/10] enable bs > ps in XFS

Message

Comments