diff mbox series

[v2] iomap: skip pages past eof in iomap_do_writepage()

Message ID 20220608004228.3658429-1-clm@fb.com (mailing list archive)
State New, archived
Headers show
Series [v2] iomap: skip pages past eof in iomap_do_writepage() | expand

Commit Message

Chris Mason June 8, 2022, 12:42 a.m. UTC
iomap_do_writepage() sends pages past i_size through
folio_redirty_for_writepage(), which normally isn't a problem because
truncate and friends clean them very quickly.

When the system has cgroups configured, we can end up in situations
where one cgroup has almost no dirty pages at all, and other cgroups
consume the entire background dirty limit.  This is especially common in
our XFS workloads in production because they have cgroups using O_DIRECT
for almost all of the IO mixed in with cgroups that do more traditional
buffered IO work.

We've hit storms where the redirty path hits millions of times in a few
seconds, on all a single file that's only ~40 pages long.  This leads to
long tail latencies for file writes because the pdflush workers are
hogging the CPU from some kworkers bound to the same CPU.

Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new
eof page on truncate...") ends up writing/waiting most of these dirty pages
before truncate gets a chance to wait on them.

The actual repro looks like this:

/*
 * run me in a cgroup all alone.  Start a second cgroup with dd
 * streaming IO into the block device.
 */
int main(int ac, char **av) {
	int fd;
	int ret;
	char buf[BUFFER_SIZE];
	char *filename = av[1];

	memset(buf, 0, BUFFER_SIZE);

	if (ac != 2) {
		fprintf(stderr, "usage: looper filename\n");
		exit(1);
	}
	fd = open(filename, O_WRONLY | O_CREAT, 0600);
	if (fd < 0) {
		err(errno, "failed to open");
	}
	fprintf(stderr, "looping on %s\n", filename);
	while(1) {
		/*
		 * skip past page 0 so truncate doesn't write and wait
		 * on our extent before changing i_size
		 */
		ret = lseek(fd, 8192, SEEK_SET);
		if (ret < 0)
			err(errno, "lseek");
		ret = write(fd, buf, BUFFER_SIZE);
		if (ret != BUFFER_SIZE)
			err(errno, "write failed");
		/* start IO so truncate has to wait after i_size is 0 */
		ret = sync_file_range(fd, 16384, 4095, SYNC_FILE_RANGE_WRITE);
		if (ret < 0)
			err(errno, "sync_file_range");
		ret = ftruncate(fd, 0);
		if (ret < 0)
			err(errno, "truncate");
		usleep(1000);
	}
}

And this bpftrace script will show when you've hit a redirty storm:

kretprobe:xfs_vm_writepages {
    delete(@dirty[pid]);
}

kprobe:xfs_vm_writepages {
    @dirty[pid] = 1;
}

kprobe:folio_redirty_for_writepage /@dirty[pid] > 0/ {
    $inode = ((struct folio *)arg1)->mapping->host->i_ino;
    @inodes[$inode] = count();
    @redirty++;
    if (@redirty > 90000) {
        printf("inode %d redirty was %d", $inode, @redirty);
        exit();
    }
}

This patch has the same number of failures on xfstests as unpatched 5.18:
Failures: generic/648 xfs/019 xfs/050 xfs/168 xfs/299 xfs/348 xfs/506
xfs/543

I also ran it through a long stress of multiple fsx processes hammering.

(Johannes Weiner did significant tracing and debugging on this as well)

Signed-off-by: Chris Mason <clm@fb.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <domas@fb.com>
---
 fs/iomap/buffered-io.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Comments

Matthew Wilcox June 8, 2022, 6:27 p.m. UTC | #1
On Tue, Jun 07, 2022 at 05:42:29PM -0700, Chris Mason wrote:
> iomap_do_writepage() sends pages past i_size through
> folio_redirty_for_writepage(), which normally isn't a problem because
> truncate and friends clean them very quickly.
[...]
> Signed-off-by: Chris Mason <clm@fb.com>
> Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> Reported-by: Domas Mituzas <domas@fb.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Dave Chinner June 9, 2022, 12:53 a.m. UTC | #2
On Tue, Jun 07, 2022 at 05:42:29PM -0700, Chris Mason wrote:
> iomap_do_writepage() sends pages past i_size through
> folio_redirty_for_writepage(), which normally isn't a problem because
> truncate and friends clean them very quickly.
> 
> When the system has cgroups configured, we can end up in situations
> where one cgroup has almost no dirty pages at all, and other cgroups
> consume the entire background dirty limit.  This is especially common in
> our XFS workloads in production because they have cgroups using O_DIRECT
> for almost all of the IO mixed in with cgroups that do more traditional
> buffered IO work.
> 
> We've hit storms where the redirty path hits millions of times in a few
> seconds, on all a single file that's only ~40 pages long.  This leads to
> long tail latencies for file writes because the pdflush workers are
> hogging the CPU from some kworkers bound to the same CPU.
> 
> Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new
> eof page on truncate...") ends up writing/waiting most of these dirty pages
> before truncate gets a chance to wait on them.

That commit went into 5.10, so this would mean it's not easily
reproducable on kernels released since then?

> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8ce8720093b9..64d1476c457d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1482,10 +1482,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		pgoff_t end_index = isize >> PAGE_SHIFT;
>  
>  		/*
> -		 * Skip the page if it's fully outside i_size, e.g. due to a
> -		 * truncate operation that's in progress. We must redirty the
> -		 * page so that reclaim stops reclaiming it. Otherwise
> -		 * iomap_vm_releasepage() is called on it and gets confused.
> +		 * Skip the page if it's fully outside i_size, e.g.
> +		 * due to a truncate operation that's in progress.  We've
> +		 * cleaned this page and truncate will finish things off for
> +		 * us.
>  		 *
>  		 * Note that the end_index is unsigned long.  If the given
>  		 * offset is greater than 16TB on a 32-bit system then if we
> @@ -1500,7 +1500,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		 */
>  		if (folio->index > end_index ||
>  		    (folio->index == end_index && poff == 0))
> -			goto redirty;
> +			goto unlock;
>  
>  		/*
>  		 * The page straddles i_size.  It must be zeroed out on each
> @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  
>  redirty:
>  	folio_redirty_for_writepage(wbc, folio);
> +unlock:
>  	folio_unlock(folio);
>  	return 0;
>  }

Regardless, the change looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
Chris Mason June 9, 2022, 9:15 p.m. UTC | #3
On 6/8/22 8:53 PM, Dave Chinner wrote:
> On Tue, Jun 07, 2022 at 05:42:29PM -0700, Chris Mason wrote:
>> iomap_do_writepage() sends pages past i_size through
>> folio_redirty_for_writepage(), which normally isn't a problem because
>> truncate and friends clean them very quickly.
>>
>> When the system has cgroups configured, we can end up in situations
>> where one cgroup has almost no dirty pages at all, and other cgroups
>> consume the entire background dirty limit.  This is especially common in
>> our XFS workloads in production because they have cgroups using O_DIRECT
>> for almost all of the IO mixed in with cgroups that do more traditional
>> buffered IO work.
>>
>> We've hit storms where the redirty path hits millions of times in a few
>> seconds, on all a single file that's only ~40 pages long.  This leads to
>> long tail latencies for file writes because the pdflush workers are
>> hogging the CPU from some kworkers bound to the same CPU.
>>
>> Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new
>> eof page on truncate...") ends up writing/waiting most of these dirty pages
>> before truncate gets a chance to wait on them.
> 
> That commit went into 5.10, so this would mean it's not easily
> reproducable on kernels released since then?

Yes, our main two prod kernels right now are v5.6 and v5.12,  but we 
don't have enough of this database tier on 5.12 to have any meaningful 
data from production.  For my repro, I didn't spend much time on 5.12, 
but it was hard to trigger there as well.

[...]

> Regardless, the change looks fine.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Thanks!  Johannes and I are both going on vacation, but I'll get an 
experiment rolled to enough hosts to see if the long tails get shorter. 
  We're unlikely to come back with results before July.

-chris
Chris Mason June 11, 2022, 11:34 p.m. UTC | #4
On 6/9/22 5:15 PM, Chris Mason wrote:
> On 6/8/22 8:53 PM, Dave Chinner wrote:

>> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> Thanks!  Johannes and I are both going on vacation, but I'll get an 
> experiment rolled to enough hosts to see if the long tails get shorter. 
>   We're unlikely to come back with results before July.
> 

Of course, the easiest way to test my theory is live patching against 
v5.6, but there's a wrinkle because v5.6 still has xfs_vm_writepage()

Looks like the iomap conversion deleted the warning that Jan was 
originally fixing, and I went through some hoops to trigger skipping the 
pages from inside writepage().  As you noted in the commit to delete 
writepage, this is pretty hard to trigger but it does happen once I get 
down to a few hundred MB free.  It doesn't seem to impact fsx runs or 
other load, and we unmount/xfs_repair cleanly.

I don't like to ask people to think about ancient kernels, but am I 
missing any huge problems?  I've got this patch backported on v5.6, and 
writepage is still in place.

-chris
Dave Chinner June 13, 2022, 10:04 p.m. UTC | #5
On Sat, Jun 11, 2022 at 07:34:01PM -0400, Chris Mason wrote:
> On 6/9/22 5:15 PM, Chris Mason wrote:
> > On 6/8/22 8:53 PM, Dave Chinner wrote:
> 
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Thanks!  Johannes and I are both going on vacation, but I'll get an
> > experiment rolled to enough hosts to see if the long tails get shorter.
> >  We're unlikely to come back with results before July.
> > 
> 
> Of course, the easiest way to test my theory is live patching against v5.6,
> but there's a wrinkle because v5.6 still has xfs_vm_writepage()
> 
> Looks like the iomap conversion deleted the warning that Jan was originally
> fixing, and I went through some hoops to trigger skipping the pages from
> inside writepage().  As you noted in the commit to delete writepage, this is
> pretty hard to trigger but it does happen once I get down to a few hundred
> MB free.  It doesn't seem to impact fsx runs or other load, and we
> unmount/xfs_repair cleanly.
>
> I don't like to ask people to think about ancient kernels, but am I missing
> any huge problems?  I've got this patch backported on v5.6, and writepage is
> still in place.

Just backport the patch to remove ->writepage as well? I've run test
kernels since ~2017 with the writepage delete in it and never seen
any problems as a result of it. That would seem to me to be the
simplest way of avoiding unexpected issues with reclaim writeback
and truncate races...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..64d1476c457d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1482,10 +1482,10 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		pgoff_t end_index = isize >> PAGE_SHIFT;
 
 		/*
-		 * Skip the page if it's fully outside i_size, e.g. due to a
-		 * truncate operation that's in progress. We must redirty the
-		 * page so that reclaim stops reclaiming it. Otherwise
-		 * iomap_vm_releasepage() is called on it and gets confused.
+		 * Skip the page if it's fully outside i_size, e.g.
+		 * due to a truncate operation that's in progress.  We've
+		 * cleaned this page and truncate will finish things off for
+		 * us.
 		 *
 		 * Note that the end_index is unsigned long.  If the given
 		 * offset is greater than 16TB on a 32-bit system then if we
@@ -1500,7 +1500,7 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 */
 		if (folio->index > end_index ||
 		    (folio->index == end_index && poff == 0))
-			goto redirty;
+			goto unlock;
 
 		/*
 		 * The page straddles i_size.  It must be zeroed out on each
@@ -1518,6 +1518,7 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 
 redirty:
 	folio_redirty_for_writepage(wbc, folio);
+unlock:
 	folio_unlock(folio);
 	return 0;
 }