diff mbox series

[14/16] xfs: align writepages to large block sizes

Message ID 20181107063127.3902-15-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series xfs: Block size > PAGE_SIZE support | expand

Commit Message

Dave Chinner Nov. 7, 2018, 6:31 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

For data integrity purposes, we need to write back the entire
filesystem block when asked to sync a sub-block range of the file.
When the filesystem block size is larger than the page size, this
means we need to convert single page integrity writes into whole
block integrity writes. We do this by extending the writepage range
to filesystem block granularity and alignment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Christoph Hellwig Nov. 9, 2018, 3:22 p.m. UTC | #1
> +	unsigned		bsize =	i_blocksize(mapping->host);
>  
>  	/*
>  	 * Refuse to write pages out if we are called from reclaim context.
> @@ -922,6 +923,19 @@ xfs_vm_writepages(
>  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		return 0;
>  
> +	/*
> +	 * If the block size is larger than page size, extent the incoming write
> +	 * request to fsb granularity and alignment. This is a requirement for
> +	 * data integrity operations and it doesn't hurt for other write
> +	 * operations, so do it unconditionally.
> +	 */
> +	if (wbc->range_start)
> +		wbc->range_start = round_down(wbc->range_start, bsize);
> +	if (wbc->range_end != LLONG_MAX)
> +		wbc->range_end = round_up(wbc->range_end, bsize);
> +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);

This looks fine to me, but I'd be much more comfortable it we had
it in the common writeback code instead of inside XFS.
Dave Chinner Nov. 11, 2018, 1:20 a.m. UTC | #2
On Fri, Nov 09, 2018 at 07:22:18AM -0800, Christoph Hellwig wrote:
> > +	unsigned		bsize =	i_blocksize(mapping->host);
> >  
> >  	/*
> >  	 * Refuse to write pages out if we are called from reclaim context.
> > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >  		return 0;
> >  
> > +	/*
> > +	 * If the block size is larger than page size, extent the incoming write
> > +	 * request to fsb granularity and alignment. This is a requirement for
> > +	 * data integrity operations and it doesn't hurt for other write
> > +	 * operations, so do it unconditionally.
> > +	 */
> > +	if (wbc->range_start)
> > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > +	if (wbc->range_end != LLONG_MAX)
> > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> 
> This looks fine to me, but I'd be much more comfortable it we had
> it in the common writeback code instead of inside XFS.

Where in the common code? there's quite a few places that can call
->writepages...

Cheers,

Dave.
Christoph Hellwig Nov. 11, 2018, 4:32 p.m. UTC | #3
On Sun, Nov 11, 2018 at 12:20:57PM +1100, Dave Chinner wrote:
> > > +	if (wbc->range_start)
> > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > +	if (wbc->range_end != LLONG_MAX)
> > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > 
> > This looks fine to me, but I'd be much more comfortable it we had
> > it in the common writeback code instead of inside XFS.
> 
> Where in the common code? there's quite a few places that can call
> ->writepages...

As far as I can tell the only place calling ->writepages is
do_writepages, which sounds like the right place to me.  Maybe
conditional on a block size > page size to reduce the scare factor.
Brian Foster Nov. 14, 2018, 2:19 p.m. UTC | #4
On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For data integrity purposes, we need to write back the entire
> filesystem block when asked to sync a sub-block range of the file.
> When the filesystem block size is larger than the page size, this
> means we need to convert single page integrity writes into whole
> block integrity writes. We do this by extending the writepage range
> to filesystem block granularity and alignment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f6ef9e0a7312..5334f16be166 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -900,6 +900,7 @@ xfs_vm_writepages(
>  		.io_type = XFS_IO_HOLE,
>  	};
>  	int			ret;
> +	unsigned		bsize =	i_blocksize(mapping->host);
>  
>  	/*
>  	 * Refuse to write pages out if we are called from reclaim context.
> @@ -922,6 +923,19 @@ xfs_vm_writepages(
>  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		return 0;
>  
> +	/*
> +	 * If the block size is larger than page size, extent the incoming write
> +	 * request to fsb granularity and alignment. This is a requirement for
> +	 * data integrity operations and it doesn't hurt for other write
> +	 * operations, so do it unconditionally.
> +	 */
> +	if (wbc->range_start)
> +		wbc->range_start = round_down(wbc->range_start, bsize);
> +	if (wbc->range_end != LLONG_MAX)
> +		wbc->range_end = round_up(wbc->range_end, bsize);
> +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> +

This latter bit causes endless writeback loops in tests such as
generic/475 (I think I reproduced it with xfs/141 as well). The
writeback infrastructure samples ->nr_to_write before and after
->writepages() calls to identify progress. Unconditionally bumping it to
something larger than the original value can lead to an underflow in the
writeback code that seems to throw things off. E.g., see the following
wb tracepoints (w/ 4k block and page size):

   kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
   kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295

The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
the same basic calculation for 'wrote.'

BTW, I haven't gone through the broader set, but just looking at this
bit what's the purpose of rounding ->nr_to_write (which is a page count)
to a block size in the first place?

Brian

>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
>  	if (wpc.ioend)
> -- 
> 2.19.1
>
Dave Chinner Nov. 14, 2018, 9:18 p.m. UTC | #5
On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For data integrity purposes, we need to write back the entire
> > filesystem block when asked to sync a sub-block range of the file.
> > When the filesystem block size is larger than the page size, this
> > means we need to convert single page integrity writes into whole
> > block integrity writes. We do this by extending the writepage range
> > to filesystem block granularity and alignment.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index f6ef9e0a7312..5334f16be166 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> >  		.io_type = XFS_IO_HOLE,
> >  	};
> >  	int			ret;
> > +	unsigned		bsize =	i_blocksize(mapping->host);
> >  
> >  	/*
> >  	 * Refuse to write pages out if we are called from reclaim context.
> > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >  		return 0;
> >  
> > +	/*
> > +	 * If the block size is larger than page size, extent the incoming write
> > +	 * request to fsb granularity and alignment. This is a requirement for
> > +	 * data integrity operations and it doesn't hurt for other write
> > +	 * operations, so do it unconditionally.
> > +	 */
> > +	if (wbc->range_start)
> > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > +	if (wbc->range_end != LLONG_MAX)
> > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > +
> 
> This latter bit causes endless writeback loops in tests such as
> generic/475 (I think I reproduced it with xfs/141 as well). The

Yup, I've seen that, but haven't fixed it yet because I still
haven't climbed out of the dedupe/clone/copy file range data
corruption hole that fsx pulled the lid of.

Basically, I can't get back to working on bs > ps until I get the
stuff we actually support working correctly first...

> writeback infrastructure samples ->nr_to_write before and after
> ->writepages() calls to identify progress. Unconditionally bumping it to
> something larger than the original value can lead to an underflow in the
> writeback code that seems to throw things off. E.g., see the following
> wb tracepoints (w/ 4k block and page size):
> 
>    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
>    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> 
> The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> the same basic calculation for 'wrote.'

Easy enough to fix, just stash the originals and restore them once
done.

> 
> BTW, I haven't gone through the broader set, but just looking at this
> bit what's the purpose of rounding ->nr_to_write (which is a page count)
> to a block size in the first place?

fsync on a single page range.

We write that page, allocate the block (which spans 16 pages), and
then return from writeback leaving 15/16 pages on that block still
dirty in memory.  Then we force the log, pushing the allocation and
metadata to disk.  Crash.

On recovery, we expose 15/16 pages of stale data because we only
wrote one of the pages over the block during fsync.

Cheers,

Dave.
Brian Foster Nov. 15, 2018, 12:55 p.m. UTC | #6
On Thu, Nov 15, 2018 at 08:18:18AM +1100, Dave Chinner wrote:
> On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> > On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > For data integrity purposes, we need to write back the entire
> > > filesystem block when asked to sync a sub-block range of the file.
> > > When the filesystem block size is larger than the page size, this
> > > means we need to convert single page integrity writes into whole
> > > block integrity writes. We do this by extending the writepage range
> > > to filesystem block granularity and alignment.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > >  1 file changed, 14 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index f6ef9e0a7312..5334f16be166 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > >  		.io_type = XFS_IO_HOLE,
> > >  	};
> > >  	int			ret;
> > > +	unsigned		bsize =	i_blocksize(mapping->host);
> > >  
> > >  	/*
> > >  	 * Refuse to write pages out if we are called from reclaim context.
> > > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > >  		return 0;
> > >  
> > > +	/*
> > > +	 * If the block size is larger than page size, extent the incoming write
> > > +	 * request to fsb granularity and alignment. This is a requirement for
> > > +	 * data integrity operations and it doesn't hurt for other write
> > > +	 * operations, so do it unconditionally.
> > > +	 */
> > > +	if (wbc->range_start)
> > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > +	if (wbc->range_end != LLONG_MAX)
> > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > > +
> > 
> > This latter bit causes endless writeback loops in tests such as
> > generic/475 (I think I reproduced it with xfs/141 as well). The
> 
> Yup, I've seen that, but haven't fixed it yet because I still
> haven't climbed out of the dedupe/clone/copy file range data
> corruption hole that fsx pulled the lid of.
> 
> Basically, I can't get back to working on bs > ps until I get the
> stuff we actually support working correctly first...
> 
> > writeback infrastructure samples ->nr_to_write before and after
> > ->writepages() calls to identify progress. Unconditionally bumping it to
> > something larger than the original value can lead to an underflow in the
> > writeback code that seems to throw things off. E.g., see the following
> > wb tracepoints (w/ 4k block and page size):
> > 
> >    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> >    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> > 
> > The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> > the same basic calculation for 'wrote.'
> 
> Easy enough to fix, just stash the originals and restore them once
> done.
> 
> > 
> > BTW, I haven't gone through the broader set, but just looking at this
> > bit what's the purpose of rounding ->nr_to_write (which is a page count)
> > to a block size in the first place?
> 
> fsync on a single page range.
> 
> We write that page, allocate the block (which spans 16 pages), and
> then return from writeback leaving 15/16 pages on that block still
> dirty in memory.  Then we force the log, pushing the allocation and
> metadata to disk.  Crash.
> 
> On recovery, we expose 15/16 pages of stale data because we only
> wrote one of the pages over the block during fsync.
> 

But why the block size and not something that represents pages per
block? Note again that ->nr_to_write is a page count and you're
comparing and aligning it with fields that are byte offsets.

Also, it looks like nr_to_write is set to LONG_MAX regardless of the
fsync range (whereas range_start/end are set appropriately). IOW, an
integrity write isn't limited by page count since it's critical to flush
the specified range. Unless I'm misunderstanding how this field is used
in the writeback code, ISTM that there shouldn't be any need to muck
with nr_to_write here for the purposes of fsync.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 16, 2018, 6:19 a.m. UTC | #7
On Thu, Nov 15, 2018 at 07:55:44AM -0500, Brian Foster wrote:
> On Thu, Nov 15, 2018 at 08:18:18AM +1100, Dave Chinner wrote:
> > On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> > > On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > For data integrity purposes, we need to write back the entire
> > > > filesystem block when asked to sync a sub-block range of the file.
> > > > When the filesystem block size is larger than the page size, this
> > > > means we need to convert single page integrity writes into whole
> > > > block integrity writes. We do this by extending the writepage range
> > > > to filesystem block granularity and alignment.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > > >  1 file changed, 14 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index f6ef9e0a7312..5334f16be166 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > > >  		.io_type = XFS_IO_HOLE,
> > > >  	};
> > > >  	int			ret;
> > > > +	unsigned		bsize =	i_blocksize(mapping->host);
> > > >  
> > > >  	/*
> > > >  	 * Refuse to write pages out if we are called from reclaim context.
> > > > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > > >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > >  		return 0;
> > > >  
> > > > +	/*
> > > > +	 * If the block size is larger than page size, extent the incoming write
> > > > +	 * request to fsb granularity and alignment. This is a requirement for
> > > > +	 * data integrity operations and it doesn't hurt for other write
> > > > +	 * operations, so do it unconditionally.
> > > > +	 */
> > > > +	if (wbc->range_start)
> > > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > > +	if (wbc->range_end != LLONG_MAX)
> > > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > > > +
> > > 
> > > This latter bit causes endless writeback loops in tests such as
> > > generic/475 (I think I reproduced it with xfs/141 as well). The
> > 
> > Yup, I've seen that, but haven't fixed it yet because I still
> > haven't climbed out of the dedupe/clone/copy file range data
> > corruption hole that fsx pulled the lid of.
> > 
> > Basically, I can't get back to working on bs > ps until I get the
> > stuff we actually support working correctly first...
> > 
> > > writeback infrastructure samples ->nr_to_write before and after
> > > ->writepages() calls to identify progress. Unconditionally bumping it to
> > > something larger than the original value can lead to an underflow in the
> > > writeback code that seems to throw things off. E.g., see the following
> > > wb tracepoints (w/ 4k block and page size):
> > > 
> > >    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> > >    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> > > 
> > > The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> > > the same basic calculation for 'wrote.'
> > 
> > Easy enough to fix, just stash the originals and restore them once
> > done.
> > 
> > > 
> > > BTW, I haven't gone through the broader set, but just looking at this
> > > bit what's the purpose of rounding ->nr_to_write (which is a page count)
> > > to a block size in the first place?
> > 
> > fsync on a single page range.
> > 
> > We write that page, allocate the block (which spans 16 pages), and
> > then return from writeback leaving 15/16 pages on that block still
> > dirty in memory.  Then we force the log, pushing the allocation and
> > metadata to disk.  Crash.
> > 
> > On recovery, we expose 15/16 pages of stale data because we only
> > wrote one of the pages over the block during fsync.
> > 
> 
> But why the block size and not something that represents pages per
> block? Note again that ->nr_to_write is a page count and you're
> comparing and aligning it with fields that are byte offsets.

Not sure I follow you - block size is directly related to the
number of pages per block. You can't calculate one without the
other...

/me goes back an looks at the code, finally it dawns on him that
the bug is he's rounding to block size, not (block size >>
PAGE_SHIFT). I'll fix that.

> IOW, an
> integrity write isn't limited by page count since it's critical to flush
> the specified range. Unless I'm misunderstanding how this field is used
> in the writeback code, ISTM that there shouldn't be any need to muck
> with nr_to_write here for the purposes of fsync.

Sorry, I thought you were asking about why were were rounding the
range in general, not nr_to_write specifically.

We don't need to modify nr_to_write for WB_SYNC_ALL,
write_cache_pages() will not terminate on a nr_to_write expiry. So,
unlike the range rounding, the nr_to_write rounding is really only
for WB_SYNC_NONE.

i.e. we round WB_SYNC_NONE to try to get whole blocks written so we
don't end up with partially written blocks on disk for extended
periods of time (e.g. between background writeback periods). It
doesn't matter for WB_SYNC_ALL, but it will reduce the potential for
stale data beign exposed when crashes occur and random pages in
blocks have't been written back. i.e. it's to help iprevent
re-exposing the problematic cases that we added the "NULL files
after crash" workarounds for.

Cheers,

Dave.
Brian Foster Nov. 16, 2018, 1:29 p.m. UTC | #8
On Fri, Nov 16, 2018 at 05:19:36PM +1100, Dave Chinner wrote:
> On Thu, Nov 15, 2018 at 07:55:44AM -0500, Brian Foster wrote:
> > On Thu, Nov 15, 2018 at 08:18:18AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 14, 2018 at 09:19:26AM -0500, Brian Foster wrote:
> > > > On Wed, Nov 07, 2018 at 05:31:25PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > For data integrity purposes, we need to write back the entire
> > > > > filesystem block when asked to sync a sub-block range of the file.
> > > > > When the filesystem block size is larger than the page size, this
> > > > > means we need to convert single page integrity writes into whole
> > > > > block integrity writes. We do this by extending the writepage range
> > > > > to filesystem block granularity and alignment.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_aops.c | 14 ++++++++++++++
> > > > >  1 file changed, 14 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > > index f6ef9e0a7312..5334f16be166 100644
> > > > > --- a/fs/xfs/xfs_aops.c
> > > > > +++ b/fs/xfs/xfs_aops.c
> > > > > @@ -900,6 +900,7 @@ xfs_vm_writepages(
> > > > >  		.io_type = XFS_IO_HOLE,
> > > > >  	};
> > > > >  	int			ret;
> > > > > +	unsigned		bsize =	i_blocksize(mapping->host);
> > > > >  
> > > > >  	/*
> > > > >  	 * Refuse to write pages out if we are called from reclaim context.
> > > > > @@ -922,6 +923,19 @@ xfs_vm_writepages(
> > > > >  	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > > >  		return 0;
> > > > >  
> > > > > +	/*
> > > > > +	 * If the block size is larger than page size, extent the incoming write
> > > > > +	 * request to fsb granularity and alignment. This is a requirement for
> > > > > +	 * data integrity operations and it doesn't hurt for other write
> > > > > +	 * operations, so do it unconditionally.
> > > > > +	 */
> > > > > +	if (wbc->range_start)
> > > > > +		wbc->range_start = round_down(wbc->range_start, bsize);
> > > > > +	if (wbc->range_end != LLONG_MAX)
> > > > > +		wbc->range_end = round_up(wbc->range_end, bsize);
> > > > > +	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
> > > > > +		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
> > > > > +
> > > > 
> > > > This latter bit causes endless writeback loops in tests such as
> > > > generic/475 (I think I reproduced it with xfs/141 as well). The
> > > 
> > > Yup, I've seen that, but haven't fixed it yet because I still
> > > haven't climbed out of the dedupe/clone/copy file range data
> > > corruption hole that fsx pulled the lid of.
> > > 
> > > Basically, I can't get back to working on bs > ps until I get the
> > > stuff we actually support working correctly first...
> > > 
> > > > writeback infrastructure samples ->nr_to_write before and after
> > > > ->writepages() calls to identify progress. Unconditionally bumping it to
> > > > something larger than the original value can lead to an underflow in the
> > > > writeback code that seems to throw things off. E.g., see the following
> > > > wb tracepoints (w/ 4k block and page size):
> > > > 
> > > >    kworker/u8:13-189   [003] ...1   317.968147: writeback_single_inode_start: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=0 cgroup_ino=4294967295
> > > >    kworker/u8:13-189   [003] ...1   317.968150: writeback_single_inode: bdi 253:9: ino=8389005 state=I_DIRTY_PAGES|I_SYNC dirtied_when=4294773087 age=211 index=0 to_write=1024 wrote=18446744073709548544 cgroup_ino=4294967295
> > > > 
> > > > The wrote value goes from 0 to garbage and writeback_sb_inodes() uses
> > > > the same basic calculation for 'wrote.'
> > > 
> > > Easy enough to fix, just stash the originals and restore them once
> > > done.
> > > 
> > > > 
> > > > BTW, I haven't gone through the broader set, but just looking at this
> > > > bit what's the purpose of rounding ->nr_to_write (which is a page count)
> > > > to a block size in the first place?
> > > 
> > > fsync on a single page range.
> > > 
> > > We write that page, allocate the block (which spans 16 pages), and
> > > then return from writeback leaving 15/16 pages on that block still
> > > dirty in memory.  Then we force the log, pushing the allocation and
> > > metadata to disk.  Crash.
> > > 
> > > On recovery, we expose 15/16 pages of stale data because we only
> > > wrote one of the pages over the block during fsync.
> > > 
> > 
> > But why the block size and not something that represents pages per
> > block? Note again that ->nr_to_write is a page count and you're
> > comparing and aligning it with fields that are byte offsets.
> 
> Not sure I follow you - block size is directly related to the
> number of pages per block. You can't calculate one without the
> other...
> 
> /me goes back an looks at the code, finally it dawns on him that
> the bug is he's rounding to block size, not (block size >>
> PAGE_SHIFT). I'll fix that.
> 

Right.. and comparing it to range_end - range_start without conversion
as well.

> > IOW, an
> > integrity write isn't limited by page count since it's critical to flush
> > the specified range. Unless I'm misunderstanding how this field is used
> > in the writeback code, ISTM that there shouldn't be any need to muck
> > with nr_to_write here for the purposes of fsync.
> 
> Sorry, I thought you were asking about why were were rounding the
> range in general, not nr_to_write specifically.
> 

No worries.

> We don't need to modify nr_to_write for WB_SYNC_ALL,
> write_cache_pages() will not terminate on a nr_to_write expiry. So,
> unlike the range rounding, the nr_to_write rounding is really only
> for WB_SYNC_NONE.
> 

Ok, I was responding to the "fsync() on a single page range" use case,
but I see it wasn't clear I was asking specifically about ->nr_to_write
and not necessarily the broader change.

Then again, the commit log seems to focus on fsync exclusively, so some
more content could be included there at minimum.

> i.e. we round WB_SYNC_NONE to try to get whole blocks written so we
> don't end up with partially written blocks on disk for extended
> periods of time (e.g. between background writeback periods). It
> doesn't matter for WB_SYNC_ALL, but it will reduce the potential for
> stale data beign exposed when crashes occur and random pages in
> blocks have't been written back. i.e. it's to help iprevent
> re-exposing the problematic cases that we added the "NULL files
> after crash" workarounds for.
> 

Ok. I see that there are earlier patches to do zero-around on sub-block
writes, so the idea makes a bit more sense with that in mind. That said,
I still don't grok how messing with nr_to_write is effective.

For background writeback (WB_SYNC_NONE), the range fields are clamped
out (0-LONG_MAX) since the location of pages to write is not really a
concern. In that case, ->nr_to_write is set based on some bandwidth
heuristic and the only change we make here is to round it. If we
consider the fact that any mapping itself may consist of some
combination of zeroed-around delalloc blocks (covered by an aligned
number of dirty pages) and already allocated/overwrite blocks (covered
by any number of dirty pages), how does a rounded ->nr_to_write actually
help us at all? Can't the magic ->nr_to_write value that prevents
stopping at a partially written sub-block page be unaligned to the block
size?

For integrity writeback (WB_SYNC_ALL), the range rounding makes sense to
me. In that case, the appropriate range_start/end values are set and
->nr_to_write is LONG_MAX because we don't want to limit on a page
count. So we round out the range to help ensure we don't do any
sub-block writes and as already discussed, ->nr_to_write is irrelevant
here.

Given the above, I don't see how tweaking ->nr_to_write really helps at
all even as an optimization. Unless I'm missing something else in the
earlier patches that facilitate this, ISTM that something more explicit
is required if you want to increase the odds that zeroed-around blocks
are written together. For example, perhaps try to detect this further
down in the writepage callout if we're about to stop in a bad spot, use
something like the old xfs_convert_page() mechanism to handle N pages
per-callout, or (as Christoph suggested), try to incorporate this kind
of block alignment consideration into write_cache_pages() itself. For
the latter, perhaps we could add a mod/align field or some such to the
wbc that the fs can use to instruct writeback to consider bs > ps
alignment explicitly.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 19, 2018, 1:14 a.m. UTC | #9
On Fri, Nov 16, 2018 at 08:29:23AM -0500, Brian Foster wrote:
> On Fri, Nov 16, 2018 at 05:19:36PM +1100, Dave Chinner wrote:
> > i.e. we round WB_SYNC_NONE to try to get whole blocks written so we
> > don't end up with partially written blocks on disk for extended
> > periods of time (e.g. between background writeback periods). It
> > doesn't matter for WB_SYNC_ALL, but it will reduce the potential for
> > stale data beign exposed when crashes occur and random pages in
> > blocks have't been written back. i.e. it's to help iprevent
> > re-exposing the problematic cases that we added the "NULL files
> > after crash" workarounds for.
> > 
> 
> Ok. I see that there are earlier patches to do zero-around on sub-block
> writes, so the idea makes a bit more sense with that in mind. That said,
> I still don't grok how messing with nr_to_write is effective.

Because background writeback is range_cyclic = 1, and that means we
always start at offset zero, and then if nr_to_write expires we
stash the page index we are up to in:

	mapping->writeback_index = done_index

And the next background writeback will start again from there.

Hence if nr_to_write is always rounding to the number of pages per
block, background writeback will /tend/ towards writing full blocks
because the writeback_index will always end up a multiple of pages
per block. Hence cyclic writeback will tend towards writing aligned,
full blocks when nr_to_write is rounded.

That's the fundamental concept here - write-in does "zero-around" to
initialise full blocks, writeback does "write-around" to push full
blocks to disk. WB_SYNC_ALL needs ranges to be rounded to do full
block writeback, WB_SYNC_NONE background writeback needs it's range
cyclic behaviour to round to writing full blocks (and that's what
rounding nr_to_write is doing in this patch).

> For background writeback (WB_SYNC_NONE), the range fields are clamped
> out (0-LONG_MAX) since the location of pages to write is not really a
> concern. In that case, ->nr_to_write is set based on some bandwidth
> heuristic and the only change we make here is to round it. If we
> consider the fact that any mapping itself may consist of some
> combination of zeroed-around delalloc blocks (covered by an aligned
> number of dirty pages) and already allocated/overwrite blocks (covered
> by any number of dirty pages), how does a rounded ->nr_to_write actually
> help us at all? Can't the magic ->nr_to_write value that prevents
> stopping at a partially written sub-block page be unaligned to the block
> size?

Yup, I never intended for this RFC prototype to deal with all these
problems. That doesn't mean I'm not aware of them, nor that I don't
have a plan to deal with them.

> Given the above, I don't see how tweaking ->nr_to_write really helps at
> all even as an optimization. Unless I'm missing something else in the
> earlier patches that facilitate this, ISTM that something more explicit
> is required if you want to increase the odds that zeroed-around blocks
> are written together.

Which I've always intended as future work. I've spent about 14 hours
on this patch set so far - it's a functional prototype, not a
finished, completed piece of work.

I'm fully aware that this going to need a lot more work before it is
ready for merging This is an early prototype I'm putting out there
for architectural/design review. i.e. don't bother nitpicking
unimplemented details or bugs, look for big picture things I've got
wrong. Look for showstoppers and fundamental problems, not things
that just require a little bit of time and coding to implement....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f6ef9e0a7312..5334f16be166 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -900,6 +900,7 @@  xfs_vm_writepages(
 		.io_type = XFS_IO_HOLE,
 	};
 	int			ret;
+	unsigned		bsize =	i_blocksize(mapping->host);
 
 	/*
 	 * Refuse to write pages out if we are called from reclaim context.
@@ -922,6 +923,19 @@  xfs_vm_writepages(
 	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		return 0;
 
+	/*
+	 * If the block size is larger than page size, extent the incoming write
+	 * request to fsb granularity and alignment. This is a requirement for
+	 * data integrity operations and it doesn't hurt for other write
+	 * operations, so do it unconditionally.
+	 */
+	if (wbc->range_start)
+		wbc->range_start = round_down(wbc->range_start, bsize);
+	if (wbc->range_end != LLONG_MAX)
+		wbc->range_end = round_up(wbc->range_end, bsize);
+	if (wbc->nr_to_write < wbc->range_end - wbc->range_start)
+		wbc->nr_to_write = round_up(wbc->nr_to_write, bsize);
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
 	if (wpc.ioend)