diff mbox

xfs: flush the range before zero partial block range on truncate down

Message ID 20171027125328.25001-1-eguan@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eryu Guan Oct. 27, 2017, 12:53 p.m. UTC
On truncate down, if new size is not block size aligned, we zero the
rest of block via iomap_truncate_page() to avoid exposing stale data
to user, and iomap_truncate_page() skips zeroing if the range is
already in unwritten status or a hole.

But it's possible that a buffer write overwrites the unwritten
extent, which won't be converted to a normal extent until I/O
completion, and iomap_truncate_page() skips zeroing wrongly because
of the not-converted unwritten extent. This would cause a subsequent
mmap read sees non-zeros beyond EOF.

I occasionally see this in fsx runs in fstests generic/112, a
simplified fsx operation sequence is like (assuming 4k block size
xfs):

  fallocate 0x0 0x1000 0x0 keep_size
  write 0x0 0x1000 0x0
  truncate 0x0 0x800 0x1000
  punch_hole 0x0 0x800 0x800
  mapread 0x0 0x800 0x800

This is introduced along with the switch to iomap based buffer write
support, commit 68a9f5e7007c ("xfs: implement iomap based buffered
write path"), because previously, block_truncate_page() did fs block
based check and only skipped holes.

Fix it by flushing the range we're going to zero to ensure
iomap_truncate_page() sees the final extent state and zeros the
range correctly.

Also fixed the wrong 'end' param of filemap_write_and_wait_range()
call while we're at it, the 'end' is inclusive and should be
'newsize - 1'.

Signed-off-by: Eryu Guan <eguan@redhat.com>
---

An fstests test case followed, I've verified it could reproduce the
bug reliably and quickly.

 fs/xfs/xfs_iops.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Oct. 28, 2017, 6:05 a.m. UTC | #1
On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> But it's possible that a buffer write overwrites the unwritten
> extent, which won't be converted to a normal extent until I/O
> completion, and iomap_truncate_page() skips zeroing wrongly because
> of the not-converted unwritten extent. This would cause a subsequent
> mmap read sees non-zeros beyond EOF.

I suspect the right fix is to look at the in-core state im the iomap
truncate helpers instead of doing a duplicate flush.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan Oct. 31, 2017, 10:09 a.m. UTC | #2
On Fri, Oct 27, 2017 at 11:05:29PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > But it's possible that a buffer write overwrites the unwritten
> > extent, which won't be converted to a normal extent until I/O
> > completion, and iomap_truncate_page() skips zeroing wrongly because
> > of the not-converted unwritten extent. This would cause a subsequent
> > mmap read sees non-zeros beyond EOF.
> 
> I suspect the right fix is to look at the in-core state im the iomap
> truncate helpers instead of doing a duplicate flush.

I may (and very likely) miss something, but my understanding is that
iomap_truncate_page() already looks at the in-core extent state provided
by xfs_file_iomap_begin()

xfs_setattr_size()
 iomap_truncate_page()
  iomap_zero_range()
   iomap_apply()
    xfs_file_iomap_begin()	# finds extent according to the range
    iomap_zero_range_actor() 	# sees iomap->type == IOMAP_UNWRITTEN

And this in-core extent state won't be converted to XFS_EXT_NORM until
writeback & I/O completion, so I think a flush is required.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Oct. 31, 2017, 5:11 p.m. UTC | #3
On Tue, Oct 31, 2017 at 06:09:58PM +0800, Eryu Guan wrote:
> On Fri, Oct 27, 2017 at 11:05:29PM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > > But it's possible that a buffer write overwrites the unwritten
> > > extent, which won't be converted to a normal extent until I/O
> > > completion, and iomap_truncate_page() skips zeroing wrongly because
> > > of the not-converted unwritten extent. This would cause a subsequent
> > > mmap read sees non-zeros beyond EOF.
> > 
> > I suspect the right fix is to look at the in-core state im the iomap
> > truncate helpers instead of doing a duplicate flush.
> 
> I may (and very likely) miss something, but my understanding is that
> iomap_truncate_page() already looks at the in-core extent state provided
> by xfs_file_iomap_begin()
> 
> xfs_setattr_size()
>  iomap_truncate_page()
>   iomap_zero_range()
>    iomap_apply()
>     xfs_file_iomap_begin()	# finds extent according to the range
>     iomap_zero_range_actor() 	# sees iomap->type == IOMAP_UNWRITTEN
> 
> And this in-core extent state won't be converted to XFS_EXT_NORM until
> writeback & I/O completion, so I think a flush is required.

If we get to iomap_zero_range_actor & type == UNWRITTEN, can we ask the
pagecache if there's a page backing the (unwritten) range and if so zero
the appropriate parts of that page?

--D

> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 31, 2017, 10:58 p.m. UTC | #4
On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> On truncate down, if new size is not block size aligned, we zero the
> rest of block via iomap_truncate_page() to avoid exposing stale data
> to user, and iomap_truncate_page() skips zeroing if the range is
> already in unwritten status or a hole.

Unless the page is in the page cache already, and then it gets
zeroed in memory as part of truncate_setsize() call.

> But it's possible that a buffer write overwrites the unwritten
> extent, which won't be converted to a normal extent until I/O
> completion, and iomap_truncate_page() skips zeroing wrongly because
> of the not-converted unwritten extent. This would cause a subsequent
> mmap read sees non-zeros beyond EOF.

Yes, it should skip the zeroing on disk. The page in the page cache
over the unwritten extent will be zeroed on read.

The real question is this: where are the zeros in the page that fsx
is complaining about?

> I occasionally see this in fsx runs in fstests generic/112, a
> simplified fsx operation sequence is like (assuming 4k block size
> xfs):

What should have is:

>   fallocate 0x0 0x1000 0x0 keep_size

Unwritten, no data.

>   write 0x0 0x1000 0x0

Unwritten, contains data in page cache.

>   truncate 0x0 0x800 0x1000

Unwritten, page contains data 0-0x800, zeros 0x800-0x1000

>   punch_hole 0x0 0x800 0x800

Unwritten, page contains zeros 0x0-0x1000

>   mapread 0x0 0x800 0x800

Should map a page full of zeros as it is either a read over an
unwritten extent or a hole, or it finds a page cache page that is
fully zeroed.

The only wrinkle in this is if the write is direct IO, but
then the truncate would see a written extent and this whole problem
doesn't occur.

So, more info required. :P

Cheers,

Dave.
Dave Chinner Oct. 31, 2017, 11:03 p.m. UTC | #5
On Tue, Oct 31, 2017 at 10:11:31AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 31, 2017 at 06:09:58PM +0800, Eryu Guan wrote:
> > On Fri, Oct 27, 2017 at 11:05:29PM -0700, Christoph Hellwig wrote:
> > > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > > > But it's possible that a buffer write overwrites the unwritten
> > > > extent, which won't be converted to a normal extent until I/O
> > > > completion, and iomap_truncate_page() skips zeroing wrongly because
> > > > of the not-converted unwritten extent. This would cause a subsequent
> > > > mmap read sees non-zeros beyond EOF.
> > > 
> > > I suspect the right fix is to look at the in-core state im the iomap
> > > truncate helpers instead of doing a duplicate flush.
> > 
> > I may (and very likely) miss something, but my understanding is that
> > iomap_truncate_page() already looks at the in-core extent state provided
> > by xfs_file_iomap_begin()
> > 
> > xfs_setattr_size()
> >  iomap_truncate_page()
> >   iomap_zero_range()
> >    iomap_apply()
> >     xfs_file_iomap_begin()	# finds extent according to the range
> >     iomap_zero_range_actor() 	# sees iomap->type == IOMAP_UNWRITTEN
> > 
> > And this in-core extent state won't be converted to XFS_EXT_NORM until
> > writeback & I/O completion, so I think a flush is required.
> 
> If we get to iomap_zero_range_actor & type == UNWRITTEN, can we ask the
> pagecache if there's a page backing the (unwritten) range and if so zero
> the appropriate parts of that page?

The partial page zeroing in the page cache should be taken care of
by

truncate_setsize
  truncate_pagecache()
    truncate_inode_pages()
      truncate_inode_pages_range()

that follows up the on-disk EOF zeroing. See the partial page
zeroing in that function....

Cheers,

Dave.
Eryu Guan Nov. 1, 2017, 3:46 a.m. UTC | #6
On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote:
> On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > On truncate down, if new size is not block size aligned, we zero the
> > rest of block via iomap_truncate_page() to avoid exposing stale data
> > to user, and iomap_truncate_page() skips zeroing if the range is
> > already in unwritten status or a hole.
> 
> Unless the page is in the page cache already, and then it gets
> zeroed in memory as part of truncate_setsize() call.
> 
> > But it's possible that a buffer write overwrites the unwritten
> > extent, which won't be converted to a normal extent until I/O
> > completion, and iomap_truncate_page() skips zeroing wrongly because
> > of the not-converted unwritten extent. This would cause a subsequent
> > mmap read sees non-zeros beyond EOF.
> 
> Yes, it should skip the zeroing on disk. The page in the page cache
> over the unwritten extent will be zeroed on read.
> 
> The real question is this: where are the zeros in the page that fsx
> is complaining about?

The partial block that iomap_truncate_page() skipped zeroing was latter
written back to disk, and the punch_hole before mmap read invalidated
the page cache so mmap read from disk and saw non-zeros. This is a
hard-to-hit sequence, it took me almost 2000 iterations of generic/112
runs to hit one failure. I'll provide more details below.

> 
> > I occasionally see this in fsx runs in fstests generic/112, a
> > simplified fsx operation sequence is like (assuming 4k block size
> > xfs):
> 
> What should have is:
> 
> >   fallocate 0x0 0x1000 0x0 keep_size
> 
> Unwritten, no data.

Yes, assuming 4k block size and 4k page size, unwritten extent with 1
block allocated, i_size stays 0.

> 
> >   write 0x0 0x1000 0x0
> 
> Unwritten, contains data in page cache.

Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0.

> 
> >   truncate 0x0 0x800 0x1000
> 
> Unwritten, page contains data 0-0x800, zeros 0x800-0x1000

Yes, the page cache after truncate is correct. But before we zero the
page cache (in truncate_setsize()), we skipped zeroing the partial block
range 0x800-0x1000 and then triggered a writeback on range
[di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back
to disk too, which contained non-zeros.

(newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true.

        if (did_zeroing ||                                                                                                                                                                                                       
            (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {                                                                                                                                                         
                error = filemap_write_and_wait_range(mapping, ip->i_d.di_size,                                                                                                                                                   
                                                        newsize - 1);                                                                                                                                                            
                if (error)                                                                                                                                                                                                       
                        return error;                                                                                                                                                                                            
        }

> 
> >   punch_hole 0x0 0x800 0x800
> 
> Unwritten, page contains zeros 0x0-0x1000

i_mapping had no pages (nrpages == 0) after punch_hole.

> 
> >   mapread 0x0 0x800 0x800

pagefault read block from disk, 0-0x7ff was zero, but 0x800-0xfff was
non-zero.

> 
> Should map a page full of zeros as it is either a read over an
> unwritten extent or a hole, or it finds a page cache page that is
> fully zeroed.
> 
> The only wrinkle in this is if the write is direct IO, but
> then the truncate would see a written extent and this whole problem
> doesn't occur.
> 
> So, more info required. :P

Above is what I observed in debugging, maybe I should have provided more
details in the first place. (I did add some comments to the fstests case
though..).

Thanks,
Eryu

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 1, 2017, 4:44 a.m. UTC | #7
On Wed, Nov 01, 2017 at 11:46:39AM +0800, Eryu Guan wrote:
> On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote:
> > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > > On truncate down, if new size is not block size aligned, we zero the
> > > rest of block via iomap_truncate_page() to avoid exposing stale data
> > > to user, and iomap_truncate_page() skips zeroing if the range is
> > > already in unwritten status or a hole.
> > 
> > Unless the page is in the page cache already, and then it gets
> > zeroed in memory as part of truncate_setsize() call.
> > 
> > > But it's possible that a buffer write overwrites the unwritten
> > > extent, which won't be converted to a normal extent until I/O
> > > completion, and iomap_truncate_page() skips zeroing wrongly because
> > > of the not-converted unwritten extent. This would cause a subsequent
> > > mmap read sees non-zeros beyond EOF.
> > 
> > Yes, it should skip the zeroing on disk. The page in the page cache
> > over the unwritten extent will be zeroed on read.
> > 
> > The real question is this: where are the zeros in the page that fsx
> > is complaining about?
> 
> The partial block that iomap_truncate_page() skipped zeroing was latter
> written back to disk, and the punch_hole before mmap read invalidated
> the page cache so mmap read from disk and saw non-zeros.  This is a
> hard-to-hit sequence, it took me almost 2000 iterations of generic/112
> runs to hit one failure. I'll provide more details below.

Oh, ok, so they weren't close together operations but far apart in
the trace. I usually indicate that by showing [....] lines between
the operations if there's stuff that occurred between them.

> > > simplified fsx operation sequence is like (assuming 4k block size
> > > xfs):
> > 
> > What should have is:
> > 
> > >   fallocate 0x0 0x1000 0x0 keep_size
> > 
> > Unwritten, no data.
> 
> Yes, assuming 4k block size and 4k page size, unwritten extent with 1
> block allocated, i_size stays 0.
> 
> > 
> > >   write 0x0 0x1000 0x0
> > 
> > Unwritten, contains data in page cache.
> 
> Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0.
> 
> > 
> > >   truncate 0x0 0x800 0x1000
> > 
> > Unwritten, page contains data 0-0x800, zeros 0x800-0x1000
> 
> Yes, the page cache after truncate is correct. But before we zero the
> page cache (in truncate_setsize()), we skipped zeroing the partial block
> range 0x800-0x1000 and then triggered a writeback on range
> [di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back
> to disk too, which contained non-zeros.
> 
> (newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true.
> 
>         if (did_zeroing ||                                                                                                                                                                                                       
>             (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {                                                                                                                                                         
>                 error = filemap_write_and_wait_range(mapping, ip->i_d.di_size,                                                                                                                                                   
>                                                         newsize - 1);                                                                                                                                                            
>                 if (error)                                                                                                                                                                                                       
>                         return error;                                                                                                                                                                                            
>         }

Ok, so we're writing data between di_size and newsize before
removing the page cache beyond newsize. As such, the page of data
that newsize lies in has not been zeroed by page cache invalidation
before it is written.

Ok, that explains why the EOF page zeroing in xfs_do_writepage()
isn't catching this - we haven't updated the inode size yet.

IOWs, the /three places/ where we normally catch this and zero the
partial tail page beyond EOF are not doing it because:

	1. iomap_truncate_page() sees unwritten and skips.
	2. truncate_setsize() has not yet been called so can't
	   zero the tail of the page.
	3. we haven't changed where EOF is yet, so
	   xfs_do_writepage() hasn't triggered it's "zero data
	   beyond EOF" case before it sends the page to disk. 

So, we have three options here:

	1. iomap_truncate_page() always zeros
	2. update inode size before writeback after zeroing so the
	   xfs_do_writepage() zeros the tail page, or
	3. move truncate_setsize() to before writeback so the page
	   cache invalidation zeros the part page at the new EOF.

I think 1) is a no-go for performance reasons. 2) is better, but
I don't like the idea of separating the page cache invalidation
from the size truncation. That leaves 3) - moving
truncate_setsize().

I think I prefer 3) because it triggers multiple layers of defense
against writing stale data past EOF, and from an crash behaviour
point of view it makes no difference whether we truncate the page
cache before or after triggering writeback because it will just make
the result the same as if we were zeroing a written extent....

Cheers,

Dave.
Eryu Guan Nov. 1, 2017, 12:06 p.m. UTC | #8
On Wed, Nov 01, 2017 at 03:44:51PM +1100, Dave Chinner wrote:
> On Wed, Nov 01, 2017 at 11:46:39AM +0800, Eryu Guan wrote:
> > On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > > > On truncate down, if new size is not block size aligned, we zero the
> > > > rest of block via iomap_truncate_page() to avoid exposing stale data
> > > > to user, and iomap_truncate_page() skips zeroing if the range is
> > > > already in unwritten status or a hole.
> > > 
> > > Unless the page is in the page cache already, and then it gets
> > > zeroed in memory as part of truncate_setsize() call.
> > > 
> > > > But it's possible that a buffer write overwrites the unwritten
> > > > extent, which won't be converted to a normal extent until I/O
> > > > completion, and iomap_truncate_page() skips zeroing wrongly because
> > > > of the not-converted unwritten extent. This would cause a subsequent
> > > > mmap read sees non-zeros beyond EOF.
> > > 
> > > Yes, it should skip the zeroing on disk. The page in the page cache
> > > over the unwritten extent will be zeroed on read.
> > > 
> > > The real question is this: where are the zeros in the page that fsx
> > > is complaining about?
> > 
> > The partial block that iomap_truncate_page() skipped zeroing was latter
> > written back to disk, and the punch_hole before mmap read invalidated
> > the page cache so mmap read from disk and saw non-zeros.  This is a
> > hard-to-hit sequence, it took me almost 2000 iterations of generic/112
> > runs to hit one failure. I'll provide more details below.
> 
> Oh, ok, so they weren't close together operations but far apart in
> the trace. I usually indicate that by showing [....] lines between
> the operations if there's stuff that occurred between them.

They are not strictly one-by-one operations in the original fsxops log,
but are close enough. Then I tailored the ops into a minimal
step-by-step reproducer.

> 
> > > > simplified fsx operation sequence is like (assuming 4k block size
> > > > xfs):
> > > 
> > > What should have is:
> > > 
> > > >   fallocate 0x0 0x1000 0x0 keep_size
> > > 
> > > Unwritten, no data.
> > 
> > Yes, assuming 4k block size and 4k page size, unwritten extent with 1
> > block allocated, i_size stays 0.
> > 
> > > 
> > > >   write 0x0 0x1000 0x0
> > > 
> > > Unwritten, contains data in page cache.
> > 
> > Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0.
> > 
> > > 
> > > >   truncate 0x0 0x800 0x1000
> > > 
> > > Unwritten, page contains data 0-0x800, zeros 0x800-0x1000
> > 
> > Yes, the page cache after truncate is correct. But before we zero the
> > page cache (in truncate_setsize()), we skipped zeroing the partial block
> > range 0x800-0x1000 and then triggered a writeback on range
> > [di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back
> > to disk too, which contained non-zeros.
> > 
> > (newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true.
> > 
> >         if (did_zeroing ||
> >             (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> >                 error = filemap_write_and_wait_range(mapping, ip->i_d.di_size,
> >                                                         newsize - 1);
> >                 if (error)
> >                         return error;
> >         }
> 
> Ok, so we're writing data between di_size and newsize before
> removing the page cache beyond newsize. As such, the page of data
> that newsize lies in has not been zeroed by page cache invalidation
> before it is written.
> 
> Ok, that explains why the EOF page zeroing in xfs_do_writepage()
> isn't catching this - we haven't updated the inode size yet.
> 
> IOWs, the /three places/ where we normally catch this and zero the
> partial tail page beyond EOF are not doing it because:
> 
> 	1. iomap_truncate_page() sees unwritten and skips.
> 	2. truncate_setsize() has not yet been called so can't
> 	   zero the tail of the page.
> 	3. we haven't changed where EOF is yet, so
> 	   xfs_do_writepage() hasn't triggered it's "zero data
> 	   beyond EOF" case before it sends the page to disk. 
> 
> So, we have three options here:
> 
> 	1. iomap_truncate_page() always zeros
> 	2. update inode size before writeback after zeroing so the
> 	   xfs_do_writepage() zeros the tail page, or
> 	3. move truncate_setsize() to before writeback so the page
> 	   cache invalidation zeros the part page at the new EOF.

This really helps summarize the problem and solution, thanks!

Yeah, I started to realize moving the order of writeback vs setsize
around might be a fix when I was writing my last reply - explaining the
problem to someone else really helps understand the problem itself :)

> 
> I think 1) is a no-go for performance reasons. 2) is better, but
> I don't like the idea of separating the page cache invalidation
> from the size truncation. That leaves 3) - moving
> truncate_setsize().
> 
> I think I prefer 3) because it triggers multiple layers of defense
> against writing stale data past EOF, and from an crash behaviour
> point of view it makes no difference whether we truncate the page
> cache before or after triggering writeback because it will just make
> the result the same as if we were zeroing a written extent....

I'm testing an updated patch based on option 3 now, the finished tests
look good. I'll send the new version out for review soon. Thanks a lot
for the suggestion and review!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 17081c77ef86..cd4c612cc23c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -824,6 +824,7 @@  xfs_setattr_size(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
+	struct address_space	*mapping = inode->i_mapping;
 	xfs_off_t		oldsize, newsize;
 	struct xfs_trans	*tp;
 	int			error;
@@ -878,8 +879,22 @@  xfs_setattr_size(
 	if (newsize > oldsize) {
 		error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing);
 	} else {
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_iomap_ops);
+		/*
+		 * Firstly write out the range that will be zeroed, otherwise
+		 * iomap_truncate_page() may skip zeroing this range wrongly.
+		 * Because the range could still be dirty after buffer writes
+		 * over unwritten extents, and the extent still stays in
+		 * UNWRITTEN state, because the conversion only happens at I/O
+		 * completion.
+		 */
+		unsigned int blocksize = i_blocksize(inode);
+		unsigned int off = newsize & (blocksize - 1);
+		if (off && mapping->nrpages)
+			error = filemap_write_and_wait_range(mapping, newsize,
+					newsize + (blocksize - off) - 1);
+		if (!error)
+			error = iomap_truncate_page(inode, newsize,
+					&did_zeroing, &xfs_iomap_ops);
 	}
 
 	if (error)
@@ -895,8 +910,8 @@  xfs_setattr_size(
 	 */
 	if (did_zeroing ||
 	    (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
-		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						      ip->i_d.di_size, newsize);
+		error = filemap_write_and_wait_range(mapping, ip->i_d.di_size,
+							newsize - 1);
 		if (error)
 			return error;
 	}