diff mbox series

[v4,29/30] btrfs: fix a subpage relocation data corruption

Message ID 20210531085106.259490-30-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: add data write support for subpage | expand

Commit Message

Qu Wenruo May 31, 2021, 8:51 a.m. UTC
[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
  sync
  btrfs balance start -d $mnt
  btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
        item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
                refs 2 gen 7 flags DATA
                extent data backref root FS_TREE objectid 259 offset 1339392 count 1
                extent data backref root FS_TREE objectid 259 offset 647168 count 1
        item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
                block group used 102400 chunk_objectid 256 flags DATA
        item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
                refs 1 gen 7 flags DATA
                extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

	0	32K	64K	96K 100K	104K
	|<------ Extent A ----->|   |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the isize is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update isize to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [isize, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Comments

Qu Wenruo May 31, 2021, 10:26 a.m. UTC | #1
On 2021/5/31 下午4:51, Qu Wenruo wrote:
> [BUG]
> When using the following script, btrfs will report data corruption after
> one data balance with subpage support:
> 
>    mkfs.btrfs -f -s 4k $dev
>    mount $dev -o nospace_cache $mnt
>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
>    sync
>    btrfs balance start -d $mnt
>    btrfs scrub start -B $mnt
> 
> Similar problem can be easily observed in btrfs/028 test case, there
> will be tons of balance failure with -EIO.
> 
> [CAUSE]
> Above fsstress will result the following data extents layout in extent
> tree:
>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
>                  refs 2 gen 7 flags DATA
>                  extent data backref root FS_TREE objectid 259 offset 1339392 count 1
>                  extent data backref root FS_TREE objectid 259 offset 647168 count 1
>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
>                  block group used 102400 chunk_objectid 256 flags DATA
>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
>                  refs 1 gen 7 flags DATA
>                  extent data backref root FS_TREE objectid 259 offset 729088 count 1
> 
> Then when creating the data reloc inode, the data reloc inode will look
> like this:
> 
> 	0	32K	64K	96K 100K	104K
> 	|<------ Extent A ----->|   |<- Ext B ->|
> 
> Then when we first try to relocate extent A, we setup the data reloc
> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
> 
> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
> with 0 and set it uptodate.
> 
> Then when we come to extent B, we update isize to 104K, then try to read
> page [64K, 128K).
> Then we find the page is already uptodate, so we skip the read.
> But range [96K, 128K) is filled with 0, not the real data.
> 
> Then we writeback the data reloc inode to disk, with 0 filling range
> [96K, 128K), corrupting the content of extent B.
> 
> The behavior is caused by the fact that we still do full page read for
> subpage case.
> 
> The bug won't really happen for regular sectorsize, as one page only
> contains one sector.
> 
> [FIX]
> This patch will fix the problem by invalidating range [isize, PAGE_END]
> in prealloc_file_extent_cluster().

The fix is enough to fix the data corruption, but it leaves a very rare 
deadlock.

Above invalidating is in fact not safe, since we're not doing a proper 
btrfs_invalidatepage().

The biggest problem here is, we can leave the page half dirty, and half 
out-of-date.

Then later btrfs_readpage() can trigger a deadlock like this:
btrfs_readpage()
|  We already have the page locked.
|
|- btrfs_lock_and_flush_ordered_range()
    |- btrfs_start_ordered_extent()
       |- extent_write_cache_pages()
          |- pagevec_lookup_range_tag()
          |- lock_page()
             We try to lock a page which is already locked by ourselves.

This can only happen for subpage case, and normally it should never 
happen for regular subpage opeartions.
As we either read the full the page, then update part of the page to 
dirty, dirty the full page without reading it.

This shortcut in relocation code breaks the assumption, and could lead 
to above deadlock.

Although I still don't like to call btrfs_invalidatepage(), here we can 
workaround the half-dirty-half-out-of-date problem by just writing the 
page back to disk.

This will clear the page dirty bits, and allow later clear_uptodate() 
call to be safe.

I'll update the patchset in github repo first, and hope to merge it with 
other feedback into next update.

Currently the test looks very promising, as the Power8 VM has survived 
over 100 loops without crashing.

Thanks,
Qu

> 
> So that if above example happens, when we preallocate the file extent
> for extent B, we will clear the uptodate bits for range [96K, 128K),
> allowing later relocate_one_page() to re-read the needed range.
> 
> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
>   1 file changed, 38 insertions(+)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index cd50559c6d17..b50ee800993d 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2782,10 +2782,48 @@ static noinline_for_stack int prealloc_file_extent_cluster(
>   	u64 num_bytes;
>   	int nr;
>   	int ret = 0;
> +	u64 isize = i_size_read(&inode->vfs_inode);
>   	u64 prealloc_start = cluster->start - offset;
>   	u64 prealloc_end = cluster->end - offset;
>   	u64 cur_offset = prealloc_start;
>   
> +	/*
> +	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
> +	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
> +	 * btrfs_do_readpage() call of previously relocated file cluster.
> +	 *
> +	 * If the current cluster starts in above range, btrfs_do_readpage()
> +	 * will skip the read, and relocate_one_page() will later writeback
> +	 * the padding 0 as new data, causing data corruption.
> +	 *
> +	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
> +	 */
> +	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
> +		struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +		const u32 sectorsize = fs_info->sectorsize;
> +		struct page *page;
> +
> +		ASSERT(sectorsize < PAGE_SIZE);
> +		ASSERT(IS_ALIGNED(isize, sectorsize));
> +
> +		page = find_lock_page(inode->vfs_inode.i_mapping,
> +				      isize >> PAGE_SHIFT);
> +		/*
> +		 * If page is freed we don't need to do anything then, as
> +		 * we will re-read the whole page anyway.
> +		 */
> +		if (page) {
> +			u64 page_end = page_offset(page) + PAGE_SIZE - 1;
> +
> +			clear_extent_bits(&inode->io_tree, isize, page_end,
> +					  EXTENT_UPTODATE);
> +			btrfs_subpage_clear_uptodate(fs_info, page, isize,
> +						     page_end + 1 - isize);
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +	}
> +
>   	BUG_ON(cluster->start != cluster->boundary[0]);
>   	ret = btrfs_alloc_data_chunk_ondemand(inode,
>   					      prealloc_end + 1 - prealloc_start);
>
Qu Wenruo June 1, 2021, 1:07 a.m. UTC | #2
On 2021/5/31 下午6:26, Qu Wenruo wrote:
> 
> 
> On 2021/5/31 下午4:51, Qu Wenruo wrote:
>> [BUG]
>> When using the following script, btrfs will report data corruption after
>> one data balance with subpage support:
>>
>>    mkfs.btrfs -f -s 4k $dev
>>    mount $dev -o nospace_cache $mnt
>>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
>>    sync
>>    btrfs balance start -d $mnt
>>    btrfs scrub start -B $mnt
>>
>> Similar problem can be easily observed in btrfs/028 test case, there
>> will be tons of balance failure with -EIO.
>>
>> [CAUSE]
>> Above fsstress will result the following data extents layout in extent
>> tree:
>>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 
>> itemsize 82
>>                  refs 2 gen 7 flags DATA
>>                  extent data backref root FS_TREE objectid 259 offset 
>> 1339392 count 1
>>                  extent data backref root FS_TREE objectid 259 offset 
>> 647168 count 1
>>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 
>> itemsize 24
>>                  block group used 102400 chunk_objectid 256 flags DATA
>>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 
>> itemsize 53
>>                  refs 1 gen 7 flags DATA
>>                  extent data backref root FS_TREE objectid 259 offset 
>> 729088 count 1
>>
>> Then when creating the data reloc inode, the data reloc inode will look
>> like this:
>>
>>     0    32K    64K    96K 100K    104K
>>     |<------ Extent A ----->|   |<- Ext B ->|
>>
>> Then when we first try to relocate extent A, we setup the data reloc
>> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
>>
>> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
>> with 0 and set it uptodate.
>>
>> Then when we come to extent B, we update isize to 104K, then try to read
>> page [64K, 128K).
>> Then we find the page is already uptodate, so we skip the read.
>> But range [96K, 128K) is filled with 0, not the real data.
>>
>> Then we writeback the data reloc inode to disk, with 0 filling range
>> [96K, 128K), corrupting the content of extent B.
>>
>> The behavior is caused by the fact that we still do full page read for
>> subpage case.
>>
>> The bug won't really happen for regular sectorsize, as one page only
>> contains one sector.
>>
>> [FIX]
>> This patch will fix the problem by invalidating range [isize, PAGE_END]
>> in prealloc_file_extent_cluster().
> 
> The fix is enough to fix the data corruption, but it leaves a very rare 
> deadlock.
> 
> Above invalidating is in fact not safe, since we're not doing a proper 
> btrfs_invalidatepage().
> 
> The biggest problem here is, we can leave the page half dirty, and half 
> out-of-date.
> 
> Then later btrfs_readpage() can trigger a deadlock like this:
> btrfs_readpage()
> |  We already have the page locked.
> |
> |- btrfs_lock_and_flush_ordered_range()
>     |- btrfs_start_ordered_extent()
>        |- extent_write_cache_pages()
>           |- pagevec_lookup_range_tag()
>           |- lock_page()
>              We try to lock a page which is already locked by ourselves.
> 
> This can only happen for subpage case, and normally it should never 
> happen for regular subpage opeartions.
> As we either read the full the page, then update part of the page to 
> dirty, dirty the full page without reading it.
> 
> This shortcut in relocation code breaks the assumption, and could lead 
> to above deadlock.
> 
> Although I still don't like to call btrfs_invalidatepage(), here we can 
> workaround the half-dirty-half-out-of-date problem by just writing the 
> page back to disk.
> 
> This will clear the page dirty bits, and allow later clear_uptodate() 
> call to be safe.
> 
> I'll update the patchset in github repo first, and hope to merge it with 
> other feedback into next update.
> 
> Currently the test looks very promising, as the Power8 VM has survived 
> over 100 loops without crashing.

The extra diff will look like this before invalidating extent and page 
status.

+               /*
+                * Btrfs subpage can't handle page with DIRTY but without
+                * UPTODATE bit as it can lead to the following deadlock:
+                * btrfs_readpage()
+                * | Page already *locked*
+                * |- btrfs_lock_and_flush_ordered_range()
+                *    |- btrfs_start_ordered_extent()
+                *       |- extent_write_cache_pages()
+                *          |- lock_page()
+                *             We try to lock the page we already hold.
+                *
+                * Here we just writeback the whole data reloc inode, so 
that
+                * we will be ensured to have no dirty range in the 
page, and
+                * are safe to clear the uptodate bits.
+                *
+                * This shouldn't cause too much overhead, as we need to 
write
+                * the data back anyway.
+                */
+               ret = filemap_write_and_wait(mapping);
+               if (ret < 0)
+                       return ret;
+

One special reason for using filemap_write_and_wait() for the whole data 
reloc inode is, we can't just write back one page, as for data reloc 
inode we have to writeback the whole cluster boundary, to meet the 
extent size.

So far it survives the full night tests, and the overhead should be minimal.
As we have to writeback the whole data reloc inode anyway.
And we are here because either previous cluster is not continuous with 
current one, or we have reached the cluster size limit.

Either way, writing back the whole inode would bring no extra overhead.

Thanks,
Qu


> 
> Thanks,
> Qu
> 
>>
>> So that if above example happens, when we preallocate the file extent
>> for extent B, we will clear the uptodate bits for range [96K, 128K),
>> allowing later relocate_one_page() to re-read the needed range.
>>
>> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/relocation.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index cd50559c6d17..b50ee800993d 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -2782,10 +2782,48 @@ static noinline_for_stack int 
>> prealloc_file_extent_cluster(
>>       u64 num_bytes;
>>       int nr;
>>       int ret = 0;
>> +    u64 isize = i_size_read(&inode->vfs_inode);
>>       u64 prealloc_start = cluster->start - offset;
>>       u64 prealloc_end = cluster->end - offset;
>>       u64 cur_offset = prealloc_start;
>> +    /*
>> +     * For subpage case, previous isize may not be aligned to PAGE_SIZE.
>> +     * This means the range [isize, PAGE_END + 1) is filled with 0 by
>> +     * btrfs_do_readpage() call of previously relocated file cluster.
>> +     *
>> +     * If the current cluster starts in above range, btrfs_do_readpage()
>> +     * will skip the read, and relocate_one_page() will later writeback
>> +     * the padding 0 as new data, causing data corruption.
>> +     *
>> +     * Here we have to manually invalidate the range (isize, PAGE_END 
>> + 1).
>> +     */
>> +    if (!IS_ALIGNED(isize, PAGE_SIZE)) {
>> +        struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +        const u32 sectorsize = fs_info->sectorsize;
>> +        struct page *page;
>> +
>> +        ASSERT(sectorsize < PAGE_SIZE);
>> +        ASSERT(IS_ALIGNED(isize, sectorsize));
>> +
>> +        page = find_lock_page(inode->vfs_inode.i_mapping,
>> +                      isize >> PAGE_SHIFT);
>> +        /*
>> +         * If page is freed we don't need to do anything then, as
>> +         * we will re-read the whole page anyway.
>> +         */
>> +        if (page) {
>> +            u64 page_end = page_offset(page) + PAGE_SIZE - 1;
>> +
>> +            clear_extent_bits(&inode->io_tree, isize, page_end,
>> +                      EXTENT_UPTODATE);
>> +            btrfs_subpage_clear_uptodate(fs_info, page, isize,
>> +                             page_end + 1 - isize);
>> +            unlock_page(page);
>> +            put_page(page);
>> +        }
>> +    }
>> +
>>       BUG_ON(cluster->start != cluster->boundary[0]);
>>       ret = btrfs_alloc_data_chunk_ondemand(inode,
>>                             prealloc_end + 1 - prealloc_start);
>>
>
David Sterba June 2, 2021, 5:10 p.m. UTC | #3
On Tue, Jun 01, 2021 at 09:07:46AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/5/31 下午6:26, Qu Wenruo wrote:
> > 
> > 
> > On 2021/5/31 下午4:51, Qu Wenruo wrote:
> >> [BUG]
> >> When using the following script, btrfs will report data corruption after
> >> one data balance with subpage support:
> >>
> >>    mkfs.btrfs -f -s 4k $dev
> >>    mount $dev -o nospace_cache $mnt
> >>    $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
> >>    sync
> >>    btrfs balance start -d $mnt
> >>    btrfs scrub start -B $mnt
> >>
> >> Similar problem can be easily observed in btrfs/028 test case, there
> >> will be tons of balance failure with -EIO.
> >>
> >> [CAUSE]
> >> Above fsstress will result the following data extents layout in extent
> >> tree:
> >>          item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 
> >> itemsize 82
> >>                  refs 2 gen 7 flags DATA
> >>                  extent data backref root FS_TREE objectid 259 offset 
> >> 1339392 count 1
> >>                  extent data backref root FS_TREE objectid 259 offset 
> >> 647168 count 1
> >>          item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 
> >> itemsize 24
> >>                  block group used 102400 chunk_objectid 256 flags DATA
> >>          item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 
> >> itemsize 53
> >>                  refs 1 gen 7 flags DATA
> >>                  extent data backref root FS_TREE objectid 259 offset 
> >> 729088 count 1
> >>
> >> Then when creating the data reloc inode, the data reloc inode will look
> >> like this:
> >>
> >>     0    32K    64K    96K 100K    104K
> >>     |<------ Extent A ----->|   |<- Ext B ->|
> >>
> >> Then when we first try to relocate extent A, we setup the data reloc
> >> inode with iszie 96K, then read both page [0, 64K) and page [64K, 128K).
> >>
> >> For page 64K, since the isize is just 96K, we fill range [96K, 128K)
> >> with 0 and set it uptodate.
> >>
> >> Then when we come to extent B, we update isize to 104K, then try to read
> >> page [64K, 128K).
> >> Then we find the page is already uptodate, so we skip the read.
> >> But range [96K, 128K) is filled with 0, not the real data.
> >>
> >> Then we writeback the data reloc inode to disk, with 0 filling range
> >> [96K, 128K), corrupting the content of extent B.
> >>
> >> The behavior is caused by the fact that we still do full page read for
> >> subpage case.
> >>
> >> The bug won't really happen for regular sectorsize, as one page only
> >> contains one sector.
> >>
> >> [FIX]
> >> This patch will fix the problem by invalidating range [isize, PAGE_END]
> >> in prealloc_file_extent_cluster().
> > 
> > The fix is enough to fix the data corruption, but it leaves a very rare 
> > deadlock.
> > 
> > Above invalidating is in fact not safe, since we're not doing a proper 
> > btrfs_invalidatepage().
> > 
> > The biggest problem here is, we can leave the page half dirty, and half 
> > out-of-date.
> > 
> > Then later btrfs_readpage() can trigger a deadlock like this:
> > btrfs_readpage()
> > |  We already have the page locked.
> > |
> > |- btrfs_lock_and_flush_ordered_range()
> >     |- btrfs_start_ordered_extent()
> >        |- extent_write_cache_pages()
> >           |- pagevec_lookup_range_tag()
> >           |- lock_page()
> >              We try to lock a page which is already locked by ourselves.
> > 
> > This can only happen for subpage case, and normally it should never 
> > happen for regular subpage opeartions.
> > As we either read the full the page, then update part of the page to 
> > dirty, dirty the full page without reading it.
> > 
> > This shortcut in relocation code breaks the assumption, and could lead 
> > to above deadlock.
> > 
> > Although I still don't like to call btrfs_invalidatepage(), here we can 
> > workaround the half-dirty-half-out-of-date problem by just writing the 
> > page back to disk.
> > 
> > This will clear the page dirty bits, and allow later clear_uptodate() 
> > call to be safe.
> > 
> > I'll update the patchset in github repo first, and hope to merge it with 
> > other feedback into next update.
> > 
> > Currently the test looks very promising, as the Power8 VM has survived 
> > over 100 loops without crashing.
> 
> The extra diff will look like this before invalidating extent and page 
> status.
> 
> +               /*
> +                * Btrfs subpage can't handle page with DIRTY but without
> +                * UPTODATE bit as it can lead to the following deadlock:
> +                * btrfs_readpage()
> +                * | Page already *locked*
> +                * |- btrfs_lock_and_flush_ordered_range()
> +                *    |- btrfs_start_ordered_extent()
> +                *       |- extent_write_cache_pages()
> +                *          |- lock_page()
> +                *             We try to lock the page we already hold.
> +                *
> +                * Here we just writeback the whole data reloc inode, so 
> that
> +                * we will be ensured to have no dirty range in the 
> page, and
> +                * are safe to clear the uptodate bits.
> +                *
> +                * This shouldn't cause too much overhead, as we need to 
> write
> +                * the data back anyway.
> +                */
> +               ret = filemap_write_and_wait(mapping);
> +               if (ret < 0)
> +                       return ret;
> +
> 
> One special reason for using filemap_write_and_wait() for the whole data 
> reloc inode is, we can't just write back one page, as for data reloc 
> inode we have to writeback the whole cluster boundary, to meet the 
> extent size.
> 
> So far it survives the full night tests, and the overhead should be minimal.
> As we have to writeback the whole data reloc inode anyway.
> And we are here because either previous cluster is not continuous with 
> current one, or we have reached the cluster size limit.
> 
> Either way, writing back the whole inode would bring no extra overhead.

I've updated the patch from github. I've also updated isize to i_size so
it's more like the struct inode->i_size.
diff mbox series

Patch

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index cd50559c6d17..b50ee800993d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2782,10 +2782,48 @@  static noinline_for_stack int prealloc_file_extent_cluster(
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
+	u64 isize = i_size_read(&inode->vfs_inode);
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
 	u64 cur_offset = prealloc_start;
 
+	/*
+	 * For subpage case, previous isize may not be aligned to PAGE_SIZE.
+	 * This means the range [isize, PAGE_END + 1) is filled with 0 by
+	 * btrfs_do_readpage() call of previously relocated file cluster.
+	 *
+	 * If the current cluster starts in above range, btrfs_do_readpage()
+	 * will skip the read, and relocate_one_page() will later writeback
+	 * the padding 0 as new data, causing data corruption.
+	 *
+	 * Here we have to manually invalidate the range (isize, PAGE_END + 1).
+	 */
+	if (!IS_ALIGNED(isize, PAGE_SIZE)) {
+		struct btrfs_fs_info *fs_info = inode->root->fs_info;
+		const u32 sectorsize = fs_info->sectorsize;
+		struct page *page;
+
+		ASSERT(sectorsize < PAGE_SIZE);
+		ASSERT(IS_ALIGNED(isize, sectorsize));
+
+		page = find_lock_page(inode->vfs_inode.i_mapping,
+				      isize >> PAGE_SHIFT);
+		/*
+		 * If page is freed we don't need to do anything then, as
+		 * we will re-read the whole page anyway.
+		 */
+		if (page) {
+			u64 page_end = page_offset(page) + PAGE_SIZE - 1;
+
+			clear_extent_bits(&inode->io_tree, isize, page_end,
+					  EXTENT_UPTODATE);
+			btrfs_subpage_clear_uptodate(fs_info, page, isize,
+						     page_end + 1 - isize);
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	ret = btrfs_alloc_data_chunk_ondemand(inode,
 					      prealloc_end + 1 - prealloc_start);