diff mbox

Btrfs: fix corruption after write/fsync failure + fsync + log recovery

Message ID 1408959780-4217-1-git-send-email-fdmanana@suse.com (mailing list archive)
State Accepted
Headers show

Commit Message

Filipe Manana Aug. 25, 2014, 9:43 a.m. UTC
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.

Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.

Some sequence of steps that lead to this:

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o commit=9999 /dev/sdd /mnt
    $ cd /mnt

    $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
    $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
    $ sync

    $ od -t x1 foo
    0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
    *
    0020000

    $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo

    # Now this write + fsync fail with -ENOMEM, which was returned by
    # btrfs_add_ordered_extent() in inode.c:cow_file_range().
    $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
    $ xfs_io -c "fsync" foo
    fsync: Cannot allocate memory

    # Now do a new write + fsync, which will succeed. Our previous
    # -ENOMEM was a transient/temporary error.
    $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
    $ xfs_io -c "fsync" foo

    # Our file content (in page cache) is now:
    $ od -t x1 foo
    0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
    *
    0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # Now reboot the machine, and mount the fs, so that fsync log replay
    # takes place.

    # The file content is now weird, in particular the first 8Kb, which
    # do not match our data before nor after the sync command above.
    $ od -t x1 foo
    0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # In fact these first 4Kb are a duplicate of the last 4kb block.
    # The last write got an extent map/file extent item that points to
    # the same disk extent that we got in the write+fsync that failed
    # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
    # verify that:

    $ btrfs-debug-tree /dev/sdd
    (...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
		extent data disk byte 12582912 nr 8192
		extent data offset 0 nr 8192 ram 8192
	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 8192 ram 8192
	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
		extent data disk byte 12582912 nr 4096
		extent data offset 0 nr 4096 ram 4096

    $ umount /dev/sdd
    $ btrfsck /dev/sdd
    Checking filesystem on /dev/sdd
    UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
    checking extents
    extent item 12582912 has multiple extent items
    ref mismatch on [12582912 4096] extent item 1, found 2
    Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
    backpointer mismatch on [12582912 4096]
    Errors found in extent allocation tree or chunk allocation
    checking free space cache
    checking fs roots
    root 5 inode 257 errors 1000, some csum missing
    found 131074 bytes used err is 1
    total csum bytes: 4
    total tree bytes: 131072
    total fs tree bytes: 32768
    total extent tree bytes: 16384
    btree space waste bytes: 123404
    file data blocks allocated: 274432
     referenced 274432
    Btrfs v3.14.1-96-gcc7fd5a-dirty

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Liu Bo Aug. 26, 2014, 3:32 a.m. UTC | #1
On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote:
> While writing to a file, in inode.c:cow_file_range() (and same applies to
> submit_compressed_extents()), after reserving an extent for the file data,
> we create a new extent map for the written range and insert it into the
> extent map cache. After that, we create an ordered operation, but if it
> fails (due to a transient/temporary-ENOMEM), we return without dropping
> that extent map, which points to a reserved extent that is freed when we
> return. A subsequent incremental fsync (when the btrfs inode doesn't have
> the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
> logs a file extent item based on that extent map, which points to a disk
> extent that doesn't contain valid data - it was freed by us earlier, at this
> point it might contain any random/garbage data.
> 
> Therefore, if we reach an error condition when cowing a file range after
> we added the new extent map to the cache, drop it from the cache before
> returning.
> 
> Some sequence of steps that lead to this:
> 
>     $ mkfs.btrfs -f /dev/sdd
>     $ mount -o commit=9999 /dev/sdd /mnt
>     $ cd /mnt
> 
>     $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
>     $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
>     $ sync
> 
>     $ od -t x1 foo
>     0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
>     *
>     0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>     *
>     0020000
> 
>     $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
> 
>     # Now this write + fsync fail with -ENOMEM, which was returned by
>     # btrfs_add_ordered_extent() in inode.c:cow_file_range().
>     $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
>     $ xfs_io -c "fsync" foo
>     fsync: Cannot allocate memory
> 
>     # Now do a new write + fsync, which will succeed. Our previous
>     # -ENOMEM was a transient/temporary error.
>     $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
>     $ xfs_io -c "fsync" foo
> 
>     # Our file content (in page cache) is now:
>     $ od -t x1 foo
>     0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
>     *
>     0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>     *
>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     *
>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>     *
>     0050000
> 
>     # Now reboot the machine, and mount the fs, so that fsync log replay
>     # takes place.
> 
>     # The file content is now weird, in particular the first 8Kb, which
>     # do not match our data before nor after the sync command above.
>     $ od -t x1 foo
>     0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>     *
>     0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
>     *
>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     *
>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>     *
>     0050000
> 
>     # In fact these first 4Kb are a duplicate of the last 4kb block.
>     # The last write got an extent map/file extent item that points to
>     # the same disk extent that we got in the write+fsync that failed
>     # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
>     # verify that:
> 
>     $ btrfs-debug-tree /dev/sdd
>     (...)
> 	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
> 		extent data disk byte 12582912 nr 8192
> 		extent data offset 0 nr 8192 ram 8192
> 	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
> 		extent data disk byte 0 nr 0
> 		extent data offset 0 nr 8192 ram 8192
> 	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
> 		extent data disk byte 12582912 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 
>     $ umount /dev/sdd
>     $ btrfsck /dev/sdd
>     Checking filesystem on /dev/sdd
>     UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
>     checking extents
>     extent item 12582912 has multiple extent items
>     ref mismatch on [12582912 4096] extent item 1, found 2
>     Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
>     backpointer mismatch on [12582912 4096]
>     Errors found in extent allocation tree or chunk allocation
>     checking free space cache
>     checking fs roots
>     root 5 inode 257 errors 1000, some csum missing
>     found 131074 bytes used err is 1
>     total csum bytes: 4
>     total tree bytes: 131072
>     total fs tree bytes: 32768
>     total extent tree bytes: 16384
>     btree space waste bytes: 123404
>     file data blocks allocated: 274432
>      referenced 274432
>     Btrfs v3.14.1-96-gcc7fd5a-dirty
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c678dea..16e8146 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -792,8 +792,12 @@ retry:
>  						ins.offset,
>  						BTRFS_ORDERED_COMPRESSED,
>  						async_extent->compress_type);
> -		if (ret)
> +		if (ret) {
> +			btrfs_drop_extent_cache(inode, async_extent->start,
> +						async_extent->start +
> +						async_extent->ram_size - 1, 0);
>  			goto out_free_reserve;
> +		}
>  
>  		/*
>  		 * clear dirty, set writeback and unlock the pages.

What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()?
It looks similar to this case.

thanks,
-liubo

> @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode,
>  		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
>  					       ram_size, cur_alloc_size, 0);
>  		if (ret)
> -			goto out_reserve;
> +			goto out_drop_extent_cache;
>  
>  		if (root->root_key.objectid ==
>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
>  			if (ret)
> -				goto out_reserve;
> +				goto out_drop_extent_cache;
>  		}
>  
>  		if (disk_num_bytes < cur_alloc_size)
> @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode,
>  out:
>  	return ret;
>  
> +out_drop_extent_cache:
> +	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
>  out_reserve:
>  	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
>  out_unlock:
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Aug. 26, 2014, 7:56 a.m. UTC | #2
On Tue, Aug 26, 2014 at 4:32 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote:
>> While writing to a file, in inode.c:cow_file_range() (and same applies to
>> submit_compressed_extents()), after reserving an extent for the file data,
>> we create a new extent map for the written range and insert it into the
>> extent map cache. After that, we create an ordered operation, but if it
>> fails (due to a transient/temporary-ENOMEM), we return without dropping
>> that extent map, which points to a reserved extent that is freed when we
>> return. A subsequent incremental fsync (when the btrfs inode doesn't have
>> the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
>> logs a file extent item based on that extent map, which points to a disk
>> extent that doesn't contain valid data - it was freed by us earlier, at this
>> point it might contain any random/garbage data.
>>
>> Therefore, if we reach an error condition when cowing a file range after
>> we added the new extent map to the cache, drop it from the cache before
>> returning.
>>
>> Some sequence of steps that lead to this:
>>
>>     $ mkfs.btrfs -f /dev/sdd
>>     $ mount -o commit=9999 /dev/sdd /mnt
>>     $ cd /mnt
>>
>>     $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
>>     $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
>>     $ sync
>>
>>     $ od -t x1 foo
>>     0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
>>     *
>>     0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
>>     *
>>     0020000
>>
>>     $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
>>
>>     # Now this write + fsync fail with -ENOMEM, which was returned by
>>     # btrfs_add_ordered_extent() in inode.c:cow_file_range().
>>     $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
>>     $ xfs_io -c "fsync" foo
>>     fsync: Cannot allocate memory
>>
>>     # Now do a new write + fsync, which will succeed. Our previous
>>     # -ENOMEM was a transient/temporary error.
>>     $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
>>     $ xfs_io -c "fsync" foo
>>
>>     # Our file content (in page cache) is now:
>>     $ od -t x1 foo
>>     0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
>>     *
>>     0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>     *
>>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>     *
>>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>>     *
>>     0050000
>>
>>     # Now reboot the machine, and mount the fs, so that fsync log replay
>>     # takes place.
>>
>>     # The file content is now weird, in particular the first 8Kb, which
>>     # do not match our data before nor after the sync command above.
>>     $ od -t x1 foo
>>     0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>>     *
>>     0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
>>     *
>>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>     *
>>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>>     *
>>     0050000
>>
>>     # In fact these first 4Kb are a duplicate of the last 4kb block.
>>     # The last write got an extent map/file extent item that points to
>>     # the same disk extent that we got in the write+fsync that failed
>>     # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
>>     # verify that:
>>
>>     $ btrfs-debug-tree /dev/sdd
>>     (...)
>>       item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
>>               extent data disk byte 12582912 nr 8192
>>               extent data offset 0 nr 8192 ram 8192
>>       item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
>>               extent data disk byte 0 nr 0
>>               extent data offset 0 nr 8192 ram 8192
>>       item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
>>               extent data disk byte 12582912 nr 4096
>>               extent data offset 0 nr 4096 ram 4096
>>
>>     $ umount /dev/sdd
>>     $ btrfsck /dev/sdd
>>     Checking filesystem on /dev/sdd
>>     UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
>>     checking extents
>>     extent item 12582912 has multiple extent items
>>     ref mismatch on [12582912 4096] extent item 1, found 2
>>     Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
>>     backpointer mismatch on [12582912 4096]
>>     Errors found in extent allocation tree or chunk allocation
>>     checking free space cache
>>     checking fs roots
>>     root 5 inode 257 errors 1000, some csum missing
>>     found 131074 bytes used err is 1
>>     total csum bytes: 4
>>     total tree bytes: 131072
>>     total fs tree bytes: 32768
>>     total extent tree bytes: 16384
>>     btree space waste bytes: 123404
>>     file data blocks allocated: 274432
>>      referenced 274432
>>     Btrfs v3.14.1-96-gcc7fd5a-dirty
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index c678dea..16e8146 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -792,8 +792,12 @@ retry:
>>                                               ins.offset,
>>                                               BTRFS_ORDERED_COMPRESSED,
>>                                               async_extent->compress_type);
>> -             if (ret)
>> +             if (ret) {
>> +                     btrfs_drop_extent_cache(inode, async_extent->start,
>> +                                             async_extent->start +
>> +                                             async_extent->ram_size - 1, 0);
>>                       goto out_free_reserve;
>> +             }
>>
>>               /*
>>                * clear dirty, set writeback and unlock the pages.
>
> What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()?
> It looks similar to this case.

Similar but different, and not a problem.

The problem is returning after adding the extent map to the modified
list and before creating an ordered operation.
For that case you mention, since the ordered operation was created, we
return without removing the extent map and without returning the
reserved space too - that's because removing the em and returning the
space is done by btrfs_finish_ordered_io(), for which the fsync was to
wait for (again, because the ordered operation exists).

thanks

>
> thanks,
> -liubo
>
>> @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode,
>>               ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
>>                                              ram_size, cur_alloc_size, 0);
>>               if (ret)
>> -                     goto out_reserve;
>> +                     goto out_drop_extent_cache;
>>
>>               if (root->root_key.objectid ==
>>                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>                       ret = btrfs_reloc_clone_csums(inode, start,
>>                                                     cur_alloc_size);
>>                       if (ret)
>> -                             goto out_reserve;
>> +                             goto out_drop_extent_cache;
>>               }
>>
>>               if (disk_num_bytes < cur_alloc_size)
>> @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode,
>>  out:
>>       return ret;
>>
>> +out_drop_extent_cache:
>> +     btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
>>  out_reserve:
>>       btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
>>  out_unlock:
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Aug. 26, 2014, 8:28 a.m. UTC | #3
On Tue, Aug 26, 2014 at 08:56:18AM +0100, Filipe David Manana wrote:
> On Tue, Aug 26, 2014 at 4:32 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote:
> >> While writing to a file, in inode.c:cow_file_range() (and same applies to
> >> submit_compressed_extents()), after reserving an extent for the file data,
> >> we create a new extent map for the written range and insert it into the
> >> extent map cache. After that, we create an ordered operation, but if it
> >> fails (due to a transient/temporary-ENOMEM), we return without dropping
> >> that extent map, which points to a reserved extent that is freed when we
> >> return. A subsequent incremental fsync (when the btrfs inode doesn't have
> >> the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
> >> logs a file extent item based on that extent map, which points to a disk
> >> extent that doesn't contain valid data - it was freed by us earlier, at this
> >> point it might contain any random/garbage data.
> >>
> >> Therefore, if we reach an error condition when cowing a file range after
> >> we added the new extent map to the cache, drop it from the cache before
> >> returning.
> >>
> >> Some sequence of steps that lead to this:
> >>
> >>     $ mkfs.btrfs -f /dev/sdd
> >>     $ mount -o commit=9999 /dev/sdd /mnt
> >>     $ cd /mnt
> >>
> >>     $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
> >>     $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
> >>     $ sync
> >>
> >>     $ od -t x1 foo
> >>     0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
> >>     *
> >>     0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
> >>     *
> >>     0020000
> >>
> >>     $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
> >>
> >>     # Now this write + fsync fail with -ENOMEM, which was returned by
> >>     # btrfs_add_ordered_extent() in inode.c:cow_file_range().
> >>     $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
> >>     $ xfs_io -c "fsync" foo
> >>     fsync: Cannot allocate memory
> >>
> >>     # Now do a new write + fsync, which will succeed. Our previous
> >>     # -ENOMEM was a transient/temporary error.
> >>     $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
> >>     $ xfs_io -c "fsync" foo
> >>
> >>     # Our file content (in page cache) is now:
> >>     $ od -t x1 foo
> >>     0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
> >>     *
> >>     0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >>     *
> >>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>     *
> >>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
> >>     *
> >>     0050000
> >>
> >>     # Now reboot the machine, and mount the fs, so that fsync log replay
> >>     # takes place.
> >>
> >>     # The file content is now weird, in particular the first 8Kb, which
> >>     # do not match our data before nor after the sync command above.
> >>     $ od -t x1 foo
> >>     0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
> >>     *
> >>     0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
> >>     *
> >>     0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>     *
> >>     0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
> >>     *
> >>     0050000
> >>
> >>     # In fact these first 4Kb are a duplicate of the last 4kb block.
> >>     # The last write got an extent map/file extent item that points to
> >>     # the same disk extent that we got in the write+fsync that failed
> >>     # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
> >>     # verify that:
> >>
> >>     $ btrfs-debug-tree /dev/sdd
> >>     (...)
> >>       item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
> >>               extent data disk byte 12582912 nr 8192
> >>               extent data offset 0 nr 8192 ram 8192
> >>       item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
> >>               extent data disk byte 0 nr 0
> >>               extent data offset 0 nr 8192 ram 8192
> >>       item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
> >>               extent data disk byte 12582912 nr 4096
> >>               extent data offset 0 nr 4096 ram 4096
> >>
> >>     $ umount /dev/sdd
> >>     $ btrfsck /dev/sdd
> >>     Checking filesystem on /dev/sdd
> >>     UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
> >>     checking extents
> >>     extent item 12582912 has multiple extent items
> >>     ref mismatch on [12582912 4096] extent item 1, found 2
> >>     Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
> >>     backpointer mismatch on [12582912 4096]
> >>     Errors found in extent allocation tree or chunk allocation
> >>     checking free space cache
> >>     checking fs roots
> >>     root 5 inode 257 errors 1000, some csum missing
> >>     found 131074 bytes used err is 1
> >>     total csum bytes: 4
> >>     total tree bytes: 131072
> >>     total fs tree bytes: 32768
> >>     total extent tree bytes: 16384
> >>     btree space waste bytes: 123404
> >>     file data blocks allocated: 274432
> >>      referenced 274432
> >>     Btrfs v3.14.1-96-gcc7fd5a-dirty
> >>
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>  fs/btrfs/inode.c | 12 +++++++++---
> >>  1 file changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >> index c678dea..16e8146 100644
> >> --- a/fs/btrfs/inode.c
> >> +++ b/fs/btrfs/inode.c
> >> @@ -792,8 +792,12 @@ retry:
> >>                                               ins.offset,
> >>                                               BTRFS_ORDERED_COMPRESSED,
> >>                                               async_extent->compress_type);
> >> -             if (ret)
> >> +             if (ret) {
> >> +                     btrfs_drop_extent_cache(inode, async_extent->start,
> >> +                                             async_extent->start +
> >> +                                             async_extent->ram_size - 1, 0);
> >>                       goto out_free_reserve;
> >> +             }

My bad, I made a mistake, what I want is just sitting here ;)

> >>
> >>               /*
> >>                * clear dirty, set writeback and unlock the pages.
> >
> > What about the 'if (ret) {}' after btrfs_add_ordered_extent_compress()?
> > It looks similar to this case.
> 
> Similar but different, and not a problem.
> 
> The problem is returning after adding the extent map to the modified
> list and before creating an ordered operation.
> For that case you mention, since the ordered operation was created, we
> return without removing the extent map and without returning the
> reserved space too - that's because removing the em and returning the
> space is done by btrfs_finish_ordered_io(), for which the fsync was to
> wait for (again, because the ordered operation exists).

Thanks for the explanation.

-liubo

> 
> thanks
> 
> >
> > thanks,
> > -liubo
> >
> >> @@ -985,14 +989,14 @@ static noinline int cow_file_range(struct inode *inode,
> >>               ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
> >>                                              ram_size, cur_alloc_size, 0);
> >>               if (ret)
> >> -                     goto out_reserve;
> >> +                     goto out_drop_extent_cache;
> >>
> >>               if (root->root_key.objectid ==
> >>                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>                       ret = btrfs_reloc_clone_csums(inode, start,
> >>                                                     cur_alloc_size);
> >>                       if (ret)
> >> -                             goto out_reserve;
> >> +                             goto out_drop_extent_cache;
> >>               }
> >>
> >>               if (disk_num_bytes < cur_alloc_size)
> >> @@ -1020,6 +1024,8 @@ static noinline int cow_file_range(struct inode *inode,
> >>  out:
> >>       return ret;
> >>
> >> +out_drop_extent_cache:
> >> +     btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
> >>  out_reserve:
> >>       btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
> >>  out_unlock:
> >> --
> >> 1.9.1
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c678dea..16e8146 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -792,8 +792,12 @@  retry:
 						ins.offset,
 						BTRFS_ORDERED_COMPRESSED,
 						async_extent->compress_type);
-		if (ret)
+		if (ret) {
+			btrfs_drop_extent_cache(inode, async_extent->start,
+						async_extent->start +
+						async_extent->ram_size - 1, 0);
 			goto out_free_reserve;
+		}
 
 		/*
 		 * clear dirty, set writeback and unlock the pages.
@@ -985,14 +989,14 @@  static noinline int cow_file_range(struct inode *inode,
 		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
 					       ram_size, cur_alloc_size, 0);
 		if (ret)
-			goto out_reserve;
+			goto out_drop_extent_cache;
 
 		if (root->root_key.objectid ==
 		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
 			if (ret)
-				goto out_reserve;
+				goto out_drop_extent_cache;
 		}
 
 		if (disk_num_bytes < cur_alloc_size)
@@ -1020,6 +1024,8 @@  static noinline int cow_file_range(struct inode *inode,
 out:
 	return ret;
 
+out_drop_extent_cache:
+	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
 out_reserve:
 	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
 out_unlock: