Message ID | 20191023135727.64358-3-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: trim: Fix a bug certain range may not be trimmed properly | expand |
On 23.10.19 г. 16:57 ч., Qu Wenruo wrote: > [BUG] > When deleting large files (which cross block group boundary) with discard > mount option, we find some btrfs_discard_extent() calls only trimmed part > of its space, not the whole range: > > btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% > > type: bbio->map_type, in above case, it's SINGLE DATA. > start: Logical address of this trim > len: Logical length of this trim > trimmed: Physically trimmed bytes > ratio: trimmed / len > > Thus leading some unused space not discarded. > > [CAUSE] > When discard mount option is specified, after a transaction is fully > committed (super block written to disk), we begin to cleanup pinned > extents in the following call chain: > > btrfs_commit_transaction() > |- write_all_supers() You can remove write_all_supers > |- btrfs_finish_extent_commit() > |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); > |- btrfs_discard_extent() > > However pinned extents are recorded in an extent_io_tree, which can > merge adjacent extent states. > > When a large file get deleted and it has adjacent file extents across > block group boundary, we will get a large merged range. This is wrong, it will only get merged if the extent spans contiguous bg boundaries (this is very important!) > > Then when we pass the large range into btrfs_discard_extent(), > btrfs_discard_extent() will just trim the first part, without trimming > the remaining part. Here is what my testing shows: mkfs.btrfs -f /dev/vdc mount -onodatasum,nospace_cache /dev/vdc /media/scratch/ xfs_io -f -c "pwrite 0 800m" /media/scratch/file1 && sync xfs_io -f -c "pwrite 0 300m" /media/scratch/file2 && sync umount /media/scratch mount -odiscard /dev/vdc /media/scratch rm -f /media/scratch/file2 && sync trace-cmd show umount /media/scratch The output I get in trace-cmd is: sync-1014 [001] .... 534.272310: btrfs_finish_extent_commit: Discarding 1943011328-2077229055 (len: 134217728) sync-1014 [001] .... 534.272315: btrfs_discard_extent: Requested to discard: 134217728 but discarded: 134217728 sync-1014 [001] .... 534.272325: btrfs_finish_extent_commit: Discarding 2177892352-2358247423 (len: 180355072) sync-1014 [001] .... 534.272330: btrfs_discard_extent: Requested to discard: 180355072 but discarded: 180355072 The extents of this file look like this in the extent tree prior to the trim: item 18 key (1943011328 EXTENT_ITEM 134217728) itemoff 15523 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 258 offset 0 count 1 item 19 key (2177892352 EXTENT_ITEM 134217728) itemoff 15470 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 258 offset 134217728 count 1 item 20 key (2177892352 BLOCK_GROUP_ITEM 1073741824) itemoff 15446 itemsize 24 block group used 180355072 chunk_objectid 256 flags DATA item 21 key (2312110080 EXTENT_ITEM 46137344) itemoff 15393 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 258 offset 268435456 count 1 So we have 3 extents 1 of which is in bg 1 and the other 2 in bg2. The 2 extents in bg2 are merged but since the 2nd bg is not contiguous to the first hence no merging. Here comes the requirement why the bg must be contiguous. If I modify my test case with slightly different write offsets such that bg1 is indeed filled and the next extent gets allocated to in bg2, which is adjacent then the bug is reproduced: mkfs.btrfs -f /dev/vdc mount -onodatasum,nospace_cache /dev/vdc /media/scratch/ xfs_io -f -c "pwrite 0 800m" /media/scratch/file1 && sync xfs_io -f -c "pwrite 0 224m" /media/scratch/file2 && sync xfs_io -f -c "pwrite 224m 76m" /media/scratch/file2 && sync umount /media/scratch mount -odiscard /dev/vdc /media/scratch rm -f /media/scratch/file2 && sync trace-cmd show umount /media/scratch The 3 extents being created and subsequently deleted are: sync-799 [000] .... 313.938048: btrfs_update_block_group: Pinning 1943011328-2077229055 sync-799 [000] .... 313.938073: btrfs_update_block_group: Pinning 2077229056-2177892351 <- BG1 ends sync-799 [000] .... 313.938116: btrfs_update_block_group: Pinning 2177892352-2257584127 <- BG2 begins But we only get 1 discard request: sync-798 [003] .... 154.077897: btrfs_finish_extent_commit: Discarding 1943011328-2257584127 (len: 314572800) <- this is the request passed to btrfs_discard_extent sync-798 [003] .... 154.077901: btrfs_discard_extent: Discarding 234881024 length for bytenr: 1943011328 <- this is the actual range being discarded inside the for loop. So the bug is genuine I will test whether your patch fixes it and report back. > Furthermore, this bug is not that reliably observed, as if the whole > block group is empty, there will be another trim for that block group. Not only because of this, mainly because of the contiguousness requirement. > > So the most obvious way to find this missing trim needs to delete large > extents at block group boundary without empting involved block groups. > > [FIX] > - Allow __btrfs_map_block_for_discard() to modify @length parameter > btrfs_map_block() uses its @length paramter to notify the caller how > many bytes are mapped in current call. > With __btrfs_map_block_for_discard() also modifing the @length, > btrfs_discard_extent() now understands when to do extra trim. > > - Call btrfs_map_block() in a loop until we hit the range end > Since we now know how many bytes are mapped each time, we can iterate > through each block group boundary and issue correct trim for each > range. > > Signed-off-by: Qu Wenruo <wqu@suse.com> <snip>
On 2019/10/23 下午11:41, Nikolay Borisov wrote: > > > On 23.10.19 г. 16:57 ч., Qu Wenruo wrote: >> [BUG] >> When deleting large files (which cross block group boundary) with discard >> mount option, we find some btrfs_discard_extent() calls only trimmed part >> of its space, not the whole range: >> >> btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% >> >> type: bbio->map_type, in above case, it's SINGLE DATA. >> start: Logical address of this trim >> len: Logical length of this trim >> trimmed: Physically trimmed bytes >> ratio: trimmed / len >> >> Thus leading some unused space not discarded. >> >> [CAUSE] >> When discard mount option is specified, after a transaction is fully >> committed (super block written to disk), we begin to cleanup pinned >> extents in the following call chain: >> >> btrfs_commit_transaction() >> |- write_all_supers() > > You can remove write_all_supers > >> |- btrfs_finish_extent_commit() >> |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); >> |- btrfs_discard_extent() >> >> However pinned extents are recorded in an extent_io_tree, which can >> merge adjacent extent states. >> >> When a large file get deleted and it has adjacent file extents across >> block group boundary, we will get a large merged range. > > This is wrong, it will only get merged if the extent spans contiguous bg boundaries > (this is very important!) Yep, skipped some details as I thought it was too obvious, but indeed it needs extra wording. > >> >> Then when we pass the large range into btrfs_discard_extent(), >> btrfs_discard_extent() will just trim the first part, without trimming >> the remaining part. > > Here is what my testing shows: > > mkfs.btrfs -f /dev/vdc > > mount -onodatasum,nospace_cache /dev/vdc /media/scratch/ nospace_cache is important as v1 space cache exists as regular file, which could take up data space and screw up extent layout. (And in original report, v1 cache is the cause of randomness in reproducibility) > xfs_io -f -c "pwrite 0 800m" /media/scratch/file1 && sync > xfs_io -f -c "pwrite 0 300m" /media/scratch/file2 && sync > umount /media/scratch > > mount -odiscard /dev/vdc /media/scratch > rm -f /media/scratch/file2 && sync > trace-cmd show > > umount /media/scratch > > The output I get in trace-cmd is: > > sync-1014 [001] .... 534.272310: btrfs_finish_extent_commit: Discarding 1943011328-2077229055 (len: 134217728) > sync-1014 [001] .... 534.272315: btrfs_discard_extent: Requested to discard: 134217728 but discarded: 134217728 > > sync-1014 [001] .... 534.272325: btrfs_finish_extent_commit: Discarding 2177892352-2358247423 (len: 180355072) > sync-1014 [001] .... 534.272330: btrfs_discard_extent: Requested to discard: 180355072 but discarded: 180355072 > > The extents of this file look like this in the extent tree prior to the trim: > > item 18 key (1943011328 EXTENT_ITEM 134217728) itemoff 15523 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 258 offset 0 count 1 > item 19 key (2177892352 EXTENT_ITEM 134217728) itemoff 15470 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 258 offset 134217728 count 1 > item 20 key (2177892352 BLOCK_GROUP_ITEM 1073741824) itemoff 15446 itemsize 24 > block group used 180355072 chunk_objectid 256 flags DATA > item 21 key (2312110080 EXTENT_ITEM 46137344) itemoff 15393 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 258 offset 268435456 count 1 > > So we have 3 extents 1 of which is in bg 1 and the other 2 in bg2. The 2 extents in bg2 are merged but > since the 2nd bg is not contiguous to the first hence no merging. > > Here comes the requirement why the bg must be contiguous. > > If I modify my test case with slightly different write offsets such that bg1 > is indeed filled and the next extent gets allocated to in bg2, which is adjacent then > the bug is reproduced: > > mkfs.btrfs -f /dev/vdc > > mount -onodatasum,nospace_cache /dev/vdc /media/scratch/ > xfs_io -f -c "pwrite 0 800m" /media/scratch/file1 && sync > xfs_io -f -c "pwrite 0 224m" /media/scratch/file2 && sync > xfs_io -f -c "pwrite 224m 76m" /media/scratch/file2 && sync > umount /media/scratch > > mount -odiscard /dev/vdc /media/scratch > rm -f /media/scratch/file2 && sync > trace-cmd show > > umount /media/scratch > > The 3 extents being created and subsequently deleted are: > > sync-799 [000] .... 313.938048: btrfs_update_block_group: Pinning 1943011328-2077229055 > sync-799 [000] .... 313.938073: btrfs_update_block_group: Pinning 2077229056-2177892351 <- BG1 ends > sync-799 [000] .... 313.938116: btrfs_update_block_group: Pinning 2177892352-2257584127 <- BG2 begins > > But we only get 1 discard request: > > sync-798 [003] .... 154.077897: btrfs_finish_extent_commit: Discarding 1943011328-2257584127 (len: 314572800) <- this is the request passed to btrfs_discard_extent > sync-798 [003] .... 154.077901: btrfs_discard_extent: Discarding 234881024 length for bytenr: 1943011328 <- this is the actual range being discarded inside the for loop. > > So the bug is genuine I will test whether your patch fixes it and report back. > >> Furthermore, this bug is not that reliably observed, as if the whole >> block group is empty, there will be another trim for that block group. > > Not only because of this, mainly because of the contiguousness requirement. Contiguousness requirement is in fact pretty easy to hit. Just do a 20G write, you will find most bg and file extents are Contiguous. The problem is to craft a reliable way extent and bg layout in a reproducible way, and a good way to detect the missing trim. In my environment, I'm using 20G write so even space cache is screwing me up, I still have 20 chances to have contiguous bg and file extents. And for trim result, loopback device can be used to account how many bytes are really used. I just need to polish them into a good fstests case. Thanks, Qu > >> >> So the most obvious way to find this missing trim needs to delete large >> extents at block group boundary without empting involved block groups. >> >> [FIX] >> - Allow __btrfs_map_block_for_discard() to modify @length parameter >> btrfs_map_block() uses its @length paramter to notify the caller how >> many bytes are mapped in current call. >> With __btrfs_map_block_for_discard() also modifing the @length, >> btrfs_discard_extent() now understands when to do extra trim. >> >> - Call btrfs_map_block() in a loop until we hit the range end >> Since we now know how many bytes are mapped each time, we can iterate >> through each block group boundary and issue correct trim for each >> range. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> > > <snip> >
On 23.10.19 г. 16:57 ч., Qu Wenruo wrote: > [BUG] > When deleting large files (which cross block group boundary) with discard > mount option, we find some btrfs_discard_extent() calls only trimmed part > of its space, not the whole range: > > btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% > > type: bbio->map_type, in above case, it's SINGLE DATA. > start: Logical address of this trim > len: Logical length of this trim > trimmed: Physically trimmed bytes > ratio: trimmed / len > > Thus leading some unused space not discarded. > > [CAUSE] > When discard mount option is specified, after a transaction is fully > committed (super block written to disk), we begin to cleanup pinned > extents in the following call chain: > > btrfs_commit_transaction() > |- write_all_supers() > |- btrfs_finish_extent_commit() > |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); > |- btrfs_discard_extent() > > However pinned extents are recorded in an extent_io_tree, which can > merge adjacent extent states. > > When a large file get deleted and it has adjacent file extents across > block group boundary, we will get a large merged range. > > Then when we pass the large range into btrfs_discard_extent(), > btrfs_discard_extent() will just trim the first part, without trimming > the remaining part. > > Furthermore, this bug is not that reliably observed, as if the whole > block group is empty, there will be another trim for that block group. > > So the most obvious way to find this missing trim needs to delete large > extents at block group boundary without empting involved block groups. > > [FIX] > - Allow __btrfs_map_block_for_discard() to modify @length parameter > btrfs_map_block() uses its @length paramter to notify the caller how > many bytes are mapped in current call. > With __btrfs_map_block_for_discard() also modifing the @length, > btrfs_discard_extent() now understands when to do extra trim. > > - Call btrfs_map_block() in a loop until we hit the range end > Since we now know how many bytes are mapped each time, we can iterate > through each block group boundary and issue correct trim for each > range. > > Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 49cb26fa7c63..ff2838bd677d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1306,8 +1306,10 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes, u64 *actual_bytes) { - int ret; + int ret = 0; u64 discarded_bytes = 0; + u64 end = bytenr + num_bytes; + u64 cur = bytenr; struct btrfs_bio *bbio = NULL; @@ -1316,15 +1318,23 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, * associated to its stripes that don't go away while we are discarding. */ btrfs_bio_counter_inc_blocked(fs_info); - /* Tell the block device(s) that the sectors can be discarded */ - ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, bytenr, &num_bytes, - &bbio, 0); - /* Error condition is -ENOMEM */ - if (!ret) { - struct btrfs_bio_stripe *stripe = bbio->stripes; + while (cur < end) { + struct btrfs_bio_stripe *stripe; int i; + num_bytes = end - cur; + /* Tell the block device(s) that the sectors can be discarded */ + ret = btrfs_map_block(fs_info, BTRFS_MAP_DISCARD, cur, + &num_bytes, &bbio, 0); + /* + * Error can be -ENOMEM, -ENOENT (no such chunk mapping) or + * -EOPNOTSUPP. For any such error, @num_bytes is not updated, + * thus we can't continue anyway. + */ + if (ret < 0) + goto out; + stripe = bbio->stripes; for (i = 0; i < bbio->num_stripes; i++, stripe++) { u64 bytes; struct request_queue *req_q; @@ -1341,10 +1351,19 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, stripe->physical, stripe->length, &bytes); - if (!ret) + if (!ret) { discarded_bytes += bytes; - else if (ret != -EOPNOTSUPP) - break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */ + } else if (ret != -EOPNOTSUPP) { + /* + * Logic errors or -ENOMEM, or -EIO but I don't + * know how that could happen JDM + * + * Ans since there are two loops, explicitly + * goto out to avoid confusion. + */ + btrfs_put_bbio(bbio); + goto out; + } /* * Just in case we get back EOPNOTSUPP for some reason, @@ -1354,7 +1373,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, ret = 0; } btrfs_put_bbio(bbio); + cur += num_bytes; } +out: btrfs_bio_counter_dec(fs_info); if (actual_bytes) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a6db11e821a5..f66bd0d03f44 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5578,12 +5578,13 @@ void btrfs_put_bbio(struct btrfs_bio *bbio) * replace. */ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info, - u64 logical, u64 length, + u64 logical, u64 *length_ret, struct btrfs_bio **bbio_ret) { struct extent_map *em; struct map_lookup *map; struct btrfs_bio *bbio; + u64 length = *length_ret; u64 offset; u64 stripe_nr; u64 stripe_nr_end; @@ -5617,6 +5618,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info, offset = logical - em->start; length = min_t(u64, em->start + em->len - logical, length); + *length_ret = length; stripe_len = map->stripe_len; /* @@ -6031,7 +6033,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, if (op == BTRFS_MAP_DISCARD) return __btrfs_map_block_for_discard(fs_info, logical, - *length, bbio_ret); + length, bbio_ret); ret = btrfs_get_io_geometry(fs_info, op, logical, *length, &geom); if (ret < 0)
[BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len: Logical length of this trim trimmed: Physically trimmed bytes ratio: trimmed / len Thus leading some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() |- write_all_supers() |- btrfs_finish_extent_commit() |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); |- btrfs_discard_extent() However pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file get deleted and it has adjacent file extents across block group boundary, we will get a large merged range. Then when we pass the large range into btrfs_discard_extent(), btrfs_discard_extent() will just trim the first part, without trimming the remaining part. Furthermore, this bug is not that reliably observed, as if the whole block group is empty, there will be another trim for that block group. So the most obvious way to find this missing trim needs to delete large extents at block group boundary without empting involved block groups. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent-tree.c | 41 +++++++++++++++++++++++++++++++---------- fs/btrfs/volumes.c | 6 ++++-- 2 files changed, 35 insertions(+), 12 deletions(-)