Message ID | a02fc8daecc6973fc928501c4bc2554062ff43e7.1674893735.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: reduce the memory usage for replace in btrfs_io_context. | expand |
On 28.01.23 09:23, Qu Wenruo wrote: > @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_ > int total_stripes, > int real_stripes) > { > - struct btrfs_io_context *bioc = kzalloc( > + struct btrfs_io_context *bioc; > + > + /* > + * We should have valid number of stripes, larger than U16_MAX(65535) > + * indicates something totally wrong, as our on-disk format is only > + * at u16. > + */ > + ASSERT(total_stripes < U16_MAX && total_stripes > 0); Shouldn't we better change 'int total_stripes' and 'int real_stripes' to u16? That'll imply above ASSERT()
On 2023/1/30 15:27, Johannes Thumshirn wrote: > On 28.01.23 09:23, Qu Wenruo wrote: >> @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_ >> int total_stripes, >> int real_stripes) >> { >> - struct btrfs_io_context *bioc = kzalloc( >> + struct btrfs_io_context *bioc; >> + >> + /* >> + * We should have valid number of stripes, larger than U16_MAX(65535) >> + * indicates something totally wrong, as our on-disk format is only >> + * at u16. >> + */ >> + ASSERT(total_stripes < U16_MAX && total_stripes > 0); > > Shouldn't we better change 'int total_stripes' and 'int real_stripes' to u16? > That'll imply above ASSERT() Right, that is way better. Thanks, Qu
> + bioc = kzalloc( > /* The size of btrfs_io_context */ > sizeof(struct btrfs_io_context) + > /* Plus the variable array for the stripes */ > sizeof(struct btrfs_io_stripe) * (total_stripes) + > /* Plus the variable array for the tgt dev */ > - sizeof(int) * (real_stripes) + > + sizeof(u16) * (real_stripes) + > /* > * Plus the raid_map, which includes both the tgt dev > * and the stripes. Why not order the various sizeof() in the same manner as in the struct btrfs_io_context? Thx.
On 2023/2/2 14:28, Anand Jain wrote: > > >> + bioc = kzalloc( >> /* The size of btrfs_io_context */ >> sizeof(struct btrfs_io_context) + >> /* Plus the variable array for the stripes */ >> sizeof(struct btrfs_io_stripe) * (total_stripes) + >> /* Plus the variable array for the tgt dev */ >> - sizeof(int) * (real_stripes) + >> + sizeof(u16) * (real_stripes) + >> /* >> * Plus the raid_map, which includes both the tgt dev >> * and the stripes. > > Why not order the various sizeof() in the same manner as in the struct > btrfs_io_context? Because the tgtdev_map would soon get completely removed in the next patch. Just to mention, I don't like the current way to allocate memory at all. If there is more feedback, I can convert the allocation to the same way as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those arrays. Thanks, Qu > > Thx.
On 2/2/23 14:47, Qu Wenruo wrote: > > > On 2023/2/2 14:28, Anand Jain wrote: >> >> >>> + bioc = kzalloc( >>> /* The size of btrfs_io_context */ >>> sizeof(struct btrfs_io_context) + >>> /* Plus the variable array for the stripes */ >>> sizeof(struct btrfs_io_stripe) * (total_stripes) + >>> /* Plus the variable array for the tgt dev */ >>> - sizeof(int) * (real_stripes) + >>> + sizeof(u16) * (real_stripes) + >>> /* >>> * Plus the raid_map, which includes both the tgt dev >>> * and the stripes. >> >> Why not order the various sizeof() in the same manner as in the struct >> btrfs_io_context? > > Because the tgtdev_map would soon get completely removed in the next patch. > Ah. Ok. > Just to mention, I don't like the current way to allocate memory at all. > > If there is more feedback, I can convert the allocation to the same way > as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for > those arrays. The alloc_rbio() method allocates each component independently, leading to frequent memory allocation and freeing, which could impact performance in an IO context. It may be goodidea to conduct a small experiment to assess any performance penalties. If none are detected, then it should be ok. Thx, > > Thanks, > Qu > >> >> Thx.
On Thu, Feb 02, 2023 at 02:47:13PM +0800, Qu Wenruo wrote: > Because the tgtdev_map would soon get completely removed in the next patch. > > Just to mention, I don't like the current way to allocate memory at all. > > If there is more feedback, I can convert the allocation to the same way as > alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those > arrays. The real elephant in the room is that the io_context needs to use a mempool anyway to be deadlock safe. Which means we'll need to size for the worst case (per file system?) here. Fortunately the read fast path now doesn't use the io_context at all which helps with the different sizing optimizations.
On 2023/2/3 14:19, Christoph Hellwig wrote: > On Thu, Feb 02, 2023 at 02:47:13PM +0800, Qu Wenruo wrote: >> Because the tgtdev_map would soon get completely removed in the next patch. >> >> Just to mention, I don't like the current way to allocate memory at all. >> >> If there is more feedback, I can convert the allocation to the same way as >> alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those >> arrays. > > The real elephant in the room is that the io_context needs to use a > mempool anyway to be deadlock safe. In fact, that makes more sense to migrate to the raid56 method. We go mempool for the io_context structure itself, then still do regular memory allocation for the stripes (and the raid56 map). Furthermore we can skip the raid56_map if we don't need, especially currently we're wasting memory for non-raid56 profiles. And with my current work to make that tgtdev_map static (currently two u16, but next version it would be just one u16), for non-raid56 cases we will just do one mempool alloc (io_context) + one regular alloc (stripes). Which I guess won't be that bad compared to the existing one. Thanks, Qu > Which means we'll need to size > for the worst case (per file system?) here. Fortunately the read > fast path now doesn't use the io_context at all which helps with > the different sizing optimizations.
On 2023/2/2 17:12, Anand Jain wrote: > > > On 2/2/23 14:47, Qu Wenruo wrote: >> >> >> On 2023/2/2 14:28, Anand Jain wrote: >>> >>> >>>> + bioc = kzalloc( >>>> /* The size of btrfs_io_context */ >>>> sizeof(struct btrfs_io_context) + >>>> /* Plus the variable array for the stripes */ >>>> sizeof(struct btrfs_io_stripe) * (total_stripes) + >>>> /* Plus the variable array for the tgt dev */ >>>> - sizeof(int) * (real_stripes) + >>>> + sizeof(u16) * (real_stripes) + >>>> /* >>>> * Plus the raid_map, which includes both the tgt dev >>>> * and the stripes. >>> >>> Why not order the various sizeof() in the same manner as in the >>> struct btrfs_io_context? >> >> Because the tgtdev_map would soon get completely removed in the next >> patch. >> > > Ah. Ok. > >> Just to mention, I don't like the current way to allocate memory at all. >> >> If there is more feedback, I can convert the allocation to the same >> way as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls >> for those arrays. > > The alloc_rbio() method allocates each component independently, leading > to frequent memory allocation and freeing, which could impact > performance in an IO context. That method is fine if we only have one variable length array (stripes). But it won't be a problem anymore, the 3/3 would make the dev-replace to only take one s16 to record the replace source. Then I have a new patch to this series, which converts *raid_map into a single u64. By all those work, btrfs_io_context structure would only have one variable length array (stripes), then the existing method would be acceptable. The final result would make btrfs_io_context look like this: (comments skipped, as the comments would be larger than the structure itself) struct btrfs_io_context { refcount_t refs; struct btrfs_fs_info *fs_info; u64 map_type; /* get from map_lookup->type */ struct bio *orig_bio; atomic_t error; u16 max_errors; u16 num_stripes; u16 mirror_num; u16 replace_nr_stripes; s16 replace_stripe_src; u64 full_stripe_logical; struct btrfs_io_stripe stripes[]; }; > > It may be goodidea to conduct a small experiment to assess any > performance penalties. If none are detected, then it should be ok. If HCH or someone else is going to make btrfs_io_context mempool based, then I guess we have to go the other way, mempool for btrfs_io_context, but still manually allocate the stripes. Anyway the updated series would benefit everyone (at least I hope so). Thanks, Qu > > Thx, > > >> >> Thanks, >> Qu >> >>> >>> Thx.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 66d44167f7b1..f00716fbb6cd 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_ int total_stripes, int real_stripes) { - struct btrfs_io_context *bioc = kzalloc( + struct btrfs_io_context *bioc; + + /* + * We should have valid number of stripes, larger than U16_MAX(65535) + * indicates something totally wrong, as our on-disk format is only + * at u16. + */ + ASSERT(total_stripes < U16_MAX && total_stripes > 0); + + bioc = kzalloc( /* The size of btrfs_io_context */ sizeof(struct btrfs_io_context) + /* Plus the variable array for the stripes */ sizeof(struct btrfs_io_stripe) * (total_stripes) + /* Plus the variable array for the tgt dev */ - sizeof(int) * (real_stripes) + + sizeof(u16) * (real_stripes) + /* * Plus the raid_map, which includes both the tgt dev * and the stripes. @@ -5908,7 +5917,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_ refcount_set(&bioc->refs, 1); bioc->fs_info = fs_info; - bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes); + bioc->tgtdev_map = (u16 *)(bioc->stripes + total_stripes); bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes); return bioc; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 6b7a05f6cf82..fe664512191b 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -419,11 +419,54 @@ struct btrfs_io_context { u64 map_type; /* get from map_lookup->type */ struct bio *orig_bio; atomic_t error; - int max_errors; - int num_stripes; - int mirror_num; - int num_tgtdevs; - int *tgtdev_map; + u16 max_errors; + + /* + * The total number of stripes, including the extra duplicated + * stripe for replace. + */ + u16 num_stripes; + + /* + * The mirror_num of this bioc. + * + * This is for reads which uses 0 as mirror_num, thus we should + * return a valid mirror_num (>0) for the reader. + */ + u16 mirror_num; + + /* + * The following two members are for dev-replace case only. + * + * @num_tgtdevs: Number of duplicated stripes which needs to be + * written to replace target. + * Should be <= 2 (2 for DUP, otherwise <= 1). + * @tgtdev_map: The array indicates where the duplicated stripes + * are from. The size is the number of original + * stripes (num_stripes - num_tgtdevs). + * + * The @tgtdev_map[] array is mostly for RAID56 cases. + * As non-RAID56 stripes share the same contents for the mapped range, + * thus no need to bother where the duplicated ones are from. + * + * But for RAID56 case, all stripes contain different contents, thus + * must need a way to know the mapping. + * + * There is an example for the two members, using a RAID5 write: + * num_stripes: 4 (3 + 1 duplicated write) + * stripes[0]: dev = devid 1, physical = X + * stripes[1]: dev = devid 2, physical = Y + * stripes[2]: dev = devid 3, physical = Z + * stripes[3]: dev = devid 0, physical = Y + * + * num_tgtdevs = 1 + * tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace. + * tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace, + * and it's duplicated to stripes[3]. + * tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace. + */ + u16 num_tgtdevs; + u16 *tgtdev_map; /* * logical block numbers for the start of each stripe * The last one or two are p/q. These are sorted,
That structure is our ultimate objective for all __btrfs_map_block() related functions. We have some hard to understand members, like tgtdev_map, but without any comments. This patch will improve the situation by: - Add extra comments for num_stripes, mirror_num, num_tgtdevs and tgtdev_map[] Especially for the last two members, add a dedicated (thus very long) comments for them, with example to explain it. - Shrink those int members to u16. In fact our on-disk format is only using u16 for num_stripes, thus no need to go int at all. - Add extra ASSERT() for alloc_btrfs_io_context() for the stripes argument Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/volumes.c | 15 ++++++++++--- fs/btrfs/volumes.h | 53 +++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 60 insertions(+), 8 deletions(-)