diff mbox series

[2/3] btrfs: small improvement for btrfs_io_context structure

Message ID a02fc8daecc6973fc928501c4bc2554062ff43e7.1674893735.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: reduce the memory usage for replace in btrfs_io_context. | expand

Commit Message

Qu Wenruo Jan. 28, 2023, 8:23 a.m. UTC
That structure is our ultimate objective for all __btrfs_map_block()
related functions.

We have some hard to understand members, like tgtdev_map, but without
any comments.

This patch will improve the situation by:

- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
  tgtdev_map[]
  Especially for the last two members, add a dedicated (thus very long)
  comments for them, with example to explain it.

- Shrink those int members to u16.
  In fact our on-disk format is only using u16 for num_stripes, thus
  no need to go int at all.

- Add extra ASSERT() for alloc_btrfs_io_context() for the stripes
  argument

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 15 ++++++++++---
 fs/btrfs/volumes.h | 53 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 60 insertions(+), 8 deletions(-)

Comments

Johannes Thumshirn Jan. 30, 2023, 7:27 a.m. UTC | #1
On 28.01.23 09:23, Qu Wenruo wrote:
> @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
>  						       int total_stripes,
>  						       int real_stripes)
>  {
> -	struct btrfs_io_context *bioc = kzalloc(
> +	struct btrfs_io_context *bioc;
> +
> +	/*
> +	 * We should have valid number of stripes, larger than U16_MAX(65535)
> +	 * indicates something totally wrong, as our on-disk format is only
> +	 * at u16.
> +	 */
> +	ASSERT(total_stripes < U16_MAX && total_stripes > 0);

Shouldn't we better change 'int total_stripes' and 'int real_stripes' to u16?
That'll imply above ASSERT()
Qu Wenruo Jan. 30, 2023, 7:31 a.m. UTC | #2
On 2023/1/30 15:27, Johannes Thumshirn wrote:
> On 28.01.23 09:23, Qu Wenruo wrote:
>> @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
>>   						       int total_stripes,
>>   						       int real_stripes)
>>   {
>> -	struct btrfs_io_context *bioc = kzalloc(
>> +	struct btrfs_io_context *bioc;
>> +
>> +	/*
>> +	 * We should have valid number of stripes, larger than U16_MAX(65535)
>> +	 * indicates something totally wrong, as our on-disk format is only
>> +	 * at u16.
>> +	 */
>> +	ASSERT(total_stripes < U16_MAX && total_stripes > 0);
> 
> Shouldn't we better change 'int total_stripes' and 'int real_stripes' to u16?
> That'll imply above ASSERT()

Right, that is way better.

Thanks,
Qu
Anand Jain Feb. 2, 2023, 6:28 a.m. UTC | #3
> +	bioc = kzalloc(
>   		 /* The size of btrfs_io_context */
>   		sizeof(struct btrfs_io_context) +
>   		/* Plus the variable array for the stripes */
>   		sizeof(struct btrfs_io_stripe) * (total_stripes) +
>   		/* Plus the variable array for the tgt dev */
> -		sizeof(int) * (real_stripes) +
> +		sizeof(u16) * (real_stripes) +
>   		/*
>   		 * Plus the raid_map, which includes both the tgt dev
>   		 * and the stripes.

Why not order the various sizeof() in the same manner as in the struct 
btrfs_io_context?

Thx.
Qu Wenruo Feb. 2, 2023, 6:47 a.m. UTC | #4
On 2023/2/2 14:28, Anand Jain wrote:
> 
> 
>> +    bioc = kzalloc(
>>            /* The size of btrfs_io_context */
>>           sizeof(struct btrfs_io_context) +
>>           /* Plus the variable array for the stripes */
>>           sizeof(struct btrfs_io_stripe) * (total_stripes) +
>>           /* Plus the variable array for the tgt dev */
>> -        sizeof(int) * (real_stripes) +
>> +        sizeof(u16) * (real_stripes) +
>>           /*
>>            * Plus the raid_map, which includes both the tgt dev
>>            * and the stripes.
> 
> Why not order the various sizeof() in the same manner as in the struct 
> btrfs_io_context?

Because the tgtdev_map would soon get completely removed in the next patch.

Just to mention, I don't like the current way to allocate memory at all.

If there is more feedback, I can convert the allocation to the same way 
as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for 
those arrays.

Thanks,
Qu

> 
> Thx.
Anand Jain Feb. 2, 2023, 9:12 a.m. UTC | #5
On 2/2/23 14:47, Qu Wenruo wrote:
> 
> 
> On 2023/2/2 14:28, Anand Jain wrote:
>>
>>
>>> +    bioc = kzalloc(
>>>            /* The size of btrfs_io_context */
>>>           sizeof(struct btrfs_io_context) +
>>>           /* Plus the variable array for the stripes */
>>>           sizeof(struct btrfs_io_stripe) * (total_stripes) +
>>>           /* Plus the variable array for the tgt dev */
>>> -        sizeof(int) * (real_stripes) +
>>> +        sizeof(u16) * (real_stripes) +
>>>           /*
>>>            * Plus the raid_map, which includes both the tgt dev
>>>            * and the stripes.
>>
>> Why not order the various sizeof() in the same manner as in the struct 
>> btrfs_io_context?
> 
> Because the tgtdev_map would soon get completely removed in the next patch.
> 

Ah. Ok.

> Just to mention, I don't like the current way to allocate memory at all.
> 
> If there is more feedback, I can convert the allocation to the same way 
> as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for 
> those arrays.

The alloc_rbio() method allocates each component independently, leading
to frequent memory allocation and freeing, which could impact 
performance in an IO context.

It may be goodidea to conduct a small experiment to assess any
performance penalties. If none are detected, then it should be ok.

Thx,


> 
> Thanks,
> Qu
> 
>>
>> Thx.
Christoph Hellwig Feb. 3, 2023, 6:19 a.m. UTC | #6
On Thu, Feb 02, 2023 at 02:47:13PM +0800, Qu Wenruo wrote:
> Because the tgtdev_map would soon get completely removed in the next patch.
> 
> Just to mention, I don't like the current way to allocate memory at all.
> 
> If there is more feedback, I can convert the allocation to the same way as
> alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those
> arrays.

The real elephant in the room is that the io_context needs to use a
mempool anyway to be deadlock safe.  Which means we'll need to size
for the worst case (per file system?) here.  Fortunately the read
fast path now doesn't use the io_context at all which helps with
the different sizing optimizations.
Qu Wenruo Feb. 3, 2023, 6:44 a.m. UTC | #7
On 2023/2/3 14:19, Christoph Hellwig wrote:
> On Thu, Feb 02, 2023 at 02:47:13PM +0800, Qu Wenruo wrote:
>> Because the tgtdev_map would soon get completely removed in the next patch.
>>
>> Just to mention, I don't like the current way to allocate memory at all.
>>
>> If there is more feedback, I can convert the allocation to the same way as
>> alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those
>> arrays.
> 
> The real elephant in the room is that the io_context needs to use a
> mempool anyway to be deadlock safe.

In fact, that makes more sense to migrate to the raid56 method.

We go mempool for the io_context structure itself, then still do regular 
memory allocation for the stripes (and the raid56 map).

Furthermore we can skip the raid56_map if we don't need, especially 
currently we're wasting memory for non-raid56 profiles.

And with my current work to make that tgtdev_map static (currently two 
u16, but next version it would be just one u16), for non-raid56 cases we 
will just do one mempool alloc (io_context) + one regular alloc (stripes).

Which I guess won't be that bad compared to the existing one.

Thanks,
Qu
>  Which means we'll need to size
> for the worst case (per file system?) here.  Fortunately the read
> fast path now doesn't use the io_context at all which helps with
> the different sizing optimizations.
Qu Wenruo Feb. 7, 2023, 2:19 a.m. UTC | #8
On 2023/2/2 17:12, Anand Jain wrote:
> 
> 
> On 2/2/23 14:47, Qu Wenruo wrote:
>>
>>
>> On 2023/2/2 14:28, Anand Jain wrote:
>>>
>>>
>>>> +    bioc = kzalloc(
>>>>            /* The size of btrfs_io_context */
>>>>           sizeof(struct btrfs_io_context) +
>>>>           /* Plus the variable array for the stripes */
>>>>           sizeof(struct btrfs_io_stripe) * (total_stripes) +
>>>>           /* Plus the variable array for the tgt dev */
>>>> -        sizeof(int) * (real_stripes) +
>>>> +        sizeof(u16) * (real_stripes) +
>>>>           /*
>>>>            * Plus the raid_map, which includes both the tgt dev
>>>>            * and the stripes.
>>>
>>> Why not order the various sizeof() in the same manner as in the 
>>> struct btrfs_io_context?
>>
>> Because the tgtdev_map would soon get completely removed in the next 
>> patch.
>>
> 
> Ah. Ok.
> 
>> Just to mention, I don't like the current way to allocate memory at all.
>>
>> If there is more feedback, I can convert the allocation to the same 
>> way as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls 
>> for those arrays.
> 
> The alloc_rbio() method allocates each component independently, leading
> to frequent memory allocation and freeing, which could impact 
> performance in an IO context.

That method is fine if we only have one variable length array (stripes).

But it won't be a problem anymore, the 3/3 would make the dev-replace to 
only take one s16 to record the replace source.

Then I have a new patch to this series, which converts *raid_map into a 
single u64.

By all those work, btrfs_io_context structure would only have one 
variable length array (stripes), then the existing method would be 
acceptable.

The final result would make btrfs_io_context look like this: (comments 
skipped, as the comments would be larger than the structure itself)

struct btrfs_io_context {
         refcount_t refs;
         struct btrfs_fs_info *fs_info;
         u64 map_type; /* get from map_lookup->type */
         struct bio *orig_bio;
         atomic_t error;
         u16 max_errors;
         u16 num_stripes;
         u16 mirror_num;

         u16 replace_nr_stripes;
         s16 replace_stripe_src;

         u64 full_stripe_logical;
         struct btrfs_io_stripe stripes[];
};

> 
> It may be goodidea to conduct a small experiment to assess any
> performance penalties. If none are detected, then it should be ok.

If HCH or someone else is going to make btrfs_io_context mempool based, 
then I guess we have to go the other way, mempool for btrfs_io_context, 
but still manually allocate the stripes.

Anyway the updated series would benefit everyone (at least I hope so).

Thanks,
Qu

> 
> Thx,
> 
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thx.
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66d44167f7b1..f00716fbb6cd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5888,13 +5888,22 @@  static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 						       int total_stripes,
 						       int real_stripes)
 {
-	struct btrfs_io_context *bioc = kzalloc(
+	struct btrfs_io_context *bioc;
+
+	/*
+	 * We should have valid number of stripes, larger than U16_MAX(65535)
+	 * indicates something totally wrong, as our on-disk format is only
+	 * at u16.
+	 */
+	ASSERT(total_stripes < U16_MAX && total_stripes > 0);
+
+	bioc = kzalloc(
 		 /* The size of btrfs_io_context */
 		sizeof(struct btrfs_io_context) +
 		/* Plus the variable array for the stripes */
 		sizeof(struct btrfs_io_stripe) * (total_stripes) +
 		/* Plus the variable array for the tgt dev */
-		sizeof(int) * (real_stripes) +
+		sizeof(u16) * (real_stripes) +
 		/*
 		 * Plus the raid_map, which includes both the tgt dev
 		 * and the stripes.
@@ -5908,7 +5917,7 @@  static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 	refcount_set(&bioc->refs, 1);
 
 	bioc->fs_info = fs_info;
-	bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
+	bioc->tgtdev_map = (u16 *)(bioc->stripes + total_stripes);
 	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
 
 	return bioc;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6b7a05f6cf82..fe664512191b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -419,11 +419,54 @@  struct btrfs_io_context {
 	u64 map_type; /* get from map_lookup->type */
 	struct bio *orig_bio;
 	atomic_t error;
-	int max_errors;
-	int num_stripes;
-	int mirror_num;
-	int num_tgtdevs;
-	int *tgtdev_map;
+	u16 max_errors;
+
+	/*
+	 * The total number of stripes, including the extra duplicated
+	 * stripe for replace.
+	 */
+	u16 num_stripes;
+
+	/*
+	 * The mirror_num of this bioc.
+	 *
+	 * This is for reads which uses 0 as mirror_num, thus we should
+	 * return a valid mirror_num (>0) for the reader.
+	 */
+	u16 mirror_num;
+
+	/*
+	 * The following two members are for dev-replace case only.
+	 *
+	 * @num_tgtdevs:	Number of duplicated stripes which needs to be
+	 *			written to replace target.
+	 *			Should be <= 2 (2 for DUP, otherwise <= 1).
+	 * @tgtdev_map:		The array indicates where the duplicated stripes
+	 *			are from. The size is the number of original
+	 *			stripes (num_stripes - num_tgtdevs).
+	 *
+	 * The @tgtdev_map[] array is mostly for RAID56 cases.
+	 * As non-RAID56 stripes share the same contents for the mapped range,
+	 * thus no need to bother where the duplicated ones are from.
+	 *
+	 * But for RAID56 case, all stripes contain different contents, thus
+	 * must need a way to know the mapping.
+	 *
+	 * There is an example for the two members, using a RAID5 write:
+	 * num_stripes:		4 (3 + 1 duplicated write)
+	 * stripes[0]:		dev = devid 1, physical = X
+	 * stripes[1]:		dev = devid 2, physical = Y
+	 * stripes[2]:		dev = devid 3, physical = Z
+	 * stripes[3]:		dev = devid 0, physical = Y
+	 *
+	 * num_tgtdevs = 1
+	 * tgtdev_map[0] = 0	<- Means stripes[0] is not involved in replace.
+	 * tgtdev_map[1] = 3	<- Means stripes[1] is involved in replace,
+	 *			   and it's duplicated to stripes[3].
+	 * tgtdev_map[2] = 0	<- Means stripes[2] is not involved in replace.
+	 */
+	u16 num_tgtdevs;
+	u16 *tgtdev_map;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,