diff mbox series

[2/2] qemu-img: align next status sector on destination alignment.

Message ID 20201111153913.41840-3-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series RFC: Issue with discards on raw block device without O_DIRECT | expand

Commit Message

Maxim Levitsky Nov. 11, 2020, 3:39 p.m. UTC
This helps avoid unneeded writes and discards.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 qemu-img.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Peter Lieven Nov. 12, 2020, 12:40 p.m. UTC | #1
Am 11.11.20 um 16:39 schrieb Maxim Levitsky:
> This helps avoid unneeded writes and discards.
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  qemu-img.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index c2c56fc797..7e9b0f659f 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1722,7 +1722,7 @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num,
>  static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
>  {
>      int64_t src_cur_offset;
> -    int ret, n, src_cur;
> +    int ret, n, src_cur, alignment;
>      bool post_backing_zero = false;
>  
>      convert_select_part(s, sector_num, &src_cur, &src_cur_offset);
> @@ -1785,11 +1785,14 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
>          n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
>  
>          /*
> -         * Avoid that s->sector_next_status becomes unaligned to the source
> -         * request alignment and/or cluster size to avoid unnecessary read
> -         * cycles.
> +         * Avoid that s->sector_next_status becomes unaligned to the
> +         * source/destination request alignment and/or cluster size to avoid
> +         * unnecessary read/write cycles.
>           */
> -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> +        assert(is_power_of_2(alignment));
> +
> +        tail = (sector_num - src_cur_offset + n) % alignment;
>          if (n > tail) {
>              n -= tail;
>          }


I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise

you might get misaligned to either source or destination.


Why exactly do you need the power of two requirement?


Peter
Eric Blake Nov. 12, 2020, 1:45 p.m. UTC | #2
On 11/12/20 6:40 AM, Peter Lieven wrote:

>>          /*
>> -         * Avoid that s->sector_next_status becomes unaligned to the source
>> -         * request alignment and/or cluster size to avoid unnecessary read
>> -         * cycles.
>> +         * Avoid that s->sector_next_status becomes unaligned to the
>> +         * source/destination request alignment and/or cluster size to avoid
>> +         * unnecessary read/write cycles.
>>           */
>> -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
>> +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
>> +        assert(is_power_of_2(alignment));
>> +
>> +        tail = (sector_num - src_cur_offset + n) % alignment;
>>          if (n > tail) {
>>              n -= tail;
>>          }
> 
> 
> I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> 
> you might get misaligned to either source or destination.
> 
> 
> Why exactly do you need the power of two requirement?

The power of two requirement ensures that you h ave the least common
multiple of both alignments ;)

However, there ARE devices in practice that have advertised a
non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
63188c24).  Which means you probably don't want to assert power-of-2,
and in turn need to worry about least common multiple.
Maxim Levitsky Nov. 12, 2020, 3:04 p.m. UTC | #3
On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> On 11/12/20 6:40 AM, Peter Lieven wrote:
> 
> > >          /*
> > > -         * Avoid that s->sector_next_status becomes unaligned to the source
> > > -         * request alignment and/or cluster size to avoid unnecessary read
> > > -         * cycles.
> > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > +         * source/destination request alignment and/or cluster size to avoid
> > > +         * unnecessary read/write cycles.
> > >           */
> > > -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > +        assert(is_power_of_2(alignment));
> > > +
> > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > >          if (n > tail) {
> > >              n -= tail;
> > >          }
> > 
> > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> > 
> > you might get misaligned to either source or destination.
> > 
> > 
> > Why exactly do you need the power of two requirement?
> 
> The power of two requirement ensures that you h ave the least common
> multiple of both alignments ;)
-
Yes, and in addition to that both alignments are already power of two because we align them
to it. The assert I added is just in case.

This is how we calculate the destination alignment:
 
s.alignment = MAX(pow2floor(s.min_sparse),
                      DIV_ROUND_UP(out_bs->bl.request_alignment,
                                   BDRV_SECTOR_SIZE));
 
 
This is how we calculate the source alignments (it is possible to have several source files)
 
s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
                                             BDRV_SECTOR_SIZE);
if (!bdrv_get_info(src_bs, &bdi)) {
    s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE);
}


The bl.request_alignment comment mentions that it must be power of 2,
and cluster sizes are also very likely to be power of 2 in all drivers
we support. An assert for s.src_alignment won't hurt though.

 
Note though that we don't really read the discard alignment.
We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
is only used to split 'too large' discard requests, and pdiscard_alignment
is advisory and used only in a couple of places.
Neither are set by file-posix.
 
We do have request_alignment, which file-posix tries to align to the logical block
size which is still often 512 for backward compatibility on many devices (even nvme)
 
Now both source and destination alignments in qemu-img convert are based on request_aligment
and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
(which otherwise controls the minimum number of blocks to be zero, to consider
discarding them in convert)
 

This means that there is no guarantee to have 4K alignment, and besides,
with sufficiently fragmented source file, the bdrv_block_status_above
can return arbitrary short and unaligned extents which can't be aligned, 
thus as I said this patch alone can't guarantee that we won't have 
write and discard sharing the same page.
 
Another thing that can be discussed is why is_allocated_sectors was patched
to convert short discards to writes. Probably because underlying hardware
ignores them or something? In theory we should detect this and fail
back to regular zeroing in this case.
Again though, while in case of converting an empty file, this is the only
source of writes, and eliminating it, also 'fixes' this case, with sufficiently
fragmented source file we can even without this code get a write and discard
sharing a page.


Best regards,
	Maxim Levitsky

> 
> However, there ARE devices in practice that have advertised a
> non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> 63188c24).  Which means you probably don't want to assert power-of-2,
> and in turn need to worry about least common multiple.
>
Maxim Levitsky March 18, 2021, 9:57 a.m. UTC | #4
On Thu, 2020-11-12 at 17:04 +0200, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> > On 11/12/20 6:40 AM, Peter Lieven wrote:
> > 
> > > >          /*
> > > > -         * Avoid that s->sector_next_status becomes unaligned to the source
> > > > -         * request alignment and/or cluster size to avoid unnecessary read
> > > > -         * cycles.
> > > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > > +         * source/destination request alignment and/or cluster size to avoid
> > > > +         * unnecessary read/write cycles.
> > > >           */
> > > > -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> > > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > > +        assert(is_power_of_2(alignment));
> > > > +
> > > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > > >          if (n > tail) {
> > > >              n -= tail;
> > > >          }
> > > 
> > > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> > > 
> > > you might get misaligned to either source or destination.
> > > 
> > > 
> > > Why exactly do you need the power of two requirement?
> > 
> > The power of two requirement ensures that you h ave the least common
> > multiple of both alignments ;)
> -
> Yes, and in addition to that both alignments are already power of two because we align them
> to it. The assert I added is just in case.
> 
> This is how we calculate the destination alignment:
>  
> s.alignment = MAX(pow2floor(s.min_sparse),
>                       DIV_ROUND_UP(out_bs->bl.request_alignment,
>                                    BDRV_SECTOR_SIZE));
>  
>  
> This is how we calculate the source alignments (it is possible to have several source files)
>  
> s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
>                                              BDRV_SECTOR_SIZE);
> if (!bdrv_get_info(src_bs, &bdi)) {
>     s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE);
> }
> 
> 
> The bl.request_alignment comment mentions that it must be power of 2,
> and cluster sizes are also very likely to be power of 2 in all drivers
> we support. An assert for s.src_alignment won't hurt though.
> 
>  
> Note though that we don't really read the discard alignment.
> We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
> is only used to split 'too large' discard requests, and pdiscard_alignment
> is advisory and used only in a couple of places.
> Neither are set by file-posix.
>  
> We do have request_alignment, which file-posix tries to align to the logical block
> size which is still often 512 for backward compatibility on many devices (even nvme)
>  
> Now both source and destination alignments in qemu-img convert are based on request_aligment
> and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
> (which otherwise controls the minimum number of blocks to be zero, to consider
> discarding them in convert)
>  
> 
> This means that there is no guarantee to have 4K alignment, and besides,
> with sufficiently fragmented source file, the bdrv_block_status_above
> can return arbitrary short and unaligned extents which can't be aligned, 
> thus as I said this patch alone can't guarantee that we won't have 
> write and discard sharing the same page.
>  
> Another thing that can be discussed is why is_allocated_sectors was patched
> to convert short discards to writes. Probably because underlying hardware
> ignores them or something? In theory we should detect this and fail
> back to regular zeroing in this case.
> Again though, while in case of converting an empty file, this is the only
> source of writes, and eliminating it, also 'fixes' this case, with sufficiently
> fragmented source file we can even without this code get a write and discard
> sharing a page.
> 
> 
> Best regards,
> 	Maxim Levitsky
> 
> > However, there ARE devices in practice that have advertised a
> > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> > 63188c24).  Which means you probably don't want to assert power-of-2,
> > and in turn need to worry about least common multiple.
> > 

Any update on this patch?

Best regards,
	Maxim Levitsky
diff mbox series

Patch

diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..7e9b0f659f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1722,7 +1722,7 @@  static void convert_select_part(ImgConvertState *s, int64_t sector_num,
 static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
 {
     int64_t src_cur_offset;
-    int ret, n, src_cur;
+    int ret, n, src_cur, alignment;
     bool post_backing_zero = false;
 
     convert_select_part(s, sector_num, &src_cur, &src_cur_offset);
@@ -1785,11 +1785,14 @@  static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
 
         /*
-         * Avoid that s->sector_next_status becomes unaligned to the source
-         * request alignment and/or cluster size to avoid unnecessary read
-         * cycles.
+         * Avoid that s->sector_next_status becomes unaligned to the
+         * source/destination request alignment and/or cluster size to avoid
+         * unnecessary read/write cycles.
          */
-        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
+        alignment = MAX(s->src_alignment[src_cur], s->alignment);
+        assert(is_power_of_2(alignment));
+
+        tail = (sector_num - src_cur_offset + n) % alignment;
         if (n > tail) {
             n -= tail;
         }