diff mbox

[v3,3/9] block: Let write zeroes fallback work even with small max_transfer

Message ID 1479413642-22463-4-git-send-email-eblake@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric Blake Nov. 17, 2016, 8:13 p.m. UTC
Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary.  But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.

However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k.  Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.

Test case:

$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056

In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea.  But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.

Reported-by: Ed Swierk <eswierk@skyportsystems.com>
CC: qemu-stable@nongnu.org
CC: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Max Reitz Nov. 17, 2016, 9:40 p.m. UTC | #1
On 17.11.2016 21:13, Eric Blake wrote:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056
> 
> In fairness to Denis (as the original listed author of the culprit
> commit), the faulty logic for at most one iteration is probably all
> my fault in reworking his idea.  But the solution is to restore what
> was in place prior to that commit: when dealing with an unaligned
> head or tail, iterate as many times as necessary while fragmenting
> the operation at max_transfer boundaries.
> 
> Reported-by: Ed Swierk <eswierk@skyportsystems.com>
> CC: qemu-stable@nongnu.org
> CC: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>
Kevin Wolf Nov. 22, 2016, 1:16 p.m. UTC | #2
Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056
> 
> In fairness to Denis (as the original listed author of the culprit
> commit), the faulty logic for at most one iteration is probably all
> my fault in reworking his idea.  But the solution is to restore what
> was in place prior to that commit: when dealing with an unaligned
> head or tail, iterate as many times as necessary while fragmenting
> the operation at max_transfer boundaries.
> 
> Reported-by: Ed Swierk <eswierk@skyportsystems.com>
> CC: qemu-stable@nongnu.org
> CC: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index aa532a5..085ac34 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>      int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>      int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>                          bs->bl.request_alignment);
> +    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);
> 
>      assert(alignment % bs->bl.request_alignment == 0);
>      head = offset % alignment;
> @@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           * boundaries.
>           */
>          if (head) {
> -            /* Make a small request up to the first aligned sector.  */
> -            num = MIN(count, alignment - head);
> -            head = 0;
> +            /* Make a small request up to the first aligned sector. For
> +             * convenience, limit this request to max_transfer even if
> +             * we don't need to fall back to writes.  */
> +            num = MIN(MIN(count, max_transfer), alignment - head);
> +            head = (head + num) % alignment;
> +            assert(num < max_write_zeroes);
>          } else if (tail && num > alignment) {
>              /* Shorten the request to the last aligned sector.  */
>              num -= tail;
>
> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> 
>          if (ret == -ENOTSUP) {
>              /* Fall back to bounce buffer if write zeroes is unsupported */
> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

Why do we even still bother with max_transfer in this function when we
could just call bdrv_aligned_pwritev() and use its fragmentation code?

Of course, when bdrv_co_do_pwrite_zeroes() was written, your
fragmentation code didn't exist yet, but today I think it would make
more sense to use a single centralised version of it instead of
reimplementing it here.

This doesn't make your fix less correct, but if we did things this way,
the fix wouldn't even be needed because a single iteration (in this
loop) would indeed always be enough.

Kevin
Eric Blake Nov. 22, 2016, 1:22 p.m. UTC | #3
On 11/22/2016 07:16 AM, Kevin Wolf wrote:
> Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
>> Commit 443668ca rewrote the write_zeroes logic to guarantee that
>> an unaligned request never crosses a cluster boundary.  But
>> in the rewrite, the new code assumed that at most one iteration
>> would be needed to get to an alignment boundary.
>>

>> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>
>>          if (ret == -ENOTSUP) {
>>              /* Fall back to bounce buffer if write zeroes is unsupported */
>> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
>> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> 
> Why do we even still bother with max_transfer in this function when we
> could just call bdrv_aligned_pwritev() and use its fragmentation code?

Hmm. bdrv_aligned_pwritev() asserts that its arguments are already
aligned, but for the head and tail, they might not be.  I agree that for
the bulk of the body, it may help, but it would take more thought on
refactoring if we want to have fragmentation at only one spot.

> 
> Of course, when bdrv_co_do_pwrite_zeroes() was written, your
> fragmentation code didn't exist yet, but today I think it would make
> more sense to use a single centralised version of it instead of
> reimplementing it here.
> 
> This doesn't make your fix less correct, but if we did things this way,
> the fix wouldn't even be needed because a single iteration (in this
> loop) would indeed always be enough.

Can I request to defer such refactoring to 2.9, while getting this patch
as-is into 2.8?
Kevin Wolf Nov. 22, 2016, 1:30 p.m. UTC | #4
Am 22.11.2016 um 14:22 hat Eric Blake geschrieben:
> On 11/22/2016 07:16 AM, Kevin Wolf wrote:
> > Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> >> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> >> an unaligned request never crosses a cluster boundary.  But
> >> in the rewrite, the new code assumed that at most one iteration
> >> would be needed to get to an alignment boundary.
> >>
> 
> >> @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> >>
> >>          if (ret == -ENOTSUP) {
> >>              /* Fall back to bounce buffer if write zeroes is unsupported */
> >> -            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> >> -                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
> >>              BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> > 
> > Why do we even still bother with max_transfer in this function when we
> > could just call bdrv_aligned_pwritev() and use its fragmentation code?
> 
> Hmm. bdrv_aligned_pwritev() asserts that its arguments are already
> aligned, but for the head and tail, they might not be.  I agree that for
> the bulk of the body, it may help, but it would take more thought on
> refactoring if we want to have fragmentation at only one spot.

Right, it should be more like bdrv_co_pwritev() then, but something that
uses the logic in bdrv_aligned_pwritev().

Using bdrv_co_pwritev() would mean that it's tracked as another request,
but I don't think that's a problem. Otherwise we'd have to factor that
part out.

> > Of course, when bdrv_co_do_pwrite_zeroes() was written, your
> > fragmentation code didn't exist yet, but today I think it would make
> > more sense to use a single centralised version of it instead of
> > reimplementing it here.
> > 
> > This doesn't make your fix less correct, but if we did things this way,
> > the fix wouldn't even be needed because a single iteration (in this
> > loop) would indeed always be enough.
> 
> Can I request to defer such refactoring to 2.9, while getting this patch
> as-is into 2.8?

Yes, the refactoring is definitely for 2.9.

Kevin
diff mbox

Patch

diff --git a/block/io.c b/block/io.c
index aa532a5..085ac34 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1214,6 +1214,8 @@  static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
     int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
                         bs->bl.request_alignment);
+    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);

     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
@@ -1229,9 +1231,12 @@  static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
          * boundaries.
          */
         if (head) {
-            /* Make a small request up to the first aligned sector.  */
-            num = MIN(count, alignment - head);
-            head = 0;
+            /* Make a small request up to the first aligned sector. For
+             * convenience, limit this request to max_transfer even if
+             * we don't need to fall back to writes.  */
+            num = MIN(MIN(count, max_transfer), alignment - head);
+            head = (head + num) % alignment;
+            assert(num < max_write_zeroes);
         } else if (tail && num > alignment) {
             /* Shorten the request to the last aligned sector.  */
             num -= tail;
@@ -1257,8 +1262,6 @@  static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

         if (ret == -ENOTSUP) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
-            int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-                                            MAX_WRITE_ZEROES_BOUNCE_BUFFER);
             BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

             if ((flags & BDRV_REQ_FUA) &&