[PULL,5/7] file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
diff mbox series

Message ID 20190326155157.3719-6-kwolf@redhat.com
State New
Headers show
Series
  • [PULL,1/7] iotests: add 248: test resume mirror after auto pause on ENOSPC
Related show

Commit Message

Kevin Wolf March 26, 2019, 3:51 p.m. UTC
We know that the kernel implements a slow fallback code path for
BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
The other operations we call in the context of .bdrv_co_pwrite_zeroes
should usually be quick, so no modification should be needed for them.
If we ever notice that there are additional problematic cases, we can
still make these conditional as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Eric Blake <eblake@redhat.com>
---
 include/block/raw-aio.h |  1 +
 block/file-posix.c      | 24 ++++++++++++++++--------
 2 files changed, 17 insertions(+), 8 deletions(-)

Comments

Eric Blake Aug. 15, 2019, 2:44 a.m. UTC | #1
On 3/26/19 10:51 AM, Kevin Wolf wrote:
> We know that the kernel implements a slow fallback code path for
> BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
> The other operations we call in the context of .bdrv_co_pwrite_zeroes
> should usually be quick, so no modification should be needed for them.
> If we ever notice that there are additional problematic cases, we can
> still make these conditional as well.

Are there cases where fallocate(FALLOC_FL_ZERO_RANGE) falls back to slow
writes?  It may be fast on some file systems, but when used on a block
device, that may equally trigger slow fallbacks.  The man page is not
clear on that fact; I suspect that there may be cases in there that need
to be made conditional (it would be awesome if the kernel folks would
give us another FALLOC_ flag when we want to guarantee no fallback).

By the way, is there an easy setup to prove (maybe some qemu-img convert
command on a specially-prepared source image) whether the no fallback
flag makes a difference?  I'm about to cross-post a series of patches to
nbd/qemu/nbdkit/libnbd that adds a new NBD_CMD_FLAG_FAST_ZERO which fits
the bill of BDRV_REQ_NO_FALLBACK, but would like to include some
benchmark numbers in my cover letter if I can reproduce a setup where it
matters.

And this patch has a bug:

> +++ b/block/file-posix.c
> @@ -652,7 +652,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif
>  
> -    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
>      ret = 0;
>  fail:
>      if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> @@ -1500,14 +1500,19 @@ static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
{
    int ret = -ENOTSUP;
    BDRVRawState *s = aiocb->bs->opaque;

    if (!s->has_write_zeroes) {
        return -ENOTSUP;
>      }

At this point, ret is -ENOTSUP.

>  
>  #ifdef BLKZEROOUT
> -    do {
> -        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> -        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> -            return 0;
> -        }
> -    } while (errno == EINTR);
> +    /* The BLKZEROOUT implementation in the kernel doesn't set
> +     * BLKDEV_ZERO_NOFALLBACK, so we can't call this if we have to avoid slow
> +     * fallbacks. */
> +    if (!(aiocb->aio_type & QEMU_AIO_NO_FALLBACK)) {
> +        do {
> +            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> +            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> +                return 0;
> +            }
> +        } while (errno == EINTR);
>  
> -    ret = translate_err(-errno);
> +        ret = translate_err(-errno);
> +    }

If the very first call to this function is with NO_FALLBACK, then this
'if' is skipped,

>  #endif
>  
>      if (ret == -ENOTSUP) {
        s->has_write_zeroes = false;
    }

and we set s->has_write_zeroes to false, permanently disabling any
BLKZEROOUT attempts in future calls, even if the future calls no longer
pass the NO_FALLBACK flag.
Kevin Wolf Aug. 15, 2019, 10:29 a.m. UTC | #2
Am 15.08.2019 um 04:44 hat Eric Blake geschrieben:
> On 3/26/19 10:51 AM, Kevin Wolf wrote:
> > We know that the kernel implements a slow fallback code path for
> > BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
> > The other operations we call in the context of .bdrv_co_pwrite_zeroes
> > should usually be quick, so no modification should be needed for them.
> > If we ever notice that there are additional problematic cases, we can
> > still make these conditional as well.
> 
> Are there cases where fallocate(FALLOC_FL_ZERO_RANGE) falls back to slow
> writes?  It may be fast on some file systems, but when used on a block
> device, that may equally trigger slow fallbacks.  The man page is not
> clear on that fact; I suspect that there may be cases in there that need
> to be made conditional (it would be awesome if the kernel folks would
> give us another FALLOC_ flag when we want to guarantee no fallback).

The NO_FALLBACK changes were based on the Linux code rather than
documentation because no interface is explicitly documented to forbid
fallbacks.

I think for file systems, we can generally assume that we don't get
fallbacks because for file systems, just deallocating blocks is the
easiest way to implement the function anyway. (Hm, or is it when we
don't punch holes...?)

And for block devices, we don't try FALLOC_FL_ZERO_RANGE because it also
involves the same slow fallback as BLKZEROOUT. In other words,
bdrv_co_pwrite_zeroes() with NO_FALLBACK, but without MAY_UNMAP, always
fails on Linux block devices, and we fall back to emulation in user
space.

We would need a kernel interface that calls blkdev_issue_zeroout() with
BLKDEV_ZERO_NOUNMAP | BLKDEV_ZERO_NOFALLBACK, but no such interface
exists.

When I talked to some file system people, they insisted that "efficient"
or "fast" wasn't well-defined enough for them or something, so if we
want to get a kernel change, maybe a new block device ioctl would be the
most realistic thing.

We do use FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE for MAY_UNMAP,
which works for both file systems (I assume - each file system has a
separate implementation) and block devices without slow fallbacks.

qemu-img create sets MAY_UNMAP, so the case we are most interested in is
covered with a fast implementation.

> By the way, is there an easy setup to prove (maybe some qemu-img convert
> command on a specially-prepared source image) whether the no fallback
> flag makes a difference?  I'm about to cross-post a series of patches to
> nbd/qemu/nbdkit/libnbd that adds a new NBD_CMD_FLAG_FAST_ZERO which fits
> the bill of BDRV_REQ_NO_FALLBACK, but would like to include some
> benchmark numbers in my cover letter if I can reproduce a setup where it
> matters.

Hm, the original case came from Nir, maybe he can suggest something.

You'll definitely need a block device that doesn't support
FALLOC_FL_PUNCH_HOLE, otherwise you can't trigger the fallback. My
first though was a loop device, but this actually does support the
operation and passes it through to the underlying file system. So maybe
if you know a file system that doesn't support it. Or if you have an old
hard disk handy.

Or... Actually there is an easily available block device that doesn't
suppport zero writes: The in-kernel NBD client! :-)

And I think I remember now how I tested this back then:

1. qemu-nbd exports an image with very slow throttling enabled
   (throttling affects writes, but not write_zeroes)

2. Attach the NBD device to /dev/nbd0

3. Convert to there (use a second NBD connection to test the fix)

> And this patch has a bug:
> 
> > +++ b/block/file-posix.c
> > @@ -652,7 +652,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >      }
> >  #endif
> >  
> > -    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> > +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
> >      ret = 0;
> >  fail:
> >      if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> > @@ -1500,14 +1500,19 @@ static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> {
>     int ret = -ENOTSUP;
>     BDRVRawState *s = aiocb->bs->opaque;
> 
>     if (!s->has_write_zeroes) {
>         return -ENOTSUP;
> >      }
> 
> At this point, ret is -ENOTSUP.
> 
> >  
> >  #ifdef BLKZEROOUT
> > -    do {
> > -        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> > -        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> > -            return 0;
> > -        }
> > -    } while (errno == EINTR);
> > +    /* The BLKZEROOUT implementation in the kernel doesn't set
> > +     * BLKDEV_ZERO_NOFALLBACK, so we can't call this if we have to avoid slow
> > +     * fallbacks. */
> > +    if (!(aiocb->aio_type & QEMU_AIO_NO_FALLBACK)) {
> > +        do {
> > +            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> > +            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> > +                return 0;
> > +            }
> > +        } while (errno == EINTR);
> >  
> > -    ret = translate_err(-errno);
> > +        ret = translate_err(-errno);
> > +    }
> 
> If the very first call to this function is with NO_FALLBACK, then this
> 'if' is skipped,
> 
> >  #endif
> >  
> >      if (ret == -ENOTSUP) {
>         s->has_write_zeroes = false;
>     }
> 
> and we set s->has_write_zeroes to false, permanently disabling any
> BLKZEROOUT attempts in future calls, even if the future calls no longer
> pass the NO_FALLBACK flag.

Right, this should be moved inside the if (!NO_FALLBACK) block.

Kevin
Nir Soffer Aug. 17, 2019, 5:45 p.m. UTC | #3
On Thu, Aug 15, 2019 at 1:29 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 15.08.2019 um 04:44 hat Eric Blake geschrieben:
> > On 3/26/19 10:51 AM, Kevin Wolf wrote:
> > > We know that the kernel implements a slow fallback code path for
> > > BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
> > > The other operations we call in the context of .bdrv_co_pwrite_zeroes
> > > should usually be quick, so no modification should be needed for them.
> > > If we ever notice that there are additional problematic cases, we can
> > > still make these conditional as well.
> >
> > Are there cases where fallocate(FALLOC_FL_ZERO_RANGE) falls back to slow
> > writes?  It may be fast on some file systems, but when used on a block
> > device, that may equally trigger slow fallbacks.  The man page is not
> > clear on that fact; I suspect that there may be cases in there that need
> > to be made conditional (it would be awesome if the kernel folks would
> > give us another FALLOC_ flag when we want to guarantee no fallback).
>
> The NO_FALLBACK changes were based on the Linux code rather than
> documentation because no interface is explicitly documented to forbid
> fallbacks.
>
> I think for file systems, we can generally assume that we don't get
> fallbacks because for file systems, just deallocating blocks is the
> easiest way to implement the function anyway. (Hm, or is it when we
> don't punch holes...?)
>
> And for block devices, we don't try FALLOC_FL_ZERO_RANGE because it also
> involves the same slow fallback as BLKZEROOUT. In other words,
> bdrv_co_pwrite_zeroes() with NO_FALLBACK, but without MAY_UNMAP, always
> fails on Linux block devices, and we fall back to emulation in user
> space.
>
> We would need a kernel interface that calls blkdev_issue_zeroout() with
> BLKDEV_ZERO_NOUNMAP | BLKDEV_ZERO_NOFALLBACK, but no such interface
> exists.
>
> When I talked to some file system people, they insisted that "efficient"
> or "fast" wasn't well-defined enough for them or something, so if we
> want to get a kernel change, maybe a new block device ioctl would be the
> most realistic thing.
>
> We do use FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE for MAY_UNMAP,
> which works for both file systems (I assume - each file system has a
> separate implementation) and block devices without slow fallbacks.
>
> qemu-img create sets MAY_UNMAP, so the case we are most interested in is
> covered with a fast implementation.
>
> > By the way, is there an easy setup to prove (maybe some qemu-img convert
> > command on a specially-prepared source image) whether the no fallback
> > flag makes a difference?  I'm about to cross-post a series of patches to
> > nbd/qemu/nbdkit/libnbd that adds a new NBD_CMD_FLAG_FAST_ZERO which fits
> > the bill of BDRV_REQ_NO_FALLBACK, but would like to include some
> > benchmark numbers in my cover letter if I can reproduce a setup where it
> > matters.
>
> Hm, the original case came from Nir, maybe he can suggest something.
>

The original case came from RHEL 7.{5,6}. The flow was:

qemu-img convert -> nbdkit rhv plugin -> imageio -> storage

nbdkit got NBD_CMD_WRITE_ZEROES request, converted it to imageio ZERO
request.

For block devices, imageio was trying:
1. fallocate(ZERO_RANGE) - fails
2. ioctl(BLKZEROOUT) - succeeds

See
https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L247

BLKZEROOUT can be fast (100 GiB/s) or slow (100 MiB/s) depending on the
server,
and on the allocation status of that area.

On our current storage (3PAR), if the device is fully allocated, for
example:

   dd if=/dev/zero bs=8M of=/dev/vg/lv

Then blkdiscard -z is slow (800 MiB/s):

But if you discard the device:

    blkdiscard /dev/vg/lv

blkdiscard -z becomes fast (100 GiB/s).

Previously we had XtremIO storage, which was able to zero 50 GiB/s
regardless
of the allocation.

You'll definitely need a block device that doesn't support
> FALLOC_FL_PUNCH_HOLE,


Old kernels (CentOS 7) did not support this.

# uname -r
3.10.0-957.21.3.el7.x86_64

# strace -e trace=fallocate fallocate -l 100m /dev/loop0
fallocate(3, 0, 0, 104857600)           = -1 ENODEV (No such device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

# strace -e trace=fallocate fallocate -p -l 100m /dev/loop0
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 104857600) = -1
ENODEV (No such device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

# strace -e trace=fallocate fallocate -z -l 100m /dev/loop0
fallocate(3, FALLOC_FL_ZERO_RANGE, 0, 104857600) = -1 ENODEV (No such
device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

otherwise you can't trigger the fallback. My
> first though was a loop device, but this actually does support the
> operation and passes it through to the underlying file system. So maybe
> if you know a file system that doesn't support it. Or if you have an old
> hard disk handy.

...

Nir

Patch
diff mbox series

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 6799614e56..ba223dd1f1 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -40,6 +40,7 @@ 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
 #define QEMU_AIO_BLKDEV       0x2000
+#define QEMU_AIO_NO_FALLBACK  0x4000
 
 
 /* linux-aio.c - Linux native implementation */
diff --git a/block/file-posix.c b/block/file-posix.c
index d102f3b222..db4cccbe51 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -652,7 +652,7 @@  static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
-    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
     ret = 0;
 fail:
     if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1500,14 +1500,19 @@  static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
     }
 
 #ifdef BLKZEROOUT
-    do {
-        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
-        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
-            return 0;
-        }
-    } while (errno == EINTR);
+    /* The BLKZEROOUT implementation in the kernel doesn't set
+     * BLKDEV_ZERO_NOFALLBACK, so we can't call this if we have to avoid slow
+     * fallbacks. */
+    if (!(aiocb->aio_type & QEMU_AIO_NO_FALLBACK)) {
+        do {
+            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+                return 0;
+            }
+        } while (errno == EINTR);
 
-    ret = translate_err(-errno);
+        ret = translate_err(-errno);
+    }
 #endif
 
     if (ret == -ENOTSUP) {
@@ -2659,6 +2664,9 @@  raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
     if (blkdev) {
         acb.aio_type |= QEMU_AIO_BLKDEV;
     }
+    if (flags & BDRV_REQ_NO_FALLBACK) {
+        acb.aio_type |= QEMU_AIO_NO_FALLBACK;
+    }
 
     if (flags & BDRV_REQ_MAY_UNMAP) {
         acb.aio_type |= QEMU_AIO_DISCARD;