diff mbox series

block: posix: Always allocate the first block

Message ID 20190816212122.8816-1-nsoffer@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: posix: Always allocate the first block | expand

Commit Message

Nir Soffer Aug. 16, 2019, 9:21 p.m. UTC
When creating an image with preallocation "off" or "falloc", the first
block of the image is typically not allocated. When using Gluster
storage backed by XFS filesystem, reading this block using direct I/O
succeeds regardless of request length, fooling alignment detection.

In this case we fallback to a safe value (4096) instead of the optimal
value (512), which may lead to unneeded data copying when aligning
requests.  Allocating the first block avoids the fallback.

When using preallocation=off, we always allocate at least one filesystem
block:

    $ ./qemu-img create -f raw test.raw 1g
    Formatting 'test.raw', fmt=raw size=1073741824

    $ ls -lhs test.raw
    4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw

I did quick performance tests for these flows:
- Provisioning a VM with a new raw image.
- Copying disks with qemu-img convert to new raw target image

I installed Fedora 29 server on raw sparse image, measuring the time
from clicking "Begin installation" until the "Reboot" button appears:

Before(s)  After(s)     Diff(%)
-------------------------------
     356        389        +8.4

I ran this only once, so we cannot tell much from these results.

The second test was cloning the installation image with qemu-img
convert, doing 10 runs:

    for i in $(seq 10); do
        rm -f dst.raw
        sleep 10
        time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
    done

Here is a table comparing the total time spent:

Type    Before(s)   After(s)    Diff(%)
---------------------------------------
real      530.028    469.123      -11.4
user       17.204     10.768      -37.4
sys        17.881      7.011      -60.7

Here we see very clear improvement in CPU usage.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
---
 block/file-posix.c         | 25 +++++++++++++++++++++++++
 tests/qemu-iotests/150.out |  1 +
 tests/qemu-iotests/160     |  4 ++++
 tests/qemu-iotests/175     | 19 +++++++++++++------
 tests/qemu-iotests/175.out |  8 ++++----
 tests/qemu-iotests/221.out | 12 ++++++++----
 tests/qemu-iotests/253.out | 12 ++++++++----
 7 files changed, 63 insertions(+), 18 deletions(-)

Comments

John Snow Aug. 16, 2019, 9:57 p.m. UTC | #1
On 8/16/19 5:21 PM, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 

Where does this detection/fallback happen? (Can it be improved?)

> When using preallocation=off, we always allocate at least one filesystem
> block:
> 
>     $ ./qemu-img create -f raw test.raw 1g
>     Formatting 'test.raw', fmt=raw size=1073741824
> 
>     $ ls -lhs test.raw
>     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> 
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
> 
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
> 
> Before(s)  After(s)     Diff(%)
> -------------------------------
>      356        389        +8.4
> 
> I ran this only once, so we cannot tell much from these results.
> 

That seems like a pretty big difference for just having pre-allocated a
single block. What was the actual command line / block graph for that test?

Was this over a network that could explain the variance?

> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
> 
>     for i in $(seq 10); do
>         rm -f dst.raw
>         sleep 10
>         time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
>     done
> 
> Here is a table comparing the total time spent:
> 
> Type    Before(s)   After(s)    Diff(%)
> ---------------------------------------
> real      530.028    469.123      -11.4
> user       17.204     10.768      -37.4
> sys        17.881      7.011      -60.7
> 
> Here we see very clear improvement in CPU usage.
> 

Hard to argue much with that. I feel a little strange trying to force
the allocation of the first block, but I suppose in practice "almost no
preallocation" is indistinguishable from "exactly no preallocation" if
you squint.

> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  block/file-posix.c         | 25 +++++++++++++++++++++++++
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160     |  4 ++++
>  tests/qemu-iotests/175     | 19 +++++++++++++------
>  tests/qemu-iotests/175.out |  8 ++++----
>  tests/qemu-iotests/221.out | 12 ++++++++----
>  tests/qemu-iotests/253.out | 12 ++++++++----
>  7 files changed, 63 insertions(+), 18 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>      return ret;
>  }
>  
> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed by XFS,
> + * reading succeeds regardless of request length. In this case we fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +    ssize_t n;
> +
> +    do {
> +        n = pwrite(fd, "\0", 1, 0);
> +    } while (n == -1 && errno == EINTR);
> +
> +    return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>                  /* posix_fallocate() doesn't set errno. */
>                  error_setg_errno(errp, -result,
>                                   "Could not preallocate new data");
> +            } else if (current_length == 0) {
> +                allocate_first_block(fd);
>              }
>          } else {
>              result = 0;
> @@ -1855,6 +1878,8 @@ static int handle_aiocb_truncate(void *opaque)
>          if (ftruncate(fd, offset) != 0) {
>              result = -errno;
>              error_setg_errno(errp, -result, "Could not resize file");
> +        } else if (current_length == 0 && offset > current_length) {
> +            allocate_first_block(fd);
>          }
>          return result;
>      default:
> diff --git a/tests/qemu-iotests/150.out b/tests/qemu-iotests/150.out
> index 2a54e8dcfa..3cdc7727a5 100644
> --- a/tests/qemu-iotests/150.out
> +++ b/tests/qemu-iotests/150.out
> @@ -3,6 +3,7 @@ QA output created by 150
>  === Mapping sparse conversion ===
>  
>  Offset          Length          File
> +0               0x1000          TEST_DIR/t.IMGFMT
>  
>  === Mapping non-sparse conversion ===
>  
> diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> index df89d3864b..ad2d054a47 100755
> --- a/tests/qemu-iotests/160
> +++ b/tests/qemu-iotests/160
> @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O "$IMGFMT" \
>          2> /dev/null
>      TEST_IMG="$TEST_IMG.out" _check_test_img
> +
> +    # We always write the first byte of an image.
> +    printf "\0" > "$TEST_IMG.out.dd"
> +
>      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
>  
>      echo
> diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> index 51e62c8276..c6a3a7bb1e 100755
> --- a/tests/qemu-iotests/175
> +++ b/tests/qemu-iotests/175
> @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
>  # the file size.  This function hides the resulting difference in the
>  # stat -c '%b' output.
>  # Parameter 1: Number of blocks an empty file occupies
> -# Parameter 2: Image size in bytes
> +# Parameter 2: Minimal number of blocks in an image
> +# Parameter 3: Image size in bytes
>  _filter_blocks()
>  {
>      extra_blocks=$1
> -    img_size=$2
> +    min_blocks=$2
> +    img_size=$3
>  
> -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
> -        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> +    sed -e "s/blocks=$((extra_blocks + min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> +        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/max allocation/"
>  }
>  
>  # get standard environment, filters and checks
> @@ -60,16 +62,21 @@ size=$((1 * 1024 * 1024))
>  touch "$TEST_DIR/empty"
>  extra_blocks=$(stat -c '%b' "$TEST_DIR/empty")
>  
> +# We always write the first byte; check how many blocks this filesystem
> +# allocates to match empty image alloation.
> +printf "\0" > "$TEST_DIR/empty"
> +min_blocks=$(stat -c '%b' "$TEST_DIR/empty")
> +
>  echo
>  echo "== creating image with default preallocation =="
>  _make_test_img $size | _filter_imgfmt
> -stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size
> +stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size
>  
>  for mode in off full falloc; do
>      echo
>      echo "== creating image with preallocation $mode =="
>      IMGOPTS=preallocation=$mode _make_test_img $size | _filter_imgfmt
> -    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size
> +    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size
>  done
>  
>  # success, all done
> diff --git a/tests/qemu-iotests/175.out b/tests/qemu-iotests/175.out
> index 6d9a5ed84e..263e521262 100644
> --- a/tests/qemu-iotests/175.out
> +++ b/tests/qemu-iotests/175.out
> @@ -2,17 +2,17 @@ QA output created by 175
>  
>  == creating image with default preallocation ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
> -size=1048576, nothing allocated
> +size=1048576, min allocation
>  
>  == creating image with preallocation off ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=off
> -size=1048576, nothing allocated
> +size=1048576, min allocation
>  
>  == creating image with preallocation full ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=full
> -size=1048576, everything allocated
> +size=1048576, max allocation
>  
>  == creating image with preallocation falloc ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=falloc
> -size=1048576, everything allocated
> +size=1048576, max allocation
>   *** done
> diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
> index 9f9dd52bb0..dca024a0c3 100644
> --- a/tests/qemu-iotests/221.out
> +++ b/tests/qemu-iotests/221.out
> @@ -3,14 +3,18 @@ QA output created by 221
>  === Check mapping of unaligned raw image ===
>  
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65537
> -[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> -[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
>  wrote 1/1 bytes at offset 65536
>  1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
>  { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
>  { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> -[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
>  { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
>  { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
>  *** done
> diff --git a/tests/qemu-iotests/253.out b/tests/qemu-iotests/253.out
> index 607c0baa0b..3d08b305d7 100644
> --- a/tests/qemu-iotests/253.out
> +++ b/tests/qemu-iotests/253.out
> @@ -3,12 +3,16 @@ QA output created by 253
>  === Check mapping of unaligned raw image ===
>  
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048575
> -[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> -[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
>  wrote 65535/65535 bytes at offset 983040
>  63.999 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
>  { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
> -[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
> +{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
>  { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
>  *** done
>
Nir Soffer Aug. 16, 2019, 10:45 p.m. UTC | #2
On Sat, Aug 17, 2019 at 12:57 AM John Snow <jsnow@redhat.com> wrote:

> On 8/16/19 5:21 PM, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
>
> Where does this detection/fallback happen? (Can it be improved?)
>

In raw_probe_alignment().

This patch explain the issues:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html

Here Kevin and me discussed ways to improve it:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00426.html

> When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> >     $ ./qemu-img create -f raw test.raw 1g
> >     Formatting 'test.raw', fmt=raw size=1073741824
> >
> >     $ ls -lhs test.raw
> >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s)     Diff(%)
> > -------------------------------
> >      356        389        +8.4
> >
> > I ran this only once, so we cannot tell much from these results.
> >
>
> That seems like a pretty big difference for just having pre-allocated a
> single block. What was the actual command line / block graph for that test?
>

Having the first block allocated changes the alignment.

Before this patch, we detect request_alignment=1, so we fallback to 4096.
Then we detect buf_align=1, so we fallback to value of request alignment.

The guest see a disk with:
logical_block_size = 512
physical_block_size = 512

But qemu uses:
request_alignment = 4096
buf_align = 4096

storage uses:
logical_block_size = 512
physical_block_size = 512

If the guest does direct I/O using 512 bytes aligment, qemu has to copy
the buffer to align them to 4096 bytes.

After this patch, qemu detects the alignment correctly, so we have:

guest
logical_block_size = 512
physical_block_size = 512

qemu
request_alignment = 512
buf_align = 512

storage:
logical_block_size = 512
physical_block_size = 512

We expect this to be more efficient because qemu does not have to emulate
anything.

Was this over a network that could explain the variance?
>

Maybe, this is complete install of Fedora 29 server, I'm not sure if the
installation
access the network.

> The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> >     for i in $(seq 10); do
> >         rm -f dst.raw
> >         sleep 10
> >         time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> >     done
> >
> > Here is a table comparing the total time spent:
> >
> > Type    Before(s)   After(s)    Diff(%)
> > ---------------------------------------
> > real      530.028    469.123      -11.4
> > user       17.204     10.768      -37.4
> > sys        17.881      7.011      -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
>
> Hard to argue much with that. I feel a little strange trying to force
> the allocation of the first block, but I suppose in practice "almost no
> preallocation" is indistinguishable from "exactly no preallocation" if
> you squint.
>

Right.

The real issue is that filesystems and block devices do not expose the
alignment
requirement for direct I/O, so we need to use these hacks and assumptions.

With local XFS we use xfsctl(XFS_IOC_DIOINFO) to get request_alignment, but
this does
not help for XFS filesystem used by Gluster on the server side.

I hope that Niels is working on adding similar ioctl for Glsuter, os it can
expose the properties
of the remote filesystem.

Nir
John Snow Aug. 16, 2019, 11 p.m. UTC | #3
On 8/16/19 6:45 PM, Nir Soffer wrote:
> On Sat, Aug 17, 2019 at 12:57 AM John Snow <jsnow@redhat.com
> <mailto:jsnow@redhat.com>> wrote:
> 
>     On 8/16/19 5:21 PM, Nir Soffer wrote:
>     > When creating an image with preallocation "off" or "falloc", the first
>     > block of the image is typically not allocated. When using Gluster
>     > storage backed by XFS filesystem, reading this block using direct I/O
>     > succeeds regardless of request length, fooling alignment detection.
>     >
>     > In this case we fallback to a safe value (4096) instead of the optimal
>     > value (512), which may lead to unneeded data copying when aligning
>     > requests.  Allocating the first block avoids the fallback.
>     >
> 
>     Where does this detection/fallback happen? (Can it be improved?)
> 
> 
> In raw_probe_alignment().
> 
> This patch explain the issues:
> https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html
> 
> Here Kevin and me discussed ways to improve it:
> https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00426.html
> 

Thanks for the reading!
That does help explain this patch better.

>     > When using preallocation=off, we always allocate at least one
>     filesystem
>     > block:
>     >
>     >     $ ./qemu-img create -f raw test.raw 1g
>     >     Formatting 'test.raw', fmt=raw size=1073741824
>     >
>     >     $ ls -lhs test.raw
>     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>     >
>     > I did quick performance tests for these flows:
>     > - Provisioning a VM with a new raw image.
>     > - Copying disks with qemu-img convert to new raw target image
>     >
>     > I installed Fedora 29 server on raw sparse image, measuring the time
>     > from clicking "Begin installation" until the "Reboot" button appears:
>     >
>     > Before(s)  After(s)     Diff(%)
>     > -------------------------------
>     >      356        389        +8.4
>     >
>     > I ran this only once, so we cannot tell much from these results.
>     >
> 
>     That seems like a pretty big difference for just having pre-allocated a
>     single block. What was the actual command line / block graph for
>     that test?
> 
> 
> Having the first block allocated changes the alignment.
> 
> Before this patch, we detect request_alignment=1, so we fallback to 4096.
> Then we detect buf_align=1, so we fallback to value of request alignment.
> 
> The guest see a disk with:
> logical_block_size = 512
> physical_block_size = 512
> 
> But qemu uses:
> request_alignment = 4096
> buf_align = 4096
> 
> storage uses:
> logical_block_size = 512
> physical_block_size = 512
> 
> If the guest does direct I/O using 512 bytes aligment, qemu has to copy
> the buffer to align them to 4096 bytes.
> 
> After this patch, qemu detects the alignment correctly, so we have:
> 
> guest
> logical_block_size = 512
> physical_block_size = 512
> 
> qemu
> request_alignment = 512
> buf_align = 512
> 
> storage:
> logical_block_size = 512
> physical_block_size = 512
> 
> We expect this to be more efficient because qemu does not have to emulate
> anything.
> 
>     Was this over a network that could explain the variance?
> 
> 
> Maybe, this is complete install of Fedora 29 server, I'm not sure if the
> installation 
> access the network.
> 
>     > The second test was cloning the installation image with qemu-img
>     > convert, doing 10 runs:
>     >
>     >     for i in $(seq 10); do
>     >         rm -f dst.raw
>     >         sleep 10
>     >         time ./qemu-img convert -f raw -O raw -t none -T none
>     src.raw dst.raw
>     >     done
>     >
>     > Here is a table comparing the total time spent:
>     >
>     > Type    Before(s)   After(s)    Diff(%)
>     > ---------------------------------------
>     > real      530.028    469.123      -11.4
>     > user       17.204     10.768      -37.4
>     > sys        17.881      7.011      -60.7
>     >
>     > Here we see very clear improvement in CPU usage.
>     >
> 
>     Hard to argue much with that. I feel a little strange trying to force
>     the allocation of the first block, but I suppose in practice "almost no
>     preallocation" is indistinguishable from "exactly no preallocation" if
>     you squint.
> 
> 
> Right.
> 
> The real issue is that filesystems and block devices do not expose the
> alignment
> requirement for direct I/O, so we need to use these hacks and assumptions.
> 
> With local XFS we use xfsctl(XFS_IOC_DIOINFO) to get request_alignment,
> but this does
> not help for XFS filesystem used by Gluster on the server side.
> 
> I hope that Niels is working on adding similar ioctl for Glsuter, os it
> can expose the properties
> of the remote filesystem.
> 
> Nir

That sounds quite a bit less hacky, but I agree we still have to do what
we can in the meantime.

(It looks like you've been hashing this out with Kevin for a while, so
I'm going to sheepishly defer to his judgment on this patch. While I
think it's probably a fine trade-off, I can't really say off-hand if
there's a better, more targeted way to accomplish it.)

--js
Nir Soffer Aug. 22, 2019, 11:30 a.m. UTC | #4
Max, did you have time to look at this?

On Sat, Aug 17, 2019 at 12:21 AM Nir Soffer <nirsof@gmail.com> wrote:

> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
>
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
>
> When using preallocation=off, we always allocate at least one filesystem
> block:
>
>     $ ./qemu-img create -f raw test.raw 1g
>     Formatting 'test.raw', fmt=raw size=1073741824
>
>     $ ls -lhs test.raw
>     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
>
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
>
> Before(s)  After(s)     Diff(%)
> -------------------------------
>      356        389        +8.4
>
> I ran this only once, so we cannot tell much from these results.
>
> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
>
>     for i in $(seq 10); do
>         rm -f dst.raw
>         sleep 10
>         time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
>     done
>
> Here is a table comparing the total time spent:
>
> Type    Before(s)   After(s)    Diff(%)
> ---------------------------------------
> real      530.028    469.123      -11.4
> user       17.204     10.768      -37.4
> sys        17.881      7.011      -60.7
>
> Here we see very clear improvement in CPU usage.
>
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  block/file-posix.c         | 25 +++++++++++++++++++++++++
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160     |  4 ++++
>  tests/qemu-iotests/175     | 19 +++++++++++++------
>  tests/qemu-iotests/175.out |  8 ++++----
>  tests/qemu-iotests/221.out | 12 ++++++++----
>  tests/qemu-iotests/253.out | 12 ++++++++----
>  7 files changed, 63 insertions(+), 18 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>      return ret;
>  }
>
> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed
> by XFS,
> + * reading succeeds regardless of request length. In this case we
> fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids
> this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +    ssize_t n;
> +
> +    do {
> +        n = pwrite(fd, "\0", 1, 0);
> +    } while (n == -1 && errno == EINTR);
> +
> +    return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>                  /* posix_fallocate() doesn't set errno. */
>                  error_setg_errno(errp, -result,
>                                   "Could not preallocate new data");
> +            } else if (current_length == 0) {
> +                allocate_first_block(fd);
>              }
>          } else {
>              result = 0;
> @@ -1855,6 +1878,8 @@ static int handle_aiocb_truncate(void *opaque)
>          if (ftruncate(fd, offset) != 0) {
>              result = -errno;
>              error_setg_errno(errp, -result, "Could not resize file");
> +        } else if (current_length == 0 && offset > current_length) {
> +            allocate_first_block(fd);
>          }
>          return result;
>      default:
> diff --git a/tests/qemu-iotests/150.out b/tests/qemu-iotests/150.out
> index 2a54e8dcfa..3cdc7727a5 100644
> --- a/tests/qemu-iotests/150.out
> +++ b/tests/qemu-iotests/150.out
> @@ -3,6 +3,7 @@ QA output created by 150
>  === Mapping sparse conversion ===
>
>  Offset          Length          File
> +0               0x1000          TEST_DIR/t.IMGFMT
>
>  === Mapping non-sparse conversion ===
>
> diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> index df89d3864b..ad2d054a47 100755
> --- a/tests/qemu-iotests/160
> +++ b/tests/qemu-iotests/160
> @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O
> "$IMGFMT" \
>          2> /dev/null
>      TEST_IMG="$TEST_IMG.out" _check_test_img
> +
> +    # We always write the first byte of an image.
> +    printf "\0" > "$TEST_IMG.out.dd"
> +
>      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
>
>      echo
> diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> index 51e62c8276..c6a3a7bb1e 100755
> --- a/tests/qemu-iotests/175
> +++ b/tests/qemu-iotests/175
> @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
>  # the file size.  This function hides the resulting difference in the
>  # stat -c '%b' output.
>  # Parameter 1: Number of blocks an empty file occupies
> -# Parameter 2: Image size in bytes
> +# Parameter 2: Minimal number of blocks in an image
> +# Parameter 3: Image size in bytes
>  _filter_blocks()
>  {
>      extra_blocks=$1
> -    img_size=$2
> +    min_blocks=$2
> +    img_size=$3
>
> -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
> -        -e "s/blocks=$((extra_blocks + img_size /
> 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> +    sed -e "s/blocks=$((extra_blocks + min_blocks))\\(\$\\|[^0-9]\\)/min
> allocation/" \
> +        -e "s/blocks=$((extra_blocks + img_size /
> 512))\\(\$\\|[^0-9]\\)/max allocation/"
>  }
>
>  # get standard environment, filters and checks
> @@ -60,16 +62,21 @@ size=$((1 * 1024 * 1024))
>  touch "$TEST_DIR/empty"
>  extra_blocks=$(stat -c '%b' "$TEST_DIR/empty")
>
> +# We always write the first byte; check how many blocks this filesystem
> +# allocates to match empty image alloation.
> +printf "\0" > "$TEST_DIR/empty"
> +min_blocks=$(stat -c '%b' "$TEST_DIR/empty")
> +
>  echo
>  echo "== creating image with default preallocation =="
>  _make_test_img $size | _filter_imgfmt
> -stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks
> $size
> +stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks
> $min_blocks $size
>
>  for mode in off full falloc; do
>      echo
>      echo "== creating image with preallocation $mode =="
>      IMGOPTS=preallocation=$mode _make_test_img $size | _filter_imgfmt
> -    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks
> $size
> +    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks
> $min_blocks $size
>  done
>
>  # success, all done
> diff --git a/tests/qemu-iotests/175.out b/tests/qemu-iotests/175.out
> index 6d9a5ed84e..263e521262 100644
> --- a/tests/qemu-iotests/175.out
> +++ b/tests/qemu-iotests/175.out
> @@ -2,17 +2,17 @@ QA output created by 175
>
>  == creating image with default preallocation ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
> -size=1048576, nothing allocated
> +size=1048576, min allocation
>
>  == creating image with preallocation off ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=off
> -size=1048576, nothing allocated
> +size=1048576, min allocation
>
>  == creating image with preallocation full ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=full
> -size=1048576, everything allocated
> +size=1048576, max allocation
>
>  == creating image with preallocation falloc ==
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
> preallocation=falloc
> -size=1048576, everything allocated
> +size=1048576, max allocation
>   *** done
> diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
> index 9f9dd52bb0..dca024a0c3 100644
> --- a/tests/qemu-iotests/221.out
> +++ b/tests/qemu-iotests/221.out
> @@ -3,14 +3,18 @@ QA output created by 221
>  === Check mapping of unaligned raw image ===
>
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65537
> -[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET}]
> -[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET}]
>  wrote 1/1 bytes at offset 65536
>  1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET},
>  { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
>  { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET}]
> -[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET},
>  { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
>  { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET}]
>  *** done
> diff --git a/tests/qemu-iotests/253.out b/tests/qemu-iotests/253.out
> index 607c0baa0b..3d08b305d7 100644
> --- a/tests/qemu-iotests/253.out
> +++ b/tests/qemu-iotests/253.out
> @@ -3,12 +3,16 @@ QA output created by 253
>  === Check mapping of unaligned raw image ===
>
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048575
> -[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET}]
> -[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET}]
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET}]
>  wrote 65535/65535 bytes at offset 983040
>  63.999 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET},
>  { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data":
> true, "offset": OFFSET}]
> -[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false,
> "offset": OFFSET},
> +[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true,
> "offset": OFFSET},
> +{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data":
> false, "offset": OFFSET},
>  { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data":
> true, "offset": OFFSET}]
>  *** done
> --
> 2.20.1
>
>
Max Reitz Aug. 22, 2019, 2:28 p.m. UTC | #5
On 16.08.19 23:21, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 
> When using preallocation=off, we always allocate at least one filesystem
> block:
> 
>     $ ./qemu-img create -f raw test.raw 1g
>     Formatting 'test.raw', fmt=raw size=1073741824
> 
>     $ ls -lhs test.raw
>     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> 
> I did quick performance tests for these flows:
> - Provisioning a VM with a new raw image.
> - Copying disks with qemu-img convert to new raw target image
> 
> I installed Fedora 29 server on raw sparse image, measuring the time
> from clicking "Begin installation" until the "Reboot" button appears:
> 
> Before(s)  After(s)     Diff(%)
> -------------------------------
>      356        389        +8.4
> 
> I ran this only once, so we cannot tell much from these results.

So you’d expect it to be fast but it was slower?  Well, you only ran it
once and it isn’t really a precise benchmark...

> The second test was cloning the installation image with qemu-img
> convert, doing 10 runs:
> 
>     for i in $(seq 10); do
>         rm -f dst.raw
>         sleep 10
>         time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
>     done
> 
> Here is a table comparing the total time spent:
> 
> Type    Before(s)   After(s)    Diff(%)
> ---------------------------------------
> real      530.028    469.123      -11.4
> user       17.204     10.768      -37.4
> sys        17.881      7.011      -60.7
> 
> Here we see very clear improvement in CPU usage.
> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  block/file-posix.c         | 25 +++++++++++++++++++++++++
>  tests/qemu-iotests/150.out |  1 +
>  tests/qemu-iotests/160     |  4 ++++
>  tests/qemu-iotests/175     | 19 +++++++++++++------
>  tests/qemu-iotests/175.out |  8 ++++----
>  tests/qemu-iotests/221.out | 12 ++++++++----
>  tests/qemu-iotests/253.out | 12 ++++++++----
>  7 files changed, 63 insertions(+), 18 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index b9c33c8f6c..3964dd2021 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>      return ret;
>  }
>  
> +/*
> + * Help alignment detection by allocating the first block.
> + *
> + * When reading with direct I/O from unallocated area on Gluster backed by XFS,
> + * reading succeeds regardless of request length. In this case we fallback to
> + * safe aligment which is not optimal. Allocating the first block avoids this
> + * fallback.
> + *
> + * Returns: 0 on success, -errno on failure.
> + */
> +static int allocate_first_block(int fd)
> +{
> +    ssize_t n;
> +
> +    do {
> +        n = pwrite(fd, "\0", 1, 0);

This breaks when fd has been opened with O_DIRECT.

(Which happens when you open some file with cache.direct=on, and then
use e.g. QMP’s block_resize.)

It isn’t that bad because eventually you simply ignore the error.  But
it still makes me wonder whether we shouldn’t write like the biggest
power of two that does not exceed the new file length or MAX_BLOCKSIZE.

> +    } while (n == -1 && errno == EINTR);
> +
> +    return (n == -1) ? -errno : 0;
> +}
> +
>  static int handle_aiocb_truncate(void *opaque)
>  {
>      RawPosixAIOData *aiocb = opaque;
> @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>                  /* posix_fallocate() doesn't set errno. */
>                  error_setg_errno(errp, -result,
>                                   "Could not preallocate new data");
> +            } else if (current_length == 0) {
> +                allocate_first_block(fd);

Should posix_fallocate() not take care of precisely this?

>              }
>          } else {
>              result = 0;

[...]

> diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> index df89d3864b..ad2d054a47 100755
> --- a/tests/qemu-iotests/160
> +++ b/tests/qemu-iotests/160
> @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O "$IMGFMT" \
>          2> /dev/null
>      TEST_IMG="$TEST_IMG.out" _check_test_img
> +
> +    # We always write the first byte of an image.
> +    printf "\0" > "$TEST_IMG.out.dd"
> +
>      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none

Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
the lack of conv=notrunc)?

>  
>      echo
> diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> index 51e62c8276..c6a3a7bb1e 100755
> --- a/tests/qemu-iotests/175
> +++ b/tests/qemu-iotests/175
> @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
>  # the file size.  This function hides the resulting difference in the
>  # stat -c '%b' output.
>  # Parameter 1: Number of blocks an empty file occupies
> -# Parameter 2: Image size in bytes
> +# Parameter 2: Minimal number of blocks in an image
> +# Parameter 3: Image size in bytes
>  _filter_blocks()
>  {
>      extra_blocks=$1
> -    img_size=$2
> +    min_blocks=$2
> +    img_size=$3
>  
> -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
> -        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> +    sed -e "s/blocks=$((extra_blocks + min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \

I don’t think adding extra_blocks to min_blocks makes sense.  Just
min_blocks alone should be what we want here.

Max
Nir Soffer Aug. 22, 2019, 4:39 p.m. UTC | #6
On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com> wrote:

> On 16.08.19 23:21, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> >     $ ./qemu-img create -f raw test.raw 1g
> >     Formatting 'test.raw', fmt=raw size=1073741824
> >
> >     $ ls -lhs test.raw
> >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s)     Diff(%)
> > -------------------------------
> >      356        389        +8.4
> >
> > I ran this only once, so we cannot tell much from these results.
>
> So you’d expect it to be fast but it was slower?  Well, you only ran it
> once and it isn’t really a precise benchmark...
>
> > The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> >     for i in $(seq 10); do
> >         rm -f dst.raw
> >         sleep 10
> >         time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> >     done
> >
> > Here is a table comparing the total time spent:
> >
> > Type    Before(s)   After(s)    Diff(%)
> > ---------------------------------------
> > real      530.028    469.123      -11.4
> > user       17.204     10.768      -37.4
> > sys        17.881      7.011      -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
> > Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> > ---
> >  block/file-posix.c         | 25 +++++++++++++++++++++++++
> >  tests/qemu-iotests/150.out |  1 +
> >  tests/qemu-iotests/160     |  4 ++++
> >  tests/qemu-iotests/175     | 19 +++++++++++++------
> >  tests/qemu-iotests/175.out |  8 ++++----
> >  tests/qemu-iotests/221.out | 12 ++++++++----
> >  tests/qemu-iotests/253.out | 12 ++++++++----
> >  7 files changed, 63 insertions(+), 18 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index b9c33c8f6c..3964dd2021 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
> >      return ret;
> >  }
> >
> > +/*
> > + * Help alignment detection by allocating the first block.
> > + *
> > + * When reading with direct I/O from unallocated area on Gluster backed
> by XFS,
> > + * reading succeeds regardless of request length. In this case we
> fallback to
> > + * safe aligment which is not optimal. Allocating the first block
> avoids this
> > + * fallback.
> > + *
> > + * Returns: 0 on success, -errno on failure.
> > + */
> > +static int allocate_first_block(int fd)
> > +{
> > +    ssize_t n;
> > +
> > +    do {
> > +        n = pwrite(fd, "\0", 1, 0);
>
> This breaks when fd has been opened with O_DIRECT.
>

It seems that we always open images without O_DIRECT when creating an image
in qemu-img create, or when creating a target image in qemu-img convert.

Here is a trace of qemu-img create:

$ strace -f -tt -o /tmp/create.trace ./qemu-img create -f raw -o
preallocation=falloc /tmp/gv1/src.raw 1g
Formatting '/tmp/gv1/src.raw', fmt=raw size=1073741824 preallocation=falloc

1. open #1

17686 18:58:23.681921 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.683753 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.683829 close(9)          = 0

2. open #2

17686 18:58:23.684146 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.684227 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.684256 close(9)          = 0

3. open #3

17686 18:58:23.684339 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 9
...
17688 18:58:23.690178 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17688 18:58:23.690217 ftruncate(9, 0 <unfinished ...>
...
17688 18:58:23.700266 <... ftruncate resumed>) = 0
...
17688 18:58:23.700595 fstat(9,  <unfinished ...>
...
17688 18:58:23.700619 <... fstat resumed>{st_mode=S_IFREG|0600, st_size=0,
...}) = 0
...
17688 18:58:23.700651 fallocate(9, 0, 0, 1073741824 <unfinished ...>
...
17688 18:58:23.728141 <... fallocate resumed>) = 0
...
17688 18:58:23.728196 pwrite64(9, "\0", 1, 0) = 1
...
17686 18:58:23.738391 close(9)          = 0

Here is convert trace:

$ strace -f -tt -o /tmp/convert.trace ./qemu-img convert -f raw -O raw -t
none -T none /tmp/gv1/src.raw /tmp/gv1/dst.raw

1. open #1

18175 19:07:23.364417 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365282 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.365323 close(10)         = 0

2. open #2

18175 19:07:23.365660 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365717 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.365742 close(10)         = 0

3. open #3

18175 19:07:23.365839 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10
...
18177 19:07:23.372112 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18177 19:07:23.372138 ftruncate(10, 0)  = 0
...
18177 19:07:23.375760 fstat(10, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
18177 19:07:23.375788 ftruncate(10, 1073741824) = 0
18177 19:07:23.376383 pwrite64(10, "\0", 1, 0) = 1
...
18175 19:07:23.390989 close(10)         = 0

4. open #4

18175 19:07:23.391429 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.392433 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.392483 close(10)         = 0

5. open #5

18175 19:07:23.392743 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDWR|O_DIRECT|O_CLOEXEC) = 10
...
18175 19:07:23.393731 ioctl(10, BLKSSZGET, 0x7ffe75ead334) = -1 ENOSYS
(Function not implemented)
18175 19:07:23.393784 pread64(10, 0x558796451000, 1, 0) = -1 EINVAL
(Invalid argument)
18175 19:07:23.395362 pread64(10,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
0) = 512
18175 19:07:23.395905 pread64(10,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096, 0) = 4096
...

(Which happens when you open some file with cache.direct=on, and then
> use e.g. QMP’s block_resize.)
>

What would be a command triggering this? I can add a test.

It isn’t that bad because eventually you simply ignore the error.  But
> it still makes me wonder whether we shouldn’t write like the biggest
> power of two that does not exceed the new file length or MAX_BLOCKSIZE.
>

It makes sense if there is a way to cause qemu-img to use O_DIRECT when
creating an image.

> +    } while (n == -1 && errno == EINTR);
> > +
> > +    return (n == -1) ? -errno : 0;
> > +}
> > +
> >  static int handle_aiocb_truncate(void *opaque)
> >  {
> >      RawPosixAIOData *aiocb = opaque;
> > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
> >                  /* posix_fallocate() doesn't set errno. */
> >                  error_setg_errno(errp, -result,
> >                                   "Could not preallocate new data");
> > +            } else if (current_length == 0) {
> > +                allocate_first_block(fd);
>
> Should posix_fallocate() not take care of precisely this?
>

Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).

In this case posix_fallocate() is doing:

  for (offset += (len - 1) % increment; len > 0; offset += increment)
    {
      len -= increment;
      if (offset < st.st_size)
        {
          unsigned char c;
          ssize_t rsize = __pread (fd, &c, 1, offset);
          if (rsize < 0)
            return errno;
          /* If there is a non-zero byte, the block must have been
             allocated already.  */
          else if (rsize == 1 && c != 0)
            continue;
        }
      if (__pwrite (fd, "", 1, offset) != 1)
        return errno;
    }

https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96

So opening a file with O_DIRECT will break preallocation=falloc on such
filesystems,
and writing one byte in allocate_first_block() is safe.

> >              }
> >          } else {
> >              result = 0;
>
> [...]
>
> > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> > index df89d3864b..ad2d054a47 100755
> > --- a/tests/qemu-iotests/160
> > +++ b/tests/qemu-iotests/160
> > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
> >      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O
> "$IMGFMT" \
> >          2> /dev/null
> >      TEST_IMG="$TEST_IMG.out" _check_test_img
> > +
> > +    # We always write the first byte of an image.
> > +    printf "\0" > "$TEST_IMG.out.dd"
> > +
> >      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
>
> Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
> the lack of conv=notrunc)?
>

There is an issue only if dd open the file with O_TRUNC. I will test this
again.

>
> >      echo
> > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> > index 51e62c8276..c6a3a7bb1e 100755
> > --- a/tests/qemu-iotests/175
> > +++ b/tests/qemu-iotests/175
> > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
> >  # the file size.  This function hides the resulting difference in the
> >  # stat -c '%b' output.
> >  # Parameter 1: Number of blocks an empty file occupies
> > -# Parameter 2: Image size in bytes
> > +# Parameter 2: Minimal number of blocks in an image
> > +# Parameter 3: Image size in bytes
> >  _filter_blocks()
> >  {
> >      extra_blocks=$1
> > -    img_size=$2
> > +    min_blocks=$2
> > +    img_size=$3
> >
> > -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/"
> \
> > -        -e "s/blocks=$((extra_blocks + img_size /
> 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> > +    sed -e "s/blocks=$((extra_blocks +
> min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
>
> I don’t think adding extra_blocks to min_blocks makes sense.  Just
> min_blocks alone should be what we want here.
>

We had failing tests (in vdsm) showing that filesystem may return more
blocs than
expected even for non-empty files, so this may be needed. I did not test
it yet with a filesystem that show this issue, but I found your instructions
how to create it.

Thanks for reviewing.

Nir
Max Reitz Aug. 22, 2019, 6:11 p.m. UTC | #7
On 22.08.19 18:39, Nir Soffer wrote:
> On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com
> <mailto:mreitz@redhat.com>> wrote:
> 
>     On 16.08.19 23:21, Nir Soffer wrote:
>     > When creating an image with preallocation "off" or "falloc", the first
>     > block of the image is typically not allocated. When using Gluster
>     > storage backed by XFS filesystem, reading this block using direct I/O
>     > succeeds regardless of request length, fooling alignment detection.
>     >
>     > In this case we fallback to a safe value (4096) instead of the optimal
>     > value (512), which may lead to unneeded data copying when aligning
>     > requests.  Allocating the first block avoids the fallback.
>     >
>     > When using preallocation=off, we always allocate at least one
>     filesystem
>     > block:
>     >
>     >     $ ./qemu-img create -f raw test.raw 1g
>     >     Formatting 'test.raw', fmt=raw size=1073741824
>     >
>     >     $ ls -lhs test.raw
>     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>     >
>     > I did quick performance tests for these flows:
>     > - Provisioning a VM with a new raw image.
>     > - Copying disks with qemu-img convert to new raw target image
>     >
>     > I installed Fedora 29 server on raw sparse image, measuring the time
>     > from clicking "Begin installation" until the "Reboot" button appears:
>     >
>     > Before(s)  After(s)     Diff(%)
>     > -------------------------------
>     >      356        389        +8.4
>     >
>     > I ran this only once, so we cannot tell much from these results.
> 
>     So you’d expect it to be fast but it was slower?  Well, you only ran it
>     once and it isn’t really a precise benchmark...
> 
>     > The second test was cloning the installation image with qemu-img
>     > convert, doing 10 runs:
>     >
>     >     for i in $(seq 10); do
>     >         rm -f dst.raw
>     >         sleep 10
>     >         time ./qemu-img convert -f raw -O raw -t none -T none
>     src.raw dst.raw
>     >     done
>     >
>     > Here is a table comparing the total time spent:
>     >
>     > Type    Before(s)   After(s)    Diff(%)
>     > ---------------------------------------
>     > real      530.028    469.123      -11.4
>     > user       17.204     10.768      -37.4
>     > sys        17.881      7.011      -60.7
>     >
>     > Here we see very clear improvement in CPU usage.
>     >
>     > Signed-off-by: Nir Soffer <nsoffer@redhat.com
>     <mailto:nsoffer@redhat.com>>
>     > ---
>     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
>     >  tests/qemu-iotests/150.out |  1 +
>     >  tests/qemu-iotests/160     |  4 ++++
>     >  tests/qemu-iotests/175     | 19 +++++++++++++------
>     >  tests/qemu-iotests/175.out |  8 ++++----
>     >  tests/qemu-iotests/221.out | 12 ++++++++----
>     >  tests/qemu-iotests/253.out | 12 ++++++++----
>     >  7 files changed, 63 insertions(+), 18 deletions(-)
>     >
>     > diff --git a/block/file-posix.c b/block/file-posix.c
>     > index b9c33c8f6c..3964dd2021 100644
>     > --- a/block/file-posix.c
>     > +++ b/block/file-posix.c
>     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
>     >      return ret;
>     >  }
>     > 
>     > +/*
>     > + * Help alignment detection by allocating the first block.
>     > + *
>     > + * When reading with direct I/O from unallocated area on Gluster
>     backed by XFS,
>     > + * reading succeeds regardless of request length. In this case we
>     fallback to
>     > + * safe aligment which is not optimal. Allocating the first block
>     avoids this
>     > + * fallback.
>     > + *
>     > + * Returns: 0 on success, -errno on failure.
>     > + */
>     > +static int allocate_first_block(int fd)
>     > +{
>     > +    ssize_t n;
>     > +
>     > +    do {
>     > +        n = pwrite(fd, "\0", 1, 0);
> 
>     This breaks when fd has been opened with O_DIRECT.
> 
> 
> It seems that we always open images without O_DIRECT when creating an image
> in qemu-img create, or when creating a target image in qemu-img convert.

Yes.  But you don’t call this function directly from image creation code
but instead from the truncation function.  (The former also calls the
latter, but truncating is also an operation on its own.)

[...]

>     (Which happens when you open some file with cache.direct=on, and then
>     use e.g. QMP’s block_resize.)
> 
> 
> What would be a command triggering this? I can add a test.

block_resize, as I’ve said:

$ ./qemu-img create -f raw empty.img 0
$ x86_64-softmmu/qemu-system-x86_64 \
    -qmp stdio \
    -blockdev file,node-name=file,filename=empty.img,cache.direct=on \
     <<EOF
{'execute':'qmp_capabilities'}
{'execute':'block_resize',
 'arguments':{'node-name':'file',
              'size':1048576}}
EOF
$ ./qemu-img map empty.img
Offset          Length          Mapped to       File

(You’d expect a data chunk here.)

I suppose you can get the same effect with blockdev-create and some
format that explicitly resizes the file to some target length (LUKS does
this, I think), but this is the most direct route.

> 
>     It isn’t that bad because eventually you simply ignore the error.  But
>     it still makes me wonder whether we shouldn’t write like the biggest
>     power of two that does not exceed the new file length or MAX_BLOCKSIZE.
> 
> 
> It makes sense if there is a way to cause qemu-img to use O_DIRECT when
> creating an image.
> 
>     > +    } while (n == -1 && errno == EINTR);
>     > +
>     > +    return (n == -1) ? -errno : 0;
>     > +}
>     > +
>     >  static int handle_aiocb_truncate(void *opaque)
>     >  {
>     >      RawPosixAIOData *aiocb = opaque;
>     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
>     >                  /* posix_fallocate() doesn't set errno. */
>     >                  error_setg_errno(errp, -result,
>     >                                   "Could not preallocate new data");
>     > +            } else if (current_length == 0) {
>     > +                allocate_first_block(fd);
> 
>     Should posix_fallocate() not take care of precisely this?
> 
> 
> Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
> 
> In this case posix_fallocate() is doing:
> 
>   for (offset += (len - 1) % increment; len > 0; offset += increment)
>     {
>       len -= increment;
>       if (offset < st.st_size)
>         {
>           unsigned char c;
>           ssize_t rsize = __pread (fd, &c, 1, offset);
>           if (rsize < 0)
>             return errno;
>           /* If there is a non-zero byte, the block must have been
>              allocated already.  */
>           else if (rsize == 1 && c != 0)
>             continue;
>         }
>       if (__pwrite (fd, "", 1, offset) != 1)
>         return errno;
>     }
> 
> https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> 
> So opening a file with O_DIRECT will break preallocation=falloc on such
> filesystems,

But won’t the function above just fail with EINVAL?
allocate_first_block() is executed only in case of success.

> and writing one byte in allocate_first_block() is safe.
> 
>     >              }
>     >          } else {
>     >              result = 0;
> 
>     [...]
> 
>     > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
>     > index df89d3864b..ad2d054a47 100755
>     > --- a/tests/qemu-iotests/160
>     > +++ b/tests/qemu-iotests/160
>     > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
>     >      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip"
>     -O "$IMGFMT" \
>     >          2> /dev/null
>     >      TEST_IMG="$TEST_IMG.out" _check_test_img
>     > +
>     > +    # We always write the first byte of an image.
>     > +    printf "\0" > "$TEST_IMG.out.dd"
>     > +
>     >      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
> 
>     Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
>     the lack of conv=notrunc)?
> 
> 
> There is an issue only if dd open the file with O_TRUNC.

It isn’t an issue, I just don’t understand why the printf would be
necessary at all.

dd should always truncate the output image unless conv=notrunc is
specified.  But even if it didn’t do that, in all of these test cases it
should copy some data from $TEST_IMG to the output, and thus should
always overwrite the first byte anyway.

> I will test
> this again.
> 
>     > 
>     >      echo
>     > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
>     > index 51e62c8276..c6a3a7bb1e 100755
>     > --- a/tests/qemu-iotests/175
>     > +++ b/tests/qemu-iotests/175
>     > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
>     >  # the file size.  This function hides the resulting difference in the
>     >  # stat -c '%b' output.
>     >  # Parameter 1: Number of blocks an empty file occupies
>     > -# Parameter 2: Image size in bytes
>     > +# Parameter 2: Minimal number of blocks in an image
>     > +# Parameter 3: Image size in bytes
>     >  _filter_blocks()
>     >  {
>     >      extra_blocks=$1
>     > -    img_size=$2
>     > +    min_blocks=$2
>     > +    img_size=$3
>     > 
>     > -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing
>     allocated/" \
>     > -        -e "s/blocks=$((extra_blocks + img_size /
>     512))\\(\$\\|[^0-9]\\)/everything allocated/"
>     > +    sed -e "s/blocks=$((extra_blocks +
>     min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> 
>     I don’t think adding extra_blocks to min_blocks makes sense.  Just
>     min_blocks alone should be what we want here.
> 
> 
> We had failing tests (in vdsm) showing that filesystem may return more
> blocs than
> expected even for non-empty files, so this may be needed.

But min_blocks is exactly the number of blocks of a file that has one
allocated block.  I don’t see how adding the number of blocks an empty
file occupies makes sense.

Max
Nir Soffer Aug. 22, 2019, 7:01 p.m. UTC | #8
On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <mreitz@redhat.com> wrote:

> On 22.08.19 18:39, Nir Soffer wrote:
> > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com
> > <mailto:mreitz@redhat.com>> wrote:
> >
> >     On 16.08.19 23:21, Nir Soffer wrote:
> >     > When creating an image with preallocation "off" or "falloc", the
> first
> >     > block of the image is typically not allocated. When using Gluster
> >     > storage backed by XFS filesystem, reading this block using direct
> I/O
> >     > succeeds regardless of request length, fooling alignment detection.
> >     >
> >     > In this case we fallback to a safe value (4096) instead of the
> optimal
> >     > value (512), which may lead to unneeded data copying when aligning
> >     > requests.  Allocating the first block avoids the fallback.
> >     >
> >     > When using preallocation=off, we always allocate at least one
> >     filesystem
> >     > block:
> >     >
> >     >     $ ./qemu-img create -f raw test.raw 1g
> >     >     Formatting 'test.raw', fmt=raw size=1073741824
> >     >
> >     >     $ ls -lhs test.raw
> >     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >     >
> >     > I did quick performance tests for these flows:
> >     > - Provisioning a VM with a new raw image.
> >     > - Copying disks with qemu-img convert to new raw target image
> >     >
> >     > I installed Fedora 29 server on raw sparse image, measuring the
> time
> >     > from clicking "Begin installation" until the "Reboot" button
> appears:
> >     >
> >     > Before(s)  After(s)     Diff(%)
> >     > -------------------------------
> >     >      356        389        +8.4
> >     >
> >     > I ran this only once, so we cannot tell much from these results.
> >
> >     So you’d expect it to be fast but it was slower?  Well, you only ran
> it
> >     once and it isn’t really a precise benchmark...
> >
> >     > The second test was cloning the installation image with qemu-img
> >     > convert, doing 10 runs:
> >     >
> >     >     for i in $(seq 10); do
> >     >         rm -f dst.raw
> >     >         sleep 10
> >     >         time ./qemu-img convert -f raw -O raw -t none -T none
> >     src.raw dst.raw
> >     >     done
> >     >
> >     > Here is a table comparing the total time spent:
> >     >
> >     > Type    Before(s)   After(s)    Diff(%)
> >     > ---------------------------------------
> >     > real      530.028    469.123      -11.4
> >     > user       17.204     10.768      -37.4
> >     > sys        17.881      7.011      -60.7
> >     >
> >     > Here we see very clear improvement in CPU usage.
> >     >
> >     > Signed-off-by: Nir Soffer <nsoffer@redhat.com
> >     <mailto:nsoffer@redhat.com>>
> >     > ---
> >     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
> >     >  tests/qemu-iotests/150.out |  1 +
> >     >  tests/qemu-iotests/160     |  4 ++++
> >     >  tests/qemu-iotests/175     | 19 +++++++++++++------
> >     >  tests/qemu-iotests/175.out |  8 ++++----
> >     >  tests/qemu-iotests/221.out | 12 ++++++++----
> >     >  tests/qemu-iotests/253.out | 12 ++++++++----
> >     >  7 files changed, 63 insertions(+), 18 deletions(-)
> >     >
> >     > diff --git a/block/file-posix.c b/block/file-posix.c
> >     > index b9c33c8f6c..3964dd2021 100644
> >     > --- a/block/file-posix.c
> >     > +++ b/block/file-posix.c
> >     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
> *opaque)
> >     >      return ret;
> >     >  }
> >     >
> >     > +/*
> >     > + * Help alignment detection by allocating the first block.
> >     > + *
> >     > + * When reading with direct I/O from unallocated area on Gluster
> >     backed by XFS,
> >     > + * reading succeeds regardless of request length. In this case we
> >     fallback to
> >     > + * safe aligment which is not optimal. Allocating the first block
> >     avoids this
> >     > + * fallback.
> >     > + *
> >     > + * Returns: 0 on success, -errno on failure.
> >     > + */
> >     > +static int allocate_first_block(int fd)
> >     > +{
> >     > +    ssize_t n;
> >     > +
> >     > +    do {
> >     > +        n = pwrite(fd, "\0", 1, 0);
> >
> >     This breaks when fd has been opened with O_DIRECT.
> >
> >
> > It seems that we always open images without O_DIRECT when creating an
> image
> > in qemu-img create, or when creating a target image in qemu-img convert.
>
> Yes.  But you don’t call this function directly from image creation code
> but instead from the truncation function.  (The former also calls the
> latter, but truncating is also an operation on its own.)
>
> [...]
>
> >     (Which happens when you open some file with cache.direct=on, and then
> >     use e.g. QMP’s block_resize.)
> >
> >
> > What would be a command triggering this? I can add a test.
>
> block_resize, as I’ve said:
>
> $ ./qemu-img create -f raw empty.img 0
>

This is extreme edge case - why would someone create such image?


> $ x86_64-softmmu/qemu-system-x86_64 \
>     -qmp stdio \
>     -blockdev file,node-name=file,filename=empty.img,cache.direct=on \
>      <<EOF
> {'execute':'qmp_capabilities'}
>

This is probably too late for the allocation, since we already probed
the alignment before executing block_resize, and used a safe fallback
(4096).
It can help if the image is reopened, since we probe alignment again.

> {'execute':'block_resize',
>  'arguments':{'node-name':'file',
>               'size':1048576}}

EOF
> $ ./qemu-img map empty.img
> Offset          Length          Mapped to       File
>
> (You’d expect a data chunk here.)

I suppose you can get the same effect with blockdev-create and some
> format that explicitly resizes the file to some target length (LUKS does
> this, I think), but this is the most direct route.
>

I will try to handle -blockdev in the next version.

>     It isn’t that bad because eventually you simply ignore the error.  But
> >     it still makes me wonder whether we shouldn’t write like the biggest
> >     power of two that does not exceed the new file length or
> MAX_BLOCKSIZE.
> >
> >
> > It makes sense if there is a way to cause qemu-img to use O_DIRECT when
> > creating an image.
> >
> >     > +    } while (n == -1 && errno == EINTR);
> >     > +
> >     > +    return (n == -1) ? -errno : 0;
> >     > +}
> >     > +
> >     >  static int handle_aiocb_truncate(void *opaque)
> >     >  {
> >     >      RawPosixAIOData *aiocb = opaque;
> >     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void
> *opaque)
> >     >                  /* posix_fallocate() doesn't set errno. */
> >     >                  error_setg_errno(errp, -result,
> >     >                                   "Could not preallocate new
> data");
> >     > +            } else if (current_length == 0) {
> >     > +                allocate_first_block(fd);
> >
> >     Should posix_fallocate() not take care of precisely this?
> >
> >
> > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
> >
> > In this case posix_fallocate() is doing:
> >
> >   for (offset += (len - 1) % increment; len > 0; offset += increment)
> >     {
> >       len -= increment;
> >       if (offset < st.st_size)
> >         {
> >           unsigned char c;
> >           ssize_t rsize = __pread (fd, &c, 1, offset);
> >           if (rsize < 0)
> >             return errno;
> >           /* If there is a non-zero byte, the block must have been
> >              allocated already.  */
> >           else if (rsize == 1 && c != 0)
> >             continue;
> >         }
> >       if (__pwrite (fd, "", 1, offset) != 1)
> >         return errno;
> >     }
> >
> >
> https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> >
> > So opening a file with O_DIRECT will break preallocation=falloc on such
> > filesystems,
>
> But won’t the function above just fail with EINVAL?
> allocate_first_block() is executed only in case of success.
>

Sure, but if posix_fallocate() fails, we fail qemu-img create/convert.

> and writing one byte in allocate_first_block() is safe.
> >
> >     >              }
> >     >          } else {
> >     >              result = 0;
> >
> >     [...]
> >
> >     > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> >     > index df89d3864b..ad2d054a47 100755
> >     > --- a/tests/qemu-iotests/160
> >     > +++ b/tests/qemu-iotests/160
> >     > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
> >     >      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip"
> >     -O "$IMGFMT" \
> >     >          2> /dev/null
> >     >      TEST_IMG="$TEST_IMG.out" _check_test_img
> >     > +
> >     > +    # We always write the first byte of an image.
> >     > +    printf "\0" > "$TEST_IMG.out.dd"
> >     > +
> >     >      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip"
> status=none
> >
> >     Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
> >     the lack of conv=notrunc)?
> >
> >
> > There is an issue only if dd open the file with O_TRUNC.
>
> It isn’t an issue, I just don’t understand why the printf would be
> necessary at all.
>
> dd should always truncate the output image unless conv=notrunc is
> specified.  But even if it didn’t do that, in all of these test cases it
> should copy some data from $TEST_IMG to the output, and thus should
> always overwrite the first byte anyway.
>

Right, this change is not needed.

> I will test
> > this again.
> >
> >     >
> >     >      echo
> >     > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> >     > index 51e62c8276..c6a3a7bb1e 100755
> >     > --- a/tests/qemu-iotests/175
> >     > +++ b/tests/qemu-iotests/175
> >     > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
> >     >  # the file size.  This function hides the resulting difference in
> the
> >     >  # stat -c '%b' output.
> >     >  # Parameter 1: Number of blocks an empty file occupies
> >     > -# Parameter 2: Image size in bytes
> >     > +# Parameter 2: Minimal number of blocks in an image
> >     > +# Parameter 3: Image size in bytes
> >     >  _filter_blocks()
> >     >  {
> >     >      extra_blocks=$1
> >     > -    img_size=$2
> >     > +    min_blocks=$2
> >     > +    img_size=$3
> >     >
> >     > -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing
> >     allocated/" \
> >     > -        -e "s/blocks=$((extra_blocks + img_size /
> >     512))\\(\$\\|[^0-9]\\)/everything allocated/"
> >     > +    sed -e "s/blocks=$((extra_blocks +
> >     min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
> >
> >     I don’t think adding extra_blocks to min_blocks makes sense.  Just
> >     min_blocks alone should be what we want here.
> >
> >
> > We had failing tests (in vdsm) showing that filesystem may return more
> > blocs than
> > expected even for non-empty files, so this may be needed.
>
> But min_blocks is exactly the number of blocks of a file that has one
> allocated block.  I don’t see how adding the number of blocks an empty
> file occupies makes sense.
>

You are right. The test fails on filesystem that allocates extra blocks.

Nir
Max Reitz Aug. 23, 2019, 1:58 p.m. UTC | #9
On 22.08.19 21:01, Nir Soffer wrote:
> On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <mreitz@redhat.com
> <mailto:mreitz@redhat.com>> wrote:
> 
>     On 22.08.19 18:39, Nir Soffer wrote:
>     > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com
>     <mailto:mreitz@redhat.com>
>     > <mailto:mreitz@redhat.com <mailto:mreitz@redhat.com>>> wrote:
>     >
>     >     On 16.08.19 23:21, Nir Soffer wrote:
>     >     > When creating an image with preallocation "off" or "falloc",
>     the first
>     >     > block of the image is typically not allocated. When using
>     Gluster
>     >     > storage backed by XFS filesystem, reading this block using
>     direct I/O
>     >     > succeeds regardless of request length, fooling alignment
>     detection.
>     >     >
>     >     > In this case we fallback to a safe value (4096) instead of
>     the optimal
>     >     > value (512), which may lead to unneeded data copying when
>     aligning
>     >     > requests.  Allocating the first block avoids the fallback.
>     >     >
>     >     > When using preallocation=off, we always allocate at least one
>     >     filesystem
>     >     > block:
>     >     >
>     >     >     $ ./qemu-img create -f raw test.raw 1g
>     >     >     Formatting 'test.raw', fmt=raw size=1073741824
>     >     >
>     >     >     $ ls -lhs test.raw
>     >     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48
>     test.raw
>     >     >
>     >     > I did quick performance tests for these flows:
>     >     > - Provisioning a VM with a new raw image.
>     >     > - Copying disks with qemu-img convert to new raw target image
>     >     >
>     >     > I installed Fedora 29 server on raw sparse image, measuring
>     the time
>     >     > from clicking "Begin installation" until the "Reboot" button
>     appears:
>     >     >
>     >     > Before(s)  After(s)     Diff(%)
>     >     > -------------------------------
>     >     >      356        389        +8.4
>     >     >
>     >     > I ran this only once, so we cannot tell much from these results.
>     >
>     >     So you’d expect it to be fast but it was slower?  Well, you
>     only ran it
>     >     once and it isn’t really a precise benchmark...
>     >
>     >     > The second test was cloning the installation image with qemu-img
>     >     > convert, doing 10 runs:
>     >     >
>     >     >     for i in $(seq 10); do
>     >     >         rm -f dst.raw
>     >     >         sleep 10
>     >     >         time ./qemu-img convert -f raw -O raw -t none -T none
>     >     src.raw dst.raw
>     >     >     done
>     >     >
>     >     > Here is a table comparing the total time spent:
>     >     >
>     >     > Type    Before(s)   After(s)    Diff(%)
>     >     > ---------------------------------------
>     >     > real      530.028    469.123      -11.4
>     >     > user       17.204     10.768      -37.4
>     >     > sys        17.881      7.011      -60.7
>     >     >
>     >     > Here we see very clear improvement in CPU usage.
>     >     >
>     >     > Signed-off-by: Nir Soffer <nsoffer@redhat.com
>     <mailto:nsoffer@redhat.com>
>     >     <mailto:nsoffer@redhat.com <mailto:nsoffer@redhat.com>>>
>     >     > ---
>     >     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
>     >     >  tests/qemu-iotests/150.out |  1 +
>     >     >  tests/qemu-iotests/160     |  4 ++++
>     >     >  tests/qemu-iotests/175     | 19 +++++++++++++------
>     >     >  tests/qemu-iotests/175.out |  8 ++++----
>     >     >  tests/qemu-iotests/221.out | 12 ++++++++----
>     >     >  tests/qemu-iotests/253.out | 12 ++++++++----
>     >     >  7 files changed, 63 insertions(+), 18 deletions(-)
>     >     >
>     >     > diff --git a/block/file-posix.c b/block/file-posix.c
>     >     > index b9c33c8f6c..3964dd2021 100644
>     >     > --- a/block/file-posix.c
>     >     > +++ b/block/file-posix.c
>     >     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
>     *opaque)
>     >     >      return ret;
>     >     >  }
>     >     > 
>     >     > +/*
>     >     > + * Help alignment detection by allocating the first block.
>     >     > + *
>     >     > + * When reading with direct I/O from unallocated area on
>     Gluster
>     >     backed by XFS,
>     >     > + * reading succeeds regardless of request length. In this
>     case we
>     >     fallback to
>     >     > + * safe aligment which is not optimal. Allocating the first
>     block
>     >     avoids this
>     >     > + * fallback.
>     >     > + *
>     >     > + * Returns: 0 on success, -errno on failure.
>     >     > + */
>     >     > +static int allocate_first_block(int fd)
>     >     > +{
>     >     > +    ssize_t n;
>     >     > +
>     >     > +    do {
>     >     > +        n = pwrite(fd, "\0", 1, 0);
>     >
>     >     This breaks when fd has been opened with O_DIRECT.
>     >
>     >
>     > It seems that we always open images without O_DIRECT when creating
>     an image
>     > in qemu-img create, or when creating a target image in qemu-img
>     convert.
> 
>     Yes.  But you don’t call this function directly from image creation code
>     but instead from the truncation function.  (The former also calls the
>     latter, but truncating is also an operation on its own.)
> 
>     [...]
> 
>     >     (Which happens when you open some file with cache.direct=on,
>     and then
>     >     use e.g. QMP’s block_resize.)
>     >
>     >
>     > What would be a command triggering this? I can add a test.
> 
>     block_resize, as I’ve said:
> 
>     $ ./qemu-img create -f raw empty.img 0
> 
> 
> This is extreme edge case - why would someone create such image?

Because it works?

This is generally the fist step of image creation with blockdev-create,
because you don’t care about the size of the protocol layer.

If you have a format layer that truncates the image to a fixed size and
does not write anything into the first block itself (say because it uses
a footer), then (with O_DIRECT) allocate_first_block() will fail
(silently, because while it does return an error value, it is never
checked and there is no comment that explains why we don’t check it) and
the first block actually will not be allocated.

I could show you that with VPC (which supports a fixed subformat where
it uses a footer), but unfortunately that’s a bit broken right now
(because of a bug in blockdev-create; I’ll send a patch).

The test would go like this:

$ x86_64-softmmu/qemu-system-x86_64 -qmp stdio
{"execute":"qmp_capabilities"}

{"execute":"blockdev-create",
 "arguments":{
    "job-id":"create",
    "options":{"driver":"file",
               "filename":"test.img",
               "size":0}}}

[Wait until the job is pending]

{"execute":"job-dismiss","arguments":{"id":"create"}}

{"execute":"blockdev-add",
 "arguments":{
    "driver":"file",
    "node-name":"protocol-node",
    "filename":"test.img",
    "cache":{"direct":true}}}

{"execute":"blockdev-create",
 "arguments":{
    "job-id":"create",
    "options":{"driver":"vpc",
               "file":"protocol-node",
               "subformat":"fixed",
               "size":67108864,
               "force-size":true}}}

[Wait until the job is pending]

{"execute":"job-dismiss","arguments":{"id":"create"}}

{"execute":"quit"}

And then:

$ ./qemu-img map test.img
Offset          Length          Mapped to       File
0x4000000       0x200           0x4000000       test.img

The footer is mapped, but the first block is not allocated.


As I said, for that to work, you need a patch (because of a bug), namely:

[Start of patch]

diff --git a/block/create.c b/block/create.c
index 1bd00ed5f8..572d3a4176 100644
--- a/block/create.c
+++ b/block/create.c
@@ -48,7 +48,7 @@ static int coroutine_fn blockdev_create_run(Job *job,
Error **errp)

     qapi_free_BlockdevCreateOptions(s->opts);

-    return ret;
+    return ret < 0 ? ret : 0;
 }

 static const JobDriver blockdev_create_job_driver = {

[End of patch]

(The reason being that the vpc block driver returns 512 here to signify
success, but the job infrastructure treats anything but 0 as a failure.)

>     $ x86_64-softmmu/qemu-system-x86_64 \
>         -qmp stdio \
>         -blockdev file,node-name=file,filename=empty.img,cache.direct=on \
>          <<EOF
>     {'execute':'qmp_capabilities'}
> 
> 
> This is probably too late for the allocation, since we already probed
> the alignment before executing block_resize, and used a safe fallback
> (4096).
> It can help if the image is reopened, since we probe alignment again.

I’m not talking about getting the alignment right when you have a
zero-length image.  That can probably never work with probing.  (Well, I
mean, technically you could make allocate_first_block() probe.  I won’t
ask for that because that really seems like too little gain for too much
effort.)

I’m just talking about the fact that this allocating write will fail, so
when the image is used the next time, it will not have the first block
allocated.

[...]

>     >     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void
>     *opaque)
>     >     >                  /* posix_fallocate() doesn't set errno. */
>     >     >                  error_setg_errno(errp, -result,
>     >     >                                   "Could not preallocate new
>     data");
>     >     > +            } else if (current_length == 0) {
>     >     > +                allocate_first_block(fd);
>     >
>     >     Should posix_fallocate() not take care of precisely this?
>     >
>     >
>     > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
>     >
>     > In this case posix_fallocate() is doing:
>     >
>     >   for (offset += (len - 1) % increment; len > 0; offset += increment)
>     >     {
>     >       len -= increment;
>     >       if (offset < st.st_size)
>     >         {
>     >           unsigned char c;
>     >           ssize_t rsize = __pread (fd, &c, 1, offset);
>     >           if (rsize < 0)
>     >             return errno;
>     >           /* If there is a non-zero byte, the block must have been
>     >              allocated already.  */
>     >           else if (rsize == 1 && c != 0)
>     >             continue;
>     >         }
>     >       if (__pwrite (fd, "", 1, offset) != 1)
>     >         return errno;
>     >     }
>     >
>     >
>     https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
>     >
>     > So opening a file with O_DIRECT will break preallocation=falloc on
>     such
>     > filesystems,
> 
>     But won’t the function above just fail with EINVAL?
>     allocate_first_block() is executed only in case of success.
> 
> 
> Sure, but if posix_fallocate() fails, we fail qemu-img create/convert.

Exactly.  But if posix_fallocate() works, it should have allocated the
first block.

Max
Nir Soffer Aug. 23, 2019, 4:30 p.m. UTC | #10
On Fri, Aug 23, 2019 at 4:58 PM Max Reitz <mreitz@redhat.com> wrote:

> On 22.08.19 21:01, Nir Soffer wrote:
>
...

> >     >     > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void
> >     *opaque)
> >     >     >                  /* posix_fallocate() doesn't set errno. */
> >     >     >                  error_setg_errno(errp, -result,
> >     >     >                                   "Could not preallocate new
> >     data");
> >     >     > +            } else if (current_length == 0) {
> >     >     > +                allocate_first_block(fd);
> >     >
> >     >     Should posix_fallocate() not take care of precisely this?
> >     >
> >     >
> >     > Only if the filesystem does not support fallocate() (e.g. NFS <
> 4.2).
> >     >
> >     > In this case posix_fallocate() is doing:
> >     >
> >     >   for (offset += (len - 1) % increment; len > 0; offset +=
> increment)
> >     >     {
> >     >       len -= increment;
> >     >       if (offset < st.st_size)
> >     >         {
> >     >           unsigned char c;
> >     >           ssize_t rsize = __pread (fd, &c, 1, offset);
> >     >           if (rsize < 0)
> >     >             return errno;
> >     >           /* If there is a non-zero byte, the block must have been
> >     >              allocated already.  */
> >     >           else if (rsize == 1 && c != 0)
> >     >             continue;
> >     >         }
> >     >       if (__pwrite (fd, "", 1, offset) != 1)
> >     >         return errno;
> >     >     }
> >     >
> >     >
> >
> https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> >     >
> >     > So opening a file with O_DIRECT will break preallocation=falloc on
> >     such
> >     > filesystems,
> >
> >     But won’t the function above just fail with EINVAL?
> >     allocate_first_block() is executed only in case of success.
> >
> >
> > Sure, but if posix_fallocate() fails, we fail qemu-img create/convert.
>
> Exactly.  But if posix_fallocate() works, it should have allocated the
> first block.
>

Only if the file system does not support fallocate(). posix_fallocate()
first try
fallocate(), and fall back to manual preallocation:
https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/posix_fallocate.c.html#27

Here is an example using fallocate --posix:

$ sudo mount -t glusterfs gluster1:/gv0 /tmp/gv0

(gv0 is gluster volume backed by XFS on top of VDO device with 4k sector
size)

$ fallocate -l 1g --posix empty.raw
$ dd if=empty.raw bs=1 count=1 of=/dev/null iflag=direct status=none

$ dd if=/dev/zero bs=1 count=1 of=empty.raw conv=notrunc status=none
$ dd if=empty.raw bs=1 count=1 of=/dev/null iflag=direct status=none
dd: error reading 'empty.raw': Invalid argument

$ dd if=empty.raw bs=512 count=1 of=/dev/null iflag=direct status=none
dd: error reading 'empty.raw': Invalid argument

$ dd if=empty.raw bs=4096 count=1 of=/dev/null iflag=direct status=none

Here is example using gluster storage with sector size of 512 bytes.

$ sudo mount -t glusterfs gluster1:/gv1 /tmp/gv1

$ fallocate -l 1g --posix empty.raw
$ dd if=empty.raw bs=1 count=1 of=/dev/null iflag=direct status=none

$ dd if=/dev/zero bs=1 count=1 of=empty.raw conv=notrunc status=none

$ dd if=empty.raw bs=1 count=1 of=/dev/null iflag=direct status=none
dd: error reading 'empty.raw': Invalid argument

$ dd if=empty.raw bs=512 count=1 of=/dev/null iflag=direct status=none

So we must allocated using write() after calling posix_fallocate().

Nir
Nir Soffer Aug. 23, 2019, 4:48 p.m. UTC | #11
On Fri, Aug 23, 2019 at 4:58 PM Max Reitz <mreitz@redhat.com> wrote:

> On 22.08.19 21:01, Nir Soffer wrote:
> > On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <mreitz@redhat.com
> > <mailto:mreitz@redhat.com>> wrote:
> >
> >     On 22.08.19 18:39, Nir Soffer wrote:
> >     > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mreitz@redhat.com
> >     <mailto:mreitz@redhat.com>
> >     > <mailto:mreitz@redhat.com <mailto:mreitz@redhat.com>>> wrote:
> >     >
> >     >     On 16.08.19 23:21, Nir Soffer wrote:
> >     >     > When creating an image with preallocation "off" or "falloc",
> >     the first
> >     >     > block of the image is typically not allocated. When using
> >     Gluster
> >     >     > storage backed by XFS filesystem, reading this block using
> >     direct I/O
> >     >     > succeeds regardless of request length, fooling alignment
> >     detection.
> >     >     >
> >     >     > In this case we fallback to a safe value (4096) instead of
> >     the optimal
> >     >     > value (512), which may lead to unneeded data copying when
> >     aligning
> >     >     > requests.  Allocating the first block avoids the fallback.
> >     >     >
> >     >     > When using preallocation=off, we always allocate at least one
> >     >     filesystem
> >     >     > block:
> >     >     >
> >     >     >     $ ./qemu-img create -f raw test.raw 1g
> >     >     >     Formatting 'test.raw', fmt=raw size=1073741824
> >     >     >
> >     >     >     $ ls -lhs test.raw
> >     >     >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48
> >     test.raw
> >     >     >
> >     >     > I did quick performance tests for these flows:
> >     >     > - Provisioning a VM with a new raw image.
> >     >     > - Copying disks with qemu-img convert to new raw target image
> >     >     >
> >     >     > I installed Fedora 29 server on raw sparse image, measuring
> >     the time
> >     >     > from clicking "Begin installation" until the "Reboot" button
> >     appears:
> >     >     >
> >     >     > Before(s)  After(s)     Diff(%)
> >     >     > -------------------------------
> >     >     >      356        389        +8.4
> >     >     >
> >     >     > I ran this only once, so we cannot tell much from these
> results.
> >     >
> >     >     So you’d expect it to be fast but it was slower?  Well, you
> >     only ran it
> >     >     once and it isn’t really a precise benchmark...
> >     >
> >     >     > The second test was cloning the installation image with
> qemu-img
> >     >     > convert, doing 10 runs:
> >     >     >
> >     >     >     for i in $(seq 10); do
> >     >     >         rm -f dst.raw
> >     >     >         sleep 10
> >     >     >         time ./qemu-img convert -f raw -O raw -t none -T none
> >     >     src.raw dst.raw
> >     >     >     done
> >     >     >
> >     >     > Here is a table comparing the total time spent:
> >     >     >
> >     >     > Type    Before(s)   After(s)    Diff(%)
> >     >     > ---------------------------------------
> >     >     > real      530.028    469.123      -11.4
> >     >     > user       17.204     10.768      -37.4
> >     >     > sys        17.881      7.011      -60.7
> >     >     >
> >     >     > Here we see very clear improvement in CPU usage.
> >     >     >
> >     >     > Signed-off-by: Nir Soffer <nsoffer@redhat.com
> >     <mailto:nsoffer@redhat.com>
> >     >     <mailto:nsoffer@redhat.com <mailto:nsoffer@redhat.com>>>
> >     >     > ---
> >     >     >  block/file-posix.c         | 25 +++++++++++++++++++++++++
> >     >     >  tests/qemu-iotests/150.out |  1 +
> >     >     >  tests/qemu-iotests/160     |  4 ++++
> >     >     >  tests/qemu-iotests/175     | 19 +++++++++++++------
> >     >     >  tests/qemu-iotests/175.out |  8 ++++----
> >     >     >  tests/qemu-iotests/221.out | 12 ++++++++----
> >     >     >  tests/qemu-iotests/253.out | 12 ++++++++----
> >     >     >  7 files changed, 63 insertions(+), 18 deletions(-)
> >     >     >
> >     >     > diff --git a/block/file-posix.c b/block/file-posix.c
> >     >     > index b9c33c8f6c..3964dd2021 100644
> >     >     > --- a/block/file-posix.c
> >     >     > +++ b/block/file-posix.c
> >     >     > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void
> >     *opaque)
> >     >     >      return ret;
> >     >     >  }
> >     >     >
> >     >     > +/*
> >     >     > + * Help alignment detection by allocating the first block.
> >     >     > + *
> >     >     > + * When reading with direct I/O from unallocated area on
> >     Gluster
> >     >     backed by XFS,
> >     >     > + * reading succeeds regardless of request length. In this
> >     case we
> >     >     fallback to
> >     >     > + * safe aligment which is not optimal. Allocating the first
> >     block
> >     >     avoids this
> >     >     > + * fallback.
> >     >     > + *
> >     >     > + * Returns: 0 on success, -errno on failure.
> >     >     > + */
> >     >     > +static int allocate_first_block(int fd)
> >     >     > +{
> >     >     > +    ssize_t n;
> >     >     > +
> >     >     > +    do {
> >     >     > +        n = pwrite(fd, "\0", 1, 0);
> >     >
> >     >     This breaks when fd has been opened with O_DIRECT.
> >     >
> >     >
> >     > It seems that we always open images without O_DIRECT when creating
> >     an image
> >     > in qemu-img create, or when creating a target image in qemu-img
> >     convert.
> >
> >     Yes.  But you don’t call this function directly from image creation
> code
> >     but instead from the truncation function.  (The former also calls the
> >     latter, but truncating is also an operation on its own.)
> >
> >     [...]
> >
> >     >     (Which happens when you open some file with cache.direct=on,
> >     and then
> >     >     use e.g. QMP’s block_resize.)
> >     >
> >     >
> >     > What would be a command triggering this? I can add a test.
> >
> >     block_resize, as I’ve said:
> >
> >     $ ./qemu-img create -f raw empty.img 0
> >
> >
> > This is extreme edge case - why would someone create such image?
>
> Because it works?
>
> This is generally the fist step of image creation with blockdev-create,
> because you don’t care about the size of the protocol layer.
>
> If you have a format layer that truncates the image to a fixed size and
> does not write anything into the first block itself (say because it uses
> a footer), then (with O_DIRECT) allocate_first_block() will fail
> (silently, because while it does return an error value, it is never
> checked and there is no comment that explains why we don’t check it)


The motivation is that this is an optimization for the special case of using
empty image, so it does not worth failing image creation.
I will add a comment about that.


> and
> the first block actually will not be allocated.
>
> I could show you that with VPC (which supports a fixed subformat where
> it uses a footer), but unfortunately that’s a bit broken right now
> (because of a bug in blockdev-create; I’ll send a patch).
>
> The test would go like this:
>
> $ x86_64-softmmu/qemu-system-x86_64 -qmp stdio
> {"execute":"qmp_capabilities"}
>
> {"execute":"blockdev-create",
>  "arguments":{
>     "job-id":"create",
>     "options":{"driver":"file",
>                "filename":"test.img",
>                "size":0}}}
>
> [Wait until the job is pending]
>
> {"execute":"job-dismiss","arguments":{"id":"create"}}
>
> {"execute":"blockdev-add",
>  "arguments":{
>     "driver":"file",
>     "node-name":"protocol-node",
>     "filename":"test.img",
>     "cache":{"direct":true}}}
>
> {"execute":"blockdev-create",
>  "arguments":{
>     "job-id":"create",
>     "options":{"driver":"vpc",
>                "file":"protocol-node",
>                "subformat":"fixed",
>                "size":67108864,
>                "force-size":true}}}
>
> [Wait until the job is pending]
>
> {"execute":"job-dismiss","arguments":{"id":"create"}}
>
> {"execute":"quit"}
>
> And then:
>
> $ ./qemu-img map test.img
> Offset          Length          Mapped to       File
> 0x4000000       0x200           0x4000000       test.img
>
> The footer is mapped, but the first block is not allocated.
>

Thanks for the example.

I will need time to play with blockdev and understand the flows when image
are created. Do you think is would be useful to fix now only image creation
via qemu-img, and handle blockdev later?
...
Max Reitz Aug. 23, 2019, 5:41 p.m. UTC | #12
On 23.08.19 18:30, Nir Soffer wrote:
> On Fri, Aug 23, 2019 at 4:58 PM Max Reitz <mreitz@redhat.com
> <mailto:mreitz@redhat.com>> wrote:
> 

[...]

>     Exactly.  But if posix_fallocate() works, it should have allocated the
>     first block.
> 
> 
> Only if the file system does not support fallocate(). posix_fallocate()
> first try
> fallocate(), and fall back to manual preallocation:
> https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/posix_fallocate.c.html#27

I still don’t understand.  Your example does show that the first block
is not allocated by fallocate(), but I still don’t understand the
connection to not having fallocate() support.

If it doesn’t have fallocate() support and posix_fallocate() does fall
back, the result should be that posix_fallocate() manually allocates
data, which should be completely sufficient.

So in fact, it seems to me that the opposite is true: It seems that when
allocating blocks on XFS with fallocate(), that simply won’t be enough
to cause alignment errors.  So it doesn’t seem to be about fallback
code, but precisely the normal XFS code that fully supports fallocate.

(Just running your example on a local file on XFS shows the same result.)

So that seems to me why the additional allocation is necessary.  I think
that should be noted in a comment – if I’m right (I may well not be).

Max
Max Reitz Aug. 23, 2019, 5:53 p.m. UTC | #13
On 23.08.19 18:48, Nir Soffer wrote:
> On Fri, Aug 23, 2019 at 4:58 PM Max Reitz <mreitz@redhat.com
> <mailto:mreitz@redhat.com>> wrote:

[...]

>     If you have a format layer that truncates the image to a fixed size and
>     does not write anything into the first block itself (say because it uses
>     a footer), then (with O_DIRECT) allocate_first_block() will fail
>     (silently, because while it does return an error value, it is never
>     checked and there is no comment that explains why we don’t check it)
> 
> 
> The motivation is that this is an optimization for the special case of using
> empty image, so it does not worth failing image creation.
> I will add a comment about that.

Thanks!

[...]

> Thanks for the example.
> 
> I will need time to play with blockdev and understand the flows when image
> are created. Do you think is would be useful to fix now only image creation
> via qemu-img, and handle blockdev later?

Well, it isn’t about blockdev, it’s simply about the fact that this
function doesn’t work for O_DIRECT files.  I showed how to reproduce the
issue without blockdev (namely block_resize).  Sure, that is an edge
case, but it is a completely valid case.

Also, it seems to me the fix is rather simple.  Just something like:

static int allocate_first_block(int fd, int64_t max_size)
{
    int write_size = MIN(max_size, MAX_BLOCKSIZE);
    void *buf;
    ssize_t n;

    /* Round down to power of two */
    assert(write_size > 0);
    write_size = 1 << (31 - clz32(write_size));

    buf = qemu_memalign(MAX(getpagesize(), write_size), write_size);
    memset(buf, 0, write_size);

    do {
        n = pwrite(fd, buf, write_size, 0);
    } while (n < 0 && errno == EINTR);

    qemu_vfree(buf);

    return n < 0 ? -errno : 0;
}

Wouldn’t that work?

Max
Nir Soffer Aug. 24, 2019, 10:57 p.m. UTC | #14
On Fri, Aug 23, 2019 at 8:53 PM Max Reitz <mreitz@redhat.com> wrote:

> On 23.08.19 18:48, Nir Soffer wrote:
> > On Fri, Aug 23, 2019 at 4:58 PM Max Reitz <mreitz@redhat.com
> > <mailto:mreitz@redhat.com>> wrote:
>
> [...]
>
> >     If you have a format layer that truncates the image to a fixed size
> and
> >     does not write anything into the first block itself (say because it
> uses
> >     a footer), then (with O_DIRECT) allocate_first_block() will fail
> >     (silently, because while it does return an error value, it is never
> >     checked and there is no comment that explains why we don’t check it)
> >
> >
> > The motivation is that this is an optimization for the special case of
> using
> > empty image, so it does not worth failing image creation.
> > I will add a comment about that.
>
> Thanks!
>
> [...]
>
> > Thanks for the example.
> >
> > I will need time to play with blockdev and understand the flows when
> image
> > are created. Do you think is would be useful to fix now only image
> creation
> > via qemu-img, and handle blockdev later?
>
> Well, it isn’t about blockdev, it’s simply about the fact that this
> function doesn’t work for O_DIRECT files.  I showed how to reproduce the
> issue without blockdev (namely block_resize).  Sure, that is an edge
> case, but it is a completely valid case.
>
> Also, it seems to me the fix is rather simple.  Just something like:
>
> static int allocate_first_block(int fd, int64_t max_size)
> {
>     int write_size = MIN(max_size, MAX_BLOCKSIZE);
>     void *buf;
>     ssize_t n;
>
>     /* Round down to power of two */
>     assert(write_size > 0);
>     write_size = 1 << (31 - clz32(write_size));
>
>     buf = qemu_memalign(MAX(getpagesize(), write_size), write_size);
>     memset(buf, 0, write_size);
>
>     do {
>         n = pwrite(fd, buf, write_size, 0);
>     } while (n < 0 && errno == EINTR);
>
>     qemu_vfree(buf);
>
>     return n < 0 ? -errno : 0;
> }
>
> Wouldn’t that work?
>

Sure, it should work.

But I think we can make this simpler, always writing MIN(max_size,
MAX_BLOCKSIZE).

vdsm is enforcing now 4k alignment, and there is no way to create images
with unaligned
size. Maybe qemu should adapt this rule?

Nir
Maxim Levitsky Aug. 25, 2019, 7:44 a.m. UTC | #15
On Sat, 2019-08-17 at 00:21 +0300, Nir Soffer wrote:
> When creating an image with preallocation "off" or "falloc", the first
> block of the image is typically not allocated. When using Gluster
> storage backed by XFS filesystem, reading this block using direct I/O
> succeeds regardless of request length, fooling alignment detection.
> 
> In this case we fallback to a safe value (4096) instead of the optimal
> value (512), which may lead to unneeded data copying when aligning
> requests.  Allocating the first block avoids the fallback.
> 
> When using preallocation=off, we always allocate at least one filesystem
> block:
> 
>     $ ./qemu-img create -f raw test.raw 1g
>     Formatting 'test.raw', fmt=raw size=1073741824
> 
>     $ ls -lhs test.raw
>     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw

Are you sure about this?

[mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ qemu-img create -f raw test.raw 1g -o preallocation=off
Formatting 'test.raw', fmt=raw size=1073741824 preallocation=off
[mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ls -lhs ./test.raw 
0 -rw-r--r--. 1 mlevitsk mlevitsk 1.0G Aug 25 10:38 ./test.raw

ext4, tested on qemu-4.0.0 and qemu git master.


From what I remember, the only case when posix-raw touches the first block is to zero it out
when running on top of kernel block device, to erase whatever header might be there, and this
is also kind of a backward compat hack which might be one day removed.

[...]

Best regards,
	Maxim Levitsky
Nir Soffer Aug. 25, 2019, 7:51 p.m. UTC | #16
On Sun, Aug 25, 2019 at 10:44 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> On Sat, 2019-08-17 at 00:21 +0300, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> >     $ ./qemu-img create -f raw test.raw 1g
> >     Formatting 'test.raw', fmt=raw size=1073741824
> >
> >     $ ls -lhs test.raw
> >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
>
> Are you sure about this?
>

This is the new behaviour with this change...

[mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ qemu-img create -f
> raw test.raw 1g -o preallocation=off
> Formatting 'test.raw', fmt=raw size=1073741824 preallocation=off
> [mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ls -lhs ./test.raw
> 0 -rw-r--r--. 1 mlevitsk mlevitsk 1.0G Aug 25 10:38 ./test.raw
>
> ext4, tested on qemu-4.0.0 and qemu git master.
>

And this is the old behavior. I guess the commit message does not make it
clear.

From what I remember, the only case when posix-raw touches the first block
> is to zero it out
> when running on top of kernel block device, to erase whatever header might
> be there, and this
> is also kind of a backward compat hack which might be one day removed.
>

This change is only for file, on block storage we use BLKSSZGET.


>
> [...]
>
> Best regards,
>         Maxim Levitsky
>
>
>
Maxim Levitsky Aug. 25, 2019, 10:17 p.m. UTC | #17
On Sun, 2019-08-25 at 22:51 +0300, Nir Soffer wrote:
> On Sun, Aug 25, 2019 at 10:44 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Sat, 2019-08-17 at 00:21 +0300, Nir Soffer wrote:
> > > When creating an image with preallocation "off" or "falloc", the first
> > > block of the image is typically not allocated. When using Gluster
> > > storage backed by XFS filesystem, reading this block using direct I/O
> > > succeeds regardless of request length, fooling alignment detection.
> > > 
> > > In this case we fallback to a safe value (4096) instead of the optimal
> > > value (512), which may lead to unneeded data copying when aligning
> > > requests.  Allocating the first block avoids the fallback.
> > > 
> > > When using preallocation=off, we always allocate at least one filesystem
> > > block:
> > > 
> > >     $ ./qemu-img create -f raw test.raw 1g
> > >     Formatting 'test.raw', fmt=raw size=1073741824
> > > 
> > >     $ ls -lhs test.raw
> > >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> > 
> > Are you sure about this?
> 
> This is the new behaviour with this change...
> 
> > [mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ qemu-img create -f raw test.raw 1g -o preallocation=off
> > Formatting 'test.raw', fmt=raw size=1073741824 preallocation=off
> > [mlevitsk@maximlenovopc ~/work/test_area/posix-file 0]$ls -lhs ./test.raw 
> > 0 -rw-r--r--. 1 mlevitsk mlevitsk 1.0G Aug 25 10:38 ./test.raw
> > 
> > ext4, tested on qemu-4.0.0 and qemu git master.
> 
> And this is the old behavior. I guess the commit message does not make it clear.

Ah, thanks!


> > From what I remember, the only case when posix-raw touches the first block is to zero it out
> > when running on top of kernel block device, to erase whatever header might be there, and this
> > is also kind of a backward compat hack which might be one day removed.
> 
> This change is only for file, on block storage we use BLKSSZGET.
>  
> > [...]
> > 
> > Best regards,
> >         Maxim Levitsky
> > 
> > 

Best regards,
	Maxim Levitsky
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index b9c33c8f6c..3964dd2021 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1755,6 +1755,27 @@  static int handle_aiocb_discard(void *opaque)
     return ret;
 }
 
+/*
+ * Help alignment detection by allocating the first block.
+ *
+ * When reading with direct I/O from unallocated area on Gluster backed by XFS,
+ * reading succeeds regardless of request length. In this case we fallback to
+ * safe aligment which is not optimal. Allocating the first block avoids this
+ * fallback.
+ *
+ * Returns: 0 on success, -errno on failure.
+ */
+static int allocate_first_block(int fd)
+{
+    ssize_t n;
+
+    do {
+        n = pwrite(fd, "\0", 1, 0);
+    } while (n == -1 && errno == EINTR);
+
+    return (n == -1) ? -errno : 0;
+}
+
 static int handle_aiocb_truncate(void *opaque)
 {
     RawPosixAIOData *aiocb = opaque;
@@ -1794,6 +1815,8 @@  static int handle_aiocb_truncate(void *opaque)
                 /* posix_fallocate() doesn't set errno. */
                 error_setg_errno(errp, -result,
                                  "Could not preallocate new data");
+            } else if (current_length == 0) {
+                allocate_first_block(fd);
             }
         } else {
             result = 0;
@@ -1855,6 +1878,8 @@  static int handle_aiocb_truncate(void *opaque)
         if (ftruncate(fd, offset) != 0) {
             result = -errno;
             error_setg_errno(errp, -result, "Could not resize file");
+        } else if (current_length == 0 && offset > current_length) {
+            allocate_first_block(fd);
         }
         return result;
     default:
diff --git a/tests/qemu-iotests/150.out b/tests/qemu-iotests/150.out
index 2a54e8dcfa..3cdc7727a5 100644
--- a/tests/qemu-iotests/150.out
+++ b/tests/qemu-iotests/150.out
@@ -3,6 +3,7 @@  QA output created by 150
 === Mapping sparse conversion ===
 
 Offset          Length          File
+0               0x1000          TEST_DIR/t.IMGFMT
 
 === Mapping non-sparse conversion ===
 
diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
index df89d3864b..ad2d054a47 100755
--- a/tests/qemu-iotests/160
+++ b/tests/qemu-iotests/160
@@ -57,6 +57,10 @@  for skip in $TEST_SKIP_BLOCKS; do
     $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O "$IMGFMT" \
         2> /dev/null
     TEST_IMG="$TEST_IMG.out" _check_test_img
+
+    # We always write the first byte of an image.
+    printf "\0" > "$TEST_IMG.out.dd"
+
     dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
 
     echo
diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
index 51e62c8276..c6a3a7bb1e 100755
--- a/tests/qemu-iotests/175
+++ b/tests/qemu-iotests/175
@@ -37,14 +37,16 @@  trap "_cleanup; exit \$status" 0 1 2 3 15
 # the file size.  This function hides the resulting difference in the
 # stat -c '%b' output.
 # Parameter 1: Number of blocks an empty file occupies
-# Parameter 2: Image size in bytes
+# Parameter 2: Minimal number of blocks in an image
+# Parameter 3: Image size in bytes
 _filter_blocks()
 {
     extra_blocks=$1
-    img_size=$2
+    min_blocks=$2
+    img_size=$3
 
-    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
-        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/everything allocated/"
+    sed -e "s/blocks=$((extra_blocks + min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
+        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/max allocation/"
 }
 
 # get standard environment, filters and checks
@@ -60,16 +62,21 @@  size=$((1 * 1024 * 1024))
 touch "$TEST_DIR/empty"
 extra_blocks=$(stat -c '%b' "$TEST_DIR/empty")
 
+# We always write the first byte; check how many blocks this filesystem
+# allocates to match empty image alloation.
+printf "\0" > "$TEST_DIR/empty"
+min_blocks=$(stat -c '%b' "$TEST_DIR/empty")
+
 echo
 echo "== creating image with default preallocation =="
 _make_test_img $size | _filter_imgfmt
-stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size
+stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size
 
 for mode in off full falloc; do
     echo
     echo "== creating image with preallocation $mode =="
     IMGOPTS=preallocation=$mode _make_test_img $size | _filter_imgfmt
-    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size
+    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size
 done
 
 # success, all done
diff --git a/tests/qemu-iotests/175.out b/tests/qemu-iotests/175.out
index 6d9a5ed84e..263e521262 100644
--- a/tests/qemu-iotests/175.out
+++ b/tests/qemu-iotests/175.out
@@ -2,17 +2,17 @@  QA output created by 175
 
 == creating image with default preallocation ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
-size=1048576, nothing allocated
+size=1048576, min allocation
 
 == creating image with preallocation off ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=off
-size=1048576, nothing allocated
+size=1048576, min allocation
 
 == creating image with preallocation full ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=full
-size=1048576, everything allocated
+size=1048576, max allocation
 
 == creating image with preallocation falloc ==
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=falloc
-size=1048576, everything allocated
+size=1048576, max allocation
  *** done
diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
index 9f9dd52bb0..dca024a0c3 100644
--- a/tests/qemu-iotests/221.out
+++ b/tests/qemu-iotests/221.out
@@ -3,14 +3,18 @@  QA output created by 221
 === Check mapping of unaligned raw image ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65537
-[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
-[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 wrote 1/1 bytes at offset 65536
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
-[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 *** done
diff --git a/tests/qemu-iotests/253.out b/tests/qemu-iotests/253.out
index 607c0baa0b..3d08b305d7 100644
--- a/tests/qemu-iotests/253.out
+++ b/tests/qemu-iotests/253.out
@@ -3,12 +3,16 @@  QA output created by 253
 === Check mapping of unaligned raw image ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048575
-[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
-[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 wrote 65535/65535 bytes at offset 983040
 63.999 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
-[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
+{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 *** done