diff mbox

[RFC] qemu-img: Drop BLK_ZERO from convert

Message ID 20180226170313.8178-1-mreitz@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Max Reitz Feb. 26, 2018, 5:03 p.m. UTC
There are filesystems (among which is tmpfs) that have a hard time
reporting allocation status.  That is definitely a bug in them.

However, there is no good reason why qemu-img convert should query the
allocation status in the first place.  It does zero detection by itself
anyway, so we can detect unallocated areas ourselves.

Furthermore, if a filesystem driver has any sense, reading unallocated
data should take just as much time as lseek(SEEK_DATA) + memset().  So
the only overhead we introduce by dropping the manual lseek() call is a
memset() in the driver and a buffer_is_zero() in qemu-img, both of which
should be relatively quick.

On the other hand, if we query the allocation status repeatedly for a
file that is nearly fully allocated, we do many lseek(SEEK_DATA/HOLE)
calls for nothing.  While we can easily blame bugs in filesystem
drivers, it still is a fact that this can cause considerable overhead.

Note that we still have to invoke bdrv_is_allocated() when copying only
the top overlay image.  But this is quick and completely under our
control because it only involves the format layer and does not go down
to the protocol level; so this will never leave qemu.

Buglink: https://bugs.launchpad.net/qemu/+bug/1751264
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
If we decide against this patch, we can still do the same if -S 0 has
been specified.  Then, if the user uses a buggy filesystem, we can tell
them to use -S 0 so that converting at least works (even if we don't do
any zero detection then).

(And we definitely should do this for -S 0.  Our current implementation
 then is to detect zero areas, create a bounce buffer, zero it, and
 write that to the destination.  But that's exactly what the source
 filesystem driver would do if we simply read the data from it, so we're
 just introducing overhead because of another lseek(SEEK_DATA).)
---
 qemu-img.c                 | 41 +++++++++----------------------
 tests/qemu-iotests/086.out | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 30 deletions(-)

Comments

Stefan Hajnoczi Feb. 27, 2018, 4:17 p.m. UTC | #1
On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
> There are filesystems (among which is tmpfs) that have a hard time
> reporting allocation status.  That is definitely a bug in them.
> 
> However, there is no good reason why qemu-img convert should query the
> allocation status in the first place.  It does zero detection by itself
> anyway, so we can detect unallocated areas ourselves.
> 
> Furthermore, if a filesystem driver has any sense, reading unallocated
> data should take just as much time as lseek(SEEK_DATA) + memset().  So
> the only overhead we introduce by dropping the manual lseek() call is a
> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
> should be relatively quick.

This makes sense.  Which file systems did you test this patch on?

XFS, ext4, and tmpfs would be a good minimal test set to prove the
patch.  Perhaps with two input files:
1. A file that is mostly filled with data.
2. A file that is only sparsely populated with data.

The time taken should be comparable with the time before this patch.
Max Reitz Feb. 28, 2018, 6:08 p.m. UTC | #2
On 2018-02-27 17:17, Stefan Hajnoczi wrote:
> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
>> There are filesystems (among which is tmpfs) that have a hard time
>> reporting allocation status.  That is definitely a bug in them.
>>
>> However, there is no good reason why qemu-img convert should query the
>> allocation status in the first place.  It does zero detection by itself
>> anyway, so we can detect unallocated areas ourselves.
>>
>> Furthermore, if a filesystem driver has any sense, reading unallocated
>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
>> the only overhead we introduce by dropping the manual lseek() call is a
>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
>> should be relatively quick.
> 
> This makes sense.  Which file systems did you test this patch on?

On tmpfs and xfs, so far.

> XFS, ext4, and tmpfs would be a good minimal test set to prove the
> patch.  Perhaps with two input files:
> 1. A file that is mostly filled with data.
> 2. A file that is only sparsely populated with data.

And probably with vmdk, which (by default) forbids querying any areas
larger than 64 kB.

> The time taken should be comparable with the time before this patch.

Yep, I'll do some benchmarks.

Max
Max Reitz Feb. 28, 2018, 8:11 p.m. UTC | #3
On 2018-02-28 19:08, Max Reitz wrote:
> On 2018-02-27 17:17, Stefan Hajnoczi wrote:
>> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
>>> There are filesystems (among which is tmpfs) that have a hard time
>>> reporting allocation status.  That is definitely a bug in them.
>>>
>>> However, there is no good reason why qemu-img convert should query the
>>> allocation status in the first place.  It does zero detection by itself
>>> anyway, so we can detect unallocated areas ourselves.
>>>
>>> Furthermore, if a filesystem driver has any sense, reading unallocated
>>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
>>> the only overhead we introduce by dropping the manual lseek() call is a
>>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
>>> should be relatively quick.
>>
>> This makes sense.  Which file systems did you test this patch on?
> 
> On tmpfs and xfs, so far.
> 
>> XFS, ext4, and tmpfs would be a good minimal test set to prove the
>> patch.  Perhaps with two input files:
>> 1. A file that is mostly filled with data.
>> 2. A file that is only sparsely populated with data.
> 
> And probably with vmdk, which (by default) forbids querying any areas
> larger than 64 kB.
> 
>> The time taken should be comparable with the time before this patch.
> 
> Yep, I'll do some benchmarks.

And the results are in.  I've created 2 GB images on various filesystems
in various formats, then I've either written 64 kB every 32 MB to them
("sparse"), or left out 64 kB every 32 MB ("full").  Then I've converted
them to null-co:// and took the (real) time through "time". (Script is
attached.)

I've attached the raw results before and after this patch.  Usually, I
did six runs for each case and dropped the most extreme outlier --
except for full vmdk images, where I've only done one run for each case
because creating these images can take a very long time.

Here are the differences from before to after:

sparse raw on tmpfs:    + 19 % (436 ms to 520 ms)
sparse qcow2 on tmpfs:  - 31 % (435 ms to 301 ms)
sparse vmdk on tmpfs:   + 37 % (214 ms to 294 ms)

sparse raw on xfs:      + 69 % (452 ms to 762 ms)
sparse qcow2 on xfs:    - 34 % (462 ms to 304 ms)
sparse vmdk on xfs:     + 42 % (210 ms to 298 ms)

sparse raw on ext4:     +360 % (144 ms to 655 ms)
sparse qcow2 on ext4:   +120 % (147 ms to 330 ms)
sparse vmdk on ext4:    + 16 % (253 ms to 293 ms)


full raw on tmpfs:      -  9 % (437 ms to 398 ms)
full qcow2 on tmpfs:    - 75 % (1.63 s to 403 ms)
full vmdk on tmpfs:     -100 % (10 min to 767 ms)

full raw on xfs:        -  1 % (407 ms to 404 ms, insignificant)
full qcow2 on xfs:      -  1 % (410 ms to 404 ms, insignificant)
full vmdk on xfs:       - 33 % (1.05 s to 695 ms)




full raw on ext4:       -  2 % (308 ms to 301 ms, insignificant)
full qcow2 on ext4:     +  2 % (307 ms to 312 ms, insignificant)
full vmdk on ext4:      - 74 % (3.53 s to 839 ms)


So...  It's more extreme than I had hoped, that's for sure.  What I
conclude from this is:

(1) This patch is generally good for nearly fully allocated images.  In
the worst case (on well-behaving filesystems with well-optimized image
formats) it changes nothing.  In the best case, conversion time is
reduced drastically.

(2) For sparse raw images, this is absolutely devastating.  Reading them
now takes more than (ext4) or nearly (xfs) twice as much time as reading
a fully allocated image.  So much for "if a filesystem driver has any
sense".

(2a) It might be worth noting that on xfs, reading the sparse file took
longer even before this patch...

(3) qcow2 is different: It benefits from this patch on tmpfs and xfs
(note that reading a sparse qcow2 file took longer than reading a full
qcow2 file before this patch!), but it gets pretty much destroyed on
ext4, too.

(4) As for sparse vmdk images...  Reading them takes longer, but it's
still fasster than reading full vmdk images, so that's not horrible.


So there we are.  I was wrong about filesystem drivers having any sense,
so this patch can indeed have a hugely negative impact.

I would argue that conversion time for full images is more important,
because that's probably the main use case; but the thing is that here
this patch only helps for tmpfs and vmdk.  We don't care too much about
vmdk, and the fact that tmpfs takes so long simply is a bug.

I guess the xfs/tmpfs results would still be in a range where they are
barely acceptable (because it's mainly a qcow2 vs. raw tradeoff), but
the ext4 horrors probably make this patch a no-go in its current form.

In any case it's interesting to see that even the current qemu-img
convert takes longer to read sparsely allocated qcow2/raw files from xfs
than fully allocated images...


So I guess I'll re-send this patch where the change is done only for
-S 0.

Max
Testing: raw on tmpfs (sparse): 0.436 ±0.012
real    0.426 0.444 0.447 0.442 (0.500) 0.421

Testing: qcow2 on tmpfs (sparse): 0.435 ±0.018
real    0.421 0.459 0.448 0.430 (0.776) 0.415

Testing: vmdk on tmpfs (sparse): 0.214 ±0.004
real    0.216 0.218 0.210 0.215 (0.222) 0.209

Testing: raw on xfs (sparse): 0.452 ±0.028
real    0.430 0.468 0.493 0.442 (0.640) 0.428

Testing: qcow2 on xfs (sparse): 0.462 ±0.016
real    0.445 0.475 0.478 0.445 0.466 (0.437)

Testing: vmdk on xfs (sparse): 0.210 ±0.005
real    0.214 0.213 (0.222) 0.203 0.207 0.213

Testing: raw on ext4 (sparse): 0.144 ±0.023
real    0.119 0.181 0.142 0.143 0.137 (0.296)

Testing: qcow2 on ext4 (sparse): 0.147 ±0.016
real    0.139 0.138 0.142 0.175 0.140 (0.301)

Testing: vmdk on ext4 (sparse): 0.253 ±0.004
real    0.247 (0.273) 0.257 0.252 0.254 0.254

Testing: raw on tmpfs (full): 0.437 ±0.012
real    (0.477) 0.436 0.434 0.432 0.457 0.424

Testing: qcow2 on tmpfs (full): 1.630 ±0.021
real    1.618 1.631 (1.684) 1.625 1.665 1.612

Testing: vmdk on tmpfs (full): 612.379 (n=1)
real    612.379

Testing: raw on xfs (full): 0.407 ±0.011
real    0.400 0.413 0.402 (0.495) 0.424 0.397

Testing: qcow2 on xfs (full): 0.410 ±0.009
real    0.403 0.406 0.413 (0.485) 0.423 0.403

Testing: vmdk on xfs (full): 1.048 (n=1)
real    1.048

Testing: raw on ext4 (full): 0.308 ±0.007
real    0.299 (0.419) 0.315 0.311 0.301 0.314

Testing: qcow2 on ext4 (full): 0.307 ±0.003
real    0.307 0.306 (0.490) 0.308 0.303 0.312

Testing: vmdk on ext4 (full): 3.528 (n=1)
real    3.528
Testing: raw on tmpfs (sparse): 0.520 ±0.022
real	0.496 0.535 0.536 0.538 (0.569) 0.495

Testing: qcow2 on tmpfs (sparse): 0.301 ±0.008
real	0.298 0.297 (0.338) 0.292 0.307 0.312

Testing: vmdk on tmpfs (sparse): 0.294 ±0.002
real	0.295 0.294 0.292 (0.287) 0.293 0.297

Testing: raw on xfs (sparse): 0.762 ±0.026
real	(0.713) 0.770 0.769 0.759 0.790 0.720

Testing: qcow2 on xfs (sparse): 0.304 ±0.003
real	0.309 (0.312) 0.301 0.302 0.306 0.304

Testing: vmdk on xfs (sparse): 0.298 ±0.009
real	0.294 0.312 0.293 0.301 0.290 (0.345)

Testing: raw on ext4 (sparse): 0.655 ±0.011
real    0.664 (0.624) 0.663 0.639 0.648 0.659

Testing: qcow2 on ext4 (sparse): 0.330 ±0.035
real    0.293 0.333 (0.391) 0.374 0.353 0.297

Testing: vmdk on ext4 (sparse): 0.293 ±0.004
real    0.290 0.289 0.293 0.297 (0.285) 0.296

Testing: raw on tmpfs (full): 0.398 ±0.004
real	0.397 (0.406) 0.396 0.397 0.394 0.404

Testing: qcow2 on tmpfs (full): 0.403 ±0.004
real	0.403 0.408 0.406 0.399 (0.538) 0.400

Testing: vmdk on tmpfs (full): 0.767 (n=1)
real    0.767

Testing: raw on xfs (full): 0.404 ±0.004
real	0.411 0.402 0.401 0.402 0.402 (0.421)

Testing: qcow2 on xfs (full): 0.404 ±0.004
real	0.402 (0.415) 0.410 0.401 0.401 0.404

Testing: vmdk on xfs (full): 0.695 (n=1)
real    0.695

Testing: raw on ext4 (full): 0.301 ±0.004
real    0.304 0.303 0.305 0.299 0.296 (0.322)

Testing: qcow2 on ext4 (full): 0.312 ±0.008
real    (0.325) 0.310 0.319 0.304 0.305 0.321

Testing: vmdk on ext4 (full): 0.839 (n=1)
real    0.839
Max Reitz Feb. 28, 2018, 8:23 p.m. UTC | #4
On 2018-02-28 21:11, Max Reitz wrote:
> On 2018-02-28 19:08, Max Reitz wrote:

[...]

> In any case it's interesting to see that even the current qemu-img
> convert takes longer to read sparsely allocated qcow2/raw files from xfs
> than fully allocated images...

(That's because I didn't drop the caches after the qemu-io run...  If I
do that, reading the full image takes its four seconds.  But that
doesn't make the full results better, and the sparse results don't
change much.)

Max
Stefan Hajnoczi March 6, 2018, 1:47 p.m. UTC | #5
On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
> On 2018-02-28 19:08, Max Reitz wrote:
> > On 2018-02-27 17:17, Stefan Hajnoczi wrote:
> >> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
> >>> There are filesystems (among which is tmpfs) that have a hard time
> >>> reporting allocation status.  That is definitely a bug in them.
> >>>
> >>> However, there is no good reason why qemu-img convert should query the
> >>> allocation status in the first place.  It does zero detection by itself
> >>> anyway, so we can detect unallocated areas ourselves.
> >>>
> >>> Furthermore, if a filesystem driver has any sense, reading unallocated
> >>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
> >>> the only overhead we introduce by dropping the manual lseek() call is a
> >>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
> >>> should be relatively quick.
> >>
> >> This makes sense.  Which file systems did you test this patch on?
> > 
> > On tmpfs and xfs, so far.
> > 
> >> XFS, ext4, and tmpfs would be a good minimal test set to prove the
> >> patch.  Perhaps with two input files:
> >> 1. A file that is mostly filled with data.
> >> 2. A file that is only sparsely populated with data.
> > 
> > And probably with vmdk, which (by default) forbids querying any areas
> > larger than 64 kB.
> > 
> >> The time taken should be comparable with the time before this patch.
> > 
> > Yep, I'll do some benchmarks.
> 
> And the results are in.  I've created 2 GB images on various filesystems
> in various formats, then I've either written 64 kB every 32 MB to them
> ("sparse"), or left out 64 kB every 32 MB ("full").  Then I've converted
> them to null-co:// and took the (real) time through "time". (Script is
> attached.)
> 
> I've attached the raw results before and after this patch.  Usually, I
> did six runs for each case and dropped the most extreme outlier --
> except for full vmdk images, where I've only done one run for each case
> because creating these images can take a very long time.
> 
> Here are the differences from before to after:
> 
> sparse raw on tmpfs:    + 19 % (436 ms to 520 ms)
> sparse qcow2 on tmpfs:  - 31 % (435 ms to 301 ms)
> sparse vmdk on tmpfs:   + 37 % (214 ms to 294 ms)
> 
> sparse raw on xfs:      + 69 % (452 ms to 762 ms)
> sparse qcow2 on xfs:    - 34 % (462 ms to 304 ms)
> sparse vmdk on xfs:     + 42 % (210 ms to 298 ms)
> 
> sparse raw on ext4:     +360 % (144 ms to 655 ms)
> sparse qcow2 on ext4:   +120 % (147 ms to 330 ms)
> sparse vmdk on ext4:    + 16 % (253 ms to 293 ms)
> 
> 
> full raw on tmpfs:      -  9 % (437 ms to 398 ms)
> full qcow2 on tmpfs:    - 75 % (1.63 s to 403 ms)
> full vmdk on tmpfs:     -100 % (10 min to 767 ms)
> 
> full raw on xfs:        -  1 % (407 ms to 404 ms, insignificant)
> full qcow2 on xfs:      -  1 % (410 ms to 404 ms, insignificant)
> full vmdk on xfs:       - 33 % (1.05 s to 695 ms)
> 
> 
> 
> 
> full raw on ext4:       -  2 % (308 ms to 301 ms, insignificant)
> full qcow2 on ext4:     +  2 % (307 ms to 312 ms, insignificant)
> full vmdk on ext4:      - 74 % (3.53 s to 839 ms)
> 
> 
> So...  It's more extreme than I had hoped, that's for sure.  What I
> conclude from this is:
> 
> (1) This patch is generally good for nearly fully allocated images.  In
> the worst case (on well-behaving filesystems with well-optimized image
> formats) it changes nothing.  In the best case, conversion time is
> reduced drastically.
> 
> (2) For sparse raw images, this is absolutely devastating.  Reading them
> now takes more than (ext4) or nearly (xfs) twice as much time as reading
> a fully allocated image.  So much for "if a filesystem driver has any
> sense".
> 
> (2a) It might be worth noting that on xfs, reading the sparse file took
> longer even before this patch...
> 
> (3) qcow2 is different: It benefits from this patch on tmpfs and xfs
> (note that reading a sparse qcow2 file took longer than reading a full
> qcow2 file before this patch!), but it gets pretty much destroyed on
> ext4, too.
> 
> (4) As for sparse vmdk images...  Reading them takes longer, but it's
> still fasster than reading full vmdk images, so that's not horrible.
> 
> 
> So there we are.  I was wrong about filesystem drivers having any sense,
> so this patch can indeed have a hugely negative impact.
> 
> I would argue that conversion time for full images is more important,
> because that's probably the main use case; but the thing is that here
> this patch only helps for tmpfs and vmdk.  We don't care too much about
> vmdk, and the fact that tmpfs takes so long simply is a bug.
> 
> I guess the xfs/tmpfs results would still be in a range where they are
> barely acceptable (because it's mainly a qcow2 vs. raw tradeoff), but
> the ext4 horrors probably make this patch a no-go in its current form.
> 
> In any case it's interesting to see that even the current qemu-img
> convert takes longer to read sparsely allocated qcow2/raw files from xfs
> than fully allocated images...
> 
> 
> So I guess I'll re-send this patch where the change is done only for
> -S 0.

Wow, unexpected.  Thanks for doing the benchmarks!

Stefan
Kevin Wolf March 6, 2018, 5:37 p.m. UTC | #6
Am 06.03.2018 um 14:47 hat Stefan Hajnoczi geschrieben:
> On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
> > On 2018-02-28 19:08, Max Reitz wrote:
> > > On 2018-02-27 17:17, Stefan Hajnoczi wrote:
> > >> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
> > >>> There are filesystems (among which is tmpfs) that have a hard time
> > >>> reporting allocation status.  That is definitely a bug in them.
> > >>>
> > >>> However, there is no good reason why qemu-img convert should query the
> > >>> allocation status in the first place.  It does zero detection by itself
> > >>> anyway, so we can detect unallocated areas ourselves.
> > >>>
> > >>> Furthermore, if a filesystem driver has any sense, reading unallocated
> > >>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
> > >>> the only overhead we introduce by dropping the manual lseek() call is a
> > >>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
> > >>> should be relatively quick.
> > >>
> > >> This makes sense.  Which file systems did you test this patch on?
> > > 
> > > On tmpfs and xfs, so far.
> > > 
> > >> XFS, ext4, and tmpfs would be a good minimal test set to prove the
> > >> patch.  Perhaps with two input files:
> > >> 1. A file that is mostly filled with data.
> > >> 2. A file that is only sparsely populated with data.
> > > 
> > > And probably with vmdk, which (by default) forbids querying any areas
> > > larger than 64 kB.
> > > 
> > >> The time taken should be comparable with the time before this patch.
> > > 
> > > Yep, I'll do some benchmarks.
> > 
> > And the results are in.  I've created 2 GB images on various filesystems
> > in various formats, then I've either written 64 kB every 32 MB to them
> > ("sparse"), or left out 64 kB every 32 MB ("full").  Then I've converted
> > them to null-co:// and took the (real) time through "time". (Script is
> > attached.)
> > 
> > I've attached the raw results before and after this patch.  Usually, I
> > did six runs for each case and dropped the most extreme outlier --
> > except for full vmdk images, where I've only done one run for each case
> > because creating these images can take a very long time.
> > 
> > Here are the differences from before to after:
> > 
> > sparse raw on tmpfs:    + 19 % (436 ms to 520 ms)
> > sparse qcow2 on tmpfs:  - 31 % (435 ms to 301 ms)
> > sparse vmdk on tmpfs:   + 37 % (214 ms to 294 ms)
> > 
> > sparse raw on xfs:      + 69 % (452 ms to 762 ms)
> > sparse qcow2 on xfs:    - 34 % (462 ms to 304 ms)
> > sparse vmdk on xfs:     + 42 % (210 ms to 298 ms)
> > 
> > sparse raw on ext4:     +360 % (144 ms to 655 ms)
> > sparse qcow2 on ext4:   +120 % (147 ms to 330 ms)
> > sparse vmdk on ext4:    + 16 % (253 ms to 293 ms)
> > 
> > 
> > full raw on tmpfs:      -  9 % (437 ms to 398 ms)
> > full qcow2 on tmpfs:    - 75 % (1.63 s to 403 ms)
> > full vmdk on tmpfs:     -100 % (10 min to 767 ms)
> > 
> > full raw on xfs:        -  1 % (407 ms to 404 ms, insignificant)
> > full qcow2 on xfs:      -  1 % (410 ms to 404 ms, insignificant)
> > full vmdk on xfs:       - 33 % (1.05 s to 695 ms)
> > 
> > 
> > 
> > 
> > full raw on ext4:       -  2 % (308 ms to 301 ms, insignificant)
> > full qcow2 on ext4:     +  2 % (307 ms to 312 ms, insignificant)
> > full vmdk on ext4:      - 74 % (3.53 s to 839 ms)
> > 
> > 
> > So...  It's more extreme than I had hoped, that's for sure.  What I
> > conclude from this is:
> > 
> > (1) This patch is generally good for nearly fully allocated images.  In
> > the worst case (on well-behaving filesystems with well-optimized image
> > formats) it changes nothing.  In the best case, conversion time is
> > reduced drastically.

This makes sense. Asking the kernel whether a block is zero only helps
the performance if the result is yes occasionally, otherwise it's just
wasted work.

Maybe we could try to guess the ratio by comparing the number of
allocated blocks in the image file and the virtual disk size or
something? Then we could only call lseek() when we can actually expect
an improvement from it.

> > (2) For sparse raw images, this is absolutely devastating.  Reading them
> > now takes more than (ext4) or nearly (xfs) twice as much time as reading
> > a fully allocated image.  So much for "if a filesystem driver has any
> > sense".

Are you sure that only the filesystem is the problem? Checking for every
single byte of an image whether it is zero has to cost some performance.
The fully allocated image doesn't suffer from this because (a) it only
has to check the first byte in each block and (b) the cost was already
there before this patch.

In fact, if null-co supported .bdrv_co_pwrite_zeroes, I think you would
get even worse results for your patch because then the pre-patch version
doesn't even have to do the memset().

> > (2a) It might be worth noting that on xfs, reading the sparse file took
> > longer even before this patch...
> > 
> > (3) qcow2 is different: It benefits from this patch on tmpfs and xfs
> > (note that reading a sparse qcow2 file took longer than reading a full
> > qcow2 file before this patch!), but it gets pretty much destroyed on
> > ext4, too.

I suppose an empty qcow2 with metadata preallocation behaves roughly
like sparse raw?

As long as the qcow2 metadata reflects the allocation status in the
image file (which it probably usually does, except with preallocation),
it makes sense that qcow2 performs better with just relying on its
metadata. Calling an lseek() that just returns the same result is a
wasted effort then.

> > (4) As for sparse vmdk images...  Reading them takes longer, but it's
> > still fasster than reading full vmdk images, so that's not horrible.

Hm, why is that? Shouldn't vmdk metadata reflect the allocation status
in the image file just as well as qcow2 metadata?

But actually, the absolute numbers are much lower than both raw and
qcow2, which is a bit surprising. Is there a bug somewhere in vmdk or
are we missing optimisations in raw and qcow2?

> > So there we are.  I was wrong about filesystem drivers having any sense,
> > so this patch can indeed have a hugely negative impact.
> > 
> > I would argue that conversion time for full images is more important,
> > because that's probably the main use case; but the thing is that here
> > this patch only helps for tmpfs and vmdk.  We don't care too much about
> > vmdk, and the fact that tmpfs takes so long simply is a bug.
> > 
> > I guess the xfs/tmpfs results would still be in a range where they are
> > barely acceptable (because it's mainly a qcow2 vs. raw tradeoff), but
> > the ext4 horrors probably make this patch a no-go in its current form.
> > 
> > In any case it's interesting to see that even the current qemu-img
> > convert takes longer to read sparsely allocated qcow2/raw files from xfs
> > than fully allocated images...
> > 
> > 
> > So I guess I'll re-send this patch where the change is done only for
> > -S 0.
> 
> Wow, unexpected.  Thanks for doing the benchmarks!

ext4 is a bit surprising, yes. vmdk, too. The rest kind of makes sense
to me.

Kevin
Max Reitz March 7, 2018, 3:57 p.m. UTC | #7
On 2018-03-06 18:37, Kevin Wolf wrote:
> Am 06.03.2018 um 14:47 hat Stefan Hajnoczi geschrieben:
>> On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
>>> On 2018-02-28 19:08, Max Reitz wrote:
>>>> On 2018-02-27 17:17, Stefan Hajnoczi wrote:
>>>>> On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
>>>>>> There are filesystems (among which is tmpfs) that have a hard time
>>>>>> reporting allocation status.  That is definitely a bug in them.
>>>>>>
>>>>>> However, there is no good reason why qemu-img convert should query the
>>>>>> allocation status in the first place.  It does zero detection by itself
>>>>>> anyway, so we can detect unallocated areas ourselves.
>>>>>>
>>>>>> Furthermore, if a filesystem driver has any sense, reading unallocated
>>>>>> data should take just as much time as lseek(SEEK_DATA) + memset().  So
>>>>>> the only overhead we introduce by dropping the manual lseek() call is a
>>>>>> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
>>>>>> should be relatively quick.
>>>>>
>>>>> This makes sense.  Which file systems did you test this patch on?
>>>>
>>>> On tmpfs and xfs, so far.
>>>>
>>>>> XFS, ext4, and tmpfs would be a good minimal test set to prove the
>>>>> patch.  Perhaps with two input files:
>>>>> 1. A file that is mostly filled with data.
>>>>> 2. A file that is only sparsely populated with data.
>>>>
>>>> And probably with vmdk, which (by default) forbids querying any areas
>>>> larger than 64 kB.
>>>>
>>>>> The time taken should be comparable with the time before this patch.
>>>>
>>>> Yep, I'll do some benchmarks.
>>>
>>> And the results are in.  I've created 2 GB images on various filesystems
>>> in various formats, then I've either written 64 kB every 32 MB to them
>>> ("sparse"), or left out 64 kB every 32 MB ("full").  Then I've converted
>>> them to null-co:// and took the (real) time through "time". (Script is
>>> attached.)
>>>
>>> I've attached the raw results before and after this patch.  Usually, I
>>> did six runs for each case and dropped the most extreme outlier --
>>> except for full vmdk images, where I've only done one run for each case
>>> because creating these images can take a very long time.
>>>
>>> Here are the differences from before to after:
>>>
>>> sparse raw on tmpfs:    + 19 % (436 ms to 520 ms)
>>> sparse qcow2 on tmpfs:  - 31 % (435 ms to 301 ms)
>>> sparse vmdk on tmpfs:   + 37 % (214 ms to 294 ms)
>>>
>>> sparse raw on xfs:      + 69 % (452 ms to 762 ms)
>>> sparse qcow2 on xfs:    - 34 % (462 ms to 304 ms)
>>> sparse vmdk on xfs:     + 42 % (210 ms to 298 ms)
>>>
>>> sparse raw on ext4:     +360 % (144 ms to 655 ms)
>>> sparse qcow2 on ext4:   +120 % (147 ms to 330 ms)
>>> sparse vmdk on ext4:    + 16 % (253 ms to 293 ms)
>>>
>>>
>>> full raw on tmpfs:      -  9 % (437 ms to 398 ms)
>>> full qcow2 on tmpfs:    - 75 % (1.63 s to 403 ms)
>>> full vmdk on tmpfs:     -100 % (10 min to 767 ms)
>>>
>>> full raw on xfs:        -  1 % (407 ms to 404 ms, insignificant)
>>> full qcow2 on xfs:      -  1 % (410 ms to 404 ms, insignificant)
>>> full vmdk on xfs:       - 33 % (1.05 s to 695 ms)
>>>
>>>
>>>
>>>
>>> full raw on ext4:       -  2 % (308 ms to 301 ms, insignificant)
>>> full qcow2 on ext4:     +  2 % (307 ms to 312 ms, insignificant)
>>> full vmdk on ext4:      - 74 % (3.53 s to 839 ms)
>>>
>>>
>>> So...  It's more extreme than I had hoped, that's for sure.  What I
>>> conclude from this is:
>>>
>>> (1) This patch is generally good for nearly fully allocated images.  In
>>> the worst case (on well-behaving filesystems with well-optimized image
>>> formats) it changes nothing.  In the best case, conversion time is
>>> reduced drastically.
> 
> This makes sense. Asking the kernel whether a block is zero only helps
> the performance if the result is yes occasionally, otherwise it's just
> wasted work.
> 
> Maybe we could try to guess the ratio by comparing the number of
> allocated blocks in the image file and the virtual disk size or
> something? Then we could only call lseek() when we can actually expect
> an improvement from it.

Sounds like "qemu should not contain policy" to me.  If the user expects
the image to be fully allocated, they might as well use -S 0.

>>> (2) For sparse raw images, this is absolutely devastating.  Reading them
>>> now takes more than (ext4) or nearly (xfs) twice as much time as reading
>>> a fully allocated image.  So much for "if a filesystem driver has any
>>> sense".
> 
> Are you sure that only the filesystem is the problem? Checking for every
> single byte of an image whether it is zero has to cost some performance.

Well, yes, but "read data location from FS metadata" + "realize it's a
hole" + memset() + "repe scasb" shouldn't take twice as much time as
"read data location from FS metadata" + "read data from SSD".

I expected the "realize it's a hole" part to fall out for free, so this
would that memset() + repe scasb take much longer than reading data from
the SSD -- and that's just pretty much impossible.

> The fully allocated image doesn't suffer from this because (a) it only
> has to check the first byte in each block and (b) the cost was already
> there before this patch.
> 
> In fact, if null-co supported .bdrv_co_pwrite_zeroes, I think you would
> get even worse results for your patch because then the pre-patch version
> doesn't even have to do the memset().
> 
>>> (2a) It might be worth noting that on xfs, reading the sparse file took
>>> longer even before this patch...
>>>
>>> (3) qcow2 is different: It benefits from this patch on tmpfs and xfs
>>> (note that reading a sparse qcow2 file took longer than reading a full
>>> qcow2 file before this patch!), but it gets pretty much destroyed on
>>> ext4, too.
> 
> I suppose an empty qcow2 with metadata preallocation behaves roughly
> like sparse raw?

Yep, more on that below.

> As long as the qcow2 metadata reflects the allocation status in the
> image file (which it probably usually does, except with preallocation),
> it makes sense that qcow2 performs better with just relying on its
> metadata. Calling an lseek() that just returns the same result is a
> wasted effort then.
> 
>>> (4) As for sparse vmdk images...  Reading them takes longer, but it's
>>> still fasster than reading full vmdk images, so that's not horrible.
> 
> Hm, why is that? Shouldn't vmdk metadata reflect the allocation status
> in the image file just as well as qcow2 metadata?
> 
> But actually, the absolute numbers are much lower than both raw and
> qcow2, which is a bit surprising. Is there a bug somewhere in vmdk or
> are we missing optimisations in raw and qcow2?

I wondered about that, too.  Maybe something to look into another time...

>>> So there we are.  I was wrong about filesystem drivers having any sense,
>>> so this patch can indeed have a hugely negative impact.
>>>
>>> I would argue that conversion time for full images is more important,
>>> because that's probably the main use case; but the thing is that here
>>> this patch only helps for tmpfs and vmdk.  We don't care too much about
>>> vmdk, and the fact that tmpfs takes so long simply is a bug.
>>>
>>> I guess the xfs/tmpfs results would still be in a range where they are
>>> barely acceptable (because it's mainly a qcow2 vs. raw tradeoff), but
>>> the ext4 horrors probably make this patch a no-go in its current form.
>>>
>>> In any case it's interesting to see that even the current qemu-img
>>> convert takes longer to read sparsely allocated qcow2/raw files from xfs
>>> than fully allocated images...
>>>
>>>
>>> So I guess I'll re-send this patch where the change is done only for
>>> -S 0.
>>
>> Wow, unexpected.  Thanks for doing the benchmarks!
> 
> ext4 is a bit surprising, yes. vmdk, too. The rest kind of makes sense
> to me.

To me it still quite doesn't...


Anyway, I wrote another variation of this which I called "greedy
block_status".  This series introduced a
bdrv_greedy_block_status_above() function which basically returns the
first "reliable" allocation information it can get: For instance, for
qcow2 images it will use only the qcow2 metadata and not descend down to
the protocol level.  However, for raw images it will go down to the
protocol because raw doesn't give any information.

The idea was that we don't want to leave the block layer unless we have
to, because we have no idea how protocols behave and whether they agree
to our notion that hole information should be available quickly.

The result looked good, but then I thought "wait, what about
preallocated qcow2 files"...  And, well...  Not good.  Here are the
benchmarks from that:

sparse raw/tmpfs:       -  3 % (0.6σ)
sparse qcow2/tmpfs:     +  6 % (0.9σ)
sparse vmdk/tmpfs:      -  5 % (0.7σ)

sparse raw/xfs:         +  2 % (0.4σ)
sparse qcow2/xfs:       +  4 % (1.0σ)
sparse vmdk/xfs:        +  0 % (0.1σ)

sparse raw/ext4:        +  2 % (0.4σ)
sparse qcow2/ext4:      ±  0 % (0.0σ)
sparse vmdk/ext4:       -  0 % (0.1σ)

full raw/tmpfs:         +  5 % (1.5σ)
full qcow2/tmpfs:       - 71 % (> 3σ) (from 1.76 s to 0.51 s)
full vmdk/tmpfs:        -100 % (n=1)  (from 10 min to 0.8 s)

full raw/xfs:           -  0 % (0.2σ)
full qcow2/xfs:         -  1 % (1.1σ)
full vmdk/xfs:          -  7 % (n=1)

full raw/ext4:          +  0 % (0.1σ)
full qcow2/ext4:        -  0 % (1.6σ)
full vmdk/ext4:         - 26 % (> 3σ) (from 7.5 s to 5.6 s)

metadata qcow2/tmpfs:   +800 % (> 3σ) (from 0.14 s to 1.25 s)
metadata qcow2/xfs:     +870 % (> 3σ) (from 0.14 s to 1.34 s)
metadata qcow2/ext4:    +780 % (> 3σ) (from 0.11 s to 0.98 s)


And here are some notes from the cover letter I had already written:

== cover letter excerpt start ==

So, the results:

- The main issue before this series is relieved, you can now indeed

  convert a vmdk file from tmpfs without starving in the meantime.



- However, in exchange, converting metadata-preallocated qcow2 files now

  takes much longer than bevore.  (Same for falloc, in case you are

  wondering.)



- Apart from that, the only significant changes are that converting

  qcow2 off tmpfs got quicker, same for vmdk from ext4.





The question is now whether it is more important to bring down the

conversion time for vmdk, which is absolutely unbearable, and in

exchange bring up the conversion time for falloc/metadata preallocated

qcow2 images so something that is much larger than before, but still
much lower than if they were fully allocated (about a fourth of that).

Comparison for qcow2 on e.g. xfs:
- Sparse: from 0.569 s to 0.592 s
- preallocation=metadata: from 0.138 s to 1.339 s
- Nearly full: from 4.056 s to 4.032 s

Just for fun:
- preallocation=off: from 0.162 s to 0.152 s

Interesting to see that (before this patch) it seems to take a little
longer with preallocation=off than with preallocation=metadata.
(And yes, if you take enough measurements, you can see that this is a
 consistent result.)

So I wouldn't call the change for preallocated qcow2 files absolutely
horrible -- whereas the previous time for vmdk was absolutely horrible.
But qcow2 is much more important than vmdk on a buggy filesystem
(tmpfs), so there's that.

== cover letter excerpt end ==

If anyone is still interested, I have the patches here...  But in the
end, it probably is indeed best to just do something about -S 0 and
leave the rest alone.  (And keep putting the blame on tmpfs.)

Max
Paolo Bonzini March 7, 2018, 4:33 p.m. UTC | #8
On 07/03/2018 16:57, Max Reitz wrote:
>>>> (2) For sparse raw images, this is absolutely devastating.  Reading them
>>>> now takes more than (ext4) or nearly (xfs) twice as much time as reading
>>>> a fully allocated image.  So much for "if a filesystem driver has any
>>>> sense".
>> Are you sure that only the filesystem is the problem? Checking for every
>> single byte of an image whether it is zero has to cost some performance.
> Well, yes, but "read data location from FS metadata" + "realize it's a
> hole" + memset() + "repe scasb" shouldn't take twice as much time as
> "read data location from FS metadata" + "read data from SSD".
> 
> I expected the "realize it's a hole" part to fall out for free, so this
> would that memset() + repe scasb take much longer than reading data from
> the SSD -- and that's just pretty much impossible.
> 

This makes a lot of sense, but just to double-check, what does profiling
say?

Paolo
Kevin Wolf March 7, 2018, 5:05 p.m. UTC | #9
Am 07.03.2018 um 16:57 hat Max Reitz geschrieben:
> On 2018-03-06 18:37, Kevin Wolf wrote:
> > Am 06.03.2018 um 14:47 hat Stefan Hajnoczi geschrieben:
> >> On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
> >>> So...  It's more extreme than I had hoped, that's for sure.  What I
> >>> conclude from this is:
> >>>
> >>> (1) This patch is generally good for nearly fully allocated images.  In
> >>> the worst case (on well-behaving filesystems with well-optimized image
> >>> formats) it changes nothing.  In the best case, conversion time is
> >>> reduced drastically.
> > 
> > This makes sense. Asking the kernel whether a block is zero only helps
> > the performance if the result is yes occasionally, otherwise it's just
> > wasted work.
> > 
> > Maybe we could try to guess the ratio by comparing the number of
> > allocated blocks in the image file and the virtual disk size or
> > something? Then we could only call lseek() when we can actually expect
> > an improvement from it.
> 
> Sounds like "qemu should not contain policy" to me.  If the user expects
> the image to be fully allocated, they might as well use -S 0.

Optimising special cases isn't really what is meant when we're talking
about having policy in qemu. The result doesn't change, but the
performance does potentially. And as long as you access the whole image
like in qemu-img convert, qemu can make a pretty good estimation because
it knows both physical and virtual image sizes, so why bother the
management layer with it?

The thing that could be considered policy is the threshold where you
switch from one method to the other.

> >>> (2) For sparse raw images, this is absolutely devastating.  Reading them
> >>> now takes more than (ext4) or nearly (xfs) twice as much time as reading
> >>> a fully allocated image.  So much for "if a filesystem driver has any
> >>> sense".
> > 
> > Are you sure that only the filesystem is the problem? Checking for every
> > single byte of an image whether it is zero has to cost some performance.
> 
> Well, yes, but "read data location from FS metadata" + "realize it's a
> hole" + memset() + "repe scasb" shouldn't take twice as much time as
> "read data location from FS metadata" + "read data from SSD".
> 
> I expected the "realize it's a hole" part to fall out for free, so this
> would that memset() + repe scasb take much longer than reading data from
> the SSD -- and that's just pretty much impossible.

Not sure where you get the SSD part from. The scenarios you're comparing
are these:

1. Query holes with lseek() and then memset() in qemu's emulation of
   bdrv_co_pwrite_zeroes() for drivers that don't implement it. (Which
   is true for your null-co, but not representative for the real-world
   use cases with an actual file-posix protocol layer. Benchmarking with
   an extended null-co that has write_zero support would probably
   better.)

2. Uncondtionally call pread() and let the kernel do a memset(), at the
   cost of having to scan the buffer afterwards because qemu doesn't
   know yet that it contains zeros.

Neither case involves disk accesses if the filesystem metadata is
cached. You're comparing a memset+scan to just a memset (and the more
realistic case should be comparing memset+scan to nothing).

(BTW, buffer_is_zero() does complicated stuff, but 'repe scasb' isn't
among it.)

> > The fully allocated image doesn't suffer from this because (a) it only
> > has to check the first byte in each block and (b) the cost was already
> > there before this patch.
> > 
> > In fact, if null-co supported .bdrv_co_pwrite_zeroes, I think you would
> > get even worse results for your patch because then the pre-patch version
> > doesn't even have to do the memset().
> > 
> >>> (2a) It might be worth noting that on xfs, reading the sparse file took
> >>> longer even before this patch...
> >>>
> >>> (3) qcow2 is different: It benefits from this patch on tmpfs and xfs
> >>> (note that reading a sparse qcow2 file took longer than reading a full
> >>> qcow2 file before this patch!), but it gets pretty much destroyed on
> >>> ext4, too.
> > 
> > I suppose an empty qcow2 with metadata preallocation behaves roughly
> > like sparse raw?
> 
> Yep, more on that below.
> 
> > As long as the qcow2 metadata reflects the allocation status in the
> > image file (which it probably usually does, except with preallocation),
> > it makes sense that qcow2 performs better with just relying on its
> > metadata. Calling an lseek() that just returns the same result is a
> > wasted effort then.
> > 
> >>> (4) As for sparse vmdk images...  Reading them takes longer, but it's
> >>> still fasster than reading full vmdk images, so that's not horrible.
> > 
> > Hm, why is that? Shouldn't vmdk metadata reflect the allocation status
> > in the image file just as well as qcow2 metadata?
> > 
> > But actually, the absolute numbers are much lower than both raw and
> > qcow2, which is a bit surprising. Is there a bug somewhere in vmdk or
> > are we missing optimisations in raw and qcow2?
> 
> I wondered about that, too.  Maybe something to look into another time...

A first thing to try would be a larger qcow2 cluster size. VMDK has 1 MB
clusters, if I remember correctly, so maybe it just covers larger ranges
with a single call.

Kevin
Max Reitz March 7, 2018, 5:11 p.m. UTC | #10
On 2018-03-07 17:33, Paolo Bonzini wrote:
> On 07/03/2018 16:57, Max Reitz wrote:
>>>>> (2) For sparse raw images, this is absolutely devastating.  Reading them
>>>>> now takes more than (ext4) or nearly (xfs) twice as much time as reading
>>>>> a fully allocated image.  So much for "if a filesystem driver has any
>>>>> sense".
>>> Are you sure that only the filesystem is the problem? Checking for every
>>> single byte of an image whether it is zero has to cost some performance.
>> Well, yes, but "read data location from FS metadata" + "realize it's a
>> hole" + memset() + "repe scasb" shouldn't take twice as much time as
>> "read data location from FS metadata" + "read data from SSD".
>>
>> I expected the "realize it's a hole" part to fall out for free, so this
>> would that memset() + repe scasb take much longer than reading data from
>> the SSD -- and that's just pretty much impossible.
>>
> 
> This makes a lot of sense, but just to double-check, what does profiling
> say?

Oops, right.  I forgot that I forgot to drop the caches in that first
benchmark...
(http://lists.nongnu.org/archive/html/qemu-block/2018-02/msg01166.html)

I don't have a full test run for this RFC version, but with the modified
series I hinted at in
http://lists.nongnu.org/archive/html/qemu-block/2018-03/msg00244.html I get:
- 0.6 s for a sparse qcow2
- 1.3 s for a preallocated qcow2 (basically like a sparse raw file in
this RFC here)
- 4.0 s for a fully allocated qcow2

So that makes more sense.

Max
Max Reitz March 7, 2018, 5:22 p.m. UTC | #11
On 2018-03-07 18:05, Kevin Wolf wrote:
> Am 07.03.2018 um 16:57 hat Max Reitz geschrieben:
>> On 2018-03-06 18:37, Kevin Wolf wrote:
>>> Am 06.03.2018 um 14:47 hat Stefan Hajnoczi geschrieben:
>>>> On Wed, Feb 28, 2018 at 09:11:32PM +0100, Max Reitz wrote:
>>>>> So...  It's more extreme than I had hoped, that's for sure.  What I
>>>>> conclude from this is:
>>>>>
>>>>> (1) This patch is generally good for nearly fully allocated images.  In
>>>>> the worst case (on well-behaving filesystems with well-optimized image
>>>>> formats) it changes nothing.  In the best case, conversion time is
>>>>> reduced drastically.
>>>
>>> This makes sense. Asking the kernel whether a block is zero only helps
>>> the performance if the result is yes occasionally, otherwise it's just
>>> wasted work.
>>>
>>> Maybe we could try to guess the ratio by comparing the number of
>>> allocated blocks in the image file and the virtual disk size or
>>> something? Then we could only call lseek() when we can actually expect
>>> an improvement from it.
>>
>> Sounds like "qemu should not contain policy" to me.  If the user expects
>> the image to be fully allocated, they might as well use -S 0.
> 
> Optimising special cases isn't really what is meant when we're talking
> about having policy in qemu. The result doesn't change, but the
> performance does potentially. And as long as you access the whole image
> like in qemu-img convert, qemu can make a pretty good estimation because
> it knows both physical and virtual image sizes, so why bother the
> management layer with it?

Oh, I thought we should measure how long an lseek() takes and then
decide based on that.  Hm, well, yes, comparing the allocation size
would be worse thinking about.  But then it gets tricky with internal
snapshots and such...

> The thing that could be considered policy is the threshold where you
> switch from one method to the other.
> 
>>>>> (2) For sparse raw images, this is absolutely devastating.  Reading them
>>>>> now takes more than (ext4) or nearly (xfs) twice as much time as reading
>>>>> a fully allocated image.  So much for "if a filesystem driver has any
>>>>> sense".
>>>
>>> Are you sure that only the filesystem is the problem? Checking for every
>>> single byte of an image whether it is zero has to cost some performance.
>>
>> Well, yes, but "read data location from FS metadata" + "realize it's a
>> hole" + memset() + "repe scasb" shouldn't take twice as much time as
>> "read data location from FS metadata" + "read data from SSD".
>>
>> I expected the "realize it's a hole" part to fall out for free, so this
>> would that memset() + repe scasb take much longer than reading data from
>> the SSD -- and that's just pretty much impossible.
> 
> Not sure where you get the SSD part from. The scenarios you're comparing
> are these:
> 
> 1. Query holes with lseek() and then memset() in qemu's emulation of
>    bdrv_co_pwrite_zeroes() for drivers that don't implement it. (Which
>    is true for your null-co, but not representative for the real-world
>    use cases with an actual file-posix protocol layer. Benchmarking with
>    an extended null-co that has write_zero support would probably
>    better.)

That's before this patch for sparsely allocated images.

> 2. Uncondtionally call pread() and let the kernel do a memset(), at the
>    cost of having to scan the buffer afterwards because qemu doesn't
>    know yet that it contains zeros.

That's after this patch for sparsely allocated images.

What I was wondering about was solely post-patch behavior, namely
sparsely vs. nearly fully allocated images.

The thing was that converting a sparsely allocated image from an SSD
took twice as much time as converting a fully allocated image.  Reading
data comes in for the fully allocated case.

The thing was that I forgot to drop the caches (and I really do want to
drop those, because they only help for my 2 GB test case, but in the
real world with 300+ GB images, they won't do much).  So, yes, I guess
what I compared was an in-memory metadata lookup + memset() +
O(n) buffer_is_zero() vs. in-memory metadata lookup + memcpy() +
O(1) buffer_is_zero().

Still leaves something to be explained (because I'd expect memset() to
be twice as fast as memcpy()), but at least it isn't completely weird.

> Neither case involves disk accesses if the filesystem metadata is
> cached. You're comparing a memset+scan to just a memset (and the more
> realistic case should be comparing memset+scan to nothing).
> 
> (BTW, buffer_is_zero() does complicated stuff, but 'repe scasb' isn't
> among it.)

I know, that was just a simplification.

>>> The fully allocated image doesn't suffer from this because (a) it only
>>> has to check the first byte in each block and (b) the cost was already
>>> there before this patch.
>>>
>>> In fact, if null-co supported .bdrv_co_pwrite_zeroes, I think you would
>>> get even worse results for your patch because then the pre-patch version
>>> doesn't even have to do the memset().
>>>
>>>>> (2a) It might be worth noting that on xfs, reading the sparse file took
>>>>> longer even before this patch...
>>>>>
>>>>> (3) qcow2 is different: It benefits from this patch on tmpfs and xfs
>>>>> (note that reading a sparse qcow2 file took longer than reading a full
>>>>> qcow2 file before this patch!), but it gets pretty much destroyed on
>>>>> ext4, too.
>>>
>>> I suppose an empty qcow2 with metadata preallocation behaves roughly
>>> like sparse raw?
>>
>> Yep, more on that below.
>>
>>> As long as the qcow2 metadata reflects the allocation status in the
>>> image file (which it probably usually does, except with preallocation),
>>> it makes sense that qcow2 performs better with just relying on its
>>> metadata. Calling an lseek() that just returns the same result is a
>>> wasted effort then.
>>>
>>>>> (4) As for sparse vmdk images...  Reading them takes longer, but it's
>>>>> still fasster than reading full vmdk images, so that's not horrible.
>>>
>>> Hm, why is that? Shouldn't vmdk metadata reflect the allocation status
>>> in the image file just as well as qcow2 metadata?
>>>
>>> But actually, the absolute numbers are much lower than both raw and
>>> qcow2, which is a bit surprising. Is there a bug somewhere in vmdk or
>>> are we missing optimisations in raw and qcow2?
>>
>> I wondered about that, too.  Maybe something to look into another time...
> 
> A first thing to try would be a larger qcow2 cluster size. VMDK has 1 MB
> clusters, if I remember correctly, so maybe it just covers larger ranges
> with a single call.

IIRC when doing block-status queries on VMDK, it only returns 64 kB
clusters.  That's the main issue for tmpfs, because we have to do so
many lseek()s, whereas qcow2 can cover an entire 2 GB image with a
single block-status call.

Max
diff mbox

Patch

diff --git a/qemu-img.c b/qemu-img.c
index aa99fd32e9..d9e39c8901 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1522,7 +1522,6 @@  out4:
 
 enum ImgConvertBlockStatus {
     BLK_DATA,
-    BLK_ZERO,
     BLK_BACKING_FILE,
 };
 
@@ -1581,30 +1580,20 @@  static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         int64_t count = n * BDRV_SECTOR_SIZE;
 
         if (s->target_has_backing) {
-
-            ret = bdrv_block_status(blk_bs(s->src[src_cur]),
+            ret = bdrv_is_allocated(blk_bs(s->src[src_cur]),
                                     (sector_num - src_cur_offset) *
                                     BDRV_SECTOR_SIZE,
-                                    count, &count, NULL, NULL);
-        } else {
-            ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
-                                          (sector_num - src_cur_offset) *
-                                          BDRV_SECTOR_SIZE,
-                                          count, &count, NULL, NULL);
-        }
-        if (ret < 0) {
-            return ret;
-        }
-        n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
+                                    count, &count);
+            if (ret < 0) {
+                return ret;
+            }
 
-        if (ret & BDRV_BLOCK_ZERO) {
-            s->status = BLK_ZERO;
-        } else if (ret & BDRV_BLOCK_DATA) {
-            s->status = BLK_DATA;
+            s->status = ret ? BLK_DATA : BLK_BACKING_FILE;
         } else {
-            s->status = s->target_has_backing ? BLK_BACKING_FILE : BLK_DATA;
+            s->status = BLK_DATA;
         }
 
+        n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
         s->sector_next_status = sector_num + n;
     }
 
@@ -1713,9 +1702,8 @@  static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 }
                 break;
             }
-            /* fall-through */
 
-        case BLK_ZERO:
+            /* Zero the target */
             if (s->has_zero_init) {
                 assert(!s->target_has_backing);
                 break;
@@ -1774,15 +1762,12 @@  static void coroutine_fn convert_co_do_copy(void *opaque)
         /* save current sector and allocation status to local variables */
         sector_num = s->sector_num;
         status = s->status;
-        if (!s->min_sparse && s->status == BLK_ZERO) {
-            n = MIN(n, s->buf_sectors);
-        }
         /* increment global sector counter so that other coroutines can
          * already continue reading beyond this request */
         s->sector_num += n;
         qemu_co_mutex_unlock(&s->lock);
 
-        if (status == BLK_DATA || (!s->min_sparse && status == BLK_ZERO)) {
+        if (status == BLK_DATA) {
             s->allocated_done += n;
             qemu_progress_print(100.0 * s->allocated_done /
                                         s->allocated_sectors, 0);
@@ -1795,9 +1780,6 @@  static void coroutine_fn convert_co_do_copy(void *opaque)
                              ": %s", sector_num, strerror(-ret));
                 s->ret = ret;
             }
-        } else if (!s->min_sparse && status == BLK_ZERO) {
-            status = BLK_DATA;
-            memset(buf, 0x00, n * BDRV_SECTOR_SIZE);
         }
 
         if (s->wr_in_order) {
@@ -1879,8 +1861,7 @@  static int convert_do_copy(ImgConvertState *s)
         if (n < 0) {
             return n;
         }
-        if (s->status == BLK_DATA || (!s->min_sparse && s->status == BLK_ZERO))
-        {
+        if (s->status == BLK_DATA) {
             s->allocated_sectors += n;
         }
         sector_num += n;
diff --git a/tests/qemu-iotests/086.out b/tests/qemu-iotests/086.out
index 5ff996101b..057a21abdb 100644
--- a/tests/qemu-iotests/086.out
+++ b/tests/qemu-iotests/086.out
@@ -9,9 +9,69 @@  wrote 1048576/1048576 bytes at offset 4194304
 wrote 1048576/1048576 bytes at offset 33554432
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
     (0.00/100%)
+    (1.56/100%)
+    (3.12/100%)
+    (4.69/100%)
+    (6.25/100%)
+    (7.81/100%)
+    (9.38/100%)
+    (10.94/100%)
+    (12.50/100%)
+    (14.06/100%)
+    (15.62/100%)
+    (17.19/100%)
+    (18.75/100%)
+    (20.31/100%)
+    (21.88/100%)
+    (23.44/100%)
     (25.00/100%)
+    (26.56/100%)
+    (28.12/100%)
+    (29.69/100%)
+    (31.25/100%)
+    (32.81/100%)
+    (34.38/100%)
+    (35.94/100%)
+    (37.50/100%)
+    (39.06/100%)
+    (40.62/100%)
+    (42.19/100%)
+    (43.75/100%)
+    (45.31/100%)
+    (46.88/100%)
+    (48.44/100%)
     (50.00/100%)
+    (51.56/100%)
+    (53.12/100%)
+    (54.69/100%)
+    (56.25/100%)
+    (57.81/100%)
+    (59.38/100%)
+    (60.94/100%)
+    (62.50/100%)
+    (64.06/100%)
+    (65.62/100%)
+    (67.19/100%)
+    (68.75/100%)
+    (70.31/100%)
+    (71.88/100%)
+    (73.44/100%)
     (75.00/100%)
+    (76.56/100%)
+    (78.12/100%)
+    (79.69/100%)
+    (81.25/100%)
+    (82.81/100%)
+    (84.38/100%)
+    (85.94/100%)
+    (87.50/100%)
+    (89.06/100%)
+    (90.62/100%)
+    (92.19/100%)
+    (93.75/100%)
+    (95.31/100%)
+    (96.88/100%)
+    (98.44/100%)
     (100.00/100%)
     (100.00/100%)