Message ID | 20250411010732.358817-10-eblake@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Make blockdev-mirror dest sparse in more cases | expand |
On 11.04.25 04:04, Eric Blake wrote: > The 'want_zero' parameter to raw_co_block_status() was added so that > we can avoid potentially time-consuming lseek(SEEK_DATA) calls > throughout the file (working around poor filesystems that have O(n) > rather than O(1) extent probing). But when it comes to learning if a > file is completely sparse (for example, it was just created), always > claiming that a file is all data without even checking offset 0 breaks > what would otherwise be attempts at useful optimizations for a > known-zero mirror destination. > > Note that this allows file-posix to report a file as completely zero > if it was externally created (such as via 'truncate --size=$n file') > as entirely sparse; however, it does NOT work for files created > internally by blockdev-create. That's because blockdev-create > intentionally does a sequence of truncate(0), truncate(size), > allocate_first_block(), in order to make it possible for gluster on > XFS to probe the sector size for direct I/O (which doesn't work if the > first block is sparse). That will be addressed in a later patch. > > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > block/file-posix.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 56d1972d156..67e83528cf5 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > return ret; > } > > - if (!want_zero) { > + /* > + * If want_zero is clear, then the caller wants speed over > + * accuracy, and the only place where SEEK_DATA should be > + * attempted is at the start of the file to learn if the file has > + * any data at all (anywhere else, just blindly claim the entire > + * file is data). > + */ > + if (!want_zero && offset) { > *pnum = bytes; > *map = offset; > *file = bs; Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0.. Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once. Another doubt (really weak): can this one extra lseek be so slow, that mirror becomes worse? Den, is it right, that problems about slow lseek (that we experienced several years ago) were about qcow2-internals, and nothing related to mirror itself? May one lseek call on mirror target break something?
On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote: > On 11.04.25 04:04, Eric Blake wrote: > > The 'want_zero' parameter to raw_co_block_status() was added so that > > we can avoid potentially time-consuming lseek(SEEK_DATA) calls > > throughout the file (working around poor filesystems that have O(n) > > rather than O(1) extent probing). But when it comes to learning if a > > file is completely sparse (for example, it was just created), always > > claiming that a file is all data without even checking offset 0 breaks > > what would otherwise be attempts at useful optimizations for a > > known-zero mirror destination. > > > > Note that this allows file-posix to report a file as completely zero > > if it was externally created (such as via 'truncate --size=$n file') > > as entirely sparse; however, it does NOT work for files created > > internally by blockdev-create. That's because blockdev-create > > intentionally does a sequence of truncate(0), truncate(size), > > allocate_first_block(), in order to make it possible for gluster on > > XFS to probe the sector size for direct I/O (which doesn't work if the > > first block is sparse). That will be addressed in a later patch. > > > > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > > block/file-posix.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 56d1972d156..67e83528cf5 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > return ret; > > } > > > > - if (!want_zero) { > > + /* > > + * If want_zero is clear, then the caller wants speed over > > + * accuracy, and the only place where SEEK_DATA should be > > + * attempted is at the start of the file to learn if the file has > > + * any data at all (anywhere else, just blindly claim the entire > > + * file is data). > > + */ > > + if (!want_zero && offset) { > > *pnum = bytes; > > *map = offset; > > *file = bs; > > Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0.. > > Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once. Which is exactly why I wrote patch 4/6 turning the want_zero bool into an enum so that we are being more explicit in WHY block status is being requested. > > > Another doubt (really weak): can this one extra lseek be so slow, that mirror becomes worse? > Den, is it right, that problems about slow lseek (that we experienced several years ago) were about qcow2-internals, and nothing related to mirror itself? May one lseek call on mirror target break something? I'm not worried about one or even two lseek on the destination; what I'm trying to avoid is an lseek on the destination for every chunk where the source is doing a write-zeroes request (we're already doing an lseek on the source as part of the background task of the mirror, but foreground writes being split into the original and the mirror are where it really matters that we aren't going slower just because of the mirror).
On 14.04.25 21:12, Eric Blake wrote: > On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote: >> On 11.04.25 04:04, Eric Blake wrote: >>> The 'want_zero' parameter to raw_co_block_status() was added so that >>> we can avoid potentially time-consuming lseek(SEEK_DATA) calls >>> throughout the file (working around poor filesystems that have O(n) >>> rather than O(1) extent probing). But when it comes to learning if a >>> file is completely sparse (for example, it was just created), always >>> claiming that a file is all data without even checking offset 0 breaks >>> what would otherwise be attempts at useful optimizations for a >>> known-zero mirror destination. >>> >>> Note that this allows file-posix to report a file as completely zero >>> if it was externally created (such as via 'truncate --size=$n file') >>> as entirely sparse; however, it does NOT work for files created >>> internally by blockdev-create. That's because blockdev-create >>> intentionally does a sequence of truncate(0), truncate(size), >>> allocate_first_block(), in order to make it possible for gluster on >>> XFS to probe the sector size for direct I/O (which doesn't work if the >>> first block is sparse). That will be addressed in a later patch. >>> >>> Signed-off-by: Eric Blake<eblake@redhat.com> >>> --- >>> block/file-posix.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index 56d1972d156..67e83528cf5 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, >>> return ret; >>> } >>> >>> - if (!want_zero) { >>> + /* >>> + * If want_zero is clear, then the caller wants speed over >>> + * accuracy, and the only place where SEEK_DATA should be >>> + * attempted is at the start of the file to learn if the file has >>> + * any data at all (anywhere else, just blindly claim the entire >>> + * file is data). >>> + */ >>> + if (!want_zero && offset) { >>> *pnum = bytes; >>> *map = offset; >>> *file = bs; >> Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0.. >> >> Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once. > Which is exactly why I wrote patch 4/6 turning the want_zero bool into > an enum so that we are being more explicit in WHY block status is > being requested. Hmm, and this makes more strange that this hack for file-posix is kept after it. Don't we have other block drivers, where we should behave similarly in block_status for offset=0? Or I mean, could we just use different block-status modes in qcow2 and mirror, when call bdrv_co_is_zero_fast(), and don't handle offset==0 specially in file-posix?
On Tue, Apr 15, 2025 at 03:37:39PM +0300, Vladimir Sementsov-Ogievskiy wrote: > On 14.04.25 21:12, Eric Blake wrote: > > On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote: > > > On 11.04.25 04:04, Eric Blake wrote: > > > > The 'want_zero' parameter to raw_co_block_status() was added so that > > > > we can avoid potentially time-consuming lseek(SEEK_DATA) calls > > > > throughout the file (working around poor filesystems that have O(n) > > > > rather than O(1) extent probing). But when it comes to learning if a > > > > file is completely sparse (for example, it was just created), always > > > > claiming that a file is all data without even checking offset 0 breaks > > > > what would otherwise be attempts at useful optimizations for a > > > > known-zero mirror destination. > > > > > > > > Note that this allows file-posix to report a file as completely zero > > > > if it was externally created (such as via 'truncate --size=$n file') > > > > as entirely sparse; however, it does NOT work for files created > > > > internally by blockdev-create. That's because blockdev-create > > > > intentionally does a sequence of truncate(0), truncate(size), > > > > allocate_first_block(), in order to make it possible for gluster on > > > > XFS to probe the sector size for direct I/O (which doesn't work if the > > > > first block is sparse). That will be addressed in a later patch. > > > > > > > > Signed-off-by: Eric Blake<eblake@redhat.com> > > > > --- > > > > block/file-posix.c | 9 ++++++++- > > > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > > index 56d1972d156..67e83528cf5 100644 > > > > --- a/block/file-posix.c > > > > +++ b/block/file-posix.c > > > > @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > > > return ret; > > > > } > > > > > > > > - if (!want_zero) { > > > > + /* > > > > + * If want_zero is clear, then the caller wants speed over > > > > + * accuracy, and the only place where SEEK_DATA should be > > > > + * attempted is at the start of the file to learn if the file has > > > > + * any data at all (anywhere else, just blindly claim the entire > > > > + * file is data). > > > > + */ > > > > + if (!want_zero && offset) { > > > > *pnum = bytes; > > > > *map = offset; > > > > *file = bs; > > > Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0.. > > > > > > Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once. > > Which is exactly why I wrote patch 4/6 turning the want_zero bool into > > an enum so that we are being more explicit in WHY block status is > > being requested. > > Hmm, and this makes more strange that this hack for file-posix is kept after it. Don't we have other block drivers, where we should behave similarly in block_status for offset=0? Or I mean, could we just use different block-status modes in qcow2 and mirror, when call bdrv_co_is_zero_fast(), and don't handle offset==0 specially in file-posix? I'll need to ajust this in v2 of my series. The problem I'm trying to resolve: libvirt migration with --migrate-disks-detect-zeroes is causing the destination to become fully allocated if the destination does not use "discard":"unmap", whereas with QEMU 9.0 it did not. The change was commit d05ae948c, where we fixed a problem with punching holes into the mirror destination even when the user had requested the file to not shrink; but it means that a file that is not allowed to shrink is now fully allocated rather than left sparse even when leaving it sparse would still be an accurate mirror result with less I/O. And it turns out that when libvirt is driving the mirror into a raw destination, the mirroring is done over NBD, rather than directly to a file-posix destination. Coupled with the oddity that 'qemu-img create -f raw' ends up pre-allocating up to 64k of the image with all zero contents to make alignment probing easier, it is no longer possible to quickly see that the entire NBD image reads as zero, just as it was not possible to quickly see that for file-posix. So, in v2 of my patch series, I think I need to hoist the logic that I added to file-posix.c that reads an initial small sector in order to determine if the entire image is zero to instead live in io.c to be used across all protocols, NBD included. So if you don't think offset=0 should be given any special treatment in file-posix, but instead have generic code in io.c that checks if an entire image has block_status of sparse and/or just a small initial non-sparse section that can be manually checked for zero contents, then that should still fix the issue of mirroring to a sparse destination with "discard":"ignore" but keeping the destination sparse.
On 15.04.25 18:22, Eric Blake wrote: > On Tue, Apr 15, 2025 at 03:37:39PM +0300, Vladimir Sementsov-Ogievskiy wrote: >> On 14.04.25 21:12, Eric Blake wrote: >>> On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote: >>>> On 11.04.25 04:04, Eric Blake wrote: >>>>> The 'want_zero' parameter to raw_co_block_status() was added so that >>>>> we can avoid potentially time-consuming lseek(SEEK_DATA) calls >>>>> throughout the file (working around poor filesystems that have O(n) >>>>> rather than O(1) extent probing). But when it comes to learning if a >>>>> file is completely sparse (for example, it was just created), always >>>>> claiming that a file is all data without even checking offset 0 breaks >>>>> what would otherwise be attempts at useful optimizations for a >>>>> known-zero mirror destination. >>>>> >>>>> Note that this allows file-posix to report a file as completely zero >>>>> if it was externally created (such as via 'truncate --size=$n file') >>>>> as entirely sparse; however, it does NOT work for files created >>>>> internally by blockdev-create. That's because blockdev-create >>>>> intentionally does a sequence of truncate(0), truncate(size), >>>>> allocate_first_block(), in order to make it possible for gluster on >>>>> XFS to probe the sector size for direct I/O (which doesn't work if the >>>>> first block is sparse). That will be addressed in a later patch. >>>>> >>>>> Signed-off-by: Eric Blake<eblake@redhat.com> >>>>> --- >>>>> block/file-posix.c | 9 ++++++++- >>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/block/file-posix.c b/block/file-posix.c >>>>> index 56d1972d156..67e83528cf5 100644 >>>>> --- a/block/file-posix.c >>>>> +++ b/block/file-posix.c >>>>> @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, >>>>> return ret; >>>>> } >>>>> >>>>> - if (!want_zero) { >>>>> + /* >>>>> + * If want_zero is clear, then the caller wants speed over >>>>> + * accuracy, and the only place where SEEK_DATA should be >>>>> + * attempted is at the start of the file to learn if the file has >>>>> + * any data at all (anywhere else, just blindly claim the entire >>>>> + * file is data). >>>>> + */ >>>>> + if (!want_zero && offset) { >>>>> *pnum = bytes; >>>>> *map = offset; >>>>> *file = bs; >>>> Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0.. >>>> >>>> Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once. >>> Which is exactly why I wrote patch 4/6 turning the want_zero bool into >>> an enum so that we are being more explicit in WHY block status is >>> being requested. >> >> Hmm, and this makes more strange that this hack for file-posix is kept after it. Don't we have other block drivers, where we should behave similarly in block_status for offset=0? Or I mean, could we just use different block-status modes in qcow2 and mirror, when call bdrv_co_is_zero_fast(), and don't handle offset==0 specially in file-posix? > > I'll need to ajust this in v2 of my series. > > The problem I'm trying to resolve: libvirt migration with > --migrate-disks-detect-zeroes is causing the destination to become > fully allocated if the destination does not use "discard":"unmap", > whereas with QEMU 9.0 it did not. The change was commit d05ae948c, > where we fixed a problem with punching holes into the mirror > destination even when the user had requested the file to not shrink; > but it means that a file that is not allowed to shrink is now fully > allocated rather than left sparse even when leaving it sparse would > still be an accurate mirror result with less I/O. > > And it turns out that when libvirt is driving the mirror into a raw > destination, the mirroring is done over NBD, rather than directly to a > file-posix destination. Coupled with the oddity that 'qemu-img create > -f raw' ends up pre-allocating up to 64k of the image with all zero > contents to make alignment probing easier, it is no longer possible to > quickly see that the entire NBD image reads as zero, just as it was > not possible to quickly see that for file-posix. So, in v2 of my > patch series, I think I need to hoist the logic that I added to > file-posix.c that reads an initial small sector in order to determine > if the entire image is zero to instead live in io.c to be used across > all protocols, NBD included. > > So if you don't think offset=0 should be given any special treatment > in file-posix, but instead have generic code in io.c that checks if an > entire image has block_status of sparse and/or just a small initial > non-sparse section that can be manually checked for zero contents, > then that should still fix the issue of mirroring to a sparse > destination with "discard":"ignore" but keeping the destination > sparse. > Thanks for explanation! Yes I think making such specific checks about first sector in some kind of generic bdrv_is_all_zeroes() would be cleaner.
diff --git a/block/file-posix.c b/block/file-posix.c index 56d1972d156..67e83528cf5 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, return ret; } - if (!want_zero) { + /* + * If want_zero is clear, then the caller wants speed over + * accuracy, and the only place where SEEK_DATA should be + * attempted is at the start of the file to learn if the file has + * any data at all (anywhere else, just blindly claim the entire + * file is data). + */ + if (!want_zero && offset) { *pnum = bytes; *map = offset; *file = bs;
The 'want_zero' parameter to raw_co_block_status() was added so that we can avoid potentially time-consuming lseek(SEEK_DATA) calls throughout the file (working around poor filesystems that have O(n) rather than O(1) extent probing). But when it comes to learning if a file is completely sparse (for example, it was just created), always claiming that a file is all data without even checking offset 0 breaks what would otherwise be attempts at useful optimizations for a known-zero mirror destination. Note that this allows file-posix to report a file as completely zero if it was externally created (such as via 'truncate --size=$n file') as entirely sparse; however, it does NOT work for files created internally by blockdev-create. That's because blockdev-create intentionally does a sequence of truncate(0), truncate(size), allocate_first_block(), in order to make it possible for gluster on XFS to probe the sector size for direct I/O (which doesn't work if the first block is sparse). That will be addressed in a later patch. Signed-off-by: Eric Blake <eblake@redhat.com> --- block/file-posix.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)