[2/6] file-posix: Allow lseek at offset 0 when !want_zero

Message ID	20250411010732.358817-10-eblake@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Eric Blake <eblake@redhat.com> To: qemu-devel@nongnu.org Cc: Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>, qemu-block@nongnu.org (open list:raw) Subject: [PATCH 2/6] file-posix: Allow lseek at offset 0 when !want_zero Date: Thu, 10 Apr 2025 20:04:52 -0500 Message-ID: <20250411010732.358817-10-eblake@redhat.com> In-Reply-To: <20250411010732.358817-8-eblake@redhat.com> References: <20250411010732.358817-8-eblake@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.129.124; envelope-from=eblake@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -26 X-Spam_score: -2.7 X-Spam_bar: -- X-Spam_report: (-2.7 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.593, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	Make blockdev-mirror dest sparse in more cases \| expand [0/6] Make blockdev-mirror dest sparse in more cases [1/6] mirror: Skip pre-zeroing destination if it is already zero [2/6] file-posix: Allow lseek at offset 0 when !want_zero [3/6] mirror: Skip writing zeroes when target is already zero [4/6] block: Expand block status mode from bool to enum [5/6] file-posix: Recognize blockdev-create file as starting all zero [6/6] tests: Add iotest mirror-sparse for recent patches [7/6] mirror: Allow QMP override to declare target already zero

Eric Blake April 11, 2025, 1:04 a.m. UTC

The 'want_zero' parameter to raw_co_block_status() was added so that
we can avoid potentially time-consuming lseek(SEEK_DATA) calls
throughout the file (working around poor filesystems that have O(n)
rather than O(1) extent probing).  But when it comes to learning if a
file is completely sparse (for example, it was just created), always
claiming that a file is all data without even checking offset 0 breaks
what would otherwise be attempts at useful optimizations for a
known-zero mirror destination.

Note that this allows file-posix to report a file as completely zero
if it was externally created (such as via 'truncate --size=$n file')
as entirely sparse; however, it does NOT work for files created
internally by blockdev-create.  That's because blockdev-create
intentionally does a sequence of truncate(0), truncate(size),
allocate_first_block(), in order to make it possible for gluster on
XFS to probe the sector size for direct I/O (which doesn't work if the
first block is sparse).  That will be addressed in a later patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/file-posix.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Vladimir Sementsov-Ogievskiy April 14, 2025, 5:05 p.m. UTC | #1

On 11.04.25 04:04, Eric Blake wrote:
> The 'want_zero' parameter to raw_co_block_status() was added so that
> we can avoid potentially time-consuming lseek(SEEK_DATA) calls
> throughout the file (working around poor filesystems that have O(n)
> rather than O(1) extent probing).  But when it comes to learning if a
> file is completely sparse (for example, it was just created), always
> claiming that a file is all data without even checking offset 0 breaks
> what would otherwise be attempts at useful optimizations for a
> known-zero mirror destination.
> 
> Note that this allows file-posix to report a file as completely zero
> if it was externally created (such as via 'truncate --size=$n file')
> as entirely sparse; however, it does NOT work for files created
> internally by blockdev-create.  That's because blockdev-create
> intentionally does a sequence of truncate(0), truncate(size),
> allocate_first_block(), in order to make it possible for gluster on
> XFS to probe the sector size for direct I/O (which doesn't work if the
> first block is sparse).  That will be addressed in a later patch.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   block/file-posix.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 56d1972d156..67e83528cf5 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>           return ret;
>       }
> 
> -    if (!want_zero) {
> +    /*
> +     * If want_zero is clear, then the caller wants speed over
> +     * accuracy, and the only place where SEEK_DATA should be
> +     * attempted is at the start of the file to learn if the file has
> +     * any data at all (anywhere else, just blindly claim the entire
> +     * file is data).
> +     */
> +    if (!want_zero && offset) {
>           *pnum = bytes;
>           *map = offset;
>           *file = bs;

Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0..

Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once.


Another doubt (really weak): can this one extra lseek be so slow, that mirror becomes worse?
Den, is it right, that problems about slow lseek (that we experienced several years ago) were about qcow2-internals, and nothing related to mirror itself? May one lseek call on mirror target break something?

Eric Blake April 14, 2025, 6:12 p.m. UTC | #2

On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 11.04.25 04:04, Eric Blake wrote:
> > The 'want_zero' parameter to raw_co_block_status() was added so that
> > we can avoid potentially time-consuming lseek(SEEK_DATA) calls
> > throughout the file (working around poor filesystems that have O(n)
> > rather than O(1) extent probing).  But when it comes to learning if a
> > file is completely sparse (for example, it was just created), always
> > claiming that a file is all data without even checking offset 0 breaks
> > what would otherwise be attempts at useful optimizations for a
> > known-zero mirror destination.
> > 
> > Note that this allows file-posix to report a file as completely zero
> > if it was externally created (such as via 'truncate --size=$n file')
> > as entirely sparse; however, it does NOT work for files created
> > internally by blockdev-create.  That's because blockdev-create
> > intentionally does a sequence of truncate(0), truncate(size),
> > allocate_first_block(), in order to make it possible for gluster on
> > XFS to probe the sector size for direct I/O (which doesn't work if the
> > first block is sparse).  That will be addressed in a later patch.
> > 
> > Signed-off-by: Eric Blake <eblake@redhat.com>
> > ---
> >   block/file-posix.c | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 56d1972d156..67e83528cf5 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >           return ret;
> >       }
> > 
> > -    if (!want_zero) {
> > +    /*
> > +     * If want_zero is clear, then the caller wants speed over
> > +     * accuracy, and the only place where SEEK_DATA should be
> > +     * attempted is at the start of the file to learn if the file has
> > +     * any data at all (anywhere else, just blindly claim the entire
> > +     * file is data).
> > +     */
> > +    if (!want_zero && offset) {
> >           *pnum = bytes;
> >           *map = offset;
> >           *file = bs;
> 
> Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0..
> 
> Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once.

Which is exactly why I wrote patch 4/6 turning the want_zero bool into
an enum so that we are being more explicit in WHY block status is
being requested.

> 
> 
> Another doubt (really weak): can this one extra lseek be so slow, that mirror becomes worse?
> Den, is it right, that problems about slow lseek (that we experienced several years ago) were about qcow2-internals, and nothing related to mirror itself? May one lseek call on mirror target break something?

I'm not worried about one or even two lseek on the destination; what
I'm trying to avoid is an lseek on the destination for every chunk
where the source is doing a write-zeroes request (we're already doing
an lseek on the source as part of the background task of the mirror,
but foreground writes being split into the original and the mirror are
where it really matters that we aren't going slower just because of
the mirror).

Vladimir Sementsov-Ogievskiy April 15, 2025, 12:37 p.m. UTC | #3

On 14.04.25 21:12, Eric Blake wrote:
> On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> On 11.04.25 04:04, Eric Blake wrote:
>>> The 'want_zero' parameter to raw_co_block_status() was added so that
>>> we can avoid potentially time-consuming lseek(SEEK_DATA) calls
>>> throughout the file (working around poor filesystems that have O(n)
>>> rather than O(1) extent probing).  But when it comes to learning if a
>>> file is completely sparse (for example, it was just created), always
>>> claiming that a file is all data without even checking offset 0 breaks
>>> what would otherwise be attempts at useful optimizations for a
>>> known-zero mirror destination.
>>>
>>> Note that this allows file-posix to report a file as completely zero
>>> if it was externally created (such as via 'truncate --size=$n file')
>>> as entirely sparse; however, it does NOT work for files created
>>> internally by blockdev-create.  That's because blockdev-create
>>> intentionally does a sequence of truncate(0), truncate(size),
>>> allocate_first_block(), in order to make it possible for gluster on
>>> XFS to probe the sector size for direct I/O (which doesn't work if the
>>> first block is sparse).  That will be addressed in a later patch.
>>>
>>> Signed-off-by: Eric Blake<eblake@redhat.com>
>>> ---
>>>    block/file-posix.c | 9 ++++++++-
>>>    1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 56d1972d156..67e83528cf5 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>            return ret;
>>>        }
>>>
>>> -    if (!want_zero) {
>>> +    /*
>>> +     * If want_zero is clear, then the caller wants speed over
>>> +     * accuracy, and the only place where SEEK_DATA should be
>>> +     * attempted is at the start of the file to learn if the file has
>>> +     * any data at all (anywhere else, just blindly claim the entire
>>> +     * file is data).
>>> +     */
>>> +    if (!want_zero && offset) {
>>>            *pnum = bytes;
>>>            *map = offset;
>>>            *file = bs;
>> Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0..
>>
>> Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once.
> Which is exactly why I wrote patch 4/6 turning the want_zero bool into
> an enum so that we are being more explicit in WHY block status is
> being requested.

Hmm, and this makes more strange that this hack for file-posix is kept after it. Don't we have other block drivers, where we should behave similarly in block_status for offset=0? Or I mean, could we just use different block-status modes in qcow2 and mirror, when call bdrv_co_is_zero_fast(), and don't handle offset==0 specially in file-posix?

Eric Blake April 15, 2025, 3:22 p.m. UTC | #4

On Tue, Apr 15, 2025 at 03:37:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 14.04.25 21:12, Eric Blake wrote:
> > On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > On 11.04.25 04:04, Eric Blake wrote:
> > > > The 'want_zero' parameter to raw_co_block_status() was added so that
> > > > we can avoid potentially time-consuming lseek(SEEK_DATA) calls
> > > > throughout the file (working around poor filesystems that have O(n)
> > > > rather than O(1) extent probing).  But when it comes to learning if a
> > > > file is completely sparse (for example, it was just created), always
> > > > claiming that a file is all data without even checking offset 0 breaks
> > > > what would otherwise be attempts at useful optimizations for a
> > > > known-zero mirror destination.
> > > > 
> > > > Note that this allows file-posix to report a file as completely zero
> > > > if it was externally created (such as via 'truncate --size=$n file')
> > > > as entirely sparse; however, it does NOT work for files created
> > > > internally by blockdev-create.  That's because blockdev-create
> > > > intentionally does a sequence of truncate(0), truncate(size),
> > > > allocate_first_block(), in order to make it possible for gluster on
> > > > XFS to probe the sector size for direct I/O (which doesn't work if the
> > > > first block is sparse).  That will be addressed in a later patch.
> > > > 
> > > > Signed-off-by: Eric Blake<eblake@redhat.com>
> > > > ---
> > > >    block/file-posix.c | 9 ++++++++-
> > > >    1 file changed, 8 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/block/file-posix.c b/block/file-posix.c
> > > > index 56d1972d156..67e83528cf5 100644
> > > > --- a/block/file-posix.c
> > > > +++ b/block/file-posix.c
> > > > @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> > > >            return ret;
> > > >        }
> > > > 
> > > > -    if (!want_zero) {
> > > > +    /*
> > > > +     * If want_zero is clear, then the caller wants speed over
> > > > +     * accuracy, and the only place where SEEK_DATA should be
> > > > +     * attempted is at the start of the file to learn if the file has
> > > > +     * any data at all (anywhere else, just blindly claim the entire
> > > > +     * file is data).
> > > > +     */
> > > > +    if (!want_zero && offset) {
> > > >            *pnum = bytes;
> > > >            *map = offset;
> > > >            *file = bs;
> > > Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0..
> > > 
> > > Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once.
> > Which is exactly why I wrote patch 4/6 turning the want_zero bool into
> > an enum so that we are being more explicit in WHY block status is
> > being requested.
> 
> Hmm, and this makes more strange that this hack for file-posix is kept after it. Don't we have other block drivers, where we should behave similarly in block_status for offset=0? Or I mean, could we just use different block-status modes in qcow2 and mirror, when call bdrv_co_is_zero_fast(), and don't handle offset==0 specially in file-posix?

I'll need to ajust this in v2 of my series.

The problem I'm trying to resolve: libvirt migration with
--migrate-disks-detect-zeroes is causing the destination to become
fully allocated if the destination does not use "discard":"unmap",
whereas with QEMU 9.0 it did not.  The change was commit d05ae948c,
where we fixed a problem with punching holes into the mirror
destination even when the user had requested the file to not shrink;
but it means that a file that is not allowed to shrink is now fully
allocated rather than left sparse even when leaving it sparse would
still be an accurate mirror result with less I/O.

And it turns out that when libvirt is driving the mirror into a raw
destination, the mirroring is done over NBD, rather than directly to a
file-posix destination.  Coupled with the oddity that 'qemu-img create
-f raw' ends up pre-allocating up to 64k of the image with all zero
contents to make alignment probing easier, it is no longer possible to
quickly see that the entire NBD image reads as zero, just as it was
not possible to quickly see that for file-posix.  So, in v2 of my
patch series, I think I need to hoist the logic that I added to
file-posix.c that reads an initial small sector in order to determine
if the entire image is zero to instead live in io.c to be used across
all protocols, NBD included.

So if you don't think offset=0 should be given any special treatment
in file-posix, but instead have generic code in io.c that checks if an
entire image has block_status of sparse and/or just a small initial
non-sparse section that can be manually checked for zero contents,
then that should still fix the issue of mirroring to a sparse
destination with "discard":"ignore" but keeping the destination
sparse.

Vladimir Sementsov-Ogievskiy April 15, 2025, 4:11 p.m. UTC | #5

On 15.04.25 18:22, Eric Blake wrote:
> On Tue, Apr 15, 2025 at 03:37:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> On 14.04.25 21:12, Eric Blake wrote:
>>> On Mon, Apr 14, 2025 at 08:05:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 11.04.25 04:04, Eric Blake wrote:
>>>>> The 'want_zero' parameter to raw_co_block_status() was added so that
>>>>> we can avoid potentially time-consuming lseek(SEEK_DATA) calls
>>>>> throughout the file (working around poor filesystems that have O(n)
>>>>> rather than O(1) extent probing).  But when it comes to learning if a
>>>>> file is completely sparse (for example, it was just created), always
>>>>> claiming that a file is all data without even checking offset 0 breaks
>>>>> what would otherwise be attempts at useful optimizations for a
>>>>> known-zero mirror destination.
>>>>>
>>>>> Note that this allows file-posix to report a file as completely zero
>>>>> if it was externally created (such as via 'truncate --size=$n file')
>>>>> as entirely sparse; however, it does NOT work for files created
>>>>> internally by blockdev-create.  That's because blockdev-create
>>>>> intentionally does a sequence of truncate(0), truncate(size),
>>>>> allocate_first_block(), in order to make it possible for gluster on
>>>>> XFS to probe the sector size for direct I/O (which doesn't work if the
>>>>> first block is sparse).  That will be addressed in a later patch.
>>>>>
>>>>> Signed-off-by: Eric Blake<eblake@redhat.com>
>>>>> ---
>>>>>     block/file-posix.c | 9 ++++++++-
>>>>>     1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>>>> index 56d1972d156..67e83528cf5 100644
>>>>> --- a/block/file-posix.c
>>>>> +++ b/block/file-posix.c
>>>>> @@ -3217,7 +3217,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>>>             return ret;
>>>>>         }
>>>>>
>>>>> -    if (!want_zero) {
>>>>> +    /*
>>>>> +     * If want_zero is clear, then the caller wants speed over
>>>>> +     * accuracy, and the only place where SEEK_DATA should be
>>>>> +     * attempted is at the start of the file to learn if the file has
>>>>> +     * any data at all (anywhere else, just blindly claim the entire
>>>>> +     * file is data).
>>>>> +     */
>>>>> +    if (!want_zero && offset) {
>>>>>             *pnum = bytes;
>>>>>             *map = offset;
>>>>>             *file = bs;
>>>> Looks like a hack. So we have bdrv_co_is_zero_fast() which do pass want_zero=false to block-status. But in case of mirror, which want to check the whole disk, we actually want want_zero=true, and detect it by offset=0..
>>>>
>>>> Isn't it better to add a kind of bdrv_is_zero_middle_speed() (which means, don't try to read the data to check, but be free to use suboptimal lseek call or something like this), which will pass want_zero=true, and use it from mirror? Mirror case differs from usage in qcow2 exactly by the fact that we call it only once.
>>> Which is exactly why I wrote patch 4/6 turning the want_zero bool into
>>> an enum so that we are being more explicit in WHY block status is
>>> being requested.
>>
>> Hmm, and this makes more strange that this hack for file-posix is kept after it. Don't we have other block drivers, where we should behave similarly in block_status for offset=0? Or I mean, could we just use different block-status modes in qcow2 and mirror, when call bdrv_co_is_zero_fast(), and don't handle offset==0 specially in file-posix?
> 
> I'll need to ajust this in v2 of my series.
> 
> The problem I'm trying to resolve: libvirt migration with
> --migrate-disks-detect-zeroes is causing the destination to become
> fully allocated if the destination does not use "discard":"unmap",
> whereas with QEMU 9.0 it did not.  The change was commit d05ae948c,
> where we fixed a problem with punching holes into the mirror
> destination even when the user had requested the file to not shrink;
> but it means that a file that is not allowed to shrink is now fully
> allocated rather than left sparse even when leaving it sparse would
> still be an accurate mirror result with less I/O.
> 
> And it turns out that when libvirt is driving the mirror into a raw
> destination, the mirroring is done over NBD, rather than directly to a
> file-posix destination.  Coupled with the oddity that 'qemu-img create
> -f raw' ends up pre-allocating up to 64k of the image with all zero
> contents to make alignment probing easier, it is no longer possible to
> quickly see that the entire NBD image reads as zero, just as it was
> not possible to quickly see that for file-posix.  So, in v2 of my
> patch series, I think I need to hoist the logic that I added to
> file-posix.c that reads an initial small sector in order to determine
> if the entire image is zero to instead live in io.c to be used across
> all protocols, NBD included.
> 
> So if you don't think offset=0 should be given any special treatment
> in file-posix, but instead have generic code in io.c that checks if an
> entire image has block_status of sparse and/or just a small initial
> non-sparse section that can be manually checked for zero contents,
> then that should still fix the issue of mirroring to a sparse
> destination with "discard":"ignore" but keeping the destination
> sparse.
> 

Thanks for explanation! Yes I think making such specific checks about first sector in some kind of generic bdrv_is_all_zeroes() would be cleaner.

[2/6] file-posix: Allow lseek at offset 0 when !want_zero

Commit Message

Comments

Patch