diff mbox series

[2/2] file-posix: Cache next hole

Message ID 20210211172242.146671-3-mreitz@redhat.com (mailing list archive)
State New, archived
Headers show
Series file-posix: Cache next hole | expand

Commit Message

Max Reitz Feb. 11, 2021, 5:22 p.m. UTC
We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
slow on certain filesystems and/or under certain circumstances.  That is
why we generally try to avoid it (which is why bdrv_co_block_status()
has the @want_zero parameter, and which is why qcow2 has a metadata
preallocation detection, so we do not fall through to the protocol layer
to discover which blocks are zero, unless that is really necessary
(i.e., for metadata-preallocated images)).

In addition to those measures, we can also try to speed up zero
detection by letting file-posix cache some hole location information,
namely where the next hole after the most recently queried offset is.
This helps especially for images that are (nearly) fully allocated,
which is coincidentally also the case where querying for zero
information cannot gain us much.

Note that this of course only works so long as we have no concurrent
writers to the image, which is the case when the WRITE capability is not
shared.

Alternatively (or perhaps as an improvement in the future), we could let
file-posix keep track of what it knows is zero and what it knows is
non-zero with bitmaps, which would help images that actually have a
significant number of holes (where this implementation here cannot do
much).  But for such images, SEEK_HOLE/DATA are generally faster (they
do not need to seek through the whole file), and the performance lost by
querying the block status does not feel as bad because it is outweighed
by the performance that can be saved by special-cases zeroed areas, so
focussing on images that are (nearly) fully allocated is more important.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

Comments

Eric Blake Feb. 11, 2021, 8 p.m. UTC | #1
On 2/11/21 11:22 AM, Max Reitz wrote:
> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
> slow on certain filesystems and/or under certain circumstances.  That is
> why we generally try to avoid it (which is why bdrv_co_block_status()
> has the @want_zero parameter, and which is why qcow2 has a metadata
> preallocation detection, so we do not fall through to the protocol layer
> to discover which blocks are zero, unless that is really necessary
> (i.e., for metadata-preallocated images)).
> 
> In addition to those measures, we can also try to speed up zero
> detection by letting file-posix cache some hole location information,
> namely where the next hole after the most recently queried offset is.
> This helps especially for images that are (nearly) fully allocated,
> which is coincidentally also the case where querying for zero
> information cannot gain us much.
> 
> Note that this of course only works so long as we have no concurrent
> writers to the image, which is the case when the WRITE capability is not
> shared.
> 
> Alternatively (or perhaps as an improvement in the future), we could let
> file-posix keep track of what it knows is zero and what it knows is
> non-zero with bitmaps, which would help images that actually have a
> significant number of holes (where this implementation here cannot do
> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
> do not need to seek through the whole file), and the performance lost by
> querying the block status does not feel as bad because it is outweighed
> by the performance that can be saved by special-cases zeroed areas, so
> focussing on images that are (nearly) fully allocated is more important.

focusing

> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 80 insertions(+), 1 deletion(-)
> 

>  static int find_allocation(BlockDriverState *bs, off_t start,
>                             off_t *data, off_t *hole)
>  {
> -#if defined SEEK_HOLE && defined SEEK_DATA
>      BDRVRawState *s = bs->opaque;
> +
> +    if (s->next_zero_offset_valid) {
> +        if (start >= s->next_zero_offset_from && start < s->next_zero_offset) {
> +            *data = start;
> +            *hole = s->next_zero_offset;
> +            return 0;
> +        }
> +    }
> +
> +#if defined SEEK_HOLE && defined SEEK_DATA

Why move the #if? If SEEK_HOLE is not defined, s->next_zero_offset_valid
should never be set, because we'll treat the entire image as data.  But
at the same time, it doesn't hurt, so doesn't stop my review.

Reviewed-by: Eric Blake <eblake@redhat.com>
Vladimir Sementsov-Ogievskiy Feb. 11, 2021, 8:38 p.m. UTC | #2
11.02.2021 20:22, Max Reitz wrote:
> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
> slow on certain filesystems and/or under certain circumstances.  That is
> why we generally try to avoid it (which is why bdrv_co_block_status()
> has the @want_zero parameter, and which is why qcow2 has a metadata
> preallocation detection, so we do not fall through to the protocol layer
> to discover which blocks are zero, unless that is really necessary
> (i.e., for metadata-preallocated images)).
> 
> In addition to those measures, we can also try to speed up zero
> detection by letting file-posix cache some hole location information,
> namely where the next hole after the most recently queried offset is.
> This helps especially for images that are (nearly) fully allocated,
> which is coincidentally also the case where querying for zero
> information cannot gain us much.
> 
> Note that this of course only works so long as we have no concurrent
> writers to the image, which is the case when the WRITE capability is not
> shared.
> 
> Alternatively (or perhaps as an improvement in the future), we could let
> file-posix keep track of what it knows is zero and what it knows is
> non-zero with bitmaps, which would help images that actually have a
> significant number of holes (where this implementation here cannot do
> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
> do not need to seek through the whole file), and the performance lost by
> querying the block status does not feel as bad because it is outweighed
> by the performance that can be saved by special-cases zeroed areas, so
> focussing on images that are (nearly) fully allocated is more important.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

I'll look at it tomorrow... Just wanted to note that something similar was proposed by Kevin some time ago:

<20190124141731.21509-1-kwolf@redhat.com>
https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html

(o_0 two years ago.. time passes so fast)
Max Reitz Feb. 12, 2021, 9:14 a.m. UTC | #3
On 11.02.21 21:38, Vladimir Sementsov-Ogievskiy wrote:
> 11.02.2021 20:22, Max Reitz wrote:
>> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
>> slow on certain filesystems and/or under certain circumstances.  That is
>> why we generally try to avoid it (which is why bdrv_co_block_status()
>> has the @want_zero parameter, and which is why qcow2 has a metadata
>> preallocation detection, so we do not fall through to the protocol layer
>> to discover which blocks are zero, unless that is really necessary
>> (i.e., for metadata-preallocated images)).
>>
>> In addition to those measures, we can also try to speed up zero
>> detection by letting file-posix cache some hole location information,
>> namely where the next hole after the most recently queried offset is.
>> This helps especially for images that are (nearly) fully allocated,
>> which is coincidentally also the case where querying for zero
>> information cannot gain us much.
>>
>> Note that this of course only works so long as we have no concurrent
>> writers to the image, which is the case when the WRITE capability is not
>> shared.
>>
>> Alternatively (or perhaps as an improvement in the future), we could let
>> file-posix keep track of what it knows is zero and what it knows is
>> non-zero with bitmaps, which would help images that actually have a
>> significant number of holes (where this implementation here cannot do
>> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
>> do not need to seek through the whole file), and the performance lost by
>> querying the block status does not feel as bad because it is outweighed
>> by the performance that can be saved by special-cases zeroed areas, so
>> focussing on images that are (nearly) fully allocated is more important.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
> I'll look at it tomorrow... Just wanted to note that something similar 
> was proposed by Kevin some time ago:
> 
> <20190124141731.21509-1-kwolf@redhat.com>
> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html

Interesting.  The reasoning that it doesn’t matter whether anyone writes 
to the assumed-data regions makes sense.

I can’t see a real reason why it was kind of forgotten, apparently...

Max
Max Reitz Feb. 12, 2021, 9:25 a.m. UTC | #4
On 11.02.21 21:00, Eric Blake wrote:
> On 2/11/21 11:22 AM, Max Reitz wrote:
>> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
>> slow on certain filesystems and/or under certain circumstances.  That is
>> why we generally try to avoid it (which is why bdrv_co_block_status()
>> has the @want_zero parameter, and which is why qcow2 has a metadata
>> preallocation detection, so we do not fall through to the protocol layer
>> to discover which blocks are zero, unless that is really necessary
>> (i.e., for metadata-preallocated images)).
>>
>> In addition to those measures, we can also try to speed up zero
>> detection by letting file-posix cache some hole location information,
>> namely where the next hole after the most recently queried offset is.
>> This helps especially for images that are (nearly) fully allocated,
>> which is coincidentally also the case where querying for zero
>> information cannot gain us much.
>>
>> Note that this of course only works so long as we have no concurrent
>> writers to the image, which is the case when the WRITE capability is not
>> shared.
>>
>> Alternatively (or perhaps as an improvement in the future), we could let
>> file-posix keep track of what it knows is zero and what it knows is
>> non-zero with bitmaps, which would help images that actually have a
>> significant number of holes (where this implementation here cannot do
>> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
>> do not need to seek through the whole file), and the performance lost by
>> querying the block status does not feel as bad because it is outweighed
>> by the performance that can be saved by special-cases zeroed areas, so
>> focussing on images that are (nearly) fully allocated is more important.
> 
> focusing

Wiktionary lists both as valid. *shrug*

>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/file-posix.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 80 insertions(+), 1 deletion(-)
>>
> 
>>   static int find_allocation(BlockDriverState *bs, off_t start,
>>                              off_t *data, off_t *hole)
>>   {
>> -#if defined SEEK_HOLE && defined SEEK_DATA
>>       BDRVRawState *s = bs->opaque;
>> +
>> +    if (s->next_zero_offset_valid) {
>> +        if (start >= s->next_zero_offset_from && start < s->next_zero_offset) {
>> +            *data = start;
>> +            *hole = s->next_zero_offset;
>> +            return 0;
>> +        }
>> +    }
>> +
>> +#if defined SEEK_HOLE && defined SEEK_DATA
> 
> Why move the #if? If SEEK_HOLE is not defined, s->next_zero_offset_valid
> should never be set, because we'll treat the entire image as data.  But
> at the same time, it doesn't hurt, so doesn't stop my review.

I found it better to put it outside the SEEK_HOLE/DATA section, because 
while it won’t work when neither are defined, the code is still valid.

I don’t know. :)

> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

Max
Kevin Wolf Feb. 12, 2021, 10:25 a.m. UTC | #5
Am 12.02.2021 um 10:14 hat Max Reitz geschrieben:
> On 11.02.21 21:38, Vladimir Sementsov-Ogievskiy wrote:
> > 11.02.2021 20:22, Max Reitz wrote:
> > > We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
> > > slow on certain filesystems and/or under certain circumstances.  That is
> > > why we generally try to avoid it (which is why bdrv_co_block_status()
> > > has the @want_zero parameter, and which is why qcow2 has a metadata
> > > preallocation detection, so we do not fall through to the protocol layer
> > > to discover which blocks are zero, unless that is really necessary
> > > (i.e., for metadata-preallocated images)).
> > > 
> > > In addition to those measures, we can also try to speed up zero
> > > detection by letting file-posix cache some hole location information,
> > > namely where the next hole after the most recently queried offset is.
> > > This helps especially for images that are (nearly) fully allocated,
> > > which is coincidentally also the case where querying for zero
> > > information cannot gain us much.
> > > 
> > > Note that this of course only works so long as we have no concurrent
> > > writers to the image, which is the case when the WRITE capability is not
> > > shared.
> > > 
> > > Alternatively (or perhaps as an improvement in the future), we could let
> > > file-posix keep track of what it knows is zero and what it knows is
> > > non-zero with bitmaps, which would help images that actually have a
> > > significant number of holes (where this implementation here cannot do
> > > much).  But for such images, SEEK_HOLE/DATA are generally faster (they
> > > do not need to seek through the whole file), and the performance lost by
> > > querying the block status does not feel as bad because it is outweighed
> > > by the performance that can be saved by special-cases zeroed areas, so
> > > focussing on images that are (nearly) fully allocated is more important.
> > > 
> > > Signed-off-by: Max Reitz <mreitz@redhat.com>
> > 
> > I'll look at it tomorrow... Just wanted to note that something similar
> > was proposed by Kevin some time ago:
> > 
> > <20190124141731.21509-1-kwolf@redhat.com>
> > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html
> 
> Interesting.  The reasoning that it doesn’t matter whether anyone
> writes to the assumed-data regions makes sense.
> 
> I can’t see a real reason why it was kind of forgotten, apparently...

After qcow2 stopped recursively querying the file-posix layer, the
relevant case under discussion was fixed anyway, so it didn't have the
highest priority any more...

I think the open question (and possibly work) in the old thread was
whether this should be moved out of file-posix into the generic block
layer.

With your patch, I guess the other open question is whether we want to
try and cache holes anyway. I assume that in the common case, you may
have many consecutive data extents, but probably rarely many holes (I
guess you can have more than one if some areas are unallocated and
others are allocated, but unwritten?) Then it's probably not worth
caching holes.

Kevin
Max Reitz Feb. 12, 2021, 12:11 p.m. UTC | #6
On 12.02.21 11:25, Kevin Wolf wrote:
> Am 12.02.2021 um 10:14 hat Max Reitz geschrieben:
>> On 11.02.21 21:38, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.02.2021 20:22, Max Reitz wrote:
>>>> We have repeatedly received reports that SEEK_HOLE and SEEK_DATA are
>>>> slow on certain filesystems and/or under certain circumstances.  That is
>>>> why we generally try to avoid it (which is why bdrv_co_block_status()
>>>> has the @want_zero parameter, and which is why qcow2 has a metadata
>>>> preallocation detection, so we do not fall through to the protocol layer
>>>> to discover which blocks are zero, unless that is really necessary
>>>> (i.e., for metadata-preallocated images)).
>>>>
>>>> In addition to those measures, we can also try to speed up zero
>>>> detection by letting file-posix cache some hole location information,
>>>> namely where the next hole after the most recently queried offset is.
>>>> This helps especially for images that are (nearly) fully allocated,
>>>> which is coincidentally also the case where querying for zero
>>>> information cannot gain us much.
>>>>
>>>> Note that this of course only works so long as we have no concurrent
>>>> writers to the image, which is the case when the WRITE capability is not
>>>> shared.
>>>>
>>>> Alternatively (or perhaps as an improvement in the future), we could let
>>>> file-posix keep track of what it knows is zero and what it knows is
>>>> non-zero with bitmaps, which would help images that actually have a
>>>> significant number of holes (where this implementation here cannot do
>>>> much).  But for such images, SEEK_HOLE/DATA are generally faster (they
>>>> do not need to seek through the whole file), and the performance lost by
>>>> querying the block status does not feel as bad because it is outweighed
>>>> by the performance that can be saved by special-cases zeroed areas, so
>>>> focussing on images that are (nearly) fully allocated is more important.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>
>>> I'll look at it tomorrow... Just wanted to note that something similar
>>> was proposed by Kevin some time ago:
>>>
>>> <20190124141731.21509-1-kwolf@redhat.com>
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06271.html
>>
>> Interesting.  The reasoning that it doesn’t matter whether anyone
>> writes to the assumed-data regions makes sense.
>>
>> I can’t see a real reason why it was kind of forgotten, apparently...
> 
> After qcow2 stopped recursively querying the file-posix layer, the
> relevant case under discussion was fixed anyway, so it didn't have the
> highest priority any more...
> 
> I think the open question (and possibly work) in the old thread was
> whether this should be moved out of file-posix into the generic block
> layer.
> 
> With your patch, I guess the other open question is whether we want to
> try and cache holes anyway. I assume that in the common case, you may
> have many consecutive data extents, but probably rarely many holes (I
> guess you can have more than one if some areas are unallocated and
> others are allocated, but unwritten?) Then it's probably not worth
> caching holes.

Yes, that’s what I think.  If you have many small holes, caching won’t 
help much anyway (at least this way, where we’d only cache information 
about a single hole).  So caching holes would mainly help when you have 
several large holes.  But in that case, finding them will surely speed 
up whatever it is the block-status caller is doing (e.g. mirroring), so 
a slow SEEK_HOLE/DATA will be tolerable.

Max
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index 05079b40ca..2ca0a2e05b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -172,6 +172,11 @@  typedef struct BDRVRawState {
     } stats;
 
     PRManager *pr_mgr;
+
+    bool can_cache_next_zero_offset;
+    bool next_zero_offset_valid;
+    uint64_t next_zero_offset_from;
+    uint64_t next_zero_offset;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
@@ -2049,7 +2054,25 @@  static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
                                        uint64_t bytes, QEMUIOVector *qiov,
                                        int flags)
 {
+    BDRVRawState *s = bs->opaque;
+
     assert(flags == 0);
+
+    /*
+     * If offset is just above s->next_zero_offset, the hole that was
+     * reportedly there might be removed from the file (because only
+     * whole filesystem clusters can be zeroed).  But that does not
+     * matter, because block-status does not care about whether there
+     * actually is a hole, but just about whether there are zeroes
+     * there - and this write will not make those zeroes non-zero.
+     */
+    if (s->next_zero_offset_valid &&
+        offset <= s->next_zero_offset &&
+        offset + bytes > s->next_zero_offset)
+    {
+        s->next_zero_offset_valid = false;
+    }
+
     return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
@@ -2183,6 +2206,10 @@  static int coroutine_fn raw_co_truncate(BlockDriverState *bs, int64_t offset,
     struct stat st;
     int ret;
 
+    if (s->next_zero_offset_valid && offset < s->next_zero_offset) {
+        s->next_zero_offset_valid = false;
+    }
+
     if (fstat(s->fd, &st)) {
         ret = -errno;
         error_setg_errno(errp, -ret, "Failed to fstat() the file");
@@ -2616,8 +2643,17 @@  static int coroutine_fn raw_co_delete_file(BlockDriverState *bs,
 static int find_allocation(BlockDriverState *bs, off_t start,
                            off_t *data, off_t *hole)
 {
-#if defined SEEK_HOLE && defined SEEK_DATA
     BDRVRawState *s = bs->opaque;
+
+    if (s->next_zero_offset_valid) {
+        if (start >= s->next_zero_offset_from && start < s->next_zero_offset) {
+            *data = start;
+            *hole = s->next_zero_offset;
+            return 0;
+        }
+    }
+
+#if defined SEEK_HOLE && defined SEEK_DATA
     off_t offs;
 
     /*
@@ -2716,6 +2752,7 @@  static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             int64_t *map,
                                             BlockDriverState **file)
 {
+    BDRVRawState *s = bs->opaque;
     off_t data = 0, hole = 0;
     int ret;
 
@@ -2734,6 +2771,7 @@  static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     }
 
     ret = find_allocation(bs, offset, &data, &hole);
+    s->next_zero_offset_valid = false;
     if (ret == -ENXIO) {
         /* Trailing hole */
         *pnum = bytes;
@@ -2761,6 +2799,12 @@  static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
         }
 
         ret = BDRV_BLOCK_DATA;
+
+        if (s->can_cache_next_zero_offset) {
+            s->next_zero_offset_valid = true;
+            s->next_zero_offset_from = offset;
+            s->next_zero_offset = hole;
+        }
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
@@ -2910,6 +2954,13 @@  raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, bool blkdev)
     RawPosixAIOData acb;
     int ret;
 
+    if (s->next_zero_offset_valid &&
+        offset <= s->next_zero_offset &&
+        offset + bytes > s->next_zero_offset_from)
+    {
+        s->next_zero_offset_valid = false;
+    }
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_fildes     = s->fd,
@@ -2941,6 +2992,17 @@  raw_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes,
     RawPosixAIOData acb;
     ThreadPoolFunc *handler;
 
+    if (s->next_zero_offset_valid &&
+        offset < s->next_zero_offset &&
+        offset + bytes > s->next_zero_offset_from)
+    {
+        if (offset > s->next_zero_offset_from) {
+            s->next_zero_offset = offset;
+        } else {
+            s->next_zero_offset_valid = false;
+        }
+    }
+
 #ifdef CONFIG_FALLOCATE
     if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
         BdrvTrackedRequest *req;
@@ -3155,6 +3217,15 @@  static void raw_set_perm(BlockDriverState *bs, uint64_t perm, uint64_t shared)
     raw_handle_perm_lock(bs, RAW_PL_COMMIT, perm, shared, NULL);
     s->perm = perm;
     s->shared_perm = shared;
+
+    /*
+     * We can only cache anything if there are no external writers on
+     * the image.
+     */
+    s->can_cache_next_zero_offset = !(shared & BLK_PERM_WRITE);
+    if (!s->can_cache_next_zero_offset) {
+        s->next_zero_offset_valid = false;
+    }
 }
 
 static void raw_abort_perm_update(BlockDriverState *bs)
@@ -3203,6 +3274,14 @@  static int coroutine_fn raw_co_copy_range_to(BlockDriverState *bs,
         return -EIO;
     }
 
+    /* Same as in raw_co_pwritev() */
+    if (s->next_zero_offset_valid &&
+        dst_offset <= s->next_zero_offset &&
+        dst_offset + bytes > s->next_zero_offset_from)
+    {
+        s->next_zero_offset_valid = false;
+    }
+
     acb = (RawPosixAIOData) {
         .bs             = bs,
         .aio_type       = QEMU_AIO_COPY_RANGE,