diff mbox series

[2/6] block: block-status cache for data regions

Message ID 20210617155247.442150-3-mreitz@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: block-status cache for data regions | expand

Commit Message

Max Reitz June 17, 2021, 3:52 p.m. UTC
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/block/block_int.h | 19 ++++++++++
 block.c                   |  2 +
 block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 98 insertions(+), 3 deletions(-)

Comments

Eric Blake June 18, 2021, 6:51 p.m. UTC | #1
On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote:
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.

Here's hoping third time's the charm!

> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  include/block/block_int.h | 19 ++++++++++
>  block.c                   |  2 +
>  block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>  3 files changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a8f9598102..c09512556a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -832,6 +832,23 @@ struct BdrvChild {
>      QLIST_ENTRY(BdrvChild) next_parent;
>  };
>  
> +/*
> + * Allows bdrv_co_block_status() to cache one data region for a
> + * protocol node.
> + *
> + * @lock: Lock for accessing this object's fields
> + * @valid: Whether the cache is valid
> + * @data_start: Offset where we know (or strongly assume) is data
> + * @data_end: Offset where the data region ends (which is not necessarily
> + *            the start of a zeroed region)
> + */
> +typedef struct BdrvBlockStatusCache {
> +    CoMutex lock;
> +    bool valid;
> +    int64_t data_start;
> +    int64_t data_end;
> +} BdrvBlockStatusCache;

Looks like the right bits of information, and I'm glad you documented
the need to be prepared for protocols that report split data sections
rather than consolidated.

> +++ b/block/io.c
> @@ -35,6 +35,7 @@
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
> +#include "qemu/range.h"
>  #include "sysemu/replay.h"
>  
>  /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>      bool need_flush = false;
>      int head = 0;
>      int tail = 0;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>  
>      int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>      int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>          return -ENOTSUP;
>      }
>  
> +    /* Invalidate the cached block-status data range if this write overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);

Are we going to be suffering from a lot of lock contention performance
degradation?  Is there a way to take advantage of RCU access patterns
for any more performance without sacrificing correctness?

> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }

Do we want to invalidate the entire bsc, or can we be smart and leave
the prefix intact (if offset > bsc->data_start, then set bsc->data_end
to offset)?

> +    qemu_co_mutex_unlock(&bsc->lock);

Worth using WITH_QEMU_LOCK_GUARD?

> +
>      assert(alignment % bs->bl.request_alignment == 0);
>      head = offset % alignment;
>      tail = (offset + bytes) % alignment;
> @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>      int64_t aligned_offset, aligned_bytes;
>      uint32_t align;
>      bool has_filtered_child;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>  
>      assert(pnum);
>      *pnum = 0;
> @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>      aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>  
>      if (bs->drv->bdrv_co_block_status) {
> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -                                            aligned_bytes, pnum, &local_map,
> -                                            &local_file);
> +        bool from_cache = false;
> +
> +        /*
> +         * Use the block-status cache only for protocol nodes: Format
> +         * drivers are generally quick to inquire the status, but protocol
> +         * drivers often need to get information from outside of qemu, so
> +         * we do not have control over the actual implementation.  There
> +         * have been cases where inquiring the status took an unreasonably
> +         * long time, and we can do nothing in qemu to fix it.
> +         * This is especially problematic for images with large data areas,
> +         * because finding the few holes in them and giving them special
> +         * treatment does not gain much performance.  Therefore, we try to
> +         * cache the last-identified data region.
> +         *
> +         * Second, limiting ourselves to protocol nodes allows us to assume
> +         * the block status for data regions to be DATA | OFFSET_VALID, and
> +         * that the host offset is the same as the guest offset.
> +         *
> +         * Note that it is possible that external writers zero parts of
> +         * the cached regions without the cache being invalidated, and so
> +         * we may report zeroes as data.  This is not catastrophic,
> +         * however, because reporting zeroes as data is fine.
> +         */

Useful comment.

> +        if (QLIST_EMPTY(&bs->children)) {
> +            qemu_co_mutex_lock(&bsc->lock);
> +            if (bsc->valid &&
> +                aligned_offset >= bsc->data_start &&
> +                aligned_offset < bsc->data_end)
> +            {
> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +                local_file = bs;
> +                local_map = aligned_offset;
> +
> +                *pnum = bsc->data_end - aligned_offset;
> +
> +                from_cache = true;
> +            }
> +            qemu_co_mutex_unlock(&bsc->lock);
> +        }
> +
> +        if (!from_cache) {
> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> +                                                aligned_bytes, pnum, &local_map,
> +                                                &local_file);
> +
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {
> +                qemu_co_mutex_lock(&bsc->lock);
> +                bsc->data_start = aligned_offset;
> +                bsc->data_end = aligned_offset + *pnum;
> +                bsc->valid = true;
> +                qemu_co_mutex_unlock(&bsc->lock);
> +            }
> +        }

Looks correct.

>      } else {
>          /* Default code for filters */
>  
> @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>      int max_pdiscard, ret;
>      int head, tail, align;
>      BlockDriverState *bs = child->bs;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>  
>      if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
>          return -ENOMEDIUM;
> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>          return 0;
>      }
>  
> +    /* Invalidate the cached block-status data range if this discard overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);
> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }

Same question as above about possibly shortening the cached range
in-place rather than discarding it altogether.

> +    qemu_co_mutex_unlock(&bsc->lock);
> +
>      /* Discard is advisory, but some devices track and coalesce
>       * unaligned requests, so we must pass everything down rather than
>       * round here.  Still, most devices will just silently ignore
> -- 
> 2.31.1
> 
>
Vladimir Sementsov-Ogievskiy June 19, 2021, 10:20 a.m. UTC | #2
17.06.2021 18:52, Max Reitz wrote:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   include/block/block_int.h | 19 ++++++++++
>   block.c                   |  2 +
>   block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>   3 files changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a8f9598102..c09512556a 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -832,6 +832,23 @@ struct BdrvChild {
>       QLIST_ENTRY(BdrvChild) next_parent;
>   };
>   
> +/*
> + * Allows bdrv_co_block_status() to cache one data region for a
> + * protocol node.
> + *
> + * @lock: Lock for accessing this object's fields
> + * @valid: Whether the cache is valid
> + * @data_start: Offset where we know (or strongly assume) is data
> + * @data_end: Offset where the data region ends (which is not necessarily
> + *            the start of a zeroed region)
> + */
> +typedef struct BdrvBlockStatusCache {
> +    CoMutex lock;
> +    bool valid;
> +    int64_t data_start;
> +    int64_t data_end;
> +} BdrvBlockStatusCache;
> +
>   struct BlockDriverState {
>       /* Protected by big QEMU lock or read-only after opening.  No special
>        * locking needed during I/O...
> @@ -997,6 +1014,8 @@ struct BlockDriverState {
>   
>       /* BdrvChild links to this node may never be frozen */
>       bool never_freeze;
> +
> +    BdrvBlockStatusCache block_status_cache;
>   };
>   
>   struct BlockBackendRootState {
> diff --git a/block.c b/block.c
> index 3f456892d0..bad64d5c4f 100644
> --- a/block.c
> +++ b/block.c
> @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
>   
>       qemu_co_queue_init(&bs->flush_queue);
>   
> +    qemu_co_mutex_init(&bs->block_status_cache.lock);
> +
>       for (i = 0; i < bdrv_drain_all_count; i++) {
>           bdrv_drained_begin(bs);
>       }
> diff --git a/block/io.c b/block/io.c
> index 323854d063..320638cc48 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -35,6 +35,7 @@
>   #include "qapi/error.h"
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
> +#include "qemu/range.h"
>   #include "sysemu/replay.h"
>   
>   /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>       bool need_flush = false;
>       int head = 0;
>       int tail = 0;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>   
>       int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>       int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           return -ENOTSUP;
>       }
>   
> +    /* Invalidate the cached block-status data range if this write overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);
> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }
> +    qemu_co_mutex_unlock(&bsc->lock);

I think this fragment used twice worth small function, like

block_status_cache_discard_region().

That would be clean and avoids duplication..

> +
>       assert(alignment % bs->bl.request_alignment == 0);
>       head = offset % alignment;
>       tail = (offset + bytes) % alignment;
> @@ -2394,6 +2406,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>       int64_t aligned_offset, aligned_bytes;
>       uint32_t align;
>       bool has_filtered_child;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>   
>       assert(pnum);
>       *pnum = 0;
> @@ -2442,9 +2455,59 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>   
>       if (bs->drv->bdrv_co_block_status) {
> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -                                            aligned_bytes, pnum, &local_map,
> -                                            &local_file);
> +        bool from_cache = false;
> +
> +        /*
> +         * Use the block-status cache only for protocol nodes: Format
> +         * drivers are generally quick to inquire the status, but protocol
> +         * drivers often need to get information from outside of qemu, so
> +         * we do not have control over the actual implementation.  There
> +         * have been cases where inquiring the status took an unreasonably
> +         * long time, and we can do nothing in qemu to fix it.
> +         * This is especially problematic for images with large data areas,
> +         * because finding the few holes in them and giving them special
> +         * treatment does not gain much performance.  Therefore, we try to
> +         * cache the last-identified data region.
> +         *
> +         * Second, limiting ourselves to protocol nodes allows us to assume
> +         * the block status for data regions to be DATA | OFFSET_VALID, and
> +         * that the host offset is the same as the guest offset.
> +         *
> +         * Note that it is possible that external writers zero parts of
> +         * the cached regions without the cache being invalidated, and so
> +         * we may report zeroes as data.  This is not catastrophic,
> +         * however, because reporting zeroes as data is fine.
> +         */
> +        if (QLIST_EMPTY(&bs->children)) {
> +            qemu_co_mutex_lock(&bsc->lock);
> +            if (bsc->valid &&
> +                aligned_offset >= bsc->data_start &&
> +                aligned_offset < bsc->data_end)
> +            {
> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +                local_file = bs;
> +                local_map = aligned_offset;
> +
> +                *pnum = bsc->data_end - aligned_offset;
> +
> +                from_cache = true;
> +            }
> +            qemu_co_mutex_unlock(&bsc->lock);
> +        }
> +
> +        if (!from_cache) {
> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> +                                                aligned_bytes, pnum, &local_map,
> +                                                &local_file);
> +
> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {

You do cache the data for any driver that returns DATA | OFFSET_VALID. But it's unused for format drivers (which may return OFFSET_VALID of course).

I think, here should be

if ((ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) && QLIST_EMPTY(&bs->children)) {

> +                qemu_co_mutex_lock(&bsc->lock);
> +                bsc->data_start = aligned_offset;
> +                bsc->data_end = aligned_offset + *pnum;
> +                bsc->valid = true;
> +                qemu_co_mutex_unlock(&bsc->lock);
> +            }
> +        }


Hmm, also, new additions may worth a separate small functions too, so that new object gets a clean API:

bsc_discard()
bsc_get()
bsc_set()

or something like this. I don't insist.

>       } else {
>           /* Default code for filters */
>   
> @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>       int max_pdiscard, ret;
>       int head, tail, align;
>       BlockDriverState *bs = child->bs;
> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>   
>       if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
>           return -ENOMEDIUM;
> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
>           return 0;
>       }
>   
> +    /* Invalidate the cached block-status data range if this discard overlaps */
> +    qemu_co_mutex_lock(&bsc->lock);
> +    if (bsc->valid &&
> +        ranges_overlap(offset, bytes, bsc->data_start,
> +                       bsc->data_end - bsc->data_start))
> +    {
> +        bsc->valid = false;
> +    }
> +    qemu_co_mutex_unlock(&bsc->lock);
> +
>       /* Discard is advisory, but some devices track and coalesce
>        * unaligned requests, so we must pass everything down rather than
>        * round here.  Still, most devices will just silently ignore
> 

Overall, seems good to me.
Max Reitz June 21, 2021, 9:37 a.m. UTC | #3
On 18.06.21 20:51, Eric Blake wrote:
> On Thu, Jun 17, 2021 at 05:52:43PM +0200, Max Reitz wrote:
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
> Here's hoping third time's the charm!

Indeed :)

>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block_int.h | 19 ++++++++++
>>   block.c                   |  2 +
>>   block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index a8f9598102..c09512556a 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -832,6 +832,23 @@ struct BdrvChild {
>>       QLIST_ENTRY(BdrvChild) next_parent;
>>   };
>>   
>> +/*
>> + * Allows bdrv_co_block_status() to cache one data region for a
>> + * protocol node.
>> + *
>> + * @lock: Lock for accessing this object's fields
>> + * @valid: Whether the cache is valid
>> + * @data_start: Offset where we know (or strongly assume) is data
>> + * @data_end: Offset where the data region ends (which is not necessarily
>> + *            the start of a zeroed region)
>> + */
>> +typedef struct BdrvBlockStatusCache {
>> +    CoMutex lock;
>> +    bool valid;
>> +    int64_t data_start;
>> +    int64_t data_end;
>> +} BdrvBlockStatusCache;
> Looks like the right bits of information, and I'm glad you documented
> the need to be prepared for protocols that report split data sections
> rather than consolidated.
>
>> +++ b/block/io.c
>> @@ -35,6 +35,7 @@
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> +#include "qemu/range.h"
>>   #include "sysemu/replay.h"
>>   
>>   /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
>> @@ -1862,6 +1863,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>       bool need_flush = false;
>>       int head = 0;
>>       int tail = 0;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>   
>>       int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>>       int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>> @@ -1878,6 +1880,16 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>           return -ENOTSUP;
>>       }
>>   
>> +    /* Invalidate the cached block-status data range if this write overlaps */
>> +    qemu_co_mutex_lock(&bsc->lock);
> Are we going to be suffering from a lot of lock contention performance
> degradation?  Is there a way to take advantage of RCU access patterns
> for any more performance without sacrificing correctness?

The critical section is so short that I considered it fine.  I wanted to 
use RW locks, but then realized that every RW lock operation is 
internally locked by another mutex, so it wouldn’t gain anything.

I’m not sure whether RCU is worth it here.

We could try something very crude, namely to just not take a lock and 
make `valid` an atomic.  After all, it doesn’t really matter whether 
`data_start` and `data_end` are consistent values, and resetting `valid` 
to false is always safe.

The worst that could happen is that a concurrent block-status call tries 
to set up an overlapping data area, which we thus fail to recognize 
here.  But if such a thing were to happen, it could just as well happen 
before said concurrent call took any lock on `bsc`.

>> +    if (bsc->valid &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start))
>> +    {
>> +        bsc->valid = false;
>> +    }
> Do we want to invalidate the entire bsc, or can we be smart and leave
> the prefix intact (if offset > bsc->data_start, then set bsc->data_end
> to offset)?

Perhaps we could be smart, but I don’t think it really makes a 
difference in practice, so I think keeping it simple is better.

>> +    qemu_co_mutex_unlock(&bsc->lock);
> Worth using WITH_QEMU_LOCK_GUARD?

I knew I forgot something, right.  Will use!

Max
Max Reitz June 21, 2021, 10:05 a.m. UTC | #4
On 19.06.21 12:20, Vladimir Sementsov-Ogievskiy wrote:
> 17.06.2021 18:52, Max Reitz wrote:
>> As we have attempted before
>> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
>> "file-posix: Cache lseek result for data regions";
>> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
>> "file-posix: Cache next hole"), this patch seeks to reduce the number of
>> SEEK_DATA/HOLE operations the file-posix driver has to perform. The
>> main difference is that this time it is implemented as part of the
>> general block layer code.
>>
>> The problem we face is that on some filesystems or in some
>> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
>> implementation is outside of qemu, there is little we can do about its
>> performance.
>>
>> We have already introduced the want_zero parameter to
>> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
>> unless we really want zero information; but sometimes we do want that
>> information, because for files that consist largely of zero areas,
>> special-casing those areas can give large performance boosts. So the
>> real problem is with files that consist largely of data, so that
>> inquiring the block status does not gain us much performance, but where
>> such an inquiry itself takes a lot of time.
>>
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
>>
>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   include/block/block_int.h | 19 ++++++++++
>>   block.c                   |  2 +
>>   block/io.c                | 80 +++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index a8f9598102..c09512556a 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -832,6 +832,23 @@ struct BdrvChild {
>>       QLIST_ENTRY(BdrvChild) next_parent;
>>   };
>>   +/*
>> + * Allows bdrv_co_block_status() to cache one data region for a
>> + * protocol node.
>> + *
>> + * @lock: Lock for accessing this object's fields
>> + * @valid: Whether the cache is valid
>> + * @data_start: Offset where we know (or strongly assume) is data
>> + * @data_end: Offset where the data region ends (which is not 
>> necessarily
>> + *            the start of a zeroed region)
>> + */
>> +typedef struct BdrvBlockStatusCache {
>> +    CoMutex lock;
>> +    bool valid;
>> +    int64_t data_start;
>> +    int64_t data_end;
>> +} BdrvBlockStatusCache;
>> +
>>   struct BlockDriverState {
>>       /* Protected by big QEMU lock or read-only after opening. No 
>> special
>>        * locking needed during I/O...
>> @@ -997,6 +1014,8 @@ struct BlockDriverState {
>>         /* BdrvChild links to this node may never be frozen */
>>       bool never_freeze;
>> +
>> +    BdrvBlockStatusCache block_status_cache;
>>   };
>>     struct BlockBackendRootState {
>> diff --git a/block.c b/block.c
>> index 3f456892d0..bad64d5c4f 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -398,6 +398,8 @@ BlockDriverState *bdrv_new(void)
>>         qemu_co_queue_init(&bs->flush_queue);
>>   +    qemu_co_mutex_init(&bs->block_status_cache.lock);
>> +
>>       for (i = 0; i < bdrv_drain_all_count; i++) {
>>           bdrv_drained_begin(bs);
>>       }
>> diff --git a/block/io.c b/block/io.c
>> index 323854d063..320638cc48 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -35,6 +35,7 @@
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>> +#include "qemu/range.h"
>>   #include "sysemu/replay.h"
>>     /* Maximum bounce buffer for copy-on-read and write zeroes, in 
>> bytes */
>> @@ -1862,6 +1863,7 @@ static int coroutine_fn 
>> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>       bool need_flush = false;
>>       int head = 0;
>>       int tail = 0;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>         int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, 
>> INT_MAX);
>>       int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>> @@ -1878,6 +1880,16 @@ static int coroutine_fn 
>> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>           return -ENOTSUP;
>>       }
>>   +    /* Invalidate the cached block-status data range if this write 
>> overlaps */
>> +    qemu_co_mutex_lock(&bsc->lock);
>> +    if (bsc->valid &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start))
>> +    {
>> +        bsc->valid = false;
>> +    }
>> +    qemu_co_mutex_unlock(&bsc->lock);
>
> I think this fragment used twice worth small function, like
>
> block_status_cache_discard_region().
>
> That would be clean and avoids duplication..

Sure, sounds good.

>> +
>>       assert(alignment % bs->bl.request_alignment == 0);
>>       head = offset % alignment;
>>       tail = (offset + bytes) % alignment;
>> @@ -2394,6 +2406,7 @@ static int coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>       int64_t aligned_offset, aligned_bytes;
>>       uint32_t align;
>>       bool has_filtered_child;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>         assert(pnum);
>>       *pnum = 0;
>> @@ -2442,9 +2455,59 @@ static int coroutine_fn 
>> bdrv_co_block_status(BlockDriverState *bs,
>>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>>         if (bs->drv->bdrv_co_block_status) {
>> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
>> aligned_offset,
>> -                                            aligned_bytes, pnum, 
>> &local_map,
>> -                                            &local_file);
>> +        bool from_cache = false;
>> +
>> +        /*
>> +         * Use the block-status cache only for protocol nodes: Format
>> +         * drivers are generally quick to inquire the status, but 
>> protocol
>> +         * drivers often need to get information from outside of 
>> qemu, so
>> +         * we do not have control over the actual implementation.  
>> There
>> +         * have been cases where inquiring the status took an 
>> unreasonably
>> +         * long time, and we can do nothing in qemu to fix it.
>> +         * This is especially problematic for images with large data 
>> areas,
>> +         * because finding the few holes in them and giving them 
>> special
>> +         * treatment does not gain much performance. Therefore, we 
>> try to
>> +         * cache the last-identified data region.
>> +         *
>> +         * Second, limiting ourselves to protocol nodes allows us to 
>> assume
>> +         * the block status for data regions to be DATA | 
>> OFFSET_VALID, and
>> +         * that the host offset is the same as the guest offset.
>> +         *
>> +         * Note that it is possible that external writers zero parts of
>> +         * the cached regions without the cache being invalidated, 
>> and so
>> +         * we may report zeroes as data.  This is not catastrophic,
>> +         * however, because reporting zeroes as data is fine.
>> +         */
>> +        if (QLIST_EMPTY(&bs->children)) {
>> +            qemu_co_mutex_lock(&bsc->lock);
>> +            if (bsc->valid &&
>> +                aligned_offset >= bsc->data_start &&
>> +                aligned_offset < bsc->data_end)
>> +            {
>> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>> +                local_file = bs;
>> +                local_map = aligned_offset;
>> +
>> +                *pnum = bsc->data_end - aligned_offset;
>> +
>> +                from_cache = true;
>> +            }
>> +            qemu_co_mutex_unlock(&bsc->lock);
>> +        }
>> +
>> +        if (!from_cache) {
>> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
>> aligned_offset,
>> +                                                aligned_bytes, pnum, 
>> &local_map,
>> + &local_file);
>> +
>> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {
>
> You do cache the data for any driver that returns DATA | OFFSET_VALID. 
> But it's unused for format drivers (which may return OFFSET_VALID of 
> course).
>
> I think, here should be
>
> if ((ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) && 
> QLIST_EMPTY(&bs->children)) {

Actually, I originally had it like this but then was afraid that a 
reviewer would ask me why I check the children list twice, once here and 
once when reading from the cache. :)

So I thought I should check it only once, and the better place to do so 
was where the cache is actually used.

I don’t think it makes much of a difference in practice (setting the 
cache doesn’t really matter if it isn’t used), but perhaps it might make 
the checks in write-zero/discard a bit quicker, because they can return 
early with `valid == false`.

>> + qemu_co_mutex_lock(&bsc->lock);
>> +                bsc->data_start = aligned_offset;
>> +                bsc->data_end = aligned_offset + *pnum;
>> +                bsc->valid = true;
>> +                qemu_co_mutex_unlock(&bsc->lock);
>> +            }
>> +        }
>
>
> Hmm, also, new additions may worth a separate small functions too, so 
> that new object gets a clean API:
>
> bsc_discard()
> bsc_get()
> bsc_set()
>
> or something like this. I don't insist.

Sounds good, will do.

>>       } else {
>>           /* Default code for filters */
>>   @@ -2974,6 +3037,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild 
>> *child, int64_t offset,
>>       int max_pdiscard, ret;
>>       int head, tail, align;
>>       BlockDriverState *bs = child->bs;
>> +    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
>>         if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
>>           return -ENOMEDIUM;
>> @@ -2997,6 +3061,16 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild 
>> *child, int64_t offset,
>>           return 0;
>>       }
>>   +    /* Invalidate the cached block-status data range if this 
>> discard overlaps */
>> +    qemu_co_mutex_lock(&bsc->lock);
>> +    if (bsc->valid &&
>> +        ranges_overlap(offset, bytes, bsc->data_start,
>> +                       bsc->data_end - bsc->data_start))
>> +    {
>> +        bsc->valid = false;
>> +    }
>> +    qemu_co_mutex_unlock(&bsc->lock);
>> +
>>       /* Discard is advisory, but some devices track and coalesce
>>        * unaligned requests, so we must pass everything down rather than
>>        * round here.  Still, most devices will just silently ignore
>>
>
> Overall, seems good to me.

Great :)

Max
diff mbox series

Patch

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..c09512556a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,23 @@  struct BdrvChild {
     QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @lock: Lock for accessing this object's fields
+ * @valid: Whether the cache is valid
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *            the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    CoMutex lock;
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
     /* Protected by big QEMU lock or read-only after opening.  No special
      * locking needed during I/O...
@@ -997,6 +1014,8 @@  struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+
+    BdrvBlockStatusCache block_status_cache;
 };
 
 struct BlockBackendRootState {
diff --git a/block.c b/block.c
index 3f456892d0..bad64d5c4f 100644
--- a/block.c
+++ b/block.c
@@ -398,6 +398,8 @@  BlockDriverState *bdrv_new(void)
 
     qemu_co_queue_init(&bs->flush_queue);
 
+    qemu_co_mutex_init(&bs->block_status_cache.lock);
+
     for (i = 0; i < bdrv_drain_all_count; i++) {
         bdrv_drained_begin(bs);
     }
diff --git a/block/io.c b/block/io.c
index 323854d063..320638cc48 100644
--- a/block/io.c
+++ b/block/io.c
@@ -35,6 +35,7 @@ 
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "qemu/range.h"
 #include "sysemu/replay.h"
 
 /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
@@ -1862,6 +1863,7 @@  static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     bool need_flush = false;
     int head = 0;
     int tail = 0;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
     int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
     int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
@@ -1878,6 +1880,16 @@  static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
         return -ENOTSUP;
     }
 
+    /* Invalidate the cached block-status data range if this write overlaps */
+    qemu_co_mutex_lock(&bsc->lock);
+    if (bsc->valid &&
+        ranges_overlap(offset, bytes, bsc->data_start,
+                       bsc->data_end - bsc->data_start))
+    {
+        bsc->valid = false;
+    }
+    qemu_co_mutex_unlock(&bsc->lock);
+
     assert(alignment % bs->bl.request_alignment == 0);
     head = offset % alignment;
     tail = (offset + bytes) % alignment;
@@ -2394,6 +2406,7 @@  static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     int64_t aligned_offset, aligned_bytes;
     uint32_t align;
     bool has_filtered_child;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
     assert(pnum);
     *pnum = 0;
@@ -2442,9 +2455,59 @@  static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
 
     if (bs->drv->bdrv_co_block_status) {
-        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-                                            aligned_bytes, pnum, &local_map,
-                                            &local_file);
+        bool from_cache = false;
+
+        /*
+         * Use the block-status cache only for protocol nodes: Format
+         * drivers are generally quick to inquire the status, but protocol
+         * drivers often need to get information from outside of qemu, so
+         * we do not have control over the actual implementation.  There
+         * have been cases where inquiring the status took an unreasonably
+         * long time, and we can do nothing in qemu to fix it.
+         * This is especially problematic for images with large data areas,
+         * because finding the few holes in them and giving them special
+         * treatment does not gain much performance.  Therefore, we try to
+         * cache the last-identified data region.
+         *
+         * Second, limiting ourselves to protocol nodes allows us to assume
+         * the block status for data regions to be DATA | OFFSET_VALID, and
+         * that the host offset is the same as the guest offset.
+         *
+         * Note that it is possible that external writers zero parts of
+         * the cached regions without the cache being invalidated, and so
+         * we may report zeroes as data.  This is not catastrophic,
+         * however, because reporting zeroes as data is fine.
+         */
+        if (QLIST_EMPTY(&bs->children)) {
+            qemu_co_mutex_lock(&bsc->lock);
+            if (bsc->valid &&
+                aligned_offset >= bsc->data_start &&
+                aligned_offset < bsc->data_end)
+            {
+                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+                local_file = bs;
+                local_map = aligned_offset;
+
+                *pnum = bsc->data_end - aligned_offset;
+
+                from_cache = true;
+            }
+            qemu_co_mutex_unlock(&bsc->lock);
+        }
+
+        if (!from_cache) {
+            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+                                                aligned_bytes, pnum, &local_map,
+                                                &local_file);
+
+            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID)) {
+                qemu_co_mutex_lock(&bsc->lock);
+                bsc->data_start = aligned_offset;
+                bsc->data_end = aligned_offset + *pnum;
+                bsc->valid = true;
+                qemu_co_mutex_unlock(&bsc->lock);
+            }
+        }
     } else {
         /* Default code for filters */
 
@@ -2974,6 +3037,7 @@  int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
     int max_pdiscard, ret;
     int head, tail, align;
     BlockDriverState *bs = child->bs;
+    BdrvBlockStatusCache *bsc = &bs->block_status_cache;
 
     if (!bs || !bs->drv || !bdrv_is_inserted(bs)) {
         return -ENOMEDIUM;
@@ -2997,6 +3061,16 @@  int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
         return 0;
     }
 
+    /* Invalidate the cached block-status data range if this discard overlaps */
+    qemu_co_mutex_lock(&bsc->lock);
+    if (bsc->valid &&
+        ranges_overlap(offset, bytes, bsc->data_start,
+                       bsc->data_end - bsc->data_start))
+    {
+        bsc->valid = false;
+    }
+    qemu_co_mutex_unlock(&bsc->lock);
+
     /* Discard is advisory, but some devices track and coalesce
      * unaligned requests, so we must pass everything down rather than
      * round here.  Still, most devices will just silently ignore