Message ID | 20190124141731.21509-1-kwolf@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | file-posix: Cache lseek result for data regions | expand |
24.01.2019 17:17, Kevin Wolf wrote: > Depending on the exact image layout and the storage backend (tmpfs is > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > save us a lot of time e.g. during a mirror block job or qemu-img convert > with a fragmented source image (.bdrv_co_block_status on the protocol > layer can be called for every single cluster in the extreme case). > > We may only cache data regions because of possible concurrent writers. > This means that we can later treat a recently punched hole as data, but > this is safe. We can't cache holes because then we might treat recently > written data as holes, which can cause corruption. > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > --- > block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 49 insertions(+), 2 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 8aee7a3fb8..7272c7c99d 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -168,6 +168,12 @@ typedef struct BDRVRawState { > bool needs_alignment; > bool check_cache_dropped; > > + struct seek_data_cache { > + bool valid; > + uint64_t start; > + uint64_t end; > + } seek_data_cache; Should we have some mutex-locking to protect it?
Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben: > 24.01.2019 17:17, Kevin Wolf wrote: > > Depending on the exact image layout and the storage backend (tmpfs is > > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > > save us a lot of time e.g. during a mirror block job or qemu-img convert > > with a fragmented source image (.bdrv_co_block_status on the protocol > > layer can be called for every single cluster in the extreme case). > > > > We may only cache data regions because of possible concurrent writers. > > This means that we can later treat a recently punched hole as data, but > > this is safe. We can't cache holes because then we might treat recently > > written data as holes, which can cause corruption. > > > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > > --- > > block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 49 insertions(+), 2 deletions(-) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 8aee7a3fb8..7272c7c99d 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -168,6 +168,12 @@ typedef struct BDRVRawState { > > bool needs_alignment; > > bool check_cache_dropped; > > > > + struct seek_data_cache { > > + bool valid; > > + uint64_t start; > > + uint64_t end; > > + } seek_data_cache; > > Should we have some mutex-locking to protect it? It is protected by the AioContext lock, like everything else in BDRVRawState. Kevin
24.01.2019 18:11, Kevin Wolf wrote: > Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 24.01.2019 17:17, Kevin Wolf wrote: >>> Depending on the exact image layout and the storage backend (tmpfs is >>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can >>> save us a lot of time e.g. during a mirror block job or qemu-img convert >>> with a fragmented source image (.bdrv_co_block_status on the protocol >>> layer can be called for every single cluster in the extreme case). >>> >>> We may only cache data regions because of possible concurrent writers. >>> This means that we can later treat a recently punched hole as data, but >>> this is safe. We can't cache holes because then we might treat recently >>> written data as holes, which can cause corruption. >>> >>> Signed-off-by: Kevin Wolf <kwolf@redhat.com> >>> --- >>> block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- >>> 1 file changed, 49 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index 8aee7a3fb8..7272c7c99d 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -168,6 +168,12 @@ typedef struct BDRVRawState { >>> bool needs_alignment; >>> bool check_cache_dropped; >>> >>> + struct seek_data_cache { >>> + bool valid; >>> + uint64_t start; >>> + uint64_t end; >>> + } seek_data_cache; >> >> Should we have some mutex-locking to protect it? > > It is protected by the AioContext lock, like everything else in > BDRVRawState. > Recently Paolo asked me not to add more users of AioContext lock. Unfortunately I don't understand the whole picture around it.. Doesn't this apply here? https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html
Am 24.01.2019 um 16:22 hat Vladimir Sementsov-Ogievskiy geschrieben: > 24.01.2019 18:11, Kevin Wolf wrote: > > Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben: > >> 24.01.2019 17:17, Kevin Wolf wrote: > >>> Depending on the exact image layout and the storage backend (tmpfs is > >>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > >>> save us a lot of time e.g. during a mirror block job or qemu-img convert > >>> with a fragmented source image (.bdrv_co_block_status on the protocol > >>> layer can be called for every single cluster in the extreme case). > >>> > >>> We may only cache data regions because of possible concurrent writers. > >>> This means that we can later treat a recently punched hole as data, but > >>> this is safe. We can't cache holes because then we might treat recently > >>> written data as holes, which can cause corruption. > >>> > >>> Signed-off-by: Kevin Wolf <kwolf@redhat.com> > >>> --- > >>> block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- > >>> 1 file changed, 49 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/block/file-posix.c b/block/file-posix.c > >>> index 8aee7a3fb8..7272c7c99d 100644 > >>> --- a/block/file-posix.c > >>> +++ b/block/file-posix.c > >>> @@ -168,6 +168,12 @@ typedef struct BDRVRawState { > >>> bool needs_alignment; > >>> bool check_cache_dropped; > >>> > >>> + struct seek_data_cache { > >>> + bool valid; > >>> + uint64_t start; > >>> + uint64_t end; > >>> + } seek_data_cache; > >> > >> Should we have some mutex-locking to protect it? > > > > It is protected by the AioContext lock, like everything else in > > BDRVRawState. > > Recently Paolo asked me not to add more users of AioContext lock. Unfortunately > I don't understand the whole picture around it.. Doesn't this apply here? > https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html I don't know. Honestly I feel nobody except Paolo knows, because we don't know his patches yet. But raw doesn't have an s->lock yet, so I think removing the AioContext lock involves some work on it anyway and adding this doesn't really change the amount of work. Kevin
On 1/24/19 8:17 AM, Kevin Wolf wrote: > Depending on the exact image layout and the storage backend (tmpfs is > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > save us a lot of time e.g. during a mirror block job or qemu-img convert > with a fragmented source image (.bdrv_co_block_status on the protocol > layer can be called for every single cluster in the extreme case). > > We may only cache data regions because of possible concurrent writers. > This means that we can later treat a recently punched hole as data, but > this is safe. We can't cache holes because then we might treat recently > written data as holes, which can cause corruption. gluster copies heavily from file-posix's implementation; should it also copy this cache of known-data? Should NBD also cache known-data when NBD_CMD_BLOCK_STATUS is available? > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > --- > block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 49 insertions(+), 2 deletions(-) > > > + /* Invalidate seek_data_cache if it overlaps */ > + sdc = &s->seek_data_cache; > + if (sdc->valid && !(sdc->end < aiocb->aio_offset || > + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) > + { > + sdc->valid = false; > + } Worth a helper function for this repeated action? Reviewed-by: Eric Blake <eblake@redhat.com>
24.01.2019 17:17, Kevin Wolf wrote: > Depending on the exact image layout and the storage backend (tmpfs is > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > save us a lot of time e.g. during a mirror block job or qemu-img convert > with a fragmented source image (.bdrv_co_block_status on the protocol > layer can be called for every single cluster in the extreme case). > > We may only cache data regions because of possible concurrent writers. > This means that we can later treat a recently punched hole as data, but > this is safe. We can't cache holes because then we might treat recently > written data as holes, which can cause corruption. > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > --- > block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 49 insertions(+), 2 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 8aee7a3fb8..7272c7c99d 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -168,6 +168,12 @@ typedef struct BDRVRawState { > bool needs_alignment; > bool check_cache_dropped; > > + struct seek_data_cache { > + bool valid; > + uint64_t start; > + uint64_t end; > + } seek_data_cache; > + > PRManager *pr_mgr; > } BDRVRawState; > > @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) > { > RawPosixAIOData *aiocb = opaque; > BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque; > + struct seek_data_cache *sdc; > int ret; > > + /* Invalidate seek_data_cache if it overlaps */ > + sdc = &s->seek_data_cache; > + if (sdc->valid && !(sdc->end < aiocb->aio_offset || > + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) to be presize: <= and >= > + { > + sdc->valid = false; > + } > + > /* First try to write zeros and unmap at the same time */ > Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA for these regions which may unallocated read-as-zero, if I'm not mistaken. > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque) > RawPosixAIOData *aiocb = opaque; > int ret = -EOPNOTSUPP; > BDRVRawState *s = aiocb->bs->opaque; > + struct seek_data_cache *sdc; > > if (!s->has_discard) { > return -ENOTSUP; > } > > + /* Invalidate seek_data_cache if it overlaps */ > + sdc = &s->seek_data_cache; > + if (sdc->valid && !(sdc->end < aiocb->aio_offset || > + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) and <= and >= and if add same to handle_aiocb_write_zeroes(), then it worth to create helper function to invalidate cache. > + { > + sdc->valid = false; > + } > + > if (aiocb->aio_type & QEMU_AIO_BLKDEV) { > #ifdef BLKDISCARD > do { > @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > int64_t *map, > BlockDriverState **file) > { > + BDRVRawState *s = bs->opaque; > + struct seek_data_cache *sdc; > off_t data = 0, hole = 0; > int ret; > > @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > } > > + sdc = &s->seek_data_cache; > + if (sdc->valid && sdc->start <= offset && sdc->end > offset) { > + *pnum = MIN(bytes, sdc->end - offset); > + *map = offset; > + *file = bs; > + return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > + } > + > ret = find_allocation(bs, offset, &data, &hole); > if (ret == -ENXIO) { > /* Trailing hole */ > @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > } else if (data == offset) { > /* On a data extent, compute bytes to the end of the extent, > * possibly including a partial sector at EOF. */ > - *pnum = MIN(bytes, hole - offset); > + *pnum = hole - offset; hmm, why? At least you didn't mention it in commit-message.. > ret = BDRV_BLOCK_DATA; > } else { > /* On a hole, compute bytes to the beginning of the next extent. */ > assert(hole == offset); > - *pnum = MIN(bytes, data - offset); > + *pnum = data - offset; > ret = BDRV_BLOCK_ZERO; > } > + > + /* Caching allocated ranges is okay even if another process writes to the > + * same file because we allow declaring things allocated even if there is a > + * hole. However, we cannot cache holes without risking corruption. */ > + if (ret == BDRV_BLOCK_DATA) { > + *sdc = (struct seek_data_cache) { > + .valid = true, > + .start = offset, > + .end = offset + *pnum, > + }; > + } > + > + *pnum = MIN(*pnum, bytes); > *map = offset; > *file = bs; > return ret | BDRV_BLOCK_OFFSET_VALID; >
Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben: > 24.01.2019 17:17, Kevin Wolf wrote: > > Depending on the exact image layout and the storage backend (tmpfs is > > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > > save us a lot of time e.g. during a mirror block job or qemu-img convert > > with a fragmented source image (.bdrv_co_block_status on the protocol > > layer can be called for every single cluster in the extreme case). > > > > We may only cache data regions because of possible concurrent writers. > > This means that we can later treat a recently punched hole as data, but > > this is safe. We can't cache holes because then we might treat recently > > written data as holes, which can cause corruption. > > > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > > @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) > > { > > RawPosixAIOData *aiocb = opaque; > > BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque; > > + struct seek_data_cache *sdc; > > int ret; > > > > + /* Invalidate seek_data_cache if it overlaps */ > > + sdc = &s->seek_data_cache; > > + if (sdc->valid && !(sdc->end < aiocb->aio_offset || > > + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) > > to be presize: <= and >= Yes, you're right. > > + { > > + sdc->valid = false; > > + } > > + > > /* First try to write zeros and unmap at the same time */ > > > > > Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA > for these regions which may unallocated read-as-zero, if I'm not mistaken. handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't need to invalidate the cache there. > > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > > @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque) > > RawPosixAIOData *aiocb = opaque; > > int ret = -EOPNOTSUPP; > > BDRVRawState *s = aiocb->bs->opaque; > > + struct seek_data_cache *sdc; > > > > if (!s->has_discard) { > > return -ENOTSUP; > > } > > > > + /* Invalidate seek_data_cache if it overlaps */ > > + sdc = &s->seek_data_cache; > > + if (sdc->valid && !(sdc->end < aiocb->aio_offset || > > + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) > > and <= and >= > > and if add same to handle_aiocb_write_zeroes(), then it worth to > create helper function to invalidate cache. Ok. > > + { > > + sdc->valid = false; > > + } > > + > > if (aiocb->aio_type & QEMU_AIO_BLKDEV) { > > #ifdef BLKDISCARD > > do { > > @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > int64_t *map, > > BlockDriverState **file) > > { > > + BDRVRawState *s = bs->opaque; > > + struct seek_data_cache *sdc; > > off_t data = 0, hole = 0; > > int ret; > > > > @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > > } > > > > + sdc = &s->seek_data_cache; > > + if (sdc->valid && sdc->start <= offset && sdc->end > offset) { > > + *pnum = MIN(bytes, sdc->end - offset); > > + *map = offset; > > + *file = bs; > > + return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > > + } > > + > > ret = find_allocation(bs, offset, &data, &hole); > > if (ret == -ENXIO) { > > /* Trailing hole */ > > @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, > > } else if (data == offset) { > > /* On a data extent, compute bytes to the end of the extent, > > * possibly including a partial sector at EOF. */ > > - *pnum = MIN(bytes, hole - offset); > > + *pnum = hole - offset; > > hmm, why? At least you didn't mention it in commit-message.. We want to cache the whole range returned by lseek(), not just whatever the raw_co_block_status() caller wanted to know. For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below... > > ret = BDRV_BLOCK_DATA; > > } else { > > /* On a hole, compute bytes to the beginning of the next extent. */ > > assert(hole == offset); > > - *pnum = MIN(bytes, data - offset); > > + *pnum = data - offset; > > ret = BDRV_BLOCK_ZERO; > > } > > + > > + /* Caching allocated ranges is okay even if another process writes to the > > + * same file because we allow declaring things allocated even if there is a > > + * hole. However, we cannot cache holes without risking corruption. */ > > + if (ret == BDRV_BLOCK_DATA) { > > + *sdc = (struct seek_data_cache) { > > + .valid = true, > > + .start = offset, > > + .end = offset + *pnum, > > + }; > > + } > > + > > + *pnum = MIN(*pnum, bytes); ...here. So what we return doesn't change. > > *map = offset; > > *file = bs; > > return ret | BDRV_BLOCK_OFFSET_VALID; Kevin
24.01.2019 19:36, Kevin Wolf wrote: > Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 24.01.2019 17:17, Kevin Wolf wrote: >>> Depending on the exact image layout and the storage backend (tmpfs is >>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can >>> save us a lot of time e.g. during a mirror block job or qemu-img convert >>> with a fragmented source image (.bdrv_co_block_status on the protocol >>> layer can be called for every single cluster in the extreme case). >>> >>> We may only cache data regions because of possible concurrent writers. >>> This means that we can later treat a recently punched hole as data, but >>> this is safe. We can't cache holes because then we might treat recently >>> written data as holes, which can cause corruption. >>> >>> Signed-off-by: Kevin Wolf <kwolf@redhat.com> > >>> @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) >>> { >>> RawPosixAIOData *aiocb = opaque; >>> BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque; >>> + struct seek_data_cache *sdc; >>> int ret; >>> >>> + /* Invalidate seek_data_cache if it overlaps */ >>> + sdc = &s->seek_data_cache; >>> + if (sdc->valid && !(sdc->end < aiocb->aio_offset || >>> + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) >> >> to be presize: <= and >= > > Yes, you're right. > >>> + { >>> + sdc->valid = false; >>> + } >>> + >>> /* First try to write zeros and unmap at the same time */ >>> >> >> >> Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA >> for these regions which may unallocated read-as-zero, if I'm not mistaken. > > handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't > need to invalidate the cache there. So, you want to say, that for fallocated regions we always return just _DATA, without _ZERO? If it is so, it's of course bad, it means that convert will have to copy (or at least read and detect zeroes by hand, if enabled) write-zeroed-without-unmap areas. Let's check (hmm, I had to use qemu-img map inside qemu-io, patch attached for it, also I printed printf("%s\n", __func__) in handle_aiocb_write_zeroes_unmap and handle_aiocb_write_zeroes): Let's test: ]# cat test ./qemu-img create -f raw x 1M ./qemu-io -f raw x <<CMDS write 0 1M map write -z 100K 100K map write -z -u 500K 100K map CMDS rm -rf x rm -rf x before your patch: ]# ./test Formatting 'x', fmt=raw size=1048576 qemu-io> wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec) qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}] qemu-io> handle_aiocb_write_zeroes wrote 102400/102400 bytes at offset 102400 100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec) qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400}, { "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, "offset": 204800}] qemu-io> handle_aiocb_write_zeroes_unmap wrote 102400/102400 bytes at offset 512000 100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec) qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400}, { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800}, { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000}, { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}] after your patch: # ./test Formatting 'x', fmt=raw size=1048576 qemu-io> wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec) qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}] qemu-io> handle_aiocb_write_zeroes wrote 102400/102400 bytes at offset 102400 100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec) qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}] qemu-io> handle_aiocb_write_zeroes_unmap wrote 102400/102400 bytes at offset 512000 100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec) qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400}, { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800}, { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000}, { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}] So, you've changed behavior of block_status after write_zeroes without UNMAP for the worse. Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better. > >>> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE >>> @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque) >>> RawPosixAIOData *aiocb = opaque; >>> int ret = -EOPNOTSUPP; >>> BDRVRawState *s = aiocb->bs->opaque; >>> + struct seek_data_cache *sdc; >>> >>> if (!s->has_discard) { >>> return -ENOTSUP; >>> } >>> >>> + /* Invalidate seek_data_cache if it overlaps */ >>> + sdc = &s->seek_data_cache; >>> + if (sdc->valid && !(sdc->end < aiocb->aio_offset || >>> + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) >> >> and <= and >= >> >> and if add same to handle_aiocb_write_zeroes(), then it worth to >> create helper function to invalidate cache. > > Ok. > >>> + { >>> + sdc->valid = false; >>> + } >>> + >>> if (aiocb->aio_type & QEMU_AIO_BLKDEV) { >>> #ifdef BLKDISCARD >>> do { >>> @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, >>> int64_t *map, >>> BlockDriverState **file) >>> { >>> + BDRVRawState *s = bs->opaque; >>> + struct seek_data_cache *sdc; >>> off_t data = 0, hole = 0; >>> int ret; >>> >>> @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, >>> return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; >>> } >>> >>> + sdc = &s->seek_data_cache; >>> + if (sdc->valid && sdc->start <= offset && sdc->end > offset) { >>> + *pnum = MIN(bytes, sdc->end - offset); >>> + *map = offset; >>> + *file = bs; >>> + return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; >>> + } >>> + >>> ret = find_allocation(bs, offset, &data, &hole); >>> if (ret == -ENXIO) { >>> /* Trailing hole */ >>> @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, >>> } else if (data == offset) { >>> /* On a data extent, compute bytes to the end of the extent, >>> * possibly including a partial sector at EOF. */ >>> - *pnum = MIN(bytes, hole - offset); >>> + *pnum = hole - offset; >> >> hmm, why? At least you didn't mention it in commit-message.. > > We want to cache the whole range returned by lseek(), not just whatever > the raw_co_block_status() caller wanted to know. > > For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below... Oops, stupid question it was, sorry:( > >>> ret = BDRV_BLOCK_DATA; >>> } else { >>> /* On a hole, compute bytes to the beginning of the next extent. */ >>> assert(hole == offset); >>> - *pnum = MIN(bytes, data - offset); >>> + *pnum = data - offset; >>> ret = BDRV_BLOCK_ZERO; >>> } >>> + >>> + /* Caching allocated ranges is okay even if another process writes to the >>> + * same file because we allow declaring things allocated even if there is a >>> + * hole. However, we cannot cache holes without risking corruption. */ >>> + if (ret == BDRV_BLOCK_DATA) { >>> + *sdc = (struct seek_data_cache) { >>> + .valid = true, >>> + .start = offset, >>> + .end = offset + *pnum, >>> + }; >>> + } >>> + >>> + *pnum = MIN(*pnum, bytes); > > ...here. > > So what we return doesn't change. > >>> *map = offset; >>> *file = bs; >>> return ret | BDRV_BLOCK_OFFSET_VALID; > > Kevin >
On 24/01/19 16:42, Kevin Wolf wrote: >> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately >> I don't understand the whole picture around it.. Doesn't this apply here? >> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html > I don't know. Honestly I feel nobody except Paolo knows, because we > don't know his patches yet. This is true. On the other hand, the AioContext lock is only used in some special cases around block jobs and blk_set_aio_context, and in general the block devices already should not have any dependencies (unless they crept in without me noticing). In particular... > But raw doesn't have an s->lock yet, so I > think removing the AioContext lock involves some work on it anyway and > adding this doesn't really change the amount of work. ... BDRVRawState doesn't have any data that changes after open, does it? This is why it doesn't have an s->lock. Paolo
Am 25.01.2019 um 11:10 hat Paolo Bonzini geschrieben: > On 24/01/19 16:42, Kevin Wolf wrote: > >> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately > >> I don't understand the whole picture around it.. Doesn't this apply here? > >> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html > > I don't know. Honestly I feel nobody except Paolo knows, because we > > don't know his patches yet. > > This is true. On the other hand, the AioContext lock is only used in > some special cases around block jobs and blk_set_aio_context, and in > general the block devices already should not have any dependencies > (unless they crept in without me noticing). It's also used in those cases where coroutines don't need locking, but threads would. Did you audit all of the drivers for such cases? > In particular... > > > But raw doesn't have an s->lock yet, so I > > think removing the AioContext lock involves some work on it anyway and > > adding this doesn't really change the amount of work. > > ... BDRVRawState doesn't have any data that changes after open, does it? > This is why it doesn't have an s->lock. No important data anyway. We do things like setting s->has_write_zeroes = false after failure, but if we got a race and end up trying twice before disabling it, it doesn't really hurt either. Then there is reopen, but that involves a drain anyway. And that's it probably. So do you think I should introduce a CoMutex for raw here? Or QemuMutex? Kevin
On 1/25/19 3:13 AM, Vladimir Sementsov-Ogievskiy wrote: > > before your patch: > ]# ./test > Formatting 'x', fmt=raw size=1048576 > qemu-io> wrote 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec) > qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}] > qemu-io> handle_aiocb_write_zeroes > wrote 102400/102400 bytes at offset 102400 > 100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec) > qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0}, > { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400}, > { "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, "offset": 204800}] > qemu-io> handle_aiocb_write_zeroes_unmap > wrote 102400/102400 bytes at offset 512000 > 100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec) > qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0}, > { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400}, > { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800}, > { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000}, > { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}] Your demonstration pre-patch shows that both 'write -z' and 'write -z -u' produced areas of the qcow2 image that were marked as known zeroes, and claim 'data':false meaning that those two areas are sparse (that is, it appears 'write -z' managed to unmap after all?) > > > > after your patch: > # ./test > Formatting 'x', fmt=raw size=1048576 > qemu-io> wrote 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec) > qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}] > qemu-io> handle_aiocb_write_zeroes > wrote 102400/102400 bytes at offset 102400 > 100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec) > qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}] So here, the cache was not invalidated, so the 'write -z' area was temporarily reported as data... > qemu-io> handle_aiocb_write_zeroes_unmap > wrote 102400/102400 bytes at offset 512000 > 100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec) > qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0}, > { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400}, > { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800}, > { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000}, > { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}] but after the 'write -z -u' invalidated the cache, we once again see that the 'write -z' worked. > > > So, you've changed behavior of block_status after write_zeroes without UNMAP for the worse. > > Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better. qemu-io map is NOT asking the same information as qemu-img map (it's annoying - but BOTH pieces of information are useful, and there are some iotests that check both outputs to get a full picture of things). qemu-img asks as much information as possible about all layers, while qemu-io asks only abut the top layer. That said, qemu-io is NOT baked in stone; if you want to patch it AND fix the iotest fallout, I'm not opposed to that.
Am 24.01.2019 um 16:56 hat Eric Blake geschrieben: > On 1/24/19 8:17 AM, Kevin Wolf wrote: > > Depending on the exact image layout and the storage backend (tmpfs is > > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can > > save us a lot of time e.g. during a mirror block job or qemu-img convert > > with a fragmented source image (.bdrv_co_block_status on the protocol > > layer can be called for every single cluster in the extreme case). > > > > We may only cache data regions because of possible concurrent writers. > > This means that we can later treat a recently punched hole as data, but > > this is safe. We can't cache holes because then we might treat recently > > written data as holes, which can cause corruption. > > gluster copies heavily from file-posix's implementation; should it also > copy this cache of known-data? Should NBD also cache known-data when > NBD_CMD_BLOCK_STATUS is available? This almost suggests that we should do the caching in generic block layer code. It would require that we can return a *pnum from the block driver that is larger than the requested bytes from, but it looks like raw_co_block_status() already handles this? We just don't seem to do this yet in the block drivers. If we want to cache for all drivers, however, the question is whether there are drivers that can transition a block from data to hole without a discard operation, so that we would have to invalidate the cache in more places. One thing that comes to mind is loading an internal snapshot for qcow2. Or maybe we need to make this opt-in for drivers, with a bool flag in BlockDriver? Kevin
On 1/29/19 4:56 AM, Kevin Wolf wrote: >> gluster copies heavily from file-posix's implementation; should it also >> copy this cache of known-data? Should NBD also cache known-data when >> NBD_CMD_BLOCK_STATUS is available? > > This almost suggests that we should do the caching in generic block > layer code. > > It would require that we can return a *pnum from the block driver that > is larger than the requested bytes from, but it looks like > raw_co_block_status() already handles this? We just don't seem to do > this yet in the block drivers. The code in io.c bdrv_co_block_status() currently does one final clamp-down to limit the answer to the caller's maximum request, but I don't know if any drivers actually take advantage of passing back larger than the request. I _do_ know that the NBD protocol took pains to ensure that NBD_CMD_BLOCK_STATUS is permitted to return a value beyond the caller's request if such information was easily obtained, precisely because the idea of letting the caller cache the knowledge of a data section that extends beyond the current query's area of interest may be useful to minimize the need to make future block status calls. > > If we want to cache for all drivers, however, the question is whether > there are drivers that can transition a block from data to hole without > a discard operation, so that we would have to invalidate the cache in > more places. One thing that comes to mind is loading an internal > snapshot for qcow2. Oh, good point - switching to a different L1 table (due to loading an internal snapshot) can indeed make a hole appear that used to read as data, so if the block layer caches data ranges, it also needs to provide a hook for drivers to invalidate the cache when doing unusual actions. Still, I can't think of any place where a hole spontaneously appears unless a specific driver action is taken (so the driver should have the opportunity to invalidate the cache during that action), or if an image is in active use by more than just the qemu process. And if the driver knows that an image might be shared with external processes modifying the image, then yes, maybe having a way to opt out of caching altogether is also appropriate. > > Or maybe we need to make this opt-in for drivers, with a bool flag in > BlockDriver? > > Kevin >
On 25/01/19 11:30, Kevin Wolf wrote: >> On the other hand, the AioContext lock is only used in >> some special cases around block jobs and blk_set_aio_context, and in >> general the block devices already should not have any dependencies >> (unless they crept in without me noticing). > It's also used in those cases where coroutines don't need locking, but > threads would. Did you audit all of the drivers for such cases? I did and the drivers already have a QemuMutex if that's the case (e.g. curl, iscsi). >> In particular... >> >>> But raw doesn't have an s->lock yet, so I >>> think removing the AioContext lock involves some work on it anyway and >>> adding this doesn't really change the amount of work. >> ... BDRVRawState doesn't have any data that changes after open, does it? >> This is why it doesn't have an s->lock. > No important data anyway. We do things like setting s->has_write_zeroes > = false after failure, but if we got a race and end up trying twice > before disabling it, it doesn't really hurt either. > > Then there is reopen, but that involves a drain anyway. And that's it > probably. > > So do you think I should introduce a CoMutex for raw here? Or QemuMutex? For the cache you can introduce either a CoMutex or QemuMutex, it's the same. Paolo
diff --git a/block/file-posix.c b/block/file-posix.c index 8aee7a3fb8..7272c7c99d 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -168,6 +168,12 @@ typedef struct BDRVRawState { bool needs_alignment; bool check_cache_dropped; + struct seek_data_cache { + bool valid; + uint64_t start; + uint64_t end; + } seek_data_cache; + PRManager *pr_mgr; } BDRVRawState; @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) { RawPosixAIOData *aiocb = opaque; BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque; + struct seek_data_cache *sdc; int ret; + /* Invalidate seek_data_cache if it overlaps */ + sdc = &s->seek_data_cache; + if (sdc->valid && !(sdc->end < aiocb->aio_offset || + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) + { + sdc->valid = false; + } + /* First try to write zeros and unmap at the same time */ #ifdef CONFIG_FALLOCATE_PUNCH_HOLE @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque) RawPosixAIOData *aiocb = opaque; int ret = -EOPNOTSUPP; BDRVRawState *s = aiocb->bs->opaque; + struct seek_data_cache *sdc; if (!s->has_discard) { return -ENOTSUP; } + /* Invalidate seek_data_cache if it overlaps */ + sdc = &s->seek_data_cache; + if (sdc->valid && !(sdc->end < aiocb->aio_offset || + sdc->start > aiocb->aio_offset + aiocb->aio_nbytes)) + { + sdc->valid = false; + } + if (aiocb->aio_type & QEMU_AIO_BLKDEV) { #ifdef BLKDISCARD do { @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, int64_t *map, BlockDriverState **file) { + BDRVRawState *s = bs->opaque; + struct seek_data_cache *sdc; off_t data = 0, hole = 0; int ret; @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; } + sdc = &s->seek_data_cache; + if (sdc->valid && sdc->start <= offset && sdc->end > offset) { + *pnum = MIN(bytes, sdc->end - offset); + *map = offset; + *file = bs; + return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; + } + ret = find_allocation(bs, offset, &data, &hole); if (ret == -ENXIO) { /* Trailing hole */ @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, } else if (data == offset) { /* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ - *pnum = MIN(bytes, hole - offset); + *pnum = hole - offset; ret = BDRV_BLOCK_DATA; } else { /* On a hole, compute bytes to the beginning of the next extent. */ assert(hole == offset); - *pnum = MIN(bytes, data - offset); + *pnum = data - offset; ret = BDRV_BLOCK_ZERO; } + + /* Caching allocated ranges is okay even if another process writes to the + * same file because we allow declaring things allocated even if there is a + * hole. However, we cannot cache holes without risking corruption. */ + if (ret == BDRV_BLOCK_DATA) { + *sdc = (struct seek_data_cache) { + .valid = true, + .start = offset, + .end = offset + *pnum, + }; + } + + *pnum = MIN(*pnum, bytes); *map = offset; *file = bs; return ret | BDRV_BLOCK_OFFSET_VALID;
Depending on the exact image layout and the storage backend (tmpfs is konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can save us a lot of time e.g. during a mirror block job or qemu-img convert with a fragmented source image (.bdrv_co_block_status on the protocol layer can be called for every single cluster in the extreme case). We may only cache data regions because of possible concurrent writers. This means that we can later treat a recently punched hole as data, but this is safe. We can't cache holes because then we might treat recently written data as holes, which can cause corruption. Signed-off-by: Kevin Wolf <kwolf@redhat.com> --- block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 49 insertions(+), 2 deletions(-)