file-posix: Cache lseek result for data regions

Message ID	20190124141731.21509-1-kwolf@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Kevin Wolf <kwolf@redhat.com> To: qemu-block@nongnu.org Date: Thu, 24 Jan 2019 15:17:31 +0100 Message-Id: <20190124141731.21509-1-kwolf@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions Precedence: list Cc: kwolf@redhat.com, vsementsov@virtuozzo.com, qemu-devel@nongnu.org, mreitz@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	file-posix: Cache lseek result for data regions \| expand file-posix: Cache lseek result for data regions

Kevin Wolf Jan. 24, 2019, 2:17 p.m. UTC

Depending on the exact image layout and the storage backend (tmpfs is
konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
save us a lot of time e.g. during a mirror block job or qemu-img convert
with a fragmented source image (.bdrv_co_block_status on the protocol
layer can be called for every single cluster in the extreme case).

We may only cache data regions because of possible concurrent writers.
This means that we can later treat a recently punched hole as data, but
this is safe. We can't cache holes because then we might treat recently
written data as holes, which can cause corruption.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

Vladimir Sementsov-Ogievskiy Jan. 24, 2019, 2:40 p.m. UTC | #1

24.01.2019 17:17, Kevin Wolf wrote:
> Depending on the exact image layout and the storage backend (tmpfs is
> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> save us a lot of time e.g. during a mirror block job or qemu-img convert
> with a fragmented source image (.bdrv_co_block_status on the protocol
> layer can be called for every single cluster in the extreme case).
> 
> We may only cache data regions because of possible concurrent writers.
> This means that we can later treat a recently punched hole as data, but
> this is safe. We can't cache holes because then we might treat recently
> written data as holes, which can cause corruption.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 8aee7a3fb8..7272c7c99d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
>       bool needs_alignment;
>       bool check_cache_dropped;
>   
> +    struct seek_data_cache {
> +        bool        valid;
> +        uint64_t    start;
> +        uint64_t    end;
> +    } seek_data_cache;

Should we have some mutex-locking to protect it?

Kevin Wolf Jan. 24, 2019, 3:11 p.m. UTC | #2

Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 17:17, Kevin Wolf wrote:
> > Depending on the exact image layout and the storage backend (tmpfs is
> > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> > save us a lot of time e.g. during a mirror block job or qemu-img convert
> > with a fragmented source image (.bdrv_co_block_status on the protocol
> > layer can be called for every single cluster in the extreme case).
> > 
> > We may only cache data regions because of possible concurrent writers.
> > This means that we can later treat a recently punched hole as data, but
> > this is safe. We can't cache holes because then we might treat recently
> > written data as holes, which can cause corruption.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 49 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 8aee7a3fb8..7272c7c99d 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
> >       bool needs_alignment;
> >       bool check_cache_dropped;
> >   
> > +    struct seek_data_cache {
> > +        bool        valid;
> > +        uint64_t    start;
> > +        uint64_t    end;
> > +    } seek_data_cache;
> 
> Should we have some mutex-locking to protect it?

It is protected by the AioContext lock, like everything else in
BDRVRawState.

Kevin

Vladimir Sementsov-Ogievskiy Jan. 24, 2019, 3:22 p.m. UTC | #3

24.01.2019 18:11, Kevin Wolf wrote:
> Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 24.01.2019 17:17, Kevin Wolf wrote:
>>> Depending on the exact image layout and the storage backend (tmpfs is
>>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
>>> save us a lot of time e.g. during a mirror block job or qemu-img convert
>>> with a fragmented source image (.bdrv_co_block_status on the protocol
>>> layer can be called for every single cluster in the extreme case).
>>>
>>> We may only cache data regions because of possible concurrent writers.
>>> This means that we can later treat a recently punched hole as data, but
>>> this is safe. We can't cache holes because then we might treat recently
>>> written data as holes, which can cause corruption.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>    block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>>>    1 file changed, 49 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>> index 8aee7a3fb8..7272c7c99d 100644
>>> --- a/block/file-posix.c
>>> +++ b/block/file-posix.c
>>> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
>>>        bool needs_alignment;
>>>        bool check_cache_dropped;
>>>    
>>> +    struct seek_data_cache {
>>> +        bool        valid;
>>> +        uint64_t    start;
>>> +        uint64_t    end;
>>> +    } seek_data_cache;
>>
>> Should we have some mutex-locking to protect it?
> 
> It is protected by the AioContext lock, like everything else in
> BDRVRawState.
> 

Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
I don't understand the whole picture around it.. Doesn't this apply here?
https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html

Kevin Wolf Jan. 24, 2019, 3:42 p.m. UTC | #4

Am 24.01.2019 um 16:22 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 18:11, Kevin Wolf wrote:
> > Am 24.01.2019 um 15:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 24.01.2019 17:17, Kevin Wolf wrote:
> >>> Depending on the exact image layout and the storage backend (tmpfs is
> >>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> >>> save us a lot of time e.g. during a mirror block job or qemu-img convert
> >>> with a fragmented source image (.bdrv_co_block_status on the protocol
> >>> layer can be called for every single cluster in the extreme case).
> >>>
> >>> We may only cache data regions because of possible concurrent writers.
> >>> This means that we can later treat a recently punched hole as data, but
> >>> this is safe. We can't cache holes because then we might treat recently
> >>> written data as holes, which can cause corruption.
> >>>
> >>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >>> ---
> >>>    block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
> >>>    1 file changed, 49 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/file-posix.c b/block/file-posix.c
> >>> index 8aee7a3fb8..7272c7c99d 100644
> >>> --- a/block/file-posix.c
> >>> +++ b/block/file-posix.c
> >>> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
> >>>        bool needs_alignment;
> >>>        bool check_cache_dropped;
> >>>    
> >>> +    struct seek_data_cache {
> >>> +        bool        valid;
> >>> +        uint64_t    start;
> >>> +        uint64_t    end;
> >>> +    } seek_data_cache;
> >>
> >> Should we have some mutex-locking to protect it?
> > 
> > It is protected by the AioContext lock, like everything else in
> > BDRVRawState.
> 
> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
> I don't understand the whole picture around it.. Doesn't this apply here?
> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html

I don't know. Honestly I feel nobody except Paolo knows, because we
don't know his patches yet. But raw doesn't have an s->lock yet, so I
think removing the AioContext lock involves some work on it anyway and
adding this doesn't really change the amount of work.

Kevin

Eric Blake Jan. 24, 2019, 3:56 p.m. UTC | #5

On 1/24/19 8:17 AM, Kevin Wolf wrote:
> Depending on the exact image layout and the storage backend (tmpfs is
> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> save us a lot of time e.g. during a mirror block job or qemu-img convert
> with a fragmented source image (.bdrv_co_block_status on the protocol
> layer can be called for every single cluster in the extreme case).
> 
> We may only cache data regions because of possible concurrent writers.
> This means that we can later treat a recently punched hole as data, but
> this is safe. We can't cache holes because then we might treat recently
> written data as holes, which can cause corruption.

gluster copies heavily from file-posix's implementation; should it also
copy this cache of known-data?  Should NBD also cache known-data when
NBD_CMD_BLOCK_STATUS is available?

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 49 insertions(+), 2 deletions(-)
> 

>  
> +    /* Invalidate seek_data_cache if it overlaps */
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
> +    {
> +        sdc->valid = false;
> +    }

Worth a helper function for this repeated action?

Reviewed-by: Eric Blake <eblake@redhat.com>

Vladimir Sementsov-Ogievskiy Jan. 24, 2019, 4:18 p.m. UTC | #6

24.01.2019 17:17, Kevin Wolf wrote:
> Depending on the exact image layout and the storage backend (tmpfs is
> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> save us a lot of time e.g. during a mirror block job or qemu-img convert
> with a fragmented source image (.bdrv_co_block_status on the protocol
> layer can be called for every single cluster in the extreme case).
> 
> We may only cache data regions because of possible concurrent writers.
> This means that we can later treat a recently punched hole as data, but
> this is safe. We can't cache holes because then we might treat recently
> written data as holes, which can cause corruption.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/file-posix.c | 51 ++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 8aee7a3fb8..7272c7c99d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -168,6 +168,12 @@ typedef struct BDRVRawState {
>       bool needs_alignment;
>       bool check_cache_dropped;
>   
> +    struct seek_data_cache {
> +        bool        valid;
> +        uint64_t    start;
> +        uint64_t    end;
> +    } seek_data_cache;
> +
>       PRManager *pr_mgr;
>   } BDRVRawState;
>   
> @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
>   {
>       RawPosixAIOData *aiocb = opaque;
>       BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
> +    struct seek_data_cache *sdc;
>       int ret;
>   
> +    /* Invalidate seek_data_cache if it overlaps */
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))

to be presize: <= and >=

> +    {
> +        sdc->valid = false;
> +    }
> +
>       /* First try to write zeros and unmap at the same time */
>   


Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA
for these regions which may unallocated read-as-zero, if I'm not mistaken.


>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
>       RawPosixAIOData *aiocb = opaque;
>       int ret = -EOPNOTSUPP;
>       BDRVRawState *s = aiocb->bs->opaque;
> +    struct seek_data_cache *sdc;
>   
>       if (!s->has_discard) {
>           return -ENOTSUP;
>       }
>   
> +    /* Invalidate seek_data_cache if it overlaps */
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))

and <= and >=

and if add same to handle_aiocb_write_zeroes(), then it worth to create helper function
to invalidate cache.

> +    {
> +        sdc->valid = false;
> +    }
> +
>       if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>   #ifdef BLKDISCARD
>           do {
> @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>                                               int64_t *map,
>                                               BlockDriverState **file)
>   {
> +    BDRVRawState *s = bs->opaque;
> +    struct seek_data_cache *sdc;
>       off_t data = 0, hole = 0;
>       int ret;
>   
> @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>           return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>       }
>   
> +    sdc = &s->seek_data_cache;
> +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
> +        *pnum = MIN(bytes, sdc->end - offset);
> +        *map = offset;
> +        *file = bs;
> +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +    }
> +
>       ret = find_allocation(bs, offset, &data, &hole);
>       if (ret == -ENXIO) {
>           /* Trailing hole */
> @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>       } else if (data == offset) {
>           /* On a data extent, compute bytes to the end of the extent,
>            * possibly including a partial sector at EOF. */
> -        *pnum = MIN(bytes, hole - offset);
> +        *pnum = hole - offset;

hmm, why? At least you didn't mention it in commit-message..

>           ret = BDRV_BLOCK_DATA;
>       } else {
>           /* On a hole, compute bytes to the beginning of the next extent.  */
>           assert(hole == offset);
> -        *pnum = MIN(bytes, data - offset);
> +        *pnum = data - offset;
>           ret = BDRV_BLOCK_ZERO;
>       }
> +
> +    /* Caching allocated ranges is okay even if another process writes to the
> +     * same file because we allow declaring things allocated even if there is a
> +     * hole. However, we cannot cache holes without risking corruption. */
> +    if (ret == BDRV_BLOCK_DATA) {
> +        *sdc = (struct seek_data_cache) {
> +            .valid  = true,
> +            .start  = offset,
> +            .end    = offset + *pnum,
> +        };
> +    }
> +
> +    *pnum = MIN(*pnum, bytes);
>       *map = offset;
>       *file = bs;
>       return ret | BDRV_BLOCK_OFFSET_VALID;
>

Kevin Wolf Jan. 24, 2019, 4:36 p.m. UTC | #7

Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 17:17, Kevin Wolf wrote:
> > Depending on the exact image layout and the storage backend (tmpfs is
> > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> > save us a lot of time e.g. during a mirror block job or qemu-img convert
> > with a fragmented source image (.bdrv_co_block_status on the protocol
> > layer can be called for every single cluster in the extreme case).
> > 
> > We may only cache data regions because of possible concurrent writers.
> > This means that we can later treat a recently punched hole as data, but
> > this is safe. We can't cache holes because then we might treat recently
> > written data as holes, which can cause corruption.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>

> > @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
> >   {
> >       RawPosixAIOData *aiocb = opaque;
> >       BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
> > +    struct seek_data_cache *sdc;
> >       int ret;
> >   
> > +    /* Invalidate seek_data_cache if it overlaps */
> > +    sdc = &s->seek_data_cache;
> > +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> > +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
> 
> to be presize: <= and >=

Yes, you're right.

> > +    {
> > +        sdc->valid = false;
> > +    }
> > +
> >       /* First try to write zeros and unmap at the same time */
> >   
> 
> 
> Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA
> for these regions which may unallocated read-as-zero, if I'm not mistaken.

handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't
need to invalidate the cache there.

> >   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
> >       RawPosixAIOData *aiocb = opaque;
> >       int ret = -EOPNOTSUPP;
> >       BDRVRawState *s = aiocb->bs->opaque;
> > +    struct seek_data_cache *sdc;
> >   
> >       if (!s->has_discard) {
> >           return -ENOTSUP;
> >       }
> >   
> > +    /* Invalidate seek_data_cache if it overlaps */
> > +    sdc = &s->seek_data_cache;
> > +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
> > +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
> 
> and <= and >=
> 
> and if add same to handle_aiocb_write_zeroes(), then it worth to
> create helper function to invalidate cache.

Ok.

> > +    {
> > +        sdc->valid = false;
> > +    }
> > +
> >       if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> >   #ifdef BLKDISCARD
> >           do {
> > @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >                                               int64_t *map,
> >                                               BlockDriverState **file)
> >   {
> > +    BDRVRawState *s = bs->opaque;
> > +    struct seek_data_cache *sdc;
> >       off_t data = 0, hole = 0;
> >       int ret;
> >   
> > @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >           return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> >       }
> >   
> > +    sdc = &s->seek_data_cache;
> > +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
> > +        *pnum = MIN(bytes, sdc->end - offset);
> > +        *map = offset;
> > +        *file = bs;
> > +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> > +    }
> > +
> >       ret = find_allocation(bs, offset, &data, &hole);
> >       if (ret == -ENXIO) {
> >           /* Trailing hole */
> > @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
> >       } else if (data == offset) {
> >           /* On a data extent, compute bytes to the end of the extent,
> >            * possibly including a partial sector at EOF. */
> > -        *pnum = MIN(bytes, hole - offset);
> > +        *pnum = hole - offset;
> 
> hmm, why? At least you didn't mention it in commit-message..

We want to cache the whole range returned by lseek(), not just whatever
the raw_co_block_status() caller wanted to know.

For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below...

> >           ret = BDRV_BLOCK_DATA;
> >       } else {
> >           /* On a hole, compute bytes to the beginning of the next extent.  */
> >           assert(hole == offset);
> > -        *pnum = MIN(bytes, data - offset);
> > +        *pnum = data - offset;
> >           ret = BDRV_BLOCK_ZERO;
> >       }
> > +
> > +    /* Caching allocated ranges is okay even if another process writes to the
> > +     * same file because we allow declaring things allocated even if there is a
> > +     * hole. However, we cannot cache holes without risking corruption. */
> > +    if (ret == BDRV_BLOCK_DATA) {
> > +        *sdc = (struct seek_data_cache) {
> > +            .valid  = true,
> > +            .start  = offset,
> > +            .end    = offset + *pnum,
> > +        };
> > +    }
> > +
> > +    *pnum = MIN(*pnum, bytes);

...here.

So what we return doesn't change.

> >       *map = offset;
> >       *file = bs;
> >       return ret | BDRV_BLOCK_OFFSET_VALID;

Kevin

Vladimir Sementsov-Ogievskiy Jan. 25, 2019, 9:13 a.m. UTC | #8

24.01.2019 19:36, Kevin Wolf wrote:
> Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 24.01.2019 17:17, Kevin Wolf wrote:
>>> Depending on the exact image layout and the storage backend (tmpfs is
>>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
>>> save us a lot of time e.g. during a mirror block job or qemu-img convert
>>> with a fragmented source image (.bdrv_co_block_status on the protocol
>>> layer can be called for every single cluster in the extreme case).
>>>
>>> We may only cache data regions because of possible concurrent writers.
>>> This means that we can later treat a recently punched hole as data, but
>>> this is safe. We can't cache holes because then we might treat recently
>>> written data as holes, which can cause corruption.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
>>> @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
>>>    {
>>>        RawPosixAIOData *aiocb = opaque;
>>>        BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>        int ret;
>>>    
>>> +    /* Invalidate seek_data_cache if it overlaps */
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
>>> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
>>
>> to be presize: <= and >=
> 
> Yes, you're right.
> 
>>> +    {
>>> +        sdc->valid = false;
>>> +    }
>>> +
>>>        /* First try to write zeros and unmap at the same time */
>>>    
>>
>>
>> Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll return DATA
>> for these regions which may unallocated read-as-zero, if I'm not mistaken.
> 
> handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't
> need to invalidate the cache there.

So, you want to say, that for fallocated regions we always return just _DATA, without _ZERO?
If it is so, it's of course bad, it means that convert will have to copy (or at least read
and detect zeroes by hand, if enabled) write-zeroed-without-unmap areas.

Let's check (hmm, I had to use qemu-img map inside qemu-io, patch attached for it,
also I printed printf("%s\n", __func__) in handle_aiocb_write_zeroes_unmap and
handle_aiocb_write_zeroes):

Let's test:
]# cat test
./qemu-img create -f raw x 1M

./qemu-io -f raw x <<CMDS
write 0 1M
map
write -z 100K 100K
map
write -z -u 500K 100K
map
CMDS

rm -rf x


rm -rf x

before your patch:
]# ./test
Formatting 'x', fmt=raw size=1048576
qemu-io> wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes
wrote 102400/102400 bytes at offset 102400
100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
{ "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, "offset": 204800}]
qemu-io> handle_aiocb_write_zeroes_unmap
wrote 102400/102400 bytes at offset 512000
100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
{ "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
{ "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
{ "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]



after your patch:
# ./test
Formatting 'x', fmt=raw size=1048576
qemu-io> wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes
wrote 102400/102400 bytes at offset 102400
100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes_unmap
wrote 102400/102400 bytes at offset 512000
100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
{ "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
{ "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
{ "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]


So, you've changed behavior of block_status after write_zeroes without UNMAP for the worse.

Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better.

> 
>>>    #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
>>>        RawPosixAIOData *aiocb = opaque;
>>>        int ret = -EOPNOTSUPP;
>>>        BDRVRawState *s = aiocb->bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>    
>>>        if (!s->has_discard) {
>>>            return -ENOTSUP;
>>>        }
>>>    
>>> +    /* Invalidate seek_data_cache if it overlaps */
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
>>> +                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
>>
>> and <= and >=
>>
>> and if add same to handle_aiocb_write_zeroes(), then it worth to
>> create helper function to invalidate cache.
> 
> Ok.
> 
>>> +    {
>>> +        sdc->valid = false;
>>> +    }
>>> +
>>>        if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>>>    #ifdef BLKDISCARD
>>>            do {
>>> @@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>                                                int64_t *map,
>>>                                                BlockDriverState **file)
>>>    {
>>> +    BDRVRawState *s = bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>        off_t data = 0, hole = 0;
>>>        int ret;
>>>    
>>> @@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>            return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>>>        }
>>>    
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
>>> +        *pnum = MIN(bytes, sdc->end - offset);
>>> +        *map = offset;
>>> +        *file = bs;
>>> +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>>> +    }
>>> +
>>>        ret = find_allocation(bs, offset, &data, &hole);
>>>        if (ret == -ENXIO) {
>>>            /* Trailing hole */
>>> @@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
>>>        } else if (data == offset) {
>>>            /* On a data extent, compute bytes to the end of the extent,
>>>             * possibly including a partial sector at EOF. */
>>> -        *pnum = MIN(bytes, hole - offset);
>>> +        *pnum = hole - offset;
>>
>> hmm, why? At least you didn't mention it in commit-message..
> 
> We want to cache the whole range returned by lseek(), not just whatever
> the raw_co_block_status() caller wanted to know.
> 
> For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below...

Oops, stupid question it was, sorry:(

> 
>>>            ret = BDRV_BLOCK_DATA;
>>>        } else {
>>>            /* On a hole, compute bytes to the beginning of the next extent.  */
>>>            assert(hole == offset);
>>> -        *pnum = MIN(bytes, data - offset);
>>> +        *pnum = data - offset;
>>>            ret = BDRV_BLOCK_ZERO;
>>>        }
>>> +
>>> +    /* Caching allocated ranges is okay even if another process writes to the
>>> +     * same file because we allow declaring things allocated even if there is a
>>> +     * hole. However, we cannot cache holes without risking corruption. */
>>> +    if (ret == BDRV_BLOCK_DATA) {
>>> +        *sdc = (struct seek_data_cache) {
>>> +            .valid  = true,
>>> +            .start  = offset,
>>> +            .end    = offset + *pnum,
>>> +        };
>>> +    }
>>> +
>>> +    *pnum = MIN(*pnum, bytes);
> 
> ...here.
> 
> So what we return doesn't change.
> 
>>>        *map = offset;
>>>        *file = bs;
>>>        return ret | BDRV_BLOCK_OFFSET_VALID;
> 
> Kevin
>

Paolo Bonzini Jan. 25, 2019, 10:10 a.m. UTC | #9

On 24/01/19 16:42, Kevin Wolf wrote:
>> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
>> I don't understand the whole picture around it.. Doesn't this apply here?
>> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html
> I don't know. Honestly I feel nobody except Paolo knows, because we
> don't know his patches yet.

This is true.  On the other hand, the AioContext lock is only used in
some special cases around block jobs and blk_set_aio_context, and in
general the block devices already should not have any dependencies
(unless they crept in without me noticing).

In particular...

> But raw doesn't have an s->lock yet, so I
> think removing the AioContext lock involves some work on it anyway and
> adding this doesn't really change the amount of work.

... BDRVRawState doesn't have any data that changes after open, does it?
 This is why it doesn't have an s->lock.

Paolo

Kevin Wolf Jan. 25, 2019, 10:30 a.m. UTC | #10

Am 25.01.2019 um 11:10 hat Paolo Bonzini geschrieben:
> On 24/01/19 16:42, Kevin Wolf wrote:
> >> Recently Paolo asked me not to add more users of AioContext lock. Unfortunately
> >> I don't understand the whole picture around it.. Doesn't this apply here?
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg03410.html
> > I don't know. Honestly I feel nobody except Paolo knows, because we
> > don't know his patches yet.
> 
> This is true.  On the other hand, the AioContext lock is only used in
> some special cases around block jobs and blk_set_aio_context, and in
> general the block devices already should not have any dependencies
> (unless they crept in without me noticing).

It's also used in those cases where coroutines don't need locking, but
threads would. Did you audit all of the drivers for such cases?

> In particular...
> 
> > But raw doesn't have an s->lock yet, so I
> > think removing the AioContext lock involves some work on it anyway and
> > adding this doesn't really change the amount of work.
> 
> ... BDRVRawState doesn't have any data that changes after open, does it?
>  This is why it doesn't have an s->lock.

No important data anyway. We do things like setting s->has_write_zeroes
= false after failure, but if we got a race and end up trying twice
before disabling it, it doesn't really hurt either.

Then there is reopen, but that involves a drain anyway. And that's it
probably.

So do you think I should introduce a CoMutex for raw here? Or QemuMutex?

Kevin

Eric Blake Jan. 25, 2019, 1:26 p.m. UTC | #11

On 1/25/19 3:13 AM, Vladimir Sementsov-Ogievskiy wrote:

> 
> before your patch:
> ]# ./test
> Formatting 'x', fmt=raw size=1048576
> qemu-io> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec)
> qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
> qemu-io> handle_aiocb_write_zeroes
> wrote 102400/102400 bytes at offset 102400
> 100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec)
> qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
> { "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, "offset": 204800}]
> qemu-io> handle_aiocb_write_zeroes_unmap
> wrote 102400/102400 bytes at offset 512000
> 100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec)
> qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
> { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
> { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
> { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]

Your demonstration pre-patch shows that both 'write -z' and 'write -z
-u' produced areas of the qcow2 image that were marked as known zeroes,
and claim 'data':false meaning that those two areas are sparse (that is,
it appears 'write -z' managed to unmap after all?)

> 
> 
> 
> after your patch:
> # ./test
> Formatting 'x', fmt=raw size=1048576
> qemu-io> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec)
> qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]
> qemu-io> handle_aiocb_write_zeroes
> wrote 102400/102400 bytes at offset 102400
> 100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec)
> qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 0}]

So here, the cache was not invalidated, so the 'write -z' area was
temporarily reported as data...

> qemu-io> handle_aiocb_write_zeroes_unmap
> wrote 102400/102400 bytes at offset 512000
> 100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec)
> qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": true, "offset": 0},
> { "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 102400},
> { "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, "offset": 204800},
> { "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, "offset": 512000},
> { "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, "offset": 614400}]

but after the 'write -z -u' invalidated the cache, we once again see
that the 'write -z' worked.

> 
> 
> So, you've changed behavior of block_status after write_zeroes without UNMAP for the worse.
> 
> Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better.

qemu-io map is NOT asking the same information as qemu-img map (it's
annoying - but BOTH pieces of information are useful, and there are some
iotests that check both outputs to get a full picture of things).
qemu-img asks as much information as possible about all layers, while
qemu-io asks only abut the top layer.

That said, qemu-io is NOT baked in stone; if you want to patch it AND
fix the iotest fallout, I'm not opposed to that.

Kevin Wolf Jan. 29, 2019, 10:56 a.m. UTC | #12

Am 24.01.2019 um 16:56 hat Eric Blake geschrieben:
> On 1/24/19 8:17 AM, Kevin Wolf wrote:
> > Depending on the exact image layout and the storage backend (tmpfs is
> > konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
> > save us a lot of time e.g. during a mirror block job or qemu-img convert
> > with a fragmented source image (.bdrv_co_block_status on the protocol
> > layer can be called for every single cluster in the extreme case).
> > 
> > We may only cache data regions because of possible concurrent writers.
> > This means that we can later treat a recently punched hole as data, but
> > this is safe. We can't cache holes because then we might treat recently
> > written data as holes, which can cause corruption.
> 
> gluster copies heavily from file-posix's implementation; should it also
> copy this cache of known-data?  Should NBD also cache known-data when
> NBD_CMD_BLOCK_STATUS is available?

This almost suggests that we should do the caching in generic block
layer code.

It would require that we can return a *pnum from the block driver that
is larger than the requested bytes from, but it looks like
raw_co_block_status() already handles this? We just don't seem to do
this yet in the block drivers.

If we want to cache for all drivers, however, the question is whether
there are drivers that can transition a block from data to hole without
a discard operation, so that we would have to invalidate the cache in
more places. One thing that comes to mind is loading an internal
snapshot for qcow2.

Or maybe we need to make this opt-in for drivers, with a bool flag in
BlockDriver?

Kevin

Eric Blake Jan. 29, 2019, 9:03 p.m. UTC | #13

On 1/29/19 4:56 AM, Kevin Wolf wrote:

>> gluster copies heavily from file-posix's implementation; should it also
>> copy this cache of known-data?  Should NBD also cache known-data when
>> NBD_CMD_BLOCK_STATUS is available?
> 
> This almost suggests that we should do the caching in generic block
> layer code.
> 
> It would require that we can return a *pnum from the block driver that
> is larger than the requested bytes from, but it looks like
> raw_co_block_status() already handles this? We just don't seem to do
> this yet in the block drivers.

The code in io.c bdrv_co_block_status() currently does one final
clamp-down to limit the answer to the caller's maximum request, but I
don't know if any drivers actually take advantage of passing back larger
than the request. I _do_ know that the NBD protocol took pains to ensure
that NBD_CMD_BLOCK_STATUS is permitted to return a value beyond the
caller's request if such information was easily obtained, precisely
because the idea of letting the caller cache the knowledge of a data
section that extends beyond the current query's area of interest may be
useful to minimize the need to make future block status calls.

> 
> If we want to cache for all drivers, however, the question is whether
> there are drivers that can transition a block from data to hole without
> a discard operation, so that we would have to invalidate the cache in
> more places. One thing that comes to mind is loading an internal
> snapshot for qcow2.

Oh, good point - switching to a different L1 table (due to loading an
internal snapshot) can indeed make a hole appear that used to read as
data, so if the block layer caches data ranges, it also needs to provide
a hook for drivers to invalidate the cache when doing unusual actions.
Still, I can't think of any place where a hole spontaneously appears
unless a specific driver action is taken (so the driver should have the
opportunity to invalidate the cache during that action), or if an image
is in active use by more than just the qemu process.  And if the driver
knows that an image might be shared with external processes modifying
the image, then yes, maybe having a way to opt out of caching altogether
is also appropriate.

> 
> Or maybe we need to make this opt-in for drivers, with a bool flag in
> BlockDriver?
> 
> Kevin
>

Paolo Bonzini Feb. 4, 2019, 10:17 a.m. UTC | #14

On 25/01/19 11:30, Kevin Wolf wrote:
>> On the other hand, the AioContext lock is only used in
>> some special cases around block jobs and blk_set_aio_context, and in
>> general the block devices already should not have any dependencies
>> (unless they crept in without me noticing).
> It's also used in those cases where coroutines don't need locking, but
> threads would. Did you audit all of the drivers for such cases?

I did and the drivers already have a QemuMutex if that's the case (e.g.
curl, iscsi).

>> In particular...
>>
>>> But raw doesn't have an s->lock yet, so I
>>> think removing the AioContext lock involves some work on it anyway and
>>> adding this doesn't really change the amount of work.
>> ... BDRVRawState doesn't have any data that changes after open, does it?
>>  This is why it doesn't have an s->lock.
> No important data anyway. We do things like setting s->has_write_zeroes
> = false after failure, but if we got a race and end up trying twice
> before disabling it, it doesn't really hurt either.
> 
> Then there is reopen, but that involves a drain anyway. And that's it
> probably.
> 
> So do you think I should introduce a CoMutex for raw here? Or QemuMutex?

For the cache you can introduce either a CoMutex or QemuMutex, it's the
same.

Paolo

file-posix: Cache lseek result for data regions

Commit Message

Comments

Patch