diff mbox series

[v3] block/rbd: implement .bdrv_get_allocated_file_size callback

Message ID 20190705093258.47856-1-sgarzare@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v3] block/rbd: implement .bdrv_get_allocated_file_size callback | expand

Commit Message

Stefano Garzarella July 5, 2019, 9:32 a.m. UTC
This patch allows 'qemu-img info' to show the 'disk size' for
the RBD images that have the fast-diff feature enabled.

If this feature is enabled, we use the rbd_diff_iterate2() API
to calculate the allocated size for the image.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
---
v3:
  - return -ENOTSUP instead of -1 when fast-diff is not available
    [John, Jason]
v2:
  - calculate the actual usage only if the fast-diff feature is
    enabled [Jason]
---
 block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

Comments

Max Reitz July 5, 2019, 9:58 a.m. UTC | #1
On 05.07.19 11:32, Stefano Garzarella wrote:
> This patch allows 'qemu-img info' to show the 'disk size' for
> the RBD images that have the fast-diff feature enabled.
> 
> If this feature is enabled, we use the rbd_diff_iterate2() API
> to calculate the allocated size for the image.
> 
> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> ---
> v3:
>   - return -ENOTSUP instead of -1 when fast-diff is not available
>     [John, Jason]
> v2:
>   - calculate the actual usage only if the fast-diff feature is
>     enabled [Jason]
> ---
>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)

Well, the librbd documentation is non-existing as always, but while
googling, I at least found that libvirt has exactly the same code.  So I
suppose it must be quite correct, then.

> diff --git a/block/rbd.c b/block/rbd.c
> index 59757b3120..b6bed683e5 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>      return info.size;
>  }
>  
> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> +                                 void *arg)
> +{
> +    int64_t *alloc_size = (int64_t *) arg;
> +
> +    if (exists) {
> +        (*alloc_size) += len;
> +    }
> +
> +    return 0;
> +}
> +
> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> +{
> +    BDRVRBDState *s = bs->opaque;
> +    uint64_t flags, features;
> +    int64_t alloc_size = 0;
> +    int r;
> +
> +    r = rbd_get_flags(s->image, &flags);
> +    if (r < 0) {
> +        return r;
> +    }
> +
> +    r = rbd_get_features(s->image, &features);
> +    if (r < 0) {
> +        return r;
> +    }
> +
> +    /*
> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> +     * very slow on a big image.
> +     */
> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> +        return -ENOTSUP;
> +    }
> +
> +    /*
> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> +     * the callback on all allocated regions of the image.
> +     */
> +    r = rbd_diff_iterate2(s->image, NULL, 0,
> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> +                          &rbd_allocated_size_cb, &alloc_size);

But I have a question.  This is basically block_status, right?  So it
gives us information on which areas are allocated and which are not.
The result thus gives us a lower bound on the allocation size, but is it
really exactly the allocation size?

There are two things I’m concerned about:

1. What about metadata?

2. If you have multiple snapshots, this will only report the overall
allocation information, right?  So say there is something like this:

(“A” means an allocated MB, “-” is an unallocated MB)

Snapshot 1: AAAA---
Snapshot 2: --AAAAA
Snapshot 3: -AAAA--

I think the allocated data size is the number of As in total (13 MB).
But I suppose this API will just return 7 MB, because it looks on
everything an it sees the whole image range (7 MB) to be allocated.  It
doesn’t report in how many snapshots some region is allocated.

Max

> +    if (r < 0) {
> +        return r;
> +    }
> +
> +    return alloc_size;
> +}
> +
>  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
>                                               int64_t offset,
>                                               PreallocMode prealloc,
> @@ -1291,6 +1344,7 @@ static BlockDriver bdrv_rbd = {
>      .bdrv_get_info          = qemu_rbd_getinfo,
>      .create_opts            = &qemu_rbd_create_opts,
>      .bdrv_getlength         = qemu_rbd_getlength,
> +    .bdrv_get_allocated_file_size = qemu_rbd_get_allocated_file_size,
>      .bdrv_co_truncate       = qemu_rbd_co_truncate,
>      .protocol_name          = "rbd",
>  
>
Stefano Garzarella July 5, 2019, 10:43 a.m. UTC | #2
On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> On 05.07.19 11:32, Stefano Garzarella wrote:
> > This patch allows 'qemu-img info' to show the 'disk size' for
> > the RBD images that have the fast-diff feature enabled.
> > 
> > If this feature is enabled, we use the rbd_diff_iterate2() API
> > to calculate the allocated size for the image.
> > 
> > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > ---
> > v3:
> >   - return -ENOTSUP instead of -1 when fast-diff is not available
> >     [John, Jason]
> > v2:
> >   - calculate the actual usage only if the fast-diff feature is
> >     enabled [Jason]
> > ---
> >  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 54 insertions(+)
> 
> Well, the librbd documentation is non-existing as always, but while
> googling, I at least found that libvirt has exactly the same code.  So I
> suppose it must be quite correct, then.
> 

While I wrote this code I took a look at libvirt implementation and also
at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
src/tools/rbd/action/DiskUsage.cc

> > diff --git a/block/rbd.c b/block/rbd.c
> > index 59757b3120..b6bed683e5 100644
> > --- a/block/rbd.c
> > +++ b/block/rbd.c
> > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
> >      return info.size;
> >  }
> >  
> > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> > +                                 void *arg)
> > +{
> > +    int64_t *alloc_size = (int64_t *) arg;
> > +
> > +    if (exists) {
> > +        (*alloc_size) += len;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > +{
> > +    BDRVRBDState *s = bs->opaque;
> > +    uint64_t flags, features;
> > +    int64_t alloc_size = 0;
> > +    int r;
> > +
> > +    r = rbd_get_flags(s->image, &flags);
> > +    if (r < 0) {
> > +        return r;
> > +    }
> > +
> > +    r = rbd_get_features(s->image, &features);
> > +    if (r < 0) {
> > +        return r;
> > +    }
> > +
> > +    /*
> > +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> > +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> > +     * very slow on a big image.
> > +     */
> > +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> > +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    /*
> > +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> > +     * the callback on all allocated regions of the image.
> > +     */
> > +    r = rbd_diff_iterate2(s->image, NULL, 0,
> > +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> > +                          &rbd_allocated_size_cb, &alloc_size);
> 
> But I have a question.  This is basically block_status, right?  So it
> gives us information on which areas are allocated and which are not.
> The result thus gives us a lower bound on the allocation size, but is it
> really exactly the allocation size?
> 
> There are two things I’m concerned about:
> 
> 1. What about metadata?

Good question, I don't think it includes the size used by metadata and I
don't know if there is a way to know it. I'll check better.

> 
> 2. If you have multiple snapshots, this will only report the overall
> allocation information, right?  So say there is something like this:
> 
> (“A” means an allocated MB, “-” is an unallocated MB)
> 
> Snapshot 1: AAAA---
> Snapshot 2: --AAAAA
> Snapshot 3: -AAAA--
> 
> I think the allocated data size is the number of As in total (13 MB).
> But I suppose this API will just return 7 MB, because it looks on
> everything an it sees the whole image range (7 MB) to be allocated.  It
> doesn’t report in how many snapshots some region is allocated.

Looking at the documentation of rbd_diff_iterate2() [1] they says:

 *                        If the source snapshot name is NULL, we
 * interpret that as the beginning of time and return all allocated
 * regions of the image.

But I don't know the answer of your question (maybe Jason can help
here).
I should check better the implementation to understand if I can cycle
on all snapshot to get the exact allocated data size.

https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L925

I'll back when I have more details on the rbd implementation to better
answer your questions.

Thanks,
Stefano
Jason Dillaman July 9, 2019, 3:08 a.m. UTC | #3
On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> > On 05.07.19 11:32, Stefano Garzarella wrote:
> > > This patch allows 'qemu-img info' to show the 'disk size' for
> > > the RBD images that have the fast-diff feature enabled.
> > >
> > > If this feature is enabled, we use the rbd_diff_iterate2() API
> > > to calculate the allocated size for the image.
> > >
> > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > ---
> > > v3:
> > >   - return -ENOTSUP instead of -1 when fast-diff is not available
> > >     [John, Jason]
> > > v2:
> > >   - calculate the actual usage only if the fast-diff feature is
> > >     enabled [Jason]
> > > ---
> > >  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 54 insertions(+)
> >
> > Well, the librbd documentation is non-existing as always, but while
> > googling, I at least found that libvirt has exactly the same code.  So I
> > suppose it must be quite correct, then.
> >
>
> While I wrote this code I took a look at libvirt implementation and also
> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> src/tools/rbd/action/DiskUsage.cc
>
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index 59757b3120..b6bed683e5 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
> > >      return info.size;
> > >  }
> > >
> > > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> > > +                                 void *arg)
> > > +{
> > > +    int64_t *alloc_size = (int64_t *) arg;
> > > +
> > > +    if (exists) {
> > > +        (*alloc_size) += len;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > > +{
> > > +    BDRVRBDState *s = bs->opaque;
> > > +    uint64_t flags, features;
> > > +    int64_t alloc_size = 0;
> > > +    int r;
> > > +
> > > +    r = rbd_get_flags(s->image, &flags);
> > > +    if (r < 0) {
> > > +        return r;
> > > +    }
> > > +
> > > +    r = rbd_get_features(s->image, &features);
> > > +    if (r < 0) {
> > > +        return r;
> > > +    }
> > > +
> > > +    /*
> > > +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> > > +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> > > +     * very slow on a big image.
> > > +     */
> > > +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> > > +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> > > +        return -ENOTSUP;
> > > +    }
> > > +
> > > +    /*
> > > +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> > > +     * the callback on all allocated regions of the image.
> > > +     */
> > > +    r = rbd_diff_iterate2(s->image, NULL, 0,
> > > +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> > > +                          &rbd_allocated_size_cb, &alloc_size);
> >
> > But I have a question.  This is basically block_status, right?  So it
> > gives us information on which areas are allocated and which are not.
> > The result thus gives us a lower bound on the allocation size, but is it
> > really exactly the allocation size?
> >
> > There are two things I’m concerned about:
> >
> > 1. What about metadata?
>
> Good question, I don't think it includes the size used by metadata and I
> don't know if there is a way to know it. I'll check better.

It does not include the size of metadata, the "rbd_diff_iterate2"
function is literally just looking for touched data blocks within the
RBD image.

> >
> > 2. If you have multiple snapshots, this will only report the overall
> > allocation information, right?  So say there is something like this:
> >
> > (“A” means an allocated MB, “-” is an unallocated MB)
> >
> > Snapshot 1: AAAA---
> > Snapshot 2: --AAAAA
> > Snapshot 3: -AAAA--
> >
> > I think the allocated data size is the number of As in total (13 MB).
> > But I suppose this API will just return 7 MB, because it looks on
> > everything an it sees the whole image range (7 MB) to be allocated.  It
> > doesn’t report in how many snapshots some region is allocated.

It should return 13 dirty data blocks (multipled by the size of the
data block) since when you don't provide a "from snapshot" name, it
will iterate from the first snapshot to the HEAD revision.

> Looking at the documentation of rbd_diff_iterate2() [1] they says:
>
>  *                        If the source snapshot name is NULL, we
>  * interpret that as the beginning of time and return all allocated
>  * regions of the image.
>
> But I don't know the answer of your question (maybe Jason can help
> here).
> I should check better the implementation to understand if I can cycle
> on all snapshot to get the exact allocated data size.
>
> https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L925
>
> I'll back when I have more details on the rbd implementation to better
> answer your questions.
>
> Thanks,
> Stefano
Max Reitz July 9, 2019, 8:55 a.m. UTC | #4
On 09.07.19 05:08, Jason Dillaman wrote:
> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>
>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
>>> On 05.07.19 11:32, Stefano Garzarella wrote:
>>>> This patch allows 'qemu-img info' to show the 'disk size' for
>>>> the RBD images that have the fast-diff feature enabled.
>>>>
>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
>>>> to calculate the allocated size for the image.
>>>>
>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>>>> ---
>>>> v3:
>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
>>>>     [John, Jason]
>>>> v2:
>>>>   - calculate the actual usage only if the fast-diff feature is
>>>>     enabled [Jason]
>>>> ---
>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 54 insertions(+)
>>>
>>> Well, the librbd documentation is non-existing as always, but while
>>> googling, I at least found that libvirt has exactly the same code.  So I
>>> suppose it must be quite correct, then.
>>>
>>
>> While I wrote this code I took a look at libvirt implementation and also
>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
>> src/tools/rbd/action/DiskUsage.cc
>>
>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>> index 59757b3120..b6bed683e5 100644
>>>> --- a/block/rbd.c
>>>> +++ b/block/rbd.c
>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>>>>      return info.size;
>>>>  }
>>>>
>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
>>>> +                                 void *arg)
>>>> +{
>>>> +    int64_t *alloc_size = (int64_t *) arg;
>>>> +
>>>> +    if (exists) {
>>>> +        (*alloc_size) += len;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
>>>> +{
>>>> +    BDRVRBDState *s = bs->opaque;
>>>> +    uint64_t flags, features;
>>>> +    int64_t alloc_size = 0;
>>>> +    int r;
>>>> +
>>>> +    r = rbd_get_flags(s->image, &flags);
>>>> +    if (r < 0) {
>>>> +        return r;
>>>> +    }
>>>> +
>>>> +    r = rbd_get_features(s->image, &features);
>>>> +    if (r < 0) {
>>>> +        return r;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
>>>> +     * very slow on a big image.
>>>> +     */
>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
>>>> +        return -ENOTSUP;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
>>>> +     * the callback on all allocated regions of the image.
>>>> +     */
>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
>>>> +                          &rbd_allocated_size_cb, &alloc_size);
>>>
>>> But I have a question.  This is basically block_status, right?  So it
>>> gives us information on which areas are allocated and which are not.
>>> The result thus gives us a lower bound on the allocation size, but is it
>>> really exactly the allocation size?
>>>
>>> There are two things I’m concerned about:
>>>
>>> 1. What about metadata?
>>
>> Good question, I don't think it includes the size used by metadata and I
>> don't know if there is a way to know it. I'll check better.
> 
> It does not include the size of metadata, the "rbd_diff_iterate2"
> function is literally just looking for touched data blocks within the
> RBD image.
> 
>>>
>>> 2. If you have multiple snapshots, this will only report the overall
>>> allocation information, right?  So say there is something like this:
>>>
>>> (“A” means an allocated MB, “-” is an unallocated MB)
>>>
>>> Snapshot 1: AAAA---
>>> Snapshot 2: --AAAAA
>>> Snapshot 3: -AAAA--
>>>
>>> I think the allocated data size is the number of As in total (13 MB).
>>> But I suppose this API will just return 7 MB, because it looks on
>>> everything an it sees the whole image range (7 MB) to be allocated.  It
>>> doesn’t report in how many snapshots some region is allocated.
> 
> It should return 13 dirty data blocks (multipled by the size of the
> data block) since when you don't provide a "from snapshot" name, it
> will iterate from the first snapshot to the HEAD revision.

Have you tested that?

I‘m so skeptical because the callback function interface has no way of
distinguishing between different layers of snapshots.

And also because we have the bdrv_block_status_above() function which
just looks strikingly similar (with the difference that it does not
invoke a callback but just returns the next allocated range).  If you
pass base=NULL to it, it will also “interpret that as the beginning of
time and return all allocated regions of the image” (or rather image
chain, in our case).  But it would just return 7 MB as allocated.  (Even
though it does in fact return layer information, i.e. where a given
continuous chunk of data is allocated.)

Sure, there is no good reason for why our interface should by chance be
the same as librbd’s interface.  But without having tested it, the fact
that the callback cannot detect which layer a chunk is allocated on just
makes me wary.

Max
Max Reitz July 9, 2019, 9:45 a.m. UTC | #5
On 09.07.19 10:55, Max Reitz wrote:
> On 09.07.19 05:08, Jason Dillaman wrote:
>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>>
>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
>>>>> the RBD images that have the fast-diff feature enabled.
>>>>>
>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
>>>>> to calculate the allocated size for the image.
>>>>>
>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>>>>> ---
>>>>> v3:
>>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
>>>>>     [John, Jason]
>>>>> v2:
>>>>>   - calculate the actual usage only if the fast-diff feature is
>>>>>     enabled [Jason]
>>>>> ---
>>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 54 insertions(+)
>>>>
>>>> Well, the librbd documentation is non-existing as always, but while
>>>> googling, I at least found that libvirt has exactly the same code.  So I
>>>> suppose it must be quite correct, then.
>>>>
>>>
>>> While I wrote this code I took a look at libvirt implementation and also
>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
>>> src/tools/rbd/action/DiskUsage.cc
>>>
>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>> index 59757b3120..b6bed683e5 100644
>>>>> --- a/block/rbd.c
>>>>> +++ b/block/rbd.c
>>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>>>>>      return info.size;
>>>>>  }
>>>>>
>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
>>>>> +                                 void *arg)
>>>>> +{
>>>>> +    int64_t *alloc_size = (int64_t *) arg;
>>>>> +
>>>>> +    if (exists) {
>>>>> +        (*alloc_size) += len;
>>>>> +    }
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
>>>>> +{
>>>>> +    BDRVRBDState *s = bs->opaque;
>>>>> +    uint64_t flags, features;
>>>>> +    int64_t alloc_size = 0;
>>>>> +    int r;
>>>>> +
>>>>> +    r = rbd_get_flags(s->image, &flags);
>>>>> +    if (r < 0) {
>>>>> +        return r;
>>>>> +    }
>>>>> +
>>>>> +    r = rbd_get_features(s->image, &features);
>>>>> +    if (r < 0) {
>>>>> +        return r;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
>>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
>>>>> +     * very slow on a big image.
>>>>> +     */
>>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
>>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
>>>>> +        return -ENOTSUP;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
>>>>> +     * the callback on all allocated regions of the image.
>>>>> +     */
>>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
>>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
>>>>> +                          &rbd_allocated_size_cb, &alloc_size);
>>>>
>>>> But I have a question.  This is basically block_status, right?  So it
>>>> gives us information on which areas are allocated and which are not.
>>>> The result thus gives us a lower bound on the allocation size, but is it
>>>> really exactly the allocation size?
>>>>
>>>> There are two things I’m concerned about:
>>>>
>>>> 1. What about metadata?
>>>
>>> Good question, I don't think it includes the size used by metadata and I
>>> don't know if there is a way to know it. I'll check better.
>>
>> It does not include the size of metadata, the "rbd_diff_iterate2"
>> function is literally just looking for touched data blocks within the
>> RBD image.
>>
>>>>
>>>> 2. If you have multiple snapshots, this will only report the overall
>>>> allocation information, right?  So say there is something like this:
>>>>
>>>> (“A” means an allocated MB, “-” is an unallocated MB)
>>>>
>>>> Snapshot 1: AAAA---
>>>> Snapshot 2: --AAAAA
>>>> Snapshot 3: -AAAA--
>>>>
>>>> I think the allocated data size is the number of As in total (13 MB).
>>>> But I suppose this API will just return 7 MB, because it looks on
>>>> everything an it sees the whole image range (7 MB) to be allocated.  It
>>>> doesn’t report in how many snapshots some region is allocated.
>>
>> It should return 13 dirty data blocks (multipled by the size of the
>> data block) since when you don't provide a "from snapshot" name, it
>> will iterate from the first snapshot to the HEAD revision.
> 
> Have you tested that?
> 
> I‘m so skeptical because the callback function interface has no way of
> distinguishing between different layers of snapshots.
> 
> And also because we have the bdrv_block_status_above() function which
> just looks strikingly similar (with the difference that it does not
> invoke a callback but just returns the next allocated range).  If you
> pass base=NULL to it, it will also “interpret that as the beginning of
> time and return all allocated regions of the image” (or rather image
> chain, in our case).  But it would just return 7 MB as allocated.  (Even
> though it does in fact return layer information, i.e. where a given
> continuous chunk of data is allocated.)
> 
> Sure, there is no good reason for why our interface should by chance be
> the same as librbd’s interface.  But without having tested it, the fact
> that the callback cannot detect which layer a chunk is allocated on just
> makes me wary.

And the librbd code doesn’t alleviate my concerns.

From what I can see, it first creates a bitmap (two bits per entry) that
covers the whole image and says which objects are allocated and which
aren‘t.  Through the whole chain, that is.  So in the above example, the
bitmap would report everything as allocated.  (Or rather “updated“ in
librbd‘s terms.)

Then it has this piece:

>   uint64_t off = m_offset;
>   uint64_t left = m_length;
> 
>   while (left > 0) {
>     uint64_t period_off = off - (off % period);                                                                                                   
>     uint64_t read_len = min(period_off + period - off, left);
> 
>     // map to extents
>     map<object_t,vector<ObjectExtent> > object_extents;
>     Striper::file_to_extents(cct, m_image_ctx.format_string,
>                              &m_image_ctx.layout, off, read_len, 0,
>                              object_extents, 0);
> 
>     // get snap info for each object
>     for (map<object_t,vector<ObjectExtent> >::iterator p =
>            object_extents.begin();
>          p != object_extents.end(); ++p) 
[...]
>           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
>                q != p->second.end(); ++q) {
>             r = m_callback(off + q->offset, q->length, updated, m_callback_arg);
[...]
>           }
[...]
>     }
>>     left -= read_len;
>     off += read_len;
>   }

So that iterates over the whole image (in our case, because m_offset is
0 and m_length is the image length), then picks out a chunk of read_len
length, converts that to objects, and then iterates over all of those
objects and all of their extents.

file_to_extents looks like it’s just an arithmetic operation.  So it
doesn‘t look like that function returns extents in multiple snapshots.
It just converts a range into subranges called “objects” and “extents”
(at least that’s the way it looks to me).

So overall, this looks awfully like it simply iterates over the whole
image and then returns allocation information gathered as a top-down
view through all of the snapshots, but not for each snapshot individually.

Max
Jason Dillaman July 9, 2019, 12:55 p.m. UTC | #6
On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote:
>
> On 09.07.19 10:55, Max Reitz wrote:
> > On 09.07.19 05:08, Jason Dillaman wrote:
> >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >>>
> >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>
> >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>> to calculate the allocated size for the image.
> >>>>>
> >>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> >>>>> ---
> >>>>> v3:
> >>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>>     [John, Jason]
> >>>>> v2:
> >>>>>   - calculate the actual usage only if the fast-diff feature is
> >>>>>     enabled [Jason]
> >>>>> ---
> >>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  1 file changed, 54 insertions(+)
> >>>>
> >>>> Well, the librbd documentation is non-existing as always, but while
> >>>> googling, I at least found that libvirt has exactly the same code.  So I
> >>>> suppose it must be quite correct, then.
> >>>>
> >>>
> >>> While I wrote this code I took a look at libvirt implementation and also
> >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>> src/tools/rbd/action/DiskUsage.cc
> >>>
> >>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>> index 59757b3120..b6bed683e5 100644
> >>>>> --- a/block/rbd.c
> >>>>> +++ b/block/rbd.c
> >>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
> >>>>>      return info.size;
> >>>>>  }
> >>>>>
> >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> >>>>> +                                 void *arg)
> >>>>> +{
> >>>>> +    int64_t *alloc_size = (int64_t *) arg;
> >>>>> +
> >>>>> +    if (exists) {
> >>>>> +        (*alloc_size) += len;
> >>>>> +    }
> >>>>> +
> >>>>> +    return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> >>>>> +{
> >>>>> +    BDRVRBDState *s = bs->opaque;
> >>>>> +    uint64_t flags, features;
> >>>>> +    int64_t alloc_size = 0;
> >>>>> +    int r;
> >>>>> +
> >>>>> +    r = rbd_get_flags(s->image, &flags);
> >>>>> +    if (r < 0) {
> >>>>> +        return r;
> >>>>> +    }
> >>>>> +
> >>>>> +    r = rbd_get_features(s->image, &features);
> >>>>> +    if (r < 0) {
> >>>>> +        return r;
> >>>>> +    }
> >>>>> +
> >>>>> +    /*
> >>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> >>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> >>>>> +     * very slow on a big image.
> >>>>> +     */
> >>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>>>> +        return -ENOTSUP;
> >>>>> +    }
> >>>>> +
> >>>>> +    /*
> >>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> >>>>> +     * the callback on all allocated regions of the image.
> >>>>> +     */
> >>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
> >>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> >>>>> +                          &rbd_allocated_size_cb, &alloc_size);
> >>>>
> >>>> But I have a question.  This is basically block_status, right?  So it
> >>>> gives us information on which areas are allocated and which are not.
> >>>> The result thus gives us a lower bound on the allocation size, but is it
> >>>> really exactly the allocation size?
> >>>>
> >>>> There are two things I’m concerned about:
> >>>>
> >>>> 1. What about metadata?
> >>>
> >>> Good question, I don't think it includes the size used by metadata and I
> >>> don't know if there is a way to know it. I'll check better.
> >>
> >> It does not include the size of metadata, the "rbd_diff_iterate2"
> >> function is literally just looking for touched data blocks within the
> >> RBD image.
> >>
> >>>>
> >>>> 2. If you have multiple snapshots, this will only report the overall
> >>>> allocation information, right?  So say there is something like this:
> >>>>
> >>>> (“A” means an allocated MB, “-” is an unallocated MB)
> >>>>
> >>>> Snapshot 1: AAAA---
> >>>> Snapshot 2: --AAAAA
> >>>> Snapshot 3: -AAAA--
> >>>>
> >>>> I think the allocated data size is the number of As in total (13 MB).
> >>>> But I suppose this API will just return 7 MB, because it looks on
> >>>> everything an it sees the whole image range (7 MB) to be allocated.  It
> >>>> doesn’t report in how many snapshots some region is allocated.
> >>
> >> It should return 13 dirty data blocks (multipled by the size of the
> >> data block) since when you don't provide a "from snapshot" name, it
> >> will iterate from the first snapshot to the HEAD revision.
> >
> > Have you tested that?
> >
> > I‘m so skeptical because the callback function interface has no way of
> > distinguishing between different layers of snapshots.
> >
> > And also because we have the bdrv_block_status_above() function which
> > just looks strikingly similar (with the difference that it does not
> > invoke a callback but just returns the next allocated range).  If you
> > pass base=NULL to it, it will also “interpret that as the beginning of
> > time and return all allocated regions of the image” (or rather image
> > chain, in our case).  But it would just return 7 MB as allocated.  (Even
> > though it does in fact return layer information, i.e. where a given
> > continuous chunk of data is allocated.)
> >
> > Sure, there is no good reason for why our interface should by chance be
> > the same as librbd’s interface.  But without having tested it, the fact
> > that the callback cannot detect which layer a chunk is allocated on just
> > makes me wary.
>
> And the librbd code doesn’t alleviate my concerns.
>
> From what I can see, it first creates a bitmap (two bits per entry) that
> covers the whole image and says which objects are allocated and which
> aren‘t.  Through the whole chain, that is.  So in the above example, the
> bitmap would report everything as allocated.  (Or rather “updated“ in
> librbd‘s terms.)
>
> Then it has this piece:
>
> >   uint64_t off = m_offset;
> >   uint64_t left = m_length;
> >
> >   while (left > 0) {
> >     uint64_t period_off = off - (off % period);
> >     uint64_t read_len = min(period_off + period - off, left);
> >
> >     // map to extents
> >     map<object_t,vector<ObjectExtent> > object_extents;
> >     Striper::file_to_extents(cct, m_image_ctx.format_string,
> >                              &m_image_ctx.layout, off, read_len, 0,
> >                              object_extents, 0);
> >
> >     // get snap info for each object
> >     for (map<object_t,vector<ObjectExtent> >::iterator p =
> >            object_extents.begin();
> >          p != object_extents.end(); ++p)
> [...]
> >           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> >                q != p->second.end(); ++q) {
> >             r = m_callback(off + q->offset, q->length, updated, m_callback_arg);
> [...]
> >           }
> [...]
> >     }
> >>     left -= read_len;
> >     off += read_len;
> >   }
>
> So that iterates over the whole image (in our case, because m_offset is
> 0 and m_length is the image length), then picks out a chunk of read_len
> length, converts that to objects, and then iterates over all of those
> objects and all of their extents.
>
> file_to_extents looks like it’s just an arithmetic operation.  So it
> doesn‘t look like that function returns extents in multiple snapshots.
> It just converts a range into subranges called “objects” and “extents”
> (at least that’s the way it looks to me).
>
> So overall, this looks awfully like it simply iterates over the whole
> image and then returns allocation information gathered as a top-down
> view through all of the snapshots, but not for each snapshot individually.

Sorry, you're correct. The API function was originally designed to
support delta extents for supporting diff exports, so while it does
open each snapshot's object-map, it only returns a unioned set of
delta extents. The rbd CLI's "disk-usage" action behaves how I
described by counting the actual dirty block usage between snapshots.

> Max
>
Stefano Garzarella July 9, 2019, 1:09 p.m. UTC | #7
On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote:
> >
> > On 09.07.19 10:55, Max Reitz wrote:
> > > On 09.07.19 05:08, Jason Dillaman wrote:
> > >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > >>>
> > >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> > >>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> > >>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> > >>>>> the RBD images that have the fast-diff feature enabled.
> > >>>>>
> > >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> > >>>>> to calculate the allocated size for the image.
> > >>>>>
> > >>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > >>>>> ---
> > >>>>> v3:
> > >>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> > >>>>>     [John, Jason]
> > >>>>> v2:
> > >>>>>   - calculate the actual usage only if the fast-diff feature is
> > >>>>>     enabled [Jason]
> > >>>>> ---
> > >>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>  1 file changed, 54 insertions(+)
> > >>>>
> > >>>> Well, the librbd documentation is non-existing as always, but while
> > >>>> googling, I at least found that libvirt has exactly the same code.  So I
> > >>>> suppose it must be quite correct, then.
> > >>>>
> > >>>
> > >>> While I wrote this code I took a look at libvirt implementation and also
> > >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> > >>> src/tools/rbd/action/DiskUsage.cc
> > >>>
> > >>>>> diff --git a/block/rbd.c b/block/rbd.c
> > >>>>> index 59757b3120..b6bed683e5 100644
> > >>>>> --- a/block/rbd.c
> > >>>>> +++ b/block/rbd.c
> > >>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
> > >>>>>      return info.size;
> > >>>>>  }
> > >>>>>
> > >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> > >>>>> +                                 void *arg)
> > >>>>> +{
> > >>>>> +    int64_t *alloc_size = (int64_t *) arg;
> > >>>>> +
> > >>>>> +    if (exists) {
> > >>>>> +        (*alloc_size) += len;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    return 0;
> > >>>>> +}
> > >>>>> +
> > >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > >>>>> +{
> > >>>>> +    BDRVRBDState *s = bs->opaque;
> > >>>>> +    uint64_t flags, features;
> > >>>>> +    int64_t alloc_size = 0;
> > >>>>> +    int r;
> > >>>>> +
> > >>>>> +    r = rbd_get_flags(s->image, &flags);
> > >>>>> +    if (r < 0) {
> > >>>>> +        return r;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    r = rbd_get_features(s->image, &features);
> > >>>>> +    if (r < 0) {
> > >>>>> +        return r;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    /*
> > >>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> > >>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> > >>>>> +     * very slow on a big image.
> > >>>>> +     */
> > >>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> > >>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> > >>>>> +        return -ENOTSUP;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    /*
> > >>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> > >>>>> +     * the callback on all allocated regions of the image.
> > >>>>> +     */
> > >>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
> > >>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> > >>>>> +                          &rbd_allocated_size_cb, &alloc_size);
> > >>>>
> > >>>> But I have a question.  This is basically block_status, right?  So it
> > >>>> gives us information on which areas are allocated and which are not.
> > >>>> The result thus gives us a lower bound on the allocation size, but is it
> > >>>> really exactly the allocation size?
> > >>>>
> > >>>> There are two things I’m concerned about:
> > >>>>
> > >>>> 1. What about metadata?
> > >>>
> > >>> Good question, I don't think it includes the size used by metadata and I
> > >>> don't know if there is a way to know it. I'll check better.
> > >>
> > >> It does not include the size of metadata, the "rbd_diff_iterate2"
> > >> function is literally just looking for touched data blocks within the
> > >> RBD image.
> > >>
> > >>>>
> > >>>> 2. If you have multiple snapshots, this will only report the overall
> > >>>> allocation information, right?  So say there is something like this:
> > >>>>
> > >>>> (“A” means an allocated MB, “-” is an unallocated MB)
> > >>>>
> > >>>> Snapshot 1: AAAA---
> > >>>> Snapshot 2: --AAAAA
> > >>>> Snapshot 3: -AAAA--
> > >>>>
> > >>>> I think the allocated data size is the number of As in total (13 MB).
> > >>>> But I suppose this API will just return 7 MB, because it looks on
> > >>>> everything an it sees the whole image range (7 MB) to be allocated.  It
> > >>>> doesn’t report in how many snapshots some region is allocated.
> > >>
> > >> It should return 13 dirty data blocks (multipled by the size of the
> > >> data block) since when you don't provide a "from snapshot" name, it
> > >> will iterate from the first snapshot to the HEAD revision.
> > >
> > > Have you tested that?
> > >
> > > I‘m so skeptical because the callback function interface has no way of
> > > distinguishing between different layers of snapshots.
> > >
> > > And also because we have the bdrv_block_status_above() function which
> > > just looks strikingly similar (with the difference that it does not
> > > invoke a callback but just returns the next allocated range).  If you
> > > pass base=NULL to it, it will also “interpret that as the beginning of
> > > time and return all allocated regions of the image” (or rather image
> > > chain, in our case).  But it would just return 7 MB as allocated.  (Even
> > > though it does in fact return layer information, i.e. where a given
> > > continuous chunk of data is allocated.)
> > >
> > > Sure, there is no good reason for why our interface should by chance be
> > > the same as librbd’s interface.  But without having tested it, the fact
> > > that the callback cannot detect which layer a chunk is allocated on just
> > > makes me wary.
> >
> > And the librbd code doesn’t alleviate my concerns.
> >
> > From what I can see, it first creates a bitmap (two bits per entry) that
> > covers the whole image and says which objects are allocated and which
> > aren‘t.  Through the whole chain, that is.  So in the above example, the
> > bitmap would report everything as allocated.  (Or rather “updated“ in
> > librbd‘s terms.)
> >
> > Then it has this piece:
> >
> > >   uint64_t off = m_offset;
> > >   uint64_t left = m_length;
> > >
> > >   while (left > 0) {
> > >     uint64_t period_off = off - (off % period);
> > >     uint64_t read_len = min(period_off + period - off, left);
> > >
> > >     // map to extents
> > >     map<object_t,vector<ObjectExtent> > object_extents;
> > >     Striper::file_to_extents(cct, m_image_ctx.format_string,
> > >                              &m_image_ctx.layout, off, read_len, 0,
> > >                              object_extents, 0);
> > >
> > >     // get snap info for each object
> > >     for (map<object_t,vector<ObjectExtent> >::iterator p =
> > >            object_extents.begin();
> > >          p != object_extents.end(); ++p)
> > [...]
> > >           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> > >                q != p->second.end(); ++q) {
> > >             r = m_callback(off + q->offset, q->length, updated, m_callback_arg);
> > [...]
> > >           }
> > [...]
> > >     }
> > >>     left -= read_len;
> > >     off += read_len;
> > >   }
> >
> > So that iterates over the whole image (in our case, because m_offset is
> > 0 and m_length is the image length), then picks out a chunk of read_len
> > length, converts that to objects, and then iterates over all of those
> > objects and all of their extents.
> >
> > file_to_extents looks like it’s just an arithmetic operation.  So it
> > doesn‘t look like that function returns extents in multiple snapshots.
> > It just converts a range into subranges called “objects” and “extents”
> > (at least that’s the way it looks to me).
> >
> > So overall, this looks awfully like it simply iterates over the whole
> > image and then returns allocation information gathered as a top-down
> > view through all of the snapshots, but not for each snapshot individually.
> 
> Sorry, you're correct. The API function was originally designed to
> support delta extents for supporting diff exports, so while it does
> open each snapshot's object-map, it only returns a unioned set of
> delta extents. The rbd CLI's "disk-usage" action behaves how I
> described by counting the actual dirty block usage between snapshots.

Max, Jason, thanks for the details!

If you agree, I'll try to implement something similar, iterating on all
snapshots.

What about metadata?
I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
An idea could be to iterate over them and sum the keys-values size returned,
but I'm not sure they are returned in the same format as they are stored.

Thanks,
Stefano
Max Reitz July 9, 2019, 3:32 p.m. UTC | #8
On 09.07.19 15:09, Stefano Garzarella wrote:
> On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
>> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote:
>>>
>>> On 09.07.19 10:55, Max Reitz wrote:
>>>> On 09.07.19 05:08, Jason Dillaman wrote:
>>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>>>>>
>>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
>>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
>>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
>>>>>>>> the RBD images that have the fast-diff feature enabled.
>>>>>>>>
>>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
>>>>>>>> to calculate the allocated size for the image.
>>>>>>>>
>>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>>>>>>>> ---
>>>>>>>> v3:
>>>>>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
>>>>>>>>     [John, Jason]
>>>>>>>> v2:
>>>>>>>>   - calculate the actual usage only if the fast-diff feature is
>>>>>>>>     enabled [Jason]
>>>>>>>> ---
>>>>>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>  1 file changed, 54 insertions(+)
>>>>>>>
>>>>>>> Well, the librbd documentation is non-existing as always, but while
>>>>>>> googling, I at least found that libvirt has exactly the same code.  So I
>>>>>>> suppose it must be quite correct, then.
>>>>>>>
>>>>>>
>>>>>> While I wrote this code I took a look at libvirt implementation and also
>>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
>>>>>> src/tools/rbd/action/DiskUsage.cc
>>>>>>
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index 59757b3120..b6bed683e5 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>>>>>>>>      return info.size;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
>>>>>>>> +                                 void *arg)
>>>>>>>> +{
>>>>>>>> +    int64_t *alloc_size = (int64_t *) arg;
>>>>>>>> +
>>>>>>>> +    if (exists) {
>>>>>>>> +        (*alloc_size) += len;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
>>>>>>>> +{
>>>>>>>> +    BDRVRBDState *s = bs->opaque;
>>>>>>>> +    uint64_t flags, features;
>>>>>>>> +    int64_t alloc_size = 0;
>>>>>>>> +    int r;
>>>>>>>> +
>>>>>>>> +    r = rbd_get_flags(s->image, &flags);
>>>>>>>> +    if (r < 0) {
>>>>>>>> +        return r;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    r = rbd_get_features(s->image, &features);
>>>>>>>> +    if (r < 0) {
>>>>>>>> +        return r;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
>>>>>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
>>>>>>>> +     * very slow on a big image.
>>>>>>>> +     */
>>>>>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
>>>>>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
>>>>>>>> +        return -ENOTSUP;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
>>>>>>>> +     * the callback on all allocated regions of the image.
>>>>>>>> +     */
>>>>>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
>>>>>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
>>>>>>>> +                          &rbd_allocated_size_cb, &alloc_size);
>>>>>>>
>>>>>>> But I have a question.  This is basically block_status, right?  So it
>>>>>>> gives us information on which areas are allocated and which are not.
>>>>>>> The result thus gives us a lower bound on the allocation size, but is it
>>>>>>> really exactly the allocation size?
>>>>>>>
>>>>>>> There are two things I’m concerned about:
>>>>>>>
>>>>>>> 1. What about metadata?
>>>>>>
>>>>>> Good question, I don't think it includes the size used by metadata and I
>>>>>> don't know if there is a way to know it. I'll check better.
>>>>>
>>>>> It does not include the size of metadata, the "rbd_diff_iterate2"
>>>>> function is literally just looking for touched data blocks within the
>>>>> RBD image.
>>>>>
>>>>>>>
>>>>>>> 2. If you have multiple snapshots, this will only report the overall
>>>>>>> allocation information, right?  So say there is something like this:
>>>>>>>
>>>>>>> (“A” means an allocated MB, “-” is an unallocated MB)
>>>>>>>
>>>>>>> Snapshot 1: AAAA---
>>>>>>> Snapshot 2: --AAAAA
>>>>>>> Snapshot 3: -AAAA--
>>>>>>>
>>>>>>> I think the allocated data size is the number of As in total (13 MB).
>>>>>>> But I suppose this API will just return 7 MB, because it looks on
>>>>>>> everything an it sees the whole image range (7 MB) to be allocated.  It
>>>>>>> doesn’t report in how many snapshots some region is allocated.
>>>>>
>>>>> It should return 13 dirty data blocks (multipled by the size of the
>>>>> data block) since when you don't provide a "from snapshot" name, it
>>>>> will iterate from the first snapshot to the HEAD revision.
>>>>
>>>> Have you tested that?
>>>>
>>>> I‘m so skeptical because the callback function interface has no way of
>>>> distinguishing between different layers of snapshots.
>>>>
>>>> And also because we have the bdrv_block_status_above() function which
>>>> just looks strikingly similar (with the difference that it does not
>>>> invoke a callback but just returns the next allocated range).  If you
>>>> pass base=NULL to it, it will also “interpret that as the beginning of
>>>> time and return all allocated regions of the image” (or rather image
>>>> chain, in our case).  But it would just return 7 MB as allocated.  (Even
>>>> though it does in fact return layer information, i.e. where a given
>>>> continuous chunk of data is allocated.)
>>>>
>>>> Sure, there is no good reason for why our interface should by chance be
>>>> the same as librbd’s interface.  But without having tested it, the fact
>>>> that the callback cannot detect which layer a chunk is allocated on just
>>>> makes me wary.
>>>
>>> And the librbd code doesn’t alleviate my concerns.
>>>
>>> From what I can see, it first creates a bitmap (two bits per entry) that
>>> covers the whole image and says which objects are allocated and which
>>> aren‘t.  Through the whole chain, that is.  So in the above example, the
>>> bitmap would report everything as allocated.  (Or rather “updated“ in
>>> librbd‘s terms.)
>>>
>>> Then it has this piece:
>>>
>>>>   uint64_t off = m_offset;
>>>>   uint64_t left = m_length;
>>>>
>>>>   while (left > 0) {
>>>>     uint64_t period_off = off - (off % period);
>>>>     uint64_t read_len = min(period_off + period - off, left);
>>>>
>>>>     // map to extents
>>>>     map<object_t,vector<ObjectExtent> > object_extents;
>>>>     Striper::file_to_extents(cct, m_image_ctx.format_string,
>>>>                              &m_image_ctx.layout, off, read_len, 0,
>>>>                              object_extents, 0);
>>>>
>>>>     // get snap info for each object
>>>>     for (map<object_t,vector<ObjectExtent> >::iterator p =
>>>>            object_extents.begin();
>>>>          p != object_extents.end(); ++p)
>>> [...]
>>>>           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
>>>>                q != p->second.end(); ++q) {
>>>>             r = m_callback(off + q->offset, q->length, updated, m_callback_arg);
>>> [...]
>>>>           }
>>> [...]
>>>>     }
>>>>>     left -= read_len;
>>>>     off += read_len;
>>>>   }
>>>
>>> So that iterates over the whole image (in our case, because m_offset is
>>> 0 and m_length is the image length), then picks out a chunk of read_len
>>> length, converts that to objects, and then iterates over all of those
>>> objects and all of their extents.
>>>
>>> file_to_extents looks like it’s just an arithmetic operation.  So it
>>> doesn‘t look like that function returns extents in multiple snapshots.
>>> It just converts a range into subranges called “objects” and “extents”
>>> (at least that’s the way it looks to me).
>>>
>>> So overall, this looks awfully like it simply iterates over the whole
>>> image and then returns allocation information gathered as a top-down
>>> view through all of the snapshots, but not for each snapshot individually.
>>
>> Sorry, you're correct. The API function was originally designed to
>> support delta extents for supporting diff exports, so while it does
>> open each snapshot's object-map, it only returns a unioned set of
>> delta extents. The rbd CLI's "disk-usage" action behaves how I
>> described by counting the actual dirty block usage between snapshots.
> 
> Max, Jason, thanks for the details!
> 
> If you agree, I'll try to implement something similar, iterating on all
> snapshots.

Yes, that should work.

> What about metadata?
> I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
> An idea could be to iterate over them and sum the keys-values size returned,
> but I'm not sure they are returned in the same format as they are stored.

I wouldn’t mind ignoring metadata too much.  If you can get something to
work, even better, but I don’t consider that too important.

Max
Jason Dillaman July 10, 2019, 1:42 a.m. UTC | #9
On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <mreitz@redhat.com> wrote:
>
> On 09.07.19 15:09, Stefano Garzarella wrote:
> > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
> >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote:
> >>>
> >>> On 09.07.19 10:55, Max Reitz wrote:
> >>>> On 09.07.19 05:08, Jason Dillaman wrote:
> >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >>>>>>
> >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>>>>
> >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>>>>> to calculate the allocated size for the image.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> >>>>>>>> ---
> >>>>>>>> v3:
> >>>>>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>>>>>     [John, Jason]
> >>>>>>>> v2:
> >>>>>>>>   - calculate the actual usage only if the fast-diff feature is
> >>>>>>>>     enabled [Jason]
> >>>>>>>> ---
> >>>>>>>>  block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>  1 file changed, 54 insertions(+)
> >>>>>>>
> >>>>>>> Well, the librbd documentation is non-existing as always, but while
> >>>>>>> googling, I at least found that libvirt has exactly the same code.  So I
> >>>>>>> suppose it must be quite correct, then.
> >>>>>>>
> >>>>>>
> >>>>>> While I wrote this code I took a look at libvirt implementation and also
> >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>>>>> src/tools/rbd/action/DiskUsage.cc
> >>>>>>
> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>>>> index 59757b3120..b6bed683e5 100644
> >>>>>>>> --- a/block/rbd.c
> >>>>>>>> +++ b/block/rbd.c
> >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
> >>>>>>>>      return info.size;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> >>>>>>>> +                                 void *arg)
> >>>>>>>> +{
> >>>>>>>> +    int64_t *alloc_size = (int64_t *) arg;
> >>>>>>>> +
> >>>>>>>> +    if (exists) {
> >>>>>>>> +        (*alloc_size) += len;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> >>>>>>>> +{
> >>>>>>>> +    BDRVRBDState *s = bs->opaque;
> >>>>>>>> +    uint64_t flags, features;
> >>>>>>>> +    int64_t alloc_size = 0;
> >>>>>>>> +    int r;
> >>>>>>>> +
> >>>>>>>> +    r = rbd_get_flags(s->image, &flags);
> >>>>>>>> +    if (r < 0) {
> >>>>>>>> +        return r;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    r = rbd_get_features(s->image, &features);
> >>>>>>>> +    if (r < 0) {
> >>>>>>>> +        return r;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    /*
> >>>>>>>> +     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> >>>>>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> >>>>>>>> +     * very slow on a big image.
> >>>>>>>> +     */
> >>>>>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>>>>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>>>>>>> +        return -ENOTSUP;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    /*
> >>>>>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> >>>>>>>> +     * the callback on all allocated regions of the image.
> >>>>>>>> +     */
> >>>>>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
> >>>>>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> >>>>>>>> +                          &rbd_allocated_size_cb, &alloc_size);
> >>>>>>>
> >>>>>>> But I have a question.  This is basically block_status, right?  So it
> >>>>>>> gives us information on which areas are allocated and which are not.
> >>>>>>> The result thus gives us a lower bound on the allocation size, but is it
> >>>>>>> really exactly the allocation size?
> >>>>>>>
> >>>>>>> There are two things I’m concerned about:
> >>>>>>>
> >>>>>>> 1. What about metadata?
> >>>>>>
> >>>>>> Good question, I don't think it includes the size used by metadata and I
> >>>>>> don't know if there is a way to know it. I'll check better.
> >>>>>
> >>>>> It does not include the size of metadata, the "rbd_diff_iterate2"
> >>>>> function is literally just looking for touched data blocks within the
> >>>>> RBD image.
> >>>>>
> >>>>>>>
> >>>>>>> 2. If you have multiple snapshots, this will only report the overall
> >>>>>>> allocation information, right?  So say there is something like this:
> >>>>>>>
> >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB)
> >>>>>>>
> >>>>>>> Snapshot 1: AAAA---
> >>>>>>> Snapshot 2: --AAAAA
> >>>>>>> Snapshot 3: -AAAA--
> >>>>>>>
> >>>>>>> I think the allocated data size is the number of As in total (13 MB).
> >>>>>>> But I suppose this API will just return 7 MB, because it looks on
> >>>>>>> everything an it sees the whole image range (7 MB) to be allocated.  It
> >>>>>>> doesn’t report in how many snapshots some region is allocated.
> >>>>>
> >>>>> It should return 13 dirty data blocks (multipled by the size of the
> >>>>> data block) since when you don't provide a "from snapshot" name, it
> >>>>> will iterate from the first snapshot to the HEAD revision.
> >>>>
> >>>> Have you tested that?
> >>>>
> >>>> I‘m so skeptical because the callback function interface has no way of
> >>>> distinguishing between different layers of snapshots.
> >>>>
> >>>> And also because we have the bdrv_block_status_above() function which
> >>>> just looks strikingly similar (with the difference that it does not
> >>>> invoke a callback but just returns the next allocated range).  If you
> >>>> pass base=NULL to it, it will also “interpret that as the beginning of
> >>>> time and return all allocated regions of the image” (or rather image
> >>>> chain, in our case).  But it would just return 7 MB as allocated.  (Even
> >>>> though it does in fact return layer information, i.e. where a given
> >>>> continuous chunk of data is allocated.)
> >>>>
> >>>> Sure, there is no good reason for why our interface should by chance be
> >>>> the same as librbd’s interface.  But without having tested it, the fact
> >>>> that the callback cannot detect which layer a chunk is allocated on just
> >>>> makes me wary.
> >>>
> >>> And the librbd code doesn’t alleviate my concerns.
> >>>
> >>> From what I can see, it first creates a bitmap (two bits per entry) that
> >>> covers the whole image and says which objects are allocated and which
> >>> aren‘t.  Through the whole chain, that is.  So in the above example, the
> >>> bitmap would report everything as allocated.  (Or rather “updated“ in
> >>> librbd‘s terms.)
> >>>
> >>> Then it has this piece:
> >>>
> >>>>   uint64_t off = m_offset;
> >>>>   uint64_t left = m_length;
> >>>>
> >>>>   while (left > 0) {
> >>>>     uint64_t period_off = off - (off % period);
> >>>>     uint64_t read_len = min(period_off + period - off, left);
> >>>>
> >>>>     // map to extents
> >>>>     map<object_t,vector<ObjectExtent> > object_extents;
> >>>>     Striper::file_to_extents(cct, m_image_ctx.format_string,
> >>>>                              &m_image_ctx.layout, off, read_len, 0,
> >>>>                              object_extents, 0);
> >>>>
> >>>>     // get snap info for each object
> >>>>     for (map<object_t,vector<ObjectExtent> >::iterator p =
> >>>>            object_extents.begin();
> >>>>          p != object_extents.end(); ++p)
> >>> [...]
> >>>>           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> >>>>                q != p->second.end(); ++q) {
> >>>>             r = m_callback(off + q->offset, q->length, updated, m_callback_arg);
> >>> [...]
> >>>>           }
> >>> [...]
> >>>>     }
> >>>>>     left -= read_len;
> >>>>     off += read_len;
> >>>>   }
> >>>
> >>> So that iterates over the whole image (in our case, because m_offset is
> >>> 0 and m_length is the image length), then picks out a chunk of read_len
> >>> length, converts that to objects, and then iterates over all of those
> >>> objects and all of their extents.
> >>>
> >>> file_to_extents looks like it’s just an arithmetic operation.  So it
> >>> doesn‘t look like that function returns extents in multiple snapshots.
> >>> It just converts a range into subranges called “objects” and “extents”
> >>> (at least that’s the way it looks to me).
> >>>
> >>> So overall, this looks awfully like it simply iterates over the whole
> >>> image and then returns allocation information gathered as a top-down
> >>> view through all of the snapshots, but not for each snapshot individually.
> >>
> >> Sorry, you're correct. The API function was originally designed to
> >> support delta extents for supporting diff exports, so while it does
> >> open each snapshot's object-map, it only returns a unioned set of
> >> delta extents. The rbd CLI's "disk-usage" action behaves how I
> >> described by counting the actual dirty block usage between snapshots.
> >
> > Max, Jason, thanks for the details!
> >
> > If you agree, I'll try to implement something similar, iterating on all
> > snapshots.
>
> Yes, that should work.
>
> > What about metadata?
> > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
> > An idea could be to iterate over them and sum the keys-values size returned,
> > but I'm not sure they are returned in the same format as they are stored.
>
> I wouldn’t mind ignoring metadata too much.  If you can get something to
> work, even better, but I don’t consider that too important.

I don't think it's a good idea to attempt to estimate it. If there is
enough desire, we can always add a supported "disk-usage" API method
that takes into account things like the image metadata, object-map,
journaling, etc overhead. However, I wouldn't expect it to be anywhere
near the actual block usage (unless something has gone terribly wrong
w/ journaling and it fails to trim your log).

> Max
>
Stefano Garzarella July 10, 2019, 3:28 p.m. UTC | #10
On Tue, Jul 09, 2019 at 09:42:14PM -0400, Jason Dillaman wrote:
> On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <mreitz@redhat.com> wrote:
> > On 09.07.19 15:09, Stefano Garzarella wrote:
> > >
> > > Max, Jason, thanks for the details!
> > >
> > > If you agree, I'll try to implement something similar, iterating on all
> > > snapshots.
> >
> > Yes, that should work.
> >
> > > What about metadata?
> > > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
> > > An idea could be to iterate over them and sum the keys-values size returned,
> > > but I'm not sure they are returned in the same format as they are stored.
> >
> > I wouldn’t mind ignoring metadata too much.  If you can get something to
> > work, even better, but I don’t consider that too important.
> 
> I don't think it's a good idea to attempt to estimate it. If there is
> enough desire, we can always add a supported "disk-usage" API method
> that takes into account things like the image metadata, object-map,
> journaling, etc overhead. However, I wouldn't expect it to be anywhere
> near the actual block usage (unless something has gone terribly wrong
> w/ journaling and it fails to trim your log).

Okay, so for now, I'll skip the metadata and if you think can be useful
to add "disk-usage" API in RBD, I'll use it later.

Thanks,
Stefano
diff mbox series

Patch

diff --git a/block/rbd.c b/block/rbd.c
index 59757b3120..b6bed683e5 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1084,6 +1084,59 @@  static int64_t qemu_rbd_getlength(BlockDriverState *bs)
     return info.size;
 }
 
+static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
+                                 void *arg)
+{
+    int64_t *alloc_size = (int64_t *) arg;
+
+    if (exists) {
+        (*alloc_size) += len;
+    }
+
+    return 0;
+}
+
+static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
+{
+    BDRVRBDState *s = bs->opaque;
+    uint64_t flags, features;
+    int64_t alloc_size = 0;
+    int r;
+
+    r = rbd_get_flags(s->image, &flags);
+    if (r < 0) {
+        return r;
+    }
+
+    r = rbd_get_features(s->image, &features);
+    if (r < 0) {
+        return r;
+    }
+
+    /*
+     * We use rbd_diff_iterate2() only if the RBD image have fast-diff
+     * feature enabled. If it is disabled, rbd_diff_iterate2() could be
+     * very slow on a big image.
+     */
+    if (!(features & RBD_FEATURE_FAST_DIFF) ||
+        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
+        return -ENOTSUP;
+    }
+
+    /*
+     * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
+     * the callback on all allocated regions of the image.
+     */
+    r = rbd_diff_iterate2(s->image, NULL, 0,
+                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
+                          &rbd_allocated_size_cb, &alloc_size);
+    if (r < 0) {
+        return r;
+    }
+
+    return alloc_size;
+}
+
 static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
                                              int64_t offset,
                                              PreallocMode prealloc,
@@ -1291,6 +1344,7 @@  static BlockDriver bdrv_rbd = {
     .bdrv_get_info          = qemu_rbd_getinfo,
     .create_opts            = &qemu_rbd_create_opts,
     .bdrv_getlength         = qemu_rbd_getlength,
+    .bdrv_get_allocated_file_size = qemu_rbd_get_allocated_file_size,
     .bdrv_co_truncate       = qemu_rbd_co_truncate,
     .protocol_name          = "rbd",