diff mbox series

[v2,03/21] preallocate: Don't poll during permission updates

Message ID 20230911094620.45040-4-kwolf@redhat.com (mailing list archive)
State New, archived
Headers show
Series Graph locking part 4 (node management) | expand

Commit Message

Kevin Wolf Sept. 11, 2023, 9:46 a.m. UTC
When the permission related BlockDriver callbacks are called, we are in
the middle of an operation traversing the block graph. Polling in such a
place is a very bad idea because the graph could change in unexpected
ways. In the future, callers will also hold the graph lock, which is
likely to turn polling into a deadlock.

So we need to get rid of calls to functions like bdrv_getlength() or
bdrv_truncate() there as these functions poll internally. They are
currently used so that when no parent has write/resize permissions on
the image any more, the preallocate filter drops the extra preallocated
area in the image file and gives up write/resize permissions itself.

In order to achieve this without polling in .bdrv_check_perm, don't
immediately truncate the image, but only schedule a BH to do so. The
filter keeps the write/resize permissions a bit longer now until the BH
has executed.

There is one case in which delaying doesn't work: Reopening the image
read-only. In this case, bs->file will likely be reopened read-only,
too, so keeping write permissions a bit longer on it doesn't work. But
we can already cover this case in preallocate_reopen_prepare() and not
rely on the permission updates for it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/preallocate.c | 89 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 69 insertions(+), 20 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy Oct. 5, 2023, 7:55 p.m. UTC | #1
On 11.09.23 12:46, Kevin Wolf wrote:
> When the permission related BlockDriver callbacks are called, we are in
> the middle of an operation traversing the block graph. Polling in such a
> place is a very bad idea because the graph could change in unexpected
> ways. In the future, callers will also hold the graph lock, which is
> likely to turn polling into a deadlock.
> 
> So we need to get rid of calls to functions like bdrv_getlength() or
> bdrv_truncate() there as these functions poll internally. They are
> currently used so that when no parent has write/resize permissions on
> the image any more, the preallocate filter drops the extra preallocated
> area in the image file and gives up write/resize permissions itself.
> 
> In order to achieve this without polling in .bdrv_check_perm, don't
> immediately truncate the image, but only schedule a BH to do so. The
> filter keeps the write/resize permissions a bit longer now until the BH
> has executed.
> 
> There is one case in which delaying doesn't work: Reopening the image
> read-only. In this case, bs->file will likely be reopened read-only,
> too, so keeping write permissions a bit longer on it doesn't work. But
> we can already cover this case in preallocate_reopen_prepare() and not
> rely on the permission updates for it.

Hmm, now I found one more "future" case.

I now try to rebase my "[PATCH v7 0/7] blockdev-replace" https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/

And it breaks after this commit.

By accident, blockdev-replace series uses exactly "preallocate" filter to test insertion/removing of filters. And removing is broken now.

Removing is done as follows:

1. We have filter inserted: disk0 -file-> filter -file-> file0

2. blockdev-replace, replaces file child of disk0, so we should get the picture*: disk0 -file-> file0 <-file- filter

3. blockdev-del filter


But step [2] fails, as now preallocate filter doesn't drop permissions during the operation (postponing this for a while) and the picture* is impossible. Permission check fails.

Hmmm... Any idea how blockdev-replace and preallocate filter should work :) ? Maybe, doing truncation in .drain_begin() will help? Will try

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
Kevin Wolf Oct. 6, 2023, 8:56 a.m. UTC | #2
Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11.09.23 12:46, Kevin Wolf wrote:
> > When the permission related BlockDriver callbacks are called, we are in
> > the middle of an operation traversing the block graph. Polling in such a
> > place is a very bad idea because the graph could change in unexpected
> > ways. In the future, callers will also hold the graph lock, which is
> > likely to turn polling into a deadlock.
> > 
> > So we need to get rid of calls to functions like bdrv_getlength() or
> > bdrv_truncate() there as these functions poll internally. They are
> > currently used so that when no parent has write/resize permissions on
> > the image any more, the preallocate filter drops the extra preallocated
> > area in the image file and gives up write/resize permissions itself.
> > 
> > In order to achieve this without polling in .bdrv_check_perm, don't
> > immediately truncate the image, but only schedule a BH to do so. The
> > filter keeps the write/resize permissions a bit longer now until the BH
> > has executed.
> > 
> > There is one case in which delaying doesn't work: Reopening the image
> > read-only. In this case, bs->file will likely be reopened read-only,
> > too, so keeping write permissions a bit longer on it doesn't work. But
> > we can already cover this case in preallocate_reopen_prepare() and not
> > rely on the permission updates for it.
> 
> Hmm, now I found one more "future" case.
> 
> I now try to rebase my "[PATCH v7 0/7] blockdev-replace"
> https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/
> 
> And it breaks after this commit.
> 
> By accident, blockdev-replace series uses exactly "preallocate" filter
> to test insertion/removing of filters. And removing is broken now.
> 
> Removing is done as follows:
> 
> 1. We have filter inserted: disk0 -file-> filter -file-> file0
> 
> 2. blockdev-replace, replaces file child of disk0, so we should get the picture*: disk0 -file-> file0 <-file- filter
> 
> 3. blockdev-del filter
> 
> 
> But step [2] fails, as now preallocate filter doesn't drop permissions
> during the operation (postponing this for a while) and the picture* is
> impossible. Permission check fails.
> 
> Hmmm... Any idea how blockdev-replace and preallocate filter should
> work :) ? Maybe, doing truncation in .drain_begin() will help? Will
> try

Hm... What preallocate tries to do is really tricky...

Of course, the error is correct, this is an invalid configuration if
preallocate can still resize the image. So it would have to truncate the
file earlier, but the first time that preallocate knows of the change is
already too late to run requests.

Truncating on drain_begin feels more like a hack, but as long as it does
the job... Of course, this will have the preallocation truncated away on
events that have nothing to do with removing the filter. It's not
necessarily a disaster because preallocation is only an optimisation,
but it doesn't feel great.

Maybe let's take a step back: Which scenario is the preallocate driver
meant for and why do we even need to truncate the image file after
removing the filter? I suppose the filter doesn't make sense with raw
images because these are fixed size anyway, and pretty much any other
image format should be able to tolerate a permanently rounded up file
size. As long as you don't write to the preallocated area, it shouldn't
take space either on any sane filesystem.

Hmm, actually both VHD and VMDK can have footers, better avoid it with
those... But if truncating the image file on close is critical, what do
you do on crashes? Maybe preallocate should just not be considered
compatible with these formats?

Kevin
Vladimir Sementsov-Ogievskiy Oct. 6, 2023, 6:10 p.m. UTC | #3
On 06.10.23 11:56, Kevin Wolf wrote:
> Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> On 11.09.23 12:46, Kevin Wolf wrote:
>>> When the permission related BlockDriver callbacks are called, we are in
>>> the middle of an operation traversing the block graph. Polling in such a
>>> place is a very bad idea because the graph could change in unexpected
>>> ways. In the future, callers will also hold the graph lock, which is
>>> likely to turn polling into a deadlock.
>>>
>>> So we need to get rid of calls to functions like bdrv_getlength() or
>>> bdrv_truncate() there as these functions poll internally. They are
>>> currently used so that when no parent has write/resize permissions on
>>> the image any more, the preallocate filter drops the extra preallocated
>>> area in the image file and gives up write/resize permissions itself.
>>>
>>> In order to achieve this without polling in .bdrv_check_perm, don't
>>> immediately truncate the image, but only schedule a BH to do so. The
>>> filter keeps the write/resize permissions a bit longer now until the BH
>>> has executed.
>>>
>>> There is one case in which delaying doesn't work: Reopening the image
>>> read-only. In this case, bs->file will likely be reopened read-only,
>>> too, so keeping write permissions a bit longer on it doesn't work. But
>>> we can already cover this case in preallocate_reopen_prepare() and not
>>> rely on the permission updates for it.
>>
>> Hmm, now I found one more "future" case.
>>
>> I now try to rebase my "[PATCH v7 0/7] blockdev-replace"
>> https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/
>>
>> And it breaks after this commit.
>>
>> By accident, blockdev-replace series uses exactly "preallocate" filter
>> to test insertion/removing of filters. And removing is broken now.
>>
>> Removing is done as follows:
>>
>> 1. We have filter inserted: disk0 -file-> filter -file-> file0
>>
>> 2. blockdev-replace, replaces file child of disk0, so we should get the picture*: disk0 -file-> file0 <-file- filter
>>
>> 3. blockdev-del filter
>>
>>
>> But step [2] fails, as now preallocate filter doesn't drop permissions
>> during the operation (postponing this for a while) and the picture* is
>> impossible. Permission check fails.
>>
>> Hmmm... Any idea how blockdev-replace and preallocate filter should
>> work :) ? Maybe, doing truncation in .drain_begin() will help? Will
>> try
> 
> Hm... What preallocate tries to do is really tricky...
> 
> Of course, the error is correct, this is an invalid configuration if
> preallocate can still resize the image. So it would have to truncate the
> file earlier, but the first time that preallocate knows of the change is
> already too late to run requests.
> 
> Truncating on drain_begin feels more like a hack, but as long as it does
> the job... Of course, this will have the preallocation truncated away on
> events that have nothing to do with removing the filter. It's not
> necessarily a disaster because preallocation is only an optimisation,
> but it doesn't feel great.

Hmm, yes, that's not good.

> 
> Maybe let's take a step back: Which scenario is the preallocate driver
> meant for and why do we even need to truncate the image file after
> removing the filter? I suppose the filter doesn't make sense with raw
> images because these are fixed size anyway, and pretty much any other
> image format should be able to tolerate a permanently rounded up file
> size. As long as you don't write to the preallocated area, it shouldn't
> take space either on any sane filesystem.
> 
> Hmm, actually both VHD and VMDK can have footers, better avoid it with
> those... But if truncating the image file on close is critical, what do
> you do on crashes? Maybe preallocate should just not be considered
> compatible with these formats?
> 

Originally preallocate filter was made to be used with qcow2, on some proprietary storage, where:

1. Allocating of big chunk works a lot fater than allocating several smaller chunks
2. Holes are not free and/or file length is not free, so we really want to truncate the file back on close

Den, correct me if I'm wrong.

Good thing is that in this scenario we don't need to remove the filter in runtime, so there is no problem.


Now I think that the generic solution is just add a new handler .bdrv_pre_replace, so blockdev-replace may work as follows:

drain_begin

call .bdrv_pre_replace for all affected nodes

do the replace

drain_end

And prellocate filter would do truncation in this .bdrv_pre_replace handler and set some flag, that we have nothing to trunctate (the flag is automatically cleared on .drained_end handler). Then during permission update if we see that nothing-to-truncate flag, we can drop permissions immediately.

But this difficulty may be postponed, and I can just document that preallocate filter doesn't support removing in runtime (and modify the test to use another filter, or just not to remove preallocate filter).
Denis V. Lunev Oct. 9, 2023, 9:28 a.m. UTC | #4
On 10/6/23 20:10, Vladimir Sementsov-Ogievskiy wrote:
> On 06.10.23 11:56, Kevin Wolf wrote:
>> Am 05.10.2023 um 21:55 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> On 11.09.23 12:46, Kevin Wolf wrote:
>>>> When the permission related BlockDriver callbacks are called, we 
>>>> are in
>>>> the middle of an operation traversing the block graph. Polling in 
>>>> such a
>>>> place is a very bad idea because the graph could change in unexpected
>>>> ways. In the future, callers will also hold the graph lock, which is
>>>> likely to turn polling into a deadlock.
>>>>
>>>> So we need to get rid of calls to functions like bdrv_getlength() or
>>>> bdrv_truncate() there as these functions poll internally. They are
>>>> currently used so that when no parent has write/resize permissions on
>>>> the image any more, the preallocate filter drops the extra 
>>>> preallocated
>>>> area in the image file and gives up write/resize permissions itself.
>>>>
>>>> In order to achieve this without polling in .bdrv_check_perm, don't
>>>> immediately truncate the image, but only schedule a BH to do so. The
>>>> filter keeps the write/resize permissions a bit longer now until 
>>>> the BH
>>>> has executed.
>>>>
>>>> There is one case in which delaying doesn't work: Reopening the image
>>>> read-only. In this case, bs->file will likely be reopened read-only,
>>>> too, so keeping write permissions a bit longer on it doesn't work. But
>>>> we can already cover this case in preallocate_reopen_prepare() and not
>>>> rely on the permission updates for it.
>>>
>>> Hmm, now I found one more "future" case.
>>>
>>> I now try to rebase my "[PATCH v7 0/7] blockdev-replace"
>>> https://patchew.org/QEMU/20230421114102.884457-1-vsementsov@yandex-team.ru/ 
>>>
>>>
>>> And it breaks after this commit.
>>>
>>> By accident, blockdev-replace series uses exactly "preallocate" filter
>>> to test insertion/removing of filters. And removing is broken now.
>>>
>>> Removing is done as follows:
>>>
>>> 1. We have filter inserted: disk0 -file-> filter -file-> file0
>>>
>>> 2. blockdev-replace, replaces file child of disk0, so we should get 
>>> the picture*: disk0 -file-> file0 <-file- filter
>>>
>>> 3. blockdev-del filter
>>>
>>>
>>> But step [2] fails, as now preallocate filter doesn't drop permissions
>>> during the operation (postponing this for a while) and the picture* is
>>> impossible. Permission check fails.
>>>
>>> Hmmm... Any idea how blockdev-replace and preallocate filter should
>>> work :) ? Maybe, doing truncation in .drain_begin() will help? Will
>>> try
>>
>> Hm... What preallocate tries to do is really tricky...
>>
>> Of course, the error is correct, this is an invalid configuration if
>> preallocate can still resize the image. So it would have to truncate the
>> file earlier, but the first time that preallocate knows of the change is
>> already too late to run requests.
>>
>> Truncating on drain_begin feels more like a hack, but as long as it does
>> the job... Of course, this will have the preallocation truncated away on
>> events that have nothing to do with removing the filter. It's not
>> necessarily a disaster because preallocation is only an optimisation,
>> but it doesn't feel great.
>
> Hmm, yes, that's not good.
>
>>
>> Maybe let's take a step back: Which scenario is the preallocate driver
>> meant for and why do we even need to truncate the image file after
>> removing the filter? I suppose the filter doesn't make sense with raw
>> images because these are fixed size anyway, and pretty much any other
>> image format should be able to tolerate a permanently rounded up file
>> size. As long as you don't write to the preallocated area, it shouldn't
>> take space either on any sane filesystem.
>>
>> Hmm, actually both VHD and VMDK can have footers, better avoid it with
>> those... But if truncating the image file on close is critical, what do
>> you do on crashes? Maybe preallocate should just not be considered
>> compatible with these formats?
>>
>
> Originally preallocate filter was made to be used with qcow2, on some 
> proprietary storage, where:
>
> 1. Allocating of big chunk works a lot fater than allocating several 
> smaller chunks
> 2. Holes are not free and/or file length is not free, so we really 
> want to truncate the file back on close
>
> Den, correct me if I'm wrong.

1. Absolutely correct. This is true when the file attributes
     are stored in a centralized place aka metadata storage
     and requests to it does not scale well.

2. This is at my opinion has different meaning. We have
     tried to make local storage behavior and distributed
     storage behavior to be the same when VM is off, i.e.
     the file should be in the same state (no free blocks
     at the end of the file).

>
> Good thing is that in this scenario we don't need to remove the filter 
> in runtime, so there is no problem.
>
Yes, this filter is not dynamic in that respect. It is either
here or not here.


>
> Now I think that the generic solution is just add a new handler 
> .bdrv_pre_replace, so blockdev-replace may work as follows:
>
> drain_begin
>
> call .bdrv_pre_replace for all affected nodes
>
> do the replace
>
> drain_end
>
> And prellocate filter would do truncation in this .bdrv_pre_replace 
> handler and set some flag, that we have nothing to trunctate (the flag 
> is automatically cleared on .drained_end handler). Then during 
> permission update if we see that nothing-to-truncate flag, we can drop 
> permissions immediately.
>
> But this difficulty may be postponed, and I can just document that 
> preallocate filter doesn't support removing in runtime (and modify the 
> test to use another filter, or just not to remove preallocate filter).
>

Den
diff mbox series

Patch

diff --git a/block/preallocate.c b/block/preallocate.c
index 3173d80534..bfb638d8b1 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -75,8 +75,14 @@  typedef struct BDRVPreallocateState {
      * be invalid (< 0) when we don't have both exclusive BLK_PERM_RESIZE and
      * BLK_PERM_WRITE permissions on file child.
      */
+
+    /* Gives up the resize permission on children when parents don't need it */
+    QEMUBH *drop_resize_bh;
 } BDRVPreallocateState;
 
+static int preallocate_drop_resize(BlockDriverState *bs, Error **errp);
+static void preallocate_drop_resize_bh(void *opaque);
+
 #define PREALLOCATE_OPT_PREALLOC_ALIGN "prealloc-align"
 #define PREALLOCATE_OPT_PREALLOC_SIZE "prealloc-size"
 static QemuOptsList runtime_opts = {
@@ -142,6 +148,7 @@  static int preallocate_open(BlockDriverState *bs, QDict *options, int flags,
      * For this to work, mark them invalid.
      */
     s->file_end = s->zero_start = s->data_end = -EINVAL;
+    s->drop_resize_bh = qemu_bh_new(preallocate_drop_resize_bh, bs);
 
     ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
     if (ret < 0) {
@@ -193,6 +200,9 @@  static void preallocate_close(BlockDriverState *bs)
 {
     BDRVPreallocateState *s = bs->opaque;
 
+    qemu_bh_cancel(s->drop_resize_bh);
+    qemu_bh_delete(s->drop_resize_bh);
+
     if (s->data_end >= 0) {
         preallocate_truncate_to_real_size(bs, NULL);
     }
@@ -211,6 +221,7 @@  static int preallocate_reopen_prepare(BDRVReopenState *reopen_state,
                                       BlockReopenQueue *queue, Error **errp)
 {
     PreallocateOpts *opts = g_new0(PreallocateOpts, 1);
+    int ret;
 
     if (!preallocate_absorb_opts(opts, reopen_state->options,
                                  reopen_state->bs->file->bs, errp)) {
@@ -218,6 +229,19 @@  static int preallocate_reopen_prepare(BDRVReopenState *reopen_state,
         return -EINVAL;
     }
 
+    /*
+     * Drop the preallocation already here if reopening read-only. The child
+     * might also be reopened read-only and then scheduling a BH during the
+     * permission update is too late.
+     */
+    if ((reopen_state->flags & BDRV_O_RDWR) == 0) {
+        ret = preallocate_drop_resize(reopen_state->bs, errp);
+        if (ret < 0) {
+            g_free(opts);
+            return ret;
+        }
+    }
+
     reopen_state->opaque = opts;
 
     return 0;
@@ -475,41 +499,61 @@  preallocate_co_getlength(BlockDriverState *bs)
     return ret;
 }
 
-static int preallocate_check_perm(BlockDriverState *bs,
-                                  uint64_t perm, uint64_t shared, Error **errp)
+static int preallocate_drop_resize(BlockDriverState *bs, Error **errp)
 {
     BDRVPreallocateState *s = bs->opaque;
+    int ret;
 
-    if (s->data_end >= 0 && !can_write_resize(perm)) {
-        /*
-         * Lose permissions.
-         * We should truncate in check_perm, as in set_perm bs->file->perm will
-         * be already changed, and we should not violate it.
-         */
-        return preallocate_truncate_to_real_size(bs, errp);
+    if (s->data_end < 0) {
+        return 0;
+    }
+
+    /*
+     * Before switching children to be read-only, truncate them to remove
+     * the preallocation and let them have the real size.
+     */
+    ret = preallocate_truncate_to_real_size(bs, errp);
+    if (ret < 0) {
+        return ret;
     }
 
+    /*
+     * We'll drop our permissions and will allow other users to take write and
+     * resize permissions (see preallocate_child_perm). Anyone will be able to
+     * change the child, so mark all states invalid. We'll regain control if a
+     * parent requests write access again.
+     */
+    s->data_end = s->file_end = s->zero_start = -EINVAL;
+
+    bdrv_graph_rdlock_main_loop();
+    bdrv_child_refresh_perms(bs, bs->file, NULL);
+    bdrv_graph_rdunlock_main_loop();
+
     return 0;
 }
 
+static void preallocate_drop_resize_bh(void *opaque)
+{
+    /*
+     * In case of errors, we'll simply keep the exclusive lock on the image
+     * indefinitely.
+     */
+    preallocate_drop_resize(opaque, NULL);
+}
+
 static void preallocate_set_perm(BlockDriverState *bs,
                                  uint64_t perm, uint64_t shared)
 {
     BDRVPreallocateState *s = bs->opaque;
 
     if (can_write_resize(perm)) {
+        qemu_bh_cancel(s->drop_resize_bh);
         if (s->data_end < 0) {
             s->data_end = s->file_end = s->zero_start =
-                bdrv_getlength(bs->file->bs);
+                bs->file->bs->total_sectors * BDRV_SECTOR_SIZE;
         }
     } else {
-        /*
-         * We drop our permissions, as well as allow shared
-         * permissions (see preallocate_child_perm), anyone will be able to
-         * change the child, so mark all states invalid. We'll regain control if
-         * get good permissions back.
-         */
-        s->data_end = s->file_end = s->zero_start = -EINVAL;
+        qemu_bh_schedule(s->drop_resize_bh);
     }
 }
 
@@ -517,10 +561,16 @@  static void preallocate_child_perm(BlockDriverState *bs, BdrvChild *c,
     BdrvChildRole role, BlockReopenQueue *reopen_queue,
     uint64_t perm, uint64_t shared, uint64_t *nperm, uint64_t *nshared)
 {
+    BDRVPreallocateState *s = bs->opaque;
+
     bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, nshared);
 
-    if (can_write_resize(perm)) {
-        /* This should come by default, but let's enforce: */
+    /*
+     * We need exclusive write and resize permissions on the child not only when
+     * the parent can write to it, but also after the parent gave up write
+     * permissions until preallocate_drop_resize() has completed.
+     */
+    if (can_write_resize(perm) || s->data_end != -EINVAL) {
         *nperm |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
 
         /*
@@ -550,7 +600,6 @@  static BlockDriver bdrv_preallocate_filter = {
     .bdrv_co_flush = preallocate_co_flush,
     .bdrv_co_truncate = preallocate_co_truncate,
 
-    .bdrv_check_perm = preallocate_check_perm,
     .bdrv_set_perm = preallocate_set_perm,
     .bdrv_child_perm = preallocate_child_perm,