diff mbox series

[v4,2/3] replication: acquire aio context before calling job_cancel_sync

Message ID 20200401081504.200017-3-s.reiter@proxmox.com (mailing list archive)
State New, archived
Headers show
Series Fix some AIO context locking in jobs | expand

Commit Message

Stefan Reiter April 1, 2020, 8:15 a.m. UTC
job_cancel_sync requires the job's lock to be held, all other callers
already do this (replication_stop, drive_backup_abort,
blockdev_backup_abort, job_cancel_sync_all, cancel_common).

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 block/replication.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Max Reitz April 2, 2020, 12:41 p.m. UTC | #1
On 01.04.20 10:15, Stefan Reiter wrote:
> job_cancel_sync requires the job's lock to be held, all other callers
> already do this (replication_stop, drive_backup_abort,
> blockdev_backup_abort, job_cancel_sync_all, cancel_common).

I think all other callers come directly from QMP, though, so they have
no locks yet.  This OTOH is called from a block driver function, so I
would assume the BDS context is locked already (or rather, this is
executed in the BDS context).

I also think that the commit job runs in the same context.  So I would
assume that this would be a nested lock, which should be unnecessary and
might cause problems.  Maybe we should just assert that the job’s
context is the current context?

(Or would that still be problematic because now job_txn_apply() wants to
release some context, and that isn’t possible without this patch?  I
would hope it’s possible, because I think we shouldn’t have to acquire
the current context, and should be free to release it...?  I have no
idea.  Maybe we are actually free to reacquire the current context here.)

> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
>  block/replication.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/replication.c b/block/replication.c
> index 413d95407d..17ddc31569 100644
> --- a/block/replication.c
> +++ b/block/replication.c
> @@ -144,12 +144,18 @@ fail:
>  static void replication_close(BlockDriverState *bs)
>  {
>      BDRVReplicationState *s = bs->opaque;
> +    Job *commit_job;
> +    AioContext *commit_ctx;
>  
>      if (s->stage == BLOCK_REPLICATION_RUNNING) {
>          replication_stop(s->rs, false, NULL);
>      }
>      if (s->stage == BLOCK_REPLICATION_FAILOVER) {
> -        job_cancel_sync(&s->commit_job->job);
> +        commit_job = &s->commit_job->job;
> +        commit_ctx = commit_job->aio_context;
> +        aio_context_acquire(commit_ctx);
> +        job_cancel_sync(commit_job);
> +        aio_context_release(commit_ctx);

Anyway, there’s also the problem that I would guess the
job_cancel_sync() might move the job from its current context back into
the main context.  Then we’d release the wrong context here.

Max

>      }
>  
>      if (s->mode == REPLICATION_MODE_SECONDARY) {
>
Stefan Reiter April 2, 2020, 3:05 p.m. UTC | #2
On 02/04/2020 14:41, Max Reitz wrote:
> On 01.04.20 10:15, Stefan Reiter wrote:
>> job_cancel_sync requires the job's lock to be held, all other callers
>> already do this (replication_stop, drive_backup_abort,
>> blockdev_backup_abort, job_cancel_sync_all, cancel_common).
> 
> I think all other callers come directly from QMP, though, so they have
> no locks yet.  This OTOH is called from a block driver function, so I
> would assume the BDS context is locked already (or rather, this is
> executed in the BDS context).
> 
> I also think that the commit job runs in the same context.  So I would
> assume that this would be a nested lock, which should be unnecessary and
> might cause problems.  Maybe we should just assert that the job’s
> context is the current context?
> 
> (Or would that still be problematic because now job_txn_apply() wants to
> release some context, and that isn’t possible without this patch?  I
> would hope it’s possible, because I think we shouldn’t have to acquire
> the current context, and should be free to release it...?  I have no
> idea.  Maybe we are actually free to reacquire the current context here.)
> 

You're right, this seems to be unnecessary. Adding an

   assert(commit_job->aio_context == qemu_get_current_aio_context());

to make sure seems like a good idea though. bdrv_close appears to always 
have the lock on its driver's context held.

>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
>> ---
>>   block/replication.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/replication.c b/block/replication.c
>> index 413d95407d..17ddc31569 100644
>> --- a/block/replication.c
>> +++ b/block/replication.c
>> @@ -144,12 +144,18 @@ fail:
>>   static void replication_close(BlockDriverState *bs)
>>   {
>>       BDRVReplicationState *s = bs->opaque;
>> +    Job *commit_job;
>> +    AioContext *commit_ctx;
>>   
>>       if (s->stage == BLOCK_REPLICATION_RUNNING) {
>>           replication_stop(s->rs, false, NULL);
>>       }
>>       if (s->stage == BLOCK_REPLICATION_FAILOVER) {
>> -        job_cancel_sync(&s->commit_job->job);
>> +        commit_job = &s->commit_job->job;
>> +        commit_ctx = commit_job->aio_context;
>> +        aio_context_acquire(commit_ctx);
>> +        job_cancel_sync(commit_job);
>> +        aio_context_release(commit_ctx);
> 
> Anyway, there’s also the problem that I would guess the
> job_cancel_sync() might move the job from its current context back into
> the main context.  Then we’d release the wrong context here.
>  > Max
> 
>>       }
>>   
>>       if (s->mode == REPLICATION_MODE_SECONDARY) {
>>
> 
>
diff mbox series

Patch

diff --git a/block/replication.c b/block/replication.c
index 413d95407d..17ddc31569 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -144,12 +144,18 @@  fail:
 static void replication_close(BlockDriverState *bs)
 {
     BDRVReplicationState *s = bs->opaque;
+    Job *commit_job;
+    AioContext *commit_ctx;
 
     if (s->stage == BLOCK_REPLICATION_RUNNING) {
         replication_stop(s->rs, false, NULL);
     }
     if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-        job_cancel_sync(&s->commit_job->job);
+        commit_job = &s->commit_job->job;
+        commit_ctx = commit_job->aio_context;
+        aio_context_acquire(commit_ctx);
+        job_cancel_sync(commit_job);
+        aio_context_release(commit_ctx);
     }
 
     if (s->mode == REPLICATION_MODE_SECONDARY) {