Message ID | 20210315060611.2989049-8-rvkagan@yandex-team.ru (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block/nbd: decouple reconnect from drain | expand |
15.03.2021 09:06, Roman Kagan wrote: > As the reconnect logic no longer interferes with drained sections, it > appears unnecessary to explicitly manipulate the in_flight counter. > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance") And here you actually allow qemu_aio_coroutine_enter() call in nbd_client_attach_aio_context_bh() to enter connection_co in any yield point which is possible during drained section. The analysis should be done to be sure that all these yield points are safe for reentering by external qemu_aio_coroutine_enter(). (By external I mean not by the actual enter() we are waiting for at the yield() point. For example qemu_channel_yield() supports reentering.. And therefore (as I understand after fast looking through) nbd_read() should support reentering too..
On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy wrote: > 15.03.2021 09:06, Roman Kagan wrote: > > As the reconnect logic no longer interferes with drained sections, it > > appears unnecessary to explicitly manipulate the in_flight counter. > > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance") > > And here you actually allow qemu_aio_coroutine_enter() call in > nbd_client_attach_aio_context_bh() to enter connection_co in any yield > point which is possible during drained section. The analysis should be > done to be sure that all these yield points are safe for reentering by > external qemu_aio_coroutine_enter(). (By external I mean not by the > actual enter() we are waiting for at the yield() point. For example > qemu_channel_yield() supports reentering.. And therefore (as I > understand after fast looking through) nbd_read() should support > reentering too.. I'll do a more thorough analysis of how safe it is. FWIW this hasn't triggered any test failures yet, but that assert in patch 3 didn't ever go off either so I'm not sure I can trust the tests on this. Thanks, Roman.
16.03.2021 19:08, Roman Kagan wrote: > On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy wrote: >> 15.03.2021 09:06, Roman Kagan wrote: >>> As the reconnect logic no longer interferes with drained sections, it >>> appears unnecessary to explicitly manipulate the in_flight counter. >>> >>> Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance") >> >> And here you actually allow qemu_aio_coroutine_enter() call in >> nbd_client_attach_aio_context_bh() to enter connection_co in any yield >> point which is possible during drained section. The analysis should be >> done to be sure that all these yield points are safe for reentering by >> external qemu_aio_coroutine_enter(). (By external I mean not by the >> actual enter() we are waiting for at the yield() point. For example >> qemu_channel_yield() supports reentering.. And therefore (as I >> understand after fast looking through) nbd_read() should support >> reentering too.. > > I'll do a more thorough analysis of how safe it is. > > FWIW this hasn't triggered any test failures yet, but that assert in > patch 3 didn't ever go off either so I'm not sure I can trust the tests > on this. > Hmm. First, we should consider qemu_coroutine_yield() in nbd_co_establish_connection(). Most of nbd_co_establish_connection_cancel() purpose is to avoid reentering this yield().. And I don't know, how to make it safely reenterable: keep in mind bh that may be already scheduled by connect_thread_func(). And if bh is already scheduled, can we cancel it? I'm not sure. We have qemu_bh_delete(). But is it possible, that BH is near to be executed and already cannot be removed by qemu_bh_delete()? I don't know. And if we can't safely drop the bh at any moment, we should wait in nbd_client_detach_aio_context until the scheduled bh enters the connection_co.. Or something like this
On Tue, Mar 16, 2021 at 09:37:13PM +0300, Vladimir Sementsov-Ogievskiy wrote: > 16.03.2021 19:08, Roman Kagan wrote: > > On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy wrote: > > > 15.03.2021 09:06, Roman Kagan wrote: > > > > As the reconnect logic no longer interferes with drained sections, it > > > > appears unnecessary to explicitly manipulate the in_flight counter. > > > > > > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance") > > > > > > And here you actually allow qemu_aio_coroutine_enter() call in > > > nbd_client_attach_aio_context_bh() to enter connection_co in any yield > > > point which is possible during drained section. The analysis should be > > > done to be sure that all these yield points are safe for reentering by > > > external qemu_aio_coroutine_enter(). (By external I mean not by the > > > actual enter() we are waiting for at the yield() point. For example > > > qemu_channel_yield() supports reentering.. And therefore (as I > > > understand after fast looking through) nbd_read() should support > > > reentering too.. > > > > I'll do a more thorough analysis of how safe it is. > > > > FWIW this hasn't triggered any test failures yet, but that assert in > > patch 3 didn't ever go off either so I'm not sure I can trust the tests > > on this. > > > > Hmm. First, we should consider qemu_coroutine_yield() in > nbd_co_establish_connection(). > > Most of nbd_co_establish_connection_cancel() purpose is to avoid > reentering this yield().. Unless I'm overlooking something, nbd_co_establish_connection() is fine with spurious entering at this yield point. What does look problematic, though, is your next point: > And I don't know, how to make it safely reenterable: keep in mind bh > that may be already scheduled by connect_thread_func(). And if bh is > already scheduled, can we cancel it? I'm not sure. > > We have qemu_bh_delete(). But is it possible, that BH is near to be > executed and already cannot be removed by qemu_bh_delete()? I don't > know. > > And if we can't safely drop the bh at any moment, we should wait in > nbd_client_detach_aio_context until the scheduled bh enters the > connection_co.. Or something like this So I think it's not the reentry at this yield point per se which is problematic, it's that that bh may have been scheduled before the aio_context switch so once it runs it would wake up connection_co on the old aio_context. I think it may be possible to address by adding a check into connect_bh(). Thanks, Roman.
diff --git a/block/nbd.c b/block/nbd.c index a5a9e4aca5..3a22f5d897 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -311,7 +311,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque) if (s->connection_co) { qemu_aio_coroutine_enter(bs->aio_context, s->connection_co); } - bdrv_dec_in_flight(bs); } static void nbd_client_attach_aio_context(BlockDriverState *bs, @@ -327,7 +326,6 @@ static void nbd_client_attach_aio_context(BlockDriverState *bs, qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context); } - bdrv_inc_in_flight(bs); /* * Need to wait here for the BH to run because the BH must run while the @@ -643,11 +641,9 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s) goto out; } - bdrv_dec_in_flight(s->bs); ret = nbd_client_handshake(s->bs, &local_err); - bdrv_inc_in_flight(s->bs); out: s->connect_status = ret; @@ -759,7 +755,6 @@ static coroutine_fn void nbd_connection_entry(void *opaque) qemu_co_queue_restart_all(&s->free_sema); nbd_recv_coroutines_wake_all(s); - bdrv_dec_in_flight(s->bs); s->connection_co = NULL; if (s->ioc) { @@ -2307,7 +2302,6 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags, nbd_init_connect_thread(s); s->connection_co = qemu_coroutine_create(nbd_connection_entry, s); - bdrv_inc_in_flight(bs); aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co); return 0; diff --git a/nbd/client.c b/nbd/client.c index 0c2db4bcba..30d5383cb1 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -1434,9 +1434,7 @@ nbd_read_eof(BlockDriverState *bs, QIOChannel *ioc, void *buffer, size_t size, len = qio_channel_readv(ioc, &iov, 1, errp); if (len == QIO_CHANNEL_ERR_BLOCK) { - bdrv_dec_in_flight(bs); qio_channel_yield(ioc, G_IO_IN); - bdrv_inc_in_flight(bs); continue; } else if (len < 0) { return -EIO;
As the reconnect logic no longer interferes with drained sections, it appears unnecessary to explicitly manipulate the in_flight counter. Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance") Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> --- block/nbd.c | 6 ------ nbd/client.c | 2 -- 2 files changed, 8 deletions(-)