Message ID | 20191002083003.21556-1-w.bumiller@proxmox.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | monitor/qmp: resume monitor when clearing its queue | expand |
Cc: Marc-André for additional monitor and chardev expertise. Wolfgang Bumiller <w.bumiller@proxmox.com> writes: > When a monitor's queue is filled up in handle_qmp_command() > it gets suspended. It's the dispatcher bh's job currently to > resume the monitor, which it does after processing an event > from the queue. However, it is possible for a > CHR_EVENT_CLOSED event to be processed before before the bh > is scheduled, which will clear the queue without resuming > the monitor, thereby preventing the dispatcher from reaching > the resume() call. Because with the request queue cleared, there's nothing for monitor_qmp_requests_pop_any_with_lock() to pop, so monitor_qmp_bh_dispatcher() won't look at this monitor. It stays suspended forever. Correct? Observable effect for the monitor's user? > Fix this by resuming the monitor when clearing a queue which > was filled up. > > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> > --- > @Michael, we ran into this with qemu 4.0, so if the logic in this patch > is correct it may make sense to include it in the 4.0.1 roundup. > A backport is at [1] as 4.0 was before the monitor/ dir split. > > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 > > monitor/qmp.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/monitor/qmp.c b/monitor/qmp.c > index 9d9e5d8b27..c1db5bf940 100644 > --- a/monitor/qmp.c > +++ b/monitor/qmp.c > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) > /* Caller must hold mon->qmp.qmp_queue_lock */ > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) > { > + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0) > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; Can you explain why this condition is correct? > while (!g_queue_is_empty(mon->qmp_requests)) { > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); > } > + if (need_resume) { > + /* > + * Pairs with the monitor_suspend() in handle_qmp_command() in case the > + * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch > + * bh got scheduled. > + */ > + monitor_resume(&mon->common); > + } > } > > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) Is monitor_qmp_cleanup_req_queue_locked() the correct place? It's called from * monitor_qmp_event() case CHR_EVENT_CLOSED via monitor_qmp_cleanup_queues(), as part of destroying the monitor's session state. This is the case you're trying to fix. Correct? I figure monitor_resume() is safe because we haven't really destroyed anything, yet, we merely flushed the request queue. Correct? * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying the monitor. Can need_resume be true in this case? If yes, is monitor_resume() still safe? We're in the middle of destroying the monitor...
On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote: > Cc: Marc-André for additional monitor and chardev expertise. > > Wolfgang Bumiller <w.bumiller@proxmox.com> writes: > > > When a monitor's queue is filled up in handle_qmp_command() > > it gets suspended. It's the dispatcher bh's job currently to > > resume the monitor, which it does after processing an event > > from the queue. However, it is possible for a > > CHR_EVENT_CLOSED event to be processed before before the bh > > is scheduled, which will clear the queue without resuming > > the monitor, thereby preventing the dispatcher from reaching > > the resume() call. > > Because with the request queue cleared, there's nothing for > monitor_qmp_requests_pop_any_with_lock() to pop, so > monitor_qmp_bh_dispatcher() won't look at this monitor. It stays > suspended forever. Correct? > > Observable effect for the monitor's user? Yes. More easily triggered now with oob. We ran into this a longer time ago, but our only reliable trigger was a customized version of -loadstate which loads the state from a separate file instead of the vmstate region of a qcow2. Turns out that doing this on a slow storage (~12s to load the data) caused our status daemon to try to poll the qmp socket during the load-state and give up after a 3s timeout. And since the BH runs in the main loop which is not even entered until after the loadstate has finished, but iothread handling the qmp socket does fill & clear the queue, the qmp socket always ended up unusable afterwards. Aside from that we have users reporting the same symptom (hanging qmp) appearing randomly on busy systems. > > Fix this by resuming the monitor when clearing a queue which > > was filled up. > > > > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> > > --- > > @Michael, we ran into this with qemu 4.0, so if the logic in this patch > > is correct it may make sense to include it in the 4.0.1 roundup. > > A backport is at [1] as 4.0 was before the monitor/ dir split. > > > > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 > > > > monitor/qmp.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/monitor/qmp.c b/monitor/qmp.c > > index 9d9e5d8b27..c1db5bf940 100644 > > --- a/monitor/qmp.c > > +++ b/monitor/qmp.c > > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) > > /* Caller must hold mon->qmp.qmp_queue_lock */ > > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) > > { > > + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0) > > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; > > Can you explain why this condition is correct? Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(), which does the following *after* popping 1 element off the queue: need_resume = !qmp_oob_enabled(mon) || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1; qemu_mutex_unlock(&mon->qmp_queue_lock); It's supposed to be the same condition, but _before_ popping off an element (hence no `- 1`), but the queue shouldn't be empty as well otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't happened, though on second though we could probably just return early in that case.). > > > while (!g_queue_is_empty(mon->qmp_requests)) { > > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); > > } > > + if (need_resume) { > > + /* > > + * Pairs with the monitor_suspend() in handle_qmp_command() in case the > > + * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch > > + * bh got scheduled. > > + */ > > + monitor_resume(&mon->common); > > + } > > } > > > > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) > > Is monitor_qmp_cleanup_req_queue_locked() the correct place? > > It's called from > > * monitor_qmp_event() case CHR_EVENT_CLOSED via > monitor_qmp_cleanup_queues(), as part of destroying the monitor's > session state. > > This is the case you're trying to fix. Correct? > > I figure monitor_resume() is safe because we haven't really destroyed > anything, yet, we merely flushed the request queue. Correct? > > * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying > the monitor. > > Can need_resume be true in this case? If yes, is monitor_resume() > still safe? We're in the middle of destroying the monitor... I thought so when first reading through it, but on second though, we should probably avoid this for sanity's sake. Maybe with a flag, or an extra parameter. Or we could introduce a "bool queue_filled" we set in handle_qmp_command() instead of "calculating" `need_resume` in 2 places and unset it in `monitor_data_destroy()` before clearing the queue?
Wolfgang Bumiller <w.bumiller@proxmox.com> writes: > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote: >> Cc: Marc-André for additional monitor and chardev expertise. >> >> Wolfgang Bumiller <w.bumiller@proxmox.com> writes: >> >> > When a monitor's queue is filled up in handle_qmp_command() >> > it gets suspended. It's the dispatcher bh's job currently to >> > resume the monitor, which it does after processing an event >> > from the queue. However, it is possible for a >> > CHR_EVENT_CLOSED event to be processed before before the bh >> > is scheduled, which will clear the queue without resuming >> > the monitor, thereby preventing the dispatcher from reaching >> > the resume() call. >> >> Because with the request queue cleared, there's nothing for >> monitor_qmp_requests_pop_any_with_lock() to pop, so >> monitor_qmp_bh_dispatcher() won't look at this monitor. It stays >> suspended forever. Correct? >> >> Observable effect for the monitor's user? > > Yes. I was too terse, let me try again: what exactly breaks for the monitor's user? > More easily triggered now with oob. We ran into this a longer time > ago, but our only reliable trigger was a customized version of > -loadstate which loads the state from a separate file instead of the > vmstate region of a qcow2. Turns out that doing this on a slow storage > (~12s to load the data) caused our status daemon to try to poll the qmp > socket during the load-state and give up after a 3s timeout. And since > the BH runs in the main loop which is not even entered until after the > loadstate has finished, but iothread handling the qmp socket does fill & > clear the queue, the qmp socket always ended up unusable afterwards. > > Aside from that we have users reporting the same symptom (hanging qmp) > appearing randomly on busy systems. > >> > Fix this by resuming the monitor when clearing a queue which >> > was filled up. >> > >> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> >> > --- >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch >> > is correct it may make sense to include it in the 4.0.1 roundup. >> > A backport is at [1] as 4.0 was before the monitor/ dir split. >> > >> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 >> > >> > monitor/qmp.c | 10 ++++++++++ >> > 1 file changed, 10 insertions(+) >> > >> > diff --git a/monitor/qmp.c b/monitor/qmp.c >> > index 9d9e5d8b27..c1db5bf940 100644 >> > --- a/monitor/qmp.c >> > +++ b/monitor/qmp.c >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) >> > /* Caller must hold mon->qmp.qmp_queue_lock */ >> > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) >> > { >> > + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0) >> > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; >> >> Can you explain why this condition is correct? > > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(), > which does the following *after* popping 1 element off the queue: > > need_resume = !qmp_oob_enabled(mon) || > mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1; > qemu_mutex_unlock(&mon->qmp_queue_lock); > > It's supposed to be the same condition, but _before_ popping off an > element (hence no `- 1`), but the queue shouldn't be empty as well > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't > happened, though on second though we could probably just return early in > that case.). I see. Could we monitor_resume() unconditionally here? >> > while (!g_queue_is_empty(mon->qmp_requests)) { >> > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); >> > } >> > + if (need_resume) { >> > + /* >> > + * Pairs with the monitor_suspend() in handle_qmp_command() in case the >> > + * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch >> > + * bh got scheduled. >> > + */ >> > + monitor_resume(&mon->common); >> > + } >> > } >> > >> > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) >> >> Is monitor_qmp_cleanup_req_queue_locked() the correct place? >> >> It's called from >> >> * monitor_qmp_event() case CHR_EVENT_CLOSED via >> monitor_qmp_cleanup_queues(), as part of destroying the monitor's >> session state. >> >> This is the case you're trying to fix. Correct? >> >> I figure monitor_resume() is safe because we haven't really destroyed >> anything, yet, we merely flushed the request queue. Correct? >> >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying >> the monitor. >> >> Can need_resume be true in this case? If yes, is monitor_resume() >> still safe? We're in the middle of destroying the monitor... > > I thought so when first reading through it, but on second though, we > should probably avoid this for sanity's sake. > Maybe with a flag, or an extra parameter. > Or we could introduce a "bool queue_filled" we set in handle_qmp_command() > instead of "calculating" `need_resume` in 2 places and unset it in > `monitor_data_destroy()` before clearing the queue? Could we simply call monitor_resume() in monitor_qmp_event() right after monitor_qmp_cleanup_queues()?
On Wed, Oct 09, 2019 at 09:18:04PM +0200, Markus Armbruster wrote: > Wolfgang Bumiller <w.bumiller@proxmox.com> writes: > > > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote: > >> Cc: Marc-André for additional monitor and chardev expertise. > >> > >> Wolfgang Bumiller <w.bumiller@proxmox.com> writes: > >> > >> > When a monitor's queue is filled up in handle_qmp_command() > >> > it gets suspended. It's the dispatcher bh's job currently to > >> > resume the monitor, which it does after processing an event > >> > from the queue. However, it is possible for a > >> > CHR_EVENT_CLOSED event to be processed before before the bh > >> > is scheduled, which will clear the queue without resuming > >> > the monitor, thereby preventing the dispatcher from reaching > >> > the resume() call. > >> > >> Because with the request queue cleared, there's nothing for > >> monitor_qmp_requests_pop_any_with_lock() to pop, so > >> monitor_qmp_bh_dispatcher() won't look at this monitor. It stays > >> suspended forever. Correct? > >> > >> Observable effect for the monitor's user? > > > > Yes. > > I was too terse, let me try again: what exactly breaks for the monitor's > user? Any new connections to qmp will be accept()ed and show the greeting ({"QMP":{"version"...}}), but not respond to any messages sent afterwards (though in my tests sometimes (after a few more attempts to reconnect/talk to the same qmp socket), not even the greeting would appear anymore). > > More easily triggered now with oob. We ran into this a longer time > > ago, but our only reliable trigger was a customized version of > > -loadstate which loads the state from a separate file instead of the > > vmstate region of a qcow2. Turns out that doing this on a slow storage > > (~12s to load the data) caused our status daemon to try to poll the qmp > > socket during the load-state and give up after a 3s timeout. And since > > the BH runs in the main loop which is not even entered until after the > > loadstate has finished, but iothread handling the qmp socket does fill & > > clear the queue, the qmp socket always ended up unusable afterwards. > > > > Aside from that we have users reporting the same symptom (hanging qmp) > > appearing randomly on busy systems. > > > >> > Fix this by resuming the monitor when clearing a queue which > >> > was filled up. > >> > > >> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> > >> > --- > >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch > >> > is correct it may make sense to include it in the 4.0.1 roundup. > >> > A backport is at [1] as 4.0 was before the monitor/ dir split. > >> > > >> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 > >> > > >> > monitor/qmp.c | 10 ++++++++++ > >> > 1 file changed, 10 insertions(+) > >> > > >> > diff --git a/monitor/qmp.c b/monitor/qmp.c > >> > index 9d9e5d8b27..c1db5bf940 100644 > >> > --- a/monitor/qmp.c > >> > +++ b/monitor/qmp.c > >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) > >> > /* Caller must hold mon->qmp.qmp_queue_lock */ > >> > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) > >> > { > >> > + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0) > >> > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; > >> > >> Can you explain why this condition is correct? > > > > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(), > > which does the following *after* popping 1 element off the queue: > > > > need_resume = !qmp_oob_enabled(mon) || > > mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1; > > qemu_mutex_unlock(&mon->qmp_queue_lock); > > > > It's supposed to be the same condition, but _before_ popping off an > > element (hence no `- 1`), but the queue shouldn't be empty as well > > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't > > happened, though on second though we could probably just return early in > > that case.). > > I see. > > Could we monitor_resume() unconditionally here? We can't, because suspend()/resume() employ a counter, which would become unbalanced and could easily become negative by queuing a command without filling the queue up and quickly disconnecting. > > >> > while (!g_queue_is_empty(mon->qmp_requests)) { > >> > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); > >> > } > >> > + if (need_resume) { > >> > + /* > >> > + * Pairs with the monitor_suspend() in handle_qmp_command() in case the > >> > + * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch > >> > + * bh got scheduled. > >> > + */ > >> > + monitor_resume(&mon->common); > >> > + } > >> > } > >> > > >> > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) > >> > >> Is monitor_qmp_cleanup_req_queue_locked() the correct place? > >> > >> It's called from > >> > >> * monitor_qmp_event() case CHR_EVENT_CLOSED via > >> monitor_qmp_cleanup_queues(), as part of destroying the monitor's > >> session state. > >> > >> This is the case you're trying to fix. Correct? > >> > >> I figure monitor_resume() is safe because we haven't really destroyed > >> anything, yet, we merely flushed the request queue. Correct? > >> > >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying > >> the monitor. > >> > >> Can need_resume be true in this case? If yes, is monitor_resume() > >> still safe? We're in the middle of destroying the monitor... > > > > I thought so when first reading through it, but on second though, we > > should probably avoid this for sanity's sake. > > Maybe with a flag, or an extra parameter. > > Or we could introduce a "bool queue_filled" we set in handle_qmp_command() > > instead of "calculating" `need_resume` in 2 places and unset it in > > `monitor_data_destroy()` before clearing the queue? > > Could we simply call monitor_resume() in monitor_qmp_event() right after > monitor_qmp_cleanup_queues()? We'd still need to make sure the suspend counter is balanced out with the corresponding suspend() calls.
Wolfgang Bumiller <w.bumiller@proxmox.com> writes: > On Wed, Oct 09, 2019 at 09:18:04PM +0200, Markus Armbruster wrote: >> Wolfgang Bumiller <w.bumiller@proxmox.com> writes: >> >> > On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote: >> >> Cc: Marc-André for additional monitor and chardev expertise. >> >> >> >> Wolfgang Bumiller <w.bumiller@proxmox.com> writes: >> >> >> >> > When a monitor's queue is filled up in handle_qmp_command() >> >> > it gets suspended. It's the dispatcher bh's job currently to >> >> > resume the monitor, which it does after processing an event >> >> > from the queue. However, it is possible for a >> >> > CHR_EVENT_CLOSED event to be processed before before the bh >> >> > is scheduled, which will clear the queue without resuming >> >> > the monitor, thereby preventing the dispatcher from reaching >> >> > the resume() call. >> >> >> >> Because with the request queue cleared, there's nothing for >> >> monitor_qmp_requests_pop_any_with_lock() to pop, so >> >> monitor_qmp_bh_dispatcher() won't look at this monitor. It stays >> >> suspended forever. Correct? >> >> >> >> Observable effect for the monitor's user? >> > >> > Yes. >> >> I was too terse, let me try again: what exactly breaks for the monitor's >> user? > > Any new connections to qmp will be accept()ed and show the greeting > ({"QMP":{"version"...}}), but not respond to any messages sent > afterwards (though in my tests sometimes (after a few more attempts to > reconnect/talk to the same qmp socket), not even the greeting would > appear anymore). Work this into your commit message, please :) >> > More easily triggered now with oob. We ran into this a longer time >> > ago, but our only reliable trigger was a customized version of >> > -loadstate which loads the state from a separate file instead of the >> > vmstate region of a qcow2. Turns out that doing this on a slow storage >> > (~12s to load the data) caused our status daemon to try to poll the qmp >> > socket during the load-state and give up after a 3s timeout. And since >> > the BH runs in the main loop which is not even entered until after the >> > loadstate has finished, but iothread handling the qmp socket does fill & >> > clear the queue, the qmp socket always ended up unusable afterwards. >> > >> > Aside from that we have users reporting the same symptom (hanging qmp) >> > appearing randomly on busy systems. >> > >> >> > Fix this by resuming the monitor when clearing a queue which >> >> > was filled up. >> >> > >> >> > Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> >> >> > --- >> >> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch >> >> > is correct it may make sense to include it in the 4.0.1 roundup. >> >> > A backport is at [1] as 4.0 was before the monitor/ dir split. >> >> > >> >> > [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 >> >> > >> >> > monitor/qmp.c | 10 ++++++++++ >> >> > 1 file changed, 10 insertions(+) >> >> > >> >> > diff --git a/monitor/qmp.c b/monitor/qmp.c >> >> > index 9d9e5d8b27..c1db5bf940 100644 >> >> > --- a/monitor/qmp.c >> >> > +++ b/monitor/qmp.c >> >> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) >> >> > /* Caller must hold mon->qmp.qmp_queue_lock */ >> >> > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) >> >> > { >> >> > + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0) >> >> > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; >> >> >> >> Can you explain why this condition is correct? >> > >> > Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(), >> > which does the following *after* popping 1 element off the queue: >> > >> > need_resume = !qmp_oob_enabled(mon) || >> > mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1; >> > qemu_mutex_unlock(&mon->qmp_queue_lock); >> > >> > It's supposed to be the same condition, but _before_ popping off an >> > element (hence no `- 1`), but the queue shouldn't be empty as well >> > otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't >> > happened, though on second though we could probably just return early in >> > that case.). >> >> I see. >> >> Could we monitor_resume() unconditionally here? > > We can't, because suspend()/resume() employ a counter, which would > become unbalanced and could easily become negative by queuing a command > without filling the queue up and quickly disconnecting. > >> >> >> > while (!g_queue_is_empty(mon->qmp_requests)) { >> >> > qmp_request_free(g_queue_pop_head(mon->qmp_requests)); >> >> > } >> >> > + if (need_resume) { >> >> > + /* >> >> > + * Pairs with the monitor_suspend() in handle_qmp_command() in case the >> >> > + * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch >> >> > + * bh got scheduled. >> >> > + */ >> >> > + monitor_resume(&mon->common); >> >> > + } >> >> > } >> >> > >> >> > static void monitor_qmp_cleanup_queues(MonitorQMP *mon) >> >> >> >> Is monitor_qmp_cleanup_req_queue_locked() the correct place? >> >> >> >> It's called from >> >> >> >> * monitor_qmp_event() case CHR_EVENT_CLOSED via >> >> monitor_qmp_cleanup_queues(), as part of destroying the monitor's >> >> session state. >> >> >> >> This is the case you're trying to fix. Correct? >> >> >> >> I figure monitor_resume() is safe because we haven't really destroyed >> >> anything, yet, we merely flushed the request queue. Correct? >> >> >> >> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying >> >> the monitor. >> >> >> >> Can need_resume be true in this case? If yes, is monitor_resume() >> >> still safe? We're in the middle of destroying the monitor... >> > >> > I thought so when first reading through it, but on second though, we >> > should probably avoid this for sanity's sake. >> > Maybe with a flag, or an extra parameter. >> > Or we could introduce a "bool queue_filled" we set in handle_qmp_command() >> > instead of "calculating" `need_resume` in 2 places and unset it in >> > `monitor_data_destroy()` before clearing the queue? >> >> Could we simply call monitor_resume() in monitor_qmp_event() right after >> monitor_qmp_cleanup_queues()? > > We'd still need to make sure the suspend counter is balanced out with > the corresponding suspend() calls. You made me look at ->suspend_cnt, and now I feel nauseous. monitor_suspend() increments, and monitor_resume() decrements. This permits "recursive" suspension. Fair enough. I'd expect monitor_resume() to assert the decrement doesn't go negative, but that's detail. The nauseating part is in monitor_event(): switch (event) { case CHR_EVENT_MUX_IN: qemu_mutex_lock(&mon->mon_lock); mon->mux_out = 0; qemu_mutex_unlock(&mon->mon_lock); if (mon->reset_seen) { readline_restart(hmp_mon->rs); ---> monitor_resume(mon); monitor_flush(mon); } else { ---> atomic_mb_set(&mon->suspend_cnt, 0); } break; case CHR_EVENT_MUX_OUT: if (mon->reset_seen) { if (atomic_mb_read(&mon->suspend_cnt) == 0) { monitor_printf(mon, "\n"); } monitor_flush(mon); ---> monitor_suspend(mon); } else { ---> atomic_inc(&mon->suspend_cnt); } qemu_mutex_lock(&mon->mon_lock); mon->mux_out = 1; qemu_mutex_unlock(&mon->mon_lock); break; If mon->reset_seen(), we use monitor_resume() and monitor_suspend() like a good boy. Else, we mess with mon->suspend_cnt directly. Worse, we don't just decrement, we go all the way to zero! Gerd, this goes back to your a7aec5da4d "monitor: fix muxing". It's been ten years, but I have to ask anyway: why?
diff --git a/monitor/qmp.c b/monitor/qmp.c index 9d9e5d8b27..c1db5bf940 100644 --- a/monitor/qmp.c +++ b/monitor/qmp.c @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req) /* Caller must hold mon->qmp.qmp_queue_lock */ static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon) { + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length > 0) + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX; while (!g_queue_is_empty(mon->qmp_requests)) { qmp_request_free(g_queue_pop_head(mon->qmp_requests)); } + if (need_resume) { + /* + * Pairs with the monitor_suspend() in handle_qmp_command() in case the + * queue gets cleared from a CH_EVENT_CLOSED event before the dispatch + * bh got scheduled. + */ + monitor_resume(&mon->common); + } } static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
When a monitor's queue is filled up in handle_qmp_command() it gets suspended. It's the dispatcher bh's job currently to resume the monitor, which it does after processing an event from the queue. However, it is possible for a CHR_EVENT_CLOSED event to be processed before before the bh is scheduled, which will clear the queue without resuming the monitor, thereby preventing the dispatcher from reaching the resume() call. Fix this by resuming the monitor when clearing a queue which was filled up. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> --- @Michael, we ran into this with qemu 4.0, so if the logic in this patch is correct it may make sense to include it in the 4.0.1 roundup. A backport is at [1] as 4.0 was before the monitor/ dir split. [1] https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714 monitor/qmp.c | 10 ++++++++++ 1 file changed, 10 insertions(+)